Thank you Xuefu! Excellent explanation and comparison! We should put it to Hive on Spark wiki. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark
On Wed, May 20, 2015 at 10:45 AM, Xuefu Zhang <xzh...@cloudera.com> wrote: > I have been working on HIve on Spark, and knows a little about SparkSQL. > Here are a few factors to be considered: > > 1. SparkSQL is similar to Shark (discontinued) in that it clones Hive's > front end (parser and semantic analyzer) and metastore, and inject in > between a laryer where Hive's operator tree is reinterpreted in Spark's > constructs (transactions and actions). Thus, it's tied to a specific > version of Hive, which is always behind official Hive releases. > 2. Because of the reinterpretation, many features (window functions, > lateral views, etc) from Hive need to be reimplemented in Spark world. If > an implementation hasn't been done, you see a gap. That's why you would > expect functional disparity, not to mention future Hive futures. > 3. SparkSQL is far from production ready. > 4. On the other hand, Hive on Spark is native in Hive, embracing all Hive > features and growing with Hive. Hive's operators are honored without > re-interpretation. The integration is done at the execution layer, where > Spark is nothing but an advanced MapReduce engine. > 5. Hive is aiming at enterprise use cases, where there are more important > concerns such as security than purely if it works or if it runs fast. Hive > on Spark certainly makes the query run faster, but still keeps the same > enterprise-readiness. > 6. SparkSQL is a good fit if you're a heavy Spark user who occasionally > needs to run some SQL. Or you're a casual SQL user and like to try > something new. > 7. If haven't touched either Spark or Hive, I'd suggest you start with > Hive, especially for an enterprise. > 8. If you're an existing Hive user and consider taking advantage of Spark, > consider Hive on Spark. > 9. It's strongly discouraged to mix Hive and SparkSQL in your deployment. > SparkSQL includes a version of Hive, which is very likely at a different > version of the Hive that you have (even if you don't use Hive on Spark). > Library conflicts can put you in a nightmare. > 10. I haven't benchmarked SparkSQL myself, but I heard several reports > that SparkSQL, when being tried at scale, is either fast or failing your > queries. > > Hope this helps. > > Thanks, > > > On Tue, May 19, 2015 at 10:38 PM, guoqing0...@yahoo.com.hk < > guoqing0...@yahoo.com.hk> wrote: > >> Hive on Spark and SparkSQL which should be better , and what are the key >> characteristics and the advantages and the disadvantages between ? >> >> ------------------------------ >> guoqing0...@yahoo.com.hk >> > >