I have been working on HIve on Spark, and knows a little about SparkSQL. Here are a few factors to be considered:
1. SparkSQL is similar to Shark (discontinued) in that it clones Hive's front end (parser and semantic analyzer) and metastore, and inject in between a laryer where Hive's operator tree is reinterpreted in Spark's constructs (transactions and actions). Thus, it's tied to a specific version of Hive, which is always behind official Hive releases. 2. Because of the reinterpretation, many features (window functions, lateral views, etc) from Hive need to be reimplemented in Spark world. If an implementation hasn't been done, you see a gap. That's why you would expect functional disparity, not to mention future Hive futures. 3. SparkSQL is far from production ready. 4. On the other hand, Hive on Spark is native in Hive, embracing all Hive features and growing with Hive. Hive's operators are honored without re-interpretation. The integration is done at the execution layer, where Spark is nothing but an advanced MapReduce engine. 5. Hive is aiming at enterprise use cases, where there are more important concerns such as security than purely if it works or if it runs fast. Hive on Spark certainly makes the query run faster, but still keeps the same enterprise-readiness. 6. SparkSQL is a good fit if you're a heavy Spark user who occasionally needs to run some SQL. Or you're a casual SQL user and like to try something new. 7. If haven't touched either Spark or Hive, I'd suggest you start with Hive, especially for an enterprise. 8. If you're an existing Hive user and consider taking advantage of Spark, consider Hive on Spark. 9. It's strongly discouraged to mix Hive and SparkSQL in your deployment. SparkSQL includes a version of Hive, which is very likely at a different version of the Hive that you have (even if you don't use Hive on Spark). Library conflicts can put you in a nightmare. 10. I haven't benchmarked SparkSQL myself, but I heard several reports that SparkSQL, when being tried at scale, is either fast or failing your queries. Hope this helps. Thanks, On Tue, May 19, 2015 at 10:38 PM, guoqing0...@yahoo.com.hk < guoqing0...@yahoo.com.hk> wrote: > Hive on Spark and SparkSQL which should be better , and what are the key > characteristics and the advantages and the disadvantages between ? > > ------------------------------ > guoqing0...@yahoo.com.hk >