Thank you Xuefu!

Excellent explanation and comparison!
We should put it to Hive on Spark wiki.
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark


On Wed, May 20, 2015 at 10:45 AM, Xuefu Zhang <xzh...@cloudera.com> wrote:

> I have been working on HIve on Spark, and knows a little about SparkSQL.
> Here are a few factors to be considered:
>
> 1. SparkSQL is similar to Shark (discontinued) in that it clones Hive's
> front end (parser and semantic analyzer) and metastore, and inject in
> between a laryer where Hive's operator tree is reinterpreted in Spark's
> constructs (transactions and actions). Thus, it's tied to a specific
> version of Hive, which is always behind official Hive releases.
> 2. Because of the reinterpretation, many features (window functions,
> lateral views, etc) from Hive need to be reimplemented in Spark world. If
> an implementation hasn't been done, you see a gap. That's why you would
> expect functional disparity, not to mention future Hive futures.
> 3. SparkSQL is far from production ready.
> 4. On the other hand, Hive on Spark is native in Hive, embracing all Hive
> features and growing with Hive. Hive's operators are honored without
> re-interpretation. The integration is done at the execution layer, where
> Spark is nothing but an advanced MapReduce engine.
> 5. Hive is aiming at enterprise use cases, where there are more important
> concerns such as security than purely if it works or if it runs fast. Hive
> on Spark certainly makes the query run faster, but still keeps the same
> enterprise-readiness.
> 6. SparkSQL is a good fit if you're a heavy Spark user who occasionally
> needs to run some SQL. Or you're a casual SQL user and like to try
> something new.
> 7. If haven't touched either Spark or Hive, I'd suggest you start with
> Hive, especially for an enterprise.
> 8. If you're an existing Hive user and consider taking advantage of Spark,
> consider Hive on Spark.
> 9. It's strongly discouraged to mix Hive and SparkSQL in your deployment.
> SparkSQL includes a version of Hive, which is very likely at a different
> version of the Hive that you have (even if you don't use Hive on Spark).
> Library conflicts can put you in a nightmare.
> 10. I haven't benchmarked SparkSQL myself, but I heard several reports
> that SparkSQL, when being tried at scale, is either fast or failing your
> queries.
>
> Hope this helps.
>
> Thanks,
>
>
> On Tue, May 19, 2015 at 10:38 PM, guoqing0...@yahoo.com.hk <
> guoqing0...@yahoo.com.hk> wrote:
>
>> Hive on Spark and SparkSQL which should be better , and what are the key
>> characteristics and the advantages and the disadvantages between ?
>>
>> ------------------------------
>> guoqing0...@yahoo.com.hk
>>
>
>

Reply via email to