Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

Niranda Perera Wed, 20 Aug 2014 01:08:26 -0700

@Maninda,

+1 for suggesting Spark SQL.


Quote Databricks,
"Spark SQL provides state-of-the-art SQL performance and maintains
compatibility with Shark/Hive. In particular, like Shark, Spark SQL
supports all existing Hive data formats, user-defined functions (UDF), and
the Hive metastore." [1]

But I am not entirely sure if Spark SQL and Siddhi is comparable, because
SparkSQL (like Hive) is designed for batch processing, where as Siddhi is
real-time processing. But if there are implementations where Siddhi is run
on top of Spark, it would be very interesting.

Spark supports either Hadoop1 or 2. But I think we should see, what is
best, MR1 or YARN+MR2

[image: Hadoop Architecture]
[2]

[1]
http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
[2] http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html


On Wed, Aug 20, 2014 at 1:13 PM, Lasantha Fernando <lasan...@wso2.com>
wrote:

> Hi Maninda,
>
> On 20 August 2014 12:02, Maninda Edirisooriya <mani...@wso2.com> wrote:
>
>> In the case of discontinuity of Shark project, IMO we should not move to
>> Shark at all.
>> And it seems better to go with Spark SQL as we are already using Spark
>> for CEP. But I am not sure the difference between Spark SQL and the Siddhi
>> queries on the Spark engine.
>>
>
> Currently, we are doing integration with CEP using Apache Storm, not
> Spark... :-). Spark Streaming is a possible candidate for integrating with
> CEP, but we have opted with Storm. I think there has been some independent
> work on integrating Kafka + Spark Streaming + Siddhi. Please refer to
> thread on arch@ "[Architecture] A few questions about WSO2 CEP/Siddhi"
>
>
> And we have to figure out how Spark SQL is used for historical data,
>> whether it can execute incremental processing by default which will
>> implement all out existing BAM use cases.
>> On the other hand in Hadoop 2 [1] they are using a completely different
>> platform for resource allocation known as Yarn. Sometimes this may be more
>> suitable for batch jobs.
>>
>> [1] https://www.youtube.com/watch?v=RncoVN0l6dc
>>
>>
> Thanks,
> Lasantha
>
>>
>> *Maninda Edirisooriya*
>> Senior Software Engineer
>>
>> *WSO2, Inc. *lean.enterprise.middleware.
>>
>> *Blog* : http://maninda.blogspot.com/
>> *E-mail* : mani...@wso2.com
>> *Skype* : @manindae
>> *Twitter* : @maninda
>>
>>
>> On Wed, Aug 20, 2014 at 11:33 AM, Niranda Perera <nira...@wso2.com>
>> wrote:
>>
>>> Hi Anjana and Srinath,
>>>
>>> After the discussion I had with Anjana, I researched more on the
>>> continuation of Shark project by Databricks.
>>>
>>> Here's what I found out,
>>> - Shark was built on the Hive codebase and achieved performance
>>> improvements by swapping out the physical execution engine part of Hive.
>>> While this approach enabled Shark users to speed up their Hive queries,
>>> Shark inherited a large, complicated code base from Hive that made it hard
>>> to optimize and maintain.
>>> Hence, Databricks has announced that they are halting the development of
>>> Shark from July, 2014. (Shark 0.9 would be the last release) [1]
>>> - Shark will be replaced by Spark SQL. It beats Shark in TPC-DS
>>> performance
>>> <http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html>
>>> by almost an order of magnitude. It also supports all existing Hive data
>>> formats, user-defined functions (UDF), and the Hive metastore.  [2]
>>> - Following is the Shark, Spark SQL migration plan
>>> http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf
>>>
>>> - For the legacy Hive and MapReduce users, they have proposed a new
>>> 'Hive on Spark Project' [3], [4]
>>> But, given the performance enhancement, it is quite certain that Hive
>>> and MR would be replaced by engines build on top of Spark (ex: Spark SQL)
>>>
>>>
>>>
>>> In my opinion there are a few matters to figure out if we are migrating
>>> from Hive,
>>>
>>> 1. whether we are changing the query engine only? (Then, we can replace
>>> Hive by Shark)
>>> 2. whether we are changing the existing Hadoop/ MapReduce framework to
>>> Spark? (Then we can replace Hive and Hadoop with Spark and Spark SQL)
>>>
>>>
>>> In my opinion, considering the longterm impact and the availability of
>>> support, it is best to migrate the Hive/Hadoop to Spark.
>>> It is open for discussion!
>>>
>>> In the mean time, I've already tried Spark SQL, and Databricks claims on
>>> improved performance seems to be true. I will work more on this.
>>>
>>> Cheers
>>>
>>> [1]
>>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
>>> [2]
>>> http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html
>>> [3] https://issues.apache.org/jira/browse/HIVE-7292
>>> [4] https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark
>>>
>>>
>>>
>>> On Thu, Aug 14, 2014 at 12:16 PM, Anjana Fernando <anj...@wso2.com>
>>> wrote:
>>>
>>>> Hi Srinath,
>>>>
>>>> No, this has not been tested in multiple nodes. I told Niranda here in
>>>> my last mail, to test a cluster with the same set of hardware we have, that
>>>> we are using to test our large data set with Hive. As for the effort to
>>>> make the change, we still have to figure out the MT aspects of Shark here.
>>>> Sinthuja was working on making the latest Hive version MT ready, and most
>>>> probably, we can do the same changes to the Hive version Shark is using. So
>>>> after we do that, the integration should be seamless. And also, as I
>>>> mentioned earlier here, we are also going to test this with the APIM Hive
>>>> script, to check if there are any unforeseen incompatibilities.
>>>>
>>>> Cheers,
>>>> Anjana.
>>>>
>>>>
>>>> On Thu, Aug 14, 2014 at 11:53 AM, Srinath Perera <srin...@wso2.com>
>>>> wrote:
>>>>
>>>>> This look great.
>>>>>
>>>>> We need to test Spark with multiple nodes? Did we do that. Please
>>>>> create few VMs in performance could (talk to Lakmal) and test with at 
>>>>> least
>>>>> 5 nodes. We need to make sure it works OK with distributed setup as well.
>>>>>
>>>>> What does it take to change to spark? Anjana .. how much work is it?
>>>>>
>>>>> --Srinath
>>>>>
>>>>>
>>>>> On Wed, Aug 13, 2014 at 7:06 PM, Niranda Perera <nira...@wso2.com>
>>>>> wrote:
>>>>>
>>>>>> Thank you Anjana.
>>>>>>
>>>>>> Yes, I am working on it.
>>>>>>
>>>>>> In the mean time, I found this in Hive documentation [1]. It talks
>>>>>> about Hive on Spark, and compares Hive, Shark and Spark SQL at an higher
>>>>>> architectural level.
>>>>>>
>>>>>> Additionally, it is said that the in-memory performance of Shark can
>>>>>> be improved by introducing Tachyon [2]. I guess we can consider this 
>>>>>> later
>>>>>> on.
>>>>>>
>>>>>> Cheers.
>>>>>>
>>>>>> [1]
>>>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-1.3ComparisonwithSharkandSparkSQL
>>>>>> [2] http://tachyon-project.org/Running-Tachyon-Locally.html
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 13, 2014 at 3:17 PM, Anjana Fernando <anj...@wso2.com>
>>>>>> wrote:
>>>>>>
>>>>>>>  Hi Niranda,
>>>>>>>
>>>>>>> Excellent analysis of Hive vs Shark! .. This gives a lot of insight
>>>>>>> into how both operates in different scenarios. As the next step, we will
>>>>>>> need to run this in an actual cluster of computers. Since you've used a
>>>>>>> subset of the dataset of 2014 DEBS challenge, we should use the full 
>>>>>>> data
>>>>>>> set in a clustered environment and check this. Gokul is already working 
>>>>>>> on
>>>>>>> the Hive based setup for this, after that is done, you can create a 
>>>>>>> Shark
>>>>>>> cluster in the same hardware and run the tests there, to get a clear
>>>>>>> comparison on how these two match up in a cluster. Until the setup is
>>>>>>> ready, do continue with your next steps on checking the RDD support and
>>>>>>> Spark SQL use.
>>>>>>>
>>>>>>> After these are done, we should also do a trial run of our own APIM
>>>>>>> Hive scripts, migrated to Shark.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Anjana.
>>>>>>>
>>>>>>>
>>>>>>>  On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera <nira...@wso2.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I have been evaluating the performance of Shark (distributed SQL
>>>>>>>> query engine for Hadoop) against Hive. This is with the objective of 
>>>>>>>> seeing
>>>>>>>> the possibility to move the WSO2 BAM data processing (which currently 
>>>>>>>> uses
>>>>>>>> Hive) to Shark (and Apache Spark) for improved performance.
>>>>>>>>
>>>>>>>> I am sharing my findings herewith.
>>>>>>>>
>>>>>>>>  *AMP Lab Shark*
>>>>>>>> Shark can execute Hive QL queries up to 100 times faster than Hive
>>>>>>>> without any modification to the existing data or queries. It supports
>>>>>>>> Hive's QL, metastore, serialization formats, and user-defined 
>>>>>>>> functions,
>>>>>>>> providing seamless integration with existing Hive deployments and a
>>>>>>>> familiar, more powerful option for new ones. [1]
>>>>>>>>
>>>>>>>>
>>>>>>>> *Apache Spark*Apache Spark is an open-source data analytics
>>>>>>>> cluster computing framework. It fits into the Hadoop open-source 
>>>>>>>> community,
>>>>>>>> building on top of the HDFS and promises performance up to 100 times 
>>>>>>>> faster
>>>>>>>> than Hadoop MapReduce for certain applications. [2]
>>>>>>>> Official documentation: [3]
>>>>>>>>
>>>>>>>>
>>>>>>>> I carried out the comparison between the following Hive and Shark
>>>>>>>> releases with input files ranging from 100 to 1 billion entries.
>>>>>>>>
>>>>>>>> QL Engine
>>>>>>>>
>>>>>>>> Apache Hive 0.11
>>>>>>>>
>>>>>>>> Shark Shark 0.9.1 (Latest release) which uses,
>>>>>>>>
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    Scala 2.10.3
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    Spark 0.9.1
>>>>>>>>    -
>>>>>>>>
>>>>>>>>    AMPLab’s Hive 0.9.0
>>>>>>>>
>>>>>>>>
>>>>>>>> Framework
>>>>>>>>
>>>>>>>> Hadoop 1.0.4
>>>>>>>> Spark 0.9.1
>>>>>>>>
>>>>>>>> File system
>>>>>>>>
>>>>>>>> HDFS
>>>>>>>> HDFS
>>>>>>>>
>>>>>>>> Attached herewith is a report which describes in detail about the
>>>>>>>> performance comparison between Shark and Hive.
>>>>>>>> 
>>>>>>>>  hive_vs_shark
>>>>>>>> <https://docs.google.com/a/wso2.com/folderview?id=0B1GsnfycTl32QTZqUktKck1Ucjg&usp=drive_web>
>>>>>>>> 
>>>>>>>>  hive_vs_shark_report.odt
>>>>>>>> <https://docs.google.com/a/wso2.com/file/d/0B1GsnfycTl32X3J5dTh6Slloa0E/edit?usp=drive_web>
>>>>>>>> 
>>>>>>>>
>>>>>>>> In summary,
>>>>>>>>
>>>>>>>> From the evaluation, following conclusions can be derived.
>>>>>>>>
>>>>>>>>    - Shark is indifferent to Hive in DDL operations (CREATE, DROP
>>>>>>>>    .. TABLE, DATABASE). Both engines show a fairly constant 
>>>>>>>> performance as the
>>>>>>>>    input size increases.
>>>>>>>>    - Shark is indifferent to Hive in DML operations (LOAD, INSERT)
>>>>>>>>    but when a DML operation is called in conjuncture of a data 
>>>>>>>> retrieval
>>>>>>>>    operation (ex. INSERT <TBL> SELECT <PROP> FROM <TBL>), Shark 
>>>>>>>> significantly
>>>>>>>>    over-performs Hive with a performance factor of 10x+ (Ranging from 
>>>>>>>> 10x to
>>>>>>>>    80x in some instances). Shark performance factor reduces with the 
>>>>>>>> input
>>>>>>>>    size increases, while HIVE performance is fairly indifferent.
>>>>>>>>    - Shark clearly over-performs Hive in Data Retrieval operations
>>>>>>>>    (FILTER, ORDER BY, JOIN). Hive performance is fairly indifferent in 
>>>>>>>> the
>>>>>>>>    data retrieval operations while Shark performance reduces as the 
>>>>>>>> input size
>>>>>>>>    increases. But at every instance Shark over-performed Hive with a 
>>>>>>>> minimum
>>>>>>>>    performance factor of 5x+ (Ranging from 5x to 80x in some 
>>>>>>>> instances).
>>>>>>>>
>>>>>>>> Please refer the 'hive_vs_shark_report', it has all the information
>>>>>>>> about the queries and timings pictographically.
>>>>>>>>
>>>>>>>> The code repository can also be found in
>>>>>>>> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark
>>>>>>>>
>>>>>>>>
>>>>>>>> Moving forward, I am currently working on the following.
>>>>>>>>
>>>>>>>>    - Apache Spark's resilient distributed dataset (RDD)
>>>>>>>>    abstraction (which is a collection of elements partitioned across 
>>>>>>>> the nodes
>>>>>>>>    of the cluster that can be operated on in parallel). The use of 
>>>>>>>> RDDs and
>>>>>>>>    its impact to the performance.
>>>>>>>>    - Spark SQL - Use of this Spark SQL over Shark on Spark
>>>>>>>>    framework
>>>>>>>>
>>>>>>>>
>>>>>>>> [1] https://github.com/amplab/shark/wiki
>>>>>>>> [2] http://en.wikipedia.org/wiki/Apache_Spark
>>>>>>>> [3] http://spark.apache.org/docs/latest/
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Would love to have your feedback on this.
>>>>>>>>
>>>>>>>> Best regards
>>>>>>>>
>>>>>>>> --
>>>>>>>>  *Niranda Perera*
>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Anjana Fernando*
>>>>>>> Senior Technical Lead
>>>>>>> WSO2 Inc. | http://wso2.com
>>>>>>> lean . enterprise . middleware
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Niranda Perera*
>>>>>> Software Engineer, WSO2 Inc.
>>>>>> Mobile: +94-71-554-8430
>>>>>>  Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ============================
>>>>> Srinath Perera, Ph.D.
>>>>>    http://people.apache.org/~hemapani/
>>>>>    http://srinathsview.blogspot.com/
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *Anjana Fernando*
>>>> Senior Technical Lead
>>>> WSO2 Inc. | http://wso2.com
>>>> lean . enterprise . middleware
>>>>
>>>
>>>
>>>
>>> --
>>> *Niranda Perera*
>>> Software Engineer, WSO2 Inc.
>>> Mobile: +94-71-554-8430
>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>
>>
>>
>> _______________________________________________
>> Architecture mailing list
>> Architecture@wso2.org
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>
>
> --
> *Lasantha Fernando*
> Software Engineer - Data Technologies Team
> WSO2 Inc. http://wso2.com
>
> email: lasan...@wso2.com
> mobile: (+94) 71 5247551
>



-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 <https://twitter.com/N1R44>

_______________________________________________
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

Reply via email to