Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

Lasantha Fernando Wed, 20 Aug 2014 00:45:12 -0700

Hi Maninda,

On 20 August 2014 12:02, Maninda Edirisooriya <[email protected]> wrote:


> In the case of discontinuity of Shark project, IMO we should not move to
> Shark at all.
> And it seems better to go with Spark SQL as we are already using Spark for
> CEP. But I am not sure the difference between Spark SQL and the Siddhi
> queries on the Spark engine.
>

Currently, we are doing integration with CEP using Apache Storm, not
Spark... :-). Spark Streaming is a possible candidate for integrating with
CEP, but we have opted with Storm. I think there has been some independent
work on integrating Kafka + Spark Streaming + Siddhi. Please refer to
thread on arch@ "[Architecture] A few questions about WSO2 CEP/Siddhi"


And we have to figure out how Spark SQL is used for historical data,
> whether it can execute incremental processing by default which will
> implement all out existing BAM use cases.
> On the other hand in Hadoop 2 [1] they are using a completely different
> platform for resource allocation known as Yarn. Sometimes this may be more
> suitable for batch jobs.
>
> [1] https://www.youtube.com/watch?v=RncoVN0l6dc
>
>
Thanks,
Lasantha

>
> *Maninda Edirisooriya*
> Senior Software Engineer
>
> *WSO2, Inc. *lean.enterprise.middleware.
>
> *Blog* : http://maninda.blogspot.com/
> *E-mail* : [email protected]
> *Skype* : @manindae
> *Twitter* : @maninda
>
>
> On Wed, Aug 20, 2014 at 11:33 AM, Niranda Perera <[email protected]> wrote:
>
>> Hi Anjana and Srinath,
>>
>> After the discussion I had with Anjana, I researched more on the
>> continuation of Shark project by Databricks.
>>
>> Here's what I found out,
>> - Shark was built on the Hive codebase and achieved performance
>> improvements by swapping out the physical execution engine part of Hive.
>> While this approach enabled Shark users to speed up their Hive queries,
>> Shark inherited a large, complicated code base from Hive that made it hard
>> to optimize and maintain.
>> Hence, Databricks has announced that they are halting the development of
>> Shark from July, 2014. (Shark 0.9 would be the last release) [1]
>> - Shark will be replaced by Spark SQL. It beats Shark in TPC-DS
>> performance
>> <http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html>
>> by almost an order of magnitude. It also supports all existing Hive data
>> formats, user-defined functions (UDF), and the Hive metastore.  [2]
>> - Following is the Shark, Spark SQL migration plan
>> http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf
>>
>> - For the legacy Hive and MapReduce users, they have proposed a new 'Hive
>> on Spark Project' [3], [4]
>> But, given the performance enhancement, it is quite certain that Hive and
>> MR would be replaced by engines build on top of Spark (ex: Spark SQL)
>>
>>
>>
>> In my opinion there are a few matters to figure out if we are migrating
>> from Hive,
>>
>> 1. whether we are changing the query engine only? (Then, we can replace
>> Hive by Shark)
>> 2. whether we are changing the existing Hadoop/ MapReduce framework to
>> Spark? (Then we can replace Hive and Hadoop with Spark and Spark SQL)
>>
>>
>> In my opinion, considering the longterm impact and the availability of
>> support, it is best to migrate the Hive/Hadoop to Spark.
>> It is open for discussion!
>>
>> In the mean time, I've already tried Spark SQL, and Databricks claims on
>> improved performance seems to be true. I will work more on this.
>>
>> Cheers
>>
>> [1]
>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
>> [2]
>> http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html
>> [3] https://issues.apache.org/jira/browse/HIVE-7292
>> [4] https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark
>>
>>
>>
>> On Thu, Aug 14, 2014 at 12:16 PM, Anjana Fernando <[email protected]>
>> wrote:
>>
>>> Hi Srinath,
>>>
>>> No, this has not been tested in multiple nodes. I told Niranda here in
>>> my last mail, to test a cluster with the same set of hardware we have, that
>>> we are using to test our large data set with Hive. As for the effort to
>>> make the change, we still have to figure out the MT aspects of Shark here.
>>> Sinthuja was working on making the latest Hive version MT ready, and most
>>> probably, we can do the same changes to the Hive version Shark is using. So
>>> after we do that, the integration should be seamless. And also, as I
>>> mentioned earlier here, we are also going to test this with the APIM Hive
>>> script, to check if there are any unforeseen incompatibilities.
>>>
>>> Cheers,
>>> Anjana.
>>>
>>>
>>> On Thu, Aug 14, 2014 at 11:53 AM, Srinath Perera <[email protected]>
>>> wrote:
>>>
>>>> This look great.
>>>>
>>>> We need to test Spark with multiple nodes? Did we do that. Please
>>>> create few VMs in performance could (talk to Lakmal) and test with at least
>>>> 5 nodes. We need to make sure it works OK with distributed setup as well.
>>>>
>>>> What does it take to change to spark? Anjana .. how much work is it?
>>>>
>>>> --Srinath
>>>>
>>>>
>>>> On Wed, Aug 13, 2014 at 7:06 PM, Niranda Perera <[email protected]>
>>>> wrote:
>>>>
>>>>> Thank you Anjana.
>>>>>
>>>>> Yes, I am working on it.
>>>>>
>>>>> In the mean time, I found this in Hive documentation [1]. It talks
>>>>> about Hive on Spark, and compares Hive, Shark and Spark SQL at an higher
>>>>> architectural level.
>>>>>
>>>>> Additionally, it is said that the in-memory performance of Shark can
>>>>> be improved by introducing Tachyon [2]. I guess we can consider this later
>>>>> on.
>>>>>
>>>>> Cheers.
>>>>>
>>>>> [1]
>>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-1.3ComparisonwithSharkandSparkSQL
>>>>> [2] http://tachyon-project.org/Running-Tachyon-Locally.html
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Aug 13, 2014 at 3:17 PM, Anjana Fernando <[email protected]>
>>>>> wrote:
>>>>>
>>>>>>  Hi Niranda,
>>>>>>
>>>>>> Excellent analysis of Hive vs Shark! .. This gives a lot of insight
>>>>>> into how both operates in different scenarios. As the next step, we will
>>>>>> need to run this in an actual cluster of computers. Since you've used a
>>>>>> subset of the dataset of 2014 DEBS challenge, we should use the full data
>>>>>> set in a clustered environment and check this. Gokul is already working 
>>>>>> on
>>>>>> the Hive based setup for this, after that is done, you can create a Shark
>>>>>> cluster in the same hardware and run the tests there, to get a clear
>>>>>> comparison on how these two match up in a cluster. Until the setup is
>>>>>> ready, do continue with your next steps on checking the RDD support and
>>>>>> Spark SQL use.
>>>>>>
>>>>>> After these are done, we should also do a trial run of our own APIM
>>>>>> Hive scripts, migrated to Shark.
>>>>>>
>>>>>> Cheers,
>>>>>> Anjana.
>>>>>>
>>>>>>
>>>>>>  On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I have been evaluating the performance of Shark (distributed SQL
>>>>>>> query engine for Hadoop) against Hive. This is with the objective of 
>>>>>>> seeing
>>>>>>> the possibility to move the WSO2 BAM data processing (which currently 
>>>>>>> uses
>>>>>>> Hive) to Shark (and Apache Spark) for improved performance.
>>>>>>>
>>>>>>> I am sharing my findings herewith.
>>>>>>>
>>>>>>>  *AMP Lab Shark*
>>>>>>> Shark can execute Hive QL queries up to 100 times faster than Hive
>>>>>>> without any modification to the existing data or queries. It supports
>>>>>>> Hive's QL, metastore, serialization formats, and user-defined functions,
>>>>>>> providing seamless integration with existing Hive deployments and a
>>>>>>> familiar, more powerful option for new ones. [1]
>>>>>>>
>>>>>>>
>>>>>>> *Apache Spark*Apache Spark is an open-source data analytics cluster
>>>>>>> computing framework. It fits into the Hadoop open-source community,
>>>>>>> building on top of the HDFS and promises performance up to 100 times 
>>>>>>> faster
>>>>>>> than Hadoop MapReduce for certain applications. [2]
>>>>>>> Official documentation: [3]
>>>>>>>
>>>>>>>
>>>>>>> I carried out the comparison between the following Hive and Shark
>>>>>>> releases with input files ranging from 100 to 1 billion entries.
>>>>>>>
>>>>>>> QL Engine
>>>>>>>
>>>>>>> Apache Hive 0.11
>>>>>>>
>>>>>>> Shark Shark 0.9.1 (Latest release) which uses,
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    Scala 2.10.3
>>>>>>>    -
>>>>>>>
>>>>>>>    Spark 0.9.1
>>>>>>>    -
>>>>>>>
>>>>>>>    AMPLab’s Hive 0.9.0
>>>>>>>
>>>>>>>
>>>>>>> Framework
>>>>>>>
>>>>>>> Hadoop 1.0.4
>>>>>>> Spark 0.9.1
>>>>>>>
>>>>>>> File system
>>>>>>>
>>>>>>> HDFS
>>>>>>> HDFS
>>>>>>>
>>>>>>> Attached herewith is a report which describes in detail about the
>>>>>>> performance comparison between Shark and Hive.
>>>>>>> 
>>>>>>>  hive_vs_shark
>>>>>>> <https://docs.google.com/a/wso2.com/folderview?id=0B1GsnfycTl32QTZqUktKck1Ucjg&usp=drive_web>
>>>>>>> 
>>>>>>>  hive_vs_shark_report.odt
>>>>>>> <https://docs.google.com/a/wso2.com/file/d/0B1GsnfycTl32X3J5dTh6Slloa0E/edit?usp=drive_web>
>>>>>>> 
>>>>>>>
>>>>>>> In summary,
>>>>>>>
>>>>>>> From the evaluation, following conclusions can be derived.
>>>>>>>
>>>>>>>    - Shark is indifferent to Hive in DDL operations (CREATE, DROP
>>>>>>>    .. TABLE, DATABASE). Both engines show a fairly constant performance 
>>>>>>> as the
>>>>>>>    input size increases.
>>>>>>>    - Shark is indifferent to Hive in DML operations (LOAD, INSERT)
>>>>>>>    but when a DML operation is called in conjuncture of a data retrieval
>>>>>>>    operation (ex. INSERT <TBL> SELECT <PROP> FROM <TBL>), Shark 
>>>>>>> significantly
>>>>>>>    over-performs Hive with a performance factor of 10x+ (Ranging from 
>>>>>>> 10x to
>>>>>>>    80x in some instances). Shark performance factor reduces with the 
>>>>>>> input
>>>>>>>    size increases, while HIVE performance is fairly indifferent.
>>>>>>>    - Shark clearly over-performs Hive in Data Retrieval operations
>>>>>>>    (FILTER, ORDER BY, JOIN). Hive performance is fairly indifferent in 
>>>>>>> the
>>>>>>>    data retrieval operations while Shark performance reduces as the 
>>>>>>> input size
>>>>>>>    increases. But at every instance Shark over-performed Hive with a 
>>>>>>> minimum
>>>>>>>    performance factor of 5x+ (Ranging from 5x to 80x in some instances).
>>>>>>>
>>>>>>> Please refer the 'hive_vs_shark_report', it has all the information
>>>>>>> about the queries and timings pictographically.
>>>>>>>
>>>>>>> The code repository can also be found in
>>>>>>> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark
>>>>>>>
>>>>>>>
>>>>>>> Moving forward, I am currently working on the following.
>>>>>>>
>>>>>>>    - Apache Spark's resilient distributed dataset (RDD) abstraction
>>>>>>>    (which is a collection of elements partitioned across the nodes of 
>>>>>>> the
>>>>>>>    cluster that can be operated on in parallel). The use of RDDs and its
>>>>>>>    impact to the performance.
>>>>>>>    - Spark SQL - Use of this Spark SQL over Shark on Spark framework
>>>>>>>
>>>>>>>
>>>>>>> [1] https://github.com/amplab/shark/wiki
>>>>>>> [2] http://en.wikipedia.org/wiki/Apache_Spark
>>>>>>> [3] http://spark.apache.org/docs/latest/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Would love to have your feedback on this.
>>>>>>>
>>>>>>> Best regards
>>>>>>>
>>>>>>> --
>>>>>>>  *Niranda Perera*
>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>> Mobile: +94-71-554-8430
>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Anjana Fernando*
>>>>>> Senior Technical Lead
>>>>>> WSO2 Inc. | http://wso2.com
>>>>>> lean . enterprise . middleware
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Niranda Perera*
>>>>> Software Engineer, WSO2 Inc.
>>>>> Mobile: +94-71-554-8430
>>>>>  Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> ============================
>>>> Srinath Perera, Ph.D.
>>>>    http://people.apache.org/~hemapani/
>>>>    http://srinathsview.blogspot.com/
>>>>
>>>
>>>
>>>
>>> --
>>> *Anjana Fernando*
>>> Senior Technical Lead
>>> WSO2 Inc. | http://wso2.com
>>> lean . enterprise . middleware
>>>
>>
>>
>>
>> --
>> *Niranda Perera*
>> Software Engineer, WSO2 Inc.
>> Mobile: +94-71-554-8430
>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>
>
>
> _______________________________________________
> Architecture mailing list
> [email protected]
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 
*Lasantha Fernando*
Software Engineer - Data Technologies Team
WSO2 Inc. http://wso2.com

email: [email protected]
mobile: (+94) 71 5247551

_______________________________________________
Architecture mailing list
[email protected]
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

Reply via email to