Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

Sriskandarajah Suhothayan Wed, 20 Aug 2014 21:20:07 -0700

On Wed, Aug 20, 2014 at 1:36 PM, Niranda Perera <nira...@wso2.com> wrote:


> @Maninda,
>
> +1 for suggesting Spark SQL.
>
> Quote Databricks,
> "Spark SQL provides state-of-the-art SQL performance and maintains
> compatibility with Shark/Hive. In particular, like Shark, Spark SQL
> supports all existing Hive data formats, user-defined functions (UDF), and
> the Hive metastore." [1]
>
> But I am not entirely sure if Spark SQL and Siddhi is comparable, because
> SparkSQL (like Hive) is designed for batch processing, where as Siddhi is
> real-time processing. But if there are implementations where Siddhi is run
> on top of Spark, it would be very interesting.
>
Yes Siddhi's current way of operation does not support this. But with
partitions and we can achieve this to some extent.

Suho

>
> Spark supports either Hadoop1 or 2. But I think we should see, what is
> best, MR1 or YARN+MR2
>
> [image: Hadoop Architecture]
> [2]
>
> [1]
> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
> [2] http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html
>
>
> On Wed, Aug 20, 2014 at 1:13 PM, Lasantha Fernando <lasan...@wso2.com>
> wrote:
>
>> Hi Maninda,
>>
>> On 20 August 2014 12:02, Maninda Edirisooriya <mani...@wso2.com> wrote:
>>
>>> In the case of discontinuity of Shark project, IMO we should not move to
>>> Shark at all.
>>> And it seems better to go with Spark SQL as we are already using Spark
>>> for CEP. But I am not sure the difference between Spark SQL and the Siddhi
>>> queries on the Spark engine.
>>>
>>
>> Currently, we are doing integration with CEP using Apache Storm, not
>> Spark... :-). Spark Streaming is a possible candidate for integrating with
>> CEP, but we have opted with Storm. I think there has been some independent
>> work on integrating Kafka + Spark Streaming + Siddhi. Please refer to
>> thread on arch@ "[Architecture] A few questions about WSO2 CEP/Siddhi"
>>
>>
>> And we have to figure out how Spark SQL is used for historical data,
>>> whether it can execute incremental processing by default which will
>>> implement all out existing BAM use cases.
>>> On the other hand in Hadoop 2 [1] they are using a completely different
>>> platform for resource allocation known as Yarn. Sometimes this may be more
>>> suitable for batch jobs.
>>>
>>> [1] https://www.youtube.com/watch?v=RncoVN0l6dc
>>>
>>>
>> Thanks,
>> Lasantha
>>
>>>
>>> *Maninda Edirisooriya*
>>> Senior Software Engineer
>>>
>>> *WSO2, Inc. *lean.enterprise.middleware.
>>>
>>> *Blog* : http://maninda.blogspot.com/
>>> *E-mail* : mani...@wso2.com
>>> *Skype* : @manindae
>>> *Twitter* : @maninda
>>>
>>>
>>> On Wed, Aug 20, 2014 at 11:33 AM, Niranda Perera <nira...@wso2.com>
>>> wrote:
>>>
>>>> Hi Anjana and Srinath,
>>>>
>>>> After the discussion I had with Anjana, I researched more on the
>>>> continuation of Shark project by Databricks.
>>>>
>>>> Here's what I found out,
>>>> - Shark was built on the Hive codebase and achieved performance
>>>> improvements by swapping out the physical execution engine part of Hive.
>>>> While this approach enabled Shark users to speed up their Hive queries,
>>>> Shark inherited a large, complicated code base from Hive that made it hard
>>>> to optimize and maintain.
>>>> Hence, Databricks has announced that they are halting the development
>>>> of Shark from July, 2014. (Shark 0.9 would be the last release) [1]
>>>> - Shark will be replaced by Spark SQL. It beats Shark in TPC-DS
>>>> performance
>>>> <http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html>
>>>> by almost an order of magnitude. It also supports all existing Hive data
>>>> formats, user-defined functions (UDF), and the Hive metastore.  [2]
>>>> - Following is the Shark, Spark SQL migration plan
>>>> http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf
>>>>
>>>> - For the legacy Hive and MapReduce users, they have proposed a new
>>>> 'Hive on Spark Project' [3], [4]
>>>> But, given the performance enhancement, it is quite certain that Hive
>>>> and MR would be replaced by engines build on top of Spark (ex: Spark SQL)
>>>>
>>>>
>>>>
>>>> In my opinion there are a few matters to figure out if we are migrating
>>>> from Hive,
>>>>
>>>> 1. whether we are changing the query engine only? (Then, we can replace
>>>> Hive by Shark)
>>>> 2. whether we are changing the existing Hadoop/ MapReduce framework to
>>>> Spark? (Then we can replace Hive and Hadoop with Spark and Spark SQL)
>>>>
>>>>
>>>> In my opinion, considering the longterm impact and the availability of
>>>> support, it is best to migrate the Hive/Hadoop to Spark.
>>>> It is open for discussion!
>>>>
>>>> In the mean time, I've already tried Spark SQL, and Databricks claims
>>>> on improved performance seems to be true. I will work more on this.
>>>>
>>>> Cheers
>>>>
>>>> [1]
>>>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
>>>> [2]
>>>> http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html
>>>> [3] https://issues.apache.org/jira/browse/HIVE-7292
>>>> [4] https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark
>>>>
>>>>
>>>>
>>>> On Thu, Aug 14, 2014 at 12:16 PM, Anjana Fernando <anj...@wso2.com>
>>>> wrote:
>>>>
>>>>> Hi Srinath,
>>>>>
>>>>> No, this has not been tested in multiple nodes. I told Niranda here in
>>>>> my last mail, to test a cluster with the same set of hardware we have, 
>>>>> that
>>>>> we are using to test our large data set with Hive. As for the effort to
>>>>> make the change, we still have to figure out the MT aspects of Shark here.
>>>>> Sinthuja was working on making the latest Hive version MT ready, and most
>>>>> probably, we can do the same changes to the Hive version Shark is using. 
>>>>> So
>>>>> after we do that, the integration should be seamless. And also, as I
>>>>> mentioned earlier here, we are also going to test this with the APIM Hive
>>>>> script, to check if there are any unforeseen incompatibilities.
>>>>>
>>>>> Cheers,
>>>>> Anjana.
>>>>>
>>>>>
>>>>> On Thu, Aug 14, 2014 at 11:53 AM, Srinath Perera <srin...@wso2.com>
>>>>> wrote:
>>>>>
>>>>>> This look great.
>>>>>>
>>>>>> We need to test Spark with multiple nodes? Did we do that. Please
>>>>>> create few VMs in performance could (talk to Lakmal) and test with at 
>>>>>> least
>>>>>> 5 nodes. We need to make sure it works OK with distributed setup as well.
>>>>>>
>>>>>> What does it take to change to spark? Anjana .. how much work is it?
>>>>>>
>>>>>> --Srinath
>>>>>>
>>>>>>
>>>>>> On Wed, Aug 13, 2014 at 7:06 PM, Niranda Perera <nira...@wso2.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thank you Anjana.
>>>>>>>
>>>>>>> Yes, I am working on it.
>>>>>>>
>>>>>>> In the mean time, I found this in Hive documentation [1]. It talks
>>>>>>> about Hive on Spark, and compares Hive, Shark and Spark SQL at an higher
>>>>>>> architectural level.
>>>>>>>
>>>>>>> Additionally, it is said that the in-memory performance of Shark can
>>>>>>> be improved by introducing Tachyon [2]. I guess we can consider this 
>>>>>>> later
>>>>>>> on.
>>>>>>>
>>>>>>> Cheers.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-1.3ComparisonwithSharkandSparkSQL
>>>>>>> [2] http://tachyon-project.org/Running-Tachyon-Locally.html
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 13, 2014 at 3:17 PM, Anjana Fernando <anj...@wso2.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>  Hi Niranda,
>>>>>>>>
>>>>>>>> Excellent analysis of Hive vs Shark! .. This gives a lot of insight
>>>>>>>> into how both operates in different scenarios. As the next step, we 
>>>>>>>> will
>>>>>>>> need to run this in an actual cluster of computers. Since you've used a
>>>>>>>> subset of the dataset of 2014 DEBS challenge, we should use the full 
>>>>>>>> data
>>>>>>>> set in a clustered environment and check this. Gokul is already 
>>>>>>>> working on
>>>>>>>> the Hive based setup for this, after that is done, you can create a 
>>>>>>>> Shark
>>>>>>>> cluster in the same hardware and run the tests there, to get a clear
>>>>>>>> comparison on how these two match up in a cluster. Until the setup is
>>>>>>>> ready, do continue with your next steps on checking the RDD support and
>>>>>>>> Spark SQL use.
>>>>>>>>
>>>>>>>> After these are done, we should also do a trial run of our own APIM
>>>>>>>> Hive scripts, migrated to Shark.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Anjana.
>>>>>>>>
>>>>>>>>
>>>>>>>>  On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera <nira...@wso2.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I have been evaluating the performance of Shark (distributed SQL
>>>>>>>>> query engine for Hadoop) against Hive. This is with the objective of 
>>>>>>>>> seeing
>>>>>>>>> the possibility to move the WSO2 BAM data processing (which currently 
>>>>>>>>> uses
>>>>>>>>> Hive) to Shark (and Apache Spark) for improved performance.
>>>>>>>>>
>>>>>>>>> I am sharing my findings herewith.
>>>>>>>>>
>>>>>>>>>  *AMP Lab Shark*
>>>>>>>>> Shark can execute Hive QL queries up to 100 times faster than Hive
>>>>>>>>> without any modification to the existing data or queries. It supports
>>>>>>>>> Hive's QL, metastore, serialization formats, and user-defined 
>>>>>>>>> functions,
>>>>>>>>> providing seamless integration with existing Hive deployments and a
>>>>>>>>> familiar, more powerful option for new ones. [1]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Apache Spark*Apache Spark is an open-source data analytics
>>>>>>>>> cluster computing framework. It fits into the Hadoop open-source 
>>>>>>>>> community,
>>>>>>>>> building on top of the HDFS and promises performance up to 100 times 
>>>>>>>>> faster
>>>>>>>>> than Hadoop MapReduce for certain applications. [2]
>>>>>>>>> Official documentation: [3]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I carried out the comparison between the following Hive and Shark
>>>>>>>>> releases with input files ranging from 100 to 1 billion entries.
>>>>>>>>>
>>>>>>>>> QL Engine
>>>>>>>>>
>>>>>>>>> Apache Hive 0.11
>>>>>>>>>
>>>>>>>>> Shark Shark 0.9.1 (Latest release) which uses,
>>>>>>>>>
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>>    Scala 2.10.3
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>>    Spark 0.9.1
>>>>>>>>>    -
>>>>>>>>>
>>>>>>>>>    AMPLab’s Hive 0.9.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Framework
>>>>>>>>>
>>>>>>>>> Hadoop 1.0.4
>>>>>>>>> Spark 0.9.1
>>>>>>>>>
>>>>>>>>> File system
>>>>>>>>>
>>>>>>>>> HDFS
>>>>>>>>> HDFS
>>>>>>>>>
>>>>>>>>> Attached herewith is a report which describes in detail about the
>>>>>>>>> performance comparison between Shark and Hive.
>>>>>>>>> 
>>>>>>>>>  hive_vs_shark
>>>>>>>>> <https://docs.google.com/a/wso2.com/folderview?id=0B1GsnfycTl32QTZqUktKck1Ucjg&usp=drive_web>
>>>>>>>>> 
>>>>>>>>>  hive_vs_shark_report.odt
>>>>>>>>> <https://docs.google.com/a/wso2.com/file/d/0B1GsnfycTl32X3J5dTh6Slloa0E/edit?usp=drive_web>
>>>>>>>>> 
>>>>>>>>>
>>>>>>>>> In summary,
>>>>>>>>>
>>>>>>>>> From the evaluation, following conclusions can be derived.
>>>>>>>>>
>>>>>>>>>    - Shark is indifferent to Hive in DDL operations (CREATE, DROP
>>>>>>>>>    .. TABLE, DATABASE). Both engines show a fairly constant 
>>>>>>>>> performance as the
>>>>>>>>>    input size increases.
>>>>>>>>>    - Shark is indifferent to Hive in DML operations (LOAD,
>>>>>>>>>    INSERT) but when a DML operation is called in conjuncture of a data
>>>>>>>>>    retrieval operation (ex. INSERT <TBL> SELECT <PROP> FROM <TBL>), 
>>>>>>>>> Shark
>>>>>>>>>    significantly over-performs Hive with a performance factor of 10x+ 
>>>>>>>>> (Ranging
>>>>>>>>>    from 10x to 80x in some instances). Shark performance factor 
>>>>>>>>> reduces with
>>>>>>>>>    the input size increases, while HIVE performance is fairly 
>>>>>>>>> indifferent.
>>>>>>>>>    - Shark clearly over-performs Hive in Data Retrieval
>>>>>>>>>    operations (FILTER, ORDER BY, JOIN). Hive performance is fairly 
>>>>>>>>> indifferent
>>>>>>>>>    in the data retrieval operations while Shark performance reduces 
>>>>>>>>> as the
>>>>>>>>>    input size increases. But at every instance Shark over-performed 
>>>>>>>>> Hive with
>>>>>>>>>    a minimum performance factor of 5x+ (Ranging from 5x to 80x in some
>>>>>>>>>    instances).
>>>>>>>>>
>>>>>>>>> Please refer the 'hive_vs_shark_report', it has all the
>>>>>>>>> information about the queries and timings pictographically.
>>>>>>>>>
>>>>>>>>> The code repository can also be found in
>>>>>>>>>
>>>>>>>>> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Moving forward, I am currently working on the following.
>>>>>>>>>
>>>>>>>>>    - Apache Spark's resilient distributed dataset (RDD)
>>>>>>>>>    abstraction (which is a collection of elements partitioned across 
>>>>>>>>> the nodes
>>>>>>>>>    of the cluster that can be operated on in parallel). The use of 
>>>>>>>>> RDDs and
>>>>>>>>>    its impact to the performance.
>>>>>>>>>    - Spark SQL - Use of this Spark SQL over Shark on Spark
>>>>>>>>>    framework
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [1] https://github.com/amplab/shark/wiki
>>>>>>>>> [2] http://en.wikipedia.org/wiki/Apache_Spark
>>>>>>>>> [3] http://spark.apache.org/docs/latest/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Would love to have your feedback on this.
>>>>>>>>>
>>>>>>>>> Best regards
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>  *Niranda Perera*
>>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> *Anjana Fernando*
>>>>>>>> Senior Technical Lead
>>>>>>>> WSO2 Inc. | http://wso2.com
>>>>>>>> lean . enterprise . middleware
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Niranda Perera*
>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>> Mobile: +94-71-554-8430
>>>>>>>  Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> ============================
>>>>>> Srinath Perera, Ph.D.
>>>>>>    http://people.apache.org/~hemapani/
>>>>>>    http://srinathsview.blogspot.com/
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Anjana Fernando*
>>>>> Senior Technical Lead
>>>>> WSO2 Inc. | http://wso2.com
>>>>> lean . enterprise . middleware
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *Niranda Perera*
>>>> Software Engineer, WSO2 Inc.
>>>> Mobile: +94-71-554-8430
>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>
>>>
>>>
>>> _______________________________________________
>>> Architecture mailing list
>>> Architecture@wso2.org
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>
>>
>> --
>> *Lasantha Fernando*
>> Software Engineer - Data Technologies Team
>> WSO2 Inc. http://wso2.com
>>
>> email: lasan...@wso2.com
>> mobile: (+94) 71 5247551
>>
>
>
>
> --
> *Niranda Perera*
> Software Engineer, WSO2 Inc.
> Mobile: +94-71-554-8430
>  Twitter: @n1r44 <https://twitter.com/N1R44>
>
> _______________________________________________
> Architecture mailing list
> Architecture@wso2.org
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 

*S. Suhothayan*
Technical Lead & Team Lead of WSO2 Complex Event Processor
 *WSO2 Inc. *http://wso2.com
* <http://wso2.com/>*
lean . enterprise . middleware


*cell: (+94) 779 756 757 | blog: http://suhothayan.blogspot.com/
<http://suhothayan.blogspot.com/>twitter: http://twitter.com/suhothayan
<http://twitter.com/suhothayan> | linked-in:
http://lk.linkedin.com/in/suhothayan <http://lk.linkedin.com/in/suhothayan>*

_______________________________________________
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

Reply via email to