Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

Maninda Edirisooriya Tue, 19 Aug 2014 23:32:46 -0700

In the case of discontinuity of Shark project, IMO we should not move to
Shark at all.
And it seems better to go with Spark SQL as we are already using Spark for
CEP. But I am not sure the difference between Spark SQL and the Siddhi
queries on the Spark engine. And we have to figure out how Spark SQL is
used for historical data, whether it can execute incremental processing by
default which will implement all out existing BAM use cases.
On the other hand in Hadoop 2 [1] they are using a completely different
platform for resource allocation known as Yarn. Sometimes this may be more
suitable for batch jobs.


[1] https://www.youtube.com/watch?v=RncoVN0l6dc


*Maninda Edirisooriya*
Senior Software Engineer

*WSO2, Inc.*lean.enterprise.middleware.

*Blog* : http://maninda.blogspot.com/
*E-mail* : mani...@wso2.com
*Skype* : @manindae
*Twitter* : @maninda


On Wed, Aug 20, 2014 at 11:33 AM, Niranda Perera <nira...@wso2.com> wrote:

> Hi Anjana and Srinath,
>
> After the discussion I had with Anjana, I researched more on the
> continuation of Shark project by Databricks.
>
> Here's what I found out,
> - Shark was built on the Hive codebase and achieved performance
> improvements by swapping out the physical execution engine part of Hive.
> While this approach enabled Shark users to speed up their Hive queries,
> Shark inherited a large, complicated code base from Hive that made it hard
> to optimize and maintain.
> Hence, Databricks has announced that they are halting the development of
> Shark from July, 2014. (Shark 0.9 would be the last release) [1]
> - Shark will be replaced by Spark SQL. It beats Shark in TPC-DS
> performance
> <http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html>
> by almost an order of magnitude. It also supports all existing Hive data
> formats, user-defined functions (UDF), and the Hive metastore.  [2]
> - Following is the Shark, Spark SQL migration plan
> http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf
>
> - For the legacy Hive and MapReduce users, they have proposed a new 'Hive
> on Spark Project' [3], [4]
> But, given the performance enhancement, it is quite certain that Hive and
> MR would be replaced by engines build on top of Spark (ex: Spark SQL)
>
>
>
> In my opinion there are a few matters to figure out if we are migrating
> from Hive,
>
> 1. whether we are changing the query engine only? (Then, we can replace
> Hive by Shark)
> 2. whether we are changing the existing Hadoop/ MapReduce framework to
> Spark? (Then we can replace Hive and Hadoop with Spark and Spark SQL)
>
>
> In my opinion, considering the longterm impact and the availability of
> support, it is best to migrate the Hive/Hadoop to Spark.
> It is open for discussion!
>
> In the mean time, I've already tried Spark SQL, and Databricks claims on
> improved performance seems to be true. I will work more on this.
>
> Cheers
>
> [1]
> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
> [2]
> http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html
> [3] https://issues.apache.org/jira/browse/HIVE-7292
> [4] https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark
>
>
>
> On Thu, Aug 14, 2014 at 12:16 PM, Anjana Fernando <anj...@wso2.com> wrote:
>
>> Hi Srinath,
>>
>> No, this has not been tested in multiple nodes. I told Niranda here in my
>> last mail, to test a cluster with the same set of hardware we have, that we
>> are using to test our large data set with Hive. As for the effort to make
>> the change, we still have to figure out the MT aspects of Shark here.
>> Sinthuja was working on making the latest Hive version MT ready, and most
>> probably, we can do the same changes to the Hive version Shark is using. So
>> after we do that, the integration should be seamless. And also, as I
>> mentioned earlier here, we are also going to test this with the APIM Hive
>> script, to check if there are any unforeseen incompatibilities.
>>
>> Cheers,
>> Anjana.
>>
>>
>> On Thu, Aug 14, 2014 at 11:53 AM, Srinath Perera <srin...@wso2.com>
>> wrote:
>>
>>> This look great.
>>>
>>> We need to test Spark with multiple nodes? Did we do that. Please create
>>> few VMs in performance could (talk to Lakmal) and test with at least 5
>>> nodes. We need to make sure it works OK with distributed setup as well.
>>>
>>> What does it take to change to spark? Anjana .. how much work is it?
>>>
>>> --Srinath
>>>
>>>
>>> On Wed, Aug 13, 2014 at 7:06 PM, Niranda Perera <nira...@wso2.com>
>>> wrote:
>>>
>>>> Thank you Anjana.
>>>>
>>>> Yes, I am working on it.
>>>>
>>>> In the mean time, I found this in Hive documentation [1]. It talks
>>>> about Hive on Spark, and compares Hive, Shark and Spark SQL at an higher
>>>> architectural level.
>>>>
>>>> Additionally, it is said that the in-memory performance of Shark can be
>>>> improved by introducing Tachyon [2]. I guess we can consider this later on.
>>>>
>>>> Cheers.
>>>>
>>>> [1]
>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-1.3ComparisonwithSharkandSparkSQL
>>>> [2] http://tachyon-project.org/Running-Tachyon-Locally.html
>>>>
>>>>
>>>>
>>>> On Wed, Aug 13, 2014 at 3:17 PM, Anjana Fernando <anj...@wso2.com>
>>>> wrote:
>>>>
>>>>> Hi Niranda,
>>>>>
>>>>> Excellent analysis of Hive vs Shark! .. This gives a lot of insight
>>>>> into how both operates in different scenarios. As the next step, we will
>>>>> need to run this in an actual cluster of computers. Since you've used a
>>>>> subset of the dataset of 2014 DEBS challenge, we should use the full data
>>>>> set in a clustered environment and check this. Gokul is already working on
>>>>> the Hive based setup for this, after that is done, you can create a Shark
>>>>> cluster in the same hardware and run the tests there, to get a clear
>>>>> comparison on how these two match up in a cluster. Until the setup is
>>>>> ready, do continue with your next steps on checking the RDD support and
>>>>> Spark SQL use.
>>>>>
>>>>> After these are done, we should also do a trial run of our own APIM
>>>>> Hive scripts, migrated to Shark.
>>>>>
>>>>> Cheers,
>>>>> Anjana.
>>>>>
>>>>>
>>>>> On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera <nira...@wso2.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I have been evaluating the performance of Shark (distributed SQL
>>>>>> query engine for Hadoop) against Hive. This is with the objective of 
>>>>>> seeing
>>>>>> the possibility to move the WSO2 BAM data processing (which currently 
>>>>>> uses
>>>>>> Hive) to Shark (and Apache Spark) for improved performance.
>>>>>>
>>>>>> I am sharing my findings herewith.
>>>>>>
>>>>>>  *AMP Lab Shark*
>>>>>> Shark can execute Hive QL queries up to 100 times faster than Hive
>>>>>> without any modification to the existing data or queries. It supports
>>>>>> Hive's QL, metastore, serialization formats, and user-defined functions,
>>>>>> providing seamless integration with existing Hive deployments and a
>>>>>> familiar, more powerful option for new ones. [1]
>>>>>>
>>>>>>
>>>>>> *Apache Spark*Apache Spark is an open-source data analytics cluster
>>>>>> computing framework. It fits into the Hadoop open-source community,
>>>>>> building on top of the HDFS and promises performance up to 100 times 
>>>>>> faster
>>>>>> than Hadoop MapReduce for certain applications. [2]
>>>>>> Official documentation: [3]
>>>>>>
>>>>>>
>>>>>> I carried out the comparison between the following Hive and Shark
>>>>>> releases with input files ranging from 100 to 1 billion entries.
>>>>>>
>>>>>> QL Engine
>>>>>>
>>>>>> Apache Hive 0.11
>>>>>>
>>>>>> Shark Shark 0.9.1 (Latest release) which uses,
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    Scala 2.10.3
>>>>>>    -
>>>>>>
>>>>>>    Spark 0.9.1
>>>>>>    -
>>>>>>
>>>>>>    AMPLab’s Hive 0.9.0
>>>>>>
>>>>>>
>>>>>> Framework
>>>>>>
>>>>>> Hadoop 1.0.4
>>>>>> Spark 0.9.1
>>>>>>
>>>>>> File system
>>>>>>
>>>>>> HDFS
>>>>>> HDFS
>>>>>>
>>>>>> Attached herewith is a report which describes in detail about the
>>>>>> performance comparison between Shark and Hive.
>>>>>> 
>>>>>>  hive_vs_shark
>>>>>> <https://docs.google.com/a/wso2.com/folderview?id=0B1GsnfycTl32QTZqUktKck1Ucjg&usp=drive_web>
>>>>>> 
>>>>>>  hive_vs_shark_report.odt
>>>>>> <https://docs.google.com/a/wso2.com/file/d/0B1GsnfycTl32X3J5dTh6Slloa0E/edit?usp=drive_web>
>>>>>> 
>>>>>>
>>>>>> In summary,
>>>>>>
>>>>>> From the evaluation, following conclusions can be derived.
>>>>>>
>>>>>>    - Shark is indifferent to Hive in DDL operations (CREATE, DROP ..
>>>>>>    TABLE, DATABASE). Both engines show a fairly constant performance as 
>>>>>> the
>>>>>>    input size increases.
>>>>>>    - Shark is indifferent to Hive in DML operations (LOAD, INSERT)
>>>>>>    but when a DML operation is called in conjuncture of a data retrieval
>>>>>>    operation (ex. INSERT <TBL> SELECT <PROP> FROM <TBL>), Shark 
>>>>>> significantly
>>>>>>    over-performs Hive with a performance factor of 10x+ (Ranging from 
>>>>>> 10x to
>>>>>>    80x in some instances). Shark performance factor reduces with the 
>>>>>> input
>>>>>>    size increases, while HIVE performance is fairly indifferent.
>>>>>>    - Shark clearly over-performs Hive in Data Retrieval operations
>>>>>>    (FILTER, ORDER BY, JOIN). Hive performance is fairly indifferent in 
>>>>>> the
>>>>>>    data retrieval operations while Shark performance reduces as the 
>>>>>> input size
>>>>>>    increases. But at every instance Shark over-performed Hive with a 
>>>>>> minimum
>>>>>>    performance factor of 5x+ (Ranging from 5x to 80x in some instances).
>>>>>>
>>>>>> Please refer the 'hive_vs_shark_report', it has all the information
>>>>>> about the queries and timings pictographically.
>>>>>>
>>>>>> The code repository can also be found in
>>>>>> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark
>>>>>>
>>>>>> Moving forward, I am currently working on the following.
>>>>>>
>>>>>>    - Apache Spark's resilient distributed dataset (RDD) abstraction
>>>>>>    (which is a collection of elements partitioned across the nodes of the
>>>>>>    cluster that can be operated on in parallel). The use of RDDs and its
>>>>>>    impact to the performance.
>>>>>>    - Spark SQL - Use of this Spark SQL over Shark on Spark framework
>>>>>>
>>>>>>
>>>>>> [1] https://github.com/amplab/shark/wiki
>>>>>> [2] http://en.wikipedia.org/wiki/Apache_Spark
>>>>>> [3] http://spark.apache.org/docs/latest/
>>>>>>
>>>>>>
>>>>>>
>>>>>> Would love to have your feedback on this.
>>>>>>
>>>>>> Best regards
>>>>>>
>>>>>> --
>>>>>>  *Niranda Perera*
>>>>>> Software Engineer, WSO2 Inc.
>>>>>> Mobile: +94-71-554-8430
>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Anjana Fernando*
>>>>> Senior Technical Lead
>>>>> WSO2 Inc. | http://wso2.com
>>>>> lean . enterprise . middleware
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> *Niranda Perera*
>>>> Software Engineer, WSO2 Inc.
>>>> Mobile: +94-71-554-8430
>>>>  Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>
>>>
>>>
>>>
>>> --
>>> ============================
>>> Srinath Perera, Ph.D.
>>>    http://people.apache.org/~hemapani/
>>>    http://srinathsview.blogspot.com/
>>>
>>
>>
>>
>> --
>> *Anjana Fernando*
>> Senior Technical Lead
>> WSO2 Inc. | http://wso2.com
>> lean . enterprise . middleware
>>
>
>
>
> --
> *Niranda Perera*
> Software Engineer, WSO2 Inc.
> Mobile: +94-71-554-8430
> Twitter: @n1r44 <https://twitter.com/N1R44>
>

_______________________________________________
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

Reply via email to