Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

Anjana Fernando Wed, 13 Aug 2014 23:48:07 -0700

Hi Srinath,

No, this has not been tested in multiple nodes. I told Niranda here in my
last mail, to test a cluster with the same set of hardware we have, that we
are using to test our large data set with Hive. As for the effort to make
the change, we still have to figure out the MT aspects of Shark here.
Sinthuja was working on making the latest Hive version MT ready, and most
probably, we can do the same changes to the Hive version Shark is using. So
after we do that, the integration should be seamless. And also, as I
mentioned earlier here, we are also going to test this with the APIM Hive
script, to check if there are any unforeseen incompatibilities.


Cheers,
Anjana.


On Thu, Aug 14, 2014 at 11:53 AM, Srinath Perera <srin...@wso2.com> wrote:

> This look great.
>
> We need to test Spark with multiple nodes? Did we do that. Please create
> few VMs in performance could (talk to Lakmal) and test with at least 5
> nodes. We need to make sure it works OK with distributed setup as well.
>
> What does it take to change to spark? Anjana .. how much work is it?
>
> --Srinath
>
>
> On Wed, Aug 13, 2014 at 7:06 PM, Niranda Perera <nira...@wso2.com> wrote:
>
>> Thank you Anjana.
>>
>> Yes, I am working on it.
>>
>> In the mean time, I found this in Hive documentation [1]. It talks about
>> Hive on Spark, and compares Hive, Shark and Spark SQL at an higher
>> architectural level.
>>
>> Additionally, it is said that the in-memory performance of Shark can be
>> improved by introducing Tachyon [2]. I guess we can consider this later on.
>>
>> Cheers.
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-1.3ComparisonwithSharkandSparkSQL
>> [2] http://tachyon-project.org/Running-Tachyon-Locally.html
>>
>>
>>
>> On Wed, Aug 13, 2014 at 3:17 PM, Anjana Fernando <anj...@wso2.com> wrote:
>>
>>> Hi Niranda,
>>>
>>> Excellent analysis of Hive vs Shark! .. This gives a lot of insight into
>>> how both operates in different scenarios. As the next step, we will need to
>>> run this in an actual cluster of computers. Since you've used a subset of
>>> the dataset of 2014 DEBS challenge, we should use the full data set in a
>>> clustered environment and check this. Gokul is already working on the Hive
>>> based setup for this, after that is done, you can create a Shark cluster in
>>> the same hardware and run the tests there, to get a clear comparison on how
>>> these two match up in a cluster. Until the setup is ready, do continue with
>>> your next steps on checking the RDD support and Spark SQL use.
>>>
>>> After these are done, we should also do a trial run of our own APIM Hive
>>> scripts, migrated to Shark.
>>>
>>> Cheers,
>>> Anjana.
>>>
>>>
>>> On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera <nira...@wso2.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I have been evaluating the performance of Shark (distributed SQL query
>>>> engine for Hadoop) against Hive. This is with the objective of seeing the
>>>> possibility to move the WSO2 BAM data processing (which currently uses
>>>> Hive) to Shark (and Apache Spark) for improved performance.
>>>>
>>>> I am sharing my findings herewith.
>>>>
>>>>  *AMP Lab Shark*
>>>> Shark can execute Hive QL queries up to 100 times faster than Hive
>>>> without any modification to the existing data or queries. It supports
>>>> Hive's QL, metastore, serialization formats, and user-defined functions,
>>>> providing seamless integration with existing Hive deployments and a
>>>> familiar, more powerful option for new ones. [1]
>>>>
>>>>
>>>> *Apache Spark*Apache Spark is an open-source data analytics cluster
>>>> computing framework. It fits into the Hadoop open-source community,
>>>> building on top of the HDFS and promises performance up to 100 times faster
>>>> than Hadoop MapReduce for certain applications. [2]
>>>> Official documentation: [3]
>>>>
>>>>
>>>> I carried out the comparison between the following Hive and Shark
>>>> releases with input files ranging from 100 to 1 billion entries.
>>>>
>>>> QL Engine
>>>>
>>>> Apache Hive 0.11
>>>>
>>>> Shark Shark 0.9.1 (Latest release) which uses,
>>>>
>>>>    -
>>>>
>>>>    Scala 2.10.3
>>>>    -
>>>>
>>>>    Spark 0.9.1
>>>>    -
>>>>
>>>>    AMPLab’s Hive 0.9.0
>>>>
>>>>
>>>> Framework
>>>>
>>>> Hadoop 1.0.4
>>>> Spark 0.9.1
>>>>
>>>> File system
>>>>
>>>> HDFS
>>>> HDFS
>>>>
>>>> Attached herewith is a report which describes in detail about the
>>>> performance comparison between Shark and Hive.
>>>> 
>>>>  hive_vs_shark
>>>> <https://docs.google.com/a/wso2.com/folderview?id=0B1GsnfycTl32QTZqUktKck1Ucjg&usp=drive_web>
>>>> 
>>>>  hive_vs_shark_report.odt
>>>> <https://docs.google.com/a/wso2.com/file/d/0B1GsnfycTl32X3J5dTh6Slloa0E/edit?usp=drive_web>
>>>> 
>>>>
>>>> In summary,
>>>>
>>>> From the evaluation, following conclusions can be derived.
>>>>
>>>>    - Shark is indifferent to Hive in DDL operations (CREATE, DROP ..
>>>>    TABLE, DATABASE). Both engines show a fairly constant performance as the
>>>>    input size increases.
>>>>    - Shark is indifferent to Hive in DML operations (LOAD, INSERT) but
>>>>    when a DML operation is called in conjuncture of a data retrieval 
>>>> operation
>>>>    (ex. INSERT <TBL> SELECT <PROP> FROM <TBL>), Shark significantly
>>>>    over-performs Hive with a performance factor of 10x+ (Ranging from 10x 
>>>> to
>>>>    80x in some instances). Shark performance factor reduces with the input
>>>>    size increases, while HIVE performance is fairly indifferent.
>>>>    - Shark clearly over-performs Hive in Data Retrieval operations
>>>>    (FILTER, ORDER BY, JOIN). Hive performance is fairly indifferent in the
>>>>    data retrieval operations while Shark performance reduces as the input 
>>>> size
>>>>    increases. But at every instance Shark over-performed Hive with a 
>>>> minimum
>>>>    performance factor of 5x+ (Ranging from 5x to 80x in some instances).
>>>>
>>>> Please refer the 'hive_vs_shark_report', it has all the information
>>>> about the queries and timings pictographically.
>>>>
>>>> The code repository can also be found in
>>>> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark
>>>>
>>>> Moving forward, I am currently working on the following.
>>>>
>>>>    - Apache Spark's resilient distributed dataset (RDD) abstraction
>>>>    (which is a collection of elements partitioned across the nodes of the
>>>>    cluster that can be operated on in parallel). The use of RDDs and its
>>>>    impact to the performance.
>>>>    - Spark SQL - Use of this Spark SQL over Shark on Spark framework
>>>>
>>>>
>>>> [1] https://github.com/amplab/shark/wiki
>>>> [2] http://en.wikipedia.org/wiki/Apache_Spark
>>>> [3] http://spark.apache.org/docs/latest/
>>>>
>>>>
>>>>
>>>> Would love to have your feedback on this.
>>>>
>>>> Best regards
>>>>
>>>> --
>>>>  *Niranda Perera*
>>>> Software Engineer, WSO2 Inc.
>>>> Mobile: +94-71-554-8430
>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>
>>>
>>>
>>>
>>> --
>>> *Anjana Fernando*
>>> Senior Technical Lead
>>> WSO2 Inc. | http://wso2.com
>>> lean . enterprise . middleware
>>>
>>
>>
>>
>> --
>> *Niranda Perera*
>> Software Engineer, WSO2 Inc.
>> Mobile: +94-71-554-8430
>>  Twitter: @n1r44 <https://twitter.com/N1R44>
>>
>
>
>
> --
> ============================
> Srinath Perera, Ph.D.
>    http://people.apache.org/~hemapani/
>    http://srinathsview.blogspot.com/
>



-- 
*Anjana Fernando*
Senior Technical Lead
WSO2 Inc. | http://wso2.com
lean . enterprise . middleware

_______________________________________________
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

Reply via email to