This look great.

We need to test Spark with multiple nodes? Did we do that. Please create
few VMs in performance could (talk to Lakmal) and test with at least 5
nodes. We need to make sure it works OK with distributed setup as well.

What does it take to change to spark? Anjana .. how much work is it?

--Srinath


On Wed, Aug 13, 2014 at 7:06 PM, Niranda Perera <nira...@wso2.com> wrote:

> Thank you Anjana.
>
> Yes, I am working on it.
>
> In the mean time, I found this in Hive documentation [1]. It talks about
> Hive on Spark, and compares Hive, Shark and Spark SQL at an higher
> architectural level.
>
> Additionally, it is said that the in-memory performance of Shark can be
> improved by introducing Tachyon [2]. I guess we can consider this later on.
>
> Cheers.
>
> [1]
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-1.3ComparisonwithSharkandSparkSQL
> [2] http://tachyon-project.org/Running-Tachyon-Locally.html
>
>
>
> On Wed, Aug 13, 2014 at 3:17 PM, Anjana Fernando <anj...@wso2.com> wrote:
>
>> Hi Niranda,
>>
>> Excellent analysis of Hive vs Shark! .. This gives a lot of insight into
>> how both operates in different scenarios. As the next step, we will need to
>> run this in an actual cluster of computers. Since you've used a subset of
>> the dataset of 2014 DEBS challenge, we should use the full data set in a
>> clustered environment and check this. Gokul is already working on the Hive
>> based setup for this, after that is done, you can create a Shark cluster in
>> the same hardware and run the tests there, to get a clear comparison on how
>> these two match up in a cluster. Until the setup is ready, do continue with
>> your next steps on checking the RDD support and Spark SQL use.
>>
>> After these are done, we should also do a trial run of our own APIM Hive
>> scripts, migrated to Shark.
>>
>> Cheers,
>> Anjana.
>>
>>
>> On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera <nira...@wso2.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I have been evaluating the performance of Shark (distributed SQL query
>>> engine for Hadoop) against Hive. This is with the objective of seeing the
>>> possibility to move the WSO2 BAM data processing (which currently uses
>>> Hive) to Shark (and Apache Spark) for improved performance.
>>>
>>> I am sharing my findings herewith.
>>>
>>>  *AMP Lab Shark*
>>> Shark can execute Hive QL queries up to 100 times faster than Hive
>>> without any modification to the existing data or queries. It supports
>>> Hive's QL, metastore, serialization formats, and user-defined functions,
>>> providing seamless integration with existing Hive deployments and a
>>> familiar, more powerful option for new ones. [1]
>>>
>>>
>>> *Apache Spark*Apache Spark is an open-source data analytics cluster
>>> computing framework. It fits into the Hadoop open-source community,
>>> building on top of the HDFS and promises performance up to 100 times faster
>>> than Hadoop MapReduce for certain applications. [2]
>>> Official documentation: [3]
>>>
>>>
>>> I carried out the comparison between the following Hive and Shark
>>> releases with input files ranging from 100 to 1 billion entries.
>>>
>>> QL Engine
>>>
>>> Apache Hive 0.11
>>>
>>> Shark Shark 0.9.1 (Latest release) which uses,
>>>
>>>    -
>>>
>>>    Scala 2.10.3
>>>    -
>>>
>>>    Spark 0.9.1
>>>    -
>>>
>>>    AMPLab’s Hive 0.9.0
>>>
>>>
>>> Framework
>>>
>>> Hadoop 1.0.4
>>> Spark 0.9.1
>>>
>>> File system
>>>
>>> HDFS
>>> HDFS
>>>
>>> Attached herewith is a report which describes in detail about the
>>> performance comparison between Shark and Hive.
>>> ​
>>>  hive_vs_shark
>>> <https://docs.google.com/a/wso2.com/folderview?id=0B1GsnfycTl32QTZqUktKck1Ucjg&usp=drive_web>
>>> ​​
>>>  hive_vs_shark_report.odt
>>> <https://docs.google.com/a/wso2.com/file/d/0B1GsnfycTl32X3J5dTh6Slloa0E/edit?usp=drive_web>
>>> ​​
>>>
>>> In summary,
>>>
>>> From the evaluation, following conclusions can be derived.
>>>
>>>    - Shark is indifferent to Hive in DDL operations (CREATE, DROP ..
>>>    TABLE, DATABASE). Both engines show a fairly constant performance as the
>>>    input size increases.
>>>    - Shark is indifferent to Hive in DML operations (LOAD, INSERT) but
>>>    when a DML operation is called in conjuncture of a data retrieval 
>>> operation
>>>    (ex. INSERT <TBL> SELECT <PROP> FROM <TBL>), Shark significantly
>>>    over-performs Hive with a performance factor of 10x+ (Ranging from 10x to
>>>    80x in some instances). Shark performance factor reduces with the input
>>>    size increases, while HIVE performance is fairly indifferent.
>>>    - Shark clearly over-performs Hive in Data Retrieval operations
>>>    (FILTER, ORDER BY, JOIN). Hive performance is fairly indifferent in the
>>>    data retrieval operations while Shark performance reduces as the input 
>>> size
>>>    increases. But at every instance Shark over-performed Hive with a minimum
>>>    performance factor of 5x+ (Ranging from 5x to 80x in some instances).
>>>
>>> Please refer the 'hive_vs_shark_report', it has all the information
>>> about the queries and timings pictographically.
>>>
>>> The code repository can also be found in
>>> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark
>>>
>>> Moving forward, I am currently working on the following.
>>>
>>>    - Apache Spark's resilient distributed dataset (RDD) abstraction
>>>    (which is a collection of elements partitioned across the nodes of the
>>>    cluster that can be operated on in parallel). The use of RDDs and its
>>>    impact to the performance.
>>>    - Spark SQL - Use of this Spark SQL over Shark on Spark framework
>>>
>>>
>>> [1] https://github.com/amplab/shark/wiki
>>> [2] http://en.wikipedia.org/wiki/Apache_Spark
>>> [3] http://spark.apache.org/docs/latest/
>>>
>>>
>>>
>>> Would love to have your feedback on this.
>>>
>>> Best regards
>>>
>>> --
>>>  *Niranda Perera*
>>> Software Engineer, WSO2 Inc.
>>> Mobile: +94-71-554-8430
>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>
>>
>>
>>
>> --
>> *Anjana Fernando*
>> Senior Technical Lead
>> WSO2 Inc. | http://wso2.com
>> lean . enterprise . middleware
>>
>
>
>
> --
> *Niranda Perera*
> Software Engineer, WSO2 Inc.
> Mobile: +94-71-554-8430
> Twitter: @n1r44 <https://twitter.com/N1R44>
>



-- 
============================
Srinath Perera, Ph.D.
   http://people.apache.org/~hemapani/
   http://srinathsview.blogspot.com/
_______________________________________________
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to