Hi Niranda,

Excellent analysis of Hive vs Shark! .. This gives a lot of insight into
how both operates in different scenarios. As the next step, we will need to
run this in an actual cluster of computers. Since you've used a subset of
the dataset of 2014 DEBS challenge, we should use the full data set in a
clustered environment and check this. Gokul is already working on the Hive
based setup for this, after that is done, you can create a Shark cluster in
the same hardware and run the tests there, to get a clear comparison on how
these two match up in a cluster. Until the setup is ready, do continue with
your next steps on checking the RDD support and Spark SQL use.

After these are done, we should also do a trial run of our own APIM Hive
scripts, migrated to Shark.

Cheers,
Anjana.


On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera <nira...@wso2.com> wrote:

> Hi all,
>
> I have been evaluating the performance of Shark (distributed SQL query
> engine for Hadoop) against Hive. This is with the objective of seeing the
> possibility to move the WSO2 BAM data processing (which currently uses
> Hive) to Shark (and Apache Spark) for improved performance.
>
> I am sharing my findings herewith.
>
>  *AMP Lab Shark*
> Shark can execute Hive QL queries up to 100 times faster than Hive without
> any modification to the existing data or queries. It supports Hive's QL,
> metastore, serialization formats, and user-defined functions, providing
> seamless integration with existing Hive deployments and a familiar, more
> powerful option for new ones. [1]
>
>
> *Apache Spark*Apache Spark is an open-source data analytics cluster
> computing framework. It fits into the Hadoop open-source community,
> building on top of the HDFS and promises performance up to 100 times faster
> than Hadoop MapReduce for certain applications. [2]
> Official documentation: [3]
>
>
> I carried out the comparison between the following Hive and Shark releases
> with input files ranging from 100 to 1 billion entries.
>
> QL Engine
>
> Apache Hive 0.11
>
> Shark Shark 0.9.1 (Latest release) which uses,
>
>    -
>
>    Scala 2.10.3
>    -
>
>    Spark 0.9.1
>    -
>
>    AMPLab’s Hive 0.9.0
>
>
> Framework
>
> Hadoop 1.0.4
> Spark 0.9.1
>
> File system
>
> HDFS
> HDFS
>
> Attached herewith is a report which describes in detail about the
> performance comparison between Shark and Hive.
> ​
>  hive_vs_shark
> <https://docs.google.com/a/wso2.com/folderview?id=0B1GsnfycTl32QTZqUktKck1Ucjg&usp=drive_web>
> ​​
>  hive_vs_shark_report.odt
> <https://docs.google.com/a/wso2.com/file/d/0B1GsnfycTl32X3J5dTh6Slloa0E/edit?usp=drive_web>
> ​​
>
> In summary,
>
> From the evaluation, following conclusions can be derived.
>
>    - Shark is indifferent to Hive in DDL operations (CREATE, DROP ..
>    TABLE, DATABASE). Both engines show a fairly constant performance as the
>    input size increases.
>    - Shark is indifferent to Hive in DML operations (LOAD, INSERT) but
>    when a DML operation is called in conjuncture of a data retrieval operation
>    (ex. INSERT <TBL> SELECT <PROP> FROM <TBL>), Shark significantly
>    over-performs Hive with a performance factor of 10x+ (Ranging from 10x to
>    80x in some instances). Shark performance factor reduces with the input
>    size increases, while HIVE performance is fairly indifferent.
>    - Shark clearly over-performs Hive in Data Retrieval operations
>    (FILTER, ORDER BY, JOIN). Hive performance is fairly indifferent in the
>    data retrieval operations while Shark performance reduces as the input size
>    increases. But at every instance Shark over-performed Hive with a minimum
>    performance factor of 5x+ (Ranging from 5x to 80x in some instances).
>
> Please refer the 'hive_vs_shark_report', it has all the information about
> the queries and timings pictographically.
>
> The code repository can also be found in
> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark
>
> Moving forward, I am currently working on the following.
>
>    - Apache Spark's resilient distributed dataset (RDD) abstraction
>    (which is a collection of elements partitioned across the nodes of the
>    cluster that can be operated on in parallel). The use of RDDs and its
>    impact to the performance.
>    - Spark SQL - Use of this Spark SQL over Shark on Spark framework
>
>
> [1] https://github.com/amplab/shark/wiki
> [2] http://en.wikipedia.org/wiki/Apache_Spark
> [3] http://spark.apache.org/docs/latest/
>
>
>
> Would love to have your feedback on this.
>
> Best regards
>
> --
>  *Niranda Perera*
> Software Engineer, WSO2 Inc.
> Mobile: +94-71-554-8430
> Twitter: @n1r44 <https://twitter.com/N1R44>
>



-- 
*Anjana Fernando*
Senior Technical Lead
WSO2 Inc. | http://wso2.com
lean . enterprise . middleware
_______________________________________________
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Reply via email to