Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

Anjana Fernando Wed, 13 Aug 2014 04:50:29 -0700

On Wed, Aug 13, 2014 at 3:51 PM, Sumedha Rubasinghe <sume...@wso2.com>
wrote:


> > After these are done, we should also do a trial run of our own APIM Hive
> scripts, migrated to Shark.
>
> Do we need to migrate?I thought existing Hive scripts can run as it is.
> First of all we need to create a large data set of API stats.
>
Oh yeah, wrong selection of words I guess :) .. we wouldn't have to migrate
.. I just referred to as just testing the same APIM Hive script in Shark.

Cheers,
Anjana.

> >
> > Cheers,
> > Anjana.
> >
> >
> > On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera <nira...@wso2.com>
> wrote:
> >>
> >> Hi all,
> >>
> >> I have been evaluating the performance of Shark (distributed SQL query
> engine for Hadoop) against Hive. This is with the objective of seeing the
> possibility to move the WSO2 BAM data processing (which currently uses
> Hive) to Shark (and Apache Spark) for improved performance.
> >>
> >> I am sharing my findings herewith.
> >>
> >> AMP Lab Shark
> >> Shark can execute Hive QL queries up to 100 times faster than Hive
> without any modification to the existing data or queries. It supports
> Hive's QL, metastore, serialization formats, and user-defined functions,
> providing seamless integration with existing Hive deployments and a
> familiar, more powerful option for new ones. [1]
> >>
> >> Apache Spark
> >> Apache Spark is an open-source data analytics cluster computing
> framework. It fits into the Hadoop open-source community, building on top
> of the HDFS and promises performance up to 100 times faster than Hadoop
> MapReduce for certain applications. [2]
> >> Official documentation: [3]
> >>
> >>
> >> I carried out the comparison between the following Hive and Shark
> releases with input files ranging from 100 to 1 billion entries.
> >>
> >> QL Engine
> >>
> >> Apache Hive 0.11
> >>
> >> Shark Shark 0.9.1 (Latest release) which uses,
> >>
> >> Scala 2.10.3
> >>
> >> Spark 0.9.1
> >>
> >> AMPLab’s Hive 0.9.0
> >>
> >>
> >> Framework
> >>
> >> Hadoop 1.0.4
> >>
> >> Spark 0.9.1
> >>
> >> File system
> >>
> >> HDFS
> >>
> >> HDFS
> >>
> >>
> >> Attached herewith is a report which describes in detail about the
> performance comparison between Shark and Hive.
> >> 
> >>  hive_vs_shark
> >> 
> >>  hive_vs_shark_report.odt
>
> >> 
> >>
> >> In summary,
> >>
> >> From the evaluation, following conclusions can be derived.
> >> Shark is indifferent to Hive in DDL operations (CREATE, DROP .. TABLE,
> DATABASE). Both engines show a fairly constant performance as the input
> size increases.
> >> Shark is indifferent to Hive in DML operations (LOAD, INSERT) but when
> a DML operation is called in conjuncture of a data retrieval operation (ex.
> INSERT <TBL> SELECT <PROP> FROM <TBL>), Shark significantly over-performs
> Hive with a performance factor of 10x+ (Ranging from 10x to 80x in some
> instances). Shark performance factor reduces with the input size increases,
> while HIVE performance is fairly indifferent.
> >> Shark clearly over-performs Hive in Data Retrieval operations (FILTER,
> ORDER BY, JOIN). Hive performance is fairly indifferent in the data
> retrieval operations while Shark performance reduces as the input size
> increases. But at every instance Shark over-performed Hive with a minimum
> performance factor of 5x+ (Ranging from 5x to 80x in some instances).
> >> Please refer the 'hive_vs_shark_report', it has all the information
> about the queries and timings pictographically.
> >>
> >> The code repository can also be found in
> >> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark
> >>
> >> Moving forward, I am currently working on the following.
> >> Apache Spark's resilient distributed dataset (RDD) abstraction (which
> is a collection of elements partitioned across the nodes of the cluster
> that can be operated on in parallel). The use of RDDs and its impact to the
> performance.
> >> Spark SQL - Use of this Spark SQL over Shark on Spark framework
> >>
> >> [1] https://github.com/amplab/shark/wiki
> >> [2] http://en.wikipedia.org/wiki/Apache_Spark
> >> [3] http://spark.apache.org/docs/latest/
> >>
> >>
> >>
> >> Would love to have your feedback on this.
> >>
> >> Best regards
> >>
> >> --
> >> Niranda Perera
> >> Software Engineer, WSO2 Inc.
> >> Mobile: +94-71-554-8430
> >> Twitter: @n1r44
> >
> >
> >
> >
> > --
> > Anjana Fernando
> > Senior Technical Lead
> > WSO2 Inc. | http://wso2.com
> > lean . enterprise . middleware
> >
> > _______________________________________________
> > Architecture mailing list
> > Architecture@wso2.org
> > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
> >
>
>
> _______________________________________________
> Architecture mailing list
> Architecture@wso2.org
> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>
>


-- 
*Anjana Fernando*
Senior Technical Lead
WSO2 Inc. | http://wso2.com
lean . enterprise . middleware

_______________________________________________
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

Reply via email to