Hi Niranda, Excellent analysis of Hive vs Shark! .. This gives a lot of insight into how both operates in different scenarios. As the next step, we will need to run this in an actual cluster of computers. Since you've used a subset of the dataset of 2014 DEBS challenge, we should use the full data set in a clustered environment and check this. Gokul is already working on the Hive based setup for this, after that is done, you can create a Shark cluster in the same hardware and run the tests there, to get a clear comparison on how these two match up in a cluster. Until the setup is ready, do continue with your next steps on checking the RDD support and Spark SQL use.
After these are done, we should also do a trial run of our own APIM Hive scripts, migrated to Shark. Cheers, Anjana. On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera <nira...@wso2.com> wrote: > Hi all, > > I have been evaluating the performance of Shark (distributed SQL query > engine for Hadoop) against Hive. This is with the objective of seeing the > possibility to move the WSO2 BAM data processing (which currently uses > Hive) to Shark (and Apache Spark) for improved performance. > > I am sharing my findings herewith. > > *AMP Lab Shark* > Shark can execute Hive QL queries up to 100 times faster than Hive without > any modification to the existing data or queries. It supports Hive's QL, > metastore, serialization formats, and user-defined functions, providing > seamless integration with existing Hive deployments and a familiar, more > powerful option for new ones. [1] > > > *Apache Spark*Apache Spark is an open-source data analytics cluster > computing framework. It fits into the Hadoop open-source community, > building on top of the HDFS and promises performance up to 100 times faster > than Hadoop MapReduce for certain applications. [2] > Official documentation: [3] > > > I carried out the comparison between the following Hive and Shark releases > with input files ranging from 100 to 1 billion entries. > > QL Engine > > Apache Hive 0.11 > > Shark Shark 0.9.1 (Latest release) which uses, > > - > > Scala 2.10.3 > - > > Spark 0.9.1 > - > > AMPLab’s Hive 0.9.0 > > > Framework > > Hadoop 1.0.4 > Spark 0.9.1 > > File system > > HDFS > HDFS > > Attached herewith is a report which describes in detail about the > performance comparison between Shark and Hive. > > hive_vs_shark > <https://docs.google.com/a/wso2.com/folderview?id=0B1GsnfycTl32QTZqUktKck1Ucjg&usp=drive_web> > > hive_vs_shark_report.odt > <https://docs.google.com/a/wso2.com/file/d/0B1GsnfycTl32X3J5dTh6Slloa0E/edit?usp=drive_web> > > > In summary, > > From the evaluation, following conclusions can be derived. > > - Shark is indifferent to Hive in DDL operations (CREATE, DROP .. > TABLE, DATABASE). Both engines show a fairly constant performance as the > input size increases. > - Shark is indifferent to Hive in DML operations (LOAD, INSERT) but > when a DML operation is called in conjuncture of a data retrieval operation > (ex. INSERT <TBL> SELECT <PROP> FROM <TBL>), Shark significantly > over-performs Hive with a performance factor of 10x+ (Ranging from 10x to > 80x in some instances). Shark performance factor reduces with the input > size increases, while HIVE performance is fairly indifferent. > - Shark clearly over-performs Hive in Data Retrieval operations > (FILTER, ORDER BY, JOIN). Hive performance is fairly indifferent in the > data retrieval operations while Shark performance reduces as the input size > increases. But at every instance Shark over-performed Hive with a minimum > performance factor of 5x+ (Ranging from 5x to 80x in some instances). > > Please refer the 'hive_vs_shark_report', it has all the information about > the queries and timings pictographically. > > The code repository can also be found in > https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark > > Moving forward, I am currently working on the following. > > - Apache Spark's resilient distributed dataset (RDD) abstraction > (which is a collection of elements partitioned across the nodes of the > cluster that can be operated on in parallel). The use of RDDs and its > impact to the performance. > - Spark SQL - Use of this Spark SQL over Shark on Spark framework > > > [1] https://github.com/amplab/shark/wiki > [2] http://en.wikipedia.org/wiki/Apache_Spark > [3] http://spark.apache.org/docs/latest/ > > > > Would love to have your feedback on this. > > Best regards > > -- > *Niranda Perera* > Software Engineer, WSO2 Inc. > Mobile: +94-71-554-8430 > Twitter: @n1r44 <https://twitter.com/N1R44> > -- *Anjana Fernando* Senior Technical Lead WSO2 Inc. | http://wso2.com lean . enterprise . middleware
_______________________________________________ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture