On Wed, Aug 13, 2014 at 3:51 PM, Sumedha Rubasinghe <sume...@wso2.com> wrote:
> > After these are done, we should also do a trial run of our own APIM Hive > scripts, migrated to Shark. > > Do we need to migrate?I thought existing Hive scripts can run as it is. > First of all we need to create a large data set of API stats. > Oh yeah, wrong selection of words I guess :) .. we wouldn't have to migrate .. I just referred to as just testing the same APIM Hive script in Shark. Cheers, Anjana. > > > > Cheers, > > Anjana. > > > > > > On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera <nira...@wso2.com> > wrote: > >> > >> Hi all, > >> > >> I have been evaluating the performance of Shark (distributed SQL query > engine for Hadoop) against Hive. This is with the objective of seeing the > possibility to move the WSO2 BAM data processing (which currently uses > Hive) to Shark (and Apache Spark) for improved performance. > >> > >> I am sharing my findings herewith. > >> > >> AMP Lab Shark > >> Shark can execute Hive QL queries up to 100 times faster than Hive > without any modification to the existing data or queries. It supports > Hive's QL, metastore, serialization formats, and user-defined functions, > providing seamless integration with existing Hive deployments and a > familiar, more powerful option for new ones. [1] > >> > >> Apache Spark > >> Apache Spark is an open-source data analytics cluster computing > framework. It fits into the Hadoop open-source community, building on top > of the HDFS and promises performance up to 100 times faster than Hadoop > MapReduce for certain applications. [2] > >> Official documentation: [3] > >> > >> > >> I carried out the comparison between the following Hive and Shark > releases with input files ranging from 100 to 1 billion entries. > >> > >> QL Engine > >> > >> Apache Hive 0.11 > >> > >> Shark Shark 0.9.1 (Latest release) which uses, > >> > >> Scala 2.10.3 > >> > >> Spark 0.9.1 > >> > >> AMPLab’s Hive 0.9.0 > >> > >> > >> Framework > >> > >> Hadoop 1.0.4 > >> > >> Spark 0.9.1 > >> > >> File system > >> > >> HDFS > >> > >> HDFS > >> > >> > >> Attached herewith is a report which describes in detail about the > performance comparison between Shark and Hive. > >> > >> hive_vs_shark > >> > >> hive_vs_shark_report.odt > > >> > >> > >> In summary, > >> > >> From the evaluation, following conclusions can be derived. > >> Shark is indifferent to Hive in DDL operations (CREATE, DROP .. TABLE, > DATABASE). Both engines show a fairly constant performance as the input > size increases. > >> Shark is indifferent to Hive in DML operations (LOAD, INSERT) but when > a DML operation is called in conjuncture of a data retrieval operation (ex. > INSERT <TBL> SELECT <PROP> FROM <TBL>), Shark significantly over-performs > Hive with a performance factor of 10x+ (Ranging from 10x to 80x in some > instances). Shark performance factor reduces with the input size increases, > while HIVE performance is fairly indifferent. > >> Shark clearly over-performs Hive in Data Retrieval operations (FILTER, > ORDER BY, JOIN). Hive performance is fairly indifferent in the data > retrieval operations while Shark performance reduces as the input size > increases. But at every instance Shark over-performed Hive with a minimum > performance factor of 5x+ (Ranging from 5x to 80x in some instances). > >> Please refer the 'hive_vs_shark_report', it has all the information > about the queries and timings pictographically. > >> > >> The code repository can also be found in > >> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark > >> > >> Moving forward, I am currently working on the following. > >> Apache Spark's resilient distributed dataset (RDD) abstraction (which > is a collection of elements partitioned across the nodes of the cluster > that can be operated on in parallel). The use of RDDs and its impact to the > performance. > >> Spark SQL - Use of this Spark SQL over Shark on Spark framework > >> > >> [1] https://github.com/amplab/shark/wiki > >> [2] http://en.wikipedia.org/wiki/Apache_Spark > >> [3] http://spark.apache.org/docs/latest/ > >> > >> > >> > >> Would love to have your feedback on this. > >> > >> Best regards > >> > >> -- > >> Niranda Perera > >> Software Engineer, WSO2 Inc. > >> Mobile: +94-71-554-8430 > >> Twitter: @n1r44 > > > > > > > > > > -- > > Anjana Fernando > > Senior Technical Lead > > WSO2 Inc. | http://wso2.com > > lean . enterprise . middleware > > > > _______________________________________________ > > Architecture mailing list > > Architecture@wso2.org > > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > > > > _______________________________________________ > Architecture mailing list > Architecture@wso2.org > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > -- *Anjana Fernando* Senior Technical Lead WSO2 Inc. | http://wso2.com lean . enterprise . middleware
_______________________________________________ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture