This look great. We need to test Spark with multiple nodes? Did we do that. Please create few VMs in performance could (talk to Lakmal) and test with at least 5 nodes. We need to make sure it works OK with distributed setup as well.
What does it take to change to spark? Anjana .. how much work is it? --Srinath On Wed, Aug 13, 2014 at 7:06 PM, Niranda Perera <nira...@wso2.com> wrote: > Thank you Anjana. > > Yes, I am working on it. > > In the mean time, I found this in Hive documentation [1]. It talks about > Hive on Spark, and compares Hive, Shark and Spark SQL at an higher > architectural level. > > Additionally, it is said that the in-memory performance of Shark can be > improved by introducing Tachyon [2]. I guess we can consider this later on. > > Cheers. > > [1] > https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-1.3ComparisonwithSharkandSparkSQL > [2] http://tachyon-project.org/Running-Tachyon-Locally.html > > > > On Wed, Aug 13, 2014 at 3:17 PM, Anjana Fernando <anj...@wso2.com> wrote: > >> Hi Niranda, >> >> Excellent analysis of Hive vs Shark! .. This gives a lot of insight into >> how both operates in different scenarios. As the next step, we will need to >> run this in an actual cluster of computers. Since you've used a subset of >> the dataset of 2014 DEBS challenge, we should use the full data set in a >> clustered environment and check this. Gokul is already working on the Hive >> based setup for this, after that is done, you can create a Shark cluster in >> the same hardware and run the tests there, to get a clear comparison on how >> these two match up in a cluster. Until the setup is ready, do continue with >> your next steps on checking the RDD support and Spark SQL use. >> >> After these are done, we should also do a trial run of our own APIM Hive >> scripts, migrated to Shark. >> >> Cheers, >> Anjana. >> >> >> On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera <nira...@wso2.com> >> wrote: >> >>> Hi all, >>> >>> I have been evaluating the performance of Shark (distributed SQL query >>> engine for Hadoop) against Hive. This is with the objective of seeing the >>> possibility to move the WSO2 BAM data processing (which currently uses >>> Hive) to Shark (and Apache Spark) for improved performance. >>> >>> I am sharing my findings herewith. >>> >>> *AMP Lab Shark* >>> Shark can execute Hive QL queries up to 100 times faster than Hive >>> without any modification to the existing data or queries. It supports >>> Hive's QL, metastore, serialization formats, and user-defined functions, >>> providing seamless integration with existing Hive deployments and a >>> familiar, more powerful option for new ones. [1] >>> >>> >>> *Apache Spark*Apache Spark is an open-source data analytics cluster >>> computing framework. It fits into the Hadoop open-source community, >>> building on top of the HDFS and promises performance up to 100 times faster >>> than Hadoop MapReduce for certain applications. [2] >>> Official documentation: [3] >>> >>> >>> I carried out the comparison between the following Hive and Shark >>> releases with input files ranging from 100 to 1 billion entries. >>> >>> QL Engine >>> >>> Apache Hive 0.11 >>> >>> Shark Shark 0.9.1 (Latest release) which uses, >>> >>> - >>> >>> Scala 2.10.3 >>> - >>> >>> Spark 0.9.1 >>> - >>> >>> AMPLab’s Hive 0.9.0 >>> >>> >>> Framework >>> >>> Hadoop 1.0.4 >>> Spark 0.9.1 >>> >>> File system >>> >>> HDFS >>> HDFS >>> >>> Attached herewith is a report which describes in detail about the >>> performance comparison between Shark and Hive. >>> >>> hive_vs_shark >>> <https://docs.google.com/a/wso2.com/folderview?id=0B1GsnfycTl32QTZqUktKck1Ucjg&usp=drive_web> >>> >>> hive_vs_shark_report.odt >>> <https://docs.google.com/a/wso2.com/file/d/0B1GsnfycTl32X3J5dTh6Slloa0E/edit?usp=drive_web> >>> >>> >>> In summary, >>> >>> From the evaluation, following conclusions can be derived. >>> >>> - Shark is indifferent to Hive in DDL operations (CREATE, DROP .. >>> TABLE, DATABASE). Both engines show a fairly constant performance as the >>> input size increases. >>> - Shark is indifferent to Hive in DML operations (LOAD, INSERT) but >>> when a DML operation is called in conjuncture of a data retrieval >>> operation >>> (ex. INSERT <TBL> SELECT <PROP> FROM <TBL>), Shark significantly >>> over-performs Hive with a performance factor of 10x+ (Ranging from 10x to >>> 80x in some instances). Shark performance factor reduces with the input >>> size increases, while HIVE performance is fairly indifferent. >>> - Shark clearly over-performs Hive in Data Retrieval operations >>> (FILTER, ORDER BY, JOIN). Hive performance is fairly indifferent in the >>> data retrieval operations while Shark performance reduces as the input >>> size >>> increases. But at every instance Shark over-performed Hive with a minimum >>> performance factor of 5x+ (Ranging from 5x to 80x in some instances). >>> >>> Please refer the 'hive_vs_shark_report', it has all the information >>> about the queries and timings pictographically. >>> >>> The code repository can also be found in >>> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark >>> >>> Moving forward, I am currently working on the following. >>> >>> - Apache Spark's resilient distributed dataset (RDD) abstraction >>> (which is a collection of elements partitioned across the nodes of the >>> cluster that can be operated on in parallel). The use of RDDs and its >>> impact to the performance. >>> - Spark SQL - Use of this Spark SQL over Shark on Spark framework >>> >>> >>> [1] https://github.com/amplab/shark/wiki >>> [2] http://en.wikipedia.org/wiki/Apache_Spark >>> [3] http://spark.apache.org/docs/latest/ >>> >>> >>> >>> Would love to have your feedback on this. >>> >>> Best regards >>> >>> -- >>> *Niranda Perera* >>> Software Engineer, WSO2 Inc. >>> Mobile: +94-71-554-8430 >>> Twitter: @n1r44 <https://twitter.com/N1R44> >>> >> >> >> >> -- >> *Anjana Fernando* >> Senior Technical Lead >> WSO2 Inc. | http://wso2.com >> lean . enterprise . middleware >> > > > > -- > *Niranda Perera* > Software Engineer, WSO2 Inc. > Mobile: +94-71-554-8430 > Twitter: @n1r44 <https://twitter.com/N1R44> > -- ============================ Srinath Perera, Ph.D. http://people.apache.org/~hemapani/ http://srinathsview.blogspot.com/
_______________________________________________ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture