Hi Maninda, On 20 August 2014 12:02, Maninda Edirisooriya <mani...@wso2.com> wrote:
> In the case of discontinuity of Shark project, IMO we should not move to > Shark at all. > And it seems better to go with Spark SQL as we are already using Spark for > CEP. But I am not sure the difference between Spark SQL and the Siddhi > queries on the Spark engine. > Currently, we are doing integration with CEP using Apache Storm, not Spark... :-). Spark Streaming is a possible candidate for integrating with CEP, but we have opted with Storm. I think there has been some independent work on integrating Kafka + Spark Streaming + Siddhi. Please refer to thread on arch@ "[Architecture] A few questions about WSO2 CEP/Siddhi" And we have to figure out how Spark SQL is used for historical data, > whether it can execute incremental processing by default which will > implement all out existing BAM use cases. > On the other hand in Hadoop 2 [1] they are using a completely different > platform for resource allocation known as Yarn. Sometimes this may be more > suitable for batch jobs. > > [1] https://www.youtube.com/watch?v=RncoVN0l6dc > > Thanks, Lasantha > > *Maninda Edirisooriya* > Senior Software Engineer > > *WSO2, Inc. *lean.enterprise.middleware. > > *Blog* : http://maninda.blogspot.com/ > *E-mail* : mani...@wso2.com > *Skype* : @manindae > *Twitter* : @maninda > > > On Wed, Aug 20, 2014 at 11:33 AM, Niranda Perera <nira...@wso2.com> wrote: > >> Hi Anjana and Srinath, >> >> After the discussion I had with Anjana, I researched more on the >> continuation of Shark project by Databricks. >> >> Here's what I found out, >> - Shark was built on the Hive codebase and achieved performance >> improvements by swapping out the physical execution engine part of Hive. >> While this approach enabled Shark users to speed up their Hive queries, >> Shark inherited a large, complicated code base from Hive that made it hard >> to optimize and maintain. >> Hence, Databricks has announced that they are halting the development of >> Shark from July, 2014. (Shark 0.9 would be the last release) [1] >> - Shark will be replaced by Spark SQL. It beats Shark in TPC-DS >> performance >> <http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html> >> by almost an order of magnitude. It also supports all existing Hive data >> formats, user-defined functions (UDF), and the Hive metastore. [2] >> - Following is the Shark, Spark SQL migration plan >> http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf >> >> - For the legacy Hive and MapReduce users, they have proposed a new 'Hive >> on Spark Project' [3], [4] >> But, given the performance enhancement, it is quite certain that Hive and >> MR would be replaced by engines build on top of Spark (ex: Spark SQL) >> >> >> >> In my opinion there are a few matters to figure out if we are migrating >> from Hive, >> >> 1. whether we are changing the query engine only? (Then, we can replace >> Hive by Shark) >> 2. whether we are changing the existing Hadoop/ MapReduce framework to >> Spark? (Then we can replace Hive and Hadoop with Spark and Spark SQL) >> >> >> In my opinion, considering the longterm impact and the availability of >> support, it is best to migrate the Hive/Hadoop to Spark. >> It is open for discussion! >> >> In the mean time, I've already tried Spark SQL, and Databricks claims on >> improved performance seems to be true. I will work more on this. >> >> Cheers >> >> [1] >> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html >> [2] >> http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html >> [3] https://issues.apache.org/jira/browse/HIVE-7292 >> [4] https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark >> >> >> >> On Thu, Aug 14, 2014 at 12:16 PM, Anjana Fernando <anj...@wso2.com> >> wrote: >> >>> Hi Srinath, >>> >>> No, this has not been tested in multiple nodes. I told Niranda here in >>> my last mail, to test a cluster with the same set of hardware we have, that >>> we are using to test our large data set with Hive. As for the effort to >>> make the change, we still have to figure out the MT aspects of Shark here. >>> Sinthuja was working on making the latest Hive version MT ready, and most >>> probably, we can do the same changes to the Hive version Shark is using. So >>> after we do that, the integration should be seamless. And also, as I >>> mentioned earlier here, we are also going to test this with the APIM Hive >>> script, to check if there are any unforeseen incompatibilities. >>> >>> Cheers, >>> Anjana. >>> >>> >>> On Thu, Aug 14, 2014 at 11:53 AM, Srinath Perera <srin...@wso2.com> >>> wrote: >>> >>>> This look great. >>>> >>>> We need to test Spark with multiple nodes? Did we do that. Please >>>> create few VMs in performance could (talk to Lakmal) and test with at least >>>> 5 nodes. We need to make sure it works OK with distributed setup as well. >>>> >>>> What does it take to change to spark? Anjana .. how much work is it? >>>> >>>> --Srinath >>>> >>>> >>>> On Wed, Aug 13, 2014 at 7:06 PM, Niranda Perera <nira...@wso2.com> >>>> wrote: >>>> >>>>> Thank you Anjana. >>>>> >>>>> Yes, I am working on it. >>>>> >>>>> In the mean time, I found this in Hive documentation [1]. It talks >>>>> about Hive on Spark, and compares Hive, Shark and Spark SQL at an higher >>>>> architectural level. >>>>> >>>>> Additionally, it is said that the in-memory performance of Shark can >>>>> be improved by introducing Tachyon [2]. I guess we can consider this later >>>>> on. >>>>> >>>>> Cheers. >>>>> >>>>> [1] >>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-1.3ComparisonwithSharkandSparkSQL >>>>> [2] http://tachyon-project.org/Running-Tachyon-Locally.html >>>>> >>>>> >>>>> >>>>> On Wed, Aug 13, 2014 at 3:17 PM, Anjana Fernando <anj...@wso2.com> >>>>> wrote: >>>>> >>>>>> Hi Niranda, >>>>>> >>>>>> Excellent analysis of Hive vs Shark! .. This gives a lot of insight >>>>>> into how both operates in different scenarios. As the next step, we will >>>>>> need to run this in an actual cluster of computers. Since you've used a >>>>>> subset of the dataset of 2014 DEBS challenge, we should use the full data >>>>>> set in a clustered environment and check this. Gokul is already working >>>>>> on >>>>>> the Hive based setup for this, after that is done, you can create a Shark >>>>>> cluster in the same hardware and run the tests there, to get a clear >>>>>> comparison on how these two match up in a cluster. Until the setup is >>>>>> ready, do continue with your next steps on checking the RDD support and >>>>>> Spark SQL use. >>>>>> >>>>>> After these are done, we should also do a trial run of our own APIM >>>>>> Hive scripts, migrated to Shark. >>>>>> >>>>>> Cheers, >>>>>> Anjana. >>>>>> >>>>>> >>>>>> On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera <nira...@wso2.com> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I have been evaluating the performance of Shark (distributed SQL >>>>>>> query engine for Hadoop) against Hive. This is with the objective of >>>>>>> seeing >>>>>>> the possibility to move the WSO2 BAM data processing (which currently >>>>>>> uses >>>>>>> Hive) to Shark (and Apache Spark) for improved performance. >>>>>>> >>>>>>> I am sharing my findings herewith. >>>>>>> >>>>>>> *AMP Lab Shark* >>>>>>> Shark can execute Hive QL queries up to 100 times faster than Hive >>>>>>> without any modification to the existing data or queries. It supports >>>>>>> Hive's QL, metastore, serialization formats, and user-defined functions, >>>>>>> providing seamless integration with existing Hive deployments and a >>>>>>> familiar, more powerful option for new ones. [1] >>>>>>> >>>>>>> >>>>>>> *Apache Spark*Apache Spark is an open-source data analytics cluster >>>>>>> computing framework. It fits into the Hadoop open-source community, >>>>>>> building on top of the HDFS and promises performance up to 100 times >>>>>>> faster >>>>>>> than Hadoop MapReduce for certain applications. [2] >>>>>>> Official documentation: [3] >>>>>>> >>>>>>> >>>>>>> I carried out the comparison between the following Hive and Shark >>>>>>> releases with input files ranging from 100 to 1 billion entries. >>>>>>> >>>>>>> QL Engine >>>>>>> >>>>>>> Apache Hive 0.11 >>>>>>> >>>>>>> Shark Shark 0.9.1 (Latest release) which uses, >>>>>>> >>>>>>> - >>>>>>> >>>>>>> Scala 2.10.3 >>>>>>> - >>>>>>> >>>>>>> Spark 0.9.1 >>>>>>> - >>>>>>> >>>>>>> AMPLab’s Hive 0.9.0 >>>>>>> >>>>>>> >>>>>>> Framework >>>>>>> >>>>>>> Hadoop 1.0.4 >>>>>>> Spark 0.9.1 >>>>>>> >>>>>>> File system >>>>>>> >>>>>>> HDFS >>>>>>> HDFS >>>>>>> >>>>>>> Attached herewith is a report which describes in detail about the >>>>>>> performance comparison between Shark and Hive. >>>>>>> >>>>>>> hive_vs_shark >>>>>>> <https://docs.google.com/a/wso2.com/folderview?id=0B1GsnfycTl32QTZqUktKck1Ucjg&usp=drive_web> >>>>>>> >>>>>>> hive_vs_shark_report.odt >>>>>>> <https://docs.google.com/a/wso2.com/file/d/0B1GsnfycTl32X3J5dTh6Slloa0E/edit?usp=drive_web> >>>>>>> >>>>>>> >>>>>>> In summary, >>>>>>> >>>>>>> From the evaluation, following conclusions can be derived. >>>>>>> >>>>>>> - Shark is indifferent to Hive in DDL operations (CREATE, DROP >>>>>>> .. TABLE, DATABASE). Both engines show a fairly constant performance >>>>>>> as the >>>>>>> input size increases. >>>>>>> - Shark is indifferent to Hive in DML operations (LOAD, INSERT) >>>>>>> but when a DML operation is called in conjuncture of a data retrieval >>>>>>> operation (ex. INSERT <TBL> SELECT <PROP> FROM <TBL>), Shark >>>>>>> significantly >>>>>>> over-performs Hive with a performance factor of 10x+ (Ranging from >>>>>>> 10x to >>>>>>> 80x in some instances). Shark performance factor reduces with the >>>>>>> input >>>>>>> size increases, while HIVE performance is fairly indifferent. >>>>>>> - Shark clearly over-performs Hive in Data Retrieval operations >>>>>>> (FILTER, ORDER BY, JOIN). Hive performance is fairly indifferent in >>>>>>> the >>>>>>> data retrieval operations while Shark performance reduces as the >>>>>>> input size >>>>>>> increases. But at every instance Shark over-performed Hive with a >>>>>>> minimum >>>>>>> performance factor of 5x+ (Ranging from 5x to 80x in some instances). >>>>>>> >>>>>>> Please refer the 'hive_vs_shark_report', it has all the information >>>>>>> about the queries and timings pictographically. >>>>>>> >>>>>>> The code repository can also be found in >>>>>>> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark >>>>>>> >>>>>>> >>>>>>> Moving forward, I am currently working on the following. >>>>>>> >>>>>>> - Apache Spark's resilient distributed dataset (RDD) abstraction >>>>>>> (which is a collection of elements partitioned across the nodes of >>>>>>> the >>>>>>> cluster that can be operated on in parallel). The use of RDDs and its >>>>>>> impact to the performance. >>>>>>> - Spark SQL - Use of this Spark SQL over Shark on Spark framework >>>>>>> >>>>>>> >>>>>>> [1] https://github.com/amplab/shark/wiki >>>>>>> [2] http://en.wikipedia.org/wiki/Apache_Spark >>>>>>> [3] http://spark.apache.org/docs/latest/ >>>>>>> >>>>>>> >>>>>>> >>>>>>> Would love to have your feedback on this. >>>>>>> >>>>>>> Best regards >>>>>>> >>>>>>> -- >>>>>>> *Niranda Perera* >>>>>>> Software Engineer, WSO2 Inc. >>>>>>> Mobile: +94-71-554-8430 >>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *Anjana Fernando* >>>>>> Senior Technical Lead >>>>>> WSO2 Inc. | http://wso2.com >>>>>> lean . enterprise . middleware >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> *Niranda Perera* >>>>> Software Engineer, WSO2 Inc. >>>>> Mobile: +94-71-554-8430 >>>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>>> >>>> >>>> >>>> >>>> -- >>>> ============================ >>>> Srinath Perera, Ph.D. >>>> http://people.apache.org/~hemapani/ >>>> http://srinathsview.blogspot.com/ >>>> >>> >>> >>> >>> -- >>> *Anjana Fernando* >>> Senior Technical Lead >>> WSO2 Inc. | http://wso2.com >>> lean . enterprise . middleware >>> >> >> >> >> -- >> *Niranda Perera* >> Software Engineer, WSO2 Inc. >> Mobile: +94-71-554-8430 >> Twitter: @n1r44 <https://twitter.com/N1R44> >> > > > _______________________________________________ > Architecture mailing list > Architecture@wso2.org > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > -- *Lasantha Fernando* Software Engineer - Data Technologies Team WSO2 Inc. http://wso2.com email: lasan...@wso2.com mobile: (+94) 71 5247551
_______________________________________________ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture