@Maninda, +1 for suggesting Spark SQL.
Quote Databricks, "Spark SQL provides state-of-the-art SQL performance and maintains compatibility with Shark/Hive. In particular, like Shark, Spark SQL supports all existing Hive data formats, user-defined functions (UDF), and the Hive metastore." [1] But I am not entirely sure if Spark SQL and Siddhi is comparable, because SparkSQL (like Hive) is designed for batch processing, where as Siddhi is real-time processing. But if there are implementations where Siddhi is run on top of Spark, it would be very interesting. Spark supports either Hadoop1 or 2. But I think we should see, what is best, MR1 or YARN+MR2 [image: Hadoop Architecture] [2] [1] http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html [2] http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html On Wed, Aug 20, 2014 at 1:13 PM, Lasantha Fernando <lasan...@wso2.com> wrote: > Hi Maninda, > > On 20 August 2014 12:02, Maninda Edirisooriya <mani...@wso2.com> wrote: > >> In the case of discontinuity of Shark project, IMO we should not move to >> Shark at all. >> And it seems better to go with Spark SQL as we are already using Spark >> for CEP. But I am not sure the difference between Spark SQL and the Siddhi >> queries on the Spark engine. >> > > Currently, we are doing integration with CEP using Apache Storm, not > Spark... :-). Spark Streaming is a possible candidate for integrating with > CEP, but we have opted with Storm. I think there has been some independent > work on integrating Kafka + Spark Streaming + Siddhi. Please refer to > thread on arch@ "[Architecture] A few questions about WSO2 CEP/Siddhi" > > > And we have to figure out how Spark SQL is used for historical data, >> whether it can execute incremental processing by default which will >> implement all out existing BAM use cases. >> On the other hand in Hadoop 2 [1] they are using a completely different >> platform for resource allocation known as Yarn. Sometimes this may be more >> suitable for batch jobs. >> >> [1] https://www.youtube.com/watch?v=RncoVN0l6dc >> >> > Thanks, > Lasantha > >> >> *Maninda Edirisooriya* >> Senior Software Engineer >> >> *WSO2, Inc. *lean.enterprise.middleware. >> >> *Blog* : http://maninda.blogspot.com/ >> *E-mail* : mani...@wso2.com >> *Skype* : @manindae >> *Twitter* : @maninda >> >> >> On Wed, Aug 20, 2014 at 11:33 AM, Niranda Perera <nira...@wso2.com> >> wrote: >> >>> Hi Anjana and Srinath, >>> >>> After the discussion I had with Anjana, I researched more on the >>> continuation of Shark project by Databricks. >>> >>> Here's what I found out, >>> - Shark was built on the Hive codebase and achieved performance >>> improvements by swapping out the physical execution engine part of Hive. >>> While this approach enabled Shark users to speed up their Hive queries, >>> Shark inherited a large, complicated code base from Hive that made it hard >>> to optimize and maintain. >>> Hence, Databricks has announced that they are halting the development of >>> Shark from July, 2014. (Shark 0.9 would be the last release) [1] >>> - Shark will be replaced by Spark SQL. It beats Shark in TPC-DS >>> performance >>> <http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html> >>> by almost an order of magnitude. It also supports all existing Hive data >>> formats, user-defined functions (UDF), and the Hive metastore. [2] >>> - Following is the Shark, Spark SQL migration plan >>> http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf >>> >>> - For the legacy Hive and MapReduce users, they have proposed a new >>> 'Hive on Spark Project' [3], [4] >>> But, given the performance enhancement, it is quite certain that Hive >>> and MR would be replaced by engines build on top of Spark (ex: Spark SQL) >>> >>> >>> >>> In my opinion there are a few matters to figure out if we are migrating >>> from Hive, >>> >>> 1. whether we are changing the query engine only? (Then, we can replace >>> Hive by Shark) >>> 2. whether we are changing the existing Hadoop/ MapReduce framework to >>> Spark? (Then we can replace Hive and Hadoop with Spark and Spark SQL) >>> >>> >>> In my opinion, considering the longterm impact and the availability of >>> support, it is best to migrate the Hive/Hadoop to Spark. >>> It is open for discussion! >>> >>> In the mean time, I've already tried Spark SQL, and Databricks claims on >>> improved performance seems to be true. I will work more on this. >>> >>> Cheers >>> >>> [1] >>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html >>> [2] >>> http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html >>> [3] https://issues.apache.org/jira/browse/HIVE-7292 >>> [4] https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark >>> >>> >>> >>> On Thu, Aug 14, 2014 at 12:16 PM, Anjana Fernando <anj...@wso2.com> >>> wrote: >>> >>>> Hi Srinath, >>>> >>>> No, this has not been tested in multiple nodes. I told Niranda here in >>>> my last mail, to test a cluster with the same set of hardware we have, that >>>> we are using to test our large data set with Hive. As for the effort to >>>> make the change, we still have to figure out the MT aspects of Shark here. >>>> Sinthuja was working on making the latest Hive version MT ready, and most >>>> probably, we can do the same changes to the Hive version Shark is using. So >>>> after we do that, the integration should be seamless. And also, as I >>>> mentioned earlier here, we are also going to test this with the APIM Hive >>>> script, to check if there are any unforeseen incompatibilities. >>>> >>>> Cheers, >>>> Anjana. >>>> >>>> >>>> On Thu, Aug 14, 2014 at 11:53 AM, Srinath Perera <srin...@wso2.com> >>>> wrote: >>>> >>>>> This look great. >>>>> >>>>> We need to test Spark with multiple nodes? Did we do that. Please >>>>> create few VMs in performance could (talk to Lakmal) and test with at >>>>> least >>>>> 5 nodes. We need to make sure it works OK with distributed setup as well. >>>>> >>>>> What does it take to change to spark? Anjana .. how much work is it? >>>>> >>>>> --Srinath >>>>> >>>>> >>>>> On Wed, Aug 13, 2014 at 7:06 PM, Niranda Perera <nira...@wso2.com> >>>>> wrote: >>>>> >>>>>> Thank you Anjana. >>>>>> >>>>>> Yes, I am working on it. >>>>>> >>>>>> In the mean time, I found this in Hive documentation [1]. It talks >>>>>> about Hive on Spark, and compares Hive, Shark and Spark SQL at an higher >>>>>> architectural level. >>>>>> >>>>>> Additionally, it is said that the in-memory performance of Shark can >>>>>> be improved by introducing Tachyon [2]. I guess we can consider this >>>>>> later >>>>>> on. >>>>>> >>>>>> Cheers. >>>>>> >>>>>> [1] >>>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-1.3ComparisonwithSharkandSparkSQL >>>>>> [2] http://tachyon-project.org/Running-Tachyon-Locally.html >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Aug 13, 2014 at 3:17 PM, Anjana Fernando <anj...@wso2.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Niranda, >>>>>>> >>>>>>> Excellent analysis of Hive vs Shark! .. This gives a lot of insight >>>>>>> into how both operates in different scenarios. As the next step, we will >>>>>>> need to run this in an actual cluster of computers. Since you've used a >>>>>>> subset of the dataset of 2014 DEBS challenge, we should use the full >>>>>>> data >>>>>>> set in a clustered environment and check this. Gokul is already working >>>>>>> on >>>>>>> the Hive based setup for this, after that is done, you can create a >>>>>>> Shark >>>>>>> cluster in the same hardware and run the tests there, to get a clear >>>>>>> comparison on how these two match up in a cluster. Until the setup is >>>>>>> ready, do continue with your next steps on checking the RDD support and >>>>>>> Spark SQL use. >>>>>>> >>>>>>> After these are done, we should also do a trial run of our own APIM >>>>>>> Hive scripts, migrated to Shark. >>>>>>> >>>>>>> Cheers, >>>>>>> Anjana. >>>>>>> >>>>>>> >>>>>>> On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera <nira...@wso2.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I have been evaluating the performance of Shark (distributed SQL >>>>>>>> query engine for Hadoop) against Hive. This is with the objective of >>>>>>>> seeing >>>>>>>> the possibility to move the WSO2 BAM data processing (which currently >>>>>>>> uses >>>>>>>> Hive) to Shark (and Apache Spark) for improved performance. >>>>>>>> >>>>>>>> I am sharing my findings herewith. >>>>>>>> >>>>>>>> *AMP Lab Shark* >>>>>>>> Shark can execute Hive QL queries up to 100 times faster than Hive >>>>>>>> without any modification to the existing data or queries. It supports >>>>>>>> Hive's QL, metastore, serialization formats, and user-defined >>>>>>>> functions, >>>>>>>> providing seamless integration with existing Hive deployments and a >>>>>>>> familiar, more powerful option for new ones. [1] >>>>>>>> >>>>>>>> >>>>>>>> *Apache Spark*Apache Spark is an open-source data analytics >>>>>>>> cluster computing framework. It fits into the Hadoop open-source >>>>>>>> community, >>>>>>>> building on top of the HDFS and promises performance up to 100 times >>>>>>>> faster >>>>>>>> than Hadoop MapReduce for certain applications. [2] >>>>>>>> Official documentation: [3] >>>>>>>> >>>>>>>> >>>>>>>> I carried out the comparison between the following Hive and Shark >>>>>>>> releases with input files ranging from 100 to 1 billion entries. >>>>>>>> >>>>>>>> QL Engine >>>>>>>> >>>>>>>> Apache Hive 0.11 >>>>>>>> >>>>>>>> Shark Shark 0.9.1 (Latest release) which uses, >>>>>>>> >>>>>>>> - >>>>>>>> >>>>>>>> Scala 2.10.3 >>>>>>>> - >>>>>>>> >>>>>>>> Spark 0.9.1 >>>>>>>> - >>>>>>>> >>>>>>>> AMPLab’s Hive 0.9.0 >>>>>>>> >>>>>>>> >>>>>>>> Framework >>>>>>>> >>>>>>>> Hadoop 1.0.4 >>>>>>>> Spark 0.9.1 >>>>>>>> >>>>>>>> File system >>>>>>>> >>>>>>>> HDFS >>>>>>>> HDFS >>>>>>>> >>>>>>>> Attached herewith is a report which describes in detail about the >>>>>>>> performance comparison between Shark and Hive. >>>>>>>> >>>>>>>> hive_vs_shark >>>>>>>> <https://docs.google.com/a/wso2.com/folderview?id=0B1GsnfycTl32QTZqUktKck1Ucjg&usp=drive_web> >>>>>>>> >>>>>>>> hive_vs_shark_report.odt >>>>>>>> <https://docs.google.com/a/wso2.com/file/d/0B1GsnfycTl32X3J5dTh6Slloa0E/edit?usp=drive_web> >>>>>>>> >>>>>>>> >>>>>>>> In summary, >>>>>>>> >>>>>>>> From the evaluation, following conclusions can be derived. >>>>>>>> >>>>>>>> - Shark is indifferent to Hive in DDL operations (CREATE, DROP >>>>>>>> .. TABLE, DATABASE). Both engines show a fairly constant >>>>>>>> performance as the >>>>>>>> input size increases. >>>>>>>> - Shark is indifferent to Hive in DML operations (LOAD, INSERT) >>>>>>>> but when a DML operation is called in conjuncture of a data >>>>>>>> retrieval >>>>>>>> operation (ex. INSERT <TBL> SELECT <PROP> FROM <TBL>), Shark >>>>>>>> significantly >>>>>>>> over-performs Hive with a performance factor of 10x+ (Ranging from >>>>>>>> 10x to >>>>>>>> 80x in some instances). Shark performance factor reduces with the >>>>>>>> input >>>>>>>> size increases, while HIVE performance is fairly indifferent. >>>>>>>> - Shark clearly over-performs Hive in Data Retrieval operations >>>>>>>> (FILTER, ORDER BY, JOIN). Hive performance is fairly indifferent in >>>>>>>> the >>>>>>>> data retrieval operations while Shark performance reduces as the >>>>>>>> input size >>>>>>>> increases. But at every instance Shark over-performed Hive with a >>>>>>>> minimum >>>>>>>> performance factor of 5x+ (Ranging from 5x to 80x in some >>>>>>>> instances). >>>>>>>> >>>>>>>> Please refer the 'hive_vs_shark_report', it has all the information >>>>>>>> about the queries and timings pictographically. >>>>>>>> >>>>>>>> The code repository can also be found in >>>>>>>> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark >>>>>>>> >>>>>>>> >>>>>>>> Moving forward, I am currently working on the following. >>>>>>>> >>>>>>>> - Apache Spark's resilient distributed dataset (RDD) >>>>>>>> abstraction (which is a collection of elements partitioned across >>>>>>>> the nodes >>>>>>>> of the cluster that can be operated on in parallel). The use of >>>>>>>> RDDs and >>>>>>>> its impact to the performance. >>>>>>>> - Spark SQL - Use of this Spark SQL over Shark on Spark >>>>>>>> framework >>>>>>>> >>>>>>>> >>>>>>>> [1] https://github.com/amplab/shark/wiki >>>>>>>> [2] http://en.wikipedia.org/wiki/Apache_Spark >>>>>>>> [3] http://spark.apache.org/docs/latest/ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Would love to have your feedback on this. >>>>>>>> >>>>>>>> Best regards >>>>>>>> >>>>>>>> -- >>>>>>>> *Niranda Perera* >>>>>>>> Software Engineer, WSO2 Inc. >>>>>>>> Mobile: +94-71-554-8430 >>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *Anjana Fernando* >>>>>>> Senior Technical Lead >>>>>>> WSO2 Inc. | http://wso2.com >>>>>>> lean . enterprise . middleware >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *Niranda Perera* >>>>>> Software Engineer, WSO2 Inc. >>>>>> Mobile: +94-71-554-8430 >>>>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> ============================ >>>>> Srinath Perera, Ph.D. >>>>> http://people.apache.org/~hemapani/ >>>>> http://srinathsview.blogspot.com/ >>>>> >>>> >>>> >>>> >>>> -- >>>> *Anjana Fernando* >>>> Senior Technical Lead >>>> WSO2 Inc. | http://wso2.com >>>> lean . enterprise . middleware >>>> >>> >>> >>> >>> -- >>> *Niranda Perera* >>> Software Engineer, WSO2 Inc. >>> Mobile: +94-71-554-8430 >>> Twitter: @n1r44 <https://twitter.com/N1R44> >>> >> >> >> _______________________________________________ >> Architecture mailing list >> Architecture@wso2.org >> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >> >> > > > -- > *Lasantha Fernando* > Software Engineer - Data Technologies Team > WSO2 Inc. http://wso2.com > > email: lasan...@wso2.com > mobile: (+94) 71 5247551 > -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 <https://twitter.com/N1R44>
_______________________________________________ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture