On Wed, Aug 20, 2014 at 1:36 PM, Niranda Perera <nira...@wso2.com> wrote:
> @Maninda, > > +1 for suggesting Spark SQL. > > Quote Databricks, > "Spark SQL provides state-of-the-art SQL performance and maintains > compatibility with Shark/Hive. In particular, like Shark, Spark SQL > supports all existing Hive data formats, user-defined functions (UDF), and > the Hive metastore." [1] > > But I am not entirely sure if Spark SQL and Siddhi is comparable, because > SparkSQL (like Hive) is designed for batch processing, where as Siddhi is > real-time processing. But if there are implementations where Siddhi is run > on top of Spark, it would be very interesting. > Yes Siddhi's current way of operation does not support this. But with partitions and we can achieve this to some extent. Suho > > Spark supports either Hadoop1 or 2. But I think we should see, what is > best, MR1 or YARN+MR2 > > [image: Hadoop Architecture] > [2] > > [1] > http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html > [2] http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html > > > On Wed, Aug 20, 2014 at 1:13 PM, Lasantha Fernando <lasan...@wso2.com> > wrote: > >> Hi Maninda, >> >> On 20 August 2014 12:02, Maninda Edirisooriya <mani...@wso2.com> wrote: >> >>> In the case of discontinuity of Shark project, IMO we should not move to >>> Shark at all. >>> And it seems better to go with Spark SQL as we are already using Spark >>> for CEP. But I am not sure the difference between Spark SQL and the Siddhi >>> queries on the Spark engine. >>> >> >> Currently, we are doing integration with CEP using Apache Storm, not >> Spark... :-). Spark Streaming is a possible candidate for integrating with >> CEP, but we have opted with Storm. I think there has been some independent >> work on integrating Kafka + Spark Streaming + Siddhi. Please refer to >> thread on arch@ "[Architecture] A few questions about WSO2 CEP/Siddhi" >> >> >> And we have to figure out how Spark SQL is used for historical data, >>> whether it can execute incremental processing by default which will >>> implement all out existing BAM use cases. >>> On the other hand in Hadoop 2 [1] they are using a completely different >>> platform for resource allocation known as Yarn. Sometimes this may be more >>> suitable for batch jobs. >>> >>> [1] https://www.youtube.com/watch?v=RncoVN0l6dc >>> >>> >> Thanks, >> Lasantha >> >>> >>> *Maninda Edirisooriya* >>> Senior Software Engineer >>> >>> *WSO2, Inc. *lean.enterprise.middleware. >>> >>> *Blog* : http://maninda.blogspot.com/ >>> *E-mail* : mani...@wso2.com >>> *Skype* : @manindae >>> *Twitter* : @maninda >>> >>> >>> On Wed, Aug 20, 2014 at 11:33 AM, Niranda Perera <nira...@wso2.com> >>> wrote: >>> >>>> Hi Anjana and Srinath, >>>> >>>> After the discussion I had with Anjana, I researched more on the >>>> continuation of Shark project by Databricks. >>>> >>>> Here's what I found out, >>>> - Shark was built on the Hive codebase and achieved performance >>>> improvements by swapping out the physical execution engine part of Hive. >>>> While this approach enabled Shark users to speed up their Hive queries, >>>> Shark inherited a large, complicated code base from Hive that made it hard >>>> to optimize and maintain. >>>> Hence, Databricks has announced that they are halting the development >>>> of Shark from July, 2014. (Shark 0.9 would be the last release) [1] >>>> - Shark will be replaced by Spark SQL. It beats Shark in TPC-DS >>>> performance >>>> <http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html> >>>> by almost an order of magnitude. It also supports all existing Hive data >>>> formats, user-defined functions (UDF), and the Hive metastore. [2] >>>> - Following is the Shark, Spark SQL migration plan >>>> http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf >>>> >>>> - For the legacy Hive and MapReduce users, they have proposed a new >>>> 'Hive on Spark Project' [3], [4] >>>> But, given the performance enhancement, it is quite certain that Hive >>>> and MR would be replaced by engines build on top of Spark (ex: Spark SQL) >>>> >>>> >>>> >>>> In my opinion there are a few matters to figure out if we are migrating >>>> from Hive, >>>> >>>> 1. whether we are changing the query engine only? (Then, we can replace >>>> Hive by Shark) >>>> 2. whether we are changing the existing Hadoop/ MapReduce framework to >>>> Spark? (Then we can replace Hive and Hadoop with Spark and Spark SQL) >>>> >>>> >>>> In my opinion, considering the longterm impact and the availability of >>>> support, it is best to migrate the Hive/Hadoop to Spark. >>>> It is open for discussion! >>>> >>>> In the mean time, I've already tried Spark SQL, and Databricks claims >>>> on improved performance seems to be true. I will work more on this. >>>> >>>> Cheers >>>> >>>> [1] >>>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html >>>> [2] >>>> http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html >>>> [3] https://issues.apache.org/jira/browse/HIVE-7292 >>>> [4] https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark >>>> >>>> >>>> >>>> On Thu, Aug 14, 2014 at 12:16 PM, Anjana Fernando <anj...@wso2.com> >>>> wrote: >>>> >>>>> Hi Srinath, >>>>> >>>>> No, this has not been tested in multiple nodes. I told Niranda here in >>>>> my last mail, to test a cluster with the same set of hardware we have, >>>>> that >>>>> we are using to test our large data set with Hive. As for the effort to >>>>> make the change, we still have to figure out the MT aspects of Shark here. >>>>> Sinthuja was working on making the latest Hive version MT ready, and most >>>>> probably, we can do the same changes to the Hive version Shark is using. >>>>> So >>>>> after we do that, the integration should be seamless. And also, as I >>>>> mentioned earlier here, we are also going to test this with the APIM Hive >>>>> script, to check if there are any unforeseen incompatibilities. >>>>> >>>>> Cheers, >>>>> Anjana. >>>>> >>>>> >>>>> On Thu, Aug 14, 2014 at 11:53 AM, Srinath Perera <srin...@wso2.com> >>>>> wrote: >>>>> >>>>>> This look great. >>>>>> >>>>>> We need to test Spark with multiple nodes? Did we do that. Please >>>>>> create few VMs in performance could (talk to Lakmal) and test with at >>>>>> least >>>>>> 5 nodes. We need to make sure it works OK with distributed setup as well. >>>>>> >>>>>> What does it take to change to spark? Anjana .. how much work is it? >>>>>> >>>>>> --Srinath >>>>>> >>>>>> >>>>>> On Wed, Aug 13, 2014 at 7:06 PM, Niranda Perera <nira...@wso2.com> >>>>>> wrote: >>>>>> >>>>>>> Thank you Anjana. >>>>>>> >>>>>>> Yes, I am working on it. >>>>>>> >>>>>>> In the mean time, I found this in Hive documentation [1]. It talks >>>>>>> about Hive on Spark, and compares Hive, Shark and Spark SQL at an higher >>>>>>> architectural level. >>>>>>> >>>>>>> Additionally, it is said that the in-memory performance of Shark can >>>>>>> be improved by introducing Tachyon [2]. I guess we can consider this >>>>>>> later >>>>>>> on. >>>>>>> >>>>>>> Cheers. >>>>>>> >>>>>>> [1] >>>>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-1.3ComparisonwithSharkandSparkSQL >>>>>>> [2] http://tachyon-project.org/Running-Tachyon-Locally.html >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Aug 13, 2014 at 3:17 PM, Anjana Fernando <anj...@wso2.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Niranda, >>>>>>>> >>>>>>>> Excellent analysis of Hive vs Shark! .. This gives a lot of insight >>>>>>>> into how both operates in different scenarios. As the next step, we >>>>>>>> will >>>>>>>> need to run this in an actual cluster of computers. Since you've used a >>>>>>>> subset of the dataset of 2014 DEBS challenge, we should use the full >>>>>>>> data >>>>>>>> set in a clustered environment and check this. Gokul is already >>>>>>>> working on >>>>>>>> the Hive based setup for this, after that is done, you can create a >>>>>>>> Shark >>>>>>>> cluster in the same hardware and run the tests there, to get a clear >>>>>>>> comparison on how these two match up in a cluster. Until the setup is >>>>>>>> ready, do continue with your next steps on checking the RDD support and >>>>>>>> Spark SQL use. >>>>>>>> >>>>>>>> After these are done, we should also do a trial run of our own APIM >>>>>>>> Hive scripts, migrated to Shark. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Anjana. >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera <nira...@wso2.com >>>>>>>> > wrote: >>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> I have been evaluating the performance of Shark (distributed SQL >>>>>>>>> query engine for Hadoop) against Hive. This is with the objective of >>>>>>>>> seeing >>>>>>>>> the possibility to move the WSO2 BAM data processing (which currently >>>>>>>>> uses >>>>>>>>> Hive) to Shark (and Apache Spark) for improved performance. >>>>>>>>> >>>>>>>>> I am sharing my findings herewith. >>>>>>>>> >>>>>>>>> *AMP Lab Shark* >>>>>>>>> Shark can execute Hive QL queries up to 100 times faster than Hive >>>>>>>>> without any modification to the existing data or queries. It supports >>>>>>>>> Hive's QL, metastore, serialization formats, and user-defined >>>>>>>>> functions, >>>>>>>>> providing seamless integration with existing Hive deployments and a >>>>>>>>> familiar, more powerful option for new ones. [1] >>>>>>>>> >>>>>>>>> >>>>>>>>> *Apache Spark*Apache Spark is an open-source data analytics >>>>>>>>> cluster computing framework. It fits into the Hadoop open-source >>>>>>>>> community, >>>>>>>>> building on top of the HDFS and promises performance up to 100 times >>>>>>>>> faster >>>>>>>>> than Hadoop MapReduce for certain applications. [2] >>>>>>>>> Official documentation: [3] >>>>>>>>> >>>>>>>>> >>>>>>>>> I carried out the comparison between the following Hive and Shark >>>>>>>>> releases with input files ranging from 100 to 1 billion entries. >>>>>>>>> >>>>>>>>> QL Engine >>>>>>>>> >>>>>>>>> Apache Hive 0.11 >>>>>>>>> >>>>>>>>> Shark Shark 0.9.1 (Latest release) which uses, >>>>>>>>> >>>>>>>>> - >>>>>>>>> >>>>>>>>> Scala 2.10.3 >>>>>>>>> - >>>>>>>>> >>>>>>>>> Spark 0.9.1 >>>>>>>>> - >>>>>>>>> >>>>>>>>> AMPLab’s Hive 0.9.0 >>>>>>>>> >>>>>>>>> >>>>>>>>> Framework >>>>>>>>> >>>>>>>>> Hadoop 1.0.4 >>>>>>>>> Spark 0.9.1 >>>>>>>>> >>>>>>>>> File system >>>>>>>>> >>>>>>>>> HDFS >>>>>>>>> HDFS >>>>>>>>> >>>>>>>>> Attached herewith is a report which describes in detail about the >>>>>>>>> performance comparison between Shark and Hive. >>>>>>>>> >>>>>>>>> hive_vs_shark >>>>>>>>> <https://docs.google.com/a/wso2.com/folderview?id=0B1GsnfycTl32QTZqUktKck1Ucjg&usp=drive_web> >>>>>>>>> >>>>>>>>> hive_vs_shark_report.odt >>>>>>>>> <https://docs.google.com/a/wso2.com/file/d/0B1GsnfycTl32X3J5dTh6Slloa0E/edit?usp=drive_web> >>>>>>>>> >>>>>>>>> >>>>>>>>> In summary, >>>>>>>>> >>>>>>>>> From the evaluation, following conclusions can be derived. >>>>>>>>> >>>>>>>>> - Shark is indifferent to Hive in DDL operations (CREATE, DROP >>>>>>>>> .. TABLE, DATABASE). Both engines show a fairly constant >>>>>>>>> performance as the >>>>>>>>> input size increases. >>>>>>>>> - Shark is indifferent to Hive in DML operations (LOAD, >>>>>>>>> INSERT) but when a DML operation is called in conjuncture of a data >>>>>>>>> retrieval operation (ex. INSERT <TBL> SELECT <PROP> FROM <TBL>), >>>>>>>>> Shark >>>>>>>>> significantly over-performs Hive with a performance factor of 10x+ >>>>>>>>> (Ranging >>>>>>>>> from 10x to 80x in some instances). Shark performance factor >>>>>>>>> reduces with >>>>>>>>> the input size increases, while HIVE performance is fairly >>>>>>>>> indifferent. >>>>>>>>> - Shark clearly over-performs Hive in Data Retrieval >>>>>>>>> operations (FILTER, ORDER BY, JOIN). Hive performance is fairly >>>>>>>>> indifferent >>>>>>>>> in the data retrieval operations while Shark performance reduces >>>>>>>>> as the >>>>>>>>> input size increases. But at every instance Shark over-performed >>>>>>>>> Hive with >>>>>>>>> a minimum performance factor of 5x+ (Ranging from 5x to 80x in some >>>>>>>>> instances). >>>>>>>>> >>>>>>>>> Please refer the 'hive_vs_shark_report', it has all the >>>>>>>>> information about the queries and timings pictographically. >>>>>>>>> >>>>>>>>> The code repository can also be found in >>>>>>>>> >>>>>>>>> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark >>>>>>>>> >>>>>>>>> >>>>>>>>> Moving forward, I am currently working on the following. >>>>>>>>> >>>>>>>>> - Apache Spark's resilient distributed dataset (RDD) >>>>>>>>> abstraction (which is a collection of elements partitioned across >>>>>>>>> the nodes >>>>>>>>> of the cluster that can be operated on in parallel). The use of >>>>>>>>> RDDs and >>>>>>>>> its impact to the performance. >>>>>>>>> - Spark SQL - Use of this Spark SQL over Shark on Spark >>>>>>>>> framework >>>>>>>>> >>>>>>>>> >>>>>>>>> [1] https://github.com/amplab/shark/wiki >>>>>>>>> [2] http://en.wikipedia.org/wiki/Apache_Spark >>>>>>>>> [3] http://spark.apache.org/docs/latest/ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Would love to have your feedback on this. >>>>>>>>> >>>>>>>>> Best regards >>>>>>>>> >>>>>>>>> -- >>>>>>>>> *Niranda Perera* >>>>>>>>> Software Engineer, WSO2 Inc. >>>>>>>>> Mobile: +94-71-554-8430 >>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> *Anjana Fernando* >>>>>>>> Senior Technical Lead >>>>>>>> WSO2 Inc. | http://wso2.com >>>>>>>> lean . enterprise . middleware >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *Niranda Perera* >>>>>>> Software Engineer, WSO2 Inc. >>>>>>> Mobile: +94-71-554-8430 >>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> ============================ >>>>>> Srinath Perera, Ph.D. >>>>>> http://people.apache.org/~hemapani/ >>>>>> http://srinathsview.blogspot.com/ >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> *Anjana Fernando* >>>>> Senior Technical Lead >>>>> WSO2 Inc. | http://wso2.com >>>>> lean . enterprise . middleware >>>>> >>>> >>>> >>>> >>>> -- >>>> *Niranda Perera* >>>> Software Engineer, WSO2 Inc. >>>> Mobile: +94-71-554-8430 >>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>> >>> >>> >>> _______________________________________________ >>> Architecture mailing list >>> Architecture@wso2.org >>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>> >>> >> >> >> -- >> *Lasantha Fernando* >> Software Engineer - Data Technologies Team >> WSO2 Inc. http://wso2.com >> >> email: lasan...@wso2.com >> mobile: (+94) 71 5247551 >> > > > > -- > *Niranda Perera* > Software Engineer, WSO2 Inc. > Mobile: +94-71-554-8430 > Twitter: @n1r44 <https://twitter.com/N1R44> > > _______________________________________________ > Architecture mailing list > Architecture@wso2.org > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > -- *S. Suhothayan* Technical Lead & Team Lead of WSO2 Complex Event Processor *WSO2 Inc. *http://wso2.com * <http://wso2.com/>* lean . enterprise . middleware *cell: (+94) 779 756 757 | blog: http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/>twitter: http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in: http://lk.linkedin.com/in/suhothayan <http://lk.linkedin.com/in/suhothayan>*
_______________________________________________ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture