Niranda, we need test Spark in multi-node mode before making a decision. Spark is very fast, I think there is no doubt about that. We need to make sure it stable.
David, thanks for a detailed email! How big (nodes) is the Spark setup you guys are running? --Srinath On Thu, Aug 21, 2014 at 1:34 PM, David Morales <dmora...@stratio.com> wrote: > Sorry for disturbing this thread, but i think that i can help clarifying a > few things (we were attending the last Spark Summit, we were also speakers > there and we are working very close to spark) > > *> Hive/Shark and others benchmark* > > You can find a nice comparison and benchmark in this web: > https://amplab.cs.berkeley.edu/benchmark/ > > > *> Shark and SparkSQL* > > SparkSQL is the natural replacement for Shark, but SparkSQL is still young > at this moment. If you are looking for Hive compatibility, you have to > execute SparkSQL with an specific context. > > Quoted from spark website: > > *> Note that Spark SQL currently uses a very basic SQL parser. Users that > want a more complete dialect of SQL should look at the HiveSQL support > provided by HiveContext.* > > So, only note that SparkSQL is a work in progress. If you want SparkSQL > you have to run a SparkSQLContext, if you want Hive, you will have a > different context... > > > *> Spark - Hadoop: the future* > > Most Hadoop distributions are including Spark: cloudera, hortonworks, > mapR... and contributing to migrate all the Hadoop ecosystem to Spark. > > Spark is a bit more than Map/Reduce... as you can read here: > http://gigaom.com/2014/06/28/4-reasons-why-spark-could-jolt-hadoop-into-hyperdrive/ > > > *> Spark Streaming / Spark SQL* > > Spark Streaming is built on Spark and it provides streaming processing > through an information abstraction called DStreams (a collection of RDDs in > a window of time). > > There is some efforts in order to make SparkSQL compatible with Spark > Streaming (something similar to trident for storm), as you can see here: > > *StreamSQL (https://github.com/thunderain-project/StreamSQL > <https://github.com/thunderain-project/StreamSQL>) is a POC project based > on Spark to combine the power of Catalyst and Spark Streaming, to offer > people the ability to manipulate SQL on top of DStream as you wanted, this > keep the same semantics with SparkSQL as offer a SchemaDStream on top of > DStream. You don't need to do tricky thing like extracting rdd to register > as a table. Besides other parts are the same as Spark.* > > So, you can apply a SQL in a data stream, but it is very simple at the > moment... you can expect a bunch of improvements in this matter in the next > months (i guess that sparkSQL will work on Spark streaming streams before > the end of this year). > > > > *> Spark Streaming / Spark SQL and CEP* > > There is no relationship at this moment between (your absolutely amazing) > Siddhi CEP and Spark. As fas as i know, you are working in doing > distributed CEP with Storm and Siddhi. > > We are currently working on doing an interactive cep built with kafka + > spark streaming + siddhi, with some features such as an API, an interactive > shell, built-in statistics and auditing, built-in functions > (save2cassandra, save2mongo, save2elasticsearch...). > > If you are interested we can talk about this project, i think that it > would be a nice idea¡ > > > Anyway, i don't think that SparkSQL will evolve in something like a CEP. > Patterns, sequences, for example would be very complex to do with spark > streaming (at least now). > > > > Thanks. > > > > > > > > > > > > > > > > > > 2014-08-21 6:18 GMT+02:00 Sriskandarajah Suhothayan <s...@wso2.com>: > > >> >> >> On Wed, Aug 20, 2014 at 1:36 PM, Niranda Perera <nira...@wso2.com> wrote: >> >>> @Maninda, >>> >>> +1 for suggesting Spark SQL. >>> >>> Quote Databricks, >>> "Spark SQL provides state-of-the-art SQL performance and maintains >>> compatibility with Shark/Hive. In particular, like Shark, Spark SQL >>> supports all existing Hive data formats, user-defined functions (UDF), and >>> the Hive metastore." [1] >>> >>> But I am not entirely sure if Spark SQL and Siddhi is comparable, >>> because SparkSQL (like Hive) is designed for batch processing, where as >>> Siddhi is real-time processing. But if there are implementations where >>> Siddhi is run on top of Spark, it would be very interesting. >>> >> Yes Siddhi's current way of operation does not support this. But with >> partitions and we can achieve this to some extent. >> >> Suho >> >>> >>> Spark supports either Hadoop1 or 2. But I think we should see, what is >>> best, MR1 or YARN+MR2 >>> >>> [image: Hadoop Architecture] >>> [2] >>> >>> [1] >>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html >>> [2] http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html >>> >>> >>> On Wed, Aug 20, 2014 at 1:13 PM, Lasantha Fernando <lasan...@wso2.com> >>> wrote: >>> >>>> Hi Maninda, >>>> >>>> On 20 August 2014 12:02, Maninda Edirisooriya <mani...@wso2.com> wrote: >>>> >>>>> In the case of discontinuity of Shark project, IMO we should not move >>>>> to Shark at all. >>>>> And it seems better to go with Spark SQL as we are already using Spark >>>>> for CEP. But I am not sure the difference between Spark SQL and the Siddhi >>>>> queries on the Spark engine. >>>>> >>>> >>>> Currently, we are doing integration with CEP using Apache Storm, not >>>> Spark... :-). Spark Streaming is a possible candidate for integrating with >>>> CEP, but we have opted with Storm. I think there has been some independent >>>> work on integrating Kafka + Spark Streaming + Siddhi. Please refer to >>>> thread on arch@ "[Architecture] A few questions about WSO2 CEP/Siddhi" >>>> >>>> >>>> And we have to figure out how Spark SQL is used for historical data, >>>>> whether it can execute incremental processing by default which will >>>>> implement all out existing BAM use cases. >>>>> On the other hand in Hadoop 2 [1] they are using a completely >>>>> different platform for resource allocation known as Yarn. Sometimes this >>>>> may be more suitable for batch jobs. >>>>> >>>>> [1] https://www.youtube.com/watch?v=RncoVN0l6dc >>>>> >>>>> >>>> Thanks, >>>> Lasantha >>>> >>>>> >>>>> *Maninda Edirisooriya* >>>>> Senior Software Engineer >>>>> >>>>> *WSO2, Inc. *lean.enterprise.middleware. >>>>> >>>>> *Blog* : http://maninda.blogspot.com/ >>>>> *E-mail* : mani...@wso2.com >>>>> *Skype* : @manindae >>>>> *Twitter* : @maninda >>>>> >>>>> >>>>> On Wed, Aug 20, 2014 at 11:33 AM, Niranda Perera <nira...@wso2.com> >>>>> wrote: >>>>> >>>>>> Hi Anjana and Srinath, >>>>>> >>>>>> After the discussion I had with Anjana, I researched more on the >>>>>> continuation of Shark project by Databricks. >>>>>> >>>>>> Here's what I found out, >>>>>> - Shark was built on the Hive codebase and achieved performance >>>>>> improvements by swapping out the physical execution engine part of Hive. >>>>>> While this approach enabled Shark users to speed up their Hive queries, >>>>>> Shark inherited a large, complicated code base from Hive that made it >>>>>> hard >>>>>> to optimize and maintain. >>>>>> Hence, Databricks has announced that they are halting the development >>>>>> of Shark from July, 2014. (Shark 0.9 would be the last release) [1] >>>>>> - Shark will be replaced by Spark SQL. It beats Shark in TPC-DS >>>>>> performance >>>>>> <http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html> >>>>>> by almost an order of magnitude. It also supports all existing Hive data >>>>>> formats, user-defined functions (UDF), and the Hive metastore. [2] >>>>>> - Following is the Shark, Spark SQL migration plan >>>>>> http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf >>>>>> >>>>>> - For the legacy Hive and MapReduce users, they have proposed a new >>>>>> 'Hive on Spark Project' [3], [4] >>>>>> But, given the performance enhancement, it is quite certain that Hive >>>>>> and MR would be replaced by engines build on top of Spark (ex: Spark SQL) >>>>>> >>>>>> >>>>>> >>>>>> In my opinion there are a few matters to figure out if we are >>>>>> migrating from Hive, >>>>>> >>>>>> 1. whether we are changing the query engine only? (Then, we can >>>>>> replace Hive by Shark) >>>>>> 2. whether we are changing the existing Hadoop/ MapReduce framework >>>>>> to Spark? (Then we can replace Hive and Hadoop with Spark and Spark SQL) >>>>>> >>>>>> >>>>>> In my opinion, considering the longterm impact and the availability >>>>>> of support, it is best to migrate the Hive/Hadoop to Spark. >>>>>> It is open for discussion! >>>>>> >>>>>> In the mean time, I've already tried Spark SQL, and Databricks claims >>>>>> on improved performance seems to be true. I will work more on this. >>>>>> >>>>>> Cheers >>>>>> >>>>>> [1] >>>>>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html >>>>>> [2] >>>>>> http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html >>>>>> [3] https://issues.apache.org/jira/browse/HIVE-7292 >>>>>> [4] https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Aug 14, 2014 at 12:16 PM, Anjana Fernando <anj...@wso2.com> >>>>>> wrote: >>>>>> >>>>>>> Hi Srinath, >>>>>>> >>>>>>> No, this has not been tested in multiple nodes. I told Niranda here >>>>>>> in my last mail, to test a cluster with the same set of hardware we >>>>>>> have, >>>>>>> that we are using to test our large data set with Hive. As for the >>>>>>> effort >>>>>>> to make the change, we still have to figure out the MT aspects of Shark >>>>>>> here. Sinthuja was working on making the latest Hive version MT ready, >>>>>>> and >>>>>>> most probably, we can do the same changes to the Hive version Shark is >>>>>>> using. So after we do that, the integration should be seamless. And >>>>>>> also, >>>>>>> as I mentioned earlier here, we are also going to test this with the >>>>>>> APIM >>>>>>> Hive script, to check if there are any unforeseen incompatibilities. >>>>>>> >>>>>>> Cheers, >>>>>>> Anjana. >>>>>>> >>>>>>> >>>>>>> On Thu, Aug 14, 2014 at 11:53 AM, Srinath Perera <srin...@wso2.com> >>>>>>> wrote: >>>>>>> >>>>>>>> This look great. >>>>>>>> >>>>>>>> We need to test Spark with multiple nodes? Did we do that. Please >>>>>>>> create few VMs in performance could (talk to Lakmal) and test with at >>>>>>>> least >>>>>>>> 5 nodes. We need to make sure it works OK with distributed setup as >>>>>>>> well. >>>>>>>> >>>>>>>> What does it take to change to spark? Anjana .. how much work is it? >>>>>>>> >>>>>>>> --Srinath >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Aug 13, 2014 at 7:06 PM, Niranda Perera <nira...@wso2.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thank you Anjana. >>>>>>>>> >>>>>>>>> Yes, I am working on it. >>>>>>>>> >>>>>>>>> In the mean time, I found this in Hive documentation [1]. It talks >>>>>>>>> about Hive on Spark, and compares Hive, Shark and Spark SQL at an >>>>>>>>> higher >>>>>>>>> architectural level. >>>>>>>>> >>>>>>>>> Additionally, it is said that the in-memory performance of Shark >>>>>>>>> can be improved by introducing Tachyon [2]. I guess we can consider >>>>>>>>> this >>>>>>>>> later on. >>>>>>>>> >>>>>>>>> Cheers. >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-1.3ComparisonwithSharkandSparkSQL >>>>>>>>> [2] http://tachyon-project.org/Running-Tachyon-Locally.html >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Aug 13, 2014 at 3:17 PM, Anjana Fernando <anj...@wso2.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Niranda, >>>>>>>>>> >>>>>>>>>> Excellent analysis of Hive vs Shark! .. This gives a lot of >>>>>>>>>> insight into how both operates in different scenarios. As the next >>>>>>>>>> step, we >>>>>>>>>> will need to run this in an actual cluster of computers. Since >>>>>>>>>> you've used >>>>>>>>>> a subset of the dataset of 2014 DEBS challenge, we should use the >>>>>>>>>> full data >>>>>>>>>> set in a clustered environment and check this. Gokul is already >>>>>>>>>> working on >>>>>>>>>> the Hive based setup for this, after that is done, you can create a >>>>>>>>>> Shark >>>>>>>>>> cluster in the same hardware and run the tests there, to get a clear >>>>>>>>>> comparison on how these two match up in a cluster. Until the setup is >>>>>>>>>> ready, do continue with your next steps on checking the RDD support >>>>>>>>>> and >>>>>>>>>> Spark SQL use. >>>>>>>>>> >>>>>>>>>> After these are done, we should also do a trial run of our own >>>>>>>>>> APIM Hive scripts, migrated to Shark. >>>>>>>>>> >>>>>>>>>> Cheers, >>>>>>>>>> Anjana. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera < >>>>>>>>>> nira...@wso2.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> I have been evaluating the performance of Shark (distributed SQL >>>>>>>>>>> query engine for Hadoop) against Hive. This is with the objective >>>>>>>>>>> of seeing >>>>>>>>>>> the possibility to move the WSO2 BAM data processing (which >>>>>>>>>>> currently uses >>>>>>>>>>> Hive) to Shark (and Apache Spark) for improved performance. >>>>>>>>>>> >>>>>>>>>>> I am sharing my findings herewith. >>>>>>>>>>> >>>>>>>>>>> *AMP Lab Shark* >>>>>>>>>>> Shark can execute Hive QL queries up to 100 times faster than >>>>>>>>>>> Hive without any modification to the existing data or queries. It >>>>>>>>>>> supports >>>>>>>>>>> Hive's QL, metastore, serialization formats, and user-defined >>>>>>>>>>> functions, >>>>>>>>>>> providing seamless integration with existing Hive deployments and a >>>>>>>>>>> familiar, more powerful option for new ones. [1] >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *Apache Spark*Apache Spark is an open-source data analytics >>>>>>>>>>> cluster computing framework. It fits into the Hadoop open-source >>>>>>>>>>> community, >>>>>>>>>>> building on top of the HDFS and promises performance up to 100 >>>>>>>>>>> times faster >>>>>>>>>>> than Hadoop MapReduce for certain applications. [2] >>>>>>>>>>> Official documentation: [3] >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I carried out the comparison between the following Hive and >>>>>>>>>>> Shark releases with input files ranging from 100 to 1 billion >>>>>>>>>>> entries. >>>>>>>>>>> >>>>>>>>>>> QL Engine >>>>>>>>>>> >>>>>>>>>>> Apache Hive 0.11 >>>>>>>>>>> >>>>>>>>>>> Shark Shark 0.9.1 (Latest release) which uses, >>>>>>>>>>> >>>>>>>>>>> - >>>>>>>>>>> >>>>>>>>>>> Scala 2.10.3 >>>>>>>>>>> - >>>>>>>>>>> >>>>>>>>>>> Spark 0.9.1 >>>>>>>>>>> - >>>>>>>>>>> >>>>>>>>>>> AMPLab’s Hive 0.9.0 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Framework >>>>>>>>>>> >>>>>>>>>>> Hadoop 1.0.4 >>>>>>>>>>> Spark 0.9.1 >>>>>>>>>>> >>>>>>>>>>> File system >>>>>>>>>>> >>>>>>>>>>> HDFS >>>>>>>>>>> HDFS >>>>>>>>>>> >>>>>>>>>>> Attached herewith is a report which describes in detail about >>>>>>>>>>> the performance comparison between Shark and Hive. >>>>>>>>>>> >>>>>>>>>>> hive_vs_shark >>>>>>>>>>> <https://docs.google.com/a/wso2.com/folderview?id=0B1GsnfycTl32QTZqUktKck1Ucjg&usp=drive_web> >>>>>>>>>>> >>>>>>>>>>> hive_vs_shark_report.odt >>>>>>>>>>> <https://docs.google.com/a/wso2.com/file/d/0B1GsnfycTl32X3J5dTh6Slloa0E/edit?usp=drive_web> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> In summary, >>>>>>>>>>> >>>>>>>>>>> From the evaluation, following conclusions can be derived. >>>>>>>>>>> >>>>>>>>>>> - Shark is indifferent to Hive in DDL operations (CREATE, >>>>>>>>>>> DROP .. TABLE, DATABASE). Both engines show a fairly constant >>>>>>>>>>> performance >>>>>>>>>>> as the input size increases. >>>>>>>>>>> - Shark is indifferent to Hive in DML operations (LOAD, >>>>>>>>>>> INSERT) but when a DML operation is called in conjuncture of a >>>>>>>>>>> data >>>>>>>>>>> retrieval operation (ex. INSERT <TBL> SELECT <PROP> FROM <TBL>), >>>>>>>>>>> Shark >>>>>>>>>>> significantly over-performs Hive with a performance factor of >>>>>>>>>>> 10x+ (Ranging >>>>>>>>>>> from 10x to 80x in some instances). Shark performance factor >>>>>>>>>>> reduces with >>>>>>>>>>> the input size increases, while HIVE performance is fairly >>>>>>>>>>> indifferent. >>>>>>>>>>> - Shark clearly over-performs Hive in Data Retrieval >>>>>>>>>>> operations (FILTER, ORDER BY, JOIN). Hive performance is fairly >>>>>>>>>>> indifferent >>>>>>>>>>> in the data retrieval operations while Shark performance reduces >>>>>>>>>>> as the >>>>>>>>>>> input size increases. But at every instance Shark over-performed >>>>>>>>>>> Hive with >>>>>>>>>>> a minimum performance factor of 5x+ (Ranging from 5x to 80x in >>>>>>>>>>> some >>>>>>>>>>> instances). >>>>>>>>>>> >>>>>>>>>>> Please refer the 'hive_vs_shark_report', it has all the >>>>>>>>>>> information about the queries and timings pictographically. >>>>>>>>>>> >>>>>>>>>>> The code repository can also be found in >>>>>>>>>>> >>>>>>>>>>> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Moving forward, I am currently working on the following. >>>>>>>>>>> >>>>>>>>>>> - Apache Spark's resilient distributed dataset (RDD) >>>>>>>>>>> abstraction (which is a collection of elements partitioned >>>>>>>>>>> across the nodes >>>>>>>>>>> of the cluster that can be operated on in parallel). The use of >>>>>>>>>>> RDDs and >>>>>>>>>>> its impact to the performance. >>>>>>>>>>> - Spark SQL - Use of this Spark SQL over Shark on Spark >>>>>>>>>>> framework >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> [1] https://github.com/amplab/shark/wiki >>>>>>>>>>> [2] http://en.wikipedia.org/wiki/Apache_Spark >>>>>>>>>>> [3] http://spark.apache.org/docs/latest/ >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Would love to have your feedback on this. >>>>>>>>>>> >>>>>>>>>>> Best regards >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> *Niranda Perera* >>>>>>>>>>> Software Engineer, WSO2 Inc. >>>>>>>>>>> Mobile: +94-71-554-8430 >>>>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> *Anjana Fernando* >>>>>>>>>> Senior Technical Lead >>>>>>>>>> WSO2 Inc. | http://wso2.com >>>>>>>>>> lean . enterprise . middleware >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> *Niranda Perera* >>>>>>>>> Software Engineer, WSO2 Inc. >>>>>>>>> Mobile: +94-71-554-8430 >>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> ============================ >>>>>>>> Srinath Perera, Ph.D. >>>>>>>> http://people.apache.org/~hemapani/ >>>>>>>> http://srinathsview.blogspot.com/ >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *Anjana Fernando* >>>>>>> Senior Technical Lead >>>>>>> WSO2 Inc. | http://wso2.com >>>>>>> lean . enterprise . middleware >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *Niranda Perera* >>>>>> Software Engineer, WSO2 Inc. >>>>>> Mobile: +94-71-554-8430 >>>>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Architecture mailing list >>>>> Architecture@wso2.org >>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>> >>>>> >>>> >>>> >>>> -- >>>> *Lasantha Fernando* >>>> Software Engineer - Data Technologies Team >>>> WSO2 Inc. http://wso2.com >>>> >>>> email: lasan...@wso2.com >>>> mobile: (+94) 71 5247551 >>>> >>> >>> >>> >>> -- >>> *Niranda Perera* >>> Software Engineer, WSO2 Inc. >>> Mobile: +94-71-554-8430 >>> Twitter: @n1r44 <https://twitter.com/N1R44> >>> >>> _______________________________________________ >>> Architecture mailing list >>> Architecture@wso2.org >>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>> >>> >> >> >> -- >> >> *S. Suhothayan* >> Technical Lead & Team Lead of WSO2 Complex Event Processor >> *WSO2 Inc. *http://wso2.com >> * <http://wso2.com/>* >> lean . enterprise . middleware >> >> >> >> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog: >> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/>twitter: >> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in: >> http://lk.linkedin.com/in/suhothayan <http://lk.linkedin.com/in/suhothayan>* >> >> _______________________________________________ >> Architecture mailing list >> Architecture@wso2.org >> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >> >> > > _______________________________________________ > Architecture mailing list > Architecture@wso2.org > https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture > > -- ============================ Srinath Perera, Ph.D. http://people.apache.org/~hemapani/ http://srinathsview.blogspot.com/
_______________________________________________ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture