Hi David, Sorry to re-initiate this thread. But may I know if you have done any benchmarking on Datastax Spark cassandra connector and Stratio Deep-spark cassandra integration? Would love to take a look at it.
I recently checked deep-spark github repo and noticed that there is no activity since Oct 29th. May I know what your future plans on this particular project? Cheers On Tue, Aug 26, 2014 at 9:12 PM, David Morales <dmora...@stratio.com> wrote: > Yes, it is already included in our benchmarks. > > It could be a nice idea to share our findings, let me talk about it here. > Meanwhile, you can ask us any question by using my mail or this thread, we > are glad to help you. > > > Best regards. > > > 2014-08-24 15:49 GMT+02:00 Niranda Perera <nira...@wso2.com>: > >> Hi David, >> >> Thank you for your detailed reply. >> >> It was great to hear about Stratio-Deep and I must say, it looks very >> interesting. Storage handlers for databases such Cassandra, MongoDB etc >> would be very helpful. We will definitely look up on Stratio-Deep. >> >> I came across with the Datastax Spark-Cassandra connector ( >> https://github.com/datastax/spark-cassandra-connector ). Have you done >> any comparison with your implementation and Datastax's connector? >> >> And, yes, please do share the performance results with us once it's ready. >> >> On a different note, is there any way for us to interact with Stratio dev >> community, in the form of dev mail lists etc, so that we could mutually >> share our findings? >> >> Best regards >> >> >> >> On Fri, Aug 22, 2014 at 2:07 PM, David Morales <dmora...@stratio.com> >> wrote: >> >>> Hi there, >>> >>> *1. About the size of deployments.* >>> >>> It depends on your use case... specially when you combine spark with a >>> datastore. We use to deploy spark with cassandra or mongodb, instead of >>> using HDFS for example. >>> >>> Spark will be faster if you put the data in memory, so if you need a lot >>> of speed (interactive queries, for example), you should have enough memory. >>> >>> >>> *2. About storage handlers.* >>> >>> We have developed the first tight integration between Cassandra and >>> Spark, called Stratio Deep, announced in the first spark summit. You can >>> check Stratio Deep out here: https://github.com/Stratio/stratio-deep (open, >>> apache2 license). >>> >>> *Deep is a thin integration layer between Apache Spark and several NoSQL >>> datastores. We actually support Apache Cassandra and MongoDB, but in the >>> near future we will add support for sever other datastores.* >>> >>> Datastax have announce its own driver for spark in the last spark >>> summit, but we have been working in our solution for almost a year. >>> >>> Furthermore, we are working to extend this solution in order to >>> work also with other databases... MongoDB integration is completed right >>> now and ElasticSearch will be ready in a few weeks. >>> >>> And that is not all, we have also developed an integration with >>> Cassandra and Lucene for indexing data (open source, apache2). >>> >>> *Stratio Cassandra is a fork of Apache Cassandra >>> <http://cassandra.apache.org/> where index functionality has been extended >>> to provide near real time search such as ElasticSearch or Solr, >>> including full text search >>> <http://en.wikipedia.org/wiki/Full_text_search> capabilities and free >>> multivariable search. It is achieved through an Apache Lucene >>> <http://lucene.apache.org/> based implementation of Cassandra secondary >>> indexes, where each node of the cluster indexes its own data.* >>> >>> >>> We will publish some benchmarks in two weeks, so i will share our >>> results here if you are interested. >>> >>> >>> If you are more interested in distributed file systems, you should take >>> a look on Tachyon: http://tachyon-project.org/index.html >>> >>> >>> *3. Spark - Hive compatibility* >>> >>> Spark will support anything with the Hadoop InputFormat interface. >>> >>> >>> *4. Performance* >>> >>> We are working a lot with Cassandra and mongoDB and the performance is >>> quite nice. We are finishing right now some benchmarks comparing Hadoop + >>> HDFS vs Spark + HDFS vs Spark + Cassandra (using stratio deep and even our >>> fork of Cassandra). >>> >>> Let me please share this results with you when they were ready, ok? >>> >>> >>> >>> >>> Regards. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> 2014-08-22 7:53 GMT+02:00 Niranda Perera <nira...@wso2.com>: >>> >>> Hi Srinath, >>>> Yes, I am working on deploying it on a multi-node cluster with the debs >>>> dataset. I will keep architecture@ posted on the progress. >>>> >>>> >>>> Hi David, >>>> Thank you very much for the detailed insight you've provided. >>>> Few quick questions, >>>> 1. Do you have experiences in using storage handlers in Spark? >>>> 2. Would a storage handler used in Hive, be directly compatible with >>>> Spark? >>>> 3. How do you grade the performance of Spark with other databases such >>>> as Cassandra, HBase, H2, etc? >>>> >>>> Thank you very much again for your interest. Look forward to hearing >>>> from you. >>>> >>>> Regards >>>> >>>> >>>> On Thu, Aug 21, 2014 at 7:02 PM, Srinath Perera <srin...@wso2.com> >>>> wrote: >>>> >>>>> Niranda, we need test Spark in multi-node mode before making a >>>>> decision. Spark is very fast, I think there is no doubt about that. We >>>>> need >>>>> to make sure it stable. >>>>> >>>>> David, thanks for a detailed email! How big (nodes) is the Spark setup >>>>> you guys are running? >>>>> >>>>> --Srinath >>>>> >>>>> >>>>> >>>>> On Thu, Aug 21, 2014 at 1:34 PM, David Morales <dmora...@stratio.com> >>>>> wrote: >>>>> >>>>>> Sorry for disturbing this thread, but i think that i can help >>>>>> clarifying a few things (we were attending the last Spark Summit, we were >>>>>> also speakers there and we are working very close to spark) >>>>>> >>>>>> *> Hive/Shark and others benchmark* >>>>>> >>>>>> You can find a nice comparison and benchmark in this web: >>>>>> https://amplab.cs.berkeley.edu/benchmark/ >>>>>> >>>>>> >>>>>> *> Shark and SparkSQL* >>>>>> >>>>>> SparkSQL is the natural replacement for Shark, but SparkSQL is still >>>>>> young at this moment. If you are looking for Hive compatibility, you have >>>>>> to execute SparkSQL with an specific context. >>>>>> >>>>>> Quoted from spark website: >>>>>> >>>>>> *> Note that Spark SQL currently uses a very basic SQL parser. Users >>>>>> that want a more complete dialect of SQL should look at the HiveSQL >>>>>> support >>>>>> provided by HiveContext.* >>>>>> >>>>>> So, only note that SparkSQL is a work in progress. If you want >>>>>> SparkSQL you have to run a SparkSQLContext, if you want Hive, you will >>>>>> have >>>>>> a different context... >>>>>> >>>>>> >>>>>> *> Spark - Hadoop: the future* >>>>>> >>>>>> Most Hadoop distributions are including Spark: cloudera, hortonworks, >>>>>> mapR... and contributing to migrate all the Hadoop ecosystem to Spark. >>>>>> >>>>>> Spark is a bit more than Map/Reduce... as you can read here: >>>>>> http://gigaom.com/2014/06/28/4-reasons-why-spark-could-jolt-hadoop-into-hyperdrive/ >>>>>> >>>>>> >>>>>> *> Spark Streaming / Spark SQL* >>>>>> >>>>>> Spark Streaming is built on Spark and it provides streaming >>>>>> processing through an information abstraction called DStreams (a >>>>>> collection >>>>>> of RDDs in a window of time). >>>>>> >>>>>> There is some efforts in order to make SparkSQL compatible with Spark >>>>>> Streaming (something similar to trident for storm), as you can see here: >>>>>> >>>>>> *StreamSQL (https://github.com/thunderain-project/StreamSQL >>>>>> <https://github.com/thunderain-project/StreamSQL>) is a POC project based >>>>>> on Spark to combine the power of Catalyst and Spark Streaming, to offer >>>>>> people the ability to manipulate SQL on top of DStream as you wanted, >>>>>> this >>>>>> keep the same semantics with SparkSQL as offer a SchemaDStream on top of >>>>>> DStream. You don't need to do tricky thing like extracting rdd to >>>>>> register >>>>>> as a table. Besides other parts are the same as Spark.* >>>>>> >>>>>> So, you can apply a SQL in a data stream, but it is very simple at >>>>>> the moment... you can expect a bunch of improvements in this matter in >>>>>> the >>>>>> next months (i guess that sparkSQL will work on Spark streaming streams >>>>>> before the end of this year). >>>>>> >>>>>> >>>>>> >>>>>> *> Spark Streaming / Spark SQL and CEP* >>>>>> >>>>>> There is no relationship at this moment between (your absolutely >>>>>> amazing) Siddhi CEP and Spark. As fas as i know, you are working in doing >>>>>> distributed CEP with Storm and Siddhi. >>>>>> >>>>>> We are currently working on doing an interactive cep built with kafka >>>>>> + spark streaming + siddhi, with some features such as an API, an >>>>>> interactive shell, built-in statistics and auditing, built-in functions >>>>>> (save2cassandra, save2mongo, save2elasticsearch...). >>>>>> >>>>>> If you are interested we can talk about this project, i think that it >>>>>> would be a nice idea¡ >>>>>> >>>>>> >>>>>> Anyway, i don't think that SparkSQL will evolve in something like a >>>>>> CEP. Patterns, sequences, for example would be very complex to do with >>>>>> spark streaming (at least now). >>>>>> >>>>>> >>>>>> >>>>>> Thanks. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> 2014-08-21 6:18 GMT+02:00 Sriskandarajah Suhothayan <s...@wso2.com>: >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Aug 20, 2014 at 1:36 PM, Niranda Perera <nira...@wso2.com> >>>>>>> wrote: >>>>>>> >>>>>>>> @Maninda, >>>>>>>> >>>>>>>> +1 for suggesting Spark SQL. >>>>>>>> >>>>>>>> Quote Databricks, >>>>>>>> "Spark SQL provides state-of-the-art SQL performance and maintains >>>>>>>> compatibility with Shark/Hive. In particular, like Shark, Spark SQL >>>>>>>> supports all existing Hive data formats, user-defined functions (UDF), >>>>>>>> and >>>>>>>> the Hive metastore." [1] >>>>>>>> >>>>>>>> But I am not entirely sure if Spark SQL and Siddhi is comparable, >>>>>>>> because SparkSQL (like Hive) is designed for batch processing, where as >>>>>>>> Siddhi is real-time processing. But if there are implementations where >>>>>>>> Siddhi is run on top of Spark, it would be very interesting. >>>>>>>> >>>>>>> Yes Siddhi's current way of operation does not support this. But >>>>>>> with partitions and we can achieve this to some extent. >>>>>>> >>>>>>> Suho >>>>>>> >>>>>>>> >>>>>>>> Spark supports either Hadoop1 or 2. But I think we should see, what >>>>>>>> is best, MR1 or YARN+MR2 >>>>>>>> >>>>>>>> [image: Hadoop Architecture] >>>>>>>> [2] >>>>>>>> >>>>>>>> [1] >>>>>>>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html >>>>>>>> [2] http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Aug 20, 2014 at 1:13 PM, Lasantha Fernando < >>>>>>>> lasan...@wso2.com> wrote: >>>>>>>> >>>>>>>>> Hi Maninda, >>>>>>>>> >>>>>>>>> On 20 August 2014 12:02, Maninda Edirisooriya <mani...@wso2.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> In the case of discontinuity of Shark project, IMO we should not >>>>>>>>>> move to Shark at all. >>>>>>>>>> And it seems better to go with Spark SQL as we are already using >>>>>>>>>> Spark for CEP. But I am not sure the difference between Spark SQL >>>>>>>>>> and the >>>>>>>>>> Siddhi queries on the Spark engine. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Currently, we are doing integration with CEP using Apache Storm, >>>>>>>>> not Spark... :-). Spark Streaming is a possible candidate for >>>>>>>>> integrating >>>>>>>>> with CEP, but we have opted with Storm. I think there has been some >>>>>>>>> independent work on integrating Kafka + Spark Streaming + Siddhi. >>>>>>>>> Please >>>>>>>>> refer to thread on arch@ "[Architecture] A few questions about >>>>>>>>> WSO2 CEP/Siddhi" >>>>>>>>> >>>>>>>>> >>>>>>>>> And we have to figure out how Spark SQL is used for historical >>>>>>>>>> data, whether it can execute incremental processing by default which >>>>>>>>>> will >>>>>>>>>> implement all out existing BAM use cases. >>>>>>>>>> On the other hand in Hadoop 2 [1] they are using a completely >>>>>>>>>> different platform for resource allocation known as Yarn. Sometimes >>>>>>>>>> this >>>>>>>>>> may be more suitable for batch jobs. >>>>>>>>>> >>>>>>>>>> [1] https://www.youtube.com/watch?v=RncoVN0l6dc >>>>>>>>>> >>>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Lasantha >>>>>>>>> >>>>>>>>>> >>>>>>>>>> *Maninda Edirisooriya* >>>>>>>>>> Senior Software Engineer >>>>>>>>>> >>>>>>>>>> *WSO2, Inc. *lean.enterprise.middleware. >>>>>>>>>> >>>>>>>>>> *Blog* : http://maninda.blogspot.com/ >>>>>>>>>> *E-mail* : mani...@wso2.com >>>>>>>>>> *Skype* : @manindae >>>>>>>>>> *Twitter* : @maninda >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Aug 20, 2014 at 11:33 AM, Niranda Perera < >>>>>>>>>> nira...@wso2.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Anjana and Srinath, >>>>>>>>>>> >>>>>>>>>>> After the discussion I had with Anjana, I researched more on the >>>>>>>>>>> continuation of Shark project by Databricks. >>>>>>>>>>> >>>>>>>>>>> Here's what I found out, >>>>>>>>>>> - Shark was built on the Hive codebase and achieved performance >>>>>>>>>>> improvements by swapping out the physical execution engine part of >>>>>>>>>>> Hive. >>>>>>>>>>> While this approach enabled Shark users to speed up their Hive >>>>>>>>>>> queries, >>>>>>>>>>> Shark inherited a large, complicated code base from Hive that made >>>>>>>>>>> it hard >>>>>>>>>>> to optimize and maintain. >>>>>>>>>>> Hence, Databricks has announced that they are halting the >>>>>>>>>>> development of Shark from July, 2014. (Shark 0.9 would be the last >>>>>>>>>>> release) >>>>>>>>>>> [1] >>>>>>>>>>> - Shark will be replaced by Spark SQL. It beats Shark in TPC-DS >>>>>>>>>>> performance >>>>>>>>>>> <http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html> >>>>>>>>>>> by almost an order of magnitude. It also supports all existing Hive >>>>>>>>>>> data >>>>>>>>>>> formats, user-defined functions (UDF), and the Hive metastore. [2] >>>>>>>>>>> - Following is the Shark, Spark SQL migration plan >>>>>>>>>>> http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf >>>>>>>>>>> >>>>>>>>>>> - For the legacy Hive and MapReduce users, they have proposed a >>>>>>>>>>> new 'Hive on Spark Project' [3], [4] >>>>>>>>>>> But, given the performance enhancement, it is quite certain that >>>>>>>>>>> Hive and MR would be replaced by engines build on top of Spark (ex: >>>>>>>>>>> Spark >>>>>>>>>>> SQL) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> In my opinion there are a few matters to figure out if we are >>>>>>>>>>> migrating from Hive, >>>>>>>>>>> >>>>>>>>>>> 1. whether we are changing the query engine only? (Then, we can >>>>>>>>>>> replace Hive by Shark) >>>>>>>>>>> 2. whether we are changing the existing Hadoop/ MapReduce >>>>>>>>>>> framework to Spark? (Then we can replace Hive and Hadoop with Spark >>>>>>>>>>> and >>>>>>>>>>> Spark SQL) >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> In my opinion, considering the longterm impact and the >>>>>>>>>>> availability of support, it is best to migrate the Hive/Hadoop to >>>>>>>>>>> Spark. >>>>>>>>>>> It is open for discussion! >>>>>>>>>>> >>>>>>>>>>> In the mean time, I've already tried Spark SQL, and Databricks >>>>>>>>>>> claims on improved performance seems to be true. I will work more >>>>>>>>>>> on this. >>>>>>>>>>> >>>>>>>>>>> Cheers >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html >>>>>>>>>>> [2] >>>>>>>>>>> http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html >>>>>>>>>>> [3] https://issues.apache.org/jira/browse/HIVE-7292 >>>>>>>>>>> [4] >>>>>>>>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Aug 14, 2014 at 12:16 PM, Anjana Fernando < >>>>>>>>>>> anj...@wso2.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Srinath, >>>>>>>>>>>> >>>>>>>>>>>> No, this has not been tested in multiple nodes. I told Niranda >>>>>>>>>>>> here in my last mail, to test a cluster with the same set of >>>>>>>>>>>> hardware we >>>>>>>>>>>> have, that we are using to test our large data set with Hive. As >>>>>>>>>>>> for the >>>>>>>>>>>> effort to make the change, we still have to figure out the MT >>>>>>>>>>>> aspects of >>>>>>>>>>>> Shark here. Sinthuja was working on making the latest Hive version >>>>>>>>>>>> MT >>>>>>>>>>>> ready, and most probably, we can do the same changes to the Hive >>>>>>>>>>>> version >>>>>>>>>>>> Shark is using. So after we do that, the integration should be >>>>>>>>>>>> seamless. >>>>>>>>>>>> And also, as I mentioned earlier here, we are also going to test >>>>>>>>>>>> this with >>>>>>>>>>>> the APIM Hive script, to check if there are any unforeseen >>>>>>>>>>>> incompatibilities. >>>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> Anjana. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Aug 14, 2014 at 11:53 AM, Srinath Perera < >>>>>>>>>>>> srin...@wso2.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> This look great. >>>>>>>>>>>>> >>>>>>>>>>>>> We need to test Spark with multiple nodes? Did we do that. >>>>>>>>>>>>> Please create few VMs in performance could (talk to Lakmal) and >>>>>>>>>>>>> test with >>>>>>>>>>>>> at least 5 nodes. We need to make sure it works OK with >>>>>>>>>>>>> distributed setup >>>>>>>>>>>>> as well. >>>>>>>>>>>>> >>>>>>>>>>>>> What does it take to change to spark? Anjana .. how much work >>>>>>>>>>>>> is it? >>>>>>>>>>>>> >>>>>>>>>>>>> --Srinath >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Aug 13, 2014 at 7:06 PM, Niranda Perera < >>>>>>>>>>>>> nira...@wso2.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Thank you Anjana. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Yes, I am working on it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> In the mean time, I found this in Hive documentation [1]. It >>>>>>>>>>>>>> talks about Hive on Spark, and compares Hive, Shark and Spark >>>>>>>>>>>>>> SQL at an >>>>>>>>>>>>>> higher architectural level. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Additionally, it is said that the in-memory performance of >>>>>>>>>>>>>> Shark can be improved by introducing Tachyon [2]. I guess we can >>>>>>>>>>>>>> consider >>>>>>>>>>>>>> this later on. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Cheers. >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] >>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-1.3ComparisonwithSharkandSparkSQL >>>>>>>>>>>>>> [2] http://tachyon-project.org/Running-Tachyon-Locally.html >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Aug 13, 2014 at 3:17 PM, Anjana Fernando < >>>>>>>>>>>>>> anj...@wso2.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Niranda, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Excellent analysis of Hive vs Shark! .. This gives a lot of >>>>>>>>>>>>>>> insight into how both operates in different scenarios. As the >>>>>>>>>>>>>>> next step, we >>>>>>>>>>>>>>> will need to run this in an actual cluster of computers. Since >>>>>>>>>>>>>>> you've used >>>>>>>>>>>>>>> a subset of the dataset of 2014 DEBS challenge, we should use >>>>>>>>>>>>>>> the full data >>>>>>>>>>>>>>> set in a clustered environment and check this. Gokul is already >>>>>>>>>>>>>>> working on >>>>>>>>>>>>>>> the Hive based setup for this, after that is done, you can >>>>>>>>>>>>>>> create a Shark >>>>>>>>>>>>>>> cluster in the same hardware and run the tests there, to get a >>>>>>>>>>>>>>> clear >>>>>>>>>>>>>>> comparison on how these two match up in a cluster. Until the >>>>>>>>>>>>>>> setup is >>>>>>>>>>>>>>> ready, do continue with your next steps on checking the RDD >>>>>>>>>>>>>>> support and >>>>>>>>>>>>>>> Spark SQL use. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> After these are done, we should also do a trial run of our >>>>>>>>>>>>>>> own APIM Hive scripts, migrated to Shark. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>> Anjana. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera < >>>>>>>>>>>>>>> nira...@wso2.com> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I have been evaluating the performance of >>>>>>>>>>>>>>>> Shark (distributed SQL query engine for Hadoop) against Hive. >>>>>>>>>>>>>>>> This is with >>>>>>>>>>>>>>>> the objective of seeing the possibility to move the WSO2 BAM >>>>>>>>>>>>>>>> data >>>>>>>>>>>>>>>> processing (which currently uses Hive) to Shark (and Apache >>>>>>>>>>>>>>>> Spark) for >>>>>>>>>>>>>>>> improved performance. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I am sharing my findings herewith. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> *AMP Lab Shark* >>>>>>>>>>>>>>>> Shark can execute Hive QL queries up to 100 times faster >>>>>>>>>>>>>>>> than Hive without any modification to the existing data or >>>>>>>>>>>>>>>> queries. It >>>>>>>>>>>>>>>> supports Hive's QL, metastore, serialization formats, and >>>>>>>>>>>>>>>> user-defined >>>>>>>>>>>>>>>> functions, providing seamless integration with existing Hive >>>>>>>>>>>>>>>> deployments >>>>>>>>>>>>>>>> and a familiar, more powerful option for new ones. [1] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> *Apache Spark*Apache Spark is an open-source data >>>>>>>>>>>>>>>> analytics cluster computing framework. It fits into the Hadoop >>>>>>>>>>>>>>>> open-source >>>>>>>>>>>>>>>> community, building on top of the HDFS and promises >>>>>>>>>>>>>>>> performance up to 100 >>>>>>>>>>>>>>>> times faster than Hadoop MapReduce for certain applications. >>>>>>>>>>>>>>>> [2] >>>>>>>>>>>>>>>> Official documentation: [3] >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I carried out the comparison between the following Hive and >>>>>>>>>>>>>>>> Shark releases with input files ranging from 100 to 1 billion >>>>>>>>>>>>>>>> entries. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> QL Engine >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Apache Hive 0.11 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Shark Shark 0.9.1 (Latest release) which uses, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Scala 2.10.3 >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Spark 0.9.1 >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> AMPLab’s Hive 0.9.0 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Framework >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hadoop 1.0.4 >>>>>>>>>>>>>>>> Spark 0.9.1 >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> File system >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> HDFS >>>>>>>>>>>>>>>> HDFS >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Attached herewith is a report which describes in detail >>>>>>>>>>>>>>>> about the performance comparison between Shark and Hive. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> hive_vs_shark >>>>>>>>>>>>>>>> <https://docs.google.com/a/wso2.com/folderview?id=0B1GsnfycTl32QTZqUktKck1Ucjg&usp=drive_web> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> hive_vs_shark_report.odt >>>>>>>>>>>>>>>> <https://docs.google.com/a/wso2.com/file/d/0B1GsnfycTl32X3J5dTh6Slloa0E/edit?usp=drive_web> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In summary, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> From the evaluation, following conclusions can be derived. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - Shark is indifferent to Hive in DDL operations >>>>>>>>>>>>>>>> (CREATE, DROP .. TABLE, DATABASE). Both engines show a >>>>>>>>>>>>>>>> fairly constant >>>>>>>>>>>>>>>> performance as the input size increases. >>>>>>>>>>>>>>>> - Shark is indifferent to Hive in DML operations (LOAD, >>>>>>>>>>>>>>>> INSERT) but when a DML operation is called in conjuncture >>>>>>>>>>>>>>>> of a data >>>>>>>>>>>>>>>> retrieval operation (ex. INSERT <TBL> SELECT <PROP> FROM >>>>>>>>>>>>>>>> <TBL>), Shark >>>>>>>>>>>>>>>> significantly over-performs Hive with a performance factor >>>>>>>>>>>>>>>> of 10x+ (Ranging >>>>>>>>>>>>>>>> from 10x to 80x in some instances). Shark performance >>>>>>>>>>>>>>>> factor reduces with >>>>>>>>>>>>>>>> the input size increases, while HIVE performance is fairly >>>>>>>>>>>>>>>> indifferent. >>>>>>>>>>>>>>>> - Shark clearly over-performs Hive in Data Retrieval >>>>>>>>>>>>>>>> operations (FILTER, ORDER BY, JOIN). Hive performance is >>>>>>>>>>>>>>>> fairly indifferent >>>>>>>>>>>>>>>> in the data retrieval operations while Shark performance >>>>>>>>>>>>>>>> reduces as the >>>>>>>>>>>>>>>> input size increases. But at every instance Shark >>>>>>>>>>>>>>>> over-performed Hive with >>>>>>>>>>>>>>>> a minimum performance factor of 5x+ (Ranging from 5x to 80x >>>>>>>>>>>>>>>> in some >>>>>>>>>>>>>>>> instances). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Please refer the 'hive_vs_shark_report', it has all the >>>>>>>>>>>>>>>> information about the queries and timings pictographically. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> The code repository can also be found in >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Moving forward, I am currently working on the following. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - Apache Spark's resilient distributed dataset (RDD) >>>>>>>>>>>>>>>> abstraction (which is a collection of elements partitioned >>>>>>>>>>>>>>>> across the nodes >>>>>>>>>>>>>>>> of the cluster that can be operated on in parallel). The >>>>>>>>>>>>>>>> use of RDDs and >>>>>>>>>>>>>>>> its impact to the performance. >>>>>>>>>>>>>>>> - Spark SQL - Use of this Spark SQL over Shark on Spark >>>>>>>>>>>>>>>> framework >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> [1] https://github.com/amplab/shark/wiki >>>>>>>>>>>>>>>> [2] http://en.wikipedia.org/wiki/Apache_Spark >>>>>>>>>>>>>>>> [3] http://spark.apache.org/docs/latest/ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Would love to have your feedback on this. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Best regards >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> *Niranda Perera* >>>>>>>>>>>>>>>> Software Engineer, WSO2 Inc. >>>>>>>>>>>>>>>> Mobile: +94-71-554-8430 >>>>>>>>>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> *Anjana Fernando* >>>>>>>>>>>>>>> Senior Technical Lead >>>>>>>>>>>>>>> WSO2 Inc. | http://wso2.com >>>>>>>>>>>>>>> lean . enterprise . middleware >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> *Niranda Perera* >>>>>>>>>>>>>> Software Engineer, WSO2 Inc. >>>>>>>>>>>>>> Mobile: +94-71-554-8430 >>>>>>>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> ============================ >>>>>>>>>>>>> Srinath Perera, Ph.D. >>>>>>>>>>>>> http://people.apache.org/~hemapani/ >>>>>>>>>>>>> http://srinathsview.blogspot.com/ >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> *Anjana Fernando* >>>>>>>>>>>> Senior Technical Lead >>>>>>>>>>>> WSO2 Inc. | http://wso2.com >>>>>>>>>>>> lean . enterprise . middleware >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> *Niranda Perera* >>>>>>>>>>> Software Engineer, WSO2 Inc. >>>>>>>>>>> Mobile: +94-71-554-8430 >>>>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> Architecture mailing list >>>>>>>>>> Architecture@wso2.org >>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> *Lasantha Fernando* >>>>>>>>> Software Engineer - Data Technologies Team >>>>>>>>> WSO2 Inc. http://wso2.com >>>>>>>>> >>>>>>>>> email: lasan...@wso2.com >>>>>>>>> mobile: (+94) 71 5247551 >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> *Niranda Perera* >>>>>>>> Software Engineer, WSO2 Inc. >>>>>>>> Mobile: +94-71-554-8430 >>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Architecture mailing list >>>>>>>> Architecture@wso2.org >>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> *S. Suhothayan* >>>>>>> Technical Lead & Team Lead of WSO2 Complex Event Processor >>>>>>> *WSO2 Inc. *http://wso2.com >>>>>>> * <http://wso2.com/>* >>>>>>> lean . enterprise . middleware >>>>>>> >>>>>>> >>>>>>> >>>>>>> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog: >>>>>>> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/> >>>>>>> twitter: >>>>>>> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | >>>>>>> linked-in: >>>>>>> http://lk.linkedin.com/in/suhothayan >>>>>>> <http://lk.linkedin.com/in/suhothayan>* >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Architecture mailing list >>>>>>> Architecture@wso2.org >>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Architecture mailing list >>>>>> Architecture@wso2.org >>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> ============================ >>>>> Srinath Perera, Ph.D. >>>>> http://people.apache.org/~hemapani/ >>>>> http://srinathsview.blogspot.com/ >>>>> >>>>> _______________________________________________ >>>>> Architecture mailing list >>>>> Architecture@wso2.org >>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>> >>>>> >>>> >>>> >>>> -- >>>> *Niranda Perera* >>>> Software Engineer, WSO2 Inc. >>>> Mobile: +94-71-554-8430 >>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>> >>> >>> >> >> >> -- >> *Niranda Perera* >> Software Engineer, WSO2 Inc. >> Mobile: +94-71-554-8430 >> Twitter: @n1r44 <https://twitter.com/N1R44> >> > > -- *Niranda Perera* Software Engineer, WSO2 Inc. Mobile: +94-71-554-8430 Twitter: @n1r44 <https://twitter.com/N1R44>
_______________________________________________ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture