Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

Niranda Perera Tue, 02 Dec 2014 02:15:49 -0800

Hi David,

Sorry to re-initiate this thread. But may I know if you have done any
benchmarking on Datastax Spark cassandra connector and Stratio Deep-spark
cassandra integration? Would love to take a look at it.


I recently checked deep-spark github repo and noticed that there is no
activity since Oct 29th. May I know what your future plans on this
particular project?

Cheers

On Tue, Aug 26, 2014 at 9:12 PM, David Morales <dmora...@stratio.com> wrote:

> Yes, it is already included in our benchmarks.
>
> It could be a nice idea to share our findings, let me talk about it here.
> Meanwhile, you can ask us any question by using my mail or this thread, we
> are glad to help you.
>
>
> Best regards.
>
>
> 2014-08-24 15:49 GMT+02:00 Niranda Perera <nira...@wso2.com>:
>
>> Hi David,
>>
>> Thank you for your detailed reply.
>>
>> It was great to hear about Stratio-Deep and I must say, it looks very
>> interesting. Storage handlers for databases such Cassandra, MongoDB etc
>> would be very helpful. We will definitely look up on Stratio-Deep.
>>
>> I came across with the Datastax Spark-Cassandra connector (
>> https://github.com/datastax/spark-cassandra-connector ). Have you done
>> any comparison with your implementation and Datastax's connector?
>>
>> And, yes, please do share the performance results with us once it's ready.
>>
>> On a different note, is there any way for us to interact with Stratio dev
>> community, in the form of dev mail lists etc, so that we could mutually
>> share our findings?
>>
>> Best regards
>>
>>
>>
>> On Fri, Aug 22, 2014 at 2:07 PM, David Morales <dmora...@stratio.com>
>> wrote:
>>
>>> Hi there,
>>>
>>> *1. About the size of deployments.*
>>>
>>> It depends on your use case... specially when you combine spark with a
>>> datastore. We use to deploy spark with cassandra or mongodb, instead of
>>> using HDFS for example.
>>>
>>> Spark will be faster if you put the data in memory, so if you need a lot
>>> of speed (interactive queries, for example), you should have enough memory.
>>>
>>>
>>> *2. About storage handlers.*
>>>
>>> We have developed the first tight integration between Cassandra and
>>> Spark, called Stratio Deep, announced in the first spark summit. You can
>>> check Stratio Deep out here: https://github.com/Stratio/stratio-deep (open,
>>> apache2 license).
>>>
>>> *Deep is a thin integration layer between Apache Spark and several NoSQL
>>> datastores. We actually support Apache Cassandra and MongoDB, but in the
>>> near future we will add support for sever other datastores.*
>>>
>>> Datastax have announce its own driver for spark in the last spark
>>> summit, but we have been working in our solution for almost a year.
>>>
>>> Furthermore, we are working to extend this solution in order to
>>> work also with other databases... MongoDB integration is completed right
>>> now and ElasticSearch will be ready in a few weeks.
>>>
>>> And that is not all, we have also developed an integration with
>>> Cassandra and Lucene for indexing data (open source, apache2).
>>>
>>> *Stratio Cassandra is a fork of Apache Cassandra
>>> <http://cassandra.apache.org/> where index functionality has been extended
>>> to provide near real time search such as ElasticSearch or Solr,
>>> including full text search
>>> <http://en.wikipedia.org/wiki/Full_text_search> capabilities and free
>>> multivariable search. It is achieved through an Apache Lucene
>>> <http://lucene.apache.org/> based implementation of Cassandra secondary
>>> indexes, where each node of the cluster indexes its own data.*
>>>
>>>
>>> We will publish some benchmarks in two weeks, so i will share our
>>> results here if you are interested.
>>>
>>>
>>> If you are more interested in distributed file systems, you should take
>>> a look on Tachyon: http://tachyon-project.org/index.html
>>>
>>>
>>> *3. Spark - Hive compatibility*
>>>
>>> Spark will support anything with the Hadoop InputFormat interface.
>>>
>>>
>>> *4. Performance*
>>>
>>> We are working a lot with Cassandra and mongoDB and the performance is
>>> quite nice. We are finishing right now some benchmarks comparing Hadoop +
>>> HDFS vs Spark + HDFS vs Spark + Cassandra (using stratio deep and even our
>>> fork of Cassandra).
>>>
>>> Let me please share this results with you when they were ready, ok?
>>>
>>>
>>>
>>>
>>> Regards.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2014-08-22 7:53 GMT+02:00 Niranda Perera <nira...@wso2.com>:
>>>
>>> Hi Srinath,
>>>> Yes, I am working on deploying it on a multi-node cluster with the debs
>>>> dataset. I will keep architecture@ posted on the progress.
>>>>
>>>>
>>>> Hi David,
>>>> Thank you very much for the detailed insight you've provided.
>>>> Few quick questions,
>>>> 1. Do you have experiences in using storage handlers in Spark?
>>>> 2. Would a storage handler used in Hive, be directly compatible with
>>>> Spark?
>>>> 3. How do you grade the performance of Spark with other databases such
>>>> as Cassandra, HBase, H2, etc?
>>>>
>>>> Thank you very much again for your interest. Look forward to hearing
>>>> from you.
>>>>
>>>> Regards
>>>>
>>>>
>>>> On Thu, Aug 21, 2014 at 7:02 PM, Srinath Perera <srin...@wso2.com>
>>>> wrote:
>>>>
>>>>> Niranda, we need test Spark in multi-node mode before making a
>>>>> decision. Spark is very fast, I think there is no doubt about that. We 
>>>>> need
>>>>> to make sure it stable.
>>>>>
>>>>> David, thanks for a detailed email! How big (nodes) is the Spark setup
>>>>> you guys are running?
>>>>>
>>>>> --Srinath
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Aug 21, 2014 at 1:34 PM, David Morales <dmora...@stratio.com>
>>>>> wrote:
>>>>>
>>>>>> Sorry for disturbing this thread, but i think that i can help
>>>>>> clarifying a few things (we were attending the last Spark Summit, we were
>>>>>> also speakers there and we are working very close to spark)
>>>>>>
>>>>>> *> Hive/Shark and others benchmark*
>>>>>>
>>>>>> You can find a nice comparison and benchmark in this web:
>>>>>> https://amplab.cs.berkeley.edu/benchmark/
>>>>>>
>>>>>>
>>>>>> *> Shark and SparkSQL*
>>>>>>
>>>>>> SparkSQL is the natural replacement for Shark, but SparkSQL is still
>>>>>> young at this moment. If you are looking for Hive compatibility, you have
>>>>>> to execute SparkSQL with an specific context.
>>>>>>
>>>>>> Quoted from spark website:
>>>>>>
>>>>>> *> Note that Spark SQL currently uses a very basic SQL parser. Users
>>>>>> that want a more complete dialect of SQL should look at the HiveSQL 
>>>>>> support
>>>>>> provided by HiveContext.*
>>>>>>
>>>>>> So, only note that SparkSQL is a work in progress. If you want
>>>>>> SparkSQL you have to run a SparkSQLContext, if you want Hive, you will 
>>>>>> have
>>>>>> a different context...
>>>>>>
>>>>>>
>>>>>> *> Spark - Hadoop: the future*
>>>>>>
>>>>>> Most Hadoop distributions are including Spark: cloudera, hortonworks,
>>>>>> mapR... and contributing to migrate all the Hadoop ecosystem to Spark.
>>>>>>
>>>>>> Spark is a bit more than Map/Reduce... as you can read here:
>>>>>> http://gigaom.com/2014/06/28/4-reasons-why-spark-could-jolt-hadoop-into-hyperdrive/
>>>>>>
>>>>>>
>>>>>> *> Spark Streaming / Spark SQL*
>>>>>>
>>>>>> Spark Streaming is built on Spark and it provides streaming
>>>>>> processing through an information abstraction called DStreams (a 
>>>>>> collection
>>>>>> of RDDs in a window of time).
>>>>>>
>>>>>> There is some efforts in order to make SparkSQL compatible with Spark
>>>>>> Streaming (something similar to trident for storm), as you can see here:
>>>>>>
>>>>>> *StreamSQL (https://github.com/thunderain-project/StreamSQL
>>>>>> <https://github.com/thunderain-project/StreamSQL>) is a POC project based
>>>>>> on Spark to combine the power of Catalyst and Spark Streaming, to offer
>>>>>> people the ability to manipulate SQL on top of DStream as you wanted, 
>>>>>> this
>>>>>> keep the same semantics with SparkSQL as offer a SchemaDStream on top of
>>>>>> DStream. You don't need to do tricky thing like extracting rdd to 
>>>>>> register
>>>>>> as a table. Besides other parts are the same as Spark.*
>>>>>>
>>>>>> So, you can apply a SQL in a data stream, but it is very simple at
>>>>>> the moment... you can expect a bunch of improvements in this matter in 
>>>>>> the
>>>>>> next months (i guess that sparkSQL will work on Spark streaming streams
>>>>>> before the end of this year).
>>>>>>
>>>>>>
>>>>>>
>>>>>> *> Spark Streaming / Spark SQL and CEP*
>>>>>>
>>>>>> There is no relationship at this moment between (your absolutely
>>>>>> amazing) Siddhi CEP and Spark. As fas as i know, you are working in doing
>>>>>> distributed CEP with Storm and Siddhi.
>>>>>>
>>>>>> We are currently working on doing an interactive cep built with kafka
>>>>>> + spark streaming + siddhi, with some features such as an API, an
>>>>>> interactive shell, built-in statistics and auditing, built-in functions
>>>>>> (save2cassandra, save2mongo, save2elasticsearch...).
>>>>>>
>>>>>> If you are interested we can talk about this project, i think that it
>>>>>> would be a nice idea¡
>>>>>>
>>>>>>
>>>>>> Anyway, i don't think that SparkSQL will evolve in something like a
>>>>>> CEP. Patterns, sequences, for example would be very complex to do with
>>>>>> spark streaming (at least now).
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2014-08-21 6:18 GMT+02:00 Sriskandarajah Suhothayan <s...@wso2.com>:
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 20, 2014 at 1:36 PM, Niranda Perera <nira...@wso2.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> @Maninda,
>>>>>>>>
>>>>>>>> +1 for suggesting Spark SQL.
>>>>>>>>
>>>>>>>> Quote Databricks,
>>>>>>>> "Spark SQL provides state-of-the-art SQL performance and maintains
>>>>>>>> compatibility with Shark/Hive. In particular, like Shark, Spark SQL
>>>>>>>> supports all existing Hive data formats, user-defined functions (UDF), 
>>>>>>>> and
>>>>>>>> the Hive metastore." [1]
>>>>>>>>
>>>>>>>> But I am not entirely sure if Spark SQL and Siddhi is comparable,
>>>>>>>> because SparkSQL (like Hive) is designed for batch processing, where as
>>>>>>>> Siddhi is real-time processing. But if there are implementations where
>>>>>>>> Siddhi is run on top of Spark, it would be very interesting.
>>>>>>>>
>>>>>>> Yes Siddhi's current way of operation does not support this. But
>>>>>>> with partitions and we can achieve this to some extent.
>>>>>>>
>>>>>>> Suho
>>>>>>>
>>>>>>>>
>>>>>>>> Spark supports either Hadoop1 or 2. But I think we should see, what
>>>>>>>> is best, MR1 or YARN+MR2
>>>>>>>>
>>>>>>>> [image: Hadoop Architecture]
>>>>>>>> [2]
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
>>>>>>>> [2] http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Aug 20, 2014 at 1:13 PM, Lasantha Fernando <
>>>>>>>> lasan...@wso2.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Maninda,
>>>>>>>>>
>>>>>>>>> On 20 August 2014 12:02, Maninda Edirisooriya <mani...@wso2.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> In the case of discontinuity of Shark project, IMO we should not
>>>>>>>>>> move to Shark at all.
>>>>>>>>>> And it seems better to go with Spark SQL as we are already using
>>>>>>>>>> Spark for CEP. But I am not sure the difference between Spark SQL 
>>>>>>>>>> and the
>>>>>>>>>> Siddhi queries on the Spark engine.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Currently, we are doing integration with CEP using Apache Storm,
>>>>>>>>> not Spark... :-). Spark Streaming is a possible candidate for 
>>>>>>>>> integrating
>>>>>>>>> with CEP, but we have opted with Storm. I think there has been some
>>>>>>>>> independent work on integrating Kafka + Spark Streaming + Siddhi. 
>>>>>>>>> Please
>>>>>>>>> refer to thread on arch@ "[Architecture] A few questions about
>>>>>>>>> WSO2 CEP/Siddhi"
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> And we have to figure out how Spark SQL is used for historical
>>>>>>>>>> data, whether it can execute incremental processing by default which 
>>>>>>>>>> will
>>>>>>>>>> implement all out existing BAM use cases.
>>>>>>>>>> On the other hand in Hadoop 2 [1] they are using a completely
>>>>>>>>>> different platform for resource allocation known as Yarn. Sometimes 
>>>>>>>>>> this
>>>>>>>>>> may be more suitable for batch jobs.
>>>>>>>>>>
>>>>>>>>>> [1] https://www.youtube.com/watch?v=RncoVN0l6dc
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Lasantha
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Maninda Edirisooriya*
>>>>>>>>>> Senior Software Engineer
>>>>>>>>>>
>>>>>>>>>> *WSO2, Inc. *lean.enterprise.middleware.
>>>>>>>>>>
>>>>>>>>>> *Blog* : http://maninda.blogspot.com/
>>>>>>>>>> *E-mail* : mani...@wso2.com
>>>>>>>>>> *Skype* : @manindae
>>>>>>>>>> *Twitter* : @maninda
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 20, 2014 at 11:33 AM, Niranda Perera <
>>>>>>>>>> nira...@wso2.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Anjana and Srinath,
>>>>>>>>>>>
>>>>>>>>>>> After the discussion I had with Anjana, I researched more on the
>>>>>>>>>>> continuation of Shark project by Databricks.
>>>>>>>>>>>
>>>>>>>>>>> Here's what I found out,
>>>>>>>>>>> - Shark was built on the Hive codebase and achieved performance
>>>>>>>>>>> improvements by swapping out the physical execution engine part of 
>>>>>>>>>>> Hive.
>>>>>>>>>>> While this approach enabled Shark users to speed up their Hive 
>>>>>>>>>>> queries,
>>>>>>>>>>> Shark inherited a large, complicated code base from Hive that made 
>>>>>>>>>>> it hard
>>>>>>>>>>> to optimize and maintain.
>>>>>>>>>>> Hence, Databricks has announced that they are halting the
>>>>>>>>>>> development of Shark from July, 2014. (Shark 0.9 would be the last 
>>>>>>>>>>> release)
>>>>>>>>>>> [1]
>>>>>>>>>>> - Shark will be replaced by Spark SQL. It beats Shark in TPC-DS
>>>>>>>>>>> performance
>>>>>>>>>>> <http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html>
>>>>>>>>>>> by almost an order of magnitude. It also supports all existing Hive 
>>>>>>>>>>> data
>>>>>>>>>>> formats, user-defined functions (UDF), and the Hive metastore.  [2]
>>>>>>>>>>> - Following is the Shark, Spark SQL migration plan
>>>>>>>>>>> http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf
>>>>>>>>>>>
>>>>>>>>>>> - For the legacy Hive and MapReduce users, they have proposed a
>>>>>>>>>>> new 'Hive on Spark Project' [3], [4]
>>>>>>>>>>> But, given the performance enhancement, it is quite certain that
>>>>>>>>>>> Hive and MR would be replaced by engines build on top of Spark (ex: 
>>>>>>>>>>> Spark
>>>>>>>>>>> SQL)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> In my opinion there are a few matters to figure out if we are
>>>>>>>>>>> migrating from Hive,
>>>>>>>>>>>
>>>>>>>>>>> 1. whether we are changing the query engine only? (Then, we can
>>>>>>>>>>> replace Hive by Shark)
>>>>>>>>>>> 2. whether we are changing the existing Hadoop/ MapReduce
>>>>>>>>>>> framework to Spark? (Then we can replace Hive and Hadoop with Spark 
>>>>>>>>>>> and
>>>>>>>>>>> Spark SQL)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> In my opinion, considering the longterm impact and the
>>>>>>>>>>> availability of support, it is best to migrate the Hive/Hadoop to 
>>>>>>>>>>> Spark.
>>>>>>>>>>> It is open for discussion!
>>>>>>>>>>>
>>>>>>>>>>> In the mean time, I've already tried Spark SQL, and Databricks
>>>>>>>>>>> claims on improved performance seems to be true. I will work more 
>>>>>>>>>>> on this.
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
>>>>>>>>>>> [2]
>>>>>>>>>>> http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html
>>>>>>>>>>> [3] https://issues.apache.org/jira/browse/HIVE-7292
>>>>>>>>>>> [4]
>>>>>>>>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Aug 14, 2014 at 12:16 PM, Anjana Fernando <
>>>>>>>>>>> anj...@wso2.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Srinath,
>>>>>>>>>>>>
>>>>>>>>>>>> No, this has not been tested in multiple nodes. I told Niranda
>>>>>>>>>>>> here in my last mail, to test a cluster with the same set of 
>>>>>>>>>>>> hardware we
>>>>>>>>>>>> have, that we are using to test our large data set with Hive. As 
>>>>>>>>>>>> for the
>>>>>>>>>>>> effort to make the change, we still have to figure out the MT 
>>>>>>>>>>>> aspects of
>>>>>>>>>>>> Shark here. Sinthuja was working on making the latest Hive version 
>>>>>>>>>>>> MT
>>>>>>>>>>>> ready, and most probably, we can do the same changes to the Hive 
>>>>>>>>>>>> version
>>>>>>>>>>>> Shark is using. So after we do that, the integration should be 
>>>>>>>>>>>> seamless.
>>>>>>>>>>>> And also, as I mentioned earlier here, we are also going to test 
>>>>>>>>>>>> this with
>>>>>>>>>>>> the APIM Hive script, to check if there are any unforeseen
>>>>>>>>>>>> incompatibilities.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Anjana.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Aug 14, 2014 at 11:53 AM, Srinath Perera <
>>>>>>>>>>>> srin...@wso2.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> This look great.
>>>>>>>>>>>>>
>>>>>>>>>>>>> We need to test Spark with multiple nodes? Did we do that.
>>>>>>>>>>>>> Please create few VMs in performance could (talk to Lakmal) and 
>>>>>>>>>>>>> test with
>>>>>>>>>>>>> at least 5 nodes. We need to make sure it works OK with 
>>>>>>>>>>>>> distributed setup
>>>>>>>>>>>>> as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> What does it take to change to spark? Anjana .. how much work
>>>>>>>>>>>>> is it?
>>>>>>>>>>>>>
>>>>>>>>>>>>> --Srinath
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Aug 13, 2014 at 7:06 PM, Niranda Perera <
>>>>>>>>>>>>> nira...@wso2.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thank you Anjana.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, I am working on it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In the mean time, I found this in Hive documentation [1]. It
>>>>>>>>>>>>>> talks about Hive on Spark, and compares Hive, Shark and Spark 
>>>>>>>>>>>>>> SQL at an
>>>>>>>>>>>>>> higher architectural level.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Additionally, it is said that the in-memory performance of
>>>>>>>>>>>>>> Shark can be improved by introducing Tachyon [2]. I guess we can 
>>>>>>>>>>>>>> consider
>>>>>>>>>>>>>> this later on.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cheers.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-1.3ComparisonwithSharkandSparkSQL
>>>>>>>>>>>>>> [2] http://tachyon-project.org/Running-Tachyon-Locally.html
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Aug 13, 2014 at 3:17 PM, Anjana Fernando <
>>>>>>>>>>>>>> anj...@wso2.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  Hi Niranda,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Excellent analysis of Hive vs Shark! .. This gives a lot of
>>>>>>>>>>>>>>> insight into how both operates in different scenarios. As the 
>>>>>>>>>>>>>>> next step, we
>>>>>>>>>>>>>>> will need to run this in an actual cluster of computers. Since 
>>>>>>>>>>>>>>> you've used
>>>>>>>>>>>>>>> a subset of the dataset of 2014 DEBS challenge, we should use 
>>>>>>>>>>>>>>> the full data
>>>>>>>>>>>>>>> set in a clustered environment and check this. Gokul is already 
>>>>>>>>>>>>>>> working on
>>>>>>>>>>>>>>> the Hive based setup for this, after that is done, you can 
>>>>>>>>>>>>>>> create a Shark
>>>>>>>>>>>>>>> cluster in the same hardware and run the tests there, to get a 
>>>>>>>>>>>>>>> clear
>>>>>>>>>>>>>>> comparison on how these two match up in a cluster. Until the 
>>>>>>>>>>>>>>> setup is
>>>>>>>>>>>>>>> ready, do continue with your next steps on checking the RDD 
>>>>>>>>>>>>>>> support and
>>>>>>>>>>>>>>> Spark SQL use.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> After these are done, we should also do a trial run of our
>>>>>>>>>>>>>>> own APIM Hive scripts, migrated to Shark.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>> Anjana.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera <
>>>>>>>>>>>>>>> nira...@wso2.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have been evaluating the performance of
>>>>>>>>>>>>>>>> Shark (distributed SQL query engine for Hadoop) against Hive. 
>>>>>>>>>>>>>>>> This is with
>>>>>>>>>>>>>>>> the objective of seeing the possibility to move the WSO2 BAM 
>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>> processing (which currently uses Hive) to Shark (and Apache 
>>>>>>>>>>>>>>>> Spark) for
>>>>>>>>>>>>>>>> improved performance.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I am sharing my findings herewith.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  *AMP Lab Shark*
>>>>>>>>>>>>>>>> Shark can execute Hive QL queries up to 100 times faster
>>>>>>>>>>>>>>>> than Hive without any modification to the existing data or 
>>>>>>>>>>>>>>>> queries. It
>>>>>>>>>>>>>>>> supports Hive's QL, metastore, serialization formats, and 
>>>>>>>>>>>>>>>> user-defined
>>>>>>>>>>>>>>>> functions, providing seamless integration with existing Hive 
>>>>>>>>>>>>>>>> deployments
>>>>>>>>>>>>>>>> and a familiar, more powerful option for new ones. [1]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *Apache Spark*Apache Spark is an open-source data
>>>>>>>>>>>>>>>> analytics cluster computing framework. It fits into the Hadoop 
>>>>>>>>>>>>>>>> open-source
>>>>>>>>>>>>>>>> community, building on top of the HDFS and promises 
>>>>>>>>>>>>>>>> performance up to 100
>>>>>>>>>>>>>>>> times faster than Hadoop MapReduce for certain applications. 
>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>> Official documentation: [3]
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I carried out the comparison between the following Hive and
>>>>>>>>>>>>>>>> Shark releases with input files ranging from 100 to 1 billion 
>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> QL Engine
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Apache Hive 0.11
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Shark Shark 0.9.1 (Latest release) which uses,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    Scala 2.10.3
>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    Spark 0.9.1
>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    AMPLab’s Hive 0.9.0
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Framework
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hadoop 1.0.4
>>>>>>>>>>>>>>>> Spark 0.9.1
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> File system
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> HDFS
>>>>>>>>>>>>>>>> HDFS
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Attached herewith is a report which describes in detail
>>>>>>>>>>>>>>>> about the performance comparison between Shark and Hive.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  hive_vs_shark
>>>>>>>>>>>>>>>> <https://docs.google.com/a/wso2.com/folderview?id=0B1GsnfycTl32QTZqUktKck1Ucjg&usp=drive_web>
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>  hive_vs_shark_report.odt
>>>>>>>>>>>>>>>> <https://docs.google.com/a/wso2.com/file/d/0B1GsnfycTl32X3J5dTh6Slloa0E/edit?usp=drive_web>
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In summary,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> From the evaluation, following conclusions can be derived.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    - Shark is indifferent to Hive in DDL operations
>>>>>>>>>>>>>>>>    (CREATE, DROP .. TABLE, DATABASE). Both engines show a 
>>>>>>>>>>>>>>>> fairly constant
>>>>>>>>>>>>>>>>    performance as the input size increases.
>>>>>>>>>>>>>>>>    - Shark is indifferent to Hive in DML operations (LOAD,
>>>>>>>>>>>>>>>>    INSERT) but when a DML operation is called in conjuncture 
>>>>>>>>>>>>>>>> of a data
>>>>>>>>>>>>>>>>    retrieval operation (ex. INSERT <TBL> SELECT <PROP> FROM 
>>>>>>>>>>>>>>>> <TBL>), Shark
>>>>>>>>>>>>>>>>    significantly over-performs Hive with a performance factor 
>>>>>>>>>>>>>>>> of 10x+ (Ranging
>>>>>>>>>>>>>>>>    from 10x to 80x in some instances). Shark performance 
>>>>>>>>>>>>>>>> factor reduces with
>>>>>>>>>>>>>>>>    the input size increases, while HIVE performance is fairly 
>>>>>>>>>>>>>>>> indifferent.
>>>>>>>>>>>>>>>>    - Shark clearly over-performs Hive in Data Retrieval
>>>>>>>>>>>>>>>>    operations (FILTER, ORDER BY, JOIN). Hive performance is 
>>>>>>>>>>>>>>>> fairly indifferent
>>>>>>>>>>>>>>>>    in the data retrieval operations while Shark performance 
>>>>>>>>>>>>>>>> reduces as the
>>>>>>>>>>>>>>>>    input size increases. But at every instance Shark 
>>>>>>>>>>>>>>>> over-performed Hive with
>>>>>>>>>>>>>>>>    a minimum performance factor of 5x+ (Ranging from 5x to 80x 
>>>>>>>>>>>>>>>> in some
>>>>>>>>>>>>>>>>    instances).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please refer the 'hive_vs_shark_report', it has all the
>>>>>>>>>>>>>>>> information about the queries and timings pictographically.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The code repository can also be found in
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Moving forward, I am currently working on the following.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    - Apache Spark's resilient distributed dataset (RDD)
>>>>>>>>>>>>>>>>    abstraction (which is a collection of elements partitioned 
>>>>>>>>>>>>>>>> across the nodes
>>>>>>>>>>>>>>>>    of the cluster that can be operated on in parallel). The 
>>>>>>>>>>>>>>>> use of RDDs and
>>>>>>>>>>>>>>>>    its impact to the performance.
>>>>>>>>>>>>>>>>    - Spark SQL - Use of this Spark SQL over Shark on Spark
>>>>>>>>>>>>>>>>    framework
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1] https://github.com/amplab/shark/wiki
>>>>>>>>>>>>>>>> [2] http://en.wikipedia.org/wiki/Apache_Spark
>>>>>>>>>>>>>>>> [3] http://spark.apache.org/docs/latest/
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Would love to have your feedback on this.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Best regards
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>  *Niranda Perera*
>>>>>>>>>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>>>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>>>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> *Anjana Fernando*
>>>>>>>>>>>>>>> Senior Technical Lead
>>>>>>>>>>>>>>> WSO2 Inc. | http://wso2.com
>>>>>>>>>>>>>>> lean . enterprise . middleware
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> *Niranda Perera*
>>>>>>>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>>>>>>>>  Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> ============================
>>>>>>>>>>>>> Srinath Perera, Ph.D.
>>>>>>>>>>>>>    http://people.apache.org/~hemapani/
>>>>>>>>>>>>>    http://srinathsview.blogspot.com/
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> *Anjana Fernando*
>>>>>>>>>>>> Senior Technical Lead
>>>>>>>>>>>> WSO2 Inc. | http://wso2.com
>>>>>>>>>>>> lean . enterprise . middleware
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> *Niranda Perera*
>>>>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Architecture mailing list
>>>>>>>>>> Architecture@wso2.org
>>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *Lasantha Fernando*
>>>>>>>>> Software Engineer - Data Technologies Team
>>>>>>>>> WSO2 Inc. http://wso2.com
>>>>>>>>>
>>>>>>>>> email: lasan...@wso2.com
>>>>>>>>> mobile: (+94) 71 5247551
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> *Niranda Perera*
>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>>  Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Architecture mailing list
>>>>>>>> Architecture@wso2.org
>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> *S. Suhothayan*
>>>>>>> Technical Lead & Team Lead of WSO2 Complex Event Processor
>>>>>>>  *WSO2 Inc. *http://wso2.com
>>>>>>> * <http://wso2.com/>*
>>>>>>> lean . enterprise . middleware
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog:
>>>>>>> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/> 
>>>>>>> twitter:
>>>>>>> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | 
>>>>>>> linked-in:
>>>>>>> http://lk.linkedin.com/in/suhothayan 
>>>>>>> <http://lk.linkedin.com/in/suhothayan>*
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Architecture mailing list
>>>>>>> Architecture@wso2.org
>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Architecture mailing list
>>>>>> Architecture@wso2.org
>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ============================
>>>>> Srinath Perera, Ph.D.
>>>>>    http://people.apache.org/~hemapani/
>>>>>    http://srinathsview.blogspot.com/
>>>>>
>>>>> _______________________________________________
>>>>> Architecture mailing list
>>>>> Architecture@wso2.org
>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Niranda Perera*
>>>> Software Engineer, WSO2 Inc.
>>>> Mobile: +94-71-554-8430
>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>
>>>
>>>
>>
>>
>> --
>> *Niranda Perera*
>> Software Engineer, WSO2 Inc.
>> Mobile: +94-71-554-8430
>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>
>
>


-- 
*Niranda Perera*
Software Engineer, WSO2 Inc.
Mobile: +94-71-554-8430
Twitter: @n1r44 <https://twitter.com/N1R44>

_______________________________________________
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

Reply via email to