Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

David Morales Fri, 22 Aug 2014 01:39:08 -0700

Hi there,

*1. About the size of deployments.*


It depends on your use case... specially when you combine spark with a
datastore. We use to deploy spark with cassandra or mongodb, instead of
using HDFS for example.

Spark will be faster if you put the data in memory, so if you need a lot of
speed (interactive queries, for example), you should have enough memory.


*2. About storage handlers.*

We have developed the first tight integration between Cassandra and Spark,
called Stratio Deep, announced in the first spark summit. You can check
Stratio Deep out here: https://github.com/Stratio/stratio-deep (open,
apache2 license).

*Deep is a thin integration layer between Apache Spark and several NoSQL
datastores. We actually support Apache Cassandra and MongoDB, but in the
near future we will add support for sever other datastores.*

Datastax have announce its own driver for spark in the last spark summit,
but we have been working in our solution for almost a year.

Furthermore, we are working to extend this solution in order to
work also with other databases... MongoDB integration is completed right
now and ElasticSearch will be ready in a few weeks.

And that is not all, we have also developed an integration with Cassandra
and Lucene for indexing data (open source, apache2).

*Stratio Cassandra is a fork of Apache Cassandra
<http://cassandra.apache.org/> where index functionality has been extended
to provide near real time search such as ElasticSearch or Solr,
including full text search
<http://en.wikipedia.org/wiki/Full_text_search> capabilities and free
multivariable search. It is achieved through an Apache Lucene
<http://lucene.apache.org/> based implementation of Cassandra secondary
indexes, where each node of the cluster indexes its own data.*


We will publish some benchmarks in two weeks, so i will share our results
here if you are interested.


If you are more interested in distributed file systems, you should take a
look on Tachyon: http://tachyon-project.org/index.html


*3. Spark - Hive compatibility*

Spark will support anything with the Hadoop InputFormat interface.


*4. Performance*

We are working a lot with Cassandra and mongoDB and the performance is
quite nice. We are finishing right now some benchmarks comparing Hadoop +
HDFS vs Spark + HDFS vs Spark + Cassandra (using stratio deep and even our
fork of Cassandra).

Let me please share this results with you when they were ready, ok?




Regards.



























2014-08-22 7:53 GMT+02:00 Niranda Perera <nira...@wso2.com>:

> Hi Srinath,
> Yes, I am working on deploying it on a multi-node cluster with the debs
> dataset. I will keep architecture@ posted on the progress.
>
>
> Hi David,
> Thank you very much for the detailed insight you've provided.
> Few quick questions,
> 1. Do you have experiences in using storage handlers in Spark?
> 2. Would a storage handler used in Hive, be directly compatible with
> Spark?
> 3. How do you grade the performance of Spark with other databases such as
> Cassandra, HBase, H2, etc?
>
> Thank you very much again for your interest. Look forward to hearing from
> you.
>
> Regards
>
>
> On Thu, Aug 21, 2014 at 7:02 PM, Srinath Perera <srin...@wso2.com> wrote:
>
>> Niranda, we need test Spark in multi-node mode before making a decision.
>> Spark is very fast, I think there is no doubt about that. We need to make
>> sure it stable.
>>
>> David, thanks for a detailed email! How big (nodes) is the Spark setup
>> you guys are running?
>>
>> --Srinath
>>
>>
>>
>> On Thu, Aug 21, 2014 at 1:34 PM, David Morales <dmora...@stratio.com>
>> wrote:
>>
>>> Sorry for disturbing this thread, but i think that i can help clarifying
>>> a few things (we were attending the last Spark Summit, we were also
>>> speakers there and we are working very close to spark)
>>>
>>> *> Hive/Shark and others benchmark*
>>>
>>> You can find a nice comparison and benchmark in this web:
>>> https://amplab.cs.berkeley.edu/benchmark/
>>>
>>>
>>> *> Shark and SparkSQL*
>>>
>>> SparkSQL is the natural replacement for Shark, but SparkSQL is still
>>> young at this moment. If you are looking for Hive compatibility, you have
>>> to execute SparkSQL with an specific context.
>>>
>>> Quoted from spark website:
>>>
>>> *> Note that Spark SQL currently uses a very basic SQL parser. Users
>>> that want a more complete dialect of SQL should look at the HiveSQL support
>>> provided by HiveContext.*
>>>
>>> So, only note that SparkSQL is a work in progress. If you want SparkSQL
>>> you have to run a SparkSQLContext, if you want Hive, you will have a
>>> different context...
>>>
>>>
>>> *> Spark - Hadoop: the future*
>>>
>>> Most Hadoop distributions are including Spark: cloudera, hortonworks,
>>> mapR... and contributing to migrate all the Hadoop ecosystem to Spark.
>>>
>>> Spark is a bit more than Map/Reduce... as you can read here:
>>> http://gigaom.com/2014/06/28/4-reasons-why-spark-could-jolt-hadoop-into-hyperdrive/
>>>
>>>
>>> *> Spark Streaming / Spark SQL*
>>>
>>> Spark Streaming is built on Spark and it provides streaming processing
>>> through an information abstraction called DStreams (a collection of RDDs in
>>> a window of time).
>>>
>>> There is some efforts in order to make SparkSQL compatible with Spark
>>> Streaming (something similar to trident for storm), as you can see here:
>>>
>>> *StreamSQL (https://github.com/thunderain-project/StreamSQL
>>> <https://github.com/thunderain-project/StreamSQL>) is a POC project based
>>> on Spark to combine the power of Catalyst and Spark Streaming, to offer
>>> people the ability to manipulate SQL on top of DStream as you wanted, this
>>> keep the same semantics with SparkSQL as offer a SchemaDStream on top of
>>> DStream. You don't need to do tricky thing like extracting rdd to register
>>> as a table. Besides other parts are the same as Spark.*
>>>
>>> So, you can apply a SQL in a data stream, but it is very simple at the
>>> moment... you can expect a bunch of improvements in this matter in the next
>>> months (i guess that sparkSQL will work on Spark streaming streams before
>>> the end of this year).
>>>
>>>
>>>
>>> *> Spark Streaming / Spark SQL and CEP*
>>>
>>> There is no relationship at this moment between (your absolutely
>>> amazing) Siddhi CEP and Spark. As fas as i know, you are working in doing
>>> distributed CEP with Storm and Siddhi.
>>>
>>> We are currently working on doing an interactive cep built with kafka +
>>> spark streaming + siddhi, with some features such as an API, an interactive
>>> shell, built-in statistics and auditing, built-in functions
>>> (save2cassandra, save2mongo, save2elasticsearch...).
>>>
>>> If you are interested we can talk about this project, i think that it
>>> would be a nice idea¡
>>>
>>>
>>> Anyway, i don't think that SparkSQL will evolve in something like a CEP.
>>> Patterns, sequences, for example would be very complex to do with spark
>>> streaming (at least now).
>>>
>>>
>>>
>>> Thanks.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> 2014-08-21 6:18 GMT+02:00 Sriskandarajah Suhothayan <s...@wso2.com>:
>>>
>>>
>>>>
>>>>
>>>> On Wed, Aug 20, 2014 at 1:36 PM, Niranda Perera <nira...@wso2.com>
>>>> wrote:
>>>>
>>>>> @Maninda,
>>>>>
>>>>> +1 for suggesting Spark SQL.
>>>>>
>>>>> Quote Databricks,
>>>>> "Spark SQL provides state-of-the-art SQL performance and maintains
>>>>> compatibility with Shark/Hive. In particular, like Shark, Spark SQL
>>>>> supports all existing Hive data formats, user-defined functions (UDF), and
>>>>> the Hive metastore." [1]
>>>>>
>>>>> But I am not entirely sure if Spark SQL and Siddhi is comparable,
>>>>> because SparkSQL (like Hive) is designed for batch processing, where as
>>>>> Siddhi is real-time processing. But if there are implementations where
>>>>> Siddhi is run on top of Spark, it would be very interesting.
>>>>>
>>>> Yes Siddhi's current way of operation does not support this. But with
>>>> partitions and we can achieve this to some extent.
>>>>
>>>> Suho
>>>>
>>>>>
>>>>> Spark supports either Hadoop1 or 2. But I think we should see, what is
>>>>> best, MR1 or YARN+MR2
>>>>>
>>>>> [image: Hadoop Architecture]
>>>>> [2]
>>>>>
>>>>> [1]
>>>>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
>>>>> [2] http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html
>>>>>
>>>>>
>>>>> On Wed, Aug 20, 2014 at 1:13 PM, Lasantha Fernando <lasan...@wso2.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Maninda,
>>>>>>
>>>>>> On 20 August 2014 12:02, Maninda Edirisooriya <mani...@wso2.com>
>>>>>> wrote:
>>>>>>
>>>>>>> In the case of discontinuity of Shark project, IMO we should not
>>>>>>> move to Shark at all.
>>>>>>> And it seems better to go with Spark SQL as we are already using
>>>>>>> Spark for CEP. But I am not sure the difference between Spark SQL and 
>>>>>>> the
>>>>>>> Siddhi queries on the Spark engine.
>>>>>>>
>>>>>>
>>>>>> Currently, we are doing integration with CEP using Apache Storm, not
>>>>>> Spark... :-). Spark Streaming is a possible candidate for integrating 
>>>>>> with
>>>>>> CEP, but we have opted with Storm. I think there has been some 
>>>>>> independent
>>>>>> work on integrating Kafka + Spark Streaming + Siddhi. Please refer to
>>>>>> thread on arch@ "[Architecture] A few questions about WSO2 CEP
>>>>>> /Siddhi"
>>>>>>
>>>>>>
>>>>>> And we have to figure out how Spark SQL is used for historical data,
>>>>>>> whether it can execute incremental processing by default which will
>>>>>>> implement all out existing BAM use cases.
>>>>>>> On the other hand in Hadoop 2 [1] they are using a completely
>>>>>>> different platform for resource allocation known as Yarn. Sometimes this
>>>>>>> may be more suitable for batch jobs.
>>>>>>>
>>>>>>> [1] https://www.youtube.com/watch?v=RncoVN0l6dc
>>>>>>>
>>>>>>>
>>>>>> Thanks,
>>>>>> Lasantha
>>>>>>
>>>>>>>
>>>>>>> *Maninda Edirisooriya*
>>>>>>> Senior Software Engineer
>>>>>>>
>>>>>>> *WSO2, Inc. *lean.enterprise.middleware.
>>>>>>>
>>>>>>> *Blog* : http://maninda.blogspot.com/
>>>>>>> *E-mail* : mani...@wso2.com
>>>>>>> *Skype* : @manindae
>>>>>>> *Twitter* : @maninda
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 20, 2014 at 11:33 AM, Niranda Perera <nira...@wso2.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Anjana and Srinath,
>>>>>>>>
>>>>>>>> After the discussion I had with Anjana, I researched more on the
>>>>>>>> continuation of Shark project by Databricks.
>>>>>>>>
>>>>>>>> Here's what I found out,
>>>>>>>> - Shark was built on the Hive codebase and achieved performance
>>>>>>>> improvements by swapping out the physical execution engine part of 
>>>>>>>> Hive.
>>>>>>>> While this approach enabled Shark users to speed up their Hive queries,
>>>>>>>> Shark inherited a large, complicated code base from Hive that made it 
>>>>>>>> hard
>>>>>>>> to optimize and maintain.
>>>>>>>> Hence, Databricks has announced that they are halting the
>>>>>>>> development of Shark from July, 2014. (Shark 0.9 would be the last 
>>>>>>>> release)
>>>>>>>> [1]
>>>>>>>> - Shark will be replaced by Spark SQL. It beats Shark in TPC-DS
>>>>>>>> performance
>>>>>>>> <http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html>
>>>>>>>> by almost an order of magnitude. It also supports all existing Hive 
>>>>>>>> data
>>>>>>>> formats, user-defined functions (UDF), and the Hive metastore.  [2]
>>>>>>>> - Following is the Shark, Spark SQL migration plan
>>>>>>>> http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf
>>>>>>>>
>>>>>>>> - For the legacy Hive and MapReduce users, they have proposed a new
>>>>>>>> 'Hive on Spark Project' [3], [4]
>>>>>>>> But, given the performance enhancement, it is quite certain that
>>>>>>>> Hive and MR would be replaced by engines build on top of Spark (ex: 
>>>>>>>> Spark
>>>>>>>> SQL)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> In my opinion there are a few matters to figure out if we are
>>>>>>>> migrating from Hive,
>>>>>>>>
>>>>>>>> 1. whether we are changing the query engine only? (Then, we can
>>>>>>>> replace Hive by Shark)
>>>>>>>> 2. whether we are changing the existing Hadoop/ MapReduce framework
>>>>>>>> to Spark? (Then we can replace Hive and Hadoop with Spark and Spark 
>>>>>>>> SQL)
>>>>>>>>
>>>>>>>>
>>>>>>>> In my opinion, considering the longterm impact and the availability
>>>>>>>> of support, it is best to migrate the Hive/Hadoop to Spark.
>>>>>>>> It is open for discussion!
>>>>>>>>
>>>>>>>> In the mean time, I've already tried Spark SQL, and Databricks
>>>>>>>> claims on improved performance seems to be true. I will work more on 
>>>>>>>> this.
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
>>>>>>>> [2]
>>>>>>>> http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html
>>>>>>>> [3] https://issues.apache.org/jira/browse/HIVE-7292
>>>>>>>> [4] https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Aug 14, 2014 at 12:16 PM, Anjana Fernando <anj...@wso2.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Srinath,
>>>>>>>>>
>>>>>>>>> No, this has not been tested in multiple nodes. I told Niranda
>>>>>>>>> here in my last mail, to test a cluster with the same set of hardware 
>>>>>>>>> we
>>>>>>>>> have, that we are using to test our large data set with Hive. As for 
>>>>>>>>> the
>>>>>>>>> effort to make the change, we still have to figure out the MT aspects 
>>>>>>>>> of
>>>>>>>>> Shark here. Sinthuja was working on making the latest Hive version MT
>>>>>>>>> ready, and most probably, we can do the same changes to the Hive 
>>>>>>>>> version
>>>>>>>>> Shark is using. So after we do that, the integration should be 
>>>>>>>>> seamless.
>>>>>>>>> And also, as I mentioned earlier here, we are also going to test this 
>>>>>>>>> with
>>>>>>>>> the APIM Hive script, to check if there are any unforeseen
>>>>>>>>> incompatibilities.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Anjana.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Aug 14, 2014 at 11:53 AM, Srinath Perera <srin...@wso2.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> This look great.
>>>>>>>>>>
>>>>>>>>>> We need to test Spark with multiple nodes? Did we do that. Please
>>>>>>>>>> create few VMs in performance could (talk to Lakmal) and test with 
>>>>>>>>>> at least
>>>>>>>>>> 5 nodes. We need to make sure it works OK with distributed setup as 
>>>>>>>>>> well.
>>>>>>>>>>
>>>>>>>>>> What does it take to change to spark? Anjana .. how much work is
>>>>>>>>>> it?
>>>>>>>>>>
>>>>>>>>>> --Srinath
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Aug 13, 2014 at 7:06 PM, Niranda Perera <nira...@wso2.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Thank you Anjana.
>>>>>>>>>>>
>>>>>>>>>>> Yes, I am working on it.
>>>>>>>>>>>
>>>>>>>>>>> In the mean time, I found this in Hive documentation [1]. It
>>>>>>>>>>> talks about Hive on Spark, and compares Hive, Shark and Spark SQL 
>>>>>>>>>>> at an
>>>>>>>>>>> higher architectural level.
>>>>>>>>>>>
>>>>>>>>>>> Additionally, it is said that the in-memory performance of Shark
>>>>>>>>>>> can be improved by introducing Tachyon [2]. I guess we can consider 
>>>>>>>>>>> this
>>>>>>>>>>> later on.
>>>>>>>>>>>
>>>>>>>>>>> Cheers.
>>>>>>>>>>>
>>>>>>>>>>> [1]
>>>>>>>>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-1.3ComparisonwithSharkandSparkSQL
>>>>>>>>>>> [2] http://tachyon-project.org/Running-Tachyon-Locally.html
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Aug 13, 2014 at 3:17 PM, Anjana Fernando <
>>>>>>>>>>> anj...@wso2.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>  Hi Niranda,
>>>>>>>>>>>>
>>>>>>>>>>>> Excellent analysis of Hive vs Shark! .. This gives a lot of
>>>>>>>>>>>> insight into how both operates in different scenarios. As the next 
>>>>>>>>>>>> step, we
>>>>>>>>>>>> will need to run this in an actual cluster of computers. Since 
>>>>>>>>>>>> you've used
>>>>>>>>>>>> a subset of the dataset of 2014 DEBS challenge, we should use the 
>>>>>>>>>>>> full data
>>>>>>>>>>>> set in a clustered environment and check this. Gokul is already 
>>>>>>>>>>>> working on
>>>>>>>>>>>> the Hive based setup for this, after that is done, you can create 
>>>>>>>>>>>> a Shark
>>>>>>>>>>>> cluster in the same hardware and run the tests there, to get a 
>>>>>>>>>>>> clear
>>>>>>>>>>>> comparison on how these two match up in a cluster. Until the setup 
>>>>>>>>>>>> is
>>>>>>>>>>>> ready, do continue with your next steps on checking the RDD 
>>>>>>>>>>>> support and
>>>>>>>>>>>> Spark SQL use.
>>>>>>>>>>>>
>>>>>>>>>>>> After these are done, we should also do a trial run of our own
>>>>>>>>>>>> APIM Hive scripts, migrated to Shark.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers,
>>>>>>>>>>>> Anjana.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera <
>>>>>>>>>>>> nira...@wso2.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have been evaluating the performance of Shark (distributed
>>>>>>>>>>>>> SQL query engine for Hadoop) against Hive. This is with the 
>>>>>>>>>>>>> objective of
>>>>>>>>>>>>> seeing the possibility to move the WSO2 BAM data processing (which
>>>>>>>>>>>>> currently uses Hive) to Shark (and Apache Spark) for improved 
>>>>>>>>>>>>> performance.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am sharing my findings herewith.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  *AMP Lab Shark*
>>>>>>>>>>>>> Shark can execute Hive QL queries up to 100 times faster than
>>>>>>>>>>>>> Hive without any modification to the existing data or queries. It 
>>>>>>>>>>>>> supports
>>>>>>>>>>>>> Hive's QL, metastore, serialization formats, and user-defined 
>>>>>>>>>>>>> functions,
>>>>>>>>>>>>> providing seamless integration with existing Hive deployments and 
>>>>>>>>>>>>> a
>>>>>>>>>>>>> familiar, more powerful option for new ones. [1]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *Apache Spark*Apache Spark is an open-source data analytics
>>>>>>>>>>>>> cluster computing framework. It fits into the Hadoop open-source 
>>>>>>>>>>>>> community,
>>>>>>>>>>>>> building on top of the HDFS and promises performance up to 100 
>>>>>>>>>>>>> times faster
>>>>>>>>>>>>> than Hadoop MapReduce for certain applications. [2]
>>>>>>>>>>>>> Official documentation: [3]
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I carried out the comparison between the following Hive and
>>>>>>>>>>>>> Shark releases with input files ranging from 100 to 1 billion 
>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>
>>>>>>>>>>>>> QL Engine
>>>>>>>>>>>>>
>>>>>>>>>>>>> Apache Hive 0.11
>>>>>>>>>>>>>
>>>>>>>>>>>>> Shark Shark 0.9.1 (Latest release) which uses,
>>>>>>>>>>>>>
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>
>>>>>>>>>>>>>    Scala 2.10.3
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>
>>>>>>>>>>>>>    Spark 0.9.1
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>
>>>>>>>>>>>>>    AMPLab’s Hive 0.9.0
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Framework
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hadoop 1.0.4
>>>>>>>>>>>>> Spark 0.9.1
>>>>>>>>>>>>>
>>>>>>>>>>>>> File system
>>>>>>>>>>>>>
>>>>>>>>>>>>> HDFS
>>>>>>>>>>>>> HDFS
>>>>>>>>>>>>>
>>>>>>>>>>>>> Attached herewith is a report which describes in detail about
>>>>>>>>>>>>> the performance comparison between Shark and Hive.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  hive_vs_shark
>>>>>>>>>>>>> <https://docs.google.com/a/wso2.com/folderview?id=0B1GsnfycTl32QTZqUktKck1Ucjg&usp=drive_web>
>>>>>>>>>>>>> 
>>>>>>>>>>>>>  hive_vs_shark_report.odt
>>>>>>>>>>>>> <https://docs.google.com/a/wso2.com/file/d/0B1GsnfycTl32X3J5dTh6Slloa0E/edit?usp=drive_web>
>>>>>>>>>>>>> 
>>>>>>>>>>>>>
>>>>>>>>>>>>> In summary,
>>>>>>>>>>>>>
>>>>>>>>>>>>> From the evaluation, following conclusions can be derived.
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - Shark is indifferent to Hive in DDL operations (CREATE,
>>>>>>>>>>>>>    DROP .. TABLE, DATABASE). Both engines show a fairly constant 
>>>>>>>>>>>>> performance
>>>>>>>>>>>>>    as the input size increases.
>>>>>>>>>>>>>    - Shark is indifferent to Hive in DML operations (LOAD,
>>>>>>>>>>>>>    INSERT) but when a DML operation is called in conjuncture of a 
>>>>>>>>>>>>> data
>>>>>>>>>>>>>    retrieval operation (ex. INSERT <TBL> SELECT <PROP> FROM 
>>>>>>>>>>>>> <TBL>), Shark
>>>>>>>>>>>>>    significantly over-performs Hive with a performance factor of 
>>>>>>>>>>>>> 10x+ (Ranging
>>>>>>>>>>>>>    from 10x to 80x in some instances). Shark performance factor 
>>>>>>>>>>>>> reduces with
>>>>>>>>>>>>>    the input size increases, while HIVE performance is fairly 
>>>>>>>>>>>>> indifferent.
>>>>>>>>>>>>>    - Shark clearly over-performs Hive in Data Retrieval
>>>>>>>>>>>>>    operations (FILTER, ORDER BY, JOIN). Hive performance is 
>>>>>>>>>>>>> fairly indifferent
>>>>>>>>>>>>>    in the data retrieval operations while Shark performance 
>>>>>>>>>>>>> reduces as the
>>>>>>>>>>>>>    input size increases. But at every instance Shark 
>>>>>>>>>>>>> over-performed Hive with
>>>>>>>>>>>>>    a minimum performance factor of 5x+ (Ranging from 5x to 80x in 
>>>>>>>>>>>>> some
>>>>>>>>>>>>>    instances).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please refer the 'hive_vs_shark_report', it has all the
>>>>>>>>>>>>> information about the queries and timings pictographically.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The code repository can also be found in
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Moving forward, I am currently working on the following.
>>>>>>>>>>>>>
>>>>>>>>>>>>>    - Apache Spark's resilient distributed dataset (RDD)
>>>>>>>>>>>>>    abstraction (which is a collection of elements partitioned 
>>>>>>>>>>>>> across the nodes
>>>>>>>>>>>>>    of the cluster that can be operated on in parallel). The use 
>>>>>>>>>>>>> of RDDs and
>>>>>>>>>>>>>    its impact to the performance.
>>>>>>>>>>>>>    - Spark SQL - Use of this Spark SQL over Shark on Spark
>>>>>>>>>>>>>    framework
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1] https://github.com/amplab/shark/wiki
>>>>>>>>>>>>> [2] http://en.wikipedia.org/wiki/Apache_Spark
>>>>>>>>>>>>> [3] http://spark.apache.org/docs/latest/
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Would love to have your feedback on this.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best regards
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>>  *Niranda Perera*
>>>>>>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> *Anjana Fernando*
>>>>>>>>>>>> Senior Technical Lead
>>>>>>>>>>>> WSO2 Inc. | http://wso2.com
>>>>>>>>>>>> lean . enterprise . middleware
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> *Niranda Perera*
>>>>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>>>>>  Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> ============================
>>>>>>>>>> Srinath Perera, Ph.D.
>>>>>>>>>>    http://people.apache.org/~hemapani/
>>>>>>>>>>    http://srinathsview.blogspot.com/
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *Anjana Fernando*
>>>>>>>>> Senior Technical Lead
>>>>>>>>> WSO2 Inc. | http://wso2.com
>>>>>>>>> lean . enterprise . middleware
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> *Niranda Perera*
>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Architecture mailing list
>>>>>>> Architecture@wso2.org
>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Lasantha Fernando*
>>>>>> Software Engineer - Data Technologies Team
>>>>>> WSO2 Inc. http://wso2.com
>>>>>>
>>>>>> email: lasan...@wso2.com
>>>>>> mobile: (+94) 71 5247551
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Niranda Perera*
>>>>> Software Engineer, WSO2 Inc.
>>>>> Mobile: +94-71-554-8430
>>>>>  Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>
>>>>> _______________________________________________
>>>>> Architecture mailing list
>>>>> Architecture@wso2.org
>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> *S. Suhothayan*
>>>> Technical Lead & Team Lead of WSO2 Complex Event Processor
>>>>  *WSO2 Inc. *http://wso2.com
>>>> * <http://wso2.com/>*
>>>> lean . enterprise . middleware
>>>>
>>>>
>>>>
>>>> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog:
>>>> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/> twitter:
>>>> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in:
>>>> http://lk.linkedin.com/in/suhothayan 
>>>> <http://lk.linkedin.com/in/suhothayan>*
>>>>
>>>> _______________________________________________
>>>> Architecture mailing list
>>>> Architecture@wso2.org
>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Architecture mailing list
>>> Architecture@wso2.org
>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>
>>>
>>
>>
>> --
>> ============================
>> Srinath Perera, Ph.D.
>>    http://people.apache.org/~hemapani/
>>    http://srinathsview.blogspot.com/
>>
>> _______________________________________________
>> Architecture mailing list
>> Architecture@wso2.org
>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>
>>
>
>
> --
> *Niranda Perera*
> Software Engineer, WSO2 Inc.
> Mobile: +94-71-554-8430
> Twitter: @n1r44 <https://twitter.com/N1R44>
>

_______________________________________________
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

Reply via email to