Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

David Morales Tue, 02 Dec 2014 09:46:24 -0800

Hi¡

Please, check the develop branch if you want to see a more realistic view
of our development path. Last commit was about two hours ago :)


Stratio Deep is one of our core modules so there is a core team in Stratio
fully devoted to spark + noSQL integration. In these last months, for
example, we have added mongoDB, ElasticSearch and Aerospike to Stratio
Deep, so you can talk to these databases from Spark just like you do with
HDFS.

Furthermore, we are working on more backends, such as neo4j or couchBase,
for example.


About our benchmarks, you can check out some results in this link:
http://www.stratio.com/deep-vs-datastax/

Please, keep in mind that spark integration with a datastore could be done
in two ways: HCI or native. We are now working on improving native
integration because it's quite more performant. In this way, we are just
working on some other tests with even more impressive results.


Here you can find a technical overview of all our platform.


http://www.slideshare.net/Stratio/stratio-platform-overview-v41

Regards

2014-12-02 11:14 GMT+01:00 Niranda Perera <nira...@wso2.com>:

> Hi David,
>
> Sorry to re-initiate this thread. But may I know if you have done any
> benchmarking on Datastax Spark cassandra connector and Stratio Deep-spark
> cassandra integration? Would love to take a look at it.
>
> I recently checked deep-spark github repo and noticed that there is no
> activity since Oct 29th. May I know what your future plans on this
> particular project?
>
> Cheers
>
> On Tue, Aug 26, 2014 at 9:12 PM, David Morales <dmora...@stratio.com>
> wrote:
>
>> Yes, it is already included in our benchmarks.
>>
>> It could be a nice idea to share our findings, let me talk about it here.
>> Meanwhile, you can ask us any question by using my mail or this thread, we
>> are glad to help you.
>>
>>
>> Best regards.
>>
>>
>> 2014-08-24 15:49 GMT+02:00 Niranda Perera <nira...@wso2.com>:
>>
>>> Hi David,
>>>
>>> Thank you for your detailed reply.
>>>
>>> It was great to hear about Stratio-Deep and I must say, it looks very
>>> interesting. Storage handlers for databases such Cassandra, MongoDB etc
>>> would be very helpful. We will definitely look up on Stratio-Deep.
>>>
>>> I came across with the Datastax Spark-Cassandra connector (
>>> https://github.com/datastax/spark-cassandra-connector ). Have you done
>>> any comparison with your implementation and Datastax's connector?
>>>
>>> And, yes, please do share the performance results with us once it's
>>> ready.
>>>
>>> On a different note, is there any way for us to interact with Stratio
>>> dev community, in the form of dev mail lists etc, so that we could mutually
>>> share our findings?
>>>
>>> Best regards
>>>
>>>
>>>
>>> On Fri, Aug 22, 2014 at 2:07 PM, David Morales <dmora...@stratio.com>
>>> wrote:
>>>
>>>> Hi there,
>>>>
>>>> *1. About the size of deployments.*
>>>>
>>>> It depends on your use case... specially when you combine spark with a
>>>> datastore. We use to deploy spark with cassandra or mongodb, instead of
>>>> using HDFS for example.
>>>>
>>>> Spark will be faster if you put the data in memory, so if you need a
>>>> lot of speed (interactive queries, for example), you should have enough
>>>> memory.
>>>>
>>>>
>>>> *2. About storage handlers.*
>>>>
>>>> We have developed the first tight integration between Cassandra and
>>>> Spark, called Stratio Deep, announced in the first spark summit. You can
>>>> check Stratio Deep out here: https://github.com/Stratio/stratio-deep (open,
>>>> apache2 license).
>>>>
>>>> *Deep is a thin integration layer between Apache Spark and several
>>>> NoSQL datastores. We actually support Apache Cassandra and MongoDB, but in
>>>> the near future we will add support for sever other datastores.*
>>>>
>>>> Datastax have announce its own driver for spark in the last spark
>>>> summit, but we have been working in our solution for almost a year.
>>>>
>>>> Furthermore, we are working to extend this solution in order to
>>>> work also with other databases... MongoDB integration is completed right
>>>> now and ElasticSearch will be ready in a few weeks.
>>>>
>>>> And that is not all, we have also developed an integration with
>>>> Cassandra and Lucene for indexing data (open source, apache2).
>>>>
>>>> *Stratio Cassandra is a fork of Apache Cassandra
>>>> <http://cassandra.apache.org/> where index functionality has been extended
>>>> to provide near real time search such as ElasticSearch or Solr,
>>>> including full text search
>>>> <http://en.wikipedia.org/wiki/Full_text_search> capabilities and free
>>>> multivariable search. It is achieved through an Apache Lucene
>>>> <http://lucene.apache.org/> based implementation of Cassandra secondary
>>>> indexes, where each node of the cluster indexes its own data.*
>>>>
>>>>
>>>> We will publish some benchmarks in two weeks, so i will share our
>>>> results here if you are interested.
>>>>
>>>>
>>>> If you are more interested in distributed file systems, you should take
>>>> a look on Tachyon: http://tachyon-project.org/index.html
>>>>
>>>>
>>>> *3. Spark - Hive compatibility*
>>>>
>>>> Spark will support anything with the Hadoop InputFormat interface.
>>>>
>>>>
>>>> *4. Performance*
>>>>
>>>> We are working a lot with Cassandra and mongoDB and the performance is
>>>> quite nice. We are finishing right now some benchmarks comparing Hadoop +
>>>> HDFS vs Spark + HDFS vs Spark + Cassandra (using stratio deep and even our
>>>> fork of Cassandra).
>>>>
>>>> Let me please share this results with you when they were ready, ok?
>>>>
>>>>
>>>>
>>>>
>>>> Regards.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2014-08-22 7:53 GMT+02:00 Niranda Perera <nira...@wso2.com>:
>>>>
>>>> Hi Srinath,
>>>>> Yes, I am working on deploying it on a multi-node cluster with the
>>>>> debs dataset. I will keep architecture@ posted on the progress.
>>>>>
>>>>>
>>>>> Hi David,
>>>>> Thank you very much for the detailed insight you've provided.
>>>>> Few quick questions,
>>>>> 1. Do you have experiences in using storage handlers in Spark?
>>>>> 2. Would a storage handler used in Hive, be directly compatible with
>>>>> Spark?
>>>>> 3. How do you grade the performance of Spark with other databases such
>>>>> as Cassandra, HBase, H2, etc?
>>>>>
>>>>> Thank you very much again for your interest. Look forward to hearing
>>>>> from you.
>>>>>
>>>>> Regards
>>>>>
>>>>>
>>>>> On Thu, Aug 21, 2014 at 7:02 PM, Srinath Perera <srin...@wso2.com>
>>>>> wrote:
>>>>>
>>>>>> Niranda, we need test Spark in multi-node mode before making a
>>>>>> decision. Spark is very fast, I think there is no doubt about that. We 
>>>>>> need
>>>>>> to make sure it stable.
>>>>>>
>>>>>> David, thanks for a detailed email! How big (nodes) is the Spark
>>>>>> setup you guys are running?
>>>>>>
>>>>>> --Srinath
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 21, 2014 at 1:34 PM, David Morales <dmora...@stratio.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Sorry for disturbing this thread, but i think that i can help
>>>>>>> clarifying a few things (we were attending the last Spark Summit, we 
>>>>>>> were
>>>>>>> also speakers there and we are working very close to spark)
>>>>>>>
>>>>>>> *> Hive/Shark and others benchmark*
>>>>>>>
>>>>>>> You can find a nice comparison and benchmark in this web:
>>>>>>> https://amplab.cs.berkeley.edu/benchmark/
>>>>>>>
>>>>>>>
>>>>>>> *> Shark and SparkSQL*
>>>>>>>
>>>>>>> SparkSQL is the natural replacement for Shark, but SparkSQL is still
>>>>>>> young at this moment. If you are looking for Hive compatibility, you 
>>>>>>> have
>>>>>>> to execute SparkSQL with an specific context.
>>>>>>>
>>>>>>> Quoted from spark website:
>>>>>>>
>>>>>>> *> Note that Spark SQL currently uses a very basic SQL parser. Users
>>>>>>> that want a more complete dialect of SQL should look at the HiveSQL 
>>>>>>> support
>>>>>>> provided by HiveContext.*
>>>>>>>
>>>>>>> So, only note that SparkSQL is a work in progress. If you want
>>>>>>> SparkSQL you have to run a SparkSQLContext, if you want Hive, you will 
>>>>>>> have
>>>>>>> a different context...
>>>>>>>
>>>>>>>
>>>>>>> *> Spark - Hadoop: the future*
>>>>>>>
>>>>>>> Most Hadoop distributions are including Spark: cloudera,
>>>>>>> hortonworks, mapR... and contributing to migrate all the Hadoop 
>>>>>>> ecosystem
>>>>>>> to Spark.
>>>>>>>
>>>>>>> Spark is a bit more than Map/Reduce... as you can read here:
>>>>>>> http://gigaom.com/2014/06/28/4-reasons-why-spark-could-jolt-hadoop-into-hyperdrive/
>>>>>>>
>>>>>>>
>>>>>>> *> Spark Streaming / Spark SQL*
>>>>>>>
>>>>>>> Spark Streaming is built on Spark and it provides streaming
>>>>>>> processing through an information abstraction called DStreams (a 
>>>>>>> collection
>>>>>>> of RDDs in a window of time).
>>>>>>>
>>>>>>> There is some efforts in order to make SparkSQL compatible with
>>>>>>> Spark Streaming (something similar to trident for storm), as you can see
>>>>>>> here:
>>>>>>>
>>>>>>> *StreamSQL (https://github.com/thunderain-project/StreamSQL
>>>>>>> <https://github.com/thunderain-project/StreamSQL>) is a POC project 
>>>>>>> based
>>>>>>> on Spark to combine the power of Catalyst and Spark Streaming, to offer
>>>>>>> people the ability to manipulate SQL on top of DStream as you wanted, 
>>>>>>> this
>>>>>>> keep the same semantics with SparkSQL as offer a SchemaDStream on top of
>>>>>>> DStream. You don't need to do tricky thing like extracting rdd to 
>>>>>>> register
>>>>>>> as a table. Besides other parts are the same as Spark.*
>>>>>>>
>>>>>>> So, you can apply a SQL in a data stream, but it is very simple at
>>>>>>> the moment... you can expect a bunch of improvements in this matter in 
>>>>>>> the
>>>>>>> next months (i guess that sparkSQL will work on Spark streaming streams
>>>>>>> before the end of this year).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *> Spark Streaming / Spark SQL and CEP*
>>>>>>>
>>>>>>> There is no relationship at this moment between (your absolutely
>>>>>>> amazing) Siddhi CEP and Spark. As fas as i know, you are working in 
>>>>>>> doing
>>>>>>> distributed CEP with Storm and Siddhi.
>>>>>>>
>>>>>>> We are currently working on doing an interactive cep built with
>>>>>>> kafka + spark streaming + siddhi, with some features such as an API, an
>>>>>>> interactive shell, built-in statistics and auditing, built-in functions
>>>>>>> (save2cassandra, save2mongo, save2elasticsearch...).
>>>>>>>
>>>>>>> If you are interested we can talk about this project, i think that
>>>>>>> it would be a nice idea¡
>>>>>>>
>>>>>>>
>>>>>>> Anyway, i don't think that SparkSQL will evolve in something like a
>>>>>>> CEP. Patterns, sequences, for example would be very complex to do with
>>>>>>> spark streaming (at least now).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2014-08-21 6:18 GMT+02:00 Sriskandarajah Suhothayan <s...@wso2.com>:
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Aug 20, 2014 at 1:36 PM, Niranda Perera <nira...@wso2.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> @Maninda,
>>>>>>>>>
>>>>>>>>> +1 for suggesting Spark SQL.
>>>>>>>>>
>>>>>>>>> Quote Databricks,
>>>>>>>>> "Spark SQL provides state-of-the-art SQL performance and
>>>>>>>>> maintains compatibility with Shark/Hive. In particular, like Shark, 
>>>>>>>>> Spark
>>>>>>>>> SQL supports all existing Hive data formats, user-defined functions 
>>>>>>>>> (UDF),
>>>>>>>>> and the Hive metastore." [1]
>>>>>>>>>
>>>>>>>>> But I am not entirely sure if Spark SQL and Siddhi is comparable,
>>>>>>>>> because SparkSQL (like Hive) is designed for batch processing, where 
>>>>>>>>> as
>>>>>>>>> Siddhi is real-time processing. But if there are implementations where
>>>>>>>>> Siddhi is run on top of Spark, it would be very interesting.
>>>>>>>>>
>>>>>>>> Yes Siddhi's current way of operation does not support this. But
>>>>>>>> with partitions and we can achieve this to some extent.
>>>>>>>>
>>>>>>>> Suho
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Spark supports either Hadoop1 or 2. But I think we should see,
>>>>>>>>> what is best, MR1 or YARN+MR2
>>>>>>>>>
>>>>>>>>> [image: Hadoop Architecture]
>>>>>>>>> [2]
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
>>>>>>>>> [2] http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Aug 20, 2014 at 1:13 PM, Lasantha Fernando <
>>>>>>>>> lasan...@wso2.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Maninda,
>>>>>>>>>>
>>>>>>>>>> On 20 August 2014 12:02, Maninda Edirisooriya <mani...@wso2.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> In the case of discontinuity of Shark project, IMO we should not
>>>>>>>>>>> move to Shark at all.
>>>>>>>>>>> And it seems better to go with Spark SQL as we are already using
>>>>>>>>>>> Spark for CEP. But I am not sure the difference between Spark SQL 
>>>>>>>>>>> and the
>>>>>>>>>>> Siddhi queries on the Spark engine.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Currently, we are doing integration with CEP using Apache Storm,
>>>>>>>>>> not Spark... :-). Spark Streaming is a possible candidate for 
>>>>>>>>>> integrating
>>>>>>>>>> with CEP, but we have opted with Storm. I think there has been some
>>>>>>>>>> independent work on integrating Kafka + Spark Streaming + Siddhi. 
>>>>>>>>>> Please
>>>>>>>>>> refer to thread on arch@ "[Architecture] A few questions about
>>>>>>>>>> WSO2 CEP/Siddhi"
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> And we have to figure out how Spark SQL is used for historical
>>>>>>>>>>> data, whether it can execute incremental processing by default 
>>>>>>>>>>> which will
>>>>>>>>>>> implement all out existing BAM use cases.
>>>>>>>>>>> On the other hand in Hadoop 2 [1] they are using a completely
>>>>>>>>>>> different platform for resource allocation known as Yarn. Sometimes 
>>>>>>>>>>> this
>>>>>>>>>>> may be more suitable for batch jobs.
>>>>>>>>>>>
>>>>>>>>>>> [1] https://www.youtube.com/watch?v=RncoVN0l6dc
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Lasantha
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Maninda Edirisooriya*
>>>>>>>>>>> Senior Software Engineer
>>>>>>>>>>>
>>>>>>>>>>> *WSO2, Inc. *lean.enterprise.middleware.
>>>>>>>>>>>
>>>>>>>>>>> *Blog* : http://maninda.blogspot.com/
>>>>>>>>>>> *E-mail* : mani...@wso2.com
>>>>>>>>>>> *Skype* : @manindae
>>>>>>>>>>> *Twitter* : @maninda
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Aug 20, 2014 at 11:33 AM, Niranda Perera <
>>>>>>>>>>> nira...@wso2.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Anjana and Srinath,
>>>>>>>>>>>>
>>>>>>>>>>>> After the discussion I had with Anjana, I researched more on
>>>>>>>>>>>> the continuation of Shark project by Databricks.
>>>>>>>>>>>>
>>>>>>>>>>>> Here's what I found out,
>>>>>>>>>>>> - Shark was built on the Hive codebase and achieved performance
>>>>>>>>>>>> improvements by swapping out the physical execution engine part of 
>>>>>>>>>>>> Hive.
>>>>>>>>>>>> While this approach enabled Shark users to speed up their Hive 
>>>>>>>>>>>> queries,
>>>>>>>>>>>> Shark inherited a large, complicated code base from Hive that made 
>>>>>>>>>>>> it hard
>>>>>>>>>>>> to optimize and maintain.
>>>>>>>>>>>> Hence, Databricks has announced that they are halting the
>>>>>>>>>>>> development of Shark from July, 2014. (Shark 0.9 would be the last 
>>>>>>>>>>>> release)
>>>>>>>>>>>> [1]
>>>>>>>>>>>> - Shark will be replaced by Spark SQL. It beats Shark in TPC-DS
>>>>>>>>>>>> performance
>>>>>>>>>>>> <http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html>
>>>>>>>>>>>> by almost an order of magnitude. It also supports all existing 
>>>>>>>>>>>> Hive data
>>>>>>>>>>>> formats, user-defined functions (UDF), and the Hive metastore.  [2]
>>>>>>>>>>>> - Following is the Shark, Spark SQL migration plan
>>>>>>>>>>>> http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf
>>>>>>>>>>>>
>>>>>>>>>>>> - For the legacy Hive and MapReduce users, they have proposed a
>>>>>>>>>>>> new 'Hive on Spark Project' [3], [4]
>>>>>>>>>>>> But, given the performance enhancement, it is quite certain
>>>>>>>>>>>> that Hive and MR would be replaced by engines build on top of 
>>>>>>>>>>>> Spark (ex:
>>>>>>>>>>>> Spark SQL)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> In my opinion there are a few matters to figure out if we are
>>>>>>>>>>>> migrating from Hive,
>>>>>>>>>>>>
>>>>>>>>>>>> 1. whether we are changing the query engine only? (Then, we can
>>>>>>>>>>>> replace Hive by Shark)
>>>>>>>>>>>> 2. whether we are changing the existing Hadoop/ MapReduce
>>>>>>>>>>>> framework to Spark? (Then we can replace Hive and Hadoop with 
>>>>>>>>>>>> Spark and
>>>>>>>>>>>> Spark SQL)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> In my opinion, considering the longterm impact and the
>>>>>>>>>>>> availability of support, it is best to migrate the Hive/Hadoop to 
>>>>>>>>>>>> Spark.
>>>>>>>>>>>> It is open for discussion!
>>>>>>>>>>>>
>>>>>>>>>>>> In the mean time, I've already tried Spark SQL, and Databricks
>>>>>>>>>>>> claims on improved performance seems to be true. I will work more 
>>>>>>>>>>>> on this.
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers
>>>>>>>>>>>>
>>>>>>>>>>>> [1]
>>>>>>>>>>>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
>>>>>>>>>>>> [2]
>>>>>>>>>>>> http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html
>>>>>>>>>>>> [3] https://issues.apache.org/jira/browse/HIVE-7292
>>>>>>>>>>>> [4]
>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Aug 14, 2014 at 12:16 PM, Anjana Fernando <
>>>>>>>>>>>> anj...@wso2.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Srinath,
>>>>>>>>>>>>>
>>>>>>>>>>>>> No, this has not been tested in multiple nodes. I told Niranda
>>>>>>>>>>>>> here in my last mail, to test a cluster with the same set of 
>>>>>>>>>>>>> hardware we
>>>>>>>>>>>>> have, that we are using to test our large data set with Hive. As 
>>>>>>>>>>>>> for the
>>>>>>>>>>>>> effort to make the change, we still have to figure out the MT 
>>>>>>>>>>>>> aspects of
>>>>>>>>>>>>> Shark here. Sinthuja was working on making the latest Hive 
>>>>>>>>>>>>> version MT
>>>>>>>>>>>>> ready, and most probably, we can do the same changes to the Hive 
>>>>>>>>>>>>> version
>>>>>>>>>>>>> Shark is using. So after we do that, the integration should be 
>>>>>>>>>>>>> seamless.
>>>>>>>>>>>>> And also, as I mentioned earlier here, we are also going to test 
>>>>>>>>>>>>> this with
>>>>>>>>>>>>> the APIM Hive script, to check if there are any unforeseen
>>>>>>>>>>>>> incompatibilities.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>> Anjana.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Aug 14, 2014 at 11:53 AM, Srinath Perera <
>>>>>>>>>>>>> srin...@wso2.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> This look great.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We need to test Spark with multiple nodes? Did we do that.
>>>>>>>>>>>>>> Please create few VMs in performance could (talk to Lakmal) and 
>>>>>>>>>>>>>> test with
>>>>>>>>>>>>>> at least 5 nodes. We need to make sure it works OK with 
>>>>>>>>>>>>>> distributed setup
>>>>>>>>>>>>>> as well.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> What does it take to change to spark? Anjana .. how much work
>>>>>>>>>>>>>> is it?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --Srinath
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Aug 13, 2014 at 7:06 PM, Niranda Perera <
>>>>>>>>>>>>>> nira...@wso2.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you Anjana.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, I am working on it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In the mean time, I found this in Hive documentation [1]. It
>>>>>>>>>>>>>>> talks about Hive on Spark, and compares Hive, Shark and Spark 
>>>>>>>>>>>>>>> SQL at an
>>>>>>>>>>>>>>> higher architectural level.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Additionally, it is said that the in-memory performance of
>>>>>>>>>>>>>>> Shark can be improved by introducing Tachyon [2]. I guess we 
>>>>>>>>>>>>>>> can consider
>>>>>>>>>>>>>>> this later on.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cheers.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-1.3ComparisonwithSharkandSparkSQL
>>>>>>>>>>>>>>> [2] http://tachyon-project.org/Running-Tachyon-Locally.html
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Aug 13, 2014 at 3:17 PM, Anjana Fernando <
>>>>>>>>>>>>>>> anj...@wso2.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  Hi Niranda,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Excellent analysis of Hive vs Shark! .. This gives a lot of
>>>>>>>>>>>>>>>> insight into how both operates in different scenarios. As the 
>>>>>>>>>>>>>>>> next step, we
>>>>>>>>>>>>>>>> will need to run this in an actual cluster of computers. Since 
>>>>>>>>>>>>>>>> you've used
>>>>>>>>>>>>>>>> a subset of the dataset of 2014 DEBS challenge, we should use 
>>>>>>>>>>>>>>>> the full data
>>>>>>>>>>>>>>>> set in a clustered environment and check this. Gokul is 
>>>>>>>>>>>>>>>> already working on
>>>>>>>>>>>>>>>> the Hive based setup for this, after that is done, you can 
>>>>>>>>>>>>>>>> create a Shark
>>>>>>>>>>>>>>>> cluster in the same hardware and run the tests there, to get a 
>>>>>>>>>>>>>>>> clear
>>>>>>>>>>>>>>>> comparison on how these two match up in a cluster. Until the 
>>>>>>>>>>>>>>>> setup is
>>>>>>>>>>>>>>>> ready, do continue with your next steps on checking the RDD 
>>>>>>>>>>>>>>>> support and
>>>>>>>>>>>>>>>> Spark SQL use.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> After these are done, we should also do a trial run of our
>>>>>>>>>>>>>>>> own APIM Hive scripts, migrated to Shark.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>> Anjana.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera <
>>>>>>>>>>>>>>>> nira...@wso2.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I have been evaluating the performance of
>>>>>>>>>>>>>>>>> Shark (distributed SQL query engine for Hadoop) against Hive. 
>>>>>>>>>>>>>>>>> This is with
>>>>>>>>>>>>>>>>> the objective of seeing the possibility to move the WSO2 BAM 
>>>>>>>>>>>>>>>>> data
>>>>>>>>>>>>>>>>> processing (which currently uses Hive) to Shark (and Apache 
>>>>>>>>>>>>>>>>> Spark) for
>>>>>>>>>>>>>>>>> improved performance.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I am sharing my findings herewith.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  *AMP Lab Shark*
>>>>>>>>>>>>>>>>> Shark can execute Hive QL queries up to 100 times faster
>>>>>>>>>>>>>>>>> than Hive without any modification to the existing data or 
>>>>>>>>>>>>>>>>> queries. It
>>>>>>>>>>>>>>>>> supports Hive's QL, metastore, serialization formats, and 
>>>>>>>>>>>>>>>>> user-defined
>>>>>>>>>>>>>>>>> functions, providing seamless integration with existing Hive 
>>>>>>>>>>>>>>>>> deployments
>>>>>>>>>>>>>>>>> and a familiar, more powerful option for new ones. [1]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> *Apache Spark*Apache Spark is an open-source data
>>>>>>>>>>>>>>>>> analytics cluster computing framework. It fits into the 
>>>>>>>>>>>>>>>>> Hadoop open-source
>>>>>>>>>>>>>>>>> community, building on top of the HDFS and promises 
>>>>>>>>>>>>>>>>> performance up to 100
>>>>>>>>>>>>>>>>> times faster than Hadoop MapReduce for certain applications. 
>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>> Official documentation: [3]
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I carried out the comparison between the following Hive
>>>>>>>>>>>>>>>>> and Shark releases with input files ranging from 100 to 1 
>>>>>>>>>>>>>>>>> billion entries.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> QL Engine
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Apache Hive 0.11
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Shark Shark 0.9.1 (Latest release) which uses,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    Scala 2.10.3
>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    Spark 0.9.1
>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    AMPLab’s Hive 0.9.0
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Framework
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hadoop 1.0.4
>>>>>>>>>>>>>>>>> Spark 0.9.1
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> File system
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> HDFS
>>>>>>>>>>>>>>>>> HDFS
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Attached herewith is a report which describes in detail
>>>>>>>>>>>>>>>>> about the performance comparison between Shark and Hive.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  hive_vs_shark
>>>>>>>>>>>>>>>>> <https://docs.google.com/a/wso2.com/folderview?id=0B1GsnfycTl32QTZqUktKck1Ucjg&usp=drive_web>
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>  hive_vs_shark_report.odt
>>>>>>>>>>>>>>>>> <https://docs.google.com/a/wso2.com/file/d/0B1GsnfycTl32X3J5dTh6Slloa0E/edit?usp=drive_web>
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> In summary,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> From the evaluation, following conclusions can be derived.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - Shark is indifferent to Hive in DDL operations
>>>>>>>>>>>>>>>>>    (CREATE, DROP .. TABLE, DATABASE). Both engines show a 
>>>>>>>>>>>>>>>>> fairly constant
>>>>>>>>>>>>>>>>>    performance as the input size increases.
>>>>>>>>>>>>>>>>>    - Shark is indifferent to Hive in DML operations
>>>>>>>>>>>>>>>>>    (LOAD, INSERT) but when a DML operation is called in 
>>>>>>>>>>>>>>>>> conjuncture of a data
>>>>>>>>>>>>>>>>>    retrieval operation (ex. INSERT <TBL> SELECT <PROP> FROM 
>>>>>>>>>>>>>>>>> <TBL>), Shark
>>>>>>>>>>>>>>>>>    significantly over-performs Hive with a performance factor 
>>>>>>>>>>>>>>>>> of 10x+ (Ranging
>>>>>>>>>>>>>>>>>    from 10x to 80x in some instances). Shark performance 
>>>>>>>>>>>>>>>>> factor reduces with
>>>>>>>>>>>>>>>>>    the input size increases, while HIVE performance is fairly 
>>>>>>>>>>>>>>>>> indifferent.
>>>>>>>>>>>>>>>>>    - Shark clearly over-performs Hive in Data Retrieval
>>>>>>>>>>>>>>>>>    operations (FILTER, ORDER BY, JOIN). Hive performance is 
>>>>>>>>>>>>>>>>> fairly indifferent
>>>>>>>>>>>>>>>>>    in the data retrieval operations while Shark performance 
>>>>>>>>>>>>>>>>> reduces as the
>>>>>>>>>>>>>>>>>    input size increases. But at every instance Shark 
>>>>>>>>>>>>>>>>> over-performed Hive with
>>>>>>>>>>>>>>>>>    a minimum performance factor of 5x+ (Ranging from 5x to 
>>>>>>>>>>>>>>>>> 80x in some
>>>>>>>>>>>>>>>>>    instances).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Please refer the 'hive_vs_shark_report', it has all the
>>>>>>>>>>>>>>>>> information about the queries and timings pictographically.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The code repository can also be found in
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Moving forward, I am currently working on the following.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>    - Apache Spark's resilient distributed dataset (RDD)
>>>>>>>>>>>>>>>>>    abstraction (which is a collection of elements partitioned 
>>>>>>>>>>>>>>>>> across the nodes
>>>>>>>>>>>>>>>>>    of the cluster that can be operated on in parallel). The 
>>>>>>>>>>>>>>>>> use of RDDs and
>>>>>>>>>>>>>>>>>    its impact to the performance.
>>>>>>>>>>>>>>>>>    - Spark SQL - Use of this Spark SQL over Shark on
>>>>>>>>>>>>>>>>>    Spark framework
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> [1] https://github.com/amplab/shark/wiki
>>>>>>>>>>>>>>>>> [2] http://en.wikipedia.org/wiki/Apache_Spark
>>>>>>>>>>>>>>>>> [3] http://spark.apache.org/docs/latest/
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Would love to have your feedback on this.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Best regards
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>  *Niranda Perera*
>>>>>>>>>>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>>>>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>>>>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> *Anjana Fernando*
>>>>>>>>>>>>>>>> Senior Technical Lead
>>>>>>>>>>>>>>>> WSO2 Inc. | http://wso2.com
>>>>>>>>>>>>>>>> lean . enterprise . middleware
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> *Niranda Perera*
>>>>>>>>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>>>>>>>>>  Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> ============================
>>>>>>>>>>>>>> Srinath Perera, Ph.D.
>>>>>>>>>>>>>>    http://people.apache.org/~hemapani/
>>>>>>>>>>>>>>    http://srinathsview.blogspot.com/
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> *Anjana Fernando*
>>>>>>>>>>>>> Senior Technical Lead
>>>>>>>>>>>>> WSO2 Inc. | http://wso2.com
>>>>>>>>>>>>> lean . enterprise . middleware
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> *Niranda Perera*
>>>>>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Architecture mailing list
>>>>>>>>>>> Architecture@wso2.org
>>>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> *Lasantha Fernando*
>>>>>>>>>> Software Engineer - Data Technologies Team
>>>>>>>>>> WSO2 Inc. http://wso2.com
>>>>>>>>>>
>>>>>>>>>> email: lasan...@wso2.com
>>>>>>>>>> mobile: (+94) 71 5247551
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *Niranda Perera*
>>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>>>  Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Architecture mailing list
>>>>>>>>> Architecture@wso2.org
>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> *S. Suhothayan*
>>>>>>>> Technical Lead & Team Lead of WSO2 Complex Event Processor
>>>>>>>>  *WSO2 Inc. *http://wso2.com
>>>>>>>> * <http://wso2.com/>*
>>>>>>>> lean . enterprise . middleware
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog:
>>>>>>>> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/> 
>>>>>>>> twitter:
>>>>>>>> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | 
>>>>>>>> linked-in:
>>>>>>>> http://lk.linkedin.com/in/suhothayan 
>>>>>>>> <http://lk.linkedin.com/in/suhothayan>*
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Architecture mailing list
>>>>>>>> Architecture@wso2.org
>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Architecture mailing list
>>>>>>> Architecture@wso2.org
>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> ============================
>>>>>> Srinath Perera, Ph.D.
>>>>>>    http://people.apache.org/~hemapani/
>>>>>>    http://srinathsview.blogspot.com/
>>>>>>
>>>>>> _______________________________________________
>>>>>> Architecture mailing list
>>>>>> Architecture@wso2.org
>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Niranda Perera*
>>>>> Software Engineer, WSO2 Inc.
>>>>> Mobile: +94-71-554-8430
>>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Niranda Perera*
>>> Software Engineer, WSO2 Inc.
>>> Mobile: +94-71-554-8430
>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>
>>
>>
>
>
> --
> *Niranda Perera*
> Software Engineer, WSO2 Inc.
> Mobile: +94-71-554-8430
> Twitter: @n1r44 <https://twitter.com/N1R44>
>



-- 

David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf
<https://twitter.com/dmoralesdf>


<http://www.stratio.com/>
Avenida de Europa, 26. Ática 5. 2ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // *@stratiobd <https://twitter.com/StratioBD>*

_______________________________________________
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

Reply via email to