Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

David Morales Mon, 15 Dec 2014 01:18:48 -0800

Of course, here you have a simple example:


//reading from mongoDB

val config : ExtractorConfig[Cells] = new ExtractorConfig();
config.setExtractorImplClass(classOf[MongoNativeCellExtractor]);
config.putValue(ExtractorConstants.DATABASE, "test")
config.putValue(ExtractorConstants.COLLECTION, "input")
config.putValue(ExtractorConstants.HOST, "localhost:27017")


// registering schema
val schema = deepContext.createJavaSchemaRDD(config);

schema.registerTempTable("input");


// doing the query
val messagesFiltered = deepContext.sql("SELECT * FROM input WHERE number >
10 ");


// collecting results
List<Row> rows = messagesFiltered.collect();




2014-12-15 10:14 GMT+01:00 Niranda Perera <nira...@wso2.com>:

> Hi David,
>
> Could you point me to an example where SparkSQL is used in Stratio Deep?
>
> Rgds
>
> On Mon, Dec 15, 2014 at 2:20 PM, David Morales <dmora...@stratio.com>
> wrote:
>>
>> Hi there,
>>
>> For sure, the new release does support SparkSQL, so you can use sparkSQL
>> and Stratio Deep together jusy out of the box.
>>
>> About cross-data, it' not itself related to Spark but can use Spark-Deep.
>> It's an interactive SQL like Hive, for example.
>>
>>
>> Regards.
>>
>> 2014-12-12 21:29 GMT+01:00 Niranda Perera <nira...@wso2.com>:
>>>
>>> Hi David,
>>>
>>> I have been going through the Deep-Spark examples. It looks very
>>> promising.
>>>
>>> On a follow up query, does Deep-spark/ deep-cassandra support SQL like
>>> operations on RDDs (like SparkSQL)?
>>>
>>> Example (from Datastax Cassandra connector demos):
>>>
>>> object SQLDemo extends DemoApp {
>>>
>>>   val cc = new CassandraSQLContext(sc)
>>>
>>>   CassandraConnector(conf).withSessionDo { session =>
>>>     session.execute("CREATE KEYSPACE IF NOT EXISTS test WITH REPLICATION
>>> = {'class': 'SimpleStrategy', 'replication_factor': 1 }")
>>>     session.execute("DROP TABLE IF EXISTS test.sql_demo")
>>>     session.execute("CREATE TABLE test.sql_demo (key INT PRIMARY KEY,
>>> grp INT, value DOUBLE)")
>>>     session.execute("INSERT INTO test.sql_demo(key, grp, value) VALUES
>>> (1, 1, 1.0)")
>>>     session.execute("INSERT INTO test.sql_demo(key, grp, value) VALUES
>>> (2, 1, 2.5)")
>>>     session.execute("INSERT INTO test.sql_demo(key, grp, value) VALUES
>>> (3, 1, 10.0)")
>>>     session.execute("INSERT INTO test.sql_demo(key, grp, value) VALUES
>>> (4, 2, 4.0)")
>>>     session.execute("INSERT INTO test.sql_demo(key, grp, value) VALUES
>>> (5, 2, 2.2)")
>>>     session.execute("INSERT INTO test.sql_demo(key, grp, value) VALUES
>>> (6, 2, 2.8)")
>>>   }
>>>
>>>   val rdd = cc.cassandraSql("SELECT grp, max(value) AS mv FROM
>>> test.sql_demo GROUP BY grp ORDER BY mv")
>>>   rdd.collect().foreach(println)  // [2, 4.0] [1, 10.0]
>>>
>>>   sc.stop()
>>> }
>>>
>>> I also read about Stratio Crossdata. Does Crossdata serve this purpose?
>>>
>>> Rgds
>>>
>>> On Tue, Dec 2, 2014 at 11:14 PM, David Morales <dmora...@stratio.com>
>>> wrote:
>>>>
>>>> Hi¡
>>>>
>>>> Please, check the develop branch if you want to see a more realistic
>>>> view of our development path. Last commit was about two hours ago :)
>>>>
>>>> Stratio Deep is one of our core modules so there is a core team in
>>>> Stratio fully devoted to spark + noSQL integration. In these last months,
>>>> for example, we have added mongoDB, ElasticSearch and Aerospike to Stratio
>>>> Deep, so you can talk to these databases from Spark just like you do with
>>>> HDFS.
>>>>
>>>> Furthermore, we are working on more backends, such as neo4j or
>>>> couchBase, for example.
>>>>
>>>>
>>>> About our benchmarks, you can check out some results in this link:
>>>> http://www.stratio.com/deep-vs-datastax/
>>>>
>>>> Please, keep in mind that spark integration with a datastore could be
>>>> done in two ways: HCI or native. We are now working on improving native
>>>> integration because it's quite more performant. In this way, we are just
>>>> working on some other tests with even more impressive results.
>>>>
>>>>
>>>> Here you can find a technical overview of all our platform.
>>>>
>>>>
>>>> http://www.slideshare.net/Stratio/stratio-platform-overview-v41
>>>>
>>>> Regards
>>>>
>>>> 2014-12-02 11:14 GMT+01:00 Niranda Perera <nira...@wso2.com>:
>>>>
>>>>> Hi David,
>>>>>
>>>>> Sorry to re-initiate this thread. But may I know if you have done any
>>>>> benchmarking on Datastax Spark cassandra connector and Stratio Deep-spark
>>>>> cassandra integration? Would love to take a look at it.
>>>>>
>>>>> I recently checked deep-spark github repo and noticed that there is no
>>>>> activity since Oct 29th. May I know what your future plans on this
>>>>> particular project?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Tue, Aug 26, 2014 at 9:12 PM, David Morales <dmora...@stratio.com>
>>>>> wrote:
>>>>>
>>>>>> Yes, it is already included in our benchmarks.
>>>>>>
>>>>>> It could be a nice idea to share our findings, let me talk about it
>>>>>> here. Meanwhile, you can ask us any question by using my mail or this
>>>>>> thread, we are glad to help you.
>>>>>>
>>>>>>
>>>>>> Best regards.
>>>>>>
>>>>>>
>>>>>> 2014-08-24 15:49 GMT+02:00 Niranda Perera <nira...@wso2.com>:
>>>>>>
>>>>>>> Hi David,
>>>>>>>
>>>>>>> Thank you for your detailed reply.
>>>>>>>
>>>>>>> It was great to hear about Stratio-Deep and I must say, it looks
>>>>>>> very interesting. Storage handlers for databases such Cassandra, MongoDB
>>>>>>> etc would be very helpful. We will definitely look up on Stratio-Deep.
>>>>>>>
>>>>>>> I came across with the Datastax Spark-Cassandra connector (
>>>>>>> https://github.com/datastax/spark-cassandra-connector ). Have you
>>>>>>> done any comparison with your implementation and Datastax's connector?
>>>>>>>
>>>>>>> And, yes, please do share the performance results with us once it's
>>>>>>> ready.
>>>>>>>
>>>>>>> On a different note, is there any way for us to interact with
>>>>>>> Stratio dev community, in the form of dev mail lists etc, so that we 
>>>>>>> could
>>>>>>> mutually share our findings?
>>>>>>>
>>>>>>> Best regards
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Aug 22, 2014 at 2:07 PM, David Morales <dmora...@stratio.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> Hi there,
>>>>>>>>
>>>>>>>> *1. About the size of deployments.*
>>>>>>>>
>>>>>>>> It depends on your use case... specially when you combine spark
>>>>>>>> with a datastore. We use to deploy spark with cassandra or mongodb, 
>>>>>>>> instead
>>>>>>>> of using HDFS for example.
>>>>>>>>
>>>>>>>> Spark will be faster if you put the data in memory, so if you need
>>>>>>>> a lot of speed (interactive queries, for example), you should have 
>>>>>>>> enough
>>>>>>>> memory.
>>>>>>>>
>>>>>>>>
>>>>>>>> *2. About storage handlers.*
>>>>>>>>
>>>>>>>> We have developed the first tight integration between Cassandra and
>>>>>>>> Spark, called Stratio Deep, announced in the first spark summit. You 
>>>>>>>> can
>>>>>>>> check Stratio Deep out here:
>>>>>>>> https://github.com/Stratio/stratio-deep (open, apache2 license).
>>>>>>>>
>>>>>>>> *Deep is a thin integration layer between Apache Spark and several
>>>>>>>> NoSQL datastores. We actually support Apache Cassandra and MongoDB, 
>>>>>>>> but in
>>>>>>>> the near future we will add support for sever other datastores.*
>>>>>>>>
>>>>>>>> Datastax have announce its own driver for spark in the last spark
>>>>>>>> summit, but we have been working in our solution for almost a year.
>>>>>>>>
>>>>>>>> Furthermore, we are working to extend this solution in order to
>>>>>>>> work also with other databases... MongoDB integration is completed 
>>>>>>>> right
>>>>>>>> now and ElasticSearch will be ready in a few weeks.
>>>>>>>>
>>>>>>>> And that is not all, we have also developed an integration with
>>>>>>>> Cassandra and Lucene for indexing data (open source, apache2).
>>>>>>>>
>>>>>>>> *Stratio Cassandra is a fork of Apache Cassandra
>>>>>>>> <http://cassandra.apache.org/> where index functionality has been 
>>>>>>>> extended
>>>>>>>> to provide near real time search such as ElasticSearch or Solr,
>>>>>>>> including full text search
>>>>>>>> <http://en.wikipedia.org/wiki/Full_text_search> capabilities and free
>>>>>>>> multivariable search. It is achieved through an Apache Lucene
>>>>>>>> <http://lucene.apache.org/> based implementation of Cassandra secondary
>>>>>>>> indexes, where each node of the cluster indexes its own data.*
>>>>>>>>
>>>>>>>>
>>>>>>>> We will publish some benchmarks in two weeks, so i will share our
>>>>>>>> results here if you are interested.
>>>>>>>>
>>>>>>>>
>>>>>>>> If you are more interested in distributed file systems, you should
>>>>>>>> take a look on Tachyon: http://tachyon-project.org/index.html
>>>>>>>>
>>>>>>>>
>>>>>>>> *3. Spark - Hive compatibility*
>>>>>>>>
>>>>>>>> Spark will support anything with the Hadoop InputFormat interface.
>>>>>>>>
>>>>>>>>
>>>>>>>> *4. Performance*
>>>>>>>>
>>>>>>>> We are working a lot with Cassandra and mongoDB and the performance
>>>>>>>> is quite nice. We are finishing right now some benchmarks comparing 
>>>>>>>> Hadoop
>>>>>>>> + HDFS vs Spark + HDFS vs Spark + Cassandra (using stratio deep and 
>>>>>>>> even
>>>>>>>> our fork of Cassandra).
>>>>>>>>
>>>>>>>> Let me please share this results with you when they were ready, ok?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2014-08-22 7:53 GMT+02:00 Niranda Perera <nira...@wso2.com>:
>>>>>>>>
>>>>>>>> Hi Srinath,
>>>>>>>>> Yes, I am working on deploying it on a multi-node cluster with the
>>>>>>>>> debs dataset. I will keep architecture@ posted on the progress.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi David,
>>>>>>>>> Thank you very much for the detailed insight you've provided.
>>>>>>>>> Few quick questions,
>>>>>>>>> 1. Do you have experiences in using storage handlers in Spark?
>>>>>>>>> 2. Would a storage handler used in Hive, be directly compatible
>>>>>>>>> with Spark?
>>>>>>>>> 3. How do you grade the performance of Spark with other databases
>>>>>>>>> such as Cassandra, HBase, H2, etc?
>>>>>>>>>
>>>>>>>>> Thank you very much again for your interest. Look forward to
>>>>>>>>> hearing from you.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Aug 21, 2014 at 7:02 PM, Srinath Perera <srin...@wso2.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Niranda, we need test Spark in multi-node mode before making a
>>>>>>>>>> decision. Spark is very fast, I think there is no doubt about that. 
>>>>>>>>>> We need
>>>>>>>>>> to make sure it stable.
>>>>>>>>>>
>>>>>>>>>> David, thanks for a detailed email! How big (nodes) is the Spark
>>>>>>>>>> setup you guys are running?
>>>>>>>>>>
>>>>>>>>>> --Srinath
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 21, 2014 at 1:34 PM, David Morales <
>>>>>>>>>> dmora...@stratio.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Sorry for disturbing this thread, but i think that i can help
>>>>>>>>>>> clarifying a few things (we were attending the last Spark Summit, 
>>>>>>>>>>> we were
>>>>>>>>>>> also speakers there and we are working very close to spark)
>>>>>>>>>>>
>>>>>>>>>>> *> Hive/Shark and others benchmark*
>>>>>>>>>>>
>>>>>>>>>>> You can find a nice comparison and benchmark in this web:
>>>>>>>>>>> https://amplab.cs.berkeley.edu/benchmark/
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *> Shark and SparkSQL*
>>>>>>>>>>>
>>>>>>>>>>> SparkSQL is the natural replacement for Shark, but SparkSQL is
>>>>>>>>>>> still young at this moment. If you are looking for Hive 
>>>>>>>>>>> compatibility, you
>>>>>>>>>>> have to execute SparkSQL with an specific context.
>>>>>>>>>>>
>>>>>>>>>>> Quoted from spark website:
>>>>>>>>>>>
>>>>>>>>>>> *> Note that Spark SQL currently uses a very basic SQL
>>>>>>>>>>> parser. Users that want a more complete dialect of SQL should look 
>>>>>>>>>>> at the
>>>>>>>>>>> HiveSQL support provided by HiveContext.*
>>>>>>>>>>>
>>>>>>>>>>> So, only note that SparkSQL is a work in progress. If you want
>>>>>>>>>>> SparkSQL you have to run a SparkSQLContext, if you want Hive, you 
>>>>>>>>>>> will have
>>>>>>>>>>> a different context...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *> Spark - Hadoop: the future*
>>>>>>>>>>>
>>>>>>>>>>> Most Hadoop distributions are including Spark: cloudera,
>>>>>>>>>>> hortonworks, mapR... and contributing to migrate all the Hadoop 
>>>>>>>>>>> ecosystem
>>>>>>>>>>> to Spark.
>>>>>>>>>>>
>>>>>>>>>>> Spark is a bit more than Map/Reduce... as you can read here:
>>>>>>>>>>> http://gigaom.com/2014/06/28/4-reasons-why-spark-could-jolt-hadoop-into-hyperdrive/
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *> Spark Streaming / Spark SQL*
>>>>>>>>>>>
>>>>>>>>>>> Spark Streaming is built on Spark and it provides streaming
>>>>>>>>>>> processing through an information abstraction called DStreams (a 
>>>>>>>>>>> collection
>>>>>>>>>>> of RDDs in a window of time).
>>>>>>>>>>>
>>>>>>>>>>> There is some efforts in order to make SparkSQL compatible with
>>>>>>>>>>> Spark Streaming (something similar to trident for storm), as you 
>>>>>>>>>>> can see
>>>>>>>>>>> here:
>>>>>>>>>>>
>>>>>>>>>>> *StreamSQL (https://github.com/thunderain-project/StreamSQL
>>>>>>>>>>> <https://github.com/thunderain-project/StreamSQL>) is a POC project 
>>>>>>>>>>> based
>>>>>>>>>>> on Spark to combine the power of Catalyst and Spark Streaming, to 
>>>>>>>>>>> offer
>>>>>>>>>>> people the ability to manipulate SQL on top of DStream as you 
>>>>>>>>>>> wanted, this
>>>>>>>>>>> keep the same semantics with SparkSQL as offer a SchemaDStream on 
>>>>>>>>>>> top of
>>>>>>>>>>> DStream. You don't need to do tricky thing like extracting rdd to 
>>>>>>>>>>> register
>>>>>>>>>>> as a table. Besides other parts are the same as Spark.*
>>>>>>>>>>>
>>>>>>>>>>> So, you can apply a SQL in a data stream, but it is very simple
>>>>>>>>>>> at the moment... you can expect a bunch of improvements in this 
>>>>>>>>>>> matter in
>>>>>>>>>>> the next months (i guess that sparkSQL will work on Spark streaming 
>>>>>>>>>>> streams
>>>>>>>>>>> before the end of this year).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *> Spark Streaming / Spark SQL and CEP*
>>>>>>>>>>>
>>>>>>>>>>> There is no relationship at this moment between (your absolutely
>>>>>>>>>>> amazing) Siddhi CEP and Spark. As fas as i know, you are working in 
>>>>>>>>>>> doing
>>>>>>>>>>> distributed CEP with Storm and Siddhi.
>>>>>>>>>>>
>>>>>>>>>>> We are currently working on doing an interactive cep built with
>>>>>>>>>>> kafka + spark streaming + siddhi, with some features such as an 
>>>>>>>>>>> API, an
>>>>>>>>>>> interactive shell, built-in statistics and auditing, built-in 
>>>>>>>>>>> functions
>>>>>>>>>>> (save2cassandra, save2mongo, save2elasticsearch...).
>>>>>>>>>>>
>>>>>>>>>>> If you are interested we can talk about this project, i think
>>>>>>>>>>> that it would be a nice idea¡
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Anyway, i don't think that SparkSQL will evolve in something
>>>>>>>>>>> like a CEP. Patterns, sequences, for example would be very complex 
>>>>>>>>>>> to do
>>>>>>>>>>> with spark streaming (at least now).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2014-08-21 6:18 GMT+02:00 Sriskandarajah Suhothayan <
>>>>>>>>>>> s...@wso2.com>:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Aug 20, 2014 at 1:36 PM, Niranda Perera <
>>>>>>>>>>>> nira...@wso2.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> @Maninda,
>>>>>>>>>>>>>
>>>>>>>>>>>>> +1 for suggesting Spark SQL.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Quote Databricks,
>>>>>>>>>>>>> "Spark SQL provides state-of-the-art SQL performance and
>>>>>>>>>>>>> maintains compatibility with Shark/Hive. In particular, like 
>>>>>>>>>>>>> Shark, Spark
>>>>>>>>>>>>> SQL supports all existing Hive data formats, user-defined 
>>>>>>>>>>>>> functions (UDF),
>>>>>>>>>>>>> and the Hive metastore." [1]
>>>>>>>>>>>>>
>>>>>>>>>>>>> But I am not entirely sure if Spark SQL and Siddhi is
>>>>>>>>>>>>> comparable, because SparkSQL (like Hive) is designed for batch 
>>>>>>>>>>>>> processing,
>>>>>>>>>>>>> where as Siddhi is real-time processing. But if there are 
>>>>>>>>>>>>> implementations
>>>>>>>>>>>>> where Siddhi is run on top of Spark, it would be very interesting.
>>>>>>>>>>>>>
>>>>>>>>>>>> Yes Siddhi's current way of operation does not support this.
>>>>>>>>>>>> But with partitions and we can achieve this to some extent.
>>>>>>>>>>>>
>>>>>>>>>>>> Suho
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Spark supports either Hadoop1 or 2. But I think we should see,
>>>>>>>>>>>>> what is best, MR1 or YARN+MR2
>>>>>>>>>>>>>
>>>>>>>>>>>>> [image: Hadoop Architecture]
>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
>>>>>>>>>>>>> [2] http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Aug 20, 2014 at 1:13 PM, Lasantha Fernando <
>>>>>>>>>>>>> lasan...@wso2.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Maninda,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 20 August 2014 12:02, Maninda Edirisooriya <
>>>>>>>>>>>>>> mani...@wso2.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In the case of discontinuity of Shark project, IMO we should
>>>>>>>>>>>>>>> not move to Shark at all.
>>>>>>>>>>>>>>> And it seems better to go with Spark SQL as we are already
>>>>>>>>>>>>>>> using Spark for CEP. But I am not sure the difference between 
>>>>>>>>>>>>>>> Spark SQL and
>>>>>>>>>>>>>>> the Siddhi queries on the Spark engine.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Currently, we are doing integration with CEP using Apache
>>>>>>>>>>>>>> Storm, not Spark... :-). Spark Streaming is a possible candidate 
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>> integrating with CEP, but we have opted with Storm. I think 
>>>>>>>>>>>>>> there has been
>>>>>>>>>>>>>> some independent work on integrating Kafka + Spark Streaming + 
>>>>>>>>>>>>>> Siddhi.
>>>>>>>>>>>>>> Please refer to thread on arch@ "[Architecture] A few
>>>>>>>>>>>>>> questions about WSO2 CEP/Siddhi"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And we have to figure out how Spark SQL is used for
>>>>>>>>>>>>>>> historical data, whether it can execute incremental processing 
>>>>>>>>>>>>>>> by default
>>>>>>>>>>>>>>> which will implement all out existing BAM use cases.
>>>>>>>>>>>>>>> On the other hand in Hadoop 2 [1] they are using a
>>>>>>>>>>>>>>> completely different platform for resource allocation known as 
>>>>>>>>>>>>>>> Yarn.
>>>>>>>>>>>>>>> Sometimes this may be more suitable for batch jobs.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1] https://www.youtube.com/watch?v=RncoVN0l6dc
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Lasantha
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Maninda Edirisooriya*
>>>>>>>>>>>>>>> Senior Software Engineer
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *WSO2, Inc. *lean.enterprise.middleware.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Blog* : http://maninda.blogspot.com/
>>>>>>>>>>>>>>> *E-mail* : mani...@wso2.com
>>>>>>>>>>>>>>> *Skype* : @manindae
>>>>>>>>>>>>>>> *Twitter* : @maninda
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Aug 20, 2014 at 11:33 AM, Niranda Perera <
>>>>>>>>>>>>>>> nira...@wso2.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Anjana and Srinath,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> After the discussion I had with Anjana, I researched more
>>>>>>>>>>>>>>>> on the continuation of Shark project by Databricks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Here's what I found out,
>>>>>>>>>>>>>>>> - Shark was built on the Hive codebase and achieved
>>>>>>>>>>>>>>>> performance improvements by swapping out the physical 
>>>>>>>>>>>>>>>> execution engine part
>>>>>>>>>>>>>>>> of Hive. While this approach enabled Shark users to speed up 
>>>>>>>>>>>>>>>> their Hive
>>>>>>>>>>>>>>>> queries, Shark inherited a large, complicated code base from 
>>>>>>>>>>>>>>>> Hive that made
>>>>>>>>>>>>>>>> it hard to optimize and maintain.
>>>>>>>>>>>>>>>> Hence, Databricks has announced that they are halting the
>>>>>>>>>>>>>>>> development of Shark from July, 2014. (Shark 0.9 would be the 
>>>>>>>>>>>>>>>> last release)
>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>> - Shark will be replaced by Spark SQL. It beats Shark in TPC-DS
>>>>>>>>>>>>>>>> performance
>>>>>>>>>>>>>>>> <http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html>
>>>>>>>>>>>>>>>> by almost an order of magnitude. It also supports all existing 
>>>>>>>>>>>>>>>> Hive data
>>>>>>>>>>>>>>>> formats, user-defined functions (UDF), and the Hive metastore. 
>>>>>>>>>>>>>>>>  [2]
>>>>>>>>>>>>>>>> - Following is the Shark, Spark SQL migration plan
>>>>>>>>>>>>>>>> http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> - For the legacy Hive and MapReduce users, they have
>>>>>>>>>>>>>>>> proposed a new 'Hive on Spark Project' [3], [4]
>>>>>>>>>>>>>>>> But, given the performance enhancement, it is quite certain
>>>>>>>>>>>>>>>> that Hive and MR would be replaced by engines build on top of 
>>>>>>>>>>>>>>>> Spark (ex:
>>>>>>>>>>>>>>>> Spark SQL)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In my opinion there are a few matters to figure out if we
>>>>>>>>>>>>>>>> are migrating from Hive,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 1. whether we are changing the query engine only? (Then, we
>>>>>>>>>>>>>>>> can replace Hive by Shark)
>>>>>>>>>>>>>>>> 2. whether we are changing the existing Hadoop/ MapReduce
>>>>>>>>>>>>>>>> framework to Spark? (Then we can replace Hive and Hadoop with 
>>>>>>>>>>>>>>>> Spark and
>>>>>>>>>>>>>>>> Spark SQL)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In my opinion, considering the longterm impact and the
>>>>>>>>>>>>>>>> availability of support, it is best to migrate the Hive/Hadoop 
>>>>>>>>>>>>>>>> to Spark.
>>>>>>>>>>>>>>>> It is open for discussion!
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In the mean time, I've already tried Spark SQL, and
>>>>>>>>>>>>>>>> Databricks claims on improved performance seems to be true. I 
>>>>>>>>>>>>>>>> will work
>>>>>>>>>>>>>>>> more on this.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html
>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>> http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html
>>>>>>>>>>>>>>>> [3] https://issues.apache.org/jira/browse/HIVE-7292
>>>>>>>>>>>>>>>> [4]
>>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, Aug 14, 2014 at 12:16 PM, Anjana Fernando <
>>>>>>>>>>>>>>>> anj...@wso2.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi Srinath,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> No, this has not been tested in multiple nodes. I told
>>>>>>>>>>>>>>>>> Niranda here in my last mail, to test a cluster with the same 
>>>>>>>>>>>>>>>>> set of
>>>>>>>>>>>>>>>>> hardware we have, that we are using to test our large data 
>>>>>>>>>>>>>>>>> set with Hive.
>>>>>>>>>>>>>>>>> As for the effort to make the change, we still have to figure 
>>>>>>>>>>>>>>>>> out the MT
>>>>>>>>>>>>>>>>> aspects of Shark here. Sinthuja was working on making the 
>>>>>>>>>>>>>>>>> latest Hive
>>>>>>>>>>>>>>>>> version MT ready, and most probably, we can do the same 
>>>>>>>>>>>>>>>>> changes to the Hive
>>>>>>>>>>>>>>>>> version Shark is using. So after we do that, the integration 
>>>>>>>>>>>>>>>>> should be
>>>>>>>>>>>>>>>>> seamless. And also, as I mentioned earlier here, we are also 
>>>>>>>>>>>>>>>>> going to test
>>>>>>>>>>>>>>>>> this with the APIM Hive script, to check if there are any 
>>>>>>>>>>>>>>>>> unforeseen
>>>>>>>>>>>>>>>>> incompatibilities.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>> Anjana.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, Aug 14, 2014 at 11:53 AM, Srinath Perera <
>>>>>>>>>>>>>>>>> srin...@wso2.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> This look great.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> We need to test Spark with multiple nodes? Did we do
>>>>>>>>>>>>>>>>>> that. Please create few VMs in performance could (talk to 
>>>>>>>>>>>>>>>>>> Lakmal) and test
>>>>>>>>>>>>>>>>>> with at least 5 nodes. We need to make sure it works OK with 
>>>>>>>>>>>>>>>>>> distributed
>>>>>>>>>>>>>>>>>> setup as well.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> What does it take to change to spark? Anjana .. how much
>>>>>>>>>>>>>>>>>> work is it?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --Srinath
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Wed, Aug 13, 2014 at 7:06 PM, Niranda Perera <
>>>>>>>>>>>>>>>>>> nira...@wso2.com> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thank you Anjana.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Yes, I am working on it.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> In the mean time, I found this in Hive documentation
>>>>>>>>>>>>>>>>>>> [1]. It talks about Hive on Spark, and compares Hive, Shark 
>>>>>>>>>>>>>>>>>>> and Spark SQL
>>>>>>>>>>>>>>>>>>> at an higher architectural level.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Additionally, it is said that the in-memory performance
>>>>>>>>>>>>>>>>>>> of Shark can be improved by introducing Tachyon [2]. I 
>>>>>>>>>>>>>>>>>>> guess we can
>>>>>>>>>>>>>>>>>>> consider this later on.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Cheers.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-1.3ComparisonwithSharkandSparkSQL
>>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>>> http://tachyon-project.org/Running-Tachyon-Locally.html
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Wed, Aug 13, 2014 at 3:17 PM, Anjana Fernando <
>>>>>>>>>>>>>>>>>>> anj...@wso2.com> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>  Hi Niranda,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Excellent analysis of Hive vs Shark! .. This gives a
>>>>>>>>>>>>>>>>>>>> lot of insight into how both operates in different 
>>>>>>>>>>>>>>>>>>>> scenarios. As the next
>>>>>>>>>>>>>>>>>>>> step, we will need to run this in an actual cluster of 
>>>>>>>>>>>>>>>>>>>> computers. Since
>>>>>>>>>>>>>>>>>>>> you've used a subset of the dataset of 2014 DEBS 
>>>>>>>>>>>>>>>>>>>> challenge, we should use
>>>>>>>>>>>>>>>>>>>> the full data set in a clustered environment and check 
>>>>>>>>>>>>>>>>>>>> this. Gokul is
>>>>>>>>>>>>>>>>>>>> already working on the Hive based setup for this, after 
>>>>>>>>>>>>>>>>>>>> that is done, you
>>>>>>>>>>>>>>>>>>>> can create a Shark cluster in the same hardware and run 
>>>>>>>>>>>>>>>>>>>> the tests there, to
>>>>>>>>>>>>>>>>>>>> get a clear comparison on how these two match up in a 
>>>>>>>>>>>>>>>>>>>> cluster. Until the
>>>>>>>>>>>>>>>>>>>> setup is ready, do continue with your next steps on 
>>>>>>>>>>>>>>>>>>>> checking the RDD
>>>>>>>>>>>>>>>>>>>> support and Spark SQL use.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> After these are done, we should also do a trial run of
>>>>>>>>>>>>>>>>>>>> our own APIM Hive scripts, migrated to Shark.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>>>>> Anjana.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>  On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera <
>>>>>>>>>>>>>>>>>>>> nira...@wso2.com> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I have been evaluating the performance of
>>>>>>>>>>>>>>>>>>>>> Shark (distributed SQL query engine for Hadoop) against 
>>>>>>>>>>>>>>>>>>>>> Hive. This is with
>>>>>>>>>>>>>>>>>>>>> the objective of seeing the possibility to move the WSO2 
>>>>>>>>>>>>>>>>>>>>> BAM data
>>>>>>>>>>>>>>>>>>>>> processing (which currently uses Hive) to Shark (and 
>>>>>>>>>>>>>>>>>>>>> Apache Spark) for
>>>>>>>>>>>>>>>>>>>>> improved performance.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I am sharing my findings herewith.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>  *AMP Lab Shark*
>>>>>>>>>>>>>>>>>>>>> Shark can execute Hive QL queries up to 100 times
>>>>>>>>>>>>>>>>>>>>> faster than Hive without any modification to the existing 
>>>>>>>>>>>>>>>>>>>>> data or queries.
>>>>>>>>>>>>>>>>>>>>> It supports Hive's QL, metastore, serialization formats, 
>>>>>>>>>>>>>>>>>>>>> and user-defined
>>>>>>>>>>>>>>>>>>>>> functions, providing seamless integration with existing 
>>>>>>>>>>>>>>>>>>>>> Hive deployments
>>>>>>>>>>>>>>>>>>>>> and a familiar, more powerful option for new ones. [1]
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> *Apache Spark*Apache Spark is an open-source data
>>>>>>>>>>>>>>>>>>>>> analytics cluster computing framework. It fits into the 
>>>>>>>>>>>>>>>>>>>>> Hadoop open-source
>>>>>>>>>>>>>>>>>>>>> community, building on top of the HDFS and promises 
>>>>>>>>>>>>>>>>>>>>> performance up to 100
>>>>>>>>>>>>>>>>>>>>> times faster than Hadoop MapReduce for certain 
>>>>>>>>>>>>>>>>>>>>> applications. [2]
>>>>>>>>>>>>>>>>>>>>> Official documentation: [3]
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I carried out the comparison between the following
>>>>>>>>>>>>>>>>>>>>> Hive and Shark releases with input files ranging from 100 
>>>>>>>>>>>>>>>>>>>>> to 1 billion
>>>>>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> QL Engine
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Apache Hive 0.11
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Shark Shark 0.9.1 (Latest release) which uses,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>    Scala 2.10.3
>>>>>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>    Spark 0.9.1
>>>>>>>>>>>>>>>>>>>>>    -
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>    AMPLab’s Hive 0.9.0
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Framework
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hadoop 1.0.4
>>>>>>>>>>>>>>>>>>>>> Spark 0.9.1
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> File system
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> HDFS
>>>>>>>>>>>>>>>>>>>>> HDFS
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Attached herewith is a report which describes in
>>>>>>>>>>>>>>>>>>>>> detail about the performance comparison between Shark and 
>>>>>>>>>>>>>>>>>>>>> Hive.
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>  hive_vs_shark
>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/a/wso2.com/folderview?id=0B1GsnfycTl32QTZqUktKck1Ucjg&usp=drive_web>
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>  hive_vs_shark_report.odt
>>>>>>>>>>>>>>>>>>>>> <https://docs.google.com/a/wso2.com/file/d/0B1GsnfycTl32X3J5dTh6Slloa0E/edit?usp=drive_web>
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> In summary,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> From the evaluation, following conclusions can be
>>>>>>>>>>>>>>>>>>>>> derived.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>    - Shark is indifferent to Hive in DDL operations
>>>>>>>>>>>>>>>>>>>>>    (CREATE, DROP .. TABLE, DATABASE). Both engines show a 
>>>>>>>>>>>>>>>>>>>>> fairly constant
>>>>>>>>>>>>>>>>>>>>>    performance as the input size increases.
>>>>>>>>>>>>>>>>>>>>>    - Shark is indifferent to Hive in DML operations
>>>>>>>>>>>>>>>>>>>>>    (LOAD, INSERT) but when a DML operation is called in 
>>>>>>>>>>>>>>>>>>>>> conjuncture of a data
>>>>>>>>>>>>>>>>>>>>>    retrieval operation (ex. INSERT <TBL> SELECT <PROP> 
>>>>>>>>>>>>>>>>>>>>> FROM <TBL>), Shark
>>>>>>>>>>>>>>>>>>>>>    significantly over-performs Hive with a performance 
>>>>>>>>>>>>>>>>>>>>> factor of 10x+ (Ranging
>>>>>>>>>>>>>>>>>>>>>    from 10x to 80x in some instances). Shark performance 
>>>>>>>>>>>>>>>>>>>>> factor reduces with
>>>>>>>>>>>>>>>>>>>>>    the input size increases, while HIVE performance is 
>>>>>>>>>>>>>>>>>>>>> fairly indifferent.
>>>>>>>>>>>>>>>>>>>>>    - Shark clearly over-performs Hive in Data
>>>>>>>>>>>>>>>>>>>>>    Retrieval operations (FILTER, ORDER BY, JOIN). Hive 
>>>>>>>>>>>>>>>>>>>>> performance is fairly
>>>>>>>>>>>>>>>>>>>>>    indifferent in the data retrieval operations while 
>>>>>>>>>>>>>>>>>>>>> Shark performance
>>>>>>>>>>>>>>>>>>>>>    reduces as the input size increases. But at every 
>>>>>>>>>>>>>>>>>>>>> instance Shark
>>>>>>>>>>>>>>>>>>>>>    over-performed Hive with a minimum performance factor 
>>>>>>>>>>>>>>>>>>>>> of 5x+ (Ranging from
>>>>>>>>>>>>>>>>>>>>>    5x to 80x in some instances).
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Please refer the 'hive_vs_shark_report', it has all
>>>>>>>>>>>>>>>>>>>>> the information about the queries and timings 
>>>>>>>>>>>>>>>>>>>>> pictographically.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> The code repository can also be found in
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Moving forward, I am currently working on the
>>>>>>>>>>>>>>>>>>>>> following.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>    - Apache Spark's resilient distributed dataset
>>>>>>>>>>>>>>>>>>>>>    (RDD) abstraction (which is a collection of elements 
>>>>>>>>>>>>>>>>>>>>> partitioned across the
>>>>>>>>>>>>>>>>>>>>>    nodes of the cluster that can be operated on in 
>>>>>>>>>>>>>>>>>>>>> parallel). The use of RDDs
>>>>>>>>>>>>>>>>>>>>>    and its impact to the performance.
>>>>>>>>>>>>>>>>>>>>>    - Spark SQL - Use of this Spark SQL over Shark on
>>>>>>>>>>>>>>>>>>>>>    Spark framework
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> [1] https://github.com/amplab/shark/wiki
>>>>>>>>>>>>>>>>>>>>> [2] http://en.wikipedia.org/wiki/Apache_Spark
>>>>>>>>>>>>>>>>>>>>> [3] http://spark.apache.org/docs/latest/
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Would love to have your feedback on this.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Best regards
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>  *Niranda Perera*
>>>>>>>>>>>>>>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>>>>>>>>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>>>>>>>>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> *Anjana Fernando*
>>>>>>>>>>>>>>>>>>>> Senior Technical Lead
>>>>>>>>>>>>>>>>>>>> WSO2 Inc. | http://wso2.com
>>>>>>>>>>>>>>>>>>>> lean . enterprise . middleware
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>> *Niranda Perera*
>>>>>>>>>>>>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>>>>>>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>>>>>>>>>>>>>  Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> ============================
>>>>>>>>>>>>>>>>>> Srinath Perera, Ph.D.
>>>>>>>>>>>>>>>>>>    http://people.apache.org/~hemapani/
>>>>>>>>>>>>>>>>>>    http://srinathsview.blogspot.com/
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>> *Anjana Fernando*
>>>>>>>>>>>>>>>>> Senior Technical Lead
>>>>>>>>>>>>>>>>> WSO2 Inc. | http://wso2.com
>>>>>>>>>>>>>>>>> lean . enterprise . middleware
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> *Niranda Perera*
>>>>>>>>>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>>>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>>>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> Architecture mailing list
>>>>>>>>>>>>>>> Architecture@wso2.org
>>>>>>>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> *Lasantha Fernando*
>>>>>>>>>>>>>> Software Engineer - Data Technologies Team
>>>>>>>>>>>>>> WSO2 Inc. http://wso2.com
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> email: lasan...@wso2.com
>>>>>>>>>>>>>> mobile: (+94) 71 5247551
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> *Niranda Perera*
>>>>>>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>>>>>>>  Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> Architecture mailing list
>>>>>>>>>>>>> Architecture@wso2.org
>>>>>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>> *S. Suhothayan*
>>>>>>>>>>>> Technical Lead & Team Lead of WSO2 Complex Event Processor
>>>>>>>>>>>>  *WSO2 Inc. *http://wso2.com
>>>>>>>>>>>> * <http://wso2.com/>*
>>>>>>>>>>>> lean . enterprise . middleware
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> |
>>>>>>>>>>>> blog: http://suhothayan.blogspot.com/ 
>>>>>>>>>>>> <http://suhothayan.blogspot.com/>
>>>>>>>>>>>> twitter: http://twitter.com/suhothayan 
>>>>>>>>>>>> <http://twitter.com/suhothayan> |
>>>>>>>>>>>> linked-in: http://lk.linkedin.com/in/suhothayan
>>>>>>>>>>>> <http://lk.linkedin.com/in/suhothayan>*
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Architecture mailing list
>>>>>>>>>>>> Architecture@wso2.org
>>>>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> Architecture mailing list
>>>>>>>>>>> Architecture@wso2.org
>>>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> ============================
>>>>>>>>>> Srinath Perera, Ph.D.
>>>>>>>>>>    http://people.apache.org/~hemapani/
>>>>>>>>>>    http://srinathsview.blogspot.com/
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Architecture mailing list
>>>>>>>>>> Architecture@wso2.org
>>>>>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *Niranda Perera*
>>>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>>>> Mobile: +94-71-554-8430
>>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Niranda Perera*
>>>>>>> Software Engineer, WSO2 Inc.
>>>>>>> Mobile: +94-71-554-8430
>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Niranda Perera*
>>>>> Software Engineer, WSO2 Inc.
>>>>> Mobile: +94-71-554-8430
>>>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf
>>>> <https://twitter.com/dmoralesdf>
>>>>
>>>>
>>>> <http://www.stratio.com/>
>>>> Avenida de Europa, 26. Ática 5. 2ª Planta
>>>> 28224 Pozuelo de Alarcón, Madrid
>>>> Tel: +34 91 352 59 42 // *@stratiobd <https://twitter.com/StratioBD>*
>>>>
>>>
>>>
>>> --
>>> *Niranda Perera*
>>> Software Engineer, WSO2 Inc.
>>> Mobile: +94-71-554-8430
>>> Twitter: @n1r44 <https://twitter.com/N1R44>
>>>
>>
>>
>> --
>>
>> David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf
>> <https://twitter.com/dmoralesdf>
>>
>>
>> <http://www.stratio.com/>
>> Avenida de Europa, 26. Ática 5. 2ª Planta
>> 28224 Pozuelo de Alarcón, Madrid
>> Tel: +34 91 352 59 42 // *@stratiobd <https://twitter.com/StratioBD>*
>>
>
>
> --
> *Niranda Perera*
> Software Engineer, WSO2 Inc.
> Mobile: +94-71-554-8430
> Twitter: @n1r44 <https://twitter.com/N1R44>
>



-- 

David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf
<https://twitter.com/dmoralesdf>


<http://www.stratio.com/>
Avenida de Europa, 26. Ática 5. 2ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // *@stratiobd <https://twitter.com/StratioBD>*

_______________________________________________
Architecture mailing list
Architecture@wso2.org
https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture

Re: [Architecture] [POC] Performance evaluation of Hive vs Shark

Reply via email to