Hi there, *1. About the size of deployments.*
It depends on your use case... specially when you combine spark with a datastore. We use to deploy spark with cassandra or mongodb, instead of using HDFS for example. Spark will be faster if you put the data in memory, so if you need a lot of speed (interactive queries, for example), you should have enough memory. *2. About storage handlers.* We have developed the first tight integration between Cassandra and Spark, called Stratio Deep, announced in the first spark summit. You can check Stratio Deep out here: https://github.com/Stratio/stratio-deep (open, apache2 license). *Deep is a thin integration layer between Apache Spark and several NoSQL datastores. We actually support Apache Cassandra and MongoDB, but in the near future we will add support for sever other datastores.* Datastax have announce its own driver for spark in the last spark summit, but we have been working in our solution for almost a year. Furthermore, we are working to extend this solution in order to work also with other databases... MongoDB integration is completed right now and ElasticSearch will be ready in a few weeks. And that is not all, we have also developed an integration with Cassandra and Lucene for indexing data (open source, apache2). *Stratio Cassandra is a fork of Apache Cassandra <http://cassandra.apache.org/> where index functionality has been extended to provide near real time search such as ElasticSearch or Solr, including full text search <http://en.wikipedia.org/wiki/Full_text_search> capabilities and free multivariable search. It is achieved through an Apache Lucene <http://lucene.apache.org/> based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data.* We will publish some benchmarks in two weeks, so i will share our results here if you are interested. If you are more interested in distributed file systems, you should take a look on Tachyon: http://tachyon-project.org/index.html *3. Spark - Hive compatibility* Spark will support anything with the Hadoop InputFormat interface. *4. Performance* We are working a lot with Cassandra and mongoDB and the performance is quite nice. We are finishing right now some benchmarks comparing Hadoop + HDFS vs Spark + HDFS vs Spark + Cassandra (using stratio deep and even our fork of Cassandra). Let me please share this results with you when they were ready, ok? Regards. 2014-08-22 7:53 GMT+02:00 Niranda Perera <nira...@wso2.com>: > Hi Srinath, > Yes, I am working on deploying it on a multi-node cluster with the debs > dataset. I will keep architecture@ posted on the progress. > > > Hi David, > Thank you very much for the detailed insight you've provided. > Few quick questions, > 1. Do you have experiences in using storage handlers in Spark? > 2. Would a storage handler used in Hive, be directly compatible with > Spark? > 3. How do you grade the performance of Spark with other databases such as > Cassandra, HBase, H2, etc? > > Thank you very much again for your interest. Look forward to hearing from > you. > > Regards > > > On Thu, Aug 21, 2014 at 7:02 PM, Srinath Perera <srin...@wso2.com> wrote: > >> Niranda, we need test Spark in multi-node mode before making a decision. >> Spark is very fast, I think there is no doubt about that. We need to make >> sure it stable. >> >> David, thanks for a detailed email! How big (nodes) is the Spark setup >> you guys are running? >> >> --Srinath >> >> >> >> On Thu, Aug 21, 2014 at 1:34 PM, David Morales <dmora...@stratio.com> >> wrote: >> >>> Sorry for disturbing this thread, but i think that i can help clarifying >>> a few things (we were attending the last Spark Summit, we were also >>> speakers there and we are working very close to spark) >>> >>> *> Hive/Shark and others benchmark* >>> >>> You can find a nice comparison and benchmark in this web: >>> https://amplab.cs.berkeley.edu/benchmark/ >>> >>> >>> *> Shark and SparkSQL* >>> >>> SparkSQL is the natural replacement for Shark, but SparkSQL is still >>> young at this moment. If you are looking for Hive compatibility, you have >>> to execute SparkSQL with an specific context. >>> >>> Quoted from spark website: >>> >>> *> Note that Spark SQL currently uses a very basic SQL parser. Users >>> that want a more complete dialect of SQL should look at the HiveSQL support >>> provided by HiveContext.* >>> >>> So, only note that SparkSQL is a work in progress. If you want SparkSQL >>> you have to run a SparkSQLContext, if you want Hive, you will have a >>> different context... >>> >>> >>> *> Spark - Hadoop: the future* >>> >>> Most Hadoop distributions are including Spark: cloudera, hortonworks, >>> mapR... and contributing to migrate all the Hadoop ecosystem to Spark. >>> >>> Spark is a bit more than Map/Reduce... as you can read here: >>> http://gigaom.com/2014/06/28/4-reasons-why-spark-could-jolt-hadoop-into-hyperdrive/ >>> >>> >>> *> Spark Streaming / Spark SQL* >>> >>> Spark Streaming is built on Spark and it provides streaming processing >>> through an information abstraction called DStreams (a collection of RDDs in >>> a window of time). >>> >>> There is some efforts in order to make SparkSQL compatible with Spark >>> Streaming (something similar to trident for storm), as you can see here: >>> >>> *StreamSQL (https://github.com/thunderain-project/StreamSQL >>> <https://github.com/thunderain-project/StreamSQL>) is a POC project based >>> on Spark to combine the power of Catalyst and Spark Streaming, to offer >>> people the ability to manipulate SQL on top of DStream as you wanted, this >>> keep the same semantics with SparkSQL as offer a SchemaDStream on top of >>> DStream. You don't need to do tricky thing like extracting rdd to register >>> as a table. Besides other parts are the same as Spark.* >>> >>> So, you can apply a SQL in a data stream, but it is very simple at the >>> moment... you can expect a bunch of improvements in this matter in the next >>> months (i guess that sparkSQL will work on Spark streaming streams before >>> the end of this year). >>> >>> >>> >>> *> Spark Streaming / Spark SQL and CEP* >>> >>> There is no relationship at this moment between (your absolutely >>> amazing) Siddhi CEP and Spark. As fas as i know, you are working in doing >>> distributed CEP with Storm and Siddhi. >>> >>> We are currently working on doing an interactive cep built with kafka + >>> spark streaming + siddhi, with some features such as an API, an interactive >>> shell, built-in statistics and auditing, built-in functions >>> (save2cassandra, save2mongo, save2elasticsearch...). >>> >>> If you are interested we can talk about this project, i think that it >>> would be a nice idea¡ >>> >>> >>> Anyway, i don't think that SparkSQL will evolve in something like a CEP. >>> Patterns, sequences, for example would be very complex to do with spark >>> streaming (at least now). >>> >>> >>> >>> Thanks. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> 2014-08-21 6:18 GMT+02:00 Sriskandarajah Suhothayan <s...@wso2.com>: >>> >>> >>>> >>>> >>>> On Wed, Aug 20, 2014 at 1:36 PM, Niranda Perera <nira...@wso2.com> >>>> wrote: >>>> >>>>> @Maninda, >>>>> >>>>> +1 for suggesting Spark SQL. >>>>> >>>>> Quote Databricks, >>>>> "Spark SQL provides state-of-the-art SQL performance and maintains >>>>> compatibility with Shark/Hive. In particular, like Shark, Spark SQL >>>>> supports all existing Hive data formats, user-defined functions (UDF), and >>>>> the Hive metastore." [1] >>>>> >>>>> But I am not entirely sure if Spark SQL and Siddhi is comparable, >>>>> because SparkSQL (like Hive) is designed for batch processing, where as >>>>> Siddhi is real-time processing. But if there are implementations where >>>>> Siddhi is run on top of Spark, it would be very interesting. >>>>> >>>> Yes Siddhi's current way of operation does not support this. But with >>>> partitions and we can achieve this to some extent. >>>> >>>> Suho >>>> >>>>> >>>>> Spark supports either Hadoop1 or 2. But I think we should see, what is >>>>> best, MR1 or YARN+MR2 >>>>> >>>>> [image: Hadoop Architecture] >>>>> [2] >>>>> >>>>> [1] >>>>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html >>>>> [2] http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html >>>>> >>>>> >>>>> On Wed, Aug 20, 2014 at 1:13 PM, Lasantha Fernando <lasan...@wso2.com> >>>>> wrote: >>>>> >>>>>> Hi Maninda, >>>>>> >>>>>> On 20 August 2014 12:02, Maninda Edirisooriya <mani...@wso2.com> >>>>>> wrote: >>>>>> >>>>>>> In the case of discontinuity of Shark project, IMO we should not >>>>>>> move to Shark at all. >>>>>>> And it seems better to go with Spark SQL as we are already using >>>>>>> Spark for CEP. But I am not sure the difference between Spark SQL and >>>>>>> the >>>>>>> Siddhi queries on the Spark engine. >>>>>>> >>>>>> >>>>>> Currently, we are doing integration with CEP using Apache Storm, not >>>>>> Spark... :-). Spark Streaming is a possible candidate for integrating >>>>>> with >>>>>> CEP, but we have opted with Storm. I think there has been some >>>>>> independent >>>>>> work on integrating Kafka + Spark Streaming + Siddhi. Please refer to >>>>>> thread on arch@ "[Architecture] A few questions about WSO2 CEP >>>>>> /Siddhi" >>>>>> >>>>>> >>>>>> And we have to figure out how Spark SQL is used for historical data, >>>>>>> whether it can execute incremental processing by default which will >>>>>>> implement all out existing BAM use cases. >>>>>>> On the other hand in Hadoop 2 [1] they are using a completely >>>>>>> different platform for resource allocation known as Yarn. Sometimes this >>>>>>> may be more suitable for batch jobs. >>>>>>> >>>>>>> [1] https://www.youtube.com/watch?v=RncoVN0l6dc >>>>>>> >>>>>>> >>>>>> Thanks, >>>>>> Lasantha >>>>>> >>>>>>> >>>>>>> *Maninda Edirisooriya* >>>>>>> Senior Software Engineer >>>>>>> >>>>>>> *WSO2, Inc. *lean.enterprise.middleware. >>>>>>> >>>>>>> *Blog* : http://maninda.blogspot.com/ >>>>>>> *E-mail* : mani...@wso2.com >>>>>>> *Skype* : @manindae >>>>>>> *Twitter* : @maninda >>>>>>> >>>>>>> >>>>>>> On Wed, Aug 20, 2014 at 11:33 AM, Niranda Perera <nira...@wso2.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Anjana and Srinath, >>>>>>>> >>>>>>>> After the discussion I had with Anjana, I researched more on the >>>>>>>> continuation of Shark project by Databricks. >>>>>>>> >>>>>>>> Here's what I found out, >>>>>>>> - Shark was built on the Hive codebase and achieved performance >>>>>>>> improvements by swapping out the physical execution engine part of >>>>>>>> Hive. >>>>>>>> While this approach enabled Shark users to speed up their Hive queries, >>>>>>>> Shark inherited a large, complicated code base from Hive that made it >>>>>>>> hard >>>>>>>> to optimize and maintain. >>>>>>>> Hence, Databricks has announced that they are halting the >>>>>>>> development of Shark from July, 2014. (Shark 0.9 would be the last >>>>>>>> release) >>>>>>>> [1] >>>>>>>> - Shark will be replaced by Spark SQL. It beats Shark in TPC-DS >>>>>>>> performance >>>>>>>> <http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html> >>>>>>>> by almost an order of magnitude. It also supports all existing Hive >>>>>>>> data >>>>>>>> formats, user-defined functions (UDF), and the Hive metastore. [2] >>>>>>>> - Following is the Shark, Spark SQL migration plan >>>>>>>> http://spark-summit.org/wp-content/uploads/2014/07/Future-of-Spark-Patrick-Wendell.pdf >>>>>>>> >>>>>>>> - For the legacy Hive and MapReduce users, they have proposed a new >>>>>>>> 'Hive on Spark Project' [3], [4] >>>>>>>> But, given the performance enhancement, it is quite certain that >>>>>>>> Hive and MR would be replaced by engines build on top of Spark (ex: >>>>>>>> Spark >>>>>>>> SQL) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> In my opinion there are a few matters to figure out if we are >>>>>>>> migrating from Hive, >>>>>>>> >>>>>>>> 1. whether we are changing the query engine only? (Then, we can >>>>>>>> replace Hive by Shark) >>>>>>>> 2. whether we are changing the existing Hadoop/ MapReduce framework >>>>>>>> to Spark? (Then we can replace Hive and Hadoop with Spark and Spark >>>>>>>> SQL) >>>>>>>> >>>>>>>> >>>>>>>> In my opinion, considering the longterm impact and the availability >>>>>>>> of support, it is best to migrate the Hive/Hadoop to Spark. >>>>>>>> It is open for discussion! >>>>>>>> >>>>>>>> In the mean time, I've already tried Spark SQL, and Databricks >>>>>>>> claims on improved performance seems to be true. I will work more on >>>>>>>> this. >>>>>>>> >>>>>>>> Cheers >>>>>>>> >>>>>>>> [1] >>>>>>>> http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html >>>>>>>> [2] >>>>>>>> http://databricks.com/blog/2014/06/02/exciting-performance-improvements-on-the-horizon-for-spark-sql.html >>>>>>>> [3] https://issues.apache.org/jira/browse/HIVE-7292 >>>>>>>> [4] https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Aug 14, 2014 at 12:16 PM, Anjana Fernando <anj...@wso2.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Srinath, >>>>>>>>> >>>>>>>>> No, this has not been tested in multiple nodes. I told Niranda >>>>>>>>> here in my last mail, to test a cluster with the same set of hardware >>>>>>>>> we >>>>>>>>> have, that we are using to test our large data set with Hive. As for >>>>>>>>> the >>>>>>>>> effort to make the change, we still have to figure out the MT aspects >>>>>>>>> of >>>>>>>>> Shark here. Sinthuja was working on making the latest Hive version MT >>>>>>>>> ready, and most probably, we can do the same changes to the Hive >>>>>>>>> version >>>>>>>>> Shark is using. So after we do that, the integration should be >>>>>>>>> seamless. >>>>>>>>> And also, as I mentioned earlier here, we are also going to test this >>>>>>>>> with >>>>>>>>> the APIM Hive script, to check if there are any unforeseen >>>>>>>>> incompatibilities. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Anjana. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Aug 14, 2014 at 11:53 AM, Srinath Perera <srin...@wso2.com >>>>>>>>> > wrote: >>>>>>>>> >>>>>>>>>> This look great. >>>>>>>>>> >>>>>>>>>> We need to test Spark with multiple nodes? Did we do that. Please >>>>>>>>>> create few VMs in performance could (talk to Lakmal) and test with >>>>>>>>>> at least >>>>>>>>>> 5 nodes. We need to make sure it works OK with distributed setup as >>>>>>>>>> well. >>>>>>>>>> >>>>>>>>>> What does it take to change to spark? Anjana .. how much work is >>>>>>>>>> it? >>>>>>>>>> >>>>>>>>>> --Srinath >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Aug 13, 2014 at 7:06 PM, Niranda Perera <nira...@wso2.com >>>>>>>>>> > wrote: >>>>>>>>>> >>>>>>>>>>> Thank you Anjana. >>>>>>>>>>> >>>>>>>>>>> Yes, I am working on it. >>>>>>>>>>> >>>>>>>>>>> In the mean time, I found this in Hive documentation [1]. It >>>>>>>>>>> talks about Hive on Spark, and compares Hive, Shark and Spark SQL >>>>>>>>>>> at an >>>>>>>>>>> higher architectural level. >>>>>>>>>>> >>>>>>>>>>> Additionally, it is said that the in-memory performance of Shark >>>>>>>>>>> can be improved by introducing Tachyon [2]. I guess we can consider >>>>>>>>>>> this >>>>>>>>>>> later on. >>>>>>>>>>> >>>>>>>>>>> Cheers. >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark#HiveonSpark-1.3ComparisonwithSharkandSparkSQL >>>>>>>>>>> [2] http://tachyon-project.org/Running-Tachyon-Locally.html >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Aug 13, 2014 at 3:17 PM, Anjana Fernando < >>>>>>>>>>> anj...@wso2.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Niranda, >>>>>>>>>>>> >>>>>>>>>>>> Excellent analysis of Hive vs Shark! .. This gives a lot of >>>>>>>>>>>> insight into how both operates in different scenarios. As the next >>>>>>>>>>>> step, we >>>>>>>>>>>> will need to run this in an actual cluster of computers. Since >>>>>>>>>>>> you've used >>>>>>>>>>>> a subset of the dataset of 2014 DEBS challenge, we should use the >>>>>>>>>>>> full data >>>>>>>>>>>> set in a clustered environment and check this. Gokul is already >>>>>>>>>>>> working on >>>>>>>>>>>> the Hive based setup for this, after that is done, you can create >>>>>>>>>>>> a Shark >>>>>>>>>>>> cluster in the same hardware and run the tests there, to get a >>>>>>>>>>>> clear >>>>>>>>>>>> comparison on how these two match up in a cluster. Until the setup >>>>>>>>>>>> is >>>>>>>>>>>> ready, do continue with your next steps on checking the RDD >>>>>>>>>>>> support and >>>>>>>>>>>> Spark SQL use. >>>>>>>>>>>> >>>>>>>>>>>> After these are done, we should also do a trial run of our own >>>>>>>>>>>> APIM Hive scripts, migrated to Shark. >>>>>>>>>>>> >>>>>>>>>>>> Cheers, >>>>>>>>>>>> Anjana. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Aug 11, 2014 at 12:21 PM, Niranda Perera < >>>>>>>>>>>> nira...@wso2.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi all, >>>>>>>>>>>>> >>>>>>>>>>>>> I have been evaluating the performance of Shark (distributed >>>>>>>>>>>>> SQL query engine for Hadoop) against Hive. This is with the >>>>>>>>>>>>> objective of >>>>>>>>>>>>> seeing the possibility to move the WSO2 BAM data processing (which >>>>>>>>>>>>> currently uses Hive) to Shark (and Apache Spark) for improved >>>>>>>>>>>>> performance. >>>>>>>>>>>>> >>>>>>>>>>>>> I am sharing my findings herewith. >>>>>>>>>>>>> >>>>>>>>>>>>> *AMP Lab Shark* >>>>>>>>>>>>> Shark can execute Hive QL queries up to 100 times faster than >>>>>>>>>>>>> Hive without any modification to the existing data or queries. It >>>>>>>>>>>>> supports >>>>>>>>>>>>> Hive's QL, metastore, serialization formats, and user-defined >>>>>>>>>>>>> functions, >>>>>>>>>>>>> providing seamless integration with existing Hive deployments and >>>>>>>>>>>>> a >>>>>>>>>>>>> familiar, more powerful option for new ones. [1] >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> *Apache Spark*Apache Spark is an open-source data analytics >>>>>>>>>>>>> cluster computing framework. It fits into the Hadoop open-source >>>>>>>>>>>>> community, >>>>>>>>>>>>> building on top of the HDFS and promises performance up to 100 >>>>>>>>>>>>> times faster >>>>>>>>>>>>> than Hadoop MapReduce for certain applications. [2] >>>>>>>>>>>>> Official documentation: [3] >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I carried out the comparison between the following Hive and >>>>>>>>>>>>> Shark releases with input files ranging from 100 to 1 billion >>>>>>>>>>>>> entries. >>>>>>>>>>>>> >>>>>>>>>>>>> QL Engine >>>>>>>>>>>>> >>>>>>>>>>>>> Apache Hive 0.11 >>>>>>>>>>>>> >>>>>>>>>>>>> Shark Shark 0.9.1 (Latest release) which uses, >>>>>>>>>>>>> >>>>>>>>>>>>> - >>>>>>>>>>>>> >>>>>>>>>>>>> Scala 2.10.3 >>>>>>>>>>>>> - >>>>>>>>>>>>> >>>>>>>>>>>>> Spark 0.9.1 >>>>>>>>>>>>> - >>>>>>>>>>>>> >>>>>>>>>>>>> AMPLab’s Hive 0.9.0 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Framework >>>>>>>>>>>>> >>>>>>>>>>>>> Hadoop 1.0.4 >>>>>>>>>>>>> Spark 0.9.1 >>>>>>>>>>>>> >>>>>>>>>>>>> File system >>>>>>>>>>>>> >>>>>>>>>>>>> HDFS >>>>>>>>>>>>> HDFS >>>>>>>>>>>>> >>>>>>>>>>>>> Attached herewith is a report which describes in detail about >>>>>>>>>>>>> the performance comparison between Shark and Hive. >>>>>>>>>>>>> >>>>>>>>>>>>> hive_vs_shark >>>>>>>>>>>>> <https://docs.google.com/a/wso2.com/folderview?id=0B1GsnfycTl32QTZqUktKck1Ucjg&usp=drive_web> >>>>>>>>>>>>> >>>>>>>>>>>>> hive_vs_shark_report.odt >>>>>>>>>>>>> <https://docs.google.com/a/wso2.com/file/d/0B1GsnfycTl32X3J5dTh6Slloa0E/edit?usp=drive_web> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> In summary, >>>>>>>>>>>>> >>>>>>>>>>>>> From the evaluation, following conclusions can be derived. >>>>>>>>>>>>> >>>>>>>>>>>>> - Shark is indifferent to Hive in DDL operations (CREATE, >>>>>>>>>>>>> DROP .. TABLE, DATABASE). Both engines show a fairly constant >>>>>>>>>>>>> performance >>>>>>>>>>>>> as the input size increases. >>>>>>>>>>>>> - Shark is indifferent to Hive in DML operations (LOAD, >>>>>>>>>>>>> INSERT) but when a DML operation is called in conjuncture of a >>>>>>>>>>>>> data >>>>>>>>>>>>> retrieval operation (ex. INSERT <TBL> SELECT <PROP> FROM >>>>>>>>>>>>> <TBL>), Shark >>>>>>>>>>>>> significantly over-performs Hive with a performance factor of >>>>>>>>>>>>> 10x+ (Ranging >>>>>>>>>>>>> from 10x to 80x in some instances). Shark performance factor >>>>>>>>>>>>> reduces with >>>>>>>>>>>>> the input size increases, while HIVE performance is fairly >>>>>>>>>>>>> indifferent. >>>>>>>>>>>>> - Shark clearly over-performs Hive in Data Retrieval >>>>>>>>>>>>> operations (FILTER, ORDER BY, JOIN). Hive performance is >>>>>>>>>>>>> fairly indifferent >>>>>>>>>>>>> in the data retrieval operations while Shark performance >>>>>>>>>>>>> reduces as the >>>>>>>>>>>>> input size increases. But at every instance Shark >>>>>>>>>>>>> over-performed Hive with >>>>>>>>>>>>> a minimum performance factor of 5x+ (Ranging from 5x to 80x in >>>>>>>>>>>>> some >>>>>>>>>>>>> instances). >>>>>>>>>>>>> >>>>>>>>>>>>> Please refer the 'hive_vs_shark_report', it has all the >>>>>>>>>>>>> information about the queries and timings pictographically. >>>>>>>>>>>>> >>>>>>>>>>>>> The code repository can also be found in >>>>>>>>>>>>> >>>>>>>>>>>>> https://github.com/nirandaperera/hiveToShark/tree/master/hiveVsShark >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Moving forward, I am currently working on the following. >>>>>>>>>>>>> >>>>>>>>>>>>> - Apache Spark's resilient distributed dataset (RDD) >>>>>>>>>>>>> abstraction (which is a collection of elements partitioned >>>>>>>>>>>>> across the nodes >>>>>>>>>>>>> of the cluster that can be operated on in parallel). The use >>>>>>>>>>>>> of RDDs and >>>>>>>>>>>>> its impact to the performance. >>>>>>>>>>>>> - Spark SQL - Use of this Spark SQL over Shark on Spark >>>>>>>>>>>>> framework >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> [1] https://github.com/amplab/shark/wiki >>>>>>>>>>>>> [2] http://en.wikipedia.org/wiki/Apache_Spark >>>>>>>>>>>>> [3] http://spark.apache.org/docs/latest/ >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Would love to have your feedback on this. >>>>>>>>>>>>> >>>>>>>>>>>>> Best regards >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> *Niranda Perera* >>>>>>>>>>>>> Software Engineer, WSO2 Inc. >>>>>>>>>>>>> Mobile: +94-71-554-8430 >>>>>>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> *Anjana Fernando* >>>>>>>>>>>> Senior Technical Lead >>>>>>>>>>>> WSO2 Inc. | http://wso2.com >>>>>>>>>>>> lean . enterprise . middleware >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> *Niranda Perera* >>>>>>>>>>> Software Engineer, WSO2 Inc. >>>>>>>>>>> Mobile: +94-71-554-8430 >>>>>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> ============================ >>>>>>>>>> Srinath Perera, Ph.D. >>>>>>>>>> http://people.apache.org/~hemapani/ >>>>>>>>>> http://srinathsview.blogspot.com/ >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> *Anjana Fernando* >>>>>>>>> Senior Technical Lead >>>>>>>>> WSO2 Inc. | http://wso2.com >>>>>>>>> lean . enterprise . middleware >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> *Niranda Perera* >>>>>>>> Software Engineer, WSO2 Inc. >>>>>>>> Mobile: +94-71-554-8430 >>>>>>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Architecture mailing list >>>>>>> Architecture@wso2.org >>>>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *Lasantha Fernando* >>>>>> Software Engineer - Data Technologies Team >>>>>> WSO2 Inc. http://wso2.com >>>>>> >>>>>> email: lasan...@wso2.com >>>>>> mobile: (+94) 71 5247551 >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> *Niranda Perera* >>>>> Software Engineer, WSO2 Inc. >>>>> Mobile: +94-71-554-8430 >>>>> Twitter: @n1r44 <https://twitter.com/N1R44> >>>>> >>>>> _______________________________________________ >>>>> Architecture mailing list >>>>> Architecture@wso2.org >>>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> *S. Suhothayan* >>>> Technical Lead & Team Lead of WSO2 Complex Event Processor >>>> *WSO2 Inc. *http://wso2.com >>>> * <http://wso2.com/>* >>>> lean . enterprise . middleware >>>> >>>> >>>> >>>> *cell: (+94) 779 756 757 <%28%2B94%29%20779%20756%20757> | blog: >>>> http://suhothayan.blogspot.com/ <http://suhothayan.blogspot.com/> twitter: >>>> http://twitter.com/suhothayan <http://twitter.com/suhothayan> | linked-in: >>>> http://lk.linkedin.com/in/suhothayan >>>> <http://lk.linkedin.com/in/suhothayan>* >>>> >>>> _______________________________________________ >>>> Architecture mailing list >>>> Architecture@wso2.org >>>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>>> >>>> >>> >>> _______________________________________________ >>> Architecture mailing list >>> Architecture@wso2.org >>> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >>> >>> >> >> >> -- >> ============================ >> Srinath Perera, Ph.D. >> http://people.apache.org/~hemapani/ >> http://srinathsview.blogspot.com/ >> >> _______________________________________________ >> Architecture mailing list >> Architecture@wso2.org >> https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture >> >> > > > -- > *Niranda Perera* > Software Engineer, WSO2 Inc. > Mobile: +94-71-554-8430 > Twitter: @n1r44 <https://twitter.com/N1R44> >
_______________________________________________ Architecture mailing list Architecture@wso2.org https://mail.wso2.org/cgi-bin/mailman/listinfo/architecture