Some friends refer me this thread about OLAP/Kylin and Spark... Here's my 2 cents..
If you are trying to setup OLAP, Apache Kylin should be one good idea for you to evaluate. The project has developed more than 2 years and going to graduate to Apache Top Level Project [1]. There are many deployments on production already include eBay, Exponential, JD.com, VIP.com and others, refer to powered by page [2]. Apache Kylin's spark engine also on the way, there's discussion about turning the performance [3]. There are variety clients are available to interactive with Kylin with ANSI SQL, including Tableau, Zeppelin, Pentaho/mondrian, Saiku/mondrian, and the Excel/PowerBI support will roll out this week. Apache Kylin is young but mature with huge case validation (one biggest cube in eBay contains 85+B rows, 90%ile production platform's query latency in few seconds). StreamingOLAP is coming in Kylin v2.0 with plug-able architecture, there's already one real case on production inside eBay, please refer to our design deck [4] We are really welcome everyone to join and contribute to Kylin as OLAP engine for Big Data:-) Please feel free to contact our community or me for any question. Thanks. 1. http://s.apache.org/bah 2. http://kylin.incubator.apache.org/community/poweredby.html 3. http://s.apache.org/lHA 4. http://www.slideshare.net/lukehan/1-apache-kylin-deep-dive-streaming-and-plugin-architecture-apache-kylin-meetup-shanghai 5. http://kylin.io Best Regards! --------------------- Luke Han On Tue, Nov 10, 2015 at 2:56 AM, tsh <t...@timshenkao.su> wrote: > Hi, > > I'm in the same position right now: we are going to implement something > like OLAP BI + Machine Learning explorations on the same cluster. > Well, the question is quite ambivalent: from one hand, we have terabytes > of versatile data and the necessity to make something like cubes (Hive and > Hive on HBase are unsatisfactory). From the other, our users get accustomed > to Tableau + Vertica. > So, right now I consider the following choices: > 1) Platfora (not free, I don't know price right now) + Spark > 2) AtScale + Tableau(not free, I don't know price right now) + Spark > 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some > storage > 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka + > Flume (has somebody use it in production?) > 5) Spark + Tableau (cubes?) > > For myself, I decided not to dive into Mesos. Cassandra is hardly > configurable, you'll have to dedicate special employee to support it. > > I'll be glad to hear other ideas & propositions as we are at the beginning > of the process too. > > Sincerely yours, Tim Shenkao > > > On 11/09/2015 09:46 AM, fightf...@163.com wrote: > > Hi, > > Thanks for suggesting. Actually we are now evaluating and stressing the > spark sql on cassandra, while > > trying to define business models. FWIW, the solution mentioned here is > different from traditional OLAP > > cube engine, right ? So we are hesitating on the common sense or direction > choice of olap architecture. > > And we are happy to hear more use case from this community. > > Best, > Sun. > > ------------------------------ > fightf...@163.com > > > *From:* Jörn Franke <jornfra...@gmail.com> > *Date:* 2015-11-09 14:40 > *To:* fightf...@163.com > *CC:* user <u...@spark.apache.org>; dev <dev@spark.apache.org> > *Subject:* Re: OLAP query using spark dataframe with cassandra > > Is there any distributor supporting these software components in > combination? If no and your core business is not software then you may want > to look for something else, because it might not make sense to build up > internal know-how in all of these areas. > > In any case - it depends all highly on your data and queries. You will > have to do your own experiments. > > On 09 Nov 2015, at 07:02, "fightf...@163.com" <fightf...@163.com> wrote: > > Hi, community > > We are specially interested about this featural integration according to > some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka) > > seems good implementation for lambda architecure in the open-source world, > especially non-hadoop based cluster environment. As we can see, > > the advantages obviously consist of : > > 1 the feasibility and scalability of spark datafram api, which can also > make a perfect complement for Apache Cassandra native cql feature. > > 2 both streaming and batch process availability using the ALL-STACK thing, > cool. > > 3 we can both achieve compacity and usability for spark with cassandra, > including seemlessly integrating with job scheduling and resource > management. > > Only one concern goes to the OLAP query performance issue, which mainly > caused by frequent aggregation work between daily increased large tables, > for > > both spark sql and cassandra. I can see that the [1] use case facilitates > FiloDB to achieve columnar storage and query performance, but we had > nothing more > > knowledge. > > Question is : Any guy had such use case for now, especially using in your > production environment ? Would be interested in your architeture for > designing this > > OLAP engine using spark + cassandra. What do you think the comparison > between the scenario with traditional OLAP cube design? Like Apache Kylin > or > > pentaho mondrian ? > > Best Regards, > > Sun. > > > [1] > <http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark> > http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark > > ------------------------------ > fightf...@163.com > > >