Re: OLAP query using spark dataframe with cassandra

Luke Han Mon, 09 Nov 2015 23:17:32 -0800

Some friends refer me this thread about OLAP/Kylin and Spark...

Here's my 2 cents..


If you are trying to setup OLAP, Apache Kylin should be one good idea for
you to evaluate.

The project has developed more than 2 years and going to graduate to Apache
Top Level Project [1].
There are many deployments on production already include eBay, Exponential,
JD.com, VIP.com and others, refer to powered by page [2].

Apache Kylin's spark engine also on the way, there's discussion about
turning the performance [3].

There are variety clients are available to interactive with Kylin with ANSI
SQL, including Tableau, Zeppelin, Pentaho/mondrian, Saiku/mondrian, and the
Excel/PowerBI support will roll out this week.

Apache Kylin is young but mature with huge case validation (one biggest
cube in eBay contains 85+B rows, 90%ile production platform's query latency
in few seconds).

StreamingOLAP is coming in Kylin v2.0 with plug-able architecture, there's
already one real case on production inside eBay, please refer to our design
deck [4]

We are really welcome everyone to join and contribute to Kylin as OLAP
engine for Big Data:-)

Please feel free to contact our community or me for any question.

Thanks.

1. http://s.apache.org/bah
2. http://kylin.incubator.apache.org/community/poweredby.html
3. http://s.apache.org/lHA
4.
http://www.slideshare.net/lukehan/1-apache-kylin-deep-dive-streaming-and-plugin-architecture-apache-kylin-meetup-shanghai
5. http://kylin.io


Best Regards!
---------------------

Luke Han

On Tue, Nov 10, 2015 at 2:56 AM, tsh <t...@timshenkao.su> wrote:

> Hi,
>
> I'm in the same position right now: we are going to implement something
> like OLAP BI + Machine Learning explorations on the same cluster.
> Well, the question is quite ambivalent: from one hand, we have terabytes
> of versatile data and the necessity to make something like cubes (Hive and
> Hive on HBase are unsatisfactory). From the other, our users get accustomed
> to Tableau + Vertica.
> So, right now I consider the following choices:
> 1) Platfora (not free, I don't know price right now) + Spark
> 2) AtScale + Tableau(not free, I don't know price right now) + Spark
> 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some
> storage
> 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka +
> Flume (has somebody use it in production?)
> 5) Spark + Tableau  (cubes?)
>
> For myself, I decided not to dive into Mesos. Cassandra is hardly
> configurable, you'll have to dedicate special employee to support it.
>
> I'll be glad to hear other ideas & propositions as we are at the beginning
> of the process too.
>
> Sincerely yours, Tim Shenkao
>
>
> On 11/09/2015 09:46 AM, fightf...@163.com wrote:
>
> Hi,
>
> Thanks for suggesting. Actually we are now evaluating and stressing the
> spark sql on cassandra, while
>
> trying to define business models. FWIW, the solution mentioned here is
> different from traditional OLAP
>
> cube engine, right ? So we are hesitating on the common sense or direction
> choice of olap architecture.
>
> And we are happy to hear more use case from this community.
>
> Best,
> Sun.
>
> ------------------------------
> fightf...@163.com
>
>
> *From:* Jörn Franke <jornfra...@gmail.com>
> *Date:* 2015-11-09 14:40
> *To:* fightf...@163.com
> *CC:* user <u...@spark.apache.org>; dev <dev@spark.apache.org>
> *Subject:* Re: OLAP query using spark dataframe with cassandra
>
> Is there any distributor supporting these software components in
> combination? If no and your core business is not software then you may want
> to look for something else, because it might not make sense to build up
> internal know-how in all of these areas.
>
> In any case - it depends all highly on your data and queries. You will
> have to do your own experiments.
>
> On 09 Nov 2015, at 07:02, "fightf...@163.com" <fightf...@163.com> wrote:
>
> Hi, community
>
> We are specially interested about this featural integration according to
> some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>
> seems good implementation for lambda architecure in the open-source world,
> especially non-hadoop based cluster environment. As we can see,
>
> the advantages obviously consist of :
>
> 1 the feasibility and scalability of spark datafram api, which can also
> make a perfect complement for Apache Cassandra native cql feature.
>
> 2 both streaming and batch process availability using the ALL-STACK thing,
> cool.
>
> 3 we can both achieve compacity and usability for spark with cassandra,
> including seemlessly integrating with job scheduling and resource
> management.
>
> Only one concern goes to the OLAP query performance issue, which mainly
> caused by frequent aggregation work between daily increased large tables,
> for
>
> both spark sql and cassandra. I can see that the [1] use case facilitates
> FiloDB to achieve columnar storage and query performance, but we had
> nothing more
>
> knowledge.
>
> Question is : Any guy had such use case for now, especially using in your
> production environment ? Would be interested in your architeture for
> designing this
>
> OLAP engine using spark +  cassandra. What do you think the comparison
> between the scenario with traditional OLAP cube design? Like Apache Kylin
> or
>
> pentaho mondrian ?
>
> Best Regards,
>
> Sun.
>
>
> [1]
> <http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark>
> http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
>
> ------------------------------
> fightf...@163.com
>
>
>

Re: OLAP query using spark dataframe with cassandra

Reply via email to