Re: OLAP query using spark dataframe with cassandra

David Morales Tue, 10 Nov 2015 00:46:23 -0800

Hi there,

Please consider our real-time aggregation engine, sparkta, fully open
source (Apache2 License).


Here you have some slides about the project:

   - http://www.slideshare.net/Stratio/strata-sparkta

And the source code:


   - https://github.com/Stratio/sparkta

Sparkta is a real-time aggregation engine based on spark streaming. You can
define your aggregation policy in a declarative way and choose the output
of your rollups, too. In addition, you can store the raw data and transform
data on-the-fly, among other features.

When working with Cassandra, it could be useful to use the lucene
integration that we have also released at Stratio:


   - http://www.slideshare.net/Stratio/cassandra-meetup-20150217
   - https://github.com/Stratio/cassandra-lucene-index


Ready for use with sparkSQL or in your CQL queries.

We are now working in a SQL layer to work with the cubes in a flexible way,
but this is not available at this moment.

Do not hesitate to contact us if you have any doubt.


Regards.

















2015-11-10 8:16 GMT+01:00 Luke Han <luke...@gmail.com>:

> Some friends refer me this thread about OLAP/Kylin and Spark...
>
> Here's my 2 cents..
>
> If you are trying to setup OLAP, Apache Kylin should be one good idea for
> you to evaluate.
>
> The project has developed more than 2 years and going to graduate to
> Apache Top Level Project [1].
> There are many deployments on production already include
> eBay, Exponential, JD.com, VIP.com and others, refer to powered by page [2].
>
> Apache Kylin's spark engine also on the way, there's discussion about
> turning the performance [3].
>
> There are variety clients are available to interactive with Kylin with
> ANSI SQL, including Tableau, Zeppelin, Pentaho/mondrian, Saiku/mondrian,
> and the Excel/PowerBI support will roll out this week.
>
> Apache Kylin is young but mature with huge case validation (one biggest
> cube in eBay contains 85+B rows, 90%ile production platform's query latency
> in few seconds).
>
> StreamingOLAP is coming in Kylin v2.0 with plug-able architecture, there's
> already one real case on production inside eBay, please refer to our design
> deck [4]
>
> We are really welcome everyone to join and contribute to Kylin as OLAP
> engine for Big Data:-)
>
> Please feel free to contact our community or me for any question.
>
> Thanks.
>
> 1. http://s.apache.org/bah
> 2. http://kylin.incubator.apache.org/community/poweredby.html
> 3. http://s.apache.org/lHA
> 4.
> http://www.slideshare.net/lukehan/1-apache-kylin-deep-dive-streaming-and-plugin-architecture-apache-kylin-meetup-shanghai
> 5. http://kylin.io
>
>
> Best Regards!
> ---------------------
>
> Luke Han
>
> On Tue, Nov 10, 2015 at 2:56 AM, tsh <t...@timshenkao.su> wrote:
>
>> Hi,
>>
>> I'm in the same position right now: we are going to implement something
>> like OLAP BI + Machine Learning explorations on the same cluster.
>> Well, the question is quite ambivalent: from one hand, we have terabytes
>> of versatile data and the necessity to make something like cubes (Hive and
>> Hive on HBase are unsatisfactory). From the other, our users get accustomed
>> to Tableau + Vertica.
>> So, right now I consider the following choices:
>> 1) Platfora (not free, I don't know price right now) + Spark
>> 2) AtScale + Tableau(not free, I don't know price right now) + Spark
>> 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some
>> storage
>> 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka +
>> Flume (has somebody use it in production?)
>> 5) Spark + Tableau  (cubes?)
>>
>> For myself, I decided not to dive into Mesos. Cassandra is hardly
>> configurable, you'll have to dedicate special employee to support it.
>>
>> I'll be glad to hear other ideas & propositions as we are at the
>> beginning of the process too.
>>
>> Sincerely yours, Tim Shenkao
>>
>>
>> On 11/09/2015 09:46 AM, fightf...@163.com wrote:
>>
>> Hi,
>>
>> Thanks for suggesting. Actually we are now evaluating and stressing the
>> spark sql on cassandra, while
>>
>> trying to define business models. FWIW, the solution mentioned here is
>> different from traditional OLAP
>>
>> cube engine, right ? So we are hesitating on the common sense or
>> direction choice of olap architecture.
>>
>> And we are happy to hear more use case from this community.
>>
>> Best,
>> Sun.
>>
>> ------------------------------
>> fightf...@163.com
>>
>>
>> *From:* Jörn Franke <jornfra...@gmail.com>
>> *Date:* 2015-11-09 14:40
>> *To:* fightf...@163.com
>> *CC:* user <user@spark.apache.org>; dev <d...@spark.apache.org>
>> *Subject:* Re: OLAP query using spark dataframe with cassandra
>>
>> Is there any distributor supporting these software components in
>> combination? If no and your core business is not software then you may want
>> to look for something else, because it might not make sense to build up
>> internal know-how in all of these areas.
>>
>> In any case - it depends all highly on your data and queries. You will
>> have to do your own experiments.
>>
>> On 09 Nov 2015, at 07:02, "fightf...@163.com" <fightf...@163.com> wrote:
>>
>> Hi, community
>>
>> We are specially interested about this featural integration according to
>> some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>>
>> seems good implementation for lambda architecure in the open-source
>> world, especially non-hadoop based cluster environment. As we can see,
>>
>> the advantages obviously consist of :
>>
>> 1 the feasibility and scalability of spark datafram api, which can also
>> make a perfect complement for Apache Cassandra native cql feature.
>>
>> 2 both streaming and batch process availability using the ALL-STACK
>> thing, cool.
>>
>> 3 we can both achieve compacity and usability for spark with cassandra,
>> including seemlessly integrating with job scheduling and resource
>> management.
>>
>> Only one concern goes to the OLAP query performance issue, which mainly
>> caused by frequent aggregation work between daily increased large tables,
>> for
>>
>> both spark sql and cassandra. I can see that the [1] use case facilitates
>> FiloDB to achieve columnar storage and query performance, but we had
>> nothing more
>>
>> knowledge.
>>
>> Question is : Any guy had such use case for now, especially using in your
>> production environment ? Would be interested in your architeture for
>> designing this
>>
>> OLAP engine using spark +  cassandra. What do you think the comparison
>> between the scenario with traditional OLAP cube design? Like Apache Kylin
>> or
>>
>> pentaho mondrian ?
>>
>> Best Regards,
>>
>> Sun.
>>
>>
>> [1]
>> <http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark>
>> http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
>>
>> ------------------------------
>> fightf...@163.com
>>
>>
>>
>


-- 

David Morales de Frías  ::  +34 607 010 411 :: @dmoralesdf
<https://twitter.com/dmoralesdf>


<http://www.stratio.com/>
Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 828 6473 // www.stratio.com // *@stratiobd
<https://twitter.com/StratioBD>*

Re: OLAP query using spark dataframe with cassandra

Reply via email to