Re: OLAP query using spark dataframe with cassandra

Ted Yu Mon, 09 Nov 2015 17:23:21 -0800

Please consider using NoSQL engine such as hbase. 

Cheers


> On Nov 9, 2015, at 3:03 PM, Andrés Ivaldi <iaiva...@gmail.com> wrote:
> 
> Hi,
> I'm also considering something similar, Spark plain is too slow for my case, 
> a possible solution is use Spark as Multiple Source connector and basic 
> transformation layer, then persist the information (actually is a RDBM), 
> after that with our engine we build a kind of Cube queries, and the result is 
> processed again by Spark adding Machine Learning.
> Our Missing part is reemplace the RDBM with something more suitable and 
> scalable than RDBM, dont care about pre processing information if after pre 
> processing the queries are fast.
> 
> Regards
> 
>> On Mon, Nov 9, 2015 at 3:56 PM, tsh <t...@timshenkao.su> wrote:
>> Hi,
>> 
>> I'm in the same position right now: we are going to implement something like 
>> OLAP BI + Machine Learning explorations on the same cluster.
>> Well, the question is quite ambivalent: from one hand, we have     terabytes 
>> of versatile data and the necessity to make something like cubes (Hive and 
>> Hive on HBase are unsatisfactory). From the other, our users get accustomed 
>> to Tableau + Vertica. 
>> So, right now I consider the following choices:
>> 1) Platfora (not free, I don't know price right now) + Spark
>> 2) AtScale + Tableau(not free, I don't know price right now) + Spark
>> 3) Apache Kylin (young project?) + Spark on YARN + Kafka + Flume + some 
>> storage
>> 4) Apache Phoenix + Apache HBase + Mondrian + Spark on YARN + Kafka + Flume 
>> (has somebody use it in production?)
>> 5) Spark + Tableau  (cubes?)
>> 
>> For myself, I decided not to dive into Mesos. Cassandra is hardly 
>> configurable, you'll have to dedicate special employee to support it.
>> 
>> I'll be glad to hear other ideas & propositions as we are at the beginning 
>> of the process too.
>> 
>> Sincerely yours, Tim Shenkao
>> 
>> 
>>> On 11/09/2015 09:46 AM, fightf...@163.com wrote:
>>> Hi, 
>>> 
>>> Thanks for suggesting. Actually we are now evaluating and stressing the 
>>> spark sql on cassandra, while
>>> 
>>> trying to define business models. FWIW, the solution mentioned here is 
>>> different from traditional OLAP
>>> 
>>> cube engine, right ? So we are hesitating on the common sense or direction 
>>> choice of olap architecture. 
>>> 
>>> And we are happy to hear more use case from this community. 
>>> 
>>> Best,
>>> Sun. 
>>> 
>>> fightf...@163.com
>>>  
>>> From: Jörn Franke
>>> Date: 2015-11-09 14:40
>>> To: fightf...@163.com
>>> CC: user; dev
>>> Subject: Re: OLAP query using spark dataframe with cassandra
>>> 
>>> Is there any distributor supporting these software components in 
>>> combination? If no and your core business is not software then you may want 
>>> to look for something else, because it might not make sense to build up 
>>> internal know-how in all of these areas.
>>> 
>>> In any case - it depends all highly on your data and queries. You will have 
>>> to do your own experiments.
>>> 
>>> On 09 Nov 2015, at 07:02, "fightf...@163.com" <fightf...@163.com> wrote:
>>> 
>>>> Hi, community
>>>> 
>>>> We are specially interested about this featural integration according to 
>>>> some slides from [1]. The SMACK(Spark+Mesos+Akka+Cassandra+Kafka)
>>>> 
>>>> seems good implementation for lambda architecure in the open-source world, 
>>>> especially non-hadoop based cluster environment. As we can see, 
>>>> 
>>>> the advantages obviously consist of :
>>>> 
>>>> 1 the feasibility and scalability of spark datafram api, which can also 
>>>> make a perfect complement for Apache Cassandra native cql feature.
>>>> 
>>>> 2 both streaming and batch process availability using the ALL-STACK thing, 
>>>> cool.
>>>> 
>>>> 3 we can both achieve compacity and usability for spark with cassandra, 
>>>> including seemlessly integrating with job scheduling and resource 
>>>> management.
>>>> 
>>>> Only one concern goes to the OLAP query performance issue, which mainly 
>>>> caused by frequent aggregation work between daily increased large tables, 
>>>> for 
>>>> 
>>>> both spark sql and cassandra. I can see that the [1] use case facilitates 
>>>> FiloDB to achieve columnar storage and query performance, but we had 
>>>> nothing more 
>>>> 
>>>> knowledge. 
>>>> 
>>>> Question is : Any guy had such use case for now, especially using in your 
>>>> production environment ? Would be interested in your architeture for 
>>>> designing this 
>>>> 
>>>> OLAP engine using spark +  cassandra. What do you think the comparison 
>>>> between the scenario with traditional OLAP cube design? Like Apache Kylin 
>>>> or 
>>>> 
>>>> pentaho mondrian ? 
>>>> 
>>>> Best Regards,
>>>> 
>>>> Sun.
>>>> 
>>>> 
>>>> [1]  
>>>> http://www.slideshare.net/planetcassandra/cassandra-summit-2014-interactive-olap-queries-using-apache-cassandra-and-spark
>>>> 
>>>> fightf...@163.com
> 
> 
> 
> -- 
> Ing. Ivaldi Andres

Re: OLAP query using spark dataframe with cassandra

Reply via email to