Evaluating spark + Cassandra for our use cases

2015-08-18 Thread Benjamin Ross
My company is interested in building a real-time time-series querying solution 
using Spark and Cassandra.  Specifically, we're interested in setting up a 
Spark system against Cassandra running a hive thrift server.  We need to be 
able to perform real-time queries on time-series data - things like, how many 
accounts have spent in total more than $300 on product X in the past 3 months, 
and purchased product Y in the past month.

These queries need to be fast - preferably sub-second but we can deal with a 
few seconds if absolutely necessary.  The data sizes are in the millions of 
records when rolled up to be per-monthly records.  Something on the order of 
100M per customer.

My question is, based on experience, how hard would it be to get Cassandra and 
Spark working together to give us sub-second response times in this use case?  
Note that we'll need to use DataStax enterprise (which is unappealing from a 
cost standpoint) because it's the only thing that provides the hive spark 
thrift server to Cassandra.

The two top contenders for our solution are Spark+Cassandra and Druid.

Neither of these solutions work perfectly out of the box:

-  Druid would need to be modified, possibly hacked, to support the 
queries we require.  I'm also not clear how operationally ready it is.

-  Cassandra and Spark would require paying money for DataStax 
enterprise.  It really feels like it's going to be tricky to configure 
Cassandra and Spark to be lightning fast for our use case.  Finally, window 
functions (which we need - see above) are not supported unless we use a 
pre-release milestone of the datastax spark Cassandra connector.

I was wondering if anyone had any thoughts.  How easy is it to get Spark and 
Cassandra down to sub-second speeds in our use case?

Thanks,
Ben


Re: Evaluating spark + Cassandra for our use cases

2015-08-18 Thread Jörn Franke
Hi,

First you need to make your SLA clear. It does not sound for me they are
defined very well or that your solution is necessary for the scenario. I
also find it hard to believe that 1 customer has 100Million transactions
per month.

Time series data is easy to precalculate - you do not necessarily need
in-memory technology here.

I recommend your company to do a Proof of Concept and get more
details/clarificarion on the requirements before risking million of dollars
of investment.

Le mar. 18 août 2015 à 21:18, Benjamin Ross br...@lattice-engines.com a
écrit :

 My company is interested in building a real-time time-series querying
 solution using Spark and Cassandra.  Specifically, we’re interested in
 setting up a Spark system against Cassandra running a hive thrift server.
 We need to be able to perform real-time queries on time-series data –
 things like, how many accounts have spent in total more than $300 on
 product X in the past 3 months, and purchased product Y in the past month.



 These queries need to be fast – preferably sub-second but we can deal with
 a few seconds if absolutely necessary.  The data sizes are in the millions
 of records when rolled up to be per-monthly records.  Something on the
 order of 100M per customer.



 My question is, based on experience, how hard would it be to get Cassandra
 and Spark working together to give us sub-second response times in this use
 case?  Note that we’ll need to use DataStax enterprise (which is
 unappealing from a cost standpoint) because it’s the only thing that
 provides the hive spark thrift server to Cassandra.



 The two top contenders for our solution are Spark+Cassandra and Druid.



 Neither of these solutions work perfectly out of the box:

 -  Druid would need to be modified, possibly hacked, to support
 the queries we require.  I’m also not clear how operationally ready it is.

 -  Cassandra and Spark would require paying money for DataStax
 enterprise.  It really feels like it’s going to be tricky to configure
 Cassandra and Spark to be lightning fast for our use case.  Finally, window
 functions (which we need – see above) are not supported unless we use a
 pre-release milestone of the datastax spark Cassandra connector.



 I was wondering if anyone had any thoughts.  How easy is it to get Spark
 and Cassandra down to sub-second speeds in our use case?



 Thanks,

 Ben



RE: Evaluating spark + Cassandra for our use cases

2015-08-18 Thread Benjamin Ross
Hi Jorn,
Of course we're planning on doing a proof of concept here - the difficulty is 
that our timeline is short, so we cannot afford too many PoCs before we have to 
make a decision.  We also need to figure out *which* databases to proof of 
concept.

Note that one tricky aspect of our problem is that we need to support window 
functions partitioned on a per account basis.  I've found that support for 
window functions is very limited in most databases, and they're also generally 
slow when available.

Also, 1 customer certainly does not have 100M transactions per month.  There 
are 100M transactions total for a given customer when we roll everything up to 
be per-month.  We do not care about granularity smaller than a month.  There 
are also many columns that we care about - on the order of many thousands.

What makes you suggest that we do not need in-memory technology?

Ben



From: Jörn Franke [jornfra...@gmail.com]
Sent: Tuesday, August 18, 2015 4:14 PM
To: Benjamin Ross; user@spark.apache.org
Cc: Ron Gonzalez
Subject: Re: Evaluating spark + Cassandra for our use cases


Hi,

First you need to make your SLA clear. It does not sound for me they are 
defined very well or that your solution is necessary for the scenario. I also 
find it hard to believe that 1 customer has 100Million transactions per month.

Time series data is easy to precalculate - you do not necessarily need 
in-memory technology here.

I recommend your company to do a Proof of Concept and get more 
details/clarificarion on the requirements before risking million of dollars of 
investment.

Le mar. 18 août 2015 à 21:18, Benjamin Ross 
br...@lattice-engines.commailto:br...@lattice-engines.com a écrit :
My company is interested in building a real-time time-series querying solution 
using Spark and Cassandra.  Specifically, we’re interested in setting up a 
Spark system against Cassandra running a hive thrift server.  We need to be 
able to perform real-time queries on time-series data – things like, how many 
accounts have spent in total more than $300 on product X in the past 3 months, 
and purchased product Y in the past month.

These queries need to be fast – preferably sub-second but we can deal with a 
few seconds if absolutely necessary.  The data sizes are in the millions of 
records when rolled up to be per-monthly records.  Something on the order of 
100M per customer.

My question is, based on experience, how hard would it be to get Cassandra and 
Spark working together to give us sub-second response times in this use case?  
Note that we’ll need to use DataStax enterprise (which is unappealing from a 
cost standpoint) because it’s the only thing that provides the hive spark 
thrift server to Cassandra.

The two top contenders for our solution are Spark+Cassandra and Druid.

Neither of these solutions work perfectly out of the box:

-  Druid would need to be modified, possibly hacked, to support the 
queries we require.  I’m also not clear how operationally ready it is.

-  Cassandra and Spark would require paying money for DataStax 
enterprise.  It really feels like it’s going to be tricky to configure 
Cassandra and Spark to be lightning fast for our use case.  Finally, window 
functions (which we need – see above) are not supported unless we use a 
pre-release milestone of the datastax spark Cassandra connector.

I was wondering if anyone had any thoughts.  How easy is it to get Spark and 
Cassandra down to sub-second speeds in our use case?

Thanks,
Ben