My company is interested in building a real-time time-series querying solution 
using Spark and Cassandra.  Specifically, we're interested in setting up a 
Spark system against Cassandra running a hive thrift server.  We need to be 
able to perform real-time queries on time-series data - things like, how many 
accounts have spent in total more than $300 on product X in the past 3 months, 
and purchased product Y in the past month.

These queries need to be fast - preferably sub-second but we can deal with a 
few seconds if absolutely necessary.  The data sizes are in the millions of 
records when rolled up to be per-monthly records.  Something on the order of 
100M per customer.

My question is, based on experience, how hard would it be to get Cassandra and 
Spark working together to give us sub-second response times in this use case?  
Note that we'll need to use DataStax enterprise (which is unappealing from a 
cost standpoint) because it's the only thing that provides the hive spark 
thrift server to Cassandra.

The two top contenders for our solution are Spark+Cassandra and Druid.

Neither of these solutions work perfectly out of the box:

-          Druid would need to be modified, possibly hacked, to support the 
queries we require.  I'm also not clear how operationally ready it is.

-          Cassandra and Spark would require paying money for DataStax 
enterprise.  It really feels like it's going to be tricky to configure 
Cassandra and Spark to be lightning fast for our use case.  Finally, window 
functions (which we need - see above) are not supported unless we use a 
pre-release milestone of the datastax spark Cassandra connector.

I was wondering if anyone had any thoughts.  How easy is it to get Spark and 
Cassandra down to sub-second speeds in our use case?

Thanks,
Ben

Reply via email to