Jirka,

> But I am really interested how it can work well with Spark/Hadoop where
> you basically needs to read all the data as well (as far as I understand
> that).


I can't give you any benchmarking between technologies (nor am i
particularly interested in getting involved in such a discussion) but i
can share our experiences with Cassandra, Hadoop, and Spark, over the
past 4+ years, and hopefully assure you that Cassandra+Spark is a smart
choice.

On a four node cluster we were running 5000+ small hadoop jobs each day
each finishing within two minutes, often within one minute, resulting in
(give or take) a billion records read and 150 millions records written
from and to c*.
These small jobs are incrementally processing on limited partition key
sets each time. These jobs are primarily reading data from a "raw events
store" that has a ttl of 3 months and 22+Gb of tombstones a day (reads
over old partition keys are rare).

We also run full-table-scan jobs and have never come across any issues
particular to that. There are hadoop map/reduce settings to increase
durability if you have tables with troublesome partition keys.

This is also a cluster that serves requests to web applications that
need low latency.

We recently wrote a spark job that does full table scans over 100
million+ rows, involves a handful of stages (two tables, 9 maps, 4
reduce, and 2 joins), and writes back to a new table 5 millions rows.
This job runs in ~260 seconds.

Spark is becoming a natural complement to schema evolution for
cassandra, something you'll want to do to keep your schema optimised
against your read request patterns, even little things like switching
cluster keys around. 

With any new technology hitting some hurdles (especially if you go
wondering outside recommended practices) will of course be part of the
game, but that said I've only had positive experiences with this
community's ability to help out (and do so quickly).

Starting from scratch i'd use Spark (on scala) over Hadoop no questions
asked. 
Otherwise Cassandra has always been our 'big data' platform,
hadoop/spark is just an extra tool on top.
We've never kept data in hdfs and are very grateful for having made that
choice.

~mck

ref
https://prezi.com/vt98oob9fvo4/cassandra-summit-cassandra-and-hadoop-at-finnno/

Reply via email to