Hi,

thanks for the reference, I really appreciate that you shared your
experience.

Could you please share how much data you store on the cluster and what
is HW configuration of the nodes? I am really impressed that you are
able to read 100M records in ~4minutes on 4 nodes. It makes something
like 100k reads per node, which is something we are quite far away from.

It leads me to question, whether reading from Spark goes through
Cassandra's JVM and thus go through normal read path, or if it reads the
sstables directly from disks sequentially and possibly filters out
old/tombstone values by itself? If it is the latter, then I understand
why it can perform that well.

Thank you.

Regards
Jirka H.

On 02/14/2015 09:17 PM, mck wrote:
> Jirka,
>
>> But I am really interested how it can work well with Spark/Hadoop where
>> you basically needs to read all the data as well (as far as I understand
>> that).
>
> I can't give you any benchmarking between technologies (nor am i
> particularly interested in getting involved in such a discussion) but i
> can share our experiences with Cassandra, Hadoop, and Spark, over the
> past 4+ years, and hopefully assure you that Cassandra+Spark is a smart
> choice.
>
> On a four node cluster we were running 5000+ small hadoop jobs each day
> each finishing within two minutes, often within one minute, resulting in
> (give or take) a billion records read and 150 millions records written
> from and to c*.
> These small jobs are incrementally processing on limited partition key
> sets each time. These jobs are primarily reading data from a "raw events
> store" that has a ttl of 3 months and 22+Gb of tombstones a day (reads
> over old partition keys are rare).
>
> We also run full-table-scan jobs and have never come across any issues
> particular to that. There are hadoop map/reduce settings to increase
> durability if you have tables with troublesome partition keys.
>
> This is also a cluster that serves requests to web applications that
> need low latency.
>
> We recently wrote a spark job that does full table scans over 100
> million+ rows, involves a handful of stages (two tables, 9 maps, 4
> reduce, and 2 joins), and writes back to a new table 5 millions rows.
> This job runs in ~260 seconds.
>
> Spark is becoming a natural complement to schema evolution for
> cassandra, something you'll want to do to keep your schema optimised
> against your read request patterns, even little things like switching
> cluster keys around. 
>
> With any new technology hitting some hurdles (especially if you go
> wondering outside recommended practices) will of course be part of the
> game, but that said I've only had positive experiences with this
> community's ability to help out (and do so quickly).
>
> Starting from scratch i'd use Spark (on scala) over Hadoop no questions
> asked. 
> Otherwise Cassandra has always been our 'big data' platform,
> hadoop/spark is just an extra tool on top.
> We've never kept data in hdfs and are very grateful for having made that
> choice.
>
> ~mck
>
> ref
> https://prezi.com/vt98oob9fvo4/cassandra-summit-cassandra-and-hadoop-at-finnno/

Reply via email to