Re: Storing large files for later processing through hadoop

mck Fri, 02 Jan 2015 04:24:45 -0800

> 1) The FAQ … informs that I can have only files of around 64 MB …

See http://wiki.apache.org/cassandra/CassandraLimitations
 "A single column value may not be larger than 2GB; in practice, "single
 digits of MB" is a more reasonable limit, since there is no streaming
 or random access of blob values."


CASSANDRA-16  only covers pushing those objects through compaction.
Getting the objects in and out of the heap during normal requests is
still a problem.

You could manually chunk them down to 64Mb pieces.


> 2) Can I replace HDFS with Cassandra so that I don't have to sync/fetch
> the file from cassandra to HDFS when I want to process it in hadoop cluster?


We¹ keep HDFS as a volatile filesystem simply for hadoop internals. No
need for backups of it, no need to upgrade data, and we're free to wipe
it whenever hadoop has been stopped. 

Otherwise all our hadoop jobs still read from and write to Cassandra.
Cassandra is our "big data" platform, with hadoop/spark just providing
additional aggregation abilities. I think this is the effective way,
rather than trying to completely gut out HDFS. 

There was a datastax project before in being able to replace HDFS with
Cassandra, but i don't think it's alive anymore.

~mck

Re: Storing large files for later processing through hadoop

Reply via email to