Throughout the year I've seen several conversations about Riak discussing ideas on how to store, retrieve, and manipulate large (250MB+) data sets, and I'm wondering if anyone has implemented a good system yet with Riak. My situation: 1) data files come in as 2 columns: time-from-start (usually a fixed interval in milliseconds) and value, with meta data 2) files can have hundreds of millions of rows 3) retrieval will be as all data, raw subsets, subsets smoothed over time, or subsets according to meta data
I've had some success storing as a bucket per file, with keys as a milliseconds from start, and retrieval is awesome. M/R works well getting subsets. Problems are: a) the speed of getting the data into Riak -- I fork off 100 - 1000 threads and do PUTs on each row (basically chunk & fork), but this is really slow. But so is one process, a row at a time. b) memory (RAM) usage after a few files are in remains very high (30-40%), so I worry that this may no perform well with thousands of files c) I simply don't know if this is the best way to do this sort of work. Other DBs are an option, but I prefer Riak's features. Any and all advice is wellcome! _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
