large time series advice

Jason McInerney Mon, 27 Sep 2010 08:46:33 -0700

Throughout the year I've seen several conversations about Riak
discussing ideas on how to store, retrieve, and manipulate large
(250MB+) data sets, and I'm wondering if anyone has implemented a good
system yet with Riak.
My situation:
1) data files come in as 2 columns: time-from-start (usually a fixed
interval in milliseconds) and value, with meta data
2) files can have hundreds of millions of rows
3) retrieval will be as all data, raw subsets, subsets smoothed over
time, or subsets according to meta data


I've had some success storing as a bucket per file, with keys as a
milliseconds from start, and retrieval is awesome.  M/R works well
getting subsets.

Problems are:
a) the speed of getting the data into Riak -- I fork off 100 - 1000
threads and do PUTs on each row (basically chunk & fork), but this is
really slow.  But so is one process, a row at a time.
b) memory (RAM) usage after a few files are in remains very high
(30-40%), so I worry that this may no perform well with thousands of
files
c) I simply don't know if this is the best way to do this sort of
work.  Other DBs are an option, but I prefer Riak's features.

Any and all advice is wellcome!

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

large time series advice

Reply via email to