On 9/27/2010 10:45 AM, Jason McInerney wrote:
Throughout the year I've seen several conversations about Riak
discussing ideas on how to store, retrieve, and manipulate large
(250MB+) data sets, and I'm wondering if anyone has implemented a good
system yet with Riak.
My situation:
1) data files come in as 2 columns: time-from-start (usually a fixed
interval in milliseconds) and value, with meta data
2) files can have hundreds of millions of rows
3) retrieval will be as all data, raw subsets, subsets smoothed over
time, or subsets according to meta data

I've had some success storing as a bucket per file, with keys as a
milliseconds from start, and retrieval is awesome.  M/R works well
getting subsets.

Problems are:
a) the speed of getting the data into Riak -- I fork off 100 - 1000
threads and do PUTs on each row (basically chunk&  fork), but this is
really slow.  But so is one process, a row at a time.
b) memory (RAM) usage after a few files are in remains very high
(30-40%), so I worry that this may no perform well with thousands of
files
c) I simply don't know if this is the best way to do this sort of
work.  Other DBs are an option, but I prefer Riak's features.

I don't have an answer, but would like to expand the question a bit to see if there is good way to replace rrd-type files commonly used in SNMP monitoring systems like mrtg, cacti, etc., or the java jrobin equivalent that OpenNMS uses. These store data at the full sample rate for some length of time, then average to lower-resolution aggregates as the data ages so you can keep a long history on line. At least with OpenNMS, writing this data is usually what limits the number of devices it can monitor so distributing it would be a big win.

--
  Les Mikesell
   [email protected]

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to