To follow up on my change n val to 1 when testing on one machine, the
same goes true when testing the dev cluster setup on one machine. You
are still writing the data multiple times to the same disk. This will
absolutely skew your testing results.
-Alexander
@siculars on twitter
http://siculars.posterous.com
Sent from my iPhone
On Sep 27, 2010, at 12:20, Alexander Sicular <[email protected]> wrote:
Inline...
@siculars on twitter
http://siculars.posterous.com
Sent from my iPhone
On Sep 27, 2010, at 11:45, Jason McInerney <[email protected]>
wrote:
Throughout the year I've seen several conversations about Riak
discussing ideas on how to store, retrieve, and manipulate large
(250MB+) data sets, and I'm wondering if anyone has implemented a
good
system yet with Riak.
My situation:
1) data files come in as 2 columns: time-from-start (usually a fixed
interval in milliseconds) and value, with meta data
2) files can have hundreds of millions of rows
3) retrieval will be as all data, raw subsets, subsets smoothed over
time, or subsets according to meta data
I've had some success storing as a bucket per file, with keys as a
milliseconds from start, and retrieval is awesome. M/R works well
getting subsets.
Problems are:
a) the speed of getting the data into Riak -- I fork off 100 - 1000
threads and do PUTs on each row (basically chunk & fork), but this is
really slow. But so is one process, a row at a time.
Are you using protobuf or http interface? I would use the former.
There are lots and lots of resources on how to best chunk large text
files in your favorite language and there may well be a protobuf lib
for riak in that language.
-Whichever interface you are using make sure your lib is not
returning the data in it's reply or doing any enhanced riak stuff
like waiting on n successful replies.
-in your connection, supply the client id. Otherwise riak auto gens
one for you (optimization, but extra processing nonetheless).
-round robin your ip's if you have a cluster of riak nodes.
-If you are testing on one node make n val equal 1 in your app.conf
file or you just churning disk.
b) memory (RAM) usage after a few files are in remains very high
(30-40%), so I worry that this may no perform well with thousands of
files
If you are using a bitcask backend (default) memory is governed by
total keys in the cask. Add more keys, eat more ram. There is a
metric floating around (off the top of my head I think it's 40b
+keylength*keys). Innodb backend may be better for this mem wise
although it does consume file descriptors on a per bucket basis.
c) I simply don't know if this is the best way to do this sort of
work. Other DBs are an option, but I prefer Riak's features.
Take a look at basho_bench to get a feel for ops/sec of your own
setup. You may be hitting some max ops/s due to some hardware
constraints. Lots of nobs to tweak there.
Any and all advice is wellcome!
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com