It is a huge email Christmas eve I am sorry, but I think the down time fosters the bigger questions (but maybe not the bigger answers).
Yes, we are currently working on very clean batch based inserts with 10k batches of lists of row mutations. We will be inserting 3 batches (3 cfs) per 10k inserts and there should hopefully not be too much row contention between writers. So we definitely will not be trying to do 10k PRC calls/sec/node. Thanks. On Sun, Dec 26, 2010 at 1:48 AM, Ryan Rawson <[email protected]> wrote: > This is a huge email to drop on Xmas eve, you wont likely get a > comprehensive answer until January. I can offer a few tidbits > though... > > First off you have to moderate your expectations in terms of datagrams > and RPC calls. You want to do 10k RPC calls/sec/node is a bit of a > hard sell. Especially since there is a flow of data from > client->thrift->rs->datanode. There is shared infra there, you can > get weird pauses as hidden dependencies are revealed, eg: 3 > regionservers hammer 1 datanode and things choke a bit. > > So back to performanceland, the key here is batching if at all > possible. If you want to do 10k insert/sec with good perf, batching > is key. I think if you use the batch put calls in thrift that will do > what is expected. > > As for the read side, I think your read goals are achievable. We get > low read latencies using PHP and Thrift. Interestingly running thru > thrift amortizes the client cache for short lived scripts/programs. > > -ryan > > On Fri, Dec 24, 2010 at 5:09 AM, Wayne <[email protected]> wrote: > > We are in the process of evaluating hbase in an effort to switch from a > > different nosql solution. Performance is of course an important part of > our > > evaluation. We are a python shop and we are very worried that we can not > get > > any real performance out of hbase using thrift (and must drop down to > java). > > We are aware of the various lower level options for bulk insert or java > > based inserts with turning off WAL etc. but none of these are available > to > > us in python so are not part of our evaluation. We have a 10 node cluster > > (24gb, 6 x 1TB, 16 core) that we setting up as data/region nodes, and we > are > > looking for suggestions on configuration as well as benchmarks in terms > of > > expectations of performance. Below are some specific questions. I realize > > there are a million factors that help determine specific performance > > numbers, so any examples of performance from running clusters would be > great > > as examples of what can be done. Again thrift seems to be our "problem" > so > > non java based solutions are preferred (do any non java based shops run > > large scale hbase clusters?). Our total production cluster size is > estimated > > to be 50TB. > > > > Our data model is 3 CFs, one primary and 2 secondary indexes. All writes > go > > to all 3 CFs and are grouped as a batch of row mutations which should > avoid > > row locking issues. > > > > What heap size is recommended for master, and for region servers (24gb > ram)? > > What other settings can/should be tweaked in hbase to optimize > performance > > (we have looked at the wiki page)? > > What is a good batch size for writes? We will start with 10k > values/batch. > > How many concurrent writers/readers can a single data node handle with > > evenly distributed load? Are there settings specific to this? > > What is "very good" read/write latency for a single put/get in hbase > using > > thrift? > > What is "very good" read/write throughput per node in hbase using thrift? > > > > We are looking to get performance numbers in the range of 10k aggregate > > inserts/sec/node and read latency < 30ms/read with 3-4 concurrent > > readers/node. Can our expectations be met with hbase through thrift? Can > > they be met with hbase through java? > > > > Thanks in advance for any help, examples, or recommendations that you can > > provide! > > > > Wayne > > >
