http://code.google.com/p/caliper/ might be useful.
On Tue, Oct 25, 2011 at 7:36 AM, Andrew Purtell <apurt...@apache.org> wrote: > Also please make note of what JVM version and report it with the results. > > > Best regards, > > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) > > > ----- Original Message ----- > > From: Li Pi <l...@idle.li> > > To: dev@hbase.apache.org > > Cc: > > Sent: Monday, October 24, 2011 11:27 PM > > Subject: Re: CSLM performance Was: SILT - nice keyvalue store paper > > > > You might also want to run the code a few times, and only take results > > from the latter half. Let the JVM warm up and JIT things for a fair > > comparison. > > > > On Mon, Oct 24, 2011 at 11:14 PM, Akash Ashok <thehellma...@gmail.com> > > wrote: > >> On Mon, Oct 24, 2011 at 10:49 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> > >>> Akash: > >>> Take a look at the following two possible replacements for CSLM: > >>> https://github.com/mspiegel/lockfreeskiptree > >>> https://github.com/nbronson/snaptree > >> > >> Thanks Ted :) :) :). Was pretty much on the lookout for other > structures. I > >> tried > >> > > > http://g.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/SyncSortedMap.html > >> but didn't perform that great. Will look into these replacements. > >> > >> > >>> > >>> About your test, I got advice from other expert: > >>> > >>> - the JVM warm-up penalty is being accrued by the CSLM run > >>> - Use thread.yield() to end a request, as otherwise the active thread > > may > >>> be > >>> able to run through its time slice without incurring much lock > > contention. > >>> > >> > >> Hmm. I was just wondering. Since each thread 10000+ inserts and deletes > are > >> being run > >> they have to context switch quite a few times during its lifecycle > right ? > >> Wouldn't it be enough if I increase the number of iterations by a > > factor if > >> say 100 per thread ? > >> > >> > >> > >>> You can publish your code and results under HBASE-3992. > >>> > >>> Thanks > >>> > >>> On Sun, Oct 23, 2011 at 5:05 PM, Jonathan Gray <jg...@fb.com> > > wrote: > >>> > >>> > Oh, and when running these experiments, you should look at the > > impact at > >>> > which order they are run in, whether you run them multiple times > > per JVM > >>> > instance, etc. Basically, you need to be cognizant of the HotSpot > >>> > optimizations the JVM is doing at runtime. > >>> > > >>> > > -----Original Message----- > >>> > > From: Jonathan Gray [mailto:jg...@fb.com] > >>> > > Sent: Sunday, October 23, 2011 4:20 PM > >>> > > To: dev@hbase.apache.org > >>> > > Subject: RE: SILT - nice keyvalue store paper > >>> > > > >>> > > Very nice experiment, Akash. Keep getting your hands dirty > > and > >>> digging! > >>> > :) > >>> > > > >>> > > I think your results might change if you bump the test up to > > 1000 > >>> threads > >>> > or > >>> > > so. 100 threads can still perform okay when there's a > > global lock but > >>> > the > >>> > > contention at 1000 threads will kill you and that's when > > CSLM should do > >>> > much > >>> > > better. (1000 handler threads is approx. what I run with on > > RS in > >>> prod). > >>> > > Though I am a bit surprised that at 100 threads the TreeMap > > was > >>> > significantly > >>> > > faster. Your inconsistent results are a bit odd, you might > > try an > >>> order > >>> > of > >>> > > magnitude more operations per thread. You might also gather > > some > >>> > > statistics about tree size and per operation latency. > >>> > > > >>> > > I've done some isolated CSLM benchmarks in the past and > > have never been > >>> > > able to reproduce any of the slowness people suggest. I > > recall trying > >>> > some > >>> > > impractically large MemStores and everything still being > > quite fast. > >>> > > > >>> > > Over in Cassandra, I believe they have a two-level CSLM with > > the first > >>> > map > >>> > > key being the row and then the columns for each row in their > > own CSLM. > >>> > > I've been told this is somewhat of a pain point for them. > > And keep in > >>> > mind > >>> > > they have one shard/region per node and we generally have > > several > >>> smaller > >>> > > MemStores on each node (tens to thousands). Not sure we > > would want to > >>> > > try that. There could be some interesting optimizations if > > you had > >>> very > >>> > > specific issues, like if you had a ton of reads to MemStore > > and not > >>> many > >>> > > writes you could keep some kind of mirrored hashmap. > >>> > > > >>> > > And for writes, the WAL is definitely the latency bottleneck. > > But if > >>> you > >>> > are > >>> > > doing lots of small operations, so your WALEdits are not > > large, and > >>> with > >>> > some > >>> > > of the HLog batching features going in to trunk, you end up > > with > >>> hundreds > >>> > of > >>> > > requests per HLog sync. And although the syncs are higher > > latency, > >>> with > >>> > > batching you end up getting high throughput. And the > > bottleneck > >>> shifts. > >>> > > > >>> > > Each sync will take approx. 1-5ms, so let's say 250 > > requests per HLog > >>> > sync > >>> > > batch, 4ms per sync, so 62.5k req/sec. (62.5k * 100 > > bytes/req = > >>> > 600K/sec, > >>> > > very reasonable). If you're mixing in reads as well (or > > if you're > >>> doing > >>> > > increments which do a read and write), then this adds to the > > CPU usage > >>> > and > >>> > > contention without adding to HLog throughput. > >>> > > > >>> > > All of a sudden the bottleneck becomes CPU/contention and not > > HLog > >>> > > latency or throughput. Highly concurrent increments/counters > > with a > >>> > largely > >>> > > in-memory dataset can easily be CPU bottlenecked. > >>> > > > >>> > > For one specific application Dhruba and I worked on, we made > > some good > >>> > > improvements in CPU efficiency by reducing the number of > > operations and > >>> > > increasing efficiency on the CSLM. Doing things like always > > taking a > >>> > tailMap > >>> > > and working from that instead of starting at the root node, > > using an > >>> > iterator() > >>> > > and taking advantage of the available remove() semantics, or > > simply > >>> just > >>> > > mutating things that are normally immutable :) Unfortunately > > many of > >>> > these > >>> > > optimizations were semi-horrid hacks and introduced things > > like > >>> > > ModifiableKeyValues, so they all haven't made their way > > to apache. > >>> > > > >>> > > In the end, after our optimizations, the real world workload > > Dhruba and > >>> I > >>> > > were working with was not all in-memory so the bottleneck in > > production > >>> > > became the random reads (so increasing the block cache hit > > ratio is the > >>> > > focus) rather than CPU contention or HLog throughput. > >>> > > > >>> > > JG > >>> > > > >>> > > From: Akash Ashok [mailto:thehellma...@gmail.com] > >>> > > Sent: Sunday, October 23, 2011 2:57 AM > >>> > > To: dev@hbase.apache.org > >>> > > Subject: Re: SILT - nice keyvalue store paper > >>> > > > >>> > > I was running some similar tests and came across a surprising > > finding. > >>> I > >>> > > compared reads and write on ConcurrentSkipListMap ( which the > > memstore > >>> > > uses) and synchronized TreeMap ( Which was literally treemap > >>> > > synchronized). Executed concurrent reads, writes and deletes > > on both of > >>> > > them. > >>> > > Surprisingly synchronized treeMap performed better, though > > just > >>> slightly > >>> > > better, than ConcurrentSkipListMap which KeyValueSkipListSet > > uses. > >>> > > > >>> > > Here are the output of a few runs > >>> > > > >>> > > Sometimes the difference was considerable Using HBaseMap it > > took > >>> > > 20438ms Using TreeMap it took 11613ms Time Difference:8825ms > >>> > > > >>> > > And sometimes the difference was negligible Using HBaseMap it > > took > >>> > > 13370ms Using TreeMap it took 9482ms Time Difference:3888ms > >>> > > > >>> > > I've attaching the test java file which I wrote to test > > it. > >>> > > This might be a very minor differece but still surprising > > considering > >>> the > >>> > fact > >>> > > that ConcurrentSkipListMap uses fancy 2 level indexes which > > they say > >>> > > improves the deletion performance. > >>> > > > >>> > > And here are the details about the test run. > >>> > > 100 Threads each fetching 1,000,000 records > >>> > > 100 threads each adding 1,000,000 records. > >>> > > 100 threads each deletin 1,000,000 records ( Reads, Writes > > and deletes > >>> > > simultaneously ) > >>> > > > >>> > > Cheers, > >>> > > Akash A > >>> > > On Sun, Oct 23, 2011 at 3:25 AM, Stack > >>> > > <st...@duboce.net<mailto:st...@duboce.net>> > > wrote: > >>> > > On Sat, Oct 22, 2011 at 2:41 PM, N Keywal > >>> > > <nkey...@gmail.com<mailto:nkey...@gmail.com>> > > wrote: > >>> > > > I would think that the bottleneck for insert is the wal > > part? > >>> > > > It would be possible to do a kind of memory list > > preparation during > >>> > > > the wal insertion, and if the wal insertion is > > confirmed, do the > >>> > > > insertion in the memory list. But it's strange to > > have the insertion > >>> in > >>> > > memory important vs. > >>> > > > the insertion on disk... > >>> > > > > >>> > > Yes, WAL is the long pole writing. But MemStore has issues > > too; Dhruba > >>> > says > >>> > > cpu above. Reading and writing it is also 'slow'. > >>> > > St.Ack > >>> > > >>> > > >>> > >> > >