http://www.azulsystems.com/events/javaone_2009/session/2009_J1_Benchmark.pdf
Might also be useful. On Tue, Oct 25, 2011 at 9:07 AM, Ted Yu <yuzhih...@gmail.com> wrote: > http://code.google.com/p/caliper/ might be useful. > > On Tue, Oct 25, 2011 at 7:36 AM, Andrew Purtell <apurt...@apache.org> wrote: > >> Also please make note of what JVM version and report it with the results. >> >> >> Best regards, >> >> >> - Andy >> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein >> (via Tom White) >> >> >> ----- Original Message ----- >> > From: Li Pi <l...@idle.li> >> > To: dev@hbase.apache.org >> > Cc: >> > Sent: Monday, October 24, 2011 11:27 PM >> > Subject: Re: CSLM performance Was: SILT - nice keyvalue store paper >> > >> > You might also want to run the code a few times, and only take results >> > from the latter half. Let the JVM warm up and JIT things for a fair >> > comparison. >> > >> > On Mon, Oct 24, 2011 at 11:14 PM, Akash Ashok <thehellma...@gmail.com> >> > wrote: >> >> On Mon, Oct 24, 2011 at 10:49 PM, Ted Yu <yuzhih...@gmail.com> wrote: >> >> >> >>> Akash: >> >>> Take a look at the following two possible replacements for CSLM: >> >>> https://github.com/mspiegel/lockfreeskiptree >> >>> https://github.com/nbronson/snaptree >> >> >> >> Thanks Ted :) :) :). Was pretty much on the lookout for other >> structures. I >> >> tried >> >> >> > >> http://g.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/SyncSortedMap.html >> >> but didn't perform that great. Will look into these replacements. >> >> >> >> >> >>> >> >>> About your test, I got advice from other expert: >> >>> >> >>> - the JVM warm-up penalty is being accrued by the CSLM run >> >>> - Use thread.yield() to end a request, as otherwise the active thread >> > may >> >>> be >> >>> able to run through its time slice without incurring much lock >> > contention. >> >>> >> >> >> >> Hmm. I was just wondering. Since each thread 10000+ inserts and deletes >> are >> >> being run >> >> they have to context switch quite a few times during its lifecycle >> right ? >> >> Wouldn't it be enough if I increase the number of iterations by a >> > factor if >> >> say 100 per thread ? >> >> >> >> >> >> >> >>> You can publish your code and results under HBASE-3992. >> >>> >> >>> Thanks >> >>> >> >>> On Sun, Oct 23, 2011 at 5:05 PM, Jonathan Gray <jg...@fb.com> >> > wrote: >> >>> >> >>> > Oh, and when running these experiments, you should look at the >> > impact at >> >>> > which order they are run in, whether you run them multiple times >> > per JVM >> >>> > instance, etc. Basically, you need to be cognizant of the HotSpot >> >>> > optimizations the JVM is doing at runtime. >> >>> > >> >>> > > -----Original Message----- >> >>> > > From: Jonathan Gray [mailto:jg...@fb.com] >> >>> > > Sent: Sunday, October 23, 2011 4:20 PM >> >>> > > To: dev@hbase.apache.org >> >>> > > Subject: RE: SILT - nice keyvalue store paper >> >>> > > >> >>> > > Very nice experiment, Akash. Keep getting your hands dirty >> > and >> >>> digging! >> >>> > :) >> >>> > > >> >>> > > I think your results might change if you bump the test up to >> > 1000 >> >>> threads >> >>> > or >> >>> > > so. 100 threads can still perform okay when there's a >> > global lock but >> >>> > the >> >>> > > contention at 1000 threads will kill you and that's when >> > CSLM should do >> >>> > much >> >>> > > better. (1000 handler threads is approx. what I run with on >> > RS in >> >>> prod). >> >>> > > Though I am a bit surprised that at 100 threads the TreeMap >> > was >> >>> > significantly >> >>> > > faster. Your inconsistent results are a bit odd, you might >> > try an >> >>> order >> >>> > of >> >>> > > magnitude more operations per thread. You might also gather >> > some >> >>> > > statistics about tree size and per operation latency. >> >>> > > >> >>> > > I've done some isolated CSLM benchmarks in the past and >> > have never been >> >>> > > able to reproduce any of the slowness people suggest. I >> > recall trying >> >>> > some >> >>> > > impractically large MemStores and everything still being >> > quite fast. >> >>> > > >> >>> > > Over in Cassandra, I believe they have a two-level CSLM with >> > the first >> >>> > map >> >>> > > key being the row and then the columns for each row in their >> > own CSLM. >> >>> > > I've been told this is somewhat of a pain point for them. >> > And keep in >> >>> > mind >> >>> > > they have one shard/region per node and we generally have >> > several >> >>> smaller >> >>> > > MemStores on each node (tens to thousands). Not sure we >> > would want to >> >>> > > try that. There could be some interesting optimizations if >> > you had >> >>> very >> >>> > > specific issues, like if you had a ton of reads to MemStore >> > and not >> >>> many >> >>> > > writes you could keep some kind of mirrored hashmap. >> >>> > > >> >>> > > And for writes, the WAL is definitely the latency bottleneck. >> > But if >> >>> you >> >>> > are >> >>> > > doing lots of small operations, so your WALEdits are not >> > large, and >> >>> with >> >>> > some >> >>> > > of the HLog batching features going in to trunk, you end up >> > with >> >>> hundreds >> >>> > of >> >>> > > requests per HLog sync. And although the syncs are higher >> > latency, >> >>> with >> >>> > > batching you end up getting high throughput. And the >> > bottleneck >> >>> shifts. >> >>> > > >> >>> > > Each sync will take approx. 1-5ms, so let's say 250 >> > requests per HLog >> >>> > sync >> >>> > > batch, 4ms per sync, so 62.5k req/sec. (62.5k * 100 >> > bytes/req = >> >>> > 600K/sec, >> >>> > > very reasonable). If you're mixing in reads as well (or >> > if you're >> >>> doing >> >>> > > increments which do a read and write), then this adds to the >> > CPU usage >> >>> > and >> >>> > > contention without adding to HLog throughput. >> >>> > > >> >>> > > All of a sudden the bottleneck becomes CPU/contention and not >> > HLog >> >>> > > latency or throughput. Highly concurrent increments/counters >> > with a >> >>> > largely >> >>> > > in-memory dataset can easily be CPU bottlenecked. >> >>> > > >> >>> > > For one specific application Dhruba and I worked on, we made >> > some good >> >>> > > improvements in CPU efficiency by reducing the number of >> > operations and >> >>> > > increasing efficiency on the CSLM. Doing things like always >> > taking a >> >>> > tailMap >> >>> > > and working from that instead of starting at the root node, >> > using an >> >>> > iterator() >> >>> > > and taking advantage of the available remove() semantics, or >> > simply >> >>> just >> >>> > > mutating things that are normally immutable :) Unfortunately >> > many of >> >>> > these >> >>> > > optimizations were semi-horrid hacks and introduced things >> > like >> >>> > > ModifiableKeyValues, so they all haven't made their way >> > to apache. >> >>> > > >> >>> > > In the end, after our optimizations, the real world workload >> > Dhruba and >> >>> I >> >>> > > were working with was not all in-memory so the bottleneck in >> > production >> >>> > > became the random reads (so increasing the block cache hit >> > ratio is the >> >>> > > focus) rather than CPU contention or HLog throughput. >> >>> > > >> >>> > > JG >> >>> > > >> >>> > > From: Akash Ashok [mailto:thehellma...@gmail.com] >> >>> > > Sent: Sunday, October 23, 2011 2:57 AM >> >>> > > To: dev@hbase.apache.org >> >>> > > Subject: Re: SILT - nice keyvalue store paper >> >>> > > >> >>> > > I was running some similar tests and came across a surprising >> > finding. >> >>> I >> >>> > > compared reads and write on ConcurrentSkipListMap ( which the >> > memstore >> >>> > > uses) and synchronized TreeMap ( Which was literally treemap >> >>> > > synchronized). Executed concurrent reads, writes and deletes >> > on both of >> >>> > > them. >> >>> > > Surprisingly synchronized treeMap performed better, though >> > just >> >>> slightly >> >>> > > better, than ConcurrentSkipListMap which KeyValueSkipListSet >> > uses. >> >>> > > >> >>> > > Here are the output of a few runs >> >>> > > >> >>> > > Sometimes the difference was considerable Using HBaseMap it >> > took >> >>> > > 20438ms Using TreeMap it took 11613ms Time Difference:8825ms >> >>> > > >> >>> > > And sometimes the difference was negligible Using HBaseMap it >> > took >> >>> > > 13370ms Using TreeMap it took 9482ms Time Difference:3888ms >> >>> > > >> >>> > > I've attaching the test java file which I wrote to test >> > it. >> >>> > > This might be a very minor differece but still surprising >> > considering >> >>> the >> >>> > fact >> >>> > > that ConcurrentSkipListMap uses fancy 2 level indexes which >> > they say >> >>> > > improves the deletion performance. >> >>> > > >> >>> > > And here are the details about the test run. >> >>> > > 100 Threads each fetching 1,000,000 records >> >>> > > 100 threads each adding 1,000,000 records. >> >>> > > 100 threads each deletin 1,000,000 records ( Reads, Writes >> > and deletes >> >>> > > simultaneously ) >> >>> > > >> >>> > > Cheers, >> >>> > > Akash A >> >>> > > On Sun, Oct 23, 2011 at 3:25 AM, Stack >> >>> > > <st...@duboce.net<mailto:st...@duboce.net>> >> > wrote: >> >>> > > On Sat, Oct 22, 2011 at 2:41 PM, N Keywal >> >>> > > <nkey...@gmail.com<mailto:nkey...@gmail.com>> >> > wrote: >> >>> > > > I would think that the bottleneck for insert is the wal >> > part? >> >>> > > > It would be possible to do a kind of memory list >> > preparation during >> >>> > > > the wal insertion, and if the wal insertion is >> > confirmed, do the >> >>> > > > insertion in the memory list. But it's strange to >> > have the insertion >> >>> in >> >>> > > memory important vs. >> >>> > > > the insertion on disk... >> >>> > > > >> >>> > > Yes, WAL is the long pole writing. But MemStore has issues >> > too; Dhruba >> >>> > says >> >>> > > cpu above. Reading and writing it is also 'slow'. >> >>> > > St.Ack >> >>> > >> >>> > >> >>> >> >> >> > >