You might also want to run the code a few times, and only take results from the latter half. Let the JVM warm up and JIT things for a fair comparison.
On Mon, Oct 24, 2011 at 11:14 PM, Akash Ashok <thehellma...@gmail.com> wrote: > On Mon, Oct 24, 2011 at 10:49 PM, Ted Yu <yuzhih...@gmail.com> wrote: > >> Akash: >> Take a look at the following two possible replacements for CSLM: >> https://github.com/mspiegel/lockfreeskiptree >> https://github.com/nbronson/snaptree > > Thanks Ted :) :) :). Was pretty much on the lookout for other structures. I > tried > http://g.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/SyncSortedMap.html > but didn't perform that great. Will look into these replacements. > > >> >> About your test, I got advice from other expert: >> >> - the JVM warm-up penalty is being accrued by the CSLM run >> - Use thread.yield() to end a request, as otherwise the active thread may >> be >> able to run through its time slice without incurring much lock contention. >> > > Hmm. I was just wondering. Since each thread 10000+ inserts and deletes are > being run > they have to context switch quite a few times during its lifecycle right ? > Wouldn't it be enough if I increase the number of iterations by a factor if > say 100 per thread ? > > > >> You can publish your code and results under HBASE-3992. >> >> Thanks >> >> On Sun, Oct 23, 2011 at 5:05 PM, Jonathan Gray <jg...@fb.com> wrote: >> >> > Oh, and when running these experiments, you should look at the impact at >> > which order they are run in, whether you run them multiple times per JVM >> > instance, etc. Basically, you need to be cognizant of the HotSpot >> > optimizations the JVM is doing at runtime. >> > >> > > -----Original Message----- >> > > From: Jonathan Gray [mailto:jg...@fb.com] >> > > Sent: Sunday, October 23, 2011 4:20 PM >> > > To: dev@hbase.apache.org >> > > Subject: RE: SILT - nice keyvalue store paper >> > > >> > > Very nice experiment, Akash. Keep getting your hands dirty and >> digging! >> > :) >> > > >> > > I think your results might change if you bump the test up to 1000 >> threads >> > or >> > > so. 100 threads can still perform okay when there's a global lock but >> > the >> > > contention at 1000 threads will kill you and that's when CSLM should do >> > much >> > > better. (1000 handler threads is approx. what I run with on RS in >> prod). >> > > Though I am a bit surprised that at 100 threads the TreeMap was >> > significantly >> > > faster. Your inconsistent results are a bit odd, you might try an >> order >> > of >> > > magnitude more operations per thread. You might also gather some >> > > statistics about tree size and per operation latency. >> > > >> > > I've done some isolated CSLM benchmarks in the past and have never been >> > > able to reproduce any of the slowness people suggest. I recall trying >> > some >> > > impractically large MemStores and everything still being quite fast. >> > > >> > > Over in Cassandra, I believe they have a two-level CSLM with the first >> > map >> > > key being the row and then the columns for each row in their own CSLM. >> > > I've been told this is somewhat of a pain point for them. And keep in >> > mind >> > > they have one shard/region per node and we generally have several >> smaller >> > > MemStores on each node (tens to thousands). Not sure we would want to >> > > try that. There could be some interesting optimizations if you had >> very >> > > specific issues, like if you had a ton of reads to MemStore and not >> many >> > > writes you could keep some kind of mirrored hashmap. >> > > >> > > And for writes, the WAL is definitely the latency bottleneck. But if >> you >> > are >> > > doing lots of small operations, so your WALEdits are not large, and >> with >> > some >> > > of the HLog batching features going in to trunk, you end up with >> hundreds >> > of >> > > requests per HLog sync. And although the syncs are higher latency, >> with >> > > batching you end up getting high throughput. And the bottleneck >> shifts. >> > > >> > > Each sync will take approx. 1-5ms, so let's say 250 requests per HLog >> > sync >> > > batch, 4ms per sync, so 62.5k req/sec. (62.5k * 100 bytes/req = >> > 600K/sec, >> > > very reasonable). If you're mixing in reads as well (or if you're >> doing >> > > increments which do a read and write), then this adds to the CPU usage >> > and >> > > contention without adding to HLog throughput. >> > > >> > > All of a sudden the bottleneck becomes CPU/contention and not HLog >> > > latency or throughput. Highly concurrent increments/counters with a >> > largely >> > > in-memory dataset can easily be CPU bottlenecked. >> > > >> > > For one specific application Dhruba and I worked on, we made some good >> > > improvements in CPU efficiency by reducing the number of operations and >> > > increasing efficiency on the CSLM. Doing things like always taking a >> > tailMap >> > > and working from that instead of starting at the root node, using an >> > iterator() >> > > and taking advantage of the available remove() semantics, or simply >> just >> > > mutating things that are normally immutable :) Unfortunately many of >> > these >> > > optimizations were semi-horrid hacks and introduced things like >> > > ModifiableKeyValues, so they all haven't made their way to apache. >> > > >> > > In the end, after our optimizations, the real world workload Dhruba and >> I >> > > were working with was not all in-memory so the bottleneck in production >> > > became the random reads (so increasing the block cache hit ratio is the >> > > focus) rather than CPU contention or HLog throughput. >> > > >> > > JG >> > > >> > > From: Akash Ashok [mailto:thehellma...@gmail.com] >> > > Sent: Sunday, October 23, 2011 2:57 AM >> > > To: dev@hbase.apache.org >> > > Subject: Re: SILT - nice keyvalue store paper >> > > >> > > I was running some similar tests and came across a surprising finding. >> I >> > > compared reads and write on ConcurrentSkipListMap ( which the memstore >> > > uses) and synchronized TreeMap ( Which was literally treemap >> > > synchronized). Executed concurrent reads, writes and deletes on both of >> > > them. >> > > Surprisingly synchronized treeMap performed better, though just >> slightly >> > > better, than ConcurrentSkipListMap which KeyValueSkipListSet uses. >> > > >> > > Here are the output of a few runs >> > > >> > > Sometimes the difference was considerable Using HBaseMap it took >> > > 20438ms Using TreeMap it took 11613ms Time Difference:8825ms >> > > >> > > And sometimes the difference was negligible Using HBaseMap it took >> > > 13370ms Using TreeMap it took 9482ms Time Difference:3888ms >> > > >> > > I've attaching the test java file which I wrote to test it. >> > > This might be a very minor differece but still surprising considering >> the >> > fact >> > > that ConcurrentSkipListMap uses fancy 2 level indexes which they say >> > > improves the deletion performance. >> > > >> > > And here are the details about the test run. >> > > 100 Threads each fetching 1,000,000 records >> > > 100 threads each adding 1,000,000 records. >> > > 100 threads each deletin 1,000,000 records ( Reads, Writes and deletes >> > > simultaneously ) >> > > >> > > Cheers, >> > > Akash A >> > > On Sun, Oct 23, 2011 at 3:25 AM, Stack >> > > <st...@duboce.net<mailto:st...@duboce.net>> wrote: >> > > On Sat, Oct 22, 2011 at 2:41 PM, N Keywal >> > > <nkey...@gmail.com<mailto:nkey...@gmail.com>> wrote: >> > > > I would think that the bottleneck for insert is the wal part? >> > > > It would be possible to do a kind of memory list preparation during >> > > > the wal insertion, and if the wal insertion is confirmed, do the >> > > > insertion in the memory list. But it's strange to have the insertion >> in >> > > memory important vs. >> > > > the insertion on disk... >> > > > >> > > Yes, WAL is the long pole writing. But MemStore has issues too; Dhruba >> > says >> > > cpu above. Reading and writing it is also 'slow'. >> > > St.Ack >> > >> > >> >