You might also want to run the code a few times, and only take results
from the latter half. Let the JVM warm up and JIT things for a fair
comparison.

On Mon, Oct 24, 2011 at 11:14 PM, Akash Ashok <thehellma...@gmail.com> wrote:
> On Mon, Oct 24, 2011 at 10:49 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Akash:
>> Take a look at the following two possible replacements for CSLM:
>> https://github.com/mspiegel/lockfreeskiptree
>> https://github.com/nbronson/snaptree
>
> Thanks Ted :) :) :). Was pretty much on the lookout for other structures. I
> tried
> http://g.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/SyncSortedMap.html
> but didn't perform that great. Will look into these replacements.
>
>
>>
>> About your test, I got advice from other expert:
>>
>> - the JVM warm-up penalty is being accrued by the CSLM run
>> - Use thread.yield() to end a request, as otherwise the active thread may
>> be
>> able to run through its time slice without incurring much lock contention.
>>
>
> Hmm. I was just wondering. Since each thread 10000+ inserts and deletes are
> being run
> they have to context switch quite a few times during its lifecycle right ?
> Wouldn't it be enough if I increase the number of iterations by a factor if
> say 100 per thread ?
>
>
>
>> You can publish your code and results under HBASE-3992.
>>
>> Thanks
>>
>> On Sun, Oct 23, 2011 at 5:05 PM, Jonathan Gray <jg...@fb.com> wrote:
>>
>> > Oh, and when running these experiments, you should look at the impact at
>> > which order they are run in, whether you run them multiple times per JVM
>> > instance, etc.  Basically, you need to be cognizant of the HotSpot
>> > optimizations the JVM is doing at runtime.
>> >
>> > > -----Original Message-----
>> > > From: Jonathan Gray [mailto:jg...@fb.com]
>> > > Sent: Sunday, October 23, 2011 4:20 PM
>> > > To: dev@hbase.apache.org
>> > > Subject: RE: SILT - nice keyvalue store paper
>> > >
>> > > Very nice experiment, Akash.  Keep getting your hands dirty and
>> digging!
>> >  :)
>> > >
>> > > I think your results might change if you bump the test up to 1000
>> threads
>> > or
>> > > so.  100 threads can still perform okay when there's a global lock but
>> > the
>> > > contention at 1000 threads will kill you and that's when CSLM should do
>> > much
>> > > better.  (1000 handler threads is approx. what I run with on RS in
>> prod).
>> > > Though I am a bit surprised that at 100 threads the TreeMap was
>> > significantly
>> > > faster.  Your inconsistent results are a bit odd, you might try an
>> order
>> > of
>> > > magnitude more operations per thread.  You might also gather some
>> > > statistics about tree size and per operation latency.
>> > >
>> > > I've done some isolated CSLM benchmarks in the past and have never been
>> > > able to reproduce any of the slowness people suggest.  I recall trying
>> > some
>> > > impractically large MemStores and everything still being quite fast.
>> > >
>> > > Over in Cassandra, I believe they have a two-level CSLM with the first
>> > map
>> > > key being the row and then the columns for each row in their own CSLM.
>> > > I've been told this is somewhat of a pain point for them.  And keep in
>> > mind
>> > > they have one shard/region per node and we generally have several
>> smaller
>> > > MemStores on each node (tens to thousands).  Not sure we would want to
>> > > try that.  There could be some interesting optimizations if you had
>> very
>> > > specific issues, like if you had a ton of reads to MemStore and not
>> many
>> > > writes you could keep some kind of mirrored hashmap.
>> > >
>> > > And for writes, the WAL is definitely the latency bottleneck.  But if
>> you
>> > are
>> > > doing lots of small operations, so your WALEdits are not large, and
>> with
>> > some
>> > > of the HLog batching features going in to trunk, you end up with
>> hundreds
>> > of
>> > > requests per HLog sync.  And although the syncs are higher latency,
>> with
>> > > batching you end up getting high throughput.  And the bottleneck
>> shifts.
>> > >
>> > > Each sync will take approx. 1-5ms, so let's say 250 requests per HLog
>> > sync
>> > > batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100 bytes/req =
>> > 600K/sec,
>> > > very reasonable).  If you're mixing in reads as well (or if you're
>> doing
>> > > increments which do a read and write), then this adds to the CPU usage
>> > and
>> > > contention without adding to HLog throughput.
>> > >
>> > > All of a sudden the bottleneck becomes CPU/contention and not HLog
>> > > latency or throughput.  Highly concurrent increments/counters with a
>> > largely
>> > > in-memory dataset can easily be CPU bottlenecked.
>> > >
>> > > For one specific application Dhruba and I worked on, we made some good
>> > > improvements in CPU efficiency by reducing the number of operations and
>> > > increasing efficiency on the CSLM.  Doing things like always taking a
>> > tailMap
>> > > and working from that instead of starting at the root node, using an
>> > iterator()
>> > > and taking advantage of the available remove() semantics, or simply
>> just
>> > > mutating things that are normally immutable :)  Unfortunately many of
>> > these
>> > > optimizations were semi-horrid hacks and introduced things like
>> > > ModifiableKeyValues, so they all haven't made their way to apache.
>> > >
>> > > In the end, after our optimizations, the real world workload Dhruba and
>> I
>> > > were working with was not all in-memory so the bottleneck in production
>> > > became the random reads (so increasing the block cache hit ratio is the
>> > > focus) rather than CPU contention or HLog throughput.
>> > >
>> > > JG
>> > >
>> > > From: Akash Ashok [mailto:thehellma...@gmail.com]
>> > > Sent: Sunday, October 23, 2011 2:57 AM
>> > > To: dev@hbase.apache.org
>> > > Subject: Re: SILT - nice keyvalue store paper
>> > >
>> > > I was running some similar tests and came across a surprising finding.
>> I
>> > > compared reads and write on ConcurrentSkipListMap ( which the memstore
>> > > uses) and synchronized TreeMap ( Which was literally treemap
>> > > synchronized). Executed concurrent reads, writes and deletes on both of
>> > > them.
>> > > Surprisingly synchronized treeMap performed better, though just
>> slightly
>> > > better, than ConcurrentSkipListMap which KeyValueSkipListSet uses.
>> > >
>> > > Here are the output of a few runs
>> > >
>> > > Sometimes the difference was considerable Using HBaseMap it took
>> > > 20438ms Using TreeMap it took 11613ms Time Difference:8825ms
>> > >
>> > > And sometimes the difference was negligible Using HBaseMap it took
>> > > 13370ms Using TreeMap it took 9482ms Time Difference:3888ms
>> > >
>> > > I've attaching the test  java file which I wrote to test it.
>> > > This might be a very minor differece but still surprising considering
>> the
>> > fact
>> > > that ConcurrentSkipListMap uses fancy 2 level indexes which they say
>> > > improves the deletion performance.
>> > >
>> > > And here are the details about the test run.
>> > > 100 Threads each fetching 1,000,000 records
>> > > 100 threads each adding 1,000,000 records.
>> > > 100 threads each deletin 1,000,000 records ( Reads, Writes and deletes
>> > > simultaneously )
>> > >
>> > > Cheers,
>> > > Akash A
>> > > On Sun, Oct 23, 2011 at 3:25 AM, Stack
>> > > <st...@duboce.net<mailto:st...@duboce.net>> wrote:
>> > > On Sat, Oct 22, 2011 at 2:41 PM, N Keywal
>> > > <nkey...@gmail.com<mailto:nkey...@gmail.com>> wrote:
>> > > > I would think that the bottleneck for insert is the wal part?
>> > > > It would be possible to do a kind of memory list preparation during
>> > > > the wal insertion, and if the wal insertion is confirmed, do the
>> > > > insertion in the memory list. But it's strange to have the insertion
>> in
>> > > memory important vs.
>> > > > the insertion on disk...
>> > > >
>> > > Yes, WAL is the long pole writing.  But MemStore has issues too; Dhruba
>> > says
>> > > cpu above.  Reading and writing it is also 'slow'.
>> > > St.Ack
>> >
>> >
>>
>

Reply via email to