http://www.azulsystems.com/events/javaone_2009/session/2009_J1_Benchmark.pdf

Might also be useful.

On Tue, Oct 25, 2011 at 9:07 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> http://code.google.com/p/caliper/ might be useful.
>
> On Tue, Oct 25, 2011 at 7:36 AM, Andrew Purtell <apurt...@apache.org> wrote:
>
>> Also please make note of what JVM version and report it with the results.
>>
>>
>> Best regards,
>>
>>
>>    - Andy
>>
>> Problems worthy of attack prove their worth by hitting back. - Piet Hein
>> (via Tom White)
>>
>>
>> ----- Original Message -----
>> > From: Li Pi <l...@idle.li>
>> > To: dev@hbase.apache.org
>> > Cc:
>> > Sent: Monday, October 24, 2011 11:27 PM
>> > Subject: Re: CSLM performance Was: SILT - nice keyvalue store paper
>> >
>> > You might also want to run the code a few times, and only take results
>> > from the latter half. Let the JVM warm up and JIT things for a fair
>> > comparison.
>> >
>> > On Mon, Oct 24, 2011 at 11:14 PM, Akash Ashok <thehellma...@gmail.com>
>> > wrote:
>> >>  On Mon, Oct 24, 2011 at 10:49 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>> >>
>> >>>  Akash:
>> >>>  Take a look at the following two possible replacements for CSLM:
>> >>>  https://github.com/mspiegel/lockfreeskiptree
>> >>>  https://github.com/nbronson/snaptree
>> >>
>> >>  Thanks Ted :) :) :). Was pretty much on the lookout for other
>> structures. I
>> >>  tried
>> >>
>> >
>> http://g.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/SyncSortedMap.html
>> >>  but didn't perform that great. Will look into these replacements.
>> >>
>> >>
>> >>>
>> >>>  About your test, I got advice from other expert:
>> >>>
>> >>>  - the JVM warm-up penalty is being accrued by the CSLM run
>> >>>  - Use thread.yield() to end a request, as otherwise the active thread
>> > may
>> >>>  be
>> >>>  able to run through its time slice without incurring much lock
>> > contention.
>> >>>
>> >>
>> >>  Hmm. I was just wondering. Since each thread 10000+ inserts and deletes
>> are
>> >>  being run
>> >>  they have to context switch quite a few times during its lifecycle
>> right ?
>> >>  Wouldn't it be enough if I increase the number of iterations by a
>> > factor if
>> >>  say 100 per thread ?
>> >>
>> >>
>> >>
>> >>>  You can publish your code and results under HBASE-3992.
>> >>>
>> >>>  Thanks
>> >>>
>> >>>  On Sun, Oct 23, 2011 at 5:05 PM, Jonathan Gray <jg...@fb.com>
>> > wrote:
>> >>>
>> >>>  > Oh, and when running these experiments, you should look at the
>> > impact at
>> >>>  > which order they are run in, whether you run them multiple times
>> > per JVM
>> >>>  > instance, etc.  Basically, you need to be cognizant of the HotSpot
>> >>>  > optimizations the JVM is doing at runtime.
>> >>>  >
>> >>>  > > -----Original Message-----
>> >>>  > > From: Jonathan Gray [mailto:jg...@fb.com]
>> >>>  > > Sent: Sunday, October 23, 2011 4:20 PM
>> >>>  > > To: dev@hbase.apache.org
>> >>>  > > Subject: RE: SILT - nice keyvalue store paper
>> >>>  > >
>> >>>  > > Very nice experiment, Akash.  Keep getting your hands dirty
>> > and
>> >>>  digging!
>> >>>  >  :)
>> >>>  > >
>> >>>  > > I think your results might change if you bump the test up to
>> > 1000
>> >>>  threads
>> >>>  > or
>> >>>  > > so.  100 threads can still perform okay when there's a
>> > global lock but
>> >>>  > the
>> >>>  > > contention at 1000 threads will kill you and that's when
>> > CSLM should do
>> >>>  > much
>> >>>  > > better.  (1000 handler threads is approx. what I run with on
>> > RS in
>> >>>  prod).
>> >>>  > > Though I am a bit surprised that at 100 threads the TreeMap
>> > was
>> >>>  > significantly
>> >>>  > > faster.  Your inconsistent results are a bit odd, you might
>> > try an
>> >>>  order
>> >>>  > of
>> >>>  > > magnitude more operations per thread.  You might also gather
>> > some
>> >>>  > > statistics about tree size and per operation latency.
>> >>>  > >
>> >>>  > > I've done some isolated CSLM benchmarks in the past and
>> > have never been
>> >>>  > > able to reproduce any of the slowness people suggest.  I
>> > recall trying
>> >>>  > some
>> >>>  > > impractically large MemStores and everything still being
>> > quite fast.
>> >>>  > >
>> >>>  > > Over in Cassandra, I believe they have a two-level CSLM with
>> > the first
>> >>>  > map
>> >>>  > > key being the row and then the columns for each row in their
>> > own CSLM.
>> >>>  > > I've been told this is somewhat of a pain point for them.
>> >  And keep in
>> >>>  > mind
>> >>>  > > they have one shard/region per node and we generally have
>> > several
>> >>>  smaller
>> >>>  > > MemStores on each node (tens to thousands).  Not sure we
>> > would want to
>> >>>  > > try that.  There could be some interesting optimizations if
>> > you had
>> >>>  very
>> >>>  > > specific issues, like if you had a ton of reads to MemStore
>> > and not
>> >>>  many
>> >>>  > > writes you could keep some kind of mirrored hashmap.
>> >>>  > >
>> >>>  > > And for writes, the WAL is definitely the latency bottleneck.
>> >  But if
>> >>>  you
>> >>>  > are
>> >>>  > > doing lots of small operations, so your WALEdits are not
>> > large, and
>> >>>  with
>> >>>  > some
>> >>>  > > of the HLog batching features going in to trunk, you end up
>> > with
>> >>>  hundreds
>> >>>  > of
>> >>>  > > requests per HLog sync.  And although the syncs are higher
>> > latency,
>> >>>  with
>> >>>  > > batching you end up getting high throughput.  And the
>> > bottleneck
>> >>>  shifts.
>> >>>  > >
>> >>>  > > Each sync will take approx. 1-5ms, so let's say 250
>> > requests per HLog
>> >>>  > sync
>> >>>  > > batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100
>> > bytes/req =
>> >>>  > 600K/sec,
>> >>>  > > very reasonable).  If you're mixing in reads as well (or
>> > if you're
>> >>>  doing
>> >>>  > > increments which do a read and write), then this adds to the
>> > CPU usage
>> >>>  > and
>> >>>  > > contention without adding to HLog throughput.
>> >>>  > >
>> >>>  > > All of a sudden the bottleneck becomes CPU/contention and not
>> > HLog
>> >>>  > > latency or throughput.  Highly concurrent increments/counters
>> > with a
>> >>>  > largely
>> >>>  > > in-memory dataset can easily be CPU bottlenecked.
>> >>>  > >
>> >>>  > > For one specific application Dhruba and I worked on, we made
>> > some good
>> >>>  > > improvements in CPU efficiency by reducing the number of
>> > operations and
>> >>>  > > increasing efficiency on the CSLM.  Doing things like always
>> > taking a
>> >>>  > tailMap
>> >>>  > > and working from that instead of starting at the root node,
>> > using an
>> >>>  > iterator()
>> >>>  > > and taking advantage of the available remove() semantics, or
>> > simply
>> >>>  just
>> >>>  > > mutating things that are normally immutable :)  Unfortunately
>> > many of
>> >>>  > these
>> >>>  > > optimizations were semi-horrid hacks and introduced things
>> > like
>> >>>  > > ModifiableKeyValues, so they all haven't made their way
>> > to apache.
>> >>>  > >
>> >>>  > > In the end, after our optimizations, the real world workload
>> > Dhruba and
>> >>>  I
>> >>>  > > were working with was not all in-memory so the bottleneck in
>> > production
>> >>>  > > became the random reads (so increasing the block cache hit
>> > ratio is the
>> >>>  > > focus) rather than CPU contention or HLog throughput.
>> >>>  > >
>> >>>  > > JG
>> >>>  > >
>> >>>  > > From: Akash Ashok [mailto:thehellma...@gmail.com]
>> >>>  > > Sent: Sunday, October 23, 2011 2:57 AM
>> >>>  > > To: dev@hbase.apache.org
>> >>>  > > Subject: Re: SILT - nice keyvalue store paper
>> >>>  > >
>> >>>  > > I was running some similar tests and came across a surprising
>> > finding.
>> >>>  I
>> >>>  > > compared reads and write on ConcurrentSkipListMap ( which the
>> > memstore
>> >>>  > > uses) and synchronized TreeMap ( Which was literally treemap
>> >>>  > > synchronized). Executed concurrent reads, writes and deletes
>> > on both of
>> >>>  > > them.
>> >>>  > > Surprisingly synchronized treeMap performed better, though
>> > just
>> >>>  slightly
>> >>>  > > better, than ConcurrentSkipListMap which KeyValueSkipListSet
>> > uses.
>> >>>  > >
>> >>>  > > Here are the output of a few runs
>> >>>  > >
>> >>>  > > Sometimes the difference was considerable Using HBaseMap it
>> > took
>> >>>  > > 20438ms Using TreeMap it took 11613ms Time Difference:8825ms
>> >>>  > >
>> >>>  > > And sometimes the difference was negligible Using HBaseMap it
>> > took
>> >>>  > > 13370ms Using TreeMap it took 9482ms Time Difference:3888ms
>> >>>  > >
>> >>>  > > I've attaching the test  java file which I wrote to test
>> > it.
>> >>>  > > This might be a very minor differece but still surprising
>> > considering
>> >>>  the
>> >>>  > fact
>> >>>  > > that ConcurrentSkipListMap uses fancy 2 level indexes which
>> > they say
>> >>>  > > improves the deletion performance.
>> >>>  > >
>> >>>  > > And here are the details about the test run.
>> >>>  > > 100 Threads each fetching 1,000,000 records
>> >>>  > > 100 threads each adding 1,000,000 records.
>> >>>  > > 100 threads each deletin 1,000,000 records ( Reads, Writes
>> > and deletes
>> >>>  > > simultaneously )
>> >>>  > >
>> >>>  > > Cheers,
>> >>>  > > Akash A
>> >>>  > > On Sun, Oct 23, 2011 at 3:25 AM, Stack
>> >>>  > > <st...@duboce.net<mailto:st...@duboce.net>>
>> > wrote:
>> >>>  > > On Sat, Oct 22, 2011 at 2:41 PM, N Keywal
>> >>>  > > <nkey...@gmail.com<mailto:nkey...@gmail.com>>
>> > wrote:
>> >>>  > > > I would think that the bottleneck for insert is the wal
>> > part?
>> >>>  > > > It would be possible to do a kind of memory list
>> > preparation during
>> >>>  > > > the wal insertion, and if the wal insertion is
>> > confirmed, do the
>> >>>  > > > insertion in the memory list. But it's strange to
>> > have the insertion
>> >>>  in
>> >>>  > > memory important vs.
>> >>>  > > > the insertion on disk...
>> >>>  > > >
>> >>>  > > Yes, WAL is the long pole writing.  But MemStore has issues
>> > too; Dhruba
>> >>>  > says
>> >>>  > > cpu above.  Reading and writing it is also 'slow'.
>> >>>  > > St.Ack
>> >>>  >
>> >>>  >
>> >>>
>> >>
>> >
>

Reply via email to