Re: CSLM performance Was: SILT - nice keyvalue store paper

Ted Yu Tue, 25 Oct 2011 09:08:06 -0700

http://code.google.com/p/caliper/ might be useful.


On Tue, Oct 25, 2011 at 7:36 AM, Andrew Purtell <[email protected]> wrote:

> Also please make note of what JVM version and report it with the results.
>
>
> Best regards,
>
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>
>
> ----- Original Message -----
> > From: Li Pi <[email protected]>
> > To: [email protected]
> > Cc:
> > Sent: Monday, October 24, 2011 11:27 PM
> > Subject: Re: CSLM performance Was: SILT - nice keyvalue store paper
> >
> > You might also want to run the code a few times, and only take results
> > from the latter half. Let the JVM warm up and JIT things for a fair
> > comparison.
> >
> > On Mon, Oct 24, 2011 at 11:14 PM, Akash Ashok <[email protected]>
> > wrote:
> >>  On Mon, Oct 24, 2011 at 10:49 PM, Ted Yu <[email protected]> wrote:
> >>
> >>>  Akash:
> >>>  Take a look at the following two possible replacements for CSLM:
> >>>  https://github.com/mspiegel/lockfreeskiptree
> >>>  https://github.com/nbronson/snaptree
> >>
> >>  Thanks Ted :) :) :). Was pretty much on the lookout for other
> structures. I
> >>  tried
> >>
> >
> http://g.oswego.edu/dl/classes/EDU/oswego/cs/dl/util/concurrent/SyncSortedMap.html
> >>  but didn't perform that great. Will look into these replacements.
> >>
> >>
> >>>
> >>>  About your test, I got advice from other expert:
> >>>
> >>>  - the JVM warm-up penalty is being accrued by the CSLM run
> >>>  - Use thread.yield() to end a request, as otherwise the active thread
> > may
> >>>  be
> >>>  able to run through its time slice without incurring much lock
> > contention.
> >>>
> >>
> >>  Hmm. I was just wondering. Since each thread 10000+ inserts and deletes
> are
> >>  being run
> >>  they have to context switch quite a few times during its lifecycle
> right ?
> >>  Wouldn't it be enough if I increase the number of iterations by a
> > factor if
> >>  say 100 per thread ?
> >>
> >>
> >>
> >>>  You can publish your code and results under HBASE-3992.
> >>>
> >>>  Thanks
> >>>
> >>>  On Sun, Oct 23, 2011 at 5:05 PM, Jonathan Gray <[email protected]>
> > wrote:
> >>>
> >>>  > Oh, and when running these experiments, you should look at the
> > impact at
> >>>  > which order they are run in, whether you run them multiple times
> > per JVM
> >>>  > instance, etc.  Basically, you need to be cognizant of the HotSpot
> >>>  > optimizations the JVM is doing at runtime.
> >>>  >
> >>>  > > -----Original Message-----
> >>>  > > From: Jonathan Gray [mailto:[email protected]]
> >>>  > > Sent: Sunday, October 23, 2011 4:20 PM
> >>>  > > To: [email protected]
> >>>  > > Subject: RE: SILT - nice keyvalue store paper
> >>>  > >
> >>>  > > Very nice experiment, Akash.  Keep getting your hands dirty
> > and
> >>>  digging!
> >>>  >  :)
> >>>  > >
> >>>  > > I think your results might change if you bump the test up to
> > 1000
> >>>  threads
> >>>  > or
> >>>  > > so.  100 threads can still perform okay when there's a
> > global lock but
> >>>  > the
> >>>  > > contention at 1000 threads will kill you and that's when
> > CSLM should do
> >>>  > much
> >>>  > > better.  (1000 handler threads is approx. what I run with on
> > RS in
> >>>  prod).
> >>>  > > Though I am a bit surprised that at 100 threads the TreeMap
> > was
> >>>  > significantly
> >>>  > > faster.  Your inconsistent results are a bit odd, you might
> > try an
> >>>  order
> >>>  > of
> >>>  > > magnitude more operations per thread.  You might also gather
> > some
> >>>  > > statistics about tree size and per operation latency.
> >>>  > >
> >>>  > > I've done some isolated CSLM benchmarks in the past and
> > have never been
> >>>  > > able to reproduce any of the slowness people suggest.  I
> > recall trying
> >>>  > some
> >>>  > > impractically large MemStores and everything still being
> > quite fast.
> >>>  > >
> >>>  > > Over in Cassandra, I believe they have a two-level CSLM with
> > the first
> >>>  > map
> >>>  > > key being the row and then the columns for each row in their
> > own CSLM.
> >>>  > > I've been told this is somewhat of a pain point for them.
> >  And keep in
> >>>  > mind
> >>>  > > they have one shard/region per node and we generally have
> > several
> >>>  smaller
> >>>  > > MemStores on each node (tens to thousands).  Not sure we
> > would want to
> >>>  > > try that.  There could be some interesting optimizations if
> > you had
> >>>  very
> >>>  > > specific issues, like if you had a ton of reads to MemStore
> > and not
> >>>  many
> >>>  > > writes you could keep some kind of mirrored hashmap.
> >>>  > >
> >>>  > > And for writes, the WAL is definitely the latency bottleneck.
> >  But if
> >>>  you
> >>>  > are
> >>>  > > doing lots of small operations, so your WALEdits are not
> > large, and
> >>>  with
> >>>  > some
> >>>  > > of the HLog batching features going in to trunk, you end up
> > with
> >>>  hundreds
> >>>  > of
> >>>  > > requests per HLog sync.  And although the syncs are higher
> > latency,
> >>>  with
> >>>  > > batching you end up getting high throughput.  And the
> > bottleneck
> >>>  shifts.
> >>>  > >
> >>>  > > Each sync will take approx. 1-5ms, so let's say 250
> > requests per HLog
> >>>  > sync
> >>>  > > batch, 4ms per sync, so 62.5k req/sec.  (62.5k * 100
> > bytes/req =
> >>>  > 600K/sec,
> >>>  > > very reasonable).  If you're mixing in reads as well (or
> > if you're
> >>>  doing
> >>>  > > increments which do a read and write), then this adds to the
> > CPU usage
> >>>  > and
> >>>  > > contention without adding to HLog throughput.
> >>>  > >
> >>>  > > All of a sudden the bottleneck becomes CPU/contention and not
> > HLog
> >>>  > > latency or throughput.  Highly concurrent increments/counters
> > with a
> >>>  > largely
> >>>  > > in-memory dataset can easily be CPU bottlenecked.
> >>>  > >
> >>>  > > For one specific application Dhruba and I worked on, we made
> > some good
> >>>  > > improvements in CPU efficiency by reducing the number of
> > operations and
> >>>  > > increasing efficiency on the CSLM.  Doing things like always
> > taking a
> >>>  > tailMap
> >>>  > > and working from that instead of starting at the root node,
> > using an
> >>>  > iterator()
> >>>  > > and taking advantage of the available remove() semantics, or
> > simply
> >>>  just
> >>>  > > mutating things that are normally immutable :)  Unfortunately
> > many of
> >>>  > these
> >>>  > > optimizations were semi-horrid hacks and introduced things
> > like
> >>>  > > ModifiableKeyValues, so they all haven't made their way
> > to apache.
> >>>  > >
> >>>  > > In the end, after our optimizations, the real world workload
> > Dhruba and
> >>>  I
> >>>  > > were working with was not all in-memory so the bottleneck in
> > production
> >>>  > > became the random reads (so increasing the block cache hit
> > ratio is the
> >>>  > > focus) rather than CPU contention or HLog throughput.
> >>>  > >
> >>>  > > JG
> >>>  > >
> >>>  > > From: Akash Ashok [mailto:[email protected]]
> >>>  > > Sent: Sunday, October 23, 2011 2:57 AM
> >>>  > > To: [email protected]
> >>>  > > Subject: Re: SILT - nice keyvalue store paper
> >>>  > >
> >>>  > > I was running some similar tests and came across a surprising
> > finding.
> >>>  I
> >>>  > > compared reads and write on ConcurrentSkipListMap ( which the
> > memstore
> >>>  > > uses) and synchronized TreeMap ( Which was literally treemap
> >>>  > > synchronized). Executed concurrent reads, writes and deletes
> > on both of
> >>>  > > them.
> >>>  > > Surprisingly synchronized treeMap performed better, though
> > just
> >>>  slightly
> >>>  > > better, than ConcurrentSkipListMap which KeyValueSkipListSet
> > uses.
> >>>  > >
> >>>  > > Here are the output of a few runs
> >>>  > >
> >>>  > > Sometimes the difference was considerable Using HBaseMap it
> > took
> >>>  > > 20438ms Using TreeMap it took 11613ms Time Difference:8825ms
> >>>  > >
> >>>  > > And sometimes the difference was negligible Using HBaseMap it
> > took
> >>>  > > 13370ms Using TreeMap it took 9482ms Time Difference:3888ms
> >>>  > >
> >>>  > > I've attaching the test  java file which I wrote to test
> > it.
> >>>  > > This might be a very minor differece but still surprising
> > considering
> >>>  the
> >>>  > fact
> >>>  > > that ConcurrentSkipListMap uses fancy 2 level indexes which
> > they say
> >>>  > > improves the deletion performance.
> >>>  > >
> >>>  > > And here are the details about the test run.
> >>>  > > 100 Threads each fetching 1,000,000 records
> >>>  > > 100 threads each adding 1,000,000 records.
> >>>  > > 100 threads each deletin 1,000,000 records ( Reads, Writes
> > and deletes
> >>>  > > simultaneously )
> >>>  > >
> >>>  > > Cheers,
> >>>  > > Akash A
> >>>  > > On Sun, Oct 23, 2011 at 3:25 AM, Stack
> >>>  > > <[email protected]<mailto:[email protected]>>
> > wrote:
> >>>  > > On Sat, Oct 22, 2011 at 2:41 PM, N Keywal
> >>>  > > <[email protected]<mailto:[email protected]>>
> > wrote:
> >>>  > > > I would think that the bottleneck for insert is the wal
> > part?
> >>>  > > > It would be possible to do a kind of memory list
> > preparation during
> >>>  > > > the wal insertion, and if the wal insertion is
> > confirmed, do the
> >>>  > > > insertion in the memory list. But it's strange to
> > have the insertion
> >>>  in
> >>>  > > memory important vs.
> >>>  > > > the insertion on disk...
> >>>  > > >
> >>>  > > Yes, WAL is the long pole writing.  But MemStore has issues
> > too; Dhruba
> >>>  > says
> >>>  > > cpu above.  Reading and writing it is also 'slow'.
> >>>  > > St.Ack
> >>>  >
> >>>  >
> >>>
> >>
> >

Re: CSLM performance Was: SILT - nice keyvalue store paper

Reply via email to