Re: Good VLDB paper on WALs

M. C. Srivas Sat, 01 Jan 2011 10:44:08 -0800

My observation (working on Spinnaker's NFS server, and then on MapR's
server), is that ELR + group-commit is essential. ELR is trivial and I am a
bit surprised that the paper claims no one does it.


Once ELR is implemented, the bottleneck immediately shifts to forcing the
log on a commit. But if multiple commit records end up landing on the same
VM page in the Linux kernel (imagine tiny transactions), then fsync issued
during log-force will cause the Linux kernel to lock out further writes to
that page while it is flushed to disk, so things come to a halt
anyway.  HBase writes to a single log file (thus a single spindle on HDFS)
so the fsync rate is further limited

Thus, a group-commit at the HBase level will go a long way in improving
performance. But, group-commit (as usually implemented in most systems) ends
up requires a timer + 2 extra context switches for each transaction. Perhaps
a "peek" into the transacation-manager to see if other transactions are
actually running can tell whether to even bother with the group-commit (ie,
wait for a group-commit only if there are other uncommitted transactions in
flight).


On Wed, Dec 29, 2010 at 12:07 PM, Ryan Rawson <[email protected]> wrote:

> Oh no, let's be wary of those server rewrites.  My micro profiling is
> showing about 30 usec for a lock handoff in the HBase client...
>
> I think we should be able to get big wins with minimal things.  A big
> rewrite has it's major costs, not to mention to effectively be async
> we'd have to rewrite every single pice of code more complex than
> Bytes.*.  If you need to block you will need to push context on a
> context-store (aka stack) and manage that all ourselves.
>
> I've been seeing papers that are talking about threading improvements
> that could get us better performance.  Assuming that ctx is the actual
> reason why we arent as fast as we could be (note: we are NOT slow!).
>
> As for the DI, I think I'd like to see more study on the costs and
> benefits.  We have a relatively minimal amount of interfaces and
> concrete objects, for the interfaces we do, we have 1 or 2
> implementations at most.  Usually 1.  There is a cost, I'd like to see
> more descriptions of the costs vs the benefits.
>
> -ryan
>
> On Wed, Dec 29, 2010 at 11:32 AM, Stack <[email protected]> wrote:
> > Nice list of things we need to do to make logging faster (with useful
> > citations on current state of art).  This notion of early lock release
> > (ELR) is worth looking into (Jon, for high rates of counter
> > transactions, you've been talking about aggregating counts in front of
> > the WAL lock... maybe an ELR and then a hold on the transaction until
> > confirmation of flush would be way to go?).  Regards flush-pipelining,
> > it would be interesting to see if there are traces of the sys-time
> > that Dhruba is seeing in his NN out in HBase servers.  My guess is
> > that its probably drowned by other context switches done in our
> > servers.  Definitely worth study.
> >
> > St.Ack
> > P.S. Minimizing context switches, a system for ELR and
> > flush-pipelining, recasting the server to make use of one of the DI or
> > OSGi frameworks, moving off log4j, etc..... Is it just me or do others
> > feel a server rewrite coming on?
> >
> >
> > On Mon, Dec 27, 2010 at 11:48 AM, Dhruba Borthakur <[email protected]>
> wrote:
> >> HDFS currently uses Hadoop RPC and the server thread blocks till the WAL
> is
> >> written to disk. In earlier deployments, I thought we could safely
> ignore
> >> flush-pipelining by creating more server threads. But in our largest
> HDFS
> >> systems, I am starting to see  20% sys-time usage on the namenode
> machine;
> >> most of this  could be thread scheduling. If so, then it makes sense to
> >> enhance the logging code to release server threads even before the WAL
> is
> >> flushed to disk (but, of course, we still have to delay the transaction
> >> response to the client till the WAL is synced to disk).
> >>
> >> Does anybody have any idea on how to figure out what percentage of the
> above
> >> sys-time is spent in thread scheduling vs the time spent in other system
> >> calls (especially in the Namenode context)?
> >>
> >> thanks,
> >> dhruba
> >>
> >>
> >> On Fri, Dec 24, 2010 at 8:17 PM, Todd Lipcon <[email protected]> wrote:
> >>
> >>> Via Hammer - I thought this was a pretty good read, some good ideas for
> >>> optimizations for our WAL.
> >>>
> >>> http://infoscience.epfl.ch/record/149436/files/vldb10aether.pdf
> >>>
> >>> -Todd
> >>> --
> >>> Todd Lipcon
> >>> Software Engineer, Cloudera
> >>>
> >>
> >>
> >>
> >> --
> >> Connect to me at http://www.facebook.com/dhruba
> >>
> >
>

Re: Good VLDB paper on WALs

Reply via email to