My observation (working on Spinnaker's NFS server, and then on MapR's server), is that ELR + group-commit is essential. ELR is trivial and I am a bit surprised that the paper claims no one does it.
Once ELR is implemented, the bottleneck immediately shifts to forcing the log on a commit. But if multiple commit records end up landing on the same VM page in the Linux kernel (imagine tiny transactions), then fsync issued during log-force will cause the Linux kernel to lock out further writes to that page while it is flushed to disk, so things come to a halt anyway. HBase writes to a single log file (thus a single spindle on HDFS) so the fsync rate is further limited Thus, a group-commit at the HBase level will go a long way in improving performance. But, group-commit (as usually implemented in most systems) ends up requires a timer + 2 extra context switches for each transaction. Perhaps a "peek" into the transacation-manager to see if other transactions are actually running can tell whether to even bother with the group-commit (ie, wait for a group-commit only if there are other uncommitted transactions in flight). On Wed, Dec 29, 2010 at 12:07 PM, Ryan Rawson <[email protected]> wrote: > Oh no, let's be wary of those server rewrites. My micro profiling is > showing about 30 usec for a lock handoff in the HBase client... > > I think we should be able to get big wins with minimal things. A big > rewrite has it's major costs, not to mention to effectively be async > we'd have to rewrite every single pice of code more complex than > Bytes.*. If you need to block you will need to push context on a > context-store (aka stack) and manage that all ourselves. > > I've been seeing papers that are talking about threading improvements > that could get us better performance. Assuming that ctx is the actual > reason why we arent as fast as we could be (note: we are NOT slow!). > > As for the DI, I think I'd like to see more study on the costs and > benefits. We have a relatively minimal amount of interfaces and > concrete objects, for the interfaces we do, we have 1 or 2 > implementations at most. Usually 1. There is a cost, I'd like to see > more descriptions of the costs vs the benefits. > > -ryan > > On Wed, Dec 29, 2010 at 11:32 AM, Stack <[email protected]> wrote: > > Nice list of things we need to do to make logging faster (with useful > > citations on current state of art). This notion of early lock release > > (ELR) is worth looking into (Jon, for high rates of counter > > transactions, you've been talking about aggregating counts in front of > > the WAL lock... maybe an ELR and then a hold on the transaction until > > confirmation of flush would be way to go?). Regards flush-pipelining, > > it would be interesting to see if there are traces of the sys-time > > that Dhruba is seeing in his NN out in HBase servers. My guess is > > that its probably drowned by other context switches done in our > > servers. Definitely worth study. > > > > St.Ack > > P.S. Minimizing context switches, a system for ELR and > > flush-pipelining, recasting the server to make use of one of the DI or > > OSGi frameworks, moving off log4j, etc..... Is it just me or do others > > feel a server rewrite coming on? > > > > > > On Mon, Dec 27, 2010 at 11:48 AM, Dhruba Borthakur <[email protected]> > wrote: > >> HDFS currently uses Hadoop RPC and the server thread blocks till the WAL > is > >> written to disk. In earlier deployments, I thought we could safely > ignore > >> flush-pipelining by creating more server threads. But in our largest > HDFS > >> systems, I am starting to see 20% sys-time usage on the namenode > machine; > >> most of this could be thread scheduling. If so, then it makes sense to > >> enhance the logging code to release server threads even before the WAL > is > >> flushed to disk (but, of course, we still have to delay the transaction > >> response to the client till the WAL is synced to disk). > >> > >> Does anybody have any idea on how to figure out what percentage of the > above > >> sys-time is spent in thread scheduling vs the time spent in other system > >> calls (especially in the Namenode context)? > >> > >> thanks, > >> dhruba > >> > >> > >> On Fri, Dec 24, 2010 at 8:17 PM, Todd Lipcon <[email protected]> wrote: > >> > >>> Via Hammer - I thought this was a pretty good read, some good ideas for > >>> optimizations for our WAL. > >>> > >>> http://infoscience.epfl.ch/record/149436/files/vldb10aether.pdf > >>> > >>> -Todd > >>> -- > >>> Todd Lipcon > >>> Software Engineer, Cloudera > >>> > >> > >> > >> > >> -- > >> Connect to me at http://www.facebook.com/dhruba > >> > > >
