By default we roll every hour regardless of the number of edits. We also roll every 0.95 x blocksize (fairly arbitrary, but works out to around 120M usually). And same thing - we trigger flushes when we have too many logs - but we're much less agressive than you -- our default here is 32 segments I think. We'll also roll immediately any time there are less than 3 DNs in the pipeline.
-Todd On Mon, Jan 30, 2012 at 11:15 AM, Eric Newton <[email protected]> wrote: > What is "periodically" for closing the log segments? Accumulo basis this > decision based on the log size, with 1G segments by default. > > Accumulo marks the tablets with the log segment, and initiates flushes when > the number of segments grows large (3 segments by default). Otherwise, > tablets with low write-rates, like the metadata table can take forever to > recover. > > Does HBase do similar things? Have you experimented with larger/smaller > segments? > > -Eric > > On Mon, Jan 30, 2012 at 1:59 PM, Todd Lipcon <[email protected]> wrote: > >> On Mon, Jan 30, 2012 at 10:53 AM, Eric Newton <[email protected]> >> wrote: >> > I will definitely look at using HBase's WAL. What did you guys do to >> > distribute the log-sort/split? >> >> In 0.90 the log sorting was done serially by the master, which as you >> can imagine was slow after a full outage. >> In 0.92 we have distributed log splitting: >> https://issues.apache.org/jira/browse/hbase-1364 >> >> I don't think there is a single design doc for it anywhere, but >> basically we use ZK as a distributed work queue. The master shoves >> items in there, and the region servers try to claim them. Each item >> corresponds to a log segment (the HBase WAL rolls periodically). After >> the segments are split, they're moved into the appropriate region >> directories, so when the region is next opened, they are replayed. >> >> -Todd >> >> > >> > On Mon, Jan 30, 2012 at 1:37 PM, Todd Lipcon <[email protected]> wrote: >> > >> >> On Mon, Jan 30, 2012 at 10:05 AM, Aaron Cordova >> >> <[email protected]> wrote: >> >> >> The big problem is in the fact that writing replicas in HDFS is done >> in >> >> a pipeline, rather than in parallel. There is a ticket to change this >> >> (HDFS-1783), but no movement on it since last summer. >> >> > >> >> > ugh - why would they change this? Pipelining maximizes bandwidth >> usage. >> >> It'd be cool if the log stream could be configured to return after >> written >> >> to one, two, or more nodes though. >> >> > >> >> >> >> The JIRA proposes to allow "star replication" instead of "pipeline >> >> replication" on a per-stream basis. Pipelining trades off latency for >> >> bandwidth -- multiple RTTs instead of 1 RTT. >> >> >> >> A few other notes relevant to the discussion above (sorry for losing >> >> the quote history): >> >> >> >> Regarding HDFS's being designed for large sequential writes rather >> >> than small records, that was originally true, but now its actually >> >> fairly efficient. We have optimizations like HDFS-895 specifically for >> >> the WAL use case which approximate things like group commit, and when >> >> you combine that with group commit at the tablet-server level you can >> >> get very good throughput along with durability guarantees. I haven't >> >> benchmarked vs Accumulo's Loggers ever, but I'd be surprised if the >> >> difference were substantial - we tend to be network bound on the WAL >> >> unless the edits are really quite tiny. >> >> >> >> We're also looking at making our WAL implementation pluggable: see >> >> HBASE-4529. Maybe a similar approach could be taken in Accumulo such >> >> that HBase could use Accumulo loggers, or Accumulo could use HBase's >> >> existing WAL class? >> >> >> >> -Todd >> >> -- >> >> Todd Lipcon >> >> Software Engineer, Cloudera >> >> >> >> >> >> -- >> Todd Lipcon >> Software Engineer, Cloudera >> -- Todd Lipcon Software Engineer, Cloudera
