Re: append (hadoop-4379), was -> Re: roadmap: data integrity

stack Sat, 08 Aug 2009 18:15:54 -0700

Was DEBUG enabled in hdfs?  If so, make the NN log available for Dhruba...
comment in HADOOP-4379.
St.Ack



On Sat, Aug 8, 2009 at 9:38 AM, Andrew Purtell <[email protected]> wrote:

> Well, technically the cluster is still up but no clients have been able to
> make
> any progress for 8+ hours now.
>
>   - Andy
>
>
>
>
> ________________________________
> From: Andrew Purtell <[email protected]>
> To: [email protected]
> Sent: Saturday, August 8, 2009 9:36:42 AM
> Subject: Re: append (hadoop-4379), was -> Re: roadmap: data integrity
>
>
> Cluster down hard after RS failure. Master stuck indefinitely splitting
> logs.
> Endless instances of this message, once per second:
>
> org.apache.hadoop.hdfs.DFSClient: Could not complete file
> /hbase/content/1965559571/oldlogfile.lo retrying...
>
> Turning off "dfs.support.append".
>
>   - Andy
>
>
>
>
> ________________________________
> From: stack <[email protected]>
> To: [email protected]
> Sent: Friday, August 7, 2009 12:34:40 PM
> Subject: Re: append (hadoop-4379), was -> Re: roadmap: data integrity
>
> You are a good man Andrew.
> St.Ack
>
> On Fri, Aug 7, 2009 at 10:27 AM, Andrew Purtell <[email protected]>
> wrote:
>
> > I'm going to join you in testing this stack, taking the below as config
> > recipe.
> >
> >   - Andy
> >
> >
> >
> >
> > ________________________________
> > From: stack <[email protected]>
> > To: [email protected]
> > Sent: Friday, August 7, 2009 9:54:53 AM
> > Subject: append (hadoop-4379), was -> Re: roadmap: data integrity
> >
> > Here is a quick note on the current state of my testing of HADOOP-4379
> > (support for 'append' in hadoop 0.20.x).
> >
> > On my small test cluster, I am not able to break the latest patch posted
> by
> > Dhruba under heavy-loadings.  It seems to basically work.  On
> regionserver
> > crash, the master runs log split and when it comes to the last in the set
> > of
> > regionserver logs for splitting, the one that is inevitably unclosed
> > because
> > the process crashed, we are able to recover most edits in this last file
> > (in
> > my testing, it seemed to be all edits up to the last flush of the
> > regionserver process).
> >
> > The upshot is that tentatively, we may have a "working" append in the
> 0.20
> > timeframe (In 0.21, we should have
> > https://issues.apache.org/jira/browse/HDFS-265).  I'll keep testing but
> > I'd
> > suggest its time for others to try out.
> >
> > With HADOOP-4379, the process recovering non-closed log files -- the
> master
> > in our case -- must successfully open the file in append mode and then
> > close
> > it.  Once closed, new readers can purportedly see up to the last flush.
> >  The
> > open to append can take a little while before it will go through
> (Complaint
> > is that another process holds the files' lease).  Meantime, the opening
> for
> > append process must retry.   In my experience its taking 2-10 seconds.
> >
> > Support for appends is off by default in hadoop even after HADOOP-4379
> has
> > been applied.  To enable, you need to set dfs.support.append.   Set it
> > everywhere -- all over hadoop and in hbase-site.xml so hbase/DFSClient
> can
> > see the attribute.
> >
> > HBase TRUNK will recognize if the bundled hadoop supports append via
> > introspection (SequenceFile has a new syncFs method when HADOOP-4379 has
> > been applied).   If an append-supporting hadoop is present, and
> > dfs.support.append is set in hbase context, then hbase when its running
> > HLog#splitLog will try to opening files to append.  On regionserver
> crash,
> > you can see the master HLog#splitLog loop retrying the open for append
> > until
> > it is successful (You'll see in the master log complaint that lease on
> the
> > file is held by another process).  We retry every second.
> >
> > Successful recovery of all edits is uncovering new, interesting issues.
>  In
> > my testing I was killing regionserver only but also killing regionserver
> > and
> > datanode.  In latter case, what I would see is that namenode would
> continue
> > to assign the dead namenode work at least until its lease expired.  Fair
> > enough says you, only the datanode lease is ten minutes by default.  I
> set
> > it down in my tests using heartbeat.recheck.interval (There is a pregnant
> > comment in HADOOP-4379 w/ clientside code where Ruyue Ma says they get
> > around this issue by having client pass the namenode the datanodes it
> knows
> > dead when asking for an extra block).  We might want to recommend setting
> > it
> > down in general.
> >
> > Other issues are hbase bugs we see when edits all recovered.  I've been
> > filing issues on these over last few days.
> >
> > St.Ack
> >
> >
> >
> >
> >
> > On Fri, Aug 7, 2009 at 9:03 AM, Andrew Purtell <[email protected]>
> > wrote:
> >
> > > Good to see there's direct edit replication support; that can make
> > > things easier.
> > >
> > > I've seen people use DRDB or NFS to replicate edits currently.
> > >
> > > Namenode failover is a "solvable" issue with traditional HA: OS level
> > > heartbeats, fencing, fail over -- e.g. HA infrastructure daemon starts
> > > NN instance on node B if heartbeat from node A is lost and takes a
> > > power control operation on A to make sure it is dead. On both nodes the
> > > infastructure daemons trigger the OS watchdog if the NN process dies.
> > > Combine this with automatic IP address reassignment. Then, page the
> > > operators. Add another node C for additional redundancy, and make sure
> > > all of the alternatives are on separate racks and power rails, and make
> > > sure the L2 and L3 topology is also HA (e.g. bonded ethernet to
> > > redundant switches at L2, mesh routing at L3, etc.) If the cluster is
> > > not super huge it can all be spanned at L2 over redundant switches. L3
> > > redundancy is tricker. A typical configuration could have a lot of OSPF
> > > stub networks -- depends how L2 is partitoned -- which can make the
> > > routing table difficult for operators to sort out.
> > >
> > > I've seen this type of thing work for myself, ~15 seconds from
> > > (simulated) fault on NN node A to the new NN up and responding to DN
> > > reconnections on node B, with 0.19.
> > >
> > > You can build in additional assurance of fast failover by building
> > > redundant processes to run concurrently with a few datanodes which over
> > > and over ping the NN via the namenode protocol and trigger fencing and
> > > failover if it stops responding.
> > >
> > > One wrinkle is the new namenode starts up in safe mode. As long as
> > > HBase can handle temporary periods where the cluster goes into
> > > safemode after NN fail over, it can ride it out.
> > >
> > > This is ugly, but this is I believe an accepted and valid systems
> > > engineering solution for the NN SPOF issue for the folks I mentioned
> > > in my previous email, something they would be familiar with. Edit
> > > replication support in HDFS 0.21 makes it a little less work to
> > > achieve and maybe a little faster to execute, so that's an
> > > improvement.
> > >
> > > It may be overstating it a little bit to say that the NN SPOF is not a
> > > concern for HBase, but, in my opinion, we need to address WAL and
> > > (lack of FSCK) issues first before being concerned about it. HBase can
> > > lose data all on its own.
> > >
> > >   - Andy
> > >
> > >
> > >
> > >
> > >
> > > ________________________________
> > > From: Jean-Daniel Cryans <[email protected]>
> > > To: [email protected]
> > > Sent: Friday, August 7, 2009 3:25:19 AM
> > > Subject: Re: roadmap: data integrity
> > >
> > > https://issues.apache.org/jira/browse/HADOOP-4539
> > >
> > > This issue was closed long ago. But, Steve Loughran just said on tha
> > > hadoop mailing list that the new NN has to come up with the same
> > > IP/hostname as the failed one.
> > >
> > > J-D
> > >
> > > On Fri, Aug 7, 2009 at 2:37 AM, Ryan Rawson<[email protected]> wrote:
> > > > WAL is a major issue, but another one that is coming up fast is the
> > > > SPOF that is the namenode.
> > > >
> > > > Right now, namenode aside, I can rolling restart my entire cluster,
> > > > including rebooting the machines if I needed to. But not so with the
> > > > namenode, because if it does AWOL, all sorts of bad can happen.
> > > >
> > > > I hope that HDFS 0.21 addresses both these issues.  Can we get
> > > > positive confirmation that this is being worked on?
> > > >
> > > > -ryan
> > > >
> > > > On Thu, Aug 6, 2009 at 10:25 AM, Andrew Purtell<[email protected]>
> > > wrote:
> > > >> I updated the roadmap up on the wiki:
> > > >>
> > > >>
> > > >> * Data integrity
> > > >>    * Insure that proper append() support in HDFS actually closes the
> > > >>      WAL last block write hole
> > > >>    * HBase-FSCK (HBASE-7) -- Suggest making this a blocker for 0.21
> > > >>
> > > >> I have had several recent conversations on my travels with people in
> > > >> Fortune 100 companies (based on this list:
> > > >> http://www.wageproject.org/content/fortune/index.php).
> > > >>
> > > >> You and I know we can set up well engineered HBase 0.20 clusters
> that
> > > >> will be operationally solid for a wide range of use cases, but given
> > > >> those aforementioned discussions there are certain sectors which
> would
> > > >> say HBASE-7 is #1 before HBase is "bank ready". Not until we can
> say:
> > > >>
> > > >>  - Yes, when the client sees data has been committed, it actually
> has
> > > >> been written and replicated on spinning or solid state media in all
> > > >> cases.
> > > >>
> > > >>  - Yes, we go to great lengths to recover data if ${deity} forbid
> you
> > > >> crush some underprovisioned cluster with load or some bizarre bug or
> > > >> system fault happens.
> > > >>
> > > >> HBASE-1295 is also required for business continuity reasons, but
> this
> > > >> is already a priority item for some HBase committers.
> > > >>
> > > >> The question is I think does the above align with project goals.
> > > >> Making HBase-FSCK a blocker will probably knock something someone
> > > >> wants for the 0.21 timeframe off the list.
> > > >>
> > > >>   - Andy
> > > >>
> > > >>
> > > >>
> > > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> >
> >
> >
>
>
>
>

Re: append (hadoop-4379), was -> Re: roadmap: data integrity

Reply via email to