Well, technically the cluster is still up but no clients have been able to make any progress for 8+ hours now.
- Andy ________________________________ From: Andrew Purtell <[email protected]> To: [email protected] Sent: Saturday, August 8, 2009 9:36:42 AM Subject: Re: append (hadoop-4379), was -> Re: roadmap: data integrity Cluster down hard after RS failure. Master stuck indefinitely splitting logs. Endless instances of this message, once per second: org.apache.hadoop.hdfs.DFSClient: Could not complete file /hbase/content/1965559571/oldlogfile.lo retrying... Turning off "dfs.support.append". - Andy ________________________________ From: stack <[email protected]> To: [email protected] Sent: Friday, August 7, 2009 12:34:40 PM Subject: Re: append (hadoop-4379), was -> Re: roadmap: data integrity You are a good man Andrew. St.Ack On Fri, Aug 7, 2009 at 10:27 AM, Andrew Purtell <[email protected]> wrote: > I'm going to join you in testing this stack, taking the below as config > recipe. > > - Andy > > > > > ________________________________ > From: stack <[email protected]> > To: [email protected] > Sent: Friday, August 7, 2009 9:54:53 AM > Subject: append (hadoop-4379), was -> Re: roadmap: data integrity > > Here is a quick note on the current state of my testing of HADOOP-4379 > (support for 'append' in hadoop 0.20.x). > > On my small test cluster, I am not able to break the latest patch posted by > Dhruba under heavy-loadings. It seems to basically work. On regionserver > crash, the master runs log split and when it comes to the last in the set > of > regionserver logs for splitting, the one that is inevitably unclosed > because > the process crashed, we are able to recover most edits in this last file > (in > my testing, it seemed to be all edits up to the last flush of the > regionserver process). > > The upshot is that tentatively, we may have a "working" append in the 0.20 > timeframe (In 0.21, we should have > https://issues.apache.org/jira/browse/HDFS-265). I'll keep testing but > I'd > suggest its time for others to try out. > > With HADOOP-4379, the process recovering non-closed log files -- the master > in our case -- must successfully open the file in append mode and then > close > it. Once closed, new readers can purportedly see up to the last flush. > The > open to append can take a little while before it will go through (Complaint > is that another process holds the files' lease). Meantime, the opening for > append process must retry. In my experience its taking 2-10 seconds. > > Support for appends is off by default in hadoop even after HADOOP-4379 has > been applied. To enable, you need to set dfs.support.append. Set it > everywhere -- all over hadoop and in hbase-site.xml so hbase/DFSClient can > see the attribute. > > HBase TRUNK will recognize if the bundled hadoop supports append via > introspection (SequenceFile has a new syncFs method when HADOOP-4379 has > been applied). If an append-supporting hadoop is present, and > dfs.support.append is set in hbase context, then hbase when its running > HLog#splitLog will try to opening files to append. On regionserver crash, > you can see the master HLog#splitLog loop retrying the open for append > until > it is successful (You'll see in the master log complaint that lease on the > file is held by another process). We retry every second. > > Successful recovery of all edits is uncovering new, interesting issues. In > my testing I was killing regionserver only but also killing regionserver > and > datanode. In latter case, what I would see is that namenode would continue > to assign the dead namenode work at least until its lease expired. Fair > enough says you, only the datanode lease is ten minutes by default. I set > it down in my tests using heartbeat.recheck.interval (There is a pregnant > comment in HADOOP-4379 w/ clientside code where Ruyue Ma says they get > around this issue by having client pass the namenode the datanodes it knows > dead when asking for an extra block). We might want to recommend setting > it > down in general. > > Other issues are hbase bugs we see when edits all recovered. I've been > filing issues on these over last few days. > > St.Ack > > > > > > On Fri, Aug 7, 2009 at 9:03 AM, Andrew Purtell <[email protected]> > wrote: > > > Good to see there's direct edit replication support; that can make > > things easier. > > > > I've seen people use DRDB or NFS to replicate edits currently. > > > > Namenode failover is a "solvable" issue with traditional HA: OS level > > heartbeats, fencing, fail over -- e.g. HA infrastructure daemon starts > > NN instance on node B if heartbeat from node A is lost and takes a > > power control operation on A to make sure it is dead. On both nodes the > > infastructure daemons trigger the OS watchdog if the NN process dies. > > Combine this with automatic IP address reassignment. Then, page the > > operators. Add another node C for additional redundancy, and make sure > > all of the alternatives are on separate racks and power rails, and make > > sure the L2 and L3 topology is also HA (e.g. bonded ethernet to > > redundant switches at L2, mesh routing at L3, etc.) If the cluster is > > not super huge it can all be spanned at L2 over redundant switches. L3 > > redundancy is tricker. A typical configuration could have a lot of OSPF > > stub networks -- depends how L2 is partitoned -- which can make the > > routing table difficult for operators to sort out. > > > > I've seen this type of thing work for myself, ~15 seconds from > > (simulated) fault on NN node A to the new NN up and responding to DN > > reconnections on node B, with 0.19. > > > > You can build in additional assurance of fast failover by building > > redundant processes to run concurrently with a few datanodes which over > > and over ping the NN via the namenode protocol and trigger fencing and > > failover if it stops responding. > > > > One wrinkle is the new namenode starts up in safe mode. As long as > > HBase can handle temporary periods where the cluster goes into > > safemode after NN fail over, it can ride it out. > > > > This is ugly, but this is I believe an accepted and valid systems > > engineering solution for the NN SPOF issue for the folks I mentioned > > in my previous email, something they would be familiar with. Edit > > replication support in HDFS 0.21 makes it a little less work to > > achieve and maybe a little faster to execute, so that's an > > improvement. > > > > It may be overstating it a little bit to say that the NN SPOF is not a > > concern for HBase, but, in my opinion, we need to address WAL and > > (lack of FSCK) issues first before being concerned about it. HBase can > > lose data all on its own. > > > > - Andy > > > > > > > > > > > > ________________________________ > > From: Jean-Daniel Cryans <[email protected]> > > To: [email protected] > > Sent: Friday, August 7, 2009 3:25:19 AM > > Subject: Re: roadmap: data integrity > > > > https://issues.apache.org/jira/browse/HADOOP-4539 > > > > This issue was closed long ago. But, Steve Loughran just said on tha > > hadoop mailing list that the new NN has to come up with the same > > IP/hostname as the failed one. > > > > J-D > > > > On Fri, Aug 7, 2009 at 2:37 AM, Ryan Rawson<[email protected]> wrote: > > > WAL is a major issue, but another one that is coming up fast is the > > > SPOF that is the namenode. > > > > > > Right now, namenode aside, I can rolling restart my entire cluster, > > > including rebooting the machines if I needed to. But not so with the > > > namenode, because if it does AWOL, all sorts of bad can happen. > > > > > > I hope that HDFS 0.21 addresses both these issues. Can we get > > > positive confirmation that this is being worked on? > > > > > > -ryan > > > > > > On Thu, Aug 6, 2009 at 10:25 AM, Andrew Purtell<[email protected]> > > wrote: > > >> I updated the roadmap up on the wiki: > > >> > > >> > > >> * Data integrity > > >> * Insure that proper append() support in HDFS actually closes the > > >> WAL last block write hole > > >> * HBase-FSCK (HBASE-7) -- Suggest making this a blocker for 0.21 > > >> > > >> I have had several recent conversations on my travels with people in > > >> Fortune 100 companies (based on this list: > > >> http://www.wageproject.org/content/fortune/index.php). > > >> > > >> You and I know we can set up well engineered HBase 0.20 clusters that > > >> will be operationally solid for a wide range of use cases, but given > > >> those aforementioned discussions there are certain sectors which would > > >> say HBASE-7 is #1 before HBase is "bank ready". Not until we can say: > > >> > > >> - Yes, when the client sees data has been committed, it actually has > > >> been written and replicated on spinning or solid state media in all > > >> cases. > > >> > > >> - Yes, we go to great lengths to recover data if ${deity} forbid you > > >> crush some underprovisioned cluster with load or some bizarre bug or > > >> system fault happens. > > >> > > >> HBASE-1295 is also required for business continuity reasons, but this > > >> is already a priority item for some HBase committers. > > >> > > >> The question is I think does the above align with project goals. > > >> Making HBase-FSCK a blocker will probably knock something someone > > >> wants for the 0.21 timeframe off the list. > > >> > > >> - Andy > > >> > > >> > > >> > > > > > > > > > > > > > > > > > >
