Is there a place to upload these logs ?
On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <va...@pinterest.com> wrote: > Hi Nicholas, > > Attached are the namenode, dn logs (of one of the healthy replicas of the > WAL block) and the rs logs which got stuch doing the log split. Action > begins at 2013-04-19 00:27*. > > Also, the rogue block is 5723958680970112840_174056. Its very interesting > to trace this guy through the HDFS logs (dn and nn). > > Btw, do you know what the UNDER_RECOVERY stage is for, in HDFS ? Also does > the stale node stuff kick in for that state ? > > Thanks > Varun > > > On Fri, Apr 19, 2013 at 4:00 AM, Nicolas Liochon <nkey...@gmail.com>wrote: > >> Thanks for the detailed scenario and analysis. I'm going to have a look. >> I can't access the logs (ec2-107-20-237-30.compute-1.amazonaws.com >> timeouts), could you please send them directly to me? >> >> Thanks, >> >> Nicolas >> >> >> On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <va...@pinterest.com> >> wrote: >> >> > Hi Nicholas, >> > >> > Here is the failure scenario, I have dug up the logs. >> > >> > A machine fails and stops accepting/transmitting traffic. The HMaster >> > starts the distributed split for 13 tasks. There are 12 region servers. >> 12 >> > tasks succeed but the 13th one takes a looong time. >> > >> > Zookeeper timeout is set to 30 seconds. Stale node timeout is 20 >> seconds. >> > Both patches are there. >> > >> > a) Machine fails around 27:30 >> > b) Master starts the split around 27:40 and submits the tasks. The one >> task >> > which fails seems to be the one which contains the WAL being currently >> > written to: >> > >> > 2013-04-19 00:27:44,325 INFO >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog: >> > hdfs:// >> > >> > >> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141 >> > , >> > length=0 >> > >> > Basically this region server picks up the task but finds the length of >> this >> > file to be 0 and drops. This happens again >> > >> > c) Finally another region server picks up the task but it ends up going >> to >> > the bad datanode which should not happen because of the stale node >> timeout) >> > Unfortunately it hits the 45 retries and a connect timeout of 20 seconds >> > every time. This delays recovery significantly. Now I guess reducing # >> of >> > retries to 1 is one possibility. >> > But then the namenode logs are very interesting. >> > >> > d) Namenode seems to be in cyclic lease recovery loop until the node is >> > marked dead. There is this one last block which exhibits this. >> > >> > 2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK* blk_-* >> > 5723958680970112840_174056*{blockUCState=UNDER_RECOVERY, >> > primaryNodeIndex=1, >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW], >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW], >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} recovery started, >> > primary=10.156.192.106:50010 >> > 2013-04-19 00:28:09,744 WARN org.apache.hadoop.hdfs.StateChange: DIR* >> > NameSystem.internalReleaseLease: File >> > >> > >> /hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141 >> > has not been closed. Lease recovery is in progress. RecoveryId = 174413 >> for >> > block blk_-5723958680970112840_174056{blockUCState=UNDER_RECOVERY, >> > primaryNodeIndex=1, >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW], >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW], >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} >> > >> > I see this over and over again in the logs until the datanode is marked >> > dead. It seems to be cycling through the replicas for this WAL block and >> > trying to add it to the recovery list. I looked at the code and it says: >> > >> > // Cannot close file right now, since the last block requires >> > recovery. >> > // This may potentially cause infinite loop in lease recovery >> > // if there are no valid replicas on data-nodes. >> > NameNode.stateChangeLog.warn( >> > "DIR* NameSystem.internalReleaseLease: " + >> > "File " + src + " has not been closed." + >> > " Lease recovery is in progress. " + >> > "RecoveryId = " + blockRecoveryId + " for block " + >> > lastBlock); >> > break; >> > >> > Eventually for this block, we get >> > >> > 2013-04-19 00:41:20,736 INFO >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: >> > >> > >> commitBlockSynchronization(lastblock=BP-696828882-10.168.7.226-1364886167971:blk_- >> > *5723958680970112840_174056*, newgenerationstamp=174413, >> > newlength=119148648, newtargets=[10.156.192.106:50010, >> 10.156.195.38:50010 >> > ], >> > closeFile=true, deleteBlock=false) >> > 2013-04-19 00:41:20,736 ERROR >> > org.apache.hadoop.security.UserGroupInformation: >> PriviledgedActionException >> > as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block >> > >> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056) >> > not found >> > 2013-04-19 00:41:20,736 INFO org.apache.hadoop.ipc.Server: IPC Server >> > handler 35 on 8020, call >> > >> > >> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization >> > from 10.156.192.106:53271: error: java.io.IOException: Block >> > >> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056) >> > not found >> > >> > On the datanode side, i see a call for recover blocks - I see that a >> write >> > pipeline is there, which gets terminated with some socket timeouts... >> > >> > 00:28:11,471 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: >> NameNode >> > at ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020 calls >> > >> > >> recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056, >> > targets=[10.156.194.94:50010, 10.156.192.106:50010, 10.156.195.38:50010 >> ], >> > newGenerationStamp=174413) >> > >> > Not sure but this looks like a case where data could be lost ? >> > >> > Varun >> > >> > >> > On Fri, Apr 19, 2013 at 12:38 AM, Nicolas Liochon <nkey...@gmail.com> >> > wrote: >> > >> > > Hey Varun, >> > > >> > > Could you please share the logs and the configuration (hdfs / hbase >> > > settings + cluster description). What's the failure scenario? >> > > From an HDFS pov, HDFS 3703 does not change the dead node status. But >> > these >> > > node will be given the lowest priority when reading. >> > > >> > > >> > > Cheers, >> > > >> > > Nicolas >> > > >> > > >> > > On Fri, Apr 19, 2013 at 3:01 AM, Varun Sharma <va...@pinterest.com> >> > wrote: >> > > >> > > > Hi, >> > > > >> > > > We are facing problems with really slow HBase region server >> recoveries >> > ~ >> > > 20 >> > > > minuted. Version is hbase 0.94.3 compiled with hadoop.profile=2.0. >> > > > >> > > > Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912 patched and >> > stale >> > > > node timeouts configured correctly. Time for dead node detection is >> > still >> > > > 10 minutes. >> > > > >> > > > We see that our region server is trying to read an HLog is stuck >> there >> > > for >> > > > a long time. Logs here: >> > > > >> > > > 2013-04-12 21:14:30,248 WARN org.apache.hadoop.hdfs.DFSClient: >> Failed >> > to >> > > > connect to /10.156.194.251:50010 for file >> > > > >> > > > >> > > >> > >> /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599 >> > > > for block >> > > > >> > > > >> > > >> > >> BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException: >> > > > 15000 millis timeout while waiting for channel to be ready for read. >> > ch : >> > > > java.nio.channels.SocketChannel[connected local=/ >> 10.156.192.173:52818 >> > > > remote=/ >> > > > 10.156.194.251:50010] >> > > > >> > > > I would think that HDFS 3703 would make the server fail fast and go >> to >> > > the >> > > > third datanode. Currently, the recovery seems way too slow for >> > production >> > > > usage... >> > > > >> > > > Varun >> > > > >> > > >> > >> > >