Re: Slow region server recoveries

Varun Sharma Fri, 19 Apr 2013 13:10:11 -0700

Hi Ted,

I had a long offline discussion with nicholas on this. Looks like the last
block which was still being written too, took an enormous time to recover.
Here's what happened.
a) Master split tasks and region servers process them
b) Region server tries to recover lease for each WAL log - most cases are
noop since they are already rolled over/finalized
c) The last file lease recovery takes some time since the crashing server
was writing to it and had a lease on it - but basically we have the lease 1
minute after the server was lost
d) Now we start the recovery for this but we end up hitting the stale data
node which is puzzling.


It seems that we did not hit the stale datanode when we were trying to
recover the finalized WAL blocks with trivial lease recovery. However, for
the final block, we hit the stale datanode. Any clue why this might be
happening ?

Varun


On Fri, Apr 19, 2013 at 10:40 AM, Ted Yu <yuzhih...@gmail.com> wrote:

> Can you show snippet from DN log which mentioned UNDER_RECOVERY ?
>
> Here is the criteria for stale node checking to kick in (from
>
> https://issues.apache.org/jira/secure/attachment/12544897/HDFS-3703-trunk-read-only.patch
> ):
>
> +   * Check if the datanode is in stale state. Here if
> +   * the namenode has not received heartbeat msg from a
> +   * datanode for more than staleInterval (default value is
> +   * {@link
> DFSConfigKeys#DFS_NAMENODE_STALE_DATANODE_INTERVAL_MILLI_DEFAULT}),
> +   * the datanode will be treated as stale node.
>
>
> On Fri, Apr 19, 2013 at 10:28 AM, Varun Sharma <va...@pinterest.com>
> wrote:
>
> > Is there a place to upload these logs ?
> >
> >
> > On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <va...@pinterest.com>
> > wrote:
> >
> > > Hi Nicholas,
> > >
> > > Attached are the namenode, dn logs (of one of the healthy replicas of
> the
> > > WAL block) and the rs logs which got stuch doing the log split. Action
> > > begins at 2013-04-19 00:27*.
> > >
> > > Also, the rogue block is 5723958680970112840_174056. Its very
> interesting
> > > to trace this guy through the HDFS logs (dn and nn).
> > >
> > > Btw, do you know what the UNDER_RECOVERY stage is for, in HDFS ? Also
> > does
> > > the stale node stuff kick in for that state ?
> > >
> > > Thanks
> > > Varun
> > >
> > >
> > > On Fri, Apr 19, 2013 at 4:00 AM, Nicolas Liochon <nkey...@gmail.com
> > >wrote:
> > >
> > >> Thanks for the detailed scenario and analysis. I'm going to have a
> look.
> > >> I can't access the logs (ec2-107-20-237-30.compute-1.amazonaws.com
> > >> timeouts), could you please send them directly to me?
> > >>
> > >> Thanks,
> > >>
> > >> Nicolas
> > >>
> > >>
> > >> On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <va...@pinterest.com>
> > >> wrote:
> > >>
> > >> > Hi Nicholas,
> > >> >
> > >> > Here is the failure scenario, I have dug up the logs.
> > >> >
> > >> > A machine fails and stops accepting/transmitting traffic. The
> HMaster
> > >> > starts the distributed split for 13 tasks. There are 12 region
> > servers.
> > >> 12
> > >> > tasks succeed but the 13th one takes a looong time.
> > >> >
> > >> > Zookeeper timeout is set to 30 seconds. Stale node timeout is 20
> > >> seconds.
> > >> > Both patches are there.
> > >> >
> > >> > a) Machine fails around 27:30
> > >> > b) Master starts the split around 27:40 and submits the tasks. The
> one
> > >> task
> > >> > which fails seems to be the one which contains the WAL being
> currently
> > >> > written to:
> > >> >
> > >> > 2013-04-19 00:27:44,325 INFO
> > >> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting
> hlog:
> > >> > hdfs://
> > >> >
> > >> >
> > >>
> >
> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> > >> > ,
> > >> > length=0
> > >> >
> > >> > Basically this region server picks up the task but finds the length
> of
> > >> this
> > >> > file to be 0 and drops. This happens again
> > >> >
> > >> > c) Finally another region server picks up the task but it ends up
> > going
> > >> to
> > >> > the bad datanode which should not happen because of the stale node
> > >> timeout)
> > >> > Unfortunately it hits the 45 retries and a connect timeout of 20
> > seconds
> > >> > every time. This delays recovery significantly. Now I guess
> reducing #
> > >> of
> > >> > retries to 1 is one possibility.
> > >> > But then the namenode logs are very interesting.
> > >> >
> > >> > d) Namenode seems to be in cyclic lease recovery loop until the node
> > is
> > >> > marked dead. There is this one last block which exhibits this.
> > >> >
> > >> > 2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK* blk_-*
> > >> > 5723958680970112840_174056*{blockUCState=UNDER_RECOVERY,
> > >> > primaryNodeIndex=1,
> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} recovery
> started,
> > >> > primary=10.156.192.106:50010
> > >> > 2013-04-19 00:28:09,744 WARN org.apache.hadoop.hdfs.StateChange:
> DIR*
> > >> > NameSystem.internalReleaseLease: File
> > >> >
> > >> >
> > >>
> >
> /hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
> > >> > has not been closed. Lease recovery is in progress. RecoveryId =
> > 174413
> > >> for
> > >> > block blk_-5723958680970112840_174056{blockUCState=UNDER_RECOVERY,
> > >> > primaryNodeIndex=1,
> > >> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
> > >> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
> > >> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]}
> > >> >
> > >> > I see this over and over again in the logs until the datanode is
> > marked
> > >> > dead. It seems to be cycling through the replicas for this WAL block
> > and
> > >> > trying to add it to the recovery list. I looked at the code and it
> > says:
> > >> >
> > >> >       // Cannot close file right now, since the last block requires
> > >> > recovery.
> > >> >       // This may potentially cause infinite loop in lease recovery
> > >> >       // if there are no valid replicas on data-nodes.
> > >> >       NameNode.stateChangeLog.warn(
> > >> >                 "DIR* NameSystem.internalReleaseLease: " +
> > >> >                 "File " + src + " has not been closed." +
> > >> >                " Lease recovery is in progress. " +
> > >> >                 "RecoveryId = " + blockRecoveryId + " for block " +
> > >> > lastBlock);
> > >> >       break;
> > >> >
> > >> > Eventually for this block, we get
> > >> >
> > >> > 2013-04-19 00:41:20,736 INFO
> > >> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> > >> >
> > >> >
> > >>
> >
> commitBlockSynchronization(lastblock=BP-696828882-10.168.7.226-1364886167971:blk_-
> > >> > *5723958680970112840_174056*, newgenerationstamp=174413,
> > >> > newlength=119148648, newtargets=[10.156.192.106:50010,
> > >> 10.156.195.38:50010
> > >> > ],
> > >> > closeFile=true, deleteBlock=false)
> > >> > 2013-04-19 00:41:20,736 ERROR
> > >> > org.apache.hadoop.security.UserGroupInformation:
> > >> PriviledgedActionException
> > >> > as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
> > >> >
> > >>
> >
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> > >> > not found
> > >> > 2013-04-19 00:41:20,736 INFO org.apache.hadoop.ipc.Server: IPC
> Server
> > >> > handler 35 on 8020, call
> > >> >
> > >> >
> > >>
> >
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
> > >> > from 10.156.192.106:53271: error: java.io.IOException: Block
> > >> >
> > >>
> >
> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
> > >> > not found
> > >> >
> > >> > On the datanode side, i see a call for recover blocks - I see that a
> > >> write
> > >> > pipeline is there, which gets terminated with some socket
> timeouts...
> > >> >
> > >> > 00:28:11,471 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> > >> NameNode
> > >> > at ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020calls
> > >> >
> > >> >
> > >>
> >
> recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
> > >> > targets=[10.156.194.94:50010, 10.156.192.106:50010,
> > 10.156.195.38:50010
> > >> ],
> > >> > newGenerationStamp=174413)
> > >> >
> > >> > Not sure but this looks like a case where data could be lost   ?
> > >> >
> > >> > Varun
> > >> >
> > >> >
> > >> > On Fri, Apr 19, 2013 at 12:38 AM, Nicolas Liochon <
> nkey...@gmail.com>
> > >> > wrote:
> > >> >
> > >> > > Hey Varun,
> > >> > >
> > >> > > Could you please share the logs and the configuration (hdfs /
> hbase
> > >> > > settings + cluster description). What's the failure scenario?
> > >> > > From an HDFS pov, HDFS 3703 does not change the dead node status.
> > But
> > >> > these
> > >> > > node will be given the lowest priority when reading.
> > >> > >
> > >> > >
> > >> > > Cheers,
> > >> > >
> > >> > > Nicolas
> > >> > >
> > >> > >
> > >> > > On Fri, Apr 19, 2013 at 3:01 AM, Varun Sharma <
> va...@pinterest.com>
> > >> > wrote:
> > >> > >
> > >> > > > Hi,
> > >> > > >
> > >> > > > We are facing problems with really slow HBase region server
> > >> recoveries
> > >> > ~
> > >> > > 20
> > >> > > > minuted. Version is hbase 0.94.3 compiled with
> hadoop.profile=2.0.
> > >> > > >
> > >> > > > Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912 patched
> and
> > >> > stale
> > >> > > > node timeouts configured correctly. Time for dead node detection
> > is
> > >> > still
> > >> > > > 10 minutes.
> > >> > > >
> > >> > > > We see that our region server is trying to read an HLog is stuck
> > >> there
> > >> > > for
> > >> > > > a long time. Logs here:
> > >> > > >
> > >> > > > 2013-04-12 21:14:30,248 WARN org.apache.hadoop.hdfs.DFSClient:
> > >> Failed
> > >> > to
> > >> > > > connect to /10.156.194.251:50010 for file
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
> > >> > > > for block
> > >> > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
> > >> > > > 15000 millis timeout while waiting for channel to be ready for
> > read.
> > >> > ch :
> > >> > > > java.nio.channels.SocketChannel[connected local=/
> > >> 10.156.192.173:52818
> > >> > > > remote=/
> > >> > > > 10.156.194.251:50010]
> > >> > > >
> > >> > > > I would think that HDFS 3703 would make the server fail fast and
> > go
> > >> to
> > >> > > the
> > >> > > > third datanode. Currently, the recovery seems way too slow for
> > >> > production
> > >> > > > usage...
> > >> > > >
> > >> > > > Varun
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Slow region server recoveries

Reply via email to