Re: Slow region server recoveries

Varun Sharma Fri, 19 Apr 2013 10:28:46 -0700

Is there a place to upload these logs ?


On Fri, Apr 19, 2013 at 10:25 AM, Varun Sharma <va...@pinterest.com> wrote:

> Hi Nicholas,
>
> Attached are the namenode, dn logs (of one of the healthy replicas of the
> WAL block) and the rs logs which got stuch doing the log split. Action
> begins at 2013-04-19 00:27*.
>
> Also, the rogue block is 5723958680970112840_174056. Its very interesting
> to trace this guy through the HDFS logs (dn and nn).
>
> Btw, do you know what the UNDER_RECOVERY stage is for, in HDFS ? Also does
> the stale node stuff kick in for that state ?
>
> Thanks
> Varun
>
>
> On Fri, Apr 19, 2013 at 4:00 AM, Nicolas Liochon <nkey...@gmail.com>wrote:
>
>> Thanks for the detailed scenario and analysis. I'm going to have a look.
>> I can't access the logs (ec2-107-20-237-30.compute-1.amazonaws.com
>> timeouts), could you please send them directly to me?
>>
>> Thanks,
>>
>> Nicolas
>>
>>
>> On Fri, Apr 19, 2013 at 12:46 PM, Varun Sharma <va...@pinterest.com>
>> wrote:
>>
>> > Hi Nicholas,
>> >
>> > Here is the failure scenario, I have dug up the logs.
>> >
>> > A machine fails and stops accepting/transmitting traffic. The HMaster
>> > starts the distributed split for 13 tasks. There are 12 region servers.
>> 12
>> > tasks succeed but the 13th one takes a looong time.
>> >
>> > Zookeeper timeout is set to 30 seconds. Stale node timeout is 20
>> seconds.
>> > Both patches are there.
>> >
>> > a) Machine fails around 27:30
>> > b) Master starts the split around 27:40 and submits the tasks. The one
>> task
>> > which fails seems to be the one which contains the WAL being currently
>> > written to:
>> >
>> > 2013-04-19 00:27:44,325 INFO
>> > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog:
>> > hdfs://
>> >
>> >
>> ec2-107-20-237-30.compute-1.amazonaws.com/hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
>> > ,
>> > length=0
>> >
>> > Basically this region server picks up the task but finds the length of
>> this
>> > file to be 0 and drops. This happens again
>> >
>> > c) Finally another region server picks up the task but it ends up going
>> to
>> > the bad datanode which should not happen because of the stale node
>> timeout)
>> > Unfortunately it hits the 45 retries and a connect timeout of 20 seconds
>> > every time. This delays recovery significantly. Now I guess reducing #
>> of
>> > retries to 1 is one possibility.
>> > But then the namenode logs are very interesting.
>> >
>> > d) Namenode seems to be in cyclic lease recovery loop until the node is
>> > marked dead. There is this one last block which exhibits this.
>> >
>> > 2013-04-19 00:28:09,744 INFO BlockStateChange: BLOCK* blk_-*
>> > 5723958680970112840_174056*{blockUCState=UNDER_RECOVERY,
>> > primaryNodeIndex=1,
>> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
>> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
>> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]} recovery started,
>> > primary=10.156.192.106:50010
>> > 2013-04-19 00:28:09,744 WARN org.apache.hadoop.hdfs.StateChange: DIR*
>> > NameSystem.internalReleaseLease: File
>> >
>> >
>> /hbase/.logs/ip-10-156-194-94.ec2.internal,60020,1366323217601-splitting/ip-10-156-194-94.ec2.internal%2C60020%2C1366323217601.1366331156141
>> > has not been closed. Lease recovery is in progress. RecoveryId = 174413
>> for
>> > block blk_-5723958680970112840_174056{blockUCState=UNDER_RECOVERY,
>> > primaryNodeIndex=1,
>> > replicas=[ReplicaUnderConstruction[10.156.194.94:50010|RBW],
>> > ReplicaUnderConstruction[10.156.192.106:50010|RBW],
>> > ReplicaUnderConstruction[10.156.195.38:50010|RBW]]}
>> >
>> > I see this over and over again in the logs until the datanode is marked
>> > dead. It seems to be cycling through the replicas for this WAL block and
>> > trying to add it to the recovery list. I looked at the code and it says:
>> >
>> >       // Cannot close file right now, since the last block requires
>> > recovery.
>> >       // This may potentially cause infinite loop in lease recovery
>> >       // if there are no valid replicas on data-nodes.
>> >       NameNode.stateChangeLog.warn(
>> >                 "DIR* NameSystem.internalReleaseLease: " +
>> >                 "File " + src + " has not been closed." +
>> >                " Lease recovery is in progress. " +
>> >                 "RecoveryId = " + blockRecoveryId + " for block " +
>> > lastBlock);
>> >       break;
>> >
>> > Eventually for this block, we get
>> >
>> > 2013-04-19 00:41:20,736 INFO
>> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>> >
>> >
>> commitBlockSynchronization(lastblock=BP-696828882-10.168.7.226-1364886167971:blk_-
>> > *5723958680970112840_174056*, newgenerationstamp=174413,
>> > newlength=119148648, newtargets=[10.156.192.106:50010,
>> 10.156.195.38:50010
>> > ],
>> > closeFile=true, deleteBlock=false)
>> > 2013-04-19 00:41:20,736 ERROR
>> > org.apache.hadoop.security.UserGroupInformation:
>> PriviledgedActionException
>> > as:hdfs (auth:SIMPLE) cause:java.io.IOException: Block
>> >
>> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
>> > not found
>> > 2013-04-19 00:41:20,736 INFO org.apache.hadoop.ipc.Server: IPC Server
>> > handler 35 on 8020, call
>> >
>> >
>> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.commitBlockSynchronization
>> > from 10.156.192.106:53271: error: java.io.IOException: Block
>> >
>> (=BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056)
>> > not found
>> >
>> > On the datanode side, i see a call for recover blocks - I see that a
>> write
>> > pipeline is there, which gets terminated with some socket timeouts...
>> >
>> > 00:28:11,471 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
>> NameNode
>> > at ec2-107-20-237-30.compute-1.amazonaws.com/10.168.7.226:8020 calls
>> >
>> >
>> recoverBlock(BP-696828882-10.168.7.226-1364886167971:blk_-5723958680970112840_174056,
>> > targets=[10.156.194.94:50010, 10.156.192.106:50010, 10.156.195.38:50010
>> ],
>> > newGenerationStamp=174413)
>> >
>> > Not sure but this looks like a case where data could be lost   ?
>> >
>> > Varun
>> >
>> >
>> > On Fri, Apr 19, 2013 at 12:38 AM, Nicolas Liochon <nkey...@gmail.com>
>> > wrote:
>> >
>> > > Hey Varun,
>> > >
>> > > Could you please share the logs and the configuration (hdfs / hbase
>> > > settings + cluster description). What's the failure scenario?
>> > > From an HDFS pov, HDFS 3703 does not change the dead node status. But
>> > these
>> > > node will be given the lowest priority when reading.
>> > >
>> > >
>> > > Cheers,
>> > >
>> > > Nicolas
>> > >
>> > >
>> > > On Fri, Apr 19, 2013 at 3:01 AM, Varun Sharma <va...@pinterest.com>
>> > wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > We are facing problems with really slow HBase region server
>> recoveries
>> > ~
>> > > 20
>> > > > minuted. Version is hbase 0.94.3 compiled with hadoop.profile=2.0.
>> > > >
>> > > > Hadoop version is CDH 4.2 with HDFS 3703 and HDFS 3912 patched and
>> > stale
>> > > > node timeouts configured correctly. Time for dead node detection is
>> > still
>> > > > 10 minutes.
>> > > >
>> > > > We see that our region server is trying to read an HLog is stuck
>> there
>> > > for
>> > > > a long time. Logs here:
>> > > >
>> > > > 2013-04-12 21:14:30,248 WARN org.apache.hadoop.hdfs.DFSClient:
>> Failed
>> > to
>> > > > connect to /10.156.194.251:50010 for file
>> > > >
>> > > >
>> > >
>> >
>> /hbase/feeds/fbe25f94ed4fa37fb0781e4a8efae142/home/1d102c5238874a5d82adbcc09bf06599
>> > > > for block
>> > > >
>> > > >
>> > >
>> >
>> BP-696828882-10.168.7.226-1364886167971:blk_-3289968688911401881_9428:java.net.SocketTimeoutException:
>> > > > 15000 millis timeout while waiting for channel to be ready for read.
>> > ch :
>> > > > java.nio.channels.SocketChannel[connected local=/
>> 10.156.192.173:52818
>> > > > remote=/
>> > > > 10.156.194.251:50010]
>> > > >
>> > > > I would think that HDFS 3703 would make the server fail fast and go
>> to
>> > > the
>> > > > third datanode. Currently, the recovery seems way too slow for
>> > production
>> > > > usage...
>> > > >
>> > > > Varun
>> > > >
>> > >
>> >
>>
>
>

Re: Slow region server recoveries

Reply via email to