[jira] [Updated] (HDFS-6617) Flake TestDFSZKFailoverController.testManualFailoverWithDFSHAAdmin due to a long edit log sync op

Chris Nauroth (JIRA) Mon, 07 Jul 2014 10:40:59 -0700

     [ 
https://issues.apache.org/jira/browse/HDFS-6617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chris Nauroth updated HDFS-6617:
--------------------------------

       Resolution: Fixed
    Fix Version/s: 2.6.0
                   3.0.0
           Status: Resolved  (was: Patch Available)

I committed this to trunk and branch-2.  Liang Xie, thank you for contributing 
this patch.

> Flake TestDFSZKFailoverController.testManualFailoverWithDFSHAAdmin due to a 
> long edit log sync op
> -------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-6617
>                 URL: https://issues.apache.org/jira/browse/HDFS-6617
>             Project: Hadoop HDFS
>          Issue Type: Test
>          Components: auto-failover, test
>    Affects Versions: 3.0.0, 2.5.0
>            Reporter: Liang Xie
>            Assignee: Liang Xie
>            Priority: Minor
>             Fix For: 3.0.0, 2.6.0
>
>         Attachments: HDFS-6617-v2.txt, HDFS-6617.txt
>
>
> Just Hit a false alarm testing while working at  HDFS-6614, see 
> https://builds.apache.org/job/PreCommit-HDFS-Build/7259//testReport/org.apache.hadoop.hdfs.server.namenode.ha/TestDFSZKFailoverController/testManualFailoverWithDFSHAAdmin/
> After a looking at the log, shows the failure came from a timeout at 
> ZKFailoverController.doCedeActive():
> localTarget.getProxy(conf, timeout).transitionToStandby(createReqInfo());
> While stopping active service, see FSNamesystem.stopActiveServices():
>   void stopActiveServices() {
>     LOG.info("Stopping services started for active state");
>     ....
> this corelates with the log:
> "2014-07-01 08:12:50,615 INFO  namenode.FSNamesystem 
> (FSNamesystem.java:stopActiveServices(1167)) - Stopping services started for 
> active state"
> then stopActiveServices will call editLog.close(), which goes to 
> endCurrentLogSegment(), see log:
> 2014-07-01 08:12:50,616 INFO  namenode.FSEditLog 
> (FSEditLog.java:endCurrentLogSegment(1216)) - Ending log segment 1
> but this operation did not finish in 5 seconds, then triggered the timeout:
> 2014-07-01 08:12:55,624 WARN  ha.ZKFailoverController 
> (ZKFailoverController.java:doCedeActive(577)) - Unable to transition local 
> node to standby: Call From asf001.sp2.ygridcore.net/67.195.138.31 to 
> localhost:10021 failed on socket timeout exception: 
> java.net.SocketTimeoutException: 5000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/127.0.0.1:53965 remote=localhost/127.0.0.1:10021]; For more details 
> see:  http://wiki.apache.org/hadoop/SocketTimeout
> the logEdit/logSync finally done followed with printStatistics(true):
> 2014-07-01 08:13:05,243 INFO  namenode.FSEditLog 
> (FSEditLog.java:printStatistics(675)) - Number of transactions: 2 Total time 
> for transactions(ms): 0 Number of transactions batched in Syncs: 0 Number of 
> syncs: 3 SyncTimes(ms): 14667 74 105 
> so obviously, this long sync contributed the timeout,  maybe the QA box is 
> very slow at that moment, so one possible fix here is setting the default 
> fence timeout to a bigger one.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (HDFS-6617) Flake TestDFSZKFailoverController.testManualFailoverWithDFSHAAdmin due to a long edit log sync op

Reply via email to