[ https://issues.apache.org/jira/browse/HDFS-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13954670#comment-13954670 ]
Hudson commented on HDFS-6166: ------------------------------ FAILURE: Integrated in Hadoop-Mapreduce-trunk #1742 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1742/]) HDFS-6166. Change Balancer socket read timeout to 20 minutes and add 10 seconds delay after error. Contributed by Nathan Roberts (szetszwo: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1583018) * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java > revisit balancer so_timeout > ---------------------------- > > Key: HDFS-6166 > URL: https://issues.apache.org/jira/browse/HDFS-6166 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer > Affects Versions: 3.0.0, 2.3.0 > Reporter: Nathan Roberts > Assignee: Nathan Roberts > Priority: Blocker > Fix For: 2.4.0 > > Attachments: HDFS-6166.patch > > > HDFS-5806 changed the socket read timeout for the balancer connection to DN > to 60 seconds. This works as long as balancer bandwidth is such that it's > safe to assume that the DN will easily complete the operation within this > time. Obviously this isn't a good assumption. When this assumption isn't > valid, the balancer will timeout the cmd BUT it will then be out-of-sync with > the datanode (balancer thinks the DN has room to do more work, DN is still > working on the request and will fail any subsequent requests with "threads > quota exceeded errors"). This causes expensive NN traffic via getBlocks() and > also causes lots of WARNS int the balancer log. > Unfortunately the protocol is such that it's impossible to tell if the DN is > busy working on replacing the block, OR is in bad shape and will never finish. > So, in the interest of a small change to deal with both situations, I propose > the following two changes: > * Crank of the socket read timeout to 20 minutes > * Delay looking at a node for a bit if we did timeout in this way (the DN > could still have xceiver threads working on the replace -- This message was sent by Atlassian JIRA (v6.2#6252)