[ 
https://issues.apache.org/jira/browse/HDFS-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13954670#comment-13954670
 ] 

Hudson commented on HDFS-6166:
------------------------------

FAILURE: Integrated in Hadoop-Mapreduce-trunk #1742 (See 
[https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1742/])
HDFS-6166. Change Balancer socket read timeout to 20 minutes and add 10 seconds 
delay after error.  Contributed by Nathan Roberts (szetszwo: 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1583018)
* /hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt
* 
/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java


> revisit balancer so_timeout 
> ----------------------------
>
>                 Key: HDFS-6166
>                 URL: https://issues.apache.org/jira/browse/HDFS-6166
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: balancer
>    Affects Versions: 3.0.0, 2.3.0
>            Reporter: Nathan Roberts
>            Assignee: Nathan Roberts
>            Priority: Blocker
>             Fix For: 2.4.0
>
>         Attachments: HDFS-6166.patch
>
>
> HDFS-5806 changed the socket read timeout for the balancer connection to DN 
> to 60 seconds. This works as long as balancer bandwidth is such that it's 
> safe to assume that the DN will easily complete the operation within this 
> time. Obviously this isn't a good assumption. When this assumption isn't 
> valid, the balancer will timeout the cmd BUT it will then be out-of-sync with 
> the datanode (balancer thinks the DN has room to do more work, DN is still 
> working on the request and will fail any subsequent requests with "threads 
> quota exceeded errors"). This causes expensive NN traffic via getBlocks() and 
> also causes lots of WARNS int the balancer log.
> Unfortunately the protocol is such that it's impossible to tell if the DN is 
> busy working on replacing the block, OR is in bad shape and will never finish.
> So, in the interest of a small change to deal with both situations, I propose 
> the following two changes:
> * Crank of the socket read timeout to 20 minutes
> * Delay looking at a node for a bit if we did timeout in this way (the DN 
> could still have xceiver threads working on the replace 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to