Nathan Roberts created HDFS-6166:
------------------------------------

             Summary: revisit balancer so_timeout 
                 Key: HDFS-6166
                 URL: https://issues.apache.org/jira/browse/HDFS-6166
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: balancer
    Affects Versions: 2.3.0, 3.0.0
            Reporter: Nathan Roberts
            Assignee: Nathan Roberts
            Priority: Blocker


HDFS-5806 changed the socket read timeout for the balancer connection to DN to 
60 seconds. This works as long as balancer bandwidth is such that it's safe to 
assume that the DN will easily complete the operation within this time. 
Obviously this isn't a good assumption. When this assumption isn't valid, the 
balancer will timeout the cmd BUT it will then be out-of-sync with the datanode 
(balancer thinks the DN has room to do more work, DN is still working on the 
request and will fail any subsequent requests with "threads quota exceeded 
errors"). This causes expensive NN traffic via getBlocks() and also causes lots 
of WARNS int the balancer log.

Unfortunately the protocol is such that it's impossible to tell if the DN is 
busy working on replacing the block, OR is in bad shape and will never finish.

So, in the interest of a small change to deal with both situations, I propose 
the following two changes:
* Crank of the socket read timeout to 20 minutes
* Delay looking at a node for a bit if we did timeout in this way (the DN could 
still have xceiver threads working on the replace 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to