Uma Maheswara Rao G created HDFS-17255:
------------------------------------------

             Summary: There should be mechanism between client and NN to 
eliminate stale nodes from current pipeline sooner.
                 Key: HDFS-17255
                 URL: https://issues.apache.org/jira/browse/HDFS-17255
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Uma Maheswara Rao G


In one of users cluster, they hit an issue similar to HDFS-2891. Client is 
always seeing first node as failed even though 2nd node is the problematic one( 
timeouts due to pulling out for NW). When pipeline failure happens, client will 
ask for another new node and replace it in pipeline. But actual bad mode still 
be in pipeline as client detected wrong node ( actually a good node) as bad. 
So, pipeline failure continues until it detects the real wrong node in random 
shuffling. NN actully detected wrong node as stale. But pipeline reconstruction 
will only bother about client detected failed node and it will be replaced with 
new node.

I don't have best solution in hand, but we can discuss. I think it may be a 
good idea if client pass all current pipeline node to recheck in first pipeline 
failure. So, NN can give some hints back to client which other nodes are not 
good and provide additional backup replacement nodes in a single call. It looks 
over designing to me, but I don't really have any other best ideas in my mind. 
Changing protocol API is painful due to compatibility problems and testing 
needed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to