Uma Maheswara Rao G created HDFS-17255: ------------------------------------------
Summary: There should be mechanism between client and NN to eliminate stale nodes from current pipeline sooner. Key: HDFS-17255 URL: https://issues.apache.org/jira/browse/HDFS-17255 Project: Hadoop HDFS Issue Type: Bug Reporter: Uma Maheswara Rao G In one of users cluster, they hit an issue similar to HDFS-2891. Client is always seeing first node as failed even though 2nd node is the problematic one( timeouts due to pulling out for NW). When pipeline failure happens, client will ask for another new node and replace it in pipeline. But actual bad mode still be in pipeline as client detected wrong node ( actually a good node) as bad. So, pipeline failure continues until it detects the real wrong node in random shuffling. NN actully detected wrong node as stale. But pipeline reconstruction will only bother about client detected failed node and it will be replaced with new node. I don't have best solution in hand, but we can discuss. I think it may be a good idea if client pass all current pipeline node to recheck in first pipeline failure. So, NN can give some hints back to client which other nodes are not good and provide additional backup replacement nodes in a single call. It looks over designing to me, but I don't really have any other best ideas in my mind. Changing protocol API is painful due to compatibility problems and testing needed. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org