[ https://issues.apache.org/jira/browse/HDFS-4273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Binglin Chang updated HDFS-4273: -------------------------------- Attachment: HDFS-4273.patch I am referring read() as int readWithStrategy(ReaderStrategy strategy, int off, int len) and pread() as int read(long position, byte[] buffer, int offset, int length) Changes: 1. Add new argument "dislike" to chooseDatanode() and bestNode(), so to fix seekToNewSource. 2. Make failures to be local variable, so pread can be thread-safe 2. In read(), make outer layer to handle BlockMissingException, bypassing seekToNewSource 3. Remove read retries, cause there is already MaxBlockAcquireFailures to handle retry 4. Throw ChecksumException iff we have tried enough times and there is only one replica available. In original logic, the throwing of ChecksumException or BlockMissing is somehow random, depending the order of the locations of getLocatedBlocks(). Another alternative is change it to always throw BlockMissingException(like pread behavior), but it breaks current test cases. 5. In pread(), modify code to follow the same retry logic as read(). Notice that the exception behavior of read() and pread() is not same currently: read() sometimes throw ChecksumException, pread() never throw ChecksumException. The current patch remain the same behavior. 6. Add sanity checks for seek and seekToNewSource 7. Add test to check DFSInputStream tried MaxBlockAcquireFailures under error 8. Add the same test cases to check seekToNewSource as the original test cases to check seek > Problem in DFSInputStream read retry logic may cause early failure > ------------------------------------------------------------------ > > Key: HDFS-4273 > URL: https://issues.apache.org/jira/browse/HDFS-4273 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Binglin Chang > Assignee: Binglin Chang > Priority: Minor > Attachments: HDFS-4273.patch, TestDFSInputStream.java > > > Assume the following call logic > {noformat} > readWithStrategy() > -> blockSeekTo() > -> readBuffer() > -> reader.doRead() > -> seekToNewSource() add currentNode to deadnode, wish to get a > different datanode > -> blockSeekTo() > -> chooseDataNode() > -> block missing, clear deadNodes and pick the currentNode again > seekToNewSource() return false > readBuffer() re-throw the exception quit loop > readWithStrategy() got the exception, and may fail the read call before > tried MaxBlockAcquireFailures. > {noformat} > some issues of the logic: > 1. seekToNewSource() logic is broken because it may clear deadNodes in the > middle. > 2. the variable "int retries=2" in readWithStrategy seems have conflict with > MaxBlockAcquireFailures, should it be removed? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira