Aaron McCurry created HDFS-9104:
-----------------------------------
Summary: DFSInputStream goes into infinite loop
Key: HDFS-9104
URL: https://issues.apache.org/jira/browse/HDFS-9104
Project: Hadoop HDFS
Issue Type: Bug
Components: hdfs-client
Affects Versions: 2.6.0, 2.5.0
Reporter: Aaron McCurry
I recently have come across that causes an infinite loop in the DFSClient. I
have experienced this issue in hadoop 2.5.0 and the issue seems to present in
2.6.0.
The bug is hard to reproduce, it seems to only occur when the NameNode is under
great pressure because I think it's a timing issue.
On the client side, a small file (100s of bytes) is written to and then sync()
is called. The depreciated sync because the code is setup to cross compile
hadoop 1 and hadoop 2. After the sync is called the close happens on the
outputstream in another thread async to the writing thread. This happens
because the close call can be very time consuming.
Once the sync happens and the outputstream is handed off to the closing thread.
The writing thread turns around and reads the output it has written and
synced. When this happens I believe the client reads the length from the
Namenode which appears to still be 0 (more on that in a moment).
Once the inputstream is open and the first byte is trying to be read the
DFSInputStream goes into an infinite loop. It appears to be error handling
logical not handling all IOExceptions.
fetchBlockByteRange =>
https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L991
The loop occurs in the fetchBlockByteRange method, which catches all
IOExceptions and just recalls the actualGetFromOneDataNode method, assuming
that method handles everything correctly.
actualGetFromOneDataNode =>
https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1025
In the actualGetFromOneDataNode inside the while loop it calls getBlockAt which
throws a IOException that is not handled by the actualGetFromOneDataNode method.
actualGetFromOneDataNode calls getBlockAt =>
https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1040
getBlockAt =>
https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L406
In the getBlockAt method it checks that position to read are within the
filelength, which I believe to still be zero at this point. This is where I
believe the IOException is thrown.
IOException =>
https://github.com/apache/hadoop/blob/release-2.6.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L413
And because the IOException is not handled in the actualGetFromOneDataNode
method and the fetchBlockByteRange blindly recalls the actualGetFromOneDataNode
method over and over again the infinite loop is created.
My current work around is to wait until the file length is properly reported by
the namenode before opening the file. Likely this is the correct choice
regarless, but I think that client should never go into an infinite loop during
an error condition.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)