DFSClient block read failures cause open DFSInputStream to become unusable
--------------------------------------------------------------------------

                 Key: HADOOP-4681
                 URL: https://issues.apache.org/jira/browse/HADOOP-4681
             Project: Hadoop Core
          Issue Type: Bug
    Affects Versions: 0.18.2, 0.19.0, 0.19.1, 0.20.0
            Reporter: Igor Bolotin
             Fix For: 0.19.1, 0.20.0


We are using some Lucene indexes directly from HDFS and for quite long time we 
were using Hadoop version 0.15.3.

When tried to upgrade to Hadoop 0.19 - index searches started to fail with 
exceptions like:
2008-11-13 16:50:20,314 WARN [Listener-4] [] DFSClient : DFS Read: 
java.io.IOException: Could not obtain block: blk_5604690829708125511_15489 
file=/usr/collarity/data/urls-new/part-00000/20081110-163426/_0.tis
at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1708)
at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1536)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1663)
at java.io.DataInputStream.read(DataInputStream.java:132)
at 
org.apache.nutch.indexer.FsDirectory$DfsIndexInput.readInternal(FsDirectory.java:174)
at 
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:152)
at 
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:76)
at org.apache.lucene.index.TermBuffer.read(TermBuffer.java:63)
at org.apache.lucene.index.SegmentTermEnum.next(SegmentTermEnum.java:131)
at org.apache.lucene.index.SegmentTermEnum.scanTo(SegmentTermEnum.java:162)
at org.apache.lucene.index.TermInfosReader.scanEnum(TermInfosReader.java:223)
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:217)
at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:54) 
...

The investigation showed that the root of this issue is that we exceeded # of 
xcievers in the data nodes and that was fixed by changing configuration 
settings to 2k.
However - one thing that bothered me was that even after datanodes recovered 
from overload and most of client servers had been shut down - we still observed 
errors in the logs of running servers.
Further investigation showed that fix for HADOOP-1911 introduced another 
problem - the DFSInputStream instance might become unusable once number of 
failures over lifetime of this instance exceeds configured threshold.

The fix for this specific issue seems to be trivial - just reset failure 
counter before reading next block (patch will be attached shortly).

This seems to be also related to HADOOP-3185, but I'm not sure I really 
understand necessity of keeping track of failed block accesses in the DFS 
client.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to