[jira] Commented: (HADOOP-1489) Input file get truncated for text files with \r\n

dhruba borthakur (JIRA) Thu, 14 Jun 2007 00:21:51 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12504572
 ]


dhruba borthakur commented on HADOOP-1489:
------------------------------------------

So, can it be related to HADOOP-1491? In that case, distcp copies files with a 
buffer size specified by "copy.buf.size". It has a default value of 4K.


> Input file get truncated for text files with \r\n
> -------------------------------------------------
>
>                 Key: HADOOP-1489
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1489
>             Project: Hadoop
>          Issue Type: Bug
>          Components: io
>    Affects Versions: 0.13.0
>            Reporter: Bwolen Yang
>         Attachments: MRIdentity.java, slashr33.txt
>
>
> When input file has \r\n, LineRecordReader uses mark()/reset() to read one 
> byte ahead to check if \r is followed by \n.   This probably caused the 
> BufferedInputStream to issue a small read request (e.g., 127 bytes).   The  
> ChecksumFileSystem.FSInputChecker.read() code 
> {code}
>    public int read(byte b[], int off, int len) throws IOException {
>      // make sure that it ends at a checksum boundary
>      long curPos = getPos();
>      long endPos = len+curPos/bytesPerSum*bytesPerSum;
>      return readBuffer(b, off, (int)(endPos-curPos));
>    }
> {code}
> tries to truncate "len" to checksum boundary.  For DFS, bytesPerSum is 512.  
> So for small reads, the truncated length become negative (i.e., endPos - 
> curPos is < 0).   The underlying DFS read returns 0 when length is negative.  
> However, readBuffer changes it to -1 assuming end-of-file has been reached.   
> This means effectively, the rest of the input file did not get read.  In my 
> case, only 8MB of a 52MB file is actually read.   Two sample stacks are 
> appended.
> One related issue, if there are assumptions (such as len >= bytesPerSum) in 
> FSInputChecker's read(), would it be ok to add a check that throws an 
> exception when the assumption is violated?   This assumption is a bit unusal 
> and as code changes (both Hadoop and Java's implementation of 
> BufferedInputStream), the assumption may get violated.  This silently 
> dropping large part of input seems really difficult for people to notice (and 
> debug) when people start to deal with terabytes of data.   Also, I suspect 
> the performance impact for such a check would not be noticed.
> bwolen
> Here are two sample stacks.  (i have readbuffer throw when it gets 0 bytes, 
> and have inputchecker catches the exception and rethrow both.  This way, I 
> catch the values from both caller and callee (see the callee one starts with 
> "Caused by")
> -------------------------------------
> {code}
> java.lang.RuntimeException: end of read()
> in=org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker len=127
> pos=45223932 res=-999999
>        at 
> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:50)
>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>        at 
> org.apache.hadoop.fs.FSDataInputStream$Buffer.read(FSDataInputStream.java:116)
>        at java.io.FilterInputStream.read(FilterInputStream.java:66)
>        at 
> org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:132)
>        at 
> org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:124)
>        at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:108)
>        at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:168)
>        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:44)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:186)
>        at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1720)
> Caused by: java.lang.RuntimeException: end of read()
> datas=org.apache.hadoop.dfs.DFSClient$DFSDataInputStream pos=45223932
> len=-381 bytesPerSum=512 eof=false read=0
>        at 
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:200)
>        at 
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:175)
>        at 
> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:47)
>        ... 11 more
> ---------------
> java.lang.RuntimeException: end of read()  
> in=org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker len=400 pos=4503 
> res=-999999
>       at 
> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:50)
>       at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>       at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>       at 
> org.apache.hadoop.fs.FSDataInputStream$Buffer.read(FSDataInputStream.java:116)
>       at java.io.FilterInputStream.read(FilterInputStream.java:66)
>       at 
> org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:132)
>       at 
> org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:124)
>       at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:108)
>       at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:168)
>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:44)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:186)
>       at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1720)
> Caused by: java.lang.RuntimeException: end of read()  
> datas=org.apache.hadoop.dfs.DFSClient$DFSDataInputStream pos=4503 len=-7 
> bytesPerSum=512 eof=false read=0
>       at 
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:200)
>       at 
> org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:175)
>       at 
> org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:47)
>       ... 11 more
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1489) Input file get truncated for text files with \r\n

Reply via email to