[ 
https://issues.apache.org/jira/browse/HADOOP-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12536008
 ] 

Raghu Angadi commented on HADOOP-2071:
--------------------------------------

bq. the readimit argument for mark is not honored in these changes. If one 
calls reset after more than readlimit bytes have been read after mark, that 
reset is supposed to throw IOException.

We can just keep track of how many bytes we read and if it is larger than 
readlimit, we can throw an IOException, if we want to keep that behavior. 
Actually we can just throw an exception if there is no record found within 
readlimit (instead of reading till there a match or EOF).

Lohit and I looked the code around and it seems to seek-back pretty heavily 
(pretty much for every record). Seeking back is pretty inefficient in DFS. It 
throws away current buffers (both app and TCP) and starts a new connection in 
most cases. The current patch does not make this situation any worse. I wonder 
what the typical size of these records is..

One problem with using BufferedInputStream() is that current code uses getPos() 
and seek() in many place which is specific to FSDataInputStream. So it will 
need more changes to manage it.

> StreamXmlRecordReader throws java.io.IOException: Mark/reset exception in 
> hadoop 0.14
> -------------------------------------------------------------------------------------
>
>                 Key: HADOOP-2071
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2071
>             Project: Hadoop
>          Issue Type: Bug
>          Components: contrib/streaming
>    Affects Versions: 0.14.3
>            Reporter: lohit vijayarenu
>            Assignee: lohit vijayarenu
>         Attachments: HADOOP-2071-1.patch
>
>
> In hadoop 0.14, using -inputreader StreamXmlRecordReader  for streaming jobs 
> throw 
> java.io.IOException: Mark/reset exception in hadoop 0.14
> This looks to be related to 
> (https://issues.apache.org/jira/browse/HADOOP-2067).
> <stack trace>
> Caused by: java.io.IOException: Mark/reset not supported
>       at
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.reset(DFSClient.java:1353)
>       at java.io.FilterInputStream.reset(FilterInputStream.java:200)
>       at
> org.apache.hadoop.streaming.StreamXmlRecordReader.fastReadUntilMatch(StreamX
> mlRecordReader.java:289)
>       at
> org.apache.hadoop.streaming.StreamXmlRecordReader.readUntilMatchBegin(Stream
> XmlRecordReader.java:118)
>       at
> org.apache.hadoop.streaming.StreamXmlRecordReader.seekNextRecordBoundary(Str
> eamXmlRecordReader.java:111)
>       at
> org.apache.hadoop.streaming.StreamXmlRecordReader.init(StreamXmlRecordReader
> .java:73)
>       at
> org.apache.hadoop.streaming.StreamXmlRecordReader.(StreamXmlRecordReader.jav
> a:63)
> </stack trace>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to