[ 
https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629661#action_12629661
 ] 

Doug Cutting commented on HADOOP-4065:
--------------------------------------

> With this interface, the SequenceFileRecordReader can read SequenceFiles, 
> DeserializerTypedFiles (thrift, proto buffers, record io whatever) and any 
> other self describing typed files; sequencefile's being one example of these.

I'm all for reducing code duplication.  So if SequenceFileRecordReader can 
mostly be replaced with code that's shared by other file format's that'd be 
great.  But we need those file formats to exist before we perform this 
factoring.  Is there a splittable thrift or protocol-buffer input file format 
implementation yet that can share code with SequenceFileInputFormat?  Let's not 
refactor until we have these.

> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: HADOOP-4065.0.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary 
> records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set 
> splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration 
> somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not 
> have the deserializer throw an exception  (which is hard to distinguish from 
> a exception due to corruptions?)). this is easy for non-compressed streams - 
> for compressed streams - DecompressorStream has a useful looking 
> getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to