[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Owen O'Malley (JIRA) Fri, 05 Sep 2008 08:51:38 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628658#action_12628658
 ]


Owen O'Malley commented on HADOOP-4065:
---------------------------------------

Ok, a dispatching FileInputFormat could make sense. Where it looked at the
filenames or header of the files and picked the appropriate reader. At that
point, it isn't about binary files, because you'd want it to work for text
files also. What would be the approach? Filenames like we do with the
compression of text files? Or sampling the first 80 bytes looking for a
header?

-- Owen


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>
> like textinputformat - looking for a concrete implementation to read binary 
> records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set 
> splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration 
> somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not 
> have the deserializer throw an exception  (which is hard to distinguish from 
> a exception due to corruptions?)). this is easy for non-compressed streams - 
> for compressed streams - DecompressorStream has a useful looking 
> getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Reply via email to