[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Pete Wyckoff (JIRA) Thu, 18 Sep 2008 10:45:36 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12632309#action_12632309
 ]


Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

bq. Could the types be called FlatFileInputFormat and FlatFileRecordReader?
Yes, better names.

bq. Is a SerializationContext class needed?

If the Serialization is in contrib, one would need to use ReflectionUtils to 
instantiate it and it wouldn't be in any Factory, would it?  So, in this case, 
it needs to know the name of the Class to instantiate it, no?  

bq. an't just get these two classes from the Configuration itself.

wanted to make it extensible so it could some from the configuration or maybe 
some place else - the name of the file or some external store or something 
depending on the application. Of course, in that case, one could argue a higher 
level is setting that up anyway, so why don't they just do the lookup and store 
the info in the configuration. 

bq. Shouldn't keys be file offsets, similar to TextInputFormat? The row numbers 
you have are actually the row number within the split, which might be confusing 
(and they're not unique per file).

Are the file offsets useful anywhere?  Maybe we should just always return the 
same instance of some dummy Writable for performance if the key isn't used 
anyway??


> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>         Attachments: FlatFileReader.java, HADOOP-4065.0.txt, 
> HADOOP-4065.1.txt, HADOOP-4065.1.txt, ThriftFlatFile.java
>
>
> like textinputformat - looking for a concrete implementation to read binary 
> records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set 
> splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration 
> somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not 
> have the deserializer throw an exception  (which is hard to distinguish from 
> a exception due to corruptions?)). this is easy for non-compressed streams - 
> for compressed streams - DecompressorStream has a useful looking 
> getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4065) support for reading binary data from flat files

Reply via email to