[ 
https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628530#action_12628530
 ] 

Pete Wyckoff commented on HADOOP-4065:
--------------------------------------

It would be nice to also take care of the case where the file is self 
describing like sequence files. If we had a FlatFileRecordReader(conf, split, 
serializerFactory) where the serializerFactory could be initialized with the 
conf, split and the input stream, it could optionally read the self describing 
data from the inputstream. Then regardless, it could be used to implement 
getKey and getValue which it could do from the self described data or as you 
said based on the path. Or it could just instantiate the factory from a conf 
variable.

Then the user is free to implement the serializer lookup however they want much 
like most of the rest of the system.

This means the serializer lookup is very low in the stack, but since one must 
implement next and as Joy points out, you can't do that without the serializer??

This solves this case, but also the case of self describing thrift 
TRecordStream since the serializer class info would be in the header itself.

It would also be nice if the underlying input stream could actually be an 
interface because sometimes there's a flat file, but other times it may be 
compressed or some other format, but that format is capable of producing a 
stream of bytes. 

So, I guess I'm advocating for flexibility in defining the serializer lookup 
logic as well as how to read from the file.



> support for reading binary data from flat files
> -----------------------------------------------
>
>                 Key: HADOOP-4065
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4065
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Joydeep Sen Sarma
>
> like textinputformat - looking for a concrete implementation to read binary 
> records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set 
> splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration 
> somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not 
> have the deserializer throw an exception  (which is hard to distinguish from 
> a exception due to corruptions?)). this is easy for non-compressed streams - 
> for compressed streams - DecompressorStream has a useful looking 
> getAvailable() call - except the class is marked package private.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to