[ https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628806#action_12628806 ]
Pete Wyckoff commented on HADOOP-4065: -------------------------------------- I propose we re-use the code from SequenceFileRecordReader by making it depend on a SplittableTypedFile interace (below) which conveniently is already implemented by SequenceFile. Then we're basically done. I am not super-familiar with this code and the devil is probably in the details, but looking at SequenceFileRecordReader, there is basically only about 5 methods it uses from SequenceFile and those are all well defined and seem needed for any implementation of a self describing file that is splittable. We could also not touch SequenceFileRecordReader, but it seems we'd just be duplicating all of its code. {code:title=TypedFile Interfaces} public interface TypedFile { public void initialize(Configuration conf, InputStream in); public Class getKeyClass(); public Class getValueClass(); public boolean next(Writable key); public boolean next(Writable key, Writable value); public Writable getCurrentValue(); } public interface SplittableTypedFile implements TypedFile { public boolean syncSeen(); // i.e., atEOF() public boolean sync(long); // skip to past last frame boundary } {code} {code:title=TypedSplittableRecordReader } // This is almost a complete cut-n-paste of existing SequenceFileRecordReader - which would be removed public class TypedSplittableRecordReader<K, V> implements RecordReader<K,V> { SplittableTypedFile in; public TypedRecordReader(Configuration conf, FileSplit split, SplittableTypedFileFactory<K,V> fileFactory) { this.in = fileFactory.getFileReader(fs, path, conf); } // the rest is exactly like the current sequence file implementation basically. } {code} {code:title=SequenceFile} -public class SequenceFile { +public class SequenceFile implements SplittableTypedFile { {code} {code:title=SequenceFileInputFormat} public class SequenceFileInputFormat<K, V> extends FileInputFormat<K, V> { public RecordReader<K,V> getRecordReader() { return TypedSplittableRecordReader<K, V>(job, split, new SequenceFileFactory<K,V>()); } } {code} {code:title=SelfDescribingFileExample} public class TFixedFrameTransportInputFormat implements SplittableTypedFile { // implementing all the above should be straightforward } public class TFixedFrameTransportFileInputFormat<K, V> extends FileInputFormat<K, V> { public RecordReader<K,V> getRecordReader() { return TypedSplittableRecordReader<K, V>(job, split, new TFixedFrameFileFactory<K,V>()); } } {code} One problem is for non-splittable files, I have to create another record reader with almost the same code. Maybe better to put everything in one interface and add boolean isSplittable and have sync just do a seek(0) and syncSeen just look at EOF. > support for reading binary data from flat files > ----------------------------------------------- > > Key: HADOOP-4065 > URL: https://issues.apache.org/jira/browse/HADOOP-4065 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Reporter: Joydeep Sen Sarma > > like textinputformat - looking for a concrete implementation to read binary > records from a flat file (that may be compressed). > it's assumed that hadoop can't split such a file. so the inputformat can set > splittable to false. > tricky aspects are: > - how to know what class the file contains (has to be in a configuration > somewhere). > - how to determine EOF (would be nice if hadoop can determine EOF and not > have the deserializer throw an exception (which is hard to distinguish from > a exception due to corruptions?)). this is easy for non-compressed streams - > for compressed streams - DecompressorStream has a useful looking > getAvailable() call - except the class is marked package private. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.