[
https://issues.apache.org/jira/browse/HADOOP-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628806#action_12628806
]
Pete Wyckoff commented on HADOOP-4065:
--------------------------------------
I propose we re-use the code from SequenceFileRecordReader by making it depend
on a SplittableTypedFile interace (below) which conveniently is already
implemented by SequenceFile. Then we're basically done.
I am not super-familiar with this code and the devil is probably in the
details, but looking at SequenceFileRecordReader, there is basically only about
5 methods it uses from SequenceFile and those are all well defined and seem
needed for any implementation of a self describing file that is splittable.
We could also not touch SequenceFileRecordReader, but it seems we'd just be
duplicating all of its code.
{code:title=TypedFile Interfaces}
public interface TypedFile {
public void initialize(Configuration conf, InputStream in);
public Class getKeyClass();
public Class getValueClass();
public boolean next(Writable key);
public boolean next(Writable key, Writable value);
public Writable getCurrentValue();
}
public interface SplittableTypedFile implements TypedFile {
public boolean syncSeen(); // i.e., atEOF()
public boolean sync(long); // skip to past last frame boundary
}
{code}
{code:title=TypedSplittableRecordReader }
// This is almost a complete cut-n-paste of existing SequenceFileRecordReader -
which would be removed
public class TypedSplittableRecordReader<K, V> implements RecordReader<K,V> {
SplittableTypedFile in;
public TypedRecordReader(Configuration conf, FileSplit split,
SplittableTypedFileFactory<K,V> fileFactory) {
this.in = fileFactory.getFileReader(fs, path, conf);
}
// the rest is exactly like the current sequence file implementation
basically.
}
{code}
{code:title=SequenceFile}
-public class SequenceFile {
+public class SequenceFile implements SplittableTypedFile {
{code}
{code:title=SequenceFileInputFormat}
public class SequenceFileInputFormat<K, V> extends FileInputFormat<K, V> {
public RecordReader<K,V> getRecordReader() {
return TypedSplittableRecordReader<K, V>(job, split, new
SequenceFileFactory<K,V>());
}
}
{code}
{code:title=SelfDescribingFileExample}
public class TFixedFrameTransportInputFormat implements SplittableTypedFile {
// implementing all the above should be straightforward
}
public class TFixedFrameTransportFileInputFormat<K, V> extends
FileInputFormat<K, V> {
public RecordReader<K,V> getRecordReader() {
return TypedSplittableRecordReader<K, V>(job, split, new
TFixedFrameFileFactory<K,V>());
}
}
{code}
One problem is for non-splittable files, I have to create another record reader
with almost the same code. Maybe better to put everything in one interface and
add boolean isSplittable and have sync just do a seek(0) and syncSeen just look
at EOF.
> support for reading binary data from flat files
> -----------------------------------------------
>
> Key: HADOOP-4065
> URL: https://issues.apache.org/jira/browse/HADOOP-4065
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Reporter: Joydeep Sen Sarma
>
> like textinputformat - looking for a concrete implementation to read binary
> records from a flat file (that may be compressed).
> it's assumed that hadoop can't split such a file. so the inputformat can set
> splittable to false.
> tricky aspects are:
> - how to know what class the file contains (has to be in a configuration
> somewhere).
> - how to determine EOF (would be nice if hadoop can determine EOF and not
> have the deserializer throw an exception (which is hard to distinguish from
> a exception due to corruptions?)). this is easy for non-compressed streams -
> for compressed streams - DecompressorStream has a useful looking
> getAvailable() call - except the class is marked package private.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
