Hi, I have implemented a subclass of RecordReader to handle a plain
text file format where a record is multi-line and of variable length.
Schematically each record is of the form
some_title
foo
bar
$$$$
another_title
foo
foo
bar
$$$$
where $$$$ is the marker for the end of the record. My code is at http://blog.rguha.net/?p=293
and it seems to work fine on my input data.
However, I realized that when I run the program, Hadoop will 'chunk'
the input file. As a result, the SDFRecordReader might get a chunk of
input text, such that the last record is actually incomplete (a
missing $$$$). Is this correct?
If so, how would the RecordReader implementation recover from this
situation? Or is there a way to indicate to Hadoop that the input file
should be chunked keeping in mind end of record delimiters?
Thanks
-------------------------------------------------------------------
Rajarshi Guha <rg...@indiana.edu>
GPG Fingerprint: D070 5427 CC5B 7938 929C DD13 66A1 922C 51E7 9E84
-------------------------------------------------------------------
Q: What's polite and works for the phone company?
A: A deferential operator.