Hi, I have implemented a subclass of RecordReader to handle a plain text file format where a record is multi-line and of variable length. Schematically each record is of the form

some_title
foo
bar
$$$$
another_title
foo
foo
bar
$$$$

where $$$$ is the marker for the end of the record. My code is at http://blog.rguha.net/?p=293 and it seems to work fine on my input data.

However, I realized that when I run the program, Hadoop will 'chunk' the input file. As a result, the SDFRecordReader might get a chunk of input text, such that the last record is actually incomplete (a missing $$$$). Is this correct?

If so, how would the RecordReader implementation recover from this situation? Or is there a way to indicate to Hadoop that the input file should be chunked keeping in mind end of record delimiters?

Thanks

-------------------------------------------------------------------
Rajarshi Guha  <rg...@indiana.edu>
GPG Fingerprint: D070 5427 CC5B 7938 929C  DD13 66A1 922C 51E7 9E84
-------------------------------------------------------------------
Q:  What's polite and works for the phone company?
A:  A deferential operator.


Reply via email to