multi-line records and file splits

Rajarshi Guha Tue, 05 May 2009 14:38:04 -0700

Hi, I have implemented a subclass of RecordReader to handle a plaintext file format where a record is multi-line and of variable length.Schematically each record is of the form


some_title
foo
bar
$$$$
another_title
foo
foo
bar
$$$$

where $$$$ is the marker for the end of the record. My code is at http://blog.rguha.net/?p=293and it seems to work fine on my input data.

However, I realized that when I run the program, Hadoop will 'chunk'the input file. As a result, the SDFRecordReader might get a chunk ofinput text, such that the last record is actually incomplete (amissing $$$$). Is this correct?

If so, how would the RecordReader implementation recover from thissituation? Or is there a way to indicate to Hadoop that the input fileshould be chunked keeping in mind end of record delimiters?


Thanks

-------------------------------------------------------------------
Rajarshi Guha  <rg...@indiana.edu>
GPG Fingerprint: D070 5427 CC5B 7938 929C  DD13 66A1 922C 51E7 9E84
-------------------------------------------------------------------
Q:  What's polite and works for the phone company?
A:  A deferential operator.

multi-line records and file splits

Reply via email to