Reading large HDFS files record by record

Yogi Devendra Thu, 28 Apr 2016 04:00:00 -0700

Hi,

My usecase involves reading from HDFS and emit each record as a separate
tuple. Record can be either fixed length record or separator based record
(such as newline).  Expected output is byte[] for each record.


I am planning to solve this as follows:
- New operator which extends BlockReader.
- It will have configuration option to select mode for FIXED_LENGTH,
SEPARATOR_BASED.
- Use appropriate ReaderContext based on mode.

Reason for having different operator than BlockReader is because output
port signature is different than BlockReader. This new operator can be used
in conjunction with FileSplitter.

Any feedback?

~ Yogi

Reading large HDFS files record by record

Reply via email to