devendra tagare created APEXMALHAR-2032:
-------------------------------------------

             Summary: MapReduce Input format support for File Splitter
                 Key: APEXMALHAR-2032
                 URL: https://issues.apache.org/jira/browse/APEXMALHAR-2032
             Project: Apache Apex Malhar
          Issue Type: New Feature
            Reporter: devendra tagare


Extending the FileSplitter to work with InputSplit & the BlockReader to work 
with the RecordReader from org.apache.hadoop.mapreduce.InputSplit & 
org.apache.hadoop.mapreduce.RecordReader respectively.
Some more details and rationale on the approach,
InputFormat lets MR create Input Splits ie individual chunks of bytes.
The ability to correctly create these splits is determined by the Input Format 
itself.eg SequenceFile format or Avro.
Internally these formats are organized as a sequence of blocks.Each block can 
be compressed with a compression codec & it does not matter if this codec in 
itself is splittable.
When they are set as an Input format, the MR framework creates input splits 
based on the block boundaries given by the metadata object packed with the file.
Each InputFormat has a specific block definition. eg for Avro the block 
definition is as below,
Avro file data block consists of:
A long indicating the count of objects in this block.
A long indicating the size in bytes of the serialized objects in the current 
block, after any codec is applied
The serialized objects. If a codec is specified, this is compressed by that 
codec.
The file's 16-byte sync marker.
Thus, each block's binary data can be efficiently extracted or skipped without 
deserializing the contents. The combination of block size, object counts, and 
sync markers enable detection of corrupt blocks and help ensure data integrity.
Each map task gets an entire block to read.RecordReader is used to read the 
individual records for the block and generates key,val pairs.
The records could be fixed length or use a schema as in the case of parquet or 
Avro.
We can extend the BlockReader to work with RecordReader based on the sync 
markers to correctly identify & parse the individual records.
Thanks,
Dev




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to