+1 for the idea On Mar 24, 2016 8:41 PM, "Thomas Weise" <[email protected]> wrote:
> +1 for the idea in general and extending existing implementation. > > In case this introduces a MapReduce dependency we will also need to > consider a separate module. > > Thomas > > > On Thu, Mar 24, 2016 at 2:35 AM, Devendra Tagare < > [email protected]> > wrote: > > > Hi, > > > > We are thinking of extending the FileSplitter and BlockReader . > > Changing the existing code could have side effects. > > > > Thanks, > > Dev > > On Mar 24, 2016 1:16 AM, "Tushar Gosavi" <[email protected]> wrote: > > > > > My suggestion is to extend from FileSplitter and BlockReader without > > > changing them, and add support for InputFormat in derived classes. > > > FileSplitter and BlockReader already provides enough hooks to define > > splits > > > and read records. > > > > > > - Tushar. > > > > > > > > > On Thu, Mar 24, 2016 at 11:17 AM, Yogi Devendra < > [email protected] > > > > > > wrote: > > > > > > > Aligning FileSplitter, BlockReader with respective counterparts from > > > > mapreduce will be excellent value addition. > > > > > > > > IMO, it has 2 advantages: > > > > > > > > 1. It will allow us to plug-in more formats for > > FileSplitter+BlockReader > > > > pattern use-cases. > > > > 2. It will be easy for end-users coming from mapreduce background if > > they > > > > get something equivalent in Apex. > > > > > > > > One question: > > > > Are you planning to refactor existing FileSplitter, BlockReader OR > plan > > > is > > > > to have this implementation as fresh classes? > > > > If these are fresh classes, are we saying that they will eventually > > > > deprecate the existing FileSplitter, BlockReader? > > > > > > > > We have other few other components dependent on existing > FileSplitter, > > > > BlockReader. Hence, would like to know about future direction for > these > > > > classes. > > > > > > > > ~ Yogi > > > > > > > > On 24 March 2016 at 10:47, Priyanka Gugale <[email protected] > > > > > > wrote: > > > > > > > > > So as I understand splitter would be format aware, in that case > would > > > we > > > > > need different kinds of parser we have right now? Or the format > aware > > > > > splitter will take care of parsing different file formats e.g. csv > > etc? > > > > > > > > > > -Priyanka > > > > > > > > > > On Wed, Mar 23, 2016 at 11:41 PM, Devendra Tagare < > > > > > [email protected] > > > > > > wrote: > > > > > > > > > > > Hi All, > > > > > > > > > > > > Initiating this thread to get the community's opinion on aligning > > the > > > > > > FileSplitter with InputSplit & the BlockReader with the > > RecordReader > > > > from > > > > > > org.apache.hadoop.mapreduce.InputSplit & > > > > > > org.apache.hadoop.mapreduce.RecordReader respectively. > > > > > > > > > > > > Some more details and rationale on the approach, > > > > > > > > > > > > InputFormat lets MR create Input Splits ie individual chunks of > > > bytes. > > > > > > The ability to correctly create these splits is determined by the > > > Input > > > > > > Format itself.eg SequenceFile format or Avro. > > > > > > > > > > > > Internally these formats are organized as a sequence of > blocks.Each > > > > block > > > > > > can be compressed with a compression codec & it does not matter > if > > > this > > > > > > codec in itself is splittable. > > > > > > When they are set as an Input format, the MR framework creates > > input > > > > > splits > > > > > > based on the block boundaries given by the metadata object packed > > > with > > > > > the > > > > > > file. > > > > > > > > > > > > Each InputFormat has a specific block definition. eg for Avro the > > > block > > > > > > definition is as below, > > > > > > > > > > > > Avro file data block consists of: > > > > > > > > > > > > A long indicating the count of objects in this block. > > > > > > A long indicating the size in bytes of the serialized objects in > > the > > > > > > current block, after any codec is applied > > > > > > The serialized objects. If a codec is specified, this is > compressed > > > by > > > > > that > > > > > > codec. > > > > > > The file's 16-byte sync marker. > > > > > > Thus, each block's binary data can be efficiently extracted or > > > skipped > > > > > > without deserializing the contents. The combination of block > size, > > > > object > > > > > > counts, and sync markers enable detection of corrupt blocks and > > help > > > > > ensure > > > > > > data integrity. > > > > > > > > > > > > Each map task gets an entire block to read.RecordReader is used > to > > > read > > > > > the > > > > > > individual records for the block and generates key,val pairs. > > > > > > The records could be fixed length or use a schema as in the case > of > > > > > parquet > > > > > > or Avro. > > > > > > > > > > > > We can extend the BlockReader to work with RecordReader based on > > the > > > > sync > > > > > > markers to correctly identify & parse the individual records. > > > > > > > > > > > > Please send across your thoughts on the same. > > > > > > > > > > > > Thanks, > > > > > > Dev > > > > > > > > > > > > > > > > > > > > >
