We need partitions for parallel read but how will the reader partition know which offset of the file it should read from. Normally FileSplitter creates this metadata, let's call them as reader task, and forwards them to next operator which is block reader. Block reader will receive one of the tasks and read from specified offset in file. If FileSplitter is absent one reader partition will have to consume one file entirely, which means we can't have parallel reading over one file. I hope this answers your question.
Advantage of having this module is having a reusable component made up of operators which are frequently used together to do file reading. -Priyanka On Wed, Feb 17, 2016 at 11:31 AM, Yogi Devendra <[email protected]> wrote: > Let me rephrase Ram's question to make it clear: > > For an application developer using Malhar: > What are the advantages / disadvantages of using the proposed HDFS File > input Module as compared to directly using FileSplitter, BlockReader > Operators available in Malhar? > > ~ Yogi > > On 16 February 2016 at 21:56, Munagala Ramanath <[email protected]> > wrote: > > > Can parallel read not be achieved by partitioning ? > > > > Ram > > > > On Tue, Feb 16, 2016 at 1:01 AM, Priyanka Gugale < > [email protected] > > > > > wrote: > > > > > Hi, > > > > > > It is a common usecase to read big files on HDFS in parallel fashion > i.e. > > > many reader thread are used to read the file in parallel. We can > achieve > > > this on top of Apex using following Malhar operators: > > > > > > 1. AbstractFileSplitter > > > 2. AbstractBlockReader > > > > > > where FileSplitter, as per file metadata, creates small reader tasks(to > > > read file in parts). Those reader tasks are run by BlockReaders in > > parallel > > > to read the file. > > > > > > As these operators are generally used together to achieve file read > > > operation, I propose we create a module, called HDFSFileReader for > this. > > > > > > Please provide your suggestions on same. > > > > > > -Priyanka > > > > > >
