Priyanka, Can you please share details about what would be the output ports from this module?
I am thinking of HDFS File Copy Module which can be used in conjunction with this module to copy files from HDFS to HDFS. ~ Yogi On 18 February 2016 at 10:29, Mohit Jotwani <[email protected]> wrote: > +1 to add this. > > Regards, > Mohit > On 17 Feb 2016 23:30, "Pramod Immaneni" <[email protected]> wrote: > > > +1 to add this module > > > > On Wed, Feb 17, 2016 at 9:21 AM, Priyanka Gugale < > [email protected] > > > > > wrote: > > > > > We need partitions for parallel read but how will the reader partition > > know > > > which offset of the file it should read from. Normally FileSplitter > > creates > > > this metadata, let's call them as reader task, and forwards them to > next > > > operator which is block reader. Block reader will receive one of the > > tasks > > > and read from specified offset in file. If FileSplitter is absent one > > > reader partition will have to consume one file entirely, which means we > > > can't have parallel reading over one file. I hope this answers your > > > question. > > > > > > Advantage of having this module is having a reusable component made up > of > > > operators which are frequently used together to do file reading. > > > > > > -Priyanka > > > > > > On Wed, Feb 17, 2016 at 11:31 AM, Yogi Devendra < > [email protected] > > > > > > wrote: > > > > > > > Let me rephrase Ram's question to make it clear: > > > > > > > > For an application developer using Malhar: > > > > What are the advantages / disadvantages of using the proposed HDFS > File > > > > input Module as compared to directly using FileSplitter, BlockReader > > > > Operators available in Malhar? > > > > > > > > ~ Yogi > > > > > > > > On 16 February 2016 at 21:56, Munagala Ramanath <[email protected] > > > > > > wrote: > > > > > > > > > Can parallel read not be achieved by partitioning ? > > > > > > > > > > Ram > > > > > > > > > > On Tue, Feb 16, 2016 at 1:01 AM, Priyanka Gugale < > > > > [email protected] > > > > > > > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > It is a common usecase to read big files on HDFS in parallel > > fashion > > > > i.e. > > > > > > many reader thread are used to read the file in parallel. We can > > > > achieve > > > > > > this on top of Apex using following Malhar operators: > > > > > > > > > > > > 1. AbstractFileSplitter > > > > > > 2. AbstractBlockReader > > > > > > > > > > > > where FileSplitter, as per file metadata, creates small reader > > > tasks(to > > > > > > read file in parts). Those reader tasks are run by BlockReaders > in > > > > > parallel > > > > > > to read the file. > > > > > > > > > > > > As these operators are generally used together to achieve file > read > > > > > > operation, I propose we create a module, called HDFSFileReader > for > > > > this. > > > > > > > > > > > > Please provide your suggestions on same. > > > > > > > > > > > > -Priyanka > > > > > > > > > > > > > > > > > > > > >
