Hi Apex Dev community, Kindly provide with feedback if any for the following approach for implementing S3RecordReader.
*S3RecordReader(delimited records)* *Input *: BlockMetaData containing offset and length *Expected Output :* Records in the block *Approach : * Similar to approach currently being followed in FSRecordReader. 1) Fetch the block from S3. S3 block fetch size should ideally be large enough, say 64MB to avoid unnecessary network delays. 2) Search for newline character in the block and emit the record 3) The last record in current block might overflow into subsequent block. For this, we will get a small part of subsequent block, say 1 MB and search for newline character and emit the record if newline character is found. We will fetch additional 1MB blocks till a newline charater is found. 4) We will also avoid reading the first record from all blocks (except first block) as this set of bytes is a part of last record in previous block. Regards, Ajay On Wed, Oct 19, 2016 at 7:31 AM, Ajay Gupta (JIRA) <[email protected]> wrote: > > [ https://issues.apache.org/jira/browse/APEXMALHAR-2303? > page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] > > Ajay Gupta reassigned APEXMALHAR-2303: > -------------------------------------- > > Assignee: Ajay Gupta > > > S3 Line By Line Module > > ---------------------- > > > > Key: APEXMALHAR-2303 > > URL: https://issues.apache.org/ > jira/browse/APEXMALHAR-2303 > > Project: Apache Apex Malhar > > Issue Type: Bug > > Reporter: Ajay Gupta > > Assignee: Ajay Gupta > > Original Estimate: 336h > > Remaining Estimate: 336h > > > > This is a new module which will consist of 2 operators > > 1) File Splitter -- Already existing in Malhar library > > 2) S3RecordReader -- Read a file from S3 and output the records > (delimited or fixed width) > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332) >
