Hi Apex Dev community,

Kindly provide with feedback if any for the following approach for
implementing S3RecordReader.

*S3RecordReader(delimited records)*
*Input *: BlockMetaData containing offset and length
*Expected Output :* Records in the block
*Approach : *
Similar to approach currently being followed in FSRecordReader.
1) Fetch the block from S3. S3 block fetch size should ideally be large
enough, say 64MB to avoid unnecessary network delays.
2) Search for newline character in the block and emit the record
3) The last record in current block might overflow into subsequent block.
For this, we will get a small part of subsequent block, say 1 MB and search
for newline character and emit the record if newline character is found. We
will fetch additional 1MB blocks till a newline charater is found.
4) We will also avoid reading the first record from all blocks (except
first block) as this set of bytes is a part of last record in previous
block.


Regards,
Ajay



On Wed, Oct 19, 2016 at 7:31 AM, Ajay Gupta (JIRA) <j...@apache.org> wrote:

>
>      [ https://issues.apache.org/jira/browse/APEXMALHAR-2303?
> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Ajay Gupta reassigned APEXMALHAR-2303:
> --------------------------------------
>
>     Assignee: Ajay Gupta
>
> > S3 Line By Line Module
> > ----------------------
> >
> >                 Key: APEXMALHAR-2303
> >                 URL: https://issues.apache.org/
> jira/browse/APEXMALHAR-2303
> >             Project: Apache Apex Malhar
> >          Issue Type: Bug
> >            Reporter: Ajay Gupta
> >            Assignee: Ajay Gupta
> >   Original Estimate: 336h
> >  Remaining Estimate: 336h
> >
> > This is a new module which will consist of 2 operators
> > 1) File Splitter -- Already existing in Malhar library
> > 2) S3RecordReader -- Read a file from S3 and output the records
> (delimited or fixed width)
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>

Reply via email to