Re: [jira] [Assigned] (APEXMALHAR-2303) S3 Line By Line Module

Chaitanya Chebolu Thu, 20 Oct 2016 22:13:48 -0700

+1 for new approach i.e, adding the file length to FileBlockMetadata.

On Thu, Oct 20, 2016 at 12:00 PM, Tushar Gosavi <[email protected]>
wrote:


> I think this approach is clean compare to previous two approached you
> have mentioned. Depending on exception/non standard error code to
> determine eof is not
> good approach, as we might consider other valid exception as eof and
> not take corrective actions. Also this will avoid multiple request
> to get file length from each reader.
>
> - Tushar.
>
>
> On Thu, Oct 20, 2016 at 11:45 AM, AJAY GUPTA <[email protected]> wrote:
> > Hi
> >
> > Following is another approach for getting information regarding the file
> > length for S3.
> >
> > We have an existing class FileBlockMetadata which currently contains only
> > filePath. To this, we can add the fileLength field which will then get
> > passed to the module. This approach will be a lot cleaner and no
> additional
> > requests will be made to S3 in this case.
> >
> > Kindly provide your opinion on which approach would be best suited.
> >
> >
> > Regards,
> > Ajay
> >
> > On Wed, Oct 19, 2016 at 6:43 PM, AJAY GUPTA <[email protected]>
> wrote:
> >
> >> Hi
> >>
> >> I need suggestion of Apex dev community on the following.
> >>
> >> For the S3RecordReader approach mentioned in previous mail, I am facing
> an
> >> issue with determining the end of file.
> >> Note that the input to this operator will not contain the file size.
> >>
> >> Following approaches are possible
> >>
> >> 1) The S3 getObject() call which fetches file data within a range will
> >> throw an AmazonS3Exception if the range provided is out of bounds.
> Hence if
> >> file size is 10bytes and if I make a getObject request for 11 to 15, I
> will
> >> get this exception.
> >> Exception in thread "main" com.amazonaws.services.s3.
> model.AmazonS3Exception:
> >> The requested range is not satisfiable (Service: Amazon S3; Status Code:
> >> 416; Error Code: InvalidRange; Request ID:
> >> If this exception gets thrown, I can catch it in the code and conclude
> >> that end of file is reached.
> >>
> >> 2) For every container running this application, maintain a
> map<filename,
> >> filesize>. If the filesize already exists in this map, use from there.
> If
> >> not, fetch the filesize information from S3 and add it to this map.
> >>
> >> My own opinion is to go with the first approach since the number of
> calls
> >> to S3 for getting file length will be less.
> >> Kindly provide with any other approaches you can think of.
> >>
> >>
> >> Thanks,
> >> Ajay
> >>
> >>
> >>
> >> On Wed, Oct 19, 2016 at 11:53 AM, AJAY GUPTA <[email protected]>
> wrote:
> >>
> >>> Hi Apex Dev community,
> >>>
> >>> Kindly provide with feedback if any for the following approach for
> >>> implementing S3RecordReader.
> >>>
> >>> *S3RecordReader(delimited records)*
> >>> *Input *: BlockMetaData containing offset and length
> >>> *Expected Output :* Records in the block
> >>> *Approach : *
> >>> Similar to approach currently being followed in FSRecordReader.
> >>> 1) Fetch the block from S3. S3 block fetch size should ideally be large
> >>> enough, say 64MB to avoid unnecessary network delays.
> >>> 2) Search for newline character in the block and emit the record
> >>> 3) The last record in current block might overflow into subsequent
> block.
> >>> For this, we will get a small part of subsequent block, say 1 MB and
> search
> >>> for newline character and emit the record if newline character is
> found. We
> >>> will fetch additional 1MB blocks till a newline charater is found.
> >>> 4) We will also avoid reading the first record from all blocks (except
> >>> first block) as this set of bytes is a part of last record in previous
> >>> block.
> >>>
> >>>
> >>> Regards,
> >>> Ajay
> >>>
> >>>
> >>>
> >>> On Wed, Oct 19, 2016 at 7:31 AM, Ajay Gupta (JIRA) <[email protected]>
> >>> wrote:
> >>>
> >>>>
> >>>>      [ https://issues.apache.org/jira/browse/APEXMALHAR-2303?page=c
> >>>> om.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
> >>>>
> >>>> Ajay Gupta reassigned APEXMALHAR-2303:
> >>>> --------------------------------------
> >>>>
> >>>>     Assignee: Ajay Gupta
> >>>>
> >>>> > S3 Line By Line Module
> >>>> > ----------------------
> >>>> >
> >>>> >                 Key: APEXMALHAR-2303
> >>>> >                 URL: https://issues.apache.org/jira
> >>>> /browse/APEXMALHAR-2303
> >>>> >             Project: Apache Apex Malhar
> >>>> >          Issue Type: Bug
> >>>> >            Reporter: Ajay Gupta
> >>>> >            Assignee: Ajay Gupta
> >>>> >   Original Estimate: 336h
> >>>> >  Remaining Estimate: 336h
> >>>> >
> >>>> > This is a new module which will consist of 2 operators
> >>>> > 1) File Splitter -- Already existing in Malhar library
> >>>> > 2) S3RecordReader -- Read a file from S3 and output the records
> >>>> (delimited or fixed width)
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> This message was sent by Atlassian JIRA
> >>>> (v6.3.4#6332)
> >>>>
> >>>
> >>>
> >>
>

Re: [jira] [Assigned] (APEXMALHAR-2303) S3 Line By Line Module

Reply via email to