+1 for new approach i.e, adding the file length to FileBlockMetadata. On Thu, Oct 20, 2016 at 12:00 PM, Tushar Gosavi <[email protected]> wrote:
> I think this approach is clean compare to previous two approached you > have mentioned. Depending on exception/non standard error code to > determine eof is not > good approach, as we might consider other valid exception as eof and > not take corrective actions. Also this will avoid multiple request > to get file length from each reader. > > - Tushar. > > > On Thu, Oct 20, 2016 at 11:45 AM, AJAY GUPTA <[email protected]> wrote: > > Hi > > > > Following is another approach for getting information regarding the file > > length for S3. > > > > We have an existing class FileBlockMetadata which currently contains only > > filePath. To this, we can add the fileLength field which will then get > > passed to the module. This approach will be a lot cleaner and no > additional > > requests will be made to S3 in this case. > > > > Kindly provide your opinion on which approach would be best suited. > > > > > > Regards, > > Ajay > > > > On Wed, Oct 19, 2016 at 6:43 PM, AJAY GUPTA <[email protected]> > wrote: > > > >> Hi > >> > >> I need suggestion of Apex dev community on the following. > >> > >> For the S3RecordReader approach mentioned in previous mail, I am facing > an > >> issue with determining the end of file. > >> Note that the input to this operator will not contain the file size. > >> > >> Following approaches are possible > >> > >> 1) The S3 getObject() call which fetches file data within a range will > >> throw an AmazonS3Exception if the range provided is out of bounds. > Hence if > >> file size is 10bytes and if I make a getObject request for 11 to 15, I > will > >> get this exception. > >> Exception in thread "main" com.amazonaws.services.s3. > model.AmazonS3Exception: > >> The requested range is not satisfiable (Service: Amazon S3; Status Code: > >> 416; Error Code: InvalidRange; Request ID: > >> If this exception gets thrown, I can catch it in the code and conclude > >> that end of file is reached. > >> > >> 2) For every container running this application, maintain a > map<filename, > >> filesize>. If the filesize already exists in this map, use from there. > If > >> not, fetch the filesize information from S3 and add it to this map. > >> > >> My own opinion is to go with the first approach since the number of > calls > >> to S3 for getting file length will be less. > >> Kindly provide with any other approaches you can think of. > >> > >> > >> Thanks, > >> Ajay > >> > >> > >> > >> On Wed, Oct 19, 2016 at 11:53 AM, AJAY GUPTA <[email protected]> > wrote: > >> > >>> Hi Apex Dev community, > >>> > >>> Kindly provide with feedback if any for the following approach for > >>> implementing S3RecordReader. > >>> > >>> *S3RecordReader(delimited records)* > >>> *Input *: BlockMetaData containing offset and length > >>> *Expected Output :* Records in the block > >>> *Approach : * > >>> Similar to approach currently being followed in FSRecordReader. > >>> 1) Fetch the block from S3. S3 block fetch size should ideally be large > >>> enough, say 64MB to avoid unnecessary network delays. > >>> 2) Search for newline character in the block and emit the record > >>> 3) The last record in current block might overflow into subsequent > block. > >>> For this, we will get a small part of subsequent block, say 1 MB and > search > >>> for newline character and emit the record if newline character is > found. We > >>> will fetch additional 1MB blocks till a newline charater is found. > >>> 4) We will also avoid reading the first record from all blocks (except > >>> first block) as this set of bytes is a part of last record in previous > >>> block. > >>> > >>> > >>> Regards, > >>> Ajay > >>> > >>> > >>> > >>> On Wed, Oct 19, 2016 at 7:31 AM, Ajay Gupta (JIRA) <[email protected]> > >>> wrote: > >>> > >>>> > >>>> [ https://issues.apache.org/jira/browse/APEXMALHAR-2303?page=c > >>>> om.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] > >>>> > >>>> Ajay Gupta reassigned APEXMALHAR-2303: > >>>> -------------------------------------- > >>>> > >>>> Assignee: Ajay Gupta > >>>> > >>>> > S3 Line By Line Module > >>>> > ---------------------- > >>>> > > >>>> > Key: APEXMALHAR-2303 > >>>> > URL: https://issues.apache.org/jira > >>>> /browse/APEXMALHAR-2303 > >>>> > Project: Apache Apex Malhar > >>>> > Issue Type: Bug > >>>> > Reporter: Ajay Gupta > >>>> > Assignee: Ajay Gupta > >>>> > Original Estimate: 336h > >>>> > Remaining Estimate: 336h > >>>> > > >>>> > This is a new module which will consist of 2 operators > >>>> > 1) File Splitter -- Already existing in Malhar library > >>>> > 2) S3RecordReader -- Read a file from S3 and output the records > >>>> (delimited or fixed width) > >>>> > >>>> > >>>> > >>>> -- > >>>> This message was sent by Atlassian JIRA > >>>> (v6.3.4#6332) > >>>> > >>> > >>> > >> >
