+1

Regards,
Sandeep

On Thu, Oct 27, 2016 at 1:53 PM, Chaitanya Chebolu <
[email protected]> wrote:

> Hi All,
>
>   I am planning to implement the approach (2) of S3 Output Module which I
> proposed in my previous email. Performance would be better as compared to
> approach (1) because of uploading the blocks without saving it on HDFS.
>
>   Please share your opinions.
>
> Regards,
> Chaitanya
>
> On Thu, Oct 20, 2016 at 8:11 PM, Chaitanya Chebolu <
> [email protected]> wrote:
>
> > Hi All,
> >
> > I am proposing the below new design for S3 Output Module using multi part
> > upload feature:
> >
> > Input to this Module: FileMetadata, FileBlockMetadata, ReaderRecord
> >
> > Steps for uploading files using S3 multipart feature:
> >
> > =============================
> >
> >    1.
> >
> >    Initiate the upload. S3 will return upload id.
> >
> > Mandatory : bucket name, file path
> >
> > Note: Upload id is the unique identifier for multi part upload of a file.
> >
> >    1.
> >
> >    Upload each block using the received upload id. S3 will return ETag in
> >    response of each upload.
> >
> > Mandatory: block number, upload id
> >
> >    1.
> >
> >    Send the merge request by providing the upload id and list of ETags .
> >
> > Mandatory: upload id, file path, block ETags.
> >
> > Here
> > <http://docs.aws.amazon.com/AmazonS3/latest/dev/llJavaUploadFile.html>
> is
> > an example link for uploading a file using multi part feature:
> >
> >
> > I am proposing the below two approaches for S3 output module.
> >
> >
> > (Solution 1)
> >
> > S3 Output Module consists of the below two operators:
> >
> > 1) BlockWriter : Write the blocks into the HDFS. Once successfully
> written
> > into HDFS, then this will emit the BlockMetadata.
> >
> > 2) S3MultiPartUpload: This consists of two parts:
> >
> >      a) If the number of blocks of a file is > 1 then upload the blocks
> > using multi part feature. Otherwise, will upload the block using
> > putObject().
> >
> >      b) Once all the blocks are successfully uploaded then will send the
> > merge complete request.
> >
> >
> > (Solution 2)
> >
> > DAG for this solution as follows:
> >
> > 1) InitateS3Upload:
> >
> > Input: FileMetadata
> >
> > Initiates the upload. This operator emits (filemetadata, uploadId) to
> > S3FileMerger and (filePath, uploadId) to S3BlockUpload.
> >
> > 2) S3BlockUpload:
> >
> > Input: FileBlockMetadata, ReaderRecord
> >
> > Upload the blocks into S3. S3 will return ETag for each upload.
> > S3BlockUpload emits (path, ETag) to S3FileMerger.
> >
> > 3) S3FileMerger: Sends the file merge request to S3.
> >
> > Pros:
> >
> > (1) Supports the size of file to upload is up to 5 TB.
> >
> > (2) Reduces the end to end latency. Because, we are not waiting to upload
> > until all the blocks of a file written to HDFS.
> >
> > Please vote and share your thoughts on these approaches.
> >
> > Regards,
> > Chaitanya
> >
> > On Tue, Mar 29, 2016 at 2:35 PM, Chaitanya Chebolu <
> > [email protected]> wrote:
> >
> >> @ Tushar
> >>
> >>   S3 Copy Output Module consists of following operators:
> >> 1) BlockWriter : Writes the blocks into the HDFS.
> >> 2) Synchronizer: Sends trigger to downstream operator, when all the
> >> blocks for a file written to HDFS.
> >> 3) FileMerger: Merges all the blocks into a file and will upload the
> >> merged file into S3 bucket.
> >>
> >> @ Ashwin
> >>
> >>     Good suggestion. In the first iteration, I will add the proposed
> >> design.
> >> Multipart support will add it in the next iteration.
> >>
> >> Regards,
> >> Chaitanya
> >>
> >> On Thu, Mar 24, 2016 at 2:44 AM, Ashwin Chandra Putta <
> >> [email protected]> wrote:
> >>
> >>> +1 regarding the s3 upload functionality.
> >>>
> >>> However, I think we should just focus on multipart upload directly as
> it
> >>> comes with various advantages like higher throughput, faster recovery,
> >>> not
> >>> needing to wait for entire file being created before uploading each
> part.
> >>> See: http://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusin
> >>> gmpu.html
> >>>
> >>> Also, seems like we can do multipart upload if the file size is more
> than
> >>> 5MB. They do recommend using multipart if the file size is more than
> >>> 100MB.
> >>> I am not sure if there is a hard lower limit though. See:
> >>> http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html
> >>>
> >>> This way, it seems like we don't to have to wait until a file is
> >>> completely
> >>> written to hdfs before performing the upload operation.
> >>>
> >>> Regards,
> >>> Ashwin.
> >>>
> >>> On Wed, Mar 23, 2016 at 5:10 AM, Tushar Gosavi <[email protected]
> >
> >>> wrote:
> >>>
> >>> > +1 , we need this functionality.
> >>> >
> >>> > Is it going to be a single operator or multiple operators? If
> multiple
> >>> > operators, then can you explain what functionality each operator will
> >>> > provide?
> >>> >
> >>> >
> >>> > Regards,
> >>> > -Tushar.
> >>> >
> >>> >
> >>> > On Wed, Mar 23, 2016 at 5:01 PM, Yogi Devendra <
> >>> [email protected]>
> >>> > wrote:
> >>> >
> >>> > > Writing to S3 is a common use-case for applications.
> >>> > > This module will be definitely helpful.
> >>> > >
> >>> > > +1 for adding this module.
> >>> > >
> >>> > >
> >>> > > ~ Yogi
> >>> > >
> >>> > > On 22 March 2016 at 13:52, Chaitanya Chebolu <
> >>> [email protected]>
> >>> > > wrote:
> >>> > >
> >>> > > > Hi All,
> >>> > > >
> >>> > > >   I am proposing S3 output copy Module. Primary functionality of
> >>> this
> >>> > > > module is uploading files to S3 bucket using block-by-block
> >>> approach.
> >>> > > >
> >>> > > >   Below is the JIRA created for this task:
> >>> > > > https://issues.apache.org/jira/browse/APEXMALHAR-2022
> >>> > > >
> >>> > > >   Design of this module is similar to HDFS copy module. So, I
> will
> >>> > extend
> >>> > > > HDFS copy module for S3.
> >>> > > >
> >>> > > > Design of this Module:
> >>> > > > =======================
> >>> > > > 1) Writing blocks into HDFS.
> >>> > > > 2) Merge the blocks into a file .
> >>> > > > 3) Upload the above merged file into S3 Bucket using
> AmazonS3Client
> >>> > > API's.
> >>> > > >
> >>> > > > Steps (1) & (2) are same as HDFS copy module.
> >>> > > >
> >>> > > > *Limitation:* Supports the size of file is up to 5 GB. Please
> >>> refer the
> >>> > > > below link about limitations of Uploading objects into S3:
> >>> > > > http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObje
> >>> cts.html
> >>> > > >
> >>> > > > We can resolve the above limitation by using S3 Multipart
> feature.
> >>> I
> >>> > will
> >>> > > > add multipart support in next iteration.
> >>> > > >
> >>> > > >  Please share your thoughts on this.
> >>> > > >
> >>> > > > Regards,
> >>> > > > Chaitanya
> >>> > > >
> >>> > >
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>>
> >>> Regards,
> >>> Ashwin.
> >>>
> >>
> >>
> >
>

Reply via email to