@ Tushar
S3 Copy Output Module consists of following operators:
1) BlockWriter : Writes the blocks into the HDFS.
2) Synchronizer: Sends trigger to downstream operator, when all the blocks
for a file written to HDFS.
3) FileMerger: Merges all the blocks into a file and will upload the merged
file into S3 bucket.
@ Ashwin
Good suggestion. In the first iteration, I will add the proposed design.
Multipart support will add it in the next iteration.
Regards,
Chaitanya
On Thu, Mar 24, 2016 at 2:44 AM, Ashwin Chandra Putta <
[email protected]> wrote:
> +1 regarding the s3 upload functionality.
>
> However, I think we should just focus on multipart upload directly as it
> comes with various advantages like higher throughput, faster recovery, not
> needing to wait for entire file being created before uploading each part.
> See: http://docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusingmpu.html
>
> Also, seems like we can do multipart upload if the file size is more than
> 5MB. They do recommend using multipart if the file size is more than 100MB.
> I am not sure if there is a hard lower limit though. See:
> http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html
>
> This way, it seems like we don't to have to wait until a file is completely
> written to hdfs before performing the upload operation.
>
> Regards,
> Ashwin.
>
> On Wed, Mar 23, 2016 at 5:10 AM, Tushar Gosavi <[email protected]>
> wrote:
>
> > +1 , we need this functionality.
> >
> > Is it going to be a single operator or multiple operators? If multiple
> > operators, then can you explain what functionality each operator will
> > provide?
> >
> >
> > Regards,
> > -Tushar.
> >
> >
> > On Wed, Mar 23, 2016 at 5:01 PM, Yogi Devendra <[email protected]>
> > wrote:
> >
> > > Writing to S3 is a common use-case for applications.
> > > This module will be definitely helpful.
> > >
> > > +1 for adding this module.
> > >
> > >
> > > ~ Yogi
> > >
> > > On 22 March 2016 at 13:52, Chaitanya Chebolu <
> [email protected]>
> > > wrote:
> > >
> > > > Hi All,
> > > >
> > > > I am proposing S3 output copy Module. Primary functionality of this
> > > > module is uploading files to S3 bucket using block-by-block approach.
> > > >
> > > > Below is the JIRA created for this task:
> > > > https://issues.apache.org/jira/browse/APEXMALHAR-2022
> > > >
> > > > Design of this module is similar to HDFS copy module. So, I will
> > extend
> > > > HDFS copy module for S3.
> > > >
> > > > Design of this Module:
> > > > =======================
> > > > 1) Writing blocks into HDFS.
> > > > 2) Merge the blocks into a file .
> > > > 3) Upload the above merged file into S3 Bucket using AmazonS3Client
> > > API's.
> > > >
> > > > Steps (1) & (2) are same as HDFS copy module.
> > > >
> > > > *Limitation:* Supports the size of file is up to 5 GB. Please refer
> the
> > > > below link about limitations of Uploading objects into S3:
> > > > http://docs.aws.amazon.com/AmazonS3/latest/dev/UploadingObjects.html
> > > >
> > > > We can resolve the above limitation by using S3 Multipart feature. I
> > will
> > > > add multipart support in next iteration.
> > > >
> > > > Please share your thoughts on this.
> > > >
> > > > Regards,
> > > > Chaitanya
> > > >
> > >
> >
>
>
>
> --
>
> Regards,
> Ashwin.
>