Pedro, Thanks for the response. Unfortunately I am running it on in-house cluster and from there I need to upload to S3.
-Himanshu On Wed, May 2, 2012 at 2:03 PM, Pedro Figueiredo <p...@89clouds.com> wrote: > > On 2 May 2012, at 18:29, Himanshu Vijay wrote: > > > Hi, > > > > I have 100 files each of ~3 GB. I need to distcp them to S3 but copying > > fails because of large size of files. The files are not gzipped so they > are > > splittable. Is there a way or property to tell Distcp to first split the > > input files into let's say 200 MB or N lines each before copying to > > destination. > > > > Assuming you're using EMR, use s3distcp: > > > http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html > > In any case, that's strange because S3's limit is 5GB per PUT request; > again if you're running on EMR, try starting your cluster with > > --bootstrap-action > s3://elasticmapreduce/bootstrap-actions/configure-hadoop \ > --args > "-c,fs.s3n.multipart.uploads.enabled=true,-c,fs.s3n.multipart.uploads.split.size=524288000" > > (or add those to whatever parameters you currently use). > > Going back to plain distcp, I'm not sure about what the -sizelimit option > does, as I've never used it. > > If push comes to shove, seeing as you have a Hadoop cluster, running a job > to write the files to S3 with compression enabled is always an option :) > > Cheers, > > Pedro Figueiredo > Skype: pfig.89clouds > http://89clouds.com/ - Big Data Consulting > > > > > -- -Himanshu Vijay