On 2 May 2012, at 18:29, Himanshu Vijay wrote:

> Hi,
> 
> I have 100 files each of ~3 GB. I need to distcp them to S3 but copying
> fails because of large size of files. The files are not gzipped so they are
> splittable. Is there a way or property to tell Distcp to first split the
> input files into let's say 200 MB or N lines each before copying to
> destination.
> 

Assuming you're using EMR, use s3distcp:

http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_s3distcp.html

In any case, that's strange because S3's limit is 5GB per PUT request; again if 
you're running on EMR, try starting your cluster with

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \
  --args 
"-c,fs.s3n.multipart.uploads.enabled=true,-c,fs.s3n.multipart.uploads.split.size=524288000"

(or add those to whatever parameters you currently use).

Going back to plain distcp, I'm not sure about what the -sizelimit option does, 
as I've never used it.

If push comes to shove, seeing as you have a Hadoop cluster, running a job to 
write the files to S3 with compression enabled is always an option :)

Cheers,

Pedro Figueiredo
Skype: pfig.89clouds
http://89clouds.com/ - Big Data Consulting




Reply via email to