On 2 May 2012, at 18:29, Himanshu Vijay wrote:

> Hi,
> I have 100 files each of ~3 GB. I need to distcp them to S3 but copying
> fails because of large size of files. The files are not gzipped so they are
> splittable. Is there a way or property to tell Distcp to first split the
> input files into let's say 200 MB or N lines each before copying to
> destination.

Assuming you're using EMR, use s3distcp:


In any case, that's strange because S3's limit is 5GB per PUT request; again if 
you're running on EMR, try starting your cluster with

--bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop \

(or add those to whatever parameters you currently use).

Going back to plain distcp, I'm not sure about what the -sizelimit option does, 
as I've never used it.

If push comes to shove, seeing as you have a Hadoop cluster, running a job to 
write the files to S3 with compression enabled is always an option :)


Pedro Figueiredo
Skype: pfig.89clouds
http://89clouds.com/ - Big Data Consulting

Reply via email to