Check out 
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html.
I don't know if there's an S3 version, but this should help.

On Tue, Jun 17, 2014 at 4:48 PM, Brian Stempin <[email protected]> wrote:
> Hi,
> I was comparing performance of a Hadoop job that I wrote in Java to one
> that I wrote in Pig.  I have ~106,000 small (<1Mb) input files.  In my Java
> job, I get one split per file, which is really inefficient.  In Pig, this
> gets done over 49 splits, which is much faster.
>
> How does Pig do this?  Is there a piece of the source code that I can be
> referred to?  I seem to be banging my head on how to combine multiple S3
> objects into a single split.
>
> Thanks,
> Brian

Reply via email to