Check out https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/CombineFileInputFormat.html. I don't know if there's an S3 version, but this should help.
On Tue, Jun 17, 2014 at 4:48 PM, Brian Stempin <[email protected]> wrote: > Hi, > I was comparing performance of a Hadoop job that I wrote in Java to one > that I wrote in Pig. I have ~106,000 small (<1Mb) input files. In my Java > job, I get one split per file, which is really inefficient. In Pig, this > gets done over 49 splits, which is much faster. > > How does Pig do this? Is there a piece of the source code that I can be > referred to? I seem to be banging my head on how to combine multiple S3 > objects into a single split. > > Thanks, > Brian
