Each bz2 file after merging is about 50Megs. The reducers take about 9 minutes.
Note: 'getmerge' is not an option. There isn't enough disk space to do a getmerge on the local production box. Plus we need a scalable solution as these files will get a lot bigger soon. On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <benjij...@gmail.com> wrote: > How big are your 50 files? How long are the reducers taking? > > On Jul 30, 2013, at 10:26 PM, Something Something < > mailinglist...@gmail.com> wrote: > > > Hello, > > > > One of our pig scripts creates over 500 small part files. To save on > > namespace, we need to cut down the # of files, so instead of saving 500 > > small files we need to merge them into 50. We tried the following: > > > > 1) When we set parallel number to 50, the Pig script takes a long time - > > for obvious reasons. > > 2) If we use Hadoop Streaming, it puts some garbage values into the key > > field. > > 3) We wrote our own Map Reducer program that reads these 500 small part > > files & uses 50 reducers. Basically, the Mappers simply write the line & > > reducers loop thru values & write them out. We set > > job.setOutputKeyClass(NullWritable.class) so that the key is not written > to > > the output file. This is performing better than Pig. Actually Mappers > run > > very fast, but Reducers take some time to complete, but this approach > seems > > to be working well. > > > > Is there a better way to do this? What strategy can you think of to > > increase speed of reducers. > > > > Any help in this regard will be greatly appreciated. Thanks. > >