Each bz2 file after merging is about 50Megs.  The reducers take about 9
minutes.

Note:  'getmerge' is not an option.  There isn't enough disk space to do a
getmerge on the local production box.  Plus we need a scalable solution as
these files will get a lot bigger soon.

On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <benjij...@gmail.com> wrote:

> How big are your 50 files?  How long are the reducers taking?
>
> On Jul 30, 2013, at 10:26 PM, Something Something <
> mailinglist...@gmail.com> wrote:
>
> > Hello,
> >
> > One of our pig scripts creates over 500 small part files.  To save on
> > namespace, we need to cut down the # of files, so instead of saving 500
> > small files we need to merge them into 50.  We tried the following:
> >
> > 1)  When we set parallel number to 50, the Pig script takes a long time -
> > for obvious reasons.
> > 2)  If we use Hadoop Streaming, it puts some garbage values into the key
> > field.
> > 3)  We wrote our own Map Reducer program that reads these 500 small part
> > files & uses 50 reducers.  Basically, the Mappers simply write the line &
> > reducers loop thru values & write them out.  We set
> > job.setOutputKeyClass(NullWritable.class) so that the key is not written
> to
> > the output file.  This is performing better than Pig.  Actually Mappers
> run
> > very fast, but Reducers take some time to complete, but this approach
> seems
> > to be working well.
> >
> > Is there a better way to do this?  What strategy can you think of to
> > increase speed of reducers.
> >
> > Any help in this regard will be greatly appreciated.  Thanks.
>
>

Reply via email to