Re: Merging files

Something Something Wed, 31 Jul 2013 13:40:14 -0700

So you are saying, we will first do a 'hadoop count' to get the total # of
bytes for all files.  Let's say that comes to:  1538684305


Default Block Size is:  128M

So, total # of blocks needed:  1538684305 / 131072 = 11740

Max file blocks = 11740 / 50 (# of output files) = 234

Does this calculation look right?

On Wed, Jul 31, 2013 at 10:28 AM, John Meagher <[email protected]>wrote:

> It is file size based, not file count based.  For fewer files up the
> max-file-blocks setting.
>
> On Wed, Jul 31, 2013 at 12:21 PM, Something Something
> <[email protected]> wrote:
> > Thanks, John.  But I don't see an option to specify the # of output
> files.
> >  How does Crush decide how many files to create?  Is it only based on
> file
> > sizes?
> >
> > On Wed, Jul 31, 2013 at 6:28 AM, John Meagher <[email protected]
> >wrote:
> >
> >> Here's a great tool for handling exactly that case:
> >> https://github.com/edwardcapriolo/filecrush
> >>
> >> On Wed, Jul 31, 2013 at 2:40 AM, Something Something
> >> <[email protected]> wrote:
> >> > Each bz2 file after merging is about 50Megs.  The reducers take about
> 9
> >> > minutes.
> >> >
> >> > Note:  'getmerge' is not an option.  There isn't enough disk space to
> do
> >> a
> >> > getmerge on the local production box.  Plus we need a scalable
> solution
> >> as
> >> > these files will get a lot bigger soon.
> >> >
> >> > On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <[email protected]>
> wrote:
> >> >
> >> >> How big are your 50 files?  How long are the reducers taking?
> >> >>
> >> >> On Jul 30, 2013, at 10:26 PM, Something Something <
> >> >> [email protected]> wrote:
> >> >>
> >> >> > Hello,
> >> >> >
> >> >> > One of our pig scripts creates over 500 small part files.  To save
> on
> >> >> > namespace, we need to cut down the # of files, so instead of saving
> >> 500
> >> >> > small files we need to merge them into 50.  We tried the following:
> >> >> >
> >> >> > 1)  When we set parallel number to 50, the Pig script takes a long
> >> time -
> >> >> > for obvious reasons.
> >> >> > 2)  If we use Hadoop Streaming, it puts some garbage values into
> the
> >> key
> >> >> > field.
> >> >> > 3)  We wrote our own Map Reducer program that reads these 500 small
> >> part
> >> >> > files & uses 50 reducers.  Basically, the Mappers simply write the
> >> line &
> >> >> > reducers loop thru values & write them out.  We set
> >> >> > job.setOutputKeyClass(NullWritable.class) so that the key is not
> >> written
> >> >> to
> >> >> > the output file.  This is performing better than Pig.  Actually
> >> Mappers
> >> >> run
> >> >> > very fast, but Reducers take some time to complete, but this
> approach
> >> >> seems
> >> >> > to be working well.
> >> >> >
> >> >> > Is there a better way to do this?  What strategy can you think of
> to
> >> >> > increase speed of reducers.
> >> >> >
> >> >> > Any help in this regard will be greatly appreciated.  Thanks.
> >> >>
> >> >>
> >>
>

Re: Merging files

Reply via email to