Re: Merging files

John Meagher Wed, 31 Jul 2013 10:29:40 -0700

It is file size based, not file count based.  For fewer files up the
max-file-blocks setting.


On Wed, Jul 31, 2013 at 12:21 PM, Something Something
<mailinglist...@gmail.com> wrote:
> Thanks, John.  But I don't see an option to specify the # of output files.
>  How does Crush decide how many files to create?  Is it only based on file
> sizes?
>
> On Wed, Jul 31, 2013 at 6:28 AM, John Meagher <john.meag...@gmail.com>wrote:
>
>> Here's a great tool for handling exactly that case:
>> https://github.com/edwardcapriolo/filecrush
>>
>> On Wed, Jul 31, 2013 at 2:40 AM, Something Something
>> <mailinglist...@gmail.com> wrote:
>> > Each bz2 file after merging is about 50Megs.  The reducers take about 9
>> > minutes.
>> >
>> > Note:  'getmerge' is not an option.  There isn't enough disk space to do
>> a
>> > getmerge on the local production box.  Plus we need a scalable solution
>> as
>> > these files will get a lot bigger soon.
>> >
>> > On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <benjij...@gmail.com> wrote:
>> >
>> >> How big are your 50 files?  How long are the reducers taking?
>> >>
>> >> On Jul 30, 2013, at 10:26 PM, Something Something <
>> >> mailinglist...@gmail.com> wrote:
>> >>
>> >> > Hello,
>> >> >
>> >> > One of our pig scripts creates over 500 small part files.  To save on
>> >> > namespace, we need to cut down the # of files, so instead of saving
>> 500
>> >> > small files we need to merge them into 50.  We tried the following:
>> >> >
>> >> > 1)  When we set parallel number to 50, the Pig script takes a long
>> time -
>> >> > for obvious reasons.
>> >> > 2)  If we use Hadoop Streaming, it puts some garbage values into the
>> key
>> >> > field.
>> >> > 3)  We wrote our own Map Reducer program that reads these 500 small
>> part
>> >> > files & uses 50 reducers.  Basically, the Mappers simply write the
>> line &
>> >> > reducers loop thru values & write them out.  We set
>> >> > job.setOutputKeyClass(NullWritable.class) so that the key is not
>> written
>> >> to
>> >> > the output file.  This is performing better than Pig.  Actually
>> Mappers
>> >> run
>> >> > very fast, but Reducers take some time to complete, but this approach
>> >> seems
>> >> > to be working well.
>> >> >
>> >> > Is there a better way to do this?  What strategy can you think of to
>> >> > increase speed of reducers.
>> >> >
>> >> > Any help in this regard will be greatly appreciated.  Thanks.
>> >>
>> >>
>>

Re: Merging files

Reply via email to