Here's a great tool for handling exactly that case:
https://github.com/edwardcapriolo/filecrush

On Wed, Jul 31, 2013 at 2:40 AM, Something Something
<mailinglist...@gmail.com> wrote:
> Each bz2 file after merging is about 50Megs.  The reducers take about 9
> minutes.
>
> Note:  'getmerge' is not an option.  There isn't enough disk space to do a
> getmerge on the local production box.  Plus we need a scalable solution as
> these files will get a lot bigger soon.
>
> On Tue, Jul 30, 2013 at 10:34 PM, Ben Juhn <benjij...@gmail.com> wrote:
>
>> How big are your 50 files?  How long are the reducers taking?
>>
>> On Jul 30, 2013, at 10:26 PM, Something Something <
>> mailinglist...@gmail.com> wrote:
>>
>> > Hello,
>> >
>> > One of our pig scripts creates over 500 small part files.  To save on
>> > namespace, we need to cut down the # of files, so instead of saving 500
>> > small files we need to merge them into 50.  We tried the following:
>> >
>> > 1)  When we set parallel number to 50, the Pig script takes a long time -
>> > for obvious reasons.
>> > 2)  If we use Hadoop Streaming, it puts some garbage values into the key
>> > field.
>> > 3)  We wrote our own Map Reducer program that reads these 500 small part
>> > files & uses 50 reducers.  Basically, the Mappers simply write the line &
>> > reducers loop thru values & write them out.  We set
>> > job.setOutputKeyClass(NullWritable.class) so that the key is not written
>> to
>> > the output file.  This is performing better than Pig.  Actually Mappers
>> run
>> > very fast, but Reducers take some time to complete, but this approach
>> seems
>> > to be working well.
>> >
>> > Is there a better way to do this?  What strategy can you think of to
>> > increase speed of reducers.
>> >
>> > Any help in this regard will be greatly appreciated.  Thanks.
>>
>>

Reply via email to