Re: Problem when using MultipleOutputs with many files

David Rosenstrauch Fri, 02 Sep 2011 08:09:46 -0700

On 09/02/2011 09:14 AM, Panagiotis Antonopoulos wrote:


Hello guys,

I am using hadoop-0.20.2-cdh3u0 and I use MultipleOutputs to divide the HFiles 
(which are the output of my MR job) so that each file can fit into one region 
of the table where I am going to bulk load them.

Therefore I have one MultipleOutput per region and as a result I had 280 
different outputs.
I just realized that using so many outputs makes my job a lot slower than it is 
when I have just one output.

Do you know what goes wrong? Has anyone noticed the same?

Thank you!
Panagiotis

You're probably running into this bug, which crushes the performance ofMultipleOutputs:


https://issues.apache.org/jira/browse/MAPREDUCE-1853

Apparently it's fixed in v0.21, so try to upgrade if you can.

I wasn't able to in our code however (we were also using Cloudera CDH,which as you see is 0.20). What I eventually wound up doing to workaround it was to use our own local copy of the MultipleOutputs class (Icalled it BugFixMultipleOutputs_0_20) which I manually patched with the fix.


HTH,

DR

Re: Problem when using MultipleOutputs with many files

Reply via email to