On 12/10/2010 02:16 PM, Harsh J wrote:
Hi,
On Thu, Dec 2, 2010 at 10:40 PM, Matt Tanquary<matt.tanqu...@gmail.com> wrote:
I am using MultipleOutputs to split a mapper input into about 20
different files. Adding this split has had an extremely adverse effect
on performance. Is MultipleOutputs known for performing slowly?
There was a bug in MultipleOutputs which could've lead to this. It has
been fixed in MAPREDUCE-1853. Should be in the next 0.21 maintenance
release as well as 0.22.
(And in next CDH3, if you are using that).
Is there any workaround to this issue for those of us who are still
running 0.20?
I have a job that very much lends itself to using the MultipleOutputs
functionality, but this bug is absolutely crushing the job's performance.
Are there any ways to fix/workaround this issue without having to a)
upgrade our cluster to 0.21, or b) completely re-write my job?
Thanks,
DR