Re: Join-package combiner number of input and output records the same

Sigurd Spieckermann Tue, 25 Sep 2012 09:34:46 -0700

I'm not doing a conventional join, but in my case one split/fileconsists of only one key-value pair. I'm not using defaultmapper/reducer implementations. I'm guessing the problem is that acombiner is only applied to the output of a map task which is aninstance of the mapper class, but one map task processes one split andsince I only have one key-value pair per split, there is nothing tocombine. What I would need is a combiner across multiple map tasks or away to treat all splits of a datanode as one, hence there would only beone map task. Is there a way to do something like that? Reusing the JVMhasn't worked in my tests.


Am 25.09.2012 15:40, schrieb Björn-Elmar Macek:

Ups, sorry. You are using standart implementations? I dont know whats
happening then. Sorry. But the fact, that your inputsize equals your
outputsize in a "join" process reminded me too much of my own problems.
Sorry for confusion, i may have caused.


Best,
Am 25.09.2012 um 15:32 schrieb Björn-Elmar Macek <ma...@cs.uni-kassel.de
<mailto:ma...@cs.uni-kassel.de>>:

Hi,

i had this problem once too. Did you properly overwrite the reduce
method with the @override annotation?
Does your reduce method use OutputCollector or Context for gathering
outputs? If you are using current version, it has to be Context.

The thing is: if you do NOT override the standart reduce function
(identity) is used and this results ofc in the same number of tuples
as you read as input.

Good luck!
Elmar

Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann
<sigurd.spieckerm...@gmail.com <mailto:sigurd.spieckerm...@gmail.com>>:

I think I have tracked down the problem to the point that each split
only contains one big key-value pair and a combiner is connected to a
map task. Please correct me if I'm wrong, but I assume each map task
takes one split and the combiner operates only on the key-value pairs
within one split. That's why the combiner has no effect in my case.
Is there a way to combine the mapper outputs of multiple splits
before they are sent off to the reducer?

2012/9/25 Sigurd Spieckermann <sigurd.spieckerm...@gmail.com
<mailto:sigurd.spieckerm...@gmail.com>>

    Maybe one more note: the combiner and the reducer class are the
    same and in the reduce-phase the values get aggregated correctly.
    Why is this not happening in the combiner-phase?


    2012/9/25 Sigurd Spieckermann <sigurd.spieckerm...@gmail.com
    <mailto:sigurd.spieckerm...@gmail.com>>

        Hi guys,

        I'm experiencing a strange behavior when I use the Hadoop
        join-package. After running a job the result statistics show
        that my combiner has an input of 100 records and an output of
        100 records. From the task I'm running and the way it's
        implemented, I know that each key appears multiple times and
        the values should be combinable before getting passed to the
        reducer. I'm running my tests in pseudo-distributed mode with
        one or two map tasks. From using the debugger, I noticed that
        each key-value pair is processed by a combiner individually
        so there's actually no list passed into the combiner that it
        could aggregate. Can anyone think of a reason that causes
        this undesired behavior?

        Thanks
        Sigurd

Re: Join-package combiner number of input and output records the same

Reply via email to