I'm not doing a conventional join, but in my case one split/file consists of only one key-value pair. I'm not using default mapper/reducer implementations. I'm guessing the problem is that a combiner is only applied to the output of a map task which is an instance of the mapper class, but one map task processes one split and since I only have one key-value pair per split, there is nothing to combine. What I would need is a combiner across multiple map tasks or a way to treat all splits of a datanode as one, hence there would only be one map task. Is there a way to do something like that? Reusing the JVM hasn't worked in my tests.

Am 25.09.2012 15:40, schrieb Björn-Elmar Macek:
Ups, sorry. You are using standart implementations? I dont know whats
happening then. Sorry. But the fact, that your inputsize equals your
outputsize in a "join" process reminded me too much of my own problems.
Sorry for confusion, i may have caused.

Best,
Am 25.09.2012 um 15:32 schrieb Björn-Elmar Macek <ma...@cs.uni-kassel.de
<mailto:ma...@cs.uni-kassel.de>>:

Hi,

i had this problem once too. Did you properly overwrite the reduce
method with the @override annotation?
Does your reduce method use OutputCollector or Context for gathering
outputs? If you are using current version, it has to be Context.

The thing is: if you do NOT override the standart reduce function
(identity) is used and this results ofc in the same number of tuples
as you read as input.

Good luck!
Elmar

Am 25.09.2012 um 11:57 schrieb Sigurd Spieckermann
<sigurd.spieckerm...@gmail.com <mailto:sigurd.spieckerm...@gmail.com>>:

I think I have tracked down the problem to the point that each split
only contains one big key-value pair and a combiner is connected to a
map task. Please correct me if I'm wrong, but I assume each map task
takes one split and the combiner operates only on the key-value pairs
within one split. That's why the combiner has no effect in my case.
Is there a way to combine the mapper outputs of multiple splits
before they are sent off to the reducer?

2012/9/25 Sigurd Spieckermann <sigurd.spieckerm...@gmail.com
<mailto:sigurd.spieckerm...@gmail.com>>

    Maybe one more note: the combiner and the reducer class are the
    same and in the reduce-phase the values get aggregated correctly.
    Why is this not happening in the combiner-phase?


    2012/9/25 Sigurd Spieckermann <sigurd.spieckerm...@gmail.com
    <mailto:sigurd.spieckerm...@gmail.com>>

        Hi guys,

        I'm experiencing a strange behavior when I use the Hadoop
        join-package. After running a job the result statistics show
        that my combiner has an input of 100 records and an output of
        100 records. From the task I'm running and the way it's
        implemented, I know that each key appears multiple times and
        the values should be combinable before getting passed to the
        reducer. I'm running my tests in pseudo-distributed mode with
        one or two map tasks. From using the debugger, I noticed that
        each key-value pair is processed by a combiner individually
        so there's actually no list passed into the combiner that it
        could aggregate. Can anyone think of a reason that causes
        this undesired behavior?

        Thanks
        Sigurd





Reply via email to