Yeah, my understanding is that the iterator is just giving you a
pointer to the same location each time. This seems to match up with
the behavior we've both observed, but maybe someone more familiar with
the internals can verify.

Also, in case you didn't know, you can use what's called a secondary
sort to make values come out of the iterator in sorted order. The
basic idea is to stuff the value attribute(s) you're sorting on into
the key, but then to specify a custom partitioner and custom grouper
that operate on just the original key attribute(s). See
org.apache.hadoop.examples.SecondarySort for a nice example. This lets
Hadoop internals do some of the heavy lifting and removes the
requirement that all values for a key fit in memory (though I guess if
you only care about the top 20, your space requirement is still O(1)).

Ed

On Mon, Feb 8, 2010 at 5:58 PM, James Hammerton
<james.hammer...@mendeley.com> wrote:
> Thanks, Ed. I'm copying the values into a list and then sorting them and
> then emiting the top 20, so yes they are buffered. I'll try cloning each
> item tomorrow and see if that works.
>
> Does this mean the Iterator is returning the same pointer with each call to
> next() but with different contents being stored at that location each time?
> E.g. it returns a pointer to a buffer that gets filled with different
> contents each time you call the iterator?
>
> Regards,
>
> James
>
> On Mon, Feb 8, 2010 at 7:09 PM, Ed Mazur <ma...@cs.umass.edu> wrote:
>>
>> Hi James,
>>
>> I ran into something similar in the past and suspect the problem may
>> be in your reduce function. Are you buffering values from the
>> iterator? If you are, then you need to first clone the value when
>> taking it from the iterator (implement Cloneable in your custom
>> Writable). Otherwise they will all be references to the last item from
>> the iterator.
>>
>> Ed
>>
>> On Mon, Feb 8, 2010 at 12:23 PM, James Hammerton
>> <james.hammer...@mendeley.com> wrote:
>> > Hi,
>> >
>> > For a particular project I created a writable for holding a long and a
>> > double called LongDoublePair. My mapper outputs LongDoublePair values
>> > and
>> > the reducer receives an Iterable<LongDoublePair>.
>> >
>> > The problem is that when I try to use it, whilst I get the right number
>> > of
>> > elements in the Iterable, they are all copies of the same object! I
>> > tested
>> > that this was the case by using the following code in the loop that
>> > processes the pairs:
>> >
>> >             if (prev != null) {
>> >                 if (prev == next) {
>> >                     context.getCounter("MY COUNTERS", key.toString() +
>> > "Values are same object").increment(1);
>> >                 }
>> >             } else {
>> >                 prev = next;
>> >             }
>> >
>> > The counters appeared with all sorts of values, e.g. I got lots of lines
>> > like: "10/02/08 16:57:18 INFO mapred.JobClient:     990Values are same
>> > object=46", indicating that the iterator was iterating through copies of
>> > the
>> > same object.
>> >
>> > My code works if instead of using the LongDoublePair I use a Text object
>> > and
>> > simply concatenate the two number strings with a space to separate them
>> > and
>> > have the reducer parse the string into a LongDoublePair and process it.
>> >
>> > Via unit tests, I've ensured the LongDoublePair's serialisation and
>> > deserialisation code works, that hashCode and equals do what they should
>> > do,
>> > etc, but I can't seem to get this to work other than by falling back on
>> > using Text objects. Any ideas what might be going wrong?
>> >
>> > I've attached the source code for LongDoublePair to this email in case
>> > you
>> > can spot anything that might be behind the problem.
>> >
>> > James
>> >
>> > --
>> > James Hammerton | Senior Data Mining Engineer
>> > www.mendeley.com/profiles/james-hammerton
>> >
>> > Mendeley Limited | London, UK | www.mendeley.com
>> > Registered in England and Wales | Company Number 6419015
>> >
>> >
>> >
>> >
>
>
>
> --
> James Hammerton | Senior Data Mining Engineer
> www.mendeley.com/profiles/james-hammerton
>
> Mendeley Limited | London, UK | www.mendeley.com
> Registered in England and Wales | Company Number 6419015
>
>
>
>

Reply via email to