Michael,

environmental variables are available in Java, but the environment itself is
not shared between instances. I read your code - you are solving exactly the
same problem I am interested in - but I did not see how it works in
distributed environment.

By the way, it occurs to me that JavaSpaces, which is a different approach
to distributed computing, trumpled by Hadoop, could be used here! Just run
one instance with GigaSpaces at all times, and you got your self-increment
for any number of jobs. It is perfect for concurrent processing and very
fast.

Thank you,
Mark

On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt <michael.kl...@gmail.com>wrote:

>
> I posted an approach to this using streaming, but if the environment
> variables are available in standard Java interface, this may work for you.
>
> http://www.mail-archive.com/core-u...@hadoop.apache.org/msg09079.html
>
> You'll have to be able to tolerate some small gaps in the ids.
>
> Michael
>
>
> Mark Kerzner wrote:
>
>>
>>
>> Aaron, although your notes are not a ready solution, but they are a great
>> help.
>>
>> Thank you,
>> Mark
>>
>> On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <aa...@cloudera.com>
>> wrote:
>>
>>  There is no in-MapReduce mechanism for cross-task synchronization. You'll
>>> need to use something like Zookeeper for this, or another external
>>> database.
>>> Note that this will greatly complicate your life.
>>>
>>> If I were you, I'd try to either redesign my pipeline elsewhere to
>>> eliminate
>>> this need, or maybe get really clever. For example, do your numbers need
>>> to
>>> be sequential, or just unique?
>>>
>>> If the latter, then take the byte offset into the reducer's current
>>> output
>>> file and combine that with the reducer id (e.g.,
>>> <current-byte-offset><zero-padded-reducer-id>) to guarantee that they're
>>> all
>>> building unique sequences. If the former... rethink your pipeline? :)
>>>
>>> - Aaron
>>>
>>> On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <markkerz...@gmail.com>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > I need to number all output records consecutively, like, 1,2,3...
>>> >
>>> > This is no problem with one reducer, making recordId an instance
>>> variable
>>> > in
>>> > the Reducer class, and setting conf.setNumReduceTasks(1)
>>> >
>>> > However, it is an architectural decision forced by processing need,
>>> where
>>> > the reducer becomes a bottleneck. Can I have a global variable for all
>>> > reducers, which would give each the next consecutive recordId? In the
>>> > database scenario, this would be the unique autokey. How to do it in
>>> > MapReduce?
>>> >
>>> > Thank you
>>> >
>>>
>>>
>>
>>

Reply via email to