Michael, environmental variables are available in Java, but the environment itself is not shared between instances. I read your code - you are solving exactly the same problem I am interested in - but I did not see how it works in distributed environment.
By the way, it occurs to me that JavaSpaces, which is a different approach to distributed computing, trumpled by Hadoop, could be used here! Just run one instance with GigaSpaces at all times, and you got your self-increment for any number of jobs. It is perfect for concurrent processing and very fast. Thank you, Mark On Wed, Oct 28, 2009 at 12:40 PM, Michael Klatt <michael.kl...@gmail.com>wrote: > > I posted an approach to this using streaming, but if the environment > variables are available in standard Java interface, this may work for you. > > http://www.mail-archive.com/core-u...@hadoop.apache.org/msg09079.html > > You'll have to be able to tolerate some small gaps in the ids. > > Michael > > > Mark Kerzner wrote: > >> >> >> Aaron, although your notes are not a ready solution, but they are a great >> help. >> >> Thank you, >> Mark >> >> On Tue, Oct 27, 2009 at 11:27 PM, Aaron Kimball <aa...@cloudera.com> >> wrote: >> >> There is no in-MapReduce mechanism for cross-task synchronization. You'll >>> need to use something like Zookeeper for this, or another external >>> database. >>> Note that this will greatly complicate your life. >>> >>> If I were you, I'd try to either redesign my pipeline elsewhere to >>> eliminate >>> this need, or maybe get really clever. For example, do your numbers need >>> to >>> be sequential, or just unique? >>> >>> If the latter, then take the byte offset into the reducer's current >>> output >>> file and combine that with the reducer id (e.g., >>> <current-byte-offset><zero-padded-reducer-id>) to guarantee that they're >>> all >>> building unique sequences. If the former... rethink your pipeline? :) >>> >>> - Aaron >>> >>> On Tue, Oct 27, 2009 at 8:55 PM, Mark Kerzner <markkerz...@gmail.com> >>> wrote: >>> >>> > Hi, >>> > >>> > I need to number all output records consecutively, like, 1,2,3... >>> > >>> > This is no problem with one reducer, making recordId an instance >>> variable >>> > in >>> > the Reducer class, and setting conf.setNumReduceTasks(1) >>> > >>> > However, it is an architectural decision forced by processing need, >>> where >>> > the reducer becomes a bottleneck. Can I have a global variable for all >>> > reducers, which would give each the next consecutive recordId? In the >>> > database scenario, this would be the unique autokey. How to do it in >>> > MapReduce? >>> > >>> > Thank you >>> > >>> >>> >> >>