No, no you misunderstand. I didn't mean to contact zookeeper for every single record.
Each map instance will contact zookeeper once for every X number of records it sees. What the mapper portion does is it gets a block of numbers, and that block number become only available to that one mapper, the next mapper to get its block will get a different block. so this IDing system can work in Integer space (since I'm memory constrained)... Hmmm, I thought pig or some parts of hadoop had already been using zookeeper... anyways.... I guess it doesn't even have to be zookeeper, just a transactional database... I'll probably end up using random numbers or UUID as you suggest..., after trying the synchronized version. hehe ;-) On Fri, Apr 23, 2010 at 12:23 PM, Alan Gates <ga...@yahoo-inc.com> wrote: > > On Apr 23, 2010, at 12:13 PM, hc busy wrote: > > Is the Java class guaranteed to be unique? Or will I have to perform an >> additional check after I join back? >> > > I'd check the Java docs, but AFAIK it is guaranteed. > > I don't know the performance of UUID vs Zookeeper, nor how Zookeeper > generates its UUIDs. You could ask on that list. Pig does not currently > have integration with Zookeeper. > > Alan. > > > >> I guess I see how I can connect to a zookeeper server inside my UDF to get >> a >> block of, say 50k, Id's at a time and sequentially increase within the >> block. Then the UDF connects again to get another block. This way I can >> get >> a guaranteed unique ID. (And it's probably faster and smaller this way >> than >> generating UUID) >> >> Does pig use zookeeper to do anything? Can I connect to that one if it >> does? >> >> >> >> On Fri, Apr 23, 2010 at 12:08 PM, Alan Gates <ga...@yahoo-inc.com> wrote: >> >> Unique identifiers are easy enough. Row ids (monotonically increasing >>> values) are impossible because of the parallel nature of map reduce. If >>> you >>> just want to generate a unique identifier you can write a UDF to wrap >>> Java's >>> UUID class (or use the new GenericInvoker UDF if you're working off >>> trunk). >>> >>> Alan. >>> >>> >>> On Apr 23, 2010, at 11:48 AM, hc busy wrote: >>> >>> Guys, is there a easy way to generate a unique row id that is guaranteed >>> >>>> to >>>> be unique? >>>> >>>> R = foreach T generate *, globally_unique() as id; >>>> >>>> The reason why I need this is because I have a really nasty memory >>>> problem >>>> here and I can't perform a group on the entire row, so all I can resort >>>> to >>>> is to split the alias into two aliases >>>> >>>> A1 = foreach R generate keys, id; >>>> A2 = foreach R generate values, id; >>>> >>>> >>>> and operate on my A1, and then come back for the rest of the values >>>> later. >>>> >>>> >>>> But it is important for the id's to be generated globally unique so that >>>> different mappers don't all start at 1. Any suggestions? >>>> >>>> >>>> Thnx! >>>> >>>> >>> >>> >