How do I generate a row id?

2010-04-23 Thread hc busy
Guys, is there a easy way to generate a unique row id that is guaranteed to be unique? R = foreach T generate *, globally_unique() as id; The reason why I need this is because I have a really nasty memory problem here and I can't perform a group on the entire row, so all I can resort to is to spl

Re: How do I generate a row id?

2010-04-23 Thread Alan Gates
Unique identifiers are easy enough. Row ids (monotonically increasing values) are impossible because of the parallel nature of map reduce. If you just want to generate a unique identifier you can write a UDF to wrap Java's UUID class (or use the new GenericInvoker UDF if you're working of

Re: How do I generate a row id?

2010-04-23 Thread hc busy
Is the Java class guaranteed to be unique? Or will I have to perform an additional check after I join back? I guess I see how I can connect to a zookeeper server inside my UDF to get a block of, say 50k, Id's at a time and sequentially increase within the block. Then the UDF connects again to get

Re: How do I generate a row id?

2010-04-23 Thread Dmitriy Ryaboy
You can certainly connect to zookeeper but you don't really need to (relying on zookeeper to do atomic increments may not scale if you are doing this for millions of records.. though I haven't done timings. Y! people?) Just grab the task id from the jobconf and use it as a uuid prefix. Details ab

Re: How do I generate a row id?

2010-04-23 Thread Alan Gates
On Apr 23, 2010, at 12:13 PM, hc busy wrote: Is the Java class guaranteed to be unique? Or will I have to perform an additional check after I join back? I'd check the Java docs, but AFAIK it is guaranteed. I don't know the performance of UUID vs Zookeeper, nor how Zookeeper generates its

Re: How do I generate a row id?

2010-04-23 Thread hc busy
No, no you misunderstand. I didn't mean to contact zookeeper for every single record. Each map instance will contact zookeeper once for every X number of records it sees. What the mapper portion does is it gets a block of numbers, and that block number become only available to that one mapper, the