No, no you misunderstand. I didn't mean to contact zookeeper for every
single record.

Each map instance will contact zookeeper once for every X number of records
it sees. What the mapper portion does is it gets a block of numbers, and
that block number become only available to that one mapper, the next mapper
to get its block will get a different block.

so this IDing system can work in Integer space (since I'm memory
constrained)... Hmmm, I thought pig or some parts of hadoop had already been
using zookeeper... anyways.... I guess it doesn't even have to be zookeeper,
just a transactional database...

I'll probably end up using random numbers or UUID as you suggest..., after
trying the synchronized version. hehe ;-)



On Fri, Apr 23, 2010 at 12:23 PM, Alan Gates <ga...@yahoo-inc.com> wrote:

>
> On Apr 23, 2010, at 12:13 PM, hc busy wrote:
>
>  Is the Java class guaranteed to be unique? Or will I have to perform an
>> additional check after I join back?
>>
>
> I'd check the Java docs, but AFAIK it is guaranteed.
>
> I don't know the performance of UUID vs Zookeeper, nor how Zookeeper
> generates its UUIDs.  You could ask on that list.  Pig does not currently
> have integration with Zookeeper.
>
> Alan.
>
>
>
>> I guess I see how I can connect to a zookeeper server inside my UDF to get
>> a
>> block of, say 50k, Id's at a time and sequentially increase within the
>> block. Then the UDF connects again to get another block. This way I can
>> get
>> a guaranteed unique ID. (And it's probably faster and smaller this way
>> than
>> generating UUID)
>>
>> Does pig use zookeeper to do anything? Can I connect to that one if it
>> does?
>>
>>
>>
>> On Fri, Apr 23, 2010 at 12:08 PM, Alan Gates <ga...@yahoo-inc.com> wrote:
>>
>>  Unique identifiers are easy enough.  Row ids (monotonically increasing
>>> values) are impossible because of the parallel nature of map reduce.  If
>>> you
>>> just want to generate a unique identifier you can write a UDF to wrap
>>> Java's
>>> UUID class (or use the new GenericInvoker UDF if you're working off
>>> trunk).
>>>
>>> Alan.
>>>
>>>
>>> On Apr 23, 2010, at 11:48 AM, hc busy wrote:
>>>
>>> Guys, is there a easy way to generate a unique row id that is guaranteed
>>>
>>>> to
>>>> be unique?
>>>>
>>>> R = foreach T generate *, globally_unique() as id;
>>>>
>>>> The reason why I need this is because I have a really nasty memory
>>>> problem
>>>> here and I can't perform a group on the entire row, so all I can resort
>>>> to
>>>> is to split the alias into two aliases
>>>>
>>>> A1 = foreach R generate keys, id;
>>>> A2 = foreach R generate values, id;
>>>>
>>>>
>>>> and operate on my A1, and then come back for the rest of the values
>>>> later.
>>>>
>>>>
>>>> But it is important for the id's to be generated globally unique so that
>>>> different mappers don't all start at 1. Any suggestions?
>>>>
>>>>
>>>> Thnx!
>>>>
>>>>
>>>
>>>
>

Reply via email to