On Sun, Oct 28, 2012 at 9:15 PM, David Parks <davidpark...@yahoo.com> wrote:

> I need a unique & permanent ID assigned to new item encountered, which has
> a constraint that it is in the range of, let’s say for simple discussion,
> one to one million.
>

Having such a limited range may require that you have a central service to
generate ID's.  The use of a central service can be disastrous for
throughput.


> ****
>
> ** I suppose I could assign a range of usable IDs to each reduce task
> (where ID’s are assigned) and keep those organized somehow at the end of
> the job, but this seems clunky too.
>
> **
>

Yes.  Much better.


>  Since this is on AWS, zookeeper is not a good option. I thought it was
> part of the hadoop cluster (and thus easy to access), but guess I was wrong
> there.
>

No.  This is specifically not part of Hadoop for performance reasons.


> ** I would think that such a service would run most logically on the
> taskmaster server. I’m surprised this isn’t a common issue. I guess I could
> launch a separate job that runs such a sequence service perhaps. But that’s
> non trivial its self with failure concerns.
>

The problem is that a serial number service is a major loss of performance
in a parallel system.  Unless you relax the idea considerably (by allowing
blocks, or having lots of bits like Snowflake), then you wind up with a
round-trip per id and you have a critical section on the ID generator.
 This is bad.

Look up Amdahl's Law.


> ** Perhaps there’s just a better way of thinking of this?
>

Yes.  Use lots of bits and be satisfied with uniqueness rather than perfect
ordering and limited range.

As the other respondent said, look up Snowflake.

Reply via email to