On Sun, Oct 28, 2012 at 9:15 PM, David Parks <davidpark...@yahoo.com> wrote:
> I need a unique & permanent ID assigned to new item encountered, which has > a constraint that it is in the range of, let’s say for simple discussion, > one to one million. > Having such a limited range may require that you have a central service to generate ID's. The use of a central service can be disastrous for throughput. > **** > > ** I suppose I could assign a range of usable IDs to each reduce task > (where ID’s are assigned) and keep those organized somehow at the end of > the job, but this seems clunky too. > > ** > Yes. Much better. > Since this is on AWS, zookeeper is not a good option. I thought it was > part of the hadoop cluster (and thus easy to access), but guess I was wrong > there. > No. This is specifically not part of Hadoop for performance reasons. > ** I would think that such a service would run most logically on the > taskmaster server. I’m surprised this isn’t a common issue. I guess I could > launch a separate job that runs such a sequence service perhaps. But that’s > non trivial its self with failure concerns. > The problem is that a serial number service is a major loss of performance in a parallel system. Unless you relax the idea considerably (by allowing blocks, or having lots of bits like Snowflake), then you wind up with a round-trip per id and you have a critical section on the ID generator. This is bad. Look up Amdahl's Law. > ** Perhaps there’s just a better way of thinking of this? > Yes. Use lots of bits and be satisfied with uniqueness rather than perfect ordering and limited range. As the other respondent said, look up Snowflake.