I would recommend using KeptCollections to get a close estimate of the currently live machines.
When you need to process a document, select a processor at random from the collection. Then if that processor manages to tell you that the document has been accepted and successfully processed, you can move to the next document. If you don't get confirmation for any reason (timeout, connection loss, whatever), you pick another random processor from the collection (using the kept collection so that you know who is live) and also add the misbehaving processor to a local or global temporary black-list. If nothing ever goes down, this does a pretty much perfect job. Adding a node also works perfectly. Taking a node out in an orderly way can also work seamless if the node first removes itself from the collection of live processors and then finishes all pending requests plus waits a few seconds to see if any more requests come through. In the presence you have a few (inevitable) problems. For instance, a node can fail in the tiny window between seeing that the node exists and sending a request. That shouldn't be much of a since the connect will fail. Another mild problem is that a node may fail in such a way that ZK takes 30 seconds or so to be sure that it is gone. Again, connection failure will let this be handled gracefully. The only serious error here is when you send a document to a processor which successfully handles the document, commits it downstream but then fails before reporting the success to you. This will lead to double commit of the processed document. Without a reliable central repository, it is impossible to avoid some error like this. On Fri, Mar 18, 2011 at 9:41 PM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > My app listens to this stream. Because of the high document ingestion rate > I > need N instances of my app to listen to this stream. So all N apps listen > and > they all "get" the same documents, but only 1 app actually processes each > document -- "if (docID mod N == appID) then process doc" -- the usual > consistent > hashing stuff. I'd like to be able to add and remove apps dynamically and > have > the remaining apps realize that "N" has changed. Similarly, when some app > instance dies and thus "N" changes, I'd like all the remaining instances to > know > about it. > > If my apps don't know the correct "N" then 1/Nth of docs will go > unprocessed (if > the app died or was removed) until the remaining apps adjust their local > value > of "N". >