On Thu, Jul 14, 2011 at 5:46 PM, Jordan Zimmerman <[email protected]>wrote:
> But the implication is still the same. After each item processing, > getChildren() will get called. This is seems incredibly inefficient to me. > Further, the next item must wait until the getChildren() returns. > I don't think that a wait is required at all. There will inevitably be some delay between the insertion of a new item and recognition of that item by all of the workers. If some of the workers have already started on an out of order item, or if they don't find out about a new high priority item for a short period of time that isn't the end of the world. The notification will happen shortly after the deletion in a separate thread. If additional deletions have happened before the watcher fires, you will still only get one notice with multiple changes. Imagine a system that is putting hundreds (1000s?) of items in this queue. > You're going to get an ugly stampede. For each Zk client processing the > queue, you will require n (number of messages) trips to getChildren(). > OK. Let's take that even further. Suppose that we have 100,000 todo items and 1000 workers and that each todo takes 1 second and each getChildren takes 200 ms because of slow network and such. Workers will complete tasks at the rate of 1000 tasks per second and will be sending deletes at that rate. The watcher threads in each worker will be continually firing in order to update the state of the queue so we will get updates ever 200ms on each client. The average delay before *some* client finds out about the most current state that it can see will be less than 1 ms. The delay before a new item is reflected in some clients state will be 200 ms (because that is how long it takes to find out about something). On average, 5000 getChildren will be firing each second as well as 1000 inserts and 1000 deletes for a total of 7000 operations per second. This is entirely doable. In more reasonable scenarios, the number of pending tasks will be much, much less and the time to read the pending ops will probably be as small as a few milliseconds, possibly 10's of milliseconds. On a fully active queue, this will lead to the getChildren calls saturating the ZK as you mentioned. This can be mitigated by having each client only do the getChildren when there is an available local worker thread. If no such worker is available, then a flag should be set. If a worker finishes and finds the flag set, they should pick the next op and start an asynchronous getChildren call. This will decrease the total operations to about 3000 per second. A higher performance design involves putting several queue items into each znode. Each worker should commit to doing several items on the list and should send back an atomic update to the znode. Upon completion, completion notifications should be batched up for a short period of time before doing another atomic update together with a scan for changed task lists. With good batching, performance in this model can be very high and latency can be bounded. With the previous scenario, but with a 10 ms task time, we can keep the number of ops down to 4000 per second while pushing through 100x more operations. You still have a roughly 1 second delay before all workers switch to higher priority tasks, but you only have a few milliseconds before *some* worker switches to the higher priority tasks. These ideas of batching of tasks and soft recognition of high priority tasks is common in distributed systems since it allows you to amortize the cost of coordination over a larger number of operations. If you require no batching and hard and fast recognition of high priority items, then you inherently have high frequency broadcast storms which are the simple consequence of the hard constraints.
