Hi

I have been reading a lot about Zookeeper lately because we have the 
requirement to distribute our workflow engine (for performance and reliability 
reasons). Currently, we have a single instance. If it fails, no work is done 
anymore. One approach mentioned in this mailing list and online documents is a 
master - worker setup. The master is made reliable by using multiple instances 
and electing a leader. This setup might work unless there are too many events 
to be dispatched by a single master. I am a bit concerned about that as 
currently the work dispatcher does no IO. It is blazing fast. If the work 
dispatcher (scheduler) now has to communicate over the network it might get too 
slow. So I thought about a different solution and would like to hear what you 
think about it.

Let’s say each workflow engine instance (referred as node from now on) 
registers an ephemeral znode under /nodes. Each node installs a children watch 
on /nodes. The nodes uses this information to populate a hashring. Each node is 
only responsible for workflow instances that map to their corresponding part of 
the hashring. Event notifications would then be dispatched to the correct node 
based on the same hashring. The persistence of workflow instances and events 
would still be done in a highly available database. Only notifications about 
events would be dispatched. Whenever an event notification is received a 
workflow instance has to be dispatched to a given worker thread within a node. 

The interesting cases happen if a new node is added to the cluster or if a 
workflow engine instance fails. Let’s talk about failures first. As processing 
a workflow instance takes some time we cannot simply switch over to a new 
instance right away. After all, the previous node might still be running (i.e. 
it did not crash, just failed to heartbeat). We would have to wait some time 
(how long?!) until we know that the failing node has dropped the work it was 
doing (it gets notified of the session expiration). The failing node cannot 
communicate with Zookeeper so it has no way of telling that it was done. We 
also cannot use the central database to find out what happened. If a node fails 
(e.g. due to hardware failure or network failure) it cannot update the 
database. The database will look the same regardless of whether it is working 
or has crashed. It must be impossible that two workflow engines work on the 
same workflow instance. This would result in duplicate messages being sent out 
to backend systems (not a good idea). 

The same issue arrises with adding a node. The new node has to wait until the 
other nodes have stopped working. Ok, I could use Zookeeper in this case to 
lock workflow instances. Only if the lock is available a node can start working 
on an workflow instance. Whenever a node is added it will look up pending 
workflow instances in the database (those that have open events). 

To summarize: instead of having a central master that coordinates all work I 
would like to divide the work “space” into segments by using consistent 
hashing. Each node is an equal peer and can operate freely within its assigned 
work “space”. 

What do you think about that setup? Is it something completely stupid?! If yes, 
let me know why. Has somebody done something similar successfully? What would I 
have to watch out for? What would I need to add to the basic setup mentioned 
above? 

Regards,
Simon

Reply via email to