Hi I have been reading a lot about Zookeeper lately because we have the requirement to distribute our workflow engine (for performance and reliability reasons). Currently, we have a single instance. If it fails, no work is done anymore. One approach mentioned in this mailing list and online documents is a master - worker setup. The master is made reliable by using multiple instances and electing a leader. This setup might work unless there are too many events to be dispatched by a single master. I am a bit concerned about that as currently the work dispatcher does no IO. It is blazing fast. If the work dispatcher (scheduler) now has to communicate over the network it might get too slow. So I thought about a different solution and would like to hear what you think about it.
Let’s say each workflow engine instance (referred as node from now on) registers an ephemeral znode under /nodes. Each node installs a children watch on /nodes. The nodes uses this information to populate a hashring. Each node is only responsible for workflow instances that map to their corresponding part of the hashring. Event notifications would then be dispatched to the correct node based on the same hashring. The persistence of workflow instances and events would still be done in a highly available database. Only notifications about events would be dispatched. Whenever an event notification is received a workflow instance has to be dispatched to a given worker thread within a node. The interesting cases happen if a new node is added to the cluster or if a workflow engine instance fails. Let’s talk about failures first. As processing a workflow instance takes some time we cannot simply switch over to a new instance right away. After all, the previous node might still be running (i.e. it did not crash, just failed to heartbeat). We would have to wait some time (how long?!) until we know that the failing node has dropped the work it was doing (it gets notified of the session expiration). The failing node cannot communicate with Zookeeper so it has no way of telling that it was done. We also cannot use the central database to find out what happened. If a node fails (e.g. due to hardware failure or network failure) it cannot update the database. The database will look the same regardless of whether it is working or has crashed. It must be impossible that two workflow engines work on the same workflow instance. This would result in duplicate messages being sent out to backend systems (not a good idea). The same issue arrises with adding a node. The new node has to wait until the other nodes have stopped working. Ok, I could use Zookeeper in this case to lock workflow instances. Only if the lock is available a node can start working on an workflow instance. Whenever a node is added it will look up pending workflow instances in the database (those that have open events). To summarize: instead of having a central master that coordinates all work I would like to divide the work “space” into segments by using consistent hashing. Each node is an equal peer and can operate freely within its assigned work “space”. What do you think about that setup? Is it something completely stupid?! If yes, let me know why. Has somebody done something similar successfully? What would I have to watch out for? What would I need to add to the basic setup mentioned above? Regards, Simon
