Good morning all, I wanted to run an idea by you all. I'm currently working on using Twill to schedule GraphLab (http://graphlab.org), which is a distributed graph analytics package written in C++. They currently use MPI, but it's only to coordinate the launch of a cluster, so it should be comparatively easy to migrate them over to YARN and Twill. In order to do this, I would like to add to Twill some mechanism to allow me to request:
1. Request X containers 2. Wait for the first container to be assigned to me. 3. Wait Y seconds for the rest of the containers to be assigned to me. 4. If the number of containers allocated equals X continue, otherwise release my containers and go to step 1. I can think of two main ways to implement this, one inside Twill itself, and one inside the application. 1. Modify `YarnAMClient.doRun` to block launching the processes until all the containers have been allocated. 2. Add some sort of distributed barrier that the application could use to block until all the containers have been allocated. I'm leaning towards the second option, as Zookeeper and Curator already implement distributed barriers. so all that's left is figuring out what's the right API to expose this. I have a couple ideas for this: 1. Pass the Zookeeper connection string to the `TwillRunnable`. This would be simplest as I wouldn't have to modify Twill, but then I would have a redundant connection to Zookeeper. 2. Expose the Zookeeper client to `TwillContext`. This would be simpler, but then we'd be tightly coupling the Twill API to only work with Zookeeper. 3. Draw inspiration from service discovery and add a `SynchronizationService`, a `Barrier` interface, and a `TwillContext.createBarrier(String)` method. It would use Zookeeper or Curator under the covers. This would be a bit more work, but could be useful for a lot of other applications. It also would be a nice place to put other synchronization primitives. My plan right now is to start off with passing the Zookeeper connection string to the `TwillRunnable`. Once I get that working I'd like to try to implement the `SynchronizationService`. Does this sound like a good plan, or would any of you suggest a better approach for implementing this? Thanks, Erick
