Is the code somewhere we can look at it right now? C
-----Original Message----- From: Avery Ching [mailto:ach...@yahoo-inc.com] Sent: Tuesday, January 11, 2011 2:02 PM To: dev@zookeeper.apache.org Subject: Discussion - Clusterlib as a subproject for ZooKeeper Hello, We have been working on Clusterlib at Yahoo! and would like to contribute it as a subproject to ZooKeeper. Clusterlib was developed as a next-generation platform for creating/coordinating search applications/services (including crawling, processing, indexing, and front end) at Yahoo!. We suspect much of this work will be useful for others trying to build up large-scale/distributed applications that would like to coordinate and share the same semantics. Here is a (relatively) short summary of why Clusterlib was developed: Large-scale distributed applications are difficult and time-consuming to develop since a great deal of effort is spent solving the same challenges (consistency, fault-tolerance, naming problems, etc.). Additionally, coordinating these applications is typically ad-hoc and hard to maintain. Clusterlib fills the gap by providing distributed application developers with an object-oriented data model, asynchronous event handling system, well-defined consistency semantics, and methods for making coordination easy across cooperating applications. Some example applications might include a search engine, scalable file system, large-scale data cache, etc. Clusterlib is a middleware library for building distributed applications. It was designed to simplify the job of application developers and provides a set of distributed objects that all inherit from the same Notifyable interface. The set of distributed objects includes: Root, Application, Group, DataDistribution, Node, ProcessSlot, PropertyList, and Queue. In order to give context, each object is described briefly. * Root is a point-of-entry object at the top of the hierarchy in Clusterlib and manages its Applications. There is only one Root per Clusterlib instance. * Applications are used as a namespace for managing Groups, Nodes, DataDistributions, Queues, and PropertyLists in a user-defined application. Using the application concept (as opposed to only having groups) makes accessing another Application's child objects explicit to developers. * Groups are a logical association of Clusterlib objects that can be nested. Since large-scale applications often require hundreds or thousands of nodes to operate, there might a "node" Group that has an "alive" child Group and a "dead" child Group that are each populated with their respective sets of nodes. * DataDistributions balance load and data across a set of objects. DataDistributions provide user-extensible key hashing to variable-sized hash ranges for user flexibility. * Nodes typically represent a physical or virtual node in an application. It has child ProcessSlots that can be used to reserve system resources. * ProcessSlots maintain an actual process running locally on the physical machine. It can also contain other information about the process, such as a PID or port array. * PropertyLists may be created and maintained as a child of any Notifyable object. It is basically a key-value storage that can, for instance, be used to determine how long a timeout would be on a particular server or the number of retries to allow before giving up. PropertyLists are leafs in the Clusterlib hierarchy and cannot have any children. * Queues are distributed FIFO queues. They can be used to synchronize threads, pass messages between threads, and for JSON-RPC. Clusterlib objects are composed in a hierarchy and maintain ACID compliance. Distributed, non-blocking, fault-tolerant locks can be acquired on any Clusterlib object and asynchronous event handlers can be registered for object-specific changes. For example, if a ProcessSlot changed, an asynchronous event handler might check to see if the process is still running and if not, try to restart it. There are 3 types of Clusterlib-defined locks (child, notifyable, and ownership). Clusterlib internally uses a child lock on a parent object to access child objects, however users may also use this lock if desired. A notifyable lock is intended as a general-purpose lock on a Notifyable. Finally, ownership locks are intended to express concepts suchs as "leadership" in a Group or "reservation" of a Node. In order to allow more parallelism, Clusterlib locks can be accessed in shared or exclusive modes. Since Clusterlib relies upon Zookeeper as a fault-tolerant, consensus service, it inherits many of its performance and fault-tolerance properties. As the number of Zookeeper servers increases, read performance scales up nearly linearly, however write performance scales inversely due to Zookeeper's internal atomic broadcast protocol. As long as the number of correctly functioning Zookeeper servers maintains a quorum, Zookeeper can continue to operate. The same is true for Clusterlib applications. The locks and leadership election algorithms in Clusterlib are fault-tolerant to client failure due to the use of Zookeeper ephemeral nodes. In addition to being a library, Clusterlib comes with a http server to viewing/manipulating Clusterlib objects and/or ZooKeeper znodes directly. I've linked some PNGs to illustrate this. It also is bundled with a CLI that is extensible. We have also developed a suite of over 90 unittests that simulate distributed event ordering using MPI to test for many of those hard-to-find distributed bugs. It's been tested to build on flavors of Redhat Linux, Ubuntu Linux, and OSX. We would like to see it as a subproject of ZooKeeper because its tightly integrated with ZooKeeper. What do folks think about Clusterlib as a subproject of ZooKeeper? Thanks, Avery Clusterlib-UI snapshot link http://users.eecs.northwestern.edu/~aching/clusterlib-ui.png ZooKeeper-UI snapshot link http://users.eecs.northwestern.edu/~aching/zookeeper-ui.png