Patrick, Thanks for the suggestions on http://incubator.apache.org/ and http://apache-extras.org. I'll have to look into that more.
The reason why we thought it would be best as ZooKeeper subproject was because it is heavily dependent on ZooKeeper. As for libmicrohttpd's LGPL, sorry if it wasn't more clear in the README, but we only link to it, we do not include the source code for libmicrohttpd. libmicrohttpd is only required if you want to build the Clusterlib http server. Avery On Jan 12, 2011, at 8:53 AM, Patrick Hunt wrote: Hi Avery, clusterlib looks like some great functionality, I don't see why we couldn't include it as a subproject (see one caveat I noticed below). I'd also like to point out that incubator is also a great option for the project. http://incubator.apache.org/ , have you considered that? According to the readme on GH a dependency exists on "libmicrohttpd" which is LGPL licensed. Unfortunately we (apache projects) cannot include LGPL licensed code, see "category X" here http://www.apache.org/legal/3party.html This dependency would have to be removed prior to adding the subproject. Regards, Patrick On Tue, Jan 11, 2011 at 5:34 PM, Avery Ching <ach...@yahoo-inc.com<mailto:ach...@yahoo-inc.com>> wrote: Sorry for the delay (meetings). I just threw it up on GitHub. https://github.com/aching/Clusterlib Enjoy! Avery On Jan 11, 2011, at 3:42 PM, Fournier, Camille F. [Tech] wrote: Is the code somewhere we can look at it right now? C -----Original Message----- From: Avery Ching [mailto:ach...@yahoo-inc.com] Sent: Tuesday, January 11, 2011 2:02 PM To: dev@zookeeper.apache.org<mailto:dev@zookeeper.apache.org> Subject: Discussion - Clusterlib as a subproject for ZooKeeper Hello, We have been working on Clusterlib at Yahoo! and would like to contribute it as a subproject to ZooKeeper. Clusterlib was developed as a next-generation platform for creating/coordinating search applications/services (including crawling, processing, indexing, and front end) at Yahoo!. We suspect much of this work will be useful for others trying to build up large-scale/distributed applications that would like to coordinate and share the same semantics. Here is a (relatively) short summary of why Clusterlib was developed: Large-scale distributed applications are difficult and time-consuming to develop since a great deal of effort is spent solving the same challenges (consistency, fault-tolerance, naming problems, etc.). Additionally, coordinating these applications is typically ad-hoc and hard to maintain. Clusterlib fills the gap by providing distributed application developers with an object-oriented data model, asynchronous event handling system, well-defined consistency semantics, and methods for making coordination easy across cooperating applications. Some example applications might include a search engine, scalable file system, large-scale data cache, etc. Clusterlib is a middleware library for building distributed applications. It was designed to simplify the job of application developers and provides a set of distributed objects that all inherit from the same Notifyable interface. The set of distributed objects includes: Root, Application, Group, DataDistribution, Node, ProcessSlot, PropertyList, and Queue. In order to give context, each object is described briefly. * Root is a point-of-entry object at the top of the hierarchy in Clusterlib and manages its Applications. There is only one Root per Clusterlib instance. * Applications are used as a namespace for managing Groups, Nodes, DataDistributions, Queues, and PropertyLists in a user-defined application. Using the application concept (as opposed to only having groups) makes accessing another Application's child objects explicit to developers. * Groups are a logical association of Clusterlib objects that can be nested. Since large-scale applications often require hundreds or thousands of nodes to operate, there might a "node" Group that has an "alive" child Group and a "dead" child Group that are each populated with their respective sets of nodes. * DataDistributions balance load and data across a set of objects. DataDistributions provide user-extensible key hashing to variable-sized hash ranges for user flexibility. * Nodes typically represent a physical or virtual node in an application. It has child ProcessSlots that can be used to reserve system resources. * ProcessSlots maintain an actual process running locally on the physical machine. It can also contain other information about the process, such as a PID or port array. * PropertyLists may be created and maintained as a child of any Notifyable object. It is basically a key-value storage that can, for instance, be used to determine how long a timeout would be on a particular server or the number of retries to allow before giving up. PropertyLists are leafs in the Clusterlib hierarchy and cannot have any children. * Queues are distributed FIFO queues. They can be used to synchronize threads, pass messages between threads, and for JSON-RPC. Clusterlib objects are composed in a hierarchy and maintain ACID compliance. Distributed, non-blocking, fault-tolerant locks can be acquired on any Clusterlib object and asynchronous event handlers can be registered for object-specific changes. For example, if a ProcessSlot changed, an asynchronous event handler might check to see if the process is still running and if not, try to restart it. There are 3 types of Clusterlib-defined locks (child, notifyable, and ownership). Clusterlib internally uses a child lock on a parent object to access child objects, however users may also use this lock if desired. A notifyable lock is intended as a general-purpose lock on a Notifyable. Finally, ownership locks are intended to express concepts suchs as "leadership" in a Group or "reservation" of a Node. In order to allow more parallelism, Clusterlib locks can be accessed in shared or exclusive modes. Since Clusterlib relies upon Zookeeper as a fault-tolerant, consensus service, it inherits many of its performance and fault-tolerance properties. As the number of Zookeeper servers increases, read performance scales up nearly linearly, however write performance scales inversely due to Zookeeper's internal atomic broadcast protocol. As long as the number of correctly functioning Zookeeper servers maintains a quorum, Zookeeper can continue to operate. The same is true for Clusterlib applications. The locks and leadership election algorithms in Clusterlib are fault-tolerant to client failure due to the use of Zookeeper ephemeral nodes. In addition to being a library, Clusterlib comes with a http server to viewing/manipulating Clusterlib objects and/or ZooKeeper znodes directly. I've linked some PNGs to illustrate this. It also is bundled with a CLI that is extensible. We have also developed a suite of over 90 unittests that simulate distributed event ordering using MPI to test for many of those hard-to-find distributed bugs. It's been tested to build on flavors of Redhat Linux, Ubuntu Linux, and OSX. We would like to see it as a subproject of ZooKeeper because its tightly integrated with ZooKeeper. What do folks think about Clusterlib as a subproject of ZooKeeper? Thanks, Avery Clusterlib-UI snapshot link http://users.eecs.northwestern.edu/~aching/clusterlib-ui.png ZooKeeper-UI snapshot link http://users.eecs.northwestern.edu/~aching/zookeeper-ui.png