I forgot, something else for you to consider, there's also now a third option for ASF project hosting called apache extras: http://apache-extras.org
more detail here https://blogs.apache.org/foundation/entry/the_apache_software_foundation_launches Patrick On Wed, Jan 12, 2011 at 8:53 AM, Patrick Hunt <ph...@apache.org> wrote: > Hi Avery, clusterlib looks like some great functionality, I don't see > why we couldn't include it as a subproject (see one caveat I noticed > below). I'd also like to point out that incubator is also a great > option for the project. http://incubator.apache.org/ , have you > considered that? > > According to the readme on GH a dependency exists on "libmicrohttpd" > which is LGPL licensed. Unfortunately we (apache projects) cannot > include LGPL licensed code, see "category X" here > http://www.apache.org/legal/3party.html This dependency would have to > be removed prior to adding the subproject. > > Regards, > > Patrick > > On Tue, Jan 11, 2011 at 5:34 PM, Avery Ching <ach...@yahoo-inc.com> wrote: >> Sorry for the delay (meetings). I just threw it up on GitHub. >> >> https://github.com/aching/Clusterlib >> >> Enjoy! >> >> Avery >> >> On Jan 11, 2011, at 3:42 PM, Fournier, Camille F. [Tech] wrote: >> >>> Is the code somewhere we can look at it right now? >>> >>> C >>> >>> -----Original Message----- >>> From: Avery Ching [mailto:ach...@yahoo-inc.com] >>> Sent: Tuesday, January 11, 2011 2:02 PM >>> To: dev@zookeeper.apache.org >>> Subject: Discussion - Clusterlib as a subproject for ZooKeeper >>> >>> Hello, >>> >>> We have been working on Clusterlib at Yahoo! and would like to contribute >>> it as a subproject to ZooKeeper. Clusterlib was developed as a >>> next-generation platform for creating/coordinating search >>> applications/services (including crawling, processing, indexing, and front >>> end) at Yahoo!. We suspect much of this work will be useful for others >>> trying to build up large-scale/distributed applications that would like to >>> coordinate and share the same semantics. >>> >>> Here is a (relatively) short summary of why Clusterlib was developed: >>> >>> Large-scale distributed applications are difficult and time-consuming to >>> develop since a great deal of effort is spent solving the same >>> challenges (consistency, fault-tolerance, naming problems, etc.). >>> Additionally, coordinating these applications is typically ad-hoc and >>> hard to maintain. Clusterlib fills the gap by providing distributed >>> application developers with an object-oriented data model, >>> asynchronous event handling system, well-defined consistency semantics, and >>> methods for making coordination easy across >>> cooperating applications. Some example applications might include a search >>> engine, scalable file system, large-scale data cache, etc. >>> >>> Clusterlib is a middleware library for building distributed applications. >>> It was designed to simplify the job of application developers and provides >>> a set of distributed objects that all inherit from the same Notifyable >>> interface. The set of distributed objects includes: Root, Application, >>> Group, DataDistribution, Node, ProcessSlot, PropertyList, and Queue. In >>> order to give context, each object is described briefly. >>> >>> * Root is a point-of-entry object at the top of the hierarchy in Clusterlib >>> and manages its Applications. There is only one Root per Clusterlib >>> instance. >>> * Applications are used as a namespace for managing Groups, Nodes, >>> DataDistributions, Queues, and PropertyLists in a user-defined application. >>> Using the application concept (as opposed to only having groups) makes >>> accessing another Application's child objects explicit to developers. >>> * Groups are a logical association of Clusterlib objects that can be >>> nested. Since large-scale applications often require hundreds or thousands >>> of nodes to operate, there might a "node" Group that has an "alive" child >>> Group and a "dead" child Group that are each populated with their >>> respective sets of nodes. >>> * DataDistributions balance load and data across a set of objects. >>> DataDistributions provide user-extensible key hashing to variable-sized >>> hash ranges for user flexibility. >>> * Nodes typically represent a physical or virtual node in an application. >>> It has child ProcessSlots that can be used to reserve system resources. >>> * ProcessSlots maintain an actual process running locally on the physical >>> machine. It can also contain other information about the process, such as a >>> PID or port array. >>> * PropertyLists may be created and maintained as a child of any Notifyable >>> object. It is basically a key-value storage that can, for instance, be used >>> to determine how long a timeout would be on a particular server or the >>> number of retries to allow before giving up. PropertyLists are leafs in the >>> Clusterlib hierarchy and cannot have any children. >>> * Queues are distributed FIFO queues. They can be used to synchronize >>> threads, pass messages between threads, and for JSON-RPC. >>> >>> Clusterlib objects are composed in a hierarchy and maintain ACID >>> compliance. Distributed, non-blocking, fault-tolerant locks can be acquired >>> on any Clusterlib object and asynchronous event handlers can be registered >>> for object-specific changes. For example, if a ProcessSlot changed, an >>> asynchronous event handler might check to see if the process is still >>> running and if not, try to restart it. There are 3 types of >>> Clusterlib-defined locks (child, notifyable, and ownership). Clusterlib >>> internally uses a child lock on a parent object to access child objects, >>> however users may also use this lock if desired. A notifyable lock is >>> intended as a general-purpose lock on a Notifyable. Finally, ownership >>> locks are intended to express concepts suchs as "leadership" in a Group or >>> "reservation" of a Node. In order to allow more parallelism, Clusterlib >>> locks can be accessed in shared or exclusive modes. >>> >>> Since Clusterlib relies upon Zookeeper as a fault-tolerant, consensus >>> service, it inherits many of its performance and fault-tolerance >>> properties. As the number of Zookeeper servers increases, read performance >>> scales up nearly linearly, however write performance scales inversely due >>> to Zookeeper's internal atomic broadcast protocol. As long as the number of >>> correctly functioning Zookeeper servers maintains a quorum, Zookeeper can >>> continue to operate. The same is true for Clusterlib applications. The >>> locks and leadership election algorithms in Clusterlib are fault-tolerant >>> to client failure due to the use of Zookeeper ephemeral nodes. >>> >>> In addition to being a library, Clusterlib comes with a http server to >>> viewing/manipulating Clusterlib objects and/or ZooKeeper znodes directly. >>> I've linked some PNGs to illustrate this. It also is bundled with a CLI >>> that is extensible. We have also developed a suite of over 90 unittests >>> that simulate distributed event ordering using MPI to test for many of >>> those hard-to-find distributed bugs. It's been tested to build on flavors >>> of Redhat Linux, Ubuntu Linux, and OSX. >>> >>> We would like to see it as a subproject of ZooKeeper because its tightly >>> integrated with ZooKeeper. What do folks think about Clusterlib as a >>> subproject of ZooKeeper? >>> >>> Thanks, >>> >>> Avery >>> >>> Clusterlib-UI snapshot link >>> http://users.eecs.northwestern.edu/~aching/clusterlib-ui.png >>> >>> ZooKeeper-UI snapshot link >>> http://users.eecs.northwestern.edu/~aching/zookeeper-ui.png >>> >>> >> >> >