See BookKeeper. The analogy is this:
ZK => Chubby BookKeeper => distributed log Application => Application. On Wed, Jul 13, 2011 at 10:17 AM, Yang <[email protected]> wrote: > actually I was just thinking about this and tried to ask exactly the same > question. > > now zk is used to store small pieces of data such as shared config, and > used for locking/coordination, but since it has a replicated data store, it > would be nice to use to store large volumes of data directly. > > in fact from the "Paxos made live" paper: > http://static.googleusercontent.com/external_content/untrusted_dlcp/labs.google.com/en/us/papers/paxos_made_live.pdf > page 3 > "We devoted effort to designing clean interfaces separating the Paxos > framework, the database, and > Chubby. We did this partly for clarity while developing this system, but > also with the intention of reusing the > replicated log layer in other applications. We anticipate future systems at > Google that seek fault-tolerance > through replication. We believe that a fault-tolerant log is a powerful > primitive on which to build such > systems. > " > > essentially in the google paxos implementation, application code can simply > grab the latest committed log record, and use it for whatever it wants for > the application. if Zookeeper abstracts out the messaging protocol, and > provides the committed transaction "stream" as the interface to > applications, potentially we could use it for many applications, including > data storage. note that this is completely outside of the current ZK data > model (znode and etc ), all we use from ZK is the underlying committed > transactions stream, probably this part of ZK can be provided as a library. > > > yang > > On Wed, Jul 13, 2011 at 5:01 AM, Flavio Junqueira <[email protected]>wrote: > >> Hi Simon, It is not entirely clear to me what you need zookeeper for in >> this case. Are blocks replicated and you need to guarantee that the updates >> are consistent across replicas? >> >> On your observations, I'm quite sure people will have an opinion, so here >> are my thoughts, which might not be representative of the whole community : >> 1- You're right, we do not recommended to use ZooKeeper directly as the >> data store. ZooKeeper servers keep their state in memory. >> 2- Cassandra already provides replication. Are you trying to strengthen >> the guarantees of Cassandra? I don't get it... >> 3- Sound right that you could use BK as a journal, but it is not clear >> which element is writing to the journal. Are you assuming a metadata manager >> such as the namenode of HDFS? >> 4- I'm not sure what this option means. Are you proposing ZooKeeper to >> manage the metadata of the file system? If so, I don't find it entirely >> unrealistic, since metadata updates are supposed to be small and the >> performance of ZooKeeper should be good enough for your case, but it might >> be awkward to have your block storage clients talking directly to ZooKeeper. >> Changes to metadata management would imply in this case rolling out a new >> version of the client application instead of just having the changes >> implemented on the service side. >> >> -Flavio >> >> On Jul 13, 2011, at 12:02 PM, Simon Felix wrote: >> >> Hello everyone >> >> What is the best way to build a distributed, shared storage system on top >> of >> ZooKeeper? I'm talking about block storage in the terabyte-range (i.e. >> store >> billions of 4k blocks). Consistency and Availability are important, as is >> throughput (both read & write). I need at least 50 MB/s with 3 nodes with >> two regular SATA drives each for my application. >> >> Some options I came up with: >> 1. Use ZooKeeper directly as a data store (Not recommended according to >> the >> docs - and it really leads to abysmally bad performance, I tested that) >> 2. Use Cassandra as data store >> 3. Use BookKeeper as write-ahead log and implement my own underlying store >> 4. Use ZooKeeper to create my own (probably buggy...) data store >> >> What would you recommend? Are there other options? >> >> Cheers, >> Simon >> >> >> *flavio* >> *junqueira* >> >> research scientist >> >> [email protected] >> direct +34 93-183-8828 >> >> avinguda diagonal 177, 8th floor, barcelona, 08018, es >> phone (408) 349 3300 fax (408) 349 3301 >> >> >> >
