Hi Simon, It is not entirely clear to me what you need zookeeper for in this case. Are blocks replicated and you need to guarantee that the updates are consistent across replicas? 

On your observations, I'm quite sure people will have an opinion, so here are my thoughts, which might not be representative of the whole community :
1- You're right, we do not recommended to use ZooKeeper directly as the data store. ZooKeeper servers keep their state in memory.
2- Cassandra already provides replication. Are you trying to strengthen the guarantees of Cassandra? I don't get it...
3- Sound right that you could use BK as a journal, but it is not clear which element is writing to the journal. Are you assuming a metadata manager such as the namenode of HDFS?
4- I'm not sure what this option means. Are you proposing ZooKeeper to manage the metadata of the file system? If so, I don't find it entirely unrealistic, since metadata updates are supposed to be small and the performance of ZooKeeper should be good enough for your case, but it might be awkward to have your block storage clients talking directly to ZooKeeper. Changes to metadata management would imply in this case rolling out a new version of the client application instead of just having the changes implemented on the service side.  

-Flavio

On Jul 13, 2011, at 12:02 PM, Simon Felix wrote:

Hello everyone

What is the best way to build a distributed, shared storage system on top of
ZooKeeper? I'm talking about block storage in the terabyte-range (i.e. store
billions of 4k blocks). Consistency and Availability are important, as is
throughput (both read & write). I need at least 50 MB/s with 3 nodes with
two regular SATA drives each for my application.

Some options I came up with:
1. Use ZooKeeper directly as a data store (Not recommended according to the
docs - and it really leads to abysmally bad performance, I tested that)
2. Use Cassandra as data store
3. Use BookKeeper as write-ahead log and implement my own underlying store
4. Use ZooKeeper to create my own (probably buggy...) data store

What would you recommend? Are there other options?

Cheers,
Simon

flavio
junqueira
 
research scientist
 
[email protected]
direct +34 93-183-8828
 
avinguda diagonal 177, 8th floor, barcelona, 08018, es
phone (408) 349 3300    fax (408) 349 3301


Reply via email to