Thawan Kooburat created ZOOKEEPER-1608:
------------------------------------------

             Summary: Add support for key-value store as optional storage engine
                 Key: ZOOKEEPER-1608
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1608
             Project: ZooKeeper
          Issue Type: Improvement
          Components: server
    Affects Versions: 3.4.3
            Reporter: Thawan Kooburat


Problem:
1. ZooKeeper need to load the entire dataset into its memory. So the total data 
size and number of znode are limited by the amount of available memory.
2. We want to minimize ZooKeeper down time, but found that it is bound by 
snapshot loading and writing time. The bigger the database, the longer it take 
for the system to recover. The worst case is that if the data size grow too 
large and initLimit wasn't update accordingly, the quorum won't form after 
failure.  

Implementation: (still work in progress)

1. Create a new type of DataTree that supported key-value storage as backing 
store. Our current candidate backing store is Oracle's Berkeley DB Java Edition

2. There is no need to use snapshot facility for this type of DataTree. Since 
doing a sync write of lastProcessedZxid into the backing store is the same as 
taking a snapshot. However, the system still use txnlog as before. The system 
can be considered as having only a single snapshot. It has to rely on backing 
store to detect data corruption and recovery.  

3. There is no need to do any per-node locking. CommitProcessor 
(ZOOKEEPER-1505) prevents concurrent read and write to reach the DataTree. The 
DataTree is also accessed by PrepRequestProcessor (to create ChangeRecord), but 
I believe that read and write to the same znode cannot happens concurrently.

4. There are 3 types of data which is required to be persisted in backing 
store: ACLs, znodes and sessions. However, we also store other data reduce 
oDataTree initialization time or serialization cost such as list of node's 
children and list of ephemeral node. 

5. Each Zookeeper's txn may translate into multiple actions on the DataTree. 
For example, creating a node may result in  AddingZNODE, AddingChildren and 
AddingEphemeralNode. However, as a long as these operations are idempotent, 
there is no need to group them into a transaction. So txns can be replayed on 
DataTree without corrupting the data. This also means that the system don't 
need key-value store that support transaction semantic. Currently, only 
operations related to quota break this assumption because it use increment 
operation.    

6. SNAP protocol is supported so the ensemble can be upgraded online. In the 
future we may add extend SNAP protocol to send raw data file in order to save 
CPU cost when sending large database.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to