MalcolmSanders created FLINK-11809:
--------------------------------------
Summary: Implement etcd based StateHandleStore
Key: FLINK-11809
URL: https://issues.apache.org/jira/browse/FLINK-11809
Project: Flink
Issue Type: Sub-task
Components: Runtime / Coordination
Reporter: MalcolmSanders
Implement StateHandleStore using jetcd.
Previously ZooKeeperStateHandleStore stores data in a dfs file while records
its metadata to a zookeeper node in order to keep the data size of a zookeeper
node small. EtcdStateHandleStore should work in the same way while there is a
corner case that should be carefully dealt with using etcd.
As described in FLINK-6612, the ZooKeeperStateHandleStore does not guard
against concurrent delete operations which could happen in case of a lost
leadership and a new leadership grant. The problem is that checkpoint nodes can
get deleted even after they have been recovered by another
ZooKeeperCompletedCheckpointStore. This corrupts the recovered checkpoint and
thwarts future recoveries. In order to guard against deletions of ZooKeeper
nodes which are still being used by a different ZooKeeperStateHandleStore, a
locking mechanism has been introduced to make sure that zookeeper nodes are
allowed to be deleted only after all ZooKeeperStateHandleStores have released
their lock. The locking mechanism is implemented via ephemeral child nodes of
the respective ZooKeeper node. Whenever a ZooKeeperStateHandleStore wants to
lock a ZNode, thus, protecting it from being deleted, it creates an ephemeral
child node. The node's name is unique to the ZooKeeperStateHandleStore
instance. The delete operations will then only delete the node if it does not
have any children associated. In order to guard against orphaned lock nodes,
they are created as ephemeral nodes. This means that they will be deleted by
ZooKeeper once the connection of the ZooKeeper client which created the node
timed out.
The solution leverages that a zookeeper directory cannot be deleted if it still
has child nodes and a ephemeral node will be deleted by zookeeper server once
the corresponding client is disconnected. Since etcd doesn’t have hierarchical
structure like zookeeper, there is no actual relations between a path and its
so-called parent directory. Suppose we create a persistent key-value to store
metadata and then create an ephemeral key-value using previous key as the
prefix. Once the client disconnects with etcd server, the ephemeral key-value
pair will be deleted from etcd server while there’ll be no effect on its parent
key-value pair.
My proposal to tackle this case is illustrated in Part 6 Implementation of etcd
based StateHandleStore in [design
doc|https://docs.google.com/document/d/12-gIZDuT4IOWG7gmwSqNFsOHuGlkdRHge0ahJ7M311Y/edit#heading=h.sqkj9zjvgicu].
Any comments or suggestions will be appreciated.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)