This is an automated email from the ASF dual-hosted git repository.
hulee pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/helix.wiki.git
The following commit(s) were added to refs/heads/master by this push:
new a886186 Created Concurrency and Parallelism for BucketDataAccessor
(markdown)
a886186 is described below
commit a8861862ff92d130adbe314e339ccf053fca73ea
Author: Hunter Lee <[email protected]>
AuthorDate: Thu Oct 10 14:15:11 2019 -0700
Created Concurrency and Parallelism for BucketDataAccessor (markdown)
---
...rency-and-Parallelism-for-BucketDataAccessor.md | 68 ++++++++++++++++++++++
1 file changed, 68 insertions(+)
diff --git a/Concurrency-and-Parallelism-for-BucketDataAccessor.md
b/Concurrency-and-Parallelism-for-BucketDataAccessor.md
new file mode 100644
index 0000000..7b5878f
--- /dev/null
+++ b/Concurrency-and-Parallelism-for-BucketDataAccessor.md
@@ -0,0 +1,68 @@
+# Overview
+BucketDataAccessor is a Helix's data accessor interface designed to write
large amounts of data to ZooKeeper. This wiki identifies the limitations/gaps
of BucketDataAccessor and suggests ways to address them.
+
+# Background
+See Storing Metadata for Helix Constraint-based Rebalancer for more context.
+
+# Problem Statement
+## Low Parallelism
+Currently, ZkBasedBucketDataAccessor uses a locking mechanism implemented
using ZooKeeper's ephemeral nodes. This is similar to a mutex, and allows for
one entity to either read/write. Since there could only be one ongoing
operation, the users of this data accessor such as WAGED rebalancer suffers
from low parallelism. More specifically, it is not possible to have verifiers
read from the metadata written to ZK while some RW is happening for
testing/verification purposes.
+
+## Latency
+Overall latency of the rebalance pipeline (BestPossibleCalcStage) could be
reduced if we could parallelize read and write operations. The current
implementation, due to locking, has to linearize all read and write operations.
+
+## Added Complexity from Locking
+Inevitably, there will be unexpected failures. When we use locks in a
distributed manner, we need to account for all failure scenarios related to
connection issues. This makes coding a little more difficult because of many
try-catch statements.
+
+# Requirements
+## Readers
+Reads shouldn't be blocked by writes.
+Reads shouldn't block each other.
+Reads should read the latest data if possible.
+Reads should only read successfully-written data (no corrupt data).
+## Writers
+Writes shouldn't block other writes.
+Writes shouldn't be blocked by reads.
+Writes shouldn't overwrite each other.
+Older writes should never overwrite the lastSuccessfulWriteVersion of newer
writes.
+# Solution
+## Metadata ZNodes
+LastSuccessfulWriteVersion for Reads
+There will be a ZNode that records the version that was written successfully
(so safe to be read) so that readers know which version to read.
+
+## LastWriteVersion for Writes
+There will be a ZNode that records the version that the last writer used to
write. A writer would come in and check this ZNode and increment it, and use
that version to write.
+
+For these two ZNodes, synchronization now becomes a problem. For this, we'll
use optimistic concurrency control so that there aren't any conflicts among
different versions.
+
+## Optimistic Concurrency Control
+Helix's data accessor supports a ZkClient interface called DataUpdater that we
can instrument to implement a ZNode version-based optimistic concurrency
control. It goes like this:
+
+Read old stats
+Write with the expected version from the stats that was read in Step 1
+If the versions don't match, retry the write
+Monotonically Increasing Version Number
+Another property to add to the solution is that it will essentially adopt the
idea of MVCC and have versions increase monotonically. This ensures that there
will never be version conflicts.
+
+To prevent an integer overflow, we will reset the number at a high-enough
version number (Integer.MAX_VALUE for example). Or another easier option is
using long.
+
+## Garbage Collection of Stale Metadata
+Because of the monotonically increasing version number property, we will end
up with many versions of the metadata. This is a problem that needs to be dealt
with because it might cause ZK's disk to run out of space. To garbage collect,
we will append an asynchronous operation at the very end of a successful write
to delete all versions before the new last successful write version.
+
+## Retrying Reads upon Failure
+The garbage collection described above may potentially disrupt an ongoing
read. In that case, we fail the read. If the read gets re-tried, then it will
pull the last successful write version afresh and read the latest version.
+
+# Pseudocode
+## Reader
+boolean read() {
+ read last successful write version number from metadata znode
+ read last successful write version
+ if empty/failure, terminate
+}
+## Writer
+boolean write() {
+ increment lastWriteVersion using updater
+ write to the incremented lastWriteVersion
+ if the write finishes, update lastSuccessfulWriteVersion using updater
+ asynchronously trigger garbage collection
+}
\ No newline at end of file