---
Makefile.am | 4 +-
doc/design-2.1-lock-acquire.dot | 34 +++++++++++
doc/design-2.1-lock-release.dot | 31 ++++++++++
doc/design-2.1.rst | 121 +++++++++++++++++++++++++++++++++++++++
4 files changed, 189 insertions(+), 1 deletions(-)
create mode 100644 doc/design-2.1-lock-acquire.dot
create mode 100644 doc/design-2.1-lock-release.dot
diff --git a/Makefile.am b/Makefile.am
index 9e67492..103e8b5 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -160,7 +160,9 @@ doc/html: $(docrst) $(docpng) doc/conf.py configure.ac
touch "$@"
docdot = \
- doc/arch-2.0.dot
+ doc/arch-2.0.dot \
+ doc/design-2.1-lock-acquire.dot \
+ doc/design-2.1-lock-release.dot
docpng = $(patsubst %.dot,%.png,$(docdot))
diff --git a/doc/design-2.1-lock-acquire.dot b/doc/design-2.1-lock-acquire.dot
new file mode 100644
index 0000000..b90808b
--- /dev/null
+++ b/doc/design-2.1-lock-acquire.dot
@@ -0,0 +1,34 @@
+digraph "design-2.1-lock-acquire" {
+ graph[fontsize=8, fontname="Helvetica"]
+ node[fontsize=8, fontname="Helvetica", width="0", height="0"]
+ edge[fontsize=8, fontname="Helvetica"]
+
+ /* Actions */
+ abort[label="Abort\n(couldn't acquire)"]
+ acquire[label="Acquire lock"]
+ add_to_queue[label="Add condition to queue"]
+ wait[label="Wait for notification"]
+ remove_from_queue[label="Remove from queue"]
+
+ /* Conditions */
+ alone[label="Empty queue and can acquire?", shape=diamond]
+ have_timeout[label="Do I have timeout?", shape=diamond]
+ top_of_queue_and_can_acquire[
+ label="On top of queue and can acquire lock?",
+ shape=diamond,
+ ]
+
+ /* Lines */
+ alone->acquire[label="Yes"]
+ alone->add_to_queue[label="No"]
+
+ have_timeout->abort[label="Yes"]
+ have_timeout->wait[label="No"]
+
+ top_of_queue_and_can_acquire->acquire[label="Yes"]
+ top_of_queue_and_can_acquire->have_timeout[label="No"]
+
+ add_to_queue->wait
+ wait->top_of_queue_and_can_acquire
+ acquire->remove_from_queue
+}
diff --git a/doc/design-2.1-lock-release.dot b/doc/design-2.1-lock-release.dot
new file mode 100644
index 0000000..17ca192
--- /dev/null
+++ b/doc/design-2.1-lock-release.dot
@@ -0,0 +1,31 @@
+digraph "design-2.1-lock-release" {
+ graph[fontsize=8, fontname="Helvetica"]
+ node[fontsize=8, fontname="Helvetica", width="0", height="0"]
+ edge[fontsize=8, fontname="Helvetica"]
+
+ /* Actions */
+ remove_from_owners[label="Remove from owner list"]
+ notify[label="Notify topmost"]
+ swap_shared[label="Swap shared conditions"]
+ success[label="Success"]
+
+ /* Conditions */
+ have_pending[label="Any pending acquires?", shape=diamond]
+ was_active_queue[
+ label="Was active condition\nfor shared acquires?",
+ shape=diamond,
+ ]
+
+ /* Lines */
+ remove_from_owners->have_pending
+
+ have_pending->notify[label="Yes"]
+ have_pending->success[label="No"]
+
+ notify->was_active_queue
+
+ was_active_queue->swap_shared[label="Yes"]
+ was_active_queue->success[label="No"]
+
+ swap_shared->success
+}
diff --git a/doc/design-2.1.rst b/doc/design-2.1.rst
index 2bc9810..adbba2f 100644
--- a/doc/design-2.1.rst
+++ b/doc/design-2.1.rst
@@ -161,6 +161,127 @@ contention and increased memory usage, with it. As this
would be an
extension of the changes proposed before it could be implemented at a later
point in time, but we decided to stay with the simpler solution for now.
+Implementation details
+++++++++++++++++++++++
+
+``SharedLock`` redesign
+^^^^^^^^^^^^^^^^^^^^^^^
+
+The current design of ``SharedLock`` is not good for supporting timeouts
+when acquiring a lock and there are also minor fairness issues in it. We
+plan to address both with a redesign. A proof of concept implementation was
+written and resulted in significantly simpler code.
+
+Currently ``SharedLock`` uses two separate queues for shared and exclusive
+acquires and waiters get to run in turns. This means if an exclusive acquire
+is released, the lock will allow shared waiters to run and vice versa.
+Although it's still fair in the end there is a slight bias towards shared
+waiters in the current implementation. The same implementation with two
+shared queues can not support timeouts without adding a lot of complexity.
+
+Our proposed redesign changes ``SharedLock`` to have only one single queue.
+There will be one condition (see Condition_ for more details) in the queue
+per exclusive acquire and two for all shared acquires (see below for an
+explanation). The maximum queue length will always be ``2 + (number of
+exclusive acquires waiting)``. The number of queue entries for shared
+acquires can vary from 0 to 2.
+
+The two conditions for shared acquires are a bit special. They will be used
+in turn. When the lock is instantiated, no conditions are in the queue. As
+soon as the first shared acquire arrives (and there are holder(s) or waiting
+acquires; see Acquire_), the active condition is added to the queue. Until
+it becomes the topmost condition in the queue and has been notified, any
+shared acquire is added to this active condition. When the active condition
+is notified, the conditions are swapped and further shared acquires are
+added to the previously inactive condition (which has now become the active
+condition). After all waiters on the previously active (now inactive) and
+now notified condition received the notification, it is removed from the
+queue of pending acquires.
+
+This means shared acquires will skip any exclusive acquire in the queue. We
+believe it's better to improve parallelization on operations only asking for
+shared (or read-only) locks. Exclusive operations holding the same lock can
+not be parallelized.
+
+
+Acquire
+*******
+
+For exclusive acquires a new condition is created and appended to the queue.
+Shared acquires are added to the active condition for shared acquires and if
+the condition is not yet on the queue, it's appended.
+
+The next step is to wait for our condition to be on the top of the queue (to
+guarantee fairness). If the timeout expired, we return to the caller without
+acquiring the lock. On every notification we check whether the lock has been
+deleted, in which case an error is returned to the caller.
+
+The lock can be acquired if we're on top of the queue (there is no one else
+ahead of us). For an exclusive acquire, there must not be other exclusive or
+shared holders. For a shared acquire, there must not be an exclusive holder.
+If these conditions are all true, the lock is acquired and we return to the
+caller. In any other case we wait again on the condition.
+
+If it was the last waiter on a condition, the condition is removed from the
+queue.
+
+Optimization: There's no need to touch the queue if there are no pending
+acquires and no current holders. The caller can have the lock immediately.
+
+.. image:: design-2.1-lock-acquire.png
+
+
+Release
+*******
+
+First the lock removes the caller from the internal owner list. If there are
+pending acquires in the queue, the first (the oldest) condition is notified.
+
+If the first condition was the active condition for shared acquires, the
+inactive condition will be made active. This ensures fairness with exclusive
+locks by forcing consecutive shared acquires to wait in the queue.
+
+.. image:: design-2.1-lock-release.png
+
+
+Delete
+******
+
+The caller must either hold the lock in exclusive mode already or the lock
+must be acquired in exclusive mode. Trying to delete a lock while it's held
+in shared mode must fail.
+
+After ensuring the lock is held in exclusive mode, the lock will mark itself
+as deleted and continue to notify all pending acquires. They will wake up,
+notice the deleted lock and return an error to the caller.
+
+
+Condition
+^^^^^^^^^
+
+Note: This is not necessary for the locking changes above, but it may be a
+good optimization (pending performance tests).
+
+The existing locking code in Ganeti 2.0 uses Python's built-in
+``threading.Condition`` class. Unfortunately ``Condition`` implements
+timeouts by sleeping 1ms to 20ms between tries to acquire the condition lock
+in non-blocking mode. This requires unnecessary context switches and
+contention on the CPython GIL (Global Interpreter Lock).
+
+By using POSIX pipes (see ``pipe(2)``) we can use the operating system's
+support for timeouts on file descriptors (see ``select(2)``). A custom
+condition class will have to be written for this.
+
+On instantiation the class creates a pipe. After each notification the
+previous pipe is abandoned and re-created (technically the old pipe needs to
+stay around until all notifications have been delivered).
+
+All waiting clients of the condition use ``select(2)`` or ``poll(2)`` to
+wait for notifications, optionally with a timeout. A notification will be
+signalled to the waiting clients by closing the pipe. If the pipe wasn't
+closed during the timeout, the waiting function returns to its caller
+nonetheless.
+
Feature changes
---------------
--
1.6.4.3