[dm-devel] [PATCH v3 22/39] dm vdo: add hash locks and hash zones

Mike Snitzer Thu, 14 Sep 2023 12:18:47 -0700

From: Matthew Sakai <msa...@redhat.com>

In order to deduplicate concurrent writes of the same data (to
different locations), data_vios which are writing the same data are
grouped together in a "hash lock," named for and keyed by the hash of
the data being written. Each hash lock is assigned to a hash zone
based on a portion of its hash.


Co-developed-by: J. corwin Coburn <cor...@hurlbutnet.net>
Signed-off-by: J. corwin Coburn <cor...@hurlbutnet.net>
Co-developed-by: Michael Sclafani <vdo-de...@redhat.com>
Signed-off-by: Michael Sclafani <vdo-de...@redhat.com>
Co-developed-by: Sweet Tea Dorminy <sweettea-ker...@dorminy.me>
Signed-off-by: Sweet Tea Dorminy <sweettea-ker...@dorminy.me>
Signed-off-by: Matthew Sakai <msa...@redhat.com>
Signed-off-by: Mike Snitzer <snit...@kernel.org>
---
 drivers/md/dm-vdo/dedupe.c | 2450 ++++++++++++++++++++++++++++++++++++
 drivers/md/dm-vdo/dedupe.h |   93 ++
 2 files changed, 2543 insertions(+)
 create mode 100644 drivers/md/dm-vdo/dedupe.c
 create mode 100644 drivers/md/dm-vdo/dedupe.h

diff --git a/drivers/md/dm-vdo/dedupe.c b/drivers/md/dm-vdo/dedupe.c
new file mode 100644
index 000000000000..1d6774e8e546
--- /dev/null
+++ b/drivers/md/dm-vdo/dedupe.c
@@ -0,0 +1,2450 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright 2023 Red Hat
+ */
+
+/**
+ * DOC:
+ *
+ * Hash Locks:
+ *
+ * A hash_lock controls and coordinates writing, index access, and dedupe 
among groups of data_vios
+ * concurrently writing identical blocks, allowing them to deduplicate not 
only against advice but
+ * also against each other. This saves on index queries and allows those 
data_vios to concurrently
+ * deduplicate against a single block instead of being serialized through a 
PBN read lock. Only one
+ * index query is needed for each hash_lock, instead of one for every data_vio.
+ *
+ * A hash_lock acts like a state machine perhaps more than as a lock. Other 
than the starting and
+ * ending states INITIALIZING and BYPASSING, every state represents and is 
held for the duration of
+ * an asynchronous operation. All state transitions are performed on the 
thread of the hash_zone
+ * containing the lock. An asynchronous operation is almost always performed 
upon entering a state,
+ * and the callback from that operation triggers exiting the state and 
entering a new state.
+ *
+ * In all states except DEDUPING, there is a single data_vio, called the lock 
agent, performing the
+ * asynchronous operations on behalf of the lock. The agent will change during 
the lifetime of the
+ * lock if the lock is shared by more than one data_vio. data_vios waiting to 
deduplicate are kept
+ * on a wait queue. Viewed a different way, the agent holds the lock 
exclusively until the lock
+ * enters the DEDUPING state, at which point it becomes a shared lock that all 
the waiters (and any
+ * new data_vios that arrive) use to share a PBN lock. In state DEDUPING, 
there is no agent. When
+ * the last data_vio in the lock calls back in DEDUPING, it becomes the agent 
and the lock becomes
+ * exclusive again. New data_vios that arrive in the lock will also go on the 
wait queue.
+ *
+ * The existence of lock waiters is a key factor controlling which state the 
lock transitions to
+ * next. When the lock is new or has waiters, it will always try to reach 
DEDUPING, and when it
+ * doesn't, it will try to clean up and exit.
+ *
+ * Deduping requires holding a PBN lock on a block that is known to contain 
data identical to the
+ * data_vios in the lock, so the lock will send the agent to the duplicate 
zone to acquire the PBN
+ * lock (LOCKING), to the kernel I/O threads to read and verify the data 
(VERIFYING), or to write a
+ * new copy of the data to a full data block or a slot in a compressed block 
(WRITING).
+ *
+ * Cleaning up consists of updating the index when the data location is 
different from the initial
+ * index query (UPDATING, triggered by stale advice, compression, and 
rollover), releasing the PBN
+ * lock on the duplicate block (UNLOCKING), and if the agent is the last 
data_vio referencing the
+ * lock, releasing the hash_lock itself back to the hash zone (BYPASSING).
+ *
+ * The shortest sequence of states is for non-concurrent writes of new data:
+ *   INITIALIZING -> QUERYING -> WRITING -> BYPASSING
+ * This sequence is short because no PBN read lock or index update is needed.
+ *
+ * Non-concurrent, finding valid advice looks like this (endpoints elided):
+ *   -> QUERYING -> LOCKING -> VERIFYING -> DEDUPING -> UNLOCKING ->
+ * Or with stale advice (endpoints elided):
+ *   -> QUERYING -> LOCKING -> VERIFYING -> UNLOCKING -> WRITING -> UPDATING ->
+ *
+ * When there are not enough available reference count increments available on 
a PBN for a data_vio
+ * to deduplicate, a new lock is forked and the excess waiters roll over to 
the new lock (which
+ * goes directly to WRITING). The new lock takes the place of the old lock in 
the lock map so new
+ * data_vios will be directed to it. The two locks will proceed independently, 
but only the new
+ * lock will have the right to update the index (unless it also forks).
+ *
+ * Since rollover happens in a lock instance, once a valid data location has 
been selected, it will
+ * not change. QUERYING and WRITING are only performed once per lock lifetime. 
All other
+ * non-endpoint states can be re-entered.
+ *
+ * The function names in this module follow a convention referencing the 
states and transitions in
+ * the state machine. For example, for the LOCKING state, there are 
start_locking() and
+ * finish_locking() functions.  start_locking() is invoked by the finish 
function of the state (or
+ * states) that transition to LOCKING. It performs the actual lock state 
change and must be invoked
+ * on the hash zone thread.  finish_locking() is called by (or continued via 
callback from) the
+ * code actually obtaining the lock. It does any bookkeeping or 
decision-making required and
+ * invokes the appropriate start function of the state being transitioned to 
after LOCKING.
+ *
+ * ----------------------------------------------------------------------
+ *
+ * Index Queries:
+ *
+ * A query to the UDS index is handled asynchronously by the index's threads. 
When the query is
+ * complete, a callback supplied with the query will be called from one of the 
those threads. Under
+ * heavy system load, the index may be slower to respond then is desirable for 
reasonable I/O
+ * throughput. Since deduplication of writes is not necessary for correct 
operation of a VDO
+ * device, it is acceptable to timeout out slow index queries and proceed to 
fulfill a write
+ * request without deduplicating. However, because the uds_request struct 
itself is supplied by the
+ * caller, we can not simply reuse a uds_request object which we have chosen 
to timeout. Hence,
+ * each hash_zone maintains a pool of dedupe_contexts which each contain a 
uds_request along with a
+ * reference to the data_vio on behalf of which they are performing a query.
+ *
+ * When a hash_lock needs to query the index, it attempts to acquire an unused 
dedupe_context from
+ * its hash_zone's pool. If one is available, that context is prepared, 
associated with the
+ * hash_lock's agent, added to the list of pending contexts, and then sent to 
the index. The
+ * context's state will be transitioned from DEDUPE_CONTEXT_IDLE to 
DEDUPE_CONTEXT_PENDING. If all
+ * goes well, the dedupe callback will be called by the index which will 
change the context's state
+ * to DEDUPE_CONTEXT_COMPLETE, and the associated data_vio will be enqueued to 
run back in the hash
+ * zone where the query results will be processed and the context will be put 
back in the idle
+ * state and returned to the hash_zone's available list.
+ *
+ * The first time an index query is launched from a given hash_zone, a timer 
is started. When the
+ * timer fires, the hash_zone's completion is enqueued to run in the hash_zone 
where the zone's
+ * pending list will be searched for any contexts in the pending state which 
have been running for
+ * too long. Those contexts are transitioned to the DEDUPE_CONTEXT_TIMED_OUT 
state and moved to the
+ * zone's timed_out list where they won't be examined again if there is a 
subsequent time out). The
+ * data_vios associated with timed out contexts are sent to continue 
processing their write
+ * operation without deduplicating. The timer is also restarted.
+ *
+ * When the dedupe callback is run for a context which is in the timed out 
state, that context is
+ * moved to the DEDUPE_CONTEXT_TIMED_OUT_COMPLETE state. No other action need 
be taken as the
+ * associated data_vios have already been dispatched.
+ *
+ * If a hash_lock needs a dedupe context, and the available list is empty, the 
timed_out list will
+ * be searched for any contexts which are timed out and complete. One of these 
will be used
+ * immediately, and the rest will be returned to the available list and marked 
idle.
+ */
+
+#include "dedupe.h"
+
+#include <linux/atomic.h>
+#include <linux/jiffies.h>
+#include <linux/kernel.h>
+#include <linux/kobject.h>
+#include <linux/list.h>
+#include <linux/ratelimit.h>
+#include <linux/spinlock.h>
+#include <linux/timer.h>
+
+#include "logger.h"
+#include "memory-alloc.h"
+#include "numeric.h"
+#include "permassert.h"
+#include "string-utils.h"
+#include "uds.h"
+
+#include "action-manager.h"
+#include "admin-state.h"
+#include "completion.h"
+#include "constants.h"
+#include "data-vio.h"
+#include "io-submitter.h"
+#include "packer.h"
+#include "physical-zone.h"
+#include "pointer-map.h"
+#include "slab-depot.h"
+#include "statistics.h"
+#include "types.h"
+#include "vdo.h"
+#include "wait-queue.h"
+
+enum hash_lock_state {
+       /* State for locks that are not in use or are being initialized. */
+       VDO_HASH_LOCK_INITIALIZING,
+
+       /* This is the sequence of states typically used on the non-dedupe 
path. */
+       VDO_HASH_LOCK_QUERYING,
+       VDO_HASH_LOCK_WRITING,
+       VDO_HASH_LOCK_UPDATING,
+
+       /* The remaining states are typically used on the dedupe path in this 
order. */
+       VDO_HASH_LOCK_LOCKING,
+       VDO_HASH_LOCK_VERIFYING,
+       VDO_HASH_LOCK_DEDUPING,
+       VDO_HASH_LOCK_UNLOCKING,
+
+       /*
+        * Terminal state for locks returning to the pool. Must be last both 
because it's the final
+        * state, and also because it's used to count the states.
+        */
+       VDO_HASH_LOCK_BYPASSING,
+};
+
+static const char * const LOCK_STATE_NAMES[] = {
+       [VDO_HASH_LOCK_BYPASSING] = "BYPASSING",
+       [VDO_HASH_LOCK_DEDUPING] = "DEDUPING",
+       [VDO_HASH_LOCK_INITIALIZING] = "INITIALIZING",
+       [VDO_HASH_LOCK_LOCKING] = "LOCKING",
+       [VDO_HASH_LOCK_QUERYING] = "QUERYING",
+       [VDO_HASH_LOCK_UNLOCKING] = "UNLOCKING",
+       [VDO_HASH_LOCK_UPDATING] = "UPDATING",
+       [VDO_HASH_LOCK_VERIFYING] = "VERIFYING",
+       [VDO_HASH_LOCK_WRITING] = "WRITING",
+};
+
+struct hash_lock {
+       /* The block hash covered by this lock */
+       struct uds_record_name hash;
+
+       /* When the lock is unused, this list entry allows the lock to be 
pooled */
+       struct list_head pool_node;
+
+       /*
+        * A list containing the data VIOs sharing this lock, all having the 
same record name and
+        * data block contents, linked by their hash_lock_node fields.
+        */
+       struct list_head duplicate_ring;
+
+       /* The number of data_vios sharing this lock instance */
+       data_vio_count_t reference_count;
+
+       /* The maximum value of reference_count in the lifetime of this lock */
+       data_vio_count_t max_references;
+
+       /* The current state of this lock */
+       enum hash_lock_state state;
+
+       /* True if the UDS index should be updated with new advice */
+       bool update_advice;
+
+       /* True if the advice has been verified to be a true duplicate */
+       bool verified;
+
+       /* True if the lock has already accounted for an initial verification */
+       bool verify_counted;
+
+       /* True if this lock is registered in the lock map (cleared on 
rollover) */
+       bool registered;
+
+       /*
+        * If verified is false, this is the location of a possible duplicate. 
If verified is true,
+        * it is the verified location of a true duplicate.
+        */
+       struct zoned_pbn duplicate;
+
+       /* The PBN lock on the block containing the duplicate data */
+       struct pbn_lock *duplicate_lock;
+
+       /* The data_vio designated to act on behalf of the lock */
+       struct data_vio *agent;
+
+       /*
+        * Other data_vios with data identical to the agent who are currently 
waiting for the agent
+        * to get the information they all need to deduplicate--either against 
each other, or
+        * against an existing duplicate on disk.
+        */
+       struct wait_queue waiters;
+};
+
+enum {
+       LOCK_POOL_CAPACITY = MAXIMUM_VDO_USER_VIOS,
+};
+
+struct hash_zones {
+       struct action_manager *manager;
+       struct kobject dedupe_directory;
+       struct uds_parameters parameters;
+       struct uds_index_session *index_session;
+       struct ratelimit_state ratelimiter;
+       atomic64_t timeouts;
+       atomic64_t dedupe_context_busy;
+
+       /* This spinlock protects the state fields and the starting of dedupe 
requests. */
+       spinlock_t lock;
+
+       /* The fields in the next block are all protected by the lock */
+       struct vdo_completion completion;
+       enum index_state index_state;
+       enum index_state index_target;
+       struct admin_state state;
+       bool changing;
+       bool create_flag;
+       bool dedupe_flag;
+       bool error_flag;
+       u64 reported_timeouts;
+
+       /* The number of zones */
+       zone_count_t zone_count;
+       /* The hash zones themselves */
+       struct hash_zone zones[];
+};
+
+static inline struct hash_zone *as_hash_zone(struct vdo_completion *completion)
+{
+       vdo_assert_completion_type(completion, VDO_HASH_ZONE_COMPLETION);
+       return container_of(completion, struct hash_zone, completion);
+}
+
+static inline struct hash_zones *as_hash_zones(struct vdo_completion 
*completion)
+{
+       vdo_assert_completion_type(completion, VDO_HASH_ZONES_COMPLETION);
+       return container_of(completion, struct hash_zones, completion);
+}
+
+static inline void assert_in_hash_zone(struct hash_zone *zone, const char 
*name)
+{
+       ASSERT_LOG_ONLY((vdo_get_callback_thread_id() == zone->thread_id),
+                       "%s called on hash zone thread",
+                       name);
+}
+
+/**
+ * return_hash_lock_to_pool() - (Re)initialize a hash lock and return it to 
its pool.
+ * @zone: The zone from which the lock was borrowed.
+ * @lock: The lock that is no longer in use.
+ */
+static void return_hash_lock_to_pool(struct hash_zone *zone, struct hash_lock 
*lock)
+{
+       memset(lock, 0, sizeof(*lock));
+       INIT_LIST_HEAD(&lock->pool_node);
+       INIT_LIST_HEAD(&lock->duplicate_ring);
+       vdo_initialize_wait_queue(&lock->waiters);
+       list_add_tail(&lock->pool_node, &zone->lock_pool);
+}
+
+/**
+ * vdo_get_duplicate_lock() - Get the PBN lock on the duplicate data location 
for a data_vio from
+ *                            the hash_lock the data_vio holds (if there is 
one).
+ * @data_vio: The data_vio to query.
+ *
+ * Return: The PBN lock on the data_vio's duplicate location.
+ */
+struct pbn_lock *vdo_get_duplicate_lock(struct data_vio *data_vio)
+{
+       if (data_vio->hash_lock == NULL)
+               return NULL;
+       return data_vio->hash_lock->duplicate_lock;
+}
+
+/**
+ * get_hash_lock_state_name() - Get the string representation of a hash lock 
state.
+ * @state: The hash lock state.
+ *
+ * Return: The short string representing the state
+ */
+static const char *get_hash_lock_state_name(enum hash_lock_state state)
+{
+       /* Catch if a state has been added without updating the name array. */
+       STATIC_ASSERT((VDO_HASH_LOCK_BYPASSING + 1) == 
ARRAY_SIZE(LOCK_STATE_NAMES));
+       return (state < ARRAY_SIZE(LOCK_STATE_NAMES)) ? LOCK_STATE_NAMES[state] 
: "INVALID";
+}
+
+/**
+ * assert_hash_lock_agent() - Assert that a data_vio is the agent of its hash 
lock, and that this
+ *                            is being called in the hash zone.
+ * @data_vio: The data_vio expected to be the lock agent.
+ * @where: A string describing the function making the assertion.
+ */
+static void assert_hash_lock_agent(struct data_vio *data_vio, const char 
*where)
+{
+       /* Not safe to access the agent field except from the hash zone. */
+       assert_data_vio_in_hash_zone(data_vio);
+       ASSERT_LOG_ONLY(data_vio == data_vio->hash_lock->agent,
+                       "%s must be for the hash lock agent", where);
+}
+
+/**
+ * set_duplicate_lock() - Set the duplicate lock held by a hash lock. May only 
be called in the
+ *                        physical zone of the PBN lock.
+ * @hash_lock: The hash lock to update.
+ * @pbn_lock: The PBN read lock to use as the duplicate lock.
+ */
+static void set_duplicate_lock(struct hash_lock *hash_lock, struct pbn_lock 
*pbn_lock)
+{
+       ASSERT_LOG_ONLY((hash_lock->duplicate_lock == NULL),
+                       "hash lock must not already hold a duplicate lock");
+
+       pbn_lock->holder_count += 1;
+       hash_lock->duplicate_lock = pbn_lock;
+}
+
+/**
+ * dequeue_lock_waiter() - Remove the first data_vio from the lock's wait 
queue and return it.
+ * @lock: The lock containing the wait queue.
+ *
+ * Return: The first (oldest) waiter in the queue, or NULL if the queue is 
empty.
+ */
+static inline struct data_vio *dequeue_lock_waiter(struct hash_lock *lock)
+{
+       return waiter_as_data_vio(vdo_dequeue_next_waiter(&lock->waiters));
+}
+
+/**
+ * set_hash_lock() - Set, change, or clear the hash lock a data_vio is using.
+ * @data_vio: The data_vio to update.
+ * @new_lock: The hash lock the data_vio is joining.
+ *
+ * Updates the hash lock (or locks) to reflect the change in membership.
+ */
+static void set_hash_lock(struct data_vio *data_vio, struct hash_lock 
*new_lock)
+{
+       struct hash_lock *old_lock = data_vio->hash_lock;
+
+       if (old_lock != NULL) {
+               ASSERT_LOG_ONLY(data_vio->hash_zone != NULL,
+                               "must have a hash zone when holding a hash 
lock");
+               ASSERT_LOG_ONLY(!list_empty(&data_vio->hash_lock_entry),
+                               "must be on a hash lock ring when holding a 
hash lock");
+               ASSERT_LOG_ONLY(old_lock->reference_count > 0,
+                               "hash lock reference must be counted");
+
+               if ((old_lock->state != VDO_HASH_LOCK_BYPASSING) &&
+                   (old_lock->state != VDO_HASH_LOCK_UNLOCKING))
+                       /*
+                        * If the reference count goes to zero in a 
non-terminal state, we're most
+                        * likely leaking this lock.
+                        */
+                       ASSERT_LOG_ONLY(old_lock->reference_count > 1,
+                                       "hash locks should only become 
unreferenced in a terminal state, not state %s",
+                                       
get_hash_lock_state_name(old_lock->state));
+
+               list_del_init(&data_vio->hash_lock_entry);
+               old_lock->reference_count -= 1;
+
+               data_vio->hash_lock = NULL;
+       }
+
+       if (new_lock != NULL) {
+               /*
+                * Keep all data_vios sharing the lock on a ring since they can 
complete in any
+                * order and we'll always need a pointer to one to compare data.
+                */
+               list_move_tail(&data_vio->hash_lock_entry, 
&new_lock->duplicate_ring);
+               new_lock->reference_count += 1;
+               if (new_lock->max_references < new_lock->reference_count)
+                       new_lock->max_references = new_lock->reference_count;
+
+               data_vio->hash_lock = new_lock;
+       }
+}
+
+/* There are loops in the state diagram, so some forward decl's are needed. */
+static void start_deduping(struct hash_lock *lock, struct data_vio *agent, 
bool agent_is_done);
+static void start_locking(struct hash_lock *lock, struct data_vio *agent);
+static void start_writing(struct hash_lock *lock, struct data_vio *agent);
+static void unlock_duplicate_pbn(struct vdo_completion *completion);
+static void transfer_allocation_lock(struct data_vio *data_vio);
+
+/**
+ * exit_hash_lock() - Bottleneck for data_vios that have written or 
deduplicated and that are no
+ *                    longer needed to be an agent for the hash lock.
+ * @data_vio: The data_vio to complete and send to be cleaned up.
+ */
+static void exit_hash_lock(struct data_vio *data_vio)
+{
+       /* Release the hash lock now, saving a thread transition in cleanup. */
+       vdo_release_hash_lock(data_vio);
+
+       /* Complete the data_vio and start the clean-up path to release any 
locks it still holds. */
+       data_vio->vio.completion.callback = complete_data_vio;
+
+       continue_data_vio(data_vio);
+}
+
+/**
+ * set_duplicate_location() - Set the location of the duplicate block for 
data_vio, updating the
+ *                            is_duplicate and duplicate fields from a 
zoned_pbn.
+ * @data_vio: The data_vio to modify.
+ * @source: The location of the duplicate.
+ */
+static void set_duplicate_location(struct data_vio *data_vio, const struct 
zoned_pbn source)
+{
+       data_vio->is_duplicate = (source.pbn != VDO_ZERO_BLOCK);
+       data_vio->duplicate = source;
+}
+
+/**
+ * retire_lock_agent() - Retire the active lock agent, replacing it with the 
first lock waiter, and
+ *                       make the retired agent exit the hash lock.
+ * @lock: The hash lock to update.
+ *
+ * Return: The new lock agent (which will be NULL if there was no waiter)
+ */
+static struct data_vio *retire_lock_agent(struct hash_lock *lock)
+{
+       struct data_vio *old_agent = lock->agent;
+       struct data_vio *new_agent = dequeue_lock_waiter(lock);
+
+       lock->agent = new_agent;
+       exit_hash_lock(old_agent);
+       if (new_agent != NULL)
+               set_duplicate_location(new_agent, lock->duplicate);
+       return new_agent;
+}
+
+/**
+ * wait_on_hash_lock() - Add a data_vio to the lock's queue of waiters.
+ * @lock: The hash lock on which to wait.
+ * @data_vio: The data_vio to add to the queue.
+ */
+static void wait_on_hash_lock(struct hash_lock *lock, struct data_vio 
*data_vio)
+{
+       vdo_enqueue_waiter(&lock->waiters, &data_vio->waiter);
+
+       /*
+        * Make sure the agent doesn't block indefinitely in the packer since 
it now has at least
+        * one other data_vio waiting on it.
+        */
+       if ((lock->state != VDO_HASH_LOCK_WRITING) || 
!cancel_data_vio_compression(lock->agent))
+               return;
+
+       /*
+        * Even though we're waiting, we also have to send ourselves as a 
one-way message to the
+        * packer to ensure the agent continues executing. This is safe because
+        * cancel_vio_compression() guarantees the agent won't continue 
executing until this
+        * message arrives in the packer, and because the wait queue link isn't 
used for sending
+        * the message.
+        */
+       data_vio->compression.lock_holder = lock->agent;
+       launch_data_vio_packer_callback(data_vio, 
vdo_remove_lock_holder_from_packer);
+}
+
+/**
+ * abort_waiter() - waiter_callback function that shunts waiters to write 
their blocks without
+ *                  optimization.
+ * @waiter: The data_vio's waiter link.
+ * @context: Not used.
+ */
+static void abort_waiter(struct waiter *waiter, void *context __always_unused)
+{
+       write_data_vio(waiter_as_data_vio(waiter));
+}
+
+/**
+ * start_bypassing() - Stop using the hash lock.
+ * @lock: The hash lock.
+ * @agent: The data_vio acting as the agent for the lock.
+ *
+ * Stops using the hash lock. This is the final transition for hash locks 
which did not get an
+ * error.
+ */
+static void start_bypassing(struct hash_lock *lock, struct data_vio *agent)
+{
+       lock->state = VDO_HASH_LOCK_BYPASSING;
+       exit_hash_lock(agent);
+}
+
+void vdo_clean_failed_hash_lock(struct data_vio *data_vio)
+{
+       struct hash_lock *lock = data_vio->hash_lock;
+
+       if (lock->state == VDO_HASH_LOCK_BYPASSING) {
+               exit_hash_lock(data_vio);
+               return;
+       }
+
+       if (lock->agent == NULL) {
+               lock->agent = data_vio;
+       } else if (data_vio != lock->agent) {
+               exit_hash_lock(data_vio);
+               return;
+       }
+
+       lock->state = VDO_HASH_LOCK_BYPASSING;
+
+       /* Ensure we don't attempt to update advice when cleaning up. */
+       lock->update_advice = false;
+
+       vdo_notify_all_waiters(&lock->waiters, abort_waiter, NULL);
+
+       if (lock->duplicate_lock != NULL) {
+               /* The agent must reference the duplicate zone to launch it. */
+               data_vio->duplicate = lock->duplicate;
+               launch_data_vio_duplicate_zone_callback(data_vio, 
unlock_duplicate_pbn);
+               return;
+       }
+
+       lock->agent = NULL;
+       data_vio->is_duplicate = false;
+       exit_hash_lock(data_vio);
+}
+
+/**
+ * finish_unlocking() - Handle the result of the agent for the lock releasing 
a read lock on
+ *                      duplicate candidate.
+ * @completion: The completion of the data_vio acting as the lock's agent.
+ *
+ * This continuation is registered in unlock_duplicate_pbn().
+ */
+static void finish_unlocking(struct vdo_completion *completion)
+{
+       struct data_vio *agent = as_data_vio(completion);
+       struct hash_lock *lock = agent->hash_lock;
+
+       assert_hash_lock_agent(agent, __func__);
+
+       ASSERT_LOG_ONLY(lock->duplicate_lock == NULL,
+                       "must have released the duplicate lock for the hash 
lock");
+
+       if (!lock->verified) {
+               /*
+                * UNLOCKING -> WRITING transition: The lock we released was on 
an unverified
+                * block, so it must have been a lock on advice we were 
verifying, not on a
+                * location that was used for deduplication. Go write (or 
compress) the block to
+                * get a location to dedupe against.
+                */
+               start_writing(lock, agent);
+               return;
+       }
+
+       /*
+        * With the lock released, the verified duplicate block may already 
have changed and will
+        * need to be re-verified if a waiter arrived.
+        */
+       lock->verified = false;
+
+       if (vdo_has_waiters(&lock->waiters)) {
+               /*
+                * UNLOCKING -> LOCKING transition: A new data_vio entered the 
hash lock while the
+                * agent was releasing the PBN lock. The current agent exits 
and the waiter has to
+                * re-lock and re-verify the duplicate location.
+                *
+                * TODO: If we used the current agent to re-acquire the PBN 
lock we wouldn't need
+                * to re-verify.
+                */
+               agent = retire_lock_agent(lock);
+               start_locking(lock, agent);
+               return;
+       }
+
+       /*
+        * UNLOCKING -> BYPASSING transition: The agent is done with the lock 
and no other
+        * data_vios reference it, so remove it from the lock map and return it 
to the pool.
+        */
+       start_bypassing(lock, agent);
+}
+
+/**
+ * unlock_duplicate_pbn() - Release a read lock on the PBN of the block that 
may or may not have
+ *                          contained duplicate data.
+ * @completion: The completion of the data_vio acting as the lock's agent.
+ *
+ * This continuation is launched by start_unlocking(), and calls back to 
finish_unlocking() on the
+ * hash zone thread.
+ */
+static void unlock_duplicate_pbn(struct vdo_completion *completion)
+{
+       struct data_vio *agent = as_data_vio(completion);
+       struct hash_lock *lock = agent->hash_lock;
+
+       assert_data_vio_in_duplicate_zone(agent);
+       ASSERT_LOG_ONLY(lock->duplicate_lock != NULL, "must have a duplicate 
lock to release");
+
+       vdo_release_physical_zone_pbn_lock(agent->duplicate.zone,
+                                          agent->duplicate.pbn,
+                                          UDS_FORGET(lock->duplicate_lock));
+       if (lock->state == VDO_HASH_LOCK_BYPASSING) {
+               complete_data_vio(completion);
+               return;
+       }
+
+       launch_data_vio_hash_zone_callback(agent, finish_unlocking);
+}
+
+/**
+ * start_unlocking() - Release a read lock on the PBN of the block that may or 
may not have
+ *                     contained duplicate data.
+ * @lock: The hash lock.
+ * @agent: The data_vio currently acting as the agent for the lock.
+ */
+static void start_unlocking(struct hash_lock *lock, struct data_vio *agent)
+{
+       lock->state = VDO_HASH_LOCK_UNLOCKING;
+       launch_data_vio_duplicate_zone_callback(agent, unlock_duplicate_pbn);
+}
+
+static void release_context(struct dedupe_context *context)
+{
+       struct hash_zone *zone = context->zone;
+
+       WRITE_ONCE(zone->active, zone->active - 1);
+       list_move(&context->list_entry, &zone->available);
+}
+
+static void process_update_result(struct data_vio *agent)
+{
+       struct dedupe_context *context = agent->dedupe_context;
+
+       if ((context == NULL) ||
+           !change_context_state(context, DEDUPE_CONTEXT_COMPLETE, 
DEDUPE_CONTEXT_IDLE))
+               return;
+
+       release_context(context);
+}
+
+/**
+ * finish_updating() - Process the result of a UDS update performed by the 
agent for the lock.
+ * @completion: The completion of the data_vio that performed the update
+ *
+ * This continuation is registered in start_querying().
+ */
+static void finish_updating(struct vdo_completion *completion)
+{
+       struct data_vio *agent = as_data_vio(completion);
+       struct hash_lock *lock = agent->hash_lock;
+
+       assert_hash_lock_agent(agent, __func__);
+
+       process_update_result(agent);
+
+       /*
+        * UDS was updated successfully, so don't update again unless the 
duplicate location
+        * changes due to rollover.
+        */
+       lock->update_advice = false;
+
+       if (vdo_has_waiters(&lock->waiters)) {
+               /*
+                * UPDATING -> DEDUPING transition: A new data_vio arrived 
during the UDS update.
+                * Send it on the verified dedupe path. The agent is done with 
the lock, but the
+                * lock may still need to use it to clean up after rollover.
+                */
+               start_deduping(lock, agent, true);
+               return;
+       }
+
+       if (lock->duplicate_lock != NULL) {
+               /*
+                * UPDATING -> UNLOCKING transition: No one is waiting to 
dedupe, but we hold a
+                * duplicate PBN lock, so go release it.
+                */
+               start_unlocking(lock, agent);
+               return;
+       }
+
+       /*
+        * UPDATING -> BYPASSING transition: No one is waiting to dedupe and 
there's no lock to
+        * release.
+        */
+       start_bypassing(lock, agent);
+}
+
+static void query_index(struct data_vio *data_vio, enum uds_request_type 
operation);
+
+/**
+ * start_updating() - Continue deduplication with the last step, updating UDS 
with the location of
+ *                    the duplicate that should be returned as advice in the 
future.
+ * @lock: The hash lock.
+ * @agent: The data_vio currently acting as the agent for the lock.
+ */
+static void start_updating(struct hash_lock *lock, struct data_vio *agent)
+{
+       lock->state = VDO_HASH_LOCK_UPDATING;
+
+       ASSERT_LOG_ONLY(lock->verified, "new advice should have been verified");
+       ASSERT_LOG_ONLY(lock->update_advice, "should only update advice if 
needed");
+
+       agent->last_async_operation = VIO_ASYNC_OP_UPDATE_DEDUPE_INDEX;
+       set_data_vio_hash_zone_callback(agent, finish_updating);
+       query_index(agent, UDS_UPDATE);
+}
+
+/**
+ * finish_deduping() - Handle a data_vio that has finished deduplicating 
against the block locked
+ *                     by the hash lock.
+ * @lock: The hash lock.
+ * @data_vio: The lock holder that has finished deduplicating.
+ *
+ * If there are other data_vios still sharing the lock, this will just release 
the data_vio's share
+ * of the lock and finish processing the data_vio. If this is the last 
data_vio holding the lock,
+ * this makes the data_vio the lock agent and uses it to advance the state of 
the lock so it can
+ * eventually be released.
+ */
+static void finish_deduping(struct hash_lock *lock, struct data_vio *data_vio)
+{
+       struct data_vio *agent = data_vio;
+
+       ASSERT_LOG_ONLY(lock->agent == NULL, "shouldn't have an agent in 
DEDUPING");
+       ASSERT_LOG_ONLY(!vdo_has_waiters(&lock->waiters),
+                       "shouldn't have any lock waiters in DEDUPING");
+
+       /* Just release the lock reference if other data_vios are still 
deduping. */
+       if (lock->reference_count > 1) {
+               exit_hash_lock(data_vio);
+               return;
+       }
+
+       /* The hash lock must have an agent for all other lock states. */
+       lock->agent = agent;
+       if (lock->update_advice)
+               /*
+                * DEDUPING -> UPDATING transition: The location of the 
duplicate block changed
+                * since the initial UDS query because of compression, 
rollover, or because the
+                * query agent didn't have an allocation. The UDS update was 
delayed in case there
+                * was another change in location, but with only this data_vio 
using the hash lock,
+                * it's time to update the advice.
+                */
+               start_updating(lock, agent);
+       else
+               /*
+                * DEDUPING -> UNLOCKING transition: Release the PBN read lock 
on the duplicate
+                * location so the hash lock itself can be released (contingent 
on no new data_vios
+                * arriving in the lock before the agent returns).
+                */
+               start_unlocking(lock, agent);
+}
+
+/**
+ * acquire_lock() - Get the lock for a record name.
+ * @zone: The zone responsible for the hash.
+ * @hash: The hash to lock.
+ * @replace_lock: If non-NULL, the lock already registered for the hash which 
should be replaced by
+ *                the new lock.
+ * @lock_ptr: A pointer to receive the hash lock.
+ *
+ * Gets the lock for the hash (record name) of the data in a data_vio, or if 
one does not exist (or
+ * if we are explicitly rolling over), initialize a new lock for the hash and 
register it in the
+ * zone. This must only be called in the correct thread for the zone.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+static int __must_check acquire_lock(struct hash_zone *zone,
+                                    const struct uds_record_name *hash,
+                                    struct hash_lock *replace_lock,
+                                    struct hash_lock **lock_ptr)
+{
+       struct hash_lock *lock, *new_lock;
+       int result;
+
+       /*
+        * Borrow and prepare a lock from the pool so we don't have to do two 
pointer_map accesses
+        * in the common case of no lock contention.
+        */
+       result = ASSERT(!list_empty(&zone->lock_pool), "never need to wait for 
a free hash lock");
+       if (result != VDO_SUCCESS)
+               return result;
+
+       new_lock = list_entry(zone->lock_pool.prev, struct hash_lock, 
pool_node);
+       list_del_init(&new_lock->pool_node);
+
+       /*
+        * Fill in the hash of the new lock so we can map it, since we have to 
use the hash as the
+        * map key.
+        */
+       new_lock->hash = *hash;
+
+       result = vdo_pointer_map_put(zone->hash_lock_map,
+                                    &new_lock->hash,
+                                    new_lock,
+                                    (replace_lock != NULL),
+                                    (void **) &lock);
+       if (result != VDO_SUCCESS) {
+               return_hash_lock_to_pool(zone, UDS_FORGET(new_lock));
+               return result;
+       }
+
+       if (replace_lock != NULL) {
+               /* On mismatch put the old lock back and return a severe error 
*/
+               ASSERT_LOG_ONLY(lock == replace_lock, "old lock must have been 
in the lock map");
+               /* TODO: Check earlier and bail out? */
+               ASSERT_LOG_ONLY(replace_lock->registered,
+                               "old lock must have been marked registered");
+               replace_lock->registered = false;
+       }
+
+       if (lock == replace_lock) {
+               lock = new_lock;
+               lock->registered = true;
+       } else {
+               /* There's already a lock for the hash, so we don't need the 
borrowed lock. */
+               return_hash_lock_to_pool(zone, UDS_FORGET(new_lock));
+       }
+
+       *lock_ptr = lock;
+       return VDO_SUCCESS;
+}
+
+/**
+ * enter_forked_lock() - Bind the data_vio to a new hash lock.
+ *
+ * Implements waiter_callback. Binds the data_vio that was waiting to a new 
hash lock and waits on
+ * that lock.
+ */
+static void enter_forked_lock(struct waiter *waiter, void *context)
+{
+       struct data_vio *data_vio = waiter_as_data_vio(waiter);
+       struct hash_lock *new_lock = (struct hash_lock *) context;
+
+       set_hash_lock(data_vio, new_lock);
+       wait_on_hash_lock(new_lock, data_vio);
+}
+
+/**
+ * fork_hash_lock() - Fork a hash lock because it has run out of increments on 
the duplicate PBN.
+ * @old_lock: The hash lock to fork.
+ * @new_agent: The data_vio that will be the agent for the new lock.
+ *
+ * Transfers the new agent and any lock waiters to a new hash lock instance 
which takes the place
+ * of the old lock in the lock map. The old lock remains active, but will not 
update advice.
+ */
+static void fork_hash_lock(struct hash_lock *old_lock, struct data_vio 
*new_agent)
+{
+       struct hash_lock *new_lock;
+       int result;
+
+       result = acquire_lock(new_agent->hash_zone, &new_agent->record_name, 
old_lock, &new_lock);
+       if (result != VDO_SUCCESS) {
+               continue_data_vio_with_error(new_agent, result);
+               return;
+       }
+
+       /*
+        * Only one of the two locks should update UDS. The old lock is out of 
references, so it
+        * would be poor dedupe advice in the short term.
+        */
+       old_lock->update_advice = false;
+       new_lock->update_advice = true;
+
+       set_hash_lock(new_agent, new_lock);
+       new_lock->agent = new_agent;
+
+       vdo_notify_all_waiters(&old_lock->waiters, enter_forked_lock, new_lock);
+
+       new_agent->is_duplicate = false;
+       start_writing(new_lock, new_agent);
+}
+
+/**
+ * launch_dedupe() - Reserve a reference count increment for a data_vio and 
launch it on the dedupe
+ *                   path.
+ * @lock: The hash lock.
+ * @data_vio: The data_vio to deduplicate using the hash lock.
+ * @has_claim: true if the data_vio already has claimed an increment from the 
duplicate lock.
+ *
+ * If no increments are available, this will roll over to a new hash lock and 
launch the data_vio
+ * as the writing agent for that lock.
+ */
+static void launch_dedupe(struct hash_lock *lock, struct data_vio *data_vio, 
bool has_claim)
+{
+       if (!has_claim && !vdo_claim_pbn_lock_increment(lock->duplicate_lock)) {
+               /* Out of increments, so must roll over to a new lock. */
+               fork_hash_lock(lock, data_vio);
+               return;
+       }
+
+       /* Deduplicate against the lock's verified location. */
+       set_duplicate_location(data_vio, lock->duplicate);
+       data_vio->new_mapped = data_vio->duplicate;
+       update_metadata_for_data_vio_write(data_vio, lock->duplicate_lock);
+}
+
+/**
+ * start_deduping() - Enter the hash lock state where data_vios deduplicate in 
parallel against a
+ *                    true copy of their data on disk.
+ * @lock: The hash lock.
+ * @agent: The data_vio acting as the agent for the lock.
+ * @agent_is_done: true only if the agent has already written or deduplicated 
against its data.
+ *
+ * If the agent itself needs to deduplicate, an increment for it must already 
have been claimed
+ * from the duplicate lock, ensuring the hash lock will still have a data_vio 
holding it.
+ */
+static void start_deduping(struct hash_lock *lock, struct data_vio *agent, 
bool agent_is_done)
+{
+       lock->state = VDO_HASH_LOCK_DEDUPING;
+
+       /*
+        * We don't take the downgraded allocation lock from the agent unless 
we actually need to
+        * deduplicate against it.
+        */
+       if (lock->duplicate_lock == NULL) {
+               
ASSERT_LOG_ONLY(!vdo_is_state_compressed(agent->new_mapped.state),
+                               "compression must have shared a lock");
+               ASSERT_LOG_ONLY(agent_is_done, "agent must have written the new 
duplicate");
+               transfer_allocation_lock(agent);
+       }
+
+       ASSERT_LOG_ONLY(vdo_is_pbn_read_lock(lock->duplicate_lock),
+                       "duplicate_lock must be a PBN read lock");
+
+       /*
+        * This state is not like any of the other states. There is no 
designated agent--the agent
+        * transitioning to this state and all the waiters will be launched to 
deduplicate in
+        * parallel.
+        */
+       lock->agent = NULL;
+
+       /*
+        * Launch the agent (if not already deduplicated) and as many lock 
waiters as we have
+        * available increments for on the dedupe path. If we run out of 
increments, rollover will
+        * be triggered and the remaining waiters will be transferred to the 
new lock.
+        */
+       if (!agent_is_done) {
+               launch_dedupe(lock, agent, true);
+               agent = NULL;
+       }
+       while (vdo_has_waiters(&lock->waiters))
+               launch_dedupe(lock, dequeue_lock_waiter(lock), false);
+
+       if (agent_is_done)
+               /*
+                * In the degenerate case where all the waiters rolled over to 
a new lock, this
+                * will continue to use the old agent to clean up this lock, 
and otherwise it just
+                * lets the agent exit the lock.
+                */
+               finish_deduping(lock, agent);
+}
+
+/**
+ * increment_stat() - Increment a statistic counter in a non-atomic yet 
thread-safe manner.
+ * @stat: The statistic field to increment.
+ */
+static void increment_stat(u64 *stat)
+{
+       /*
+        * Must only be mutated on the hash zone thread. Prevents any compiler 
shenanigans from
+        * affecting other threads reading stats.
+        */
+       WRITE_ONCE(*stat, *stat + 1);
+}
+
+/**
+ * finish_verifying() - Handle the result of the agent for the lock comparing 
its data to the
+ *                      duplicate candidate.
+ * @completion: The completion of the data_vio used to verify dedupe
+ *
+ * This continuation is registered in start_verifying().
+ */
+static void finish_verifying(struct vdo_completion *completion)
+{
+       struct data_vio *agent = as_data_vio(completion);
+       struct hash_lock *lock = agent->hash_lock;
+
+       assert_hash_lock_agent(agent, __func__);
+
+       lock->verified = agent->is_duplicate;
+
+       /*
+        * Only count the result of the initial verification of the advice as 
valid or stale, and
+        * not any re-verifications due to PBN lock releases.
+        */
+       if (!lock->verify_counted) {
+               lock->verify_counted = true;
+               if (lock->verified)
+                       
increment_stat(&agent->hash_zone->statistics.dedupe_advice_valid);
+               else
+                       
increment_stat(&agent->hash_zone->statistics.dedupe_advice_stale);
+       }
+
+       /*
+        * Even if the block is a verified duplicate, we can't start to 
deduplicate unless we can
+        * claim a reference count increment for the agent.
+        */
+       if (lock->verified && 
!vdo_claim_pbn_lock_increment(lock->duplicate_lock)) {
+               agent->is_duplicate = false;
+               lock->verified = false;
+       }
+
+       if (lock->verified) {
+               /*
+                * VERIFYING -> DEDUPING transition: The advice is for a true 
duplicate, so start
+                * deduplicating against it, if references are available.
+                */
+               start_deduping(lock, agent, false);
+       } else {
+               /*
+                * VERIFYING -> UNLOCKING transition: Either the verify failed 
or we'd try to
+                * dedupe and roll over immediately, which would fail because 
it would leave the
+                * lock without an agent to release the PBN lock. In both 
cases, the data will have
+                * to be written or compressed, but first the advice PBN must 
be unlocked by the
+                * VERIFYING agent.
+                */
+               lock->update_advice = true;
+               start_unlocking(lock, agent);
+       }
+}
+
+static bool blocks_equal(char *block1, char *block2)
+{
+       int i;
+
+
+       for (i = 0; i < VDO_BLOCK_SIZE; i += sizeof(u64))
+               if (*((u64 *) &block1[i]) != *((u64 *) &block2[i]))
+                       return false;
+
+       return true;
+}
+
+static void verify_callback(struct vdo_completion *completion)
+{
+       struct data_vio *agent = as_data_vio(completion);
+
+       agent->is_duplicate = blocks_equal(agent->vio.data, 
agent->scratch_block);
+       launch_data_vio_hash_zone_callback(agent, finish_verifying);
+}
+
+static void uncompress_and_verify(struct vdo_completion *completion)
+{
+       struct data_vio *agent = as_data_vio(completion);
+       int result;
+
+       result = uncompress_data_vio(agent, agent->duplicate.state, 
agent->scratch_block);
+       if (result == VDO_SUCCESS) {
+               verify_callback(completion);
+               return;
+       }
+
+       agent->is_duplicate = false;
+       launch_data_vio_hash_zone_callback(agent, finish_verifying);
+}
+
+static void verify_endio(struct bio *bio)
+{
+       struct data_vio *agent = vio_as_data_vio(bio->bi_private);
+       int result = blk_status_to_errno(bio->bi_status);
+
+       vdo_count_completed_bios(bio);
+       if (result != VDO_SUCCESS) {
+               agent->is_duplicate = false;
+               launch_data_vio_hash_zone_callback(agent, finish_verifying);
+               return;
+       }
+
+       if (vdo_is_state_compressed(agent->duplicate.state)) {
+               launch_data_vio_cpu_callback(agent,
+                                            uncompress_and_verify,
+                                            CPU_Q_COMPRESS_BLOCK_PRIORITY);
+               return;
+       }
+
+       launch_data_vio_cpu_callback(agent, verify_callback, 
CPU_Q_COMPLETE_READ_PRIORITY);
+}
+
+/**
+ * start_verifying() - Begin the data verification phase.
+ * @lock: The hash lock (must be LOCKING).
+ * @agent: The data_vio to use to read and compare candidate data.
+ *
+ * Continue the deduplication path for a hash lock by using the agent to read 
(and possibly
+ * decompress) the data at the candidate duplicate location, comparing it to 
the data in the agent
+ * to verify that the candidate is identical to all the data_vios sharing the 
hash. If so, it can
+ * be deduplicated against, otherwise a data_vio allocation will have to be 
written to and used for
+ * dedupe.
+ */
+static void start_verifying(struct hash_lock *lock, struct data_vio *agent)
+{
+       int result;
+       struct vio *vio = &agent->vio;
+       char *buffer = (vdo_is_state_compressed(agent->duplicate.state) ?
+                       (char *) agent->compression.block :
+                       agent->scratch_block);
+
+       lock->state = VDO_HASH_LOCK_VERIFYING;
+       ASSERT_LOG_ONLY(!lock->verified, "hash lock only verifies advice once");
+
+       agent->last_async_operation = VIO_ASYNC_OP_VERIFY_DUPLICATION;
+       result = vio_reset_bio(vio, buffer, verify_endio, REQ_OP_READ, 
agent->duplicate.pbn);
+       if (result != VDO_SUCCESS) {
+               set_data_vio_hash_zone_callback(agent, finish_verifying);
+               continue_data_vio_with_error(agent, result);
+               return;
+       }
+
+       set_data_vio_bio_zone_callback(agent, process_vio_io);
+       vdo_launch_completion_with_priority(&vio->completion, 
BIO_Q_VERIFY_PRIORITY);
+}
+
+/**
+ * finish_locking() - Handle the result of the agent for the lock attempting 
to obtain a PBN read
+ *                    lock on the candidate duplicate block.
+ * @completion: The completion of the data_vio that attempted to get the read 
lock.
+ *
+ * This continuation is registered in lock_duplicate_pbn().
+ */
+static void finish_locking(struct vdo_completion *completion)
+{
+       struct data_vio *agent = as_data_vio(completion);
+       struct hash_lock *lock = agent->hash_lock;
+
+       assert_hash_lock_agent(agent, __func__);
+
+       if (!agent->is_duplicate) {
+               ASSERT_LOG_ONLY(lock->duplicate_lock == NULL,
+                               "must not hold duplicate_lock if not flagged as 
a duplicate");
+               /*
+                * LOCKING -> WRITING transition: The advice block is being 
modified or has no
+                * available references, so try to write or compress the data, 
remembering to
+                * update UDS later with the new advice.
+                */
+               
increment_stat(&agent->hash_zone->statistics.dedupe_advice_stale);
+               lock->update_advice = true;
+               start_writing(lock, agent);
+               return;
+       }
+
+       ASSERT_LOG_ONLY(lock->duplicate_lock != NULL,
+                       "must hold duplicate_lock if flagged as a duplicate");
+
+       if (!lock->verified) {
+               /*
+                * LOCKING -> VERIFYING transition: Continue on the unverified 
dedupe path, reading
+                * the candidate duplicate and comparing it to the agent's data 
to decide whether
+                * it is a true duplicate or stale advice.
+                */
+               start_verifying(lock, agent);
+               return;
+       }
+
+       if (!vdo_claim_pbn_lock_increment(lock->duplicate_lock)) {
+               /*
+                * LOCKING -> UNLOCKING transition: The verified block was 
re-locked, but has no
+                * available increments left. Must first release the useless 
PBN read lock before
+                * rolling over to a new copy of the block.
+                */
+               agent->is_duplicate = false;
+               lock->verified = false;
+               lock->update_advice = true;
+               start_unlocking(lock, agent);
+               return;
+       }
+
+       /*
+        * LOCKING -> DEDUPING transition: Continue on the verified dedupe 
path, deduplicating
+        * against a location that was previously verified or written to.
+        */
+       start_deduping(lock, agent, false);
+}
+
+static bool acquire_provisional_reference(struct data_vio *agent,
+                                         struct pbn_lock *lock,
+                                         struct slab_depot *depot)
+{
+       /* Ensure that the newly-locked block is referenced. */
+       struct vdo_slab *slab = vdo_get_slab(depot, agent->duplicate.pbn);
+       int result = vdo_acquire_provisional_reference(slab, 
agent->duplicate.pbn, lock);
+
+       if (result == VDO_SUCCESS)
+               return true;
+
+       uds_log_warning_strerror(result,
+                                "Error acquiring provisional reference for 
dedupe candidate; aborting dedupe");
+       agent->is_duplicate = false;
+       vdo_release_physical_zone_pbn_lock(agent->duplicate.zone, 
agent->duplicate.pbn, lock);
+       continue_data_vio_with_error(agent, result);
+       return false;
+}
+
+/**
+ * lock_duplicate_pbn() - Acquire a read lock on the PBN of the block 
containing candidate
+ *                        duplicate data (compressed or uncompressed).
+ * @completion: The completion of the data_vio attempting to acquire the 
physical block lock on
+ *              behalf of its hash lock.
+ *
+ * If the PBN is already locked for writing, the lock attempt is abandoned and 
is_duplicate will be
+ * cleared before calling back. this continuation is launched from 
start_locking(), and calls back
+ * to finish_locking() on the hash zone thread.
+ */
+static void lock_duplicate_pbn(struct vdo_completion *completion)
+{
+       unsigned int increment_limit;
+       struct pbn_lock *lock;
+       int result;
+
+       struct data_vio *agent = as_data_vio(completion);
+       struct slab_depot *depot = vdo_from_data_vio(agent)->depot;
+       struct physical_zone *zone = agent->duplicate.zone;
+
+       assert_data_vio_in_duplicate_zone(agent);
+
+       set_data_vio_hash_zone_callback(agent, finish_locking);
+
+       /*
+        * While in the zone that owns it, find out how many additional 
references can be made to
+        * the block if it turns out to truly be a duplicate.
+        */
+       increment_limit = vdo_get_increment_limit(depot, agent->duplicate.pbn);
+       if (increment_limit == 0) {
+               /*
+                * We could deduplicate against it later if a reference 
happened to be released
+                * during verification, but it's probably better to bail out 
now.
+                */
+               agent->is_duplicate = false;
+               continue_data_vio(agent);
+               return;
+       }
+
+       result = vdo_attempt_physical_zone_pbn_lock(zone,
+                                                   agent->duplicate.pbn,
+                                                   VIO_READ_LOCK,
+                                                   &lock);
+       if (result != VDO_SUCCESS) {
+               continue_data_vio_with_error(agent, result);
+               return;
+       }
+
+       if (!vdo_is_pbn_read_lock(lock)) {
+               /*
+                * There are three cases of write locks: uncompressed data 
block writes, compressed
+                * (packed) block writes, and block map page writes. In all 
three cases, we give up
+                * on trying to verify the advice and don't bother to try 
deduplicate against the
+                * data in the write lock holder.
+                *
+                * 1) We don't ever want to try to deduplicate against a block 
map page.
+                *
+                * 2a) It's very unlikely we'd deduplicate against an entire 
packed block, both
+                * because of the chance of matching it, and because we don't 
record advice for it,
+                * but for the uncompressed representation of all the fragments 
it contains. The
+                * only way we'd be getting lock contention is if we've written 
the same
+                * representation coincidentally before, had it become 
unreferenced, and it just
+                * happened to be packed together from compressed writes when 
we go to verify the
+                * lucky advice. Giving up is a minuscule loss of potential 
dedupe.
+                *
+                * 2b) If the advice is for a slot of a compressed block, it's 
about to get
+                * smashed, and the write smashing it cannot contain our 
data--it would have to be
+                * writing on behalf of our hash lock, but that's impossible 
since we're the lock
+                * agent.
+                *
+                * 3a) If the lock is held by a data_vio with different data, 
the advice is already
+                * stale or is about to become stale.
+                *
+                * 3b) If the lock is held by a data_vio that matches us, we 
may as well either
+                * write it ourselves (or reference the copy we already wrote) 
instead of
+                * potentially having many duplicates wait for the lock holder 
to write, journal,
+                * hash, and finally arrive in the hash lock. We lose a chance 
to avoid a UDS
+                * update in the very rare case of advice for a free block that 
just happened to be
+                * allocated to a data_vio with the same hash. There's also a 
chance to save on a
+                * block write, at the cost of a block verify. Saving on a full 
block compare in
+                * all stale advice cases almost certainly outweighs saving a 
UDS update and
+                * trading a write for a read in a lucky case where advice 
would have been saved
+                * from becoming stale.
+                */
+               agent->is_duplicate = false;
+               continue_data_vio(agent);
+               return;
+       }
+
+       if (lock->holder_count == 0) {
+               if (!acquire_provisional_reference(agent, lock, depot))
+                       return;
+
+               /*
+                * The increment limit we grabbed earlier is still valid. The 
lock now holds the
+                * rights to acquire all those references. Those rights will be 
claimed by hash
+                * locks sharing this read lock.
+                */
+               lock->increment_limit = increment_limit;
+       }
+
+       /*
+        * We've successfully acquired a read lock on behalf of the hash lock, 
so mark it as such.
+        */
+       set_duplicate_lock(agent->hash_lock, lock);
+
+       /*
+        * TODO: Optimization: We could directly launch the block verify, then 
switch to a hash
+        * thread.
+        */
+       continue_data_vio(agent);
+}
+
+/**
+ * start_locking() - Continue deduplication for a hash lock that has obtained 
valid advice of a
+ *                   potential duplicate through its agent.
+ * @lock: The hash lock (currently must be QUERYING).
+ * @agent: The data_vio bearing the dedupe advice.
+ */
+static void start_locking(struct hash_lock *lock, struct data_vio *agent)
+{
+       ASSERT_LOG_ONLY(lock->duplicate_lock == NULL,
+                       "must not acquire a duplicate lock when already holding 
it");
+
+       lock->state = VDO_HASH_LOCK_LOCKING;
+
+       /*
+        * TODO: Optimization: If we arrange to continue on the duplicate zone 
thread when
+        * accepting the advice, and don't explicitly change lock states (or 
use an agent-local
+        * state, or an atomic), we can avoid a thread transition here.
+        */
+       agent->last_async_operation = VIO_ASYNC_OP_LOCK_DUPLICATE_PBN;
+       launch_data_vio_duplicate_zone_callback(agent, lock_duplicate_pbn);
+}
+
+/**
+ * finish_writing() - Re-entry point for the lock agent after it has finished 
writing or
+ *                    compressing its copy of the data block.
+ * @lock: The hash lock, which must be in state WRITING.
+ * @agent: The data_vio that wrote its data for the lock.
+ *
+ * The agent will never need to dedupe against anything, so it's done with the 
lock, but the lock
+ * may not be finished with it, as a UDS update might still be needed.
+ *
+ * If there are other lock holders, the agent will hand the job to one of them 
and exit, leaving
+ * the lock to deduplicate against the just-written block. If there are no 
other lock holders, the
+ * agent either exits (and later tears down the hash lock), or it remains the 
agent and updates
+ * UDS.
+ */
+static void finish_writing(struct hash_lock *lock, struct data_vio *agent)
+{
+       /*
+        * Dedupe against the data block or compressed block slot the agent 
wrote. Since we know
+        * the write succeeded, there's no need to verify it.
+        */
+       lock->duplicate = agent->new_mapped;
+       lock->verified = true;
+
+       if (vdo_is_state_compressed(lock->duplicate.state) &&
+           lock->registered)
+               /*
+                * Compression means the location we gave in the UDS query is 
not the location
+                * we're using to deduplicate.
+                */
+               lock->update_advice = true;
+
+       /* If there are any waiters, we need to start deduping them. */
+       if (vdo_has_waiters(&lock->waiters)) {
+               /*
+                * WRITING -> DEDUPING transition: an asynchronously-written 
block failed to
+                * compress, so the PBN lock on the written copy was already 
transferred. The agent
+                * is done with the lock, but the lock may still need to use it 
to clean up after
+                * rollover.
+                */
+               start_deduping(lock, agent, true);
+               return;
+       }
+
+       /*
+        * There are no waiters and the agent has successfully written, so take 
a step towards
+        * being able to release the hash lock (or just release it).
+        */
+       if (lock->update_advice) {
+               /*
+                * WRITING -> UPDATING transition: There's no waiter and a UDS 
update is needed, so
+                * retain the WRITING agent and use it to launch the update. 
The happens on
+                * compression, rollover, or the QUERYING agent not having an 
allocation.
+                */
+               start_updating(lock, agent);
+       } else if (lock->duplicate_lock != NULL) {
+               /*
+                * WRITING -> UNLOCKING transition: There's no waiter and no 
update needed, but the
+                * compressed write gave us a shared duplicate lock that we 
must release.
+                */
+               set_duplicate_location(agent, lock->duplicate);
+               start_unlocking(lock, agent);
+       } else {
+               /*
+                * WRITING -> BYPASSING transition: There's no waiter, no 
update needed, and no
+                * duplicate lock held, so both the agent and lock have no more 
work to do. The
+                * agent will release its allocation lock in cleanup.
+                */
+               start_bypassing(lock, agent);
+       }
+}
+
+/**
+ * select_writing_agent() - Search through the lock waiters for a data_vio 
that has an allocation.
+ * @lock: The hash lock to modify.
+ *
+ * If an allocation is found, swap agents, put the old agent at the head of 
the wait queue, then
+ * return the new agent. Otherwise, just return the current agent.
+ */
+static struct data_vio *select_writing_agent(struct hash_lock *lock)
+{
+       struct wait_queue temp_queue;
+       struct data_vio *data_vio;
+
+       vdo_initialize_wait_queue(&temp_queue);
+
+       /*
+        * Move waiters to the temp queue one-by-one until we find an 
allocation. Not ideal to
+        * search, but it only happens when nearly out of space.
+        */
+       while (((data_vio = dequeue_lock_waiter(lock)) != NULL) &&
+              !data_vio_has_allocation(data_vio)) {
+               /* Use the lower-level enqueue since we're just moving waiters 
around. */
+               vdo_enqueue_waiter(&temp_queue, &data_vio->waiter);
+       }
+
+       if (data_vio != NULL) {
+               /*
+                * Move the rest of the waiters over to the temp queue, 
preserving the order they
+                * arrived at the lock.
+                */
+               vdo_transfer_all_waiters(&lock->waiters, &temp_queue);
+
+               /*
+                * The current agent is being replaced and will have to wait to 
dedupe; make it the
+                * first waiter since it was the first to reach the lock.
+                */
+               vdo_enqueue_waiter(&lock->waiters, &lock->agent->waiter);
+               lock->agent = data_vio;
+       } else {
+               /* No one has an allocation, so keep the current agent. */
+               data_vio = lock->agent;
+       }
+
+       /* Swap all the waiters back onto the lock's queue. */
+       vdo_transfer_all_waiters(&temp_queue, &lock->waiters);
+       return data_vio;
+}
+
+/**
+ * start_writing() - Begin the non-duplicate write path.
+ * @lock: The hash lock (currently must be QUERYING).
+ * @agent: The data_vio currently acting as the agent for the lock.
+ *
+ * Begins the non-duplicate write path for a hash lock that had no advice, 
selecting a data_vio
+ * with an allocation as a new agent, if necessary, then resuming the agent on 
the data_vio write
+ * path.
+ */
+static void start_writing(struct hash_lock *lock, struct data_vio *agent)
+{
+       lock->state = VDO_HASH_LOCK_WRITING;
+
+       /*
+        * The agent might not have received an allocation and so can't be used 
for writing, but
+        * it's entirely possible that one of the waiters did.
+        */
+       if (!data_vio_has_allocation(agent)) {
+               agent = select_writing_agent(lock);
+               /* If none of the waiters had an allocation, the writes all 
have to fail. */
+               if (!data_vio_has_allocation(agent)) {
+                       /*
+                        * TODO: Should we keep a variant of BYPASSING that 
causes new arrivals to
+                        * fail immediately if they don't have an allocation? 
It might be possible
+                        * that on some path there would be non-waiters still 
referencing the lock,
+                        * so it would remain in the map as everything is 
currently spelled, even
+                        * if the agent and all waiters release.
+                        */
+                       continue_data_vio_with_error(agent, VDO_NO_SPACE);
+                       return;
+               }
+       }
+
+       /*
+        * If the agent compresses, it might wait indefinitely in the packer, 
which would be bad if
+        * there are any other data_vios waiting.
+        */
+       if (vdo_has_waiters(&lock->waiters))
+               cancel_data_vio_compression(agent);
+
+       /*
+        * Send the agent to the compress/pack/write path in vioWrite. If it 
succeeds, it will
+        * return to the hash lock via vdo_continue_hash_lock() and call 
finish_writing().
+        */
+       launch_compress_data_vio(agent);
+}
+
+/*
+ * Decode VDO duplicate advice from the old_metadata field of a UDS request.
+ * Returns true if valid advice was found and decoded
+ */
+static bool decode_uds_advice(struct dedupe_context *context)
+{
+       const struct uds_request *request = &context->request;
+       struct data_vio *data_vio = context->requestor;
+       size_t offset = 0;
+       const struct uds_record_data *encoding = &request->old_metadata;
+       struct vdo *vdo = vdo_from_data_vio(data_vio);
+       struct zoned_pbn *advice = &data_vio->duplicate;
+       u8 version;
+       int result;
+
+       if ((request->status != UDS_SUCCESS) || !request->found)
+               return false;
+
+       version = encoding->data[offset++];
+       if (version != UDS_ADVICE_VERSION) {
+               uds_log_error("invalid UDS advice version code %u", version);
+               return false;
+       }
+
+       advice->state = encoding->data[offset++];
+       advice->pbn = get_unaligned_le64(&encoding->data[offset]);
+       offset += sizeof(u64);
+       BUG_ON(offset != UDS_ADVICE_SIZE);
+
+       /* Don't use advice that's clearly meaningless. */
+       if ((advice->state == VDO_MAPPING_STATE_UNMAPPED) || (advice->pbn == 
VDO_ZERO_BLOCK)) {
+               uds_log_debug("Invalid advice from deduplication server: pbn 
%llu, state %u. Giving up on deduplication of logical block %llu",
+                             (unsigned long long) advice->pbn,
+                             advice->state,
+                             (unsigned long long) data_vio->logical.lbn);
+               atomic64_inc(&vdo->stats.invalid_advice_pbn_count);
+               return false;
+       }
+
+       result = vdo_get_physical_zone(vdo, advice->pbn, &advice->zone);
+       if ((result != VDO_SUCCESS) || (advice->zone == NULL)) {
+               uds_log_debug("Invalid physical block number from deduplication 
server: %llu, giving up on deduplication of logical block %llu",
+                             (unsigned long long) advice->pbn,
+                             (unsigned long long) data_vio->logical.lbn);
+               atomic64_inc(&vdo->stats.invalid_advice_pbn_count);
+               return false;
+       }
+
+       return true;
+}
+
+static void process_query_result(struct data_vio *agent)
+{
+       struct dedupe_context *context = agent->dedupe_context;
+
+       if (context == NULL)
+               return;
+
+       if (change_context_state(context, DEDUPE_CONTEXT_COMPLETE, 
DEDUPE_CONTEXT_IDLE)) {
+               agent->is_duplicate = decode_uds_advice(context);
+               release_context(context);
+       }
+}
+
+/**
+ * finish_querying() - Process the result of a UDS query performed by the 
agent for the lock.
+ * @completion: The completion of the data_vio that performed the query.
+ *
+ * This continuation is registered in start_querying().
+ */
+static void finish_querying(struct vdo_completion *completion)
+{
+       struct data_vio *agent = as_data_vio(completion);
+       struct hash_lock *lock = agent->hash_lock;
+
+       assert_hash_lock_agent(agent, __func__);
+
+       process_query_result(agent);
+
+       if (agent->is_duplicate) {
+               lock->duplicate = agent->duplicate;
+               /*
+                * QUERYING -> LOCKING transition: Valid advice was obtained 
from UDS. Use the
+                * QUERYING agent to start the hash lock on the unverified 
dedupe path, verifying
+                * that the advice can be used.
+                */
+               start_locking(lock, agent);
+       } else {
+               /*
+                * The agent will be used as the duplicate if has an 
allocation; if it does, that
+                * location was posted to UDS, so no update will be needed.
+                */
+               lock->update_advice = !data_vio_has_allocation(agent);
+               /*
+                * QUERYING -> WRITING transition: There was no advice or the 
advice wasn't valid,
+                * so try to write or compress the data.
+                */
+               start_writing(lock, agent);
+       }
+}
+
+/**
+ * start_querying() - Start deduplication for a hash lock.
+ * @lock: The initialized hash lock.
+ * @data_vio: The data_vio that has just obtained the new lock.
+ *
+ * Starts deduplication for a hash lock that has finished initializing by 
making the data_vio that
+ * requested it the agent, entering the QUERYING state, and using the agent to 
perform the UDS
+ * query on behalf of the lock.
+ */
+static void start_querying(struct hash_lock *lock, struct data_vio *data_vio)
+{
+       lock->agent = data_vio;
+       lock->state = VDO_HASH_LOCK_QUERYING;
+       data_vio->last_async_operation = VIO_ASYNC_OP_CHECK_FOR_DUPLICATION;
+       set_data_vio_hash_zone_callback(data_vio, finish_querying);
+       query_index(data_vio, (data_vio_has_allocation(data_vio) ? UDS_POST : 
UDS_QUERY));
+}
+
+/**
+ * report_bogus_lock_state() - Complain that a data_vio has entered a 
hash_lock that is in an
+ *                             unimplemented or unusable state and continue 
the data_vio with an
+ *                             error.
+ * @lock: The hash lock.
+ * @data_vio: The data_vio attempting to enter the lock.
+ */
+static void report_bogus_lock_state(struct hash_lock *lock, struct data_vio 
*data_vio)
+{
+       ASSERT_LOG_ONLY(false,
+                       "hash lock must not be in unimplemented state %s",
+                       get_hash_lock_state_name(lock->state));
+       continue_data_vio_with_error(data_vio, VDO_LOCK_ERROR);
+}
+
+/**
+ * vdo_continue_hash_lock() - Continue the processing state after writing, 
compressing, or
+ *                            deduplicating.
+ * @data_vio: The data_vio to continue processing in its hash lock.
+ *
+ * Asynchronously continue processing a data_vio in its hash lock after it has 
finished writing,
+ * compressing, or deduplicating, so it can share the result with any 
data_vios waiting in the hash
+ * lock, or update the UDS index, or simply release its share of the lock.
+ *
+ * Context: This must only be called in the correct thread for the hash zone.
+ */
+void vdo_continue_hash_lock(struct vdo_completion *completion)
+{
+       struct data_vio *data_vio = as_data_vio(completion);
+       struct hash_lock *lock = data_vio->hash_lock;
+
+       switch (lock->state) {
+       case VDO_HASH_LOCK_WRITING:
+               ASSERT_LOG_ONLY(data_vio == lock->agent,
+                               "only the lock agent may continue the lock");
+               finish_writing(lock, data_vio);
+               break;
+
+       case VDO_HASH_LOCK_DEDUPING:
+               finish_deduping(lock, data_vio);
+               break;
+
+       case VDO_HASH_LOCK_BYPASSING:
+               /* This data_vio has finished the write path and the lock 
doesn't need it. */
+               exit_hash_lock(data_vio);
+               break;
+
+       case VDO_HASH_LOCK_INITIALIZING:
+       case VDO_HASH_LOCK_QUERYING:
+       case VDO_HASH_LOCK_UPDATING:
+       case VDO_HASH_LOCK_LOCKING:
+       case VDO_HASH_LOCK_VERIFYING:
+       case VDO_HASH_LOCK_UNLOCKING:
+               /* A lock in this state should never be re-entered. */
+               report_bogus_lock_state(lock, data_vio);
+               break;
+
+       default:
+               report_bogus_lock_state(lock, data_vio);
+       }
+}
+
+/**
+ * is_hash_collision() - Check to see if a hash collision has occurred.
+ * @lock: The lock to check.
+ * @candidate: The data_vio seeking to share the lock.
+ *
+ * Check whether the data in data_vios sharing a lock is different than in a 
data_vio seeking to
+ * share the lock, which should only be possible in the extremely unlikely 
case of a hash
+ * collision.
+ *
+ * Return: true if the given data_vio must not share the lock because it 
doesn't have the same data
+ *         as the lock holders.
+ */
+static bool is_hash_collision(struct hash_lock *lock, struct data_vio 
*candidate)
+{
+       struct data_vio *lock_holder;
+       struct hash_zone *zone;
+       bool collides;
+
+       if (list_empty(&lock->duplicate_ring))
+               return false;
+
+       lock_holder = list_first_entry(&lock->duplicate_ring, struct data_vio, 
hash_lock_entry);
+       zone = candidate->hash_zone;
+       collides = !blocks_equal(lock_holder->vio.data, candidate->vio.data);
+       if (collides)
+               increment_stat(&zone->statistics.concurrent_hash_collisions);
+       else
+               increment_stat(&zone->statistics.concurrent_data_matches);
+
+       return collides;
+}
+
+static inline int assert_hash_lock_preconditions(const struct data_vio 
*data_vio)
+{
+       int result;
+
+       /* FIXME: BUG_ON() and/or enter read-only mode? */
+       result = ASSERT(data_vio->hash_lock == NULL, "must not already hold a 
hash lock");
+       if (result != VDO_SUCCESS)
+               return result;
+
+       result = ASSERT(list_empty(&data_vio->hash_lock_entry),
+                       "must not already be a member of a hash lock ring");
+       if (result != VDO_SUCCESS)
+               return result;
+
+       return ASSERT(data_vio->recovery_sequence_number == 0,
+                     "must not hold a recovery lock when getting a hash lock");
+}
+
+/**
+ * vdo_acquire_hash_lock() - Acquire or share a lock on a record name.
+ * @data_vio: The data_vio acquiring a lock on its record name.
+ *
+ * Acquire or share a lock on the hash (record name) of the data in a 
data_vio, updating the
+ * data_vio to reference the lock. This must only be called in the correct 
thread for the zone. In
+ * the unlikely case of a hash collision, this function will succeed, but the 
data_vio will not get
+ * a lock reference.
+ */
+void vdo_acquire_hash_lock(struct vdo_completion *completion)
+{
+       struct data_vio *data_vio = as_data_vio(completion);
+       struct hash_lock *lock;
+       int result;
+
+       assert_data_vio_in_hash_zone(data_vio);
+
+       result = assert_hash_lock_preconditions(data_vio);
+       if (result != VDO_SUCCESS) {
+               continue_data_vio_with_error(data_vio, result);
+               return;
+       }
+
+       result = acquire_lock(data_vio->hash_zone, &data_vio->record_name, 
NULL, &lock);
+       if (result != VDO_SUCCESS) {
+               continue_data_vio_with_error(data_vio, result);
+               return;
+       }
+
+       if (is_hash_collision(lock, data_vio)) {
+               /*
+                * Hash collisions are extremely unlikely, but the bogus dedupe 
would be a data
+                * corruption. Bypass optimization entirely. We can't compress 
a data_vio without
+                * a hash_lock as the compressed write depends on the hash_lock 
to manage the
+                * references for the compressed block.
+                */
+               write_data_vio(data_vio);
+               return;
+       }
+
+       set_hash_lock(data_vio, lock);
+       switch (lock->state) {
+       case VDO_HASH_LOCK_INITIALIZING:
+               start_querying(lock, data_vio);
+               return;
+
+       case VDO_HASH_LOCK_QUERYING:
+       case VDO_HASH_LOCK_WRITING:
+       case VDO_HASH_LOCK_UPDATING:
+       case VDO_HASH_LOCK_LOCKING:
+       case VDO_HASH_LOCK_VERIFYING:
+       case VDO_HASH_LOCK_UNLOCKING:
+               /* The lock is busy, and can't be shared yet. */
+               wait_on_hash_lock(lock, data_vio);
+               return;
+
+       case VDO_HASH_LOCK_BYPASSING:
+               /* We can't use this lock, so bypass optimization entirely. */
+               vdo_release_hash_lock(data_vio);
+               write_data_vio(data_vio);
+               return;
+
+       case VDO_HASH_LOCK_DEDUPING:
+               launch_dedupe(lock, data_vio, false);
+               return;
+
+       default:
+               /* A lock in this state should not be acquired by new VIOs. */
+               report_bogus_lock_state(lock, data_vio);
+       }
+}
+
+/**
+ * vdo_release_hash_lock() - Release a data_vio's share of a hash lock, if 
held, and null out the
+ *                           data_vio's reference to it.
+ * @data_vio: The data_vio releasing its hash lock.
+ *
+ * If the data_vio is the only one holding the lock, this also releases any 
resources or locks used
+ * by the hash lock (such as a PBN read lock on a block containing data with 
the same hash) and
+ * returns the lock to the hash zone's lock pool.
+ *
+ * Context: This must only be called in the correct thread for the hash zone.
+ */
+void vdo_release_hash_lock(struct data_vio *data_vio)
+{
+       struct hash_lock *lock = data_vio->hash_lock;
+       struct hash_zone *zone = data_vio->hash_zone;
+
+       if (lock == NULL)
+               return;
+
+       set_hash_lock(data_vio, NULL);
+
+       if (lock->reference_count > 0)
+               /* The lock is still in use by other data_vios. */
+               return;
+
+       if (lock->registered) {
+               struct hash_lock *removed;
+
+               removed = vdo_pointer_map_remove(zone->hash_lock_map, 
&lock->hash);
+               ASSERT_LOG_ONLY(lock == removed, "hash lock being released must 
have been mapped");
+       } else {
+               ASSERT_LOG_ONLY(lock != 
vdo_pointer_map_get(zone->hash_lock_map, &lock->hash),
+                               "unregistered hash lock must not be in the lock 
map");
+       }
+
+       ASSERT_LOG_ONLY(!vdo_has_waiters(&lock->waiters),
+                       "hash lock returned to zone must have no waiters");
+       ASSERT_LOG_ONLY((lock->duplicate_lock == NULL),
+                       "hash lock returned to zone must not reference a PBN 
lock");
+       ASSERT_LOG_ONLY((lock->state == VDO_HASH_LOCK_BYPASSING),
+                       "returned hash lock must not be in use with state %s",
+                       get_hash_lock_state_name(lock->state));
+       ASSERT_LOG_ONLY(list_empty(&lock->pool_node),
+                       "hash lock returned to zone must not be in a pool 
ring");
+       ASSERT_LOG_ONLY(list_empty(&lock->duplicate_ring),
+                       "hash lock returned to zone must not reference 
DataVIOs");
+
+       return_hash_lock_to_pool(zone, lock);
+}
+
+/**
+ * transfer_allocation_lock() - Transfer a data_vio's downgraded allocation 
PBN lock to the
+ *                              data_vio's hash lock, converting it to a 
duplicate PBN lock.
+ * @data_vio: The data_vio holding the allocation lock to transfer.
+ */
+static void transfer_allocation_lock(struct data_vio *data_vio)
+{
+       struct allocation *allocation = &data_vio->allocation;
+       struct hash_lock *hash_lock = data_vio->hash_lock;
+
+       ASSERT_LOG_ONLY(data_vio->new_mapped.pbn == allocation->pbn,
+                       "transferred lock must be for the block written");
+
+       allocation->pbn = VDO_ZERO_BLOCK;
+
+       ASSERT_LOG_ONLY(vdo_is_pbn_read_lock(allocation->lock),
+                       "must have downgraded the allocation lock before 
transfer");
+
+       hash_lock->duplicate = data_vio->new_mapped;
+       data_vio->duplicate = data_vio->new_mapped;
+
+       /*
+        * Since the lock is being transferred, the holder count doesn't change 
(and isn't even
+        * safe to examine on this thread).
+        */
+       hash_lock->duplicate_lock = UDS_FORGET(allocation->lock);
+}
+
+/**
+ * vdo_share_compressed_write_lock() - Make a data_vio's hash lock a shared 
holder of the PBN lock
+ *                                     on the compressed block to which its 
data was just written.
+ * @data_vio: The data_vio which was just compressed.
+ * @pbn_lock: The PBN lock on the compressed block.
+ *
+ * If the lock is still a write lock (as it will be for the first share), it 
will be converted to a
+ * read lock. This also reserves a reference count increment for the data_vio.
+ */
+void vdo_share_compressed_write_lock(struct data_vio *data_vio, struct 
pbn_lock *pbn_lock)
+{
+       bool claimed;
+
+       ASSERT_LOG_ONLY(vdo_get_duplicate_lock(data_vio) == NULL,
+                       "a duplicate PBN lock should not exist when writing");
+       ASSERT_LOG_ONLY(vdo_is_state_compressed(data_vio->new_mapped.state),
+                       "lock transfer must be for a compressed write");
+       assert_data_vio_in_new_mapped_zone(data_vio);
+
+       /* First sharer downgrades the lock. */
+       if (!vdo_is_pbn_read_lock(pbn_lock))
+               vdo_downgrade_pbn_write_lock(pbn_lock, true);
+
+       /*
+        * Get a share of the PBN lock, ensuring it cannot be released until 
after this data_vio
+        * has had a chance to journal a reference.
+        */
+       data_vio->duplicate = data_vio->new_mapped;
+       data_vio->hash_lock->duplicate = data_vio->new_mapped;
+       set_duplicate_lock(data_vio->hash_lock, pbn_lock);
+
+       /*
+        * Claim a reference for this data_vio. Necessary since another 
hash_lock might start
+        * deduplicating against it before our incRef.
+        */
+       claimed = vdo_claim_pbn_lock_increment(pbn_lock);
+       ASSERT_LOG_ONLY(claimed, "impossible to fail to claim an initial 
increment");
+}
+
+/** compare_keys() - Implements pointer_key_comparator. */
+static bool compare_keys(const void *this_key, const void *that_key)
+{
+       /* Null keys are not supported. */
+       return (memcmp(this_key, that_key, sizeof(struct uds_record_name)) == 
0);
+}
+
+/** hash_key() - Implements pointer_key_comparator. */
+static u32 hash_key(const void *key)
+{
+       const struct uds_record_name *name = key;
+
+       /* Use a fragment of the record name as a hash code. */
+       return get_unaligned_le32(&name->name[4]);
+}
+
+static int __must_check
+initialize_zone(struct vdo *vdo, struct hash_zones *zones, zone_count_t 
zone_number)
+{
+       int result;
+       data_vio_count_t i;
+       struct hash_zone *zone = &zones->zones[zone_number];
+
+       result = vdo_make_pointer_map(VDO_LOCK_MAP_CAPACITY,
+                                     0,
+                                     compare_keys,
+                                     hash_key,
+                                     &zone->hash_lock_map);
+       if (result != VDO_SUCCESS)
+               return result;
+
+       vdo_set_admin_state_code(&zone->state, 
VDO_ADMIN_STATE_NORMAL_OPERATION);
+       zone->zone_number = zone_number;
+       zone->thread_id = vdo->thread_config.hash_zone_threads[zone_number];
+       vdo_initialize_completion(&zone->completion, vdo, 
VDO_HASH_ZONE_COMPLETION);
+       vdo_set_completion_callback(&zone->completion,
+                                   timeout_index_operations_callback,
+                                   zone->thread_id);
+       INIT_LIST_HEAD(&zone->lock_pool);
+       result = UDS_ALLOCATE(LOCK_POOL_CAPACITY,
+                             struct hash_lock,
+                             "hash_lock array",
+                             &zone->lock_array);
+       if (result != VDO_SUCCESS)
+               return result;
+
+       for (i = 0; i < LOCK_POOL_CAPACITY; i++)
+               return_hash_lock_to_pool(zone, &zone->lock_array[i]);
+
+       INIT_LIST_HEAD(&zone->available);
+       INIT_LIST_HEAD(&zone->pending);
+       result = uds_make_funnel_queue(&zone->timed_out_complete);
+       if (result != VDO_SUCCESS)
+               return result;
+
+       timer_setup(&zone->timer, timeout_index_operations, 0);
+
+       for (i = 0; i < MAXIMUM_VDO_USER_VIOS; i++) {
+               struct dedupe_context *context = &zone->contexts[i];
+
+               context->zone = zone;
+               context->request.callback = finish_index_operation;
+               context->request.session = zones->index_session;
+               list_add(&context->list_entry, &zone->available);
+       }
+
+       return vdo_make_default_thread(vdo, zone->thread_id);
+}
+
+/** get_thread_id_for_zone() - Implements vdo_zone_thread_getter. */
+static thread_id_t get_thread_id_for_zone(void *context, zone_count_t 
zone_number)
+{
+       struct hash_zones *zones = context;
+
+       return zones->zones[zone_number].thread_id;
+}
+
+/**
+ * vdo_make_hash_zones() - Create the hash zones.
+ *
+ * @vdo: The vdo to which the zone will belong.
+ * @zones_ptr: A pointer to hold the zones.
+ *
+ * Return: VDO_SUCCESS or an error code.
+ */
+int vdo_make_hash_zones(struct vdo *vdo, struct hash_zones **zones_ptr)
+{
+       int result;
+       struct hash_zones *zones;
+       zone_count_t z;
+       zone_count_t zone_count = vdo->thread_config.hash_zone_count;
+
+       if (zone_count == 0)
+               return VDO_SUCCESS;
+
+       result = UDS_ALLOCATE_EXTENDED(struct hash_zones,
+                                      zone_count,
+                                      struct hash_zone,
+                                      __func__,
+                                      &zones);
+       if (result != VDO_SUCCESS)
+               return result;
+
+       result = initialize_index(vdo, zones);
+       if (result != VDO_SUCCESS) {
+               UDS_FREE(zones);
+               return result;
+       }
+
+       vdo_set_admin_state_code(&zones->state, VDO_ADMIN_STATE_NEW);
+
+       zones->zone_count = zone_count;
+       for (z = 0; z < zone_count; z++) {
+               result = initialize_zone(vdo, zones, z);
+               if (result != VDO_SUCCESS) {
+                       vdo_free_hash_zones(zones);
+                       return result;
+               }
+       }
+
+       result = vdo_make_action_manager(zones->zone_count,
+                                        get_thread_id_for_zone,
+                                        vdo->thread_config.admin_thread,
+                                        zones,
+                                        NULL,
+                                        vdo,
+                                        &zones->manager);
+       if (result != VDO_SUCCESS) {
+               vdo_free_hash_zones(zones);
+               return result;
+       }
+
+       *zones_ptr = zones;
+       return VDO_SUCCESS;
+}
+
+void vdo_finish_dedupe_index(struct hash_zones *zones)
+{
+       if (zones == NULL)
+               return;
+
+       uds_destroy_index_session(UDS_FORGET(zones->index_session));
+}
+
+/**
+ * vdo_free_hash_zones() - Free the hash zones.
+ * @zones: The zone to free.
+ */
+void vdo_free_hash_zones(struct hash_zones *zones)
+{
+       zone_count_t i;
+
+       if (zones == NULL)
+               return;
+
+       UDS_FREE(UDS_FORGET(zones->manager));
+
+       for (i = 0; i < zones->zone_count; i++) {
+               struct hash_zone *zone = &zones->zones[i];
+
+               uds_free_funnel_queue(UDS_FORGET(zone->timed_out_complete));
+               vdo_free_pointer_map(UDS_FORGET(zone->hash_lock_map));
+               UDS_FREE(UDS_FORGET(zone->lock_array));
+       }
+
+       if (zones->index_session != NULL)
+               vdo_finish_dedupe_index(zones);
+
+       ratelimit_state_exit(&zones->ratelimiter);
+       if (vdo_get_admin_state_code(&zones->state) == VDO_ADMIN_STATE_NEW)
+               UDS_FREE(zones);
+       else
+               kobject_put(&zones->dedupe_directory);
+}
+
+static void initiate_suspend_index(struct admin_state *state)
+{
+       struct hash_zones *zones = container_of(state, struct hash_zones, 
state);
+       enum index_state index_state;
+
+       spin_lock(&zones->lock);
+       index_state = zones->index_state;
+       spin_unlock(&zones->lock);
+
+       if (index_state != IS_CLOSED) {
+               bool save = vdo_is_state_saving(&zones->state);
+               int result;
+
+               result = uds_suspend_index_session(zones->index_session, save);
+               if (result != UDS_SUCCESS)
+                       uds_log_error_strerror(result, "Error suspending dedupe 
index");
+       }
+
+       vdo_finish_draining(state);
+}
+
+/**
+ * suspend_index() - Suspend the UDS index prior to draining hash zones.
+ *
+ * Implements vdo_action_preamble
+ */
+static void suspend_index(void *context, struct vdo_completion *completion)
+{
+       struct hash_zones *zones = context;
+
+       vdo_start_draining(&zones->state,
+                          vdo_get_current_manager_operation(zones->manager),
+                          completion,
+                          initiate_suspend_index);
+}
+
+/**
+ * initiate_drain() - Initiate a drain.
+ *
+ * Implements vdo_admin_initiator.
+ */
+static void initiate_drain(struct admin_state *state)
+{
+       check_for_drain_complete(container_of(state, struct hash_zone, state));
+}
+
+/**
+ * drain_hash_zone() - Drain a hash zone.
+ *
+ * Implements vdo_zone_action.
+ */
+static void drain_hash_zone(void *context, zone_count_t zone_number, struct 
vdo_completion *parent)
+{
+       struct hash_zones *zones = context;
+
+       vdo_start_draining(&zones->zones[zone_number].state,
+                          vdo_get_current_manager_operation(zones->manager),
+                          parent,
+                          initiate_drain);
+}
+
+/** vdo_drain_hash_zones() - Drain all hash zones. */
+void vdo_drain_hash_zones(struct hash_zones *zones, struct vdo_completion 
*parent)
+{
+       vdo_schedule_operation(zones->manager,
+                              parent->vdo->suspend_type,
+                              suspend_index,
+                              drain_hash_zone,
+                              NULL,
+                              parent);
+}
+
+static void launch_dedupe_state_change(struct hash_zones *zones)
+{
+       /* ASSERTION: We enter with the lock held. */
+       if (zones->changing || !vdo_is_state_normal(&zones->state))
+               /* Either a change is already in progress, or changes are not 
allowed. */
+               return;
+
+       if (zones->create_flag || (zones->index_state != zones->index_target)) {
+               zones->changing = true;
+               vdo_launch_completion(&zones->completion);
+               return;
+       }
+
+       /* ASSERTION: We exit with the lock held. */
+}
+
+/**
+ * resume_index() - Resume the UDS index prior to resuming hash zones.
+ *
+ * Implements vdo_action_preamble
+ */
+static void resume_index(void *context, struct vdo_completion *parent)
+{
+       struct hash_zones *zones = context;
+       struct device_config *config = parent->vdo->device_config;
+       int result;
+
+       zones->parameters.name = config->parent_device_name;
+       result = uds_resume_index_session(zones->index_session, 
zones->parameters.name);
+       if (result != UDS_SUCCESS)
+               uds_log_error_strerror(result, "Error resuming dedupe index");
+
+       spin_lock(&zones->lock);
+       vdo_resume_if_quiescent(&zones->state);
+
+       if (config->deduplication) {
+               zones->index_target = IS_OPENED;
+               WRITE_ONCE(zones->dedupe_flag, true);
+       } else {
+               zones->index_target = IS_CLOSED;
+       }
+
+       launch_dedupe_state_change(zones);
+       spin_unlock(&zones->lock);
+
+       vdo_finish_completion(parent);
+}
+
+/**
+ * resume_hash_zone() - Resume a hash zone.
+ *
+ * Implements vdo_zone_action.
+ */
+static void
+resume_hash_zone(void *context, zone_count_t zone_number, struct 
vdo_completion *parent)
+{
+       struct hash_zone *zone = &(((struct hash_zones *) 
context)->zones[zone_number]);
+
+       vdo_fail_completion(parent, vdo_resume_if_quiescent(&zone->state));
+}
+
+/**
+ * vdo_resume_hash_zones() - Resume a set of hash zones.
+ * @zones: The hash zones to resume.
+ * @parent: The object to notify when the zones have resumed.
+ */
+void vdo_resume_hash_zones(struct hash_zones *zones, struct vdo_completion 
*parent)
+{
+       if (vdo_is_read_only(parent->vdo)) {
+               vdo_launch_completion(parent);
+               return;
+       }
+
+       vdo_schedule_operation(zones->manager,
+                              VDO_ADMIN_STATE_RESUMING,
+                              resume_index,
+                              resume_hash_zone,
+                              NULL,
+                              parent);
+}
+
+/**
+ * get_hash_zone_statistics() - Add the statistics for this hash zone to the 
tally for all zones.
+ * @zone: The hash zone to query.
+ * @tally: The tally
+ */
+static void
+get_hash_zone_statistics(const struct hash_zone *zone, struct 
hash_lock_statistics *tally)
+{
+       const struct hash_lock_statistics *stats = &zone->statistics;
+
+       tally->dedupe_advice_valid += READ_ONCE(stats->dedupe_advice_valid);
+       tally->dedupe_advice_stale += READ_ONCE(stats->dedupe_advice_stale);
+       tally->concurrent_data_matches += 
READ_ONCE(stats->concurrent_data_matches);
+       tally->concurrent_hash_collisions += 
READ_ONCE(stats->concurrent_hash_collisions);
+       tally->curr_dedupe_queries += READ_ONCE(zone->active);
+}
+
+static void get_index_statistics(struct hash_zones *zones, struct 
index_statistics *stats)
+{
+       enum index_state state;
+       struct uds_index_stats index_stats;
+       int result;
+
+       spin_lock(&zones->lock);
+       state = zones->index_state;
+       spin_unlock(&zones->lock);
+
+       if (state != IS_OPENED)
+               return;
+
+       result = uds_get_index_session_stats(zones->index_session, 
&index_stats);
+       if (result != UDS_SUCCESS) {
+               uds_log_error_strerror(result, "Error reading index stats");
+               return;
+       }
+
+       stats->entries_indexed = index_stats.entries_indexed;
+       stats->posts_found = index_stats.posts_found;
+       stats->posts_not_found = index_stats.posts_not_found;
+       stats->queries_found = index_stats.queries_found;
+       stats->queries_not_found = index_stats.queries_not_found;
+       stats->updates_found = index_stats.updates_found;
+       stats->updates_not_found = index_stats.updates_not_found;
+       stats->entries_discarded = index_stats.entries_discarded;
+}
+
+/**
+ * vdo_get_dedupe_statistics() - Tally the statistics from all the hash zones 
and the UDS index.
+ * @hash_zones: The hash zones to query
+ *
+ * Return: The sum of the hash lock statistics from all hash zones plus the 
statistics from the UDS
+ *         index
+ */
+void vdo_get_dedupe_statistics(struct hash_zones *zones, struct vdo_statistics 
*stats)
+
+{
+       zone_count_t zone;
+
+       for (zone = 0; zone < zones->zone_count; zone++)
+               get_hash_zone_statistics(&zones->zones[zone], 
&stats->hash_lock);
+
+       get_index_statistics(zones, &stats->index);
+
+       /*
+        * zones->timeouts gives the number of timeouts, and 
dedupe_context_busy gives the number
+        * of queries not made because of earlier timeouts.
+        */
+       stats->dedupe_advice_timeouts =
+               (atomic64_read(&zones->timeouts) + 
atomic64_read(&zones->dedupe_context_busy));
+}
+
+/**
+ * vdo_select_hash_zone() - Select the hash zone responsible for locking a 
given record name.
+ * @zones: The hash_zones from which to select.
+ * @name: The record name.
+ *
+ * Return: The hash zone responsible for the record name.
+ */
+struct hash_zone *
+vdo_select_hash_zone(struct hash_zones *zones, const struct uds_record_name 
*name)
+{
+       /*
+        * Use a fragment of the record name as a hash code. Eight bits of hash 
should suffice
+        * since the number of hash zones is small.
+        * TODO: Verify that the first byte is independent enough.
+        */
+       u32 hash = name->name[0];
+
+       /*
+        * Scale the 8-bit hash fragment to a zone index by treating it as a 
binary fraction and
+        * multiplying that by the zone count. If the hash is uniformly 
distributed over [0 ..
+        * 2^8-1], then (hash * count / 2^8) should be uniformly distributed 
over [0 .. count-1].
+        * The multiply and shift is much faster than a divide (modulus) on X86 
CPUs.
+        */
+       hash = (hash * zones->zone_count) >> 8;
+       return &zones->zones[hash];
+}
+
+/**
+ * dump_hash_lock() - Dump a compact description of hash_lock to the log if 
the lock is not on the
+ *                    free list.
+ * @lock: The hash lock to dump.
+ */
+static void dump_hash_lock(const struct hash_lock *lock)
+{
+       const char *state;
+
+       if (!list_empty(&lock->pool_node))
+               /* This lock is on the free list. */
+               return;
+
+       /*
+        * Necessarily cryptic since we can log a lot of these. First three 
chars of state is
+        * unambiguous. 'U' indicates a lock not registered in the map.
+        */
+       state = get_hash_lock_state_name(lock->state);
+       uds_log_info("  hl %px: %3.3s %c%llu/%u rc=%u wc=%zu agt=%px",
+                    (const void *) lock, state, (lock->registered ? 'D' : 'U'),
+                    (unsigned long long) lock->duplicate.pbn,
+                    lock->duplicate.state, lock->reference_count,
+                    vdo_count_waiters(&lock->waiters), (void *) lock->agent);
+}
+
+static const char *index_state_to_string(struct hash_zones *zones, enum 
index_state state)
+{
+       if (!vdo_is_state_normal(&zones->state))
+               return SUSPENDED;
+
+       switch (state) {
+       case IS_CLOSED:
+               return zones->error_flag ? ERROR : CLOSED;
+       case IS_CHANGING:
+               return zones->index_target == IS_OPENED ? OPENING : CLOSING;
+       case IS_OPENED:
+               return READ_ONCE(zones->dedupe_flag) ? ONLINE : OFFLINE;
+       default:
+               return UNKNOWN;
+       }
+}
+
+/**
+ * vdo_dump_hash_zone() - Dump information about a hash zone to the log for 
debugging.
+ * @zone: The zone to dump.
+ */
+static void dump_hash_zone(const struct hash_zone *zone)
+{
+       data_vio_count_t i;
+
+       if (zone->hash_lock_map == NULL) {
+               uds_log_info("struct hash_zone %u: NULL map", 
zone->zone_number);
+               return;
+       }
+
+       uds_log_info("struct hash_zone %u: mapSize=%zu",
+                    zone->zone_number,
+                    vdo_pointer_map_size(zone->hash_lock_map));
+       for (i = 0; i < LOCK_POOL_CAPACITY; i++)
+               dump_hash_lock(&zone->lock_array[i]);
+}
+
+/**
+ * vdo_dump_hash_zones() - Dump information about the hash zones to the log 
for debugging.
+ * @zones: The zones to dump.
+ */
+void vdo_dump_hash_zones(struct hash_zones *zones)
+{
+       const char *state, *target;
+       zone_count_t zone;
+
+       spin_lock(&zones->lock);
+       state = index_state_to_string(zones, zones->index_state);
+       target = (zones->changing ? index_state_to_string(zones, 
zones->index_target) : NULL);
+       spin_unlock(&zones->lock);
+
+       uds_log_info("UDS index: state: %s", state);
+       if (target != NULL)
+               uds_log_info("UDS index: changing to state: %s", target);
+
+       for (zone = 0; zone < zones->zone_count; zone++)
+               dump_hash_zone(&zones->zones[zone]);
+}
diff --git a/drivers/md/dm-vdo/dedupe.h b/drivers/md/dm-vdo/dedupe.h
new file mode 100644
index 000000000000..af329fb0fa68
--- /dev/null
+++ b/drivers/md/dm-vdo/dedupe.h
@@ -0,0 +1,93 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright 2023 Red Hat
+ */
+
+#ifndef VDO_DEDUPE_H
+#define VDO_DEDUPE_H
+
+#include <linux/list.h>
+#include <linux/timer.h>
+
+#include "uds.h"
+
+#include "admin-state.h"
+#include "constants.h"
+#include "statistics.h"
+#include "types.h"
+#include "wait-queue.h"
+
+struct dedupe_context {
+       struct hash_zone *zone;
+       struct uds_request request;
+       struct list_head list_entry;
+       struct funnel_queue_entry queue_entry;
+       u64 submission_jiffies;
+       struct data_vio *requestor;
+       atomic_t state;
+};
+
+struct hash_lock;
+
+struct hash_zone {
+       /* Which hash zone this is */
+       zone_count_t zone_number;
+
+       /* The administrative state of the zone */
+       struct admin_state state;
+
+       /* The thread ID for this zone */
+       thread_id_t thread_id;
+
+       /* Mapping from record name fields to hash_locks */
+       struct pointer_map *hash_lock_map;
+
+       /* List containing all unused hash_locks */
+       struct list_head lock_pool;
+
+       /*
+        * Statistics shared by all hash locks in this zone. Only modified on 
the hash zone thread,
+        * but queried by other threads.
+        */
+       struct hash_lock_statistics statistics;
+
+       /* Array of all hash_locks */
+       struct hash_lock *lock_array;
+
+       /* These fields are used to manage the dedupe contexts */
+       struct list_head available;
+       struct list_head pending;
+       struct funnel_queue *timed_out_complete;
+       struct timer_list timer;
+       struct vdo_completion completion;
+       unsigned int active;
+       atomic_t timer_state;
+
+       /* The dedupe contexts for querying the index from this zone */
+       struct dedupe_context contexts[MAXIMUM_VDO_USER_VIOS];
+};
+
+struct hash_zones;
+
+struct pbn_lock * __must_check vdo_get_duplicate_lock(struct data_vio 
*data_vio);
+
+void vdo_acquire_hash_lock(struct vdo_completion *completion);
+void vdo_continue_hash_lock(struct vdo_completion *completion);
+void vdo_release_hash_lock(struct data_vio *data_vio);
+void vdo_clean_failed_hash_lock(struct data_vio *data_vio);
+void vdo_share_compressed_write_lock(struct data_vio *data_vio, struct 
pbn_lock *pbn_lock);
+
+int __must_check vdo_make_hash_zones(struct vdo *vdo, struct hash_zones 
**zones_ptr);
+
+void vdo_free_hash_zones(struct hash_zones *zones);
+
+void vdo_drain_hash_zones(struct hash_zones *zones, struct vdo_completion 
*parent);
+
+void vdo_get_dedupe_statistics(struct hash_zones *zones, struct vdo_statistics 
*stats);
+
+struct hash_zone * __must_check
+vdo_select_hash_zone(struct hash_zones *zones, const struct uds_record_name 
*name);
+
+void vdo_dump_hash_zones(struct hash_zones *zones);
+
+#endif /* VDO_DEDUPE_H */
-- 
2.40.0

--
dm-devel mailing list
dm-devel@redhat.com
https://listman.redhat.com/mailman/listinfo/dm-devel

[dm-devel] [PATCH v3 22/39] dm vdo: add hash locks and hash zones

Reply via email to