[Userspace RCU] Introducing a new data structure: Fractal Trie

Mathieu Desnoyers Wed, 01 Apr 2026 10:46:10 -0700

Hi!

I've implemented a new RCU Trie data structure, and would welcome
your feedback while it's in development stage.


It's currently sitting in a development branch of my private
userspace-rcu repository. It's currently targeting userspace
applications, but could be ported to the Linux kernel if there is a
use-case for it.

https://github.com/compudj/userspace-rcu-dev/tree/fractal-trie-dev

I've started working on this project around 2012, and I've recently
been able to find time to pursue this work. (Many thanks to ISC.org
for funding this!).

Here are a few major use-cases I see are a good fit:

- DNS resolver,
- IP routing tables.
- Real-time application data structure lookups with strict real-time
  bound limits (wait-free, guaranteed bounded number of cache line
  loads per key byte).

I'm pasting the API header top comment as summary of the feature set
here:

/*
 * urcu/fractal-trie.h
 *
 * Userspace RCU library - Fractal Trie
 *
 * A concurrent, RCU-protected ordered trie mapping variable or
 * fixed-length byte keys to user-defined nodes. Keys are opaque
 * byte sequences with no reserved or sentinel values; unlike
 * tries that rely on NUL or other terminal characters, any byte
 * value may appear at any position in a key. Lookups and
 * traversals are wait-free under the RCU read-side lock and can
 * proceed concurrently with mutations. The internal structure is
 * acyclic by construction: no sequence of concurrent updates can
 * introduce a cycle, ensuring that all lookups and traversals
 * complete in bounded time. Supports exact lookup, partial
 * (prefix) match, ordered iteration, duplicate key chains, and
 * range queries (<=, >=, <, >). A longest-match lookup reports
 * the deepest trie position matching a prefix of the input key,
 * even if that position has no external node.
 *
 * Performance characteristics:
 *
 * - Lookup: At most 2 cache-line accesses per key byte.
 * - Ordered traversal: At most 3 cache-line accesses per node.
 *
 * Internal node configurations:
 *
 * Internal nodes self-adapt to the key population using several
 * configurations (linear, 1D pool, 2D pool, pigeon), each with
 * a different indexing strategy suited to its child density.
 * The appropriate configuration is chosen automatically based
 * on the number of children. Node sizes are powers of 2 between
 * 16 bytes and 2048 bytes on 64-bit architectures (1024 bytes
 * on 32-bit). Mutations use in-place updates when possible to
 * minimize node recompaction, and hysteresis at size thresholds
 * prevents repeated recompaction when the child count oscillates
 * near a boundary.
 *
 * The 1D and 2D pool configurations partition the 8-bit key
 * byte space using one or two bit positions respectively,
 * distributing children across sub-nodes. This provides a
 * range of intermediate node sizes between the compact linear
 * configuration and the full 256-entry pigeon configuration,
 * allowing memory-efficient representation of medium-density
 * populations without requiring the full pigeon footprint.
 * The bit positions are selected to minimize the maximum
 * sub-node population, using a minimax criterion. The
 * transition thresholds and worst-case sub-node sizes were
 * empirically validated by brute-force enumeration of optimal
 * bit selections across millions of random populations.
 * Alternative approaches such as Judy use a population bitmap
 * with a dense child array, which requires recompacting the
 * array on every insertion or removal. This is incompatible
 * with wait-free RCU lookups, since a reader could observe a
 * partially recompacted array. The pool approach avoids this
 * by using fixed index positions derived from key bits,
 * allowing children to be added or removed with single-pointer
 * updates visible atomically to concurrent readers.
 *
 * Node type and configuration are encoded in the low bits of
 * child pointers (tagged pointers), so determining a node's
 * layout during lookup requires no extra memory access.
 *
 * The 2D pool and pigeon configurations use a 32-byte bitmap
 * to locate populated children via bit scanning, reducing the
 * number of cache-line accesses needed for ordered traversal.
 *
 * Memory layout:
 *
 * Per-node metadata is stored in a separate page via a strided
 * allocator and reached by pointer offset from the node address,
 * keeping metadata out of the node's cache lines. This avoids
 * padding overhead within nodes and ensures that lookups and
 * traversals only touch the data they need. Internal nodes are
 * reclaimed via RCU (call_rcu), so readers never access freed
 * memory even for the trie's own internal structure.
 *
 * This provides memory efficiency comparable to adaptive radix
 * tree schemes without requiring user tuning or configuration.
 *
 * Graft, graft-swap, and detach:
 *
 * The graft, graft-swap, and detach operations move entire
 * sub-tries between trie instances within the same group. Each
 * operation is a single pointer store visible atomically to
 * concurrent RCU readers: a reader sees either the complete
 * sub-trie or nothing, never a partial state.
 *
 * - Graft (cds_ft_graft) attaches the content of a source trie
 *   at a key position in a destination trie. The destination
 *   must have no content at or below the graft point. On
 *   success the source trie becomes empty and can be reused or
 *   destroyed.
 *
 * - Graft-swap (cds_ft_graft_swap) exchanges the content at a
 *   key position in a destination trie with the content of
 *   another trie. Unlike plain graft, the destination does not
 *   need to be empty at the graft point: any pre-existing
 *   content is moved into the swap trie. Concurrent readers
 *   see either the old content or the new content, never an
 *   empty intermediate state.
 *
 * - Detach (cds_ft_detach) removes the sub-structure rooted at
 *   a key position and returns it as a new, independent trie
 *   instance whose key lengths are relative to the detach
 *   point.
 *
 * Root-level operations (key_len 0) work with both fixed-length
 * and variable-length key groups. Non-root operations (key_len
 * > 0) require a variable-length key group
 * (CDS_FT_LEN_VARIABLE): in a fixed-length group every key
 * must equal the group's fixed length, so a shorter prefix
 * needed to address an interior graft or detach point cannot be
 * expressed, and the call returns
 * CDS_FT_STATUS_INVALID_ARGUMENT_ERROR.
 *
 * Together, these operations serve as the transplant, exchange,
 * and split primitives for bulk operations. A typical bulk-load
 * pattern populates a staging trie offline — with no RCU or
 * locking overhead — and then grafts it into the live trie in
 * a single O(1) step, keeping the writer critical section
 * minimal regardless of the number of nodes being moved. A
 * complementary bulk-removal pattern detaches (or graft-swaps)
 * a sub-trie in O(1), waits for a single grace period, and
 * then drains the detached trie locally, reducing the cost
 * from one grace period per node to one grace period total.
 *
 * Iterator Lifecycle and RCU Locking:
 *
 * The `cds_ft_iter` object can operate in two path modes, selected via
 * cds_ft_iter_set_path_mode():
 *
 * CDS_FT_ITER_PATH_CACHED (default):
 *
 * The iterator caches the traversal path (RCU protected pointers) to
 * accelerate subsequent sequential operations like cds_ft_next(),
 * cds_ft_prev(), or cds_ft_remove().
 *
 * Because of this caching, the RCU read-side lock must be held
 * CONTINUOUSLY between the operation that populates the iterator
 * (e.g., cds_ft_lookup) and the operation that consumes it. If the
 * RCU read-side lock is dropped, the cached path may point to freed
 * memory.
 *
 * If you need to drop the RCU read-side lock between operations, you
 * MUST invalidate the iterator's cached path before reusing it. This
 * is done by calling:
 * - cds_ft_iter_invalidate_path()
 *
 * Calling this function clears the internal path state while keeping
 * your key intact. The next operation (e.g., cds_ft_remove) will
 * automatically fall back to a safe, fresh top-down traversal.
 *
 * Example A (Continuous Lock - Fast):
 * rcu_read_lock();
 * cds_ft_lookup(ft, iter);
 * cds_ft_remove(ft, iter, node); // Uses cached path O(1)
 * rcu_read_unlock();
 *
 * Example B (Dropped Lock - CDS_FT_ITER_PATH_CACHED only):
 * rcu_read_lock();
 * cds_ft_lookup(ft, iter);
 * rcu_read_unlock();
 * // ... time passes, lock is dropped ...
 *
 * lock(&writer_mutex);
 * cds_ft_iter_invalidate_path(iter); // Flush the stale RCU cache
 * cds_ft_remove(ft, iter, node);     // Safe: Fresh traversal protected by 
writer lock
 * unlock(&writer_mutex);
 *
 * Note: CDS_FT_ITER_PATH_UNCACHED iterators do not require
 * cds_ft_iter_invalidate_path() — the path is discarded
 * automatically after each operation.
 *
 * Debug validation (URCU_FRACTAL_TRIE_DEBUG_PATH):
 *
 * Building with URCU_FRACTAL_TRIE_DEBUG_PATH defined enables run-time
 * detection of stale cached paths.  Each path population records an RCU
 * grace-period snapshot (via the flavor's
 * update_start_poll_synchronize_rcu); each path consumption polls it
 * (via update_poll_state_synchronize_rcu).  If a full grace period has
 * elapsed since the path was populated, the cached pointers may
 * reference freed memory — the program aborts with a diagnostic.
 * Since struct cds_ft_iter is opaque, this option does not affect the
 * application ABI — only the library needs to be rebuilt.
 *
 * CDS_FT_ITER_PATH_UNCACHED:
 *
 * The iterator automatically discards the traversal path after each
 * operation returns. Every subsequent operation performs a fresh
 * top-down traversal from the current key. The RCU read-side lock
 * only needs to be held during each individual operation and while
 * accessing the returned node — it may be dropped between operations.
 *
 * This mode is suited for iteration patterns where the caller needs
 * to drop the RCU read-side lock between steps (e.g. to perform
 * blocking work or acquire other locks).
 *
 * Example C (Uncached mode, per-step locking with refcount):
 * cds_ft_iter_set_path_mode(iter, CDS_FT_ITER_PATH_UNCACHED);
 * rcu_read_lock();
 * cds_ft_for_each_rcu(ft, iter) {
 *         struct my_entry *e = cds_ft_entry(
 *                 cds_ft_iter_node(iter),
 *                 struct my_entry, ft_node);
 *         refcount_inc(&e->refcount);
 *         rcu_read_unlock();
 *
 *         do_blocking_work(e);
 *         refcount_dec(&e->refcount);
 *
 *         rcu_read_lock();
 * }
 * rcu_read_unlock();
 */

Your feedback is welcome!

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

[Userspace RCU] Introducing a new data structure: Fractal Trie

Reply via email to