Hi! I've implemented a new RCU Trie data structure, and would welcome your feedback while it's in development stage.
It's currently sitting in a development branch of my private userspace-rcu repository. It's currently targeting userspace applications, but could be ported to the Linux kernel if there is a use-case for it. https://github.com/compudj/userspace-rcu-dev/tree/fractal-trie-dev I've started working on this project around 2012, and I've recently been able to find time to pursue this work. (Many thanks to ISC.org for funding this!). Here are a few major use-cases I see are a good fit: - DNS resolver, - IP routing tables. - Real-time application data structure lookups with strict real-time bound limits (wait-free, guaranteed bounded number of cache line loads per key byte). I'm pasting the API header top comment as summary of the feature set here: /* * urcu/fractal-trie.h * * Userspace RCU library - Fractal Trie * * A concurrent, RCU-protected ordered trie mapping variable or * fixed-length byte keys to user-defined nodes. Keys are opaque * byte sequences with no reserved or sentinel values; unlike * tries that rely on NUL or other terminal characters, any byte * value may appear at any position in a key. Lookups and * traversals are wait-free under the RCU read-side lock and can * proceed concurrently with mutations. The internal structure is * acyclic by construction: no sequence of concurrent updates can * introduce a cycle, ensuring that all lookups and traversals * complete in bounded time. Supports exact lookup, partial * (prefix) match, ordered iteration, duplicate key chains, and * range queries (<=, >=, <, >). A longest-match lookup reports * the deepest trie position matching a prefix of the input key, * even if that position has no external node. * * Performance characteristics: * * - Lookup: At most 2 cache-line accesses per key byte. * - Ordered traversal: At most 3 cache-line accesses per node. * * Internal node configurations: * * Internal nodes self-adapt to the key population using several * configurations (linear, 1D pool, 2D pool, pigeon), each with * a different indexing strategy suited to its child density. * The appropriate configuration is chosen automatically based * on the number of children. Node sizes are powers of 2 between * 16 bytes and 2048 bytes on 64-bit architectures (1024 bytes * on 32-bit). Mutations use in-place updates when possible to * minimize node recompaction, and hysteresis at size thresholds * prevents repeated recompaction when the child count oscillates * near a boundary. * * The 1D and 2D pool configurations partition the 8-bit key * byte space using one or two bit positions respectively, * distributing children across sub-nodes. This provides a * range of intermediate node sizes between the compact linear * configuration and the full 256-entry pigeon configuration, * allowing memory-efficient representation of medium-density * populations without requiring the full pigeon footprint. * The bit positions are selected to minimize the maximum * sub-node population, using a minimax criterion. The * transition thresholds and worst-case sub-node sizes were * empirically validated by brute-force enumeration of optimal * bit selections across millions of random populations. * Alternative approaches such as Judy use a population bitmap * with a dense child array, which requires recompacting the * array on every insertion or removal. This is incompatible * with wait-free RCU lookups, since a reader could observe a * partially recompacted array. The pool approach avoids this * by using fixed index positions derived from key bits, * allowing children to be added or removed with single-pointer * updates visible atomically to concurrent readers. * * Node type and configuration are encoded in the low bits of * child pointers (tagged pointers), so determining a node's * layout during lookup requires no extra memory access. * * The 2D pool and pigeon configurations use a 32-byte bitmap * to locate populated children via bit scanning, reducing the * number of cache-line accesses needed for ordered traversal. * * Memory layout: * * Per-node metadata is stored in a separate page via a strided * allocator and reached by pointer offset from the node address, * keeping metadata out of the node's cache lines. This avoids * padding overhead within nodes and ensures that lookups and * traversals only touch the data they need. Internal nodes are * reclaimed via RCU (call_rcu), so readers never access freed * memory even for the trie's own internal structure. * * This provides memory efficiency comparable to adaptive radix * tree schemes without requiring user tuning or configuration. * * Graft, graft-swap, and detach: * * The graft, graft-swap, and detach operations move entire * sub-tries between trie instances within the same group. Each * operation is a single pointer store visible atomically to * concurrent RCU readers: a reader sees either the complete * sub-trie or nothing, never a partial state. * * - Graft (cds_ft_graft) attaches the content of a source trie * at a key position in a destination trie. The destination * must have no content at or below the graft point. On * success the source trie becomes empty and can be reused or * destroyed. * * - Graft-swap (cds_ft_graft_swap) exchanges the content at a * key position in a destination trie with the content of * another trie. Unlike plain graft, the destination does not * need to be empty at the graft point: any pre-existing * content is moved into the swap trie. Concurrent readers * see either the old content or the new content, never an * empty intermediate state. * * - Detach (cds_ft_detach) removes the sub-structure rooted at * a key position and returns it as a new, independent trie * instance whose key lengths are relative to the detach * point. * * Root-level operations (key_len 0) work with both fixed-length * and variable-length key groups. Non-root operations (key_len * > 0) require a variable-length key group * (CDS_FT_LEN_VARIABLE): in a fixed-length group every key * must equal the group's fixed length, so a shorter prefix * needed to address an interior graft or detach point cannot be * expressed, and the call returns * CDS_FT_STATUS_INVALID_ARGUMENT_ERROR. * * Together, these operations serve as the transplant, exchange, * and split primitives for bulk operations. A typical bulk-load * pattern populates a staging trie offline — with no RCU or * locking overhead — and then grafts it into the live trie in * a single O(1) step, keeping the writer critical section * minimal regardless of the number of nodes being moved. A * complementary bulk-removal pattern detaches (or graft-swaps) * a sub-trie in O(1), waits for a single grace period, and * then drains the detached trie locally, reducing the cost * from one grace period per node to one grace period total. * * Iterator Lifecycle and RCU Locking: * * The `cds_ft_iter` object can operate in two path modes, selected via * cds_ft_iter_set_path_mode(): * * CDS_FT_ITER_PATH_CACHED (default): * * The iterator caches the traversal path (RCU protected pointers) to * accelerate subsequent sequential operations like cds_ft_next(), * cds_ft_prev(), or cds_ft_remove(). * * Because of this caching, the RCU read-side lock must be held * CONTINUOUSLY between the operation that populates the iterator * (e.g., cds_ft_lookup) and the operation that consumes it. If the * RCU read-side lock is dropped, the cached path may point to freed * memory. * * If you need to drop the RCU read-side lock between operations, you * MUST invalidate the iterator's cached path before reusing it. This * is done by calling: * - cds_ft_iter_invalidate_path() * * Calling this function clears the internal path state while keeping * your key intact. The next operation (e.g., cds_ft_remove) will * automatically fall back to a safe, fresh top-down traversal. * * Example A (Continuous Lock - Fast): * rcu_read_lock(); * cds_ft_lookup(ft, iter); * cds_ft_remove(ft, iter, node); // Uses cached path O(1) * rcu_read_unlock(); * * Example B (Dropped Lock - CDS_FT_ITER_PATH_CACHED only): * rcu_read_lock(); * cds_ft_lookup(ft, iter); * rcu_read_unlock(); * // ... time passes, lock is dropped ... * * lock(&writer_mutex); * cds_ft_iter_invalidate_path(iter); // Flush the stale RCU cache * cds_ft_remove(ft, iter, node); // Safe: Fresh traversal protected by writer lock * unlock(&writer_mutex); * * Note: CDS_FT_ITER_PATH_UNCACHED iterators do not require * cds_ft_iter_invalidate_path() — the path is discarded * automatically after each operation. * * Debug validation (URCU_FRACTAL_TRIE_DEBUG_PATH): * * Building with URCU_FRACTAL_TRIE_DEBUG_PATH defined enables run-time * detection of stale cached paths. Each path population records an RCU * grace-period snapshot (via the flavor's * update_start_poll_synchronize_rcu); each path consumption polls it * (via update_poll_state_synchronize_rcu). If a full grace period has * elapsed since the path was populated, the cached pointers may * reference freed memory — the program aborts with a diagnostic. * Since struct cds_ft_iter is opaque, this option does not affect the * application ABI — only the library needs to be rebuilt. * * CDS_FT_ITER_PATH_UNCACHED: * * The iterator automatically discards the traversal path after each * operation returns. Every subsequent operation performs a fresh * top-down traversal from the current key. The RCU read-side lock * only needs to be held during each individual operation and while * accessing the returned node — it may be dropped between operations. * * This mode is suited for iteration patterns where the caller needs * to drop the RCU read-side lock between steps (e.g. to perform * blocking work or acquire other locks). * * Example C (Uncached mode, per-step locking with refcount): * cds_ft_iter_set_path_mode(iter, CDS_FT_ITER_PATH_UNCACHED); * rcu_read_lock(); * cds_ft_for_each_rcu(ft, iter) { * struct my_entry *e = cds_ft_entry( * cds_ft_iter_node(iter), * struct my_entry, ft_node); * refcount_inc(&e->refcount); * rcu_read_unlock(); * * do_blocking_work(e); * refcount_dec(&e->refcount); * * rcu_read_lock(); * } * rcu_read_unlock(); */ Your feedback is welcome! Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com

