Re: [RFC PATCH 1/1] hpref: Hazard Pointers with Reference Counter
Thanks for your response, Am 9/28/2024 um 1:33 PM schrieb Mathieu Desnoyers: This is a userspace prototype. This will behave similarly to a userspace spinlock in that case, which is not great in terms of CPU usage, but should eventually unblock the waiter, unless it has a RT priority that really prevents any progress from the emergency slot owner. On my TODO list, I have a bullet about integrating with sys_futex to block on wait, wake up on slot release. I would then use the wait/wakeup code based on sys_futex already present in liburcu. Oh, I see. I think if it is for userspace then it really should use wait/wakeup as you said. Best wishes, jonas
Re: [RFC PATCH 1/1] hpref: Hazard Pointers with Reference Counter
On 2024-09-28 13:22, Jonas Oberhauser wrote: Two more questions below: Am 9/21/2024 um 6:42 PM schrieb Mathieu Desnoyers: +#define NR_PERCPU_SLOTS_BITS 3 Have you measured any advantage of this multi-slot version vs a version with just one normal slot and one emergency slot? No, I have not. That being said, I am taking a minimalistic approach that takes things even further in the "simple" direction for what I will send as RFC against the Linux kernel: there is just the one "emergency slot", irqs are disabled around use of the HP slot, and promotion to refcount is done before returning to the caller. With just one normal slot, the normal slot version would always be zero, and there'd be no need to increment etc., which might make the common case (no conflict) faster. The multi-slots allows preemption while holding the slot. It also allows HP slots users to keep it longer without doing to refcount right away. I even have a patch that dynamically adapts the scan depth (increase by reader, decrease by synchronize, with an hysteresis) in my userspace prototype. This splits the number of allocated slots from the scan depth. But I keep that for later and will focus on the simple case first (single HP slot, only used with irqoff). Either way I recommend stress testing with just one normal slot to increase the chance of conflict (and hence triggering corner cases) during stress testing. Good point. +retry: + node = uatomic_load(node_p, CMM_RELAXED); + if (!node) + return false; + /* Use rseq to try setting current slot hp. Store B. */ + if (rseq_load_cbne_store__ptr(RSEQ_MO_RELAXED, RSEQ_PERCPU_CPU_ID, + (intptr_t *) &slot->node, (intptr_t) NULL, + (intptr_t) node, cpu)) { + slot = &cpu_slots->slots[HPREF_EMERGENCY_SLOT]; + use_refcount = true; + /* + * This may busy-wait for another reader using the + * emergency slot to transition to refcount. + */ + caa_cpu_relax(); + goto retry; + } I'm not familiar with Linux' preemption model. Can this deadlock if a low-interrupt-level thread is occupying the EMERGENCY slot and a higher-interrupt-level thread is also trying to take it? This is a userspace prototype. This will behave similarly to a userspace spinlock in that case, which is not great in terms of CPU usage, but should eventually unblock the waiter, unless it has a RT priority that really prevents any progress from the emergency slot owner. On my TODO list, I have a bullet about integrating with sys_futex to block on wait, wake up on slot release. I would then use the wait/wakeup code based on sys_futex already present in liburcu. Thanks! Mathieu Best wishes, jonas -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC PATCH 1/1] hpref: Hazard Pointers with Reference Counter
Two more questions below: Am 9/21/2024 um 6:42 PM schrieb Mathieu Desnoyers: +#define NR_PERCPU_SLOTS_BITS 3 Have you measured any advantage of this multi-slot version vs a version with just one normal slot and one emergency slot? With just one normal slot, the normal slot version would always be zero, and there'd be no need to increment etc., which might make the common case (no conflict) faster. Either way I recommend stress testing with just one normal slot to increase the chance of conflict (and hence triggering corner cases) during stress testing. +retry: + node = uatomic_load(node_p, CMM_RELAXED); + if (!node) + return false; + /* Use rseq to try setting current slot hp. Store B. */ + if (rseq_load_cbne_store__ptr(RSEQ_MO_RELAXED, RSEQ_PERCPU_CPU_ID, + (intptr_t *) &slot->node, (intptr_t) NULL, + (intptr_t) node, cpu)) { + slot = &cpu_slots->slots[HPREF_EMERGENCY_SLOT]; + use_refcount = true; + /* +* This may busy-wait for another reader using the +* emergency slot to transition to refcount. +*/ + caa_cpu_relax(); + goto retry; + } I'm not familiar with Linux' preemption model. Can this deadlock if a low-interrupt-level thread is occupying the EMERGENCY slot and a higher-interrupt-level thread is also trying to take it? Best wishes, jonas
Re: [RFC PATCH 1/1] hpref: Hazard Pointers with Reference Counter
Am 9/25/2024 um 1:36 PM schrieb Mathieu Desnoyers: On 2024-09-25 12:06, Jonas Oberhauser wrote: Am 9/25/2024 um 8:35 AM schrieb Mathieu Desnoyers: On 2024-09-25 07:57, Jonas Oberhauser wrote: Hi Mathieu, I haven't read your code in detail but it seems to me you have an ABA bug: as I explained elsewhere, you could read the same pointer after ABA but you don't synchronize with the newer store that gave you node2, leaving you to speculatively read stale values through *ctx->hp. (I am assuming here that ctx->hp is essentially an out parameter used to let the caller know which node got protected). The following change should fix it: cmm_barrier(); - node2 = uatomic_load(node_p, CMM_RELAXED); /* Load A */ + node2 = rcu_dereference(*node_p); /* Load A */ I don't think this fixes it, because IIRC rcu_dereference relies on the address dependency (which we don't have here) to provide ordering. I would recommend either: - ctx->hp = node; + ctx->hp = node2; which fixes the problem under the perhaps too weak assumption that the compiler doesn't use its knowledge that node==node2 to just undo this fix, or more strictly, As stated in Documentation/RCU/rcu_dereference.rst from the Linux kernel, comparing the result of rcu_dereference against another non-NULL pointer is discouraged, as you rightly point out. + ctx->hp = READ_ONCE(node2); which I believe makes sure that the value of node2 is used. I am not entirely sure this extra READ_ONCE() would be sufficient to prevent the compiler from making assumptions about the content of node2 and thus use the result of the first load (node) instead. It would also not suffice to prevent the CPU from speculatively using the result of the first load to perform dependent loads AFAIU. The reason I think it should be sufficient is that it forces the compiler to assume that since the comparison, the value of node2 might have changed. Therefore, simply using the value of node at that point should be unsound from the compiler's POV. But I'm not a compiler expert... So I definitely support uneasiness about this construct :)) jonas
Re: [RFC PATCH 1/1] hpref: Hazard Pointers with Reference Counter
On 2024-09-25 12:06, Jonas Oberhauser wrote: Am 9/25/2024 um 8:35 AM schrieb Mathieu Desnoyers: On 2024-09-25 07:57, Jonas Oberhauser wrote: Hi Mathieu, I haven't read your code in detail but it seems to me you have an ABA bug: as I explained elsewhere, you could read the same pointer after ABA but you don't synchronize with the newer store that gave you node2, leaving you to speculatively read stale values through *ctx->hp. (I am assuming here that ctx->hp is essentially an out parameter used to let the caller know which node got protected). The following change should fix it: cmm_barrier(); - node2 = uatomic_load(node_p, CMM_RELAXED); /* Load A */ + node2 = rcu_dereference(*node_p); /* Load A */ I don't think this fixes it, because IIRC rcu_dereference relies on the address dependency (which we don't have here) to provide ordering. I would recommend either: - ctx->hp = node; + ctx->hp = node2; which fixes the problem under the perhaps too weak assumption that the compiler doesn't use its knowledge that node==node2 to just undo this fix, or more strictly, As stated in Documentation/RCU/rcu_dereference.rst from the Linux kernel, comparing the result of rcu_dereference against another non-NULL pointer is discouraged, as you rightly point out. + ctx->hp = READ_ONCE(node2); which I believe makes sure that the value of node2 is used. I am not entirely sure this extra READ_ONCE() would be sufficient to prevent the compiler from making assumptions about the content of node2 and thus use the result of the first load (node) instead. It would also not suffice to prevent the CPU from speculatively using the result of the first load to perform dependent loads AFAIU. Alternatively you could always use an acquire load. Unless someone comes up with a sound alternate approach, I am tempted to go with an acquire load as the second load within hpref_hp_get(). This way, the compiler would not attempt to use the node value from the first load for dependent loads, and the and CPU won't try to speculate dependent loads either. Thanks, Mathieu Best wishes, jonas -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC PATCH 1/1] hpref: Hazard Pointers with Reference Counter
Am 9/25/2024 um 8:35 AM schrieb Mathieu Desnoyers: On 2024-09-25 07:57, Jonas Oberhauser wrote: Hi Mathieu, I haven't read your code in detail but it seems to me you have an ABA bug: as I explained elsewhere, you could read the same pointer after ABA but you don't synchronize with the newer store that gave you node2, leaving you to speculatively read stale values through *ctx->hp. (I am assuming here that ctx->hp is essentially an out parameter used to let the caller know which node got protected). The following change should fix it: cmm_barrier(); - node2 = uatomic_load(node_p, CMM_RELAXED); /* Load A */ + node2 = rcu_dereference(*node_p); /* Load A */ I don't think this fixes it, because IIRC rcu_dereference relies on the address dependency (which we don't have here) to provide ordering. I would recommend either: -ctx->hp = node; +ctx->hp = node2; which fixes the problem under the perhaps too weak assumption that the compiler doesn't use its knowledge that node==node2 to just undo this fix, or more strictly, +ctx->hp = READ_ONCE(node2); which I believe makes sure that the value of node2 is used. Alternatively you could always use an acquire load. Best wishes, jonas
Re: [RFC PATCH 1/1] hpref: Hazard Pointers with Reference Counter
On 2024-09-25 07:57, Jonas Oberhauser wrote: Hi Mathieu, interesting idea. Conceptually it looks good. There's another approach of using hazard pointer to optimize shared reference counting (to make it lock-free in common cases). https://github.com/cmuparlay/concurrent_deferred_rc It doesn't go as far as what you're doing, but they also use the hazard pointer to protect the refcount (like you do in the promote function). Thanks for the reference. Indeed, one use-case of this HP+refcount scheme would be to implement something that resembles C++ shared pointers. I'm glad to see I'm not alone in seeing potential in this approach. I haven't read your code in detail but it seems to me you have an ABA bug: as I explained elsewhere, you could read the same pointer after ABA but you don't synchronize with the newer store that gave you node2, leaving you to speculatively read stale values through *ctx->hp. (I am assuming here that ctx->hp is essentially an out parameter used to let the caller know which node got protected). You are of course absolutely right. The following change should fix it: cmm_barrier(); - node2 = uatomic_load(node_p, CMM_RELAXED); /* Load A */ + node2 = rcu_dereference(*node_p); /* Load A */ Please let me know if I'm missing anything. Thanks, Mathieu Have fun, jonas +/* + * hpref_hp_get: Obtain a reference to a stable object, protected either + * by hazard pointer (fast-path) or using reference + * counter as fall-back. + */ +static inline +bool hpref_hp_get(struct hpref_node **node_p, struct hpref_ctx *ctx) +{ + int cpu = rseq_current_cpu_raw(); + struct hpref_percpu_slots *cpu_slots = rseq_percpu_ptr(hpref_percpu_slots, cpu); + struct hpref_slot *slot = &cpu_slots->slots[cpu_slots->current_slot]; + bool use_refcount = false; + struct hpref_node *node, *node2; + unsigned int next_slot; + +retry: + node = uatomic_load(node_p, CMM_RELAXED); + if (!node) + return false; + /* Use rseq to try setting current slot hp. Store B. */ + if (rseq_load_cbne_store__ptr(RSEQ_MO_RELAXED, RSEQ_PERCPU_CPU_ID, + (intptr_t *) &slot->node, (intptr_t) NULL, + (intptr_t) node, cpu)) { + slot = &cpu_slots->slots[HPREF_EMERGENCY_SLOT]; + use_refcount = true; + /* + * This may busy-wait for another reader using the + * emergency slot to transition to refcount. + */ + caa_cpu_relax(); + goto retry; + } + /* Memory ordering: Store B before Load A. */ + cmm_smp_mb(); + node2 = uatomic_load(node_p, CMM_RELAXED); /* Load A */ + if (node != node2) { + uatomic_store(&slot->node, NULL, CMM_RELAXED); + if (!node2) + return false; + goto retry; + } + ctx->type = HPREF_TYPE_HP; + ctx->hp = node; ^ here + ctx->slot = slot; + if (use_refcount) { + hpref_promote_hp_to_ref(ctx); + return true; + } + /* + * Increment current slot (racy increment is OK because it is + * just a position hint). Skip the emergency slot. + */ + next_slot = uatomic_load(&cpu_slots->current_slot, CMM_RELAXED) + 1; + if (next_slot >= HPREF_EMERGENCY_SLOT) + next_slot = 0; + uatomic_store(&cpu_slots->current_slot, next_slot, CMM_RELAXED); + return true; +} + +static inline +void hpref_put(struct hpref_ctx *ctx) +{ + if (ctx->type == HPREF_TYPE_REF) { + urcu_ref_put(&ctx->hp->refcount, hpref_release); + } else { + /* Release HP. */ + uatomic_store(&ctx->slot->node, NULL, CMM_RELEASE); + } + ctx->hp = NULL; +} + +/* + * hpref_set_pointer: Store pointer @node to @ptr, with RCU publication + * guarantees. + */ +static inline +void hpref_set_pointer(struct hpref_node **ptr, struct hpref_node *node) +{ + if (__builtin_constant_p(node) && node == NULL) + uatomic_store(ptr, NULL, CMM_RELAXED); + else + uatomic_store(ptr, node, CMM_RELEASE); +} + +/* + * Return the content of the hpref context hazard pointer field. + */ +static inline +struct hpref_node *hpref_ctx_pointer(struct hpref_ctx *ctx) +{ + return ctx->hp; +} + +#ifdef __cplusplus +} +#endif + +#endif /* _URCU_HPREF_H */ diff --git a/src/Makefile.am b/src/Makefile.am index b555c818..7312c9f7 100644 --- a/src/Makefile.am +++ b/src/Makefile.am @@ -19,7 +19,8 @@ RCULFHASH = rculfhash.c rculfhash-mm-order.c rculfhash-mm-chunk.c \ lib_LTLIBRARIES = liburcu-common.la \ liburcu.la liburcu-qsbr.la \ liburcu-mb.la liburcu-bp.la \ - liburcu-memb.la liburcu-cds.la + liburcu-memb.la liburcu-cds.la \ + liburcu-hpref.la # # liburcu-common contains wait-free queues (needed by call_rcu) as well @@ -50,6 +51,9 @@ liburcu_cds_la_SOURCES = rculfqueue.c rculfstack.c lfstack.c \ workqueue.c workqueue.h $(RCULFHASH) $(COMPAT) liburcu_cds_la_LI
Re: [RFC PATCH 1/1] hpref: Hazard Pointers with Reference Counter
Hi Mathieu, interesting idea. Conceptually it looks good. There's another approach of using hazard pointer to optimize shared reference counting (to make it lock-free in common cases). https://github.com/cmuparlay/concurrent_deferred_rc It doesn't go as far as what you're doing, but they also use the hazard pointer to protect the refcount (like you do in the promote function). I haven't read your code in detail but it seems to me you have an ABA bug: as I explained elsewhere, you could read the same pointer after ABA but you don't synchronize with the newer store that gave you node2, leaving you to speculatively read stale values through *ctx->hp. (I am assuming here that ctx->hp is essentially an out parameter used to let the caller know which node got protected). Have fun, jonas +/* + * hpref_hp_get: Obtain a reference to a stable object, protected either + * by hazard pointer (fast-path) or using reference + * counter as fall-back. + */ +static inline +bool hpref_hp_get(struct hpref_node **node_p, struct hpref_ctx *ctx) +{ + int cpu = rseq_current_cpu_raw(); + struct hpref_percpu_slots *cpu_slots = rseq_percpu_ptr(hpref_percpu_slots, cpu); + struct hpref_slot *slot = &cpu_slots->slots[cpu_slots->current_slot]; + bool use_refcount = false; + struct hpref_node *node, *node2; + unsigned int next_slot; + +retry: + node = uatomic_load(node_p, CMM_RELAXED); + if (!node) + return false; + /* Use rseq to try setting current slot hp. Store B. */ + if (rseq_load_cbne_store__ptr(RSEQ_MO_RELAXED, RSEQ_PERCPU_CPU_ID, + (intptr_t *) &slot->node, (intptr_t) NULL, + (intptr_t) node, cpu)) { + slot = &cpu_slots->slots[HPREF_EMERGENCY_SLOT]; + use_refcount = true; + /* +* This may busy-wait for another reader using the +* emergency slot to transition to refcount. +*/ + caa_cpu_relax(); + goto retry; + } + /* Memory ordering: Store B before Load A. */ + cmm_smp_mb(); + node2 = uatomic_load(node_p, CMM_RELAXED); /* Load A */ + if (node != node2) { + uatomic_store(&slot->node, NULL, CMM_RELAXED); + if (!node2) + return false; + goto retry; + } + ctx->type = HPREF_TYPE_HP; + ctx->hp = node; ^ here + ctx->slot = slot; + if (use_refcount) { + hpref_promote_hp_to_ref(ctx); + return true; + } + /* +* Increment current slot (racy increment is OK because it is +* just a position hint). Skip the emergency slot. +*/ + next_slot = uatomic_load(&cpu_slots->current_slot, CMM_RELAXED) + 1; + if (next_slot >= HPREF_EMERGENCY_SLOT) + next_slot = 0; + uatomic_store(&cpu_slots->current_slot, next_slot, CMM_RELAXED); + return true; +} + +static inline +void hpref_put(struct hpref_ctx *ctx) +{ + if (ctx->type == HPREF_TYPE_REF) { + urcu_ref_put(&ctx->hp->refcount, hpref_release); + } else { + /* Release HP. */ + uatomic_store(&ctx->slot->node, NULL, CMM_RELEASE); + } + ctx->hp = NULL; +} + +/* + * hpref_set_pointer: Store pointer @node to @ptr, with RCU publication + *guarantees. + */ +static inline +void hpref_set_pointer(struct hpref_node **ptr, struct hpref_node *node) +{ + if (__builtin_constant_p(node) && node == NULL) + uatomic_store(ptr, NULL, CMM_RELAXED); + else + uatomic_store(ptr, node, CMM_RELEASE); +} + +/* + * Return the content of the hpref context hazard pointer field. + */ +static inline +struct hpref_node *hpref_ctx_pointer(struct hpref_ctx *ctx) +{ + return ctx->hp; +} + +#ifdef __cplusplus +} +#endif + +#endif /* _URCU_HPREF_H */ diff --git a/src/Makefile.am b/src/Makefile.am index b555c818..7312c9f7 100644 --- a/src/Makefile.am +++ b/src/Makefile.am @@ -19,7 +19,8 @@ RCULFHASH = rculfhash.c rculfhash-mm-order.c rculfhash-mm-chunk.c \ lib_LTLIBRARIES = liburcu-common.la \ liburcu.la liburcu-qsbr.la \ liburcu-mb.la liburcu-bp.la \ - liburcu-memb.la liburcu-cds.la + liburcu-memb.la liburcu-cds.la \ + liburcu-hpref.la # # liburcu-common contains wait-free queues (needed by call_rcu) as well @@ -50,6 +51,9 @@ liburcu_cds_la_SOURCES = rculfqueue.c rculfstack.c lfstack.c \ workqueue.c workqueue.h $(RCULFHASH) $(COMPAT) liburcu_cds_la_LIBADD = liburcu-common.la +liburcu_hpref_la_SOURCES = hpref.c +liburcu_hpref_la_LIBADD = -lrseq + pkgconfigdir = $(libdir)/pkgconfig pkgconfig_DATA = liburcu-cds.pc liburcu.pc liburcu-bp.pc liburcu-qsbr.pc \ libu
Re: [RFC PATCH 1/1] hpref: Hazard Pointers with Reference Counter
On 2024-09-21 23:07, Lai Jiangshan wrote: +/* + * hpref_hp_get: Obtain a reference to a stable object, protected either + * by hazard pointer (fast-path) or using reference + * counter as fall-back. + */ +static inline +bool hpref_hp_get(struct hpref_node **node_p, struct hpref_ctx *ctx) +{ + int cpu = rseq_current_cpu_raw(); + struct hpref_percpu_slots *cpu_slots = rseq_percpu_ptr(hpref_percpu_slots, cpu); + struct hpref_slot *slot = &cpu_slots->slots[cpu_slots->current_slot]; + bool use_refcount = false; + struct hpref_node *node, *node2; + unsigned int next_slot; + +retry: + node = uatomic_load(node_p, CMM_RELAXED); + if (!node) + return false; + /* Use rseq to try setting current slot hp. Store B. */ + if (rseq_load_cbne_store__ptr(RSEQ_MO_RELAXED, RSEQ_PERCPU_CPU_ID, + (intptr_t *) &slot->node, (intptr_t) NULL, + (intptr_t) node, cpu)) { + slot = &cpu_slots->slots[HPREF_EMERGENCY_SLOT]; Can @cpu be possibly changed? if it can, it seems @cpu and @cpu_slots should be updated first. Indeed, if migration happens between rseq_current_cpu_raw() and execution of rseq_load_cbne_store__ptr(), it will cause this second operation to fail. This in turn could cause the reader to retry for a long time (possibly forever) until it finally migrates back to the original CPU. As you suggest, we should re-load cpu and cpu_slots after failure here. More specifically, we should re-load those when the rseq c.s. fails with -1, which means it was abort or there was a cpu number mismatch. If the rseq c.s. returns 1, this means the slot did not contain NULL, so all we need to do is move over to the next slot. While applying your suggested change, I noticed that I can improve the fast-path by removing the notion of "current_slot" number, and thus remove increment of this hint from the fast path. The fast path will instead just scan all 8 slots trying to find a NULL one. This also lessens the odds that the fast-path must fallback to refcount in case of concurrent use. It provides a 50% performance improvement for the fast path with barrier/membarrier pairing. I've updated the https://github.com/compudj/userspace-rcu-dev/tree/hpref volatile feature branch with these changes. I'll give others some time to provide feedback on the overall idea before sending out an updated patch. Thanks for your feedback! Mathieu + use_refcount = true; + /* +* This may busy-wait for another reader using the +* emergency slot to transition to refcount. +*/ + caa_cpu_relax(); + goto retry; + } -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC PATCH 1/1] hpref: Hazard Pointers with Reference Counter
> +/* > + * hpref_hp_get: Obtain a reference to a stable object, protected either > + * by hazard pointer (fast-path) or using reference > + * counter as fall-back. > + */ > +static inline > +bool hpref_hp_get(struct hpref_node **node_p, struct hpref_ctx *ctx) > +{ > + int cpu = rseq_current_cpu_raw(); > + struct hpref_percpu_slots *cpu_slots = > rseq_percpu_ptr(hpref_percpu_slots, cpu); > + struct hpref_slot *slot = &cpu_slots->slots[cpu_slots->current_slot]; > + bool use_refcount = false; > + struct hpref_node *node, *node2; > + unsigned int next_slot; > + > +retry: > + node = uatomic_load(node_p, CMM_RELAXED); > + if (!node) > + return false; > + /* Use rseq to try setting current slot hp. Store B. */ > + if (rseq_load_cbne_store__ptr(RSEQ_MO_RELAXED, RSEQ_PERCPU_CPU_ID, > + (intptr_t *) &slot->node, (intptr_t) NULL, > + (intptr_t) node, cpu)) { > + slot = &cpu_slots->slots[HPREF_EMERGENCY_SLOT]; Can @cpu be possibly changed? if it can, it seems @cpu and @cpu_slots should be updated first. > + use_refcount = true; > + /* > +* This may busy-wait for another reader using the > +* emergency slot to transition to refcount. > +*/ > + caa_cpu_relax(); > + goto retry; > + }