Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
* Eric Wong (normalper...@yhbt.net) wrote: > Mathieu Desnoyers wrote: > > It is provided as a starting point only. Test cases should be ported > > from Userspace RCU to kernel space and thoroughly ran on a wide range of > > architectures before considering this port production-ready. > > Hi Mathieu, any progress on this front? Nope, I've been busy on other things. > I'm not sure how to approach > test cases in the kernel, perhaps taking hints from locking_selftest() > would be a start. I guess creating a wfcq_selftest() derived from userspace RCU test_urcu_wfcq.c might be a good start. > Fwiw, I've been running wfcqueue+epoll for a few weeks now on my > personal machines without problems :) Great! Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
Mathieu Desnoyers wrote: > It is provided as a starting point only. Test cases should be ported > from Userspace RCU to kernel space and thoroughly ran on a wide range of > architectures before considering this port production-ready. Hi Mathieu, any progress on this front? I'm not sure how to approach test cases in the kernel, perhaps taking hints from locking_selftest() would be a start. Fwiw, I've been running wfcqueue+epoll for a few weeks now on my personal machines without problems :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
Mathieu Desnoyers wrote: > Thanks for providing this detailed scenario. I think there is an > important aspect in the use of splice I suggested on which we are not > fully understanding each other. I will annotate your scenario below with > clarifications: Ah yes, I somehow thought splice would overwrite its destination. I've confirmed everything works as expected and have an updated RFC coming. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
* Eric Wong (normalper...@yhbt.net) wrote: > Eric Wong wrote: > > Mathieu Desnoyers wrote: > > > * Eric Wong (normalper...@yhbt.net) wrote: > > > > Mathieu Desnoyers wrote: > > > > > +/* > > > > > + * Load a data from shared memory. > > > > > + */ > > > > > +#define CMM_LOAD_SHARED(p) ACCESS_ONCE(p) > > > > > > > > When iterating through the queue by dequeueing, I needed a way > > > > to get the tail at the start of the iteration and use that as > > > > a sentinel while iterating, so I access the tail like this: > > > > > > > > struct wfcq_node *p = CMM_LOAD_SHARED(ep->rdltail.p); > > > > > > > > I hope this is supported... it seems to work :) > > > > > > Ideally it would be good if users could try using the exposed APIs to do > > > these things, or if it's really needed, maybe it's a sign that we need > > > to extend the API. > > > > Right. If I can use splice, I will not need this. more comments below > > on splice... > > Even with splice, I think I need to see the main tail at the start of > iteration to maintain compatibility (for weird apps that might care). Thanks for providing this detailed scenario. I think there is an important aspect in the use of splice I suggested on which we are not fully understanding each other. I will annotate your scenario below with clarifications: > > Consider this scenario: > > 1) main.queue has 20 events > > 2) epoll_wait(maxevents=16) called by user > > 3) splice all 20 events into unconsumed.queue, main.queue is empty > > 4) put_user + dequeue on 16 events from unconsumed.queue > # unconsumed.queue has 4 left at this point > > 5) main.queue gets several more events enqueued at any point after 3. Let's suppose 11 events are enqueued into main.queue after 3. > > 6) epoll_wait(maxevents=16) called by user again Before 7), here is what should be done: 6.5) splice all new events from main.queue into unconsumed.queue. unconsumed.queue will now contain 4 + 11 = 15 events. Note that splice will preserve the right order of events within unconsumed.queue. > > 7) put_user + dequeue on 4 remaining items in unconsumed.queue > > We can safely return 4 events back to the user at this point. Step 7) will now return 15 events from unconsumed.queue. > > However, this might break compatibility for existing users. I'm > not sure if there's any weird apps which know/expect the event > count they'll get from epoll_wait, but maybe there is one... With the new step 6.5), I don't think the behavior will change compared to what is already there. > > 8) We could perform a splice off main.queue to fill the remaining > slots the user requested, but we do not know if the things we > splice from main.queue at this point were just dequeued in 7. > > If we loaded the main.queue.tail before 7, we could safely splice > into unconsumed.queue and know when to stop when repeating the > put_user + dequeue loop. We can achieve the same thing by doing step 6.5) at the beginning of epoll_wait(). It's important to do it at the beginning of epoll_wait for the reason you discuss in 8) : if you wait until you notice that unconsumed.queue is empty before refilling it from main.queue, you won't be able to know if the events in main.queue were added after the first event was dequeued. Step 6.5) should be performed each time upon entry into epoll_wait(). It does not matter if unconsumed.queue would happen to have enough events to fill in the maxevents request or not (and you don't want to iterate on the unconsumed.queue needlessly to count them): you can just do a O(1) splice from main.queue into unconsumed.queue, and your original semantic should be preserved. Thoughts ? Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
Eric Wong wrote: > Mathieu Desnoyers wrote: > > * Eric Wong (normalper...@yhbt.net) wrote: > > > Mathieu Desnoyers wrote: > > > > +/* > > > > + * Load a data from shared memory. > > > > + */ > > > > +#define CMM_LOAD_SHARED(p) ACCESS_ONCE(p) > > > > > > When iterating through the queue by dequeueing, I needed a way > > > to get the tail at the start of the iteration and use that as > > > a sentinel while iterating, so I access the tail like this: > > > > > > struct wfcq_node *p = CMM_LOAD_SHARED(ep->rdltail.p); > > > > > > I hope this is supported... it seems to work :) > > > > Ideally it would be good if users could try using the exposed APIs to do > > these things, or if it's really needed, maybe it's a sign that we need > > to extend the API. > > Right. If I can use splice, I will not need this. more comments below > on splice... Even with splice, I think I need to see the main tail at the start of iteration to maintain compatibility (for weird apps that might care). Consider this scenario: 1) main.queue has 20 events 2) epoll_wait(maxevents=16) called by user 3) splice all 20 events into unconsumed.queue, main.queue is empty 4) put_user + dequeue on 16 events from unconsumed.queue # unconsumed.queue has 4 left at this point 5) main.queue gets several more events enqueued at any point after 3. 6) epoll_wait(maxevents=16) called by user again 7) put_user + dequeue on 4 remaining items in unconsumed.queue We can safely return 4 events back to the user at this point. However, this might break compatibility for existing users. I'm not sure if there's any weird apps which know/expect the event count they'll get from epoll_wait, but maybe there is one... 8) We could perform a splice off main.queue to fill the remaining slots the user requested, but we do not know if the things we splice from main.queue at this point were just dequeued in 7. If we loaded the main.queue.tail before 7, we could safely splice into unconsumed.queue and know when to stop when repeating the put_user + dequeue loop. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
* Eric Wong (normalper...@yhbt.net) wrote: > Mathieu Desnoyers wrote: > > * Eric Wong (normalper...@yhbt.net) wrote: > > > Mathieu Desnoyers wrote: > > > > The advantage of using splice() over dequeue() is that you will reduce > > > > the amount of interactions between concurrent enqueue and dequeue > > > > operations on the head and tail of the same queue. > > > > > > I wanted to use splice here originally, but epoll_wait(2) may not > > > consume the entire queue, as it's limited to maxevents specified by the > > > user calling epoll_wait. > > > > I see, > > > > > > > > With unconsumed elements, I need to preserve ordering of the queue to > > > avoid starvation. So I would either need to: > > > > > > a) splice the unconsumed portion back to the head of the shared queue. > > >I'm not sure if this is possible while elements are being enqueued... > > > > That would be a double-ended queue. I haven't thought this problem > > through yet. > > > > >Using a mutex for splicing back to unconsumed elements is OK, and > > >probably required anyways since we need to keep EPOLLONESHOT > > >unmasking synced with dequeue. > > > > > > b) preserve the unconsumed spliced queue across epoll_wait invocations > > >but that requires checking both queues for event availability... > > > > I think b) should be preferred over a). > > > > Basically, instead of having the "unconsumed element" queue on the stack > > as I suggested, you might want to keep it across epoll_wait invocations. > > > > Before you start digging in the unconsumed element queue, you just start > > by splicing the content of the shared queue into the tail of unconsumed > > queue. Then, you simply dequeue the unconsumed queue elements until you > > either reach the end or the max nr. > > > > You should note that with this scheme, you'll have to dequeue items from > > the unconsumed queue rather than just iterating over the elements. The > > nice side of consuming all the elements (in the temp queue on the local > > stack) is that you don't care about dequeuing, since the entire queue > > will vanish. However, in this case, since you care about keeping the > > queue after a partial iteration, you need to dequeue from it. > > > > And yes, this approach involves checking both queues for event > > availability. Hopefully none of this will be too much of an issue > > performance-wise. > > Right. I will try this, I don't think the check will be too expensive. > > When dequeuing from the unconsumed queue, perhaps there should be a > "dequeue_local" function which omits the normal barriers required > for the shared queue. > > With a splice and without needing barriers for iteration, this sounds good. Well actually, __wfcq_dequeue() is really not that expensive. In terms of synchronization, here is what it typically does: node = ___wfcq_node_sync_next(&head->node); -> busy wait if node->next is NULL. This check is needed even if we work on a "local" queue, because the O(1) splice operation does not walk over every node to check for NULL next pointer: this is left to the dequeue/iteration operations. if ((next = CMM_LOAD_SHARED(node->next)) == NULL) { -> only taken if we are getting the last node of the queue. Happens at most once per batch. } head->node.next = next; -> a simple store. smp_read_barrier_depends(); -> no-op on everything but Alpha. return node; So my recommendation would be to be careful before trying to come up with flavors that remove barriers if those are not actually hurting the fast-path significantly. By dequeue fast-path, I mean what needs to be executed for dequeue of each node. > > > Another approach could be to let you work directly on the shared queue: > > > > I could possibly implement a > > > > void __wfcq_snapshot(struct wfcq_head *head, > > struct wfcq_tail *tail); > > > > That would save a tail shapshot that would then be used to stop > > iteration, dequeue and splice at the location of the tail snapshot. And > > > > void __wfcq_snapshot_reset(struct wfcq_head *head, > > struct wfcq_tail *tail); > > > > would set the tail snapshot pointer back to NULL. > > > > This would require a few extra checks, but nothing very expensive I > > expect. > > > > Thoughts ? > > I'm not sure I follow, would using it be something like this? > > snapshot > iterate (read-only, no dequeue) > splice(discard_head, discard_tail, shared_head, iter_stop_point) > snapshot_reset My idea was more like: snapshot __wfcq_dequeue until returns NULL or reach max snapshot_reset But I would really prefer the splice approach unless there is a very significant performance gain to do otherwise, because I don't want to clutter the API with various ways of performing the same thing needlessly. > > I also use an atomic state variable to prevent an item from being queued > twice, and I unset that while iterating+dequeueing. > > I change the
Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
* Peter Hurley (pe...@hurleysoftware.com) wrote: > On Mon, 2013-03-11 at 17:36 -0400, Mathieu Desnoyers wrote: > > +/* > > + * Do not put head and tail on the same cache-line if concurrent > > + * enqueue/dequeue are expected from many CPUs. This eliminates > > + * false-sharing between enqueue and dequeue. > > + */ > > +struct wfcq_head { > > + struct wfcq_node node; > > + struct mutex lock; > > +}; > > + > > +struct wfcq_tail { > > + struct wfcq_node *p; > > +}; > > > If you want to force separate cachelines for SMP, this would be > > struct wfcq_head { > struct wfcq_node node; > struct mutex lock; > } cacheline_aligned_in_smp; > > struct wfcq_tail { > struct wfcq_node *p; > } cacheline_aligned_in_smp; Well, the thing is: I don't want to force it. The queue head and tail can be used in a few ways: 1) tail used by frequent enqueue on one CPU, head used for frequent dequeues on another CPU. In this case, we want head/tail on different cache lines. 2) same scenario as 1), but head and tail are placed in per-cpu data of the two CPUs. We don't need to align each structure explicitly. 3) tail and head used locally, e.g. on the stack, as splice destination followed by iteration or dequeue from the same CPU. We don't want to waste precious memory space in this case, so no alignment. So as you see, only (1) actually requires explicit alignment of head and tail. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
On Mon, 2013-03-11 at 17:36 -0400, Mathieu Desnoyers wrote: > +/* > + * Do not put head and tail on the same cache-line if concurrent > + * enqueue/dequeue are expected from many CPUs. This eliminates > + * false-sharing between enqueue and dequeue. > + */ > +struct wfcq_head { > + struct wfcq_node node; > + struct mutex lock; > +}; > + > +struct wfcq_tail { > + struct wfcq_node *p; > +}; If you want to force separate cachelines for SMP, this would be struct wfcq_head { struct wfcq_node node; struct mutex lock; } cacheline_aligned_in_smp; struct wfcq_tail { struct wfcq_node *p; } cacheline_aligned_in_smp; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
Mathieu Desnoyers wrote: > * Eric Wong (normalper...@yhbt.net) wrote: > > Mathieu Desnoyers wrote: > > > The advantage of using splice() over dequeue() is that you will reduce > > > the amount of interactions between concurrent enqueue and dequeue > > > operations on the head and tail of the same queue. > > > > I wanted to use splice here originally, but epoll_wait(2) may not > > consume the entire queue, as it's limited to maxevents specified by the > > user calling epoll_wait. > > I see, > > > > > With unconsumed elements, I need to preserve ordering of the queue to > > avoid starvation. So I would either need to: > > > > a) splice the unconsumed portion back to the head of the shared queue. > >I'm not sure if this is possible while elements are being enqueued... > > That would be a double-ended queue. I haven't thought this problem > through yet. > > >Using a mutex for splicing back to unconsumed elements is OK, and > >probably required anyways since we need to keep EPOLLONESHOT > >unmasking synced with dequeue. > > > > b) preserve the unconsumed spliced queue across epoll_wait invocations > >but that requires checking both queues for event availability... > > I think b) should be preferred over a). > > Basically, instead of having the "unconsumed element" queue on the stack > as I suggested, you might want to keep it across epoll_wait invocations. > > Before you start digging in the unconsumed element queue, you just start > by splicing the content of the shared queue into the tail of unconsumed > queue. Then, you simply dequeue the unconsumed queue elements until you > either reach the end or the max nr. > > You should note that with this scheme, you'll have to dequeue items from > the unconsumed queue rather than just iterating over the elements. The > nice side of consuming all the elements (in the temp queue on the local > stack) is that you don't care about dequeuing, since the entire queue > will vanish. However, in this case, since you care about keeping the > queue after a partial iteration, you need to dequeue from it. > > And yes, this approach involves checking both queues for event > availability. Hopefully none of this will be too much of an issue > performance-wise. Right. I will try this, I don't think the check will be too expensive. When dequeuing from the unconsumed queue, perhaps there should be a "dequeue_local" function which omits the normal barriers required for the shared queue. With a splice and without needing barriers for iteration, this sounds good. > Another approach could be to let you work directly on the shared queue: > > I could possibly implement a > > void __wfcq_snapshot(struct wfcq_head *head, > struct wfcq_tail *tail); > > That would save a tail shapshot that would then be used to stop > iteration, dequeue and splice at the location of the tail snapshot. And > > void __wfcq_snapshot_reset(struct wfcq_head *head, > struct wfcq_tail *tail); > > would set the tail snapshot pointer back to NULL. > > This would require a few extra checks, but nothing very expensive I > expect. > > Thoughts ? I'm not sure I follow, would using it be something like this? snapshot iterate (read-only, no dequeue) splice(discard_head, discard_tail, shared_head, iter_stop_point) snapshot_reset I also use an atomic state variable to prevent an item from being queued twice, and I unset that while iterating+dequeueing. I change the state to IDLE while iterating and ep_poll_callback sets state READY on any item before being spliced out, that would cause the event to be lost when it's spliced out. So I think dequeue/state IDLE while iterating is required for epoll_wait, I'll try the private queue. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
* Eric Wong (normalper...@yhbt.net) wrote: > Mathieu Desnoyers wrote: > > * Eric Wong (normalper...@yhbt.net) wrote: > > > Mathieu Desnoyers wrote: > > > > +/* > > > > + * Load a data from shared memory. > > > > + */ > > > > +#define CMM_LOAD_SHARED(p) ACCESS_ONCE(p) > > > > > > When iterating through the queue by dequeueing, I needed a way > > > to get the tail at the start of the iteration and use that as > > > a sentinel while iterating, so I access the tail like this: > > > > > > struct wfcq_node *p = CMM_LOAD_SHARED(ep->rdltail.p); > > > > > > I hope this is supported... it seems to work :) > > > > Ideally it would be good if users could try using the exposed APIs to do > > these things, or if it's really needed, maybe it's a sign that we need > > to extend the API. > > Right. If I can use splice, I will not need this. more comments below > on splice... > > > > Unlike most queue users, I need to stop iteration to prevent the same > > > item from appearing in the events returned by epoll_wait; since a > > > dequeued item may appear back in the wfcqueue immediately. > > > > I think your use-case is very similar to our urcu call_rcu > > implementation. I would recommend to use wfcq in the following way: > > > > When you want to dequeue, define, on the stack: > > > > struct wfcq_head qhead; > > struct wfcq_tail qtail; > > struct wfcq_node *node, *n; > > enum wfcq_ret ret; > > > > wfcq_init(&qhead, &qtail); > > > > /* > > * Then use splice to move the entire source queue into the local queue. > > * Don't forget to grab the appropriate mutexes for eqpoll_q here. > > */ > > ret = __wfcq_splice(&qhead, &qtail, epoll_q_head, epoll_q_tail); > > > > switch (ret) { > > case WFCQ_RET_SRC_EMPTY: > > return -ENOENT; /* No events to handle */ > > case WFCQ_RET_DEST_EMPTY: > > case WFCQ_RET_DEST_NON_EMPTY: > > break; > > } > > > > /* > > * From this point, you can release the epoll_q lock and simply iterate > > * on the local queue using __wfcq_for_each() or __wfcq_for_each_safe() > > * if you need to free the nodes at the same time. > > */ > > __wfcq_for_each_safe(&qhead, &qtail, node, n) { > > ... > > } > > > > The advantage of using splice() over dequeue() is that you will reduce > > the amount of interactions between concurrent enqueue and dequeue > > operations on the head and tail of the same queue. > > I wanted to use splice here originally, but epoll_wait(2) may not > consume the entire queue, as it's limited to maxevents specified by the > user calling epoll_wait. I see, > > With unconsumed elements, I need to preserve ordering of the queue to > avoid starvation. So I would either need to: > > a) splice the unconsumed portion back to the head of the shared queue. >I'm not sure if this is possible while elements are being enqueued... That would be a double-ended queue. I haven't thought this problem through yet. >Using a mutex for splicing back to unconsumed elements is OK, and >probably required anyways since we need to keep EPOLLONESHOT >unmasking synced with dequeue. > > b) preserve the unconsumed spliced queue across epoll_wait invocations >but that requires checking both queues for event availability... I think b) should be preferred over a). Basically, instead of having the "unconsumed element" queue on the stack as I suggested, you might want to keep it across epoll_wait invocations. Before you start digging in the unconsumed element queue, you just start by splicing the content of the shared queue into the tail of unconsumed queue. Then, you simply dequeue the unconsumed queue elements until you either reach the end or the max nr. You should note that with this scheme, you'll have to dequeue items from the unconsumed queue rather than just iterating over the elements. The nice side of consuming all the elements (in the temp queue on the local stack) is that you don't care about dequeuing, since the entire queue will vanish. However, in this case, since you care about keeping the queue after a partial iteration, you need to dequeue from it. And yes, this approach involves checking both queues for event availability. Hopefully none of this will be too much of an issue performance-wise. Another approach could be to let you work directly on the shared queue: I could possibly implement a void __wfcq_snapshot(struct wfcq_head *head, struct wfcq_tail *tail); That would save a tail shapshot that would then be used to stop iteration, dequeue and splice at the location of the tail snapshot. And void __wfcq_snapshot_reset(struct wfcq_head *head, struct wfcq_tail *tail); would set the tail snapshot pointer back to NULL. This would require a few extra checks, but nothing very expensive I expect. Thoughts ? > > > I'll send a new patch version soon, > > > > Thanks for your comments! > > Thank you for this data structure! I will update my patch tomorrow or > the day after an
Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
Mathieu Desnoyers wrote: > * Eric Wong (normalper...@yhbt.net) wrote: > > Mathieu Desnoyers wrote: > > > +/* > > > + * Load a data from shared memory. > > > + */ > > > +#define CMM_LOAD_SHARED(p) ACCESS_ONCE(p) > > > > When iterating through the queue by dequeueing, I needed a way > > to get the tail at the start of the iteration and use that as > > a sentinel while iterating, so I access the tail like this: > > > > struct wfcq_node *p = CMM_LOAD_SHARED(ep->rdltail.p); > > > > I hope this is supported... it seems to work :) > > Ideally it would be good if users could try using the exposed APIs to do > these things, or if it's really needed, maybe it's a sign that we need > to extend the API. Right. If I can use splice, I will not need this. more comments below on splice... > > Unlike most queue users, I need to stop iteration to prevent the same > > item from appearing in the events returned by epoll_wait; since a > > dequeued item may appear back in the wfcqueue immediately. > > I think your use-case is very similar to our urcu call_rcu > implementation. I would recommend to use wfcq in the following way: > > When you want to dequeue, define, on the stack: > > struct wfcq_head qhead; > struct wfcq_tail qtail; > struct wfcq_node *node, *n; > enum wfcq_ret ret; > > wfcq_init(&qhead, &qtail); > > /* > * Then use splice to move the entire source queue into the local queue. > * Don't forget to grab the appropriate mutexes for eqpoll_q here. > */ > ret = __wfcq_splice(&qhead, &qtail, epoll_q_head, epoll_q_tail); > > switch (ret) { > case WFCQ_RET_SRC_EMPTY: > return -ENOENT; /* No events to handle */ > case WFCQ_RET_DEST_EMPTY: > case WFCQ_RET_DEST_NON_EMPTY: > break; > } > > /* > * From this point, you can release the epoll_q lock and simply iterate > * on the local queue using __wfcq_for_each() or __wfcq_for_each_safe() > * if you need to free the nodes at the same time. > */ > __wfcq_for_each_safe(&qhead, &qtail, node, n) { > ... > } > > The advantage of using splice() over dequeue() is that you will reduce > the amount of interactions between concurrent enqueue and dequeue > operations on the head and tail of the same queue. I wanted to use splice here originally, but epoll_wait(2) may not consume the entire queue, as it's limited to maxevents specified by the user calling epoll_wait. With unconsumed elements, I need to preserve ordering of the queue to avoid starvation. So I would either need to: a) splice the unconsumed portion back to the head of the shared queue. I'm not sure if this is possible while elements are being enqueued... Using a mutex for splicing back to unconsumed elements is OK, and probably required anyways since we need to keep EPOLLONESHOT unmasking synced with dequeue. b) preserve the unconsumed spliced queue across epoll_wait invocations but that requires checking both queues for event availability... > I'll send a new patch version soon, > > Thanks for your comments! Thank you for this data structure! I will update my patch tomorrow or the day after and do more testing. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation (v2)
Ported to the Linux kernel from Userspace RCU library, at commit 108a92e5b97ee91b2b902dba2dd2e78aab42f420. Ref: http://git.lttng.org/userspace-rcu.git It is provided as a starting point only. Test cases should be ported from Userspace RCU to kernel space and thoroughly ran on a wide range of architectures before considering this port production-ready. Changelog since v1: - Remove the internal mutex and helpers. Let the caller handle it. - Disable preemption within append operation, thus removing sleep from dequeue, therefore removing the blocking/nonblocking API distinction. Signed-off-by: Mathieu Desnoyers CC: Lai Jiangshan CC: Paul E. McKenney CC: Stephen Hemminger CC: Davide Libenzi CC: Eric Wong --- include/linux/wfcqueue.h | 444 +++ 1 file changed, 444 insertions(+) Index: linux/include/linux/wfcqueue.h === --- /dev/null +++ linux/include/linux/wfcqueue.h @@ -0,0 +1,444 @@ +#ifndef _LINUX_WFCQUEUE_H +#define _LINUX_WFCQUEUE_H + +/* + * linux/wfcqueue.h + * + * Concurrent Queue with Wait-Free Enqueue/Busy-Waiting Dequeue + * + * Copyright 2010-2013 - Mathieu Desnoyers + * Copyright 2011-2012 - Lai Jiangshan + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include +#include +#include +#include +#include +#include +#include + +/* + * Concurrent Queue with Wait-Free Enqueue/Busy-Waiting Dequeue + * + * This queue has been designed and implemented collaboratively by + * Mathieu Desnoyers and Lai Jiangshan. Inspired from + * half-wait-free/half-blocking queue implementation done by Paul E. + * McKenney. + * + * Mutual exclusion of wfcq_* / __wfcq_* API + * + * Synchronization table: + * + * External synchronization techniques described in the API below is + * required between pairs marked with "X". No external synchronization + * required between pairs marked with "-". + * + * Legend: + * [1] wfcq_enqueue + * [2] __wfcq_splice (destination queue) + * [3] __wfcq_dequeue + * [4] __wfcq_splice (source queue) + * [5] __wfcq_first + * [6] __wfcq_next + * + * [1] [2] [3] [4] [5] [6] + * [1] - - - - - - + * [2] - - - - - - + * [3] - - X X X X + * [4] - - X - X X + * [5] - - X X - - + * [6] - - X X - - + * + * Besides locking, mutual exclusion of dequeue, splice and iteration + * can be ensured by performing all of those operations from a single + * thread, without requiring any lock. + */ + +/* + * Load a data from shared memory. + */ +#define CMM_LOAD_SHARED(p) ACCESS_ONCE(p) + +/* + * Identify a shared store. + */ +#define CMM_STORE_SHARED(x, v) ({ ACCESS_ONCE(x) = (v); }) + +enum wfcq_ret { + WFCQ_RET_DEST_EMPTY = 0, + WFCQ_RET_DEST_NON_EMPTY = 1, + WFCQ_RET_SRC_EMPTY =2, +}; + +struct wfcq_node { + struct wfcq_node *next; +}; + +/* + * Do not put head and tail on the same cache-line if concurrent + * enqueue/dequeue are expected from many CPUs. This eliminates + * false-sharing between enqueue and dequeue. + */ +struct wfcq_head { + struct wfcq_node node; +}; + +struct wfcq_tail { + struct wfcq_node *p; +}; + +/* + * wfcq_node_init: initialize wait-free queue node. + */ +static inline void wfcq_node_init(struct wfcq_node *node) +{ + node->next = NULL; +} + +/* + * wfcq_init: initialize wait-free queue. + */ +static inline void wfcq_init(struct wfcq_head *head, + struct wfcq_tail *tail) +{ + /* Set queue head and tail */ + wfcq_node_init(&head->node); + tail->p = &head->node; +} + +/* + * wfcq_empty: return whether wait-free queue is empty. + * + * No memory barrier is issued. No mutual exclusion is required. + * + * We perform the test on head->node.next to check if the queue is + * possibly empty, but we confirm this by checking if the tail pointer + * points to the head node because the tail pointer is the linearisation + * point of the enqueuers. Just checking the head next pointer could + * make a queue appear empty if an enqueuer is preempted for a long time + * between xchg() and setting the previous node's next pointer. + */ +static inline bool wfcq_empty(struct wfcq_head *head, + str
Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
* Eric Wong (normalper...@yhbt.net) wrote: > Mathieu Desnoyers wrote: > > Ported to the Linux kernel from Userspace RCU library, at commit > > 108a92e5b97ee91b2b902dba2dd2e78aab42f420. > > > > Ref: http://git.lttng.org/userspace-rcu.git > > > > It is provided as a starting point only. Test cases should be ported > > from Userspace RCU to kernel space and thoroughly ran on a wide range of > > architectures before considering this port production-ready. > > Thanks, this seems to work. Will post an early epoll patch in a minute. > > Minor comments below. > > > +/* > > + * Load a data from shared memory. > > + */ > > +#define CMM_LOAD_SHARED(p) ACCESS_ONCE(p) > > When iterating through the queue by dequeueing, I needed a way > to get the tail at the start of the iteration and use that as > a sentinel while iterating, so I access the tail like this: > > struct wfcq_node *p = CMM_LOAD_SHARED(ep->rdltail.p); > > I hope this is supported... it seems to work :) Ideally it would be good if users could try using the exposed APIs to do these things, or if it's really needed, maybe it's a sign that we need to extend the API. > > Unlike most queue users, I need to stop iteration to prevent the same > item from appearing in the events returned by epoll_wait; since a > dequeued item may appear back in the wfcqueue immediately. I think your use-case is very similar to our urcu call_rcu implementation. I would recommend to use wfcq in the following way: When you want to dequeue, define, on the stack: struct wfcq_head qhead; struct wfcq_tail qtail; struct wfcq_node *node, *n; enum wfcq_ret ret; wfcq_init(&qhead, &qtail); /* * Then use splice to move the entire source queue into the local queue. * Don't forget to grab the appropriate mutexes for eqpoll_q here. */ ret = __wfcq_splice(&qhead, &qtail, epoll_q_head, epoll_q_tail); switch (ret) { case WFCQ_RET_SRC_EMPTY: return -ENOENT; /* No events to handle */ case WFCQ_RET_DEST_EMPTY: case WFCQ_RET_DEST_NON_EMPTY: break; } /* * From this point, you can release the epoll_q lock and simply iterate * on the local queue using __wfcq_for_each() or __wfcq_for_each_safe() * if you need to free the nodes at the same time. */ __wfcq_for_each_safe(&qhead, &qtail, node, n) { ... } The advantage of using splice() over dequeue() is that you will reduce the amount of interactions between concurrent enqueue and dequeue operations on the head and tail of the same queue. > > > +struct wfcq_head { > > + struct wfcq_node node; > > + struct mutex lock; > > +}; > > I'm not using this lock at all since I already have ep->mtx (which also > protects the ep->rbr). Perhaps it should not be included; normal linked > list and most data structures I see in the kernel do not provide their > own locks, either Good point. The Linux kernel habits are to leave the locks outside of those structures whenever possible, so users can pick the right lock for their need. Will remove, and remove all the "helpers" that use this lock as well. > > > +static inline void wfcq_init(struct wfcq_head *head, > > + struct wfcq_tail *tail) > > +{ > > + /* Set queue head and tail */ > > + wfcq_node_init(&head->node); > > + tail->p = &head->node; > > + mutex_init(&head->lock); > > +} > > There's no corresponding mutex_destroy, so I'm just destroying it > myself... Since I remove it, it will become a non-issue. After thinking about it, I plan to disable preemption around xchg and store within __wfcq_append. This is a luxury we have at kernel-level that we don't have in user-space, si it might be good to use it. This will allow me to remove the 10ms sleep from the adaptative busy-wait in ___wfcq_busy_wait(), and replace this by a simple busy-wait. This will eliminate those possible latency peaks from the dequeue/splice/iteration side. It's a shame to have a possibility of 10ms sleep due to preemption within a 2 instructions window. preempt_disable() will fix this, while adding a very low overhead to enqueue path in preemptible kernels. I'll send a new patch version soon, Thanks for your comments! Mathieu -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
Mathieu Desnoyers wrote: > Ported to the Linux kernel from Userspace RCU library, at commit > 108a92e5b97ee91b2b902dba2dd2e78aab42f420. > > Ref: http://git.lttng.org/userspace-rcu.git > > It is provided as a starting point only. Test cases should be ported > from Userspace RCU to kernel space and thoroughly ran on a wide range of > architectures before considering this port production-ready. Thanks, this seems to work. Will post an early epoll patch in a minute. Minor comments below. > +/* > + * Load a data from shared memory. > + */ > +#define CMM_LOAD_SHARED(p) ACCESS_ONCE(p) When iterating through the queue by dequeueing, I needed a way to get the tail at the start of the iteration and use that as a sentinel while iterating, so I access the tail like this: struct wfcq_node *p = CMM_LOAD_SHARED(ep->rdltail.p); I hope this is supported... it seems to work :) Unlike most queue users, I need to stop iteration to prevent the same item from appearing in the events returned by epoll_wait; since a dequeued item may appear back in the wfcqueue immediately. > +struct wfcq_head { > + struct wfcq_node node; > + struct mutex lock; > +}; I'm not using this lock at all since I already have ep->mtx (which also protects the ep->rbr). Perhaps it should not be included; normal linked list and most data structures I see in the kernel do not provide their own locks, either > +static inline void wfcq_init(struct wfcq_head *head, > + struct wfcq_tail *tail) > +{ > + /* Set queue head and tail */ > + wfcq_node_init(&head->node); > + tail->p = &head->node; > + mutex_init(&head->lock); > +} There's no corresponding mutex_destroy, so I'm just destroying it myself... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation
Ported to the Linux kernel from Userspace RCU library, at commit 108a92e5b97ee91b2b902dba2dd2e78aab42f420. Ref: http://git.lttng.org/userspace-rcu.git It is provided as a starting point only. Test cases should be ported from Userspace RCU to kernel space and thoroughly ran on a wide range of architectures before considering this port production-ready. Signed-off-by: Mathieu Desnoyers CC: Lai Jiangshan CC: Paul E. McKenney CC: Stephen Hemminger CC: Davide Libenzi CC: Eric Wong --- include/linux/wfcqueue.h | 642 +++ 1 file changed, 642 insertions(+) Index: linux/include/linux/wfcqueue.h === --- /dev/null +++ linux/include/linux/wfcqueue.h @@ -0,0 +1,642 @@ +#ifndef _LINUX_WFCQUEUE_H +#define _LINUX_WFCQUEUE_H + +/* + * linux/wfcqueue.h + * + * Concurrent Queue with Wait-Free Enqueue/Blocking Dequeue + * + * Copyright 2010-2013 - Mathieu Desnoyers + * Copyright 2011-2012 - Lai Jiangshan + * + * This library is free software; you can redistribute it and/or + * modify it under the terms of the GNU Lesser General Public + * License as published by the Free Software Foundation; either + * version 2.1 of the License, or (at your option) any later version. + * + * This library is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * Lesser General Public License for more details. + * + * You should have received a copy of the GNU Lesser General Public + * License along with this library; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +/* + * Concurrent Queue with Wait-Free Enqueue/Blocking Dequeue + * + * This queue has been designed and implemented collaboratively by + * Mathieu Desnoyers and Lai Jiangshan. Inspired from + * half-wait-free/half-blocking queue implementation done by Paul E. + * McKenney. + * + * Mutual exclusion of wfcq_* / __wfcq_* API + * + * Synchronization table: + * + * External synchronization techniques described in the API below is + * required between pairs marked with "X". No external synchronization + * required between pairs marked with "-". + * + * Legend: + * [1] wfcq_enqueue + * [2] __wfcq_splice (destination queue) + * [3] __wfcq_dequeue + * [4] __wfcq_splice (source queue) + * [5] __wfcq_first + * [6] __wfcq_next + * + * [1] [2] [3] [4] [5] [6] + * [1] - - - - - - + * [2] - - - - - - + * [3] - - X X X X + * [4] - - X - X X + * [5] - - X X - - + * [6] - - X X - - + * + * Mutual exclusion can be ensured by holding wfcq_dequeue_lock(). + * + * For convenience, wfcq_dequeue_blocking() and + * wfcq_splice_blocking() hold the dequeue lock. + * + * Besides locking, mutual exclusion of dequeue, splice and iteration + * can be ensured by performing all of those operations from a single + * thread, without requiring any lock. + */ + +/* + * Load a data from shared memory. + */ +#define CMM_LOAD_SHARED(p) ACCESS_ONCE(p) + +/* + * Identify a shared store. + */ +#define CMM_STORE_SHARED(x, v) ({ ACCESS_ONCE(x) = (v); }) + +#define WFCQ_WOULDBLOCK((void *) -1UL) +#define WFCQ_ADAPT_ATTEMPTS10 /* Retry if being set */ +#define WFCQ_WAIT 10 /* Wait 10 ms if being set */ + +enum wfcq_ret { + WFCQ_RET_WOULDBLOCK = -1, + WFCQ_RET_DEST_EMPTY = 0, + WFCQ_RET_DEST_NON_EMPTY = 1, + WFCQ_RET_SRC_EMPTY =2, +}; + +struct wfcq_node { + struct wfcq_node *next; +}; + +/* + * Do not put head and tail on the same cache-line if concurrent + * enqueue/dequeue are expected from many CPUs. This eliminates + * false-sharing between enqueue and dequeue. + */ +struct wfcq_head { + struct wfcq_node node; + struct mutex lock; +}; + +struct wfcq_tail { + struct wfcq_node *p; +}; + +/* + * wfcq_node_init: initialize wait-free queue node. + */ +static inline void wfcq_node_init(struct wfcq_node *node) +{ + node->next = NULL; +} + +/* + * wfcq_init: initialize wait-free queue. + */ +static inline void wfcq_init(struct wfcq_head *head, + struct wfcq_tail *tail) +{ + /* Set queue head and tail */ + wfcq_node_init(&head->node); + tail->p = &head->node; + mutex_init(&head->lock); +} + +/* + * wfcq_empty: return whether wait-free queue is empty. + * + * No memory barrier is issued. No mutual exclusion is required. + * + * We perform the test on head->node.next to check if the queue is + * possibly empty, but we confirm this by checking if the tail pointer + * points to the head node because the tail pointer is the linearisation + * point of the enqueuers. Just che