Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation

2013-04-11 Thread Mathieu Desnoyers
* Eric Wong (normalper...@yhbt.net) wrote:
> Mathieu Desnoyers  wrote:
> > It is provided as a starting point only. Test cases should be ported
> > from Userspace RCU to kernel space and thoroughly ran on a wide range of
> > architectures before considering this port production-ready.
> 
> Hi Mathieu, any progress on this front?

Nope, I've been busy on other things.

> I'm not sure how to approach
> test cases in the kernel, perhaps taking hints from locking_selftest()
> would be a start.

I guess creating a wfcq_selftest() derived from userspace RCU
test_urcu_wfcq.c might be a good start.

> Fwiw, I've been running wfcqueue+epoll for a few weeks now on my
> personal machines without problems :)

Great!

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation

2013-04-11 Thread Eric Wong
Mathieu Desnoyers  wrote:
> It is provided as a starting point only. Test cases should be ported
> from Userspace RCU to kernel space and thoroughly ran on a wide range of
> architectures before considering this port production-ready.

Hi Mathieu, any progress on this front?  I'm not sure how to approach
test cases in the kernel, perhaps taking hints from locking_selftest()
would be a start.

Fwiw, I've been running wfcqueue+epoll for a few weeks now on my
personal machines without problems :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation

2013-03-18 Thread Eric Wong
Mathieu Desnoyers  wrote:
> Thanks for providing this detailed scenario. I think there is an
> important aspect in the use of splice I suggested on which we are not
> fully understanding each other. I will annotate your scenario below with
> clarifications:

Ah yes, I somehow thought splice would overwrite its destination.
I've confirmed everything works as expected and have an updated
RFC coming.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation

2013-03-16 Thread Mathieu Desnoyers
* Eric Wong (normalper...@yhbt.net) wrote:
> Eric Wong  wrote:
> > Mathieu Desnoyers  wrote:
> > > * Eric Wong (normalper...@yhbt.net) wrote:
> > > > Mathieu Desnoyers  wrote:
> > > > > +/*
> > > > > + * Load a data from shared memory.
> > > > > + */
> > > > > +#define CMM_LOAD_SHARED(p)   ACCESS_ONCE(p)
> > > > 
> > > > When iterating through the queue by dequeueing, I needed a way
> > > > to get the tail at the start of the iteration and use that as
> > > > a sentinel while iterating, so I access the tail like this:
> > > > 
> > > > struct wfcq_node *p = CMM_LOAD_SHARED(ep->rdltail.p);
> > > > 
> > > > I hope this is supported... it seems to work :)
> > > 
> > > Ideally it would be good if users could try using the exposed APIs to do
> > > these things, or if it's really needed, maybe it's a sign that we need
> > > to extend the API.
> > 
> > Right.  If I can use splice, I will not need this.  more comments below
> > on splice...
> 
> Even with splice, I think I need to see the main tail at the start of
> iteration to maintain compatibility (for weird apps that might care).

Thanks for providing this detailed scenario. I think there is an
important aspect in the use of splice I suggested on which we are not
fully understanding each other. I will annotate your scenario below with
clarifications:

> 
> Consider this scenario:
> 
>   1) main.queue has 20 events
> 
>   2) epoll_wait(maxevents=16) called by user
> 
>   3) splice all 20 events into unconsumed.queue, main.queue is empty
> 
>   4) put_user + dequeue on 16 events from unconsumed.queue
>  # unconsumed.queue has 4 left at this point
> 
>   5) main.queue gets several more events enqueued at any point after 3.

Let's suppose 11 events are enqueued into main.queue after 3.

> 
>   6) epoll_wait(maxevents=16) called by user again

Before 7), here is what should be done:

6.5) splice all new events from main.queue into unconsumed.queue.
 unconsumed.queue will now contain 4 + 11 = 15 events. Note
 that splice will preserve the right order of events within
 unconsumed.queue.

> 
>   7) put_user + dequeue on 4 remaining items in unconsumed.queue
> 
>  We can safely return 4 events back to the user at this point.

Step 7) will now return 15 events from unconsumed.queue.

> 
>  However, this might break compatibility for existing users.  I'm
>  not sure if there's any weird apps which know/expect the event
>  count they'll get from epoll_wait, but maybe there is one...

With the new step 6.5), I don't think the behavior will change compared
to what is already there.

> 
>   8) We could perform a splice off main.queue to fill the remaining
>  slots the user requested, but we do not know if the things we
>  splice from main.queue at this point were just dequeued in 7.
> 
>  If we loaded the main.queue.tail before 7, we could safely splice
>  into unconsumed.queue and know when to stop when repeating the
>  put_user + dequeue loop.

We can achieve the same thing by doing step 6.5) at the beginning of
epoll_wait(). It's important to do it at the beginning of epoll_wait for
the reason you discuss in 8) : if you wait until you notice that
unconsumed.queue is empty before refilling it from main.queue, you won't
be able to know if the events in main.queue were added after the first
event was dequeued.

Step 6.5) should be performed each time upon entry into epoll_wait(). It
does not matter if unconsumed.queue would happen to have enough events
to fill in the maxevents request or not (and you don't want to iterate
on the unconsumed.queue needlessly to count them): you can just do a
O(1) splice from main.queue into unconsumed.queue, and your original
semantic should be preserved.

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation

2013-03-16 Thread Eric Wong
Eric Wong  wrote:
> Mathieu Desnoyers  wrote:
> > * Eric Wong (normalper...@yhbt.net) wrote:
> > > Mathieu Desnoyers  wrote:
> > > > +/*
> > > > + * Load a data from shared memory.
> > > > + */
> > > > +#define CMM_LOAD_SHARED(p) ACCESS_ONCE(p)
> > > 
> > > When iterating through the queue by dequeueing, I needed a way
> > > to get the tail at the start of the iteration and use that as
> > > a sentinel while iterating, so I access the tail like this:
> > > 
> > >   struct wfcq_node *p = CMM_LOAD_SHARED(ep->rdltail.p);
> > > 
> > > I hope this is supported... it seems to work :)
> > 
> > Ideally it would be good if users could try using the exposed APIs to do
> > these things, or if it's really needed, maybe it's a sign that we need
> > to extend the API.
> 
> Right.  If I can use splice, I will not need this.  more comments below
> on splice...

Even with splice, I think I need to see the main tail at the start of
iteration to maintain compatibility (for weird apps that might care).

Consider this scenario:

  1) main.queue has 20 events

  2) epoll_wait(maxevents=16) called by user

  3) splice all 20 events into unconsumed.queue, main.queue is empty

  4) put_user + dequeue on 16 events from unconsumed.queue
 # unconsumed.queue has 4 left at this point

  5) main.queue gets several more events enqueued at any point after 3.

  6) epoll_wait(maxevents=16) called by user again

  7) put_user + dequeue on 4 remaining items in unconsumed.queue

 We can safely return 4 events back to the user at this point.

 However, this might break compatibility for existing users.  I'm
 not sure if there's any weird apps which know/expect the event
 count they'll get from epoll_wait, but maybe there is one...

  8) We could perform a splice off main.queue to fill the remaining
 slots the user requested, but we do not know if the things we
 splice from main.queue at this point were just dequeued in 7.

 If we loaded the main.queue.tail before 7, we could safely splice
 into unconsumed.queue and know when to stop when repeating the
 put_user + dequeue loop.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation

2013-03-14 Thread Mathieu Desnoyers
* Eric Wong (normalper...@yhbt.net) wrote:
> Mathieu Desnoyers  wrote:
> > * Eric Wong (normalper...@yhbt.net) wrote:
> > > Mathieu Desnoyers  wrote:
> > > > The advantage of using splice() over dequeue() is that you will reduce
> > > > the amount of interactions between concurrent enqueue and dequeue
> > > > operations on the head and tail of the same queue.
> > > 
> > > I wanted to use splice here originally, but epoll_wait(2) may not
> > > consume the entire queue, as it's limited to maxevents specified by the
> > > user calling epoll_wait.
> > 
> > I see,
> > 
> > > 
> > > With unconsumed elements, I need to preserve ordering of the queue to
> > > avoid starvation.  So I would either need to:
> > > 
> > > a) splice the unconsumed portion back to the head of the shared queue.
> > >I'm not sure if this is possible while elements are being enqueued...
> > 
> > That would be a double-ended queue. I haven't thought this problem
> > through yet.
> > 
> > >Using a mutex for splicing back to unconsumed elements is OK, and
> > >probably required anyways since we need to keep EPOLLONESHOT
> > >unmasking synced with dequeue.
> > > 
> > > b) preserve the unconsumed spliced queue across epoll_wait invocations
> > >but that requires checking both queues for event availability...
> > 
> > I think b) should be preferred over a).
> > 
> > Basically, instead of having the "unconsumed element" queue on the stack
> > as I suggested, you might want to keep it across epoll_wait invocations.
> > 
> > Before you start digging in the unconsumed element queue, you just start
> > by splicing the content of the shared queue into the tail of unconsumed
> > queue. Then, you simply dequeue the unconsumed queue elements until you
> > either reach the end or the max nr.
> > 
> > You should note that with this scheme, you'll have to dequeue items from
> > the unconsumed queue rather than just iterating over the elements. The
> > nice side of consuming all the elements (in the temp queue on the local
> > stack) is that you don't care about dequeuing, since the entire queue
> > will vanish. However, in this case, since you care about keeping the
> > queue after a partial iteration, you need to dequeue from it.
> > 
> > And yes, this approach involves checking both queues for event
> > availability. Hopefully none of this will be too much of an issue
> > performance-wise.
> 
> Right.  I will try this, I don't think the check will be too expensive.
> 
> When dequeuing from the unconsumed queue, perhaps there should be a
> "dequeue_local" function which omits the normal barriers required
> for the shared queue.
> 
> With a splice and without needing barriers for iteration, this sounds good.

Well actually, __wfcq_dequeue() is really not that expensive. In terms
of synchronization, here is what it typically does:

node = ___wfcq_node_sync_next(&head->node);
  -> busy wait if node->next is NULL. This check is needed even if we
 work on a "local" queue, because the O(1) splice operation does not
 walk over every node to check for NULL next pointer: this is left
 to the dequeue/iteration operations.
if ((next = CMM_LOAD_SHARED(node->next)) == NULL) {
  -> only taken if we are getting the last node of the queue. Happens
 at most once per batch.
}

head->node.next = next;
  -> a simple store.

smp_read_barrier_depends();
  -> no-op on everything but Alpha.

return node;

So my recommendation would be to be careful before trying to come up
with flavors that remove barriers if those are not actually hurting the
fast-path significantly. By dequeue fast-path, I mean what needs to be
executed for dequeue of each node.

> 
> > Another approach could be to let you work directly on the shared queue:
> > 
> > I could possibly implement a
> > 
> > void __wfcq_snapshot(struct wfcq_head *head,
> > struct wfcq_tail *tail);
> > 
> > That would save a tail shapshot that would then be used to stop
> > iteration, dequeue and splice at the location of the tail snapshot. And
> > 
> > void __wfcq_snapshot_reset(struct wfcq_head *head,
> > struct wfcq_tail *tail);
> > 
> > would set the tail snapshot pointer back to NULL.
> > 
> > This would require a few extra checks, but nothing very expensive I
> > expect.
> > 
> > Thoughts ?
> 
> I'm not sure I follow, would using it be something like this?
> 
>   snapshot
>   iterate (read-only, no dequeue)
>   splice(discard_head, discard_tail, shared_head, iter_stop_point)
>   snapshot_reset

My idea was more like:

snapshot
__wfcq_dequeue until returns NULL or reach max
snapshot_reset

But I would really prefer the splice approach unless there is a very
significant performance gain to do otherwise, because I don't want to
clutter the API with various ways of performing the same thing
needlessly.

> 
> I also use an atomic state variable to prevent an item from being queued
> twice, and I unset that while iterating+dequeueing.
> 
> I change the

Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation

2013-03-14 Thread Mathieu Desnoyers
* Peter Hurley (pe...@hurleysoftware.com) wrote:
> On Mon, 2013-03-11 at 17:36 -0400, Mathieu Desnoyers wrote:
> > +/*
> > + * Do not put head and tail on the same cache-line if concurrent
> > + * enqueue/dequeue are expected from many CPUs. This eliminates
> > + * false-sharing between enqueue and dequeue.
> > + */
> > +struct wfcq_head {
> > +   struct wfcq_node node;
> > +   struct mutex lock;
> > +};
> > +
> > +struct wfcq_tail {
> > +   struct wfcq_node *p;
> > +};
> 
> 
> If you want to force separate cachelines for SMP, this would be
> 
> struct wfcq_head {
>   struct wfcq_node node;
>   struct mutex lock;
> } cacheline_aligned_in_smp;
> 
> struct wfcq_tail {
>   struct wfcq_node *p;
> } cacheline_aligned_in_smp;

Well, the thing is: I don't want to force it. The queue head and tail
can be used in a few ways:

1) tail used by frequent enqueue on one CPU, head used for frequent
   dequeues on another CPU. In this case, we want head/tail on different
   cache lines.
2) same scenario as 1), but head and tail are placed in per-cpu data
   of the two CPUs. We don't need to align each structure explicitly.
3) tail and head used locally, e.g. on the stack, as splice destination
   followed by iteration or dequeue from the same CPU. We don't want to
   waste precious memory space in this case, so no alignment.

So as you see, only (1) actually requires explicit alignment of head and
tail.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation

2013-03-14 Thread Peter Hurley
On Mon, 2013-03-11 at 17:36 -0400, Mathieu Desnoyers wrote:
> +/*
> + * Do not put head and tail on the same cache-line if concurrent
> + * enqueue/dequeue are expected from many CPUs. This eliminates
> + * false-sharing between enqueue and dequeue.
> + */
> +struct wfcq_head {
> + struct wfcq_node node;
> + struct mutex lock;
> +};
> +
> +struct wfcq_tail {
> + struct wfcq_node *p;
> +};


If you want to force separate cachelines for SMP, this would be

struct wfcq_head {
struct wfcq_node node;
struct mutex lock;
} cacheline_aligned_in_smp;

struct wfcq_tail {
struct wfcq_node *p;
} cacheline_aligned_in_smp;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation

2013-03-14 Thread Eric Wong
Mathieu Desnoyers  wrote:
> * Eric Wong (normalper...@yhbt.net) wrote:
> > Mathieu Desnoyers  wrote:
> > > The advantage of using splice() over dequeue() is that you will reduce
> > > the amount of interactions between concurrent enqueue and dequeue
> > > operations on the head and tail of the same queue.
> > 
> > I wanted to use splice here originally, but epoll_wait(2) may not
> > consume the entire queue, as it's limited to maxevents specified by the
> > user calling epoll_wait.
> 
> I see,
> 
> > 
> > With unconsumed elements, I need to preserve ordering of the queue to
> > avoid starvation.  So I would either need to:
> > 
> > a) splice the unconsumed portion back to the head of the shared queue.
> >I'm not sure if this is possible while elements are being enqueued...
> 
> That would be a double-ended queue. I haven't thought this problem
> through yet.
> 
> >Using a mutex for splicing back to unconsumed elements is OK, and
> >probably required anyways since we need to keep EPOLLONESHOT
> >unmasking synced with dequeue.
> > 
> > b) preserve the unconsumed spliced queue across epoll_wait invocations
> >but that requires checking both queues for event availability...
> 
> I think b) should be preferred over a).
> 
> Basically, instead of having the "unconsumed element" queue on the stack
> as I suggested, you might want to keep it across epoll_wait invocations.
> 
> Before you start digging in the unconsumed element queue, you just start
> by splicing the content of the shared queue into the tail of unconsumed
> queue. Then, you simply dequeue the unconsumed queue elements until you
> either reach the end or the max nr.
> 
> You should note that with this scheme, you'll have to dequeue items from
> the unconsumed queue rather than just iterating over the elements. The
> nice side of consuming all the elements (in the temp queue on the local
> stack) is that you don't care about dequeuing, since the entire queue
> will vanish. However, in this case, since you care about keeping the
> queue after a partial iteration, you need to dequeue from it.
> 
> And yes, this approach involves checking both queues for event
> availability. Hopefully none of this will be too much of an issue
> performance-wise.

Right.  I will try this, I don't think the check will be too expensive.

When dequeuing from the unconsumed queue, perhaps there should be a
"dequeue_local" function which omits the normal barriers required
for the shared queue.

With a splice and without needing barriers for iteration, this sounds good.

> Another approach could be to let you work directly on the shared queue:
> 
> I could possibly implement a
> 
> void __wfcq_snapshot(struct wfcq_head *head,
> struct wfcq_tail *tail);
> 
> That would save a tail shapshot that would then be used to stop
> iteration, dequeue and splice at the location of the tail snapshot. And
> 
> void __wfcq_snapshot_reset(struct wfcq_head *head,
> struct wfcq_tail *tail);
> 
> would set the tail snapshot pointer back to NULL.
> 
> This would require a few extra checks, but nothing very expensive I
> expect.
> 
> Thoughts ?

I'm not sure I follow, would using it be something like this?

snapshot
iterate (read-only, no dequeue)
splice(discard_head, discard_tail, shared_head, iter_stop_point)
snapshot_reset

I also use an atomic state variable to prevent an item from being queued
twice, and I unset that while iterating+dequeueing.

I change the state to IDLE while iterating and ep_poll_callback sets
state READY on any item before being spliced out, that would cause the
event to be lost when it's spliced out.

So I think dequeue/state IDLE while iterating is required for epoll_wait,
I'll try the private queue.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation

2013-03-14 Thread Mathieu Desnoyers
* Eric Wong (normalper...@yhbt.net) wrote:
> Mathieu Desnoyers  wrote:
> > * Eric Wong (normalper...@yhbt.net) wrote:
> > > Mathieu Desnoyers  wrote:
> > > > +/*
> > > > + * Load a data from shared memory.
> > > > + */
> > > > +#define CMM_LOAD_SHARED(p) ACCESS_ONCE(p)
> > > 
> > > When iterating through the queue by dequeueing, I needed a way
> > > to get the tail at the start of the iteration and use that as
> > > a sentinel while iterating, so I access the tail like this:
> > > 
> > >   struct wfcq_node *p = CMM_LOAD_SHARED(ep->rdltail.p);
> > > 
> > > I hope this is supported... it seems to work :)
> > 
> > Ideally it would be good if users could try using the exposed APIs to do
> > these things, or if it's really needed, maybe it's a sign that we need
> > to extend the API.
> 
> Right.  If I can use splice, I will not need this.  more comments below
> on splice...
> 
> > > Unlike most queue users, I need to stop iteration to prevent the same
> > > item from appearing in the events returned by epoll_wait; since a
> > > dequeued item may appear back in the wfcqueue immediately.
> > 
> > I think your use-case is very similar to our urcu call_rcu
> > implementation. I would recommend to use wfcq in the following way:
> > 
> > When you want to dequeue, define, on the stack:
> > 
> > struct wfcq_head qhead;
> > struct wfcq_tail qtail;
> > struct wfcq_node *node, *n;
> > enum wfcq_ret ret;
> > 
> > wfcq_init(&qhead, &qtail);
> > 
> > /*
> >  * Then use splice to move the entire source queue into the local queue.
> >  * Don't forget to grab the appropriate mutexes for eqpoll_q here.
> >  */
> > ret = __wfcq_splice(&qhead, &qtail, epoll_q_head, epoll_q_tail);
> > 
> > switch (ret) {
> > case WFCQ_RET_SRC_EMPTY:
> > return -ENOENT; /* No events to handle */
> > case WFCQ_RET_DEST_EMPTY:
> > case WFCQ_RET_DEST_NON_EMPTY:
> > break;
> > }
> > 
> > /*
> >  * From this point, you can release the epoll_q lock and simply iterate
> >  * on the local queue using __wfcq_for_each() or __wfcq_for_each_safe()
> >  * if you need to free the nodes at the same time.
> >  */
> > __wfcq_for_each_safe(&qhead, &qtail, node, n) {
> > ...
> > }
> > 
> > The advantage of using splice() over dequeue() is that you will reduce
> > the amount of interactions between concurrent enqueue and dequeue
> > operations on the head and tail of the same queue.
> 
> I wanted to use splice here originally, but epoll_wait(2) may not
> consume the entire queue, as it's limited to maxevents specified by the
> user calling epoll_wait.

I see,

> 
> With unconsumed elements, I need to preserve ordering of the queue to
> avoid starvation.  So I would either need to:
> 
> a) splice the unconsumed portion back to the head of the shared queue.
>I'm not sure if this is possible while elements are being enqueued...

That would be a double-ended queue. I haven't thought this problem
through yet.

>Using a mutex for splicing back to unconsumed elements is OK, and
>probably required anyways since we need to keep EPOLLONESHOT
>unmasking synced with dequeue.
> 
> b) preserve the unconsumed spliced queue across epoll_wait invocations
>but that requires checking both queues for event availability...

I think b) should be preferred over a).

Basically, instead of having the "unconsumed element" queue on the stack
as I suggested, you might want to keep it across epoll_wait invocations.

Before you start digging in the unconsumed element queue, you just start
by splicing the content of the shared queue into the tail of unconsumed
queue. Then, you simply dequeue the unconsumed queue elements until you
either reach the end or the max nr.

You should note that with this scheme, you'll have to dequeue items from
the unconsumed queue rather than just iterating over the elements. The
nice side of consuming all the elements (in the temp queue on the local
stack) is that you don't care about dequeuing, since the entire queue
will vanish. However, in this case, since you care about keeping the
queue after a partial iteration, you need to dequeue from it.

And yes, this approach involves checking both queues for event
availability. Hopefully none of this will be too much of an issue
performance-wise.

Another approach could be to let you work directly on the shared queue:

I could possibly implement a

void __wfcq_snapshot(struct wfcq_head *head,
struct wfcq_tail *tail);

That would save a tail shapshot that would then be used to stop
iteration, dequeue and splice at the location of the tail snapshot. And

void __wfcq_snapshot_reset(struct wfcq_head *head,
struct wfcq_tail *tail);

would set the tail snapshot pointer back to NULL.

This would require a few extra checks, but nothing very expensive I
expect.

Thoughts ?

> 
> > I'll send a new patch version soon,
> > 
> > Thanks for your comments!
> 
> Thank you for this data structure!  I will update my patch tomorrow or
> the day after an

Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation

2013-03-14 Thread Eric Wong
Mathieu Desnoyers  wrote:
> * Eric Wong (normalper...@yhbt.net) wrote:
> > Mathieu Desnoyers  wrote:
> > > +/*
> > > + * Load a data from shared memory.
> > > + */
> > > +#define CMM_LOAD_SHARED(p)   ACCESS_ONCE(p)
> > 
> > When iterating through the queue by dequeueing, I needed a way
> > to get the tail at the start of the iteration and use that as
> > a sentinel while iterating, so I access the tail like this:
> > 
> > struct wfcq_node *p = CMM_LOAD_SHARED(ep->rdltail.p);
> > 
> > I hope this is supported... it seems to work :)
> 
> Ideally it would be good if users could try using the exposed APIs to do
> these things, or if it's really needed, maybe it's a sign that we need
> to extend the API.

Right.  If I can use splice, I will not need this.  more comments below
on splice...

> > Unlike most queue users, I need to stop iteration to prevent the same
> > item from appearing in the events returned by epoll_wait; since a
> > dequeued item may appear back in the wfcqueue immediately.
> 
> I think your use-case is very similar to our urcu call_rcu
> implementation. I would recommend to use wfcq in the following way:
> 
> When you want to dequeue, define, on the stack:
> 
> struct wfcq_head qhead;
> struct wfcq_tail qtail;
> struct wfcq_node *node, *n;
> enum wfcq_ret ret;
> 
> wfcq_init(&qhead, &qtail);
> 
> /*
>  * Then use splice to move the entire source queue into the local queue.
>  * Don't forget to grab the appropriate mutexes for eqpoll_q here.
>  */
> ret = __wfcq_splice(&qhead, &qtail, epoll_q_head, epoll_q_tail);
> 
> switch (ret) {
> case WFCQ_RET_SRC_EMPTY:
> return -ENOENT; /* No events to handle */
> case WFCQ_RET_DEST_EMPTY:
> case WFCQ_RET_DEST_NON_EMPTY:
> break;
> }
> 
> /*
>  * From this point, you can release the epoll_q lock and simply iterate
>  * on the local queue using __wfcq_for_each() or __wfcq_for_each_safe()
>  * if you need to free the nodes at the same time.
>  */
> __wfcq_for_each_safe(&qhead, &qtail, node, n) {
> ...
> }
> 
> The advantage of using splice() over dequeue() is that you will reduce
> the amount of interactions between concurrent enqueue and dequeue
> operations on the head and tail of the same queue.

I wanted to use splice here originally, but epoll_wait(2) may not
consume the entire queue, as it's limited to maxevents specified by the
user calling epoll_wait.

With unconsumed elements, I need to preserve ordering of the queue to
avoid starvation.  So I would either need to:

a) splice the unconsumed portion back to the head of the shared queue.
   I'm not sure if this is possible while elements are being enqueued...
   Using a mutex for splicing back to unconsumed elements is OK, and
   probably required anyways since we need to keep EPOLLONESHOT
   unmasking synced with dequeue.

b) preserve the unconsumed spliced queue across epoll_wait invocations
   but that requires checking both queues for event availability...

> I'll send a new patch version soon,
> 
> Thanks for your comments!

Thank you for this data structure!  I will update my patch tomorrow or
the day after and do more testing.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation (v2)

2013-03-14 Thread Mathieu Desnoyers
Ported to the Linux kernel from Userspace RCU library, at commit
108a92e5b97ee91b2b902dba2dd2e78aab42f420.

Ref: http://git.lttng.org/userspace-rcu.git

It is provided as a starting point only. Test cases should be ported
from Userspace RCU to kernel space and thoroughly ran on a wide range of
architectures before considering this port production-ready.

Changelog since v1:
- Remove the internal mutex and helpers. Let the caller handle it.
- Disable preemption within append operation, thus removing sleep from
  dequeue, therefore removing the blocking/nonblocking API distinction.

Signed-off-by: Mathieu Desnoyers 
CC: Lai Jiangshan 
CC: Paul E. McKenney 
CC: Stephen Hemminger 
CC: Davide Libenzi 
CC: Eric Wong 
---
 include/linux/wfcqueue.h |  444 +++
 1 file changed, 444 insertions(+)

Index: linux/include/linux/wfcqueue.h
===
--- /dev/null
+++ linux/include/linux/wfcqueue.h
@@ -0,0 +1,444 @@
+#ifndef _LINUX_WFCQUEUE_H
+#define _LINUX_WFCQUEUE_H
+
+/*
+ * linux/wfcqueue.h
+ *
+ * Concurrent Queue with Wait-Free Enqueue/Busy-Waiting Dequeue
+ *
+ * Copyright 2010-2013 - Mathieu Desnoyers 
+ * Copyright 2011-2012 - Lai Jiangshan 
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Concurrent Queue with Wait-Free Enqueue/Busy-Waiting Dequeue
+ *
+ * This queue has been designed and implemented collaboratively by
+ * Mathieu Desnoyers and Lai Jiangshan. Inspired from
+ * half-wait-free/half-blocking queue implementation done by Paul E.
+ * McKenney.
+ *
+ * Mutual exclusion of wfcq_* / __wfcq_* API
+ *
+ * Synchronization table:
+ *
+ * External synchronization techniques described in the API below is
+ * required between pairs marked with "X". No external synchronization
+ * required between pairs marked with "-".
+ *
+ * Legend:
+ * [1] wfcq_enqueue
+ * [2] __wfcq_splice (destination queue)
+ * [3] __wfcq_dequeue
+ * [4] __wfcq_splice (source queue)
+ * [5] __wfcq_first
+ * [6] __wfcq_next
+ *
+ * [1] [2] [3] [4] [5] [6]
+ * [1]  -   -   -   -   -   -
+ * [2]  -   -   -   -   -   -
+ * [3]  -   -   X   X   X   X
+ * [4]  -   -   X   -   X   X
+ * [5]  -   -   X   X   -   -
+ * [6]  -   -   X   X   -   -
+ *
+ * Besides locking, mutual exclusion of dequeue, splice and iteration
+ * can be ensured by performing all of those operations from a single
+ * thread, without requiring any lock.
+ */
+
+/*
+ * Load a data from shared memory.
+ */
+#define CMM_LOAD_SHARED(p) ACCESS_ONCE(p)
+
+/*
+ * Identify a shared store.
+ */
+#define CMM_STORE_SHARED(x, v) ({ ACCESS_ONCE(x) = (v); })
+
+enum wfcq_ret {
+   WFCQ_RET_DEST_EMPTY =   0,
+   WFCQ_RET_DEST_NON_EMPTY =   1,
+   WFCQ_RET_SRC_EMPTY =2,
+};
+
+struct wfcq_node {
+   struct wfcq_node *next;
+};
+
+/*
+ * Do not put head and tail on the same cache-line if concurrent
+ * enqueue/dequeue are expected from many CPUs. This eliminates
+ * false-sharing between enqueue and dequeue.
+ */
+struct wfcq_head {
+   struct wfcq_node node;
+};
+
+struct wfcq_tail {
+   struct wfcq_node *p;
+};
+
+/*
+ * wfcq_node_init: initialize wait-free queue node.
+ */
+static inline void wfcq_node_init(struct wfcq_node *node)
+{
+   node->next = NULL;
+}
+
+/*
+ * wfcq_init: initialize wait-free queue.
+ */
+static inline void wfcq_init(struct wfcq_head *head,
+   struct wfcq_tail *tail)
+{
+   /* Set queue head and tail */
+   wfcq_node_init(&head->node);
+   tail->p = &head->node;
+}
+
+/*
+ * wfcq_empty: return whether wait-free queue is empty.
+ *
+ * No memory barrier is issued. No mutual exclusion is required.
+ *
+ * We perform the test on head->node.next to check if the queue is
+ * possibly empty, but we confirm this by checking if the tail pointer
+ * points to the head node because the tail pointer is the linearisation
+ * point of the enqueuers. Just checking the head next pointer could
+ * make a queue appear empty if an enqueuer is preempted for a long time
+ * between xchg() and setting the previous node's next pointer.
+ */
+static inline bool wfcq_empty(struct wfcq_head *head,
+   str

Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation

2013-03-14 Thread Mathieu Desnoyers
* Eric Wong (normalper...@yhbt.net) wrote:
> Mathieu Desnoyers  wrote:
> > Ported to the Linux kernel from Userspace RCU library, at commit
> > 108a92e5b97ee91b2b902dba2dd2e78aab42f420.
> > 
> > Ref: http://git.lttng.org/userspace-rcu.git
> > 
> > It is provided as a starting point only. Test cases should be ported
> > from Userspace RCU to kernel space and thoroughly ran on a wide range of
> > architectures before considering this port production-ready.
> 
> Thanks, this seems to work.  Will post an early epoll patch in a minute.
> 
> Minor comments below.
> 
> > +/*
> > + * Load a data from shared memory.
> > + */
> > +#define CMM_LOAD_SHARED(p) ACCESS_ONCE(p)
> 
> When iterating through the queue by dequeueing, I needed a way
> to get the tail at the start of the iteration and use that as
> a sentinel while iterating, so I access the tail like this:
> 
>   struct wfcq_node *p = CMM_LOAD_SHARED(ep->rdltail.p);
> 
> I hope this is supported... it seems to work :)

Ideally it would be good if users could try using the exposed APIs to do
these things, or if it's really needed, maybe it's a sign that we need
to extend the API.

> 
> Unlike most queue users, I need to stop iteration to prevent the same
> item from appearing in the events returned by epoll_wait; since a
> dequeued item may appear back in the wfcqueue immediately.

I think your use-case is very similar to our urcu call_rcu
implementation. I would recommend to use wfcq in the following way:

When you want to dequeue, define, on the stack:

struct wfcq_head qhead;
struct wfcq_tail qtail;
struct wfcq_node *node, *n;
enum wfcq_ret ret;

wfcq_init(&qhead, &qtail);

/*
 * Then use splice to move the entire source queue into the local queue.
 * Don't forget to grab the appropriate mutexes for eqpoll_q here.
 */
ret = __wfcq_splice(&qhead, &qtail, epoll_q_head, epoll_q_tail);

switch (ret) {
case WFCQ_RET_SRC_EMPTY:
return -ENOENT; /* No events to handle */
case WFCQ_RET_DEST_EMPTY:
case WFCQ_RET_DEST_NON_EMPTY:
break;
}

/*
 * From this point, you can release the epoll_q lock and simply iterate
 * on the local queue using __wfcq_for_each() or __wfcq_for_each_safe()
 * if you need to free the nodes at the same time.
 */
__wfcq_for_each_safe(&qhead, &qtail, node, n) {
...
}

The advantage of using splice() over dequeue() is that you will reduce
the amount of interactions between concurrent enqueue and dequeue
operations on the head and tail of the same queue.

> 
> > +struct wfcq_head {
> > +   struct wfcq_node node;
> > +   struct mutex lock;
> > +};
> 
> I'm not using this lock at all since I already have ep->mtx (which also
> protects the ep->rbr).  Perhaps it should not be included; normal linked
> list and most data structures I see in the kernel do not provide their
> own locks, either

Good point. The Linux kernel habits are to leave the locks outside of
those structures whenever possible, so users can pick the right lock for
their need. Will remove, and remove all the "helpers" that use this lock
as well.

> 
> > +static inline void wfcq_init(struct wfcq_head *head,
> > +   struct wfcq_tail *tail)
> > +{
> > +   /* Set queue head and tail */
> > +   wfcq_node_init(&head->node);
> > +   tail->p = &head->node;
> > +   mutex_init(&head->lock);
> > +}
> 
> There's no corresponding mutex_destroy, so I'm just destroying it
> myself...

Since I remove it, it will become a non-issue.

After thinking about it, I plan to disable preemption around xchg and
store within __wfcq_append. This is a luxury we have at kernel-level
that we don't have in user-space, si it might be good to use it. This
will allow me to remove the 10ms sleep from the adaptative busy-wait in
___wfcq_busy_wait(), and replace this by a simple busy-wait. This will
eliminate those possible latency peaks from the dequeue/splice/iteration
side. It's a shame to have a possibility of 10ms sleep due to preemption
within a 2 instructions window. preempt_disable() will fix this, while
adding a very low overhead to enqueue path in preemptible kernels.

I'll send a new patch version soon,

Thanks for your comments!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation

2013-03-13 Thread Eric Wong
Mathieu Desnoyers  wrote:
> Ported to the Linux kernel from Userspace RCU library, at commit
> 108a92e5b97ee91b2b902dba2dd2e78aab42f420.
> 
> Ref: http://git.lttng.org/userspace-rcu.git
> 
> It is provided as a starting point only. Test cases should be ported
> from Userspace RCU to kernel space and thoroughly ran on a wide range of
> architectures before considering this port production-ready.

Thanks, this seems to work.  Will post an early epoll patch in a minute.

Minor comments below.

> +/*
> + * Load a data from shared memory.
> + */
> +#define CMM_LOAD_SHARED(p)   ACCESS_ONCE(p)

When iterating through the queue by dequeueing, I needed a way
to get the tail at the start of the iteration and use that as
a sentinel while iterating, so I access the tail like this:

struct wfcq_node *p = CMM_LOAD_SHARED(ep->rdltail.p);

I hope this is supported... it seems to work :)

Unlike most queue users, I need to stop iteration to prevent the same
item from appearing in the events returned by epoll_wait; since a
dequeued item may appear back in the wfcqueue immediately.

> +struct wfcq_head {
> + struct wfcq_node node;
> + struct mutex lock;
> +};

I'm not using this lock at all since I already have ep->mtx (which also
protects the ep->rbr).  Perhaps it should not be included; normal linked
list and most data structures I see in the kernel do not provide their
own locks, either

> +static inline void wfcq_init(struct wfcq_head *head,
> + struct wfcq_tail *tail)
> +{
> + /* Set queue head and tail */
> + wfcq_node_init(&head->node);
> + tail->p = &head->node;
> + mutex_init(&head->lock);
> +}

There's no corresponding mutex_destroy, so I'm just destroying it
myself...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH] Linux kernel Wait-Free Concurrent Queue Implementation

2013-03-11 Thread Mathieu Desnoyers
Ported to the Linux kernel from Userspace RCU library, at commit
108a92e5b97ee91b2b902dba2dd2e78aab42f420.

Ref: http://git.lttng.org/userspace-rcu.git

It is provided as a starting point only. Test cases should be ported
from Userspace RCU to kernel space and thoroughly ran on a wide range of
architectures before considering this port production-ready.

Signed-off-by: Mathieu Desnoyers 
CC: Lai Jiangshan 
CC: Paul E. McKenney 
CC: Stephen Hemminger 
CC: Davide Libenzi 
CC: Eric Wong 
---
 include/linux/wfcqueue.h |  642 +++
 1 file changed, 642 insertions(+)

Index: linux/include/linux/wfcqueue.h
===
--- /dev/null
+++ linux/include/linux/wfcqueue.h
@@ -0,0 +1,642 @@
+#ifndef _LINUX_WFCQUEUE_H
+#define _LINUX_WFCQUEUE_H
+
+/*
+ * linux/wfcqueue.h
+ *
+ * Concurrent Queue with Wait-Free Enqueue/Blocking Dequeue
+ *
+ * Copyright 2010-2013 - Mathieu Desnoyers 
+ * Copyright 2011-2012 - Lai Jiangshan 
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/*
+ * Concurrent Queue with Wait-Free Enqueue/Blocking Dequeue
+ *
+ * This queue has been designed and implemented collaboratively by
+ * Mathieu Desnoyers and Lai Jiangshan. Inspired from
+ * half-wait-free/half-blocking queue implementation done by Paul E.
+ * McKenney.
+ *
+ * Mutual exclusion of wfcq_* / __wfcq_* API
+ *
+ * Synchronization table:
+ *
+ * External synchronization techniques described in the API below is
+ * required between pairs marked with "X". No external synchronization
+ * required between pairs marked with "-".
+ *
+ * Legend:
+ * [1] wfcq_enqueue
+ * [2] __wfcq_splice (destination queue)
+ * [3] __wfcq_dequeue
+ * [4] __wfcq_splice (source queue)
+ * [5] __wfcq_first
+ * [6] __wfcq_next
+ *
+ * [1] [2] [3] [4] [5] [6]
+ * [1]  -   -   -   -   -   -
+ * [2]  -   -   -   -   -   -
+ * [3]  -   -   X   X   X   X
+ * [4]  -   -   X   -   X   X
+ * [5]  -   -   X   X   -   -
+ * [6]  -   -   X   X   -   -
+ *
+ * Mutual exclusion can be ensured by holding wfcq_dequeue_lock().
+ *
+ * For convenience, wfcq_dequeue_blocking() and
+ * wfcq_splice_blocking() hold the dequeue lock.
+ *
+ * Besides locking, mutual exclusion of dequeue, splice and iteration
+ * can be ensured by performing all of those operations from a single
+ * thread, without requiring any lock.
+ */
+
+/*
+ * Load a data from shared memory.
+ */
+#define CMM_LOAD_SHARED(p) ACCESS_ONCE(p)
+
+/*
+ * Identify a shared store.
+ */
+#define CMM_STORE_SHARED(x, v) ({ ACCESS_ONCE(x) = (v); })
+
+#define WFCQ_WOULDBLOCK((void *) -1UL)
+#define WFCQ_ADAPT_ATTEMPTS10  /* Retry if being set */
+#define WFCQ_WAIT  10  /* Wait 10 ms if being set */
+
+enum wfcq_ret {
+   WFCQ_RET_WOULDBLOCK =   -1,
+   WFCQ_RET_DEST_EMPTY =   0,
+   WFCQ_RET_DEST_NON_EMPTY =   1,
+   WFCQ_RET_SRC_EMPTY =2,
+};
+
+struct wfcq_node {
+   struct wfcq_node *next;
+};
+
+/*
+ * Do not put head and tail on the same cache-line if concurrent
+ * enqueue/dequeue are expected from many CPUs. This eliminates
+ * false-sharing between enqueue and dequeue.
+ */
+struct wfcq_head {
+   struct wfcq_node node;
+   struct mutex lock;
+};
+
+struct wfcq_tail {
+   struct wfcq_node *p;
+};
+
+/*
+ * wfcq_node_init: initialize wait-free queue node.
+ */
+static inline void wfcq_node_init(struct wfcq_node *node)
+{
+   node->next = NULL;
+}
+
+/*
+ * wfcq_init: initialize wait-free queue.
+ */
+static inline void wfcq_init(struct wfcq_head *head,
+   struct wfcq_tail *tail)
+{
+   /* Set queue head and tail */
+   wfcq_node_init(&head->node);
+   tail->p = &head->node;
+   mutex_init(&head->lock);
+}
+
+/*
+ * wfcq_empty: return whether wait-free queue is empty.
+ *
+ * No memory barrier is issued. No mutual exclusion is required.
+ *
+ * We perform the test on head->node.next to check if the queue is
+ * possibly empty, but we confirm this by checking if the tail pointer
+ * points to the head node because the tail pointer is the linearisation
+ * point of the enqueuers. Just che