Re: pserialize(9) vs. TAILQ

2014-11-18 Thread Dennis Ferguson

On 18 Nov, 2014, at 22:13 , Masao Uebayashi  wrote:
> In the TAILQ case, where readers iterate a list by TAILQ_FOREACH(),
> TAILQ_REMOVE() is safely used as the update operation, because:
> 
> - Readers only see tqe_next in TAILQ_FOREACH(), and
> - Pointer assignment (done in TAILQ_REMOVE()) is atomic.
> 
> If this is correct, pserialize(9) should be updated to be clearer;
> probably this TAILQ example is worth being added there.

I don't think TAILQ is a good example for this.  While TAILQ_REMOVE()
will work, none of the TAILQ_INSERT_*() macros can be reliably used
with concurrent readers so it isn't clear how the list you are
removing stuff from could have been built.

Also, the TAILQ macros depend on a type pun fairly similar to the one
that did in the CIRCLEQ macros and hence have a lifetime which extends
only to some future version of the compiler smart enough to recognize
that and miscompile them.  I think new code, and new man page text,
should probably avoid using the macros to keep from making the eventual
problem with them bigger than it already is.

Dennis Ferguson


Re: pserialize(9) vs. TAILQ

2014-11-18 Thread Masao Uebayashi
On Wed, Nov 19, 2014 at 1:05 AM, Dennis Ferguson
 wrote:
>
> On 18 Nov, 2014, at 22:13 , Masao Uebayashi  wrote:
>> In the TAILQ case, where readers iterate a list by TAILQ_FOREACH(),
>> TAILQ_REMOVE() is safely used as the update operation, because:
>>
>> - Readers only see tqe_next in TAILQ_FOREACH(), and
>> - Pointer assignment (done in TAILQ_REMOVE()) is atomic.
>>
>> If this is correct, pserialize(9) should be updated to be clearer;
>> probably this TAILQ example is worth being added there.
>
> I don't think TAILQ is a good example for this.  While TAILQ_REMOVE()
> will work, none of the TAILQ_INSERT_*() macros can be reliably used
> with concurrent readers so it isn't clear how the list you are
> removing stuff from could have been built.

If there is a promise that readers iterate list only in forward
direction (using tqe_next / tqh_first), the actual update operation by
writers is the pointer assignment to link the new element.  (Note that
next pointer in the new element is not visible until it is linked.)
This is common to any list, and TAILQ is not an exception.

> Also, the TAILQ macros depend on a type pun fairly similar to the one
> that did in the CIRCLEQ macros and hence have a lifetime which extends
> only to some future version of the compiler smart enough to recognize
> that and miscompile them.  I think new code, and new man page text,
> should probably avoid using the macros to keep from making the eventual
> problem with them bigger than it already is.

No comment about this part. :)  Any better suggestion?


Re: pserialize(9) vs. TAILQ

2014-11-18 Thread Taylor R Campbell
   Date: Tue, 18 Nov 2014 23:13:34 +0900
   From: Masao Uebayashi 

   In pserialize_perform(), context switches are made on all CPUs.  After
   pserialize_perform(), all readers on all CPUs see the update data.
   The data old data item is safely destroyed.

   In the TAILQ case, where readers iterate a list by TAILQ_FOREACH(),
   TAILQ_REMOVE() is safely used as the update operation, because:

   - Readers only see tqe_next in TAILQ_FOREACH(), and
   - Pointer assignment (done in TAILQ_REMOVE()) is atomic.

   If this is correct, pserialize(9) should be updated to be clearer;
   probably this TAILQ example is worth being added there.

That is correct.

The one tricky detail is that after a reader has fetched the tqe_next
pointer, it must issue a membar_consumer before dereferencing the
pointer: otherwise there is no guarantee about the order in which the
CPU will fetch the tqe_next pointer and its contents (which it may
have cached).  Whoever inserts entries must also issue a
membar_producer after initializing the entry and before inserting it,
to match the reader's membar_consumer.

Someone^TM should invent names for queue operations that are
pserialize-safe by virtue of automatically issuing these memory
barriers: TAILQ_FOREACH_PSZ, TAILQ_INSERT_HEAD_PSZ, &c., or something,
so that it is easier to use them correctly and spot incorrect use.


Re: pserialize(9) vs. TAILQ

2014-11-18 Thread Taylor R Campbell
   Date: Wed, 19 Nov 2014 00:05:05 +0800
   From: Dennis Ferguson 

   Also, the TAILQ macros depend on a type pun fairly similar to the one
   that did in the CIRCLEQ macros and hence have a lifetime which extends
   only to some future version of the compiler smart enough to recognize
   that and miscompile them.

What type pun is that?  I don't see it.


Re: pserialize(9) vs. TAILQ

2014-11-18 Thread Masao Uebayashi
On Wed, Nov 19, 2014 at 2:53 AM, Taylor R Campbell
 wrote:
> The one tricky detail is that after a reader has fetched the tqe_next
> pointer, it must issue a membar_consumer before dereferencing the
> pointer: otherwise there is no guarantee about the order in which the
> CPU will fetch the tqe_next pointer and its contents (which it may
> have cached).  Whoever inserts entries must also issue a
> membar_producer after initializing the entry and before inserting it,
> to match the reader's membar_consumer.

Good point.

> Someone^TM should invent names for queue operations that are
> pserialize-safe by virtue of automatically issuing these memory
> barriers: TAILQ_FOREACH_PSZ, TAILQ_INSERT_HEAD_PSZ, &c., or something,
> so that it is easier to use them correctly and spot incorrect use.

I think providing only TAILQ-for-pserialize(9) is enough.  Maybe
PSZQ()?  I don't really mind the name.


Re: pserialize(9) vs. TAILQ

2014-11-18 Thread Masao Uebayashi
On Wed, Nov 19, 2014 at 10:15 AM, Masao Uebayashi  wrote:
> On Wed, Nov 19, 2014 at 2:53 AM, Taylor R Campbell
>  wrote:
>> The one tricky detail is that after a reader has fetched the tqe_next
>> pointer, it must issue a membar_consumer before dereferencing the
>> pointer: otherwise there is no guarantee about the order in which the
>> CPU will fetch the tqe_next pointer and its contents (which it may
>> have cached).  Whoever inserts entries must also issue a
>> membar_producer after initializing the entry and before inserting it,
>> to match the reader's membar_consumer.
>
> Good point.
>
>> Someone^TM should invent names for queue operations that are
>> pserialize-safe by virtue of automatically issuing these memory
>> barriers: TAILQ_FOREACH_PSZ, TAILQ_INSERT_HEAD_PSZ, &c., or something,
>> so that it is easier to use them correctly and spot incorrect use.
>
> I think providing only TAILQ-for-pserialize(9) is enough.  Maybe
> PSZQ()?  I don't really mind the name.

Well, struct doesn't need to change, only memory barriers need to be
placed in read/write paths.  TAILQ_*_PSZ is better.


Re: pserialize(9) vs. TAILQ

2014-11-18 Thread Dennis Ferguson

On 19 Nov, 2014, at 00:34 , Masao Uebayashi  wrote:
> On Wed, Nov 19, 2014 at 1:05 AM, Dennis Ferguson
>  wrote:
>> 
>> On 18 Nov, 2014, at 22:13 , Masao Uebayashi  wrote:
>>> In the TAILQ case, where readers iterate a list by TAILQ_FOREACH(),
>>> TAILQ_REMOVE() is safely used as the update operation, because:
>>> 
>>> - Readers only see tqe_next in TAILQ_FOREACH(), and
>>> - Pointer assignment (done in TAILQ_REMOVE()) is atomic.
>>> 
>>> If this is correct, pserialize(9) should be updated to be clearer;
>>> probably this TAILQ example is worth being added there.
>> 
>> I don't think TAILQ is a good example for this.  While TAILQ_REMOVE()
>> will work, none of the TAILQ_INSERT_*() macros can be reliably used
>> with concurrent readers so it isn't clear how the list you are
>> removing stuff from could have been built.
> 
> If there is a promise that readers iterate list only in forward
> direction (using tqe_next / tqh_first), the actual update operation by
> writers is the pointer assignment to link the new element.  (Note that
> next pointer in the new element is not visible until it is linked.)
> This is common to any list, and TAILQ is not an exception.

That's very true, but the code is only correct if the writes occur
in the order the C code has them written (i.e. tqe_next in the
new element is written before writing the tqe_next of the existing
list element to point at it).  To ensure the compiler or the
cache hardware doesn't reorder those writes it needs a write barrier
inserted to enforce the order as written.

Making sure there are memory barriers where they are needed is the
hardest part of writing data structure update code designed to operate
with concurrent readers, since the C code still looks perfectly correct
without them.

Dennis Ferguson


Re: pserialize(9) vs. TAILQ

2014-11-19 Thread Dennis Ferguson

On 19 Nov, 2014, at 01:54 , Taylor R Campbell 
 wrote:

>   Date: Wed, 19 Nov 2014 00:05:05 +0800
>   From: Dennis Ferguson 
> 
>   Also, the TAILQ macros depend on a type pun fairly similar to the one
>   that did in the CIRCLEQ macros and hence have a lifetime which extends
>   only to some future version of the compiler smart enough to recognize
>   that and miscompile them.
> 
> What type pun is that?  I don't see it.

I think this one:

#define TAILQ_LAST(head, headname) \
(*(((struct headname *)((head)->tqh_last))->tqh_last))
#define TAILQ_PREV(elm, headname, field) \
(*(((struct headname *)((elm)->field.tqe_prev))->tqh_last))

The (struct headname *) pointer it is obtaining by the cast is
sometimes being used to access data in something other than
a 'struct headname'.

Dennis Ferguson

Re: pserialize(9) vs. TAILQ

2014-11-19 Thread Dennis Ferguson

On 19 Nov, 2014, at 01:53 , Taylor R Campbell 
 wrote:
>   Date: Tue, 18 Nov 2014 23:13:34 +0900
>   From: Masao Uebayashi 
> 
>   In pserialize_perform(), context switches are made on all CPUs.  After
>   pserialize_perform(), all readers on all CPUs see the update data.
>   The data old data item is safely destroyed.
> 
>   In the TAILQ case, where readers iterate a list by TAILQ_FOREACH(),
>   TAILQ_REMOVE() is safely used as the update operation, because:
> 
>   - Readers only see tqe_next in TAILQ_FOREACH(), and
>   - Pointer assignment (done in TAILQ_REMOVE()) is atomic.
> 
>   If this is correct, pserialize(9) should be updated to be clearer;
>   probably this TAILQ example is worth being added there.
> 
> That is correct.
> 
> The one tricky detail is that after a reader has fetched the tqe_next
> pointer, it must issue a membar_consumer before dereferencing the
> pointer: otherwise there is no guarantee about the order in which the
> CPU will fetch the tqe_next pointer and its contents (which it may
> have cached).

I don't think it is correct to use a membar_consumer().  The reads done
by dereferencing tqe_next are dependent reads, as in they cannot be
done until the read of tqe_next has completed.  This is sufficient for
the reader to observe the write ordering, without a read barrier, on
every machine except a multiprocessor DEC Alpha.  See, e.g., section 7.1
of this:

http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf

Since most machines don't need any barrier here it would be extremely
inefficient to add a membar_consumer() since that would make all machines
pay for the idiosyncrasies unique to the DEC Alpha.

So the question is, is concurrent reader code we write expected to run
on a multiprocessor DEC Alpha?  If it is then there is a missing memory
barrier primitive that is required for dependent reads, one that does
something on a DEC Alpha and nothing on any other machine.  If code
like this doesn't need to run on a multiprocessor DEC Alpha, however,
then the code is correct with no read barrier.

Dennis Ferguson

Re: pserialize(9) vs. TAILQ

2014-11-19 Thread Taylor R Campbell
   Date: Wed, 19 Nov 2014 17:11:18 +0800
   From: Dennis Ferguson 

   On 19 Nov, 2014, at 01:53 , Taylor R Campbell 
 wrote:
   > The one tricky detail is that after a reader has fetched the tqe_next
   > pointer, it must issue a membar_consumer before dereferencing the
   > pointer: otherwise there is no guarantee about the order in which the
   > CPU will fetch the tqe_next pointer and its contents (which it may
   > have cached).

   I don't think it is correct to use a membar_consumer().

It is certainly /correct/.  It may not be /necessary/ in some
architectures that provide more guarantees about load ordering than
are written in the program without membar_consumer.

   Since most machines don't need any barrier here it would be extremely
   inefficient to add a membar_consumer() since that would make all machines
   pay for the idiosyncrasies unique to the DEC Alpha.

We could invent a membar_datadep_consumer for this case, and make it a
no-op on everything other than Alpha until another CPU designer
relaxes the memory ordering.  The intent of the code is clearer with
the memory barrier, and it emphasizes the need for a paired
membar_producer elsewhere.


Re: pserialize(9) vs. TAILQ

2014-11-19 Thread Taylor R Campbell
   Date: Wed, 19 Nov 2014 16:28:41 +0800
   From: Dennis Ferguson 

   On 19 Nov, 2014, at 01:54 , Taylor R Campbell 
 wrote:

   > What type pun is that?  I don't see it.

   I think this one:

   #define TAILQ_LAST(head, headname) \
   (*(((struct headname *)((head)->tqh_last))->tqh_last))
   #define TAILQ_PREV(elm, headname, field) \
   (*(((struct headname *)((elm)->field.tqe_prev))->tqh_last))

Pooh.  Fortunately, it applies only to reverse iteration, which most
tailq users don't rely on.

Perhaps it's time to invent a new API where you have to declare the
queue type up front -- both TAILQ and CIRCLEQ use the same physical
structure for the head and entry, so if we had a struct tag declared
up front they could both use that.  Something like:

BIDIQ_TYPE(frobq, frob);
/*
 * struct frob;
 * struct frobq { struct frob *bqh_first; struct frob **bqh_last; };
 */

BIDIQ_HEAD(frobq, frob_head);
/* struct frob_head { struct frobq q; }; */

struct frob_head the_frobs = BIDIQ_HEAD_INITIALIZER(&the_frobs);

struct frob {
...
BIDIQ_ENTRY(frobq)  f_entry;
/* struct frobq f_entry; */
...
};


Re: pserialize(9) vs. TAILQ

2014-11-19 Thread Masao Uebayashi
On Wed, Nov 19, 2014 at 10:00 PM, Taylor R Campbell
 wrote:
>Date: Wed, 19 Nov 2014 17:11:18 +0800
>From: Dennis Ferguson 
>
>On 19 Nov, 2014, at 01:53 , Taylor R Campbell 
>  wrote:
>> The one tricky detail is that after a reader has fetched the tqe_next
>> pointer, it must issue a membar_consumer before dereferencing the
>> pointer: otherwise there is no guarantee about the order in which the
>> CPU will fetch the tqe_next pointer and its contents (which it may
>> have cached).
>
>I don't think it is correct to use a membar_consumer().
>
> It is certainly /correct/.  It may not be /necessary/ in some
> architectures that provide more guarantees about load ordering than
> are written in the program without membar_consumer.

By reading and comparing your opinions, I come to think that membar is
not necessary to safely traverse the protected data structure
(tqe_next pointers).  It is the minimal requirement of pserialize(9)
"update" section (mutex_enter() ... pserialize_perform()) that the
protected data is updated atomically and kept topologically consistent
to prevent readers from seeing corrupted data (pointers).  Some
readers may see the stale/old tqe_next, but it is OK at this point.

In practice, there must be a membar around for readers to see the
updated data; but such a membar is not always relative to tqe_next.
Maybe it is, maybe not, only users (of pserialize(9)) know.  Putting
membars around TAILQ is too excessive.

For example, consider struct ifnet.  It will have a reference count to
take a reference in a pserialize(9) critical section.  Reference count
is incremented by atomic_inc_*, which might imply membar on some
archs.

When adding a new ifnet, part of its content, promised as constant in
the protected list (e.g. if_name), must be filled before being
inserted and visible via the list.  In this case, write to if_name is
relative to tqe_next.

>Since most machines don't need any barrier here it would be extremely
>inefficient to add a membar_consumer() since that would make all machines
>pay for the idiosyncrasies unique to the DEC Alpha.
>
> We could invent a membar_datadep_consumer for this case, and make it a
> no-op on everything other than Alpha until another CPU designer
> relaxes the memory ordering.  The intent of the code is clearer with
> the memory barrier, and it emphasizes the need for a paired
> membar_producer elsewhere.

I thought alpha sync op implementations imply instruction ordering ...
but maybe I'm too optimistic. :)


Re: pserialize(9) vs. TAILQ

2014-11-19 Thread Masao Uebayashi
On Wed, Nov 19, 2014 at 2:49 PM, Dennis Ferguson
 wrote:
> That's very true, but the code is only correct if the writes occur
> in the order the C code has them written (i.e. tqe_next in the
> new element is written before writing the tqe_next of the existing
> list element to point at it).  To ensure the compiler or the
> cache hardware doesn't reorder those writes it needs a write barrier
> inserted to enforce the order as written.

Yes, in the pserialize(9) + TAILQ() case, everything except tqe_next
is only written by writers, and protected with a mutex, which ensures
(implies) ordering.  What is ordered by what has to be clearly
documented.

> Making sure there are memory barriers where they are needed is the
> hardest part of writing data structure update code designed to operate
> with concurrent readers, since the C code still looks perfectly correct
> without them.

Agreed.  The current manual (membar_*(9)) and existing code are not
good for learners (like me).

I always wonder why membar(9) functions don't take arguments to
clarify the ordering of objects.  Even if those object addresses have
no meaning for CPUs, declaring the intended objects, ordered either
before/after the barrier, should help help code to be clearer.


Re: pserialize(9) vs. TAILQ

2014-11-19 Thread Dennis Ferguson

On 19 Nov, 2014, at 21:00 , Taylor R Campbell 
 wrote:
>   Since most machines don't need any barrier here it would be extremely
>   inefficient to add a membar_consumer() since that would make all machines
>   pay for the idiosyncrasies unique to the DEC Alpha.
> 
> We could invent a membar_datadep_consumer for this case, and make it a
> no-op on everything other than Alpha until another CPU designer
> relaxes the memory ordering.  The intent of the code is clearer with
> the memory barrier, and it emphasizes the need for a paired
> membar_producer elsewhere.

The problem with this is that current kernel code already assumes it
isn't running on a multiprocessor Alpha and omits read barriers where
they aren't required for other machines (which likely explains why
there is no appropriate barrier operation for this currently).  As
a gross measure

   b1$ cd /usr/src/sys/kern
   b1$ grep membar_producer * | wc -l 
 25
   b1$ grep membar_consumer * | wc -l 
  5

and it isn't very hard to find explicit examples.  Here's one from
subr_pcq.c

void *  
pcq_get(pcq_t *pcq)
{
uint32_t v, nv;
u_int p, c;
void *item;
   
v = pcq->pcq_pc;
pcq_split(v, &p, &c);  
if (p == c) {
/* Queue is empty: nothing to return. */
return NULL;
}
item = pcq->pcq_items[c];

and another from subr_ipi.c

static void  
ipi_msg_cpu_handler(void *arg __unused)
{   
const struct cpu_info * const ci = curcpu();
ipi_mbox_t *mbox = &ipi_mboxes[cpu_index(ci)];

for (u_int i = 0; i < IPI_MSG_MAX; i++) {
ipi_msg_t *msg;

/* Get the message. */
if ((msg = mbox->msg[i]) == NULL) {
continue;
}
mbox->msg[i] = NULL;

/* Execute the handler. */
KASSERT(msg->func);
msg->func(msg->arg);

I think the Alpha needs barriers before the last lines.

Adding a new barrier operation for this would only help if there
were some determination to also go back and fix the code already
written to use it, i.e. someone needs to decree that all code
now needs to be written to theoretically run on an Alpha (the
support would still only be theoretical since it is hard to
find hardware which behaves like this to test on).  I haven't
heard anything one way or the other concerning support for MP
Alphas, but the implicit message from the current state of
the code is that an MP Alpha isn't a NetBSD target.

In any case (and omitting arguments about why trying to support
the MP Alpha might not be a good idea), before you add the new
barrier I think there needs to explicitly statement to the effect
that the MP Alpha needs to be included as a NetBSD target platform
so that there is an actual need for that barrier.  Without that I
think this code is correct the way it is; there is certainly
no problem with it that is fixed by adding a membar_consumer().

Dennis Ferguson

Re: pserialize(9) vs. TAILQ

2014-11-19 Thread Dennis Ferguson

On 20 Nov, 2014, at 11:07 , Masao Uebayashi  wrote:
> On Wed, Nov 19, 2014 at 2:49 PM, Dennis Ferguson
>  wrote:
>> That's very true, but the code is only correct if the writes occur
>> in the order the C code has them written (i.e. tqe_next in the
>> new element is written before writing the tqe_next of the existing
>> list element to point at it).  To ensure the compiler or the
>> cache hardware doesn't reorder those writes it needs a write barrier
>> inserted to enforce the order as written.
> 
> Yes, in the pserialize(9) + TAILQ() case, everything except tqe_next
> is only written by writers, and protected with a mutex, which ensures
> (implies) ordering.  What is ordered by what has to be clearly
> documented.

We might be on the same page, but your response is confusing me enough
that it might be useful to make this explicit.  The guts of the current
TAILQ_INSERT_TAIL() macro look like this:

(elm)->field.tqe_next = TAILQ_END(head);
(elm)->field.tqe_prev = (head)->tqh_last;
*(head)->tqh_last = (elm);
(head)->tqh_last = &(elm)->field.tqe_next;

In standard C (assuming no volatile variable declarations) this replacement
sequence is exactly equivalent to the above

*(head)->tqh_last = (elm);
(elm)->field.tqe_prev = (head)->tqh_last;
(head)->tqh_last = &(elm)->field.tqe_next;
(elm)->field.tqe_next = TAILQ_END(head);

because the end result of executing the 4 lines in either order is identical
and, in a single-threaded process, that is all that matters.  Since the
sequences are equivalent the compiler's optimizer and the machine's cache
hardware are free to turn one sequence into the other if there is a
performance advantage gained by doing so.

If (and only if) there may be a concurrent reader of the structure it is
pretty clear that the second sequence can't work; a concurrent reader may
see (elm)->field.tqe_next uninitialized.  What is also true, but not so
clear, is that the while the first sequence looks right it is equally incorrect
since it is equivalent to the second.  The code is only correct if the required
ordering is enforced by a barrier, e.g.

(elm)->field.tqe_next = TAILQ_END(head);
(elm)->field.tqe_prev = (head)->tqh_last;
membar_producer();
*(head)->tqh_last = (elm);
(head)->tqh_last = &(elm)->field.tqe_next;

That tells the compiler to avoid reordering the stores done in code before
the barrier to after the barrier, and vice versa, and to do what is needed
(if anything) to tell the cache hardware to do the same (on an x86 only
the compiler barrier is necessary).  It makes the ordering you need
explicit.

The conclusion is that while the current TAILQ_REMOVE() macro will work
unchanged to maintain a list with concurrent readers (given no requirement
to run on an MP DEC Alpha) the TAILQ_INSERT_*() macros will not, even
though they look correct.  The mutex and/or the use of pserialize() does
nothing to change this.  If you want to insert things into a TAILQ list
without blocking concurrent readers you must use alternative macros that
have membar_producer() calls in the right spots.

Are we on the same page?

Dennis Ferguson

Re: pserialize(9) vs. TAILQ

2014-11-20 Thread Thor Lancelot Simon
On Thu, Nov 20, 2014 at 01:05:05PM +0800, Dennis Ferguson wrote:
> 
> find hardware which behaves like this to test on).  I haven't
> heard anything one way or the other concerning support for MP
> Alphas, but the implicit message from the current state of
> the code is that an MP Alpha isn't a NetBSD target.

We do run on a pretty good variety of multiprocessor Alpha systems.
Whether it's worth the effort to ensure we continue to do so... well,
it might be worth at least a policy of "do no harm" (in other words,
declaring the necessary barrier operation and using it where we notice
it is needed).

Thor


Re: pserialize(9) vs. TAILQ

2014-11-20 Thread Eduardo Horvath
On Thu, 20 Nov 2014, Thor Lancelot Simon wrote:

> On Thu, Nov 20, 2014 at 01:05:05PM +0800, Dennis Ferguson wrote:
> > 
> > find hardware which behaves like this to test on).  I haven't
> > heard anything one way or the other concerning support for MP
> > Alphas, but the implicit message from the current state of
> > the code is that an MP Alpha isn't a NetBSD target.
> 
> We do run on a pretty good variety of multiprocessor Alpha systems.
> Whether it's worth the effort to ensure we continue to do so... well,
> it might be worth at least a policy of "do no harm" (in other words,
> declaring the necessary barrier operation and using it where we notice
> it is needed).

Or you could try to get the kernel to run on a SPARC V9 machine running 
with RMO memory ordering.  There's a lot more of those around.  I'm not 
convinced the existing APIs are sufficient to get that working.

Eduardo


Re: pserialize(9) vs. TAILQ

2014-11-20 Thread Christos Zoulas
In article ,
Eduardo Horvath   wrote:
>On Thu, 20 Nov 2014, Thor Lancelot Simon wrote:
>
>> On Thu, Nov 20, 2014 at 01:05:05PM +0800, Dennis Ferguson wrote:
>> > 
>> > find hardware which behaves like this to test on).  I haven't
>> > heard anything one way or the other concerning support for MP
>> > Alphas, but the implicit message from the current state of
>> > the code is that an MP Alpha isn't a NetBSD target.
>> 
>> We do run on a pretty good variety of multiprocessor Alpha systems.
>> Whether it's worth the effort to ensure we continue to do so... well,
>> it might be worth at least a policy of "do no harm" (in other words,
>> declaring the necessary barrier operation and using it where we notice
>> it is needed).
>
>Or you could try to get the kernel to run on a SPARC V9 machine running 
>with RMO memory ordering.  There's a lot more of those around.  I'm not 

Isn't "RMO Memory Ordering" like "Windows/NT New Technology" or 
"Colgate/FP with Fluoride Protection"? :-)

christos



Re: pserialize(9) vs. TAILQ

2014-11-20 Thread Taylor R Campbell
   Date: Thu, 20 Nov 2014 13:05:05 +0800
   From: Dennis Ferguson 

   Adding a new barrier operation for this would only help if there
   were some determination to also go back and fix the code already
   written to use it, i.e. someone needs to decree that all code
   now needs to be written to theoretically run on an Alpha (the
   support would still only be theoretical since it is hard to
   find hardware which behaves like this to test on).

The attached patch adds membar_datadep_consumer with no changes needed
outside  and alpha code (and any other CPU in the future
that might have a similarly relaxed memory ordering).  I have such
hardware, which I've been planning to use for exactly this purpose.

Writing membar_datadep_consumer makes it clear where interprocessor
communication is happening and where you must tread carefully, so I
see it as a net win for code clarity whether you want to run the code
on alpha or not.
Index: common/lib/libc/arch/alpha/atomic/membar_ops.S
===
RCS file: /cvsroot/src/common/lib/libc/arch/alpha/atomic/membar_ops.S,v
retrieving revision 1.6
diff -p -u -r1.6 membar_ops.S
--- common/lib/libc/arch/alpha/atomic/membar_ops.S  25 May 2008 15:56:11 
-  1.6
+++ common/lib/libc/arch/alpha/atomic/membar_ops.S  20 Nov 2014 17:50:25 
-
@@ -87,3 +87,5 @@ ATOMIC_OP_ALIAS(membar_exit,_membar_sync
 STRONG_ALIAS(_membar_exit,_membar_sync)
 ATOMIC_OP_ALIAS(membar_consumer,_membar_sync)
 STRONG_ALIAS(_membar_consumer,_membar_sync)
+ATOMIC_OP_ALIAS(membar_datadep_consumer,_membar_sync)
+STRONG_ALIAS(_membar_datadep_consumer,_membar_sync)
Index: lib/libc/atomic/membar_ops.3
===
RCS file: /cvsroot/src/lib/libc/atomic/membar_ops.3,v
retrieving revision 1.3
diff -p -u -r1.3 membar_ops.3
--- lib/libc/atomic/membar_ops.328 Apr 2011 11:56:26 -  1.3
+++ lib/libc/atomic/membar_ops.320 Nov 2014 17:53:23 -
@@ -27,7 +27,7 @@
 .\" ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
 .\" POSSIBILITY OF SUCH DAMAGE.
 .\"
-.Dd February 11, 2007
+.Dd November 20, 2014
 .Dt MEMBAR_OPS 3
 .Os
 .Sh NAME
@@ -52,6 +52,8 @@
 .Ft void
 .Fn membar_consumer "void"
 .Ft void
+.Fn membar_datadep_consumer "void"
+.Ft void
 .Fn membar_sync "void"
 .Sh DESCRIPTION
 The
@@ -83,6 +85,40 @@ before any stores after the memory barri
 .It Fn membar_consumer
 All loads preceding the memory barrier will complete before any loads
 after the memory barrier complete.
+.It Fn membar_datadep_consumer
+Same as
+.Fn membar_consumer ,
+but limited to loads of addresses dependent on prior loads, or
+.Sq data-dependent
+loads:
+.Bd -literal -offset indent
+int **pp, *p, v;
+
+p = *pp;
+membar_datadep_consumer();
+v = *p;
+consume(v);
+.Ed
+.Pp
+Does not guarantee ordering of loads in branches, or
+.Sq control-dependent
+loads -- you must use
+.Fn membar_consumer
+instead:
+.Bd -literal -offset indent
+int *ok, *p, v;
+
+if (*ok) {
+   membar_consumer();
+   v = *p;
+   consume(v);
+}
+.Ed
+.Pp
+Most CPUs do not reorder data-dependent loads (i.e., most CPUs
+guarantee that cached values are not stale in that case), so
+.Fn membar_datadep_consumer
+is a no-op on those CPUs.
 .It Fn membar_sync
 All loads and stores preceding the memory barrier will complete and
 reach global visibility before any loads and stores after the memory
@@ -95,3 +131,7 @@ The
 .Nm membar_ops
 functions first appeared in
 .Nx 5.0 .
+The data-dependent load barrier,
+.Fn membar_datadep_consumer ,
+first appeared in
+.Nx 7.0 .
Index: sys/arch/alpha/include/types.h
===
RCS file: /cvsroot/src/sys/arch/alpha/include/types.h,v
retrieving revision 1.49
diff -p -u -r1.49 types.h
--- sys/arch/alpha/include/types.h  25 Jan 2012 18:09:13 -  1.49
+++ sys/arch/alpha/include/types.h  20 Nov 2014 17:50:47 -
@@ -70,6 +70,7 @@ typedef   volatile int__cpu_simple_lock_
 
 #define__HAVE_NEW_STYLE_BUS_H
 #define__HAVE_ATOMIC_OPERATIONS
+#define__HAVE_MEMBAR_DATADEP_CONSUMER
 #define__HAVE_CPU_COUNTER
 #define__HAVE_SYSCALL_INTERN
 #define__HAVE_MINIMAL_EMUL
Index: sys/sys/atomic.h
===
RCS file: /cvsroot/src/sys/sys/atomic.h,v
retrieving revision 1.12
diff -p -u -r1.12 atomic.h
--- sys/sys/atomic.h21 Feb 2014 15:52:53 -  1.12
+++ sys/sys/atomic.h20 Nov 2014 17:50:53 -
@@ -156,6 +156,12 @@ void   membar_producer(void);
 void   membar_consumer(void);
 void   membar_sync(void);
 
+#ifdef __HAVE_MEMBAR_DATADEP_CONSUMER
+void   membar_datadep_consumer(void);
+#else
+#definemembar_datadep_consumer()   ((void)0)
+#endif
+
 __END_DECLS
 
 #endif /* ! _SYS_ATOMIC_H_ */


Re: pserialize(9) vs. TAILQ

2014-11-21 Thread Dennis Ferguson

On 21 Nov, 2014, at 00:22 , Eduardo Horvath  wrote:
> Or you could try to get the kernel to run on a SPARC V9 machine running 
> with RMO memory ordering.  There's a lot more of those around.  I'm not 
> convinced the existing APIs are sufficient to get that working.

It would be worrying if the kernel wouldn't run that way.  The table from
the McKenney paper, reproduced here

http://en.wikipedia.org/wiki/Memory_ordering

indicates that SPARC RMO potentially reorders the same operations as
ARMv7 and POWER.  Someone else would have to comment on how well NetBSD
runs on multiprocessor versions of the latter processors, but it is supposed
to and I haven't noticed any code which omits required barriers for machines
like that (unlike the Alpha, where missing barriers for its unique quirk are
easy to find).  The API seems sufficient.

Beyond the Alpha-barrier, he only thing I find lacking about the membar_*()
API is maybe a set of functions that would eliminate the #ifndef/#endif
around

#ifndef __HAVE_ATOMIC_AS_MEMBAR
membar_foo();
#endif

repeated over and over when synchronization is done via atomic_ops(3).

Dennis Ferguson




Re: pserialize(9) vs. TAILQ

2014-11-21 Thread Taylor R Campbell
   Date: Fri, 21 Nov 2014 18:25:02 +0800
   From: Dennis Ferguson 

   Beyond the Alpha-barrier, he only thing I find lacking about the membar_*()
   API is maybe a set of functions that would eliminate the #ifndef/#endif
   around

   #ifndef __HAVE_ATOMIC_AS_MEMBAR
   membar_foo();
   #endif

   repeated over and over when synchronization is done via atomic_ops(3).

I've been meaning to add preatomic_membar_foo/postatomic_membar_foo to
address this, since finding typos in the cpp conditional that went
undetected for years.

Not sure it's necessary to distinguish the pre/post cases -- Linux
does with smp_mb__before_atomic/smp_mb__after_atomic, but I'm not sure
that distinction is needed for all our membar_foos, and I haven't
taken the time to research the situation adequately.


Re: pserialize(9) vs. TAILQ

2014-11-21 Thread Masao Uebayashi
On Thu, Nov 20, 2014 at 4:45 PM, Dennis Ferguson
 wrote:
>
> On 20 Nov, 2014, at 11:07 , Masao Uebayashi  wrote:
>> On Wed, Nov 19, 2014 at 2:49 PM, Dennis Ferguson
>>  wrote:
>>> That's very true, but the code is only correct if the writes occur
>>> in the order the C code has them written (i.e. tqe_next in the
>>> new element is written before writing the tqe_next of the existing
>>> list element to point at it).  To ensure the compiler or the
>>> cache hardware doesn't reorder those writes it needs a write barrier
>>> inserted to enforce the order as written.
>>
>> Yes, in the pserialize(9) + TAILQ() case, everything except tqe_next
>> is only written by writers, and protected with a mutex, which ensures
>> (implies) ordering.  What is ordered by what has to be clearly
>> documented.
>
> We might be on the same page, but your response is confusing me enough
> that it might be useful to make this explicit.  The guts of the current
> TAILQ_INSERT_TAIL() macro look like this:
>
> (elm)->field.tqe_next = TAILQ_END(head);
> (elm)->field.tqe_prev = (head)->tqh_last;
> *(head)->tqh_last = (elm);
> (head)->tqh_last = &(elm)->field.tqe_next;
>
> In standard C (assuming no volatile variable declarations) this replacement
> sequence is exactly equivalent to the above
>
> *(head)->tqh_last = (elm);
> (elm)->field.tqe_prev = (head)->tqh_last;
> (head)->tqh_last = &(elm)->field.tqe_next;
> (elm)->field.tqe_next = TAILQ_END(head);
>
> because the end result of executing the 4 lines in either order is identical
> and, in a single-threaded process, that is all that matters.  Since the
> sequences are equivalent the compiler's optimizer and the machine's cache
> hardware are free to turn one sequence into the other if there is a
> performance advantage gained by doing so.
>
> If (and only if) there may be a concurrent reader of the structure it is
> pretty clear that the second sequence can't work; a concurrent reader may
> see (elm)->field.tqe_next uninitialized.  What is also true, but not so
> clear, is that the while the first sequence looks right it is equally 
> incorrect
> since it is equivalent to the second.  The code is only correct if the 
> required
> ordering is enforced by a barrier, e.g.
>
> (elm)->field.tqe_next = TAILQ_END(head);
> (elm)->field.tqe_prev = (head)->tqh_last;
> membar_producer();
> *(head)->tqh_last = (elm);
> (head)->tqh_last = &(elm)->field.tqe_next;
>
> That tells the compiler to avoid reordering the stores done in code before
> the barrier to after the barrier, and vice versa, and to do what is needed
> (if anything) to tell the cache hardware to do the same (on an x86 only
> the compiler barrier is necessary).  It makes the ordering you need
> explicit.

OK, I overlooked the obvious "new elm's next" -> "prev elm's next" ordering.

> The conclusion is that while the current TAILQ_REMOVE() macro will work
> unchanged to maintain a list with concurrent readers (given no requirement
> to run on an MP DEC Alpha) the TAILQ_INSERT_*() macros will not, even
> though they look correct.  The mutex and/or the use of pserialize() does
> nothing to change this.  If you want to insert things into a TAILQ list
> without blocking concurrent readers you must use alternative macros that
> have membar_producer() calls in the right spots.

I completely disagree with your conclusion.

Your basic idea seems that ``member_{prroducer,consumer}() are cheaper
than membar_{enter,leave}, let's use cheaper ones''.  That way you end
up concluding that you need memory barrier in every iteration.  Thant
is contradictory to what you previously said.

The point is, pserialize(9) is totally different from pcq(9).

In pcq(9), like pkgqueue(9), the data structure is updated all the
time by producers and consumers.  In pserialize(9), like ifnet_list,
the data structure is updated extremely rarely.  (How many times is
ifnet_list updated on your machine a day?)

The problem of TAILQ_INSERT_*() macros with respect to pserialize(9)
is that, as you are pointing out, they update new elm's next pointer
and prev elm's next pointer almost simultaneously.  So you have to
ensure those ordering in the reader's code path.  But consider that
this is pserialize(9); where the target elements (E_p and E_q) are
unlikely to move while inserting a new element (E_new).  There is no
point to update E_new.next in the "update" section.  If E_new.next has
been updated *long enough before* E_p.next is updated, that ordering
is not a problem at all.

You are also ignoring mutex_{enter,exit} imply membar_{enter,leave}.

This pseudo code is what I have come up with.  Still WIP, but at least
solves E_new.next -> E_p.next ordering.
int
insert(E *E_new, ...)
{
	/* Insert E_new between E_p and E_q. */
	E *E_p, *E_q;

	/* Generation. */
	int gen;

retry:
	E_p = E_q = NULL;

	/*
	 * Find the target entries.
	 */
	pserialize_read_enter();
	TAILQ_FOREACH(..

Re: pserialize(9) vs. TAILQ

2014-11-21 Thread Eduardo Horvath
On Fri, 21 Nov 2014, Dennis Ferguson wrote:

> On 21 Nov, 2014, at 00:22 , Eduardo Horvath  wrote:
> > Or you could try to get the kernel to run on a SPARC V9 machine running 
> > with RMO memory ordering.  There's a lot more of those around.  I'm not 
> > convinced the existing APIs are sufficient to get that working.
> 
> It would be worrying if the kernel wouldn't run that way.  The table from
> the McKenney paper, reproduced here
> 
> http://en.wikipedia.org/wiki/Memory_ordering
> 
> indicates that SPARC RMO potentially reorders the same operations as
> ARMv7 and POWER.  Someone else would have to comment on how well NetBSD
> runs on multiprocessor versions of the latter processors, but it is supposed
> to and I haven't noticed any code which omits required barriers for machines
> like that (unlike the Alpha, where missing barriers for its unique quirk are
> easy to find).  The API seems sufficient.

The last time I tried it the kernel would only run in TSO mode on SPARC 
machines.  Mutexes just didn't work.

Eduardo


Re: pserialize(9) vs. TAILQ

2014-11-21 Thread Taylor R Campbell
   Date: Sat, 22 Nov 2014 01:03:58 +0900
   From: Masao Uebayashi 

   The problem of TAILQ_INSERT_*() macros with respect to pserialize(9)
   is that, as you are pointing out, they update new elm's next pointer
   and prev elm's next pointer almost simultaneously.  So you have to
   ensure those ordering in the reader's code path.

Dennis's point was that on most CPUs this ordering is guaranteed
already.  My point was that that is not true on all CPUs, hence my
proposal for a membar_datadep_consumer which is a no-op on most CPUs.

pserialize_read_enter();
TAILQ_FOREACH(...) {
if (...) {
/* Ensure that E_p.next and generation are sync'ed. */
mutex_spin_enter();

It matters what's in the ...'s.  Are you doing this?

TAILQ_FOREACH(e, head, next) {
if (e->key == key) {
mutex_spin_enter(&e->lock);
...

If so, the reader may see a stale e->key -- the membar_enter implied
by mutex_spin_enter is too late.  The reader needs a memory barrier
after loading e, and before loading anything e points to:

TAILQ_FOREACH(e, head, next) {
membar_datadep_consumer();
if (e->key == key) {
mutex_spin_enter(&e->lock);
...

I'm not clear on what you're trying to solve by versioning the tailqs,
but I don't think it will be necessary in the end for the purposes we
have in mind.

I expanded the example code in pserialize(9).  Perhaps it will be a
little more instructive now.


Re: pserialize(9) vs. TAILQ

2014-11-21 Thread Masao Uebayashi
On Sat, Nov 22, 2014 at 1:22 AM, Taylor R Campbell
 wrote:
>Date: Sat, 22 Nov 2014 01:03:58 +0900
>From: Masao Uebayashi 
>
>The problem of TAILQ_INSERT_*() macros with respect to pserialize(9)
>is that, as you are pointing out, they update new elm's next pointer
>and prev elm's next pointer almost simultaneously.  So you have to
>ensure those ordering in the reader's code path.
>
> Dennis's point was that on most CPUs this ordering is guaranteed
> already.  My point was that that is not true on all CPUs, hence my
> proposal for a membar_datadep_consumer which is a no-op on most CPUs.
>
> pserialize_read_enter();
> TAILQ_FOREACH(...) {
> if (...) {
> /* Ensure that E_p.next and generation are sync'ed. */
> mutex_spin_enter();
>
> It matters what's in the ...'s.  Are you doing this?
>
> TAILQ_FOREACH(e, head, next) {
> if (e->key == key) {
> mutex_spin_enter(&e->lock);
> ...
>
> If so, the reader may see a stale e->key -- the membar_enter implied
> by mutex_spin_enter is too late.

The promise there is that, e->key is constant while it's read with
pserialize_read_{enter,leave}(9).  If those members have to be
changed, they have to be once removed from there, and re-added.

> The reader needs a memory barrier
> after loading e, and before loading anything e points to:
>
> TAILQ_FOREACH(e, head, next) {
> membar_datadep_consumer();
> if (e->key == key) {
> mutex_spin_enter(&e->lock);
> ...
>
> I'm not clear on what you're trying to solve by versioning the tailqs,
> but I don't think it will be necessary in the end for the purposes we
> have in mind.

I added because I think it is necessary.


Re: pserialize(9) vs. TAILQ

2014-11-21 Thread Masao Uebayashi
On Sat, Nov 22, 2014 at 1:31 AM, Masao Uebayashi  wrote:
> On Sat, Nov 22, 2014 at 1:22 AM, Taylor R Campbell
>  wrote:
>>Date: Sat, 22 Nov 2014 01:03:58 +0900
>>From: Masao Uebayashi 
>>
>>The problem of TAILQ_INSERT_*() macros with respect to pserialize(9)
>>is that, as you are pointing out, they update new elm's next pointer
>>and prev elm's next pointer almost simultaneously.  So you have to
>>ensure those ordering in the reader's code path.
>>
>> Dennis's point was that on most CPUs this ordering is guaranteed
>> already.  My point was that that is not true on all CPUs, hence my
>> proposal for a membar_datadep_consumer which is a no-op on most CPUs.
>>
>> pserialize_read_enter();
>> TAILQ_FOREACH(...) {
>> if (...) {
>> /* Ensure that E_p.next and generation are sync'ed. 
>> */
>> mutex_spin_enter();
>>
>> It matters what's in the ...'s.  Are you doing this?
>>
>> TAILQ_FOREACH(e, head, next) {
>> if (e->key == key) {
>> mutex_spin_enter(&e->lock);
>> ...
>>
>> If so, the reader may see a stale e->key -- the membar_enter implied
>> by mutex_spin_enter is too late.
>
> The promise there is that, e->key is constant while it's read with
> pserialize_read_{enter,leave}(9).  If those members have to be

This should have been read as:

e-key is constant while it's globally visible via TAILQ protected with
pserialize(9)


Re: pserialize(9) vs. TAILQ

2014-11-21 Thread Taylor R Campbell
   Date: Sat, 22 Nov 2014 01:31:48 +0900
   From: Masao Uebayashi 

   On Sat, Nov 22, 2014 at 1:22 AM, Taylor R Campbell
wrote:
   > It matters what's in the ...'s.  Are you doing this?
   >
   > TAILQ_FOREACH(e, head, next) {
   > if (e->key == key) {
   > mutex_spin_enter(&e->lock);
   > ...
   >
   > If so, the reader may see a stale e->key -- the membar_enter implied
   > by mutex_spin_enter is too late.

   The promise there is that, e->key is constant while it's read with
   pserialize_read_{enter,leave}(9).  If those members have to be
   changed, they have to be once removed from there, and re-added.

The problem is not that e->key may change while e is in the list -- it
won't (whoever wants to change it must, as you suggest, remove e from
the list, change it, and then re-add it).

The problem is that when CPU 0 does

e = pool_get(&e_pool, PR_WAITOK);

then e->key, at address 0xdeadbeef, will have some garbage
uninitialized value, say 0xa5a5a5a5, which CPU 1 may have cached.

Next, CPU 0 does:

e->key = 12345;

mutex_enter(lock);
e->next = eq;
membar_producer();
eq = e;
mutex_exit(lock);

>From CPU 0's perspective, the stores have been issued to memory in the
correct order: nobody who loads e from the memory at eq, and then
loads e->key from memory at 0xdeadbeef, will see 0xa5a5a5a5 -- anyone
who issues loads to memory in that order will see 12345.

But CPU 1 already has the contents of 0xdeadbeef cached, so it won't
load from memory there.  Instead, when it does

s = pserialize_read_enter();
for (e = eq; e != NULL; e = e->next) {
if (e->key == 12345)
...
},

it will load e from memory at eq (or maybe from its cache), and then
use its cached value for the address 0xdeadbeef, which is 0xa5a5a5a5.
Nothing prevented CPU 1 from issuing the loads in the `wrong' order --
CPU 1 had already issued a load for 0xdeadbeef a long time ago, which
at the time correctly yielded 0xa5a5a5a5.

In order for CPU 1 to guarantee that, after it loads e, it will load
e->key from memory instead of using the cache, it must issue a
membar_datadep_consumer:

s = pserialize_read_enter();
for (e = eq; e != NULL; e = e->next) {
membar_datadep_consumer();
if (e->key == 12345)
...
}

This is a general heuristic for memory barriers: when communicating
from one CPU to another CPU, there should be a matched pair of memory
barriers.


Re: pserialize(9) vs. TAILQ

2014-11-21 Thread Masao Uebayashi
On Sat, Nov 22, 2014 at 5:38 AM, Taylor R Campbell
 wrote:
>Date: Sat, 22 Nov 2014 01:31:48 +0900
>From: Masao Uebayashi 
>
>On Sat, Nov 22, 2014 at 1:22 AM, Taylor R Campbell
> wrote:
>> It matters what's in the ...'s.  Are you doing this?
>>
>> TAILQ_FOREACH(e, head, next) {
>> if (e->key == key) {
>> mutex_spin_enter(&e->lock);
>> ...
>>
>> If so, the reader may see a stale e->key -- the membar_enter implied
>> by mutex_spin_enter is too late.
>
>The promise there is that, e->key is constant while it's read with
>pserialize_read_{enter,leave}(9).  If those members have to be
>changed, they have to be once removed from there, and re-added.
>
> The problem is not that e->key may change while e is in the list -- it
> won't (whoever wants to change it must, as you suggest, remove e from
> the list, change it, and then re-add it).
>
> The problem is that when CPU 0 does
>
> e = pool_get(&e_pool, PR_WAITOK);
>
> then e->key, at address 0xdeadbeef, will have some garbage
> uninitialized value, say 0xa5a5a5a5, which CPU 1 may have cached.
>
> Next, CPU 0 does:
>
> e->key = 12345;
>
> mutex_enter(lock);
> e->next = eq;
> membar_producer();
> eq = e;
> mutex_exit(lock);
>
> From CPU 0's perspective, the stores have been issued to memory in the
> correct order: nobody who loads e from the memory at eq, and then
> loads e->key from memory at 0xdeadbeef, will see 0xa5a5a5a5 -- anyone
> who issues loads to memory in that order will see 12345.
>
> But CPU 1 already has the contents of 0xdeadbeef cached, so it won't
> load from memory there.  Instead, when it does
>
> s = pserialize_read_enter();
> for (e = eq; e != NULL; e = e->next) {
> if (e->key == 12345)
> ...
> },
>
> it will load e from memory at eq (or maybe from its cache), and then
> use its cached value for the address 0xdeadbeef, which is 0xa5a5a5a5.
> Nothing prevented CPU 1 from issuing the loads in the `wrong' order --
> CPU 1 had already issued a load for 0xdeadbeef a long time ago, which
> at the time correctly yielded 0xa5a5a5a5.
>
> In order for CPU 1 to guarantee that, after it loads e, it will load
> e->key from memory instead of using the cache, it must issue a
> membar_datadep_consumer:
>
> s = pserialize_read_enter();
> for (e = eq; e != NULL; e = e->next) {
> membar_datadep_consumer();
> if (e->key == 12345)
> ...
> }
>
> This is a general heuristic for memory barriers: when communicating
> from one CPU to another CPU, there should be a matched pair of memory
> barriers.

You are right that "key" has to be volatile.  But otherwise, you are
making things too complex, or too generic, which is not what I'm
expecting as a TAILQ-specialized-for-pserialize(9).

Consider struct ifnet.  IFNET_FOREACH() iterates ifnet_lists to
look-up a wanted ifnet, by matching most typically interface's
addresses.  How often do interface's address change in practice?  Very
rare; or probably never, once after an ifnet is added to the
ifnet_list.

It is users' choice whether using fast-changing values as a key or
not, but if you decide to change values visible by pserialize(9)
readers, you'll end up with ensuring memory order.  It is your choice,
but beyond pserialize(9)'s responsibility.

I don't understand why you resort to only memory barriers.
pserialize(9) doesn't need those tricks, because pserialize(9) relies
on another trick; help from scheduler.  Your approach might work even
without pserialize(9).  Too safe and too excessive.


Re: pserialize(9) vs. TAILQ

2014-11-21 Thread Masao Uebayashi
On Sat, Nov 22, 2014 at 11:49 AM, Masao Uebayashi  wrote:
> It is users' choice whether using fast-changing values as a key or
> not, but if you decide to change values visible by pserialize(9)
> readers, you'll end up with ensuring memory order.  It is your choice,
> but beyond pserialize(9)'s responsibility.

Correction: changing values can be read by volatile.  No memory
barrier is needed if there's no memory ordering constraint.


Re: pserialize(9) vs. TAILQ

2014-11-21 Thread Dennis Ferguson

On 22 Nov, 2014, at 00:14 , Eduardo Horvath  wrote:
> The last time I tried it the kernel would only run in TSO mode on SPARC 
> machines.  Mutexes just didn't work.

I'll guess one problem is in sparc/mutex.h, here:

#define MUTEX_RECEIVE(mtx)  /* nothing */
#define MUTEX_GIVE(mtx) /* nothing */

This works with TSO, but with RMO they need to somehow generate
hardware memory barriers.  See arm/mutex.h for an example with
them filled in.

Dennis Ferguson


Re: pserialize(9) vs. TAILQ

2014-11-21 Thread Dennis Ferguson

On 22 Nov, 2014, at 11:56 , Masao Uebayashi  wrote:
> On Sat, Nov 22, 2014 at 11:49 AM, Masao Uebayashi  wrote:
>> It is users' choice whether using fast-changing values as a key or
>> not, but if you decide to change values visible by pserialize(9)
>> readers, you'll end up with ensuring memory order.  It is your choice,
>> but beyond pserialize(9)'s responsibility.
> 
> Correction: changing values can be read by volatile.  No memory
> barrier is needed if there's no memory ordering constraint.

Using volatile is very frequently a bug.  You have two closely
related problems to deal with on a multiprocessor, that the
compiler may reorder or eliminate memory accesses and that the
processor cache hardware may reorder or eliminate memory accesses.
If you attempt to fix a problem with volatile (and you seem to
realize there is a problem if you are using volatile to fix it)
you may fix the compiler problem but do nothing to fix the cache
hardware problem.  A barrier fixes both at the same time.

Dennis Ferguson

Re: pserialize(9) vs. TAILQ

2014-11-22 Thread Martin Husemann
On Sat, Nov 22, 2014 at 01:24:42PM +0800, Dennis Ferguson wrote:
> I'll guess one problem is in sparc/mutex.h, here:
> 
> #define MUTEX_RECEIVE(mtx)  /* nothing */
> #define MUTEX_GIVE(mtx) /* nothing */
> 
> This works with TSO, but with RMO they need to somehow generate
> hardware memory barriers.  See arm/mutex.h for an example with
> them filled in.

Or src/sys/arch/sparc64/include/mutex.h - IIRC sparc v8 and earlier do not
have RMO.

Martin


Re: pserialize(9) vs. TAILQ

2014-11-22 Thread Taylor R Campbell
   Date: Sat, 22 Nov 2014 11:49:34 +0900
   From: Masao Uebayashi 

   You are right that "key" has to be volatile.  But otherwise, you are
   making things too complex, or too generic, which is not what I'm
   expecting as a TAILQ-specialized-for-pserialize(9).

No, volatile is unrelated to multiprocessor memory ordering issues.
It doesn't prevent the CPU from reordering loads from RAM.  It only
forces the C compiler to compile

x = e->key;
f(e->key);

into code that instructs the CPU to read e->key twice.

The CPU may nevertheless reorder loads from RAM, or read it from its
cache instead of loading from RAM.  That is, when the CPU executes the
instructions

movq 16(%rax),%rbx  ; x = e->key
movq 16(%rax),%rdi  ; arg0 = e->key
callq f

then

(a) the CPU may load from RAM first into %rdi and next into %rbx, or

(b) the CPU may load once from RAM at %rax + 16 into its cache and
then execute both movq instructions from the cache, or

(c) the CPU may already have %rax + 16 cached and may not issue any
load requests to RAM at all.

In the case of pserialized queues,

for (e = eq; e != NULL; e = e->next) {
if (e->key == 12345)
...
}

the x86 code will look something more like this:

movq eq(%rip),%rax  ; Load eq into %rax (e).
testq %rax,%rax ; Test for null.
je out  ; Branch if null.
movq 8(%rax),%rdi   ; Load e->key into %rdi.

The CPU may have a cached value for the location %rax + 8 in RAM.  In
that case, even if it executes the instructions in order, and even if
it issues a load request to RAM for eq, it may not issue a new load
request to RAM for %rax + 8 because it already happens to have that
location cached[*].

Volatile does not help here: the C compiler generated instructions in
the right order.  It's the CPU that requires a memory barrier to
guarantee that it won't issue loads to RAM out of the order we need:

for (e = eq; e != NULL; e = e->next) {
membar_datadep_consumer();
if (e->key == 12345)
...
}

movq eq(%rip),%rax  ; Load eq into %rax (e).
testq %rax,%rax ; Test for null.
je out  ; Branch if null.
lfence  ; Force CPU to load e->key from RAM.
movq 8(%rax),%rdi   ; Load e->key into %rdi.


[*] The x86 architecture happens to guarantee that if whoever inserted
the entry issues a store barrier (membar_producer) after initializing
e->key and before setting eq = e, this situation won't happen.  But
that is not guaranteed on every architecture, and if the code had a
control-dependent load instead of a data-dependent load, say

int ok[128];
int value[128];

for (i = 0; i < 128; i++) {
if (ok[i])
return value[i];
},

then membar_consumer/lfence between ok[i] and value[i] would be
necessary on x86 in practice -- even if we qualified ok and value with
volatile.

   It is users' choice whether using fast-changing values as a key or
   not, but if you decide to change values visible by pserialize(9)
   readers, you'll end up with ensuring memory order.

As I said, this is not about changing e->key while e is in the queue.
It is about making sure readers read the correct value of e->key after
a writer puts e in the queue the first time.


Re: pserialize(9) vs. TAILQ

2014-11-22 Thread Dennis Ferguson

On 23 Nov, 2014, at 01:01 , Martin Husemann  wrote:

> On Sat, Nov 22, 2014 at 01:24:42PM +0800, Dennis Ferguson wrote:
>> I'll guess one problem is in sparc/mutex.h, here:
>> 
>>#define MUTEX_RECEIVE(mtx)  /* nothing */
>>#define MUTEX_GIVE(mtx) /* nothing */
>> 
>> This works with TSO, but with RMO they need to somehow generate
>> hardware memory barriers.  See arm/mutex.h for an example with
>> them filled in.
> 
> Or src/sys/arch/sparc64/include/mutex.h - IIRC sparc v8 and earlier do not
> have RMO.

Ah, got it.  I'd now guess that in

   src/common/lib/libc/arch/sparc64/atomic/membar_ops.S

this comment

/* These assume Total Store Order (TSO) */

suggests the problem.

Dennis Ferguson


Re: pserialize(9) vs. TAILQ

2014-11-22 Thread Dennis Ferguson

On 23 Nov, 2014, at 01:52 , Taylor R Campbell 
 wrote:
> [*] The x86 architecture happens to guarantee that if whoever inserted
> the entry issues a store barrier (membar_producer) after initializing
> e->key and before setting eq = e, this situation won't happen.  But
> that is not guaranteed on every architecture, and if the code had a
> control-dependent load instead of a data-dependent load, say
> 
>   int ok[128];
>   int value[128];
> 
>   for (i = 0; i < 128; i++) {
>   if (ok[i])
>   return value[i];
>   },
> 
> then membar_consumer/lfence between ok[i] and value[i] would be
> necessary on x86 in practice -- even if we qualified ok and value with
> volatile.

While this horse is certainly dead enough, the above isn't true for the
x86.  The x86 (Intel, and anything else recent enough to support
the 64 bit instruction set, at least) guarantees that if you start
off with

volatile int x = 0;
volatile int y = 0;

and then a writer does

 x = 1;
 y = 1;

while a reader does

 int my_x, my_y;

 my_y = y;
 my_x = x;

it is guaranteed that (my_y == 1 && my_x == 0) will never be true
even in the absence of explicit barrier instructions (this is
from section 7.2.3.2 of volume 3 of the Intel 64 and IA-32
Architectures Development Manual).  lfence and sfence instructions
are never needed for ordinary loads and stores, I think they're
required only for certain odd-ball SSE load and store instructions.

Of course these are better written with membar_producer()
and membar_consumer(), rather than volatile, but on an x86
both functions can be

asm volatile("" ::: "memory")

(like their Linux equivalents are defined) since only a
compiler barrier is needed.

The reason to (almost) always prefer membar_*() to volatile
is that the barrier functions do the right thing on all machines,
rather than producing buggy code that happens to test okay
on an x86, and the barriers also tend to impose more minimal
constraints on the compiler's optimizer so it can do a
better job on the code on either side of the barrier.

Dennis Ferguson

Re: pserialize(9) vs. TAILQ

2014-11-23 Thread Dennis Ferguson

On 22 Nov, 2014, at 00:03 , Masao Uebayashi  wrote:

> On Thu, Nov 20, 2014 at 4:45 PM, Dennis Ferguson
>  wrote:
> OK, I overlooked the obvious "new elm's next" -> "prev elm's next" ordering.
> 
>> The conclusion is that while the current TAILQ_REMOVE() macro will work
>> unchanged to maintain a list with concurrent readers (given no requirement
>> to run on an MP DEC Alpha) the TAILQ_INSERT_*() macros will not, even
>> though they look correct.  The mutex and/or the use of pserialize() does
>> nothing to change this.  If you want to insert things into a TAILQ list
>> without blocking concurrent readers you must use alternative macros that
>> have membar_producer() calls in the right spots.
> 
> I completely disagree with your conclusion.
> 
> Your basic idea seems that ``member_{prroducer,consumer}() are cheaper
> than membar_{enter,leave}, let's use cheaper ones''.  That way you end
> up concluding that you need memory barrier in every iteration.  Thant
> is contradictory to what you previously said.

You need to fix the store ordering in TAILQ_INSERT_*().  If you want
to use a different barrier to do it that's fine with me.

I was happy with no barrier in the reader when I thought providing
support for the DEC Alpha architecture's unique cache quirk wasn't
necessary.  When it turned out support for the Alpha was necessary
the reader needed to have a barrier inserted that looks like it gets
executed after every read pointer read.  Note, however, that this is
a barrier which does nothing at all on any machine other than an
Alpha. Nothing has changed, nothing has been contradicted, no cost
has been added to readers.  Taylor might like to see barrier
operations paired up in C code but there is no pairing at the
hardware instruction level.  The reader barrier has no cost on any
machine other than the Alpha, the only machine which has an
architectural requirement to generate an instruction there.

I understand why you are unhappy with the barrier.  It seems
stupid to have to execute a not-inexpensive hardware instruction
over and over and over again in the reader to deal with something
that a writer is almost never going to do, but since you have to do
this on an Alpha (and only an Alpha) to make this code correct,
why can't the conclusion be that it sucks to be a DEC Alpha,
and just get on with it?  This is probably a perfect example
of why no processor other than the Alpha has had this quirk in
its cache architecture, nor is there likely to be another in
future; this quirk has a cost that is annoying.  Indeed, judged
from the fact that people think NetBSD runs on an MP Alpha
okay without this barrier, it may be that even later DEC Alpha
implementations omitted this misfeature.

> point to update E_new.next in the "update" section.  If E_new.next has
> been updated *long enough before* E_p.next is updated, that ordering
> is not a problem at all.

"long enough" is not a specific length of time.  If you can
find a spec for the amount of time which is "long enough" you
can wait that long instead.  Just calling pserialize_perform()
when inserting entries (where no pseralize_perform() is required)
because pserialize_perform() is a fantastically expensive function
and takes a long time, doesn't do anything if you don't know
how long is "long enough", not to mention that you'd be executing
this extra-expensive DELAY() on all archectures to eliminate a cost
elsewhere that is unique to the DEC Alpha.

This is only a problem for the Alpha.  The barrier fixes it for
the Alpha without costing other architectures anything.  That seems
like the right thing to do.

Dennis Ferguson



Re: pserialize(9) vs. TAILQ

2014-11-23 Thread Masao Uebayashi
On Sat, Nov 22, 2014 at 2:24 PM, Dennis Ferguson
 wrote:
> I'll guess one problem is in sparc/mutex.h, here:
>
> #define MUTEX_RECEIVE(mtx)  /* nothing */
> #define MUTEX_GIVE(mtx) /* nothing */

I think that these macros should be replaced with
membar_enter()/member_exit() respectively in sys/kern/kern_mutex.c.

The only reason, which I can think of, that these exist is these
predate membar ops.


Re: pserialize(9) vs. TAILQ

2014-11-24 Thread Masao Uebayashi
On Sun, Nov 23, 2014 at 5:00 PM, Dennis Ferguson
 wrote:
> I was happy with no barrier in the reader when I thought providing
> support for the DEC Alpha architecture's unique cache quirk wasn't
> necessary.  When it turned out support for the Alpha was necessary
> the reader needed to have a barrier inserted that looks like it gets
> executed after every read pointer read.  Note, however, that this is
> a barrier which does nothing at all on any machine other than an
> Alpha. Nothing has changed, nothing has been contradicted, no cost
> has been added to readers.  Taylor might like to see barrier
> operations paired up in C code but there is no pairing at the
> hardware instruction level.  The reader barrier has no cost on any
> machine other than the Alpha, the only machine which has an
> architectural requirement to generate an instruction there.
(snip)

I'm lost.  You wrote:

http://mail-index.netbsd.org/tech-kern/2014/11/20/msg018048.html

> The conclusion is that while the current TAILQ_REMOVE() macro will work
> unchanged to maintain a list with concurrent readers (given no requirement
> to run on an MP DEC Alpha) the TAILQ_INSERT_*() macros will not, even
> though they look correct.  The mutex and/or the use of pserialize() does
> nothing to change this.  If you want to insert things into a TAILQ list
> without blocking concurrent readers you must use alternative macros that
> have membar_producer() calls in the right spots.

- TAILQ_REMOVE() will work (except on Alpha)
- TAILQ_INSERT_*() will not (on Alpha and some others)
  - TAILQ_INSERT_*() need membar_producer()
  - TAILQ_FOREACH() needs membar_consumer()

Am I reading English correctly?


Re: pserialize(9) vs. TAILQ

2014-11-24 Thread Taylor R Campbell
   Date: Mon, 24 Nov 2014 21:44:11 +0900
   From: Masao Uebayashi 

   - TAILQ_REMOVE() will work (except on Alpha)
   - TAILQ_INSERT_*() will not (on Alpha and some others)
 - TAILQ_INSERT_*() need membar_producer()
 - TAILQ_FOREACH() needs membar_consumer()

- TAILQ_REMOVE works as is everywhere (unless you enable QUEUEDEBUG).
- TAILQ_INSERT_* need membar_producer everywhere.
- TAILQ_FOREACH needs membar_consumer on alpha.

To clarify this situation, I added _PSZ versions of all of these in
the patch I sent earlier that will

(a) do the right thing everywhere, and
(b) mark where you're sharing a queue with pserialize.


Re: pserialize(9) vs. TAILQ

2014-11-24 Thread Masao Uebayashi
I think I understand the context.  membar_consumer() can be omitted
because forward list iteration is (happen to be) "dependent-load"
(except on Alpha).

My confusion came from that I thought memory ordering of load is more
flexible in general.  I also didn't quite understand "dependent-load".
I have been only reading NetBSD kernel code to learn memory ordering;
and I believe, nothing in the tree utilize this topologycal
"dependent-load".  I may be missing something though.

On Mon, Nov 24, 2014 at 11:21 PM, Taylor R Campbell
 wrote:
>Date: Mon, 24 Nov 2014 21:44:11 +0900
>From: Masao Uebayashi 
>
>- TAILQ_REMOVE() will work (except on Alpha)
>- TAILQ_INSERT_*() will not (on Alpha and some others)
>  - TAILQ_INSERT_*() need membar_producer()
>  - TAILQ_FOREACH() needs membar_consumer()
>
> - TAILQ_REMOVE works as is everywhere (unless you enable QUEUEDEBUG).
> - TAILQ_INSERT_* need membar_producer everywhere.
> - TAILQ_FOREACH needs membar_consumer on alpha.
>
> To clarify this situation, I added _PSZ versions of all of these in
> the patch I sent earlier that will
>
> (a) do the right thing everywhere, and
> (b) mark where you're sharing a queue with pserialize.

I'm still not sure what part is really specific to pserialize(9) or not.


Re: pserialize(9) vs. TAILQ

2014-11-24 Thread Eduardo Horvath
On Sun, 23 Nov 2014, Dennis Ferguson wrote:

> On 23 Nov, 2014, at 01:01 , Martin Husemann  wrote:
> 
> > On Sat, Nov 22, 2014 at 01:24:42PM +0800, Dennis Ferguson wrote:
> >> I'll guess one problem is in sparc/mutex.h, here:
> >> 
> >>#define MUTEX_RECEIVE(mtx)  /* nothing */
> >>#define MUTEX_GIVE(mtx) /* nothing */
> >> 
> >> This works with TSO, but with RMO they need to somehow generate
> >> hardware memory barriers.  See arm/mutex.h for an example with
> >> them filled in.
> > 
> > Or src/sys/arch/sparc64/include/mutex.h - IIRC sparc v8 and earlier do not
> > have RMO.
> 
> Ah, got it.  I'd now guess that in
> 
>src/common/lib/libc/arch/sparc64/atomic/membar_ops.S
> 
> this comment
> 
> /* These assume Total Store Order (TSO) */
> 
> suggests the problem.

Yes, the existing code assumes TSO.  A while back I was looking in to fixing 
that.  

I enhanced membar_ops with proper memory barriers and then was looking at 
the mutex code.  Unfortunately, I didn't get very far.  It seemed at the 
time that the mutex code has two hooks for memory barriers after the 
atomic operations, however it's missing memory barrier hooks to ensure 
consistency before accessing the lock.

Eduardo


Re: pserialize(9) vs. TAILQ

2014-11-24 Thread Taylor R Campbell
   Date: Tue, 25 Nov 2014 00:33:11 +0900
   From: Masao Uebayashi 

   I think I understand the context.  membar_consumer() can be omitted
   because forward list iteration is (happen to be) "dependent-load"
   (except on Alpha).

Just to clarify a tiny bit:  TAILQ_FOREACH uses data-dependent loads,
no matter what CPU architecture it runs on.  A data-dependent load is
anything like what CPU 2 is doing here:

CPU 1   CPU 2

A = 0
A = 5
membar_producer()
*pp = &A
p = *pp
v = *p  <--- Data-dependent load:
 load address depends on
 data from previous load.

On most CPU architectures, v is guaranteed to be 5; on alpha, if you
want v to be 5, you must insert a membar_consumer between `p = *pp'
and `v = *p'.

Writing membar_datadep_consumer there would (a) add no overhead to
CPUs that provide the guarantee, (b) make the code work on CPUs that
don't, and (c) explain to anyone who is reading the code that
interprocessor memory ordering is important here.

I say `data-dependent load', not just `dependent load', because there
are also control-dependent loads, as in

if (ok[i])
v = value[i];

which require membar_consumer on CPUs other than alpha.

   My confusion came from that I thought memory ordering of load is more
   flexible in general.  I also didn't quite understand "dependent-load".
   I have been only reading NetBSD kernel code to learn memory ordering;
   and I believe, nothing in the tree utilize this topologycal
   "dependent-load".  I may be missing something though.

We do have at least one use of nontrivial data-dependent loads
in-tree, in pcq(9), as Dennis noted.

   On Mon, Nov 24, 2014 at 11:21 PM, Taylor R Campbell
wrote:
   > To clarify this situation, I added _PSZ versions of all of these in
   > the patch I sent earlier that will
   >
   > (a) do the right thing everywhere, and
   > (b) mark where you're sharing a queue with pserialize.

   I'm still not sure what part is really specific to pserialize(9) or not.

You're right that it's not really specific to pserialize, but it is
designed to be safe to use with pserialize, and most uses would
probably be written with pserialize.


Re: pserialize(9) vs. TAILQ

2014-11-24 Thread Rhialto
On Mon 24 Nov 2014 at 16:56:44 +, Taylor R Campbell wrote:
> Just to clarify a tiny bit:  TAILQ_FOREACH uses data-dependent loads,
> no matter what CPU architecture it runs on.  A data-dependent load is
> anything like what CPU 2 is doing here:
> 
> CPU 1   CPU 2
> 
> A = 0
> A = 5
> membar_producer()
> *pp = &A
> p = *pp
> v = *p  <--- Data-dependent load:
>  load address depends on
>  data from previous load.
> 
> On most CPU architectures, v is guaranteed to be 5; on alpha, if you

I must admit that it is not clear to me how on all these architectures
the CPU is to know in general that *p depends on p = *pp. I can imagine
that this is possible in trivial cases like this, when simply compiled
as (let's use PDP-11 syntax)

mov _pp,r0
mov (r0),r1 ; r1 is p
mov (r1),r2 ; r2 is v

but in more complex cases where r1 is moved to another register before
indirection, or has some constant or variable added to it or even more
complex calculations, this seems far from clear to me. It seems that the
Alpha case would actually be more realistic in general.

Structs can be very large, so p->far_away_field can address memory quite
far away from where p points. Simply discarding a cache line when p is
loaded into a register seems neither necessary (->inefficient) nor
sufficient.

> want v to be 5, you must insert a membar_consumer between `p = *pp'
> and `v = *p'.

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- The Doctor: No, 'eureka' is Greek for
\X/ rhialto/at/xs4all.nl-- 'this bath is too hot.'


pgpYwaB1qt7OX.pgp
Description: PGP signature


Re: pserialize(9) vs. TAILQ

2014-11-24 Thread Taylor R Campbell
   Date: Mon, 24 Nov 2014 18:36:43 +0100
   From: Rhialto 

   I must admit that it is not clear to me how on all these architectures
   the CPU is to know in general that *p depends on p = *pp.

Issuing membar_producer on CPU 1 may flush its store buffer to RAM and
force all other CPUs to discard their caches for everything in its
store buffer.

Then when CPU 2 executes p = *pp, if it sees the new version of *pp,
it must have discarded its cache for *p, so when it executes v = *p,
it will see the value of *p that CPU 1 put in there before assigning
*pp = p.

That doesn't address control-dependent loads, which is why even CPUs
that do the above may require membar_consumer for control-dependent
loads.  Specifically, if CPU 1 does

value[i] = 5;
membar_producer();
ok[i] = 1;

then because there's no data dependency, CPU 2 might still reorder

if (ok[i])
v = value[i];

into

tmp = value[i];
if (ok[i])
v = tmp;

just by executing instructions out of order, with no cache involved.


Re: pserialize(9) vs. TAILQ

2014-11-24 Thread Mindaugas Rasiukevicius
Taylor R Campbell  wrote:
>Date: Fri, 21 Nov 2014 18:25:02 +0800
>From: Dennis Ferguson 
> 
>Beyond the Alpha-barrier, he only thing I find lacking about the
> membar_*() API is maybe a set of functions that would eliminate the
> #ifndef/#endif around
> 
>#ifndef __HAVE_ATOMIC_AS_MEMBAR
>membar_foo();
>#endif
> 
>repeated over and over when synchronization is done via atomic_ops(3).
> 
> I've been meaning to add preatomic_membar_foo/postatomic_membar_foo to
> address this, since finding typos in the cpp conditional that went
> undetected for years.
> 
> Not sure it's necessary to distinguish the pre/post cases -- Linux
> does with smp_mb__before_atomic/smp_mb__after_atomic, but I'm not sure
> that distinction is needed for all our membar_foos, and I haven't
> taken the time to research the situation adequately.

I would rather extend atomic_ops(3) API to have functions which ensure
a particular memory barrier or a function which takes the constraint as
an extra argument, like in C11 API.  Speaking of which, it might be worth
to evaluate C11 atomics and consider whether we should just move there..

-- 
Mindaugas


Re: pserialize(9) vs. TAILQ

2014-11-24 Thread Masao Uebayashi
On Tue, Nov 25, 2014 at 1:44 AM, Eduardo Horvath  wrote:
> Yes, the existing code assumes TSO.  A while back I was looking in to fixing 
> that.
>
> I enhanced membar_ops with proper memory barriers and then was looking at
> the mutex code.  Unfortunately, I didn't get very far.  It seemed at the
> time that the mutex code has two hooks for memory barriers after the
> atomic operations, however it's missing memory barrier hooks to ensure
> consistency before accessing the lock.

My understanding is that checking mtx_owner, which is usually
"volatile uintptr_t mtxa_owner", doesn't need memory barrier, assuming
"volatile" works as expected (always read from memory).


Re: pserialize(9) vs. TAILQ

2014-11-24 Thread Dennis Ferguson

On 25 Nov, 2014, at 00:56 , Taylor R Campbell 
 wrote:
>   My confusion came from that I thought memory ordering of load is more
>   flexible in general.  I also didn't quite understand "dependent-load".
>   I have been only reading NetBSD kernel code to learn memory ordering;
>   and I believe, nothing in the tree utilize this topologycal
>   "dependent-load".  I may be missing something though.
> 
> We do have at least one use of nontrivial data-dependent loads
> in-tree, in pcq(9), as Dennis noted.

More than that, I think.  If I'm not mistaken there are also examples in

sys/kern/kern_descrip.c
sys/kern/kern_tc.c
sys/kern/subr_ipi.c
sys/kern/vfs_cache.c

When there is code which fills in a structure, followed by a membar_producer(),
followed by an assignment to a pointer to point at the structure, it is very
unlikely that there will be barriers in the places where the pointer is 
dereferenced.

Dennis Ferguson

Re: pserialize(9) vs. TAILQ

2014-11-24 Thread Masao Uebayashi
On Tue, Nov 25, 2014 at 3:55 PM, Dennis Ferguson
 wrote:
>
> On 25 Nov, 2014, at 00:56 , Taylor R Campbell 
>  wrote:
>>   My confusion came from that I thought memory ordering of load is more
>>   flexible in general.  I also didn't quite understand "dependent-load".
>>   I have been only reading NetBSD kernel code to learn memory ordering;
>>   and I believe, nothing in the tree utilize this topologycal
>>   "dependent-load".  I may be missing something though.
>>
>> We do have at least one use of nontrivial data-dependent loads
>> in-tree, in pcq(9), as Dennis noted.
>
> More than that, I think.  If I'm not mistaken there are also examples in
>
> sys/kern/kern_descrip.c
> sys/kern/kern_tc.c
> sys/kern/subr_ipi.c
> sys/kern/vfs_cache.c
>
> When there is code which fills in a structure, followed by a 
> membar_producer(),
> followed by an assignment to a pointer to point at the structure, it is very
> unlikely that there will be barriers in the places where the pointer is 
> dereferenced.

I'd *really* appreciate if you leave comments explaining those
(hidden) constraints.


Re: pserialize(9) vs. TAILQ

2014-11-25 Thread Masao Uebayashi
On Tue, Nov 25, 2014 at 9:38 AM, Mindaugas Rasiukevicius
 wrote:
> I would rather extend atomic_ops(3) API to have functions which ensure
> a particular memory barrier or a function which takes the constraint as
> an extra argument, like in C11 API.  Speaking of which, it might be worth
> to evaluate C11 atomics and consider whether we should just move there..

More I learn, more I'm liking this approach.

So the direction is, code expresses intention more clearly, gives more
information to compiler.  Compiler will understand more about memory
ordering in the future and do optimization.


mutex(9) on sparc64 RMO [was Re: pserialize(9) vs. TAILQ]

2014-11-24 Thread Taylor R Campbell
   Date: Mon, 24 Nov 2014 16:44:41 + (UTC)
   From: Eduardo Horvath 

   I enhanced membar_ops with proper memory barriers and then was looking at 
   the mutex code.  Unfortunately, I didn't get very far.  It seemed at the 
   time that the mutex code has two hooks for memory barriers after the 
   atomic operations, however it's missing memory barrier hooks to ensure 
   consistency before accessing the lock.

What exactly is the consistency you need before accessing the lock?


Re: mutex(9) on sparc64 RMO [was Re: pserialize(9) vs. TAILQ]

2014-11-24 Thread Masao Uebayashi
On Tue, Nov 25, 2014 at 2:19 AM, Taylor R Campbell
 wrote:
>Date: Mon, 24 Nov 2014 16:44:41 + (UTC)
>From: Eduardo Horvath 
>
>I enhanced membar_ops with proper memory barriers and then was looking at
>the mutex code.  Unfortunately, I didn't get very far.  It seemed at the
>time that the mutex code has two hooks for memory barriers after the
>atomic operations, however it's missing memory barrier hooks to ensure
>consistency before accessing the lock.
>
> What exactly is the consistency you need before accessing the lock?

Besides that, other necessary memory barriers I have figured out:

- cpu_simple_lock
  - lock/unlock should ensure membar_enter/membar_exit equivalent
- mutex
  - spin
- enter/exit should ensure membar_enter/membar_exit equivalent
  - adaptive
- Provide MUTEX_RECEIVE, MUTEX_GIVE -> membar_enter, membar_exit
- rw
  - enter/exit should ensure membar_enter/membar_exit equivalent
- membar_ops
  - Implement these
- Other synchronization
  - Relies on membar_ops, should work if membar_ops is done right

I'd start with using full-sync everywhere and see what happens.


Re: mutex(9) on sparc64 RMO [was Re: pserialize(9) vs. TAILQ]

2014-11-25 Thread Eduardo Horvath
On Mon, 24 Nov 2014, Taylor R Campbell wrote:

>Date: Mon, 24 Nov 2014 16:44:41 + (UTC)
>From: Eduardo Horvath 
> 
>I enhanced membar_ops with proper memory barriers and then was looking at 
>the mutex code.  Unfortunately, I didn't get very far.  It seemed at the 
>time that the mutex code has two hooks for memory barriers after the 
>atomic operations, however it's missing memory barrier hooks to ensure 
>consistency before accessing the lock.
> 
> What exactly is the consistency you need before accessing the lock?

I know you need a `membar #Lookaside' before accessing the atomic 
variable.  I don't recall if other memory barriers are needed since it's 
been a while since I last looked at the V9 architecture reference.

Eduardo