Re: svn commit: r230201 - head/lib/libc/gen

2012-01-20 Thread David Chisnall
On 20 Jan 2012, at 00:46, David Xu wrote:

 It depends on hardware, if it is a large machine with lots of cpu,
 a small conflict on dual-core machine can become a large conflict
 on large machine because it is possible more cpus are now
 running same code which becomes a bottleneck. On a large machine
 which has 1024 cores, many code need to be redesigned.

You'll also find that the relative cost of atomic instructions varies a lot 
between CPU models.  Between Core 2 and Sandy Bridge Core i7, the relative cost 
of an atomic add (full barrier) dropped by about two thirds.  The cache 
coherency logic has been significantly improved on the newer chips.  

For portable code, it's worth remembering that ARMv8 (which doesn't entirely 
exist yet) contains a set of barriers that closely match the semantics of the 
C[++]11 memory ordering.  They do this not for performance (directly), but for 
power efficiency - so using the least-restrictive required locking will 
eventually result in code for mobile devices that uses less battery power, if 
it's in a hot path.  

David___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r230201 - head/lib/libc/gen

2012-01-20 Thread John Baldwin
On Thursday, January 19, 2012 7:36:33 pm David Xu wrote:
 On 2012/1/19 23:23, John Baldwin wrote:
  On Thursday, January 19, 2012 12:57:50 am David Xu wrote:
  rdtsc() may not work on SMP, so I have updated it to use clock_gettime
  to get total time.
  http://people.freebsd.org/~davidxu/bench/semaphore2/
  http://people.freebsd.org/%7Edavidxu/bench/semaphore2/
 
  Still, lfence is a lot faster than atomic lock.
  http://www.freebsd.org/~jhb/patches/amd64_fence.patch
 
  This the patch I've had for quite a while.  Can you retest with this?  
You'll
  probably have to install the updated header in /usr/include as well.
 
 The lines in atomic_load_acq() seem not what I want:
 
 + v = *p; \
 + __asm __volatile(lfence ::: memory);\
 
 I think they should be swapped ?

No, the point is that any subsequent loads cannot pass the '*p'.  If you swap 
the order, then the compiler (and CPU) are free to reorder '*p' to be later 
than some other load later in program order.

 + __asm __volatile(lfence ::: memory);\
 + v = *p; \
 
 What I need in the semaphore code is read can not pass write in such a 
special case.

Hmm, it seems you need the equivalent of an 'mfence'.  Note that for your
first change in your diff, it should not have made a difference on x86.  
atomic_add_rel_int() already has the equivalent of an 'mfence' (on x86 the 
non-load/store ops all end up with full fences since that is what 'lock' 
provides, the architecture doesn't let us do more fine-grained barriers).

It may be that you still have a race and that the barrier just changed the 
timing enough to fix your test case.  Specifically, note that an 'rmb'
(or 'lfence') does not force other CPUs to flush any pending writes, or
wait for other CPUs to flush pending writes.  Even with the lfence, you can
still read a stale value of _has_waiters.  This is why in-kernel locking
primitives encode this state in the lock cookie via contested flags and
use cmpset's to set them (and retry the loop if the cmpset fails).

-- 
John Baldwin
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r230201 - head/lib/libc/gen

2012-01-19 Thread John Baldwin
On Thursday, January 19, 2012 12:57:50 am David Xu wrote:
 rdtsc() may not work on SMP, so I have updated it to use clock_gettime 
 to get total time.
 http://people.freebsd.org/~davidxu/bench/semaphore2/ 
 http://people.freebsd.org/%7Edavidxu/bench/semaphore2/
 
 Still, lfence is a lot faster than atomic lock.

http://www.freebsd.org/~jhb/patches/amd64_fence.patch

This the patch I've had for quite a while.  Can you retest with this?  You'll 
probably have to install the updated header in /usr/include as well.

-- 
John Baldwin
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r230201 - head/lib/libc/gen

2012-01-19 Thread Bruce Evans

On Thu, 19 Jan 2012, John Baldwin wrote:


On Thursday, January 19, 2012 12:57:50 am David Xu wrote:

rdtsc() may not work on SMP, so I have updated it to use clock_gettime
to get total time.
http://people.freebsd.org/~davidxu/bench/semaphore2/
http://people.freebsd.org/%7Edavidxu/bench/semaphore2/

Still, lfence is a lot faster than atomic lock.


I hope it does non-microbenchmarks.  IIRC, jhb found that it was
actually slower in some cases.  I only did micro-benchmarks on Athlon64.


http://www.freebsd.org/~jhb/patches/amd64_fence.patch

This the patch I've had for quite a while.  Can you retest with this?  You'll
probably have to install the updated header in /usr/include as well.


% --- //depot/projects/smpng/sys/amd64/include/atomic.h 2011-01-05 
17:06:25.0 
% +++ //depot/user/jhb/ktrace/amd64/include/atomic.h2011-01-05 
22:08:56.0 
% ...
% @@ -213,13 +213,12 @@
%  #if defined(_KERNEL)  !defined(SMP)
% 
%  /*

% - * We assume that a = b will do atomic loads and stores.  However, on a
% - * PentiumPro or higher, reads may pass writes, so for that case we have
% - * to use a serializing instruction (i.e. with LOCK) to do the load in
% - * SMP kernels.  For UP kernels, however, the cache of the single processor
% + * We assume that a = b will do atomic loads and stores.  However, reads
% + * may pass writes, so we have to use fences in SMP kernels to preserve
% + * ordering.  For UP kernels, however, the cache of the single processor
%   * is always consistent, so we only need to take care of compiler.
%   */
% -#define  ATOMIC_STORE_LOAD(TYPE, LOP, SOP)   \
% +#define  ATOMIC_STORE_LOAD(TYPE) \
%  static __inline u_##TYPE \
%  atomic_load_acq_##TYPE(volatile u_##TYPE *p) \
%  {\

It also has some simplifications from removing the use of different
operators for load and store.  These simplifications seem to be not
quite generic, since lock; cmpxchg* seems to be 2 cycles faster
than xchg* in Athlon64.  The ATOMIC_STORE_LOAD() macro is obfuscatory.
Better to have separate macros for load/store, like we do for
set/clear/add/subtract.  (The latter can be obfuscated even better
using 4 parameters for the ops.  Then better yet by making the type
a parameter.)

% @@ -240,32 +239,22 @@
% 
%  #else /* !(_KERNEL  !SMP) */
% 
% -#define	ATOMIC_STORE_LOAD(TYPE, LOP, SOP)		\

% +#define  ATOMIC_STORE_LOAD(TYPE) \
%  static __inline u_##TYPE \
%  atomic_load_acq_##TYPE(volatile u_##TYPE *p) \
%  {\
% - u_##TYPE res;   \
% + u_##TYPE v; \
%   \
% - __asm __volatile(MPLOCKED LOP   \
% - : =a (res), /* 0 */ \
% -   =m (*p)   /* 1 */ \
% - : m (*p)/* 2 */ \
% - : memory, cc);  \
% - \
% - return (res);   \
% + v = *p; \
% + __asm __volatile(lfence ::: memory);\

Style bug (missing spaces around binary operator `:') which becomes
a syntax error for C++.  Other places in this file use ` : : : '.

lfence() should be in cpufunc.h if it can be done separately
like this.  However, I think it can't be done separately --
it needs to be done in the same asm as the load/store, since
separate C statements may be reordered.  This is the same
problem that forces us to write __asm volatile(sti; hlt);
instead of sti(); hlt(); in too many places for idling in
machdep.c.  BTW, sti() and hlt() are bogusly not in cpufunc.h
either:
- I misnamed sti() as disable_intr() since disable_intr() was
  supposed to be MI and I didn't understand that any MI
  interface should not be direct in cpufunc.h.
- someone misnamed hlt() as halt().

Bruce
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r230201 - head/lib/libc/gen

2012-01-19 Thread Bruce Evans

On Fri, 20 Jan 2012, Bruce Evans wrote:


...
% + v = *p; \
% + __asm __volatile(lfence ::: memory);\

Style bug (missing spaces around binary operator `:') which becomes
a syntax error for C++.  Other places in this file use ` : : : '.

lfence() should be in cpufunc.h if it can be done separately
like this.  However, I think it can't be done separately --
it needs to be done in the same asm as the load/store, since
separate C statements may be reordered.  This is the same
problem that forces us to write __asm volatile(sti; hlt);
instead of sti(); hlt(); in too many places for idling in
machdep.c.  BTW, sti() and hlt() are bogusly not in cpufunc.h
either:
- I misnamed sti() as disable_intr() since disable_intr() was
 supposed to be MI and I didn't understand that any MI
 interface should not be direct in cpufunc.h.
- someone misnamed hlt() as halt().


BTW2, there are 1 or 2 direct uses of halt() per arch, and all seem
to be wrong:
- cpu_halt() uses halt() for amd64, i386 and pc98.  It neither enables
  or disables interrupts.  When it is called from ddb, enabling interrupts
  would be a bug.  Otherwise, it might want to ensure that interrupts
  are enabled (to allow i/o to complete), but it is safer to ensure that
  they is disabled (the caller, which is normally shutdown_halt(), should
  have waited).
- on arches that support apm (i386 and maybe pc98), apm_cpu_idle() uses
  halt().  It neither disables nor enables interrupts.  Apparently they
  are always enabled already.  But that seems to give races.  But this
  seems to have no effect, since apm_cpu_idle() seems to be never used.
  It seems to have been last used in FreeBSD-2, where it is called from
  swtch.s.  All references to apm were removed from swtch.s in 1997.  The
  call was replaced by a call through the function pointer _hlt_vector
  (aout spelling).  This always called a default which just executed the
  hlt instruction.  _hlt_vector was not connected to apm.  This later
  turned into cpu_idle() and the function pointer cpu_idle_hook that we
  have today.  The hook was apparently never connected to apm.  It is even
  misdescribed in its comment as being the ACPI idle hook.  Perhaps ACPI
  is the only thing that uses it, but its declaration shouldn't say that.

Bruce
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r230201 - head/lib/libc/gen

2012-01-19 Thread Ed Schouten
Hi David,

* David Xu listlog2...@gmail.com, 20120116 08:53:
 but since atomic.h does not have a full barrier atomic operation
 interface, I intend to add a rmb() here.

In the very nearby future (after I switch SPARC64 and MIPS to
libcompiler_rt), it should be possible to safely use C11's stdatomic.h
on all supported architectures. The C11 interface allows any operation
to be combined with any type of barrier.

Maybe we should simply migrate this code to use stdatomic.h then?

Greetings,
-- 
 Ed Schouten e...@80386.nl
 WWW: http://80386.nl/


pgpDC3F3wL7nx.pgp
Description: PGP signature


Re: svn commit: r230201 - head/lib/libc/gen

2012-01-19 Thread David Chisnall
On 19 Jan 2012, at 18:09, Ed Schouten wrote:

 In the very nearby future (after I switch SPARC64 and MIPS to
 libcompiler_rt), it should be possible to safely use C11's stdatomic.h
 on all supported architectures. The C11 interface allows any operation
 to be combined with any type of barrier.
 
 Maybe we should simply migrate this code to use stdatomic.h then?

Currently, that will give worse code if we use gcc 4.2.1, but (I hope!) better 
code if we use clang.  With GCC, we are implementing atomic_thread_fence() as 
__sync_synchronize(), which is a full barrier, and ignoring its argument.  It 
would probably be worth postponing any such migration until:

1) Clang is the default compiler, and
2) The bugs in LLVM that cause the back end to fatal error on any nontrivial 
code using atomics are fixed.

David___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r230201 - head/lib/libc/gen

2012-01-19 Thread David Xu

On 2012/1/19 23:23, John Baldwin wrote:

On Thursday, January 19, 2012 12:57:50 am David Xu wrote:

rdtsc() may not work on SMP, so I have updated it to use clock_gettime
to get total time.
http://people.freebsd.org/~davidxu/bench/semaphore2/
http://people.freebsd.org/%7Edavidxu/bench/semaphore2/

Still, lfence is a lot faster than atomic lock.

http://www.freebsd.org/~jhb/patches/amd64_fence.patch

This the patch I've had for quite a while.  Can you retest with this?  You'll
probably have to install the updated header in /usr/include as well.


The lines in atomic_load_acq() seem not what I want:

+   v = *p; \
+   __asm __volatile(lfence ::: memory);\

I think they should be swapped ?

+   __asm __volatile(lfence ::: memory);\
+   v = *p; \

What I need in the semaphore code is read can not pass write in such a special 
case.

___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r230201 - head/lib/libc/gen

2012-01-19 Thread David Xu

On 2012/1/20 0:55, Bruce Evans wrote:

On Thu, 19 Jan 2012, John Baldwin wrote:


On Thursday, January 19, 2012 12:57:50 am David Xu wrote:

rdtsc() may not work on SMP, so I have updated it to use clock_gettime
to get total time.
http://people.freebsd.org/~davidxu/bench/semaphore2/
http://people.freebsd.org/%7Edavidxu/bench/semaphore2/

Still, lfence is a lot faster than atomic lock.


I hope it does non-microbenchmarks.  IIRC, jhb found that it was
actually slower in some cases.  I only did micro-benchmarks on Athlon64.


It depends on hardware, if it is a large machine with lots of cpu,
a small conflict on dual-core machine can become a large conflict
on large machine because it is possible more cpus are now
running same code which becomes a bottleneck. On a large machine
which has 1024 cores, many code need to be redesigned.

___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r230201 - head/lib/libc/gen

2012-01-18 Thread David Xu

On 2012/1/18 23:09, John Baldwin wrote:

On Tuesday, January 17, 2012 9:09:25 pm David Xu wrote:

On 2012/1/17 22:57, John Baldwin wrote:

On Monday, January 16, 2012 1:15:14 am David Xu wrote:

Author: davidxu
Date: Mon Jan 16 06:15:14 2012
New Revision: 230201
URL: http://svn.freebsd.org/changeset/base/230201

Log:
Insert read memory barriers.

I think using atomic_load_acq() on sem-nwaiters would be clearer as it would
indicate which variable you need to ensure is read after other operations.  In
general I think raw rmb/wmb usage should be avoided when possible as it is
does not describe the programmer's intent as well.


Yes, I had considered that I may use atomic_load_acq(), but at that time,
I thought it emits a bus locking, right ? so I just picked up rmb() which
only affects current cpu. maybe atomic_load_acq() does same thing with
rmb() ?
it is still unclear to me.

atomic_load_acq() is the same as rmb().  Right now it uses a locked
instruction on amd64, but it could easily switch to lfence/sfence instead.  I
had patches to do that but I think bde@ had done some benchmarks that showed
that change made no difference.


I wish there is a version uses lfence for atomic_load_acq(). I always think
bus locking is expensive on a multiple-core machine. Here we work on large
machine found that even current rwlock in libthr is not scale well if
most threads are readers, we have to implement CSNZI-like rwlock to avoid
CPU conflict.
http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf

I have just done a benchmark on my notebook which is a 4 SMT sandy bridge
CPU i3 2310m.
http://people.freebsd.org/~davidxu/bench/semaphore/ 
http://people.freebsd.org/%7Edavidxu/bench/semaphore/


The load_acq uses atomic locking is much slower than lfence:
http://people.freebsd.org/~davidxu/bench/semaphore/ministat.txt 
http://people.freebsd.org/%7Edavidxu/bench/semaphore/ministat.txt


benchmark program:
http://people.freebsd.org/~davidxu/bench/semaphore/sem_test.c 
http://people.freebsd.org/%7Edavidxu/bench/semaphore/sem_test.c



Regards,
David Xu

___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r230201 - head/lib/libc/gen

2012-01-18 Thread David Xu

On 2012/1/19 11:24, David Xu wrote:

On 2012/1/18 23:09, John Baldwin wrote:

On Tuesday, January 17, 2012 9:09:25 pm David Xu wrote:

On 2012/1/17 22:57, John Baldwin wrote:

On Monday, January 16, 2012 1:15:14 am David Xu wrote:

Author: davidxu
Date: Mon Jan 16 06:15:14 2012
New Revision: 230201
URL: http://svn.freebsd.org/changeset/base/230201

Log:
Insert read memory barriers.
I think using atomic_load_acq() on sem-nwaiters would be clearer 
as it would
indicate which variable you need to ensure is read after other 
operations.  In
general I think raw rmb/wmb usage should be avoided when possible 
as it is

does not describe the programmer's intent as well.

Yes, I had considered that I may use atomic_load_acq(), but at that 
time,
I thought it emits a bus locking, right ? so I just picked up rmb() 
which

only affects current cpu. maybe atomic_load_acq() does same thing with
rmb() ?
it is still unclear to me.

atomic_load_acq() is the same as rmb().  Right now it uses a locked
instruction on amd64, but it could easily switch to lfence/sfence 
instead.  I
had patches to do that but I think bde@ had done some benchmarks that 
showed

that change made no difference.

I wish there is a version uses lfence for atomic_load_acq(). I always 
think
bus locking is expensive on a multiple-core machine. Here we work on 
large

machine found that even current rwlock in libthr is not scale well if
most threads are readers, we have to implement CSNZI-like rwlock to avoid
CPU conflict.
http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf

I have just done a benchmark on my notebook which is a 4 SMT sandy bridge
CPU i3 2310m.
http://people.freebsd.org/~davidxu/bench/semaphore/ 
http://people.freebsd.org/%7Edavidxu/bench/semaphore/


The load_acq uses atomic locking is much slower than lfence:
http://people.freebsd.org/~davidxu/bench/semaphore/ministat.txt 
http://people.freebsd.org/%7Edavidxu/bench/semaphore/ministat.txt


benchmark program:
http://people.freebsd.org/~davidxu/bench/semaphore/sem_test.c 
http://people.freebsd.org/%7Edavidxu/bench/semaphore/sem_test.c


rdtsc() may not work on SMP, so I have updated it to use clock_gettime 
to get total time.
http://people.freebsd.org/~davidxu/bench/semaphore2/ 
http://people.freebsd.org/%7Edavidxu/bench/semaphore2/


Still, lfence is a lot faster than atomic lock.


___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r230201 - head/lib/libc/gen

2012-01-17 Thread John Baldwin
On Monday, January 16, 2012 1:15:14 am David Xu wrote:
 Author: davidxu
 Date: Mon Jan 16 06:15:14 2012
 New Revision: 230201
 URL: http://svn.freebsd.org/changeset/base/230201
 
 Log:
   Insert read memory barriers.

I think using atomic_load_acq() on sem-nwaiters would be clearer as it would
indicate which variable you need to ensure is read after other operations.  In
general I think raw rmb/wmb usage should be avoided when possible as it is 
does not describe the programmer's intent as well.

-- 
John Baldwin
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r230201 - head/lib/libc/gen

2012-01-17 Thread David Xu

On 2012/1/17 22:57, John Baldwin wrote:

On Monday, January 16, 2012 1:15:14 am David Xu wrote:

Author: davidxu
Date: Mon Jan 16 06:15:14 2012
New Revision: 230201
URL: http://svn.freebsd.org/changeset/base/230201

Log:
   Insert read memory barriers.

I think using atomic_load_acq() on sem-nwaiters would be clearer as it would
indicate which variable you need to ensure is read after other operations.  In
general I think raw rmb/wmb usage should be avoided when possible as it is
does not describe the programmer's intent as well.


Yes, I had considered that I may use atomic_load_acq(), but at that time,
I thought it emits a bus locking, right ? so I just picked up rmb() which
only affects current cpu. maybe atomic_load_acq() does same thing with 
rmb() ?

it is still unclear to me.

Regards,
David Xu
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r230201 - head/lib/libc/gen

2012-01-15 Thread Lawrence Stewart

On 01/16/12 17:15, David Xu wrote:

Author: davidxu
Date: Mon Jan 16 06:15:14 2012
New Revision: 230201
URL: http://svn.freebsd.org/changeset/base/230201

Log:
   Insert read memory barriers.

Modified:
   head/lib/libc/gen/sem.c
   head/lib/libc/gen/sem_new.c


Could you please provide a bit more information about why these are 
necessary and why they weren't there before (or how you figured out that 
there was a problem without them)? I learn a lot by reading commit mail, 
but only when the log message and any added code comments explain the 
what *and* the why.


Cheers,
Lawrence
___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org


Re: svn commit: r230201 - head/lib/libc/gen

2012-01-15 Thread David Xu

On 2012/1/16 15:39, Lawrence Stewart wrote:

On 01/16/12 17:15, David Xu wrote:

Author: davidxu
Date: Mon Jan 16 06:15:14 2012
New Revision: 230201
URL: http://svn.freebsd.org/changeset/base/230201

Log:
   Insert read memory barriers.

Modified:
   head/lib/libc/gen/sem.c
   head/lib/libc/gen/sem_new.c


Could you please provide a bit more information about why these are 
necessary and why they weren't there before (or how you figured out 
that there was a problem without them)? I learn a lot by reading 
commit mail, but only when the log message and any added code comments 
explain the what *and* the why.


Cheers,
Lawrence

I know this is rather obscure and diffcult to understand, the problem is 
here

we have two variables m, n, and two threads A, B.
the A wants to set m, than looks n, while B is in reverse order,
it should be in strict order, but since atomic.h does not have a full 
barrier

atomic operation interface, I intend to add a rmb() here.
though X86 may not need this, but other arches may not work in this way.

Regards,
David Xu


___
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org