Re: svn commit: r230201 - head/lib/libc/gen
On 20 Jan 2012, at 00:46, David Xu wrote: It depends on hardware, if it is a large machine with lots of cpu, a small conflict on dual-core machine can become a large conflict on large machine because it is possible more cpus are now running same code which becomes a bottleneck. On a large machine which has 1024 cores, many code need to be redesigned. You'll also find that the relative cost of atomic instructions varies a lot between CPU models. Between Core 2 and Sandy Bridge Core i7, the relative cost of an atomic add (full barrier) dropped by about two thirds. The cache coherency logic has been significantly improved on the newer chips. For portable code, it's worth remembering that ARMv8 (which doesn't entirely exist yet) contains a set of barriers that closely match the semantics of the C[++]11 memory ordering. They do this not for performance (directly), but for power efficiency - so using the least-restrictive required locking will eventually result in code for mobile devices that uses less battery power, if it's in a hot path. David___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r230201 - head/lib/libc/gen
On Thursday, January 19, 2012 7:36:33 pm David Xu wrote: On 2012/1/19 23:23, John Baldwin wrote: On Thursday, January 19, 2012 12:57:50 am David Xu wrote: rdtsc() may not work on SMP, so I have updated it to use clock_gettime to get total time. http://people.freebsd.org/~davidxu/bench/semaphore2/ http://people.freebsd.org/%7Edavidxu/bench/semaphore2/ Still, lfence is a lot faster than atomic lock. http://www.freebsd.org/~jhb/patches/amd64_fence.patch This the patch I've had for quite a while. Can you retest with this? You'll probably have to install the updated header in /usr/include as well. The lines in atomic_load_acq() seem not what I want: + v = *p; \ + __asm __volatile(lfence ::: memory);\ I think they should be swapped ? No, the point is that any subsequent loads cannot pass the '*p'. If you swap the order, then the compiler (and CPU) are free to reorder '*p' to be later than some other load later in program order. + __asm __volatile(lfence ::: memory);\ + v = *p; \ What I need in the semaphore code is read can not pass write in such a special case. Hmm, it seems you need the equivalent of an 'mfence'. Note that for your first change in your diff, it should not have made a difference on x86. atomic_add_rel_int() already has the equivalent of an 'mfence' (on x86 the non-load/store ops all end up with full fences since that is what 'lock' provides, the architecture doesn't let us do more fine-grained barriers). It may be that you still have a race and that the barrier just changed the timing enough to fix your test case. Specifically, note that an 'rmb' (or 'lfence') does not force other CPUs to flush any pending writes, or wait for other CPUs to flush pending writes. Even with the lfence, you can still read a stale value of _has_waiters. This is why in-kernel locking primitives encode this state in the lock cookie via contested flags and use cmpset's to set them (and retry the loop if the cmpset fails). -- John Baldwin ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r230201 - head/lib/libc/gen
On Thursday, January 19, 2012 12:57:50 am David Xu wrote: rdtsc() may not work on SMP, so I have updated it to use clock_gettime to get total time. http://people.freebsd.org/~davidxu/bench/semaphore2/ http://people.freebsd.org/%7Edavidxu/bench/semaphore2/ Still, lfence is a lot faster than atomic lock. http://www.freebsd.org/~jhb/patches/amd64_fence.patch This the patch I've had for quite a while. Can you retest with this? You'll probably have to install the updated header in /usr/include as well. -- John Baldwin ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r230201 - head/lib/libc/gen
On Thu, 19 Jan 2012, John Baldwin wrote: On Thursday, January 19, 2012 12:57:50 am David Xu wrote: rdtsc() may not work on SMP, so I have updated it to use clock_gettime to get total time. http://people.freebsd.org/~davidxu/bench/semaphore2/ http://people.freebsd.org/%7Edavidxu/bench/semaphore2/ Still, lfence is a lot faster than atomic lock. I hope it does non-microbenchmarks. IIRC, jhb found that it was actually slower in some cases. I only did micro-benchmarks on Athlon64. http://www.freebsd.org/~jhb/patches/amd64_fence.patch This the patch I've had for quite a while. Can you retest with this? You'll probably have to install the updated header in /usr/include as well. % --- //depot/projects/smpng/sys/amd64/include/atomic.h 2011-01-05 17:06:25.0 % +++ //depot/user/jhb/ktrace/amd64/include/atomic.h2011-01-05 22:08:56.0 % ... % @@ -213,13 +213,12 @@ % #if defined(_KERNEL) !defined(SMP) % % /* % - * We assume that a = b will do atomic loads and stores. However, on a % - * PentiumPro or higher, reads may pass writes, so for that case we have % - * to use a serializing instruction (i.e. with LOCK) to do the load in % - * SMP kernels. For UP kernels, however, the cache of the single processor % + * We assume that a = b will do atomic loads and stores. However, reads % + * may pass writes, so we have to use fences in SMP kernels to preserve % + * ordering. For UP kernels, however, the cache of the single processor % * is always consistent, so we only need to take care of compiler. % */ % -#define ATOMIC_STORE_LOAD(TYPE, LOP, SOP) \ % +#define ATOMIC_STORE_LOAD(TYPE) \ % static __inline u_##TYPE \ % atomic_load_acq_##TYPE(volatile u_##TYPE *p) \ % {\ It also has some simplifications from removing the use of different operators for load and store. These simplifications seem to be not quite generic, since lock; cmpxchg* seems to be 2 cycles faster than xchg* in Athlon64. The ATOMIC_STORE_LOAD() macro is obfuscatory. Better to have separate macros for load/store, like we do for set/clear/add/subtract. (The latter can be obfuscated even better using 4 parameters for the ops. Then better yet by making the type a parameter.) % @@ -240,32 +239,22 @@ % % #else /* !(_KERNEL !SMP) */ % % -#define ATOMIC_STORE_LOAD(TYPE, LOP, SOP) \ % +#define ATOMIC_STORE_LOAD(TYPE) \ % static __inline u_##TYPE \ % atomic_load_acq_##TYPE(volatile u_##TYPE *p) \ % {\ % - u_##TYPE res; \ % + u_##TYPE v; \ % \ % - __asm __volatile(MPLOCKED LOP \ % - : =a (res), /* 0 */ \ % - =m (*p) /* 1 */ \ % - : m (*p)/* 2 */ \ % - : memory, cc); \ % - \ % - return (res); \ % + v = *p; \ % + __asm __volatile(lfence ::: memory);\ Style bug (missing spaces around binary operator `:') which becomes a syntax error for C++. Other places in this file use ` : : : '. lfence() should be in cpufunc.h if it can be done separately like this. However, I think it can't be done separately -- it needs to be done in the same asm as the load/store, since separate C statements may be reordered. This is the same problem that forces us to write __asm volatile(sti; hlt); instead of sti(); hlt(); in too many places for idling in machdep.c. BTW, sti() and hlt() are bogusly not in cpufunc.h either: - I misnamed sti() as disable_intr() since disable_intr() was supposed to be MI and I didn't understand that any MI interface should not be direct in cpufunc.h. - someone misnamed hlt() as halt(). Bruce ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r230201 - head/lib/libc/gen
On Fri, 20 Jan 2012, Bruce Evans wrote: ... % + v = *p; \ % + __asm __volatile(lfence ::: memory);\ Style bug (missing spaces around binary operator `:') which becomes a syntax error for C++. Other places in this file use ` : : : '. lfence() should be in cpufunc.h if it can be done separately like this. However, I think it can't be done separately -- it needs to be done in the same asm as the load/store, since separate C statements may be reordered. This is the same problem that forces us to write __asm volatile(sti; hlt); instead of sti(); hlt(); in too many places for idling in machdep.c. BTW, sti() and hlt() are bogusly not in cpufunc.h either: - I misnamed sti() as disable_intr() since disable_intr() was supposed to be MI and I didn't understand that any MI interface should not be direct in cpufunc.h. - someone misnamed hlt() as halt(). BTW2, there are 1 or 2 direct uses of halt() per arch, and all seem to be wrong: - cpu_halt() uses halt() for amd64, i386 and pc98. It neither enables or disables interrupts. When it is called from ddb, enabling interrupts would be a bug. Otherwise, it might want to ensure that interrupts are enabled (to allow i/o to complete), but it is safer to ensure that they is disabled (the caller, which is normally shutdown_halt(), should have waited). - on arches that support apm (i386 and maybe pc98), apm_cpu_idle() uses halt(). It neither disables nor enables interrupts. Apparently they are always enabled already. But that seems to give races. But this seems to have no effect, since apm_cpu_idle() seems to be never used. It seems to have been last used in FreeBSD-2, where it is called from swtch.s. All references to apm were removed from swtch.s in 1997. The call was replaced by a call through the function pointer _hlt_vector (aout spelling). This always called a default which just executed the hlt instruction. _hlt_vector was not connected to apm. This later turned into cpu_idle() and the function pointer cpu_idle_hook that we have today. The hook was apparently never connected to apm. It is even misdescribed in its comment as being the ACPI idle hook. Perhaps ACPI is the only thing that uses it, but its declaration shouldn't say that. Bruce ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r230201 - head/lib/libc/gen
Hi David, * David Xu listlog2...@gmail.com, 20120116 08:53: but since atomic.h does not have a full barrier atomic operation interface, I intend to add a rmb() here. In the very nearby future (after I switch SPARC64 and MIPS to libcompiler_rt), it should be possible to safely use C11's stdatomic.h on all supported architectures. The C11 interface allows any operation to be combined with any type of barrier. Maybe we should simply migrate this code to use stdatomic.h then? Greetings, -- Ed Schouten e...@80386.nl WWW: http://80386.nl/ pgpDC3F3wL7nx.pgp Description: PGP signature
Re: svn commit: r230201 - head/lib/libc/gen
On 19 Jan 2012, at 18:09, Ed Schouten wrote: In the very nearby future (after I switch SPARC64 and MIPS to libcompiler_rt), it should be possible to safely use C11's stdatomic.h on all supported architectures. The C11 interface allows any operation to be combined with any type of barrier. Maybe we should simply migrate this code to use stdatomic.h then? Currently, that will give worse code if we use gcc 4.2.1, but (I hope!) better code if we use clang. With GCC, we are implementing atomic_thread_fence() as __sync_synchronize(), which is a full barrier, and ignoring its argument. It would probably be worth postponing any such migration until: 1) Clang is the default compiler, and 2) The bugs in LLVM that cause the back end to fatal error on any nontrivial code using atomics are fixed. David___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r230201 - head/lib/libc/gen
On 2012/1/19 23:23, John Baldwin wrote: On Thursday, January 19, 2012 12:57:50 am David Xu wrote: rdtsc() may not work on SMP, so I have updated it to use clock_gettime to get total time. http://people.freebsd.org/~davidxu/bench/semaphore2/ http://people.freebsd.org/%7Edavidxu/bench/semaphore2/ Still, lfence is a lot faster than atomic lock. http://www.freebsd.org/~jhb/patches/amd64_fence.patch This the patch I've had for quite a while. Can you retest with this? You'll probably have to install the updated header in /usr/include as well. The lines in atomic_load_acq() seem not what I want: + v = *p; \ + __asm __volatile(lfence ::: memory);\ I think they should be swapped ? + __asm __volatile(lfence ::: memory);\ + v = *p; \ What I need in the semaphore code is read can not pass write in such a special case. ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r230201 - head/lib/libc/gen
On 2012/1/20 0:55, Bruce Evans wrote: On Thu, 19 Jan 2012, John Baldwin wrote: On Thursday, January 19, 2012 12:57:50 am David Xu wrote: rdtsc() may not work on SMP, so I have updated it to use clock_gettime to get total time. http://people.freebsd.org/~davidxu/bench/semaphore2/ http://people.freebsd.org/%7Edavidxu/bench/semaphore2/ Still, lfence is a lot faster than atomic lock. I hope it does non-microbenchmarks. IIRC, jhb found that it was actually slower in some cases. I only did micro-benchmarks on Athlon64. It depends on hardware, if it is a large machine with lots of cpu, a small conflict on dual-core machine can become a large conflict on large machine because it is possible more cpus are now running same code which becomes a bottleneck. On a large machine which has 1024 cores, many code need to be redesigned. ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r230201 - head/lib/libc/gen
On 2012/1/18 23:09, John Baldwin wrote: On Tuesday, January 17, 2012 9:09:25 pm David Xu wrote: On 2012/1/17 22:57, John Baldwin wrote: On Monday, January 16, 2012 1:15:14 am David Xu wrote: Author: davidxu Date: Mon Jan 16 06:15:14 2012 New Revision: 230201 URL: http://svn.freebsd.org/changeset/base/230201 Log: Insert read memory barriers. I think using atomic_load_acq() on sem-nwaiters would be clearer as it would indicate which variable you need to ensure is read after other operations. In general I think raw rmb/wmb usage should be avoided when possible as it is does not describe the programmer's intent as well. Yes, I had considered that I may use atomic_load_acq(), but at that time, I thought it emits a bus locking, right ? so I just picked up rmb() which only affects current cpu. maybe atomic_load_acq() does same thing with rmb() ? it is still unclear to me. atomic_load_acq() is the same as rmb(). Right now it uses a locked instruction on amd64, but it could easily switch to lfence/sfence instead. I had patches to do that but I think bde@ had done some benchmarks that showed that change made no difference. I wish there is a version uses lfence for atomic_load_acq(). I always think bus locking is expensive on a multiple-core machine. Here we work on large machine found that even current rwlock in libthr is not scale well if most threads are readers, we have to implement CSNZI-like rwlock to avoid CPU conflict. http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf I have just done a benchmark on my notebook which is a 4 SMT sandy bridge CPU i3 2310m. http://people.freebsd.org/~davidxu/bench/semaphore/ http://people.freebsd.org/%7Edavidxu/bench/semaphore/ The load_acq uses atomic locking is much slower than lfence: http://people.freebsd.org/~davidxu/bench/semaphore/ministat.txt http://people.freebsd.org/%7Edavidxu/bench/semaphore/ministat.txt benchmark program: http://people.freebsd.org/~davidxu/bench/semaphore/sem_test.c http://people.freebsd.org/%7Edavidxu/bench/semaphore/sem_test.c Regards, David Xu ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r230201 - head/lib/libc/gen
On 2012/1/19 11:24, David Xu wrote: On 2012/1/18 23:09, John Baldwin wrote: On Tuesday, January 17, 2012 9:09:25 pm David Xu wrote: On 2012/1/17 22:57, John Baldwin wrote: On Monday, January 16, 2012 1:15:14 am David Xu wrote: Author: davidxu Date: Mon Jan 16 06:15:14 2012 New Revision: 230201 URL: http://svn.freebsd.org/changeset/base/230201 Log: Insert read memory barriers. I think using atomic_load_acq() on sem-nwaiters would be clearer as it would indicate which variable you need to ensure is read after other operations. In general I think raw rmb/wmb usage should be avoided when possible as it is does not describe the programmer's intent as well. Yes, I had considered that I may use atomic_load_acq(), but at that time, I thought it emits a bus locking, right ? so I just picked up rmb() which only affects current cpu. maybe atomic_load_acq() does same thing with rmb() ? it is still unclear to me. atomic_load_acq() is the same as rmb(). Right now it uses a locked instruction on amd64, but it could easily switch to lfence/sfence instead. I had patches to do that but I think bde@ had done some benchmarks that showed that change made no difference. I wish there is a version uses lfence for atomic_load_acq(). I always think bus locking is expensive on a multiple-core machine. Here we work on large machine found that even current rwlock in libthr is not scale well if most threads are readers, we have to implement CSNZI-like rwlock to avoid CPU conflict. http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf I have just done a benchmark on my notebook which is a 4 SMT sandy bridge CPU i3 2310m. http://people.freebsd.org/~davidxu/bench/semaphore/ http://people.freebsd.org/%7Edavidxu/bench/semaphore/ The load_acq uses atomic locking is much slower than lfence: http://people.freebsd.org/~davidxu/bench/semaphore/ministat.txt http://people.freebsd.org/%7Edavidxu/bench/semaphore/ministat.txt benchmark program: http://people.freebsd.org/~davidxu/bench/semaphore/sem_test.c http://people.freebsd.org/%7Edavidxu/bench/semaphore/sem_test.c rdtsc() may not work on SMP, so I have updated it to use clock_gettime to get total time. http://people.freebsd.org/~davidxu/bench/semaphore2/ http://people.freebsd.org/%7Edavidxu/bench/semaphore2/ Still, lfence is a lot faster than atomic lock. ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r230201 - head/lib/libc/gen
On Monday, January 16, 2012 1:15:14 am David Xu wrote: Author: davidxu Date: Mon Jan 16 06:15:14 2012 New Revision: 230201 URL: http://svn.freebsd.org/changeset/base/230201 Log: Insert read memory barriers. I think using atomic_load_acq() on sem-nwaiters would be clearer as it would indicate which variable you need to ensure is read after other operations. In general I think raw rmb/wmb usage should be avoided when possible as it is does not describe the programmer's intent as well. -- John Baldwin ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r230201 - head/lib/libc/gen
On 2012/1/17 22:57, John Baldwin wrote: On Monday, January 16, 2012 1:15:14 am David Xu wrote: Author: davidxu Date: Mon Jan 16 06:15:14 2012 New Revision: 230201 URL: http://svn.freebsd.org/changeset/base/230201 Log: Insert read memory barriers. I think using atomic_load_acq() on sem-nwaiters would be clearer as it would indicate which variable you need to ensure is read after other operations. In general I think raw rmb/wmb usage should be avoided when possible as it is does not describe the programmer's intent as well. Yes, I had considered that I may use atomic_load_acq(), but at that time, I thought it emits a bus locking, right ? so I just picked up rmb() which only affects current cpu. maybe atomic_load_acq() does same thing with rmb() ? it is still unclear to me. Regards, David Xu ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r230201 - head/lib/libc/gen
On 01/16/12 17:15, David Xu wrote: Author: davidxu Date: Mon Jan 16 06:15:14 2012 New Revision: 230201 URL: http://svn.freebsd.org/changeset/base/230201 Log: Insert read memory barriers. Modified: head/lib/libc/gen/sem.c head/lib/libc/gen/sem_new.c Could you please provide a bit more information about why these are necessary and why they weren't there before (or how you figured out that there was a problem without them)? I learn a lot by reading commit mail, but only when the log message and any added code comments explain the what *and* the why. Cheers, Lawrence ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org
Re: svn commit: r230201 - head/lib/libc/gen
On 2012/1/16 15:39, Lawrence Stewart wrote: On 01/16/12 17:15, David Xu wrote: Author: davidxu Date: Mon Jan 16 06:15:14 2012 New Revision: 230201 URL: http://svn.freebsd.org/changeset/base/230201 Log: Insert read memory barriers. Modified: head/lib/libc/gen/sem.c head/lib/libc/gen/sem_new.c Could you please provide a bit more information about why these are necessary and why they weren't there before (or how you figured out that there was a problem without them)? I learn a lot by reading commit mail, but only when the log message and any added code comments explain the what *and* the why. Cheers, Lawrence I know this is rather obscure and diffcult to understand, the problem is here we have two variables m, n, and two threads A, B. the A wants to set m, than looks n, while B is in reverse order, it should be in strict order, but since atomic.h does not have a full barrier atomic operation interface, I intend to add a rmb() here. though X86 may not need this, but other arches may not work in this way. Regards, David Xu ___ svn-src-head@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to svn-src-head-unsubscr...@freebsd.org