Eric Dumazet wrote:
Jie Chen a écrit :
Hi, there:
We have a simple pthread program that measures the synchronization
overheads for various synchronization mechanisms such as spin locks,
barriers (the barrier is implemented using queue-based barrier
algorithm) and so on. We have dual quad-core AMD opterons (barcelona)
clusters running 2.6.23.8 kernel at this moment using Fedora Core 7
distribution. Before we moved to this kernel, we had kernel 2.6.21.
These two kernels are configured identical and compiled with the same
gcc 4.1.2 compiler. Under the old kernel, we observed that the
performance of these overheads increases as the number of threads
increases from 2 to 8. The following are the values of total time and
overhead for all threads acquiring a pthread spin lock and all threads
executing a barrier synchronization call.
Could you post the source of your test program ?
Hi, Eric:
Thank you for the quick response. You can get the source code containing
the test code from ftp://ftp.jlab.org/pub/hpc/qmt.tar.gz . This is a
data parallel threading package for physics calculation. The test code
is pthread_sync in the src directory once you unpack the gz file. To
configure and build this package is very simple: configure and make. The
test program is built by make check. The number of threads is
controlled by QMT_NUM_THREADS. The package is using pthread spin lock,
but the barrier is implemented using a queue-based barrier algorithm
proposed by J. B. Carter of University of Utah (2005).
spinlock are ... spining and should not call linux scheduler, so I have
no idea why a kernel change could modify your results.
Also I suspect you'll have better results with Fedora Core 8 (since
glibc was updated to use private futexes in v 2.7), at least for the
barrier ops.
I am not sure what the biggest change between kernel 2.6.21 and 2.6.22
(23) is? Is the scheduler the biggest change between these versions? Can
the scheduler of kernel somehow effect the performance? I know the
scheduler is trying to do load balance and so on. Can the scheduler move
threads to different cores according to the load balance algorithm even
though the threads are bound to cores using pthread_setaffinity_np call
when the number of threads is fewer than the number of cores? I am
thinking about this because the performance of our test code is roughly
the same for both kernels when the number of threads equals to the
number of cores.
Kernel 2.6.21
Number of Threads 2 4 6 8
SpinLock (Time micro second) 10.5618 10.58538 10.5915 10.643
(Overhead) 0.073 0.05746 0.102805 0.154563
Barrier (Time micro second) 11.020410 11.678125 11.9889 12.38002
(Overhead) 0.531660 1.1502 1.500112 1.891617
Each thread is bound to a particular core using pthread_setaffinity_np.
Kernel 2.6.23.8
Number of Threads 2 4 6 8
SpinLock (Time micro second) 14.849915 17.117603 14.4496 10.5990
(Overhead) 4.345417 6.617207 3.949435 0.110985
Barrier (Time micro second) 19.462255 20.285117 16.19395 12.37662
(Overhead) 8.957755 9.784722 5.699590 1.869518
It is clearly that the synchronization overhead increases as the
number of threads increases in the kernel 2.6.21. But the
synchronization overhead actually decreases as the number of threads
increases in the kernel 2.6.23.8 (We observed the same behavior on
kernel 2.6.22 as well). This certainly is not a correct behavior. The
kernels are configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC,
CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel
configuration file is in the attachment of this e-mail.
From what we have read, there was a new scheduler (CFS) appeared from
2.6.22. We are not sure whether the above behavior is caused by the
new scheduler.
Finally, our machine cpu information is listed in the following:
processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 2
model name : Quad-Core AMD Opteron(tm) Processor 2347
stepping : 10
cpu MHz : 1909.801
cache size : 512 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
pdpe1gb rdtscp
lm 3dnowext 3dnow constant_tsc rep_good pni cx16 popcnt lahf_lm
cmp_legacy svm
extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
bogomips : 3822.95
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
In addition, we have schedstat and sched_debug files in the /proc
directory.
Thank you for all your help to solve this puzzle. If you need more
information, please let us know.
P.S. I like to be cc'ed on the discussions related to this problem.
Thank you for your help and happy thanksgiving !
--
#############################################################################
# Jie Chen
# Scientific Computing Group
# Thomas Jefferson National Accelerator Facility
# Newport News, VA 23606
#
# [EMAIL PROTECTED]
# (757)269-5046 (office)
# (757)269-6248 (fax)
#############################################################################
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/