Eric Dumazet wrote:
Jie Chen a écrit :
Hi, there:

We have a simple pthread program that measures the synchronization overheads for various synchronization mechanisms such as spin locks, barriers (the barrier is implemented using queue-based barrier algorithm) and so on. We have dual quad-core AMD opterons (barcelona) clusters running 2.6.23.8 kernel at this moment using Fedora Core 7 distribution. Before we moved to this kernel, we had kernel 2.6.21. These two kernels are configured identical and compiled with the same gcc 4.1.2 compiler. Under the old kernel, we observed that the performance of these overheads increases as the number of threads increases from 2 to 8. The following are the values of total time and overhead for all threads acquiring a pthread spin lock and all threads executing a barrier synchronization call.

Could you post the source of your test program ?



Hi, Eric:

Thank you for the quick response. You can get the source code containing the test code from ftp://ftp.jlab.org/pub/hpc/qmt.tar.gz . This is a data parallel threading package for physics calculation. The test code is pthread_sync in the src directory once you unpack the gz file. To configure and build this package is very simple: configure and make. The test program is built by make check. The number of threads is controlled by QMT_NUM_THREADS. The package is using pthread spin lock, but the barrier is implemented using a queue-based barrier algorithm proposed by J. B. Carter of University of Utah (2005).





spinlock are ... spining and should not call linux scheduler, so I have no idea why a kernel change could modify your results.

Also I suspect you'll have better results with Fedora Core 8 (since glibc was updated to use private futexes in v 2.7), at least for the barrier ops.



I am not sure what the biggest change between kernel 2.6.21 and 2.6.22 (23) is? Is the scheduler the biggest change between these versions? Can the scheduler of kernel somehow effect the performance? I know the scheduler is trying to do load balance and so on. Can the scheduler move threads to different cores according to the load balance algorithm even though the threads are bound to cores using pthread_setaffinity_np call when the number of threads is fewer than the number of cores? I am thinking about this because the performance of our test code is roughly the same for both kernels when the number of threads equals to the number of cores.


Kernel 2.6.21
Number of Threads              2          4           6         8
SpinLock (Time micro second)   10.5618    10.58538    10.5915   10.643
                  (Overhead)   0.073      0.05746     0.102805 0.154563
Barrier (Time micro second)    11.020410  11.678125   11.9889   12.38002
                 (Overhead)    0.531660   1.1502      1.500112 1.891617

Each thread is bound to a particular core using pthread_setaffinity_np.

Kernel 2.6.23.8
Number of Threads              2          4           6         8
SpinLock (Time micro second)   14.849915  17.117603   14.4496   10.5990
                 (Overhead)    4.345417   6.617207    3.949435  0.110985
Barrier (Time micro second)    19.462255  20.285117   16.19395  12.37662
                 (Overhead)    8.957755   9.784722    5.699590  1.869518

It is clearly that the synchronization overhead increases as the number of threads increases in the kernel 2.6.21. But the synchronization overhead actually decreases as the number of threads increases in the kernel 2.6.23.8 (We observed the same behavior on kernel 2.6.22 as well). This certainly is not a correct behavior. The kernels are configured with CONFIG_SMP, CONFIG_NUMA, CONFIG_SCHED_MC, CONFIG_PREEMPT_NONE, CONFIG_DISCONTIGMEM set. The complete kernel configuration file is in the attachment of this e-mail.

From what we have read, there was a new scheduler (CFS) appeared from 2.6.22. We are not sure whether the above behavior is caused by the new scheduler.

Finally, our machine cpu information is listed in the following:

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 2
model name      : Quad-Core AMD Opteron(tm) Processor 2347
stepping        : 10
cpu MHz         : 1909.801
cache size      : 512 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good pni cx16 popcnt lahf_lm cmp_legacy svm
extapic cr8_legacy altmovcr8 abm sse4a misalignsse 3dnowprefetch osvw
bogomips        : 3822.95
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

In addition, we have schedstat and sched_debug files in the /proc directory.

Thank you for all your help to solve this puzzle. If you need more information, please let us know.


P.S. I like to be cc'ed on the discussions related to this problem.


Thank you for your help and happy thanksgiving !

--
#############################################################################
# Jie Chen
# Scientific Computing Group
# Thomas Jefferson National Accelerator Facility
# Newport News, VA 23606
#
# [EMAIL PROTECTED]
# (757)269-5046 (office)
# (757)269-6248 (fax)
#############################################################################
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to