Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
On Tue, 2007-12-11 at 17:11 -0500, Jie Chen wrote: > Ingo Molnar wrote: > > * Jie Chen <[EMAIL PROTECTED]> wrote: > > > >> The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP > >> kernel. > > > >> 2 threads: > > > >> PARALLEL time = 11.106580 microseconds +/- 0.002460 > >> PARALLEL overhead =0.617590 microseconds +/- 0.003409 > > > >> Output for Kernel 2.6.24-rc4 #1 SMP > > > >> PARALLEL time = 19.668450 microseconds +/- 0.015782 > >> PARALLEL overhead = 9.157945 microseconds +/- 0.018217 > > > > ok, so the problem is that this PARALLEL time has an additional +9 usecs > > overhead, right? I dont see this myself on a Core2 CPU: > > > > PARALLEL time = 10.446933 microseconds +/- > > 0.078849 > > PARALLEL overhead = 0.751732 microseconds +/- 0.177446 > > On my dual socket AMD Athlon MP 2.6.20-13-generic PARALLEL time = 22.751875 microseconds +/- 21.370942 PARALLEL overhead = 7.046595 microseconds +/- 24.370040 2.6.24-rc5 PARALLEL time = 17.365543 microseconds +/- 3.295133 PARALLEL overhead = 2.213722 microseconds +/- 4.797886 signature.asc Description: This is a digitally signed message part
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
On Tue, 2007-12-11 at 17:11 -0500, Jie Chen wrote: Ingo Molnar wrote: * Jie Chen [EMAIL PROTECTED] wrote: The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP kernel. 2 threads: PARALLEL time = 11.106580 microseconds +/- 0.002460 PARALLEL overhead =0.617590 microseconds +/- 0.003409 Output for Kernel 2.6.24-rc4 #1 SMP PARALLEL time = 19.668450 microseconds +/- 0.015782 PARALLEL overhead = 9.157945 microseconds +/- 0.018217 ok, so the problem is that this PARALLEL time has an additional +9 usecs overhead, right? I dont see this myself on a Core2 CPU: PARALLEL time = 10.446933 microseconds +/- 0.078849 PARALLEL overhead = 0.751732 microseconds +/- 0.177446 On my dual socket AMD Athlon MP 2.6.20-13-generic PARALLEL time = 22.751875 microseconds +/- 21.370942 PARALLEL overhead = 7.046595 microseconds +/- 24.370040 2.6.24-rc5 PARALLEL time = 17.365543 microseconds +/- 3.295133 PARALLEL overhead = 2.213722 microseconds +/- 4.797886 signature.asc Description: This is a digitally signed message part
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen <[EMAIL PROTECTED]> wrote: The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP kernel. 2 threads: PARALLEL time = 11.106580 microseconds +/- 0.002460 PARALLEL overhead =0.617590 microseconds +/- 0.003409 Output for Kernel 2.6.24-rc4 #1 SMP PARALLEL time = 19.668450 microseconds +/- 0.015782 PARALLEL overhead = 9.157945 microseconds +/- 0.018217 ok, so the problem is that this PARALLEL time has an additional +9 usecs overhead, right? I dont see this myself on a Core2 CPU: PARALLEL time = 10.446933 microseconds +/- 0.078849 PARALLEL overhead = 0.751732 microseconds +/- 0.177446 Ingo Hi, Ingo: Yes, the extra 9 usecs overhead for running two threads in the 2.6.24 kernel when there are total of 8 cores (2 quad opterons). What is the total number of cores do you have? I do not have machines that have dual quad Xeons here for direct comparisons. Thank you. -- # # Jie Chen # Scientific Computing Group # Thomas Jefferson National Accelerator Facility # Newport News, VA 23606 # # [EMAIL PROTECTED] # (757)269-5046 (office) # (757)269-6248 (fax) # -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen <[EMAIL PROTECTED]> wrote: > The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP > kernel. > 2 threads: > PARALLEL time = 11.106580 microseconds +/- 0.002460 > PARALLEL overhead =0.617590 microseconds +/- 0.003409 > Output for Kernel 2.6.24-rc4 #1 SMP > PARALLEL time = 19.668450 microseconds +/- 0.015782 > PARALLEL overhead = 9.157945 microseconds +/- 0.018217 ok, so the problem is that this PARALLEL time has an additional +9 usecs overhead, right? I dont see this myself on a Core2 CPU: PARALLEL time = 10.446933 microseconds +/- 0.078849 PARALLEL overhead = 0.751732 microseconds +/- 0.177446 Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen <[EMAIL PROTECTED]> wrote: Hi, Ingo: I guess it is a good news. I did patch 2.6.21.7 kernel using your cfs patch. The results of pthread_sync is the same as the non-patched 2.6.21 kernel. This means the performance of is not related to the scheduler. As for overhead of the gettimeofday, there is no difference between 2.6.21 and 2.6.24-rc4. The reference time is around 10.5 us for both kernel. could you please paste again the relevant portion of the output you get on a "good" .21 kernel versus the output you get on a "bad" .24 kernel? So what is changed between 2.6.21 and 2.6.22? Any hints :-). Thank you very much for all your help. we'll figure it out i'm sure :) Ingo Hi, Ingo: The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP kernel. 2 threads: Computing reference time 1 Sample_size Average Min MaxS.D. Outliers 20 10.489085 10.488800 10.4911000.000539 1 Reference_time_1 = 10.489085 microseconds +/- 0.001057 Computing PARALLEL time Sample_size Average Min Max S.D. Outliers 20 11.106580 11.105650 11.1097000.001255 0 PARALLEL time = 11.106580 microseconds +/- 0.002460 PARALLEL overhead =0.617590 microseconds +/- 0.003409 8 threads: Computing reference time 1 Sample_size Average MinMax S.D. Outliers 2010.488735 10.488500 10.4907000.000484 1 Reference_time_1 = 10.488735 microseconds +/- 0.000948 Computing PARALLEL time Sample_size Average MinMax S.D. Outliers 20 13.000647 12.991050 13.0527000.012592 1 PARALLEL time =13.000647 microseconds +/- 0.024680 PARALLEL overhead = 2.511907 microseconds +/- 0.025594 Output for Kernel 2.6.24-rc4 #1 SMP 2 threads: Computing reference time 1 Sample_size Average Min MaxS.D. Outliers 20 10.510535 10.508600 10.5182000.002237 1 Reference_time_1 = 10.510535 microseconds +/- 0.004384 Computing PARALLEL time Sample_size Average MinMax S.D. Outliers 20 19.668450 19.650200 19.6796500.008052 0 PARALLEL time = 19.668450 microseconds +/- 0.015782 PARALLEL overhead = 9.157945 microseconds +/- 0.018217 8 threads: Computing reference time 1 Sample_size Average MinMaxS.D. Outliers 2010.491285 10.490100 10.4949000.001085 1 Reference_time_1 = 10.491285 microseconds +/- 0.002127 Computing PARALLEL time Sample_size Average MinMaxS.D. Outliers 2013.090080 13.079150 13.1314500.010995 1 PARALLEL time = 13.090080 microseconds +/- 0.021550 PARALLEL overhead = 2.598590 microseconds +/- 0.024534 For 8 threads, both kernels have the similar performance number. But for 2 threads, the 2.6.21 is much better than 2.6.24-rc4. Thank you. -- ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen <[EMAIL PROTECTED]> wrote: > Hi, Ingo: > > I guess it is a good news. I did patch 2.6.21.7 kernel using your cfs > patch. The results of pthread_sync is the same as the non-patched > 2.6.21 kernel. This means the performance of is not related to the > scheduler. As for overhead of the gettimeofday, there is no difference > between 2.6.21 and 2.6.24-rc4. The reference time is around 10.5 us > for both kernel. could you please paste again the relevant portion of the output you get on a "good" .21 kernel versus the output you get on a "bad" .24 kernel? > So what is changed between 2.6.21 and 2.6.22? Any hints :-). Thank you > very much for all your help. we'll figure it out i'm sure :) Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen <[EMAIL PROTECTED]> wrote: and then you use this in the measurement loop: for (k=0; k<=OUTERREPS; k++){ start = getclock(); for (j=0; jthe problem is, this does not take the overhead of gettimeofday into account - which overhead can easily reach 10 usecs (the observed regression). Could you try to eliminate the gettimeofday overhead from your measurement? gettimeofday overhead is something that might have changed from .21 to .22 on your box. Ingo Hi, Ingo: In my pthread_sync code, I first call refer () subroutine which actually establishes the elapsed time (reference time) for non-synchronized delay() using the gettimeofday. Then each synchronization overhead value is obtained by subtracting the reference time from the elapsed time with introduced synchronization. The effect of gettimeofday() should be minimal if the time difference (overhead value) is the interest here. Unless the gettimeofday behaves differently in the case of running 8 threads .vs. running 2 threads. I will try to replace gettimeofday with a lightweight timer call in my test code. Thank you very much. gettimeofday overhead is around 10 usecs here: 2740 1197359374.873214 gettimeofday({1197359374, 873225}, NULL) = 0 <0.10> 2740 1197359374.970592 gettimeofday({1197359374, 970608}, NULL) = 0 <0.10> and that's the only thing that is going on when computing the reference time - and i see a similar syscall pattern in the PARALLEL and BARRIER calculations as well (with no real scheduling going on). Ingo Hi, Ingo: I guess it is a good news. I did patch 2.6.21.7 kernel using your cfs patch. The results of pthread_sync is the same as the non-patched 2.6.21 kernel. This means the performance of is not related to the scheduler. As for overhead of the gettimeofday, there is no difference between 2.6.21 and 2.6.24-rc4. The reference time is around 10.5 us for both kernel. So what is changed between 2.6.21 and 2.6.22? Any hints :-). Thank you very much for all your help. -- ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen <[EMAIL PROTECTED]> wrote: >> and then you use this in the measurement loop: >> >>for (k=0; k<=OUTERREPS; k++){ >> start = getclock(); >> for (j=0; j> #ifdef _QMT_PUBLIC >>delay((void *)0, 0); >> #else >>delay(0, 0, 0, (void *)0); >> #endif >> } >> times[k] = (getclock() - start) * 1.0e6 / (double) innerreps; >>} >> >> the problem is, this does not take the overhead of gettimeofday into >> account - which overhead can easily reach 10 usecs (the observed >> regression). Could you try to eliminate the gettimeofday overhead from >> your measurement? >> >> gettimeofday overhead is something that might have changed from .21 to .22 >> on your box. >> >> Ingo > > Hi, Ingo: > > In my pthread_sync code, I first call refer () subroutine which > actually establishes the elapsed time (reference time) for > non-synchronized delay() using the gettimeofday. Then each > synchronization overhead value is obtained by subtracting the > reference time from the elapsed time with introduced synchronization. > The effect of gettimeofday() should be minimal if the time difference > (overhead value) is the interest here. Unless the gettimeofday behaves > differently in the case of running 8 threads .vs. running 2 threads. > > I will try to replace gettimeofday with a lightweight timer call in my > test code. Thank you very much. gettimeofday overhead is around 10 usecs here: 2740 1197359374.873214 gettimeofday({1197359374, 873225}, NULL) = 0 <0.10> 2740 1197359374.970592 gettimeofday({1197359374, 970608}, NULL) = 0 <0.10> and that's the only thing that is going on when computing the reference time - and i see a similar syscall pattern in the PARALLEL and BARRIER calculations as well (with no real scheduling going on). Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen [EMAIL PROTECTED] wrote: and then you use this in the measurement loop: for (k=0; k=OUTERREPS; k++){ start = getclock(); for (j=0; jinnerreps; j++){ #ifdef _QMT_PUBLIC delay((void *)0, 0); #else delay(0, 0, 0, (void *)0); #endif } times[k] = (getclock() - start) * 1.0e6 / (double) innerreps; } the problem is, this does not take the overhead of gettimeofday into account - which overhead can easily reach 10 usecs (the observed regression). Could you try to eliminate the gettimeofday overhead from your measurement? gettimeofday overhead is something that might have changed from .21 to .22 on your box. Ingo Hi, Ingo: In my pthread_sync code, I first call refer () subroutine which actually establishes the elapsed time (reference time) for non-synchronized delay() using the gettimeofday. Then each synchronization overhead value is obtained by subtracting the reference time from the elapsed time with introduced synchronization. The effect of gettimeofday() should be minimal if the time difference (overhead value) is the interest here. Unless the gettimeofday behaves differently in the case of running 8 threads .vs. running 2 threads. I will try to replace gettimeofday with a lightweight timer call in my test code. Thank you very much. gettimeofday overhead is around 10 usecs here: 2740 1197359374.873214 gettimeofday({1197359374, 873225}, NULL) = 0 0.10 2740 1197359374.970592 gettimeofday({1197359374, 970608}, NULL) = 0 0.10 and that's the only thing that is going on when computing the reference time - and i see a similar syscall pattern in the PARALLEL and BARRIER calculations as well (with no real scheduling going on). Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen [EMAIL PROTECTED] wrote: and then you use this in the measurement loop: for (k=0; k=OUTERREPS; k++){ start = getclock(); for (j=0; jinnerreps; j++){ #ifdef _QMT_PUBLIC delay((void *)0, 0); #else delay(0, 0, 0, (void *)0); #endif } times[k] = (getclock() - start) * 1.0e6 / (double) innerreps; } the problem is, this does not take the overhead of gettimeofday into account - which overhead can easily reach 10 usecs (the observed regression). Could you try to eliminate the gettimeofday overhead from your measurement? gettimeofday overhead is something that might have changed from .21 to .22 on your box. Ingo Hi, Ingo: In my pthread_sync code, I first call refer () subroutine which actually establishes the elapsed time (reference time) for non-synchronized delay() using the gettimeofday. Then each synchronization overhead value is obtained by subtracting the reference time from the elapsed time with introduced synchronization. The effect of gettimeofday() should be minimal if the time difference (overhead value) is the interest here. Unless the gettimeofday behaves differently in the case of running 8 threads .vs. running 2 threads. I will try to replace gettimeofday with a lightweight timer call in my test code. Thank you very much. gettimeofday overhead is around 10 usecs here: 2740 1197359374.873214 gettimeofday({1197359374, 873225}, NULL) = 0 0.10 2740 1197359374.970592 gettimeofday({1197359374, 970608}, NULL) = 0 0.10 and that's the only thing that is going on when computing the reference time - and i see a similar syscall pattern in the PARALLEL and BARRIER calculations as well (with no real scheduling going on). Ingo Hi, Ingo: I guess it is a good news. I did patch 2.6.21.7 kernel using your cfs patch. The results of pthread_sync is the same as the non-patched 2.6.21 kernel. This means the performance of is not related to the scheduler. As for overhead of the gettimeofday, there is no difference between 2.6.21 and 2.6.24-rc4. The reference time is around 10.5 us for both kernel. So what is changed between 2.6.21 and 2.6.22? Any hints :-). Thank you very much for all your help. -- ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen [EMAIL PROTECTED] wrote: Hi, Ingo: I guess it is a good news. I did patch 2.6.21.7 kernel using your cfs patch. The results of pthread_sync is the same as the non-patched 2.6.21 kernel. This means the performance of is not related to the scheduler. As for overhead of the gettimeofday, there is no difference between 2.6.21 and 2.6.24-rc4. The reference time is around 10.5 us for both kernel. could you please paste again the relevant portion of the output you get on a good .21 kernel versus the output you get on a bad .24 kernel? So what is changed between 2.6.21 and 2.6.22? Any hints :-). Thank you very much for all your help. we'll figure it out i'm sure :) Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen [EMAIL PROTECTED] wrote: Hi, Ingo: I guess it is a good news. I did patch 2.6.21.7 kernel using your cfs patch. The results of pthread_sync is the same as the non-patched 2.6.21 kernel. This means the performance of is not related to the scheduler. As for overhead of the gettimeofday, there is no difference between 2.6.21 and 2.6.24-rc4. The reference time is around 10.5 us for both kernel. could you please paste again the relevant portion of the output you get on a good .21 kernel versus the output you get on a bad .24 kernel? So what is changed between 2.6.21 and 2.6.22? Any hints :-). Thank you very much for all your help. we'll figure it out i'm sure :) Ingo Hi, Ingo: The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP kernel. 2 threads: Computing reference time 1 Sample_size Average Min MaxS.D. Outliers 20 10.489085 10.488800 10.4911000.000539 1 Reference_time_1 = 10.489085 microseconds +/- 0.001057 Computing PARALLEL time Sample_size Average Min Max S.D. Outliers 20 11.106580 11.105650 11.1097000.001255 0 PARALLEL time = 11.106580 microseconds +/- 0.002460 PARALLEL overhead =0.617590 microseconds +/- 0.003409 8 threads: Computing reference time 1 Sample_size Average MinMax S.D. Outliers 2010.488735 10.488500 10.4907000.000484 1 Reference_time_1 = 10.488735 microseconds +/- 0.000948 Computing PARALLEL time Sample_size Average MinMax S.D. Outliers 20 13.000647 12.991050 13.0527000.012592 1 PARALLEL time =13.000647 microseconds +/- 0.024680 PARALLEL overhead = 2.511907 microseconds +/- 0.025594 Output for Kernel 2.6.24-rc4 #1 SMP 2 threads: Computing reference time 1 Sample_size Average Min MaxS.D. Outliers 20 10.510535 10.508600 10.5182000.002237 1 Reference_time_1 = 10.510535 microseconds +/- 0.004384 Computing PARALLEL time Sample_size Average MinMax S.D. Outliers 20 19.668450 19.650200 19.6796500.008052 0 PARALLEL time = 19.668450 microseconds +/- 0.015782 PARALLEL overhead = 9.157945 microseconds +/- 0.018217 8 threads: Computing reference time 1 Sample_size Average MinMaxS.D. Outliers 2010.491285 10.490100 10.4949000.001085 1 Reference_time_1 = 10.491285 microseconds +/- 0.002127 Computing PARALLEL time Sample_size Average MinMaxS.D. Outliers 2013.090080 13.079150 13.1314500.010995 1 PARALLEL time = 13.090080 microseconds +/- 0.021550 PARALLEL overhead = 2.598590 microseconds +/- 0.024534 For 8 threads, both kernels have the similar performance number. But for 2 threads, the 2.6.21 is much better than 2.6.24-rc4. Thank you. -- ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen [EMAIL PROTECTED] wrote: The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP kernel. 2 threads: PARALLEL time = 11.106580 microseconds +/- 0.002460 PARALLEL overhead =0.617590 microseconds +/- 0.003409 Output for Kernel 2.6.24-rc4 #1 SMP PARALLEL time = 19.668450 microseconds +/- 0.015782 PARALLEL overhead = 9.157945 microseconds +/- 0.018217 ok, so the problem is that this PARALLEL time has an additional +9 usecs overhead, right? I dont see this myself on a Core2 CPU: PARALLEL time = 10.446933 microseconds +/- 0.078849 PARALLEL overhead = 0.751732 microseconds +/- 0.177446 Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen [EMAIL PROTECTED] wrote: The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP kernel. 2 threads: PARALLEL time = 11.106580 microseconds +/- 0.002460 PARALLEL overhead =0.617590 microseconds +/- 0.003409 Output for Kernel 2.6.24-rc4 #1 SMP PARALLEL time = 19.668450 microseconds +/- 0.015782 PARALLEL overhead = 9.157945 microseconds +/- 0.018217 ok, so the problem is that this PARALLEL time has an additional +9 usecs overhead, right? I dont see this myself on a Core2 CPU: PARALLEL time = 10.446933 microseconds +/- 0.078849 PARALLEL overhead = 0.751732 microseconds +/- 0.177446 Ingo Hi, Ingo: Yes, the extra 9 usecs overhead for running two threads in the 2.6.24 kernel when there are total of 8 cores (2 quad opterons). What is the total number of cores do you have? I do not have machines that have dual quad Xeons here for direct comparisons. Thank you. -- # # Jie Chen # Scientific Computing Group # Thomas Jefferson National Accelerator Facility # Newport News, VA 23606 # # [EMAIL PROTECTED] # (757)269-5046 (office) # (757)269-6248 (fax) # -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen <[EMAIL PROTECTED]> wrote: I did patch the header file and recompiled the kernel. I observed no difference (two threads overhead stays too high). Thank you. ok, i think i found it. You do this in your qmt/pthread_sync.c test-code: double get_time_of_day_() { ... err = gettimeofday(, NULL); ... } and then you use this in the measurement loop: for (k=0; k<=OUTERREPS; k++){ start = getclock(); for (j=0; jthe problem is, this does not take the overhead of gettimeofday into account - which overhead can easily reach 10 usecs (the observed regression). Could you try to eliminate the gettimeofday overhead from your measurement? gettimeofday overhead is something that might have changed from .21 to .22 on your box. Ingo Hi, Ingo: In my pthread_sync code, I first call refer () subroutine which actually establishes the elapsed time (reference time) for non-synchronized delay() using the gettimeofday. Then each synchronization overhead value is obtained by subtracting the reference time from the elapsed time with introduced synchronization. The effect of gettimeofday() should be minimal if the time difference (overhead value) is the interest here. Unless the gettimeofday behaves differently in the case of running 8 threads .vs. running 2 threads. I will try to replace gettimeofday with a lightweight timer call in my test code. Thank you very much. -- ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen <[EMAIL PROTECTED]> wrote: > I did patch the header file and recompiled the kernel. I observed no > difference (two threads overhead stays too high). Thank you. ok, i think i found it. You do this in your qmt/pthread_sync.c test-code: double get_time_of_day_() { ... err = gettimeofday(, NULL); ... } and then you use this in the measurement loop: for (k=0; k<=OUTERREPS; k++){ start = getclock(); for (j=0; jhttp://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen [EMAIL PROTECTED] wrote: I did patch the header file and recompiled the kernel. I observed no difference (two threads overhead stays too high). Thank you. ok, i think i found it. You do this in your qmt/pthread_sync.c test-code: double get_time_of_day_() { ... err = gettimeofday(ts, NULL); ... } and then you use this in the measurement loop: for (k=0; k=OUTERREPS; k++){ start = getclock(); for (j=0; jinnerreps; j++){ #ifdef _QMT_PUBLIC delay((void *)0, 0); #else delay(0, 0, 0, (void *)0); #endif } times[k] = (getclock() - start) * 1.0e6 / (double) innerreps; } the problem is, this does not take the overhead of gettimeofday into account - which overhead can easily reach 10 usecs (the observed regression). Could you try to eliminate the gettimeofday overhead from your measurement? gettimeofday overhead is something that might have changed from .21 to .22 on your box. Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen [EMAIL PROTECTED] wrote: I did patch the header file and recompiled the kernel. I observed no difference (two threads overhead stays too high). Thank you. ok, i think i found it. You do this in your qmt/pthread_sync.c test-code: double get_time_of_day_() { ... err = gettimeofday(ts, NULL); ... } and then you use this in the measurement loop: for (k=0; k=OUTERREPS; k++){ start = getclock(); for (j=0; jinnerreps; j++){ #ifdef _QMT_PUBLIC delay((void *)0, 0); #else delay(0, 0, 0, (void *)0); #endif } times[k] = (getclock() - start) * 1.0e6 / (double) innerreps; } the problem is, this does not take the overhead of gettimeofday into account - which overhead can easily reach 10 usecs (the observed regression). Could you try to eliminate the gettimeofday overhead from your measurement? gettimeofday overhead is something that might have changed from .21 to .22 on your box. Ingo Hi, Ingo: In my pthread_sync code, I first call refer () subroutine which actually establishes the elapsed time (reference time) for non-synchronized delay() using the gettimeofday. Then each synchronization overhead value is obtained by subtracting the reference time from the elapsed time with introduced synchronization. The effect of gettimeofday() should be minimal if the time difference (overhead value) is the interest here. Unless the gettimeofday behaves differently in the case of running 8 threads .vs. running 2 threads. I will try to replace gettimeofday with a lightweight timer call in my test code. Thank you very much. -- ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen <[EMAIL PROTECTED]> wrote: not "BARRIER time". I've re-read the discussion and found no hint about how to build and run a barrier test. Either i missed it or it's so obvious to you that you didnt mention it :-) Ingo Hi, Ingo: Did you do configure --enable-public-release? My qmt is for qcd calculation (one type of physics code) [...] yes, i did exactly as instructed. [...]. Without the above flag one can only test PARALLEL overhead. Actually the PARALLEL benchmark has the same behavior as the BARRIER. Thanks. hm, but PARALLEL does not seem to do that much context switching. So basically you create the threads and do a few short runs to establish overhead? Threads do not get fork-balanced at the moment - but turning it on would be easy. Could you try the patch below - how does it impact your results? (and please keep affinity setting off) Ingo ---> Subject: sched: reactivate fork balancing From: Ingo Molnar <[EMAIL PROTECTED]> reactivate fork balancing. Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]> --- include/linux/topology.h |3 +++ 1 file changed, 3 insertions(+) Index: linux/include/linux/topology.h === --- linux.orig/include/linux/topology.h +++ linux/include/linux/topology.h @@ -103,6 +103,7 @@ .forkexec_idx = 0,\ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_NEWIDLE\ + | SD_BALANCE_FORK \ | SD_BALANCE_EXEC \ | SD_WAKE_AFFINE\ | SD_WAKE_IDLE \ @@ -134,6 +135,7 @@ .forkexec_idx = 1,\ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_NEWIDLE\ + | SD_BALANCE_FORK \ | SD_BALANCE_EXEC \ | SD_WAKE_AFFINE\ | SD_WAKE_IDLE \ @@ -165,6 +167,7 @@ .forkexec_idx = 1,\ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_NEWIDLE\ + | SD_BALANCE_FORK \ | SD_BALANCE_EXEC \ | SD_WAKE_AFFINE\ | BALANCE_FOR_PKG_POWER,\ Hi, Ingo: I did patch the header file and recompiled the kernel. I observed no difference (two threads overhead stays too high). Thank you. -- ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen <[EMAIL PROTECTED]> wrote: >> not "BARRIER time". I've re-read the discussion and found no hint >> about how to build and run a barrier test. Either i missed it or it's >> so obvious to you that you didnt mention it :-) >> >> Ingo > > Hi, Ingo: > > Did you do configure --enable-public-release? My qmt is for qcd > calculation (one type of physics code) [...] yes, i did exactly as instructed. > [...]. Without the above flag one can only test PARALLEL overhead. > Actually the PARALLEL benchmark has the same behavior as the BARRIER. > Thanks. hm, but PARALLEL does not seem to do that much context switching. So basically you create the threads and do a few short runs to establish overhead? Threads do not get fork-balanced at the moment - but turning it on would be easy. Could you try the patch below - how does it impact your results? (and please keep affinity setting off) Ingo ---> Subject: sched: reactivate fork balancing From: Ingo Molnar <[EMAIL PROTECTED]> reactivate fork balancing. Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]> --- include/linux/topology.h |3 +++ 1 file changed, 3 insertions(+) Index: linux/include/linux/topology.h === --- linux.orig/include/linux/topology.h +++ linux/include/linux/topology.h @@ -103,6 +103,7 @@ .forkexec_idx = 0,\ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_NEWIDLE\ + | SD_BALANCE_FORK \ | SD_BALANCE_EXEC \ | SD_WAKE_AFFINE\ | SD_WAKE_IDLE \ @@ -134,6 +135,7 @@ .forkexec_idx = 1,\ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_NEWIDLE\ + | SD_BALANCE_FORK \ | SD_BALANCE_EXEC \ | SD_WAKE_AFFINE\ | SD_WAKE_IDLE \ @@ -165,6 +167,7 @@ .forkexec_idx = 1,\ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_NEWIDLE\ + | SD_BALANCE_FORK \ | SD_BALANCE_EXEC \ | SD_WAKE_AFFINE\ | BALANCE_FOR_PKG_POWER,\ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen [EMAIL PROTECTED] wrote: not BARRIER time. I've re-read the discussion and found no hint about how to build and run a barrier test. Either i missed it or it's so obvious to you that you didnt mention it :-) Ingo Hi, Ingo: Did you do configure --enable-public-release? My qmt is for qcd calculation (one type of physics code) [...] yes, i did exactly as instructed. [...]. Without the above flag one can only test PARALLEL overhead. Actually the PARALLEL benchmark has the same behavior as the BARRIER. Thanks. hm, but PARALLEL does not seem to do that much context switching. So basically you create the threads and do a few short runs to establish overhead? Threads do not get fork-balanced at the moment - but turning it on would be easy. Could you try the patch below - how does it impact your results? (and please keep affinity setting off) Ingo --- Subject: sched: reactivate fork balancing From: Ingo Molnar [EMAIL PROTECTED] reactivate fork balancing. Signed-off-by: Ingo Molnar [EMAIL PROTECTED] --- include/linux/topology.h |3 +++ 1 file changed, 3 insertions(+) Index: linux/include/linux/topology.h === --- linux.orig/include/linux/topology.h +++ linux/include/linux/topology.h @@ -103,6 +103,7 @@ .forkexec_idx = 0,\ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_NEWIDLE\ + | SD_BALANCE_FORK \ | SD_BALANCE_EXEC \ | SD_WAKE_AFFINE\ | SD_WAKE_IDLE \ @@ -134,6 +135,7 @@ .forkexec_idx = 1,\ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_NEWIDLE\ + | SD_BALANCE_FORK \ | SD_BALANCE_EXEC \ | SD_WAKE_AFFINE\ | SD_WAKE_IDLE \ @@ -165,6 +167,7 @@ .forkexec_idx = 1,\ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_NEWIDLE\ + | SD_BALANCE_FORK \ | SD_BALANCE_EXEC \ | SD_WAKE_AFFINE\ | BALANCE_FOR_PKG_POWER,\ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen [EMAIL PROTECTED] wrote: not BARRIER time. I've re-read the discussion and found no hint about how to build and run a barrier test. Either i missed it or it's so obvious to you that you didnt mention it :-) Ingo Hi, Ingo: Did you do configure --enable-public-release? My qmt is for qcd calculation (one type of physics code) [...] yes, i did exactly as instructed. [...]. Without the above flag one can only test PARALLEL overhead. Actually the PARALLEL benchmark has the same behavior as the BARRIER. Thanks. hm, but PARALLEL does not seem to do that much context switching. So basically you create the threads and do a few short runs to establish overhead? Threads do not get fork-balanced at the moment - but turning it on would be easy. Could you try the patch below - how does it impact your results? (and please keep affinity setting off) Ingo --- Subject: sched: reactivate fork balancing From: Ingo Molnar [EMAIL PROTECTED] reactivate fork balancing. Signed-off-by: Ingo Molnar [EMAIL PROTECTED] --- include/linux/topology.h |3 +++ 1 file changed, 3 insertions(+) Index: linux/include/linux/topology.h === --- linux.orig/include/linux/topology.h +++ linux/include/linux/topology.h @@ -103,6 +103,7 @@ .forkexec_idx = 0,\ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_NEWIDLE\ + | SD_BALANCE_FORK \ | SD_BALANCE_EXEC \ | SD_WAKE_AFFINE\ | SD_WAKE_IDLE \ @@ -134,6 +135,7 @@ .forkexec_idx = 1,\ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_NEWIDLE\ + | SD_BALANCE_FORK \ | SD_BALANCE_EXEC \ | SD_WAKE_AFFINE\ | SD_WAKE_IDLE \ @@ -165,6 +167,7 @@ .forkexec_idx = 1,\ .flags = SD_LOAD_BALANCE \ | SD_BALANCE_NEWIDLE\ + | SD_BALANCE_FORK \ | SD_BALANCE_EXEC \ | SD_WAKE_AFFINE\ | BALANCE_FOR_PKG_POWER,\ Hi, Ingo: I did patch the header file and recompiled the kernel. I observed no difference (two threads overhead stays too high). Thank you. -- ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen <[EMAIL PROTECTED]> wrote: sorry to be dense, but could you give me instructions how i could remove the affinity mask and test the "barrier overhead" myself? I have built "pthread_sync" and it outputs numbers for me - which one would be the barrier overhead: Reference_time_1 ? To disable affinity, do configure --enable-public-release --disable-thread_affinity. You should see barrier overhead like the following: Computing BARRIER time Sample_size Average MinMax S.D.Outliers 20 19.486162 19.482250 19.4914000.002740 0 BARRIER time =19.486162 microseconds +/- 0.005371 BARRIER overhead =8.996257 microseconds +/- 0.006575 ok, i did that and rebuilt. I also did "make check" and got src/pthread_sync which i can run. The only thing i'm missing, if i run src/pthread_sync, it outputs "PARALLEL time": PARALLEL time = 22.486103 microseconds +/- 3.944821 PARALLEL overhead = 10.638658 microseconds +/- 10.854154 not "BARRIER time". I've re-read the discussion and found no hint about how to build and run a barrier test. Either i missed it or it's so obvious to you that you didnt mention it :-) Ingo Hi, Ingo: Did you do configure --enable-public-release? My qmt is for qcd calculation (one type of physics code). Without the above flag one can only test PARALLEL overhead. Actually the PARALLEL benchmark has the same behavior as the BARRIER. Thanks. ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen <[EMAIL PROTECTED]> wrote: >> sorry to be dense, but could you give me instructions how i could >> remove the affinity mask and test the "barrier overhead" myself? I >> have built "pthread_sync" and it outputs numbers for me - which one >> would be the barrier overhead: Reference_time_1 ? > > To disable affinity, do configure --enable-public-release > --disable-thread_affinity. You should see barrier overhead like the > following: Computing BARRIER time > > Sample_size Average MinMax S.D.Outliers > 20 19.486162 19.482250 19.4914000.002740 0 > > BARRIER time =19.486162 microseconds +/- 0.005371 > BARRIER overhead =8.996257 microseconds +/- 0.006575 ok, i did that and rebuilt. I also did "make check" and got src/pthread_sync which i can run. The only thing i'm missing, if i run src/pthread_sync, it outputs "PARALLEL time": PARALLEL time = 22.486103 microseconds +/- 3.944821 PARALLEL overhead = 10.638658 microseconds +/- 10.854154 not "BARRIER time". I've re-read the discussion and found no hint about how to build and run a barrier test. Either i missed it or it's so obvious to you that you didnt mention it :-) Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen <[EMAIL PROTECTED]> wrote: I just disabled the affinity mask and reran the test. There were no significant changes for two threads (barrier overhead is around 9 microseconds). As for 8 threads, the barrier overhead actually drops a little, which is good. Let me know whether I can be any help. Thank you very much. sorry to be dense, but could you give me instructions how i could remove the affinity mask and test the "barrier overhead" myself? I have built "pthread_sync" and it outputs numbers for me - which one would be the barrier overhead: Reference_time_1 ? Ingo Hi, Ingo: To disable affinity, do configure --enable-public-release --disable-thread_affinity. You should see barrier overhead like the following: Computing BARRIER time Sample_size Average MinMax S.D.Outliers 20 19.486162 19.482250 19.4914000.002740 0 BARRIER time =19.486162 microseconds +/- 0.005371 BARRIER overhead =8.996257 microseconds +/- 0.006575 The Reference_time_1 is the elapsed time for single thread doing simple loop without any synchronization. Thank you. -- ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen <[EMAIL PROTECTED]> wrote: > I just disabled the affinity mask and reran the test. There were no > significant changes for two threads (barrier overhead is around 9 > microseconds). As for 8 threads, the barrier overhead actually drops a > little, which is good. Let me know whether I can be any help. Thank > you very much. sorry to be dense, but could you give me instructions how i could remove the affinity mask and test the "barrier overhead" myself? I have built "pthread_sync" and it outputs numbers for me - which one would be the barrier overhead: Reference_time_1 ? Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen <[EMAIL PROTECTED]> wrote: Since I am using affinity flag to bind each thread to a different core, the synchronization overhead should increases as the number of cores/threads increases. But what we observed in the new kernel is the opposite. The barrier overhead of two threads is 8.93 micro seconds vs 1.86 microseconds for 8 threads (the old kernel is 0.49 vs 1.86). This will confuse most of people who study the synchronization/communication scalability. I know my test code is not real-world computation which usually use up all cores. I hope I have explained myself clearly. Thank you very much. btw., could you try to not use the affinity mask and let the scheduler manage the spreading of tasks? It generally has a better knowledge about how tasks interrelate. Ingo Hi, Ingo: I just disabled the affinity mask and reran the test. There were no significant changes for two threads (barrier overhead is around 9 microseconds). As for 8 threads, the barrier overhead actually drops a little, which is good. Let me know whether I can be any help. Thank you very much. -- ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen <[EMAIL PROTECTED]> wrote: > Since I am using affinity flag to bind each thread to a different > core, the synchronization overhead should increases as the number of > cores/threads increases. But what we observed in the new kernel is the > opposite. The barrier overhead of two threads is 8.93 micro seconds vs > 1.86 microseconds for 8 threads (the old kernel is 0.49 vs 1.86). This > will confuse most of people who study the > synchronization/communication scalability. I know my test code is not > real-world computation which usually use up all cores. I hope I have > explained myself clearly. Thank you very much. btw., could you try to not use the affinity mask and let the scheduler manage the spreading of tasks? It generally has a better knowledge about how tasks interrelate. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen <[EMAIL PROTECTED]> wrote: the moment you saturate the system a bit more, the numbers should improve even with such a ping-pong test. You are right. If I manually do load balance (bind unrelated processes on the other cores), my test code perform as well as it did in the kernel 2.6.21. so right now the results dont seem to be too bad to me - the higher overhead comes from two threads running on two different cores and incurring the overhead of cross-core communications. In a true spread-out workloads that synchronize occasionally you'd get the same kind of overhead so in fact this behavior is more informative of the real overhead i guess. In 2.6.21 the two threads would stick on the same core and produce artificially low latency - which would only be true in a real spread-out workload if all tasks ran on the same core. (which is hardly the thing you want on openmp) I use pthread_setaffinity_np call to bind one thread to one core. Unless the kernel 2.6.21 does not honor the affinity, I do not see the difference running two threads on two cores between the new kernel and the old kernel. My test code does not do any numerical calculation, but it does spin waiting on shared/non-shared flags. The reason I am using the affinity is to test synchronization overheads among different cores. In either the new and the old kernel, I do see 200% CPU usage when I ran my test code for two threads. Does this mean two threads are running on two cores? Also I verify a thread is indeed bound to a core by using pthread_getaffinity_np. In any case, if i misinterpreted your numbers or if you just disagree, or if have a workload/test that shows worse performance that it could/should, let me know. Ingo Hi, Ingo: Since I am using affinity flag to bind each thread to a different core, the synchronization overhead should increases as the number of cores/threads increases. But what we observed in the new kernel is the opposite. The barrier overhead of two threads is 8.93 micro seconds vs 1.86 microseconds for 8 threads (the old kernel is 0.49 vs 1.86). This will confuse most of people who study the synchronization/communication scalability. I know my test code is not real-world computation which usually use up all cores. I hope I have explained myself clearly. Thank you very much. -- ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen <[EMAIL PROTECTED]> wrote: >> the moment you saturate the system a bit more, the numbers should >> improve even with such a ping-pong test. > > You are right. If I manually do load balance (bind unrelated processes > on the other cores), my test code perform as well as it did in the > kernel 2.6.21. so right now the results dont seem to be too bad to me - the higher overhead comes from two threads running on two different cores and incurring the overhead of cross-core communications. In a true spread-out workloads that synchronize occasionally you'd get the same kind of overhead so in fact this behavior is more informative of the real overhead i guess. In 2.6.21 the two threads would stick on the same core and produce artificially low latency - which would only be true in a real spread-out workload if all tasks ran on the same core. (which is hardly the thing you want on openmp) In any case, if i misinterpreted your numbers or if you just disagree, or if have a workload/test that shows worse performance that it could/should, let me know. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar a écrit : * Eric Dumazet <[EMAIL PROTECTED]> wrote: $ gcc -O2 -o burner burner.c $ ./burner Time to perform the unit of work on one thread is 0.040328 s Time to perform the unit of work on 2 threads is 0.040221 s ok, but this actually suggests that scheduling is fine for this, correct? Ingo Yes, But this machine runs an old kernel. I was just giving you how to run it :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Eric Dumazet <[EMAIL PROTECTED]> wrote: > $ gcc -O2 -o burner burner.c > $ ./burner > Time to perform the unit of work on one thread is 0.040328 s > Time to perform the unit of work on 2 threads is 0.040221 s ok, but this actually suggests that scheduling is fine for this, correct? Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen <[EMAIL PROTECTED]> wrote: I just ran the same test on two 2.6.24-rc4 kernels: one with CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED off. The odd behavior I described in my previous e-mails were still there for both kernels. Let me know If I can be any more help. Thank you. ok, i had a look at your data, and i think this is the result of the scheduler balancing out to idle CPUs more agressively than before. Doing that is almost always a good idea though - but indeed it can result in "bad" numbers if all you do is to measure the ping-pong "performance" between two threads. (with no real work done by any of them). My test code are not doing much work but measuring overhead of various synchronization mechanisms such as barrier and lock. I am trying to see the scalability of different implementations/algorithms on multi-core machines. the moment you saturate the system a bit more, the numbers should improve even with such a ping-pong test. You are right. If I manually do load balance (bind unrelated processes on the other cores), my test code perform as well as it did in the kernel 2.6.21. do you have testcode (or a modification of your testcase sourcecode) that simulates a real-life situation where 2.6.24-rc4 performs not as well as you'd like it to see? (or if qmt.tar.gz already contains that then please point me towards that portion of the test and how i should run it - thanks!) The qmt.tar.gz code contains a simple test program call pthread_sync under the src directory. You can change the number of threads by setting QMT_NUM_THREADS environment variable. You can build the qmt by doing configure --enable-public-release. I do not have Intel quad core machines, I am not sure whether the behavior will show up on Intel platform. Our cluster is dual quad-core opteron which has its own hardware problem :-). http://hardware.slashdot.org/article.pl?sid=07/12/04/237248=rss Ingo Hi, Ingo: My test code qmt can be found at ftp://ftp.jlab.org/pub/hpc/qmt.tar.gz. There is a minor performance issue in qmt pointed out by Eric, which I have not put into the tar ball yet. If I can be any help, please let me know. Thank you very much. -- ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar a écrit : * Jie Chen <[EMAIL PROTECTED]> wrote: I just ran the same test on two 2.6.24-rc4 kernels: one with CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED off. The odd behavior I described in my previous e-mails were still there for both kernels. Let me know If I can be any more help. Thank you. ok, i had a look at your data, and i think this is the result of the scheduler balancing out to idle CPUs more agressively than before. Doing that is almost always a good idea though - but indeed it can result in "bad" numbers if all you do is to measure the ping-pong "performance" between two threads. (with no real work done by any of them). the moment you saturate the system a bit more, the numbers should improve even with such a ping-pong test. do you have testcode (or a modification of your testcase sourcecode) that simulates a real-life situation where 2.6.24-rc4 performs not as well as you'd like it to see? (or if qmt.tar.gz already contains that then please point me towards that portion of the test and how i should run it - thanks!) Ingo I cooked a program shorter than Jie one, to try to understand what was going on. Its a pure cpu burner program, with no thread synchronisation (but the pthread_join at the very end) As each thread is bound to a given cpu, I am not sure the scheduler is allowed to balance to an idle cpu. Unfortunatly I dont have a 4 way SMP idle machine available to test it. $ gcc -O2 -o burner burner.c $ ./burner Time to perform the unit of work on one thread is 0.040328 s Time to perform the unit of work on 2 threads is 0.040221 s I tried it on a 64 way machine (Thanks David :) ) and noticed some strange results that may be related to the Niagara hardware (time for 64 threads was nearly the double for one thread) #include #include #include #include #include #include #include int blockthemall=1; static void inline cpupause() { #if defined(i386) asm volatile("rep;nop":::"memory"); #else asm volatile("":::"memory"); #endif } /* * Determines number of cpus * Can be overiden by the NR_CPUS environment variable */ int number_of_cpus() { char line[1024], *p; int cnt = 0; FILE *F; p = getenv("NR_CPUS"); if (p) return atoi(p); F = fopen("/proc/cpuinfo", "r"); if (F == NULL) { perror("/proc/cpuinfo"); return 1; } while (fgets(line, sizeof(line), F) != NULL) { if (memcmp(line, "processor", 9) == 0) cnt++; } fclose(F); return cnt; } void compute_elapsed(struct timeval *delta, const struct timeval *t0) { struct timeval t1; gettimeofday(, NULL); delta->tv_sec = t1.tv_sec - t0->tv_sec; delta->tv_usec = t1.tv_usec - t0->tv_usec; if (delta->tv_usec < 0) { delta->tv_usec += 100; delta->tv_sec--; } } int nr_loops = 20*100; double incr=0.3456; void perform_work() { int i; double t = 0.0; for (i = 0; i < nr_loops; i++) { t += incr; } if (t < 0.0) printf("well... should not happen\n"); } void set_affinity(int cpu) { long cpu_mask; int res; cpu_mask = 1L << cpu; res = sched_setaffinity(0, sizeof(cpu_mask), _mask); if (res) perror("sched_setaffinity"); } void *thread_work(void *arg) { int cpu = (int)arg; set_affinity(cpu); while (blockthemall) cpupause(); perform_work(); return (void *)0; } main(int argc, char *argv[]) { struct timeval t0, delta; int nr_cpus, i; pthread_t *tids; gettimeofday(, NULL); perform_work(); compute_elapsed(, ); printf("Time to perform the unit of work on one thread is %d.%06d s\n", delta.tv_sec, delta.tv_usec); nr_cpus = number_of_cpus(); if (nr_cpus <= 1) return 0; tids = malloc(nr_cpus * sizeof(pthread_t)); for (i = 1; i < nr_cpus; i++) { pthread_create(tids + i, NULL, thread_work, (void *)i); } set_affinity(0); gettimeofday(, NULL); blockthemall=0; perform_work(); for (i = 1; i < nr_cpus; i++) pthread_join(tids[i], NULL); compute_elapsed(, ); printf("Time to perform the unit of work on %d threads is %d.%06d s\n", nr_cpus, delta.tv_sec, delta.tv_usec); }
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen <[EMAIL PROTECTED]> wrote: > I just ran the same test on two 2.6.24-rc4 kernels: one with > CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED > off. The odd behavior I described in my previous e-mails were still > there for both kernels. Let me know If I can be any more help. Thank > you. ok, i had a look at your data, and i think this is the result of the scheduler balancing out to idle CPUs more agressively than before. Doing that is almost always a good idea though - but indeed it can result in "bad" numbers if all you do is to measure the ping-pong "performance" between two threads. (with no real work done by any of them). the moment you saturate the system a bit more, the numbers should improve even with such a ping-pong test. do you have testcode (or a modification of your testcase sourcecode) that simulates a real-life situation where 2.6.24-rc4 performs not as well as you'd like it to see? (or if qmt.tar.gz already contains that then please point me towards that portion of the test and how i should run it - thanks!) Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen <[EMAIL PROTECTED]> wrote: Simon Holm Th??gersen wrote: ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen: There is a backport of the CFS scheduler to 2.6.21, see http://lkml.org/lkml/2007/11/19/127 Hi, Simon: I will try that after the thanksgiving holiday to find out whether the odd behavior will show up using 2.6.21 with back ported CFS. would be also nice to test this with 2.6.24-rc4. Ingo Hi, Ingo: I just ran the same test on two 2.6.24-rc4 kernels: one with CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED off. The odd behavior I described in my previous e-mails were still there for both kernels. Let me know If I can be any more help. Thank you. -- ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen [EMAIL PROTECTED] wrote: Simon Holm Th??gersen wrote: ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen: There is a backport of the CFS scheduler to 2.6.21, see http://lkml.org/lkml/2007/11/19/127 Hi, Simon: I will try that after the thanksgiving holiday to find out whether the odd behavior will show up using 2.6.21 with back ported CFS. would be also nice to test this with 2.6.24-rc4. Ingo Hi, Ingo: I just ran the same test on two 2.6.24-rc4 kernels: one with CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED off. The odd behavior I described in my previous e-mails were still there for both kernels. Let me know If I can be any more help. Thank you. -- ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen [EMAIL PROTECTED] wrote: the moment you saturate the system a bit more, the numbers should improve even with such a ping-pong test. You are right. If I manually do load balance (bind unrelated processes on the other cores), my test code perform as well as it did in the kernel 2.6.21. so right now the results dont seem to be too bad to me - the higher overhead comes from two threads running on two different cores and incurring the overhead of cross-core communications. In a true spread-out workloads that synchronize occasionally you'd get the same kind of overhead so in fact this behavior is more informative of the real overhead i guess. In 2.6.21 the two threads would stick on the same core and produce artificially low latency - which would only be true in a real spread-out workload if all tasks ran on the same core. (which is hardly the thing you want on openmp) I use pthread_setaffinity_np call to bind one thread to one core. Unless the kernel 2.6.21 does not honor the affinity, I do not see the difference running two threads on two cores between the new kernel and the old kernel. My test code does not do any numerical calculation, but it does spin waiting on shared/non-shared flags. The reason I am using the affinity is to test synchronization overheads among different cores. In either the new and the old kernel, I do see 200% CPU usage when I ran my test code for two threads. Does this mean two threads are running on two cores? Also I verify a thread is indeed bound to a core by using pthread_getaffinity_np. In any case, if i misinterpreted your numbers or if you just disagree, or if have a workload/test that shows worse performance that it could/should, let me know. Ingo Hi, Ingo: Since I am using affinity flag to bind each thread to a different core, the synchronization overhead should increases as the number of cores/threads increases. But what we observed in the new kernel is the opposite. The barrier overhead of two threads is 8.93 micro seconds vs 1.86 microseconds for 8 threads (the old kernel is 0.49 vs 1.86). This will confuse most of people who study the synchronization/communication scalability. I know my test code is not real-world computation which usually use up all cores. I hope I have explained myself clearly. Thank you very much. -- ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar a écrit : * Jie Chen [EMAIL PROTECTED] wrote: I just ran the same test on two 2.6.24-rc4 kernels: one with CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED off. The odd behavior I described in my previous e-mails were still there for both kernels. Let me know If I can be any more help. Thank you. ok, i had a look at your data, and i think this is the result of the scheduler balancing out to idle CPUs more agressively than before. Doing that is almost always a good idea though - but indeed it can result in bad numbers if all you do is to measure the ping-pong performance between two threads. (with no real work done by any of them). the moment you saturate the system a bit more, the numbers should improve even with such a ping-pong test. do you have testcode (or a modification of your testcase sourcecode) that simulates a real-life situation where 2.6.24-rc4 performs not as well as you'd like it to see? (or if qmt.tar.gz already contains that then please point me towards that portion of the test and how i should run it - thanks!) Ingo I cooked a program shorter than Jie one, to try to understand what was going on. Its a pure cpu burner program, with no thread synchronisation (but the pthread_join at the very end) As each thread is bound to a given cpu, I am not sure the scheduler is allowed to balance to an idle cpu. Unfortunatly I dont have a 4 way SMP idle machine available to test it. $ gcc -O2 -o burner burner.c $ ./burner Time to perform the unit of work on one thread is 0.040328 s Time to perform the unit of work on 2 threads is 0.040221 s I tried it on a 64 way machine (Thanks David :) ) and noticed some strange results that may be related to the Niagara hardware (time for 64 threads was nearly the double for one thread) #include pthread.h #include sched.h #include unistd.h #include fcntl.h #include sys/time.h #include stdio.h #include stdlib.h int blockthemall=1; static void inline cpupause() { #if defined(i386) asm volatile(rep;nop:::memory); #else asm volatile(:::memory); #endif } /* * Determines number of cpus * Can be overiden by the NR_CPUS environment variable */ int number_of_cpus() { char line[1024], *p; int cnt = 0; FILE *F; p = getenv(NR_CPUS); if (p) return atoi(p); F = fopen(/proc/cpuinfo, r); if (F == NULL) { perror(/proc/cpuinfo); return 1; } while (fgets(line, sizeof(line), F) != NULL) { if (memcmp(line, processor, 9) == 0) cnt++; } fclose(F); return cnt; } void compute_elapsed(struct timeval *delta, const struct timeval *t0) { struct timeval t1; gettimeofday(t1, NULL); delta-tv_sec = t1.tv_sec - t0-tv_sec; delta-tv_usec = t1.tv_usec - t0-tv_usec; if (delta-tv_usec 0) { delta-tv_usec += 100; delta-tv_sec--; } } int nr_loops = 20*100; double incr=0.3456; void perform_work() { int i; double t = 0.0; for (i = 0; i nr_loops; i++) { t += incr; } if (t 0.0) printf(well... should not happen\n); } void set_affinity(int cpu) { long cpu_mask; int res; cpu_mask = 1L cpu; res = sched_setaffinity(0, sizeof(cpu_mask), cpu_mask); if (res) perror(sched_setaffinity); } void *thread_work(void *arg) { int cpu = (int)arg; set_affinity(cpu); while (blockthemall) cpupause(); perform_work(); return (void *)0; } main(int argc, char *argv[]) { struct timeval t0, delta; int nr_cpus, i; pthread_t *tids; gettimeofday(t0, NULL); perform_work(); compute_elapsed(delta, t0); printf(Time to perform the unit of work on one thread is %d.%06d s\n, delta.tv_sec, delta.tv_usec); nr_cpus = number_of_cpus(); if (nr_cpus = 1) return 0; tids = malloc(nr_cpus * sizeof(pthread_t)); for (i = 1; i nr_cpus; i++) { pthread_create(tids + i, NULL, thread_work, (void *)i); } set_affinity(0); gettimeofday(t0, NULL); blockthemall=0; perform_work(); for (i = 1; i nr_cpus; i++) pthread_join(tids[i], NULL); compute_elapsed(delta, t0); printf(Time to perform the unit of work on %d threads is %d.%06d s\n, nr_cpus, delta.tv_sec, delta.tv_usec); }
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar a écrit : * Eric Dumazet [EMAIL PROTECTED] wrote: $ gcc -O2 -o burner burner.c $ ./burner Time to perform the unit of work on one thread is 0.040328 s Time to perform the unit of work on 2 threads is 0.040221 s ok, but this actually suggests that scheduling is fine for this, correct? Ingo Yes, But this machine runs an old kernel. I was just giving you how to run it :) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen [EMAIL PROTECTED] wrote: the moment you saturate the system a bit more, the numbers should improve even with such a ping-pong test. You are right. If I manually do load balance (bind unrelated processes on the other cores), my test code perform as well as it did in the kernel 2.6.21. so right now the results dont seem to be too bad to me - the higher overhead comes from two threads running on two different cores and incurring the overhead of cross-core communications. In a true spread-out workloads that synchronize occasionally you'd get the same kind of overhead so in fact this behavior is more informative of the real overhead i guess. In 2.6.21 the two threads would stick on the same core and produce artificially low latency - which would only be true in a real spread-out workload if all tasks ran on the same core. (which is hardly the thing you want on openmp) In any case, if i misinterpreted your numbers or if you just disagree, or if have a workload/test that shows worse performance that it could/should, let me know. Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Eric Dumazet [EMAIL PROTECTED] wrote: $ gcc -O2 -o burner burner.c $ ./burner Time to perform the unit of work on one thread is 0.040328 s Time to perform the unit of work on 2 threads is 0.040221 s ok, but this actually suggests that scheduling is fine for this, correct? Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen [EMAIL PROTECTED] wrote: sorry to be dense, but could you give me instructions how i could remove the affinity mask and test the barrier overhead myself? I have built pthread_sync and it outputs numbers for me - which one would be the barrier overhead: Reference_time_1 ? To disable affinity, do configure --enable-public-release --disable-thread_affinity. You should see barrier overhead like the following: Computing BARRIER time Sample_size Average MinMax S.D.Outliers 20 19.486162 19.482250 19.4914000.002740 0 BARRIER time =19.486162 microseconds +/- 0.005371 BARRIER overhead =8.996257 microseconds +/- 0.006575 ok, i did that and rebuilt. I also did make check and got src/pthread_sync which i can run. The only thing i'm missing, if i run src/pthread_sync, it outputs PARALLEL time: PARALLEL time = 22.486103 microseconds +/- 3.944821 PARALLEL overhead = 10.638658 microseconds +/- 10.854154 not BARRIER time. I've re-read the discussion and found no hint about how to build and run a barrier test. Either i missed it or it's so obvious to you that you didnt mention it :-) Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen [EMAIL PROTECTED] wrote: Since I am using affinity flag to bind each thread to a different core, the synchronization overhead should increases as the number of cores/threads increases. But what we observed in the new kernel is the opposite. The barrier overhead of two threads is 8.93 micro seconds vs 1.86 microseconds for 8 threads (the old kernel is 0.49 vs 1.86). This will confuse most of people who study the synchronization/communication scalability. I know my test code is not real-world computation which usually use up all cores. I hope I have explained myself clearly. Thank you very much. btw., could you try to not use the affinity mask and let the scheduler manage the spreading of tasks? It generally has a better knowledge about how tasks interrelate. Ingo Hi, Ingo: I just disabled the affinity mask and reran the test. There were no significant changes for two threads (barrier overhead is around 9 microseconds). As for 8 threads, the barrier overhead actually drops a little, which is good. Let me know whether I can be any help. Thank you very much. -- ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen [EMAIL PROTECTED] wrote: Since I am using affinity flag to bind each thread to a different core, the synchronization overhead should increases as the number of cores/threads increases. But what we observed in the new kernel is the opposite. The barrier overhead of two threads is 8.93 micro seconds vs 1.86 microseconds for 8 threads (the old kernel is 0.49 vs 1.86). This will confuse most of people who study the synchronization/communication scalability. I know my test code is not real-world computation which usually use up all cores. I hope I have explained myself clearly. Thank you very much. btw., could you try to not use the affinity mask and let the scheduler manage the spreading of tasks? It generally has a better knowledge about how tasks interrelate. Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen [EMAIL PROTECTED] wrote: sorry to be dense, but could you give me instructions how i could remove the affinity mask and test the barrier overhead myself? I have built pthread_sync and it outputs numbers for me - which one would be the barrier overhead: Reference_time_1 ? To disable affinity, do configure --enable-public-release --disable-thread_affinity. You should see barrier overhead like the following: Computing BARRIER time Sample_size Average MinMax S.D.Outliers 20 19.486162 19.482250 19.4914000.002740 0 BARRIER time =19.486162 microseconds +/- 0.005371 BARRIER overhead =8.996257 microseconds +/- 0.006575 ok, i did that and rebuilt. I also did make check and got src/pthread_sync which i can run. The only thing i'm missing, if i run src/pthread_sync, it outputs PARALLEL time: PARALLEL time = 22.486103 microseconds +/- 3.944821 PARALLEL overhead = 10.638658 microseconds +/- 10.854154 not BARRIER time. I've re-read the discussion and found no hint about how to build and run a barrier test. Either i missed it or it's so obvious to you that you didnt mention it :-) Ingo Hi, Ingo: Did you do configure --enable-public-release? My qmt is for qcd calculation (one type of physics code). Without the above flag one can only test PARALLEL overhead. Actually the PARALLEL benchmark has the same behavior as the BARRIER. Thanks. ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen [EMAIL PROTECTED] wrote: I just disabled the affinity mask and reran the test. There were no significant changes for two threads (barrier overhead is around 9 microseconds). As for 8 threads, the barrier overhead actually drops a little, which is good. Let me know whether I can be any help. Thank you very much. sorry to be dense, but could you give me instructions how i could remove the affinity mask and test the barrier overhead myself? I have built pthread_sync and it outputs numbers for me - which one would be the barrier overhead: Reference_time_1 ? Ingo Hi, Ingo: To disable affinity, do configure --enable-public-release --disable-thread_affinity. You should see barrier overhead like the following: Computing BARRIER time Sample_size Average MinMax S.D.Outliers 20 19.486162 19.482250 19.4914000.002740 0 BARRIER time =19.486162 microseconds +/- 0.005371 BARRIER overhead =8.996257 microseconds +/- 0.006575 The Reference_time_1 is the elapsed time for single thread doing simple loop without any synchronization. Thank you. -- ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen [EMAIL PROTECTED] wrote: I just ran the same test on two 2.6.24-rc4 kernels: one with CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED off. The odd behavior I described in my previous e-mails were still there for both kernels. Let me know If I can be any more help. Thank you. ok, i had a look at your data, and i think this is the result of the scheduler balancing out to idle CPUs more agressively than before. Doing that is almost always a good idea though - but indeed it can result in bad numbers if all you do is to measure the ping-pong performance between two threads. (with no real work done by any of them). My test code are not doing much work but measuring overhead of various synchronization mechanisms such as barrier and lock. I am trying to see the scalability of different implementations/algorithms on multi-core machines. the moment you saturate the system a bit more, the numbers should improve even with such a ping-pong test. You are right. If I manually do load balance (bind unrelated processes on the other cores), my test code perform as well as it did in the kernel 2.6.21. do you have testcode (or a modification of your testcase sourcecode) that simulates a real-life situation where 2.6.24-rc4 performs not as well as you'd like it to see? (or if qmt.tar.gz already contains that then please point me towards that portion of the test and how i should run it - thanks!) The qmt.tar.gz code contains a simple test program call pthread_sync under the src directory. You can change the number of threads by setting QMT_NUM_THREADS environment variable. You can build the qmt by doing configure --enable-public-release. I do not have Intel quad core machines, I am not sure whether the behavior will show up on Intel platform. Our cluster is dual quad-core opteron which has its own hardware problem :-). http://hardware.slashdot.org/article.pl?sid=07/12/04/237248from=rss Ingo Hi, Ingo: My test code qmt can be found at ftp://ftp.jlab.org/pub/hpc/qmt.tar.gz. There is a minor performance issue in qmt pointed out by Eric, which I have not put into the tar ball yet. If I can be any help, please let me know. Thank you very much. -- ### Jie Chen Scientific Computing Group Thomas Jefferson National Accelerator Facility 12000, Jefferson Ave. Newport News, VA 23606 (757)269-5046 (office) (757)269-6248 (fax) [EMAIL PROTECTED] ### -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen [EMAIL PROTECTED] wrote: I just disabled the affinity mask and reran the test. There were no significant changes for two threads (barrier overhead is around 9 microseconds). As for 8 threads, the barrier overhead actually drops a little, which is good. Let me know whether I can be any help. Thank you very much. sorry to be dense, but could you give me instructions how i could remove the affinity mask and test the barrier overhead myself? I have built pthread_sync and it outputs numbers for me - which one would be the barrier overhead: Reference_time_1 ? Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen [EMAIL PROTECTED] wrote: I just ran the same test on two 2.6.24-rc4 kernels: one with CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED off. The odd behavior I described in my previous e-mails were still there for both kernels. Let me know If I can be any more help. Thank you. ok, i had a look at your data, and i think this is the result of the scheduler balancing out to idle CPUs more agressively than before. Doing that is almost always a good idea though - but indeed it can result in bad numbers if all you do is to measure the ping-pong performance between two threads. (with no real work done by any of them). the moment you saturate the system a bit more, the numbers should improve even with such a ping-pong test. do you have testcode (or a modification of your testcase sourcecode) that simulates a real-life situation where 2.6.24-rc4 performs not as well as you'd like it to see? (or if qmt.tar.gz already contains that then please point me towards that portion of the test and how i should run it - thanks!) Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen <[EMAIL PROTECTED]> wrote: Simon Holm Th??gersen wrote: ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen: There is a backport of the CFS scheduler to 2.6.21, see http://lkml.org/lkml/2007/11/19/127 Hi, Simon: I will try that after the thanksgiving holiday to find out whether the odd behavior will show up using 2.6.21 with back ported CFS. would be also nice to test this with 2.6.24-rc4. Ingo Hi, Ingo: I will test 2.6.24-rc4 this week and let you know the result. Thanks. -- # # Jie Chen # Scientific Computing Group # Thomas Jefferson National Accelerator Facility # Newport News, VA 23606 # # [EMAIL PROTECTED] # (757)269-5046 (office) # (757)269-6248 (fax) # -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen <[EMAIL PROTECTED]> wrote: > Simon Holm Th??gersen wrote: >> ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen: > >> There is a backport of the CFS scheduler to 2.6.21, see >> http://lkml.org/lkml/2007/11/19/127 >> > Hi, Simon: > > I will try that after the thanksgiving holiday to find out whether the > odd behavior will show up using 2.6.21 with back ported CFS. would be also nice to test this with 2.6.24-rc4. Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
* Jie Chen [EMAIL PROTECTED] wrote: Simon Holm Th??gersen wrote: ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen: There is a backport of the CFS scheduler to 2.6.21, see http://lkml.org/lkml/2007/11/19/127 Hi, Simon: I will try that after the thanksgiving holiday to find out whether the odd behavior will show up using 2.6.21 with back ported CFS. would be also nice to test this with 2.6.24-rc4. Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4
Ingo Molnar wrote: * Jie Chen [EMAIL PROTECTED] wrote: Simon Holm Th??gersen wrote: ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen: There is a backport of the CFS scheduler to 2.6.21, see http://lkml.org/lkml/2007/11/19/127 Hi, Simon: I will try that after the thanksgiving holiday to find out whether the odd behavior will show up using 2.6.21 with back ported CFS. would be also nice to test this with 2.6.24-rc4. Ingo Hi, Ingo: I will test 2.6.24-rc4 this week and let you know the result. Thanks. -- # # Jie Chen # Scientific Computing Group # Thomas Jefferson National Accelerator Facility # Newport News, VA 23606 # # [EMAIL PROTECTED] # (757)269-5046 (office) # (757)269-6248 (fax) # -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/