subject:"Re\: Possible bug from kernel 2.6.22 and above, 2.6.24\-rc4"

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-12 Thread Peter Zijlstra


On Tue, 2007-12-11 at 17:11 -0500, Jie Chen wrote:
> Ingo Molnar wrote:
> > * Jie Chen <[EMAIL PROTECTED]> wrote:
> > 
> >> The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP 
> >> kernel.
> > 
> >> 2 threads:
> > 
> >> PARALLEL time =   11.106580 microseconds +/- 0.002460
> >> PARALLEL overhead =0.617590 microseconds +/- 0.003409
> > 
> >> Output for Kernel 2.6.24-rc4 #1 SMP
> > 
> >> PARALLEL time =  19.668450 microseconds +/- 0.015782
> >> PARALLEL overhead =   9.157945 microseconds +/- 0.018217
> > 
> > ok, so the problem is that this PARALLEL time has an additional +9 usecs 
> > overhead, right? I dont see this myself on a Core2 CPU:
> > 
> > PARALLEL time =   10.446933 microseconds +/- 
> > 0.078849
> > PARALLEL overhead =   0.751732 microseconds +/- 0.177446
> > 

On my dual socket AMD Athlon MP

2.6.20-13-generic

PARALLEL time =   22.751875 microseconds +/- 21.370942
PARALLEL overhead =   7.046595 microseconds +/- 24.370040

2.6.24-rc5

PARALLEL time =   17.365543 microseconds +/- 3.295133
PARALLEL overhead =   2.213722 microseconds +/- 4.797886




signature.asc
Description: This is a digitally signed message part

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-12 Thread Peter Zijlstra


On Tue, 2007-12-11 at 17:11 -0500, Jie Chen wrote:
 Ingo Molnar wrote:
  * Jie Chen [EMAIL PROTECTED] wrote:
  
  The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP 
  kernel.
  
  2 threads:
  
  PARALLEL time =   11.106580 microseconds +/- 0.002460
  PARALLEL overhead =0.617590 microseconds +/- 0.003409
  
  Output for Kernel 2.6.24-rc4 #1 SMP
  
  PARALLEL time =  19.668450 microseconds +/- 0.015782
  PARALLEL overhead =   9.157945 microseconds +/- 0.018217
  
  ok, so the problem is that this PARALLEL time has an additional +9 usecs 
  overhead, right? I dont see this myself on a Core2 CPU:
  
  PARALLEL time =   10.446933 microseconds +/- 
  0.078849
  PARALLEL overhead =   0.751732 microseconds +/- 0.177446
  

On my dual socket AMD Athlon MP

2.6.20-13-generic

PARALLEL time =   22.751875 microseconds +/- 21.370942
PARALLEL overhead =   7.046595 microseconds +/- 24.370040

2.6.24-rc5

PARALLEL time =   17.365543 microseconds +/- 3.295133
PARALLEL overhead =   2.213722 microseconds +/- 4.797886




signature.asc
Description: This is a digitally signed message part

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-11 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen <[EMAIL PROTECTED]> wrote:

The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP 
kernel.



2 threads:



PARALLEL time =   11.106580 microseconds +/- 0.002460
PARALLEL overhead =0.617590 microseconds +/- 0.003409



Output for Kernel 2.6.24-rc4 #1 SMP



PARALLEL time =  19.668450 microseconds +/- 0.015782
PARALLEL overhead =   9.157945 microseconds +/- 0.018217


ok, so the problem is that this PARALLEL time has an additional +9 usecs 
overhead, right? I dont see this myself on a Core2 CPU:


PARALLEL time =   10.446933 microseconds +/- 0.078849
PARALLEL overhead =   0.751732 microseconds +/- 0.177446

Ingo

Hi, Ingo:

Yes, the extra 9 usecs overhead for running two threads in the 2.6.24 
kernel when there are total of 8 cores (2 quad opterons). What is the 
total number of cores do you have? I do not have machines that have dual 
quad Xeons here for direct comparisons. Thank you.


--
#
# Jie Chen
# Scientific Computing Group
# Thomas Jefferson National Accelerator Facility
# Newport News, VA 23606
#
# [EMAIL PROTECTED]
# (757)269-5046 (office)
# (757)269-6248 (fax)
#
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-11 Thread Ingo Molnar


* Jie Chen <[EMAIL PROTECTED]> wrote:

> The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP 
> kernel.

> 2 threads:

> PARALLEL time =   11.106580 microseconds +/- 0.002460
> PARALLEL overhead =0.617590 microseconds +/- 0.003409

> Output for Kernel 2.6.24-rc4 #1 SMP

> PARALLEL time =  19.668450 microseconds +/- 0.015782
> PARALLEL overhead =   9.157945 microseconds +/- 0.018217

ok, so the problem is that this PARALLEL time has an additional +9 usecs 
overhead, right? I dont see this myself on a Core2 CPU:

PARALLEL time =   10.446933 microseconds +/- 0.078849
PARALLEL overhead =   0.751732 microseconds +/- 0.177446

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-11 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen <[EMAIL PROTECTED]> wrote:


Hi, Ingo:

I guess it is a good news. I did patch 2.6.21.7 kernel using your cfs 
patch. The results of pthread_sync is the same as the non-patched 
2.6.21 kernel. This means the performance of is not related to the 
scheduler. As for overhead of the gettimeofday, there is no difference 
between 2.6.21 and 2.6.24-rc4. The reference time is around 10.5 us 
for both kernel.


could you please paste again the relevant portion of the output you get 
on a "good" .21 kernel versus the output you get on a "bad" .24 kernel?


So what is changed between 2.6.21 and 2.6.22? Any hints :-). Thank you 
very much for all your help.


we'll figure it out i'm sure :)

Ingo


Hi, Ingo:

The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP kernel.

2 threads:

Computing reference time 1

Sample_size   Average Min MaxS.D.   Outliers
 20 10.489085   10.488800   10.4911000.000539  1

Reference_time_1 =  10.489085 microseconds +/- 0.001057

Computing PARALLEL time

Sample_size   Average Min Max  S.D.  Outliers
 20 11.106580   11.105650   11.1097000.001255  0

PARALLEL time =   11.106580 microseconds +/- 0.002460
PARALLEL overhead =0.617590 microseconds +/- 0.003409

8 threads:
Computing reference time 1

Sample_size   Average MinMax S.D.  Outliers
 2010.488735   10.488500   10.4907000.000484  1

Reference_time_1 = 10.488735 microseconds +/- 0.000948

Computing PARALLEL time

Sample_size   Average MinMax S.D.  Outliers
 20   13.000647   12.991050   13.0527000.012592  1

PARALLEL time =13.000647 microseconds +/- 0.024680
PARALLEL overhead = 2.511907 microseconds +/- 0.025594


Output for Kernel 2.6.24-rc4 #1 SMP

2 threads:
Computing reference time 1

Sample_size   Average Min MaxS.D.  Outliers
 20  10.510535   10.508600   10.5182000.002237  1

Reference_time_1 =   10.510535 microseconds +/- 0.004384

Computing PARALLEL time

Sample_size   Average MinMax S.D.  Outliers
 20 19.668450   19.650200   19.6796500.008052  0

PARALLEL time =  19.668450 microseconds +/- 0.015782
PARALLEL overhead =   9.157945 microseconds +/- 0.018217

8 threads:
Computing reference time 1

Sample_size   Average MinMaxS.D.  Outliers
 2010.491285   10.490100   10.4949000.001085  1

Reference_time_1 =   10.491285 microseconds +/- 0.002127

Computing PARALLEL time

Sample_size   Average MinMaxS.D.  Outliers
 2013.090080   13.079150   13.1314500.010995  1

PARALLEL time =  13.090080 microseconds +/- 0.021550
PARALLEL overhead =  2.598590 microseconds +/- 0.024534

For 8 threads, both kernels have the similar performance number. But for 
2 threads, the 2.6.21 is much better than 2.6.24-rc4. Thank you.



--
###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-11 Thread Ingo Molnar


* Jie Chen <[EMAIL PROTECTED]> wrote:

> Hi, Ingo:
>
> I guess it is a good news. I did patch 2.6.21.7 kernel using your cfs 
> patch. The results of pthread_sync is the same as the non-patched 
> 2.6.21 kernel. This means the performance of is not related to the 
> scheduler. As for overhead of the gettimeofday, there is no difference 
> between 2.6.21 and 2.6.24-rc4. The reference time is around 10.5 us 
> for both kernel.

could you please paste again the relevant portion of the output you get 
on a "good" .21 kernel versus the output you get on a "bad" .24 kernel?

> So what is changed between 2.6.21 and 2.6.22? Any hints :-). Thank you 
> very much for all your help.

we'll figure it out i'm sure :)

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-11 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen <[EMAIL PROTECTED]> wrote:


and then you use this in the measurement loop:

   for (k=0; k<=OUTERREPS; k++){
 start  = getclock();
 for (j=0; jthe problem is, this does not take the overhead of gettimeofday into 
account - which overhead can easily reach 10 usecs (the observed 
regression). Could you try to eliminate the gettimeofday overhead from 
your measurement?


gettimeofday overhead is something that might have changed from .21 to .22 
on your box.


Ingo

Hi, Ingo:

In my pthread_sync code, I first call refer () subroutine which 
actually establishes the elapsed time (reference time) for 
non-synchronized delay() using the gettimeofday. Then each 
synchronization overhead value is obtained by subtracting the 
reference time from the elapsed time with introduced synchronization. 
The effect of gettimeofday() should be minimal if the time difference 
(overhead value) is the interest here. Unless the gettimeofday behaves 
differently in the case of running 8 threads .vs. running 2 threads.


I will try to replace gettimeofday with a lightweight timer call in my 
test code. Thank you very much.


gettimeofday overhead is around 10 usecs here:

 2740  1197359374.873214 gettimeofday({1197359374, 873225}, NULL) = 0 <0.10>
 2740  1197359374.970592 gettimeofday({1197359374, 970608}, NULL) = 0 <0.10>

and that's the only thing that is going on when computing the reference 
time - and i see a similar syscall pattern in the PARALLEL and BARRIER 
calculations as well (with no real scheduling going on).


Ingo


Hi, Ingo:

I guess it is a good news. I did patch 2.6.21.7 kernel using your cfs 
patch. The results of pthread_sync is the same as the non-patched 2.6.21 
kernel. This means the performance of is not related to the scheduler. 
As for overhead of the gettimeofday, there is no difference between 
2.6.21 and 2.6.24-rc4. The reference time is around 10.5 us for both 
kernel.


So what is changed between 2.6.21 and 2.6.22? Any hints :-). Thank you 
very much for all your help.


--
###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-11 Thread Ingo Molnar


* Jie Chen <[EMAIL PROTECTED]> wrote:

>> and then you use this in the measurement loop:
>>
>>for (k=0; k<=OUTERREPS; k++){
>>  start  = getclock();
>>  for (j=0; j>  #ifdef _QMT_PUBLIC
>>delay((void *)0, 0);
>>  #else
>>delay(0, 0, 0, (void *)0);
>>  #endif
>>  }
>>  times[k] = (getclock() - start) * 1.0e6 / (double) innerreps;
>>}
>>
>> the problem is, this does not take the overhead of gettimeofday into 
>> account - which overhead can easily reach 10 usecs (the observed 
>> regression). Could you try to eliminate the gettimeofday overhead from 
>> your measurement?
>>
>> gettimeofday overhead is something that might have changed from .21 to .22 
>> on your box.
>>
>>  Ingo
>
> Hi, Ingo:
>
> In my pthread_sync code, I first call refer () subroutine which 
> actually establishes the elapsed time (reference time) for 
> non-synchronized delay() using the gettimeofday. Then each 
> synchronization overhead value is obtained by subtracting the 
> reference time from the elapsed time with introduced synchronization. 
> The effect of gettimeofday() should be minimal if the time difference 
> (overhead value) is the interest here. Unless the gettimeofday behaves 
> differently in the case of running 8 threads .vs. running 2 threads.
>
> I will try to replace gettimeofday with a lightweight timer call in my 
> test code. Thank you very much.

gettimeofday overhead is around 10 usecs here:

 2740  1197359374.873214 gettimeofday({1197359374, 873225}, NULL) = 0 <0.10>
 2740  1197359374.970592 gettimeofday({1197359374, 970608}, NULL) = 0 <0.10>

and that's the only thing that is going on when computing the reference 
time - and i see a similar syscall pattern in the PARALLEL and BARRIER 
calculations as well (with no real scheduling going on).

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-11 Thread Ingo Molnar


* Jie Chen [EMAIL PROTECTED] wrote:

 and then you use this in the measurement loop:

for (k=0; k=OUTERREPS; k++){
  start  = getclock();
  for (j=0; jinnerreps; j++){
  #ifdef _QMT_PUBLIC
delay((void *)0, 0);
  #else
delay(0, 0, 0, (void *)0);
  #endif
  }
  times[k] = (getclock() - start) * 1.0e6 / (double) innerreps;
}

 the problem is, this does not take the overhead of gettimeofday into 
 account - which overhead can easily reach 10 usecs (the observed 
 regression). Could you try to eliminate the gettimeofday overhead from 
 your measurement?

 gettimeofday overhead is something that might have changed from .21 to .22 
 on your box.

  Ingo

 Hi, Ingo:

 In my pthread_sync code, I first call refer () subroutine which 
 actually establishes the elapsed time (reference time) for 
 non-synchronized delay() using the gettimeofday. Then each 
 synchronization overhead value is obtained by subtracting the 
 reference time from the elapsed time with introduced synchronization. 
 The effect of gettimeofday() should be minimal if the time difference 
 (overhead value) is the interest here. Unless the gettimeofday behaves 
 differently in the case of running 8 threads .vs. running 2 threads.

 I will try to replace gettimeofday with a lightweight timer call in my 
 test code. Thank you very much.

gettimeofday overhead is around 10 usecs here:

 2740  1197359374.873214 gettimeofday({1197359374, 873225}, NULL) = 0 0.10
 2740  1197359374.970592 gettimeofday({1197359374, 970608}, NULL) = 0 0.10

and that's the only thing that is going on when computing the reference 
time - and i see a similar syscall pattern in the PARALLEL and BARRIER 
calculations as well (with no real scheduling going on).

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-11 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen [EMAIL PROTECTED] wrote:


and then you use this in the measurement loop:

   for (k=0; k=OUTERREPS; k++){
 start  = getclock();
 for (j=0; jinnerreps; j++){
 #ifdef _QMT_PUBLIC
   delay((void *)0, 0);
 #else
   delay(0, 0, 0, (void *)0);
 #endif
 }
 times[k] = (getclock() - start) * 1.0e6 / (double) innerreps;
   }

the problem is, this does not take the overhead of gettimeofday into 
account - which overhead can easily reach 10 usecs (the observed 
regression). Could you try to eliminate the gettimeofday overhead from 
your measurement?


gettimeofday overhead is something that might have changed from .21 to .22 
on your box.


Ingo

Hi, Ingo:

In my pthread_sync code, I first call refer () subroutine which 
actually establishes the elapsed time (reference time) for 
non-synchronized delay() using the gettimeofday. Then each 
synchronization overhead value is obtained by subtracting the 
reference time from the elapsed time with introduced synchronization. 
The effect of gettimeofday() should be minimal if the time difference 
(overhead value) is the interest here. Unless the gettimeofday behaves 
differently in the case of running 8 threads .vs. running 2 threads.


I will try to replace gettimeofday with a lightweight timer call in my 
test code. Thank you very much.


gettimeofday overhead is around 10 usecs here:

 2740  1197359374.873214 gettimeofday({1197359374, 873225}, NULL) = 0 0.10
 2740  1197359374.970592 gettimeofday({1197359374, 970608}, NULL) = 0 0.10

and that's the only thing that is going on when computing the reference 
time - and i see a similar syscall pattern in the PARALLEL and BARRIER 
calculations as well (with no real scheduling going on).


Ingo


Hi, Ingo:

I guess it is a good news. I did patch 2.6.21.7 kernel using your cfs 
patch. The results of pthread_sync is the same as the non-patched 2.6.21 
kernel. This means the performance of is not related to the scheduler. 
As for overhead of the gettimeofday, there is no difference between 
2.6.21 and 2.6.24-rc4. The reference time is around 10.5 us for both 
kernel.


So what is changed between 2.6.21 and 2.6.22? Any hints :-). Thank you 
very much for all your help.


--
###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-11 Thread Ingo Molnar


* Jie Chen [EMAIL PROTECTED] wrote:

 Hi, Ingo:

 I guess it is a good news. I did patch 2.6.21.7 kernel using your cfs 
 patch. The results of pthread_sync is the same as the non-patched 
 2.6.21 kernel. This means the performance of is not related to the 
 scheduler. As for overhead of the gettimeofday, there is no difference 
 between 2.6.21 and 2.6.24-rc4. The reference time is around 10.5 us 
 for both kernel.

could you please paste again the relevant portion of the output you get 
on a good .21 kernel versus the output you get on a bad .24 kernel?

 So what is changed between 2.6.21 and 2.6.22? Any hints :-). Thank you 
 very much for all your help.

we'll figure it out i'm sure :)

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-11 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen [EMAIL PROTECTED] wrote:


Hi, Ingo:

I guess it is a good news. I did patch 2.6.21.7 kernel using your cfs 
patch. The results of pthread_sync is the same as the non-patched 
2.6.21 kernel. This means the performance of is not related to the 
scheduler. As for overhead of the gettimeofday, there is no difference 
between 2.6.21 and 2.6.24-rc4. The reference time is around 10.5 us 
for both kernel.


could you please paste again the relevant portion of the output you get 
on a good .21 kernel versus the output you get on a bad .24 kernel?


So what is changed between 2.6.21 and 2.6.22? Any hints :-). Thank you 
very much for all your help.


we'll figure it out i'm sure :)

Ingo


Hi, Ingo:

The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP kernel.

2 threads:

Computing reference time 1

Sample_size   Average Min MaxS.D.   Outliers
 20 10.489085   10.488800   10.4911000.000539  1

Reference_time_1 =  10.489085 microseconds +/- 0.001057

Computing PARALLEL time

Sample_size   Average Min Max  S.D.  Outliers
 20 11.106580   11.105650   11.1097000.001255  0

PARALLEL time =   11.106580 microseconds +/- 0.002460
PARALLEL overhead =0.617590 microseconds +/- 0.003409

8 threads:
Computing reference time 1

Sample_size   Average MinMax S.D.  Outliers
 2010.488735   10.488500   10.4907000.000484  1

Reference_time_1 = 10.488735 microseconds +/- 0.000948

Computing PARALLEL time

Sample_size   Average MinMax S.D.  Outliers
 20   13.000647   12.991050   13.0527000.012592  1

PARALLEL time =13.000647 microseconds +/- 0.024680
PARALLEL overhead = 2.511907 microseconds +/- 0.025594


Output for Kernel 2.6.24-rc4 #1 SMP

2 threads:
Computing reference time 1

Sample_size   Average Min MaxS.D.  Outliers
 20  10.510535   10.508600   10.5182000.002237  1

Reference_time_1 =   10.510535 microseconds +/- 0.004384

Computing PARALLEL time

Sample_size   Average MinMax S.D.  Outliers
 20 19.668450   19.650200   19.6796500.008052  0

PARALLEL time =  19.668450 microseconds +/- 0.015782
PARALLEL overhead =   9.157945 microseconds +/- 0.018217

8 threads:
Computing reference time 1

Sample_size   Average MinMaxS.D.  Outliers
 2010.491285   10.490100   10.4949000.001085  1

Reference_time_1 =   10.491285 microseconds +/- 0.002127

Computing PARALLEL time

Sample_size   Average MinMaxS.D.  Outliers
 2013.090080   13.079150   13.1314500.010995  1

PARALLEL time =  13.090080 microseconds +/- 0.021550
PARALLEL overhead =  2.598590 microseconds +/- 0.024534

For 8 threads, both kernels have the similar performance number. But for 
2 threads, the 2.6.21 is much better than 2.6.24-rc4. Thank you.



--
###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-11 Thread Ingo Molnar


* Jie Chen [EMAIL PROTECTED] wrote:

 The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP 
 kernel.

 2 threads:

 PARALLEL time =   11.106580 microseconds +/- 0.002460
 PARALLEL overhead =0.617590 microseconds +/- 0.003409

 Output for Kernel 2.6.24-rc4 #1 SMP

 PARALLEL time =  19.668450 microseconds +/- 0.015782
 PARALLEL overhead =   9.157945 microseconds +/- 0.018217

ok, so the problem is that this PARALLEL time has an additional +9 usecs 
overhead, right? I dont see this myself on a Core2 CPU:

PARALLEL time =   10.446933 microseconds +/- 0.078849
PARALLEL overhead =   0.751732 microseconds +/- 0.177446

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-11 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen [EMAIL PROTECTED] wrote:

The following is pthread_sync output for 2.6.21.7-cfs-v24 #1 SMP 
kernel.



2 threads:



PARALLEL time =   11.106580 microseconds +/- 0.002460
PARALLEL overhead =0.617590 microseconds +/- 0.003409



Output for Kernel 2.6.24-rc4 #1 SMP



PARALLEL time =  19.668450 microseconds +/- 0.015782
PARALLEL overhead =   9.157945 microseconds +/- 0.018217


ok, so the problem is that this PARALLEL time has an additional +9 usecs 
overhead, right? I dont see this myself on a Core2 CPU:


PARALLEL time =   10.446933 microseconds +/- 0.078849
PARALLEL overhead =   0.751732 microseconds +/- 0.177446

Ingo

Hi, Ingo:

Yes, the extra 9 usecs overhead for running two threads in the 2.6.24 
kernel when there are total of 8 cores (2 quad opterons). What is the 
total number of cores do you have? I do not have machines that have dual 
quad Xeons here for direct comparisons. Thank you.


--
#
# Jie Chen
# Scientific Computing Group
# Thomas Jefferson National Accelerator Facility
# Newport News, VA 23606
#
# [EMAIL PROTECTED]
# (757)269-5046 (office)
# (757)269-6248 (fax)
#
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-10 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen <[EMAIL PROTECTED]> wrote:

I did patch the header file and recompiled the kernel. I observed no 
difference (two threads overhead stays too high). Thank you.


ok, i think i found it. You do this in your qmt/pthread_sync.c 
test-code:


 double get_time_of_day_()
 {
 ...
   err = gettimeofday(, NULL);
 ...
 }

and then you use this in the measurement loop:

   for (k=0; k<=OUTERREPS; k++){
 start  = getclock();
 for (j=0; jthe problem is, this does not take the overhead of gettimeofday into 
account - which overhead can easily reach 10 usecs (the observed 
regression). Could you try to eliminate the gettimeofday overhead from 
your measurement?


gettimeofday overhead is something that might have changed from .21 to 
.22 on your box.


Ingo


Hi, Ingo:

In my pthread_sync code, I first call refer () subroutine which actually 
establishes the elapsed time (reference time) for non-synchronized 
delay() using the gettimeofday. Then each synchronization overhead value 
is obtained by subtracting the reference time from the elapsed time with 
introduced synchronization. The effect of gettimeofday() should be 
minimal if the time difference (overhead value) is the interest here. 
Unless the gettimeofday behaves differently in the case of running 8 
threads .vs. running 2 threads.


I will try to replace gettimeofday with a lightweight timer call in my 
test code. Thank you very much.


--
###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-10 Thread Ingo Molnar


* Jie Chen <[EMAIL PROTECTED]> wrote:

> I did patch the header file and recompiled the kernel. I observed no 
> difference (two threads overhead stays too high). Thank you.

ok, i think i found it. You do this in your qmt/pthread_sync.c 
test-code:

 double get_time_of_day_()
 {
 ...
   err = gettimeofday(, NULL);
 ...
 }

and then you use this in the measurement loop:

   for (k=0; k<=OUTERREPS; k++){
 start  = getclock();
 for (j=0; jhttp://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-10 Thread Ingo Molnar


* Jie Chen [EMAIL PROTECTED] wrote:

 I did patch the header file and recompiled the kernel. I observed no 
 difference (two threads overhead stays too high). Thank you.

ok, i think i found it. You do this in your qmt/pthread_sync.c 
test-code:

 double get_time_of_day_()
 {
 ...
   err = gettimeofday(ts, NULL);
 ...
 }

and then you use this in the measurement loop:

   for (k=0; k=OUTERREPS; k++){
 start  = getclock();
 for (j=0; jinnerreps; j++){
 #ifdef _QMT_PUBLIC
   delay((void *)0, 0);
 #else
   delay(0, 0, 0, (void *)0);
 #endif
 }
 times[k] = (getclock() - start) * 1.0e6 / (double) innerreps;
   }

the problem is, this does not take the overhead of gettimeofday into 
account - which overhead can easily reach 10 usecs (the observed 
regression). Could you try to eliminate the gettimeofday overhead from 
your measurement?

gettimeofday overhead is something that might have changed from .21 to 
.22 on your box.

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-10 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen [EMAIL PROTECTED] wrote:

I did patch the header file and recompiled the kernel. I observed no 
difference (two threads overhead stays too high). Thank you.


ok, i think i found it. You do this in your qmt/pthread_sync.c 
test-code:


 double get_time_of_day_()
 {
 ...
   err = gettimeofday(ts, NULL);
 ...
 }

and then you use this in the measurement loop:

   for (k=0; k=OUTERREPS; k++){
 start  = getclock();
 for (j=0; jinnerreps; j++){
 #ifdef _QMT_PUBLIC
   delay((void *)0, 0);
 #else
   delay(0, 0, 0, (void *)0);
 #endif
 }
 times[k] = (getclock() - start) * 1.0e6 / (double) innerreps;
   }

the problem is, this does not take the overhead of gettimeofday into 
account - which overhead can easily reach 10 usecs (the observed 
regression). Could you try to eliminate the gettimeofday overhead from 
your measurement?


gettimeofday overhead is something that might have changed from .21 to 
.22 on your box.


Ingo


Hi, Ingo:

In my pthread_sync code, I first call refer () subroutine which actually 
establishes the elapsed time (reference time) for non-synchronized 
delay() using the gettimeofday. Then each synchronization overhead value 
is obtained by subtracting the reference time from the elapsed time with 
introduced synchronization. The effect of gettimeofday() should be 
minimal if the time difference (overhead value) is the interest here. 
Unless the gettimeofday behaves differently in the case of running 8 
threads .vs. running 2 threads.


I will try to replace gettimeofday with a lightweight timer call in my 
test code. Thank you very much.


--
###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-06 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen <[EMAIL PROTECTED]> wrote:

not "BARRIER time". I've re-read the discussion and found no hint 
about how to build and run a barrier test. Either i missed it or it's 
so obvious to you that you didnt mention it :-)


Ingo

Hi, Ingo:

Did you do configure --enable-public-release? My qmt is for qcd 
calculation (one type of physics code) [...]


yes, i did exactly as instructed.

[...]. Without the above flag one can only test PARALLEL overhead. 
Actually the PARALLEL benchmark has the same behavior as the BARRIER. 
Thanks.


hm, but PARALLEL does not seem to do that much context switching. So 
basically you create the threads and do a few short runs to establish 
overhead? Threads do not get fork-balanced at the moment - but turning 
it on would be easy. Could you try the patch below - how does it impact 
your results? (and please keep affinity setting off)


Ingo

--->
Subject: sched: reactivate fork balancing
From: Ingo Molnar <[EMAIL PROTECTED]>

reactivate fork balancing.

Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
---
 include/linux/topology.h |3 +++
 1 file changed, 3 insertions(+)

Index: linux/include/linux/topology.h
===
--- linux.orig/include/linux/topology.h
+++ linux/include/linux/topology.h
@@ -103,6 +103,7 @@
.forkexec_idx   = 0,\
.flags  = SD_LOAD_BALANCE   \
| SD_BALANCE_NEWIDLE\
+   | SD_BALANCE_FORK   \
| SD_BALANCE_EXEC   \
| SD_WAKE_AFFINE\
| SD_WAKE_IDLE  \
@@ -134,6 +135,7 @@
.forkexec_idx   = 1,\
.flags  = SD_LOAD_BALANCE   \
| SD_BALANCE_NEWIDLE\
+   | SD_BALANCE_FORK   \
| SD_BALANCE_EXEC   \
| SD_WAKE_AFFINE\
| SD_WAKE_IDLE  \
@@ -165,6 +167,7 @@
.forkexec_idx   = 1,\
.flags  = SD_LOAD_BALANCE   \
| SD_BALANCE_NEWIDLE\
+   | SD_BALANCE_FORK   \
| SD_BALANCE_EXEC   \
| SD_WAKE_AFFINE\
| BALANCE_FOR_PKG_POWER,\

Hi, Ingo:

I did patch the header file and recompiled the kernel. I observed no 
difference (two threads overhead stays too high). Thank you.


--
###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-06 Thread Ingo Molnar


* Jie Chen <[EMAIL PROTECTED]> wrote:

>> not "BARRIER time". I've re-read the discussion and found no hint 
>> about how to build and run a barrier test. Either i missed it or it's 
>> so obvious to you that you didnt mention it :-)
>>
>>  Ingo
>
> Hi, Ingo:
>
> Did you do configure --enable-public-release? My qmt is for qcd 
> calculation (one type of physics code) [...]

yes, i did exactly as instructed.

> [...]. Without the above flag one can only test PARALLEL overhead. 
> Actually the PARALLEL benchmark has the same behavior as the BARRIER. 
> Thanks.

hm, but PARALLEL does not seem to do that much context switching. So 
basically you create the threads and do a few short runs to establish 
overhead? Threads do not get fork-balanced at the moment - but turning 
it on would be easy. Could you try the patch below - how does it impact 
your results? (and please keep affinity setting off)

Ingo

--->
Subject: sched: reactivate fork balancing
From: Ingo Molnar <[EMAIL PROTECTED]>

reactivate fork balancing.

Signed-off-by: Ingo Molnar <[EMAIL PROTECTED]>
---
 include/linux/topology.h |3 +++
 1 file changed, 3 insertions(+)

Index: linux/include/linux/topology.h
===
--- linux.orig/include/linux/topology.h
+++ linux/include/linux/topology.h
@@ -103,6 +103,7 @@
.forkexec_idx   = 0,\
.flags  = SD_LOAD_BALANCE   \
| SD_BALANCE_NEWIDLE\
+   | SD_BALANCE_FORK   \
| SD_BALANCE_EXEC   \
| SD_WAKE_AFFINE\
| SD_WAKE_IDLE  \
@@ -134,6 +135,7 @@
.forkexec_idx   = 1,\
.flags  = SD_LOAD_BALANCE   \
| SD_BALANCE_NEWIDLE\
+   | SD_BALANCE_FORK   \
| SD_BALANCE_EXEC   \
| SD_WAKE_AFFINE\
| SD_WAKE_IDLE  \
@@ -165,6 +167,7 @@
.forkexec_idx   = 1,\
.flags  = SD_LOAD_BALANCE   \
| SD_BALANCE_NEWIDLE\
+   | SD_BALANCE_FORK   \
| SD_BALANCE_EXEC   \
| SD_WAKE_AFFINE\
| BALANCE_FOR_PKG_POWER,\
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-06 Thread Ingo Molnar


* Jie Chen [EMAIL PROTECTED] wrote:

 not BARRIER time. I've re-read the discussion and found no hint 
 about how to build and run a barrier test. Either i missed it or it's 
 so obvious to you that you didnt mention it :-)

  Ingo

 Hi, Ingo:

 Did you do configure --enable-public-release? My qmt is for qcd 
 calculation (one type of physics code) [...]

yes, i did exactly as instructed.

 [...]. Without the above flag one can only test PARALLEL overhead. 
 Actually the PARALLEL benchmark has the same behavior as the BARRIER. 
 Thanks.

hm, but PARALLEL does not seem to do that much context switching. So 
basically you create the threads and do a few short runs to establish 
overhead? Threads do not get fork-balanced at the moment - but turning 
it on would be easy. Could you try the patch below - how does it impact 
your results? (and please keep affinity setting off)

Ingo

---
Subject: sched: reactivate fork balancing
From: Ingo Molnar [EMAIL PROTECTED]

reactivate fork balancing.

Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
---
 include/linux/topology.h |3 +++
 1 file changed, 3 insertions(+)

Index: linux/include/linux/topology.h
===
--- linux.orig/include/linux/topology.h
+++ linux/include/linux/topology.h
@@ -103,6 +103,7 @@
.forkexec_idx   = 0,\
.flags  = SD_LOAD_BALANCE   \
| SD_BALANCE_NEWIDLE\
+   | SD_BALANCE_FORK   \
| SD_BALANCE_EXEC   \
| SD_WAKE_AFFINE\
| SD_WAKE_IDLE  \
@@ -134,6 +135,7 @@
.forkexec_idx   = 1,\
.flags  = SD_LOAD_BALANCE   \
| SD_BALANCE_NEWIDLE\
+   | SD_BALANCE_FORK   \
| SD_BALANCE_EXEC   \
| SD_WAKE_AFFINE\
| SD_WAKE_IDLE  \
@@ -165,6 +167,7 @@
.forkexec_idx   = 1,\
.flags  = SD_LOAD_BALANCE   \
| SD_BALANCE_NEWIDLE\
+   | SD_BALANCE_FORK   \
| SD_BALANCE_EXEC   \
| SD_WAKE_AFFINE\
| BALANCE_FOR_PKG_POWER,\
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-06 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen [EMAIL PROTECTED] wrote:

not BARRIER time. I've re-read the discussion and found no hint 
about how to build and run a barrier test. Either i missed it or it's 
so obvious to you that you didnt mention it :-)


Ingo

Hi, Ingo:

Did you do configure --enable-public-release? My qmt is for qcd 
calculation (one type of physics code) [...]


yes, i did exactly as instructed.

[...]. Without the above flag one can only test PARALLEL overhead. 
Actually the PARALLEL benchmark has the same behavior as the BARRIER. 
Thanks.


hm, but PARALLEL does not seem to do that much context switching. So 
basically you create the threads and do a few short runs to establish 
overhead? Threads do not get fork-balanced at the moment - but turning 
it on would be easy. Could you try the patch below - how does it impact 
your results? (and please keep affinity setting off)


Ingo

---
Subject: sched: reactivate fork balancing
From: Ingo Molnar [EMAIL PROTECTED]

reactivate fork balancing.

Signed-off-by: Ingo Molnar [EMAIL PROTECTED]
---
 include/linux/topology.h |3 +++
 1 file changed, 3 insertions(+)

Index: linux/include/linux/topology.h
===
--- linux.orig/include/linux/topology.h
+++ linux/include/linux/topology.h
@@ -103,6 +103,7 @@
.forkexec_idx   = 0,\
.flags  = SD_LOAD_BALANCE   \
| SD_BALANCE_NEWIDLE\
+   | SD_BALANCE_FORK   \
| SD_BALANCE_EXEC   \
| SD_WAKE_AFFINE\
| SD_WAKE_IDLE  \
@@ -134,6 +135,7 @@
.forkexec_idx   = 1,\
.flags  = SD_LOAD_BALANCE   \
| SD_BALANCE_NEWIDLE\
+   | SD_BALANCE_FORK   \
| SD_BALANCE_EXEC   \
| SD_WAKE_AFFINE\
| SD_WAKE_IDLE  \
@@ -165,6 +167,7 @@
.forkexec_idx   = 1,\
.flags  = SD_LOAD_BALANCE   \
| SD_BALANCE_NEWIDLE\
+   | SD_BALANCE_FORK   \
| SD_BALANCE_EXEC   \
| SD_WAKE_AFFINE\
| BALANCE_FOR_PKG_POWER,\

Hi, Ingo:

I did patch the header file and recompiled the kernel. I observed no 
difference (two threads overhead stays too high). Thank you.


--
###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen <[EMAIL PROTECTED]> wrote:

sorry to be dense, but could you give me instructions how i could 
remove the affinity mask and test the "barrier overhead" myself? I 
have built "pthread_sync" and it outputs numbers for me - which one 
would be the barrier overhead: Reference_time_1 ?
To disable affinity, do configure --enable-public-release 
--disable-thread_affinity. You should see barrier overhead like the 
following: Computing BARRIER time


Sample_size  Average MinMax  S.D.Outliers
 20  19.486162   19.482250   19.4914000.002740  0

BARRIER time =19.486162 microseconds +/- 0.005371
BARRIER overhead =8.996257 microseconds +/- 0.006575


ok, i did that and rebuilt. I also did "make check" and got 
src/pthread_sync which i can run. The only thing i'm missing, if i run 
src/pthread_sync, it outputs "PARALLEL time":


 PARALLEL time =   22.486103 microseconds +/- 3.944821
 PARALLEL overhead =   10.638658 microseconds +/- 10.854154

not "BARRIER time". I've re-read the discussion and found no hint about 
how to build and run a barrier test. Either i missed it or it's so 
obvious to you that you didnt mention it :-)


Ingo


Hi, Ingo:

Did you do configure --enable-public-release? My qmt is for qcd 
calculation (one type of physics code). Without the above flag one can 
only test PARALLEL overhead. Actually the PARALLEL benchmark has the 
same behavior as the BARRIER. Thanks.



###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Ingo Molnar


* Jie Chen <[EMAIL PROTECTED]> wrote:

>> sorry to be dense, but could you give me instructions how i could 
>> remove the affinity mask and test the "barrier overhead" myself? I 
>> have built "pthread_sync" and it outputs numbers for me - which one 
>> would be the barrier overhead: Reference_time_1 ?
>
> To disable affinity, do configure --enable-public-release 
> --disable-thread_affinity. You should see barrier overhead like the 
> following: Computing BARRIER time
>
> Sample_size  Average MinMax  S.D.Outliers
>  20  19.486162   19.482250   19.4914000.002740  0
>
> BARRIER time =19.486162 microseconds +/- 0.005371
> BARRIER overhead =8.996257 microseconds +/- 0.006575

ok, i did that and rebuilt. I also did "make check" and got 
src/pthread_sync which i can run. The only thing i'm missing, if i run 
src/pthread_sync, it outputs "PARALLEL time":

 PARALLEL time =   22.486103 microseconds +/- 3.944821
 PARALLEL overhead =   10.638658 microseconds +/- 10.854154

not "BARRIER time". I've re-read the discussion and found no hint about 
how to build and run a barrier test. Either i missed it or it's so 
obvious to you that you didnt mention it :-)

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen <[EMAIL PROTECTED]> wrote:

I just disabled the affinity mask and reran the test. There were no 
significant changes for two threads (barrier overhead is around 9 
microseconds). As for 8 threads, the barrier overhead actually drops a 
little, which is good. Let me know whether I can be any help. Thank 
you very much.


sorry to be dense, but could you give me instructions how i could remove 
the affinity mask and test the "barrier overhead" myself? I have built 
"pthread_sync" and it outputs numbers for me - which one would be the 
barrier overhead: Reference_time_1 ?


Ingo

Hi, Ingo:

To disable affinity, do configure --enable-public-release 
--disable-thread_affinity. You should see barrier overhead like the 
following:

Computing BARRIER time

Sample_size  Average MinMax  S.D.Outliers
 20  19.486162   19.482250   19.4914000.002740  0

BARRIER time =19.486162 microseconds +/- 0.005371
BARRIER overhead =8.996257 microseconds +/- 0.006575

The Reference_time_1 is the elapsed time for single thread doing simple 
loop without any synchronization. Thank you.


--
###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Ingo Molnar

* Jie Chen <[EMAIL PROTECTED]> wrote:

> I just disabled the affinity mask and reran the test. There were no 
> significant changes for two threads (barrier overhead is around 9 
> microseconds). As for 8 threads, the barrier overhead actually drops a 
> little, which is good. Let me know whether I can be any help. Thank 
> you very much.

sorry to be dense, but could you give me instructions how i could remove 
the affinity mask and test the "barrier overhead" myself? I have built 
"pthread_sync" and it outputs numbers for me - which one would be the 
barrier overhead: Reference_time_1 ?

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen <[EMAIL PROTECTED]> wrote:

Since I am using affinity flag to bind each thread to a different 
core, the synchronization overhead should increases as the number of 
cores/threads increases. But what we observed in the new kernel is the 
opposite. The barrier overhead of two threads is 8.93 micro seconds vs 
1.86 microseconds for 8 threads (the old kernel is 0.49 vs 1.86). This 
will confuse most of people who study the 
synchronization/communication scalability. I know my test code is not 
real-world computation which usually use up all cores. I hope I have 
explained myself clearly. Thank you very much.


btw., could you try to not use the affinity mask and let the scheduler 
manage the spreading of tasks? It generally has a better knowledge about 
how tasks interrelate.


Ingo

Hi, Ingo:

I just disabled the affinity mask and reran the test. There were no 
significant changes for two threads (barrier overhead is around 9 
microseconds). As for 8 threads, the barrier overhead actually drops a 
little, which is good. Let me know whether I can be any help. Thank you 
very much.


--
###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Ingo Molnar


* Jie Chen <[EMAIL PROTECTED]> wrote:

> Since I am using affinity flag to bind each thread to a different 
> core, the synchronization overhead should increases as the number of 
> cores/threads increases. But what we observed in the new kernel is the 
> opposite. The barrier overhead of two threads is 8.93 micro seconds vs 
> 1.86 microseconds for 8 threads (the old kernel is 0.49 vs 1.86). This 
> will confuse most of people who study the 
> synchronization/communication scalability. I know my test code is not 
> real-world computation which usually use up all cores. I hope I have 
> explained myself clearly. Thank you very much.

btw., could you try to not use the affinity mask and let the scheduler 
manage the spreading of tasks? It generally has a better knowledge about 
how tasks interrelate.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen <[EMAIL PROTECTED]> wrote:

the moment you saturate the system a bit more, the numbers should 
improve even with such a ping-pong test.
You are right. If I manually do load balance (bind unrelated processes 
on the other cores), my test code perform as well as it did in the 
kernel 2.6.21.


so right now the results dont seem to be too bad to me - the higher 
overhead comes from two threads running on two different cores and 
incurring the overhead of cross-core communications. In a true 
spread-out workloads that synchronize occasionally you'd get the same 
kind of overhead so in fact this behavior is more informative of the 
real overhead i guess. In 2.6.21 the two threads would stick on the same 
core and produce artificially low latency - which would only be true in 
a real spread-out workload if all tasks ran on the same core. (which is 
hardly the thing you want on openmp)




I use pthread_setaffinity_np call to bind one thread to one core. Unless 
 the kernel 2.6.21 does not honor the affinity, I do not see the 
difference running two threads on two cores between the new kernel and 
the old kernel. My test code does not do any numerical calculation, but 
it does spin waiting on shared/non-shared flags. The reason I am using 
the affinity is to test synchronization overheads among different cores.
In either the new and the old kernel, I do see 200% CPU usage when I ran 
my test code for two threads. Does this mean two threads are running on 
two cores? Also I verify a thread is indeed bound to a core by using 
pthread_getaffinity_np.


In any case, if i misinterpreted your numbers or if you just disagree, 
or if have a workload/test that shows worse performance that it 
could/should, let me know.


Ingo


Hi, Ingo:

Since I am using affinity flag to bind each thread to a different core, 
the synchronization overhead should increases as the number of 
cores/threads increases. But what we observed in the new kernel is the 
opposite. The barrier overhead of two threads is 8.93 micro seconds vs 
1.86 microseconds for 8 threads (the old kernel is 0.49 vs 1.86). This 
will confuse most of people who study the synchronization/communication 
scalability. I know my test code is not real-world computation which 
usually use up all cores. I hope I have explained myself clearly. Thank 
you very much.


--
###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Ingo Molnar

* Jie Chen <[EMAIL PROTECTED]> wrote:

>> the moment you saturate the system a bit more, the numbers should 
>> improve even with such a ping-pong test.
>
> You are right. If I manually do load balance (bind unrelated processes 
> on the other cores), my test code perform as well as it did in the 
> kernel 2.6.21.

so right now the results dont seem to be too bad to me - the higher 
overhead comes from two threads running on two different cores and 
incurring the overhead of cross-core communications. In a true 
spread-out workloads that synchronize occasionally you'd get the same 
kind of overhead so in fact this behavior is more informative of the 
real overhead i guess. In 2.6.21 the two threads would stick on the same 
core and produce artificially low latency - which would only be true in 
a real spread-out workload if all tasks ran on the same core. (which is 
hardly the thing you want on openmp)

In any case, if i misinterpreted your numbers or if you just disagree, 
or if have a workload/test that shows worse performance that it 
could/should, let me know.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Eric Dumazet


Ingo Molnar a écrit :

* Eric Dumazet <[EMAIL PROTECTED]> wrote:


$ gcc -O2 -o burner burner.c
$ ./burner
Time to perform the unit of work on one thread is 0.040328 s
Time to perform the unit of work on 2 threads is 0.040221 s


ok, but this actually suggests that scheduling is fine for this, 
correct?


Ingo




Yes, But this machine runs an old kernel. I was just giving you how to run it :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Ingo Molnar


* Eric Dumazet <[EMAIL PROTECTED]> wrote:

> $ gcc -O2 -o burner burner.c
> $ ./burner
> Time to perform the unit of work on one thread is 0.040328 s
> Time to perform the unit of work on 2 threads is 0.040221 s

ok, but this actually suggests that scheduling is fine for this, 
correct?

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen <[EMAIL PROTECTED]> wrote:

I just ran the same test on two 2.6.24-rc4 kernels: one with 
CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED 
off. The odd behavior I described in my previous e-mails were still 
there for both kernels. Let me know If I can be any more help. Thank 
you.


ok, i had a look at your data, and i think this is the result of the 
scheduler balancing out to idle CPUs more agressively than before. Doing 
that is almost always a good idea though - but indeed it can result in 
"bad" numbers if all you do is to measure the ping-pong "performance" 
between two threads. (with no real work done by any of them).




My test code are not doing much work but measuring overhead of various 
synchronization mechanisms such as barrier and lock. I am trying to see 
the scalability of different implementations/algorithms on multi-core 
machines.


the moment you saturate the system a bit more, the numbers should 
improve even with such a ping-pong test.


You are right. If I manually do load balance (bind unrelated processes 
on the other cores), my test code perform as well as it did in the 
kernel 2.6.21.
do you have testcode (or a modification of your testcase sourcecode) 
that simulates a real-life situation where 2.6.24-rc4 performs not as 
well as you'd like it to see? (or if qmt.tar.gz already contains that 
then please point me towards that portion of the test and how i should 
run it - thanks!)


The qmt.tar.gz code contains a simple test program call pthread_sync 
under the src directory. You can change the number of threads by setting 
QMT_NUM_THREADS environment variable. You can build the qmt by doing 
configure --enable-public-release. I do not have Intel quad core 
machines, I am not sure whether the behavior will show up on Intel 
platform. Our cluster is dual quad-core opteron which has its own 
hardware problem :-).

http://hardware.slashdot.org/article.pl?sid=07/12/04/237248=rss



Ingo


Hi, Ingo:

My test code qmt can be found at ftp://ftp.jlab.org/pub/hpc/qmt.tar.gz. 
There is a minor performance issue in qmt pointed out by Eric, which I 
have not put into the tar ball yet. If I can be any help, please let me 
know. Thank you very much.




--
###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Eric Dumazet


Ingo Molnar a écrit :

* Jie Chen <[EMAIL PROTECTED]> wrote:

I just ran the same test on two 2.6.24-rc4 kernels: one with 
CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED 
off. The odd behavior I described in my previous e-mails were still 
there for both kernels. Let me know If I can be any more help. Thank 
you.


ok, i had a look at your data, and i think this is the result of the 
scheduler balancing out to idle CPUs more agressively than before. Doing 
that is almost always a good idea though - but indeed it can result in 
"bad" numbers if all you do is to measure the ping-pong "performance" 
between two threads. (with no real work done by any of them).


the moment you saturate the system a bit more, the numbers should 
improve even with such a ping-pong test.


do you have testcode (or a modification of your testcase sourcecode) 
that simulates a real-life situation where 2.6.24-rc4 performs not as 
well as you'd like it to see? (or if qmt.tar.gz already contains that 
then please point me towards that portion of the test and how i should 
run it - thanks!)


Ingo


I cooked a program shorter than Jie one, to try to understand what was going 
on. Its a pure cpu burner program, with no thread synchronisation (but the 
pthread_join at the very end)


As each thread is bound to a given cpu, I am not sure the scheduler is allowed 
to balance to an idle cpu.


Unfortunatly I dont have a 4 way SMP idle machine available to
test it.

$ gcc -O2 -o burner burner.c
$ ./burner
Time to perform the unit of work on one thread is 0.040328 s
Time to perform the unit of work on 2 threads is 0.040221 s

I tried it on a 64 way machine (Thanks David :) ) and noticed some strange
results that may be related to the Niagara hardware (time for 64 threads was 
nearly the double for one thread)


#include 
#include 
#include 
#include 
#include 
#include 
#include 

int blockthemall=1;
static void inline cpupause()
{
#if defined(i386)
 asm volatile("rep;nop":::"memory");
#else
 asm volatile("":::"memory");
#endif
}
/*
 * Determines number of cpus
 * Can be overiden by the NR_CPUS environment variable
 */
int number_of_cpus()
{
char line[1024], *p;
int cnt = 0;
FILE *F;

p = getenv("NR_CPUS");
if (p)
return atoi(p);
F = fopen("/proc/cpuinfo", "r");
if (F == NULL) {
perror("/proc/cpuinfo");
return 1;
}
while (fgets(line, sizeof(line), F) != NULL) {
if (memcmp(line, "processor", 9) == 0)
cnt++;
}
fclose(F);
return cnt;
}

void compute_elapsed(struct timeval *delta, const struct timeval *t0)
{
struct timeval t1;

gettimeofday(, NULL);
delta->tv_sec = t1.tv_sec - t0->tv_sec;
delta->tv_usec = t1.tv_usec - t0->tv_usec;
if (delta->tv_usec < 0) {
delta->tv_usec += 100;
delta->tv_sec--;
}
}

int nr_loops = 20*100;
double incr=0.3456;
void perform_work()
{
int i;
double t = 0.0;
for (i = 0; i < nr_loops; i++) {
t += incr;
}
if (t < 0.0) printf("well... should not happen\n");
}

void set_affinity(int cpu)
{
long cpu_mask;
int res;

cpu_mask = 1L << cpu;
res = sched_setaffinity(0, sizeof(cpu_mask), _mask);
if (res)
perror("sched_setaffinity");
}

void *thread_work(void *arg)
{
int cpu = (int)arg;
set_affinity(cpu);
while (blockthemall)
cpupause();
perform_work();
return (void *)0;
}

main(int argc, char *argv[])
{
struct timeval t0, delta;
int nr_cpus, i;
pthread_t *tids;

gettimeofday(, NULL);
perform_work();
compute_elapsed(, );
printf("Time to perform the unit of work on one thread is %d.%06d s\n", 
delta.tv_sec, delta.tv_usec);

nr_cpus = number_of_cpus();
if (nr_cpus <= 1)
return 0;
tids = malloc(nr_cpus * sizeof(pthread_t));
for (i = 1; i < nr_cpus; i++) {
pthread_create(tids + i, NULL, thread_work, (void *)i);
}

set_affinity(0);
gettimeofday(, NULL);
blockthemall=0;
perform_work();
for (i = 1; i < nr_cpus; i++)
pthread_join(tids[i], NULL);
compute_elapsed(, );
printf("Time to perform the unit of work on %d threads is %d.%06d s\n", 
nr_cpus, delta.tv_sec, delta.tv_usec);

}

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Ingo Molnar

* Jie Chen <[EMAIL PROTECTED]> wrote:

> I just ran the same test on two 2.6.24-rc4 kernels: one with 
> CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED 
> off. The odd behavior I described in my previous e-mails were still 
> there for both kernels. Let me know If I can be any more help. Thank 
> you.

ok, i had a look at your data, and i think this is the result of the 
scheduler balancing out to idle CPUs more agressively than before. Doing 
that is almost always a good idea though - but indeed it can result in 
"bad" numbers if all you do is to measure the ping-pong "performance" 
between two threads. (with no real work done by any of them).

the moment you saturate the system a bit more, the numbers should 
improve even with such a ping-pong test.

do you have testcode (or a modification of your testcase sourcecode) 
that simulates a real-life situation where 2.6.24-rc4 performs not as 
well as you'd like it to see? (or if qmt.tar.gz already contains that 
then please point me towards that portion of the test and how i should 
run it - thanks!)

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen <[EMAIL PROTECTED]> wrote:


Simon Holm Th??gersen wrote:

ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen:
There is a backport of the CFS scheduler to 2.6.21, see
http://lkml.org/lkml/2007/11/19/127


Hi, Simon:

I will try that after the thanksgiving holiday to find out whether the 
odd behavior will show up using 2.6.21 with back ported CFS.


would be also nice to test this with 2.6.24-rc4.

Ingo

Hi, Ingo:

I just ran the same test on two 2.6.24-rc4 kernels: one with 
CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED 
off. The odd behavior I described in my previous e-mails were still 
there for both kernels. Let me know If I can be any more help. Thank you.


--
###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen [EMAIL PROTECTED] wrote:


Simon Holm Th??gersen wrote:

ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen:
There is a backport of the CFS scheduler to 2.6.21, see
http://lkml.org/lkml/2007/11/19/127


Hi, Simon:

I will try that after the thanksgiving holiday to find out whether the 
odd behavior will show up using 2.6.21 with back ported CFS.


would be also nice to test this with 2.6.24-rc4.

Ingo

Hi, Ingo:

I just ran the same test on two 2.6.24-rc4 kernels: one with 
CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED 
off. The odd behavior I described in my previous e-mails were still 
there for both kernels. Let me know If I can be any more help. Thank you.


--
###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen [EMAIL PROTECTED] wrote:

the moment you saturate the system a bit more, the numbers should 
improve even with such a ping-pong test.
You are right. If I manually do load balance (bind unrelated processes 
on the other cores), my test code perform as well as it did in the 
kernel 2.6.21.


so right now the results dont seem to be too bad to me - the higher 
overhead comes from two threads running on two different cores and 
incurring the overhead of cross-core communications. In a true 
spread-out workloads that synchronize occasionally you'd get the same 
kind of overhead so in fact this behavior is more informative of the 
real overhead i guess. In 2.6.21 the two threads would stick on the same 
core and produce artificially low latency - which would only be true in 
a real spread-out workload if all tasks ran on the same core. (which is 
hardly the thing you want on openmp)




I use pthread_setaffinity_np call to bind one thread to one core. Unless 
 the kernel 2.6.21 does not honor the affinity, I do not see the 
difference running two threads on two cores between the new kernel and 
the old kernel. My test code does not do any numerical calculation, but 
it does spin waiting on shared/non-shared flags. The reason I am using 
the affinity is to test synchronization overheads among different cores.
In either the new and the old kernel, I do see 200% CPU usage when I ran 
my test code for two threads. Does this mean two threads are running on 
two cores? Also I verify a thread is indeed bound to a core by using 
pthread_getaffinity_np.


In any case, if i misinterpreted your numbers or if you just disagree, 
or if have a workload/test that shows worse performance that it 
could/should, let me know.


Ingo


Hi, Ingo:

Since I am using affinity flag to bind each thread to a different core, 
the synchronization overhead should increases as the number of 
cores/threads increases. But what we observed in the new kernel is the 
opposite. The barrier overhead of two threads is 8.93 micro seconds vs 
1.86 microseconds for 8 threads (the old kernel is 0.49 vs 1.86). This 
will confuse most of people who study the synchronization/communication 
scalability. I know my test code is not real-world computation which 
usually use up all cores. I hope I have explained myself clearly. Thank 
you very much.


--
###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Eric Dumazet


Ingo Molnar a écrit :

* Jie Chen [EMAIL PROTECTED] wrote:

I just ran the same test on two 2.6.24-rc4 kernels: one with 
CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED 
off. The odd behavior I described in my previous e-mails were still 
there for both kernels. Let me know If I can be any more help. Thank 
you.


ok, i had a look at your data, and i think this is the result of the 
scheduler balancing out to idle CPUs more agressively than before. Doing 
that is almost always a good idea though - but indeed it can result in 
bad numbers if all you do is to measure the ping-pong performance 
between two threads. (with no real work done by any of them).


the moment you saturate the system a bit more, the numbers should 
improve even with such a ping-pong test.


do you have testcode (or a modification of your testcase sourcecode) 
that simulates a real-life situation where 2.6.24-rc4 performs not as 
well as you'd like it to see? (or if qmt.tar.gz already contains that 
then please point me towards that portion of the test and how i should 
run it - thanks!)


Ingo


I cooked a program shorter than Jie one, to try to understand what was going 
on. Its a pure cpu burner program, with no thread synchronisation (but the 
pthread_join at the very end)


As each thread is bound to a given cpu, I am not sure the scheduler is allowed 
to balance to an idle cpu.


Unfortunatly I dont have a 4 way SMP idle machine available to
test it.

$ gcc -O2 -o burner burner.c
$ ./burner
Time to perform the unit of work on one thread is 0.040328 s
Time to perform the unit of work on 2 threads is 0.040221 s

I tried it on a 64 way machine (Thanks David :) ) and noticed some strange
results that may be related to the Niagara hardware (time for 64 threads was 
nearly the double for one thread)


#include pthread.h
#include sched.h
#include unistd.h
#include fcntl.h
#include sys/time.h
#include stdio.h
#include stdlib.h

int blockthemall=1;
static void inline cpupause()
{
#if defined(i386)
 asm volatile(rep;nop:::memory);
#else
 asm volatile(:::memory);
#endif
}
/*
 * Determines number of cpus
 * Can be overiden by the NR_CPUS environment variable
 */
int number_of_cpus()
{
char line[1024], *p;
int cnt = 0;
FILE *F;

p = getenv(NR_CPUS);
if (p)
return atoi(p);
F = fopen(/proc/cpuinfo, r);
if (F == NULL) {
perror(/proc/cpuinfo);
return 1;
}
while (fgets(line, sizeof(line), F) != NULL) {
if (memcmp(line, processor, 9) == 0)
cnt++;
}
fclose(F);
return cnt;
}

void compute_elapsed(struct timeval *delta, const struct timeval *t0)
{
struct timeval t1;

gettimeofday(t1, NULL);
delta-tv_sec = t1.tv_sec - t0-tv_sec;
delta-tv_usec = t1.tv_usec - t0-tv_usec;
if (delta-tv_usec  0) {
delta-tv_usec += 100;
delta-tv_sec--;
}
}

int nr_loops = 20*100;
double incr=0.3456;
void perform_work()
{
int i;
double t = 0.0;
for (i = 0; i  nr_loops; i++) {
t += incr;
}
if (t  0.0) printf(well... should not happen\n);
}

void set_affinity(int cpu)
{
long cpu_mask;
int res;

cpu_mask = 1L  cpu;
res = sched_setaffinity(0, sizeof(cpu_mask), cpu_mask);
if (res)
perror(sched_setaffinity);
}

void *thread_work(void *arg)
{
int cpu = (int)arg;
set_affinity(cpu);
while (blockthemall)
cpupause();
perform_work();
return (void *)0;
}

main(int argc, char *argv[])
{
struct timeval t0, delta;
int nr_cpus, i;
pthread_t *tids;

gettimeofday(t0, NULL);
perform_work();
compute_elapsed(delta, t0);
printf(Time to perform the unit of work on one thread is %d.%06d s\n, 
delta.tv_sec, delta.tv_usec);

nr_cpus = number_of_cpus();
if (nr_cpus = 1)
return 0;
tids = malloc(nr_cpus * sizeof(pthread_t));
for (i = 1; i  nr_cpus; i++) {
pthread_create(tids + i, NULL, thread_work, (void *)i);
}

set_affinity(0);
gettimeofday(t0, NULL);
blockthemall=0;
perform_work();
for (i = 1; i  nr_cpus; i++)
pthread_join(tids[i], NULL);
compute_elapsed(delta, t0);
printf(Time to perform the unit of work on %d threads is %d.%06d s\n, 
nr_cpus, delta.tv_sec, delta.tv_usec);

}

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Eric Dumazet


Ingo Molnar a écrit :

* Eric Dumazet [EMAIL PROTECTED] wrote:


$ gcc -O2 -o burner burner.c
$ ./burner
Time to perform the unit of work on one thread is 0.040328 s
Time to perform the unit of work on 2 threads is 0.040221 s


ok, but this actually suggests that scheduling is fine for this, 
correct?


Ingo




Yes, But this machine runs an old kernel. I was just giving you how to run it :)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Ingo Molnar


* Jie Chen [EMAIL PROTECTED] wrote:

 the moment you saturate the system a bit more, the numbers should 
 improve even with such a ping-pong test.

 You are right. If I manually do load balance (bind unrelated processes 
 on the other cores), my test code perform as well as it did in the 
 kernel 2.6.21.

so right now the results dont seem to be too bad to me - the higher 
overhead comes from two threads running on two different cores and 
incurring the overhead of cross-core communications. In a true 
spread-out workloads that synchronize occasionally you'd get the same 
kind of overhead so in fact this behavior is more informative of the 
real overhead i guess. In 2.6.21 the two threads would stick on the same 
core and produce artificially low latency - which would only be true in 
a real spread-out workload if all tasks ran on the same core. (which is 
hardly the thing you want on openmp)

In any case, if i misinterpreted your numbers or if you just disagree, 
or if have a workload/test that shows worse performance that it 
could/should, let me know.

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Ingo Molnar


* Eric Dumazet [EMAIL PROTECTED] wrote:

 $ gcc -O2 -o burner burner.c
 $ ./burner
 Time to perform the unit of work on one thread is 0.040328 s
 Time to perform the unit of work on 2 threads is 0.040221 s

ok, but this actually suggests that scheduling is fine for this, 
correct?

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Ingo Molnar


* Jie Chen [EMAIL PROTECTED] wrote:

 sorry to be dense, but could you give me instructions how i could 
 remove the affinity mask and test the barrier overhead myself? I 
 have built pthread_sync and it outputs numbers for me - which one 
 would be the barrier overhead: Reference_time_1 ?

 To disable affinity, do configure --enable-public-release 
 --disable-thread_affinity. You should see barrier overhead like the 
 following: Computing BARRIER time

 Sample_size  Average MinMax  S.D.Outliers
  20  19.486162   19.482250   19.4914000.002740  0

 BARRIER time =19.486162 microseconds +/- 0.005371
 BARRIER overhead =8.996257 microseconds +/- 0.006575

ok, i did that and rebuilt. I also did make check and got 
src/pthread_sync which i can run. The only thing i'm missing, if i run 
src/pthread_sync, it outputs PARALLEL time:

 PARALLEL time =   22.486103 microseconds +/- 3.944821
 PARALLEL overhead =   10.638658 microseconds +/- 10.854154

not BARRIER time. I've re-read the discussion and found no hint about 
how to build and run a barrier test. Either i missed it or it's so 
obvious to you that you didnt mention it :-)

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen [EMAIL PROTECTED] wrote:

Since I am using affinity flag to bind each thread to a different 
core, the synchronization overhead should increases as the number of 
cores/threads increases. But what we observed in the new kernel is the 
opposite. The barrier overhead of two threads is 8.93 micro seconds vs 
1.86 microseconds for 8 threads (the old kernel is 0.49 vs 1.86). This 
will confuse most of people who study the 
synchronization/communication scalability. I know my test code is not 
real-world computation which usually use up all cores. I hope I have 
explained myself clearly. Thank you very much.


btw., could you try to not use the affinity mask and let the scheduler 
manage the spreading of tasks? It generally has a better knowledge about 
how tasks interrelate.


Ingo

Hi, Ingo:

I just disabled the affinity mask and reran the test. There were no 
significant changes for two threads (barrier overhead is around 9 
microseconds). As for 8 threads, the barrier overhead actually drops a 
little, which is good. Let me know whether I can be any help. Thank you 
very much.


--
###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Ingo Molnar


* Jie Chen [EMAIL PROTECTED] wrote:

 Since I am using affinity flag to bind each thread to a different 
 core, the synchronization overhead should increases as the number of 
 cores/threads increases. But what we observed in the new kernel is the 
 opposite. The barrier overhead of two threads is 8.93 micro seconds vs 
 1.86 microseconds for 8 threads (the old kernel is 0.49 vs 1.86). This 
 will confuse most of people who study the 
 synchronization/communication scalability. I know my test code is not 
 real-world computation which usually use up all cores. I hope I have 
 explained myself clearly. Thank you very much.

btw., could you try to not use the affinity mask and let the scheduler 
manage the spreading of tasks? It generally has a better knowledge about 
how tasks interrelate.

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen [EMAIL PROTECTED] wrote:

sorry to be dense, but could you give me instructions how i could 
remove the affinity mask and test the barrier overhead myself? I 
have built pthread_sync and it outputs numbers for me - which one 
would be the barrier overhead: Reference_time_1 ?
To disable affinity, do configure --enable-public-release 
--disable-thread_affinity. You should see barrier overhead like the 
following: Computing BARRIER time


Sample_size  Average MinMax  S.D.Outliers
 20  19.486162   19.482250   19.4914000.002740  0

BARRIER time =19.486162 microseconds +/- 0.005371
BARRIER overhead =8.996257 microseconds +/- 0.006575


ok, i did that and rebuilt. I also did make check and got 
src/pthread_sync which i can run. The only thing i'm missing, if i run 
src/pthread_sync, it outputs PARALLEL time:


 PARALLEL time =   22.486103 microseconds +/- 3.944821
 PARALLEL overhead =   10.638658 microseconds +/- 10.854154

not BARRIER time. I've re-read the discussion and found no hint about 
how to build and run a barrier test. Either i missed it or it's so 
obvious to you that you didnt mention it :-)


Ingo


Hi, Ingo:

Did you do configure --enable-public-release? My qmt is for qcd 
calculation (one type of physics code). Without the above flag one can 
only test PARALLEL overhead. Actually the PARALLEL benchmark has the 
same behavior as the BARRIER. Thanks.



###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen [EMAIL PROTECTED] wrote:

I just disabled the affinity mask and reran the test. There were no 
significant changes for two threads (barrier overhead is around 9 
microseconds). As for 8 threads, the barrier overhead actually drops a 
little, which is good. Let me know whether I can be any help. Thank 
you very much.


sorry to be dense, but could you give me instructions how i could remove 
the affinity mask and test the barrier overhead myself? I have built 
pthread_sync and it outputs numbers for me - which one would be the 
barrier overhead: Reference_time_1 ?


Ingo

Hi, Ingo:

To disable affinity, do configure --enable-public-release 
--disable-thread_affinity. You should see barrier overhead like the 
following:

Computing BARRIER time

Sample_size  Average MinMax  S.D.Outliers
 20  19.486162   19.482250   19.4914000.002740  0

BARRIER time =19.486162 microseconds +/- 0.005371
BARRIER overhead =8.996257 microseconds +/- 0.006575

The Reference_time_1 is the elapsed time for single thread doing simple 
loop without any synchronization. Thank you.


--
###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen [EMAIL PROTECTED] wrote:

I just ran the same test on two 2.6.24-rc4 kernels: one with 
CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED 
off. The odd behavior I described in my previous e-mails were still 
there for both kernels. Let me know If I can be any more help. Thank 
you.


ok, i had a look at your data, and i think this is the result of the 
scheduler balancing out to idle CPUs more agressively than before. Doing 
that is almost always a good idea though - but indeed it can result in 
bad numbers if all you do is to measure the ping-pong performance 
between two threads. (with no real work done by any of them).




My test code are not doing much work but measuring overhead of various 
synchronization mechanisms such as barrier and lock. I am trying to see 
the scalability of different implementations/algorithms on multi-core 
machines.


the moment you saturate the system a bit more, the numbers should 
improve even with such a ping-pong test.


You are right. If I manually do load balance (bind unrelated processes 
on the other cores), my test code perform as well as it did in the 
kernel 2.6.21.
do you have testcode (or a modification of your testcase sourcecode) 
that simulates a real-life situation where 2.6.24-rc4 performs not as 
well as you'd like it to see? (or if qmt.tar.gz already contains that 
then please point me towards that portion of the test and how i should 
run it - thanks!)


The qmt.tar.gz code contains a simple test program call pthread_sync 
under the src directory. You can change the number of threads by setting 
QMT_NUM_THREADS environment variable. You can build the qmt by doing 
configure --enable-public-release. I do not have Intel quad core 
machines, I am not sure whether the behavior will show up on Intel 
platform. Our cluster is dual quad-core opteron which has its own 
hardware problem :-).

http://hardware.slashdot.org/article.pl?sid=07/12/04/237248from=rss



Ingo


Hi, Ingo:

My test code qmt can be found at ftp://ftp.jlab.org/pub/hpc/qmt.tar.gz. 
There is a minor performance issue in qmt pointed out by Eric, which I 
have not put into the tar ball yet. If I can be any help, please let me 
know. Thank you very much.




--
###
Jie Chen
Scientific Computing Group
Thomas Jefferson National Accelerator Facility
12000, Jefferson Ave.
Newport News, VA 23606

(757)269-5046 (office) (757)269-6248 (fax)
[EMAIL PROTECTED]
###

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Ingo Molnar


* Jie Chen [EMAIL PROTECTED] wrote:

 I just disabled the affinity mask and reran the test. There were no 
 significant changes for two threads (barrier overhead is around 9 
 microseconds). As for 8 threads, the barrier overhead actually drops a 
 little, which is good. Let me know whether I can be any help. Thank 
 you very much.

sorry to be dense, but could you give me instructions how i could remove 
the affinity mask and test the barrier overhead myself? I have built 
pthread_sync and it outputs numbers for me - which one would be the 
barrier overhead: Reference_time_1 ?

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-05 Thread Ingo Molnar


* Jie Chen [EMAIL PROTECTED] wrote:

 I just ran the same test on two 2.6.24-rc4 kernels: one with 
 CONFIG_FAIR_GROUP_SCHED on and the other with CONFIG_FAIR_GROUP_SCHED 
 off. The odd behavior I described in my previous e-mails were still 
 there for both kernels. Let me know If I can be any more help. Thank 
 you.

ok, i had a look at your data, and i think this is the result of the 
scheduler balancing out to idle CPUs more agressively than before. Doing 
that is almost always a good idea though - but indeed it can result in 
bad numbers if all you do is to measure the ping-pong performance 
between two threads. (with no real work done by any of them).

the moment you saturate the system a bit more, the numbers should 
improve even with such a ping-pong test.

do you have testcode (or a modification of your testcase sourcecode) 
that simulates a real-life situation where 2.6.24-rc4 performs not as 
well as you'd like it to see? (or if qmt.tar.gz already contains that 
then please point me towards that portion of the test and how i should 
run it - thanks!)

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-04 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen <[EMAIL PROTECTED]> wrote:


Simon Holm Th??gersen wrote:

ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen:
There is a backport of the CFS scheduler to 2.6.21, see
http://lkml.org/lkml/2007/11/19/127


Hi, Simon:

I will try that after the thanksgiving holiday to find out whether the 
odd behavior will show up using 2.6.21 with back ported CFS.


would be also nice to test this with 2.6.24-rc4.

Ingo

Hi, Ingo:

I will test 2.6.24-rc4 this week and let you know the result. Thanks.

--
#
# Jie Chen
# Scientific Computing Group
# Thomas Jefferson National Accelerator Facility
# Newport News, VA 23606
#
# [EMAIL PROTECTED]
# (757)269-5046 (office)
# (757)269-6248 (fax)
#
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-04 Thread Ingo Molnar


* Jie Chen <[EMAIL PROTECTED]> wrote:

> Simon Holm Th??gersen wrote:
>> ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen:
>
>> There is a backport of the CFS scheduler to 2.6.21, see
>> http://lkml.org/lkml/2007/11/19/127
>>
> Hi, Simon:
>
> I will try that after the thanksgiving holiday to find out whether the 
> odd behavior will show up using 2.6.21 with back ported CFS.

would be also nice to test this with 2.6.24-rc4.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-04 Thread Ingo Molnar


* Jie Chen [EMAIL PROTECTED] wrote:

 Simon Holm Th??gersen wrote:
 ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen:

 There is a backport of the CFS scheduler to 2.6.21, see
 http://lkml.org/lkml/2007/11/19/127

 Hi, Simon:

 I will try that after the thanksgiving holiday to find out whether the 
 odd behavior will show up using 2.6.21 with back ported CFS.

would be also nice to test this with 2.6.24-rc4.

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Possible bug from kernel 2.6.22 and above, 2.6.24-rc4

2007-12-04 Thread Jie Chen


Ingo Molnar wrote:

* Jie Chen [EMAIL PROTECTED] wrote:


Simon Holm Th??gersen wrote:

ons, 21 11 2007 kl. 20:52 -0500, skrev Jie Chen:
There is a backport of the CFS scheduler to 2.6.21, see
http://lkml.org/lkml/2007/11/19/127


Hi, Simon:

I will try that after the thanksgiving holiday to find out whether the 
odd behavior will show up using 2.6.21 with back ported CFS.


would be also nice to test this with 2.6.24-rc4.

Ingo

Hi, Ingo:

I will test 2.6.24-rc4 this week and let you know the result. Thanks.

--
#
# Jie Chen
# Scientific Computing Group
# Thomas Jefferson National Accelerator Facility
# Newport News, VA 23606
#
# [EMAIL PROTECTED]
# (757)269-5046 (office)
# (757)269-6248 (fax)
#
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

54 matches

Mail list logo