Re: pluggable scheduler thread (was Re: Volanomark slows by 80% under CFS)

2007-07-30 Thread Andrea Arcangeli
On Mon, Jul 30, 2007 at 05:07:46PM -0400, Chris Snook wrote:
> [..]  It's spending a lot less time in %sys despite the 
> higher context switches, [..]

The workload takes 40% more so you've to add up that additional 40%
too into your math. "A lot less time" sounds an overstatement to
me. Also you've to take into account cache effects in executing the
scheduler so much etc...

> [..] and there are far fewer tasks waiting for CPU 
> time. The real problem seems to be that volanomark is optimized for a 

It looks weird that there are a lot less tasks in R state. Could you
press SYSRQ+T to see where those hundred tasks are sleeping in the CFS
run?

> That's not to say that we can't improve volanomark performance under CFS, 
> but simply that CFS isn't so fundamentally flawed that this is impossible.

Given the increase of context switches, it means not all the ctx
switches are "userland mandated", so the first thing to try here is to
increase the granularity with the new tunable sysctl. Increasing the
granularity has to reduce the context switch rate, and in turn it will
reduce the slowdown to less than 40%.

There's nothing necessarily flawed in CFS even if it's slower than
O(1) in this load no matter how you tune it. The higher context switch
rate to retain complete fariness is a feature, but fariness vs global
performance is generally a tradeoff.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pluggable scheduler thread (was Re: Volanomark slows by 80% under CFS)

2007-07-30 Thread Chris Snook

Tim Chen wrote:

On Sat, 2007-07-28 at 02:51 -0400, Chris Snook wrote:


Tim --

	Since you're already set up to do this benchmarking, would you mind 
varying the parameters a bit and collecting vmstat data?  If you want to 
run oprofile too, that wouldn't hurt.




Here's the vmstat data.  The number of runnable processes are 
fewer and there're more contex switches with CFS. 
 
The vmstat for 2.6.22 looks like


procs ---memory-- ---swap-- -io --system-- 
-cpu--
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa st
391  0  0 1722564  14416  9547200   16925   76 6520  3  3 89  5 
 0
400  0  0 1722372  14416  9549600 0 0  264 641685 47 53  0  
0  0
368  0  0 1721504  14424  9549600 0 7  261 648493 46 51  3  
0  0
438  0  0 1721504  14432  9549600 0 2  264 690834 46 54  0  
0  0
400  0  0 1721380  14432  9549600 0 0  260 657157 46 53  1  
0  0
393  0  0 1719892  14440  9549600 0 6  265 671599 45 53  2  
0  0
423  0  0 1719892  14440  9549600 015  264 701626 44 56  0  
0  0
375  0  0 1720240  14472  9550400 072  265 671795 43 53  3  
0  0
393  0  0 1720140  14480  9550400 0 7  265 733561 45 55  0  
0  0
355  0  0 1716052  14480  9550400 0 0  260 670676 43 54  3  
0  0
419  0  0 1718900  14480  9550400 0 4  265 680690 43 55  2  
0  0
396  0  0 1719148  14488  9550400 0 3  261 712307 43 56  0  
0  0
395  0  0 1719148  14488  9550400 0 2  264 692781 44 54  1  
0  0
387  0  0 1719148  14492  9550400 041  268 709579 43 57  0  
0  0
420  0  0 1719148  14500  9550400 0 3  265 690862 44 54  2  
0  0
429  0  0 1719396  14500  9550400 0 0  260 704872 46 54  0  
0  0
460  0  0 1719396  14500  9550400 0 0  264 716272 46 54  0  
0  0
419  0  0 1719396  14508  9550400 0 3  261 685864 43 55  2  
0  0
455  0  0 1719396  14508  9550400 0 0  264 703718 44 56  0  
0  0
395  0  0 1719372  14540  9551200 064  265 692785 45 54  1  
0  0
424  0  0 1719396  14548  9551200 010  265 732866 45 55  0  
0  0

While 2.6.23-rc1 look like
procs ---memory-- ---swap-- -io --system-- 
-cpu--
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa st
23  0  0 1705992  17020  9572000 0 0  261 1010016 53 42  5  
0  0
 7  0  0 1706116  17020  9572000 013  267 1060997 52 41  7  
0  0
 5  0  0 1706116  17020  9572000 028  266 1313361 56 41  3  
0  0
19  0  0 1706116  17028  9572000 0 8  265 1273669 55 41  4  
0  0
18  0  0 1706116  17032  9572000 0 2  262 1403588 55 41  4  
0  0
23  0  0 1706116  17032  9572000 0 0  264 1272561 56 40  4  
0  0
14  0  0 1706116  17032  9572000 0 0  262 1046795 55 40  5  
0  0
16  0  0 1706116  17032  9572000 0 0  260 1361102 58 39  4  
0  0
 4  0  0 1706224  17120  9572400 0   126  273 1488711 56 41  3  
0  0
24  0  0 1706224  17128  9572400 0 6  261 1408432 55 41  4  
0  0
 3  0  0 1706240  17128  9572400 048  273 1299203 54 42  4  
0  0
16  0  0 1706240  17132  9572400 0 3  261 1356609 54 42  4  
0  0
 5  0  0 1706364  17132  9572400 0 0  264 1293198 58 39  3  
0  0
 9  0  0 1706364  17132  9572400 0 0  261 1555153 56 41  3  
0  0
13  0  0 1706364  17132  9572400 0 0  264 1160296 56 40  4  
0  0
 8  0  0 1706364  17132  9572400 0 0  261 1388909 58 38  4  
0  0
18  0  0 1706364  17132  9572400 0 0  264 1236774 56 39  5  
0  0
11  0  0 1706364  17136  9572400 0 2  261 1360325 57 40  3  
0  0
 5  0  0 1706364  17136  9572400 0 1  265 1201912 57 40  3  
0  0
 8  0  0 1706364  17136  9572400 0 0  261 1104308 57 39  4  
0  0
 7  0  0 1705976  17232  9572400 0   127  274 1205212 58 39  4  
0  0

Tim



From a scheduler performance perspective, it looks like CFS is doing much 
better on this workload.  It's spending a lot less time in %sys despite the 
higher context switches, and there are far fewer tasks waiting for CPU time. 
The real problem seems to be that volanomark is optimized for a particular 
scheduler behavior.


That's not to say that we can't improve volanomark performance under CFS, but 
simply that CFS isn't so fundamentally flawed that this is impossible.


When I initially agreed with zeroing out wait time in sched_yield, I didn't 
realize that it could be negative and that this 

Re: pluggable scheduler thread (was Re: Volanomark slows by 80% under CFS)

2007-07-30 Thread Tim Chen
On Sat, 2007-07-28 at 02:51 -0400, Chris Snook wrote:

> 
> Tim --
> 
>   Since you're already set up to do this benchmarking, would you mind 
> varying the parameters a bit and collecting vmstat data?  If you want to 
> run oprofile too, that wouldn't hurt.
> 

Here's the vmstat data.  The number of runnable processes are 
fewer and there're more contex switches with CFS. 
 
The vmstat for 2.6.22 looks like

procs ---memory-- ---swap-- -io --system-- 
-cpu--
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa st
391  0  0 1722564  14416  9547200   16925   76 6520  3  3 89  5 
 0
400  0  0 1722372  14416  9549600 0 0  264 641685 47 53  0  
0  0
368  0  0 1721504  14424  9549600 0 7  261 648493 46 51  3  
0  0
438  0  0 1721504  14432  9549600 0 2  264 690834 46 54  0  
0  0
400  0  0 1721380  14432  9549600 0 0  260 657157 46 53  1  
0  0
393  0  0 1719892  14440  9549600 0 6  265 671599 45 53  2  
0  0
423  0  0 1719892  14440  9549600 015  264 701626 44 56  0  
0  0
375  0  0 1720240  14472  9550400 072  265 671795 43 53  3  
0  0
393  0  0 1720140  14480  9550400 0 7  265 733561 45 55  0  
0  0
355  0  0 1716052  14480  9550400 0 0  260 670676 43 54  3  
0  0
419  0  0 1718900  14480  9550400 0 4  265 680690 43 55  2  
0  0
396  0  0 1719148  14488  9550400 0 3  261 712307 43 56  0  
0  0
395  0  0 1719148  14488  9550400 0 2  264 692781 44 54  1  
0  0
387  0  0 1719148  14492  9550400 041  268 709579 43 57  0  
0  0
420  0  0 1719148  14500  9550400 0 3  265 690862 44 54  2  
0  0
429  0  0 1719396  14500  9550400 0 0  260 704872 46 54  0  
0  0
460  0  0 1719396  14500  9550400 0 0  264 716272 46 54  0  
0  0
419  0  0 1719396  14508  9550400 0 3  261 685864 43 55  2  
0  0
455  0  0 1719396  14508  9550400 0 0  264 703718 44 56  0  
0  0
395  0  0 1719372  14540  9551200 064  265 692785 45 54  1  
0  0
424  0  0 1719396  14548  9551200 010  265 732866 45 55  0  
0  0

While 2.6.23-rc1 look like
procs ---memory-- ---swap-- -io --system-- 
-cpu--
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa st
23  0  0 1705992  17020  9572000 0 0  261 1010016 53 42  5  
0  0
 7  0  0 1706116  17020  9572000 013  267 1060997 52 41  7  
0  0
 5  0  0 1706116  17020  9572000 028  266 1313361 56 41  3  
0  0
19  0  0 1706116  17028  9572000 0 8  265 1273669 55 41  4  
0  0
18  0  0 1706116  17032  9572000 0 2  262 1403588 55 41  4  
0  0
23  0  0 1706116  17032  9572000 0 0  264 1272561 56 40  4  
0  0
14  0  0 1706116  17032  9572000 0 0  262 1046795 55 40  5  
0  0
16  0  0 1706116  17032  9572000 0 0  260 1361102 58 39  4  
0  0
 4  0  0 1706224  17120  9572400 0   126  273 1488711 56 41  3  
0  0
24  0  0 1706224  17128  9572400 0 6  261 1408432 55 41  4  
0  0
 3  0  0 1706240  17128  9572400 048  273 1299203 54 42  4  
0  0
16  0  0 1706240  17132  9572400 0 3  261 1356609 54 42  4  
0  0
 5  0  0 1706364  17132  9572400 0 0  264 1293198 58 39  3  
0  0
 9  0  0 1706364  17132  9572400 0 0  261 1555153 56 41  3  
0  0
13  0  0 1706364  17132  9572400 0 0  264 1160296 56 40  4  
0  0
 8  0  0 1706364  17132  9572400 0 0  261 1388909 58 38  4  
0  0
18  0  0 1706364  17132  9572400 0 0  264 1236774 56 39  5  
0  0
11  0  0 1706364  17136  9572400 0 2  261 1360325 57 40  3  
0  0
 5  0  0 1706364  17136  9572400 0 1  265 1201912 57 40  3  
0  0
 8  0  0 1706364  17136  9572400 0 0  261 1104308 57 39  4  
0  0
 7  0  0 1705976  17232  9572400 0   127  274 1205212 58 39  4  
0  0

Tim

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pluggable scheduler thread (was Re: Volanomark slows by 80% under CFS)

2007-07-30 Thread Tim Chen
On Sat, 2007-07-28 at 02:51 -0400, Chris Snook wrote:

 
 Tim --
 
   Since you're already set up to do this benchmarking, would you mind 
 varying the parameters a bit and collecting vmstat data?  If you want to 
 run oprofile too, that wouldn't hurt.
 

Here's the vmstat data.  The number of runnable processes are 
fewer and there're more contex switches with CFS. 
 
The vmstat for 2.6.22 looks like

procs ---memory-- ---swap-- -io --system-- 
-cpu--
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa st
391  0  0 1722564  14416  9547200   16925   76 6520  3  3 89  5 
 0
400  0  0 1722372  14416  9549600 0 0  264 641685 47 53  0  
0  0
368  0  0 1721504  14424  9549600 0 7  261 648493 46 51  3  
0  0
438  0  0 1721504  14432  9549600 0 2  264 690834 46 54  0  
0  0
400  0  0 1721380  14432  9549600 0 0  260 657157 46 53  1  
0  0
393  0  0 1719892  14440  9549600 0 6  265 671599 45 53  2  
0  0
423  0  0 1719892  14440  9549600 015  264 701626 44 56  0  
0  0
375  0  0 1720240  14472  9550400 072  265 671795 43 53  3  
0  0
393  0  0 1720140  14480  9550400 0 7  265 733561 45 55  0  
0  0
355  0  0 1716052  14480  9550400 0 0  260 670676 43 54  3  
0  0
419  0  0 1718900  14480  9550400 0 4  265 680690 43 55  2  
0  0
396  0  0 1719148  14488  9550400 0 3  261 712307 43 56  0  
0  0
395  0  0 1719148  14488  9550400 0 2  264 692781 44 54  1  
0  0
387  0  0 1719148  14492  9550400 041  268 709579 43 57  0  
0  0
420  0  0 1719148  14500  9550400 0 3  265 690862 44 54  2  
0  0
429  0  0 1719396  14500  9550400 0 0  260 704872 46 54  0  
0  0
460  0  0 1719396  14500  9550400 0 0  264 716272 46 54  0  
0  0
419  0  0 1719396  14508  9550400 0 3  261 685864 43 55  2  
0  0
455  0  0 1719396  14508  9550400 0 0  264 703718 44 56  0  
0  0
395  0  0 1719372  14540  9551200 064  265 692785 45 54  1  
0  0
424  0  0 1719396  14548  9551200 010  265 732866 45 55  0  
0  0

While 2.6.23-rc1 look like
procs ---memory-- ---swap-- -io --system-- 
-cpu--
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa st
23  0  0 1705992  17020  9572000 0 0  261 1010016 53 42  5  
0  0
 7  0  0 1706116  17020  9572000 013  267 1060997 52 41  7  
0  0
 5  0  0 1706116  17020  9572000 028  266 1313361 56 41  3  
0  0
19  0  0 1706116  17028  9572000 0 8  265 1273669 55 41  4  
0  0
18  0  0 1706116  17032  9572000 0 2  262 1403588 55 41  4  
0  0
23  0  0 1706116  17032  9572000 0 0  264 1272561 56 40  4  
0  0
14  0  0 1706116  17032  9572000 0 0  262 1046795 55 40  5  
0  0
16  0  0 1706116  17032  9572000 0 0  260 1361102 58 39  4  
0  0
 4  0  0 1706224  17120  9572400 0   126  273 1488711 56 41  3  
0  0
24  0  0 1706224  17128  9572400 0 6  261 1408432 55 41  4  
0  0
 3  0  0 1706240  17128  9572400 048  273 1299203 54 42  4  
0  0
16  0  0 1706240  17132  9572400 0 3  261 1356609 54 42  4  
0  0
 5  0  0 1706364  17132  9572400 0 0  264 1293198 58 39  3  
0  0
 9  0  0 1706364  17132  9572400 0 0  261 1555153 56 41  3  
0  0
13  0  0 1706364  17132  9572400 0 0  264 1160296 56 40  4  
0  0
 8  0  0 1706364  17132  9572400 0 0  261 1388909 58 38  4  
0  0
18  0  0 1706364  17132  9572400 0 0  264 1236774 56 39  5  
0  0
11  0  0 1706364  17136  9572400 0 2  261 1360325 57 40  3  
0  0
 5  0  0 1706364  17136  9572400 0 1  265 1201912 57 40  3  
0  0
 8  0  0 1706364  17136  9572400 0 0  261 1104308 57 39  4  
0  0
 7  0  0 1705976  17232  9572400 0   127  274 1205212 58 39  4  
0  0

Tim

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pluggable scheduler thread (was Re: Volanomark slows by 80% under CFS)

2007-07-30 Thread Chris Snook

Tim Chen wrote:

On Sat, 2007-07-28 at 02:51 -0400, Chris Snook wrote:


Tim --

	Since you're already set up to do this benchmarking, would you mind 
varying the parameters a bit and collecting vmstat data?  If you want to 
run oprofile too, that wouldn't hurt.




Here's the vmstat data.  The number of runnable processes are 
fewer and there're more contex switches with CFS. 
 
The vmstat for 2.6.22 looks like


procs ---memory-- ---swap-- -io --system-- 
-cpu--
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa st
391  0  0 1722564  14416  9547200   16925   76 6520  3  3 89  5 
 0
400  0  0 1722372  14416  9549600 0 0  264 641685 47 53  0  
0  0
368  0  0 1721504  14424  9549600 0 7  261 648493 46 51  3  
0  0
438  0  0 1721504  14432  9549600 0 2  264 690834 46 54  0  
0  0
400  0  0 1721380  14432  9549600 0 0  260 657157 46 53  1  
0  0
393  0  0 1719892  14440  9549600 0 6  265 671599 45 53  2  
0  0
423  0  0 1719892  14440  9549600 015  264 701626 44 56  0  
0  0
375  0  0 1720240  14472  9550400 072  265 671795 43 53  3  
0  0
393  0  0 1720140  14480  9550400 0 7  265 733561 45 55  0  
0  0
355  0  0 1716052  14480  9550400 0 0  260 670676 43 54  3  
0  0
419  0  0 1718900  14480  9550400 0 4  265 680690 43 55  2  
0  0
396  0  0 1719148  14488  9550400 0 3  261 712307 43 56  0  
0  0
395  0  0 1719148  14488  9550400 0 2  264 692781 44 54  1  
0  0
387  0  0 1719148  14492  9550400 041  268 709579 43 57  0  
0  0
420  0  0 1719148  14500  9550400 0 3  265 690862 44 54  2  
0  0
429  0  0 1719396  14500  9550400 0 0  260 704872 46 54  0  
0  0
460  0  0 1719396  14500  9550400 0 0  264 716272 46 54  0  
0  0
419  0  0 1719396  14508  9550400 0 3  261 685864 43 55  2  
0  0
455  0  0 1719396  14508  9550400 0 0  264 703718 44 56  0  
0  0
395  0  0 1719372  14540  9551200 064  265 692785 45 54  1  
0  0
424  0  0 1719396  14548  9551200 010  265 732866 45 55  0  
0  0

While 2.6.23-rc1 look like
procs ---memory-- ---swap-- -io --system-- 
-cpu--
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa st
23  0  0 1705992  17020  9572000 0 0  261 1010016 53 42  5  
0  0
 7  0  0 1706116  17020  9572000 013  267 1060997 52 41  7  
0  0
 5  0  0 1706116  17020  9572000 028  266 1313361 56 41  3  
0  0
19  0  0 1706116  17028  9572000 0 8  265 1273669 55 41  4  
0  0
18  0  0 1706116  17032  9572000 0 2  262 1403588 55 41  4  
0  0
23  0  0 1706116  17032  9572000 0 0  264 1272561 56 40  4  
0  0
14  0  0 1706116  17032  9572000 0 0  262 1046795 55 40  5  
0  0
16  0  0 1706116  17032  9572000 0 0  260 1361102 58 39  4  
0  0
 4  0  0 1706224  17120  9572400 0   126  273 1488711 56 41  3  
0  0
24  0  0 1706224  17128  9572400 0 6  261 1408432 55 41  4  
0  0
 3  0  0 1706240  17128  9572400 048  273 1299203 54 42  4  
0  0
16  0  0 1706240  17132  9572400 0 3  261 1356609 54 42  4  
0  0
 5  0  0 1706364  17132  9572400 0 0  264 1293198 58 39  3  
0  0
 9  0  0 1706364  17132  9572400 0 0  261 1555153 56 41  3  
0  0
13  0  0 1706364  17132  9572400 0 0  264 1160296 56 40  4  
0  0
 8  0  0 1706364  17132  9572400 0 0  261 1388909 58 38  4  
0  0
18  0  0 1706364  17132  9572400 0 0  264 1236774 56 39  5  
0  0
11  0  0 1706364  17136  9572400 0 2  261 1360325 57 40  3  
0  0
 5  0  0 1706364  17136  9572400 0 1  265 1201912 57 40  3  
0  0
 8  0  0 1706364  17136  9572400 0 0  261 1104308 57 39  4  
0  0
 7  0  0 1705976  17232  9572400 0   127  274 1205212 58 39  4  
0  0

Tim



From a scheduler performance perspective, it looks like CFS is doing much 
better on this workload.  It's spending a lot less time in %sys despite the 
higher context switches, and there are far fewer tasks waiting for CPU time. 
The real problem seems to be that volanomark is optimized for a particular 
scheduler behavior.


That's not to say that we can't improve volanomark performance under CFS, but 
simply that CFS isn't so fundamentally flawed that this is impossible.


When I initially agreed with zeroing out wait time in sched_yield, I didn't 
realize that it could be negative and that this 

Re: pluggable scheduler thread (was Re: Volanomark slows by 80% under CFS)

2007-07-30 Thread Andrea Arcangeli
On Mon, Jul 30, 2007 at 05:07:46PM -0400, Chris Snook wrote:
 [..]  It's spending a lot less time in %sys despite the 
 higher context switches, [..]

The workload takes 40% more so you've to add up that additional 40%
too into your math. A lot less time sounds an overstatement to
me. Also you've to take into account cache effects in executing the
scheduler so much etc...

 [..] and there are far fewer tasks waiting for CPU 
 time. The real problem seems to be that volanomark is optimized for a 

It looks weird that there are a lot less tasks in R state. Could you
press SYSRQ+T to see where those hundred tasks are sleeping in the CFS
run?

 That's not to say that we can't improve volanomark performance under CFS, 
 but simply that CFS isn't so fundamentally flawed that this is impossible.

Given the increase of context switches, it means not all the ctx
switches are userland mandated, so the first thing to try here is to
increase the granularity with the new tunable sysctl. Increasing the
granularity has to reduce the context switch rate, and in turn it will
reduce the slowdown to less than 40%.

There's nothing necessarily flawed in CFS even if it's slower than
O(1) in this load no matter how you tune it. The higher context switch
rate to retain complete fariness is a feature, but fariness vs global
performance is generally a tradeoff.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Volanomark slows by 80% under CFS

2007-07-28 Thread Dave Jones
On Fri, Jul 27, 2007 at 10:47:21PM -0400, Rik van Riel wrote:
 > Tim Chen wrote:
 > > Ingo,
 > > 
 > > Volanomark slows by 80% with CFS scheduler on 2.6.23-rc1.  
 > > Benchmark was run on a 2 socket Core2 machine.
 > > 
 > > The change in scheduler treatment of sched_yield 
 > > could play a part in changing Volanomark behavior.
 > > In CFS, sched_yield is implemented
 > > by dequeueing and requeueing a process .  The time a process 
 > > has spent running probably reduced the the cpu time due it 
 > > by only a bit. The process could get re-queued pretty close
 > > to head of the queue, and may get scheduled again pretty
 > > quickly if there is still a lot of cpu time due.  
 > 
 > I wonder if this explains the 30% drop in top performance
 > seen with the MySQL sysbench benchmark when the scheduler
 > changed to CFS...
 > 
 > See http://people.freebsd.org/~jeff/sysbench.png

 From the authors blog when he did that graph:
 http://jeffr-tech.livejournal.com/10103.html

"So I updated the image for the second time today to include Ingo's cfs
 scheduler. This kernel is from the rpm on his website. I double checked
 that it was not using tcmalloc at the time and switching back to a
 2.6.21 kernel returned to the expected perf.

 Basically, it has the same performance as the FreeBSD 4BSD scheduler
 now. Which is to say the peak is terrible but it has virtually no
 dropoff and performs better under load than the default 2.6.21
 scheduler. "



Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Volanomark slows by 80% under CFS

2007-07-28 Thread David Schwartz

> > Volanomark runs better
> > and is only 40% (instead of 80%) down from old scheduler
> > without CFS.

> 40 or 80 % is still a huge regression.
> Dmitry Adamushko

Can anyone explain precisely what Volanomark is doing? If it's something
dumb like "looping on sched_yield until the 'right' thread runs and finishes
what we're waiting for" then I think any regression can be ignored.

This applies if and only if CFS' sched_yield behavior is sane and Volano's
is insane.

A sane sched_yield implementation must do two things:

1) Reward processes that actually do yield most of their CPU time to another
process.

2) Make an effort to run every ready-to-run process at the same or higher
static priority level before re-scheduling this process. (That won't always
be possible due to SMP issues, but a reasonable effort is needed.)

If CFS is doing these two things, and Volanomark is looping on sched_yield
until the 'right thread' runs, then CFS is doing the right and Volanomark
isn't. Volanomark deserves to lose.

If CFS binds processes to processors more tightly and thus sched_yield can't
yield to a process that was planned to run on another CPU in the future,
that would be a legitimate complaint about CFS.

DS


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Volanomark slows by 80% under CFS

2007-07-28 Thread Dmitry Adamushko
On 28/07/07, Chris Snook <[EMAIL PROTECTED]> wrote:
> [ ... ]
>   Under CFS, the yielding process will still be leftmost in the rbtree,
> otherwise it would have already been scheduled out.

Not actually true. The position of the 'current' task within the
rb-tree is updated with a timer tick's frequency. Being called
somewhere in between 2 ticks, sched_yield() may trigger a reschedule
which would otherwise take place upon the next tick.

Moreover, 'scheduling granularity' may also take effect. e.g.
effectively, the yielding task's 'fair_key' was already != 'left_most'
upon the previous timer tick but an actual reschedule has been delayed
due to the 'scheduling granularity' taking effect... and sched_yield()
may trigger it.


> Zeroing out wait_runtime on sched_yield strikes me as completely appropriate.
> If the process wanted to sleep a finite duration, it should actually call a
> sleep function, but sched_yield is essentially saying "I don't have anything
> else to do right now", so it's hardly fair to claim you've been waiting for 
> your
> chance when you just gave it up.

'wait_runtime' describes dynamic behavior of a task on the rq. It
doesn't matter what the task is about to do as 'wait_runtime' is
something it fully deserves. Note, 'wait_runtime' can be both positive
and negative, meaning credit/punishment appropriately.

When it's negative, 'zeroing it out' effectively means the task gets
helped -- in the sense that it doen't get 'punished' for some amount
of time it actually spent running. Which is wrong.

One more thing: we don't take time accounted to 'wait_runtime' just
from the thin air.
e.g. sleepers get an additional bonus to their 'wait_runtime' upon a
wakeup _but_ the amount of "wait_runtime" == "a given bonus" will be
additionally substracted from tasks which happen to run later on (grep
for "sleeper_bonus" in sched_fair.c). That said, the sum (of
additionally given/taken wait_runtime) is zero.

All in all, I doubt the "zeroing out wait_runtime on sched_yield"
thing is really appropriate.


>
> -- Chris
>

-- 
Best regards,
Dmitry Adamushko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Volanomark slows by 80% under CFS

2007-07-28 Thread Dmitry Adamushko
On 28/07/07, Tim Chen <[EMAIL PROTECTED]> wrote:
> [ ... ]
> It may make sense to queue the
> yielding process a bit further behind in the queue.
> I made a slight change by zeroing out wait_runtime
> (i.e. have the process gives
> up cpu time due for it to run) for experimentation.

But that's wrong. The 'wait_runtime' might have been negative at this
point (i.e. a task is in the negative 'run-time' balance wrt the
'etalon' nice-0 task). Your change ends up helping such a task to
actually stay closer to the 'left most' element of the tree (or to be
it) and not "further behind in the queue" as your intention is.

I don't know Volanomark's details so refrain from speculating on why
this change "improves" benchmark results indeed (maybe some afected
tasks have
positive 'wait_runtime's on average for this setup).

If you want to make sure (just for a test) a yeilding task is not the
left-most (at least) for some short interval of time (likely to be <=
1 tick), take a look at yield_task_fair() in e.g. cfs-v15.

> Volanomark runs better
> and is only 40% (instead of 80%) down from old scheduler
> without CFS.

40 or 80 % is still a huge regression.


>
> Regards,
> Tim
>

-- 
Best regards,
Dmitry Adamushko
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pluggable scheduler thread (was Re: Volanomark slows by 80% under CFS)

2007-07-28 Thread Chris Snook

Andrea Arcangeli wrote:

On Fri, Jul 27, 2007 at 11:43:23PM -0400, Chris Snook wrote:
I'm pretty sure the point of posting a patch that triples CFS performance 
on a certain benchmark and arguably improves the semantics of sched_yield 
was to improve CFS.  You have a point, but it is a point for a different 
thread.  I have taken the liberty of starting this thread for you.


I've no real interest in starting or participating in flamewars
(especially the ones not backed by hard numbers). So I adjusted the
subject a bit in the hope the discussion will not degenerate as you
predicted, hope you don't mind.


Not at all.  I clearly misread your tone.


I'm pretty sure the point of posting that email was to show the
remaining performance regression with the sched_yield fix applied
too. Given you considered my post both offtopic and inflammatory, I
guess you think it's possible and reasonably easy to fix that
remaining regression without a pluggable scheduler, right? So please
enlighten us on your intend to achieve it.


There are four possibilities that are immediately obvious to me:

a) The remaining difference is due mostly to the algorithmic complexity 
of the rbtree algorithm in CFS.


If this is the case, we should be able to vary the test parameters (CPU 
count, thread count, etc.) graph the results, and see a roughly 
logarithmic divergence between the schedulers as some parameter(s) vary. 
 If this is the problem, we may be able to fix it with data structure 
tweaks or optimized base cases, like how quicksort can be optimized by 
using insertion sort below a certain threshold.


b) The remaining difference is due mostly to how the scheduler handles 
volanomark.


vmstat can give us a comparison of context switches between O(1), CFS, 
and CFS+patch.  If the decrease in throughput correlates with an 
increase in context switches, we may be able to induce more O(1)-like 
behavior by charging tasks for context switch overhead.


c) The remaining difference is due mostly to how the scheduler handles 
something other than volanomark.


If context switch count is not the problem, context switch pattern still 
could be.  I doubt we'd see a 40% difference due to cache misses, but 
it's possible.  Fortunately, oprofile can sample based on cache misses, 
so we can debug this too.


d) The remaining difference is due mostly to some implementation detail 
in CFS.


It's possible there's some constant-factor overhead in CFS that is 
magnified heavily by the context switching volanomark deliberately 
induces.  If this is the case, oprofile sampling on clock cycles should 
catch it.


Tim --

	Since you're already set up to do this benchmarking, would you mind 
varying the parameters a bit and collecting vmstat data?  If you want to 
run oprofile too, that wouldn't hurt.



Also consider the other numbers likely used nptl so they shouldn't be
affected by sched_yield changes.

Sure there is.  We can run a fully-functional POSIX OS without using any 
block devices at all.  We cannot run a fully-functional POSIX OS without a 
scheduler. Any feature without which the OS cannot execute userspace code 
 is sufficiently primitive that somewhere there is a device on which it will 
be impossible to debug if that feature fails to initialize.  It is quite 
reasonable to insist on only having one implementation of such features in 
any given kernel build.


Sounds like a red-herring to me... There aren't just pluggable I/O
schedulers in the kernel, there are pluggable packet schedulers too
(see `tc qdisc`). And both are switchable at runtime (not just at boot
time).

Can you run your fully-functional POSIX OS without a packet scheduler
and without an I/O scheduler? I wonder where are you going to
read/write data without HD and network?


If I'm missing both, I'm pretty screwed, but if either one is 
functional, I can send something out.



Also those pluggable things don't increase the risk of crash much, if
compared to the complexity of the schedulers.

Whether or not these alternatives belong in the source tree as config-time 
options is a political question, but preserving boot-time debugging 
capability is a perfectly reasonable technical motivation.


The scheduler is invoked very late in the boot process (printk and
serial console, kdb are working for ages when scheduler kicks in), so
it's fully debuggable (no debugger depends on the scheduler, they run
inside the nmi handler...), I don't really see your point.


I'm more concerned about embedded systems.  These are the same people 
who want userspace character drivers to control their custom hardware. 
Having the robot point to where it hurts is a lot more convenient than 
hooking up a JTAG debugger.



And even if there would be a subtle bug in the scheduler you'll never
trigger it at boot with so few tasks and so few context switches.


Sure, but it's the non-subtle bugs that worry me.  These are usually 
related to low-level hardware setup, so they could miss the 

Re: pluggable scheduler thread (was Re: Volanomark slows by 80% under CFS)

2007-07-28 Thread Chris Snook

Andrea Arcangeli wrote:

On Fri, Jul 27, 2007 at 11:43:23PM -0400, Chris Snook wrote:
I'm pretty sure the point of posting a patch that triples CFS performance 
on a certain benchmark and arguably improves the semantics of sched_yield 
was to improve CFS.  You have a point, but it is a point for a different 
thread.  I have taken the liberty of starting this thread for you.


I've no real interest in starting or participating in flamewars
(especially the ones not backed by hard numbers). So I adjusted the
subject a bit in the hope the discussion will not degenerate as you
predicted, hope you don't mind.


Not at all.  I clearly misread your tone.


I'm pretty sure the point of posting that email was to show the
remaining performance regression with the sched_yield fix applied
too. Given you considered my post both offtopic and inflammatory, I
guess you think it's possible and reasonably easy to fix that
remaining regression without a pluggable scheduler, right? So please
enlighten us on your intend to achieve it.


There are four possibilities that are immediately obvious to me:

a) The remaining difference is due mostly to the algorithmic complexity 
of the rbtree algorithm in CFS.


If this is the case, we should be able to vary the test parameters (CPU 
count, thread count, etc.) graph the results, and see a roughly 
logarithmic divergence between the schedulers as some parameter(s) vary. 
 If this is the problem, we may be able to fix it with data structure 
tweaks or optimized base cases, like how quicksort can be optimized by 
using insertion sort below a certain threshold.


b) The remaining difference is due mostly to how the scheduler handles 
volanomark.


vmstat can give us a comparison of context switches between O(1), CFS, 
and CFS+patch.  If the decrease in throughput correlates with an 
increase in context switches, we may be able to induce more O(1)-like 
behavior by charging tasks for context switch overhead.


c) The remaining difference is due mostly to how the scheduler handles 
something other than volanomark.


If context switch count is not the problem, context switch pattern still 
could be.  I doubt we'd see a 40% difference due to cache misses, but 
it's possible.  Fortunately, oprofile can sample based on cache misses, 
so we can debug this too.


d) The remaining difference is due mostly to some implementation detail 
in CFS.


It's possible there's some constant-factor overhead in CFS that is 
magnified heavily by the context switching volanomark deliberately 
induces.  If this is the case, oprofile sampling on clock cycles should 
catch it.


Tim --

	Since you're already set up to do this benchmarking, would you mind 
varying the parameters a bit and collecting vmstat data?  If you want to 
run oprofile too, that wouldn't hurt.



Also consider the other numbers likely used nptl so they shouldn't be
affected by sched_yield changes.

Sure there is.  We can run a fully-functional POSIX OS without using any 
block devices at all.  We cannot run a fully-functional POSIX OS without a 
scheduler. Any feature without which the OS cannot execute userspace code 
 is sufficiently primitive that somewhere there is a device on which it will 
be impossible to debug if that feature fails to initialize.  It is quite 
reasonable to insist on only having one implementation of such features in 
any given kernel build.


Sounds like a red-herring to me... There aren't just pluggable I/O
schedulers in the kernel, there are pluggable packet schedulers too
(see `tc qdisc`). And both are switchable at runtime (not just at boot
time).

Can you run your fully-functional POSIX OS without a packet scheduler
and without an I/O scheduler? I wonder where are you going to
read/write data without HD and network?


If I'm missing both, I'm pretty screwed, but if either one is 
functional, I can send something out.



Also those pluggable things don't increase the risk of crash much, if
compared to the complexity of the schedulers.

Whether or not these alternatives belong in the source tree as config-time 
options is a political question, but preserving boot-time debugging 
capability is a perfectly reasonable technical motivation.


The scheduler is invoked very late in the boot process (printk and
serial console, kdb are working for ages when scheduler kicks in), so
it's fully debuggable (no debugger depends on the scheduler, they run
inside the nmi handler...), I don't really see your point.


I'm more concerned about embedded systems.  These are the same people 
who want userspace character drivers to control their custom hardware. 
Having the robot point to where it hurts is a lot more convenient than 
hooking up a JTAG debugger.



And even if there would be a subtle bug in the scheduler you'll never
trigger it at boot with so few tasks and so few context switches.


Sure, but it's the non-subtle bugs that worry me.  These are usually 
related to low-level hardware setup, so they could miss the 

Re: Volanomark slows by 80% under CFS

2007-07-28 Thread Dmitry Adamushko
On 28/07/07, Tim Chen [EMAIL PROTECTED] wrote:
 [ ... ]
 It may make sense to queue the
 yielding process a bit further behind in the queue.
 I made a slight change by zeroing out wait_runtime
 (i.e. have the process gives
 up cpu time due for it to run) for experimentation.

But that's wrong. The 'wait_runtime' might have been negative at this
point (i.e. a task is in the negative 'run-time' balance wrt the
'etalon' nice-0 task). Your change ends up helping such a task to
actually stay closer to the 'left most' element of the tree (or to be
it) and not further behind in the queue as your intention is.

I don't know Volanomark's details so refrain from speculating on why
this change improves benchmark results indeed (maybe some afected
tasks have
positive 'wait_runtime's on average for this setup).

If you want to make sure (just for a test) a yeilding task is not the
left-most (at least) for some short interval of time (likely to be =
1 tick), take a look at yield_task_fair() in e.g. cfs-v15.

 Volanomark runs better
 and is only 40% (instead of 80%) down from old scheduler
 without CFS.

40 or 80 % is still a huge regression.



 Regards,
 Tim


-- 
Best regards,
Dmitry Adamushko
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Volanomark slows by 80% under CFS

2007-07-28 Thread Dmitry Adamushko
On 28/07/07, Chris Snook [EMAIL PROTECTED] wrote:
 [ ... ]
   Under CFS, the yielding process will still be leftmost in the rbtree,
 otherwise it would have already been scheduled out.

Not actually true. The position of the 'current' task within the
rb-tree is updated with a timer tick's frequency. Being called
somewhere in between 2 ticks, sched_yield() may trigger a reschedule
which would otherwise take place upon the next tick.

Moreover, 'scheduling granularity' may also take effect. e.g.
effectively, the yielding task's 'fair_key' was already != 'left_most'
upon the previous timer tick but an actual reschedule has been delayed
due to the 'scheduling granularity' taking effect... and sched_yield()
may trigger it.


 Zeroing out wait_runtime on sched_yield strikes me as completely appropriate.
 If the process wanted to sleep a finite duration, it should actually call a
 sleep function, but sched_yield is essentially saying I don't have anything
 else to do right now, so it's hardly fair to claim you've been waiting for 
 your
 chance when you just gave it up.

'wait_runtime' describes dynamic behavior of a task on the rq. It
doesn't matter what the task is about to do as 'wait_runtime' is
something it fully deserves. Note, 'wait_runtime' can be both positive
and negative, meaning credit/punishment appropriately.

When it's negative, 'zeroing it out' effectively means the task gets
helped -- in the sense that it doen't get 'punished' for some amount
of time it actually spent running. Which is wrong.

One more thing: we don't take time accounted to 'wait_runtime' just
from the thin air.
e.g. sleepers get an additional bonus to their 'wait_runtime' upon a
wakeup _but_ the amount of wait_runtime == a given bonus will be
additionally substracted from tasks which happen to run later on (grep
for sleeper_bonus in sched_fair.c). That said, the sum (of
additionally given/taken wait_runtime) is zero.

All in all, I doubt the zeroing out wait_runtime on sched_yield
thing is really appropriate.



 -- Chris


-- 
Best regards,
Dmitry Adamushko
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: Volanomark slows by 80% under CFS

2007-07-28 Thread David Schwartz

  Volanomark runs better
  and is only 40% (instead of 80%) down from old scheduler
  without CFS.

 40 or 80 % is still a huge regression.
 Dmitry Adamushko

Can anyone explain precisely what Volanomark is doing? If it's something
dumb like looping on sched_yield until the 'right' thread runs and finishes
what we're waiting for then I think any regression can be ignored.

This applies if and only if CFS' sched_yield behavior is sane and Volano's
is insane.

A sane sched_yield implementation must do two things:

1) Reward processes that actually do yield most of their CPU time to another
process.

2) Make an effort to run every ready-to-run process at the same or higher
static priority level before re-scheduling this process. (That won't always
be possible due to SMP issues, but a reasonable effort is needed.)

If CFS is doing these two things, and Volanomark is looping on sched_yield
until the 'right thread' runs, then CFS is doing the right and Volanomark
isn't. Volanomark deserves to lose.

If CFS binds processes to processors more tightly and thus sched_yield can't
yield to a process that was planned to run on another CPU in the future,
that would be a legitimate complaint about CFS.

DS


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Volanomark slows by 80% under CFS

2007-07-28 Thread Dave Jones
On Fri, Jul 27, 2007 at 10:47:21PM -0400, Rik van Riel wrote:
  Tim Chen wrote:
   Ingo,
   
   Volanomark slows by 80% with CFS scheduler on 2.6.23-rc1.  
   Benchmark was run on a 2 socket Core2 machine.
   
   The change in scheduler treatment of sched_yield 
   could play a part in changing Volanomark behavior.
   In CFS, sched_yield is implemented
   by dequeueing and requeueing a process .  The time a process 
   has spent running probably reduced the the cpu time due it 
   by only a bit. The process could get re-queued pretty close
   to head of the queue, and may get scheduled again pretty
   quickly if there is still a lot of cpu time due.  
  
  I wonder if this explains the 30% drop in top performance
  seen with the MySQL sysbench benchmark when the scheduler
  changed to CFS...
  
  See http://people.freebsd.org/~jeff/sysbench.png

 From the authors blog when he did that graph:
 http://jeffr-tech.livejournal.com/10103.html

So I updated the image for the second time today to include Ingo's cfs
 scheduler. This kernel is from the rpm on his website. I double checked
 that it was not using tcmalloc at the time and switching back to a
 2.6.21 kernel returned to the expected perf.

 Basically, it has the same performance as the FreeBSD 4BSD scheduler
 now. Which is to say the peak is terrible but it has virtually no
 dropoff and performs better under load than the default 2.6.21
 scheduler. 



Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pluggable scheduler thread (was Re: Volanomark slows by 80% under CFS)

2007-07-27 Thread Andrea Arcangeli
On Fri, Jul 27, 2007 at 11:43:23PM -0400, Chris Snook wrote:
> I'm pretty sure the point of posting a patch that triples CFS performance 
> on a certain benchmark and arguably improves the semantics of sched_yield 
> was to improve CFS.  You have a point, but it is a point for a different 
> thread.  I have taken the liberty of starting this thread for you.

I've no real interest in starting or participating in flamewars
(especially the ones not backed by hard numbers). So I adjusted the
subject a bit in the hope the discussion will not degenerate as you
predicted, hope you don't mind.

I'm pretty sure the point of posting that email was to show the
remaining performance regression with the sched_yield fix applied
too. Given you considered my post both offtopic and inflammatory, I
guess you think it's possible and reasonably easy to fix that
remaining regression without a pluggable scheduler, right? So please
enlighten us on your intend to achieve it.

Also consider the other numbers likely used nptl so they shouldn't be
affected by sched_yield changes.

> Sure there is.  We can run a fully-functional POSIX OS without using any 
> block devices at all.  We cannot run a fully-functional POSIX OS without a 
> scheduler. Any feature without which the OS cannot execute userspace code 
>  is sufficiently primitive that somewhere there is a device on which it will 
> be impossible to debug if that feature fails to initialize.  It is quite 
> reasonable to insist on only having one implementation of such features in 
> any given kernel build.

Sounds like a red-herring to me... There aren't just pluggable I/O
schedulers in the kernel, there are pluggable packet schedulers too
(see `tc qdisc`). And both are switchable at runtime (not just at boot
time).

Can you run your fully-functional POSIX OS without a packet scheduler
and without an I/O scheduler? I wonder where are you going to
read/write data without HD and network?

Also those pluggable things don't increase the risk of crash much, if
compared to the complexity of the schedulers.

> Whether or not these alternatives belong in the source tree as config-time 
> options is a political question, but preserving boot-time debugging 
> capability is a perfectly reasonable technical motivation.

The scheduler is invoked very late in the boot process (printk and
serial console, kdb are working for ages when scheduler kicks in), so
it's fully debuggable (no debugger depends on the scheduler, they run
inside the nmi handler...), I don't really see your point.

And even if there would be a subtle bug in the scheduler you'll never
trigger it at boot with so few tasks and so few context switches.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


pluggable scheduler flamewar thread (was Re: Volanomark slows by 80% under CFS)

2007-07-27 Thread Chris Snook

Andrea Arcangeli wrote:

On Fri, Jul 27, 2007 at 08:31:19PM -0400, Chris Snook wrote:
I think Volanomark is being pretty stupid, and deserves to run slowly, but 


Indeed, any app doing what volanomark does is pretty inefficient.

But this is not the point. I/O schedulers are pluggable to help for
inefficient apps too. If apps would be extremely smart they would all
use async-io for their reads, and there wouldn't be the need of
anticipatory scheduler just for an example.


I'm pretty sure the point of posting a patch that triples CFS performance on a 
certain benchmark and arguably improves the semantics of sched_yield was to 
improve CFS.  You have a point, but it is a point for a different thread.  I 
have taken the liberty of starting this thread for you.



The fact is there's no technical explanation for which we're forbidden
to be able to choose between CFS and O(1) at least at boot time.


Sure there is.  We can run a fully-functional POSIX OS without using any block 
devices at all.  We cannot run a fully-functional POSIX OS without a scheduler. 
 Any feature without which the OS cannot execute userspace code is sufficiently 
primitive that somewhere there is a device on which it will be impossible to 
debug if that feature fails to initialize.  It is quite reasonable to insist on 
only having one implementation of such features in any given kernel build.


Whether or not these alternatives belong in the source tree as config-time 
options is a political question, but preserving boot-time debugging capability 
is a perfectly reasonable technical motivation.


-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Volanomark slows by 80% under CFS

2007-07-27 Thread Rik van Riel

Tim Chen wrote:

Ingo,

Volanomark slows by 80% with CFS scheduler on 2.6.23-rc1.  
Benchmark was run on a 2 socket Core2 machine.


The change in scheduler treatment of sched_yield 
could play a part in changing Volanomark behavior.

In CFS, sched_yield is implemented
by dequeueing and requeueing a process .  The time a process 
has spent running probably reduced the the cpu time due it 
by only a bit. The process could get re-queued pretty close

to head of the queue, and may get scheduled again pretty
quickly if there is still a lot of cpu time due.  


I wonder if this explains the 30% drop in top performance
seen with the MySQL sysbench benchmark when the scheduler
changed to CFS...

See http://people.freebsd.org/~jeff/sysbench.png

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Volanomark slows by 80% under CFS

2007-07-27 Thread Andrea Arcangeli
On Fri, Jul 27, 2007 at 08:31:19PM -0400, Chris Snook wrote:
> I think Volanomark is being pretty stupid, and deserves to run slowly, but 

Indeed, any app doing what volanomark does is pretty inefficient.

But this is not the point. I/O schedulers are pluggable to help for
inefficient apps too. If apps would be extremely smart they would all
use async-io for their reads, and there wouldn't be the need of
anticipatory scheduler just for an example.

The fact is there's no technical explanation for which we're forbidden
to be able to choose between CFS and O(1) at least at boot time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Volanomark slows by 80% under CFS

2007-07-27 Thread Chris Snook

Tim Chen wrote:

Ingo,

Volanomark slows by 80% with CFS scheduler on 2.6.23-rc1.  
Benchmark was run on a 2 socket Core2 machine.


The change in scheduler treatment of sched_yield 
could play a part in changing Volanomark behavior.

In CFS, sched_yield is implemented
by dequeueing and requeueing a process .  The time a process 
has spent running probably reduced the the cpu time due it 
by only a bit. The process could get re-queued pretty close

to head of the queue, and may get scheduled again pretty
quickly if there is still a lot of cpu time due.  


It may make sense to queue the
yielding process a bit further behind in the queue. 
I made a slight change by zeroing out wait_runtime 
(i.e. have the process gives
up cpu time due for it to run) for experimentation. 
Let's put aside gripes that Volanomark should have used a 
better mechanism to coordinate threads instead sched_yield for 
a second.   Volanomark runs better
and is only 40% (instead of 80%) down from old scheduler 
without CFS.  


Of course we should not tune for Volanomark and this is
reference data. 
What are your view on how CFS's sched_yield should behave?


Regards,
Tim


The primary purpose of sched_yield is for SCHED_FIFO realtime processes.  Where 
nothing else will run, ever, unless the running thread blocks or yields the CPU. 
 Under CFS, the yielding process will still be leftmost in the rbtree, 
otherwise it would have already been scheduled out.


Zeroing out wait_runtime on sched_yield strikes me as completely appropriate. 
If the process wanted to sleep a finite duration, it should actually call a 
sleep function, but sched_yield is essentially saying "I don't have anything 
else to do right now", so it's hardly fair to claim you've been waiting for your 
chance when you just gave it up.


As for the remaining 40% degradation, if Volanomark is using it for 
synchronization, the scheduler is probably cycling through threads until it gets 
to the one that actually wants to do work.  The O(1) scheduler will do this very 
 quickly, whereas CFS has a bit more overhead.  Interactivity boosting may have 
also helped the old scheduler find the right thread faster.


I think Volanomark is being pretty stupid, and deserves to run slowly, but there 
are legitimate reasons to want to call sched_yield in a non-SCHED_FIFO process. 
 If I'm performing multiple different calculations on the same set of data in 
multiple threads, and accessing the shared data in a linear fashion, I'd like to 
be able to have one thread give the other some CPU time so they can stay at the 
same point in the stream and improve cache hit rates, but this is only an 
optimization if I can do it without wasting CPU or gradually nicing myself into 
oblivion.  Having sched_yield zero out wait_runtime seems like an appropriate 
way to make this use case work to the extent possible.  Any user attempting such 
an optimization should have the good sense to do real work between sched_yield 
calls, to avoid calling the scheduler in a tight loop.


-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Volanomark slows by 80% under CFS

2007-07-27 Thread Chris Snook

Tim Chen wrote:

Ingo,

Volanomark slows by 80% with CFS scheduler on 2.6.23-rc1.  
Benchmark was run on a 2 socket Core2 machine.


The change in scheduler treatment of sched_yield 
could play a part in changing Volanomark behavior.

In CFS, sched_yield is implemented
by dequeueing and requeueing a process .  The time a process 
has spent running probably reduced the the cpu time due it 
by only a bit. The process could get re-queued pretty close

to head of the queue, and may get scheduled again pretty
quickly if there is still a lot of cpu time due.  


It may make sense to queue the
yielding process a bit further behind in the queue. 
I made a slight change by zeroing out wait_runtime 
(i.e. have the process gives
up cpu time due for it to run) for experimentation. 
Let's put aside gripes that Volanomark should have used a 
better mechanism to coordinate threads instead sched_yield for 
a second.   Volanomark runs better
and is only 40% (instead of 80%) down from old scheduler 
without CFS.  


Of course we should not tune for Volanomark and this is
reference data. 
What are your view on how CFS's sched_yield should behave?


Regards,
Tim


The primary purpose of sched_yield is for SCHED_FIFO realtime processes.  Where 
nothing else will run, ever, unless the running thread blocks or yields the CPU. 
 Under CFS, the yielding process will still be leftmost in the rbtree, 
otherwise it would have already been scheduled out.


Zeroing out wait_runtime on sched_yield strikes me as completely appropriate. 
If the process wanted to sleep a finite duration, it should actually call a 
sleep function, but sched_yield is essentially saying I don't have anything 
else to do right now, so it's hardly fair to claim you've been waiting for your 
chance when you just gave it up.


As for the remaining 40% degradation, if Volanomark is using it for 
synchronization, the scheduler is probably cycling through threads until it gets 
to the one that actually wants to do work.  The O(1) scheduler will do this very 
 quickly, whereas CFS has a bit more overhead.  Interactivity boosting may have 
also helped the old scheduler find the right thread faster.


I think Volanomark is being pretty stupid, and deserves to run slowly, but there 
are legitimate reasons to want to call sched_yield in a non-SCHED_FIFO process. 
 If I'm performing multiple different calculations on the same set of data in 
multiple threads, and accessing the shared data in a linear fashion, I'd like to 
be able to have one thread give the other some CPU time so they can stay at the 
same point in the stream and improve cache hit rates, but this is only an 
optimization if I can do it without wasting CPU or gradually nicing myself into 
oblivion.  Having sched_yield zero out wait_runtime seems like an appropriate 
way to make this use case work to the extent possible.  Any user attempting such 
an optimization should have the good sense to do real work between sched_yield 
calls, to avoid calling the scheduler in a tight loop.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Volanomark slows by 80% under CFS

2007-07-27 Thread Andrea Arcangeli
On Fri, Jul 27, 2007 at 08:31:19PM -0400, Chris Snook wrote:
 I think Volanomark is being pretty stupid, and deserves to run slowly, but 

Indeed, any app doing what volanomark does is pretty inefficient.

But this is not the point. I/O schedulers are pluggable to help for
inefficient apps too. If apps would be extremely smart they would all
use async-io for their reads, and there wouldn't be the need of
anticipatory scheduler just for an example.

The fact is there's no technical explanation for which we're forbidden
to be able to choose between CFS and O(1) at least at boot time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Volanomark slows by 80% under CFS

2007-07-27 Thread Rik van Riel

Tim Chen wrote:

Ingo,

Volanomark slows by 80% with CFS scheduler on 2.6.23-rc1.  
Benchmark was run on a 2 socket Core2 machine.


The change in scheduler treatment of sched_yield 
could play a part in changing Volanomark behavior.

In CFS, sched_yield is implemented
by dequeueing and requeueing a process .  The time a process 
has spent running probably reduced the the cpu time due it 
by only a bit. The process could get re-queued pretty close

to head of the queue, and may get scheduled again pretty
quickly if there is still a lot of cpu time due.  


I wonder if this explains the 30% drop in top performance
seen with the MySQL sysbench benchmark when the scheduler
changed to CFS...

See http://people.freebsd.org/~jeff/sysbench.png

--
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


pluggable scheduler flamewar thread (was Re: Volanomark slows by 80% under CFS)

2007-07-27 Thread Chris Snook

Andrea Arcangeli wrote:

On Fri, Jul 27, 2007 at 08:31:19PM -0400, Chris Snook wrote:
I think Volanomark is being pretty stupid, and deserves to run slowly, but 


Indeed, any app doing what volanomark does is pretty inefficient.

But this is not the point. I/O schedulers are pluggable to help for
inefficient apps too. If apps would be extremely smart they would all
use async-io for their reads, and there wouldn't be the need of
anticipatory scheduler just for an example.


I'm pretty sure the point of posting a patch that triples CFS performance on a 
certain benchmark and arguably improves the semantics of sched_yield was to 
improve CFS.  You have a point, but it is a point for a different thread.  I 
have taken the liberty of starting this thread for you.



The fact is there's no technical explanation for which we're forbidden
to be able to choose between CFS and O(1) at least at boot time.


Sure there is.  We can run a fully-functional POSIX OS without using any block 
devices at all.  We cannot run a fully-functional POSIX OS without a scheduler. 
 Any feature without which the OS cannot execute userspace code is sufficiently 
primitive that somewhere there is a device on which it will be impossible to 
debug if that feature fails to initialize.  It is quite reasonable to insist on 
only having one implementation of such features in any given kernel build.


Whether or not these alternatives belong in the source tree as config-time 
options is a political question, but preserving boot-time debugging capability 
is a perfectly reasonable technical motivation.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pluggable scheduler thread (was Re: Volanomark slows by 80% under CFS)

2007-07-27 Thread Andrea Arcangeli
On Fri, Jul 27, 2007 at 11:43:23PM -0400, Chris Snook wrote:
 I'm pretty sure the point of posting a patch that triples CFS performance 
 on a certain benchmark and arguably improves the semantics of sched_yield 
 was to improve CFS.  You have a point, but it is a point for a different 
 thread.  I have taken the liberty of starting this thread for you.

I've no real interest in starting or participating in flamewars
(especially the ones not backed by hard numbers). So I adjusted the
subject a bit in the hope the discussion will not degenerate as you
predicted, hope you don't mind.

I'm pretty sure the point of posting that email was to show the
remaining performance regression with the sched_yield fix applied
too. Given you considered my post both offtopic and inflammatory, I
guess you think it's possible and reasonably easy to fix that
remaining regression without a pluggable scheduler, right? So please
enlighten us on your intend to achieve it.

Also consider the other numbers likely used nptl so they shouldn't be
affected by sched_yield changes.

 Sure there is.  We can run a fully-functional POSIX OS without using any 
 block devices at all.  We cannot run a fully-functional POSIX OS without a 
 scheduler. Any feature without which the OS cannot execute userspace code 
  is sufficiently primitive that somewhere there is a device on which it will 
 be impossible to debug if that feature fails to initialize.  It is quite 
 reasonable to insist on only having one implementation of such features in 
 any given kernel build.

Sounds like a red-herring to me... There aren't just pluggable I/O
schedulers in the kernel, there are pluggable packet schedulers too
(see `tc qdisc`). And both are switchable at runtime (not just at boot
time).

Can you run your fully-functional POSIX OS without a packet scheduler
and without an I/O scheduler? I wonder where are you going to
read/write data without HD and network?

Also those pluggable things don't increase the risk of crash much, if
compared to the complexity of the schedulers.

 Whether or not these alternatives belong in the source tree as config-time 
 options is a political question, but preserving boot-time debugging 
 capability is a perfectly reasonable technical motivation.

The scheduler is invoked very late in the boot process (printk and
serial console, kdb are working for ages when scheduler kicks in), so
it's fully debuggable (no debugger depends on the scheduler, they run
inside the nmi handler...), I don't really see your point.

And even if there would be a subtle bug in the scheduler you'll never
trigger it at boot with so few tasks and so few context switches.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/