Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-28 Thread Bernd Eckenfels
In article <[EMAIL PROTECTED]> you wrote:
>  a) it may do so for a short and bound time, typically less than the
> maximum acceptable latency for other tasks

if you have n threads in runq and each of them can have mhttp://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-26 Thread William Lee Irwin III
On Wed, Apr 25, 2007 at 04:58:40AM -0700, William Lee Irwin III wrote:
>>> Adjustments to the lag computation for for arrivals and departures
>>> during execution are among the missing pieces. Some algorithmic devices
>>> are also needed to account for the varying growth rates of lags of tasks
>>> waiting to run, which arise from differing priorities/weights.

On Wed, 2007-04-25 at 22:13 +0200, Willy Tarreau wrote:
>> that was the principle of my proposal of sorting tasks by expected completion
>> time and using +/- credit to compensate for too large/too short slice used.

On Thu, Apr 26, 2007 at 10:57:48AM -0700, Li, Tong N wrote:
> Yeah, it's a good algorithm. It's a variant of earliest deadline first
> (EDF). There are also similar ones in the literature such as earliest
> eligible virtual deadline first (EEVDF) and biased virtual finishing
> time (BVFT). Based on wli's explanation, I think Ingo's approach would
> also fall into this category. With careful design, all such algorithms
> that order tasks based on some notion of time can achieve good fairness.
> There are some subtle differences. Some algorithms of this type can
> achieve a constant lag bound, but some only have a constant positive lag
> bound, but O(N) negative lag bound, meaning some tasks could receive
> much more CPU time than it would under ideal fairness when the number of
> tasks is high.

The algorithm is in a bit of flux, but the virtual deadline computation
is rather readable. You may be able to tell whether cfs is affected by
the negative lag issue better than I. For the most part all I can smoke
out is that it's not apparent to me whether load balancing is done the
way it needs to be.


On Thu, Apr 26, 2007 at 10:57:48AM -0700, Li, Tong N wrote:
> On the other hand, the log(N) complexity of this type of algorithms has
> been a concern in the research community. This motivated O(1)
> round-robin based algorithms such as deficit round-robin (DRR) and
> smoothed round-robin (SRR) in networking, and virtual-time round-robin
> (VTRR), group ratio round-robin (GP3) and grouped distributed queues
> (GDQ) in OS scheduling, as well as the distributed weighted round-robin
> (DWRR) one I posted earlier.

I'm going to make a bold statement: I don't think O(lg(n)) is bad at
all. In real systems there are constraints related to per-task memory
footprints that severely restrict the domain of the performance metric,
rendering O(lg(n)) bounded by a rather reasonable constant.

A larger concern to me is whether this affair actually achieves its
design goals and, to a lesser extent, in what contexts those design
goals are truly crucial or dominant as opposed to others, such as,
say, interactivity. It is clear, regardless of general applicability,
that the predictability of behavior with regard to strict fairness
is going to be useful in certain contexts.

Another concern which is in favor of the virtual deadline design is
that virtual deadlines can very effectively emulate a broad spectrum
of algorithms. For instance, the mainline "O(1) scheduler" can be
emulated using such a queueing mechanism. Even if the particular
policy cfs now implemented is dumped, radically different policies
can be expressed with its queueing mechanism. This has maintainence
implications which are quite beneficial. That said, it's far from
an unqualified endorsement. I'd still like to see much done differently.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-26 Thread Willy Tarreau
On Thu, Apr 26, 2007 at 10:57:48AM -0700, Li, Tong N wrote:
> On Wed, 2007-04-25 at 22:13 +0200, Willy Tarreau wrote:
> > On Wed, Apr 25, 2007 at 04:58:40AM -0700, William Lee Irwin III wrote:
> > 
> > > Adjustments to the lag computation for for arrivals and departures
> > > during execution are among the missing pieces. Some algorithmic devices
> > > are also needed to account for the varying growth rates of lags of tasks
> > > waiting to run, which arise from differing priorities/weights.
> > 
> > that was the principle of my proposal of sorting tasks by expected 
> > completion
> > time and using +/- credit to compensate for too large/too short slice used.
> > 
> > Willy
> 
> Yeah, it's a good algorithm. It's a variant of earliest deadline first
> (EDF). There are also similar ones in the literature such as earliest
> eligible virtual deadline first (EEVDF) and biased virtual finishing
> time (BVFT). Based on wli's explanation, I think Ingo's approach would
> also fall into this category. With careful design, all such algorithms
> that order tasks based on some notion of time can achieve good fairness.
> There are some subtle differences. Some algorithms of this type can
> achieve a constant lag bound, but some only have a constant positive lag
> bound, but O(N) negative lag bound,

Anyway, we're working in discrete time, not linear time. Lag is
unavoidable. At best it can be bounded and compensated for. First time
I thought about this algorithm, I was looking at a line drawn using the
Bresenham algorithm. This algorithm is all about error compensation for
a perfect expectation. Too high, too far, too high, too far... I thought
that the line could represent a task progress as a function of time, and
the pixels the periods the task spends on the CPU. On short intervals,
you lose. On large ones, you're very close to the ideal case.

> meaning some tasks could receive
> much more CPU time than it would under ideal fairness when the number of
> tasks is high.

It's not a problem that a task receives much more CPU than it should,
provided that :

  a) it may do so for a short and bound time, typically less than the
 maximum acceptable latency for other tasks

  b) the excess of CPU it received is accounted for so that it is
 deduced from next passes.


> On the other hand, the log(N) complexity of this type of algorithms has
> been a concern in the research community. This motivated O(1)

I've been for O(1) algorithms for a long time, but seeing how dumb the
things you can do in O(1) are compared to O(logN), I definitely switched
my mind. Also, in O(logN), you often have a lot of common operations
still O(1). Eg: you insert in O(logN) in a time-ordered tree, but you
read from it in O(1). But you still have the ability to change its
content in O(logN) if you need.

Last but not least, spending one hundred cycles a few thousand times
a second is nothing compared to electing the wrong task for a full
time-slice.

Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-26 Thread Li, Tong N
On Wed, 2007-04-25 at 22:13 +0200, Willy Tarreau wrote:
> On Wed, Apr 25, 2007 at 04:58:40AM -0700, William Lee Irwin III wrote:
> 
> > Adjustments to the lag computation for for arrivals and departures
> > during execution are among the missing pieces. Some algorithmic devices
> > are also needed to account for the varying growth rates of lags of tasks
> > waiting to run, which arise from differing priorities/weights.
> 
> that was the principle of my proposal of sorting tasks by expected completion
> time and using +/- credit to compensate for too large/too short slice used.
> 
> Willy

Yeah, it's a good algorithm. It's a variant of earliest deadline first
(EDF). There are also similar ones in the literature such as earliest
eligible virtual deadline first (EEVDF) and biased virtual finishing
time (BVFT). Based on wli's explanation, I think Ingo's approach would
also fall into this category. With careful design, all such algorithms
that order tasks based on some notion of time can achieve good fairness.
There are some subtle differences. Some algorithms of this type can
achieve a constant lag bound, but some only have a constant positive lag
bound, but O(N) negative lag bound, meaning some tasks could receive
much more CPU time than it would under ideal fairness when the number of
tasks is high.

On the other hand, the log(N) complexity of this type of algorithms has
been a concern in the research community. This motivated O(1)
round-robin based algorithms such as deficit round-robin (DRR) and
smoothed round-robin (SRR) in networking, and virtual-time round-robin
(VTRR), group ratio round-robin (GP3) and grouped distributed queues
(GDQ) in OS scheduling, as well as the distributed weighted round-robin
(DWRR) one I posted earlier.

  tong
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


SD renice recommendation was: Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-25 Thread Con Kolivas
On Tuesday 24 April 2007 16:36, Ingo Molnar wrote:

> So, my point is, the nice level of X for desktop users should not be set
> lower than a low limit suggested by that particular scheduler's author.
> That limit is scheduler-specific. Con i think recommends a nice level of
> -1 for X when using SD [Con, can you confirm?], while my tests show that
> if you want you can go as low as -10 under CFS, without any bad
> side-effects. (-19 was a bit too much)

Nice 0 as a default for X, but if renicing, nice -10 as the lower limit for X 
on SD. The reason for that on SD is that the priority of freshly woken up 
tasks (ie not fully cpu bound) for both nice 0 and nice -10 will still be the 
same at PRIO 1 (see the prio_matrix). Therefore, there will _not_ be 
preemption of the nice 0 task and a context switch _unless_ it is already cpu 
bound and has consumed a certain number of cycles and has been demoted. 
Contrary to popular belief, it is not universal that a less niced task will 
preempt its more niced counterpart and depends entirely on implementation of 
nice. Yes it is true that context switch rate will go up with a reniced X 
because the conditions that lead to preemption are more likely to be met, but 
it is definitely not every single wakeup of the reniced X.

Alas, again, I am forced to spend as little time as possible at the pc for my 
health, so expect _very few_ responses via email from me. Luckily SD is in 
pretty fine shape with version 0.46.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-25 Thread Willy Tarreau
On Wed, Apr 25, 2007 at 04:58:40AM -0700, William Lee Irwin III wrote:

> Adjustments to the lag computation for for arrivals and departures
> during execution are among the missing pieces. Some algorithmic devices
> are also needed to account for the varying growth rates of lags of tasks
> waiting to run, which arise from differing priorities/weights.

that was the principle of my proposal of sorting tasks by expected completion
time and using +/- credit to compensate for too large/too short slice used.

Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-25 Thread William Lee Irwin III
* Li, Tong N <[EMAIL PROTECTED]> wrote:
>> [...] A corollary of this is that if both threads i and j are 
>> continuously runnable with fixed weights in the time interval, then 
>> the ratio of their CPU time should be equal to the ratio of their 
>> weights. This definition is pretty restrictive since it requires the 
>> properties to hold for any thread in any interval, which is not 
>> feasible. [...]

On Wed, Apr 25, 2007 at 11:44:03AM +0200, Ingo Molnar wrote:
> yes, it's a pretty strong definition, but also note that while it is 
> definitely not easy to implement, the solution is nevertheless feasible 
> in my opinion and there exists a scheduler that implements it: CFS.

The feasibility comment refers to the unimplementability of schedulers
with infinitesimal timeslices/quanta/sched_granularity_ns. It's no
failing of cfs (or any other scheduler) if, say, the ratios are not
exact within a time interval of one nanosecond or one picosecond.

One of the reasons you get the results you do is that what you use for
->fair_key is very close to the definition of lag, which is used as a
metric of fairness. It differs is in a couple of ways, but how it's
computed and used for queueing can be altered to more precisely match.

The basic concept you appear to be trying to implement is a greedy
algorithm: run the task with the largest lag first. As far as I can
tell, this is sound enough, though I have no formal proof. So with the
lag computation and queueing adjusted appropriately, it should work out.

Adjustments to the lag computation for for arrivals and departures
during execution are among the missing pieces. Some algorithmic devices
are also needed to account for the varying growth rates of lags of tasks
waiting to run, which arise from differing priorities/weights.

There are no mysteries.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-25 Thread Alan Cox
> > it into some xorg.conf field. (It also makes sure that X isnt preempted 
> > by other userspace stuff while it does timing-sensitive operations like 
> > setting the video modes up or switching video modes, etc.)
> 
> X is priviledged. It can just cli around the critical section.

Not really. X can use iopl3 but if it disables interrupts you get
priority inversions and hangs, so in practice it can't do that. 

Alan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-25 Thread Ingo Molnar

* Li, Tong N <[EMAIL PROTECTED]> wrote:

> [...] A corollary of this is that if both threads i and j are 
> continuously runnable with fixed weights in the time interval, then 
> the ratio of their CPU time should be equal to the ratio of their 
> weights. This definition is pretty restrictive since it requires the 
> properties to hold for any thread in any interval, which is not 
> feasible. [...]

yes, it's a pretty strong definition, but also note that while it is 
definitely not easy to implement, the solution is nevertheless feasible 
in my opinion and there exists a scheduler that implements it: CFS.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-25 Thread Ingo Molnar

* Ray Lee <[EMAIL PROTECTED]> wrote:

> It would seem like there should be a penalty associated with sending 
> those points as well, so that two processes communicating quickly with 
> each other won't get into a mutual love-fest that'll capture the 
> scheduler's attention.

it's not really "points", but "nanoseconds you are allowed to execute on 
the CPU". And thus two processes communicating with each other quickly 
and sending around this resource does get the attention of CFS: the 
resource is gradually consumed because the two processes are running on 
the CPU while they are communicating with each other. So it all works 
out fine.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-25 Thread Ingo Molnar

* Rogan Dawes <[EMAIL PROTECTED]> wrote:

> My concern was that since Ingo said that this is a closed economy, 
> with a fixed sum/total, if we lose a nanosecond here and there, 
> eventually we'll lose them all.

it's not a closed economy - the CPU constantly produces a resource: "CPU 
cycles to be spent", and tasks constantly consume that resource. So in 
that sense small inaccuracies are not a huge issue. But you are correct 
that each and every such inaccuracy has to be justified. For example 
larger inaccuracies on the order of SCHED_LOAD_SCALE are a problem 
because they can indeed sum up, and i fixed up a couple of such 
inaccuracies in -v6-to-be.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-25 Thread Ingo Molnar

* Pavel Machek <[EMAIL PROTECTED]> wrote:

> > it into some xorg.conf field. (It also makes sure that X isnt 
> > preempted by other userspace stuff while it does timing-sensitive 
> > operations like setting the video modes up or switching video modes, 
> > etc.)
> 
> X is priviledged. It can just cli around the critical section.

yes, that is a tool that can be used too (and is used by most drivers) - 
my point was rather that besides the disadvantages, not preempting X can 
be an advantage too - not that there are no other (and often more 
suitable) tools to do the same.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-25 Thread Pavel Machek
Hi!


> it into some xorg.conf field. (It also makes sure that X isnt preempted 
> by other userspace stuff while it does timing-sensitive operations like 
> setting the video modes up or switching video modes, etc.)

X is priviledged. It can just cli around the critical section.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread William Lee Irwin III
On Tue, Apr 24, 2007 at 06:22:53PM -0700, Li, Tong N wrote:
> The goal of a proportional-share scheduling algorithm is to minimize the
> above metrics. If the lag function is bounded by a constant for any
> thread in any time interval, then the algorithm is considered to be
> fair. You may notice that the second metric is actually weaker than
> first. In fact, if an algorithm achieves a constant lag bound, it must
> also achieve a constant bound for the second metric, but the reverse is
> not necessarily true. But in some settings, people have focused on the
> second metric and still consider an algorithm to be fair as long as the
> second metric is bounded by a constant.

Using these metrics it is possible to write benchmarks quantifying
fairness as a performance metric, provided weights for nice numbers.

Not so coincidentally, this also entails a test of whether nice numbers
are working as intended.


-- wli

P.S. Divide by the length of the time interval to rephrase in terms of
CPU bandwidth.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Li, Tong N
> Could you explain for the audience the technical definition of
fairness
> and what sorts of error metrics are commonly used? There seems to be
> some disagreement, and you're neutral enough of an observer that your
> statement would help.

The definition for proportional fairness assumes that each thread has a
weight, which, for example, can be specified by the user, or sth. mapped
from thread priorities, nice values, etc. A scheduler achieves ideal
proportional fairness if (1) it is work-conserving, i.e., it never
leaves a processor idle if there are runnable threads, and (2) for any
two threads, i and j, in any time interval, the ratio of their CPU time
is greater than or equal to the ratio of their weights, assuming that
thread i is continuously runnable in the entire interval and both
threads have fixed weights throughout the interval. A corollary of this
is that if both threads i and j are continuously runnable with fixed
weights in the time interval, then the ratio of their CPU time should be
equal to the ratio of their weights. This definition is pretty
restrictive since it requires the properties to hold for any thread in
any interval, which is not feasible. In practice, all algorithms try to
approximate this ideal scheduler (often referred to as Generalized
Processor Scheduling or GPS). Two error metrics are often used: 

(1) lag(t): for any interval [t1, t2], the lag of a thread at time t \in
[t1, t2] is S'(t1, t) - S(t1, t), where S' is the CPU time the thread
would receive in the interval [t1, t] under the ideal scheduler and S is
the actual CPU time it receives under the scheduler being evaluated.

(2) The second metric doesn't really have an agreed-upon name. Some call
it fairness measure and some call it sth else. Anyway, different from
lag, which is kind of an absolute measure for one thread, this metric
(call it F) defines a relative measure between two threads over any time
interval:

F(t1, t2) = S_i(t1, t2) / w_i - S_j(t1, t2) / w_j,

where S_i and S_j are the CPU time the two threads receive in the
interval [t1, t2] and w_i and w_j are their weights, assuming both
weights don't change throughout the interval.

The goal of a proportional-share scheduling algorithm is to minimize the
above metrics. If the lag function is bounded by a constant for any
thread in any time interval, then the algorithm is considered to be
fair. You may notice that the second metric is actually weaker than
first. In fact, if an algorithm achieves a constant lag bound, it must
also achieve a constant bound for the second metric, but the reverse is
not necessarily true. But in some settings, people have focused on the
second metric and still consider an algorithm to be fair as long as the
second metric is bounded by a constant.

> 
> On Mon, Apr 23, 2007 at 05:59:06PM -0700, Li, Tong N wrote:
> > I understand that via experiments we can show a design is reasonably
> > fair in the common case, but IMHO, to claim that a design is fair,
there
> > needs to be some kind of formal analysis on the fairness bound, and
this
> > bound should be proven to be constant. Even if the bound is not
> > constant, at least this analysis can help us better understand and
> > predict the degree of fairness that users would experience (e.g.,
would
> > the system be less fair if the number of threads increases? What
happens
> > if a large number of threads dynamically join and leave the
system?).
> 
> Carrying out this sort of analysis on various policies would help, but
> I'd expect most of them to be difficult to analyze. cfs' current
> ->fair_key computation should be simple enough to analyze, at least
> ignoring nice numbers, though I've done nothing rigorous in this area.
> 

If we can derive some invariants from the algorithm, it'd help the
analysis. An example is the deficit round-robin (DRR) algorithm in
networking. Its analysis utilizes the fact that the round each flow (in
this case, it'd be thread) goes through in any time interval differs by
at most one.

Hope you didn't get bored by all of this. :)

  tong
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Gene Heskett
On Tuesday 24 April 2007, Willy Tarreau wrote:
>On Tue, Apr 24, 2007 at 10:38:32AM -0400, Gene Heskett wrote:
>> On Tuesday 24 April 2007, Ingo Molnar wrote:
>> >* David Lang <[EMAIL PROTECTED]> wrote:
>> >> > (Btw., to protect against such mishaps in the future i have changed
>> >> > the SysRq-N [SysRq-Nice] implementation in my tree to not only
>> >> > change real-time tasks to SCHED_OTHER, but to also renice negative
>> >> > nice levels back to 0 - this will show up in -v6. That way you'd
>> >> > only have had to hit SysRq-N to get the system out of the wedge.)
>> >>
>> >> if you are trying to unwedge a system it may be a good idea to renice
>> >> all tasks to 0, it could be that a task at +19 is holding a lock that
>> >> something else is waiting for.
>> >
>> >Yeah, that's possible too, but +19 tasks are getting a small but
>> >guaranteed share of the CPU so eventually it ought to release it. It's
>> >still a possibility, but i think i'll wait for a specific incident to
>> >happen first, and then react to that incident :-)
>> >
>> >Ingo
>>
>> In the instance I created, even the SysRq+b was ignored, and ISTR thats
>> supposed to initiate a reboot is it not?  So it was well and truly wedged.
>
>On many machines I use this on, I have to release Alt while still holding B.
>Don't know why, but it works like this.
>
>Willy

Yeah, Willy, and pardon a slight bit of sarcasm here but that's how we get the 
reputation for needing virgins to sacrifice, regular experienced girls just 
wouldn't do.

This isn't APL running on an IBM 5120, so it should Just Work(TM) and not need 
a sceance or something to conjure up the right spell.  Besides, the reset 
button is only about 6 feet away...  I get some execsize that way by getting 
up to push it. :)

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
It is so soon that I am done for, I wonder what I was begun for.
-- Epitaph, Cheltenham Churchyard
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Gene Heskett
On Tuesday 24 April 2007, Willy Tarreau wrote:
>On Tue, Apr 24, 2007 at 10:38:32AM -0400, Gene Heskett wrote:
>> On Tuesday 24 April 2007, Ingo Molnar wrote:
>> >* David Lang <[EMAIL PROTECTED]> wrote:
>> >> > (Btw., to protect against such mishaps in the future i have changed
>> >> > the SysRq-N [SysRq-Nice] implementation in my tree to not only
>> >> > change real-time tasks to SCHED_OTHER, but to also renice negative
>> >> > nice levels back to 0 - this will show up in -v6. That way you'd
>> >> > only have had to hit SysRq-N to get the system out of the wedge.)
>> >>
>> >> if you are trying to unwedge a system it may be a good idea to renice
>> >> all tasks to 0, it could be that a task at +19 is holding a lock that
>> >> something else is waiting for.
>> >
>> >Yeah, that's possible too, but +19 tasks are getting a small but
>> >guaranteed share of the CPU so eventually it ought to release it. It's
>> >still a possibility, but i think i'll wait for a specific incident to
>> >happen first, and then react to that incident :-)
>> >
>> >Ingo
>>
>> In the instance I created, even the SysRq+b was ignored, and ISTR thats
>> supposed to initiate a reboot is it not?  So it was well and truly wedged.
>
>On many machines I use this on, I have to release Alt while still holding B.
>Don't know why, but it works like this.
>
>Willy

Yeah, Willy, and pardon a slight bit of sarcasm here but that's how we get the 
reputation for needing virgins to sacrifice, regular experienced girls just 
wouldn't do.

This isn't APL running on an IBM 5120, so it should Just Work(TM) and not need 
a sceance or something to conjure up the right spell.  Besides, the reset 
button is only about 6 feet away...  I get some execsize that way by getting 
up to push it. :)

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
It is so soon that I am done for, I wonder what I was begun for.
-- Epitaph, Cheltenham Churchyard
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Peter Williams

Rogan Dawes wrote:

Chris Friesen wrote:

Rogan Dawes wrote:

I guess my point was if we somehow get to an odd number of 
nanoseconds, we'd end up with rounding errors. I'm not sure if your 
algorithm will ever allow that.


And Ingo's point was that when it takes thousands of nanoseconds for a 
single context switch, an error of half a nanosecond is down in the 
noise.


Chris


My concern was that since Ingo said that this is a closed economy, with 
a fixed sum/total, if we lose a nanosecond here and there, eventually 
we'll lose them all.


Some folks have uptimes of multiple years.

Of course, I could (very likely!) be full of it! ;-)


And won't be using the any new scheduler on these computers anyhow as 
that would involve bringing the system down to install the new kernel. :-)


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Bernd Eckenfels
In article <[EMAIL PROTECTED]> you wrote:
> Could you explain for the audience the technical definition of fairness
> and what sorts of error metrics are commonly used? There seems to be
> some disagreement, and you're neutral enough of an observer that your
> statement would help.

And while we are at it, why it is a good thing. I could understand that fair
means no missbehaving (intentionally or unintentionally) application can
harm the rest of the system. However a responsive desktop might not
necesarily be very fair to compute jobs.

Even a simple thing as "who gets accounted" can be quite different in
different workloads. (larger multi user systems tend to be fair based on the
user, on servers you more balance by thread or job and single user systems
should be as unfair as the user wants them as long as no process can "run
away")

Gruss
Bernd
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread William Lee Irwin III
On Mon, Apr 23, 2007 at 05:59:06PM -0700, Li, Tong N wrote:
> I don't know if we've discussed this or not. Since both CFS and SD claim
> to be fair, I'd like to hear more opinions on the fairness aspect of
> these designs. In areas such as OS, networking, and real-time, fairness,
> and its more general form, proportional fairness, are well-defined
> terms. In fact, perfect fairness is not feasible since it requires all
> runnable threads to be running simultaneously and scheduled with
> infinitesimally small quanta (like a fluid system). So to evaluate if a
> new scheduling algorithm is fair, the common approach is to take the
> ideal fair algorithm (often referred to as Generalized Processor
> Scheduling or GPS) as a reference model and analyze if the new algorithm
> can achieve a constant error bound (different error metrics also exist).

Could you explain for the audience the technical definition of fairness
and what sorts of error metrics are commonly used? There seems to be
some disagreement, and you're neutral enough of an observer that your
statement would help.


On Mon, Apr 23, 2007 at 05:59:06PM -0700, Li, Tong N wrote:
> I understand that via experiments we can show a design is reasonably
> fair in the common case, but IMHO, to claim that a design is fair, there
> needs to be some kind of formal analysis on the fairness bound, and this
> bound should be proven to be constant. Even if the bound is not
> constant, at least this analysis can help us better understand and
> predict the degree of fairness that users would experience (e.g., would
> the system be less fair if the number of threads increases? What happens
> if a large number of threads dynamically join and leave the system?).

Carrying out this sort of analysis on various policies would help, but
I'd expect most of them to be difficult to analyze. cfs' current
->fair_key computation should be simple enough to analyze, at least
ignoring nice numbers, though I've done nothing rigorous in this area.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Li, Tong N
On Mon, 2007-04-23 at 18:57 -0700, Bill Huey wrote: 
> On Mon, Apr 23, 2007 at 05:59:06PM -0700, Li, Tong N wrote:
> > I don't know if we've discussed this or not. Since both CFS and SD claim
> > to be fair, I'd like to hear more opinions on the fairness aspect of
> > these designs. In areas such as OS, networking, and real-time, fairness,
> > and its more general form, proportional fairness, are well-defined
> > terms. In fact, perfect fairness is not feasible since it requires all
> > runnable threads to be running simultaneously and scheduled with
> > infinitesimally small quanta (like a fluid system). So to evaluate if a
> 
> Unfortunately, fairness is rather non-formal in this context and probably
> isn't strictly desirable given how hack much of Linux userspace is. Until
> there's a method of doing directed yields, like what Will has prescribed
> a kind of allotment to thread doing work for another a completely strict
> mechanism, it is probably problematic with regards to corner cases.
> 
> X for example is largely non-thread safe. Until they can get their xcb
> framework in place and addition thread infrastructure to do hand off
> properly, it's going to be difficult schedule for it. It's well known to
> be problematic.

I agree. I just think calling the designs "perfectly" or "completely"
fair is too strong. It might cause unnecessary confusion that
overshadows the actual merits of these designs. If we were to evaluate
specifically the fairness aspect of a design, then I'd suggest defining
it more formally.

> You announced your scheduler without CCing any of the relevant people here
> (and risk being completely ignored in lkml traffic):
> 
>   http://lkml.org/lkml/2007/4/20/286
> 
> What is your opinion of both CFS and SDL ? How can you work be useful
> to either scheduler mentioned or to the Linux kernel on its own ?

I like SD for its simplicity. My concern with CFS is the RB tree
structure. Log(n) seems high to me given the fact that we had an O(1)
scheduler. Many algorithms achieve strong fairness guarantees at the
cost of log(n) time. Thus, I tend to think, if log(n) is acceptable, we
might want also to look at other algorithms (e.g., start-time first)
with better fairness properties and see if they could be extended to be
general purpose.

> > I understand that via experiments we can show a design is reasonably
> > fair in the common case, but IMHO, to claim that a design is fair, there
> > needs to be some kind of formal analysis on the fairness bound, and this
> > bound should be proven to be constant. Even if the bound is not
> > constant, at least this analysis can help us better understand and
> > predict the degree of fairness that users would experience (e.g., would
> > the system be less fair if the number of threads increases? What happens
> > if a large number of threads dynamically join and leave the system?).
> 
> Will has been thinking about this, but you have to also consider the
> practicalities of your approach versus Con's and Ingo's.

I consider my work an approach to extend an existing scheduler to
support proportional fairness. I see many proportional-share designs are
lacking things such as good interactive support that Linux does well.
This is why I designed it on top of the existing scheduler so that it
can leverage things such as dynamic priorities. Regardless of the
underlying scheduler, SD or CFS, I think the algorithm I used would
still apply and thus we can extend the scheduler similarly.

> I'm all for things like proportional scheduling and the extensions
> needed to do it properly. It would be highly relevant to some version
> of the -rt patch if not that patch directly.

I'd love it to be considered for part of the -rt patch. I'm new to 
this, so would you please let me know what to do?

Thanks,

  tong
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Willy Tarreau
On Tue, Apr 24, 2007 at 10:38:32AM -0400, Gene Heskett wrote:
> On Tuesday 24 April 2007, Ingo Molnar wrote:
> >* David Lang <[EMAIL PROTECTED]> wrote:
> >> > (Btw., to protect against such mishaps in the future i have changed
> >> > the SysRq-N [SysRq-Nice] implementation in my tree to not only
> >> > change real-time tasks to SCHED_OTHER, but to also renice negative
> >> > nice levels back to 0 - this will show up in -v6. That way you'd
> >> > only have had to hit SysRq-N to get the system out of the wedge.)
> >>
> >> if you are trying to unwedge a system it may be a good idea to renice
> >> all tasks to 0, it could be that a task at +19 is holding a lock that
> >> something else is waiting for.
> >
> >Yeah, that's possible too, but +19 tasks are getting a small but
> >guaranteed share of the CPU so eventually it ought to release it. It's
> >still a possibility, but i think i'll wait for a specific incident to
> >happen first, and then react to that incident :-)
> >
> > Ingo
> 
> In the instance I created, even the SysRq+b was ignored, and ISTR thats 
> supposed to initiate a reboot is it not?  So it was well and truly wedged.

On many machines I use this on, I have to release Alt while still holding B.
Don't know why, but it works like this.

Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Chris Friesen

Rogan Dawes wrote:

My concern was that since Ingo said that this is a closed economy, with 
a fixed sum/total, if we lose a nanosecond here and there, eventually 
we'll lose them all.


I assume Ingo has set it up so that the system doesn't "lose" partial 
nanoseconds, but rather they'd just be accounted to the wrong task.


Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Ray Lee

On 4/23/07, Linus Torvalds <[EMAIL PROTECTED]> wrote:

On Mon, 23 Apr 2007, Ingo Molnar wrote:
>
> The "give scheduler money" transaction can be both an "implicit
> transaction" (for example when writing to UNIX domain sockets or
> blocking on a pipe, etc.), or it could be an "explicit transaction":
> sched_yield_to(). This latter i've already implemented for CFS, but it's
> much less useful than the really significant implicit ones, the ones
> which will help X.

Yes. It would be wonderful to get it working automatically, so please say
something about the implementation..

The "perfect" situation would be that when somebody goes to sleep, any
extra points it had could be given to whoever it woke up last. Note that
for something like X, it means that the points are 100% ephemeral: it gets
points when a client sends it a request, but it would *lose* the points
again when it sends the reply!


It would seem like there should be a penalty associated with sending
those points as well, so that two processes communicating quickly with
each other won't get into a mutual love-fest that'll capture the
scheduler's attention.

Ray
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Rogan Dawes

Chris Friesen wrote:

Rogan Dawes wrote:

I guess my point was if we somehow get to an odd number of 
nanoseconds, we'd end up with rounding errors. I'm not sure if your 
algorithm will ever allow that.


And Ingo's point was that when it takes thousands of nanoseconds for a 
single context switch, an error of half a nanosecond is down in the noise.


Chris


My concern was that since Ingo said that this is a closed economy, with 
a fixed sum/total, if we lose a nanosecond here and there, eventually 
we'll lose them all.


Some folks have uptimes of multiple years.

Of course, I could (very likely!) be full of it! ;-)

Rogan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Chris Friesen

Rogan Dawes wrote:

I guess my point was if we somehow get to an odd number of nanoseconds, 
we'd end up with rounding errors. I'm not sure if your algorithm will 
ever allow that.


And Ingo's point was that when it takes thousands of nanoseconds for a 
single context switch, an error of half a nanosecond is down in the noise.


Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Gene Heskett
On Tuesday 24 April 2007, Ingo Molnar wrote:
>* Ingo Molnar <[EMAIL PROTECTED]> wrote:
>> yeah, i guess this has little to do with X. I think in your scenario
>> it might have been smarter to either stop, or to renice the workloads
>> that took away CPU power from others to _positive_ nice levels.
>> Negative nice levels can indeed be dangerous.
>
>btw., was X itself at nice 0 or nice -10 when the lockup happened?
>
>   Ingo

Memory could be fuzzy Ingo, but I think it was at 0 at the time.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
I know it all.  I just can't remember it all at once.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Gene Heskett
On Tuesday 24 April 2007, Ingo Molnar wrote:
>* Ingo Molnar <[EMAIL PROTECTED]> wrote:
>> yeah, i guess this has little to do with X. I think in your scenario
>> it might have been smarter to either stop, or to renice the workloads
>> that took away CPU power from others to _positive_ nice levels.
>> Negative nice levels can indeed be dangerous.
>
>btw., was X itself at nice 0 or nice -10 when the lockup happened?
>
>   Ingo

Memory could be hazy Ingo, but I think X was at 0 when that occurred.


-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
I use technology in order to hate it more properly.
-- Nam June Paik
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Gene Heskett
On Tuesday 24 April 2007, Ingo Molnar wrote:
>* David Lang <[EMAIL PROTECTED]> wrote:
>> > (Btw., to protect against such mishaps in the future i have changed
>> > the SysRq-N [SysRq-Nice] implementation in my tree to not only
>> > change real-time tasks to SCHED_OTHER, but to also renice negative
>> > nice levels back to 0 - this will show up in -v6. That way you'd
>> > only have had to hit SysRq-N to get the system out of the wedge.)
>>
>> if you are trying to unwedge a system it may be a good idea to renice
>> all tasks to 0, it could be that a task at +19 is holding a lock that
>> something else is waiting for.
>
>Yeah, that's possible too, but +19 tasks are getting a small but
>guaranteed share of the CPU so eventually it ought to release it. It's
>still a possibility, but i think i'll wait for a specific incident to
>happen first, and then react to that incident :-)
>
>   Ingo

In the instance I created, even the SysRq+b was ignored, and ISTR thats 
supposed to initiate a reboot is it not?  So it was well and truly wedged.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
I use technology in order to hate it more properly.
-- Nam June Paik
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Gene Heskett
On Tuesday 24 April 2007, Ingo Molnar wrote:
>* Gene Heskett <[EMAIL PROTECTED]> wrote:
>> > (Btw., to protect against such mishaps in the future i have changed
>> > the SysRq-N [SysRq-Nice] implementation in my tree to not only
>> > change real-time tasks to SCHED_OTHER, but to also renice negative
>> > nice levels back to 0 - this will show up in -v6. That way you'd
>> > only have had to hit SysRq-N to get the system out of the wedge.)
>>
>> That sounds handy, particularly with idiots like me at the wheel...
>
>by that standard i guess we tinkerers are all idiots ;)
>
>   Ingo

Eiyyyup!

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Man's horizons are bounded by his vision.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Rogan Dawes

Ingo Molnar wrote:

* Rogan Dawes <[EMAIL PROTECTED]> wrote:


   if (p_to && p->wait_runtime > 0) {
   p->wait_runtime >>= 1;
   p_to->wait_runtime += p->wait_runtime;
   }

the above is the basic expression of: "charge a positive bank balance". 


[..]

[note, due to the nanoseconds unit there's no rounding loss to worry 
about.]

Surely if you divide 5 nanoseconds by 2, you'll get a rounding loss?


yes. But not that we'll only truly have to worry about that when we'll 
have context-switching performance in that range - currently it's at 
least 2-3 orders of magnitude above that. Microseconds seemed to me to 
be too coarse already, that's why i picked nanoseconds and 64-bit 
arithmetics for CFS.


Ingo


I guess my point was if we somehow get to an odd number of nanoseconds, 
we'd end up with rounding errors. I'm not sure if your algorithm will 
ever allow that.


Rogan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> [...] That way you'd only have had to hit SysRq-N to get the system 
> out of the wedge.)

small correction: Alt-SysRq-N.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Ingo Molnar

* Rogan Dawes <[EMAIL PROTECTED]> wrote:

> >if (p_to && p->wait_runtime > 0) {
> >p->wait_runtime >>= 1;
> >p_to->wait_runtime += p->wait_runtime;
> >}
> >
> >the above is the basic expression of: "charge a positive bank balance". 
> >
> 
> [..]
> 
> > [note, due to the nanoseconds unit there's no rounding loss to worry 
> > about.]
> 
> Surely if you divide 5 nanoseconds by 2, you'll get a rounding loss?

yes. But not that we'll only truly have to worry about that when we'll 
have context-switching performance in that range - currently it's at 
least 2-3 orders of magnitude above that. Microseconds seemed to me to 
be too coarse already, that's why i picked nanoseconds and 64-bit 
arithmetics for CFS.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Ingo Molnar

* David Lang <[EMAIL PROTECTED]> wrote:

> > (Btw., to protect against such mishaps in the future i have changed 
> > the SysRq-N [SysRq-Nice] implementation in my tree to not only 
> > change real-time tasks to SCHED_OTHER, but to also renice negative 
> > nice levels back to 0 - this will show up in -v6. That way you'd 
> > only have had to hit SysRq-N to get the system out of the wedge.)
> 
> if you are trying to unwedge a system it may be a good idea to renice 
> all tasks to 0, it could be that a task at +19 is holding a lock that 
> something else is waiting for.

Yeah, that's possible too, but +19 tasks are getting a small but 
guaranteed share of the CPU so eventually it ought to release it. It's 
still a possibility, but i think i'll wait for a specific incident to 
happen first, and then react to that incident :-)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> yeah, i guess this has little to do with X. I think in your scenario 
> it might have been smarter to either stop, or to renice the workloads 
> that took away CPU power from others to _positive_ nice levels. 
> Negative nice levels can indeed be dangerous.

btw., was X itself at nice 0 or nice -10 when the lockup happened?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread David Lang

On Tue, 24 Apr 2007, Ingo Molnar wrote:


* Gene Heskett <[EMAIL PROTECTED]> wrote:


Gene has done some testing under CFS with X reniced to +10 and the
desktop still worked smoothly for him.


As a data point here, and probably nothing to do with X, but I did
manage to lock it up, solid, reset button time tonight, by wanting
'smart' to get done with an update session after amanda had started.
I took both smart processes I could see in htop all the way to -19,
but when it was about done about 3 minutes later, everything came to
an instant, frozen, reset button required lockup.  I should have
stopped at -17 I guess. :(


yeah, i guess this has little to do with X. I think in your scenario it
might have been smarter to either stop, or to renice the workloads that
took away CPU power from others to _positive_ nice levels. Negative nice
levels can indeed be dangerous.

(Btw., to protect against such mishaps in the future i have changed the
SysRq-N [SysRq-Nice] implementation in my tree to not only change
real-time tasks to SCHED_OTHER, but to also renice negative nice levels
back to 0 - this will show up in -v6. That way you'd only have had to
hit SysRq-N to get the system out of the wedge.)


if you are trying to unwedge a system it may be a good idea to renice all tasks 
to 0, it could be that a task at +19 is holding a lock that something else is 
waiting for.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Ingo Molnar

* Gene Heskett <[EMAIL PROTECTED]> wrote:

> > (Btw., to protect against such mishaps in the future i have changed 
> > the SysRq-N [SysRq-Nice] implementation in my tree to not only 
> > change real-time tasks to SCHED_OTHER, but to also renice negative 
> > nice levels back to 0 - this will show up in -v6. That way you'd 
> > only have had to hit SysRq-N to get the system out of the wedge.)
> 
> That sounds handy, particularly with idiots like me at the wheel...

by that standard i guess we tinkerers are all idiots ;)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Gene Heskett
On Tuesday 24 April 2007, Ingo Molnar wrote:
>* Gene Heskett <[EMAIL PROTECTED]> wrote:
>> > Gene has done some testing under CFS with X reniced to +10 and the
>> > desktop still worked smoothly for him.
>>
>> As a data point here, and probably nothing to do with X, but I did
>> manage to lock it up, solid, reset button time tonight, by wanting
>> 'smart' to get done with an update session after amanda had started.
>> I took both smart processes I could see in htop all the way to -19,
>> but when it was about done about 3 minutes later, everything came to
>> an instant, frozen, reset button required lockup.  I should have
>> stopped at -17 I guess. :(
>
>yeah, i guess this has little to do with X. I think in your scenario it
>might have been smarter to either stop, or to renice the workloads that
>took away CPU power from others to _positive_ nice levels. Negative nice
>levels can indeed be dangerous.
>
>(Btw., to protect against such mishaps in the future i have changed the
>SysRq-N [SysRq-Nice] implementation in my tree to not only change
>real-time tasks to SCHED_OTHER, but to also renice negative nice levels
>back to 0 - this will show up in -v6. That way you'd only have had to
>hit SysRq-N to get the system out of the wedge.)
>
>   Ingo

That sounds handy, particularly with idiots like me at the wheel...


-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
When a Banker jumps out of a window, jump after him--that's where the money 
is.
-- Robespierre
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Ingo Molnar

* Gene Heskett <[EMAIL PROTECTED]> wrote:

> > Gene has done some testing under CFS with X reniced to +10 and the 
> > desktop still worked smoothly for him.
> 
> As a data point here, and probably nothing to do with X, but I did 
> manage to lock it up, solid, reset button time tonight, by wanting 
> 'smart' to get done with an update session after amanda had started.  
> I took both smart processes I could see in htop all the way to -19, 
> but when it was about done about 3 minutes later, everything came to 
> an instant, frozen, reset button required lockup.  I should have 
> stopped at -17 I guess. :(

yeah, i guess this has little to do with X. I think in your scenario it 
might have been smarter to either stop, or to renice the workloads that 
took away CPU power from others to _positive_ nice levels. Negative nice 
levels can indeed be dangerous.

(Btw., to protect against such mishaps in the future i have changed the 
SysRq-N [SysRq-Nice] implementation in my tree to not only change 
real-time tasks to SCHED_OTHER, but to also renice negative nice levels 
back to 0 - this will show up in -v6. That way you'd only have had to 
hit SysRq-N to get the system out of the wedge.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Rogan Dawes

Ingo Molnar wrote:


static void
yield_task_fair(struct rq *rq, struct task_struct *p, struct task_struct *p_to)
{
struct rb_node *curr, *next, *first;
struct task_struct *p_next;

/*
 * yield-to support: if we are on the same runqueue then
 * give half of our wait_runtime (if it's positive) to the other task:
 */
if (p_to && p->wait_runtime > 0) {
p->wait_runtime >>= 1;
p_to->wait_runtime += p->wait_runtime;
}

the above is the basic expression of: "charge a positive bank balance". 



[..]

[note, due to the nanoseconds unit there's no rounding loss to worry 
about.]


Surely if you divide 5 nanoseconds by 2, you'll get a rounding loss?


Ingo


Rogan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-24 Thread Gene Heskett
On Tuesday 24 April 2007, Ingo Molnar wrote:
>* Peter Williams <[EMAIL PROTECTED]> wrote:
>> > The cases are fundamentally different in behavior, because in the
>> > first case, X hardly consumes the time it would get in any scheme,
>> > while in the second case X really is CPU bound and will happily
>> > consume any CPU time it can get.
>>
>> Which still doesn't justify an elaborate "points" sharing scheme.
>> Whichever way you look at that that's just another way of giving X
>> more CPU bandwidth and there are simpler ways to give X more CPU if it
>> needs it.  However, I think there's something seriously wrong if it
>> needs the -19 nice that I've heard mentioned.
>
>Gene has done some testing under CFS with X reniced to +10 and the
>desktop still worked smoothly for him.

As a data point here, and probably nothing to do with X, but I did manage to 
lock it up, solid, reset button time tonight, by wanting 'smart' to get done 
with an update session after amanda had started.  I took both smart processes 
I could see in htop all the way to -19, but when it was about done about 3 
minutes later, everything came to an instant, frozen, reset button required 
lockup.  I should have stopped at -17 I guess. :(

>So CFS does not 'need' a reniced 
>X. There are simply advantages to negative nice levels: for example
>screen refreshes are smoother on any scheduler i tried. BUT, there is a
>caveat: on non-CFS schedulers i tried X is much more prone to get into
>'overscheduling' scenarios that visibly hurt X's performance, while on
>CFS there's a max of 1000-1500 context switches a second at nice -10.
>(which, considering the cost of a context switch is well under 1%
>overhead.)
>
>So, my point is, the nice level of X for desktop users should not be set
>lower than a low limit suggested by that particular scheduler's author.
>That limit is scheduler-specific. Con i think recommends a nice level of
>-1 for X when using SD [Con, can you confirm?], while my tests show that
>if you want you can go as low as -10 under CFS, without any bad
>side-effects. (-19 was a bit too much)
>
>> [...]  You might as well just run it as a real time process.
>
>hm, that would be a bad idea under any scheduler (including CFS),
>because real time processes can starve other processes indefinitely.
>
>   Ingo



-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
I have discovered that all human evil comes from this, man's being unable
to sit still in a room.
-- Blaise Pascal
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-23 Thread Ingo Molnar

* Peter Williams <[EMAIL PROTECTED]> wrote:

> > The cases are fundamentally different in behavior, because in the 
> > first case, X hardly consumes the time it would get in any scheme, 
> > while in the second case X really is CPU bound and will happily 
> > consume any CPU time it can get.
> 
> Which still doesn't justify an elaborate "points" sharing scheme. 
> Whichever way you look at that that's just another way of giving X 
> more CPU bandwidth and there are simpler ways to give X more CPU if it 
> needs it.  However, I think there's something seriously wrong if it 
> needs the -19 nice that I've heard mentioned.

Gene has done some testing under CFS with X reniced to +10 and the 
desktop still worked smoothly for him. So CFS does not 'need' a reniced 
X. There are simply advantages to negative nice levels: for example 
screen refreshes are smoother on any scheduler i tried. BUT, there is a 
caveat: on non-CFS schedulers i tried X is much more prone to get into 
'overscheduling' scenarios that visibly hurt X's performance, while on 
CFS there's a max of 1000-1500 context switches a second at nice -10. 
(which, considering the cost of a context switch is well under 1% 
overhead.)

So, my point is, the nice level of X for desktop users should not be set 
lower than a low limit suggested by that particular scheduler's author. 
That limit is scheduler-specific. Con i think recommends a nice level of 
-1 for X when using SD [Con, can you confirm?], while my tests show that 
if you want you can go as low as -10 under CFS, without any bad 
side-effects. (-19 was a bit too much)

> [...]  You might as well just run it as a real time process.

hm, that would be a bad idea under any scheduler (including CFS), 
because real time processes can starve other processes indefinitely.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-23 Thread Peter Williams

Arjan van de Ven wrote:
Within reason, it's not the number of clients that X has that causes its 
CPU bandwidth use to sky rocket and cause problems.  It's more to to 
with what type of clients they are.  Most GUIs (even ones that are 
constantly updating visual data (e.g. gkrellm -- I can open quite a 
large number of these without increasing X's CPU usage very much)) cause 
very little load on the X server.  The exceptions to this are the 



there is actually 2 and not just 1 "X server", and they are VERY VERY
different in behavior.

Case 1: Accelerated driver

If X talks to a decent enough card it supports will with acceleration,
it will be very rare for X itself to spend any kind of significant
amount of CPU time, all the really heavy stuff is done in hardware, and
asynchronously at that. A bit of batching will greatly improve system
performance in this case.

Case 2: Unaccelerated VESA

Some drivers in X, especially the VESA and NV drivers (which are quite
common, vesa is used on all hardware without a special driver nowadays),
have no or not enough acceleration to matter for modern desktops. This
means the CPU is doing all the heavy lifting, in the X program. In this
case even a simple "move the window a bit" becomes quite a bit of a CPU
hog already.


Mine's a:

SiS 661/741/760 PCI/AGP or 662/761Gx PCIE VGA Display adapter according 
to X's display settings tool.  Which category does that fall into?


It's not a special adapter and is just the one that came with the 
motherboard. It doesn't use much CPU unless I grab a window and wiggle 
it all over the screen or do something like "ls -lR /" in an xterm.




The cases are fundamentally different in behavior, because in the first
case, X hardly consumes the time it would get in any scheme, while in
the second case X really is CPU bound and will happily consume any CPU
time it can get.


Which still doesn't justify an elaborate "points" sharing scheme. 
Whichever way you look at that that's just another way of giving X more 
CPU bandwidth and there are simpler ways to give X more CPU if it needs 
it.  However, I think there's something seriously wrong if it needs the 
-19 nice that I've heard mentioned.  You might as well just run it as a 
real time process.


Peter
--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-23 Thread Arjan van de Ven

> Within reason, it's not the number of clients that X has that causes its 
> CPU bandwidth use to sky rocket and cause problems.  It's more to to 
> with what type of clients they are.  Most GUIs (even ones that are 
> constantly updating visual data (e.g. gkrellm -- I can open quite a 
> large number of these without increasing X's CPU usage very much)) cause 
> very little load on the X server.  The exceptions to this are the 


there is actually 2 and not just 1 "X server", and they are VERY VERY
different in behavior.

Case 1: Accelerated driver

If X talks to a decent enough card it supports will with acceleration,
it will be very rare for X itself to spend any kind of significant
amount of CPU time, all the really heavy stuff is done in hardware, and
asynchronously at that. A bit of batching will greatly improve system
performance in this case.

Case 2: Unaccelerated VESA

Some drivers in X, especially the VESA and NV drivers (which are quite
common, vesa is used on all hardware without a special driver nowadays),
have no or not enough acceleration to matter for modern desktops. This
means the CPU is doing all the heavy lifting, in the X program. In this
case even a simple "move the window a bit" becomes quite a bit of a CPU
hog already.

The cases are fundamentally different in behavior, because in the first
case, X hardly consumes the time it would get in any scheme, while in
the second case X really is CPU bound and will happily consume any CPU
time it can get.



-- 
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via 
http://www.linuxfirmwarekit.org

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-23 Thread Peter Williams

Linus Torvalds wrote:


On Mon, 23 Apr 2007, Ingo Molnar wrote:
The "give scheduler money" transaction can be both an "implicit 
transaction" (for example when writing to UNIX domain sockets or 
blocking on a pipe, etc.), or it could be an "explicit transaction": 
sched_yield_to(). This latter i've already implemented for CFS, but it's 
much less useful than the really significant implicit ones, the ones 
which will help X.


Yes. It would be wonderful to get it working automatically, so please say 
something about the implementation..


The "perfect" situation would be that when somebody goes to sleep, any 
extra points it had could be given to whoever it woke up last. Note that 
for something like X, it means that the points are 100% ephemeral: it gets 
points when a client sends it a request, but it would *lose* the points 
again when it sends the reply!


So it would only accumulate "scheduling points" while multiuple clients 
are actively waiting for it, which actually sounds like exactly the right 
thing. However, I don't really see how to do it well, especially since the 
kernel cannot actually match up the client that gave some scheduling 
points to the reply that X sends back.


There are subtle semantics with these kinds of things: especially if the 
scheduling points are only awarded when a process goes to sleep, if X is 
busy and continues to use the CPU (for another client), it wouldn't give 
any scheduling points back to clients and they really do accumulate with 
the server. Which again sounds like it would be exactly the right thing 
(both in the sense that the server that runs more gets more points, but 
also in the sense that we *only* give points at actual scheduling events).


But how do you actually *give/track* points? A simple "last woken up by 
this process" thing that triggers when it goes to sleep? It might work, 
but on the other hand, especially with more complex things (and networking 
tends to be pretty complex) the actual wakeup may be done by a software 
irq. Do we just say "it ran within the context of X, so we assume X was 
the one that caused it?" It probably would work, but we've generally tried 
very hard to avoid accessing "current" from interrupt context, including 
bh's.


Within reason, it's not the number of clients that X has that causes its 
CPU bandwidth use to sky rocket and cause problems.  It's more to to 
with what type of clients they are.  Most GUIs (even ones that are 
constantly updating visual data (e.g. gkrellm -- I can open quite a 
large number of these without increasing X's CPU usage very much)) cause 
very little load on the X server.  The exceptions to this are the 
various terminal emulators (e.g. xterm, gnome-terminal, etc.) when being 
used to run output intensive command line programs e.g. try "ls -lR /" 
in an xterm.  The other way (that I've noticed) X's CPU usage bandwidth 
sky rocket is when you grab a large window and wiggle it about a lot and 
hopefully this doesn't happen a lot so the problem that needs to be 
addressed is the one caused by text output on xterm and its ilk.


So I think that an elaborate scheme for distributing "points" between X 
and its clients would be overkill.  A good scheduler will make sure 
other tasks such as audio streamers get CPU when they need it with good 
responsiveness even when X takes off by giving them higher priority 
because their CPU bandwidth use is low.


The one problem that might still be apparent in these cases is the mouse 
becoming jerky while X is working like crazy to spew out text too fast 
for anyone to read.  But the only way to fix that is to give X more 
bandwidth but if it's already running at about 95% of a CPU that's 
unlikely to help.  To fix this you would probably need to modify X so 
that it knows re-rendering the cursor is more important than rendering 
text in an xterm.


In normal circumstances, the re-rendering of the mouse happens quickly 
enough for the user to experience good responsiveness because X's normal 
CPU use is low enough for it to be given high priority.


Just because the O(1) tried this model and failed doesn't mean that the 
model is bad.  O(1) was a flawed implementation of a good model.


Peter
PS Doing a kernel build in an xterm isn't an example of high enough 
output to cause a problem as (on my system) it only raises X's 
consumption from 0 to 2% to 2 to 5%.  The type of output that causes the 
problem is usually flying past too fast to read.

--
Peter Williams   [EMAIL PROTECTED]

"Learning, n. The kind of ignorance distinguishing the studious."
 -- Ambrose Bierce
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-23 Thread hui
On Mon, Apr 23, 2007 at 05:59:06PM -0700, Li, Tong N wrote:
> I don't know if we've discussed this or not. Since both CFS and SD claim
> to be fair, I'd like to hear more opinions on the fairness aspect of
> these designs. In areas such as OS, networking, and real-time, fairness,
> and its more general form, proportional fairness, are well-defined
> terms. In fact, perfect fairness is not feasible since it requires all
> runnable threads to be running simultaneously and scheduled with
> infinitesimally small quanta (like a fluid system). So to evaluate if a

Unfortunately, fairness is rather non-formal in this context and probably
isn't strictly desirable given how hack much of Linux userspace is. Until
there's a method of doing directed yields, like what Will has prescribed
a kind of allotment to thread doing work for another a completely strict
mechanism, it is probably problematic with regards to corner cases.

X for example is largely non-thread safe. Until they can get their xcb
framework in place and addition thread infrastructure to do hand off
properly, it's going to be difficult schedule for it. It's well known to
be problematic.

You announced your scheduler without CCing any of the relevant people here
(and risk being completely ignored in lkml traffic):

http://lkml.org/lkml/2007/4/20/286

What is your opinion of both CFS and SDL ? How can you work be useful
to either scheduler mentioned or to the Linux kernel on its own ?

> I understand that via experiments we can show a design is reasonably
> fair in the common case, but IMHO, to claim that a design is fair, there
> needs to be some kind of formal analysis on the fairness bound, and this
> bound should be proven to be constant. Even if the bound is not
> constant, at least this analysis can help us better understand and
> predict the degree of fairness that users would experience (e.g., would
> the system be less fair if the number of threads increases? What happens
> if a large number of threads dynamically join and leave the system?).

Will has been thinking about this, but you have to also consider the
practicalities of your approach versus Con's and Ingo's.

I'm all for things like proportional scheduling and the extensions
needed to do it properly. It would be highly relevant to some version
of the -rt patch if not that patch directly.

bill

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [REPORT] cfs-v4 vs sd-0.44

2007-04-23 Thread Li, Tong N
I don't know if we've discussed this or not. Since both CFS and SD claim
to be fair, I'd like to hear more opinions on the fairness aspect of
these designs. In areas such as OS, networking, and real-time, fairness,
and its more general form, proportional fairness, are well-defined
terms. In fact, perfect fairness is not feasible since it requires all
runnable threads to be running simultaneously and scheduled with
infinitesimally small quanta (like a fluid system). So to evaluate if a
new scheduling algorithm is fair, the common approach is to take the
ideal fair algorithm (often referred to as Generalized Processor
Scheduling or GPS) as a reference model and analyze if the new algorithm
can achieve a constant error bound (different error metrics also exist).
I understand that via experiments we can show a design is reasonably
fair in the common case, but IMHO, to claim that a design is fair, there
needs to be some kind of formal analysis on the fairness bound, and this
bound should be proven to be constant. Even if the bound is not
constant, at least this analysis can help us better understand and
predict the degree of fairness that users would experience (e.g., would
the system be less fair if the number of threads increases? What happens
if a large number of threads dynamically join and leave the system?).

  tong
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-23 Thread Jeremy Fitzhardinge
Linus Torvalds wrote:
> The "perfect" situation would be that when somebody goes to sleep, any 
> extra points it had could be given to whoever it woke up last. Note that 
> for something like X, it means that the points are 100% ephemeral: it gets 
> points when a client sends it a request, but it would *lose* the points 
> again when it sends the reply!
>
> So it would only accumulate "scheduling points" while multiuple clients 
> are actively waiting for it, which actually sounds like exactly the right 
> thing. However, I don't really see how to do it well, especially since the 
> kernel cannot actually match up the client that gave some scheduling 
> points to the reply that X sends back.
>   

This works out in quite an interesting way.  If the economy is closed -
all clients and servers are managed by the same scheduler - then the
server could get no inherent CPU priority and live entirely on donated
shares.  If that were the case, you'd have to make sure that the server
used the donation from client A on client A's work, otherwise you'd get
freeloaders - but maybe it will all work out.

It gets more interesting when you have a non-closed system - the X
server is working on behalf of external clients over TCP.  Presumably
wakeups from incoming TCP connections wouldn't have any scheduler shares
associated with it, so the X server would have to use its inherent CPU
allocation to service those requests.  Or the external client could
effectively end up freeloading off portions of the local clients' donations.

J
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-23 Thread Guillaume Chazarain

2007/4/23, Ingo Molnar <[EMAIL PROTECTED]>:

p->wait_runtime >>= 1;
p_to->wait_runtime += p->wait_runtime;


I have no problem with clients giving some credit to X,
I am more concerned with X giving half of its credit to
a single client, a quarter of its credit to another client, etc...

For example, a client could setup a periodical wake up
from X, and then periodically getting some credit for free.
Would that be possible?

Thanks.

--
Guillaume
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-23 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> sorry, i was a bit imprecise here. There is a case where CFS can give 
> out a 'loan' to tasks. The scheduler tick has a low resolution, so it 
> is fundamentally inevitable [*] that tasks will run a bit more than 
> they should, and at a heavy context-switching rates these errors can 
> add up significantly. Furthermore, we want to batch up workloads.
> 
> So CFS has a "no loans larger than sched_granularity_ns" policy (which 
> defaults to 5msec), and it captures these sub-granularity 'loans' with 
> nanosec accounting. This too is a very sane economic policy and is 
> anti-infationary :-)

at which point i guess i should rename CFS to 'EFS' (the Economic Fair 
Scheduler)? =B-)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-23 Thread Ingo Molnar

* Ingo Molnar <[EMAIL PROTECTED]> wrote:

> (we obviously dont want to allow people to 'share' their loans with 
> others ;), nor do we want to allow a net negative balance. CFS is 
> really brutally cold-hearted, it has a strict 'no loans' policy - the 
> easiest economic way to manage 'inflation', besides the basic act of 
> not printing new money, ever.)

sorry, i was a bit imprecise here. There is a case where CFS can give 
out a 'loan' to tasks. The scheduler tick has a low resolution, so it is 
fundamentally inevitable [*] that tasks will run a bit more than they 
should, and at a heavy context-switching rates these errors can add up 
significantly. Furthermore, we want to batch up workloads.

So CFS has a "no loans larger than sched_granularity_ns" policy (which 
defaults to 5msec), and it captures these sub-granularity 'loans' with 
nanosec accounting. This too is a very sane economic policy and is 
anti-infationary :-)

Ingo

[*] i fundamentally hate 'fundamentally inevitable' conditions so i 
have plans to make the scheduler tick be fed from the rbtree and 
thus become a true high-resolution timer. This not only increases 
fairness (=='precision of scheduling') more, but it also decreases 
the number of timer interrupts on a running system - extending 
dynticks to sched-ticks too. Thomas and me shaped dynticks to enable 
that in an easy way: the scheduler tick is today already a high-res 
timer (but which is currently still driven via the jiffy mechanism).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-23 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> > The "give scheduler money" transaction can be both an "implicit 
> > transaction" (for example when writing to UNIX domain sockets or 
> > blocking on a pipe, etc.), or it could be an "explicit transaction": 
> > sched_yield_to(). This latter i've already implemented for CFS, but 
> > it's much less useful than the really significant implicit ones, the 
> > ones which will help X.
> 
> Yes. It would be wonderful to get it working automatically, so please 
> say something about the implementation..

i agree that the devil will be in the details, but so far it's really 
simple. I'll put all this into separate helper functions so that places 
can just use it in a natural way. The existing yield-to bit is this:

static void
yield_task_fair(struct rq *rq, struct task_struct *p, struct task_struct *p_to)
{
struct rb_node *curr, *next, *first;
struct task_struct *p_next;

/*
 * yield-to support: if we are on the same runqueue then
 * give half of our wait_runtime (if it's positive) to the other task:
 */
if (p_to && p->wait_runtime > 0) {
p->wait_runtime >>= 1;
p_to->wait_runtime += p->wait_runtime;
}

the above is the basic expression of: "charge a positive bank balance". 

(we obviously dont want to allow people to 'share' their loans with 
others ;), nor do we want to allow a net negative balance. CFS is really 
brutally cold-hearted, it has a strict 'no loans' policy - the easiest 
economic way to manage 'inflation', besides the basic act of not 
printing new money, ever.)

[note, due to the nanoseconds unit there's no rounding loss to worry 
about.]

that's all. No runqueue locking, no wakeup decisions even! [Note: see 
detail #1 below for cases where we need to touch the tree]. Really 
low-overhead. Accumulated 'new money' will be acted upon in the next 
schedule() call or in the next scheduler tick, whichever comes sooner. 
Note that in most cases when tasks communicate there will be a natural 
schedule() anyway, which drives this.

p->wait_runtime is also very finegrained: it is in nanoseconds, so a 
task can 'pay' at arbitrary granularity in essence, and there is in 
essence zero 'small coin overhead' and 'conversion loss' in this money 
system. (as you might remember, sharing p->timeslice had inherent 
rounding and sharing problems due to its low jiffy resolution)

detail #1: for decoupled workloads where there is no direct sleep/wake 
coupling between worker and producer, there should also be a way to 
update a task's position in the fairness tree, if it accumulates 
significant amount of new p->wait_runtime. I think this can be done by 
making this an extra field: p->new_wait_runtime, which gets picked up by 
the task if it runs, or which gets propagated into the task's tree 
position if the p->new_wait_runtime value goes above the 
sched_granularity_ns value. But it would work pretty well even without 
this, the server will take advantage of the p->new_wait_runtime 
immediately when it runs, so as long as enough clients 'feed' it with 
money, it will always have enough to keep going.

detail #2: changes to p->wait_runtime are totally lockless, as long as 
they are 64-bit atomic. So the above code is a bit naive on 32-bit 
systems, but no locking is needed otherwise, other than having a stable 
reference to a task structure. (i designed CFS for 64-bit systems)

detail #3: i suspect i should rename p->wait_runtime to a more intuitive 
name - perhaps p->right_to_run? I want to avoid calling it p->timeslice 
because it's not really a timeslice, it's the thing you earned, the 
'timeslice' is a totally locally decided property that has no direct 
connection to this physical resource. I also dont want to call it 
p->cpu_credit, because it is _not_ a credit system: every positive value 
there has been earned the hard way: by 'working' for the system via 
waiting on the runqueue - scaled down to the 'expected fair runtime' - 
i.e. roughly scaled down by 1/rq->nr_running.

detail #3: the scheduler is also a charity: when it has no other work 
left it will let tasks execute "for free" ;-) But otherwise, in any sort 
of saturated market situation CFS is very much a cold hearted 
capitalist.

about the 50% rule: it was a totally arbitrary case for yield_to(), and 
in other cases it should rather be: "give me _all_ the money you have, 
i'll make it work for you as much as i can". And the receiver should 
also perhaps record the amount of 'money' it got from the client, and 
_give back_ any unused proportion of it. (only where easily doable, in 
1:1 task relationships) I.e.:

p_to->wait_runtime = p->wait_runtime;
p->wait_runtime = 0;

schedule();

the former two lines put into a sched_pay(p) API perhaps?

> The "perfect" situation would be that when somebody goes to sleep, any 
> extra points it had could be given to whoever it woke up last. Note 

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-23 Thread Linus Torvalds


On Mon, 23 Apr 2007, Ingo Molnar wrote:
> 
> The "give scheduler money" transaction can be both an "implicit 
> transaction" (for example when writing to UNIX domain sockets or 
> blocking on a pipe, etc.), or it could be an "explicit transaction": 
> sched_yield_to(). This latter i've already implemented for CFS, but it's 
> much less useful than the really significant implicit ones, the ones 
> which will help X.

Yes. It would be wonderful to get it working automatically, so please say 
something about the implementation..

The "perfect" situation would be that when somebody goes to sleep, any 
extra points it had could be given to whoever it woke up last. Note that 
for something like X, it means that the points are 100% ephemeral: it gets 
points when a client sends it a request, but it would *lose* the points 
again when it sends the reply!

So it would only accumulate "scheduling points" while multiuple clients 
are actively waiting for it, which actually sounds like exactly the right 
thing. However, I don't really see how to do it well, especially since the 
kernel cannot actually match up the client that gave some scheduling 
points to the reply that X sends back.

There are subtle semantics with these kinds of things: especially if the 
scheduling points are only awarded when a process goes to sleep, if X is 
busy and continues to use the CPU (for another client), it wouldn't give 
any scheduling points back to clients and they really do accumulate with 
the server. Which again sounds like it would be exactly the right thing 
(both in the sense that the server that runs more gets more points, but 
also in the sense that we *only* give points at actual scheduling events).

But how do you actually *give/track* points? A simple "last woken up by 
this process" thing that triggers when it goes to sleep? It might work, 
but on the other hand, especially with more complex things (and networking 
tends to be pretty complex) the actual wakeup may be done by a software 
irq. Do we just say "it ran within the context of X, so we assume X was 
the one that caused it?" It probably would work, but we've generally tried 
very hard to avoid accessing "current" from interrupt context, including 
bh's..

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-23 Thread Willy Tarreau
Hi !

On Mon, Apr 23, 2007 at 09:11:43PM +0200, Ingo Molnar wrote:
> 
> * Linus Torvalds <[EMAIL PROTECTED]> wrote:
> 
> > but the point I'm trying to make is that X shouldn't get more CPU-time 
> > because it's "more important" (it's not: and as noted earlier, 
> > thinking that it's more important skews the problem and makes for too 
> > *much* scheduling). X should get more CPU time simply because it 
> > should get it's "fair CPU share" relative to the *sum* of the clients, 
> > not relative to any client individually.
> 
> yeah. And this is not a pipe dream and i think it does not need a 
> 'wakeup matrix' or other complexities.
> 
> I am --->.< this close to being able to do this very robustly under 
> CFS via simple rules of economy and trade: there the p->wait_runtime 
> metric is intentionally a "physical resource" of "hard-earned right to 
> execute on the CPU, by having waited on it" the sum of which is bound 
> for the whole system.
>
> So while with other, heuristic approaches we always had the problem of 
> creating a "hyper-inflation" of an uneconomic virtual currency that 
> could be freely printed by certain tasks, in CFS the economy of this is 
> strict and the finegrained plus/minus balance is strictly managed by a 
> conservative and independent central bank.
> 
> So we can actually let tasks "trade" in these very physical units of 
> "right to execute on the CPU". A task giving it to another task means 
> that this task _already gave up CPU time in the past_. So it's the 
> robust equivalent of an economy's "money earned" concept, and this 
> "money"'s distribution (and redistribution) is totally fair and totally 
> balanced and is not prone to "inflation".
> 
> The "give scheduler money" transaction can be both an "implicit 
> transaction" (for example when writing to UNIX domain sockets or 
> blocking on a pipe, etc.), or it could be an "explicit transaction": 
> sched_yield_to(). This latter i've already implemented for CFS, but it's 
> much less useful than the really significant implicit ones, the ones 
> which will help X.

I don't think that a task should _give_ its slice to the task it's waiting
on, but it should _lend_ it : if the second task (the server, eg: X) does
not eat everything, maybe the first one will need to use the remains.

We had a good example with glxgears. Glxgears may need more CPU than X on
some machines, less on others. But it needs CPU. So it must not give all
it has to X otherwise it will stop. But it can tell X "hey, if you need
some CPU, I have some here, help yourself". When X has exhausted its slice,
it can then use some from the client. Hmmm no, better, X first serves itself
in the client's share, and may then use (parts of) its own if it needs more.

This would be seen like some CPU ressource buckets. Indeed, it's even a
problem of economy as you said. If you want someone to do something for you,
either it's very quick and simple and he can do it for free once in a while,
or you take all of his time and you have to pay him for this time.

Of course, you don't always know whom X is working for, and this will cause
X to sometimes run for one task on another one's ressources. But as long as
the work is done, it's OK. Hey, after all, many of us sometimes work for
customers in real life and take some of their time to work on the kernel
and everyone is happy with it.

I think that if we could have a (small) list of CPU buckets per task, it
would permit us to do such a thing. We would then have to ensure that
pipes or unix sockets correctly present their buckets to their servers.
If we consider that each task only has its own bucket and can lend it to
one and only one server at a time, it should not look too much horrible.

Basically, just something like this (thinking while typing) :

struct task_struct {
  ...
  struct {
 struct list list;
 int money_left;
  } cpu_bucket;
  ...
}

Then, waking up another process would consist in linking our bucket into
its own bucket list. The server can identify the task it's borrowing
from by looking at which task_struct the list belongs to.

Also, it creates some inheritance between processes. When doing such a
thing :


 $ fgrep DST=1.2.3.4 fw.log | sed -e 's/1.2/A.B/' | gzip -c3 >fw-anon.gz

Then fgrep would lend some CPU to sed which in turn would present them both
to gzip. Maybe we need two lists in order of the structures to be unstacked
upon gzip's sleep() :-/

I don't know if I'm clear enough.

Cheers,
Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-23 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> but the point I'm trying to make is that X shouldn't get more CPU-time 
> because it's "more important" (it's not: and as noted earlier, 
> thinking that it's more important skews the problem and makes for too 
> *much* scheduling). X should get more CPU time simply because it 
> should get it's "fair CPU share" relative to the *sum* of the clients, 
> not relative to any client individually.

yeah. And this is not a pipe dream and i think it does not need a 
'wakeup matrix' or other complexities.

I am --->.< this close to being able to do this very robustly under 
CFS via simple rules of economy and trade: there the p->wait_runtime 
metric is intentionally a "physical resource" of "hard-earned right to 
execute on the CPU, by having waited on it" the sum of which is bound 
for the whole system.

So while with other, heuristic approaches we always had the problem of 
creating a "hyper-inflation" of an uneconomic virtual currency that 
could be freely printed by certain tasks, in CFS the economy of this is 
strict and the finegrained plus/minus balance is strictly managed by a 
conservative and independent central bank.

So we can actually let tasks "trade" in these very physical units of 
"right to execute on the CPU". A task giving it to another task means 
that this task _already gave up CPU time in the past_. So it's the 
robust equivalent of an economy's "money earned" concept, and this 
"money"'s distribution (and redistribution) is totally fair and totally 
balanced and is not prone to "inflation".

The "give scheduler money" transaction can be both an "implicit 
transaction" (for example when writing to UNIX domain sockets or 
blocking on a pipe, etc.), or it could be an "explicit transaction": 
sched_yield_to(). This latter i've already implemented for CFS, but it's 
much less useful than the really significant implicit ones, the ones 
which will help X.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-23 Thread Linus Torvalds


On Mon, 23 Apr 2007, Nick Piggin wrote:
> > If you have a single client, the X server is *not* more important than the 
> > client, and indeed, renicing the X server causes bad patterns: just 
> > because the client sends a request does not mean that the X server should 
> > immediately be given the CPU as being "more important". 
> 
> If the client is doing some processing, and the user moves the mouse, it
> feels much more interactive if the pointer moves rather than waits for
> the client to finish processing.

.. yes. However, that should be automatically true if the X process just 
has "enough CPU time" to merit being scheduled to.

Which it normally should always have, exactly because it's an 
"interactive" process (regardless of how the scheduler is done - any 
scheduler should always give sleepers good latency. The current one 
obviously does it by giving interactivity-bonuses, CFS does it by trying 
to be fair in giving out CPU time).

The problem tends to be the following scenario:

 - the X server is very CPU-busy, because it has lots of clients 
   connecting to it, and it's not getting any "bonus" for doing work for 
   those clients (ie it uses up its time-slice and thus becomes "less 
   important" than other processes, since it's already gotten its "fair" 
   slice of CPU - never mind that it was really unfair to not give it 
   more)

 - there is some process that is *not* dependent on X, that can (and does) 
   run, because X has spent its CPU time serving others.

but the point I'm trying to make is that X shouldn't get more CPU-time 
because it's "more important" (it's not: and as noted earlier, thinking 
that it's more important skews the problem and makes for too *much* 
scheduling). X should get more CPU time simply because it should get it's 
"fair CPU share" relative to the *sum* of the clients, not relative to any 
client individually.

Once you actually do give the X server "fair share" of the CPU, I'm sure
that you can still get into bad situations (trivial example: make clients 
that on purpose do X requests that are expensive for the server, but are 
cheap to generate). But it's likely not going to be an issue in practice 
any more.

Scheduling is not something you can do "perfectly". There's no point in 
even trying. To do "perfect" scheduling, you'd have to have ESP and know 
exactly what the user expects and know the future too. What you should aim 
for is the "obvious cases".

And I don't think anybody really disputes the fact that a process that 
does work for other processes "obviously" should get the CPU time skewed 
towards it (and away from the clients - not from non-clients!). I think 
the only real issue is that nobody really knows how to do it well (or at 
all).

I think the "schedule by user" would be reasonable in practice - not 
perfect by any means, but it *does* fall into the same class of issues: 
users are not in general "more important" than other users, but they 
should be treated fairly across the user, not on a per-process basis.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-22 Thread Nick Piggin
On Sun, Apr 22, 2007 at 04:24:47PM -0700, Linus Torvalds wrote:
> 
> 
> On Sun, 22 Apr 2007, Juliusz Chroboczek wrote:
> > 
> > Why not do it in the X server itself?  This will avoid controversial
> > policy in the kernel, and have the added advantage of working with
> > X servers that don't directly access hardware.
> 
> It's wrong *wherever* you do it.
> 
> The X server should not be re-niced. It was done in the past, and it was 
> wrogn then (and caused problems - we had to tell people to undo it, 
> because some distros had started doing it by default).

The 2.6 scheduler can get very bad latency problems with the X server
reniced.


> If you have a single client, the X server is *not* more important than the 
> client, and indeed, renicing the X server causes bad patterns: just 
> because the client sends a request does not mean that the X server should 
> immediately be given the CPU as being "more important". 

If the client is doing some processing, and the user moves the mouse, it
feels much more interactive if the pointer moves rather than waits for
the client to finish processing.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-22 Thread Rusty Russell
On Sun, 2007-04-22 at 09:16 -0700, Ulrich Drepper wrote:
> On 4/22/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote:
> > On Sun, Apr 22, 2007 at 12:17:31AM -0700, Ulrich Drepper wrote:
> > > For futex(), the extension is needed for the FUTEX_WAIT operation.  We
> > > need a new operation FUTEX_WAIT_FOR or so which takes another (the
> > > fourth) parameter which is the PID of the target.
> > > For FUTEX_LOCK_PI we need no extension.  The futex value is the PID of
> > > the current owner.  This is required for the whole interface to work
> > > in the first place.
> >
> > We'll have to send things out and see what sticks here. There seems to
> > be some pickiness above.
> 
> I know Rusty will shudder since it makes futexes yet more complicated
> (although only if the user wants it) but if you introduce the concept
> of "yield to" then this extension makes really sense and it is a quite
> simple extension.  Plus: I'm the most affected by the change since I
> have to change code to use it and I'm fine with it.

Hi Uli,

I wouldn't worry: futexes long ago jumped the shark.

I think it was inevitable that once we started endorsing programs
bypassing the kernel for IPC that we'd want some form of yield_to().
And yield_to(p) has much more sane semantics than yield().

Cheers,
Rusty.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-22 Thread Linus Torvalds


On Sun, 22 Apr 2007, Juliusz Chroboczek wrote:
> 
> Why not do it in the X server itself?  This will avoid controversial
> policy in the kernel, and have the added advantage of working with
> X servers that don't directly access hardware.

It's wrong *wherever* you do it.

The X server should not be re-niced. It was done in the past, and it was 
wrogn then (and caused problems - we had to tell people to undo it, 
because some distros had started doing it by default).

If you have a single client, the X server is *not* more important than the 
client, and indeed, renicing the X server causes bad patterns: just 
because the client sends a request does not mean that the X server should 
immediately be given the CPU as being "more important". 

In other words, the things that make it important that the X server _can_ 
get CPU time if needed are all totally different from the X server being 
"more important". The X server is more important only in the presense of 
multiple clients, not on its own! Needing to renice it is a hack for a bad 
scheduler, and shows that somebody doesn't understand the problem!

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-22 Thread Juliusz Chroboczek
> Oh I definitely was not advocating against renicing X,

Why not do it in the X server itself?  This will avoid controversial
policy in the kernel, and have the added advantage of working with
X servers that don't directly access hardware.

Con, if you tell me ``if you're running under Linux and such and such
/sys variable has value so-and-so, then it's definitely a good idea to
call nice(42) at the X server's start up'', then I'll commit it into
X.Org.  (Please CC both me the list, so I can point any people
complaining to the archives.)

Juliusz
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-22 Thread Ulrich Drepper

On 4/22/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote:

On Sun, Apr 22, 2007 at 12:17:31AM -0700, Ulrich Drepper wrote:
> For futex(), the extension is needed for the FUTEX_WAIT operation.  We
> need a new operation FUTEX_WAIT_FOR or so which takes another (the
> fourth) parameter which is the PID of the target.
> For FUTEX_LOCK_PI we need no extension.  The futex value is the PID of
> the current owner.  This is required for the whole interface to work
> in the first place.

We'll have to send things out and see what sticks here. There seems to
be some pickiness above.


I know Rusty will shudder since it makes futexes yet more complicated
(although only if the user wants it) but if you introduce the concept
of "yield to" then this extension makes really sense and it is a quite
simple extension.  Plus: I'm the most affected by the change since I
have to change code to use it and I'm fine with it.

Oh, last time I didn't explicitly mention the cases of
waitpid()/wait4()/waitid() explicitly naming a process to wait on.  I
think it's clear that those cases also should be changed to use yield
to if possible.  I don't have a good suggestion what to do when the
call waits for any child.  Perhaps yielding to the last created one is
fine.  If delays through reading on a pipe are recognized as well and
handle with yield to then the time slot will automatically be
forwarded to the first runnable process in the pipe sequence.  I.e.,
running

   grep foo /etc/passwd | cut -d: -f2 | crack

probably will create 'crack' last.  Giving the remainder of the time
slot should result is recognizing it waits for 'cut' which in turn
waits for 'grep'.  So in the end 'grep' gets the timeslot.  Seems
quite complicated from the outside but I can imagine quite good
results from this.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-22 Thread Ingo Molnar

* Mark Lord <[EMAIL PROTECTED]> wrote:

> > i've not experienced a 'runaway X' personally, at most it would 
> > crash or lock up ;) The value is boot-time and sysctl configurable 
> > as well back to 0.
> 
> Mmmm.. I've had to kill off the odd X that was locking in 100% CPU 
> usage. In the past, this has happened maybe 1-3 times a year or so on 
> my notebook.
> 
> Now mind you, that usage could have been due to some client process, 
> but X is where the 100% showed up, so X is what I nuked.

well, i just simulated a runaway X at nice -19 on CFS (on a UP box), and 
while the box was a tad laggy, i was able to killall it without 
problems, within 2 seconds that also included a 'su'. So it's not an 
issue in CFS, it can be turned off, and because every distro has another 
way to renice Xorg, this is a convenience hack until Xorg standardizes 
it into some xorg.conf field. (It also makes sure that X isnt preempted 
by other userspace stuff while it does timing-sensitive operations like 
setting the video modes up or switching video modes, etc.)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-22 Thread Mark Lord

Ingo Molnar wrote:


well, i just simulated a runaway X at nice -19 on CFS (on a UP box), and 
while the box was a tad laggy, i was able to killall it without 
problems, within 2 seconds that also included a 'su'. So it's not an 
issue in CFS, it can be turned off, and because every distro has another 
way to renice Xorg, this is a convenience hack until Xorg standardizes 
it into some xorg.conf field. (It also makes sure that X isnt preempted 
by other userspace stuff while it does timing-sensitive operations like 
setting the video modes up or switching video modes, etc.)


Good!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-22 Thread Mark Lord

Ingo Molnar wrote:

* Jan Engelhardt <[EMAIL PROTECTED]> wrote:

i've attached it below in a standalone form, feel free to put it 
into SD! :)
Assume X went crazy (lacking any statistics, I make the unproven 
statement that this happens more often than kthreads going berserk), 
then having it niced with minus something is not too nice.


i've not experienced a 'runaway X' personally, at most it would crash or 
lock up ;) The value is boot-time and sysctl configurable as well back 
to 0.




Mmmm.. I've had to kill off the odd X that was locking in 100% CPU usage.
In the past, this has happened maybe 1-3 times a year or so on my notebook.

Now mind you, that usage could have been due to some client process,
but X is where the 100% showed up, so X is what I nuked.

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-22 Thread Mark Lord

Con Kolivas wrote:


Oh I definitely was not advocating against renicing X, I just suspect that 
virtually all the users who gave glowing reports to CFS comparing it to SD 
had no idea it had reniced X to -19 behind their back and that they were 
comparing it to SD running X at nice 0.


I really do wish I wouldn't feel the need to keep stepping in here
to manually exclude my own results from such wide brush strokes.

I'm one of those "users", and I've never even tried CFS v4 (yes).
All prior versions did NOT do the renicing.

The renicing was in the CFS v4 announcement, right up front for all to see,
and the code for it has been posted separately with encouragement for RSDL
or whatever to also adopt it.

Now, with it in all of the various "me-too" schedulers,
maybe they'll all start to shine a little more on real users' systems.
So far, the stock 2.6.20 scheduler remains my own current preference,
despite really good results with CFS v1.

Cheers

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-22 Thread Gene Heskett
On Sunday 22 April 2007, William Lee Irwin III wrote:
>On Sat, Apr 21, 2007 at 02:17:02PM -0400, Gene Heskett wrote:
>> CFS-v4 is quite smooth in terms of the users experience but after
>> prolonged observations approaching 24 hours, it appears to choke the cpu
>> hog off a bit even when the system has nothing else to do.  My amanda runs
>> went from 1 to 1.5 hours depending on how much time it took gzip to handle
>> the amount of data tar handed it, up to about 165m & change, or nearly 3
>> hours pretty consistently over 5 runs.
>
>Welcome to infinite history. I'm not surprised, apart from the time
>scale of anomalies being much larger than I anticipated.

[...]

>Pardon my saying so but you appear to be describing anomalous behavior
>in terms of "scheduler warmups."

Well, that was what I saw, it took gzip about 4 or 5 minutes to get to the 
first 90% hit in htop's display, and it first hit the top of the display with 
only 5%.  And the next backup run took about 2h:21m, so we're back in the 
ballpark.  I'd reset amanda's schedule for a faster dumpcycle too, along with 
giving the old girl a new drive, all about the time we started playing with 
this, so the times I'm recording now may well be nominal.  I suppose I should 
boot a plain 2.6.21-rc7 and make a run & time that, but I don't enjoy 
masochism THAT much. :)

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
I've enjoyed just about as much of this as I can stand.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-22 Thread William Lee Irwin III
On 4/22/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote:
>> I'm just looking for what people want the API to be here. With that in
>> hand we can just go out and do whatever needs to be done.

On Sun, Apr 22, 2007 at 12:17:31AM -0700, Ulrich Drepper wrote:
> I think a sched_yield_to is one interface:
>   int sched_yield_to(pid_t);

All clear on that front.


On Sun, Apr 22, 2007 at 12:17:31AM -0700, Ulrich Drepper wrote:
> For futex(), the extension is needed for the FUTEX_WAIT operation.  We
> need a new operation FUTEX_WAIT_FOR or so which takes another (the
> fourth) parameter which is the PID of the target.
> For FUTEX_LOCK_PI we need no extension.  The futex value is the PID of
> the current owner.  This is required for the whole interface to work
> in the first place.

We'll have to send things out and see what sticks here. There seems to
be some pickiness above.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-22 Thread William Lee Irwin III
On Sat, Apr 21, 2007 at 02:17:02PM -0400, Gene Heskett wrote:
> CFS-v4 is quite smooth in terms of the users experience but after prolonged 
> observations approaching 24 hours, it appears to choke the cpu hog off a bit 
> even when the system has nothing else to do.  My amanda runs went from 1 to 
> 1.5 hours depending on how much time it took gzip to handle the amount of 
> data tar handed it, up to about 165m & change, or nearly 3 hours pretty 
> consistently over 5 runs.

Welcome to infinite history. I'm not surprised, apart from the time
scale of anomalies being much larger than I anticipated.


On Sat, Apr 21, 2007 at 02:17:02PM -0400, Gene Heskett wrote:
> sd-0.44 so far seems to be handling the same load (theres a backup running 
> right now) fairly well also, and possibly theres a bit more snap to the 
> system now.  A switch to screen 1 from this screen 8, and the loading of that 
> screen image, which is the Cassini shot of saturn from the backside, the one 
> showing that teeny dot to the left of Saturn that is actually us, took 10 
> seconds with the stock 2.6.21-rc7, 3 seconds with the best of Ingo's patches, 
> and now with Con's latest, is 1 second flat. Another screen however is 4 
> seconds, so maybe that first scren had been looked at since I rebooted. 
> However, amanda is still getting estimates so gzip hasn't put a tiewrap 
> around the kernels neck just yet.

Not sure what you mean by gzip putting a tiewrap around the kernel's neck.
Could you clarify?


On Sat, Apr 21, 2007 at 02:17:02PM -0400, Gene Heskett wrote:
> Some minutes later, gzip is smunching /usr/src, and the machine doesn't even 
> know its running as sd-0.44 isn't giving gzip more than 75% to gzip, and 
> probably averaging less than 50%. And it scared me a bit as it started out at 
> not over 5% for the first minute or so.  Running in the 70's now according to 
> gkrellm, with an occasional blip to 95%.  And the machine generally feels 
> good.

I wonder what's behind that sort of initial and steady-state behavior.


On Sat, Apr 21, 2007 at 02:17:02PM -0400, Gene Heskett wrote:
> I had previously given CFS-v4 a 95 score but that was before I saw
> the general slowdown, and I believe my first impression of this one
> is also a 95.  This on a scale of the best one of the earlier CFS
> patches being 100, and stock 2.6.21-rc7 gets a 0.0.  This scheduler
> seems to be giving gzip ever more cpu as time progresses, and the cpu
> is warming up quite nicely, from about 132F idling to 149.9F now.
> And my keyboard is still alive and well.
> Generally speaking, Con, I believe this one is also a keeper.  And we'll see 
> how long a backup run takes.

Pardon my saying so but you appear to be describing anomalous behavior
in terms of "scheduler warmups."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-22 Thread Ulrich Drepper

On 4/22/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote:

I'm just looking for what people want the API to be here. With that in
hand we can just go out and do whatever needs to be done.


I think a sched_yield_to is one interface:

  int sched_yield_to(pid_t);

For futex(), the extension is needed for the FUTEX_WAIT operation.  We
need a new operation FUTEX_WAIT_FOR or so which takes another (the
fourth) parameter which is the PID of the target.

For FUTEX_LOCK_PI we need no extension.  The futex value is the PID of
the current owner.  This is required for the whole interface to work
in the first place.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-22 Thread William Lee Irwin III
On 4/21/07, Linus Torvalds <[EMAIL PROTECTED]> wrote:
>> And how the hell do you imagine you'd even *know* what thread holds the
>> futex?

On Sat, Apr 21, 2007 at 06:46:58PM -0700, Ulrich Drepper wrote:
> We know this in most cases.  This is information recorded, for
> instance, in the mutex data structure.  You might have missed my "the
> interface must be extended" part.  This means the PID of the owning
> thread will have to be passed done.  For PI mutexes this is not
> necessary since the kernel already has access to the information.

I'm just looking for what people want the API to be here. With that in
hand we can just go out and do whatever needs to be done.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Mike Galbraith
On Sun, 2007-04-22 at 10:08 +1000, Con Kolivas wrote:
> On Sunday 22 April 2007 08:54, Denis Vlasenko wrote:
> > On Saturday 21 April 2007 18:00, Ingo Molnar wrote:
> > > correct. Note that Willy reniced X back to 0 so it had no relevance on
> > > his test. Also note that i pointed this change out in the -v4 CFS
> > >
> > > announcement:
> > > || Changes since -v3:
> > > ||
> > > ||  - usability fix: automatic renicing of kernel threads such as
> > > ||keventd, OOM tasks and tasks doing privileged hardware access
> > > ||(such as Xorg).
> > >
> > > i've attached it below in a standalone form, feel free to put it into
> > > SD! :)
> >
> > But X problems have nothing to do with "privileged hardware access".
> > X problems are related to priority inversions between server and client
> > processes, and "one server process - many client processes" case.
> 
> It's not a privileged hardware access reason that this code is there. This is 
> obfuscation/advertising to make it look like there is a valid reason for X 
> getting negative nice levels somehow in the kernel to make interactive 
> testing of CFS better by default.

That's not a very nice thing to say, and it has no benefit unless you
specifically want to run multiple heavy X hitting clients.

I boot with that feature disabled specifically to be able to measure
fairness in a pure environment, and it's still _much_ smoother and
snappier than any RSDL/SD kernel I ever tried.

-Mike

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Al Boldi
Con Kolivas wrote:
> On Sunday 22 April 2007 02:00, Ingo Molnar wrote:
> > * Con Kolivas <[EMAIL PROTECTED]> wrote:
> > > >   Feels even better, mouse movements are very smooth even under high
> > > >   load. I noticed that X gets reniced to -19 with this scheduler.
> > > >   I've not looked at the code yet but this looked suspicious to me.
> > > >   I've reniced it to 0 and it did not change any behaviour. Still
> > > >   very good.
> > >
> > > Looks like this code does it:
> > >
> > > +int sysctl_sched_privileged_nice_level __read_mostly = -19;
> >
> > correct.
>
> Oh I definitely was not advocating against renicing X, I just suspect that
> virtually all the users who gave glowing reports to CFS comparing it to SD
> had no idea it had reniced X to -19 behind their back and that they were
> comparing it to SD running X at nice 0. I think had they been comparing
> CFS with X nice -19 to SD running nice -10 in this interactivity soft and
> squishy comparison land their thoughts might have been different. I missed
> it in the announcement and had to go looking in the code since Willy just
> kinda tripped over it unwittingly as well.

I tried this with the vesa driver of X, and reflect from the mesa-demos 
heavily starves new window creation on cfs-v4 with X niced -19.  X reniced 
to 0 removes these starves.  On SD, X reniced to -10 works great.


Thanks!

--
Al

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Gene Heskett
On Saturday 21 April 2007, Con Kolivas wrote:
>On Sunday 22 April 2007 04:17, Gene Heskett wrote:
>> More first impressions of sd-0.44 vs CFS-v4
>
>Thanks Gene.
>
>> CFS-v4 is quite smooth in terms of the users experience but after
>> prolonged observations approaching 24 hours, it appears to choke the cpu
>> hog off a bit even when the system has nothing else to do.  My amanda runs
>> went from 1 to 1.5 hours depending on how much time it took gzip to handle
>> the amount of data tar handed it, up to about 165m & change, or nearly 3
>> hours pretty consistently over 5 runs.
>>
>> sd-0.44 so far seems to be handling the same load (theres a backup running
>> right now) fairly well also, and possibly theres a bit more snap to the
>> system now.  A switch to screen 1 from this screen 8, and the loading of
>> that screen image, which is the Cassini shot of saturn from the backside,
>> the one showing that teeny dot to the left of Saturn that is actually us,
>> took 10 seconds with the stock 2.6.21-rc7, 3 seconds with the best of
>> Ingo's patches, and now with Con's latest, is 1 second flat. Another
>> screen however is 4 seconds, so maybe that first scren had been looked at
>> since I rebooted. However, amanda is still getting estimates so gzip
>> hasn't put a tiewrap around the kernels neck just yet.
>>
>> Some minutes later, gzip is smunching /usr/src, and the machine doesn't
>> even know its running as sd-0.44 isn't giving gzip more than 75% to gzip,
>> and probably averaging less than 50%. And it scared me a bit as it started
>> out at not over 5% for the first minute or so.  Running in the 70's now
>> according to gkrellm, with an occasional blip to 95%.  And the machine
>> generally feels good.
>>
>> I had previously given CFS-v4 a 95 score but that was before I saw the
>> general slowdown, and I believe my first impression of this one is also a
>> 95.  This on a scale of the best one of the earlier CFS patches being 100,
>> and stock 2.6.21-rc7 gets a 0.0.  This scheduler seems to be giving gzip
>> ever more cpu as time progresses, and the cpu is warming up quite nicely,
>> from about 132F idling to 149.9F now.  And my keyboard is still alive and
>> well.
>
>I'm not sure how much weight to put on what you see as the measured cpu
> usage. I have a feeling it's being wrongly reported in SD currently.
> Concentrate more on the actual progress and behaviour of things as you've
> already done.
>
>> Generally speaking, Con, I believe this one is also a keeper.  And we'll
>> see how long a backup run takes.

It looks as if it could have been 10 minutes quicker according to amplot, but 
that's entirely within the expected variations that amanda's scheduler might 
do to it.  But the one that just finished, running under CFS-v5 was only 
1h:47m, not including the verify run.  The previous backup using sd-0.44, 
took 2h:28m for a similar but not identical operation according to amplot.  
That's a big enough diff to be an indicator I believe, but without knowing 
how much of that time was burned by gzip, its an apples and oranges compare.  
We'll see if it repeats, I coded 'catchup' to do 2 in a row.

>Great thanks for feedback.

You're quite welcome, Con.

ATM I'm doing the same thing again but booted to a CFS-v5 delta that Ingo sent 
me privately,  and except for the kmail lag/freezes everything is cool except 
the cpu, it managed to hit 150.7F during the height of one of the gzip -best 
smunching operations.  I believe the /dev/hdd writes are cranked well up from 
the earlier CSF patches also.  Unforch, this isn't something that's been 
coded into amplot, so I'm stuck watching the hdd display in gkrellm and 
making SWAG's.  And we all know what they are worth.  I've made a lot of them 
in my 72 years, and my track record, with some glaring exceptions like my 2nd 
wife that I won't bore you with the details of, has been fairly decent. :)

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
You will be run over by a beer truck.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Con Kolivas
On Saturday 21 April 2007 22:12, Willy Tarreau wrote:
> 2) SD-0.44
>
>Feels good, but becomes jerky at moderately high loads. I've started
>64 ocbench with a 250 ms busy loop and 750 ms sleep time. The system
>always responds correctly but under X, mouse jumps quite a bit and
>typing in xterm or even text console feels slightly jerky. The CPU is
>not completely used, and the load varies a lot (see below). However,
>the load is shared equally between all 64 ocbench, and they do not
>deviate even after 4000 iterations. X uses less than 1% CPU during
>those tests.

Found it. I broke SMP balancing again so there is serious scope for 
improvement on SMP hardware. That explains the huge load variations. Expect 
yet another fix soon, which should improve behaviour further :)

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Ulrich Drepper

On 4/21/07, Linus Torvalds <[EMAIL PROTECTED]> wrote:

And how the hell do you imagine you'd even *know* what thread holds the
futex?


We know this in most cases.  This is information recorded, for
instance, in the mutex data structure.  You might have missed my "the
interface must be extended" part.  This means the PID of the owning
thread will have to be passed done.  For PI mutexes this is not
necessary since the kernel already has access to the information.



The whole point of the "f" part of the mutex is that it's fast, and we
never see the non-contended case in the kernel.


See above.  Believe me, I know how futexes work.  But I also know what
additional information we collect.  For mutexes and in part for
rwlocks we know which thread owns the sync object.  In that case we
can easily provide the kernel with the information.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Con Kolivas
On Sunday 22 April 2007 04:17, Gene Heskett wrote:
> More first impressions of sd-0.44 vs CFS-v4

Thanks Gene.
>
> CFS-v4 is quite smooth in terms of the users experience but after prolonged
> observations approaching 24 hours, it appears to choke the cpu hog off a
> bit even when the system has nothing else to do.  My amanda runs went from
> 1 to 1.5 hours depending on how much time it took gzip to handle the amount
> of data tar handed it, up to about 165m & change, or nearly 3 hours pretty
> consistently over 5 runs.
>
> sd-0.44 so far seems to be handling the same load (theres a backup running
> right now) fairly well also, and possibly theres a bit more snap to the
> system now.  A switch to screen 1 from this screen 8, and the loading of
> that screen image, which is the Cassini shot of saturn from the backside,
> the one showing that teeny dot to the left of Saturn that is actually us,
> took 10 seconds with the stock 2.6.21-rc7, 3 seconds with the best of
> Ingo's patches, and now with Con's latest, is 1 second flat. Another screen
> however is 4 seconds, so maybe that first scren had been looked at since I
> rebooted. However, amanda is still getting estimates so gzip hasn't put a
> tiewrap around the kernels neck just yet.
>
> Some minutes later, gzip is smunching /usr/src, and the machine doesn't
> even know its running as sd-0.44 isn't giving gzip more than 75% to gzip,
> and probably averaging less than 50%. And it scared me a bit as it started
> out at not over 5% for the first minute or so.  Running in the 70's now
> according to gkrellm, with an occasional blip to 95%.  And the machine
> generally feels good.
>
> I had previously given CFS-v4 a 95 score but that was before I saw the
> general slowdown, and I believe my first impression of this one is also a
> 95.  This on a scale of the best one of the earlier CFS patches being 100,
> and stock 2.6.21-rc7 gets a 0.0.  This scheduler seems to be giving gzip
> ever more cpu as time progresses, and the cpu is warming up quite nicely,
> from about 132F idling to 149.9F now.  And my keyboard is still alive and
> well.

I'm not sure how much weight to put on what you see as the measured cpu usage. 
I have a feeling it's being wrongly reported in SD currently. Concentrate 
more on the actual progress and behaviour of things as you've already done.

> Generally speaking, Con, I believe this one is also a keeper.  And we'll
> see how long a backup run takes.

Great thanks for feedback.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Con Kolivas
On Sunday 22 April 2007 08:54, Denis Vlasenko wrote:
> On Saturday 21 April 2007 18:00, Ingo Molnar wrote:
> > correct. Note that Willy reniced X back to 0 so it had no relevance on
> > his test. Also note that i pointed this change out in the -v4 CFS
> >
> > announcement:
> > || Changes since -v3:
> > ||
> > ||  - usability fix: automatic renicing of kernel threads such as
> > ||keventd, OOM tasks and tasks doing privileged hardware access
> > ||(such as Xorg).
> >
> > i've attached it below in a standalone form, feel free to put it into
> > SD! :)
>
> But X problems have nothing to do with "privileged hardware access".
> X problems are related to priority inversions between server and client
> processes, and "one server process - many client processes" case.

It's not a privileged hardware access reason that this code is there. This is 
obfuscation/advertising to make it look like there is a valid reason for X 
getting negative nice levels somehow in the kernel to make interactive 
testing of CFS better by default.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Con Kolivas
On Sunday 22 April 2007 02:00, Ingo Molnar wrote:
> * Con Kolivas <[EMAIL PROTECTED]> wrote:
> > >   Feels even better, mouse movements are very smooth even under high
> > >   load. I noticed that X gets reniced to -19 with this scheduler.
> > >   I've not looked at the code yet but this looked suspicious to me.
> > >   I've reniced it to 0 and it did not change any behaviour. Still
> > >   very good.
> >
> > Looks like this code does it:
> >
> > +int sysctl_sched_privileged_nice_level __read_mostly = -19;
>
> correct. 

Oh I definitely was not advocating against renicing X, I just suspect that 
virtually all the users who gave glowing reports to CFS comparing it to SD 
had no idea it had reniced X to -19 behind their back and that they were 
comparing it to SD running X at nice 0. I think had they been comparing CFS 
with X nice -19 to SD running nice -10 in this interactivity soft and squishy 
comparison land their thoughts might have been different. I missed it in the 
announcement and had to go looking in the code since Willy just kinda tripped 
over it unwittingly as well.

> Note that Willy reniced X back to 0 so it had no relevance on 
> his test.

Oh yes I did notice that, but since the array swap is the remaining longest 
deadline in SD which would cause noticeable jerks, renicing X on SD by 
default would make the experience very different since reniced tasks do much 
better over array swaps compared to non niced tasks. I really should go and 
make the whole thing one circular list and blow away the array swap (if I can 
figure out how to do it). 

> Also note that i pointed this change out in the -v4 CFS 
>
> announcement:
> || Changes since -v3:
> ||
> ||  - usability fix: automatic renicing of kernel threads such as
> ||keventd, OOM tasks and tasks doing privileged hardware access
> ||(such as Xorg).

Reading the changelog in the gloss-over fashion that I unfortunately did, even 
I missed it. 

> i've attached it below in a standalone form, feel free to put it into
> SD! :)

Hmm well I have tried my very best to do all the changes without 
changing "policy" as much as possible since that trips over so many emotive 
issues that noone can agree on, and I don't have a strong opinion on this as 
I thought it would be better for it to be a config option for X in userspace 
instead. Either way it needs to be turned on/off by admin and doing it by 
default in the kernel is... not universally accepted as good. What else 
accesses ioports that can get privileged nice levels? Does this make it 
relatively exploitable just by poking an ioport?

>   Ingo
>
> ---
>  arch/i386/kernel/ioport.c   |   13 ++---
>  arch/x86_64/kernel/ioport.c |8 ++--
>  drivers/block/loop.c|5 -
>  include/linux/sched.h   |7 +++
>  kernel/sched.c  |   40

Thanks for the patch. I'll consider it. Since end users are testing this in 
fuzzy interactivity land I may simply be forced to do this just for 
comparisons to be meaningful between CFS and SD otherwise they're not really 
comparing them on a level playing field. I had almost given up SD for dead 
meat with all the momentum CFS had gained... until recently.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Linus Torvalds


On Sat, 21 Apr 2007, Ulrich Drepper wrote:
> 
> If you do this, and it has been requested many a times, then please
> generalize it.  We have the same issue with futexes.  If a FUTEX_WAIT
> call is issues the remaining time in the slot should be given to the
> thread currently owning the futex.

And how the hell do you imagine you'd even *know* what thread holds the 
futex?

The whole point of the "f" part of the mutex is that it's fast, and we 
never see the non-contended case in the kernel. 

So we know who *blocks*, but we don't know who actually didn't block.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread William Lee Irwin III
On 4/21/07, Kyle Moffett <[EMAIL PROTECTED]> wrote:
>> It might be nice if it was possible to actively contribute your CPU
>> time to a child process.  For example:
>> int sched_donate(pid_t pid, struct timeval *time, int percentage);

On Sat, Apr 21, 2007 at 12:49:52PM -0700, Ulrich Drepper wrote:
> If you do this, and it has been requested many a times, then please
> generalize it.  We have the same issue with futexes.  If a FUTEX_WAIT
> call is issues the remaining time in the slot should be given to the
> thread currently owning the futex.  For non-PI futexes this needs an
> extension of the interface but I would be up for that.  It can have
> big benefits on the throughput of an application.

It's encouraging to hear support for a more full-featured API (or, for
that matter, any response at all) on this front.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Denis Vlasenko
On Saturday 21 April 2007 18:00, Ingo Molnar wrote:
> correct. Note that Willy reniced X back to 0 so it had no relevance on 
> his test. Also note that i pointed this change out in the -v4 CFS 
> announcement:
> 
> || Changes since -v3:
> ||
> ||  - usability fix: automatic renicing of kernel threads such as 
> ||keventd, OOM tasks and tasks doing privileged hardware access
> ||(such as Xorg).
> 
> i've attached it below in a standalone form, feel free to put it into 
> SD! :)

But X problems have nothing to do with "privileged hardware access".
X problems are related to priority inversions between server and client
processes, and "one server process - many client processes" case.

I think syncronous nature of Xlib (clients cannot fire-and-forget
their commands to X server, with Xlib each command waits for ACK
from server) also add some amount of pain.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Ulrich Drepper

On 4/21/07, Kyle Moffett <[EMAIL PROTECTED]> wrote:

It might be nice if it was possible to actively contribute your CPU
time to a child process.  For example:
int sched_donate(pid_t pid, struct timeval *time, int percentage);


If you do this, and it has been requested many a times, then please
generalize it.  We have the same issue with futexes.  If a FUTEX_WAIT
call is issues the remaining time in the slot should be given to the
thread currently owning the futex.  For non-PI futexes this needs an
extension of the interface but I would be up for that.  It can have
big benefits on the throughput of an application.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Ingo Molnar

* Jan Engelhardt <[EMAIL PROTECTED]> wrote:

> > i've attached it below in a standalone form, feel free to put it 
> > into SD! :)
> 
> Assume X went crazy (lacking any statistics, I make the unproven 
> statement that this happens more often than kthreads going berserk), 
> then having it niced with minus something is not too nice.

i've not experienced a 'runaway X' personally, at most it would crash or 
lock up ;) The value is boot-time and sysctl configurable as well back 
to 0.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Kyle Moffett

On Apr 21, 2007, at 12:42:41, William Lee Irwin III wrote:

On Sat, 21 Apr 2007, Willy Tarreau wrote:
If you remember, with 50/50, I noticed some difficulties to fork  
many processes. I think that during a fork(), the parent has a  
higher probability of forking other processes than the child. So  
at least, we should use something like 67/33 or 75/25 for parent/ 
child.


On Sat, Apr 21, 2007 at 09:34:07AM -0700, Linus Torvalds wrote:

It would be even better to simply have the rule:
 - child gets almost no points at startup
 - but when a parent does a "waitpid()" call and blocks, it will  
spread  out its points to the childred (the "vfork()" blocking is  
another case that is really the same).
This is a very special kind of "priority inversion" logic: you  
give higher priority to the things you wait for. Not because of  
holding any locks, but simply because a blockign waitpid really is  
a damn big hint that "ok, the child now works for the parent".


An in-kernel scheduler API might help. void yield_to(struct  
task_struct *)?


A userspace API might be nice, too. e.g. int sched_yield_to(pid_t).


It might be nice if it was possible to actively contribute your CPU  
time to a child process.  For example:

int sched_donate(pid_t pid, struct timeval *time, int percentage);

Maybe a way to pass CPU time over a UNIX socket (analogous to  
SCM_RIGHTS), along with information on what process/user passed it   
That would make it possible to really fix X properly on a local  
system.  You could make the X client library pass CPU time to the X  
server whenever it requests a CPU-intensive rendering operation.   
Ordinarily X would nice all of its client service threads to +10, but  
when a client passes CPU time to its thread over the socket, then its  
service thread temporarily gets the scheduling properties of the  
client.  I'm not a scheduler guru, but that's what makes the most  
sense from an application-programmer point of view.


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Gene Heskett
On Saturday 21 April 2007, Willy Tarreau wrote:
>Hi Ingo, Hi Con,
>
>I promised to perform some tests on your code. I'm short in time right now,
>but I observed behaviours that should be commented on.
>
>1) machine : dual athlon 1533 MHz, 1G RAM, kernel 2.6.21-rc7 + either
> scheduler Test:  ./ocbench -R 25 -S 75 -x 8 -y 8
>   ocbench: http://linux.1wt.eu/sched/
>
>2) SD-0.44
>
>   Feels good, but becomes jerky at moderately high loads. I've started
>   64 ocbench with a 250 ms busy loop and 750 ms sleep time. The system
>   always responds correctly but under X, mouse jumps quite a bit and
>   typing in xterm or even text console feels slightly jerky. The CPU is
>   not completely used, and the load varies a lot (see below). However,
>   the load is shared equally between all 64 ocbench, and they do not
>   deviate even after 4000 iterations. X uses less than 1% CPU during
>   those tests.
>
>   Here's the vmstat output :
>
>[EMAIL PROTECTED]:~$ vmstat 1
>   procs  memory  swap  io system 
> cpu r  b  w   swpd   free   buff  cache   si   sobibo   incs us
> sy id 0  0  0  0 919856   6648  577880022 24   148
> 31 49 20 0  0  0  0 919856   6648  5778800 0 02  
> 285 32 50 19 28  0  0  0 919836   6648  5778800 0 0   
> 0   331 24 40 36 64  0  0  0 919836   6648  5778800 0 0
>1   618 23 40 37 65  0  0  0 919836   6648  5778800 0   
>  00   571 21 36 43 35  0  0  0 919836   6648  5778800 0
> 03   382 32 50 18 2  0  0  0 919836   6648  5778800
> 0 00   308 37 61  2 8  0  0  0 919836   6648  5778800  
>   0 01   533 36 65  0 32  0  0  0 919768   6648  577880   
> 0 0 0   93   706 33 62  5 62  0  0  0 919712   6648  577880
>0 0 0   65   617 32 54 13 63  0  0  0 919712   6648  57788  
>  00 0 01   569 28 48 23 40  0  0  0 919712   6648 
> 5778800 0 00   427 26 50 24 4  0  0  0 919712  
> 6648  5778800 0 01   382 29 48 23 4  0  0  0 919712
>   6648  5778800 0 00   383 34 65  0 14  0  0  0
> 919712   6648  5778800 0 01   769 39 61  0 40  0  0
>  0 919712   6648  5778800 0 00   384 37 52 11 54  0  0 
> 0 919712   6648  5778800 0 01   715 31 60  8 58  0 
> 2  0 919712   6648  5778800 0 01   611 34 65  0 41 
> 0  0  0 919712   6648  5778800 0 0   19   395 28 45 27
> 0  0  0  0 919712   6648  5778800 0 0   31   421 23 32
> 45 0  0  0  0 919712   6648  5778800 0 0   31   328 34
> 44 22 29  0  0  0 919712   6648  5778800 0 0   34   369
> 32 43 25 65  0  0  0 919712   6648  5778800 0 0   31  
> 410 24 35 40 47  0  1  0 919712   6648  5778800 0 0  
> 42   538 25 39 35
>
>3) CFS-v4
>
>  Feels even better, mouse movements are very smooth even under high load.
>  I noticed that X gets reniced to -19 with this scheduler. I've not looked
>  at the code yet but this looked suspicious to me. I've reniced it to 0 and
>  it did not change any behaviour. Still very good. The 64 ocbench share
>  equal CPU time and show exact same progress after 2000 iterations. The CPU
>  load is more smoothly spread according to vmstat, and there's no idle (see
>  below). BUT I now think it was wrong to let new processes start with no
>  timeslice at all, because it can take tens of seconds to start a new
> process when only 64 ocbench are there. Simply starting "killall ocbench"
> takes about 10 seconds. On a smaller machine (VIA C3-533), it took me more
> than one minute to do "su -", even from console, so that's not X. BTW, X
> uses less than 1% CPU during those tests.
>
>[EMAIL PROTECTED]:~$ vmstat 1
>   procs  memory  swap  io system 
> cpu r  b  w   swpd   free   buff  cache   si   sobibo   incs us
> sy id 12  0  2  0 922120   6532  5754000   29929   31   386
> 17 27 57 12  0  2  0 922096   6532  5755600 0 01  
> 776 37 63  0 14  0  2  0 922096   6532  5755600 0 0   
> 1   782 35 65  0 13  0  1  0 922096   6532  5755600 0 0
>0   782 38 62  0 14  0  1  0 922096   6532  5755600 0   
>  01   782 36 64  0 13  0  1  0 922096   6532  5755600 0
> 02   785 38 62  0 13  0  1  0 922096   6532  5755600   
>  0 01   774 35 65  0 14  0  1  0 922096   6532  5755600
> 0 00   784 36 64  0 13  0  1  0 922096   6532  575560  
>  0 0 01   767 37 63  0 13  0  1  0 922096   6532  57556   
> 00 0 01   785 41 59  0 14  0  1  0 92209

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Ulrich Drepper

On 4/21/07, Ingo Molnar <[EMAIL PROTECTED]> wrote:

on a simple 'ls' command:

  21310 clone(child_stack=0,  ...) = 21399
  ...
  21399 execve("/bin/ls",
  ...
  21310 waitpid(-1, 

the PID is -1 so we dont actually know which task we are waiting for.


That's a special case.  Most programs don't do this.  In fact, in
multi-threaded code you better never do it since such an unqualified
wait might catch the child another thread waits for (particularly bad
if one thread uses system()).

And even in the case of bash, we probably can change to code to use a
qualified wait in case there are no other children.  This is known at
any time and I expect that most of the time there are no background
processes.  At least in shell scripts.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Jan Engelhardt

On Apr 21 2007 18:00, Ingo Molnar wrote:
>* Con Kolivas <[EMAIL PROTECTED]> wrote:
>
>> >   Feels even better, mouse movements are very smooth even under high 
>> >   load. I noticed that X gets reniced to -19 with this scheduler. 
>> >   I've not looked at the code yet but this looked suspicious to me. 
>> >   I've reniced it to 0 and it did not change any behaviour. Still 
>> >   very good.
>> 
>> Looks like this code does it:
>> 
>> +int sysctl_sched_privileged_nice_level __read_mostly = -19;
>
>correct. Note that Willy reniced X back to 0 so it had no relevance on 
>his test. Also note that i pointed this change out in the -v4 CFS 
>announcement:
>
>|| Changes since -v3:
>||
>||  - usability fix: automatic renicing of kernel threads such as 
>||keventd, OOM tasks and tasks doing privileged hardware access
>||(such as Xorg).
>
>i've attached it below in a standalone form, feel free to put it into 
>SD! :)

Assume X went crazy (lacking any statistics, I make the unproven
statement that this happens more often than kthreads going berserk),
then having it niced with minus something is not too nice.



Jan
-- 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Geert Bosch


On Apr 21, 2007, at 12:18, Willy Tarreau wrote:
Also, I believe that (in shells), most forked processes do not even  
consume
a full timeslice (eg: $(uname -n) is very fast). This means that  
assigning
them with a shorter one will not hurt them while preserving the  
shell's

performance against CPU hogs.


On a fast machine, during regression testing of GCC, I've noticed we  
create
an average of 500 processes per second during an hour or so. There  
are other
work loads like this. So, most processes start, execute and complete  
in 2ms.

How does fairness work in a situation like this?

  -Geert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Willy Tarreau
On Sat, Apr 21, 2007 at 06:53:47PM +0200, Ingo Molnar wrote:
> 
> * Linus Torvalds <[EMAIL PROTECTED]> wrote:
> 
> > It would be even better to simply have the rule:
> >  - child gets almost no points at startup
> >  - but when a parent does a "waitpid()" call and blocks, it will spread 
> >out its points to the childred (the "vfork()" blocking is another case 
> >that is really the same).
> > 
> > This is a very special kind of "priority inversion" logic: you give 
> > higher priority to the things you wait for. Not because of holding any 
> > locks, but simply because a blockign waitpid really is a damn big hint 
> > that "ok, the child now works for the parent".
> 
> yeah. One problem i can see with the implementation of this though is 
> that shells typically do nonspecific waits - for example bash does this 
> on a simple 'ls' command:
> 
>   21310 clone(child_stack=0,  ...) = 21399
>   ...
>   21399 execve("/bin/ls", 
>   ...
>   21310 waitpid(-1, 
> 
> the PID is -1 so we dont actually know which task we are waiting for. We 
> could use the first entry from the p->children list, but that looks too 
> specific of a hack to me. It should catch most of the 
> synchronous-helper-task cases though.

The last one should be more appropriate IMHO. If you waitpid(), it's very
likely that you're waiting for the result of the very last fork().

Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Willy Tarreau
On Sat, Apr 21, 2007 at 09:34:07AM -0700, Linus Torvalds wrote:
> 
> 
> On Sat, 21 Apr 2007, Willy Tarreau wrote:
> > 
> > If you remember, with 50/50, I noticed some difficulties to fork many
> > processes. I think that during a fork(), the parent has a higher probability
> > of forking other processes than the child. So at least, we should use
> > something like 67/33 or 75/25 for parent/child.
> 
> It would be even better to simply have the rule:
>  - child gets almost no points at startup
>  - but when a parent does a "waitpid()" call and blocks, it will spread 
>out its points to the childred (the "vfork()" blocking is another case 
>that is really the same).
> 
> This is a very special kind of "priority inversion" logic: you give higher 
> priority to the things you wait for. Not because of holding any locks, but 
> simply because a blockign waitpid really is a damn big hint that "ok, the 
> child now works for the parent".

I like this idea a lot. I don't know if it can be applied to pipes and unix
sockets, but it's clearly a way of saying "hurry up, I'm waiting for you"
which seems natural with inter-process communications. Also, if we can do
this on unix sockets, it would help a lot with X !

Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> It would be even better to simply have the rule:
>  - child gets almost no points at startup
>  - but when a parent does a "waitpid()" call and blocks, it will spread 
>out its points to the childred (the "vfork()" blocking is another case 
>that is really the same).
> 
> This is a very special kind of "priority inversion" logic: you give 
> higher priority to the things you wait for. Not because of holding any 
> locks, but simply because a blockign waitpid really is a damn big hint 
> that "ok, the child now works for the parent".

yeah. One problem i can see with the implementation of this though is 
that shells typically do nonspecific waits - for example bash does this 
on a simple 'ls' command:

  21310 clone(child_stack=0,  ...) = 21399
  ...
  21399 execve("/bin/ls", 
  ...
  21310 waitpid(-1, 

the PID is -1 so we dont actually know which task we are waiting for. We 
could use the first entry from the p->children list, but that looks too 
specific of a hack to me. It should catch most of the 
synchronous-helper-task cases though.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread William Lee Irwin III
On Sat, 21 Apr 2007, Willy Tarreau wrote:
>> If you remember, with 50/50, I noticed some difficulties to fork many
>> processes. I think that during a fork(), the parent has a higher probability
>> of forking other processes than the child. So at least, we should use
>> something like 67/33 or 75/25 for parent/child.

On Sat, Apr 21, 2007 at 09:34:07AM -0700, Linus Torvalds wrote:
> It would be even better to simply have the rule:
>  - child gets almost no points at startup
>  - but when a parent does a "waitpid()" call and blocks, it will spread 
>out its points to the childred (the "vfork()" blocking is another case 
>that is really the same).
> This is a very special kind of "priority inversion" logic: you give higher 
> priority to the things you wait for. Not because of holding any locks, but 
> simply because a blockign waitpid really is a damn big hint that "ok, the 
> child now works for the parent".

An in-kernel scheduler API might help. void yield_to(struct task_struct *)?

A userspace API might be nice, too. e.g. int sched_yield_to(pid_t).


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread William Lee Irwin III
On Sat, Apr 21, 2007 at 06:00:08PM +0200, Ingo Molnar wrote:
>  arch/i386/kernel/ioport.c   |   13 ++---
>  arch/x86_64/kernel/ioport.c |8 ++--
>  drivers/block/loop.c|5 -
>  include/linux/sched.h   |7 +++
>  kernel/sched.c  |   40 
>  kernel/workqueue.c  |2 +-
>  mm/oom_kill.c   |4 +++-
>  7 files changed, 71 insertions(+), 8 deletions(-)

Yum. I'm going to see what this does for glxgears (I presume it's a
screensaver) on my dual G5 driving a 42" wall-mounted TV for a display. ;)

More seriously, there should be more portable ways of doing this. I
suspect even someone using fbdev on i386/x86-64 might be left out here.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Linus Torvalds


On Sat, 21 Apr 2007, Willy Tarreau wrote:
> 
> If you remember, with 50/50, I noticed some difficulties to fork many
> processes. I think that during a fork(), the parent has a higher probability
> of forking other processes than the child. So at least, we should use
> something like 67/33 or 75/25 for parent/child.

It would be even better to simply have the rule:
 - child gets almost no points at startup
 - but when a parent does a "waitpid()" call and blocks, it will spread 
   out its points to the childred (the "vfork()" blocking is another case 
   that is really the same).

This is a very special kind of "priority inversion" logic: you give higher 
priority to the things you wait for. Not because of holding any locks, but 
simply because a blockign waitpid really is a damn big hint that "ok, the 
child now works for the parent".

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Willy Tarreau
On Sat, Apr 21, 2007 at 05:46:14PM +0200, Ingo Molnar wrote:
> 
> * Willy Tarreau <[EMAIL PROTECTED]> wrote:
> 
> > I promised to perform some tests on your code. I'm short in time right 
> > now, but I observed behaviours that should be commented on.
> 
> thanks for the feedback!
> 
> > 3) CFS-v4
> > 
> >   Feels even better, mouse movements are very smooth even under high 
> >   load. I noticed that X gets reniced to -19 with this scheduler. I've 
> >   not looked at the code yet but this looked suspicious to me. I've 
> >   reniced it to 0 and it did not change any behaviour. Still very 
> >   good. The 64 ocbench share equal CPU time and show exact same 
> >   progress after 2000 iterations. The CPU load is more smoothly spread 
> >   according to vmstat, and there's no idle (see below). BUT I now 
> >   think it was wrong to let new processes start with no timeslice at 
> >   all, because it can take tens of seconds to start a new process when 
> >   only 64 ocbench are there. [...]
> 
> ok, i'll modify that portion and add back the 50%/50% parent/child CPU 
> time sharing approach again. (which CFS had in -v1) That should not 
> change the rest of your test and should improve the task startup 
> characteristics.

If you remember, with 50/50, I noticed some difficulties to fork many
processes. I think that during a fork(), the parent has a higher probability
of forking other processes than the child. So at least, we should use
something like 67/33 or 75/25 for parent/child.

There are many shell-scripts out there doing a lot of fork(), and it should
be reasonable to let them keep some CPU to continue to work.

Also, I believe that (in shells), most forked processes do not even consume
a full timeslice (eg: $(uname -n) is very fast). This means that assigning
them with a shorter one will not hurt them while preserving the shell's
performance against CPU hogs.

Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Willy Tarreau
On Sat, Apr 21, 2007 at 06:00:08PM +0200, Ingo Molnar wrote:
> 
> * Con Kolivas <[EMAIL PROTECTED]> wrote:
> 
> > >   Feels even better, mouse movements are very smooth even under high 
> > >   load. I noticed that X gets reniced to -19 with this scheduler. 
> > >   I've not looked at the code yet but this looked suspicious to me. 
> > >   I've reniced it to 0 and it did not change any behaviour. Still 
> > >   very good.
> > 
> > Looks like this code does it:
> > 
> > +int sysctl_sched_privileged_nice_level __read_mostly = -19;
> 
> correct. Note that Willy reniced X back to 0 so it had no relevance on 
> his test.

Anyway, my X was mostly unused (below 1% CPU), which was my intent when
replacing glxgears by ocbench. We have not settled yet about how to handle
the special case for X. Let's at least try to get the best schedulers without
this problem, then see how to make them behave the best taking X into account.

> Also note that i pointed this change out in the -v4 CFS 
> announcement:
> 
> || Changes since -v3:
> ||
> ||  - usability fix: automatic renicing of kernel threads such as 
> ||keventd, OOM tasks and tasks doing privileged hardware access
> ||(such as Xorg).
> 
> i've attached it below in a standalone form, feel free to put it into 
> SD! :)

Con, I think it could be a good idea since you recommend to renice X with
SD. Most of the problem users are facing with renicing X is that they need
to change their configs or scripts. If the kernel can reliably detect X and
handle it differently, why not do it ?

It makes me think that this hint might be used to set some flags in the task
struct in order to apply different processing than just renicing. It is indeed
possible that nice is not the best solution and that something else would be
even better (eg: longer timeslices, but not changing priority in the queues).
Just an idea anyway.

OK, back to work ;-)
Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Ingo Molnar

* Con Kolivas <[EMAIL PROTECTED]> wrote:

> >   Feels even better, mouse movements are very smooth even under high 
> >   load. I noticed that X gets reniced to -19 with this scheduler. 
> >   I've not looked at the code yet but this looked suspicious to me. 
> >   I've reniced it to 0 and it did not change any behaviour. Still 
> >   very good.
> 
> Looks like this code does it:
> 
> +int sysctl_sched_privileged_nice_level __read_mostly = -19;

correct. Note that Willy reniced X back to 0 so it had no relevance on 
his test. Also note that i pointed this change out in the -v4 CFS 
announcement:

|| Changes since -v3:
||
||  - usability fix: automatic renicing of kernel threads such as 
||keventd, OOM tasks and tasks doing privileged hardware access
||(such as Xorg).

i've attached it below in a standalone form, feel free to put it into 
SD! :)

Ingo

---
 arch/i386/kernel/ioport.c   |   13 ++---
 arch/x86_64/kernel/ioport.c |8 ++--
 drivers/block/loop.c|5 -
 include/linux/sched.h   |7 +++
 kernel/sched.c  |   40 
 kernel/workqueue.c  |2 +-
 mm/oom_kill.c   |4 +++-
 7 files changed, 71 insertions(+), 8 deletions(-)

Index: linux/arch/i386/kernel/ioport.c
===
--- linux.orig/arch/i386/kernel/ioport.c
+++ linux/arch/i386/kernel/ioport.c
@@ -64,9 +64,15 @@ asmlinkage long sys_ioperm(unsigned long
 
if ((from + num <= from) || (from + num > IO_BITMAP_BITS))
return -EINVAL;
-   if (turn_on && !capable(CAP_SYS_RAWIO))
-   return -EPERM;
-
+   if (turn_on) {
+   if (!capable(CAP_SYS_RAWIO))
+   return -EPERM;
+   /*
+* Task will be accessing hardware IO ports,
+* mark it as special with the scheduler too:
+*/
+   sched_privileged_task(current);
+   }
/*
 * If it's the first ioperm() call in this thread's lifetime, set the
 * IO bitmap up. ioperm() is much less timing critical than clone(),
@@ -145,6 +151,7 @@ asmlinkage long sys_iopl(unsigned long u
if (level > old) {
if (!capable(CAP_SYS_RAWIO))
return -EPERM;
+   sched_privileged_task(current);
}
t->iopl = level << 12;
regs->eflags = (regs->eflags & ~X86_EFLAGS_IOPL) | t->iopl;
Index: linux/arch/x86_64/kernel/ioport.c
===
--- linux.orig/arch/x86_64/kernel/ioport.c
+++ linux/arch/x86_64/kernel/ioport.c
@@ -41,8 +41,11 @@ asmlinkage long sys_ioperm(unsigned long
 
if ((from + num <= from) || (from + num > IO_BITMAP_BITS))
return -EINVAL;
-   if (turn_on && !capable(CAP_SYS_RAWIO))
-   return -EPERM;
+   if (turn_on) {
+   if (!capable(CAP_SYS_RAWIO))
+   return -EPERM;
+   sched_privileged_task(current);
+   }
 
/*
 * If it's the first ioperm() call in this thread's lifetime, set the
@@ -113,6 +116,7 @@ asmlinkage long sys_iopl(unsigned int le
if (level > old) {
if (!capable(CAP_SYS_RAWIO))
return -EPERM;
+   sched_privileged_task(current);
}
regs->eflags = (regs->eflags &~ X86_EFLAGS_IOPL) | (level << 12);
return 0;
Index: linux/drivers/block/loop.c
===
--- linux.orig/drivers/block/loop.c
+++ linux/drivers/block/loop.c
@@ -588,7 +588,10 @@ static int loop_thread(void *data)
 */
current->flags |= PF_NOFREEZE;
 
-   set_user_nice(current, -20);
+   /*
+* The loop thread is important enough to be given a boost:
+*/
+   sched_privileged_task(current);
 
while (!kthread_should_stop() || lo->lo_bio) {
 
Index: linux/include/linux/sched.h
===
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1256,6 +1256,13 @@ static inline int rt_mutex_getprio(struc
 #endif
 
 extern void set_user_nice(struct task_struct *p, long nice);
+/*
+ * Task has special privileges, give it more CPU power:
+ */
+extern void sched_privileged_task(struct task_struct *p);
+
+extern int sysctl_sched_privileged_nice_level;
+
 extern int task_prio(const struct task_struct *p);
 extern int task_nice(const struct task_struct *p);
 extern int can_nice(const struct task_struct *p, const int nice);
Index: linux/kernel/sched.c
===
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -3251,6 +3251,46 @@ out_unlock:
 EXPORT_SYMBOL(set_user_nice);
 
 /*
+ * Nice level for privileged tasks. (can be set to 0 for this
+ * to be turned off)

Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Con Kolivas
On Saturday 21 April 2007 22:12, Willy Tarreau wrote:
> I promised to perform some tests on your code. I'm short in time right now,
> but I observed behaviours that should be commented on.

>   Feels even better, mouse movements are very smooth even under high load.
>   I noticed that X gets reniced to -19 with this scheduler. I've not looked
>   at the code yet but this looked suspicious to me.

Looks like this code does it:

+int sysctl_sched_privileged_nice_level __read_mostly = -19;

allows anything that sets sched_privileged_task one way or another gets 
nice -19, and this is enabled by default.

--- linux-cfs-2.6.20.7.q.orig/arch/i386/kernel/ioport.c
+++ linux-cfs-2.6.20.7.q/arch/i386/kernel/ioport.c

+   if (turn_on) {
+   if (!capable(CAP_SYS_RAWIO))
+   return -EPERM;
+   /*
+* Task will be accessing hardware IO ports,
+* mark it as special with the scheduler too:
+*/
+   sched_privileged_task(current);
+   }

presumably that selects out X as a privileged task... and sets it to nice -19 
by default.

-- 
-ck
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Ingo Molnar

* Willy Tarreau <[EMAIL PROTECTED]> wrote:

> I promised to perform some tests on your code. I'm short in time right 
> now, but I observed behaviours that should be commented on.

thanks for the feedback!

> 3) CFS-v4
> 
>   Feels even better, mouse movements are very smooth even under high 
>   load. I noticed that X gets reniced to -19 with this scheduler. I've 
>   not looked at the code yet but this looked suspicious to me. I've 
>   reniced it to 0 and it did not change any behaviour. Still very 
>   good. The 64 ocbench share equal CPU time and show exact same 
>   progress after 2000 iterations. The CPU load is more smoothly spread 
>   according to vmstat, and there's no idle (see below). BUT I now 
>   think it was wrong to let new processes start with no timeslice at 
>   all, because it can take tens of seconds to start a new process when 
>   only 64 ocbench are there. [...]

ok, i'll modify that portion and add back the 50%/50% parent/child CPU 
time sharing approach again. (which CFS had in -v1) That should not 
change the rest of your test and should improve the task startup 
characteristics.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [REPORT] cfs-v4 vs sd-0.44

2007-04-21 Thread Willy Tarreau
On Sat, Apr 21, 2007 at 10:40:18PM +1000, Con Kolivas wrote:
> On Saturday 21 April 2007 22:12, Willy Tarreau wrote:
> > Hi Ingo, Hi Con,
> >
> > I promised to perform some tests on your code. I'm short in time right now,
> > but I observed behaviours that should be commented on.
> >
> > 1) machine : dual athlon 1533 MHz, 1G RAM, kernel 2.6.21-rc7 + either
> > scheduler Test:  ./ocbench -R 25 -S 75 -x 8 -y 8
> >ocbench: http://linux.1wt.eu/sched/
> >
> > 2) SD-0.44
> >
> >Feels good, but becomes jerky at moderately high loads. I've started
> >64 ocbench with a 250 ms busy loop and 750 ms sleep time. The system
> >always responds correctly but under X, mouse jumps quite a bit and
> >typing in xterm or even text console feels slightly jerky. The CPU is
> >not completely used, and the load varies a lot (see below). However,
> >the load is shared equally between all 64 ocbench, and they do not
> >deviate even after 4000 iterations. X uses less than 1% CPU during
> >those tests.
> >
> >Here's the vmstat output :
> [snip]
> 
> > 3) CFS-v4
> >
> >   Feels even better, mouse movements are very smooth even under high load.
> >   I noticed that X gets reniced to -19 with this scheduler. I've not looked
> >   at the code yet but this looked suspicious to me. I've reniced it to 0
> > and it did not change any behaviour. Still very good. The 64 ocbench share
> > equal CPU time and show exact same progress after 2000 iterations. The CPU
> > load is more smoothly spread according to vmstat, and there's no idle (see
> > below). BUT I now think it was wrong to let new processes start with no
> > timeslice at all, because it can take tens of seconds to start a new
> > process when only 64 ocbench are there. Simply starting "killall ocbench"
> > takes about 10 seconds. On a smaller machine (VIA C3-533), it took me more
> > than one minute to do "su -", even from console, so that's not X. BTW, X
> > uses less than 1% CPU during those tests.
> >
> > [EMAIL PROTECTED]:~$ vmstat 1
> [snip]
> 
> > 4) first impressions
> >
> > I think that CFS is based on a more promising concept but is less mature
> > and is dangerous right now with certain workloads. SD shows some strange
> > behaviours like not using all CPU available and a little jerkyness, but is
> > more robust and may be the less risky solution for a first step towards
> > a better scheduler in mainline, but it may also probably be the last O(1)
> > scheduler, which may be replaced sometime later when CFS (or any other one)
> > shows at the same time the smoothness of CFS and the robustness of SD.
> 
> I assumed from your description that you were running X nice 0 during all 
> this 
Yes, that's what I did.

> testing and left the tunables from both SD and CFS at their defaults;

yes too because I don't have enough time to try many combinations this week-end.

> this 
> tends to have the effective equivalent of "timeslice" in CFS smaller than SD.

If you look at the CS column in vmstat, you'll see that there's about twice
as many context switches with CFS than with SD, meaning the average timeslice
would be about twice as short with CFS. But my impression is that some tasks
occasionally get very long timeslices with SD while this never happens with
CFS, hence the very smooth versus jerky feeling which cannot be explained
by just halved timeslices alone.

> > I'm sorry not to spend more time on them right now, I hope that other
> > people will do.
> 
> Thanks for that interesting testing you've done. The fluctuating cpu load and 
> the apparently high idle time means there is almost certainly a bug still in 
> the cpu accounting I do in update_cpu_clock. It looks suspicious to me 
> already on just my first glance. Fortunately the throughput does not appear 
> to be adversely affected on other benchmarks so I suspect it's lying about 
> the idle time and it's not really there. Which means it's likely also 
> accounting the cpu time wrongly.

It is possible that only measurement is wrong because the time was evenly
distributed among the 64 processes. Maybe the fix could also prevent some
tasks from occasionally stealing one slice and reduce the jerkiness feeling.
Anyway, it's just a bit jerky, no more freezes as we've known for years ;-)

> Which also means there's something I can fix and improve SD further.
> Great stuff, thanks! 

You're welcome !

Cheers,
Willy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >