* Paul Mackerras <[EMAIL PROTECTED]> wrote:

> PowerPC's sched_clock() currently measures real time.  On POWER5 and 
> POWER6 machines we could change it to use a register called the "PURR" 
> (for Processor Utilization of Resources Register), which only measures 
> time spent while the partition is running.  But the PURR has another 
> function as well: it measures the distribution of dispatch cycles 
> between the two hardware threads on each core when running in SMT 
> mode.  That is, the cpu dispatches instructions from one thread or the 
> other (not both) on each CPU cycle, and each thread's PURR only gets 
> incremented on cycles where the cpu dispatches instructions for that 
> thread.  So the sum of the two threads' PURRs adds up to real time.
> 
> Do you think this makes the PURR more useful for CFS, or less?  To me 
> it looks like this would mean that CFS can make a more equitable 
> distribution of CPU time if, for example, you had 3 runnable tasks on 
> a 2-core x dual-threaded machine (4 virtual CPUs).

there's one complication: sched_clock() still needs to increase while 
the CPU (or thread) is idle, so that we can have a correct measurement 
of the CPU's utilization, for SMP load-balancing. CFS constructs another 
clock from sched_clock() [the rq->fair_clock] which does stop while the 
CPU is idle.

So perhaps a combination of the PURR and real-time might work as 
sched_clock(): when a hardware thread is in cpu_idle(), it should 
advance its sched clock with _half_ the rate of real-time [so that the 
sum of advance of all threads if they are all idle is equal to real 
time], and use the PURR if they are not idle. This would still correctly 
keep a meaningful load-average if the physical CPU is under-utilized.

If you do such a change you'll immediately see whether the approach is 
right: monitor the cpu_load[] values in /proc/sched_debug, they should 
match the intuitive 'load average' of that CPU (if divided by 1024), and 
check whether 'top' still works fine.

> BTW, what does "time spent running during sleep" mean?  Does it mean 
> "time that other tasks are running while this task is sleeping"?

yeah. It's "the amount of fair runtime i missed out on while others were 
running".

> > still, CFS needs time measurement across idle periods as well, for 
> > another purpose: to be able to do precise task statistics for /proc. 
> > (for top, ps, etc.) So it's still true that sched_clock() should 
> > include idle periods too.
> 
> As with s390, 64-bit PowerPC also uses CONFIG_VIRT_CPU_ACCOUNTING. 
> That affects how tsk->utime and tsk->stime are accumulated (we call 
> account_user_time and account_system_time directly rather than calling 
> update_process_times) as well as the system hardirq/softirq time, idle 
> time, and stolen time.

tsk->utime and tsk->stime is only used for a single purpose: to 
determine the 'split' factor of how to split up the precise total time 
between user and system time.

> When you say "precise task statistics for /proc", where are they 
> accumulated?  I don't see any changes to the way that tsk->utime and 
> ctime are computed.

we now use p->se.sum_exec_runtime that measures (in nanoseconds) the 
precise amount of time spent executing (sum of system and user time) - 
and ->stime and ->utime is used to determine the 'split'. [this allows 
us to gather ->stime and ->utime via low-resolution sampling, while 
keeping the 'total' precise. Accounting at every system entry point 
would be quite expensive on most platforms.]

        Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Reply via email to