On Mon, Nov 21, 2016 at 11:17:28AM +0100, Martin Schwidefsky wrote:
> On Mon, 21 Nov 2016 07:59:56 +0100
> Martin Schwidefsky <schwidef...@de.ibm.com> wrote:
> 
> > On Fri, 18 Nov 2016 15:47:02 +0100
> > Frederic Weisbecker <fweis...@gmail.com> wrote:
> > 
> > > On Fri, Nov 18, 2016 at 01:08:46PM +0100, Martin Schwidefsky wrote:
> > > > On Thu, 17 Nov 2016 19:08:07 +0100
> > > > Frederic Weisbecker <fweis...@gmail.com> wrote:
> > > > 
> > > > Now it has been proposed to implement lazy accounting to accumulate 
> > > > deltas
> > > > and do the expensive conversions only infrequently. This is pretty 
> > > > straight-
> > > > forward for account_user_time but to do this for the account_system_time
> > > > function is more complicated. The function has to differentiate between
> > > > guest/hardirq/softirq and pure system time. We would need to keep sums 
> > > > for
> > > > each bucket and provide a separate function to add to each bucket. Like
> > > > account_guest_time(), account_hardirq_time(), account_softirq_time() and
> > > > account_system_time(). Then it is up to the arch code to sort out the 
> > > > details
> > > > and call the accounting code once per jiffy for each of the buckets.
> > > 
> > > That wouldn't be too hard really. The s390 code in vtime.c already does 
> > > that.
> > 
> > Yes, I agree that the accumulating change would not be too hard. Can I make 
> > the
> > request that we try to get that done first before doing the cleanup ?
> 
> Played with the idea a bit, here is a prototype patch to do the delay system 
> time
> accounting for s390. It applies against the latest s390 features tree which 
> you'll
> find here
> 
>       git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux.git features
> 
> The details probably needs some more work but it works.
> 
> --
> From 1b5ef9ddf899da81a48de826f783b15e6fc45d25 Mon Sep 17 00:00:00 2001
> From: Martin Schwidefsky <schwidef...@de.ibm.com>
> Date: Mon, 21 Nov 2016 10:44:10 +0100
> Subject: [PATCH] s390/cputime: delayed accounting of system time
> 
> The account_system_time() function is called with a cputime that
> occurred while running in the kernel. The function detects which
> context the CPU is currently running in and accounts the time to
> the correct bucket. This forces the arch code to account the
> cputime for hardirq and softirq immediately.
> 
> Make account_guest_time non-static and add account_sys_time,
> account_hardirq_time and account_softirq_time. With these functions
> the arch code can delay the accounting for system time. For s390
> the accounting is done once per timer tick and for each task switch.
> 
> Signed-off-by: Martin Schwidefsky <schwidef...@de.ibm.com>

Thanks a lot for taking care of that! I'll give a try to do the same
on powerpc.

A few comments below:

> ---
>  arch/s390/include/asm/lowcore.h   |  65 ++++++++++++-----------
>  arch/s390/include/asm/processor.h |   3 ++
>  arch/s390/kernel/vtime.c          | 106 
> ++++++++++++++++++++++----------------
>  include/linux/kernel_stat.h       |  13 +++--
>  kernel/sched/cputime.c            |  22 +++++++-
>  5 files changed, 129 insertions(+), 80 deletions(-)
> 
> diff --git a/arch/s390/include/asm/lowcore.h b/arch/s390/include/asm/lowcore.h
> index 62a5cf1..8a5b082 100644
> --- a/arch/s390/include/asm/lowcore.h
> +++ b/arch/s390/include/asm/lowcore.h
[...]
> @@ -110,34 +119,48 @@ static int do_account_vtime(struct task_struct *tsk, 
> int hardirq_offset)
>  #endif
>               : "=m" (S390_lowcore.last_update_timer),
>                 "=m" (S390_lowcore.last_update_clock));
> -     S390_lowcore.system_timer += timer - S390_lowcore.last_update_timer;
> -     S390_lowcore.steal_timer += S390_lowcore.last_update_clock - clock;
> +     clock = S390_lowcore.last_update_clock - clock;
> +     timer -= S390_lowcore.last_update_timer;
> +
> +     if ((tsk->flags & PF_VCPU) && (irq_count() - hardirq_offset == 0))
> +             S390_lowcore.guest_timer += timer;
> +     else if (hardirq_count() - hardirq_offset)
> +             S390_lowcore.hardirq_timer += timer;
> +     else if (in_serving_softirq())
> +             S390_lowcore.softirq_timer += timer;
> +     else
> +             S390_lowcore.system_timer += timer;

I initially thought that some code could be shared for that whole accumulation. 
Now I
don't know if it would be a good idea. An example would be to deal with the 
contexts above
in order to store the accumulation to the appropriate place.

>  
>       /* Update MT utilization calculation */
>       if (smp_cpu_mtid &&
>           time_after64(jiffies_64, this_cpu_read(mt_scaling_jiffies)))
>               update_mt_scaling();
>  
> +     /* Calculate cputime delta */
>       user = S390_lowcore.user_timer - tsk->thread.user_timer;
> -     S390_lowcore.steal_timer -= user;
>       tsk->thread.user_timer = S390_lowcore.user_timer;
> -
> +     guest = S390_lowcore.guest_timer - tsk->thread.guest_timer;
> +     tsk->thread.guest_timer = S390_lowcore.guest_timer;
>       system = S390_lowcore.system_timer - tsk->thread.system_timer;
> -     S390_lowcore.steal_timer -= system;
>       tsk->thread.system_timer = S390_lowcore.system_timer;
> -
> -     user_scaled = user;
> -     system_scaled = system;
> -     /* Do MT utilization scaling */
> -     if (smp_cpu_mtid) {
> -             u64 mult = __this_cpu_read(mt_scaling_mult);
> -             u64 div = __this_cpu_read(mt_scaling_div);
> -
> -             user_scaled = (user_scaled * mult) / div;
> -             system_scaled = (system_scaled * mult) / div;
> -     }
> -     account_user_time(tsk, user, user_scaled);
> -     account_system_time(tsk, hardirq_offset, system, system_scaled);
> +     hardirq = S390_lowcore.hardirq_timer - tsk->thread.hardirq_timer;
> +     tsk->thread.hardirq_timer = S390_lowcore.hardirq_timer;
> +     softirq = S390_lowcore.softirq_timer - tsk->thread.softirq_timer;
> +     tsk->thread.softirq_timer = S390_lowcore.softirq_timer;
> +     S390_lowcore.steal_timer +=
> +             clock - user - guest - system - hardirq - softirq;
> +
> +     /* Push account value */
> +     if (user)
> +             account_user_time(tsk, user, scale_vtime(user));
> +     if (guest)
> +             account_guest_time(tsk, guest, scale_vtime(guest));
> +     if (system)
> +             account_sys_time(tsk, system, scale_vtime(system));
> +     if (hardirq)
> +             account_hardirq_time(tsk, hardirq, scale_vtime(hardirq));
> +     if (softirq)
> +             account_softirq_time(tsk, softirq, scale_vtime(softirq));

And doing that would be another part of the shared code.

>  
>       steal = S390_lowcore.steal_timer;
>       if ((s64) steal > 0) {
> @@ -145,16 +168,22 @@ static int do_account_vtime(struct task_struct *tsk, 
> int hardirq_offset)
>               account_steal_time(steal);
>       }
>  
> -     return virt_timer_forward(user + system);
> +     return virt_timer_forward(user + guest + system + hardirq + softirq);
>  }
>  
>  void vtime_task_switch(struct task_struct *prev)
>  {
>       do_account_vtime(prev, 0);
>       prev->thread.user_timer = S390_lowcore.user_timer;
> +     prev->thread.guest_timer = S390_lowcore.guest_timer;
>       prev->thread.system_timer = S390_lowcore.system_timer;
> +     prev->thread.hardirq_timer = S390_lowcore.hardirq_timer;
> +     prev->thread.softirq_timer = S390_lowcore.softirq_timer;
>       S390_lowcore.user_timer = current->thread.user_timer;
> +     S390_lowcore.guest_timer = current->thread.guest_timer;
>       S390_lowcore.system_timer = current->thread.system_timer;
> +     S390_lowcore.hardirq_timer = current->thread.hardirq_timer;
> +     S390_lowcore.softirq_timer = current->thread.softirq_timer;
>  }

Ditto.

>  
>  /*
> @@ -174,31 +203,22 @@ void vtime_account_user(struct task_struct *tsk)
>   */
>  void vtime_account_irq_enter(struct task_struct *tsk)
>  {
> -     u64 timer, system, system_scaled;
> +     u64 timer;
>  
>       timer = S390_lowcore.last_update_timer;
>       S390_lowcore.last_update_timer = get_vtimer();
> -     S390_lowcore.system_timer += timer - S390_lowcore.last_update_timer;
> -
> -     /* Update MT utilization calculation */
> -     if (smp_cpu_mtid &&
> -         time_after64(jiffies_64, this_cpu_read(mt_scaling_jiffies)))
> -             update_mt_scaling();
> -
> -     system = S390_lowcore.system_timer - tsk->thread.system_timer;
> -     S390_lowcore.steal_timer -= system;
> -     tsk->thread.system_timer = S390_lowcore.system_timer;
> -     system_scaled = system;
> -     /* Do MT utilization scaling */
> -     if (smp_cpu_mtid) {
> -             u64 mult = __this_cpu_read(mt_scaling_mult);
> -             u64 div = __this_cpu_read(mt_scaling_div);
> -
> -             system_scaled = (system_scaled * mult) / div;
> -     }
> -     account_system_time(tsk, 0, system, system_scaled);
> -
> -     virt_timer_forward(system);
> +     timer -= S390_lowcore.last_update_timer;
> +
> +     if ((tsk->flags & PF_VCPU) && (irq_count() == 0))
> +             S390_lowcore.guest_timer += timer;
> +     else if (hardirq_count())
> +             S390_lowcore.hardirq_timer += timer;
> +     else if (in_serving_softirq())
> +             S390_lowcore.softirq_timer += timer;
> +     else
> +             S390_lowcore.system_timer += timer;

And Ditto.

We could put together the accumulation in a common struct in s390_lowcore,
and its mirror in thread struct then have helpers take care of the contexts.

How does that sound to you, would it help or hurt?

Thanks.

Reply via email to