RE: [PATCH] aio: make aio wait path to account iowait time

2020-08-28 Thread Tianxianting
Thanks peterz, jan
So, enable aio iowait time accounting is a bad idea:(
I gained a lot from you, thanks

-Original Message-
From: pet...@infradead.org [mailto:pet...@infradead.org] 
Sent: Friday, August 28, 2020 6:52 PM
To: Jan Kara 
Cc: tianxianting (RD) ; v...@zeniv.linux.org.uk; 
b...@kvack.org; mi...@redhat.com; juri.le...@redhat.com; 
vincent.guit...@linaro.org; dietmar.eggem...@arm.com; rost...@goodmis.org; 
bseg...@google.com; mgor...@suse.de; linux-fsde...@vger.kernel.org; 
linux-...@kvack.org; linux-kernel@vger.kernel.org; Tejun Heo ; 
han...@cmpxchg.org
Subject: Re: [PATCH] aio: make aio wait path to account iowait time

On Fri, Aug 28, 2020 at 11:41:29AM +0200, Jan Kara wrote:
> On Fri 28-08-20 11:07:29, pet...@infradead.org wrote:
> > On Fri, Aug 28, 2020 at 02:07:12PM +0800, Xianting Tian wrote:
> > > As the normal aio wait path(read_events() ->
> > > wait_event_interruptible_hrtimeout()) doesn't account iowait time, 
> > > so use this patch to make it to account iowait time, which can 
> > > truely reflect the system io situation when using a tool like 'top'.
> > 
> > Do be aware though that io_schedule() is potentially far more 
> > expensive than regular schedule() and io-wait accounting as a whole 
> > is a trainwreck.
> 
> Hum, I didn't know that io_schedule() is that much more expensive. 
> Thanks for info.

It's all relative, but it can add up under contention. And since these storage 
thingies are getting faster every year, I'm assuming these schedule rates are 
increasing along with it.

> > When in_iowait is set schedule() and ttwu() will have to do 
> > additional atomic ops, and (much) worse, PSI will take additional locks.
> > 
> > And all that for a number that, IMO, is mostly useless, see the 
> > comment with nr_iowait().
> 
> Well, I understand the limited usefulness of the system or even per 
> CPU percentage spent in IO wait. However whether a particular task is 
> sleeping waiting for IO or not

So strict per-task state is not a problem, and we could easily change
get_task_state() to distinguish between IO-wait or not, basically duplicate S/D 
state into an IO-wait variant of the same. Although even this has ABI 
implications :-(

> is IMO a useful diagnostic information and there are several places in 
> the kernel that take that into account (PSI, hangcheck timer, cpufreq, 
> ...).

So PSI is the one I hate most. We spend an aweful lot of time to not have to 
take the old rq->lock on wakeup, and PSI reintroduced it for accounting 
purposes -- I hate accounting overhead. :/

There's a number of high frequency scheduling workloads where it really adds 
up, which is the reason we got rid of it in the first place.

OTOH, PSI gives more sensible numbers, although it goes side-ways when you 
introduce affinity masks / cpusets.

The menu-cpufreq gov is known crazy and we're all hard working on replacing it.

And the tick-sched usage is, iirc, the nohz case of iowait.

> So I don't see that properly accounting that a task is waiting for IO 
> is just "expensive random number generator" as you mention below :). 
> But I'm open to being educated...

It's the userspace iowait, and in particular the per-cpu iowait numbers that I 
hate. Only on UP does any of that make sense.

But we can't remove them because ABI :-(



Re: [PATCH] aio: make aio wait path to account iowait time

2020-08-28 Thread peterz
On Fri, Aug 28, 2020 at 11:41:29AM +0200, Jan Kara wrote:
> On Fri 28-08-20 11:07:29, pet...@infradead.org wrote:
> > On Fri, Aug 28, 2020 at 02:07:12PM +0800, Xianting Tian wrote:
> > > As the normal aio wait path(read_events() ->
> > > wait_event_interruptible_hrtimeout()) doesn't account iowait time, so use
> > > this patch to make it to account iowait time, which can truely reflect
> > > the system io situation when using a tool like 'top'.
> > 
> > Do be aware though that io_schedule() is potentially far more expensive
> > than regular schedule() and io-wait accounting as a whole is a
> > trainwreck.
> 
> Hum, I didn't know that io_schedule() is that much more expensive. Thanks
> for info.

It's all relative, but it can add up under contention. And since these
storage thingies are getting faster every year, I'm assuming these
schedule rates are increasing along with it.

> > When in_iowait is set schedule() and ttwu() will have to do additional
> > atomic ops, and (much) worse, PSI will take additional locks.
> > 
> > And all that for a number that, IMO, is mostly useless, see the comment
> > with nr_iowait().
> 
> Well, I understand the limited usefulness of the system or even per CPU
> percentage spent in IO wait. However whether a particular task is sleeping
> waiting for IO or not

So strict per-task state is not a problem, and we could easily change
get_task_state() to distinguish between IO-wait or not, basically
duplicate S/D state into an IO-wait variant of the same. Although even
this has ABI implications :-(

> is IMO a useful diagnostic information and there are
> several places in the kernel that take that into account (PSI, hangcheck
> timer, cpufreq, ...).

So PSI is the one I hate most. We spend an aweful lot of time to not
have to take the old rq->lock on wakeup, and PSI reintroduced it for
accounting purposes -- I hate accounting overhead. :/

There's a number of high frequency scheduling workloads where it really
adds up, which is the reason we got rid of it in the first place.

OTOH, PSI gives more sensible numbers, although it goes side-ways when
you introduce affinity masks / cpusets.

The menu-cpufreq gov is known crazy and we're all hard working on
replacing it.

And the tick-sched usage is, iirc, the nohz case of iowait.

> So I don't see that properly accounting that a task
> is waiting for IO is just "expensive random number generator" as you
> mention below :). But I'm open to being educated...

It's the userspace iowait, and in particular the per-cpu iowait numbers
that I hate. Only on UP does any of that make sense.

But we can't remove them because ABI :-(



Re: [PATCH] aio: make aio wait path to account iowait time

2020-08-28 Thread Jan Kara
On Fri 28-08-20 11:07:29, pet...@infradead.org wrote:
> On Fri, Aug 28, 2020 at 02:07:12PM +0800, Xianting Tian wrote:
> > As the normal aio wait path(read_events() ->
> > wait_event_interruptible_hrtimeout()) doesn't account iowait time, so use
> > this patch to make it to account iowait time, which can truely reflect
> > the system io situation when using a tool like 'top'.
> 
> Do be aware though that io_schedule() is potentially far more expensive
> than regular schedule() and io-wait accounting as a whole is a
> trainwreck.

Hum, I didn't know that io_schedule() is that much more expensive. Thanks
for info.

> When in_iowait is set schedule() and ttwu() will have to do additional
> atomic ops, and (much) worse, PSI will take additional locks.
> 
> And all that for a number that, IMO, is mostly useless, see the comment
> with nr_iowait().

Well, I understand the limited usefulness of the system or even per CPU
percentage spent in IO wait. However whether a particular task is sleeping
waiting for IO or not is IMO a useful diagnostic information and there are
several places in the kernel that take that into account (PSI, hangcheck
timer, cpufreq, ...). So I don't see that properly accounting that a task
is waiting for IO is just "expensive random number generator" as you
mention below :). But I'm open to being educated...

> But, if you don't care about performance, and want to see a shiny random
> number generator, by all means, use io_schedule().

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] aio: make aio wait path to account iowait time

2020-08-28 Thread peterz
On Fri, Aug 28, 2020 at 02:07:12PM +0800, Xianting Tian wrote:
> As the normal aio wait path(read_events() ->
> wait_event_interruptible_hrtimeout()) doesn't account iowait time, so use
> this patch to make it to account iowait time, which can truely reflect
> the system io situation when using a tool like 'top'.

Do be aware though that io_schedule() is potentially far more expensive
than regular schedule() and io-wait accounting as a whole is a
trainwreck.

When in_iowait is set schedule() and ttwu() will have to do additional
atomic ops, and (much) worse, PSI will take additional locks.

And all that for a number that, IMO, is mostly useless, see the comment
with nr_iowait().

But, if you don't care about performance, and want to see a shiny random
number generator, by all means, use io_schedule().


Re: [PATCH] aio: make aio wait path to account iowait time

2020-08-28 Thread Jan Kara
On Fri 28-08-20 14:07:12, Xianting Tian wrote:
> As the normal aio wait path(read_events() ->
> wait_event_interruptible_hrtimeout()) doesn't account iowait time, so use
> this patch to make it to account iowait time, which can truely reflect
> the system io situation when using a tool like 'top'.
> 
> The test result as below.
> 
> Test environment:
>   Architecture:  x86_64
>   CPU op-mode(s):32-bit, 64-bit
>   Byte Order:Little Endian
>   CPU(s):32
>   On-line CPU(s) list:   0-31
>   Thread(s) per core:2
>   Core(s) per socket:8
>   Socket(s): 2
>   NUMA node(s):  2
>   Vendor ID: GenuineIntel
>   CPU family:6
>   Model: 85
>   Model name:Intel(R) Xeon(R) Silver 4108 CPU @ 1.80GHz
>   Stepping:  4
>   CPU MHz:   801.660
> 
> AIO test command:
>   fio -ioengine=libaio -bs=8k -direct=1 -numjobs 32 -rw=read -size=10G
>   -filename=/dev/sda3 -name="Max throughput" -iodepth=128 -runtime=60
> 
> Before test, set nr_requests to 512(default is 256), aim to to make the 
> backend
> device busy to handle io request, and make io_getevents() have more chances to
> wait for io completion:
>   echo 512 >  /sys/block/sda/queue/nr_requests
> 
> Fio test result with the AIO iowait time accounting patch showed as below,
> almost all fio threads are in 'D' state to waiting for io completion, and the
> iowait time is accounted as 48.0%:
>   top - 19:19:23 up 29 min,  2 users,  load average: 14.60, 9.45, 9.45
>   Tasks: 456 total,   4 running, 247 sleeping,   0 stopped,   0 zombie
>   %Cpu(s):  0.4 us,  1.0 sy,  0.0 ni, 50.6 id, 48.0 wa,  0.0 hi,  0.0 si, 
>  0.0 st
>   KiB Mem : 19668915+total, 19515264+free,   866476 used,   670028 
> buff/cache
>   KiB Swap:  4194300 total,  4194300 free,0 used. 19449948+avail 
> Mem
> 
> PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ 
> COMMAND
>   16135 root  20   0  294092  63724  63060 S   1.7  0.0   0:03.31 fio
>   16173 root  20   0  272352   3540   1792 D   1.7  0.0   0:03.85 fio
>   16175 root  20   0  272360   3544   1796 D   1.7  0.0   0:03.85 fio
>   16185 root  20   0  272400   3556   1808 D   1.7  0.0   0:03.84 fio
>   16187 root  20   0  272408   3552   1804 D   1.7  0.0   0:03.82 fio
>   16190 root  20   0  272420   3500   1804 R   1.7  0.0   0:03.88 fio
>   16169 root  20   0  272336   3444   1740 D   1.3  0.0   0:03.75 fio
>   16170 root  20   0  272340   3504   1804 R   1.3  0.0   0:03.80 fio
>   16172 root  20   0  272348   3500   1800 D   1.3  0.0   0:03.86 fio
>   16174 root  20   0  272356   3544   1796 D   1.3  0.0   0:03.77 fio
>   16179 root  20   0  272376   3528   1780 D   1.3  0.0   0:03.79 fio
>   16180 root  20   0  272380   3500   1800 D   1.3  0.0   0:03.85 fio
>   16181 root  20   0  272384   3552   1804 D   1.3  0.0   0:03.87 fio
>   16182 root  20   0  272388   3520   1772 D   1.3  0.0   0:03.80 fio
>   16183 root  20   0  272392   3552   1804 D   1.3  0.0   0:03.77 fio
>   16186 root  20   0  272404   3500   1804 D   1.3  0.0   0:03.88 fio
>   16188 root  20   0  272412   3500   1800 D   1.3  0.0   0:03.89 fio
>   16191 root  20   0  272424   3500   1800 D   1.3  0.0   0:03.92 fio
>   16192 root  20   0  272428   3500   1800 D   1.3  0.0   0:03.87 fio
>   16194 root  20   0  272436   3500   1804 D   1.3  0.0   0:03.82 fio
>   16195 root  20   0  272440   3500   1800 R   1.3  0.0   0:03.82 fio
>   16196 root  20   0  272444   3552   1804 D   1.3  0.0   0:03.84 fio
>   16198 root  20   0  272452   3500   1804 D   1.3  0.0   0:03.89 fio
>   16199 root  20   0  272456   3504   1800 D   1.3  0.0   0:03.84 fio
>   16200 root  20   0  272460   3552   1804 D   1.3  0.0   0:03.85 fio
>   16171 root  20   0  272344   3504   1800 D   1.0  0.0   0:03.84 fio
>   16176 root  20   0  272364   3520   1772 D   1.0  0.0   0:03.76 fio
>   16177 root  20   0  272368   3556   1808 D   1.0  0.0   0:03.74 fio
>   16178 root  20   0  272372   3500   1804 D   1.0  0.0   0:03.90 fio
>   16184 root  20   0  272396   3500   1800 D   1.0  0.0   0:03.83 fio
>   16189 root  20   0  272416   3500   1804 D   1.0  0.0   0:03.86 fio
>   16193 root  20   0  272432   3500   1804 D   1.0  0.0   0:03.85 fio
>   16197 root  20   0  272448   3556   1808 D   1.0  0.0   0:03.75 fio
> 
> Fio test result without the AIO iowait time accounting patch showed as below,
> almost all fio threads are in 'S' state to waiting for io completion, and the
> iowait time is not accounted, iowait is 0.0%.
>   top - 19:20:44 up 31 min,  2 users,  load average: 12.50, 10.15,