RE: [PATCH] aio: make aio wait path to account iowait time
Thanks peterz, jan So, enable aio iowait time accounting is a bad idea:( I gained a lot from you, thanks -Original Message- From: pet...@infradead.org [mailto:pet...@infradead.org] Sent: Friday, August 28, 2020 6:52 PM To: Jan Kara Cc: tianxianting (RD) ; v...@zeniv.linux.org.uk; b...@kvack.org; mi...@redhat.com; juri.le...@redhat.com; vincent.guit...@linaro.org; dietmar.eggem...@arm.com; rost...@goodmis.org; bseg...@google.com; mgor...@suse.de; linux-fsde...@vger.kernel.org; linux-...@kvack.org; linux-kernel@vger.kernel.org; Tejun Heo ; han...@cmpxchg.org Subject: Re: [PATCH] aio: make aio wait path to account iowait time On Fri, Aug 28, 2020 at 11:41:29AM +0200, Jan Kara wrote: > On Fri 28-08-20 11:07:29, pet...@infradead.org wrote: > > On Fri, Aug 28, 2020 at 02:07:12PM +0800, Xianting Tian wrote: > > > As the normal aio wait path(read_events() -> > > > wait_event_interruptible_hrtimeout()) doesn't account iowait time, > > > so use this patch to make it to account iowait time, which can > > > truely reflect the system io situation when using a tool like 'top'. > > > > Do be aware though that io_schedule() is potentially far more > > expensive than regular schedule() and io-wait accounting as a whole > > is a trainwreck. > > Hum, I didn't know that io_schedule() is that much more expensive. > Thanks for info. It's all relative, but it can add up under contention. And since these storage thingies are getting faster every year, I'm assuming these schedule rates are increasing along with it. > > When in_iowait is set schedule() and ttwu() will have to do > > additional atomic ops, and (much) worse, PSI will take additional locks. > > > > And all that for a number that, IMO, is mostly useless, see the > > comment with nr_iowait(). > > Well, I understand the limited usefulness of the system or even per > CPU percentage spent in IO wait. However whether a particular task is > sleeping waiting for IO or not So strict per-task state is not a problem, and we could easily change get_task_state() to distinguish between IO-wait or not, basically duplicate S/D state into an IO-wait variant of the same. Although even this has ABI implications :-( > is IMO a useful diagnostic information and there are several places in > the kernel that take that into account (PSI, hangcheck timer, cpufreq, > ...). So PSI is the one I hate most. We spend an aweful lot of time to not have to take the old rq->lock on wakeup, and PSI reintroduced it for accounting purposes -- I hate accounting overhead. :/ There's a number of high frequency scheduling workloads where it really adds up, which is the reason we got rid of it in the first place. OTOH, PSI gives more sensible numbers, although it goes side-ways when you introduce affinity masks / cpusets. The menu-cpufreq gov is known crazy and we're all hard working on replacing it. And the tick-sched usage is, iirc, the nohz case of iowait. > So I don't see that properly accounting that a task is waiting for IO > is just "expensive random number generator" as you mention below :). > But I'm open to being educated... It's the userspace iowait, and in particular the per-cpu iowait numbers that I hate. Only on UP does any of that make sense. But we can't remove them because ABI :-(
Re: [PATCH] aio: make aio wait path to account iowait time
On Fri, Aug 28, 2020 at 11:41:29AM +0200, Jan Kara wrote: > On Fri 28-08-20 11:07:29, pet...@infradead.org wrote: > > On Fri, Aug 28, 2020 at 02:07:12PM +0800, Xianting Tian wrote: > > > As the normal aio wait path(read_events() -> > > > wait_event_interruptible_hrtimeout()) doesn't account iowait time, so use > > > this patch to make it to account iowait time, which can truely reflect > > > the system io situation when using a tool like 'top'. > > > > Do be aware though that io_schedule() is potentially far more expensive > > than regular schedule() and io-wait accounting as a whole is a > > trainwreck. > > Hum, I didn't know that io_schedule() is that much more expensive. Thanks > for info. It's all relative, but it can add up under contention. And since these storage thingies are getting faster every year, I'm assuming these schedule rates are increasing along with it. > > When in_iowait is set schedule() and ttwu() will have to do additional > > atomic ops, and (much) worse, PSI will take additional locks. > > > > And all that for a number that, IMO, is mostly useless, see the comment > > with nr_iowait(). > > Well, I understand the limited usefulness of the system or even per CPU > percentage spent in IO wait. However whether a particular task is sleeping > waiting for IO or not So strict per-task state is not a problem, and we could easily change get_task_state() to distinguish between IO-wait or not, basically duplicate S/D state into an IO-wait variant of the same. Although even this has ABI implications :-( > is IMO a useful diagnostic information and there are > several places in the kernel that take that into account (PSI, hangcheck > timer, cpufreq, ...). So PSI is the one I hate most. We spend an aweful lot of time to not have to take the old rq->lock on wakeup, and PSI reintroduced it for accounting purposes -- I hate accounting overhead. :/ There's a number of high frequency scheduling workloads where it really adds up, which is the reason we got rid of it in the first place. OTOH, PSI gives more sensible numbers, although it goes side-ways when you introduce affinity masks / cpusets. The menu-cpufreq gov is known crazy and we're all hard working on replacing it. And the tick-sched usage is, iirc, the nohz case of iowait. > So I don't see that properly accounting that a task > is waiting for IO is just "expensive random number generator" as you > mention below :). But I'm open to being educated... It's the userspace iowait, and in particular the per-cpu iowait numbers that I hate. Only on UP does any of that make sense. But we can't remove them because ABI :-(
Re: [PATCH] aio: make aio wait path to account iowait time
On Fri 28-08-20 11:07:29, pet...@infradead.org wrote: > On Fri, Aug 28, 2020 at 02:07:12PM +0800, Xianting Tian wrote: > > As the normal aio wait path(read_events() -> > > wait_event_interruptible_hrtimeout()) doesn't account iowait time, so use > > this patch to make it to account iowait time, which can truely reflect > > the system io situation when using a tool like 'top'. > > Do be aware though that io_schedule() is potentially far more expensive > than regular schedule() and io-wait accounting as a whole is a > trainwreck. Hum, I didn't know that io_schedule() is that much more expensive. Thanks for info. > When in_iowait is set schedule() and ttwu() will have to do additional > atomic ops, and (much) worse, PSI will take additional locks. > > And all that for a number that, IMO, is mostly useless, see the comment > with nr_iowait(). Well, I understand the limited usefulness of the system or even per CPU percentage spent in IO wait. However whether a particular task is sleeping waiting for IO or not is IMO a useful diagnostic information and there are several places in the kernel that take that into account (PSI, hangcheck timer, cpufreq, ...). So I don't see that properly accounting that a task is waiting for IO is just "expensive random number generator" as you mention below :). But I'm open to being educated... > But, if you don't care about performance, and want to see a shiny random > number generator, by all means, use io_schedule(). Honza -- Jan Kara SUSE Labs, CR
Re: [PATCH] aio: make aio wait path to account iowait time
On Fri, Aug 28, 2020 at 02:07:12PM +0800, Xianting Tian wrote: > As the normal aio wait path(read_events() -> > wait_event_interruptible_hrtimeout()) doesn't account iowait time, so use > this patch to make it to account iowait time, which can truely reflect > the system io situation when using a tool like 'top'. Do be aware though that io_schedule() is potentially far more expensive than regular schedule() and io-wait accounting as a whole is a trainwreck. When in_iowait is set schedule() and ttwu() will have to do additional atomic ops, and (much) worse, PSI will take additional locks. And all that for a number that, IMO, is mostly useless, see the comment with nr_iowait(). But, if you don't care about performance, and want to see a shiny random number generator, by all means, use io_schedule().
Re: [PATCH] aio: make aio wait path to account iowait time
On Fri 28-08-20 14:07:12, Xianting Tian wrote: > As the normal aio wait path(read_events() -> > wait_event_interruptible_hrtimeout()) doesn't account iowait time, so use > this patch to make it to account iowait time, which can truely reflect > the system io situation when using a tool like 'top'. > > The test result as below. > > Test environment: > Architecture: x86_64 > CPU op-mode(s):32-bit, 64-bit > Byte Order:Little Endian > CPU(s):32 > On-line CPU(s) list: 0-31 > Thread(s) per core:2 > Core(s) per socket:8 > Socket(s): 2 > NUMA node(s): 2 > Vendor ID: GenuineIntel > CPU family:6 > Model: 85 > Model name:Intel(R) Xeon(R) Silver 4108 CPU @ 1.80GHz > Stepping: 4 > CPU MHz: 801.660 > > AIO test command: > fio -ioengine=libaio -bs=8k -direct=1 -numjobs 32 -rw=read -size=10G > -filename=/dev/sda3 -name="Max throughput" -iodepth=128 -runtime=60 > > Before test, set nr_requests to 512(default is 256), aim to to make the > backend > device busy to handle io request, and make io_getevents() have more chances to > wait for io completion: > echo 512 > /sys/block/sda/queue/nr_requests > > Fio test result with the AIO iowait time accounting patch showed as below, > almost all fio threads are in 'D' state to waiting for io completion, and the > iowait time is accounted as 48.0%: > top - 19:19:23 up 29 min, 2 users, load average: 14.60, 9.45, 9.45 > Tasks: 456 total, 4 running, 247 sleeping, 0 stopped, 0 zombie > %Cpu(s): 0.4 us, 1.0 sy, 0.0 ni, 50.6 id, 48.0 wa, 0.0 hi, 0.0 si, > 0.0 st > KiB Mem : 19668915+total, 19515264+free, 866476 used, 670028 > buff/cache > KiB Swap: 4194300 total, 4194300 free,0 used. 19449948+avail > Mem > > PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ > COMMAND > 16135 root 20 0 294092 63724 63060 S 1.7 0.0 0:03.31 fio > 16173 root 20 0 272352 3540 1792 D 1.7 0.0 0:03.85 fio > 16175 root 20 0 272360 3544 1796 D 1.7 0.0 0:03.85 fio > 16185 root 20 0 272400 3556 1808 D 1.7 0.0 0:03.84 fio > 16187 root 20 0 272408 3552 1804 D 1.7 0.0 0:03.82 fio > 16190 root 20 0 272420 3500 1804 R 1.7 0.0 0:03.88 fio > 16169 root 20 0 272336 3444 1740 D 1.3 0.0 0:03.75 fio > 16170 root 20 0 272340 3504 1804 R 1.3 0.0 0:03.80 fio > 16172 root 20 0 272348 3500 1800 D 1.3 0.0 0:03.86 fio > 16174 root 20 0 272356 3544 1796 D 1.3 0.0 0:03.77 fio > 16179 root 20 0 272376 3528 1780 D 1.3 0.0 0:03.79 fio > 16180 root 20 0 272380 3500 1800 D 1.3 0.0 0:03.85 fio > 16181 root 20 0 272384 3552 1804 D 1.3 0.0 0:03.87 fio > 16182 root 20 0 272388 3520 1772 D 1.3 0.0 0:03.80 fio > 16183 root 20 0 272392 3552 1804 D 1.3 0.0 0:03.77 fio > 16186 root 20 0 272404 3500 1804 D 1.3 0.0 0:03.88 fio > 16188 root 20 0 272412 3500 1800 D 1.3 0.0 0:03.89 fio > 16191 root 20 0 272424 3500 1800 D 1.3 0.0 0:03.92 fio > 16192 root 20 0 272428 3500 1800 D 1.3 0.0 0:03.87 fio > 16194 root 20 0 272436 3500 1804 D 1.3 0.0 0:03.82 fio > 16195 root 20 0 272440 3500 1800 R 1.3 0.0 0:03.82 fio > 16196 root 20 0 272444 3552 1804 D 1.3 0.0 0:03.84 fio > 16198 root 20 0 272452 3500 1804 D 1.3 0.0 0:03.89 fio > 16199 root 20 0 272456 3504 1800 D 1.3 0.0 0:03.84 fio > 16200 root 20 0 272460 3552 1804 D 1.3 0.0 0:03.85 fio > 16171 root 20 0 272344 3504 1800 D 1.0 0.0 0:03.84 fio > 16176 root 20 0 272364 3520 1772 D 1.0 0.0 0:03.76 fio > 16177 root 20 0 272368 3556 1808 D 1.0 0.0 0:03.74 fio > 16178 root 20 0 272372 3500 1804 D 1.0 0.0 0:03.90 fio > 16184 root 20 0 272396 3500 1800 D 1.0 0.0 0:03.83 fio > 16189 root 20 0 272416 3500 1804 D 1.0 0.0 0:03.86 fio > 16193 root 20 0 272432 3500 1804 D 1.0 0.0 0:03.85 fio > 16197 root 20 0 272448 3556 1808 D 1.0 0.0 0:03.75 fio > > Fio test result without the AIO iowait time accounting patch showed as below, > almost all fio threads are in 'S' state to waiting for io completion, and the > iowait time is not accounted, iowait is 0.0%. > top - 19:20:44 up 31 min, 2 users, load average: 12.50, 10.15,