Is kern.sched.preempt_thresh=0 a sensible default? (was: Re: Extremely low disk throughput under high compute load)

2018-04-04 Thread Stefan Esser
Am 02.04.18 um 00:18 schrieb Stefan Esser:
> Am 01.04.18 um 18:33 schrieb Warner Losh:
>> On Sun, Apr 1, 2018 at 9:18 AM, Stefan Esser > > wrote:
>>
>> My i7-2600K based system with 24 GB RAM was in the midst of a buildworld 
>> -j8
>> (starting from a clean state) which caused a load average of 12 for more 
>> than
>> 1 hour, when I decided to move a directory structure holding some 10 GB 
>> to its
>> own ZFS file system. File sizes varied, but were mostly in the range 0f 
>> 500KB.
>>
>> I had just thrown away /usr/obj, but /usr/src was cached in ARC and thus 
>> there
>> was nearly no disk activity caused by the buildworld.
>>
>> The copying proceeded at a rate of at most 10 MB/s, but most of the time 
>> less
>> than 100 KB/s were transferred. The "cp" process had a PRIO of 20 and 
>> thus a
>> much better priority than the compute bound compiler processes, but it 
>> got
>> just 0.2% to 0.5% of 1 CPU core. Apparently, the copy process was 
>> scheduled
>> at such a low rate, that it only managed to issue a few controller 
>> writes per
>> second.
>>
>> The system is healthy and does not show any problems or anomalies under
>> normal use (e.g., file copies are fast, without the high compute load).
>>
>> This was with SCHED_ULE on a -CURRENT without WITNESS or malloc 
>> debugging.
>>
>> Is this a regression in -CURRENT?
>>
>> Does 'sync' push a lot of I/O to the disk?
> 
> Each sync takes 0.7 to 1.5 seconds to complete, but since reading is so
> slow, not much is written.
> 
> Normal gstat output for the 3 drives the RAIDZ1 consists of:
> 
> dT: 1.002s  w: 1.000s
>  L(q)  ops/sr/s   kBps   ms/rw/s   kBps   ms/w   %busy Name
> 0  2  2 84   39.1  0  00.07.8  ada0
> 0  4  4 92   66.6  0  00.0   26.6  ada1
> 0  6  6259   66.9  0  00.0   36.2  ada3
> dT: 1.058s  w: 1.000s
>  L(q)  ops/sr/s   kBps   ms/rw/s   kBps   ms/w   %busy Name
> 0  1  1 60   70.6  0  00.06.7  ada0
> 0  3  3 68   71.3  0  00.0   20.2  ada1
> 0  6  6242   65.5  0  00.0   28.8  ada3
> dT: 1.002s  w: 1.000s
>  L(q)  ops/sr/s   kBps   ms/rw/s   kBps   ms/w   %busy Name
> 0  5  5192   44.8  0  00.0   22.4  ada0
> 0  6  6160   61.9  0  00.0   26.5  ada1
> 0  6  6172   43.7  0  00.0   26.2  ada3
> 
> This includes the copy process and the reads caused by "make -j 8 world"
> (but I assume that all the source files are already cached in ARC).

I have identified the cause of the extremely low I/O performance (2 to 6 read
operations scheduled per second).

The default value of kern.sched.preempt_thresh=0 does not give any CPU to the
I/O bound process unless a (long) time slice expires (kern.sched.quantum=94488
on my system with HZ=1000) or one of the CPU bound processes voluntarily gives
up the CPU (or exits).

Any non-zero value of preemt_thresh lets the system perform I/O in parallel
with the CPU bound processes, again.

I'm not sure about the bias relative to the PRI values displayed by top, but
for me a process with PRI above 72 (in top) should be eligible for preemption.

What value of preempt_thresh should I use to get that behavior?


And, more important: Is preempt_thresh=0 a reasonable default???

This prevents I/O bound processes from making reasonable progress if all CPU
cores/threads are busy. In my case, performance dropped from > 10 MB/s to just
a few hundred KB per second, i.e. by a factor of 30. (The %busy values in my
previous mail are misleading: At 10 MB/s the disk was about 70% busy ...)


Should preempt_thresh be set to some (possibly high, to only preempt long
running processes) value?

Regards, STefan
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Extremely low disk throughput under high compute load

2018-04-01 Thread Stefan Esser
Am 01.04.18 um 18:33 schrieb Warner Losh:
> 
> 
> On Sun, Apr 1, 2018 at 9:18 AM, Stefan Esser  > wrote:
> 
> My i7-2600K based system with 24 GB RAM was in the midst of a buildworld 
> -j8
> (starting from a clean state) which caused a load average of 12 for more 
> than
> 1 hour, when I decided to move a directory structure holding some 10 GB 
> to its
> own ZFS file system. File sizes varied, but were mostly in the range 0f 
> 500KB.
> 
> I had just thrown away /usr/obj, but /usr/src was cached in ARC and thus 
> there
> was nearly no disk activity caused by the buildworld.
> 
> The copying proceeded at a rate of at most 10 MB/s, but most of the time 
> less
> than 100 KB/s were transferred. The "cp" process had a PRIO of 20 and 
> thus a
> much better priority than the compute bound compiler processes, but it got
> just 0.2% to 0.5% of 1 CPU core. Apparently, the copy process was 
> scheduled
> at such a low rate, that it only managed to issue a few controller writes 
> per
> second.
> 
> The system is healthy and does not show any problems or anomalies under
> normal use (e.g., file copies are fast, without the high compute load).
> 
> This was with SCHED_ULE on a -CURRENT without WITNESS or malloc debugging.
> 
> Is this a regression in -CURRENT?
> 
> Does 'sync' push a lot of I/O to the disk?

Each sync takes 0.7 to 1.5 seconds to complete, but since reading is so
slow, not much is written.

Normal gstat output for the 3 drives the RAIDZ1 consists of:

dT: 1.002s  w: 1.000s
 L(q)  ops/sr/s   kBps   ms/rw/s   kBps   ms/w   %busy Name
0  2  2 84   39.1  0  00.07.8  ada0
0  4  4 92   66.6  0  00.0   26.6  ada1
0  6  6259   66.9  0  00.0   36.2  ada3
dT: 1.058s  w: 1.000s
 L(q)  ops/sr/s   kBps   ms/rw/s   kBps   ms/w   %busy Name
0  1  1 60   70.6  0  00.06.7  ada0
0  3  3 68   71.3  0  00.0   20.2  ada1
0  6  6242   65.5  0  00.0   28.8  ada3
dT: 1.002s  w: 1.000s
 L(q)  ops/sr/s   kBps   ms/rw/s   kBps   ms/w   %busy Name
0  5  5192   44.8  0  00.0   22.4  ada0
0  6  6160   61.9  0  00.0   26.5  ada1
0  6  6172   43.7  0  00.0   26.2  ada3

This includes the copy process and the reads caused by "make -j 8 world"
(but I assume that all the source files are already cached in ARC).

During sync:

dT: 1.002s  w: 1.000s
 L(q)  ops/sr/s   kBps   ms/rw/s   kBps   ms/w   %busy Name
1101  9132   14.6 90   17605.6   59.7  ada0
2110 16267   15.0 92   17566.0   50.7  ada1
2 82 13291   17.8 67   16537.4   34.3  ada3

ZFS is configured to flush dirty buffers after 5 seconds, so there are
not many dirty buffers in RAM at any time, anyway.

> Is the effective throughput of CP tiny or large? It's tiny, if I read right,
> and the I/O is slow (as opposed to it all buffering in memory and being slow
> to drain own), right?

Yes, reading is very slow, with less than 10 read operations scheduled
per second.

Top output taken at the same time as above gstat samples:

last pid: 24306;  load averages: 12.07, 11.51,  8.13

  up 2+05:41:57  00:10:22
132 processes: 10 running, 122 sleeping
CPU: 98.2% user,  0.0% nice,  1.7% system,  0.1% interrupt,  0.0% idle
Mem: 1069M Active, 1411M Inact, 269M Laundry, 20G Wired, 1076M Free
ARC: 16G Total, 1234M MFU, 14G MRU, 83M Anon, 201M Header, 786M Other
 14G Compressed, 30G Uncompressed, 2.09:1 Ratio
Swap: 24G Total, 533M Used, 23G Free, 2% Inuse

  PID USERNAME   THR PRI NICE   SIZERES STATE   C   TIMEWCPU COMMAND
24284 root 1  920   228M   199M CPU66   0:11 101.34% c++
24287 root 1  910   269M   241M CPU33   0:10 101.32% c++
24266 root 1  970   303M   276M CPU00   0:17 101.13% c++
24297 root 1  850   213M   184M CPU11   0:06  98.40% c++
24281 root 1  930   245M   217M CPU77   0:12  96.76% c++
24300 root 1  760   114M 89268K RUN 2   0:02  83.22% c++
24303 root 1  750   105M 79908K CPU44   0:01  59.94% c++
24302 root 1  520 74940K 47264K wait4   0:00   0.35% c++
24299 root 1  520 74960K 47268K wait2   0:00   0.33% c++
20954 root 1  200 15528K  4900K zio->i  3   0:02   0.11% cp

ARC is limited to 18 GB to leave 6 GB RAM for use by kernel and user programs.

vfs.zfs.arc_meta_limit: 45
vfs.zfs.arc_free_target: 42339
vfs.zfs.compressed_arc_enabled: 1
vfs.zfs.arc_grow_retry: 60
vfs.zfs.arc_shrink_shift: 7
vfs.zfs.arc_average_blocksize: 8192
vfs.zfs.arc_no_grow_shift: 5
vfs.zfs.arc_min: 225000
vfs.zfs.arc_max

Re: Extremely low disk throughput under high compute load

2018-04-01 Thread Warner Losh
On Sun, Apr 1, 2018 at 9:18 AM, Stefan Esser  wrote:

> My i7-2600K based system with 24 GB RAM was in the midst of a buildworld
> -j8
> (starting from a clean state) which caused a load average of 12 for more
> than
> 1 hour, when I decided to move a directory structure holding some 10 GB to
> its
> own ZFS file system. File sizes varied, but were mostly in the range 0f
> 500KB.
>
> I had just thrown away /usr/obj, but /usr/src was cached in ARC and thus
> there
> was nearly no disk activity caused by the buildworld.
>
> The copying proceeded at a rate of at most 10 MB/s, but most of the time
> less
> than 100 KB/s were transferred. The "cp" process had a PRIO of 20 and thus
> a
> much better priority than the compute bound compiler processes, but it got
> just 0.2% to 0.5% of 1 CPU core. Apparently, the copy process was scheduled
> at such a low rate, that it only managed to issue a few controller writes
> per
> second.
>
> The system is healthy and does not show any problems or anomalies under
> normal use (e.g., file copies are fast, without the high compute load).
>
> This was with SCHED_ULE on a -CURRENT without WITNESS or malloc debugging.
>
> Is this a regression in -CURRENT?
>

Does 'sync' push a lot of I/O to the disk?

Is the effective throughput of CP tiny or large? It's tiny, if I read
right, and the I/O is slow (as opposed to it all buffering in memory and
being slow to drain own), right?

Warner
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Extremely low disk throughput under high compute load

2018-04-01 Thread Stefan Esser
My i7-2600K based system with 24 GB RAM was in the midst of a buildworld -j8
(starting from a clean state) which caused a load average of 12 for more than
1 hour, when I decided to move a directory structure holding some 10 GB to its
own ZFS file system. File sizes varied, but were mostly in the range 0f 500KB.

I had just thrown away /usr/obj, but /usr/src was cached in ARC and thus there
was nearly no disk activity caused by the buildworld.

The copying proceeded at a rate of at most 10 MB/s, but most of the time less
than 100 KB/s were transferred. The "cp" process had a PRIO of 20 and thus a
much better priority than the compute bound compiler processes, but it got
just 0.2% to 0.5% of 1 CPU core. Apparently, the copy process was scheduled
at such a low rate, that it only managed to issue a few controller writes per
second.

The system is healthy and does not show any problems or anomalies under
normal use (e.g., file copies are fast, without the high compute load).

This was with SCHED_ULE on a -CURRENT without WITNESS or malloc debugging.

Is this a regression in -CURRENT?

Regards, STefan
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"