Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned

Alexander Motin Mon, 04 Sep 2023 06:10:33 -0700

On 04.09.2023 05:56, Mark Millard wrote:

On Sep 4, 2023, at 02:00, Mark Millard <mark...@yahoo.com> wrote:

On Sep 3, 2023, at 23:35, Mark Millard <mark...@yahoo.com> wrote:

On Sep 3, 2023, at 22:06, Alexander Motin <m...@freebsd.org> wrote:

On 03.09.2023 22:54, Mark Millard wrote:

After that ^t produced the likes of:
load: 6.39  cmd: sh 4849 [tx->tx_quiesce_done_cv] 10047.33r 0.51u 121.32s 1% 
13004k


So the full state is not "tx->tx", but is actually a "tx->tx_quiesce_done_cv", 
which means the thread is waiting for new transaction to be opened, which means some previous to be 
quiesced and then synced.

#0 0xffffffff80b6f103 at mi_switch+0x173
#1 0xffffffff80bc0f24 at sleepq_switch+0x104
#2 0xffffffff80aec4c5 at _cv_wait+0x165
#3 0xffffffff82aba365 at txg_wait_open+0xf5
#4 0xffffffff82a11b81 at dmu_free_long_range+0x151


Here it seems like transaction commit is waited due to large amount of delete 
operations, which ZFS tries to spread between separate TXGs.


That fit the context: cleaning out /usr/local/poudriere/data/.m/

You should probably see some large and growing number in sysctl 
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay .


After the reboot I started a -J64 example. It has avoided the
early "witness exhausted". Again I ^C'd after about an hours
after the 2nd builder had started. So: again cleaning out
/usr/local/poudriere/data/.m/ Only seconds between:

# sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276042

# sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 276427

# sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 277323

# sysctl kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay
kstat.zfs.misc.dmu_tx.dmu_tx_dirty_frees_delay: 278027


As expected, deletes trigger and wait for TXG commits.

I have found a measure of progress: zfs list's USED
for /usr/local/poudriere/data/.m is decreasing. So
ztop's d/s was a good classification: deletes.

#5 0xffffffff829a87d2 at zfs_rmnode+0x72
#6 0xffffffff829b658d at zfs_freebsd_reclaim+0x3d
#7 0xffffffff8113a495 at VOP_RECLAIM_APV+0x35
#8 0xffffffff80c5a7d9 at vgonel+0x3a9
#9 0xffffffff80c5af7f at vrecycle+0x3f
#10 0xffffffff829b643e at zfs_freebsd_inactive+0x4e
#11 0xffffffff80c598cf at vinactivef+0xbf
#12 0xffffffff80c590da at vput_final+0x2aa
#13 0xffffffff80c68886 at kern_funlinkat+0x2f6
#14 0xffffffff80c68588 at sys_unlink+0x28
#15 0xffffffff8106323f at amd64_syscall+0x14f
#16 0xffffffff8103512b at fast_syscall_common+0xf8


What we don't see here is what quiesce and sync threads of the pool are 
actually doing.  Sync thread has plenty of different jobs, including async 
write, async destroy, scrub and others, that all may delay each other.

Before you rebooted the system, depending how alive it is, could you save a 
number of outputs of `procstat -akk`, or at least specifically `procstat -akk | 
grep txg_thread_enter` if the full is hard?  Or somehow else observe what they 
are doing.


# grep txg_thread_enter ~/mmjnk0[0-5].txt
/usr/home/root/mmjnk00.txt:    6 100881 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk00.txt:    6 100882 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk01.txt:    6 100881 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk01.txt:    6 100882 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk02.txt:    6 100881 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk02.txt:    6 100882 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk03.txt:    6 100881 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk03.txt:    6 100882 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk04.txt:    6 100881 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk04.txt:    6 100882 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk05.txt:    6 100881 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 _cv_wait+0x165 txg_thread_wait+0xeb 
txg_quiesce_thread+0x144 fork_exit+0x82 fork_trampoline+0xe
/usr/home/root/mmjnk05.txt:    6 100882 zfskern             txg_thread_enter    
mi_switch+0x173 sleepq_switch+0x104 sleepq_timedwait+0x4b 
_cv_timedwait_sbt+0x188 zio_wait+0x3c9 dsl_pool_sync+0x139 spa_sync+0xc68 
txg_sync_thread+0x2eb fork_exit+0x82 fork_trampoline+0xe

So quiesce threads are idle, while sync thread is waiting for TXG commitwrites completion. I see no no crime, we should see the same just forslow storage.

`zpool status`, `zpool get all` and `sysctl -a` would also not harm.


It is a very simple zpool configuration: one partition.
I only use it for bectl BE reasons, not the general
range of reasons for using zfs. I created the media with
my normal content, then checkpointed before doing the
git fetch to start to set up the experiment.

OK. And I see no scrub or async destroy, that could delay sync thread.Though I don't see them in the above procstat either.

/etc/sysctl.conf does have:

vfs.zfs.min_auto_ashift=12
vfs.zfs.per_txg_dirty_frees_percent=5

The vfs.zfs.per_txg_dirty_frees_percent is from prior
Mateusz Guzik help, where after testing the change I
reported:

Result summary: Seems to have avoided the sustained periods
of low load average activity. Much better for the context.

But it was for a different machine (aarch64, 8 cores). But
it was for poudriere bulk use.

Turns out the default of 30 was causing sort of like
what is seen here: I could have presented some of the
information via the small load average figures here.

(Note: 5 is the old default, 30 is newer. Other contexts
have other problems with 5: no single right setting and
no automated configuration.)

per_txg_dirty_frees_percent is directly related to the delete delays wesee here. You are forcing ZFS to commit transactions each 5% of dirtyARC limit, which is 5% of 10% or memory size. I haven't looked on thatcode recently, but I guess setting it too low can make ZFS committransactions too often, increasing write inflation for the underlyingstorage. I would propose you to restore the default and try again.

Other than those 2 items, zfs is untuned (defaults).

sysctl -a is a lot more output (864930 Bytes) so I'll skip
it for now.

PS: I may be wrong, but USB in "USB3 NVMe SSD storage" makes me shiver. Make 
sure there is no storage problems, like some huge delays, timeouts, etc, that can be 
seen, for example, as busy percents regularly spiking far above 100% in your `gstat 
-spod`.


The "gstat -spod" output showed (and shows): around 0.8ms/w to 3ms/w,
mostly at the lower end of the range. < 98%busy, no spikes to > 100%.
It is a previously unused Samsung PSSD T7 Touch.

Is the ~98% busy most of the time there? Unfortunately our umass doesnot support UASP, i.e. supports only one command at a time, so manysmall I/Os may accumulate more latency than other interface storageswould. Higher number of small transaction commits may not help it either.

A little more context here: that is for the "kB" figures seen
during the cleanup/delete activity. During port builds into
packages larger "kB" figures are seen and the ms/w figures
will tend to be larger as well. The larger sizes can also
lead to reaching somewhat above 100 %busy some of the time.

I'll also note that I've ended up doing a lot more write
activity exploring than I'd expected.

I was not prepared to replace the content of a PCIe slot's media
or M.2 connection's media for the temporary purpose. No spare
supply for those so no simple swapping for those.


Trying -J36 (so: 32+4) got to 470 built in about an hour
after [02] reached "Builder started". /usr/local/poudriere/data/.m
used a little under 40 GiBytes at that point. (I do not have
a file count.)

The cleanup seems to have gone somewhat faster after my ^C for
this context:

^C[01:20:20] Error: Signal SIGINT caught, cleaning up and exiting
[01:20:20] [27] [00:02:54] Finished math/p5-Data-Float | p5-Data-Float-0.013: 
Success
[main-amd64-bulk_a-default] [2023-09-04_00h30m42s] [sigint:] Queued: 34588 
Built: 502   Failed: 1     Skipped: 50    Ignored: 335   Fetched: 0     
Tobuild: 33700  Time: 01:20:12
[01:20:22] Logs: 
/usr/local/poudriere/data/logs/bulk/main-amd64-bulk_a-default/2023-09-04_00h30m42s
[01:20:23] [25] [00:04:46] Finished www/p5-HTML-TreeBuilder-XPath | 
p5-HTML-TreeBuilder-XPath-0.14_1: Success
[01:20:24] Cleaning up
[02:17:01] Unmounting file systems
Exiting with status 1

So it took about an hour to cleanup after 502 port builds into
packages (not published, though).

( gstat -spod showed a fairly general, sustained lack of read activity,
instead of the comparitively small sustained amount I'd not mentioned
for the previous explorations. May be that helped. )

I suppose more builders mean deleting more work directories same time?I don't know if it should cause cumulative effect, but I suppose shouldbe at least linear.


--
Alexander Motin

Re: An attempted test of main's "git: 2ad756a6bbb3" "merge openzfs/zfs@95f71c019" that did not go as planned

Reply via email to