Observations from a ZFS reorganization on 12-STABLE

Karl Denninger Sun, 17 Mar 2019 08:20:45 -0700

I've long argued that the VM system's interaction with ZFS' arc cache
and UMA has serious, even severe issues.  12.x appeared to have
addressed some of them, and as such I've yet to roll forward any part of
the patch series that is found here [
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=187594 ] or the
Phabricator version referenced in the bug thread (which is more-complex
and attempts to dig at the root of the issue more effectively,
particularly when UMA is involved as it usually is.)


Yesterday I decided to perform a fairly significant reorganization of
the ZFS pools on one of my personal machines, including the root pool
which was on mirrored SSDs, changing to a Raidz2 (also on SSDs.)  This
of course required booting single-user from a 12-Stable memstick.

A simple "zfs send -R zs/root-save/R | zfs recv -Fuev zsr/R" should have
done it, no sweat.  The root that was copied over before I started is
uncomplicated; it's compressed, but not de-duped.  While it has
snapshots on it too it's by no means complex.

*The system failed to execute that command with an "out of swap space"
error, killing the job; there was indeed no swap configured since I
booted from a memstick.*

Huh?  A simple *filesystem copy* managed to force a 16Gb system into
requiring page file backing store?

I was able to complete the copy by temporarily adding the swap space
back on (where it would be when the move was complete) but that
requirement is pure insanity and it appears, from what I was able to
determine, that it came about from the same root cause that's been
plaguing VM/ZFS interaction since 2014 when I started work this issue --
specifically, when RAM gets low rather than evict ARC (or clean up UMA
that is allocated but unused) the system will attempt to page out
working set.  In this case since it couldn't page out working set since
there was nowhere to page it to the process involved got an OOM error
and was terminated.

*I continue to argue that this decision is ALWAYS wrong.*

It's wrong because if you invalidate cache and reclaim it you *might*
take a read from physical I/O to replace that data back into the cache
in the future (since it's not in RAM) but in exchange for a *potential*
I/O you perform a GUARANTEED physical I/O (to page out some amount of
working set) and possibly TWO physical I/Os (to page said working set
out and, later, page it back in.)

It has always appeared to me to be flat-out nonsensical to trade a
possible physical I/O (if there is a future cache miss) for a guaranteed
physical I/O and a possible second one.  It's even worse if the reason
you make that decision is that UMA is allocated but unused; in that case
you are paging when no physical I/O is required at all as the "memory
pressure" is a phantom!  While UMA is a very material performance win in
the general case to allow allocated-but-unused UMA to force paging, from
a performance perspective, appears to be flat-out insanity.  I find it
very difficult to come up with any reasonable scenario where releasing
allocated-but-unused UMA rather than paging out working set is a net
performance loser.

In this case since the system was running in single user mode the
process that got selected to be destroyed when that circumstance arose
and there was no available swap was the copy process itself.  The copy
itself did not require anywhere near all of the available non-kernel RAM.

I'm going to dig into this further but IMHO the base issue still exists,
even though the impact of it for my workloads with everything "running
normally" has materially decreased with 12.x.

-- 
Karl Denninger
k...@denninger.net <mailto:k...@denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/

smime.p7s
Description: S/MIME Cryptographic Signature

Observations from a ZFS reorganization on 12-STABLE

Reply via email to