Re: [SOLVED] Re: Strange behavior after running under high load

2021-04-04 Thread Poul-Henning Kamp

Konstantin Belousov writes:

> > B) We lack a nuanced call-back to tell the subsystems to release some of 
> > their memory "without major delay".

> The delay in the wall clock sense does not drive the issue.

I didnt say anything about "wall clock" and you're missing my point by a wide 
margin.

We need to make major memory consumers, like vnodes take action *before* 
shortages happen, so that *when* they happen, a lot of memory can be released 
to relive them.

> We cannot expect any io to proceed while we are low on memory [...]

Which is precisely why the top level goal should be for that to never happen, 
while still allowing the freeable" memory to be used as a cache as much as 
possible.

> > C) We have never attempted to enlist userland, where jemalloc often hang on 
> > to a lot of unused VM pages.
> > 
> The userland does not add to this problem, [...]

No, but userland can help solve it:  The unused pages from jemalloc/userland 
can very quickly be released to relieve any imminent shortage the kernel might 
have.

As can pages from vnodes, and for that matter socket buffers.

But there are always costs, actual costs, ie: what it will take to release the 
memory (locking, VM mappings, washing) and potential costs (lack of future 
caching opportunities).

These costs need to be presented to the central memory allocator, so when it 
decides back-pressure is appropriate, it can decide who to punk for how much 
memory.

> But normally operating system does not have an issue with user pages.  

Only if you disregard all non-UNIX operating systems.

Many other kernels have cooperated with userland to balance memory (and for 
that matter disk-space).

Just imagine how much better the desktop experience would be, if we could send 
SIGVM to firefox to tell it stop being a memory-pig.

(At least two of the major operating systems in the desktop world does 
something like that today.)

> Io latency is not the factor there. We must avoid situations where
> instantiating a vnode stalls waiting for KVA to appear, similarly we
> must avoid system state where vnodes allocation consumed so much kmem
> that other allocations stall.

My argument is the precise opposite:  We must make vnodes and the allocations 
they cause responsive to the sytems overall memory availability, well in 
advance of the shortage happening in the first place.

> Quite indicative is that we do not shrink the vnode list on low memory
> events.  Vnlru also does not account for the memory pressure.

The only reason we do not, is that we cannot tell definitively if freeing a 
vnode will cause disk-I/O (which may not matter with SSD's) or even how much 
memory it might free, if anything.

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: [SOLVED] Re: Strange behavior after running under high load

2021-04-04 Thread Konstantin Belousov
On Sun, Apr 04, 2021 at 07:01:44PM +, Poul-Henning Kamp wrote:
> 
> Konstantin Belousov writes:
> 
> > But what would you provide as the input for PID controller, and what would 
> > be the targets?
> 
> Viewing this purely as a vnode related issue is wrong, this is about memory 
> allocation in general.
> 
> We may or may not want a PID regulator, but putting it on counts of vnode 
> would not improve things, precisely, as you point out, because the amount of 
> memory a vnode ties up has enormous variance.
> 
Yes

> 
> We should focus on the end goal: To ensure "sufficient" memory can always be 
> allocated for any purpose "without major delay".
> 
and no

> 
> Architecturally there are three major problems:
> 
> A) While each subsystem generally have a good idea about memory that can be 
> released "without major delay", the information does not trickle up through a 
> summarizing NUMA aware tree.
> 
> B) We lack a nuanced call-back to tell the subsystems to release some of 
> their memory "without major delay".
The delay in the wall clock sense does not drive the issue.
We cannot expect any io to proceed while we are low on memory, in the sense
that allocators cannot respond right now.  More and more, our io subsystem
requires allocating memory to make any progress with io.  This is already
quite bad with geom, although some hacks make it not too outstanding.

It is very bad with ZFS, where swap on zvols causes deadlocks almost
immediately.

> 
> C) We have never attempted to enlist userland, where jemalloc often hang on 
> to a lot of unused VM pages.
> 
The userland does not add to this problem, because pagedaemon typically has
enough processing power to convert user-allocated pages into usable clean
or free pages.  Of course, if there is no swap and dirty anon page cannot
be launder, the issue would accumulate.

But normally operating system does not have an issue with user pages.  

> 
> As far as vnodes go:
> 
> 
> It used to be that "without major delay" meant "without disk-I/O" which again 
> led to the "dirty buffers/VM pages" heuristic.
> 
> With microsecond SSD backing store, that heuristic is not only invalid, it is 
> down-right harmful in many cases.
> 
> GEOM maintains estimates of per-provider latency and VM+VFS should use that 
> to schedule write-back so that more of it happens outside rush-hour, in order 
> to increase the amount of memory which can be released "without major delay".
> 
> Today that happens largely as a side effect of the periodic syncer, which 
> does a really bad job at it, because it still expects VAX-era hardware 
> performance and workloads.
> 
Io latency is not the factor there. We must avoid situations where
instantiating a vnode stalls waiting for KVA to appear, similarly we
must avoid system state where vnodes allocation consumed so much kmem
that other allocations stall.

Quite indicative is that we do not shrink the vnode list on low memory
events.  Vnlru also does not account for the memory pressure.

Problem is that it is not clear how to express that relations between
safe allocators state and our desire to cache file system data, which is
bound to the vnode identity.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: [SOLVED] Re: Strange behavior after running under high load

2021-04-04 Thread Poul-Henning Kamp

Konstantin Belousov writes:

> But what would you provide as the input for PID controller, and what would be 
> the targets?

Viewing this purely as a vnode related issue is wrong, this is about memory 
allocation in general.

We may or may not want a PID regulator, but putting it on counts of vnode would 
not improve things, precisely, as you point out, because the amount of memory a 
vnode ties up has enormous variance.


We should focus on the end goal: To ensure "sufficient" memory can always be 
allocated for any purpose "without major delay".


Architecturally there are three major problems:

A) While each subsystem generally have a good idea about memory that can be 
released "without major delay", the information does not trickle up through a 
summarizing NUMA aware tree.

B) We lack a nuanced call-back to tell the subsystems to release some of their 
memory "without major delay".

C) We have never attempted to enlist userland, where jemalloc often hang on to 
a lot of unused VM pages.


As far as vnodes go:


It used to be that "without major delay" meant "without disk-I/O" which again 
led to the "dirty buffers/VM pages" heuristic.

With microsecond SSD backing store, that heuristic is not only invalid, it is 
down-right harmful in many cases.

GEOM maintains estimates of per-provider latency and VM+VFS should use that to 
schedule write-back so that more of it happens outside rush-hour, in order to 
increase the amount of memory which can be released "without major delay".

Today that happens largely as a side effect of the periodic syncer, which does 
a really bad job at it, because it still expects VAX-era hardware performance 
and workloads.


-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: [SOLVED] Re: Strange behavior after running under high load

2021-04-04 Thread Konstantin Belousov
On Sun, Apr 04, 2021 at 08:45:41AM -0600, Warner Losh wrote:
> On Sun, Apr 4, 2021, 5:51 AM Mateusz Guzik  wrote:
> 
> > On 4/3/21, Poul-Henning Kamp  wrote:
> > > 
> > > Mateusz Guzik writes:
> > >
> > >> It is high because of this:
> > >> msleep(_sig, _list_mtx, PVFS, "vlruwk",
> > >> hz);
> > >>
> > >> i.e. it literally sleeps for 1 second.
> > >
> > > Before the line looked like that, it slept on "lbolt" aka "lightning
> > > bolt" which was woken once a second.
> > >
> > > The calculations which come up with those "constants" have always
> > > been utterly bogus math, not quite "square-root of shoe-size
> > > times sun-angle in Patagonia", but close.
> > >
> > > The original heuristic came from university environments with tons of
> > > students doing assignments and nethack behind VT102 terminals, on
> > > filesystems where files only seldom grew past 100KB, so it made sense
> > > to scale number of vnodes to how much RAM was in the system, because
> > > that also scaled the size of the buffer-cache.
> > >
> > > With a merged VM buffer-cache, whatever validity that heuristic had
> > > was lost, and we tweaked the bogomath in various ways until it
> > > seemed to mostly work, trusting the users for which it did not, to
> > > tweak things themselves.
> > >
> > > Please dont tweak the Finagle Constants again.
> > >
> > > Rip all that crap out and come up with something fundamentally better.
> > >
> >
> > Some level of pacing is probably useful to control total memory use --
> > there can be A LOT of memory tied up in mere fact that vnode is fully
> > cached. imo the thing to do is to come up with some watermarks to be
> > revisited every 1-2 years and to change the behavior when they get
> > exceeded -- try to whack some stuff but in face of trouble just go
> > ahead and alloc without sleep 1. Should the load spike sort itself
> > out, vnlru will slowly get things down to the watermark. If the
> > watermark is too low, maybe it can autotune. Bottom line is that even
> > with the current idea of limiting preferred total vnode count, the
> > corner case behavior can be drastically better suffering SOME perf
> > loss from recycling vnodes, but not sleeping for a second for every
> > single one.
> >
> 
> I'd suggest that going directly to a PID to control this would be better
> than the watermarks. That would give a smoother response than high/low
> watermarks would. While you'd need some level to keep things at still, the
> laundry stuff has shown the precise level of that level is less critical
> than the watermarks.
But what would you provide as the input for PID controller, and what
would be the targets?

The main reason for the (almost) hard cap on the number of vnodes is not
that excessive number of vnodes is harmful by itself.  Each allocated
vnode typically implies existence of several second-order allocations
that accumulate into significant KVA usage:
- filesystem inode
- vm object
- namecache entries
There are usually even more allocations, third-order, for instance UFS
inode carries a pointer to the dinode copy in RAM, and possibly EA area.
And of course, the fact that vnode names pages in the page cache owned by
corresponding file, i.e. amount of allocated vnodes regulates amount of
work for pagedaemon.

We currently trying to put some rational limit for total number of vnodes,
estimating both KVA and physical memory consumed by them.  If you remove
that limit, you need to ensure that we do not create OOM situation either
for KVA or for physical memory just by creating too many vnodes, otherwise
system cannot get out of it.

So there are some combinations of machine config (RAM) and loads where 
default settings are arguably low.  Raising the limits needs to handle
the indirect resource usage from vnode.

I do not know how to write the feedback formula, taking into account all
the consequences of the vnode existence, and that effects depend also on
the underlying filesystem and patterns of VM paging usage.  In this sense
ZFS is probably simplest case, because its caching subsystem is autonomous.
While UFS or NFS are tightly integrated with VM.

> 
> Warner
> 
> I think the notion of 'struct vnode' being a separately allocated
> > object is not very useful and it comes with complexity (and happens to
> > suffer from several bugs).
> >
> > That said, the easiest and safest thing to do in the meantime is to
> > bump the limit. Perhaps the sleep can be whacked as it is which would
> > largely sort it out.
> >
> > --
> > Mateusz Guzik 
> > ___
> > freebsd-current@freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-current
> > To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
> >
> ___
> freebsd-current@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to 

Re: [SOLVED] Re: Strange behavior after running under high load

2021-04-04 Thread Warner Losh
On Sun, Apr 4, 2021, 5:51 AM Mateusz Guzik  wrote:

> On 4/3/21, Poul-Henning Kamp  wrote:
> > 
> > Mateusz Guzik writes:
> >
> >> It is high because of this:
> >> msleep(_sig, _list_mtx, PVFS, "vlruwk",
> >> hz);
> >>
> >> i.e. it literally sleeps for 1 second.
> >
> > Before the line looked like that, it slept on "lbolt" aka "lightning
> > bolt" which was woken once a second.
> >
> > The calculations which come up with those "constants" have always
> > been utterly bogus math, not quite "square-root of shoe-size
> > times sun-angle in Patagonia", but close.
> >
> > The original heuristic came from university environments with tons of
> > students doing assignments and nethack behind VT102 terminals, on
> > filesystems where files only seldom grew past 100KB, so it made sense
> > to scale number of vnodes to how much RAM was in the system, because
> > that also scaled the size of the buffer-cache.
> >
> > With a merged VM buffer-cache, whatever validity that heuristic had
> > was lost, and we tweaked the bogomath in various ways until it
> > seemed to mostly work, trusting the users for which it did not, to
> > tweak things themselves.
> >
> > Please dont tweak the Finagle Constants again.
> >
> > Rip all that crap out and come up with something fundamentally better.
> >
>
> Some level of pacing is probably useful to control total memory use --
> there can be A LOT of memory tied up in mere fact that vnode is fully
> cached. imo the thing to do is to come up with some watermarks to be
> revisited every 1-2 years and to change the behavior when they get
> exceeded -- try to whack some stuff but in face of trouble just go
> ahead and alloc without sleep 1. Should the load spike sort itself
> out, vnlru will slowly get things down to the watermark. If the
> watermark is too low, maybe it can autotune. Bottom line is that even
> with the current idea of limiting preferred total vnode count, the
> corner case behavior can be drastically better suffering SOME perf
> loss from recycling vnodes, but not sleeping for a second for every
> single one.
>

I'd suggest that going directly to a PID to control this would be better
than the watermarks. That would give a smoother response than high/low
watermarks would. While you'd need some level to keep things at still, the
laundry stuff has shown the precise level of that level is less critical
than the watermarks.

Warner

I think the notion of 'struct vnode' being a separately allocated
> object is not very useful and it comes with complexity (and happens to
> suffer from several bugs).
>
> That said, the easiest and safest thing to do in the meantime is to
> bump the limit. Perhaps the sleep can be whacked as it is which would
> largely sort it out.
>
> --
> Mateusz Guzik 
> ___
> freebsd-current@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"
>
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: [SOLVED] Re: Strange behavior after running under high load

2021-04-04 Thread Mateusz Guzik
On 4/3/21, Poul-Henning Kamp  wrote:
> 
> Mateusz Guzik writes:
>
>> It is high because of this:
>> msleep(_sig, _list_mtx, PVFS, "vlruwk",
>> hz);
>>
>> i.e. it literally sleeps for 1 second.
>
> Before the line looked like that, it slept on "lbolt" aka "lightning
> bolt" which was woken once a second.
>
> The calculations which come up with those "constants" have always
> been utterly bogus math, not quite "square-root of shoe-size
> times sun-angle in Patagonia", but close.
>
> The original heuristic came from university environments with tons of
> students doing assignments and nethack behind VT102 terminals, on
> filesystems where files only seldom grew past 100KB, so it made sense
> to scale number of vnodes to how much RAM was in the system, because
> that also scaled the size of the buffer-cache.
>
> With a merged VM buffer-cache, whatever validity that heuristic had
> was lost, and we tweaked the bogomath in various ways until it
> seemed to mostly work, trusting the users for which it did not, to
> tweak things themselves.
>
> Please dont tweak the Finagle Constants again.
>
> Rip all that crap out and come up with something fundamentally better.
>

Some level of pacing is probably useful to control total memory use --
there can be A LOT of memory tied up in mere fact that vnode is fully
cached. imo the thing to do is to come up with some watermarks to be
revisited every 1-2 years and to change the behavior when they get
exceeded -- try to whack some stuff but in face of trouble just go
ahead and alloc without sleep 1. Should the load spike sort itself
out, vnlru will slowly get things down to the watermark. If the
watermark is too low, maybe it can autotune. Bottom line is that even
with the current idea of limiting preferred total vnode count, the
corner case behavior can be drastically better suffering SOME perf
loss from recycling vnodes, but not sleeping for a second for every
single one.

I think the notion of 'struct vnode' being a separately allocated
object is not very useful and it comes with complexity (and happens to
suffer from several bugs).

That said, the easiest and safest thing to do in the meantime is to
bump the limit. Perhaps the sleep can be whacked as it is which would
largely sort it out.

-- 
Mateusz Guzik 
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: [SOLVED] Re: Strange behavior after running under high load

2021-04-03 Thread Poul-Henning Kamp

Mateusz Guzik writes:

> It is high because of this:
> msleep(_sig, _list_mtx, PVFS, "vlruwk", hz);
>
> i.e. it literally sleeps for 1 second.

Before the line looked like that, it slept on "lbolt" aka "lightning
bolt" which was woken once a second.

The calculations which come up with those "constants" have always
been utterly bogus math, not quite "square-root of shoe-size
times sun-angle in Patagonia", but close.

The original heuristic came from university environments with tons of
students doing assignments and nethack behind VT102 terminals, on
filesystems where files only seldom grew past 100KB, so it made sense
to scale number of vnodes to how much RAM was in the system, because
that also scaled the size of the buffer-cache.

With a merged VM buffer-cache, whatever validity that heuristic had
was lost, and we tweaked the bogomath in various ways until it
seemed to mostly work, trusting the users for which it did not, to
tweak things themselves.

Please dont tweak the Finagle Constants again.

Rip all that crap out and come up with something fundamentally better.

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: [SOLVED] Re: Strange behavior after running under high load

2021-04-02 Thread Mateusz Guzik
On 4/2/21, Stefan Esser  wrote:
> Am 28.03.21 um 16:39 schrieb Stefan Esser:
>> After a period of high load, my now idle system needs 4 to 10 seconds to
>> run any trivial command - even after 20 minutes of no load ...
>>
>>
>> I have run some Monte-Carlo simulations for a few hours, with initially
> 35
>> processes running in parallel for some 10 seconds each.
>>
>> The load decreased over time since some parameter sets were faster to
>> process.
>> All in all 63000 processes ran within some 3 hours.
>>
>> When the system became idle, interactive performance was very bad.
>> Running
>> any trivial command (e.g. uptime) takes some 5 to 10 seconds. Since I
>> have
>> to have this system working, I plan to reboot it later today, but will
>> keep
>> it in this state for some more time to see whether this state persists or
>> whether the system recovers from it.
>>
>> Any ideas what might cause such a system state???
>
> Seems that Mateusz Guzik was right to mention performance issues when
> the system is very low on vnodes. (Thanks!)
>
> I have been able to reproduce the issue and have checked vnode stats:
>
> kern.maxvnodes: 620370
> kern.minvnodes: 155092
> vm.stats.vm.v_vnodepgsout: 6890171
> vm.stats.vm.v_vnodepgsin: 18475530
> vm.stats.vm.v_vnodeout: 228516
> vm.stats.vm.v_vnodein: 1592444
> vfs.wantfreevnodes: 155092
> vfs.freevnodes: 47<- obviously too low ...
> vfs.vnodes_created: 19554702
> vfs.numvnodes: 621284
> vfs.cache.debug.vnodes_cel_3_failures: 0
> vfs.cache.stats.heldvnodes: 6412
>
> The freevnodes value stayed in this region over several minutes, with
> typical program start times (e.g. for "uptime") in the region of 10 to
> 15 seconds.
>
> After rising maxvnodes to 2,000,000 form 600,000 the system performance
> is restored and I get:
>
> kern.maxvnodes: 200
> kern.minvnodes: 50
> vm.stats.vm.v_vnodepgsout: 7875198
> vm.stats.vm.v_vnodepgsin: 20788679
> vm.stats.vm.v_vnodeout: 261179
> vm.stats.vm.v_vnodein: 1817599
> vfs.wantfreevnodes: 50
> vfs.freevnodes: 205988<- still a lot higher than wantfreevnodes
> vfs.vnodes_created: 19956502
> vfs.numvnodes: 912880
> vfs.cache.debug.vnodes_cel_3_failures: 0
> vfs.cache.stats.heldvnodes: 20702
>
> I do not know why the performance impact is so high - there are a few
> free vnodes (more than required for the shared libraries to start e.g.
> the uptime program). Most probably each attempt to get a vnode triggers
> a clean-up attempt that runs for a significant time, but has no chance
> to actually reach near the goal of 155k or 500k free vnodes.
>

It is high because of this:
msleep(_sig, _list_mtx, PVFS, "vlruwk", hz);

i.e. it literally sleeps for 1 second.

The vnode limit is probably too conservative and behavior when limit
is reached is rather broken. Probably the thing to do is to let
allocations go through while kicking vnlru to free some stuff up. I'll
have to sleep on it.


> Anyway, kern.maxvnodes can be changed at run-time and it is thus easy
> to fix. It seems that no message is logged to report this situation.
> A rate limited hint to rise the limit should help other affected users.
>
> Regards, STefan
>
>


-- 
Mateusz Guzik 
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


[SOLVED] Re: Strange behavior after running under high load

2021-04-02 Thread Stefan Esser

Am 28.03.21 um 16:39 schrieb Stefan Esser:

After a period of high load, my now idle system needs 4 to 10 seconds to
run any trivial command - even after 20 minutes of no load ...


I have run some Monte-Carlo simulations for a few hours, with initially 
35 

processes running in parallel for some 10 seconds each.

The load decreased over time since some parameter sets were faster to process.
All in all 63000 processes ran within some 3 hours.

When the system became idle, interactive performance was very bad. Running
any trivial command (e.g. uptime) takes some 5 to 10 seconds. Since I have
to have this system working, I plan to reboot it later today, but will keep
it in this state for some more time to see whether this state persists or
whether the system recovers from it.

Any ideas what might cause such a system state???


Seems that Mateusz Guzik was right to mention performance issues when
the system is very low on vnodes. (Thanks!)

I have been able to reproduce the issue and have checked vnode stats:

kern.maxvnodes: 620370
kern.minvnodes: 155092
vm.stats.vm.v_vnodepgsout: 6890171
vm.stats.vm.v_vnodepgsin: 18475530
vm.stats.vm.v_vnodeout: 228516
vm.stats.vm.v_vnodein: 1592444
vfs.wantfreevnodes: 155092
vfs.freevnodes: 47  <- obviously too low ...
vfs.vnodes_created: 19554702
vfs.numvnodes: 621284
vfs.cache.debug.vnodes_cel_3_failures: 0
vfs.cache.stats.heldvnodes: 6412

The freevnodes value stayed in this region over several minutes, with
typical program start times (e.g. for "uptime") in the region of 10 to
15 seconds.

After rising maxvnodes to 2,000,000 form 600,000 the system performance
is restored and I get:

kern.maxvnodes: 200
kern.minvnodes: 50
vm.stats.vm.v_vnodepgsout: 7875198
vm.stats.vm.v_vnodepgsin: 20788679
vm.stats.vm.v_vnodeout: 261179
vm.stats.vm.v_vnodein: 1817599
vfs.wantfreevnodes: 50
vfs.freevnodes: 205988  <- still a lot higher than wantfreevnodes
vfs.vnodes_created: 19956502
vfs.numvnodes: 912880
vfs.cache.debug.vnodes_cel_3_failures: 0
vfs.cache.stats.heldvnodes: 20702

I do not know why the performance impact is so high - there are a few
free vnodes (more than required for the shared libraries to start e.g.
the uptime program). Most probably each attempt to get a vnode triggers
a clean-up attempt that runs for a significant time, but has no chance
to actually reach near the goal of 155k or 500k free vnodes.

Anyway, kern.maxvnodes can be changed at run-time and it is thus easy
to fix. It seems that no message is logged to report this situation.
A rate limited hint to rise the limit should help other affected users.

Regards, STefan



OpenPGP_signature
Description: OpenPGP digital signature


Re: Strange behavior after running under high load

2021-03-29 Thread Stefan Esser

Am 29.03.21 um 08:45 schrieb Andrea Venturoli:

On 3/28/21 4:39 PM, Stefan Esser wrote:

After a period of high load, my now idle system needs 4 to 10 seconds to
run any trivial command - even after 20 minutes of no load ...


High CPU load or high disk load?


High CPU load, 3 times the number of CPU threads in this particular
batch run.

Less than 10 files of less than 100 KB per second have been written.


ZFS? Snapshots?


ZFS and automatic snapshots of the file system every hour.


12.x? 13.x?


-CURRENT as of some 24 hours before the issue occurred:

FreeBSD 14.0-CURRENT #33 main-n245694-90d2f7c413f9-dirty: Sat Mar 27 15:35:37 
CET 2021


I've seen something similar: after a high load period, system crawled so much 
that services were not answering in a reasonable time (e.g. mail would fail 
with "no such mailbox"!).


Program start-up was very slow, but interactive response once running was
normal (e.g. execution of internal shell commands like "echo *").


Even rebooting didn't fix it, until I deleted some autosnapshots.


Rebooting fixed it on my case.

top or other tools would show no disk activity, although the disks were 
working 

as mad.


No disk activity in my case. The system was idle without any load, but the
issue persisted over many hours (up to the moment when I decided to reboot
the system to get it back into a usable state).


Not sure it's the same case you experienced, though.


Probably not, but you seem to have hit another case were a resource limit
was reached and the system did not gracefully deal with the situation.

Thanks for replying ...

Regards, STefan



OpenPGP_signature
Description: OpenPGP digital signature


Re: Strange behavior after running under high load

2021-03-29 Thread Stefan Esser

Am 29.03.21 um 03:11 schrieb Mateusz Guzik:

This may be the problem fixed in
e9272225e6bed840b00eef1c817b188c172338ee ("vfs: fix vnlru marker
handling for filtered/unfiltered cases").


My system was up for less than 24 hours and using a kernel and world
built on the latest -CURRENT of less than 1 hour before the reboot:

FreeBSD 14.0-CURRENT #33 main-n245694-90d2f7c413f9-dirty: Sat Mar 27 15:35:37 
CET 2021


The fix had been committed some 9 days before that kernel was built.


However, there is a long standing performance bug where if vnode limit
is hit, and there is nothing to reclaim, the code is just going to
sleep for one second.


There are no log entries that give any hint to what occurred.
But I do assume that these events are not logged ... (?)

Yes, I could have checked that and will do so if the issue occurs
again. I plan to generate more output files in the same way that
triggered the issue yesterday, and since the system is very slow
but still able to execute commands, I can try to debug it, just
have to know where to start looking ...

Thank you for your reply!

Regards, STefan



OpenPGP_signature
Description: OpenPGP digital signature


Re: Strange behavior after running under high load

2021-03-29 Thread Andrea Venturoli

On 3/28/21 4:39 PM, Stefan Esser wrote:

After a period of high load, my now idle system needs 4 to 10 seconds to
run any trivial command - even after 20 minutes of no load ...


High CPU load or high disk load?
ZFS? Snapshots?
12.x? 13.x?

I've seen something similar: after a high load period, system crawled so 
much that services were not answering in a reasonable time (e.g. mail 
would fail with "no such mailbox"!).

Even rebooting didn't fix it, until I deleted some autosnapshots.
top or other tools would show no disk activity, although the disks were 
working as mad.


Not sure it's the same case you experienced, though.
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange behavior after running under high load

2021-03-28 Thread Mateusz Guzik
This may be the problem fixed in
e9272225e6bed840b00eef1c817b188c172338ee ("vfs: fix vnlru marker
handling for filtered/unfiltered cases").

However, there is a long standing performance bug where if vnode limit
is hit, and there is nothing to reclaim, the code is just going to
sleep for one second.

On 3/28/21, Stefan Esser  wrote:
> Am 28.03.21 um 17:44 schrieb Andriy Gapon:
>> On 28/03/2021 17:39, Stefan Esser wrote:
>>> After a period of high load, my now idle system needs 4 to 10 seconds to
>>> run any trivial command - even after 20 minutes of no load ...
>>>
>>>
>>> I have run some Monte-Carlo simulations for a few hours, with initially
>>> 35
>>> processes running in parallel for some 10 seconds each.
>>
>> I saw somewhat similar symptoms with 13-CURRENT some time ago.
>> To me it looked like even small kernel memory allocations took a very long
>> time.
>> But it was hard to properly diagnose that as my favorite tool, dtrace, was
>> also
>> affected by the same problem.
>
> That could have been the case - but I had to reboot to recover the system.
>
> I had let it sit idle fpr a few hours and the last "time uptime" before
> the reboot took 15 second real time to complete.
>
> Response from within the shell (e.g. "echo *") was instantaneous, though.
>
> I tried to trace the program execution of "uptime" with truss and found,
> that the loading of shared libraries proceeded at about one or two per
> second until all were attached and then the program quickly printed the
> expected results.
>
> I could probably recreate the issue by running the same set of programs
> that triggered it a few hours ago, but this is a production system and
> I need it to be operational through the week ...
>
> Regards, STefan
>
>


-- 
Mateusz Guzik 
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Re: Strange behavior after running under high load

2021-03-28 Thread Stefan Esser

Am 28.03.21 um 17:44 schrieb Andriy Gapon:

On 28/03/2021 17:39, Stefan Esser wrote:

After a period of high load, my now idle system needs 4 to 10 seconds to
run any trivial command - even after 20 minutes of no load ...


I have run some Monte-Carlo simulations for a few hours, with initially 35
processes running in parallel for some 10 seconds each.


I saw somewhat similar symptoms with 13-CURRENT some time ago.
To me it looked like even small kernel memory allocations took a very long time.
But it was hard to properly diagnose that as my favorite tool, dtrace, was also
affected by the same problem.


That could have been the case - but I had to reboot to recover the system.

I had let it sit idle fpr a few hours and the last "time uptime" before
the reboot took 15 second real time to complete.

Response from within the shell (e.g. "echo *") was instantaneous, though.

I tried to trace the program execution of "uptime" with truss and found,
that the loading of shared libraries proceeded at about one or two per
second until all were attached and then the program quickly printed the
expected results.

I could probably recreate the issue by running the same set of programs
that triggered it a few hours ago, but this is a production system and
I need it to be operational through the week ...

Regards, STefan



OpenPGP_signature
Description: OpenPGP digital signature


Re: Strange behavior after running under high load

2021-03-28 Thread Andriy Gapon
On 28/03/2021 17:39, Stefan Esser wrote:
> After a period of high load, my now idle system needs 4 to 10 seconds to
> run any trivial command - even after 20 minutes of no load ...
> 
> 
> I have run some Monte-Carlo simulations for a few hours, with initially 35
> processes running in parallel for some 10 seconds each.

I saw somewhat similar symptoms with 13-CURRENT some time ago.
To me it looked like even small kernel memory allocations took a very long time.
But it was hard to properly diagnose that as my favorite tool, dtrace, was also
affected by the same problem.

> The load decreased over time since some parameter sets were faster to process.
> All in all 63000 processes ran within some 3 hours.
> 
> When the system became idle, interactive performance was very bad. Running
> any trivial command (e.g. uptime) takes some 5 to 10 seconds. Since I have
> to have this system working, I plan to reboot it later today, but will keep
> it in this state for some more time to see whether this state persists or
> whether the system recovers from it.
> 
> Any ideas what might cause such a system state???
> 
> 
> The system has a Ryzen 5 3600 CPU (6 core/12 threads) and 32 GB or RAM.
> 
> The following are a few commands that I have tried on this now practically
> idle system:
> 
> $ time vmstat -n 1
>   procs    memory    page  disks faults   cpu
>   r  b  w  avm  fre  flt  re  pi  po   fr   sr nv0   in   sy   cs us sy id
>   2  0  0  26G 922M 1.2K   1   4   0 1.4K  239   0  482 7.2K  934 11  1 88
> 
> real    0m9,357s
> user    0m0,001s
> sys    0m0,018
> 
>  wait 1 minute 
> 
> $ time vmstat -n 1
>   procs    memory    page  disks faults   cpu
>   r  b  w  avm  fre  flt  re  pi  po   fr   sr nv0   in   sy   cs us sy id
>   1  0  0  26G 925M 1.2K   1   4   0 1.4K  239   0  482 7.2K  933 11  1 88
> 
> real    0m9,821s
> user    0m0,003s
> sys    0m0,389s
> 
> $ systat -vm
> 
>  4 users    Load  0.10  0.72  3.57  Mar 28 16:15
>     Mem usage:  97%Phy 55%Kmem   VN PAGER   SWAP PAGER
> Mem:  REAL   VIRTUAL in   out in  out
>     Tot   Share Tot    Share Free   count
> Act  2387M    460K  26481M 460K 923M   pages
> All  2605M    218M  27105M 572M    ioflt  Interrupts
> Proc:  cow 132 total
>    r   p   d    s   w   Csw  Trp  Sys  Int  Sof  Flt    52 zfod 96 
> hpet0:t0
>   316   356   39  225  132   21   53   ozfod nvme0:admi
>   %ozfod nvme0:io0
>   0.1%Sys   0.0%Intr  0.0%User  0.0%Nice 99.9%Idle daefr nvme0:io1
> |    |    |    |    |    |    |    |    |    |    |    prcfr nvme0:io2
>    totfr nvme0:io3
>     dtbuf  react nvme0:io4
> Namei  Name-cache   Dir-cache    620370 maxvn  pdwak nvme0:io5
>     Calls    hits   %    hits   %    627486 numvn  168 pdpgs    27 xhci0 
> 66
>    18  14  78    65 frevn  intrn ahci0 67
>     17539M wire xhci1 68
> Disks  nvd0  ada0  ada1  ada2  ada3  ada4   cd0   430M act   9 re0 69
> KB/t   0.00  0.00  0.00  0.00  0.00  0.00  0.00 12696M inact hdac0 76
> tps   0 0 0 0 0 0 0 54276K laund vgapci0 78
> MB/s   0.00  0.00  0.00  0.00  0.00  0.00  0.00   923M free
> %busy 0 0 0 0 0 0 0  0 buf
> 
>  5 minutes later 
> 
> $ time vmstat -n 1
>  procs    memory    page  disks faults   cpu
>  r  b  w  avm  fre  flt  re  pi  po   fr   sr nv0   in   sy   cs us sy id
>  1  0  0  26G 922M 1.2K   1   4   0 1.4K  239   0  481 7.2K  931 11  1 88
> 
> real    0m4,270s
> user    0m0,000s
> sys    0m0,019s
> 
> $ time uptime
> 16:20  up 23:23, 4 users, load averages: 0,17 0,39 2,68
> 
> real    0m10,840s
> user    0m0,001s
> sys    0m0,374s
> 
> $ time uptime
> 16:37  up 23:40, 4 users, load averages: 0,29 0,27 0,96
> 
> real    0m9,273s
> user    0m0,000s
> sys    0m0,020s
> 


-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"


Strange behavior after running under high load

2021-03-28 Thread Stefan Esser

After a period of high load, my now idle system needs 4 to 10 seconds to
run any trivial command - even after 20 minutes of no load ...


I have run some Monte-Carlo simulations for a few hours, with initially 35 
processes running in parallel for some 10 seconds each.


The load decreased over time since some parameter sets were faster to process.
All in all 63000 processes ran within some 3 hours.

When the system became idle, interactive performance was very bad. Running
any trivial command (e.g. uptime) takes some 5 to 10 seconds. Since I have
to have this system working, I plan to reboot it later today, but will keep
it in this state for some more time to see whether this state persists or
whether the system recovers from it.

Any ideas what might cause such a system state???


The system has a Ryzen 5 3600 CPU (6 core/12 threads) and 32 GB or RAM.

The following are a few commands that I have tried on this now practically
idle system:

$ time vmstat -n 1
  procsmemorypage  disks faults   cpu
  r  b  w  avm  fre  flt  re  pi  po   fr   sr nv0   in   sy   cs us sy id
  2  0  0  26G 922M 1.2K   1   4   0 1.4K  239   0  482 7.2K  934 11  1 88

real0m9,357s
user0m0,001s
sys 0m0,018

 wait 1 minute 

$ time vmstat -n 1
  procsmemorypage  disks faults   cpu
  r  b  w  avm  fre  flt  re  pi  po   fr   sr nv0   in   sy   cs us sy id
  1  0  0  26G 925M 1.2K   1   4   0 1.4K  239   0  482 7.2K  933 11  1 88

real0m9,821s
user0m0,003s
sys 0m0,389s

$ systat -vm

 4 usersLoad  0.10  0.72  3.57  Mar 28 16:15
Mem usage:  97%Phy 55%Kmem   VN PAGER   SWAP 
PAGER
Mem:  REAL   VIRTUAL in   out in  
out

Tot   Share TotShare Free   count
Act  2387M460K  26481M 460K 923M   pages
All  2605M218M  27105M 572Mioflt  Interrupts
Proc:  cow 132 total
   r   p   ds   w   Csw  Trp  Sys  Int  Sof  Flt52 zfod 96 hpet0:t0
  316   356   39  225  132   21   53   ozfod nvme0:admi
  %ozfod nvme0:io0
  0.1%Sys   0.0%Intr  0.0%User  0.0%Nice 99.9%Idle daefr nvme0:io1
|||||||||||prcfr nvme0:io2
   totfr nvme0:io3
dtbuf  react nvme0:io4
Namei  Name-cache   Dir-cache620370 maxvn  pdwak nvme0:io5
Callshits   %hits   %627486 numvn  168 pdpgs27 xhci0 66
   18  14  7865 frevn  intrn ahci0 67
17539M wire xhci1 68
Disks  nvd0  ada0  ada1  ada2  ada3  ada4   cd0   430M act   9 re0 69
KB/t   0.00  0.00  0.00  0.00  0.00  0.00  0.00 12696M inact hdac0 76
tps   0 0 0 0 0 0 0 54276K laund vgapci0 78
MB/s   0.00  0.00  0.00  0.00  0.00  0.00  0.00   923M free
%busy 0 0 0 0 0 0 0  0 buf

 5 minutes later 

$ time vmstat -n 1
 procsmemorypage  disks faults   cpu
 r  b  w  avm  fre  flt  re  pi  po   fr   sr nv0   in   sy   cs us sy id
 1  0  0  26G 922M 1.2K   1   4   0 1.4K  239   0  481 7.2K  931 11  1 88

real0m4,270s
user0m0,000s
sys 0m0,019s

$ time uptime
16:20  up 23:23, 4 users, load averages: 0,17 0,39 2,68

real0m10,840s
user0m0,001s
sys 0m0,374s

$ time uptime
16:37  up 23:40, 4 users, load averages: 0,29 0,27 0,96

real0m9,273s
user0m0,000s
sys 0m0,020s



OpenPGP_signature
Description: OpenPGP digital signature