VNET lock reversal

2022-05-10 Thread Dustin Marquess
Just updated my -CURRENT box from a build almost exactly 14 days ago
to one just about an hour old.  When a VNET jail starts, I'm seeing a
lock reversal:

lock order reversal:
 1st 0x81e893a8 allprison (allprison, sx) @
/usr/src/sys/kern/kern_jail.c:1378
 2nd 0x81f99fe8 vnet_sysinit_sxlock (vnet_sysinit_sxlock, sx)
@ /usr/src/sys/net/vnet.c:579
lock order allprison -> vnet_sysinit_sxlock attempted at:
#0 0x80c9b7c6 at witness_checkorder+0xbd6
#1 0x80c35c67 at _sx_slock_int+0x67
#2 0x80d92185 at vnet_alloc+0x115
#3 0x80be7e02 at kern_jail_set+0x1722
#4 0x80be92f0 at sys_jail_set+0x40
#5 0x811200aa at amd64_syscall+0x13a
#6 0x810f12eb at fast_syscall_common+0xf8

I'll try and see if I can get a bisect going if somebody else hasn't
seen this yet.

-Dustin



Re: Chasing OOM Issues - good sysctl metrics to use?

2022-05-10 Thread Mark Millard
On 2022-May-10, at 17:49, Mark Millard  wrote:

> On 2022-May-10, at 11:49, Mark Millard  wrote:
> 
>> On 2022-May-10, at 08:47, Jan Mikkelsen  wrote:
>> 
>>> On 10 May 2022, at 10:01, Mark Millard  wrote:
 
 On 2022-Apr-29, at 13:57, Mark Millard  wrote:
 
> On 2022-Apr-29, at 13:41, Pete Wright  wrote:
>> 
>>> . . .
>> 
>> d'oh - went out for lunch and workstation locked up.  i *knew* i 
>> shouldn't have said anything lol.
> 
> Any interesting console messages ( or dmesg -a or /var/log/messages )?
> 
 
 I've been doing some testing of a patch by tijl at FreeBSD.org
 and have reproduced both hang-ups (ZFS/ARC context) and kills
 (UFS/noARC and ZFS/ARC) for "was killed: failed to reclaim
 memory", both with and without the patch. This is with only a
 tiny fraction of the swap partition(s) enabled being put to
 use. So far, the testing was deliberately with
 vm.pageout_oom_seq=12 (the default value). My testing has been
 with main [so: 14].
 
 But I also learned how to avoid the hang-ups that I got --but
 it costs making kills more likely/quicker, other things being
 equal.
 
 I discovered that the hang-ups that I got were from all the
 processes that I interact with the system via ending up with
 the process's kernel threads swapped out and were not being
 swapped in. (including sshd, so no new ssh connections). In
 some contexts I only had escaping into the kernel debugger
 available, not even ^T would work. Other times ^T did work.
 
 So, when I'm willing to risk kills in order to maintain
 the ability to interact normally, I now use in
 /etc/sysctl.conf :
 
 vm.swap_enabled=0
>>> 
>>> I have been looking at an OOM related issue. Ignoring the actual leak, the 
>>> problem leads to a process being killed because the system was out of 
>>> memory. This is fine. After that, however, the system console was black 
>>> with a single block cursor and the console keyboard was unresponsive. Caps 
>>> lock and num lock didn’t toggle their lights when pressed.
>>> 
>>> Using an ssh session, the system looked fine. USB events for the keyboard 
>>> being disconnected and reconnected appeared but the keyboard stayed 
>>> unresponsive.
>>> 
>>> Setting vm.swap_enabled=0, as you did above, resolved this problem. After 
>>> the process was killed a perfectly normal console returned.
>>> 
>>> The interesting thing is that this test system is configured with no swap 
>>> space.
>>> 
>>> This is on 13.1-RC5.
>>> 
 This disables swapping out of process kernel stacks. It
 is just with that option removedfor gaining free RAM, there
 fewer options tried before a kill is initiated. It is not a
 loader-time tunable but is writable, thus the
 /etc/sysctl.conf placement.
>>> 
>>> Is that really what it does? From a quick look at the code in 
>>> vm/vm_swapout.c, it seems little more complex.
>> 
>> I was going by its description:
>> 
>> # sysctl -d vm.swap_enabled
>> vm.swap_enabled: Enable entire process swapout
>> 
>> Based on the below, it appears that the description
>> presumes vm.swap_idle_enabled==0 (the default). In
>> my context vm.swap_idle_enabled==0 . Looks like I
>> should also list:
>> 
>> vm.swap_idle_enabled=0
>> 
>> in my /etc/sysctl.conf with a reminder comment that the
>> pair of =0's are required for avoiding the observed
>> hang-ups.
>> 
>> 
>> The  analysis goes like . . .
>> 
>> I see in the code that vm.swap_enabled !=0 causes
>> VM_SWAP_NORMAL :
>> 
>> void
>> vm_swapout_run(void)
>> {
>> 
>>   if (vm_swap_enabled)
>>   vm_req_vmdaemon(VM_SWAP_NORMAL);
>> }
>> 
>> and that in turn leads to vm_daemon to:
>> 
>>   if (swapout_flags != 0) {
>>   /*
>>* Drain the per-CPU page queue batches as a deadlock
>>* avoidance measure.
>>*/
>>   if ((swapout_flags & VM_SWAP_NORMAL) != 0)
>>   vm_page_pqbatch_drain();
>>   swapout_procs(swapout_flags);
>>   }
>> 
>> Note: vm.swap_idle_enabled==0 && vm.swap_enabled==0 ends
>> up with swapout_flags==0. vm.swap_idle. . . defaults seem
>> to be (in my context):
>> 
>> # sysctl -a | grep swap_idle
>> vm.swap_idle_threshold2: 10
>> vm.swap_idle_threshold1: 2
>> vm.swap_idle_enabled: 0
>> 
>> For reference:
>> 
>> /*
>> * Idle process swapout -- run once per second when pagedaemons are
>> * reclaiming pages.
>> */
>> void
>> vm_swapout_run_idle(void)
>> {
>>   static long lsec;
>> 
>>   if (!vm_swap_idle_enabled || time_second == lsec)
>>   return;
>>   vm_req_vmdaemon(VM_SWAP_IDLE);
>>   lsec = time_second;
>> }
>> 
>> [So vm.swap_idle_enabled==0 avoids VM_SWAP_IDLE status.]
>> 
>> static void
>> vm_req_vmdaemon(int req)
>> {
>>   static int lastrun = 0;
>> 
>>   

Re: Chasing OOM Issues - good sysctl metrics to use?

2022-05-10 Thread Mark Millard
On 2022-May-10, at 11:49, Mark Millard  wrote:

> On 2022-May-10, at 08:47, Jan Mikkelsen  wrote:
> 
>> On 10 May 2022, at 10:01, Mark Millard  wrote:
>>> 
>>> On 2022-Apr-29, at 13:57, Mark Millard  wrote:
>>> 
 On 2022-Apr-29, at 13:41, Pete Wright  wrote:
> 
>> . . .
> 
> d'oh - went out for lunch and workstation locked up.  i *knew* i 
> shouldn't have said anything lol.
 
 Any interesting console messages ( or dmesg -a or /var/log/messages )?
 
>>> 
>>> I've been doing some testing of a patch by tijl at FreeBSD.org
>>> and have reproduced both hang-ups (ZFS/ARC context) and kills
>>> (UFS/noARC and ZFS/ARC) for "was killed: failed to reclaim
>>> memory", both with and without the patch. This is with only a
>>> tiny fraction of the swap partition(s) enabled being put to
>>> use. So far, the testing was deliberately with
>>> vm.pageout_oom_seq=12 (the default value). My testing has been
>>> with main [so: 14].
>>> 
>>> But I also learned how to avoid the hang-ups that I got --but
>>> it costs making kills more likely/quicker, other things being
>>> equal.
>>> 
>>> I discovered that the hang-ups that I got were from all the
>>> processes that I interact with the system via ending up with
>>> the process's kernel threads swapped out and were not being
>>> swapped in. (including sshd, so no new ssh connections). In
>>> some contexts I only had escaping into the kernel debugger
>>> available, not even ^T would work. Other times ^T did work.
>>> 
>>> So, when I'm willing to risk kills in order to maintain
>>> the ability to interact normally, I now use in
>>> /etc/sysctl.conf :
>>> 
>>> vm.swap_enabled=0
>> 
>> I have been looking at an OOM related issue. Ignoring the actual leak, the 
>> problem leads to a process being killed because the system was out of 
>> memory. This is fine. After that, however, the system console was black with 
>> a single block cursor and the console keyboard was unresponsive. Caps lock 
>> and num lock didn’t toggle their lights when pressed.
>> 
>> Using an ssh session, the system looked fine. USB events for the keyboard 
>> being disconnected and reconnected appeared but the keyboard stayed 
>> unresponsive.
>> 
>> Setting vm.swap_enabled=0, as you did above, resolved this problem. After 
>> the process was killed a perfectly normal console returned.
>> 
>> The interesting thing is that this test system is configured with no swap 
>> space.
>> 
>> This is on 13.1-RC5.
>> 
>>> This disables swapping out of process kernel stacks. It
>>> is just with that option removedfor gaining free RAM, there
>>> fewer options tried before a kill is initiated. It is not a
>>> loader-time tunable but is writable, thus the
>>> /etc/sysctl.conf placement.
>> 
>> Is that really what it does? From a quick look at the code in 
>> vm/vm_swapout.c, it seems little more complex.
> 
> I was going by its description:
> 
> # sysctl -d vm.swap_enabled
> vm.swap_enabled: Enable entire process swapout
> 
> Based on the below, it appears that the description
> presumes vm.swap_idle_enabled==0 (the default). In
> my context vm.swap_idle_enabled==0 . Looks like I
> should also list:
> 
> vm.swap_idle_enabled=0
> 
> in my /etc/sysctl.conf with a reminder comment that the
> pair of =0's are required for avoiding the observed
> hang-ups.
> 
> 
> The  analysis goes like . . .
> 
> I see in the code that vm.swap_enabled !=0 causes
> VM_SWAP_NORMAL :
> 
> void
> vm_swapout_run(void)
> {
> 
>if (vm_swap_enabled)
>vm_req_vmdaemon(VM_SWAP_NORMAL);
> }
> 
> and that in turn leads to vm_daemon to:
> 
>if (swapout_flags != 0) {
>/*
> * Drain the per-CPU page queue batches as a deadlock
> * avoidance measure.
> */
>if ((swapout_flags & VM_SWAP_NORMAL) != 0)
>vm_page_pqbatch_drain();
>swapout_procs(swapout_flags);
>}
> 
> Note: vm.swap_idle_enabled==0 && vm.swap_enabled==0 ends
> up with swapout_flags==0. vm.swap_idle. . . defaults seem
> to be (in my context):
> 
> # sysctl -a | grep swap_idle
> vm.swap_idle_threshold2: 10
> vm.swap_idle_threshold1: 2
> vm.swap_idle_enabled: 0
> 
> For reference:
> 
> /*
> * Idle process swapout -- run once per second when pagedaemons are
> * reclaiming pages.
> */
> void
> vm_swapout_run_idle(void)
> {
>static long lsec;
> 
>if (!vm_swap_idle_enabled || time_second == lsec)
>return;
>vm_req_vmdaemon(VM_SWAP_IDLE);
>lsec = time_second;
> }
> 
> [So vm.swap_idle_enabled==0 avoids VM_SWAP_IDLE status.]
> 
> static void
> vm_req_vmdaemon(int req)
> {
>static int lastrun = 0;
> 
>mtx_lock(_daemon_mtx);
>vm_pageout_req_swapout |= req;
>if ((ticks > (lastrun + hz)) || (ticks < lastrun)) {
>wakeup(_daemon_needed);
>lastrun 

Re: Chasing OOM Issues - good sysctl metrics to use?

2022-05-10 Thread Mark Millard
On 2022-May-10, at 08:47, Jan Mikkelsen  wrote:

> On 10 May 2022, at 10:01, Mark Millard  wrote:
>> 
>> On 2022-Apr-29, at 13:57, Mark Millard  wrote:
>> 
>>> On 2022-Apr-29, at 13:41, Pete Wright  wrote:
 
> . . .
 
 d'oh - went out for lunch and workstation locked up.  i *knew* i shouldn't 
 have said anything lol.
>>> 
>>> Any interesting console messages ( or dmesg -a or /var/log/messages )?
>>> 
>> 
>> I've been doing some testing of a patch by tijl at FreeBSD.org
>> and have reproduced both hang-ups (ZFS/ARC context) and kills
>> (UFS/noARC and ZFS/ARC) for "was killed: failed to reclaim
>> memory", both with and without the patch. This is with only a
>> tiny fraction of the swap partition(s) enabled being put to
>> use. So far, the testing was deliberately with
>> vm.pageout_oom_seq=12 (the default value). My testing has been
>> with main [so: 14].
>> 
>> But I also learned how to avoid the hang-ups that I got --but
>> it costs making kills more likely/quicker, other things being
>> equal.
>> 
>> I discovered that the hang-ups that I got were from all the
>> processes that I interact with the system via ending up with
>> the process's kernel threads swapped out and were not being
>> swapped in. (including sshd, so no new ssh connections). In
>> some contexts I only had escaping into the kernel debugger
>> available, not even ^T would work. Other times ^T did work.
>> 
>> So, when I'm willing to risk kills in order to maintain
>> the ability to interact normally, I now use in
>> /etc/sysctl.conf :
>> 
>> vm.swap_enabled=0
> 
> I have been looking at an OOM related issue. Ignoring the actual leak, the 
> problem leads to a process being killed because the system was out of memory. 
> This is fine. After that, however, the system console was black with a single 
> block cursor and the console keyboard was unresponsive. Caps lock and num 
> lock didn’t toggle their lights when pressed.
> 
> Using an ssh session, the system looked fine. USB events for the keyboard 
> being disconnected and reconnected appeared but the keyboard stayed 
> unresponsive.
> 
> Setting vm.swap_enabled=0, as you did above, resolved this problem. After the 
> process was killed a perfectly normal console returned.
> 
> The interesting thing is that this test system is configured with no swap 
> space.
> 
> This is on 13.1-RC5.
> 
>> This disables swapping out of process kernel stacks. It
>> is just with that option removedfor gaining free RAM, there
>> fewer options tried before a kill is initiated. It is not a
>> loader-time tunable but is writable, thus the
>> /etc/sysctl.conf placement.
> 
> Is that really what it does? From a quick look at the code in 
> vm/vm_swapout.c, it seems little more complex.

I was going by its description:

# sysctl -d vm.swap_enabled
vm.swap_enabled: Enable entire process swapout

Based on the below, it appears that the description
presumes vm.swap_idle_enabled==0 (the default). In
my context vm.swap_idle_enabled==0 . Looks like I
should also list:

vm.swap_idle_enabled=0

in my /etc/sysctl.conf with a reminder comment that the
pair of =0's are required for avoiding the observed
hang-ups.


The  analysis goes like . . .

I see in the code that vm.swap_enabled !=0 causes
VM_SWAP_NORMAL :

void
vm_swapout_run(void)
{
 
if (vm_swap_enabled)
vm_req_vmdaemon(VM_SWAP_NORMAL);
}

and that in turn leads to vm_daemon to:

if (swapout_flags != 0) {
/*
 * Drain the per-CPU page queue batches as a deadlock
 * avoidance measure.
 */
if ((swapout_flags & VM_SWAP_NORMAL) != 0)
vm_page_pqbatch_drain();
swapout_procs(swapout_flags);
}

Note: vm.swap_idle_enabled==0 && vm.swap_enabled==0 ends
up with swapout_flags==0. vm.swap_idle. . . defaults seem
to be (in my context):

# sysctl -a | grep swap_idle
vm.swap_idle_threshold2: 10
vm.swap_idle_threshold1: 2
vm.swap_idle_enabled: 0

For reference:

/*
 * Idle process swapout -- run once per second when pagedaemons are
 * reclaiming pages.
 */
void
vm_swapout_run_idle(void)
{
static long lsec;
 
if (!vm_swap_idle_enabled || time_second == lsec)
return;
vm_req_vmdaemon(VM_SWAP_IDLE);
lsec = time_second;
}

[So vm.swap_idle_enabled==0 avoids VM_SWAP_IDLE status.]

static void
vm_req_vmdaemon(int req)
{
static int lastrun = 0;

mtx_lock(_daemon_mtx);
vm_pageout_req_swapout |= req;
if ((ticks > (lastrun + hz)) || (ticks < lastrun)) {
wakeup(_daemon_needed);
lastrun = ticks;
}
mtx_unlock(_daemon_mtx);
}

[So VM_SWAP_IDLE and VM_SWAP_NORMAL are independent bits
in vm_pageout_req_swapout.]

vm_deamon does:

mtx_lock(_daemon_mtx);
 

Re: Upgrade automation

2022-05-10 Thread Miroslav Lachman

On 10/05/2022 17:46, Alan Somers wrote:

On Tue, May 10, 2022 at 9:08 AM Cristian Cardoso
 wrote:


Hi

I have some FreeBSD servers in my machine park and I would like to perform the 
version upgrade in an automated way with ansible.

In my example, I want to perform the upgrade from version 12.3 to 13, it is 
possible to run the upgrade with the command below:

freebsd-update --not-running-from-cron upgrade -r 12.2-RELEASE

I ask this, because I don't know if it's the most correct way to execute this.

Grateful for any assistance.


Yes, that's perfect.  But there's another step too.  You'll have to do:
freebsd-update install
And _this_ step isn't easy to perfectly automate, because etcupdate
may ask for your input when it merges config files.  If you know
exactly which etc files you've modified, you can add them to
IgnorePaths.  That way etcupdate won't run interactively, it will
simply throw away changes from upstream.


Automation with etcupdate sounds very scary to me because etcupdate 
breaks real life configuration files inplace. Mergemaster did it on 
temporary copies. But if you let etcupdate to left something (after 
merge conflict) in vital config file(s) wich will have syntax error on 
next boot, then you are out.
It would be much better if etcupdate do not edit target file on merge 
conflicts.


Kind regards
Miroslav Lachman



Re: Upgrade automation

2022-05-10 Thread Cristian Cardoso
I currently update patches this way:


- name: Checking for updates on FreeBSD
   command: freebsd-update fetch
   when:
 - ansible_distribution == "FreeBSD"
   register: result_update
   changed_when: "'No updates needed' not in result_update.stdout"
   become: yes
   tags:
   - check-update

- name: Applying update on FreeBSD
   command: freebsd-update install
   when:
 - ansible_distribution == "FreeBSD" and result_update.changed
   register: result_update_install
   become: yes
   tags:
   - apply-update



Maybe to get around the situation after the version upgrade task, you can
do something like this:


- name: Reboot system to apply new kernel
   shell: "sleep 5 && reboot"
   async: 1
   poll: 0
   become: True

- name: Wait for reconnection to system to continue update
   wait_for_connection:
 connect_timeout: 20
 sleep: 20
 delay: 60
 timeout: 600

- name: Applying update on FreeBSD
   command: freebsd-update install
   when:
 - ansible_distribution == "FreeBSD" and result_update.changed
   register: result_update_install
   become: yes



Em ter., 10 de mai. de 2022 às 12:47, Alan Somers 
escreveu:

> On Tue, May 10, 2022 at 9:08 AM Cristian Cardoso
>  wrote:
> >
> > Hi
> >
> > I have some FreeBSD servers in my machine park and I would like to
> perform the version upgrade in an automated way with ansible.
> >
> > In my example, I want to perform the upgrade from version 12.3 to 13, it
> is possible to run the upgrade with the command below:
> >
> > freebsd-update --not-running-from-cron upgrade -r 12.2-RELEASE
> >
> > I ask this, because I don't know if it's the most correct way to execute
> this.
> >
> > Grateful for any assistance.
>
> Yes, that's perfect.  But there's another step too.  You'll have to do:
> freebsd-update install
> And _this_ step isn't easy to perfectly automate, because etcupdate
> may ask for your input when it merges config files.  If you know
> exactly which etc files you've modified, you can add them to
> IgnorePaths.  That way etcupdate won't run interactively, it will
> simply throw away changes from upstream.
>
> Whenever I need to upgrade multiple machines at once, I start tmux,
> split it into multiple panes, ssh to each server from one pane, then
> do ":synchronize-panes on" so my input will be directed to multiple
> panes simultaneously.  Usually, that works for 90% of the upgrade.
> But invariably there are a few files that aren't synchronized between
> the servers, and I have to desynchronize my panes to deal with that.
>
> -Alan
>


Re: Chasing OOM Issues - good sysctl metrics to use?

2022-05-10 Thread Jan Mikkelsen
On 10 May 2022, at 10:01, Mark Millard  wrote:
> 
> On 2022-Apr-29, at 13:57, Mark Millard  wrote:
> 
>> On 2022-Apr-29, at 13:41, Pete Wright  wrote:
>>> 
 . . .
>>> 
>>> d'oh - went out for lunch and workstation locked up.  i *knew* i shouldn't 
>>> have said anything lol.
>> 
>> Any interesting console messages ( or dmesg -a or /var/log/messages )?
>> 
> 
> I've been doing some testing of a patch by tijl at FreeBSD.org
> and have reproduced both hang-ups (ZFS/ARC context) and kills
> (UFS/noARC and ZFS/ARC) for "was killed: failed to reclaim
> memory", both with and without the patch. This is with only a
> tiny fraction of the swap partition(s) enabled being put to
> use. So far, the testing was deliberately with
> vm.pageout_oom_seq=12 (the default value). My testing has been
> with main [so: 14].
> 
> But I also learned how to avoid the hang-ups that I got --but
> it costs making kills more likely/quicker, other things being
> equal.
> 
> I discovered that the hang-ups that I got were from all the
> processes that I interact with the system via ending up with
> the process's kernel threads swapped out and were not being
> swapped in. (including sshd, so no new ssh connections). In
> some contexts I only had escaping into the kernel debugger
> available, not even ^T would work. Other times ^T did work.
> 
> So, when I'm willing to risk kills in order to maintain
> the ability to interact normally, I now use in
> /etc/sysctl.conf :
> 
> vm.swap_enabled=0

I have been looking at an OOM related issue. Ignoring the actual leak, the 
problem leads to a process being killed because the system was out of memory. 
This is fine. After that, however, the system console was black with a single 
block cursor and the console keyboard was unresponsive. Caps lock and num lock 
didn’t toggle their lights when pressed.

Using an ssh session, the system looked fine. USB events for the keyboard being 
disconnected and reconnected appeared but the keyboard stayed unresponsive.

Setting vm.swap_enabled=0, as you did above, resolved this problem. After the 
process was killed a perfectly normal console returned.

The interesting thing is that this test system is configured with no swap space.

This is on 13.1-RC5.

> This disables swapping out of process kernel stacks. It
> is just with that option removedfor gaining free RAM, there
> fewer options tried before a kill is initiated. It is not a
> loader-time tunable but is writable, thus the
> /etc/sysctl.conf placement.

Is that really what it does? From a quick look at the code in vm/vm_swapout.c, 
it seems little more complex.

Regards,

Jan M.





Re: Upgrade automation

2022-05-10 Thread Alan Somers
On Tue, May 10, 2022 at 9:08 AM Cristian Cardoso
 wrote:
>
> Hi
>
> I have some FreeBSD servers in my machine park and I would like to perform 
> the version upgrade in an automated way with ansible.
>
> In my example, I want to perform the upgrade from version 12.3 to 13, it is 
> possible to run the upgrade with the command below:
>
> freebsd-update --not-running-from-cron upgrade -r 12.2-RELEASE
>
> I ask this, because I don't know if it's the most correct way to execute this.
>
> Grateful for any assistance.

Yes, that's perfect.  But there's another step too.  You'll have to do:
freebsd-update install
And _this_ step isn't easy to perfectly automate, because etcupdate
may ask for your input when it merges config files.  If you know
exactly which etc files you've modified, you can add them to
IgnorePaths.  That way etcupdate won't run interactively, it will
simply throw away changes from upstream.

Whenever I need to upgrade multiple machines at once, I start tmux,
split it into multiple panes, ssh to each server from one pane, then
do ":synchronize-panes on" so my input will be directed to multiple
panes simultaneously.  Usually, that works for 90% of the upgrade.
But invariably there are a few files that aren't synchronized between
the servers, and I have to desynchronize my panes to deal with that.

-Alan



Upgrade automation

2022-05-10 Thread Cristian Cardoso
Hi

I have some FreeBSD servers in my machine park and I would like to perform
the version upgrade in an automated way with ansible.

In my example, I want to perform the upgrade from version 12.3 to 13, it is
possible to run the upgrade with the command below:

freebsd-update --not-running-from-cron upgrade -r 12.2-RELEASE

I ask this, because I don't know if it's the most correct way to execute
this.

Grateful for any assistance.


Re: Testing 14-CURRENT-f44280bf5fb on aarch64

2022-05-10 Thread Hans Petter Selasky

On 5/10/22 09:37, Daniel Morante wrote:
Updated to the latest (14.0-CURRENT #2 main-n255521-10f44229dcd: Tue May 
10 02:52:27 EDT 2022) and removed the sysctl option 
(hw.usb.disable_enumeration=1).


Still seeing the problem.  The below just endlessly prints out on the 
console:


```

FreeBSD/arm64 (mars.morante.com) (ttyu0)

login: ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub4: 4 ports with 4 removable, self powered
ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub4: 4 ports with 4 removable, self powered
ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub_attach: port 1 power on or off failed, USB_ERR_IOERROR
uhub_attach: port 2 power on or off failed, USB_ERR_IOERROR
uhub_attach: port 3 power on or off failed, USB_ERR_IOERROR
uhub_attach: port 4 power on or off failed, USB_ERR_IOERROR
uhub4: 4 ports with 4 removable, self powered
uhub_reattach_port: device problem (USB_ERR_IOERROR), disabling port 1
uhub_reattach_port: device problem (USB_ERR_IOERROR), disabling port 2
uhub_reattach_port: device problem (USB_ERR_IOERROR), disabling port 3
uhub_reattach_port: device problem (USB_ERR_IOERROR), disabling port 4
ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub4: 4 ports with 4 removable, self powered
ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub4: 4 ports with 4 removable, self powered
ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub4: 4 ports with 4 removable, self powered
ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub4: 4 ports with 4 removable, self powered
ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub4: 4 ports with 4 removable, self powered
ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub4: 4 ports with 4 removable, self powered
ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub4: 4 ports with 4 removable, self powered
ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub4: 4 ports with 4 removable, self powered
```



Hi,

Does it help to do a "usbconfig -d X.Y reset" of the parent USB HUB of 
the failing one, I guess this is ugen0.1 .


--HPS




Re: Chasing OOM Issues - good sysctl metrics to use?

2022-05-10 Thread Mark Millard
On 2022-Apr-29, at 13:57, Mark Millard  wrote:

> On 2022-Apr-29, at 13:41, Pete Wright  wrote:
>> 
>>> . . .
>> 
>> d'oh - went out for lunch and workstation locked up.  i *knew* i shouldn't 
>> have said anything lol.
> 
> Any interesting console messages ( or dmesg -a or /var/log/messages )?
> 

I've been doing some testing of a patch by tijl at FreeBSD.org
and have reproduced both hang-ups (ZFS/ARC context) and kills
(UFS/noARC and ZFS/ARC) for "was killed: failed to reclaim
memory", both with and without the patch. This is with only a
tiny fraction of the swap partition(s) enabled being put to
use. So far, the testing was deliberately with
vm.pageout_oom_seq=12 (the default value). My testing has been
with main [so: 14].

But I also learned how to avoid the hang-ups that I got --but
it costs making kills more likely/quicker, other things being
equal.

I discovered that the hang-ups that I got were from all the
processes that I interact with the system via ending up with
the process's kernel threads swapped out and were not being
swapped in. (including sshd, so no new ssh connections). In
some contexts I only had escaping into the kernel debugger
available, not even ^T would work. Other times ^T did work.

So, when I'm willing to risk kills in order to maintain
the ability to interact normally, I now use in
/etc/sysctl.conf :

vm.swap_enabled=0

This disables swapping out of process kernel stacks. It
is just with that option removedfor gaining free RAM, there
fewer options tried before a kill is initiated. It is not a
loader-time tunable but is writable, thus the
/etc/sysctl.conf placement.

Note that I get kills both for vm.swap_enabled=0 and for
vm.swap_enabled=1 . It is just what looks like a hangup
that I'm trying to control via using =0 .

For now, I view my use as experimental. It might require
adjusting my vm.pageout_oom_seq=120 usage.

I've yet to use protect to also prevent kills of processes
needed for the interactions ( see: man 1 protect ). Most
likely I'd try to protect enough to allow the console
interactions to avoid being killed.


For reference . . .

The type of testing is to use the likes of:

# stress -m 2 --vm-bytes M --vm-keep

and part of the time with grep activity also
running, such as:

# grep -r nfreed /usr/*-src/sys/ | more

for specific values where the * is. (I have
13_0R , 13_1R , 13S , and main .) Varying
the value leads to reading new material
instead of referencing buffered/cached
material from the prior grep(s).

The  is roughly set up so that the system
ends up about where its initial Free RAM is
used up, so near (above or below) where some
sustained paging starts. I explore figures
that make the system land in this state.
I do not have a use-exactly-this computed
figure technique. But I run into the problems
fairly easily/quickly so far. As stress itself
uses some memory, the  need not be strictly
based on exactly 1/2 of the initial Free RAM
value --but that figure suggests were I explore
around.

The kills sometimes are not during the grep
but somewhat after. Sometimes, after grep is
done, stopping stress and starting it again
leads to a fairly quick kill.

The system used for the testing is an aarch64
MACCHIATObin Double Shot (4 Cortex-A72s) with
16 GiBytes of RAM. I can boot either its ZFS
media or its UFS media. (The other OS media is
normally ignored by the system configuration.)

===
Mark Millard
marklmi at yahoo.com




Re: Testing 14-CURRENT-f44280bf5fb on aarch64

2022-05-10 Thread Daniel Morante
Updated to the latest (14.0-CURRENT #2 main-n255521-10f44229dcd: Tue May 
10 02:52:27 EDT 2022) and removed the sysctl option 
(hw.usb.disable_enumeration=1).


Still seeing the problem.  The below just endlessly prints out on the 
console:


```

FreeBSD/arm64 (mars.morante.com) (ttyu0)

login: ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub4: 4 ports with 4 removable, self powered
ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub4: 4 ports with 4 removable, self powered
ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub_attach: port 1 power on or off failed, USB_ERR_IOERROR
uhub_attach: port 2 power on or off failed, USB_ERR_IOERROR
uhub_attach: port 3 power on or off failed, USB_ERR_IOERROR
uhub_attach: port 4 power on or off failed, USB_ERR_IOERROR
uhub4: 4 ports with 4 removable, self powered
uhub_reattach_port: device problem (USB_ERR_IOERROR), disabling port 1
uhub_reattach_port: device problem (USB_ERR_IOERROR), disabling port 2
uhub_reattach_port: device problem (USB_ERR_IOERROR), disabling port 3
uhub_reattach_port: device problem (USB_ERR_IOERROR), disabling port 4
ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub4: 4 ports with 4 removable, self powered
ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub4: 4 ports with 4 removable, self powered
ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub4: 4 ports with 4 removable, self powered
ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub4: 4 ports with 4 removable, self powered
ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub4: 4 ports with 4 removable, self powered
ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub4: 4 ports with 4 removable, self powered
ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub4: 4 ports with 4 removable, self powered
ugen0.2:  at usbus0 (disconnected)
uhub4: at uhub0, port 1, addr 1 (disconnected)
uhub4: detached
ugen0.2:  at usbus0
uhub4 numa-domain 0 on uhub0
uhub4:  on usbus0
uhub4: 4 ports with 4 removable, self powered
```

On 5/4/2022 4:10 AM, Hans Petter Selasky wrote:

On 5/4/22 09:49, Daniel Morante wrote:
I'm still using the sysctl option "hw.usb.disable_enumeration=1" to 
prevent the USB devices from disconnecting/reconnecting every few 
seconds.  Other than that the improvement in stability with 
14-CURRENT on aarach64 on this hardware is much better since the last 
time I tried, back in late February 2022.


Hi Daniel,

Could you try the very latest 14-current as of now? I've made a couple 
of USB fixes which may fix the issue you are seeing.


--HPS