VNET lock reversal
Just updated my -CURRENT box from a build almost exactly 14 days ago to one just about an hour old. When a VNET jail starts, I'm seeing a lock reversal: lock order reversal: 1st 0x81e893a8 allprison (allprison, sx) @ /usr/src/sys/kern/kern_jail.c:1378 2nd 0x81f99fe8 vnet_sysinit_sxlock (vnet_sysinit_sxlock, sx) @ /usr/src/sys/net/vnet.c:579 lock order allprison -> vnet_sysinit_sxlock attempted at: #0 0x80c9b7c6 at witness_checkorder+0xbd6 #1 0x80c35c67 at _sx_slock_int+0x67 #2 0x80d92185 at vnet_alloc+0x115 #3 0x80be7e02 at kern_jail_set+0x1722 #4 0x80be92f0 at sys_jail_set+0x40 #5 0x811200aa at amd64_syscall+0x13a #6 0x810f12eb at fast_syscall_common+0xf8 I'll try and see if I can get a bisect going if somebody else hasn't seen this yet. -Dustin
Re: Chasing OOM Issues - good sysctl metrics to use?
On 2022-May-10, at 17:49, Mark Millard wrote: > On 2022-May-10, at 11:49, Mark Millard wrote: > >> On 2022-May-10, at 08:47, Jan Mikkelsen wrote: >> >>> On 10 May 2022, at 10:01, Mark Millard wrote: On 2022-Apr-29, at 13:57, Mark Millard wrote: > On 2022-Apr-29, at 13:41, Pete Wright wrote: >> >>> . . . >> >> d'oh - went out for lunch and workstation locked up. i *knew* i >> shouldn't have said anything lol. > > Any interesting console messages ( or dmesg -a or /var/log/messages )? > I've been doing some testing of a patch by tijl at FreeBSD.org and have reproduced both hang-ups (ZFS/ARC context) and kills (UFS/noARC and ZFS/ARC) for "was killed: failed to reclaim memory", both with and without the patch. This is with only a tiny fraction of the swap partition(s) enabled being put to use. So far, the testing was deliberately with vm.pageout_oom_seq=12 (the default value). My testing has been with main [so: 14]. But I also learned how to avoid the hang-ups that I got --but it costs making kills more likely/quicker, other things being equal. I discovered that the hang-ups that I got were from all the processes that I interact with the system via ending up with the process's kernel threads swapped out and were not being swapped in. (including sshd, so no new ssh connections). In some contexts I only had escaping into the kernel debugger available, not even ^T would work. Other times ^T did work. So, when I'm willing to risk kills in order to maintain the ability to interact normally, I now use in /etc/sysctl.conf : vm.swap_enabled=0 >>> >>> I have been looking at an OOM related issue. Ignoring the actual leak, the >>> problem leads to a process being killed because the system was out of >>> memory. This is fine. After that, however, the system console was black >>> with a single block cursor and the console keyboard was unresponsive. Caps >>> lock and num lock didn’t toggle their lights when pressed. >>> >>> Using an ssh session, the system looked fine. USB events for the keyboard >>> being disconnected and reconnected appeared but the keyboard stayed >>> unresponsive. >>> >>> Setting vm.swap_enabled=0, as you did above, resolved this problem. After >>> the process was killed a perfectly normal console returned. >>> >>> The interesting thing is that this test system is configured with no swap >>> space. >>> >>> This is on 13.1-RC5. >>> This disables swapping out of process kernel stacks. It is just with that option removedfor gaining free RAM, there fewer options tried before a kill is initiated. It is not a loader-time tunable but is writable, thus the /etc/sysctl.conf placement. >>> >>> Is that really what it does? From a quick look at the code in >>> vm/vm_swapout.c, it seems little more complex. >> >> I was going by its description: >> >> # sysctl -d vm.swap_enabled >> vm.swap_enabled: Enable entire process swapout >> >> Based on the below, it appears that the description >> presumes vm.swap_idle_enabled==0 (the default). In >> my context vm.swap_idle_enabled==0 . Looks like I >> should also list: >> >> vm.swap_idle_enabled=0 >> >> in my /etc/sysctl.conf with a reminder comment that the >> pair of =0's are required for avoiding the observed >> hang-ups. >> >> >> The analysis goes like . . . >> >> I see in the code that vm.swap_enabled !=0 causes >> VM_SWAP_NORMAL : >> >> void >> vm_swapout_run(void) >> { >> >> if (vm_swap_enabled) >> vm_req_vmdaemon(VM_SWAP_NORMAL); >> } >> >> and that in turn leads to vm_daemon to: >> >> if (swapout_flags != 0) { >> /* >>* Drain the per-CPU page queue batches as a deadlock >>* avoidance measure. >>*/ >> if ((swapout_flags & VM_SWAP_NORMAL) != 0) >> vm_page_pqbatch_drain(); >> swapout_procs(swapout_flags); >> } >> >> Note: vm.swap_idle_enabled==0 && vm.swap_enabled==0 ends >> up with swapout_flags==0. vm.swap_idle. . . defaults seem >> to be (in my context): >> >> # sysctl -a | grep swap_idle >> vm.swap_idle_threshold2: 10 >> vm.swap_idle_threshold1: 2 >> vm.swap_idle_enabled: 0 >> >> For reference: >> >> /* >> * Idle process swapout -- run once per second when pagedaemons are >> * reclaiming pages. >> */ >> void >> vm_swapout_run_idle(void) >> { >> static long lsec; >> >> if (!vm_swap_idle_enabled || time_second == lsec) >> return; >> vm_req_vmdaemon(VM_SWAP_IDLE); >> lsec = time_second; >> } >> >> [So vm.swap_idle_enabled==0 avoids VM_SWAP_IDLE status.] >> >> static void >> vm_req_vmdaemon(int req) >> { >> static int lastrun = 0; >> >>
Re: Chasing OOM Issues - good sysctl metrics to use?
On 2022-May-10, at 11:49, Mark Millard wrote: > On 2022-May-10, at 08:47, Jan Mikkelsen wrote: > >> On 10 May 2022, at 10:01, Mark Millard wrote: >>> >>> On 2022-Apr-29, at 13:57, Mark Millard wrote: >>> On 2022-Apr-29, at 13:41, Pete Wright wrote: > >> . . . > > d'oh - went out for lunch and workstation locked up. i *knew* i > shouldn't have said anything lol. Any interesting console messages ( or dmesg -a or /var/log/messages )? >>> >>> I've been doing some testing of a patch by tijl at FreeBSD.org >>> and have reproduced both hang-ups (ZFS/ARC context) and kills >>> (UFS/noARC and ZFS/ARC) for "was killed: failed to reclaim >>> memory", both with and without the patch. This is with only a >>> tiny fraction of the swap partition(s) enabled being put to >>> use. So far, the testing was deliberately with >>> vm.pageout_oom_seq=12 (the default value). My testing has been >>> with main [so: 14]. >>> >>> But I also learned how to avoid the hang-ups that I got --but >>> it costs making kills more likely/quicker, other things being >>> equal. >>> >>> I discovered that the hang-ups that I got were from all the >>> processes that I interact with the system via ending up with >>> the process's kernel threads swapped out and were not being >>> swapped in. (including sshd, so no new ssh connections). In >>> some contexts I only had escaping into the kernel debugger >>> available, not even ^T would work. Other times ^T did work. >>> >>> So, when I'm willing to risk kills in order to maintain >>> the ability to interact normally, I now use in >>> /etc/sysctl.conf : >>> >>> vm.swap_enabled=0 >> >> I have been looking at an OOM related issue. Ignoring the actual leak, the >> problem leads to a process being killed because the system was out of >> memory. This is fine. After that, however, the system console was black with >> a single block cursor and the console keyboard was unresponsive. Caps lock >> and num lock didn’t toggle their lights when pressed. >> >> Using an ssh session, the system looked fine. USB events for the keyboard >> being disconnected and reconnected appeared but the keyboard stayed >> unresponsive. >> >> Setting vm.swap_enabled=0, as you did above, resolved this problem. After >> the process was killed a perfectly normal console returned. >> >> The interesting thing is that this test system is configured with no swap >> space. >> >> This is on 13.1-RC5. >> >>> This disables swapping out of process kernel stacks. It >>> is just with that option removedfor gaining free RAM, there >>> fewer options tried before a kill is initiated. It is not a >>> loader-time tunable but is writable, thus the >>> /etc/sysctl.conf placement. >> >> Is that really what it does? From a quick look at the code in >> vm/vm_swapout.c, it seems little more complex. > > I was going by its description: > > # sysctl -d vm.swap_enabled > vm.swap_enabled: Enable entire process swapout > > Based on the below, it appears that the description > presumes vm.swap_idle_enabled==0 (the default). In > my context vm.swap_idle_enabled==0 . Looks like I > should also list: > > vm.swap_idle_enabled=0 > > in my /etc/sysctl.conf with a reminder comment that the > pair of =0's are required for avoiding the observed > hang-ups. > > > The analysis goes like . . . > > I see in the code that vm.swap_enabled !=0 causes > VM_SWAP_NORMAL : > > void > vm_swapout_run(void) > { > >if (vm_swap_enabled) >vm_req_vmdaemon(VM_SWAP_NORMAL); > } > > and that in turn leads to vm_daemon to: > >if (swapout_flags != 0) { >/* > * Drain the per-CPU page queue batches as a deadlock > * avoidance measure. > */ >if ((swapout_flags & VM_SWAP_NORMAL) != 0) >vm_page_pqbatch_drain(); >swapout_procs(swapout_flags); >} > > Note: vm.swap_idle_enabled==0 && vm.swap_enabled==0 ends > up with swapout_flags==0. vm.swap_idle. . . defaults seem > to be (in my context): > > # sysctl -a | grep swap_idle > vm.swap_idle_threshold2: 10 > vm.swap_idle_threshold1: 2 > vm.swap_idle_enabled: 0 > > For reference: > > /* > * Idle process swapout -- run once per second when pagedaemons are > * reclaiming pages. > */ > void > vm_swapout_run_idle(void) > { >static long lsec; > >if (!vm_swap_idle_enabled || time_second == lsec) >return; >vm_req_vmdaemon(VM_SWAP_IDLE); >lsec = time_second; > } > > [So vm.swap_idle_enabled==0 avoids VM_SWAP_IDLE status.] > > static void > vm_req_vmdaemon(int req) > { >static int lastrun = 0; > >mtx_lock(_daemon_mtx); >vm_pageout_req_swapout |= req; >if ((ticks > (lastrun + hz)) || (ticks < lastrun)) { >wakeup(_daemon_needed); >lastrun
Re: Chasing OOM Issues - good sysctl metrics to use?
On 2022-May-10, at 08:47, Jan Mikkelsen wrote: > On 10 May 2022, at 10:01, Mark Millard wrote: >> >> On 2022-Apr-29, at 13:57, Mark Millard wrote: >> >>> On 2022-Apr-29, at 13:41, Pete Wright wrote: > . . . d'oh - went out for lunch and workstation locked up. i *knew* i shouldn't have said anything lol. >>> >>> Any interesting console messages ( or dmesg -a or /var/log/messages )? >>> >> >> I've been doing some testing of a patch by tijl at FreeBSD.org >> and have reproduced both hang-ups (ZFS/ARC context) and kills >> (UFS/noARC and ZFS/ARC) for "was killed: failed to reclaim >> memory", both with and without the patch. This is with only a >> tiny fraction of the swap partition(s) enabled being put to >> use. So far, the testing was deliberately with >> vm.pageout_oom_seq=12 (the default value). My testing has been >> with main [so: 14]. >> >> But I also learned how to avoid the hang-ups that I got --but >> it costs making kills more likely/quicker, other things being >> equal. >> >> I discovered that the hang-ups that I got were from all the >> processes that I interact with the system via ending up with >> the process's kernel threads swapped out and were not being >> swapped in. (including sshd, so no new ssh connections). In >> some contexts I only had escaping into the kernel debugger >> available, not even ^T would work. Other times ^T did work. >> >> So, when I'm willing to risk kills in order to maintain >> the ability to interact normally, I now use in >> /etc/sysctl.conf : >> >> vm.swap_enabled=0 > > I have been looking at an OOM related issue. Ignoring the actual leak, the > problem leads to a process being killed because the system was out of memory. > This is fine. After that, however, the system console was black with a single > block cursor and the console keyboard was unresponsive. Caps lock and num > lock didn’t toggle their lights when pressed. > > Using an ssh session, the system looked fine. USB events for the keyboard > being disconnected and reconnected appeared but the keyboard stayed > unresponsive. > > Setting vm.swap_enabled=0, as you did above, resolved this problem. After the > process was killed a perfectly normal console returned. > > The interesting thing is that this test system is configured with no swap > space. > > This is on 13.1-RC5. > >> This disables swapping out of process kernel stacks. It >> is just with that option removedfor gaining free RAM, there >> fewer options tried before a kill is initiated. It is not a >> loader-time tunable but is writable, thus the >> /etc/sysctl.conf placement. > > Is that really what it does? From a quick look at the code in > vm/vm_swapout.c, it seems little more complex. I was going by its description: # sysctl -d vm.swap_enabled vm.swap_enabled: Enable entire process swapout Based on the below, it appears that the description presumes vm.swap_idle_enabled==0 (the default). In my context vm.swap_idle_enabled==0 . Looks like I should also list: vm.swap_idle_enabled=0 in my /etc/sysctl.conf with a reminder comment that the pair of =0's are required for avoiding the observed hang-ups. The analysis goes like . . . I see in the code that vm.swap_enabled !=0 causes VM_SWAP_NORMAL : void vm_swapout_run(void) { if (vm_swap_enabled) vm_req_vmdaemon(VM_SWAP_NORMAL); } and that in turn leads to vm_daemon to: if (swapout_flags != 0) { /* * Drain the per-CPU page queue batches as a deadlock * avoidance measure. */ if ((swapout_flags & VM_SWAP_NORMAL) != 0) vm_page_pqbatch_drain(); swapout_procs(swapout_flags); } Note: vm.swap_idle_enabled==0 && vm.swap_enabled==0 ends up with swapout_flags==0. vm.swap_idle. . . defaults seem to be (in my context): # sysctl -a | grep swap_idle vm.swap_idle_threshold2: 10 vm.swap_idle_threshold1: 2 vm.swap_idle_enabled: 0 For reference: /* * Idle process swapout -- run once per second when pagedaemons are * reclaiming pages. */ void vm_swapout_run_idle(void) { static long lsec; if (!vm_swap_idle_enabled || time_second == lsec) return; vm_req_vmdaemon(VM_SWAP_IDLE); lsec = time_second; } [So vm.swap_idle_enabled==0 avoids VM_SWAP_IDLE status.] static void vm_req_vmdaemon(int req) { static int lastrun = 0; mtx_lock(_daemon_mtx); vm_pageout_req_swapout |= req; if ((ticks > (lastrun + hz)) || (ticks < lastrun)) { wakeup(_daemon_needed); lastrun = ticks; } mtx_unlock(_daemon_mtx); } [So VM_SWAP_IDLE and VM_SWAP_NORMAL are independent bits in vm_pageout_req_swapout.] vm_deamon does: mtx_lock(_daemon_mtx);
Re: Upgrade automation
On 10/05/2022 17:46, Alan Somers wrote: On Tue, May 10, 2022 at 9:08 AM Cristian Cardoso wrote: Hi I have some FreeBSD servers in my machine park and I would like to perform the version upgrade in an automated way with ansible. In my example, I want to perform the upgrade from version 12.3 to 13, it is possible to run the upgrade with the command below: freebsd-update --not-running-from-cron upgrade -r 12.2-RELEASE I ask this, because I don't know if it's the most correct way to execute this. Grateful for any assistance. Yes, that's perfect. But there's another step too. You'll have to do: freebsd-update install And _this_ step isn't easy to perfectly automate, because etcupdate may ask for your input when it merges config files. If you know exactly which etc files you've modified, you can add them to IgnorePaths. That way etcupdate won't run interactively, it will simply throw away changes from upstream. Automation with etcupdate sounds very scary to me because etcupdate breaks real life configuration files inplace. Mergemaster did it on temporary copies. But if you let etcupdate to left something (after merge conflict) in vital config file(s) wich will have syntax error on next boot, then you are out. It would be much better if etcupdate do not edit target file on merge conflicts. Kind regards Miroslav Lachman
Re: Upgrade automation
I currently update patches this way: - name: Checking for updates on FreeBSD command: freebsd-update fetch when: - ansible_distribution == "FreeBSD" register: result_update changed_when: "'No updates needed' not in result_update.stdout" become: yes tags: - check-update - name: Applying update on FreeBSD command: freebsd-update install when: - ansible_distribution == "FreeBSD" and result_update.changed register: result_update_install become: yes tags: - apply-update Maybe to get around the situation after the version upgrade task, you can do something like this: - name: Reboot system to apply new kernel shell: "sleep 5 && reboot" async: 1 poll: 0 become: True - name: Wait for reconnection to system to continue update wait_for_connection: connect_timeout: 20 sleep: 20 delay: 60 timeout: 600 - name: Applying update on FreeBSD command: freebsd-update install when: - ansible_distribution == "FreeBSD" and result_update.changed register: result_update_install become: yes Em ter., 10 de mai. de 2022 às 12:47, Alan Somers escreveu: > On Tue, May 10, 2022 at 9:08 AM Cristian Cardoso > wrote: > > > > Hi > > > > I have some FreeBSD servers in my machine park and I would like to > perform the version upgrade in an automated way with ansible. > > > > In my example, I want to perform the upgrade from version 12.3 to 13, it > is possible to run the upgrade with the command below: > > > > freebsd-update --not-running-from-cron upgrade -r 12.2-RELEASE > > > > I ask this, because I don't know if it's the most correct way to execute > this. > > > > Grateful for any assistance. > > Yes, that's perfect. But there's another step too. You'll have to do: > freebsd-update install > And _this_ step isn't easy to perfectly automate, because etcupdate > may ask for your input when it merges config files. If you know > exactly which etc files you've modified, you can add them to > IgnorePaths. That way etcupdate won't run interactively, it will > simply throw away changes from upstream. > > Whenever I need to upgrade multiple machines at once, I start tmux, > split it into multiple panes, ssh to each server from one pane, then > do ":synchronize-panes on" so my input will be directed to multiple > panes simultaneously. Usually, that works for 90% of the upgrade. > But invariably there are a few files that aren't synchronized between > the servers, and I have to desynchronize my panes to deal with that. > > -Alan >
Re: Chasing OOM Issues - good sysctl metrics to use?
On 10 May 2022, at 10:01, Mark Millard wrote: > > On 2022-Apr-29, at 13:57, Mark Millard wrote: > >> On 2022-Apr-29, at 13:41, Pete Wright wrote: >>> . . . >>> >>> d'oh - went out for lunch and workstation locked up. i *knew* i shouldn't >>> have said anything lol. >> >> Any interesting console messages ( or dmesg -a or /var/log/messages )? >> > > I've been doing some testing of a patch by tijl at FreeBSD.org > and have reproduced both hang-ups (ZFS/ARC context) and kills > (UFS/noARC and ZFS/ARC) for "was killed: failed to reclaim > memory", both with and without the patch. This is with only a > tiny fraction of the swap partition(s) enabled being put to > use. So far, the testing was deliberately with > vm.pageout_oom_seq=12 (the default value). My testing has been > with main [so: 14]. > > But I also learned how to avoid the hang-ups that I got --but > it costs making kills more likely/quicker, other things being > equal. > > I discovered that the hang-ups that I got were from all the > processes that I interact with the system via ending up with > the process's kernel threads swapped out and were not being > swapped in. (including sshd, so no new ssh connections). In > some contexts I only had escaping into the kernel debugger > available, not even ^T would work. Other times ^T did work. > > So, when I'm willing to risk kills in order to maintain > the ability to interact normally, I now use in > /etc/sysctl.conf : > > vm.swap_enabled=0 I have been looking at an OOM related issue. Ignoring the actual leak, the problem leads to a process being killed because the system was out of memory. This is fine. After that, however, the system console was black with a single block cursor and the console keyboard was unresponsive. Caps lock and num lock didn’t toggle their lights when pressed. Using an ssh session, the system looked fine. USB events for the keyboard being disconnected and reconnected appeared but the keyboard stayed unresponsive. Setting vm.swap_enabled=0, as you did above, resolved this problem. After the process was killed a perfectly normal console returned. The interesting thing is that this test system is configured with no swap space. This is on 13.1-RC5. > This disables swapping out of process kernel stacks. It > is just with that option removedfor gaining free RAM, there > fewer options tried before a kill is initiated. It is not a > loader-time tunable but is writable, thus the > /etc/sysctl.conf placement. Is that really what it does? From a quick look at the code in vm/vm_swapout.c, it seems little more complex. Regards, Jan M.
Re: Upgrade automation
On Tue, May 10, 2022 at 9:08 AM Cristian Cardoso wrote: > > Hi > > I have some FreeBSD servers in my machine park and I would like to perform > the version upgrade in an automated way with ansible. > > In my example, I want to perform the upgrade from version 12.3 to 13, it is > possible to run the upgrade with the command below: > > freebsd-update --not-running-from-cron upgrade -r 12.2-RELEASE > > I ask this, because I don't know if it's the most correct way to execute this. > > Grateful for any assistance. Yes, that's perfect. But there's another step too. You'll have to do: freebsd-update install And _this_ step isn't easy to perfectly automate, because etcupdate may ask for your input when it merges config files. If you know exactly which etc files you've modified, you can add them to IgnorePaths. That way etcupdate won't run interactively, it will simply throw away changes from upstream. Whenever I need to upgrade multiple machines at once, I start tmux, split it into multiple panes, ssh to each server from one pane, then do ":synchronize-panes on" so my input will be directed to multiple panes simultaneously. Usually, that works for 90% of the upgrade. But invariably there are a few files that aren't synchronized between the servers, and I have to desynchronize my panes to deal with that. -Alan
Upgrade automation
Hi I have some FreeBSD servers in my machine park and I would like to perform the version upgrade in an automated way with ansible. In my example, I want to perform the upgrade from version 12.3 to 13, it is possible to run the upgrade with the command below: freebsd-update --not-running-from-cron upgrade -r 12.2-RELEASE I ask this, because I don't know if it's the most correct way to execute this. Grateful for any assistance.
Re: Testing 14-CURRENT-f44280bf5fb on aarch64
On 5/10/22 09:37, Daniel Morante wrote: Updated to the latest (14.0-CURRENT #2 main-n255521-10f44229dcd: Tue May 10 02:52:27 EDT 2022) and removed the sysctl option (hw.usb.disable_enumeration=1). Still seeing the problem. The below just endlessly prints out on the console: ``` FreeBSD/arm64 (mars.morante.com) (ttyu0) login: ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub4: 4 ports with 4 removable, self powered ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub4: 4 ports with 4 removable, self powered ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub_attach: port 1 power on or off failed, USB_ERR_IOERROR uhub_attach: port 2 power on or off failed, USB_ERR_IOERROR uhub_attach: port 3 power on or off failed, USB_ERR_IOERROR uhub_attach: port 4 power on or off failed, USB_ERR_IOERROR uhub4: 4 ports with 4 removable, self powered uhub_reattach_port: device problem (USB_ERR_IOERROR), disabling port 1 uhub_reattach_port: device problem (USB_ERR_IOERROR), disabling port 2 uhub_reattach_port: device problem (USB_ERR_IOERROR), disabling port 3 uhub_reattach_port: device problem (USB_ERR_IOERROR), disabling port 4 ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub4: 4 ports with 4 removable, self powered ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub4: 4 ports with 4 removable, self powered ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub4: 4 ports with 4 removable, self powered ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub4: 4 ports with 4 removable, self powered ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub4: 4 ports with 4 removable, self powered ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub4: 4 ports with 4 removable, self powered ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub4: 4 ports with 4 removable, self powered ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub4: 4 ports with 4 removable, self powered ``` Hi, Does it help to do a "usbconfig -d X.Y reset" of the parent USB HUB of the failing one, I guess this is ugen0.1 . --HPS
Re: Chasing OOM Issues - good sysctl metrics to use?
On 2022-Apr-29, at 13:57, Mark Millard wrote: > On 2022-Apr-29, at 13:41, Pete Wright wrote: >> >>> . . . >> >> d'oh - went out for lunch and workstation locked up. i *knew* i shouldn't >> have said anything lol. > > Any interesting console messages ( or dmesg -a or /var/log/messages )? > I've been doing some testing of a patch by tijl at FreeBSD.org and have reproduced both hang-ups (ZFS/ARC context) and kills (UFS/noARC and ZFS/ARC) for "was killed: failed to reclaim memory", both with and without the patch. This is with only a tiny fraction of the swap partition(s) enabled being put to use. So far, the testing was deliberately with vm.pageout_oom_seq=12 (the default value). My testing has been with main [so: 14]. But I also learned how to avoid the hang-ups that I got --but it costs making kills more likely/quicker, other things being equal. I discovered that the hang-ups that I got were from all the processes that I interact with the system via ending up with the process's kernel threads swapped out and were not being swapped in. (including sshd, so no new ssh connections). In some contexts I only had escaping into the kernel debugger available, not even ^T would work. Other times ^T did work. So, when I'm willing to risk kills in order to maintain the ability to interact normally, I now use in /etc/sysctl.conf : vm.swap_enabled=0 This disables swapping out of process kernel stacks. It is just with that option removedfor gaining free RAM, there fewer options tried before a kill is initiated. It is not a loader-time tunable but is writable, thus the /etc/sysctl.conf placement. Note that I get kills both for vm.swap_enabled=0 and for vm.swap_enabled=1 . It is just what looks like a hangup that I'm trying to control via using =0 . For now, I view my use as experimental. It might require adjusting my vm.pageout_oom_seq=120 usage. I've yet to use protect to also prevent kills of processes needed for the interactions ( see: man 1 protect ). Most likely I'd try to protect enough to allow the console interactions to avoid being killed. For reference . . . The type of testing is to use the likes of: # stress -m 2 --vm-bytes M --vm-keep and part of the time with grep activity also running, such as: # grep -r nfreed /usr/*-src/sys/ | more for specific values where the * is. (I have 13_0R , 13_1R , 13S , and main .) Varying the value leads to reading new material instead of referencing buffered/cached material from the prior grep(s). The is roughly set up so that the system ends up about where its initial Free RAM is used up, so near (above or below) where some sustained paging starts. I explore figures that make the system land in this state. I do not have a use-exactly-this computed figure technique. But I run into the problems fairly easily/quickly so far. As stress itself uses some memory, the need not be strictly based on exactly 1/2 of the initial Free RAM value --but that figure suggests were I explore around. The kills sometimes are not during the grep but somewhat after. Sometimes, after grep is done, stopping stress and starting it again leads to a fairly quick kill. The system used for the testing is an aarch64 MACCHIATObin Double Shot (4 Cortex-A72s) with 16 GiBytes of RAM. I can boot either its ZFS media or its UFS media. (The other OS media is normally ignored by the system configuration.) === Mark Millard marklmi at yahoo.com
Re: Testing 14-CURRENT-f44280bf5fb on aarch64
Updated to the latest (14.0-CURRENT #2 main-n255521-10f44229dcd: Tue May 10 02:52:27 EDT 2022) and removed the sysctl option (hw.usb.disable_enumeration=1). Still seeing the problem. The below just endlessly prints out on the console: ``` FreeBSD/arm64 (mars.morante.com) (ttyu0) login: ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub4: 4 ports with 4 removable, self powered ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub4: 4 ports with 4 removable, self powered ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub_attach: port 1 power on or off failed, USB_ERR_IOERROR uhub_attach: port 2 power on or off failed, USB_ERR_IOERROR uhub_attach: port 3 power on or off failed, USB_ERR_IOERROR uhub_attach: port 4 power on or off failed, USB_ERR_IOERROR uhub4: 4 ports with 4 removable, self powered uhub_reattach_port: device problem (USB_ERR_IOERROR), disabling port 1 uhub_reattach_port: device problem (USB_ERR_IOERROR), disabling port 2 uhub_reattach_port: device problem (USB_ERR_IOERROR), disabling port 3 uhub_reattach_port: device problem (USB_ERR_IOERROR), disabling port 4 ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub4: 4 ports with 4 removable, self powered ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub4: 4 ports with 4 removable, self powered ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub4: 4 ports with 4 removable, self powered ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub4: 4 ports with 4 removable, self powered ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub4: 4 ports with 4 removable, self powered ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub4: 4 ports with 4 removable, self powered ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub4: 4 ports with 4 removable, self powered ugen0.2: at usbus0 (disconnected) uhub4: at uhub0, port 1, addr 1 (disconnected) uhub4: detached ugen0.2: at usbus0 uhub4 numa-domain 0 on uhub0 uhub4: on usbus0 uhub4: 4 ports with 4 removable, self powered ``` On 5/4/2022 4:10 AM, Hans Petter Selasky wrote: On 5/4/22 09:49, Daniel Morante wrote: I'm still using the sysctl option "hw.usb.disable_enumeration=1" to prevent the USB devices from disconnecting/reconnecting every few seconds. Other than that the improvement in stability with 14-CURRENT on aarach64 on this hardware is much better since the last time I tried, back in late February 2022. Hi Daniel, Could you try the very latest 14-current as of now? I've made a couple of USB fixes which may fix the issue you are seeing. --HPS