EFI framebuffer information, then nothing

2022-04-19 Thread Graham Perrin
Sometimes, boot gets no further than EFI framebuffer information. 
Whenever this occurs, the computer will stop immediately in response to 
a simple, short press on the power button.


The two lines preceding framebuffer information (copy typed from a 
photograph):


staging 0xadc0 (not copying) tramp 0xbecfe000 PT4 0xbecf5000
Start @ 0x80389000 ...


I can not make the issue consistently reproducible.

Originally, it seemed that the minority of boots were bugged. For a 
while, probably more than a week, the majority of boots were bugged 
(sometimes five or more successive failures). More recently, again, the 
minority of boots are bugged.


There's a strong sense of randomness, so I can not be certain that e.g. 
safe mode is a workaround.


At least once, boot failed whilst the notebook was un-docked and 
completely without peripherals.


Should I suspect any of the modules indicated below?

% cat /boot/loader.conf | grep -v \# | sort | tr -s '\n'

acpi_dock_load="YES"
acpi_hp_load="YES"
aesni_load="YES"
autofs_load="YES"
beastie_disable="NO"
boot_mute="NO"
boot_verbose="NO"
cpu_microcode_load="YES"
cpu_microcode_name="/boot/firmware/intel-ucode.bin"
cuse_load="YES"
efi_max_resolution="1600x900"
geom_eli_load="YES"
hw.pci.do_power_nodriver=3
hw.usb.no_boot_wait=0
kern.geom.label.disk_ident.enable="0"
kern.geom.label.gptid.enable="0"
kern.racct.enable=1
openzfs_load="NO"
screen.font="8x16"
sysctlbyname_improved_load="YES"
sysctlinfo_load="YES"
vboxdrv_load="YES"
verbose_loading="NO"
zfs_load="YES"
%

HP EliteBook 8570p, probed a few minutes ago:


GELI encryption:

% lsblk ada0
DEVICE    MAJ:MIN SIZE TYPE LABEL MOUNT
ada0    0:119 932G GPT  - -
  ada0p1    0:121 260M efi   gpt/efiboot0 -
      -:-   1.0M -    - -
  ada0p2    0:123  16G freebsd-swap gpt/swap0 SWAP
  ada0p2.eli    2:52   16G freebsd-swap - SWAP
  ada0p3    0:125 915G freebsd-zfs   gpt/zfs0 
  ada0p3.eli    0:131 915G -    - -
      -:-   708K -    - -
%

Three photographs, the first of which is barely readable (sorry):

 (2022-02-18 21:31)
 (2022-03-18 15:44)
 (2022-04-17 18:58)



Daily black screen of death

2022-04-19 Thread Steve Kargl
FYI,

I'm experiencing an almost daily black screen of death panic.
Kernel, world, drm-current-kmod, and gpu-firmware-kmod were
all rebuilt and installed at the same time.  Uname shows

FreeBSD 14.0-CURRENT #0 main-n254360-eb9d205fa69: Tue Apr 5 13:49:47 PDT 2022

So, April 5th sources.

The panic results in a keyboard lock and no dump.  The system
does not have a serial console.  Only recourse is a hard rest.

Hand transcribed from photo

_sleep() at _sleep+0x38a/frame 0xfe012b7c0680
buf_daemon_shutdown() at buf_daemon_shutdown+0x6b/frame 0xfe012b7c06a0
kern_reboot() at kern_reboot+0x2ae/frame 0xfe012b7c06e0
vpanic() at vpanic+0x1ee/frame 0xfe012b7c0730
panic() at panic+0x43/frame 0xfe012b7c0790

Above repeats 100s of time scrolling off the screen with ever
increasing frame pointer.

Final message,

mi_switch() at mi_switch+0x18e/frame 0xfe012b7c14b0
__mtx_lock_sleep() at __mtx_lock_sleep+0x173/frame 0xfe012b7c1510
__mtx_lock_flags() at __mtx_lock_flags+0xc0/frame 0xfe012b7c1550
linux_wake_up() at linux_wake_up+0x38/frame 0xfe012b7c15a0
radeon_fence_is_signaled() at radeon_fence_is_signaled+0x99/frame 
0xfe012b7c15f0
dma_resv_add_shared_fence() at dma_resv_add_shared_fence+0x99/frame 
0xfe012b7c1640
ttm_eu_fence_buffer_objects() at ttm_eu_fence_buffer_objects+0x79/frame 
0xfe012b7c1680
radeon_cs_parser_fini() at radeon_cs_parser_fini+0x53/frame 0xfe012b7c16b0
radeaon_cs_ioctl() at radeaon_cs_ioctl+0x75e/frame 0xfe012b7c1b30
drm_ioctl_kernel() at drm_ioctl_kernel+0xc7/frame 0xfe012b7c1b80
drm_ioctl() at drm_ioctl+0x2c3/frame 0xfe012b7c1c70
linux_file_ioctl() at linux_file_ioctl+0x309/frame 0xfe012b7c1cd0
kern_ioctl() at kern_ioctl+0x1dc/frame 0xfe012b7c1d40
sys_ioctl() at sys_ioctl+0x121/frame 0xfe012b7c1e10
amd64_syscall() at amd64_syscall+0x108/frame 0xfe012b7c1f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfe012b7c1f30
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x36a096c34ea, rsp = 
0x3fa11e623eb8, \
rbp = 0x3fa11e623ee0 ---
panic: _sleep: curthread not running
cpuid = 4
time = 1650389478
KDB: stack backtrace:

One common trigger appears to be the use of firefox-99.0,2 from
the ports collection.  

-- 
Steve



Re: advice sought: workflow with -CURRENT and amd GPU [Re: -CURRENT hangs since at least 2022-04-04]

2022-04-19 Thread Paul Mather
On Apr 19, 2022, at 4:49 AM, Michael Schuster  wrote:

> Hi,
> 
> I'm highjacking and re-purposing the previous thread, I hope that's OK
> (I did change the subject ;-)) - I'm keeping some of the previous
> contents for reference.
> 
> I have similar HW to OP (Ryzen 7 4700 w. Renoir Graphics), and have
> been using a similar approach to keep the machine up to date - or so I
> suspect. Still, after a while (several months), I end up with one or
> more of these:
> - I get some sort of panic in DRM (at startup or, currently, at shutdown)
> - when I boot into to a previous BE to attempt a fix and then again
> reboot into the current one, I get tons of messages like this
> "... kernel: KLD iic.ko: depends on kernel - not available or
> version mismatch
>  ... kernel: linker_load_file: /boot/kernel/iic.ko - unsupported file 
> type"
>   and computer refuses to accept input (let alone start X)
> 
> and some others I don't recall right now.
> 
> Before I ask for advice (see below), let me explain the approaches
> I've taken so far. I install with ZFS from the beginning, current boot
> env is "N". These are outlines, not exact commands:
> 
> I) never touch the current BE, always update a new one:
>  1) given current BE N, I create a new BE N+1 and mount it on /mnt,
>  2) 'cd /usr/src; git pull; sudo make DESTDIR=/mnt ... (build, install, etc)'
>  3) 'cd usr/ports/graphics/drm-devel-kmod; sudo make DESTDIR=/mnt install'
>  4) beadm activate BE N+1; reboot
> 
> II) keep a "new" BE as backup/fallback, update current BE:
>  1) given current BE N, I create a new BE N+1 (mounting not required)
> (this is the intended 'fallback')
>  2) 'cd /usr/src; git pull"; then "make" as described in the Handbook
> "24.6. Updating FreeBSD from Source"
>  3) 'cd usr/ports/graphics/drm-devel-kmod; sudo make install'
>  4) reboot
> 
> in both scenarios(sp?), I do "pkg update; pkg upgrade" from time to
> time (also following the resp. approach shown above).
> 
> I suspect that I'm missing something fundamental in my approaches -
> does anyone have a (for them) foolproof approach along these lines, or
> can someone show me what I'm missing in either of mine (in private, if
> you prefer)?


I don't know whether you're missing anything, but I wanted to mention I
recently found a tool useful in helping my own BE-based source
upgrades: /usr/src/tools/build/beinstall.sh

I've found it helps with the build/upgrade steps.  See man beinstall(8)
for details.

Cheers,

Paul.




Re: nullfs and ZFS issues

2022-04-19 Thread Doug Ambrisko
On Tue, Apr 19, 2022 at 11:47:22AM +0200, Mateusz Guzik wrote:
| Try this: https://people.freebsd.org/~mjg/vnlru_free_pick.diff
| 
| this is not committable but should validate whether it works fine

As a POC it's working.  I see the vnode count for the nullfs and
ZFS go up.  The ARC cache also goes up until it exceeds the ARC max.
size tten the vnodes for nullfs and ZFS goes down.  The ARC cache goes
down as well.  This all repeats over and over.  The systems seems
healthy.  No excessive running of arc_prune or arc_evict.

My only comment is that the vnode freeing seems a bit agressive.
Going from ~15,000 to ~200 vnode for nullfs and the same for ZFS.
The ARC drops from 70M to 7M (max is set at 64M) for this unit
test.

Thanks,

Doug A.
 
| On 4/19/22, Mateusz Guzik  wrote:
| > On 4/19/22, Mateusz Guzik  wrote:
| >> On 4/19/22, Doug Ambrisko  wrote:
| >>> I've switched my laptop to use nullfs and ZFS.  Previously, I used
| >>> localhost NFS mounts instead of nullfs when nullfs would complain
| >>> that it couldn't mount.  Since that check has been removed, I've
| >>> switched to nullfs only.  However, every so often my laptop would
| >>> get slow and the the ARC evict and prune thread would consume two
| >>> cores 100% until I rebooted.  I had a 1G max. ARC and have increased
| >>> it to 2G now.  Looking into this has uncovered some issues:
| >>>  -nullfs would prevent vnlru_free_vfsops from doing anything
| >>>   when called from ZFS arc_prune_task
| >>>  -nullfs would hang onto a bunch of vnodes unless mounted with
| >>>   nocache
| >>>  -nullfs and nocache would break untar.  This has been fixed now.
| >>>
| >>> With nullfs, nocache and settings max vnodes to a low number I can
| >>> keep the ARC around the max. without evict and prune consuming
| >>> 100% of 2 cores.  This doesn't seem like the best solution but it
| >>> better then when the ARC starts spinning.
| >>>
| >>> Looking into this issue with bhyve and a md drive for testing I create
| >>> a brand new zpool mounted as /test and then nullfs mount /test to /mnt.
| >>> I loop through untaring the Linux kernel into the nullfs mount, rm -rf
| >>> it
| >>> and repeat.  I set the ARC to the smallest value I can.  Untarring the
| >>> Linux kernel was enough to get the ARC evict and prune to spin since
| >>> they couldn't evict/prune anything.
| >>>
| >>> Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it
| >>>   static int
| >>>   vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode *mvp)
| >>>   {
| >>>   ...
| >>>
| >>> for (;;) {
| >>>   ...
| >>> vp = TAILQ_NEXT(vp, v_vnodelist);
| >>>   ...
| >>>
| >>> /*
| >>>  * Don't recycle if our vnode is from different type
| >>>  * of mount point.  Note that mp is type-safe, the
| >>>  * check does not reach unmapped address even if
| >>>  * vnode is reclaimed.
| >>>  */
| >>> if (mnt_op != NULL && (mp = vp->v_mount) != NULL &&
| >>> mp->mnt_op != mnt_op) {
| >>> continue;
| >>> }
| >>>   ...
| >>>
| >>> The vp ends up being the nulfs mount and then hits the continue
| >>> even though the passed in mvp is on ZFS.  If I do a hack to
| >>> comment out the continue then I see the ARC, nullfs vnodes and
| >>> ZFS vnodes grow.  When the ARC calls arc_prune_task that calls
| >>> vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS.
| >>> The ARC cache usage also goes down.  Then they increase again until
| >>> the ARC gets full and then they go down again.  So with this hack
| >>> I don't need nocache passed to nullfs and I don't need to limit
| >>> the max vnodes.  Doing multiple untars in parallel over and over
| >>> doesn't seem to cause any issues for this test.  I'm not saying
| >>> commenting out continue is the fix but a simple POC test.
| >>>
| >>
| >> I don't see an easy way to say "this is a nullfs vnode holding onto a
| >> zfs vnode". Perhaps the routine can be extrended with issuing a nullfs
| >> callback, if the module is loaded.
| >>
| >> In the meantime I think a good enough(tm) fix would be to check that
| >> nothing was freed and fallback to good old regular clean up without
| >> filtering by vfsops. This would be very similar to what you are doing
| >> with your hack.
| >>
| >
| > Now that I wrote this perhaps an acceptable hack would be to extend
| > struct mount with a pointer to "lower layer" mount (if any) and patch
| > the vfsops check to also look there.
| >
| >>
| >>> It appears that when ZFS is asking for cached vnodes to be
| >>> free'd nullfs also needs to free some up as well so that
| >>> they are free'd on the VFS level.  It seems that vnlru_free_impl
| >>> should allow some of the related nullfs vnodes to be free'd so
| >>> the ZFS ones can be free'd and reduce the size of the ARC.
| >>>
| >>> BTW, I also hacked the kernel 

Re: 13-STABLE: can not cross build 13.0-RELENG and 13.1-RELENG, what on CURRENT works fine

2022-04-19 Thread Warner Losh
The vdso error is fixed by cherry-picking
b3b462229f972e2ed24d450d7d2f8855cdd58a87.
I'm not sure about the second error, though if it's a general profile
error, you make need WITHOUT_PROFILE=t in both the build and install phases.


On Tue, Apr 19, 2022 at 5:46 AM FreeBSD User  wrote:

> I regular build for a NanoBSD appliance 13-STABLE, 13.0-RELENG and
> 13.1-RELENG
> on either 14-CURRENT and 13-STABLE. Several days ago, some changes has been
> made to /usr/src/Makefile.inc1, first on CURRENT and shortly after on 13.
>
> As with today, building from sources either 13.1-RELENG and 13.0-RELENG on
> a
> CURRENT host (recent 14-CURRENT!) works without problems.
>
> On 13-STABLE (FreeBSD 13.1-STABLE #26 stable/13-n250478-bb8e1dfbff3: Thu
> Apr 14
> 10:47:51 CEST 2022 amd64) building either 13.0-RELENG or 13.1-RELENG fails!
>
> On building from sources 13.0-RELENG I receive while in installworld:
>
> [...]
> --
> >>> Install check world
> --
> mkdir -p /tmp/install.6R4Ifq8o
> ...
> cp: [vdso]: No such file or directory
> *** Error code 1
> [...]
>
> and on building from sources 13.1-RELENG while in installworld we get:
>
> [...]
> ===> usr.bin/lex/lib (install)
> install -l h -o root -g wheel -m 444
>
> /home/ohartmann/Projects/TEST/fbsd13.1-dev/os-base/systemconf/../../world/amd64/TEST-DEV-amd64-13.1-RELENG/_.w/usr/lib/lib
> ln_p.a
>
> /home/ohartmann/Projects/TEST/fbsd13.1-dev/os-base/systemconf/../../world/amd64/TEST-DEV-amd64-13.1-RELENG/_.w/usr/lib/libl_p.a
> install: link
> /home/ohartmann/Projects/TEST/fbsd13.1-dev/os-base/systemconf/../../world/amd64/TEST-DEV-amd64-13.1-RELENG/_.w/usr/lib/libln_p.a
> -> /home/ohartman
>
> n/Projects/TEST/fbsd13.1-dev/os-base/systemconf/../../world/amd64/TEST-DEV-amd64-13.1-RELENG/_.w/usr/lib/libl_p.a:
> No such file or directory
> *** Error code 71
>
>
> Thanks in advance,
>
> O. Hartmann
>
>


Re: main-n254654-d4e8207317c results in "no pools available to import"

2022-04-19 Thread Thomas Laus

On 4/18/22 21:43, Graham Perrin wrote:

On 12/04/2022 23:35, Dennis Clarke wrote:

… at least two examples in the wild. …


Not the same symptom, but 
 caught my eye:



RC3 Guided ZFS on root with encryption unmountable


Beyond that, I have nothing useful to add (sorry). I do not get the "no 
pools available to import" symptom with any of the boot environments 
listed below (Git hash within the name of each BE).


Thanks for the link.  I agree that there is something unexpected 
happening recently with booting a ZFS filesystem using EFI.  I noticed 
that my system still flashes a "no pools available to import" message 
when using a MBR boot record at the same point that EFI boot just hangs 
but the booting process will complete soon afterward.  Putting EFI boot 
back on that ada0p1 partition hangs at the same point again.  I am 
fortunate that both systems can use either EFI or MBR boot records.


Tom


--
Public Keys:
PGP KeyID = 0x5F22FDC1
GnuPG KeyID = 0x620836CF



13-STABLE: can not cross build 13.0-RELENG and 13.1-RELENG, what on CURRENT works fine

2022-04-19 Thread FreeBSD User
I regular build for a NanoBSD appliance 13-STABLE, 13.0-RELENG and 13.1-RELENG
on either 14-CURRENT and 13-STABLE. Several days ago, some changes has been
made to /usr/src/Makefile.inc1, first on CURRENT and shortly after on 13.

As with today, building from sources either 13.1-RELENG and 13.0-RELENG on a
CURRENT host (recent 14-CURRENT!) works without problems.

On 13-STABLE (FreeBSD 13.1-STABLE #26 stable/13-n250478-bb8e1dfbff3: Thu Apr 14
10:47:51 CEST 2022 amd64) building either 13.0-RELENG or 13.1-RELENG fails!

On building from sources 13.0-RELENG I receive while in installworld:

[...]
--
>>> Install check world
--
mkdir -p /tmp/install.6R4Ifq8o
...
cp: [vdso]: No such file or directory
*** Error code 1
[...]

and on building from sources 13.1-RELENG while in installworld we get:

[...]
===> usr.bin/lex/lib (install)
install -l h -o root -g wheel -m 444
/home/ohartmann/Projects/TEST/fbsd13.1-dev/os-base/systemconf/../../world/amd64/TEST-DEV-amd64-13.1-RELENG/_.w/usr/lib/lib
ln_p.a
/home/ohartmann/Projects/TEST/fbsd13.1-dev/os-base/systemconf/../../world/amd64/TEST-DEV-amd64-13.1-RELENG/_.w/usr/lib/libl_p.a
install: link
/home/ohartmann/Projects/TEST/fbsd13.1-dev/os-base/systemconf/../../world/amd64/TEST-DEV-amd64-13.1-RELENG/_.w/usr/lib/libln_p.a
 -> /home/ohartman
n/Projects/TEST/fbsd13.1-dev/os-base/systemconf/../../world/amd64/TEST-DEV-amd64-13.1-RELENG/_.w/usr/lib/libl_p.a:
No such file or directory
*** Error code 71


Thanks in advance,

O. Hartmann



Re: nullfs and ZFS issues

2022-04-19 Thread Mateusz Guzik
Try this: https://people.freebsd.org/~mjg/vnlru_free_pick.diff

this is not committable but should validate whether it works fine

On 4/19/22, Mateusz Guzik  wrote:
> On 4/19/22, Mateusz Guzik  wrote:
>> On 4/19/22, Doug Ambrisko  wrote:
>>> I've switched my laptop to use nullfs and ZFS.  Previously, I used
>>> localhost NFS mounts instead of nullfs when nullfs would complain
>>> that it couldn't mount.  Since that check has been removed, I've
>>> switched to nullfs only.  However, every so often my laptop would
>>> get slow and the the ARC evict and prune thread would consume two
>>> cores 100% until I rebooted.  I had a 1G max. ARC and have increased
>>> it to 2G now.  Looking into this has uncovered some issues:
>>>  -  nullfs would prevent vnlru_free_vfsops from doing anything
>>> when called from ZFS arc_prune_task
>>>  -  nullfs would hang onto a bunch of vnodes unless mounted with
>>> nocache
>>>  -  nullfs and nocache would break untar.  This has been fixed now.
>>>
>>> With nullfs, nocache and settings max vnodes to a low number I can
>>> keep the ARC around the max. without evict and prune consuming
>>> 100% of 2 cores.  This doesn't seem like the best solution but it
>>> better then when the ARC starts spinning.
>>>
>>> Looking into this issue with bhyve and a md drive for testing I create
>>> a brand new zpool mounted as /test and then nullfs mount /test to /mnt.
>>> I loop through untaring the Linux kernel into the nullfs mount, rm -rf
>>> it
>>> and repeat.  I set the ARC to the smallest value I can.  Untarring the
>>> Linux kernel was enough to get the ARC evict and prune to spin since
>>> they couldn't evict/prune anything.
>>>
>>> Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it
>>>   static int
>>>   vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode *mvp)
>>>   {
>>> ...
>>>
>>> for (;;) {
>>> ...
>>> vp = TAILQ_NEXT(vp, v_vnodelist);
>>> ...
>>>
>>> /*
>>>  * Don't recycle if our vnode is from different type
>>>  * of mount point.  Note that mp is type-safe, the
>>>  * check does not reach unmapped address even if
>>>  * vnode is reclaimed.
>>>  */
>>> if (mnt_op != NULL && (mp = vp->v_mount) != NULL &&
>>> mp->mnt_op != mnt_op) {
>>> continue;
>>> }
>>> ...
>>>
>>> The vp ends up being the nulfs mount and then hits the continue
>>> even though the passed in mvp is on ZFS.  If I do a hack to
>>> comment out the continue then I see the ARC, nullfs vnodes and
>>> ZFS vnodes grow.  When the ARC calls arc_prune_task that calls
>>> vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS.
>>> The ARC cache usage also goes down.  Then they increase again until
>>> the ARC gets full and then they go down again.  So with this hack
>>> I don't need nocache passed to nullfs and I don't need to limit
>>> the max vnodes.  Doing multiple untars in parallel over and over
>>> doesn't seem to cause any issues for this test.  I'm not saying
>>> commenting out continue is the fix but a simple POC test.
>>>
>>
>> I don't see an easy way to say "this is a nullfs vnode holding onto a
>> zfs vnode". Perhaps the routine can be extrended with issuing a nullfs
>> callback, if the module is loaded.
>>
>> In the meantime I think a good enough(tm) fix would be to check that
>> nothing was freed and fallback to good old regular clean up without
>> filtering by vfsops. This would be very similar to what you are doing
>> with your hack.
>>
>
> Now that I wrote this perhaps an acceptable hack would be to extend
> struct mount with a pointer to "lower layer" mount (if any) and patch
> the vfsops check to also look there.
>
>>
>>> It appears that when ZFS is asking for cached vnodes to be
>>> free'd nullfs also needs to free some up as well so that
>>> they are free'd on the VFS level.  It seems that vnlru_free_impl
>>> should allow some of the related nullfs vnodes to be free'd so
>>> the ZFS ones can be free'd and reduce the size of the ARC.
>>>
>>> BTW, I also hacked the kernel and mount to show the vnodes used
>>> per mount ie. mount -v:
>>>   test on /test (zfs, NFS exported, local, nfsv4acls, fsid
>>> 2b23b2a1de21ed66,
>>> vnodes: count 13846 lazy 0)
>>>   /test on /mnt (nullfs, NFS exported, local, nfsv4acls, fsid
>>> 11ff00292900, vnodes: count 13846 lazy 0)
>>>
>>> Now I can easily see how the vnodes are used without going into ddb.
>>> On my laptop I have various vnet jails and nullfs mount my homedir into
>>> them so pretty much everything goes through nullfs to ZFS.  I'm limping
>>> along with the nullfs nocache and small number of vnodes but it would be
>>> nice to not need that.
>>>
>>> Thanks,
>>>
>>> Doug A.
>>>
>>>
>>
>>
>> --
>> Mateusz Guzik 
>>
>
>
> --
> Mateusz Guzik 
>


-- 
Mateusz Guzik 



Re: nullfs and ZFS issues

2022-04-19 Thread Mateusz Guzik
On 4/19/22, Mateusz Guzik  wrote:
> On 4/19/22, Doug Ambrisko  wrote:
>> I've switched my laptop to use nullfs and ZFS.  Previously, I used
>> localhost NFS mounts instead of nullfs when nullfs would complain
>> that it couldn't mount.  Since that check has been removed, I've
>> switched to nullfs only.  However, every so often my laptop would
>> get slow and the the ARC evict and prune thread would consume two
>> cores 100% until I rebooted.  I had a 1G max. ARC and have increased
>> it to 2G now.  Looking into this has uncovered some issues:
>>  -   nullfs would prevent vnlru_free_vfsops from doing anything
>>  when called from ZFS arc_prune_task
>>  -   nullfs would hang onto a bunch of vnodes unless mounted with
>>  nocache
>>  -   nullfs and nocache would break untar.  This has been fixed now.
>>
>> With nullfs, nocache and settings max vnodes to a low number I can
>> keep the ARC around the max. without evict and prune consuming
>> 100% of 2 cores.  This doesn't seem like the best solution but it
>> better then when the ARC starts spinning.
>>
>> Looking into this issue with bhyve and a md drive for testing I create
>> a brand new zpool mounted as /test and then nullfs mount /test to /mnt.
>> I loop through untaring the Linux kernel into the nullfs mount, rm -rf it
>> and repeat.  I set the ARC to the smallest value I can.  Untarring the
>> Linux kernel was enough to get the ARC evict and prune to spin since
>> they couldn't evict/prune anything.
>>
>> Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it
>>   static int
>>   vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode *mvp)
>>   {
>>  ...
>>
>> for (;;) {
>>  ...
>> vp = TAILQ_NEXT(vp, v_vnodelist);
>>  ...
>>
>> /*
>>  * Don't recycle if our vnode is from different type
>>  * of mount point.  Note that mp is type-safe, the
>>  * check does not reach unmapped address even if
>>  * vnode is reclaimed.
>>  */
>> if (mnt_op != NULL && (mp = vp->v_mount) != NULL &&
>> mp->mnt_op != mnt_op) {
>> continue;
>> }
>>  ...
>>
>> The vp ends up being the nulfs mount and then hits the continue
>> even though the passed in mvp is on ZFS.  If I do a hack to
>> comment out the continue then I see the ARC, nullfs vnodes and
>> ZFS vnodes grow.  When the ARC calls arc_prune_task that calls
>> vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS.
>> The ARC cache usage also goes down.  Then they increase again until
>> the ARC gets full and then they go down again.  So with this hack
>> I don't need nocache passed to nullfs and I don't need to limit
>> the max vnodes.  Doing multiple untars in parallel over and over
>> doesn't seem to cause any issues for this test.  I'm not saying
>> commenting out continue is the fix but a simple POC test.
>>
>
> I don't see an easy way to say "this is a nullfs vnode holding onto a
> zfs vnode". Perhaps the routine can be extrended with issuing a nullfs
> callback, if the module is loaded.
>
> In the meantime I think a good enough(tm) fix would be to check that
> nothing was freed and fallback to good old regular clean up without
> filtering by vfsops. This would be very similar to what you are doing
> with your hack.
>

Now that I wrote this perhaps an acceptable hack would be to extend
struct mount with a pointer to "lower layer" mount (if any) and patch
the vfsops check to also look there.

>
>> It appears that when ZFS is asking for cached vnodes to be
>> free'd nullfs also needs to free some up as well so that
>> they are free'd on the VFS level.  It seems that vnlru_free_impl
>> should allow some of the related nullfs vnodes to be free'd so
>> the ZFS ones can be free'd and reduce the size of the ARC.
>>
>> BTW, I also hacked the kernel and mount to show the vnodes used
>> per mount ie. mount -v:
>>   test on /test (zfs, NFS exported, local, nfsv4acls, fsid
>> 2b23b2a1de21ed66,
>> vnodes: count 13846 lazy 0)
>>   /test on /mnt (nullfs, NFS exported, local, nfsv4acls, fsid
>> 11ff00292900, vnodes: count 13846 lazy 0)
>>
>> Now I can easily see how the vnodes are used without going into ddb.
>> On my laptop I have various vnet jails and nullfs mount my homedir into
>> them so pretty much everything goes through nullfs to ZFS.  I'm limping
>> along with the nullfs nocache and small number of vnodes but it would be
>> nice to not need that.
>>
>> Thanks,
>>
>> Doug A.
>>
>>
>
>
> --
> Mateusz Guzik 
>


-- 
Mateusz Guzik 



Re: nullfs and ZFS issues

2022-04-19 Thread Mateusz Guzik
On 4/19/22, Doug Ambrisko  wrote:
> I've switched my laptop to use nullfs and ZFS.  Previously, I used
> localhost NFS mounts instead of nullfs when nullfs would complain
> that it couldn't mount.  Since that check has been removed, I've
> switched to nullfs only.  However, every so often my laptop would
> get slow and the the ARC evict and prune thread would consume two
> cores 100% until I rebooted.  I had a 1G max. ARC and have increased
> it to 2G now.  Looking into this has uncovered some issues:
>  -nullfs would prevent vnlru_free_vfsops from doing anything
>   when called from ZFS arc_prune_task
>  -nullfs would hang onto a bunch of vnodes unless mounted with
>   nocache
>  -nullfs and nocache would break untar.  This has been fixed now.
>
> With nullfs, nocache and settings max vnodes to a low number I can
> keep the ARC around the max. without evict and prune consuming
> 100% of 2 cores.  This doesn't seem like the best solution but it
> better then when the ARC starts spinning.
>
> Looking into this issue with bhyve and a md drive for testing I create
> a brand new zpool mounted as /test and then nullfs mount /test to /mnt.
> I loop through untaring the Linux kernel into the nullfs mount, rm -rf it
> and repeat.  I set the ARC to the smallest value I can.  Untarring the
> Linux kernel was enough to get the ARC evict and prune to spin since
> they couldn't evict/prune anything.
>
> Looking at vnlru_free_vfsops called from ZFS arc_prune_task I see it
>   static int
>   vnlru_free_impl(int count, struct vfsops *mnt_op, struct vnode *mvp)
>   {
>   ...
>
> for (;;) {
>   ...
> vp = TAILQ_NEXT(vp, v_vnodelist);
>   ...
>
> /*
>  * Don't recycle if our vnode is from different type
>  * of mount point.  Note that mp is type-safe, the
>  * check does not reach unmapped address even if
>  * vnode is reclaimed.
>  */
> if (mnt_op != NULL && (mp = vp->v_mount) != NULL &&
> mp->mnt_op != mnt_op) {
> continue;
> }
>   ...
>
> The vp ends up being the nulfs mount and then hits the continue
> even though the passed in mvp is on ZFS.  If I do a hack to
> comment out the continue then I see the ARC, nullfs vnodes and
> ZFS vnodes grow.  When the ARC calls arc_prune_task that calls
> vnlru_free_vfsops and now the vnodes go down for nullfs and ZFS.
> The ARC cache usage also goes down.  Then they increase again until
> the ARC gets full and then they go down again.  So with this hack
> I don't need nocache passed to nullfs and I don't need to limit
> the max vnodes.  Doing multiple untars in parallel over and over
> doesn't seem to cause any issues for this test.  I'm not saying
> commenting out continue is the fix but a simple POC test.
>

I don't see an easy way to say "this is a nullfs vnode holding onto a
zfs vnode". Perhaps the routine can be extrended with issuing a nullfs
callback, if the module is loaded.

In the meantime I think a good enough(tm) fix would be to check that
nothing was freed and fallback to good old regular clean up without
filtering by vfsops. This would be very similar to what you are doing
with your hack.


> It appears that when ZFS is asking for cached vnodes to be
> free'd nullfs also needs to free some up as well so that
> they are free'd on the VFS level.  It seems that vnlru_free_impl
> should allow some of the related nullfs vnodes to be free'd so
> the ZFS ones can be free'd and reduce the size of the ARC.
>
> BTW, I also hacked the kernel and mount to show the vnodes used
> per mount ie. mount -v:
>   test on /test (zfs, NFS exported, local, nfsv4acls, fsid 2b23b2a1de21ed66,
> vnodes: count 13846 lazy 0)
>   /test on /mnt (nullfs, NFS exported, local, nfsv4acls, fsid
> 11ff00292900, vnodes: count 13846 lazy 0)
>
> Now I can easily see how the vnodes are used without going into ddb.
> On my laptop I have various vnet jails and nullfs mount my homedir into
> them so pretty much everything goes through nullfs to ZFS.  I'm limping
> along with the nullfs nocache and small number of vnodes but it would be
> nice to not need that.
>
> Thanks,
>
> Doug A.
>
>


-- 
Mateusz Guzik 



advice sought: workflow with -CURRENT and amd GPU [Re: -CURRENT hangs since at least 2022-04-04]

2022-04-19 Thread Michael Schuster
Hi,

I'm highjacking and re-purposing the previous thread, I hope that's OK
(I did change the subject ;-)) - I'm keeping some of the previous
contents for reference.

I have similar HW to OP (Ryzen 7 4700 w. Renoir Graphics), and have
been using a similar approach to keep the machine up to date - or so I
suspect. Still, after a while (several months), I end up with one or
more of these:
- I get some sort of panic in DRM (at startup or, currently, at shutdown)
- when I boot into to a previous BE to attempt a fix and then again
reboot into the current one, I get tons of messages like this
 "... kernel: KLD iic.ko: depends on kernel - not available or
version mismatch
  ... kernel: linker_load_file: /boot/kernel/iic.ko - unsupported file type"
   and computer refuses to accept input (let alone start X)

and some others I don't recall right now.

Before I ask for advice (see below), let me explain the approaches
I've taken so far. I install with ZFS from the beginning, current boot
env is "N". These are outlines, not exact commands:

I) never touch the current BE, always update a new one:
  1) given current BE N, I create a new BE N+1 and mount it on /mnt,
  2) 'cd /usr/src; git pull; sudo make DESTDIR=/mnt ... (build, install, etc)'
  3) 'cd usr/ports/graphics/drm-devel-kmod; sudo make DESTDIR=/mnt install'
  4) beadm activate BE N+1; reboot

II) keep a "new" BE as backup/fallback, update current BE:
  1) given current BE N, I create a new BE N+1 (mounting not required)
(this is the intended 'fallback')
  2) 'cd /usr/src; git pull"; then "make" as described in the Handbook
"24.6. Updating FreeBSD from Source"
  3) 'cd usr/ports/graphics/drm-devel-kmod; sudo make install'
  4) reboot

in both scenarios(sp?), I do "pkg update; pkg upgrade" from time to
time (also following the resp. approach shown above).

I suspect that I'm missing something fundamental in my approaches -
does anyone have a (for them) foolproof approach along these lines, or
can someone show me what I'm missing in either of mine (in private, if
you prefer)?

TIA for all and any advice
Michael

On Mon, Apr 18, 2022 at 9:33 PM Pete Wright  wrote:
>
>
>
> On 4/18/22 12:23, filis+fbsdcurr...@filis.org wrote:
> > Hi,
> >
> > I'm running -CURRENT on this one desktop box which is a "Ryzen 7 4800U
> > with Radeon Graphics", since it didn't work on 13R.
> > I use Boot environments and on 2022-04-04 I updated it and it started
> > to completely freeze under X (I haven't tried letting it run without
> > X) after a few dozen minutes.
> [...]
>
>
> After updating your CURRENT environment did you rebuild the drm-kmod
> package?  that's usually required as the LKPI is much more of a moving
> target on that branch compared to STABLE or RELEASE.  i have a pretty
> much identical setup and building/installing drm-devel-kmod has been
> working flawlessly for quite a while.
>
> after building/installing my latest world i do following (this is from a
> local script i use when rebuilding):
>
> cd $PORTS/graphics/drm-devel-kmod
> sudo pkg unlock -y drm-devel-kmod
> sudo make package
> sudo pkg upgrade -y work/pkg/*.pkg
> sudo pkg lock -y drm-devel-kmod
>
> -pete
>
> --
> Pete Wright
> p...@nomadlogic.org
> @nomadlogicLA
>
>


-- 
Michael Schuster
http://recursiveramblings.wordpress.com/
recursion, n: see 'recursion'



Re: -CURRENT hangs since at least 2022-04-04

2022-04-19 Thread Evilham

On dl., abr. 18 2022, Pete Wright wrote:


On 4/18/22 12:23, filis+fbsdcurr...@filis.org wrote:

Hi,

I'm running -CURRENT on this one desktop box which is a "Ryzen 
7 4800U with

Radeon Graphics", since it didn't work on 13R.
I use Boot environments and on 2022-04-04 I updated it and it 
started to
completely freeze under X (I haven't tried letting it run 
without X) after a

few dozen minutes.
I went on vacation and came back today and updated it again to 
see if the
issue went away, but it froze again. I went back to the latest 
BE before
2022-04-04, which is from 2022-03-21 and so far it works fine 
again. I use a
different machine to build and then rsync /usr/src and /usr/obj 
over and run
make installworld, etc locally and also pkg upgrade (I use 
FreeBSD -latest
packages) everything, so I can't quite tell if this is related 
to base or
drm-kmod and I'm not too familiar with changes in the timeframe 
between

2022-03-21 and 2022-04-04 that would affect my setup.
Is there anything I can try and/or find or collect info to shed 
more light on

this?



After updating your CURRENT environment did you rebuild the 
drm-kmod package?
that's usually required as the LKPI is much more of a moving 
target on that
branch compared to STABLE or RELEASE.  i have a pretty much 
identical setup and
building/installing drm-devel-kmod has been working flawlessly 
for quite a

while.

after building/installing my latest world i do following (this 
is from a local

script i use when rebuilding):

cd $PORTS/graphics/drm-devel-kmod
sudo pkg unlock -y drm-devel-kmod
sudo make package
sudo pkg upgrade -y work/pkg/*.pkg
sudo pkg lock -y drm-devel-kmod

-pete


I too have recently noticed some freezes after a few hours on 
-CURRENT that were not happening before.
This with a matching drm-devel-kmod package (built with matching 
source on matching kernel).


The hw being: AMD Ryzen 7 PRO 2700U w/ Radeon Vega Mobile Gfx
--
Evilham