from:"john"


On 2/14/24 10:16 AM, Mark Millard wrote:

Top posting a related but separate item:

I looked up some old (2022-Dec-17) lspci -v output from
a Linux boot. Note the "Memory at" value 6 (in
the 35 bit BCM2711 address space) and the "(64-bit,
non-prefetchable)" (and "[size=4K]").

01:00.0 USB controller: VIA Technologies, Inc. VL805/806 xHCI USB 3.0 
Controller (rev 01) (prog-if 30 [XHCI])
 Subsystem: VIA Technologies, Inc. VL805/806 xHCI USB 3.0 Controller
 Device tree node: 
/sys/firmware/devicetree/base/scb/pcie@7d50/pci@0,0/usb@0,0
 Flags: bus master, fast devsel, latency 0, IRQ 51
 Memory at 6 (64-bit, non-prefetchable) [size=4K]
 Capabilities: [80] Power Management version 3
 Capabilities: [90] MSI: Enable+ Count=1/4 Maskable- 64bit+
 Capabilities: [c4] Express Endpoint, MSI 00
 Capabilities: [100] Advanced Error Reporting
 Kernel driver in use: xhci_hcd


"Memory at 6 (64-bit, non-prefetchable)":
Violation of a PCIe standard?


No, this is a device BAR which can be 64-bit (memory BARs can either
be 32-bits or 64-bits).  However, the "window" in a PCI _bridge_ for
memory is only defined to be 32-bits.  Windows in PCI-PCI bridges
are a special type of BAR that defines the address ranges that the
bridge decodes on the parent side and passes down to child devices.
The prefetchable window in PCI-PCI bridges can optionally be 64-bit.

BAR == a range of memory or I/O port addresses decoded by a device,
usually mapped to a register bank, but sometimes mapped to internal
memory (e.g. a framebuffer)

Window == a range of memory or I/O port addresses decoded by a bridge
for which transactions are passed across the bridge to be handled by
a child device.

--
John Baldwin

Re: Recent commits reject RPi4B booting: pcib0 vs. pcib1 "rman_manage_region: request" leads to panic


On 2/14/24 9:57 AM, Mark Millard wrote:

On Feb 14, 2024, at 08:08, John Baldwin  wrote:


On 2/12/24 5:57 PM, Mark Millard wrote:

On Feb 12, 2024, at 16:36, Mark Millard  wrote:

On Feb 12, 2024, at 16:10, Mark Millard  wrote:


On Feb 12, 2024, at 12:00, Mark Millard  wrote:


[Gack: I was looking at the wrong vintage of source code, predating
your changes: wrong system used.]


On Feb 12, 2024, at 10:41, Mark Millard  wrote:


On Feb 12, 2024, at 09:32, John Baldwin  wrote:


On 2/9/24 8:13 PM, Mark Millard wrote:

Summary:
pcib0:  mem 0x7d50-0x7d50930f 
irq 80,81 on simplebus2
pcib0: parsing FDT for ECAM0:
pcib0:  PCI addr: 0xc000, CPU addr: 0x6, Size: 0x4000
. . .
rman_manage_region:  request: start 0x6, end 
0x6000f
panic: Failed to add resource to rman


Hmmm, I suspect this is due to the way that bus_translate_resource works which 
is
fundamentally broken.  It rewrites the start address of a resource in-situ 
instead
of keeping downstream resources separate from the upstream resources.   For 
example,
I don't see how you could ever release a resource in this design without 
completely
screwing up your rman.  That is, I expect trying to detach a PCI device behind a
translating bridge that uses the current approach should corrupt the allocated
resource ranges in an rman long before my changes.

That said, that doesn't really explain the panic.  Hmm, the panic might be 
because
for PCI bridge windows the driver now passes RF_ACTIVE and the 
bus_translate_resource
hack only kicks in the activate_resource method of pci_host_generic.c.


Detail:
. . .
pcib0:  mem 0x7d50-0x7d50930f 
irq 80,81 on simplebus2
pcib0: parsing FDT for ECAM0:
pcib0: PCI addr: 0xc000, CPU addr: 0x6, Size: 0x4000


This indicates this is a translating bus.


pcib1:  irq 91 at device 0.0 on pci0
rman_manage_region:  request: start 0x1, end 0x1
pcib0: rman_reserve_resource: start=0xc000, end=0xc00f, count=0x10
rman_reserve_resource_bound:  request: [0xc000, 0xc00f], 
length 0x10, flags 102, device pcib1
rman_reserve_resource_bound: trying 0x <0xc000,0xf>
considering [0xc000, 0x]
truncated region: [0xc000, 0xc00f]; size 0x10 (requested 0x10)
candidate region: [0xc000, 0xc00f], size 0x10
allocating from the beginning
rman_manage_region:  request: start 0x6, end 
0x6000f


What you later typed does not match:

0x6
0x6000f

You later typed:

0x6000
0x600fff

This seems to have lead to some confusion from using the
wrong figure(s).


The fact that we are trying to reserve the CPU addresses in the rman is because
bus_translate_resource rewrote the start address in the resource after it was 
allocated.

That said, I can't see why rman_manage_region would actually fail.  At this 
point the
rman is empty (this is the first call to rman_manage_region for "pcib1 memory 
window"),
so only the check that should be failing are the checks against rm_start and
rm_end.  For the memory window, rm_start is always 0, and rm_end is always
0x, so both the old (0xc - 0xc00f) and new (0x6000 - 
0x600fff)
ranges are within those bounds.


No:

0x

.vs (actual):

0x6
0x6000f


Ok, then this explains the failure if the "raw" addresses are above 4G.  I have
access to an emag I'm currently using to test fixes to pci_host_generic.c to
avoid corrupting struct resource objects.  I'll post the diff once I've got
something verified to work.


It looks to me like in sys/dev/pci/pci_pci.c the:
static void
pcib_probe_windows(struct pcib_softc *sc)
{
. . .
 pcib_alloc_window(sc, >mem, SYS_RES_MEMORY, 0, 0x);
. . .
is just inappropriately restrictive about where in the system
address space a PCIe can validly be mapped to on the high end.
That, in turn, leads to the rejection on the RPi4B now that
the range use is checked.


No, the physical register in PCI-PCI bridges is only 32-bits.  Only the
prefetchable BAR supports 64-bit addresses.


Just for my edification . . .

As I understand, SYS_RES_MEMORY for the BCM2711
means the 35 bit addressing space in the BCM2711,
not a PCIe device internal address range that
corresponds. Am I wrong about that?

If I'm wrong, what does identify the 35 bit
addressing space in the BCM2711?

If I'm correct, then the 0..0x
seems to be from the wrong address space up
front. Or, may be, the SYS_RES_MEMORY and the
0x argments are not related as I
expected and the 0x is not a
SYS_RES_MEMORY value?


We use SYS_RES_MEMORY for both address spaces.  SYS_RES_MEMORY is more of
an address space "type" and doesn't necessarily name a single, unique
address space.  The way to think about these address spaces is instances
of 'struct rman'.  There's a global 'struct rman' in the arm64 nexus
driver that represents the CPU physical memory address space.  The
pci_host_

Re: Recent commits reject RPi4B booting: pcib0 vs. pcib1 "rman_manage_region: request" leads to panic


On 2/14/24 8:42 AM, Warner Losh wrote:

On Wed, Feb 14, 2024 at 9:08 AM John Baldwin  wrote:


On 2/12/24 5:57 PM, Mark Millard wrote:

On Feb 12, 2024, at 16:36, Mark Millard  wrote:


On Feb 12, 2024, at 16:10, Mark Millard  wrote:


On Feb 12, 2024, at 12:00, Mark Millard  wrote:


[Gack: I was looking at the wrong vintage of source code, predating
your changes: wrong system used.]


On Feb 12, 2024, at 10:41, Mark Millard  wrote:


On Feb 12, 2024, at 09:32, John Baldwin  wrote:


On 2/9/24 8:13 PM, Mark Millard wrote:

Summary:
pcib0:  mem

0x7d50-0x7d50930f irq 80,81 on simplebus2

pcib0: parsing FDT for ECAM0:
pcib0:  PCI addr: 0xc000, CPU addr: 0x6, Size:

0x4000

. . .
rman_manage_region:  request: start

0x6, end 0x6000f

panic: Failed to add resource to rman


Hmmm, I suspect this is due to the way that bus_translate_resource

works which is

fundamentally broken.  It rewrites the start address of a resource

in-situ instead

of keeping downstream resources separate from the upstream

resources.   For example,

I don't see how you could ever release a resource in this design

without completely

screwing up your rman.  That is, I expect trying to detach a PCI

device behind a

translating bridge that uses the current approach should corrupt

the allocated

resource ranges in an rman long before my changes.

That said, that doesn't really explain the panic.  Hmm, the panic

might be because

for PCI bridge windows the driver now passes RF_ACTIVE and the

bus_translate_resource

hack only kicks in the activate_resource method of

pci_host_generic.c.



Detail:
. . .
pcib0:  mem

0x7d50-0x7d50930f irq 80,81 on simplebus2

pcib0: parsing FDT for ECAM0:
pcib0: PCI addr: 0xc000, CPU addr: 0x6, Size:

0x4000


This indicates this is a translating bus.


pcib1:  irq 91 at device 0.0 on pci0
rman_manage_region:  request: start 0x1, end 0x1
pcib0: rman_reserve_resource: start=0xc000, end=0xc00f,

count=0x10

rman_reserve_resource_bound:  request: [0xc000,

0xc00f], length 0x10, flags 102, device pcib1

rman_reserve_resource_bound: trying 0x <0xc000,0xf>
considering [0xc000, 0x]
truncated region: [0xc000, 0xc00f]; size 0x10

(requested 0x10)

candidate region: [0xc000, 0xc00f], size 0x10
allocating from the beginning
rman_manage_region:  request: start

0x6, end 0x6000f


What you later typed does not match:

0x6
0x6000f

You later typed:

0x6000
0x600fff

This seems to have lead to some confusion from using the
wrong figure(s).


The fact that we are trying to reserve the CPU addresses in the

rman is because

bus_translate_resource rewrote the start address in the resource

after it was allocated.


That said, I can't see why rman_manage_region would actually fail.

At this point the

rman is empty (this is the first call to rman_manage_region for

"pcib1 memory window"),

so only the check that should be failing are the checks against

rm_start and

rm_end.  For the memory window, rm_start is always 0, and rm_end is

always

0x, so both the old (0xc - 0xc00f) and new

(0x6000 - 0x600fff)

ranges are within those bounds.


No:

0x

.vs (actual):

0x6
0x6000f


Ok, then this explains the failure if the "raw" addresses are above 4G.  I
have
access to an emag I'm currently using to test fixes to pci_host_generic.c
to
avoid corrupting struct resource objects.  I'll post the diff once I've got
something verified to work.


It looks to me like in sys/dev/pci/pci_pci.c the:

static void
pcib_probe_windows(struct pcib_softc *sc)
{
. . .
  pcib_alloc_window(sc, >mem, SYS_RES_MEMORY, 0, 0x);
. . .

is just inappropriately restrictive about where in the system
address space a PCIe can validly be mapped to on the high end.
That, in turn, leads to the rejection on the RPi4B now that
the range use is checked.


No, the physical register in PCI-PCI bridges is only 32-bits.  Only the
prefetchable BAR supports 64-bit addresses.  This is why the host bridge
is doing a translation from the CPU side (0x6) to the PCI BAR
addresses (0xc000) so that the BAR addresses are down in the 32-bit
address range.  It's also true that many PCI devices only support 32-bit
addresses in memory BARs.  64-bit BARs are an optional extension not
universally supported.

The translation here is somewhat akin to a type of MMU where the CPU
addresses are mapped to PCI addresses.  The problem here is that the
PCI BAR resources need to "stay" as PCI addresses since we depend on
being able to use rman_get_start/end to get the PCI addresses of
allocated resources, but pci_host_generic.c currently rewrites the
addresses.

Probably I should remove rman_set_start/end entirely (Warner added them
back in 2004) as the methods don't do anything to deal with the fallout
that th

Re: Recent commits reject RPi4B booting: pcib0 vs. pcib1 "rman_manage_region: request" leads to panic


On 2/12/24 5:57 PM, Mark Millard wrote:

On Feb 12, 2024, at 16:36, Mark Millard  wrote:


On Feb 12, 2024, at 16:10, Mark Millard  wrote:


On Feb 12, 2024, at 12:00, Mark Millard  wrote:


[Gack: I was looking at the wrong vintage of source code, predating
your changes: wrong system used.]


On Feb 12, 2024, at 10:41, Mark Millard  wrote:


On Feb 12, 2024, at 09:32, John Baldwin  wrote:


On 2/9/24 8:13 PM, Mark Millard wrote:

Summary:
pcib0:  mem 0x7d50-0x7d50930f 
irq 80,81 on simplebus2
pcib0: parsing FDT for ECAM0:
pcib0:  PCI addr: 0xc000, CPU addr: 0x6, Size: 0x4000
. . .
rman_manage_region:  request: start 0x6, end 
0x6000f
panic: Failed to add resource to rman


Hmmm, I suspect this is due to the way that bus_translate_resource works which 
is
fundamentally broken.  It rewrites the start address of a resource in-situ 
instead
of keeping downstream resources separate from the upstream resources.   For 
example,
I don't see how you could ever release a resource in this design without 
completely
screwing up your rman.  That is, I expect trying to detach a PCI device behind a
translating bridge that uses the current approach should corrupt the allocated
resource ranges in an rman long before my changes.

That said, that doesn't really explain the panic.  Hmm, the panic might be 
because
for PCI bridge windows the driver now passes RF_ACTIVE and the 
bus_translate_resource
hack only kicks in the activate_resource method of pci_host_generic.c.


Detail:
. . .
pcib0:  mem 0x7d50-0x7d50930f 
irq 80,81 on simplebus2
pcib0: parsing FDT for ECAM0:
pcib0: PCI addr: 0xc000, CPU addr: 0x6, Size: 0x4000


This indicates this is a translating bus.


pcib1:  irq 91 at device 0.0 on pci0
rman_manage_region:  request: start 0x1, end 0x1
pcib0: rman_reserve_resource: start=0xc000, end=0xc00f, count=0x10
rman_reserve_resource_bound:  request: [0xc000, 0xc00f], 
length 0x10, flags 102, device pcib1
rman_reserve_resource_bound: trying 0x <0xc000,0xf>
considering [0xc000, 0x]
truncated region: [0xc000, 0xc00f]; size 0x10 (requested 0x10)
candidate region: [0xc000, 0xc00f], size 0x10
allocating from the beginning
rman_manage_region:  request: start 0x6, end 
0x6000f


What you later typed does not match:

0x6
0x6000f

You later typed:

0x6000
0x600fff

This seems to have lead to some confusion from using the
wrong figure(s).


The fact that we are trying to reserve the CPU addresses in the rman is because
bus_translate_resource rewrote the start address in the resource after it was 
allocated.

That said, I can't see why rman_manage_region would actually fail.  At this 
point the
rman is empty (this is the first call to rman_manage_region for "pcib1 memory 
window"),
so only the check that should be failing are the checks against rm_start and
rm_end.  For the memory window, rm_start is always 0, and rm_end is always
0x, so both the old (0xc - 0xc00f) and new (0x6000 - 
0x600fff)
ranges are within those bounds.


No:

0x

.vs (actual):

0x6
0x6000f


Ok, then this explains the failure if the "raw" addresses are above 4G.  I have
access to an emag I'm currently using to test fixes to pci_host_generic.c to
avoid corrupting struct resource objects.  I'll post the diff once I've got
something verified to work.


It looks to me like in sys/dev/pci/pci_pci.c the:

static void
pcib_probe_windows(struct pcib_softc *sc)
{
. . .
 pcib_alloc_window(sc, >mem, SYS_RES_MEMORY, 0, 0x);
. . .

is just inappropriately restrictive about where in the system
address space a PCIe can validly be mapped to on the high end.
That, in turn, leads to the rejection on the RPi4B now that
the range use is checked.


No, the physical register in PCI-PCI bridges is only 32-bits.  Only the
prefetchable BAR supports 64-bit addresses.  This is why the host bridge
is doing a translation from the CPU side (0x6) to the PCI BAR
addresses (0xc000) so that the BAR addresses are down in the 32-bit
address range.  It's also true that many PCI devices only support 32-bit
addresses in memory BARs.  64-bit BARs are an optional extension not
universally supported.

The translation here is somewhat akin to a type of MMU where the CPU
addresses are mapped to PCI addresses.  The problem here is that the
PCI BAR resources need to "stay" as PCI addresses since we depend on
being able to use rman_get_start/end to get the PCI addresses of
allocated resources, but pci_host_generic.c currently rewrites the
addresses.

Probably I should remove rman_set_start/end entirely (Warner added them
back in 2004) as the methods don't do anything to deal with the fallout
that the rman.rm_list linked-list is no longer sorted by address once
some addresses get rewritten, etc.

--
John Baldwin

Re: Recent commits reject RPi4B booting: pcib0 vs. pcib1 "rman_manage_region: request" leads to panic

2024-02-12 Thread John Baldwin


On 2/10/24 2:09 PM, Michael Butler wrote:

I have stability problems with anything at or after this commit
(b377ff8) on an amd64 laptop. While I see the following panic logged, no
crash dump is preserved :-( It happens after ~5-6 minutes running in KDE
(X).

Reverting to 36efc64 seems to work reliably (after ACPI changes but
before the problematic PCI one)

kernel: Fatal trap 12: page fault while in kernel mode
kernel: cpuid = 2; apic id = 02
kernel: fault virtual address = 0x48
kernel: fault code= supervisor read data, page not present
kernel: instruction pointer   = 0x20:0x80acb962
kernel: stack pointer = 0x28:0xfe00c4318d80
kernel: frame pointer = 0x28:0xfe00c4318d80
kernel: code segment  = base 0x0, limit 0xf, type 0x1b
kernel:   = DPL 0, pres 1, long 1, def32 0, gran 1
kernel: processor eflags  = interrupt enabled, resume, IOPL = 0
kernel: current process   = 2 (clock (0))
kernel: rdi: f802e460c000 rsi:  rdx: 0002
kernel: rcx:   r8: 001e  r9: fe00c4319000
kernel: rax: 0002 rbx: f802e460c000 rbp: fe00c4318d80
kernel: r10: 1388 r11: 7ffc765d r12: 000f
kernel: r13: 0002 r14: f8000193e740 r15: 
kernel: trap number   = 12
kernel: panic: page fault
kernel: cpuid = 2
kernel: time = 1707573802
kernel: Uptime: 6m19s
kernel: Dumping 942 out of 16242
MB:..2%..11%..21%..31%..41%..51%..62%..72%..82%..92%
kernel: Dump complete
kernel: Automatic reboot in 15 seconds - press a key on the console to abort


Without a stack trace it is pretty much impossible to debug a panic like this.
Do you have KDB_TRACE enabled in your kernel config?  I'm also not sure how the
PCI changes can result in a panic post-boot.  If you were going to have problems
they would be during device attach, not after you are booted and running X.

Short of a stack trace, you can at least use lldb or gdb to lookup the source
line associated with the faulting instruction pointer (as long as it isn't in
a kernel module), e.g. for gdb you would use 'gdb /boot/kernel/kernel' and then
'l *', e.g. from above: 'l *0x80acb962'

--
John Baldwin

Re: Recent commits reject RPi4B booting: pcib0 vs. pcib1 "rman_manage_region: request" leads to panic

2024-02-12 Thread John Baldwin


On 2/9/24 8:13 PM, Mark Millard wrote:

Summary:

pcib0:  mem 0x7d50-0x7d50930f 
irq 80,81 on simplebus2
pcib0: parsing FDT for ECAM0:
pcib0:  PCI addr: 0xc000, CPU addr: 0x6, Size: 0x4000
. . .
rman_manage_region:  request: start 0x6, end 
0x6000f
panic: Failed to add resource to rman


Hmmm, I suspect this is due to the way that bus_translate_resource works which 
is
fundamentally broken.  It rewrites the start address of a resource in-situ 
instead
of keeping downstream resources separate from the upstream resources.   For 
example,
I don't see how you could ever release a resource in this design without 
completely
screwing up your rman.  That is, I expect trying to detach a PCI device behind a
translating bridge that uses the current approach should corrupt the allocated
resource ranges in an rman long before my changes.

That said, that doesn't really explain the panic.  Hmm, the panic might be 
because
for PCI bridge windows the driver now passes RF_ACTIVE and the 
bus_translate_resource
hack only kicks in the activate_resource method of pci_host_generic.c.


Detail:


. . .
pcib0:  mem 0x7d50-0x7d50930f 
irq 80,81 on simplebus2
pcib0: parsing FDT for ECAM0:
pcib0:  PCI addr: 0xc000, CPU addr: 0x6, Size: 0x4000


This indicates this is a translating bus.


pcib1:  irq 91 at device 0.0 on pci0
rman_manage_region:  request: start 0x1, end 0x1
pcib0: rman_reserve_resource: start=0xc000, end=0xc00f, count=0x10
rman_reserve_resource_bound:  request: [0xc000, 0xc00f], 
length 0x10, flags 102, device pcib1
rman_reserve_resource_bound: trying 0x <0xc000,0xf>
considering [0xc000, 0x]
truncated region: [0xc000, 0xc00f]; size 0x10 (requested 0x10)
candidate region: [0xc000, 0xc00f], size 0x10
allocating from the beginning
rman_manage_region:  request: start 0x6, end 
0x6000f


The fact that we are trying to reserve the CPU addresses in the rman is because
bus_translate_resource rewrote the start address in the resource after it was 
allocated.

That said, I can't see why rman_manage_region would actually fail.  At this 
point the
rman is empty (this is the first call to rman_manage_region for "pcib1 memory 
window"),
so only the check that should be failing are the checks against rm_start and
rm_end.  For the memory window, rm_start is always 0, and rm_end is always
0x, so both the old (0xc - 0xc00f) and new (0x6000 - 
0x600fff)
ranges are within those bounds.  I would instead expect to see some other issue 
later
on where we fail to allocate a resource for a child BAR, but I wouldn't expect
rman_manage_region to fail.

Logging the return value from rman_manage_region would be the first step I think
to see which error value it is returning.

Probably I should fix pci_host_generic.c to handle translation properly however.
I can work on a patch for that.

--
John Baldwin

Re: Alder lake supported? (graphics)

2024-01-16 Thread John D Groenveld

In message , Chris writes:
>I upgraded to an alder lake based machine and installed 14.
>But I can't seem to get the intel graphics loaded (drm-515-kmod).
>It simply freezes at load.

Shot in the dark:
# pkg delete drm-515-kmod && pkg install drm-510-kmod && kldload i915kms

John
groenv...@acm.org

Re: How to upgrade an EOL FreeBSD release or how to make it working again

2024-01-15 Thread John F Carr

Judging by a commit message BSD on the ARM Chromebook didn't work
when support was removed in 2019.

>RK* Exynos* and Meson*/Odroid* don't even work with current
>source code, if someone wants to make them work again they
>better use the Linux DTS.
https://cgit.freebsd.org/src/commit?id=9dfa2a54684978d1d6cef67bbf6242e825801f18

I have one of the "snow" Chromebooks.  The warnings in the web page
https://wiki.freebsd.org/arm/Chromebook led me not to try FreeBSD.
None of the many bugs seemed likely to ever be fixed.  I'm not using it
so I could try an experiment, but fighting with u-boot is not how I want
to spend my days.  Even the popular Raspberry Pi takes skill or luck.

(So "build an arm6 world and copy X, Y, and Z to the DOS partition
on your USB drive" is the kind of advice I need to supplement the old
Chromebook wiki page.)

There is at least a little value in getting it to work because the armv6
code is bit rotting and will go away entirely unless people use it.

John Carr

> On Jan 15, 2024, at 10:59, Mario Marietto  wrote:
> 
> Hello to everyone.
> 
> I'm trying to install FreeBSD 14 natively on my ARM Chromebook model xe303c12 
> ; I've found only one tutorial that teaches how to do that,that's it :
> 
> https://wiki.freebsd.org/arm/Chromebook
> 
> The problem is that it ends with the installation of FreeBSD 11,that's very 
> EOL.
> I can't use it as is. I need to upgrade it to 14 (but I'm on arm 32 
> bit,that's TIER-2,so I can't upgrade it automatically using the 
> freebsd-update script. It is also true that I can't install 14 directly on 
> that machine,as you can read below :
> 
> 
> 
> 
> I've looked all around and I found the tool pkgbase,that I'm talking about on 
> the FreeBSD forum,to understand if it allows the 11 to be usable or 
> upgradable. It does not seem to be the proper tool to achieve my goal. Do you 
> have any suggestions that can help me ? Thanks.
> 
> -- 
> Mario.

Re: ZFS problems since recently ?

2024-01-10 Thread John Kennedy

On Tue, Jan 02, 2024 at 05:51:32PM -0500, Alexander Motin wrote:
> Please see/test: https://github.com/openzfs/zfs/pull/15732 .

  Looks like that has landed in current:

commit f552d7adebb13e24f65276a6c4822bffeeac3993
Merge: 13720136fbf a382e21194c
Author: Martin Matuska 
Date:   Wed Jan 10 09:07:45 2024 +0100

zfs: merge openzfs/zfs@a382e2119

Notable upstream pull request merges:
 #15693 a382e2119 Add Gotify notification support to ZED
-->  #15732 e78aca3b3 Fix livelist assertions for dedup and cloning
 #15733 7ecaa0758 make zdb_decompress_block check decompression 
reliably
 #15735 255741fc9 Improve block sizes checks during cloning

Obtained from:  OpenZFS
OpenZFS commit: a382e21194c1690951d2eee8ebd98bc096f01c83

Re: ZFS problems since recently ?

2024-01-04 Thread John Kennedy

On Tue, Jan 02, 2024 at 08:02:04PM -0800, John Kennedy wrote:
> On Tue, Jan 02, 2024 at 05:51:32PM -0500, Alexander Motin wrote:
> > On 01.01.2024 08:59, John Kennedy wrote:
> > >  ...
> > >My poudriere build did eventually fail as well:
> > >   ...
> > >   [05:40:24] [01] [00:17:20] Finished devel/gdb@py39 | gdb-13.2_1: Success
> > >   [05:40:24] Stopping 2 builders
> > >   panic: VERIFY(BP_GET_DEDUP(bp)) failed
> > 
> > Please see/test: https://github.com/openzfs/zfs/pull/15732 .
> 
>   It came back today at the end of my poudriere build.  Your patch has fixed
> it, so far at least.

  At the risk of conflating this with other ZFS issues, I beat on the VM a lot
more last night without triggering any panics.  My usual busy-workload is a
total kernel+world rebuild (with whatever pending patches might be out), then
a poudriere run (~230 or so packages).  It's weird that the first (much bigger)
run worked but later ones didn't (where maybe I had one port that failed to
build), triggering the panic.  Seemed repeatable, but don't have a feel for
the exact trigger like the sysctl issue.

Re: ZFS problems since recently ?

2024-01-02 Thread John Kennedy

On Tue, Jan 02, 2024 at 05:51:32PM -0500, Alexander Motin wrote:
> On 01.01.2024 08:59, John Kennedy wrote:
> >  ...
> >My poudriere build did eventually fail as well:
> > ...
> > [05:40:24] [01] [00:17:20] Finished devel/gdb@py39 | gdb-13.2_1: Success
> > [05:40:24] Stopping 2 builders
> > panic: VERIFY(BP_GET_DEDUP(bp)) failed
> 
> Please see/test: https://github.com/openzfs/zfs/pull/15732 .

  It came back today at the end of my poudriere build.  Your patch has fixed
it, so far at least.

Re: ZFS problems since recently ?

On Mon, Jan 01, 2024 at 02:27:17PM +0100, Kurt Jaeger wrote:
> > On Mon, Jan 01, 2024 at 06:43:58AM +0100, Kurt Jaeger wrote:
> > > markj@ pointed me in
> > > https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276039
> > > to
> > > https://github.com/openzfs/zfs/pull/15719 
> > > 
> > > So it will probably be fixed sooner or later.
> > > 
> > > The other ZFS crashes I've seen are still an issue.
> > 
> >   My poudriere build did eventually fail as well:
> > ...
> > [05:40:24] [01] [00:17:20] Finished devel/gdb@py39 | gdb-13.2_1: Success
> > [05:40:24] Stopping 2 builders
> > panic: VERIFY(BP_GET_DEDUP(bp)) failed
> 
> That's one of the panic messages I had as well.
> 
> See
> 
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276051
> 
> for additional crashes and dumps.
> 
> >   I didn't tweak this system off defaults for block-cloning.  I haven't 
> > been following
> > that issue 100%.
> 
> Do you have
>   vfs.zfs.dmu_offset_next_sync=0
> ?

  I reverted everything and reinstalled.  The VERIFY(BP_GET_DEDUP(bp)) panic
hasn't reoccurred (tended to happen on poudriere-build cleanup), which may
lean it more towards corruption, or maybe I just haven't been "lucky" with my
small random chance of corruption.

  I did set vfs.zfs.dmu_offset_next_sync=0 after the bsdinstall was complete
(maybe I could have loaded the zfs kernel module from the shell and set it
before things kicked off).

Re: ZFS problems since recently ?

On Mon, Jan 01, 2024 at 08:42:26AM -0800, John Kennedy wrote:
>   Applying the two ZFS kernel patches fixes that issue:

commit 09af4bf2c987f6f57804162cef8aeee05575ad1d (zfs: Fix SPA sysctl handlers) 
landed too.

root@bsd15:~ # sysctl -a | grep vfs.zfs.zio
vfs.zfs.zio.deadman_log_all: 0
vfs.zfs.zio.dva_throttle_enabled: 1
vfs.zfs.zio.requeue_io_start_cut_in_line: 1
vfs.zfs.zio.slow_io_ms: 3
vfs.zfs.zio.taskq_wr_iss_ncpus: 0
vfs.zfs.zio.taskq_write: sync fixed,1,5 scale fixed,1,5
vfs.zfs.zio.taskq_read: fixed,1,8 null scale null
vfs.zfs.zio.taskq_batch_tpq: 0
vfs.zfs.zio.taskq_batch_pct: 80
vfs.zfs.zio.exclude_metadata: 0

root@bsd15:~ # uname -aUK
FreeBSD bsd15 15.0-CURRENT FreeBSD 15.0-CURRENT #1 
main-n267336-09af4bf2c98: Mon Jan  1 12:04:15 PST 2024 
warlock@bsd15:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 158 158

Re: ZFS problems since recently ?

On Mon, Jan 01, 2024 at 06:43:58AM +0100, Kurt Jaeger wrote:
> > >   I can crash mine with "sysctl -a" as well.
> 
> markj@ pointed me in
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276039
> to
> https://github.com/openzfs/zfs/pull/15719 
> 
> So it will probably be fixed sooner or later.
> 
> The other ZFS crashes I've seen are still an issue.

  Applying the two ZFS kernel patches fixes that issue:

root@bsd15:~ # sysctl -a | grep vfs.zfs.zio
vfs.zfs.zio.deadman_log_all: 0
vfs.zfs.zio.dva_throttle_enabled: 1
vfs.zfs.zio.requeue_io_start_cut_in_line: 1
vfs.zfs.zio.slow_io_ms: 3
vfs.zfs.zio.taskq_wr_iss_ncpus: 0
vfs.zfs.zio.taskq_write: sync fixed,1,5 scale fixed,1,5
vfs.zfs.zio.taskq_read: fixed,1,8 null scale null
vfs.zfs.zio.taskq_batch_tpq: 0
vfs.zfs.zio.taskq_batch_pct: 80
vfs.zfs.zio.exclude_metadata: 0

root@bsd15:~ # uname -aUK
FreeBSD bsd15 15.0-CURRENT FreeBSD 15.0-CURRENT #2 
main-n267335-499e84e16f5-dirty: Mon Jan  1 08:04:59 PST 2024 
warlock@bsd15:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 158 158

Re: ZFS problems since recently ?

On Mon, Jan 01, 2024 at 02:27:17PM +0100, Kurt Jaeger wrote:
> Do you have
>vfs.zfs.dmu_offset_next_sync=0

  I didn't initially, I do now.  Like I said, I haven't been following that one
100%.  I know it isn't block-clone per say, so much as some underlying problem
it pokes with a pointy stick.  Small chance multiplied by a bunch of ZFS IOPS.

  Seems like I'd have to revert it all the way back to fresh install if I want
to get rid of all potential corruption unrelated to sysctl panic.

  But I'll do myh busy-work cycle (*) with that one and maybe another with it
off and see what happens.

  * full kernel+world, plus my local poudriere package build, currenly wedged
a bit with the heimdall build issue.

Re: ZFS problems since recently ?

On Mon, Jan 01, 2024 at 06:43:58AM +0100, Kurt Jaeger wrote:
> markj@ pointed me in
> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276039
> to
> https://github.com/openzfs/zfs/pull/15719 
> 
> So it will probably be fixed sooner or later.
> 
> The other ZFS crashes I've seen are still an issue.

  My poudriere build did eventually fail as well:

...
[05:40:24] [01] [00:17:20] Finished devel/gdb@py39 | gdb-13.2_1: Success
[05:40:24] Stopping 2 builders
panic: VERIFY(BP_GET_DEDUP(bp)) failed

cpuid = 2
time = 1704091946
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 
0xfe00f62898c0
vpanic() at vpanic+0x131/frame 0xfe00f62899f0
spl_panic() at spl_panic+0x3a/frame 0xfe00f6289a50
dsl_livelist_iterate() at dsl_livelist_iterate+0x2de/frame 
0xfe00f6289b30
bpobj_iterate_blkptrs() at bpobj_iterate_blkptrs+0x235/frame 
0xfe00f6289bf0
bpobj_iterate_impl() at bpobj_iterate_impl+0x16e/frame 
0xfe00f6289c80
dsl_process_sub_livelist() at dsl_process_sub_livelist+0x5c/frame 
0xfe00f6289d00
spa_livelist_delete_cb() at spa_livelist_delete_cb+0xf6/frame 
0xfe00f6289ea0
zthr_procedure() at zthr_procedure+0xa5/frame 0xfe00f6289ef0
fork_exit() at fork_exit+0x82/frame 0xfe00f6289f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfe00f6289f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 9 tid 100223 ]
Stopped at  kdb_enter+0x33: movq$0,0xe3a582(%rip)
db>

  Trying to do another poudriere build fails almost immediatly with that verify 
error.

  Your verify errors don't match up exactly.  I've got snapshots from before I 
started
freaking it out with the sysctl calls and possibly inducing corruption.

  I didn't tweak this system off defaults for block-cloning.  I haven't been 
following
that issue 100%.

Re: ZFS problems since recently ?

2023-12-31 Thread John Kennedy

>   I can crash mine with "sysctl -a" as well.

  Smaller test, this is sufficient to crash things:

root@bsd15:~ # sysctl vfs.zfs.zio
vfs.zfs.zio.deadman_log_all: 0
vfs.zfs.zio.dva_throttle_enabled: 1
vfs.zfs.ziopanic: sbuf_clear makes no sense on sbuf 0xf8002c8dc300 
with drain
cpuid = 3
time = 1704069514
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 
0xfe00fa502960
vpanic() at vpanic+0x131/frame 0xfe00fa502a90
panic() at panic+0x43/frame 0xfe00fa502af0
sbuf_clear() at sbuf_clear+0xa8/frame 0xfe00fa502b00
sbuf_cpy() at sbuf_cpy+0x56/frame 0xfe00fa502b20
spa_taskq_write_param() at spa_taskq_write_param+0x85/frame 
0xfe00fa502bd0
sysctl_root_handler_locked() at sysctl_root_handler_locked+0x9c/frame 
0xfe00fa502c20
sysctl_root() at sysctl_root+0x21e/frame 0xfe00fa502ca0
userland_sysctl() at userland_sysctl+0x184/frame 0xfe00fa502d50
sys___sysctl() at sys___sysctl+0x60/frame 0xfe00fa502e00
amd64_syscall() at amd64_syscall+0x153/frame 0xfe00fa502f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 
0xfe00fa502f30
--- syscall (202, FreeBSD ELF64, __sysctl), rip = 0x3733c1e5619a, rsp = 
0x3733bf494538, rbp = 0x3733bf494570 ---
KDB: enter: panic
[ thread pid 780 tid 100237 ]
Stopped at  kdb_enter+0x33: movq$0,0xe3a582(%rip)
db>

Re: ZFS problems since recently ?

2023-12-31 Thread John Kennedy

On Sun, Dec 31, 2023 at 07:34:45PM +0100, Kurt Jaeger wrote:
> Hi!
> 
> Short overview:
> - Had CURRENT system from around September
> - Upgrade on the 23th of December
> - crashes in ZFS, see
>   https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=261538
>   for details
> - Reinstalled from scratch with new SSDs drives from
> https://download.freebsd.org/snapshots/amd64/amd64/ISO-IMAGES/15.0/
> freebsd-openzfs-amd64-2020081900-memstick.img.xz
> - Had one crash with
>   sysctl -a
>   https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276039
> - Still see crashes with ZFS (and other) when using poudriere to
>   build ports.
> 
> Problem:
> 
> I happen to run in several cases of crashes in ZFS, some of
> them fatal (zpool non-recoverable).

  I can crash mine with "sysctl -a" as well.

  I seeded my bhyve with:
FreeBSD-15.0-CURRENT-amd64-20231228-fb03f7f8e30d-267242-disc1.iso

  Rebuilt the kernel (so now at main-n267320-4d08b569a01) and started
crunching through poudriere package builds.  Sorta stock install of encrypted
ZFS.  I didn't get it to crash with poudriere (yet).  Mine lives in bhyve,
so maybe less possible destruction via crashes.

KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 
0xfe00fa5f3960
vpanic() at vpanic+0x131/frame 0xfe00fa5f3a90
panic() at panic+0x43/frame 0xfe00fa5f3af0
sbuf_clear() at sbuf_clear+0xa8/frame 0xfe00fa5f3b00
sbuf_cpy() at sbuf_cpy+0x56/frame 0xfe00fa5f3b20
spa_taskq_write_param() at spa_taskq_write_param+0x85/frame 
0xfe00fa5f3bd0
sysctl_root_handler_locked() at sysctl_root_handler_locked+0x9c/frame 
0xfe00fa5f3c20
sysctl_root() at sysctl_root+0x21e/frame 0xfe00fa5f3ca0
userland_sysctl() at userland_sysctl+0x184/frame 0xfe00fa5f3d50
sys___sysctl() at sys___sysctl+0x60/frame 0xfe00fa5f3e00
amd64_syscall() at amd64_syscall+0x153/frame 0xfe00fa5f3f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 
0xfe00fa5f3f30
--- syscall (202, FreeBSD ELF64, __sysctl), rip = 0x22e42167019a, rsp = 
0x22e41ee72518, rbp = 0x22e41ee72550 ---
KDB: enter: panic

  The sysctl died at this point, but who knows if it had pending buffered 
output or anything...

...
vfs.zfs.zio.deadman_log_all: 0
vfs.zfs.zio.dva_throttle_enabled: 1
vfs.zfs.zio.requeue_io_start_cut_in_line: 1
vfs.zfs.zio.slow_io_ms: 3
vfs.zfs.zio.taskq_wr_iss_ncpus: 0

Re: make installworld fails because /usr/include/c++/v1/__tuple is a file

2023-12-13 Thread John Baldwin


On 12/10/23 8:43 AM, Dimitry Andric wrote:

On 10 Dec 2023, at 15:11, Herbert J. Skuhra  wrote:


On Sun, Dec 10, 2023 at 01:22:38PM +, John F Carr wrote:

On arm64 running CURRENT from two weeks ago I updated to

  c711af772782 Bump __FreeBSD_version for llvm 17.0.6 merge

and built and installed from source.  make installworld failed:

  install: target directory `/usr/include/c++/v1/__tuple/' does not exist

That pathname is a file:

  -r--r--r--  1 root wheel 20512 Feb 15  2023 /usr/include/c++/v1/__tuple

Early in make output is

  mtree -deU -i -f /usr/src/etc/mtree/BSD.include.dist -p /usr/include
  ./c++/v1/__algorithm/pstl_backends missing (created)
  [...]
  ./c++/v1/__tuple missing (not created: File exists)

Should I remove the file and try again, or is there a more elegant fix?

The word "tuple" does not appear in UPDATING.


'make delete-old' should have removed this file.

bdd1243df58e6 (Dimitry Andric  2023-04-14 23:41:27 +0200   965)
OLD_FILES+=usr/include/c++/v1/__tuple


Ah yes, that's it. The file was removed during the upgrade from libc++
15.0 to 16.0, while its contents was split into a subdirectory named
__tuple_dir. In libc++ 17.0.0 they renamed this subdirectory back to
just __tuple.

This means that apparently people are not running "make delete-old"
after installations. Please don't forget that. :)


Well, but if you have an old system with LLVM 15 that you upgrade directly
to LLVM 17 you will hit this even if you ran delete-old after your last
upgrading that used LLVM 15.  We might need something to cope with this
during the install target for libc++ in particular where this has occurred
multiple times historically.

--
John Baldwin

make installworld fails because /usr/include/c++/v1/__tuple is a file

2023-12-10 Thread John F Carr

On arm64 running CURRENT from two weeks ago I updated to

  c711af772782 Bump __FreeBSD_version for llvm 17.0.6 merge

and built and installed from source.  make installworld failed:

  install: target directory `/usr/include/c++/v1/__tuple/' does not exist

That pathname is a file:

  -r--r--r--  1 root wheel 20512 Feb 15  2023 /usr/include/c++/v1/__tuple

Early in make output is

  mtree -deU -i -f /usr/src/etc/mtree/BSD.include.dist -p /usr/include
  ./c++/v1/__algorithm/pstl_backends missing (created)
  [...]
  ./c++/v1/__tuple missing (not created: File exists)

Should I remove the file and try again, or is there a more elegant fix?

The word "tuple" does not appear in UPDATING.

Re: How do I update the kernel of FreeBSD-CURRENT

2023-11-29 Thread John Nielsen

On Nov 29, 2023, at 12:21 PM, Manoel Games  wrote:
> 
> I am a new FreeBSD user, and I am using FreeBSD-CURRENT. How do I update the 
> FreeBSD-CURRENT kernel, and is it done through pkg? I installed 
> FreeBSD-CURRENT without src.

As a new user you should probably run a supported release version, such as 
14.0.  Releases have binary updates available via freebsd-update. (Upgrading 
the base OS via pkg is still experimental.) Current has no such feature, so you 
need to download/update the source and recompile.

See the Handbook chapter on upgrading FreeBSD:
https://docs.freebsd.org/en/books/handbook/cutting-edge/

JN

Re: bhyve -G


On 11/15/23 3:06 PM, Bakul Shah wrote:




On Nov 15, 2023, at 7:57 AM, John Baldwin  wrote:

On 10/9/23 5:21 PM, Bakul Shah wrote:

Any hints on how to use bhyve's -G  option to debug a VM
kernel? I can connect to it from gdb with "target remote :"
& bhyve stops the VM initially but beyond that I am not sure.
Ideally this should work just like an in-circuit-emulator, not
requiring anything special in the VM or kernel itself.


step only works on Intel CPUs currently (and is a bit fragile
anyway due to interrupts firing while you try to step, but that
happens for me in QEMU as well).   Breakpoints should work fine.
I tend to use 'until' to do stepping (basically stepping via
temporary breakpoints) when debugging the kernel this way.


Thanks for your response!

I can ^C to stop the VM, examine the stack, set breakpoints,
continue etc. but when the breakpoint is hit, kgdb doesn't
regain control -- instead I get the usual

db> ...

prompt on the console. I guess I have to set some sysctl for
this?


Hmm, no, it shouldn't be breaking into DDB in the guest as the
breakpoint exception should be intercepted by the stub and
never made visible to the guest.

--
John Baldwin

Re: [HEADS-UP] Quick update to 14.0-RELEASE schedule


On 11/14/23 8:52 PM, Glen Barber wrote:

On Tue, Nov 14, 2023 at 08:10:23PM -0700, The Doctor wrote:

On Wed, Nov 15, 2023 at 02:27:01AM +, Glen Barber wrote:

On Tue, Nov 14, 2023 at 05:15:48PM -0700, The Doctor wrote:

On Tue, Nov 14, 2023 at 08:36:54PM +, Glen Barber wrote:

We are still waiting for a few (non-critical) things to complete before
the announcement of 14.0-RELEASE will be ready.

It should only be another day or so before these things complete.

Thank you for your understanding.



I always just installed my copy.



Ok.  I do not know what exactly is your point, but releases are never
official until there is a PGP-signed email sent.  The email is intended
for the general public of consumers of official releases, not "yeah,
but"s.



Howver if you do a freebsd-update upgrade, you can upgrade.

Is that suppose to happen?



That does not say that the freebsd-update bits will not change *until*
the official release announcement has been sent.

In my past 15 years involved in the Project, I think we have been very
clear on that.

A RELEASE IS NOT FINAL UNTIL THE PGP-SIGNED ANNOUNCEMENT IS SENT.

I mean, c'mon, dude.

We really, seriously, for all intents and purposes, cannot be any more
clear than that.

So, yes, *IF* an update necessitates a new freebsd-update build, what
you are running is *NOT* official.

For at least 15 years, we have all said the same entire thing.


Yes, but, if at this point we had to rebuild, it would have to be 14.0.1
or something (which we have done a few times in the past).  It would be
too confusing otherwise once the bits are built and published (where
published means "uploaded to our CDN").  It is the 14.0 release bits,
the only question is if for some reason we had a dire emergency that
meant we had to pull it at the last minute and publish different bits
(under a different release name).

Realistically, once the bits are available, we can't prevent people from
using them, it's just at their own risk to do so until the project says
"yes, we believe these are good".  Granted, they are under the same risk
if they are still running the last RC.  The best way to minimize that
risk going forward is to add more automation of testing/CI to go along
with the process of building release bits so that the build artifacts
from the release build run through CI and are only published if the CI
is green as that would give us greater confidence of "we believe these
are good" before they are uploaded for publishing.

--
John Baldwin

Re: bsdinstall/scriptedpart could not run ;-(


On 11/12/23 11:00 PM, KIRIYAMA Kazuhiko wrote:

Hi, all

I usually run bsdinstall by instllerconfig, but
bsdinstall/scriptedpart could not run ;-(

My installerconfig is:

PARTITIONS='nda0 gpt { 200M efi, 804G freebsd-ufs /, 128G freebsd-swap }'
DISTRIBUTIONS='base.txz kernel-dbg.txz kernel.txz lib32.txz tests.txz'
ZFSBOOT_DISKS=""

#!/bin/sh
/bin/mkdir -p /.dake
/bin/cp /usr/share/zoneinfo/Asia/Tokyo /etc/localtime
/bin/cp /root/.cshrc /root/.cshrc.org
/bin/cat <> /etc/fstab
192.168.1.17:/.dake /.dake  nfs rw  0   0
EOF
sed -i".bak" -Ee 
'/^#BDS_install.sh_added:start_line$/,/^#BDS_install.sh_added:end_line$/d' /root/.cshrc
/bin/cat <<'EOF' >> /root/.cshrc
#BDS_install.sh_added:start_line
setenv  PATH${PATH}:/.dake/bin
setenv  MGRHOME /usr/home/admin
setenv  OPENTOOLSDIR/.dake
setenv  DAKEDIR /.dake
#BDS_install.sh_added:end_line
EOF
   :
(snip)
   :

I investigated in bsdinstall script and found scriptedpart
which acutually run partedit with scriptedpart would not be
destroy existing partition. In fact scriptedpart -> partedit
changed in script as follows, then parttion editor run at
terminal.


My guess is something to do with commit 
23099099196548550461ba427dcf09dcfb01878d,
though I don't see how it could work any differently in this case as the only
change to part_config there was to return if if geom_gettree fails, and if it
fails, provider_for_name would presumably have failed anyway.

--
John Baldwin

Re: bhyve -G


On 10/9/23 5:21 PM, Bakul Shah wrote:

Any hints on how to use bhyve's -G  option to debug a VM
kernel? I can connect to it from gdb with "target remote :"
& bhyve stops the VM initially but beyond that I am not sure.
Ideally this should work just like an in-circuit-emulator, not
requiring anything special in the VM or kernel itself.


step only works on Intel CPUs currently (and is a bit fragile
anyway due to interrupts firing while you try to step, but that
happens for me in QEMU as well).   Breakpoints should work fine.
I tend to use 'until' to do stepping (basically stepping via
temporary breakpoints) when debugging the kernel this way.

--
John Baldwin

Re: KTLS thread on 14.0-RC3

2023-10-31 Thread John Baldwin


On 10/30/23 3:41 AM, Zhenlei Huang wrote:




On Oct 30, 2023, at 12:09 PM, Zhenlei Huang  wrote:




On Oct 29, 2023, at 5:43 PM, Gordon Bergling  wrote:

Hi,

I am currently building a new system, which should be based on 14.0-RELEASE.
Therefor I am tracking releng/14.0 since its creation and updating it currently
via the usualy buildworld steps.

What I have noticed recently is, that the [KTLS] is missing. I have a stable/13
system which shows the [KTLS] thread and a very recent -CURRENT that also shows
the [KTLS] thread.

The stable/13 and releng/14.0 systems both use the GENERIC kernel, without any
custom modifications.

Loaded KLDs are also the same.

Did I miss something, or is there something in releng/14.0 missing, which
is currenlty enabled in stable/13?


KTLS shall still work as intended, the creation of it threads is deferred.

See a72ee355646c (ktls: Defer creation of threads and zones until first use)

Run ktls_init() when the first KTLS session is created rather than
unconditionally during boot.  This avoids creating unused threads and
allocating unused resources on systems which do not use KTLS.


```
-SYSINIT(ktls, SI_SUB_SMP + 1, SI_ORDER_ANY, ktls_init, NULL);
```


Seems 14.0 only create one KTLS thread.

IIRC 13.2 create one thread per core.


That part should not be different.  There should always be one thread per core.

--
John Baldwin

15/14 upgrades break old sudo, maybe bump PAM's shlib?

2023-09-09 Thread John Baldwin


I upgraded my laptop from a late June current to current from yesterday
today, and after installworld sudo stopped working (dies with a SIGBUS).
After some debugging, the issue ended up being OpenSSL library version
mismatches as sudo uses PAM and PAM is linked agianst OpenSSL 3, but
sudo is linked against OpenSSL 1.1.1.  Both shlibs get mapped into the
the process and at some point sudo crosses the streams and the crash
occurs inside OpenSSL 3's libcrypto.

I realize that we do have a generate note about needing to update third
party packages after an upgrade, but I tend to use sudo as part of my
workflow for doing that sort of thing.  I generally build all my own
packages via poudriere and use sudo at various points in that process,
but even if I were using FreeBSD.org packages I would be using sudo to
try to run 'pkg upgrade'.

su(8) in base works fine, so that's my workaround for now on my laptop,
but I wonder if we want to make this particular bump on the upgrade
path a little less bumpy?  Either by being clear in our release notes
that tools like sudo (and I suspect any other third-party su wrappers
that also use PAM, xscreensaver's screen lock doesn't seem to be affected
since it probably doesn't use OpenSSL directly thankfully) can break,
or another route we could take would be to bump the DSO versions of
things that depend on libcrypto/libssl in base.  We did not do this
latter approach for the OpenSSL 1.0.2 -> 1.1.1 upgrade FWIW.

If we wanted to do the shlib bump approach, Enji had a good list from a
while back (though Enji wanted to make them all private rather than
bumping):

- kerberos
- libarchive
- libbsnmp
- libfetch
- libgeli
- libldns
- libmp
- libradius
- libunbound

From my research it seems that PAM (library and modules), gssapi libraries,
and libzfs would also need to be on the list.

libldns is already private as is libunbound, though bumping them might
be safter anyway.

There is on libgeli, instead there is geli_eli.so which has no version,
but hopefully is not widely used in ports the same as PAM.  Note also
that if we did this, we would want to do it for 14.0 as 13.x -> 14 upgrades
are affected in the same way.

--
John Baldwin

Re: user problems when upgrading to v15

2023-09-09 Thread John Baldwin


On 9/2/23 7:11 AM, Dimitry Andric wrote:

On 1 Sep 2023, at 03:42, brian whalen  wrote:


Repeating the entire process:

I created a 13.2 vm with 6 cores and 8GB of ram.

Ran freebsd-update fetch and install.

Ran pkg install git bash ccache open-vm-tools-nox11

Used git clone to get current and ports source files.

Edited /etc/make.conf to use ccache

Ran make -j6 buildworld && make -j6 kernel

I then rebooted in single user mode and did the next steps saving output to a file 
with > filename.

etcupdate -p was pretty uneventful. It did  show the below and did not prompt 
to edit.

root@f15:~ # less etcupdatep
   C /etc/group
   C /etc/master.passwd


This is a problem: the "C" characters mean there were conflicts, and it's 
indeed very unfortunate that etcupdate does not immediately force you to resolve them. 
Because now you basically have mangled group and master.passwd files, with conflict 
markers in them!


No, the conflicted files are in /var/db/etcupdate/conflicts, the files in /etc 
are still
the old ones at this point and won't be updated until you run 'etcupdate 
resolve' to
fix them.

I suspect what happened here is that Brian chose the 'tf' (theirs-full) option 
for
'etcupdate resolve' when he really wanted to do 'e' to edit the conflicted 
version.


Immediately after this, you should run "etcupdate resolve", and fix any 
conflicts that it has found.

Note that recently there was a lot of churn due to the removal of $FreeBSD$ 
keywords, and this almost always creates conflicts in the group and passwd 
files. For lots of other files in /etc, the conflicts are resolved 
automatically, but unfortunately not for the files that are essential to log in!



make installworld seemed mostly error free though I did see a nonzero status 
for a man page failed inn the man4 directory.

etcupdate -B only showed the below. This was my first build after install.

root@f15:~ # less etcupdateB
Conflicts remain from previous update, aborting.


Yes, that is indeed the problem. You must first resolve conflicts from any 
previous etcupdate run, before doing anything else. As to why it does not 
immediately forces you to do so, and delegates this to a separate step, which 
can easily be forgotten, I have no idea.


So that if you are doing scripted upgrades, you don't hang forever in a script.
The intention is that after doing a bunch of scripted installworld + etcupdate's
on various hosts you can use 'etcupdate status' to see if there are any 
remaining
steps requiring manual intervention.

There could be an option to request batched vs interactivate updates perhaps.


If I type exit in single user mode to go multi user mode, the local user still 
works. After a reboot the local user still works. This local user can also sudo 
as expected. This wasn't the case for the previous build when I first reported 
this. However, if I run etcupdate resolve it is still presenting /etc/group and 
/etc/master/passwd as problems.

If this is is expected behavior for current then no big deal. I just wasn't 
sure.


The conflicts themselves are expected, alas. But you _must_ resolve them, 
otherwise you can end up with a mostly-bricked system.


No, the conflict markers are not placed in the versions in /etc.  However, 
etucpdate
does refuse to do a "new" upgrade until you resolve all the conflicts from your
previous upgrade to ensure that conflicted upgrades aren't missed.

--
John Baldwin

sscanf change prevents build of CURRENT

2023-08-30 Thread John F Carr

I had a problem yesterday and today rebuilding a -CURRENT system from source:

  --- magic.mgc ---
  ./mkmagic magic
  magic, 4979: Warning: Current entry does not yet have a description for 
adding a MIME type
  mkmagic: could not find any valid magic files!

The cause was an sscanf call unexpectedly failing to parse the input.  This 
caused
the mkmagic program (internal tool used to build magic number table for file) 
to fail.

If I link mkmagic against the static libc.a in /usr/obj then it works.  So my 
installed
libc.so is broken and the latest source works.  I think.  My installed kernel 
is at
76edfabbecde, the end of the binary integer parsing commit series, so my libc
should be the same.

The program below demonstrates the bug.  See src/contrib/file/src for context.

I am trying to manually compile a working mkmagic and restart the build to get 
unstuck.

#include 
#include 

struct guid {
uint32_t data1;
uint16_t data2;
uint16_t data3;
uint8_t data4[8];
};

int main(int argc, char *argv[])
{
  struct guid g = {0, 0, 0, {0}};
  char *text = "75B22630-668E-11CF-A6D9-00AA0062CE6C";

  if (argc > 1)
text = argv[1];
  int count =
sscanf(text,
   "%8x-%4hx-%4hx-%2hhx%2hhx-%2hhx%2hhx%2hhx%2hhx%2hhx%2hhx",
   , , , [0], [1],
   [2], [3], [4], [5],
   [6], [7]);

  fprintf(stdout,
  
"[%d]:\n%08x-%04hx-%04hx-%02hhx%02hhx-%02hhx%02hhx%02hhx%02hhx%02hhx%02hhx\n",
  count,
  g.data1, g.data2, g.data3, g.data4[0], g.data4[1],
  g.data4[2], g.data4[3], g.data4[4], g.data4[5],
  g.data4[6], g.data4[7]);
  return count != 11;
}

Re: shell hung in fork system call

2023-07-10 Thread John F Carr




> On Jul 9, 2023, at 19:59, Konstantin Belousov  wrote:
> 
> On Sun, Jul 09, 2023 at 11:36:03PM +0000, John F Carr wrote:
>> 
>> 
>>> On Jul 9, 2023, at 19:25, Konstantin Belousov  wrote:
>>> 
>>> On Sun, Jul 09, 2023 at 10:41:27PM +, John F Carr wrote:
>>>> Kernel and system at a146207d66f320ed239c1059de9df854b66b55b7 plus some 
>>>> irrelevant local changes, four 64 bit ARM processors, make.conf sets 
>>>> CPUTYPE?=cortex-a57.
>>>> 
>>>> I typed ^C while /bin/sh was starting a pipeline and my shell got hung in 
>>>> the middle of fork().
>>>> 
>>>>> From the terminal:
>>>> 
>>>> # git log --oneline --|more
>>>> ^C^C^C
>>>> load: 3.26  cmd: sh 95505 [fork] 5308.67r 0.00u 0.03s 0% 2860k
>>>> mi_switch+0x198 sleepq_switch+0xfc sleepq_timedwait+0x40 _sleep+0x264 
>>>> fork1+0x67c sys_fork+0x34 do_el0_sync+0x4c8 handle_el0_sync+0x44 
>>>> load: 3.16  cmd: sh 95505 [fork] 5311.75r 0.00u 0.03s 0% 2860k
>>>> mi_switch+0x198 sleepq_switch+0xfc sleepq_timedwait+0x40 _sleep+0x264 
>>>> fork1+0x67c sys_fork+0x34 do_el0_sync+0x4c8 handle_el0_sync+0x44 
>>>> 
>>>> According to ps -d on another terminal the shell has no children:
>>>> 
>>>> PID TT  STAT   TIME COMMAND
>>>> [...]
>>>> 873 u0  IWs 0:00.00 `-- login [pam] (login)
>>>> 874 u0  I   0:00.17   `-- -sh (sh)
>>>> 95504 u0  I   0:00.01 `-- su -
>>>> 95505 u0  D+  0:00.05   `-- -su (sh)
>>>> [...]
>>>> 
>>>> Nothing on the (115200 bps serial) console.  No change in system 
>>>> performance.
>>>> 
>>>> The system is busy copying a large amount of data from the network to a 
>>>> ZFS pool on spinning disks.  The git|more pipeline could have taken some 
>>>> time to get going while I/O requests worked their way through the queue.  
>>>> It would not have touched the busy pool, only the zroot pool on an SSD.
>>>> 
>>>> Has anything changed recently that might cause this?
>>> 
>>> There was some change around fork, but your sleep seems to be not from
>>> that change.  Can you show the wait channel for the process?  Do something
>>> like
>>> $ ps alxww
>>> 
>> 
>> UID   PID  PPID  C PRI NI   VSZ   RSS MWCHAN   STAT TTTIME COMMAND
>>   0 95505 95504  2  20  0 13508  2876 fork D+   u0 0:00.13 -su (sh)
>> 
>> This is probably the same information displayed as [fork] in the output from 
>> ^T.
>> 
>> Does it correspond to the source line
>> 
>> pause("fork", hz / 2);
>> 
>> ?
> 
> Yes, it is rate-limiting code.  Still it is interesting to see the whole
> ps output.
> 
> Do you have 7a70f17ac4bd64dc1a5020f in your source?

No, I do not have that commit.

The comment mentions livelock.  CPU use as reported by iostat did not change 
after the process hung.

Re: shell hung in fork system call

2023-07-09 Thread John F Carr




> On Jul 9, 2023, at 19:25, Konstantin Belousov  wrote:
> 
> On Sun, Jul 09, 2023 at 10:41:27PM +0000, John F Carr wrote:
>> Kernel and system at a146207d66f320ed239c1059de9df854b66b55b7 plus some 
>> irrelevant local changes, four 64 bit ARM processors, make.conf sets 
>> CPUTYPE?=cortex-a57.
>> 
>> I typed ^C while /bin/sh was starting a pipeline and my shell got hung in 
>> the middle of fork().
>> 
>>> From the terminal:
>> 
>> # git log --oneline --|more
>> ^C^C^C
>> load: 3.26  cmd: sh 95505 [fork] 5308.67r 0.00u 0.03s 0% 2860k
>> mi_switch+0x198 sleepq_switch+0xfc sleepq_timedwait+0x40 _sleep+0x264 
>> fork1+0x67c sys_fork+0x34 do_el0_sync+0x4c8 handle_el0_sync+0x44 
>> load: 3.16  cmd: sh 95505 [fork] 5311.75r 0.00u 0.03s 0% 2860k
>> mi_switch+0x198 sleepq_switch+0xfc sleepq_timedwait+0x40 _sleep+0x264 
>> fork1+0x67c sys_fork+0x34 do_el0_sync+0x4c8 handle_el0_sync+0x44 
>> 
>> According to ps -d on another terminal the shell has no children:
>> 
>>  PID TT  STAT   TIME COMMAND
>> [...]
>>  873 u0  IWs 0:00.00 `-- login [pam] (login)
>>  874 u0  I   0:00.17   `-- -sh (sh)
>> 95504 u0  I   0:00.01 `-- su -
>> 95505 u0  D+  0:00.05   `-- -su (sh)
>> [...]
>> 
>> Nothing on the (115200 bps serial) console.  No change in system performance.
>> 
>> The system is busy copying a large amount of data from the network to a ZFS 
>> pool on spinning disks.  The git|more pipeline could have taken some time to 
>> get going while I/O requests worked their way through the queue.  It would 
>> not have touched the busy pool, only the zroot pool on an SSD.
>> 
>> Has anything changed recently that might cause this?
> 
> There was some change around fork, but your sleep seems to be not from
> that change.  Can you show the wait channel for the process?  Do something
> like
> $ ps alxww
> 

 UID   PID  PPID  C PRI NI   VSZ   RSS MWCHAN   STAT TTTIME COMMAND
   0 95505 95504  2  20  0 13508  2876 fork D+   u0 0:00.13 -su (sh)

This is probably the same information displayed as [fork] in the output from ^T.

Does it correspond to the source line

pause("fork", hz / 2);

?

shell hung in fork system call

2023-07-09 Thread John F Carr

Kernel and system at a146207d66f320ed239c1059de9df854b66b55b7 plus some 
irrelevant local changes, four 64 bit ARM processors, make.conf sets 
CPUTYPE?=cortex-a57.

I typed ^C while /bin/sh was starting a pipeline and my shell got hung in the 
middle of fork().

>From the terminal:

# git log --oneline --|more
^C^C^C
load: 3.26  cmd: sh 95505 [fork] 5308.67r 0.00u 0.03s 0% 2860k
mi_switch+0x198 sleepq_switch+0xfc sleepq_timedwait+0x40 _sleep+0x264 
fork1+0x67c sys_fork+0x34 do_el0_sync+0x4c8 handle_el0_sync+0x44 
load: 3.16  cmd: sh 95505 [fork] 5311.75r 0.00u 0.03s 0% 2860k
mi_switch+0x198 sleepq_switch+0xfc sleepq_timedwait+0x40 _sleep+0x264 
fork1+0x67c sys_fork+0x34 do_el0_sync+0x4c8 handle_el0_sync+0x44 

According to ps -d on another terminal the shell has no children:

  PID TT  STAT   TIME COMMAND
[...]
  873 u0  IWs 0:00.00 `-- login [pam] (login)
  874 u0  I   0:00.17   `-- -sh (sh)
95504 u0  I   0:00.01 `-- su -
95505 u0  D+  0:00.05   `-- -su (sh)
[...]

Nothing on the (115200 bps serial) console.  No change in system performance.

The system is busy copying a large amount of data from the network to a ZFS 
pool on spinning disks.  The git|more pipeline could have taken some time to 
get going while I/O requests worked their way through the queue.  It would not 
have touched the busy pool, only the zroot pool on an SSD.

Has anything changed recently that might cause this?

Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot

2023-07-07 Thread John F Carr

On Jul 6, 2023, at 20:42, Mike Karels  wrote:
> 
> 
> Thanks for isolating this.  Let me know when you have the bug number.
> I just tested a fix (the compat code drops the reference on the current
> address space an extra time, probably freeing it).
> 
> Mike

The bug was introduced in January, 2022.   It allows 32 bit binaries to crash a 
64 bit system when COMPAT_FREEBSD32 is on.  Test coverage of the buggy function 
(sysctl_kern_proc_vm_layout) was added at the same time.

There should be routine runs of 32 bit test suites on 64 bit systems.  Although 
i386 and armv7 are tier 2 systems, the tier 1 COMPAT_FREEBSD32 kernel code 
needs to be exercised.  This bug was only discovered by manually running tests 
in the right environment, 17 months after automated testing could have 
discovered it.

Re: For snapshot builds: armv7 chroot on aarch64 has kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin hung up [in getpid?], unkillable, prevents reboot

2023-07-06 Thread John F Carr




> On Jun 25, 2023, at 20:16, Mark Millard  wrote:
> 
> Using the likes of:
> 
> FreeBSD-14.0-CURRENT-arm64-aarch64-ROCK64-20230622-b95d2237af40-263748.img
> and:
> FreeBSD-14.0-CURRENT-arm-armv7-GENERICSD-20230622-b95d2237af40-263748.img
> 
> I have shown the following behavior after setting up storage
> media based on them. (This was a test that my builds were not
> odd for the issue.)
> 
> Boot the aarch64 media and log in. (Note: I logged in
> as root.)
> 
> mount the armv7 media (-noatime is just my habit)
> and then put it to use:
> 
> # mount -onoatime /dev/da1s2a /mnt
> 
> # chroot /mnt/
> 
> # kyua test -k /usr/tests/Kyuafile sys/kern/kern_copyin
> sys/kern/kern_copyin:kern_copyin  ->  
> 
> On the serial console:
> 
> # ps -xu
> USER  PID   %CPU %MEM   VSZ  RSS TT  STAT STARTED  TIME COMMAND
> root   11 1498.4  0.0 0  256  -  RNL  23:24   542:52.92 [idle]
> root 1174  100.0  0.0 0   16  -  Rs   23:37 0:00.00 
> /usr/tests/sys/kern/kern_copyin -vunprivileged-user=tests 
> -r/tmp/kyua.9YUttj/2/result.atf kern_copyin
> root00.0  0.0 0 1616  -  DLs  23:24 0:00.50 [kernel]
> root10.0  0.0 11704 1288  -  ILs  23:24 0:00.02 /sbin/init
> root20.0  0.0 0  256  -  WL   23:24 0:00.26 [clock]
> root30.0  0.0 0  272  -  DL   23:24 0:00.00 [crypto]
> root40.0  0.0 0   80  -  DL   23:24 0:00.95 [cam]
> root50.0  0.0 0   16  -  DL   23:24 0:00.00 [busdma]
> root60.0  0.0 0   16  -  DL   23:24 0:00.03 [rand_harvestq]
> root70.0  0.0 0   48  -  DL   23:24 0:00.06 [pagedaemon]
> root80.0  0.0 0   16  -  DL   23:24 0:00.00 [vmdaemon]
> root90.0  0.0 0  160  -  DL   23:24 0:00.38 [bufdaemon]
> root   100.0  0.0 0   16  -  DL   23:24 0:00.00 [audit]
> root   120.0  0.0 0  880  -  WL   23:24 0:11.81 [intr]
> root   130.0  0.0 0   48  -  DL   23:24 0:00.04 [geom]
> root   140.0  0.0 0   16  -  DL   23:24 0:00.00 [sequencer 00]
> root   150.0  0.0 0  160  -  DL   23:24 0:06.42 [usb]
> root   160.0  0.0 0   16  -  DL   23:24 0:00.10 [acpi_thermal]
> root   170.0  0.0 0   16  -  DL   23:24 0:00.00 [acpi_cooling0]
> root   180.0  0.0 0   16  -  DL   23:24 0:00.04 [syncer]
> root   190.0  0.0 0   16  -  DL   23:24 0:00.00 [vnlru]
> root  6710.0  0.0 13260 2600  -  Is   23:25 0:00.00 dhclient: 
> system.syslog (dhclient)
> root  6740.0  0.0 13260 2752  -  Is   23:25 0:00.00 dhclient: dpni0 
> [priv] (dhclient)
> root  7610.0  0.0 14572 3972  -  Ss   23:25 0:00.02 /sbin/devd
> root  9640.0  0.0 12832 2764  -  Is   23:25 0:00.02 /usr/sbin/syslogd 
> -s
> root 10330.0  0.0 13012 2604  -  Ss   23:25 0:00.01 /usr/sbin/cron -s
> root 10580.0  0.0 21052 8308  -  Is   23:25 0:00.01 sshd: 
> /usr/sbin/sshd [listener] 0 of 10-100 startups (sshd)
> root 10780.0  0.0 21288 9304  -  Is   23:26 0:00.09 sshd: root@pts/0 
> (sshd)
> root 11750.0  0.0 21288 9496  -  Is   23:37 0:00.04 sshd: root@pts/1 
> (sshd)
> root 10740.0  0.0 13380 3008 u0  Is   23:25 0:00.01 login [pam] 
> (login)
> root 10750.0  0.0 13460 3292 u0  S23:25 0:00.02 -sh (sh)
> root 12330.0  0.0 13588 3016 u0  R+   00:00 0:00.00 ps -xu
> root 10810.0  0.0 13460 3328  0  Is   23:26 0:00.02 -sh (sh)
> root 11700.0  0.0  5788 2884  0  I23:36 0:00.02 /bin/sh -i
> root 11720.0  0.0 10408 7192  0  I+   23:37 0:00.30 kyua test -k 
> /usr/tests/Kyuafile sys/kern/kern_copyin
> root 11780.0  0.0 13460 3320  1  Is+  23:38 0:00.01 -sh (sh)
> 
> 1174 is stuck, even if one waits for 30min+.
> kill and kill -9 will not kill 1174.
> 
> "shutdown -r now" hangs before the reboot happens
> and reports: "some processes would not die".
> 
> An interesting property is that ps and top disagree
> about 1174 CPU usage: ps 100%, top 0%. But top also
> indicates 1174 always has CPU0 "STATE". (Across
> tests CPUn varies but within a test it has
> a fixed n.)
> 
> I have also seen ps "STAT" being RXs.
> 
> The following is from my earlier activity with my own
> builds involved, here 1119, not the 1174 from above.
> truss reports as the last thing for the stuck process
> as "getpid()".
> 
> . . .
> 1119: 0.588983953 fstatat(AT_FDCWD,"/usr/tests/sys/kern/kern_copyin",{ 
> mode=-r-xr-xr-x ,inode=111756,size=9776,blksize=10240 },AT_SYMLINK_NOFOLLOW) 
> = 0 (0x0)
> 1119: 0.589065030 
> mmap(0x0,20480,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON|MAP_ALIGNED(12),-1,0x0)
>  = 1074188288 (0x4006d000)
> 1119: 0.589227544 
> openat(AT_FDCWD,"/tmp/kyua.aBQv6E/2/result.atf",O_WRONLY|O_CREAT|O_TRUNC,0644)
>  = 3 (0x3)
> 1119: 0.589276503 getpid()  = 1119 (0x45f)
> 
> 
> 
> For reference, from inside an armv7 chroot session
> before doing such a test:
> 
> # uname -apKU
> FreeBSD generic

Re: aarch64 main-n263493-4e8d558c9d1c-dirty (so: 2023-Jun-10) Kyuafile run: "Fatal data abort" crash during vnet_register_sysinit

2023-06-26 Thread John F Carr



> On Jun 26, 2023, at 04:32, Mark Millard  wrote:
> 
> On Jun 24, 2023, at 17:25, Mark Millard  wrote:
> 
>> On Jun 24, 2023, at 14:26, John F Carr  wrote:
>> 
>>> 
>>>> On Jun 24, 2023, at 13:00, Mark Millard  wrote:
>>>> 
>>>> The running system build is a non-debug build (but
>>>> with symbols not stripped).
>>>> 
>>>> The HoneyComb's console log shows:
>>>> 
>>>> . . .
>>>> GEOM_STRIPE: Device stripe.IMfBZr destroyed.
>>>> GEOM_NOP: Device md0.nop created.
>>>> g_vfs_done():md0.nop[READ(offset=5885952, length=8192)]error = 5
>>>> GEOM_NOP: Device md0.nop removed.
>>>> GEOM_NOP: Device md0.nop created.
>>>> g_vfs_done():md0.nop[READ(offset=5935104, length=4096)]error = 5
>>>> g_vfs_done():md0.nop[READ(offset=5935104, length=4096)]error = 5
>>>> GEOM_NOP: Device md0.nop removed.
>>>> GEOM_NOP: Device md0.nop created.
>>>> GEOM_NOP: Device md0.nop removed.
>>>> Fatal data abort:
>>>> x0: a02506e64400
>>>> x1: 0001ea401880 (g_raid3_post_sync + 3a145f8)
>>>> x2:   4b
>>>> x3: a343932b0b22fb30
>>>> x4:0
>>>> x5:  3310b0d062d0e1d
>>>> x6: 1d0e2d060d0b3103
>>>> x7:0
>>>> x8: ea325df8
>>>> x9: 0001eec946d0 ($d.6 + 0)
>>>> x10: 0001ea401880 (g_raid3_post_sync + 3a145f8)
>>>> x11:0
>>>> x12:0
>>>> x13: 00cd8960 (lock_class_mtx_sleep + 0)
>>>> x14:0
>>>> x15: a02506e64405
>>>> x16: 0001eec94860 (_DYNAMIC + 160)
>>>> x17: 0063a450 (ifc_attach_cloner + 0)
>>>> x18: 0001eb290400 (g_raid3_post_sync + 48a3178)
>>>> x19: 0001eec94600 (vnet_epair_init_vnet_init + 0)
>>>> x20: 00fa5b68 (vnet_sysinit_sxlock + 18)
>>>> x21: 00d8e000 (sdt_vfs_vop_vop_spare4_return + 0)
>>>> x22: 00d8e000 (sdt_vfs_vop_vop_spare4_return + 0)
>>>> x23: a042e500
>>>> x24: a042e500
>>>> x25: 00ce0788 (linker_lookup_set_desc + 0)
>>>> x26: a0203cdef780
>>>> x27: 0001eec94698 (__set_sysinit_set_sym_if_epairmodule_sys_init + 0)
>>>> x28: 00d8e000 (sdt_vfs_vop_vop_spare4_return + 0)
>>>> x29: 0001eb290430 (g_raid3_post_sync + 48a31a8)
>>>> sp: 0001eb290400
>>>> lr: 0001eec82a4c ($x.1 + 3c)
>>>> elr: 0001eec82a60 ($x.1 + 50)
>>>> spsr: 6045
>>>> far: 0002d8fba4c8
>>>> esr: 9646
>>>> panic: vm_fault failed: 0001eec82a60 error 1
>>>> cpuid = 14
>>>> time = 1687625470
>>>> KDB: stack backtrace:
>>>> db_trace_self() at db_trace_self
>>>> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
>>>> vpanic() at vpanic+0x13c
>>>> panic() at panic+0x44
>>>> data_abort() at data_abort+0x2fc
>>>> handle_el1h_sync() at handle_el1h_sync+0x14
>>>> --- exception, esr 0x9646
>>>> $x.1() at $x.1+0x50
>>>> vnet_register_sysinit() at vnet_register_sysinit+0x114
>>>> linker_load_module() at linker_load_module+0xae4
>>>> kern_kldload() at kern_kldload+0xfc
>>>> sys_kldload() at sys_kldload+0x60
>>>> do_el0_sync() at do_el0_sync+0x608
>>>> handle_el0_sync() at handle_el0_sync+0x44
>>>> --- exception, esr 0x5600
>>>> KDB: enter: panic
>>>> [ thread pid 70419 tid 101003 ]
>>>> Stopped at  kdb_enter+0x44: str xzr, [x19, #3200]
>>>> db> 
>>> 
>>> The failure appears to be initializing module if_epair.
>> 
>> Yep: trying:
>> 
>> # kldload if_epair.ko
>> 
>> was enough to cause the crash. (Just a HoneyComb context at
>> that point.)
>> 
>> I tried media dd'd from the recent main snapshot, booting the
>> same system. No crash. I moved my build boot media to some
>> other systems and tested them: crashes. I tried my boot media
>> built optimized for Cortex-A53 or Cortex-X1C/Cortex-A78C
>> instead of Cortex-A72: no crashes. (But only one system can
>> use the X1C/A78C code in that build.)
>> 
>> So variation testing only gets the crashes for my builds
>> that are code-optimized for Cortex-A72's. The same source
>> tree vinta

Re: twe(4) removed

2023-06-25 Thread John Nielsen

> On Jun 24, 2023, at 4:16 AM, Marcin Cieslak  wrote:
> 
> I just noticed that I had to remove "device twe"
> from my kernel configuration when rebuilding my -CURRENT today.
> 
> Is there any problem with this driver that makes it difficult
> to keep around?
> 
> Believe or not, I still rent a machine using it in JBOD mode
> (running 13 right now but I could switch it to -CURRENT for testing if 
> needed).

The deprecation notice and partial justification are here:
https://cgit.freebsd.org/src/commit/?id=4b22ce07306243d6641c93efcf315a787dd0876c

JN

Re: aarch64 main-n263493-4e8d558c9d1c-dirty (so: 2023-Jun-10) Kyuafile run: "Fatal data abort" crash during vnet_register_sysinit

2023-06-24 Thread John F Carr



> On Jun 24, 2023, at 13:00, Mark Millard  wrote:
> 
> The running system build is a non-debug build (but
> with symbols not stripped).
> 
> The HoneyComb's console log shows:
> 
> . . .
> GEOM_STRIPE: Device stripe.IMfBZr destroyed.
> GEOM_NOP: Device md0.nop created.
> g_vfs_done():md0.nop[READ(offset=5885952, length=8192)]error = 5
> GEOM_NOP: Device md0.nop removed.
> GEOM_NOP: Device md0.nop created.
> g_vfs_done():md0.nop[READ(offset=5935104, length=4096)]error = 5
> g_vfs_done():md0.nop[READ(offset=5935104, length=4096)]error = 5
> GEOM_NOP: Device md0.nop removed.
> GEOM_NOP: Device md0.nop created.
> GEOM_NOP: Device md0.nop removed.
> Fatal data abort:
>  x0: a02506e64400
>  x1: 0001ea401880 (g_raid3_post_sync + 3a145f8)
>  x2:   4b
>  x3: a343932b0b22fb30
>  x4:0
>  x5:  3310b0d062d0e1d
>  x6: 1d0e2d060d0b3103
>  x7:0
>  x8: ea325df8
>  x9: 0001eec946d0 ($d.6 + 0)
> x10: 0001ea401880 (g_raid3_post_sync + 3a145f8)
> x11:0
> x12:0
> x13: 00cd8960 (lock_class_mtx_sleep + 0)
> x14:0
> x15: a02506e64405
> x16: 0001eec94860 (_DYNAMIC + 160)
> x17: 0063a450 (ifc_attach_cloner + 0)
> x18: 0001eb290400 (g_raid3_post_sync + 48a3178)
> x19: 0001eec94600 (vnet_epair_init_vnet_init + 0)
> x20: 00fa5b68 (vnet_sysinit_sxlock + 18)
> x21: 00d8e000 (sdt_vfs_vop_vop_spare4_return + 0)
> x22: 00d8e000 (sdt_vfs_vop_vop_spare4_return + 0)
> x23: a042e500
> x24: a042e500
> x25: 00ce0788 (linker_lookup_set_desc + 0)
> x26: a0203cdef780
> x27: 0001eec94698 (__set_sysinit_set_sym_if_epairmodule_sys_init + 0)
> x28: 00d8e000 (sdt_vfs_vop_vop_spare4_return + 0)
> x29: 0001eb290430 (g_raid3_post_sync + 48a31a8)
>  sp: 0001eb290400
>  lr: 0001eec82a4c ($x.1 + 3c)
> elr: 0001eec82a60 ($x.1 + 50)
> spsr: 6045
> far: 0002d8fba4c8
> esr: 9646
> panic: vm_fault failed: 0001eec82a60 error 1
> cpuid = 14
> time = 1687625470
> KDB: stack backtrace:
> db_trace_self() at db_trace_self
> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> vpanic() at vpanic+0x13c
> panic() at panic+0x44
> data_abort() at data_abort+0x2fc
> handle_el1h_sync() at handle_el1h_sync+0x14
> --- exception, esr 0x9646
> $x.1() at $x.1+0x50
> vnet_register_sysinit() at vnet_register_sysinit+0x114
> linker_load_module() at linker_load_module+0xae4
> kern_kldload() at kern_kldload+0xfc
> sys_kldload() at sys_kldload+0x60
> do_el0_sync() at do_el0_sync+0x608
> handle_el0_sync() at handle_el0_sync+0x44
> --- exception, esr 0x5600
> KDB: enter: panic
> [ thread pid 70419 tid 101003 ]
> Stopped at  kdb_enter+0x44: str xzr, [x19, #3200]
> db> 

The failure appears to be initializing module if_epair.  I see no recent 
changes in that module that would be likely to break initialization.

a9bfd080d09a if_epair: do not transmit packets that exceed the interface MTU
4d846d260e2b spdx: The BSD-2-Clause-FreeBSD identifier is obsolete, drop 
-FreeBSD
a6b55ee6be15 net: replace IFF_KNOWSEPOCH with IFF_NEEDSEPOCH
c69ae8419734 if_epair: also remove vlan metadata from mbufs
29c9b1673305 epair: Remove unneeded includes and sort some of the rest

Re: Support for more than 256 CPU cores

2023-05-08 Thread John Baldwin


On 5/5/23 6:38 AM, Ed Maste wrote:

FreeBSD supports up to 256 CPU cores in the default kernel configuration
(on Tier-1 architectures).  Systems with more than 256 cores are
available now, and will become increasingly common over FreeBSD 14’s
lifetime.  The FreeBSD Foundation is supporting the effort to increase
MAXCPU, and PR269572[1] is open to track tasks and changes.

As a project we have scalability work ahead of us to make best use of
high core count machines, but at a minimum we should be able to boot a
GENERIC kernel on such systems, and have an ABI for the FreeBSD 14
release that supports such a configuration.

Some changes have already been committed in support of increased MAXCPU,
including increasing MAX_APIC_ID (commit c8113dad7ed4) and a number of
changes to reduce bloat (such as commits 42f722e721cd, e72f7ed43eef,
78cfa762ebf2 and 74ac712f72cf).

The next step is to increase the maximum cpuset size for userland.
I have this change open in review D39941[2] and an exp-run request in
PR271213[3].  Following that the kernel change for increasing MAXCPU is
in D36838[4].

Additional work on bloat reduction will continue after this change, and
looking forward FreeBSD is going to need ongoing effort from the
community and the FreeBSD Foundation to continue improving scalability.

[1] https://bugs.freebsd.org/269572
[2] https://reviews.freebsd.org/D39941
[3] https://bugs.freebsd.org/271213
[4] https://reviews.freebsd.org/D36838


FWIW, I think it will be useful for main to run with a larger userspace
MAXCPU than kernel for at least a while so that we have better testing
of that configuration and to give headroom for bumping MAXCPU in the
kernel during the 14.x branch.

The only other viable path I think which would be more work would be to
rework cpuset_t in userspace to always use a dynamically sized mask.
This could perhaps be done in an API-preserving manner by making cpuset_t
an opaque wrapper type in userland and requiring CPU_* to indirect to
functions in libc, etc.  That's a fair bit more work however.

--
John Baldwin

Re: morse(6) sound

2022-10-28 Thread John-Mark Gurney

Nuno Teixeira wrote this message on Fri, Oct 28, 2022 at 19:36 +0100:
> Is there any way to get sound from morse(6) without speaker(4) device?

I mean, I guess you could use sox (play command) and sed to make the audio..

morse -s converts it to . and -'s, so then you convert each one of those to
a frequency and necessary delay.

-- 
  John-Mark Gurney  Voice: +1 415 225 5579

 "All that I will do, has been done, All that I have, has not."

Re: How to Enable support for IPsec deprecated algorithms: 3DES, MD5-HMAC

2022-10-04 Thread John Baldwin


On 10/4/22 1:53 AM, alfadev wrote:

Hi, i am trying to move my gateway from FreeBSD 11.0 to FreeBSD 14.0 to use
newly added ipfw table lookup for mac addresses 
(https://reviews.freebsd.org/D35103)

Also I have too many IPSec connections between fortigate, cisco etc.
And their operators use only 3DES algorithms and they have no intention to 
change it for me.
So, now i have to enable 3DES support for FreeBSD 14.0 .

To add 3DES support again i changed some files shown below.
I am not sure what i did any help welcomes.


You do not want to just restore the files as-is.  You instead want to revert 
some of the
diffs from the first commit.  The second commit for /dev/crypto doesn't matter 
for IPsec
and you can ignore it.

However, you will need to also partially revert commit 
0e00c709d7f1cdaeb584d244df9534bcdd0ac527
which removes DES and 3DES from OCF itself.  This is what removed enc_xform_des 
for example.

--
John Baldwin

Re: pkg: Newer FreeBSD version for package... but why?

2022-07-13 Thread John Baldwin


On 7/13/22 3:17 AM, Andriy Gapon wrote:

On 2022-07-13 13:09, Michael Gmelin wrote:



On Wed, 13 Jul 2022 10:29:06 +0300
Andriy Gapon  wrote:


# uname -U
1400063

# uname -K
1400063

# pkg upgrade
Updating FreeBSD repository catalogue...
Fetching packagesite.pkg: 100%5 MiB   4.8MB/s00:01
Processing entries:   0%
Newer FreeBSD version for package zyre:
To ignore this error set IGNORE_OSVERSION=yes
- package: 1400063
- running kernel: 1400051
Ignore the mismatch and continue? [y/N]:

Does anyone know why this would happen?
Where does pkg get its notion of the running kernel version?



If I'm reading the sources correctly, it's determining the OS version
by looking at the elf headers of various files in this order:

  getenv("ABI_FILE")
  /usr/bin/uname
  /bin/sh

So I would assume that `file /usr/bin/uname` shows 1400051 on your
system.


Thank you very much!  That's it:
# file /usr/bin/uname
/usr/bin/uname: ELF 32-bit LSB executable, ARM, EABI5 version 1
(FreeBSD), dynamically linked, interpreter /libexec/ld-elf.so.1,
FreeBSD-style, for FreeBSD 14.0 (1400051), stripped


You can point it to checking another file by setting ABI_FILE[0] in the
environment or ignore the check by setting IGNORE_OSVERSION (like
advised). The "running kernel:" label seems a bit misleading.


Indeed.

Now the next thing (for me) to research is why the binaries were built
"for FreeBSD 14.0 (1400051)" when the source tree has 1400063 and uname
-U also reports 1400063.
FWIW, this was a cross-build, maybe that played a role too.


If you do a NO_CLEAN=yes build, we don't relink binaries just because
crt*.o changed (where the note is stored).

--
John Baldwin

Re: BLAKE3 unstability?

2022-07-12 Thread John Baldwin


On 7/12/22 1:41 AM, Evgeniy Khramtsov wrote:

I can reproduce via:

$ truncate -s 10G /tmp/test
$ mdconfig -f /tmp/test -S 4096
$ zpool create test /dev/md1
$ zfs create -o checksum=blake3 test/b
$ dd if=/dev/random of=/test/b/noise bs=1M count=4096
$ sync
$ zpool scrub test
$ zpool status


I cannot reproduce this on openzfs/zfs@cb01da68057 (the commit that was
most recently merged) built out of tree on either stable/13 70fd40edb86
or main 9aa02d5120a.

I'll update a system and see if I can reproduce it with the in-tree ZFS.

- Ryan


It did not reproduce for me with in-tree ZFS on main@3c9ad9398fcd either.

Could you share sysctl kstat.zfs.misc.chksum_bench, maybe we are using
different implementations?
I do see that blake3 went in with only a Linux module parameter for the
implementation selection, so I'll have to fix that. For now we can at least
see which was fastest, which should be the one selected. You just won't be
able to manually change it to see if that helps.

- Ryan


I found the culprit (kernel and base from download.FreeBSD.org
kernel.txz and base.txz respectively) (I forgot about local sysctl.conf...):

kern.sched.steal_thresh=1
kern.sched.preempt_thresh=121

Then

#!/bin/sh

truncate -s 10G /tmp/test
mdconfig -f /tmp/test -S 4096
zpool create test /dev/md0
zfs create -o checksum=blake3 test/b
dd if=/dev/random of=/test/b/noise bs=1M count=4096
sync
zpool scrub test
sleep 3
zpool status

zpool destroy test
mdconfig -d -u 0
rm /tmp/test

As for ULE "tuning", these values give me fine desktop interactivity
when building lang/rust when nice and idprio did not help, so I left
them in sysctl.conf. Not sure if scheduling parameters are worthy of
a ZFS PR, maybe something essential is preempted.


It could be missing fpu_kern_enter/leave that lack of preemption would
cover over.  I thought that missing that would give a panic in the
kernel though due to FPU instructions being disabled (including vector
instructions).  Maybe ZFS isn't using fpu_kern_enter(FPU_NOCTX) and is
instead trying to juggle contexts and it has a bug in how it manages
saved FPU contexts and reuses a context?  If so, I would just suggest
that ZFS switch to using FPU_KERN_NOCTX instead which runs all SSE
type code in a critical section to disable preemption but avoids
having to allocate and manage FPU contexts.

--
John Baldwin

Re: Accessibility in the FreeBSD installer and console

2022-07-07 Thread John Kennedy

On Thu, Jul 07, 2022 at 10:11:52PM +0200, Klaus Küchemann wrote:
> > Am 07.07.2022 um 19:32 schrieb Hans Petter Selasky :
> > The only argument I've heard from some non-sighted friends about not using 
> > FreeBSD natively is that ooh, MacOSX is so cool. It starts speaking from 
> > the start if I press this and this key. Is anyone here working on or 
> > wanting such a feature?
> 
> Possibly they didn’t want to  be rude and your friends didn't tell you the 
> other argument  :-)  : according to the corresponding wiki page FreeBSD 
> doesn't natively support any audio output at all on your friends current M1 
> Mac hardware.
>  since quite nothing is currently supported you probably will first take over 
> working on the Audio driver …..and of course USB  :-)

  I think a huge benefit that Apple would have is that they might be
able to guarantee some sort of audio speaker, period, since they
control the hardware that the software runs on.  That might be a big ask
on FreeBSD, but maybe if there was some relatively ubiquitous assistance
hardware, maybe doable.  But text-to-speech (and then WHAT language's
speech) is a big software chunk, audio layers seems large, and then
having to worry about the potential driver issues (while not being able
to see-to-hear any potential setup issues) seems huge.

  Everybody seems happy farming that out to the internet, except on
system setup you're not connected to the internet yet.

  Plus Apple has some deep hooks into the app-stack since you're
basically using their toolkit to make a graphical app, so they can
guarantee some potential for GUI-textbox-speech, where FreeBSD has a
hodgepodge of graphical toolkits (KDE, GTK, Gnome, etc).

Re: Accessibility in the FreeBSD installer and console

2022-07-07 Thread John Kennedy

On Thu, Jul 07, 2022 at 10:11:52PM +0200, Klaus Küchemann wrote:
> > Am 07.07.2022 um 19:32 schrieb Hans Petter Selasky :
> > The only argument I've heard from some non-sighted friends about not using 
> > FreeBSD natively is that ooh, MacOSX is so cool. It starts speaking from 
> > the start if I press this and this key. Is anyone here working on or 
> > wanting such a feature?
> 
> Possibly they didn’t want to  be rude and your friends didn't tell you the 
> other argument  :-)  : according to the corresponding wiki page FreeBSD 
> doesn't natively support any audio output at all on your friends current M1 
> Mac hardware.
>  since quite nothing is currently supported you probably will first take over 
> working on the Audio driver …..and of course USB  :-)

  I think a huge benefit that Apple would have is that they might be
able to guarantee some sort of audio speaker, period, since they
control the hardware that the software runs on.  That might be a big ask
on FreeBSD, but maybe if there was some relatively ubiqitous

Re: Posting Netiquette [ref: Threads "look definitely like" unreadable mess. Handbook project.]

2022-07-01 Thread John-Mark Gurney

Greg 'groggy' Lehey wrote this message on Thu, Jun 23, 2022 at 16:33 +1000:
> Does anybody have an opinion on character set recommendations?  I
> think we should ask for UTF-8 if at all possible.

I don't think there's any need for a recommendation.  All [modern] MUA
should tag the post appropriately and each MUA be able to convert as
needed between them.

-- 
  John-Mark Gurney  Voice: +1 415 225 5579

 "All that I will do, has been done, All that I have, has not."


signature.asc
Description: PGP signature

Re: Profiled libraries on freebsd-current

2022-05-04 Thread John Baldwin


On 5/4/22 1:38 PM, Steve Kargl wrote:

On Wed, May 04, 2022 at 01:22:57PM -0700, John Baldwin wrote:

On 5/4/22 12:53 PM, Steve Kargl wrote:

On Wed, May 04, 2022 at 11:12:55AM -0700, John Baldwin wrote:

I don't know the entire FreeBSD ecosystem.  Do people
use FreeBSD on embedded systems (e.g., nanobsd) where
libthr may be stripped out?  Thus, --enable-threads=no
is needed.


If they do, they are also using a constrained userland and
probably are not shipping a GCC binary either.  However, it's
not clear to me what --enable-threads means.

Does this enable -pthread as an option?  If so, that should
definitely just always be on.  It's still an option users have
to opt into via a command line flag and doesn't prevent
building non-threaded programs.

If it's enabling use of threads at runtime within GCC itself,
I'd say that also should probably just be allowed to be on.

I can't really imagine what else it might mean (and I doubt
it means the latter).



AFAICT, it controls whether -lpthread is automatically added to
the command line.  In the case of -pg, it is -lpthread_p.
The relevant lines are

#ifdef FBSD_NO_THREADS
#define FBSD_LIB_SPEC "\
   %{pthread: %eThe -pthread option is only supported on FreeBSD when gcc \
is built with the --enable-threads configure-time option.}  \
   %{!shared:   \
 %{!pg: -lc}
\
 %{pg:  -lc_p}  \
   }"
#else
#define FBSD_LIB_SPEC "\
   %{!shared:   \
 %{!pg: %{pthread:-lpthread} -lc}   \
 %{pg:  %{pthread:-lpthread_p} -lc_p}   \
   }\
   %{shared:\
 %{pthread:-lpthread} -lc   \
   }"
#endif

Ed is wondering if one can get rid of FBSD_NO_THREADS. With the
pending removal of WITH_PROFILE, the above reduces to

#define FBSD_LIB_SPEC " \
   %{!shared:\
 %{pthread:-lpthread} -lc\
   } \
   %{shared: \
 %{pthread:-lpthread} -lc\
   }"

If one can do the above, then freebsd-nthr.h is no longer needed
and can be deleted and config.gcc's handling of --enable-threads
can be updated/removed.


Ok, so it's just if -pthread is supported (%{pthread:-lpthread} only
adds -lpthread if -pthread was given on the command line).  That can just
be on all the time and Ed is correct that it is safe to remove the
FBSD_NO_THREADS case and assume it is always present instead.

--
John Baldwin

Re: Profiled libraries on freebsd-current

2022-05-04 Thread John Baldwin


On 5/4/22 12:53 PM, Steve Kargl wrote:

On Wed, May 04, 2022 at 11:12:55AM -0700, John Baldwin wrote:

On 5/2/22 10:37 AM, Steve Kargl wrote:

On Mon, May 02, 2022 at 12:32:25PM -0400, Ed Maste wrote:

On Sun, 1 May 2022 at 11:54, Steve Kargl
 wrote:


diff --git a/gcc/config/freebsd-spec.h b/gcc/config/freebsd-spec.h
index 594487829b5..1e8ab2e1827 100644
--- a/gcc/config/freebsd-spec.h
+++ b/gcc/config/freebsd-spec.h
@@ -93,14 +93,22 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  
If not, see
  (similar to the default, except no -lg, and no -p).  */

   #ifdef FBSD_NO_THREADS


I wonder if we can simplify things now, and remove this
`FBSD_NO_THREADS` case. I didn't see anything similar in other GCC
targets I looked at.


That I don't know.  FBSD_NO_THREADS is defined in freebsd-nthr.h.
In fact, it's the only thing in that header (except copyright
broilerplate).  freebsd-nthr.h only appears in config.gcc and
seems to only get added to the build if someone runs configure
with --enable-threads=no.  Looking at my last config.log for
gcc trunk, I see "Thread model: posix", which appears to be
the default case or if someone does --enable-threads=yes or
--enable-threads=posix.  So, I suppose it comes down to
two questions: (1) is libpthread.* available on all supported
targets and versions? (2) does anyone build gcc without
threads support?


libpthread is available on all supported architectures on all
supported versions.  libthr has been the default threading library
since 7.0 and the only supported library since 8.0.  In GDB I just
assume libthr style threads, and I think GCC can safely do the
same.



I don't know the entire FreeBSD ecosystem.  Do people
use FreeBSD on embedded systems (e.g., nanobsd) where
libthr may be stripped out?  Thus, --enable-threads=no
is needed.


If they do, they are also using a constrained userland and
probably are not shipping a GCC binary either.  However, it's
not clear to me what --enable-threads means.

Does this enable -pthread as an option?  If so, that should
definitely just always be on.  It's still an option users have
to opt into via a command line flag and doesn't prevent
building non-threaded programs.

If it's enabling use of threads at runtime within GCC itself,
I'd say that also should probably just be allowed to be on.

I can't really imagine what else it might mean (and I doubt
it means the latter).

--
John Baldwin

Re: Profiled libraries on freebsd-current

2022-05-04 Thread John Baldwin


On 5/2/22 10:37 AM, Steve Kargl wrote:

On Mon, May 02, 2022 at 12:32:25PM -0400, Ed Maste wrote:

On Sun, 1 May 2022 at 11:54, Steve Kargl
 wrote:


diff --git a/gcc/config/freebsd-spec.h b/gcc/config/freebsd-spec.h
index 594487829b5..1e8ab2e1827 100644
--- a/gcc/config/freebsd-spec.h
+++ b/gcc/config/freebsd-spec.h
@@ -93,14 +93,22 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  
If not, see
 (similar to the default, except no -lg, and no -p).  */

  #ifdef FBSD_NO_THREADS


I wonder if we can simplify things now, and remove this
`FBSD_NO_THREADS` case. I didn't see anything similar in other GCC
targets I looked at.


That I don't know.  FBSD_NO_THREADS is defined in freebsd-nthr.h.
In fact, it's the only thing in that header (except copyright
broilerplate).  freebsd-nthr.h only appears in config.gcc and
seems to only get added to the build if someone runs configure
with --enable-threads=no.  Looking at my last config.log for
gcc trunk, I see "Thread model: posix", which appears to be
the default case or if someone does --enable-threads=yes or
--enable-threads=posix.  So, I suppose it comes down to
two questions: (1) is libpthread.* available on all supported
targets and versions? (2) does anyone build gcc without
threads support?


libpthread is available on all supported architectures on all
supported versions.  libthr has been the default threading library
since 7.0 and the only supported library since 8.0.  In GDB I just
assume libthr style threads, and I think GCC can safely do the
same.

--
John Baldwin

Re: 'set but unused' breaks drm-*-kmod

2022-04-27 Thread John Baldwin


On 4/21/22 6:45 AM, Emmanuel Vadot wrote:

On Thu, 21 Apr 2022 08:51:26 -0400
Michael Butler  wrote:


On 4/21/22 03:42, Emmanuel Vadot wrote:


   Hello Michael,

On Wed, 20 Apr 2022 23:39:12 -0400
Michael Butler  wrote:


Seems this new requirement breaks kmod builds too ..

The first of many errors was (I stopped chasing them all for lack of
time) ..

--- amdgpu_cs.o ---
/usr/ports/graphics/drm-devel-kmod/work/drm-kmod-drm_v5.7.19_3/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c:1210:26:
error: variable 'priority' set but not used
[-Werror,-Wunused-but-set-variable]
   enum drm_sched_priority priority;
   ^
1 error generated.
*** [amdgpu_cs.o] Error code 1



   How are you building the port, directly or with PORTS_MODULES ?
   I do make passes on the warning for drm and I did for set-but-not-used
case but unfortunately this option doesn't exists in 13.0 so I couldn't
apply those in every branch.


I build this directly on -current. I'm guessing that these are what
triggered this behaviour:

commit 8b83d7e0ee54416b0ee58bd85f9c0ae7fb3357a1
Author: John Baldwin 
Date:   Mon Apr 18 16:06:27 2022 -0700

  Make -Wunused-but-set-variable a fatal error for clang 13+ for
kernel builds.

  Reviewed by:imp, emaste
  Differential Revision:  https://reviews.freebsd.org/D34949

commit 615d289ffefe2b175f80caa9b1e113c975576472
Author: John Baldwin 
Date:   Mon Apr 18 16:06:14 2022 -0700

  Re-enable set but not used warnings for kernel builds.

  make tinderbox now passes with this warning enabled as a fatal error,
  so revert the change to hide it in preparation for making it fatal.

  This reverts commit e8e691983bb75e80153b802f47733f1531615fa2.

  Reviewed by:imp, emaste
  Differential Revision:  https://reviews.freebsd.org/D34948




  Ok I see,

  I won't have time until monday (maybe tuesday to fix this) but if
someone wants to beat me to it we should add some new CWARNFLAGS for
each problematic files in the 5.4-lts and 5.7-table branches of
drm-kmod (master which is following 5.10 is already good) only
if $ {COMPILER_VERSION} >= 13.


There is already a helper you can use that deals with compiler versions:

CWARNFLAGS+= ${NO_WUNUSED_BUT_SET_VARIABLE}

or some such.

--
John Baldwin

Can't build with INVARIANTS but not WITNESS

2022-04-27 Thread John F Carr

My -CURRENT kernel has INVARIANTS (inherited from GENERIC) but not WITNESS:

include GENERIC
ident   STRIATUS
nooptions   WITNESS
nooptions   WITNESS_SKIPSPIN

My kernel build fails:

/usr/home/jfc/freebsd/src/sys/kern/vfs_lookup.c:102:13: error: variable 'line' 
set but not used [-Werror,-Wunused-but-set-variable]
int flags, line __diagused;
   ^
/usr/home/jfc/freebsd/src/sys/kern/vfs_lookup.c:101:14: error: variable 'file' 
set but not used [-Werror,-Wunused-but-set-variable]
const char *file __diagused;

The problem is, __diagused expands to nothing if INVARIANTS _or_ WITNESS is 
defined, but the variable in vfs_lookup.c is only used if WITNESS is defined.

#if defined(INVARIANTS) || defined(WITNESS)
#define __diagused
#else
#define __diagused  __unused
#endif

I think this code is trying to be too clever and causing more trouble than it 
prevents.  Change the || to &&, or replace __diagused with __unused everywhere.

Re: ktrace on NFSroot failing?

2022-03-25 Thread John Baldwin


On 3/10/22 8:14 AM, Mateusz Guzik wrote:

On 3/10/22, Bjoern A. Zeeb  wrote:

Hi,

I am having a weird issue with ktrace on an nfsroot machine:

root:/tmp # ktrace sleep 1
root:/tmp # kdump
-559038242  Events dropped.
kdump: bogus length 0xdeadc0de

Anyone seen something like this before?



I just did a quick check and it definitely fails on nfs mounts:
# ktrace pwd
/root/mjg
# kdump
-559038242  Events dropped.
kdump: bogus length 0xdeadc0de

I don't have time to look into it this week though.


Possibly related: core dumps are no longer working for me on NFS
mounts.  I get a 0 byte foo.core instead of a valid core dump.

--
John Baldwin

Re: Buildworld fails with external GCC toolchain

2022-02-14 Thread John Baldwin

On 2/12/22 11:34 AM, Yasuhiro Kimura wrote:

From: Dimitry Andric 
Subject: Re: Buildworld fails with external GCC toolchain
Date: Fri, 11 Feb 2022 22:53:44 +0100

Not really, the gcc 9 build has been broken for months, as far as I know.

See also: https://ci.freebsd.org/job/FreeBSD-main-amd64-gcc9_build/

The last build(s) show a different error from yours, though:

/workspace/src/tests/sys/netinet/libalias/util.c: In function 'set_udp':
/workspace/src/tests/sys/netinet/libalias/util.c:112:2: error: converting a 
packed 'struct ip' pointer (alignment 2) to a 'uint32_t' {aka 'unsigned int'} 
pointer (alignment 4) may result in an unaligned pointer value 
[-Werror=address-of-packed-member]
   112 |  uint32_t *up = (void *)p;
   |  ^~~~
In file included from /workspace/src/tests/sys/netinet/libalias/util.h:37,
  from /workspace/src/tests/sys/netinet/libalias/util.c:39:
/workspace/src/sys/netinet/ip.h:51:8: note: defined here
51 | struct ip {
   |^~

-Dimitry

Thanks for information. I went back the commit history of main branch
about every month and check if buildworld succeeds with GCC. But it
didn't succeed even if I went back about a year. And devel/binutils
port was update to 2.37 on last August. So I suspect external GCC
toolchain doesn't work well after binutils is updated to current
version.

I have amd64 world + kernel building with GCC 9 and the only remaining
open review not merged yet is https://reviews.freebsd.org/D34147.

It is work to keep it working though and I hadn't worked on it again
until recently.

--
John Baldwin

Re: Dragonfly Mail Agent (dma) in the base system

2022-01-27 Thread John Baldwin


On 1/27/22 1:34 PM, Ed Maste wrote:

The Dragonfly Mail Agent (dma) is a small Mail Transport Agent (MTA)
which accepts mail from a local Mail User Agent (MUA) and delivers it
locally or to a smarthost for delivery. dma does not accept inbound
mail (i.e., it does not listen on port 25) and is not intended to
provide the same functionality as a full MTA like postfix or sendmail.
It is intended for use cases such as delivering cron(8) mail.

Since 2014 we have a copy of dma in the base system available as an
optional component, enabled via the WITH_DMAGENT src.conf knob.

I am interested in determining whether dma is a viable minimal base
system MTA, and if not what gaps remain. If you have enabled DMA on
your systems (or are willing to give it a try) and have any feedback
or are aware of issues please follow up or submit a PR as appropriate.


I've used DMA on systems without local mail accounts to forward cron
periodic e-mails just fine.  It even supports STARTTLS and SMTP AUTH.
I haven't tried using it for simple local delivery to /var/mail/root.

--
John Baldwin

Re: git: 5e6a2d6eb220 - main - Reapply: move libc++ from /usr/lib to /lib [add /usr/lib/libc++.so.1 -> ../../lib/libc++.so.1 ?]

2022-01-03 Thread John Baldwin


On 1/1/22 9:00 AM, Ed Maste wrote:

On Fri, 31 Dec 2021 at 18:04, John Baldwin  wrote:


However, your point about libcxxrt.so.1 is valid.  It needs to also be moved
to /lib if libc++.so.1 is moved to /lib.


libcxxrt.so.1 has always been in /lib.


Oh, I was thrown off by the .so indirection for libcxxrt in the linker script.

--
John Baldwin

Re: git: 5e6a2d6eb220 - main - Reapply: move libc++ from /usr/lib to /lib [add /usr/lib/libc++.so.1 -> ../../lib/libc++.so.1 ?]

2021-12-31 Thread John Baldwin

On 12/31/21 2:59 PM, Mark Millard wrote:

On 2021-Dec-31, at 14:28, Mark Millard wrote:

On 2021-Dec-30, at 14:04, John Baldwin wrote:

On 12/30/21 1:09 PM, Mark Millard wrote:

On 2021-Dec-30, at 13:05, Mark Millard wrote:

This asks a question in a different direction that my prior
reports about my builds vs. Cy's reported build.

Background:

/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib/libc++.so:GROUP
( /lib/libc++.so.1 /usr/lib/libcxxrt.so
and:
lrwxr-xr-x 1 root wheel23 Dec 29 13:17:01 2021 /usr/lib/libcxxrt.so
-> ../../lib/libcxxrt.so.1

Why did libc++.so.1 not get a:

/usr/lib/libc++.so.1 -> ../../lib/libc++.so.1

I forgot to remove the .1 on the left hand side:
/usr/lib/libc++.so -> ../../lib/libc++.so.1

Because for libc++.so we don't just symlink to the current version of the
library
(as we do for most other shared libraries) to tell the compiler what to link
against
for -lc++, instead we use a linker script that tells the compiler to link
against
both of those libraries when -lc++ is encountered.

A better identification of what looks odd to me is the
path variations in:

# more /usr/lib/libc++.so

Another not great day on my part: That path alone makes
the mix of /lib/ and /usr/lib/ use involved, given the
reference to /lib/libc++.so.1 . That would still be true
if the other path had been /lib/libcxxrt.so .

/usr/lib/libc++.so is only used by the compiler/linker when linking a binary.
The resulting binary has the associated paths (/lib/libc++.so.1 and
/usr/lib/libcxxrt.so.1) in its DT_NEEDED. So it is fine for the .so to be
in /usr/lib. This is the same with /usr/lib/libc.so vs /lib/libc.so.7.

However, your point about libcxxrt.so.1 is valid. It needs to also be moved
to /lib if libc++.so.1 is moved to /lib. Doing so will also require yet another
depend-clean.sh fixup (well, probably just adjusting the one I added to
check the libcxxrt path instead of libc++ path).

--
John Baldwin

Re: git: 5e6a2d6eb220 - main - Reapply: move libc++ from /usr/lib to /lib [add /usr/lib/libc++.so.1 -> ../../lib/libc++.so.1 ?]

2021-12-30 Thread John Baldwin


On 12/30/21 1:09 PM, Mark Millard wrote:



On 2021-Dec-30, at 13:05, Mark Millard  wrote:


This asks a question in a different direction that my prior
reports about my builds vs. Cy's reported build.

Background:

/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64.amd64/tmp/usr/lib/libc++.so:GROUP
 ( /lib/libc++.so.1 /usr/lib/libcxxrt.so
and:
lrwxr-xr-x  1 root  wheel23 Dec 29 13:17:01 2021 /usr/lib/libcxxrt.so 
-> ../../lib/libcxxrt.so.1

Why did libc++.so.1 not get a:

/usr/lib/libc++.so.1 -> ../../lib/libc++.so.1


I forgot to remove the .1 on the left hand side:

/usr/lib/libc++.so -> ../../lib/libc++.so.1


Because for libc++.so we don't just symlink to the current version of the 
library
(as we do for most other shared libraries) to tell the compiler what to link 
against
for -lc++, instead we use a linker script that tells the compiler to link 
against
both of those libraries when -lc++ is encountered.

I have finally reproduced Cy's build error locally and am testing my fix.  If it
works I'll commit it.

--
John Baldwin

Re: smr inp breaks some jail use cases and panics with i915kms don't switch to the console anymore


On 12/14/21 9:40 AM, Gleb Smirnoff wrote:

On Tue, Dec 14, 2021 at 09:28:07AM -0800, John Baldwin wrote:
J> > AFAIK, today it will always panic only with WITNESS. Without WITNESS it 
would
J> > pass through mtx_lock as long as the mutex is not locked.
J>
J> Yes, but the default kernel on head is GENERIC which has witness enabled, 
hence
J> the out of the box kernel panics reliably. :)
J>
J> > So, do you suggest to push D33340 before finalizing D9?
J>
J> Yes, I think so.

Pushed. And I plan to post new version of D9 today.


Thanks!

--
John Baldwin

Re: smr inp breaks some jail use cases and panics with i915kms don't switch to the console anymore


On 12/13/21 12:25 PM, Gleb Smirnoff wrote:

On Mon, Dec 13, 2021 at 11:56:35AM -0800, John Baldwin wrote:
J> > J> So there are two things here.  The root issue is that the devel/apr1 
port
J> > J> runs a configure test for TCP_NDELAY being inherited by accepted 
sockets.
J> > J> This test panics because prison_check_ip4() tries to lock a prison mutex
J> > J> to walk the IPs assigned to a jail, but the caller 
(in_pcblookup_hash()) has
J> > J> done an smr_enter() which is a critical_enter():
J> >
J> > The first one is known, and I got a patch to fix it:
J> >
J> > https://reviews.freebsd.org/D33340
J> >
J> > However, a pre-requisite to this simple patch is more complex:
J> >
J> > https://reviews.freebsd.org/D9
J> >
J> > There is some discussion on how to improve that, and I decided to do that
J> > rather than stick to original version. So I takes a few extra days.
J> >
J> > We could push D33340 into main, if the negative effects (raciness of
J> > the prison check) is considered lesser evil then potentially contested
J> > mtx_lock in smr section.
J>
J> I think raciness is probably better than always panicking as it does today.

AFAIK, today it will always panic only with WITNESS. Without WITNESS it would
pass through mtx_lock as long as the mutex is not locked.


Yes, but the default kernel on head is GENERIC which has witness enabled, hence
the out of the box kernel panics reliably. :)


So, do you suggest to push D33340 before finalizing D9?


Yes, I think so.

--
John Baldwin

Re: smr inp breaks some jail use cases and panics with i915kms don't switch to the console anymore


On 12/14/21 2:14 AM, Alexey Dokuchaev wrote:

How do you mean?  Most FreeBSD people, not some random Twitter crowd, want
the bell to be on by default, but it's still off.


I don't know that that's true, and I myself am not sure that I want it back
on by default.  Previously my laptop had a rather annoying beep whose volume
I couldn't control that I actually prefer to have off normally.  On further
reflection, the beep I was looking for for bad input may actually be an
xscreensaver thing for an invalid character to unlock the screen vs a sysbeep
anyway.

--
John Baldwin

Re: RFC: What to do about Allocate in the NFS server for FreeBSD13?