[Bug 277211] panic: Unhandled external data abort - handle_el1h_sync - --- exception, esr 0x96000410 - wait_fw_init - mlx5_load_one

2024-03-04 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=277211

John Baldwin  changed:

   What|Removed |Added

   Assignee|b...@freebsd.org|j...@freebsd.org

-- 
You are receiving this mail because:
You are the assignee for the bug.


[Bug 277211] panic: Unhandled external data abort - handle_el1h_sync - --- exception, esr 0x96000410 - wait_fw_init - mlx5_load_one

2024-03-04 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=277211

--- Comment #8 from John Baldwin  ---
Created attachment 248936
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=248936=edit
fix.patch

Please try this patch.  Looking at the dmesg, the address was translated
incorrectly.  It matches this range which requires no translation:

pcib0: PCI addr: 0x101, CPU addr: 0x101, Size: 0x7f,
Type: memory

(PCI addr == CPU addr), but it was matching on the wrong range and translating
the address as if it belonged to the first range:

pcib0: PCI addr: 0x0, CPU addr: 0x1001000, Size: 0x1, Type: I/O port

The code I changed in commit d79b6b8ec267 expected the end to be >= start, and
the end value of 0 in the old code violated this assumption.

-- 
You are receiving this mail because:
You are the assignee for the bug.


[Bug 277211] panic: Unhandled external data abort - handle_el1h_sync - --- exception, esr 0x96000410 - wait_fw_init - mlx5_load_one

2024-02-28 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=277211

--- Comment #7 from Dave Cottlehuber  ---
Created attachment 248817
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=248817=edit
dmesg after D44132

dmesg & panic after rebasing and applying D44132

-- 
You are receiving this mail because:
You are the assignee for the bug.


[Bug 277211] panic: Unhandled external data abort - handle_el1h_sync - --- exception, esr 0x96000410 - wait_fw_init - mlx5_load_one

2024-02-27 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=277211

--- Comment #6 from Dave Cottlehuber  ---
Created attachment 248803
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=248803=edit
verbose dmesg with range_descr debugging

thanks, attached dmesg after patch.

-- 
You are receiving this mail because:
You are the assignee for the bug.


[Bug 277211] panic: Unhandled external data abort - handle_el1h_sync - --- exception, esr 0x96000410 - wait_fw_init - mlx5_load_one

2024-02-22 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=277211

--- Comment #5 from John Baldwin  ---
Ah, looks like the dmesg from Dave does actually include this patch as it has
this line of output:

mlx5_core0: translate 0x1408200 -> 0x2408200

That looks correct, but unfortunately, we only display the ranges in
bootverbose for FDT, not ACPI.  The patch below fixes the pcib driver to always
log the ranges which would be useful to confirm the window:

diff --git a/sys/dev/pci/pci_host_generic.c b/sys/dev/pci/pci_host_generic.c
index 386b8411d29a..46b84ff3004b 100644
--- a/sys/dev/pci/pci_host_generic.c
+++ b/sys/dev/pci/pci_host_generic.c
@@ -83,6 +83,7 @@ pci_host_generic_core_attach(device_t dev)
uint64_t phys_base;
uint64_t pci_base;
uint64_t size;
+   const char *range_descr;
char buf[64];
int domain, error;
int flags, rid, tuple, type;
@@ -179,6 +180,7 @@ pci_host_generic_core_attach(device_t dev)
switch (FLAG_TYPE(sc->ranges[tuple].flags)) {
case FLAG_TYPE_PMEM:
sc->has_pmem = true;
+   range_descr = "prefetch";
flags = RF_PREFETCHABLE;
type = SYS_RES_MEMORY;
error = rman_manage_region(>pmem_rman,
@@ -186,12 +188,14 @@ pci_host_generic_core_attach(device_t dev)
break;
case FLAG_TYPE_MEM:
flags = 0;
+   range_descr = "memory";
type = SYS_RES_MEMORY;
error = rman_manage_region(>mem_rman,
   pci_base, pci_base + size - 1);
break;
case FLAG_TYPE_IO:
flags = 0;
+   range_descr = "I/O port";
type = SYS_RES_IOPORT;
error = rman_manage_region(>io_rman,
   pci_base, pci_base + size - 1);
@@ -219,6 +223,10 @@ pci_host_generic_core_attach(device_t dev)
error = ENXIO;
goto err_rman_manage;
}
+   if (bootverbose)
+   device_printf(dev,
+   "PCI addr: 0x%jx, CPU addr: 0x%jx, Size: 0x%jx,
Type: %s\n",
+   pci_base, phys_base, size, range_type);
}

return (0);

That said, it seems like the translation is correct given the prefetch window
used for the pcib1 bridge between pcib0 and the mlx5 device:

pcib1:  at device 0.0 on pci0
pcib1:   domain0
pcib1:   secondary bus 1
pcib1:   subordinate bus   1
pcib1:   memory decode 0x3000-0x301f
pcib1:   prefetched decode 0x1408000-0x14083ff

And this allocation of mlx5's BAR:

map[10]: type Prefetchable Memory, range 64, base 0x1408200, size
25, enabled
pcib1: allocated prefetch range (0x1408200-0x14083ff) for rid 10 of
pci0:1:0:0


It is odd for a register bar to be in a prefetch BAR.  It might be good to see
a verbose dmesg from before to see how the bridge and and mlx5 BAR were
configured before.

-- 
You are receiving this mail because:
You are the assignee for the bug.


[Bug 277211] panic: Unhandled external data abort - handle_el1h_sync - --- exception, esr 0x96000410 - wait_fw_init - mlx5_load_one

2024-02-22 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=277211

John Baldwin  changed:

   What|Removed |Added

 Status|New |In Progress

--- Comment #4 from John Baldwin  ---
>From what I could tell when I looked at this before for Dave, this appears to
be an issue specific to linuxkpi used by the mlx5 driver.  In particular, it
uses bus_translate_resource in its wrapper for pci_resource_start and I had
asked Dave to boot with an additional patch to try to trace what is going on
there to see if it is getting the wrong answer.  The mlx5 driver assumes that
pci_resource_start gives a valid physical address it can pass to ioremap to
create a memory mapping of the BAR.

The patch with the debugging trace is here: https://reviews.freebsd.org/P632

-- 
You are receiving this mail because:
You are the assignee for the bug.


[Bug 277211] panic: Unhandled external data abort - handle_el1h_sync - --- exception, esr 0x96000410 - wait_fw_init - mlx5_load_one

2024-02-22 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=277211

--- Comment #3 from Michael Tuexen  ---
A version, which is running fine is:
FreeBSD ampere32.nplab.de 15.0-CURRENT FreeBSD 15.0-CURRENT #76
main-n268036-d682c1eaa598-dirty: Sat Feb  3 23:32:31 CET 2024
r...@ampere32.nplab.de:/usr/obj/usr/home/tuexen/freebsd-src/arm64.aarch64/sys/TCP
arm64
Not sure which change introduced the problem. It there for about a week or so.

-- 
You are receiving this mail because:
You are the assignee for the bug.


[Bug 277211] panic: Unhandled external data abort - handle_el1h_sync - --- exception, esr 0x96000410 - wait_fw_init - mlx5_load_one

2024-02-21 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=277211

Michael Tuexen  changed:

   What|Removed |Added

 CC||tue...@freebsd.org

--- Comment #2 from Michael Tuexen  ---
I also see it on my 32 core Ampere system:

CPU  0: APM eMAG 8180 r3p2 affinity:  0  0
   Cache Type = <64 byte D-cacheline,64 byte I-cacheline,PIPT
ICache,64 byte ERG,64 byte CWG>
 Instruction Set Attributes 0 = 
 Instruction Set Attributes 1 = <>
 Instruction Set Attributes 2 = <>
Trying to mount root from ufs:/dev/ada0p2 [rw]...
 Processor Features 0 = 
 Processor Features 1 = <>
  Memory Model Features 0 = 
  Memory Model Features 1 = <8bit VMID>
  Memory Model Features 2 = <32bit CCIDX,48bit VA>
 Debug Features 0 = 
 Debug Features 1 = <>
 Auxiliary Features 0 = <>
 Auxiliary Features 1 = <>
AArch32 Instruction Set Attributes 5 = 
AArch32 Media and VFP Features 0 = 
AArch32 Media and VFP Features 1 = 
CPU  1: APM eMAG 8180 r3p2 affinity:  0  1
CPU  2: APM eMAG 8180 r3p2 affinity:  1  0
CPU  3: APM eMAG 8180 r3p2 affinity:  1  1
CPU  4: APM eMAG 8180 r3p2 affinity:  2  0
CPU  5: APM eMAG 8180 r3p2 affinity:  2  1
CPU  6: APM eMAG 8180 r3p2 affinity:  3  0
CPU  7: APM eMAG 8180 r3p2 affinity:  3  1
CPU  8: APM eMAG 8180 r3p2 affinity:  4  0
CPU  9: APM eMAG 8180 r3p2 affinity:  4  1
CPU 10: APM eMAG 8180 r3p2 affinity:  5  0
CPU 11: APM eMAG 8180 r3p2 affinity:  5  1
CPU 12: APM eMAG 8180 r3p2 affinity:  6  0
CPU 13: APM eMAG 8180 r3p2 affinity:  6  1
CPU 14: APM eMAG 8180 r3p2 affinity:  7  0
CPU 15: APM eMAG 8180 r3p2 affinity:  7  1
CPU 16: APM eMAG 8180 r3p2 affinity:  8  0
CPU 17: APM eMAG 8180 r3p2 affinity:  8  1
CPU 18: APM eMAG 8180 r3p2 affinity:  9  0
CPU 19: APM eMAG 8180 r3p2 affinity:  9  1
CPU 20: APM eMAG 8180 r3p2 affinity: 10  0
CPU 21: APM eMAG 8180 r3p2 affinity: 10  1
CPU 22: APM eMAG 8180 r3p2 affinity: 11  0
CPU 23: APM eMAG 8180 r3p2 affinity: 11  1
CPU 24: APM eMAG 8180 r3p2 affinity: 12  0
CPU 25: APM eMAG 8180 r3p2 affinity: 12  1
CPU 26: APM eMAG 8180 r3p2 affinity: 13  0
CPU 27: APM eMAG 8180 r3p2 affinity: 13  1
CPU 28: APM eMAG 8180 r3p2 affinity: 14  0
CPU 29: APM eMAG 8180 r3p2 affinity: 14  1
CPU 30: APM eMAG 8180 r3p2 affinity: 15  0
CPU 31: APM eMAG 8180 r3p2 affinity: 15  1

-- 
You are receiving this mail because:
You are the assignee for the bug.


[Bug 277211] panic: Unhandled external data abort - handle_el1h_sync - --- exception, esr 0x96000410 - wait_fw_init - mlx5_load_one

2024-02-21 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=277211

Mark Linimon  changed:

   What|Removed |Added

   Keywords||crash
 CC||j...@freebsd.org

--- Comment #1 from Mark Linimon  ---
^Triage: notify jhb of possible problem with his change.

-- 
You are receiving this mail because:
You are the assignee for the bug.


[Bug 277211] panic: Unhandled external data abort - handle_el1h_sync - --- exception, esr 0x96000410 - wait_fw_init - mlx5_load_one

2024-02-21 Thread bugzilla-noreply
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=277211

Bug ID: 277211
   Summary: panic: Unhandled external data abort -
handle_el1h_sync - --- exception, esr 0x96000410 -
wait_fw_init - mlx5_load_one
   Product: Base System
   Version: 15.0-CURRENT
  Hardware: arm64
OS: Any
Status: New
  Severity: Affects Only Me
  Priority: ---
 Component: kern
  Assignee: b...@freebsd.org
  Reporter: d...@freebsd.org

Created attachment 248657
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=248657=edit
dmesg + panic as of 58df49801d9d

panic from 58df49801d9d & jhb's "acpi: Defer reserving resources for ACPI
devices" patch.

-- 
You are receiving this mail because:
You are the assignee for the bug.