Hi Paul,

On 01/03/2024 19:37, Paul Leiber wrote:
Stopping xen-watchdog prevents the reboot. However, when triggering traffic on the VLAN, Dom0 and DomU become completely unresponsive. No error or kernel message is printed in the serial console.

Thanks for providing some logs. See some comments below. How long did you wait before confirming dom0 is stucked?

IIRC, Linux may print some RCU stall logs after a few minutes.


Switching to Xen console works. Pressing '0' produces the following output:

(XEN) '0' pressed -> dumping Dom0's registers
(XEN) *** Dumping Dom0 vcpu#0 state: ***
(XEN) ----[ Xen-4.19-unstable  arm64  debug=y  Tainted:   C    ]----
(XEN) CPU:    0
(XEN) PC:     ffff800008027e50
(XEN) LR:     ffff800008027e44
(XEN) SP_EL0: ffff800009c78f80
(XEN) SP_EL1: ffff800008003b60
(XEN) CPSR:   00000000000003c5 MODE:64-bit EL1h (Guest Kernel, handler)

[...]

(XEN) *** Dumping Dom0 vcpu#1 state: ***
(XEN) ----[ Xen-4.19-unstable  arm64  debug=y  Tainted:   C    ]----
(XEN) CPU:    0
(XEN) PC:     ffff800008c5dc80
(XEN) LR:     ffff800008c5dc88
(XEN) SP_EL0: ffff000042272080
(XEN) SP_EL1: ffff80000800b0e0
(XEN) CPSR:   0000000080000305 MODE:64-bit EL1h (Guest Kernel, handler)

[...]

(XEN) *** Dumping Dom0 vcpu#2 state: ***
(XEN) ----[ Xen-4.19-unstable  arm64  debug=y  Tainted:   C    ]----
(XEN) CPU:    0
(XEN) PC:     ffff800008027e50
(XEN) LR:     ffff800008027e44
(XEN) SP_EL0: ffff000042271040
(XEN) SP_EL1: ffff800009fcbf20
(XEN) CPSR:   00000000000003c5 MODE:64-bit EL1h (Guest Kernel, handler)

[...]

(XEN) *** Dumping Dom0 vcpu#3 state: ***
(XEN) ----[ Xen-4.19-unstable  arm64  debug=y  Tainted:   C    ]----
(XEN) CPU:    0
(XEN) PC:     ffff800008027e50
(XEN) LR:     ffff800008027e44
(XEN) SP_EL0: ffff0000422730c0
(XEN) SP_EL1: ffff800009fd3f20
(XEN) CPSR:   00000000000003c5 MODE:64-bit EL1h (Guest Kernel, handler)

All the PCs but one (vcpu#1) are the same.

(XEN) 'q' pressed -> dumping domain info (now = 727929105981)
(XEN) General information for domain 0:
(XEN)     refcnt=3 dying=0 pause_count=0
(XEN)     nr_pages=262144 xenheap_pages=2 dirty_cpus={} max_pages=262144
(XEN)     handle=00000000-0000-0000-0000-000000000000 vm_assist=00000020
(XEN) p2m mappings for domain 0 (vmid 1):
(XEN)   1G mappings: 0 (shattered 1)
(XEN)   2M mappings: 422 (shattered 90)
(XEN)   4K mappings: 45372
(XEN) Rangesets belonging to domain 0:
(XEN)     Interrupts { 32-152, 154-255 }
(XEN)     I/O Memory { 0-fe200, fe203-ff841, ff849-ffffffffffffffff }
(XEN) NODE affinity for domain 0: [0]
(XEN) VCPU information and callbacks for domain 0:
(XEN)   UNIT0 affinities: hard={0-3} soft={0-3}
(XEN)     VCPU0: CPU3 [has=F] poll=0 upcall_pend=01 upcall_mask=01
(XEN)     pause_count=0 pause_flags=1

The vCPU is blocked. But...

(XEN) GICH_LRs (vcpu 0) mask=f
(XEN)    VCPU_LR[0]=2a000002
(XEN)    VCPU_LR[1]=1a00001b
(XEN)    VCPU_LR[2]=1a000001
(XEN)    VCPU_LR[3]=1a000010

... it loosk like multiple IRQs are inflights. LR0 (holding IRQ2) is active but the others are pending. This is the same for vCPU #2, #3. vCPU #1 still seems to "work".

AFAICT, Linux is using IRQ2 for the IPI CPU_STOP. So it sounds like dom0 may have panicked.

Looking at the initial logs you posted. I see some messages from Xen but no messages at all from dom0 (including boot). Can you check if you have console=hvc0 on the Linux command line?

If not, please add it and retry.

Cheers,

--
Julien Grall

Reply via email to