On 07.07.2022 17:36, G.R. wrote:
> On Thu, Jul 7, 2022 at 11:24 PM G.R. <firemet...@users.sourceforge.net> wrote:
>>
>> On Wed, Jul 6, 2022 at 2:33 PM Jan Beulich <jbeul...@suse.com> wrote:
>>>
>>>> Should I expect a debug build of XEN hypervisor to give better
>>>> diagnose messages, without the debug patch that Roger mentioned?
>>>
>>> Well, "expect" is perhaps too much to say, but with problems like
>>> yours (and even more so with multiple ones) using a debug
>>> hypervisor (or kernel, if there such a build mode existed) is imo
>>> always a good idea. As is using as up-to-date a version as
>>> possible.
>>
>> I built both 4.14.3 debug version and 4.16.1 release version for
>> testing purposes.
>> Unfortunately they gave me absolutely zero information, since both of
>> them are not able to get through issue #1
>> the FlR related DPC / AER issue.
>> With 4.16.1 release, it actually can survive the 'xl
>> pci-assignable-add' which triggers the first AER failure.
>> But the 'xl pci-assignable-remove' will lead to xl segmentation fault...
>>> [  655.041442] xl[975]: segfault at 0 ip 00007f2cccdaf71f sp 
>>> 00007ffd73a3d4d0 error 4 in libxenlight.so.4.16.0[7f2cccd92000+7c000]
>>> [  655.041460] Code: 61 06 00 eb 13 66 0f 1f 44 00 00 83 c3 01 39 5c 24 2c 
>>> 0f 86 1b 01 00 00 48 8b 34 24 89 d8 4d 89 f9 4d 89 f0 4c 89 e9 4c 89 e2 
>>> <48> 8b 3c c6 31 c0 48 89 ee e8 53 44 fe ff 83 f8 04 75 ce 48 8b 44
>> Since I'll need a couple of pci-assignable-add &&
>> pci-assignable-remove to get to a seemingly normal state, I cannot
>> proceed from here.
>>
>> With 4.14.3 debug build, the hypervisor / dom0 reboots on 'xl
>> pci-assignable-add'.
>>
>> [  574.623143] pciback 0000:05:00.0: xen_pciback: resetting (FLR, D3,
>> etc) the device
>> [  574.623203] pcieport 0000:00:1d.0: DPC: containment event,
>> status:0x1f11 source:0x0000
>> [  574.623204] pcieport 0000:00:1d.0: DPC: unmasked uncorrectable error 
>> detected
>> [  574.623209] pcieport 0000:00:1d.0: PCIe Bus Error:
>> severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver
>> ID)
>> [  574.623240] pcieport 0000:00:1d.0:   device [8086:a330] error
>> status/mask=00200000/00010000
>> [  574.623261] pcieport 0000:00:1d.0:    [21] ACSViol                (First)
>> [  575.855026] pciback 0000:05:00.0: not ready 1023ms after FLR; waiting
>> [  576.895015] pciback 0000:05:00.0: not ready 2047ms after FLR; waiting
>> [  579.028311] pciback 0000:05:00.0: not ready 4095ms after FLR; waiting
>> [  583.294910] pciback 0000:05:00.0: not ready 8191ms after FLR; waiting
>> [  591.614965] pciback 0000:05:00.0: not ready 16383ms after FLR; waiting
>> [  609.534502] pciback 0000:05:00.0: not ready 32767ms after FLR; waiting
>> [  643.667069] pciback 0000:05:00.0: not ready 65535ms after FLR; giving up
>> //<=======The reboot happens somewhere here, not immediately, but
>> after a while...
>> //Maybe I can get something from xl dmesg if I was quick enough and
>> have connected from a second terminal...
> 
> Unfortunately I didn't see anything from xl dmesg...
> I wish the 'xl dmesg' can support the follow mode (dmesg -w) that the
> Linux dmesg does.
> Here I have to manually repeat this command. The machine suddenly
> freezes after the 'giving up' message is out.
> I see nothing special in the log. Maybe I'm just not lucky enough to
> catch the output, not sure.

If the box reboots in the middle, I guess you really want to hook up
a serial console.

Jan

Reply via email to