Re: 5.10-dovetail regression?

2022-04-09 Thread Philippe Gerum via Xenomai


Philippe Gerum  writes:

> Jan Kiszka  writes:
>
>> On 07.04.22 17:24, Philippe Gerum wrote:
>>> 
>>> Jan Kiszka  writes:
>>> 
 Hi Philippe,

 does this already ring some bell?

 https://source.denx.de/Xenomai/xenomai-images/-/jobs/419210

 Only triggers with qemu-amd64, not on real HW and not with 5.15.

>>> 
>>> I could not reproduce locally, but visual inspection revealed something
>>> fishy in #8e2c09ee5323. Could you try this on the failing kernel? TIA,
>>> 
>>> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
>>> index 2651c6cfd034..da6735d45a8a 100644
>>> --- a/kernel/time/clockevents.c
>>> +++ b/kernel/time/clockevents.c
>>> @@ -644,8 +644,8 @@ void clockevents_exchange_device(struct 
>>> clock_event_device *old,
>>>  * to the release list, keep it around but mark it as
>>>  * reserved.
>>>  */
>>> +   list_del(>list);
>>> if (tick_check_is_proxy(new)) {
>>> -   list_del(>list);
>>> clockevents_switch_state(old, CLOCK_EVT_STATE_RESERVED);
>>> } else {
>>> clockevents_switch_state(old, CLOCK_EVT_STATE_DETACHED);
>>> 
>>
>> Didn't reproduce locally for me as well, though using the same image.
>> But the patch helped on the CI system.
>>
>
> It does not seem to be enough though, that patch fixes a different bug
> actually. So there are two of them:
>
> 1. lockup when running "corectl --stop" on 5.10/kvm_x86 configurations,
> not reproducible here on any other setup
>
> 2. list poisoning which triggers an assertion at boot on "some" x86
> configurations
>
> The patch above definitely fixes #1, makes sense. I managed to reproduce
> #2 on real hw, with kernel 5.15 this time. Same gremlin:
>
> [2.052096] smpboot: Estimated ratio of average max frequency by base 
> frequency (times 1024): 1152
> [2.052273] [ cut here ]
> [2.053250] list_del corruption, 8881001ce0b8->next is LIST_POISON1 
> (dead0100)
> [2.053250] WARNING: CPU: 0 PID: 1 at lib/list_debug.c:45 
> __list_del_entry_valid+0x81/0xe0
> [2.053250] Modules linked in:
> [2.053250] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.15.32+ #100
> [2.053250] Hardware name: TQ-Group TQMxE39M/Type2 - Board Product Name, 
> BIOS 5.12.09.16.05 07/26/2017
> [2.053250] IRQ stage: Linux
> [2.053250] RIP: 0010:__list_del_entry_valid+0x81/0xe0
> [2.053250] Code: 85 c5 ff 49 8b 55 08 4c 39 e2 75 5b b8 01 00 00 00 5d 41 
> 5c 41 5d c3 4c 89 ea 48 8d 75 00 48 c7 c7 80 99 80 ad e8 ea fb 83 00 <0f> 0b 
> 5d 41 5c 31 c0 41 5d c3 49 8d 14 24 48 8d 75 00 48 c7 c7 e0
> [2.053250] RSP: :888100287dc0 EFLAGS: 00010246
> [2.053250] RAX:  RBX: 8881001ce000 RCX: 
> 
> [2.053250] RDX: 0002 RSI: 0008 RDI: 
> ed1020050fae
> [2.053250] RBP: 8881001ce0b8 R08: ac22b384 R09: 
> ac279120
> [2.053250] R10: 888100287aaf R11: ed1020050f55 R12: 
> dead0122
> [2.053250] R13: dead0100 R14: 0002 R15: 
> adff62a0
> [2.053250] FS:  () GS:88815c80() 
> knlGS:
> [2.053250] CS:  0010 DS:  ES:  CR0: 80050033
> [2.053250] CR2: 888104e01000 CR3: 000103e1 CR4: 
> 003506f0
> [2.053250] Call Trace:
> [2.053250]  
> [2.053250]  clockevents_exchange_device+0x16c/0x2a0
> [2.053250]  tick_check_new_device+0x1c3/0x230
> [2.053250]  clockevents_register_device+0xc3/0x170
> [2.053250]  setup_boot_APIC_clock+0x526/0x553
> [2.053250]  ? default_ioapic_phys_id_map+0x40/0x40
> [2.053250]  native_smp_prepare_cpus+0x2cd/0x3ef
> [2.053250]  kernel_init_freeable+0xc0/0x290
> [2.053250]  ? rest_init+0xe0/0xe0
> [2.053250]  kernel_init+0x19/0x130
> [2.053250]  ret_from_fork+0x22/0x30
> [2.053250]  
>
> I'm on it.

Ok, so the first patch is not a fix, it's plain nonsense and is
responsible for the second issue in my test case. Back to square
#1. Still on it.

-- 
Philippe.



Re: 5.10-dovetail regression?

2022-04-09 Thread Philippe Gerum via Xenomai


Jan Kiszka  writes:

> On 07.04.22 17:24, Philippe Gerum wrote:
>> 
>> Jan Kiszka  writes:
>> 
>>> Hi Philippe,
>>>
>>> does this already ring some bell?
>>>
>>> https://source.denx.de/Xenomai/xenomai-images/-/jobs/419210
>>>
>>> Only triggers with qemu-amd64, not on real HW and not with 5.15.
>>>
>> 
>> I could not reproduce locally, but visual inspection revealed something
>> fishy in #8e2c09ee5323. Could you try this on the failing kernel? TIA,
>> 
>> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
>> index 2651c6cfd034..da6735d45a8a 100644
>> --- a/kernel/time/clockevents.c
>> +++ b/kernel/time/clockevents.c
>> @@ -644,8 +644,8 @@ void clockevents_exchange_device(struct 
>> clock_event_device *old,
>>   * to the release list, keep it around but mark it as
>>   * reserved.
>>   */
>> +list_del(>list);
>>  if (tick_check_is_proxy(new)) {
>> -list_del(>list);
>>  clockevents_switch_state(old, CLOCK_EVT_STATE_RESERVED);
>>  } else {
>>  clockevents_switch_state(old, CLOCK_EVT_STATE_DETACHED);
>> 
>
> Didn't reproduce locally for me as well, though using the same image.
> But the patch helped on the CI system.
>

It does not seem to be enough though, that patch fixes a different bug
actually. So there are two of them:

1. lockup when running "corectl --stop" on 5.10/kvm_x86 configurations,
not reproducible here on any other setup

2. list poisoning which triggers an assertion at boot on "some" x86
configurations

The patch above definitely fixes #1, makes sense. I managed to reproduce
#2 on real hw, with kernel 5.15 this time. Same gremlin:

[2.052096] smpboot: Estimated ratio of average max frequency by base 
frequency (times 1024): 1152
[2.052273] [ cut here ]
[2.053250] list_del corruption, 8881001ce0b8->next is LIST_POISON1 
(dead0100)
[2.053250] WARNING: CPU: 0 PID: 1 at lib/list_debug.c:45 
__list_del_entry_valid+0x81/0xe0
[2.053250] Modules linked in:
[2.053250] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.15.32+ #100
[2.053250] Hardware name: TQ-Group TQMxE39M/Type2 - Board Product Name, 
BIOS 5.12.09.16.05 07/26/2017
[2.053250] IRQ stage: Linux
[2.053250] RIP: 0010:__list_del_entry_valid+0x81/0xe0
[2.053250] Code: 85 c5 ff 49 8b 55 08 4c 39 e2 75 5b b8 01 00 00 00 5d 41 
5c 41 5d c3 4c 89 ea 48 8d 75 00 48 c7 c7 80 99 80 ad e8 ea fb 83 00 <0f> 0b 5d 
41 5c 31 c0 41 5d c3 49 8d 14 24 48 8d 75 00 48 c7 c7 e0
[2.053250] RSP: :888100287dc0 EFLAGS: 00010246
[2.053250] RAX:  RBX: 8881001ce000 RCX: 
[2.053250] RDX: 0002 RSI: 0008 RDI: ed1020050fae
[2.053250] RBP: 8881001ce0b8 R08: ac22b384 R09: ac279120
[2.053250] R10: 888100287aaf R11: ed1020050f55 R12: dead0122
[2.053250] R13: dead0100 R14: 0002 R15: adff62a0
[2.053250] FS:  () GS:88815c80() 
knlGS:
[2.053250] CS:  0010 DS:  ES:  CR0: 80050033
[2.053250] CR2: 888104e01000 CR3: 000103e1 CR4: 003506f0
[2.053250] Call Trace:
[2.053250]  
[2.053250]  clockevents_exchange_device+0x16c/0x2a0
[2.053250]  tick_check_new_device+0x1c3/0x230
[2.053250]  clockevents_register_device+0xc3/0x170
[2.053250]  setup_boot_APIC_clock+0x526/0x553
[2.053250]  ? default_ioapic_phys_id_map+0x40/0x40
[2.053250]  native_smp_prepare_cpus+0x2cd/0x3ef
[2.053250]  kernel_init_freeable+0xc0/0x290
[2.053250]  ? rest_init+0xe0/0xe0
[2.053250]  kernel_init+0x19/0x130
[2.053250]  ret_from_fork+0x22/0x30
[2.053250]  

I'm on it.

-- 
Philippe.



Re: 5.10-dovetail regression?

2022-04-07 Thread Jan Kiszka via Xenomai
On 07.04.22 17:24, Philippe Gerum wrote:
> 
> Jan Kiszka  writes:
> 
>> Hi Philippe,
>>
>> does this already ring some bell?
>>
>> https://source.denx.de/Xenomai/xenomai-images/-/jobs/419210
>>
>> Only triggers with qemu-amd64, not on real HW and not with 5.15.
>>
> 
> I could not reproduce locally, but visual inspection revealed something
> fishy in #8e2c09ee5323. Could you try this on the failing kernel? TIA,
> 
> diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
> index 2651c6cfd034..da6735d45a8a 100644
> --- a/kernel/time/clockevents.c
> +++ b/kernel/time/clockevents.c
> @@ -644,8 +644,8 @@ void clockevents_exchange_device(struct 
> clock_event_device *old,
>* to the release list, keep it around but mark it as
>* reserved.
>*/
> + list_del(>list);
>   if (tick_check_is_proxy(new)) {
> - list_del(>list);
>   clockevents_switch_state(old, CLOCK_EVT_STATE_RESERVED);
>   } else {
>   clockevents_switch_state(old, CLOCK_EVT_STATE_DETACHED);
> 

Didn't reproduce locally for me as well, though using the same image.
But the patch helped on the CI system.

Thanks,
Jan

-- 
Siemens AG, Technology
Competence Center Embedded Linux



Re: 5.10-dovetail regression?

2022-04-07 Thread Philippe Gerum via Xenomai


Jan Kiszka  writes:

> Hi Philippe,
>
> does this already ring some bell?
>
> https://source.denx.de/Xenomai/xenomai-images/-/jobs/419210
>
> Only triggers with qemu-amd64, not on real HW and not with 5.15.
>

I could not reproduce locally, but visual inspection revealed something
fishy in #8e2c09ee5323. Could you try this on the failing kernel? TIA,

diff --git a/kernel/time/clockevents.c b/kernel/time/clockevents.c
index 2651c6cfd034..da6735d45a8a 100644
--- a/kernel/time/clockevents.c
+++ b/kernel/time/clockevents.c
@@ -644,8 +644,8 @@ void clockevents_exchange_device(struct clock_event_device 
*old,
 * to the release list, keep it around but mark it as
 * reserved.
 */
+   list_del(>list);
if (tick_check_is_proxy(new)) {
-   list_del(>list);
clockevents_switch_state(old, CLOCK_EVT_STATE_RESERVED);
} else {
clockevents_switch_state(old, CLOCK_EVT_STATE_DETACHED);

-- 
Philippe.



Re: 5.10-dovetail regression?

2022-04-07 Thread Philippe Gerum via Xenomai


Philippe Gerum  writes:

> a
> Jan Kiszka  writes:
>
>> Hi Philippe,
>>
>> does this already ring some bell?
>>
>> https://source.denx.de/Xenomai/xenomai-images/-/jobs/419210
>>
>> Only triggers with qemu-amd64, not on real HW and not with 5.15.
>>
>> Jan
>
> 8e2c09ee5323 is most likely causing this. It's a backport of the fix
> developed for 5.15. I have a kvm-aarch64 setup which I routinely use
> too, I'll reproduce and fix this.

Sorry, I mean x86_64, not aarch64.

-- 
Philippe.



Re: 5.10-dovetail regression?

2022-04-07 Thread Philippe Gerum via Xenomai
a
Jan Kiszka  writes:

> Hi Philippe,
>
> does this already ring some bell?
>
> https://source.denx.de/Xenomai/xenomai-images/-/jobs/419210
>
> Only triggers with qemu-amd64, not on real HW and not with 5.15.
>
> Jan

8e2c09ee5323 is most likely causing this. It's a backport of the fix
developed for 5.15. I have a kvm-aarch64 setup which I routinely use
too, I'll reproduce and fix this.

-- 
Philippe.



5.10-dovetail regression?

2022-04-07 Thread Jan Kiszka via Xenomai
Hi Philippe,

does this already ring some bell?

https://source.denx.de/Xenomai/xenomai-images/-/jobs/419210

Only triggers with qemu-amd64, not on real HW and not with 5.15.

Jan

-- 
Siemens AG, Technology
Competence Center Embedded Linux