Re: Deadlock during debugging
On 11/21/19 4:52 PM, Jan Kiszka wrote: > On 21.11.19 16:31, Philippe Gerum wrote: >> On 11/19/19 6:39 PM, Philippe Gerum wrote: >>> On 11/19/19 5:46 PM, Philippe Gerum via Xenomai wrote: On 11/18/19 4:13 PM, Lange Norbert via Xenomai wrote: > Hello, > > Here's one of my deadlocks, the output seems interleaved from 2 > concurrent dumps, > I ran the crashlog through decode_stacktrace.sh. > Ok, I can reproduce this one, including in a vm. The symptom can be either a lockup, or recursive faults. I'm on it. >>> >>> #0b9e81807 seems to have introduced a regression. Something there may be >>> subtly at odds with the core scheduler logic. More later. >> >> Eh, no. #0b9e81807 only exposed a very serious and longstanding issue into >> the scheduler core, which may cause a CPU to pick threads from a remote >> runqueue in rare circumstances. And that bug is definitely mine. I need to >> review more code to make sure no more horror shows alike are waiting for >> prime time. This bug affects all Xenomai series, 3.1, 3.0, 2.x. >> > > Uh... good that we surfaced this now. Curious to see the result! > Good news is that unlike initially suspected, Xenomai 2.x is _not_ affected by the thread-suspend bug. The synchronous bp feature increased the likeliness of triggering it, as XNDBGSTOP needs to be sent to multiple threads while single-stepping. -- Philippe.
Re: Deadlock during debugging
On 21.11.19 16:31, Philippe Gerum wrote: On 11/19/19 6:39 PM, Philippe Gerum wrote: On 11/19/19 5:46 PM, Philippe Gerum via Xenomai wrote: On 11/18/19 4:13 PM, Lange Norbert via Xenomai wrote: Hello, Here's one of my deadlocks, the output seems interleaved from 2 concurrent dumps, I ran the crashlog through decode_stacktrace.sh. Ok, I can reproduce this one, including in a vm. The symptom can be either a lockup, or recursive faults. I'm on it. #0b9e81807 seems to have introduced a regression. Something there may be subtly at odds with the core scheduler logic. More later. Eh, no. #0b9e81807 only exposed a very serious and longstanding issue into the scheduler core, which may cause a CPU to pick threads from a remote runqueue in rare circumstances. And that bug is definitely mine. I need to review more code to make sure no more horror shows alike are waiting for prime time. This bug affects all Xenomai series, 3.1, 3.0, 2.x. Uh... good that we surfaced this now. Curious to see the result! Jan -- Siemens AG, Corporate Technology, CT RDA IOT SES-DE Corporate Competence Center Embedded Linux
Re: Deadlock during debugging
On 11/19/19 6:39 PM, Philippe Gerum wrote: > On 11/19/19 5:46 PM, Philippe Gerum via Xenomai wrote: >> On 11/18/19 4:13 PM, Lange Norbert via Xenomai wrote: >>> Hello, >>> >>> Here's one of my deadlocks, the output seems interleaved from 2 concurrent >>> dumps, >>> I ran the crashlog through decode_stacktrace.sh. >>> >> >> Ok, I can reproduce this one, including in a vm. The symptom can be either a >> lockup, or recursive faults. I'm on it. >> > > #0b9e81807 seems to have introduced a regression. Something there may be > subtly at odds with the core scheduler logic. More later. Eh, no. #0b9e81807 only exposed a very serious and longstanding issue into the scheduler core, which may cause a CPU to pick threads from a remote runqueue in rare circumstances. And that bug is definitely mine. I need to review more code to make sure no more horror shows alike are waiting for prime time. This bug affects all Xenomai series, 3.1, 3.0, 2.x. -- Philippe.
Re: Deadlock during debugging
On 11/19/19 5:46 PM, Philippe Gerum via Xenomai wrote: > On 11/18/19 4:13 PM, Lange Norbert via Xenomai wrote: >> Hello, >> >> Here's one of my deadlocks, the output seems interleaved from 2 concurrent >> dumps, >> I ran the crashlog through decode_stacktrace.sh. >> > > Ok, I can reproduce this one, including in a vm. The symptom can be either a > lockup, or recursive faults. I'm on it. > #0b9e81807 seems to have introduced a regression. Something there may be subtly at odds with the core scheduler logic. More later. -- Philippe.
Re: Deadlock during debugging
On 11/18/19 4:13 PM, Lange Norbert via Xenomai wrote: > Hello, > > Here's one of my deadlocks, the output seems interleaved from 2 concurrent > dumps, > I ran the crashlog through decode_stacktrace.sh. > Ok, I can reproduce this one, including in a vm. The symptom can be either a lockup, or recursive faults. I'm on it. -- Philippe.
RE: Deadlock during debugging
> -Original Message- > From: Jan Kiszka > Sent: Dienstag, 19. November 2019 07:51 > To: Lange Norbert ; Xenomai > (xenomai@xenomai.org) > Subject: Re: Deadlock during debugging > > NON-ANDRITZ SOURCE: BE CAUTIOUS WITH CONTENT, LINKS OR > ATTACHMENTS. > > > On 18.11.19 18:31, Lange Norbert wrote: > > > > > >> -Original Message- > >> From: Jan Kiszka > >> Sent: Montag, 18. November 2019 18:22 > >> To: Lange Norbert ; Xenomai > >> (xenomai@xenomai.org) > >> Subject: Re: Deadlock during debugging > >> > >> NON-ANDRITZ SOURCE: BE CAUTIOUS WITH CONTENT, LINKS OR > ATTACHMENTS. > >> > >> > >> On 18.11.19 17:24, Lange Norbert via Xenomai wrote: > >>> One more, > >>> > >>> Note that there seem to be quite different reports, from a recursive > fault > >> to some threads getting marked as "runaway". > >>> I can reproduce the issue now easily, but its proprietary software I cant > >> reach around. > >> > >> Understood. Will try to read something from the traces. > >> > >> This is a regression over 3.0 now, correct? > > > > No, can't say that. I had various recurring issues with 4.9, 4.14 and 4.19 > kernels, > > aswell as 3.0 and and 3.1. > > It's hard to narrow down and often just vanished after a while, and my only > > gut-feeling is that condition variables are involved. > > I also have a couple cobalt threads *not* pinned to a single cpu. > > I'm only talking about the crash during debug - one issue after the other. So did I, the crashes produce different logs (likely just the effect, after some mem corruption) and I occasionally had these crashes for a while. So this does *not* appear to be a 3.1 regression, which is why I brought these points up. Compiling the kernel with another gcc and pinning the threads did not help BTW. Norbert This message and any attachments are solely for the use of the intended recipients. They may contain privileged and/or confidential information or other information protected from disclosure. If you are not an intended recipient, you are hereby notified that you received this email in error and that any review, dissemination, distribution or copying of this email and any attachment is strictly prohibited. If you have received this email in error, please contact the sender and delete the message and any attachment from your system. ANDRITZ HYDRO GmbH Rechtsform/ Legal form: Gesellschaft mit beschränkter Haftung / Corporation Firmensitz/ Registered seat: Wien Firmenbuchgericht/ Court of registry: Handelsgericht Wien Firmenbuchnummer/ Company registration: FN 61833 g DVR: 0605077 UID-Nr.: ATU14756806 Thank You
Re: Deadlock during debugging
On 18.11.19 18:31, Lange Norbert wrote: -Original Message- From: Jan Kiszka Sent: Montag, 18. November 2019 18:22 To: Lange Norbert ; Xenomai (xenomai@xenomai.org) Subject: Re: Deadlock during debugging NON-ANDRITZ SOURCE: BE CAUTIOUS WITH CONTENT, LINKS OR ATTACHMENTS. On 18.11.19 17:24, Lange Norbert via Xenomai wrote: One more, Note that there seem to be quite different reports, from a recursive fault to some threads getting marked as "runaway". I can reproduce the issue now easily, but its proprietary software I cant reach around. Understood. Will try to read something from the traces. This is a regression over 3.0 now, correct? No, can't say that. I had various recurring issues with 4.9, 4.14 and 4.19 kernels, aswell as 3.0 and and 3.1. It's hard to narrow down and often just vanished after a while, and my only gut-feeling is that condition variables are involved. I also have a couple cobalt threads *not* pinned to a single cpu. I'm only talking about the crash during debug - one issue after the other. Jan Atleast I can now say it’s a single app causing the issue, not using rtnet or having additional cobalt applications running. Since I can easily reproduce the issue, I will now try using debian's gcc-8, to rule out troubles with the toolchain. Norbert. This message and any attachments are solely for the use of the intended recipients. They may contain privileged and/or confidential information or other information protected from disclosure. If you are not an intended recipient, you are hereby notified that you received this email in error and that any review, dissemination, distribution or copying of this email and any attachment is strictly prohibited. If you have received this email in error, please contact the sender and delete the message and any attachment from your system. ANDRITZ HYDRO GmbH Rechtsform/ Legal form: Gesellschaft mit beschränkter Haftung / Corporation Firmensitz/ Registered seat: Wien Firmenbuchgericht/ Court of registry: Handelsgericht Wien Firmenbuchnummer/ Company registration: FN 61833 g DVR: 0605077 UID-Nr.: ATU14756806 Thank You -- Siemens AG, Corporate Technology, CT RDA IOT SES-DE Corporate Competence Center Embedded Linux
RE: Deadlock during debugging
> -Original Message- > From: Jan Kiszka > Sent: Montag, 18. November 2019 18:22 > To: Lange Norbert ; Xenomai > (xenomai@xenomai.org) > Subject: Re: Deadlock during debugging > > NON-ANDRITZ SOURCE: BE CAUTIOUS WITH CONTENT, LINKS OR > ATTACHMENTS. > > > On 18.11.19 17:24, Lange Norbert via Xenomai wrote: > > One more, > > > > Note that there seem to be quite different reports, from a recursive fault > to some threads getting marked as "runaway". > > I can reproduce the issue now easily, but its proprietary software I cant > reach around. > > Understood. Will try to read something from the traces. > > This is a regression over 3.0 now, correct? No, can't say that. I had various recurring issues with 4.9, 4.14 and 4.19 kernels, aswell as 3.0 and and 3.1. It's hard to narrow down and often just vanished after a while, and my only gut-feeling is that condition variables are involved. I also have a couple cobalt threads *not* pinned to a single cpu. Atleast I can now say it’s a single app causing the issue, not using rtnet or having additional cobalt applications running. Since I can easily reproduce the issue, I will now try using debian's gcc-8, to rule out troubles with the toolchain. Norbert. This message and any attachments are solely for the use of the intended recipients. They may contain privileged and/or confidential information or other information protected from disclosure. If you are not an intended recipient, you are hereby notified that you received this email in error and that any review, dissemination, distribution or copying of this email and any attachment is strictly prohibited. If you have received this email in error, please contact the sender and delete the message and any attachment from your system. ANDRITZ HYDRO GmbH Rechtsform/ Legal form: Gesellschaft mit beschränkter Haftung / Corporation Firmensitz/ Registered seat: Wien Firmenbuchgericht/ Court of registry: Handelsgericht Wien Firmenbuchnummer/ Company registration: FN 61833 g DVR: 0605077 UID-Nr.: ATU14756806 Thank You
Re: Deadlock during debugging
On 18.11.19 17:24, Lange Norbert via Xenomai wrote: One more, Note that there seem to be quite different reports, from a recursive fault to some threads getting marked as "runaway". I can reproduce the issue now easily, but its proprietary software I cant reach around. Understood. Will try to read something from the traces. This is a regression over 3.0 now, correct? Thanks, Jan -- Siemens AG, Corporate Technology, CT RDA IOT SES-DE Corporate Competence Center Embedded Linux
RE: Deadlock during debugging
One more, Note that there seem to be quite different reports, from a recursive fault to some threads getting marked as "runaway". I can reproduce the issue now easily, but its proprietary software I cant reach around. Norbert [ 226.354729] I-pipe: Detected stalled head domain, probably caused by a bug. [ 226.354729] A critical section may have been left unterminated. [ 226.370156] CPU: 1 PID: 0 Comm: swapper/2 Tainted: GW 4.19.84-xenod8-static #1 [ 226.370160] CPU: 2 PID: 732 Comm: fup.fast Tainted: GW 4.19.84-xenod8-static #1 [ 226.378775] Hardware name: TQ-Group TQMxE39M/Type2 - Board Product Name, BIOS 5.12.30.21.16 01/31/2019 [ 226.387475] Hardware name: TQ-Group TQMxE39M/Type2 - Board Product Name, BIOS 5.12.30.21.16 01/31/2019 [ 226.396782] I-pipe domain: Linux [ 226.406089] I-pipe domain: Linux [ 226.409320] RIP: 0010:do_idle+0xaf/0x140 [ 226.412549] Call Trace: [ 226.416476] Code: 85 92 00 00 00 e8 51 f5 04 00 e8 bc 65 03 00 e8 77 36 7c 00 f0 80 4d 02 20 9c 58 f6 c4 02 74 7e e8 66 2d 07 00 48 85 c0 74 6b <0f> 0b e8 0a 42 07 00 e8 45 68 03 00 9c 58 f6 c4 02 0f 85 79 ff ff [ 226.418936] dump_stack+0x8c/0xc0 [ 226.437687] RSP: 0018:932cc00afef8 EFLAGS: 00010002 [ 226.441009] ipipe_root_only.cold+0x11/0x32 [ 226.446240] ipipe_stall_root+0xe/0x60 [ 226.450424] RAX: 0001 RBX: 0002 RCX: 000b [ 226.454182] __ipipe_trap_prologue+0x2ae/0x2f0 [ 226.461319] RDX: a3fc RSI: 8f63f99c8208 RDI: [ 226.465767] ? __ipipe_complete_domain_migration+0x40/0x40 [ 226.472899] RBP: 8f63f815a7c0 R08: R09: 0002e248 [ 226.478386] invalid_op+0x26/0x51 [ 226.485518] R10: 00015800 R11: 003480cf3801 R12: 8f63f815a7c0 [ 226.488839] RIP: 0010:xnthread_suspend+0x3ef/0x540 [ 226.495973] R13: R14: R15: [ 226.500766] Code: 58 12 00 00 4c 89 e7 e8 ef ca ff ff 41 83 8c 24 c4 11 00 00 01 e9 82 fd ff ff 0f 0b 48 83 bf 58 12 00 00 00 0f 84 49 fc ff ff <0f> 0b 0f 0b 9c 58 f6 c4 02 0f 84 85 fd ff ff fa bf 00 00 00 80 e8 [ 226.507900] FS: () GS:8f63f980() knlGS: [ 226.52] RSP: 0018:932cc083bd60 EFLAGS: 00010082 [ 226.534755] CS: 0010 DS: ES: CR0: 80050033 [ 226.539986] CR2: 7ff8dca27000 CR3: 000174c54000 CR4: 003406e0 [ 226.545738] RAX: 932cc0617e30 RBX: 00025090 RCX: [ 226.552870] Call Trace: [ 226.560005] RDX: RSI: 0002 RDI: 932cc0616240 [ 226.562461] cpu_startup_entry+0x6f/0x80 [ 226.569590] RBP: 932cc0617e08 R08: 932cc0617e08 R09: 0005cc88 [ 226.573520] start_secondary+0x169/0x1b0 [ 226.580655] R10: R11: R12: 932cc0616240 [ 226.584585] secondary_startup_64+0xa4/0xb0 [ 226.591716] R13: R14: R15: 932cc0617e08 [ 226.595905] ---[ end trace aa5dc96dbf303c58 ]--- [ 226.603042] xnsynch_sleep_on+0x117/0x2d0 [ 226.611670] __cobalt_cond_wait_prologue+0x29f/0x950 [ 226.616647] ? __cobalt_cond_wait_prologue+0x950/0x950 [ 226.621798] CoBaLt_cond_wait_prologue+0x23/0x30 [ 226.626425] handle_head_syscall+0xe1/0x370 [ 226.630618] ipipe_fastcall_hook+0x14/0x20 [ 226.634724] ipipe_handle_syscall+0x57/0xe0 [ 226.638920] do_syscall_64+0x4b/0x500 [ 226.642598] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 226.647660] RIP: 0033:0x77f9c134 [ 226.651244] Code: 8b 73 04 49 89 dc e8 fb ef ff ff 48 89 de 48 8b 5c 24 10 45 31 c0 b9 23 00 00 10 48 8d 54 24 44 45 31 d2 48 89 df 89 c8 0f 05 <8b> 7c 24 2c 31 f6 49 89 c5 89 c5 e8 cc ef ff ff 4c 89 ff e8 74 e9 [ 226.670014] RSP: 002b:7fffe1a6bb10 EFLAGS: 0246 ORIG_RAX: 1023 [ 226.677599] RAX: ffda RBX: 74d91c78 RCX: 77f9c134 [ 226.684744] RDX: 7fffe1a6bb54 RSI: 74d91c48 RDI: 74d91c78 [ 226.691885] RBP: 7fffe1a6bc30 R08: R09: [ 226.699027] R10: R11: 0246 R12: 74d91c48 [ 226.706166] R13: R14: 0001 R15: 7fffe1a6bb60 [ 226.713325] I-pipe tracer log (100 points): [ 226.717520] |*+func0 ipipe_trace_panic_freeze+0x0 (ipipe_root_only+0xcf) [ 226.726114] |*+func0 ipipe_root_only+0x0 (ipipe_stall_root+0xe) [ 226.733926] |*+func -1 ipipe_stall_root+0x0 (__ipipe_trap_prologue+0x2ae) [ 226.742431] |# func -2 ipipe_trap_hook+0x0 (__ipipe_notify_trap+0x98) [ 226.750590] |# func -3 __ipipe_notify_trap+0x0 (__ipipe_trap_prologue+0x7f) [ 226.759268] |# func -3 __ipipe_trap_prologue+0x0 (invalid_op+0x26) [ 226.767167] |# func -5 xnthread_suspend+0x0 (xnsynch_sleep_on+0x117) [ 226.7752
RE: Deadlock during debugging
New crash, same thing with ipipe panic trace (the decoded log does not add information to the relevant parts). Is the dump_stack function itself trashing the stack? [ 168.411205] [Xenomai] watchdog triggered on CPU #1 -- runaway thread 'main' signaled [ 209.176742] [ cut here ] [ 209.181381] xnthread_relax() failed for thread aboard_runner[790] [ 209.181389] BUG: Unhandled exception over domain Xenomai at 0x7fed - switching to ROOT [ 209.196451] CPU: 0 PID: 790 Comm: aboard_runner Tainted: GW 4.19.84-xenod8-static #1 [ 209.205588] Hardware name: TQ-Group TQMxE39M/Type2 - Board Product Name, BIOS 5.12.30.21.16 01/31/2019 [ 209.214900] I-pipe domain: Linux [ 209.218137] Call Trace: [ 209.220593] dump_stack+0x8c/0xc0 [ 209.223919] __ipipe_trap_prologue.cold+0x1f/0x5e [ 209.228629] invalid_op+0x26/0x51 [ 209.231952] RIP: 0010:xnthread_relax+0x46d/0x4a0 [ 209.236576] Code: f6 83 c2 11 00 00 01 75 0e 48 8b 03 48 85 c0 74 33 8b 90 c0 04 00 00 48 8d b3 5c 14 00 00 48 c7 c7 90 00 8b 9a e8 02 02 ef ff <0f> 0b e9 42 fd ff ff 89 c6 48 c7 c7 c4 f8 a3 9a e8 2e 71 f3 ff e9 [ 209.255347] RSP: 0018:9a0e4074fd90 EFLAGS: 00010286 [ 209.260586] RAX: RBX: 9a0e4065aa40 RCX: 000b [ 209.267728] RDX: 5129 RSI: 902a794791f8 RDI: 007800c0 [ 209.274869] RBP: 9a0e4074fe68 R08: 007800c0 R09: 0002e248 [ 209.282013] R10: 9bb72040 R11: 9bb3209c R12: 9bbfdc80 [ 209.289157] R13: 902a76da8000 R14: 0001 R15: 0292 [ 209.296299] ? xnthread_prepare_wait+0x20/0x20 [ 209.300752] ? trace+0x59/0x8d [ 209.303814] ? __cobalt_clock_nanosleep+0x540/0x540 [ 209.308700] handle_head_syscall+0x307/0x370 [ 209.312979] ipipe_fastcall_hook+0x14/0x20 [ 209.317083] ipipe_handle_syscall+0x57/0xe0 [ 209.321280] do_syscall_64+0x4b/0x500 [ 209.324950] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 209.330011] RIP: 0033:0x77f9bd68 [ 209.333598] Code: 89 fb bf 01 00 00 00 48 83 ec 18 48 8d 74 24 0c e8 bd f3 ff ff b9 19 00 00 10 48 63 f5 48 63 fb 4d 89 ea 4c 89 e2 89 c8 0f 05 <8b> 7c 24 0c 48 89 c3 31 f6 e8 9a f3 ff ff 48 83 c4 18 89 d8 f7 d8 [ 209.352370] RSP: 002b:7fffe7d0 EFLAGS: 0246 ORIG_RAX: 1019 [ 209.359954] RAX: fe00 RBX: 0001 RCX: 77f9bd68 [ 209.367098] RDX: 7fffe820 RSI: 0001 RDI: 0001 [ 209.374237] RBP: 0001 R08: 0001 R09: 0014 [ 209.381381] R10: 7fffe820 R11: 0246 R12: 7fffe820 [ 209.388524] R13: 7fffe820 R14: R15: [ 209.395665] I-pipe tracer log (100 points): [ 209.399857] | #func0 ipipe_trace_panic_freeze+0x0 (__ipipe_trap_prologue+0x237) [ 209.409056] | +func0 ipipe_root_only+0x0 (ipipe_stall_root+0xe) [ 209.416862] | +func -1 ipipe_stall_root+0x0 (__ipipe_trap_prologue+0x2ae) [ 209.425365] |+ func -2 ipipe_trap_hook+0x0 (__ipipe_notify_trap+0x98) [ 209.433523] |+ func -3 __ipipe_notify_trap+0x0 (__ipipe_trap_prologue+0x7f) [ 209.442199] |+ func -4 __ipipe_trap_prologue+0x0 (invalid_op+0x26) [ 209.450097] |+ end 0x8001 -5 __ipipe_spin_unlock_irqrestore+0x4f (<>) [ 209.458425] |# func -6 __ipipe_spin_unlock_irqrestore+0x0 (__ipipe_log_printk+0x69) [ 209.467797] |+ begin 0x8001-10 __ipipe_spin_lock_irqsave+0x5e (<>) [ 209.475693] + func -10 __ipipe_spin_lock_irqsave+0x0 (__ipipe_log_printk+0x22) [ 209.484630] + func -10 __ipipe_log_printk+0x0 (__warn_printk+0x6c) [ 209.492525] |+ end 0x8001-11 do_vprintk+0xf6 (<>) [ 209.499120] |+ begin 0x8001-11 do_vprintk+0x106 (<>) [ 209.505799] + func -12 do_vprintk+0x0 (__warn_printk+0x6c) [ 209.513000] + func -12 vprintk+0x0 (__warn_printk+0x6c) [ 209.519939] |+ end 0x8001-12 ipipe_raise_irq+0x70 (<>) [ 209.526969] |+ func -13 __ipipe_set_irq_pending+0x0 (__ipipe_dispatch_irq+0xad) [ 209.535905] |+ func -14 __ipipe_dispatch_irq+0x0 (ipipe_raise_irq+0x7e) [ 209.544148] |+ begin 0x8001-14 ipipe_raise_irq+0x64 (<>) [ 209.551178] + func -15 ipipe_raise_irq+0x0 (__ipipe_log_printk+0x84) [ 209.559250] |+ end 0x8001-15 __ipipe_spin_unlock_irqrestore+0x4f (<>) [ 209.567581] |# func -15 __ipipe_spin_unlock_irqrestore+0x0 (__ipipe_log_printk+0x69) [ 209.576951] |+ begin 0x8001-17 __ipipe_spin_lock_irqsave+0x5e (<>) [ 209.584847] + func -18 __ipipe_spin_lock_irqsave+0x0 (__ipipe_log_printk+0x22) [ 2