RE: Still getting Deadlocks with condition variables

2020-06-15 Thread Lange Norbert via Xenomai


> -Original Message-
> From: Philippe Gerum 
> Sent: Montag, 15. Juni 2020 12:03
> To: Lange Norbert ; Xenomai
> (xenomai@xenomai.org) ;
> 'jan.kis...@siemens.com' 
> Subject: Re: Still getting Deadlocks with condition variables
>
> NON-ANDRITZ SOURCE: BE CAUTIOUS WITH CONTENT, LINKS OR
> ATTACHMENTS.
>
>
> On 6/15/20 11:06 AM, Lange Norbert wrote:
> >>
> >> This code does not take away any protection, on the contrary this
> >> ensures that PROT_EXEC is set for all stacks along with read and
> >> write access, which is glibc's default for the x86_64 architecture.
> >
> > I meant that it might have to do some non-atomic procedure, for
> > example when splitting up a continuous bigger mapping with the stack
> > in the middle, as the protection flags are now different.
> >
>
> We are talking about mprotect(), not mmap().

My bad, yes.

>
> >> The fault is likely due to mm code fixing up those protections for
> >> the relevant page(s). It looks like such pages are force faulted-in,
> >> which would explain the #PF, and the SIGXCPU notification as a
> >> consequence. These are minor faults in the MMU management sense, so
> >> this is transparent for common applications.
> >
> > I don’t know enough about the x86 (and don’t want to know), but this
> > needs some explanation. First, the DSOs don’t need executable stack
> > (build system did not care to add the .note.GNU-stack everywhere ), so
> > this specific issue can be worked around.
> >
> > -   I don’t understand why this is very timing sensitive, If page is marked 
> > to
> #PF (or removed)
> > Then this should fault predictable on the next access (I don’t
> > share data on stack that Linux threads could run into an #PF instead)
>
> Faulting is what it does. It is predictable and synchronous, you seem to be
> assuming that the fault is somehow async or proxied, it is not.

It does happen rather sparsely, affected by small changes in unrelated code,
Loading a DSO (which requires executable stack) might trigger a #PF or not.
Nothing but the respective RT-Thread is accessing it's *private* stack, the 
pagefault
Happens at a callq.

If that stack accesss is to cause a #PF, then the code would run into it 
*every* time.
Yet it does not, its rather really hard to get this to reproduce.

> > -   If that’s a non-atomic operation (perhaps only if the sparse tables need
> modification in a higher level), then I would expect
> > some sort of lazy locking (RCU?). Is this ending up in chaos as cores
> running Xenomai are "idle" for Linux, and pick up outdated data?
>
> I have no idea why you would bring up RCU in this picture, there is no
> convoluted aspect in what happens. There is no chaos, only a plain simple
> #PF event which unfortunately occurs as a result of running an apparently
> innocuous regular operation which is loading a DSO. The reason for the #PF
> can be explained, how it is dealt with is fine, the rt loop in your app just 
> does
> not like observing it for a legitimate reason.

I am asking a question, I assume pagetables need to be reallocated under
some circumstances. I understand a #PF *has to happen* according to your 
explanation,
and I don’t know *where it happens if I don’t observe the RT task switching*.

The faults are very timing sensitive.

So I could imagine the multilevel page-table map looking like this (dunt know 
how many levels x86 is using nowadays, but that’s beside the point):
[first level] -> [second level] -> [stack mapping]

if the mprotect syscall changes just the *private* stack mapping, then the RT 
thread will always fault - (not what I observe).
if the syscall modifies lower levels, then any thread can hit the PF, and if 
it's not a RT thread then no #PF will be observed (by WARNSW).

This is my conjecture, if that’s true, then the question is modified to:
-   under what circumstances can this appear?

>
> > -   Are/can such minor faults be handled in Xenomai? In other words is the
> WARNSW correct, or is
> > this actually just the check causing the issues?
> > Would it make sense to handle such minor faults in Xenomai (only
> demoting to Linux if necessary)?
> >
> >> Not for those of us who do not want the application code to run into
> >> any page fault unfortunately.
> >>
> >> Loading DSOs while the real-time system is running just proved to be
> >> a bad idea it seems (did not check how other *libc implementations
> >> behave on
> >> dlopen() though).
> >
> > glibc dlopens files on its own BTW, for nss plugins and encodings.
> > Practically than means you would need to chec

Re: Still getting Deadlocks with condition variables

2020-06-15 Thread Philippe Gerum via Xenomai
On 6/15/20 11:06 AM, Lange Norbert wrote:
>>
>> This code does not take away any protection, on the contrary this ensures
>> that
>> PROT_EXEC is set for all stacks along with read and write access, which is
>> glibc's default for the x86_64 architecture.
> 
> I meant that it might have to do some non-atomic procedure, for example
> when splitting up a continuous bigger mapping with the stack in the middle,
> as the protection flags are now different.
>

We are talking about mprotect(), not mmap().

>> The fault is likely due to mm code fixing up those protections for the
>> relevant page(s). It looks like such pages are force faulted-in, which would
>> explain the #PF, and the SIGXCPU notification as a consequence. These are
>> minor faults in the MMU management sense, so this is transparent for
>> common applications.
> 
> I don’t know enough about the x86 (and don’t want to know), but this needs 
> some
> explanation. First, the DSOs don’t need executable stack
> (build system did not care to add the .note.GNU-stack everywhere ), so this 
> specific issue
> can be worked around.
> 
> -   I don’t understand why this is very timing sensitive, If page is marked 
> to #PF (or removed)
> Then this should fault predictable on the next access (I don’t share data 
> on stack that Linux threads could run into an #PF instead)

Faulting is what it does. It is predictable and synchronous, you seem to be
assuming that the fault is somehow async or proxied, it is not.

> -   If that’s a non-atomic operation (perhaps only if the sparse tables need 
> modification in a higher level), then I would expect
> some sort of lazy locking (RCU?). Is this ending up in chaos as cores 
> running Xenomai are "idle" for Linux, and pick up outdated data?

I have no idea why you would bring up RCU in this picture, there is no
convoluted aspect in what happens. There is no chaos, only a plain simple #PF
event which unfortunately occurs as a result of running an apparently
innocuous regular operation which is loading a DSO. The reason for the #PF can
be explained, how it is dealt with is fine, the rt loop in your app just does
not like observing it for a legitimate reason.

> -   Are/can such minor faults be handled in Xenomai? In other words is the 
> WARNSW correct, or is
> this actually just the check causing the issues?
> Would it make sense to handle such minor faults in Xenomai (only demoting 
> to Linux if necessary)?
> 
>> Not for those of us who do not want the application code to run
>> into any page fault unfortunately.
>>
>> Loading DSOs while the real-time system is running just proved to be a bad
>> idea it seems (did not check how other *libc implementations behave on
>> dlopen() though).
> 
> glibc dlopens files on its own BTW, for nss plugins and encodings. 
> Practically than means
> you would need to check everything running (in non-rt threads), for dlopen 
> and various
> calls that could resolve names to uid/gid, do dns lookups, use iconv etc.
> 

The glibc is fortunately not dlopening DSOs at every corner. You mention very
specific features that would have to take place during the app init chores
instead, or at the very least in a way which is synchronized with a quiescent
state of the rt portion of the process.

> This is an issue of changing protection on existing mappings, or the mmap 
> call in broader terms.

This is an issue with some of mprotect() side-effects.

> Knowing why and under what circumstances this causes trouble would be rather 
> important
> (would have been important some years ago when we *started* porting to 
> Xenomai and picked solutions).
> 

I believe the whole thread has already framed the how and why fairly
precisely. With respect to knowing those things in advance, I can only
recommend that people who based their project on Xenomai over the past 15
years share their knowledge and experience by contributing documentation and
participating to the mailing list.

> I could for ex. expect the kernel option CONFIG_COMPACTION causing similar 
> issues (and pretty impossible to triage).
> 

CONFIG_COMPACTION is a known source of latency spots, just like transparent
huge pages are not going to be helpful, regardless of the underlying rt
infrastructure.

To sum up what we have all been saying, the problem is not about causing a PTE
miss due to mprotect() altering permissions on pages, but how we can handle
the minor fault from primary mode. On arm, arm64 and ppc, such fault can be
handled directly from primary mode. On x86, the inner MMU code which is in
charge of handling faults does not allow that, there is the need for switching
to secondary mode.

Regarding handling PTE misses directly from primary mode, this would be a mess
with x86: sharing the MMU management logic between Cobalt and the regular
linux mm sub-system which otherwise run totally asynchronously is something I
for one won't even try.

Without Xenomai, you would take a fault the same way, the difference is that

RE: Still getting Deadlocks with condition variables

2020-06-15 Thread Lange Norbert via Xenomai


> -Original Message-
> From: Philippe Gerum 
> Sent: Mittwoch, 10. Juni 2020 18:48
> To: Lange Norbert ; Xenomai
> (xenomai@xenomai.org) ;
> 'jan.kis...@siemens.com' 
> Subject: Re: Still getting Deadlocks with condition variables
>
> NON-ANDRITZ SOURCE: BE CAUTIOUS WITH CONTENT, LINKS OR
> ATTACHMENTS.
>
>
> On 6/9/20 7:10 PM, Lange Norbert wrote:
> >
> >
> >> -Original Message-
> >> From: Philippe Gerum 
> >> Sent: Montag, 8. Juni 2020 16:17
> >> To: Lange Norbert ; Xenomai
> >> (xenomai@xenomai.org) 
> >> Subject: Re: Still getting Deadlocks with condition variables
> >>
> >> NON-ANDRITZ SOURCE: BE CAUTIOUS WITH CONTENT, LINKS OR
> ATTACHMENTS.
> >>
> >>
> >> On 6/8/20 12:08 PM, Lange Norbert wrote:
> >>>
> >>>> This kernel message tells a different story, thread pid 681 received
> >>>> a #PF, maybe due to accessing its own stack (cond.c, line 316). This
> >>>> may be a minor fault though, nothing invalid. Such fault is not
> >>>> supposed to occur for Xenomai threads on x86, but that would be
> >>>> another issue. Code-wise, I'm referring to the current state of the
> >>>> master branch for lib/cobalt/cond.c, which seems to match your
> >> description.
> >>>
> >>> I dont know what you mean with minor fault, from the perspective of
> >> Linux?
> >>> A RT thread getting demoted to Linux is rather serious to me.
> >>>
> >>
> >> Minor from a MMU standpoint: the memory the CPU  dereferenced is
> valid
> >> but no page table entry currently maps it. So, #PF in this case seems to be
> a
> >> 'minor'
> >> fault in MMU lingo, but it is still not expected.
> >>
> >>> Also, the thing is that I would not know how a PF in the long running
> >>> thread, with locked memory, With the call being close to the thread
> >>> entry point in a wait-for-condvar-loop, never using more than an
> >>> insignificant> amount of stack at this time should be possible.>
> >>
> >> Except if mapping an executable segment via dlopen() comes into play,
> >> affecting the page table. Only an assumption at this stage.
> >>
> >>> On the other hand, the non-RT thread loads a DSO and is stuck
> somewhere
> >> after allocating memory.
> >>> My guess would be that the PF ends up at the wrong thread.
> >>>
> >>
> >> As Jan pointed out, #PF are synchronously taken, synchronously handled.
> I
> >> really don't see how #PF handling could ever wander.
> >>
> >>>>>
> >>>>
> >>>> You refer to an older post describing a lockup, but this post
> >>>> describes an application crashing with a core dump. What made you
> >>>> draw the conclusion that the same bug would be at work?
> >>>
> >>> Same bug, different PTHREAD_WARNSW setting is my guess.
> >>> The underlying issue that a unrelated signal ends up to a RT thread.
> >>>
> >>>> Also, could you give some details
> >>>> regarding the
> >>>> following:
> >>>>
> >>>> - what do you mean by 'lockup' in this case? Can you still access the
> >>>> board or is there some runaway real-time code locking out everything
> >>>> else when this happens? My understanding is that this is no hard lock
> >>>> up otherwise the watchdog would have triggered. If this is a softer
> >>>> kind of lockup instead, what does /proc/xenomai/sched/stat tell you
> >>>> about the thread states after the problem occurred?
> >>>
> >>> This was a post-mortem, no access to /proc/xenomai/sched/stat
> anymore.
> >>> lockup means deadlock (the thread getting the signal holds a mutex,
> >>> but is stuck), Coredump happens if PTHREAD_WARNSW is enabled
> (means
> >> it asserts out before).
> >>>
> >>>> - did you determine that using the dynamic linker is required to
> >>>> trigger the bug yet? Or could you observe it without such interaction
> with
> >> dl?
> >>>
> >>> AFAIK, always occurred at the stage where we load a "configuration",
> and
> >> load DSOs.
> >>>
> >>>>
> >>>> - what is the typical size of your Xenomai thread stack? It defaults
> >>>> t

Re: Still getting Deadlocks with condition variables

2020-06-10 Thread Philippe Gerum via Xenomai
On 6/9/20 7:10 PM, Lange Norbert wrote:
> 
> 
>> -Original Message-
>> From: Philippe Gerum 
>> Sent: Montag, 8. Juni 2020 16:17
>> To: Lange Norbert ; Xenomai
>> (xenomai@xenomai.org) 
>> Subject: Re: Still getting Deadlocks with condition variables
>>
>> NON-ANDRITZ SOURCE: BE CAUTIOUS WITH CONTENT, LINKS OR
>> ATTACHMENTS.
>>
>>
>> On 6/8/20 12:08 PM, Lange Norbert wrote:
>>>
>>>> This kernel message tells a different story, thread pid 681 received
>>>> a #PF, maybe due to accessing its own stack (cond.c, line 316). This
>>>> may be a minor fault though, nothing invalid. Such fault is not
>>>> supposed to occur for Xenomai threads on x86, but that would be
>>>> another issue. Code-wise, I'm referring to the current state of the
>>>> master branch for lib/cobalt/cond.c, which seems to match your
>> description.
>>>
>>> I dont know what you mean with minor fault, from the perspective of
>> Linux?
>>> A RT thread getting demoted to Linux is rather serious to me.
>>>
>>
>> Minor from a MMU standpoint: the memory the CPU  dereferenced is valid
>> but no page table entry currently maps it. So, #PF in this case seems to be a
>> 'minor'
>> fault in MMU lingo, but it is still not expected.
>>
>>> Also, the thing is that I would not know how a PF in the long running
>>> thread, with locked memory, With the call being close to the thread
>>> entry point in a wait-for-condvar-loop, never using more than an
>>> insignificant> amount of stack at this time should be possible.>
>>
>> Except if mapping an executable segment via dlopen() comes into play,
>> affecting the page table. Only an assumption at this stage.
>>
>>> On the other hand, the non-RT thread loads a DSO and is stuck somewhere
>> after allocating memory.
>>> My guess would be that the PF ends up at the wrong thread.
>>>
>>
>> As Jan pointed out, #PF are synchronously taken, synchronously handled. I
>> really don't see how #PF handling could ever wander.
>>
>>>>>
>>>>
>>>> You refer to an older post describing a lockup, but this post
>>>> describes an application crashing with a core dump. What made you
>>>> draw the conclusion that the same bug would be at work?
>>>
>>> Same bug, different PTHREAD_WARNSW setting is my guess.
>>> The underlying issue that a unrelated signal ends up to a RT thread.
>>>
>>>> Also, could you give some details
>>>> regarding the
>>>> following:
>>>>
>>>> - what do you mean by 'lockup' in this case? Can you still access the
>>>> board or is there some runaway real-time code locking out everything
>>>> else when this happens? My understanding is that this is no hard lock
>>>> up otherwise the watchdog would have triggered. If this is a softer
>>>> kind of lockup instead, what does /proc/xenomai/sched/stat tell you
>>>> about the thread states after the problem occurred?
>>>
>>> This was a post-mortem, no access to /proc/xenomai/sched/stat anymore.
>>> lockup means deadlock (the thread getting the signal holds a mutex,
>>> but is stuck), Coredump happens if PTHREAD_WARNSW is enabled (means
>> it asserts out before).
>>>
>>>> - did you determine that using the dynamic linker is required to
>>>> trigger the bug yet? Or could you observe it without such interaction with
>> dl?
>>>
>>> AFAIK, always occurred at the stage where we load a "configuration", and
>> load DSOs.
>>>
>>>>
>>>> - what is the typical size of your Xenomai thread stack? It defaults
>>>> to 64k min with Xenomai 3.1.
>>>
>>> 1MB
>>
>> I would dig the following distinct issues:
>>
>> - why is #PF taken on an apparently innocuous instruction. dlopen(3)-
>>> mmap(2) might be involved. With a simple test case, you could check the
>> impact of loading/unloading DSOs on memory management for real-time
>> threads running in parallel. Setting the WARNSW bit on for these threads
>> would be required.
>>
>> - whether dealing with a signal adversely affects the wait-side of a Xenomai
>> condvar. There is a specific trick to handle this in the Cobalt and libcobalt
>> code, which is the reason for the wait_prologue / wait_epilogue dance in the
>> implementation IIRC. Understanding why that thread receives a si

RE: Still getting Deadlocks with condition variables

2020-06-09 Thread Lange Norbert via Xenomai


> -Original Message-
> From: Philippe Gerum 
> Sent: Montag, 8. Juni 2020 16:17
> To: Lange Norbert ; Xenomai
> (xenomai@xenomai.org) 
> Subject: Re: Still getting Deadlocks with condition variables
>
> NON-ANDRITZ SOURCE: BE CAUTIOUS WITH CONTENT, LINKS OR
> ATTACHMENTS.
>
>
> On 6/8/20 12:08 PM, Lange Norbert wrote:
> >
> >> This kernel message tells a different story, thread pid 681 received
> >> a #PF, maybe due to accessing its own stack (cond.c, line 316). This
> >> may be a minor fault though, nothing invalid. Such fault is not
> >> supposed to occur for Xenomai threads on x86, but that would be
> >> another issue. Code-wise, I'm referring to the current state of the
> >> master branch for lib/cobalt/cond.c, which seems to match your
> description.
> >
> > I dont know what you mean with minor fault, from the perspective of
> Linux?
> > A RT thread getting demoted to Linux is rather serious to me.
> >
>
> Minor from a MMU standpoint: the memory the CPU  dereferenced is valid
> but no page table entry currently maps it. So, #PF in this case seems to be a
> 'minor'
> fault in MMU lingo, but it is still not expected.
>
> > Also, the thing is that I would not know how a PF in the long running
> > thread, with locked memory, With the call being close to the thread
> > entry point in a wait-for-condvar-loop, never using more than an
> > insignificant> amount of stack at this time should be possible.>
>
> Except if mapping an executable segment via dlopen() comes into play,
> affecting the page table. Only an assumption at this stage.
>
> > On the other hand, the non-RT thread loads a DSO and is stuck somewhere
> after allocating memory.
> > My guess would be that the PF ends up at the wrong thread.
> >
>
> As Jan pointed out, #PF are synchronously taken, synchronously handled. I
> really don't see how #PF handling could ever wander.
>
> >>>
> >>
> >> You refer to an older post describing a lockup, but this post
> >> describes an application crashing with a core dump. What made you
> >> draw the conclusion that the same bug would be at work?
> >
> > Same bug, different PTHREAD_WARNSW setting is my guess.
> > The underlying issue that a unrelated signal ends up to a RT thread.
> >
> >> Also, could you give some details
> >> regarding the
> >> following:
> >>
> >> - what do you mean by 'lockup' in this case? Can you still access the
> >> board or is there some runaway real-time code locking out everything
> >> else when this happens? My understanding is that this is no hard lock
> >> up otherwise the watchdog would have triggered. If this is a softer
> >> kind of lockup instead, what does /proc/xenomai/sched/stat tell you
> >> about the thread states after the problem occurred?
> >
> > This was a post-mortem, no access to /proc/xenomai/sched/stat anymore.
> > lockup means deadlock (the thread getting the signal holds a mutex,
> > but is stuck), Coredump happens if PTHREAD_WARNSW is enabled (means
> it asserts out before).
> >
> >> - did you determine that using the dynamic linker is required to
> >> trigger the bug yet? Or could you observe it without such interaction with
> dl?
> >
> > AFAIK, always occurred at the stage where we load a "configuration", and
> load DSOs.
> >
> >>
> >> - what is the typical size of your Xenomai thread stack? It defaults
> >> to 64k min with Xenomai 3.1.
> >
> > 1MB
>
> I would dig the following distinct issues:
>
> - why is #PF taken on an apparently innocuous instruction. dlopen(3)-
> >mmap(2) might be involved. With a simple test case, you could check the
> impact of loading/unloading DSOs on memory management for real-time
> threads running in parallel. Setting the WARNSW bit on for these threads
> would be required.
>
> - whether dealing with a signal adversely affects the wait-side of a Xenomai
> condvar. There is a specific trick to handle this in the Cobalt and libcobalt
> code, which is the reason for the wait_prologue / wait_epilogue dance in the
> implementation IIRC. Understanding why that thread receives a signal in the
> first place would help too. According to your description, this may not be
> directly due to taking #PF, but may be an indirect consequence of that event
> on sibling threads (propagation of a debug condition of some sort, such as
> those detected by CONFIG_XENO_OPT_DEBUG_MUTEX*).
>
> At any rate, you may want to enable the function ftracer, enabling

Re: Still getting Deadlocks with condition variables

2020-06-08 Thread Philippe Gerum via Xenomai
On 6/8/20 12:08 PM, Lange Norbert wrote:
> 
>> This kernel message tells a different story, thread pid 681 received a #PF,
>> maybe due to accessing its own stack (cond.c, line 316). This may be a minor
>> fault though, nothing invalid. Such fault is not supposed to occur for 
>> Xenomai
>> threads on x86, but that would be another issue. Code-wise, I'm referring to
>> the current state of the master branch for lib/cobalt/cond.c, which seems to
>> match your description.
> 
> I dont know what you mean with minor fault, from the perspective of Linux?
> A RT thread getting demoted to Linux is rather serious to me.
>

Minor from a MMU standpoint: the memory the CPU  dereferenced is valid but no
page table entry currently maps it. So, #PF in this case seems to be a 'minor'
fault in MMU lingo, but it is still not expected.

> Also, the thing is that I would not know how a PF in the long running thread, 
> with locked memory,
> With the call being close to the thread entry point in a 
> wait-for-condvar-loop, never using more than an insignificant> amount of 
> stack at this time should be possible.>

Except if mapping an executable segment via dlopen() comes into play,
affecting the page table. Only an assumption at this stage.

> On the other hand, the non-RT thread loads a DSO and is stuck somewhere after 
> allocating memory.
> My guess would be that the PF ends up at the wrong thread.
> 

As Jan pointed out, #PF are synchronously taken, synchronously handled. I
really don't see how #PF handling could ever wander.

>>>
>>
>> You refer to an older post describing a lockup, but this post describes an
>> application crashing with a core dump. What made you draw the conclusion
>> that the same bug would be at work?
> 
> Same bug, different PTHREAD_WARNSW setting is my guess.
> The underlying issue that a unrelated signal ends up to a RT thread.
> 
>> Also, could you give some details
>> regarding the
>> following:
>>
>> - what do you mean by 'lockup' in this case? Can you still access the board 
>> or
>> is there some runaway real-time code locking out everything else when this
>> happens? My understanding is that this is no hard lock up otherwise the
>> watchdog would have triggered. If this is a softer kind of lockup instead, 
>> what
>> does /proc/xenomai/sched/stat tell you about the thread states after the
>> problem occurred?
> 
> This was a post-mortem, no access to /proc/xenomai/sched/stat anymore.
> lockup means deadlock (the thread getting the signal holds a mutex, but is 
> stuck),
> Coredump happens if PTHREAD_WARNSW is enabled (means it asserts out before).
> 
>> - did you determine that using the dynamic linker is required to trigger the
>> bug yet? Or could you observe it without such interaction with dl?
> 
> AFAIK, always occurred at the stage where we load a "configuration", and load 
> DSOs.
> 
>>
>> - what is the typical size of your Xenomai thread stack? It defaults to 64k 
>> min
>> with Xenomai 3.1.
> 
> 1MB

I would dig the following distinct issues:

- why is #PF taken on an apparently innocuous instruction. dlopen(3)->mmap(2)
might be involved. With a simple test case, you could check the impact of
loading/unloading DSOs on memory management for real-time threads running in
parallel. Setting the WARNSW bit on for these threads would be required.

- whether dealing with a signal adversely affects the wait-side of a Xenomai
condvar. There is a specific trick to handle this in the Cobalt and libcobalt
code, which is the reason for the wait_prologue / wait_epilogue dance in the
implementation IIRC. Understanding why that thread receives a signal in the
first place would help too. According to your description, this may not be
directly due to taking #PF, but may be an indirect consequence of that event
on sibling threads (propagation of a debug condition of some sort, such as
those detected by CONFIG_XENO_OPT_DEBUG_MUTEX*).

At any rate, you may want to enable the function ftracer, enabling conditional
snapshots, e.g. when SIGXCPU is sent by the cobalt core. Guesswork with such
bug is unlikely to uncover every aspect of the issue, hard data would be
required to go to the bottom of it. With a bit of luck, that bug is not
time-sensitive in a way that the overhead due to ftracing would paper over it.

-- 
Philippe.



RE: Still getting Deadlocks with condition variables

2020-06-08 Thread Lange Norbert via Xenomai


> -Original Message-
> From: Jan Kiszka 
> Sent: Montag, 8. Juni 2020 12:09
> To: Lange Norbert ; Xenomai
> (xenomai@xenomai.org) 
> Subject: Re: Still getting Deadlocks with condition variables
>
> NON-ANDRITZ SOURCE: BE CAUTIOUS WITH CONTENT, LINKS OR
> ATTACHMENTS.
>
>
> On 08.06.20 11:48, Lange Norbert wrote:
> >
> >
> >> -Original Message-
> >> From: Jan Kiszka 
> >> Sent: Freitag, 5. Juni 2020 17:40
> >> To: Lange Norbert ; Xenomai
> >> (xenomai@xenomai.org) 
> >> Subject: Re: Still getting Deadlocks with condition variables
> >>
> >> NON-ANDRITZ SOURCE: BE CAUTIOUS WITH CONTENT, LINKS OR
> ATTACHMENTS.
> >>
> >>
> >> On 05.06.20 16:36, Lange Norbert via Xenomai wrote:
> >>> Hello,
> >>>
> >>> I brought this up once or twice at this ML [1], I am still getting
> >>> some occasional lockups. Now the first time without running under an
> >>> debugger,
> >>>
> >>> Harwdare is a TQMxE39M (Goldmont Atom)
> >>> Kernel: 4.19.124-cip27-xeno12-static x86_64 I-pipe Version: 12 Xenomai
> >>> Version: 3.1 Glibc Version 2.28
> >>>
> >>> What happens (as far as I understand it):
> >>>
> >>> The setup is an project with several cobalt threads (no "native" Linux
> >> thread as far as I can tell, apart maybe from the cobalt's printf thread).
> >>> They mostly sleep, and are triggered if work is available, the project
> >>> also can load DSOs (specialized maths) during configuration stage -
> >>> during this stages is when the exceptions occur
> >>>
> >>>
> >>> 1.   Linux Thread LWP 682 calls SYS_futex "wake"
> >>>
> >>> Code immediately before syscall, file x86_64/lowlevellock.S:
> >>> movl$0, (%rdi)
> >>> LOAD_FUTEX_WAKE (%esi)
> >>> movl$1, %edx/* Wake one thread.  */
> >>> movl$SYS_futex, %eax
> >>> syscall
> >>>
> >>> 2. Xenomai switches a cobalt thread to secondary, potentially because all
> >> threads are in primary:
> >>>
> >>> Jun 05 12:35:19 buildroot kernel: [Xenomai] switching dispatcher to
> >>> secondary mode after exception #14 from user-space at 0x7fd731299115
> >>> (pid 681)
> >>
> >> #14 mean page fault, fixable or real. What is at that address? What
> address
> >> was accessed by that instruction?
> >>
> >>>
> >>> Note that most threads are stuck waiting for a condvar in
> >> sc_cobalt_cond_wait_prologue (cond.c:313), LWP 681 is at the next
> >> instruction.
> >>>
> >>
> >> Stuck at what? Waiting for the condvar itsself or getting the enclosing
> mutex
> >> again? What are the states of the involved synchonization objects?
> >
> > All mutexes are free. There is one task (Thread 2) pulling the mutexes for
> the duration of signaling the condvars,
> > this task should never block outside of a sleep function giving it a 1ms 
> > cycle.
> > No deadlock is possible.
> >
> > What happens is that for some weird reason, Thread 1 got a sporadic
> wakeup (handling a PF fault from another thread?),
>
> PFs are synchronous, not proxied.
>
> As Philippe also pointed out, understanding that PF is the first step.
> Afterwards, we may look into the secondary issue, if there is still one,
> and that would be be behavior around the condvars after that PF.

As I told Phillipe, there is no way I can image the thread holding the condvar 
running into a PF.
This code is run several thousand times before the issue happens, it's 
basically the outer loop that just waits for work,
Stack is plenty (1M).


The dlopen call triggers it, it's always stuck in the same position.

Thread 9 (LWP 682):
#0  __lll_unlock_wake () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:339
#1  0x7fd731275d65 in __pthread_mutex_unlock_usercnt (mutex=0x7fd7312f6968 
<_rtld_local+2312>, decr=1) at pthread_mutex_unlock.c:54
#2  0x7fd7312e0442 in _dl_open (file=, mode=-2147483647, 
caller_dlopen=0x460864 , nsid=-2, argc=7, 
argv=0x7fd728680f90, env=0x7ffc1e972b88) at dl-open.c:627
#3  0x7fd7312c72ac in dlopen_doit (a=a@entry=0x7fd7286811d0) at dlopen.c:66
#4  0x7fd73104211f in __GI__dl_catch_exception 
(exception=exception@entry=0x7fd728681170, operate=operate@entry=0x7fd7312c7250 
, args=args@entry=0x7fd7286811d0) at dl-error-skeleton.c:196
#5  0x7fd731042190 in __GI__dl_catch_error 
(objname=objname@entry=0x7fd72005c010, 
errstring=errstri

Re: Still getting Deadlocks with condition variables

2020-06-08 Thread Jan Kiszka via Xenomai
On 08.06.20 11:48, Lange Norbert wrote:
> 
> 
>> -Original Message-
>> From: Jan Kiszka 
>> Sent: Freitag, 5. Juni 2020 17:40
>> To: Lange Norbert ; Xenomai
>> (xenomai@xenomai.org) 
>> Subject: Re: Still getting Deadlocks with condition variables
>>
>> NON-ANDRITZ SOURCE: BE CAUTIOUS WITH CONTENT, LINKS OR
>> ATTACHMENTS.
>>
>>
>> On 05.06.20 16:36, Lange Norbert via Xenomai wrote:
>>> Hello,
>>>
>>> I brought this up once or twice at this ML [1], I am still getting
>>> some occasional lockups. Now the first time without running under an
>>> debugger,
>>>
>>> Harwdare is a TQMxE39M (Goldmont Atom)
>>> Kernel: 4.19.124-cip27-xeno12-static x86_64 I-pipe Version: 12 Xenomai
>>> Version: 3.1 Glibc Version 2.28
>>>
>>> What happens (as far as I understand it):
>>>
>>> The setup is an project with several cobalt threads (no "native" Linux
>> thread as far as I can tell, apart maybe from the cobalt's printf thread).
>>> They mostly sleep, and are triggered if work is available, the project
>>> also can load DSOs (specialized maths) during configuration stage -
>>> during this stages is when the exceptions occur
>>>
>>>
>>> 1.   Linux Thread LWP 682 calls SYS_futex "wake"
>>>
>>> Code immediately before syscall, file x86_64/lowlevellock.S:
>>> movl$0, (%rdi)
>>> LOAD_FUTEX_WAKE (%esi)
>>> movl$1, %edx/* Wake one thread.  */
>>> movl$SYS_futex, %eax
>>> syscall
>>>
>>> 2. Xenomai switches a cobalt thread to secondary, potentially because all
>> threads are in primary:
>>>
>>> Jun 05 12:35:19 buildroot kernel: [Xenomai] switching dispatcher to
>>> secondary mode after exception #14 from user-space at 0x7fd731299115
>>> (pid 681)
>>
>> #14 mean page fault, fixable or real. What is at that address? What address
>> was accessed by that instruction?
>>
>>>
>>> Note that most threads are stuck waiting for a condvar in
>> sc_cobalt_cond_wait_prologue (cond.c:313), LWP 681 is at the next
>> instruction.
>>>
>>
>> Stuck at what? Waiting for the condvar itsself or getting the enclosing mutex
>> again? What are the states of the involved synchonization objects?
> 
> All mutexes are free. There is one task (Thread 2) pulling the mutexes for 
> the duration of signaling the condvars,
> this task should never block outside of a sleep function giving it a 1ms 
> cycle.
> No deadlock is possible.
> 
> What happens is that for some weird reason, Thread 1 got a sporadic wakeup 
> (handling a PF fault from another thread?),

PFs are synchronous, not proxied.

As Philippe also pointed out, understanding that PF is the first step.
Afterwards, we may look into the secondary issue, if there is still one,
and that would be be behavior around the condvars after that PF.

Jan

> Acquires the mutex and then either is getting demoted to Linux and cause a 
> XCPU signal (if that check is enabled),
> or stuck at sc_cobalt_cond_wait_epilogue infinitely.
> 
> Then Thread 2 will logically be stuck at re-aquiring the mutex.
> 
> I have an alternaivte implementation using Semaphores instead of condvars, I 
> think I have never seen this issue crop up there.
> 
>>
>> Jan
>>
>>> 3. Xenomai gets XCPU signal -> coredump
>>>
>>> gdb) thread apply all bt 3
>>>
>>> Thread 9 (LWP 682):
>>> #0  __lll_unlock_wake () at
>>> ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:339
>>> #1  0x7fd731275d65 in __pthread_mutex_unlock_usercnt
>>> (mutex=0x7fd7312f6968 <_rtld_global+2312>, decr=1) at
>>> pthread_mutex_unlock.c:54
>>> #2  0x7fd7312e0442 in ?? () from
>>> /home/lano/Downloads/bugcrash/lib64/ld-linux-x86-64.so.2
>>> #3  0x7fd7312c72ac in ?? () from /lib/libdl.so.2
>>> #4  0x7fd73104211f in _dl_catch_exception () from /lib/libc.so.6
>>> #5  0x7fd731042190 in _dl_catch_error () from /lib/libc.so.6
>>> #6  0x7fd7312c7975 in ?? () from /lib/libdl.so.2
>>> #7  0x7fd7312c7327 in dlopen () from /lib/libdl.so.2 (More stack
>>> frames follow...)
>>>
>>> Thread 8 (LWP 686):
>>> #0  0x7fd731298d48 in __cobalt_clock_nanosleep (clock_id=0,
>>> flags=0, rqtp=0x7fd727e3ad10, rmtp=0x0) at
>>> /opt/hipase2/src/xenomai-3.1.0/lib/cobalt/clock.c:312
>>> #1  0x7fd731298d81 in __cobalt_nanosleep (rqtp=,
>>> rmtp=) at
>>> /opt/hipase2/src/xenomai-3.1.

RE: Still getting Deadlocks with condition variables

2020-06-08 Thread Lange Norbert via Xenomai


> -Original Message-
> From: Philippe Gerum 
> Sent: Sonntag, 7. Juni 2020 22:16
> To: Lange Norbert ; Xenomai
> (xenomai@xenomai.org) 
> Subject: Re: Still getting Deadlocks with condition variables
>
> NON-ANDRITZ SOURCE: BE CAUTIOUS WITH CONTENT, LINKS OR
> ATTACHMENTS.
>
>
> On 6/5/20 4:36 PM, Lange Norbert wrote:
> > Hello,
> >
> > I brought this up once or twice at this ML [1], I am still getting
> > some occasional lockups. Now the first time without running under an
> > debugger,
> >
> > Harwdare is a TQMxE39M (Goldmont Atom)
> > Kernel: 4.19.124-cip27-xeno12-static x86_64 I-pipe Version: 12 Xenomai
> > Version: 3.1 Glibc Version 2.28
> >
> > What happens (as far as I understand it):
> >
> > The setup is an project with several cobalt threads (no "native" Linux
> thread as far as I can tell, apart maybe from the cobalt's printf thread).
> > They mostly sleep, and are triggered if work is available, the project
> > also can load DSOs (specialized maths) during configuration stage -
> > during this stages is when the exceptions occur
> >
> >
> > 1.   Linux Thread LWP 682 calls SYS_futex "wake"
> >
> > Code immediately before syscall, file x86_64/lowlevellock.S:
> > movl$0, (%rdi)
> > LOAD_FUTEX_WAKE (%esi)
> > movl$1, %edx/* Wake one thread.  */
> > movl$SYS_futex, %eax
> > syscall
> >
> > 2. Xenomai switches a cobalt thread to secondary, potentially because all
> threads are in primary:
> >
> > Jun 05 12:35:19 buildroot kernel: [Xenomai] switching dispatcher to
> > secondary mode after exception #14 from user-space at 0x7fd731299115
> > (pid 681)
> >
>
> This kernel message tells a different story, thread pid 681 received a #PF,
> maybe due to accessing its own stack (cond.c, line 316). This may be a minor
> fault though, nothing invalid. Such fault is not supposed to occur for Xenomai
> threads on x86, but that would be another issue. Code-wise, I'm referring to
> the current state of the master branch for lib/cobalt/cond.c, which seems to
> match your description.

I dont know what you mean with minor fault, from the perspective of Linux?
A RT thread getting demoted to Linux is rather serious to me.

Also, the thing is that I would not know how a PF in the long running thread, 
with locked memory,
With the call being close to the thread entry point in a wait-for-condvar-loop, 
never using more than an insignificant
amount of stack at this time should be possible.

On the other hand, the non-RT thread loads a DSO and is stuck somewhere after 
allocating memory.
My guess would be that the PF ends up at the wrong thread.

Note that both tasks are locked to the same CPU core.

>
> > Note that most threads are stuck waiting for a condvar in
> sc_cobalt_cond_wait_prologue (cond.c:313), LWP 681 is at the next
> instruction.
> >
> > 3. Xenomai gets XCPU signal -> coredump
> >
>
> More precisely, Xenomai is likely sending this signal to your application, 
> since
> it had to switch pid 681 to secondary mode for fixing up the #PF event.
> You may have set PTHREAD_WARNSW with pthread_setmode_np() for that
> thread.

Yes, I use PTHREAD_WARNSW, if I did not, then chances are that the code would 
run
to the sc_cobalt_cond_wait_epilogue, never freeing the mutex and the other 
thread trying to send a
signal would never be able to acquire the mutex.
Ie. identical to my previous reports (where PTHREAD_WARNSW was disabled)

>
> > gdb) thread apply all bt 3
> >
> > Thread 9 (LWP 682):
> > #0  __lll_unlock_wake () at
> > ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:339
> > #1  0x7fd731275d65 in __pthread_mutex_unlock_usercnt
> > (mutex=0x7fd7312f6968 <_rtld_global+2312>, decr=1) at
> > pthread_mutex_unlock.c:54
> > #2  0x7fd7312e0442 in ?? () from
> > /home/lano/Downloads/bugcrash/lib64/ld-linux-x86-64.so.2
> > #3  0x7fd7312c72ac in ?? () from /lib/libdl.so.2
> > #4  0x7fd73104211f in _dl_catch_exception () from /lib/libc.so.6
> > #5  0x7fd731042190 in _dl_catch_error () from /lib/libc.so.6
> > #6  0x7fd7312c7975 in ?? () from /lib/libdl.so.2
> > #7  0x7fd7312c7327 in dlopen () from /lib/libdl.so.2 (More stack
> > frames follow...)
> >
> > Thread 8 (LWP 686):
> > #0  0x7fd731298d48 in __cobalt_clock_nanosleep (clock_id=0,
> > flags=0, rqtp=0x7fd727e3ad10, rmtp=0x0) at
> > /opt/hipase2/src/xenomai-3.1.0/lib/cobalt/clock.c:312
> > #1  0x7fd731298d81 in __cobalt_nanosleep (rqtp=,
> > rmtp=) at
> > /opt/hipase2/src/xenomai-3.1.0/lib/cobalt/clock.c:354
> > #2  0x00434590

RE: Still getting Deadlocks with condition variables

2020-06-08 Thread Lange Norbert via Xenomai


> -Original Message-
> From: Jan Kiszka 
> Sent: Freitag, 5. Juni 2020 17:40
> To: Lange Norbert ; Xenomai
> (xenomai@xenomai.org) 
> Subject: Re: Still getting Deadlocks with condition variables
>
> NON-ANDRITZ SOURCE: BE CAUTIOUS WITH CONTENT, LINKS OR
> ATTACHMENTS.
>
>
> On 05.06.20 16:36, Lange Norbert via Xenomai wrote:
> > Hello,
> >
> > I brought this up once or twice at this ML [1], I am still getting
> > some occasional lockups. Now the first time without running under an
> > debugger,
> >
> > Harwdare is a TQMxE39M (Goldmont Atom)
> > Kernel: 4.19.124-cip27-xeno12-static x86_64 I-pipe Version: 12 Xenomai
> > Version: 3.1 Glibc Version 2.28
> >
> > What happens (as far as I understand it):
> >
> > The setup is an project with several cobalt threads (no "native" Linux
> thread as far as I can tell, apart maybe from the cobalt's printf thread).
> > They mostly sleep, and are triggered if work is available, the project
> > also can load DSOs (specialized maths) during configuration stage -
> > during this stages is when the exceptions occur
> >
> >
> > 1.   Linux Thread LWP 682 calls SYS_futex "wake"
> >
> > Code immediately before syscall, file x86_64/lowlevellock.S:
> > movl$0, (%rdi)
> > LOAD_FUTEX_WAKE (%esi)
> > movl$1, %edx/* Wake one thread.  */
> > movl$SYS_futex, %eax
> > syscall
> >
> > 2. Xenomai switches a cobalt thread to secondary, potentially because all
> threads are in primary:
> >
> > Jun 05 12:35:19 buildroot kernel: [Xenomai] switching dispatcher to
> > secondary mode after exception #14 from user-space at 0x7fd731299115
> > (pid 681)
>
> #14 mean page fault, fixable or real. What is at that address? What address
> was accessed by that instruction?
>
> >
> > Note that most threads are stuck waiting for a condvar in
> sc_cobalt_cond_wait_prologue (cond.c:313), LWP 681 is at the next
> instruction.
> >
>
> Stuck at what? Waiting for the condvar itsself or getting the enclosing mutex
> again? What are the states of the involved synchonization objects?

All mutexes are free. There is one task (Thread 2) pulling the mutexes for the 
duration of signaling the condvars,
this task should never block outside of a sleep function giving it a 1ms cycle.
No deadlock is possible.

What happens is that for some weird reason, Thread 1 got a sporadic wakeup 
(handling a PF fault from another thread?),
Acquires the mutex and then either is getting demoted to Linux and cause a XCPU 
signal (if that check is enabled),
or stuck at sc_cobalt_cond_wait_epilogue infinitely.

Then Thread 2 will logically be stuck at re-aquiring the mutex.

I have an alternaivte implementation using Semaphores instead of condvars, I 
think I have never seen this issue crop up there.

>
> Jan
>
> > 3. Xenomai gets XCPU signal -> coredump
> >
> > gdb) thread apply all bt 3
> >
> > Thread 9 (LWP 682):
> > #0  __lll_unlock_wake () at
> > ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:339
> > #1  0x7fd731275d65 in __pthread_mutex_unlock_usercnt
> > (mutex=0x7fd7312f6968 <_rtld_global+2312>, decr=1) at
> > pthread_mutex_unlock.c:54
> > #2  0x7fd7312e0442 in ?? () from
> > /home/lano/Downloads/bugcrash/lib64/ld-linux-x86-64.so.2
> > #3  0x7fd7312c72ac in ?? () from /lib/libdl.so.2
> > #4  0x7fd73104211f in _dl_catch_exception () from /lib/libc.so.6
> > #5  0x7fd731042190 in _dl_catch_error () from /lib/libc.so.6
> > #6  0x7fd7312c7975 in ?? () from /lib/libdl.so.2
> > #7  0x7fd7312c7327 in dlopen () from /lib/libdl.so.2 (More stack
> > frames follow...)
> >
> > Thread 8 (LWP 686):
> > #0  0x7fd731298d48 in __cobalt_clock_nanosleep (clock_id=0,
> > flags=0, rqtp=0x7fd727e3ad10, rmtp=0x0) at
> > /opt/hipase2/src/xenomai-3.1.0/lib/cobalt/clock.c:312
> > #1  0x7fd731298d81 in __cobalt_nanosleep (rqtp=,
> > rmtp=) at
> > /opt/hipase2/src/xenomai-3.1.0/lib/cobalt/clock.c:354
> > #2  0x00434590 in operator() (__closure=0x7fd720006fb8) at
> > ../../acpu.runner/asim/asim_com.cpp:685
> > (More stack frames follow...)
> >
> > Thread 7 (LWP 677):
> > #0  0x7fd73127b6c6 in __GI___nanosleep
> > (requested_time=requested_time@entry=0x7fd7312b1fb0 ,
> > remaining=remaining@entry=0x0) at
> > ../sysdeps/unix/sysv/linux/nanosleep.c:28
> > #1  0x7fd73129b746 in printer_loop (arg=) at
> > /opt/hipase2/src/xenomai-3.1.0/lib/cobalt/printf.c:635
> > #2  0x7fd7312720f7 in start_thread (arg=) at
> > pthread_create.c:486 (More stack 

Re: Still getting Deadlocks with condition variables

2020-06-07 Thread Philippe Gerum via Xenomai
On 6/5/20 4:36 PM, Lange Norbert wrote:
> Hello,
> 
> I brought this up once or twice at this ML [1], I am still getting some 
> occasional lockups. Now the first time without running under an debugger,
> 
> Harwdare is a TQMxE39M (Goldmont Atom)
> Kernel: 4.19.124-cip27-xeno12-static x86_64
> I-pipe Version: 12
> Xenomai Version: 3.1
> Glibc Version 2.28
> 
> What happens (as far as I understand it):
> 
> The setup is an project with several cobalt threads (no "native" Linux thread 
> as far as I can tell, apart maybe from the cobalt's printf thread).
> They mostly sleep, and are triggered if work is available, the project also 
> can load DSOs (specialized maths) during configuration stage - during this 
> stages is when the exceptions occur
> 
> 
> 1.   Linux Thread LWP 682 calls SYS_futex "wake"
> 
> Code immediately before syscall, file x86_64/lowlevellock.S:
> movl$0, (%rdi)
> LOAD_FUTEX_WAKE (%esi)
> movl$1, %edx/* Wake one thread.  */
> movl$SYS_futex, %eax
> syscall
> 
> 2. Xenomai switches a cobalt thread to secondary, potentially because all 
> threads are in primary:
> 
> Jun 05 12:35:19 buildroot kernel: [Xenomai] switching dispatcher to secondary 
> mode after exception #14 from user-space at 0x7fd731299115 (pid 681)
> 

This kernel message tells a different story, thread pid 681 received a #PF,
maybe due to accessing its own stack (cond.c, line 316). This may be a minor
fault though, nothing invalid. Such fault is not supposed to occur for Xenomai
threads on x86, but that would be another issue. Code-wise, I'm referring to
the current state of the master branch for lib/cobalt/cond.c, which seems to
match your description.

> Note that most threads are stuck waiting for a condvar in 
> sc_cobalt_cond_wait_prologue (cond.c:313), LWP 681 is at the next instruction.
> 
> 3. Xenomai gets XCPU signal -> coredump
> 

More precisely, Xenomai is likely sending this signal to your application,
since it had to switch pid 681 to secondary mode for fixing up the #PF event.
You may have set PTHREAD_WARNSW with pthread_setmode_np() for that thread.

> gdb) thread apply all bt 3
> 
> Thread 9 (LWP 682):
> #0  __lll_unlock_wake () at 
> ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:339
> #1  0x7fd731275d65 in __pthread_mutex_unlock_usercnt 
> (mutex=0x7fd7312f6968 <_rtld_global+2312>, decr=1) at 
> pthread_mutex_unlock.c:54
> #2  0x7fd7312e0442 in ?? () from 
> /home/lano/Downloads/bugcrash/lib64/ld-linux-x86-64.so.2
> #3  0x7fd7312c72ac in ?? () from /lib/libdl.so.2
> #4  0x7fd73104211f in _dl_catch_exception () from /lib/libc.so.6
> #5  0x7fd731042190 in _dl_catch_error () from /lib/libc.so.6
> #6  0x7fd7312c7975 in ?? () from /lib/libdl.so.2
> #7  0x7fd7312c7327 in dlopen () from /lib/libdl.so.2
> (More stack frames follow...)
> 
> Thread 8 (LWP 686):
> #0  0x7fd731298d48 in __cobalt_clock_nanosleep (clock_id=0, flags=0, 
> rqtp=0x7fd727e3ad10, rmtp=0x0) at 
> /opt/hipase2/src/xenomai-3.1.0/lib/cobalt/clock.c:312
> #1  0x7fd731298d81 in __cobalt_nanosleep (rqtp=, 
> rmtp=) at /opt/hipase2/src/xenomai-3.1.0/lib/cobalt/clock.c:354
> #2  0x00434590 in operator() (__closure=0x7fd720006fb8) at 
> ../../acpu.runner/asim/asim_com.cpp:685
> (More stack frames follow...)
> 
> Thread 7 (LWP 677):
> #0  0x7fd73127b6c6 in __GI___nanosleep 
> (requested_time=requested_time@entry=0x7fd7312b1fb0 , 
> remaining=remaining@entry=0x0) at ../sysdeps/unix/sysv/linux/nanosleep.c:28
> #1  0x7fd73129b746 in printer_loop (arg=) at 
> /opt/hipase2/src/xenomai-3.1.0/lib/cobalt/printf.c:635
> #2  0x7fd7312720f7 in start_thread (arg=) at 
> pthread_create.c:486
> (More stack frames follow...)
> 
> Thread 6 (LWP 685):
> #0  0x7fd73129910a in __cobalt_pthread_cond_wait (cond=0x7fd72f269660, 
> mutex=0x7fd72f269630) at /opt/hipase2/src/xenomai-3.1.0/lib/cobalt/cond.c:313
> #1  0x0046377c in conditionvar_wait (pData=0x7fd72f269660, 
> pMutex=0x7fd72f269630) at ../../alib/src/alib/posix/conditionvar.c:66
> #2  0x0040a620 in HIPASE::Posix::CAlib_ConditionVariable::wait 
> (this=0x7fd72f269660, lock=...) at 
> ../../alib/include/alib/alib_conditionvar_posix.h:67
> (More stack frames follow...)
> 
> Thread 5 (LWP 684):
> #0  0x7fd73129910a in __cobalt_pthread_cond_wait (cond=0x7fd72f267790, 
> mutex=0x7fd72f267760) at /opt/hipase2/src/xenomai-3.1.0/lib/cobalt/cond.c:313
> #1  0x0046377c in conditionvar_wait (pData=0x7fd72f267790, 
> pMutex=0x7fd72f267760) at ../../alib/src/alib/posix/conditionvar.c:66
> #2  0x0040a620 in HIPASE::Posix::CAlib_ConditionVariable::wait 
> (this=0x7fd72f267790, lock=...) at 
> ../../alib/include/alib/alib_conditionvar_posix.h:67
> (More stack frames follow...)
> 
> Thread 4 (LWP 680):
> #0  0x7fd73129910a in __cobalt_pthread_cond_wait (cond=0xfeafa0 
> <(anonymous namespace)::m_MainTaskStart>, mutex=0xfeaf60 <(anonymous 
> namespace)::m_TaskMutex>) at 
> /opt/hipase2/src/xenomai-3.1.0/lib/cobalt/cond.c:313
> #1 

Re: Still getting Deadlocks with condition variables

2020-06-05 Thread Jan Kiszka via Xenomai
On 05.06.20 16:36, Lange Norbert via Xenomai wrote:
> Hello,
> 
> I brought this up once or twice at this ML [1], I am still getting some 
> occasional lockups. Now the first time without running under an debugger,
> 
> Harwdare is a TQMxE39M (Goldmont Atom)
> Kernel: 4.19.124-cip27-xeno12-static x86_64
> I-pipe Version: 12
> Xenomai Version: 3.1
> Glibc Version 2.28
> 
> What happens (as far as I understand it):
> 
> The setup is an project with several cobalt threads (no "native" Linux thread 
> as far as I can tell, apart maybe from the cobalt's printf thread).
> They mostly sleep, and are triggered if work is available, the project also 
> can load DSOs (specialized maths) during configuration stage - during this 
> stages is when the exceptions occur
> 
> 
> 1.   Linux Thread LWP 682 calls SYS_futex "wake"
> 
> Code immediately before syscall, file x86_64/lowlevellock.S:
> movl$0, (%rdi)
> LOAD_FUTEX_WAKE (%esi)
> movl$1, %edx/* Wake one thread.  */
> movl$SYS_futex, %eax
> syscall
> 
> 2. Xenomai switches a cobalt thread to secondary, potentially because all 
> threads are in primary:
> 
> Jun 05 12:35:19 buildroot kernel: [Xenomai] switching dispatcher to secondary 
> mode after exception #14 from user-space at 0x7fd731299115 (pid 681)

#14 mean page fault, fixable or real. What is at that address? What
address was accessed by that instruction?

> 
> Note that most threads are stuck waiting for a condvar in 
> sc_cobalt_cond_wait_prologue (cond.c:313), LWP 681 is at the next instruction.
> 

Stuck at what? Waiting for the condvar itsself or getting the enclosing
mutex again? What are the states of the involved synchonization objects?

Jan

> 3. Xenomai gets XCPU signal -> coredump
> 
> gdb) thread apply all bt 3
> 
> Thread 9 (LWP 682):
> #0  __lll_unlock_wake () at 
> ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:339
> #1  0x7fd731275d65 in __pthread_mutex_unlock_usercnt 
> (mutex=0x7fd7312f6968 <_rtld_global+2312>, decr=1) at 
> pthread_mutex_unlock.c:54
> #2  0x7fd7312e0442 in ?? () from 
> /home/lano/Downloads/bugcrash/lib64/ld-linux-x86-64.so.2
> #3  0x7fd7312c72ac in ?? () from /lib/libdl.so.2
> #4  0x7fd73104211f in _dl_catch_exception () from /lib/libc.so.6
> #5  0x7fd731042190 in _dl_catch_error () from /lib/libc.so.6
> #6  0x7fd7312c7975 in ?? () from /lib/libdl.so.2
> #7  0x7fd7312c7327 in dlopen () from /lib/libdl.so.2
> (More stack frames follow...)
> 
> Thread 8 (LWP 686):
> #0  0x7fd731298d48 in __cobalt_clock_nanosleep (clock_id=0, flags=0, 
> rqtp=0x7fd727e3ad10, rmtp=0x0) at 
> /opt/hipase2/src/xenomai-3.1.0/lib/cobalt/clock.c:312
> #1  0x7fd731298d81 in __cobalt_nanosleep (rqtp=, 
> rmtp=) at /opt/hipase2/src/xenomai-3.1.0/lib/cobalt/clock.c:354
> #2  0x00434590 in operator() (__closure=0x7fd720006fb8) at 
> ../../acpu.runner/asim/asim_com.cpp:685
> (More stack frames follow...)
> 
> Thread 7 (LWP 677):
> #0  0x7fd73127b6c6 in __GI___nanosleep 
> (requested_time=requested_time@entry=0x7fd7312b1fb0 , 
> remaining=remaining@entry=0x0) at ../sysdeps/unix/sysv/linux/nanosleep.c:28
> #1  0x7fd73129b746 in printer_loop (arg=) at 
> /opt/hipase2/src/xenomai-3.1.0/lib/cobalt/printf.c:635
> #2  0x7fd7312720f7 in start_thread (arg=) at 
> pthread_create.c:486
> (More stack frames follow...)
> 
> Thread 6 (LWP 685):
> #0  0x7fd73129910a in __cobalt_pthread_cond_wait (cond=0x7fd72f269660, 
> mutex=0x7fd72f269630) at /opt/hipase2/src/xenomai-3.1.0/lib/cobalt/cond.c:313
> #1  0x0046377c in conditionvar_wait (pData=0x7fd72f269660, 
> pMutex=0x7fd72f269630) at ../../alib/src/alib/posix/conditionvar.c:66
> #2  0x0040a620 in HIPASE::Posix::CAlib_ConditionVariable::wait 
> (this=0x7fd72f269660, lock=...) at 
> ../../alib/include/alib/alib_conditionvar_posix.h:67
> (More stack frames follow...)
> 
> Thread 5 (LWP 684):
> #0  0x7fd73129910a in __cobalt_pthread_cond_wait (cond=0x7fd72f267790, 
> mutex=0x7fd72f267760) at /opt/hipase2/src/xenomai-3.1.0/lib/cobalt/cond.c:313
> #1  0x0046377c in conditionvar_wait (pData=0x7fd72f267790, 
> pMutex=0x7fd72f267760) at ../../alib/src/alib/posix/conditionvar.c:66
> #2  0x0040a620 in HIPASE::Posix::CAlib_ConditionVariable::wait 
> (this=0x7fd72f267790, lock=...) at 
> ../../alib/include/alib/alib_conditionvar_posix.h:67
> (More stack frames follow...)
> 
> Thread 4 (LWP 680):
> #0  0x7fd73129910a in __cobalt_pthread_cond_wait (cond=0xfeafa0 
> <(anonymous namespace)::m_MainTaskStart>, mutex=0xfeaf60 <(anonymous 
> namespace)::m_TaskMutex>) at 
> /opt/hipase2/src/xenomai-3.1.0/lib/cobalt/cond.c:313
> #1  0x0046377c in conditionvar_wait (pData=0xfeafa0 <(anonymous 
> namespace)::m_MainTaskStart>, pMutex=0xfeaf60 <(anonymous 
> namespace)::m_TaskMutex>) at ../../alib/src/alib/posix/conditionvar.c:66
> #2  0x0040a620 in HIPASE::Posix::CAlib_ConditionVariable::wait 
> (this=0xfeafa0 <(anonymous namespace)::m_MainTaskStart>, lock=...) at 
> ../../alib/