RE: Still getting Deadlocks with condition variables

Lange Norbert via Xenomai Tue, 09 Jun 2020 10:11:14 -0700


> -----Original Message-----
> From: Philippe Gerum <r...@xenomai.org>
> Sent: Montag, 8. Juni 2020 16:17
> To: Lange Norbert <norbert.la...@andritz.com>; Xenomai
> (xenomai@xenomai.org) <xenomai@xenomai.org>
> Subject: Re: Still getting Deadlocks with condition variables
>
> NON-ANDRITZ SOURCE: BE CAUTIOUS WITH CONTENT, LINKS OR
> ATTACHMENTS.
>
>
> On 6/8/20 12:08 PM, Lange Norbert wrote:
> >
> >> This kernel message tells a different story, thread pid 681 received
> >> a #PF, maybe due to accessing its own stack (cond.c, line 316). This
> >> may be a minor fault though, nothing invalid. Such fault is not
> >> supposed to occur for Xenomai threads on x86, but that would be
> >> another issue. Code-wise, I'm referring to the current state of the
> >> master branch for lib/cobalt/cond.c, which seems to match your
> description.
> >
> > I dont know what you mean with minor fault, from the perspective of
> Linux?
> > A RT thread getting demoted to Linux is rather serious to me.
> >
>
> Minor from a MMU standpoint: the memory the CPU  dereferenced is valid
> but no page table entry currently maps it. So, #PF in this case seems to be a
> 'minor'
> fault in MMU lingo, but it is still not expected.
>
> > Also, the thing is that I would not know how a PF in the long running
> > thread, with locked memory, With the call being close to the thread
> > entry point in a wait-for-condvar-loop, never using more than an
> > insignificant> amount of stack at this time should be possible.>
>
> Except if mapping an executable segment via dlopen() comes into play,
> affecting the page table. Only an assumption at this stage.
>
> > On the other hand, the non-RT thread loads a DSO and is stuck somewhere
> after allocating memory.
> > My guess would be that the PF ends up at the wrong thread.
> >
>
> As Jan pointed out, #PF are synchronously taken, synchronously handled. I
> really don't see how #PF handling could ever wander.
>
> >>>
> >>
> >> You refer to an older post describing a lockup, but this post
> >> describes an application crashing with a core dump. What made you
> >> draw the conclusion that the same bug would be at work?
> >
> > Same bug, different PTHREAD_WARNSW setting is my guess.
> > The underlying issue that a unrelated signal ends up to a RT thread.
> >
> >> Also, could you give some details
> >> regarding the
> >> following:
> >>
> >> - what do you mean by 'lockup' in this case? Can you still access the
> >> board or is there some runaway real-time code locking out everything
> >> else when this happens? My understanding is that this is no hard lock
> >> up otherwise the watchdog would have triggered. If this is a softer
> >> kind of lockup instead, what does /proc/xenomai/sched/stat tell you
> >> about the thread states after the problem occurred?
> >
> > This was a post-mortem, no access to /proc/xenomai/sched/stat anymore.
> > lockup means deadlock (the thread getting the signal holds a mutex,
> > but is stuck), Coredump happens if PTHREAD_WARNSW is enabled (means
> it asserts out before).
> >
> >> - did you determine that using the dynamic linker is required to
> >> trigger the bug yet? Or could you observe it without such interaction with
> dl?
> >
> > AFAIK, always occurred at the stage where we load a "configuration", and
> load DSOs.
> >
> >>
> >> - what is the typical size of your Xenomai thread stack? It defaults
> >> to 64k min with Xenomai 3.1.
> >
> > 1MB
>
> I would dig the following distinct issues:
>
> - why is #PF taken on an apparently innocuous instruction. dlopen(3)-
> >mmap(2) might be involved. With a simple test case, you could check the
> impact of loading/unloading DSOs on memory management for real-time
> threads running in parallel. Setting the WARNSW bit on for these threads
> would be required.
>
> - whether dealing with a signal adversely affects the wait-side of a Xenomai
> condvar. There is a specific trick to handle this in the Cobalt and libcobalt
> code, which is the reason for the wait_prologue / wait_epilogue dance in the
> implementation IIRC. Understanding why that thread receives a signal in the
> first place would help too. According to your description, this may not be
> directly due to taking #PF, but may be an indirect consequence of that event
> on sibling threads (propagation of a debug condition of some sort, such as
> those detected by CONFIG_XENO_OPT_DEBUG_MUTEX*).
>
> At any rate, you may want to enable the function ftracer, enabling
> conditional snapshots, e.g. when SIGXCPU is sent by the cobalt core.
> Guesswork with such bug is unlikely to uncover every aspect of the issue,
> hard data would be required to go to the bottom of it. With a bit of luck, 
> that
> bug is not time-sensitive in a way that the overhead due to ftracing would
> paper over it.
>
> --
> Philippe.


This aint exactly easy to reproduce, managed to get something that often 
reproduces just now.
Tracing however hides the issue, as well as disabling PTHREAD_WARNSW (could be 
just that this changes timing enough to make a difference).


I got a few instances where the thread loading DSOS is stuck in an omnious
__make_stacks_executable, that seemingly iterates through *all* thread stacks,
And calls __mprotect on them.

If that’s the cause, and if the cobalt thread use that same stack,
and if the syscall does something funny like taking away write protection 
in-between,
then this could be the explanation (don’t know how this could ever be valid 
though).

int
__make_stacks_executable (void **stack_endp)
{
  /* First the main thread's stack.  */
  int err = _dl_make_stack_executable (stack_endp);
  if (err != 0)
    return err;

#ifdef NEED_SEPARATE_REGISTER_STACK
  const size_t pagemask = ~(__getpagesize () - 1);
#endif

  lll_lock (stack_cache_lock, LLL_PRIVATE);

  list_t *runp;
  list_for_each (runp, &stack_used)
    {
      err = change_stack_perm (list_entry (runp, struct pthread, list)
#ifdef NEED_SEPARATE_REGISTER_STACK
       , pagemask
#endif
       );
      if (err != 0)
break;
    }

regards, Norbert
________________________________

This message and any attachments are solely for the use of the intended 
recipients. They may contain privileged and/or confidential information or 
other information protected from disclosure. If you are not an intended 
recipient, you are hereby notified that you received this email in error and 
that any review, dissemination, distribution or copying of this email and any 
attachment is strictly prohibited. If you have received this email in error, 
please contact the sender and delete the message and any attachment from your 
system.

ANDRITZ HYDRO GmbH


Rechtsform/ Legal form: Gesellschaft mit beschränkter Haftung / Corporation

Firmensitz/ Registered seat: Wien

Firmenbuchgericht/ Court of registry: Handelsgericht Wien

Firmenbuchnummer/ Company registration: FN 61833 g

DVR: 0605077

UID-Nr.: ATU14756806


Thank You
________________________________

RE: Still getting Deadlocks with condition variables

Reply via email to