Re: Cobalt deadlock for no apparent reason

Jan Kiszka via Xenomai Wed, 22 Jan 2020 09:21:16 -0800

On 22.01.20 11:11, Lange Norbert wrote:

-----Original Message-----
From: Jan Kiszka <jan.kis...@siemens.com>
Sent: Dienstag, 21. Jänner 2020 18:46
To: Lange Norbert <norbert.la...@andritz.com>; Xenomai
(xenomai@xenomai.org) <xenomai@xenomai.org>
Subject: Re: Cobalt deadlock for no apparent reason

NON-ANDRITZ SOURCE: BE CAUTIOUS WITH CONTENT, LINKS OR
ATTACHMENTS.

On 20.01.20 19:03, Lange Norbert via Xenomai wrote:

Hello,

I got a deadlock while running through gdbserver, this is an
implementation of a synchronized queue, Fup side waits via condition

variable, main wants to push data, but main fails to acquire the mutex.

The mutex is an errorchecking type, without priority inheritance, and not

used elsewhere.



The task are as following:

CPU  PID    CLASS  TYPE      PRI   TIMEOUT       STAT       NAME
    1  1686   rt     cobalt      4   -             Wt         main
    3  1690   rt     cobalt      2   -             Wt         fup.medium

main is stuck in this function

int mutex_lock(struct mutex_data *pData) {
      pthread_t threadId = pthread_self();
      // assert(pthread_equal(threadId, pData->m_LockId) == 0);
->    int r = pthread_mutex_lock(&pData->m_Mutex);
      assert(r == 0);
      pData->m_LockId = threadId;
      return r;
}

In libcobalt:
    do
      ret = XENOMAI_SYSCALL1(sc_cobalt_mutex_lock, _mutex);
    while (ret == -EINTR);

fup.medium is stuck in:

int conditionvar_wait(struct conditionvar_data *pData, struct
mutex_data *pMutex) {
      pthread_t sid = pthread_self();
      assert(pthread_equal(sid, pMutex->m_LockId) != 0);
      pMutex->m_LockId = 0;
->    int r = pthread_cond_wait(&pData->m_CondVar, &pMutex-
m_Mutex);
      assert(r == 0);
      pMutex->m_LockId = sid;
      return r;
}

In libcobalt:
    while (err == -EINTR)
      err = XENOMAI_SYSCALL2(sc_cobalt_cond_wait_epilogue, _cnd, _mx);


This is likely tricky to debug by just looking at things. Can you factor out a
reproducer?


Well, the "no apparent reason" is key here, it's not easily reproducible either.
Might help if you tell me how I could end up in this situation, AFAIK the 
pthread_cond_wait function got an interrupt,
when can this occur for example.

Well, we have the cond_wait apparently being signaled (condition met)and on its way back, "just" trying to reacquire the mutex it droppedwhile waiting. On the other hand, that mutex is also not available forthe other context trying to call mutex_lock. Now, we either have one ofthose two instances actually holding the lock while not noticing it - orthere is a third instance in possession of the lock. From the code andinformation you sent, this is impossible to guess.

It seems limited to running under a debugger (or just happens a lot less 
without), and when the process dlopen's libraries this pauses
the process for example. At that time the fup.medium is supposed to stick in 
pthread_cond_wait (means the dlopens might cause spurious wakeups)
and notified after everything is ready to run.

The debugger might be the catalyst for the issue, or it is actuallycausing it. Again, impossible to guess from the given information:


So, if you have any idea how I could narrow it down, I migh be able to build a 
reproducer. Right now I am going to use a timed mutex to atleast detect the 
issue.


- identify the owner of the lock at the time of the deadlock, based on
  the data structures - maybe that is already telling the story

- take an ftrace of the situation so that the flow of context switches
  and debugger interceptions can be reconstructed

Jan

--
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux

Re: Cobalt deadlock for no apparent reason

Reply via email to