> -----Original Message-----
> From: Jan Kiszka <jan.kis...@siemens.com>
> Sent: Dienstag, 21. Jänner 2020 18:46
> To: Lange Norbert <norbert.la...@andritz.com>; Xenomai
> (xenomai@xenomai.org) <xenomai@xenomai.org>
> Subject: Re: Cobalt deadlock for no apparent reason
>
> NON-ANDRITZ SOURCE: BE CAUTIOUS WITH CONTENT, LINKS OR
> ATTACHMENTS.
>
>
> On 20.01.20 19:03, Lange Norbert via Xenomai wrote:
> > Hello,
> >
> > I got a deadlock while running through gdbserver, this is an
> > implementation of a synchronized queue, Fup side waits via condition
> variable, main wants to push data, but main fails to acquire the mutex.
> > The mutex is an errorchecking type, without priority inheritance, and not
> used elsewhere.
> >
> >
> > The task are as following:
> >
> > CPU  PID    CLASS  TYPE      PRI   TIMEOUT       STAT       NAME
> >    1  1686   rt     cobalt      4   -             Wt         main
> >    3  1690   rt     cobalt      2   -             Wt         fup.medium
> >
> > main is stuck in this function
> >
> > int mutex_lock(struct mutex_data *pData) {
> >      pthread_t threadId = pthread_self();
> >      // assert(pthread_equal(threadId, pData->m_LockId) == 0);
> > ->    int r = pthread_mutex_lock(&pData->m_Mutex);
> >      assert(r == 0);
> >      pData->m_LockId = threadId;
> >      return r;
> > }
> >
> > In libcobalt:
> >    do
> >      ret = XENOMAI_SYSCALL1(sc_cobalt_mutex_lock, _mutex);
> >    while (ret == -EINTR);
> >
> > fup.medium is stuck in:
> >
> > int conditionvar_wait(struct conditionvar_data *pData, struct
> > mutex_data *pMutex) {
> >      pthread_t sid = pthread_self();
> >      assert(pthread_equal(sid, pMutex->m_LockId) != 0);
> >      pMutex->m_LockId = 0;
> > ->    int r = pthread_cond_wait(&pData->m_CondVar, &pMutex-
> >m_Mutex);
> >      assert(r == 0);
> >      pMutex->m_LockId = sid;
> >      return r;
> > }
> >
> > In libcobalt:
> >    while (err == -EINTR)
> >      err = XENOMAI_SYSCALL2(sc_cobalt_cond_wait_epilogue, _cnd, _mx);
> >
>
> This is likely tricky to debug by just looking at things. Can you factor out a
> reproducer?

Well, the "no apparent reason" is key here, it's not easily reproducible either.
Might help if you tell me how I could end up in this situation, AFAIK the 
pthread_cond_wait function got an interrupt,
when can this occur for example.
It seems limited to running under a debugger (or just happens a lot less 
without), and when the process dlopen's libraries this pauses
the process for example. At that time the fup.medium is supposed to stick in 
pthread_cond_wait (means the dlopens might cause spurious wakeups)
and notified after everything is ready to run.

So, if you have any idea how I could narrow it down, I migh be able to build a 
reproducer. Right now I am going to use a timed mutex to atleast detect the 
issue.

Norbert
________________________________

This message and any attachments are solely for the use of the intended 
recipients. They may contain privileged and/or confidential information or 
other information protected from disclosure. If you are not an intended 
recipient, you are hereby notified that you received this email in error and 
that any review, dissemination, distribution or copying of this email and any 
attachment is strictly prohibited. If you have received this email in error, 
please contact the sender and delete the message and any attachment from your 
system.

ANDRITZ HYDRO GmbH


Rechtsform/ Legal form: Gesellschaft mit beschränkter Haftung / Corporation

Firmensitz/ Registered seat: Wien

Firmenbuchgericht/ Court of registry: Handelsgericht Wien

Firmenbuchnummer/ Company registration: FN 61833 g

DVR: 0605077

UID-Nr.: ATU14756806


Thank You
________________________________

Reply via email to