Re: [Xenomai] Xenomai-forge: thread using 100% cpu load

Philippe Gerum Tue, 05 Mar 2013 06:47:26 -0800

On 03/05/2013 03:25 PM, Ronny Meeus wrote:

On Tue, Mar 5, 2013 at 3:08 PM, Philippe Gerum <r...@xenomai.org> wrote:

On 03/05/2013 01:43 PM, Ronny Meeus wrote:


On Sat, Mar 2, 2013 at 12:13 PM, Ronny Meeus <ronny.me...@gmail.com>
wrote:


On Fri, Mar 1, 2013 at 9:41 AM, Philippe Gerum <r...@xenomai.org> wrote:


On 03/01/2013 09:30 AM, Gilles Chanteperdrix wrote:



On 03/01/2013 09:30 AM, Philippe Gerum wrote:

On 03/01/2013 09:26 AM, Gilles Chanteperdrix wrote:



On 03/01/2013 09:22 AM, Philippe Gerum wrote:

On 02/28/2013 09:22 PM, Thomas De Schampheleire wrote:



On Thu, Feb 28, 2013 at 9:10 PM, Gilles Chanteperdrix
<gilles.chanteperd...@xenomai.org> wrote:



On 02/28/2013 08:19 PM, Ronny Meeus wrote:

Hello

we are using the PSOS interface of Xenomai forge, running
completely
in user-space using the mercury code.
We deploy our application on different processors, one product is
running on PPC multicore (P4040, P4080, P4034) and another one on
Cavium (8 core device).
The Linux version we use is 2.6.32 but I would assume that this
is
not
so relevant.

Our Xenomai application is running on one of the cores (affinity
is
set), while the other cores are running other code.

On both architectures we recently start to see issues that one
thread
is consuming 100% of the core on which the application is pinned.
The thread that monopolizes the core is the thread internally
used
to
manage the timers, running at the highest priority.
The trigger for running into this behavior is currently unclear.
If we only start a part of the application (platform management
only),
the issue is not observed.
We see this on both an old version of Xenomai and a very recent
one
(pulled from the git repo yesterday).

I will continue to debug this issue in the coming days and try
isolate
the code that is triggering it, but I can use hints from the
community.
Debugging is complex since once the load starts, the debugger is
not
reacting anymore.
If I put breakpoints in the functions that are called when the
timer
expires (both oneshot and periodic), the process starts to clone
itself and I endup with tens of them.

Has anybody seen an issue like this before or does somebody has
some
hints on how to debug this problem?





First enable the watchdog. It will send a signal to the
application
when
detecting a problem, then you can use the watchdog to trigger an
I-pipe
tracer trace when the bug happens. You will probably have to
increase
the watchdog polling frequency, in order to have a meaningful
trace.


I don't think an I-pipe tracer will be possible when using the
Mercury
core, right (xenomai-forge) ?


Correct.





I do not think so. The way I see it, you can enable the I-pipe tracer
without CONFIG_XENOMAI.


Mercury has NO pipeline in the kernel.


You mean mercury can not run with an I-pipe kernel?


I mean it does not care about the pipeline, it does not need it. So if
this
is about observing kernel activity, then ftrace should be fine, or
possibly
perf to find out where userland spends time.

--
Philippe.


_______________________________________________
Xenomai mailing list
Xenomai@xenomai.org
http://www.xenomai.org/mailman/listinfo/xenomai



Hello

An update on the investigation:
I was able to make this issue disappear by changing the timeout value
of the smallest timers we use.
We use a couple of timers with a timeout of 25ms. By enlarging these
to 25sec and the problem is gone.

Yesterday I was also able to see (using the"strace" tool) the process
executing constantly "clone" system calls.
Note that the process we use is large (2Gb) and uses an mlockall call.

In
http://stackoverflow.com/questions/4263958/some-information-on-timer-helper-thread-of-librt-so-1/4935895#4935895
I see that a new thread is created when the timer_create is called for
the first time. This thread stays alive until the program exits and is
used to process the timer expiries.
I have the feeling that there is an issue during the creation of this
thread. For example what would happen if the clone operation takes
longer than the time needed to perform the clone operation?
In the past we already observed issues with the clone call that we
could not explain (creation of the clone simply failed on our
application while it was working fine on a smaller application).

Do you guys know whether there is an impact on the clone operation by
this mlockall call?

I will try to make a small test application on which the issue can be
reproduced.

---
Ronny



I'm able to reproduce the issue on a small test build:

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <psos.h>
#include <copperplate/init.h>
#include <stdlib.h>
#include <string.h>

static void foo (u_long a0, u_long a1, u_long a2, u_long a3)
{
          u_long ret, ev = 0, tmid,tmid2;

          ret = tm_evevery(1,1,&tmid);
          ret = tm_evafter(30000,4,&tmid2);
          while (1) {
                  ret = ev_receive(0xFF,EV_ANY|EV_WAIT,0,&ev);
                  if (ev & 4) {
                          printf("%lx Restarting one-shot timer.
ev=%lx\n",ret,ev);
                          tm_evafter(30000,4,&tmid2);
                  }
                  ev = 0;
          }
          tm_wkafter(100);
}

int main(int argc, char * const *argv)
{
          u_long ret, tid = 0, args[4];

          mlockall(MCL_CURRENT | MCL_FUTURE);
          copperplate_init(&argc,&argv);

          ret = t_create("TEST",97, 0, 0, 0, &tid);
          printf("t_create(tid=%lu) = %lu\n", tid, ret);
          args[0] = 1;
          args[1] = 2;
          args[2] = 3;
          args[3] = 4;
          ret = t_start(tid, 0, foo, args);
          printf("t_start(tid=%lu) = %lu\n", tid, ret);

          while (1)
                  tm_wkafter(100);
          return 0;
}

The TEST task starts 2 timers: one periodic and one 1shot timer.
Each time the one-shot timer expires, a print is done and the timer is
restarted.

Observation is that once the one-shot timer expires, the application
starts to use 100% cpuload on one core and the application code is not
executed anymore. So it looks like there is constant processing in
either Xenomai or the library code to process the timer handling. If
periodic timers are used the issue is not observed.


I can't reproduce this bug using that test code, over glibc 2.15/x86. We
might have a problem with SIGEV_THREAD. Which glibc release are you running?

Also, do you observe the same issue with a larger event interval for the
periodic timer (e.g. 1000 ticks)?

--
Philippe.


Philippe,
this is the output I see:

# taskset 4 /tmp/simple_tm_cancel.exe  &
#    0"000.506| WARNING: [main] Xenomai compiled with partial debug enabled,
                               high latencies expected [--enable-debug=partial]
t_create(tid=273617536) = 0
t_start(tid=273617536) = 0
0 Restarting one-shot timer. ev=6

6? We are asking for events 4 or 1, so at best, we might get 0x5 if bothare pending at the same time. Or maybe is it a different test?



--
Philippe.

_______________________________________________
Xenomai mailing list
Xenomai@xenomai.org
http://www.xenomai.org/mailman/listinfo/xenomai

Re: [Xenomai] Xenomai-forge: thread using 100% cpu load

Reply via email to