On Sat, Mar 2, 2013 at 12:13 PM, Ronny Meeus <[email protected]> wrote:
> On Fri, Mar 1, 2013 at 9:41 AM, Philippe Gerum <[email protected]> wrote:
>> On 03/01/2013 09:30 AM, Gilles Chanteperdrix wrote:
>>>
>>> On 03/01/2013 09:30 AM, Philippe Gerum wrote:
>>>
>>>> On 03/01/2013 09:26 AM, Gilles Chanteperdrix wrote:
>>>>>
>>>>> On 03/01/2013 09:22 AM, Philippe Gerum wrote:
>>>>>
>>>>>> On 02/28/2013 09:22 PM, Thomas De Schampheleire wrote:
>>>>>>>
>>>>>>> On Thu, Feb 28, 2013 at 9:10 PM, Gilles Chanteperdrix
>>>>>>> <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On 02/28/2013 08:19 PM, Ronny Meeus wrote:
>>>>>>>>
>>>>>>>>> Hello
>>>>>>>>>
>>>>>>>>> we are using the PSOS interface of Xenomai forge, running completely
>>>>>>>>> in user-space using the mercury code.
>>>>>>>>> We deploy our application on different processors, one product is
>>>>>>>>> running on PPC multicore (P4040, P4080, P4034) and another one on
>>>>>>>>> Cavium (8 core device).
>>>>>>>>> The Linux version we use is 2.6.32 but I would assume that this is
>>>>>>>>> not
>>>>>>>>> so relevant.
>>>>>>>>>
>>>>>>>>> Our Xenomai application is running on one of the cores (affinity is
>>>>>>>>> set), while the other cores are running other code.
>>>>>>>>>
>>>>>>>>> On both architectures we recently start to see issues that one
>>>>>>>>> thread
>>>>>>>>> is consuming 100% of the core on which the application is pinned.
>>>>>>>>> The thread that monopolizes the core is the thread internally used
>>>>>>>>> to
>>>>>>>>> manage the timers, running at the highest priority.
>>>>>>>>> The trigger for running into this behavior is currently unclear.
>>>>>>>>> If we only start a part of the application (platform management
>>>>>>>>> only),
>>>>>>>>> the issue is not observed.
>>>>>>>>> We see this on both an old version of Xenomai and a very recent one
>>>>>>>>> (pulled from the git repo yesterday).
>>>>>>>>>
>>>>>>>>> I will continue to debug this issue in the coming days and try
>>>>>>>>> isolate
>>>>>>>>> the code that is triggering it, but I can use hints from the
>>>>>>>>> community.
>>>>>>>>> Debugging is complex since once the load starts, the debugger is not
>>>>>>>>> reacting anymore.
>>>>>>>>> If I put breakpoints in the functions that are called when the timer
>>>>>>>>> expires (both oneshot and periodic), the process starts to clone
>>>>>>>>> itself and I endup with tens of them.
>>>>>>>>>
>>>>>>>>> Has anybody seen an issue like this before or does somebody has some
>>>>>>>>> hints on how to debug this problem?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> First enable the watchdog. It will send a signal to the application
>>>>>>>> when
>>>>>>>> detecting a problem, then you can use the watchdog to trigger an
>>>>>>>> I-pipe
>>>>>>>> tracer trace when the bug happens. You will probably have to increase
>>>>>>>> the watchdog polling frequency, in order to have a meaningful trace.
>>>>>>>>
>>>>>>>
>>>>>>> I don't think an I-pipe tracer will be possible when using the Mercury
>>>>>>> core, right (xenomai-forge) ?
>>>>>>>
>>>>>>
>>>>>> Correct.
>>>>>
>>>>>
>>>>>
>>>>> I do not think so. The way I see it, you can enable the I-pipe tracer
>>>>> without CONFIG_XENOMAI.
>>>>>
>>>>
>>>> Mercury has NO pipeline in the kernel.
>>>>
>>>
>>> You mean mercury can not run with an I-pipe kernel?
>>>
>>
>> I mean it does not care about the pipeline, it does not need it. So if this
>> is about observing kernel activity, then ftrace should be fine, or possibly
>> perf to find out where userland spends time.
>>
>> --
>> Philippe.
>>
>>
>> _______________________________________________
>> Xenomai mailing list
>> [email protected]
>> http://www.xenomai.org/mailman/listinfo/xenomai
>
> Hello
>
> An update on the investigation:
> I was able to make this issue disappear by changing the timeout value
> of the smallest timers we use.
> We use a couple of timers with a timeout of 25ms. By enlarging these
> to 25sec and the problem is gone.
>
> Yesterday I was also able to see (using the"strace" tool) the process
> executing constantly "clone" system calls.
> Note that the process we use is large (2Gb) and uses an mlockall call.
>
> In
> http://stackoverflow.com/questions/4263958/some-information-on-timer-helper-thread-of-librt-so-1/4935895#4935895
> I see that a new thread is created when the timer_create is called for
> the first time. This thread stays alive until the program exits and is
> used to process the timer expiries.
> I have the feeling that there is an issue during the creation of this
> thread. For example what would happen if the clone operation takes
> longer than the time needed to perform the clone operation?
> In the past we already observed issues with the clone call that we
> could not explain (creation of the clone simply failed on our
> application while it was working fine on a smaller application).
>
> Do you guys know whether there is an impact on the clone operation by
> this mlockall call?
>
> I will try to make a small test application on which the issue can be
> reproduced.
>
> ---
> Ronny
I'm able to reproduce the issue on a small test build:
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <psos.h>
#include <copperplate/init.h>
#include <stdlib.h>
#include <string.h>
static void foo (u_long a0, u_long a1, u_long a2, u_long a3)
{
u_long ret, ev = 0, tmid,tmid2;
ret = tm_evevery(1,1,&tmid);
ret = tm_evafter(30000,4,&tmid2);
while (1) {
ret = ev_receive(0xFF,EV_ANY|EV_WAIT,0,&ev);
if (ev & 4) {
printf("%lx Restarting one-shot timer.
ev=%lx\n",ret,ev);
tm_evafter(30000,4,&tmid2);
}
ev = 0;
}
tm_wkafter(100);
}
int main(int argc, char * const *argv)
{
u_long ret, tid = 0, args[4];
mlockall(MCL_CURRENT | MCL_FUTURE);
copperplate_init(&argc,&argv);
ret = t_create("TEST",97, 0, 0, 0, &tid);
printf("t_create(tid=%lu) = %lu\n", tid, ret);
args[0] = 1;
args[1] = 2;
args[2] = 3;
args[3] = 4;
ret = t_start(tid, 0, foo, args);
printf("t_start(tid=%lu) = %lu\n", tid, ret);
while (1)
tm_wkafter(100);
return 0;
}
The TEST task starts 2 timers: one periodic and one 1shot timer.
Each time the one-shot timer expires, a print is done and the timer is
restarted.
Observation is that once the one-shot timer expires, the application
starts to use 100% cpuload on one core and the application code is not
executed anymore. So it looks like there is constant processing in
either Xenomai or the library code to process the timer handling. If
periodic timers are used the issue is not observed.
Best regards,
Ronny
_______________________________________________
Xenomai mailing list
[email protected]
http://www.xenomai.org/mailman/listinfo/xenomai