On Tue, Mar 5, 2013 at 3:08 PM, Philippe Gerum <r...@xenomai.org> wrote: > On 03/05/2013 01:43 PM, Ronny Meeus wrote: >> >> On Sat, Mar 2, 2013 at 12:13 PM, Ronny Meeus <ronny.me...@gmail.com> >> wrote: >>> >>> On Fri, Mar 1, 2013 at 9:41 AM, Philippe Gerum <r...@xenomai.org> wrote: >>>> >>>> On 03/01/2013 09:30 AM, Gilles Chanteperdrix wrote: >>>>> >>>>> >>>>> On 03/01/2013 09:30 AM, Philippe Gerum wrote: >>>>> >>>>>> On 03/01/2013 09:26 AM, Gilles Chanteperdrix wrote: >>>>>>> >>>>>>> >>>>>>> On 03/01/2013 09:22 AM, Philippe Gerum wrote: >>>>>>> >>>>>>>> On 02/28/2013 09:22 PM, Thomas De Schampheleire wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Feb 28, 2013 at 9:10 PM, Gilles Chanteperdrix >>>>>>>>> <gilles.chanteperd...@xenomai.org> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On 02/28/2013 08:19 PM, Ronny Meeus wrote: >>>>>>>>>> >>>>>>>>>>> Hello >>>>>>>>>>> >>>>>>>>>>> we are using the PSOS interface of Xenomai forge, running >>>>>>>>>>> completely >>>>>>>>>>> in user-space using the mercury code. >>>>>>>>>>> We deploy our application on different processors, one product is >>>>>>>>>>> running on PPC multicore (P4040, P4080, P4034) and another one on >>>>>>>>>>> Cavium (8 core device). >>>>>>>>>>> The Linux version we use is 2.6.32 but I would assume that this >>>>>>>>>>> is >>>>>>>>>>> not >>>>>>>>>>> so relevant. >>>>>>>>>>> >>>>>>>>>>> Our Xenomai application is running on one of the cores (affinity >>>>>>>>>>> is >>>>>>>>>>> set), while the other cores are running other code. >>>>>>>>>>> >>>>>>>>>>> On both architectures we recently start to see issues that one >>>>>>>>>>> thread >>>>>>>>>>> is consuming 100% of the core on which the application is pinned. >>>>>>>>>>> The thread that monopolizes the core is the thread internally >>>>>>>>>>> used >>>>>>>>>>> to >>>>>>>>>>> manage the timers, running at the highest priority. >>>>>>>>>>> The trigger for running into this behavior is currently unclear. >>>>>>>>>>> If we only start a part of the application (platform management >>>>>>>>>>> only), >>>>>>>>>>> the issue is not observed. >>>>>>>>>>> We see this on both an old version of Xenomai and a very recent >>>>>>>>>>> one >>>>>>>>>>> (pulled from the git repo yesterday). >>>>>>>>>>> >>>>>>>>>>> I will continue to debug this issue in the coming days and try >>>>>>>>>>> isolate >>>>>>>>>>> the code that is triggering it, but I can use hints from the >>>>>>>>>>> community. >>>>>>>>>>> Debugging is complex since once the load starts, the debugger is >>>>>>>>>>> not >>>>>>>>>>> reacting anymore. >>>>>>>>>>> If I put breakpoints in the functions that are called when the >>>>>>>>>>> timer >>>>>>>>>>> expires (both oneshot and periodic), the process starts to clone >>>>>>>>>>> itself and I endup with tens of them. >>>>>>>>>>> >>>>>>>>>>> Has anybody seen an issue like this before or does somebody has >>>>>>>>>>> some >>>>>>>>>>> hints on how to debug this problem? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> First enable the watchdog. It will send a signal to the >>>>>>>>>> application >>>>>>>>>> when >>>>>>>>>> detecting a problem, then you can use the watchdog to trigger an >>>>>>>>>> I-pipe >>>>>>>>>> tracer trace when the bug happens. You will probably have to >>>>>>>>>> increase >>>>>>>>>> the watchdog polling frequency, in order to have a meaningful >>>>>>>>>> trace. >>>>>>>>>> >>>>>>>>> >>>>>>>>> I don't think an I-pipe tracer will be possible when using the >>>>>>>>> Mercury >>>>>>>>> core, right (xenomai-forge) ? >>>>>>>>> >>>>>>>> >>>>>>>> Correct. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> I do not think so. The way I see it, you can enable the I-pipe tracer >>>>>>> without CONFIG_XENOMAI. >>>>>>> >>>>>> >>>>>> Mercury has NO pipeline in the kernel. >>>>>> >>>>> >>>>> You mean mercury can not run with an I-pipe kernel? >>>>> >>>> >>>> I mean it does not care about the pipeline, it does not need it. So if >>>> this >>>> is about observing kernel activity, then ftrace should be fine, or >>>> possibly >>>> perf to find out where userland spends time. >>>> >>>> -- >>>> Philippe. >>>> >>>> >>>> _______________________________________________ >>>> Xenomai mailing list >>>> Xenomai@xenomai.org >>>> http://www.xenomai.org/mailman/listinfo/xenomai >>> >>> >>> Hello >>> >>> An update on the investigation: >>> I was able to make this issue disappear by changing the timeout value >>> of the smallest timers we use. >>> We use a couple of timers with a timeout of 25ms. By enlarging these >>> to 25sec and the problem is gone. >>> >>> Yesterday I was also able to see (using the"strace" tool) the process >>> executing constantly "clone" system calls. >>> Note that the process we use is large (2Gb) and uses an mlockall call. >>> >>> In >>> http://stackoverflow.com/questions/4263958/some-information-on-timer-helper-thread-of-librt-so-1/4935895#4935895 >>> I see that a new thread is created when the timer_create is called for >>> the first time. This thread stays alive until the program exits and is >>> used to process the timer expiries. >>> I have the feeling that there is an issue during the creation of this >>> thread. For example what would happen if the clone operation takes >>> longer than the time needed to perform the clone operation? >>> In the past we already observed issues with the clone call that we >>> could not explain (creation of the clone simply failed on our >>> application while it was working fine on a smaller application). >>> >>> Do you guys know whether there is an impact on the clone operation by >>> this mlockall call? >>> >>> I will try to make a small test application on which the issue can be >>> reproduced. >>> >>> --- >>> Ronny >> >> >> I'm able to reproduce the issue on a small test build: >> >> #include <stdio.h> >> #include <unistd.h> >> #include <sys/types.h> >> #include <sys/mman.h> >> #include <psos.h> >> #include <copperplate/init.h> >> #include <stdlib.h> >> #include <string.h> >> >> static void foo (u_long a0, u_long a1, u_long a2, u_long a3) >> { >> u_long ret, ev = 0, tmid,tmid2; >> >> ret = tm_evevery(1,1,&tmid); >> ret = tm_evafter(30000,4,&tmid2); >> while (1) { >> ret = ev_receive(0xFF,EV_ANY|EV_WAIT,0,&ev); >> if (ev & 4) { >> printf("%lx Restarting one-shot timer. >> ev=%lx\n",ret,ev); >> tm_evafter(30000,4,&tmid2); >> } >> ev = 0; >> } >> tm_wkafter(100); >> } >> >> int main(int argc, char * const *argv) >> { >> u_long ret, tid = 0, args[4]; >> >> mlockall(MCL_CURRENT | MCL_FUTURE); >> copperplate_init(&argc,&argv); >> >> ret = t_create("TEST",97, 0, 0, 0, &tid); >> printf("t_create(tid=%lu) = %lu\n", tid, ret); >> args[0] = 1; >> args[1] = 2; >> args[2] = 3; >> args[3] = 4; >> ret = t_start(tid, 0, foo, args); >> printf("t_start(tid=%lu) = %lu\n", tid, ret); >> >> while (1) >> tm_wkafter(100); >> return 0; >> } >> >> The TEST task starts 2 timers: one periodic and one 1shot timer. >> Each time the one-shot timer expires, a print is done and the timer is >> restarted. >> >> Observation is that once the one-shot timer expires, the application >> starts to use 100% cpuload on one core and the application code is not >> executed anymore. So it looks like there is constant processing in >> either Xenomai or the library code to process the timer handling. If >> periodic timers are used the issue is not observed. >> > > I can't reproduce this bug using that test code, over glibc 2.15/x86. We > might have a problem with SIGEV_THREAD. Which glibc release are you running? >
Philip, do you have a reference to the issue that you are suspecting and a view on which version of the glib we need to use to solve it? Ronny > Also, do you observe the same issue with a larger event interval for the > periodic timer (e.g. 1000 ticks)? > > -- > Philippe. _______________________________________________ Xenomai mailing list Xenomai@xenomai.org http://www.xenomai.org/mailman/listinfo/xenomai