On Tue, Mar 5, 2013 at 3:47 PM, Philippe Gerum <r...@xenomai.org> wrote: > On 03/05/2013 03:25 PM, Ronny Meeus wrote: >> >> On Tue, Mar 5, 2013 at 3:08 PM, Philippe Gerum <r...@xenomai.org> wrote: >>> >>> On 03/05/2013 01:43 PM, Ronny Meeus wrote: >>>> >>>> >>>> On Sat, Mar 2, 2013 at 12:13 PM, Ronny Meeus <ronny.me...@gmail.com> >>>> wrote: >>>>> >>>>> >>>>> On Fri, Mar 1, 2013 at 9:41 AM, Philippe Gerum <r...@xenomai.org> wrote: >>>>>> >>>>>> >>>>>> On 03/01/2013 09:30 AM, Gilles Chanteperdrix wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 03/01/2013 09:30 AM, Philippe Gerum wrote: >>>>>>> >>>>>>>> On 03/01/2013 09:26 AM, Gilles Chanteperdrix wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 03/01/2013 09:22 AM, Philippe Gerum wrote: >>>>>>>>> >>>>>>>>>> On 02/28/2013 09:22 PM, Thomas De Schampheleire wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Feb 28, 2013 at 9:10 PM, Gilles Chanteperdrix >>>>>>>>>>> <gilles.chanteperd...@xenomai.org> wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On 02/28/2013 08:19 PM, Ronny Meeus wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hello >>>>>>>>>>>>> >>>>>>>>>>>>> we are using the PSOS interface of Xenomai forge, running >>>>>>>>>>>>> completely >>>>>>>>>>>>> in user-space using the mercury code. >>>>>>>>>>>>> We deploy our application on different processors, one product >>>>>>>>>>>>> is >>>>>>>>>>>>> running on PPC multicore (P4040, P4080, P4034) and another one >>>>>>>>>>>>> on >>>>>>>>>>>>> Cavium (8 core device). >>>>>>>>>>>>> The Linux version we use is 2.6.32 but I would assume that this >>>>>>>>>>>>> is >>>>>>>>>>>>> not >>>>>>>>>>>>> so relevant. >>>>>>>>>>>>> >>>>>>>>>>>>> Our Xenomai application is running on one of the cores >>>>>>>>>>>>> (affinity >>>>>>>>>>>>> is >>>>>>>>>>>>> set), while the other cores are running other code. >>>>>>>>>>>>> >>>>>>>>>>>>> On both architectures we recently start to see issues that one >>>>>>>>>>>>> thread >>>>>>>>>>>>> is consuming 100% of the core on which the application is >>>>>>>>>>>>> pinned. >>>>>>>>>>>>> The thread that monopolizes the core is the thread internally >>>>>>>>>>>>> used >>>>>>>>>>>>> to >>>>>>>>>>>>> manage the timers, running at the highest priority. >>>>>>>>>>>>> The trigger for running into this behavior is currently >>>>>>>>>>>>> unclear. >>>>>>>>>>>>> If we only start a part of the application (platform management >>>>>>>>>>>>> only), >>>>>>>>>>>>> the issue is not observed. >>>>>>>>>>>>> We see this on both an old version of Xenomai and a very recent >>>>>>>>>>>>> one >>>>>>>>>>>>> (pulled from the git repo yesterday). >>>>>>>>>>>>> >>>>>>>>>>>>> I will continue to debug this issue in the coming days and try >>>>>>>>>>>>> isolate >>>>>>>>>>>>> the code that is triggering it, but I can use hints from the >>>>>>>>>>>>> community. >>>>>>>>>>>>> Debugging is complex since once the load starts, the debugger >>>>>>>>>>>>> is >>>>>>>>>>>>> not >>>>>>>>>>>>> reacting anymore. >>>>>>>>>>>>> If I put breakpoints in the functions that are called when the >>>>>>>>>>>>> timer >>>>>>>>>>>>> expires (both oneshot and periodic), the process starts to >>>>>>>>>>>>> clone >>>>>>>>>>>>> itself and I endup with tens of them. >>>>>>>>>>>>> >>>>>>>>>>>>> Has anybody seen an issue like this before or does somebody has >>>>>>>>>>>>> some >>>>>>>>>>>>> hints on how to debug this problem? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> First enable the watchdog. It will send a signal to the >>>>>>>>>>>> application >>>>>>>>>>>> when >>>>>>>>>>>> detecting a problem, then you can use the watchdog to trigger an >>>>>>>>>>>> I-pipe >>>>>>>>>>>> tracer trace when the bug happens. You will probably have to >>>>>>>>>>>> increase >>>>>>>>>>>> the watchdog polling frequency, in order to have a meaningful >>>>>>>>>>>> trace. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I don't think an I-pipe tracer will be possible when using the >>>>>>>>>>> Mercury >>>>>>>>>>> core, right (xenomai-forge) ? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Correct. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> I do not think so. The way I see it, you can enable the I-pipe >>>>>>>>> tracer >>>>>>>>> without CONFIG_XENOMAI. >>>>>>>>> >>>>>>>> >>>>>>>> Mercury has NO pipeline in the kernel. >>>>>>>> >>>>>>> >>>>>>> You mean mercury can not run with an I-pipe kernel? >>>>>>> >>>>>> >>>>>> I mean it does not care about the pipeline, it does not need it. So if >>>>>> this >>>>>> is about observing kernel activity, then ftrace should be fine, or >>>>>> possibly >>>>>> perf to find out where userland spends time. >>>>>> >>>>>> -- >>>>>> Philippe. >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Xenomai mailing list >>>>>> Xenomai@xenomai.org >>>>>> http://www.xenomai.org/mailman/listinfo/xenomai >>>>> >>>>> >>>>> >>>>> Hello >>>>> >>>>> An update on the investigation: >>>>> I was able to make this issue disappear by changing the timeout value >>>>> of the smallest timers we use. >>>>> We use a couple of timers with a timeout of 25ms. By enlarging these >>>>> to 25sec and the problem is gone. >>>>> >>>>> Yesterday I was also able to see (using the"strace" tool) the process >>>>> executing constantly "clone" system calls. >>>>> Note that the process we use is large (2Gb) and uses an mlockall call. >>>>> >>>>> In >>>>> >>>>> http://stackoverflow.com/questions/4263958/some-information-on-timer-helper-thread-of-librt-so-1/4935895#4935895 >>>>> I see that a new thread is created when the timer_create is called for >>>>> the first time. This thread stays alive until the program exits and is >>>>> used to process the timer expiries. >>>>> I have the feeling that there is an issue during the creation of this >>>>> thread. For example what would happen if the clone operation takes >>>>> longer than the time needed to perform the clone operation? >>>>> In the past we already observed issues with the clone call that we >>>>> could not explain (creation of the clone simply failed on our >>>>> application while it was working fine on a smaller application). >>>>> >>>>> Do you guys know whether there is an impact on the clone operation by >>>>> this mlockall call? >>>>> >>>>> I will try to make a small test application on which the issue can be >>>>> reproduced. >>>>> >>>>> --- >>>>> Ronny >>>> >>>> >>>> >>>> I'm able to reproduce the issue on a small test build: >>>> >>>> #include <stdio.h> >>>> #include <unistd.h> >>>> #include <sys/types.h> >>>> #include <sys/mman.h> >>>> #include <psos.h> >>>> #include <copperplate/init.h> >>>> #include <stdlib.h> >>>> #include <string.h> >>>> >>>> static void foo (u_long a0, u_long a1, u_long a2, u_long a3) >>>> { >>>> u_long ret, ev = 0, tmid,tmid2; >>>> >>>> ret = tm_evevery(1,1,&tmid); >>>> ret = tm_evafter(30000,4,&tmid2); >>>> while (1) { >>>> ret = ev_receive(0xFF,EV_ANY|EV_WAIT,0,&ev); >>>> if (ev & 4) { >>>> printf("%lx Restarting one-shot timer. >>>> ev=%lx\n",ret,ev); >>>> tm_evafter(30000,4,&tmid2); >>>> } >>>> ev = 0; >>>> } >>>> tm_wkafter(100); >>>> } >>>> >>>> int main(int argc, char * const *argv) >>>> { >>>> u_long ret, tid = 0, args[4]; >>>> >>>> mlockall(MCL_CURRENT | MCL_FUTURE); >>>> copperplate_init(&argc,&argv); >>>> >>>> ret = t_create("TEST",97, 0, 0, 0, &tid); >>>> printf("t_create(tid=%lu) = %lu\n", tid, ret); >>>> args[0] = 1; >>>> args[1] = 2; >>>> args[2] = 3; >>>> args[3] = 4; >>>> ret = t_start(tid, 0, foo, args); >>>> printf("t_start(tid=%lu) = %lu\n", tid, ret); >>>> >>>> while (1) >>>> tm_wkafter(100); >>>> return 0; >>>> } >>>> >>>> The TEST task starts 2 timers: one periodic and one 1shot timer. >>>> Each time the one-shot timer expires, a print is done and the timer is >>>> restarted. >>>> >>>> Observation is that once the one-shot timer expires, the application >>>> starts to use 100% cpuload on one core and the application code is not >>>> executed anymore. So it looks like there is constant processing in >>>> either Xenomai or the library code to process the timer handling. If >>>> periodic timers are used the issue is not observed. >>>> >>> >>> I can't reproduce this bug using that test code, over glibc 2.15/x86. We >>> might have a problem with SIGEV_THREAD. Which glibc release are you >>> running? >>> >>> Also, do you observe the same issue with a larger event interval for the >>> periodic timer (e.g. 1000 ticks)? >>> >>> -- >>> Philippe. >> >> >> Philippe, >> this is the output I see: >> >> # taskset 4 /tmp/simple_tm_cancel.exe & >> # 0"000.506| WARNING: [main] Xenomai compiled with partial debug >> enabled, >> high latencies expected >> [--enable-debug=partial] >> t_create(tid=273617536) = 0 >> t_start(tid=273617536) = 0 >> 0 Restarting one-shot timer. ev=6 > > > 6? We are asking for events 4 or 1, so at best, we might get 0x5 if both are > pending at the same time. Or maybe is it a different test? > > > -- > Philippe.
In the meantime I indeed added another timer. The events I receive are as expected. I just wanted to show the return value of the ev_receive call. Sorry for the confusion. --- Ronny _______________________________________________ Xenomai mailing list Xenomai@xenomai.org http://www.xenomai.org/mailman/listinfo/xenomai