On Fri, Oct 3, 2014 at 1:13 AM, Gilles Chanteperdrix
<[email protected]> wrote:
> On 10/02/2014 05:52 PM, GP Orcullo wrote:
>> On Thu, Oct 2, 2014 at 9:36 PM, Gilles Chanteperdrix
>> <[email protected]> wrote:
>>> On 10/02/2014 03:27 PM, GP Orcullo wrote:
>>>> On Wed, Oct 1, 2014 at 5:20 PM, Gilles Chanteperdrix
>>>> <[email protected]> wrote:
>>>>> On 10/01/2014 11:12 AM, GP Orcullo wrote:
>>>>>> On Oct 1, 2014 3:54 PM, "Gilles Chanteperdrix" <
>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>> On 10/01/2014 01:32 AM, GP Orcullo wrote:
>>>>>>>> On Sep 30, 2014 8:16 PM, "Gilles Chanteperdrix" <
>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>> On 09/30/2014 02:04 PM, GP Orcullo wrote:
>>>>>>>>>> On Sep 30, 2014 7:30 PM, "Gilles Chanteperdrix" <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 09/30/2014 07:31 AM, GP Orcullo wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Running the switchtest for extended periods (>10 mins) causes the
>>>>>>>>>>>> machine to lockup.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm running a modified xeno-regression-test which contains only the
>>>>>>>>>>>> following tests:
>>>>>>>>>>>>
>>>>>>>>>>>> check_alive /usr/lib/xenomai/testsuite/switchtest
>>>>>>>>>>>> check_alive /usr/lib/xenomai/testsuite/switchtest -s 1000
>>>>>>>>>>>> check_alive /usr/lib/xenomai/testsuite/latency ${1+"$@"}
>>>>>>>>>>>>
>>>>>>>>>>>> The script is invoked with the following arguments:
>>>>>>>>>>>>
>>>>>>>>>>>> nohup sudo ./xeno-regression-test -l
>>>>>>>>>>>> "/usr/lib/xenomai/testsuite/dohell -m /media/work 36000" -t 2 >
>>>>>>>>>>>> /dev/null & top -d0.5
>>>>>>>>>>>>
>>>>>>>>>>>> The kernel dumps the OOPS information intermittently so it's
>>>>>> difficult
>>>>>>>>>>>> to diagnose the issue.
>>>>>>>>>>>>
>>>>>>>>>>>> Attached is the kernel config and the logfile.
>>>>>>>>>>>
>>>>>>>>>>> Ok, this is an exynos. Sorry, but I have never seen the patch for
>>>>>>>>>>> exynos, so I do not know what is inside. You should direct your
>>>>>>>>>>> questions to whoever provided you with this support.
>>>>>>>>>>
>>>>>>>>>> I'm in the process of porting xenomai to run on exynos.
>>>>>>>>>>
>>>>>>>>>> The ipipe-core-3.8.13-arm-3.patch applies cleanly to the 3.8.13.11
>>>>>>>> kernel
>>>>>>>>>> used by the odroid U3 board.
>>>>>>>>>>
>>>>>>>>>> Attached is the ipipe patch that I've made.
>>>>>>>>>>
>>>>>>>>>> I was just wondering what would cause switchtest to fail. The error
>>>>>>>> that I
>>>>>>>>>> can see is that the system is running out of memory and I don't know
>>>>>>>>>> exactly what is causing this.
>>>>>>>>>
>>>>>>>>> Certainly not switchtest as it does not do any memory allocation.
>>>>>>>>> However, the dohell script has a loop creating a large file and
>>>>>> removing
>>>>>>>>> it. So, could you try and run the dohell script with an unpatched
>>>>>> kernel
>>>>>>>>> and see if you have the error?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Running dohell on a patched and unpatched kernel doesn't trigger the
>>>>>> lockup.
>>>>>>>>
>>>>>>>> Running switchtest without dohell works OK.
>>>>>>>
>>>>>>> Is the problem a lockup, or an OOM?
>>>>>>>
>>>>>>
>>>>>> It's a lockup.
>>>>>>
>>>>>> The OOM message is the only one that I've captured so far.  Most of the
>>>>>> time the kernel doesn't spew any messages before the lockup.
>>>>>>
>>>>>> The lockups are repeatable but generating any error messages isn't.
>>>>>
>>>>> Are you running the tests on the serial console, or with ssh? Do you
>>>>> have unlocked context switch enabled? Have you tried enabling some debug
>>>>> options?
>>>>>
>>>>
>>>> I'm using the serial console to log the kernel messages and ssh to run
>>>> the command. Using purely the serial console has the same results.
>>>
>>> The main point was to avoid redirecting standard error to /dev/null to
>>> see any application error message. Doing this on the serial console may
>>> be a better idea that on ssh, because it means you are less likely to
>>> miss a message that would be sent just prior to the system dying.
>>>
>>>>
>>>> Is this the context switch?: "CONFIG_XENO_HW_UNLOCKED_SWITCH=y"
>>>
>>> Yes, please try to disable it if you have it enabled.
>>>
>>>>
>>>> I will try playing again with the debug options and see if I can get
>>>> something useful.
>>>>
>>>>> Also note that xeno-regression-test puts the system under a lot of
>>>>> stress, so it may happen that there is no output for some time (several
>>>>> minutes), normally the test should stop by itself if there is no output
>>>>> for something like 30 minutes. So, I would recommend not redirecting
>>>>> xeno-test output to see if there is any error before the lockup, and
>>>>> when you see the lockup, leave the system for 30 minutes to see if it
>>>>> does not restart or if xeno-regression-test can exit gracefully.
>>>>>
>>>>
>>>> This is a total lockup. There's a heartbeat led that dies when it occurs.
>>>
>>> Well the heartbeat led does not prove anything: some Linux kernel
>>> activity can very well prevent it from being toggled. Say if for
>>> instance it is toggled by a thread and the activity that hogs the kernel
>>> is a softirq that never ends.
>>>
>>>>
>>>> Attached is one error log that I had captured previously and this one
>>>> had the CONFIG_CPU_IDLE enabled. I've lost track on which kernel this
>>>> trace came from but maybe the error looks familiar.
>>>
>>> This trace misses an important information: the reason for the error.
>>> So, please capture the serial console to a file, and post the complete
>>> file, from boot up to the error.
>>>
>>> Anyway, you did not answered my question: did you try to leave the
>>> system on for say 30 minutes of 1 hour after the lockup to see if it
>>> does not recover?
>>>
>>>
>>
>> The system never recovered.
>>
>> With the context switch disabled, I was able to capture this error:
>>
>> [  210.482299] INFO: rcu_preempt detected stalls on CPUs/tasks:)
>> [  210.487790] Task dump for CPU 2:
>> [  210.490995] switchtest      R running      0  3915   3639 0x00000002
>> [  210.497340] [<c0453ddc>] (__schedule+0x1fc/0x5f8) from [<00000010>] (0x10)
>> [  390.507943] INFO: rcu_preempt detected stalls on CPUs/tasks: { 2} 
>> (detected )
>> [  390.513510] Task dump for CPU 2:
>> [  390.516716] switchtest      R running      0  3915   3639 0x00000002
>> [  390.523065] [<c0453ddc>] (__schedule+0x1fc/0x5f8) from [<00000010>] (0x10)
>>
>> <c0453ddc> points to the following section:
>>
>> #ifndef __ARCH_WANT_UNLOCKED_CTXSW
>>         spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
>> c0453dc8:       ebf04b13        bl      c0066a1c <lock_release>
>> #endif
>>
>>         context_tracking_task_switch(prev, next);
>
> You do not have context tracking enabled, right?
>

# CONFIG_XENO_HW_UNLOCKED_SWITCH is not set

Getting this board to spew out error messages is tough.

>
> --
>                                                                 Gilles.



-- 
GP Orcullo

_______________________________________________
Xenomai mailing list
[email protected]
http://www.xenomai.org/mailman/listinfo/xenomai

Reply via email to