On 6 Apr 2014, at 22:04, Gilles Chanteperdrix wrote:

On 04/06/2014 10:57 PM, Andreas Glatz wrote:

On 6 Apr 2014, at 16:28, Gilles Chanteperdrix wrote:

On 04/06/2014 05:22 PM, Andreas Glatz wrote:

On 6 Apr 2014, at 15:44, Gilles Chanteperdrix wrote:

On 04/06/2014 01:21 PM, Andreas Glatz wrote:

On 4 Apr 2014, at 11:44, Gilles Chanteperdrix wrote:

On 04/04/2014 12:27 PM, Andreas Glatz wrote:
Hi Gilles,

I'm finally back to my original problem below:

On 6 Jan 2014, at 17:39, Gilles Chanteperdrix wrote:

On 01/06/2014 04:30 PM, Andreas Glatz wrote:
Hi,

I managed to produce a kernel (v3.8.13) with xenomai 2.6.3
ipipe
patch and
rootfs (debian wheezy) with xenomai 2.6.3 libraries for my
Pandaboard ES
(omap4460). The simple regression test, which only calls dd
during
the
switchtest, works fine. However the regression test with the
linux
test
project (ltp-full-20130904) scripts causes some sort of system
lock
up.
After that I only can ctrl-c xeno-regression-test (i.e.
switchtest), which,
however, doesn't help to regain console access (neigher over
ethernet nor
serial).

Here's what I did:

-- Building --
As recomended in the Xenomai 2.6 readme I followed the
instructions
in [1]
to produce a kernel and filesystem. To get a xenomai kernel I
had
to do
three things differently:

*) I used: git checkout origin/v3.8.x -b tmp
*) I applied ipipe-core-3.8.13-arm-3.patch from the xenomai-2.6
git
tree as
described in the Xenomai 2.6 readme
*) I disabled KGDB and TIDSPBRIDGE since those produced compile
errors (see
config [2])

After a while I obtained the following messages from dmesg [3]
and
from the
command prompt:

root@arm:~# cat /proc/version
Linux version 3.8.13-x3.6 (aglatz@linuxvbox) (gcc version 4.7.3
20130328
(prerelease) (crosstool-NG linaro-1.13.1-4.7-2013.04-20130415 -
Linaro GCC
2013.04) ) #4 SMP Sat Jan 4 15:54:20 GMT 2014

-- Testing Linux --
To see if everything works I downloaded and cross-compiled
ltp-full-20130904 [4] with the same toolchain and flags (-
march=armv7-a
-mfpu=vfp3) as the xenomai libs and runtime. I started ltp with
"./
runltp
-p -l dohell-2014-01-06-1.log -S xenomai.skiplist" and after a
while it
finished with a few failed tests [5]. The console access,
however,
worked
fine.

-- Testing Xenomai --
First I sucessfully could run the simple xenomai regression
test:
xeno-regression-test -l "/usr/lib/xenomai/testsuite/dohell - m /
tmp
100" -t
2 which produced the output in [6] and the following additional
messages
with dmesg:

[  476.215057] Xenomai: RTDM: closing file descriptor 1.
[  477.434936] Xenomai: Posix: destroying semaphore f0069c00.
[  477.440887] Xenomai: Posix: destroying mutex f0069a00.
[  477.475372] xnheap: destroying shared heap 'rt_heap: heap'
with
16384
bytes still in use.
[  479.008453] Xenomai: Switching rt_task to secondary mode
after
exception
#0 from user-space at 0x9620 (pid 2145)
[ 480.574462] Xenomai: watchdog triggered -- signaling runaway
thread
'rt_task'
[  480.582061] [sched_delayed] sched: RT throttling activated
[ 557.336425] Xenomai: Posix: closing message queue descriptor
3.

and  "cat /proc/xenomai/*" produced [7].

When I started the realistic xenomai regression test: xeno-
regression-test
-l "/usr/lib/xenomai/testsuite/dohell -m /tmp -l /opt/ltp" - t 2
everything
seemed fine at first - I could logon and start top to inspect
the
running
processes. However, the command line (over serial and ethernet)
consistently freezes after a while (at different ltp tests
though).
First I
thought it's the massive system load which doesn't leave CPU
for
the
console... however ctrl-c of xeno-regression-test does not help
to
regain
console access...

That is because kill xeno-regression-test does not kill all the
script children. So, basically, the load tasks are still
running.
Also, what filesystem is /tmp? dohell is using dd to
alternatively
write to /tmp, then erase the file. If /tmp is some flash, it
will
become slow after a while. If it is a tmpfs, it will eat RAM.



The described problem is _very_ reproducible on my PandaBoard ES (omap4460), where I boot from an SD card partition and the rootfs
is
also on the SD card partition. I tried it with several kernel
versions
(3.8.13, 3.10.18, and 3.10.34) with the latest ipipe and xenomai
from
git the git repos. Everytime I start the regression test (see
command
above) the following happens: Everything works fine until the
switch/
latency tests start. Then I see that there is heavy access to the
SD
card, which is expected, as the status LED 2 is blinking. After
~5mins
this status LED is constantly on. That's when I know that
everything
is over. On the console I can only execute commands that are
already
in RAM, such as the bash things like ps, mount, ... However, if I
try
a simple 'touch new' it blocks forever and I know that it
blocks in
the syscall where the file should be created, because I looked at
it
with strace. I tried several things: I turned off CONFIG_PM
(which
was
on by default), turned on the MMC debugging, put extra prink's in
the
omap_hsmmc.c ISR. However, everything seems to work on this
level:
DMA
requests are started and do finish, the ISR is called regularly
(bc
first I though that Xenomai would starve it).

Have you every run Xenonmai on this _specific_ board (since
everything
is running smoothly on the omap5 board)?
Any more ideas how to debug it?

Currently, I'm compiling the ipipe trace in hope that it would
tell
me
something useful...

Oh yes, the best bit is that the regression test works perfectly
fine
if I boot from an external USB HD _AND_ unmount (!) all MMC
partitions.

So, the MMC driver has a problem. Have you tried:
- running the exact same kernel configuration only with
CONFIG_XENOMAI
disabled (and stress with dohell)
- then with CONFIG_XENOMAI and CONFIG_IPIPE disabled.

Also, do you have this patch in the tree you tried?
http://git.xenomai.org/ipipe.git/commit/?h=stable/ipipe-3.10.18&id=c26e7ad5679f9391cd8ea1db001bf301d2f6bc88


First i mounted tmpfs on /tmp so I don't wear out the SD card too
much:
mount -t tmpfs -osize=192M tmpfs /tmp

Then I used the following line to start the test (substitute MYTEST
below with the following line):
/usr/lib/xenomai/testsuite/dohell -m /tmp -l /opt/ltp

Note: I always monitored the test over wifi with 'top' so I also
had
some network load...

I got the following results with the 3.10.34 kernel, which includes
everything up to the current ipipe-3.10 tag (it also included the
patch you mentioned):

- xeno-regression-test "MYTEST" -> FAIL if booted from SD card (see
description above); OK if booted from ext USB HD _AND_ no mmc
partitions mounted
- CONFIG_IPIPE && CONFIG_XENOMAI && MYTEST -> FAIL (got status
LED 2
constantly on as described above)
- CONFIG_IPIPE && MYTEST -> OK (see attached config file and ltp
test
log)

Anything else I should try?

Is the current LTP test when the failure happens always the same?



I went through all the logfiles on my pandaboard and and identified
the last tests that ltp logged before the error occurred (I'm
assuming
that ltp writes to the file in /opt/ltp/results after completing the
test since there is the PASS/FAIL note as well, which logically
should
only be available after completing the test):

test                               count
========================
rt_sigqueueinfo01    1
clock_nanosleep01 10
munmap02                1
semget06                   1
epoll_create1_01     5
splice01                      1
clock_getres01          1
rename13                   1
BindMounts                1
utimes01                     1

So it seems that the test after 'clock_nanosleep01', which is
'clone01' according to the LTP log file I sent you, seems to be the
prime hotspot of failure followed by 'epoll01', which comes after
'epoll_create1_01'.

I'm using the standard LTP version 'ltp-full-20130904', which I
downloaded and compiled on the target with gcc 4.6.3 (default debian
wheezy).

Ok. I am not sure it is meaningful. Anyway, the only difference
between
CONFIG_XENOMAI + CONFIG_IPIPE and CONFIG_IPIPE alone, provided that
you
are not running any program using Xenomai, is the host tick emulation.

So, could you please try to turn off
CONFIG_NO_HZ_IDLE
CONFIG_NO_HZ
CONFIG_HIGH_RES_TIMERS

And see if it works better?


As I wrote before, I recompiled the Kernel with your timer options and CONFIG_XENOMAI, installed it, synced it and rebooted after cutting the
power to the board for ~10secs.

It seems with those options it got much further with the tests.
However, eventually all ssh connections broke up and the last messages
on the console, where I started do hell were:

[...]
102400000 bytes (102 MB) copied, 2.97674 s, 34.4 MB/s
100+0 records in
100+0 records out
102400000 bytes (102 MB) copied, 1.97433 s, 51.9 MB/s
100+0 records in
100+0 records out
102400000 bytes (102 MB) copied, 2.68371 s, 38.2 MB/s
100+0 records in
100+0 records out
102400000 bytes (102 MB) copied, 2.57073 s, 39.8 MB/s
dd: writing `/tmp/bigfile': No space left on device
7+0 records in
6+0 records out
6164480 bytes (6.2 MB) copied, 0.189001 s, 32.6 MB/s
/usr/lib/xenomai/testsuite/dohell: 62: /usr/lib/xenomai/testsuite/
dohell: Cannot fork

This may simply be due to some LTP test which forks a lot and prevent
the system from being able to fork. This should be a temporary solution.

Write failed: Host is down

... and as usuall status LED 2 is permanently on.

As u suspect there's something wrong with the timer subsystem I looked
around a bit what extra patches went into the 3.10.14 kernel of
RobertCNelson, which I used as a base to merge the ipipe git tree.
Here is the list:

0001-panda-fix-wl12xx-regulator.patch
0002-ti-st-st-kim-fixing-firmware-path.patch
0003-Panda-expansion-add-spidev.patch
0004-HACK-PandaES-disable-cpufreq-so-board-will-boot.patch
0005-HACK-panda-enable-OMAP4_ERRATA_I688.patch
0006-ARM-hw_breakpoint-Enable-debug-powerdown-only-if-sys.patch
0007-Revert-regulator-twl-Remove-TWL6030_FIXED_RESOURCE.patch
0008-Revert-regulator-twl-Remove-another-unused-variable-.patch
0009-Revert-regulator-twl-Remove-references-to-the-twl403.patch
0010-Revert-regulator-twl-Remove-references-to-32kHz-cloc.patch
0011-panda-spidev-setup-pinmux.patch

Do you think those may have something to do with it?

I do not think so. When the LED is still on, can you use the serial
console to run cat /proc/interrupts to see if the timer is still ticking?


I ran the test again with the same kernel and traced the messages from the serial console with minicom. Again, the test ran for quite some time until I got stacktraces similar to [1] (which might be just related to the ltp memcg test).

However, after these stacktraces I got the following message on the serial console (LED2 also went on and stayed on):

[...]
[ 6674.540000] omap_hsmmc omap_hsmmc.0: MMC start dma failure
[ 6674.540000] mmcblk0: unknown error -22 sending read/write command, card status 0x900
[ 6674.550000] end_request: I/O error, dev mmcblk0, sector 12751744
[ 6674.560000] EXT4-fs warning (device mmcblk0p2): __ext4_read_dirblock:908: error reading directory block (ino 397703, block 0)
[...]
[ 6932.610000] omap_hsmmc omap_hsmmc.0: MMC start dma failure
[ 6932.610000] mmcblk0: unknown error -22 sending read/write command, card status 0x900
[ 6932.620000] end_request: I/O error, dev mmcblk0, sector 21142904
[ 6932.630000] EXT4-fs warning (device mmcblk0p2): __ext4_read_dirblock:908: error reading directory block (ino 657554, block 0)
[...]

Although dd is still running on minicom, I lost the ssh connection over Ethernet (and I couldn't get it back even after unconnecting and reconnecting the cable, which didn't cause any PHY interrupt in dmesg as well) and I cannot Ctrl-C or do anything on the serial console... I just see dd, which was started by dohell, getting invoked.

So with the periodic timer ltp runs for much longer, however I can't get the console back after the mmc (?), which I was able to with the original timer subsystem config.

... and xeno-regression-test "MYTEST" fails as usual after ~ 5mins.

A.



[1] memcg related stacktrace:
=======================
[ 6606.000000] memcg_process invoked oom-killer: gfp_mask=0xd0, order=0, oom_sco
re_adj=0[ 6606.010000] memcg_process cpuset=/ mems_allowed=0
[ 6606.010000] CPU: 0 PID: 26237 Comm: memcg_process Tainted: G W 3.10.32-x3.4 #26 [ 6606.020000] [<c0014e0c>] (unwind_backtrace+0x0/0xe8) from [<c00122ac>] (show_stack+0x20/0x24) [ 6606.030000] [<c00122ac>] (show_stack+0x20/0x24) from [<c081e0b0>] (dump_stack+0x20/0x28) [ 6606.040000] [<c081e0b0>] (dump_stack+0x20/0x28) from [<c081a610>] (dump_header.isra.11+0x98/0x1ac) [ 6606.050000] [<c081a610>] (dump_header.isra.11+0x98/0x1ac) from [<c01948e8>] (oom_kill_process+0x6c/0x3a0) [ 6606.060000] [<c01948e8>] (oom_kill_process+0x6c/0x3a0) from [<c01d0fe8>] (__mem_cgroup_try_charge+0xb00/0xb50) [ 6606.070000] [<c01d0fe8>] (__mem_cgroup_try_charge+0xb00/0xb50) from [<c01d14f0>] (mem_cgroup_charge_common+0x44/0x6c) [ 6606.080000] [<c01d14f0>] (mem_cgroup_charge_common+0x44/0x6c) from [<c01d2958>] (mem_cgroup_newpage_charge+0x34/0x3c) [ 6606.090000] [<c01d2958>] (mem_cgroup_newpage_charge+0x34/0x3c) from [<c01b5718>] (handle_pte_fault+0x718/0x878) [ 6606.100000] [<c01b5718>] (handle_pte_fault+0x718/0x878) from [<c01b5968>] (handle_mm_fault+0xf0/0x144) [ 6606.110000] [<c01b5968>] (handle_mm_fault+0xf0/0x144) from [<c01b5c7c>] (__get_user_pages.part.72+0x2c0/0x434) [ 6606.120000] [<c01b5c7c>] (__get_user_pages.part.72+0x2c0/0x434) from [<c01b5e38>] (__get_user_pages+0x48/0x50) [ 6606.130000] [<c01b5e38>] (__get_user_pages+0x48/0x50) from [<c01b6b24>] (__mlock_vma_pages_range+0x74/0x7c) [ 6606.140000] [<c01b6b24>] (__mlock_vma_pages_range+0x74/0x7c) from [<c01b6fc4>] (__mm_populate+0xd8/0x13c) [ 6606.150000] [<c01b6fc4>] (__mm_populate+0xd8/0x13c) from [<c01a9930>] (vm_mmap_pgoff+0xac/0xb8) [ 6606.160000] [<c01a9930>] (vm_mmap_pgoff+0xac/0xb8) from [<c01b8dd8>] (SyS_mma
p_pgoff+0xb0/0xec)
[ 6606.160000] [<c01b8dd8>] (SyS_mmap_pgoff+0xb0/0xec) from [<c000e020>] (ret_fa
st_syscall+0x0/0x50)
[ 6606.170000] Task in /1/subgroup killed as a result of limit of / 1[ 6606.180000] memory: usage 4kB, limit 4kB, failcnt 6[ 6606.190000] memory+swap: usage 4kB, limit 9007199254740991kB, failcnt 0 [ 6606.190000] kmem: usage 0kB, limit 9007199254740991kB, failcnt 0[ 6606.200000] Memory cgroup stats for /1: cache:0KB rss:0KB rss_huge: 0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_fi
le:0KB unevictable:0KB
[ 6606.220000] Memory cgroup stats for /1/subgroup: cache:0KB rss:4KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon: 0KB inactive_file:0KB active_file:0KB unevictable:4KB [ 6606.230000] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [ 6606.240000] [26237] 0 26237 404 84 3 0 0 memcg_process [ 6606.250000] Memory cgroup out of memory: Kill process 26237 (memcg_process) score 85000 or sacrifice child [ 6606.260000] Killed process 26237 (memcg_process) total-vm:1616kB, anon-rss:68kB, file-rss:268kB





_______________________________________________
Xenomai mailing list
Xenomai@xenomai.org
http://www.xenomai.org/mailman/listinfo/xenomai

Reply via email to