Hi Richard, Ian,

Any progress on the issue? In case if not, I am adding few Linaro guys
who work on aarch64 qemu. Maybe they can give some insight.

I was able to reproduce on my system and I
and look at it under gdb. It seems that some strange aarch64
percularity might be in play. Details inline, root cause is still
not clear.

On Sat, 3 Mar 2018, Ian Arkver wrote:

On 03/03/18 10:51, Ian Arkver wrote:
On 03/03/18 09:00, Richard Purdie wrote:
Hi,

I need some help with a problem we keep seeing:

https://autobuilder.yocto.io/builders/nightly-arm64/builds/798

Basically, now and again, for reasons we don't understand, all the
sanity tests fail for qemuarm64.

I've poked at this a bit and if I go in onto the failed machine and run
this again, they work, using the same image, kernel and qemu binaries.
We've seen this on two different autobuilder infrastructure on varying
host OSs. They always seem to fail all three at once.

Whilst this was a mut build, I saw this repeat three builds in a row on
the new autobuilder we're setting up with master.

The kernels always seem to hang somewhere around the:

| [    0.766079] raid6: int64x1  xor()   302 MB/s
| [    0.844597] raid6: int64x2  gen()   675 MB/s

I believe this is related to btrfs and comes from having btrfs compiled in to the kernel. You could maybe side-step the problem (and hence leave it lurking) by changing btrfs to a module.

Actually, this comes from a library (lib/raid6), and in 4.14.y's arm64 defconfig BTRFS is already a module, so please disregard my hack suggestion.

Indeed, in my case when I run qemu with enabled remote gdbserver, and
in kernel hang boot case I press Ctrl-C and drop into gdb I see the
following traceback:

(gdb) bt
#0  vectors ()
    at 
/wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/arch/arm64/kernel/entry.S:376
#1  0xffffff80089a2ff4 in raid6_choose_gen (disks=<optimized out>, 
dptrs=<optimized out>)
    at 
/wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/lib/raid6/algos.c:190
#2  raid6_select_algo ()
    at 
/wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/lib/raid6/algos.c:253
#3  0xffffff8008083b8c in do_one_initcall (fn=0xffffff80089a2e64 
<raid6_select_algo>)
    at 
/wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/init/main.c:832
#4  0xffffff8008970e80 in do_initcall_level (level=<optimized out>)
    at 
/wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/init/main.c:898
#5  do_initcalls () at 
/wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/init/main.c:906
#6  do_basic_setup () at 
/wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/init/main.c:924
#7  kernel_init_freeable ()
    at 
/wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/init/main.c:1073
#8  0xffffff80087a2e00 in kernel_init (unused=<optimized out>)
    at 
/wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/init/main.c:999
#9  0xffffff80080850ec in ret_from_fork ()
    at 
/wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/arch/arm64/kernel/entry.S:994
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) x /10i $pc - 12
   0xffffff8008082274 <vectors+628>:      nop
   0xffffff8008082278 <vectors+632>:      nop
   0xffffff800808227c <vectors+636>:      nop
=> 0xffffff8008082280 <vectors+640>:   sub     sp, sp, #0x140
   0xffffff8008082284 <vectors+644>:      add     sp, sp, x0
   0xffffff8008082288 <vectors+648>:      sub     x0, sp, x0
   0xffffff800808228c <vectors+652>:      tbnz    w0, #14, 0xffffff800808229c 
<vectors+668>
   0xffffff8008082290 <vectors+656>:      sub     x0, sp, x0
   0xffffff8008082294 <vectors+660>:      sub     sp, sp, x0
   0xffffff8008082298 <vectors+664>:      b       0xffffff8008082fc0 <el1_irq>
(gdb) f 1
#1  0xffffff80089a2ff4 in raid6_choose_gen (disks=<optimized out>, 
dptrs=<optimized out>)
    at 
/wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/lib/raid6/algos.c:190
190                             preempt_disable();
(gdb) x /12i $pc - 12
   0xffffff80089a2fe8 <raid6_select_algo+388>:    cbz     x0, 0xffffff80089a3098 
<raid6_select_algo+564>
   0xffffff80089a2fec <raid6_select_algo+392>:    mov     w0, #0x1              
          // #1
   0xffffff80089a2ff0 <raid6_select_algo+396>:    bl      0xffffff80080cc498 
<preempt_count_add>
=> 0xffffff80089a2ff4 <raid6_select_algo+400>: ldr     x0, [x23, #2688]
   0xffffff80089a2ff8 <raid6_select_algo+404>:    ldr     x5, [x23, #2688]
   0xffffff80089a2ffc <raid6_select_algo+408>:    cmp     x0, x5
   0xffffff80089a3000 <raid6_select_algo+412>:    b.ne    0xffffff80089a300c 
<raid6_select_algo+424>  // b.any
   0xffffff80089a3004 <raid6_select_algo+416>:    yield
   0xffffff80089a3008 <raid6_select_algo+420>:    b       0xffffff80089a2ff8 
<raid6_select_algo+404>
   0xffffff80089a300c <raid6_select_algo+424>:    mov     x25, #0x0             
          // #0
   0xffffff80089a3010 <raid6_select_algo+428>:    ldr     x0, [x23, #2688]
   0xffffff80089a3014 <raid6_select_algo+432>:    mov     x4, x27

(gdb) b *0xffffff80089a2ff4
Breakpoint 8 at 0xffffff80089a2ff4: file 
/wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work-shared/qemuarm64/kernel-source/lib/raid6/algos.c,
 line 191.

This corresponds to this code in lib/raid6/algos.c

   190                          preempt_disable();
   191                          j0 = jiffies;
   192                          while ((j1 = jiffies) == j0)
   193                                  cpu_relax();
   194                          while (time_before(jiffies,
   195                                              j1 + 
(1<<RAID6_TIME_JIFFIES_LG2))) {
   196                                  (*algo)->xor_syndrome(disks, start, 
stop,
   197                                                        PAGE_SIZE, 
*dptrs);
   198                                  perf++;
   199                          }
   200                          preempt_enable();

If for experiment sake I disable loop that tries to find
jiffies transition. I.e have something like this:

diff --git a/lib/raid6/algos.c b/lib/raid6/algos.c
index 4769947..e0199fc 100644
--- a/lib/raid6/algos.c
+++ b/lib/raid6/algos.c
@@ -166,8 +166,12 @@ static inline const struct raid6_calls *raid6_choose_gen(

                        preempt_disable();
                        j0 = jiffies;
+#if 0
                        while ((j1 = jiffies) == j0)
                                cpu_relax();
+#else
+                        j1 = jiffies;
+#endif /* 0 */
                        while (time_before(jiffies,
                                            j1 + (1<<RAID6_TIME_JIFFIES_LG2))) {
                                (*algo)->gen_syndrome(disks, PAGE_SIZE, *dptrs);
@@ -189,8 +193,12 @@ static inline const struct raid6_calls *raid6_choose_gen(

                        preempt_disable();
                        j0 = jiffies;
+#if 0
                        while ((j1 = jiffies) == j0)
                                cpu_relax();
+#else
+                        j1 = jiffies;
+#endif /* 0 */
                        while (time_before(jiffies,
                                            j1 + (1<<RAID6_TIME_JIFFIES_LG2))) {
                                (*algo)->xor_syndrome(disks, start, stop,

Image boots fine after that.

I.e it looks as some strange effect in aarch64 qemu that seems does not
progress jiffies and code stuck.

Another observation is that if I put breakpoint for example
in do_timer, it actually hits the breakpoint, ie timer interrupt
happens in this case, and strangely raid6_choose_gen sequence
does progress, ie debugger breakpoints make this case unstuck.
Actually several pressing Ctrl-C to interrupt target, followed
by continue in gdb let code eventually go out of raid6_choose_gen.

Also whenever I presss Ctrl-C in gdb to stop target it always
in stalled case drops with $pc into first instruction of el1_irq,
I never saw different $pc hang code interrupt. Does it mean qemu
hangged on first instruction of el1_irq handler? Note once I do
stepi after that it ables to proceseed. If I continue steping
eventually it gets to arch_timer_handler_virt and do_timer.

For Linaro qemu aarch64 guys more details:

Situation happens on latest openembedded-core, for qemuarm64 MACHINE.
It does not happens always, i.e sometimes it works.

Qemu version is 2.11.1 and it is invoked like this (through regular
oe runqemu helper utility):

/wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/work/x86_64-linux/qemu-helper-native/1.0-r1/recipe-sysroot-native/usr/bin/qemu-system-aarch64
 -device virtio-net-device,netdev=net0,mac=52:54:00:12:34:02 -netdev 
tap,id=net0,ifname=tap0,script=no,downscript=no -drive 
id=disk0,file=/wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/deploy/images/qemuarm64/core-image-minimal-qemuarm64-20180305025002.rootfs.ext4,if=none,format=raw
 -device virtio-blk-device,drive=disk0 -show-cursor -device virtio-rng-pci 
-monitor null -machine virt -cpu cortex-a57 -m 512 -serial mon:vc -serial null 
-kernel 
/wd6/oe/20180304/systemtap-oe-sysroot/build/tmp-glibc/deploy/images/qemuarm64/Image
 -append root=/dev/vda rw highres=off  mem=512M 
ip=192.168.7.2::192.168.7.1:255.255.255.0 console=ttyAMA0,38400

My host system is ubuntu-16.04.

Please let's me know if you need additional info and/or want to
enable additional debug/trace options.

Thanks,
Victor

Regards,
Ian

raid timing measurements.

In the past we've dived in and handled these kinds of things but I've
run out of people to lean on and I need help from the wider community.

Can anyone help look into and fix this?

This is serious as if nobody cares, I'll have to simply stop boot
testing qemuarm64.

Not sure if there is an open bug yet either :/.

Cheers,

Richard

--
_______________________________________________
Openembedded-core mailing list
Openembedded-core@lists.openembedded.org
http://lists.openembedded.org/mailman/listinfo/openembedded-core
-- 
_______________________________________________
Openembedded-core mailing list
Openembedded-core@lists.openembedded.org
http://lists.openembedded.org/mailman/listinfo/openembedded-core

Reply via email to