To add weight to the timing issue theory, we are seeing the crash much more often (est. 1 in 20) when :
- The -verbose OSv startup argument is supplied or - We TCP connect to the OSv instance whilst it is starting up Is there somewhere I can raise a bug report for this? On Friday, February 26, 2021 at 12:36:43 PM UTC David Smith wrote: > The power-off-on-abort flag worked, thanks for that tip. > > On the crash issue, I built an OSv image with the standard native example: > *./scripts/build > image=native-example* > Then ran continuous launches under firecracker using a bash loop: *while > true; do ./scripts/firecracker.py; done* > After 1249 iterations, I got the same Assert at startup. > > Then I ran continuous launches under qemu: *while true; do > ./scripts/run.py; done* > So far, I'm over 6,500 iterations without failure, so I think this issue > is occurring under firecracker only, but that may be due to startup timing. > > I tried the patch supplied for a similar issue ("[PATCH] pthread_create() > - prefault top of stack") in case it helped, but the Assert still happens > as described in my original post. > > On Wednesday, February 17, 2021 at 3:31:35 PM UTC David Smith wrote: > >> I ran the same test using firecracker. After 1865 successful executions >> (OSv VM startup, my app dumping usage info. to stdout then quitting, OSv VM >> shutdown/poweroff) the 1866'th failed with the same Assert. I'll keep >> trying to find a more easily reproducible test (I think I'll try the sample >> go app. next, which cuts out anything I'm developing). >> With a 1 in 1866 failure, I don't think that's a clear cut "It only >> happens under firecracker", so I'll also repeat both firecracker and Qemu >> tests to get a more certain picture. >> >> On Wednesday, February 17, 2021 at 12:10:27 PM UTC David Smith wrote: >> >>> Thanks for the power-off tip, I'll give that a try. >>> I ran 2,500 cycles of starting the VM under Qemu yesterday and saw no >>> occurrences of this issue. >>> I'll repeat the same test using firecracker and see if I can reproduce. >>> Last time I saw the issue, it happened once in 12 VM launches. >>> >>> On Tuesday, February 16, 2021 at 10:26:29 PM UTC jwkoz...@gmail.com >>> wrote: >>> >>>> The #1123 issue has been reported in aarch64 port and only (as far as I >>>> have seen) happens in SMP mode (#vCPUs >=2). In your case, you seem to be >>>> getting it with 1 cpu. But who knows if it is related. >>>> >>>> As far as automatic power off, have you tried to prepend your command >>>> line with '--power-off-on-abort'? >>>> >>>> Waldek >>>> On Tuesday, February 16, 2021 at 3:59:22 AM UTC-5 d787...@gmail.com >>>> wrote: >>>> >>>>> Thanks for the quick reply. I haven't seen this failure when running >>>>> under Qemu, but since I'm almost always running under Firecracker and it >>>>> happens so infrequently under Firecracker, that's no surprise. I will try >>>>> to reproduce under Qemu and hopefully collect more information. >>>>> >>>>> On a related point, when this occurs (and I assume the behaviour would >>>>> be the same with any kernel assert), the OS seems to be left in a hung >>>>> state. What I'd like to ensure with my build is that if a fatal error >>>>> occurs, the VM is powered off. Is there an option I can use to achieve >>>>> this, or would it require code modification (perhaps injecting a call to >>>>> poweroff() at an appropriate location)? >>>>> >>>>> On Monday, February 15, 2021 at 12:16:55 PM UTC Nadav Har'El wrote: >>>>> >>>>>> On Mon, Feb 15, 2021 at 1:51 PM David Smith <d787...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I am running a Golang application on OSv under firecracker. >>>>>>> Occasionally (it's proving difficult to reproduce reliably) I see an >>>>>>> assertion failure early in the OSv startup sequence. The console output >>>>>>> for >>>>>>> my release build is as follows: >>>>>>> >>>>>>> OSv v0.55.0-168-g31a04239 >>>>>>> 1 CPUs detected >>>>>>> Firmware vendor: Unknown >>>>>>> bsd: initializingAssertion failed: sched::preemptable() >>>>>>> (arch/x64/mmu.cc: page_fault: 37) >>>>>>> >>>>>>> [backtrace] >>>>>>> >>>>>> >>>>>> I wonder why the backtrace is empty. If you run with qemu, not >>>>>> firecracker, I you can perhaps get a better backtrace with gdb - see >>>>>> instructions in >>>>>> https://github.com/cloudius-systems/osv/wiki/Debugging-OSv#debugging-osv-with-gdb. >>>>>> >>>>>> Does it fail in a similar way in qemu too, or just in firecracker? >>>>>> >>>>>> >>>>>>> It appears to be hitting the assert well before my go application is >>>>>>> launched, so my impression is that this isn't being caused by anything >>>>>>> my >>>>>>> application is doing, >>>>>>> >>>>>> >>>>>> That's my impression too - bsd_init() finishes (and prints "done()") >>>>>> before starting to run the application. >>>>>> >>>>>> >>>>>>> though I haven't been able to prove that the same behaviour occurs >>>>>>> with a base image (no apps included). >>>>>>> >>>>>> >>>>>> This assert appears to happen in bsd_init() (although it's hard to >>>>>> know without a backtrace) - and that runs with IRQs disabled so indeed >>>>>> should not have had page faults and the fact we got a page fault might >>>>>> indicate a bug. Hopefully the debugger or backtrace can explain >>>>>> why/where >>>>>> this page fault is happening. >>>>>> >>>>>> >>>>>> As you can see the backtrace is empty (perhaps that's the norm for a >>>>>>> release build?), so there are no additional clues. If I can find a way >>>>>>> to >>>>>>> reproduce with a debug build and/or with a base image, I will update >>>>>>> this >>>>>>> thread. >>>>>>> >>>>>>> In most cases, my OSv image starts up without issue and in those >>>>>>> cases, the console output around the same area looks as follows: >>>>>>> >>>>>>> OSv v0.55.0-168-g31a04239 >>>>>>> 1 CPUs detected >>>>>>> Firmware vendor: Unknown >>>>>>> bsd: initializing - done >>>>>>> VFS: mounting ramfs at / >>>>>>> VFS: mounting devfs at /dev >>>>>>> net: initializing - done >>>>>>> >>>>>>> I can see there are numerous previous discussions around the page >>>>>>> fault when preemption is disabled assertion, but these seem to be tied >>>>>>> to >>>>>>> application behaviour. >>>>>>> Is an assert during "bsd: Initializing" a known issue for which >>>>>>> there is already a patch? >>>>>>> >>>>>> >>>>>> Unfortunately, I don't remember this specific case was seen. We had >>>>>> recently https://github.com/cloudius-systems/osv/issues/1123 where >>>>>> bsd_init() hangs, though, I wonder if it's related. >>>>>> >>>>>> >>>>>>> Otherwise, if there is additional info. that I can capture to help >>>>>>> isolate the cause, let me know. >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "OSv Development" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to osv-dev+u...@googlegroups.com. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/osv-dev/17c2a0f9-970f-4b8a-b2c4-18b3c678f027n%40googlegroups.com >>>>>>> >>>>>>> <https://groups.google.com/d/msgid/osv-dev/17c2a0f9-970f-4b8a-b2c4-18b3c678f027n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> >>>>>> -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to osv-dev+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/44377247-a022-4bec-ac08-d2affa1311c7n%40googlegroups.com.