On 2016-05-18 14:00, Mats Karrman wrote: > > > On 2016-05-18 13:01, Felix Fietkau wrote: >> On 2016-05-18 11:38, Mats Karrman wrote: >>> >>> On 2016-05-17 17:31, Mats Karrman wrote: >>>> On 2016-05-17 13:29, Felix Fietkau wrote: >>>>> I just took a look at the code and uloop's processing of signals looked >>>>> a bit racy to me. I've pushed a commit that makes it use signalfd if >>>>> available. I also found that waitpid wasn't being retried on signal >>>>> interrupt, so I added an extra check there. The changes are in libubox >>>>> git, but not in OpenWrt/LEDE yet. >>>>> Please test if this fixes your issue. >>>>> >>>>> Thanks, >>>>> >>>>> - Felix >>>> Tried that but no immediate success, but it might have provided >>>> some additional clues. Now the boot hangs early on *every* boot >>>> but after logging in I found something different in the ps list. >>>> There is a Broadcom utility (smd) that is called from one of the >>>> start scripts (S10environment). It's purpose is to set scheduling >>>> priority and cpu affinity for some of the Broadcom proprietary >>>> processes, The smd program handles fork rather ugly. The >>>> parent only loops until it receives SIGCHLD and then exits without >>>> any wait. With the modified libubox I get a zombie smd child and >>>> sleeping smd parent and S11environment (no other zombie). >>>> >>>> Not sure exactly how this happened but I got to think about >>>> something written in the wait man page: >>>> >>>> """ >>>> If a parent process terminates, then its "zombie" children (if any) >>>> are adopted by init(8), which automatically performs a wait to >>>> remove the zombies. >>>> """ >>>> >>>> Is this wait really (unconditionally) implemented in procd or could >>>> that be what I accomplished with the "forced timeout" patch? >>>> >>>> I fixed the ugly fork and got the system to boot once. >>>> Then tried the original libubox with the fixed smd program but >>>> this was not enough to get things working (25 reboots to hang). >>>> >>>> Now I'm running reboot tests with your new libubox and fixed smd... >>> More than 250 reboots without problem :) >>> >>> Clearly the smd program is broken, but still it doesn't feel good that it >>> manages to hang the init process. Considering that timing is involved >>> it's difficult to make any certain conclusions but it seems like having >>> uloop epoll_wait to time out occasionally isn't such a bad idea? >> I agree, that definitely needs fixing. What kernel are you using? > It's the 3.4.11-rt19 from the Broadcom SDK v4.16, so very old... > > Now I also noticed, with your libubox fixes (and my fixed smd) I still get > some zombies, even though the system seems to boot OK all the way > (the corresponding services being defunct though). > With my epoll_wait timeout fix on the original libubox, this does not > happen. Can you try backporting this to your kernel? https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=128dd1759d96ad36c379240f8b9463e8acfd37a1
- Felix _______________________________________________ Lede-dev mailing list Lede-dev@lists.infradead.org http://lists.infradead.org/mailman/listinfo/lede-dev