Re: arm64 lock: no userland progress, several procs in wchan "vp"
On 2017/05/04 00:31, Mark Kettenis wrote: > > Date: Wed, 3 May 2017 21:05:24 +0100 > > From: Stuart Henderson> > > > On 2017/05/03 15:12, Mark Kettenis wrote: > > > > Date: Wed, 3 May 2017 13:51:22 +0100 > > > > From: Stuart Henderson > > > > > > > > On 2017/05/01 22:18, Mark Kettenis wrote: > > > > > > Date: Mon, 1 May 2017 20:58:29 +0100 > > > > > > From: Stuart Henderson > > > > > > > > > > > > Userland is non-responsive, machine is pingable, tcp connections > > > > > > open > > > > > > but no banner from ssh. No failed pool requests. This kernel is from > > > > > > today's snapshot but I saw the same with one from a couple of days > > > > > > ago. Is there anything else I can get that might be useful? > > > > > > > > > > .. > > > > > > 71034 186155 65198 0 30x11 vpperl > > > > .. > > > > > > > > > > The diff below might fix thise. Or it might actually turn this into a > > > > > hard hang... > > > > > > > > > > Nevertheless, could you try running with it? > > > > > > > > I haven't seen this happen again with your diff, and haven't seen any > > > > hangs. Probably still too early to say for sure that it fixes things, > > > > but it looks promising so far. > > > > > > Thanks. Since Dale ok'ed it and I had been running with it for a > > > while already, I committed it last night. > > > > > > > Ha. As is traditional, not long after sending that message I've hit > > a hard lock - no DDB. > > I'm sure it wouldn't have happened if I hand't committed it ;). > > Could you change the PR_NOWAIT back into P_WAITOK and see if you can > reproduce the hang and break into ddb? Meanwhile I'll think about > what information to print once you've hit it ;). > I've hit it: [halt sent] Stopped at pluart_intr+0x164: ddb> ps PID TID PPIDUID S FLAGS WAIT COMMAND 625 437040 78755 55 3 0x2 vpcc 78755 385259 7516 55 30x82 wait cc 7516 136325 5903 55 30x82 wait bash 19959 322579 19904 0 30x11 vpperl 5903 139810 38260 55 30x82 wait gmake 38260 218618 2854 55 30x82 wait bash 2854 90787 9324 55 30x82 wait gmake 9324 170231 48868 55 30x82 wait gmake 48868 117527 14396 55 30x80 wait bash 14396 515415 1643 55 30x82 wait bash 1643 416463 68987 55 30x82 wait gmake 68987 461754 3515 55 30x82 wait bash 3515 418174 60280 55 30x82 wait gmake 60280 191595 96060 55 30x10008a pause sh 96060 102351 39653 55 30x10008a pause sh 39653 519956 69874 55 30x10008a pause make 69874 105490 67943 55 30x10008a pause make 679438933 3490 55 30x10008a pause sh 3490 32658 91100 55 30x10008a pause make 91100 232620 7725 55 30x10008a pause sh 7725 345849 19904 55 30x10008a pause make 74896 137698 47027 1000 30x100083 kqreadtmux 47027 296497 97822 1000 30x10008b pause ksh 97822 245510 69858 1000 30x10 vpsshd 69858 304169 7139 0 30x92 poll sshd 52214 79445 1 0 30x13 vpgetty 65809 151855 1 77 30x100090 poll dhclient 69118 177062 1 0 30x80 poll dhclient 74448 202290 83042 1000 30x100083 ttyin ksh 19904 489750 99750 0 30x93 wait perl 99750 481138 83042 1000 30x10008b pause ksh 83042 493811 1 1000 30x100080 kqreadtmux 60522 271756 1 0 30x100010 vpcron 49702 312331 1110 30x100090 poll sndiod 81577 93 1 99 30x100090 poll sndiod 3727 393780 19238 95 30x100092 kqreadsmtpd 88472 480120 19238103 30x100092 kqreadsmtpd 87719 273623 19238 95 30x100092 kqreadsmtpd 8252 227745 19238 95 30x100092 kqreadsmtpd 30445 236541 19238 95 30x100092 kqreadsmtpd 23622 99340 19238 95 30x100092 kqreadsmtpd 19238 382063 1 0 30x100080 kqreadsmtpd 7139 54914 1 0 30x80 selectsshd 63805 211190 0 0 3 0x14280 nfsidlnfsio 56273 341955 0 0 3 0x14280 nfsidlnfsio 13129 36316 0 0 3 0x14280 nfsidlnfsio 43056 41917 0 0 3 0x14280 nfsidlnfsio
Re: arm64 lock: no userland progress, several procs in wchan "vp"
> Date: Wed, 3 May 2017 21:05:24 +0100 > From: Stuart Henderson> > On 2017/05/03 15:12, Mark Kettenis wrote: > > > Date: Wed, 3 May 2017 13:51:22 +0100 > > > From: Stuart Henderson > > > > > > On 2017/05/01 22:18, Mark Kettenis wrote: > > > > > Date: Mon, 1 May 2017 20:58:29 +0100 > > > > > From: Stuart Henderson > > > > > > > > > > Userland is non-responsive, machine is pingable, tcp connections open > > > > > but no banner from ssh. No failed pool requests. This kernel is from > > > > > today's snapshot but I saw the same with one from a couple of days > > > > > ago. Is there anything else I can get that might be useful? > > > > > > > > .. > > > > > 71034 186155 65198 0 30x11 vpperl > > > .. > > > > > > > > The diff below might fix thise. Or it might actually turn this into a > > > > hard hang... > > > > > > > > Nevertheless, could you try running with it? > > > > > > I haven't seen this happen again with your diff, and haven't seen any > > > hangs. Probably still too early to say for sure that it fixes things, > > > but it looks promising so far. > > > > Thanks. Since Dale ok'ed it and I had been running with it for a > > while already, I committed it last night. > > > > Ha. As is traditional, not long after sending that message I've hit > a hard lock - no DDB. I'm sure it wouldn't have happened if I hand't committed it ;). Could you change the PR_NOWAIT back into P_WAITOK and see if you can reproduce the hang and break into ddb? Meanwhile I'll think about what information to print once you've hit it ;).
Re: arm64 lock: no userland progress, several procs in wchan "vp"
On 2017/05/03 15:12, Mark Kettenis wrote: > > Date: Wed, 3 May 2017 13:51:22 +0100 > > From: Stuart Henderson> > > > On 2017/05/01 22:18, Mark Kettenis wrote: > > > > Date: Mon, 1 May 2017 20:58:29 +0100 > > > > From: Stuart Henderson > > > > > > > > Userland is non-responsive, machine is pingable, tcp connections open > > > > but no banner from ssh. No failed pool requests. This kernel is from > > > > today's snapshot but I saw the same with one from a couple of days > > > > ago. Is there anything else I can get that might be useful? > > > > > > .. > > > > 71034 186155 65198 0 30x11 vpperl > > .. > > > > > > The diff below might fix thise. Or it might actually turn this into a > > > hard hang... > > > > > > Nevertheless, could you try running with it? > > > > I haven't seen this happen again with your diff, and haven't seen any > > hangs. Probably still too early to say for sure that it fixes things, > > but it looks promising so far. > > Thanks. Since Dale ok'ed it and I had been running with it for a > while already, I committed it last night. > Ha. As is traditional, not long after sending that message I've hit a hard lock - no DDB.
Re: arm64 lock: no userland progress, several procs in wchan "vp"
> Date: Wed, 3 May 2017 13:51:22 +0100 > From: Stuart Henderson> > On 2017/05/01 22:18, Mark Kettenis wrote: > > > Date: Mon, 1 May 2017 20:58:29 +0100 > > > From: Stuart Henderson > > > > > > Userland is non-responsive, machine is pingable, tcp connections open > > > but no banner from ssh. No failed pool requests. This kernel is from > > > today's snapshot but I saw the same with one from a couple of days > > > ago. Is there anything else I can get that might be useful? > > > > .. > > > 71034 186155 65198 0 30x11 vpperl > .. > > > > The diff below might fix thise. Or it might actually turn this into a > > hard hang... > > > > Nevertheless, could you try running with it? > > I haven't seen this happen again with your diff, and haven't seen any > hangs. Probably still too early to say for sure that it fixes things, > but it looks promising so far. Thanks. Since Dale ok'ed it and I had been running with it for a while already, I committed it last night.
Re: arm64 lock: no userland progress, several procs in wchan "vp"
On 2017/05/01 22:18, Mark Kettenis wrote: > > Date: Mon, 1 May 2017 20:58:29 +0100 > > From: Stuart Henderson> > > > Userland is non-responsive, machine is pingable, tcp connections open > > but no banner from ssh. No failed pool requests. This kernel is from > > today's snapshot but I saw the same with one from a couple of days > > ago. Is there anything else I can get that might be useful? > > .. > > 71034 186155 65198 0 30x11 vpperl .. > > The diff below might fix thise. Or it might actually turn this into a > hard hang... > > Nevertheless, could you try running with it? I haven't seen this happen again with your diff, and haven't seen any hangs. Probably still too early to say for sure that it fixes things, but it looks promising so far.
Re: arm64 lock: no userland progress, several procs in wchan "vp"
On Mon, May 01, 2017 at 10:18:58PM +0200, Mark Kettenis wrote: > > Date: Mon, 1 May 2017 20:58:29 +0100 > > From: Stuart Henderson> > > > Userland is non-responsive, machine is pingable, tcp connections open > > but no banner from ssh. No failed pool requests. This kernel is from > > today's snapshot but I saw the same with one from a couple of days > > ago. Is there anything else I can get that might be useful? > > > > ddb> ps > >PID TID PPIDUID S FLAGS WAIT COMMAND > > 99554 86967 57409 55 3 0x2 vpcc > > 57409 23557 97377 55 30x82 wait cc > > 97377 51254 49407 55 30x10008a pause sh > > 71034 186155 65198 0 30x11 vpperl > > 49407 183801 58608 55 30x82 wait gmake > > 58608 251568 90720 55 30x10008a pause sh > > 90720 294385 26849 55 30x82 wait gmake > > 26849 434857 31480 55 30x100088 pause sh > > 31480 479316 1945 55 30x10008a pause sh > > 1945 53261 1392 55 30x82 wait gmake > > 1392 297593 28991 55 30x100088 pause sh > > 28991 101756 11650 55 30x10008a pause sh > > 11650 273060 70062 55 30x82 wait gmake > > 70062 380995 21324 55 30x82 wait gmake > > 21324 380494 20357 55 30x10008a pause make > > 20357 495141 79040 55 30x10008a pause sh > > 79040 411698 40069 55 30x10008a pause make > > 40069 407214 61289 55 30x10008a pause sh > > 61289 440156 65198 55 30x10008a pause make > > 16484 143829 63578 55 30x82 nanosleep perl > > 63578 247597 69857 55 30x10008a pause sh > > 698573708 28018 55 30x10008a pause make > > 28018 161747 1 55 30x10008a pause sh > > 78305 185109 40308 1000 30x100083 ttyin ksh > > 65198 454438 69872 0 30x93 wait perl > > 69872 91535 40308 1000 30x10008b pause ksh > > 40308 108204 1 1000 30x100080 kqreadtmux > > 72632 510504 69073 1000 30x100083 kqreadtmux > > 69073 166246 39096 1000 30x10008b pause ksh > > 39096 474432 39165 1000 30x10 vpsshd > > 39165 380864 95218 0 30x92 poll sshd > > 19837 75515 1 0 30x13 vpgetty > > 61 140725 1 0 30x100010 vpcron > > 33247 144573 1110 30x100090 poll sndiod > > 85245 294054 1 99 30x100090 poll sndiod > > 20071 339430 77361 95 30x100092 kqreadsmtpd > > 31714 216717 77361103 30x100092 kqreadsmtpd > > 38145 373966 77361 95 30x100092 kqreadsmtpd > > 73235 449750 77361 95 30x100092 kqreadsmtpd > > 52512 523411 77361 95 30x100092 kqreadsmtpd > > 25217 17706 77361 95 30x100092 kqreadsmtpd > > 77361 512649 1 0 30x100080 kqreadsmtpd > > 95218 352524 1 0 30x80 selectsshd > > 28640 338771 0 0 3 0x14280 nfsidlnfsio > > 30707 131410 0 0 3 0x14280 nfsidlnfsio > > 26109 142203 0 0 3 0x14280 nfsidlnfsio > > 61054 453416 0 0 3 0x14280 nfsidlnfsio > > 20679 124381 1 0 30x80 poll rpc.statd > > 75142 494960 1 28 30x100090 poll portmap > > 13394 497677 1 0 30x10 vpntpd > > 56991 117256 27035 83 30x100092 poll ntpd > > 27035 498377 1 83 30x100092 poll ntpd > > 19071 360785 2016 74 30x100090 bpf pflogd > > 2016 326372 1 0 30x80 netio pflogd > > 75485 263260 29155 73 30x100090 kqreadsyslogd > > 29155 379800 1 0 30x100082 netio syslogd > > 9314 271265 1 77 30x100090 poll dhclient > > 77002 87 1 0 30x80 poll dhclient > > 4332 479844 1 0 30x80 mfsidlmount_mfs > > 58334 330646 0 0 3 0x14200 pgzerozerothread > > 15557 331142 0 0 3 0x14200 aiodoned aiodoned > > 34557 432814 0 0 3 0x14200 syncerupdate > > 82663 208419 0 0 3 0x14200 cleaner cleaner > > 51853 347618 0 0 3 0x14200 reaperreaper > >
Re: arm64 lock: no userland progress, several procs in wchan "vp"
> Date: Mon, 1 May 2017 20:58:29 +0100 > From: Stuart Henderson> > Userland is non-responsive, machine is pingable, tcp connections open > but no banner from ssh. No failed pool requests. This kernel is from > today's snapshot but I saw the same with one from a couple of days > ago. Is there anything else I can get that might be useful? > > ddb> ps >PID TID PPIDUID S FLAGS WAIT COMMAND > 99554 86967 57409 55 3 0x2 vpcc > 57409 23557 97377 55 30x82 wait cc > 97377 51254 49407 55 30x10008a pause sh > 71034 186155 65198 0 30x11 vpperl > 49407 183801 58608 55 30x82 wait gmake > 58608 251568 90720 55 30x10008a pause sh > 90720 294385 26849 55 30x82 wait gmake > 26849 434857 31480 55 30x100088 pause sh > 31480 479316 1945 55 30x10008a pause sh > 1945 53261 1392 55 30x82 wait gmake > 1392 297593 28991 55 30x100088 pause sh > 28991 101756 11650 55 30x10008a pause sh > 11650 273060 70062 55 30x82 wait gmake > 70062 380995 21324 55 30x82 wait gmake > 21324 380494 20357 55 30x10008a pause make > 20357 495141 79040 55 30x10008a pause sh > 79040 411698 40069 55 30x10008a pause make > 40069 407214 61289 55 30x10008a pause sh > 61289 440156 65198 55 30x10008a pause make > 16484 143829 63578 55 30x82 nanosleep perl > 63578 247597 69857 55 30x10008a pause sh > 698573708 28018 55 30x10008a pause make > 28018 161747 1 55 30x10008a pause sh > 78305 185109 40308 1000 30x100083 ttyin ksh > 65198 454438 69872 0 30x93 wait perl > 69872 91535 40308 1000 30x10008b pause ksh > 40308 108204 1 1000 30x100080 kqreadtmux > 72632 510504 69073 1000 30x100083 kqreadtmux > 69073 166246 39096 1000 30x10008b pause ksh > 39096 474432 39165 1000 30x10 vpsshd > 39165 380864 95218 0 30x92 poll sshd > 19837 75515 1 0 30x13 vpgetty > 61 140725 1 0 30x100010 vpcron > 33247 144573 1110 30x100090 poll sndiod > 85245 294054 1 99 30x100090 poll sndiod > 20071 339430 77361 95 30x100092 kqreadsmtpd > 31714 216717 77361103 30x100092 kqreadsmtpd > 38145 373966 77361 95 30x100092 kqreadsmtpd > 73235 449750 77361 95 30x100092 kqreadsmtpd > 52512 523411 77361 95 30x100092 kqreadsmtpd > 25217 17706 77361 95 30x100092 kqreadsmtpd > 77361 512649 1 0 30x100080 kqreadsmtpd > 95218 352524 1 0 30x80 selectsshd > 28640 338771 0 0 3 0x14280 nfsidlnfsio > 30707 131410 0 0 3 0x14280 nfsidlnfsio > 26109 142203 0 0 3 0x14280 nfsidlnfsio > 61054 453416 0 0 3 0x14280 nfsidlnfsio > 20679 124381 1 0 30x80 poll rpc.statd > 75142 494960 1 28 30x100090 poll portmap > 13394 497677 1 0 30x10 vpntpd > 56991 117256 27035 83 30x100092 poll ntpd > 27035 498377 1 83 30x100092 poll ntpd > 19071 360785 2016 74 30x100090 bpf pflogd > 2016 326372 1 0 30x80 netio pflogd > 75485 263260 29155 73 30x100090 kqreadsyslogd > 29155 379800 1 0 30x100082 netio syslogd > 9314 271265 1 77 30x100090 poll dhclient > 77002 87 1 0 30x80 poll dhclient > 4332 479844 1 0 30x80 mfsidlmount_mfs > 58334 330646 0 0 3 0x14200 pgzerozerothread > 15557 331142 0 0 3 0x14200 aiodoned aiodoned > 34557 432814 0 0 3 0x14200 syncerupdate > 82663 208419 0 0 3 0x14200 cleaner cleaner > 51853 347618 0 0 3 0x14200 reaperreaper > 18753 499821 0 0 3 0x14200 pgdaemon pagedaemon > 59831 415568 0 0 3 0x14200 bored crynlk > 29354 478337 0 0 3 0x14200 bored crypto >
arm64 lock: no userland progress, several procs in wchan "vp"
Userland is non-responsive, machine is pingable, tcp connections open but no banner from ssh. No failed pool requests. This kernel is from today's snapshot but I saw the same with one from a couple of days ago. Is there anything else I can get that might be useful? ddb> ps PID TID PPIDUID S FLAGS WAIT COMMAND 99554 86967 57409 55 3 0x2 vpcc 57409 23557 97377 55 30x82 wait cc 97377 51254 49407 55 30x10008a pause sh 71034 186155 65198 0 30x11 vpperl 49407 183801 58608 55 30x82 wait gmake 58608 251568 90720 55 30x10008a pause sh 90720 294385 26849 55 30x82 wait gmake 26849 434857 31480 55 30x100088 pause sh 31480 479316 1945 55 30x10008a pause sh 1945 53261 1392 55 30x82 wait gmake 1392 297593 28991 55 30x100088 pause sh 28991 101756 11650 55 30x10008a pause sh 11650 273060 70062 55 30x82 wait gmake 70062 380995 21324 55 30x82 wait gmake 21324 380494 20357 55 30x10008a pause make 20357 495141 79040 55 30x10008a pause sh 79040 411698 40069 55 30x10008a pause make 40069 407214 61289 55 30x10008a pause sh 61289 440156 65198 55 30x10008a pause make 16484 143829 63578 55 30x82 nanosleep perl 63578 247597 69857 55 30x10008a pause sh 698573708 28018 55 30x10008a pause make 28018 161747 1 55 30x10008a pause sh 78305 185109 40308 1000 30x100083 ttyin ksh 65198 454438 69872 0 30x93 wait perl 69872 91535 40308 1000 30x10008b pause ksh 40308 108204 1 1000 30x100080 kqreadtmux 72632 510504 69073 1000 30x100083 kqreadtmux 69073 166246 39096 1000 30x10008b pause ksh 39096 474432 39165 1000 30x10 vpsshd 39165 380864 95218 0 30x92 poll sshd 19837 75515 1 0 30x13 vpgetty 61 140725 1 0 30x100010 vpcron 33247 144573 1110 30x100090 poll sndiod 85245 294054 1 99 30x100090 poll sndiod 20071 339430 77361 95 30x100092 kqreadsmtpd 31714 216717 77361103 30x100092 kqreadsmtpd 38145 373966 77361 95 30x100092 kqreadsmtpd 73235 449750 77361 95 30x100092 kqreadsmtpd 52512 523411 77361 95 30x100092 kqreadsmtpd 25217 17706 77361 95 30x100092 kqreadsmtpd 77361 512649 1 0 30x100080 kqreadsmtpd 95218 352524 1 0 30x80 selectsshd 28640 338771 0 0 3 0x14280 nfsidlnfsio 30707 131410 0 0 3 0x14280 nfsidlnfsio 26109 142203 0 0 3 0x14280 nfsidlnfsio 61054 453416 0 0 3 0x14280 nfsidlnfsio 20679 124381 1 0 30x80 poll rpc.statd 75142 494960 1 28 30x100090 poll portmap 13394 497677 1 0 30x10 vpntpd 56991 117256 27035 83 30x100092 poll ntpd 27035 498377 1 83 30x100092 poll ntpd 19071 360785 2016 74 30x100090 bpf pflogd 2016 326372 1 0 30x80 netio pflogd 75485 263260 29155 73 30x100090 kqreadsyslogd 29155 379800 1 0 30x100082 netio syslogd 9314 271265 1 77 30x100090 poll dhclient 77002 87 1 0 30x80 poll dhclient 4332 479844 1 0 30x80 mfsidlmount_mfs 58334 330646 0 0 3 0x14200 pgzerozerothread 15557 331142 0 0 3 0x14200 aiodoned aiodoned 34557 432814 0 0 3 0x14200 syncerupdate 82663 208419 0 0 3 0x14200 cleaner cleaner 51853 347618 0 0 3 0x14200 reaperreaper 18753 499821 0 0 3 0x14200 pgdaemon pagedaemon 59831 415568 0 0 3 0x14200 bored crynlk 29354 478337 0 0 3 0x14200 bored crypto 338986377 0 0 3 0x14200 pftm pfpurge 33736 115197 0 0 3 0x14200 usbtskusbtask 54044 43212 0 0 3 0x14200 usbatsk usbatsk 57344 273215 0 0 3 0x14200