Re: arm64 lock: no userland progress, several procs in wchan "vp"

2017-05-08 Thread Stuart Henderson
On 2017/05/04 00:31, Mark Kettenis wrote:
> > Date: Wed, 3 May 2017 21:05:24 +0100
> > From: Stuart Henderson 
> > 
> > On 2017/05/03 15:12, Mark Kettenis wrote:
> > > > Date: Wed, 3 May 2017 13:51:22 +0100
> > > > From: Stuart Henderson 
> > > > 
> > > > On 2017/05/01 22:18, Mark Kettenis wrote:
> > > > > > Date: Mon, 1 May 2017 20:58:29 +0100
> > > > > > From: Stuart Henderson 
> > > > > > 
> > > > > > Userland is non-responsive, machine is pingable, tcp connections 
> > > > > > open
> > > > > > but no banner from ssh. No failed pool requests. This kernel is from
> > > > > > today's snapshot but I saw the same with one from a couple of days
> > > > > > ago.  Is there anything else I can get that might be useful?
> > > > > > 
> > > > ..
> > > > > >  71034  186155  65198  0  30x11  vpperl
> > > > ..
> > > > > 
> > > > > The diff below might fix thise.  Or it might actually turn this into a
> > > > > hard hang...
> > > > > 
> > > > > Nevertheless, could you try running with it?
> > > > 
> > > > I haven't seen this happen again with your diff, and haven't seen any
> > > > hangs. Probably still too early to say for sure that it fixes things,
> > > > but it looks promising so far.
> > > 
> > > Thanks.  Since Dale ok'ed it and I had been running with it for a
> > > while already, I committed it last night.
> > > 
> > 
> > Ha. As is traditional, not long after sending that message I've hit
> > a hard lock - no DDB.
> 
> I'm sure it wouldn't have happened if I hand't committed it ;).
> 
> Could you change the PR_NOWAIT back into P_WAITOK and see if you can
> reproduce the hang and break into ddb?  Meanwhile I'll think about
> what information to print once you've hit it ;).
> 

I've hit it:

[halt sent]
Stopped at  pluart_intr+0x164:
ddb> ps   
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
   625  437040  78755 55  3 0x2  vpcc 
 78755  385259   7516 55  30x82  wait  cc
  7516  136325   5903 55  30x82  wait  bash
 19959  322579  19904  0  30x11  vpperl
  5903  139810  38260 55  30x82  wait  gmake
 38260  218618   2854 55  30x82  wait  bash 
  2854   90787   9324 55  30x82  wait  gmake
  9324  170231  48868 55  30x82  wait  gmake
 48868  117527  14396 55  30x80  wait  bash 
 14396  515415   1643 55  30x82  wait  bash
  1643  416463  68987 55  30x82  wait  gmake
 68987  461754   3515 55  30x82  wait  bash 
  3515  418174  60280 55  30x82  wait  gmake
 60280  191595  96060 55  30x10008a  pause sh   
 96060  102351  39653 55  30x10008a  pause sh
 39653  519956  69874 55  30x10008a  pause make
 69874  105490  67943 55  30x10008a  pause make
 679438933   3490 55  30x10008a  pause sh  
  3490   32658  91100 55  30x10008a  pause make
 91100  232620   7725 55  30x10008a  pause sh  
  7725  345849  19904 55  30x10008a  pause make
 74896  137698  47027   1000  30x100083  kqreadtmux
 47027  296497  97822   1000  30x10008b  pause ksh 
 97822  245510  69858   1000  30x10  vpsshd
 69858  304169   7139  0  30x92  poll  sshd
 52214   79445  1  0  30x13  vpgetty
 65809  151855  1 77  30x100090  poll  dhclient
 69118  177062  1  0  30x80  poll  dhclient
 74448  202290  83042   1000  30x100083  ttyin ksh 
 19904  489750  99750  0  30x93  wait  perl
 99750  481138  83042   1000  30x10008b  pause ksh 
 83042  493811  1   1000  30x100080  kqreadtmux
 60522  271756  1  0  30x100010  vpcron
 49702  312331  1110  30x100090  poll  sndiod
 81577  93  1 99  30x100090  poll  sndiod
  3727  393780  19238 95  30x100092  kqreadsmtpd 
 88472  480120  19238103  30x100092  kqreadsmtpd
 87719  273623  19238 95  30x100092  kqreadsmtpd
  8252  227745  19238 95  30x100092  kqreadsmtpd
 30445  236541  19238 95  30x100092  kqreadsmtpd
 23622   99340  19238 95  30x100092  kqreadsmtpd
 19238  382063  1  0  30x100080  kqreadsmtpd
  7139   54914  1  0  30x80  selectsshd 
 63805  211190  0  0  3 0x14280  nfsidlnfsio
 56273  341955  0  0  3 0x14280  nfsidlnfsio
 13129   36316  0  0  3 0x14280  nfsidlnfsio
 43056   41917  0  0  3 0x14280  nfsidlnfsio

Re: arm64 lock: no userland progress, several procs in wchan "vp"

2017-05-03 Thread Mark Kettenis
> Date: Wed, 3 May 2017 21:05:24 +0100
> From: Stuart Henderson 
> 
> On 2017/05/03 15:12, Mark Kettenis wrote:
> > > Date: Wed, 3 May 2017 13:51:22 +0100
> > > From: Stuart Henderson 
> > > 
> > > On 2017/05/01 22:18, Mark Kettenis wrote:
> > > > > Date: Mon, 1 May 2017 20:58:29 +0100
> > > > > From: Stuart Henderson 
> > > > > 
> > > > > Userland is non-responsive, machine is pingable, tcp connections open
> > > > > but no banner from ssh. No failed pool requests. This kernel is from
> > > > > today's snapshot but I saw the same with one from a couple of days
> > > > > ago.  Is there anything else I can get that might be useful?
> > > > > 
> > > ..
> > > > >  71034  186155  65198  0  30x11  vpperl
> > > ..
> > > > 
> > > > The diff below might fix thise.  Or it might actually turn this into a
> > > > hard hang...
> > > > 
> > > > Nevertheless, could you try running with it?
> > > 
> > > I haven't seen this happen again with your diff, and haven't seen any
> > > hangs. Probably still too early to say for sure that it fixes things,
> > > but it looks promising so far.
> > 
> > Thanks.  Since Dale ok'ed it and I had been running with it for a
> > while already, I committed it last night.
> > 
> 
> Ha. As is traditional, not long after sending that message I've hit
> a hard lock - no DDB.

I'm sure it wouldn't have happened if I hand't committed it ;).

Could you change the PR_NOWAIT back into P_WAITOK and see if you can
reproduce the hang and break into ddb?  Meanwhile I'll think about
what information to print once you've hit it ;).



Re: arm64 lock: no userland progress, several procs in wchan "vp"

2017-05-03 Thread Stuart Henderson
On 2017/05/03 15:12, Mark Kettenis wrote:
> > Date: Wed, 3 May 2017 13:51:22 +0100
> > From: Stuart Henderson 
> > 
> > On 2017/05/01 22:18, Mark Kettenis wrote:
> > > > Date: Mon, 1 May 2017 20:58:29 +0100
> > > > From: Stuart Henderson 
> > > > 
> > > > Userland is non-responsive, machine is pingable, tcp connections open
> > > > but no banner from ssh. No failed pool requests. This kernel is from
> > > > today's snapshot but I saw the same with one from a couple of days
> > > > ago.  Is there anything else I can get that might be useful?
> > > > 
> > ..
> > > >  71034  186155  65198  0  30x11  vpperl
> > ..
> > > 
> > > The diff below might fix thise.  Or it might actually turn this into a
> > > hard hang...
> > > 
> > > Nevertheless, could you try running with it?
> > 
> > I haven't seen this happen again with your diff, and haven't seen any
> > hangs. Probably still too early to say for sure that it fixes things,
> > but it looks promising so far.
> 
> Thanks.  Since Dale ok'ed it and I had been running with it for a
> while already, I committed it last night.
> 

Ha. As is traditional, not long after sending that message I've hit
a hard lock - no DDB.



Re: arm64 lock: no userland progress, several procs in wchan "vp"

2017-05-03 Thread Mark Kettenis
> Date: Wed, 3 May 2017 13:51:22 +0100
> From: Stuart Henderson 
> 
> On 2017/05/01 22:18, Mark Kettenis wrote:
> > > Date: Mon, 1 May 2017 20:58:29 +0100
> > > From: Stuart Henderson 
> > > 
> > > Userland is non-responsive, machine is pingable, tcp connections open
> > > but no banner from ssh. No failed pool requests. This kernel is from
> > > today's snapshot but I saw the same with one from a couple of days
> > > ago.  Is there anything else I can get that might be useful?
> > > 
> ..
> > >  71034  186155  65198  0  30x11  vpperl
> ..
> > 
> > The diff below might fix thise.  Or it might actually turn this into a
> > hard hang...
> > 
> > Nevertheless, could you try running with it?
> 
> I haven't seen this happen again with your diff, and haven't seen any
> hangs. Probably still too early to say for sure that it fixes things,
> but it looks promising so far.

Thanks.  Since Dale ok'ed it and I had been running with it for a
while already, I committed it last night.



Re: arm64 lock: no userland progress, several procs in wchan "vp"

2017-05-03 Thread Stuart Henderson
On 2017/05/01 22:18, Mark Kettenis wrote:
> > Date: Mon, 1 May 2017 20:58:29 +0100
> > From: Stuart Henderson 
> > 
> > Userland is non-responsive, machine is pingable, tcp connections open
> > but no banner from ssh. No failed pool requests. This kernel is from
> > today's snapshot but I saw the same with one from a couple of days
> > ago.  Is there anything else I can get that might be useful?
> > 
..
> >  71034  186155  65198  0  30x11  vpperl
..
> 
> The diff below might fix thise.  Or it might actually turn this into a
> hard hang...
> 
> Nevertheless, could you try running with it?

I haven't seen this happen again with your diff, and haven't seen any
hangs. Probably still too early to say for sure that it fixes things,
but it looks promising so far.



Re: arm64 lock: no userland progress, several procs in wchan "vp"

2017-05-01 Thread Dale Rahn
On Mon, May 01, 2017 at 10:18:58PM +0200, Mark Kettenis wrote:
> > Date: Mon, 1 May 2017 20:58:29 +0100
> > From: Stuart Henderson 
> > 
> > Userland is non-responsive, machine is pingable, tcp connections open
> > but no banner from ssh. No failed pool requests. This kernel is from
> > today's snapshot but I saw the same with one from a couple of days
> > ago.  Is there anything else I can get that might be useful?
> > 
> > ddb> ps
> >PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
> >  99554   86967  57409 55  3 0x2  vpcc 
> >  57409   23557  97377 55  30x82  wait  cc
> >  97377   51254  49407 55  30x10008a  pause sh
> >  71034  186155  65198  0  30x11  vpperl
> >  49407  183801  58608 55  30x82  wait  gmake
> >  58608  251568  90720 55  30x10008a  pause sh   
> >  90720  294385  26849 55  30x82  wait  gmake
> >  26849  434857  31480 55  30x100088  pause sh   
> >  31480  479316   1945 55  30x10008a  pause sh
> >   1945   53261   1392 55  30x82  wait  gmake
> >   1392  297593  28991 55  30x100088  pause sh   
> >  28991  101756  11650 55  30x10008a  pause sh
> >  11650  273060  70062 55  30x82  wait  gmake
> >  70062  380995  21324 55  30x82  wait  gmake
> >  21324  380494  20357 55  30x10008a  pause make 
> >  20357  495141  79040 55  30x10008a  pause sh  
> >  79040  411698  40069 55  30x10008a  pause make
> >  40069  407214  61289 55  30x10008a  pause sh  
> >  61289  440156  65198 55  30x10008a  pause make
> >  16484  143829  63578 55  30x82  nanosleep perl
> >  63578  247597  69857 55  30x10008a  pause sh  
> >  698573708  28018 55  30x10008a  pause make
> >  28018  161747  1 55  30x10008a  pause sh  
> >  78305  185109  40308   1000  30x100083  ttyin ksh
> >  65198  454438  69872  0  30x93  wait  perl
> >  69872   91535  40308   1000  30x10008b  pause ksh 
> >  40308  108204  1   1000  30x100080  kqreadtmux
> >  72632  510504  69073   1000  30x100083  kqreadtmux
> >  69073  166246  39096   1000  30x10008b  pause ksh 
> >  39096  474432  39165   1000  30x10  vpsshd
> >  39165  380864  95218  0  30x92  poll  sshd
> >  19837   75515  1  0  30x13  vpgetty
> > 61  140725  1  0  30x100010  vpcron 
> >  33247  144573  1110  30x100090  poll  sndiod
> >  85245  294054  1 99  30x100090  poll  sndiod
> >  20071  339430  77361 95  30x100092  kqreadsmtpd 
> >  31714  216717  77361103  30x100092  kqreadsmtpd
> >  38145  373966  77361 95  30x100092  kqreadsmtpd
> >  73235  449750  77361 95  30x100092  kqreadsmtpd
> >  52512  523411  77361 95  30x100092  kqreadsmtpd
> >  25217   17706  77361 95  30x100092  kqreadsmtpd
> >  77361  512649  1  0  30x100080  kqreadsmtpd
> >  95218  352524  1  0  30x80  selectsshd 
> >  28640  338771  0  0  3 0x14280  nfsidlnfsio
> >  30707  131410  0  0  3 0x14280  nfsidlnfsio
> >  26109  142203  0  0  3 0x14280  nfsidlnfsio
> >  61054  453416  0  0  3 0x14280  nfsidlnfsio
> >  20679  124381  1  0  30x80  poll  rpc.statd
> >  75142  494960  1 28  30x100090  poll  portmap  
> >  13394  497677  1  0  30x10  vpntpd   
> >  56991  117256  27035 83  30x100092  poll  ntpd
> >  27035  498377  1 83  30x100092  poll  ntpd
> >  19071  360785   2016 74  30x100090  bpf   pflogd
> >   2016  326372  1  0  30x80  netio pflogd
> >  75485  263260  29155 73  30x100090  kqreadsyslogd
> >  29155  379800  1  0  30x100082  netio syslogd
> >   9314  271265  1 77  30x100090  poll  dhclient
> >  77002  87  1  0  30x80  poll  dhclient
> >   4332  479844  1  0  30x80  mfsidlmount_mfs
> >  58334  330646  0  0  3 0x14200  pgzerozerothread
> >  15557  331142  0  0  3 0x14200  aiodoned  aiodoned  
> >  34557  432814  0  0  3 0x14200  syncerupdate  
> >  82663  208419  0  0  3 0x14200  cleaner   cleaner
> >  51853  347618  0  0  3 0x14200  reaperreaper 
> >  

Re: arm64 lock: no userland progress, several procs in wchan "vp"

2017-05-01 Thread Mark Kettenis
> Date: Mon, 1 May 2017 20:58:29 +0100
> From: Stuart Henderson 
> 
> Userland is non-responsive, machine is pingable, tcp connections open
> but no banner from ssh. No failed pool requests. This kernel is from
> today's snapshot but I saw the same with one from a couple of days
> ago.  Is there anything else I can get that might be useful?
> 
> ddb> ps
>PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
>  99554   86967  57409 55  3 0x2  vpcc 
>  57409   23557  97377 55  30x82  wait  cc
>  97377   51254  49407 55  30x10008a  pause sh
>  71034  186155  65198  0  30x11  vpperl
>  49407  183801  58608 55  30x82  wait  gmake
>  58608  251568  90720 55  30x10008a  pause sh   
>  90720  294385  26849 55  30x82  wait  gmake
>  26849  434857  31480 55  30x100088  pause sh   
>  31480  479316   1945 55  30x10008a  pause sh
>   1945   53261   1392 55  30x82  wait  gmake
>   1392  297593  28991 55  30x100088  pause sh   
>  28991  101756  11650 55  30x10008a  pause sh
>  11650  273060  70062 55  30x82  wait  gmake
>  70062  380995  21324 55  30x82  wait  gmake
>  21324  380494  20357 55  30x10008a  pause make 
>  20357  495141  79040 55  30x10008a  pause sh  
>  79040  411698  40069 55  30x10008a  pause make
>  40069  407214  61289 55  30x10008a  pause sh  
>  61289  440156  65198 55  30x10008a  pause make
>  16484  143829  63578 55  30x82  nanosleep perl
>  63578  247597  69857 55  30x10008a  pause sh  
>  698573708  28018 55  30x10008a  pause make
>  28018  161747  1 55  30x10008a  pause sh  
>  78305  185109  40308   1000  30x100083  ttyin ksh
>  65198  454438  69872  0  30x93  wait  perl
>  69872   91535  40308   1000  30x10008b  pause ksh 
>  40308  108204  1   1000  30x100080  kqreadtmux
>  72632  510504  69073   1000  30x100083  kqreadtmux
>  69073  166246  39096   1000  30x10008b  pause ksh 
>  39096  474432  39165   1000  30x10  vpsshd
>  39165  380864  95218  0  30x92  poll  sshd
>  19837   75515  1  0  30x13  vpgetty
> 61  140725  1  0  30x100010  vpcron 
>  33247  144573  1110  30x100090  poll  sndiod
>  85245  294054  1 99  30x100090  poll  sndiod
>  20071  339430  77361 95  30x100092  kqreadsmtpd 
>  31714  216717  77361103  30x100092  kqreadsmtpd
>  38145  373966  77361 95  30x100092  kqreadsmtpd
>  73235  449750  77361 95  30x100092  kqreadsmtpd
>  52512  523411  77361 95  30x100092  kqreadsmtpd
>  25217   17706  77361 95  30x100092  kqreadsmtpd
>  77361  512649  1  0  30x100080  kqreadsmtpd
>  95218  352524  1  0  30x80  selectsshd 
>  28640  338771  0  0  3 0x14280  nfsidlnfsio
>  30707  131410  0  0  3 0x14280  nfsidlnfsio
>  26109  142203  0  0  3 0x14280  nfsidlnfsio
>  61054  453416  0  0  3 0x14280  nfsidlnfsio
>  20679  124381  1  0  30x80  poll  rpc.statd
>  75142  494960  1 28  30x100090  poll  portmap  
>  13394  497677  1  0  30x10  vpntpd   
>  56991  117256  27035 83  30x100092  poll  ntpd
>  27035  498377  1 83  30x100092  poll  ntpd
>  19071  360785   2016 74  30x100090  bpf   pflogd
>   2016  326372  1  0  30x80  netio pflogd
>  75485  263260  29155 73  30x100090  kqreadsyslogd
>  29155  379800  1  0  30x100082  netio syslogd
>   9314  271265  1 77  30x100090  poll  dhclient
>  77002  87  1  0  30x80  poll  dhclient
>   4332  479844  1  0  30x80  mfsidlmount_mfs
>  58334  330646  0  0  3 0x14200  pgzerozerothread
>  15557  331142  0  0  3 0x14200  aiodoned  aiodoned  
>  34557  432814  0  0  3 0x14200  syncerupdate  
>  82663  208419  0  0  3 0x14200  cleaner   cleaner
>  51853  347618  0  0  3 0x14200  reaperreaper 
>  18753  499821  0  0  3 0x14200  pgdaemon  pagedaemon
>  59831  415568  0  0  3 0x14200  bored crynlk
>  29354  478337  0  0  3 0x14200  bored crypto
>  

arm64 lock: no userland progress, several procs in wchan "vp"

2017-05-01 Thread Stuart Henderson
Userland is non-responsive, machine is pingable, tcp connections open
but no banner from ssh. No failed pool requests. This kernel is from
today's snapshot but I saw the same with one from a couple of days
ago.  Is there anything else I can get that might be useful?

ddb> ps
   PID TID   PPIDUID  S   FLAGS  WAIT  COMMAND
 99554   86967  57409 55  3 0x2  vpcc 
 57409   23557  97377 55  30x82  wait  cc
 97377   51254  49407 55  30x10008a  pause sh
 71034  186155  65198  0  30x11  vpperl
 49407  183801  58608 55  30x82  wait  gmake
 58608  251568  90720 55  30x10008a  pause sh   
 90720  294385  26849 55  30x82  wait  gmake
 26849  434857  31480 55  30x100088  pause sh   
 31480  479316   1945 55  30x10008a  pause sh
  1945   53261   1392 55  30x82  wait  gmake
  1392  297593  28991 55  30x100088  pause sh   
 28991  101756  11650 55  30x10008a  pause sh
 11650  273060  70062 55  30x82  wait  gmake
 70062  380995  21324 55  30x82  wait  gmake
 21324  380494  20357 55  30x10008a  pause make 
 20357  495141  79040 55  30x10008a  pause sh  
 79040  411698  40069 55  30x10008a  pause make
 40069  407214  61289 55  30x10008a  pause sh  
 61289  440156  65198 55  30x10008a  pause make
 16484  143829  63578 55  30x82  nanosleep perl
 63578  247597  69857 55  30x10008a  pause sh  
 698573708  28018 55  30x10008a  pause make
 28018  161747  1 55  30x10008a  pause sh  
 78305  185109  40308   1000  30x100083  ttyin ksh
 65198  454438  69872  0  30x93  wait  perl
 69872   91535  40308   1000  30x10008b  pause ksh 
 40308  108204  1   1000  30x100080  kqreadtmux
 72632  510504  69073   1000  30x100083  kqreadtmux
 69073  166246  39096   1000  30x10008b  pause ksh 
 39096  474432  39165   1000  30x10  vpsshd
 39165  380864  95218  0  30x92  poll  sshd
 19837   75515  1  0  30x13  vpgetty
61  140725  1  0  30x100010  vpcron 
 33247  144573  1110  30x100090  poll  sndiod
 85245  294054  1 99  30x100090  poll  sndiod
 20071  339430  77361 95  30x100092  kqreadsmtpd 
 31714  216717  77361103  30x100092  kqreadsmtpd
 38145  373966  77361 95  30x100092  kqreadsmtpd
 73235  449750  77361 95  30x100092  kqreadsmtpd
 52512  523411  77361 95  30x100092  kqreadsmtpd
 25217   17706  77361 95  30x100092  kqreadsmtpd
 77361  512649  1  0  30x100080  kqreadsmtpd
 95218  352524  1  0  30x80  selectsshd 
 28640  338771  0  0  3 0x14280  nfsidlnfsio
 30707  131410  0  0  3 0x14280  nfsidlnfsio
 26109  142203  0  0  3 0x14280  nfsidlnfsio
 61054  453416  0  0  3 0x14280  nfsidlnfsio
 20679  124381  1  0  30x80  poll  rpc.statd
 75142  494960  1 28  30x100090  poll  portmap  
 13394  497677  1  0  30x10  vpntpd   
 56991  117256  27035 83  30x100092  poll  ntpd
 27035  498377  1 83  30x100092  poll  ntpd
 19071  360785   2016 74  30x100090  bpf   pflogd
  2016  326372  1  0  30x80  netio pflogd
 75485  263260  29155 73  30x100090  kqreadsyslogd
 29155  379800  1  0  30x100082  netio syslogd
  9314  271265  1 77  30x100090  poll  dhclient
 77002  87  1  0  30x80  poll  dhclient
  4332  479844  1  0  30x80  mfsidlmount_mfs
 58334  330646  0  0  3 0x14200  pgzerozerothread
 15557  331142  0  0  3 0x14200  aiodoned  aiodoned  
 34557  432814  0  0  3 0x14200  syncerupdate  
 82663  208419  0  0  3 0x14200  cleaner   cleaner
 51853  347618  0  0  3 0x14200  reaperreaper 
 18753  499821  0  0  3 0x14200  pgdaemon  pagedaemon
 59831  415568  0  0  3 0x14200  bored crynlk
 29354  478337  0  0  3 0x14200  bored crypto
 338986377  0  0  3 0x14200  pftm  pfpurge
 33736  115197  0  0  3 0x14200  usbtskusbtask
 54044   43212  0  0  3 0x14200  usbatsk   usbatsk
 57344  273215  0  0  3 0x14200