Re: [9fans] That deadlock, again

2010-11-18 Thread Lucio De Re
On Thu, Nov 18, 2010 at 12:53:52AM -0500, erik quanstrom wrote:
 
 you must be in process context to qlock, because only
 processes can sleep.
 
There's obviously at least one exception, because otherwise I would not
have got a panic at startup.  Or, for that matter there would not be
active code ahead of the

/sys/src/9/port/qlock.c:35,36
if(up == 0)
panic(qlock);

in qlock().  Or maybe that's where things are going wrong, but I doubt
that the code is mistaken, I know my understanding is inadequate :-)

++L



Re: [9fans] That deadlock, again

2010-11-18 Thread cinap_lenrek
was 0xf01e739e really the code that accesses up-qpctry?

--
cinap
---BeginMessage---
On Thu, Nov 18, 2010 at 12:53:52AM -0500, erik quanstrom wrote:
 
 you must be in process context to qlock, because only
 processes can sleep.
 
There's obviously at least one exception, because otherwise I would not
have got a panic at startup.  Or, for that matter there would not be
active code ahead of the

/sys/src/9/port/qlock.c:35,36
if(up == 0)
panic(qlock);

in qlock().  Or maybe that's where things are going wrong, but I doubt
that the code is mistaken, I know my understanding is inadequate :-)

++L
---End Message---


Re: [9fans] That deadlock, again

2010-11-18 Thread cinap_lenrek
hm...  thinking about it...  does the kernel assume (maybe in early
initialization) that calling qlock() without a proc is ok as long as
it can make sure it will not be held by another proc?

--
cinap
---BeginMessage---
On Thu, Nov 18, 2010 at 12:53:52AM -0500, erik quanstrom wrote:
 
 you must be in process context to qlock, because only
 processes can sleep.
 
There's obviously at least one exception, because otherwise I would not
have got a panic at startup.  Or, for that matter there would not be
active code ahead of the

/sys/src/9/port/qlock.c:35,36
if(up == 0)
panic(qlock);

in qlock().  Or maybe that's where things are going wrong, but I doubt
that the code is mistaken, I know my understanding is inadequate :-)

++L
---End Message---


Re: [9fans] That deadlock, again

2010-11-18 Thread Lucio De Re
On Thu, Nov 18, 2010 at 10:20:33AM +0100, cinap_len...@gmx.de wrote:
 
 hm...  thinking about it...  does the kernel assume (maybe in early
 initialization) that calling qlock() without a proc is ok as long as
 it can make sure it will not be held by another proc?
 
That's a question for Bell Labs, I suppose, but that's precisely what I
believe.  There is no other explanation for the panic.  Moving the
up == 0 test earlier will invalidate this assumption and cause the panic
we have already seen.

The issue here is whether there is a situation where qlock() is
intentionally invoked where up == 0 (suggested by the positioning of the
up == 0 test _after_ setting the locked condition).  This is improbable,
though, and needs sorting out: whereas setting the lock can be done with
up == 0 - and we can also clear the lock - we cannot _fail_ to set the
lock, because then the absence of up will trigger a panic.

Now, we know that qlock() is called with up == 0, we have seen a panic
being generated by such a call.  Will it suffice to locate the invocation
and somehow deal with it, or should we make qlock() more robust and cause
it to reject a request from a space where up == 0?  Definitely, if qlock()
no longer allows invocations with up == 0 there will be simplifications
in its implementation.  For example, the line

if(up != nil  up-nlocks.ref)
print(qlock: %#p: nlocks %lud\n, getcallerpc(q), 
up-nlocks.ref);

will no longer need the up != nil test.

But I'm convinced there's more here than meets the eye.  Unfortunately,
while I have a Plan 9 distribution at my fingertips, I'm not going to
try to fix this problem in a 9vx environment, I'll wait until I get home
to deal with the native stuff.  But one can speculate...

++L



Re: [9fans] That deadlock, again

2010-11-18 Thread erik quanstrom
   if(up != nil  up-nlocks.ref)
   print(qlock: %#p: nlocks %lud\n, getcallerpc(q), 
 up-nlocks.ref);
 
 will no longer need the up != nil test.

that's just wrong.  if the kernel is qlocking without
a up, there's a bug.  stack dumps are your friend.

but i have a feeling that there is a mistake in your
modification to qlock.  you didn't have this panic
before you modified qlock.

- erik



Re: [9fans] That deadlock, again

2010-11-18 Thread erik quanstrom
On Thu Nov 18 10:23:20 EST 2010, quans...@quanstro.net wrote:
  if(up != nil  up-nlocks.ref)
  print(qlock: %#p: nlocks %lud\n, getcallerpc(q), 
  up-nlocks.ref);
  
  will no longer need the up != nil test.
 
 that's just wrong.  if the kernel is qlocking without
 a up, there's a bug.  stack dumps are your friend.
 
 but i have a feeling that there is a mistake in your
 modification to qlock.  you didn't have this panic
 before you modified qlock.

and i'm just wrong.  intentionally or not, devsd does
qlock things with no up from sdreset().  ether82598
does too (my fault).

- erik



Re: [9fans] That deadlock, again

2010-11-18 Thread Lucio De Re
 and i'm just wrong.  intentionally or not, devsd does
 qlock things with no up from sdreset().  ether82598
 does too (my fault).

I suggest you fix ether82598: it is OK to call qlock() and qunlock()
without up, but only if sure that the qlock() will succeed.  If it
has to wait, it will panic.  Given that, why do the locking at all?

++L




Re: [9fans] That deadlock, again

2010-11-18 Thread Lucio De Re
 but i have a feeling that there is a mistake in your
 modification to qlock.  you didn't have this panic
 before you modified qlock.

qlock() is broken, or at the very least ambivalent.  Someone ought to
put it out of its misery: is it legal or is it not to call qlock() in
a up == 0 context?

++L




Re: [9fans] That deadlock, again

2010-11-18 Thread C H Forsyth
If it has to wait, it will panic.  Given that, why do the locking at all?

i assume the intention is along these lines:

it's to allow the use during reset of a given driver's
standard functions that normally must qlock, to avoid requiring two copies
of them, with and without the qlock.

after reset, it's illegal to call qlock without a process (notably
in an interrupt function), as it previously was.



Re: [9fans] That deadlock, again

2010-11-18 Thread Lucio De Re
 it's to allow the use during reset of a given driver's
 standard functions that normally must qlock, to avoid requiring two copies
 of them, with and without the qlock.
 
 after reset, it's illegal to call qlock without a process (notably
 in an interrupt function), as it previously was.

I'm willing to credit the validity of this, but I believe then that it
ought to be more explicit.  It seems to me that having a situation
where a panic can ensue if a lock is already taken is too risky.  Is
it possible to count the instances of such qlock() invocations in the
present kernel code and find out how common the problem really is?

Or should one simply treat such invocations as innocuous and just omit
connecting a user process to the queue when no user process is
specified, if the lock is taken?  That sounds positively explosive!

++L




Re: [9fans] That deadlock, again

2010-11-18 Thread Lucio De Re
 after reset, it's illegal to call qlock without a process (notably
 in an interrupt function), as it previously was.

That suggests that the (hopefully) few instances of qlock()
invocations that may occur in this space should be burdened with the
need to check for the value of up and altogether skip the call if
it's nil.

Mind you, this is not the problem we set out to fix anymore, although
no doubt there is a relationship, however tenuous.

++L




Re: [9fans] That deadlock, again

2010-11-18 Thread erik quanstrom
 I suggest you fix ether82598: it is OK to call qlock() and qunlock()
 without up, but only if sure that the qlock() will succeed.  If it
 has to wait, it will panic.

yes.  that's it.

 If it has to wait, it will panic.  Given that, why do the locking at all?
 
 i assume the intention is along these lines:
 
 it's to allow the use during reset of a given driver's
 standard functions that normally must qlock, to avoid requiring two copies
 of them, with and without the qlock.
 
 after reset, it's illegal to call qlock without a process (notably
 in an interrupt function), as it previously was.

i think lucio has stated the current restriction more exactly,
but this is closer to the assumed intent.  perhaps we should make
this what qlock actually does with

diff -c /n/dump/2010/1118/sys/src/9/pc/main.c pc/main.c
/n/dump/2010/1118/sys/src/9/pc/main.c:201,206 - pc/main.c:201,207
poperror();
}
kproc(alarm, alarmkproc, 0);
+   conf.postdawn = 1;
touser(sp);
  }
  
diff -c /n/dump/2010/1118/sys/src/9/pc/dat.h pc/dat.h
/n/dump/2010/1118/sys/src/9/pc/dat.h:107,112 - pc/dat.h:107,113
ulong   ialloc; /* max interrupt time allocation in bytes */
ulong   pipeqsize;  /* size in bytes of pipe queues */
int nuart;  /* number of uart devices */
+   int postdawn;   /* mutiprogramming on */
  };
  
  /*
diff -c /n/dump/2010/1118/sys/src/9/port/qlock.c port/qlock.c
/n/dump/2010/1118/sys/src/9/port/qlock.c:18,23 - port/qlock.c:18,25
  {
Proc *p;
  
+   if(up == nil  conf.postdawn)
+   panic(qlock: %#p: postdawn up nil\n, getcallerpc(q));
if(m-ilockdepth != 0)
print(qlock: %#p: ilockdepth %d\n, getcallerpc(q), 
m-ilockdepth);
if(up != nil  up-nlocks.ref)

a test kernel i've got does boot with this test.

- erik



Re: [9fans] That deadlock, again

2010-11-18 Thread erik quanstrom
on second thought, conf.postdawn should be set in schedinit().

- erik



Re: [9fans] That deadlock, again

2010-11-18 Thread Lucio De Re
 /n/dump/2010/1118/sys/src/9/port/qlock.c:18,23 - port/qlock.c:18,25
   {
   Proc *p;
   
 + if(up == nil  conf.postdawn)
 + panic(qlock: %#p: postdawn up nil\n, getcallerpc(q));
   if(m-ilockdepth != 0)
   print(qlock: %#p: ilockdepth %d\n, getcallerpc(q), 
 m-ilockdepth);
   if(up != nil  up-nlocks.ref)

Yes, this is the type of explicit-ness I was thinking of.  Note that
you can now drop further tests for up == 0 later in the qlock() text.

++L




Re: [9fans] That deadlock, again

2010-11-18 Thread Lucio De Re
 Yes, this is the type of explicit-ness I was thinking of.  Note that
 you can now drop further tests for up == 0 later in the qlock() text.

Hm, spoke too quickly.  The tests on up have to remain, sadly.
Sorry about the misleading noise.

++L




Re: [9fans] That deadlock, again

2010-11-17 Thread Russ Cox
lock loops are about Locks (spin locks), not QLocks.
the relative ordering of any two calls to qlock and qunlock
is irrelevant.

russ



Re: [9fans] That deadlock, again

2010-11-17 Thread Lucio De Re
 one could move:
 
   up-qpc = getcallerpc(q);
 
  from qlock() before the lock(q-use); so we can see from where that
 qlock gets called that hangs the exportfs call, or add another magic
 debug pointer (qpctry) to the proc stucture and print it in dumpaproc().


Cinap, I tried your debugging code and got an odd panic at boot time.
Consistently:

panic: kernel fault: no user process pc=0xf01e739e addr=0x09e8

Having a look with acid, this seems to be caused by an attempt at
setting the debug PC (your up-qpctry) at a time when up has no
value yet.

Strangely, later in the qlock() code up is checked and a panic
issued if zero.  I'm missing something here: it is possible to execute
this code

/sys/src/9/port/qlock.c:29,37 (more or less)
lock(q-use);
rwstats.qlock++;
if(!q-locked) {
q-locked = 1;
unlock(q-use);
return;
}

which is immediately followed by

if(up == 0)
panic(qlock);

If up is nil, but it looks like a bit of a one-way operation.

Anyway, I have moved the assignment to qpctry to after up is
tested.  Let's see what happens.  I'll have to get back to you once
the system is back up.

++L




Re: [9fans] That deadlock, again

2010-11-17 Thread Lucio De Re
 Anyway, I have moved the assignment to qpctry to after up is
 tested.  Let's see what happens.  I'll have to get back to you once
 the system is back up.

The system is working now.  I have to wait for a problem to arise, next.

++L




Re: [9fans] That deadlock, again

2010-11-17 Thread erik quanstrom
 Strangely, later in the qlock() code up is checked and a panic
 issued if zero.  I'm missing something here: it is possible to execute
 this code

you must be in process context to qlock, because only
processes can sleep.

- erik



Re: [9fans] That deadlock, again

2010-11-16 Thread erik quanstrom
 I tried acid, but I'm just not familiar enough with it to make it
 work.  I tried
 
   rumble% acid 2052 /bin/exportfs
   /bin/exportfs:386 plan 9 executable
   /sys/lib/acid/port
   /sys/lib/acid/386
   acid: src(0xf01e7377)
   no source for ?file?

cinap is right, the bug is in the kernel.  we know
that because it's a lock loop.  that can only happen
if the kernel screws up.  also, the address is a kernel
address (starts with 0xf).

if you didn't know the lock loop message was from
the kernel, you could grep for it

; cd /sys/src
; g 'lock (%#?[luxp]+ )loop' .
./9/port/taslock.c:61:  print(lock %#p loop key %#lux pc %#lux held by pc 
%#lux proc %lud\n,

- erik



Re: [9fans] That deadlock, again

2010-11-16 Thread lucio
 cinap is right, the bug is in the kernel.  we know
 that because it's a lock loop.  that can only happen
 if the kernel screws up.  also, the address is a kernel
 address (starts with 0xf).

Well, here is an acid dump, I'll inspect it in detail, but I'm hoping
someone will beat me to it (not hard at all, I have to confess):

rumble# acid /sys/src/9/pc/9pccpuf
/sys/src/9/pc/9pccpuf:386 plan 9 boot image
/sys/lib/acid/port
/sys/lib/acid/386

# lock 0xf0057d74 loop key 0xdeaddead pc 0xf01e736a held by pc 0xf01e736a proc 
2052

acid: src(0xf01e736a)
/sys/src/9/port/qlock.c:29
 24 print(qlock: %#p: nlocks %lud\n, getcallerpc(q), 
up-nlocks.ref);
 25 
 26 if(q-use.key == 0x)
 27 panic(qlock: q %#p, key 5*\n, q);
 28 lock(q-use);
29 rwstats.qlock++;
 30 if(!q-locked) {
 31 q-locked = 1;
 32 unlock(q-use);
 33 return;
 34 }

# 61:etherread4 pc f01ef8a0 dbgpc0   Running (Running) ut 2923 st 0 bss 
0 qpc f0148c8a nl 0 nd 0 lpc f0100f6e pri 13

acid: src(0xf01ef8a0)
/sys/src/9/port/taslock.c:96
 91 lockstats.glare++;
 92 for(;;){
 93 lockstats.inglare++;
 94 i = 0;
 95 while(l-key){
96 if(conf.nmach  2  up  up-edf  
(up-edf-flags  Admitted)){
 97 /*
 98  * Priority inversion, yield on a 
uniprocessor; on a
 99  * multiprocessor, the other processor 
will unlock
 100 */
 101print(inversion %#p pc %#lux proc %lud 
held by pc %#lux proc %lud\n,
acid: src(0xf0148c8a)
/sys/src/9/ip/tcp.c:2096
 2091   if(waserror()){
 2092   qunlock(s);
 2093   nexterror();
 2094   }
 2095   qlock(s);
2096   qunlock(tcp);
 2097   
 2098   /* fix up window */
 2099   seg.wnd = tcb-rcv.scale;
 2100   
 2101   /* every input packet in puts off the keep alive time out */
acid: src(0xf0100f6e)
/sys/src/9/pc/cga.c:112
 107return;
 108}
 109else
 110lock(cgascreenlock);
 111
112while(n--  0)
 113cgascreenputc(*s++);
 114
 115unlock(cgascreenlock);
 116}
 117

# 2052:  exportfs pc f01e7377 dbgpc 94adPwrite (Ready) ut 55 st 270 bss 
4 qpc f0145b62 nl 1 nd 0 lpc f01e2c60 pri 10

acid: src(0xf01e7377)
/sys/src/9/port/qlock.c:30
 25 
 26 if(q-use.key == 0x)
 27 panic(qlock: q %#p, key 5*\n, q);
 28 lock(q-use);
 29 rwstats.qlock++;
30 if(!q-locked) {
 31 q-locked = 1;
 32 unlock(q-use);
 33 return;
 34 }
 35 if(up == 0)
acid: src(0xf0145b62)
/sys/src/9/ip/tcp.c:704
 699tcpgo(Tcppriv *priv, Tcptimer *t)
 700{
 701if(t == nil || t-start == 0)
 702return;
 703
704qlock(priv-tl);
 705t-count = t-start;
 706timerstate(priv, t, TcptimerON);
 707qunlock(priv-tl);
 708}
 709
acid: src(0xf01e2c60)
/sys/src/9/port/proc.c:345
 340queueproc(Schedq *rq, Proc *p)
 341{
 342int pri;
 343
 344pri = rq - runq;
345lock(runq);
 346p-priority = pri;
 347p-rnext = 0;
 348if(rq-tail)
 349rq-tail-rnext = p;
 350else
acid: 




Re: [9fans] That deadlock, again

2010-11-16 Thread Lucio De Re
 Well, here is an acid dump, I'll inspect it in detail, but I'm hoping
 someone will beat me to it (not hard at all, I have to confess):
 
 rumble# acid /sys/src/9/pc/9pccpuf
 /sys/src/9/pc/9pccpuf:386 plan 9 boot image
 /sys/lib/acid/port
 /sys/lib/acid/386
 
[ ... ]

This bit looks suspicious to me, but I'm really not an authority and I
may easily be missing something:

 acid: src(0xf0148c8a)
 /sys/src/9/ip/tcp.c:2096
  2091 if(waserror()){
  2092 qunlock(s);
  2093 nexterror();
  2094 }
  2095 qlock(s);
2096  qunlock(tcp);
  2097 
  2098 /* fix up window */
  2099 seg.wnd = tcb-rcv.scale;
  2100 
  2101 /* every input packet in puts off the keep alive time out */

The source actually says (to be pedantic):

/* The rest of the input state machine is run with the control block
 * locked and implements the state machine directly out of the RFC.
 * Out-of-band data is ignored - it was always a bad idea.
 */
tcb = (Tcpctl*)s-ptcl;
if(waserror()){
qunlock(s);
nexterror();
}
qlock(s);
qunlock(tcp);

Now, the qunlock(s) should not precede the qlock(s), this is the first
case in this procedure:

/sys/src/9/ip/tcp.c:1941,2424
void
tcpiput(Proto *tcp, Ipifc*, Block *bp)
{
...
}

and unlocking an unlocked item should not be permitted, right?  If s
(the conversation) is returned locked by either of

/sys/src/9/ip/tcp.c:2048
s = iphtlook(tpriv-ht, source, seg.source, dest, seg.dest);

or

/sys/src/9/ip/tcp.c:2081
s = tcpincoming(s, seg, source, dest, version);

then the qlock(s) is an issue, but I'm sure that is not the case here.

Finally, locking an item ahead of unlocking another is often cause for
a deadlock, although I understand the possible necessity.

qlock(s);
qunlock(tcp);

Even though acid points to a different location, I'd be curious to
have the details above expanded on by somebody familiar with the code.

++L




Re: [9fans] That deadlock, again

2010-11-16 Thread erik quanstrom
  acid: src(0xf0148c8a)
  /sys/src/9/ip/tcp.c:2096
   2091   if(waserror()){
   2092   qunlock(s);
   2093   nexterror();
   2094   }
   2095   qlock(s);
 2096qunlock(tcp);
   2097   
   2098   /* fix up window */
   2099   seg.wnd = tcb-rcv.scale;
   2100   
   2101   /* every input packet in puts off the keep alive time 
  out */
 
 The source actually says (to be pedantic):
 
   /* The rest of the input state machine is run with the control block
* locked and implements the state machine directly out of the RFC.
* Out-of-band data is ignored - it was always a bad idea.
*/
   tcb = (Tcpctl*)s-ptcl;
   if(waserror()){
   qunlock(s);
   nexterror();
   }
   qlock(s);
   qunlock(tcp);
 
 Now, the qunlock(s) should not precede the qlock(s), this is the first
 case in this procedure:

it doesn't.  waserror() can't be executed before the code
following it.  perhpas it could be more carefully written
as

   2095   qlock(s);
   2091   if(waserror()){
   2092   qunlock(s);
   2093   nexterror();
   2094   }
 2096qunlock(tcp);

but it really wouldn't make any difference.

i'm not completely convinced that tcp's to blame.
and if it is, i think the problem is probablly tcp
timers.

- erik



Re: [9fans] That deadlock, again

2010-11-16 Thread Lucio De Re
 Now, the qunlock(s) should not precede the qlock(s), this is the first
 case in this procedure:
 
 it doesn't.  waserror() can't be executed before the code
 following it.  perhpas it could be more carefully written
 as
 
   2095  qlock(s);
   2091  if(waserror()){
   2092  qunlock(s);
   2093  nexterror();
   2094  }
 2096   qunlock(tcp);

Hm, I thought I understood waserror(), but now I'm sure I don't.  What
condition is waserror() attempting to handle here?

++L




Re: [9fans] That deadlock, again

2010-11-16 Thread erik quanstrom
 Hm, I thought I understood waserror(), but now I'm sure I don't.  What
 condition is waserror() attempting to handle here?

waserror() sets up an entry in the error stack.
if there is a call to error() before poperror(),
then that entry is poped and waserror() returns
1.  it's just like set_jmp or sched().

- erik



Re: [9fans] That deadlock, again

2010-11-16 Thread cinap_lenrek
qpc is the just the caller of the last successfull *acquired* qlock.
what we know is that the exportfs proc spins in the q-use taslock
called by qlock() right?  this already seems wired...  q-use is held
just long enougth to test q-locked and manipulate the queue.  also
sched() will avoid switching to another proc while we are holding tas
locks.

i would like to know which qlock is the kernel is trying to acquire
on behalf of exportfs that is also reachable from the etherread4
code.

one could move:

up-qpc = getcallerpc(q);

from qlock() before the lock(q-use); so we can see from where that
qlock gets called that hangs the exportfs call, or add another magic
debug pointer (qpctry) to the proc stucture and print it in dumpaproc().

--
cinap
---BeginMessage---
  acid: src(0xf0148c8a)
  /sys/src/9/ip/tcp.c:2096
   2091   if(waserror()){
   2092   qunlock(s);
   2093   nexterror();
   2094   }
   2095   qlock(s);
 2096qunlock(tcp);
   2097   
   2098   /* fix up window */
   2099   seg.wnd = tcb-rcv.scale;
   2100   
   2101   /* every input packet in puts off the keep alive time 
  out */
 
 The source actually says (to be pedantic):
 
   /* The rest of the input state machine is run with the control block
* locked and implements the state machine directly out of the RFC.
* Out-of-band data is ignored - it was always a bad idea.
*/
   tcb = (Tcpctl*)s-ptcl;
   if(waserror()){
   qunlock(s);
   nexterror();
   }
   qlock(s);
   qunlock(tcp);
 
 Now, the qunlock(s) should not precede the qlock(s), this is the first
 case in this procedure:

it doesn't.  waserror() can't be executed before the code
following it.  perhpas it could be more carefully written
as

   2095   qlock(s);
   2091   if(waserror()){
   2092   qunlock(s);
   2093   nexterror();
   2094   }
 2096qunlock(tcp);

but it really wouldn't make any difference.

i'm not completely convinced that tcp's to blame.
and if it is, i think the problem is probablly tcp
timers.

- erik
---End Message---


Re: [9fans] That deadlock, again

2010-11-16 Thread cinap_lenrek
sorry for not being clear.  what i ment was that qpc is for the last
qlock we succeeded to acquire.  its *not* the one we are spinning on.
also, qpc is not set to nil on unlock.

--
cinap
---BeginMessage---
  acid: src(0xf0148c8a)
  /sys/src/9/ip/tcp.c:2096
   2091   if(waserror()){
   2092   qunlock(s);
   2093   nexterror();
   2094   }
   2095   qlock(s);
 2096qunlock(tcp);
   2097   
   2098   /* fix up window */
   2099   seg.wnd = tcb-rcv.scale;
   2100   
   2101   /* every input packet in puts off the keep alive time 
  out */
 
 The source actually says (to be pedantic):
 
   /* The rest of the input state machine is run with the control block
* locked and implements the state machine directly out of the RFC.
* Out-of-band data is ignored - it was always a bad idea.
*/
   tcb = (Tcpctl*)s-ptcl;
   if(waserror()){
   qunlock(s);
   nexterror();
   }
   qlock(s);
   qunlock(tcp);
 
 Now, the qunlock(s) should not precede the qlock(s), this is the first
 case in this procedure:

it doesn't.  waserror() can't be executed before the code
following it.  perhpas it could be more carefully written
as

   2095   qlock(s);
   2091   if(waserror()){
   2092   qunlock(s);
   2093   nexterror();
   2094   }
 2096qunlock(tcp);

but it really wouldn't make any difference.

i'm not completely convinced that tcp's to blame.
and if it is, i think the problem is probablly tcp
timers.

- erik
---End Message---


Re: [9fans] That deadlock, again

2010-11-16 Thread Lucio De Re
On Wed, Nov 17, 2010 at 06:22:33AM +0100, cinap_len...@gmx.de wrote:
 
 qpc is the just the caller of the last successfull *acquired* qlock.
 what we know is that the exportfs proc spins in the q-use taslock
 called by qlock() right?  this already seems wired...  q-use is held
 just long enougth to test q-locked and manipulate the queue.  also
 sched() will avoid switching to another proc while we are holding tas
 locks.
 
I think I'm with you, probably not quite to the same depth of understanding.

 i would like to know which qlock is the kernel is trying to acquire
 on behalf of exportfs that is also reachable from the etherread4
 code.
 
... and from whatever the other proc is that also contributes to this
jam. I don't have the name right in front of me, but I will post it
separately.  As far as I know it's always those two that interfere with
exportfs and usually together, only a short time apart.

 one could move:
 
   up-qpc = getcallerpc(q);
 
 from qlock() before the lock(q-use); so we can see from where that
 qlock gets called that hangs the exportfs call, or add another magic
 debug pointer (qpctry) to the proc stucture and print it in dumpaproc().
 
I think I'll do the latter; even though it's more complex, it can be a
useful debugging tool in future.  I wouldn't leave in the kernel code,
but it would be worth being able to refer to it when the occasion arises.

How do you expect this qpctry to be initialised/set?

++L



Re: [9fans] That deadlock, again

2010-11-16 Thread Lucio De Re
On Wed, Nov 17, 2010 at 06:33:13AM +0100, cinap_len...@gmx.de wrote:
 sorry for not being clear.  what i ment was that qpc is for the last
 qlock we succeeded to acquire.  its *not* the one we are spinning on.
 also, qpc is not set to nil on unlock.
 
Ok, so we set qpctry (qpcdbg?) to qpc before changing qpc?  Irrespective
of whether qpc is set or nil?  And should qunlock() clear qpc for safety,
or would this just make debugging more difficult?

++L



Re: [9fans] That deadlock, again

2010-11-16 Thread Lucio De Re
On Wed, Nov 17, 2010 at 08:45:00AM +0200, Lucio De Re wrote:
 ... and from whatever the other proc is that also contributes to this
 jam. I don't have the name right in front of me, but I will post it
 separately.  As far as I know it's always those two that interfere with
 exportfs and usually together, only a short time apart.
 
#I0tcpack pc f01ff12a dbgpc ...

That's the other common factor.

++L



Re: [9fans] That deadlock, again

2010-11-16 Thread erik quanstrom
 On Wed, Nov 17, 2010 at 06:33:13AM +0100, cinap_len...@gmx.de wrote:
  sorry for not being clear.  what i ment was that qpc is for the last
  qlock we succeeded to acquire.  its *not* the one we are spinning on.
  also, qpc is not set to nil on unlock.
  
 Ok, so we set qpctry (qpcdbg?) to qpc before changing qpc?  Irrespective
 of whether qpc is set or nil?  And should qunlock() clear qpc for safety,
 or would this just make debugging more difficult?

no.  you're very likely to nil out the new qpc if you do that.

- erik



[9fans] That deadlock, again

2010-11-15 Thread lucio
Regarding the deadlock report that I occasionally see on my CPU
server console, I won't bore anyone with PC addresses or anything like
that, but I will recommend something I believe to be a possible
trigger: the failure always seems to occur within exportfs, which in
this case is used exclusively to run stats(1) remotely from my
workstation.  So the recommendation is that somebody like Erik, who is
infinitely more clued up than I am in the kernel arcana should run one
or more stats sessions into a cpu server (I happen to be running
fossil, so maybe Erik won't see this) and see if he can also trigger this 
behaviour.  I'm hoping that it is not platform specific.

Right now, I'm short of skills as well as a serial console :-(

++L

PS: here is a kmesg from the server:

Plan 9
E820:  0009fc00 memory
E820: 0009fc00 000a reserved
E820: 000e 0010 reserved
E820: 0010 4774 memory
E820: 4774 4775 acpi reclaim
E820: 4775 4780 acpi nvs
126 holes free
00018000 0009f000 552960
00468000 0642b000 100413440
100966400 bytes free
cpu0: 2599MHz GenuineIntel PentiumIV/Xeon (cpuid: AX 0x0F29 DX 
0xBFEBFBFF)
ELCR: 0E28
#l0: i82557: 100Mbps port 0xDC00 irq 11: 0004e0b6
1143M memory: 100M kernel data, 1043M user, 1668M swap
root is from (tcp, local)[local!#S/sdC0/fossil]: time...
venti...2010/1115 17:36:16 venti: conf.../boot/venti: mem 31,972,556 
bcmem 63,945,112 icmem 95,917,670...httpd tcp!127.1!8000...init...icache 
95,917,670 bytes = 1,498,714 entries; 16 scache
sync...announce tcp!127.1!17034...serving.
fossil(#S/sdC0/fossil)...fsys: dialing venti at tcp!127.1!17034
version...time...

init: starting /bin/rc

which also supplies:

lock 0xf09d8980 loop key 0xdeaddead pc 0xf01e736a held by pc 0xf01e736a proc 
2052
 17: #I0tcpack pc f01ff12a dbgpc0   Running (Running) ut 530 st 0 bss 0 
qpc f014583c nl 0 nd 0 lpc f01e2cc8 pri 13
2052:  exportfs pc f01efc9f dbgpc 94adPwrite (Ready) ut 43 st 209 bss 
4 qpc f0145b62 nl 1 nd 0 lpc f01e2c60 pri 10

and, a bit later:

lock 0xf0057d74 loop key 0xdeaddead pc 0xf01e736a held by pc 0xf01e736a proc 
2052
 61:etherread4 pc f01ef8a0 dbgpc0   Running (Running) ut 2923 st 0 bss 
0 qpc f0148c8a nl 0 nd 0 lpc f0100f6e pri 13
2052:  exportfs pc f01e7377 dbgpc 94adPwrite (Ready) ut 55 st 270 bss 
4 qpc f0145b62 nl 1 nd 0 lpc f01e2c60 pri 10

to my surprise.




Re: [9fans] That deadlock, again

2010-11-15 Thread lucio
 i assume you've fixed this?  (not yet fixed on sources.)

Yes, before I did that the errors occurred much more frequently;
there's definitely something in that, but as Russ points out, the fix
prevents panics and I have yet to see a panic.

I have a suspicion that we're looking at the wrong problem, but I
don't know how to approach it.

I tried acid, but I'm just not familiar enough with it to make it
work.  I tried

rumble% acid 2052 /bin/exportfs
/bin/exportfs:386 plan 9 executable
/sys/lib/acid/port
/sys/lib/acid/386
acid: src(0xf01e7377)
no source for ?file?

and that's where I have to leave it for the time being.  But
suggestions are welcome, I'll be back here at the end of the day (it's
07:00 now, I expect to be back after 19:00).

++L




Re: [9fans] That deadlock, again

2010-11-15 Thread cinap_lenrek
the pc's printed by the lock loop message are kernel code. you
have to load a debug kernel into acid.

--
cinap
---BeginMessage---
 i assume you've fixed this?  (not yet fixed on sources.)

Yes, before I did that the errors occurred much more frequently;
there's definitely something in that, but as Russ points out, the fix
prevents panics and I have yet to see a panic.

I have a suspicion that we're looking at the wrong problem, but I
don't know how to approach it.

I tried acid, but I'm just not familiar enough with it to make it
work.  I tried

rumble% acid 2052 /bin/exportfs
/bin/exportfs:386 plan 9 executable
/sys/lib/acid/port
/sys/lib/acid/386
acid: src(0xf01e7377)
no source for ?file?

and that's where I have to leave it for the time being.  But
suggestions are welcome, I'll be back here at the end of the day (it's
07:00 now, I expect to be back after 19:00).

++L

---End Message---


Re: [9fans] That deadlock, again

2010-11-15 Thread lucio
 the pc's printed by the lock loop message are kernel code. you
 have to load a debug kernel into acid.

Thanks, I'll do some digging in the Acid document(s), try to
familiarise myself with the details...

++L




Re: [9fans] That deadlock, again

2010-11-15 Thread cinap_lenrek
you almost had it with your acid approach :)

if your kernel image is uncompressed and is unstriped, you can
just load it with acid:

acid /386/9pc

if you build it yourself, then there should be such a kernel in /sys/src/9/pc

--
cinap
---BeginMessage---
 the pc's printed by the lock loop message are kernel code. you
 have to load a debug kernel into acid.

Thanks, I'll do some digging in the Acid document(s), try to
familiarise myself with the details...

++L

---End Message---


Re: [9fans] That deadlock, again

2010-11-15 Thread Lucio De Re
On Tue, Nov 16, 2010 at 06:28:07AM +0100, cinap_len...@gmx.de wrote:
 
 if your kernel image is uncompressed and is unstriped, you can
 just load it with acid:
 
 acid /386/9pc
 
 if you build it yourself, then there should be such a kernel in /sys/src/9/pc
 
OK, will try this evening, I specifically left the machines on to do so,
I'm very keen to help put this issue to rest.  And to get to know the
kernel more initmately.  Pity Nemo isn't updating his commentary.

++L