Re: [9fans] That deadlock, again
On Thu, Nov 18, 2010 at 12:53:52AM -0500, erik quanstrom wrote: you must be in process context to qlock, because only processes can sleep. There's obviously at least one exception, because otherwise I would not have got a panic at startup. Or, for that matter there would not be active code ahead of the /sys/src/9/port/qlock.c:35,36 if(up == 0) panic(qlock); in qlock(). Or maybe that's where things are going wrong, but I doubt that the code is mistaken, I know my understanding is inadequate :-) ++L
Re: [9fans] That deadlock, again
was 0xf01e739e really the code that accesses up-qpctry? -- cinap ---BeginMessage--- On Thu, Nov 18, 2010 at 12:53:52AM -0500, erik quanstrom wrote: you must be in process context to qlock, because only processes can sleep. There's obviously at least one exception, because otherwise I would not have got a panic at startup. Or, for that matter there would not be active code ahead of the /sys/src/9/port/qlock.c:35,36 if(up == 0) panic(qlock); in qlock(). Or maybe that's where things are going wrong, but I doubt that the code is mistaken, I know my understanding is inadequate :-) ++L ---End Message---
Re: [9fans] That deadlock, again
hm... thinking about it... does the kernel assume (maybe in early initialization) that calling qlock() without a proc is ok as long as it can make sure it will not be held by another proc? -- cinap ---BeginMessage--- On Thu, Nov 18, 2010 at 12:53:52AM -0500, erik quanstrom wrote: you must be in process context to qlock, because only processes can sleep. There's obviously at least one exception, because otherwise I would not have got a panic at startup. Or, for that matter there would not be active code ahead of the /sys/src/9/port/qlock.c:35,36 if(up == 0) panic(qlock); in qlock(). Or maybe that's where things are going wrong, but I doubt that the code is mistaken, I know my understanding is inadequate :-) ++L ---End Message---
Re: [9fans] That deadlock, again
On Thu, Nov 18, 2010 at 10:20:33AM +0100, cinap_len...@gmx.de wrote: hm... thinking about it... does the kernel assume (maybe in early initialization) that calling qlock() without a proc is ok as long as it can make sure it will not be held by another proc? That's a question for Bell Labs, I suppose, but that's precisely what I believe. There is no other explanation for the panic. Moving the up == 0 test earlier will invalidate this assumption and cause the panic we have already seen. The issue here is whether there is a situation where qlock() is intentionally invoked where up == 0 (suggested by the positioning of the up == 0 test _after_ setting the locked condition). This is improbable, though, and needs sorting out: whereas setting the lock can be done with up == 0 - and we can also clear the lock - we cannot _fail_ to set the lock, because then the absence of up will trigger a panic. Now, we know that qlock() is called with up == 0, we have seen a panic being generated by such a call. Will it suffice to locate the invocation and somehow deal with it, or should we make qlock() more robust and cause it to reject a request from a space where up == 0? Definitely, if qlock() no longer allows invocations with up == 0 there will be simplifications in its implementation. For example, the line if(up != nil up-nlocks.ref) print(qlock: %#p: nlocks %lud\n, getcallerpc(q), up-nlocks.ref); will no longer need the up != nil test. But I'm convinced there's more here than meets the eye. Unfortunately, while I have a Plan 9 distribution at my fingertips, I'm not going to try to fix this problem in a 9vx environment, I'll wait until I get home to deal with the native stuff. But one can speculate... ++L
Re: [9fans] That deadlock, again
if(up != nil up-nlocks.ref) print(qlock: %#p: nlocks %lud\n, getcallerpc(q), up-nlocks.ref); will no longer need the up != nil test. that's just wrong. if the kernel is qlocking without a up, there's a bug. stack dumps are your friend. but i have a feeling that there is a mistake in your modification to qlock. you didn't have this panic before you modified qlock. - erik
Re: [9fans] That deadlock, again
On Thu Nov 18 10:23:20 EST 2010, quans...@quanstro.net wrote: if(up != nil up-nlocks.ref) print(qlock: %#p: nlocks %lud\n, getcallerpc(q), up-nlocks.ref); will no longer need the up != nil test. that's just wrong. if the kernel is qlocking without a up, there's a bug. stack dumps are your friend. but i have a feeling that there is a mistake in your modification to qlock. you didn't have this panic before you modified qlock. and i'm just wrong. intentionally or not, devsd does qlock things with no up from sdreset(). ether82598 does too (my fault). - erik
Re: [9fans] That deadlock, again
and i'm just wrong. intentionally or not, devsd does qlock things with no up from sdreset(). ether82598 does too (my fault). I suggest you fix ether82598: it is OK to call qlock() and qunlock() without up, but only if sure that the qlock() will succeed. If it has to wait, it will panic. Given that, why do the locking at all? ++L
Re: [9fans] That deadlock, again
but i have a feeling that there is a mistake in your modification to qlock. you didn't have this panic before you modified qlock. qlock() is broken, or at the very least ambivalent. Someone ought to put it out of its misery: is it legal or is it not to call qlock() in a up == 0 context? ++L
Re: [9fans] That deadlock, again
If it has to wait, it will panic. Given that, why do the locking at all? i assume the intention is along these lines: it's to allow the use during reset of a given driver's standard functions that normally must qlock, to avoid requiring two copies of them, with and without the qlock. after reset, it's illegal to call qlock without a process (notably in an interrupt function), as it previously was.
Re: [9fans] That deadlock, again
it's to allow the use during reset of a given driver's standard functions that normally must qlock, to avoid requiring two copies of them, with and without the qlock. after reset, it's illegal to call qlock without a process (notably in an interrupt function), as it previously was. I'm willing to credit the validity of this, but I believe then that it ought to be more explicit. It seems to me that having a situation where a panic can ensue if a lock is already taken is too risky. Is it possible to count the instances of such qlock() invocations in the present kernel code and find out how common the problem really is? Or should one simply treat such invocations as innocuous and just omit connecting a user process to the queue when no user process is specified, if the lock is taken? That sounds positively explosive! ++L
Re: [9fans] That deadlock, again
after reset, it's illegal to call qlock without a process (notably in an interrupt function), as it previously was. That suggests that the (hopefully) few instances of qlock() invocations that may occur in this space should be burdened with the need to check for the value of up and altogether skip the call if it's nil. Mind you, this is not the problem we set out to fix anymore, although no doubt there is a relationship, however tenuous. ++L
Re: [9fans] That deadlock, again
I suggest you fix ether82598: it is OK to call qlock() and qunlock() without up, but only if sure that the qlock() will succeed. If it has to wait, it will panic. yes. that's it. If it has to wait, it will panic. Given that, why do the locking at all? i assume the intention is along these lines: it's to allow the use during reset of a given driver's standard functions that normally must qlock, to avoid requiring two copies of them, with and without the qlock. after reset, it's illegal to call qlock without a process (notably in an interrupt function), as it previously was. i think lucio has stated the current restriction more exactly, but this is closer to the assumed intent. perhaps we should make this what qlock actually does with diff -c /n/dump/2010/1118/sys/src/9/pc/main.c pc/main.c /n/dump/2010/1118/sys/src/9/pc/main.c:201,206 - pc/main.c:201,207 poperror(); } kproc(alarm, alarmkproc, 0); + conf.postdawn = 1; touser(sp); } diff -c /n/dump/2010/1118/sys/src/9/pc/dat.h pc/dat.h /n/dump/2010/1118/sys/src/9/pc/dat.h:107,112 - pc/dat.h:107,113 ulong ialloc; /* max interrupt time allocation in bytes */ ulong pipeqsize; /* size in bytes of pipe queues */ int nuart; /* number of uart devices */ + int postdawn; /* mutiprogramming on */ }; /* diff -c /n/dump/2010/1118/sys/src/9/port/qlock.c port/qlock.c /n/dump/2010/1118/sys/src/9/port/qlock.c:18,23 - port/qlock.c:18,25 { Proc *p; + if(up == nil conf.postdawn) + panic(qlock: %#p: postdawn up nil\n, getcallerpc(q)); if(m-ilockdepth != 0) print(qlock: %#p: ilockdepth %d\n, getcallerpc(q), m-ilockdepth); if(up != nil up-nlocks.ref) a test kernel i've got does boot with this test. - erik
Re: [9fans] That deadlock, again
on second thought, conf.postdawn should be set in schedinit(). - erik
Re: [9fans] That deadlock, again
/n/dump/2010/1118/sys/src/9/port/qlock.c:18,23 - port/qlock.c:18,25 { Proc *p; + if(up == nil conf.postdawn) + panic(qlock: %#p: postdawn up nil\n, getcallerpc(q)); if(m-ilockdepth != 0) print(qlock: %#p: ilockdepth %d\n, getcallerpc(q), m-ilockdepth); if(up != nil up-nlocks.ref) Yes, this is the type of explicit-ness I was thinking of. Note that you can now drop further tests for up == 0 later in the qlock() text. ++L
Re: [9fans] That deadlock, again
Yes, this is the type of explicit-ness I was thinking of. Note that you can now drop further tests for up == 0 later in the qlock() text. Hm, spoke too quickly. The tests on up have to remain, sadly. Sorry about the misleading noise. ++L
Re: [9fans] That deadlock, again
lock loops are about Locks (spin locks), not QLocks. the relative ordering of any two calls to qlock and qunlock is irrelevant. russ
Re: [9fans] That deadlock, again
one could move: up-qpc = getcallerpc(q); from qlock() before the lock(q-use); so we can see from where that qlock gets called that hangs the exportfs call, or add another magic debug pointer (qpctry) to the proc stucture and print it in dumpaproc(). Cinap, I tried your debugging code and got an odd panic at boot time. Consistently: panic: kernel fault: no user process pc=0xf01e739e addr=0x09e8 Having a look with acid, this seems to be caused by an attempt at setting the debug PC (your up-qpctry) at a time when up has no value yet. Strangely, later in the qlock() code up is checked and a panic issued if zero. I'm missing something here: it is possible to execute this code /sys/src/9/port/qlock.c:29,37 (more or less) lock(q-use); rwstats.qlock++; if(!q-locked) { q-locked = 1; unlock(q-use); return; } which is immediately followed by if(up == 0) panic(qlock); If up is nil, but it looks like a bit of a one-way operation. Anyway, I have moved the assignment to qpctry to after up is tested. Let's see what happens. I'll have to get back to you once the system is back up. ++L
Re: [9fans] That deadlock, again
Anyway, I have moved the assignment to qpctry to after up is tested. Let's see what happens. I'll have to get back to you once the system is back up. The system is working now. I have to wait for a problem to arise, next. ++L
Re: [9fans] That deadlock, again
Strangely, later in the qlock() code up is checked and a panic issued if zero. I'm missing something here: it is possible to execute this code you must be in process context to qlock, because only processes can sleep. - erik
Re: [9fans] That deadlock, again
I tried acid, but I'm just not familiar enough with it to make it work. I tried rumble% acid 2052 /bin/exportfs /bin/exportfs:386 plan 9 executable /sys/lib/acid/port /sys/lib/acid/386 acid: src(0xf01e7377) no source for ?file? cinap is right, the bug is in the kernel. we know that because it's a lock loop. that can only happen if the kernel screws up. also, the address is a kernel address (starts with 0xf). if you didn't know the lock loop message was from the kernel, you could grep for it ; cd /sys/src ; g 'lock (%#?[luxp]+ )loop' . ./9/port/taslock.c:61: print(lock %#p loop key %#lux pc %#lux held by pc %#lux proc %lud\n, - erik
Re: [9fans] That deadlock, again
cinap is right, the bug is in the kernel. we know that because it's a lock loop. that can only happen if the kernel screws up. also, the address is a kernel address (starts with 0xf). Well, here is an acid dump, I'll inspect it in detail, but I'm hoping someone will beat me to it (not hard at all, I have to confess): rumble# acid /sys/src/9/pc/9pccpuf /sys/src/9/pc/9pccpuf:386 plan 9 boot image /sys/lib/acid/port /sys/lib/acid/386 # lock 0xf0057d74 loop key 0xdeaddead pc 0xf01e736a held by pc 0xf01e736a proc 2052 acid: src(0xf01e736a) /sys/src/9/port/qlock.c:29 24 print(qlock: %#p: nlocks %lud\n, getcallerpc(q), up-nlocks.ref); 25 26 if(q-use.key == 0x) 27 panic(qlock: q %#p, key 5*\n, q); 28 lock(q-use); 29 rwstats.qlock++; 30 if(!q-locked) { 31 q-locked = 1; 32 unlock(q-use); 33 return; 34 } # 61:etherread4 pc f01ef8a0 dbgpc0 Running (Running) ut 2923 st 0 bss 0 qpc f0148c8a nl 0 nd 0 lpc f0100f6e pri 13 acid: src(0xf01ef8a0) /sys/src/9/port/taslock.c:96 91 lockstats.glare++; 92 for(;;){ 93 lockstats.inglare++; 94 i = 0; 95 while(l-key){ 96 if(conf.nmach 2 up up-edf (up-edf-flags Admitted)){ 97 /* 98 * Priority inversion, yield on a uniprocessor; on a 99 * multiprocessor, the other processor will unlock 100 */ 101print(inversion %#p pc %#lux proc %lud held by pc %#lux proc %lud\n, acid: src(0xf0148c8a) /sys/src/9/ip/tcp.c:2096 2091 if(waserror()){ 2092 qunlock(s); 2093 nexterror(); 2094 } 2095 qlock(s); 2096 qunlock(tcp); 2097 2098 /* fix up window */ 2099 seg.wnd = tcb-rcv.scale; 2100 2101 /* every input packet in puts off the keep alive time out */ acid: src(0xf0100f6e) /sys/src/9/pc/cga.c:112 107return; 108} 109else 110lock(cgascreenlock); 111 112while(n-- 0) 113cgascreenputc(*s++); 114 115unlock(cgascreenlock); 116} 117 # 2052: exportfs pc f01e7377 dbgpc 94adPwrite (Ready) ut 55 st 270 bss 4 qpc f0145b62 nl 1 nd 0 lpc f01e2c60 pri 10 acid: src(0xf01e7377) /sys/src/9/port/qlock.c:30 25 26 if(q-use.key == 0x) 27 panic(qlock: q %#p, key 5*\n, q); 28 lock(q-use); 29 rwstats.qlock++; 30 if(!q-locked) { 31 q-locked = 1; 32 unlock(q-use); 33 return; 34 } 35 if(up == 0) acid: src(0xf0145b62) /sys/src/9/ip/tcp.c:704 699tcpgo(Tcppriv *priv, Tcptimer *t) 700{ 701if(t == nil || t-start == 0) 702return; 703 704qlock(priv-tl); 705t-count = t-start; 706timerstate(priv, t, TcptimerON); 707qunlock(priv-tl); 708} 709 acid: src(0xf01e2c60) /sys/src/9/port/proc.c:345 340queueproc(Schedq *rq, Proc *p) 341{ 342int pri; 343 344pri = rq - runq; 345lock(runq); 346p-priority = pri; 347p-rnext = 0; 348if(rq-tail) 349rq-tail-rnext = p; 350else acid:
Re: [9fans] That deadlock, again
Well, here is an acid dump, I'll inspect it in detail, but I'm hoping someone will beat me to it (not hard at all, I have to confess): rumble# acid /sys/src/9/pc/9pccpuf /sys/src/9/pc/9pccpuf:386 plan 9 boot image /sys/lib/acid/port /sys/lib/acid/386 [ ... ] This bit looks suspicious to me, but I'm really not an authority and I may easily be missing something: acid: src(0xf0148c8a) /sys/src/9/ip/tcp.c:2096 2091 if(waserror()){ 2092 qunlock(s); 2093 nexterror(); 2094 } 2095 qlock(s); 2096 qunlock(tcp); 2097 2098 /* fix up window */ 2099 seg.wnd = tcb-rcv.scale; 2100 2101 /* every input packet in puts off the keep alive time out */ The source actually says (to be pedantic): /* The rest of the input state machine is run with the control block * locked and implements the state machine directly out of the RFC. * Out-of-band data is ignored - it was always a bad idea. */ tcb = (Tcpctl*)s-ptcl; if(waserror()){ qunlock(s); nexterror(); } qlock(s); qunlock(tcp); Now, the qunlock(s) should not precede the qlock(s), this is the first case in this procedure: /sys/src/9/ip/tcp.c:1941,2424 void tcpiput(Proto *tcp, Ipifc*, Block *bp) { ... } and unlocking an unlocked item should not be permitted, right? If s (the conversation) is returned locked by either of /sys/src/9/ip/tcp.c:2048 s = iphtlook(tpriv-ht, source, seg.source, dest, seg.dest); or /sys/src/9/ip/tcp.c:2081 s = tcpincoming(s, seg, source, dest, version); then the qlock(s) is an issue, but I'm sure that is not the case here. Finally, locking an item ahead of unlocking another is often cause for a deadlock, although I understand the possible necessity. qlock(s); qunlock(tcp); Even though acid points to a different location, I'd be curious to have the details above expanded on by somebody familiar with the code. ++L
Re: [9fans] That deadlock, again
acid: src(0xf0148c8a) /sys/src/9/ip/tcp.c:2096 2091 if(waserror()){ 2092 qunlock(s); 2093 nexterror(); 2094 } 2095 qlock(s); 2096qunlock(tcp); 2097 2098 /* fix up window */ 2099 seg.wnd = tcb-rcv.scale; 2100 2101 /* every input packet in puts off the keep alive time out */ The source actually says (to be pedantic): /* The rest of the input state machine is run with the control block * locked and implements the state machine directly out of the RFC. * Out-of-band data is ignored - it was always a bad idea. */ tcb = (Tcpctl*)s-ptcl; if(waserror()){ qunlock(s); nexterror(); } qlock(s); qunlock(tcp); Now, the qunlock(s) should not precede the qlock(s), this is the first case in this procedure: it doesn't. waserror() can't be executed before the code following it. perhpas it could be more carefully written as 2095 qlock(s); 2091 if(waserror()){ 2092 qunlock(s); 2093 nexterror(); 2094 } 2096qunlock(tcp); but it really wouldn't make any difference. i'm not completely convinced that tcp's to blame. and if it is, i think the problem is probablly tcp timers. - erik
Re: [9fans] That deadlock, again
Now, the qunlock(s) should not precede the qlock(s), this is the first case in this procedure: it doesn't. waserror() can't be executed before the code following it. perhpas it could be more carefully written as 2095 qlock(s); 2091 if(waserror()){ 2092 qunlock(s); 2093 nexterror(); 2094 } 2096 qunlock(tcp); Hm, I thought I understood waserror(), but now I'm sure I don't. What condition is waserror() attempting to handle here? ++L
Re: [9fans] That deadlock, again
Hm, I thought I understood waserror(), but now I'm sure I don't. What condition is waserror() attempting to handle here? waserror() sets up an entry in the error stack. if there is a call to error() before poperror(), then that entry is poped and waserror() returns 1. it's just like set_jmp or sched(). - erik
Re: [9fans] That deadlock, again
qpc is the just the caller of the last successfull *acquired* qlock. what we know is that the exportfs proc spins in the q-use taslock called by qlock() right? this already seems wired... q-use is held just long enougth to test q-locked and manipulate the queue. also sched() will avoid switching to another proc while we are holding tas locks. i would like to know which qlock is the kernel is trying to acquire on behalf of exportfs that is also reachable from the etherread4 code. one could move: up-qpc = getcallerpc(q); from qlock() before the lock(q-use); so we can see from where that qlock gets called that hangs the exportfs call, or add another magic debug pointer (qpctry) to the proc stucture and print it in dumpaproc(). -- cinap ---BeginMessage--- acid: src(0xf0148c8a) /sys/src/9/ip/tcp.c:2096 2091 if(waserror()){ 2092 qunlock(s); 2093 nexterror(); 2094 } 2095 qlock(s); 2096qunlock(tcp); 2097 2098 /* fix up window */ 2099 seg.wnd = tcb-rcv.scale; 2100 2101 /* every input packet in puts off the keep alive time out */ The source actually says (to be pedantic): /* The rest of the input state machine is run with the control block * locked and implements the state machine directly out of the RFC. * Out-of-band data is ignored - it was always a bad idea. */ tcb = (Tcpctl*)s-ptcl; if(waserror()){ qunlock(s); nexterror(); } qlock(s); qunlock(tcp); Now, the qunlock(s) should not precede the qlock(s), this is the first case in this procedure: it doesn't. waserror() can't be executed before the code following it. perhpas it could be more carefully written as 2095 qlock(s); 2091 if(waserror()){ 2092 qunlock(s); 2093 nexterror(); 2094 } 2096qunlock(tcp); but it really wouldn't make any difference. i'm not completely convinced that tcp's to blame. and if it is, i think the problem is probablly tcp timers. - erik ---End Message---
Re: [9fans] That deadlock, again
sorry for not being clear. what i ment was that qpc is for the last qlock we succeeded to acquire. its *not* the one we are spinning on. also, qpc is not set to nil on unlock. -- cinap ---BeginMessage--- acid: src(0xf0148c8a) /sys/src/9/ip/tcp.c:2096 2091 if(waserror()){ 2092 qunlock(s); 2093 nexterror(); 2094 } 2095 qlock(s); 2096qunlock(tcp); 2097 2098 /* fix up window */ 2099 seg.wnd = tcb-rcv.scale; 2100 2101 /* every input packet in puts off the keep alive time out */ The source actually says (to be pedantic): /* The rest of the input state machine is run with the control block * locked and implements the state machine directly out of the RFC. * Out-of-band data is ignored - it was always a bad idea. */ tcb = (Tcpctl*)s-ptcl; if(waserror()){ qunlock(s); nexterror(); } qlock(s); qunlock(tcp); Now, the qunlock(s) should not precede the qlock(s), this is the first case in this procedure: it doesn't. waserror() can't be executed before the code following it. perhpas it could be more carefully written as 2095 qlock(s); 2091 if(waserror()){ 2092 qunlock(s); 2093 nexterror(); 2094 } 2096qunlock(tcp); but it really wouldn't make any difference. i'm not completely convinced that tcp's to blame. and if it is, i think the problem is probablly tcp timers. - erik ---End Message---
Re: [9fans] That deadlock, again
On Wed, Nov 17, 2010 at 06:22:33AM +0100, cinap_len...@gmx.de wrote: qpc is the just the caller of the last successfull *acquired* qlock. what we know is that the exportfs proc spins in the q-use taslock called by qlock() right? this already seems wired... q-use is held just long enougth to test q-locked and manipulate the queue. also sched() will avoid switching to another proc while we are holding tas locks. I think I'm with you, probably not quite to the same depth of understanding. i would like to know which qlock is the kernel is trying to acquire on behalf of exportfs that is also reachable from the etherread4 code. ... and from whatever the other proc is that also contributes to this jam. I don't have the name right in front of me, but I will post it separately. As far as I know it's always those two that interfere with exportfs and usually together, only a short time apart. one could move: up-qpc = getcallerpc(q); from qlock() before the lock(q-use); so we can see from where that qlock gets called that hangs the exportfs call, or add another magic debug pointer (qpctry) to the proc stucture and print it in dumpaproc(). I think I'll do the latter; even though it's more complex, it can be a useful debugging tool in future. I wouldn't leave in the kernel code, but it would be worth being able to refer to it when the occasion arises. How do you expect this qpctry to be initialised/set? ++L
Re: [9fans] That deadlock, again
On Wed, Nov 17, 2010 at 06:33:13AM +0100, cinap_len...@gmx.de wrote: sorry for not being clear. what i ment was that qpc is for the last qlock we succeeded to acquire. its *not* the one we are spinning on. also, qpc is not set to nil on unlock. Ok, so we set qpctry (qpcdbg?) to qpc before changing qpc? Irrespective of whether qpc is set or nil? And should qunlock() clear qpc for safety, or would this just make debugging more difficult? ++L
Re: [9fans] That deadlock, again
On Wed, Nov 17, 2010 at 08:45:00AM +0200, Lucio De Re wrote: ... and from whatever the other proc is that also contributes to this jam. I don't have the name right in front of me, but I will post it separately. As far as I know it's always those two that interfere with exportfs and usually together, only a short time apart. #I0tcpack pc f01ff12a dbgpc ... That's the other common factor. ++L
Re: [9fans] That deadlock, again
On Wed, Nov 17, 2010 at 06:33:13AM +0100, cinap_len...@gmx.de wrote: sorry for not being clear. what i ment was that qpc is for the last qlock we succeeded to acquire. its *not* the one we are spinning on. also, qpc is not set to nil on unlock. Ok, so we set qpctry (qpcdbg?) to qpc before changing qpc? Irrespective of whether qpc is set or nil? And should qunlock() clear qpc for safety, or would this just make debugging more difficult? no. you're very likely to nil out the new qpc if you do that. - erik
[9fans] That deadlock, again
Regarding the deadlock report that I occasionally see on my CPU server console, I won't bore anyone with PC addresses or anything like that, but I will recommend something I believe to be a possible trigger: the failure always seems to occur within exportfs, which in this case is used exclusively to run stats(1) remotely from my workstation. So the recommendation is that somebody like Erik, who is infinitely more clued up than I am in the kernel arcana should run one or more stats sessions into a cpu server (I happen to be running fossil, so maybe Erik won't see this) and see if he can also trigger this behaviour. I'm hoping that it is not platform specific. Right now, I'm short of skills as well as a serial console :-( ++L PS: here is a kmesg from the server: Plan 9 E820: 0009fc00 memory E820: 0009fc00 000a reserved E820: 000e 0010 reserved E820: 0010 4774 memory E820: 4774 4775 acpi reclaim E820: 4775 4780 acpi nvs 126 holes free 00018000 0009f000 552960 00468000 0642b000 100413440 100966400 bytes free cpu0: 2599MHz GenuineIntel PentiumIV/Xeon (cpuid: AX 0x0F29 DX 0xBFEBFBFF) ELCR: 0E28 #l0: i82557: 100Mbps port 0xDC00 irq 11: 0004e0b6 1143M memory: 100M kernel data, 1043M user, 1668M swap root is from (tcp, local)[local!#S/sdC0/fossil]: time... venti...2010/1115 17:36:16 venti: conf.../boot/venti: mem 31,972,556 bcmem 63,945,112 icmem 95,917,670...httpd tcp!127.1!8000...init...icache 95,917,670 bytes = 1,498,714 entries; 16 scache sync...announce tcp!127.1!17034...serving. fossil(#S/sdC0/fossil)...fsys: dialing venti at tcp!127.1!17034 version...time... init: starting /bin/rc which also supplies: lock 0xf09d8980 loop key 0xdeaddead pc 0xf01e736a held by pc 0xf01e736a proc 2052 17: #I0tcpack pc f01ff12a dbgpc0 Running (Running) ut 530 st 0 bss 0 qpc f014583c nl 0 nd 0 lpc f01e2cc8 pri 13 2052: exportfs pc f01efc9f dbgpc 94adPwrite (Ready) ut 43 st 209 bss 4 qpc f0145b62 nl 1 nd 0 lpc f01e2c60 pri 10 and, a bit later: lock 0xf0057d74 loop key 0xdeaddead pc 0xf01e736a held by pc 0xf01e736a proc 2052 61:etherread4 pc f01ef8a0 dbgpc0 Running (Running) ut 2923 st 0 bss 0 qpc f0148c8a nl 0 nd 0 lpc f0100f6e pri 13 2052: exportfs pc f01e7377 dbgpc 94adPwrite (Ready) ut 55 st 270 bss 4 qpc f0145b62 nl 1 nd 0 lpc f01e2c60 pri 10 to my surprise.
Re: [9fans] That deadlock, again
i assume you've fixed this? (not yet fixed on sources.) Yes, before I did that the errors occurred much more frequently; there's definitely something in that, but as Russ points out, the fix prevents panics and I have yet to see a panic. I have a suspicion that we're looking at the wrong problem, but I don't know how to approach it. I tried acid, but I'm just not familiar enough with it to make it work. I tried rumble% acid 2052 /bin/exportfs /bin/exportfs:386 plan 9 executable /sys/lib/acid/port /sys/lib/acid/386 acid: src(0xf01e7377) no source for ?file? and that's where I have to leave it for the time being. But suggestions are welcome, I'll be back here at the end of the day (it's 07:00 now, I expect to be back after 19:00). ++L
Re: [9fans] That deadlock, again
the pc's printed by the lock loop message are kernel code. you have to load a debug kernel into acid. -- cinap ---BeginMessage--- i assume you've fixed this? (not yet fixed on sources.) Yes, before I did that the errors occurred much more frequently; there's definitely something in that, but as Russ points out, the fix prevents panics and I have yet to see a panic. I have a suspicion that we're looking at the wrong problem, but I don't know how to approach it. I tried acid, but I'm just not familiar enough with it to make it work. I tried rumble% acid 2052 /bin/exportfs /bin/exportfs:386 plan 9 executable /sys/lib/acid/port /sys/lib/acid/386 acid: src(0xf01e7377) no source for ?file? and that's where I have to leave it for the time being. But suggestions are welcome, I'll be back here at the end of the day (it's 07:00 now, I expect to be back after 19:00). ++L ---End Message---
Re: [9fans] That deadlock, again
the pc's printed by the lock loop message are kernel code. you have to load a debug kernel into acid. Thanks, I'll do some digging in the Acid document(s), try to familiarise myself with the details... ++L
Re: [9fans] That deadlock, again
you almost had it with your acid approach :) if your kernel image is uncompressed and is unstriped, you can just load it with acid: acid /386/9pc if you build it yourself, then there should be such a kernel in /sys/src/9/pc -- cinap ---BeginMessage--- the pc's printed by the lock loop message are kernel code. you have to load a debug kernel into acid. Thanks, I'll do some digging in the Acid document(s), try to familiarise myself with the details... ++L ---End Message---
Re: [9fans] That deadlock, again
On Tue, Nov 16, 2010 at 06:28:07AM +0100, cinap_len...@gmx.de wrote: if your kernel image is uncompressed and is unstriped, you can just load it with acid: acid /386/9pc if you build it yourself, then there should be such a kernel in /sys/src/9/pc OK, will try this evening, I specifically left the machines on to do so, I'm very keen to help put this issue to rest. And to get to know the kernel more initmately. Pity Nemo isn't updating his commentary. ++L