Re: [9fans] 9vx (is this the right list)? import issue
So just need to fix oserrstr() or fix this in devip itself? I vote oserrstr, lucho votes fix this little bit of code. how many other errors are lurking in osstrerror()? there are lots of assumptions about the exact errstrs. - erik
Re: [9fans] 9vx (is this the right list)? import issue
sent in via codereview but I missed a debug print I left in. Sorry. I assume you can reject it so I'm sending the correct patch. ron
Re: [9fans] 9vx (is this the right list)? import issue
not having done this before, the reference is URL: http://codereview.appspot.com/122046 ron
Re: [9fans] 9vx (is this the right list)? import issue
On Wed Sep 23 14:54:47 EDT 2009, r...@swtch.com wrote: how sure are we that 1 holds? couldn't there be other, legitimate and transient errors? could a user-delivered note sneak in and confuse the issue? no. at least not if the kernel is working properly. that's why i said devmnt should enforce the assumption. it's at most a couple lines of extra code, whereas the diff you posted was quite a bit longer. it seems to be a big assumption that the whole ip stack and the ethernet driver know the difference between being interrupted before sending or queuing the packet or after. the comment in qbwrite seems to say that you can get interrupted after the pkt has been queued. my approach has the advantage of sidestepping this problem. i don't think 13 changed lines is unreasonable. (without verbose debugging.) - erik
Re: [9fans] 9vx (is this the right list)? import issue
On Wednesday, September 23, 2009, ron minnich rminn...@gmail.com wrote: not having done this before, the reference is URL: http://codereview.appspot.com/122046 Thanks, Ron. Fix applied. Russ
Re: [9fans] 9vx (is this the right list)? import issue
On Monday, September 21, 2009, ron minnich rminn...@gmail.com wrote: If I am in 9vx and have imported a file system from somewhere, and do an ls, and get impatient and hit del, the import dies. term% grep full *.c grep: can't open *.c: '*.c' mount rpc error term% ls ls: .: clone failed term% mnt: proc grep 91: mismatch from /net/tcp/0/data /n/o/usr/rminnich/9k/bgp rep 0x93daf0 tag 1 fid 432 T110 R117 rp 1 mnt: proc rc 93: mismatch from /net/tcp/0/data /n/o/usr/rminnich/9k/bgp rep 0x93daf0 tag 1 fid 432 T110 R121 rp 1 I'm wondering if this is just a simple matter of the note going too far. I am going to look but figure somebody might immediately say ah ha! and have a fix. I think it is. In fact I think drawterm has the same bug except in drawterm it is for some reason harder to trigger. I have no aha! fix for you. Russ
Re: [9fans] 9vx (is this the right list)? import issue
OK, a little more fooling around. term% grep full *.c (wait about a second, hit DEL) grep: can't open apic.c: 'apic.c' './sys/src/9/pc/grep' does not exist grep: can't open apm.c: 'apm.c' mount rpc error grep: can't open archmp.c: 'archmp.c' mount rpc error grep: can't open bios32.c: 'bios32.c' mount rpc error grep: can't open cga.c: 'cga.c' mount rpc error etc. in fact I get an error for each file in the directory. I also get one of these for each file. mnt: proc grep 90: mismatch from /net/tcp/0/data /n/o/sys/src/9/pc/devlml.c rep 0x16d7010 tag 3 fid 533 T112 R111 rp 3 mnt: proc grep 90: mismatch from /net/tcp/0/data /n/o/sys/src/9/pc/devlml.c rep 0x16d7010 tag 3 fid 533 T120 R113 rp 3 mnt: proc grep 90: mismatch from /net/tcp/0/data /n/o/sys/src/9/pc rep 0x16d7010 tag 3 fid 524 T110 R121 rp 3 mnt: proc grep 90: mismatch from /net/tcp/0/data /n/o/sys/src/9/pc/devmoipv6.c rep 0x16d7010 tag 3 fid 534 T112 R111 rp 3 mnt: proc grep 90: mismatch from /net/tcp/0/data /n/o/sys/src/9/pc/devmoipv6.c rep 0x16d7010 tag 3 fid 534 T120 R113 rp 3 mnt: proc grep 90: mismatch from /net/tcp/0/data /n/o/sys/src/9/pc rep 0x16d7010 tag 3 fid 524 T110 R121 rp 3 mnt: proc grep 90: mismatch from /net/tcp/0/data /n/o/sys/src/9/pc/devrtc.c rep 0x16d7010 tag 3 fid 535 T112 R111 rp 3 so what is interesting is that the grep did not die. It keeps trying to open files and keeps failing. And then, finally: term% ls ls: .: clone failed term% So, we see lots of guys with the same tag, with a T and R mismatch, with T and R like open and clunk, with the apparent problem that they are all using tag 3. I'm wondering if there is not a problem with the way tags are flushed when there is an interrupt. As an experiment I commented out the free in the freetag code. Better: now I just get this: mnt: proc grep 78: mismatch from /net/tcp/0/data /n/o/sys/src/9/pc/apic.c rep 0x1491c60 tag 2 fid 454 T120 R117 rp 2 after the DEL, TCLUNK gets an RREAD (which makes a sort of sense) and the when I ls . mnt: proc rc 80: mismatch from /net/tcp/0/data /n/o/sys/src/9/pc rep 0x1491c60 tag 2 fid 421 T110 R121 rp 2 Twalk for the ls gets the RCLUNK And from that point on it's all over. Interesting. ron
Re: [9fans] 9vx (is this the right list)? import issue
OK, I did this in mntralloc, in the code path in which we reuse the rpc from the mtnalloc.rpcfree: I just simply always allocated a new tag, even if we found an old rpc which was nominally free: I always allocated a new tag, not reusing the old tag. That fixed the problem. So, basically, the way I see it is, grep proc gets an interrupt, kernel will try to flush RPCs which we initiated, we drop the (we think) flushed rpc struct onto the rpcfree list, but the reply from the server is still in flight. We reuse the rpc from rpcfree list, we send out a new T, with the same tag as the previous one which we think we flushed, we get the reply from the earlier RPC, tags match, R does not match T, bad day. This is the same code as is in plan 9. There's harder and harder ways to deal with this, I have some ideas but I expect some of the folks on this list to have better ones. The simplest one that would probably work is to avoid reusing tags quite so quickly. Free tags, yes, but use a counter to indicate next tag to use, so that it's a relatively long time before a tag for a mount point is reused again. So, e.g., we free tag 3, but the next tag we allocate is tag 4, and so on. Sooner or later we'll use tag 3 again, likely long after any messages in flight have been retired. (weirdly enough, I saw a trick like this used in hardware many years ago ...) That may not be good enough, not sure. It's definitely pretty easy to implement. What I observed is that when this is all working, tag 3 is the only tag ever used! ron
Re: [9fans] 9vx (is this the right list)? import issue
2009/9/22 ron minnich rminn...@gmail.com: So, basically, the way I see it is, grep proc gets an interrupt, kernel will try to flush RPCs which we initiated, we drop the (we think) flushed rpc struct onto the rpcfree list, but the reply from the server is still in flight. We reuse the rpc from rpcfree list, we send out a new T, with the same tag as the previous one which we think we flushed, we get the reply from the earlier RPC, tags match, R does not match T, bad day. surely the correct way to go about this (caveat: i haven't looked at the code) is to drop the rpc struct onto the rpcfree list only when the Rflush is received? from experience with writing heavily used 9p services, getting flush properly right is a bitch.
Re: [9fans] 9vx (is this the right list)? import issue
On Tue, Sep 22, 2009 at 11:35 AM, roger peppe rogpe...@gmail.com wrote: surely the correct way to go about this (caveat: i haven't looked at the code) is to drop the rpc struct onto the rpcfree list only when the Rflush is received? you just got an Eintr. Did the request get sent? I don't know. I am not sure the code does either. Since this is only seen so far in 9vx I am guess it is a 9vx thing. ron
Re: [9fans] 9vx (is this the right list)? import issue
I don't know. I am not sure the code does either. Since this is only seen so far in 9vx I am guess it is a 9vx thing. i see this now than then on regular plan 9 Tue Sep 1 18:51:15: unexpected reply tag 51; type 109 Tue Sep 1 18:51:15: unexpected reply tag 16; type 109 Tue Sep 1 18:51:15: unexpected reply tag 39; type 109 Tue Sep 1 18:51:15: unexpected reply tag 51; type 109 Tue Sep 1 18:51:16: unexpected reply tag 51; type 109 Tue Sep 1 18:51:17: unexpected reply tag 39; type 109 - erik
Re: [9fans] 9vx (is this the right list)? import issue
2009/9/22 ron minnich rminn...@gmail.com: On Tue, Sep 22, 2009 at 11:35 AM, roger peppe rogpe...@gmail.com wrote: surely the correct way to go about this (caveat: i haven't looked at the code) is to drop the rpc struct onto the rpcfree list only when the Rflush is received? you just got an Eintr. Did the request get sent? doesn't Eintr mean that the write did not complete? if it's ambiguous, then the tag should indeed be put on hold, because there's no way to get it right. it would be useful to see a log of the actual 9p messages.
Re: [9fans] 9vx (is this the right list)? import issue
if it's ambiguous, then the tag should indeed be put on hold, because there's no way to get it right. how do we prevent all tags from being on hold? there's no way to get that right, either. - erik
Re: [9fans] 9vx (is this the right list)? import issue
On Sep 22, 2009, at 1:47 PM, ron minnich wrote: On Tue, Sep 22, 2009 at 11:35 AM, roger peppe rogpe...@gmail.com wrote: surely the correct way to go about this (caveat: i haven't looked at the code) is to drop the rpc struct onto the rpcfree list only when the Rflush is received? you just got an Eintr. Did the request get sent? I don't know. I am not sure the code does either. Since this is only seen so far in 9vx I am guess it is a 9vx thing. I believe it should - at least that's the behavior we went for in v9fs. -eric
Re: [9fans] 9vx (is this the right list)? import issue
2009/9/22 erik quanstrom quans...@quanstro.net: if it's ambiguous, then the tag should indeed be put on hold, because there's no way to get it right. how do we prevent all tags from being on hold? there's no way to get that right, either. well, it's legal to send several flushes for the same tag, and it's also legal to send a flush of a non-existent tag, so if there's a case of ambiguity, we could resend the flush and drop the original rpc struct when the reply to that comes back (or the original reply comes back).
Re: [9fans] 9vx (is this the right list)? import issue
here is a 9p trace of the problem. See line 43. Topen and Tclunk go out with same tag. This is with a print in the rpc code as suggested by Russ. ron y Description: Binary data
Re: [9fans] 9vx (is this the right list)? import issue
here's one last caught in the act scenario. I have a print in mntralloc when I reuse something. The fid is being read and clunked. But the Tclunk goes out before the Rread comes in. Oops. Reuse 1 Tread tag 1 fid 454 offset 0 count 8192 Reuse 1 Tclunk tag 1 fid 454 reply Rread tag 1 count 8192 '#include u.h #include ../port/lib.h #include mem.h #includ' mnt: proc grep 78: mismatch from /net/tcp/0/data /n/o/sys/src/9/pc/apic.c rep 0x168ed20 tag 1 fid 454 T120 R117 rp 1
Re: [9fans] 9vx (is this the right list)? import issue
On Tue Sep 22 17:22:08 EDT 2009, rminn...@gmail.com wrote: here's one last caught in the act scenario. I have a print in mntralloc when I reuse something. The fid is being read and clunked. But the Tclunk goes out before the Rread comes in. Oops. Reuse 1 Tread tag 1 fid 454 offset 0 count 8192 Reuse 1 Tclunk tag 1 fid 454 reply Rread tag 1 count 8192 '#include u.h #include ../port/lib.h #include mem.h #includ' mnt: proc grep 78: mismatch from /net/tcp/0/data /n/o/sys/src/9/pc/apic.c rep 0x168ed20 tag 1 fid 454 T120 R117 rp 1 you're still triggering this with a note? - erik
Re: [9fans] 9vx (is this the right list)? import issue
On Tue, Sep 22, 2009 at 2:23 PM, erik quanstrom quans...@quanstro.net wrote: On Tue Sep 22 17:22:08 EDT 2009, rminn...@gmail.com wrote: here's one last caught in the act scenario. I have a print in mntralloc when I reuse something. The fid is being read and clunked. But the Tclunk goes out before the Rread comes in. Oops. Reuse 1 Tread tag 1 fid 454 offset 0 count 8192 Reuse 1 Tclunk tag 1 fid 454 reply Rread tag 1 count 8192 '#include u.h #include ../port/lib.h #include mem.h #includ' mnt: proc grep 78: mismatch from /net/tcp/0/data /n/o/sys/src/9/pc/apic.c rep 0x168ed20 tag 1 fid 454 T120 R117 rp 1 you're still triggering this with a note? DEL in rio. ron
Re: [9fans] 9vx (is this the right list)? import issue
ron, this works for me but my symptoms were a little different than yours. before: mnt: proc cat 290: mismatch from #D/ssl/1/data /n/coraid/lib/unicode rep 0x7fcd8c04e190 tag 4 fid 1603 T120 R117 rp 4 after: WOOT! caught stale reply 6; type 117 note: the poor organization of this patch is geared toward keeping the diff small. tagallocd and freetag should be moved above mountmux. i didn't use russ' ed scripts because i was too lazy. - erik 9vx version ; ; diff -c devmnt.c devmnt.c~ devmnt.c:945,954 - devmnt.c~:945,951 void mountmux(Mnt *m, Mntrpc *r) { - int bad; Mntrpc **l, *q; - int tagallocd(int); - void freetag(int); lock(m-lk); l = m-queue; devmnt.c:977,992 - devmnt.c~:974,981 } l = q-list; } - bad = 1; - if(tagallocd(r-reply.tag)){ - freetag(r-reply.tag); - bad = 0; - } unlock(m-lk); - if(bad) - print(unexpected reply tag %ud; type %d\n, r-reply.tag, r-reply.type); - else - print(WOOT! caught stale reply %ud; type %d\n, r-reply.tag, r-reply.type); + print(unexpected reply tag %ud; type %d\n, r-reply.tag, r-reply.type); } /* devmnt.c:1054,1065 - devmnt.c~:1043,1048 return NOTAG; } - int - tagallocd(int t) - { - return mntalloc.tagmask[tTAGSHIFT] 1(tTAGMASK); - } - void freetag(int t) { devmnt.c:1125,1136 - devmnt.c~:1108,1116 if(mntalloc.nrpcfree = 10){ free(r-rpc); free(r); - if(r-done != 2) - freetag(r-request.tag); + freetag(r-request.tag); } else{ - if(r-done == 2) - r-request.tag = alloctag(); r-list = mntalloc.rpcfree; mntalloc.rpcfree = r; mntalloc.nrpcfree++; devmnt.c:1145,1151 - devmnt.c~:1125,1131 Mntrpc **l, *f; lock(m-lk); - r-done = 2; + r-done = 1; l = m-queue; for(f = *l; f; f = f-list) { plan 9 version ; diffy -c devmnt.c /n/dump/2009/0922/sys/src/9/port/devmnt.c:932,938 - devmnt.c:932,941 void mountmux(Mnt *m, Mntrpc *r) { + int bad; Mntrpc **l, *q; + int tagallocd(int); + void freetag(int); lock(m); l = m-queue; /n/dump/2009/0922/sys/src/9/port/devmnt.c:961,968 - devmnt.c:964,977 } l = q-list; } + bad = 1; + if(tagallocd(r-reply.tag)){ + freetag(r-reply.tag); + bad = 0; + } unlock(m); - print(unexpected reply tag %ud; type %d\n, r-reply.tag, r-reply.type); + if(bad) + print(unexpected reply tag %ud; type %d\n, r-reply.tag, r-reply.type); } /* /n/dump/2009/0922/sys/src/9/port/devmnt.c:1030,1035 - devmnt.c:1039,1050 return NOTAG; } + int + tagallocd(int t) + { + return mntalloc.tagmask[tTAGSHIFT] 1(tTAGMASK); + } + void freetag(int t) { /n/dump/2009/0922/sys/src/9/port/devmnt.c:1095,1103 - devmnt.c:1110,1121 if(mntalloc.nrpcfree = 10){ free(r-rpc); free(r); - freetag(r-request.tag); + if(r-done != 2) + freetag(r-request.tag); } else{ + if(r-done == 2) + r-request.tag = alloctag(); r-list = mntalloc.rpcfree; mntalloc.rpcfree = r; mntalloc.nrpcfree++; /n/dump/2009/0922/sys/src/9/port/devmnt.c:1112,1118 - devmnt.c:1130,1136 Mntrpc **l, *f; lock(m); - r-done = 1; + r-done = 2; l = m-queue; for(f = *l; f; f = f-list) { -- plan 9 version ; - diffy diffy -c devmnt.c /n/dump/2009/0922/sys/src/9/port/devmnt.c:932,938 - devmnt.c:932,941 void mountmux(Mnt *m, Mntrpc *r) { + int bad; Mntrpc **l, *q; + int tagallocd(int); + void freetag(int); lock(m); l = m-queue; /n/dump/2009/0922/sys/src/9/port/devmnt.c:961,968 - devmnt.c:964,979 } l = q-list; } + bad = 1; + if(tagallocd(r-reply.tag)){ + freetag(r-reply.tag); + bad = 0; + } unlock(m); - print(unexpected reply tag %ud; type %d\n, r-reply.tag, r-reply.type); + if(bad) + print(unexpected reply tag %ud; type %d\n, r-reply.tag, r-reply.type); + else + print(WOOT! caught stale reply %ud; type %d\n, r-reply.tag, r-reply.type); } /* /n/dump/2009/0922/sys/src/9/port/devmnt.c:1030,1035 - devmnt.c:1041,1052 return NOTAG; } + int + tagallocd(int t) + { + return mntalloc.tagmask[tTAGSHIFT] 1(tTAGMASK); + } + void freetag(int t) {
Re: [9fans] 9vx (is this the right list)? import issue
full versions in /n/sources/contrib/quanstro/devmnt.c /n/sources/contrib/quanstro/vx32devmnt.c - erik
Re: [9fans] 9vx (is this the right list)? import issue
On Tue Sep 22 23:12:27 EDT 2009, r...@swtch.com wrote: The extra tracking that has been proposed is unnecessary, and waiting for the Rflush doesn't make sense. The assumption is that the Rflush isn't ever going to arrive, because the connection is dead. what do you mean by dead? i/o to the same channel works fine. - erik
Re: [9fans] 9vx (is this the right list)? import issue
On Tuesday, September 22, 2009, erik quanstrom quans...@quanstro.net wrote: On Tue Sep 22 23:12:27 EDT 2009, r...@swtch.com wrote: The extra tracking that has been proposed is unnecessary, and waiting for the Rflush doesn't make sense. The assumption is that the Rflush isn't ever going to arrive, because the connection is dead. what do you mean by dead? i/o to the same channel works fine. I mean that the code as written is assuming that if a read or write errors out, it can only happen for one of two reasons: 1) there was an interrupt note, in which case strcmp(error, Eintr) == 0 2) there has been an error on the 9P connection, in which case strcmp(error, Eintr) != 0 and the connection will never work again. My suggestion is to enforce #2: if a non-interrupt error happens, mark the connection so that the kernel won't even try to use it again. Separately, you might investigate what error is happening that violates the assumption above. In 9vx, it is easy: case #1 happened but the error was spelled wrong. Russ
Re: [9fans] 9vx (is this the right list)? import issue
I mean that the code as written is assuming that if a read or write errors out, it can only happen for one of two reasons: 1) there was an interrupt note, in which case strcmp(error, Eintr) == 0 2) there has been an error on the 9P connection, in which case strcmp(error, Eintr) != 0 and the connection will never work again. My suggestion is to enforce #2: if a non-interrupt error happens, mark the connection so that the kernel won't even try to use it again. Separately, you might investigate what error is happening that violates the assumption above. In 9vx, it is easy: case #1 happened but the error was spelled wrong. how sure are we that 1 holds? couldn't there be other, legitimate and transient errors? could a user-delivered note sneak in and confuse the issue? the problem with my solution is that it could leak tags. i don't see this as a significant problem, but i could be wrong. i think the connection would need to be pretty broken for tags to be leaked. marking connections dead also adds tracking, but in a new place. it could have trouble if ever a transient error happens when strcmp(error, Eintr) == 0, which can happen in 9vx or dt. - erik