Re: [9fans] 9vx (is this the right list)? import issue

2009-09-23 Thread erik quanstrom
 So just need to fix oserrstr() or fix this in devip itself? I vote
 oserrstr, lucho votes fix this little bit of
 code.

how many other errors are lurking in osstrerror()?
there are lots of assumptions about the exact
errstrs.

- erik



Re: [9fans] 9vx (is this the right list)? import issue

2009-09-23 Thread ron minnich
sent in via codereview but I missed a debug print I left in. Sorry. I
assume you can reject it so I'm sending the correct patch.

ron



Re: [9fans] 9vx (is this the right list)? import issue

2009-09-23 Thread ron minnich
not having done this before, the reference is URL:
http://codereview.appspot.com/122046

ron



Re: [9fans] 9vx (is this the right list)? import issue

2009-09-23 Thread erik quanstrom
On Wed Sep 23 14:54:47 EDT 2009, r...@swtch.com wrote:
  how sure are we that 1 holds?  couldn't there be other,
  legitimate and transient errors?  could a user-delivered
  note sneak in and confuse the issue?
 
 no.  at least not if the kernel is working properly.
 that's why i said devmnt should enforce the assumption.
 it's at most a couple lines of extra code,
 whereas the diff you posted was quite a bit longer.

it seems to be a big assumption that the whole ip stack
and the ethernet driver know the difference between
being interrupted before sending or queuing the packet
or after.

the comment in qbwrite seems to say that you can
get interrupted after the pkt has been queued.

my approach has the advantage of sidestepping this
problem.

i don't think 13 changed lines is unreasonable.
(without verbose debugging.)

- erik



Re: [9fans] 9vx (is this the right list)? import issue

2009-09-23 Thread Russ Cox
On Wednesday, September 23, 2009, ron minnich rminn...@gmail.com wrote:
 not having done this before, the reference is URL:
 http://codereview.appspot.com/122046

Thanks, Ron.  Fix applied.

Russ



Re: [9fans] 9vx (is this the right list)? import issue

2009-09-22 Thread Russ Cox
On Monday, September 21, 2009, ron minnich rminn...@gmail.com wrote:
 If I am in 9vx and have imported a file system from somewhere, and do
 an ls, and get impatient and hit del, the import dies.

 term% grep full *.c
  grep: can't open *.c: '*.c' mount rpc error
 term% ls
 ls: .: clone failed
 term%

 mnt: proc grep 91: mismatch from /net/tcp/0/data
 /n/o/usr/rminnich/9k/bgp rep 0x93daf0 tag 1 fid 432 T110 R117 rp 1
 mnt: proc rc 93: mismatch from /net/tcp/0/data
 /n/o/usr/rminnich/9k/bgp rep 0x93daf0 tag 1 fid 432 T110 R121 rp 1

 I'm wondering if this is just a simple matter of the note going too
 far. I am going to look but figure somebody might immediately say ah
 ha! and have a fix.

I think it is.  In fact I think drawterm has the same bug
except in drawterm it is for some reason harder to trigger.
I have no aha! fix for you.

Russ



Re: [9fans] 9vx (is this the right list)? import issue

2009-09-22 Thread ron minnich
OK, a little more fooling around.

term% grep full *.c

(wait about a second, hit DEL)
grep: can't open apic.c: 'apic.c' './sys/src/9/pc/grep' does not exist
grep: can't open apm.c: 'apm.c' mount rpc error
grep: can't open archmp.c: 'archmp.c' mount rpc error
grep: can't open bios32.c: 'bios32.c' mount rpc error
grep: can't open cga.c: 'cga.c' mount rpc error

etc. in fact I get an error for each file in the directory.

I also get one of these for each file.
mnt: proc grep 90: mismatch from /net/tcp/0/data
/n/o/sys/src/9/pc/devlml.c rep 0x16d7010 tag 3 fid 533 T112 R111 rp 3
mnt: proc grep 90: mismatch from /net/tcp/0/data
/n/o/sys/src/9/pc/devlml.c rep 0x16d7010 tag 3 fid 533 T120 R113 rp 3
mnt: proc grep 90: mismatch from /net/tcp/0/data /n/o/sys/src/9/pc rep
0x16d7010 tag 3 fid 524 T110 R121 rp 3
mnt: proc grep 90: mismatch from /net/tcp/0/data
/n/o/sys/src/9/pc/devmoipv6.c rep 0x16d7010 tag 3 fid 534 T112 R111 rp
3
mnt: proc grep 90: mismatch from /net/tcp/0/data
/n/o/sys/src/9/pc/devmoipv6.c rep 0x16d7010 tag 3 fid 534 T120 R113 rp
3
mnt: proc grep 90: mismatch from /net/tcp/0/data /n/o/sys/src/9/pc rep
0x16d7010 tag 3 fid 524 T110 R121 rp 3
mnt: proc grep 90: mismatch from /net/tcp/0/data
/n/o/sys/src/9/pc/devrtc.c rep 0x16d7010 tag 3 fid 535 T112 R111 rp 3

so what is interesting is that the grep did not die. It keeps trying
to open files and keeps failing. And then, finally:

term% ls
ls: .: clone failed
term%

So, we see lots of guys with the same tag, with a T and R mismatch,
with T and R like open and clunk, with the apparent problem that they
are all using tag 3.

I'm wondering if there is not a problem with the way tags are flushed
when there is an interrupt.

As an experiment I commented out the free in the freetag code.

Better: now I just get this:
mnt: proc grep 78: mismatch from /net/tcp/0/data
/n/o/sys/src/9/pc/apic.c rep 0x1491c60 tag 2 fid 454 T120 R117 rp 2
after the DEL, TCLUNK gets an RREAD (which makes a sort of sense)

and the when I ls .

mnt: proc rc 80: mismatch from /net/tcp/0/data /n/o/sys/src/9/pc rep
0x1491c60 tag 2 fid 421 T110 R121 rp 2

Twalk for the ls gets the RCLUNK

And from that point on it's all over.


Interesting.

ron



Re: [9fans] 9vx (is this the right list)? import issue

2009-09-22 Thread ron minnich
OK, I did this in mntralloc, in the code path in which we reuse the
rpc from the mtnalloc.rpcfree: I just simply always allocated a new
tag, even if we found an old rpc which was nominally free: I always
allocated a new tag, not reusing the old tag.

That fixed the problem.

So, basically, the way I see it is, grep proc gets an interrupt,
kernel will try to flush RPCs which we initiated, we drop the (we
think) flushed rpc struct onto the rpcfree list, but the reply from
the server is still in flight. We reuse the rpc from rpcfree list, we
send out a new T, with the same tag as the previous one which we think
we flushed, we get the reply from the earlier RPC, tags match, R does
not match T, bad day.

This is the same code as is in plan 9.

There's harder and harder ways to deal with this, I have some ideas
but I expect some of the folks on this list to have better ones. The
simplest one that would probably work is to avoid reusing tags quite
so quickly. Free tags, yes, but use a counter to indicate next tag to
use, so that it's a relatively long time before a tag for a mount
point is reused again. So, e.g., we free tag 3, but the next tag we
allocate is tag 4, and so on. Sooner or later we'll use tag 3 again,
likely long after any messages in flight have been retired. (weirdly
enough, I saw a trick like this used in hardware many years ago ...)

That may not be good enough, not sure. It's definitely pretty easy to
implement.

What I observed is that when this is all working, tag 3 is the only
tag ever used!

ron



Re: [9fans] 9vx (is this the right list)? import issue

2009-09-22 Thread roger peppe
2009/9/22 ron minnich rminn...@gmail.com:
 So, basically, the way I see it is, grep proc gets an interrupt,
 kernel will try to flush RPCs which we initiated, we drop the (we
 think) flushed rpc struct onto the rpcfree list, but the reply from
 the server is still in flight. We reuse the rpc from rpcfree list, we
 send out a new T, with the same tag as the previous one which we think
 we flushed, we get the reply from the earlier RPC, tags match, R does
 not match T, bad day.

surely the correct way to go about this (caveat: i haven't looked at the code)
is to drop the rpc struct onto the rpcfree list only when the Rflush is
received?

from experience with writing heavily used 9p services, getting flush properly
right is a bitch.



Re: [9fans] 9vx (is this the right list)? import issue

2009-09-22 Thread ron minnich
On Tue, Sep 22, 2009 at 11:35 AM, roger peppe rogpe...@gmail.com wrote:

 surely the correct way to go about this (caveat: i haven't looked at the code)
 is to drop the rpc struct onto the rpcfree list only when the Rflush is
 received?

you just got an Eintr. Did the request get sent?

I don't know. I am not sure the code does either. Since this is only
seen so far in 9vx I am guess it is a 9vx thing.

ron



Re: [9fans] 9vx (is this the right list)? import issue

2009-09-22 Thread erik quanstrom
 I don't know. I am not sure the code does either. Since this is only
 seen so far in 9vx I am guess it is a 9vx thing.

i see this now than then on regular plan 9

Tue Sep  1 18:51:15: unexpected reply tag 51; type 109
Tue Sep  1 18:51:15: unexpected reply tag 16; type 109
Tue Sep  1 18:51:15: unexpected reply tag 39; type 109
Tue Sep  1 18:51:15: unexpected reply tag 51; type 109
Tue Sep  1 18:51:16: unexpected reply tag 51; type 109
Tue Sep  1 18:51:17: unexpected reply tag 39; type 109

- erik



Re: [9fans] 9vx (is this the right list)? import issue

2009-09-22 Thread roger peppe
2009/9/22 ron minnich rminn...@gmail.com:
 On Tue, Sep 22, 2009 at 11:35 AM, roger peppe rogpe...@gmail.com wrote:

 surely the correct way to go about this (caveat: i haven't looked at the 
 code)
 is to drop the rpc struct onto the rpcfree list only when the Rflush is
 received?

 you just got an Eintr. Did the request get sent?

doesn't Eintr mean that the write did not complete?

if it's ambiguous, then the tag should indeed be put on hold,
because there's no way to get it right.

it would be useful to see a log of the actual 9p messages.



Re: [9fans] 9vx (is this the right list)? import issue

2009-09-22 Thread erik quanstrom
 if it's ambiguous, then the tag should indeed be put on hold,
 because there's no way to get it right.

how do we prevent all tags from being on hold?
there's no way to get that right, either.

- erik



Re: [9fans] 9vx (is this the right list)? import issue

2009-09-22 Thread Eric Van Hensbergen

On Sep 22, 2009, at 1:47 PM, ron minnich wrote:

On Tue, Sep 22, 2009 at 11:35 AM, roger peppe rogpe...@gmail.com  
wrote:


surely the correct way to go about this (caveat: i haven't looked  
at the code)
is to drop the rpc struct onto the rpcfree list only when the  
Rflush is

received?


you just got an Eintr. Did the request get sent?

I don't know. I am not sure the code does either. Since this is only
seen so far in 9vx I am guess it is a 9vx thing.



I believe it should - at least that's the behavior we went for in v9fs.

   -eric




Re: [9fans] 9vx (is this the right list)? import issue

2009-09-22 Thread roger peppe
2009/9/22 erik quanstrom quans...@quanstro.net:
 if it's ambiguous, then the tag should indeed be put on hold,
 because there's no way to get it right.

 how do we prevent all tags from being on hold?
 there's no way to get that right, either.

well, it's legal to send several flushes for the same tag,
and it's also legal to send a flush of a non-existent tag,
so if there's a case of ambiguity, we could resend the flush
and drop the original rpc struct when the reply to that comes
back (or the original reply comes back).



Re: [9fans] 9vx (is this the right list)? import issue

2009-09-22 Thread ron minnich
here is a 9p trace of the problem. See line 43. Topen and Tclunk go
out with same tag. This is with a print in the rpc code as suggested
by Russ.

ron


y
Description: Binary data


Re: [9fans] 9vx (is this the right list)? import issue

2009-09-22 Thread ron minnich
here's one last caught in the act scenario. I have a print in
mntralloc when I reuse something.

The fid is being read and clunked. But the Tclunk goes out before the
Rread comes in. Oops.


Reuse 1
Tread tag 1 fid 454 offset 0 count 8192
Reuse 1
Tclunk tag 1 fid 454
reply Rread tag 1 count 8192 '#include u.h
#include ../port/lib.h
#include mem.h
#includ'
mnt: proc grep 78: mismatch from /net/tcp/0/data
/n/o/sys/src/9/pc/apic.c rep 0x168ed20 tag 1 fid 454 T120 R117 rp 1



Re: [9fans] 9vx (is this the right list)? import issue

2009-09-22 Thread erik quanstrom
On Tue Sep 22 17:22:08 EDT 2009, rminn...@gmail.com wrote:
 here's one last caught in the act scenario. I have a print in
 mntralloc when I reuse something.
 
 The fid is being read and clunked. But the Tclunk goes out before the
 Rread comes in. Oops.
 
 
 Reuse 1
 Tread tag 1 fid 454 offset 0 count 8192
 Reuse 1
 Tclunk tag 1 fid 454
 reply Rread tag 1 count 8192 '#include u.h
 #include ../port/lib.h
 #include mem.h
 #includ'
 mnt: proc grep 78: mismatch from /net/tcp/0/data
 /n/o/sys/src/9/pc/apic.c rep 0x168ed20 tag 1 fid 454 T120 R117 rp 1

you're still triggering this with a note?

- erik



Re: [9fans] 9vx (is this the right list)? import issue

2009-09-22 Thread ron minnich
On Tue, Sep 22, 2009 at 2:23 PM, erik quanstrom quans...@quanstro.net wrote:
 On Tue Sep 22 17:22:08 EDT 2009, rminn...@gmail.com wrote:
 here's one last caught in the act scenario. I have a print in
 mntralloc when I reuse something.

 The fid is being read and clunked. But the Tclunk goes out before the
 Rread comes in. Oops.


 Reuse 1
 Tread tag 1 fid 454 offset 0 count 8192
 Reuse 1
 Tclunk tag 1 fid 454
 reply Rread tag 1 count 8192 '#include u.h
 #include ../port/lib.h
 #include mem.h
 #includ'
 mnt: proc grep 78: mismatch from /net/tcp/0/data
 /n/o/sys/src/9/pc/apic.c rep 0x168ed20 tag 1 fid 454 T120 R117 rp 1

 you're still triggering this with a note?

DEL in rio.

ron



Re: [9fans] 9vx (is this the right list)? import issue

2009-09-22 Thread erik quanstrom
ron,

this works for me but my symptoms were a little different than yours.
before:
mnt: proc cat 290: mismatch from #D/ssl/1/data /n/coraid/lib/unicode 
rep 0x7fcd8c04e190 tag 4 fid 1603 T120 R117 rp 4
after:
WOOT! caught stale reply 6; type 117

note: the poor organization of this patch is geared toward keeping the
diff small. tagallocd and freetag should be moved above mountmux.

i didn't use russ' ed scripts because i was too lazy.

- erik

9vx version

; ; diff -c devmnt.c devmnt.c~
devmnt.c:945,954 - devmnt.c~:945,951
  void
  mountmux(Mnt *m, Mntrpc *r)
  {
-   int bad;
Mntrpc **l, *q;
-   int tagallocd(int);
-   void freetag(int);
  
lock(m-lk);
l = m-queue;
devmnt.c:977,992 - devmnt.c~:974,981
}
l = q-list;
}
-   bad = 1;
-   if(tagallocd(r-reply.tag)){
-   freetag(r-reply.tag);
-   bad = 0;
-   }
unlock(m-lk);
-   if(bad)
-   print(unexpected reply tag %ud; type %d\n, r-reply.tag, 
r-reply.type);
-   else
-   print(WOOT! caught stale reply %ud; type %d\n, r-reply.tag, 
r-reply.type);
+   print(unexpected reply tag %ud; type %d\n, r-reply.tag, 
r-reply.type);
  }
  
  /*
devmnt.c:1054,1065 - devmnt.c~:1043,1048
return NOTAG;
  }
  
- int
- tagallocd(int t)
- {
-   return mntalloc.tagmask[tTAGSHIFT]  1(tTAGMASK);
- }
- 
  void
  freetag(int t)
  {
devmnt.c:1125,1136 - devmnt.c~:1108,1116
if(mntalloc.nrpcfree = 10){
free(r-rpc);
free(r);
-   if(r-done != 2)
-   freetag(r-request.tag);
+   freetag(r-request.tag);
}
else{
-   if(r-done == 2)
-   r-request.tag = alloctag();
r-list = mntalloc.rpcfree;
mntalloc.rpcfree = r;
mntalloc.nrpcfree++;
devmnt.c:1145,1151 - devmnt.c~:1125,1131
Mntrpc **l, *f;
  
lock(m-lk);
-   r-done = 2;
+   r-done = 1;
  
l = m-queue;
for(f = *l; f; f = f-list) {

plan 9 version

; diffy -c devmnt.c
/n/dump/2009/0922/sys/src/9/port/devmnt.c:932,938 - devmnt.c:932,941
  void
  mountmux(Mnt *m, Mntrpc *r)
  {
+   int bad;
Mntrpc **l, *q;
+   int tagallocd(int);
+   void freetag(int);
  
lock(m);
l = m-queue;
/n/dump/2009/0922/sys/src/9/port/devmnt.c:961,968 - devmnt.c:964,977
}
l = q-list;
}
+   bad = 1;
+   if(tagallocd(r-reply.tag)){
+   freetag(r-reply.tag);
+   bad = 0;
+   }
unlock(m);
-   print(unexpected reply tag %ud; type %d\n, r-reply.tag, 
r-reply.type);
+   if(bad)
+   print(unexpected reply tag %ud; type %d\n, r-reply.tag, 
r-reply.type);
  }
  
  /*
/n/dump/2009/0922/sys/src/9/port/devmnt.c:1030,1035 - devmnt.c:1039,1050
return NOTAG;
  }
  
+ int
+ tagallocd(int t)
+ {
+   return mntalloc.tagmask[tTAGSHIFT]  1(tTAGMASK);
+ }
+ 
  void
  freetag(int t)
  {
/n/dump/2009/0922/sys/src/9/port/devmnt.c:1095,1103 - devmnt.c:1110,1121
if(mntalloc.nrpcfree = 10){
free(r-rpc);
free(r);
-   freetag(r-request.tag);
+   if(r-done != 2)
+   freetag(r-request.tag);
}
else{
+   if(r-done == 2)
+   r-request.tag = alloctag();
r-list = mntalloc.rpcfree;
mntalloc.rpcfree = r;
mntalloc.nrpcfree++;
/n/dump/2009/0922/sys/src/9/port/devmnt.c:1112,1118 - devmnt.c:1130,1136
Mntrpc **l, *f;
  
lock(m);
-   r-done = 1;
+   r-done = 2;
  
l = m-queue;
for(f = *l; f; f = f-list) {


--
plan 9 version

; - diffy
diffy -c devmnt.c
/n/dump/2009/0922/sys/src/9/port/devmnt.c:932,938 - devmnt.c:932,941
  void
  mountmux(Mnt *m, Mntrpc *r)
  {
+   int bad;
Mntrpc **l, *q;
+   int tagallocd(int);
+   void freetag(int);
  
lock(m);
l = m-queue;
/n/dump/2009/0922/sys/src/9/port/devmnt.c:961,968 - devmnt.c:964,979
}
l = q-list;
}
+   bad = 1;
+   if(tagallocd(r-reply.tag)){
+   freetag(r-reply.tag);
+   bad = 0;
+   }
unlock(m);
-   print(unexpected reply tag %ud; type %d\n, r-reply.tag, 
r-reply.type);
+   if(bad)
+   print(unexpected reply tag %ud; type %d\n, r-reply.tag, 
r-reply.type);
+   else
+   print(WOOT! caught stale reply %ud; type %d\n, r-reply.tag, 
r-reply.type);
  }
  
  /*
/n/dump/2009/0922/sys/src/9/port/devmnt.c:1030,1035 - devmnt.c:1041,1052
return NOTAG;
  }
  
+ int
+ tagallocd(int t)
+ {
+   return mntalloc.tagmask[tTAGSHIFT]  1(tTAGMASK);
+ }
+ 
  void
  freetag(int t)
  {

Re: [9fans] 9vx (is this the right list)? import issue

2009-09-22 Thread erik quanstrom
full versions in
/n/sources/contrib/quanstro/devmnt.c
/n/sources/contrib/quanstro/vx32devmnt.c

- erik



Re: [9fans] 9vx (is this the right list)? import issue

2009-09-22 Thread erik quanstrom
On Tue Sep 22 23:12:27 EDT 2009, r...@swtch.com wrote:
 The extra tracking that has been proposed is unnecessary,
 and waiting for the Rflush doesn't make sense.  The assumption
 is that the Rflush isn't ever going to arrive, because the connection
 is dead.

what do you mean by dead?  i/o to the same channel works
fine.

- erik



Re: [9fans] 9vx (is this the right list)? import issue

2009-09-22 Thread Russ Cox
On Tuesday, September 22, 2009, erik quanstrom quans...@quanstro.net wrote:
 On Tue Sep 22 23:12:27 EDT 2009, r...@swtch.com wrote:
 The extra tracking that has been proposed is unnecessary,
 and waiting for the Rflush doesn't make sense.  The assumption
 is that the Rflush isn't ever going to arrive, because the connection
 is dead.

 what do you mean by dead?  i/o to the same channel works
 fine.

I mean that the code as written is assuming that if a read or write
errors out, it can only happen for one of two reasons:
1) there was an interrupt note, in which case strcmp(error, Eintr) == 0
2) there has been an error on the 9P connection, in which case
   strcmp(error, Eintr) != 0 and the connection will never work again.

My suggestion is to enforce #2: if a non-interrupt error happens,
mark the connection so that the kernel won't even try to use it
again.

Separately, you might investigate what error is happening that
violates the assumption above.  In 9vx, it is easy: case #1 happened
but the error was spelled wrong.

Russ



Re: [9fans] 9vx (is this the right list)? import issue

2009-09-22 Thread erik quanstrom
 I mean that the code as written is assuming that if a read or write
 errors out, it can only happen for one of two reasons:
 1) there was an interrupt note, in which case strcmp(error, Eintr) == 0
 2) there has been an error on the 9P connection, in which case
strcmp(error, Eintr) != 0 and the connection will never work again.
 
 My suggestion is to enforce #2: if a non-interrupt error happens,
 mark the connection so that the kernel won't even try to use it
 again.
 
 Separately, you might investigate what error is happening that
 violates the assumption above.  In 9vx, it is easy: case #1 happened
 but the error was spelled wrong.

how sure are we that 1 holds?  couldn't there be other,
legitimate and transient errors?  could a user-delivered
note sneak in and confuse the issue?

the problem with my solution is that it could leak tags.
i don't see this as a significant problem, but i could be
wrong.  i think the connection would need to be pretty
broken for tags to be leaked.

marking connections dead also adds tracking, but in
a new place.  it could have trouble if ever a transient error
happens when strcmp(error, Eintr) == 0, which can
happen in 9vx or dt.

- erik