--- Dave Grothe <[EMAIL PROTECTED]> wrote: > -- Now I remember -- I was using poll as you are and I was > over-managing > the "read bits" thinking that I knew when to set them and when to > clear > them. That was what got the program out of phase with itself. The > fix was > simply to always keep the "read bits" set so that a returned packet > could > arrive at any time and have its time stamp evaluated.
I don't think that this is getting me because I just leave the poll bits set. This brings me to another question though... I was looking through the poll code and it looks to me like lis_poll_bits() returns all of the possible flags, read and write, no matter what events were requested. What I do is I have two seperate processes reading and writing at the same time, both potentially calling poll() at the same time. The write side as POLLOUT|POLLPRI and the read side has POLLIN|POLLPRI. Now when lis_poll_bits() returns to lis_poll_2_1(), lis_poll_2_1() just returns whatever was returned from lis_poll_bits. The comment in the code says that any non-zero return will wake up all of the waiters. Does this mean that if lis_poll_bits() returns just POLLIN that both the write and the read poll will be woken up and that I need to explicitly check the revents with the events? Or is this masking of the revents hidden within the linux poll() system call somehow so that only the thread which requested the returned events is awoken? It doesn't look like there is any way to pass the requested events to lis_poll_bits, so it's kind of a mystery how this works (mostly because I'm too lazy to go digging through the linux poll() code....) > One of the things that I put into LiS to assist with this was the > spin lock > contention and semaphore contention code. Check out "streams -L" > and > "streams -T" in conjunction with "streams -D0x28". I did have a look at this and didn't see anything massively out of the ordinary. I think that the highest that delay time got was around the 10ms range. Nothing that completely explains the 200ms round trip delays that I am getting for 40byte frames on a 155Mbps link. > I also found that there was a highly significant difference between > Red Hat > 9 and a stock 2.4 kernel of the same nominal version. RH 9 has the > 2.6 O1 > processor scheduler back-ported to it. It makes a huge difference > if you > have a number of threads active. I think that I am using the stock RH9 kernel, 2.4.20-8, but some of our customers are using Enterprise Linux 3.0 (or 3.1, I can't remember). Do you know if the EL series has this 2.6 O1 process scheduler? I also have a 2.4.23 kernel that I downloaded from kernel.org to use with kgdb, but I have not tried that yet to see if there is a difference in performance. The test that I am doing is just one connection trying to write and read as fast as it can. There are only two threads, one reading, one writing. I am pushing 40 byte frames through it. On my creaky old dual PIII Dell 1GHz I am getting around 86000 frames per second in UP mode and 43000 frames per second in SMP mode, with the same CPU utilization in both cases. In the UP case the total round trip delay is less than 30ms (I stole your semaphore histogram idea to get a histogram of the rtt). In the UP case the total round trip delay is usually as high as 100-200ms, in some cases, like if I type "sync" it can be as high as 600-900ms. yikes! I measure the rtt by using gettimeofday on linux to get a usec timestamp and just stick that in the first word of data on the putmsg side. On the getmsg side I do a getmsg, then gettimeofday() and subtract that from the value in the first word. There's no issues with gettimeofday returning bogus data on SMP, is there? > I played with oprofile but found it not to be the correct tool for > diagnosing latencies. It only tells you where the CPU *is* > executing, not > where is *isn't* executing. If CPU bottlenecks are your problem > oprofile > is a really cool tool. I have been wanting to check out oprofile, but just have not yet had the time. I am hoping to do that sometime soon. > Once I was able to get lock and semaphore contention to a minimum I > was > able to get some real pipelining going among the 4 CPUs. This > allowed me > to tune the number of messages queued before waking up another LiS > queue > runner thread. I found in my testing that 12-13 messages seemed to > be > about right, so I went for the Fibonacci 13. > > Oh, another tip: In an SMP system you are best off if you can run > your > driver with qlock=0, that is, no locking of your queue by LiS prior > to > driver entry. This is what allows your put procedure to enqueue a > message > on one CPU at the same time that the same service procedure is > operating on > another one on another CPU. This is what gets the pipeline going. This brings up yet another issue. I set my driver to use qlock=0 and saw no real improvement. This is because I modified my driver to always write the data from the wput routine and only put data onto the wsrv routine if there was write side flow control. Since my line rate is really fast, I don't get write side flow control very often, so it's actually very rare that my wput and wsrv routine would be running simultaneously. My wsrv routine runs less than 1000 times out of 10,000,000 frames sent. On the read side, same thing, there is no rput routine in a bottom level driver, so rsrv should have a clear shot. What I did do this morning that made a massive difference with a 50% increase in throughput was to change the stream head to have qlock=0. I was a little suprised that I had to do this because looking at lis_alloc_stdata() it looks like it tries to do just this: /* * Allocate and initialize an stdata structure. Do not get any of * the locks, leave that up to the caller. */ static stdata_t * lis_alloc_stdata(void) { queue_t *q = lis_allocq("stream-head"); stdata_t *head ; if (!q) { if (LIS_DEBUG_OPEN) printk("lis_alloc_stdata() - " "failed to allocate queues for stream head\n"); return NULL; } if (lis_set_q_sync(q, LIS_QLOCK_NONE) < 0) { lis_freeq(q) ; return(NULL) ; } head = (stdata_t*) LIS_HEAD_ALLOC(sizeof(stdata_t), "stream-head ") ; if (head == NULL) { lis_freeq(q) ; return(NULL) ; } I also added a debugging statement to the bottom of lis_set_q_sync(): q->q_qlock_option = qlock_option ; q->q_other->q_qlock_option = qlock_option ; printk("lis_set_q_sync q=0x%p q->qsp=0x%p otherq=0x%p otherq->qsp=0x%p option=%d \n", q, q->q_qsp, q->q_other, q->q_other->q_qsp, qlock_option); return(0) ; } Here is the output when I open my driver: Oct 29 13:48:45 ramon kernel: LiS-RunQ-2.18.0 running on CPU 0 pid=4990 Oct 29 13:48:45 ramon kernel: LiS-RunQ-2.18.0 running on CPU 1 pid=4991 Oct 29 13:48:45 ramon kernel: STREAMS driver "atmii" registered, major 252 Oct 29 13:48:45 ramon kernel: lis_set_q_sync q=0xcdc60c80 option=1 Oct 29 13:48:45 ramon kernel: lis_set_q_sync q=0xcdc60c80 q->qsp=0xcd7f3e80 otherq=0xcdc60d5c otherq->qsp=0xcd7f3d80 option=1 Oct 29 13:48:45 ramon kernel: lis_set_q_sync q=0xcdc60c80 option=0 Oct 29 13:48:45 ramon kernel: lis_set_q_sync q=0xcdc60c80 q->qsp=0x00000000 otherq=0xcdc60d5c otherq->qsp=0x00000000 option=0 ==== Here it had just set the stream head to be option 0 ==== Oct 29 13:48:45 ramon kernel: lis_set_q_sync q=0xcdc60a80 option=1 Oct 29 13:48:45 ramon kernel: lis_set_q_sync q=0xcdc60a80 q->qsp=0xcd7f3e80 otherq=0xcdc60b5c otherq->qsp=0xcd7f3d80 option=1 Oct 29 13:48:45 ramon kernel: atmiiopen: take 10. readq=0xcdc60a80 writeq=0xcdc60b5c headrq=0xcdc60c80 headwq=0xcdc60d5c Oct 29 13:48:45 ramon kernel: lis_set_q_sync q=0xcdc60d5c option=1 Oct 29 13:48:45 ramon kernel: lis_set_q_sync q=0xcdc60d5c q->qsp=0xcd7f3c80 otherq=0xcdc60c80 otherq->qsp=0xcd7f3b80 option=1 ==== But here it changes it back to option 1.... ==== Oct 29 13:48:45 ramon kernel: lis_set_q_sync q=0xcdc60b5c option=0 Oct 29 13:48:45 ramon kernel: lis_set_q_sync q=0xcdc60b5c q->qsp=0x00000000 otherq=0xcdc60a80 otherq->qsp=0x00000000 option=0 So is the stream head supposed to be option 0 or option 1? For me it works _much_ better as option 0 (which I forced by exporting lis_set_q_sync() and just calling it explicitly on the stream head queue from my wput routine(). Doing it from the open routine didn't work, so it must get set back to option 1 after the open() routine returns.). This brings me to another question that I had about the queue locking documentation in: file://localhost/C:/tmp/LiS/LiS-2.18.0.B/htdocs/config.html#QueueLockingSpecification It says: =========== The utility of using locking style 0 has to do with achieving higher throughput in multiple CPU environments. If all put and service procedures can be entered simultaneously then the put procedure can be calling putq() on one CPU while the service procedure is calling getq() and processing messages on another, setting up a pipeline execution effect. Experiments have shown significant throughput improvements when this style of locking can be used. In such a circumstance the driver service procedure needs to maintain some kind of a "running" flag, manipulated under spin lock control, to assure single threaded execution of the messages removed from the queue. Such flag manipulation is good practice even though LiS will not naturally tend to enter a driver's service procedure for the same queue on multiple CPUs simultaneously. =========== Is this true???? From looking at the code lis_putq and lis_getq are both surrounded by LIS_QISRLOCK() macros, which are supposed to be the irqsave() versions of the read/write lock. Why do you need another spin lock to protect the getq/putq()? Also how are you going to have the same service routine running on two different CPUs at the same time? That implies to me that it would be possible for a routine to be on the run queue twice. From looking at lis_run_queues, that does not look possible to me, even on SMP machines. This brings up another point about the queue locking. I think that we need to improve the documentation for the queue locking. It is pretty misleading in that the names are similar to locking schemes on other OSs (HPUX and Solaris) which also provide no locking, per queue locking, queue pair locking or "other". However, particularly in the QPAIR case, the locking does not behave like the Solaris or HPUX versions. For example, if you do an ioctl from user space to a driver which processes the ioctl in the wput routine through a module which is set for QPAIR locking (the module just passes the ioctl through to the driver via its wput routine and back to the stream head via it's rput routine), you get a different call stack on Linux than you do on either Solaris or HPUX. To test this I put debugging messages in my QPAIR module (sscop) at the start and end of wput and rput and at the start and end of the driver (atmii) wput. Solaris 8 Sparc 64 bit (QPAIR synchronization in SSCOP, no synchronization in ATMII): ============== Oct 4 11:05:18 gravitron a_sscop: [ID 963018 kern.notice] sscopwput: ioctl doing putnext q=0x30000e58d20 pm=0x3000007f6c0 Oct 4 11:05:18 gravitron atmii: [ID 682503 kern.notice] atmiiioctl: about to qreply q=0x30000ede3a8 pm=0x3000007f6c0 Oct 4 11:05:18 gravitron atmii: [ID 997160 kern.notice] atmiiioctl: back from qreply q=0x30000ede3a8 pm=0x3000007f6c0 Oct 4 11:05:18 gravitron a_sscop: [ID 801744 kern.notice] sscopwput: ioctl back from putnext returning. q=0x30000e58d20 pm=0x3000007f6c0 Oct 4 11:05:18 gravitron a_sscop: [ID 609763 kern.notice] sscoprput: ioctl doing putnext q=0x30000e58c40 pm=0x3000007f6c0 Oct 4 11:05:18 gravitron a_sscop: [ID 760784 kern.notice] sscoprput: ioctl back from putnext returning. q=0x30000e58c40 pm=0x3000007f6c0 The same code on Linux with LiS ============== Oct 4 15:52:13 STATIC kernel: sscopwput: ioctl doing putnext q=0xf77ac6c0 pm=0xf7970100 Oct 4 15:52:13 STATIC kernel: atmiiioctl: about to qreply q=0xf77ac440 pm=0xf7970100 Oct 4 15:52:13 STATIC kernel: sscoprput: ioctl doing putnext q=0xf77ac580 pm=0xf7970100 Oct 4 15:52:13 STATIC kernel: sscoprput: ioctl back from putnext returning. q=0xf77ac580 pm=0xf7970100 Oct 4 15:52:13 STATIC kernel: atmiiioctl: back from qreply q=0xf77ac440 pm=0xf7970100 Oct 4 15:52:13 STATIC kernel: sscopwput: ioctl back from putnext returning. q=0xf77ac6c0 pm=0xf7970100 As you can see in the Solaris case, the sscopwput routine returns from the putnext and finishes running before the sscoprput routine is entered. In LiS, the sscoprput routine runs to completion before sscopwput returns from the putnext and finishes. So I think that the thing that needs to be pointed out more forcefully is that QPAIR only protects _two different threads_ from entering the two queue's put and service routines at the same time. This is a key distinction between LiS QPAIR locking and Solaris. In fact a single thread _can_ enter the read and write put routines at the same time if an external driver or module turns the message around with qreply() so that the put routine is entered with the same thread. This is not true on Solaris or HPUX. This was the crux of the QPAIR problem that I had mentioned a month or so ago... This needs to be documented more explicitly so that it does not bite other people. It took me two weeks to figure this out from extensive debugging and I can't believe that I will be the only person fooled by this, particularly if the developer is comming from a Solaris or HPUX background. Ideally I'd like to see the QPAIR locking implemented the same way as it is on Solaris. It is much more useful that way. However I know that that is going to be a lot of work. This is what I was talking about way back when you were first talking about introducing this and I mentioned all the work that went into getting the Solaris sync queues right. This is how solaris does it. If it cannot enter the module's perimeter, it puts the message on a sync queue and lets the module finish so that it can reenter the perimeter. LiS would have to do something similar to defer the entrance of the put routine after a qreply() until the other put routine could be exited. Not fun and I know that your appetite for big LiS projects is not very strong. I'm tempted to go for it, but I just don't have time, what with my day job and all... So anyways, thanks for your time and quick response to my last questions. dan ===== Dan Gora Software Engineer Adax, Inc. Tel: +55 12-3845-3572 email: [EMAIL PROTECTED] _______________________________________________ Linux-streams mailing list [EMAIL PROTECTED] http://gsyc.escet.urjc.es/mailman/listinfo/linux-streams