[Linux-streams] Re: Minor bug in lis_process_rput() LiS 2.18 that probably doesn't affect anyone except me.

dan_gora Fri, 29 Oct 2004 10:24:14 -0700

--- Dave Grothe <[EMAIL PROTECTED]> wrote:

> -- Now I remember -- I was using poll as you are and I was
> over-managing 
> the "read bits" thinking that I knew when to set them and when to
> clear 
> them.  That was what got the program out of phase with itself.  The
> fix was 
> simply to always keep the "read bits" set so that a returned packet
> could 
> arrive at any time and have its time stamp evaluated.


I don't think that this is getting me because I just leave the poll
bits set.  This brings me to another question though... I was looking
through the poll code and it looks to me like lis_poll_bits() returns
all of the possible flags, read and write, no matter what events were
requested.  What I do is I have two seperate processes reading and
writing at the same time, both potentially calling poll() at the same
time.  The write side as POLLOUT|POLLPRI and the read side has
POLLIN|POLLPRI.  Now when lis_poll_bits() returns to lis_poll_2_1(),
lis_poll_2_1() just returns whatever was returned from lis_poll_bits.
 The comment in the code says that any non-zero return will wake up
all of the waiters.  Does this mean that if lis_poll_bits() returns
just POLLIN that both the write and the read poll will be woken up
and that I need to explicitly check the revents with the events?  Or
is this masking of the revents hidden within the linux poll() system
call somehow so that only the thread which requested the returned
events is awoken?  It doesn't look like there is any way to pass the
requested events to lis_poll_bits, so it's kind of a mystery how this
works (mostly because I'm too lazy to go digging through the linux
poll() code....)

> One of the things that I put into LiS to assist with this was the
> spin lock 
> contention and semaphore contention code.  Check out "streams -L"
> and 
> "streams -T" in conjunction with "streams -D0x28".

I did have a look at this and didn't see anything massively out of
the ordinary.  I think that the highest that delay time got was
around the 10ms range.  Nothing that completely explains the 200ms
round trip delays that I am getting for 40byte frames on a 155Mbps
link.


> I also found that there was a highly significant difference between
> Red Hat 
> 9 and a stock 2.4 kernel of the same nominal version.  RH 9 has the
> 2.6 O1 
> processor scheduler back-ported to it.  It makes a huge difference
> if you 
> have a number of threads active.

I think that I am using the stock RH9 kernel, 2.4.20-8, but some of
our customers are using Enterprise Linux 3.0 (or 3.1, I can't
remember).  Do you know if the EL series has this 2.6 O1 process
scheduler?

I also have a 2.4.23 kernel that I downloaded from kernel.org to use
with kgdb, but I have not tried that yet to see if there is a
difference in performance.

The test that I am doing is just one connection trying to write and
read as fast as it can.  There are only two threads, one reading, one
writing.  I am pushing 40 byte frames through it.  On my creaky old
dual PIII Dell 1GHz I am getting around 86000 frames per second in UP
mode and 43000 frames per second in SMP mode, with the same CPU
utilization in both cases.  In the UP case the total round trip delay
is less than 30ms (I stole your semaphore histogram idea to get a
histogram of the rtt).  In the UP case the total round trip delay is
usually as high as 100-200ms, in some cases, like if I type "sync" it
can be as high as 600-900ms.  yikes!

I measure the rtt by using gettimeofday on linux to get a usec
timestamp and just stick that in the first word of data on the putmsg
side.  On the getmsg side I do a getmsg, then gettimeofday() and
subtract that from the value in the first word.

There's no issues with gettimeofday returning bogus data on SMP, is
there?

> I played with oprofile but found it not to be the correct tool for 
> diagnosing latencies.  It only tells you where the CPU *is*
> executing, not 
> where is *isn't* executing.  If CPU bottlenecks are your problem
> oprofile 
> is a really cool tool.

I have been wanting to check out oprofile, but just have not yet had
the time.  I am hoping to do that sometime soon.

> Once I was able to get lock and semaphore contention to a minimum I
> was 
> able to get some real pipelining going among the 4 CPUs.  This
> allowed me 
> to tune the number of messages queued before waking up another LiS
> queue 
> runner thread.  I found in my testing that 12-13 messages seemed to
> be 
> about right, so I went for the Fibonacci 13.
> 
> Oh, another tip:  In an SMP system you are best off if you can run
> your 
> driver with qlock=0, that is, no locking of your queue by LiS prior
> to 
> driver entry.  This is what allows your put procedure to enqueue a
> message 
> on one CPU at the same time that the same service procedure is
> operating on 
> another one on another CPU.  This is what gets the pipeline going.


This brings up yet another issue.  I set my driver to use qlock=0 and
saw no real improvement.  This is because I modified my driver to
always write the data from the wput routine and only put data onto
the wsrv routine if there was write side flow control.  Since my line
rate is really fast, I don't get write side flow control very often,
so it's actually very rare that my wput and wsrv routine would be
running simultaneously.  My wsrv routine runs less than 1000 times
out of 
10,000,000 frames sent.  On the read side, same thing, there is no
rput routine in a bottom level driver, so rsrv should have a clear
shot.

What I did do this morning that made a massive difference with a 50%
increase in throughput was to change the stream head to have qlock=0.

I was a little suprised that I had to do this because looking at
lis_alloc_stdata() it looks like it tries to do just this:

/*
 * Allocate and initialize an stdata structure. Do not get any of
 * the locks, leave that up to the caller.
 */
static stdata_t *
lis_alloc_stdata(void)
{
    queue_t     *q = lis_allocq("stream-head");
    stdata_t    *head ;

    if (!q)
    {
        if (LIS_DEBUG_OPEN)
            printk("lis_alloc_stdata() - "
                   "failed to allocate queues for stream head\n");
        return NULL;
    }

    if (lis_set_q_sync(q, LIS_QLOCK_NONE) < 0)
    {
        lis_freeq(q) ;
        return(NULL) ;
    }

    head = (stdata_t*) LIS_HEAD_ALLOC(sizeof(stdata_t), "stream-head
") ;
    if (head == NULL)
    {
        lis_freeq(q) ;
        return(NULL) ;
    }


I also added a debugging statement to the bottom of lis_set_q_sync():


    q->q_qlock_option = qlock_option ;
    q->q_other->q_qlock_option = qlock_option ;

printk("lis_set_q_sync q=0x%p q->qsp=0x%p otherq=0x%p
otherq->qsp=0x%p option=%d
\n", q, q->q_qsp, q->q_other, q->q_other->q_qsp, qlock_option);
    return(0) ;
}


Here is the output when I open my driver:

Oct 29 13:48:45 ramon kernel: LiS-RunQ-2.18.0 running on CPU 0
pid=4990
Oct 29 13:48:45 ramon kernel: LiS-RunQ-2.18.0 running on CPU 1
pid=4991
Oct 29 13:48:45 ramon kernel: STREAMS driver "atmii" registered,
major 252
Oct 29 13:48:45 ramon kernel: lis_set_q_sync q=0xcdc60c80 option=1
Oct 29 13:48:45 ramon kernel: lis_set_q_sync q=0xcdc60c80
q->qsp=0xcd7f3e80 otherq=0xcdc60d5c otherq->qsp=0xcd7f3d80 option=1
Oct 29 13:48:45 ramon kernel: lis_set_q_sync q=0xcdc60c80 option=0
Oct 29 13:48:45 ramon kernel: lis_set_q_sync q=0xcdc60c80
q->qsp=0x00000000 otherq=0xcdc60d5c otherq->qsp=0x00000000 option=0
====
Here it had just set the stream head to be option 0
====
Oct 29 13:48:45 ramon kernel: lis_set_q_sync q=0xcdc60a80 option=1
Oct 29 13:48:45 ramon kernel: lis_set_q_sync q=0xcdc60a80
q->qsp=0xcd7f3e80 otherq=0xcdc60b5c otherq->qsp=0xcd7f3d80 option=1
Oct 29 13:48:45 ramon kernel: atmiiopen: take 10.  readq=0xcdc60a80
writeq=0xcdc60b5c headrq=0xcdc60c80 headwq=0xcdc60d5c
Oct 29 13:48:45 ramon kernel: lis_set_q_sync q=0xcdc60d5c option=1
Oct 29 13:48:45 ramon kernel: lis_set_q_sync q=0xcdc60d5c
q->qsp=0xcd7f3c80 otherq=0xcdc60c80 otherq->qsp=0xcd7f3b80 option=1
====
But here it changes it back to option 1....
====
Oct 29 13:48:45 ramon kernel: lis_set_q_sync q=0xcdc60b5c option=0
Oct 29 13:48:45 ramon kernel: lis_set_q_sync q=0xcdc60b5c
q->qsp=0x00000000 otherq=0xcdc60a80 otherq->qsp=0x00000000 option=0

So is the stream head supposed to be option 0 or option 1?  For me it
works _much_ better as option 0 (which I forced by exporting
lis_set_q_sync() and just calling it explicitly on the stream head
queue from my wput routine().  Doing it from the open routine didn't
work, so it must get set back to option 1 after the open() routine
returns.).

This brings me to another question that I had about the queue locking
documentation in:

file://localhost/C:/tmp/LiS/LiS-2.18.0.B/htdocs/config.html#QueueLockingSpecification

It says:
===========
The utility of using locking style 0 has to do with achieving higher
throughput in multiple CPU environments. If all put and service
procedures can be entered simultaneously then the put procedure can
be calling putq() on one CPU while the service procedure is calling
getq() and processing messages on another, setting up a pipeline
execution effect. Experiments have shown significant throughput
improvements when this style of locking can be used.

In such a circumstance the driver service procedure needs to maintain
some kind of a "running" flag, manipulated under spin lock control,
to assure single threaded execution of the messages removed from the
queue. Such flag manipulation is good practice even though LiS will
not naturally tend to enter a driver's service procedure for the same
queue on multiple CPUs simultaneously.
===========

Is this true???? From looking at the code lis_putq and lis_getq are
both surrounded by LIS_QISRLOCK() macros, which are supposed to be
the irqsave() versions of the read/write lock.  Why do you need
another spin lock to protect the getq/putq()?  

Also how are you going to have the same service routine running on
two different CPUs at the same time?  That implies to me that it
would be possible for a routine to be on the run queue twice.  From
looking at lis_run_queues, that does not look possible to me, even on
SMP machines.

This brings up another point about the queue locking.  I think that
we need to improve the documentation for the queue locking.  It is
pretty misleading in that the names are similar to locking schemes on
other OSs (HPUX and Solaris) which also provide no locking, per queue
locking, queue pair locking or "other".  However, particularly in the
QPAIR case, the locking does not behave like the Solaris or HPUX
versions.  For example, if you do an ioctl from user space to a
driver which processes the ioctl in the wput routine through a module
which is set for QPAIR locking (the module just passes the ioctl
through to the driver via its wput routine and back to the stream
head via it's rput routine), you get a different call stack on Linux
than you do on either Solaris or HPUX.  To test this I put debugging
messages in my QPAIR module (sscop) at the start and end of wput and
rput and at the start and end of the driver (atmii) wput.

Solaris 8 Sparc 64 bit (QPAIR synchronization in SSCOP, no
synchronization in ATMII):
==============

Oct  4 11:05:18 gravitron a_sscop: [ID 963018 kern.notice] sscopwput:
ioctl doing putnext q=0x30000e58d20 pm=0x3000007f6c0
Oct  4 11:05:18 gravitron atmii: [ID 682503 kern.notice] atmiiioctl:
about to qreply q=0x30000ede3a8 pm=0x3000007f6c0
Oct  4 11:05:18 gravitron atmii: [ID 997160 kern.notice] atmiiioctl:
back from qreply q=0x30000ede3a8 pm=0x3000007f6c0
Oct  4 11:05:18 gravitron a_sscop: [ID 801744 kern.notice] sscopwput:
ioctl back from putnext returning.  q=0x30000e58d20 pm=0x3000007f6c0
Oct  4 11:05:18 gravitron a_sscop: [ID 609763 kern.notice] sscoprput:
ioctl doing putnext q=0x30000e58c40 pm=0x3000007f6c0
Oct  4 11:05:18 gravitron a_sscop: [ID 760784 kern.notice] sscoprput:
ioctl back from putnext returning.  q=0x30000e58c40 pm=0x3000007f6c0

The same code on Linux with LiS
==============

Oct  4 15:52:13 STATIC kernel: sscopwput: ioctl doing putnext
q=0xf77ac6c0 pm=0xf7970100
Oct  4 15:52:13 STATIC kernel: atmiiioctl: about to qreply
q=0xf77ac440 pm=0xf7970100
Oct  4 15:52:13 STATIC kernel: sscoprput: ioctl doing putnext
q=0xf77ac580 pm=0xf7970100
Oct  4 15:52:13 STATIC kernel: sscoprput: ioctl back from putnext
returning.  q=0xf77ac580 pm=0xf7970100
Oct  4 15:52:13 STATIC kernel: atmiiioctl: back from qreply
q=0xf77ac440 pm=0xf7970100
Oct  4 15:52:13 STATIC kernel: sscopwput: ioctl back from putnext
returning.  q=0xf77ac6c0 pm=0xf7970100

As you can see in the Solaris case, the sscopwput routine returns
from the putnext and finishes running before the sscoprput routine is
entered.  In LiS, the sscoprput routine runs to completion before
sscopwput returns from the putnext and finishes.

So I think that the thing that needs to be pointed out more
forcefully is that QPAIR only protects _two different threads_ from
entering the two queue's put and service routines at the same time. 
This is a key distinction between LiS QPAIR locking and Solaris.  In
fact a single thread _can_ enter the read and write put routines at
the same time if an external driver or module turns the message
around with qreply() so that the put routine is entered with the same
thread.  This is not true on Solaris or HPUX. 

This was the crux of the QPAIR problem that I had mentioned a month
or so ago...

This needs to be documented more explicitly so that it does not bite
other people.  It took me two weeks to figure this out from extensive
debugging and I can't believe that I will be the only person fooled
by this, particularly if the developer is comming from a Solaris or
HPUX background.

Ideally I'd like to see the QPAIR locking implemented the same way as
it is on Solaris.  It is much more useful that way.  However I know
that that is going to be a lot of work.  This is what I was talking
about way back when you were first talking about introducing this and
I mentioned all the work that went into getting the Solaris sync
queues right.  This is how solaris does it.  If it cannot enter the
module's perimeter, it puts the message on a sync queue and lets the
module finish so that it can reenter the perimeter.  LiS would have
to do something similar to defer the entrance of the put routine
after a qreply() until the other put routine could be exited.  Not
fun and I know that your appetite for big LiS projects is not very
strong.

I'm tempted to go for it, but I just don't have time, what with my
day job and all...

So anyways, thanks for your time and quick response to my last
questions.

dan

=====
Dan Gora
Software Engineer
Adax, Inc.

Tel: +55 12-3845-3572
email: [EMAIL PROTECTED]
_______________________________________________
Linux-streams mailing list
[EMAIL PROTECTED]
http://gsyc.escet.urjc.es/mailman/listinfo/linux-streams

[Linux-streams] Re: Minor bug in lis_process_rput() LiS 2.18 that probably doesn't affect anyone except me.

Reply via email to