Hi Greg:

I think we now both understand the root cause and have a solution for the
intermittent timeout problem in our Digi USB serial driver.  I'm pretty
sure that this problem affects the other usb-serial drivers under 2.3.X
and 2.4.X kernels too.  We're just merging in some other changes and
will send you a patch against 2.4.0-test1 shortly.  Meanwhile, here's
our assessment of the problem:

As I mentioned in my earlier note, we found that large file transfers
would sometimes hang with timeout errors on some machines.  It turned
out that in these circumstances write_chan() (in n_tty.c) was asleep
waiting for our wakeup call.  Even though we call wake_up_interruptible()
in digi_write_bulk_callback(), there is a race condition that could cause
the wakeup to fail:  if our wake_up_interruptible() call occurs between
the time that our driver write routine finishes and when write_chan() sets
current->state to TASK_INTERRUPTIBLE, the effect of our wakeup setting
the state to TASK_RUNNING will be lost and write_chan's subsequent call
to schedule() will never return (unless it catches a signal).

This race condition occurs because write_bulk_callback() (and thus the
wakeup) are called asynchonously from an interrupt, rather than from the
scheduler.  We can avoid the race by calling the wakeup from the scheduler
queue and that's our fix:  Now, at the end of bulk_write_callback()
we queue up a wakeup call on the scheduler task queue.  We still also
invoke the wakeup directly since that squeezes a bit more performance
out of the driver, and any lost race conditions will get cleaned up at
the next scheduler run.

I noticed from my test logs that I never seemed to get the timeout errors
on 2.2.14 kernels, so I compared the 2.2.14 and 2.4.0-test1 versions of
write_chan() in n_tty.c.  It turns out that the 2.2.14 version just sets
current->state to TASK_INTERRUPTIBLE at the top of each loop rather
than trying to closely bracket the driver.write() call with code setting
it first to TASK_RUNNING, then to TASK_INTERRUPTIBLE, so there's no
race condition in those kernels.  I can't immediately see any need
for that new code in n_tty.c anyway, so I tried a test with the
vanilla 2.4.0-test1 sources where, in n_tty.c, I commented out those two
lines (lines 1145 and 1147).  With those two changes our old driver
(as included in the 2.4.0-test1 distribution) works fine.

So, that would be a quick way for you to see if this is the same problem
you've been seeing with Visor -- try commenting out those two lines
in write_chan() and try your test again.  If it is, then you may want
to consider adding a scheduler-time wakeup to your write_bulk_callback
routine too.   (Then again, unless someone can explain why it's useful
to have those lines in n_tty.c, maybe they should be removed.)



...
> > Maybe this is the same problem you're seeing? 
> 
> This sounds like the problem that I have been seeing (and lots of other
> people) with the Visor on OHCI seeming to just go to sleep in the middle
> of a large transfer. When I slow the OHCI driver down by turning on all
> of it's debugging, the problem goes away, which makes sense given what
> you just described. 
> 
> Thanks!
...

Yup.  That sounds likely.  Here's hoping that this quick fix will fix
those problems for you...

     --Peter
--
Peter E. Berger                               Tel:  (218) 848-2885
Brimson Laboratories                          Fax:  (218) 848-2433
1549 Hiironen Rd.                             Email: [EMAIL PROTECTED]
Brimson, MN 55602

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to