Garrett D'Amore writes:
 > Roch - PAE wrote:
 > > Garrett D'Amore writes:
 > >  > Roch - PAE wrote:
 > >  > > jason jiang writes:
 > >  > >  > >From my experience, use the softintr to distribute the packets to
 > >  > >  > upper layer will get poor performance on latency and througput
 > >  > >  > performance than handle it in single interrupt thread. And you 
 > > want to
 > >  > >  > make sure do not handle too much packets in one interrupts. 
 > >  > >  >  
 > >  > >  >  
 > >  > >  > This message posted from opensolaris.org
 > >  > >  > _______________________________________________
 > >  > >  > networking-discuss mailing list
 > >  > >  > [email protected]
 > >  > >
 > >  > >
 > >  > > I see that both interrupt scheme suffer from the same
 > >  > > drawback of pinning whatever thread happens to be running on 
 > >  > > the interrupt/softintr cpu. The problem gets really annoying
 > >  > > when the incoming inter-packets time interval is smaller than the
 > >  > > handling time under the interrupt. Even if the code is set
 > >  > > to return after handling N packets, a new interrupt will be
 > >  > > _immediately_ signified and the pinning will keep on going.
 > >  > >   
 > >  > 
 > >  > That depends on the device and the driver.  Its fully possible to 
 > >  > acknowledge multiple interrupts.  Most hardware I've worked with, if 
 > >  > multiple packets arrive, and the interrupt on the device is not 
 > >  > acknowledged, then multiple interrupts are not received.  So if you 
 > >  > don't acknowledge the interrupt until you think you're done processing, 
 > >  > you probably won't take another interrupt when you exit.  (You do need 
 > >  > to check one last time after acknowledging the interrupt to prevent a 
 > >  > lost packet race though.)
 > >
 > >
 > > I agree about the interrupt being coalesced. The problem
 > > that needs to be dealt with is when there is an endless
 > > stream of inbound data with a inter-packet gap which is smaller
 > > than the handling time (the ill-defined entity).
 > >   
 > 
 > In this case, the driver/stack simply cannot keep up with the inbound 
 > packets.  This is the problem that solutions like 802.3x flow control 
 > and RED (random early drop) are supposed to address.  Otherwise you wind 
 > up just losing packets.
 > 

802.3x operates  at the HW level.  So it's not a solution to
my issue. We're concerned here with dealing with a stream of
packets that, when handled directly in the interrupt, causes
endless   pinning  of some  poor thread...    you suggest an
alternative below ...


 > Hopefully, you don't get so far behind that the system winds up 
 > processing packets which are later discarded as being "stale".  If that 
 > happens then you have a serious problem.  Modern CPUs with modern 
 > devices/drivers shouldn't have that problem.
 > 
 > >
 > >  > > Now, the per-packet handling time, is not a well defined
 > >  > > entity. The software stack can choose to do more (say push up
 > >  > > through TCP/IP) or less work (just queue and wake kernel
 > >  > > thread) on each packet. All this needs to be managed based
 > >  > > on the load and we're moving in that direction.
 > >  > >   
 > >  > 
 > >  > There are other changes in the process... when the stack can't keep up 
 > >  > with the inbound packets at _interrupt_ rate, the stack will have the 
 > >  > ability to turn off interrupts on the device (if it supports it), and 
 > >  > run the receive thread in "polling mode".  This means that you have no 
 > >  > interpacket context switches.  It will stay in this mode until the 
 > >  > poller empties the receive ring.
 > >  > 
 > >
 > > Perfect.
 > >
 > >  > > At the driver level, if you reach a point where you have a
 > >  > > large queue in the HW receive rings, that is a nice
 > >  > > indication that deferring the processing to a non-interrupt
 > >  > > kernel thread would be good. Under this condition the thread 
 > >  > > wakeup cost is amortized over the handling of many packets.
 > >  > >   
 > >  > 
 > >  > Hmm... but you still have the initial latency for the first packet in 
 > >  > the ring.  Its not fatal, but its not nice to add 10msec latency if you 
 > >  > don't have to, either.  
 > >
 > > Absolutely. The first packets are handled quickly as soon as 
 > > they arrive. The handling of the intial packets, is exactly
 > > what can cause a backlog to build up in the HW. When the HW
 > > has a backlog builtup, no sense to keep processing them
 > > under the interrupt, latency is not the critical metric at
 > > that point.
 > >   
 > 
 > Why not deal with them under the interrupt?  Assuming you can process 
 > them all in the same interrupt context (i.e. you do not want to have to 
 > service multiple interrupts, but once you're already in interrupt 
 > context, if the packets keep coming, there is no real reason not to deal 
 > with them there, as long as you can do so quickly, without tying up the 
 > CPU from processing other system critical tasks.  In multi-CPU systems, 
 > the idea of dedicating an processor to handing rx interrupts from a 
 > high-traffic NIC is actually very reasonable.)
 > 
 > To a certain extent, the question of the "context" that you are handling 
 > the traffic in is one of resource allocation... interrupt context is a 
 > bad place for a shared CPU to be, at least for very long.   But if you 
 > have the CPU to dedicate to the job, and the traffic to justify it, 
 > leaving the CPU running in interrupt context works pretty well.
 > 

For benchmarking I agree. For certain system possibly. But
for many servers, the incoming flux might not be sustained
24x7 and there are at times multiple interrupt sources that
are all active at once, sometimes not. Limiting
the interrupts load to segregated set of CPU means the set
can be idle when there is high non-network related demand
for CPU. Seems a pain to manage.

My prefered solution to this complex issue, is more along
the way to running or not code under the interrupt based on
the load dynamics without segreggating CPUs for interrupt
handling. If a certain interrupt routine runs for long time, 
or if the backlog on a HW-ring  indicates that it will, then I
want to defer the work to kernel context.

 > In fact, lately I've been doing a lot of performance testing with IP 
 > forwarding of very small (64-byte) packets.  I've found that the best 
 > way to get good performance on systems with multiple CPUs is to allocate 
 > a CPU to the task of interrupt handling for each high-traffic NIC.  
 > Right now this requires some finagling with psrset and psradm -i, and 
 > looking at bindings with mdb "::interrupts", but it is really 
 > worthwhile.  The performance boost you get by doing this kind of 
 > tweaking can be nearly 100%.

That makes sense and I agree this should be made easier to
acheive. I'm also arguing that for the general purpose market
we need a more generic answer that does not involve workload 
specific tuning.

 > 
 > (E.g. I've been able to process ~500,000 inbound packets per second on a 
 > single 2.4GHz core using this technique.  I'm looking at ways to 
 > increase this number even further.)
 > >
 > >  > The details of this decision are moving up-stack 
 > >  > though, in the form of squeues and polling with crossbow.
 > >
 > > Right, I just suggest that the HW backlog might be one of the variable
 > > involved in the decision.
 > >   
 > 
 > If you have a HW backlog, you probably also have an upstream backlog.  
 > But consideration of this should be taken into account.
 > 

I  don't agree  here.  The HW  backlog forms   because of an
imbalance between incoming  inter-packet interval and packet
on-interrupt CPU  handling. If packet handling means pushing
data all the  way  through, there  might not be  an upstream
backlog.   Sometimes there  will  backlogs   in both places,
sometimes either one. But the 2 are fairly independant IMO.

 > Again, note that a lot of NICs on the market these days have support for 
 > features like 802.3x, which is intended to help manage the backlog at 
 > the link layer (in this case by providing flow control information to 
 > the peer systems in the network.)
 > 

Yep. But again this deals with another unreleated imbalance.

-r

_______________________________________________
networking-discuss mailing list
[email protected]

Reply via email to