On Tue, Aug 30, 2016 at 08:31 +0200, Mark Kettenis wrote:
> > Date: Tue, 30 Aug 2016 07:48:09 +0200
> > From: Mike Belopuhov <m...@belopuhov.com>
> > 
> > On Tue, Aug 30, 2016 at 09:58 +1000, David Gwynne wrote:
> > > On Mon, Aug 29, 2016 at 08:30:37PM +0200, Alexander Bluhm wrote:
> > > > On Mon, Aug 29, 2016 at 07:10:48PM +0200, Mike Belopuhov wrote:
> > > > > Due to a recent change in -current the socket sending routine
> > > > > has started producing small data packets crossing memory page
> > > > > boundary.  This is not supported by Xen and kernels with this
> > > > > change will experience broken bulk TCP transmit behaviour.
> > > > > We're working on fixing it.
> > > > 
> > > > For the same reason some old i386 machines from 2006 and 2005 have
> > > > performance problems when sending data with tcpbench.
> > > > 
> > > > em 82573E drops to 200 MBit/sec output, 82546GB and 82540EM do only
> > > > 10 MBit anymore.
> > > > 
> > > > With the patch below I get 946, 642, 422 MBit/sec output performance
> > > > over these chips respectively.
> > > > 
> > > > Don't know wether PAGE_SIZE is the correct fix as I think the problem
> > > > is more related to the network chip than to the processor's page
> > > > size.
> > > 
> > > does this diff help those chips?
> > >
> > 
> > This diff defeats the purpose of the sosend change by punishing
> > every other chip not suffering from the aforementioned problem.
> > Lots of packets from the bulk TCP transfer will have to be
> > defragmented for no good reason.
> 
> No, this em diff will still do proper scatter/gather.  It might
> consume more descriptors as it will use two descriptors for packets
> crossing a page boundary.  But the fact that we collect more data into
> an mbuf will actually reduce the number of descriptors in other cases.
>

Right, my bad.  I didn't think this through.

> Regarding the xnf(4) issue; I think any driver that can't properly
> deal with an mbuf crossing a page boundary is broken.  I can't think
> of any modern dma engine that can't handle that properly, or doesn't
> at least support scatter/gather of some sort.

To set things straight: xnf does support and utilize fragmented packets.

This functionality is limited in several ways, however.  First of all
it may not be supported: some (old?) NetBSD based setups don't support
scatter-gather at all and require that the packet must fit one 4k buffer.
This now requires a bcopy into a temporary buffer, while previously it
didn't.

The second limitation is when scatter-gather i/o is supported Netfront
provides us with 256 general purpose ring descriptors that describe
either a complete packet or one of up to 18 chunks of the said packet.
Therefore there's no traditional fragment SGL attached to a descriptor,
but the whole 256 entry ring is one big fragment SGL itself.

Furthermore each one of 256 entries has a reference to a single 4k
buffer.  This reference is a limited resource itself as it's an entry
in an IOMMU-like structure called Grant Tables.  Typically there are
only 32 grant table pages and each page holds 512 entries (IIRC).  One
xnf device uses 256 entries for Rx ring, several entries for the ring
header and 256 * NFRAG entries for the Tx ring.  Right now this NFRAG
is 1.  Bumping it to 2 is probably not a problem.  However if we want
(and we do) to support jumbo frames (9000 no less) we'd have to bump
it up to 4 entries to fit one jumbo frame which eats up two whole
grant table pages (1024 entries).  That's roughly 3 pages per xnf
in a *typical* setup.  Since it's a shared resource for all Xen PV
drivers, this limits the number of xnf interfaces to about 9.  If the
disk driver appears, we might be limited to a lot less number of
supported interfaces.  But at the moment it's a speculation at best.

Now that limitations of the interface are specified, we can see that
bus_dmamap_load_mbuf would be a tremendously wasteful interface:
256 descriptors * 18 fragments = 4608 grant table entries
4608 / 512 entries per grant table page = 9 pages per xnf out of 32
in total per system.  This is the reason for the manual mbuf chain
traversal code that does a bus_dmamap_load into a single buffer.

At the same time this is how fragments are used right now: every
m->m_data within a chain is its own fragment.  The sosend change
requires an additional change to support multiple segments for
each m->m_data and use of additional descriptors to cover for that.
While it's possible to do, this is a requirement that was pushed
on me w/o any notifications.

Hope it clears the situation up.

> There may be old crufty
> stuff though that can't deal with it, but those probably already have
> "bcopy" drivers.  Now there may be drivers that don't enforce the
> boundary properly.  Those will mysteriously stop working.  Will we be
> able to fix all of those before 6.1 gets released?
> 

Since it depends on users providing test coverage, I wouldn't bet on it.

Reply via email to