Re: Heads up to Xen users following -current

2016-09-12 Thread Mike Belopuhov
On 29 August 2016 at 19:10, Mike Belopuhov  wrote:
> Due to a recent change in -current the socket sending routine
> has started producing small data packets crossing memory page
> boundary.  This is not supported by Xen and kernels with this
> change will experience broken bulk TCP transmit behaviour.
> We're working on fixing it.
>
> Cheers,
> Mike

Update: both TCP stacks and xnf(4) were fixed.



Re: Heads up to Xen users following -current

2016-08-30 Thread Mike Belopuhov
On Aug 30, 2016 10:56 AM, "Claudio Jeker"  wrote:
>
> On Tue, Aug 30, 2016 at 10:48:53AM +0200, Mike Belopuhov wrote:
> > On Tue, Aug 30, 2016 at 08:31 +0200, Mark Kettenis wrote:
> > > > Date: Tue, 30 Aug 2016 07:48:09 +0200
> > > > From: Mike Belopuhov 
> > > >
> > > > On Tue, Aug 30, 2016 at 09:58 +1000, David Gwynne wrote:
> > > > > On Mon, Aug 29, 2016 at 08:30:37PM +0200, Alexander Bluhm wrote:
> > > > > > On Mon, Aug 29, 2016 at 07:10:48PM +0200, Mike Belopuhov wrote:
> > > > > > > Due to a recent change in -current the socket sending routine
> > > > > > > has started producing small data packets crossing memory page
> > > > > > > boundary.  This is not supported by Xen and kernels with this
> > > > > > > change will experience broken bulk TCP transmit behaviour.
> > > > > > > We're working on fixing it.
> > > > > >
> > > > > > For the same reason some old i386 machines from 2006 and 2005
have
> > > > > > performance problems when sending data with tcpbench.
> > > > > >
> > > > > > em 82573E drops to 200 MBit/sec output, 82546GB and 82540EM do
only
> > > > > > 10 MBit anymore.
> > > > > >
> > > > > > With the patch below I get 946, 642, 422 MBit/sec output
performance
> > > > > > over these chips respectively.
> > > > > >
> > > > > > Don't know wether PAGE_SIZE is the correct fix as I think the
problem
> > > > > > is more related to the network chip than to the processor's page
> > > > > > size.
> > > > >
> > > > > does this diff help those chips?
> > > > >
> > > >
> > > > This diff defeats the purpose of the sosend change by punishing
> > > > every other chip not suffering from the aforementioned problem.
> > > > Lots of packets from the bulk TCP transfer will have to be
> > > > defragmented for no good reason.
> > >
> > > No, this em diff will still do proper scatter/gather.  It might
> > > consume more descriptors as it will use two descriptors for packets
> > > crossing a page boundary.  But the fact that we collect more data into
> > > an mbuf will actually reduce the number of descriptors in other cases.
> > >
> >
> > Right, my bad.  I didn't think this through.
> >
> > > Regarding the xnf(4) issue; I think any driver that can't properly
> > > deal with an mbuf crossing a page boundary is broken.  I can't think
> > > of any modern dma engine that can't handle that properly, or doesn't
> > > at least support scatter/gather of some sort.
> >
> > To set things straight: xnf does support and utilize fragmented packets.
> >
> > This functionality is limited in several ways, however.  First of all
> > it may not be supported: some (old?) NetBSD based setups don't support
> > scatter-gather at all and require that the packet must fit one 4k
buffer.
> > This now requires a bcopy into a temporary buffer, while previously it
> > didn't.
>
> I think here not much changes. You get more then one segemnt you lose.
> Bad HW bad performance...
>

Yes, but the code won't write itself.

> > The second limitation is when scatter-gather i/o is supported Netfront
> > provides us with 256 general purpose ring descriptors that describe
> > either a complete packet or one of up to 18 chunks of the said packet.
> > Therefore there's no traditional fragment SGL attached to a descriptor,
> > but the whole 256 entry ring is one big fragment SGL itself.
> >
> > Furthermore each one of 256 entries has a reference to a single 4k
> > buffer.  This reference is a limited resource itself as it's an entry
> > in an IOMMU-like structure called Grant Tables.  Typically there are
> > only 32 grant table pages and each page holds 512 entries (IIRC).  One
> > xnf device uses 256 entries for Rx ring, several entries for the ring
> > header and 256 * NFRAG entries for the Tx ring.  Right now this NFRAG
> > is 1.  Bumping it to 2 is probably not a problem.  However if we want
> > (and we do) to support jumbo frames (9000 no less) we'd have to bump
> > it up to 4 entries to fit one jumbo frame which eats up two whole
> > grant table pages (1024 entries).  That's roughly 3 pages per xnf
> > in a *typical* setup.  Since it's a shared resource for all Xen PV
> > drivers, this limits the number of xnf interfaces to about 9.  If the
> > disk driver appears, we might be limited to a lot less number of
> > supported interfaces.  But at the moment it's a speculation at best.
> >
> > Now that limitations of the interface are specified, we can see that
> > bus_dmamap_load_mbuf would be a tremendously wasteful interface:
> > 256 descriptors * 18 fragments = 4608 grant table entries
> > 4608 / 512 entries per grant table page = 9 pages per xnf out of 32
> > in total per system.  This is the reason for the manual mbuf chain
> > traversal code that does a bus_dmamap_load into a single buffer.
>
> This is about how almost all HW rings work.

Not really. Normally you've got a tx descriptor handling a single packet.
This is not the case here.

> The driver creates the DMA map with nsegments = 18, maxsegsz = 

Re: Heads up to Xen users following -current

2016-08-30 Thread Alexander Bluhm
On Tue, Aug 30, 2016 at 09:58:59AM +1000, David Gwynne wrote:
> On Mon, Aug 29, 2016 at 08:30:37PM +0200, Alexander Bluhm wrote:
> > em 82573E drops to 200 MBit/sec output, 82546GB and 82540EM do only
> > 10 MBit anymore.
> 
> does this diff help those chips?

No, it does not change anything.

> 
> Index: if_em.c
> ===
> RCS file: /cvs/src/sys/dev/pci/if_em.c,v
> retrieving revision 1.331
> diff -u -p -r1.331 if_em.c
> --- if_em.c   13 Apr 2016 10:34:32 -  1.331
> +++ if_em.c   29 Aug 2016 23:52:07 -
> @@ -2134,7 +2134,7 @@ em_setup_transmit_structures(struct em_s
>   pkt = >sc_tx_pkts_ring[i];
>   error = bus_dmamap_create(sc->sc_dmat, MAX_JUMBO_FRAME_SIZE,
>   EM_MAX_SCATTER / (sc->pcix_82544 ? 2 : 1),
> - MAX_JUMBO_FRAME_SIZE, 0, BUS_DMA_NOWAIT, >pkt_map);
> + MAX_JUMBO_FRAME_SIZE, 4096, BUS_DMA_NOWAIT, >pkt_map);
>   if (error != 0) {
>   printf("%s: Unable to create TX DMA map\n",
>   DEVNAME(sc));



Re: Heads up to Xen users following -current

2016-08-30 Thread Claudio Jeker
On Tue, Aug 30, 2016 at 10:48:53AM +0200, Mike Belopuhov wrote:
> On Tue, Aug 30, 2016 at 08:31 +0200, Mark Kettenis wrote:
> > > Date: Tue, 30 Aug 2016 07:48:09 +0200
> > > From: Mike Belopuhov 
> > > 
> > > On Tue, Aug 30, 2016 at 09:58 +1000, David Gwynne wrote:
> > > > On Mon, Aug 29, 2016 at 08:30:37PM +0200, Alexander Bluhm wrote:
> > > > > On Mon, Aug 29, 2016 at 07:10:48PM +0200, Mike Belopuhov wrote:
> > > > > > Due to a recent change in -current the socket sending routine
> > > > > > has started producing small data packets crossing memory page
> > > > > > boundary.  This is not supported by Xen and kernels with this
> > > > > > change will experience broken bulk TCP transmit behaviour.
> > > > > > We're working on fixing it.
> > > > > 
> > > > > For the same reason some old i386 machines from 2006 and 2005 have
> > > > > performance problems when sending data with tcpbench.
> > > > > 
> > > > > em 82573E drops to 200 MBit/sec output, 82546GB and 82540EM do only
> > > > > 10 MBit anymore.
> > > > > 
> > > > > With the patch below I get 946, 642, 422 MBit/sec output performance
> > > > > over these chips respectively.
> > > > > 
> > > > > Don't know wether PAGE_SIZE is the correct fix as I think the problem
> > > > > is more related to the network chip than to the processor's page
> > > > > size.
> > > > 
> > > > does this diff help those chips?
> > > >
> > > 
> > > This diff defeats the purpose of the sosend change by punishing
> > > every other chip not suffering from the aforementioned problem.
> > > Lots of packets from the bulk TCP transfer will have to be
> > > defragmented for no good reason.
> > 
> > No, this em diff will still do proper scatter/gather.  It might
> > consume more descriptors as it will use two descriptors for packets
> > crossing a page boundary.  But the fact that we collect more data into
> > an mbuf will actually reduce the number of descriptors in other cases.
> >
> 
> Right, my bad.  I didn't think this through.
> 
> > Regarding the xnf(4) issue; I think any driver that can't properly
> > deal with an mbuf crossing a page boundary is broken.  I can't think
> > of any modern dma engine that can't handle that properly, or doesn't
> > at least support scatter/gather of some sort.
> 
> To set things straight: xnf does support and utilize fragmented packets.
> 
> This functionality is limited in several ways, however.  First of all
> it may not be supported: some (old?) NetBSD based setups don't support
> scatter-gather at all and require that the packet must fit one 4k buffer.
> This now requires a bcopy into a temporary buffer, while previously it
> didn't.

I think here not much changes. You get more then one segemnt you lose.
Bad HW bad performance...
 
> The second limitation is when scatter-gather i/o is supported Netfront
> provides us with 256 general purpose ring descriptors that describe
> either a complete packet or one of up to 18 chunks of the said packet.
> Therefore there's no traditional fragment SGL attached to a descriptor,
> but the whole 256 entry ring is one big fragment SGL itself.
> 
> Furthermore each one of 256 entries has a reference to a single 4k
> buffer.  This reference is a limited resource itself as it's an entry
> in an IOMMU-like structure called Grant Tables.  Typically there are
> only 32 grant table pages and each page holds 512 entries (IIRC).  One
> xnf device uses 256 entries for Rx ring, several entries for the ring
> header and 256 * NFRAG entries for the Tx ring.  Right now this NFRAG
> is 1.  Bumping it to 2 is probably not a problem.  However if we want
> (and we do) to support jumbo frames (9000 no less) we'd have to bump
> it up to 4 entries to fit one jumbo frame which eats up two whole
> grant table pages (1024 entries).  That's roughly 3 pages per xnf
> in a *typical* setup.  Since it's a shared resource for all Xen PV
> drivers, this limits the number of xnf interfaces to about 9.  If the
> disk driver appears, we might be limited to a lot less number of
> supported interfaces.  But at the moment it's a speculation at best.
> 
> Now that limitations of the interface are specified, we can see that
> bus_dmamap_load_mbuf would be a tremendously wasteful interface:
> 256 descriptors * 18 fragments = 4608 grant table entries
> 4608 / 512 entries per grant table page = 9 pages per xnf out of 32
> in total per system.  This is the reason for the manual mbuf chain
> traversal code that does a bus_dmamap_load into a single buffer.

This is about how almost all HW rings work.
The driver creates the DMA map with nsegments = 18, maxsegsz = PAGE_SIZE and
boundary = PAGE_SIZE. This will waste a few resources in the bus_dmamap_t
but that's it. bus_dmamap_load_mbuf is used to load the mbuf chain into
the dma map and then the driver loops over the dma map dm_segs and fills
the 256 ring. You can check that you have at least 18 free entries in the
SGL before doing the work and if bus_dmamap_load_mbuf fails because the

Re: Heads up to Xen users following -current

2016-08-30 Thread Mike Belopuhov
On Tue, Aug 30, 2016 at 08:31 +0200, Mark Kettenis wrote:
> > Date: Tue, 30 Aug 2016 07:48:09 +0200
> > From: Mike Belopuhov 
> > 
> > On Tue, Aug 30, 2016 at 09:58 +1000, David Gwynne wrote:
> > > On Mon, Aug 29, 2016 at 08:30:37PM +0200, Alexander Bluhm wrote:
> > > > On Mon, Aug 29, 2016 at 07:10:48PM +0200, Mike Belopuhov wrote:
> > > > > Due to a recent change in -current the socket sending routine
> > > > > has started producing small data packets crossing memory page
> > > > > boundary.  This is not supported by Xen and kernels with this
> > > > > change will experience broken bulk TCP transmit behaviour.
> > > > > We're working on fixing it.
> > > > 
> > > > For the same reason some old i386 machines from 2006 and 2005 have
> > > > performance problems when sending data with tcpbench.
> > > > 
> > > > em 82573E drops to 200 MBit/sec output, 82546GB and 82540EM do only
> > > > 10 MBit anymore.
> > > > 
> > > > With the patch below I get 946, 642, 422 MBit/sec output performance
> > > > over these chips respectively.
> > > > 
> > > > Don't know wether PAGE_SIZE is the correct fix as I think the problem
> > > > is more related to the network chip than to the processor's page
> > > > size.
> > > 
> > > does this diff help those chips?
> > >
> > 
> > This diff defeats the purpose of the sosend change by punishing
> > every other chip not suffering from the aforementioned problem.
> > Lots of packets from the bulk TCP transfer will have to be
> > defragmented for no good reason.
> 
> No, this em diff will still do proper scatter/gather.  It might
> consume more descriptors as it will use two descriptors for packets
> crossing a page boundary.  But the fact that we collect more data into
> an mbuf will actually reduce the number of descriptors in other cases.
>

Right, my bad.  I didn't think this through.

> Regarding the xnf(4) issue; I think any driver that can't properly
> deal with an mbuf crossing a page boundary is broken.  I can't think
> of any modern dma engine that can't handle that properly, or doesn't
> at least support scatter/gather of some sort.

To set things straight: xnf does support and utilize fragmented packets.

This functionality is limited in several ways, however.  First of all
it may not be supported: some (old?) NetBSD based setups don't support
scatter-gather at all and require that the packet must fit one 4k buffer.
This now requires a bcopy into a temporary buffer, while previously it
didn't.

The second limitation is when scatter-gather i/o is supported Netfront
provides us with 256 general purpose ring descriptors that describe
either a complete packet or one of up to 18 chunks of the said packet.
Therefore there's no traditional fragment SGL attached to a descriptor,
but the whole 256 entry ring is one big fragment SGL itself.

Furthermore each one of 256 entries has a reference to a single 4k
buffer.  This reference is a limited resource itself as it's an entry
in an IOMMU-like structure called Grant Tables.  Typically there are
only 32 grant table pages and each page holds 512 entries (IIRC).  One
xnf device uses 256 entries for Rx ring, several entries for the ring
header and 256 * NFRAG entries for the Tx ring.  Right now this NFRAG
is 1.  Bumping it to 2 is probably not a problem.  However if we want
(and we do) to support jumbo frames (9000 no less) we'd have to bump
it up to 4 entries to fit one jumbo frame which eats up two whole
grant table pages (1024 entries).  That's roughly 3 pages per xnf
in a *typical* setup.  Since it's a shared resource for all Xen PV
drivers, this limits the number of xnf interfaces to about 9.  If the
disk driver appears, we might be limited to a lot less number of
supported interfaces.  But at the moment it's a speculation at best.

Now that limitations of the interface are specified, we can see that
bus_dmamap_load_mbuf would be a tremendously wasteful interface:
256 descriptors * 18 fragments = 4608 grant table entries
4608 / 512 entries per grant table page = 9 pages per xnf out of 32
in total per system.  This is the reason for the manual mbuf chain
traversal code that does a bus_dmamap_load into a single buffer.

At the same time this is how fragments are used right now: every
m->m_data within a chain is its own fragment.  The sosend change
requires an additional change to support multiple segments for
each m->m_data and use of additional descriptors to cover for that.
While it's possible to do, this is a requirement that was pushed
on me w/o any notifications.

Hope it clears the situation up.

> There may be old crufty
> stuff though that can't deal with it, but those probably already have
> "bcopy" drivers.  Now there may be drivers that don't enforce the
> boundary properly.  Those will mysteriously stop working.  Will we be
> able to fix all of those before 6.1 gets released?
> 

Since it depends on users providing test coverage, I wouldn't bet on it.



Re: Heads up to Xen users following -current

2016-08-30 Thread David Gwynne

> On 30 Aug 2016, at 16:31, Mark Kettenis  wrote:
> 
>> Date: Tue, 30 Aug 2016 07:48:09 +0200
>> From: Mike Belopuhov 
>> 
>> On Tue, Aug 30, 2016 at 09:58 +1000, David Gwynne wrote:
>>> On Mon, Aug 29, 2016 at 08:30:37PM +0200, Alexander Bluhm wrote:
 On Mon, Aug 29, 2016 at 07:10:48PM +0200, Mike Belopuhov wrote:
> Due to a recent change in -current the socket sending routine
> has started producing small data packets crossing memory page
> boundary.  This is not supported by Xen and kernels with this
> change will experience broken bulk TCP transmit behaviour.
> We're working on fixing it.
 
 For the same reason some old i386 machines from 2006 and 2005 have
 performance problems when sending data with tcpbench.
 
 em 82573E drops to 200 MBit/sec output, 82546GB and 82540EM do only
 10 MBit anymore.
 
 With the patch below I get 946, 642, 422 MBit/sec output performance
 over these chips respectively.
 
 Don't know wether PAGE_SIZE is the correct fix as I think the problem
 is more related to the network chip than to the processor's page
 size.
>>> 
>>> does this diff help those chips?
>>> 
>> 
>> This diff defeats the purpose of the sosend change by punishing
>> every other chip not suffering from the aforementioned problem.
>> Lots of packets from the bulk TCP transfer will have to be
>> defragmented for no good reason.

the sosend change that demonstrated the performance difference was going to 
punish all chips instead of just all em(4) chips.

the em diff is quick and simple so we can see if the driver could be fixed 
without having to revert sosend. if it does work on bluhms test systems, i was 
going to make the change only apply to the specific chips in question.

> 
> No, this em diff will still do proper scatter/gather.  It might
> consume more descriptors as it will use two descriptors for packets
> crossing a page boundary.  But the fact that we collect more data into
> an mbuf will actually reduce the number of descriptors in other cases.
> 
> Regarding the xnf(4) issue; I think any driver that can't properly
> deal with an mbuf crossing a page boundary is broken.  I can't think
> of any modern dma engine that can't handle that properly, or doesn't
> at least support scatter/gather of some sort.  There may be old crufty
> stuff though that can't deal with it, but those probably already have
> "bcopy" drivers.  Now there may be drivers that don't enforce the
> boundary properly.  Those will mysteriously stop working.  Will we be
> able to fix all of those before 6.1 gets released?

if a single packet can use multiple descriptors but each descriptor cannot 
cross a page boundary, then bus_dma is able to represent that just fine. if xnf 
can only use a single descriptor per packet, then it deserves bcopy.

dlg

> 
>>> Index: if_em.c
>>> ===
>>> RCS file: /cvs/src/sys/dev/pci/if_em.c,v
>>> retrieving revision 1.331
>>> diff -u -p -r1.331 if_em.c
>>> --- if_em.c 13 Apr 2016 10:34:32 -  1.331
>>> +++ if_em.c 29 Aug 2016 23:52:07 -
>>> @@ -2134,7 +2134,7 @@ em_setup_transmit_structures(struct em_s
>>> pkt = >sc_tx_pkts_ring[i];
>>> error = bus_dmamap_create(sc->sc_dmat, MAX_JUMBO_FRAME_SIZE,
>>> EM_MAX_SCATTER / (sc->pcix_82544 ? 2 : 1),
>>> -   MAX_JUMBO_FRAME_SIZE, 0, BUS_DMA_NOWAIT, >pkt_map);
>>> +   MAX_JUMBO_FRAME_SIZE, 4096, BUS_DMA_NOWAIT, >pkt_map);
>>> if (error != 0) {
>>> printf("%s: Unable to create TX DMA map\n",
>>> DEVNAME(sc));



Re: Heads up to Xen users following -current

2016-08-30 Thread Mark Kettenis
> Date: Tue, 30 Aug 2016 07:48:09 +0200
> From: Mike Belopuhov 
> 
> On Tue, Aug 30, 2016 at 09:58 +1000, David Gwynne wrote:
> > On Mon, Aug 29, 2016 at 08:30:37PM +0200, Alexander Bluhm wrote:
> > > On Mon, Aug 29, 2016 at 07:10:48PM +0200, Mike Belopuhov wrote:
> > > > Due to a recent change in -current the socket sending routine
> > > > has started producing small data packets crossing memory page
> > > > boundary.  This is not supported by Xen and kernels with this
> > > > change will experience broken bulk TCP transmit behaviour.
> > > > We're working on fixing it.
> > > 
> > > For the same reason some old i386 machines from 2006 and 2005 have
> > > performance problems when sending data with tcpbench.
> > > 
> > > em 82573E drops to 200 MBit/sec output, 82546GB and 82540EM do only
> > > 10 MBit anymore.
> > > 
> > > With the patch below I get 946, 642, 422 MBit/sec output performance
> > > over these chips respectively.
> > > 
> > > Don't know wether PAGE_SIZE is the correct fix as I think the problem
> > > is more related to the network chip than to the processor's page
> > > size.
> > 
> > does this diff help those chips?
> >
> 
> This diff defeats the purpose of the sosend change by punishing
> every other chip not suffering from the aforementioned problem.
> Lots of packets from the bulk TCP transfer will have to be
> defragmented for no good reason.

No, this em diff will still do proper scatter/gather.  It might
consume more descriptors as it will use two descriptors for packets
crossing a page boundary.  But the fact that we collect more data into
an mbuf will actually reduce the number of descriptors in other cases.

Regarding the xnf(4) issue; I think any driver that can't properly
deal with an mbuf crossing a page boundary is broken.  I can't think
of any modern dma engine that can't handle that properly, or doesn't
at least support scatter/gather of some sort.  There may be old crufty
stuff though that can't deal with it, but those probably already have
"bcopy" drivers.  Now there may be drivers that don't enforce the
boundary properly.  Those will mysteriously stop working.  Will we be
able to fix all of those before 6.1 gets released?

> > Index: if_em.c
> > ===
> > RCS file: /cvs/src/sys/dev/pci/if_em.c,v
> > retrieving revision 1.331
> > diff -u -p -r1.331 if_em.c
> > --- if_em.c 13 Apr 2016 10:34:32 -  1.331
> > +++ if_em.c 29 Aug 2016 23:52:07 -
> > @@ -2134,7 +2134,7 @@ em_setup_transmit_structures(struct em_s
> > pkt = >sc_tx_pkts_ring[i];
> > error = bus_dmamap_create(sc->sc_dmat, MAX_JUMBO_FRAME_SIZE,
> > EM_MAX_SCATTER / (sc->pcix_82544 ? 2 : 1),
> > -   MAX_JUMBO_FRAME_SIZE, 0, BUS_DMA_NOWAIT, >pkt_map);
> > +   MAX_JUMBO_FRAME_SIZE, 4096, BUS_DMA_NOWAIT, >pkt_map);
> > if (error != 0) {
> > printf("%s: Unable to create TX DMA map\n",
> > DEVNAME(sc));
> 
> 



Re: Heads up to Xen users following -current

2016-08-29 Thread David Gwynne
On Mon, Aug 29, 2016 at 08:30:37PM +0200, Alexander Bluhm wrote:
> On Mon, Aug 29, 2016 at 07:10:48PM +0200, Mike Belopuhov wrote:
> > Due to a recent change in -current the socket sending routine
> > has started producing small data packets crossing memory page
> > boundary.  This is not supported by Xen and kernels with this
> > change will experience broken bulk TCP transmit behaviour.
> > We're working on fixing it.
> 
> For the same reason some old i386 machines from 2006 and 2005 have
> performance problems when sending data with tcpbench.
> 
> em 82573E drops to 200 MBit/sec output, 82546GB and 82540EM do only
> 10 MBit anymore.
> 
> With the patch below I get 946, 642, 422 MBit/sec output performance
> over these chips respectively.
> 
> Don't know wether PAGE_SIZE is the correct fix as I think the problem
> is more related to the network chip than to the processor's page
> size.

does this diff help those chips?

Index: if_em.c
===
RCS file: /cvs/src/sys/dev/pci/if_em.c,v
retrieving revision 1.331
diff -u -p -r1.331 if_em.c
--- if_em.c 13 Apr 2016 10:34:32 -  1.331
+++ if_em.c 29 Aug 2016 23:52:07 -
@@ -2134,7 +2134,7 @@ em_setup_transmit_structures(struct em_s
pkt = >sc_tx_pkts_ring[i];
error = bus_dmamap_create(sc->sc_dmat, MAX_JUMBO_FRAME_SIZE,
EM_MAX_SCATTER / (sc->pcix_82544 ? 2 : 1),
-   MAX_JUMBO_FRAME_SIZE, 0, BUS_DMA_NOWAIT, >pkt_map);
+   MAX_JUMBO_FRAME_SIZE, 4096, BUS_DMA_NOWAIT, >pkt_map);
if (error != 0) {
printf("%s: Unable to create TX DMA map\n",
DEVNAME(sc));



Re: Heads up to Xen users following -current

2016-08-29 Thread Alexander Bluhm
On Mon, Aug 29, 2016 at 07:10:48PM +0200, Mike Belopuhov wrote:
> Due to a recent change in -current the socket sending routine
> has started producing small data packets crossing memory page
> boundary.  This is not supported by Xen and kernels with this
> change will experience broken bulk TCP transmit behaviour.
> We're working on fixing it.

For the same reason some old i386 machines from 2006 and 2005 have
performance problems when sending data with tcpbench.

em 82573E drops to 200 MBit/sec output, 82546GB and 82540EM do only
10 MBit anymore.

With the patch below I get 946, 642, 422 MBit/sec output performance
over these chips respectively.

Don't know wether PAGE_SIZE is the correct fix as I think the problem
is more related to the network chip than to the processor's page
size.

bluhm

Index: kern/uipc_socket.c
===
RCS file: /data/mirror/openbsd/cvs/src/sys/kern/uipc_socket.c,v
retrieving revision 1.155
diff -u -p -r1.155 uipc_socket.c
--- kern/uipc_socket.c  25 Aug 2016 14:13:19 -  1.155
+++ kern/uipc_socket.c  29 Aug 2016 18:02:24 -
@@ -544,7 +544,7 @@ m_getuio(struct mbuf **mp, int atomic, l
 
resid = ulmin(resid, space);
if (resid >= MINCLSIZE) {
-   MCLGETI(m, M_NOWAIT, NULL, ulmin(resid, MAXMCLBYTES));
+   MCLGETI(m, M_NOWAIT, NULL, ulmin(resid, PAGE_SIZE));
if ((m->m_flags & M_EXT) == 0)
goto nopages;
mlen = m->m_ext.ext_size;