Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-10-08 Thread Hans Petter Selasky

Hi,

I've now MFC'ed r287775 to 10-stable and 9-stable. I hope this will 
resolve the issues with m_defrag() being called on too long mbuf chains 
due to an off-by-one in the driver TSO parameters and that it will be 
easier to maintain these parameters in the future.


Some comments were made that we might want to have an option to select 
if the IP-header should be counted or not. Certain network drivers 
require copying of the whole ETH/TCP/IP-header into separate memory 
areas, and can then handle one more data payload mbuf for TSO. Others 
required DMA-ing of the whole mbuf TSO chain. I think it is acceptable 
to have one TX-DMA segment slot free, in case of 2K mbuf clusters being 
used for TSO. From my experience the limitation typically kicks in when 
2K mbuf clusters are used for TSO instead of 4K mbuf clusters. 65536 / 
4096 = 16, whereas 65536 / 2048 = 32. If an ethernet hardware driver has 
a limitation of 24 data segments (mlxen), and assuming that each mbuf 
represent a single segment, then iff the majority of mbufs being 
transmitted are 2K clusters we may have a small, 1/24 = 4.2%, loss of TX 
capability per TSO packet. From what I've seen using iperf, which in 
turn calls m_uiotombuf() which in turn calls m_getm2(), MJUMPPAGESIZE'ed 
mbuf clusters are preferred for large data transfers, so this issue 
might only happen in case of NODELAY being used on the socket and if the 
writes are small from the application point of view.  If an application 
is writing small amounts of data per send() system call, it is expected 
to degrade the system performance.


Please file a PR if it becomes an issue.

Someone asked me to MFC r287775 to 10.X release aswell. Is this still 
required?


--HPS
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-10-08 Thread Rick Macklem
Hans Petter Selasky wrote:
> Hi,
> 
> I've now MFC'ed r287775 to 10-stable and 9-stable. I hope this will
> resolve the issues with m_defrag() being called on too long mbuf chains
> due to an off-by-one in the driver TSO parameters and that it will be
> easier to maintain these parameters in the future.
> 
> Some comments were made that we might want to have an option to select
> if the IP-header should be counted or not. Certain network drivers
> require copying of the whole ETH/TCP/IP-header into separate memory
> areas, and can then handle one more data payload mbuf for TSO. Others
> required DMA-ing of the whole mbuf TSO chain. I think it is acceptable
> to have one TX-DMA segment slot free, in case of 2K mbuf clusters being
> used for TSO. From my experience the limitation typically kicks in when
> 2K mbuf clusters are used for TSO instead of 4K mbuf clusters. 65536 /
> 4096 = 16, whereas 65536 / 2048 = 32. If an ethernet hardware driver has
> a limitation of 24 data segments (mlxen), and assuming that each mbuf
> represent a single segment, then iff the majority of mbufs being
> transmitted are 2K clusters we may have a small, 1/24 = 4.2%, loss of TX
> capability per TSO packet. From what I've seen using iperf, which in
> turn calls m_uiotombuf() which in turn calls m_getm2(), MJUMPPAGESIZE'ed
> mbuf clusters are preferred for large data transfers, so this issue
> might only happen in case of NODELAY being used on the socket and if the
> writes are small from the application point of view.  If an application
> is writing small amounts of data per send() system call, it is expected
> to degrade the system performance.
> 
Btw, last year I did some testing with NFS generating chains of 4K (page size)
clusters instead of 2K (MCLBYTES). Although not easily reproduced, I was able
to fragment the KVM used for the cluster enough that allocations would fail.
(I could only get it to happen when the code used 4K clusters for large NFS
 requests/replies and 2K clusters otherwise, resulting in a mix of allocations
 of both sizes.) As such, I never committed the changes to head.

Any kernel change that does 4K cluster allocations needs to be carefully tested
carefully (a small i386 like I have), imho.

> Please file a PR if it becomes an issue.
> 
> Someone asked me to MFC r287775 to 10.X release aswell. Is this still
> required?
> 
> --HPS

Thanks for doing this, rick

> ___
> freebsd-stable@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
> 
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-25 Thread Daniel Braniss

 On Aug 24, 2015, at 3:25 PM, Rick Macklem rmack...@uoguelph.ca wrote:
 
 Daniel Braniss wrote:
 
 On 24 Aug 2015, at 10:22, Hans Petter Selasky h...@selasky.org wrote:
 
 On 08/24/15 01:02, Rick Macklem wrote:
 The other thing is the degradation seems to cut the rate by about half
 each time.
 300--150--70 I have no idea if this helps to explain it.
 
 Might be a NUMA binding issue for the processes involved.
 
 man cpuset
 
 --HPS
 
 I can’t see how this is relevant, given that the same host, using the
 mellanox/mlxen
 behave much better.
 Well, the ix driver has a bunch of tunables for things like number of 
 queues
 and although I'll admit I don't understand how these queues are used, I think
 they are related to CPUs and their caches. There is also something called 
 IXGBE_FDIR,
 which others have recommended be disabled. (The code is #ifdef IXGBE_FDIR, 
 but I don't
 know if it defined for your kernel?) There are also tunables for interrupt 
 rate and
 something called hw.ixgbe_tx_process_limit, which appears to limit the number 
 of packets
 to send or something like that?
 (I suspect Hans would understand this stuff much better than I do, since I 
 don't understand
 it at all.;-)
 
but how does this explain the fact that, at the same time,
the throughput to the NetApp is about 70MG/s while to
a FreeBSD it’s above 150MB/s? (window size negotiation?)
switching off TSO evens out this diff.

 At a glance, the mellanox  driver looks very different.
 
 I’m getting different results with the intel/ix depending who is the nfs
 server
 
 Who knows until you figure out what is actually going on. It could just be 
 the timing of
 handling the write RPCs or when the different servers send acks for the TCP 
 segments or ...
 that causes this for one server and not another.
 
 One of the principals used when investigating airplane accidents is to never 
 assume anything
 and just try to collect the facts until the pieces of the puzzle fall in 
 place. I think the
 same principal works for this kind of stuff.
 I once had a case where a specific read of one NFS file would fail on certain 
 machines.
 I won't bore you with the details, but after weeks we got to the point where 
 we had a lab
 of identical machines (exactly the same hardware and exactly the same 
 software loaded on them)
 and we could reproduce this problem on about half the machines and not the 
 other half. We
 (myself and the guy I worked with) finally noticed the failing machines were 
 on network ports
 for a given switch. We moved the net cables to another switch and the problem 
 went away.
 -- This particular network switch was broken in such a way that it would 
 garble one specific
packet consistently, but worked fine for everything else.
 My point here is that, if someone had suggested the network switch might be 
 broken at the
 beginning of investigating this, I would have probably dismissed it, based on 
 the network is
 working just fine, but in the end, that was the problem.
 -- I am not suggesting you have a broken network switch, just don't take 
 anything off the
table until you know what is actually going on.
 
 And to be honest, you may never know, but it is fun to try and solve these 
 puzzles.

one needs to find the clues …
at the moment:
when things go bad, they stay bad
ix/nfs/tcp/tso and NetApp
when things are ok, the numbers fluctuate, which is probably due to 
loads
on the system, but they are far above the 70MB/s (100 to 200)

 Beyond what I already suggested, I'd look at the ix driver's stats and 
 tunables and
 see if any of the tunables has an effect. (And, yes, it will take time to 
 work through these.)
 



 Good luck with it, rick
 
 
 danny
 
 ___
 freebsd-stable@freebsd.org mailto:freebsd-stable@freebsd.org mailing list
 https://lists.freebsd.org/mailman/listinfo/freebsd-stable 
 https://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org 
 mailto:freebsd-stable-unsubscr...@freebsd.org

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-25 Thread Hans Petter Selasky

Hi,

I've made some minor modifications to the patch from Rick, and made this 
review:


https://reviews.freebsd.org/D3477

--HPS
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-24 Thread Rick Macklem
Daniel Braniss wrote:
 
  On 24 Aug 2015, at 10:22, Hans Petter Selasky h...@selasky.org wrote:
  
  On 08/24/15 01:02, Rick Macklem wrote:
  The other thing is the degradation seems to cut the rate by about half
  each time.
  300--150--70 I have no idea if this helps to explain it.
  
  Might be a NUMA binding issue for the processes involved.
  
  man cpuset
  
  --HPS
 
 I can’t see how this is relevant, given that the same host, using the
 mellanox/mlxen
 behave much better.
Well, the ix driver has a bunch of tunables for things like number of queues
and although I'll admit I don't understand how these queues are used, I think
they are related to CPUs and their caches. There is also something called 
IXGBE_FDIR,
which others have recommended be disabled. (The code is #ifdef IXGBE_FDIR, but 
I don't
know if it defined for your kernel?) There are also tunables for interrupt rate 
and
something called hw.ixgbe_tx_process_limit, which appears to limit the number 
of packets
to send or something like that?
(I suspect Hans would understand this stuff much better than I do, since I 
don't understand
 it at all.;-)

At a glance, the mellanox  driver looks very different.

 I’m getting different results with the intel/ix depending who is the nfs
 server
 
Who knows until you figure out what is actually going on. It could just be the 
timing of
handling the write RPCs or when the different servers send acks for the TCP 
segments or ...
that causes this for one server and not another.

One of the principals used when investigating airplane accidents is to never 
assume anything
and just try to collect the facts until the pieces of the puzzle fall in place. 
I think the
same principal works for this kind of stuff.
I once had a case where a specific read of one NFS file would fail on certain 
machines.
I won't bore you with the details, but after weeks we got to the point where we 
had a lab
of identical machines (exactly the same hardware and exactly the same software 
loaded on them)
and we could reproduce this problem on about half the machines and not the 
other half. We
(myself and the guy I worked with) finally noticed the failing machines were on 
network ports
for a given switch. We moved the net cables to another switch and the problem 
went away.
-- This particular network switch was broken in such a way that it would 
garble one specific
packet consistently, but worked fine for everything else.
My point here is that, if someone had suggested the network switch might be 
broken at the
beginning of investigating this, I would have probably dismissed it, based on 
the network is
working just fine, but in the end, that was the problem.
-- I am not suggesting you have a broken network switch, just don't take 
anything off the
table until you know what is actually going on.

And to be honest, you may never know, but it is fun to try and solve these 
puzzles.
Beyond what I already suggested, I'd look at the ix driver's stats and 
tunables and
see if any of the tunables has an effect. (And, yes, it will take time to work 
through these.)

Good luck with it, rick

 
 danny
 
 ___
 freebsd-stable@freebsd.org mailing list
 https://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-24 Thread Hans Petter Selasky

On 08/24/15 01:02, Rick Macklem wrote:

The other thing is the degradation seems to cut the rate by about half each 
time.
300--150--70 I have no idea if this helps to explain it.


Might be a NUMA binding issue for the processes involved.

man cpuset

--HPS
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-24 Thread Daniel Braniss

 On 24 Aug 2015, at 10:22, Hans Petter Selasky h...@selasky.org wrote:
 
 On 08/24/15 01:02, Rick Macklem wrote:
 The other thing is the degradation seems to cut the rate by about half each 
 time.
 300--150--70 I have no idea if this helps to explain it.
 
 Might be a NUMA binding issue for the processes involved.
 
 man cpuset
 
 --HPS

I can’t see how this is relevant, given that the same host, using the 
mellanox/mlxen
behave much better.
I’m getting different results with the intel/ix depending who is the nfs server


danny

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-24 Thread Daniel Braniss

 On 24 Aug 2015, at 02:02, Rick Macklem rmack...@uoguelph.ca wrote:
 
 Daniel Braniss wrote:
 
 On 22 Aug 2015, at 14:59, Rick Macklem rmack...@uoguelph.ca wrote:
 
 Daniel Braniss wrote:
 
 On Aug 22, 2015, at 12:46 AM, Rick Macklem rmack...@uoguelph.ca wrote:
 
 Yonghyeon PYUN wrote:
 On Wed, Aug 19, 2015 at 09:00:35AM -0400, Rick Macklem wrote:
 Hans Petter Selasky wrote:
 On 08/19/15 09:42, Yonghyeon PYUN wrote:
 On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote:
 On 08/18/15 23:54, Rick Macklem wrote:
 Ouch! Yes, I now see that the code that counts the # of mbufs is
 before
 the
 code that adds the tcp/ip header mbuf.
 
 In my opinion, this should be fixed by setting if_hw_tsomaxsegcount
 to
 whatever
 the driver provides - 1. It is not the driver's responsibility to
 know if
 a tcp/ip
 header mbuf will be added and is a lot less confusing that
 expecting
 the
 driver
 author to know to subtract one. (I had mistakenly thought that
 tcp_output() had
 added the tc/ip header mbuf before the loop that counts mbufs in
 the
 list.
 Btw,
 this tcp/ip header mbuf also has leading space for the MAC layer
 header.)
 
 
 Hi Rick,
 
 Your question is good. With the Mellanox hardware we have separate
 so-called inline data space for the TCP/IP headers, so if the TCP
 stack
 subtracts something, then we would need to add something to the
 limit,
 because then the scatter gather list is only used for the data part.
 
 
 I think all drivers in tree don't subtract 1 for
 if_hw_tsomaxsegcount.  Probably touching Mellanox driver would be
 simpler than fixing all other drivers in tree.
 
 Maybe it can be controlled by some kind of flag, if all the three
 TSO
 limits should include the TCP/IP/ethernet headers too. I'm pretty
 sure
 we want both versions.
 
 
 Hmm, I'm afraid it's already complex.  Drivers have to tell almost
 the same information to both bus_dma(9) and network stack.
 
 Don't forget that not all drivers in the tree set the TSO limits
 before
 if_attach(), so possibly the subtraction of one TSO fragment needs to
 go
 into ip_output() 
 
 Ok, I realized that some drivers may not know the answers before
 ether_ifattach(),
 due to the way they are configured/written (I saw the use of
 if_hw_tsomax_update()
 in the patch).
 
 I was not able to find an interface that configures TSO parameters
 after if_t conversion.  I'm under the impression
 if_hw_tsomax_update() is not designed to use this way.  Probably we
 need a better one?(CCed to Gleb).
 
 
 If it is subtracted as a part of the assignment to if_hw_tsomaxsegcount
 in
 tcp_output()
 at line#791 in tcp_output() like the following, I don't think it should
 matter if the
 values are set before ether_ifattach()?
 /*
  * Subtract 1 for the tcp/ip header mbuf that
  * will be prepended to the mbuf chain in this
  * function in the code below this block.
  */
 if_hw_tsomaxsegcount = tp-t_tsomaxsegcount - 1;
 
 I don't have a good solution for the case where a driver doesn't plan
 on
 using the
 tcp/ip header provided by tcp_output() except to say the driver can add
 one
 to the
 setting to compensate for that (and if they fail to do so, it still
 works,
 although
 somewhat suboptimally). When I now read the comment in sys/net/if_var.h
 it
 is clear
 what it means, but for some reason I didn't read it that way before? (I
 think it was
 the part that said the driver didn't have to subtract for the headers
 that
 confused me?)
 In any case, we need to try and come up with a clear definition of what
 they need to
 be set to.
 
 I can now think of two ways to deal with this:
 1 - Leave tcp_output() as is, but provide a macro for the device driver
 authors to use
  that sets if_hw_tsomaxsegcount with a flag for driver uses tcp/ip
  header mbuf,
  documenting that this flag should normally be true.
 OR
 2 - Change tcp_output() as above, noting that this is a workaround for
 confusion w.r.t.
  whether or not if_hw_tsomaxsegcount should include the tcp/ip header
  mbuf and
  update the comment in if_var.h to reflect this. Then drivers that
  don't
  use the
  tcp/ip header mbuf can increase their value for if_hw_tsomaxsegcount
  by
  1.
  (The comment should also mention that a value of 35 or greater is
  much
  preferred to
   32 if the hardware will support that.)
 
 
 Both works for me.  My preference is 2 just because it's very
 common for most drivers that use tcp/ip header mbuf.
 Thanks for this comment. I tend to agree, both for the reason you state
 and
 also
 because the patch is simple enough that it might qualify as an errata for
 10.2.
 
 I am hoping Daniel Braniss will be able to test the patch and let us know
 if it
 improves performance with TSO enabled?
 
 send me the patch and I’ll test it ASAP.
danny
 
 Patch is attached. The one for head will also include an update to the
 comment
 in 

Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-24 Thread Adrian Chadd
Hi,

Some hand-waving suggestions:

* if you're running something before 10.2, please disable IXGBE_FDIR
in sys/conf/options and sys/modules/ixgbe/Makefile . It's buggy and it
caused a lot of issues.
* It sounds like some extra latency is happening, so I'd fiddle around
with interrupt settings. By default it does something called adaptive
interrupt moderation and it may be getting in the way of what you're
trying to do. There's a way to disable AIM in /boot/loader.conf and
manually set the interrupt rate.
* As others have said, TSO has been a bit of a problem - hps has been
working on solidifying the TSO configuration side of things so NICs
advertise to the stack what their maximum offload capability is so
things like NFS and TCP don't exceed the segment count. I don't know
if it's tunable without hacking the driver, but maybe hack the driver
to reduce the count a little to make sure you're not overflowing
things and causing it to fall back to a slower path (where it copies
all the mbufs into a single larger one to send to the NIC.)
* Disable software LRO and see if it helps. Since you're doing lots of
little non-streaming operations, it may actually be hindering.

HTH,


-adrian
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-23 Thread Daniel Braniss

 On 22 Aug 2015, at 14:59, Rick Macklem rmack...@uoguelph.ca wrote:
 
 Daniel Braniss wrote:
 
 On Aug 22, 2015, at 12:46 AM, Rick Macklem rmack...@uoguelph.ca wrote:
 
 Yonghyeon PYUN wrote:
 On Wed, Aug 19, 2015 at 09:00:35AM -0400, Rick Macklem wrote:
 Hans Petter Selasky wrote:
 On 08/19/15 09:42, Yonghyeon PYUN wrote:
 On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote:
 On 08/18/15 23:54, Rick Macklem wrote:
 Ouch! Yes, I now see that the code that counts the # of mbufs is
 before
 the
 code that adds the tcp/ip header mbuf.
 
 In my opinion, this should be fixed by setting if_hw_tsomaxsegcount
 to
 whatever
 the driver provides - 1. It is not the driver's responsibility to
 know if
 a tcp/ip
 header mbuf will be added and is a lot less confusing that expecting
 the
 driver
 author to know to subtract one. (I had mistakenly thought that
 tcp_output() had
 added the tc/ip header mbuf before the loop that counts mbufs in the
 list.
 Btw,
 this tcp/ip header mbuf also has leading space for the MAC layer
 header.)
 
 
 Hi Rick,
 
 Your question is good. With the Mellanox hardware we have separate
 so-called inline data space for the TCP/IP headers, so if the TCP
 stack
 subtracts something, then we would need to add something to the limit,
 because then the scatter gather list is only used for the data part.
 
 
 I think all drivers in tree don't subtract 1 for
 if_hw_tsomaxsegcount.  Probably touching Mellanox driver would be
 simpler than fixing all other drivers in tree.
 
 Maybe it can be controlled by some kind of flag, if all the three TSO
 limits should include the TCP/IP/ethernet headers too. I'm pretty sure
 we want both versions.
 
 
 Hmm, I'm afraid it's already complex.  Drivers have to tell almost
 the same information to both bus_dma(9) and network stack.
 
 Don't forget that not all drivers in the tree set the TSO limits before
 if_attach(), so possibly the subtraction of one TSO fragment needs to go
 into ip_output() 
 
 Ok, I realized that some drivers may not know the answers before
 ether_ifattach(),
 due to the way they are configured/written (I saw the use of
 if_hw_tsomax_update()
 in the patch).
 
 I was not able to find an interface that configures TSO parameters
 after if_t conversion.  I'm under the impression
 if_hw_tsomax_update() is not designed to use this way.  Probably we
 need a better one?(CCed to Gleb).
 
 
 If it is subtracted as a part of the assignment to if_hw_tsomaxsegcount
 in
 tcp_output()
 at line#791 in tcp_output() like the following, I don't think it should
 matter if the
 values are set before ether_ifattach()?
   /*
* Subtract 1 for the tcp/ip header mbuf that
* will be prepended to the mbuf chain in this
* function in the code below this block.
*/
   if_hw_tsomaxsegcount = tp-t_tsomaxsegcount - 1;
 
 I don't have a good solution for the case where a driver doesn't plan on
 using the
 tcp/ip header provided by tcp_output() except to say the driver can add
 one
 to the
 setting to compensate for that (and if they fail to do so, it still
 works,
 although
 somewhat suboptimally). When I now read the comment in sys/net/if_var.h
 it
 is clear
 what it means, but for some reason I didn't read it that way before? (I
 think it was
 the part that said the driver didn't have to subtract for the headers
 that
 confused me?)
 In any case, we need to try and come up with a clear definition of what
 they need to
 be set to.
 
 I can now think of two ways to deal with this:
 1 - Leave tcp_output() as is, but provide a macro for the device driver
 authors to use
   that sets if_hw_tsomaxsegcount with a flag for driver uses tcp/ip
   header mbuf,
   documenting that this flag should normally be true.
 OR
 2 - Change tcp_output() as above, noting that this is a workaround for
 confusion w.r.t.
   whether or not if_hw_tsomaxsegcount should include the tcp/ip header
   mbuf and
   update the comment in if_var.h to reflect this. Then drivers that
   don't
   use the
   tcp/ip header mbuf can increase their value for if_hw_tsomaxsegcount
   by
   1.
   (The comment should also mention that a value of 35 or greater is much
   preferred to
32 if the hardware will support that.)
 
 
 Both works for me.  My preference is 2 just because it's very
 common for most drivers that use tcp/ip header mbuf.
 Thanks for this comment. I tend to agree, both for the reason you state and
 also
 because the patch is simple enough that it might qualify as an errata for
 10.2.
 
 I am hoping Daniel Braniss will be able to test the patch and let us know
 if it
 improves performance with TSO enabled?
 
 send me the patch and I’ll test it ASAP.
  danny
 
 Patch is attached. The one for head will also include an update to the comment
 in sys/net/if_var.h, but that isn't needed for testing.


well, the plot thickens.

Yesterday, before running the new kernel, I 

Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-23 Thread Slawa Olhovchenkov
On Sun, Aug 23, 2015 at 02:08:56PM +0300, Daniel Braniss wrote:

  send me the patch and I'll test it ASAP.
 danny
  
  Patch is attached. The one for head will also include an update to the 
  comment
  in sys/net/if_var.h, but that isn't needed for testing.
 
 
 well, the plot thickens.
 
 Yesterday, before running the new kernel, I decided to re run my test, and to 
 my surprise
 i was getting good numbers, about 300MGB/s with and without TSO.
 
 this morning, the numbers were again bad, around 70MGB/s,what the ^%$#@!
 
 so, after some coffee, I run some more tests, and some conclusions:
 using a netapp(*) as the nfs client:
   - doing 
   ifconfig ix0 tso or -tso
 does some magic and numbers are back to normal - for a while
 
 using another Fbsd/zfs as client all is nifty, actually a bit faster than the 
 netapp (not a fair
 comparison, since the zfs client is not heavily used) and I can't see any 
 degradation.
  
 btw, this is with the patch applied, but was seeing similar numbers before 
 the patch.
 
 running with tso, initially I get around 300MGB/s, but after a while(sorry 
 can't be more scientific)
 it drops down to about half,  and finally to a pathetic 70MGB/s
 
 *: while running the tests I monitored the Netapp, and nothing out of the 
 ordinary there.

Can you do this
https://lists.freebsd.org/pipermail/freebsd-stable/2015-August/083138.html ?
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-23 Thread Rick Macklem
Daniel Braniss wrote:
 
  On 22 Aug 2015, at 14:59, Rick Macklem rmack...@uoguelph.ca wrote:
  
  Daniel Braniss wrote:
  
  On Aug 22, 2015, at 12:46 AM, Rick Macklem rmack...@uoguelph.ca wrote:
  
  Yonghyeon PYUN wrote:
  On Wed, Aug 19, 2015 at 09:00:35AM -0400, Rick Macklem wrote:
  Hans Petter Selasky wrote:
  On 08/19/15 09:42, Yonghyeon PYUN wrote:
  On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote:
  On 08/18/15 23:54, Rick Macklem wrote:
  Ouch! Yes, I now see that the code that counts the # of mbufs is
  before
  the
  code that adds the tcp/ip header mbuf.
  
  In my opinion, this should be fixed by setting if_hw_tsomaxsegcount
  to
  whatever
  the driver provides - 1. It is not the driver's responsibility to
  know if
  a tcp/ip
  header mbuf will be added and is a lot less confusing that
  expecting
  the
  driver
  author to know to subtract one. (I had mistakenly thought that
  tcp_output() had
  added the tc/ip header mbuf before the loop that counts mbufs in
  the
  list.
  Btw,
  this tcp/ip header mbuf also has leading space for the MAC layer
  header.)
  
  
  Hi Rick,
  
  Your question is good. With the Mellanox hardware we have separate
  so-called inline data space for the TCP/IP headers, so if the TCP
  stack
  subtracts something, then we would need to add something to the
  limit,
  because then the scatter gather list is only used for the data part.
  
  
  I think all drivers in tree don't subtract 1 for
  if_hw_tsomaxsegcount.  Probably touching Mellanox driver would be
  simpler than fixing all other drivers in tree.
  
  Maybe it can be controlled by some kind of flag, if all the three
  TSO
  limits should include the TCP/IP/ethernet headers too. I'm pretty
  sure
  we want both versions.
  
  
  Hmm, I'm afraid it's already complex.  Drivers have to tell almost
  the same information to both bus_dma(9) and network stack.
  
  Don't forget that not all drivers in the tree set the TSO limits
  before
  if_attach(), so possibly the subtraction of one TSO fragment needs to
  go
  into ip_output() 
  
  Ok, I realized that some drivers may not know the answers before
  ether_ifattach(),
  due to the way they are configured/written (I saw the use of
  if_hw_tsomax_update()
  in the patch).
  
  I was not able to find an interface that configures TSO parameters
  after if_t conversion.  I'm under the impression
  if_hw_tsomax_update() is not designed to use this way.  Probably we
  need a better one?(CCed to Gleb).
  
  
  If it is subtracted as a part of the assignment to if_hw_tsomaxsegcount
  in
  tcp_output()
  at line#791 in tcp_output() like the following, I don't think it should
  matter if the
  values are set before ether_ifattach()?
  /*
   * Subtract 1 for the tcp/ip header mbuf that
   * will be prepended to the mbuf chain in this
   * function in the code below this block.
   */
  if_hw_tsomaxsegcount = tp-t_tsomaxsegcount - 1;
  
  I don't have a good solution for the case where a driver doesn't plan
  on
  using the
  tcp/ip header provided by tcp_output() except to say the driver can add
  one
  to the
  setting to compensate for that (and if they fail to do so, it still
  works,
  although
  somewhat suboptimally). When I now read the comment in sys/net/if_var.h
  it
  is clear
  what it means, but for some reason I didn't read it that way before? (I
  think it was
  the part that said the driver didn't have to subtract for the headers
  that
  confused me?)
  In any case, we need to try and come up with a clear definition of what
  they need to
  be set to.
  
  I can now think of two ways to deal with this:
  1 - Leave tcp_output() as is, but provide a macro for the device driver
  authors to use
that sets if_hw_tsomaxsegcount with a flag for driver uses tcp/ip
header mbuf,
documenting that this flag should normally be true.
  OR
  2 - Change tcp_output() as above, noting that this is a workaround for
  confusion w.r.t.
whether or not if_hw_tsomaxsegcount should include the tcp/ip header
mbuf and
update the comment in if_var.h to reflect this. Then drivers that
don't
use the
tcp/ip header mbuf can increase their value for if_hw_tsomaxsegcount
by
1.
(The comment should also mention that a value of 35 or greater is
much
preferred to
 32 if the hardware will support that.)
  
  
  Both works for me.  My preference is 2 just because it's very
  common for most drivers that use tcp/ip header mbuf.
  Thanks for this comment. I tend to agree, both for the reason you state
  and
  also
  because the patch is simple enough that it might qualify as an errata for
  10.2.
  
  I am hoping Daniel Braniss will be able to test the patch and let us know
  if it
  improves performance with TSO enabled?
  
  send me the patch and I’ll test it ASAP.
 

Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-22 Thread Rick Macklem
Daniel Braniss wrote:
 
  On Aug 22, 2015, at 12:46 AM, Rick Macklem rmack...@uoguelph.ca wrote:
  
  Yonghyeon PYUN wrote:
  On Wed, Aug 19, 2015 at 09:00:35AM -0400, Rick Macklem wrote:
  Hans Petter Selasky wrote:
  On 08/19/15 09:42, Yonghyeon PYUN wrote:
  On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote:
  On 08/18/15 23:54, Rick Macklem wrote:
  Ouch! Yes, I now see that the code that counts the # of mbufs is
  before
  the
  code that adds the tcp/ip header mbuf.
  
  In my opinion, this should be fixed by setting if_hw_tsomaxsegcount
  to
  whatever
  the driver provides - 1. It is not the driver's responsibility to
  know if
  a tcp/ip
  header mbuf will be added and is a lot less confusing that expecting
  the
  driver
  author to know to subtract one. (I had mistakenly thought that
  tcp_output() had
  added the tc/ip header mbuf before the loop that counts mbufs in the
  list.
  Btw,
  this tcp/ip header mbuf also has leading space for the MAC layer
  header.)
  
  
  Hi Rick,
  
  Your question is good. With the Mellanox hardware we have separate
  so-called inline data space for the TCP/IP headers, so if the TCP
  stack
  subtracts something, then we would need to add something to the limit,
  because then the scatter gather list is only used for the data part.
  
  
  I think all drivers in tree don't subtract 1 for
  if_hw_tsomaxsegcount.  Probably touching Mellanox driver would be
  simpler than fixing all other drivers in tree.
  
  Maybe it can be controlled by some kind of flag, if all the three TSO
  limits should include the TCP/IP/ethernet headers too. I'm pretty sure
  we want both versions.
  
  
  Hmm, I'm afraid it's already complex.  Drivers have to tell almost
  the same information to both bus_dma(9) and network stack.
  
  Don't forget that not all drivers in the tree set the TSO limits before
  if_attach(), so possibly the subtraction of one TSO fragment needs to go
  into ip_output() 
  
  Ok, I realized that some drivers may not know the answers before
  ether_ifattach(),
  due to the way they are configured/written (I saw the use of
  if_hw_tsomax_update()
  in the patch).
  
  I was not able to find an interface that configures TSO parameters
  after if_t conversion.  I'm under the impression
  if_hw_tsomax_update() is not designed to use this way.  Probably we
  need a better one?(CCed to Gleb).
  
  
  If it is subtracted as a part of the assignment to if_hw_tsomaxsegcount
  in
  tcp_output()
  at line#791 in tcp_output() like the following, I don't think it should
  matter if the
  values are set before ether_ifattach()?
/*
 * Subtract 1 for the tcp/ip header mbuf that
 * will be prepended to the mbuf chain in this
 * function in the code below this block.
 */
if_hw_tsomaxsegcount = tp-t_tsomaxsegcount - 1;
  
  I don't have a good solution for the case where a driver doesn't plan on
  using the
  tcp/ip header provided by tcp_output() except to say the driver can add
  one
  to the
  setting to compensate for that (and if they fail to do so, it still
  works,
  although
  somewhat suboptimally). When I now read the comment in sys/net/if_var.h
  it
  is clear
  what it means, but for some reason I didn't read it that way before? (I
  think it was
  the part that said the driver didn't have to subtract for the headers
  that
  confused me?)
  In any case, we need to try and come up with a clear definition of what
  they need to
  be set to.
  
  I can now think of two ways to deal with this:
  1 - Leave tcp_output() as is, but provide a macro for the device driver
  authors to use
 that sets if_hw_tsomaxsegcount with a flag for driver uses tcp/ip
 header mbuf,
 documenting that this flag should normally be true.
  OR
  2 - Change tcp_output() as above, noting that this is a workaround for
  confusion w.r.t.
 whether or not if_hw_tsomaxsegcount should include the tcp/ip header
 mbuf and
 update the comment in if_var.h to reflect this. Then drivers that
 don't
 use the
 tcp/ip header mbuf can increase their value for if_hw_tsomaxsegcount
 by
 1.
 (The comment should also mention that a value of 35 or greater is much
 preferred to
  32 if the hardware will support that.)
  
  
  Both works for me.  My preference is 2 just because it's very
  common for most drivers that use tcp/ip header mbuf.
  Thanks for this comment. I tend to agree, both for the reason you state and
  also
  because the patch is simple enough that it might qualify as an errata for
  10.2.
  
  I am hoping Daniel Braniss will be able to test the patch and let us know
  if it
  improves performance with TSO enabled?
 
 send me the patch and I’ll test it ASAP.
   danny
 
Patch is attached. The one for head will also include an update to the comment
in sys/net/if_var.h, but that isn't needed for testing.

Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-22 Thread Daniel Braniss


 On Aug 22, 2015, at 12:46 AM, Rick Macklem rmack...@uoguelph.ca wrote:
 
 Yonghyeon PYUN wrote:
 On Wed, Aug 19, 2015 at 09:00:35AM -0400, Rick Macklem wrote:
 Hans Petter Selasky wrote:
 On 08/19/15 09:42, Yonghyeon PYUN wrote:
 On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote:
 On 08/18/15 23:54, Rick Macklem wrote:
 Ouch! Yes, I now see that the code that counts the # of mbufs is
 before
 the
 code that adds the tcp/ip header mbuf.
 
 In my opinion, this should be fixed by setting if_hw_tsomaxsegcount
 to
 whatever
 the driver provides - 1. It is not the driver's responsibility to
 know if
 a tcp/ip
 header mbuf will be added and is a lot less confusing that expecting
 the
 driver
 author to know to subtract one. (I had mistakenly thought that
 tcp_output() had
 added the tc/ip header mbuf before the loop that counts mbufs in the
 list.
 Btw,
 this tcp/ip header mbuf also has leading space for the MAC layer
 header.)
 
 
 Hi Rick,
 
 Your question is good. With the Mellanox hardware we have separate
 so-called inline data space for the TCP/IP headers, so if the TCP
 stack
 subtracts something, then we would need to add something to the limit,
 because then the scatter gather list is only used for the data part.
 
 
 I think all drivers in tree don't subtract 1 for
 if_hw_tsomaxsegcount.  Probably touching Mellanox driver would be
 simpler than fixing all other drivers in tree.
 
 Maybe it can be controlled by some kind of flag, if all the three TSO
 limits should include the TCP/IP/ethernet headers too. I'm pretty sure
 we want both versions.
 
 
 Hmm, I'm afraid it's already complex.  Drivers have to tell almost
 the same information to both bus_dma(9) and network stack.
 
 Don't forget that not all drivers in the tree set the TSO limits before
 if_attach(), so possibly the subtraction of one TSO fragment needs to go
 into ip_output() 
 
 Ok, I realized that some drivers may not know the answers before
 ether_ifattach(),
 due to the way they are configured/written (I saw the use of
 if_hw_tsomax_update()
 in the patch).
 
 I was not able to find an interface that configures TSO parameters
 after if_t conversion.  I'm under the impression
 if_hw_tsomax_update() is not designed to use this way.  Probably we
 need a better one?(CCed to Gleb).
 
 
 If it is subtracted as a part of the assignment to if_hw_tsomaxsegcount in
 tcp_output()
 at line#791 in tcp_output() like the following, I don't think it should
 matter if the
 values are set before ether_ifattach()?
 /*
  * Subtract 1 for the tcp/ip header mbuf that
  * will be prepended to the mbuf chain in this
  * function in the code below this block.
  */
 if_hw_tsomaxsegcount = tp-t_tsomaxsegcount - 1;
 
 I don't have a good solution for the case where a driver doesn't plan on
 using the
 tcp/ip header provided by tcp_output() except to say the driver can add one
 to the
 setting to compensate for that (and if they fail to do so, it still works,
 although
 somewhat suboptimally). When I now read the comment in sys/net/if_var.h it
 is clear
 what it means, but for some reason I didn't read it that way before? (I
 think it was
 the part that said the driver didn't have to subtract for the headers that
 confused me?)
 In any case, we need to try and come up with a clear definition of what
 they need to
 be set to.
 
 I can now think of two ways to deal with this:
 1 - Leave tcp_output() as is, but provide a macro for the device driver
 authors to use
that sets if_hw_tsomaxsegcount with a flag for driver uses tcp/ip
header mbuf,
documenting that this flag should normally be true.
 OR
 2 - Change tcp_output() as above, noting that this is a workaround for
 confusion w.r.t.
whether or not if_hw_tsomaxsegcount should include the tcp/ip header
mbuf and
update the comment in if_var.h to reflect this. Then drivers that don't
use the
tcp/ip header mbuf can increase their value for if_hw_tsomaxsegcount by
1.
(The comment should also mention that a value of 35 or greater is much
preferred to
 32 if the hardware will support that.)
 
 
 Both works for me.  My preference is 2 just because it's very
 common for most drivers that use tcp/ip header mbuf.
 Thanks for this comment. I tend to agree, both for the reason you state and 
 also
 because the patch is simple enough that it might qualify as an errata for 
 10.2.
 
 I am hoping Daniel Braniss will be able to test the patch and let us know if 
 it
 improves performance with TSO enabled?

send me the patch and I’ll test it ASAP.
danny

 
 rick
 
 ___
 freebsd-stable@freebsd.org mailing list
 https://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
 

___

Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-21 Thread Rick Macklem
Yonghyeon PYUN wrote:
 On Wed, Aug 19, 2015 at 09:00:35AM -0400, Rick Macklem wrote:
  Hans Petter Selasky wrote:
   On 08/19/15 09:42, Yonghyeon PYUN wrote:
On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote:
On 08/18/15 23:54, Rick Macklem wrote:
Ouch! Yes, I now see that the code that counts the # of mbufs is
before
the
code that adds the tcp/ip header mbuf.
   
In my opinion, this should be fixed by setting if_hw_tsomaxsegcount
to
whatever
the driver provides - 1. It is not the driver's responsibility to
know if
a tcp/ip
header mbuf will be added and is a lot less confusing that expecting
the
driver
author to know to subtract one. (I had mistakenly thought that
tcp_output() had
added the tc/ip header mbuf before the loop that counts mbufs in the
list.
Btw,
this tcp/ip header mbuf also has leading space for the MAC layer
header.)
   
   
Hi Rick,
   
Your question is good. With the Mellanox hardware we have separate
so-called inline data space for the TCP/IP headers, so if the TCP
stack
subtracts something, then we would need to add something to the limit,
because then the scatter gather list is only used for the data part.
   
   
I think all drivers in tree don't subtract 1 for
if_hw_tsomaxsegcount.  Probably touching Mellanox driver would be
simpler than fixing all other drivers in tree.
   
Maybe it can be controlled by some kind of flag, if all the three TSO
limits should include the TCP/IP/ethernet headers too. I'm pretty sure
we want both versions.
   
   
Hmm, I'm afraid it's already complex.  Drivers have to tell almost
the same information to both bus_dma(9) and network stack.
   
   Don't forget that not all drivers in the tree set the TSO limits before
   if_attach(), so possibly the subtraction of one TSO fragment needs to go
   into ip_output() 
   
  Ok, I realized that some drivers may not know the answers before
  ether_ifattach(),
  due to the way they are configured/written (I saw the use of
  if_hw_tsomax_update()
  in the patch).
 
 I was not able to find an interface that configures TSO parameters
 after if_t conversion.  I'm under the impression
 if_hw_tsomax_update() is not designed to use this way.  Probably we
 need a better one?(CCed to Gleb).
 
  
  If it is subtracted as a part of the assignment to if_hw_tsomaxsegcount in
  tcp_output()
  at line#791 in tcp_output() like the following, I don't think it should
  matter if the
  values are set before ether_ifattach()?
  /*
   * Subtract 1 for the tcp/ip header mbuf that
   * will be prepended to the mbuf chain in this
   * function in the code below this block.
   */
  if_hw_tsomaxsegcount = tp-t_tsomaxsegcount - 1;
  
  I don't have a good solution for the case where a driver doesn't plan on
  using the
  tcp/ip header provided by tcp_output() except to say the driver can add one
  to the
  setting to compensate for that (and if they fail to do so, it still works,
  although
  somewhat suboptimally). When I now read the comment in sys/net/if_var.h it
  is clear
  what it means, but for some reason I didn't read it that way before? (I
  think it was
  the part that said the driver didn't have to subtract for the headers that
  confused me?)
  In any case, we need to try and come up with a clear definition of what
  they need to
  be set to.
  
  I can now think of two ways to deal with this:
  1 - Leave tcp_output() as is, but provide a macro for the device driver
  authors to use
  that sets if_hw_tsomaxsegcount with a flag for driver uses tcp/ip
  header mbuf,
  documenting that this flag should normally be true.
  OR
  2 - Change tcp_output() as above, noting that this is a workaround for
  confusion w.r.t.
  whether or not if_hw_tsomaxsegcount should include the tcp/ip header
  mbuf and
  update the comment in if_var.h to reflect this. Then drivers that don't
  use the
  tcp/ip header mbuf can increase their value for if_hw_tsomaxsegcount by
  1.
  (The comment should also mention that a value of 35 or greater is much
  preferred to
   32 if the hardware will support that.)
  
 
 Both works for me.  My preference is 2 just because it's very
 common for most drivers that use tcp/ip header mbuf.
Thanks for this comment. I tend to agree, both for the reason you state and also
because the patch is simple enough that it might qualify as an errata for 10.2.

I am hoping Daniel Braniss will be able to test the patch and let us know if it
improves performance with TSO enabled?

rick

 ___
 freebsd-stable@freebsd.org mailing list
 https://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
 

Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-20 Thread Gleb Smirnoff
  Yonghyeon,

On Thu, Aug 20, 2015 at 11:30:24AM +0900, Yonghyeon PYUN wrote:
YMaybe it can be controlled by some kind of flag, if all the three TSO
Ylimits should include the TCP/IP/ethernet headers too. I'm pretty sure
Ywe want both versions.
Y   
Y   
YHmm, I'm afraid it's already complex.  Drivers have to tell almost
Ythe same information to both bus_dma(9) and network stack.
Y   
Y   Don't forget that not all drivers in the tree set the TSO limits before
Y   if_attach(), so possibly the subtraction of one TSO fragment needs to go
Y   into ip_output() 
Y   
Y  Ok, I realized that some drivers may not know the answers before 
ether_ifattach(),
Y  due to the way they are configured/written (I saw the use of 
if_hw_tsomax_update()
Y  in the patch).
Y 
Y I was not able to find an interface that configures TSO parameters
Y after if_t conversion.  I'm under the impression
Y if_hw_tsomax_update() is not designed to use this way.  Probably we
Y need a better one?(CCed to Gleb).

Yes. In the projects/ifnet all the TSO stuff is configured differently.

I'd really appreciate if other developers look there and review it,
try it, give some input.

Here is a snippet from net/if.h in projects/ifnet:

/*
 * Structure describing TSO properties of an interface.  Known both to ifnet
 * layer and TCP.  Most interfaces point to a static tsomax in ifdriver  
 * definition.  However, vlan(4) and lagg(4) require a dynamic tsomax.
 */
struct iftsomax {
uint32_t tsomax_bytes;/* TSO total burst length limit in bytes */ 
uint32_t tsomax_segcount; /* TSO maximum segment count */
uint32_t tsomax_segsize;  /* TSO maximum segment size in bytes */
};

Now closer to your original question. I haven't yet converted lagg(4), so
haven't yet worked on if_hw_tsomax_update(). I am convinced that it shouldn't
be needed for a regular driver (save lagg(4).

A proper driver should first study its hardware and only then call if_attach().
Correct me if am wrong, please.

Also, I suppose, that a piece of hardware can't change its TSO maximums at
runtime, so I don't see reason for changing them at runtime (save lagg(4)).

-- 
Totus tuus, Glebius.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-20 Thread Rick Macklem
Yonghyeon PYUN wrote:
 On Wed, Aug 19, 2015 at 08:13:59AM -0400, Rick Macklem wrote:
  Yonghyeon PYUN wrote:
   On Wed, Aug 19, 2015 at 09:51:44AM +0200, Hans Petter Selasky wrote:
On 08/19/15 09:42, Yonghyeon PYUN wrote:
On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote:
On 08/18/15 23:54, Rick Macklem wrote:
Ouch! Yes, I now see that the code that counts the # of mbufs is
before
the
code that adds the tcp/ip header mbuf.

In my opinion, this should be fixed by setting if_hw_tsomaxsegcount
to
whatever
the driver provides - 1. It is not the driver's responsibility to
know
if
a tcp/ip
header mbuf will be added and is a lot less confusing that expecting
the
driver
author to know to subtract one. (I had mistakenly thought that
tcp_output() had
added the tc/ip header mbuf before the loop that counts mbufs in the
list.
Btw,
this tcp/ip header mbuf also has leading space for the MAC layer
header.)


Hi Rick,

Your question is good. With the Mellanox hardware we have separate
so-called inline data space for the TCP/IP headers, so if the TCP
stack
subtracts something, then we would need to add something to the
limit,
because then the scatter gather list is only used for the data part.


I think all drivers in tree don't subtract 1 for
if_hw_tsomaxsegcount.  Probably touching Mellanox driver would be
simpler than fixing all other drivers in tree.

Hi,

If you change the behaviour don't forget to update and/or add comments
describing it. Maybe the amount of subtraction could be defined by some
macro? Then drivers which inline the headers can subtract it?

   
   I'm also ok with your suggestion.
   
Your suggestion is fine by me.

   
The initial TSO limits were tried to be preserved, and I believe that
TSO limits never accounted for IP/TCP/ETHERNET/VLAN headers!

   
   I guess FreeBSD used to follow MS LSOv1 specification with minor
   exception in pseudo checksum computation. If I recall correctly the
   specification says upper stack can generate up to IP_MAXPACKET sized
   packet.  Other L2 headers like ethernet/vlan header size is not
   included in the packet and it's drivers responsibility to allocate
   additional DMA buffers/segments for L2 headers.
   
  Yep. The default for if_hw_tsomax was reduced from IP_MAXPACKET to
32 * MCLBYTES - max_ethernet_header_size as a workaround/hack so that
  devices limited to 32 transmit segments would work (ie. the entire packet,
  including MAC header would fit in 32 MCLBYTE clusters).
  This implied that many drivers did end up using m_defrag() to copy the mbuf
  list to one made up of 32 MCLBYTE clusters.
  
  If a driver sets if_hw_tsomaxsegcount correctly, then it can set
  if_hw_tsomax
  to whatever it can handle as the largest TSO packet (without MAC header)
  the
  hardware can handle. If it can handle  IP_MAXPACKET, then it can set it to
  that.
  
 
 I thought the upper limit was still IP_MAXPACKET. If driver
 increase it (i.e.  IP_MAXPACKET,  the length field in the IP
 header would overflow which in turn may break firewalls and other
 packet handling in IPv4/IPv6 code path.
I have no idea if a bogus value in the ip_len field of the TSO segment
would break something in ip_output() or not. This would need to be checked
before anyone configures if_hw_tsomax  IP_MAXPACKET. I didn't think of
any effect this would have in ip_output(), I just knew that the hardware
would be replacing ip_len when it generated the TCP/IP segments from the TSO
segment. As you note, I vaguely recall some hardware being able to handle a
TSO segment  IP_MAXPACKET (presumably getting the TSO segment's length some
other way).

It would be nice if this was checked, but yes, the comment should specify
an upper bound on if_hw_tsomax of IP_MAXPACKET until then.

rick

 If the limit no longer apply to network stack, that's great.  Some
 controllers can handle up to 256KB TCP/UDP segmentation and
 supporting that feature wouldn't be hard.
 ___
 freebsd-stable@freebsd.org mailing list
 https://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
 
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-19 Thread Rick Macklem
Hans Petter Selasky wrote:
 On 08/19/15 09:42, Yonghyeon PYUN wrote:
  On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote:
  On 08/18/15 23:54, Rick Macklem wrote:
  Ouch! Yes, I now see that the code that counts the # of mbufs is before
  the
  code that adds the tcp/ip header mbuf.
 
  In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to
  whatever
  the driver provides - 1. It is not the driver's responsibility to know if
  a tcp/ip
  header mbuf will be added and is a lot less confusing that expecting the
  driver
  author to know to subtract one. (I had mistakenly thought that
  tcp_output() had
  added the tc/ip header mbuf before the loop that counts mbufs in the
  list.
  Btw,
  this tcp/ip header mbuf also has leading space for the MAC layer header.)
 
 
  Hi Rick,
 
  Your question is good. With the Mellanox hardware we have separate
  so-called inline data space for the TCP/IP headers, so if the TCP stack
  subtracts something, then we would need to add something to the limit,
  because then the scatter gather list is only used for the data part.
 
 
  I think all drivers in tree don't subtract 1 for
  if_hw_tsomaxsegcount.  Probably touching Mellanox driver would be
  simpler than fixing all other drivers in tree.
 
  Maybe it can be controlled by some kind of flag, if all the three TSO
  limits should include the TCP/IP/ethernet headers too. I'm pretty sure
  we want both versions.
 
 
  Hmm, I'm afraid it's already complex.  Drivers have to tell almost
  the same information to both bus_dma(9) and network stack.
 
 Don't forget that not all drivers in the tree set the TSO limits before
 if_attach(), so possibly the subtraction of one TSO fragment needs to go
 into ip_output() 
 
I think setting them before a call to ether_ifattach() should be required and
any driver that doesn't do that needs to be fixed.

Also, I notice that 32 * MCLBYTES - (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN)
is getting written as 65536 - (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN) which
obscures the reason it is the default. It probably isn't the correct default
for any driver that sets if_hw_tsomaxsegcount, but is close to IP_MAXPACKET,
so the breakage is mostly theoretical.

rick

 --HPS
 
 ___
 freebsd-stable@freebsd.org mailing list
 https://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
 
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-19 Thread Daniel Braniss

 On 19 Aug 2015, at 16:00, Rick Macklem rmack...@uoguelph.ca wrote:
 
 Hans Petter Selasky wrote:
 On 08/19/15 09:42, Yonghyeon PYUN wrote:
 On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote:
 On 08/18/15 23:54, Rick Macklem wrote:
 Ouch! Yes, I now see that the code that counts the # of mbufs is before
 the
 code that adds the tcp/ip header mbuf.
 
 In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to
 whatever
 the driver provides - 1. It is not the driver's responsibility to know if
 a tcp/ip
 header mbuf will be added and is a lot less confusing that expecting the
 driver
 author to know to subtract one. (I had mistakenly thought that
 tcp_output() had
 added the tc/ip header mbuf before the loop that counts mbufs in the
 list.
 Btw,
 this tcp/ip header mbuf also has leading space for the MAC layer header.)
 
 
 Hi Rick,
 
 Your question is good. With the Mellanox hardware we have separate
 so-called inline data space for the TCP/IP headers, so if the TCP stack
 subtracts something, then we would need to add something to the limit,
 because then the scatter gather list is only used for the data part.
 
 
 I think all drivers in tree don't subtract 1 for
 if_hw_tsomaxsegcount.  Probably touching Mellanox driver would be
 simpler than fixing all other drivers in tree.
 
 Maybe it can be controlled by some kind of flag, if all the three TSO
 limits should include the TCP/IP/ethernet headers too. I'm pretty sure
 we want both versions.
 
 
 Hmm, I'm afraid it's already complex.  Drivers have to tell almost
 the same information to both bus_dma(9) and network stack.
 
 Don't forget that not all drivers in the tree set the TSO limits before
 if_attach(), so possibly the subtraction of one TSO fragment needs to go
 into ip_output() 
 
 Ok, I realized that some drivers may not know the answers before 
 ether_ifattach(),
 due to the way they are configured/written (I saw the use of 
 if_hw_tsomax_update()
 in the patch).
 
 If it is subtracted as a part of the assignment to if_hw_tsomaxsegcount in 
 tcp_output()
 at line#791 in tcp_output() like the following, I don't think it should 
 matter if the
 values are set before ether_ifattach()?
   /*
* Subtract 1 for the tcp/ip header mbuf that
* will be prepended to the mbuf chain in this
* function in the code below this block.
*/
   if_hw_tsomaxsegcount = tp-t_tsomaxsegcount - 1;
 
 I don't have a good solution for the case where a driver doesn't plan on 
 using the
 tcp/ip header provided by tcp_output() except to say the driver can add one 
 to the
 setting to compensate for that (and if they fail to do so, it still works, 
 although
 somewhat suboptimally). When I now read the comment in sys/net/if_var.h it is 
 clear
 what it means, but for some reason I didn't read it that way before? (I think 
 it was
 the part that said the driver didn't have to subtract for the headers that 
 confused me?)
 In any case, we need to try and come up with a clear definition of what they 
 need to
 be set to.
 
 I can now think of two ways to deal with this:
 1 - Leave tcp_output() as is, but provide a macro for the device driver 
 authors to use
that sets if_hw_tsomaxsegcount with a flag for driver uses tcp/ip header 
 mbuf,
documenting that this flag should normally be true.
 OR
 2 - Change tcp_output() as above, noting that this is a workaround for 
 confusion w.r.t.
whether or not if_hw_tsomaxsegcount should include the tcp/ip header mbuf 
 and
update the comment in if_var.h to reflect this. Then drivers that don't 
 use the
tcp/ip header mbuf can increase their value for if_hw_tsomaxsegcount by 1.
(The comment should also mention that a value of 35 or greater is much 
 preferred to
 32 if the hardware will support that.)
 
 Also, I'd like to apologize for some of my emails getting a little blunt. I 
 just find
 it flustrating that this problem is still showing up and is even in 10.2. 
 This is partly
 my fault for not making it clearer to driver authors what 
 if_hw_tsomaxsegcount should be
 set to, because I had it incorrect.
 
 Hopefully we can come up with a solution that everyone is comfortable with, 
 rick


ok guys,
when you have some code for me to try just let me know.

danny

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-19 Thread Rick Macklem
Hans Petter Selasky wrote:
 On 08/19/15 09:42, Yonghyeon PYUN wrote:
  On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote:
  On 08/18/15 23:54, Rick Macklem wrote:
  Ouch! Yes, I now see that the code that counts the # of mbufs is before
  the
  code that adds the tcp/ip header mbuf.
 
  In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to
  whatever
  the driver provides - 1. It is not the driver's responsibility to know if
  a tcp/ip
  header mbuf will be added and is a lot less confusing that expecting the
  driver
  author to know to subtract one. (I had mistakenly thought that
  tcp_output() had
  added the tc/ip header mbuf before the loop that counts mbufs in the
  list.
  Btw,
  this tcp/ip header mbuf also has leading space for the MAC layer header.)
 
 
  Hi Rick,
 
  Your question is good. With the Mellanox hardware we have separate
  so-called inline data space for the TCP/IP headers, so if the TCP stack
  subtracts something, then we would need to add something to the limit,
  because then the scatter gather list is only used for the data part.
 
 
  I think all drivers in tree don't subtract 1 for
  if_hw_tsomaxsegcount.  Probably touching Mellanox driver would be
  simpler than fixing all other drivers in tree.
 
  Maybe it can be controlled by some kind of flag, if all the three TSO
  limits should include the TCP/IP/ethernet headers too. I'm pretty sure
  we want both versions.
 
 
  Hmm, I'm afraid it's already complex.  Drivers have to tell almost
  the same information to both bus_dma(9) and network stack.
 
 Don't forget that not all drivers in the tree set the TSO limits before
 if_attach(), so possibly the subtraction of one TSO fragment needs to go
 into ip_output() 
 
Ok, I realized that some drivers may not know the answers before 
ether_ifattach(),
due to the way they are configured/written (I saw the use of 
if_hw_tsomax_update()
in the patch).

If it is subtracted as a part of the assignment to if_hw_tsomaxsegcount in 
tcp_output()
at line#791 in tcp_output() like the following, I don't think it should matter 
if the
values are set before ether_ifattach()?
/*
 * Subtract 1 for the tcp/ip header mbuf that
 * will be prepended to the mbuf chain in this
 * function in the code below this block.
 */
if_hw_tsomaxsegcount = tp-t_tsomaxsegcount - 1;

I don't have a good solution for the case where a driver doesn't plan on using 
the
tcp/ip header provided by tcp_output() except to say the driver can add one to 
the
setting to compensate for that (and if they fail to do so, it still works, 
although
somewhat suboptimally). When I now read the comment in sys/net/if_var.h it is 
clear
what it means, but for some reason I didn't read it that way before? (I think 
it was
the part that said the driver didn't have to subtract for the headers that 
confused me?)
In any case, we need to try and come up with a clear definition of what they 
need to
be set to.

I can now think of two ways to deal with this:
1 - Leave tcp_output() as is, but provide a macro for the device driver authors 
to use
that sets if_hw_tsomaxsegcount with a flag for driver uses tcp/ip header 
mbuf,
documenting that this flag should normally be true.
OR
2 - Change tcp_output() as above, noting that this is a workaround for 
confusion w.r.t.
whether or not if_hw_tsomaxsegcount should include the tcp/ip header mbuf 
and
update the comment in if_var.h to reflect this. Then drivers that don't use 
the
tcp/ip header mbuf can increase their value for if_hw_tsomaxsegcount by 1.
(The comment should also mention that a value of 35 or greater is much 
preferred to
 32 if the hardware will support that.)

Also, I'd like to apologize for some of my emails getting a little blunt. I 
just find
it flustrating that this problem is still showing up and is even in 10.2. This 
is partly
my fault for not making it clearer to driver authors what if_hw_tsomaxsegcount 
should be
set to, because I had it incorrect.

Hopefully we can come up with a solution that everyone is comfortable with, rick

 --HPS
 
 ___
 freebsd-...@freebsd.org mailing list
 https://lists.freebsd.org/mailman/listinfo/freebsd-net
 To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
 
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-19 Thread Rick Macklem
Hans Petter Selasky wrote:
 On 08/19/15 09:42, Yonghyeon PYUN wrote:
  On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote:
  On 08/18/15 23:54, Rick Macklem wrote:
  Ouch! Yes, I now see that the code that counts the # of mbufs is before
  the
  code that adds the tcp/ip header mbuf.
 
  In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to
  whatever
  the driver provides - 1. It is not the driver's responsibility to know if
  a tcp/ip
  header mbuf will be added and is a lot less confusing that expecting the
  driver
  author to know to subtract one. (I had mistakenly thought that
  tcp_output() had
  added the tc/ip header mbuf before the loop that counts mbufs in the
  list.
  Btw,
  this tcp/ip header mbuf also has leading space for the MAC layer header.)
 
 
  Hi Rick,
 
  Your question is good. With the Mellanox hardware we have separate
  so-called inline data space for the TCP/IP headers, so if the TCP stack
  subtracts something, then we would need to add something to the limit,
  because then the scatter gather list is only used for the data part.
 
 
  I think all drivers in tree don't subtract 1 for
  if_hw_tsomaxsegcount.  Probably touching Mellanox driver would be
  simpler than fixing all other drivers in tree.
 
  Maybe it can be controlled by some kind of flag, if all the three TSO
  limits should include the TCP/IP/ethernet headers too. I'm pretty sure
  we want both versions.
 
 
  Hmm, I'm afraid it's already complex.  Drivers have to tell almost
  the same information to both bus_dma(9) and network stack.
 
 Don't forget that not all drivers in the tree set the TSO limits before
 if_attach(), so possibly the subtraction of one TSO fragment needs to go
 into ip_output() 
 
I don't really care where it gets subtracted, so long as it is subtracted
at least by default, so all the drivers that don't subtract it get fixed.

However, I might argue that tcp_output() is the correct place, since 
tcp_output()
is where the tcp/ip header mbuf is prepended to the list.
The subtraction is just taking into account the mbuf that tcp_output() will be
adding to the head of the list and it should count that in the while() loop.

rick

 --HPS
 
 ___
 freebsd-...@freebsd.org mailing list
 https://lists.freebsd.org/mailman/listinfo/freebsd-net
 To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
 
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-19 Thread Rick Macklem
Yonghyeon PYUN wrote:
 On Wed, Aug 19, 2015 at 09:51:44AM +0200, Hans Petter Selasky wrote:
  On 08/19/15 09:42, Yonghyeon PYUN wrote:
  On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote:
  On 08/18/15 23:54, Rick Macklem wrote:
  Ouch! Yes, I now see that the code that counts the # of mbufs is before
  the
  code that adds the tcp/ip header mbuf.
  
  In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to
  whatever
  the driver provides - 1. It is not the driver's responsibility to know
  if
  a tcp/ip
  header mbuf will be added and is a lot less confusing that expecting the
  driver
  author to know to subtract one. (I had mistakenly thought that
  tcp_output() had
  added the tc/ip header mbuf before the loop that counts mbufs in the
  list.
  Btw,
  this tcp/ip header mbuf also has leading space for the MAC layer
  header.)
  
  
  Hi Rick,
  
  Your question is good. With the Mellanox hardware we have separate
  so-called inline data space for the TCP/IP headers, so if the TCP stack
  subtracts something, then we would need to add something to the limit,
  because then the scatter gather list is only used for the data part.
  
  
  I think all drivers in tree don't subtract 1 for
  if_hw_tsomaxsegcount.  Probably touching Mellanox driver would be
  simpler than fixing all other drivers in tree.
  
  Hi,
  
  If you change the behaviour don't forget to update and/or add comments
  describing it. Maybe the amount of subtraction could be defined by some
  macro? Then drivers which inline the headers can subtract it?
  
 
 I'm also ok with your suggestion.
 
  Your suggestion is fine by me.
  
 
  The initial TSO limits were tried to be preserved, and I believe that
  TSO limits never accounted for IP/TCP/ETHERNET/VLAN headers!
  
 
 I guess FreeBSD used to follow MS LSOv1 specification with minor
 exception in pseudo checksum computation. If I recall correctly the
 specification says upper stack can generate up to IP_MAXPACKET sized
 packet.  Other L2 headers like ethernet/vlan header size is not
 included in the packet and it's drivers responsibility to allocate
 additional DMA buffers/segments for L2 headers.
 
Yep. The default for if_hw_tsomax was reduced from IP_MAXPACKET to
  32 * MCLBYTES - max_ethernet_header_size as a workaround/hack so that
devices limited to 32 transmit segments would work (ie. the entire packet,
including MAC header would fit in 32 MCLBYTE clusters).
This implied that many drivers did end up using m_defrag() to copy the mbuf
list to one made up of 32 MCLBYTE clusters.

If a driver sets if_hw_tsomaxsegcount correctly, then it can set if_hw_tsomax
to whatever it can handle as the largest TSO packet (without MAC header) the
hardware can handle. If it can handle  IP_MAXPACKET, then it can set it to 
that.

rick

  
  Maybe it can be controlled by some kind of flag, if all the three TSO
  limits should include the TCP/IP/ethernet headers too. I'm pretty sure
  we want both versions.
  
  
  Hmm, I'm afraid it's already complex.  Drivers have to tell almost
  the same information to both bus_dma(9) and network stack.
  
  You're right it's complicated. Not sure if bus_dma can provide an API
  for this though.
  
  --HPS
 ___
 freebsd-...@freebsd.org mailing list
 https://lists.freebsd.org/mailman/listinfo/freebsd-net
 To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
 
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-19 Thread Rick Macklem
Daniel Braniss wrote:
 
  On 19 Aug 2015, at 16:00, Rick Macklem rmack...@uoguelph.ca wrote:
  
  Hans Petter Selasky wrote:
  On 08/19/15 09:42, Yonghyeon PYUN wrote:
  On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote:
  On 08/18/15 23:54, Rick Macklem wrote:
  Ouch! Yes, I now see that the code that counts the # of mbufs is before
  the
  code that adds the tcp/ip header mbuf.
  
  In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to
  whatever
  the driver provides - 1. It is not the driver's responsibility to know
  if
  a tcp/ip
  header mbuf will be added and is a lot less confusing that expecting
  the
  driver
  author to know to subtract one. (I had mistakenly thought that
  tcp_output() had
  added the tc/ip header mbuf before the loop that counts mbufs in the
  list.
  Btw,
  this tcp/ip header mbuf also has leading space for the MAC layer
  header.)
  
  
  Hi Rick,
  
  Your question is good. With the Mellanox hardware we have separate
  so-called inline data space for the TCP/IP headers, so if the TCP stack
  subtracts something, then we would need to add something to the limit,
  because then the scatter gather list is only used for the data part.
  
  
  I think all drivers in tree don't subtract 1 for
  if_hw_tsomaxsegcount.  Probably touching Mellanox driver would be
  simpler than fixing all other drivers in tree.
  
  Maybe it can be controlled by some kind of flag, if all the three TSO
  limits should include the TCP/IP/ethernet headers too. I'm pretty sure
  we want both versions.
  
  
  Hmm, I'm afraid it's already complex.  Drivers have to tell almost
  the same information to both bus_dma(9) and network stack.
  
  Don't forget that not all drivers in the tree set the TSO limits before
  if_attach(), so possibly the subtraction of one TSO fragment needs to go
  into ip_output() 
  
  Ok, I realized that some drivers may not know the answers before
  ether_ifattach(),
  due to the way they are configured/written (I saw the use of
  if_hw_tsomax_update()
  in the patch).
  
  If it is subtracted as a part of the assignment to if_hw_tsomaxsegcount in
  tcp_output()
  at line#791 in tcp_output() like the following, I don't think it should
  matter if the
  values are set before ether_ifattach()?
  /*
   * Subtract 1 for the tcp/ip header mbuf that
   * will be prepended to the mbuf chain in this
   * function in the code below this block.
   */
  if_hw_tsomaxsegcount = tp-t_tsomaxsegcount - 1;
  
Well, you can replace the line in sys/netinet/tcp_output.c that looks like:
if_hw_tsomaxsegcount = tp-t_tsomaxsegcount;
with the above line (at line #797 in head).

Any other patch for this will have the same effect, rick

  I don't have a good solution for the case where a driver doesn't plan on
  using the
  tcp/ip header provided by tcp_output() except to say the driver can add one
  to the
  setting to compensate for that (and if they fail to do so, it still works,
  although
  somewhat suboptimally). When I now read the comment in sys/net/if_var.h it
  is clear
  what it means, but for some reason I didn't read it that way before? (I
  think it was
  the part that said the driver didn't have to subtract for the headers that
  confused me?)
  In any case, we need to try and come up with a clear definition of what
  they need to
  be set to.
  
  I can now think of two ways to deal with this:
  1 - Leave tcp_output() as is, but provide a macro for the device driver
  authors to use
 that sets if_hw_tsomaxsegcount with a flag for driver uses tcp/ip
 header mbuf,
 documenting that this flag should normally be true.
  OR
  2 - Change tcp_output() as above, noting that this is a workaround for
  confusion w.r.t.
 whether or not if_hw_tsomaxsegcount should include the tcp/ip header
 mbuf and
 update the comment in if_var.h to reflect this. Then drivers that don't
 use the
 tcp/ip header mbuf can increase their value for if_hw_tsomaxsegcount by
 1.
 (The comment should also mention that a value of 35 or greater is much
 preferred to
  32 if the hardware will support that.)
  
  Also, I'd like to apologize for some of my emails getting a little blunt.
  I just find
  it flustrating that this problem is still showing up and is even in 10.2.
  This is partly
  my fault for not making it clearer to driver authors what
  if_hw_tsomaxsegcount should be
  set to, because I had it incorrect.
  
  Hopefully we can come up with a solution that everyone is comfortable with,
  rick
 
 
 ok guys,
 when you have some code for me to try just let me know.
 
 danny
 
 ___
 freebsd-...@freebsd.org mailing list
 https://lists.freebsd.org/mailman/listinfo/freebsd-net
 To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
 

Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-19 Thread Yonghyeon PYUN
On Wed, Aug 19, 2015 at 09:00:35AM -0400, Rick Macklem wrote:
 Hans Petter Selasky wrote:
  On 08/19/15 09:42, Yonghyeon PYUN wrote:
   On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote:
   On 08/18/15 23:54, Rick Macklem wrote:
   Ouch! Yes, I now see that the code that counts the # of mbufs is before
   the
   code that adds the tcp/ip header mbuf.
  
   In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to
   whatever
   the driver provides - 1. It is not the driver's responsibility to know 
   if
   a tcp/ip
   header mbuf will be added and is a lot less confusing that expecting the
   driver
   author to know to subtract one. (I had mistakenly thought that
   tcp_output() had
   added the tc/ip header mbuf before the loop that counts mbufs in the
   list.
   Btw,
   this tcp/ip header mbuf also has leading space for the MAC layer 
   header.)
  
  
   Hi Rick,
  
   Your question is good. With the Mellanox hardware we have separate
   so-called inline data space for the TCP/IP headers, so if the TCP stack
   subtracts something, then we would need to add something to the limit,
   because then the scatter gather list is only used for the data part.
  
  
   I think all drivers in tree don't subtract 1 for
   if_hw_tsomaxsegcount.  Probably touching Mellanox driver would be
   simpler than fixing all other drivers in tree.
  
   Maybe it can be controlled by some kind of flag, if all the three TSO
   limits should include the TCP/IP/ethernet headers too. I'm pretty sure
   we want both versions.
  
  
   Hmm, I'm afraid it's already complex.  Drivers have to tell almost
   the same information to both bus_dma(9) and network stack.
  
  Don't forget that not all drivers in the tree set the TSO limits before
  if_attach(), so possibly the subtraction of one TSO fragment needs to go
  into ip_output() 
  
 Ok, I realized that some drivers may not know the answers before 
 ether_ifattach(),
 due to the way they are configured/written (I saw the use of 
 if_hw_tsomax_update()
 in the patch).

I was not able to find an interface that configures TSO parameters
after if_t conversion.  I'm under the impression
if_hw_tsomax_update() is not designed to use this way.  Probably we
need a better one?(CCed to Gleb).

 
 If it is subtracted as a part of the assignment to if_hw_tsomaxsegcount in 
 tcp_output()
 at line#791 in tcp_output() like the following, I don't think it should 
 matter if the
 values are set before ether_ifattach()?
   /*
* Subtract 1 for the tcp/ip header mbuf that
* will be prepended to the mbuf chain in this
* function in the code below this block.
*/
   if_hw_tsomaxsegcount = tp-t_tsomaxsegcount - 1;
 
 I don't have a good solution for the case where a driver doesn't plan on 
 using the
 tcp/ip header provided by tcp_output() except to say the driver can add one 
 to the
 setting to compensate for that (and if they fail to do so, it still works, 
 although
 somewhat suboptimally). When I now read the comment in sys/net/if_var.h it is 
 clear
 what it means, but for some reason I didn't read it that way before? (I think 
 it was
 the part that said the driver didn't have to subtract for the headers that 
 confused me?)
 In any case, we need to try and come up with a clear definition of what they 
 need to
 be set to.
 
 I can now think of two ways to deal with this:
 1 - Leave tcp_output() as is, but provide a macro for the device driver 
 authors to use
 that sets if_hw_tsomaxsegcount with a flag for driver uses tcp/ip header 
 mbuf,
 documenting that this flag should normally be true.
 OR
 2 - Change tcp_output() as above, noting that this is a workaround for 
 confusion w.r.t.
 whether or not if_hw_tsomaxsegcount should include the tcp/ip header mbuf 
 and
 update the comment in if_var.h to reflect this. Then drivers that don't 
 use the
 tcp/ip header mbuf can increase their value for if_hw_tsomaxsegcount by 1.
 (The comment should also mention that a value of 35 or greater is much 
 preferred to
  32 if the hardware will support that.)
 

Both works for me.  My preference is 2 just because it's very
common for most drivers that use tcp/ip header mbuf.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-19 Thread Hans Petter Selasky

On 08/18/15 23:54, Rick Macklem wrote:

Ouch! Yes, I now see that the code that counts the # of mbufs is before the
code that adds the tcp/ip header mbuf.

In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever
the driver provides - 1. It is not the driver's responsibility to know if a 
tcp/ip
header mbuf will be added and is a lot less confusing that expecting the driver
author to know to subtract one. (I had mistakenly thought that tcp_output() had
added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw,
this tcp/ip header mbuf also has leading space for the MAC layer header.)



Hi Rick,

Your question is good. With the Mellanox hardware we have separate 
so-called inline data space for the TCP/IP headers, so if the TCP stack 
subtracts something, then we would need to add something to the limit, 
because then the scatter gather list is only used for the data part.


Maybe it can be controlled by some kind of flag, if all the three TSO 
limits should include the TCP/IP/ethernet headers too. I'm pretty sure 
we want both versions.


--HPS
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-19 Thread Yonghyeon PYUN
On Tue, Aug 18, 2015 at 06:04:25PM -0400, Rick Macklem wrote:
 Hans Petter Selasky wrote:
  On 08/18/15 14:53, Rick Macklem wrote:
   If this is just a test machine, maybe you could test with these lines (at
   about #880)
   in sys/netinet/tcp_output.c commented out? (It looks to me like this will
   disable TSO
   for almost all the NFS writes.)
   - around line #880 in sys/netinet/tcp_output.c:
 /*
  * In case there are too many small fragments
  * don't use TSO:
  */
 if (len = max_len) {
 len = max_len;
 sendalot = 1;
 tso = 0;
 }
  
   This was added along with the other stuff that did the
   if_hw_tsomaxsegcount, etc and I
   never noticed it until now (not my patch).
  
  FYI:
  
  These lines are needed by other hardware, like the mlxen driver. If you
  remove them mlxen will start doing m_defrag(). I believe if you set the
  correct parameters in the struct ifnet for the TSO size/count limits
  this problem will go away. If you print the len and max_len and also
  the cases where TSO limits are reached, you'll see what parameter is
  triggering it and needs to be increased.
  
 Well, if the driver isn't setting if_hw_tsomaxsegcount correctly, then it
 is the driver that needs to be fixed.
 Having the above code block disable TSO for all of the NFS writes, including
 the ones that set if_hw_tsomaxsegcount correctly doesn't make sense to me.
 If the driver authors don't set these, the drivers do lots of m_defrag()
 calls. I have posted more than once to freebsd-net@ asking the driver authors
 to set these and some now have. (I can't do it, because I don't have the
 hardware to test it with.)
 

Thanks for reminder.  I have generated a diff against HEAD.
https://people.freebsd.org/~yongari/tso.param.diff
The diff restores optimal TSO parameters which were lost in r271946
for drivers that relied on sane default values.  I'll commit it
after some testing.

 I do think that most/all of them don't subtract 1 for the tcp/ip header and
 I don't think they should be expected to, since the driver isn't supposed to
 worry about the protocol at that level.

I agree.

 -- I think tcp_output() should subtract one from the if_hw_tsomaxsegcount
 provided by the driver to handle this, since it chooses to count mbufs
 (the while() loop at around line #825 in sys/netinet/tcp_output.c.)
 before it prepends the tcp/ip header mbuf.
 
 rick
 
  --HPS
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-19 Thread Yonghyeon PYUN
On Wed, Aug 19, 2015 at 09:51:44AM +0200, Hans Petter Selasky wrote:
 On 08/19/15 09:42, Yonghyeon PYUN wrote:
 On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote:
 On 08/18/15 23:54, Rick Macklem wrote:
 Ouch! Yes, I now see that the code that counts the # of mbufs is before 
 the
 code that adds the tcp/ip header mbuf.
 
 In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to
 whatever
 the driver provides - 1. It is not the driver's responsibility to know if
 a tcp/ip
 header mbuf will be added and is a lot less confusing that expecting the
 driver
 author to know to subtract one. (I had mistakenly thought that
 tcp_output() had
 added the tc/ip header mbuf before the loop that counts mbufs in the 
 list.
 Btw,
 this tcp/ip header mbuf also has leading space for the MAC layer header.)
 
 
 Hi Rick,
 
 Your question is good. With the Mellanox hardware we have separate
 so-called inline data space for the TCP/IP headers, so if the TCP stack
 subtracts something, then we would need to add something to the limit,
 because then the scatter gather list is only used for the data part.
 
 
 I think all drivers in tree don't subtract 1 for
 if_hw_tsomaxsegcount.  Probably touching Mellanox driver would be
 simpler than fixing all other drivers in tree.
 
 Hi,
 
 If you change the behaviour don't forget to update and/or add comments 
 describing it. Maybe the amount of subtraction could be defined by some 
 macro? Then drivers which inline the headers can subtract it?
 

I'm also ok with your suggestion.

 Your suggestion is fine by me.
 

 The initial TSO limits were tried to be preserved, and I believe that 
 TSO limits never accounted for IP/TCP/ETHERNET/VLAN headers!
 

I guess FreeBSD used to follow MS LSOv1 specification with minor
exception in pseudo checksum computation. If I recall correctly the
specification says upper stack can generate up to IP_MAXPACKET sized
packet.  Other L2 headers like ethernet/vlan header size is not
included in the packet and it's drivers responsibility to allocate
additional DMA buffers/segments for L2 headers.

 
 Maybe it can be controlled by some kind of flag, if all the three TSO
 limits should include the TCP/IP/ethernet headers too. I'm pretty sure
 we want both versions.
 
 
 Hmm, I'm afraid it's already complex.  Drivers have to tell almost
 the same information to both bus_dma(9) and network stack.
 
 You're right it's complicated. Not sure if bus_dma can provide an API 
 for this though.
 
 --HPS
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-19 Thread Yonghyeon PYUN
On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote:
 On 08/18/15 23:54, Rick Macklem wrote:
 Ouch! Yes, I now see that the code that counts the # of mbufs is before the
 code that adds the tcp/ip header mbuf.
 
 In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to 
 whatever
 the driver provides - 1. It is not the driver's responsibility to know if 
 a tcp/ip
 header mbuf will be added and is a lot less confusing that expecting the 
 driver
 author to know to subtract one. (I had mistakenly thought that 
 tcp_output() had
 added the tc/ip header mbuf before the loop that counts mbufs in the list. 
 Btw,
 this tcp/ip header mbuf also has leading space for the MAC layer header.)
 
 
 Hi Rick,
 
 Your question is good. With the Mellanox hardware we have separate 
 so-called inline data space for the TCP/IP headers, so if the TCP stack 
 subtracts something, then we would need to add something to the limit, 
 because then the scatter gather list is only used for the data part.
 

I think all drivers in tree don't subtract 1 for
if_hw_tsomaxsegcount.  Probably touching Mellanox driver would be
simpler than fixing all other drivers in tree.

 Maybe it can be controlled by some kind of flag, if all the three TSO 
 limits should include the TCP/IP/ethernet headers too. I'm pretty sure 
 we want both versions.
 

Hmm, I'm afraid it's already complex.  Drivers have to tell almost
the same information to both bus_dma(9) and network stack.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-19 Thread Hans Petter Selasky

On 08/19/15 09:42, Yonghyeon PYUN wrote:

On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote:

On 08/18/15 23:54, Rick Macklem wrote:

Ouch! Yes, I now see that the code that counts the # of mbufs is before the
code that adds the tcp/ip header mbuf.

In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to
whatever
the driver provides - 1. It is not the driver's responsibility to know if
a tcp/ip
header mbuf will be added and is a lot less confusing that expecting the
driver
author to know to subtract one. (I had mistakenly thought that
tcp_output() had
added the tc/ip header mbuf before the loop that counts mbufs in the list.
Btw,
this tcp/ip header mbuf also has leading space for the MAC layer header.)



Hi Rick,

Your question is good. With the Mellanox hardware we have separate
so-called inline data space for the TCP/IP headers, so if the TCP stack
subtracts something, then we would need to add something to the limit,
because then the scatter gather list is only used for the data part.



I think all drivers in tree don't subtract 1 for
if_hw_tsomaxsegcount.  Probably touching Mellanox driver would be
simpler than fixing all other drivers in tree.


Maybe it can be controlled by some kind of flag, if all the three TSO
limits should include the TCP/IP/ethernet headers too. I'm pretty sure
we want both versions.



Hmm, I'm afraid it's already complex.  Drivers have to tell almost
the same information to both bus_dma(9) and network stack.


Don't forget that not all drivers in the tree set the TSO limits before 
if_attach(), so possibly the subtraction of one TSO fragment needs to go 
into ip_output() 


--HPS

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-19 Thread Hans Petter Selasky

On 08/19/15 09:42, Yonghyeon PYUN wrote:

On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote:

On 08/18/15 23:54, Rick Macklem wrote:

Ouch! Yes, I now see that the code that counts the # of mbufs is before the
code that adds the tcp/ip header mbuf.

In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to
whatever
the driver provides - 1. It is not the driver's responsibility to know if
a tcp/ip
header mbuf will be added and is a lot less confusing that expecting the
driver
author to know to subtract one. (I had mistakenly thought that
tcp_output() had
added the tc/ip header mbuf before the loop that counts mbufs in the list.
Btw,
this tcp/ip header mbuf also has leading space for the MAC layer header.)



Hi Rick,

Your question is good. With the Mellanox hardware we have separate
so-called inline data space for the TCP/IP headers, so if the TCP stack
subtracts something, then we would need to add something to the limit,
because then the scatter gather list is only used for the data part.



I think all drivers in tree don't subtract 1 for
if_hw_tsomaxsegcount.  Probably touching Mellanox driver would be
simpler than fixing all other drivers in tree.


Hi,

If you change the behaviour don't forget to update and/or add comments 
describing it. Maybe the amount of subtraction could be defined by some 
macro? Then drivers which inline the headers can subtract it?


Your suggestion is fine by me.

The initial TSO limits were tried to be preserved, and I believe that 
TSO limits never accounted for IP/TCP/ETHERNET/VLAN headers!





Maybe it can be controlled by some kind of flag, if all the three TSO
limits should include the TCP/IP/ethernet headers too. I'm pretty sure
we want both versions.



Hmm, I'm afraid it's already complex.  Drivers have to tell almost
the same information to both bus_dma(9) and network stack.


You're right it's complicated. Not sure if bus_dma can provide an API 
for this though.


--HPS
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-19 Thread Yonghyeon PYUN
On Wed, Aug 19, 2015 at 08:13:59AM -0400, Rick Macklem wrote:
 Yonghyeon PYUN wrote:
  On Wed, Aug 19, 2015 at 09:51:44AM +0200, Hans Petter Selasky wrote:
   On 08/19/15 09:42, Yonghyeon PYUN wrote:
   On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote:
   On 08/18/15 23:54, Rick Macklem wrote:
   Ouch! Yes, I now see that the code that counts the # of mbufs is before
   the
   code that adds the tcp/ip header mbuf.
   
   In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to
   whatever
   the driver provides - 1. It is not the driver's responsibility to know
   if
   a tcp/ip
   header mbuf will be added and is a lot less confusing that expecting 
   the
   driver
   author to know to subtract one. (I had mistakenly thought that
   tcp_output() had
   added the tc/ip header mbuf before the loop that counts mbufs in the
   list.
   Btw,
   this tcp/ip header mbuf also has leading space for the MAC layer
   header.)
   
   
   Hi Rick,
   
   Your question is good. With the Mellanox hardware we have separate
   so-called inline data space for the TCP/IP headers, so if the TCP stack
   subtracts something, then we would need to add something to the limit,
   because then the scatter gather list is only used for the data part.
   
   
   I think all drivers in tree don't subtract 1 for
   if_hw_tsomaxsegcount.  Probably touching Mellanox driver would be
   simpler than fixing all other drivers in tree.
   
   Hi,
   
   If you change the behaviour don't forget to update and/or add comments
   describing it. Maybe the amount of subtraction could be defined by some
   macro? Then drivers which inline the headers can subtract it?
   
  
  I'm also ok with your suggestion.
  
   Your suggestion is fine by me.
   
  
   The initial TSO limits were tried to be preserved, and I believe that
   TSO limits never accounted for IP/TCP/ETHERNET/VLAN headers!
   
  
  I guess FreeBSD used to follow MS LSOv1 specification with minor
  exception in pseudo checksum computation. If I recall correctly the
  specification says upper stack can generate up to IP_MAXPACKET sized
  packet.  Other L2 headers like ethernet/vlan header size is not
  included in the packet and it's drivers responsibility to allocate
  additional DMA buffers/segments for L2 headers.
  
 Yep. The default for if_hw_tsomax was reduced from IP_MAXPACKET to
   32 * MCLBYTES - max_ethernet_header_size as a workaround/hack so that
 devices limited to 32 transmit segments would work (ie. the entire packet,
 including MAC header would fit in 32 MCLBYTE clusters).
 This implied that many drivers did end up using m_defrag() to copy the mbuf
 list to one made up of 32 MCLBYTE clusters.
 
 If a driver sets if_hw_tsomaxsegcount correctly, then it can set if_hw_tsomax
 to whatever it can handle as the largest TSO packet (without MAC header) the
 hardware can handle. If it can handle  IP_MAXPACKET, then it can set it to 
 that.
 

I thought the upper limit was still IP_MAXPACKET. If driver
increase it (i.e.  IP_MAXPACKET,  the length field in the IP
header would overflow which in turn may break firewalls and other
packet handling in IPv4/IPv6 code path.
If the limit no longer apply to network stack, that's great.  Some
controllers can handle up to 256KB TCP/UDP segmentation and
supporting that feature wouldn't be hard.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-18 Thread Daniel Braniss

 On Aug 18, 2015, at 12:49 AM, Rick Macklem rmack...@uoguelph.ca wrote:
 
 Daniel Braniss wrote:
 
 On Aug 17, 2015, at 3:21 PM, Rick Macklem rmack...@uoguelph.ca wrote:
 
 Daniel Braniss wrote:
 
 On Aug 17, 2015, at 1:41 PM, Christopher Forgeron csforge...@gmail.com
 wrote:
 
 FYI, I can regularly hit 9.3 Gib/s with my Intel X520-DA2's and FreeBSD
 10.1. Before 10.1 it was less.
 
 
 this is NOT iperf/3 where i do get close to wire speed,
 it’s NFS writes, i.e., almost real work :-)
 
 I used to tweak the card settings, but now it's just stock. You may want
 to
 check your settings, the Mellanox may just have better defaults for your
 switch.
 
 Have you tried disabling TSO for the Intel? With TSO enabled, it will be
 copying
 every transmitted mbuf chain to a new chain of mbuf clusters via.
 m_defrag() when
 TSO is enabled. (Assuming you aren't an 82598 chip. Most seem to be the
 82599 chip
 these days?)
 
 
 hi Rick
 
 how can i check the chip?
 
 Haven't a clue. Does dmesg tell you? (To be honest, since disabling TSO 
 helped,
 I'll bet you don't have a 82598.)
 
 This has been fixed in the driver very recently, but those fixes won't be
 in 10.1.
 
 rick
 ps: If you could test with 10.2, it would be interesting to see how the ix
 does with
   the current driver fixes in it?
 
 I new TSO was involved!
 ok, firstly, it’s 10.2 stable.
 with TSO enabled, ix is bad, around 64MGB/s.
 disabling TSO it’s better, around 130
 
 Hmm, could you check to see of these lines are in sys/dev/ixgbe/if_ix.c at 
 around
 line#2500?
  /* TSO parameters */
 2572   ifp-if_hw_tsomax = 65518;
 2573   ifp-if_hw_tsomaxsegcount = IXGBE_82599_SCATTER;
 2574   ifp-if_hw_tsomaxsegsize = 2048;
 
 They are in stable/10. I didn't look at releng/10.2. (And if they're in a 
 #ifdef
 for FreeBSD11, take the #ifdef away.)
 If they are there and not ifdef'd, I can't explain why disabling TSO would 
 help.
 Once TSO is fixed so that it handles the 64K transmit segments without 
 copying all
 the mbufs, I suspect you might get better perf. with it enabled?
 

this is 10.2 :
they are on lines  2509-2511 and I don’t see any #ifdefs around it.

the plot thickens :-)

danny

 Good luck with it, rick
 
 still, mlxen0 is about 250! with and without TSO
 
 
 
 On Mon, Aug 17, 2015 at 6:41 AM, Slawa Olhovchenkov s...@zxy.spb.ru
 mailto:s...@zxy.spb.ru wrote:
 On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote:
 
 hi,
 I have a host (Dell R730) with both cards, connected to an HP8200
 switch at 10Gb.
 when writing to the same storage (netapp) this is what I get:
 ix0:~130MGB/s
 mlxen0  ~330MGB/s
 this is via nfs/tcpv3
 
 I can get similar (bad) performance with the mellanox if I increase
 the file size
 to 512MGB.
 
 Look like mellanox have internal beffer for caching and do ACK
 acclerating.
 
 so at face value, it seems the mlxen does a better use of resources
 than the intel.
 Any ideas how to improve ix/intel's performance?
 
 Are you sure about netapp performance?
 ___
 freebsd-...@freebsd.org mailto:freebsd-...@freebsd.org mailing list
 https://lists.freebsd.org/mailman/listinfo/freebsd-net
 https://lists.freebsd.org/mailman/listinfo/freebsd-net
 To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
 mailto:freebsd-net-unsubscr...@freebsd.org
 
 
 ___
 freebsd-stable@freebsd.org mailing list
 https://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
 
 ___
 freebsd-stable@freebsd.org mailing list
 https://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-18 Thread Rick Macklem
Daniel Braniss wrote:
 
  On Aug 18, 2015, at 12:49 AM, Rick Macklem rmack...@uoguelph.ca wrote:
  
  Daniel Braniss wrote:
  
  On Aug 17, 2015, at 3:21 PM, Rick Macklem rmack...@uoguelph.ca wrote:
  
  Daniel Braniss wrote:
  
  On Aug 17, 2015, at 1:41 PM, Christopher Forgeron
  csforge...@gmail.com
  wrote:
  
  FYI, I can regularly hit 9.3 Gib/s with my Intel X520-DA2's and FreeBSD
  10.1. Before 10.1 it was less.
  
  
  this is NOT iperf/3 where i do get close to wire speed,
  it’s NFS writes, i.e., almost real work :-)
  
  I used to tweak the card settings, but now it's just stock. You may
  want
  to
  check your settings, the Mellanox may just have better defaults for
  your
  switch.
  
  Have you tried disabling TSO for the Intel? With TSO enabled, it will be
  copying
  every transmitted mbuf chain to a new chain of mbuf clusters via.
  m_defrag() when
  TSO is enabled. (Assuming you aren't an 82598 chip. Most seem to be the
  82599 chip
  these days?)
  
  
  hi Rick
  
  how can i check the chip?
  
  Haven't a clue. Does dmesg tell you? (To be honest, since disabling TSO
  helped,
  I'll bet you don't have a 82598.)
  
  This has been fixed in the driver very recently, but those fixes won't be
  in 10.1.
  
  rick
  ps: If you could test with 10.2, it would be interesting to see how the
  ix
  does with
the current driver fixes in it?
  
  I new TSO was involved!
  ok, firstly, it’s 10.2 stable.
  with TSO enabled, ix is bad, around 64MGB/s.
  disabling TSO it’s better, around 130
  
  Hmm, could you check to see of these lines are in sys/dev/ixgbe/if_ix.c at
  around
  line#2500?
   /* TSO parameters */
  2572 ifp-if_hw_tsomax = 65518;
  2573 ifp-if_hw_tsomaxsegcount = 
  IXGBE_82599_SCATTER;
  2574 ifp-if_hw_tsomaxsegsize = 2048;
  
  They are in stable/10. I didn't look at releng/10.2. (And if they're in a
  #ifdef
  for FreeBSD11, take the #ifdef away.)
  If they are there and not ifdef'd, I can't explain why disabling TSO would
  help.
  Once TSO is fixed so that it handles the 64K transmit segments without
  copying all
  the mbufs, I suspect you might get better perf. with it enabled?
  
 
 this is 10.2 :
 they are on lines  2509-2511 and I don’t see any #ifdefs around it.
 
 the plot thickens :-)
 
If this is just a test machine, maybe you could test with these lines (at about 
#880)
in sys/netinet/tcp_output.c commented out? (It looks to me like this will 
disable TSO
for almost all the NFS writes.)
- around line #880 in sys/netinet/tcp_output.c:
/*
 * In case there are too many small fragments
 * don't use TSO:
 */
if (len = max_len) {
len = max_len;
sendalot = 1;
tso = 0;
}

This was added along with the other stuff that did the if_hw_tsomaxsegcount, 
etc and I
never noticed it until now (not my patch).

rick

 danny
 
  Good luck with it, rick
  
  still, mlxen0 is about 250! with and without TSO
  
  
  
  On Mon, Aug 17, 2015 at 6:41 AM, Slawa Olhovchenkov s...@zxy.spb.ru
  mailto:s...@zxy.spb.ru wrote:
  On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote:
  
  hi,
  I have a host (Dell R730) with both cards, connected to an HP8200
  switch at 10Gb.
  when writing to the same storage (netapp) this is what I get:
  ix0:~130MGB/s
  mlxen0  ~330MGB/s
  this is via nfs/tcpv3
  
  I can get similar (bad) performance with the mellanox if I
  increase
  the file size
  to 512MGB.
  
  Look like mellanox have internal beffer for caching and do ACK
  acclerating.
  
  so at face value, it seems the mlxen does a better use of
  resources
  than the intel.
  Any ideas how to improve ix/intel's performance?
  
  Are you sure about netapp performance?
  ___
  freebsd-...@freebsd.org mailto:freebsd-...@freebsd.org mailing list
  https://lists.freebsd.org/mailman/listinfo/freebsd-net
  https://lists.freebsd.org/mailman/listinfo/freebsd-net
  To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
  mailto:freebsd-net-unsubscr...@freebsd.org
  
  
  ___
  freebsd-stable@freebsd.org mailing list
  https://lists.freebsd.org/mailman/listinfo/freebsd-stable
  To unsubscribe, send any mail to
  freebsd-stable-unsubscr...@freebsd.org
  
  ___
  freebsd-stable@freebsd.org mailing list
  https://lists.freebsd.org/mailman/listinfo/freebsd-stable
  To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
 
 ___
 freebsd-...@freebsd.org mailing list
 

Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-18 Thread Rick Macklem
Hans Petter Selasky wrote:
 On 08/18/15 14:53, Rick Macklem wrote:
  If this is just a test machine, maybe you could test with these lines (at
  about #880)
  in sys/netinet/tcp_output.c commented out? (It looks to me like this will
  disable TSO
  for almost all the NFS writes.)
  - around line #880 in sys/netinet/tcp_output.c:
  /*
   * In case there are too many small fragments
   * don't use TSO:
   */
  if (len = max_len) {
  len = max_len;
  sendalot = 1;
  tso = 0;
  }
 
  This was added along with the other stuff that did the
  if_hw_tsomaxsegcount, etc and I
  never noticed it until now (not my patch).
 
 FYI:
 
 These lines are needed by other hardware, like the mlxen driver. If you
 remove them mlxen will start doing m_defrag(). I believe if you set the
 correct parameters in the struct ifnet for the TSO size/count limits
 this problem will go away. If you print the len and max_len and also
 the cases where TSO limits are reached, you'll see what parameter is
 triggering it and needs to be increased.
 
Well, if the driver isn't setting if_hw_tsomaxsegcount correctly, then it
is the driver that needs to be fixed.
Having the above code block disable TSO for all of the NFS writes, including
the ones that set if_hw_tsomaxsegcount correctly doesn't make sense to me.
If the driver authors don't set these, the drivers do lots of m_defrag()
calls. I have posted more than once to freebsd-net@ asking the driver authors
to set these and some now have. (I can't do it, because I don't have the
hardware to test it with.)

I do think that most/all of them don't subtract 1 for the tcp/ip header and
I don't think they should be expected to, since the driver isn't supposed to
worry about the protocol at that level.
-- I think tcp_output() should subtract one from the if_hw_tsomaxsegcount
provided by the driver to handle this, since it chooses to count mbufs
(the while() loop at around line #825 in sys/netinet/tcp_output.c.)
before it prepends the tcp/ip header mbuf.

rick

 --HPS
 
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-18 Thread Rick Macklem
Daniel Braniss wrote:
 
  On Aug 18, 2015, at 12:49 AM, Rick Macklem rmack...@uoguelph.ca wrote:
  
  Daniel Braniss wrote:
  
  On Aug 17, 2015, at 3:21 PM, Rick Macklem rmack...@uoguelph.ca wrote:
  
  Daniel Braniss wrote:
  
  On Aug 17, 2015, at 1:41 PM, Christopher Forgeron
  csforge...@gmail.com
  wrote:
  
  FYI, I can regularly hit 9.3 Gib/s with my Intel X520-DA2's and FreeBSD
  10.1. Before 10.1 it was less.
  
  
  this is NOT iperf/3 where i do get close to wire speed,
  it’s NFS writes, i.e., almost real work :-)
  
  I used to tweak the card settings, but now it's just stock. You may
  want
  to
  check your settings, the Mellanox may just have better defaults for
  your
  switch.
  
  Have you tried disabling TSO for the Intel? With TSO enabled, it will be
  copying
  every transmitted mbuf chain to a new chain of mbuf clusters via.
  m_defrag() when
  TSO is enabled. (Assuming you aren't an 82598 chip. Most seem to be the
  82599 chip
  these days?)
  
Oops, I think I screwed up. It looks like t_maxopd is limited to somewhat less
than the mtu.

If that is the case, the code block wouldn't do what I thought it would do.

However, if_hw_tsomaxsegcount does need to be one less than the limit for the
driver, since the tcp/ip header isn't yet prepended when it is counted.

I think the code in tcp_output() should subtract 1, but you can change it in
the driver to test this.

Thanks for doing this, rick

  
  hi Rick
  
  how can i check the chip?
  
  Haven't a clue. Does dmesg tell you? (To be honest, since disabling TSO
  helped,
  I'll bet you don't have a 82598.)
  
  This has been fixed in the driver very recently, but those fixes won't be
  in 10.1.
  
  rick
  ps: If you could test with 10.2, it would be interesting to see how the
  ix
  does with
the current driver fixes in it?
  
  I new TSO was involved!
  ok, firstly, it’s 10.2 stable.
  with TSO enabled, ix is bad, around 64MGB/s.
  disabling TSO it’s better, around 130
  
  Hmm, could you check to see of these lines are in sys/dev/ixgbe/if_ix.c at
  around
  line#2500?
   /* TSO parameters */
  2572 ifp-if_hw_tsomax = 65518;
  2573 ifp-if_hw_tsomaxsegcount = 
  IXGBE_82599_SCATTER;
  2574 ifp-if_hw_tsomaxsegsize = 2048;
  
  They are in stable/10. I didn't look at releng/10.2. (And if they're in a
  #ifdef
  for FreeBSD11, take the #ifdef away.)
  If they are there and not ifdef'd, I can't explain why disabling TSO would
  help.
  Once TSO is fixed so that it handles the 64K transmit segments without
  copying all
  the mbufs, I suspect you might get better perf. with it enabled?
  
 
 this is 10.2 :
 they are on lines  2509-2511 and I don’t see any #ifdefs around it.
 
 the plot thickens :-)
 
 danny
 
  Good luck with it, rick
  
  still, mlxen0 is about 250! with and without TSO
  
  
  
  On Mon, Aug 17, 2015 at 6:41 AM, Slawa Olhovchenkov s...@zxy.spb.ru
  mailto:s...@zxy.spb.ru wrote:
  On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote:
  
  hi,
  I have a host (Dell R730) with both cards, connected to an HP8200
  switch at 10Gb.
  when writing to the same storage (netapp) this is what I get:
  ix0:~130MGB/s
  mlxen0  ~330MGB/s
  this is via nfs/tcpv3
  
  I can get similar (bad) performance with the mellanox if I
  increase
  the file size
  to 512MGB.
  
  Look like mellanox have internal beffer for caching and do ACK
  acclerating.
  
  so at face value, it seems the mlxen does a better use of
  resources
  than the intel.
  Any ideas how to improve ix/intel's performance?
  
  Are you sure about netapp performance?
  ___
  freebsd-...@freebsd.org mailto:freebsd-...@freebsd.org mailing list
  https://lists.freebsd.org/mailman/listinfo/freebsd-net
  https://lists.freebsd.org/mailman/listinfo/freebsd-net
  To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
  mailto:freebsd-net-unsubscr...@freebsd.org
  
  
  ___
  freebsd-stable@freebsd.org mailing list
  https://lists.freebsd.org/mailman/listinfo/freebsd-stable
  To unsubscribe, send any mail to
  freebsd-stable-unsubscr...@freebsd.org
  
  ___
  freebsd-stable@freebsd.org mailing list
  https://lists.freebsd.org/mailman/listinfo/freebsd-stable
  To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
 
 ___
 freebsd-...@freebsd.org mailing list
 https://lists.freebsd.org/mailman/listinfo/freebsd-net
 To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-18 Thread Hans Petter Selasky

On 08/18/15 14:53, Rick Macklem wrote:

2572 ifp-if_hw_tsomax = 65518;

2573 ifp-if_hw_tsomaxsegcount = IXGBE_82599_SCATTER;
2574 ifp-if_hw_tsomaxsegsize = 2048;


Hi,

If IXGBE_82599_SCATTER is the maximum scatter/gather entries the 
hardware can do, remember to subtract one fragment for the TCP/IP-header 
mbuf!


I think there is an off-by-one here:

ifp-if_hw_tsomax = 65518;
ifp-if_hw_tsomaxsegcount = IXGBE_82599_SCATTER - 1;
ifp-if_hw_tsomaxsegsize = 2048;

Refer to:


 *
 * NOTE: The TSO limits only apply to the data payload part of
 * a TCP/IP packet. That means there is no need to subtract
 * space for ethernet-, vlan-, IP- or TCP- headers from the
 * TSO limits unless the hardware driver in question requires
 * so.


In sys/net/if_var.h

Thank you!

--HPS

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-18 Thread Hans Petter Selasky

On 08/18/15 14:53, Rick Macklem wrote:

If this is just a test machine, maybe you could test with these lines (at about 
#880)
in sys/netinet/tcp_output.c commented out? (It looks to me like this will 
disable TSO
for almost all the NFS writes.)
- around line #880 in sys/netinet/tcp_output.c:
/*
 * In case there are too many small fragments
 * don't use TSO:
 */
if (len = max_len) {
len = max_len;
sendalot = 1;
tso = 0;
}

This was added along with the other stuff that did the if_hw_tsomaxsegcount, 
etc and I
never noticed it until now (not my patch).


FYI:

These lines are needed by other hardware, like the mlxen driver. If you 
remove them mlxen will start doing m_defrag(). I believe if you set the 
correct parameters in the struct ifnet for the TSO size/count limits 
this problem will go away. If you print the len and max_len and also 
the cases where TSO limits are reached, you'll see what parameter is 
triggering it and needs to be increased.


--HPS
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-18 Thread Slawa Olhovchenkov
On Tue, Aug 18, 2015 at 05:09:41PM +0300, Daniel Braniss wrote:

 sorry, it's been a tough day, we had a major meltdown, caused by a faulty 
 gbic :-(
 anyways, could you tell me what to do?
 comment out, fix the off by one?
 
 the machine is not yet production.

Can you collect this information?
https://lists.freebsd.org/pipermail/freebsd-stable/2015-August/083113.html

And 'show interface' (or equivalent: error/collsion/events counters)
from both ports from HP8200.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-18 Thread Daniel Braniss
sorry, it’s been a tough day, we had a major meltdown, caused by a faulty gbic 
:-(
anyways, could you tell me what to do?
comment out, fix the off by one?

the machine is not yet production.

thanks,
danny

 On 18 Aug 2015, at 16:32, Hans Petter Selasky h...@selasky.org wrote:
 
 On 08/18/15 14:53, Rick Macklem wrote:
 2572  ifp-if_hw_tsomax = 65518;
 2573ifp-if_hw_tsomaxsegcount = 
 IXGBE_82599_SCATTER;
 2574ifp-if_hw_tsomaxsegsize = 2048;
 
 Hi,
 
 If IXGBE_82599_SCATTER is the maximum scatter/gather entries the hardware can 
 do, remember to subtract one fragment for the TCP/IP-header mbuf!
 
 I think there is an off-by-one here:
 
 ifp-if_hw_tsomax = 65518;
 ifp-if_hw_tsomaxsegcount = IXGBE_82599_SCATTER - 1;
 ifp-if_hw_tsomaxsegsize = 2048;
 
 Refer to:
 
 *
 * NOTE: The TSO limits only apply to the data payload part of
 * a TCP/IP packet. That means there is no need to subtract
 * space for ethernet-, vlan-, IP- or TCP- headers from the
 * TSO limits unless the hardware driver in question requires
 * so.
 
 In sys/net/if_var.h
 
 Thank you!
 
 --HPS
 

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-18 Thread Rick Macklem
Hans Petter Selasky wrote:
 On 08/18/15 14:53, Rick Macklem wrote:
  2572 ifp-if_hw_tsomax = 65518;
  2573   ifp-if_hw_tsomaxsegcount = 
  IXGBE_82599_SCATTER;
  2574   ifp-if_hw_tsomaxsegsize = 2048;
 
 Hi,
 
 If IXGBE_82599_SCATTER is the maximum scatter/gather entries the
 hardware can do, remember to subtract one fragment for the TCP/IP-header
 mbuf!
 
Ouch! Yes, I now see that the code that counts the # of mbufs is before the
code that adds the tcp/ip header mbuf.

In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever
the driver provides - 1. It is not the driver's responsibility to know if a 
tcp/ip
header mbuf will be added and is a lot less confusing that expecting the driver
author to know to subtract one. (I had mistakenly thought that tcp_output() had
added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw,
this tcp/ip header mbuf also has leading space for the MAC layer header.)

 I think there is an off-by-one here:
 
 ifp-if_hw_tsomax = 65518;
 ifp-if_hw_tsomaxsegcount = IXGBE_82599_SCATTER - 1;
 ifp-if_hw_tsomaxsegsize = 2048;
 
 Refer to:
 
   *
   * NOTE: The TSO limits only apply to the data payload part of
   * a TCP/IP packet. That means there is no need to subtract
   * space for ethernet-, vlan-, IP- or TCP- headers from the
   * TSO limits unless the hardware driver in question requires
   * so.
 
This comment suggests that the driver author doesn't need to do this.

However, unless this is fixed in tcp_output(), the above patch should be
applied to the driver.
 In sys/net/if_var.h
 
 Thank you!
 
 --HPS
 
The problem I see is that, after doing the calculation of how many mbufs can
be in the TSO segment, the code in tcp_output() will have calculated a value
for len that will always be less that tp-t_maxopd - optlen when the
if_hw_tsosegcount limit has been hit (see where it does a break; out of
the while loop).
-- This does not imply too many small fragments for NFS, just that the
driver's transmit segment limit has been reached, where most of them
are mbuf clusters, but not the first ones.
As such the code:
/*
 * In case there are too many small fragments
 * don't use TSO:
 */
if (len = max_len) {
len = max_len;
sendalot = 1;
tso = 0;
}
Will always happen for this case and tso gets set to 0. Not what we want to
happen, imho.
The above code block was what I suggested should be commented out or deleted
for the test.

It appears you should also add the - 1 in the driver sys/dev/ixgbe/if_ix.c.

rick

 ___
 freebsd-...@freebsd.org mailing list
 https://lists.freebsd.org/mailman/listinfo/freebsd-net
 To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
 
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-17 Thread Rick Macklem
Daniel Braniss wrote:
 
  On Aug 17, 2015, at 3:21 PM, Rick Macklem rmack...@uoguelph.ca wrote:
  
  Daniel Braniss wrote:
  
  On Aug 17, 2015, at 1:41 PM, Christopher Forgeron csforge...@gmail.com
  wrote:
  
  FYI, I can regularly hit 9.3 Gib/s with my Intel X520-DA2's and FreeBSD
  10.1. Before 10.1 it was less.
  
  
  this is NOT iperf/3 where i do get close to wire speed,
  it’s NFS writes, i.e., almost real work :-)
  
  I used to tweak the card settings, but now it's just stock. You may want
  to
  check your settings, the Mellanox may just have better defaults for your
  switch.
  
  Have you tried disabling TSO for the Intel? With TSO enabled, it will be
  copying
  every transmitted mbuf chain to a new chain of mbuf clusters via.
  m_defrag() when
  TSO is enabled. (Assuming you aren't an 82598 chip. Most seem to be the
  82599 chip
  these days?)
  
 
 hi Rick
 
 how can i check the chip?
 
Haven't a clue. Does dmesg tell you? (To be honest, since disabling TSO 
helped,
I'll bet you don't have a 82598.)

  This has been fixed in the driver very recently, but those fixes won't be
  in 10.1.
  
  rick
  ps: If you could test with 10.2, it would be interesting to see how the ix
  does with
 the current driver fixes in it?
 
 I new TSO was involved!
 ok, firstly, it’s 10.2 stable.
 with TSO enabled, ix is bad, around 64MGB/s.
 disabling TSO it’s better, around 130
 
Hmm, could you check to see of these lines are in sys/dev/ixgbe/if_ix.c at 
around
line#2500?
  /* TSO parameters */
2572 ifp-if_hw_tsomax = 65518;
2573 ifp-if_hw_tsomaxsegcount = IXGBE_82599_SCATTER;
2574 ifp-if_hw_tsomaxsegsize = 2048;

They are in stable/10. I didn't look at releng/10.2. (And if they're in a #ifdef
for FreeBSD11, take the #ifdef away.)
If they are there and not ifdef'd, I can't explain why disabling TSO would help.
Once TSO is fixed so that it handles the 64K transmit segments without copying 
all
the mbufs, I suspect you might get better perf. with it enabled?

Good luck with it, rick

 still, mlxen0 is about 250! with and without TSO
 
 
  
  On Mon, Aug 17, 2015 at 6:41 AM, Slawa Olhovchenkov s...@zxy.spb.ru
  mailto:s...@zxy.spb.ru wrote:
  On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote:
  
  hi,
   I have a host (Dell R730) with both cards, connected to an HP8200
   switch at 10Gb.
   when writing to the same storage (netapp) this is what I get:
   ix0:~130MGB/s
   mlxen0  ~330MGB/s
   this is via nfs/tcpv3
  
   I can get similar (bad) performance with the mellanox if I increase
   the file size
   to 512MGB.
  
  Look like mellanox have internal beffer for caching and do ACK
  acclerating.
  
   so at face value, it seems the mlxen does a better use of resources
   than the intel.
   Any ideas how to improve ix/intel's performance?
  
  Are you sure about netapp performance?
  ___
  freebsd-...@freebsd.org mailto:freebsd-...@freebsd.org mailing list
  https://lists.freebsd.org/mailman/listinfo/freebsd-net
  https://lists.freebsd.org/mailman/listinfo/freebsd-net
  To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
  mailto:freebsd-net-unsubscr...@freebsd.org
  
  
  ___
  freebsd-stable@freebsd.org mailing list
  https://lists.freebsd.org/mailman/listinfo/freebsd-stable
  To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
 
 ___
 freebsd-stable@freebsd.org mailing list
 https://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-17 Thread Rick Macklem
Daniel Braniss wrote:
 
  On Aug 17, 2015, at 1:41 PM, Christopher Forgeron csforge...@gmail.com
  wrote:
  
  FYI, I can regularly hit 9.3 Gib/s with my Intel X520-DA2's and FreeBSD
  10.1. Before 10.1 it was less.
  
 
 this is NOT iperf/3 where i do get close to wire speed,
 it’s NFS writes, i.e., almost real work :-)
 
  I used to tweak the card settings, but now it's just stock. You may want to
  check your settings, the Mellanox may just have better defaults for your
  switch.
  
Have you tried disabling TSO for the Intel? With TSO enabled, it will be copying
every transmitted mbuf chain to a new chain of mbuf clusters via. m_defrag() 
when
TSO is enabled. (Assuming you aren't an 82598 chip. Most seem to be the 82599 
chip
these days?)

This has been fixed in the driver very recently, but those fixes won't be in 
10.1.

rick
ps: If you could test with 10.2, it would be interesting to see how the ix does 
with
the current driver fixes in it?

  On Mon, Aug 17, 2015 at 6:41 AM, Slawa Olhovchenkov s...@zxy.spb.ru
  mailto:s...@zxy.spb.ru wrote:
  On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote:
  
   hi,
 I have a host (Dell R730) with both cards, connected to an HP8200
 switch at 10Gb.
 when writing to the same storage (netapp) this is what I get:
 ix0:~130MGB/s
 mlxen0  ~330MGB/s
 this is via nfs/tcpv3
  
 I can get similar (bad) performance with the mellanox if I increase
 the file size
 to 512MGB.
  
  Look like mellanox have internal beffer for caching and do ACK acclerating.
  
 so at face value, it seems the mlxen does a better use of resources
 than the intel.
 Any ideas how to improve ix/intel's performance?
  
  Are you sure about netapp performance?
  ___
  freebsd-...@freebsd.org mailto:freebsd-...@freebsd.org mailing list
  https://lists.freebsd.org/mailman/listinfo/freebsd-net
  https://lists.freebsd.org/mailman/listinfo/freebsd-net
  To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
  mailto:freebsd-net-unsubscr...@freebsd.org
  
 
 ___
 freebsd-stable@freebsd.org mailing list
 https://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-17 Thread Alban Hertroys
On 17 August 2015 at 13:39, Slawa Olhovchenkov s...@zxy.spb.ru wrote:

 In any case, for 10Gb expect about 1200MGB/s.

Your usage of units is confusing. Above you claim you expect 1200
million gigabytes per second, or 1.2 * 10^18 Bytes/s. I don't think
any known network interface can do that, including highly experimental
ones.

I suspect you intended to claim that you expect 1.2GB/s (Gigabytes per
second) over that 10Gb/s (Gigabits per second) network.
That's still on the high side of what's possible. On TCP/IP there is
some TCP overhead, so 1.0 GB/s is probably more realistic.

WRT the actual problem you're trying to solve, I'm no help there.
-- 
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-17 Thread Daniel Braniss

 On Aug 17, 2015, at 12:41 PM, Slawa Olhovchenkov s...@zxy.spb.ru wrote:
 
 On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote:
 
 hi,
  I have a host (Dell R730) with both cards, connected to an HP8200 
 switch at 10Gb.
  when writing to the same storage (netapp) this is what I get:
  ix0:~130MGB/s
  mlxen0  ~330MGB/s
  this is via nfs/tcpv3
 
  I can get similar (bad) performance with the mellanox if I increase the 
 file size
  to 512MGB.
 
 Look like mellanox have internal beffer for caching and do ACK acclerating.
what ever they are doing, it’s impressive :-)

 
  so at face value, it seems the mlxen does a better use of resources 
 than the intel.
  Any ideas how to improve ix/intel's performance?
 
 Are you sure about netapp performance?

yes, and why should it act differently if the request is coming from the same 
host? in any case
the numbers are quiet consistent since I have measured it from several hosts, 
and at different times.

danny

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-17 Thread Slawa Olhovchenkov
On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote:

 hi,
   I have a host (Dell R730) with both cards, connected to an HP8200 
 switch at 10Gb.
   when writing to the same storage (netapp) this is what I get:
   ix0:~130MGB/s
   mlxen0  ~330MGB/s
   this is via nfs/tcpv3
 
   I can get similar (bad) performance with the mellanox if I increase the 
 file size
   to 512MGB.

Look like mellanox have internal beffer for caching and do ACK acclerating.

   so at face value, it seems the mlxen does a better use of resources 
 than the intel.
   Any ideas how to improve ix/intel's performance?

Are you sure about netapp performance?
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-17 Thread Christopher Forgeron
FYI, I can regularly hit 9.3 Gib/s with my Intel X520-DA2's and FreeBSD
10.1. Before 10.1 it was less.

I used to tweak the card settings, but now it's just stock. You may want to
check your settings, the Mellanox may just have better defaults for your
switch.

On Mon, Aug 17, 2015 at 6:41 AM, Slawa Olhovchenkov s...@zxy.spb.ru wrote:

 On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote:

  hi,
I have a host (Dell R730) with both cards, connected to an HP8200
 switch at 10Gb.
when writing to the same storage (netapp) this is what I get:
ix0:~130MGB/s
mlxen0  ~330MGB/s
this is via nfs/tcpv3
 
I can get similar (bad) performance with the mellanox if I
 increase the file size
to 512MGB.

 Look like mellanox have internal beffer for caching and do ACK acclerating.

so at face value, it seems the mlxen does a better use of
 resources than the intel.
Any ideas how to improve ix/intel's performance?

 Are you sure about netapp performance?
 ___
 freebsd-...@freebsd.org mailing list
 https://lists.freebsd.org/mailman/listinfo/freebsd-net
 To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-17 Thread Daniel Braniss

 On Aug 17, 2015, at 1:41 PM, Christopher Forgeron csforge...@gmail.com 
 wrote:
 
 FYI, I can regularly hit 9.3 Gib/s with my Intel X520-DA2's and FreeBSD 10.1. 
 Before 10.1 it was less.
 

this is NOT iperf/3 where i do get close to wire speed,
it’s NFS writes, i.e., almost real work :-)

 I used to tweak the card settings, but now it's just stock. You may want to 
 check your settings, the Mellanox may just have better defaults for your 
 switch. 
 
 On Mon, Aug 17, 2015 at 6:41 AM, Slawa Olhovchenkov s...@zxy.spb.ru 
 mailto:s...@zxy.spb.ru wrote:
 On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote:
 
  hi,
I have a host (Dell R730) with both cards, connected to an HP8200 
  switch at 10Gb.
when writing to the same storage (netapp) this is what I get:
ix0:~130MGB/s
mlxen0  ~330MGB/s
this is via nfs/tcpv3
 
I can get similar (bad) performance with the mellanox if I increase 
  the file size
to 512MGB.
 
 Look like mellanox have internal beffer for caching and do ACK acclerating.
 
so at face value, it seems the mlxen does a better use of resources 
  than the intel.
Any ideas how to improve ix/intel's performance?
 
 Are you sure about netapp performance?
 ___
 freebsd-...@freebsd.org mailto:freebsd-...@freebsd.org mailing list
 https://lists.freebsd.org/mailman/listinfo/freebsd-net 
 https://lists.freebsd.org/mailman/listinfo/freebsd-net
 To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org 
 mailto:freebsd-net-unsubscr...@freebsd.org
 

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-17 Thread Slawa Olhovchenkov
On Mon, Aug 17, 2015 at 01:35:06PM +0300, Daniel Braniss wrote:

 
  On Aug 17, 2015, at 12:41 PM, Slawa Olhovchenkov s...@zxy.spb.ru wrote:
  
  On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote:
  
  hi,
 I have a host (Dell R730) with both cards, connected to an HP8200 
  switch at 10Gb.
 when writing to the same storage (netapp) this is what I get:
 ix0:~130MGB/s
 mlxen0  ~330MGB/s
 this is via nfs/tcpv3
  
 I can get similar (bad) performance with the mellanox if I increase the 
  file size
 to 512MGB.
  
  Look like mellanox have internal beffer for caching and do ACK acclerating.
 what ever they are doing, it's impressive :-)
 
  
 so at face value, it seems the mlxen does a better use of resources 
  than the intel.
 Any ideas how to improve ix/intel's performance?
  
  Are you sure about netapp performance?
 
 yes, and why should it act differently if the request is coming from the same 
 host? in any case
 the numbers are quiet consistent since I have measured it from several hosts, 
 and at different times.

In any case, for 10Gb expect about 1200MGB/s.
I see lesser speed.
What netapp maximum performance? From other hosts, or local, any?
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-17 Thread Slawa Olhovchenkov
On Mon, Aug 17, 2015 at 01:49:27PM +0200, Alban Hertroys wrote:

 On 17 August 2015 at 13:39, Slawa Olhovchenkov s...@zxy.spb.ru wrote:
 
  In any case, for 10Gb expect about 1200MGB/s.
 
 Your usage of units is confusing. Above you claim you expect 1200

I am use as topic starter and expect MeGaBytes per second

 million gigabytes per second, or 1.2 * 10^18 Bytes/s. I don't think
 any known network interface can do that, including highly experimental
 ones.
 
 I suspect you intended to claim that you expect 1.2GB/s (Gigabytes per
 second) over that 10Gb/s (Gigabits per second) network.
 That's still on the high side of what's possible. On TCP/IP there is
 some TCP overhead, so 1.0 GB/s is probably more realistic.

TCP give 5-7% overhead (include retrasmits).
10^9/8*0.97 = 1.2125

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-17 Thread Slawa Olhovchenkov
On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote:

 hi,
   I have a host (Dell R730) with both cards, connected to an HP8200 
 switch at 10Gb.
   when writing to the same storage (netapp) this is what I get:
   ix0:~130MGB/s
   mlxen0  ~330MGB/s
   this is via nfs/tcpv3
 
   I can get similar (bad) performance with the mellanox if I increase the 
 file size
   to 512MGB.
   so at face value, it seems the mlxen does a better use of resources 
 than the intel.
   Any ideas how to improve ix/intel's performance?

Any way, please show

OS version
/var/run/dmesg.boot
What's tuning perfomed (loader.conf, sysctl.conf)?
top -PHS in both cases
ifconfig -a in both cases
netstat -rn in both cases
I am don't know netapp -- what is hardware configuration (disks and
etc) and software tuning (MTU?).

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-17 Thread Daniel Braniss

 On Aug 17, 2015, at 3:21 PM, Rick Macklem rmack...@uoguelph.ca wrote:
 
 Daniel Braniss wrote:
 
 On Aug 17, 2015, at 1:41 PM, Christopher Forgeron csforge...@gmail.com
 wrote:
 
 FYI, I can regularly hit 9.3 Gib/s with my Intel X520-DA2's and FreeBSD
 10.1. Before 10.1 it was less.
 
 
 this is NOT iperf/3 where i do get close to wire speed,
 it’s NFS writes, i.e., almost real work :-)
 
 I used to tweak the card settings, but now it's just stock. You may want to
 check your settings, the Mellanox may just have better defaults for your
 switch.
 
 Have you tried disabling TSO for the Intel? With TSO enabled, it will be 
 copying
 every transmitted mbuf chain to a new chain of mbuf clusters via. m_defrag() 
 when
 TSO is enabled. (Assuming you aren't an 82598 chip. Most seem to be the 82599 
 chip
 these days?)
 

hi Rick

how can i check the chip?

 This has been fixed in the driver very recently, but those fixes won't be in 
 10.1.
 
 rick
 ps: If you could test with 10.2, it would be interesting to see how the ix 
 does with
the current driver fixes in it?

I new TSO was involved! 
ok, firstly, it’s 10.2 stable.
with TSO enabled, ix is bad, around 64MGB/s.
disabling TSO it’s better, around 130

still, mlxen0 is about 250! with and without TSO


 
 On Mon, Aug 17, 2015 at 6:41 AM, Slawa Olhovchenkov s...@zxy.spb.ru
 mailto:s...@zxy.spb.ru wrote:
 On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote:
 
 hi,
  I have a host (Dell R730) with both cards, connected to an HP8200
  switch at 10Gb.
  when writing to the same storage (netapp) this is what I get:
  ix0:~130MGB/s
  mlxen0  ~330MGB/s
  this is via nfs/tcpv3
 
  I can get similar (bad) performance with the mellanox if I increase
  the file size
  to 512MGB.
 
 Look like mellanox have internal beffer for caching and do ACK acclerating.
 
  so at face value, it seems the mlxen does a better use of resources
  than the intel.
  Any ideas how to improve ix/intel's performance?
 
 Are you sure about netapp performance?
 ___
 freebsd-...@freebsd.org mailto:freebsd-...@freebsd.org mailing list
 https://lists.freebsd.org/mailman/listinfo/freebsd-net
 https://lists.freebsd.org/mailman/listinfo/freebsd-net
 To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
 mailto:freebsd-net-unsubscr...@freebsd.org
 
 
 ___
 freebsd-stable@freebsd.org mailing list
 https://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-17 Thread Alban Hertroys
On 17 August 2015 at 13:54, Slawa Olhovchenkov s...@zxy.spb.ru wrote:
 On Mon, Aug 17, 2015 at 01:49:27PM +0200, Alban Hertroys wrote:

 On 17 August 2015 at 13:39, Slawa Olhovchenkov s...@zxy.spb.ru wrote:

  In any case, for 10Gb expect about 1200MGB/s.

 Your usage of units is confusing. Above you claim you expect 1200

 I am use as topic starter and expect MeGaBytes per second

That's a highly unusual way of writing MB/s.

There are standards for unit prefixes: k means kilo, M means Mega, G
means Giga, etc. See:
https://en.wikipedia.org/wiki/International_System_of_Units#Prefixes

 million gigabytes per second, or 1.2 * 10^18 Bytes/s. I don't think
 any known network interface can do that, including highly experimental
 ones.

 I suspect you intended to claim that you expect 1.2GB/s (Gigabytes per
 second) over that 10Gb/s (Gigabits per second) network.
 That's still on the high side of what's possible. On TCP/IP there is
 some TCP overhead, so 1.0 GB/s is probably more realistic.

 TCP give 5-7% overhead (include retrasmits).
 10^9/8*0.97 = 1.2125

In information science, Bytes are counted in multiples of 2, not 10. A
kb is 1024 bits or 2^10 b. So 10 Gb is 10 * 2^30 bits.

It's also not unusual to be more specific about that 2-base and use
kib, Mib and Gib instead.

Apparently you didn't know that...

Also, if you take 5% off, you are left with (0.95 * 10 * 2^30) / 8 =
1.1875 B/s, not 0.97 * ... Your calculations were a bit optimistic.

Now I have to admit I'm used to use a factor of 10 to convert from b/s
to B/s (that's 20%!), but that's probably no longer correct, what with
jumbo frames and all.

-- 
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: ix(intel) vs mlxen(mellanox) 10Gb performance

2015-08-17 Thread Slawa Olhovchenkov
On Mon, Aug 17, 2015 at 05:44:37PM +0200, Alban Hertroys wrote:

 On 17 August 2015 at 13:54, Slawa Olhovchenkov s...@zxy.spb.ru wrote:
  On Mon, Aug 17, 2015 at 01:49:27PM +0200, Alban Hertroys wrote:
 
  On 17 August 2015 at 13:39, Slawa Olhovchenkov s...@zxy.spb.ru wrote:
 
   In any case, for 10Gb expect about 1200MGB/s.
 
  Your usage of units is confusing. Above you claim you expect 1200
 
  I am use as topic starter and expect MeGaBytes per second
 
 That's a highly unusual way of writing MB/s.

I am know. This is do not care for me.

 There are standards for unit prefixes: k means kilo, M means Mega, G
 means Giga, etc. See:
 https://en.wikipedia.org/wiki/International_System_of_Units#Prefixes
 
  million gigabytes per second, or 1.2 * 10^18 Bytes/s. I don't think
  any known network interface can do that, including highly experimental
  ones.
 
  I suspect you intended to claim that you expect 1.2GB/s (Gigabytes per
  second) over that 10Gb/s (Gigabits per second) network.
  That's still on the high side of what's possible. On TCP/IP there is
  some TCP overhead, so 1.0 GB/s is probably more realistic.
 
  TCP give 5-7% overhead (include retrasmits).
  10^9/8*0.97 = 1.2125
 
 In information science, Bytes are counted in multiples of 2, not 10. A
 kb is 1024 bits or 2^10 b. So 10 Gb is 10 * 2^30 bits.

Interface speeds counted in multile of 10.
10Mbit ethernet have speed 10^7 bit/s.
64Kbit ISDN have speed 64000, not 65536.

 It's also not unusual to be more specific about that 2-base and use
 kib, Mib and Gib instead.
 
 Apparently you didn't know that...
 
 Also, if you take 5% off, you are left with (0.95 * 10 * 2^30) / 8 =
 1.1875 B/s, not 0.97 * ... Your calculations were a bit optimistic.

May bug.
10^10/8*0.93 = 116250 = 1162.5

 Now I have to admit I'm used to use a factor of 10 to convert from b/s
 to B/s (that's 20%!), but that's probably no longer correct, what with
 jumbo frames and all.

Ok, may be topic started use software metered speed with MGBs as
1048576 per second. 116250/1048576 = 1108.64
___
freebsd-stable@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org