Re: ix(intel) vs mlxen(mellanox) 10Gb performance
Hi, I've now MFC'ed r287775 to 10-stable and 9-stable. I hope this will resolve the issues with m_defrag() being called on too long mbuf chains due to an off-by-one in the driver TSO parameters and that it will be easier to maintain these parameters in the future. Some comments were made that we might want to have an option to select if the IP-header should be counted or not. Certain network drivers require copying of the whole ETH/TCP/IP-header into separate memory areas, and can then handle one more data payload mbuf for TSO. Others required DMA-ing of the whole mbuf TSO chain. I think it is acceptable to have one TX-DMA segment slot free, in case of 2K mbuf clusters being used for TSO. From my experience the limitation typically kicks in when 2K mbuf clusters are used for TSO instead of 4K mbuf clusters. 65536 / 4096 = 16, whereas 65536 / 2048 = 32. If an ethernet hardware driver has a limitation of 24 data segments (mlxen), and assuming that each mbuf represent a single segment, then iff the majority of mbufs being transmitted are 2K clusters we may have a small, 1/24 = 4.2%, loss of TX capability per TSO packet. From what I've seen using iperf, which in turn calls m_uiotombuf() which in turn calls m_getm2(), MJUMPPAGESIZE'ed mbuf clusters are preferred for large data transfers, so this issue might only happen in case of NODELAY being used on the socket and if the writes are small from the application point of view. If an application is writing small amounts of data per send() system call, it is expected to degrade the system performance. Please file a PR if it becomes an issue. Someone asked me to MFC r287775 to 10.X release aswell. Is this still required? --HPS ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
Hans Petter Selasky wrote: > Hi, > > I've now MFC'ed r287775 to 10-stable and 9-stable. I hope this will > resolve the issues with m_defrag() being called on too long mbuf chains > due to an off-by-one in the driver TSO parameters and that it will be > easier to maintain these parameters in the future. > > Some comments were made that we might want to have an option to select > if the IP-header should be counted or not. Certain network drivers > require copying of the whole ETH/TCP/IP-header into separate memory > areas, and can then handle one more data payload mbuf for TSO. Others > required DMA-ing of the whole mbuf TSO chain. I think it is acceptable > to have one TX-DMA segment slot free, in case of 2K mbuf clusters being > used for TSO. From my experience the limitation typically kicks in when > 2K mbuf clusters are used for TSO instead of 4K mbuf clusters. 65536 / > 4096 = 16, whereas 65536 / 2048 = 32. If an ethernet hardware driver has > a limitation of 24 data segments (mlxen), and assuming that each mbuf > represent a single segment, then iff the majority of mbufs being > transmitted are 2K clusters we may have a small, 1/24 = 4.2%, loss of TX > capability per TSO packet. From what I've seen using iperf, which in > turn calls m_uiotombuf() which in turn calls m_getm2(), MJUMPPAGESIZE'ed > mbuf clusters are preferred for large data transfers, so this issue > might only happen in case of NODELAY being used on the socket and if the > writes are small from the application point of view. If an application > is writing small amounts of data per send() system call, it is expected > to degrade the system performance. > Btw, last year I did some testing with NFS generating chains of 4K (page size) clusters instead of 2K (MCLBYTES). Although not easily reproduced, I was able to fragment the KVM used for the cluster enough that allocations would fail. (I could only get it to happen when the code used 4K clusters for large NFS requests/replies and 2K clusters otherwise, resulting in a mix of allocations of both sizes.) As such, I never committed the changes to head. Any kernel change that does 4K cluster allocations needs to be carefully tested carefully (a small i386 like I have), imho. > Please file a PR if it becomes an issue. > > Someone asked me to MFC r287775 to 10.X release aswell. Is this still > required? > > --HPS Thanks for doing this, rick > ___ > freebsd-stable@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org" > ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On Aug 24, 2015, at 3:25 PM, Rick Macklem rmack...@uoguelph.ca wrote: Daniel Braniss wrote: On 24 Aug 2015, at 10:22, Hans Petter Selasky h...@selasky.org wrote: On 08/24/15 01:02, Rick Macklem wrote: The other thing is the degradation seems to cut the rate by about half each time. 300--150--70 I have no idea if this helps to explain it. Might be a NUMA binding issue for the processes involved. man cpuset --HPS I can’t see how this is relevant, given that the same host, using the mellanox/mlxen behave much better. Well, the ix driver has a bunch of tunables for things like number of queues and although I'll admit I don't understand how these queues are used, I think they are related to CPUs and their caches. There is also something called IXGBE_FDIR, which others have recommended be disabled. (The code is #ifdef IXGBE_FDIR, but I don't know if it defined for your kernel?) There are also tunables for interrupt rate and something called hw.ixgbe_tx_process_limit, which appears to limit the number of packets to send or something like that? (I suspect Hans would understand this stuff much better than I do, since I don't understand it at all.;-) but how does this explain the fact that, at the same time, the throughput to the NetApp is about 70MG/s while to a FreeBSD it’s above 150MB/s? (window size negotiation?) switching off TSO evens out this diff. At a glance, the mellanox driver looks very different. I’m getting different results with the intel/ix depending who is the nfs server Who knows until you figure out what is actually going on. It could just be the timing of handling the write RPCs or when the different servers send acks for the TCP segments or ... that causes this for one server and not another. One of the principals used when investigating airplane accidents is to never assume anything and just try to collect the facts until the pieces of the puzzle fall in place. I think the same principal works for this kind of stuff. I once had a case where a specific read of one NFS file would fail on certain machines. I won't bore you with the details, but after weeks we got to the point where we had a lab of identical machines (exactly the same hardware and exactly the same software loaded on them) and we could reproduce this problem on about half the machines and not the other half. We (myself and the guy I worked with) finally noticed the failing machines were on network ports for a given switch. We moved the net cables to another switch and the problem went away. -- This particular network switch was broken in such a way that it would garble one specific packet consistently, but worked fine for everything else. My point here is that, if someone had suggested the network switch might be broken at the beginning of investigating this, I would have probably dismissed it, based on the network is working just fine, but in the end, that was the problem. -- I am not suggesting you have a broken network switch, just don't take anything off the table until you know what is actually going on. And to be honest, you may never know, but it is fun to try and solve these puzzles. one needs to find the clues … at the moment: when things go bad, they stay bad ix/nfs/tcp/tso and NetApp when things are ok, the numbers fluctuate, which is probably due to loads on the system, but they are far above the 70MB/s (100 to 200) Beyond what I already suggested, I'd look at the ix driver's stats and tunables and see if any of the tunables has an effect. (And, yes, it will take time to work through these.) Good luck with it, rick danny ___ freebsd-stable@freebsd.org mailto:freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org mailto:freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
Hi, I've made some minor modifications to the patch from Rick, and made this review: https://reviews.freebsd.org/D3477 --HPS ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
Daniel Braniss wrote: On 24 Aug 2015, at 10:22, Hans Petter Selasky h...@selasky.org wrote: On 08/24/15 01:02, Rick Macklem wrote: The other thing is the degradation seems to cut the rate by about half each time. 300--150--70 I have no idea if this helps to explain it. Might be a NUMA binding issue for the processes involved. man cpuset --HPS I can’t see how this is relevant, given that the same host, using the mellanox/mlxen behave much better. Well, the ix driver has a bunch of tunables for things like number of queues and although I'll admit I don't understand how these queues are used, I think they are related to CPUs and their caches. There is also something called IXGBE_FDIR, which others have recommended be disabled. (The code is #ifdef IXGBE_FDIR, but I don't know if it defined for your kernel?) There are also tunables for interrupt rate and something called hw.ixgbe_tx_process_limit, which appears to limit the number of packets to send or something like that? (I suspect Hans would understand this stuff much better than I do, since I don't understand it at all.;-) At a glance, the mellanox driver looks very different. I’m getting different results with the intel/ix depending who is the nfs server Who knows until you figure out what is actually going on. It could just be the timing of handling the write RPCs or when the different servers send acks for the TCP segments or ... that causes this for one server and not another. One of the principals used when investigating airplane accidents is to never assume anything and just try to collect the facts until the pieces of the puzzle fall in place. I think the same principal works for this kind of stuff. I once had a case where a specific read of one NFS file would fail on certain machines. I won't bore you with the details, but after weeks we got to the point where we had a lab of identical machines (exactly the same hardware and exactly the same software loaded on them) and we could reproduce this problem on about half the machines and not the other half. We (myself and the guy I worked with) finally noticed the failing machines were on network ports for a given switch. We moved the net cables to another switch and the problem went away. -- This particular network switch was broken in such a way that it would garble one specific packet consistently, but worked fine for everything else. My point here is that, if someone had suggested the network switch might be broken at the beginning of investigating this, I would have probably dismissed it, based on the network is working just fine, but in the end, that was the problem. -- I am not suggesting you have a broken network switch, just don't take anything off the table until you know what is actually going on. And to be honest, you may never know, but it is fun to try and solve these puzzles. Beyond what I already suggested, I'd look at the ix driver's stats and tunables and see if any of the tunables has an effect. (And, yes, it will take time to work through these.) Good luck with it, rick danny ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On 08/24/15 01:02, Rick Macklem wrote: The other thing is the degradation seems to cut the rate by about half each time. 300--150--70 I have no idea if this helps to explain it. Might be a NUMA binding issue for the processes involved. man cpuset --HPS ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On 24 Aug 2015, at 10:22, Hans Petter Selasky h...@selasky.org wrote: On 08/24/15 01:02, Rick Macklem wrote: The other thing is the degradation seems to cut the rate by about half each time. 300--150--70 I have no idea if this helps to explain it. Might be a NUMA binding issue for the processes involved. man cpuset --HPS I can’t see how this is relevant, given that the same host, using the mellanox/mlxen behave much better. I’m getting different results with the intel/ix depending who is the nfs server danny ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On 24 Aug 2015, at 02:02, Rick Macklem rmack...@uoguelph.ca wrote: Daniel Braniss wrote: On 22 Aug 2015, at 14:59, Rick Macklem rmack...@uoguelph.ca wrote: Daniel Braniss wrote: On Aug 22, 2015, at 12:46 AM, Rick Macklem rmack...@uoguelph.ca wrote: Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:35AM -0400, Rick Macklem wrote: Hans Petter Selasky wrote: On 08/19/15 09:42, Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote: On 08/18/15 23:54, Rick Macklem wrote: Ouch! Yes, I now see that the code that counts the # of mbufs is before the code that adds the tcp/ip header mbuf. In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever the driver provides - 1. It is not the driver's responsibility to know if a tcp/ip header mbuf will be added and is a lot less confusing that expecting the driver author to know to subtract one. (I had mistakenly thought that tcp_output() had added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw, this tcp/ip header mbuf also has leading space for the MAC layer header.) Hi Rick, Your question is good. With the Mellanox hardware we have separate so-called inline data space for the TCP/IP headers, so if the TCP stack subtracts something, then we would need to add something to the limit, because then the scatter gather list is only used for the data part. I think all drivers in tree don't subtract 1 for if_hw_tsomaxsegcount. Probably touching Mellanox driver would be simpler than fixing all other drivers in tree. Maybe it can be controlled by some kind of flag, if all the three TSO limits should include the TCP/IP/ethernet headers too. I'm pretty sure we want both versions. Hmm, I'm afraid it's already complex. Drivers have to tell almost the same information to both bus_dma(9) and network stack. Don't forget that not all drivers in the tree set the TSO limits before if_attach(), so possibly the subtraction of one TSO fragment needs to go into ip_output() Ok, I realized that some drivers may not know the answers before ether_ifattach(), due to the way they are configured/written (I saw the use of if_hw_tsomax_update() in the patch). I was not able to find an interface that configures TSO parameters after if_t conversion. I'm under the impression if_hw_tsomax_update() is not designed to use this way. Probably we need a better one?(CCed to Gleb). If it is subtracted as a part of the assignment to if_hw_tsomaxsegcount in tcp_output() at line#791 in tcp_output() like the following, I don't think it should matter if the values are set before ether_ifattach()? /* * Subtract 1 for the tcp/ip header mbuf that * will be prepended to the mbuf chain in this * function in the code below this block. */ if_hw_tsomaxsegcount = tp-t_tsomaxsegcount - 1; I don't have a good solution for the case where a driver doesn't plan on using the tcp/ip header provided by tcp_output() except to say the driver can add one to the setting to compensate for that (and if they fail to do so, it still works, although somewhat suboptimally). When I now read the comment in sys/net/if_var.h it is clear what it means, but for some reason I didn't read it that way before? (I think it was the part that said the driver didn't have to subtract for the headers that confused me?) In any case, we need to try and come up with a clear definition of what they need to be set to. I can now think of two ways to deal with this: 1 - Leave tcp_output() as is, but provide a macro for the device driver authors to use that sets if_hw_tsomaxsegcount with a flag for driver uses tcp/ip header mbuf, documenting that this flag should normally be true. OR 2 - Change tcp_output() as above, noting that this is a workaround for confusion w.r.t. whether or not if_hw_tsomaxsegcount should include the tcp/ip header mbuf and update the comment in if_var.h to reflect this. Then drivers that don't use the tcp/ip header mbuf can increase their value for if_hw_tsomaxsegcount by 1. (The comment should also mention that a value of 35 or greater is much preferred to 32 if the hardware will support that.) Both works for me. My preference is 2 just because it's very common for most drivers that use tcp/ip header mbuf. Thanks for this comment. I tend to agree, both for the reason you state and also because the patch is simple enough that it might qualify as an errata for 10.2. I am hoping Daniel Braniss will be able to test the patch and let us know if it improves performance with TSO enabled? send me the patch and I’ll test it ASAP. danny Patch is attached. The one for head will also include an update to the comment in
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
Hi, Some hand-waving suggestions: * if you're running something before 10.2, please disable IXGBE_FDIR in sys/conf/options and sys/modules/ixgbe/Makefile . It's buggy and it caused a lot of issues. * It sounds like some extra latency is happening, so I'd fiddle around with interrupt settings. By default it does something called adaptive interrupt moderation and it may be getting in the way of what you're trying to do. There's a way to disable AIM in /boot/loader.conf and manually set the interrupt rate. * As others have said, TSO has been a bit of a problem - hps has been working on solidifying the TSO configuration side of things so NICs advertise to the stack what their maximum offload capability is so things like NFS and TCP don't exceed the segment count. I don't know if it's tunable without hacking the driver, but maybe hack the driver to reduce the count a little to make sure you're not overflowing things and causing it to fall back to a slower path (where it copies all the mbufs into a single larger one to send to the NIC.) * Disable software LRO and see if it helps. Since you're doing lots of little non-streaming operations, it may actually be hindering. HTH, -adrian ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On 22 Aug 2015, at 14:59, Rick Macklem rmack...@uoguelph.ca wrote: Daniel Braniss wrote: On Aug 22, 2015, at 12:46 AM, Rick Macklem rmack...@uoguelph.ca wrote: Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:35AM -0400, Rick Macklem wrote: Hans Petter Selasky wrote: On 08/19/15 09:42, Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote: On 08/18/15 23:54, Rick Macklem wrote: Ouch! Yes, I now see that the code that counts the # of mbufs is before the code that adds the tcp/ip header mbuf. In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever the driver provides - 1. It is not the driver's responsibility to know if a tcp/ip header mbuf will be added and is a lot less confusing that expecting the driver author to know to subtract one. (I had mistakenly thought that tcp_output() had added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw, this tcp/ip header mbuf also has leading space for the MAC layer header.) Hi Rick, Your question is good. With the Mellanox hardware we have separate so-called inline data space for the TCP/IP headers, so if the TCP stack subtracts something, then we would need to add something to the limit, because then the scatter gather list is only used for the data part. I think all drivers in tree don't subtract 1 for if_hw_tsomaxsegcount. Probably touching Mellanox driver would be simpler than fixing all other drivers in tree. Maybe it can be controlled by some kind of flag, if all the three TSO limits should include the TCP/IP/ethernet headers too. I'm pretty sure we want both versions. Hmm, I'm afraid it's already complex. Drivers have to tell almost the same information to both bus_dma(9) and network stack. Don't forget that not all drivers in the tree set the TSO limits before if_attach(), so possibly the subtraction of one TSO fragment needs to go into ip_output() Ok, I realized that some drivers may not know the answers before ether_ifattach(), due to the way they are configured/written (I saw the use of if_hw_tsomax_update() in the patch). I was not able to find an interface that configures TSO parameters after if_t conversion. I'm under the impression if_hw_tsomax_update() is not designed to use this way. Probably we need a better one?(CCed to Gleb). If it is subtracted as a part of the assignment to if_hw_tsomaxsegcount in tcp_output() at line#791 in tcp_output() like the following, I don't think it should matter if the values are set before ether_ifattach()? /* * Subtract 1 for the tcp/ip header mbuf that * will be prepended to the mbuf chain in this * function in the code below this block. */ if_hw_tsomaxsegcount = tp-t_tsomaxsegcount - 1; I don't have a good solution for the case where a driver doesn't plan on using the tcp/ip header provided by tcp_output() except to say the driver can add one to the setting to compensate for that (and if they fail to do so, it still works, although somewhat suboptimally). When I now read the comment in sys/net/if_var.h it is clear what it means, but for some reason I didn't read it that way before? (I think it was the part that said the driver didn't have to subtract for the headers that confused me?) In any case, we need to try and come up with a clear definition of what they need to be set to. I can now think of two ways to deal with this: 1 - Leave tcp_output() as is, but provide a macro for the device driver authors to use that sets if_hw_tsomaxsegcount with a flag for driver uses tcp/ip header mbuf, documenting that this flag should normally be true. OR 2 - Change tcp_output() as above, noting that this is a workaround for confusion w.r.t. whether or not if_hw_tsomaxsegcount should include the tcp/ip header mbuf and update the comment in if_var.h to reflect this. Then drivers that don't use the tcp/ip header mbuf can increase their value for if_hw_tsomaxsegcount by 1. (The comment should also mention that a value of 35 or greater is much preferred to 32 if the hardware will support that.) Both works for me. My preference is 2 just because it's very common for most drivers that use tcp/ip header mbuf. Thanks for this comment. I tend to agree, both for the reason you state and also because the patch is simple enough that it might qualify as an errata for 10.2. I am hoping Daniel Braniss will be able to test the patch and let us know if it improves performance with TSO enabled? send me the patch and I’ll test it ASAP. danny Patch is attached. The one for head will also include an update to the comment in sys/net/if_var.h, but that isn't needed for testing. well, the plot thickens. Yesterday, before running the new kernel, I
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On Sun, Aug 23, 2015 at 02:08:56PM +0300, Daniel Braniss wrote: send me the patch and I'll test it ASAP. danny Patch is attached. The one for head will also include an update to the comment in sys/net/if_var.h, but that isn't needed for testing. well, the plot thickens. Yesterday, before running the new kernel, I decided to re run my test, and to my surprise i was getting good numbers, about 300MGB/s with and without TSO. this morning, the numbers were again bad, around 70MGB/s,what the ^%$#@! so, after some coffee, I run some more tests, and some conclusions: using a netapp(*) as the nfs client: - doing ifconfig ix0 tso or -tso does some magic and numbers are back to normal - for a while using another Fbsd/zfs as client all is nifty, actually a bit faster than the netapp (not a fair comparison, since the zfs client is not heavily used) and I can't see any degradation. btw, this is with the patch applied, but was seeing similar numbers before the patch. running with tso, initially I get around 300MGB/s, but after a while(sorry can't be more scientific) it drops down to about half, and finally to a pathetic 70MGB/s *: while running the tests I monitored the Netapp, and nothing out of the ordinary there. Can you do this https://lists.freebsd.org/pipermail/freebsd-stable/2015-August/083138.html ? ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
Daniel Braniss wrote: On 22 Aug 2015, at 14:59, Rick Macklem rmack...@uoguelph.ca wrote: Daniel Braniss wrote: On Aug 22, 2015, at 12:46 AM, Rick Macklem rmack...@uoguelph.ca wrote: Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:35AM -0400, Rick Macklem wrote: Hans Petter Selasky wrote: On 08/19/15 09:42, Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote: On 08/18/15 23:54, Rick Macklem wrote: Ouch! Yes, I now see that the code that counts the # of mbufs is before the code that adds the tcp/ip header mbuf. In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever the driver provides - 1. It is not the driver's responsibility to know if a tcp/ip header mbuf will be added and is a lot less confusing that expecting the driver author to know to subtract one. (I had mistakenly thought that tcp_output() had added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw, this tcp/ip header mbuf also has leading space for the MAC layer header.) Hi Rick, Your question is good. With the Mellanox hardware we have separate so-called inline data space for the TCP/IP headers, so if the TCP stack subtracts something, then we would need to add something to the limit, because then the scatter gather list is only used for the data part. I think all drivers in tree don't subtract 1 for if_hw_tsomaxsegcount. Probably touching Mellanox driver would be simpler than fixing all other drivers in tree. Maybe it can be controlled by some kind of flag, if all the three TSO limits should include the TCP/IP/ethernet headers too. I'm pretty sure we want both versions. Hmm, I'm afraid it's already complex. Drivers have to tell almost the same information to both bus_dma(9) and network stack. Don't forget that not all drivers in the tree set the TSO limits before if_attach(), so possibly the subtraction of one TSO fragment needs to go into ip_output() Ok, I realized that some drivers may not know the answers before ether_ifattach(), due to the way they are configured/written (I saw the use of if_hw_tsomax_update() in the patch). I was not able to find an interface that configures TSO parameters after if_t conversion. I'm under the impression if_hw_tsomax_update() is not designed to use this way. Probably we need a better one?(CCed to Gleb). If it is subtracted as a part of the assignment to if_hw_tsomaxsegcount in tcp_output() at line#791 in tcp_output() like the following, I don't think it should matter if the values are set before ether_ifattach()? /* * Subtract 1 for the tcp/ip header mbuf that * will be prepended to the mbuf chain in this * function in the code below this block. */ if_hw_tsomaxsegcount = tp-t_tsomaxsegcount - 1; I don't have a good solution for the case where a driver doesn't plan on using the tcp/ip header provided by tcp_output() except to say the driver can add one to the setting to compensate for that (and if they fail to do so, it still works, although somewhat suboptimally). When I now read the comment in sys/net/if_var.h it is clear what it means, but for some reason I didn't read it that way before? (I think it was the part that said the driver didn't have to subtract for the headers that confused me?) In any case, we need to try and come up with a clear definition of what they need to be set to. I can now think of two ways to deal with this: 1 - Leave tcp_output() as is, but provide a macro for the device driver authors to use that sets if_hw_tsomaxsegcount with a flag for driver uses tcp/ip header mbuf, documenting that this flag should normally be true. OR 2 - Change tcp_output() as above, noting that this is a workaround for confusion w.r.t. whether or not if_hw_tsomaxsegcount should include the tcp/ip header mbuf and update the comment in if_var.h to reflect this. Then drivers that don't use the tcp/ip header mbuf can increase their value for if_hw_tsomaxsegcount by 1. (The comment should also mention that a value of 35 or greater is much preferred to 32 if the hardware will support that.) Both works for me. My preference is 2 just because it's very common for most drivers that use tcp/ip header mbuf. Thanks for this comment. I tend to agree, both for the reason you state and also because the patch is simple enough that it might qualify as an errata for 10.2. I am hoping Daniel Braniss will be able to test the patch and let us know if it improves performance with TSO enabled? send me the patch and I’ll test it ASAP.
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
Daniel Braniss wrote: On Aug 22, 2015, at 12:46 AM, Rick Macklem rmack...@uoguelph.ca wrote: Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:35AM -0400, Rick Macklem wrote: Hans Petter Selasky wrote: On 08/19/15 09:42, Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote: On 08/18/15 23:54, Rick Macklem wrote: Ouch! Yes, I now see that the code that counts the # of mbufs is before the code that adds the tcp/ip header mbuf. In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever the driver provides - 1. It is not the driver's responsibility to know if a tcp/ip header mbuf will be added and is a lot less confusing that expecting the driver author to know to subtract one. (I had mistakenly thought that tcp_output() had added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw, this tcp/ip header mbuf also has leading space for the MAC layer header.) Hi Rick, Your question is good. With the Mellanox hardware we have separate so-called inline data space for the TCP/IP headers, so if the TCP stack subtracts something, then we would need to add something to the limit, because then the scatter gather list is only used for the data part. I think all drivers in tree don't subtract 1 for if_hw_tsomaxsegcount. Probably touching Mellanox driver would be simpler than fixing all other drivers in tree. Maybe it can be controlled by some kind of flag, if all the three TSO limits should include the TCP/IP/ethernet headers too. I'm pretty sure we want both versions. Hmm, I'm afraid it's already complex. Drivers have to tell almost the same information to both bus_dma(9) and network stack. Don't forget that not all drivers in the tree set the TSO limits before if_attach(), so possibly the subtraction of one TSO fragment needs to go into ip_output() Ok, I realized that some drivers may not know the answers before ether_ifattach(), due to the way they are configured/written (I saw the use of if_hw_tsomax_update() in the patch). I was not able to find an interface that configures TSO parameters after if_t conversion. I'm under the impression if_hw_tsomax_update() is not designed to use this way. Probably we need a better one?(CCed to Gleb). If it is subtracted as a part of the assignment to if_hw_tsomaxsegcount in tcp_output() at line#791 in tcp_output() like the following, I don't think it should matter if the values are set before ether_ifattach()? /* * Subtract 1 for the tcp/ip header mbuf that * will be prepended to the mbuf chain in this * function in the code below this block. */ if_hw_tsomaxsegcount = tp-t_tsomaxsegcount - 1; I don't have a good solution for the case where a driver doesn't plan on using the tcp/ip header provided by tcp_output() except to say the driver can add one to the setting to compensate for that (and if they fail to do so, it still works, although somewhat suboptimally). When I now read the comment in sys/net/if_var.h it is clear what it means, but for some reason I didn't read it that way before? (I think it was the part that said the driver didn't have to subtract for the headers that confused me?) In any case, we need to try and come up with a clear definition of what they need to be set to. I can now think of two ways to deal with this: 1 - Leave tcp_output() as is, but provide a macro for the device driver authors to use that sets if_hw_tsomaxsegcount with a flag for driver uses tcp/ip header mbuf, documenting that this flag should normally be true. OR 2 - Change tcp_output() as above, noting that this is a workaround for confusion w.r.t. whether or not if_hw_tsomaxsegcount should include the tcp/ip header mbuf and update the comment in if_var.h to reflect this. Then drivers that don't use the tcp/ip header mbuf can increase their value for if_hw_tsomaxsegcount by 1. (The comment should also mention that a value of 35 or greater is much preferred to 32 if the hardware will support that.) Both works for me. My preference is 2 just because it's very common for most drivers that use tcp/ip header mbuf. Thanks for this comment. I tend to agree, both for the reason you state and also because the patch is simple enough that it might qualify as an errata for 10.2. I am hoping Daniel Braniss will be able to test the patch and let us know if it improves performance with TSO enabled? send me the patch and I’ll test it ASAP. danny Patch is attached. The one for head will also include an update to the comment in sys/net/if_var.h, but that isn't needed for testing.
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On Aug 22, 2015, at 12:46 AM, Rick Macklem rmack...@uoguelph.ca wrote: Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:35AM -0400, Rick Macklem wrote: Hans Petter Selasky wrote: On 08/19/15 09:42, Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote: On 08/18/15 23:54, Rick Macklem wrote: Ouch! Yes, I now see that the code that counts the # of mbufs is before the code that adds the tcp/ip header mbuf. In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever the driver provides - 1. It is not the driver's responsibility to know if a tcp/ip header mbuf will be added and is a lot less confusing that expecting the driver author to know to subtract one. (I had mistakenly thought that tcp_output() had added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw, this tcp/ip header mbuf also has leading space for the MAC layer header.) Hi Rick, Your question is good. With the Mellanox hardware we have separate so-called inline data space for the TCP/IP headers, so if the TCP stack subtracts something, then we would need to add something to the limit, because then the scatter gather list is only used for the data part. I think all drivers in tree don't subtract 1 for if_hw_tsomaxsegcount. Probably touching Mellanox driver would be simpler than fixing all other drivers in tree. Maybe it can be controlled by some kind of flag, if all the three TSO limits should include the TCP/IP/ethernet headers too. I'm pretty sure we want both versions. Hmm, I'm afraid it's already complex. Drivers have to tell almost the same information to both bus_dma(9) and network stack. Don't forget that not all drivers in the tree set the TSO limits before if_attach(), so possibly the subtraction of one TSO fragment needs to go into ip_output() Ok, I realized that some drivers may not know the answers before ether_ifattach(), due to the way they are configured/written (I saw the use of if_hw_tsomax_update() in the patch). I was not able to find an interface that configures TSO parameters after if_t conversion. I'm under the impression if_hw_tsomax_update() is not designed to use this way. Probably we need a better one?(CCed to Gleb). If it is subtracted as a part of the assignment to if_hw_tsomaxsegcount in tcp_output() at line#791 in tcp_output() like the following, I don't think it should matter if the values are set before ether_ifattach()? /* * Subtract 1 for the tcp/ip header mbuf that * will be prepended to the mbuf chain in this * function in the code below this block. */ if_hw_tsomaxsegcount = tp-t_tsomaxsegcount - 1; I don't have a good solution for the case where a driver doesn't plan on using the tcp/ip header provided by tcp_output() except to say the driver can add one to the setting to compensate for that (and if they fail to do so, it still works, although somewhat suboptimally). When I now read the comment in sys/net/if_var.h it is clear what it means, but for some reason I didn't read it that way before? (I think it was the part that said the driver didn't have to subtract for the headers that confused me?) In any case, we need to try and come up with a clear definition of what they need to be set to. I can now think of two ways to deal with this: 1 - Leave tcp_output() as is, but provide a macro for the device driver authors to use that sets if_hw_tsomaxsegcount with a flag for driver uses tcp/ip header mbuf, documenting that this flag should normally be true. OR 2 - Change tcp_output() as above, noting that this is a workaround for confusion w.r.t. whether or not if_hw_tsomaxsegcount should include the tcp/ip header mbuf and update the comment in if_var.h to reflect this. Then drivers that don't use the tcp/ip header mbuf can increase their value for if_hw_tsomaxsegcount by 1. (The comment should also mention that a value of 35 or greater is much preferred to 32 if the hardware will support that.) Both works for me. My preference is 2 just because it's very common for most drivers that use tcp/ip header mbuf. Thanks for this comment. I tend to agree, both for the reason you state and also because the patch is simple enough that it might qualify as an errata for 10.2. I am hoping Daniel Braniss will be able to test the patch and let us know if it improves performance with TSO enabled? send me the patch and I’ll test it ASAP. danny rick ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:35AM -0400, Rick Macklem wrote: Hans Petter Selasky wrote: On 08/19/15 09:42, Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote: On 08/18/15 23:54, Rick Macklem wrote: Ouch! Yes, I now see that the code that counts the # of mbufs is before the code that adds the tcp/ip header mbuf. In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever the driver provides - 1. It is not the driver's responsibility to know if a tcp/ip header mbuf will be added and is a lot less confusing that expecting the driver author to know to subtract one. (I had mistakenly thought that tcp_output() had added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw, this tcp/ip header mbuf also has leading space for the MAC layer header.) Hi Rick, Your question is good. With the Mellanox hardware we have separate so-called inline data space for the TCP/IP headers, so if the TCP stack subtracts something, then we would need to add something to the limit, because then the scatter gather list is only used for the data part. I think all drivers in tree don't subtract 1 for if_hw_tsomaxsegcount. Probably touching Mellanox driver would be simpler than fixing all other drivers in tree. Maybe it can be controlled by some kind of flag, if all the three TSO limits should include the TCP/IP/ethernet headers too. I'm pretty sure we want both versions. Hmm, I'm afraid it's already complex. Drivers have to tell almost the same information to both bus_dma(9) and network stack. Don't forget that not all drivers in the tree set the TSO limits before if_attach(), so possibly the subtraction of one TSO fragment needs to go into ip_output() Ok, I realized that some drivers may not know the answers before ether_ifattach(), due to the way they are configured/written (I saw the use of if_hw_tsomax_update() in the patch). I was not able to find an interface that configures TSO parameters after if_t conversion. I'm under the impression if_hw_tsomax_update() is not designed to use this way. Probably we need a better one?(CCed to Gleb). If it is subtracted as a part of the assignment to if_hw_tsomaxsegcount in tcp_output() at line#791 in tcp_output() like the following, I don't think it should matter if the values are set before ether_ifattach()? /* * Subtract 1 for the tcp/ip header mbuf that * will be prepended to the mbuf chain in this * function in the code below this block. */ if_hw_tsomaxsegcount = tp-t_tsomaxsegcount - 1; I don't have a good solution for the case where a driver doesn't plan on using the tcp/ip header provided by tcp_output() except to say the driver can add one to the setting to compensate for that (and if they fail to do so, it still works, although somewhat suboptimally). When I now read the comment in sys/net/if_var.h it is clear what it means, but for some reason I didn't read it that way before? (I think it was the part that said the driver didn't have to subtract for the headers that confused me?) In any case, we need to try and come up with a clear definition of what they need to be set to. I can now think of two ways to deal with this: 1 - Leave tcp_output() as is, but provide a macro for the device driver authors to use that sets if_hw_tsomaxsegcount with a flag for driver uses tcp/ip header mbuf, documenting that this flag should normally be true. OR 2 - Change tcp_output() as above, noting that this is a workaround for confusion w.r.t. whether or not if_hw_tsomaxsegcount should include the tcp/ip header mbuf and update the comment in if_var.h to reflect this. Then drivers that don't use the tcp/ip header mbuf can increase their value for if_hw_tsomaxsegcount by 1. (The comment should also mention that a value of 35 or greater is much preferred to 32 if the hardware will support that.) Both works for me. My preference is 2 just because it's very common for most drivers that use tcp/ip header mbuf. Thanks for this comment. I tend to agree, both for the reason you state and also because the patch is simple enough that it might qualify as an errata for 10.2. I am hoping Daniel Braniss will be able to test the patch and let us know if it improves performance with TSO enabled? rick ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
Yonghyeon, On Thu, Aug 20, 2015 at 11:30:24AM +0900, Yonghyeon PYUN wrote: YMaybe it can be controlled by some kind of flag, if all the three TSO Ylimits should include the TCP/IP/ethernet headers too. I'm pretty sure Ywe want both versions. Y Y YHmm, I'm afraid it's already complex. Drivers have to tell almost Ythe same information to both bus_dma(9) and network stack. Y Y Don't forget that not all drivers in the tree set the TSO limits before Y if_attach(), so possibly the subtraction of one TSO fragment needs to go Y into ip_output() Y Y Ok, I realized that some drivers may not know the answers before ether_ifattach(), Y due to the way they are configured/written (I saw the use of if_hw_tsomax_update() Y in the patch). Y Y I was not able to find an interface that configures TSO parameters Y after if_t conversion. I'm under the impression Y if_hw_tsomax_update() is not designed to use this way. Probably we Y need a better one?(CCed to Gleb). Yes. In the projects/ifnet all the TSO stuff is configured differently. I'd really appreciate if other developers look there and review it, try it, give some input. Here is a snippet from net/if.h in projects/ifnet: /* * Structure describing TSO properties of an interface. Known both to ifnet * layer and TCP. Most interfaces point to a static tsomax in ifdriver * definition. However, vlan(4) and lagg(4) require a dynamic tsomax. */ struct iftsomax { uint32_t tsomax_bytes;/* TSO total burst length limit in bytes */ uint32_t tsomax_segcount; /* TSO maximum segment count */ uint32_t tsomax_segsize; /* TSO maximum segment size in bytes */ }; Now closer to your original question. I haven't yet converted lagg(4), so haven't yet worked on if_hw_tsomax_update(). I am convinced that it shouldn't be needed for a regular driver (save lagg(4). A proper driver should first study its hardware and only then call if_attach(). Correct me if am wrong, please. Also, I suppose, that a piece of hardware can't change its TSO maximums at runtime, so I don't see reason for changing them at runtime (save lagg(4)). -- Totus tuus, Glebius. ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 08:13:59AM -0400, Rick Macklem wrote: Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:51:44AM +0200, Hans Petter Selasky wrote: On 08/19/15 09:42, Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote: On 08/18/15 23:54, Rick Macklem wrote: Ouch! Yes, I now see that the code that counts the # of mbufs is before the code that adds the tcp/ip header mbuf. In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever the driver provides - 1. It is not the driver's responsibility to know if a tcp/ip header mbuf will be added and is a lot less confusing that expecting the driver author to know to subtract one. (I had mistakenly thought that tcp_output() had added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw, this tcp/ip header mbuf also has leading space for the MAC layer header.) Hi Rick, Your question is good. With the Mellanox hardware we have separate so-called inline data space for the TCP/IP headers, so if the TCP stack subtracts something, then we would need to add something to the limit, because then the scatter gather list is only used for the data part. I think all drivers in tree don't subtract 1 for if_hw_tsomaxsegcount. Probably touching Mellanox driver would be simpler than fixing all other drivers in tree. Hi, If you change the behaviour don't forget to update and/or add comments describing it. Maybe the amount of subtraction could be defined by some macro? Then drivers which inline the headers can subtract it? I'm also ok with your suggestion. Your suggestion is fine by me. The initial TSO limits were tried to be preserved, and I believe that TSO limits never accounted for IP/TCP/ETHERNET/VLAN headers! I guess FreeBSD used to follow MS LSOv1 specification with minor exception in pseudo checksum computation. If I recall correctly the specification says upper stack can generate up to IP_MAXPACKET sized packet. Other L2 headers like ethernet/vlan header size is not included in the packet and it's drivers responsibility to allocate additional DMA buffers/segments for L2 headers. Yep. The default for if_hw_tsomax was reduced from IP_MAXPACKET to 32 * MCLBYTES - max_ethernet_header_size as a workaround/hack so that devices limited to 32 transmit segments would work (ie. the entire packet, including MAC header would fit in 32 MCLBYTE clusters). This implied that many drivers did end up using m_defrag() to copy the mbuf list to one made up of 32 MCLBYTE clusters. If a driver sets if_hw_tsomaxsegcount correctly, then it can set if_hw_tsomax to whatever it can handle as the largest TSO packet (without MAC header) the hardware can handle. If it can handle IP_MAXPACKET, then it can set it to that. I thought the upper limit was still IP_MAXPACKET. If driver increase it (i.e. IP_MAXPACKET, the length field in the IP header would overflow which in turn may break firewalls and other packet handling in IPv4/IPv6 code path. I have no idea if a bogus value in the ip_len field of the TSO segment would break something in ip_output() or not. This would need to be checked before anyone configures if_hw_tsomax IP_MAXPACKET. I didn't think of any effect this would have in ip_output(), I just knew that the hardware would be replacing ip_len when it generated the TCP/IP segments from the TSO segment. As you note, I vaguely recall some hardware being able to handle a TSO segment IP_MAXPACKET (presumably getting the TSO segment's length some other way). It would be nice if this was checked, but yes, the comment should specify an upper bound on if_hw_tsomax of IP_MAXPACKET until then. rick If the limit no longer apply to network stack, that's great. Some controllers can handle up to 256KB TCP/UDP segmentation and supporting that feature wouldn't be hard. ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
Hans Petter Selasky wrote: On 08/19/15 09:42, Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote: On 08/18/15 23:54, Rick Macklem wrote: Ouch! Yes, I now see that the code that counts the # of mbufs is before the code that adds the tcp/ip header mbuf. In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever the driver provides - 1. It is not the driver's responsibility to know if a tcp/ip header mbuf will be added and is a lot less confusing that expecting the driver author to know to subtract one. (I had mistakenly thought that tcp_output() had added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw, this tcp/ip header mbuf also has leading space for the MAC layer header.) Hi Rick, Your question is good. With the Mellanox hardware we have separate so-called inline data space for the TCP/IP headers, so if the TCP stack subtracts something, then we would need to add something to the limit, because then the scatter gather list is only used for the data part. I think all drivers in tree don't subtract 1 for if_hw_tsomaxsegcount. Probably touching Mellanox driver would be simpler than fixing all other drivers in tree. Maybe it can be controlled by some kind of flag, if all the three TSO limits should include the TCP/IP/ethernet headers too. I'm pretty sure we want both versions. Hmm, I'm afraid it's already complex. Drivers have to tell almost the same information to both bus_dma(9) and network stack. Don't forget that not all drivers in the tree set the TSO limits before if_attach(), so possibly the subtraction of one TSO fragment needs to go into ip_output() I think setting them before a call to ether_ifattach() should be required and any driver that doesn't do that needs to be fixed. Also, I notice that 32 * MCLBYTES - (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN) is getting written as 65536 - (ETHER_HDR_LEN + ETHER_VLAN_ENCAP_LEN) which obscures the reason it is the default. It probably isn't the correct default for any driver that sets if_hw_tsomaxsegcount, but is close to IP_MAXPACKET, so the breakage is mostly theoretical. rick --HPS ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On 19 Aug 2015, at 16:00, Rick Macklem rmack...@uoguelph.ca wrote: Hans Petter Selasky wrote: On 08/19/15 09:42, Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote: On 08/18/15 23:54, Rick Macklem wrote: Ouch! Yes, I now see that the code that counts the # of mbufs is before the code that adds the tcp/ip header mbuf. In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever the driver provides - 1. It is not the driver's responsibility to know if a tcp/ip header mbuf will be added and is a lot less confusing that expecting the driver author to know to subtract one. (I had mistakenly thought that tcp_output() had added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw, this tcp/ip header mbuf also has leading space for the MAC layer header.) Hi Rick, Your question is good. With the Mellanox hardware we have separate so-called inline data space for the TCP/IP headers, so if the TCP stack subtracts something, then we would need to add something to the limit, because then the scatter gather list is only used for the data part. I think all drivers in tree don't subtract 1 for if_hw_tsomaxsegcount. Probably touching Mellanox driver would be simpler than fixing all other drivers in tree. Maybe it can be controlled by some kind of flag, if all the three TSO limits should include the TCP/IP/ethernet headers too. I'm pretty sure we want both versions. Hmm, I'm afraid it's already complex. Drivers have to tell almost the same information to both bus_dma(9) and network stack. Don't forget that not all drivers in the tree set the TSO limits before if_attach(), so possibly the subtraction of one TSO fragment needs to go into ip_output() Ok, I realized that some drivers may not know the answers before ether_ifattach(), due to the way they are configured/written (I saw the use of if_hw_tsomax_update() in the patch). If it is subtracted as a part of the assignment to if_hw_tsomaxsegcount in tcp_output() at line#791 in tcp_output() like the following, I don't think it should matter if the values are set before ether_ifattach()? /* * Subtract 1 for the tcp/ip header mbuf that * will be prepended to the mbuf chain in this * function in the code below this block. */ if_hw_tsomaxsegcount = tp-t_tsomaxsegcount - 1; I don't have a good solution for the case where a driver doesn't plan on using the tcp/ip header provided by tcp_output() except to say the driver can add one to the setting to compensate for that (and if they fail to do so, it still works, although somewhat suboptimally). When I now read the comment in sys/net/if_var.h it is clear what it means, but for some reason I didn't read it that way before? (I think it was the part that said the driver didn't have to subtract for the headers that confused me?) In any case, we need to try and come up with a clear definition of what they need to be set to. I can now think of two ways to deal with this: 1 - Leave tcp_output() as is, but provide a macro for the device driver authors to use that sets if_hw_tsomaxsegcount with a flag for driver uses tcp/ip header mbuf, documenting that this flag should normally be true. OR 2 - Change tcp_output() as above, noting that this is a workaround for confusion w.r.t. whether or not if_hw_tsomaxsegcount should include the tcp/ip header mbuf and update the comment in if_var.h to reflect this. Then drivers that don't use the tcp/ip header mbuf can increase their value for if_hw_tsomaxsegcount by 1. (The comment should also mention that a value of 35 or greater is much preferred to 32 if the hardware will support that.) Also, I'd like to apologize for some of my emails getting a little blunt. I just find it flustrating that this problem is still showing up and is even in 10.2. This is partly my fault for not making it clearer to driver authors what if_hw_tsomaxsegcount should be set to, because I had it incorrect. Hopefully we can come up with a solution that everyone is comfortable with, rick ok guys, when you have some code for me to try just let me know. danny ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
Hans Petter Selasky wrote: On 08/19/15 09:42, Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote: On 08/18/15 23:54, Rick Macklem wrote: Ouch! Yes, I now see that the code that counts the # of mbufs is before the code that adds the tcp/ip header mbuf. In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever the driver provides - 1. It is not the driver's responsibility to know if a tcp/ip header mbuf will be added and is a lot less confusing that expecting the driver author to know to subtract one. (I had mistakenly thought that tcp_output() had added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw, this tcp/ip header mbuf also has leading space for the MAC layer header.) Hi Rick, Your question is good. With the Mellanox hardware we have separate so-called inline data space for the TCP/IP headers, so if the TCP stack subtracts something, then we would need to add something to the limit, because then the scatter gather list is only used for the data part. I think all drivers in tree don't subtract 1 for if_hw_tsomaxsegcount. Probably touching Mellanox driver would be simpler than fixing all other drivers in tree. Maybe it can be controlled by some kind of flag, if all the three TSO limits should include the TCP/IP/ethernet headers too. I'm pretty sure we want both versions. Hmm, I'm afraid it's already complex. Drivers have to tell almost the same information to both bus_dma(9) and network stack. Don't forget that not all drivers in the tree set the TSO limits before if_attach(), so possibly the subtraction of one TSO fragment needs to go into ip_output() Ok, I realized that some drivers may not know the answers before ether_ifattach(), due to the way they are configured/written (I saw the use of if_hw_tsomax_update() in the patch). If it is subtracted as a part of the assignment to if_hw_tsomaxsegcount in tcp_output() at line#791 in tcp_output() like the following, I don't think it should matter if the values are set before ether_ifattach()? /* * Subtract 1 for the tcp/ip header mbuf that * will be prepended to the mbuf chain in this * function in the code below this block. */ if_hw_tsomaxsegcount = tp-t_tsomaxsegcount - 1; I don't have a good solution for the case where a driver doesn't plan on using the tcp/ip header provided by tcp_output() except to say the driver can add one to the setting to compensate for that (and if they fail to do so, it still works, although somewhat suboptimally). When I now read the comment in sys/net/if_var.h it is clear what it means, but for some reason I didn't read it that way before? (I think it was the part that said the driver didn't have to subtract for the headers that confused me?) In any case, we need to try and come up with a clear definition of what they need to be set to. I can now think of two ways to deal with this: 1 - Leave tcp_output() as is, but provide a macro for the device driver authors to use that sets if_hw_tsomaxsegcount with a flag for driver uses tcp/ip header mbuf, documenting that this flag should normally be true. OR 2 - Change tcp_output() as above, noting that this is a workaround for confusion w.r.t. whether or not if_hw_tsomaxsegcount should include the tcp/ip header mbuf and update the comment in if_var.h to reflect this. Then drivers that don't use the tcp/ip header mbuf can increase their value for if_hw_tsomaxsegcount by 1. (The comment should also mention that a value of 35 or greater is much preferred to 32 if the hardware will support that.) Also, I'd like to apologize for some of my emails getting a little blunt. I just find it flustrating that this problem is still showing up and is even in 10.2. This is partly my fault for not making it clearer to driver authors what if_hw_tsomaxsegcount should be set to, because I had it incorrect. Hopefully we can come up with a solution that everyone is comfortable with, rick --HPS ___ freebsd-...@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
Hans Petter Selasky wrote: On 08/19/15 09:42, Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote: On 08/18/15 23:54, Rick Macklem wrote: Ouch! Yes, I now see that the code that counts the # of mbufs is before the code that adds the tcp/ip header mbuf. In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever the driver provides - 1. It is not the driver's responsibility to know if a tcp/ip header mbuf will be added and is a lot less confusing that expecting the driver author to know to subtract one. (I had mistakenly thought that tcp_output() had added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw, this tcp/ip header mbuf also has leading space for the MAC layer header.) Hi Rick, Your question is good. With the Mellanox hardware we have separate so-called inline data space for the TCP/IP headers, so if the TCP stack subtracts something, then we would need to add something to the limit, because then the scatter gather list is only used for the data part. I think all drivers in tree don't subtract 1 for if_hw_tsomaxsegcount. Probably touching Mellanox driver would be simpler than fixing all other drivers in tree. Maybe it can be controlled by some kind of flag, if all the three TSO limits should include the TCP/IP/ethernet headers too. I'm pretty sure we want both versions. Hmm, I'm afraid it's already complex. Drivers have to tell almost the same information to both bus_dma(9) and network stack. Don't forget that not all drivers in the tree set the TSO limits before if_attach(), so possibly the subtraction of one TSO fragment needs to go into ip_output() I don't really care where it gets subtracted, so long as it is subtracted at least by default, so all the drivers that don't subtract it get fixed. However, I might argue that tcp_output() is the correct place, since tcp_output() is where the tcp/ip header mbuf is prepended to the list. The subtraction is just taking into account the mbuf that tcp_output() will be adding to the head of the list and it should count that in the while() loop. rick --HPS ___ freebsd-...@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:51:44AM +0200, Hans Petter Selasky wrote: On 08/19/15 09:42, Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote: On 08/18/15 23:54, Rick Macklem wrote: Ouch! Yes, I now see that the code that counts the # of mbufs is before the code that adds the tcp/ip header mbuf. In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever the driver provides - 1. It is not the driver's responsibility to know if a tcp/ip header mbuf will be added and is a lot less confusing that expecting the driver author to know to subtract one. (I had mistakenly thought that tcp_output() had added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw, this tcp/ip header mbuf also has leading space for the MAC layer header.) Hi Rick, Your question is good. With the Mellanox hardware we have separate so-called inline data space for the TCP/IP headers, so if the TCP stack subtracts something, then we would need to add something to the limit, because then the scatter gather list is only used for the data part. I think all drivers in tree don't subtract 1 for if_hw_tsomaxsegcount. Probably touching Mellanox driver would be simpler than fixing all other drivers in tree. Hi, If you change the behaviour don't forget to update and/or add comments describing it. Maybe the amount of subtraction could be defined by some macro? Then drivers which inline the headers can subtract it? I'm also ok with your suggestion. Your suggestion is fine by me. The initial TSO limits were tried to be preserved, and I believe that TSO limits never accounted for IP/TCP/ETHERNET/VLAN headers! I guess FreeBSD used to follow MS LSOv1 specification with minor exception in pseudo checksum computation. If I recall correctly the specification says upper stack can generate up to IP_MAXPACKET sized packet. Other L2 headers like ethernet/vlan header size is not included in the packet and it's drivers responsibility to allocate additional DMA buffers/segments for L2 headers. Yep. The default for if_hw_tsomax was reduced from IP_MAXPACKET to 32 * MCLBYTES - max_ethernet_header_size as a workaround/hack so that devices limited to 32 transmit segments would work (ie. the entire packet, including MAC header would fit in 32 MCLBYTE clusters). This implied that many drivers did end up using m_defrag() to copy the mbuf list to one made up of 32 MCLBYTE clusters. If a driver sets if_hw_tsomaxsegcount correctly, then it can set if_hw_tsomax to whatever it can handle as the largest TSO packet (without MAC header) the hardware can handle. If it can handle IP_MAXPACKET, then it can set it to that. rick Maybe it can be controlled by some kind of flag, if all the three TSO limits should include the TCP/IP/ethernet headers too. I'm pretty sure we want both versions. Hmm, I'm afraid it's already complex. Drivers have to tell almost the same information to both bus_dma(9) and network stack. You're right it's complicated. Not sure if bus_dma can provide an API for this though. --HPS ___ freebsd-...@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
Daniel Braniss wrote: On 19 Aug 2015, at 16:00, Rick Macklem rmack...@uoguelph.ca wrote: Hans Petter Selasky wrote: On 08/19/15 09:42, Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote: On 08/18/15 23:54, Rick Macklem wrote: Ouch! Yes, I now see that the code that counts the # of mbufs is before the code that adds the tcp/ip header mbuf. In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever the driver provides - 1. It is not the driver's responsibility to know if a tcp/ip header mbuf will be added and is a lot less confusing that expecting the driver author to know to subtract one. (I had mistakenly thought that tcp_output() had added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw, this tcp/ip header mbuf also has leading space for the MAC layer header.) Hi Rick, Your question is good. With the Mellanox hardware we have separate so-called inline data space for the TCP/IP headers, so if the TCP stack subtracts something, then we would need to add something to the limit, because then the scatter gather list is only used for the data part. I think all drivers in tree don't subtract 1 for if_hw_tsomaxsegcount. Probably touching Mellanox driver would be simpler than fixing all other drivers in tree. Maybe it can be controlled by some kind of flag, if all the three TSO limits should include the TCP/IP/ethernet headers too. I'm pretty sure we want both versions. Hmm, I'm afraid it's already complex. Drivers have to tell almost the same information to both bus_dma(9) and network stack. Don't forget that not all drivers in the tree set the TSO limits before if_attach(), so possibly the subtraction of one TSO fragment needs to go into ip_output() Ok, I realized that some drivers may not know the answers before ether_ifattach(), due to the way they are configured/written (I saw the use of if_hw_tsomax_update() in the patch). If it is subtracted as a part of the assignment to if_hw_tsomaxsegcount in tcp_output() at line#791 in tcp_output() like the following, I don't think it should matter if the values are set before ether_ifattach()? /* * Subtract 1 for the tcp/ip header mbuf that * will be prepended to the mbuf chain in this * function in the code below this block. */ if_hw_tsomaxsegcount = tp-t_tsomaxsegcount - 1; Well, you can replace the line in sys/netinet/tcp_output.c that looks like: if_hw_tsomaxsegcount = tp-t_tsomaxsegcount; with the above line (at line #797 in head). Any other patch for this will have the same effect, rick I don't have a good solution for the case where a driver doesn't plan on using the tcp/ip header provided by tcp_output() except to say the driver can add one to the setting to compensate for that (and if they fail to do so, it still works, although somewhat suboptimally). When I now read the comment in sys/net/if_var.h it is clear what it means, but for some reason I didn't read it that way before? (I think it was the part that said the driver didn't have to subtract for the headers that confused me?) In any case, we need to try and come up with a clear definition of what they need to be set to. I can now think of two ways to deal with this: 1 - Leave tcp_output() as is, but provide a macro for the device driver authors to use that sets if_hw_tsomaxsegcount with a flag for driver uses tcp/ip header mbuf, documenting that this flag should normally be true. OR 2 - Change tcp_output() as above, noting that this is a workaround for confusion w.r.t. whether or not if_hw_tsomaxsegcount should include the tcp/ip header mbuf and update the comment in if_var.h to reflect this. Then drivers that don't use the tcp/ip header mbuf can increase their value for if_hw_tsomaxsegcount by 1. (The comment should also mention that a value of 35 or greater is much preferred to 32 if the hardware will support that.) Also, I'd like to apologize for some of my emails getting a little blunt. I just find it flustrating that this problem is still showing up and is even in 10.2. This is partly my fault for not making it clearer to driver authors what if_hw_tsomaxsegcount should be set to, because I had it incorrect. Hopefully we can come up with a solution that everyone is comfortable with, rick ok guys, when you have some code for me to try just let me know. danny ___ freebsd-...@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On Wed, Aug 19, 2015 at 09:00:35AM -0400, Rick Macklem wrote: Hans Petter Selasky wrote: On 08/19/15 09:42, Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote: On 08/18/15 23:54, Rick Macklem wrote: Ouch! Yes, I now see that the code that counts the # of mbufs is before the code that adds the tcp/ip header mbuf. In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever the driver provides - 1. It is not the driver's responsibility to know if a tcp/ip header mbuf will be added and is a lot less confusing that expecting the driver author to know to subtract one. (I had mistakenly thought that tcp_output() had added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw, this tcp/ip header mbuf also has leading space for the MAC layer header.) Hi Rick, Your question is good. With the Mellanox hardware we have separate so-called inline data space for the TCP/IP headers, so if the TCP stack subtracts something, then we would need to add something to the limit, because then the scatter gather list is only used for the data part. I think all drivers in tree don't subtract 1 for if_hw_tsomaxsegcount. Probably touching Mellanox driver would be simpler than fixing all other drivers in tree. Maybe it can be controlled by some kind of flag, if all the three TSO limits should include the TCP/IP/ethernet headers too. I'm pretty sure we want both versions. Hmm, I'm afraid it's already complex. Drivers have to tell almost the same information to both bus_dma(9) and network stack. Don't forget that not all drivers in the tree set the TSO limits before if_attach(), so possibly the subtraction of one TSO fragment needs to go into ip_output() Ok, I realized that some drivers may not know the answers before ether_ifattach(), due to the way they are configured/written (I saw the use of if_hw_tsomax_update() in the patch). I was not able to find an interface that configures TSO parameters after if_t conversion. I'm under the impression if_hw_tsomax_update() is not designed to use this way. Probably we need a better one?(CCed to Gleb). If it is subtracted as a part of the assignment to if_hw_tsomaxsegcount in tcp_output() at line#791 in tcp_output() like the following, I don't think it should matter if the values are set before ether_ifattach()? /* * Subtract 1 for the tcp/ip header mbuf that * will be prepended to the mbuf chain in this * function in the code below this block. */ if_hw_tsomaxsegcount = tp-t_tsomaxsegcount - 1; I don't have a good solution for the case where a driver doesn't plan on using the tcp/ip header provided by tcp_output() except to say the driver can add one to the setting to compensate for that (and if they fail to do so, it still works, although somewhat suboptimally). When I now read the comment in sys/net/if_var.h it is clear what it means, but for some reason I didn't read it that way before? (I think it was the part that said the driver didn't have to subtract for the headers that confused me?) In any case, we need to try and come up with a clear definition of what they need to be set to. I can now think of two ways to deal with this: 1 - Leave tcp_output() as is, but provide a macro for the device driver authors to use that sets if_hw_tsomaxsegcount with a flag for driver uses tcp/ip header mbuf, documenting that this flag should normally be true. OR 2 - Change tcp_output() as above, noting that this is a workaround for confusion w.r.t. whether or not if_hw_tsomaxsegcount should include the tcp/ip header mbuf and update the comment in if_var.h to reflect this. Then drivers that don't use the tcp/ip header mbuf can increase their value for if_hw_tsomaxsegcount by 1. (The comment should also mention that a value of 35 or greater is much preferred to 32 if the hardware will support that.) Both works for me. My preference is 2 just because it's very common for most drivers that use tcp/ip header mbuf. ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On 08/18/15 23:54, Rick Macklem wrote: Ouch! Yes, I now see that the code that counts the # of mbufs is before the code that adds the tcp/ip header mbuf. In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever the driver provides - 1. It is not the driver's responsibility to know if a tcp/ip header mbuf will be added and is a lot less confusing that expecting the driver author to know to subtract one. (I had mistakenly thought that tcp_output() had added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw, this tcp/ip header mbuf also has leading space for the MAC layer header.) Hi Rick, Your question is good. With the Mellanox hardware we have separate so-called inline data space for the TCP/IP headers, so if the TCP stack subtracts something, then we would need to add something to the limit, because then the scatter gather list is only used for the data part. Maybe it can be controlled by some kind of flag, if all the three TSO limits should include the TCP/IP/ethernet headers too. I'm pretty sure we want both versions. --HPS ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On Tue, Aug 18, 2015 at 06:04:25PM -0400, Rick Macklem wrote: Hans Petter Selasky wrote: On 08/18/15 14:53, Rick Macklem wrote: If this is just a test machine, maybe you could test with these lines (at about #880) in sys/netinet/tcp_output.c commented out? (It looks to me like this will disable TSO for almost all the NFS writes.) - around line #880 in sys/netinet/tcp_output.c: /* * In case there are too many small fragments * don't use TSO: */ if (len = max_len) { len = max_len; sendalot = 1; tso = 0; } This was added along with the other stuff that did the if_hw_tsomaxsegcount, etc and I never noticed it until now (not my patch). FYI: These lines are needed by other hardware, like the mlxen driver. If you remove them mlxen will start doing m_defrag(). I believe if you set the correct parameters in the struct ifnet for the TSO size/count limits this problem will go away. If you print the len and max_len and also the cases where TSO limits are reached, you'll see what parameter is triggering it and needs to be increased. Well, if the driver isn't setting if_hw_tsomaxsegcount correctly, then it is the driver that needs to be fixed. Having the above code block disable TSO for all of the NFS writes, including the ones that set if_hw_tsomaxsegcount correctly doesn't make sense to me. If the driver authors don't set these, the drivers do lots of m_defrag() calls. I have posted more than once to freebsd-net@ asking the driver authors to set these and some now have. (I can't do it, because I don't have the hardware to test it with.) Thanks for reminder. I have generated a diff against HEAD. https://people.freebsd.org/~yongari/tso.param.diff The diff restores optimal TSO parameters which were lost in r271946 for drivers that relied on sane default values. I'll commit it after some testing. I do think that most/all of them don't subtract 1 for the tcp/ip header and I don't think they should be expected to, since the driver isn't supposed to worry about the protocol at that level. I agree. -- I think tcp_output() should subtract one from the if_hw_tsomaxsegcount provided by the driver to handle this, since it chooses to count mbufs (the while() loop at around line #825 in sys/netinet/tcp_output.c.) before it prepends the tcp/ip header mbuf. rick --HPS ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On Wed, Aug 19, 2015 at 09:51:44AM +0200, Hans Petter Selasky wrote: On 08/19/15 09:42, Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote: On 08/18/15 23:54, Rick Macklem wrote: Ouch! Yes, I now see that the code that counts the # of mbufs is before the code that adds the tcp/ip header mbuf. In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever the driver provides - 1. It is not the driver's responsibility to know if a tcp/ip header mbuf will be added and is a lot less confusing that expecting the driver author to know to subtract one. (I had mistakenly thought that tcp_output() had added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw, this tcp/ip header mbuf also has leading space for the MAC layer header.) Hi Rick, Your question is good. With the Mellanox hardware we have separate so-called inline data space for the TCP/IP headers, so if the TCP stack subtracts something, then we would need to add something to the limit, because then the scatter gather list is only used for the data part. I think all drivers in tree don't subtract 1 for if_hw_tsomaxsegcount. Probably touching Mellanox driver would be simpler than fixing all other drivers in tree. Hi, If you change the behaviour don't forget to update and/or add comments describing it. Maybe the amount of subtraction could be defined by some macro? Then drivers which inline the headers can subtract it? I'm also ok with your suggestion. Your suggestion is fine by me. The initial TSO limits were tried to be preserved, and I believe that TSO limits never accounted for IP/TCP/ETHERNET/VLAN headers! I guess FreeBSD used to follow MS LSOv1 specification with minor exception in pseudo checksum computation. If I recall correctly the specification says upper stack can generate up to IP_MAXPACKET sized packet. Other L2 headers like ethernet/vlan header size is not included in the packet and it's drivers responsibility to allocate additional DMA buffers/segments for L2 headers. Maybe it can be controlled by some kind of flag, if all the three TSO limits should include the TCP/IP/ethernet headers too. I'm pretty sure we want both versions. Hmm, I'm afraid it's already complex. Drivers have to tell almost the same information to both bus_dma(9) and network stack. You're right it's complicated. Not sure if bus_dma can provide an API for this though. --HPS ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote: On 08/18/15 23:54, Rick Macklem wrote: Ouch! Yes, I now see that the code that counts the # of mbufs is before the code that adds the tcp/ip header mbuf. In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever the driver provides - 1. It is not the driver's responsibility to know if a tcp/ip header mbuf will be added and is a lot less confusing that expecting the driver author to know to subtract one. (I had mistakenly thought that tcp_output() had added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw, this tcp/ip header mbuf also has leading space for the MAC layer header.) Hi Rick, Your question is good. With the Mellanox hardware we have separate so-called inline data space for the TCP/IP headers, so if the TCP stack subtracts something, then we would need to add something to the limit, because then the scatter gather list is only used for the data part. I think all drivers in tree don't subtract 1 for if_hw_tsomaxsegcount. Probably touching Mellanox driver would be simpler than fixing all other drivers in tree. Maybe it can be controlled by some kind of flag, if all the three TSO limits should include the TCP/IP/ethernet headers too. I'm pretty sure we want both versions. Hmm, I'm afraid it's already complex. Drivers have to tell almost the same information to both bus_dma(9) and network stack. ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On 08/19/15 09:42, Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote: On 08/18/15 23:54, Rick Macklem wrote: Ouch! Yes, I now see that the code that counts the # of mbufs is before the code that adds the tcp/ip header mbuf. In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever the driver provides - 1. It is not the driver's responsibility to know if a tcp/ip header mbuf will be added and is a lot less confusing that expecting the driver author to know to subtract one. (I had mistakenly thought that tcp_output() had added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw, this tcp/ip header mbuf also has leading space for the MAC layer header.) Hi Rick, Your question is good. With the Mellanox hardware we have separate so-called inline data space for the TCP/IP headers, so if the TCP stack subtracts something, then we would need to add something to the limit, because then the scatter gather list is only used for the data part. I think all drivers in tree don't subtract 1 for if_hw_tsomaxsegcount. Probably touching Mellanox driver would be simpler than fixing all other drivers in tree. Maybe it can be controlled by some kind of flag, if all the three TSO limits should include the TCP/IP/ethernet headers too. I'm pretty sure we want both versions. Hmm, I'm afraid it's already complex. Drivers have to tell almost the same information to both bus_dma(9) and network stack. Don't forget that not all drivers in the tree set the TSO limits before if_attach(), so possibly the subtraction of one TSO fragment needs to go into ip_output() --HPS ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On 08/19/15 09:42, Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote: On 08/18/15 23:54, Rick Macklem wrote: Ouch! Yes, I now see that the code that counts the # of mbufs is before the code that adds the tcp/ip header mbuf. In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever the driver provides - 1. It is not the driver's responsibility to know if a tcp/ip header mbuf will be added and is a lot less confusing that expecting the driver author to know to subtract one. (I had mistakenly thought that tcp_output() had added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw, this tcp/ip header mbuf also has leading space for the MAC layer header.) Hi Rick, Your question is good. With the Mellanox hardware we have separate so-called inline data space for the TCP/IP headers, so if the TCP stack subtracts something, then we would need to add something to the limit, because then the scatter gather list is only used for the data part. I think all drivers in tree don't subtract 1 for if_hw_tsomaxsegcount. Probably touching Mellanox driver would be simpler than fixing all other drivers in tree. Hi, If you change the behaviour don't forget to update and/or add comments describing it. Maybe the amount of subtraction could be defined by some macro? Then drivers which inline the headers can subtract it? Your suggestion is fine by me. The initial TSO limits were tried to be preserved, and I believe that TSO limits never accounted for IP/TCP/ETHERNET/VLAN headers! Maybe it can be controlled by some kind of flag, if all the three TSO limits should include the TCP/IP/ethernet headers too. I'm pretty sure we want both versions. Hmm, I'm afraid it's already complex. Drivers have to tell almost the same information to both bus_dma(9) and network stack. You're right it's complicated. Not sure if bus_dma can provide an API for this though. --HPS ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On Wed, Aug 19, 2015 at 08:13:59AM -0400, Rick Macklem wrote: Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:51:44AM +0200, Hans Petter Selasky wrote: On 08/19/15 09:42, Yonghyeon PYUN wrote: On Wed, Aug 19, 2015 at 09:00:52AM +0200, Hans Petter Selasky wrote: On 08/18/15 23:54, Rick Macklem wrote: Ouch! Yes, I now see that the code that counts the # of mbufs is before the code that adds the tcp/ip header mbuf. In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever the driver provides - 1. It is not the driver's responsibility to know if a tcp/ip header mbuf will be added and is a lot less confusing that expecting the driver author to know to subtract one. (I had mistakenly thought that tcp_output() had added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw, this tcp/ip header mbuf also has leading space for the MAC layer header.) Hi Rick, Your question is good. With the Mellanox hardware we have separate so-called inline data space for the TCP/IP headers, so if the TCP stack subtracts something, then we would need to add something to the limit, because then the scatter gather list is only used for the data part. I think all drivers in tree don't subtract 1 for if_hw_tsomaxsegcount. Probably touching Mellanox driver would be simpler than fixing all other drivers in tree. Hi, If you change the behaviour don't forget to update and/or add comments describing it. Maybe the amount of subtraction could be defined by some macro? Then drivers which inline the headers can subtract it? I'm also ok with your suggestion. Your suggestion is fine by me. The initial TSO limits were tried to be preserved, and I believe that TSO limits never accounted for IP/TCP/ETHERNET/VLAN headers! I guess FreeBSD used to follow MS LSOv1 specification with minor exception in pseudo checksum computation. If I recall correctly the specification says upper stack can generate up to IP_MAXPACKET sized packet. Other L2 headers like ethernet/vlan header size is not included in the packet and it's drivers responsibility to allocate additional DMA buffers/segments for L2 headers. Yep. The default for if_hw_tsomax was reduced from IP_MAXPACKET to 32 * MCLBYTES - max_ethernet_header_size as a workaround/hack so that devices limited to 32 transmit segments would work (ie. the entire packet, including MAC header would fit in 32 MCLBYTE clusters). This implied that many drivers did end up using m_defrag() to copy the mbuf list to one made up of 32 MCLBYTE clusters. If a driver sets if_hw_tsomaxsegcount correctly, then it can set if_hw_tsomax to whatever it can handle as the largest TSO packet (without MAC header) the hardware can handle. If it can handle IP_MAXPACKET, then it can set it to that. I thought the upper limit was still IP_MAXPACKET. If driver increase it (i.e. IP_MAXPACKET, the length field in the IP header would overflow which in turn may break firewalls and other packet handling in IPv4/IPv6 code path. If the limit no longer apply to network stack, that's great. Some controllers can handle up to 256KB TCP/UDP segmentation and supporting that feature wouldn't be hard. ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On Aug 18, 2015, at 12:49 AM, Rick Macklem rmack...@uoguelph.ca wrote: Daniel Braniss wrote: On Aug 17, 2015, at 3:21 PM, Rick Macklem rmack...@uoguelph.ca wrote: Daniel Braniss wrote: On Aug 17, 2015, at 1:41 PM, Christopher Forgeron csforge...@gmail.com wrote: FYI, I can regularly hit 9.3 Gib/s with my Intel X520-DA2's and FreeBSD 10.1. Before 10.1 it was less. this is NOT iperf/3 where i do get close to wire speed, it’s NFS writes, i.e., almost real work :-) I used to tweak the card settings, but now it's just stock. You may want to check your settings, the Mellanox may just have better defaults for your switch. Have you tried disabling TSO for the Intel? With TSO enabled, it will be copying every transmitted mbuf chain to a new chain of mbuf clusters via. m_defrag() when TSO is enabled. (Assuming you aren't an 82598 chip. Most seem to be the 82599 chip these days?) hi Rick how can i check the chip? Haven't a clue. Does dmesg tell you? (To be honest, since disabling TSO helped, I'll bet you don't have a 82598.) This has been fixed in the driver very recently, but those fixes won't be in 10.1. rick ps: If you could test with 10.2, it would be interesting to see how the ix does with the current driver fixes in it? I new TSO was involved! ok, firstly, it’s 10.2 stable. with TSO enabled, ix is bad, around 64MGB/s. disabling TSO it’s better, around 130 Hmm, could you check to see of these lines are in sys/dev/ixgbe/if_ix.c at around line#2500? /* TSO parameters */ 2572 ifp-if_hw_tsomax = 65518; 2573 ifp-if_hw_tsomaxsegcount = IXGBE_82599_SCATTER; 2574 ifp-if_hw_tsomaxsegsize = 2048; They are in stable/10. I didn't look at releng/10.2. (And if they're in a #ifdef for FreeBSD11, take the #ifdef away.) If they are there and not ifdef'd, I can't explain why disabling TSO would help. Once TSO is fixed so that it handles the 64K transmit segments without copying all the mbufs, I suspect you might get better perf. with it enabled? this is 10.2 : they are on lines 2509-2511 and I don’t see any #ifdefs around it. the plot thickens :-) danny Good luck with it, rick still, mlxen0 is about 250! with and without TSO On Mon, Aug 17, 2015 at 6:41 AM, Slawa Olhovchenkov s...@zxy.spb.ru mailto:s...@zxy.spb.ru wrote: On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote: hi, I have a host (Dell R730) with both cards, connected to an HP8200 switch at 10Gb. when writing to the same storage (netapp) this is what I get: ix0:~130MGB/s mlxen0 ~330MGB/s this is via nfs/tcpv3 I can get similar (bad) performance with the mellanox if I increase the file size to 512MGB. Look like mellanox have internal beffer for caching and do ACK acclerating. so at face value, it seems the mlxen does a better use of resources than the intel. Any ideas how to improve ix/intel's performance? Are you sure about netapp performance? ___ freebsd-...@freebsd.org mailto:freebsd-...@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org mailto:freebsd-net-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
Daniel Braniss wrote: On Aug 18, 2015, at 12:49 AM, Rick Macklem rmack...@uoguelph.ca wrote: Daniel Braniss wrote: On Aug 17, 2015, at 3:21 PM, Rick Macklem rmack...@uoguelph.ca wrote: Daniel Braniss wrote: On Aug 17, 2015, at 1:41 PM, Christopher Forgeron csforge...@gmail.com wrote: FYI, I can regularly hit 9.3 Gib/s with my Intel X520-DA2's and FreeBSD 10.1. Before 10.1 it was less. this is NOT iperf/3 where i do get close to wire speed, it’s NFS writes, i.e., almost real work :-) I used to tweak the card settings, but now it's just stock. You may want to check your settings, the Mellanox may just have better defaults for your switch. Have you tried disabling TSO for the Intel? With TSO enabled, it will be copying every transmitted mbuf chain to a new chain of mbuf clusters via. m_defrag() when TSO is enabled. (Assuming you aren't an 82598 chip. Most seem to be the 82599 chip these days?) hi Rick how can i check the chip? Haven't a clue. Does dmesg tell you? (To be honest, since disabling TSO helped, I'll bet you don't have a 82598.) This has been fixed in the driver very recently, but those fixes won't be in 10.1. rick ps: If you could test with 10.2, it would be interesting to see how the ix does with the current driver fixes in it? I new TSO was involved! ok, firstly, it’s 10.2 stable. with TSO enabled, ix is bad, around 64MGB/s. disabling TSO it’s better, around 130 Hmm, could you check to see of these lines are in sys/dev/ixgbe/if_ix.c at around line#2500? /* TSO parameters */ 2572 ifp-if_hw_tsomax = 65518; 2573 ifp-if_hw_tsomaxsegcount = IXGBE_82599_SCATTER; 2574 ifp-if_hw_tsomaxsegsize = 2048; They are in stable/10. I didn't look at releng/10.2. (And if they're in a #ifdef for FreeBSD11, take the #ifdef away.) If they are there and not ifdef'd, I can't explain why disabling TSO would help. Once TSO is fixed so that it handles the 64K transmit segments without copying all the mbufs, I suspect you might get better perf. with it enabled? this is 10.2 : they are on lines 2509-2511 and I don’t see any #ifdefs around it. the plot thickens :-) If this is just a test machine, maybe you could test with these lines (at about #880) in sys/netinet/tcp_output.c commented out? (It looks to me like this will disable TSO for almost all the NFS writes.) - around line #880 in sys/netinet/tcp_output.c: /* * In case there are too many small fragments * don't use TSO: */ if (len = max_len) { len = max_len; sendalot = 1; tso = 0; } This was added along with the other stuff that did the if_hw_tsomaxsegcount, etc and I never noticed it until now (not my patch). rick danny Good luck with it, rick still, mlxen0 is about 250! with and without TSO On Mon, Aug 17, 2015 at 6:41 AM, Slawa Olhovchenkov s...@zxy.spb.ru mailto:s...@zxy.spb.ru wrote: On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote: hi, I have a host (Dell R730) with both cards, connected to an HP8200 switch at 10Gb. when writing to the same storage (netapp) this is what I get: ix0:~130MGB/s mlxen0 ~330MGB/s this is via nfs/tcpv3 I can get similar (bad) performance with the mellanox if I increase the file size to 512MGB. Look like mellanox have internal beffer for caching and do ACK acclerating. so at face value, it seems the mlxen does a better use of resources than the intel. Any ideas how to improve ix/intel's performance? Are you sure about netapp performance? ___ freebsd-...@freebsd.org mailto:freebsd-...@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org mailto:freebsd-net-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-...@freebsd.org mailing list
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
Hans Petter Selasky wrote: On 08/18/15 14:53, Rick Macklem wrote: If this is just a test machine, maybe you could test with these lines (at about #880) in sys/netinet/tcp_output.c commented out? (It looks to me like this will disable TSO for almost all the NFS writes.) - around line #880 in sys/netinet/tcp_output.c: /* * In case there are too many small fragments * don't use TSO: */ if (len = max_len) { len = max_len; sendalot = 1; tso = 0; } This was added along with the other stuff that did the if_hw_tsomaxsegcount, etc and I never noticed it until now (not my patch). FYI: These lines are needed by other hardware, like the mlxen driver. If you remove them mlxen will start doing m_defrag(). I believe if you set the correct parameters in the struct ifnet for the TSO size/count limits this problem will go away. If you print the len and max_len and also the cases where TSO limits are reached, you'll see what parameter is triggering it and needs to be increased. Well, if the driver isn't setting if_hw_tsomaxsegcount correctly, then it is the driver that needs to be fixed. Having the above code block disable TSO for all of the NFS writes, including the ones that set if_hw_tsomaxsegcount correctly doesn't make sense to me. If the driver authors don't set these, the drivers do lots of m_defrag() calls. I have posted more than once to freebsd-net@ asking the driver authors to set these and some now have. (I can't do it, because I don't have the hardware to test it with.) I do think that most/all of them don't subtract 1 for the tcp/ip header and I don't think they should be expected to, since the driver isn't supposed to worry about the protocol at that level. -- I think tcp_output() should subtract one from the if_hw_tsomaxsegcount provided by the driver to handle this, since it chooses to count mbufs (the while() loop at around line #825 in sys/netinet/tcp_output.c.) before it prepends the tcp/ip header mbuf. rick --HPS ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
Daniel Braniss wrote: On Aug 18, 2015, at 12:49 AM, Rick Macklem rmack...@uoguelph.ca wrote: Daniel Braniss wrote: On Aug 17, 2015, at 3:21 PM, Rick Macklem rmack...@uoguelph.ca wrote: Daniel Braniss wrote: On Aug 17, 2015, at 1:41 PM, Christopher Forgeron csforge...@gmail.com wrote: FYI, I can regularly hit 9.3 Gib/s with my Intel X520-DA2's and FreeBSD 10.1. Before 10.1 it was less. this is NOT iperf/3 where i do get close to wire speed, it’s NFS writes, i.e., almost real work :-) I used to tweak the card settings, but now it's just stock. You may want to check your settings, the Mellanox may just have better defaults for your switch. Have you tried disabling TSO for the Intel? With TSO enabled, it will be copying every transmitted mbuf chain to a new chain of mbuf clusters via. m_defrag() when TSO is enabled. (Assuming you aren't an 82598 chip. Most seem to be the 82599 chip these days?) Oops, I think I screwed up. It looks like t_maxopd is limited to somewhat less than the mtu. If that is the case, the code block wouldn't do what I thought it would do. However, if_hw_tsomaxsegcount does need to be one less than the limit for the driver, since the tcp/ip header isn't yet prepended when it is counted. I think the code in tcp_output() should subtract 1, but you can change it in the driver to test this. Thanks for doing this, rick hi Rick how can i check the chip? Haven't a clue. Does dmesg tell you? (To be honest, since disabling TSO helped, I'll bet you don't have a 82598.) This has been fixed in the driver very recently, but those fixes won't be in 10.1. rick ps: If you could test with 10.2, it would be interesting to see how the ix does with the current driver fixes in it? I new TSO was involved! ok, firstly, it’s 10.2 stable. with TSO enabled, ix is bad, around 64MGB/s. disabling TSO it’s better, around 130 Hmm, could you check to see of these lines are in sys/dev/ixgbe/if_ix.c at around line#2500? /* TSO parameters */ 2572 ifp-if_hw_tsomax = 65518; 2573 ifp-if_hw_tsomaxsegcount = IXGBE_82599_SCATTER; 2574 ifp-if_hw_tsomaxsegsize = 2048; They are in stable/10. I didn't look at releng/10.2. (And if they're in a #ifdef for FreeBSD11, take the #ifdef away.) If they are there and not ifdef'd, I can't explain why disabling TSO would help. Once TSO is fixed so that it handles the 64K transmit segments without copying all the mbufs, I suspect you might get better perf. with it enabled? this is 10.2 : they are on lines 2509-2511 and I don’t see any #ifdefs around it. the plot thickens :-) danny Good luck with it, rick still, mlxen0 is about 250! with and without TSO On Mon, Aug 17, 2015 at 6:41 AM, Slawa Olhovchenkov s...@zxy.spb.ru mailto:s...@zxy.spb.ru wrote: On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote: hi, I have a host (Dell R730) with both cards, connected to an HP8200 switch at 10Gb. when writing to the same storage (netapp) this is what I get: ix0:~130MGB/s mlxen0 ~330MGB/s this is via nfs/tcpv3 I can get similar (bad) performance with the mellanox if I increase the file size to 512MGB. Look like mellanox have internal beffer for caching and do ACK acclerating. so at face value, it seems the mlxen does a better use of resources than the intel. Any ideas how to improve ix/intel's performance? Are you sure about netapp performance? ___ freebsd-...@freebsd.org mailto:freebsd-...@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org mailto:freebsd-net-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-...@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On 08/18/15 14:53, Rick Macklem wrote: 2572 ifp-if_hw_tsomax = 65518; 2573 ifp-if_hw_tsomaxsegcount = IXGBE_82599_SCATTER; 2574 ifp-if_hw_tsomaxsegsize = 2048; Hi, If IXGBE_82599_SCATTER is the maximum scatter/gather entries the hardware can do, remember to subtract one fragment for the TCP/IP-header mbuf! I think there is an off-by-one here: ifp-if_hw_tsomax = 65518; ifp-if_hw_tsomaxsegcount = IXGBE_82599_SCATTER - 1; ifp-if_hw_tsomaxsegsize = 2048; Refer to: * * NOTE: The TSO limits only apply to the data payload part of * a TCP/IP packet. That means there is no need to subtract * space for ethernet-, vlan-, IP- or TCP- headers from the * TSO limits unless the hardware driver in question requires * so. In sys/net/if_var.h Thank you! --HPS ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On 08/18/15 14:53, Rick Macklem wrote: If this is just a test machine, maybe you could test with these lines (at about #880) in sys/netinet/tcp_output.c commented out? (It looks to me like this will disable TSO for almost all the NFS writes.) - around line #880 in sys/netinet/tcp_output.c: /* * In case there are too many small fragments * don't use TSO: */ if (len = max_len) { len = max_len; sendalot = 1; tso = 0; } This was added along with the other stuff that did the if_hw_tsomaxsegcount, etc and I never noticed it until now (not my patch). FYI: These lines are needed by other hardware, like the mlxen driver. If you remove them mlxen will start doing m_defrag(). I believe if you set the correct parameters in the struct ifnet for the TSO size/count limits this problem will go away. If you print the len and max_len and also the cases where TSO limits are reached, you'll see what parameter is triggering it and needs to be increased. --HPS ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On Tue, Aug 18, 2015 at 05:09:41PM +0300, Daniel Braniss wrote: sorry, it's been a tough day, we had a major meltdown, caused by a faulty gbic :-( anyways, could you tell me what to do? comment out, fix the off by one? the machine is not yet production. Can you collect this information? https://lists.freebsd.org/pipermail/freebsd-stable/2015-August/083113.html And 'show interface' (or equivalent: error/collsion/events counters) from both ports from HP8200. ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
sorry, it’s been a tough day, we had a major meltdown, caused by a faulty gbic :-( anyways, could you tell me what to do? comment out, fix the off by one? the machine is not yet production. thanks, danny On 18 Aug 2015, at 16:32, Hans Petter Selasky h...@selasky.org wrote: On 08/18/15 14:53, Rick Macklem wrote: 2572 ifp-if_hw_tsomax = 65518; 2573ifp-if_hw_tsomaxsegcount = IXGBE_82599_SCATTER; 2574ifp-if_hw_tsomaxsegsize = 2048; Hi, If IXGBE_82599_SCATTER is the maximum scatter/gather entries the hardware can do, remember to subtract one fragment for the TCP/IP-header mbuf! I think there is an off-by-one here: ifp-if_hw_tsomax = 65518; ifp-if_hw_tsomaxsegcount = IXGBE_82599_SCATTER - 1; ifp-if_hw_tsomaxsegsize = 2048; Refer to: * * NOTE: The TSO limits only apply to the data payload part of * a TCP/IP packet. That means there is no need to subtract * space for ethernet-, vlan-, IP- or TCP- headers from the * TSO limits unless the hardware driver in question requires * so. In sys/net/if_var.h Thank you! --HPS ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
Hans Petter Selasky wrote: On 08/18/15 14:53, Rick Macklem wrote: 2572 ifp-if_hw_tsomax = 65518; 2573 ifp-if_hw_tsomaxsegcount = IXGBE_82599_SCATTER; 2574 ifp-if_hw_tsomaxsegsize = 2048; Hi, If IXGBE_82599_SCATTER is the maximum scatter/gather entries the hardware can do, remember to subtract one fragment for the TCP/IP-header mbuf! Ouch! Yes, I now see that the code that counts the # of mbufs is before the code that adds the tcp/ip header mbuf. In my opinion, this should be fixed by setting if_hw_tsomaxsegcount to whatever the driver provides - 1. It is not the driver's responsibility to know if a tcp/ip header mbuf will be added and is a lot less confusing that expecting the driver author to know to subtract one. (I had mistakenly thought that tcp_output() had added the tc/ip header mbuf before the loop that counts mbufs in the list. Btw, this tcp/ip header mbuf also has leading space for the MAC layer header.) I think there is an off-by-one here: ifp-if_hw_tsomax = 65518; ifp-if_hw_tsomaxsegcount = IXGBE_82599_SCATTER - 1; ifp-if_hw_tsomaxsegsize = 2048; Refer to: * * NOTE: The TSO limits only apply to the data payload part of * a TCP/IP packet. That means there is no need to subtract * space for ethernet-, vlan-, IP- or TCP- headers from the * TSO limits unless the hardware driver in question requires * so. This comment suggests that the driver author doesn't need to do this. However, unless this is fixed in tcp_output(), the above patch should be applied to the driver. In sys/net/if_var.h Thank you! --HPS The problem I see is that, after doing the calculation of how many mbufs can be in the TSO segment, the code in tcp_output() will have calculated a value for len that will always be less that tp-t_maxopd - optlen when the if_hw_tsosegcount limit has been hit (see where it does a break; out of the while loop). -- This does not imply too many small fragments for NFS, just that the driver's transmit segment limit has been reached, where most of them are mbuf clusters, but not the first ones. As such the code: /* * In case there are too many small fragments * don't use TSO: */ if (len = max_len) { len = max_len; sendalot = 1; tso = 0; } Will always happen for this case and tso gets set to 0. Not what we want to happen, imho. The above code block was what I suggested should be commented out or deleted for the test. It appears you should also add the - 1 in the driver sys/dev/ixgbe/if_ix.c. rick ___ freebsd-...@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
Daniel Braniss wrote: On Aug 17, 2015, at 3:21 PM, Rick Macklem rmack...@uoguelph.ca wrote: Daniel Braniss wrote: On Aug 17, 2015, at 1:41 PM, Christopher Forgeron csforge...@gmail.com wrote: FYI, I can regularly hit 9.3 Gib/s with my Intel X520-DA2's and FreeBSD 10.1. Before 10.1 it was less. this is NOT iperf/3 where i do get close to wire speed, it’s NFS writes, i.e., almost real work :-) I used to tweak the card settings, but now it's just stock. You may want to check your settings, the Mellanox may just have better defaults for your switch. Have you tried disabling TSO for the Intel? With TSO enabled, it will be copying every transmitted mbuf chain to a new chain of mbuf clusters via. m_defrag() when TSO is enabled. (Assuming you aren't an 82598 chip. Most seem to be the 82599 chip these days?) hi Rick how can i check the chip? Haven't a clue. Does dmesg tell you? (To be honest, since disabling TSO helped, I'll bet you don't have a 82598.) This has been fixed in the driver very recently, but those fixes won't be in 10.1. rick ps: If you could test with 10.2, it would be interesting to see how the ix does with the current driver fixes in it? I new TSO was involved! ok, firstly, it’s 10.2 stable. with TSO enabled, ix is bad, around 64MGB/s. disabling TSO it’s better, around 130 Hmm, could you check to see of these lines are in sys/dev/ixgbe/if_ix.c at around line#2500? /* TSO parameters */ 2572 ifp-if_hw_tsomax = 65518; 2573 ifp-if_hw_tsomaxsegcount = IXGBE_82599_SCATTER; 2574 ifp-if_hw_tsomaxsegsize = 2048; They are in stable/10. I didn't look at releng/10.2. (And if they're in a #ifdef for FreeBSD11, take the #ifdef away.) If they are there and not ifdef'd, I can't explain why disabling TSO would help. Once TSO is fixed so that it handles the 64K transmit segments without copying all the mbufs, I suspect you might get better perf. with it enabled? Good luck with it, rick still, mlxen0 is about 250! with and without TSO On Mon, Aug 17, 2015 at 6:41 AM, Slawa Olhovchenkov s...@zxy.spb.ru mailto:s...@zxy.spb.ru wrote: On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote: hi, I have a host (Dell R730) with both cards, connected to an HP8200 switch at 10Gb. when writing to the same storage (netapp) this is what I get: ix0:~130MGB/s mlxen0 ~330MGB/s this is via nfs/tcpv3 I can get similar (bad) performance with the mellanox if I increase the file size to 512MGB. Look like mellanox have internal beffer for caching and do ACK acclerating. so at face value, it seems the mlxen does a better use of resources than the intel. Any ideas how to improve ix/intel's performance? Are you sure about netapp performance? ___ freebsd-...@freebsd.org mailto:freebsd-...@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org mailto:freebsd-net-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
Daniel Braniss wrote: On Aug 17, 2015, at 1:41 PM, Christopher Forgeron csforge...@gmail.com wrote: FYI, I can regularly hit 9.3 Gib/s with my Intel X520-DA2's and FreeBSD 10.1. Before 10.1 it was less. this is NOT iperf/3 where i do get close to wire speed, it’s NFS writes, i.e., almost real work :-) I used to tweak the card settings, but now it's just stock. You may want to check your settings, the Mellanox may just have better defaults for your switch. Have you tried disabling TSO for the Intel? With TSO enabled, it will be copying every transmitted mbuf chain to a new chain of mbuf clusters via. m_defrag() when TSO is enabled. (Assuming you aren't an 82598 chip. Most seem to be the 82599 chip these days?) This has been fixed in the driver very recently, but those fixes won't be in 10.1. rick ps: If you could test with 10.2, it would be interesting to see how the ix does with the current driver fixes in it? On Mon, Aug 17, 2015 at 6:41 AM, Slawa Olhovchenkov s...@zxy.spb.ru mailto:s...@zxy.spb.ru wrote: On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote: hi, I have a host (Dell R730) with both cards, connected to an HP8200 switch at 10Gb. when writing to the same storage (netapp) this is what I get: ix0:~130MGB/s mlxen0 ~330MGB/s this is via nfs/tcpv3 I can get similar (bad) performance with the mellanox if I increase the file size to 512MGB. Look like mellanox have internal beffer for caching and do ACK acclerating. so at face value, it seems the mlxen does a better use of resources than the intel. Any ideas how to improve ix/intel's performance? Are you sure about netapp performance? ___ freebsd-...@freebsd.org mailto:freebsd-...@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org mailto:freebsd-net-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On 17 August 2015 at 13:39, Slawa Olhovchenkov s...@zxy.spb.ru wrote: In any case, for 10Gb expect about 1200MGB/s. Your usage of units is confusing. Above you claim you expect 1200 million gigabytes per second, or 1.2 * 10^18 Bytes/s. I don't think any known network interface can do that, including highly experimental ones. I suspect you intended to claim that you expect 1.2GB/s (Gigabytes per second) over that 10Gb/s (Gigabits per second) network. That's still on the high side of what's possible. On TCP/IP there is some TCP overhead, so 1.0 GB/s is probably more realistic. WRT the actual problem you're trying to solve, I'm no help there. -- If you can't see the forest for the trees, Cut the trees and you'll see there is no forest. ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On Aug 17, 2015, at 12:41 PM, Slawa Olhovchenkov s...@zxy.spb.ru wrote: On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote: hi, I have a host (Dell R730) with both cards, connected to an HP8200 switch at 10Gb. when writing to the same storage (netapp) this is what I get: ix0:~130MGB/s mlxen0 ~330MGB/s this is via nfs/tcpv3 I can get similar (bad) performance with the mellanox if I increase the file size to 512MGB. Look like mellanox have internal beffer for caching and do ACK acclerating. what ever they are doing, it’s impressive :-) so at face value, it seems the mlxen does a better use of resources than the intel. Any ideas how to improve ix/intel's performance? Are you sure about netapp performance? yes, and why should it act differently if the request is coming from the same host? in any case the numbers are quiet consistent since I have measured it from several hosts, and at different times. danny ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote: hi, I have a host (Dell R730) with both cards, connected to an HP8200 switch at 10Gb. when writing to the same storage (netapp) this is what I get: ix0:~130MGB/s mlxen0 ~330MGB/s this is via nfs/tcpv3 I can get similar (bad) performance with the mellanox if I increase the file size to 512MGB. Look like mellanox have internal beffer for caching and do ACK acclerating. so at face value, it seems the mlxen does a better use of resources than the intel. Any ideas how to improve ix/intel's performance? Are you sure about netapp performance? ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
FYI, I can regularly hit 9.3 Gib/s with my Intel X520-DA2's and FreeBSD 10.1. Before 10.1 it was less. I used to tweak the card settings, but now it's just stock. You may want to check your settings, the Mellanox may just have better defaults for your switch. On Mon, Aug 17, 2015 at 6:41 AM, Slawa Olhovchenkov s...@zxy.spb.ru wrote: On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote: hi, I have a host (Dell R730) with both cards, connected to an HP8200 switch at 10Gb. when writing to the same storage (netapp) this is what I get: ix0:~130MGB/s mlxen0 ~330MGB/s this is via nfs/tcpv3 I can get similar (bad) performance with the mellanox if I increase the file size to 512MGB. Look like mellanox have internal beffer for caching and do ACK acclerating. so at face value, it seems the mlxen does a better use of resources than the intel. Any ideas how to improve ix/intel's performance? Are you sure about netapp performance? ___ freebsd-...@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On Aug 17, 2015, at 1:41 PM, Christopher Forgeron csforge...@gmail.com wrote: FYI, I can regularly hit 9.3 Gib/s with my Intel X520-DA2's and FreeBSD 10.1. Before 10.1 it was less. this is NOT iperf/3 where i do get close to wire speed, it’s NFS writes, i.e., almost real work :-) I used to tweak the card settings, but now it's just stock. You may want to check your settings, the Mellanox may just have better defaults for your switch. On Mon, Aug 17, 2015 at 6:41 AM, Slawa Olhovchenkov s...@zxy.spb.ru mailto:s...@zxy.spb.ru wrote: On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote: hi, I have a host (Dell R730) with both cards, connected to an HP8200 switch at 10Gb. when writing to the same storage (netapp) this is what I get: ix0:~130MGB/s mlxen0 ~330MGB/s this is via nfs/tcpv3 I can get similar (bad) performance with the mellanox if I increase the file size to 512MGB. Look like mellanox have internal beffer for caching and do ACK acclerating. so at face value, it seems the mlxen does a better use of resources than the intel. Any ideas how to improve ix/intel's performance? Are you sure about netapp performance? ___ freebsd-...@freebsd.org mailto:freebsd-...@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org mailto:freebsd-net-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On Mon, Aug 17, 2015 at 01:35:06PM +0300, Daniel Braniss wrote: On Aug 17, 2015, at 12:41 PM, Slawa Olhovchenkov s...@zxy.spb.ru wrote: On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote: hi, I have a host (Dell R730) with both cards, connected to an HP8200 switch at 10Gb. when writing to the same storage (netapp) this is what I get: ix0:~130MGB/s mlxen0 ~330MGB/s this is via nfs/tcpv3 I can get similar (bad) performance with the mellanox if I increase the file size to 512MGB. Look like mellanox have internal beffer for caching and do ACK acclerating. what ever they are doing, it's impressive :-) so at face value, it seems the mlxen does a better use of resources than the intel. Any ideas how to improve ix/intel's performance? Are you sure about netapp performance? yes, and why should it act differently if the request is coming from the same host? in any case the numbers are quiet consistent since I have measured it from several hosts, and at different times. In any case, for 10Gb expect about 1200MGB/s. I see lesser speed. What netapp maximum performance? From other hosts, or local, any? ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On Mon, Aug 17, 2015 at 01:49:27PM +0200, Alban Hertroys wrote: On 17 August 2015 at 13:39, Slawa Olhovchenkov s...@zxy.spb.ru wrote: In any case, for 10Gb expect about 1200MGB/s. Your usage of units is confusing. Above you claim you expect 1200 I am use as topic starter and expect MeGaBytes per second million gigabytes per second, or 1.2 * 10^18 Bytes/s. I don't think any known network interface can do that, including highly experimental ones. I suspect you intended to claim that you expect 1.2GB/s (Gigabytes per second) over that 10Gb/s (Gigabits per second) network. That's still on the high side of what's possible. On TCP/IP there is some TCP overhead, so 1.0 GB/s is probably more realistic. TCP give 5-7% overhead (include retrasmits). 10^9/8*0.97 = 1.2125 ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote: hi, I have a host (Dell R730) with both cards, connected to an HP8200 switch at 10Gb. when writing to the same storage (netapp) this is what I get: ix0:~130MGB/s mlxen0 ~330MGB/s this is via nfs/tcpv3 I can get similar (bad) performance with the mellanox if I increase the file size to 512MGB. so at face value, it seems the mlxen does a better use of resources than the intel. Any ideas how to improve ix/intel's performance? Any way, please show OS version /var/run/dmesg.boot What's tuning perfomed (loader.conf, sysctl.conf)? top -PHS in both cases ifconfig -a in both cases netstat -rn in both cases I am don't know netapp -- what is hardware configuration (disks and etc) and software tuning (MTU?). ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On Aug 17, 2015, at 3:21 PM, Rick Macklem rmack...@uoguelph.ca wrote: Daniel Braniss wrote: On Aug 17, 2015, at 1:41 PM, Christopher Forgeron csforge...@gmail.com wrote: FYI, I can regularly hit 9.3 Gib/s with my Intel X520-DA2's and FreeBSD 10.1. Before 10.1 it was less. this is NOT iperf/3 where i do get close to wire speed, it’s NFS writes, i.e., almost real work :-) I used to tweak the card settings, but now it's just stock. You may want to check your settings, the Mellanox may just have better defaults for your switch. Have you tried disabling TSO for the Intel? With TSO enabled, it will be copying every transmitted mbuf chain to a new chain of mbuf clusters via. m_defrag() when TSO is enabled. (Assuming you aren't an 82598 chip. Most seem to be the 82599 chip these days?) hi Rick how can i check the chip? This has been fixed in the driver very recently, but those fixes won't be in 10.1. rick ps: If you could test with 10.2, it would be interesting to see how the ix does with the current driver fixes in it? I new TSO was involved! ok, firstly, it’s 10.2 stable. with TSO enabled, ix is bad, around 64MGB/s. disabling TSO it’s better, around 130 still, mlxen0 is about 250! with and without TSO On Mon, Aug 17, 2015 at 6:41 AM, Slawa Olhovchenkov s...@zxy.spb.ru mailto:s...@zxy.spb.ru wrote: On Mon, Aug 17, 2015 at 10:27:41AM +0300, Daniel Braniss wrote: hi, I have a host (Dell R730) with both cards, connected to an HP8200 switch at 10Gb. when writing to the same storage (netapp) this is what I get: ix0:~130MGB/s mlxen0 ~330MGB/s this is via nfs/tcpv3 I can get similar (bad) performance with the mellanox if I increase the file size to 512MGB. Look like mellanox have internal beffer for caching and do ACK acclerating. so at face value, it seems the mlxen does a better use of resources than the intel. Any ideas how to improve ix/intel's performance? Are you sure about netapp performance? ___ freebsd-...@freebsd.org mailto:freebsd-...@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-net https://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org mailto:freebsd-net-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On 17 August 2015 at 13:54, Slawa Olhovchenkov s...@zxy.spb.ru wrote: On Mon, Aug 17, 2015 at 01:49:27PM +0200, Alban Hertroys wrote: On 17 August 2015 at 13:39, Slawa Olhovchenkov s...@zxy.spb.ru wrote: In any case, for 10Gb expect about 1200MGB/s. Your usage of units is confusing. Above you claim you expect 1200 I am use as topic starter and expect MeGaBytes per second That's a highly unusual way of writing MB/s. There are standards for unit prefixes: k means kilo, M means Mega, G means Giga, etc. See: https://en.wikipedia.org/wiki/International_System_of_Units#Prefixes million gigabytes per second, or 1.2 * 10^18 Bytes/s. I don't think any known network interface can do that, including highly experimental ones. I suspect you intended to claim that you expect 1.2GB/s (Gigabytes per second) over that 10Gb/s (Gigabits per second) network. That's still on the high side of what's possible. On TCP/IP there is some TCP overhead, so 1.0 GB/s is probably more realistic. TCP give 5-7% overhead (include retrasmits). 10^9/8*0.97 = 1.2125 In information science, Bytes are counted in multiples of 2, not 10. A kb is 1024 bits or 2^10 b. So 10 Gb is 10 * 2^30 bits. It's also not unusual to be more specific about that 2-base and use kib, Mib and Gib instead. Apparently you didn't know that... Also, if you take 5% off, you are left with (0.95 * 10 * 2^30) / 8 = 1.1875 B/s, not 0.97 * ... Your calculations were a bit optimistic. Now I have to admit I'm used to use a factor of 10 to convert from b/s to B/s (that's 20%!), but that's probably no longer correct, what with jumbo frames and all. -- If you can't see the forest for the trees, Cut the trees and you'll see there is no forest. ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: ix(intel) vs mlxen(mellanox) 10Gb performance
On Mon, Aug 17, 2015 at 05:44:37PM +0200, Alban Hertroys wrote: On 17 August 2015 at 13:54, Slawa Olhovchenkov s...@zxy.spb.ru wrote: On Mon, Aug 17, 2015 at 01:49:27PM +0200, Alban Hertroys wrote: On 17 August 2015 at 13:39, Slawa Olhovchenkov s...@zxy.spb.ru wrote: In any case, for 10Gb expect about 1200MGB/s. Your usage of units is confusing. Above you claim you expect 1200 I am use as topic starter and expect MeGaBytes per second That's a highly unusual way of writing MB/s. I am know. This is do not care for me. There are standards for unit prefixes: k means kilo, M means Mega, G means Giga, etc. See: https://en.wikipedia.org/wiki/International_System_of_Units#Prefixes million gigabytes per second, or 1.2 * 10^18 Bytes/s. I don't think any known network interface can do that, including highly experimental ones. I suspect you intended to claim that you expect 1.2GB/s (Gigabytes per second) over that 10Gb/s (Gigabits per second) network. That's still on the high side of what's possible. On TCP/IP there is some TCP overhead, so 1.0 GB/s is probably more realistic. TCP give 5-7% overhead (include retrasmits). 10^9/8*0.97 = 1.2125 In information science, Bytes are counted in multiples of 2, not 10. A kb is 1024 bits or 2^10 b. So 10 Gb is 10 * 2^30 bits. Interface speeds counted in multile of 10. 10Mbit ethernet have speed 10^7 bit/s. 64Kbit ISDN have speed 64000, not 65536. It's also not unusual to be more specific about that 2-base and use kib, Mib and Gib instead. Apparently you didn't know that... Also, if you take 5% off, you are left with (0.95 * 10 * 2^30) / 8 = 1.1875 B/s, not 0.97 * ... Your calculations were a bit optimistic. May bug. 10^10/8*0.93 = 116250 = 1162.5 Now I have to admit I'm used to use a factor of 10 to convert from b/s to B/s (that's 20%!), but that's probably no longer correct, what with jumbo frames and all. Ok, may be topic started use software metered speed with MGBs as 1048576 per second. 116250/1048576 = 1108.64 ___ freebsd-stable@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org