RE: Flow ID, LACP, and igb

2013-09-02 Thread Joe Holden
Your argument is horseshit on the basis that many x86 and non-x86
(especially mips) usable NICs will happily do linerate (I see you don't
understand how network interfaces actually work... that is pps and frame
sizes are relevant not throughput) on stock FreeBSD without any tuning
whatsoever.  Also: a modern Realtek will do higher pps before becoming
useless than a 2 or 3 generation old 1000 G/CT.

This is *with* PCI-X at 133mhz and 64bit as well as PCIe gen2.  You should
also consider the people buying interfaces from people like Chelsio (who
support FreeBSD rather well considering their customer base includes
basically 0 FreeBSD users) who sell 20/80G PCIe interface cards.

In reality CPU load is entirely irrelevant since 10G won't bother a decent
CPU even with the glaring inefficiencies of the FreeBSD stack - as long as
it isn't live locked who cares?

Ultimately there are very few driver problems and some quite serious stack
design problems which driver behaviour exacerbates.


> -Original Message-
> From: owner-freebsd-...@freebsd.org [mailto:owner-freebsd-
> n...@freebsd.org] On Behalf Of Barney Cordoba
> Sent: 02 September 2013 13:47
> To: Adrian Chadd
> Cc: Andre Oppermann; Alan Somers; n...@freebsd.org; Jack F Vogel; Justin T.
> Gibbs; Luigi Rizzo; T.C. Gubatayao
> Subject: Re: Flow ID, LACP, and igb
>
> Are you using a pcie3 bus? Of course this is only an issue for 10g; what
pct of
> FreeBSD users have a load over 9.5Gb/s? It's completely unnecessary for
igb
> or em driver, so why is it used? because it's there.
>
> Here's my argument against it. The handful of brains capable of doing
driver
> development become consumed with BS like LRO and the things that need
> to be fixed, like buffer management and basic driver design flaws, never
get
> fixed. The offload code makes the driver code a virtual mess that can only
be
> maintained by Jack and
> 1 other guy in the entire world. And it takes 10 times longer to make a
simple
> change or to add support for a new NIC.
>
> In a week I ripped out the offload crap and the 9000 sysctls, eliminated
the
> "consumer buffer" problem, reduced locking by 40% and now the igb driver
> uses 20% less cpu with a full gig load.
>
> And the code is cleaner and more easily maintained.
>
> BC
>
>
> 
>  From: Adrian Chadd 
> To: Barney Cordoba 
> Cc: Andre Oppermann ; Alan Somers
> ; "n...@freebsd.org" ; Jack F
> Vogel ; Justin T. Gibbs ; Luigi Rizzo
> ; T.C. Gubatayao 
> Sent: Sunday, September 1, 2013 4:51 PM
> Subject: Re: Flow ID, LACP, and igb
>
>
> Yo,
>
> LRO is an interesting hack that seems to do a good trick of hiding the
> ridiculous locking and unfriendly cache behaviour that we do per-packet.
>
> It helps with LAN test traffic where things are going out in batches from
the
> TCP layer so the RX layer "sees" these frames in-order and can do LRO.
> When you disable it, I don't easily get 10GE LAN TCP performance. That has
> to be fixed. Given how fast the CPU cores, bus interconnect and memory
> interconnects are, I don't think there should be any reason why we can't
hit
> 10GE traffic on a LAN with LRO disabled (in both software and hardware.)
>
> Now that I have the PMC sandy bridge stuff working right (but no PEBS, I
> have to talk to Intel about that in a bit more detail before I think about
> hacking that in) we can get actual live information about this stuff. But
the
> last time I looked, there's just too much per-packet latency going on.
> The root cause looks like it's a toss up between scheduling, locking and
just
> lots of code running to completion per-frame. As I said, that all has to
die
> somehow.
>
> 2c,
>
>
>
> -adrian
>
>
>
> On 1 September 2013 08:45, Barney Cordoba
>  wrote:
>
> >
> >
> > Comcast sends packets OOO. With any decent number of internet hops
> > you're likely to encounter a load balancer or packet shaper that sends
> > packets OOO, so you just can't be worried about it. In fact, your
> > designs MUST work with OOO packets.
> >
> > Getting balance on your load balanced lines is certainly a bigger
> > upside than the additional CPU used.
> > You can buy a faster processor for your "stack" for a lot less than
> > you can buy bandwidth.
> >
> > Frankly my opinion of LRO is that it's a science project suitable for
> > labs only. It's a trick to get more bandwidth than your bus capacity;
> > the answer is to not run PCIe2 if you need pcie3.
> > You can use it internally if you have
> > control of all of the machines. When I modify a driver the first thing
> &

Re: Flow ID, LACP, and igb

2013-09-02 Thread Olivier Cochard-Labbé
On Mon, Sep 2, 2013 at 2:47 PM, Barney Cordoba  wrote:
>
> In a week I ripped out the offload crap and the 9000 sysctls, eliminated the
> "consumer buffer" problem, reduced locking by 40% and now the igb driver
> uses 20% less cpu with a full gig load.
>

Wow!

where is the patch ? I would like to test it too.

Thanks,

Olivier
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-09-02 Thread Barney Cordoba
Are you using a pcie3 bus? Of course this is only an issue for 10g; what pct of
FreeBSD users have a load over 9.5Gb/s? It's completely unnecessary for igb
or em driver, so why is it used? because it's there.

Here's my argument against it. The handful of brains capable of doing driver 
development
become consumed with BS like LRO and the things that need to be fixed, like
buffer management and basic driver design flaws, never get fixed. The offload
code makes the driver code a virtual mess that can only be maintained by Jack 
and
1 other guy in the entire world. And it takes 10 times longer to make a simple 
change or
to add support for a new NIC. 

In a week I ripped out the offload crap and the 9000 sysctls, eliminated the 
"consumer buffer" problem, reduced locking by 40% and now the igb driver
uses 20% less cpu with a full gig load.

And the code is cleaner and more easily maintained.

BC



 From: Adrian Chadd 
To: Barney Cordoba  
Cc: Andre Oppermann ; Alan Somers ; 
"n...@freebsd.org" ; Jack F Vogel ; Justin 
T. Gibbs ; Luigi Rizzo ; T.C. Gubatayao 
 
Sent: Sunday, September 1, 2013 4:51 PM
Subject: Re: Flow ID, LACP, and igb
 

Yo,

LRO is an interesting hack that seems to do a good trick of hiding the
ridiculous locking and unfriendly cache behaviour that we do per-packet.

It helps with LAN test traffic where things are going out in batches from
the TCP layer so the RX layer "sees" these frames in-order and can do LRO.
When you disable it, I don't easily get 10GE LAN TCP performance. That has
to be fixed. Given how fast the CPU cores, bus interconnect and memory
interconnects are, I don't think there should be any reason why we can't
hit 10GE traffic on a LAN with LRO disabled (in both software and hardware.)

Now that I have the PMC sandy bridge stuff working right (but no PEBS, I
have to talk to Intel about that in a bit more detail before I think about
hacking that in) we can get actual live information about this stuff. But
the last time I looked, there's just too much per-packet latency going on.
The root cause looks like it's a toss up between scheduling, locking and
just lots of code running to completion per-frame. As I said, that all has
to die somehow.

2c,



-adrian



On 1 September 2013 08:45, Barney Cordoba  wrote:

>
>
> Comcast sends packets OOO. With any decent number of internet hops you're
> likely to encounter a load
> balancer or packet shaper that sends packets OOO, so you just can't be
> worried about it. In fact, your
> designs MUST work with OOO packets.
>
> Getting balance on your load balanced lines is certainly a bigger upside
> than the additional CPU used.
> You can buy a faster processor for your "stack" for a lot less than you
> can buy bandwidth.
>
> Frankly my opinion of LRO is that it's a science project suitable for labs
> only. It's a trick to get more bandwidth
> than your bus capacity; the answer is to not run PCIe2 if you need pcie3.
> You can use it internally if you have
> control of all of the machines. When I modify a driver the first thing
> that I do is rip it out.
>
> BC
>
>
> 
>  From: Luigi Rizzo 
> To: Barney Cordoba 
> Cc: Andre Oppermann ; Alan Somers ;
> "n...@freebsd.org" ; Jack F Vogel ;
> Justin T. Gibbs ; T.C. Gubatayao <
> tgubata...@barracuda.com>
> Sent: Saturday, August 31, 2013 10:27 PM
> Subject: Re: Flow ID, LACP, and igb
>
>
> On Sun, Sep 1, 2013 at 4:15 AM, Barney Cordoba  >wrote:
>
> > ...
> >
>
> [your point on testing with realistic assumptions is surely a valid one]
>
>
> >
> > Of course there's nothing really wrong with OOO packets. We had this
> > discussion before; lots of people
> > have round robin dual homing without any ill effects. It's just not an
> > issue.
> >
>
> It depends on where you are.
> It may not be an issue if the reordering is not large enough to
> trigger retransmissions, but even then it is annoying as it causes
> more work in the endpoint -- it prevents LRO from working, and even
> on the host stack it takes more work to sort where an out of order
> segment goes than appending an in-order one to the socket buffer.
>
> cheers
> luigi
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
>
___

Re: Flow ID, LACP, and igb

2013-09-01 Thread Alexander V. Chernikov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 02.09.2013 00:45, Adrian Chadd wrote:
> 
> 
> Not sure about igb, but ixgbe (according to advanced RX descriptor 
> format, 7.1.6.2 @ 82599 datasheet) can provide 'real' RSS value
> which can be used in m_flowid instead of NIC queue id.
> 
> (And, by the way, another RSS-related problem: there are cases when
> setting flowid does more harm, for example - PPPoE frames always
> being received at Q0. Partially this can be solved by analyzing RSS
> type from the same RX descriptor format (e.g. don't set flowid for
> RSS type 0x0), but there are other cases like GRE tunneling (where
> you probably want to perform deeper inspection in SW).
> 
> So, can we have some kind of per-NIC sysctl disabling setting
> flowid on given port?
> 
> (Yes, this should be in some kind of `ethtool` binary but we still 
> don't have it..) )
> 
> 
> What specifically are you asking for? Disabling the flowid tagging
> of
I'm talking about some small ixgbe (and maybe igb?) changes related to
the generation of mbuf flowid.

More specifically:
1) As far as I understand, ixgbe generates u16 hash which is then used
to compute receive queue number. It seems that this value can be set
by NIC in per-packet advanced RX descriptor, so it can be used as
better flowid value (which should be optional).
2) There are cases when we shouldn't simply mark all packets as
received by q0 (since NIC hash doesn't known how to hash them), so
disabling setting m_flowid for given port can help a lot.

> mbufs? Or changing the LACP hashing?
It is not related to lagg directly.

Actually, it is, but for the very special case like 'routing on the
stick' when we're forwarding packets back to the same lagg interface.

In this case, we can set (pre-computed) static RX queue flowids to
force forwarder packet fall to the same NIC and the same TX queue id.
This approach minimizes egress mutex contention (I forgot to mention
patch implementing this stuff in my 'Network stack changes' letter).

> 
> You can configure LACP to not use the flowid from the NIC and do
> the hashing yourself.
Yup, I'm aware of that :)
> 
> 
> -adrian
> 

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.20 (FreeBSD)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIjq2cACgkQwcJ4iSZ1q2ncAACfVmiVTtyno7hcxG59HZs8cSyq
umwAnjQ6r4V3UCO8T0uE3gMZmeMveUMB
=thS0
-END PGP SIGNATURE-
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-09-01 Thread Adrian Chadd
Yo,

LRO is an interesting hack that seems to do a good trick of hiding the
ridiculous locking and unfriendly cache behaviour that we do per-packet.

It helps with LAN test traffic where things are going out in batches from
the TCP layer so the RX layer "sees" these frames in-order and can do LRO.
When you disable it, I don't easily get 10GE LAN TCP performance. That has
to be fixed. Given how fast the CPU cores, bus interconnect and memory
interconnects are, I don't think there should be any reason why we can't
hit 10GE traffic on a LAN with LRO disabled (in both software and hardware.)

Now that I have the PMC sandy bridge stuff working right (but no PEBS, I
have to talk to Intel about that in a bit more detail before I think about
hacking that in) we can get actual live information about this stuff. But
the last time I looked, there's just too much per-packet latency going on.
The root cause looks like it's a toss up between scheduling, locking and
just lots of code running to completion per-frame. As I said, that all has
to die somehow.

2c,



-adrian



On 1 September 2013 08:45, Barney Cordoba  wrote:

>
>
> Comcast sends packets OOO. With any decent number of internet hops you're
> likely to encounter a load
> balancer or packet shaper that sends packets OOO, so you just can't be
> worried about it. In fact, your
> designs MUST work with OOO packets.
>
> Getting balance on your load balanced lines is certainly a bigger upside
> than the additional CPU used.
> You can buy a faster processor for your "stack" for a lot less than you
> can buy bandwidth.
>
> Frankly my opinion of LRO is that it's a science project suitable for labs
> only. It's a trick to get more bandwidth
> than your bus capacity; the answer is to not run PCIe2 if you need pcie3.
> You can use it internally if you have
> control of all of the machines. When I modify a driver the first thing
> that I do is rip it out.
>
> BC
>
>
> 
>  From: Luigi Rizzo 
> To: Barney Cordoba 
> Cc: Andre Oppermann ; Alan Somers ;
> "n...@freebsd.org" ; Jack F Vogel ;
> Justin T. Gibbs ; T.C. Gubatayao <
> tgubata...@barracuda.com>
> Sent: Saturday, August 31, 2013 10:27 PM
> Subject: Re: Flow ID, LACP, and igb
>
>
> On Sun, Sep 1, 2013 at 4:15 AM, Barney Cordoba  >wrote:
>
> > ...
> >
>
> [your point on testing with realistic assumptions is surely a valid one]
>
>
> >
> > Of course there's nothing really wrong with OOO packets. We had this
> > discussion before; lots of people
> > have round robin dual homing without any ill effects. It's just not an
> > issue.
> >
>
> It depends on where you are.
> It may not be an issue if the reordering is not large enough to
> trigger retransmissions, but even then it is annoying as it causes
> more work in the endpoint -- it prevents LRO from working, and even
> on the host stack it takes more work to sort where an out of order
> segment goes than appending an in-order one to the socket buffer.
>
> cheers
> luigi
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
>
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-09-01 Thread Adrian Chadd
> Not sure about igb, but ixgbe (according to advanced RX descriptor
> format, 7.1.6.2 @ 82599 datasheet) can provide 'real' RSS value which
> can be used in m_flowid instead of NIC queue id.
>
> (And, by the way, another RSS-related problem:
> there are cases when setting flowid does more harm, for example -
> PPPoE frames always being received at Q0.
> Partially this can be solved by analyzing RSS type from the same RX
> descriptor format (e.g. don't set flowid for RSS type 0x0), but there
> are other cases like GRE tunneling (where you probably want to perform
> deeper inspection in SW).
>
> So, can we have some kind of per-NIC sysctl disabling setting flowid
> on given port?
>
> (Yes, this should be in some kind of `ethtool` binary but we still
> don't have it..)
> )
>

What specifically are you asking for? Disabling the flowid tagging of
mbufs? Or changing the LACP hashing?

You can configure LACP to not use the flowid from the NIC and do the
hashing yourself.


-adrian
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-09-01 Thread Barney Cordoba


Comcast sends packets OOO. With any decent number of internet hops you're 
likely to encounter a load
balancer or packet shaper that sends packets OOO, so you just can't be worried 
about it. In fact, your
designs MUST work with OOO packets. 

Getting balance on your load balanced lines is certainly a bigger upside than 
the additional CPU used.
You can buy a faster processor for your "stack" for a lot less than you can buy 
bandwidth. 

Frankly my opinion of LRO is that it's a science project suitable for labs 
only. It's a trick to get more bandwidth
than your bus capacity; the answer is to not run PCIe2 if you need pcie3. You 
can use it internally if you have
control of all of the machines. When I modify a driver the first thing that I 
do is rip it out.

BC



 From: Luigi Rizzo 
To: Barney Cordoba  
Cc: Andre Oppermann ; Alan Somers ; 
"n...@freebsd.org" ; Jack F Vogel ; Justin 
T. Gibbs ; T.C. Gubatayao  
Sent: Saturday, August 31, 2013 10:27 PM
Subject: Re: Flow ID, LACP, and igb
 

On Sun, Sep 1, 2013 at 4:15 AM, Barney Cordoba wrote:

> ...
>

[your point on testing with realistic assumptions is surely a valid one]


>
> Of course there's nothing really wrong with OOO packets. We had this
> discussion before; lots of people
> have round robin dual homing without any ill effects. It's just not an
> issue.
>

It depends on where you are.
It may not be an issue if the reordering is not large enough to
trigger retransmissions, but even then it is annoying as it causes
more work in the endpoint -- it prevents LRO from working, and even
on the host stack it takes more work to sort where an out of order
segment goes than appending an in-order one to the socket buffer.

cheers
luigi
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-09-01 Thread Alexander V. Chernikov
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 26.08.2013 21:18, Justin T. Gibbs wrote:
> Hi Net,
> 
> I'm an infrequent traveler through the networking code and would 
> appreciate some feedback on some proposed solutions to issues
> Spectra has seen with outbound LACP traffic.
> 
> lacp_select_tx_port() uses the flow ID if it is available in the
> outbound mbuf to select the outbound port.  The igb driver uses the
> msix queue of the inbound packet to set a packet's flow ID.  This
> doesn't provide enough
Not sure about igb, but ixgbe (according to advanced RX descriptor
format, 7.1.6.2 @ 82599 datasheet) can provide 'real' RSS value which
can be used in m_flowid instead of NIC queue id.

(And, by the way, another RSS-related problem:
there are cases when setting flowid does more harm, for example -
PPPoE frames always being received at Q0.
Partially this can be solved by analyzing RSS type from the same RX
descriptor format (e.g. don't set flowid for RSS type 0x0), but there
are other cases like GRE tunneling (where you probably want to perform
deeper inspection in SW).

So, can we have some kind of per-NIC sysctl disabling setting flowid
on given port?

(Yes, this should be in some kind of `ethtool` binary but we still
don't have it..)
)

Jack, what do you think?

> bits of information to yield a high quality flow ID.  If, for
> example, the switch controlling inbound packet distribution does a
> poor job, the outbound packet distribution will also be poorly
> distributed.
> 
> The majority of the adapters supported by this driver will compute 
> the Toeplitz RSS hash.  Using this data seems to work quite well in
> our tests (3 member LAGG group).  Is there any reason we shouldn't 
> use the RSS hash for flow ID?
> 
> We also tried disabling the use of flow ID and doing the hash
> directly in the driver.  Unfortunately, the current hash is pretty
> weak.  It multiplies by 33, which yield very poor distributions if
> you need to mod the result by 3 (e.g. LAGG group with 3 members).
> Alan modified the driver to use the FNV hash, which is already in
> the kernel, and this yielded much better results.  He is still
> benchmarking the impact of this change.  Assuming we can get decent
> flow ID data, this should only impact outbound UDP, since the stack
> doesn't provide a flow ID in this case.
> 
> Are there other checksums we should be looking at in addition to
> FNV?
> 
> Thanks, Justin
> 
> ___ 
> freebsd-net@freebsd.org mailing list 
> http://lists.freebsd.org/mailman/listinfo/freebsd-net To
> unsubscribe, send any mail to
> "freebsd-net-unsubscr...@freebsd.org"
> 

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.20 (FreeBSD)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIi9mEACgkQwcJ4iSZ1q2meFQCfQ9QO+y/9ArTXQBAB9RDCGVY2
SpEAnAgZ0vRYuJ0HMamCnpd8q8/yxsLE
=RlDE
-END PGP SIGNATURE-
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-08-31 Thread Luigi Rizzo
On Sun, Sep 1, 2013 at 4:15 AM, Barney Cordoba wrote:

> ...
>

[your point on testing with realistic assumptions is surely a valid one]


>
> Of course there's nothing really wrong with OOO packets. We had this
> discussion before; lots of people
> have round robin dual homing without any ill effects. It's just not an
> issue.
>

It depends on where you are.
It may not be an issue if the reordering is not large enough to
trigger retransmissions, but even then it is annoying as it causes
more work in the endpoint -- it prevents LRO from working, and even
on the host stack it takes more work to sort where an out of order
segment goes than appending an in-order one to the socket buffer.

cheers
luigi
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-08-31 Thread Barney Cordoba
No, no. The entire point of the hash is to separate the "connections". But when 
testing you should
use realistic assumptions. You're not splitting packets, so the big packets 
will mess up your distribution
if you don't get it right. 

Of course there's nothing really wrong with OOO packets. We had this discussion 
before; lots of people
have round robin dual homing without any ill effects. It's just not an issue.


BC


 From: T.C. Gubatayao 
To: Barney Cordoba ; Luigi Rizzo 
; Alan Somers  
Cc: Jack F Vogel ; Justin T. Gibbs ; Andre 
Oppermann ; "n...@freebsd.org"  
Sent: Saturday, August 31, 2013 9:38 PM
Subject: RE: Flow ID, LACP, and igb
 

On Sat, Aug 31, 2013 at 8:41 AM, Barney Cordoba  
wrote:

> Also, the *most* important thing is distribution with realistic data. The goal
> should be to use the most trivial function that gives the most balanced
> distribution with real numbers. Faster is not better if the result is an
> unbalanced distribution.

Agreed, with a caveat.  It's critical that this distribution be by "flow", so
that out of order packet delivery is minimized.

> Many of your ports will be 80 and 53, and if you're going through a router
> your ethernets may not be very unique, so why even bother to include them?
> Does getting a good distribution require that you hash every element
> individually, or can you get the same distribution with a faster, simpler way
> of creating the seed?
>
> There's also the other consideration of packet size. Packets on port 53 are
> likely to be smaller than packets on port 80. What you want is equal
> distribution PER PORT on the ports that will carry that vast majority of your
> traffic.

Unfortunately, trying to evenly distribute traffic per port based on packet
size will likely result in the reordering of packets, and bandwidth wasted on
TCP retransmissions.

> Or better yet, use the same number of queues on igb as you have LAGG ports,
> and use the queue id (or RSS) as the hash, so that your traffic is sync'd
> between the ethernet adapter queues and the LAGG ports. The card has already
> done the work for you.

Isn't this hash for selecting an outbound link?  The ingress adapter hash (RSS)
won't help for packets originating from the host, or for packets that may have
been translated or otherwise modified while traversing the stack.

T.C.
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


RE: Flow ID, LACP, and igb

2013-08-31 Thread T.C. Gubatayao
On Sat, Aug 31, 2013 at 8:41 AM, Barney Cordoba  
wrote:

> Also, the *most* important thing is distribution with realistic data. The goal
> should be to use the most trivial function that gives the most balanced
> distribution with real numbers. Faster is not better if the result is an
> unbalanced distribution.

Agreed, with a caveat.  It's critical that this distribution be by "flow", so
that out of order packet delivery is minimized.

> Many of your ports will be 80 and 53, and if you're going through a router
> your ethernets may not be very unique, so why even bother to include them?
> Does getting a good distribution require that you hash every element
> individually, or can you get the same distribution with a faster, simpler way
> of creating the seed?
>
> There's also the other consideration of packet size. Packets on port 53 are
> likely to be smaller than packets on port 80. What you want is equal
> distribution PER PORT on the ports that will carry that vast majority of your
> traffic.

Unfortunately, trying to evenly distribute traffic per port based on packet
size will likely result in the reordering of packets, and bandwidth wasted on
TCP retransmissions.

> Or better yet, use the same number of queues on igb as you have LAGG ports,
> and use the queue id (or RSS) as the hash, so that your traffic is sync'd
> between the ethernet adapter queues and the LAGG ports. The card has already
> done the work for you.

Isn't this hash for selecting an outbound link?  The ingress adapter hash (RSS)
won't help for packets originating from the host, or for packets that may have
been translated or otherwise modified while traversing the stack.

T.C.
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-08-31 Thread Barney Cordoba
And another thing; the use of modulo is very expensive when the number of ports
used in LAGG is *usually* a power of 2. foo&(SLOTS-1) is a lot faster than 
(foo%SLOTS). 

if (SLOTS == 2 || SLOTS == 4 || SLOTS == 8)
    hash = hash&(SLOTS-1);
else
    hash = hash % SLOTS;

is more than twice as fast as 

hash % SLOTS;

BC



 From: Luigi Rizzo 
To: Alan Somers  
Cc: Jack F Vogel ; "n...@freebsd.org" ; 
Justin T. Gibbs ; Andre Oppermann ; T.C. 
Gubatayao  
Sent: Friday, August 30, 2013 8:04 PM
Subject: Re: Flow ID, LACP, and igb
 

Alan,


On Thu, Aug 29, 2013 at 6:45 PM, Alan Somers  wrote:
>
>
> ...
> I pulled all four hash functions out into userland and microbenchmarked
> them.  The upshot is that hash32 and fnv_hash are the fastest, jenkins_hash
> is slower, and siphash24 is the slowest.  Also, Clang resulted in much
> faster code than gcc.
>
>
i missed this part of your message, but if i read your code well,
you are running 100M iterations and the numbers below are in seconds,
so if you multiply the numbers by 10 you have the cost per hash in
nanoseconds.

What CPU did you use for your tests ?

Also some of the numbers (FNV and hash32) are suspiciously low.

I
 believe that the compiler (both of them) have figure out that everything
is constant in these functions, and fnv_32_buf() and hash32_buf() are
inline,
hence they can be optimized to just return a constant.
This does not happen for siphash and jenkins because they are defined
externally.

Can you please re-run the tests in a way that defeats the optimization ?
(e.g. pass a non constant argument to the the hashes so you actually need
to run the code).

cheers
luigi


http://people.freebsd.org/~asomers/lagg_hash/
>
> [root@sm4u-4 /usr/home/alans/ctest/lagg_hash]# ./lagg_hash-gcc-4.8
> FNV: 0.76
> hash32: 1.18
> SipHash24: 44.39
> Jenkins: 6.20
> [root@sm4u-4 /usr/home/alans/ctest/lagg_hash]# ./lagg_hash-gcc-4.2.1
> FNV: 0.74
> hash32: 1.35
> SipHash24: 55.25
> Jenkins: 7.37
> [root@sm4u-4 /usr/home/alans/ctest/lagg_hash]# ./lagg_hash.clang-3.3
> FNV: 0.30
> hash32: 0.30
> SipHash24: 55.97
> Jenkins: 6.45
> [root@sm4u-4 /usr/home/alans/ctest/lagg_hash]# ./lagg_hash.clang-3.2
> FNV: 0.30
> hash32: 0.30
> SipHash24: 44.52
> Jenkins: 6.48
>
>
>
> > T.C.
> >
> > [1]
> >
> http://svnweb.freebsd.org/base/head/sys/libkern/jenkins_hash.c?view=markup
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
>
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-08-31 Thread Barney Cordoba
May I express my glee and astonishment that  you're debating the use of 
complicated hash functions
for something that's likely to have from 2-8 slots?

Also, the *most* important thing is distribution with realistic data. The goal 
should be to use the
most trivial function that gives the most balanced distribution with real 
numbers. Faster is
not better if the result is an unbalanced distribution.

Many of your ports will be 80 and 53, and if you're going through a router your 
ethernets
may not be very unique, so why even bother to include them? Does getting a good 
distribution
require that you hash every element individually, or can you get the same 
distribution with
a faster, simpler way of creating the seed?

There's also the other consideration of packet size. Packets on port 53 are 
likely to be smaller
than packets on port 80. What you want is equal distribution PER PORT on the 
ports that will
carry that vast majority of your traffic.

When designing efficient systems, you must not assume that ports and IPs are 
random, because they're
not. 99% of your load will be on a small number of destination ports and a 
limited range of source ports.

For a web server application, geting a perfect distribution on the http ports 
is most crucial.

The hash function in if_lagg.c looks like more of a classroom exercise than a 
practical implementation. 
If you're going to consider 100M iterations; consider that much of the time is 
wasted parsing the
packet (again). Why not add a simple sysctl that enables a hash that is created 
in the ip parser,
when all of the pieces are available without having to re-parse the mbuf?

Or better yet, use the same number of queues on igb as you have LAGG ports, and 
use the queue id (or RSS)
as the hash, so that your traffic is sync'd between the ethernet adapter queues 
and the LAGG ports. The card
has already done the work for you.

BC






 From: Luigi Rizzo 
To: Alan Somers  
Cc: Jack F Vogel ; "n...@freebsd.org" ; 
Justin T. Gibbs ; Andre Oppermann ; T.C. 
Gubatayao  
Sent: Friday, August 30, 2013 8:04 PM
Subject: Re: Flow ID, LACP, and igb
 

Alan,


On Thu, Aug 29, 2013 at 6:45 PM, Alan Somers  wrote:
>
>
> ...
> I pulled all four hash functions out into userland and microbenchmarked
> them.  The upshot is that hash32 and fnv_hash are the fastest, jenkins_hash
> is slower, and siphash24 is the slowest.  Also, Clang resulted in much
> faster code than gcc.
>
>
i missed this part of your message, but if i read your code well,
you are running 100M iterations and the numbers below are in seconds,
so if you multiply the numbers by 10 you have the cost per hash in
nanoseconds.

What CPU did you use for your tests ?

Also some of the numbers (FNV and hash32) are suspiciously low.

I believe that the compiler (both of them) have figure out that everything
is constant in these functions, and fnv_32_buf() and hash32_buf() are
inline,
hence they can be optimized to just return a constant.
This does not happen for siphash and jenkins because they are defined
externally.

Can you please re-run the tests in a way that defeats the optimization ?
(e.g. pass a non constant argument to the the hashes so you actually need
to run the code).

cheers
luigi


http://people.freebsd.org/~asomers/lagg_hash/
>
> [root@sm4u-4 /usr/home/alans/ctest/lagg_hash]# ./lagg_hash-gcc-4.8
> FNV: 0.76
> hash32: 1.18
> SipHash24: 44.39
> Jenkins: 6.20
> [root@sm4u-4 /usr/home/alans/ctest/lagg_hash]# ./lagg_hash-gcc-4.2.1
> FNV: 0.74
> hash32: 1.35
> SipHash24: 55.25
> Jenkins: 7.37
> [root@sm4u-4 /usr/home/alans/ctest/lagg_hash]# ./lagg_hash.clang-3.3
> FNV: 0.30
> hash32: 0.30
> SipHash24: 55.97
> Jenkins: 6.45
> [root@sm4u-4 /usr/home/alans/ctest/lagg_hash]# ./lagg_hash.clang-3.2
> FNV: 0.30
> hash32: 0.30
> SipHash24: 44.52
> Jenkins: 6.48
>
>
>
> > T.C.
> >
> > [1]
> >
> http://svnweb.freebsd.org/base/head/sys/libkern/jenkins_hash.c?view=markup
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
>
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-08-30 Thread Luigi Rizzo
Alan,


On Thu, Aug 29, 2013 at 6:45 PM, Alan Somers  wrote:
>
>
> ...
> I pulled all four hash functions out into userland and microbenchmarked
> them.  The upshot is that hash32 and fnv_hash are the fastest, jenkins_hash
> is slower, and siphash24 is the slowest.  Also, Clang resulted in much
> faster code than gcc.
>
>
i missed this part of your message, but if i read your code well,
you are running 100M iterations and the numbers below are in seconds,
so if you multiply the numbers by 10 you have the cost per hash in
nanoseconds.

What CPU did you use for your tests ?

Also some of the numbers (FNV and hash32) are suspiciously low.

I believe that the compiler (both of them) have figure out that everything
is constant in these functions, and fnv_32_buf() and hash32_buf() are
inline,
hence they can be optimized to just return a constant.
This does not happen for siphash and jenkins because they are defined
externally.

Can you please re-run the tests in a way that defeats the optimization ?
(e.g. pass a non constant argument to the the hashes so you actually need
to run the code).

cheers
luigi


http://people.freebsd.org/~asomers/lagg_hash/
>
> [root@sm4u-4 /usr/home/alans/ctest/lagg_hash]# ./lagg_hash-gcc-4.8
> FNV: 0.76
> hash32: 1.18
> SipHash24: 44.39
> Jenkins: 6.20
> [root@sm4u-4 /usr/home/alans/ctest/lagg_hash]# ./lagg_hash-gcc-4.2.1
> FNV: 0.74
> hash32: 1.35
> SipHash24: 55.25
> Jenkins: 7.37
> [root@sm4u-4 /usr/home/alans/ctest/lagg_hash]# ./lagg_hash.clang-3.3
> FNV: 0.30
> hash32: 0.30
> SipHash24: 55.97
> Jenkins: 6.45
> [root@sm4u-4 /usr/home/alans/ctest/lagg_hash]# ./lagg_hash.clang-3.2
> FNV: 0.30
> hash32: 0.30
> SipHash24: 44.52
> Jenkins: 6.48
>
>
>
> > T.C.
> >
> > [1]
> >
> http://svnweb.freebsd.org/base/head/sys/libkern/jenkins_hash.c?view=markup
> ___
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"
>
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-08-30 Thread Alan Somers
On Thu, Aug 29, 2013 at 3:40 PM, T.C. Gubatayao
 wrote:
> On Aug 29, 2013, at 4:21 PM, Alan Somers  wrote:
>
>> They're faster, but even with this change, jenkins_hash is still 6 times
>> slower than FNV hash.
>
> Actually, I think your test isn't accurately simulating memory access, which
> might be skewing the results.
>
> For example, from net/if_lagg.c:
>
> p = hash32_buf(&eh->ether_shost, ETHER_ADDR_LEN, p);
> p = hash32_buf(&eh->ether_dhost, ETHER_ADDR_LEN, p);
>
> These two calls can't both be aligned, since ETHER_ADDR_LEN is 6 octets.  The
> same is true for the other hashed fields in the IP and TCP/UDP headers.
> Assuming the mbuf data pointer is aligned, the IP addresses and ports are both
> on 2-byte alignments (without VLAN or IP options).  In your test, they're all
> aligned and in the same cache line.
>
> When I modify the test to simulate an mbuf, lookup3 beats FNV and hash32, and
> SipHash is only 2-3 times slower.


Indeed, in your latest version FNV and hash32 are significantly slower.  It
isn't due to alignment issues though; those hashes don't care about alignment
because they access data 8 bits at a time.  The problem was that Clang was too
smart for me.  In my version, Clang was computing FNV hash and hash32 entirely
at compile time.  All the functions did at runtime was return the correct
answer.  Your mbuf simulation defeats that optimization.

I think that your latest version is fairly accurate, and it shows that Jenkins
is the fastest when compiled with Clang and when all three layers are hashed.
However, FNV is faster when compiled with GCC, or when only one or two layers
are hashed.  In any case, the difference between FNV and Jenkins is about
4ns/packet, which is about as significant as whether to paint the roof cyan or
aquamarine.  As far as I'm concerned, FNV still has two major advantages: it's
available in stable/9, and it's a drop-in replacement for hash32.  Using
Jenkins would require refactoring lagg_hashmbuf to copy the hashable fields
into a stack buffer.  I'm loath to do that, because then I would have to test
lagg_hashmbuf with IPv6 and VLAN packets.  My network isn't currently setup for
those.  Using FNV is a simple enough change that I would feel comfortable
committing it without testing VLANs and IPv6.

We have a three day weekend in my country, but hopefully I'll be able to wrap
up my testing on Tuesday.


>
>> Also, your technique of copying the hashable fields into a separate buffer
>> would need modification to work with different types of packet and different
>> LAGG_F_HASH[234] flags.  Because different packets have different hashable
>> fields, struct key would need to be expanded to include the vlan tag, IPV6
>> addresses, and IPv6 flowid.  lagg_hashmbuf would then have to zero the unused
>> fields.
>
> Agreed, but this is relatively simple with a buffer on the stack, and does not
> require zeroes or padding.  See my modified test, attached.
>
> T.C.
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-08-29 Thread T.C. Gubatayao
On Aug 29, 2013, at 5:40 PM, T.C. Gubatayao  wrote:
> On Aug 29, 2013, at 4:21 PM, Alan Somers  wrote:
>
>> They're faster, but even with this change, jenkins_hash is still 6 times
>> slower than FNV hash.
>
> Actually, I think your test isn't accurately simulating memory access, which
> might be skewing the results.
>
> For example, from net/if_lagg.c:
>
>p = hash32_buf(&eh->ether_shost, ETHER_ADDR_LEN, p);
>p = hash32_buf(&eh->ether_dhost, ETHER_ADDR_LEN, p);
>
> These two calls can't both be aligned, since ETHER_ADDR_LEN is 6 octets.  The
> same is true for the other hashed fields in the IP and TCP/UDP headers.
> Assuming the mbuf data pointer is aligned, the IP addresses and ports are both
> on 2-byte alignments (without VLAN or IP options).  In your test, they're all
> aligned and in the same cache line.
>
> When I modify the test to simulate an mbuf, lookup3 beats FNV and hash32, and
> SipHash is only 2-3 times slower.
>
>> Also, your technique of copying the hashable fields into a separate buffer
>> would need modification to work with different types of packet and different
>> LAGG_F_HASH[234] flags.  Because different packets have different hashable
>> fields, struct key would need to be expanded to include the vlan tag, IPV6
>> addresses, and IPv6 flowid.  lagg_hashmbuf would then have to zero the unused
>> fields.
>
> Agreed, but this is relatively simple with a buffer on the stack, and does not
> require zeroes or padding.  See my modified test, attached.
>
> T.C.

Attachment was stripped.

--- a/lagg_hash.c   2013-08-29 14:21:17.255307349 -0400
+++ b/lagg_hash.c   2013-08-29 17:26:14.055404918 -0400
@@ -7,35 +7,63 @@
 #include 
 #include 
 #include 
-
-uint32_t jenkins_hash32(const uint32_t *, size_t, uint32_t);
+#include 
+#include 
+#include 
+#include 
 
 #define ITERATIONS 1
 
-typedef uint32_t do_hash_t(void);
+typedef uint32_t do_hash_t(uint32_t);
+
+/*
+ * Simulate mbuf data for a packet.
+ * No VLAN tagging and no IP options.
+ */
+struct _mbuf {
+   struct ether_header eh;
+   struct ip ip;
+   struct tcphdr th;
+} __attribute__((packed)) m = {
+   {
+   .ether_dhost = { 181, 16, 73, 9, 219, 22 },
+   .ether_shost = { 69, 170, 210, 11, 24, 120 },
+   .ether_type = 0x008
+   },
+   {
+   .ip_src.s_addr = 1329258245,
+   .ip_dst.s_addr = 1319097119,
+   .ip_p = 0x06
+   },
+   {
+   .th_sport = 12506,
+   .th_dport = 47804
+   }
+};
 
-// Pad the MACs with 0s because jenkins_hash operates on 32-bit inputs
-const uint8_t ether_shost[] = {181, 16, 73, 9, 219, 22, 0, 0};
-const uint8_t ether_dhost[] = {69, 170, 210, 111, 24, 120, 0, 0};
-const struct in_addr ip_src = {.s_addr = 1329258245};
-const struct in_addr ip_dst = {.s_addr = 1319097119};
-const uint32_t ports = 3132895450;
 const uint8_t sipkey[16] = {7, 239, 255, 43, 68, 53, 56, 225,
98, 81, 177, 80, 92, 235, 242, 39};
 
+#define LAGG_F_HASHL2  0x1
+#define LAGG_F_HASHL3  0x2
+#define LAGG_F_HASHL4  0x4
+#define LAGG_F_HASHALL (LAGG_F_HASHL2|LAGG_F_HASHL3|LAGG_F_HASHL4)
+
 /*
  * Simulate how lagg_hashmbuf uses FNV hash for a TCP/IP packet
  * No VLAN tagging
  */
-uint32_t do_fnv(void)
+uint32_t do_fnv(uint32_t flags)
 {
uint32_t p = FNV1_32_INIT;
 
-   p = fnv_32_buf(ether_shost, 6, p);
-   p = fnv_32_buf(ether_dhost, 6, p);
-   p = fnv_32_buf(&ip_src, sizeof(struct in_addr), p);
-   p = fnv_32_buf(&ip_dst, sizeof(struct in_addr), p);
-   p = fnv_32_buf(&ports, sizeof(ports), p);
+   if (flags & LAGG_F_HASHL2)
+   p = fnv_32_buf(&m.eh.ether_dhost, 12, p);
+   if (flags & LAGG_F_HASHL3)
+   p = fnv_32_buf(&m.ip.ip_src, 8, p);
+   if (flags & LAGG_F_HASHL4)
+   p = fnv_32_buf(&m.th.th_sport, 4, p);
+
return (p);
 }
 
@@ -43,59 +71,74 @@
  * Simulate how lagg_hashmbuf uses hash32 for a TCP/IP packet
  * No VLAN tagging
  */
-uint32_t do_hash32(void)
+uint32_t do_hash32(uint32_t flags)
 {
// Actually, if_lagg used a pseudorandom number determined at interface
// creation time.  But this should have the same timing
// characteristics.
uint32_t p = HASHINIT;
 
-   p = hash32_buf(ether_shost, 6, p);
-   p = hash32_buf(ether_dhost, 6, p);
-   p = hash32_buf(&ip_src, sizeof(struct in_addr), p);
-   p = hash32_buf(&ip_dst, sizeof(struct in_addr), p);
-   p = hash32_buf(&ports, sizeof(ports), p);
+   if (flags & LAGG_F_HASHL2)
+   p = hash32_buf(&m.eh.ether_dhost, 12, p);
+   if (flags & LAGG_F_HASHL3)
+   p = hash32_buf(&m.ip.ip_src, 8, p);
+   if (flags & LAGG_F_HASHL4)
+   p = hash32_buf(&m.th.th_sport, 4, p);
+
return (p);
 }
 
+/* Simulate copying the info out of the mbuf. */
+static __inline size_t init_key(char *key, uint32_t flags)
+{
+

Re: Flow ID, LACP, and igb

2013-08-29 Thread T.C. Gubatayao
On Aug 29, 2013, at 4:21 PM, Alan Somers  wrote:

> They're faster, but even with this change, jenkins_hash is still 6 times
> slower than FNV hash.

Actually, I think your test isn't accurately simulating memory access, which
might be skewing the results.

For example, from net/if_lagg.c:

p = hash32_buf(&eh->ether_shost, ETHER_ADDR_LEN, p);
p = hash32_buf(&eh->ether_dhost, ETHER_ADDR_LEN, p);

These two calls can't both be aligned, since ETHER_ADDR_LEN is 6 octets.  The
same is true for the other hashed fields in the IP and TCP/UDP headers.
Assuming the mbuf data pointer is aligned, the IP addresses and ports are both
on 2-byte alignments (without VLAN or IP options).  In your test, they're all  
aligned and in the same cache line.

When I modify the test to simulate an mbuf, lookup3 beats FNV and hash32, and
SipHash is only 2-3 times slower.

> Also, your technique of copying the hashable fields into a separate buffer
> would need modification to work with different types of packet and different
> LAGG_F_HASH[234] flags.  Because different packets have different hashable
> fields, struct key would need to be expanded to include the vlan tag, IPV6
> addresses, and IPv6 flowid.  lagg_hashmbuf would then have to zero the unused
> fields.

Agreed, but this is relatively simple with a buffer on the stack, and does not
require zeroes or padding.  See my modified test, attached.

T.C.___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Flow ID, LACP, and igb

2013-08-29 Thread Alan Somers
On Thu, Aug 29, 2013 at 1:33 PM, T.C. Gubatayao wrote:

> On Aug 29, 2013, at 12:45 PM, Alan Somers  wrote:
>
> > I pulled all four hash functions out into userland and microbenchmarked
> them.
> > The upshot is that hash32 and fnv_hash are the fastest, jenkins_hash is
> > slower, and siphash24 is the slowest.  Also, Clang resulted in much
> faster
> > code than gcc.
>
> I didn't realize that you were testing incremental hashing with 4 and 6
> byte
> keys.
>
> There might be advantages to conditionally filling out a contiguous key
> and then performing the hash on that.  You could guarantee key alignment,
> for
> one, and this would benefit the hashes which perform word-sized reads.
>
> Based on my quick tests, lookup3 and SipHash improve significantly.
>

They're faster, but even with this change, jenkins_hash is still 6 times
slower than FNV hash.  Also, your technique of copying the hashable fields
into a separate buffer would need modification to work with different types
of packet and different LAGG_F_HASH[234] flags.  Because different packets
have different hashable fields, struct key would need to be expanded to
include the vlan tag, IPV6 addresses, and IPv6 flowid.  lagg_hashmbuf would
then have to zero the unused fields.  In any case, that's not going to make
Jenkins and SipHash24 more likely to beat FNV.


>
> T.C.
>
> diff -u a/lagg_hash.c b/lagg_hash.c
> --- a/lagg_hash.c   2013-08-29 14:21:17.255307349 -0400
> +++ b/lagg_hash.c   2013-08-29 15:16:31.135653259 -0400
> @@ -7,22 +7,30 @@
>  #include 
>  #include 
>  #include 
> -
> -uint32_t jenkins_hash32(const uint32_t *, size_t, uint32_t);
> +#include 
>
>  #define ITERATIONS 1
>
>  typedef uint32_t do_hash_t(void);
>
> -// Pad the MACs with 0s because jenkins_hash operates on 32-bit inputs
> -const uint8_t ether_shost[] = {181, 16, 73, 9, 219, 22, 0, 0};
> -const uint8_t ether_dhost[] = {69, 170, 210, 111, 24, 120, 0, 0};
> +const uint8_t ether_shost[] = {181, 16, 73, 9, 219, 22};
> +const uint8_t ether_dhost[] = {69, 170, 210, 111, 24, 120};
> +const uint8_t ether_hosts[] = { 181, 16, 73, 9, 219, 22,
> +   69, 170, 210, 111, 24, 120 };
>  const struct in_addr ip_src = {.s_addr = 1329258245};
>  const struct in_addr ip_dst = {.s_addr = 1319097119};
> +const struct in_addr ips[2] = { { .s_addr = 1329258245 },
> +   { .s_addr = 1319097119 } };
>  const uint32_t ports = 3132895450;
>  const uint8_t sipkey[16] = {7, 239, 255, 43, 68, 53, 56, 225,
> 98, 81, 177, 80, 92, 235, 242, 39};
>
> +struct key {
> +   uint8_t ether_hosts[12];
> +   struct in_addr ips[2];
> +   uint16_t ports[2];
> +} __attribute__((packed));
> +
>  /*
>   * Simulate how lagg_hashmbuf uses FNV hash for a TCP/IP packet
>   * No VLAN tagging
> @@ -58,6 +66,15 @@
> return (p);
>  }
>
> +static __inline init_key(struct key *key)
> +{
> +
> +   /* Simulate copying the info out of the mbuf. */
> +   memcpy(key->ether_hosts, ether_hosts, sizeof(ether_hosts));
> +   memcpy(key->ips, ips, sizeof(ips));
> +   memcpy(key->ports, &ports, sizeof(ports));
> +}
> +
>  /*
>   * Simulate how lagg_hashmbuf would use siphash24 for a TCP/IP packet
>   * No VLAN tagging
> @@ -65,16 +82,11 @@
>  uint32_t do_siphash24(void)
>  {
> SIPHASH_CTX ctx;
> +   struct key key;
>
> -   SipHash24_Init(&ctx);
> -   SipHash_SetKey(&ctx, sipkey);
> +   init_key(&key);
>
> -   SipHash_Update(&ctx, ether_shost, 6);
> -   SipHash_Update(&ctx, ether_dhost, 6);
> -   SipHash_Update(&ctx, &ip_src, sizeof(struct in_addr));
> -   SipHash_Update(&ctx, &ip_dst, sizeof(struct in_addr));
> -   SipHash_Update(&ctx, &ports, sizeof(ports));
> -   return (SipHash_End(&ctx) & 0x);
> +   return (SipHash24(&ctx, sipkey, &key, sizeof(key)) & 0x);
>  }
>
>  /*
> @@ -83,19 +95,11 @@
>   */
>  uint32_t do_jenkins(void)
>  {
> -   /* Jenkins hash does not recommend any specific initializer */
> -   uint32_t p = FNV1_32_INIT;
> +   struct key key;
>
> -   /*
> -* jenkins_hash uses 32-bit inputs, so we need to present the MACs
> as
> -* arrays of 2 32-bit values
> -*/
> -   p = jenkins_hash32((uint32_t*)ether_shost, 2, p);
> -   p = jenkins_hash32((uint32_t*)ether_dhost, 2, p);
> -   p = jenkins_hash32((uint32_t*)&ip_src, sizeof(struct in_addr) / 4,
> p);
> -   p = jenkins_hash32((uint32_t*)&ip_dst, sizeof(struct in_addr) / 4,
> p);
> -   p = jenkins_hash32(&ports, sizeof(ports) / 4, p);
> -   return (p);
> +   init_key(&key);
> +
> +   return (jenkins_hash(&key, sizeof(key), FNV1_32_INIT));
>  }
>
>
> diff -u a/siphash.h b/siphash.h
> --- a/siphash.h 2013-08-29 14:21:21.851306417 -0400
> +++ b/siphash.h 2013-08-29 14:26:44.470240137 -0400
> @@ -73,8 +73,8 @@
>  void SipHash_Final(void *, SIPHASH_CTX *);
>  uint64_t SipHash_End(SIPHASH_CTX *);
>
> -#define Sip

Re: Flow ID, LACP, and igb

2013-08-29 Thread T.C. Gubatayao
On Aug 29, 2013, at 12:45 PM, Alan Somers  wrote:

> I pulled all four hash functions out into userland and microbenchmarked them.
> The upshot is that hash32 and fnv_hash are the fastest, jenkins_hash is
> slower, and siphash24 is the slowest.  Also, Clang resulted in much faster
> code than gcc.

I didn't realize that you were testing incremental hashing with 4 and 6 byte
keys.

There might be advantages to conditionally filling out a contiguous key
and then performing the hash on that.  You could guarantee key alignment, for  
one, and this would benefit the hashes which perform word-sized reads. 

Based on my quick tests, lookup3 and SipHash improve significantly.

T.C.

diff -u a/lagg_hash.c b/lagg_hash.c
--- a/lagg_hash.c   2013-08-29 14:21:17.255307349 -0400
+++ b/lagg_hash.c   2013-08-29 15:16:31.135653259 -0400
@@ -7,22 +7,30 @@
 #include 
 #include 
 #include 
-
-uint32_t jenkins_hash32(const uint32_t *, size_t, uint32_t);
+#include 
 
 #define ITERATIONS 1
 
 typedef uint32_t do_hash_t(void);
 
-// Pad the MACs with 0s because jenkins_hash operates on 32-bit inputs
-const uint8_t ether_shost[] = {181, 16, 73, 9, 219, 22, 0, 0};
-const uint8_t ether_dhost[] = {69, 170, 210, 111, 24, 120, 0, 0};
+const uint8_t ether_shost[] = {181, 16, 73, 9, 219, 22};
+const uint8_t ether_dhost[] = {69, 170, 210, 111, 24, 120};
+const uint8_t ether_hosts[] = { 181, 16, 73, 9, 219, 22,
+   69, 170, 210, 111, 24, 120 };
 const struct in_addr ip_src = {.s_addr = 1329258245};
 const struct in_addr ip_dst = {.s_addr = 1319097119};
+const struct in_addr ips[2] = { { .s_addr = 1329258245 },
+   { .s_addr = 1319097119 } };
 const uint32_t ports = 3132895450;
 const uint8_t sipkey[16] = {7, 239, 255, 43, 68, 53, 56, 225,
98, 81, 177, 80, 92, 235, 242, 39};
 
+struct key {
+   uint8_t ether_hosts[12];
+   struct in_addr ips[2];
+   uint16_t ports[2];
+} __attribute__((packed));
+
 /*
  * Simulate how lagg_hashmbuf uses FNV hash for a TCP/IP packet
  * No VLAN tagging
@@ -58,6 +66,15 @@
return (p);
 }
 
+static __inline init_key(struct key *key)
+{
+
+   /* Simulate copying the info out of the mbuf. */
+   memcpy(key->ether_hosts, ether_hosts, sizeof(ether_hosts));
+   memcpy(key->ips, ips, sizeof(ips));
+   memcpy(key->ports, &ports, sizeof(ports));
+}
+
 /*
  * Simulate how lagg_hashmbuf would use siphash24 for a TCP/IP packet
  * No VLAN tagging
@@ -65,16 +82,11 @@
 uint32_t do_siphash24(void)
 {
SIPHASH_CTX ctx;
+   struct key key;
 
-   SipHash24_Init(&ctx);
-   SipHash_SetKey(&ctx, sipkey);
+   init_key(&key);
 
-   SipHash_Update(&ctx, ether_shost, 6);
-   SipHash_Update(&ctx, ether_dhost, 6);
-   SipHash_Update(&ctx, &ip_src, sizeof(struct in_addr));
-   SipHash_Update(&ctx, &ip_dst, sizeof(struct in_addr));
-   SipHash_Update(&ctx, &ports, sizeof(ports));
-   return (SipHash_End(&ctx) & 0x);
+   return (SipHash24(&ctx, sipkey, &key, sizeof(key)) & 0x);
 }
 
 /*
@@ -83,19 +95,11 @@
  */
 uint32_t do_jenkins(void)
 {
-   /* Jenkins hash does not recommend any specific initializer */
-   uint32_t p = FNV1_32_INIT;
+   struct key key;
 
-   /* 
-* jenkins_hash uses 32-bit inputs, so we need to present the MACs as
-* arrays of 2 32-bit values
-*/
-   p = jenkins_hash32((uint32_t*)ether_shost, 2, p);
-   p = jenkins_hash32((uint32_t*)ether_dhost, 2, p);
-   p = jenkins_hash32((uint32_t*)&ip_src, sizeof(struct in_addr) / 4, p);
-   p = jenkins_hash32((uint32_t*)&ip_dst, sizeof(struct in_addr) / 4, p);
-   p = jenkins_hash32(&ports, sizeof(ports) / 4, p);
-   return (p);
+   init_key(&key);
+
+   return (jenkins_hash(&key, sizeof(key), FNV1_32_INIT));
 }
 
 
diff -u a/siphash.h b/siphash.h
--- a/siphash.h 2013-08-29 14:21:21.851306417 -0400
+++ b/siphash.h 2013-08-29 14:26:44.470240137 -0400
@@ -73,8 +73,8 @@
 void SipHash_Final(void *, SIPHASH_CTX *);
 uint64_t SipHash_End(SIPHASH_CTX *);
 
-#define SipHash24(x, y, z, i)  SipHashX((x), 2, 4, (y), (z), (i));
-#define SipHash48(x, y, z, i)  SipHashX((x), 4, 8, (y), (z), (i));
+#define SipHash24(x, y, z, i)  SipHashX((x), 2, 4, (y), (z), (i))
+#define SipHash48(x, y, z, i)  SipHashX((x), 4, 8, (y), (z), (i))
 uint64_t SipHashX(SIPHASH_CTX *, int, int, const uint8_t [16], const void *,
 size_t);
 
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-08-29 Thread Luigi Rizzo
On Thu, Aug 29, 2013 at 1:42 AM, Alan Somers  wrote:

> On Mon, Aug 26, 2013 at 2:40 PM, Andre Oppermann 
> wrote:
>
> > On 26.08.2013 19:18, Justin T. Gibbs wrote:
> >
> ...
>
> >> Are there other checksums we should be looking at in addition to FNV?
> >>
> >
> > siphash24() is fast, keyed and strong.
> >
> I benchmarked hash32 (the existing hash function) vs fnv_hash using both
> TCP and UDP, with 1500 and 9000 byte MTUs.  At 10Gbps, I couldn't measure
> any difference in either throughput or cpu utilization.  Given that
> siphash24 is definitely slower than hash32, there's no way that I'll find
>

with these large MTUs the packet rate is too low to see the
difference between the various functions.
Just as a data point, the jenkins hash used in the
netmap code takes at most 10-15ns (with data in cache) on the
i7-2600 CPUs i was using in my tests.

I think the way to tell which hash is faster is to run the
function in a tight loop, rather than relying on input traffic.

Then of course there are cache misses that impact heavily
the cost of the function, but that is an orthogonal issues
that exists for all hashes.

cheers
luigi
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-08-29 Thread Alan Somers
On Thu, Aug 29, 2013 at 1:27 AM, T.C. Gubatayao wrote:

> > No problem with fnv_hash().
>
> Doesn't it have bad mixing?  Good distribution is important since this
> code is
> for load balancing.
>

The poor mixing in FNV hash comes from the 8-bit XOR operation.  But that
provides fine mixing of the last 8 bits, which should be sufficient for
lagg_hash unless people are lagging together > 256 ports.


>
> FNV is also slower compared to most of the newer non-cryptographic hashes,
> certainly on large keys, but even on small ones.  Of course, performance
> will
> vary with the architecture.
>
> > While I agree that it is likely that siphash24() is slower if you could
> afford
> > the time do a test run it would be great to from guess to know.
>
> +1


> You might want to consider lookup3 too, since it's also readily available
> in the
> kernel [1].
>

I pulled all four hash functions out into userland and microbenchmarked
them.  The upshot is that hash32 and fnv_hash are the fastest, jenkins_hash
is slower, and siphash24 is the slowest.  Also, Clang resulted in much
faster code than gcc.

http://people.freebsd.org/~asomers/lagg_hash/

[root@sm4u-4 /usr/home/alans/ctest/lagg_hash]# ./lagg_hash-gcc-4.8
FNV: 0.76
hash32: 1.18
SipHash24: 44.39
Jenkins: 6.20
[root@sm4u-4 /usr/home/alans/ctest/lagg_hash]# ./lagg_hash-gcc-4.2.1
FNV: 0.74
hash32: 1.35
SipHash24: 55.25
Jenkins: 7.37
[root@sm4u-4 /usr/home/alans/ctest/lagg_hash]# ./lagg_hash.clang-3.3
FNV: 0.30
hash32: 0.30
SipHash24: 55.97
Jenkins: 6.45
[root@sm4u-4 /usr/home/alans/ctest/lagg_hash]# ./lagg_hash.clang-3.2
FNV: 0.30
hash32: 0.30
SipHash24: 44.52
Jenkins: 6.48



> T.C.
>
> [1]
> http://svnweb.freebsd.org/base/head/sys/libkern/jenkins_hash.c?view=markup
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


RE: Flow ID, LACP, and igb

2013-08-29 Thread T.C. Gubatayao
> No problem with fnv_hash().

Doesn't it have bad mixing?  Good distribution is important since this code is
for load balancing.

FNV is also slower compared to most of the newer non-cryptographic hashes,
certainly on large keys, but even on small ones.  Of course, performance will
vary with the architecture.

> While I agree that it is likely that siphash24() is slower if you could afford
> the time do a test run it would be great to from guess to know.

+1

You might want to consider lookup3 too, since it's also readily available in the
kernel [1].

T.C.

[1] http://svnweb.freebsd.org/base/head/sys/libkern/jenkins_hash.c?view=markup
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-08-28 Thread Andre Oppermann

On 29.08.2013 01:42, Alan Somers wrote:

On Mon, Aug 26, 2013 at 2:40 PM, Andre Oppermann  wrote:


On 26.08.2013 19:18, Justin T. Gibbs wrote:


Hi Net,

I'm an infrequent traveler through the networking code and would
appreciate some feedback on some proposed solutions to issues Spectra
has seen with outbound LACP traffic.

lacp_select_tx_port() uses the flow ID if it is available in the outbound
mbuf to select the outbound port.  The igb driver uses the msix queue of
the inbound packet to set a packet's flow ID.  This doesn't provide enough
bits of information to yield a high quality flow ID.  If, for example, the
switch controlling inbound packet distribution does a poor job, the
outbound
packet distribution will also be poorly distributed.



Please note that inbound and outbound flow ID do not need to be the same
or symmetric.  It only should stay the same for all packets in a single
connection to prevent reordering.

Generally it doesn't matter if in- and outbound packets do not use the
same queue.  Only in sophisticated setups with full affinity, which we
don't support yet, it could matter.


  The majority of the adapters supported by this driver will compute

the Toeplitz RSS hash.  Using this data seems to work quite well
in our tests (3 member LAGG group).  Is there any reason we shouldn't
use the RSS hash for flow ID?



Using the RSS hash is the idea.  The infrastructure and driver adjustments
haven't been implemented throughout yet.


  We also tried disabling the use of flow ID and doing the hash directly in

the driver.  Unfortunately, the current hash is pretty weak.  It
multiplies
by 33, which yield very poor distributions if you need to mod the result
by 3 (e.g. LAGG group with 3 members).  Alan modified the driver to use
the FNV hash, which is already in the kernel, and this yielded much better
results.  He is still benchmarking the impact of this change.  Assuming we
can get decent flow ID data, this should only impact outbound UDP, since
the
stack doesn't provide a flow ID in this case.

Are there other checksums we should be looking at in addition to FNV?



siphash24() is fast, keyed and strong.


I benchmarked hash32 (the existing hash function) vs fnv_hash using both
TCP and UDP, with 1500 and 9000 byte MTUs.  At 10Gbps, I couldn't measure
any difference in either throughput or cpu utilization.  Given that
siphash24 is definitely slower than hash32, there's no way that I'll find
it to be significantly faster than fnv_hash for this application.  In fact,
I'm guessing that it will be slower due to the function call overhead and
the fact that lagg_hashmbuf calls the hash function on very short buffers.


No problem with fnv_hash().  While I agree that it is likely that siphash24()
is slower if you could afford the time do a test run it would be great to from
guess to know.


Therefore I'm going to commit the change using fnv_hash in the next few
days if no one objects.  Here's the diff:

 //SpectraBSD/stable/sys/net/ieee8023ad_lacp.c#4 (text) 

@@ -763,7 +763,6 @@
  sc->sc_psc = (caddr_t)lsc;
  lsc->lsc_softc = sc;

-lsc->lsc_hashkey = arc4random();
  lsc->lsc_active_aggregator = NULL;
  LACP_LOCK_INIT(lsc);
  TAILQ_INIT(&lsc->lsc_aggregators);
@@ -841,7 +840,7 @@
  if (sc->use_flowid && (m->m_flags & M_FLOWID))
  hash = m->m_pkthdr.flowid;
  else
-hash = lagg_hashmbuf(sc, m, lsc->lsc_hashkey);
+hash = lagg_hashmbuf(sc, m);
  hash %= pm->pm_count;
  lp = pm->pm_map[hash];


The reason for the hashkey was to prevent directed "attacks" on the load
balancing by choosing/predicting the outcome of it.  This is good and bad
as it is undeterministic between runs, which makes debugging particular
situations harder.  To work around the lack of key for fnv_hash() XOR'ing
the hash output with a pre-initialized random is likely sufficient.
The true importance of this randomization is debatable and just point out
why it was there, not to object to you removing it.

--
Andre

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-08-28 Thread Alan Somers
On Mon, Aug 26, 2013 at 2:40 PM, Andre Oppermann  wrote:

> On 26.08.2013 19:18, Justin T. Gibbs wrote:
>
>> Hi Net,
>>
>> I'm an infrequent traveler through the networking code and would
>> appreciate some feedback on some proposed solutions to issues Spectra
>> has seen with outbound LACP traffic.
>>
>> lacp_select_tx_port() uses the flow ID if it is available in the outbound
>> mbuf to select the outbound port.  The igb driver uses the msix queue of
>> the inbound packet to set a packet's flow ID.  This doesn't provide enough
>> bits of information to yield a high quality flow ID.  If, for example, the
>> switch controlling inbound packet distribution does a poor job, the
>> outbound
>> packet distribution will also be poorly distributed.
>>
>
> Please note that inbound and outbound flow ID do not need to be the same
> or symmetric.  It only should stay the same for all packets in a single
> connection to prevent reordering.
>
> Generally it doesn't matter if in- and outbound packets do not use the
> same queue.  Only in sophisticated setups with full affinity, which we
> don't support yet, it could matter.
>
>
>  The majority of the adapters supported by this driver will compute
>> the Toeplitz RSS hash.  Using this data seems to work quite well
>> in our tests (3 member LAGG group).  Is there any reason we shouldn't
>> use the RSS hash for flow ID?
>>
>
> Using the RSS hash is the idea.  The infrastructure and driver adjustments
> haven't been implemented throughout yet.
>
>
>  We also tried disabling the use of flow ID and doing the hash directly in
>> the driver.  Unfortunately, the current hash is pretty weak.  It
>> multiplies
>> by 33, which yield very poor distributions if you need to mod the result
>> by 3 (e.g. LAGG group with 3 members).  Alan modified the driver to use
>> the FNV hash, which is already in the kernel, and this yielded much better
>> results.  He is still benchmarking the impact of this change.  Assuming we
>> can get decent flow ID data, this should only impact outbound UDP, since
>> the
>> stack doesn't provide a flow ID in this case.
>>
>> Are there other checksums we should be looking at in addition to FNV?
>>
>
> siphash24() is fast, keyed and strong.
>
I benchmarked hash32 (the existing hash function) vs fnv_hash using both
TCP and UDP, with 1500 and 9000 byte MTUs.  At 10Gbps, I couldn't measure
any difference in either throughput or cpu utilization.  Given that
siphash24 is definitely slower than hash32, there's no way that I'll find
it to be significantly faster than fnv_hash for this application.  In fact,
I'm guessing that it will be slower due to the function call overhead and
the fact that lagg_hashmbuf calls the hash function on very short buffers.
Therefore I'm going to commit the change using fnv_hash in the next few
days if no one objects.  Here's the diff:

 //SpectraBSD/stable/sys/net/ieee8023ad_lacp.c#4 (text) 

@@ -763,7 +763,6 @@
 sc->sc_psc = (caddr_t)lsc;
 lsc->lsc_softc = sc;

-lsc->lsc_hashkey = arc4random();
 lsc->lsc_active_aggregator = NULL;
 LACP_LOCK_INIT(lsc);
 TAILQ_INIT(&lsc->lsc_aggregators);
@@ -841,7 +840,7 @@
 if (sc->use_flowid && (m->m_flags & M_FLOWID))
 hash = m->m_pkthdr.flowid;
 else
-hash = lagg_hashmbuf(sc, m, lsc->lsc_hashkey);
+hash = lagg_hashmbuf(sc, m);
 hash %= pm->pm_count;
 lp = pm->pm_map[hash];


 //SpectraBSD/stable/sys/net/ieee8023ad_lacp.h#2 (text) 

@@ -244,7 +244,6 @@
 LIST_HEAD(, lacp_port)lsc_ports;
 struct lacp_portmaplsc_pmap[2];
 volatile u_intlsc_activemap;
-u_int32_tlsc_hashkey;
 };

 #defineLACP_TYPE_ACTORINFO1

 //SpectraBSD/stable/sys/net/if_lagg.c#9 (text) 

@@ -35,7 +35,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -1588,10 +1588,10 @@
 }

 uint32_t
-lagg_hashmbuf(struct lagg_softc *sc, struct mbuf *m, uint32_t key)
+lagg_hashmbuf(struct lagg_softc *sc, struct mbuf *m)
 {
 uint16_t etype;
-uint32_t p = key;
+uint32_t p = FNV1_32_INIT;
 int off;
 struct ether_header *eh;
 const struct ether_vlan_header *vlan;
@@ -1622,13 +1622,13 @@
 eh = mtod(m, struct ether_header *);
 etype = ntohs(eh->ether_type);
 if (sc->sc_flags & LAGG_F_HASHL2) {
-p = hash32_buf(&eh->ether_shost, ETHER_ADDR_LEN, p);
-p = hash32_buf(&eh->ether_dhost, ETHER_ADDR_LEN, p);
+p = fnv_32_buf(&eh->ether_shost, ETHER_ADDR_LEN, p);
+p = fnv_32_buf(&eh->ether_dhost, ETHER_ADDR_LEN, p);
 }

 /* Special handling for encapsulating VLAN frames */
 if ((m->m_flags & M_VLANTAG) && (sc->sc_flags & LAGG_F_HASHL2)) {
-p = hash32_buf(&m->m_pkthdr.ether_vtag,
+p = fnv_32_buf(&m->m_pkthdr.ether_vtag,
 sizeof(m->m_pkthdr.ether_vtag), p);
 } else if (etype == ETHERTYPE_VLAN) {
 vlan = lagg_gethdr(m, off,  sizeof(*vlan), &buf);
@@ -1636,7 +1636,7 @@
   

Re: Flow ID, LACP, and igb

2013-08-27 Thread Andre Oppermann

On 27.08.2013 01:30, Adrian Chadd wrote:

... is there any reason we wouldn't want to have the TX and RX for a given flow 
mapped to the same core?


They are.  Thing is the inbound and outbound packet flow id's are totally
independent from each other.  The inbound one determines the RX ring it
will take to go up the stack.  If that's bound to a core that's fine and
gives affinity.  If the socket and user-space application are bound to
the same core as well, there is full affinity.

Now on the way down the core doing the write to the socket matters entering
the kernel.  It stays there until the packet is generated (in tcp_output
for example).  The flow id of the packet doesn't matter at all so far because
it is filled only then.  Now the packet goes down the stack and the flow id
is only used at the end when it has to decide for an outbound TX queue based
on it. This outbound TX ring doesn't have to be same it came in on as long as
it stays the same to prevent reordering.

This fixes Justin's issue with if_lagg and poor balancing.  He can simply
choose a good hash for the packets going out and stop worrying about it.
More important he's no longer hostage to random switches with poor hashing.

Ultimately you could try to bind the TX ring to a particular CPU as well and
try to run it lockless.  That is fraught with some difficult problems though.
First you must have exactly as many RX/TX queues as cores.  That's often not
the case as there are many cards that only support a limited number of rings.
Then for packets generated locally (think DNS query over UDP) you either simply
stick to the local cpu-assigned queue to send without looking at the computed
flow id or you have to switch cores to send the packet on the correct queue.
Such a very strong core binding is typically only really useful in embarrassing
parallel applications that only do packet pushing.  If your application is also
compute intense you may want to have some more flexibility to schedule threads
to prevent stalls from busy cores.  In that case not binding TX to a core is
a win.  So we will pretty much end up with one lock per TX ring to protect the
DMA descriptor structures.

We're still far way from having to worry about this TX issue.  The big win
is the RX queue - socket - application affinity (to the same core).

--
Andre

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-08-26 Thread Scott Long
On Aug 26, 2013, at 5:30 PM, Adrian Chadd  wrote:

> ... is there any reason we wouldn't want to have the TX and RX for a given
> flow mapped to the same core?
> 


Given than an inbound ACK is likely to be turned into an outbound segment
from within the same execution context and CPU instance, I can't imagine why
it would be useful for these flows to be different.  However, I'm still a n00b 
at
this networking stuff, so please correct me if I'm wrong.

Scott

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-08-26 Thread Jack Vogel
None that I can think of.


On Mon, Aug 26, 2013 at 4:30 PM, Adrian Chadd  wrote:

> ... is there any reason we wouldn't want to have the TX and RX for a given
> flow mapped to the same core?
>
>
>
>
> -adrian
>
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-08-26 Thread Adrian Chadd
... is there any reason we wouldn't want to have the TX and RX for a given
flow mapped to the same core?




-adrian
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-08-26 Thread Andre Oppermann

On 26.08.2013 19:18, Justin T. Gibbs wrote:

Hi Net,

I'm an infrequent traveler through the networking code and would
appreciate some feedback on some proposed solutions to issues Spectra
has seen with outbound LACP traffic.

lacp_select_tx_port() uses the flow ID if it is available in the outbound
mbuf to select the outbound port.  The igb driver uses the msix queue of
the inbound packet to set a packet's flow ID.  This doesn't provide enough
bits of information to yield a high quality flow ID.  If, for example, the
switch controlling inbound packet distribution does a poor job, the outbound
packet distribution will also be poorly distributed.


Please note that inbound and outbound flow ID do not need to be the same
or symmetric.  It only should stay the same for all packets in a single
connection to prevent reordering.

Generally it doesn't matter if in- and outbound packets do not use the
same queue.  Only in sophisticated setups with full affinity, which we
don't support yet, it could matter.


The majority of the adapters supported by this driver will compute
the Toeplitz RSS hash.  Using this data seems to work quite well
in our tests (3 member LAGG group).  Is there any reason we shouldn't
use the RSS hash for flow ID?


Using the RSS hash is the idea.  The infrastructure and driver adjustments
haven't been implemented throughout yet.


We also tried disabling the use of flow ID and doing the hash directly in
the driver.  Unfortunately, the current hash is pretty weak.  It multiplies
by 33, which yield very poor distributions if you need to mod the result
by 3 (e.g. LAGG group with 3 members).  Alan modified the driver to use
the FNV hash, which is already in the kernel, and this yielded much better
results.  He is still benchmarking the impact of this change.  Assuming we
can get decent flow ID data, this should only impact outbound UDP, since the
stack doesn't provide a flow ID in this case.

Are there other checksums we should be looking at in addition to FNV?


siphash24() is fast, keyed and strong.

--
Andre

___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"


Re: Flow ID, LACP, and igb

2013-08-26 Thread Alan Somers
On Mon, Aug 26, 2013 at 11:18 AM, Justin T. Gibbs  wrote:

> Hi Net,
>
> I'm an infrequent traveler through the networking code and would
> appreciate some feedback on some proposed solutions to issues Spectra
> has seen with outbound LACP traffic.
>
> lacp_select_tx_port() uses the flow ID if it is available in the outbound
> mbuf to select the outbound port.  The igb driver uses the msix queue of
> the inbound packet to set a packet's flow ID.  This doesn't provide enough
> bits of information to yield a high quality flow ID.  If, for example, the
> switch controlling inbound packet distribution does a poor job, the
> outbound
> packet distribution will also be poorly distributed.
>
It's actually worse than this.  If two inbound TCP packets get sent to the
same queue on different igb ports, then they will have the same flowid.
That could happen even if the switch is distributing packets just fine.

>
> The majority of the adapters supported by this driver will compute
> the Toeplitz RSS hash.  Using this data seems to work quite well
> in our tests (3 member LAGG group).  Is there any reason we shouldn't
> use the RSS hash for flow ID?
>
> We also tried disabling the use of flow ID and doing the hash directly in
> the driver.  Unfortunately, the current hash is pretty weak.  It multiplies
> by 33, which yield very poor distributions if you need to mod the result
> by 3 (e.g. LAGG group with 3 members).  Alan modified the driver to use
> the FNV hash, which is already in the kernel, and this yielded much better
> results.  He is still benchmarking the impact of this change.  Assuming we
> can get decent flow ID data, this should only impact outbound UDP, since
> the
> stack doesn't provide a flow ID in this case.
>
 It also affects outbound TCP packets for streams that originated on the
host.  For example, it affects tcp-mounted NFS clients.

>
> Are there other checksums we should be looking at in addition to FNV?
>
s/checksums/hashes/

>
> Thanks,
> Justin
>
>
___
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"