[RFC 1/2] bridge: avoid ptype_all packet handling
On Fri, 02 Mar 2007 13:26:38 -0800 (PST) David Miller [EMAIL PROTECTED] wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Wed, 28 Feb 2007 17:18:46 -0800 I was measuring bridging/routing performance and noticed this. The current code runs the all packet type handlers before calling the bridge hook. If an application (like some DHCP clients) is using AF_PACKET, this means that each received packet gets run through the Berkeley Packet Filter code in sk_run_filter (slow). I know we closed this out by saying that even though performance sucks, we can't really apply this without breaking things. wrong. What would be broken is if the DHCP client isn't specifying a device ifindex when it binds the AF_PACKET socket. That would be an easy way to fix this performance problem at the application level. The DHCP client should only care about a particular interface's traffic, the one it wants to listen on. My assumption is that when bridging, the normal stack path only has to receive those packets that it would receive if it was not doing bridging. A better version of the patch is: == The current code runs the all packet type handlers before calling the bridge hook. If an application (like some DHCP clients) is using AF_PACKET, this means that each received packet gets run through the Berkeley Packet Filter code in sk_run_filter. This is significant overhead. By moving the bridging hook to run first, the packets flowing through the bridge get filtered out there first. This results in a 14% improvement in performance. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- net/core/dev.c | 24 1 file changed, 12 insertions(+), 12 deletions(-) --- netem.orig/net/core/dev.c +++ netem/net/core/dev.c @@ -1702,9 +1702,12 @@ struct net_bridge_fdb_entry *(*br_fdb_ge unsigned char *addr); void (*br_fdb_put_hook)(struct net_bridge_fdb_entry *ent); -static __inline__ int handle_bridge(struct sk_buff **pskb, - struct packet_type **pt_prev, int *ret, - struct net_device *orig_dev) +/* + * If bridge module is loaded call bridging hook. + * when it returns 1, this is a non-local packet + */ +int (*br_handle_frame_hook)(struct net_bridge_port *p, struct sk_buff **pskb) __read_mostly; +static int handle_bridge(struct sk_buff **pskb) { struct net_bridge_port *port; @@ -1712,15 +1715,10 @@ static __inline__ int handle_bridge(stru (port = rcu_dereference((*pskb)-dev-br_port)) == NULL) return 0; - if (*pt_prev) { - *ret = deliver_skb(*pskb, *pt_prev, orig_dev); - *pt_prev = NULL; - } - return br_handle_frame_hook(port, pskb); } #else -#define handle_bridge(skb, pt_prev, ret, orig_dev) (0) +#define handle_bridge(pskb)0 #endif #ifdef CONFIG_NET_CLS_ACT @@ -1799,6 +1797,9 @@ int netif_receive_skb(struct sk_buff *sk } #endif + if (handle_bridge(skb)) + goto out; + list_for_each_entry_rcu(ptype, ptype_all, list) { if (!ptype-dev || ptype-dev == skb-dev) { if (pt_prev) @@ -1826,9 +1827,6 @@ int netif_receive_skb(struct sk_buff *sk ncls: #endif - if (handle_bridge(skb, pt_prev, ret, orig_dev)) - goto out; - type = skb-protocol; list_for_each_entry_rcu(ptype, ptype_base[ntohs(type)15], list) { if (ptype-type == type -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 1/2] bridge: avoid ptype_all packet handling
On Fri, 02 Mar 2007 13:26:38 -0800 (PST) David Miller [EMAIL PROTECTED] wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Wed, 28 Feb 2007 17:18:46 -0800 I was measuring bridging/routing performance and noticed this. The current code runs the all packet type handlers before calling the bridge hook. If an application (like some DHCP clients) is using AF_PACKET, this means that each received packet gets run through the Berkeley Packet Filter code in sk_run_filter (slow). I know we closed this out by saying that even though performance sucks, we can't really apply this without breaking things. wrong. What would be broken is if the DHCP client isn't specifying a device ifindex when it binds the AF_PACKET socket. That would be an easy way to fix this performance problem at the application level. The DHCP client should only care about a particular interface's traffic, the one it wants to listen on. My assumption is that when bridging, the normal stack path only has to receive those packets that it would receive if it was not doing bridging. A better version of the patch is: == The current code runs the all packet type handlers before calling the bridge hook. If an application (like some DHCP clients) is using AF_PACKET, this means that each received packet gets run through the Berkeley Packet Filter code in sk_run_filter. This is significant overhead. By moving the bridging hook to run first, the packets flowing through the bridge get filtered out there first. This results in a 14% improvement in performance. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- net/core/dev.c | 24 1 file changed, 12 insertions(+), 12 deletions(-) --- netem.orig/net/core/dev.c +++ netem/net/core/dev.c @@ -1702,9 +1702,12 @@ struct net_bridge_fdb_entry *(*br_fdb_ge unsigned char *addr); void (*br_fdb_put_hook)(struct net_bridge_fdb_entry *ent); -static __inline__ int handle_bridge(struct sk_buff **pskb, - struct packet_type **pt_prev, int *ret, - struct net_device *orig_dev) +/* + * If bridge module is loaded call bridging hook. + * when it returns 1, this is a non-local packet + */ +int (*br_handle_frame_hook)(struct net_bridge_port *p, struct sk_buff **pskb) __read_mostly; +static int handle_bridge(struct sk_buff **pskb) { struct net_bridge_port *port; @@ -1712,15 +1715,10 @@ static __inline__ int handle_bridge(stru (port = rcu_dereference((*pskb)-dev-br_port)) == NULL) return 0; - if (*pt_prev) { - *ret = deliver_skb(*pskb, *pt_prev, orig_dev); - *pt_prev = NULL; - } - return br_handle_frame_hook(port, pskb); } #else -#define handle_bridge(skb, pt_prev, ret, orig_dev) (0) +#define handle_bridge(pskb)0 #endif #ifdef CONFIG_NET_CLS_ACT @@ -1799,6 +1797,9 @@ int netif_receive_skb(struct sk_buff *sk } #endif + if (handle_bridge(skb)) + goto out; + list_for_each_entry_rcu(ptype, ptype_all, list) { if (!ptype-dev || ptype-dev == skb-dev) { if (pt_prev) @@ -1826,9 +1827,6 @@ int netif_receive_skb(struct sk_buff *sk ncls: #endif - if (handle_bridge(skb, pt_prev, ret, orig_dev)) - goto out; - type = skb-protocol; list_for_each_entry_rcu(ptype, ptype_base[ntohs(type)15], list) { if (ptype-type == type -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/2] bridge: avoid ptype_all packet handling
From: Stephen Hemminger [EMAIL PROTECTED] Date: Fri, 2 Mar 2007 14:09:29 -0800 On Fri, 02 Mar 2007 13:26:38 -0800 (PST) David Miller [EMAIL PROTECTED] wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Wed, 28 Feb 2007 17:18:46 -0800 I was measuring bridging/routing performance and noticed this. The current code runs the all packet type handlers before calling the bridge hook. If an application (like some DHCP clients) is using AF_PACKET, this means that each received packet gets run through the Berkeley Packet Filter code in sk_run_filter (slow). I know we closed this out by saying that even though performance sucks, we can't really apply this without breaking things. wrong. I disagee, and your patch is still broken because as Jamal pointed out (which you didn't address in any way) this breaks traffic classification of bridged traffic as well. If someone wants their network tap to hear all traffic, they do mean all traffic, and this includes potentially seeing it multiple times when things like bridging and virtual devices decap incoming frames. We can't apply this. Back to a workable solution, why doesn't DHCP specify a specific device? It would fix this performance problem completely, at the application level. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/2] bridge: avoid ptype_all packet handling
From: David Miller [EMAIL PROTECTED] Date: Fri, 02 Mar 2007 14:48:18 -0800 (PST) Back to a workable solution, why doesn't DHCP specify a specific device? It would fix this performance problem completely, at the application level. Since nobody seems to be able to be bothered to actually look at what DHCP clients are doing, I actually did and it's no surprise that broken stuff is happening here. Here is how dhcp3-3.0.3 binds AF_PACKET sockets, in common/lpf.c: struct sockaddr sa; ... /* Bind to the interface name */ memset (sa, 0, sizeof sa); sa.sa_family = AF_PACKET; strncpy (sa.sa_data, (const char *)info - ifp, sizeof sa.sa_data); if (bind (sock, sa, sizeof sa)) { if (errno == ENOPROTOOPT || errno == EPROTONOSUPPORT || errno == ESOCKTNOSUPPORT || errno == EPFNOSUPPORT || errno == EAFNOSUPPORT || errno == EINVAL) { log_error (socket: %m - make sure); log_error (CONFIG_PACKET (Packet socket) %s, and CONFIG_FILTER); log_error ((Socket Filtering) are enabled %s, in your kernel); log_fatal (configuration!); } log_fatal (Bind socket to interface: %m); } So it puts a string into the sockaddr data, and there is no mention of sockaddr_ll, which is what is supposed to be provided as the socket address here, in the entire DHCP tree. I'm tempted to say I must be missing something here, since I can't see how this could possible work at all. The string passed in should be interpreted as the ifindex value, and thus trigger a -ENODEV return from AF_PACKET's bind() implementation. My suspicions are confirmed by the patch here: http://kernel.org/pub/linux/kernel/people/chuyee/patches/dhcp-3.0/dhcp-3.0-linux_cooked_packet.patch Really, this bogus bind() explains everything. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/2] bridge: avoid ptype_all packet handling
On Fri, 02 Mar 2007 15:18:03 -0800 (PST) David Miller [EMAIL PROTECTED] wrote: From: David Miller [EMAIL PROTECTED] Date: Fri, 02 Mar 2007 14:48:18 -0800 (PST) Back to a workable solution, why doesn't DHCP specify a specific device? It would fix this performance problem completely, at the application level. Since nobody seems to be able to be bothered to actually look at what DHCP clients are doing, I actually did and it's no surprise that broken stuff is happening here. I was in middle of checking that.. Here is how dhcp3-3.0.3 binds AF_PACKET sockets, in common/lpf.c: struct sockaddr sa; ... /* Bind to the interface name */ memset (sa, 0, sizeof sa); sa.sa_family = AF_PACKET; strncpy (sa.sa_data, (const char *)info - ifp, sizeof sa.sa_data); if (bind (sock, sa, sizeof sa)) { if (errno == ENOPROTOOPT || errno == EPROTONOSUPPORT || errno == ESOCKTNOSUPPORT || errno == EPFNOSUPPORT || errno == EAFNOSUPPORT || errno == EINVAL) { log_error (socket: %m - make sure); log_error (CONFIG_PACKET (Packet socket) %s, and CONFIG_FILTER); log_error ((Socket Filtering) are enabled %s, in your kernel); log_fatal (configuration!); } log_fatal (Bind socket to interface: %m); } So it puts a string into the sockaddr data, and there is no mention of sockaddr_ll, which is what is supposed to be provided as the socket address here, in the entire DHCP tree. I'm tempted to say I must be missing something here, since I can't see how this could possible work at all. The string passed in should be interpreted as the ifindex value, and thus trigger a -ENODEV return from AF_PACKET's bind() implementation. My suspicions are confirmed by the patch here: http://kernel.org/pub/linux/kernel/people/chuyee/patches/dhcp-3.0/dhcp-3.0-linux_cooked_packet.patch Can you get FC fixed? Really, this bogus bind() explains everything. Should we add a warning to kernel log, to make distro's fix it? It might make sense to add a per-device ptype_dev list in network device? -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/2] bridge: avoid ptype_all packet handling
From: Stephen Hemminger [EMAIL PROTECTED] Date: Fri, 2 Mar 2007 15:34:14 -0800 Can you get FC fixed? I am not the DHCP package maintainer. :-) I'm up to my earfulls already dealing with people trying to slug broken patches into the kernel networking that paper around application bugs. ;) Should we add a warning to kernel log, to make distro's fix it? Unfortunately it looks like a properly formed sockaddr_ll, the ifindex is in fact zero, so there is nothing we can do to warn about this case. The sockaddr_ll sits after the first sockaddr string in the ifreq, and the rest remains initialized to zeros, thus the bind() succeeds. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/2] bridge: avoid ptype_all packet handling
David Miller [EMAIL PROTECTED] wrote: I'm tempted to say I must be missing something here, since I can't see how this could possible work at all. The string passed in should be interpreted as the ifindex value, and thus trigger a -ENODEV return from AF_PACKET's bind() implementation. This is using packet_bind_spkt which uses a name instead of ifindex. As you may recall, I've made a patch to convert it to use the new (actually it's not-so-new anymore) AF_PACKET interface. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 1/2] bridge: avoid ptype_all packet handling
From: Herbert Xu [EMAIL PROTECTED] Date: Sat, 03 Mar 2007 16:38:45 +1100 This is using packet_bind_spkt which uses a name instead of ifindex. So it should be just fine, it should be binding to a specific device (by name instead of ifindex) and therefore it should only trigger the pt_all hook when the packet arrives on that specific device. As you may recall, I've made a patch to convert it to use the new (actually it's not-so-new anymore) AF_PACKET interface. That's right. So it's still a mystery why dhcp is causing bridge devices to trigger the network tap paths on Stephen's machine. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html