Re: [PATCH RESEND] 8021q:Add missing statements to switch case in the function, register_vlan_device

2016-03-07 Thread Patrick McHardy
On 06.03, Nicholas Krause wrote:
> This adds the proper snprintf and break statement for formatting
> the vlan_net structure pointer, vn's name using snprintf for if
> the switch case, VLAN_NAME_TYPE_PLUS_VID occurs for this particular
> switch statement inside the function, register_vlan_device.
> 
> Signed-off-by: Nicholas Krause 
> ---
>  net/8021q/vlan.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
> index 59555f0..91ef50e 100644
> --- a/net/8021q/vlan.c
> +++ b/net/8021q/vlan.c
> @@ -246,8 +246,11 @@ static int register_vlan_device(struct net_device 
> *real_dev, u16 vlan_id)
>   /* Put our vlan.VID in the name.
>* Name will look like:  vlan0005
>*/
> + snprintf(name, IFNAMSIZ, "%s%i", real_dev->name, vlan_id);
> + break;

As the comment indicates, the fall-through is intended like this.

>   default:
>   snprintf(name, IFNAMSIZ, "vlan%.4i", vlan_id);
> + break;
>   }
>  
>   new_dev = alloc_netdev(sizeof(struct vlan_dev_priv), name,
> -- 
> 2.1.4
> 


Re: [PATCH 3/4] netfilter: ipv4: use preferred kernel types

2016-01-30 Thread Patrick McHardy
On 30.01, Lucas Tanure wrote:
> As suggested by checkpatch.pl:
> CHECK: Prefer kernel type 'uX' over 'uintX_t'

You might have noticed we have literally hundreds of them spread over 100
files in the netfilter code. We'll gradually change them when the code is
touched anyways.

>  net/ipv4/netfilter/ip_tables.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)


Re: [PATCH nf 2/2] netfilter: nf_tables: add clone interface to expression operations

2015-11-10 Thread Patrick McHardy
On 10.11, Pablo Neira Ayuso wrote:
> On Tue, Nov 10, 2015 at 06:58:05PM +0000, Patrick McHardy wrote:
> > On 10.11, Pablo Neira Ayuso wrote:
> > > On Tue, Nov 10, 2015 at 06:30:34PM +0000, Patrick McHardy wrote:
> > > > >   __module_get(src->ops->type->owner);
> > > > > - memcpy(dst, src, src->ops->size);
> > > > > + if (src->ops->clone) {
> > > > > + memcpy(dst, src, sizeof(*src));
> > > > 
> > > > Why copy if we clone? The function should do a full initialization if 
> > > > it is
> > > > present I would say.
> > > 
> > > This is not copying the variable length data area of the expression,
> > > just the expression head.
> > 
> > Ah right. But that is only ->ops. We can set this directly, should generate
> > better code and be easier to understand.
> 
> I left the memcpy just to avoid that we forget in case we ever get
> more data there (unlikely). So I'll set the pointer instead.
> 
> If no further objections, will make those two changes locally and will
> push this upstream.

No further objections :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nf 2/2] netfilter: nf_tables: add clone interface to expression operations

2015-11-10 Thread Patrick McHardy
On 10.11, Pablo Neira Ayuso wrote:
> On Tue, Nov 10, 2015 at 06:30:34PM +0000, Patrick McHardy wrote:
> > >   __module_get(src->ops->type->owner);
> > > - memcpy(dst, src, src->ops->size);
> > > + if (src->ops->clone) {
> > > + memcpy(dst, src, sizeof(*src));
> > 
> > Why copy if we clone? The function should do a full initialization if it is
> > present I would say.
> 
> This is not copying the variable length data area of the expression,
> just the expression head.

Ah right. But that is only ->ops. We can set this directly, should generate
better code and be easier to understand.

> 
> > > + err = src->ops->clone(dst, src);
> > > + if (err < 0)
> > > + return err;
> > > + } else {
> > > + memcpy(dst, src, src->ops->size);
> > > + }
> > > + return 0;
> > >  }
> > >  
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nf 2/2] netfilter: nf_tables: add clone interface to expression operations

2015-11-10 Thread Patrick McHardy
On 10.11, Pablo Neira Ayuso wrote:
> With the conversion of the counter expressions to make it percpu, we
> need to clone the percpu memory area, otherwise we crash when using
> counters from flow tables.
> 
> Signed-off-by: Pablo Neira Ayuso 
> ---
>  include/net/netfilter/nf_tables.h | 16 +++--
>  net/netfilter/nft_counter.c   | 49 
> ---
>  net/netfilter/nft_dynset.c|  5 ++--
>  3 files changed, 58 insertions(+), 12 deletions(-)
> 
> diff --git a/include/net/netfilter/nf_tables.h 
> b/include/net/netfilter/nf_tables.h
> index c9149cc..c186457 100644
> --- a/include/net/netfilter/nf_tables.h
> +++ b/include/net/netfilter/nf_tables.h
> @@ -630,6 +630,8 @@ struct nft_expr_ops {
>   int (*validate)(const struct nft_ctx *ctx,
>   const struct nft_expr *expr,
>   const struct nft_data 
> **data);
> + int (*clone)(struct nft_expr *dst,
> +  const struct nft_expr *src);

The functions and data needed during runtime are deliberately kept together
at the beginning of the structure to avoid having to read the entire thing.
So I'd say this shoud go after ->eval().

> @@ -660,10 +662,20 @@ void nft_expr_destroy(const struct nft_ctx *ctx, struct 
> nft_expr *expr);
>  int nft_expr_dump(struct sk_buff *skb, unsigned int attr,
> const struct nft_expr *expr);
>  
> -static inline void nft_expr_clone(struct nft_expr *dst, struct nft_expr *src)
> +static inline int nft_expr_clone(struct nft_expr *dst, struct nft_expr *src)
>  {
> + int err;
> +
>   __module_get(src->ops->type->owner);
> - memcpy(dst, src, src->ops->size);
> + if (src->ops->clone) {
> + memcpy(dst, src, sizeof(*src));

Why copy if we clone? The function should do a full initialization if it is
present I would say.

> + err = src->ops->clone(dst, src);
> + if (err < 0)
> + return err;
> + } else {
> + memcpy(dst, src, src->ops->size);
> + }
> + return 0;
>  }
>  
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH net-next v2 7/8] openvswitch: Delay conntrack helper call for new connections.

2015-11-09 Thread Patrick McHardy
On 06.11, Jarno Rajahalme wrote:
> There is no need to help connections that are not confirmed, so we can
> delay helping new connections to the time when they are confirmed.
> This change is needed for NAT support, and having this as a separate
> patch will make the following NAT patch a bit easier to review.

For the first packet a helper receives the connection is always unconfirmed.
It makes no sense to confirm it if the helper drops the packet.

> Signed-off-by: Jarno Rajahalme 
> ---
>  net/openvswitch/conntrack.c | 20 +++-
>  1 file changed, 15 insertions(+), 5 deletions(-)
> 
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index 7aa38fa..ba44287 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -458,6 +458,7 @@ static bool skb_nfct_cached(struct net *net,
>  /* Pass 'skb' through conntrack in 'net', using zone configured in 'info', if
>   * not done already.  Update key with new CT state after passing the packet
>   * through conntrack.
> + * Note that invalid packets are accepted while the skb->nfct remains unset!
>   */
>  static int __ovs_ct_lookup(struct net *net, struct sw_flow_key *key,
>  const struct ovs_conntrack_info *info,
> @@ -468,7 +469,11 @@ static int __ovs_ct_lookup(struct net *net, struct 
> sw_flow_key *key,
>* actually run the packet through conntrack twice unless it's for a
>* different zone.
>*/
> - if (!skb_nfct_cached(net, key, info, skb)) {
> + bool cached = skb_nfct_cached(net, key, info, skb);
> + enum ip_conntrack_info ctinfo;
> + struct nf_conn *ct;
> +
> + if (!cached) {
>   struct nf_conn *tmpl = info->ct;
>   int err;
>  
> @@ -491,11 +496,16 @@ static int __ovs_ct_lookup(struct net *net, struct 
> sw_flow_key *key,
>   return -ENOENT;
>  
>   ovs_ct_update_key(skb, key, true);
> + }
>  
> - if (ovs_ct_helper(skb, info->family) != NF_ACCEPT) {
> - WARN_ONCE(1, "helper rejected packet");
> - return -EINVAL;
> - }
> + /* Call the helper right after nf_conntrack_in() for confirmed
> +  * connections, but only when commiting for unconfirmed connections.
> +  */
> + ct = nf_ct_get(skb, &ctinfo);
> + if (ct && (nf_ct_is_confirmed(ct) ? !cached : info->commit)
> + && ovs_ct_helper(skb, info->family) != NF_ACCEPT) {
> + WARN_ONCE(1, "helper rejected packet");
> + return -EINVAL;
>   }
>  
>   return 0;
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH net-next v2 0/8] openvswitch: NAT support.

2015-11-09 Thread Patrick McHardy
On 06.11, Jarno Rajahalme wrote:
> This series adds NAT support to openvswitch kernel module.  A few
> changes are needed to the netfilter code to facilitate this (patches
> 1-3/8).  Patches 4-7 make the openvswitch kernel module ready for the
> patch 8 that adds the NAT support for calling into netfilter NAT code
> from the openvswitch conntrack action.

I'm missing some high level description, especially how it is invoked, how
it makes sure expectations of the NAT code about its invocation are met
(it is my understanding that OVS simply invokes this based on actions
specified by the user) and how it interacts with the remaining netfilter
features.

> Jarno Rajahalme (8):
>   netfilter: Remove IP_CT_NEW_REPLY definition.
>   netfilter: Factor out nf_ct_get_info().
>   netfilter: Allow calling into nat helper without skb_dst.
>   openvswitch: Update the CT state key only after nf_conntrack_in().
>   openvswitch: Find existing conntrack entry after upcall.
>   openvswitch: Handle NF_REPEAT in conntrack action.
>   openvswitch: Delay conntrack helper call for new connections.
>   openvswitch: Interface with NAT.
> 
>  include/net/netfilter/nf_conntrack.h   |  15 +
>  include/uapi/linux/netfilter/nf_conntrack_common.h |  12 +-
>  include/uapi/linux/openvswitch.h   |  47 ++
>  net/ipv4/netfilter/nf_nat_l3proto_ipv4.c   |  29 +-
>  net/ipv6/netfilter/nf_nat_l3proto_ipv6.c   |  29 +-
>  net/netfilter/nf_conntrack_core.c  |  22 +-
>  net/openvswitch/conntrack.c| 632 
> +++--
>  net/openvswitch/conntrack.h|   3 +-
>  8 files changed, 686 insertions(+), 103 deletions(-)
> 
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/2] netfilter: ip6t_SYNPROXY: fix sending window update to client

2015-08-10 Thread Patrick McHardy
On 09.08, Phil Sutter wrote:
> This is the identical fix as "netfilter: ipt_SYNPROXY: fix sending
> window update to client" but for the IPv6 variant which obviously
> suffers from the same issue.

Looks fine to me.

Acked-by: Patrick McHardy 

(Also for the IPv4 version, which for some reason I didn't receive)

> 
> Signed-off-by: Phil Sutter 
> ---
> Changes since v1:
> - Adjust for v2 changes of first patch.
> ---
>  net/ipv6/netfilter/ip6t_SYNPROXY.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv6/netfilter/ip6t_SYNPROXY.c 
> b/net/ipv6/netfilter/ip6t_SYNPROXY.c
> index bcebc24..ebbb754 100644
> --- a/net/ipv6/netfilter/ip6t_SYNPROXY.c
> +++ b/net/ipv6/netfilter/ip6t_SYNPROXY.c
> @@ -243,7 +243,8 @@ synproxy_send_client_ack(const struct synproxy_net *snet,
>  
>   synproxy_build_options(nth, opts);
>  
> - synproxy_send_tcp(snet, skb, nskb, NULL, 0, niph, nth, tcp_hdr_size);
> + synproxy_send_tcp(snet, skb, nskb, skb->nfct, IP_CT_ESTABLISHED_REPLY,
> +   niph, nth, tcp_hdr_size);
>  }
>  
>  static bool
> -- 
> 2.1.2
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] netfilter: ip6t_SYNPROXY: fix NULL pointer dereference

2015-08-10 Thread Patrick McHardy
On 09.08, Phil Sutter wrote:
> This happens when networking namespaces are enabled.
> 
> Suggested-by: Patrick McHardy 
> Signed-off-by: Phil Sutter 

Acked-by: Patrick McHardy 

> ---
> Changes since v1:
> - Moved snet param to first place.
> - Constify snet param.
> ---
>  net/ipv6/netfilter/ip6t_SYNPROXY.c | 18 ++
>  1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/net/ipv6/netfilter/ip6t_SYNPROXY.c 
> b/net/ipv6/netfilter/ip6t_SYNPROXY.c
> index 6edb7b1..bcebc24 100644
> --- a/net/ipv6/netfilter/ip6t_SYNPROXY.c
> +++ b/net/ipv6/netfilter/ip6t_SYNPROXY.c
> @@ -37,12 +37,13 @@ synproxy_build_ip(struct sk_buff *skb, const struct 
> in6_addr *saddr,
>  }
>  
>  static void
> -synproxy_send_tcp(const struct sk_buff *skb, struct sk_buff *nskb,
> +synproxy_send_tcp(const struct synproxy_net *snet,
> +   const struct sk_buff *skb, struct sk_buff *nskb,
> struct nf_conntrack *nfct, enum ip_conntrack_info ctinfo,
> struct ipv6hdr *niph, struct tcphdr *nth,
> unsigned int tcp_hdr_size)
>  {
> - struct net *net = nf_ct_net((struct nf_conn *)nfct);
> + struct net *net = nf_ct_net(snet->tmpl);
>   struct dst_entry *dst;
>   struct flowi6 fl6;
>  
> @@ -83,7 +84,8 @@ free_nskb:
>  }
>  
>  static void
> -synproxy_send_client_synack(const struct sk_buff *skb, const struct tcphdr 
> *th,
> +synproxy_send_client_synack(const struct synproxy_net *snet,
> + const struct sk_buff *skb, const struct tcphdr *th,
>   const struct synproxy_options *opts)
>  {
>   struct sk_buff *nskb;
> @@ -119,7 +121,7 @@ synproxy_send_client_synack(const struct sk_buff *skb, 
> const struct tcphdr *th,
>  
>   synproxy_build_options(nth, opts);
>  
> - synproxy_send_tcp(skb, nskb, skb->nfct, IP_CT_ESTABLISHED_REPLY,
> + synproxy_send_tcp(snet, skb, nskb, skb->nfct, IP_CT_ESTABLISHED_REPLY,
> niph, nth, tcp_hdr_size);
>  }
>  
> @@ -163,7 +165,7 @@ synproxy_send_server_syn(const struct synproxy_net *snet,
>  
>   synproxy_build_options(nth, opts);
>  
> - synproxy_send_tcp(skb, nskb, &snet->tmpl->ct_general, IP_CT_NEW,
> + synproxy_send_tcp(snet, skb, nskb, &snet->tmpl->ct_general, IP_CT_NEW,
> niph, nth, tcp_hdr_size);
>  }
>  
> @@ -203,7 +205,7 @@ synproxy_send_server_ack(const struct synproxy_net *snet,
>  
>   synproxy_build_options(nth, opts);
>  
> - synproxy_send_tcp(skb, nskb, NULL, 0, niph, nth, tcp_hdr_size);
> + synproxy_send_tcp(snet, skb, nskb, NULL, 0, niph, nth, tcp_hdr_size);
>  }
>  
>  static void
> @@ -241,7 +243,7 @@ synproxy_send_client_ack(const struct synproxy_net *snet,
>  
>   synproxy_build_options(nth, opts);
>  
> - synproxy_send_tcp(skb, nskb, NULL, 0, niph, nth, tcp_hdr_size);
> + synproxy_send_tcp(snet, skb, nskb, NULL, 0, niph, nth, tcp_hdr_size);
>  }
>  
>  static bool
> @@ -301,7 +303,7 @@ synproxy_tg6(struct sk_buff *skb, const struct 
> xt_action_param *par)
> XT_SYNPROXY_OPT_SACK_PERM |
> XT_SYNPROXY_OPT_ECN);
>  
> - synproxy_send_client_synack(skb, th, &opts);
> + synproxy_send_client_synack(snet, skb, th, &opts);
>   return NF_DROP;
>  
>   } else if (th->ack && !(th->fin || th->rst || th->syn)) {
> -- 
> 2.1.2
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] netfilter: ip6t_SYNPROXY: fix NULL pointer dereference

2015-08-08 Thread Patrick McHardy
On 06.08, Phil Sutter wrote:
> This happens when networking namespaces are enabled.

Thanks, just one minor request:

>  synproxy_send_tcp(const struct sk_buff *skb, struct sk_buff *nskb,
> struct nf_conntrack *nfct, enum ip_conntrack_info ctinfo,
> struct ipv6hdr *niph, struct tcphdr *nth,
> -   unsigned int tcp_hdr_size)
> +   unsigned int tcp_hdr_size, struct synproxy_net *snet)

Logically the synproxy_net pointer should come before all other arguments
since its the container for a lot of the following arguments.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ip6t_SYNPROXY crashes kernel

2015-08-01 Thread Patrick McHardy
On 28.07, Phil Sutter wrote:
> Hi,
> 
> When synproxy_send_server_ack() calls synproxy_send_tcp(), it passes
> NULL as third parameter (struct nf_conntrack *nfct). And the first thing
> synproxy_send_tcp() does, is dereference it:
> 
> | struct net *net = nf_ct_net((struct nf_conn *)nfct);
> 
> I could not find a commit leading to this breakage in the commit log,
> which makes me doubt ip6t_SYNPROXY has ever worked at all.
> 
> If you need one, I have a reproducer at hand. (Though I would want to
> strip it down a bit first.) Just let me know.

Thanks, looks like I never tested this with netns enabled. Would you
care to provide a patch? An easy fix seems to be to pass the synproxy_net
struct to synproxy_send_tcp() and use nf_ct_net(snet->tmpl) instead.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] COLO Proxy Module

2015-06-30 Thread Patrick McHardy
On 30.06, Li Zhijian wrote:
> |ping...
> 
> and i have another question:
> can i add a new |||nf_ct_ext_id simply without touching the exiting kernel
> code?|

No, the kernel needs to know the highest extension ID in order to
allocate space for the offsets.

> in order to support COLO-Proxy, i need a extra nf_ct_ext_id called
> "NF_CT_EXT_COLO"
> to store some message of COLO related. This message can help COLO-Proxy to
> buffer packet and compare packet for each connection.
> 
> Thanks
> Li Zhijian

Cheers,
Patrick
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] netfilter: nf_qeueue: Drop queue entries on nf_unregister_hook

2015-06-20 Thread Patrick McHardy
On 20.06, Eric W. Biederman wrote:
> Patrick McHardy  writes:
> 
> > On 20.06, Pablo Neira Ayuso wrote:
> >> On Fri, Jun 19, 2015 at 02:03:39PM -0500, Eric W. Biederman wrote:
> >> > 
> >> > Add code to nf_unregister_hook to flush the nf_queue when a hook is
> >> > unregistered.  This guarantees that the pointer that the nf_queue code
> >> > retains into the nf_hook list will remain valid while a packet is
> >> > queued.
> >> 
> >> I think the real problem is that struct nf_queue_entry holds a pointer
> >> to struct nf_hook_ops, which will be gone after removal. So you
> >> uncovered a long standing problem that will amplify by when pernet
> >> hooks are in place.
> >> 
> >> Regarding the pointer to nf_hook_list, now that new netdevice variant
> >> doesn't support nf_queue yet, so that nf_hook_list will be always
> >> valid since it will point to the global nf_hooks in the core.
> >
> > I think Eric's patch is the right thing to do. I'm not sure I get
> > your netdev comment, but we certainly do want to drop packets once
> > a hook is gone.
> >
> >> > +{
> >> > +const struct nf_queue_handler *qh;
> >> > +struct net *net;
> >> > +
> >> > +rtnl_lock();
> >> 
> >> Why rtnl_lock() here?
> >
> > for_each_net(). Would actually be nice to have a variant that doesn't
> > need the rtnl since it makes locking order analysis a lot harder.
> 
> Someone added a for_each_net_rcu.  But right now I am not at all certain
> I trust an rcu variant not to miss something, in a weird corner case.
> When missing something translates to an unprivileged user triggerable
> kernel oops I am not ready to play games.
> 
> As for the lock analysis.  Except for nf_tables nf_unregister_hook is
> called by module removal routines where rtnl_lock() is safe.
> 
> With nftables we seem to do everything under some version of the
> nfnl_lock.  Does the nfnl_lock have any problems with taking the
> rtnl_lock to nest underneath it?

No, its fine, we have almost none interactions except for network
namespaces and device lookups.

Main reason why I'd prefer a non-RTNL version is because your
callbacks introduce bigger chunks of code under the RTNL, so
it might complicate things in the future. But your reasoning is
sound and for now this is perfectly fine.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in


Re: [PATCH net] netfilter: nf_qeueue: Drop queue entries on nf_unregister_hook

2015-06-20 Thread Patrick McHardy
On 20.06, Pablo Neira Ayuso wrote:
> On Fri, Jun 19, 2015 at 02:03:39PM -0500, Eric W. Biederman wrote:
> > 
> > Add code to nf_unregister_hook to flush the nf_queue when a hook is
> > unregistered.  This guarantees that the pointer that the nf_queue code
> > retains into the nf_hook list will remain valid while a packet is
> > queued.
> 
> I think the real problem is that struct nf_queue_entry holds a pointer
> to struct nf_hook_ops, which will be gone after removal. So you
> uncovered a long standing problem that will amplify by when pernet
> hooks are in place.
> 
> Regarding the pointer to nf_hook_list, now that new netdevice variant
> doesn't support nf_queue yet, so that nf_hook_list will be always
> valid since it will point to the global nf_hooks in the core.

I think Eric's patch is the right thing to do. I'm not sure I get
your netdev comment, but we certainly do want to drop packets once
a hook is gone.

> > +{
> > +   const struct nf_queue_handler *qh;
> > +   struct net *net;
> > +
> > +   rtnl_lock();
> 
> Why rtnl_lock() here?

for_each_net(). Would actually be nice to have a variant that doesn't
need the rtnl since it makes locking order analysis a lot harder.

> > +   rcu_read_lock();
> > +   qh = rcu_dereference(queue_handler);
> > +   if (qh) {
> > +   for_each_net(net) {
> > +   qh->nf_hook_drop(net, ops);
> > +   }
> > +   }
> > +   rcu_read_unlock();
> > +   rtnl_unlock();
--
To unsubscribe from this list: send the line "unsubscribe netdev" in


Re: [PATCH net-next 43/43] netfilter: Skip unnecessary calls to synchronize_net

2015-06-17 Thread Patrick McHardy
On 17.06, Eric W. Biederman wrote:
> From: Eric W Biederman 
> 
> Signed-off-by: "Eric W. Biederman" 
> ---
>  net/netfilter/core.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/net/netfilter/core.c b/net/netfilter/core.c
> index 95456c09cf69..1b4eadc9c030 100644
> --- a/net/netfilter/core.c
> +++ b/net/netfilter/core.c
> @@ -134,7 +134,9 @@ void nf_unregister_hook(struct net *net, const struct 
> nf_hook_ops *reg)
>  #ifdef HAVE_JUMP_LABEL
>   static_key_slow_dec(&nf_hooks_needed[reg->pf][reg->hooknum]);
>  #endif
> - synchronize_net();
> + /* Don't wait if there are no packets in flight */
> + if (net->loopback_dev)
> + synchronize_net();

I don't get this, could you please explain why there wouldn't be any packets
in flight if there is no loopback_dev?

>   kfree(elem);
>  }
>  EXPORT_SYMBOL(nf_unregister_hook);
> -- 
> 2.2.1
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH nft] src: add netdev family support

2015-05-26 Thread Patrick McHardy
On 25.05, Pablo Neira Ayuso wrote:
> diff --git a/include/rule.h b/include/rule.h
> index 97959f7..06ec2ff 100644
> --- a/include/rule.h
> +++ b/include/rule.h
> @@ -72,6 +72,7 @@ enum table_flags {
>   *
>   * @list:list node
>   * @handle:  table handle
> + * @dev: network device name (only for netdev family)
>   * @location:location the table was defined at
>   * @chains:  chains contained in the table
>   * @sets:sets contained in the table
> @@ -80,6 +81,7 @@ enum table_flags {
>  struct table {
>   struct list_headlist;
>   struct handle   handle;
> + const char  *dev;

I think this logically belongs into struct handle itself.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] net: move qdisc ingress filtering on top of netfilter ingress hooks

2015-04-29 Thread Patrick McHardy
On 30.04, Daniel Borkmann wrote:
> On 04/30/2015 02:37 AM, Patrick McHardy wrote:
> >On 30.04, Pablo Neira Ayuso wrote:
> >>On the bugfix front, the illegal mangling of shared skb from actions
> >>like stateless nat and bpf look also important to be addressed to me.
> >>David already suggested to propagate some state object that keeps a
> >>pointer to the skb that is passed to the action. Thus, the action can
> >>clone it and get the skb back to the ingress path. I started a
> >>patchset to do so here, it's a bit large since it requires quite a lot
> >>of function signature adjustment.
> >
> >Jumping in on this point - the fact that roughly 2/3 of TC actions will
> >simply BUG under not unlikely circumstances when used in ingress (I went
> >through them one by one with Pablo a week ago) is also telling. Nobody
> >seems to be using that. All packet mangling actions will BUG while any
> >tap is active. Its nothing easily fixed, but apparently nobody has cared
> >in ten years. ipt is trivial to crash differently, connmark is as well.
> >
> >So I'm wondering what are we actually arguing about here. Whether we are
> >affecting the performance how fast TC will crash? We *do* actually care
> >about these thing, in TC apparently nobody has for the past ten years.
> 
> Totally agree with you that the situation is quite a mess. From tc ingress/
> egress side, at least my use case is to have an as minimal as possible entry
> point for cls_bpf/act_bpf, which is what we were working on recently. That
> is rather ``fresh'' compared to the remaining history of cls/act in tc.

It's more than a mess. Leaving aside the fully broken code at ingress,
just look at the TC action APIs. Its "a failed state". TC is as well, but
on egress only on the userspace side - Thomas's presentation 'TC - -EWTF'
IIRC put it quite well. Our long term goal is to fix both, but ingress is
actually simply, for any realistic case (though my experience is limited),
meaning you have more than a single classifier, some structure in your rules
and more than a single CPU, we'll do better right now.

But Alexey's point is well taken. If we can't manange to add our hooks
without impacting existing use cases, we'll try better until we can.
The single hook case seems to be possible to optimize at the benefit of
all the layers quite easily, and for people using both, we'll try to
compete on a different level.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] net: move qdisc ingress filtering on top of netfilter ingress hooks

2015-04-29 Thread Patrick McHardy
On 30.04, Daniel Borkmann wrote:
> >
> >I can also see there were also intentions to support userspace
> >queueing at some point since TC_ACT_QUEUED has been there since the
> >beginning.  That should be possible at some point using this
> >infrastructure (once there are no further concerns on the
> >netif_receive_core_finish() patch as soon as gcc 4.9 and follow up
> >versions keep inlining this new function).
> 
> Wrt the other mail, just thinking out loud ... do you see a longer-term
> possibility of further generalizing the gen hooks infrastructure, so that
> actually classifier from tc could attach (disregarding the nf_* naming
> scheme for now) ...
> 
> `-> nf_hook_slow()
>  `-> [for each entry in hook list]   <-- here as an entry
>   `-> nf_iterate()
>`-> (*elemp)->hook()
> 
> ... as well?

Jumping in there since I'm probably the one thinking the TC ingress
abstraction is wrong the strongest - yes, it's an interesting idea.
Unlike egress qdiscs, ingress only has a single classifier chain
anyways, so there is no qdisc internal classification structure to
be observed. It should be possible to skip the ingress invocation
for classification purposes completely and only use it to expose
it to userspace for management purposes, while invoking the chain
directly.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] net: move qdisc ingress filtering on top of netfilter ingress hooks

2015-04-29 Thread Patrick McHardy
On 29.04, Cong Wang wrote:
> On Wed, Apr 29, 2015 at 11:53 AM, Pablo Neira Ayuso  
> wrote:
> > diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> > index 2274e72..23b57da 100644
> > --- a/net/sched/Kconfig
> > +++ b/net/sched/Kconfig
> > @@ -312,6 +312,7 @@ config NET_SCH_PIE
> >  config NET_SCH_INGRESS
> > tristate "Ingress Qdisc"
> > depends on NET_CLS_ACT
> > +   select NETFILTER_INGRESS
> > ---help---
> >   Say Y here if you want to use classifiers for incoming packets.
> >   If unsure, say Y.
> 
> 
> So now it impossible to compile ingress Qdics without netfilters...
> 
> (I know you moved them into net/core/, but still they are netfilter API's.)
> 
> Why do we have to mix different layers? IOW, why not just keep TC at L2
> and netfilters at L3 even just w.r.t. API?

Call it what you want, netfilter for most people is the entire infrastructure
built on top of the hooks. We can call it packet hooks if that makes you
happy. And the are infrastructure and not affiliated to any particular
layer.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting

2015-04-16 Thread Patrick McHardy
On 17.04, Hannes Frederic Sowa wrote:
> 
> 
> On Thu, Apr 16, 2015, at 22:56, Patrick McHardy wrote:
> > On 17.04, Herbert Xu wrote:
> > > On Thu, Apr 16, 2015 at 06:13:25PM +0200, Hannes Frederic Sowa wrote:
> > > >
> > > > So currently we have one fast path, that is: we are not fragmented, we
> > > > get out non-fragmented, none of this code is ever touched and no
> > > > problem.
> > > > 
> > > > We don't want to mak this more complex, but
> > > 
> > > You should read Dave's other email where he gives you an obvious
> > > solution.  If you have to modify the skb then you don't have to
> > > worry about the original fragments.
> > > 
> > > But if you only read the skb then don't linearise it completely
> > > and keep the original fragments.
> > 
> > Yes, I was just responding to that. We need an additional member
> > in the skb, but then this part is quite simple.
> 
> I guess you refer to the solution of using the fragmented list of
> packets to do checks. I don't think it will be that simple to peek into
> all the different skbs if the fragments are split in the header.
> Splitting fragments in the header is the 1x1 of firewall by-passing
> attacks, so we should certainly handle it correctly.

Also correct, that argument hasn't come up so far. Its actually not
too hard to handle, you need to keep track of which parts have been
inspected to make a classification decision and make sure those
parts are not overwritten. The TCP match does this statically, but
we certainly could do this dynamically. The question again is how
to keep geometry if things are actually overwritten, especially
partially. Apparently all this troubly is being seen as worthwhile.

> Logic off peeking into fragments and splitting fragments must be
> logically consent all the time, with all netfilter helpers in the tree.
> 
> I don't think the additional checks in fast-path are in any way
> justified.

I fully agree.

> 
> Bye,
> Hannes
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting

2015-04-16 Thread Patrick McHardy
On 17.04, Herbert Xu wrote:
> On Thu, Apr 16, 2015 at 06:13:25PM +0200, Hannes Frederic Sowa wrote:
> >
> > So currently we have one fast path, that is: we are not fragmented, we
> > get out non-fragmented, none of this code is ever touched and no
> > problem.
> > 
> > We don't want to mak this more complex, but
> 
> You should read Dave's other email where he gives you an obvious
> solution.  If you have to modify the skb then you don't have to
> worry about the original fragments.
> 
> But if you only read the skb then don't linearise it completely
> and keep the original fragments.

Yes, I was just responding to that. We need an additional member
in the skb, but then this part is quite simple.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting

2015-04-16 Thread Patrick McHardy
On 16.04, David Miller wrote:
> > Netfilter may change the contents of the packet, even change its size.
> > It is *really* hard to do this while keeping the original fragments
> > intact.
> 
> I keep hearing a lot of "it's hard" as the only reason we shouldn't do
> this properly, and that frankly sucks.  People aren't looking for a
> solution and to be honest it's quite tiring.
> 
> The common case is that the rules processed are simple, the size of
> the overall packet does _not_ change, and therefore the best thing
> to do is pass the entire thing as a unit with the frags in tact.
> 
> That's the fundamental fact.  It's also the fastest way to process
> these packets and avoids all of these stupid max frag garbage.
> 
> Only at the point where netfilter makes changes to the size of the
> packet does it take "ownership" and get to take on the responsibility
> of making sure the new resulting fragments are sane.
> 
> But only at that point.

Agreed, that part shouldn't be hard. We need to pass the defragmented
skb through the ruleset, meaning we need to pass it through the stack.
That's needed since the rules depend on this.

If we don't make changes, we can spit out the original fragments, but
for this we need to keep a reference to them from the skb. We still
need the max_frag_size thing, once a modification is made we drop the
frag list reference and just regulary refragment the modified skb
according to the limits.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting

2015-04-16 Thread Patrick McHardy
Am 16. April 2015 17:43:23 MESZ, schrieb David Miller :
>From: Hannes Frederic Sowa 
>Date: Thu, 16 Apr 2015 14:11:42 +0200
>
>> On Thu, Apr 16, 2015, at 07:29, Herbert Xu wrote:
>>> On Thu, Apr 16, 2015 at 06:24:00AM +0100, Patrick McHardy wrote:
>>> >
>>> > Netfilter may change the contents of the packet, even change its
>size.
>>> > It is *really* hard to do this while keeping the original
>fragments
>>> > intact.
>>> 
>>> Perhaps we should provide better helpers to facilitate this?
>>> 
>>> So instead of directly manipulating the content of the skb you
>>> would so so through helpers and the helpers can then try to do
>>> sensible things with the fragments.
>> 
>> When Florian and me started discussing how to solve this we also
>wanted
>> to be as transparent as possible. But looking at all possible
>> fragmentation scenarios, this seems to be too complicated.
>> 
>> Even imagine a fragment with overlapping offsets and some of the
>> fragments got duplicated. If we had to keep this in frag_list and now
>> netfilter has to change any of this contents, this will become a
>total
>> mess, like changing one port in multiple skbs at different offsets.
>> 
>> I doubt it is worth the effort.
>
>You guys keep talking about exceptional cases rather than what is
>unquestionable the common case, and the one worth handling in the
>fast paths.

True. But its not like we haven't tried. What I keep hearing is lots of well 
meaning opinions, and the fact is, I've looked into this countless times, you 
can find my first attempt when googling from 11 years ago, and the resolution 
was always - it's not worth it. It's not "too hard", I wouldn't mind. And I 
fail to see the problem with that. We're not talking about a functional defect, 
it's something people (me included) think is "not nice".

I will try to recollect the problems the different approaches have in detail. 
I'm pretty sure you will come to the same conclusion as I have.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] net: refactor __netif_receive_skb_core

2015-04-15 Thread Patrick McHardy
On 15.04, Jesper Dangaard Brouer wrote:
> > Out of curiosity, what is actually the performance impact on all
> > of this? We were just arguing on a different matter on two more
> > instructions in the fast-path, here it's refactoring the whole
> > function into several ones, I presume gcc won't inline it.
> 
> Pablo asked me to performance test this change.  Full test report below.
> 
> The performance effect (of this patch) depend on the Gcc compiler
> version.
> 
> Two tests:
>  1. IP-forwarding (unloaded netfilter modules)
>  2. Early drop in iptables "raw" table
> 
> With GCC 4.4.7, which does not inline the new functions
> (__netif_receive_skb_ingress and __netif_receive_skb_finish) the
> performance impact/regression is definitly measurable.
> 
> With GCC 4.4.7:
>  1. IP-forwarding: +25.18 ns (slower) (-27776 pps)
>  2. Early-drop   :  +7.55 ns (slower) (-66577 pps)
> 
> With GCC 4.9.1, the new functions gets inlined, thus the refactor
> splitup of __netif_receive_skb_core() is basically "cancled".
> Strangly there is a small improvement for forwarding, likely due to
> some lucky assember reordering that give less icache/fetch-misses.
> The early-drop improvement is below accuracy levels, can cannot be
> trusted.
> 
> With GCC 4.9.1:
>  1. IP-forwarding: -10.05ns (faster) (+17532 pps)
>  2. Early-drop   :  -1.54ns (faster) (+16216 pps) below accuracy levels
> 
> I don't know what to conclude, as the result depend on the compiler
> version... but these kind of change do affect performance, and should
> be tested/measured.

Thanks Jesper. This effect without inlinging was to be expected I guess.
The interesting question would be a patch that uses nf_hook() without okfn
callback, moves the ingress qdisc to register with the netfilter ingress
hook and moves the TTL tracking of ing_filter() to the ingress qdisc,
where it belongs.

My expectation would be that this would result in an overall performance
gain.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting

2015-04-15 Thread Patrick McHardy
On 16.04, Herbert Xu wrote:
> On Thu, Apr 16, 2015 at 06:24:00AM +0100, Patrick McHardy wrote:
> >
> > Netfilter may change the contents of the packet, even change its size.
> > It is *really* hard to do this while keeping the original fragments
> > intact.
> 
> Perhaps we should provide better helpers to facilitate this?
> 
> So instead of directly manipulating the content of the skb you
> would so so through helpers and the helpers can then try to do
> sensible things with the fragments.

Yeah, but its going to get pretty complicated. There can be multiple
size changes and modifications for every packet, so we would need to
keep track of the size modifications that have already occured to
map to the correct position in the frag_list. The modifications
would then have to be performed on both the reassembled skb since
they might again be matched against later on and on the frag_list,
potentially split over multiple skbs.

When enlarging the packet, I guess we would insert a new (small)
fragment to keep the geometry of the original fragments. In extreme
cases like H.323 NAT this might result in a huge amount of new very
small fragments. And it might still break the geometry when the
size difference isn't representable as a multiple of 8.

When reducing the skb size, we might have to choice but to change
the geometry of at least on fragment and would have to shift a lot
of data around.

This is very complicated, it we really want to do this, its a lot
easier to just keep note of the full original geometry and refragment
to those exact sizes while potentially adding or removing things
at the end.

My personal opinion is, why bother. The only thing that cares about
the fragment sizes is PMTUD, and that's what this patch is fixing in
a much simpler fashion.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH -next 0/3] net: cap size to original frag size when refragmenting

2015-04-15 Thread Patrick McHardy
On 16.04, Herbert Xu wrote:
> David Miller  wrote:
> > 
> > Because then there is no ambiguity at all, you preserve on output
> > exactly what you had on input.  The same geometry, the same
> > everything.  No special checks, no max frag len, none of this crap.
> > Those are all hacks trying to work around the _fundamental_ issue
> > which is that we potentially change the thing when we refrag.
> 
> Agreed.  Doing anything other than preserving the original geometry
> is simply wrong.
> 
> However, this doesn't mean that netfilter has to process each
> fragment.  What we could do is to preserve the original fragments
> in frag_list and then process the overall skb as a unit in netfilter.
> 
> On output we simply fragment according to the original frag_list.
> 
> The only thing to watch out for is to eliminate anything in the
> middle that tries to linearise the skb.

Netfilter may change the contents of the packet, even change its size.
It is *really* hard to do this while keeping the original fragments
intact.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[MACVLAN]: Update Kconfig to refer to iproute

2008-02-25 Thread Patrick McHardy
 [MACVLAN]: Update Kconfig to refer to iproute

Since the macvlan release I had at least 5 users asking how to configure
it since the old userspace tool doesn't work with the version in the
kernel. Add a pointer to the Kconfig help.

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index f337800..f333c11 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -90,6 +90,11 @@ config MACVLAN
  This allows one to create virtual interfaces that map packets to
  or from specific MAC addresses to a particular interface.
 
+ Macvlan devices can be added using the "ip" command from the
+ iproute2 package starting with the iproute2-2.6.23 release:
+
+ "ip link add link  [ address MAC ] [ NAME ] type macvlan"
+
  To compile this driver as a module, choose M here: the module
  will be called macvlan.
 


Re: [PATCH] nf_conntrack: less hairy ifdefs around proc and sysctl

2008-02-25 Thread Patrick McHardy

Alexey Dobriyan wrote:

Patch splits creation of /proc/net/nf_conntrack, /proc/net/stat/nf_conntrack
and net.netfilter hierarchy into their own functions with dummy ones
if PROC_FS or SYSCTL is not set. Also, remove dead "ret = 0" write
while I'm at it.


Queued for 2.6.26, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NETFILTER]: Introduce nf_inet_address

2008-02-25 Thread Patrick McHardy

David Woodhouse wrote:

On Mon, 2008-02-25 at 13:20 +0100, Patrick McHardy wrote:

Right, I missed that. In that case the current headers should match
the kernel headers (with the compiler.h part removed).


They don't. When you run 'make headers_install' there are some missing.



We don't need all of them, but I'll do a proper resync. Thanks.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NETFILTER]: Introduce nf_inet_address

2008-02-25 Thread Patrick McHardy

David Woodhouse wrote:

On Mon, 2008-02-25 at 13:12 +0100, Patrick McHardy wrote:

Yes, the kernel headers need to be fixed as well to not include
linux/compiler.h outside of #ifdef __KERNEL__. I'll take care
of that.


No. When you run 'make headers_install' that's already taken care of.



Right, I missed that. In that case the current headers should match
the kernel headers (with the compiler.h part removed).



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NETFILTER]: Introduce nf_inet_address

2008-02-25 Thread Patrick McHardy

David Woodhouse wrote:

On Fri, 2008-02-22 at 16:44 +0100, Patrick McHardy wrote:

Pablo Neira Ayuso wrote:

Patrick McHardy wrote:

Yes, that was a bug in the lastest release. We need to
release a 1.4.1 version or something like that, but I'm
not too familiar with the release process, so I haven't
done this so far.

I can schedule one for this weekend, just send me an ACK.


That would be great. I think we had another issue in 1.4.0 with
some header files, but I can't remeber the details.


If you are going to include header files in the release (which makes a
certain amount of sense), it would be best if those are simply the
result of the kernel's 'make headers_install', without any manual
changes.



Yes, the kernel headers need to be fixed as well to not include
linux/compiler.h outside of #ifdef __KERNEL__. I'll take care
of that.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New sparse warning in net/mac80211/debugfs_sta.c

2008-02-25 Thread Patrick McHardy

David Miller wrote:

From: Patrick McHardy <[EMAIL PROTECTED]>
Date: Thu, 21 Feb 2008 19:00:03 +0100

  

And adds back the overhead of two completely unnecessary
function calls to the VLAN fastpath. How about just
stopping this idiocy and reverting the appropriate patches
to bring back MAC_FMT and use it where appropriate?



Agreed, I'll do that.
  


It would be good if Joe could go through the remaining print_mac users
and convert the remaining unintended function calls in fastpaths back
to MAC_FMT. Grepping for "start_xmit" in commit 0795af5729b shows that
at least 10 hard_start_xmit functions are affected and I expect that
some of the changes in the wireless code affect fastpaths as well.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NETFILTER]: Introduce nf_inet_address

2008-02-22 Thread Patrick McHardy

Pablo Neira Ayuso wrote:

Patrick McHardy wrote:

Pablo Neira Ayuso wrote:

Patrick McHardy wrote:

Yes, that was a bug in the lastest release. We need to
release a 1.4.1 version or something like that, but I'm
not too familiar with the release process, so I haven't
done this so far.

I can schedule one for this weekend, just send me an ACK.


That would be great. I think we had another issue in 1.4.0 with
some header files, but I can't remeber the details.

Jan, I recall we talked about this some time ago, do you remember?


Was it related with kernel 2.4.x compilation?



Yes, I just found the old mail, the error was:

In file included from include/linux/netfilter/nf_nat.h:4,
 from extensions/libipt_DNAT.c:9:
include/linux/netfilter/nf_conntrack_tuple.h:29: error: syntax error 
before "__be32"



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NETFILTER]: Introduce nf_inet_address

2008-02-22 Thread Patrick McHardy

Pablo Neira Ayuso wrote:

Patrick McHardy wrote:

Yes, that was a bug in the lastest release. We need to
release a 1.4.1 version or something like that, but I'm
not too familiar with the release process, so I haven't
done this so far.


I can schedule one for this weekend, just send me an ACK.



That would be great. I think we had another issue in 1.4.0 with
some header files, but I can't remeber the details.

Jan, I recall we talked about this some time ago, do you remember?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NETFILTER]: Introduce nf_inet_address

2008-02-22 Thread Patrick McHardy

David Woodhouse wrote:

On Tue, 2008-02-19 at 15:45 +0100, Patrick McHardy wrote:

That would break iptables compilation, which already includes
linux/in.h in some files. I guess the best fix for now is to
include netinet/in.h in busybox and long-term clean this up
properly.


It looks like iptables is fairly broken anyway:

make[1]: Entering directory 
`/home/dwmw2/working/extras/iptables/devel/iptables-1.4.0'
Unable to resolve dependency on linux/compiler.h. Try 'make clean'.
Extensions found:
make[1]: Leaving directory 
`/home/dwmw2/working/extras/iptables/devel/iptables-1.4.0'
error: Bad exit status from /var/tmp/rpm-tmp.32057 (%build)



Yes, that was a bug in the lastest release. We need to
release a 1.4.1 version or something like that, but I'm
not too familiar with the release process, so I haven't
done this so far.

Anyway, I just committed this patch to iptables to remove
the compiler.h inclusions.

Index: include/linux/netfilter_ipv6/ip6_tables.h
===
--- include/linux/netfilter_ipv6/ip6_tables.h   (Revision 7376)
+++ include/linux/netfilter_ipv6/ip6_tables.h   (Arbeitskopie)
@@ -15,7 +15,6 @@
 #ifndef _IP6_TABLES_H
 #define _IP6_TABLES_H
 
-#include 
 #include 
 
 #include 
Index: include/linux/netfilter.h
===
--- include/linux/netfilter.h   (Revision 7376)
+++ include/linux/netfilter.h   (Arbeitskopie)
@@ -1,8 +1,6 @@
 #ifndef __LINUX_NETFILTER_H
 #define __LINUX_NETFILTER_H
 
-#include 
-
 /* Responses from hook functions. */
 #define NF_DROP 0
 #define NF_ACCEPT 1
Index: include/linux/netfilter_ipv4/ip_tables.h
===
--- include/linux/netfilter_ipv4/ip_tables.h(Revision 7376)
+++ include/linux/netfilter_ipv4/ip_tables.h(Arbeitskopie)
@@ -15,7 +15,6 @@
 #ifndef _IPTABLES_H
 #define _IPTABLES_H
 
-#include 
 #include 
 
 #include 


Re: [NETFILTER]: Introduce nf_inet_address

2008-02-22 Thread Patrick McHardy

David Woodhouse wrote:

On Fri, 2008-02-22 at 16:52 +0900, David Woodhouse wrote:

It looks like iptables is fairly broken anyway:

make[1]: Entering directory
`/home/dwmw2/working/extras/iptables/devel/iptables-1.4.0'
Unable to resolve dependency on linux/compiler.h. Try 'make clean'.


And if I move away the contents of the local include/linux/ directory
and replace it with proper headers generated by 'make
headers_install' (which won't be trying to include compiler.h), then I
get more failures:

Unable to resolve dependency on ../include/linux/netfilter/xt_u32.h. Try 'make 
clean'.
Unable to resolve dependency on linux/netfilter/xt_time.h. Try 'make clean'.
Unable to resolve dependency on linux/netfilter/xt_quota.h. Try 'make clean'.
Unable to resolve dependency on ../include/linux/netfilter/xt_connlimit.h. Try 
'make clean'.
Unable to resolve dependency on linux/netfilter_ipv6/ip6t_mh.h. Try 'make 
clean'.
Unable to resolve dependency on linux/netfilter/nf_nat.h. Try 'make clean'.



Yes, the dependency tracking has always been a bit strange/broken.
The next release will use autoconf, which will hopefully behave
better.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RTNL]: Validate hardware and broadcast address attribute for RTM_NEWLINK

2008-02-22 Thread Patrick McHardy

Thomas Graf wrote:

RTM_NEWLINK allows for already existing links to be modified. For this
purpose do_setlink() is called which expects address attributes with a
payload length of at least dev->addr_len. This patch adds the necessary
validation for the RTM_NEWLINK case.

The address length for links to be created is not checked for now as the
actual attribute length is used when copying the address to the netdevice
structure. It might make sense to report an error if less than addr_len
bytes are provided but enforcing this might break drivers trying to be
smart with not transmitting all zero addresses.

Signed-off-by: Thomas Graf <[EMAIL PROTECTED]>

Index: net-2.6.26/net/core/rtnetlink.c
===
--- net-2.6.26.orig/net/core/rtnetlink.c2008-02-22 01:50:53.0 
+0100
+++ net-2.6.26/net/core/rtnetlink.c 2008-02-22 11:28:59.0 +0100
@@ -726,6 +726,21 @@
return net;
 }
 
+static int validate_linkmsg(struct net_device *dev, struct nlattr *tb[])

+{
+   if (dev) {
+   if (tb[IFLA_ADDRESS] &&
+   nla_len(tb[IFLA_ADDRESS]) < dev->addr_len)
+   return -EINVAL;
+
+   if (tb[IFLA_BROADCAST] &&
+   nla_len(tb[IFLA_BROADCAST]) < dev->addr_len)
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
 static int do_setlink(struct net_device *dev, struct ifinfomsg *ifm,
  struct nlattr **tb, char *ifname, int modified)
 {
@@ -910,12 +925,7 @@
goto errout;
}
 
-	if (tb[IFLA_ADDRESS] &&

-   nla_len(tb[IFLA_ADDRESS]) < dev->addr_len)
-   goto errout_dev;
-
-   if (tb[IFLA_BROADCAST] &&
-   nla_len(tb[IFLA_BROADCAST]) < dev->addr_len)
+   if ((err = validate_linkmsg(dev, tb)) < 0)
goto errout_dev;
 
 	err = do_setlink(dev, ifm, tb, ifname, 0);

@@ -1036,6 +1046,9 @@
else
dev = NULL;
 
+	if ((err = validate_linkmsg(dev, tb)) < 0)

+   return err;
+


Minor nitpick: it would be more logical to put this in the

if (dev) {
 ...

branch a bit below since thats the only path that leads to
do_setlink(). That would also allow to remove the
if (dev) check from validate_linkmsg().
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix multicast on VLAN interfaces

2008-02-22 Thread Patrick McHardy

Phil Oester wrote:

In commit 56addd6eeeb4e11f5a0af7093ca078e0f29140e0 the VLAN multicast list
handling was reworked.  Unfortunately, a variable initialization was missed,
and multicast over vlan devices has been broken since 2.6.23-rc1 (apparently
not too many people use this).  Trivial fix below, which might be a candidate
for -stable.


Thanks Phil, a similar patch is already sitting in net-2.6.git.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New sparse warning in net/mac80211/debugfs_sta.c

2008-02-21 Thread Patrick McHardy

Patrick McHardy wrote:

Joe Perches wrote:


This removes the __pure from print_mac, so reject as appropriate...

Add some type safety to print_mac by using struct print_mac_buf * 
instead of char *.


And adds back the overhead of two completely unnecessary
function calls to the VLAN fastpath. 


BTW, this also affects ATM, with 3 calls in hard_start_xmit,
3 calls in lec_xmit and 1 cakk in lec_atm_send.


How about just
stopping this idiocy and reverting the appropriate patches
to bring back MAC_FMT and use it where appropriate? 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New sparse warning in net/mac80211/debugfs_sta.c

2008-02-21 Thread Patrick McHardy

Joe Perches wrote:

On Thu, 2008-02-21 at 11:17 +0100, Johannes Berg wrote:
  

Yeah, I saw that discussion. I think it's fine, it's just something we
need to be aware of. In fact, I Joe had a patch (that seems to have
gotten lost?) to make DECLARE_MAC_BUF() declare a structure with the u8
pointer in it instead to get type checking for the args, which would
make our code there not even compile, and imho rightfully so. I'll send
in a patch to fix this (via John) and Joe can resend his patch to get
typechecking there.



This removes the __pure from print_mac, so reject as appropriate...

Add some type safety to print_mac by using 
struct print_mac_buf * instead of char *.


And adds back the overhead of two completely unnecessary
function calls to the VLAN fastpath. How about just
stopping this idiocy and reverting the appropriate patches
to bring back MAC_FMT and use it where appropriate?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: New sparse warning in net/mac80211/debugfs_sta.c

2008-02-21 Thread Patrick McHardy

David Miller wrote:

From: Harvey Harrison <[EMAIL PROTECTED]>
Date: Thu, 21 Feb 2008 02:01:19 -0800


In this case, it's being passed to a debugfs create function, could it
instead use sysfs_format_mac?


Just assigning print_mac() to a local variable then passing that to
debugfs_create_dir() will make the warning go away.


From another perspective adding that __pure attribute to print_mac()

might not have been the best idea.  But I can't think of another
way to elimitate the "passing print_mac() args to pr_debug()
still generates calls to print_mac() even when DEBUG is not
defined" problem :-/



Frankly, I think the main problem is that MAC_FMT got removed
for no real reason and is forcing us to come up with lots of
workarounds. We have NIPQUAD_FMT, NIP6_FMT, so I don't see
whats wrong with MAC_FMT. In fact I think a simple

printk(MAC_FMT, addr);

is much nicer than all this temporary buffer and function
call stuff.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.25-rc2-mm1 - several bugs and a crash

2008-02-21 Thread Patrick McHardy

Stephen Hemminger wrote:

On Thu, 21 Feb 2008 12:28:50 +0100
Patrick McHardy <[EMAIL PROTECTED]> wrote:


  

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 327e847..b77eb56 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -256,13 +256,19 @@ __nf_conntrack_find(const struct nf_conntrack_tuple 
*tuple)
struct hlist_node *n;
unsigned int hash = hash_conntrack(tuple);
 
+	/* Disable BHs the entire time since we normally need to disable them

+* at least once for the stats anyway.
+*/
+   local_bh_disable();



Use rcu_read_lock instead. local_bh_disable() won't work with some of the other 
forms
of RCU alternatives.
  


The caller already calls rcu_read_lock(). This is for the per-cpu
statistics.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Don't limit the number of tunnels with generic name explicitly.

2008-02-21 Thread Patrick McHardy

Pavel Emelyanov wrote:

Patrick McHardy wrote:

Pavel Emelyanov wrote:

Patrick McHardy wrote:


It would be nicer to replace the entire hand-made name
allocation to remove the 100 device limit.

Actually, I thought the same, but fixing % in names looks like a 
BUG-fix for 2.6.25, while removing the hand-made name allocation 
looks like an enhancement for 2.6.26. No?


Well, its so closely related that I guess it would still look
like a bugfix :) But changing this in 2.6.26 is also fine of
course, your patch just reminded me since I wanted to change
this for a long time and repeatedly forgot about it again.


Ok, point taken ;) Here's the 2nd patch that does so. If David 
decides it can go to 2.6.25, that would be good, otherwise this 
patch will fit the 2.6.26 as well.


Changelog:

Use the added dev_alloc_name() call to create tunnel device name,
rather than iterate in a hand-made loop with an artificial limit.

Thanks Patrick for noticing this.

Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>



Looks good to me, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Don't create tunnels with '%' in name.

2008-02-21 Thread Patrick McHardy

Pavel Emelyanov wrote:

Patrick McHardy wrote:


It would be nicer to replace the entire hand-made name
allocation to remove the 100 device limit.



Actually, I thought the same, but fixing % in names looks like a 
BUG-fix for 2.6.25, while removing the hand-made name allocation 
looks like an enhancement for 2.6.26. No?



Well, its so closely related that I guess it would still look
like a bugfix :) But changing this in 2.6.26 is also fine of
course, your patch just reminded me since I wanted to change
this for a long time and repeatedly forgot about it again.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Don't create tunnels with '%' in name.

2008-02-21 Thread Patrick McHardy

Pavel Emelyanov wrote:

Four tunnel drivers (ip_gre, ipip, ip6_tunnel and sit) can
receive a pre-defined name for a device from the userspace.
Since these drivers call the register_netdevice() after this 
(rtnl_lock is held), the device's name may contain a '%' 
character.


Not sure how bad is this to have a device with a '%' in its
name, but all the other places either use the register_netdev(),
or explicitly call dev_alloc_name() before registering, i.e. 
do not allow for such names.


Signed-off-by: Pavel Emelyanov <[EMAIL PROTECTED]>

---

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 63f6917..6b9744f 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -274,19 +274,24 @@ static struct ip_tunnel * ipgre_tunnel_locate(struct 
ip_tunnel_parm *parms, int
if (!dev)
  return NULL;
 
+	if (strchr(name, '%')) {

+   if (dev_alloc_name(dev, name) < 0)
+   goto failed_free;
+   }
+



It would be nicer to replace the entire hand-made name
allocation to remove the 100 device limit.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NETFILTER]: Introduce nf_inet_address

2008-02-21 Thread Patrick McHardy

David Woodhouse wrote:

On Tue, 2008-02-19 at 15:45 +0100, Patrick McHardy wrote:

That would break iptables compilation, which already includes
linux/in.h in some files. I guess the best fix for now is to
include netinet/in.h in busybox and long-term clean this up
properly.


Yeah, that makes sense.

Can we push the change to __u32 (or uint32_t) for 2.6.24? Or is there
something obvious we should be doing in busybox which we aren't? I don't
quite understand why this u_int32_t crap doesn't work at _all_ when it
evidently used to at least in some environments.



I already sent it to Dave for 2.6.25 (I assume that what you meant,
it was introduced after 2.6.24), its currently sitting in net-2.6
and should hit Linus' tree next time he pulls from Dave.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.25-rc2-mm1 - several bugs and a crash

2008-02-21 Thread Patrick McHardy

Tilman Schmidt wrote:

Still, X came up fine, I could log in (Gnome feeling subjectively
a bit sluggish), call up a web page from the Internet in Firefox,
and start perusing the logs, when the whole system froze: neither
mouse nor keyboard would react anymore, and only the Wind^Wreset
button would put me back in control. After rebooting into the
previous, non-mm kernel I found this in the syslog:

Feb 20 17:22:40 xenon kernel: [   48.180297] BUG: using smp_processor_id() in 
preemptible [] code: ntpdate/3562
Feb 20 17:22:40 xenon kernel: [   48.180297] caller is 
__nf_conntrack_find+0x9b/0xeb [nf_conntrack]
Feb 20 17:22:40 xenon kernel: [   48.180297] Pid: 3562, comm: ntpdate Not 
tainted 2.6.25-rc2-mm1-testing #1
Feb 20 17:22:40 xenon kernel: [   48.180297]  [] 
debug_smp_processor_id+0x99/0xb0



Could you test whether this patch fixes the netfilter
warnings please?

commit 736b33102292be0d75be1e950ca9bcd5361db7dd
Author: Patrick McHardy <[EMAIL PROTECTED]>
Date:   Thu Feb 21 12:26:01 2008 +0100

[NETFILTER]: nf_conntrack: fix smp_processor_id() in preemptible code 
warning

Since we're using RCU for the conntrack hash now, we need to avoid
getting preempted or interrupted by BHs while changing the stats.

Fixes warning reported by Tilman Schmidt <[EMAIL PROTECTED]> when using
preemptible RCU:

[   48.180297] BUG: using smp_processor_id() in preemptible [] 
code: ntpdate/3562
[   48.180297] caller is __nf_conntrack_find+0x9b/0xeb [nf_conntrack]
[   48.180297] Pid: 3562, comm: ntpdate Not tainted 2.6.25-rc2-mm1-testing 
#1
[   48.180297]  [] debug_smp_processor_id+0x99/0xb0
[   48.180297]  [] __nf_conntrack_find+0x9b/0xeb [nf_conntrack]

    Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 327e847..b77eb56 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -256,13 +256,19 @@ __nf_conntrack_find(const struct nf_conntrack_tuple 
*tuple)
struct hlist_node *n;
unsigned int hash = hash_conntrack(tuple);
 
+   /* Disable BHs the entire time since we normally need to disable them
+* at least once for the stats anyway.
+*/
+   local_bh_disable();
hlist_for_each_entry_rcu(h, n, &nf_conntrack_hash[hash], hnode) {
if (nf_ct_tuple_equal(tuple, &h->tuple)) {
NF_CT_STAT_INC(found);
+   local_bh_enable();
return h;
}
NF_CT_STAT_INC(searched);
}
+   local_bh_enable();
 
return NULL;
 }
@@ -400,17 +406,20 @@ nf_conntrack_tuple_taken(const struct nf_conntrack_tuple 
*tuple,
struct hlist_node *n;
unsigned int hash = hash_conntrack(tuple);
 
-   rcu_read_lock();
+   /* Disable BHs the entire time since we need to disable them at
+* least once for the stats anyway.
+*/
+   rcu_read_lock_bh();
hlist_for_each_entry_rcu(h, n, &nf_conntrack_hash[hash], hnode) {
if (nf_ct_tuplehash_to_ctrack(h) != ignored_conntrack &&
nf_ct_tuple_equal(tuple, &h->tuple)) {
NF_CT_STAT_INC(found);
-   rcu_read_unlock();
+   rcu_read_unlock_bh();
return 1;
}
NF_CT_STAT_INC(searched);
}
-   rcu_read_unlock();
+   rcu_read_unlock_bh();
 
return 0;
 }


Re: [PATCH] [NETFILTER]: fix ebtable targets return

2008-02-21 Thread Patrick McHardy

David Miller wrote:

From: Joonwoo Park <[EMAIL PROTECTED]>
Date: Thu, 21 Feb 2008 14:36:32 +0900


The function ebt_do_table doesn't take NF_DROP as a verdict from the targets.

Signed-off-by: Joonwoo Park <[EMAIL PROTECTED]>


Whoops, good catch :-)

Patrick, if you want you can just signoff on this and I can
put it directly into my tree.


Good catch indeed, thanks.

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.25 2/4]S2io: Multiqueue network device support - FIFO selection based on L4 ports

2008-02-20 Thread Patrick McHardy

David Miller wrote:

From: Patrick McHardy <[EMAIL PROTECTED]>
Date: Thu, 21 Feb 2008 00:15:13 +0100

  

I missed those discussions, but this has already been agreed on,
fine by me. It would still be preferable to use queue_mapping
instead of priority IMO, even if its activated by a module
parameter, since that leaves the option to use classifiers.



I think the driver should compute the value in it's transmit
routine and keep the value local.  It shouldn't be setting
skb->foo values at all, they don't belong to the driver
and this could confuse other code.
  


Yes, I didn't meant it should set anything, but on of the options
this patch introduces is to _use_ skb->priority for classification,
so the classification is already external, in which case it might
as well offer the full flexibility (as one option) to use qdisc
classifiers.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.25 2/4]S2io: Multiqueue network device support - FIFO selection based on L4 ports

2008-02-20 Thread Patrick McHardy

David Miller wrote:

From: Patrick McHardy <[EMAIL PROTECTED]>
Date: Thu, 21 Feb 2008 00:08:52 +0100

  

Ramkrishna Vepa wrote:


Sreenivasa Honnur wrote:



- Resubmit #2
- Transmit fifo selection based on TCP/UDP ports.
- Added tx_steering_type loadable parameter for transmit fifo
  
  

selection.
  
  

  0x0 NO_STEERING: Default FIFO is selected.
  0x1 TX_PRIORITY_STEERING: FIFO is selected based on skb->priority.
  0x2 TX_DEFAULT_STEERING: FIFO is selected based on L4 Ports.

  
  

Why duplicate the generic multiqueue classification?



[Ram] Could you be more specific?
  
  

The generic multiqueue support classifies packets by setting
skb->queue_mapping using qdisc classifiers, which is more
flexible and avoids using module parameters.



But it doesn't do what these multiqueue TX queue hardware devices
want.  These devices don't want packet scheduler "classification",
they want load balancing using some key (current cpu number,
hashing on the packet headers, etc.)  And that's not what our
packet scheduler classifiers do or should do.
  


Its what the flow classifier does :)

We don't want to have to tell people "you have to run 'tc' magic
foo to use all of the TX queues on your network card."  That's
completely unreasonable and stupid.

We have to resolve this somehow, and there have been many discussions
about this a month or so ago.
  


I see. I missed those discussions, but this has already been
agreed on, fine by me. It would still be preferable to use
queue_mapping instead of priority IMO, even if its activated
by a module parameter, since that leaves the option to use
classifiers.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.25 2/4]S2io: Multiqueue network device support - FIFO selection based on L4 ports

2008-02-20 Thread Patrick McHardy

Ramkrishna Vepa wrote:

Sreenivasa Honnur wrote:


- Resubmit #2
- Transmit fifo selection based on TCP/UDP ports.
- Added tx_steering_type loadable parameter for transmit fifo
  

selection.
  

  0x0 NO_STEERING: Default FIFO is selected.
  0x1 TX_PRIORITY_STEERING: FIFO is selected based on skb->priority.
  0x2 TX_DEFAULT_STEERING: FIFO is selected based on L4 Ports.

  

Why duplicate the generic multiqueue classification?


[Ram] Could you be more specific?
  


The generic multiqueue support classifies packets by setting
skb->queue_mapping using qdisc classifiers, which is more
flexible and avoids using module parameters.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2.6.25 2/4]S2io: Multiqueue network device support - FIFO selection based on L4 ports

2008-02-20 Thread Patrick McHardy

Sreenivasa Honnur wrote:

- Resubmit #2
- Transmit fifo selection based on TCP/UDP ports.
- Added tx_steering_type loadable parameter for transmit fifo selection.
  0x0 NO_STEERING: Default FIFO is selected.
  0x1 TX_PRIORITY_STEERING: FIFO is selected based on skb->priority.
  0x2 TX_DEFAULT_STEERING: FIFO is selected based on L4 Ports.
  


Why duplicate the generic multiqueue classification?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.25-rc2-mm1 - several bugs and a crash

2008-02-20 Thread Patrick McHardy

Tilman Schmidt wrote:

Still, X came up fine, I could log in (Gnome feeling subjectively
a bit sluggish), call up a web page from the Internet in Firefox,
and start perusing the logs, when the whole system froze: neither
mouse nor keyboard would react anymore, and only the Wind^Wreset
button would put me back in control. After rebooting into the
previous, non-mm kernel I found this in the syslog:

Feb 20 17:22:40 xenon kernel: [   48.180297] BUG: using smp_processor_id() in 
preemptible [] code: ntpdate/3562
Feb 20 17:22:40 xenon kernel: [   48.180297] caller is 
__nf_conntrack_find+0x9b/0xeb [nf_conntrack]
Feb 20 17:22:40 xenon kernel: [   48.180297] Pid: 3562, comm: ntpdate Not 
tainted 2.6.25-rc2-mm1-testing #1
Feb 20 17:22:40 xenon kernel: [   48.180297]  [] 
debug_smp_processor_id+0x99/0xb0
Feb 20 17:22:40 xenon kernel: [   48.180297]  [] 
__nf_conntrack_find+0x9b/0xeb [nf_conntrack]


I guess the cause for this is a combination of preemtible
RCU and conntrack using RCU since 2.6.25-rc. Using
NF_CT_STAT_INC_ATOMIC should fix it, but I'd prefer
to have a fix that doesn't increase overhead when regular
RCU is used.

I'll see if I can find a better way to fix this tommorrow.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 3/8] [NET]: uninline dev_alloc_skb, de-bloats a lot

2008-02-20 Thread Patrick McHardy

Jan Engelhardt wrote:

On Feb 20 2008 15:47, Ilpo Järvinen wrote:

-23668  392 funcs, 104 +, 23772 -, diff: -23668 --- dev_alloc_skb

-static inline struct sk_buff *dev_alloc_skb(unsigned int length)
-{
-   return __dev_alloc_skb(length, GFP_ATOMIC);
-}
+extern struct sk_buff *dev_alloc_skb(unsigned int length);


Striking. How can this even happen? A callsite which calls

dev_alloc_skb(n)

is just equivalent to

__dev_alloc_skb(n, GFP_ATOMIC);

which means there's like 4 (or 8 if it's long) bytes more on the
stack. For a worst case, count in another 8 bytes for push and pop or mov on
the stack. But that still does not add up to 23 kb.



__dev_alloc_skb() is also an inline function which performs
some extra work. Which raises the question - if dev_alloc_skb()
is uninlined, shouldn't __dev_alloc_skb() be uninline as well?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] SUNRPC: Mark buffer used for debug printks with __maybe_unused

2008-02-20 Thread Patrick McHardy

Joe Perches wrote:

On Wed, 2008-02-20 at 07:29 -0800, Joe Perches wrote:
  

fs/nfsd/nfsproc.c:  char buf[RPC_MAX_ADDRBUFLEN];
Perhaps there should be a DECLARE_RPC_BUF(buf) macro?
#define DECLARE_RPC_BUF(var) char var[MAC_BUF_SIZE] __maybe_unused



Make that:

#define DECLARE_RPC_BUF(var) char var[RPC_MAX_ADDRBUFLEN] __maybe_unuse


Alternatively change the dprintk macro to behave similar like
pr_debug() and mark things like svc_print_addr() __pure, which
has the advantage that is still performs format checking even
if debugging is disabled.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/8] [NET]: uninline skb_put, de-bloats a lot

2008-02-20 Thread Patrick McHardy

Ilpo Järvinen wrote:

~500 files changed
...
kernel/uninlined.c:
  skb_put   | +104
 1 function changed, 104 bytes added, diff: +104

vmlinux.o:
 869 functions changed, 198 bytes added, 111003 bytes removed, diff: -110805

This change is INCOMPLETE, I think that the call to current_text_addr()
should be rethought but I don't have a clue how to do that.



I guess __builtin_return_address(0) would work.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [VLAN] vlan_skb_recv

2008-02-20 Thread Patrick McHardy

Ben Greear wrote:

Stephen Anderson wrote:

Hello,

To help increase throughput and bypass the backlog queue, I changed the
netif_rx() to netif_receive_skb() in vlan_skb_recv().  What's the
argument for using netif_rx() other than legacy maintenance?  At this
point, interrupt context should not be an issue.  Layer 2 performance
has been a big focus in my area of development.



I guess the only point is to reduce stack usage. Its probably not
a problem with only VLAN, but it might be with further tunnels,
IPsec, ...


I'm sure you have seen many attempts to implement a single VLAN aware
IVL FDB in the past and I was wondering which attempt do you feel was
the best?  Have you ever considered integrating your VLAN support
natively into the bridging code base or know of any attempts to do just
that?



Without having thought about this much, it seems to me that
it needs to be integrated in the bridge fdb to work properly.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[NET]: Messed multicast lists after dev_mc_sync/unsync

2008-02-19 Thread Patrick McHardy
 commit 6548b91f39381b2c5f02f99c14734546354bff89
Author: Jorge Boncompte [DTI2] <[EMAIL PROTECTED]>
Date:   Mon Feb 18 16:02:01 2008 +0100

[NET]: Messed multicast lists after dev_mc_sync/unsync

Commit a0a400d79e3dd7843e7e81baa3ef2957bdc292d0 from you
introduced a new field "da_synced" to struct dev_addr_list that is
not properly initialized to 0. So when any of the current users (8021q,
macvlan, mac80211) calls dev_mc_sync/unsync they mess the address
list for both devices.

The attached patch fixed it for me and avoid future problems.

Signed-off-by: Jorge Boncompte [DTI2] <[EMAIL PROTECTED]>
Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

diff --git a/net/core/dev.c b/net/core/dev.c
index 6cfc123..9516105 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2900,7 +2900,7 @@ int __dev_addr_add(struct dev_addr_list **list, int 
*count,
}
}
 
-   da = kmalloc(sizeof(*da), GFP_ATOMIC);
+   da = kzalloc(sizeof(*da), GFP_ATOMIC);
if (da == NULL)
return -ENOMEM;
memcpy(da->da_addr, addr, alen);


Re: AW: AW: AW: Problem receiving multicast with kernel.2.6.24, REAL solution

2008-02-19 Thread Patrick McHardy

Reither Robert wrote:

Visit AVD on prolight+sound in Frankfurt from 12.-15. March 2008 - Hall 8.0, 
Stand G16




Hurray, your patch does the trick for me !!!

Now join/leave und receiving of multicasts work like it should !!

May i kiss you Patrick, this saves my week ;-)


The credit goes to Jorge Boncompte, I only broke it :|


Hope we see a minor kernel release with this patch very soon ...


Yes, I'll send it to -stable as soon as it hits upstream.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NETFILTER]: Introduce nf_inet_address

2008-02-19 Thread Patrick McHardy

David Woodhouse wrote:

On Tue, 2008-02-19 at 15:01 +0100, Patrick McHardy wrote:

David Woodhouse wrote:

+union nf_inet_addr {
+   u_int32_t   all[4];
+   __be32  ip;
+   __be32  ip6[4];
+};
+
 #ifdef __KERNEL__
 #ifdef CONFIG_NETFILTER

This breaks the busybox build:

CC  ipsvd/tcpudp.o
In file included from /usr/include/linux/netfilter_ipv4.h:8,
 from ipsvd/tcpudp.c:33:
/usr/include/linux/netfilter.h:40: error: expected specifier-qualifier-list 
before 'u_int32_t'

What is this 'u_int32_t' nonsense anyway?

If a user-visible header is likely to be included by libc directly from
a 'standard' header, it may not require . Therefore it should
use the system-specific types such as '__u32'.

Right, I queued this patch to fix it.


That does the trick -- but are we using u_int32_t elsewhere in
user-visible headers? Does it work there? How?



Its used in nearly every ip_tables header file. Those are most likely
not included by glibc and iptables includes sys/types.h. Besides
iptables I'm only aware of a perl module that uses these files,
which is probably also including sys/types.h.


A later commit adds struct in_addr and struct in6_addr to this union
too, which breaks busybox even harder.

Thats odd, the iptables headers have always used struct in_addr and
struct in6_addr in struct ipt_ip/struct ip6t_ip6, which are also
used by userspace. What is "ipsvd/tcpudp.c"? I couldn't find it in
the Debian busybox source.


It's in busybox 1.9.1. Just including  seems to be
sufficient to make it happy again. I wonder if netfilter.h should
include that for itself?



That would break iptables compilation, which already includes
linux/in.h in some files. I guess the best fix for now is to
include netinet/in.h in busybox and long-term clean this up
properly.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ANNOUNCE] ESFQ --> SFQ patches for Linux 2.6.24

2008-02-19 Thread Patrick McHardy

David Miller wrote:

From: "Brock Noland" <[EMAIL PROTECTED]>
Date: Sat, 9 Feb 2008 20:30:58 -0600


Is this going to be merged anytime soon?


If it gets submitted to the proper mailing list, it might.
'linux-net' is for user questions, it is not where the networking
developers hang out, 'netdev' is.

And you have to post patches for review, not URL's point to
the patches.  It has to be int he email, in an applyable form
so people can review the thing properly.



Since SFQ is not exactly simple and I needed something like this
myself, I followed Paul's suggestion and added a new scheduler
(DRR) for this with more flexible limits.

I'll rediff against net-2.6.26 within the next days and send
a final version for review (anyone interested is welcome to
already review this version of course :).

commit 13d0cc64d0f7fed945c357cf4ca43330c8f95ad2
Author: Patrick McHardy <[EMAIL PROTECTED]>
Date:   Mon Feb 18 22:21:55 2008 +0100

    [NET_SCHED]: Add DRR scheduler

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index dbb7ac3..2fca9c4 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -482,4 +482,20 @@ struct tc_netem_corrupt
 
 #define NETEM_DIST_SCALE   8192
 
+/* DRR */
+
+enum
+{
+   TCA_DRR_UNSPEC,
+   TCA_DRR_QUANTUM,
+   __TCA_DRR_MAX
+};
+
+#define TCA_DRR_MAX(__TCA_DRR_MAX - 1)
+
+struct tc_drr_stats
+{
+   s32 deficit;
+};
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 82adfe6..7e1ab99 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -196,6 +196,9 @@ config NET_SCH_NETEM
 
  If unsure, say N.
 
+config NET_SCH_DRR
+   tristate "DRR scheduler"
+
 config NET_SCH_INGRESS
tristate "Ingress Qdisc"
depends on NET_CLS_ACT
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 1d2b0f7..b055f74 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -28,6 +28,7 @@ obj-$(CONFIG_NET_SCH_TEQL)+= sch_teql.o
 obj-$(CONFIG_NET_SCH_PRIO) += sch_prio.o
 obj-$(CONFIG_NET_SCH_ATM)  += sch_atm.o
 obj-$(CONFIG_NET_SCH_NETEM)+= sch_netem.o
+obj-$(CONFIG_NET_SCH_DRR)  += sch_drr.o
 obj-$(CONFIG_NET_CLS_U32)  += cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)   += cls_route.o
 obj-$(CONFIG_NET_CLS_FW)   += cls_fw.o
diff --git a/net/sched/sch_drr.c b/net/sched/sch_drr.c
new file mode 100644
index 000..aa241b5
--- /dev/null
+++ b/net/sched/sch_drr.c
@@ -0,0 +1,534 @@
+/*
+ * net/sched/sch_drr.c Deficit Round Robin scheduler
+ *
+ * Copyright (c) 2008 Patrick McHardy <[EMAIL PROTECTED]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct drr_class {
+   struct hlist_node   hlist;
+   u32 classid;
+   unsigned intrefcnt;
+
+   struct gnet_stats_basic bstats;
+   struct gnet_stats_queue qstats;
+   struct gnet_stats_rate_est  rate_est;
+   struct list_headalist;
+   struct Qdisc *  qdisc;
+
+   u32 quantum;
+   s32 deficit;
+};
+
+#define DRR_HSIZE  16
+
+struct drr_sched {
+   struct list_headactive;
+   struct tcf_proto *  filter_list;
+   unsigned intfilter_cnt;
+   struct hlist_head   clhash[DRR_HSIZE];
+   struct sk_buff *requeue;
+};
+
+static unsigned int drr_hash(u32 h)
+{
+   h ^= h >> 8;
+   h ^= h >> 4;
+
+   return h & (DRR_HSIZE - 1);
+}
+
+static struct drr_class *drr_find_class(struct Qdisc *sch, u32 classid)
+{
+   struct drr_sched *q = qdisc_priv(sch);
+   struct drr_class *cl;
+   struct hlist_node *n;
+
+   hlist_for_each_entry(cl, n, &q->clhash[drr_hash(classid)], hlist) {
+   if (cl->classid == classid)
+   return cl;
+   }
+   return NULL;
+}
+
+static void drr_purge_queue(struct drr_class *cl)
+{
+   unsigned int len = cl->qdisc->q.qlen;
+
+   qdisc_reset(cl->qdisc);
+   qdisc_tree_decrease_qlen(cl->qdisc, len);
+}
+
+static const struct nla_policy drr_policy[TCA_DRR_MAX + 1] = {
+   [TCA_DRR_QUANTUM]   = { .type = NLA_U32 },
+};
+
+static int drr_change_class(struct Qdisc *sch, u32 classid, u32 parentid,
+   struct nlattr **tca, unsigned long *arg)
+{
+   struct drr_sched *q = qdisc_priv(sch);
+   struct drr_class 

Re: [PATCH 3/17 net-2.6.26] [NETFILTER]: Consolidate masq_inet_event and masq_device_event.

2008-02-19 Thread Patrick McHardy

Denis V. Lunev wrote:

They do exactly the same job.

Signed-off-by: Denis V. Lunev <[EMAIL PROTECTED]>
---
 net/ipv4/netfilter/ipt_MASQUERADE.c |   14 ++
 1 files changed, 2 insertions(+), 12 deletions(-)


Looks fine.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AW: AW: Problem receiving multicast with kernel.2.6.24 #3

2008-02-19 Thread Patrick McHardy

Reither Robert wrote:

Visit AVD on prolight+sound in Frankfurt from 12.-15. March 2008 - Hall 8.0, 
Stand G16




OK, found one solution (but i think, this should not the normal way)

MC joining first the VLAN device (eth0.3) and than the physical (eth0) does the 
trick, 100% success ...

Doing it the opposite way, gives the same problem as before ...

So somehow joining the VLAN interface disrupts the join to the physical one ...

Ideas ?


Does this patch help?
commit 6548b91f39381b2c5f02f99c14734546354bff89
Author: Jorge Boncompte [DTI2] <[EMAIL PROTECTED]>
Date:   Mon Feb 18 16:02:01 2008 +0100

[NET]: Messed multicast lists after dev_mc_sync/unsync

Commit a0a400d79e3dd7843e7e81baa3ef2957bdc292d0 from you
introduced a new field "da_synced" to struct dev_addr_list that is
not properly initialized to 0. So when any of the current users (8021q,
macvlan, mac80211) calls dev_mc_sync/unsync they mess the address
list for both devices.

The attached patch fixed it for me and avoid future problems.

Signed-off-by: Jorge Boncompte [DTI2] <[EMAIL PROTECTED]>
Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

diff --git a/net/core/dev.c b/net/core/dev.c
index 6cfc123..9516105 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2900,7 +2900,7 @@ int __dev_addr_add(struct dev_addr_list **list, int 
*count,
}
}
 
-   da = kmalloc(sizeof(*da), GFP_ATOMIC);
+   da = kzalloc(sizeof(*da), GFP_ATOMIC);
if (da == NULL)
return -ENOMEM;
memcpy(da->da_addr, addr, alen);


Re: [NETFILTER]: Introduce nf_inet_address

2008-02-19 Thread Patrick McHardy

David Woodhouse wrote:

+union nf_inet_addr {
+   u_int32_t   all[4];
+   __be32  ip;
+   __be32  ip6[4];
+};
+
 #ifdef __KERNEL__
 #ifdef CONFIG_NETFILTER


This breaks the busybox build:

CC  ipsvd/tcpudp.o
In file included from /usr/include/linux/netfilter_ipv4.h:8,
 from ipsvd/tcpudp.c:33:
/usr/include/linux/netfilter.h:40: error: expected specifier-qualifier-list 
before 'u_int32_t'

What is this 'u_int32_t' nonsense anyway?

If a user-visible header is likely to be included by libc directly from
a 'standard' header, it may not require . Therefore it should
use the system-specific types such as '__u32'.


Right, I queued this patch to fix it.


If it isn't likely to be included by libc, which is the case for
netfilter, then it might as well just use the proper C types. Those who
are stuck on C89 or earlier might still prefer to use '__u32' even when
there's no need for it, but 'u_int32_t' is just silly. I suspect we
should eradicate it.


Yes, some more consitency would be nice. So far the consensus was
to not use it in new code, but keep using it in subsystems like
netfilter that (almost) consistently use it everywhere.


I couldn't make busybox work with it --
__BIT_TYPES_DEFINED__ is defined in  and prevents the
definitions of u_int32_t et al from appearing in . And if
I include  first, other things break.

A later commit adds struct in_addr and struct in6_addr to this union
too, which breaks busybox even harder.


Thats odd, the iptables headers have always used struct in_addr and
struct in6_addr in struct ipt_ip/struct ip6t_ip6, which are also
used by userspace. What is "ipsvd/tcpudp.c"? I couldn't find it in
the Debian busybox source.


How is this supposed to be used in userspace? Or is it even supposed to
be exposed?


Yes, its meant to replace many self-made "AF-independant" address
representations.
commit 6f2e68f81457d72bdb14a7ead305a1da06e78893
Author: Patrick McHardy <[EMAIL PROTECTED]>
Date:   Tue Feb 19 14:54:36 2008 +0100

[NETFILTER]: Use __u32 in struct nf_inet_addr

As reported by David Woodhouse <[EMAIL PROTECTED]>, using u_int32_t in
struct nf_inet_addr breaks the busybox build. Fix by using __u32.

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index d74e79b..b74b615 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -51,7 +51,7 @@ enum nf_inet_hooks {
 };
 
 union nf_inet_addr {
-   u_int32_t   all[4];
+   __u32   all[4];
__be32  ip;
__be32  ip6[4];
struct in_addr  in;


Re: [Bugme-new] [Bug 9920] New: kernel panic when using ebtables redirect target

2008-02-19 Thread Patrick McHardy

David Miller wrote:

From: Joonwoo Park <[EMAIL PROTECTED]>
Date: Tue, 19 Feb 2008 11:53:24 +0900


[PATCH] netfilter: fix incorrect use of skb_make_writable

http://bugzilla.kernel.org/show_bug.cgi?id=9920
The function skb_make_writable returns true or false.

Signed-off-by: Joonwoo Park <[EMAIL PROTECTED]>


I'll let Patrick pull this in, thanks!



Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net/8021q/vlan_dev.c - Use print_mac

2008-02-19 Thread Patrick McHardy

David Miller wrote:

From: David Miller <[EMAIL PROTECTED]>
Date: Mon, 18 Feb 2008 16:43:05 -0800 (PST)


I think we can fix this easily by using __attribute_const_
on the print_mac() declaration.  Let me play with that.


Actually it seems the 'pure' attribute is more important
here.  Although it's not semantically a perfect match,
what we need to tell the compiler is basically that:

1) the return value depends upon the inputs
2) if the input is not used, it's safe to avoid the call

and 'pure' accomplishes that without any unwanted side-effects.

I think this will not result in any unwanted over-optimization.
Because if the inputs change in any way GCC has to emit the
call.

Any objections?



This seems fine to me, thanks Dave.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net/8021q/vlan_dev.c - Use print_mac

2008-02-18 Thread Patrick McHardy

Joe Perches wrote:

On Mon, 2008-02-18 at 16:19 +0100, Patrick McHardy wrote:
  

@@ -404,11 +405,8 @@ static int vlan_dev_hard_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
 
 	pr_debug("%s: about to send skb: %p to dev: %s\n",

__FUNCTION__, skb, skb->dev->name);
-   pr_debug("  " MAC_FMT " " MAC_FMT " %4hx %4hx %4hx\n",
-veth->h_dest[0], veth->h_dest[1], veth->h_dest[2],
-veth->h_dest[3], veth->h_dest[4], veth->h_dest[5],
-veth->h_source[0], veth->h_source[1], veth->h_source[2],
-veth->h_source[3], veth->h_source[4], veth->h_source[5],
+   pr_debug("  %s %s %4hx %4hx %4hx\n",
+print_mac(mac, veth->h_dest), print_mac(mac2, veth->h_source),
  

This results in print_mac getting called twice per packet even without
debugging. Whats the problem with MAC_FMT?



It's just a consistency thing.
It identifies code where MAC addresses are used.
  
an allyesconfig is a bit smaller (~.1%).

pr_debug is a noop when not debugging, print_mac is optimized away.
  


No its not, which I also stated in the commit message that restored
it.

0x60244313 :  callq  
0x60161dbd 
0x60244318 :  lea
-0x50(%rbp),%rdi

0x6024431c :  mov%r15,%rsi
0x6024431f :  callq  
0x60161dbd 



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net/8021q/vlan_dev.c - Use print_mac

2008-02-18 Thread Patrick McHardy

David Miller wrote:

From: Patrick McHardy <[EMAIL PROTECTED]>
Date: Mon, 18 Feb 2008 16:19:40 +0100

  

Joe Perches wrote:




We specifically removed this sort of thing, please don't
add it back.


Why?



We converted the entire tree over the print_mac(), and since
the MAC_FMT stuff was therefore no longer used we could
remove it.

Some references slipped back in somehow, and thus MAC_FMT
did too.

There is no reason to keep around a global interface for
_one_ user when that user can use the recommended interface
just as equally as the rest of the tree which we converted.

This is a pr_debug() statement we're talking about here.
:-)
  


The way pr_debug is implemented it still results in two function
calls per packet since the compiler doesn't know that it doesn't
have visible side-effects besides modifying the (unused) buffer.
I confirmed this using codiff.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] net/8021q/vlan_dev.c - Use print_mac

2008-02-18 Thread Patrick McHardy

Joe Perches wrote:

On Fri, 2008-02-15 at 02:58 -0800, David Miller wrote:

From: Bruno Randolf <[EMAIL PROTECTED]>
Date: Fri, 15 Feb 2008 19:48:05 +0900

is there any chance to include a macro like this for printing mac

addresses?

its advantage is that it can be used without the need to declare

buffers for

print_mac(), for example:

We specifically removed this sort of thing, please don't
add it back.


Why?


@@ -404,11 +405,8 @@ static int vlan_dev_hard_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
 
 	pr_debug("%s: about to send skb: %p to dev: %s\n",

__FUNCTION__, skb, skb->dev->name);
-   pr_debug("  " MAC_FMT " " MAC_FMT " %4hx %4hx %4hx\n",
-veth->h_dest[0], veth->h_dest[1], veth->h_dest[2],
-veth->h_dest[3], veth->h_dest[4], veth->h_dest[5],
-veth->h_source[0], veth->h_source[1], veth->h_source[2],
-veth->h_source[3], veth->h_source[4], veth->h_source[5],
+   pr_debug("  %s %s %4hx %4hx %4hx\n",
+print_mac(mac, veth->h_dest), print_mac(mac2, veth->h_source),



This results in print_mac getting called twice per packet even without
debugging. Whats the problem with MAC_FMT?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2.6 patch] remove include/linux/netfilter_ipv4/ipt_SAME.h

2008-02-18 Thread Patrick McHardy

Adrian Bunk wrote:

This patch removes the no longer used include/linux/netfilter_ipv4/ipt_SAME.h



We kept it around because old iptables binaries need it to build.
The kernel no longer supports it, but people might still wish to
use a distributor-built iptables binary with old kernels. It will
be removed with a number of other headers kept for compatibility
in 1-2 years.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Netfilter fixes to 2.6.24-git

2008-02-10 Thread Patrick McHardy

David Miller wrote:

From: Jan Engelhardt <[EMAIL PROTECTED]>
Date: Sun, 10 Feb 2008 22:02:35 +0100 (CET)

I have been unable to reach the netfilter and net maintainers the past 
week regarding inclusion of patches, but most importantly a group of 
fixes at [0]-[3]. I am kind of at a loss here but to turn up the volume 
and write to more people on how to proceed.


You just need to be patient and wait for Patrick to get to your
patches, it's as simple as that :-)

Patrick is normally very responsive, so whatever it is it has to be a
temporary issue.



Indeed, I'm currently ill and not really up for much working,
but this will hopefully get better soon.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[IPV4]: route: fix crash ip_route_input

2008-02-06 Thread Patrick McHardy
 commit dad61a4af7d23146ce67ec2f069f6ba9b75a578d
Author: Patrick McHardy <[EMAIL PROTECTED]>
Date:   Wed Feb 6 14:35:11 2008 +0100

[IPV4]: route: fix crash ip_route_input

ip_route_me_harder() may call ip_route_input() with skbs that don't
have skb->dev set for skbs rerouted in LOCAL_OUT and TCP resets
generated by the REJECT target, resulting in a crash when dereferencing
skb->dev->nd_net. Since ip_route_input() has an input device argument,
it seems correct to use that one anyway.

Bug introduced in b5921910a1 (Routing cache virtualization).

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 8842ecb..525787b 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2041,7 +2041,7 @@ int ip_route_input(struct sk_buff *skb, __be32 daddr, __be32 saddr,
 	int iif = dev->ifindex;
 	struct net *net;
 
-	net = skb->dev->nd_net;
+	net = dev->nd_net;
 	tos &= IPTOS_RT_MASK;
 	hash = rt_hash(daddr, saddr, iif);
 


[IPROUTE]: cls_flow: add vlan-tag support

2008-02-05 Thread Patrick McHardy
 commit 94e9cba778cb97d77d9146dc3bd38ff195bc2c8a
Author: Patrick McHardy <[EMAIL PROTECTED]>
Date:   Sat Feb 2 18:22:16 2008 +0100

[IPROUTE]: cls_flow: add vlan-tag support

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

diff --git a/include/linux/pkt_cls.h b/include/linux/pkt_cls.h
index 16869c2..e3e9e25 100644
--- a/include/linux/pkt_cls.h
+++ b/include/linux/pkt_cls.h
@@ -348,6 +348,7 @@ enum
FLOW_KEY_RTCLASSID,
FLOW_KEY_SKUID,
FLOW_KEY_SKGID,
+   FLOW_KEY_VLAN_TAG,
__FLOW_KEY_MAX,
 };
 
diff --git a/tc/f_flow.c b/tc/f_flow.c
index eca05cd..1537ade 100644
--- a/tc/f_flow.c
+++ b/tc/f_flow.c
@@ -32,7 +32,8 @@ static void explain(void)
 "KEY-LIST := [ KEY-LIST , ] KEY\n"
 "KEY  := [ src | dst | proto | proto-src | proto-dst | iif | priority | \n"
 "  mark | nfct | nfct-src | nfct-dst | nfct-proto-src | \n"
-"  nfct-proto-dst | rt-classid | sk-uid | sk-gid ]\n"
+"  nfct-proto-dst | rt-classid | sk-uid | sk-gid |\n"
+"  vlan-tag ]\n"
 "OPS  := [ or NUM | and NUM | xor NUM | rshift NUM | addend NUM ]\n"
 "ID   := X:Y\n"
);
@@ -55,6 +56,7 @@ static const char *flow_keys[FLOW_KEY_MAX+1] = {
[FLOW_KEY_RTCLASSID]= "rt-classid",
[FLOW_KEY_SKUID]= "sk-uid",
[FLOW_KEY_SKGID]= "sk-gid",
+   [FLOW_KEY_VLAN_TAG] = "vlan-tag",
 };
 
 static int flow_parse_keys(__u32 *keys, __u32 *nkeys, char *argv)


[NET_SCHED 04/04]: cls_flow: support classification based on VLAN tag

2008-02-05 Thread Patrick McHardy
[NET_SCHED]: cls_flow: support classification based on VLAN tag

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

---
commit 104092e6d90cba5fa00902a3154155872d693f42
tree 0e3fd5871a861fa022bbc2f34d314bb8672b556a
parent 03faf81b8195be455c3c7592d76d712ea9d80b13
author Patrick McHardy <[EMAIL PROTECTED]> Tue, 05 Feb 2008 15:22:23 +0100
committer Patrick McHardy <[EMAIL PROTECTED]> Tue, 05 Feb 2008 15:22:23 +0100

 include/linux/pkt_cls.h |1 +
 net/sched/cls_flow.c|   12 
 2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/include/linux/pkt_cls.h b/include/linux/pkt_cls.h
index 40fac8c..28dfc61 100644
--- a/include/linux/pkt_cls.h
+++ b/include/linux/pkt_cls.h
@@ -348,6 +348,7 @@ enum
FLOW_KEY_RTCLASSID,
FLOW_KEY_SKUID,
FLOW_KEY_SKGID,
+   FLOW_KEY_VLAN_TAG,
__FLOW_KEY_MAX,
 };
 
diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c
index eeb223c..971b867 100644
--- a/net/sched/cls_flow.c
+++ b/net/sched/cls_flow.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -270,6 +271,15 @@ static u32 flow_get_skgid(const struct sk_buff *skb)
return 0;
 }
 
+static u32 flow_get_vlan_tag(const struct sk_buff *skb)
+{
+   u16 uninitialized_var(tag);
+
+   if (vlan_get_tag(skb, &tag) < 0)
+   return 0;
+   return tag & VLAN_VID_MASK;
+}
+
 static u32 flow_key_get(const struct sk_buff *skb, int key)
 {
switch (key) {
@@ -305,6 +315,8 @@ static u32 flow_key_get(const struct sk_buff *skb, int key)
return flow_get_skuid(skb);
case FLOW_KEY_SKGID:
return flow_get_skgid(skb);
+   case FLOW_KEY_VLAN_TAG:
+   return flow_get_vlan_tag(skb);
default:
WARN_ON(1);
return 0;
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[VLAN 03/04]: Constify skb argument to vlan_get_tag()

2008-02-05 Thread Patrick McHardy
[VLAN]: Constify skb argument to vlan_get_tag()

Required by next patch to use it from the flow classifier.

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

---
commit 03faf81b8195be455c3c7592d76d712ea9d80b13
tree 9da377d2bd44421494a4c4d7cf6c705d199c26ce
parent 2e5915ef51e55135522e59e041bb176432857d82
author Patrick McHardy <[EMAIL PROTECTED]> Tue, 05 Feb 2008 15:22:23 +0100
committer Patrick McHardy <[EMAIL PROTECTED]> Tue, 05 Feb 2008 15:22:23 +0100

 include/linux/if_vlan.h |7 ---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index 34f40ef..79504b2 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -327,7 +327,7 @@ static inline struct sk_buff *vlan_put_tag(struct sk_buff 
*skb, unsigned short t
  * 
  * Returns error if the skb is not of VLAN type
  */
-static inline int __vlan_get_tag(struct sk_buff *skb, unsigned short *tag)
+static inline int __vlan_get_tag(const struct sk_buff *skb, unsigned short 
*tag)
 {
struct vlan_ethhdr *veth = (struct vlan_ethhdr *)skb->data;
 
@@ -347,7 +347,8 @@ static inline int __vlan_get_tag(struct sk_buff *skb, 
unsigned short *tag)
  * 
  * Returns error if @skb->cb[] is not set correctly
  */
-static inline int __vlan_hwaccel_get_tag(struct sk_buff *skb, unsigned short 
*tag)
+static inline int __vlan_hwaccel_get_tag(const struct sk_buff *skb,
+unsigned short *tag)
 {
struct vlan_skb_tx_cookie *cookie;
 
@@ -370,7 +371,7 @@ static inline int __vlan_hwaccel_get_tag(struct sk_buff 
*skb, unsigned short *ta
  * 
  * Returns error if the skb is not VLAN tagged
  */
-static inline int vlan_get_tag(struct sk_buff *skb, unsigned short *tag)
+static inline int vlan_get_tag(const struct sk_buff *skb, unsigned short *tag)
 {
if (skb->dev->features & NETIF_F_HW_VLAN_TX) {
return __vlan_hwaccel_get_tag(skb, tag);
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[NET_SCHED 01/04]: em_meta: fix compile warning

2008-02-05 Thread Patrick McHardy
[NET_SCHED]: em_meta: fix compile warning

net/sched/em_meta.c: In function 'meta_int_vlan_tag':
net/sched/em_meta.c:179: warning: 'tag' may be used uninitialized in this 
function

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

---
commit adfab462c5e0a32274927bba4eec3afc6e35
tree bfac456798152d7ea6bedcca4a03b4f045605d3d
parent fb1c15ba9ebcab6478a409051ad26d7d180fe286
author Patrick McHardy <[EMAIL PROTECTED]> Tue, 05 Feb 2008 15:22:21 +0100
committer Patrick McHardy <[EMAIL PROTECTED]> Tue, 05 Feb 2008 15:22:21 +0100

 net/sched/em_meta.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/sched/em_meta.c b/net/sched/em_meta.c
index 9c2ec19..2a7e648 100644
--- a/net/sched/em_meta.c
+++ b/net/sched/em_meta.c
@@ -176,7 +176,7 @@ META_COLLECTOR(var_dev)
 
 META_COLLECTOR(int_vlan_tag)
 {
-   unsigned short tag;
+   unsigned short uninitialized_var(tag);
if (vlan_get_tag(skb, &tag) < 0)
*err = -1;
else
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[NET_SCHED 02/04]: cls_flow: fix key mask validity check

2008-02-05 Thread Patrick McHardy
[NET_SCHED]: cls_flow: fix key mask validity check

Since we're using fls(), we need to check whether the value is non-zero first.

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

---
commit 2e5915ef51e55135522e59e041bb176432857d82
tree 9a42fac3d1646a378acdc91b55642b68c9d97dde
parent adfab462c5e0a32274927bba4eec3afc6e35
author Patrick McHardy <[EMAIL PROTECTED]> Tue, 05 Feb 2008 15:22:23 +0100
committer Patrick McHardy <[EMAIL PROTECTED]> Tue, 05 Feb 2008 15:22:23 +0100

 net/sched/cls_flow.c |5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c
index 8d76986..eeb223c 100644
--- a/net/sched/cls_flow.c
+++ b/net/sched/cls_flow.c
@@ -402,12 +402,13 @@ static int flow_change(struct tcf_proto *tp, unsigned 
long base,
 
if (tb[TCA_FLOW_KEYS]) {
keymask = nla_get_u32(tb[TCA_FLOW_KEYS]);
-   if (fls(keymask) - 1 > FLOW_KEY_MAX)
-   return -EOPNOTSUPP;
 
nkeys = hweight32(keymask);
if (nkeys == 0)
return -EINVAL;
+
+   if (fls(keymask) - 1 > FLOW_KEY_MAX)
+   return -EOPNOTSUPP;
}
 
err = tcf_exts_validate(tp, tb, tca[TCA_RATE], &e, &flow_ext_map);
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[NET_SCHED 00/04]: Two fixes + cls_flow VLAN tag based classification

2008-02-05 Thread Patrick McHardy
These patches fix a compile warning in em_meta, an invalid check in
the flow classifier and add VLAN tag based classification to cls_flow.
Please apply, thanks.


 include/linux/if_vlan.h |7 ---
 include/linux/pkt_cls.h |1 +
 net/sched/cls_flow.c|   17 +++--
 net/sched/em_meta.c |2 +-
 4 files changed, 21 insertions(+), 6 deletions(-)

Patrick McHardy (4):
  [NET_SCHED]: em_meta: fix compile warning
  [NET_SCHED]: cls_flow: fix key mask validity check
  [VLAN]: Constify skb argument to vlan_get_tag()
  [NET_SCHED]: cls_flow: support classification based on VLAN tag
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vlan tag match

2008-02-05 Thread Patrick McHardy

David Miller wrote:

From: Stephen Hemminger <[EMAIL PROTECTED]>
Date: Thu, 31 Jan 2008 21:50:25 -0800


On Fri, 01 Feb 2008 06:34:34 +0100
Patrick McHardy <[EMAIL PROTECTED]> wrote:


Stephen Hemminger wrote:

Provide a way to use tc filters on vlan tag even if tag is buried in
skb due to hardware acceleration.

Looks reasonable. Would you like to add the same feature to the
flow classifier?

Yes, that would be good.


Did that patch get posted?  I didn't see it.



No, but I already added it myself. I'll post it in a new thread.

BTW, Stephen, I noticed your ematch patch doesn't mask out the
VLAN priority which is also included in the return value of
vlan_get_tag(), is that intentional?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [NET_SCHED 00/04]: External SFQ classifiers/flow classifier

2008-02-04 Thread Patrick McHardy

Corey Hickey wrote:

Patrick McHardy wrote:

These patches add support for external classifiers to SFQ and add a
new "flow" classifier, which can do hashing based on user-specified
keys or deterministic mapping of keys to classes. Additionally there
is a patch to make the SFQ queues visisble as classes to verify that
the hash is indeed doing something useful and a patch to consifiy
struct tcf_ext_map, which I had queued in the same tree.


Excellent! I'm glad this is applied. I'm having trouble figuring out how
it works, though. As a test, I'm trying to set up SFQ equivalent to
ESFQ's "hash dst". Here's what I do, and this is what I get:


# ./tc qdisc add dev eth0 root handle 1: sfq
# ./tc filter add dev eth0 parent 1: flow hash keys dst
RTNETLINK answers: Invalid argument
We have an error talking to the kernel


I've tried a few different keys with the same results. I don't know what
I'm doing wrong, or even where to start figuring it out. Can you point
me in the right direction?



You're missing protocol, handle etc. Try something like this:

tc filter add dev eth0 protocol ip pref 1 parent 1: handle 1 \ 


flow hash keys dst divisor 1024

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Still oopsing in nf_nat_move_storage()

2008-02-02 Thread Patrick McHardy

Chuck Ebbert wrote:

On 01/31/2008 01:03 PM, Chuck Ebbert wrote:

On 01/29/2008 12:18 PM, Patrick McHardy wrote:

Chuck Ebbert wrote:

nf_nat_move_storage():
/usr/src/debug/kernel-2.6.23/linux-2.6.23.i686/net/ipv4/netfilter/nf_nat_core.c:612

  87:   f7 47 64 80 01 00 00testl  $0x180,0x64(%edi)
  8e:   74 39   je c9


line 612:
if (!(ct->status & IPS_NAT_DONE_MASK))
return;

ct is NULL

The current kernel (and 2.6.23-stable) have:

if (!ct || !(ct->status & IPS_NAT_DONE_MASK))
return;

so it seems you're using an old version.


So, it is now oopsing after the test for NULL and only x86_64 is
catching the invalid address because it is non-canonical. Checking
for NULL is obviously not enough...



Could you try whether this patch fixes it please?

commit 6953954cc566c19a84b7ca9647c16dabe4646c03
Author: Patrick McHardy <[EMAIL PROTECTED]>
Date:   Sat Feb 2 12:01:03 2008 +0100

[NETFILTER]: nf_conntrack: fix ct_extend ->move operation

The ->move operation has two bugs:

- It is called with the same extension as source and destination,
  so it doesn't update the new extension.

- The address of the old extension is calculated incorrectly,
  instead of (void *)ct->ext + ct->ext->offset[i] it uses
  ct->ext + ct->ext->offset[i].

Should fix a crash on x86_64 reported by Chuck Ebbert <[EMAIL PROTECTED]>
and Thomas Woerner <[EMAIL PROTECTED]>.

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

diff --git a/include/net/netfilter/nf_conntrack_extend.h 
b/include/net/netfilter/nf_conntrack_extend.h
index 73b5711..49aac63 100644
--- a/include/net/netfilter/nf_conntrack_extend.h
+++ b/include/net/netfilter/nf_conntrack_extend.h
@@ -67,7 +67,7 @@ struct nf_ct_ext_type
void (*destroy)(struct nf_conn *ct);
/* Called when realloacted (can be NULL).
   Contents has already been moved. */
-   void (*move)(struct nf_conn *ct, void *old);
+   void (*move)(void *new, void *old);
 
enum nf_ct_ext_id id;
 
diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
index dd07362..0d5fa3a 100644
--- a/net/ipv4/netfilter/nf_nat_core.c
+++ b/net/ipv4/netfilter/nf_nat_core.c
@@ -600,10 +600,10 @@ static void nf_nat_cleanup_conntrack(struct nf_conn *ct)
spin_unlock_bh(&nf_nat_lock);
 }
 
-static void nf_nat_move_storage(struct nf_conn *conntrack, void *old)
+static void nf_nat_move_storage(void *new, void *old)
 {
-   struct nf_conn_nat *new_nat = nf_ct_ext_find(conntrack, NF_CT_EXT_NAT);
-   struct nf_conn_nat *old_nat = (struct nf_conn_nat *)old;
+   struct nf_conn_nat *new_nat = new;
+   struct nf_conn_nat *old_nat = old;
struct nf_conn *ct = old_nat->ct;
 
if (!ct || !(ct->status & IPS_NAT_DONE_MASK))
diff --git a/net/netfilter/nf_conntrack_extend.c 
b/net/netfilter/nf_conntrack_extend.c
index cf6ba66..8b9be1e 100644
--- a/net/netfilter/nf_conntrack_extend.c
+++ b/net/netfilter/nf_conntrack_extend.c
@@ -109,7 +109,8 @@ void *__nf_ct_ext_add(struct nf_conn *ct, enum nf_ct_ext_id 
id, gfp_t gfp)
rcu_read_lock();
t = rcu_dereference(nf_ct_ext_types[i]);
if (t && t->move)
-   t->move(ct, ct->ext + ct->ext->offset[i]);
+   t->move((void *)new + new->offset[i],
+   (void *)ct->ext + ct->ext->offset[i]);
rcu_read_unlock();
}
kfree(ct->ext);


Re: Still oopsing in nf_nat_move_storage()

2008-02-02 Thread Patrick McHardy

Chuck Ebbert wrote:

On 01/31/2008 01:03 PM, Chuck Ebbert wrote:

On 01/29/2008 12:18 PM, Patrick McHardy wrote:

Chuck Ebbert wrote:

nf_nat_move_storage():
/usr/src/debug/kernel-2.6.23/linux-2.6.23.i686/net/ipv4/netfilter/nf_nat_core.c:612

  87:   f7 47 64 80 01 00 00testl  $0x180,0x64(%edi)
  8e:   74 39   je c9


line 612:
if (!(ct->status & IPS_NAT_DONE_MASK))
return;

ct is NULL

The current kernel (and 2.6.23-stable) have:

if (!ct || !(ct->status & IPS_NAT_DONE_MASK))
return;

so it seems you're using an old version.


So, it is now oopsing after the test for NULL and only x86_64 is
catching the invalid address because it is non-canonical. Checking
for NULL is obviously not enough...



The addresses passed to ->move seems to be bogus, we're doing:

  t->move(ct, ct->ext + ct->ext->offset[i]);

without assigning the new ct->ext first, which is wrong for
two reasons:

- the new ext hasn't been assigned to the conntrack yet,
  so its moving within the same extension

- ct->ext + ct->ext->offset[i] should be (void *)ct->ext + ...

I'll fix it and send a patch after some testing. Still wondering
why this wasn't noticed before.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-02-01 Thread Patrick McHardy

Waskiewicz Jr, Peter P wrote:
Indeed. As an example of an unknowing user, this discussion 
made me check whether my cablemodem device (on which I'm 
using HFSC) uses TSO :)


The TSO defer logic is based on your congestion window and current
window size.  So the actual frame sizes hitting your NIC attached to
your DSL probably aren't anywhere near 64KB, but probably more in line
with whatever your window size is for DSL.

The bottom line is TSO saves CPU cycles.  If we want to make it go away
because of a traffic shaping qdisc interfering, then that's fine.  I
just don't think a TSO option should be added to the scheduler layer,
since it already exists in the ethtool layer.  Asking a user to type
'ethtool -k  tso off' is probably going to be much easier
than setting an option on your qdisc through tc to turn TSO back on.

I think we're having more of a disagreement of what is considered the
"normal case" user.  If you are on a slow link, such as a DSL/cable
line, your TCP window/congestion window aren't going to be big enough to
generate large TSO's, so what is the issue?  But disabling TSO, say on a
10 GbE link, can cut throughput by half (I have data on 8-core machines
with 10 GbE with/without TSO if you're interested).  Even on a
single-core machine with a 1GbE link can have bad performance hits.  So
this is why I'm so concerned about a proposal to turn off TSO outside of
the current established methods of using ethtool.  Rather than educating
the user about how to turn TSO back on using tc if they want it, educate
them why they may want to consider turning TSO off in certain
configurations.  And I don't consider any user effectively using a TBF
qdisc someone incapable of understanding how to use ethtool.



We don't want to disable TSO for cases where it makes sense, but
who is using TBF on 10GbE? The point is that most users of qdiscs
which are incapable of dealing with TSO without hacks or special
configuration probably don't care, and 10GbE users know about
ethtool *and* don't use TBF or HTB (which are probably the only
qdiscs which actually have problems, maybe also CBQ).

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Patrick McHardy

Andi Kleen wrote:

The problem with ethtool is that it's a non-obvious nerd knob.  At
the least the ethtool documentation should be updated to indicate that 
activating TSO effects tc accuracy.


TSO tends to be activated by default in the driver; very few people who use it
do even know that ethtool exist or what TSO is.



Indeed. As an example of an unknowing user, this discussion made me
check whether my cablemodem device (on which I'm using HFSC) uses
TSO :)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2.6 patch] rtnetlink.c: #if 0 no longer used functions

2008-01-31 Thread Patrick McHardy

David Miller wrote:

From: Patrick McHardy <[EMAIL PROTECTED]>
Date: Wed, 30 Jan 2008 21:04:33 +0100


Adrian Bunk wrote:

This patch #if 0's the following no longer used functions:
- rtattr_parse()
- rtattr_strlcpy()
- __rtattr_parse_nested_compat()
  

Please remove them instead.


Agreed.


The rtattr_parse_nested_compat macro can also go away.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Patrick McHardy

Glen Turner wrote:

On Thu, 2008-01-31 at 20:34 +0100, Andi Kleen wrote:


The philosophical problem I have with this suggestion is that I expect
that the large majority of users will be more happy with disabled TSO
if they use non standard qdiscs and defaults that do not fit 
the majority use case are bad.


I wouldn't be so fast to assume that all users need an exact playout
rate, as people seem to do fine with the 8Kbps playout steps in Cisco
IOS.  A nerd-knob which expresses user's preference in the
accuracy/performance trade-off would be nice.

The problem with ethtool is that it's a non-obvious nerd knob.  At
the least the ethtool documentation should be updated to indicate that 
activating TSO effects tc accuracy.



I agree with Andi, most user neither know nor care about TSO.
It should work properly by default and optimizations should
be explicitly configured. This is especially true if you
consider the common userbase of qdiscs - which is mostly
slow DSL lines, cablemodems etc.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vlan tag match

2008-01-31 Thread Patrick McHardy

Stephen Hemminger wrote:

Provide a way to use tc filters on vlan tag even if tag is buried in
skb due to hardware acceleration.


Looks reasonable. Would you like to add the same feature to the
flow classifier?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Patrick McHardy

Waskiewicz Jr, Peter P wrote:

Well, it could be just that when using such qdiscs TSO would be
disabled, but the user could override this by using ethtool after
loading the qdiscs.


I still disagree with this.  The qdisc should not cause anything to happen to 
feature flags on the device. It's the scheduling layer and really shouldn't 
care about what features the device supports or not.  If someone has an issue 
with a feature hurting performance or causing odd behavior when using a qdisc, 
then they should disable the feature on the device using the appropriate tools 
provided.  If it's the qdisc causing issues, then either the qdisc needs to be 
fixed, or it should be documented what features are recommended to be on and 
off with the qdisc.  I don't agree that the scheduling layer should affect 
features on an underlying device.


Andi's patch made the TSO capable flag a property of the
qdisc (not the ops), so it could still be explicitly
configured by the user.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Patrick McHardy

Andi Kleen wrote:

On Thu, Jan 31, 2008 at 07:21:20PM +0100, Patrick McHardy wrote:

Andi Kleen wrote:

Then change TBF to use skb_gso_segment?  Be careful, the fact that

That doesn't help because it wants to interleave packets
>from different streams to get everything fair and smooth. The only 
good way to handle that is to split it up and the simplest way to do 
this is to just tell TCP to not do GSO in the first place.


Thats not correct, TBF keeps packets strictly ordered unless


My point was that without TSO different submitters will interleave
their streams (because they compete about the qdisc submission) 
and then you end up with a smooth rate over time for all of them.


If you submit in large chunks only (as TSO does) it will always 
be more bursty and that works against the TBF goal.


For a single submitter you would be correct.



The TBF goal is not really fairness among different flows, but
I agree, avoiding TSO in the first place seems to make more sense.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hard hang through qdisc

2008-01-31 Thread Patrick McHardy

Patrick McHardy wrote:

Andi Kleen wrote:


Can you please try with the above config?



I'll give it a try later.



I took all options from that config that seemed possibly
related (qdiscs, no hrtimers, no nohz, slab, ...), but
still can't reproduce it.

Does it also crash if you use more reasonable parameters?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Patrick McHardy

Andi Kleen wrote:

Then change TBF to use skb_gso_segment?  Be careful, the fact that


That doesn't help because it wants to interleave packets
from different streams to get everything fair and smooth. The only 
good way to handle that is to split it up and the simplest way to do 
this is to just tell TCP to not do GSO in the first place.



Thats not correct, TBF keeps packets strictly ordered unless
an inner qdisc does reordering. But even then (let say you use
SFQ) packets of a single flow will stay ordered. Segmenting
TSO packets is no different than having them arrive independantly
for other reasons.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Patrick McHardy

Stephen Hemminger wrote:

On Thu, 31 Jan 2008 19:37:35 +0100
Andi Kleen <[EMAIL PROTECTED]> wrote:


On Thu, Jan 31, 2008 at 07:01:00PM +0100, Patrick McHardy wrote:

Andi Kleen wrote:

Fix the broken qdisc instead.

What do you mean? I don't think the qdiscs are broken.
I cannot think of any way how e.g. TBF can do anything useful
with large TSO packets.


Someone posted a patch some time ago to calculate the amount
of tokens needed in max_size portions and use that, but IMO
people should just configure TBF with the proper MTU for TSO.

TBF with 64k atomic units will always be chunky and uneven. I don't
think that's a useful goal. 


-Andi


Then change TBF to use skb_gso_segment?  Be careful, the fact that
one skb ends up queueing multiple skb's would cause issues to parent
qdisc (ie work generating qdisc).



How about keeping the TSO-capable flag on qdiscs, propagating
the non-capability up the tree and perform segmentation before
queueing to the root?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[IPROUTE 02/02]: Add flow classifier support

2008-01-31 Thread Patrick McHardy
 [IPROUTE]: Add flow classifier support

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

---
commit ac3df2d7e37826b06cc9093f50d829a9da1873a4
tree b33a2b29abdcea0267fe7a357d282a4c2f67124b
parent 196870f762ee393438c42115425f4af69e5b5186
author Patrick McHardy <[EMAIL PROTECTED]> Thu, 31 Jan 2008 18:52:47 +0100
committer Patrick McHardy <[EMAIL PROTECTED]> Thu, 31 Jan 2008 18:52:47 +0100

 include/linux/pkt_cls.h |   50 +++
 tc/Makefile |1 
 tc/f_flow.c |  347 +++
 3 files changed, 398 insertions(+), 0 deletions(-)

diff --git a/include/linux/pkt_cls.h b/include/linux/pkt_cls.h
index afb79d0..16869c2 100644
--- a/include/linux/pkt_cls.h
+++ b/include/linux/pkt_cls.h
@@ -328,6 +328,56 @@ enum
 
 #define TCA_TCINDEX_MAX (__TCA_TCINDEX_MAX - 1)
 
+/* Flow filter */
+
+enum
+{
+	FLOW_KEY_SRC,
+	FLOW_KEY_DST,
+	FLOW_KEY_PROTO,
+	FLOW_KEY_PROTO_SRC,
+	FLOW_KEY_PROTO_DST,
+	FLOW_KEY_IIF,
+	FLOW_KEY_PRIORITY,
+	FLOW_KEY_MARK,
+	FLOW_KEY_NFCT,
+	FLOW_KEY_NFCT_SRC,
+	FLOW_KEY_NFCT_DST,
+	FLOW_KEY_NFCT_PROTO_SRC,
+	FLOW_KEY_NFCT_PROTO_DST,
+	FLOW_KEY_RTCLASSID,
+	FLOW_KEY_SKUID,
+	FLOW_KEY_SKGID,
+	__FLOW_KEY_MAX,
+};
+
+#define FLOW_KEY_MAX	(__FLOW_KEY_MAX - 1)
+
+enum
+{
+	FLOW_MODE_MAP,
+	FLOW_MODE_HASH,
+};
+
+enum
+{
+	TCA_FLOW_UNSPEC,
+	TCA_FLOW_KEYS,
+	TCA_FLOW_MODE,
+	TCA_FLOW_BASECLASS,
+	TCA_FLOW_RSHIFT,
+	TCA_FLOW_ADDEND,
+	TCA_FLOW_MASK,
+	TCA_FLOW_XOR,
+	TCA_FLOW_DIVISOR,
+	TCA_FLOW_ACT,
+	TCA_FLOW_POLICE,
+	TCA_FLOW_EMATCHES,
+	__TCA_FLOW_MAX
+};
+
+#define TCA_FLOW_MAX	(__TCA_FLOW_MAX - 1)
+
 /* Basic filter */
 
 enum
diff --git a/tc/Makefile b/tc/Makefile
index 0facc88..7ece958 100644
--- a/tc/Makefile
+++ b/tc/Makefile
@@ -18,6 +18,7 @@ TCMODULES += f_u32.o
 TCMODULES += f_route.o
 TCMODULES += f_fw.o
 TCMODULES += f_basic.o
+TCMODULES += f_flow.o
 TCMODULES += q_dsmark.o
 TCMODULES += q_gred.o
 TCMODULES += f_tcindex.o
diff --git a/tc/f_flow.c b/tc/f_flow.c
new file mode 100644
index 000..eca05cd
--- /dev/null
+++ b/tc/f_flow.c
@@ -0,0 +1,347 @@
+/*
+ * f_flow.c		Flow filter
+ *
+ * 		This program is free software; you can redistribute it and/or
+ * 		modify it under the terms of the GNU General Public License
+ * 		as published by the Free Software Foundation; either version
+ * 		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Patrick McHardy <[EMAIL PROTECTED]>
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "utils.h"
+#include "tc_util.h"
+#include "m_ematch.h"
+
+static void explain(void)
+{
+	fprintf(stderr,
+"Usage: ... flow ...\n"
+"\n"
+" [mapping mode]: map key KEY [ OPS ] ...\n"
+" [hashing mode]: hash keys KEY-LIST ...\n"
+"\n"
+" [ divisor NUM ] [ baseclass ID ] [ match EMATCH_TREE ]\n"
+" [ police POLICE_SPEC ] [ action ACTION_SPEC ]\n"
+"\n"
+"KEY-LIST := [ KEY-LIST , ] KEY\n"
+"KEY  := [ src | dst | proto | proto-src | proto-dst | iif | priority | \n"
+"  mark | nfct | nfct-src | nfct-dst | nfct-proto-src | \n"
+"  nfct-proto-dst | rt-classid | sk-uid | sk-gid ]\n"
+"OPS  := [ or NUM | and NUM | xor NUM | rshift NUM | addend NUM ]\n"
+"ID   := X:Y\n"
+	);
+}
+
+static const char *flow_keys[FLOW_KEY_MAX+1] = {
+	[FLOW_KEY_SRC]			= "src",
+	[FLOW_KEY_DST]			= "dst",
+	[FLOW_KEY_PROTO]		= "proto",
+	[FLOW_KEY_PROTO_SRC]		= "proto-src",
+	[FLOW_KEY_PROTO_DST]		= "proto-dst",
+	[FLOW_KEY_IIF]			= "iif",
+	[FLOW_KEY_PRIORITY]		= "priority",
+	[FLOW_KEY_MARK]			= "mark",
+	[FLOW_KEY_NFCT]			= "nfct",
+	[FLOW_KEY_NFCT_SRC]		= "nfct-src",
+	[FLOW_KEY_NFCT_DST]		= "nfct-dst",
+	[FLOW_KEY_NFCT_PROTO_SRC]	= "nfct-proto-src",
+	[FLOW_KEY_NFCT_PROTO_DST]	= "nfct-proto-dst",
+	[FLOW_KEY_RTCLASSID]		= "rt-classid",
+	[FLOW_KEY_SKUID]		= "sk-uid",
+	[FLOW_KEY_SKGID]		= "sk-gid",
+};
+
+static int flow_parse_keys(__u32 *keys, __u32 *nkeys, char *argv)
+{
+	char *s, *sep;
+	unsigned int i;
+
+	*keys = 0;
+	*nkeys = 0;
+	s = argv;
+	while (s != NULL) {
+		sep = strchr(s, ',');
+		if (sep)
+			*sep = '\0';
+
+		for (i = 0; i <= FLOW_KEY_MAX; i++) {
+			if (matches(s, flow_keys[i]) == 0) {
+*keys |= 1 << i;
+(*nkeys)++;
+break;
+			}
+		}
+		if (i > FLOW_KEY_MAX) {
+			fprintf(stderr, "Unknown flow key \"%s\"\n", s);
+			return -1;
+		}
+		s = sep ? sep + 1 : NULL;
+	}
+	return 0;
+}
+
+static void transfer_bitop(__u32 *mask, __u32 *xor, __u32 m, __u32 x)
+{
+	*xor = x ^ (*xor & m);
+	*mask &= m;
+}
+
+static int get_addend(__u32 *addend, char *argv, __u32 keys)
+{
+	inet_prefi

Re: [PATCH] Disable TSO for non standard qdiscs

2008-01-31 Thread Patrick McHardy

Andi Kleen wrote:

Fix the broken qdisc instead.


What do you mean? I don't think the qdiscs are broken.
I cannot think of any way how e.g. TBF can do anything useful
with large TSO packets.



Someone posted a patch some time ago to calculate the amount
of tokens needed in max_size portions and use that, but IMO
people should just configure TBF with the proper MTU for TSO.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[IPROUTE 01/02]: Add support for SFQ xstats

2008-01-31 Thread Patrick McHardy
 [IPROUTE]: Add support for SFQ xstats

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

---
commit 196870f762ee393438c42115425f4af69e5b5186
tree 5650c1f93cc58886f8f97a0e55e374c157b96e2e
parent 54bb35c69cec6c730a4ac95530a1d2ca6670f73b
author Patrick McHardy <[EMAIL PROTECTED]> Thu, 31 Jan 2008 15:10:07 +0100
committer Patrick McHardy <[EMAIL PROTECTED]> Thu, 31 Jan 2008 15:10:07 +0100

 include/linux/pkt_sched.h |5 +
 tc/q_sfq.c|   17 +
 2 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index 3276135..4ccd684 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -150,6 +150,11 @@ struct tc_sfq_qopt
 	unsigned	flows;		/* Maximal number of flows  */
 };
 
+struct tc_sfq_xstats
+{
+	__u32		allot;
+};
+
 /*
  *  NOTE: limit, divisor and flows are hardwired to code at the moment.
  *
diff --git a/tc/q_sfq.c b/tc/q_sfq.c
index 05385cf..ce4dade 100644
--- a/tc/q_sfq.c
+++ b/tc/q_sfq.c
@@ -100,8 +100,25 @@ static int sfq_print_opt(struct qdisc_util *qu, FILE *f, struct rtattr *opt)
 	return 0;
 }
 
+static int sfq_print_xstats(struct qdisc_util *qu, FILE *f,
+			struct rtattr *xstats)
+{
+	struct tc_sfq_xstats *st;
+
+	if (xstats == NULL)
+		return 0;
+	if (RTA_PAYLOAD(xstats) < sizeof(*st))
+		return -1;
+	st = RTA_DATA(xstats);
+
+	fprintf(f, " allot %d ", st->allot);
+	fprintf(f, "\n");
+	return 0;
+}
+
 struct qdisc_util sfq_qdisc_util = {
 	.id		= "sfq",
 	.parse_qopt	= sfq_parse_opt,
 	.print_qopt	= sfq_print_opt,
+	.print_xstats	= sfq_print_xstats,
 };


[NET_SCHED 04/04]: Add flow classifier

2008-01-31 Thread Patrick McHardy
[NET_SCHED]: Add flow classifier

Add new "flow" classifier, which is meant to extend the SFQ hashing
capabilities without hard-coding new hash functions and also allows
deterministic mappings of keys to classes, replacing some out of tree
iptables patches like IPCLASSIFY (maps IPs to classes), IPMARK (maps
IPs to marks, with fw filters to classes), ...

Some examples:

- Classic SFQ hash:

  tc filter add ... flow hash \
keys src,dst,proto,proto-src,proto-dst divisor 1024

- Classic SFQ hash, but using information from conntrack to work properly in
  combination with NAT:

  tc filter add ... flow hash \
keys nfct-src,nfct-dst,proto,nfct-proto-src,nfct-proto-dst divisor 1024

- Map destination IPs of 192.168.0.0/24 to classids 1-257:

  tc filter add ... flow map \
key dst addend -192.168.0.0 divisor 256

- alternatively:

  tc filter add ... flow map \
key dst and 0xff

- similar, but reverse ordered:

  tc filter add ... flow map \
key dst and 0xff xor 0xff

Perturbation is currently not supported because we can't reliable kill the
timer on destruction.

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

---
commit 91a3a09ce63cba8df30ac42133a40dd64c0a7259
tree 2572feb8ffd88e6abf9270d2137af2a4cf7f542a
parent 7a281f8ef334a35d699682315e9f80a3e006376c
author Patrick McHardy <[EMAIL PROTECTED]> Wed, 30 Jan 2008 21:59:31 +0100
committer Patrick McHardy <[EMAIL PROTECTED]> Thu, 31 Jan 2008 18:52:56 +0100

 include/linux/pkt_cls.h |   50 
 net/sched/Kconfig   |   11 +
 net/sched/Makefile  |1 
 net/sched/cls_flow.c|  660 +++
 4 files changed, 722 insertions(+), 0 deletions(-)

diff --git a/include/linux/pkt_cls.h b/include/linux/pkt_cls.h
index 30b8571..1c1dba9 100644
--- a/include/linux/pkt_cls.h
+++ b/include/linux/pkt_cls.h
@@ -328,6 +328,56 @@ enum
 
 #define TCA_TCINDEX_MAX (__TCA_TCINDEX_MAX - 1)
 
+/* Flow filter */
+
+enum
+{
+   FLOW_KEY_SRC,
+   FLOW_KEY_DST,
+   FLOW_KEY_PROTO,
+   FLOW_KEY_PROTO_SRC,
+   FLOW_KEY_PROTO_DST,
+   FLOW_KEY_IIF,
+   FLOW_KEY_PRIORITY,
+   FLOW_KEY_MARK,
+   FLOW_KEY_NFCT,
+   FLOW_KEY_NFCT_SRC,
+   FLOW_KEY_NFCT_DST,
+   FLOW_KEY_NFCT_PROTO_SRC,
+   FLOW_KEY_NFCT_PROTO_DST,
+   FLOW_KEY_RTCLASSID,
+   FLOW_KEY_SKUID,
+   FLOW_KEY_SKGID,
+   __FLOW_KEY_MAX,
+};
+
+#define FLOW_KEY_MAX   (__FLOW_KEY_MAX - 1)
+
+enum
+{
+   FLOW_MODE_MAP,
+   FLOW_MODE_HASH,
+};
+
+enum
+{
+   TCA_FLOW_UNSPEC,
+   TCA_FLOW_KEYS,
+   TCA_FLOW_MODE,
+   TCA_FLOW_BASECLASS,
+   TCA_FLOW_RSHIFT,
+   TCA_FLOW_ADDEND,
+   TCA_FLOW_MASK,
+   TCA_FLOW_XOR,
+   TCA_FLOW_DIVISOR,
+   TCA_FLOW_ACT,
+   TCA_FLOW_POLICE,
+   TCA_FLOW_EMATCHES,
+   __TCA_FLOW_MAX
+};
+
+#define TCA_FLOW_MAX   (__TCA_FLOW_MAX - 1)
+
 /* Basic filter */
 
 enum
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 87af7c9..bccf42b 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -307,6 +307,17 @@ config NET_CLS_RSVP6
  To compile this code as a module, choose M here: the
  module will be called cls_rsvp6.
 
+config NET_CLS_FLOW
+   tristate "Flow classifier"
+   select NET_CLS
+   ---help---
+ If you say Y here, you will be able to classify packets based on
+ a configurable combination of packet keys. This is mostly useful
+ in combination with SFQ.
+
+ To compile this code as a module, choose M here: the
+ module will be called cls_flow.
+
 config NET_EMATCH
bool "Extended Matches"
select NET_CLS
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 81ecbe8..1d2b0f7 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -35,6 +35,7 @@ obj-$(CONFIG_NET_CLS_RSVP)+= cls_rsvp.o
 obj-$(CONFIG_NET_CLS_TCINDEX)  += cls_tcindex.o
 obj-$(CONFIG_NET_CLS_RSVP6)+= cls_rsvp6.o
 obj-$(CONFIG_NET_CLS_BASIC)+= cls_basic.o
+obj-$(CONFIG_NET_CLS_FLOW) += cls_flow.o
 obj-$(CONFIG_NET_EMATCH)   += ematch.o
 obj-$(CONFIG_NET_EMATCH_CMP)   += em_cmp.o
 obj-$(CONFIG_NET_EMATCH_NBYTE) += em_nbyte.o
diff --git a/net/sched/cls_flow.c b/net/sched/cls_flow.c
new file mode 100644
index 000..5a7f6a3
--- /dev/null
+++ b/net/sched/cls_flow.c
@@ -0,0 +1,660 @@
+/*
+ * net/sched/cls_flow.cGeneric flow classifier
+ *
+ * Copyright (c) 2007, 2008 Patrick McHardy <[EMAIL PROTECTED]>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#if defined(CONFIG_NF_CONNTRACK

[NET_SCHED 02/04]: sch_sfq: add support for external classifiers

2008-01-31 Thread Patrick McHardy
[NET_SCHED]: sch_sfq: add support for external classifiers

Add support for external classifiers to allow using different flow hash
functions similar to ESFQ. When no classifier is attached the built-in
hash is used as before.

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

---
commit 6049892cc4acca9af393e134e4cdaf6b3e1ccad9
tree 9a8347d45808de2aef14486e5792fcab58baf3fe
parent 12e33ddf57910b685501df10bd92223ea9b98fd6
author Patrick McHardy <[EMAIL PROTECTED]> Wed, 30 Jan 2008 21:59:27 +0100
committer Patrick McHardy <[EMAIL PROTECTED]> Thu, 31 Jan 2008 18:52:55 +0100

 net/sched/sch_sfq.c |   95 +--
 1 files changed, 91 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index 91af539..d818d19 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -95,6 +95,7 @@ struct sfq_sched_data
int limit;
 
 /* Variables */
+   struct tcf_proto *filter_list;
struct timer_list perturb_timer;
u32 perturbation;
sfq_index   tail;   /* Index of current slot in round */
@@ -155,6 +156,39 @@ static unsigned sfq_hash(struct sfq_sched_data *q, struct 
sk_buff *skb)
return sfq_fold_hash(q, h, h2);
 }
 
+static unsigned int sfq_classify(struct sk_buff *skb, struct Qdisc *sch,
+int *qerr)
+{
+   struct sfq_sched_data *q = qdisc_priv(sch);
+   struct tcf_result res;
+   int result;
+
+   if (TC_H_MAJ(skb->priority) == sch->handle &&
+   TC_H_MIN(skb->priority) > 0 &&
+   TC_H_MIN(skb->priority) <= SFQ_HASH_DIVISOR)
+   return TC_H_MIN(skb->priority);
+
+   if (!q->filter_list)
+   return sfq_hash(q, skb) + 1;
+
+   *qerr = NET_XMIT_BYPASS;
+   result = tc_classify(skb, q->filter_list, &res);
+   if (result >= 0) {
+#ifdef CONFIG_NET_CLS_ACT
+   switch (result) {
+   case TC_ACT_STOLEN:
+   case TC_ACT_QUEUED:
+   *qerr = NET_XMIT_SUCCESS;
+   case TC_ACT_SHOT:
+   return 0;
+   }
+#endif
+   if (TC_H_MIN(res.classid) <= SFQ_HASH_DIVISOR)
+   return TC_H_MIN(res.classid);
+   }
+   return 0;
+}
+
 static inline void sfq_link(struct sfq_sched_data *q, sfq_index x)
 {
sfq_index p, n;
@@ -245,8 +279,18 @@ static int
 sfq_enqueue(struct sk_buff *skb, struct Qdisc *sch)
 {
struct sfq_sched_data *q = qdisc_priv(sch);
-   unsigned hash = sfq_hash(q, skb);
+   unsigned int hash;
sfq_index x;
+   int ret;
+
+   hash = sfq_classify(skb, sch, &ret);
+   if (hash == 0) {
+   if (ret == NET_XMIT_BYPASS)
+   sch->qstats.drops++;
+   kfree_skb(skb);
+   return ret;
+   }
+   hash--;
 
x = q->ht[hash];
if (x == SFQ_DEPTH) {
@@ -289,8 +333,18 @@ static int
 sfq_requeue(struct sk_buff *skb, struct Qdisc *sch)
 {
struct sfq_sched_data *q = qdisc_priv(sch);
-   unsigned hash = sfq_hash(q, skb);
+   unsigned int hash;
sfq_index x;
+   int ret;
+
+   hash = sfq_classify(skb, sch, &ret);
+   if (hash == 0) {
+   if (ret == NET_XMIT_BYPASS)
+   sch->qstats.drops++;
+   kfree_skb(skb);
+   return ret;
+   }
+   hash--;
 
x = q->ht[hash];
if (x == SFQ_DEPTH) {
@@ -465,6 +519,8 @@ static int sfq_init(struct Qdisc *sch, struct nlattr *opt)
 static void sfq_destroy(struct Qdisc *sch)
 {
struct sfq_sched_data *q = qdisc_priv(sch);
+
+   tcf_destroy_chain(q->filter_list);
del_timer(&q->perturb_timer);
 }
 
@@ -490,9 +546,40 @@ nla_put_failure:
return -1;
 }
 
+static int sfq_change_class(struct Qdisc *sch, u32 classid, u32 parentid,
+   struct nlattr **tca, unsigned long *arg)
+{
+   return -EOPNOTSUPP;
+}
+
+static unsigned long sfq_get(struct Qdisc *sch, u32 classid)
+{
+   return 0;
+}
+
+static struct tcf_proto **sfq_find_tcf(struct Qdisc *sch, unsigned long cl)
+{
+   struct sfq_sched_data *q = qdisc_priv(sch);
+
+   if (cl)
+   return NULL;
+   return &q->filter_list;
+}
+
+static void sfq_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+   return;
+}
+
+static const struct Qdisc_class_ops sfq_class_ops = {
+   .get=   sfq_get,
+   .change =   sfq_change_class,
+   .tcf_chain  =   sfq_find_tcf,
+   .walk   =   sfq_walk,
+};
+
 static struct Qdisc_ops sfq_qdisc_ops __read_mostly = {
-   .next   =   NULL,
-   .cl_ops =   NULL,
+   .cl_ops =   &sfq_class_ops,
.id =   "sf

[NET_SCHED 03/04]: sch_sfq: make internal queues visible as classes

2008-01-31 Thread Patrick McHardy
[NET_SCHED]: sch_sfq: make internal queues visible as classes

Add support for dumping statistics and make internal queues visible
as classes.

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

---
commit 7a281f8ef334a35d699682315e9f80a3e006376c
tree 0a2cbd55e22f1913e9cf0cc28da2956952110243
parent 6049892cc4acca9af393e134e4cdaf6b3e1ccad9
author Patrick McHardy <[EMAIL PROTECTED]> Wed, 30 Jan 2008 21:59:29 +0100
committer Patrick McHardy <[EMAIL PROTECTED]> Thu, 31 Jan 2008 18:52:56 +0100

 include/linux/pkt_sched.h |5 +
 net/sched/sch_sfq.c   |   41 -
 2 files changed, 45 insertions(+), 1 deletions(-)

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index 3276135..dbb7ac3 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -150,6 +150,11 @@ struct tc_sfq_qopt
unsignedflows;  /* Maximal number of flows  */
 };
 
+struct tc_sfq_xstats
+{
+   __s32   allot;
+};
+
 /*
  *  NOTE: limit, divisor and flows are hardwired to code at the moment.
  *
diff --git a/net/sched/sch_sfq.c b/net/sched/sch_sfq.c
index d818d19..a20e2ef 100644
--- a/net/sched/sch_sfq.c
+++ b/net/sched/sch_sfq.c
@@ -566,15 +566,54 @@ static struct tcf_proto **sfq_find_tcf(struct Qdisc *sch, 
unsigned long cl)
return &q->filter_list;
 }
 
+static int sfq_dump_class(struct Qdisc *sch, unsigned long cl,
+ struct sk_buff *skb, struct tcmsg *tcm)
+{
+   tcm->tcm_handle |= TC_H_MIN(cl);
+   return 0;
+}
+
+static int sfq_dump_class_stats(struct Qdisc *sch, unsigned long cl,
+   struct gnet_dump *d)
+{
+   struct sfq_sched_data *q = qdisc_priv(sch);
+   sfq_index idx = q->ht[cl-1];
+   struct gnet_stats_queue qs = { .qlen = q->qs[idx].qlen };
+   struct tc_sfq_xstats xstats = { .allot = q->allot[idx] };
+
+   if (gnet_stats_copy_queue(d, &qs) < 0)
+   return -1;
+   return gnet_stats_copy_app(d, &xstats, sizeof(xstats));
+}
+
 static void sfq_walk(struct Qdisc *sch, struct qdisc_walker *arg)
 {
-   return;
+   struct sfq_sched_data *q = qdisc_priv(sch);
+   unsigned int i;
+
+   if (arg->stop)
+   return;
+
+   for (i = 0; i < SFQ_HASH_DIVISOR; i++) {
+   if (q->ht[i] == SFQ_DEPTH ||
+   arg->count < arg->skip) {
+   arg->count++;
+   continue;
+   }
+   if (arg->fn(sch, i + 1, arg) < 0) {
+   arg->stop = 1;
+   break;
+   }
+   arg->count++;
+   }
 }
 
 static const struct Qdisc_class_ops sfq_class_ops = {
.get=   sfq_get,
.change =   sfq_change_class,
.tcf_chain  =   sfq_find_tcf,
+   .dump   =   sfq_dump_class,
+   .dump_stats =   sfq_dump_class_stats,
.walk   =   sfq_walk,
 };
 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[NET_SCHED 00/04]: External SFQ classifiers/flow classifier

2008-01-31 Thread Patrick McHardy
These patches add support for external classifiers to SFQ and add a
new "flow" classifier, which can do hashing based on user-specified
keys or deterministic mapping of keys to classes. Additionally there
is a patch to make the SFQ queues visisble as classes to verify that
the hash is indeed doing something useful and a patch to consifiy
struct tcf_ext_map, which I had queued in the same tree.

Please apply, thanks.


 include/linux/pkt_cls.h   |   50 
 include/linux/pkt_sched.h |5 +
 include/net/pkt_cls.h |6 +-
 net/sched/Kconfig |   11 +
 net/sched/Makefile|1 +
 net/sched/cls_api.c   |6 +-
 net/sched/cls_basic.c |2 +-
 net/sched/cls_flow.c  |  660 +
 net/sched/cls_fw.c|2 +-
 net/sched/cls_route.c |2 +-
 net/sched/cls_tcindex.c   |2 +-
 net/sched/cls_u32.c   |2 +-
 net/sched/sch_sfq.c   |  134 +-
 13 files changed, 868 insertions(+), 15 deletions(-)
 create mode 100644 net/sched/cls_flow.c

Patrick McHardy (4):
  [NET_SCHED]: Constify struct tcf_ext_map
  [NET_SCHED]: sch_sfq: add support for external classifiers
  [NET_SCHED]: sch_sfq: make internal queues visible as classes
  [NET_SCHED]: Add flow classifier
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[NET_SCHED 01/04]: Constify struct tcf_ext_map

2008-01-31 Thread Patrick McHardy
[NET_SCHED]: Constify struct tcf_ext_map

Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]>

---
commit 12e33ddf57910b685501df10bd92223ea9b98fd6
tree 1ce47c7b6b6b968940f3dc28f9d7839e78c85089
parent 8af03e782cae1e0a0f530ddd22301cdd12cf9dc0
author Patrick McHardy <[EMAIL PROTECTED]> Wed, 30 Jan 2008 21:59:26 +0100
committer Patrick McHardy <[EMAIL PROTECTED]> Thu, 31 Jan 2008 18:52:55 +0100

 include/net/pkt_cls.h   |6 +++---
 net/sched/cls_api.c |6 +++---
 net/sched/cls_basic.c   |2 +-
 net/sched/cls_fw.c  |2 +-
 net/sched/cls_route.c   |2 +-
 net/sched/cls_tcindex.c |2 +-
 net/sched/cls_u32.c |2 +-
 7 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 8716eb7..d349c66 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -131,14 +131,14 @@ tcf_exts_exec(struct sk_buff *skb, struct tcf_exts *exts,
 
 extern int tcf_exts_validate(struct tcf_proto *tp, struct nlattr **tb,
 struct nlattr *rate_tlv, struct tcf_exts *exts,
-struct tcf_ext_map *map);
+const struct tcf_ext_map *map);
 extern void tcf_exts_destroy(struct tcf_proto *tp, struct tcf_exts *exts);
 extern void tcf_exts_change(struct tcf_proto *tp, struct tcf_exts *dst,
 struct tcf_exts *src);
 extern int tcf_exts_dump(struct sk_buff *skb, struct tcf_exts *exts,
-struct tcf_ext_map *map);
+const struct tcf_ext_map *map);
 extern int tcf_exts_dump_stats(struct sk_buff *skb, struct tcf_exts *exts,
-  struct tcf_ext_map *map);
+  const struct tcf_ext_map *map);
 
 /**
  * struct tcf_pkt_info - packet information
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3377ca0..0fbedca 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -482,7 +482,7 @@ EXPORT_SYMBOL(tcf_exts_destroy);
 
 int tcf_exts_validate(struct tcf_proto *tp, struct nlattr **tb,
  struct nlattr *rate_tlv, struct tcf_exts *exts,
- struct tcf_ext_map *map)
+ const struct tcf_ext_map *map)
 {
memset(exts, 0, sizeof(*exts));
 
@@ -535,7 +535,7 @@ void tcf_exts_change(struct tcf_proto *tp, struct tcf_exts 
*dst,
 EXPORT_SYMBOL(tcf_exts_change);
 
 int tcf_exts_dump(struct sk_buff *skb, struct tcf_exts *exts,
- struct tcf_ext_map *map)
+ const struct tcf_ext_map *map)
 {
 #ifdef CONFIG_NET_CLS_ACT
if (map->action && exts->action) {
@@ -571,7 +571,7 @@ EXPORT_SYMBOL(tcf_exts_dump);
 
 
 int tcf_exts_dump_stats(struct sk_buff *skb, struct tcf_exts *exts,
-   struct tcf_ext_map *map)
+   const struct tcf_ext_map *map)
 {
 #ifdef CONFIG_NET_CLS_ACT
if (exts->action)
diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c
index bfb4342..956915c 100644
--- a/net/sched/cls_basic.c
+++ b/net/sched/cls_basic.c
@@ -35,7 +35,7 @@ struct basic_filter
struct list_headlink;
 };
 
-static struct tcf_ext_map basic_ext_map = {
+static const struct tcf_ext_map basic_ext_map = {
.action = TCA_BASIC_ACT,
.police = TCA_BASIC_POLICE
 };
diff --git a/net/sched/cls_fw.c b/net/sched/cls_fw.c
index 436a6e7..b0f90e5 100644
--- a/net/sched/cls_fw.c
+++ b/net/sched/cls_fw.c
@@ -47,7 +47,7 @@ struct fw_filter
struct tcf_exts exts;
 };
 
-static struct tcf_ext_map fw_ext_map = {
+static const struct tcf_ext_map fw_ext_map = {
.action = TCA_FW_ACT,
.police = TCA_FW_POLICE
 };
diff --git a/net/sched/cls_route.c b/net/sched/cls_route.c
index f7e7d39..784dcb8 100644
--- a/net/sched/cls_route.c
+++ b/net/sched/cls_route.c
@@ -62,7 +62,7 @@ struct route4_filter
 
 #define ROUTE4_FAILURE ((struct route4_filter*)(-1L))
 
-static struct tcf_ext_map route_ext_map = {
+static const struct tcf_ext_map route_ext_map = {
.police = TCA_ROUTE4_POLICE,
.action = TCA_ROUTE4_ACT
 };
diff --git a/net/sched/cls_tcindex.c b/net/sched/cls_tcindex.c
index ee60b2d..7a7bff5 100644
--- a/net/sched/cls_tcindex.c
+++ b/net/sched/cls_tcindex.c
@@ -55,7 +55,7 @@ struct tcindex_data {
int fall_through;   /* 0: only classify if explicit match */
 };
 
-static struct tcf_ext_map tcindex_ext_map = {
+static const struct tcf_ext_map tcindex_ext_map = {
.police = TCA_TCINDEX_POLICE,
.action = TCA_TCINDEX_ACT
 };
diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index e8a7756..b18fa95 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -82,7 +82,7 @@ struct tc_u_common
u32 hgenerator;
 };
 
-static struct tcf_ext_map u32_ext_map = {
+static const struct tcf_ext_map u32_ext_map = {
.action = TCA_U32_ACT,
.police = TCA_U32_POLICE
 };
--
To unsubscribe from this list: send the line 

  1   2   3   4   5   6   7   8   9   10   >