Re: e100 problems in .23rc8 ?
Kok, Auke [EMAIL PROTECTED] wrote: Dave Jones wrote: Last night, I hit this bug during boot up.. http://www.codemonkey.org.uk/junk/e100-2.jpg This morning, I got a mail from a Fedora user of the same .23-rc8 based kernel that has seen a different trace also implicating e100.. http://www.codemonkey.org.uk/junk/e100.jpg It may be that the two problems are unrelated, and it's just coincidence that both reports happen to be on an e100, but the timing is odd. Have there been other reports of similar problems recently ? there hasn't been a change to e100 in two months now - perhaps something slipped into the stack that broke it? If this reproduces, could you bisect? Well this looks exactly like the e1000 race that we fixed around the time of the last kernel release. That fix never made it into e100 so it's no surprise that we get a similar crash here. The problem is that if a spurious interrupt comes in between request_irq and netif_poll_enable then you'll get a crash at the next netif_rx_complete. It'd be good if this were reproducible as it would allow us to identify the source of the spurious interrupt, which may well be caused by an unrelated bug somewhere else. In any case, e100 should be prepared to deal with spurious interrupts as e1000 has been fixed to do. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/2][BNX2]: Add iSCSI support to BNX2 devices.
FUJITA Tomonori wrote: Yeah, we could nicely handle lld's restrictions (especially with stacking devices). But iommu code needs only max_segment_size and seg_boundary_mask, right? If so, the first simple approach to add two values to device structure is not so bad, I think. (replying to slightly older email in the thread) (added benh, since we've discussed this issue in the past) dumb question, what happened to seg_boundary_mask? If you look at drivers/ata/libata-core.c:ata_fill_sg(), you will note that we split s/g segments after DMA-mapping. Looking at libata LLDD's, you will also note judicious use of ATA_DMA_BOUNDARY (0x). It was drilled into my head by James and benh that I cannot rely on the DMA boundary + block/scsi + dma_map_sg() to ensure that my S/G segments never cross a 64K boundary, a legacy IDE requirement. Thus the additional code in ata_fill_sg() to split S/G segments straddling 64K, in addition to setting dma boundary to 0x. A key problem I was hoping would be solved with your work here was the elimination of that post dma_map_sg() split. If I understood James and Ben correctly, one of the key problems was always in communicating libata's segment boundary needs to the IOMMU layers? Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PKT_SCHED]: Add stateless NAT
Hi: [PKT_SCHED]: Add stateless NAT Stateless NAT is useful in controlled environments where restrictions are placed on through traffic such that we don't need connection tracking to correctly NAT protocol-specific data. In particular, this is of interest when the number of flows or the number of addresses being NATed is large, or if connection tracking information has to be replicated and where it is not practical to do so. Previously we had stateless NAT functionality which was integrated into the IPv4 routing subsystem. This was a great solution as long as the NAT worked on a subnet to subnet basis such that the number of NAT rules was relatively small. The reason is that for SNAT the routing based system had to perform a linear scan through the rules. If the number of rules is large then major renovations would have take place in the routing subsystem to make this practical. For the time being, the least intrusive way of achieving this is to use the u32 classifier written by Alexey Kuznetsov along with the actions infrastructure implemented by Jamal Hadi Salim. The following patch is an attempt at this problem by creating a new nat action that can be invoked from u32 hash tables which would allow large number of stateless NAT rules that can be used/updated in constant time. The actual NAT code is mostly based on the previous stateless NAT code written by Alexey. In future we might be able to utilise the protocol NAT code from netfilter to improve support for other protocols. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- diff --git a/include/linux/tc_act/tc_nat.h b/include/linux/tc_act/tc_nat.h new file mode 100644 index 000..9280c6f --- /dev/null +++ b/include/linux/tc_act/tc_nat.h @@ -0,0 +1,29 @@ +#ifndef __LINUX_TC_NAT_H +#define __LINUX_TC_NAT_H + +#include linux/pkt_cls.h +#include linux/types.h + +#define TCA_ACT_NAT 9 + +enum +{ + TCA_NAT_UNSPEC, + TCA_NAT_PARMS, + TCA_NAT_TM, + __TCA_NAT_MAX +}; +#define TCA_NAT_MAX (__TCA_NAT_MAX - 1) + +#define TCA_NAT_FLAG_EGRESS 1 + +struct tc_nat +{ + tc_gen; + __be32 old_addr; + __be32 new_addr; + __be32 mask; + __u32 flags; +}; + +#endif diff --git a/include/net/tc_act/tc_nat.h b/include/net/tc_act/tc_nat.h new file mode 100644 index 000..4a691f3 --- /dev/null +++ b/include/net/tc_act/tc_nat.h @@ -0,0 +1,21 @@ +#ifndef __NET_TC_NAT_H +#define __NET_TC_NAT_H + +#include linux/types.h +#include net/act_api.h + +struct tcf_nat { + struct tcf_common common; + + __be32 old_addr; + __be32 new_addr; + __be32 mask; + u32 flags; +}; + +static inline struct tcf_nat *to_tcf_nat(struct tcf_common *pc) +{ + return container_of(pc, struct tcf_nat, common); +} + +#endif /* __NET_TC_NAT_H */ diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 8a74cac..22b34f2 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -447,6 +447,17 @@ config NET_ACT_IPT To compile this code as a module, choose M here: the module will be called ipt. +config NET_ACT_NAT +tristate Stateless NAT +depends on NET_CLS_ACT +select NETFILTER +---help--- + Say Y here to do stateless NAT on IPv4 packets. You should use + netfilter for NAT unless you know what you are doing. + + To compile this code as a module, choose M here: the + module will be called ipt. + config NET_ACT_PEDIT tristate Packet Editing depends on NET_CLS_ACT diff --git a/net/sched/Makefile b/net/sched/Makefile index b67c36f..81ecbe8 100644 --- a/net/sched/Makefile +++ b/net/sched/Makefile @@ -11,6 +11,7 @@ obj-$(CONFIG_NET_ACT_POLICE) += act_police.o obj-$(CONFIG_NET_ACT_GACT) += act_gact.o obj-$(CONFIG_NET_ACT_MIRRED) += act_mirred.o obj-$(CONFIG_NET_ACT_IPT) += act_ipt.o +obj-$(CONFIG_NET_ACT_NAT) += act_nat.o obj-$(CONFIG_NET_ACT_PEDIT)+= act_pedit.o obj-$(CONFIG_NET_ACT_SIMP) += act_simple.o obj-$(CONFIG_NET_SCH_FIFO) += sch_fifo.o diff --git a/net/sched/act_nat.c b/net/sched/act_nat.c new file mode 100644 index 000..efd6d7d --- /dev/null +++ b/net/sched/act_nat.c @@ -0,0 +1,313 @@ +/* + * Stateless NAT actions + * + * Copyright (c) 2007 Herbert Xu [EMAIL PROTECTED] + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either version 2 of the License, or (at your option) + * any later version. + */ + +#include linux/errno.h +#include linux/init.h +#include linux/kernel.h +#include linux/module.h +#include linux/netfilter.h +#include linux/rtnetlink.h +#include
Re: [PATCH v3 2/2][BNX2]: Add iSCSI support to BNX2 devices.
On Thu, 2007-09-27 at 03:31 -0400, Jeff Garzik wrote: A key problem I was hoping would be solved with your work here was the elimination of that post dma_map_sg() split. If I understood James and Ben correctly, one of the key problems was always in communicating libata's segment boundary needs to the IOMMU layers? Yup. If we can put some constraint in struct device that the dma mapping code can then look at ... we also need to ensure that what's passed in for DMA'ing already matches those constraints as well since no-iommu platforms will basically just keep the dma table as-is. Ben. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] net: Dynamically allocate the per cpu counters for the loopback device.
David Miller [EMAIL PROTECTED] writes: From: [EMAIL PROTECTED] (Eric W. Biederman) Date: Wed, 26 Sep 2007 17:53:40 -0600 This patch add support for dynamically allocating the statistics counters for the loopback device and adds appropriate device methods for allocating and freeing the loopback device. This completes support for creating multiple instances of the loopback device, in preparation for creating per network namespace instances. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] Applied to net-2.6.24, thanks. @@ -155,7 +154,8 @@ static int loopback_xmit(struct sk_buff *skb, struct net_device *dev) dev-last_rx = jiffies; /* it's OK to use __get_cpu_var() because BHs are off */ -lb_stats = __get_cpu_var(pcpu_lstats); +pcpu_lstats = netdev_priv(dev); +lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id()); lb_stats-bytes += skb-len; lb_stats-packets++; I'm going to add a followon change that gets rid of that comment about __get_cpu_var() since it is no longer relevant. Good point. I'm not doing get_cpu/put_cpu so does the comment make sense in relationship to per_cpu_ptr? Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/2][BNX2]: Add iSCSI support to BNX2 devices.
Benjamin Herrenschmidt wrote: On Thu, 2007-09-27 at 03:31 -0400, Jeff Garzik wrote: A key problem I was hoping would be solved with your work here was the elimination of that post dma_map_sg() split. If I understood James and Ben correctly, one of the key problems was always in communicating libata's segment boundary needs to the IOMMU layers? Yup. If we can put some constraint in struct device that the dma mapping code can then look at ... we also need to ensure that what's passed in for DMA'ing already matches those constraints as well since no-iommu platforms will basically just keep the dma table as-is. That's a good point... no-iommu platforms would need to be updated to do the split for me. I suppose we can steal that code from swiotlb or somewhere. Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/2][BNX2]: Add iSCSI support to BNX2 devices.
CC'ed Jens, James, and linux-scsi. On Thu, 27 Sep 2007 03:31:55 -0400 Jeff Garzik [EMAIL PROTECTED] wrote: FUJITA Tomonori wrote: Yeah, we could nicely handle lld's restrictions (especially with stacking devices). But iommu code needs only max_segment_size and seg_boundary_mask, right? If so, the first simple approach to add two values to device structure is not so bad, I think. (replying to slightly older email in the thread) (added benh, since we've discussed this issue in the past) dumb question, what happened to seg_boundary_mask? I'll work on it too after finishing max_seg_size. If you look at drivers/ata/libata-core.c:ata_fill_sg(), you will note that we split s/g segments after DMA-mapping. Looking at libata LLDD's, you will also note judicious use of ATA_DMA_BOUNDARY (0x). I know the workaround since I fixed libata's sg chaining patch. It was drilled into my head by James and benh that I cannot rely on the DMA boundary + block/scsi + dma_map_sg() to ensure that my S/G segments never cross a 64K boundary, a legacy IDE requirement. Thus the additional code in ata_fill_sg() to split S/G segments straddling 64K, in addition to setting dma boundary to 0x. I think that the block layer can handle both max_segment_size and seg_boundary_mask properly (and SCSI-ml just uses the block layer). So if we fix iommu, then we can remove a workaround to fix sg lists in llds. A key problem I was hoping would be solved with your work here was the elimination of that post dma_map_sg() split. Yeah, that's my goal too. If I understood James and Ben correctly, one of the key problems was always in communicating libata's segment boundary needs to the IOMMU layers? Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/2][BNX2]: Add iSCSI support to BNX2 devices.
On Thu, 2007-09-27 at 03:49 -0400, Jeff Garzik wrote: Benjamin Herrenschmidt wrote: On Thu, 2007-09-27 at 03:31 -0400, Jeff Garzik wrote: A key problem I was hoping would be solved with your work here was the elimination of that post dma_map_sg() split. If I understood James and Ben correctly, one of the key problems was always in communicating libata's segment boundary needs to the IOMMU layers? Yup. If we can put some constraint in struct device that the dma mapping code can then look at ... we also need to ensure that what's passed in for DMA'ing already matches those constraints as well since no-iommu platforms will basically just keep the dma table as-is. That's a good point... no-iommu platforms would need to be updated to do the split for me. I suppose we can steal that code from swiotlb or somewhere. Doing the split means being able to grow the sglist... which the dma_* calls can't do at least not in their current form. Ben. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/2][BNX2]: Add iSCSI support to BNX2 devices.
Benjamin Herrenschmidt wrote: On Thu, 2007-09-27 at 03:49 -0400, Jeff Garzik wrote: Benjamin Herrenschmidt wrote: On Thu, 2007-09-27 at 03:31 -0400, Jeff Garzik wrote: A key problem I was hoping would be solved with your work here was the elimination of that post dma_map_sg() split. If I understood James and Ben correctly, one of the key problems was always in communicating libata's segment boundary needs to the IOMMU layers? Yup. If we can put some constraint in struct device that the dma mapping code can then look at ... we also need to ensure that what's passed in for DMA'ing already matches those constraints as well since no-iommu platforms will basically just keep the dma table as-is. That's a good point... no-iommu platforms would need to be updated to do the split for me. I suppose we can steal that code from swiotlb or somewhere. Doing the split means being able to grow the sglist... which the dma_* calls can't do at least not in their current form. IMO one straightforward approach is for the struct scatterlist owner to provide a table large enough to accomodate the possible splits (perhaps along with communicate that table's max size to the IOMMU/dma layers). Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/2][BNX2]: Add iSCSI support to BNX2 devices.
FUJITA Tomonori wrote: CC'ed Jens, James, and linux-scsi. On Thu, 27 Sep 2007 03:31:55 -0400 Jeff Garzik [EMAIL PROTECTED] wrote: FUJITA Tomonori wrote: Yeah, we could nicely handle lld's restrictions (especially with stacking devices). But iommu code needs only max_segment_size and seg_boundary_mask, right? If so, the first simple approach to add two values to device structure is not so bad, I think. (replying to slightly older email in the thread) (added benh, since we've discussed this issue in the past) dumb question, what happened to seg_boundary_mask? I'll work on it too after finishing max_seg_size. If you look at drivers/ata/libata-core.c:ata_fill_sg(), you will note that we split s/g segments after DMA-mapping. Looking at libata LLDD's, you will also note judicious use of ATA_DMA_BOUNDARY (0x). I know the workaround since I fixed libata's sg chaining patch. It was drilled into my head by James and benh that I cannot rely on the DMA boundary + block/scsi + dma_map_sg() to ensure that my S/G segments never cross a 64K boundary, a legacy IDE requirement. Thus the additional code in ata_fill_sg() to split S/G segments straddling 64K, in addition to setting dma boundary to 0x. I think that the block layer can handle both max_segment_size and seg_boundary_mask properly (and SCSI-ml just uses the block layer). So if we fix iommu, then we can remove a workaround to fix sg lists in llds. A key problem I was hoping would be solved with your work here was the elimination of that post dma_map_sg() split. Yeah, that's my goal too. Great :) Well, I'm generally happy with your max-seg-size stuff (sans the minor nits I pointed out in another email). Thanks for pursuing this, Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 2/2][BNX2]: Add iSCSI support to BNX2 devices.
CC'ed Jens, James, and linux-scsi again. On Thu, 27 Sep 2007 04:22:15 -0400 Jeff Garzik [EMAIL PROTECTED] wrote: Benjamin Herrenschmidt wrote: On Thu, 2007-09-27 at 03:49 -0400, Jeff Garzik wrote: Benjamin Herrenschmidt wrote: On Thu, 2007-09-27 at 03:31 -0400, Jeff Garzik wrote: A key problem I was hoping would be solved with your work here was the elimination of that post dma_map_sg() split. If I understood James and Ben correctly, one of the key problems was always in communicating libata's segment boundary needs to the IOMMU layers? Yup. If we can put some constraint in struct device that the dma mapping code can then look at ... we also need to ensure that what's passed in for DMA'ing already matches those constraints as well since no-iommu platforms will basically just keep the dma table as-is. That's a good point... no-iommu platforms would need to be updated to do the split for me. I suppose we can steal that code from swiotlb or somewhere. Doing the split means being able to grow the sglist... which the dma_* calls can't do at least not in their current form. IMO one straightforward approach is for the struct scatterlist owner to provide a table large enough to accomodate the possible splits (perhaps along with communicate that table's max size to the IOMMU/dma layers). As I said in another mail, the block layer and scsi-ml work properly, I think. So there is no need to split sg lists for no-iommu platforms. We need to fix only iommu code merge sglists (already done) for the segment size restriction but we need to fix all iommu code and swiotlb for the segment boundary restriction. Splitting sg list might be useful for the case that iommu can't find a proper boundary memory area. But I think that it rarely happens (and there are few llds has the boundary restriction). - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Add network namespace clone unshare support.
Eric W. Biederman wrote: This patch allows you to create a new network namespace using sys_clone, or sys_unshare. As the network namespace is still experimental and under development clone and unshare support is only made available when CONFIG_NET_NS is selected at compile time. As this patch introduces network namespace support into code paths that exist when the CONFIG_NET is not selected there are a few additions made to net_namespace.h to allow a few more functions to be used when the networking stack is not compiled in. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- include/linux/sched.h |1 + include/net/net_namespace.h | 18 ++ kernel/fork.c |3 ++- kernel/nsproxy.c| 15 +-- net/Kconfig |8 net/core/net_namespace.c| 43 +-- 6 files changed, 83 insertions(+), 5 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index a01ac6d..e10a0a8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -27,6 +27,7 @@ #define CLONE_NEWUTS 0x0400 /* New utsname group? */ #define CLONE_NEWIPC 0x0800 /* New ipcs */ #define CLONE_NEWUSER0x1000 /* New user namespace */ +#define CLONE_NEWNET 0x2000 /* New network namespace */ This new flag is going to conflict with the pid namespace flag CLONE_NEWPID in -mm. It might be worth changing it to: #define CLONE_NEWNET0x4000 The changes in nxproxy.c and fork.c will also conflict but I don't think we can do much about it for now. C. /* * Scheduling policies diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h index ac8f830..3ea4194 100644 --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -38,11 +38,23 @@ extern struct net init_net; extern struct list_head net_namespace_list; +#ifdef CONFIG_NET +extern struct net *copy_net_ns(unsigned long flags, struct net *net_ns); +#else +static inline struct net *copy_net_ns(unsigned long flags, struct net *net_ns) +{ + /* There is nothing to copy so this is a noop */ + return net_ns; +} +#endif + extern void __put_net(struct net *net); static inline struct net *get_net(struct net *net) { +#ifdef CONFIG_NET atomic_inc(net-count); +#endif return net; } @@ -60,19 +72,25 @@ static inline struct net *maybe_get_net(struct net *net) static inline void put_net(struct net *net) { +#ifdef CONFIG_NET if (atomic_dec_and_test(net-count)) __put_net(net); +#endif } static inline struct net *hold_net(struct net *net) { +#ifdef CONFIG_NET atomic_inc(net-use_count); +#endif return net; } static inline void release_net(struct net *net) { +#ifdef CONFIG_NET atomic_dec(net-use_count); +#endif } extern void net_lock(void); diff --git a/kernel/fork.c b/kernel/fork.c index 33f12f4..5e67f90 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1608,7 +1608,8 @@ asmlinkage long sys_unshare(unsigned long unshare_flags) err = -EINVAL; if (unshare_flags ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND| CLONE_VM|CLONE_FILES|CLONE_SYSVSEM| - CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER)) + CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER| + CLONE_NEWNET)) goto bad_unshare_out; if ((err = unshare_thread(unshare_flags))) diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c index a4fb7d4..f1decd2 100644 --- a/kernel/nsproxy.c +++ b/kernel/nsproxy.c @@ -20,6 +20,7 @@ #include linux/mnt_namespace.h #include linux/utsname.h #include linux/pid_namespace.h +#include net/net_namespace.h static struct kmem_cache *nsproxy_cachep; @@ -98,8 +99,17 @@ static struct nsproxy *create_new_namespaces(unsigned long flags, goto out_user; } + new_nsp-net_ns = copy_net_ns(flags, tsk-nsproxy-net_ns); + if (IS_ERR(new_nsp-net_ns)) { + err = PTR_ERR(new_nsp-net_ns); + goto out_net; + } + return new_nsp; +out_net: + if (new_nsp-user_ns) + put_user_ns(new_nsp-user_ns); out_user: if (new_nsp-pid_ns) put_pid_ns(new_nsp-pid_ns); @@ -132,7 +142,7 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk) get_nsproxy(old_ns); - if (!(flags (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER))) + if (!(flags (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER | CLONE_NEWNET))) return 0; if (!capable(CAP_SYS_ADMIN)) { @@ -164,6 +174,7 @@ void free_nsproxy(struct nsproxy *ns) put_pid_ns(ns-pid_ns); if (ns-user_ns)
Re: [PATCH] sky2: sky2 FE+ receive status workaround
Hi Stephen, On 27 Sep 2007, at 01:58, Stephen Hemminger wrote: + /* This chip has hardware problems that generates bogus status. +* So do only marginal checking and expect higher level protocols +* to handle crap frames. +*/ + if (sky2-hw-chip_id == CHIP_ID_YUKON_FE_P + sky2-hw-chip_rev == CHIP_REV_YU_FE2_A0 + length != count) + goto okay; Shouldn't the condition be length == count? I hope this helps, Jochen -- http://seehuhn.de/ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED]: Add stateless NAT
Hi Herbert. On Thu, Sep 27, 2007 at 03:34:47PM +0800, Herbert Xu ([EMAIL PROTECTED]) wrote: Hi: [PKT_SCHED]: Add stateless NAT Stateless NAT is useful in controlled environments where restrictions are placed on through traffic such that we don't need connection tracking to correctly NAT protocol-specific data. Couple of comments below. --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -447,6 +447,17 @@ config NET_ACT_IPT To compile this code as a module, choose M here: the module will be called ipt. +config NET_ACT_NAT +tristate Stateless NAT +depends on NET_CLS_ACT +select NETFILTER Argh... People usually do not understand such jokes :) What about not using netfilter helpers and just move them to the accessible header so that no additional slow path would ever be enabled? +---help--- + Say Y here to do stateless NAT on IPv4 packets. You should use + netfilter for NAT unless you know what you are doing. + + To compile this code as a module, choose M here: the + module will be called ipt. + Modile will be called 'nat' I believe. +++ b/net/sched/act_nat.c ... +#define NAT_TAB_MASK 15 This really wants to be configurable at least via module parameter. +static struct tcf_common *tcf_nat_ht[NAT_TAB_MASK + 1]; +static u32 nat_idx_gen; +static DEFINE_RWLOCK(nat_lock); +static struct tcf_hashinfo nat_hash_info = { + .htab = tcf_nat_ht, + .hmask = NAT_TAB_MASK, + .lock = nat_lock, +}; When I read this I swear I heard 'I want to be RCU'. But that is another task. +static int tcf_nat(struct sk_buff *skb, struct tc_action *a, +struct tcf_result *res) +{ + struct tcf_nat *p = a-priv; + struct iphdr *iph; + __be32 old_addr; + __be32 new_addr; + __be32 mask; + __be32 addr; + int egress; + int action; + int ihl; + + spin_lock(p-tcf_lock); + + p-tcf_tm.lastuse = jiffies; + old_addr = p-old_addr; + new_addr = p-new_addr; + mask = p-mask; + egress = p-flags TCA_NAT_FLAG_EGRESS; + action = p-tcf_action; + + p-tcf_bstats.bytes += skb-len; + p-tcf_bstats.packets++; + + spin_unlock(p-tcf_lock); + + if (!pskb_may_pull(skb, sizeof(*iph))) + return TC_ACT_SHOT; + + iph = ip_hdr(skb); + + if (egress) + addr = iph-saddr; + else + addr = iph-daddr; + + if (!((old_addr ^ addr) mask)) { + if (skb_cloned(skb) + !skb_clone_writable(skb, sizeof(*iph)) + pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) + return TC_ACT_SHOT; + + new_addr = mask; + new_addr |= addr ~mask; + + /* Rewrite IP header */ + iph = ip_hdr(skb); + if (egress) + iph-saddr = new_addr; + else + iph-daddr = new_addr; + + nf_csum_replace4(iph-check, addr, new_addr); + } + + ihl = iph-ihl * 4; + + /* It would be nice to share code with stateful NAT. */ + switch (iph-frag_off htons(IP_OFFSET) ? 0 : iph-protocol) { + case IPPROTO_TCP: + { + struct tcphdr *tcph; + + if (!pskb_may_pull(skb, ihl + sizeof(*tcph)) || + (skb_cloned(skb) + !skb_clone_writable(skb, ihl + sizeof(*tcph)) + pskb_expand_head(skb, 0, 0, GFP_ATOMIC))) + return TC_ACT_SHOT; + + tcph = (void *)(skb_network_header(skb) + ihl); Were you too lazy to write struct tcphdr here and in other places? :) -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED]: Add stateless NAT
On Thu, Sep 27, 2007 at 01:25:12PM +0400, Evgeniy Polyakov wrote: Couple of comments below. Thanks Evgeniey :) --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -447,6 +447,17 @@ config NET_ACT_IPT To compile this code as a module, choose M here: the module will be called ipt. +config NET_ACT_NAT +tristate Stateless NAT +depends on NET_CLS_ACT +select NETFILTER Argh... People usually do not understand such jokes :) What about not using netfilter helpers and just move them to the accessible header so that no additional slow path would ever be enabled? Sure. However, as it is it's just including the netfilter core which does not mean the inclusion of connection trakcing. It's only connection tracking that *may* (so don't flame me for this :) pose a scalability problem. +---help--- + Say Y here to do stateless NAT on IPv4 packets. You should use + netfilter for NAT unless you know what you are doing. + + To compile this code as a module, choose M here: the + module will be called ipt. + Modile will be called 'nat' I believe. Good catch, now you know where I copied it from :) +++ b/net/sched/act_nat.c ... +#define NAT_TAB_MASK 15 This really wants to be configurable at least via module parameter. +static struct tcf_common *tcf_nat_ht[NAT_TAB_MASK + 1]; +static u32 nat_idx_gen; +static DEFINE_RWLOCK(nat_lock); +static struct tcf_hashinfo nat_hash_info = { + .htab = tcf_nat_ht, + .hmask = NAT_TAB_MASK, + .lock = nat_lock, +}; When I read this I swear I heard 'I want to be RCU'. But that is another task. Yes there are a lot of clean-up's that can be done for all actions. You're most welcome to send patches in this area. + tcph = (void *)(skb_network_header(skb) + ihl); Were you too lazy to write struct tcphdr here and in other places? :) Unfortunately it doesn't work. For prerouting, we've not entered the IP stack yet so the transport header isn't set. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED]: Add stateless NAT
On Thu, Sep 27, 2007 at 05:33:58PM +0800, Herbert Xu ([EMAIL PROTECTED]) wrote: +config NET_ACT_NAT +tristate Stateless NAT +depends on NET_CLS_ACT +select NETFILTER Argh... People usually do not understand such jokes :) What about not using netfilter helpers and just move them to the accessible header so that no additional slow path would ever be enabled? Sure. However, as it is it's just including the netfilter core which does not mean the inclusion of connection trakcing. It's only connection tracking that *may* (so don't flame me for this :) pose a scalability problem. It forces all inpuit/pre/post/forward hooks to be enbled not as a direct function call, but as additional lookups. And unability to remove netfilter from config. And just because of couple of checksum helpers... +++ b/net/sched/act_nat.c ... +#define NAT_TAB_MASK 15 This really wants to be configurable at least via module parameter. +static struct tcf_common *tcf_nat_ht[NAT_TAB_MASK + 1]; +static u32 nat_idx_gen; +static DEFINE_RWLOCK(nat_lock); +static struct tcf_hashinfo nat_hash_info = { + .htab = tcf_nat_ht, + .hmask = NAT_TAB_MASK, + .lock = nat_lock, +}; When I read this I swear I heard 'I want to be RCU'. But that is another task. Yes there are a lot of clean-up's that can be done for all actions. You're most welcome to send patches in this area. + tcph = (void *)(skb_network_header(skb) + ihl); Were you too lazy to write struct tcphdr here and in other places? :) Unfortunately it doesn't work. For prerouting, we've not entered the IP stack yet so the transport header isn't set. I meant instead of dereferencing to void * it should be struct tcphdr *. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED]: Add stateless NAT
On Thu, Sep 27, 2007 at 02:07:53PM +0400, Evgeniy Polyakov wrote: It forces all inpuit/pre/post/forward hooks to be enbled not as a direct function call, but as additional lookups. And unability to remove netfilter from config. And just because of couple of checksum helpers... I'm certainly not against patches moving that code out of netfilter. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] net ipv4: When possible test for IFF_LOOPBACK and not dev == loopback_dev
Eric W. Biederman wrote: Now that multiple loopback devices are becoming possible it makes the code a little cleaner and more maintainable to test if a deivice is th a loopback device by testing dev-flags IFF_LOOPBACK instead of dev == loopback_dev. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] Urs Thuermann posted the patch: [PATCH 5/7] CAN: Add virtual CAN netdevice driver This network driver set its flag to IFF_LOOPBACK for testing. Is it possible this can be a collision with your patch ? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fixed broken bootp compilation
Compilation fix. Extra bracket removed. Broken by [NET]: Wrap netdevice hardware header creation from Stephen Hemminger [EMAIL PROTECTED] Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- ./net/ipv4/ipconfig.c.compile 2007-09-27 13:32:35.0 +0400 +++ ./net/ipv4/ipconfig.c 2007-09-27 14:36:19.0 +0400 @@ -758,7 +758,7 @@ static void __init ic_bootp_send_if(stru skb-dev = dev; skb-protocol = htons(ETH_P_IP); if (dev_hard_header(skb, dev, ntohs(skb-protocol), - dev-broadcast, dev-dev_addr, skb-len) 0) || + dev-broadcast, dev-dev_addr, skb-len) 0 || dev_queue_xmit(skb) 0) printk(E); } - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Linux networking implementation and packet capture
Gaurav Aggarwal escreveu: Hi, I am trying to understand the implementation of linux 2.4 and linux 2.6's networking (IPV4) . Can anyone give me some idea/pointers about some of the good resources/whitepapers available in the market to understand the same. If there is any document that mention the changes between the implementation of networking in 2.4 2.6 I am also trying to write a simple program(preferably a userspace application) which captures all the incoming and outgoing packets of a particular machine (preferably at PREROUTING stage), then according to the SRC/DST addresses, changes the IP address of some of the packets and then reinject it back into the local IP stack. I am able to do that in 2.4 kernel by using libipq and ip_tables but that prog is not running in 2.6 kernel. (It hits at ip_route_BUG). Any idea or code snippet will be really appreciated. -- Regards, Gaurav Aggarwal You may also see the information on the linux-net wiki: http://linux-net.osdl.org/index.php/Main_Page. Also, read some threads on netdev list. It has a lot of information there too. -- -- Best Regards Alan Menegotto - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
ax88796: add 93cx6 eeprom support
ax88796: add 93cx6 eeprom support This patch hooks up the 93cx6 eeprom code to the ax88796 driver and modifies the ax88796 driver to read out the mac address from the eeprom. We need this for the ax88796 on certain SuperH boards. The pin configuration used to connect the eeprom to the ax88796 on these boards is the same as pointed out by the ax88796 datasheet, so we can probably reuse this code for multiple platforms in the future. Signed-off-by: Magnus Damm [EMAIL PROTECTED] --- This is a broken out version of the larger patch recently posted to netdev: http://www.mail-archive.com/netdev@vger.kernel.org/msg47278.html drivers/net/Kconfig |7 ++ drivers/net/ax88796.c| 49 ++ include/linux/eeprom_93cx6.h |3 +- include/net/ax88796.h|1 4 files changed, 59 insertions(+), 1 deletion(-) --- 0001/drivers/net/Kconfig +++ work/drivers/net/Kconfig2007-09-27 19:32:10.0 +0900 @@ -225,6 +225,13 @@ config AX88796 AX88796 driver, using platform bus to provide chip detection and resources +config AX88796_93CX6 + bool ASIX AX88796 external 93CX6 eeprom support + depends on AX88796 + select EEPROM_93CX6 + help + Select this if your platform comes with an external 93CX6 eeprom. + config MACE tristate MACE (Power Mac ethernet) support depends on PPC_PMAC PPC32 --- 0001/drivers/net/ax88796.c +++ work/drivers/net/ax88796.c 2007-09-27 19:17:44.0 +0900 @@ -24,6 +24,7 @@ #include linux/etherdevice.h #include linux/ethtool.h #include linux/mii.h +#include linux/eeprom_93cx6.h #include net/ax88796.h @@ -582,6 +583,37 @@ static const struct ethtool_ops ax_ethto .get_link = ax_get_link, }; +#ifdef CONFIG_AX88796_93CX6 +static void ax_eeprom_register_read(struct eeprom_93cx6 *eeprom) +{ + struct ei_device *ei_local = eeprom-data; + u8 reg = ei_inb(ei_local-mem + AX_MEMR); + + eeprom-reg_data_in = reg AX_MEMR_EEI; + eeprom-reg_data_out = reg AX_MEMR_EEO; /* Input pin */ + eeprom-reg_data_clock = reg AX_MEMR_EECLK; + eeprom-reg_chip_select = reg AX_MEMR_EECS; +} + +static void ax_eeprom_register_write(struct eeprom_93cx6 *eeprom) +{ + struct ei_device *ei_local = eeprom-data; + u8 reg = ei_inb(ei_local-mem + AX_MEMR); + + reg = ~(AX_MEMR_EEI | AX_MEMR_EECLK | AX_MEMR_EECS); + + if (eeprom-reg_data_in) + reg |= AX_MEMR_EEI; + if (eeprom-reg_data_clock) + reg |= AX_MEMR_EECLK; + if (eeprom-reg_chip_select) + reg |= AX_MEMR_EECS; + + ei_outb(reg, ei_local-mem + AX_MEMR); + udelay(10); +} +#endif + /* setup code */ static void ax_initial_setup(struct net_device *dev, struct ei_device *ei_local) @@ -640,6 +672,23 @@ static int ax_init_dev(struct net_device memcpy(dev-dev_addr, SA_prom, 6); } +#ifdef CONFIG_AX88796_93CX6 + if (first_init ax-plat-flags AXFLG_HAS_93CX6) { + unsigned char mac_addr[6]; + struct eeprom_93cx6 eeprom; + + eeprom.data = ei_local; + eeprom.register_read = ax_eeprom_register_read; + eeprom.register_write = ax_eeprom_register_write; + eeprom.width = PCI_EEPROM_WIDTH_93C56; + + eeprom_93cx6_multiread(eeprom, 0, + (__le16 __force *)mac_addr, + sizeof(mac_addr) 1); + + memcpy(dev-dev_addr, mac_addr, 6); + } +#endif if (ax-plat-wordlength == 2) { /* We must set the 8390 for word mode. */ ei_outb(ax-plat-dcr_val, ei_local-mem + EN0_DCFG); --- 0001/include/linux/eeprom_93cx6.h +++ work/include/linux/eeprom_93cx6.h 2007-09-27 19:17:44.0 +0900 @@ -21,13 +21,14 @@ /* Module: eeprom_93cx6 Abstract: EEPROM reader datastructures for 93cx6 chipsets. - Supported chipsets: 93c46 93c66. + Supported chipsets: 93c46, 93c56 and 93c66. */ /* * EEPROM operation defines. */ #define PCI_EEPROM_WIDTH_93C46 6 +#define PCI_EEPROM_WIDTH_93C56 8 #define PCI_EEPROM_WIDTH_93C66 8 #define PCI_EEPROM_WIDTH_OPCODE3 #define PCI_EEPROM_WRITE_OPCODE0x05 --- 0001/include/net/ax88796.h +++ work/include/net/ax88796.h 2007-09-27 19:17:44.0 +0900 @@ -14,6 +14,7 @@ #define AXFLG_HAS_EEPROM (10) #define AXFLG_MAC_FROMDEV (11) /* device already has MAC */ +#define AXFLG_HAS_93CX6(12) /* use eeprom_93cx6 driver */ struct ax_plat_data { unsigned int flags; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.23-rc8-mm2 - drivers/net/ibm_newemac/mal - broken
Andrew Morton wrote: ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.23-rc8/2.6.23-rc8-mm2/ Hi Andrew, The drivers/net/ibm_newemac/mal seems to be broken with 2.6.23-rc8-mm2 also, it was reported on 2.6.23-rc8-mm1 (http://lkml.org/lkml/2007/9/25/173). -- Thanks Regards, Kamalesh Babulal, Linux Technology Center, IBM, ISTL. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] [PATCH 4/4] net: Make the loopback device per network namespace
Eric W. Biederman wrote: This patch makes loopback_dev per network namespace. Adding code to create a different loopback device for each network namespace and adding the code to free a loopback device when a network namespace exits. This patch modifies all users the loopback_dev so they access it as init_net.loopback_dev, keeping all of the code compiling and working. A later pass will be needed to update the users to use something other than the initial network namespace. A pity that an important bit of explanation is missed. The initialization of loopback_dev is moved from a chain of devices (init_module) to a subsystem initialization to keep proper order, i.e. we must be sure that the initialization order is correct. Regards, Den - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] proper comment for loopback initialization order
Loopback device is special. It should be initialized at the very beginning. Initialization order has been changed by Eric W. Biederman [EMAIL PROTECTED] and this change is non-obvious and important enough to add proper comment. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- ./drivers/net/loopback.c.loopcomment2007-08-26 19:30:38.0 +0400 +++ ./drivers/net/loopback.c2007-09-27 16:08:06.0 +0400 @@ -293,4 +293,6 @@ static int __init loopback_init(void) return register_pernet_device(loopback_net_ops); } +/* Loopback is special. It should be initialized before any other network + device and network subsystem */ fs_initcall(loopback_init); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED]: Add stateless NAT
On Thu, 2007-27-09 at 16:41 +0400, Evgeniy Polyakov wrote: I've attached simple patch which moves checksum helpers out of CONFIG_NETFILTER option but still in the same linux/netfilter.h header. This should be enough for removing 'select NETFILTER' in your patch. Is there any point in keeping the code inside netfilter or keeping the nf_ prefix? something in net/utilities/ or net/core maybe? the nf_* can still exist in netfilter as aliases to wherever this is moved to. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED]: Add stateless NAT
On Thu, Sep 27, 2007 at 08:52:03AM -0400, jamal ([EMAIL PROTECTED]) wrote: On Thu, 2007-27-09 at 16:41 +0400, Evgeniy Polyakov wrote: I've attached simple patch which moves checksum helpers out of CONFIG_NETFILTER option but still in the same linux/netfilter.h header. This should be enough for removing 'select NETFILTER' in your patch. Is there any point in keeping the code inside netfilter or keeping the nf_ prefix? something in net/utilities/ or net/core maybe? the nf_* can still exist in netfilter as aliases to wherever this is moved to. No, there is no point in keeping that, I just wanted the smallest possible change :) -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED]: Add stateless NAT
On Thu, Sep 27, 2007 at 08:39:45AM -0400, jamal wrote: Do you have plans to do the iproute bits? If you do it will be nice to also update the doc/examples with some simple example(s). Oh yes, I didn't test this by poking bits in the kernel you know :) Here are the iproute bits. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- diff --git a/include/utils.h b/include/utils.h index a3fd335..b0dc03e 100644 --- a/include/utils.h +++ b/include/utils.h @@ -36,7 +36,7 @@ extern char * _SL_; extern void incomplete_command(void) __attribute__((noreturn)); -#define NEXT_ARG() do { argv++; if (--argc = 0) incomplete_command(); } while(0) +#define NEXT_ARG() do { argv++; if (--argc 0) incomplete_command(); } while(0) #define NEXT_ARG_OK() (argc - 1 0) #define PREV_ARG() do { argv--; argc++; } while(0) diff --git a/tc/Makefile b/tc/Makefile index 22cd437..cd5a69e 100644 --- a/tc/Makefile +++ b/tc/Makefile @@ -26,6 +26,7 @@ TCMODULES += q_htb.o TCMODULES += m_gact.o TCMODULES += m_mirred.o TCMODULES += m_ipt.o +TCMODULES += m_nat.o TCMODULES += m_pedit.o TCMODULES += p_ip.o TCMODULES += p_icmp.o diff --git a/tc/m_nat.c b/tc/m_nat.c new file mode 100644 index 000..9a6c7da --- /dev/null +++ b/tc/m_nat.c @@ -0,0 +1,208 @@ +/* + * m_nat.c NAT module + * + * This program is free software; you can distribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Authors:Herbert Xu [EMAIL PROTECTED] + * + */ + +#include stdio.h +#include stdlib.h +#include unistd.h +#include syslog.h +#include fcntl.h +#include sys/socket.h +#include netinet/in.h +#include arpa/inet.h +#include string.h +#include dlfcn.h +#include utils.h +#include tc_util.h +#include linux/tc_act/tc_nat.h + +static void +explain(void) +{ + fprintf(stderr, Usage: ... nat NAT\n + NAT := DIRECTION OLD NEW\n + DIRECTION := { ingress | egress }\n + OLD := PREFIX\n + NEW := ADDRESS\n); +} + +static void +usage(void) +{ + explain(); + exit(-1); +} + +static int +parse_nat_args(int *argc_p, char ***argv_p,struct tc_nat *sel) +{ + int argc = *argc_p; + char **argv = *argv_p; + inet_prefix addr; + + if (argc = 0) + return -1; + + if (matches(*argv, egress) == 0) + sel-flags |= TCA_NAT_FLAG_EGRESS; + else if (matches(*argv, ingress) != 0) + goto bad_val; + + NEXT_ARG(); + + if (get_prefix_1(addr, *argv, AF_INET)) + goto bad_val; + + sel-old_addr = addr.data[0]; + sel-mask = htonl(~0u (32 - addr.bitlen)); + + NEXT_ARG(); + + if (get_prefix_1(addr, *argv, AF_INET)) + goto bad_val; + + sel-new_addr = addr.data[0]; + + NEXT_ARG(); + + *argc_p = argc; + *argv_p = argv; + return 0; + +bad_val: + return -1; +} + +static int +parse_nat(struct action_util *a, int *argc_p, char ***argv_p, int tca_id, struct nlmsghdr *n) +{ + struct tc_nat sel; + + int argc = *argc_p; + char **argv = *argv_p; + int ok = 0, iok = 0; + struct rtattr *tail; + + memset(sel, 0, sizeof(sel)); + + while (argc 0) { + if (matches(*argv, nat) == 0) { + NEXT_ARG(); + if (parse_nat_args(argc, argv, sel)) { + fprintf(stderr, Illegal nat construct (%s) \n, + *argv); + explain(); + return -1; + } + ok++; + continue; + } else if (matches(*argv, help) == 0) { + usage(); + } else { + break; + } + + } + + if (!ok) { + explain(); + return -1; + } + + if (argc) { + if (matches(*argv, reclassify) == 0) { + sel.action = TC_ACT_RECLASSIFY; + NEXT_ARG(); + } else if (matches(*argv, pipe) == 0) { + sel.action = TC_ACT_PIPE; + NEXT_ARG(); + } else if (matches(*argv, drop) == 0 || + matches(*argv, shot) == 0) { + sel.action = TC_ACT_SHOT; + NEXT_ARG(); + } else if (matches(*argv, continue) == 0) { + sel.action = TC_ACT_UNSPEC; + NEXT_ARG(); + } else if
Re: [PKT_SCHED]: Add stateless NAT
On Thu, Sep 27, 2007 at 08:39:45AM -0400, jamal wrote: You also need to p-tcf_qstats.drops++ for all packets that get shot. I was rather hoping that my packets wouldn't get shot :) But yeah let's increment the drops counter for consistency. [PKT_SCHED]: Add stateless NAT Stateless NAT is useful in controlled environments where restrictions are placed on through traffic such that we don't need connection tracking to correctly NAT protocol-specific data. In particular, this is of interest when the number of flows or the number of addresses being NATed is large, or if connection tracking information has to be replicated and where it is not practical to do so. Previously we had stateless NAT functionality which was integrated into the IPv4 routing subsystem. This was a great solution as long as the NAT worked on a subnet to subnet basis such that the number of NAT rules was relatively small. The reason is that for SNAT the routing based system had to perform a linear scan through the rules. If the number of rules is large then major renovations would have take place in the routing subsystem to make this practical. For the time being, the least intrusive way of achieving this is to use the u32 classifier written by Alexey Kuznetsov along with the actions infrastructure implemented by Jamal Hadi Salim. The following patch is an attempt at this problem by creating a new nat action that can be invoked from u32 hash tables which would allow large number of stateless NAT rules that can be used/updated in constant time. The actual NAT code is mostly based on the previous stateless NAT code written by Alexey. In future we might be able to utilise the protocol NAT code from netfilter to improve support for other protocols. Signed-off-by: Herbert Xu [EMAIL PROTECTED] Thanks, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- diff --git a/include/linux/tc_act/tc_nat.h b/include/linux/tc_act/tc_nat.h new file mode 100644 index 000..9280c6f --- /dev/null +++ b/include/linux/tc_act/tc_nat.h @@ -0,0 +1,29 @@ +#ifndef __LINUX_TC_NAT_H +#define __LINUX_TC_NAT_H + +#include linux/pkt_cls.h +#include linux/types.h + +#define TCA_ACT_NAT 9 + +enum +{ + TCA_NAT_UNSPEC, + TCA_NAT_PARMS, + TCA_NAT_TM, + __TCA_NAT_MAX +}; +#define TCA_NAT_MAX (__TCA_NAT_MAX - 1) + +#define TCA_NAT_FLAG_EGRESS 1 + +struct tc_nat +{ + tc_gen; + __be32 old_addr; + __be32 new_addr; + __be32 mask; + __u32 flags; +}; + +#endif diff --git a/include/net/tc_act/tc_nat.h b/include/net/tc_act/tc_nat.h new file mode 100644 index 000..4a691f3 --- /dev/null +++ b/include/net/tc_act/tc_nat.h @@ -0,0 +1,21 @@ +#ifndef __NET_TC_NAT_H +#define __NET_TC_NAT_H + +#include linux/types.h +#include net/act_api.h + +struct tcf_nat { + struct tcf_common common; + + __be32 old_addr; + __be32 new_addr; + __be32 mask; + u32 flags; +}; + +static inline struct tcf_nat *to_tcf_nat(struct tcf_common *pc) +{ + return container_of(pc, struct tcf_nat, common); +} + +#endif /* __NET_TC_NAT_H */ diff --git a/net/sched/Kconfig b/net/sched/Kconfig index 8a74cac..92435a8 100644 --- a/net/sched/Kconfig +++ b/net/sched/Kconfig @@ -447,6 +447,17 @@ config NET_ACT_IPT To compile this code as a module, choose M here: the module will be called ipt. +config NET_ACT_NAT +tristate Stateless NAT +depends on NET_CLS_ACT +select NETFILTER +---help--- + Say Y here to do stateless NAT on IPv4 packets. You should use + netfilter for NAT unless you know what you are doing. + + To compile this code as a module, choose M here: the + module will be called nat. + config NET_ACT_PEDIT tristate Packet Editing depends on NET_CLS_ACT diff --git a/net/sched/Makefile b/net/sched/Makefile index b67c36f..81ecbe8 100644 --- a/net/sched/Makefile +++ b/net/sched/Makefile @@ -11,6 +11,7 @@ obj-$(CONFIG_NET_ACT_POLICE) += act_police.o obj-$(CONFIG_NET_ACT_GACT) += act_gact.o obj-$(CONFIG_NET_ACT_MIRRED) += act_mirred.o obj-$(CONFIG_NET_ACT_IPT) += act_ipt.o +obj-$(CONFIG_NET_ACT_NAT) += act_nat.o obj-$(CONFIG_NET_ACT_PEDIT)+= act_pedit.o obj-$(CONFIG_NET_ACT_SIMP) += act_simple.o obj-$(CONFIG_NET_SCH_FIFO) += sch_fifo.o diff --git a/net/sched/act_nat.c b/net/sched/act_nat.c new file mode 100644 index 000..1bce750 --- /dev/null +++ b/net/sched/act_nat.c @@ -0,0 +1,322 @@ +/* + * Stateless NAT actions + * + * Copyright (c) 2007 Herbert Xu [EMAIL PROTECTED] + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the Free + * Software Foundation; either
Re: [PKT_SCHED]: Add stateless NAT
On Thu, Sep 27, 2007 at 08:45:15PM +0800, Herbert Xu ([EMAIL PROTECTED]) wrote: On Thu, Sep 27, 2007 at 04:41:21PM +0400, Evgeniy Polyakov wrote: I've attached simple patch which moves checksum helpers out of CONFIG_NETFILTER option but still in the same linux/netfilter.h header. This should be enough for removing 'select NETFILTER' in your patch. Close but no cigar :) :) take 2. diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h index 1dd075e..5313739 100644 --- a/include/linux/netfilter.h +++ b/include/linux/netfilter.h @@ -40,6 +40,41 @@ #endif #ifdef __KERNEL__ + +static inline void nf_csum_replace4(__sum16 *sum, __be32 from, __be32 to) +{ + __be32 diff[] = { ~from, to }; + + *sum = csum_fold(csum_partial((char *)diff, sizeof(diff), ~csum_unfold(*sum))); +} + +static inline void nf_csum_replace2(__sum16 *sum, __be16 from, __be16 to) +{ + nf_csum_replace4(sum, (__force __be32)from, (__force __be32)to); +} + +static inline void nf_proto_csum_replace4(__sum16 *sum, struct sk_buff *skb, + __be32 from, __be32 to, int pseudohdr) +{ + __be32 diff[] = { ~from, to }; + if (skb-ip_summed != CHECKSUM_PARTIAL) { + *sum = csum_fold(csum_partial(diff, sizeof(diff), + ~csum_unfold(*sum))); + if (skb-ip_summed == CHECKSUM_COMPLETE pseudohdr) + skb-csum = ~csum_partial(diff, sizeof(diff), + ~skb-csum); + } else if (pseudohdr) + *sum = ~csum_fold(csum_partial(diff, sizeof(diff), + csum_unfold(*sum))); +} + +static inline void nf_proto_csum_replace2(__sum16 *sum, struct sk_buff *skb, + __be16 from, __be16 to, int pseudohdr) +{ + nf_proto_csum_replace4(sum, skb, (__force __be32)from, + (__force __be32)to, pseudohdr); +} + #ifdef CONFIG_NETFILTER extern void netfilter_init(void); @@ -289,28 +324,6 @@ extern void nf_invalidate_cache(int pf); Returns true or false. */ extern int skb_make_writable(struct sk_buff **pskb, unsigned int writable_len); -static inline void nf_csum_replace4(__sum16 *sum, __be32 from, __be32 to) -{ - __be32 diff[] = { ~from, to }; - - *sum = csum_fold(csum_partial((char *)diff, sizeof(diff), ~csum_unfold(*sum))); -} - -static inline void nf_csum_replace2(__sum16 *sum, __be16 from, __be16 to) -{ - nf_csum_replace4(sum, (__force __be32)from, (__force __be32)to); -} - -extern void nf_proto_csum_replace4(__sum16 *sum, struct sk_buff *skb, - __be32 from, __be32 to, int pseudohdr); - -static inline void nf_proto_csum_replace2(__sum16 *sum, struct sk_buff *skb, - __be16 from, __be16 to, int pseudohdr) -{ - nf_proto_csum_replace4(sum, skb, (__force __be32)from, - (__force __be32)to, pseudohdr); -} - struct nf_afinfo { unsigned short family; __sum16 (*checksum)(struct sk_buff *skb, unsigned int hook, diff --git a/net/netfilter/core.c b/net/netfilter/core.c index 381a77c..9ffbbe2 100644 --- a/net/netfilter/core.c +++ b/net/netfilter/core.c @@ -226,22 +226,6 @@ copy_skb: } EXPORT_SYMBOL(skb_make_writable); -void nf_proto_csum_replace4(__sum16 *sum, struct sk_buff *skb, - __be32 from, __be32 to, int pseudohdr) -{ - __be32 diff[] = { ~from, to }; - if (skb-ip_summed != CHECKSUM_PARTIAL) { - *sum = csum_fold(csum_partial(diff, sizeof(diff), - ~csum_unfold(*sum))); - if (skb-ip_summed == CHECKSUM_COMPLETE pseudohdr) - skb-csum = ~csum_partial(diff, sizeof(diff), - ~skb-csum); - } else if (pseudohdr) - *sum = ~csum_fold(csum_partial(diff, sizeof(diff), - csum_unfold(*sum))); -} -EXPORT_SYMBOL(nf_proto_csum_replace4); - #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE) /* This does not belong here, but locally generated errors need it if connection tracking in use: without this, connection may not be in hash table, and hence -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RTL8111 PCI Express Gigabit driver r8169 produces slow file transfers
Dear Linux r8169 crew, I have got your e-mail address from the modinfo of the r8196 module. I am not sure if this is the right way to contact you, but I hope you could help me. The current driver in Kernel 2.6.22 produces very bad network speeds. I only geht 100 kb/s. Maybe you could take a look at this bug-report at launchpad.net. https://bugs.launchpad.net/ubuntu/+source/linux-ubuntu-modules-2.6.22/+bug/114171 The latest driver from realtek is working very well. ftp://210.51.181.211/cn/nic/r8168-8.003.00.tar.bz2 What I would like to know, is, if the latest realtek driver will make it into the kernel, or if the problems with the r8196 module are already solved. If there are any questions feel free to contact me. Thanks in Advanced Achim Frase - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED]: Add stateless NAT
On Thu, 2007-27-09 at 21:01 +0800, Herbert Xu wrote: On Thu, Sep 27, 2007 at 08:39:45AM -0400, jamal wrote: Do you have plans to do the iproute bits? If you do it will be nice to also update the doc/examples with some simple example(s). Oh yes, I didn't test this by poking bits in the kernel you know :) Trust me - it has been done before ;- Thanks Herbert, looks good to me. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED]: Add stateless NAT
Evgeniy Polyakov wrote: +static inline void nf_csum_replace4(__sum16 *sum, __be32 from, __be32 to) +{ + __be32 diff[] = { ~from, to }; + + *sum = csum_fold(csum_partial((char *)diff, sizeof(diff), ~csum_unfold(*sum))); +} + +static inline void nf_csum_replace2(__sum16 *sum, __be16 from, __be16 to) +{ + nf_csum_replace4(sum, (__force __be32)from, (__force __be32)to); +} + +static inline void nf_proto_csum_replace4(__sum16 *sum, struct sk_buff *skb, + __be32 from, __be32 to, int pseudohdr) +{ + __be32 diff[] = { ~from, to }; + if (skb-ip_summed != CHECKSUM_PARTIAL) { + *sum = csum_fold(csum_partial(diff, sizeof(diff), + ~csum_unfold(*sum))); + if (skb-ip_summed == CHECKSUM_COMPLETE pseudohdr) + skb-csum = ~csum_partial(diff, sizeof(diff), + ~skb-csum); + } else if (pseudohdr) + *sum = ~csum_fold(csum_partial(diff, sizeof(diff), + csum_unfold(*sum))); +} + +static inline void nf_proto_csum_replace2(__sum16 *sum, struct sk_buff *skb, + __be16 from, __be16 to, int pseudohdr) +{ + nf_proto_csum_replace4(sum, skb, (__force __be32)from, + (__force __be32)to, pseudohdr); +} These are way too large to get inlined, please move somewhere below net/core. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED]: Add stateless NAT
On Thu, Sep 27, 2007 at 05:10:08PM +0400, Evgeniy Polyakov wrote: +static inline void nf_proto_csum_replace4(__sum16 *sum, struct sk_buff *skb, + __be32 from, __be32 to, int pseudohdr) +{ + __be32 diff[] = { ~from, to }; + if (skb-ip_summed != CHECKSUM_PARTIAL) { + *sum = csum_fold(csum_partial(diff, sizeof(diff), + ~csum_unfold(*sum))); + if (skb-ip_summed == CHECKSUM_COMPLETE pseudohdr) + skb-csum = ~csum_partial(diff, sizeof(diff), + ~skb-csum); + } else if (pseudohdr) + *sum = ~csum_fold(csum_partial(diff, sizeof(diff), + csum_unfold(*sum))); +} The embedded people are going to hate you for this :) How about putting it in net/core/utils.c? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED]: Add stateless NAT
On Thu, Sep 27, 2007 at 03:16:48PM +0200, Patrick McHardy ([EMAIL PROTECTED]) wrote: Evgeniy Polyakov wrote: +static inline void nf_proto_csum_replace4(__sum16 *sum, struct sk_buff *skb, + __be32 from, __be32 to, int pseudohdr) +{ + __be32 diff[] = { ~from, to }; + if (skb-ip_summed != CHECKSUM_PARTIAL) { + *sum = csum_fold(csum_partial(diff, sizeof(diff), + ~csum_unfold(*sum))); + if (skb-ip_summed == CHECKSUM_COMPLETE pseudohdr) + skb-csum = ~csum_partial(diff, sizeof(diff), + ~skb-csum); + } else if (pseudohdr) + *sum = ~csum_fold(csum_partial(diff, sizeof(diff), + csum_unfold(*sum))); +} + +static inline void nf_proto_csum_replace2(__sum16 *sum, struct sk_buff *skb, + __be16 from, __be16 to, int pseudohdr) +{ + nf_proto_csum_replace4(sum, skb, (__force __be32)from, + (__force __be32)to, pseudohdr); +} These are way too large to get inlined, please move somewhere below net/core. I knew that... :) I'm pretty sure new files called net/core/helpers.c which will host that helper is not a good solution too? diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h index 1dd075e..624d78b 100644 --- a/include/linux/netfilter.h +++ b/include/linux/netfilter.h @@ -40,6 +40,29 @@ #endif #ifdef __KERNEL__ + +static inline void nf_csum_replace4(__sum16 *sum, __be32 from, __be32 to) +{ + __be32 diff[] = { ~from, to }; + + *sum = csum_fold(csum_partial((char *)diff, sizeof(diff), ~csum_unfold(*sum))); +} + +static inline void nf_csum_replace2(__sum16 *sum, __be16 from, __be16 to) +{ + nf_csum_replace4(sum, (__force __be32)from, (__force __be32)to); +} + +extern void nf_proto_csum_replace4(__sum16 *sum, struct sk_buff *skb, + __be32 from, __be32 to, int pseudohdr); + +static inline void nf_proto_csum_replace2(__sum16 *sum, struct sk_buff *skb, + __be16 from, __be16 to, int pseudohdr) +{ + nf_proto_csum_replace4(sum, skb, (__force __be32)from, + (__force __be32)to, pseudohdr); +} + #ifdef CONFIG_NETFILTER extern void netfilter_init(void); @@ -289,28 +312,6 @@ extern void nf_invalidate_cache(int pf); Returns true or false. */ extern int skb_make_writable(struct sk_buff **pskb, unsigned int writable_len); -static inline void nf_csum_replace4(__sum16 *sum, __be32 from, __be32 to) -{ - __be32 diff[] = { ~from, to }; - - *sum = csum_fold(csum_partial((char *)diff, sizeof(diff), ~csum_unfold(*sum))); -} - -static inline void nf_csum_replace2(__sum16 *sum, __be16 from, __be16 to) -{ - nf_csum_replace4(sum, (__force __be32)from, (__force __be32)to); -} - -extern void nf_proto_csum_replace4(__sum16 *sum, struct sk_buff *skb, - __be32 from, __be32 to, int pseudohdr); - -static inline void nf_proto_csum_replace2(__sum16 *sum, struct sk_buff *skb, - __be16 from, __be16 to, int pseudohdr) -{ - nf_proto_csum_replace4(sum, skb, (__force __be32)from, - (__force __be32)to, pseudohdr); -} - struct nf_afinfo { unsigned short family; __sum16 (*checksum)(struct sk_buff *skb, unsigned int hook, diff --git a/net/core/Makefile b/net/core/Makefile index 4751613..5757323 100644 --- a/net/core/Makefile +++ b/net/core/Makefile @@ -3,7 +3,7 @@ # obj-y := sock.o request_sock.o skbuff.o iovec.o datagram.o stream.o scm.o \ -gen_stats.o gen_estimator.o +gen_stats.o gen_estimator.o helpers.o obj-$(CONFIG_SYSCTL) += sysctl_net_core.o diff --git a/net/core/helpers.c b/net/core/helpers.c new file mode 100644 index 000..d3c8d97 --- /dev/null +++ b/net/core/helpers.c @@ -0,0 +1,23 @@ +/* + * Generic helper functions. + */ + +#include linux/types.h +#include linux/skbuff.h + +#include net/checksum.h + +void nf_proto_csum_replace4(__sum16 *sum, struct sk_buff *skb, + __be32 from, __be32 to, int pseudohdr) +{ + __be32 diff[] = { ~from, to }; + if (skb-ip_summed != CHECKSUM_PARTIAL) { + *sum = csum_fold(csum_partial(diff, sizeof(diff), + ~csum_unfold(*sum))); + if (skb-ip_summed == CHECKSUM_COMPLETE pseudohdr) + skb-csum = ~csum_partial(diff, sizeof(diff), + ~skb-csum); + } else if (pseudohdr) + *sum = ~csum_fold(csum_partial(diff, sizeof(diff), + csum_unfold(*sum))); +} diff --git a/net/netfilter/core.c b/net/netfilter/core.c index 381a77c..9ffbbe2 100644 --- a/net/netfilter/core.c +++ b/net/netfilter/core.c @@ -226,22 +226,6 @@ copy_skb: }
Re: [PKT_SCHED]: Add stateless NAT
On Thu, Sep 27, 2007 at 09:20:37PM +0800, Herbert Xu ([EMAIL PROTECTED]) wrote: How about putting it in net/core/utils.c? I knew, that was a bad idea to try to fix netfilter dependency :) diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h index 1dd075e..51b5a22 100644 --- a/include/linux/netfilter.h +++ b/include/linux/netfilter.h @@ -40,6 +40,35 @@ #endif #ifdef __KERNEL__ + +static inline void nf_csum_replace4(__sum16 *sum, __be32 from, __be32 to) +{ + __be32 diff[] = { ~from, to }; + + *sum = csum_fold(csum_partial((char *)diff, sizeof(diff), ~csum_unfold(*sum))); +} + +static inline void nf_csum_replace2(__sum16 *sum, __be16 from, __be16 to) +{ + nf_csum_replace4(sum, (__force __be32)from, (__force __be32)to); +} + +extern void proto_csum_replace(__sum16 *sum, struct sk_buff *skb, + __be32 from, __be32 to, int pseudohdr); + +static inline void nf_proto_csum_replace4(__sum16 *sum, struct sk_buff *skb, + __be32 from, __be32 to, int pseudohdr) +{ + proto_csum_replace(sum, skb, from, to, pseudohdr); +} + +static inline void nf_proto_csum_replace2(__sum16 *sum, struct sk_buff *skb, + __be16 from, __be16 to, int pseudohdr) +{ + nf_proto_csum_replace4(sum, skb, (__force __be32)from, + (__force __be32)to, pseudohdr); +} + #ifdef CONFIG_NETFILTER extern void netfilter_init(void); @@ -289,28 +318,6 @@ extern void nf_invalidate_cache(int pf); Returns true or false. */ extern int skb_make_writable(struct sk_buff **pskb, unsigned int writable_len); -static inline void nf_csum_replace4(__sum16 *sum, __be32 from, __be32 to) -{ - __be32 diff[] = { ~from, to }; - - *sum = csum_fold(csum_partial((char *)diff, sizeof(diff), ~csum_unfold(*sum))); -} - -static inline void nf_csum_replace2(__sum16 *sum, __be16 from, __be16 to) -{ - nf_csum_replace4(sum, (__force __be32)from, (__force __be32)to); -} - -extern void nf_proto_csum_replace4(__sum16 *sum, struct sk_buff *skb, - __be32 from, __be32 to, int pseudohdr); - -static inline void nf_proto_csum_replace2(__sum16 *sum, struct sk_buff *skb, - __be16 from, __be16 to, int pseudohdr) -{ - nf_proto_csum_replace4(sum, skb, (__force __be32)from, - (__force __be32)to, pseudohdr); -} - struct nf_afinfo { unsigned short family; __sum16 (*checksum)(struct sk_buff *skb, unsigned int hook, diff --git a/net/core/utils.c b/net/core/utils.c index 0bf17da..2f6d4d2 100644 --- a/net/core/utils.c +++ b/net/core/utils.c @@ -293,3 +293,20 @@ out: } EXPORT_SYMBOL(in6_pton); + +void proto_csum_replace(__sum16 *sum, struct sk_buff *skb, + __be32 from, __be32 to, int pseudohdr) +{ + __be32 diff[] = { ~from, to }; + if (skb-ip_summed != CHECKSUM_PARTIAL) { + *sum = csum_fold(csum_partial(diff, sizeof(diff), + ~csum_unfold(*sum))); + if (skb-ip_summed == CHECKSUM_COMPLETE pseudohdr) + skb-csum = ~csum_partial(diff, sizeof(diff), + ~skb-csum); + } else if (pseudohdr) + *sum = ~csum_fold(csum_partial(diff, sizeof(diff), + csum_unfold(*sum))); +} + +EXPORT_SYMBOL(proto_csum_replace); diff --git a/net/netfilter/core.c b/net/netfilter/core.c index 381a77c..9ffbbe2 100644 --- a/net/netfilter/core.c +++ b/net/netfilter/core.c @@ -226,22 +226,6 @@ copy_skb: } EXPORT_SYMBOL(skb_make_writable); -void nf_proto_csum_replace4(__sum16 *sum, struct sk_buff *skb, - __be32 from, __be32 to, int pseudohdr) -{ - __be32 diff[] = { ~from, to }; - if (skb-ip_summed != CHECKSUM_PARTIAL) { - *sum = csum_fold(csum_partial(diff, sizeof(diff), - ~csum_unfold(*sum))); - if (skb-ip_summed == CHECKSUM_COMPLETE pseudohdr) - skb-csum = ~csum_partial(diff, sizeof(diff), - ~skb-csum); - } else if (pseudohdr) - *sum = ~csum_fold(csum_partial(diff, sizeof(diff), - csum_unfold(*sum))); -} -EXPORT_SYMBOL(nf_proto_csum_replace4); - #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE) /* This does not belong here, but locally generated errors need it if connection tracking in use: without this, connection may not be in hash table, and hence -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED]: Add stateless NAT
Evgeniy Polyakov wrote: On Thu, Sep 27, 2007 at 03:16:48PM +0200, Patrick McHardy ([EMAIL PROTECTED]) wrote: These are way too large to get inlined, please move somewhere below net/core. I knew that... :) I'm pretty sure new files called net/core/helpers.c which will host that helper is not a good solution too? I like Herbert's suggestion of net/core/utils.c better (and without the nf_ prefix please). - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED]: Add stateless NAT
On Thu, Sep 27, 2007 at 03:30:12PM +0200, Patrick McHardy ([EMAIL PROTECTED]) wrote: Evgeniy Polyakov wrote: On Thu, Sep 27, 2007 at 03:16:48PM +0200, Patrick McHardy ([EMAIL PROTECTED]) wrote: These are way too large to get inlined, please move somewhere below net/core. I knew that... :) I'm pretty sure new files called net/core/helpers.c which will host that helper is not a good solution too? I like Herbert's suggestion of net/core/utils.c better (and without the nf_ prefix please). I've put it there without nf_ prefix and updated netfilter header to create new inlune function with that prefix for private netfilter usage. -- Evgeniy Polyakov - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED]: Add stateless NAT
On Thu, 2007-27-09 at 15:30 +0200, Patrick McHardy wrote: I like Herbert's suggestion of net/core/utils.c better (and without the nf_ prefix please). me too. Evgeniy, you are the man if you finish the whole cow as some wise Africans would say;- cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED]: Add stateless NAT
Evgeniy Polyakov wrote: On Thu, Sep 27, 2007 at 09:20:37PM +0800, Herbert Xu ([EMAIL PROTECTED]) wrote: How about putting it in net/core/utils.c? I knew, that was a bad idea to try to fix netfilter dependency :) diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h This looks good to me. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] various dst_ifdown routines to catch refcounting bugs
Moving dst entries into init_net.loopback_dev is not a good thing. This hides obvious and non-obvious ref-counting bugs. This patch uses net_ns loopback instead of init_net loopback. This allowes to catch various bugs like recent one in IPv6 DAD handling. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] --- ./net/core/dst.c.loop 2007-08-26 19:30:38.0 +0400 +++ ./net/core/dst.c2007-08-26 19:30:38.0 +0400 @@ -279,11 +279,11 @@ static inline void dst_ifdown(struct dst if (!unregister) { dst-input = dst-output = dst_discard; } else { - dst-dev = init_net.loopback_dev; + dst-dev = dst-dev-nd_net-loopback_dev; dev_hold(dst-dev); dev_put(dev); if (dst-neighbour dst-neighbour-dev == dev) { - dst-neighbour-dev = init_net.loopback_dev; + dst-neighbour-dev = dst-dev; dev_put(dev); dev_hold(dst-neighbour-dev); } --- ./net/ipv4/route.c.loop 2007-08-26 19:30:38.0 +0400 +++ ./net/ipv4/route.c 2007-08-26 19:30:38.0 +0400 @@ -1402,8 +1402,9 @@ static void ipv4_dst_ifdown(struct dst_e { struct rtable *rt = (struct rtable *) dst; struct in_device *idev = rt-idev; - if (dev != init_net.loopback_dev idev idev-dev == dev) { - struct in_device *loopback_idev = in_dev_get(init_net.loopback_dev); + if (dev != dev-nd_net-loopback_dev idev idev-dev == dev) { + struct in_device *loopback_idev = + in_dev_get(dev-nd_net-loopback_dev); if (loopback_idev) { rt-idev = loopback_idev; in_dev_put(idev); --- ./net/ipv4/xfrm4_policy.c.loop 2007-08-26 19:30:38.0 +0400 +++ ./net/ipv4/xfrm4_policy.c 2007-08-26 19:30:38.0 +0400 @@ -306,7 +306,8 @@ static void xfrm4_dst_ifdown(struct dst_ xdst = (struct xfrm_dst *)dst; if (xdst-u.rt.idev-dev == dev) { - struct in_device *loopback_idev = in_dev_get(init_net.loopback_dev); + struct in_device *loopback_idev = + in_dev_get(dev-nd_net-loopback_dev); BUG_ON(!loopback_idev); do { --- ./net/ipv6/route.c.loop 2007-08-26 19:30:38.0 +0400 +++ ./net/ipv6/route.c 2007-08-26 19:30:38.0 +0400 @@ -220,9 +220,12 @@ static void ip6_dst_ifdown(struct dst_en { struct rt6_info *rt = (struct rt6_info *)dst; struct inet6_dev *idev = rt-rt6i_idev; + struct net_device *loopback_dev = + dev-nd_net-loopback_dev; - if (dev != init_net.loopback_dev idev != NULL idev-dev == dev) { - struct inet6_dev *loopback_idev = in6_dev_get(init_net.loopback_dev); + if (dev != loopback_dev idev != NULL idev-dev == dev) { + struct inet6_dev *loopback_idev = + in6_dev_get(loopback_dev); if (loopback_idev != NULL) { rt-rt6i_idev = loopback_idev; in6_dev_put(idev); @@ -1185,12 +1188,12 @@ int ip6_route_add(struct fib6_config *cf if ((cfg-fc_flags RTF_REJECT) || (dev (dev-flagsIFF_LOOPBACK) !(addr_typeIPV6_ADDR_LOOPBACK))) { /* hold loopback dev/idev if we haven't done so. */ - if (dev != init_net.loopback_dev) { + if (dev != dev-nd_net-loopback_dev) { if (dev) { dev_put(dev); in6_dev_put(idev); } - dev = init_net.loopback_dev; + dev = dev-nd_net-loopback_dev; dev_hold(dev); idev = in6_dev_get(dev); if (!idev) { @@ -1894,13 +1897,13 @@ struct rt6_info *addrconf_dst_alloc(stru if (rt == NULL) return ERR_PTR(-ENOMEM); - dev_hold(init_net.loopback_dev); + dev_hold(idev-dev-nd_net-loopback_dev); in6_dev_hold(idev); rt-u.dst.flags = DST_HOST; rt-u.dst.input = ip6_input; rt-u.dst.output = ip6_output; - rt-rt6i_dev = init_net.loopback_dev; + rt-rt6i_dev = idev-dev-nd_net-loopback_dev; rt-rt6i_idev = idev; rt-u.dst.metrics[RTAX_MTU-1] = ipv6_get_mtu(rt-rt6i_dev); rt-u.dst.metrics[RTAX_ADVMSS-1] = ipv6_advmss(dst_mtu(rt-u.dst)); --- ./net/ipv6/xfrm6_policy.c.loop 2007-08-26 19:30:38.0 +0400 +++ ./net/ipv6/xfrm6_policy.c 2007-08-26 19:30:38.0 +0400 @@ -375,7 +375,8 @@ static void xfrm6_dst_ifdown(struct dst_ xdst = (struct xfrm_dst *)dst; if (xdst-u.rt6.rt6i_idev-dev == dev) { - struct inet6_dev *loopback_idev = in6_dev_get(init_net.loopback_dev); + struct inet6_dev
Re: [PATCH] sky2: sky2 FE+ receive status workaround
On Thu, 27 Sep 2007 09:14:11 +0100 Jochen Voß [EMAIL PROTECTED] wrote: Hi Stephen, On 27 Sep 2007, at 01:58, Stephen Hemminger wrote: + /* This chip has hardware problems that generates bogus status. +* So do only marginal checking and expect higher level protocols +* to handle crap frames. +*/ + if (sky2-hw-chip_id == CHIP_ID_YUKON_FE_P + sky2-hw-chip_rev == CHIP_REV_YU_FE2_A0 + length != count) + goto okay; Shouldn't the condition be length == count? No, the code is correct as is. Basically if length == count, then the status field is correct, and the driver can go ahead and use it. If length != count, then the status is bogus but the data is okay. -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
IPSec on Linux Kernel
Hi, I'm currently doing some research work and I thought that maybe you guys could help me out on this. I'm currently trying to find where can I understand more about the IPSec implementation on the current Linux Kernel (2.6.22). I need to find where the AH calls are made so I can reroute those functions calls to an external module, for a safer AH generation. It would be helpful to find the source code files where I can study the IPSec stack in detail, and reroute the function call. Any hints on these topics? Thanks in advance Fabio Souto Portugal -- View this message in context: http://www.nabble.com/IPSec-on-Linux-Kernel-tf4528613.html#a12922013 Sent from the netdev mailing list archive at Nabble.com. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP Spike
On Thu, 27 Sep 2007 11:58:01 +0800 Majumder, Rajib [EMAIL PROTECTED] wrote: Hi, We have observed 40ms latency spikes in TCP connections in burst type of traffic. This affects regular TCP sockets. We observed this issue in kernels of 2.4.21 and kernel 2.6.5. Unfortunately, 2.6.5 is out of my short term memory at this point. I do remember that 2.6.5 used BIC for congestion control, and there were some math errors in the congestion control logic that caused it to be way to aggressive. Aparently, this seems to be fixed in 2.6.19. Can someone throw some light on this? My guess is that the addition of the SACK hinting might be the major win. The code takes 3 passes over the SACK list, so with large outstanding data that was a major bottleneck, not sure if it was 4ms worth though. Is this a congestion control/avoidance issue? What congestion control algorithm is used before 2.6.8? Default congestion control in early 2.6 was BIC, then after CUBIC stabilized it was made the default in 2.6.19. Another thing that may cause changes in latency is Appropriate Byte Counting (ABC). It was added in 2.6.14, but then turned off by default in 2.6.18. The problem is that ABC caused performance problems with some applications that sent messages as many small writes. -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] af_packet: allow disabling timestamps
This small modification to Stephen's patch timestamps the skb when needed, so the timestamp can be reused by other af_packet sockets. Signed-off-by: Unai Uribarri [EMAIL PROTECTED] --- a/net/core/sock.c +++ b/net/core/sock.c @@ -259,7 +259,8 @@ static void sock_disable_timestamp(struct sock *sk) { if (sock_flag(sk, SOCK_TIMESTAMP)) { sock_reset_flag(sk, SOCK_TIMESTAMP); - net_disable_timestamp(); + if (sk-sk_family != PF_PACKET) + net_disable_timestamp(); } } @@ -1655,7 +1656,8 @@ void sock_enable_timestamp(struct sock *sk) { if (!sock_flag(sk, SOCK_TIMESTAMP)) { sock_set_flag(sk, SOCK_TIMESTAMP); - net_enable_timestamp(); + if (sk-sk_family != PF_PACKET) + net_enable_timestamp(); } } EXPORT_SYMBOL(sock_enable_timestamp); --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -570,7 +570,6 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, struct packe unsigned long status = TP_STATUS_LOSING|TP_STATUS_USER; unsigned short macoff, netoff; struct sk_buff *copy_skb = NULL; - struct timeval tv; if (dev-nd_net != init_net) goto drop; @@ -648,12 +647,18 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, struct packe h-tp_snaplen = snaplen; h-tp_mac = macoff; h-tp_net = netoff; - if (skb-tstamp.tv64) + + if (sock_flag(sk, SOCK_TIMESTAMP)) { + struct timeval tv; + if (skb-tstamp.tv64 == 0) + __net_timestamp(skb); tv = ktime_to_timeval(skb-tstamp); - else - do_gettimeofday(tv); - h-tp_sec = tv.tv_sec; - h-tp_usec = tv.tv_usec; + h-tp_sec = tv.tv_sec; + h-tp_usec = tv.tv_usec; + } else { + h-tp_sec = 0; + h-tp_usec = 0; + } sll = (struct sockaddr_ll*)((u8*)h + TPACKET_ALIGN(sizeof(*h))); sll-sll_halen = dev_parse_header(skb, sll-sll_addr); @@ -1004,6 +1009,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol) sock-ops = packet_ops_spkt; sock_init_data(sock, sk); + sock_set_flag(sk, SOCK_TIMESTAMP); po = pkt_sk(sk); sk-sk_family = PF_PACKET; On jue, 2007-09-13 at 12:42 +0200, Stephen Hemminger wrote: Currently, af_packet does not allow disabling timestamps. This patch changes that but doesn't force global timestamps on. This shows up in bugzilla as: http://bugzilla.kernel.org/show_bug.cgi?id=4809 Patch against net-2.6.24 tree. Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- a/net/core/sock.c 2007-09-12 15:08:43.0 +0200 +++ b/net/core/sock.c 2007-09-13 12:10:19.0 +0200 @@ -259,7 +259,8 @@ static void sock_disable_timestamp(struc { if (sock_flag(sk, SOCK_TIMESTAMP)) { sock_reset_flag(sk, SOCK_TIMESTAMP); - net_disable_timestamp(); + if (sk-sk_family != PF_PACKET) + net_disable_timestamp(); } } @@ -1645,7 +1646,8 @@ void sock_enable_timestamp(struct sock * { if (!sock_flag(sk, SOCK_TIMESTAMP)) { sock_set_flag(sk, SOCK_TIMESTAMP); - net_enable_timestamp(); + if (sk-sk_family != PF_PACKET) + net_enable_timestamp(); } } EXPORT_SYMBOL(sock_enable_timestamp); --- a/net/packet/af_packet.c 2007-09-12 17:07:00.0 +0200 +++ b/net/packet/af_packet.c 2007-09-13 12:09:10.0 +0200 @@ -572,7 +572,6 @@ static int tpacket_rcv(struct sk_buff *s unsigned long status = TP_STATUS_LOSING|TP_STATUS_USER; unsigned short macoff, netoff; struct sk_buff *copy_skb = NULL; - struct timeval tv; if (dev-nd_net != init_net) goto drop; @@ -650,12 +649,19 @@ static int tpacket_rcv(struct sk_buff *s h-tp_snaplen = snaplen; h-tp_mac = macoff; h-tp_net = netoff; - if (skb-tstamp.tv64) - tv = ktime_to_timeval(skb-tstamp); - else - do_gettimeofday(tv); - h-tp_sec = tv.tv_sec; - h-tp_usec = tv.tv_usec; + + if (sock_flag(sk, SOCK_TIMESTAMP)) { + struct timeval tv; + if (skb-tstamp.tv64) + tv = ktime_to_timeval(skb-tstamp); + else + do_gettimeofday(tv); + h-tp_sec = tv.tv_sec; + h-tp_usec = tv.tv_usec; + } else { + h-tp_sec = 0; + h-tp_usec = 0; + } sll = (struct sockaddr_ll*)((u8*)h + TPACKET_ALIGN(sizeof(*h))); sll-sll_halen = 0; @@ -1014,6 +1020,7 @@ static int packet_create(struct net *net sock-ops = packet_ops_spkt; sock_init_data(sock, sk); +
Re: [RFC] af_packet: allow disabling timestamps
On vie, 2007-09-14 at 12:26 +0200, Stephen Hemminger wrote: On Thu, 13 Sep 2007 14:24:06 +0200 Eric Dumazet [EMAIL PROTECTED] wrote: On Thu, 13 Sep 2007 12:42:53 +0200 Stephen Hemminger [EMAIL PROTECTED] wrote: Currently, af_packet does not allow disabling timestamps. This patch changes that but doesn't force global timestamps on. This shows up in bugzilla as: http://bugzilla.kernel.org/show_bug.cgi?id=4809 Patch against net-2.6.24 tree. I am not sure I understood this patch. This means that tcpdump/ethereal wont get precise timestamps (gathered when packet is received), but imprecise ones (gathered when the sniffer reads the packet) I added some time ago ktime infrastructure to eventually get nanosecond precision in libpcap, so I would prefer a step in the right direction :) Should'nt we use something like : [PATCH] af_packet : allow disabling timestamps, or requesting nanosecond precision. Signed-off-by: Eric Dumazet [EMAIL PROTECTED] diff --git a/net/core/sock.c b/net/core/sock.c index 5a16e38..1c10b9d 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -563,6 +563,7 @@ set_rcvbuf: } else { sock_reset_flag(sk, SOCK_RCVTSTAMP); sock_reset_flag(sk, SOCK_RCVTSTAMPNS); + sock_disable_timestamp(sk); } break; diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 745e2cb..409de44 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -650,12 +650,27 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, struct packe h-tp_snaplen = snaplen; h-tp_mac = macoff; h-tp_net = netoff; - if (skb-tstamp.tv64) - tv = ktime_to_timeval(skb-tstamp); - else - do_gettimeofday(tv); - h-tp_sec = tv.tv_sec; - h-tp_usec = tv.tv_usec; + h-tp_sec = 0; + h-tp_usec = 0; + if ((sock_flag(sk, SOCK_TIMESTAMP))) { + if (sock_flag(sk, SOCK_RCVTSTAMPNS)) { + struct timespec ts; + if (skb-tstamp.tv64) + ts = ktime_to_timespec(skb-tstamp); + else + getnstimeofday(ts); + h-tp_sec = ts.tv_sec; + h-tp_usec = ts.tv_nsec; /* cheat a litle bit */ + } + else { + if (skb-tstamp.tv64) + tv = ktime_to_timeval(skb-tstamp); + else + do_gettimeofday(tv); + h-tp_sec = tv.tv_sec; + h-tp_usec = tv.tv_usec; + } + } sll = (struct sockaddr_ll*)((u8*)h + TPACKET_ALIGN(sizeof(*h))); sll-sll_halen = 0; @@ -1014,6 +1029,7 @@ static int packet_create(struct net *net, struct socket *sock, int protocol) sock-ops = packet_ops_spkt; sock_init_data(sock, sk); + sock_enable_timestamp(sk); po = pkt_sk(sk); sk-sk_family = PF_PACKET; No, then we end up timestamping all the packets, even if they get dropped by packet filter. The change in 2.6.24 allows dhclient (and rstp) to only call hires clock source for packets they want, not all packets. Perhaps the timestamping needs to change into a tristate flag? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Eric's patch has a feature your previous patch hasn't: a way to disable timestamping from userspace (the changes at net/core/sock.c). But it changes the userspace API. I really think that any developer that sets SO_TIMESTAMP to 0 and still expect to receive valid timestamp is terminally insane and doesn't deserve any mercy. But we should take pity of these poor souls that uses (suffers) closed software and found another way that doesn't changes the API. Bye. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sky2: sky2 FE+ receive status workaround
Hi, On Thu, Sep 27, 2007 at 06:58:07AM -0700, Stephen Hemminger wrote: On Thu, 27 Sep 2007 09:14:11 +0100 Jochen Voß [EMAIL PROTECTED] wrote: On 27 Sep 2007, at 01:58, Stephen Hemminger wrote: + /* This chip has hardware problems that generates bogus status. + * So do only marginal checking and expect higher level protocols + * to handle crap frames. + */ + if (sky2-hw-chip_id == CHIP_ID_YUKON_FE_P + sky2-hw-chip_rev == CHIP_REV_YU_FE2_A0 + length != count) + goto okay; Shouldn't the condition be length == count? No, the code is correct as is. Basically if length == count, then the status field is correct, and the driver can go ahead and use it. If length != count, then the status is bogus but the data is okay. Oh, I see. Thanks for the explanation. All the best, Jochen -- http://seehuhn.de/ signature.asc Description: Digital signature
Re: IPSec on Linux Kernel
Hi,Fabio, - Assuming that you intend to deal with IPV4, I suggest that you will start by looking at the ah4.ko module sources, which are in net/ipv4/ah.c, especially at the ah_output() and the ah_input() methods. (for ipv6 there are the ah6.c in net/ipv6). - May I ask: are you aware that the Authentication Header protocol deals only with authentication and not with encryption? and as a result, the ESP protocol, which supports authentication and also, when needed, encryption, is much more widely used? Regards, Rosen Rami - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/7] CAN: Add virtual CAN netdevice driver
Urs Thuermann [EMAIL PROTECTED] writes: This patch adds the virtual CAN bus (vcan) network driver. The vcan device is just a loopback device for CAN frames, no real CAN hardware is involved. I'm trying to wrap my head around the CAN use of IFF_LOOPBACK. 6.2 loopback As described in chapter 3.2 the CAN network device driver should support a local loopback functionality. In this case the driver flag IFF_LOOPBACK has to be set to cause the PF_CAN core to not perform the loopback as fallback solution: dev-flags = (IFF_NOARP | IFF_LOOPBACK); Currently IFF_LOOPBACK set in dev-flags means we are dealing with drivers/net/loopback.c. In other networking layers loopback functionality (i.e. for broadcast) is never expected to be provided by the drivers and is instead always provided by the networking layer. Keeping the drivers simpler. Further you already have this functionality in the generic CAN layer for doing loopback without driver support. So at a first glance the CAN usage of IFF_LOOPBACK looks completely broken, and likely to confuse other networking layers if they see a CAN device. Say if someone attempts to run IP over CAN or something like that. Do you think you can remove this incompatible usage of IFF_LOOPBACK from the can code? If I have read your documentation properly the only reason you are doing this is so that the timing of frames to cansniffer more accurately reflects when the frame hits the wire. If CAN runs over a very slow medium I guess I can see where that can be a concern. But the usage of IFF_LOOPBACK to do this still feels fairly hackish to me. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] net ipv4: When possible test for IFF_LOOPBACK and not dev == loopback_dev
Daniel Lezcano [EMAIL PROTECTED] writes: Eric W. Biederman wrote: Now that multiple loopback devices are becoming possible it makes the code a little cleaner and more maintainable to test if a deivice is th a loopback device by testing dev-flags IFF_LOOPBACK instead of dev == loopback_dev. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] Urs Thuermann posted the patch: [PATCH 5/7] CAN: Add virtual CAN netdevice driver This network driver set its flag to IFF_LOOPBACK for testing. Is it possible this can be a collision with your patch ? I have brought it up on that thread. As best as I tell the CAN usage of IFF_LOOPBACK will be a problem even without my patch. Assuming something other then the CAN layer will see the CAN devices. The CAN documentations IFF_LOOPBACK should be set on all CAN devices. It seems that the people who want high performance predictable CAN don't want this and the people who want something they can trace easily want this. It sounds to me like CAN routers don't exist. Anyway hopefully that usage can be resolved as that code is reviewed, and made ready to merge. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/7] CAN: Add virtual CAN netdevice driver
I guess in particular IFF_LOOPBACK means that all packets from a device will come right back to the current machine, and go nowhere else. That usage sounds completely different then the CAN usage which appears to mean. Broadcast packets will be returned to this machine as well as being sent out onto the wire. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] various dst_ifdown routines to catch refcounting bugs
Denis V. Lunev [EMAIL PROTECTED] writes: Moving dst entries into init_net.loopback_dev is not a good thing. This hides obvious and non-obvious ref-counting bugs. Acked-by: Eric W. Biederman [EMAIL PROTECTED] To be clear using init_net.loopback is currently safe because we don't have any destination cache entries for anything except the initial network namespace. I have not yet made this change simply because I haven't gotten around to this part in my patches. I do have a question I would like to bring up, because I like avoiding explicit references to loopback_dev when I can. /* Dirty hack. We did it in 2.2 (in __dst_free), * we have _very_ good reasons not to repeat * this mistake in 2.3, but we have no choice * now. _It_ _is_ _explicit_ _deliberate_ * _race_ _condition_. * * Commented and originally written by Alexey. */ What is the race that is talked about in that comment. Can we just assign NULL instead of the loopback device when we bring a route down. My gut feeling is that something like: dst-input = dst-output = dst_discard; may be enough.But I don't know where the deliberate race is. I haven't traced this all of the way through but from the obvious parts I just get this nagging feeling that something isn't quite right. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Devel] [PATCH 4/4] net: Make the loopback device per network namespace
Denis V. Lunev [EMAIL PROTECTED] writes: Eric W. Biederman wrote: This patch makes loopback_dev per network namespace. Adding code to create a different loopback device for each network namespace and adding the code to free a loopback device when a network namespace exits. This patch modifies all users the loopback_dev so they access it as init_net.loopback_dev, keeping all of the code compiling and working. A later pass will be needed to update the users to use something other than the initial network namespace. A pity that an important bit of explanation is missed. The initialization of loopback_dev is moved from a chain of devices (init_module) to a subsystem initialization to keep proper order, i.e. we must be sure that the initialization order is correct. That didn't happen in the patch you mentioned. That happened when we started dynamically allocating the loopback device. That was the patch Daniel sent out a bit ago. There are certainly some ordering issues and it may have helped to talk about them. But they are because things assume the loopback device is present. We have various bits of code that is around such as the dst_ifdown case that assumes if another network device is present the loopback device is present. To fulfill that assumption I guess that means we have both an initialization order dependency and a destruction order dependency. The fact we were using module_init before actually appears to me to have been racy, but we got away with it because the actual data structure was statically allocated. Since it appears that for a dynamically allocated loopback registering it first and unregistering it last is necessary for routing. It is likely worth looking at this a little more closely and making a guarantee. So we can make it easier for networking layers like ipv6, that want to memorize which device is the loopback device. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RTL8111 PCI Express Gigabit driver r8169 produces slow file transfers
Achim Frase [EMAIL PROTECTED] : [...] but I hope you could help me. Yes. Please try any of: - current 2.6.23-git - 2.6.23-rc8 + patch below diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c index b85ab4a..c921ec3 100644 --- a/drivers/net/r8169.c +++ b/drivers/net/r8169.c @@ -1228,7 +1228,10 @@ static void rtl8169_hw_phy_config(struct net_device *dev) return; } - /* phy config for RTL8169s mac_version C chip */ + if ((tp-mac_version != RTL_GIGA_MAC_VER_02) + (tp-mac_version != RTL_GIGA_MAC_VER_03)) + return; + mdio_write(ioaddr, 31, 0x0001); //w 31 2 0 1 mdio_write(ioaddr, 21, 0x1000); //w 21 15 0 1000 mdio_write(ioaddr, 24, 0x65c7); //w 24 15 0 65c7 @@ -2567,6 +2570,15 @@ static void rtl8169_tx_interrupt(struct net_device *dev, (TX_BUFFS_AVAIL(tp) = MAX_SKB_FRAGS)) { netif_wake_queue(dev); } + /* +* 8168 hack: TxPoll requests are lost when the Tx packets are +* too close. Let's kick an extra TxPoll request when a burst +* of start_xmit activity is detected (if it is not detected, +* it is slow enough). -- FR +*/ + smp_rmb(); + if (tp-cur_tx != dirty_tx) + RTL_W8(TxPoll, NPQ); } } -- Ueimor - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Add network namespace clone unshare support.
Cedric Le Goater [EMAIL PROTECTED] writes: diff --git a/include/linux/sched.h b/include/linux/sched.h index a01ac6d..e10a0a8 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -27,6 +27,7 @@ #define CLONE_NEWUTS0x0400 /* New utsname group? */ #define CLONE_NEWIPC0x0800 /* New ipcs */ #define CLONE_NEWUSER 0x1000 /* New user namespace */ +#define CLONE_NEWNET0x2000 /* New network namespace */ This new flag is going to conflict with the pid namespace flag CLONE_NEWPID in -mm. It might be worth changing it to: #define CLONE_NEWNET 0x4000 Interesting, it would have been nice if someone had caught this detail earlier. Oh well. Thanks for pointing this out, it's on my todo list to look into, and ensure we resolve. I'm confused because my notes have 0x8000 for the pid namespace, and 0x4000 for the time namespace. The changes in nxproxy.c and fork.c will also conflict but I don't think we can do much about it for now. They should also be fairly easy conflicts to resolve. I guess we are likely to hit this conflict in the next -mm or the merge window, which ever comes first. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 0/3] UDP memory usage accounting
Hello, Apologies for late response. Evgeniy Polyakov wrote: Hi. On Fri, Sep 21, 2007 at 09:18:07PM +0900, Satoshi OSHIMA ([EMAIL PROTECTED]) wrote: This patch set try to introduce memory usage accounting for UDP(currently ipv4 only). Currently, memory usage of UDP can be observed as the sam of usage of tx_queue and rx_queue. But I believe that the system wide accounting is usefull when heavy loaded condition. In the next step, I would like to add memory usage quota for UDP to avoid unlimited memory consumption problem under DDOS attack. Could you please desribed such attack in more details? Each UDP socket has its queue length which can not be exceeded (roughly), no new sockets are created when remote side sends a packet (like after special steps in TCP), so where is possibility to eat all the mem? I think Satoshi will answer this question soon. This patch set is for 2.6.23-rc7. I seriously doubt you want to put udp specific hacks and zillions of atomic ops all around the code just to know exact number of bytes eaten for UDP. I'll revise the patch to reduce the number of atomic operations. Please use udp specific code (like udp_sendmsg()) for proper accounting if you need that, but not hacks in generic ip code. As far as I know, Satoshi is improving this part right now. Please wait his response. Many thanks for your comments. Best regards, Hideo Aoki -- Hitachi Computer Products (America) Inc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] net: Dynamically allocate the per cpu counters for the loopback device.
From: [EMAIL PROTECTED] (Eric W. Biederman) Date: Thu, 27 Sep 2007 01:48:00 -0600 I'm not doing get_cpu/put_cpu so does the comment make sense in relationship to per_cpu_ptr? It is possible. But someone would need to go check for sure. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC/PATCH 2/3] UDP memory usage accounting: accounting unit and variable
Hello, I apologize for not replying sooner. Andi Kleen wrote: Satoshi OSHIMA [EMAIL PROTECTED] writes: This patch introduces global variable for UDP memory accounting. The unit is page. The global variable doesn't seem to be very MP scalable, especially if you change it for each packet. This will be a very hot cache line, in the worst case bouncing around a large machine. Possible alternatives: - Per CPU variables - You only change the global on socket creation time (by pre allocating a large amount) or when the system comes under memory pressure. - Batching of the global updates for multiple packets [that's a variant of the previous one, might be still too costly though] Also for such variables it's usually good to cache line pad them on SMP to avoid false sharing with something else. -Andi Thank you so much for your suggestions. The implementation of the patch basically followed implementation of tcp_memory_allocated. However, I should agree that the patch introduces atomic operations too much. Then, I'll try to use the batching to reduce the number of atomic operations. Best regards, Hideo Aoki -- Hitachi Computer Products (America) Inc. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only interfacesto avoid 4-tuple conflicts.
Sean, What is the model on how client connects, say for iSCSI, when client and server both support, iWARP and 10GbE or 1GbE, and would like to setup most performant connection for ULP? Thanks, Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Sean Hefty [mailto:[EMAIL PROTECTED] Sent: Thursday, September 27, 2007 2:39 PM To: Steve Wise Cc: netdev@vger.kernel.org; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only interfacesto avoid 4-tuple conflicts. The sysadmin creates for iwarp use only alias interfaces of the form devname:iw* where devname is the native interface name (eg eth0) for the iwarp netdev device. The alias label can be anything starting with iw. The iw immediately after the ':' is the key used by the iw_cxgb3 driver. I'm still not sure about this, but haven't come up with anything better myself. And if there's a good chance of other rnic's needing the same support, I'd rather see the common code separated out, even if just encapsulated within this module for easy re-use. As for the code, I have a couple of questions about whether deadlock and a race condition are possible, plus a few minor comments. +static void insert_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) +{ + struct iwch_addrlist *addr; + + addr = kmalloc(sizeof *addr, GFP_KERNEL); + if (!addr) { + printk(KERN_ERR MOD %s - failed to alloc memory!\n, + __FUNCTION__); + return; + } + addr-ifa = ifa; + mutex_lock(rnicp-mutex); + list_add_tail(addr-entry, rnicp-addrlist); + mutex_unlock(rnicp-mutex); +} Should this return success/failure? +static int nb_callback(struct notifier_block *self, unsigned long event, + void *ctx) +{ + struct in_ifaddr *ifa = ctx; + struct iwch_dev *rnicp = container_of(self, struct iwch_dev, nb); + + PDBG(%s rnicp %p event %lx\n, __FUNCTION__, rnicp, event); + + switch (event) { + case NETDEV_UP: + if (netdev_is_ours(rnicp, ifa-ifa_dev-dev) + is_iwarp_label(ifa-ifa_label)) { + PDBG(%s label %s addr 0x%x added\n, + __FUNCTION__, ifa-ifa_label, ifa-ifa_address); + insert_ifa(rnicp, ifa); + iwch_listeners_add_addr(rnicp, ifa-ifa_address); If insert_ifa() fails, what will iwch_listeners_add_addr() do? (I'm not easily seeing the relationship between the address list and the listen list at this point.) + } + break; + case NETDEV_DOWN: + if (netdev_is_ours(rnicp, ifa-ifa_dev-dev) + is_iwarp_label(ifa-ifa_label)) { + PDBG(%s label %s addr 0x%x deleted\n, + __FUNCTION__, ifa-ifa_label, ifa-ifa_address); + iwch_listeners_del_addr(rnicp, ifa-ifa_address); + remove_ifa(rnicp, ifa); + } + break; + default: + break; + } + return 0; +} + +static void delete_addrlist(struct iwch_dev *rnicp) { + struct iwch_addrlist *addr, *tmp; + + mutex_lock(rnicp-mutex); + list_for_each_entry_safe(addr, tmp, rnicp-addrlist, entry) { + list_del(addr-entry); + kfree(addr); + } + mutex_unlock(rnicp-mutex); +} + +static void populate_addrlist(struct iwch_dev *rnicp) { + int i; + struct in_device *indev; + + for (i = 0; i rnicp-rdev.port_info.nports; i++) { + indev = in_dev_get(rnicp-rdev.port_info.lldevs[i]); + if (!indev) + continue; + for_ifa(indev) + if (is_iwarp_label(ifa-ifa_label)) { + PDBG(%s label %s addr 0x%x added\n, +__FUNCTION__, ifa-ifa_label, +ifa-ifa_address); + insert_ifa(rnicp, ifa); + } + endfor_ifa(indev); + } +} + static void rnic_init(struct iwch_dev *rnicp) { PDBG(%s iwch_dev %p\n, __FUNCTION__, rnicp); @@ -70,6 +187,12 @@ static void rnic_init(struct iwch_dev *r idr_init(rnicp-qpidr); idr_init(rnicp-mmidr); spin_lock_init(rnicp-lock); + INIT_LIST_HEAD(rnicp-addrlist); + INIT_LIST_HEAD(rnicp-listen_eps); + mutex_init(rnicp-mutex); + rnicp-nb.notifier_call = nb_callback; + populate_addrlist(rnicp); + register_inetaddr_notifier(rnicp-nb); rnicp-attr.vendor_id = 0x168; rnicp-attr.vendor_part_id = 7; @@ -148,6 +271,8 @@ static void
Re: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only interfaces to avoid 4-tuple conflicts.
The sysadmin creates for iwarp use only alias interfaces of the form devname:iw* where devname is the native interface name (eg eth0) for the iwarp netdev device. The alias label can be anything starting with iw. The iw immediately after the ':' is the key used by the iw_cxgb3 driver. I'm still not sure about this, but haven't come up with anything better myself. And if there's a good chance of other rnic's needing the same support, I'd rather see the common code separated out, even if just encapsulated within this module for easy re-use. As for the code, I have a couple of questions about whether deadlock and a race condition are possible, plus a few minor comments. +static void insert_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) +{ + struct iwch_addrlist *addr; + + addr = kmalloc(sizeof *addr, GFP_KERNEL); + if (!addr) { + printk(KERN_ERR MOD %s - failed to alloc memory!\n, + __FUNCTION__); + return; + } + addr-ifa = ifa; + mutex_lock(rnicp-mutex); + list_add_tail(addr-entry, rnicp-addrlist); + mutex_unlock(rnicp-mutex); +} Should this return success/failure? +static int nb_callback(struct notifier_block *self, unsigned long event, + void *ctx) +{ + struct in_ifaddr *ifa = ctx; + struct iwch_dev *rnicp = container_of(self, struct iwch_dev, nb); + + PDBG(%s rnicp %p event %lx\n, __FUNCTION__, rnicp, event); + + switch (event) { + case NETDEV_UP: + if (netdev_is_ours(rnicp, ifa-ifa_dev-dev) + is_iwarp_label(ifa-ifa_label)) { + PDBG(%s label %s addr 0x%x added\n, + __FUNCTION__, ifa-ifa_label, ifa-ifa_address); + insert_ifa(rnicp, ifa); + iwch_listeners_add_addr(rnicp, ifa-ifa_address); If insert_ifa() fails, what will iwch_listeners_add_addr() do? (I'm not easily seeing the relationship between the address list and the listen list at this point.) + } + break; + case NETDEV_DOWN: + if (netdev_is_ours(rnicp, ifa-ifa_dev-dev) + is_iwarp_label(ifa-ifa_label)) { + PDBG(%s label %s addr 0x%x deleted\n, + __FUNCTION__, ifa-ifa_label, ifa-ifa_address); + iwch_listeners_del_addr(rnicp, ifa-ifa_address); + remove_ifa(rnicp, ifa); + } + break; + default: + break; + } + return 0; +} + +static void delete_addrlist(struct iwch_dev *rnicp) +{ + struct iwch_addrlist *addr, *tmp; + + mutex_lock(rnicp-mutex); + list_for_each_entry_safe(addr, tmp, rnicp-addrlist, entry) { + list_del(addr-entry); + kfree(addr); + } + mutex_unlock(rnicp-mutex); +} + +static void populate_addrlist(struct iwch_dev *rnicp) +{ + int i; + struct in_device *indev; + + for (i = 0; i rnicp-rdev.port_info.nports; i++) { + indev = in_dev_get(rnicp-rdev.port_info.lldevs[i]); + if (!indev) + continue; + for_ifa(indev) + if (is_iwarp_label(ifa-ifa_label)) { + PDBG(%s label %s addr 0x%x added\n, +__FUNCTION__, ifa-ifa_label, +ifa-ifa_address); + insert_ifa(rnicp, ifa); + } + endfor_ifa(indev); + } +} + static void rnic_init(struct iwch_dev *rnicp) { PDBG(%s iwch_dev %p\n, __FUNCTION__, rnicp); @@ -70,6 +187,12 @@ static void rnic_init(struct iwch_dev *r idr_init(rnicp-qpidr); idr_init(rnicp-mmidr); spin_lock_init(rnicp-lock); + INIT_LIST_HEAD(rnicp-addrlist); + INIT_LIST_HEAD(rnicp-listen_eps); + mutex_init(rnicp-mutex); + rnicp-nb.notifier_call = nb_callback; + populate_addrlist(rnicp); + register_inetaddr_notifier(rnicp-nb); rnicp-attr.vendor_id = 0x168; rnicp-attr.vendor_part_id = 7; @@ -148,6 +271,8 @@ static void close_rnic_dev(struct t3cdev mutex_lock(dev_mutex); list_for_each_entry_safe(dev, tmp, dev_list, entry) { if (dev-rdev.t3cdev_p == tdev) { + unregister_inetaddr_notifier(dev-nb); + delete_addrlist(dev); list_del(dev-entry); iwch_unregister_device(dev); cxio_rdev_close(dev-rdev); diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h index caf4e60..7fa0a47 100644 --- a/drivers/infiniband/hw/cxgb3/iwch.h +++ b/drivers/infiniband/hw/cxgb3/iwch.h @@ -36,6 +36,8 @@ #include linux/mutex.h #include linux/list.h #include linux/spinlock.h
Re: [PATCH] fixed broken bootp compilation
From: Denis V. Lunev [EMAIL PROTECTED] Date: Thu, 27 Sep 2007 14:46:22 +0400 Compilation fix. Extra bracket removed. Broken by [NET]: Wrap netdevice hardware header creation from Stephen Hemminger [EMAIL PROTECTED] Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] Applied, thanks Denis. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] proper comment for loopback initialization order
From: Denis V. Lunev [EMAIL PROTECTED] Date: Thu, 27 Sep 2007 16:25:27 +0400 Subject: [PATCH] proper comment for loopback initialization order From: Denis V. Lunev [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED], [EMAIL PROTECTED], netdev@vger.kernel.org, [EMAIL PROTECTED] Date: Thu, 27 Sep 2007 16:25:27 +0400 Sender: [EMAIL PROTECTED] User-Agent: Mutt/1.5.16 (2007-06-09) Loopback device is special. It should be initialized at the very beginning. Initialization order has been changed by Eric W. Biederman [EMAIL PROTECTED] and this change is non-obvious and important enough to add proper comment. Signed-off-by: Denis V. Lunev [EMAIL PROTECTED] Applied, but I had to fix the coding style of your comment, please do it like this in the future: /* Loopback is special. It should be initialized before any other network * device and network subsystem. */ Thanks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED]: Add stateless NAT
From: jamal [EMAIL PROTECTED] Date: Thu, 27 Sep 2007 08:39:45 -0400 nice work. I like the egress flag idea;- and who would have thunk stateless nat could be written in such a few lines ;- I would have put the checksum as a separate action but it is fine the way you did it since it simplifies config. more comments below. On Thu, 2007-27-09 at 15:34 +0800, Herbert Xu wrote: +config NET_ACT_NAT +tristate Stateless NAT +depends on NET_CLS_ACT +select NETFILTER I am gonna have to agree with Evgeniy on this Herbert;- The rewards are it will improve performance for people who dont need netfilter. Ok, who is gonna move the csum utility functions out? /me looks at Evgeniy;- I could do it realsoonnow if noone raises their hands. In any case, it would be real nice to have but i dont see it as a show stopper for inclusion. I agree that we should move the functions out. However... You have to realize that this logic is a complete crock of shit for %99. of Linux users out there who get and only use distribution compiled kernels which are going to enable everything anyways. So we better make sure there are zero performance implications at some point just for compiling netfilter into the tree. I know that isn't the case currently, but that means that we aren't helping out the majority of Linux users and are thus only adding these optimizations for such a small sliver of users and that is totally pointless and sucks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/7] CAN: Add virtual CAN netdevice driver
From: [EMAIL PROTECTED] (Eric W. Biederman) Date: Thu, 27 Sep 2007 10:16:37 -0600 I guess in particular IFF_LOOPBACK means that all packets from a device will come right back to the current machine, and go nowhere else. That usage sounds completely different then the CAN usage which appears to mean. Broadcast packets will be returned to this machine as well as being sent out onto the wire. It's bogus and it should be removed from the CAN code, they can add some other attribute to achieve their goals. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only interfaces to avoid 4-tuple conflicts.
Sean Hefty wrote: The sysadmin creates for iwarp use only alias interfaces of the form devname:iw* where devname is the native interface name (eg eth0) for the iwarp netdev device. The alias label can be anything starting with iw. The iw immediately after the ':' is the key used by the iw_cxgb3 driver. I'm still not sure about this, but haven't come up with anything better myself. And if there's a good chance of other rnic's needing the same support, I'd rather see the common code separated out, even if just encapsulated within this module for easy re-use. As for the code, I have a couple of questions about whether deadlock and a race condition are possible, plus a few minor comments. Thanks for reviewing! Responses are in-line below. +static void insert_ifa(struct iwch_dev *rnicp, struct in_ifaddr *ifa) +{ +struct iwch_addrlist *addr; + +addr = kmalloc(sizeof *addr, GFP_KERNEL); +if (!addr) { +printk(KERN_ERR MOD %s - failed to alloc memory!\n, + __FUNCTION__); +return; +} +addr-ifa = ifa; +mutex_lock(rnicp-mutex); +list_add_tail(addr-entry, rnicp-addrlist); +mutex_unlock(rnicp-mutex); +} Should this return success/failure? I think so. See below... +static int nb_callback(struct notifier_block *self, unsigned long event, + void *ctx) +{ +struct in_ifaddr *ifa = ctx; +struct iwch_dev *rnicp = container_of(self, struct iwch_dev, nb); + +PDBG(%s rnicp %p event %lx\n, __FUNCTION__, rnicp, event); + +switch (event) { +case NETDEV_UP: +if (netdev_is_ours(rnicp, ifa-ifa_dev-dev) +is_iwarp_label(ifa-ifa_label)) { +PDBG(%s label %s addr 0x%x added\n, +__FUNCTION__, ifa-ifa_label, ifa-ifa_address); +insert_ifa(rnicp, ifa); +iwch_listeners_add_addr(rnicp, ifa-ifa_address); If insert_ifa() fails, what will iwch_listeners_add_addr() do? (I'm not easily seeing the relationship between the address list and the listen list at this point.) I guess insert_ifa() needs to return success/failure. Then if we failed to add the ifa to the list we won't update the listeners. The relationship is this: - when a listen is done on addr 0.0.0.0, the code walks the list of addresses to do specific listens on each address. - when an address is added or deleted, then the list of current listeners is walked and updated accordingly. +} +break; +case NETDEV_DOWN: +if (netdev_is_ours(rnicp, ifa-ifa_dev-dev) +is_iwarp_label(ifa-ifa_label)) { +PDBG(%s label %s addr 0x%x deleted\n, +__FUNCTION__, ifa-ifa_label, ifa-ifa_address); +iwch_listeners_del_addr(rnicp, ifa-ifa_address); +remove_ifa(rnicp, ifa); +} +break; +default: +break; +} +return 0; +} + +static void delete_addrlist(struct iwch_dev *rnicp) +{ +struct iwch_addrlist *addr, *tmp; + +mutex_lock(rnicp-mutex); +list_for_each_entry_safe(addr, tmp, rnicp-addrlist, entry) { +list_del(addr-entry); +kfree(addr); +} +mutex_unlock(rnicp-mutex); +} + +static void populate_addrlist(struct iwch_dev *rnicp) +{ +int i; +struct in_device *indev; + +for (i = 0; i rnicp-rdev.port_info.nports; i++) { +indev = in_dev_get(rnicp-rdev.port_info.lldevs[i]); +if (!indev) +continue; +for_ifa(indev) +if (is_iwarp_label(ifa-ifa_label)) { +PDBG(%s label %s addr 0x%x added\n, + __FUNCTION__, ifa-ifa_label, + ifa-ifa_address); +insert_ifa(rnicp, ifa); +} +endfor_ifa(indev); +} +} + static void rnic_init(struct iwch_dev *rnicp) { PDBG(%s iwch_dev %p\n, __FUNCTION__, rnicp); @@ -70,6 +187,12 @@ static void rnic_init(struct iwch_dev *r idr_init(rnicp-qpidr); idr_init(rnicp-mmidr); spin_lock_init(rnicp-lock); +INIT_LIST_HEAD(rnicp-addrlist); +INIT_LIST_HEAD(rnicp-listen_eps); +mutex_init(rnicp-mutex); +rnicp-nb.notifier_call = nb_callback; +populate_addrlist(rnicp); +register_inetaddr_notifier(rnicp-nb); rnicp-attr.vendor_id = 0x168; rnicp-attr.vendor_part_id = 7; @@ -148,6 +271,8 @@ static void close_rnic_dev(struct t3cdev mutex_lock(dev_mutex); list_for_each_entry_safe(dev, tmp, dev_list, entry) { if (dev-rdev.t3cdev_p == tdev) { +unregister_inetaddr_notifier(dev-nb); +delete_addrlist(dev); list_del(dev-entry); iwch_unregister_device(dev); cxio_rdev_close(dev-rdev); diff --git a/drivers/infiniband/hw/cxgb3/iwch.h b/drivers/infiniband/hw/cxgb3/iwch.h index caf4e60..7fa0a47 100644 --- a/drivers/infiniband/hw/cxgb3/iwch.h +++ b/drivers/infiniband/hw/cxgb3/iwch.h @@ -36,6 +36,8 @@ #include linux/mutex.h #include linux/list.h #include
[PATCH] Added Ethernet PHY support for the Realtek 821x
From: Joe D'Abbraccio [EMAIL PROTECTED] The MPC837xERDB platform uses the RTL8211B Ethernet PHY on the WAN port (on eth0). Also added the kernel configuration options for selecting the PHY. Signed-off-by: Johnson Leung [EMAIL PROTECTED] Signed-off-by: Kevin Lam [EMAIL PROTECTED] Signed-off-by: Joe D'Abbraccio [EMAIL PROTECTED] --- arch/powerpc/configs/mpc837x_rdb_defconfig |1 + drivers/net/phy/Kconfig|5 ++ drivers/net/phy/Makefile |1 + drivers/net/phy/realtek.c | 84 4 files changed, 91 insertions(+), 0 deletions(-) create mode 100644 drivers/net/phy/realtek.c diff --git a/arch/powerpc/configs/mpc837x_rdb_defconfig b/arch/powerpc/configs/mpc837x_rdb_defconfig index e398e9f..9837493 100644 --- a/arch/powerpc/configs/mpc837x_rdb_defconfig +++ b/arch/powerpc/configs/mpc837x_rdb_defconfig @@ -411,6 +411,7 @@ CONFIG_MARVELL_PHY=y # CONFIG_SMSC_PHY is not set # CONFIG_BROADCOM_PHY is not set # CONFIG_ICPLUS_PHY is not set +CONFIG_REALTEK_PHY=y # CONFIG_FIXED_PHY is not set CONFIG_NET_ETHERNET=y CONFIG_MII=y diff --git a/drivers/net/phy/Kconfig b/drivers/net/phy/Kconfig index dd09011..9ce95c9 100644 --- a/drivers/net/phy/Kconfig +++ b/drivers/net/phy/Kconfig @@ -60,6 +60,11 @@ config ICPLUS_PHY ---help--- Currently supports the IP175C PHY. +config REALTEK_PHY + tristate Drivers for Realtek PHYs + ---help--- + Supports the Realtek 821x PHY. + config FIXED_PHY tristate Drivers for PHY emulation on fixed speed/link ---help--- diff --git a/drivers/net/phy/Makefile b/drivers/net/phy/Makefile index 8885650..d7bfa4e 100644 --- a/drivers/net/phy/Makefile +++ b/drivers/net/phy/Makefile @@ -12,4 +12,5 @@ obj-$(CONFIG_SMSC_PHY)+= smsc.o obj-$(CONFIG_VITESSE_PHY) += vitesse.o obj-$(CONFIG_BROADCOM_PHY) += broadcom.o obj-$(CONFIG_ICPLUS_PHY) += icplus.o +obj-$(CONFIG_REALTEK_PHY) += realtek.o obj-$(CONFIG_FIXED_PHY)+= fixed.o diff --git a/drivers/net/phy/realtek.c b/drivers/net/phy/realtek.c new file mode 100644 index 000..546c25f --- /dev/null +++ b/drivers/net/phy/realtek.c @@ -0,0 +1,84 @@ +/* + * drivers/net/phy/realtek.c + * + * Driver for Realtek PHYs + * + * Author: Johnson Leung [EMAIL PROTECTED] + * + * Copyright (c) 2004 Freescale Semiconductor, Inc. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the + * Free Software Foundation; either version 2 of the License, or (at your + * option) any later version. + * + */ +#include linux/phy.h + +#define RTL821x_PHYSR 0x11 +#define RTL821x_PHYSR_DUPLEX 0x2000 +#define RTL821x_PHYSR_SPEED0xc000 +#define RTL821x_INER 0x12 +#define RTL821x_INER_INIT 0x6400 +#define RTL821x_INSR 0x13 + +MODULE_DESCRIPTION(Realtek PHY driver); +MODULE_AUTHOR(Johnson Leung); +MODULE_LICENSE(GPL); + +static int rtl821x_config_init(struct phy_device *phydev) +{ + return 0; +} + +static int rtl821x_ack_interrupt(struct phy_device *phydev) +{ + int err; + + err = phy_read(phydev, RTL821x_INSR); + return (err 0) ? err : 0; +} + +static int rtl821x_config_intr(struct phy_device *phydev) +{ + int err; + + if (phydev-interrupts == PHY_INTERRUPT_ENABLED) + err = phy_write(phydev, RTL821x_INER, + RTL821x_INER_INIT); + else + err = phy_write(phydev, RTL821x_INER, 0); + + return err; +} + +/* RTL8211B */ +static struct phy_driver rtl821x_driver = { + .phy_id = 0x0001cc912, + .name = RTL821x Gigabit Ethernet, + .phy_id_mask= 0x001f, + .features = PHY_GBIT_FEATURES, + .flags = PHY_HAS_INTERRUPT, + .config_init= rtl821x_config_init, + .config_aneg= genphy_config_aneg, + .read_status= genphy_read_status, + .ack_interrupt = rtl821x_ack_interrupt, + .config_intr= rtl821x_config_intr, + .driver = { .owner = THIS_MODULE,}, +}; + +static int __init realtek_init(void) +{ + int ret; + + ret = phy_driver_register(rtl821x_driver); + return ret; +} + +static void __exit realtek_exit(void) +{ + phy_driver_unregister(rtl821x_driver); +} + +module_init(realtek_init); +module_exit(realtek_exit); -- 1.5.2.2 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-onlyinterfacesto avoid 4-tuple conflicts.
What is the model on how client connects, say for iSCSI, when client and server both support, iWARP and 10GbE or 1GbE, and would like to setup most performant connection for ULP? For the most performance connection, the ULP would use IB, and all these problems go away. :) This proposal is for each iwarp interface to have its own IP address. Clients would need an iwarp usable address of the server and would connect using rdma_connect(). If that call (or rdma_resolve_addr/route) fails, the client could try connecting using sockets, aoi, or some other interface. I don't see that Steve's proposal changes anything from the client's perspective. - Sean - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] sky2: FE+ vlan workaround
The FE+ workaround means the driver can no longer trust the status register to indicate VLAN tagged frames. The fix for this is to just disable VLAN acceleration for that chip version. Tested and works fine. This patch applies to 2.6.23-rc8 after yesterday's patch: sky2 FE+ receive status workaround Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- a/drivers/net/sky2.c2007-09-27 08:45:13.0 -0700 +++ b/drivers/net/sky2.c2007-09-27 09:43:15.0 -0700 @@ -3970,8 +3970,12 @@ static __devinit struct net_device *sky2 dev-features |= NETIF_F_HIGHDMA; #ifdef SKY2_VLAN_TAG_USED - dev-features |= NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX; - dev-vlan_rx_register = sky2_vlan_rx_register; + /* The workaround for FE+ status conflicts with VLAN tag detection. */ + if (!(sky2-hw-chip_id == CHIP_ID_YUKON_FE_P + sky2-hw-chip_rev == CHIP_REV_YU_FE2_A0)) { + dev-features |= NETIF_F_HW_VLAN_TX | NETIF_F_HW_VLAN_RX; + dev-vlan_rx_register = sky2_vlan_rx_register; + } #endif /* read the mac address */ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP Spike
On Thu, 27 Sep 2007, Majumder, Rajib wrote: We have observed 40ms latency spikes in TCP connections in burst type of traffic. This affects regular TCP sockets. Are segments being sent full-sized, or is there perhaps some Nagle component in it as well? I.e., are the applications using TCP_NODELAY? We observed this issue in kernels of 2.4.21 and kernel 2.6.5. Aparently, this seems to be fixed in 2.6.19. Can someone throw some light on this? I think somebody, probably Alexey, enabled sending of ACK on every 2nd segment. Previously small segment senders playing with Nagle were complaining every now and then about performance because two small segments did not generate ACKs but one had to accumulate, IIRC, half MSS worth of data before ACK was sent. Could this be related to your case? ...In case you're having too much time, you can always try bisecting it which finds out the causing commit... :-) Is this a congestion control/avoidance issue? Congestion control is basically ACK clocked math for cwnd, ssthresh, etc. state, which then results in permission to send new segments out etc. (except for RTO part of course, which I'll ignore in the next statement). Any delay gaps to sent packet after ACK receival, which triggered the state changing math, isn't there due to congestion control but due to other factors! 40ms is much below MIN_RTO (200ms), so it shouldn't be due to RTO either... Note that also delayed ACKs are exception to the general rule. Congestion control is controlled like your CPU is. In your CPU there's this whatever GHz clock which determines when the state changing events take place, state changes don't just happen arbitarily but are _clocked_ (ACK _clocked_ in case of congestion control). Of course there will be some propagation delay after the change to put in effect all the state changes that are result of what occurred at clock edge (and this delay assimilating to processing delay in the context of congestion control). -- i. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] sky2: fix transmit state on resume
This should fix http://bugzilla.kernel.org/show_bug.cgi?id=8667 After resume, driver has reset the chip so the current state of transmit checksum offload state machine and DMA state machine will be undefined. The fix is to set the state so that first Tx will set MSS and offset values. Patch is against 2.6.23-rc8 after last patch: sky2: FE+ vlan workaround (Should also work on older releases with minor fuzz). Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- a/drivers/net/sky2.c2007-09-27 08:45:13.0 -0700 +++ b/drivers/net/sky2.c2007-09-27 09:39:49.0 -0700 @@ -910,6 +910,20 @@ static inline struct sky2_tx_le *get_tx_ return le; } +static void tx_init(struct sky2_port *sky2) +{ + struct sky2_tx_le *le; + + sky2-tx_prod = sky2-tx_cons = 0; + sky2-tx_tcpsum = 0; + sky2-tx_last_mss = 0; + + le = get_tx_le(sky2); + le-addr = 0; + le-opcode = OP_ADDR64 | HW_OWNER; + sky2-tx_addr64 = 0; +} + static inline struct tx_ring_info *tx_le_re(struct sky2_port *sky2, struct sky2_tx_le *le) { @@ -1320,7 +1334,8 @@ static int sky2_up(struct net_device *de GFP_KERNEL); if (!sky2-tx_ring) goto err_out; - sky2-tx_prod = sky2-tx_cons = 0; + + tx_init(sky2); sky2-rx_le = pci_alloc_consistent(hw-pdev, RX_LE_BYTES, sky2-rx_le_map); - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] various dst_ifdown routines to catch refcounting bugs
From: [EMAIL PROTECTED] (Eric W. Biederman) Date: Thu, 27 Sep 2007 10:27:43 -0600 Denis V. Lunev [EMAIL PROTECTED] writes: Moving dst entries into init_net.loopback_dev is not a good thing. This hides obvious and non-obvious ref-counting bugs. Acked-by: Eric W. Biederman [EMAIL PROTECTED] Patch applied. I do have a question I would like to bring up, because I like avoiding explicit references to loopback_dev when I can. /* Dirty hack. We did it in 2.2 (in __dst_free), * we have _very_ good reasons not to repeat * this mistake in 2.3, but we have no choice * now. _It_ _is_ _explicit_ _deliberate_ * _race_ _condition_. * * Commented and originally written by Alexey. */ What is the race that is talked about in that comment. Can we just assign NULL instead of the loopback device when we bring a route down. My gut feeling is that something like: dst-input = dst-output = dst_discard; may be enough.But I don't know where the deliberate race is. The packet output path accesses the cached route device asynchronously, and we are resetting the device to be loopback without any synchronization whatsoever. None is in fact possible, and we don't want to add it because that would be way too expensive. So another thread on the system can either see the original device or the loopback one. It all works out because as the device goes down we'll purge any packets queued into the transmit queue and packet scheduler for that device. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Add network namespace clone unshare support.
From: [EMAIL PROTECTED] (Eric W. Biederman) Date: Thu, 27 Sep 2007 11:14:33 -0600 Thanks for pointing this out, it's on my todo list to look into, and ensure we resolve. I'm confused because my notes have 0x8000 for the pid namespace, and 0x4000 for the time namespace. Eric, pick an appropriate new non-conflicting number NOW. This adds unnecessary extra work for Andrew Morton, which he has enough of already. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED]: Add stateless NAT
From: Herbert Xu [EMAIL PROTECTED] Date: Thu, 27 Sep 2007 20:58:01 +0800 On Thu, Sep 27, 2007 at 08:39:45AM -0400, jamal wrote: You also need to p-tcf_qstats.drops++ for all packets that get shot. I was rather hoping that my packets wouldn't get shot :) But yeah let's increment the drops counter for consistency. [PKT_SCHED]: Add stateless NAT Applied to net-2.6.24, thanks Herbert! - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PKT_SCHED]: Add stateless NAT
From: Patrick McHardy [EMAIL PROTECTED] Date: Thu, 27 Sep 2007 15:39:34 +0200 Evgeniy Polyakov wrote: On Thu, Sep 27, 2007 at 09:20:37PM +0800, Herbert Xu ([EMAIL PROTECTED]) wrote: How about putting it in net/core/utils.c? I knew, that was a bad idea to try to fix netfilter dependency :) diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h This looks good to me. I still think the nf_*() prefixes should all go and the extern prototypes should go into an independant header file. These are not netfilter routines, they are INET helpers. And we should make similar treatment for all of the ipv6 packet parser helper functions that ipv6 netfilter needs. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] [PATCH v3] iw_cxgb3: Supportiwarp-onlyinterfacesto avoid 4-tuple conflicts.
Sean, IB aside, it looks like an ULP which is capable of being both RDMA aware and RDMA not-aware, like iSER and iSCSI, NFS-RDMA and NFS, SDP and sockets, will be treated as two separete ULPs. Each has its own IP address, since there is a different IP address for iWARP port and regular Ethernet port. So it falls on the users of ULPs to handle it via DNS or some other services. Is this acceptable to users? I doubt it. Recall that ULPs are going in opposite directions by having a different port number for RDMA aware and RDMA unaware versions of the ULP. This way, ULP connection manager handles RDMA-ness under the covers, while users plug an IP address for a server to connect to. Thanks, Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Sean Hefty [mailto:[EMAIL PROTECTED] Sent: Thursday, September 27, 2007 3:12 PM To: Kanevsky, Arkady; Sean Hefty; Steve Wise Cc: netdev@vger.kernel.org; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: RE: [ofa-general] [PATCH v3] iw_cxgb3: Supportiwarp-onlyinterfacesto avoid 4-tuple conflicts. What is the model on how client connects, say for iSCSI, when client and server both support, iWARP and 10GbE or 1GbE, and would like to setup most performant connection for ULP? For the most performance connection, the ULP would use IB, and all these problems go away. :) This proposal is for each iwarp interface to have its own IP address. Clients would need an iwarp usable address of the server and would connect using rdma_connect(). If that call (or rdma_resolve_addr/route) fails, the client could try connecting using sockets, aoi, or some other interface. I don't see that Steve's proposal changes anything from the client's perspective. - Sean ___ general mailing list [EMAIL PROTECTED] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [ofa-general] [PATCH v3] iw_cxgb3: Support iwarp-only interfaces to avoid 4-tuple conflicts.
It is ok to block while holding a mutex, yes? It's okay, I just didn't try to trace through the code to see if it ever tries to acquire the same mutex in the thread that needs to signal the event. - Sean - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] net: Dynamically allocate the per cpu counters for the loopback device.
David Miller [EMAIL PROTECTED] writes: From: [EMAIL PROTECTED] (Eric W. Biederman) Date: Thu, 27 Sep 2007 01:48:00 -0600 I'm not doing get_cpu/put_cpu so does the comment make sense in relationship to per_cpu_ptr? It is possible. But someone would need to go check for sure. Verified. hard_start_xmit is called inside of a rcu_read_lock_bh(),rcu_read_unlock_bh() pair. Which means the code will only run on one cpu. Therefore we do not need get_cpu/put_cpu. In addition per_cpu_ptr is valid. As it is just a lookup into a NR_CPUS sized array by smp_processor_id() to return the address of the specific cpu. The only difference between per_cpu_ptr and __get_cpu_var() are the implementation details between statically allocated and dynamically allocated per cpu state. So the comment is still valid, and still interesting it just should say per_cpu_ptr instead of __get_cpu_var. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c index 0f9d8c6..756e267 100644 --- a/drivers/net/loopback.c +++ b/drivers/net/loopback.c @@ -154,7 +154,7 @@ static int loopback_xmit(struct sk_buff *skb, struct net_device *dev) #endif dev-last_rx = jiffies; - /* it's OK to use __get_cpu_var() because BHs are off */ + /* it's OK to use per_cpu_ptr() because BHs are off */ pcpu_lstats = netdev_priv(dev); lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id()); lb_stats-bytes += skb-len; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] net: Dynamically allocate the per cpu counters for the loopback device.
From: [EMAIL PROTECTED] (Eric W. Biederman) Date: Thu, 27 Sep 2007 14:44:37 -0600 David Miller [EMAIL PROTECTED] writes: From: [EMAIL PROTECTED] (Eric W. Biederman) Date: Thu, 27 Sep 2007 01:48:00 -0600 I'm not doing get_cpu/put_cpu so does the comment make sense in relationship to per_cpu_ptr? It is possible. But someone would need to go check for sure. Verified. hard_start_xmit is called inside of a rcu_read_lock_bh(),rcu_read_unlock_bh() pair. Which means the code will only run on one cpu. Therefore we do not need get_cpu/put_cpu. In addition per_cpu_ptr is valid. As it is just a lookup into a NR_CPUS sized array by smp_processor_id() to return the address of the specific cpu. The only difference between per_cpu_ptr and __get_cpu_var() are the implementation details between statically allocated and dynamically allocated per cpu state. So the comment is still valid, and still interesting it just should say per_cpu_ptr instead of __get_cpu_var. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] I've already removed the comment, so you'll have to give me a patch that adds it back with the new content :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] rfkill: Move rfkill_switch_all out of global header
rfkill_switch_all shouldn't be called by drivers directly, instead they should send a signal over the input device. To prevent confusion for driver developers, move the function into a rfkill private header. Signed-off-by: Ivo van Doorn [EMAIL PROTECTED] --- diff --git a/include/linux/rfkill.h b/include/linux/rfkill.h index f9a50da..67096b5 100644 --- a/include/linux/rfkill.h +++ b/include/linux/rfkill.h @@ -2,7 +2,7 @@ #define __RFKILL_H /* - * Copyright (C) 2006 Ivo van Doorn + * Copyright (C) 2006 - 2007 Ivo van Doorn * Copyright (C) 2007 Dmitry Torokhov * * This program is free software; you can redistribute it and/or modify @@ -84,6 +84,4 @@ void rfkill_free(struct rfkill *rfkill); int rfkill_register(struct rfkill *rfkill); void rfkill_unregister(struct rfkill *rfkill); -void rfkill_switch_all(enum rfkill_type type, enum rfkill_state state); - #endif /* RFKILL_H */ diff --git a/net/rfkill/rfkill-input.c b/net/rfkill/rfkill-input.c index 8e4516a..eaabf08 100644 --- a/net/rfkill/rfkill-input.c +++ b/net/rfkill/rfkill-input.c @@ -17,6 +17,8 @@ #include linux/init.h #include linux/rfkill.h +#include rfkill-input.h + MODULE_AUTHOR(Dmitry Torokhov [EMAIL PROTECTED]); MODULE_DESCRIPTION(Input layer to RF switch connector); MODULE_LICENSE(GPL); diff --git a/net/rfkill/rfkill-input.h b/net/rfkill/rfkill-input.h new file mode 100644 index 000..4dae500 --- /dev/null +++ b/net/rfkill/rfkill-input.h @@ -0,0 +1,16 @@ +/* + * Copyright (C) 2007 Ivo van Doorn + */ + +/* + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License version 2 as published + * by the Free Software Foundation. + */ + +#ifndef __RFKILL_INPUT_H +#define __RFKILL_INPUT_H + +void rfkill_switch_all(enum rfkill_type type, enum rfkill_state state); + +#endif /* __RFKILL_INPUT_H */ diff --git a/net/rfkill/rfkill.c b/net/rfkill/rfkill.c index 03ed7fd..00ee534 100644 --- a/net/rfkill/rfkill.c +++ b/net/rfkill/rfkill.c @@ -1,5 +1,5 @@ /* - * Copyright (C) 2006 Ivo van Doorn + * Copyright (C) 2006 - 2007 Ivo van Doorn * Copyright (C) 2007 Dmitry Torokhov * * This program is free software; you can redistribute it and/or modify - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] rfkill: Move rfkill_switch_all out of global header
From: Ivo van Doorn [EMAIL PROTECTED] Date: Fri, 28 Sep 2007 00:07:41 +0200 rfkill_switch_all shouldn't be called by drivers directly, instead they should send a signal over the input device. To prevent confusion for driver developers, move the function into a rfkill private header. Signed-off-by: Ivo van Doorn [EMAIL PROTECTED] Applied to net-2.6.24, thanks. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ax88796: add 93cx6 eeprom support
On Thu, 27 Sep 2007 19:51:19 +0900 Magnus Damm [EMAIL PROTECTED] wrote: ax88796: add 93cx6 eeprom support This patch hooks up the 93cx6 eeprom code to the ax88796 driver and modifies the ax88796 driver to read out the mac address from the eeprom. We need this for the ax88796 on certain SuperH boards. The pin configuration used to connect the eeprom to the ax88796 on these boards is the same as pointed out by the ax88796 datasheet, so we can probably reuse this code for multiple platforms in the future. I'm showing a minor reject between this and Francois's git-r8169.patch. *** *** 21,33 /* Module: eeprom_93cx6 Abstract: EEPROM reader datastructures for 93cx6 chipsets. - Supported chipsets: 93c46 93c66. */ /* * EEPROM operation defines. */ #define PCI_EEPROM_WIDTH_93C466 #define PCI_EEPROM_WIDTH_93C668 #define PCI_EEPROM_WIDTH_OPCODE 3 #define PCI_EEPROM_WRITE_OPCODE 0x05 --- 21,34 /* Module: eeprom_93cx6 Abstract: EEPROM reader datastructures for 93cx6 chipsets. + Supported chipsets: 93c46/93c56/93c66. */ /* * EEPROM operation defines. */ #define PCI_EEPROM_WIDTH_93C466 + #define PCI_EEPROM_WIDTH_93C568 #define PCI_EEPROM_WIDTH_93C668 #define PCI_EEPROM_WIDTH_OPCODE 3 #define PCI_EEPROM_WRITE_OPCODE 0x05 You both made the same change to eeprom_93cx6.h. That all sounds good but it would be comforting if you could review each other's work, please... - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please pull 'upstream-davem' branch of wireless-2.6 (2007-09-27)
John W. Linville wrote: Dave Jeff, Here are some more wireless stack and driver updates for 2.6.24. Please pull at your earliest convenience. ACK (I presume davem will pull) it looks like this includes my adm feedback, thanks! - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] Make TCP prequeue configurable
Hi all I am sure some of you are going to tell me that prequeue is not all black :) Thank you [RFC] Make TCP prequeue configurable The TCP prequeue thing is based on old facts, and has drawbacks. 1) It adds 48 bytes per 'struct tcp_sock' 2) It adds some ugly code in hot paths 3) It has a small hit ratio on typical servers using many sockets 4) It may have a high hit ratio on UP machines running one process, where the prequeue adds litle gain. (In fact, letting the user doing the copy after being woke up is better for cache reuse) 5) Doing a copy to user in softirq handler is not good, because of potential page faults :( 6) Maybe the NET_DMA thing is the only thing that might need prequeue. This patch introduces a CONFIG_TCP_PREQUEUE, automatically selected if CONFIG_NET_DMA is on. Signed-off-by: Eric Dumazet [EMAIL PROTECTED] diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig index 8f670da..14e3f01 100644 --- a/drivers/dma/Kconfig +++ b/drivers/dma/Kconfig @@ -16,6 +16,7 @@ comment DMA Clients config NET_DMA bool Network: TCP receive copy offload depends on DMA_ENGINE NET + select TCP_PREQUEUE default y ---help--- This enables the use of DMA engines in the network stack to diff --git a/include/linux/tcp.h b/include/linux/tcp.h index c6b9f92..844a05e 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -268,11 +268,13 @@ struct tcp_sock { /* Data for direct copy to user */ struct { +#ifdef CONFIG_TCP_PREQUEUE struct sk_buff_head prequeue; struct task_struct *task; struct iovec*iov; int memory; int len; +#endif #ifdef CONFIG_NET_DMA /* members for async copy */ struct dma_chan *dma_chan; diff --git a/include/net/tcp.h b/include/net/tcp.h index 185c7ec..3430d8e 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -835,10 +835,12 @@ static inline int tcp_checksum_complete(struct sk_buff *skb) static inline void tcp_prequeue_init(struct tcp_sock *tp) { +#ifdef CONFIG_TCP_PREQUEUE tp-ucopy.task = NULL; tp-ucopy.len = 0; tp-ucopy.memory = 0; skb_queue_head_init(tp-ucopy.prequeue); +#endif #ifdef CONFIG_NET_DMA tp-ucopy.dma_chan = NULL; tp-ucopy.wakeup = 0; @@ -857,6 +859,7 @@ static inline void tcp_prequeue_init(struct tcp_sock *tp) */ static inline int tcp_prequeue(struct sock *sk, struct sk_buff *skb) { +#ifdef CONFIG_TCP_PREQUEUE struct tcp_sock *tp = tcp_sk(sk); if (!sysctl_tcp_low_latency tp-ucopy.task) { @@ -882,6 +885,7 @@ static inline int tcp_prequeue(struct sock *sk, struct sk_buff *skb) } return 1; } +#endif return 0; } diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig index fb79097..b770829 100644 --- a/net/ipv4/Kconfig +++ b/net/ipv4/Kconfig @@ -616,5 +616,20 @@ config TCP_MD5SIG If unsure, say N. +config TCP_PREQUEUE + bool Enable TCP prequeue + default n + ---help--- + TCP PREQUEUE is an 'optimization' loosely based on the famous + 30 instruction TCP receive Van Jacobson mail. + Van's trick is to deposit buffers into socket queue + on a device interrupt, to call tcp_recv function + on the receive process context and checksum and copy + the buffer to user space. smart... + + Some people believe this 'optimization' is not really needed + but for some benchmarks. Also, taking potential pagefaults in + softirq handler seems a high price to pay. + source net/ipv4/ipvs/Kconfig diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 7e74011..8659533 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -994,6 +994,7 @@ void tcp_cleanup_rbuf(struct sock *sk, int copied) tcp_send_ack(sk); } +#ifdef CONFIG_TCP_PREQUEUE static void tcp_prequeue_process(struct sock *sk) { struct sk_buff *skb; @@ -1011,6 +1012,7 @@ static void tcp_prequeue_process(struct sock *sk) /* Clear memory counter. */ tp-ucopy.memory = 0; } +#endif static inline struct sk_buff *tcp_recv_skb(struct sock *sk, u32 seq, u32 *off) { @@ -1251,6 +1253,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, tcp_cleanup_rbuf(sk, copied); +#ifdef CONFIG_TCP_PREQUEUE if (!sysctl_tcp_low_latency tp-ucopy.task == user_recv) { /* Install new reader */ if (!user_recv !(flags (MSG_TRUNC | MSG_PEEK))) { @@ -1295,7 +1298,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, /* __ Set realtime policy in scheduler __ */ } - +#endif if (copied = target) { /* Do not sleep, just
[PATCH] netns: CLONE_NEWNET don't use the same clone flag as the pid namespace.
Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- include/linux/sched.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index e10a0a8..d82c1f7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -27,7 +27,7 @@ #define CLONE_NEWUTS 0x0400 /* New utsname group? */ #define CLONE_NEWIPC 0x0800 /* New ipcs */ #define CLONE_NEWUSER 0x1000 /* New user namespace */ -#define CLONE_NEWNET 0x2000 /* New network namespace */ +#define CLONE_NEWNET 0x4000 /* New network namespace */ /* * Scheduling policies -- 1.5.3.rc6.17.g1911 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Bring comments in loopback.c uptodate.
A hint as to why it is safe to use per cpu variables, and note that we actually can have multiple instances of the loopback device now. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] --- drivers/net/loopback.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c index 2617320..cba5c76 100644 --- a/drivers/net/loopback.c +++ b/drivers/net/loopback.c @@ -154,6 +154,7 @@ static int loopback_xmit(struct sk_buff *skb, struct net_device *dev) #endif dev-last_rx = jiffies; + /* it's OK to use per_cpu_ptr() because BHs are off */ pcpu_lstats = netdev_priv(dev); lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id()); lb_stats-bytes += skb-len; @@ -221,7 +222,8 @@ static void loopback_dev_free(struct net_device *dev) } /* - * The loopback device is special. There is only one instance. + * The loopback device is special. There is only one instance + * per network namespace. */ static void loopback_setup(struct net_device *dev) { -- 1.5.3.rc6.17.g1911 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Zero-length write() does not generate a datagram on connected socket
On Thu, 27 Sep 2007 13:53:34 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: From: Stephen Hemminger [EMAIL PROTECTED] Date: Mon, 24 Sep 2007 15:34:35 -0700 The bug http://bugzilla.kernel.org/show_bug.cgi?id=5731 describes an issue where write() can't be used to generate a zero-length datagram (but send, and sendto do work). I think the following is needed: --- a/net/socket.c 2007-08-20 09:54:28.0 -0700 +++ b/net/socket.c 2007-09-24 15:31:25.0 -0700 @@ -777,8 +777,11 @@ static ssize_t sock_aio_write(struct kio if (pos != 0) return -ESPIPE; - if (iocb-ki_left == 0) /* Match SYS5 behaviour */ - return 0; + if (unlikely(iocb-ki_left == 0)) { + struct socket *sock = iocb-ki_filp-private_data; + if (sock-type == SOCK_STREAM) + return 0; + } x = alloc_sock_iocb(iocb, siocb); if (!x) We should simply remove the check completely. There is no need to add special code for different types of protocols and sockets. As is hinted in the bugzilla, the exact same thing can happen with a suitably constructed sendto() or sendmsg() call. write() on a socket is a sendmsg() with a NULL msg_control and a single entry iovec, plain and simple. It's how BSD and many other systems behave, and I double checked Steven's Volume 2 just to make sure. So I'm going to check in the following to fix this bugzilla. There is a similarly ugly test for len==0 in sys_read() on sockets. If someone would do some research on the validity of that thing I'd really appreciate it :-) Read of zero length should be a no-op for SOCK_STREAM but for SOCK_DATAGRAM or SOCK_SEQPACKET it might be useful as a remote wait for event. -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Make TCP prequeue configurable
On Fri, 28 Sep 2007 00:08:33 +0200 Eric Dumazet [EMAIL PROTECTED] wrote: Hi all I am sure some of you are going to tell me that prequeue is not all black :) Thank you [RFC] Make TCP prequeue configurable The TCP prequeue thing is based on old facts, and has drawbacks. 1) It adds 48 bytes per 'struct tcp_sock' 2) It adds some ugly code in hot paths 3) It has a small hit ratio on typical servers using many sockets 4) It may have a high hit ratio on UP machines running one process, where the prequeue adds litle gain. (In fact, letting the user doing the copy after being woke up is better for cache reuse) 5) Doing a copy to user in softirq handler is not good, because of potential page faults :( 6) Maybe the NET_DMA thing is the only thing that might need prequeue. This patch introduces a CONFIG_TCP_PREQUEUE, automatically selected if CONFIG_NET_DMA is on. Signed-off-by: Eric Dumazet [EMAIL PROTECTED] Rather than having a two more compile cases and test cases to deal with. If you can prove it is useless, make a case for killing it completely. -- Stephen Hemminger [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Add network namespace clone unshare support.
David Miller [EMAIL PROTECTED] writes: Eric, pick an appropriate new non-conflicting number NOW. Done. My apologies for the confusion. I thought the way Cedric and the IBM guys were testing someone would have shouted at me long before now. This adds unnecessary extra work for Andrew Morton, which he has enough of already. Cedric made a good point that we will have conflicts of code being added to the same place in nsproxy.c and the like. So I copied Andrew to give him a heads up. I will gladly do what I can, to help. Working against 3 trees development at the moment is a bit of a development challenge. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Make TCP prequeue configurable
From: Eric Dumazet [EMAIL PROTECTED] Date: Fri, 28 Sep 2007 00:08:33 +0200 1) It adds 48 bytes per 'struct tcp_sock' 2) It adds some ugly code in hot paths 3) It has a small hit ratio on typical servers using many sockets 4) It may have a high hit ratio on UP machines running one process, where the prequeue adds litle gain. (In fact, letting the user doing the copy after being woke up is better for cache reuse) 5) Doing a copy to user in softirq handler is not good, because of potential page faults :( 6) Maybe the NET_DMA thing is the only thing that might need prequeue. If you want to make changes at least get your facts straight in your changelog message :-) The prequeue doesn't do copies in softirqs, it acquires the user side socket lock and runs the packet input path directly from there, copying into userspace along the way. You are making claims about performance based upon your understanding of the code and your understanding of typical workloads, rather than from actual measurements. In scientific communities this would make you a quack at best :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Make TCP prequeue configurable
Stephen Hemminger wrote: On Fri, 28 Sep 2007 00:08:33 +0200 Eric Dumazet [EMAIL PROTECTED] wrote: Hi all I am sure some of you are going to tell me that prequeue is not all black :) Thank you [RFC] Make TCP prequeue configurable The TCP prequeue thing is based on old facts, and has drawbacks. 1) It adds 48 bytes per 'struct tcp_sock' 2) It adds some ugly code in hot paths 3) It has a small hit ratio on typical servers using many sockets 4) It may have a high hit ratio on UP machines running one process, where the prequeue adds litle gain. (In fact, letting the user doing the copy after being woke up is better for cache reuse) 5) Doing a copy to user in softirq handler is not good, because of potential page faults :( 6) Maybe the NET_DMA thing is the only thing that might need prequeue. This patch introduces a CONFIG_TCP_PREQUEUE, automatically selected if CONFIG_NET_DMA is on. Signed-off-by: Eric Dumazet [EMAIL PROTECTED] Rather than having a two more compile cases and test cases to deal with. If you can prove it is useless, make a case for killing it completely. I think it really does help in case (4) with old NICs that don't do rx checksumming. I'm not sure how many people really care about this anymore, but probably some...? OTOH, it would be nice to get rid of sysctl_tcp_low_latency. -John - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] netns: CLONE_NEWNET don't use the same clone flag as the pid namespace.
From: [EMAIL PROTECTED] (Eric W. Biederman) Date: Thu, 27 Sep 2007 16:40:31 -0600 Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] Applied. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please pull 'upstream-davem' branch of wireless-2.6 (2007-09-27)
From: Jeff Garzik [EMAIL PROTECTED] Date: Thu, 27 Sep 2007 18:21:54 -0400 John W. Linville wrote: Dave Jeff, Here are some more wireless stack and driver updates for 2.6.24. Please pull at your earliest convenience. ACK (I presume davem will pull) it looks like this includes my adm feedback, thanks! Pulled into net-2.6.24 and pushed back out, thanks! - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Bring comments in loopback.c uptodate.
From: [EMAIL PROTECTED] (Eric W. Biederman) Date: Thu, 27 Sep 2007 16:39:53 -0600 A hint as to why it is safe to use per cpu variables, and note that we actually can have multiple instances of the loopback device now. Signed-off-by: Eric W. Biederman [EMAIL PROTECTED] Applied. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Add network namespace clone unshare support.
From: [EMAIL PROTECTED] (Eric W. Biederman) Date: Thu, 27 Sep 2007 17:00:23 -0600 I will gladly do what I can, to help. Working against 3 trees development at the moment is a bit of a development challenge. Andrew has to work against 30 or so, so multiply your pain by 10 to understand what he has to deal with :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: Add network namespace clone unshare support.
On Thu, 27 Sep 2007 17:10:53 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: I will gladly do what I can, to help. Working against 3 trees development at the moment is a bit of a development challenge. Andrew has to work against 30 or so I wish! A remerge presently involves pulling and merging 73 git trees, 9 quilt trees and maybe 1,500 -mm patches. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
2.6.23-rc8 network problem. Mem leak? ip1000a?
Uniprocessor Althlon 64, 64-bit kernel, 2G ECC RAM, 2.6.23-rc8 + linuxpps (5.0.0) + ip1000a driver. (patch from http://marc.info/?l=linux-netdevm=118980588419882) After a few hours of operation, ntp loses the ability to send packets. sendto() returns -EAGAIN to everything, including the 24-byte UDP packet that is a response to ntpq. -EAGAIN on a sendto() makes me think of memory problems, so here's meminfo at the time: ### FAILED state ### # cat /proc/meminfo MemTotal: 2059384 kB MemFree: 15332 kB Buffers:665608 kB Cached: 18212 kB SwapCached: 0 kB Active: 380384 kB Inactive: 355020 kB SwapTotal: 5855208 kB SwapFree: 5854552 kB Dirty: 28504 kB Writeback: 0 kB AnonPages: 51608 kB Mapped: 11852 kB Slab: 1285348 kB SReclaimable: 152968 kB SUnreclaim:1132380 kB PageTables: 3888 kB NFS_Unstable:0 kB Bounce: 0 kB CommitLimit: 6884900 kB Committed_AS: 590528 kB VmallocTotal: 34359738367 kB VmallocUsed:265628 kB VmallocChunk: 34359472059 kB Killing and restarting ntpd gets it running again for a few hours. Here's after about two hours of successful operation. (I'll try to remember to run slabinfo before killing ntpd next time.) ### WORKING state ### # cat /proc/meminfo MemTotal: 2059384 kB MemFree: 20252 kB Buffers:242688 kB Cached: 41556 kB SwapCached:200 kB Active: 285012 kB Inactive: 147348 kB SwapTotal: 5855208 kB SwapFree: 5854212 kB Dirty: 36 kB Writeback: 0 kB AnonPages: 148052 kB Mapped: 12756 kB Slab: 1582512 kB SReclaimable: 134348 kB SUnreclaim:1448164 kB PageTables: 4500 kB NFS_Unstable:0 kB Bounce: 0 kB CommitLimit: 6884900 kB Committed_AS: 689956 kB VmallocTotal: 34359738367 kB VmallocUsed:265628 kB VmallocChunk: 34359472059 kB # /usr/src/linux/Documentation/vm/slabinfo Name Objects ObjsizeSpace Slabs/Part/Cpu O/S O %Fr %Ef Flg :016 1478 1624.5K 6/3/1 256 0 50 96 * :024 170 24 4.0K 1/0/1 170 0 0 99 * :032 1339 3245.0K 11/2/1 128 0 18 95 * :040 102 40 4.0K 1/0/1 102 0 0 99 * :064 5937 64 413.6K 101/15/1 64 0 14 91 * :07256 72 4.0K 1/0/1 56 0 0 98 * :088 6946 88 618.4K151/0/1 46 0 0 98 * :096 23851 96 2.5M 616/144/1 42 0 23 90 * :128 730 128 114.6K 28/6/1 32 0 21 81 * :136 232 13636.8K 9/6/1 30 0 66 85 * :192 474 19298.3K 24/4/1 21 0 16 92 * :256 1385376 256 354.6M 86587/0/1 16 0 0 99 * :32012 304 4.0K 1/0/1 12 0 0 89 *A :384 359 384 180.2K44/23/1 10 0 52 76 *A :512 1384316 512 708.7M 173040/1/18 0 0 99 * :64072 61653.2K 13/5/16 0 38 83 *A :704 1870 696 1.3M170/0/1 11 1 0 93 *A :0001024 4271024 454.6K111/9/14 0 8 96 * :0001472 1501472 245.7K 30/0/15 1 0 89 * :00020481589912048 325.7M 39759/25/14 1 0 99 * :0004096514096 245.7K 30/9/12 1 30 85 * Acpi-State 51 80 4.0K 1/0/1 51 0 0 99 anon_vma 1032 1628.6K 7/5/1 170 0 71 57 bdev_cache 43 72036.8K 9/1/15 0 11 83 Aa blkdev_requests 42 28812.2K 3/0/1 14 0 0 98 buffer_head 59173 10411.1M2734/1690/1 39 0 61 54 a cfq_io_context 223 15240.9K 10/6/1 26 0 60 82 dentry 98641 19219.7M 4813/274/1 21 0 5 96 a ext3_inode_cache115690 68886.3M 10545/77/1 11 1 0 92 a file_lock_cache 23 168 4.0K 1/0/1 23 0 0 94 idr_layer_cache118 52869.6K 17/1/17 0 5 89 inode_cache 1365 528 798.7K195/0/17 0 0 90 a kmalloc-131072 1 131072 131.0K 1/0/11 5 0 100 kmalloc-163848 16384 131.0K 8/0/11 2 0 100 kmalloc-327681 3276832.7K 1/0/11 3 0 100 kmalloc-8 1535 812.2K 3/1/1 512 0 33 99 kmalloc-819210
Re: [PATCH] net: Add network namespace clone unshare support.
Andrew Morton [EMAIL PROTECTED] writes: On Thu, 27 Sep 2007 17:10:53 -0700 (PDT) David Miller [EMAIL PROTECTED] wrote: I will gladly do what I can, to help. Working against 3 trees development at the moment is a bit of a development challenge. Andrew has to work against 30 or so I wish! A remerge presently involves pulling and merging 73 git trees, 9 quilt trees and maybe 1,500 -mm patches. Yep. There is a lot of chaos and keeping on top of it all is a pain, and nobody has it easy. Andrew probably wins award for the biggest challenge. My todo list pales in comparison. I only have 80+ patches in my queue that I need to reviewed and then pushed upstream. 50 sysfs patches to review and get a handle on so hopefully we can out of the sysfs quagmire. Plus I don't know how many little gotchas that need to be fixed with a new patch of their own. It's coming together but it takes time. David, Andrew thanks you both are really are good upstream maintainers to work with. Eric - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[git patches] net driver fixes
And an e1000 id patch. Please pull from 'upstream-linus' branch of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git upstream-linus to receive the following updates: drivers/net/e1000/e1000_ethtool.c |1 + drivers/net/e1000/e1000_hw.c |1 + drivers/net/e1000/e1000_hw.h |1 + drivers/net/e1000/e1000_main.c|2 + drivers/net/sky2.c| 53 +++-- 5 files changed, 44 insertions(+), 14 deletions(-) Auke Kok (1): e1000: Add device IDs of blade version of the 82571 quad port Stephen Hemminger (3): sky2: sky2 FE+ receive status workaround sky2: FE+ vlan workaround sky2: fix transmit state on resume diff --git a/drivers/net/e1000/e1000_ethtool.c b/drivers/net/e1000/e1000_ethtool.c index 4c3785c..9ecc3ad 100644 --- a/drivers/net/e1000/e1000_ethtool.c +++ b/drivers/net/e1000/e1000_ethtool.c @@ -1726,6 +1726,7 @@ static int e1000_wol_exclusion(struct e1000_adapter *adapter, struct ethtool_wol case E1000_DEV_ID_82571EB_QUAD_COPPER: case E1000_DEV_ID_82571EB_QUAD_FIBER: case E1000_DEV_ID_82571EB_QUAD_COPPER_LOWPROFILE: + case E1000_DEV_ID_82571PT_QUAD_COPPER: case E1000_DEV_ID_82546GB_QUAD_COPPER_KSP3: /* quad port adapters only support WoL on port A */ if (!adapter-quad_port_a) { diff --git a/drivers/net/e1000/e1000_hw.c b/drivers/net/e1000/e1000_hw.c index ba120f7..8604adb 100644 --- a/drivers/net/e1000/e1000_hw.c +++ b/drivers/net/e1000/e1000_hw.c @@ -387,6 +387,7 @@ e1000_set_mac_type(struct e1000_hw *hw) case E1000_DEV_ID_82571EB_SERDES_DUAL: case E1000_DEV_ID_82571EB_SERDES_QUAD: case E1000_DEV_ID_82571EB_QUAD_COPPER: + case E1000_DEV_ID_82571PT_QUAD_COPPER: case E1000_DEV_ID_82571EB_QUAD_FIBER: case E1000_DEV_ID_82571EB_QUAD_COPPER_LOWPROFILE: hw-mac_type = e1000_82571; diff --git a/drivers/net/e1000/e1000_hw.h b/drivers/net/e1000/e1000_hw.h index fe87146..07f0ea7 100644 --- a/drivers/net/e1000/e1000_hw.h +++ b/drivers/net/e1000/e1000_hw.h @@ -475,6 +475,7 @@ int32_t e1000_check_phy_reset_block(struct e1000_hw *hw); #define E1000_DEV_ID_82571EB_FIBER 0x105F #define E1000_DEV_ID_82571EB_SERDES 0x1060 #define E1000_DEV_ID_82571EB_QUAD_COPPER 0x10A4 +#define E1000_DEV_ID_82571PT_QUAD_COPPER 0x10D5 #define E1000_DEV_ID_82571EB_QUAD_FIBER 0x10A5 #define E1000_DEV_ID_82571EB_QUAD_COPPER_LOWPROFILE 0x10BC #define E1000_DEV_ID_82571EB_SERDES_DUAL 0x10D9 diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index 4a22595..e7c8951 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -108,6 +108,7 @@ static struct pci_device_id e1000_pci_tbl[] = { INTEL_E1000_ETHERNET_DEVICE(0x10BC), INTEL_E1000_ETHERNET_DEVICE(0x10C4), INTEL_E1000_ETHERNET_DEVICE(0x10C5), + INTEL_E1000_ETHERNET_DEVICE(0x10D5), INTEL_E1000_ETHERNET_DEVICE(0x10D9), INTEL_E1000_ETHERNET_DEVICE(0x10DA), /* required last entry */ @@ -1101,6 +1102,7 @@ e1000_probe(struct pci_dev *pdev, case E1000_DEV_ID_82571EB_QUAD_COPPER: case E1000_DEV_ID_82571EB_QUAD_FIBER: case E1000_DEV_ID_82571EB_QUAD_COPPER_LOWPROFILE: + case E1000_DEV_ID_82571PT_QUAD_COPPER: /* if quad port adapter, disable WoL on all but port A */ if (global_quad_port_a != 0) adapter-eeprom_wol = 0; diff --git a/drivers/net/sky2.c b/drivers/net/sky2.c index 0792031..162489b 100644 --- a/drivers/net/sky2.c +++ b/drivers/net/sky2.c @@ -910,6 +910,20 @@ static inline struct sky2_tx_le *get_tx_le(struct sky2_port *sky2) return le; } +static void tx_init(struct sky2_port *sky2) +{ + struct sky2_tx_le *le; + + sky2-tx_prod = sky2-tx_cons = 0; + sky2-tx_tcpsum = 0; + sky2-tx_last_mss = 0; + + le = get_tx_le(sky2); + le-addr = 0; + le-opcode = OP_ADDR64 | HW_OWNER; + sky2-tx_addr64 = 0; +} + static inline struct tx_ring_info *tx_le_re(struct sky2_port *sky2, struct sky2_tx_le *le) { @@ -1320,7 +1334,8 @@ static int sky2_up(struct net_device *dev) GFP_KERNEL); if (!sky2-tx_ring) goto err_out; - sky2-tx_prod = sky2-tx_cons = 0; + + tx_init(sky2); sky2-rx_le = pci_alloc_consistent(hw-pdev, RX_LE_BYTES, sky2-rx_le_map); @@ -2148,6 +2163,18 @@ static struct sk_buff *sky2_receive(struct net_device *dev, sky2-rx_next = (sky2-rx_next + 1) % sky2-rx_pending; prefetch(sky2-rx_ring + sky2-rx_next); + if (length ETH_ZLEN || length sky2-rx_data_size) + goto len_error; + + /* This chip has hardware problems that generates bogus status. +* So do only marginal checking and expect
Re: [PATCH] net: Add network namespace clone unshare support.
From: [EMAIL PROTECTED] (Eric W. Biederman) Date: Thu, 27 Sep 2007 21:28:45 -0600 David, Andrew thanks you both are really are good upstream maintainers to work with. Just keep the coffee flowing :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Clean up redundant PHY write line for ULi526x Ethernet driver
On Thu, Sep 27, 2007 at 11:36:58PM -0400, Jeff Garzik wrote: Zang Roy-r61911 wrote: From: Roy Zang [EMAIL PROTECTED] Clean up redundant PHY write line for ULi526x Ethernet Driver. Signed-off-by: Roy Zang [EMAIL PROTECTED] --- drivers/net/tulip/uli526x.c |1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/drivers/net/tulip/uli526x.c b/drivers/net/tulip/uli526x.c index ca2548e..53a8e65 100644 --- a/drivers/net/tulip/uli526x.c +++ b/drivers/net/tulip/uli526x.c @@ -1512,7 +1512,6 @@ static void uli526x_process_mode(struct uli526x_board_info *db) case ULI526X_100MFD: phy_reg = 0x2100; break; } phy_write(db-ioaddr, db-phy_addr, 0, phy_reg, db-chip_id); -phy_write(db-ioaddr, db-phy_addr, 0, phy_reg, db-chip_id); Kyle and Grant, I'll queue this up, unless ya'll object... please do, I've already Ack'd it for akpm's tree when he sent out the initial cc. thanks, grant Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: e1000 tcp checksum incorrect, x86 64b
Jon Smirl [EMAIL PROTECTED] wrote: App is writing seven bytes to the socket. Socket write timeout expires and the seven bytes are sent. The checksum is not getting inserted into the packet. It is set to a constant 0x8389 instead of the right value. App is gmpc 0.15.4.95, Revision: 6794 Attached Wireshark packet trace show the problem. e1000 is 192.168.1.4 64bit, Q6600. Dell Dimension 9200 Wireshark is broken. It needs to know TP_STATUS_CSUMNOTREADY means that the checksum is partial and will only be completed when the hardware sends the packet out. Alternatively disable checksum offload with ethtool. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html