[RTNETLINK]: Fix use of wrong skb in do_getlink()
[RTNETLINK]: Fix use of wrong skb in do_getlink() skb is the netlink query, nskb is the reply message. Signed-off-by: Patrick McHardy <[EMAIL PROTECTED]> --- commit 7d6fc0796ea33cd86cf20ccd51c351538903dcd9 tree 15bd1793a030f471f4a70762b079cf187bca8207 parent 587b522ced8a8ca3bddc99b05ec5118f11cfe164 author Patrick McHardy <[EMAIL PROTECTED]> Thu, 12 Oct 2006 08:38:03 +0200 committer Patrick McHardy <[EMAIL PROTECTED]> Thu, 12 Oct 2006 08:38:03 +0200 net/core/rtnetlink.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 221e403..02f3c79 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -602,7 +602,7 @@ #endif /* CONFIG_NET_WIRELESS_RTNETLINK goto errout; } - err = rtnl_unicast(skb, NETLINK_CB(skb).pid); + err = rtnl_unicast(nskb, NETLINK_CB(skb).pid); errout: kfree(iw_buf); dev_put(dev);
Re: [RFC] Question about potential problem in net/ipv4/route.c
David Miller a écrit : From: Eric Dumazet <[EMAIL PROTECTED]> Date: Thu, 12 Oct 2006 07:48:20 +0200 Not on my gcc here (gcc version 3.4.4) : It wont zeros out the padding bytes Patrick just proved this too :) Well, on this machine I have these oprofile numbers : : /* rt_intern_hash total: 993464 0.3619 */ 31751 0.0116 :803e8d26: repz cmpsb %es:(%rdi),%ds:(%rsi) 433438 0.1579 :803e8d28: jne803e8f80 Indeed, numbers talk bullshit walks :) How about something like this as a start? diff --git a/net/ipv4/route.c b/net/ipv4/route.c index c41ddba..4d4b1bd 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -566,9 +566,13 @@ static inline u32 rt_score(struct rtable static inline int compare_keys(struct flowi *fl1, struct flowi *fl2) { - return memcmp(&fl1->nl_u.ip4_u, &fl2->nl_u.ip4_u, sizeof(fl1->nl_u.ip4_u)) == 0 && - fl1->oif == fl2->oif && - fl1->iif == fl2->iif; + return ((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) | + (fl1->nl_u.ip4_u.saddr ^ fl2->nl_u.ip4_u.saddr) | + (fl1->nl_u.ip4_u.fwmark ^ fl2->nl_u.ip4_u.fwmark) | + (*(u16 *)&fl1->nl_u.ip4_u.tos ^ +*(u16 *)&fl2->nl_u.ip4_u.tos) | + (fl1->oif ^ fl2->oif) | + (fl1->iif ^ fl2->iif)) == 0; } #ifdef CONFIG_IP_ROUTE_MULTIPATH_CACHED Yes it seems a nice start :) How about avoiding the fwmark thing if !CONFIG_IP_ROUTE_FWMARK > + return ((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) | > + (fl1->nl_u.ip4_u.saddr ^ fl2->nl_u.ip4_u.saddr) | #ifdef CONFIG_IP_ROUTE_FWMARK > + (fl1->nl_u.ip4_u.fwmark ^ fl2->nl_u.ip4_u.fwmark) | #endif > + (*(u16 *)&fl1->nl_u.ip4_u.tos ^ > + *(u16 *)&fl2->nl_u.ip4_u.tos) | > + (fl1->oif ^ fl2->oif) | > + (fl1->iif ^ fl2->iif)) == 0; Eric - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Question about potential problem in net/ipv4/route.c
From: Patrick McHardy <[EMAIL PROTECTED]> Date: Thu, 12 Oct 2006 08:10:43 +0200 > David Miller wrote: > > Indeed, numbers talk bullshit walks :) > > :) > > > How about something like this as a start? > > > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > > The same problem is also present in dn_route.c (3 uninitialized > bytes in dn_u). This should take care of that case too: diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c index dd0761e..33ccc56 100644 --- a/net/decnet/dn_route.c +++ b/net/decnet/dn_route.c @@ -267,9 +267,12 @@ static void dn_dst_link_failure(struct s static inline int compare_keys(struct flowi *fl1, struct flowi *fl2) { - return memcmp(&fl1->nl_u.dn_u, &fl2->nl_u.dn_u, sizeof(fl1->nl_u.dn_u)) == 0 && - fl1->oif == fl2->oif && - fl1->iif == fl2->iif; + return ((fl1->nl_u.dn_u.daddr ^ fl2->nl_u.dn_u.daddr) | + (fl1->nl_u.dn_u.saddr ^ fl2->nl_u.dn_u.saddr) | + (fl1->nl_u.dn_u.fwmark ^ fl2->nl_u.dn_u.fwmark) | + (fl1->nl_u.dn_u.scope ^ fl2->nl_u.dn_u.scope) | + (fl1->oif ^ fl2->oif) | + (fl1->iif ^ fl2->iif)) == 0; } static int dn_insert_route(struct dn_route *rt, unsigned hash, struct dn_route **rp) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index c41ddba..4d4b1bd 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -566,9 +566,13 @@ static inline u32 rt_score(struct rtable static inline int compare_keys(struct flowi *fl1, struct flowi *fl2) { - return memcmp(&fl1->nl_u.ip4_u, &fl2->nl_u.ip4_u, sizeof(fl1->nl_u.ip4_u)) == 0 && - fl1->oif == fl2->oif && - fl1->iif == fl2->iif; + return ((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) | + (fl1->nl_u.ip4_u.saddr ^ fl2->nl_u.ip4_u.saddr) | + (fl1->nl_u.ip4_u.fwmark ^ fl2->nl_u.ip4_u.fwmark) | + (*(u16 *)&fl1->nl_u.ip4_u.tos ^ +*(u16 *)&fl2->nl_u.ip4_u.tos) | + (fl1->oif ^ fl2->oif) | + (fl1->iif ^ fl2->iif)) == 0; } #ifdef CONFIG_IP_ROUTE_MULTIPATH_CACHED - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Question about potential problem in net/ipv4/route.c
David Miller wrote: > Indeed, numbers talk bullshit walks :) :) > How about something like this as a start? > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c The same problem is also present in dn_route.c (3 uninitialized bytes in dn_u). - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Question about potential problem in net/ipv4/route.c
From: Eric Dumazet <[EMAIL PROTECTED]> Date: Thu, 12 Oct 2006 07:48:20 +0200 > Not on my gcc here (gcc version 3.4.4) : It wont zeros out the padding bytes Patrick just proved this too :) > Well, on this machine I have these oprofile numbers : > > : /* rt_intern_hash total: 993464 0.3619 */ > > 31751 0.0116 :803e8d26: repz cmpsb %es:(%rdi),%ds:(%rsi) > 433438 0.1579 :803e8d28: jne803e8f80 > Indeed, numbers talk bullshit walks :) How about something like this as a start? diff --git a/net/ipv4/route.c b/net/ipv4/route.c index c41ddba..4d4b1bd 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -566,9 +566,13 @@ static inline u32 rt_score(struct rtable static inline int compare_keys(struct flowi *fl1, struct flowi *fl2) { - return memcmp(&fl1->nl_u.ip4_u, &fl2->nl_u.ip4_u, sizeof(fl1->nl_u.ip4_u)) == 0 && - fl1->oif == fl2->oif && - fl1->iif == fl2->iif; + return ((fl1->nl_u.ip4_u.daddr ^ fl2->nl_u.ip4_u.daddr) | + (fl1->nl_u.ip4_u.saddr ^ fl2->nl_u.ip4_u.saddr) | + (fl1->nl_u.ip4_u.fwmark ^ fl2->nl_u.ip4_u.fwmark) | + (*(u16 *)&fl1->nl_u.ip4_u.tos ^ +*(u16 *)&fl2->nl_u.ip4_u.tos) | + (fl1->oif ^ fl2->oif) | + (fl1->iif ^ fl2->iif)) == 0; } #ifdef CONFIG_IP_ROUTE_MULTIPATH_CACHED - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Question about potential problem in net/ipv4/route.c
From: Patrick McHardy <[EMAIL PROTECTED]> Date: Thu, 12 Oct 2006 07:31:12 +0200 > This program demonstrates the effect, it doesn't output the expected > "1 2" but "1 4294967042" on my x86_64 (gcc-Version 4.1.2 20060901 > (prerelease) (Debian 4.1.1-13)). The initialization doesn't touch > the padding bytes: > > 0x00400494 :movl $0x1,0xfff0(%rbp) > 0x0040049b : movb $0x2,0xfff4(%rbp) Crap, ok we need to fix this. I even looked at the assembler for a little test foo.c bit of code... must have misread the sparc64 asm :) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Question about potential problem in net/ipv4/route.c
David Miller a écrit : From: Eric Dumazet <[EMAIL PROTECTED]> Date: Wed, 11 Oct 2006 15:11:18 +0200 Using memcmp(ptr1, ptr2, sizeof(SOMEFIELD)) is dangerous because sizeof(SOMEFIELD) can be larger than the underlying object, because of alignment constraints. In this case, sizeof(fl1->nl_u.ip4_u) is 16, while fl1->nl_u.ip4_u is : struct { __u32 daddr; __u32 saddr; __u32 fwmark; __u8tos; __u8scope; } ip4_u; So 14 bytes are really initialized, and 2 padding bytes might contain random values... We always explicitly initialize the flows, and even for local stack assignment based initialization, gcc zeros out the padding bytes always. For non-stack-local cases we do explicit memset()'s over the entire object. So I think while not perfect, we're very much safe here. Not on my gcc here (gcc version 3.4.4) : It wont zeros out the padding bytes # cat bug.c struct s1 { long d; char c; }; void bar() { struct s1 s = { .d = 123, .c = 'a'}; foo(&s, sizeof(s)); } # gcc -O2 -S bug.c # more bug.s .globl bar .type bar, @function bar: .LFB2: subq$24, %rsp .LCFI0: movl$16, %esi xorl%eax, %eax movq%rsp, %rdi movq$123, (%rsp) movb$97, 8(%rsp) callfoo addq$24, %rsp ret So 9(%rsp) -> 15(%rsp) are not initialized Same on more recent gcc (4.1.1) fast comparison, we should do some clever long/int XOR operations to avoid many test/branches, like the optim we did in compare_ether_addr()) As shown in profiles, "repz cmpsb" is really a dog... (and its use of esi/edi/ecx registers a high pressure for the compiler/optimizer) Yes, for the fast comparisons it is almost certainly worth it to do something saner in compare_keys(). But looking at this further, compare_keys() is only used in hotpath situations where we are optimizing for a miss, such as during hash insert. The optimization therefore might be less justified as a result. Well, on this machine I have these oprofile numbers : : /* rt_intern_hash total: 993464 0.3619 */ 31751 0.0116 :803e8d26: repz cmpsb %es:(%rdi),%ds:(%rsi) 433438 0.1579 :803e8d28: jne803e8f80 Eric - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Question about potential problem in net/ipv4/route.c
David Miller wrote: > We always explicitly initialize the flows, and even for local stack > assignment based initialization, gcc zeros out the padding bytes > always. I thought so too until I added the iptables compat functions recently and noticed uninitialized padding of on-stack structures, which confused iptables since it also uses memcmp. This program demonstrates the effect, it doesn't output the expected "1 2" but "1 4294967042" on my x86_64 (gcc-Version 4.1.2 20060901 (prerelease) (Debian 4.1.1-13)). The initialization doesn't touch the padding bytes: 0x00400494 :movl $0x1,0xfff0(%rbp) 0x0040049b : movb $0x2,0xfff4(%rbp) #include struct x1 { unsigned int x; char y; }; struct x2 { unsigned int x; unsigned int y; }; void pollute(void) { struct x2 x = { .x = ~0, .y = ~0, }; } void test(void) { struct x1 x1 = { .x = 1, .y = 2, }; struct x2 *x2 = (struct x2 *)&x1; printf("%u %u\n", x2->x, x2->y); } int main(int argc, char **argv) { pollute(); test(); return 0; }
Re: [RFC] Question about potential problem in net/ipv4/route.c
From: Eric Dumazet <[EMAIL PROTECTED]> Date: Wed, 11 Oct 2006 15:11:18 +0200 > Using memcmp(ptr1, ptr2, sizeof(SOMEFIELD)) is dangerous because > sizeof(SOMEFIELD) can be larger than the underlying object, because of > alignment constraints. > > In this case, sizeof(fl1->nl_u.ip4_u) is 16, while fl1->nl_u.ip4_u is : > > struct { > __u32 daddr; > __u32 saddr; > __u32 fwmark; > __u8tos; > __u8scope; > } ip4_u; > > So 14 bytes are really initialized, and 2 padding bytes might contain random > values... We always explicitly initialize the flows, and even for local stack assignment based initialization, gcc zeros out the padding bytes always. For non-stack-local cases we do explicit memset()'s over the entire object. So I think while not perfect, we're very much safe here. > fast comparison, we should do some clever long/int XOR operations to avoid > many test/branches, like the optim we did in compare_ether_addr()) > > As shown in profiles, "repz cmpsb" is really a dog... (and its use of > esi/edi/ecx registers a high pressure for the compiler/optimizer) Yes, for the fast comparisons it is almost certainly worth it to do something saner in compare_keys(). But looking at this further, compare_keys() is only used in hotpath situations where we are optimizing for a miss, such as during hash insert. The optimization therefore might be less justified as a result. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What is current sundance.c status
On Thu, 12 Oct 2006 10:29:37 +0800 "Jesse Huang" <[EMAIL PROTECTED]> wrote: > Would you tell me what is the current IP100A status? Should I re-generate > patches again. Would it put into kernel or not? I'm sitting on a copy of them. I didn't send them to Jeff last time because: sundance-remove-txstartthresh-and-rxearlythresh.patch There's no description of what this patent issue is. sundance-fix-tx-pause-bug-reset_tx-intr_handler.patch There's no description of the bug which got fixed, nor how this patch fixes it. sundance-change-phy-address-search-from-phy=1-to-phy=0.patch There's a (small) possibility that this will break on hardware which _doesn't_ have a phy at address 0. sundance-correct-initial-and-close-hardware-step.patch There's no real description of the bug which is being fixed, nor of how this patch fixes it. sundance-solve-host-error-problem-in-low-performance-embedded.patch No description of what the "host error problem" is, nor of what causes it, nor of how this patch fixes it. So generally these patches are a bit worrying, and it is hard to gauge what their risk factor is. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] NET/atm: handle sysfs errors
Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]> --- net/atm/atm_sysfs.c | 15 --- diff --git a/net/atm/atm_sysfs.c b/net/atm/atm_sysfs.c index c0a4ae2..62f6ed1 100644 --- a/net/atm/atm_sysfs.c +++ b/net/atm/atm_sysfs.c @@ -141,7 +141,7 @@ static struct class atm_class = { int atm_register_sysfs(struct atm_dev *adev) { struct class_device *cdev = &adev->class_dev; - int i, err; + int i, j, err; cdev->class = &atm_class; class_set_devdata(cdev, adev); @@ -151,10 +151,19 @@ int atm_register_sysfs(struct atm_dev *a if (err < 0) return err; - for (i = 0; atm_attrs[i]; i++) - class_device_create_file(cdev, atm_attrs[i]); + for (i = 0; atm_attrs[i]; i++) { + err = class_device_create_file(cdev, atm_attrs[i]); + if (err) + goto err_out; + } return 0; + +err_out: + for (j = 0; j < i; j++) + class_device_remove_file(cdev, atm_attrs[j]); + class_device_del(cdev); + return err; } void atm_unregister_sysfs(struct atm_dev *adev) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] NET/bluetooth: handle sysfs errors
Signed-off-by: Jeff Garzik <[EMAIL PROTECTED]> --- net/bluetooth/l2cap.c | 19 ++- net/bluetooth/rfcomm/core.c | 21 ++--- net/bluetooth/rfcomm/sock.c |7 ++- net/bluetooth/sco.c | 17 - 4 files changed, 50 insertions(+), 14 deletions(-) diff --git a/net/bluetooth/l2cap.c b/net/bluetooth/l2cap.c index d56f60b..7dda7ce 100644 --- a/net/bluetooth/l2cap.c +++ b/net/bluetooth/l2cap.c @@ -2206,24 +2206,33 @@ static int __init l2cap_init(void) err = bt_sock_register(BTPROTO_L2CAP, &l2cap_sock_family_ops); if (err < 0) { BT_ERR("L2CAP socket registration failed"); - goto error; + goto err_proto; } err = hci_register_proto(&l2cap_hci_proto); if (err < 0) { BT_ERR("L2CAP protocol registration failed"); - bt_sock_unregister(BTPROTO_L2CAP); - goto error; + goto err_sock; } - class_create_file(bt_class, &class_attr_l2cap); + err = class_create_file(bt_class, &class_attr_l2cap); + if (err) { + BT_ERR("L2CAP class registration failed"); + goto err_proto2; + } BT_INFO("L2CAP ver %s", VERSION); BT_INFO("L2CAP socket layer initialized"); return 0; -error: +err_proto2: + if (hci_unregister_proto(&l2cap_hci_proto) < 0) + BT_ERR("L2CAP protocol unregistration failed"); +err_sock: + if (bt_sock_unregister(BTPROTO_L2CAP) < 0) + BT_ERR("L2CAP socket unregistration failed"); +err_proto: proto_unregister(&l2cap_proto); return err; } diff --git a/net/bluetooth/rfcomm/core.c b/net/bluetooth/rfcomm/core.c index 468df3b..102d668 100644 --- a/net/bluetooth/rfcomm/core.c +++ b/net/bluetooth/rfcomm/core.c @@ -2052,13 +2052,21 @@ static CLASS_ATTR(rfcomm_dlc, S_IRUGO, r /* Initialization */ static int __init rfcomm_init(void) { + int rc; + l2cap_load(); - hci_register_cb(&rfcomm_cb); + rc = hci_register_cb(&rfcomm_cb); + if (rc) + goto err; - kernel_thread(rfcomm_run, NULL, CLONE_KERNEL); + rc = class_create_file(bt_class, &class_attr_rfcomm_dlc); + if (rc) + goto err_hcireg; - class_create_file(bt_class, &class_attr_rfcomm_dlc); + rc = kernel_thread(rfcomm_run, NULL, CLONE_KERNEL); + if (rc < 0) + goto err_file; rfcomm_init_sockets(); @@ -2069,6 +2077,13 @@ #endif BT_INFO("RFCOMM ver %s", VERSION); return 0; + +err_file: + class_remove_file(bt_class, &class_attr_rfcomm_dlc); +err_hcireg: + hci_unregister_cb(&rfcomm_cb); +err: + return rc; } static void __exit rfcomm_exit(void) diff --git a/net/bluetooth/rfcomm/sock.c b/net/bluetooth/rfcomm/sock.c index 220fee0..ff9b3f1 100644 --- a/net/bluetooth/rfcomm/sock.c +++ b/net/bluetooth/rfcomm/sock.c @@ -944,12 +944,17 @@ int __init rfcomm_init_sockets(void) if (err < 0) goto error; - class_create_file(bt_class, &class_attr_rfcomm); + err = class_create_file(bt_class, &class_attr_rfcomm); + if (err) + goto error_sock; BT_INFO("RFCOMM socket layer initialized"); return 0; +error_sock: + if (bt_sock_unregister(BTPROTO_RFCOMM) < 0) + BT_ERR("RFCOMM socket layer unregistration failed"); error: BT_ERR("RFCOMM socket layer registration failed"); proto_unregister(&rfcomm_proto); diff --git a/net/bluetooth/sco.c b/net/bluetooth/sco.c index 7714a2e..f28392a 100644 --- a/net/bluetooth/sco.c +++ b/net/bluetooth/sco.c @@ -957,24 +957,31 @@ static int __init sco_init(void) err = bt_sock_register(BTPROTO_SCO, &sco_sock_family_ops); if (err < 0) { BT_ERR("SCO socket registration failed"); - goto error; + goto err_proto; } err = hci_register_proto(&sco_hci_proto); if (err < 0) { BT_ERR("SCO protocol registration failed"); - bt_sock_unregister(BTPROTO_SCO); - goto error; + goto err_sock; } - class_create_file(bt_class, &class_attr_sco); + err = class_create_file(bt_class, &class_attr_sco); + if (err) + goto err_proto2; BT_INFO("SCO (Voice Link) ver %s", VERSION); BT_INFO("SCO socket layer initialized"); return 0; -error: +err_proto2: + if (hci_unregister_proto(&sco_hci_proto) < 0) + BT_ERR("SCO protocol unregistration failed"); +err_sock: + if (bt_sock_unregister(BTPROTO_SCO) < 0) + BT_ERR("SCO socket unregistration failed"); +err_proto: proto_unregister(&sco_proto); return err; } - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PR
Re: [PATCH] Customizable TCP backoff patch
In article <[EMAIL PROTECTED]> (at Wed, 11 Oct 2006 17:51:24 -0700), Ben Woodard <[EMAIL PROTECTED]> says: > diff -ru linux-2.6.18/net/ipv4/sysctl_net_ipv4.c > linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c > --- linux-2.6.18/net/ipv4/sysctl_net_ipv4.c 2006-09-19 20:42:06.0 > -0700 > +++ linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c 2006-10-11 > 16:00:37.0 -0700 > @@ -128,6 +128,8 @@ > return ret; > } > > +static unsigned long tcp_rto_min=0; > +static unsigned long tcp_rto_max=65535; > > ctl_table ipv4_table[] = { > { > @@ -697,6 +699,26 @@ > .mode = 0644, > .proc_handler = &proc_dointvec > }, > + { > + .ctl_name = NET_TCP_RTO_MAX, > + .procname = "tcp_rto_max", > + .data = &sysctl_tcp_rto_max, > + .maxlen = sizeof(unsigned), sizeof(unsigned long) > + .mode = 0644, > + .proc_handler = &proc_doulongvec_ms_jiffies_minmax, > + .extra1 = &tcp_rto_min_constant, > + .extra2 = &tcp_rto_max_constant, > + }, > + { > + .ctl_name = NET_TCP_RTO_INIT, > + .procname = "tcp_rto_init", > + .data = &sysctl_tcp_rto_init, > + .maxlen = sizeof(unsigned), sizeof(unsigned long) > + .mode = 0644, > + .proc_handler = &proc_doulongvec_ms_jiffies_minmax, > + .extra1 = &tcp_rto_min_constant, > + .extra2 = &tcp_rto_max_constant, > + }, > { .ctl_name = 0 } > }; > > diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c > --- linux-2.6.18/net/ipv4/tcp.c 2006-09-19 20:42:06.0 -0700 > +++ linux-2.6.18.new/net/ipv4/tcp.c 2006-10-11 16:00:37.0 -0700 > @@ -1764,6 +1764,8 @@ > return err; > } > > +#define TCP_BACKOFF_MAXVAL 65535 > + > /* > * Socket option code for TCP. > */ > @@ -1939,6 +1941,20 @@ > } > break; > > + case TCP_BACKOFF_MAX: > + if (val < 1 || val > TCP_BACKOFF_MAXVAL) > + err = -EINVAL; > + else > + tp->rto_max = val; > + break; > + > + case TCP_BACKOFF_INIT: > + if (val < 1 || val > TCP_BACKOFF_MAXVAL) > + err = -EINVAL; > + else > + tp->rto_init = val; > + break; > + > default: > err = -ENOPROTOOPT; > break; > @@ -2110,6 +2126,12 @@ > if (copy_to_user(optval, icsk->icsk_ca_ops->name, len)) > return -EFAULT; > return 0; > + case TCP_BACKOFF_MAX: > + val = jiffies_to_msecs(tcp_rto_max(tp)); > + break; tp->rto_max > + case TCP_BACKOFF_INIT: > + val = jiffies_to_msecs(tcp_rto_init(tp)); > + break; > default: tp->rto_init --yoshfuji - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Customizable TCP backoff patch
Here we go again. The changes are as follows: 1) the spaces are gone and now it is tabs. 2) used msecs_to_jiffies() and jiffies_to_msecs(). I'm much happier with that. I didn't know those functions existed. 3) changed over to proc_doulongvec_ms_jiffies_minmax(). Last night's test compile failed because I was out of disk. I figured that the code I changed had compiled. I guess not. I really didn't understand Yoshifuji Hideaki's comment. (sorry) -ben Vlad Yasevich wrote: Ben Woodard wrote: diff -ru linux-2.6.18/include/net/tcp.h linux-2.6.18.new/include/net/tcp.h --- linux-2.6.18/include/net/tcp.h 2006-09-19 20:42:06.0 -0700 +++ linux-2.6.18.new/include/net/tcp.h 2006-10-10 18:42:00.0 -0700 @@ -227,11 +227,23 @@ extern int sysctl_tcp_base_mss; extern int sysctl_tcp_workaround_signed_windows; extern int sysctl_tcp_slow_start_after_idle; +extern unsigned long sysctl_tcp_rto_max; +extern unsigned long sysctl_tcp_rto_init; extern atomic_t tcp_memory_allocated; extern atomic_t tcp_sockets_allocated; extern int tcp_memory_pressure; +static inline unsigned long tcp_rto_max(struct tcp_sock *tp) +{ +return tp->rto_max ? tp->rto_max*HZ/1000 : sysctl_tcp_rto_max; ^^^ That should probably be msecs_to_jiffies(tp->rto_max) +} + +static inline unsigned long tcp_rto_init(struct tcp_sock *tp) +{ +return tp->rto_init ? tp->rto_init*HZ/1000 : sysctl_tcp_rto_init; Ditto. +} + /* * The next routines deal with comparing 32 bit unsigned ints * and worry about wraparound (automatic with unsigned arithmetic). diff -ru linux-2.6.18/net/ipv4/sysctl_net_ipv4.c linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c --- linux-2.6.18/net/ipv4/sysctl_net_ipv4.c 2006-09-19 20:42:06.0 -0700 +++ linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c 2006-10-10 16:32:08.0 -0700 @@ -128,6 +128,8 @@ return ret; } +static unsigned long tcp_rto_min=0; +static unsigned long tcp_rto_max=65535; ctl_table ipv4_table[] = { { @@ -697,6 +699,26 @@ .mode = 0644, .proc_handler = &proc_dointvec }, + { + .ctl_name = NET_TCP_RTO_MAX, + .procname = "tcp_rto_max", + .data = &sysctl_tcp_rto_max, + .maxlen = sizeof(unsigned), + .mode = 0644, + .proc_handler = &proc_doulongvec_ms_jiffies, + .extra1 = &tcp_rto_min_constant, + .extra2 = &tcp_rto_max_constant, + }, + { + .ctl_name = NET_TCP_RTO_INIT, + .procname = "tcp_rto_init", + .data = &sysctl_tcp_rto_init, + .maxlen = sizeof(unsigned), + .mode = 0644, + .proc_handler = &proc_doulongvec_ms_jiffies, + .extra1 = &tcp_rto_min_constant, + .extra2 = &tcp_rto_max_constant, + }, { .ctl_name = 0 } }; Try as I might, I can't find proc_doulongvec_ms_jiffies() anywhere. I think you meant proc_doulongvec_ms_jiffies_minmax()? Also, this function has a bug in that it doesn't do corrections for potentially different HZ values. Look at the msecs_to_jiffies... When I've used proc_doulongvec_ms_jiffies_minmax() before, I would seen rather interesting truncation errors such that sysctl input didn't match sysctl output. diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c --- linux-2.6.18/net/ipv4/tcp.c 2006-09-19 20:42:06.0 -0700 +++ linux-2.6.18.new/net/ipv4/tcp.c 2006-10-10 18:37:40.0 -0700 @@ -1764,6 +1764,8 @@ return err; } +#define TCP_BACKOFF_MAXVAL 65535 + /* * Socket option code for TCP. */ @@ -1939,6 +1941,21 @@ } break; +case TCP_BACKOFF_MAX: +if (val < 1 || val > TCP_BACKOFF_MAXVAL) +err = -EINVAL; +else +tp->rto_max = val; +break; + +case TCP_BACKOFF_INIT: +if (val < 1 || val > TCP_BACKOFF_MAXVAL) +err = -EINVAL; +else +tp->rto_init = val; +break; + + default: err = -ENOPROTOOPT; break; @@ -2110,6 +2127,12 @@ if (copy_to_user(optval, icsk->icsk_ca_ops->name, len)) return -EFAULT; return 0; +case TCP_BACKOFF_MAX: + val = tcp_rto_max(tp)*1000/HZ; +break; +case TCP_BACKOFF_INIT: + val = tcp_rto_init(tp)*1000/HZ; The two divisions above introduce truncation errors such the input doesn't match the output. Better to use jiffies_t
Re: [PATCH] sotftmac: fix a slab corruption in WEP restricted key association
Laurent Riffard wrote: Fix a slab corruption in ieee80211softmac_auth(). The size of a buffer was miscomputed. see http://bugzilla.kernel.org/show_bug.cgi?id=7245 Acked-by: Daniel Drake <[EMAIL PROTECTED]>" Signed-off-by: Laurent Riffard <[EMAIL PROTECTED]> Apparently the CC list didn't work out. Adding Johannes and netdev. net/ieee80211/softmac/ieee80211softmac_io.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6-mm/net/ieee80211/softmac/ieee80211softmac_io.c === --- linux-2.6-mm.orig/net/ieee80211/softmac/ieee80211softmac_io.c +++ linux-2.6-mm/net/ieee80211/softmac/ieee80211softmac_io.c @@ -304,7 +304,7 @@ ieee80211softmac_auth(struct ieee80211_a 2 +/* Auth Transaction Seq */ 2 +/* Status Code */ /* Challenge Text IE */ -is_shared_response ? 0 : 1 + 1 + net->challenge_len +(is_shared_response ? 1 + 1 + net->challenge_len : 0) ); if (unlikely((*pkt) == NULL)) return 0; - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] NetLabel: only deref the CIPSOv4 standard map fields when using standard mapping
From: Paul Moore <[EMAIL PROTECTED]> Fix several places in the CIPSO code where it was dereferencing fields which did not have valid pointers by moving those pointer dereferences into code blocks where the pointers are valid. Signed-off-by: Paul Moore <[EMAIL PROTECTED]> --- net/ipv4/cipso_ipv4.c | 18 -- 1 files changed, 12 insertions(+), 6 deletions(-) Index: net-2.6_bugfix_2/net/ipv4/cipso_ipv4.c === --- net-2.6_bugfix_2.orig/net/ipv4/cipso_ipv4.c +++ net-2.6_bugfix_2/net/ipv4/cipso_ipv4.c @@ -773,13 +773,15 @@ static int cipso_v4_map_cat_rbm_valid(co { int cat = -1; u32 bitmap_len_bits = bitmap_len * 8; - u32 cipso_cat_size = doi_def->map.std->cat.cipso_size; - u32 *cipso_array = doi_def->map.std->cat.cipso; + u32 cipso_cat_size; + u32 *cipso_array; switch (doi_def->type) { case CIPSO_V4_MAP_PASS: return 0; case CIPSO_V4_MAP_STD: + cipso_cat_size = doi_def->map.std->cat.cipso_size; + cipso_array = doi_def->map.std->cat.cipso; for (;;) { cat = cipso_v4_bitmap_walk(bitmap, bitmap_len_bits, @@ -825,8 +827,8 @@ static int cipso_v4_map_cat_rbm_hton(con u32 net_spot_max = 0; u32 host_clen_bits = host_cat_len * 8; u32 net_clen_bits = net_cat_len * 8; - u32 host_cat_size = doi_def->map.std->cat.local_size; - u32 *host_cat_array = doi_def->map.std->cat.local; + u32 host_cat_size; + u32 *host_cat_array; switch (doi_def->type) { case CIPSO_V4_MAP_PASS: @@ -838,6 +840,8 @@ static int cipso_v4_map_cat_rbm_hton(con memcpy(net_cat, host_cat, net_spot_max); return net_spot_max; case CIPSO_V4_MAP_STD: + host_cat_size = doi_def->map.std->cat.local_size; + host_cat_array = doi_def->map.std->cat.local; for (;;) { host_spot = cipso_v4_bitmap_walk(host_cat, host_clen_bits, @@ -893,8 +897,8 @@ static int cipso_v4_map_cat_rbm_ntoh(con int net_spot = -1; u32 net_clen_bits = net_cat_len * 8; u32 host_clen_bits = host_cat_len * 8; - u32 net_cat_size = doi_def->map.std->cat.cipso_size; - u32 *net_cat_array = doi_def->map.std->cat.cipso; + u32 net_cat_size; + u32 *net_cat_array; switch (doi_def->type) { case CIPSO_V4_MAP_PASS: @@ -903,6 +907,8 @@ static int cipso_v4_map_cat_rbm_ntoh(con memcpy(host_cat, net_cat, net_cat_len); return net_cat_len; case CIPSO_V4_MAP_STD: + net_cat_size = doi_def->map.std->cat.cipso_size; + net_cat_array = doi_def->map.std->cat.cipso; for (;;) { net_spot = cipso_v4_bitmap_walk(net_cat, net_clen_bits, -- paul moore linux security @ hp - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] NetLabel: better error handling involving mls_export_cat()
From: Paul Moore <[EMAIL PROTECTED]> Upon inspection it looked like the error handling for mls_export_cat() was rather poor. This patch addresses this by NULL'ing out kfree()'d pointers before returning and checking the return value of the function everywhere it is called. Signed-off-by: Paul Moore <[EMAIL PROTECTED]> --- security/selinux/ss/ebitmap.c |8 ++-- security/selinux/ss/mls.c | 17 ++--- security/selinux/ss/services.c | 18 ++ 3 files changed, 30 insertions(+), 13 deletions(-) Index: net-2.6_bugfix_2/security/selinux/ss/ebitmap.c === --- net-2.6_bugfix_2.orig/security/selinux/ss/ebitmap.c +++ net-2.6_bugfix_2/security/selinux/ss/ebitmap.c @@ -93,11 +93,15 @@ int ebitmap_export(const struct ebitmap size_t bitmap_byte; unsigned char bitmask; + if (src->highbit == 0) { + *dst = NULL; + *dst_len = 0; + return 0; + } + bitmap_len = src->highbit / 8; if (src->highbit % 7) bitmap_len += 1; - if (bitmap_len == 0) - return -EINVAL; bitmap = kzalloc((bitmap_len & ~(sizeof(MAPTYPE) - 1)) + sizeof(MAPTYPE), Index: net-2.6_bugfix_2/security/selinux/ss/mls.c === --- net-2.6_bugfix_2.orig/security/selinux/ss/mls.c +++ net-2.6_bugfix_2/security/selinux/ss/mls.c @@ -640,8 +640,13 @@ int mls_export_cat(const struct context { int rc = -EPERM; - if (!selinux_mls_enabled) + if (!selinux_mls_enabled) { + *low = NULL; + *low_len = 0; + *high = NULL; + *high_len = 0; return 0; + } if (low != NULL) { rc = ebitmap_export(&context->range.level[0].cat, @@ -661,10 +666,16 @@ int mls_export_cat(const struct context return 0; export_cat_failure: - if (low != NULL) + if (low != NULL) { kfree(*low); - if (high != NULL) + *low = NULL; + *low_len = 0; + } + if (high != NULL) { kfree(*high); + *high = NULL; + *high_len = 0; + } return rc; } Index: net-2.6_bugfix_2/security/selinux/ss/services.c === --- net-2.6_bugfix_2.orig/security/selinux/ss/services.c +++ net-2.6_bugfix_2/security/selinux/ss/services.c @@ -2399,31 +2399,33 @@ static int selinux_netlbl_socket_setsid( if (!ss_initialized) return 0; + netlbl_secattr_init(&secattr); + POLICY_RDLOCK; ctx = sidtab_search(&sidtab, sid); if (ctx == NULL) goto netlbl_socket_setsid_return; - netlbl_secattr_init(&secattr); secattr.domain = kstrdup(policydb.p_type_val_to_name[ctx->type - 1], GFP_ATOMIC); mls_export_lvl(ctx, &secattr.mls_lvl, NULL); secattr.mls_lvl_vld = 1; - mls_export_cat(ctx, - &secattr.mls_cat, - &secattr.mls_cat_len, - NULL, - NULL); + rc = mls_export_cat(ctx, + &secattr.mls_cat, + &secattr.mls_cat_len, + NULL, + NULL); + if (rc != 0) + goto netlbl_socket_setsid_return; rc = netlbl_socket_setattr(sock, &secattr); if (rc == 0) sksec->nlbl_state = NLBL_LABELED; - netlbl_secattr_destroy(&secattr); - netlbl_socket_setsid_return: POLICY_RDUNLOCK; + netlbl_secattr_destroy(&secattr); return rc; } -- paul moore linux security @ hp - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] NetLabel: the CIPSOv4 passthrough mapping does not pass categories correctly
From: Paul Moore <[EMAIL PROTECTED]> The CIPSO passthrough mapping had a problem when sending categories which would cause no or incorrect categories to be sent on the wire with a packet. This patch fixes the problem which was a simple off-by-one bug. Signed-off-by: Paul Moore <[EMAIL PROTECTED]> --- net/ipv4/cipso_ipv4.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) Index: net-2.6_bugfix_2/net/ipv4/cipso_ipv4.c === --- net-2.6_bugfix_2.orig/net/ipv4/cipso_ipv4.c +++ net-2.6_bugfix_2/net/ipv4/cipso_ipv4.c @@ -832,8 +832,8 @@ static int cipso_v4_map_cat_rbm_hton(con switch (doi_def->type) { case CIPSO_V4_MAP_PASS: - net_spot_max = host_cat_len - 1; - while (net_spot_max > 0 && host_cat[net_spot_max] == 0) + net_spot_max = host_cat_len; + while (net_spot_max > 0 && host_cat[net_spot_max - 1] == 0) net_spot_max--; if (net_spot_max > net_cat_len) return -EINVAL; -- paul moore linux security @ hp - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] Collection of small NetLabel bugfixes
When doing some more testing today I ran into a few bugs, this patchset addresses those bugs. This patchset is backed against today's net-2.6 git tree. Please apply these patches for 2.6.19, thanks. -- paul moore linux security @ hp - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/10] sky2: MSI test is only a warning
Some motherboards don't implement MSI correctly. The driver handles this but the warning is too verbose and overly cautious. Signed-off-by: Stephe Hemminger <[EMAIL PROTECTED]> --- sky2.orig/drivers/net/sky2.c2006-10-11 11:43:08.0 -0700 +++ sky2/drivers/net/sky2.c 2006-10-11 11:43:33.0 -0700 @@ -3326,9 +3326,8 @@ if (!hw->msi_detected) { /* MSI test failed, go back to INTx mode */ - printk(KERN_WARNING PFX "%s: No interrupt was generated using MSI, " - "switching to INTx mode. Please report this failure to " - "the PCI maintainer and include system chipset information.\n", + printk(KERN_INFO PFX "%s: No interrupt generated using MSI, " + "switching to INTx mode.\n", pci_name(pdev)); err = -EOPNOTSUPP; @@ -3336,6 +3335,7 @@ } sky2_write32(hw, B0_IMSK, 0); + sky2_read32(hw, B0_IMSK); free_irq(pdev->irq, hw); -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Dropping NETIF_F_SG since no checksum feature.
O > > > > You might want to try ignoring the check in dev.c and testing > > to see if there is a performance gain. It wouldn't be hard to test > > a modified version and validate the performance change. > > Yes. With my patch, there is a huge performance gain by increasing MTU to 64K. > And it seems the only way to do this is by S/G. > > > You could even do what I suggested and use skb_checksum_help() > > to do inplace checksumming, as a performance test. > > I can. But as network algorithmics says (chapter 5) > "Since such bus reads are expensive, the CPU might as well piggyback > the checksum computation with the copy process". > > It speaks about onboard the adapter buffers, but memory bus reads are also > much slower > than CPU nowdays. So I think even if this works well in benchmark in real > life > single copy should better. > The other alternative might be to make copy/checksum code smarter about using fragments rather than allocating a large buffer. It should avoid second order allocations (effective size > PAGESIZE). -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/10] sky2: no message on rx fifo overflow
Under high load it is possible to make the receiver FIFO get overloaded. The driver/hardware recover properly, so there is no reason to fill the log with lots of extra messages, just update counter. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- drivers/net/sky2.c |6 -- 1 file changed, 4 insertions(+), 2 deletions(-) --- sky2.orig/drivers/net/sky2.c2006-10-11 12:05:46.0 -0700 +++ sky2/drivers/net/sky2.c 2006-10-11 12:05:54.0 -0700 @@ -2014,6 +2014,10 @@ error: ++sky2->net_stats.rx_errors; + if (status & GMR_FS_RX_FF_OV) { + sky2->net_stats.rx_fifo_errors++; + goto resubmit; + } if (netif_msg_rx_err(sky2) && net_ratelimit()) printk(KERN_INFO PFX "%s: rx error, status 0x%x length %d\n", @@ -2025,8 +2029,6 @@ sky2->net_stats.rx_frame_errors++; if (status & GMR_FS_CRC_ERR) sky2->net_stats.rx_crc_errors++; - if (status & GMR_FS_RX_FF_OV) - sky2->net_stats.rx_fifo_errors++; goto resubmit; } -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/10] sky2: use duplex result bits
The result of duplex negotiation is avaliable in the phy status register, so use that to simplify code and avoid rereading the PHY. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- drivers/net/sky2.c | 21 + 1 file changed, 1 insertion(+), 20 deletions(-) --- sky2.orig/drivers/net/sky2.c2006-10-11 11:43:47.0 -0700 +++ sky2/drivers/net/sky2.c 2006-10-11 11:43:50.0 -0700 @@ -1714,26 +1714,7 @@ } sky2->speed = sky2_phy_speed(hw, aux); - if (sky2->speed == SPEED_1000) { - u16 ctl2 = gm_phy_read(hw, port, PHY_MARV_1000T_CTRL); - u16 lpa2 = gm_phy_read(hw, port, PHY_MARV_1000T_STAT); - if (lpa2 & PHY_B_1000S_MSF) { - printk(KERN_ERR PFX "%s: master/slave fault", - sky2->netdev->name); - return -1; - } - - if ((ctl2 & PHY_M_1000C_AFD) && (lpa2 & PHY_B_1000S_LP_FD)) - sky2->duplex = DUPLEX_FULL; - else - sky2->duplex = DUPLEX_HALF; - } else { - u16 adv = gm_phy_read(hw, port, PHY_MARV_AUNE_ADV); - if ((aux & adv) & PHY_AN_FULL) - sky2->duplex = DUPLEX_FULL; - else - sky2->duplex = DUPLEX_HALF; - } + sky2->duplex = (aux & PHY_M_PS_FULL_DUP) ? DUPLEX_FULL : DUPLEX_HALF; /* Pause bits are offset (9..8) */ if (hw->chip_id == CHIP_ID_YUKON_XL || hw->chip_id == CHIP_ID_YUKON_EC_U) -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Dropping NETIF_F_SG since no checksum feature.
Quoting r. Stephen Hemminger <[EMAIL PROTECTED]>: > Subject: Re: Dropping NETIF_F_SG since no checksum feature. > > On Wed, 11 Oct 2006 21:11:38 +0100 > Steven Whitehouse <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > On Wed, Oct 11, 2006 at 05:01:03PM +0200, Michael S. Tsirkin wrote: > > > Quoting Steven Whitehouse <[EMAIL PROTECTED]>: > > > > > ssize_t tcp_sendpage(struct socket *sock, struct page *page, int > > > > > offset, > > > > > size_t size, int flags) > > > > > { > > > > > ssize_t res; > > > > > struct sock *sk = sock->sk; > > > > > > > > > > if (!(sk->sk_route_caps & NETIF_F_SG) || > > > > > !(sk->sk_route_caps & NETIF_F_ALL_CSUM)) > > > > > return sock_no_sendpage(sock, page, offset, size, > > > > > flags); > > > > > > > > > > > > > > > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM, > > > > > data will be copied over rather than sent directly. > > > > > So why does dev.c have to force set NETIF_F_SG to off then? > > > > > > > > > I agree with that analysis, > > > > > > So, would you Ack something like the following then? > > > > > > > In so far as I'm able to ack it, then yes, but with the following > > caveats: that you also need to look at the tcp code's checks for > > NETIF_F_SG (aside from the interface to tcp_sendpage which I think > > we've agreed is ok) and ensure that this patch will not change their > > behaviour, and here I'm thinking of the test in net/ipv4/tcp.c:select_size() > > in particular - there may be others but thats the only one I can think > > of off the top of my head. I think this is what davem was getting at > > with his comment about copy & sum for smaller packets. > > > > Also all subject to approval by davem and shemminger of course :-) > > > > My general feeling is that devices should advertise the features that > > they actually have and that the protocols should make the decision > > as to which ones to use or not depending on the combinations available > > (which I think is pretty much your argument). > > > > Steve. > > > > You might want to try ignoring the check in dev.c and testing > to see if there is a performance gain. It wouldn't be hard to test > a modified version and validate the performance change. Yes. With my patch, there is a huge performance gain by increasing MTU to 64K. And it seems the only way to do this is by S/G. > You could even do what I suggested and use skb_checksum_help() > to do inplace checksumming, as a performance test. I can. But as network algorithmics says (chapter 5) "Since such bus reads are expensive, the CPU might as well piggyback the checksum computation with the copy process". It speaks about onboard the adapter buffers, but memory bus reads are also much slower than CPU nowdays. So I think even if this works well in benchmark in real life single copy should better. -- MST - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/10] sky2: phy irq on shutdown
When PHY is turned off on shutdown, it causes the IRQ to get stuck on. Make sure and disable the IRQ first, and if IRQ occurs when device is not running, don't access PHY because that will hang. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- sky2.orig/drivers/net/sky2.c2006-10-11 11:43:40.0 -0700 +++ sky2/drivers/net/sky2.c 2006-10-11 11:43:43.0 -0700 @@ -1499,6 +1499,11 @@ /* Stop more packets from being queued */ netif_stop_queue(dev); + /* Disable port IRQ */ + imask = sky2_read32(hw, B0_IMSK); + imask &= ~portirq_msk[port]; + sky2_write32(hw, B0_IMSK, imask); + sky2_gmac_reset(hw, port); /* Stop transmitter */ @@ -1549,11 +1554,6 @@ sky2_write8(hw, SK_REG(port, RX_GMF_CTRL_T), GMF_RST_SET); sky2_write8(hw, SK_REG(port, TX_GMF_CTRL_T), GMF_RST_SET); - /* Disable port IRQ */ - imask = sky2_read32(hw, B0_IMSK); - imask &= ~portirq_msk[port]; - sky2_write32(hw, B0_IMSK, imask); - sky2_phy_power(hw, port, 0); /* turn off LED's */ @@ -1750,13 +1750,13 @@ struct sky2_port *sky2 = netdev_priv(dev); u16 istatus, phystat; + if (!netif_running(dev)) + return; + spin_lock(&sky2->phy_lock); istatus = gm_phy_read(hw, port, PHY_MARV_INT_STAT); phystat = gm_phy_read(hw, port, PHY_MARV_PHY_STAT); - if (!netif_running(dev)) - goto out; - if (netif_msg_intr(sky2)) printk(KERN_INFO PFX "%s: phy interrupt status 0x%x 0x%x\n", sky2->netdev->name, istatus, phystat); -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/10] sky2: flow control setting fixes
The result of flow control negotiation should not limit the next negotiatition. If board is plugged into an old half duplex 10Mbit port, without pause, then replugged into a gigabit port, it should negotiate what is desired, not inherit that last negotiation. P.s: many other drivers have this bug. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- drivers/net/sky2.c | 132 ++--- drivers/net/sky2.h | 13 - 2 files changed, 86 insertions(+), 59 deletions(-) --- sky2.orig/drivers/net/sky2.c2006-10-11 12:00:37.0 -0700 +++ sky2/drivers/net/sky2.c 2006-10-11 12:05:46.0 -0700 @@ -284,6 +284,31 @@ gma_write16(hw, port, GM_RX_CTRL, reg); } +/* flow control to advertise bits */ +static const u16 copper_fc_adv[] = { + [FC_NONE] = 0, + [FC_TX] = PHY_M_AN_ASP, + [FC_RX] = PHY_M_AN_PC, + [FC_BOTH] = PHY_M_AN_PC | PHY_M_AN_ASP, +}; + +/* flow control to advertise bits when using 1000BaseX */ +static const u16 fiber_fc_adv[] = { + [FC_BOTH] = PHY_M_P_BOTH_MD_X, + [FC_TX] = PHY_M_P_ASYM_MD_X, + [FC_RX] = PHY_M_P_SYM_MD_X, + [FC_NONE] = PHY_M_P_NO_PAUSE_X, +}; + +/* flow control to GMA disable bits */ +static const u16 gm_fc_disable[] = { + [FC_NONE] = GM_GPCR_FC_RX_DIS | GM_GPCR_FC_TX_DIS, + [FC_TX] = GM_GPCR_FC_RX_DIS, + [FC_RX] = GM_GPCR_FC_TX_DIS, + [FC_BOTH] = 0, +}; + + static void sky2_phy_init(struct sky2_hw *hw, unsigned port) { struct sky2_port *sky2 = netdev_priv(hw->dev[port]); @@ -376,29 +401,14 @@ if (sky2->advertising & ADVERTISED_10baseT_Half) adv |= PHY_M_AN_10_HD; - /* desired flow control */ - if (sky2->tx_pause && sky2->rx_pause) /* both */ - adv |= PHY_M_AN_PC | PHY_M_AN_ASP; - else if (sky2->tx_pause) - adv |= PHY_M_AN_ASP; - else if (sky2->rx_pause) - adv |= PHY_M_AN_PC; - - + adv |= copper_fc_adv[sky2->flow_mode]; } else {/* special defines for FIBER (88E1040S only) */ if (sky2->advertising & ADVERTISED_1000baseT_Full) adv |= PHY_M_AN_1000X_AFD; if (sky2->advertising & ADVERTISED_1000baseT_Half) adv |= PHY_M_AN_1000X_AHD; - if (sky2->tx_pause && sky2->rx_pause) /* both */ - adv |= PHY_M_P_BOTH_MD_X; - else if (sky2->tx_pause) - adv |= PHY_M_P_ASYM_MD_X; - else if (sky2->rx_pause) - adv |= PHY_M_P_SYM_MD_X; - else - adv |= PHY_M_P_NO_PAUSE_X; + adv |= fiber_fc_adv[sky2->flow_mode]; } /* Restart Auto-negotiation */ @@ -424,20 +434,14 @@ if (sky2->duplex == DUPLEX_FULL) { reg |= GM_GPCR_DUP_FULL; ctrl |= PHY_CT_DUP_MD; - } else if (sky2->speed != SPEED_1000 && hw->chip_id != CHIP_ID_YUKON_EC_U) { - /* Turn off flow control for 10/100mbps */ - sky2->rx_pause = 0; - sky2->tx_pause = 0; - } + } else if (sky2->speed < SPEED_1000) + sky2->flow_mode = FC_NONE; - if (!sky2->rx_pause) - reg |= GM_GPCR_FC_RX_DIS; - if (!sky2->tx_pause) - reg |= GM_GPCR_FC_TX_DIS; + reg |= gm_fc_disable[sky2->flow_mode]; /* Forward pause packets to GMAC? */ - if (sky2->tx_pause || sky2->rx_pause) + if (sky2->flow_mode & FC_RX) sky2_write8(hw, SK_REG(port, GMAC_CTRL), GMC_PAUSE_ON); else sky2_write8(hw, SK_REG(port, GMAC_CTRL), GMC_PAUSE_OFF); @@ -1605,6 +1609,12 @@ struct sky2_hw *hw = sky2->hw; unsigned port = sky2->port; u16 reg; + static const char *fc_name[] = { + [FC_NONE] = "none", + [FC_TX] = "tx", + [FC_RX] = "rx", + [FC_BOTH] = "both", + }; /* enable Rx/Tx */ reg = gma_read16(hw, port, GM_GP_CTRL); @@ -1648,8 +1658,7 @@ "%s: Link is up at %d Mbps, %s duplex, flow control %s\n", sky2->netdev->name, sky2->speed, sky2->duplex == DUPLEX_FULL ? "full" : "half", - (sky2->tx_pause && sky2->rx_pause) ? "both" : - sky2->tx_paus
Re: [Bug 7296] New: Connected networks persist in routing table despite no IF_RUNNING
From: Stephen Hemminger <[EMAIL PROTECTED]> Date: Wed, 11 Oct 2006 14:12:13 -0700 > I rejected this because I don't consider "sticky routes" when link is down > a bug. This definitely sounds like it is in the realm of userspace policy. The routing daemon should adjust the routes, as appropriate, if a link loses it's IF_RUNNING and it wants to use a backup route. IFF_RUNNING does not cause routes to get flushed, only IFF_UP events do. Changing routes based upon IFF_RUNNING state is a userland policy issue. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/10] sky2: advertising register 16 bits
The advertising bits (from ethtool.h) fit in 16 bits. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- sky2.orig/drivers/net/sky2.h2006-10-11 11:56:25.0 -0700 +++ sky2/drivers/net/sky2.h 2006-10-11 11:56:42.0 -0700 @@ -1860,7 +1860,7 @@ dma_addr_t rx_le_map; dma_addr_t tx_le_map; - u32 advertising; /* ADVERTISED_ bits */ + u16 advertising; /* ADVERTISED_ bits */ u16 speed; /* SPEED_1000, SPEED_100, ... */ u8 autoneg; /* AUTONEG_ENABLE, AUTONEG_DISABLE */ u8 duplex;/* DUPLEX_HALF, DUPLEX_FULL */ -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/10] sky2: version 1.9 patches
Several little patches to sky2 driver. Mostly having to do with PHY handling and flow control. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/10] sky2: fiber pause bits
The advertisement bits for flow control are located in different location on fiber (1000baseX) Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- sky2.orig/drivers/net/sky2.c2006-10-11 11:43:43.0 -0700 +++ sky2/drivers/net/sky2.c 2006-10-11 11:43:47.0 -0700 @@ -384,20 +384,31 @@ adv |= PHY_M_AN_10_FD; if (sky2->advertising & ADVERTISED_10baseT_Half) adv |= PHY_M_AN_10_HD; + + /* desired flow control */ + if (sky2->tx_pause && sky2->rx_pause) /* both */ + adv |= PHY_M_AN_PC | PHY_M_AN_ASP; + else if (sky2->tx_pause) + adv |= PHY_M_AN_ASP; + else if (sky2->rx_pause) + adv |= PHY_M_AN_PC; + + } else {/* special defines for FIBER (88E1040S only) */ if (sky2->advertising & ADVERTISED_1000baseT_Full) adv |= PHY_M_AN_1000X_AFD; if (sky2->advertising & ADVERTISED_1000baseT_Half) adv |= PHY_M_AN_1000X_AHD; - } - /* Set Flow-control capabilities */ - if (sky2->tx_pause && sky2->rx_pause) - adv |= PHY_AN_PAUSE_CAP;/* symmetric */ - else if (sky2->rx_pause && !sky2->tx_pause) - adv |= PHY_AN_PAUSE_ASYM | PHY_AN_PAUSE_CAP; - else if (!sky2->rx_pause && sky2->tx_pause) - adv |= PHY_AN_PAUSE_ASYM; /* local */ + if (sky2->tx_pause && sky2->rx_pause) /* both */ + adv |= PHY_M_P_BOTH_MD_X; + else if (sky2->tx_pause) + adv |= PHY_M_P_ASYM_MD_X; + else if (sky2->rx_pause) + adv |= PHY_M_P_SYM_MD_X; + else + adv |= PHY_M_P_NO_PAUSE_X; + } /* Restart Auto-negotiation */ ctrl |= PHY_CT_ANE | PHY_CT_RE_CFG; -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/10] sky2: version 1.9
Mark version, this has been a lot of patches. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- sky2.orig/drivers/net/sky2.c2006-10-11 14:19:14.0 -0700 +++ sky2/drivers/net/sky2.c 2006-10-11 14:19:23.0 -0700 @@ -50,7 +50,7 @@ #include "sky2.h" #define DRV_NAME "sky2" -#define DRV_VERSION"1.9" +#define DRV_VERSION"1.10" #define PFXDRV_NAME " " /* -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Fw: [Bug 7296] New: Connected networks persist in routing table despite no IF_RUNNING
I rejected this because I don't consider "sticky routes" when link is down a bug. Date: Tue, 10 Oct 2006 02:17:26 -0700 From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: [Bug 7296] New: Connected networks persist in routing table despite no IF_RUNNING http://bugzilla.kernel.org/show_bug.cgi?id=7296 Summary: Connected networks persist in routing table despite no IF_RUNNING Kernel Version: 2.6.18 Status: NEW Severity: normal Owner: [EMAIL PROTECTED] Submitter: [EMAIL PROTECTED] Most recent kernel where this bug did not occur: All affected AFAIK Distribution: Debian Sarge Hardware Environment: Dual (Intel) Ethernet, PPro, SATA RAID, etc :02:06.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 0d) :02:07.0 Ethernet controller: Intel Corp. 82557/8/9 [Ethernet Pro 100] (rev 0d) Software Environment: K:2.6.18 and more. (Quagga: 0.98.3-7.2) Problem Description: Introduction Our servers have two ethernet interfaces to offer resiliency. They run a routing protocol (OSPF) in order to learn exit paths via either of these interfaces. These routing protocols are independant to this problem, however from experience the problem curtails the effectiveness of any such protocols in resilient architectures. Therefore with growing interest shown in utilising Linux as a router, IMO this should be considered a fairly serious bug. Problem --- Whilst the nature of my specific topology is rather complex, I can define the problem generally as: A "connected" IP network whose interface subsequently loses its IF_RUNNING flag (ie unusable for routing), persists in the Linux routing table apparently as a "kernel" route. In the case of running routing protocols on dual interface servers this connected network is likely, following the destruction of the connected interface, to be reachable via the other. If the route persists however, it will never use this alternate path and black hole any traffic by routing it via a dead interface. See below example: -- # ifconfig eth0 Link encap:Ethernet HWaddr 00:20:ED:35:D4:C8 inet addr:192.168.0.143 Bcast:192.168.0.191 Mask:255.255.255.192 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 eth1 Link encap:Ethernet HWaddr 00:20:ED:35:D4:C9 inet addr:192.168.0.207 Bcast:192.168.0.255 Mask:255.255.255.192 UP BROADCAST MULTICAST MTU:1500 Metric:1 # ip route show 192.168.0.128/26 dev eth0 proto kernel scope link src 192.168.0.143 192.168.0.192/26 dev eth1 proto kernel scope link src 192.168.0.207 <-- NO!! 192.168.0.192/26 via 192.168.0.130 dev eth0 proto zebra metric 60 equalize # ping {anything else on 192.168.0.192/26} The path for 192.168.0.192 is learned via 192.168.0.130 (current ospf dr - irrelevant), but it'll never use it because of the kernel sourced directly connected route still sitting in there. If I then IFDOWN eth1, everything is fine and the route withdraws correctly, but this kinda defeats the object of dynamic routing =:/ Not sure whether this is a "driver tells the kernel" or a "kernel checks the driver at {n} intervals" issue - I would suggest the former would be more correct, but it is a problem regardless. Steps to reproduce: Ingredients --- 1 x Linux (mine is Debian sarge but it may affect all kernels) 1 x Dual Ethernet server (see Intel cards above for my exact build) 1 x Switch or Hub or similar (cheap will do) Connect the Server ports to the switch ports so the server interfaces are RUNNING in ifconfig On the server, configure: eth0 as 192.168.0.1/24 eth1 as 192.168.1.1/24 NOW "ip route show" Pull the plug on the eth0 interface, or admin shut it on the SWITCH end if you're a ciscoite. If buggy then "ip route show" will still show the network 192.168.0.0/24 in the routing table. It really shouldn't do this - if a path to this network had been learned via a protocol via the other interface, it would never use that path because the network persists as directly connected and thus supercedes any dynamic path in true cisco style. Try ifdown, the route will withdraw correctly, it needs to do that if IF_RUNNING goes away. Rgds Shaun Kemp --- You are receiving this mail because: --- You are the assignee for the bug, or are watching the assignee. -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/10] sky2: dont reset PHY twice
Don't need to reset PHY twice on startup. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- sky2.orig/drivers/net/sky2.c2006-10-11 11:59:45.0 -0700 +++ sky2/drivers/net/sky2.c 2006-10-11 14:21:36.0 -0700 @@ -356,16 +356,7 @@ gm_phy_write(hw, port, PHY_MARV_EXT_ADR, pg); } - ctrl = gm_phy_read(hw, port, PHY_MARV_CTRL); - if (sky2->autoneg == AUTONEG_DISABLE) - ctrl &= ~PHY_CT_ANE; - else - ctrl |= PHY_CT_ANE; - - ctrl |= PHY_CT_RESET; - gm_phy_write(hw, port, PHY_MARV_CTRL, ctrl); - - ctrl = 0; + ctrl = PHY_CT_RESET; ct1000 = 0; adv = PHY_AN_CSMA; reg = 0; @@ -450,8 +441,6 @@ sky2_write8(hw, SK_REG(port, GMAC_CTRL), GMC_PAUSE_ON); else sky2_write8(hw, SK_REG(port, GMAC_CTRL), GMC_PAUSE_OFF); - - ctrl |= PHY_CT_RESET; } gma_write16(hw, port, GM_GP_CTRL, reg); -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Dropping NETIF_F_SG since no checksum feature.
Quoting r. David Miller <[EMAIL PROTECTED]>: > Subject: Re: Dropping NETIF_F_SG since no checksum feature. > > From: "Michael S. Tsirkin" <[EMAIL PROTECTED]> > Date: Wed, 11 Oct 2006 17:01:03 +0200 > > > Quoting Steven Whitehouse <[EMAIL PROTECTED]>: > > > > ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset, > > > > size_t size, int flags) > > > > { > > > > ssize_t res; > > > > struct sock *sk = sock->sk; > > > > > > > > if (!(sk->sk_route_caps & NETIF_F_SG) || > > > > !(sk->sk_route_caps & NETIF_F_ALL_CSUM)) > > > > return sock_no_sendpage(sock, page, offset, size, > > > > flags); > > > > > > > > > > > > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM, > > > > data will be copied over rather than sent directly. > > > > So why does dev.c have to force set NETIF_F_SG to off then? > > > > > > > I agree with that analysis, > > > > So, would you Ack something like the following then? > > I certainly don't. > Dave, sorry, you lost me. Would a new flag NETIF_F_SW_CSUM be better, to tell network core that device computes checksums in software, so we should piggyback the checksum computation with the copy process if possible? Or is there some other issue? -- MST - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Dropping NETIF_F_SG since no checksum feature.
Quoting r. Stephen Hemminger <[EMAIL PROTECTED]>: > Subject: Re: Dropping NETIF_F_SG since no checksum feature. > > O > > > > > > You might want to try ignoring the check in dev.c and testing > > > to see if there is a performance gain. It wouldn't be hard to test > > > a modified version and validate the performance change. > > > > Yes. With my patch, there is a huge performance gain by increasing MTU to > > 64K. > > And it seems the only way to do this is by S/G. > > > > > You could even do what I suggested and use skb_checksum_help() > > > to do inplace checksumming, as a performance test. > > > > I can. But as network algorithmics says (chapter 5) > > "Since such bus reads are expensive, the CPU might as well piggyback > > the checksum computation with the copy process". > > > > It speaks about onboard the adapter buffers, but memory bus reads are also > > much slower > > than CPU nowdays. So I think even if this works well in benchmark in real > > life > > single copy should better. > > > > The other alternative might be to make copy/checksum code smarter about using > fragments rather than allocating a large buffer. It should avoid second order > allocations (effective size > PAGESIZE). In my testing, it seems quite smart already - once I declare F_SG and clear F_...CSUM I start getting properly checksummed packets with lots of s/g fragments. But I'm not catching the drift - an alternative to what this would be? -- MST - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Dropping NETIF_F_SG since no checksum feature.
From: "Michael S. Tsirkin" <[EMAIL PROTECTED]> Date: Wed, 11 Oct 2006 23:23:39 +0200 > With my patch, there is a huge performance gain by increasing MTU to 64K. > And it seems the only way to do this is by S/G. Numbers? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/10] sky2: turn of workaround timer
The workaround timer is not needed in most systems with proper IRQ routing and by perodically waking up it adds to laptop power consumption. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- drivers/net/sky2.c |4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- sky2.orig/drivers/net/sky2.c2006-10-11 11:43:33.0 -0700 +++ sky2/drivers/net/sky2.c 2006-10-11 11:43:40.0 -0700 @@ -96,9 +96,9 @@ module_param(disable_msi, int, 0); MODULE_PARM_DESC(disable_msi, "Disable Message Signaled Interrupt (MSI)"); -static int idle_timeout = 100; +static int idle_timeout = 0; module_param(idle_timeout, int, 0); -MODULE_PARM_DESC(idle_timeout, "Idle timeout workaround for lost interrupts (ms)"); +MODULE_PARM_DESC(idle_timeout, "Watchdog timer for lost interrupts (ms)"); static const struct pci_device_id sky2_id_table[] = { { PCI_DEVICE(PCI_VENDOR_ID_SYSKONNECT, 0x9000) }, -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 21/21]: powerpc/cell spidernet DMA coalescing
> I started writingthe patch thinking it will have some huge effect on > performance, based on a false assumption on how i/o was done on this > machine > > *If* this were another pSeries system, then each call to > pci_map_single() chews up an actual hardware "translation > control entry" (TCE) that maps pci bus addresses into > system RAM addresses. These are somewhat limited resources, > and so one shouldn't squander them. Furthermore, I thouhght > TCE's have TLB's associated with them (similar to how virtual > memory page tables are backed by hardware page TLB's), of which > there are even less of. I was thinking that TLB thrashing would > have a big hit on performance. > > Turns out that there was no difference to performance at all, > and a quick look at "cell_map_single()" in arch/powerpc/platforms/cell > made it clear why: there's no fancy i/o address mapping. > > Thus, the patch has only mrginal benefit; I submit it only in the > name of "its the right thing to do anyway". Well, there is no fancy iommu mapping ... yet. It's been implemented and is coming after we put together some workarounds for various other spider hardware issues that trigger when using it (bogus prefetches and bogus pci ordering). I think the hypervisor based platforms will be happy with that patch too. Ben. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Dropping NETIF_F_SG since no checksum feature.
On Wed, 11 Oct 2006 21:11:38 +0100 Steven Whitehouse <[EMAIL PROTECTED]> wrote: > Hi, > > On Wed, Oct 11, 2006 at 05:01:03PM +0200, Michael S. Tsirkin wrote: > > Quoting Steven Whitehouse <[EMAIL PROTECTED]>: > > > > ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset, > > > > size_t size, int flags) > > > > { > > > > ssize_t res; > > > > struct sock *sk = sock->sk; > > > > > > > > if (!(sk->sk_route_caps & NETIF_F_SG) || > > > > !(sk->sk_route_caps & NETIF_F_ALL_CSUM)) > > > > return sock_no_sendpage(sock, page, offset, size, > > > > flags); > > > > > > > > > > > > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM, > > > > data will be copied over rather than sent directly. > > > > So why does dev.c have to force set NETIF_F_SG to off then? > > > > > > > I agree with that analysis, > > > > So, would you Ack something like the following then? > > > > In so far as I'm able to ack it, then yes, but with the following > caveats: that you also need to look at the tcp code's checks for > NETIF_F_SG (aside from the interface to tcp_sendpage which I think > we've agreed is ok) and ensure that this patch will not change their > behaviour, and here I'm thinking of the test in net/ipv4/tcp.c:select_size() > in particular - there may be others but thats the only one I can think > of off the top of my head. I think this is what davem was getting at > with his comment about copy & sum for smaller packets. > > Also all subject to approval by davem and shemminger of course :-) > > My general feeling is that devices should advertise the features that > they actually have and that the protocols should make the decision > as to which ones to use or not depending on the combinations available > (which I think is pretty much your argument). > > Steve. > You might want to try ignoring the check in dev.c and testing to see if there is a performance gain. It wouldn't be hard to test a modified version and validate the performance change. You could even do what I suggested and use skb_checksum_help() to do inplace checksumming, as a performance test. -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Dropping NETIF_F_SG since no checksum feature.
Quoting r. Steven Whitehouse <[EMAIL PROTECTED]>: > Subject: Re: Dropping NETIF_F_SG since no checksum feature. > > Hi, > > On Wed, Oct 11, 2006 at 05:01:03PM +0200, Michael S. Tsirkin wrote: > > Quoting Steven Whitehouse <[EMAIL PROTECTED]>: > > > > ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset, > > > > size_t size, int flags) > > > > { > > > > ssize_t res; > > > > struct sock *sk = sock->sk; > > > > > > > > if (!(sk->sk_route_caps & NETIF_F_SG) || > > > > !(sk->sk_route_caps & NETIF_F_ALL_CSUM)) > > > > return sock_no_sendpage(sock, page, offset, size, > > > > flags); > > > > > > > > > > > > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM, > > > > data will be copied over rather than sent directly. > > > > So why does dev.c have to force set NETIF_F_SG to off then? > > > > > > > I agree with that analysis, > > > > So, would you Ack something like the following then? > > > > In so far as I'm able to ack it, then yes, but with the following > caveats: that you also need to look at the tcp code's checks for > NETIF_F_SG (aside from the interface to tcp_sendpage which I think > we've agreed is ok) and ensure that this patch will not change their > behaviour, and here I'm thinking of the test in net/ipv4/tcp.c:select_size() > in particular - there may be others but thats the only one I can think > of off the top of my head. I think this is what davem was getting at > with his comment about copy & sum for smaller packets. Will do - thanks for the tips. > Also all subject to approval by davem and shemminger of course :-) Goes without saying :) > My general feeling is that devices should advertise the features that > they actually have and that the protocols should make the decision > as to which ones to use or not depending on the combinations available > (which I think is pretty much your argument). > > Steve. -- MST - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Dropping NETIF_F_SG since no checksum feature.
From: "Michael S. Tsirkin" <[EMAIL PROTECTED]> Date: Wed, 11 Oct 2006 17:01:03 +0200 > Quoting Steven Whitehouse <[EMAIL PROTECTED]>: > > > ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset, > > > size_t size, int flags) > > > { > > > ssize_t res; > > > struct sock *sk = sock->sk; > > > > > > if (!(sk->sk_route_caps & NETIF_F_SG) || > > > !(sk->sk_route_caps & NETIF_F_ALL_CSUM)) > > > return sock_no_sendpage(sock, page, offset, size, flags); > > > > > > > > > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM, > > > data will be copied over rather than sent directly. > > > So why does dev.c have to force set NETIF_F_SG to off then? > > > > > I agree with that analysis, > > So, would you Ack something like the following then? I certainly don't. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] d80211: extend extra_hdr_room to be a bytecount
On Wednesday 11 October 2006 16:59, David Kimdon wrote: > Perhaps rename it to extra_tx_headroom? > - existing users would then need to take notice of the change > - the name 'extra_tx_headroom' is more descriptive of what it actually is Oh, yeah. This was lazyman's solution ;) I will send an updated patch. -- Greetings Michael. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Dropping NETIF_F_SG since no checksum feature.
Hi, On Wed, Oct 11, 2006 at 05:01:03PM +0200, Michael S. Tsirkin wrote: > Quoting Steven Whitehouse <[EMAIL PROTECTED]>: > > > ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset, > > > size_t size, int flags) > > > { > > > ssize_t res; > > > struct sock *sk = sock->sk; > > > > > > if (!(sk->sk_route_caps & NETIF_F_SG) || > > > !(sk->sk_route_caps & NETIF_F_ALL_CSUM)) > > > return sock_no_sendpage(sock, page, offset, size, flags); > > > > > > > > > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM, > > > data will be copied over rather than sent directly. > > > So why does dev.c have to force set NETIF_F_SG to off then? > > > > > I agree with that analysis, > > So, would you Ack something like the following then? > In so far as I'm able to ack it, then yes, but with the following caveats: that you also need to look at the tcp code's checks for NETIF_F_SG (aside from the interface to tcp_sendpage which I think we've agreed is ok) and ensure that this patch will not change their behaviour, and here I'm thinking of the test in net/ipv4/tcp.c:select_size() in particular - there may be others but thats the only one I can think of off the top of my head. I think this is what davem was getting at with his comment about copy & sum for smaller packets. Also all subject to approval by davem and shemminger of course :-) My general feeling is that devices should advertise the features that they actually have and that the protocols should make the decision as to which ones to use or not depending on the combinations available (which I think is pretty much your argument). Steve. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 00/11] The _entire_ secid reconciliation patchset (tada!)
> From an initial review of this patchset, it doesn't look > quite ready to > queue for 2.6.20 (which I plan to to via git once it is). > > Outstanding items include resolving the igmp skb hook issue > generally, > testing to verify both the design and implementation, and > ensuring that > all the related policy changes are merged upstream first. > Regarding the igmp hook issue, we could do a generic hook like Paul suggested. Would that be more palatable you think? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [openib-general] Dropping NETIF_F_SG since no checksum feature.
At 02:46 AM 10/11/2006, Michael S. Tsirkin wrote: Quoting r. David Miller <[EMAIL PROTECTED]>: > Subject: Re: Dropping NETIF_F_SG since no checksum feature. > > From: "Michael S. Tsirkin" <[EMAIL PROTECTED]> > Date: Wed, 11 Oct 2006 11:05:04 +0200 > > > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM, > > data will be copied over rather than sent directly. > > So why does dev.c have to force set NETIF_F_SG to off then? > > Because it's more efficient to copy into a linear destination > buffer of an SKB than page sub-chunks when doing checksum+copy. > Thanks for the explanation. Obviously its true as long as you can allocate the skb that big. I think you won't realistically be able to get 64K in a linear SKB on a busy system, though, is not that right? OTOH, having large MTU (e.g. 64K) helps performance a lot since it reduces receive side processing overhead. One thing to keep in mind is while it may help performance in a micro-benchmark, the system performance or the QoS impacts to other flows can be negatively impacted depending upon implementation. For example, consider multiple messages interleaving (heaven help implementations that are not able to interleave multiple messages) on either the transmit or receive HCA / RNIC and how the time-to-completion of any message is extended out in time as a result of the interleave. The effective throughput in terms of useful units of work can be lower as a result. The same effect can be observed when there are a significant number connections in a device being simultaneously processed. Also, if the copy-checksum is not performed on the processor where the application resides, then the performance can also be negatively impacted (want to have the right cache hot when initiated or concluded). While the aggregate computational performance of systems may be increasing at a significant rate (set aside the per core vs. aggregate core debate), the memory performance gains are much less. If you examine the longer term trends, there may be a flattening out of memory performance improvements by 2009/10 without some major changes in the way controllers and subsystems are designed. Mike - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] e1000: Real time packets and bytes statistics
On 10/11/06, Jean Delvare <[EMAIL PROTECTED]> wrote: Hi all, This patch is posted for review and comments. Let the e1000 driver report the most important statistics (rx/tx_bytes and rx/tx_packets) in real time, rather than every other second. This is similar to what the e100 driver is doing. The current asynchronous statistics refresh model makes it impossible to monitor the network traffic with an interval which isn't a multiple of 2 seconds. For example, an interval of 5 seconds would result in a sawtooth diagram (+20%, -20%) for a constant transfer rate. With a 1 second interval it's even worse (0, 200%) of course. This has been annoying users for years, but was never actually fixed: I think the idea is good, however, see below. rx/tx_bytes will show slightly lower values than before, because the hardware appears to include the 4-byte ethernet frame CRC into the frame length, while the driver doesn't. It's probably OK as the e100, 3c59x and 8139too drivers don't include it either. this is okay. I additionally noted a difference of 6 bytes on some TX frames, which I am not able to explain. It's probably small and rare enough not to be considered a problem, but if someone can explain it, I would be grateful. now, that sounds odd, however, once again, see below. Signed-off-by: Jean Delvare <[EMAIL PROTECTED]> --- drivers/net/e1000/e1000_main.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) --- linux-2.6.19-rc1.orig/drivers/net/e1000/e1000_main.c2006-10-11 10:53:49.0 +0200 +++ linux-2.6.19-rc1/drivers/net/e1000/e1000_main.c 2006-10-11 11:34:41.0 +0200 @@ -3118,6 +3118,8 @@ e1000_tx_map(adapter, tx_ring, skb, first, max_per_txd, nr_frags, mss)); + adapter->net_stats.tx_packets++; + adapter->net_stats.tx_bytes += skb->len; netdev->trans_start = jiffies; this is the part I'm most worried about. as I believe it to be incorrect for TSO packets. Maybe something like? + if (skb_shinfo(skb)->gso_segs) + adapter->net_stats.tx_packets += skb_shinfo(skb)->gso_segs; + else + adapter->net_stats.tx_packets++; + adapter->net_stats.tx_bytes += skb->len; netdev->trans_start = jiffies; skb len will still be off by some amount, because the skb->data (header) is replicated across each gso segment but only counted once this way, but hopefully someone will pipe up with a good way to compute that. The rest of the patch seems fine, barring any other comments. Jesse - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/21]: powerpc/cell spidernet bugfixes, etc.
On Tuesday 10 October 2006 22:49, Linas Vepstas wrote: > Andrew, please apply/forward upstream. > > The following set of 21 patches (!) are all aimed at the the > spidernet ethernet device driver. The spidernet is an etherenet > controller built into the Toshiba southbridge for the PowerPC Cell > processor. (This is the only device in existance that with this > ethernet hardware in it). > > These patches re-package/re-order/re-cleanup a previous > set of patches I've previously mailed. Thus, some have > been previously Acked-by lines, most do not. Most of > these patches are tiny, and handle problems that cropped > up during testing. Sorry about there being so many of them. > > The first set of 12 patches fix a large variety of mostly > minor bugs. > > The important patches are 13 through 17: these overcome a > debilitating performance problem on transmit (6 megabits > per second !!) on transmit of patches 500 bytes or larger. > After applying these, I am able to get the following: > > pkt sz speed (100K buffs) speed (4M buffs) > -- - > 1500 700 Mbits/sec 951 Mbits/sec > 1000 658 Mbits/sec 770 > 800 600 648 > 500 500 500 > 300 372 372 > 60 70 70 > > Above buf size refers to /proc/sys/net/core/wmem_default Excellent work! I guess this the best tx performance we've seen so far on this hardware. Consider this as an Acked-by: for all the patches, I'll save the effort of replying to each one of them separately. Jeff, do you plan on merging these fixes for 2.6.19? Arnd <>< - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 21/21]: powerpc/cell spidernet DMA coalescing
Linas Vepstas wrote: > On Tue, Oct 10, 2006 at 06:46:08PM -0700, Geoff Levand wrote: >> > Linas Vepstas wrote: >> >> The current driver code performs 512 DMA mappns of a bunch of >> >> 32-byte structures. This is silly, as they are all in contiguous >> >> memory. Ths patch changes the code to DMA map the entie area >> >> with just one call. >> >> Linas, >> >> Is the motivation for this change to improve performance by reducing the >> overhead >> of the mapping calls? > > Yes. > >> If so, there may be some benefit for some systems. Could >> you please elaborate? > > I started writingthe patch thinking it will have some huge effect on > performance, based on a false assumption on how i/o was done on this > machine > > *If* this were another pSeries system, then each call to > pci_map_single() chews up an actual hardware "translation > control entry" (TCE) that maps pci bus addresses into > system RAM addresses. These are somewhat limited resources, > and so one shouldn't squander them. Furthermore, I thouhght > TCE's have TLB's associated with them (similar to how virtual > memory page tables are backed by hardware page TLB's), of which > there are even less of. I was thinking that TLB thrashing would > have a big hit on performance. > > Turns out that there was no difference to performance at all, > and a quick look at "cell_map_single()" in arch/powerpc/platforms/cell > made it clear why: there's no fancy i/o address mapping. OK, thanks for the explanation. Actually, the current cell DMA mapping implementation uses a simple 'linear' mapping, in that, all of RAM is mapped into the bus DMA address space at once, and in fact, it is all just done at system startup. There is ongoing work to implement 'dynamic' mapping, where DMA pages are mapped into the bus DMA address space on demand. I think a key point to understand the benefit to this is that the cell processor's I/O controller maps pages per device, so you can map one DMA page to one device. I currently have this working for my platform, but have not released that work. There is some overhead to managing the mapped buffers and to request pages be mapped by the hypervisor, etc., so I was thinking that is this work of yours to consolidate the memory buffers prior to requesting the mapping could be of benefit if it was in an often executed code path. -Geoff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 21/21]: powerpc/cell spidernet DMA coalescing
On Tue, Oct 10, 2006 at 06:46:08PM -0700, Geoff Levand wrote: > > Linas Vepstas wrote: > >> The current driver code performs 512 DMA mappns of a bunch of > >> 32-byte structures. This is silly, as they are all in contiguous > >> memory. Ths patch changes the code to DMA map the entie area > >> with just one call. > > Linas, > > Is the motivation for this change to improve performance by reducing the > overhead > of the mapping calls? Yes. > If so, there may be some benefit for some systems. Could > you please elaborate? I started writingthe patch thinking it will have some huge effect on performance, based on a false assumption on how i/o was done on this machine *If* this were another pSeries system, then each call to pci_map_single() chews up an actual hardware "translation control entry" (TCE) that maps pci bus addresses into system RAM addresses. These are somewhat limited resources, and so one shouldn't squander them. Furthermore, I thouhght TCE's have TLB's associated with them (similar to how virtual memory page tables are backed by hardware page TLB's), of which there are even less of. I was thinking that TLB thrashing would have a big hit on performance. Turns out that there was no difference to performance at all, and a quick look at "cell_map_single()" in arch/powerpc/platforms/cell made it clear why: there's no fancy i/o address mapping. Thus, the patch has only mrginal benefit; I submit it only in the name of "its the right thing to do anyway". --linas - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Dropping NETIF_F_SG since no checksum feature.
Quoting Steven Whitehouse <[EMAIL PROTECTED]>: > > ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset, > > size_t size, int flags) > > { > > ssize_t res; > > struct sock *sk = sock->sk; > > > > if (!(sk->sk_route_caps & NETIF_F_SG) || > > !(sk->sk_route_caps & NETIF_F_ALL_CSUM)) > > return sock_no_sendpage(sock, page, offset, size, flags); > > > > > > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM, > > data will be copied over rather than sent directly. > > So why does dev.c have to force set NETIF_F_SG to off then? > > > I agree with that analysis, So, would you Ack something like the following then? == Enabling NETIF_F_SG without NETIF_F_ALL_CSUM actually seems to work fine by doing an old-fashioned data copy in tcp_sendpage. And for devices that do not calculate IP checksum in hardware (e.g. InfiniBand) calculating the checksum for all packets in network driver is worse than have the CPU piggyback the checksum compitation with the copy process. Finally, note that NETIF_F_SG is necessary to be able to allocate skbs > PAGE_SIZE on busy systems. So, let's allow that combination, again, for drivers that want it. Signed-off-by: Michael S. Tsirkin <[EMAIL PROTECTED]> --- diff --git a/net/core/dev.c b/net/core/dev.c index d4a1ec3..2d731a0 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2930,14 +2930,6 @@ #endif } } - /* Fix illegal SG+CSUM combinations. */ - if ((dev->features & NETIF_F_SG) && - !(dev->features & NETIF_F_ALL_CSUM)) { - printk(KERN_NOTICE "%s: Dropping NETIF_F_SG since no checksum feature.\n", - dev->name); - dev->features &= ~NETIF_F_SG; - } - /* TSO requires that SG is present as well. */ if ((dev->features & NETIF_F_TSO) && !(dev->features & NETIF_F_SG)) { -- MST - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] d80211: extend extra_hdr_room to be a bytecount
Perhaps rename it to extra_tx_headroom? - existing users would then need to take notice of the change - the name 'extra_tx_headroom' is more descriptive of what it actually is -David On Wed, Oct 11, 2006 at 11:58:28AM +0200, Michael Buesch wrote: > Extend ieee80211_hw's extra_hdr_room to be a bytecount for > a device specific TX header instead of being a hardcoded > 0/2 byte choice. > > Signed-off-by: Michael Buesch <[EMAIL PROTECTED]> > > diff --git a/include/net/d80211.h b/include/net/d80211.h > index a80f48b..9e9709f 100644 > --- a/include/net/d80211.h > +++ b/include/net/d80211.h > @@ -476,10 +476,6 @@ struct ieee80211_hw { > /* Force software encryption for TKIP packets if WMM is enabled. */ > unsigned int no_tkip_wmm_hwaccel:1; > > - /* set if the payload needs to be padded at even boundaries after the > - * header */ > - unsigned int extra_hdr_room:1; > - > /* Some devices handle Michael MIC internally and do not include MIC in >* the received packets passed up. device_strips_mic must be set >* for such devices. The 'encryption' frame control bit is expected to > @@ -496,6 +492,9 @@ struct ieee80211_hw { >* i.e. more than one skb per frame */ > unsigned int fraglist:1; > > + /* Set to the size of a needed device specific skb headroom for TX > skbs. */ > + unsigned int extra_hdr_room; > + > /* This is the time in us to change channels > */ > int channel_change_time; > diff --git a/net/d80211/ieee80211.c b/net/d80211/ieee80211.c > index 1ef2707..7d52c3b 100644 > --- a/net/d80211/ieee80211.c > +++ b/net/d80211/ieee80211.c > @@ -1551,7 +1551,7 @@ static int ieee80211_subif_start_xmit(st >* build in headroom in __dev_alloc_skb() (linux/skbuff.h) and >* alloc_skb() (net/core/skbuff.c) >*/ > - head_need = hdrlen + encaps_len + (local->hw->extra_hdr_room ? 2 : 0); > + head_need = hdrlen + encaps_len + local->hw->extra_hdr_room; > head_need -= skb_headroom(skb); > > /* We are going to modify skb data, so make a copy of it if happens to > > > -- > Greetings Michael. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Customizable TCP backoff patch
Ben Woodard wrote: > diff -ru linux-2.6.18/include/net/tcp.h linux-2.6.18.new/include/net/tcp.h > --- linux-2.6.18/include/net/tcp.h2006-09-19 20:42:06.0 -0700 > +++ linux-2.6.18.new/include/net/tcp.h2006-10-10 18:42:00.0 > -0700 > @@ -227,11 +227,23 @@ > extern int sysctl_tcp_base_mss; > extern int sysctl_tcp_workaround_signed_windows; > extern int sysctl_tcp_slow_start_after_idle; > +extern unsigned long sysctl_tcp_rto_max; > +extern unsigned long sysctl_tcp_rto_init; > > extern atomic_t tcp_memory_allocated; > extern atomic_t tcp_sockets_allocated; > extern int tcp_memory_pressure; > > +static inline unsigned long tcp_rto_max(struct tcp_sock *tp) > +{ > +return tp->rto_max ? tp->rto_max*HZ/1000 : sysctl_tcp_rto_max; ^^^ That should probably be msecs_to_jiffies(tp->rto_max) > +} > + > +static inline unsigned long tcp_rto_init(struct tcp_sock *tp) > +{ > +return tp->rto_init ? tp->rto_init*HZ/1000 : sysctl_tcp_rto_init; Ditto. > +} > + > /* > * The next routines deal with comparing 32 bit unsigned ints > * and worry about wraparound (automatic with unsigned arithmetic). > diff -ru linux-2.6.18/net/ipv4/sysctl_net_ipv4.c > linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c > --- linux-2.6.18/net/ipv4/sysctl_net_ipv4.c 2006-09-19 20:42:06.0 > -0700 > +++ linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c 2006-10-10 > 16:32:08.0 -0700 > @@ -128,6 +128,8 @@ > return ret; > } > > +static unsigned long tcp_rto_min=0; > +static unsigned long tcp_rto_max=65535; > > ctl_table ipv4_table[] = { > { > @@ -697,6 +699,26 @@ > .mode = 0644, > .proc_handler = &proc_dointvec > }, > + { > + .ctl_name = NET_TCP_RTO_MAX, > + .procname = "tcp_rto_max", > + .data = &sysctl_tcp_rto_max, > + .maxlen = sizeof(unsigned), > + .mode = 0644, > + .proc_handler = &proc_doulongvec_ms_jiffies, > + .extra1 = &tcp_rto_min_constant, > + .extra2 = &tcp_rto_max_constant, > + }, > + { > + .ctl_name = NET_TCP_RTO_INIT, > + .procname = "tcp_rto_init", > + .data = &sysctl_tcp_rto_init, > + .maxlen = sizeof(unsigned), > + .mode = 0644, > + .proc_handler = &proc_doulongvec_ms_jiffies, > + .extra1 = &tcp_rto_min_constant, > + .extra2 = &tcp_rto_max_constant, > + }, > { .ctl_name = 0 } > }; Try as I might, I can't find proc_doulongvec_ms_jiffies() anywhere. I think you meant proc_doulongvec_ms_jiffies_minmax()? Also, this function has a bug in that it doesn't do corrections for potentially different HZ values. Look at the msecs_to_jiffies... When I've used proc_doulongvec_ms_jiffies_minmax() before, I would seen rather interesting truncation errors such that sysctl input didn't match sysctl output. > > diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c > --- linux-2.6.18/net/ipv4/tcp.c 2006-09-19 20:42:06.0 -0700 > +++ linux-2.6.18.new/net/ipv4/tcp.c 2006-10-10 18:37:40.0 -0700 > @@ -1764,6 +1764,8 @@ > return err; > } > > +#define TCP_BACKOFF_MAXVAL 65535 > + > /* > * Socket option code for TCP. > */ > @@ -1939,6 +1941,21 @@ > } > break; > > +case TCP_BACKOFF_MAX: > +if (val < 1 || val > TCP_BACKOFF_MAXVAL) > +err = -EINVAL; > +else > +tp->rto_max = val; > +break; > + > +case TCP_BACKOFF_INIT: > +if (val < 1 || val > TCP_BACKOFF_MAXVAL) > +err = -EINVAL; > +else > +tp->rto_init = val; > +break; > + > + > default: > err = -ENOPROTOOPT; > break; > @@ -2110,6 +2127,12 @@ > if (copy_to_user(optval, icsk->icsk_ca_ops->name, len)) > return -EFAULT; > return 0; > +case TCP_BACKOFF_MAX: > + val = tcp_rto_max(tp)*1000/HZ; > +break; > +case TCP_BACKOFF_INIT: > + val = tcp_rto_init(tp)*1000/HZ; The two divisions above introduce truncation errors such the input doesn't match the output. Better to use jiffies_to_msecs(). Regards -vlad - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] Question about potential problem in net/ipv4/route.c
Hi David While browsing net/ipv4/route.c I discovered compare_keys() function, and a potential bug in it. static inline int compare_keys(struct flowi *fl1, struct flowi *fl2) { return memcmp(&fl1->nl_u.ip4_u, &fl2->nl_u.ip4_u, sizeof(fl1->nl_u.ip4_u)) == 0 && fl1->oif == fl2->oif && fl1->iif == fl2->iif; } Using memcmp(ptr1, ptr2, sizeof(SOMEFIELD)) is dangerous because sizeof(SOMEFIELD) can be larger than the underlying object, because of alignment constraints. In this case, sizeof(fl1->nl_u.ip4_u) is 16, while fl1->nl_u.ip4_u is : struct { __u32 daddr; __u32 saddr; __u32 fwmark; __u8tos; __u8scope; } ip4_u; So 14 bytes are really initialized, and 2 padding bytes might contain random values... So at the very minimum, we should avoid doing the memcmp() including those last two bytes : It would be less bugy, and faster too... (But to get really fast comparison, we should do some clever long/int XOR operations to avoid many test/branches, like the optim we did in compare_ether_addr()) As shown in profiles, "repz cmpsb" is really a dog... (and its use of esi/edi/ecx registers a high pressure for the compiler/optimizer) Eric - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] drivers/net/sis900.c: Replacing yield() with a better alternative
*Please check if timeout should be increased* Signed-off-by: Amol Lad <[EMAIL PROTECTED]> --- diff -uprN -X linux-2.6.19-rc1-orig/Documentation/dontdiff linux-2.6.19-rc1-orig/drivers/net/wireless/hostap/hostap_hw.c linux-2.6.19-rc1/drivers/net/wireless/hostap/hostap_hw.c --- linux-2.6.19-rc1-orig/drivers/net/wireless/hostap/hostap_hw.c 2006-09-21 10:15:37.0 +0530 +++ linux-2.6.19-rc1/drivers/net/wireless/hostap/hostap_hw.c2006-10-11 17:41:25.0 +0530 @@ -993,7 +993,7 @@ static u16 hfa384x_allocate_fid(struct n delay = jiffies + HFA384X_ALLOC_COMPL_TIMEOUT; while (!(HFA384X_INW(HFA384X_EVSTAT_OFF) & HFA384X_EV_ALLOC) && time_before(jiffies, delay)) - yield(); + schedule_timeout_interruptible(1); if (!(HFA384X_INW(HFA384X_EVSTAT_OFF) & HFA384X_EV_ALLOC)) { printk("%s: fid allocate, len=%d - timeout\n", dev->name, len); return 0x; @@ -1333,7 +1333,7 @@ static int prism2_hw_init(struct net_dev delay = jiffies + HFA384X_INIT_TIMEOUT; while (!(HFA384X_INW(HFA384X_EVSTAT_OFF) & HFA384X_EV_CMD) && time_before(jiffies, delay)) - yield(); + schedule_timeout_interruptible(1); if (!(HFA384X_INW(HFA384X_EVSTAT_OFF) & HFA384X_EV_CMD)) { printk(KERN_DEBUG "%s: assuming no Primary image in " "flash - card initialization not completed\n", - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] drivers/net/depca.c: Replacing yield() with a better alternative
In 2.6, the semantics of calling yield() changed from "sleep for a bit" to "I really don't want to run for a while". This matches POSIX better, but there's a lot of drivers still using yield() when they mean cond_resched(), schedule() or even schedule_timeout(). For this driver schedule_timeout_interruptible seems to be a better alternative Tested compile only Signed-off-by: Amol Lad <[EMAIL PROTECTED]> --- diff -uprN -X linux-2.6.19-rc1-orig/Documentation/dontdiff linux-2.6.19-rc1-orig/drivers/net/depca.c linux-2.6.19-rc1/drivers/net/depca.c --- linux-2.6.19-rc1-orig/drivers/net/depca.c 2006-10-05 14:00:47.0 +0530 +++ linux-2.6.19-rc1/drivers/net/depca.c2006-10-11 17:57:02.0 +0530 @@ -738,7 +738,7 @@ static int __init depca_hw_init (struct interrupts. For now we will always get a DMA error. */ if (dev->irq < 2) { unsigned char irqnum; - unsigned long irq_mask, delay; + unsigned long irq_mask; irq_mask = probe_irq_on(); @@ -767,9 +767,7 @@ static int __init depca_hw_init (struct /* Trigger an initialization just for the interrupt. */ outw(INEA | INIT, DEPCA_DATA); - delay = jiffies + HZ/50; - while (time_before(jiffies, delay)) - yield(); + schedule_timeout_interruptible(jiffies + HZ/50); irqnum = probe_irq_off(irq_mask); - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] drivers/net/sb1000.c: Replacing yield() with a better alternative
cond_resched() seems to be a better alternative Signed-off-by: Amol Lad <[EMAIL PROTECTED]> --- diff -uprN -X linux-2.6.19-rc1-orig/Documentation/dontdiff linux-2.6.19-rc1-orig/drivers/net/sb1000.c linux-2.6.19-rc1/drivers/net/sb1000.c --- linux-2.6.19-rc1-orig/drivers/net/sb1000.c 2006-10-05 14:00:48.0 +0530 +++ linux-2.6.19-rc1/drivers/net/sb1000.c 2006-10-11 17:57:02.0 +0530 @@ -264,7 +264,7 @@ card_wait_for_busy_clear(const int ioadd timeout = jiffies + TimeOutJiffies; while (a & 0x80 || a & 0x40) { /* a little sleep */ - yield(); + cond_resched(); a = inb(ioaddr[0] + 7); if (time_after_eq(jiffies, timeout)) { @@ -288,7 +288,7 @@ card_wait_for_ready(const int ioaddr[], timeout = jiffies + TimeOutJiffies; while (a & 0x80 || !(a & 0x40)) { /* a little sleep */ - yield(); + cond_resched(); a = inb(ioaddr[1] + 6); if (time_after_eq(jiffies, timeout)) { - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] drivers/net/sis900.c: Replacing yield() with a better alternative
*Please check if the timout value is ok or it should be increased* Signed-off-by: Amol Lad <[EMAIL PROTECTED]> --- diff -uprN -X linux-2.6.19-rc1-orig/Documentation/dontdiff linux-2.6.19-rc1-orig/drivers/net/sis900.c linux-2.6.19-rc1/drivers/net/sis900.c --- linux-2.6.19-rc1-orig/drivers/net/sis900.c 2006-10-05 14:00:48.0 +0530 +++ linux-2.6.19-rc1/drivers/net/sis900.c 2006-10-11 17:57:02.0 +0530 @@ -664,7 +664,7 @@ static int __init sis900_mii_probe(struc if(status & MII_STAT_LINK){ while (poll_bit) { - yield(); + schedule_timeout_interruptible(1); poll_bit ^= (mdio_read(net_dev, sis_priv->cur_phy, MII_STATUS) & poll_bit); if (time_after_eq(jiffies, timeout)) { - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 7278] New: forcedeth slowed down by traffic shaping
Jarek Poplawski wrote: > On 07-10-2006 02:22, Andrew Morton wrote: > >>On Fri, 6 Oct 2006 17:18:13 -0700 >>[EMAIL PROTECTED] wrote: >> >> >>>http://bugzilla.kernel.org/show_bug.cgi?id=7278 >>> >>> Summary: forcedeth slowed down by traffic shaping >>>Kernel Version: 2.6.16 >>>Status: NEW >>> Severity: normal >>> Owner: [EMAIL PROTECTED] >>> Submitter: [EMAIL PROTECTED] >>> >>> >>>When assigninig a simple Tocket Bucket Filter (TBF) classless trafic shaping >>>policy to the eth0 device (provided by forcedeth), the device's upstream >>>speed >>>drops to ~12kbyte/s, regardless of the traffic limit I set in the shaping >>>rule. >>>In fact, the traffic limit values could be larger than the connection's >>>throughput and it'll still limit to mere 12kbyte/s. Once I revert the rule to >>>pfifo_fast, the rate resumes to the connection's real one (64kbyte/s). >>> >>>If I set the TBF lower than 96kbit on eth0, it indeed applies; in other >>>words, I >>>can make the connection slower than 12kbyte/s but not faster. >>> >>>If I assign the same rule to ppp0, it works just as intended, for rules >>>higher >>>than 12kbyte/s, leading me to believe forcedeth has something to do with it. >>> >>>(Is there any module parameter I can try tweaking?) > > > Isn't it to unpolite to expect from the submitter some details > like this tbf rule or testing with other kernel version? The TBF command would be a good start to make sure its not a configuration error. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 7278] New: forcedeth slowed down by traffic shaping
On 07-10-2006 02:22, Andrew Morton wrote: > On Fri, 6 Oct 2006 17:18:13 -0700 > [EMAIL PROTECTED] wrote: > >> http://bugzilla.kernel.org/show_bug.cgi?id=7278 >> >>Summary: forcedeth slowed down by traffic shaping >> Kernel Version: 2.6.16 >> Status: NEW >> Severity: normal >> Owner: [EMAIL PROTECTED] >> Submitter: [EMAIL PROTECTED] >> >> >> When assigninig a simple Tocket Bucket Filter (TBF) classless trafic shaping >> policy to the eth0 device (provided by forcedeth), the device's upstream >> speed >> drops to ~12kbyte/s, regardless of the traffic limit I set in the shaping >> rule. >> In fact, the traffic limit values could be larger than the connection's >> throughput and it'll still limit to mere 12kbyte/s. Once I revert the rule to >> pfifo_fast, the rate resumes to the connection's real one (64kbyte/s). >> >> If I set the TBF lower than 96kbit on eth0, it indeed applies; in other >> words, I >> can make the connection slower than 12kbyte/s but not faster. >> >> If I assign the same rule to ppp0, it works just as intended, for rules >> higher >> than 12kbyte/s, leading me to believe forcedeth has something to do with it. >> >> (Is there any module parameter I can try tweaking?) Isn't it to unpolite to expect from the submitter some details like this tbf rule or testing with other kernel version? Jarek P. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] e1000: Real time packets and bytes statistics
Hi all, This patch is posted for review and comments. Let the e1000 driver report the most important statistics (rx/tx_bytes and rx/tx_packets) in real time, rather than every other second. This is similar to what the e100 driver is doing. The current asynchronous statistics refresh model makes it impossible to monitor the network traffic with an interval which isn't a multiple of 2 seconds. For example, an interval of 5 seconds would result in a sawtooth diagram (+20%, -20%) for a constant transfer rate. With a 1 second interval it's even worse (0, 200%) of course. This has been annoying users for years, but was never actually fixed: LKML thread "e1000 statistics timer", August 2003: http://lkml.org/lkml/2003/8/2/127 Bug filled against SUSE LINUX 9.0 Professional, September 2003: https://bugzilla.novell.com/show_bug.cgi?id=46464 (access is restricted, sorry about that) LKML thread "minor e1000 bug", December 2003: http://marc.theaimsgroup.com/?t=10719095701 Bug filled against Ubuntu Dapper Drake, August 2006: https://launchpad.net/distros/ubuntu/+source/linux-source-2.6.15/+bug/55989 Kernel bug #6986, August 2006: http://bugzilla.kernel.org/show_bug.cgi?id=6986 rx/tx_bytes will show slightly lower values than before, because the hardware appears to include the 4-byte ethernet frame CRC into the frame length, while the driver doesn't. It's probably OK as the e100, 3c59x and 8139too drivers don't include it either. I additionally noted a difference of 6 bytes on some TX frames, which I am not able to explain. It's probably small and rare enough not to be considered a problem, but if someone can explain it, I would be grateful. Signed-off-by: Jean Delvare <[EMAIL PROTECTED]> --- drivers/net/e1000/e1000_main.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) --- linux-2.6.19-rc1.orig/drivers/net/e1000/e1000_main.c2006-10-11 10:53:49.0 +0200 +++ linux-2.6.19-rc1/drivers/net/e1000/e1000_main.c 2006-10-11 11:34:41.0 +0200 @@ -3118,6 +3118,8 @@ e1000_tx_map(adapter, tx_ring, skb, first, max_per_txd, nr_frags, mss)); + adapter->net_stats.tx_packets++; + adapter->net_stats.tx_bytes += skb->len; netdev->trans_start = jiffies; /* Make sure there is space in the ring for the next send. */ @@ -3384,10 +3386,8 @@ /* Fill out the OS statistics structure */ - adapter->net_stats.rx_packets = adapter->stats.gprc; - adapter->net_stats.tx_packets = adapter->stats.gptc; - adapter->net_stats.rx_bytes = adapter->stats.gorcl; - adapter->net_stats.tx_bytes = adapter->stats.gotcl; + /* rx/tx_packets and rx/tx_bytes are computed in real time by + the driver */ adapter->net_stats.multicast = adapter->stats.mprc; adapter->net_stats.collisions = adapter->stats.colc; @@ -3833,6 +3833,7 @@ ((uint32_t)(rx_desc->errors) << 24), le16_to_cpu(rx_desc->csum), skb); + length = skb->len; /* Save actual length for stats */ skb->protocol = eth_type_trans(skb, netdev); #ifdef CONFIG_E1000_NAPI if (unlikely(adapter->vlgrp && @@ -3853,6 +3854,8 @@ netif_rx(skb); } #endif /* CONFIG_E1000_NAPI */ + adapter->net_stats.rx_packets++; + adapter->net_stats.rx_bytes += length; netdev->last_rx = jiffies; next_desc: @@ -4009,6 +4012,7 @@ copydone: e1000_rx_checksum(adapter, staterr, le16_to_cpu(rx_desc->wb.lower.hi_dword.csum_ip.csum), skb); + length = skb->len; /* Save actual length for stats */ skb->protocol = eth_type_trans(skb, netdev); if (likely(rx_desc->wb.upper.header_status & @@ -4031,6 +4035,8 @@ netif_rx(skb); } #endif /* CONFIG_E1000_NAPI */ + adapter->net_stats.rx_packets++; + adapter->net_stats.rx_bytes += length; netdev->last_rx = jiffies; next_desc: Thanks, -- Jean Delvare - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Bug ? IF_RUNNING/routing table updates
On 09-10-2006 12:55, Shaun Kemp wrote: ... > An interface (+ connected IP network) which loses its IF_RUNNING flag (ie > unusable for routing) persists in the routing table as a kernel route. > Thus rather than responding to a dynamically announced route to this > connected network (the connected being unreachable due to the interface > being down, but the dynamic offering an alternate path), the box insists on > trying to route it out of the broken interface via this ?kernel? sourced > route. ... > The path for 192.168.0.192 is learned via 192.168.0.130 (current ospf dr - > irrelevant), but it'll never use it presumably (from Cisco experience) > because of the kernel sourced directly connected route still sitting in > there. Furthermore, if I then IFDOWN eth1, everything is fine but I don't > want to do this manually everytime there's an interface problem because > that's why we run ospf ! =:D > > Not sure whether this is a "driver tells the kernel" or a "kernel checks the > driver at {n} intervals" issue - I would suggest the former would be more > correct, but it is a problem regardless. > > Maybe it's just these Intel drivers ? :/ IMHO it should be done by the same part which changed the flag - for this purpose routing frontend is listening for notify events. Jarek P. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Pull request for 'jg-20061009-00' tag
(removed the forcedeth people from the Cc:) Jeff Garzik <[EMAIL PROTECTED]> : [...] > I merged the forcedeth change via another source. Could I pull from a > tag with just the r8169 change? Yes, you can pull from: git://electric-eye.fr.zoreil.com/home/romieu/linux-2.6.git jg-20061011-00 -- Ueimor - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] d80211: extend extra_hdr_room to be a bytecount
Extend ieee80211_hw's extra_hdr_room to be a bytecount for a device specific TX header instead of being a hardcoded 0/2 byte choice. Signed-off-by: Michael Buesch <[EMAIL PROTECTED]> diff --git a/include/net/d80211.h b/include/net/d80211.h index a80f48b..9e9709f 100644 --- a/include/net/d80211.h +++ b/include/net/d80211.h @@ -476,10 +476,6 @@ struct ieee80211_hw { /* Force software encryption for TKIP packets if WMM is enabled. */ unsigned int no_tkip_wmm_hwaccel:1; - /* set if the payload needs to be padded at even boundaries after the -* header */ - unsigned int extra_hdr_room:1; - /* Some devices handle Michael MIC internally and do not include MIC in * the received packets passed up. device_strips_mic must be set * for such devices. The 'encryption' frame control bit is expected to @@ -496,6 +492,9 @@ struct ieee80211_hw { * i.e. more than one skb per frame */ unsigned int fraglist:1; + /* Set to the size of a needed device specific skb headroom for TX skbs. */ + unsigned int extra_hdr_room; + /* This is the time in us to change channels */ int channel_change_time; diff --git a/net/d80211/ieee80211.c b/net/d80211/ieee80211.c index 1ef2707..7d52c3b 100644 --- a/net/d80211/ieee80211.c +++ b/net/d80211/ieee80211.c @@ -1551,7 +1551,7 @@ static int ieee80211_subif_start_xmit(st * build in headroom in __dev_alloc_skb() (linux/skbuff.h) and * alloc_skb() (net/core/skbuff.c) */ - head_need = hdrlen + encaps_len + (local->hw->extra_hdr_room ? 2 : 0); + head_need = hdrlen + encaps_len + local->hw->extra_hdr_room; head_need -= skb_headroom(skb); /* We are going to modify skb data, so make a copy of it if happens to -- Greetings Michael. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Dropping NETIF_F_SG since no checksum feature.
Quoting r. David Miller <[EMAIL PROTECTED]>: > Subject: Re: Dropping NETIF_F_SG since no checksum feature. > > From: "Michael S. Tsirkin" <[EMAIL PROTECTED]> > Date: Wed, 11 Oct 2006 11:05:04 +0200 > > > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM, > > data will be copied over rather than sent directly. > > So why does dev.c have to force set NETIF_F_SG to off then? > > Because it's more efficient to copy into a linear destination > buffer of an SKB than page sub-chunks when doing checksum+copy. > Thanks for the explanation. Obviously its true as long as you can allocate the skb that big. I think you won't realistically be able to get 64K in a linear SKB on a busy system, though, is not that right? OTOH, having large MTU (e.g. 64K) helps performance a lot since it reduces receive side processing overhead. So, if I understand what you are saying correctly, things do work correctly (just slower for small skb) if NETIF_F_SG is set bug clear, it seems that all we need to do is drop the following in dev.c: /* Fix illegal SG+CSUM combinations. */ if ((dev->features & NETIF_F_SG) && !(dev->features & NETIF_F_ALL_CSUM)) { printk(KERN_NOTICE "%s: Dropping NETIF_F_SG since no checksum feature.\n", dev->name); dev->features &= ~NETIF_F_SG; } is that right? -- MST - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 21/21]: powerpc/cell spidernet DMA coalescing
On Wednesday 11 October 2006 03:46, Geoff Levand wrote: > > > > The others look good, but this one complicates the code and doesn't have > > any benefit. 20 > > for 21 isn't bad. > > Is the motivation for this change to improve performance by reducing the > overhead > of the mapping calls? If so, there may be some benefit for some systems. > Could > you please elaborate? > >From what I understand, this patch drastically reduces the number of I/O PTEs that are needed in the iommu. With the current static IOMMU mapping, it should only make a difference during initialization, but any platform that uses a dynamic mapping of iommu entries will benefit a lot from it, because: - the card can do better prefetching of consecutive memory - there are more I/O ptes available for other drivers. Arnd <>< - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Dropping NETIF_F_SG since no checksum feature.
From: "Michael S. Tsirkin" <[EMAIL PROTECTED]> Date: Wed, 11 Oct 2006 11:05:04 +0200 > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM, > data will be copied over rather than sent directly. > So why does dev.c have to force set NETIF_F_SG to off then? Because it's more efficient to copy into a linear destination buffer of an SKB than page sub-chunks when doing checksum+copy. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Dropping NETIF_F_SG since no checksum feature.
Hi, On Wed, Oct 11, 2006 at 11:05:04AM +0200, Michael S. Tsirkin wrote: > Quoting r. David Miller <[EMAIL PROTECTED]>: > > Subject: Re: Dropping NETIF_F_SG since no checksum feature. > > > > From: "Michael S. Tsirkin" <[EMAIL PROTECTED]> > > Date: Wed, 11 Oct 2006 02:13:38 +0200 > > > > > Maybe I can patch linux to allow SG without checksum? > > > Dave, maybe you could drop a hint or two on whether this is worthwhile > > > and what are the issues that need addressing to make this work? > > > > > > I imagine it's not just the matter of changing net/core/dev.c :). > > > > You can't, it's a quality of implementation issue. We sendfile() > > pages directly out of the filesystem page cache without any > > blocking of modifications to the page contents, and the only way > > that works is if the card computes the checksum for us. > > > > If we sendfile() a page directly, we must compute a correct checksum > > no matter what the contents. We can't do this on the cpu before the > > data hits the device because another thread of execution can go in and > > modify the page contents which would invalidate the checksum and thus > > invalidating the packet. We cannot allow this. > > > > Blocking modifications is too expensive, so that's not an option > > either. > > > I would argue that SG does make sense without checksum for protocols that don't need/use a checksum. DECnet for example could do zero-copy without caring about the checksum since it doesn't have one. One of these days I'll get around to writing that bit of code :-) > But copying still works fine, does it not? > Dave, could you clarify this please? > > ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset, > size_t size, int flags) > { > ssize_t res; > struct sock *sk = sock->sk; > > if (!(sk->sk_route_caps & NETIF_F_SG) || > !(sk->sk_route_caps & NETIF_F_ALL_CSUM)) > return sock_no_sendpage(sock, page, offset, size, flags); > > > So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM, > data will be copied over rather than sent directly. > So why does dev.c have to force set NETIF_F_SG to off then? > I agree with that analysis, Steve. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Dropping NETIF_F_SG since no checksum feature.
Quoting r. David Miller <[EMAIL PROTECTED]>: > Subject: Re: Dropping NETIF_F_SG since no checksum feature. > > From: "Michael S. Tsirkin" <[EMAIL PROTECTED]> > Date: Wed, 11 Oct 2006 02:13:38 +0200 > > > Maybe I can patch linux to allow SG without checksum? > > Dave, maybe you could drop a hint or two on whether this is worthwhile > > and what are the issues that need addressing to make this work? > > > > I imagine it's not just the matter of changing net/core/dev.c :). > > You can't, it's a quality of implementation issue. We sendfile() > pages directly out of the filesystem page cache without any > blocking of modifications to the page contents, and the only way > that works is if the card computes the checksum for us. > > If we sendfile() a page directly, we must compute a correct checksum > no matter what the contents. We can't do this on the cpu before the > data hits the device because another thread of execution can go in and > modify the page contents which would invalidate the checksum and thus > invalidating the packet. We cannot allow this. > > Blocking modifications is too expensive, so that's not an option > either. > But copying still works fine, does it not? Dave, could you clarify this please? ssize_t tcp_sendpage(struct socket *sock, struct page *page, int offset, size_t size, int flags) { ssize_t res; struct sock *sk = sock->sk; if (!(sk->sk_route_caps & NETIF_F_SG) || !(sk->sk_route_caps & NETIF_F_ALL_CSUM)) return sock_no_sendpage(sock, page, offset, size, flags); So, it seems that if I set NETIF_F_SG but clear NETIF_F_ALL_CSUM, data will be copied over rather than sent directly. So why does dev.c have to force set NETIF_F_SG to off then? -- MST - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/4] beginning of 8390 fixes - generic and arm/etherh
applied 1-4 to #upstream (queued for 2.6.20). will let it simmer in -mm for a cycle, to make sure it gets some wide testing. Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Pull request for 'jg-20061009-00' tag
Francois Romieu wrote: Please pull from tag 'jg-20061009-00' in repository git://electric-eye.fr.zoreil.com/home/romieu/linux-2.6.git jg-20061009-00 to get the changes below. Distance from 'upstream-fixes' - $ git rev-list 2f614fe04f4463ff22234133319067d7361f54e5..jg-20061009-00 91a6ba7162852092080b0e710dc0ba0f35496308 73f5e28b336772c4b08ee82e5bf28ab872898ee1 Diffstat drivers/net/forcedeth.c | 43 +++ drivers/net/r8169.c |1 + 2 files changed, 44 insertions(+), 0 deletions(-) I merged the forcedeth change via another source. Could I pull from a tag with just the r8169 change? Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-update][RFC] net: consolidated UDP / UDP-Lite code
| > csum_copy_err: | > - UDP_INC_STATS_BH(UDP_MIB_INERRORS); | > + UDP_INC_STATS_BH(UDP_MIB_INERRORS, is_udplite); | > + UDP_DEC_STATS_BH(UDP_MIB_INDATAGRAMS, is_udplite); | | I'm not a big fan at all of these "statistic corrections" | we're starting to place in various spots. I am not really fond of this solution either. It evolved via discussion, from a previous suggestion to place the increment of InDatagrams into udp_recvmsg(). The problem with that alternative was in dealing with applications which use the data_ready handler (such as sunrpc). | I really don't think it's the end of the world if we count as | INDATAGRAMS a packet that we later discover has a bad checksum. It would have been nice to say "all these counters count correctly". Maybe that has been over-ambitious - in tcp_ipv4.c I found that tcp_v4_rcv also counts incoming segments even if they are bad. I will restore the original state and remove the counter decrements in the next upcoming version of the patch. Thank you, -- Gerrit - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: is each frag of a skb always less than 1 page?
On 11-10-2006 01:26, Ronghua Zhang wrote: ... > The reason I asked this is that I saw the following code in forthdeth > drvier: > > #define NV_TX2_TSO_MAX_SHIFT) 14 > /* add fragments to entries count */ > for (i = 0; i < fragments; i++) { >entries += (skb_shinfo(skb)->frags[i].size >> NV_TX2_TSO_MAX_SHIFT) + >((skb_shinfo(skb)->frags[i].size & > (NV_TX2_TSO_MAX_SIZE-1)) ? 1 : 0); > } > > This looks unnecessary if each frag is guaranteed not to span pages. Even if PAGE_SIZE > NV_TX2_TSO_MAX_SIZE? Jarek P. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 21/21]: powerpc/cell spidernet DMA coalescing
On Tue, 2006-10-10 at 18:20 -0500, jschopp wrote: > Linas Vepstas wrote: > > The current driver code performs 512 DMA mappns of a bunch of > > 32-byte structures. This is silly, as they are all in contiguous > > memory. Ths patch changes the code to DMA map the entie area > > with just one call. > > > > Signed-off-by: Linas Vepstas <[EMAIL PROTECTED]> > > Cc: James K Lewis <[EMAIL PROTECTED]> > > Cc: Arnd Bergmann <[EMAIL PROTECTED]> > > The others look good, but this one complicates the code and doesn't have any > benefit. 20 > for 21 isn't bad. Hi Joel ! I'm not sure what you mean here (especially your 20 to 21 comment). The patch looks perfectly fine to me, and in fact removes more lines of code than it adds :) Cheers, Ben. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html