Re: [git patches] net driver fixes
On Friday 24 February 2006 06:22, Jeff Garzik wrote: > Please pull from 'upstream-fixes' branch of > master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git > > [...] > Stephen Hemminger: > sky2: yukon-ec-u chipset initialization > sky2: limit coalescing values to ring size > sky2: poke coalescing timer to fix hang > sky2: force early transmit status > sky2: use device iomem to access PCI config > sky2: close race on IRQ mask update. >[...] Thanks for the update. Still I'm seeing reproducable hangs with this version of sky2 (as reported in bugzilla 6084 and discussed on netdev). Stephen, if there is anything I can do to narrow down my hangs a bit more systematically, please let me know, I'd be happy to help. Wolfgang - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[git patches] net driver fixes
Please pull from 'upstream-fixes' branch of master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6.git to receive the following updates: drivers/net/r8169.c | 189 drivers/net/skge.c | 75 drivers/net/skge.h |1 drivers/net/sky2.c | 173 --- drivers/net/sky2.h | 85 --- drivers/net/tlan.c |2 6 files changed, 371 insertions(+), 154 deletions(-) Adrian Bunk: drivers/net/tlan.c: #ifdef CONFIG_PCI the PCI specific code Francois Romieu: r8169: fix broken ring index handling in suspend/resume r8169: enable wake on lan Stephen Hemminger: sky2: yukon-ec-u chipset initialization sky2: limit coalescing values to ring size sky2: poke coalescing timer to fix hang sky2: force early transmit status sky2: use device iomem to access PCI config sky2: close race on IRQ mask update. skge: NAPI/irq race fix skge: genesis phy initialzation skge: protect interrupt mask diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c index 6e10184..8cc0d0b 100644 --- a/drivers/net/r8169.c +++ b/drivers/net/r8169.c @@ -287,6 +287,20 @@ enum RTL8169_register_content { TxInterFrameGapShift = 24, TxDMAShift = 8, /* DMA burst value (0-7) is shift this many bits */ + /* Config1 register p.24 */ + PMEnable= (1 << 0), /* Power Management Enable */ + + /* Config3 register p.25 */ + MagicPacket = (1 << 5), /* Wake up when receives a Magic Packet */ + LinkUp = (1 << 4), /* Wake up when the cable connection is re-established */ + + /* Config5 register p.27 */ + BWF = (1 << 6), /* Accept Broadcast wakeup frame */ + MWF = (1 << 5), /* Accept Multicast wakeup frame */ + UWF = (1 << 4), /* Accept Unicast wakeup frame */ + LanWake = (1 << 1), /* LanWake enable/disable */ + PMEStatus = (1 << 0), /* PME status can be reset by PCI RST# */ + /* TBICSR p.28 */ TBIReset= 0x8000, TBILoopback = 0x4000, @@ -433,6 +447,7 @@ struct rtl8169_private { unsigned int (*phy_reset_pending)(void __iomem *); unsigned int (*link_ok)(void __iomem *); struct work_struct task; + unsigned wol_enabled : 1; }; MODULE_AUTHOR("Realtek and the Linux r8169 crew "); @@ -607,6 +622,80 @@ static void rtl8169_link_option(int idx, *duplex = p->duplex; } +static void rtl8169_get_wol(struct net_device *dev, struct ethtool_wolinfo *wol) +{ + struct rtl8169_private *tp = netdev_priv(dev); + void __iomem *ioaddr = tp->mmio_addr; + u8 options; + + wol->wolopts = 0; + +#define WAKE_ANY (WAKE_PHY | WAKE_MAGIC | WAKE_UCAST | WAKE_BCAST | WAKE_MCAST) + wol->supported = WAKE_ANY; + + spin_lock_irq(&tp->lock); + + options = RTL_R8(Config1); + if (!(options & PMEnable)) + goto out_unlock; + + options = RTL_R8(Config3); + if (options & LinkUp) + wol->wolopts |= WAKE_PHY; + if (options & MagicPacket) + wol->wolopts |= WAKE_MAGIC; + + options = RTL_R8(Config5); + if (options & UWF) + wol->wolopts |= WAKE_UCAST; + if (options & BWF) + wol->wolopts |= WAKE_BCAST; + if (options & MWF) + wol->wolopts |= WAKE_MCAST; + +out_unlock: + spin_unlock_irq(&tp->lock); +} + +static int rtl8169_set_wol(struct net_device *dev, struct ethtool_wolinfo *wol) +{ + struct rtl8169_private *tp = netdev_priv(dev); + void __iomem *ioaddr = tp->mmio_addr; + int i; + static struct { + u32 opt; + u16 reg; + u8 mask; + } cfg[] = { + { WAKE_ANY, Config1, PMEnable }, + { WAKE_PHY, Config3, LinkUp }, + { WAKE_MAGIC, Config3, MagicPacket }, + { WAKE_UCAST, Config5, UWF }, + { WAKE_BCAST, Config5, BWF }, + { WAKE_MCAST, Config5, MWF }, + { WAKE_ANY, Config5, LanWake } + }; + + spin_lock_irq(&tp->lock); + + RTL_W8(Cfg9346, Cfg9346_Unlock); + + for (i = 0; i < ARRAY_SIZE(cfg); i++) { + u8 options = RTL_R8(cfg[i].reg) & ~cfg[i].mask; + if (wol->wolopts & cfg[i].opt) + options |= cfg[i].mask; + RTL_W8(cfg[i].reg, options); + } + + RTL_W8(Cfg9346, Cfg9346_Lock); + + tp->wol_enabled = (wol->wolopts) ? 1 : 0; + + spin_unlock_irq(&tp->lock); + + return 0; +} + static void rtl8169_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo *info) { @@ -1025,6 +1114,8 @@ static struct ethtool_ops rtl8169_ethtoo .get_tso
[Patch 1/1] updated: TCP/UDP getpeersec
Hi, Updated as per Herbert's comment. Catherine --- From: [EMAIL PROTECTED] This patch implements an application of the LSM-IPSec networking controls whereby an application can determine the label of the security association its TCP or UDP sockets are currently connected to via getsockopt and the auxiliary data mechanism of recvmsg. Patch purpose: This patch enables a security-aware application to retrieve the security context of an IPSec security association a particular TCP or UDP socket is using. The application can then use this security context to determine the security context for processing on behalf of the peer at the other end of this connection. In the case of UDP, the security context is for each individual packet. An example application is the inetd daemon, which could be modified to start daemons running at security contexts dependent on the remote client. Patch design approach: - Design for TCP The patch enables the SELinux LSM to set the peer security context for a socket based on the security context of the IPSec security association. The application may retrieve this context using getsockopt. When called, the kernel determines if the socket is a connected (TCP_ESTABLISHED) TCP socket and, if so, uses the dst_entry cache on the socket to retrieve the security associations. If a security association has a security context, the context string is returned, as for UNIX domain sockets. - Design for UDP Unlike TCP, UDP is connectionless. This requires a somewhat different API to retrieve the peer security context. With TCP, the peer security context stays the same throughout the connection, thus it can be retrieved at any time between when the connection is established and when it is torn down. With UDP, each read/write can have different peer and thus the security context might change every time. As a result the security context retrieval must be done TOGETHER with the packet retrieval. The solution is to build upon the existing Unix domain socket API for retrieving user credentials. Linux offers the API for obtaining user credentials via ancillary messages (i.e., out of band/control messages that are bundled together with a normal message). Patch implementation details: - Implementation for TCP The security context can be retrieved by applications using getsockopt with the existing SO_PEERSEC flag. As an example (ignoring error checking): getsockopt(sockfd, SOL_SOCKET, SO_PEERSEC, optbuf, &optlen); printf("Socket peer context is: %s\n", optbuf); The SELinux function, selinux_socket_getpeersec, is extended to check for labeled security associations for connected (TCP_ESTABLISHED == sk->sk_state) TCP sockets only. If so, the socket has a dst_cache of struct dst_entry values that may refer to security associations. If these have security associations with security contexts, the security context is returned. getsockopt returns a buffer that contains a security context string or the buffer is unmodified. - Implementation for UDP To retrieve the security context, the application first indicates to the kernel such desire by setting the IP_PASSSEC option via getsockopt. Then the application retrieves the security context using the auxiliary data mechanism. An example server application for UDP should look like this: toggle = 1; toggle_len = sizeof(toggle); setsockopt(sockfd, SOL_IP, IP_PASSSEC, &toggle, &toggle_len); recvmsg(sockfd, &msg_hdr, 0); if (msg_hdr.msg_controllen > sizeof(struct cmsghdr)) { cmsg_hdr = CMSG_FIRSTHDR(&msg_hdr); if (cmsg_hdr->cmsg_len <= CMSG_LEN(sizeof(scontext)) && cmsg_hdr->cmsg_level == SOL_IP && cmsg_hdr->cmsg_type == SCM_SECURITY) { memcpy(&scontext, CMSG_DATA(cmsg_hdr), sizeof(scontext)); } } ip_setsockopt is enhanced with a new socket option IP_PASSSEC to allow a server socket to receive security context of the peer. A new ancillary message type SCM_SECURITY. When the packet is received we get the security context from the sec_path pointer which is contained in the sk_buff, and copy it to the ancillary message space. An additional LSM hook, selinux_socket_getpeersec_udp, is defined to retrieve the security context from the SELinux space. The existing function, selinux_socket_getpeersec does not suit our purpose, because the security context is copied directly to user space, rather than to kernel space. Testing: We have tested the patch by setting up TCP and UDP connections between applications on two machines using the IPSec policies that result in labeled security associations being built. For TCP, we can then extract the peer security context using getsockopt on either end. For UDP, the receiving end can retrieve the security context using the auxiliary data mechanism of recvmsg. --- include/linux/in.h |1 include/linux/security.h| 25 +++--- include/linux/socket.h |1 net/core/sock.c |2 - net/ipv4/ip_sockglue
Re: [PATCH] iproute2 -- add fwmarkmask
Michael Richardson wrote: > > >>>"Patrick" == Patrick McHardy <[EMAIL PROTECTED]> writes: > > Patrick> The normal way to display masks is with a "/". Also I think > Patrick> it shouldn't display the default mask to avoid breaking > Patrick> scripts that parse the output. > > I generally dislike the /VALUE, since I expect /PREFIX-LEN. > I agree that it shouldn't show if it is default. > > Patrick> ip should be able to parse its own output, and it would > Patrick> also look nicer if I could just say "fwmark > Patrick> 0x1/32". fwmarkmask is really an incredible ugly expression > Patrick> :) > > Sure. Is that a 32-bit long mask (0xfff), or is it a 0x0020? > fwmark is not an address. > > Or would you like /32 to be a prefix-based mask, and &value and/or > fwmarkmask to be a value? That was not the greatest example :) I think it should be a bitmask. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/02] add mask options to fwmark masking code
Michael Richardson wrote: > > >>>"Patrick" == Patrick McHardy <[EMAIL PROTECTED]> writes: > > >> #define RTA_FWMARK RTA_PROTOINFO +#define RTA_FWMARK_MASK > >> RTA_CACHEINFO > > Patrick> Please introduce a new attribute for this instead of > Patrick> overloading RTA_CACHEINFO. > > I would be happy to do that. > Should I also un-overload FWMARK, with backwards compatibility? No, that one is fine since it doesn't already have a different meaning. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with Ipsec transport mode over NAT
Chinh Nguyen wrote: > Patrick McHardy wrote: > >>Netfilter recalculates the checksum when NATing it. > > > The NATing is not done by netfilter but by the NAT device between the IPsec > peers. I see, so the TCP checksum includes the wrong IPs. > [Linux ipsec client C] -- [NAT device] -- [Linux ipsec server S] > > C negotiates a IPsec Transport Mode with S. Because of Transport Mode/NAT-T, 2 > things happen to an IPsec packet. > > 1. It is UDP-encapsulated, typically on port 4500/udp. > 2. Transport Mode traffic leaves the original IP header alone whereas tunnel > mode wraps the entire traffic in a second IP header. As such, when the packet > passes through the NAT device, the source IP is N. However, the original > unencrypted packet had source IP C. > > S rips off the UDP-encap header, decrypts the payload, and "joins" the content > back to the IP header. If the decrypted content is UDP or TCP, the UDP/TCP > checksum is now incorrect because the source IP is now N not C. > > (In tunnel mode, we would ignore the NAT-ted outer IP header because the > decrypted content has an entire IP header + UDP/TCP etc) > > This is a well-known problem with transport mode/NAT. One solution is to use > NAT-OA and NAT-OR to recalculate the checksum. The linux kernel does the > simpler > thing of ignoring the UDP/TCP checksum altogether in this particular case: > > function esp_post_input (net/ipv4/esp4.c) > 290 /* > 291 * 2) ignore UDP/TCP checksums in case > 292 *of NAT-T in Transport Mode, or > 293 *perform other post-processing fixes > 294 *as per * draft-ietf-ipsec-udp-encaps-06, > 295 *section 3.1.2 > 296 */ > 297 if (!x->props.mode) > 298 skb->ip_summed = CHECKSUM_UNNECESSARY; > 299 > 300 break; > > > As noted, esp_post_input is called in xfrm4_policy_check. Decrypted UDP > traffic > through transport mode/nat also has bad checksums. However, since it is passed > through udp_queue_rcv_skb after decryption, and this function calls > xfrm4_policy_check before checking the UDP checksum, line 298 means the kernel > ignores the bad checksum. > > Decrypted TCP traffic has bad checksums too. But since tcp_v4_rcv checks the > TCP > checksum before calling xfrm4_policy_check, the bad checksum means the TCP > packet is dropped as a bad segment. > > The end result is that UDP and other traffic (eg, ICMP) can pass through > transport mode/nat but not TCP. > > I don't know what correct fix is. Adding an extra call to xfrm4_policy_check > in > tcp_v4_rcv before the checksum check fixes this problem and doesn't seem to > break anything else. On the other hand, moving some of the code in > esp_post_input into esp_input (especially line 298) will work, too. So we could move checksum validation behind xfrm4_policy_check or already set ip_summed to CHECKSUM_UNNECESSARY in esp_input. Already setting ip_summed in esp4_input looks easier. But this still leaves one problem. With netfilter and local NAT, a decapsulated transport mode packet might be forwarded to another host. In that case the checksum contained in the packet is invalid. Any ideas how to fix this anyone? - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [TCP 2.6.16-rc3] window scaling disabled issue?
On 2/21/06, David S. Miller <[EMAIL PROTECTED]> wrote: > From: Rick Jones <[EMAIL PROTECTED]> > Date: Tue, 21 Feb 2006 17:21:30 -0800 > > > My point (perhaps not as well expressed as the one on the top of my > > head :) was that if 2.4 is "OK" with extending the window beyond > > 32767 without adding additional semantics on those options, why > > should 2.6 need to? > > 2.4.x has the same window limiting code, if it isn't limiting the > window it's either a bug or a local change the person reporting > that made. Its definitely not a local change that *I* made. Unless redhat made that change to their kernel for some reason. I'm running that 2.4.21-27 kernel from Redhat Enterprise on a power system. The 2.4 machine had window scaling enabled but didn't advertise or use it when tcp_window_scaling was off on the 2.6 side.. I finally got 2.4.32 to compile and it ramps nicely to a 64k receive window, still, and the 2.6 kernel limits itself to 32767 when receiving. keeping in mind this is with tcp_window_scaling = 0 and tcp_adv_window_scale = 0 on the 2.6 kernel side. I made no stack config changes on the 2.4.32 side. Just for grins I left the window scaling settings at default and I noticed that the 2.4.32 kernel replies (and advertises with SYN) with wscale 0 in the SYNACK. Is that correct? so i would say the 2.6 kernel with default settings is working okay but is *not* the same as vanilla 2.4.32 when window scaling is disabled. Jesse PS here are the mini-dumps *** 2.6 sending to 2.4 19:04:50.431251 arp who-has 10.0.1.7 tell 10.0.1.9 19:04:50.431500 arp reply 10.0.1.7 is-at 00:07:e9:03:68:61 19:04:50.431514 IP 10.0.1.9.56210 > 10.0.1.7.12865: S 946995500:946995500(0) win 5840 19:04:50.431873 IP 10.0.1.7.12865 > 10.0.1.9.56210: S 3054767463:3054767463(0) ack 946995501 win 5792 19:04:50.431914 IP 10.0.1.9.56210 > 10.0.1.7.12865: . ack 1 win 5840 19:04:50.443776 IP 10.0.1.9.56210 > 10.0.1.7.12865: P 1:257(256) ack 1 win 5840 19:04:50.444119 IP 10.0.1.7.12865 > 10.0.1.9.56210: . ack 257 win 6432 19:04:50.447120 IP 10.0.1.7.12865 > 10.0.1.9.56210: P 1:257(256) ack 257 win 6432 19:04:50.447129 IP 10.0.1.9.56210 > 10.0.1.7.12865: . ack 257 win 6432 19:04:50.447159 IP 10.0.1.9.53371 > 10.0.1.7.32777: S 938580246:938580246(0) win 5840 19:04:50.447369 IP 10.0.1.7.32777 > 10.0.1.9.53371: S 3061241349:3061241349(0) ack 938580247 win 5792 19:04:50.447380 IP 10.0.1.9.53371 > 10.0.1.7.32777: . ack 1 win 5840 19:04:50.447422 IP 10.0.1.9.53371 > 10.0.1.7.32777: . 1:2897(2896) ack 1 win 5840 19:04:50.447619 IP 10.0.1.7.32777 > 10.0.1.9.53371: . ack 1449 win 8688 19:04:50.447630 IP 10.0.1.9.53371 > 10.0.1.7.32777: P 2897:5793(2896) ack 1 win 5840 19:04:50.447638 IP 10.0.1.7.32777 > 10.0.1.9.53371: . ack 2897 win 11584 19:04:50.447645 IP 10.0.1.9.53371 > 10.0.1.7.32777: . 5793:8689(2896) ack 1 win 5840 19:04:50.447869 IP 10.0.1.7.32777 > 10.0.1.9.53371: . ack 4345 win 14480 19:04:50.447877 IP 10.0.1.9.53371 > 10.0.1.7.32777: P 8689:11585(2896) ack 1 win 5840 19:04:50.447883 IP 10.0.1.7.32777 > 10.0.1.9.53371: . ack 5793 win 17376 19:04:50.447890 IP 10.0.1.9.53371 > 10.0.1.7.32777: P 11585:14481(2896) ack 1 win 5840 19:04:50.447897 IP 10.0.1.7.32777 > 10.0.1.9.53371: . ack 7241 win 20272 19:04:50.447902 IP 10.0.1.7.32777 > 10.0.1.9.53371: . ack 8689 win 23168 19:04:50.447921 IP 10.0.1.9.53371 > 10.0.1.7.32777: . 14481:15929(1448) ack 1 win 5840 19:04:50.447927 IP 10.0.1.9.53371 > 10.0.1.7.32777: P 15929:16385(456) ack 1 win 5840 19:04:50.447944 IP 10.0.1.9.53371 > 10.0.1.7.32777: . 16385:19281(2896) ack 1 win 5840 19:04:50.448118 IP 10.0.1.7.32777 > 10.0.1.9.53371: . ack 10137 win 26064 19:04:50.448126 IP 10.0.1.7.32777 > 10.0.1.9.53371: . ack 11585 win 28960 19:04:50.448135 IP 10.0.1.9.53371 > 10.0.1.7.32777: . 19281:25073(5792) ack 1 win 5840 19:04:50.448142 IP 10.0.1.7.32777 > 10.0.1.9.53371: . ack 13033 win 31856 19:04:50.448147 IP 10.0.1.7.32777 > 10.0.1.9.53371: . ack 14481 win 34752 19:04:50.448157 IP 10.0.1.9.53371 > 10.0.1.7.32777: . 25073:30865(5792) ack 1 win 5840 19:04:50.448163 IP 10.0.1.7.32777 > 10.0.1.9.53371: . ack 15929 win 37648 19:04:50.448245 IP 10.0.1.7.32777 > 10.0.1.9.53371: . ack 16385 win 37648 19:04:50.448255 IP 10.0.1.7.32777 > 10.0.1.9.53371: . ack 17833 win 40544 19:04:50.448261 IP 10.0.1.9.53371 > 10.0.1.7.32777: . 30865:38105(7240) ack 1 win 5840 19:04:50.448269 IP 10.0.1.7.32777 > 10.0.1.9.53371: . ack 19281 win 43440 19:04:50.448285 IP 10.0.1.9.53371 > 10.0.1.7.32777: . 38105:41001(2896) ack 1 win 5840 19:04:50.448372 IP 10.0.1.7.32777 > 10.0.1.9.53371: . ack 20729 win 46336 19:04:50.448381 IP 10.0.1.7.32777 > 10.0.1.9.53371: . ack 22177 win 49232 19:04:50.448493 IP 10.0.1.7.32777 > 10.0.1.9.53371: . ack 23625 win 52128 19:04:50.448502 IP 10.0.1.9.53371 > 10.0.1.7.32777: . 41001:49689(8688) ack 1 win 5840 19:04:50.448508 IP 10.0.1.7.32777 > 10.0.1.9.53371: . ack 25073 win 55024 19:04:50.448515 IP 10.0.1.7.32777 > 10.0.1.9.53371: . ack 26521 win
Re: [PATCH]IPv4 UDP does not discard the datagram with invalid checksum
Under IPv4, when I send a UDP packet with invalid checksum, kernel used udp_rcv() to up packet to UDP layer, application used udp_recvmsg to receive message. So if one UDP packet with invalid checksum is arrived to host, UDP_MIB_INDATAGRAMS will be increased 1, UDP_MIB_INERRORS should be increased 1. int udp_rcv(struct sk_buff *skb) { ... udp_queue_rcv_skb(); ... } static int udp_queue_rcv_skb(struct sock * sk, struct sk_buff *skb) { ... if (sk->sk_filter && skb->ip_summed != CHECKSUM_UNNECESSARY) { if (__udp_checksum_complete(skb)) { UDP_INC_STATS_BH(UDP_MIB_INERRORS); kfree_skb(skb); return -1; } skb->ip_summed = CHECKSUM_UNNECESSARY; } UDP_INC_STATS_BH(UDP_MIB_INDATAGRAMS); ... } static int udp_recvmsg(...) { ... csum_copy_err: UDP_INC_STATS_BH(UDP_MIB_INERRORS); ... } In my test, I send a to a IPv4 UDP packet with invalid checksum to echo- udp, I can find the following message in file /var/log/messages: xinetd[4468]: service echo-dgram, recvfrom: Resource temporarily unavailable (errno = 11) and UDP_MIB_INDATAGRAMS increased 1, UDP_MIB_INERRORS increased 0. xinetd used other fucntion to receive message, not udp_recvmsg()? The other question is why discard the packet with invalid checksum only when sk->sk_filter is set? By the way, under IPv6, packet with invalid checksum be discard in udpv6_rcv(), so So if one UDP packet with invalid checksum is arrived to IPv6 host, UDP_MIB_INDATAGRAMS will be increased 0, UDP_MIB_INERRORS should be increased 1. static int udpv6_rcv(struct sk_buff **pskb, unsigned int *nhoffp) { ... udpv6_queue_rcv_skb(); ... } static inline int udpv6_queue_rcv_skb(struct sock * sk, struct sk_buff *skb) { ... if (skb->ip_summed != CHECKSUM_UNNECESSARY) { if ((unsigned short)csum_fold(skb_checksum(skb, 0, skb->len, skb->csum))) { UDP6_INC_STATS_BH(UDP_MIB_INERRORS); kfree_skb(skb); return 0; } skb->ip_summed = CHECKSUM_UNNECESSARY; } ... UDP6_INC_STATS_BH(UDP_MIB_INDATAGRAMS); ... } One packet with invalid checksum arrived to IPv4 and IPv6 host, the count of UDP_MIB_INDATAGRAMS and UDP_MIB_INERRORS get different increase. There definition of the two count are some difference between IPv4 and IPv6? > > IPv4 UDP does not discard the datagram with invalid checksum. UDP can > > validate UDP checksums correctly only when socket filtering > instructions > > is set. If socket filtering instructions is not set, datagram with > > invalid checksum will be passed to the application. > > We check the checksum later, in parallel with the copy of > the packet data into userspace. > > See udp_recvmsg(), where we do this: > > if (skb->ip_summed==CHECKSUM_UNNECESSARY) { > err = skb_copy_datagram_iovec(skb, sizeof(struct > udphdr), msg->msg_iov, > copied); > } else if (msg->msg_flags&MSG_TRUNC) { > if (__udp_checksum_complete(skb)) > goto csum_copy_err; > err = skb_copy_datagram_iovec(skb, sizeof(struct > udphdr), msg->msg_iov, > copied); > } else { > err = skb_copy_and_csum_datagram_iovec(skb, sizeof > (struct udphdr), msg->msg_iov); > > if (err == -EINVAL) > goto csum_copy_err; > } - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Some state changes not be counted to TCP_MIB_ATTEMPTFAILS
Refer to RFC2012, tcpAttemptFails is defined as following: tcpAttemptFails OBJECT-TYPE SYNTAX Counter32 MAX-ACCESS read-only STATUS current DESCRIPTION "The number of times TCP connections have made a direct transition to the CLOSED state from either the SYN-SENT state or the SYN-RCVD state, plus the number of times TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state." ::= { tcp 7 } State changes of SYN-RCVD to CLOSED, SYN-SENT to CLOSED and SYN-RCVD to LISTEN should be counted to TCP_MIB_ATTEMPTFAILS. Following state changes does not be counted to TCP_MIB_ATTEMPTFAILS by the kernel. SYN-SENT state => CLOSED TCP A TCP B 1. LISTENCLOSED 2. <-- --> SYN-SENT 3. --> SEQ=X> --> CLOSED SYN-RECEIVED state(came from SYN-SENT state) => CLOSED TCP A TCP B 1. LISTENCLOSED 2. <-- --> SYN-SENT 3. --> SYN-SENT 4. <----> SYN-RECEIVED 3. -->--> CLOSED SYN-RECEIVED state(came from SYN-SENT state) => CLOSED TCP A TCP B 1. LISTENCLOSED 2. <-- --> SYN-SENT 3. --> SYN-SENT 4. <----> SYN-RECEIVED 3. -->--> CLOSED SYN-RECEIVED state => LISTEN TCP A TCP B 1. LISTENLISTEN 2. ... --> SYN-RECEIVED 3. (??) <--<-- SYN-RECEIVED 4. --> --> (return to LISTEN!) 5. LISTENLISTEN SYN-RECEIVED state => LISTEN TCP A TCP B 1. LISTENLISTEN 2. ... --> SYN-RECEIVED 3. (??) <--<-- SYN-RECEIVED 4. --> --> (return to LISTEN!) 5. LISTENLISTEN Patch to kernel 2.6.15.4 as following: diff -Nur a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c --- a/net/ipv4/tcp_input.c 2006-02-23 09:20:24.659262056 +0900 +++ b/net/ipv4/tcp_input.c 2006-02-23 09:28:50.772321176 +0900 @@ -4003,6 +4003,7 @@ */ if (th->rst) { + TCP_INC_STATS_BH(TCP_MIB_ATTEMPTFAILS); tcp_reset(sk); goto discard; } @@ -4290,6 +4291,8 @@ /* step 2: check RST bit */ if(th->rst) { + if(sk->sk_state == TCP_SYN_RECV) + TCP_INC_STATS_BH(TCP_MIB_ATTEMPTFAILS); tcp_reset(sk); goto discard; } @@ -4303,6 +4306,8 @@ * Check for a SYN in window. */ if (th->syn && !before(TCP_SKB_CB(skb)->seq, tp->rcv_nxt)) { + if(sk->sk_state == TCP_SYN_RECV) + TCP_INC_STATS_BH(TCP_MIB_ATTEMPTFAILS); NET_INC_STATS_BH(LINUX_MIB_TCPABORTONSYN); tcp_reset(sk); return 1; diff -Nur a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c --- a/net/ipv4/tcp_minisocks.c 2006-02-23 09:20:24.660261904 +0900 +++ b/net/ipv4/tcp_minisocks.c 2006-02-23 09:26:07.432152656 +0900 @@ -591,8 +591,10 @@ /* RFC793: "second check the RST bit" and * "fourth, check the SYN bit" */ - if (flg & (TCP_FLAG_RST|TCP_FLAG_SYN)) + if (flg & (TCP_FLAG_RST|TCP_FLAG_SYN)) { + TCP_INC_STATS_BH(TCP_MIB_ATTEMPTFAILS); goto embryonic_reset; + } /* ACK sequence verified above, just make sure ACK is * set. If ACK not set, just silently drop the packet. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] skge: patches for 2.6.16
Francois Romieu wrote: Stephen Hemminger <[EMAIL PROTECTED]> : Bug fix patches to skge driver that need to go in 2.6.16. Some of them are in -mm and some have already been sent (and ignored). #1..#3 Applied to branch 'for-jeff' at git://electric-eye.fr.zoreil.com/home/romieu/linux-2.6.git Shortlog $ git rev-list --pretty master..HEAD | git shortlog Francois Romieu: r8169: fix broken ring index handling in suspend/resume r8169: enable wake on lan Stephen Hemminger: sky2: yukon-ec-u chipset initialization sky2: limit coalescing values to ring size sky2: poke coalescing timer to fix hang sky2: force early transmit status sky2: use device iomem to access PCI config sky2: close race on IRQ mask update. skge: NAPI/irq race fix skge: genesis phy initialzation skge: protect interrupt mask pulled, thanks. It definitely makes things easier, if the patches are rolled up like this. Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ip6_tunnel: release cached dst on change of tunnel params
Hi, The included patch fixes ip6_tunnel to release the cached dst entry when the tunnel parameters (such as tunnel endpoints) are changed so they are used immediatly for the next encapsulated packets. Signed-off-by: Hugo Santos <[EMAIL PROTECTED]> --- linux-2.6.16-rc4/net/ipv6/ip6_tunnel.c 2006-02-17 22:23:45.0 + +++ linux-2.6.16-rc4-new/net/ipv6/ip6_tunnel.c 2006-02-24 01:40:17.0 + @@ -884,6 +884,7 @@ ip6ip6_tnl_change(struct ip6_tnl *t, str t->parms.encap_limit = p->encap_limit; t->parms.flowinfo = p->flowinfo; t->parms.link = p->link; + ip6_tnl_dst_reset(t); ip6ip6_tnl_link_config(t); return 0; } signature.asc Description: Digital signature
Re: [PATCH] pktgen: fix races between control/worker threads
From: Robert Olsson <[EMAIL PROTECTED]> Date: Wed, 22 Feb 2006 19:47:13 +0100 > > Jesse Brandeburg writes: > > > > I looked quickly at this on a couple different machines and wasn't > > able to reproduce, so don't let me block the patch. I think its a > > good patch FWIW > > OK! > We ask Deve to apply it. Applied to net-2.6.17, thanks. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/01] pktgen: Lindent run.
From: Luiz Fernando Capitulino <[EMAIL PROTECTED]> Date: Mon, 23 Jan 2006 13:44:19 -0200 > > This patch is not in-lined because it's 120K bytes long, you can found it at: > > http://www.cpu.eti.br/patches/pktgen_lindent_1.patch Not found: [EMAIL PROTECTED]:~/src/GIT/net-2.6.17$ wget http://www.cpu.eti.br/patches/pktgen_lindent_1.patch --17:16:50-- http://www.cpu.eti.br/patches/pktgen_lindent_1.patch => `pktgen_lindent_1.patch' Resolving www.cpu.eti.br... 209.59.143.183 Connecting to www.cpu.eti.br|209.59.143.183|:80... connected. HTTP request sent, awaiting response... 404 Not Found 17:16:50 ERROR 404: Not Found. Anyways, can you please regenerate these 4 patches against net-2.6.17, as I put in Arthur's race fix and it will certainly conflict with these. Sorry for taking so long to get to this :-( - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: (usagi-users 03614) Re: IPv6 setsockopt software MTU patch
From: YOSHIFUJI Hideaki <[EMAIL PROTECTED]> Date: Fri, 24 Feb 2006 00:23:51 +0900 (JST) > David, please apply. Thank you. Can you please resend the patch with a full changelog entry and Signed-off-by lines for me? Thank you. This is for net-2.6 right? Or net-2.6.17? Thanks again. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
ip6_tunnel keeping dst_cache after change of params
Hi, ip6_tunnel keeps a cached dst (dst_cache in ip6_tnl) per tunnel instance. This cached dst is re-used while it's not marked obsolete. A change of the tunnel's parameters (via SIOCCHGTUNNEL) does not invalidate the dst_cache directly, which results on it being used by ip6ip6_tnl_xmit after the tunnel is configured with new parameters. Shouldn't ip6ip6_tnl_change dst_release() the cached dst and leave ip6ip6_tnl_xmit to pick a new one based on the new local/remote addresses etc? I can provide a patch to fix this, meanwhile just wanted to confirm the expected behaviour. Thanks, Hugo signature.asc Description: Digital signature
Re: Fw: [Bugme-new] [Bug 6121] New: TCP_DEFER_ACCEPT is reset on listen() call
On 2/23/06, Arnaldo Carvalho de Melo <[EMAIL PROTECTED]> wrote: > On 2/23/06, Andrew Morton <[EMAIL PROTECTED]> wrote: > > > Starting from 2.6.14, defer_accept is moved to request_sock_queue structure, > > which is re-initialized in inet_csk_listen_start(). > > Oops, looking into it... culprit: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=295f7324ff8d9ea58b4d3ec93b1aaa1d80e048a9 Alexandra, can you please test by just removing the zeroing from reqsk_queue_alloc() in net/core/request_sock.c? Just remove this line: queue->rskq_defer_accept = 0; icsk->icsk_accept_queue (that maps to the queue-> above) is zeroed at sk alloc time, so just removing this one should restore the previous behaviour. Thanks, - Arnaldo - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch 1/6] IPSEC: core updates
From: jamal <[EMAIL PROTECTED]> Date: Tue, 21 Feb 2006 08:31:49 -0500 > Ok. Patch attached against net-2617 > > Yoshfuji-san you should probably write a little doc that should be > available in the Doc/ directory. If we write this, please ask Andi Kleen to review it. His arch has the most problems in this area making him an expert on this topic :-) > struct xfrm_aevent_id needs to be 32-bit + 64-bit align friendly. > > Signed-off-by: Jamal Hadi Salim <[EMAIL PROTECTED]> Applied, thanks everyone. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: fix first packet goes out with MAC 00:00:00:00:00:00
From: jamal <[EMAIL PROTECTED]> Date: Thu, 23 Feb 2006 10:06:46 -0500 > Ok, patch attached. Dave this also is needed for 2.6.16-rcXX. > > Tested against a standard eth device (e1000) and tuntap. Applied to net-2.6, thanks a lot. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tg3 losing promisc rx_mode bit
From: "Michael Chan" <[EMAIL PROTECTED]> Date: Thu, 23 Feb 2006 13:12:38 -0800 > On Fri, 2006-02-24 at 11:48 +1300, Ian McDonald wrote: > > > Thinking out loud here without reading source... - can you check the > > version of the firmware and make noise if they have a version like > > this one? > > Probably yes. Will put this on my queue if there is no other objection. No objection here. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tg3 losing promisc rx_mode bit
On Fri, 2006-02-24 at 11:48 +1300, Ian McDonald wrote: > Thinking out loud here without reading source... - can you check the > version of the firmware and make noise if they have a version like > this one? > Probably yes. Will put this on my queue if there is no other objection. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tg3 losing promisc rx_mode bit
On 2/24/06, Michael Chan <[EMAIL PROTECTED]> wrote: > This is a known problem caused by ASF or IPMI firmware overwriting the > promiscuous mode bit. I will have someone contact you to get the > firmware upgraded. > > Thanks. > Thinking out loud here without reading source... - can you check the version of the firmware and make noise if they have a version like this one? Ian -- Ian McDonald Web: http://wand.net.nz/~iam4 Blog: http://imcdnzl.blogspot.com WAND Network Research Group Department of Computer Science University of Waikato New Zealand - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: tg3 losing promisc rx_mode bit
On Thu, 2006-02-23 at 14:31 -0800, Jim Westfall wrote: > I am seeing the following issue on only the first onboard nic on each of > the servers. If the nic is put into promisc mode too soon after the nic > is brought up, the promisc bit in the rx_mode register is somehow getting > reset to 0; > This is a known problem caused by ASF or IPMI firmware overwriting the promiscuous mode bit. I will have someone contact you to get the firmware upgraded. Thanks. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
tg3 losing promisc rx_mode bit
Hi I have a number of ibm x336 servers that have the following 2 onboard nics (eth0/1 are 2 other bcm57xx nics on a PCIX card). kernel version is 2.6.15.4, though I have tried 2.4.32/2.6.14-rc4 and they both have the issue below. ACPI: PCI Interrupt :06:00.0[A] -> GSI 16 (level, low) -> IRQ 169 PCI: Enabling bus mastering for device :06:00.0 PCI: Setting latency timer of device :06:00.0 to 64 eth2: Tigon3 [partno(BCM95721) rev 4101 PHY(5750)] (PCI Express) 10/100/1000BaseT Ethernet 00:0d:60:9a:81:be eth2: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] Split[0] WireSpeed[1] TSOcap[1] eth2: dma_rwctrl[7618] ACPI: PCI Interrupt :07:00.0[A] -> GSI 16 (level, low) -> IRQ 169 PCI: Setting latency timer of device :07:00.0 to 64 eth3: Tigon3 [partno(BCM95721) rev 4101 PHY(5750)] (PCI Express) 10/100/1000BaseT Ethernet 00:0d:60:9a:81:bf eth3: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] Split[0] WireSpeed[1] TSOcap[1] eth3: dma_rwctrl[7618] I am seeing the following issue on only the first onboard nic on each of the servers. If the nic is put into promisc mode too soon after the nic is brought up, the promisc bit in the rx_mode register is somehow getting reset to 0; Test cases are non-sleep case (promisc bit gets lost) ifconfig eth2 up;tcpdump -n -i eth2 and sleep case (promisc is set) ifconfig eth2 up;sleep 1;tcpdump -n -i eth2 I added some addition debug statements to the driver to printk when its updates to the rx_mode register. It dumps what the change is, who changed it (via stack dump) and re-reads the register and print the value out. This is the output of the parts that set the device into promisc mode, which is the same for both test cases. ADDRCONF(NETDEV_UP): eth2: link is not ready eth2: setting rx_mode register to 0102 [] _tw32_flush+0x30/0xac [tg3] [] __tg3_set_rx_mode+0x194/0x1ac [tg3] [] wakeme_after_rcu+0x0/0x10 [] tg3_set_rx_mode+0x25/0x3c [tg3] [] __dev_mc_upload+0x21/0x28 [] dev_mc_upload+0x19/0x28 [] dev_set_promiscuity+0x37/0x5c [] packet_dev_mc+0x67/0x7c [] packet_mc_add+0x126/0x13c [] packet_setsockopt+0xa5/0xd4 [] sys_setsockopt+0x69/0x84 [] sys_socketcall+0x1b6/0x208 [] syscall_call+0x7/0xb eth2: read 0102 from rx_mode register device eth2 entered promiscuous mode tg3: eth2: Link is up at 100 Mbps, full duplex. tg3: eth2: Flow control is off for TX and off for RX. ADDRCONF(NETDEV_CHANGE): eth2: link becomes ready both indicate they are setting the promisc bit 0x00100, and reading it back as being set, but ethtool and tcpdump show otherwise in the non-sleep case. non-sleep case # ethtool -d eth2 | egrep -A4 1128 | egrep -A4 1128 | head -4 11280x02 11290x00 11300x00 11310x00 sleep case # ethtool -d eth2 | egrep -A4 1128 | egrep -A4 1128 | head -4 11280x02 11290x01 11300x00 11310x00 I recent got burned by this because we use eth2/3 in a bridge, eth2 wasnt seeing any stp related broadcasts, which triggered a loop. thanks jim - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] skge: patches for 2.6.16
Stephen Hemminger <[EMAIL PROTECTED]> : > Bug fix patches to skge driver that need to go in 2.6.16. > Some of them are in -mm and some have already been sent (and ignored). #1..#3 Applied to branch 'for-jeff' at git://electric-eye.fr.zoreil.com/home/romieu/linux-2.6.git Shortlog $ git rev-list --pretty master..HEAD | git shortlog Francois Romieu: r8169: fix broken ring index handling in suspend/resume r8169: enable wake on lan Stephen Hemminger: sky2: yukon-ec-u chipset initialization sky2: limit coalescing values to ring size sky2: poke coalescing timer to fix hang sky2: force early transmit status sky2: use device iomem to access PCI config sky2: close race on IRQ mask update. skge: NAPI/irq race fix skge: genesis phy initialzation skge: protect interrupt mask -- Ueimor - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Uninline kfree_skb and allow NULL argument
From: Jörn Engel <[EMAIL PROTECTED]> Date: Thu, 23 Feb 2006 13:52:59 +0100 > +void kfree_skb(struct sk_buff *skb); > extern void __kfree_skb(struct sk_buff *skb); If you wish to contribute to a software project, you should adhere to the coding style and conventions of that project when submitting changes. It doesn't matter what the reasons are for those conventions, you should follow them until the projects decides to change them. If you wish to discuss the merits of putting extern there or not in function declarations, you can start a thread about that and make proposals on linux-kernel. Patch submissions are not the place to do that. So place add extern here, thanks a lot. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with Ipsec transport mode over NAT
Patrick McHardy wrote: > Chinh Nguyen wrote: > >>Patrick McHardy wrote: >> >> >>>What values does skb->ip_summed have before that? >> >> >>the skb->ip_summed value before the checksum check in tcp_v4_rcv is >>CHECKSUM_NONE. Hence tcp_v4_rcv checks its value, which is incorrect because >>the >>checksum is with regards to the private IP but the NAT device has modified the >>source IP. > > > Netfilter recalculates the checksum when NATing it. The NATing is not done by netfilter but by the NAT device between the IPsec peers. > > I believe that skb->ip_summed is set to CHECKSUM_NONE by esp_input > >>(net/ipv4/esp4.c:180) which is called by xfrm4_rcv_encap >>(net/ipv4/xfrm4_input.c:101). > > > The question is why the checksum is invalid. Please start by describing > what you're trying to do. [Linux ipsec client C] -- [NAT device] -- [Linux ipsec server S] C negotiates a IPsec Transport Mode with S. Because of Transport Mode/NAT-T, 2 things happen to an IPsec packet. 1. It is UDP-encapsulated, typically on port 4500/udp. 2. Transport Mode traffic leaves the original IP header alone whereas tunnel mode wraps the entire traffic in a second IP header. As such, when the packet passes through the NAT device, the source IP is N. However, the original unencrypted packet had source IP C. S rips off the UDP-encap header, decrypts the payload, and "joins" the content back to the IP header. If the decrypted content is UDP or TCP, the UDP/TCP checksum is now incorrect because the source IP is now N not C. (In tunnel mode, we would ignore the NAT-ted outer IP header because the decrypted content has an entire IP header + UDP/TCP etc) This is a well-known problem with transport mode/NAT. One solution is to use NAT-OA and NAT-OR to recalculate the checksum. The linux kernel does the simpler thing of ignoring the UDP/TCP checksum altogether in this particular case: function esp_post_input (net/ipv4/esp4.c) 290 /* 291 * 2) ignore UDP/TCP checksums in case 292 *of NAT-T in Transport Mode, or 293 *perform other post-processing fixes 294 *as per * draft-ietf-ipsec-udp-encaps-06, 295 *section 3.1.2 296 */ 297 if (!x->props.mode) 298 skb->ip_summed = CHECKSUM_UNNECESSARY; 299 300 break; As noted, esp_post_input is called in xfrm4_policy_check. Decrypted UDP traffic through transport mode/nat also has bad checksums. However, since it is passed through udp_queue_rcv_skb after decryption, and this function calls xfrm4_policy_check before checking the UDP checksum, line 298 means the kernel ignores the bad checksum. Decrypted TCP traffic has bad checksums too. But since tcp_v4_rcv checks the TCP checksum before calling xfrm4_policy_check, the bad checksum means the TCP packet is dropped as a bad segment. The end result is that UDP and other traffic (eg, ICMP) can pass through transport mode/nat but not TCP. I don't know what correct fix is. Adding an extra call to xfrm4_policy_check in tcp_v4_rcv before the checksum check fixes this problem and doesn't seem to break anything else. On the other hand, moving some of the code in esp_post_input into esp_input (especially line 298) will work, too. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Uninline kfree_skb and allow NULL argument
On Thu, Feb 23, 2006 at 02:21:46PM +0100, Sven Schuster wrote: > > static inline void kfree_skb(struct sk_buff *skb) > { > if (unlikely(!skb)) > return; > _kfree_skb(skb); > } > > This way the kernel with the new inlined kfree_skb should still become > smaller while not calling the un-inlined _kfree_skb if skb is > NULL...?? (_should_ become smaller is a claim I make without any > proof, sorry...) This is pointless because most callers of kfree_skb expect skb to be non-NULL. -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] iproute2 -- add fwmarkmask
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 > "Patrick" == Patrick McHardy <[EMAIL PROTECTED]> writes: Patrick> The normal way to display masks is with a "/". Also I think Patrick> it shouldn't display the default mask to avoid breaking Patrick> scripts that parse the output. I generally dislike the /VALUE, since I expect /PREFIX-LEN. I agree that it shouldn't show if it is default. Patrick> ip should be able to parse its own output, and it would Patrick> also look nicer if I could just say "fwmark Patrick> 0x1/32". fwmarkmask is really an incredible ugly expression Patrick> :) Sure. Is that a 32-bit long mask (0xfff), or is it a 0x0020? fwmark is not an address. Or would you like /32 to be a prefix-based mask, and &value and/or fwmarkmask to be a value? - -- ] ON HUMILITY: to err is human. To moo, bovine. | firewalls [ ] Michael Richardson,Xelerance Corporation, Ottawa, ON|net architect[ ] [EMAIL PROTECTED] http://www.sandelman.ottawa.on.ca/mcr/ |device driver[ ] panic("Just another Debian GNU/Linux using, kernel hacking, security guy"); [ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Finger me for keys iQEUAwUBQ/4PcoCLcPvd0N1lAQIHhQf3XzPLA91QEx2+XpmYIm8RyB1oKmUUXDP+ s2UrhOKbQwipcq8/hk1t4FKx8J5j/dFHzVXbgPK+ZUwX4+IjHmM3r0sCIcK08xwU /ZZjf0wqwUI+RcPRFw3zC0+hnwRUIAUxhl3p7h3PigDpPu7AY5tQ1dXc6WNwRjTi fS7Yozbo225dzvVLKHhSIqOQ4eJFJcPPQdTKQLxnc3gtVoSe41DKMM+x6uix6fG8 se9dngJRbhye1Xgws9AGnBQT9f7JVmCSv7V4SHnNynmnRw3cra8++QEnLZ/vhm5C JdeVSeDGxAPuKEj6HA2RZu/UOG6RkYNZGPovGKzuPn403x0HNBuf =BzfV -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/02] add mask options to fwmark masking code
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 > "Patrick" == Patrick McHardy <[EMAIL PROTECTED]> writes: >> #define RTA_FWMARK RTA_PROTOINFO +#define RTA_FWMARK_MASK >> RTA_CACHEINFO Patrick> Please introduce a new attribute for this instead of Patrick> overloading RTA_CACHEINFO. I would be happy to do that. Should I also un-overload FWMARK, with backwards compatibility? >> diff --git a/net/ipv4/fib_rules.c b/net/ipv4/fib_rules.c index >> de327b3..69eed89 100644 --- a/net/ipv4/fib_rules.c +++ >> b/net/ipv4/fib_rules.c @@ -68,6 +68,7 @@ struct fib_rule u8 >> r_tos; #ifdef CONFIG_IP_ROUTE_FWMARK u32 r_fwmark; + u32 >> r_fwmark_mask; Patrick> Both patches have whitespace issues. You should also change uhm. okay. I'm surprised, since I produced it with git-format-patch. Maybe there are tabs that emacs screwed up. - -- ] ON HUMILITY: to err is human. To moo, bovine. | firewalls [ ] Michael Richardson,Xelerance Corporation, Ottawa, ON|net architect[ ] [EMAIL PROTECTED] http://www.sandelman.ottawa.on.ca/mcr/ |device driver[ ] panic("Just another Debian GNU/Linux using, kernel hacking, security guy"); [ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Finger me for keys iQEVAwUBQ/4O2ICLcPvd0N1lAQK/egf6A0iQ1hvecR4BeaCrQiu53beGZd6zHldk o6logfar94kPP/H/D/kMcNeAvL2a3cJ8wyfyP02Cav8gP1C3X+XV+yLtA9jHIrdK nqQ1gw7F4Cj2+v7du/jS8GxNMWevXhJ7f9hvnzh8+DHMUCjqiksgsuIgcRQYrqOQ vxYERvR5TojEIaJfg8kH/lJRn3sm/APuMphM6c6SAeqrWpAdijbZb4LSNpGH50ci nNhUp+FxoP8vVFTMTu7M1MK4fpCIWA/PxBkmy3YDhcQx1+mE2nrEqHdbKfx9uY+t 0mxR8UC5sthhn94/VCjcqWOoHe3S/Gi+WWoPtwN1sFe5BujwU7Vcfw== =yKIA -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fw: [Bugme-new] [Bug 6121] New: TCP_DEFER_ACCEPT is reset on listen() call
On 2/23/06, Andrew Morton <[EMAIL PROTECTED]> wrote: > Starting from 2.6.14, defer_accept is moved to request_sock_queue structure, > which is re-initialized in inet_csk_listen_start(). Oops, looking into it... - Arnaldo - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Fw: [Bugme-new] [Bug 6121] New: TCP_DEFER_ACCEPT is reset on listen() call
Begin forwarded message: Date: Thu, 23 Feb 2006 07:26:28 -0800 From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: [Bugme-new] [Bug 6121] New: TCP_DEFER_ACCEPT is reset on listen() call http://bugzilla.kernel.org/show_bug.cgi?id=6121 Summary: TCP_DEFER_ACCEPT is reset on listen() call Kernel Version: 2.6.14, 2.6.15 Status: NEW Severity: normal Owner: [EMAIL PROTECTED] Submitter: [EMAIL PROTECTED] Most recent kernel where this bug did not occur: 2.6.13 Distribution: Hardware Environment: Software Environment: Problem Description: Value of TCP_DEFER_ACCEPT socket option is reset to zero when listen() is called. Steps to reproduce: Following program shows the problem: #include #include #include #include main() { int s = socket(AF_INET, SOCK_STREAM, 0); int val = 1; int len = sizeof(val); setsockopt(s, SOL_TCP, TCP_DEFER_ACCEPT, &val, len); listen(s, 1); getsockopt(s, SOL_TCP, TCP_DEFER_ACCEPT, &val, &len); printf("get TCP_DEFER_ACCEPT = %d\n", val); } On <=2.6.13 output is "get TCP_DEFER_ACCEPT = 3"; On >=2.6.14 output is "get TCP_DEFER_ACCEPT = 0". Starting from 2.6.14, defer_accept is moved to request_sock_queue structure, which is re-initialized in inet_csk_listen_start(). --- You are receiving this mail because: --- You are on the CC list for the bug, or are watching someone who is. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2.6.16-rc4] e1000: revert to single descriptor for legacy receive path
A recent patch attempted to enable more efficient memory usage by using only 2kB descriptors for jumbo frames. The method used to implement this has since been commented upon as "illegal" and in recent kernels even causes a BUG when receiving ip fragments while using jumbo frames. This patch simply goes back to the way things were. We expect some complaints to reoccur due to order 3 allocations failing due to this change. Signed-off-by: Jesse Brandeburg <[EMAIL PROTECTED]> --- drivers/net/e1000/e1000.h |3 - drivers/net/e1000/e1000_main.c | 117 +++- 2 files changed, 45 insertions(+), 75 deletions(-) diff --git a/drivers/net/e1000/e1000.h b/drivers/net/e1000/e1000.h index 27c7730..99baf0e 100644 --- a/drivers/net/e1000/e1000.h +++ b/drivers/net/e1000/e1000.h @@ -225,9 +225,6 @@ struct e1000_rx_ring { struct e1000_ps_page *ps_page; struct e1000_ps_page_dma *ps_page_dma; - struct sk_buff *rx_skb_top; - struct sk_buff *rx_skb_prev; - /* cpu for rx queue */ int cpu; diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index 31e3329..5b7d0f4 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -103,7 +103,7 @@ static char e1000_driver_string[] = "Int #else #define DRIVERNAPI "-NAPI" #endif -#define DRV_VERSION "6.3.9-k2"DRIVERNAPI +#define DRV_VERSION "6.3.9-k4"DRIVERNAPI char e1000_driver_version[] = DRV_VERSION; static char e1000_copyright[] = "Copyright (c) 1999-2005 Intel Corporation."; @@ -1635,8 +1635,6 @@ setup_rx_desc_die: rxdr->next_to_clean = 0; rxdr->next_to_use = 0; - rxdr->rx_skb_top = NULL; - rxdr->rx_skb_prev = NULL; return 0; } @@ -1713,8 +1711,23 @@ e1000_setup_rctl(struct e1000_adapter *a rctl |= adapter->rx_buffer_len << 0x11; } else { rctl &= ~E1000_RCTL_SZ_4096; - rctl &= ~E1000_RCTL_BSEX; - rctl |= E1000_RCTL_SZ_2048; + rctl |= E1000_RCTL_BSEX; + switch (adapter->rx_buffer_len) { + case E1000_RXBUFFER_2048: + default: + rctl |= E1000_RCTL_SZ_2048; + rctl &= ~E1000_RCTL_BSEX; + break; + case E1000_RXBUFFER_4096: + rctl |= E1000_RCTL_SZ_4096; + break; + case E1000_RXBUFFER_8192: + rctl |= E1000_RCTL_SZ_8192; + break; + case E1000_RXBUFFER_16384: + rctl |= E1000_RCTL_SZ_16384; + break; + } } #ifndef CONFIG_E1000_DISABLE_PACKET_SPLIT @@ -2107,16 +2120,6 @@ e1000_clean_rx_ring(struct e1000_adapter } } - /* there also may be some cached data in our adapter */ - if (rx_ring->rx_skb_top) { - dev_kfree_skb(rx_ring->rx_skb_top); - - /* rx_skb_prev will be wiped out by rx_skb_top */ - rx_ring->rx_skb_top = NULL; - rx_ring->rx_skb_prev = NULL; - } - - size = sizeof(struct e1000_buffer) * rx_ring->count; memset(rx_ring->buffer_info, 0, size); size = sizeof(struct e1000_ps_page) * rx_ring->count; @@ -3106,24 +3109,27 @@ e1000_change_mtu(struct net_device *netd break; } - /* since the driver code now supports splitting a packet across -* multiple descriptors, most of the fifo related limitations on -* jumbo frame traffic have gone away. -* simply use 2k descriptors for everything. -* -* NOTE: dev_alloc_skb reserves 16 bytes, and typically NET_IP_ALIGN -* means we reserve 2 more, this pushes us to allocate from the next -* larger slab size -* i.e. RXBUFFER_2048 --> size-4096 slab */ - /* recent hardware supports 1KB granularity */ if (adapter->hw.mac_type > e1000_82547_rev_2) { - adapter->rx_buffer_len = - ((max_frame < E1000_RXBUFFER_2048) ? - max_frame : E1000_RXBUFFER_2048); + adapter->rx_buffer_len = max_frame; E1000_ROUNDUP(adapter->rx_buffer_len, 1024); - } else - adapter->rx_buffer_len = E1000_RXBUFFER_2048; + } else { + if(unlikely((adapter->hw.mac_type < e1000_82543) && + (max_frame > MAXIMUM_ETHERNET_FRAME_SIZE))) { + DPRINTK(PROBE, ERR, "Jumbo Frames not supported " + "on 82542\n"); + return -EINVAL; + } else { + if(max_frame <= E1000_RXBUFFER_2048) + adapter->rx_buffer_len = E1000_RXBUFFER_2048; + else if(max_frame <= E1000_RXBUFFER_4096) + adapter->rx_buffer_len = E10
Re: Problem with Ipsec transport mode over NAT
Chinh Nguyen wrote: > Patrick McHardy wrote: > >>What values does skb->ip_summed have before that? > > > the skb->ip_summed value before the checksum check in tcp_v4_rcv is > CHECKSUM_NONE. Hence tcp_v4_rcv checks its value, which is incorrect because > the > checksum is with regards to the private IP but the NAT device has modified the > source IP. Netfilter recalculates the checksum when NATing it. I believe that skb->ip_summed is set to CHECKSUM_NONE by esp_input > (net/ipv4/esp4.c:180) which is called by xfrm4_rcv_encap > (net/ipv4/xfrm4_input.c:101). The question is why the checksum is invalid. Please start by describing what you're trying to do. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: pktgen + napi == kaboom
On 2/22/06, Simon Kirby <[EMAIL PROTECTED]> wrote: > Of course, now it doesn't send as fast. Hrmph. :) On this older Xeon > 2.4 Ghz w/533 FSB and e1000 & tg3 @ PCI-X 133 Mhz 64 bit, SMP kernel, > single pktgen thread, I'm only seeing: > > clone_skb=0, 802.1Q tagging, 60 byte: > e1000: 558526pps 268Mb/sec (268092480bps) errors: 0 > tg3: 621260pps 298Mb/sec (298204800bps) errors: 0 > > clone_skb=0, no 802.1Q, 60 byte: > e1000: 664558pps 318Mb/sec (318987840bps) errors: 0 > tg3: 772650pps 370Mb/sec (370872000bps) errors: 0 > > clone_skb=16384, no 802.1Q, 60 byte: > e1000: 684206pps 328Mb/sec (328418880bps) errors: 0 > tg3: 1069830pps 513Mb/sec (513518400bps) errors: 0 > > I tried on an Opteron 140 box and it was faster for both cards, but not > by much. oprofile showed a lot of do_getttimeofday, so I hacked a bunch > of calls out of pktgen -- I noticed the CPU time shifted around but the > throughput was still the same as before, as if it's card or bus limited. > > Why is it so difficult to actually get 1 Gbps of small packets? I also > tried changing ring buffer sizes, txqueuelen, interrupt coalescing > settings, etc... all I was able to do was make it slower or very > slightly faster. Its difficult because you have *loads* of transactions going over the bus. Linux's single transmit packet at a time methodology also exacerbates this, as we're unable to coalesce transmit tail (TDT) writes to the bus and are probably interrupting the previous DMA *a lot* (see thread below) You'll probably be able to get a little better throughput by switching to a UP kernel, but from my experience you're getting pretty close to the max for pci-x e1000 adapters. There are some previous messages about this like: http://oss.sgi.com/projects/netdev/archive/2004-12/msg00017.html beware the hardware bug when enabling TXDMAC! Jesse - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ipw2200 tester needed
On Thursday 23 February 2006 17:17, you wrote: > In reviewing the ieee80211 stack in order to add additional geographic > support for wireless drivers, > I have studied all the in-kernel wireless drivers for their interactions with > the routines in > ieee80211_geo.c. As clearly stated in the comments, ipw2200.c duplicates most > of those routines, > even though ieee80211 is required to use ipw2200. Obviously, this bloats both > the source code and > the binaries for any user of ipw2200. I am planning to develop a patch to > have ipw2200 use the > ieee80211 code; however, I do not have the necessary hardware to test the > result. > > Is anyone interested in testing this patch for me? Are there any comments > regarding this change? Provide the patch, and I will see what I can do. -- Greetings Michael. pgpsKWhzaj29G.pgp Description: PGP signature
ipw2200 tester needed
In reviewing the ieee80211 stack in order to add additional geographic support for wireless drivers, I have studied all the in-kernel wireless drivers for their interactions with the routines in ieee80211_geo.c. As clearly stated in the comments, ipw2200.c duplicates most of those routines, even though ieee80211 is required to use ipw2200. Obviously, this bloats both the source code and the binaries for any user of ipw2200. I am planning to develop a patch to have ipw2200 use the ieee80211 code; however, I do not have the necessary hardware to test the result. Is anyone interested in testing this patch for me? Are there any comments regarding this change? Thanks, Larry - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: fix first packet goes out with MAC 00:00:00:00:00:00
On Thu, 2006-23-02 at 17:41 +0300, Alexey Kuznetsov wrote: > After some thinking I suspect the deletion of this chunk could change > behaviour > of some parts which do not use neighbour cache f.e. packet socket. > Thanks Alexey, this was what i was worried about ;-> > > I think safer approach would be to move this chunk after if (daddr). > And the possibility to remove this completely could be analyzed later. > Ok, patch attached. Dave this also is needed for 2.6.16-rcXX. Tested against a standard eth device (e1000) and tuntap. cheers, jamal For ethernet-like netdevices, dont overwritte first packet's dst MAC address when it is already resolved Signed-off-by: Jamal Hadi Salim <[EMAIL PROTECTED]> --- diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c index 9890fd9..c971f14 100644 --- a/net/ethernet/eth.c +++ b/net/ethernet/eth.c @@ -95,6 +95,12 @@ int eth_header(struct sk_buff *skb, stru saddr = dev->dev_addr; memcpy(eth->h_source,saddr,dev->addr_len); + if(daddr) + { + memcpy(eth->h_dest,daddr,dev->addr_len); + return ETH_HLEN; + } + /* * Anyway, the loopback-device should never use this function... */ @@ -105,12 +111,6 @@ int eth_header(struct sk_buff *skb, stru return ETH_HLEN; } - if(daddr) - { - memcpy(eth->h_dest,daddr,dev->addr_len); - return ETH_HLEN; - } - return -ETH_HLEN; }
Re: Problem with Ipsec transport mode over NAT
Patrick McHardy wrote: > Chinh Nguyen wrote: > >>I discovered that the "bug" is in the function tcp_v4_rcv for kernel >>2.6.16-rc1. >> >>After the ESP packet is decapped and decrypted in xfrm4_rcv_encap_finish, the >>unencrypted packet is pushed back through ip_local_deliver. For a UDP packet, >>it >>goes (back) to function udp_queue_rcv_skb. The first thing this function does >>is >>called xfrm4_policy_check. As noted previously, in xfrm4_policy_check, if the >>skb->sp != NULL, the esp_post_input function is called. The post input >>function >>sets skb->ip_summed to CHECKSUM_UNNECESSASRY if we are in transport mode. >>Therefore, further down in udp_queue_rcv_skb, we skip the checksum check and >>the >>packet is passed up the stack. >> >>However, for a decrypted TCP packet, the packet goes to tcp_v4_rcv. This >>function does the checksum check right away if skb->ip_summed != >>CHECKSUM_UNNECESSARY while xfrm4_policy_check is called a little later in the >>function. Therefore, the esp post input has not yet set the ip_summed to >>unnecessary. The decrypted packet fails the checksum and is discarded. >> >>To confirm this, I added another call to xfrm4_policy_check before the >>checksum >>check in tcp_v4_rcv (to call esp post input). Once patched, my systems were >>able >>to initiate TCP connections using Transport Mode/NAT. > > > What values does skb->ip_summed have before that? the skb->ip_summed value before the checksum check in tcp_v4_rcv is CHECKSUM_NONE. Hence tcp_v4_rcv checks its value, which is incorrect because the checksum is with regards to the private IP but the NAT device has modified the source IP. I believe that skb->ip_summed is set to CHECKSUM_NONE by esp_input (net/ipv4/esp4.c:180) which is called by xfrm4_rcv_encap (net/ipv4/xfrm4_input.c:101). - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: fix first packet goes out with MAC 00:00:00:00:00:00
Hello! > All devices including loopback pass a daddr. loopback in fact passes > a 0 all the time ;-> > This means i can delete the check totaly or i can remove the IFF_NOARP ... > Anyone knows the history? I think, it was me who did this crap. It was so long ago I do not remember why it was made. I remember some troubles with dummy device. It tried to resolve addresses, apparently, without success and generated errors instead of blackholing. I think the problem was eventually solved at neighbour level. After some thinking I suspect the deletion of this chunk could change behaviour of some parts which do not use neighbour cache f.e. packet socket. I think safer approach would be to move this chunk after if (daddr). And the possibility to remove this completely could be analyzed later. Alexey - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RFC: fix first packet goes out with MAC 00:00:00:00:00:00
This drove me nuts this morning and i find it hard to believe that no-one has reported this before because i went back as far back as 2.4.2 and it is there ;->. I am ccing the three people who may possibly have made this change (no records whatsoever in git);-> When you turn off ARP on a netdevice then the first packet always goes out with a dstMAC of all zeroes. This is because the first packet is used to resolve ARP entries. Even though the ARP entry may be resolved (I tried by setting a static ARP entry for a host i was pinging from), it gets overwritten by virtue of having the netdevice disabling ARP. Subsequent packets go out fine with correct dstMAC address (which may be why people have ignored reporting this issue). To cut the story short: the culprit code is in net/ethernet/eth.c::eth_header() /* * Anyway, the loopback-device should never use this function... */ if (dev->flags & (IFF_LOOPBACK|IFF_NOARP)) { memset(eth->h_dest, 0, dev->addr_len); return ETH_HLEN; } if(daddr) { memcpy(eth->h_dest,daddr,dev->addr_len); return ETH_HLEN; } Note how the h_dest is being reset when device has IFF_NOARP. The only reason i am asking is that this small piece of code has some huge impact and i dont understand the history of IFF_NOARP check being put there. As a note: All devices including loopback pass a daddr. loopback in fact passes a 0 all the time ;-> This means i can delete the check totaly or i can remove the IFF_NOARP Anyone knows the history? cheers, jamal - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Uninline kfree_skb and allow NULL argument
Hello, > --- kfree_skb/include/linux/skbuff.h~kfree_skb_uninline_null 2006-02-23 > 13:35:05.0 +0100 > +++ kfree_skb/include/linux/skbuff.h 2006-02-23 13:36:23.0 +0100 > @@ -306,6 +306,7 @@ struct sk_buff { > > #include > > +void kfree_skb(struct sk_buff *skb); > extern void __kfree_skb(struct sk_buff *skb); > extern struct sk_buff *__alloc_skb(unsigned int size, > gfp_t priority, int fclone); > @@ -406,22 +407,6 @@ static inline struct sk_buff *skb_get(st > */ > > /** > - * kfree_skb - free an sk_buff > - * @skb: buffer to free > - * > - * Drop a reference to the buffer and free it if the usage count has > - * hit zero. > - */ > -static inline void kfree_skb(struct sk_buff *skb) > -{ > - if (likely(atomic_read(&skb->users) == 1)) > - smp_rmb(); > - else if (likely(!atomic_dec_and_test(&skb->users))) > - return; > - __kfree_skb(skb); > -} > - > -/** > * skb_cloned - is the buffer a clone > * @skb: buffer to check > * > --- kfree_skb/net/core/skbuff.c~kfree_skb_uninline_null 2006-02-23 > 13:35:05.0 +0100 > +++ kfree_skb/net/core/skbuff.c 2006-02-23 13:37:01.0 +0100 > @@ -355,6 +355,24 @@ void __kfree_skb(struct sk_buff *skb) > } > > /** > + * kfree_skb - free an sk_buff > + * @skb: buffer to free > + * > + * Drop a reference to the buffer and free it if the usage count has > + * hit zero. > + */ > +void kfree_skb(struct sk_buff *skb) > +{ > + if (unlikely(!skb)) > + return; > + if (likely(atomic_read(&skb->users) == 1)) > + smp_rmb(); > + else if (likely(!atomic_dec_and_test(&skb->users))) > + return; > + __kfree_skb(skb); > +} > + just thinking about it a little bit, why not un-inline the current kfree_skb to, say, _kfree_skb, and make a new inlined kfree_skb which just does static inline void kfree_skb(struct sk_buff *skb) { if (unlikely(!skb)) return; _kfree_skb(skb); } This way the kernel with the new inlined kfree_skb should still become smaller while not calling the un-inlined _kfree_skb if skb is NULL...?? (_should_ become smaller is a claim I make without any proof, sorry...) Sven -- Linux zion.homelinux.com 2.6.16-rc3-mm1_27 #27 Wed Feb 15 17:51:36 CET 2006 i686 athlon i386 GNU/Linux 14:14:50 up 5 days, 18:30, 1 user, load average: 0.14, 0.11, 0.17 pgp3WY3Yi97zC.pgp Description: PGP signature
Re: [RFC] Some infrastructure for interrupt-less TX
On Thu, 23 February 2006 02:00:50 -0800, David S. Miller wrote: > > > This breaks socket buffer accounting. > > That's why he's dropping the SKB sans the data. There doesn't appear to be any fundamental opposition. David, should I turn this mess into a decent patch, convert one driver, do some extensive testing and send it to you [1]? [1] Provided that I get bored on a rainy day and actually sit down and do it, of course. It's not a high priority. Jörn -- You ain't got no problem, Jules. I'm on the motherfucker. Go back in there, chill them niggers out and wait for the Wolf, who should be coming directly. -- Marsellus Wallace - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Uninline kfree_skb and allow NULL argument
On Thu, 23 February 2006 22:26:01 +1100, Herbert Xu wrote: > On Thu, Feb 23, 2006 at 12:22:31PM +0100, J?rn Engel wrote: > > > > Should I merge the two patches into one and resend? > > Sounds good. Here it is. Jörn -- Fancy algorithms are buggier than simple ones, and they're much harder to implement. Use simple algorithms as well as simple data structures. -- Rob Pike o Uninline kfree_skb, which saves some 15k of object code on my notebook. o Allow kfree_skb to be called with a NULL argument. Subsequent patches can remove conditional from drivers and further reduce source and object size. Signed-off-by: Jörn Engel <[EMAIL PROTECTED]> --- include/linux/skbuff.h | 17 + net/core/skbuff.c | 18 ++ 2 files changed, 19 insertions(+), 16 deletions(-) --- kfree_skb/include/linux/skbuff.h~kfree_skb_uninline_null2006-02-23 13:35:05.0 +0100 +++ kfree_skb/include/linux/skbuff.h2006-02-23 13:36:23.0 +0100 @@ -306,6 +306,7 @@ struct sk_buff { #include +void kfree_skb(struct sk_buff *skb); extern void __kfree_skb(struct sk_buff *skb); extern struct sk_buff *__alloc_skb(unsigned int size, gfp_t priority, int fclone); @@ -406,22 +407,6 @@ static inline struct sk_buff *skb_get(st */ /** - * kfree_skb - free an sk_buff - * @skb: buffer to free - * - * Drop a reference to the buffer and free it if the usage count has - * hit zero. - */ -static inline void kfree_skb(struct sk_buff *skb) -{ - if (likely(atomic_read(&skb->users) == 1)) - smp_rmb(); - else if (likely(!atomic_dec_and_test(&skb->users))) - return; - __kfree_skb(skb); -} - -/** * skb_cloned - is the buffer a clone * @skb: buffer to check * --- kfree_skb/net/core/skbuff.c~kfree_skb_uninline_null 2006-02-23 13:35:05.0 +0100 +++ kfree_skb/net/core/skbuff.c 2006-02-23 13:37:01.0 +0100 @@ -355,6 +355,24 @@ void __kfree_skb(struct sk_buff *skb) } /** + * kfree_skb - free an sk_buff + * @skb: buffer to free + * + * Drop a reference to the buffer and free it if the usage count has + * hit zero. + */ +void kfree_skb(struct sk_buff *skb) +{ + if (unlikely(!skb)) + return; + if (likely(atomic_read(&skb->users) == 1)) + smp_rmb(); + else if (likely(!atomic_dec_and_test(&skb->users))) + return; + __kfree_skb(skb); +} + +/** * skb_clone - duplicate an sk_buff * @skb: buffer to clone * @gfp_mask: allocation priority - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Uninline kfree_skb and allow NULL argument
On Thu, 23 February 2006 13:52:59 +0100, Jörn Engel wrote: > > +void kfree_skb(struct sk_buff *skb); > extern void __kfree_skb(struct sk_buff *skb); > extern struct sk_buff *__alloc_skb(unsigned int size, And while we're in the area...is there a good reason why all function declarations have the "extern" added? Jörn -- He that composes himself is wiser than he that composes a book. -- B. Franklin - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Allow kfree_skb to be called with a NULL argument
On Thu, Feb 23, 2006 at 12:22:31PM +0100, J?rn Engel wrote: > > Should I merge the two patches into one and resend? Sounds good. -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Allow kfree_skb to be called with a NULL argument
On Thu, 23 February 2006 03:11:12 -0800, David S. Miller wrote: > > > Now there's a good idea. After all, the great majority of callers > > of kfree_skb expect to free the skb. Dave, what do you think? > > Absolutely. Should I merge the two patches into one and resend? Jörn -- If you're willing to restrict the flexibility of your approach, you can almost always do something better. -- John Carmack - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Allow kfree_skb to be called with a NULL argument
From: Herbert Xu <[EMAIL PROTECTED]> Date: Thu, 23 Feb 2006 21:55:43 +1100 > Now there's a good idea. After all, the great majority of callers > of kfree_skb expect to free the skb. Dave, what do you think? Absolutely. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Allow kfree_skb to be called with a NULL argument
On Thu, Feb 23, 2006 at 11:50:41AM +0100, J?rn Engel wrote: > > For my kernel, there would be 92 removals if the condition at the > price of 135 bytes of extra object code. Some of the removals would > be in modules, so the numbers are not exactly fair. IMHO source saving is cheap while binary bloat isn't. > Another interesting question is: Why is kfree_skb inline in the first > place? After uninlining it, my patch would debloat both source and > object code by a bit: > > -rwxr-xr-x 1 joern src 4824435 Feb 23 11:46 vmlinux > > 12157 bytes gained. Plus a bit more when the 92 conditionals are > removed. Now there's a good idea. After all, the great majority of callers of kfree_skb expect to free the skb. Dave, what do you think? Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Allow kfree_skb to be called with a NULL argument
On Thu, 23 February 2006 21:10:37 +1100, Herbert Xu wrote: > On Thu, Feb 23, 2006 at 10:54:46AM +0100, J?rn Engel wrote: > > > > Wrt. the binary, you have a point. For source code, my patch does not > > any new bloat and allows removal of the existing. Lemme do a quick > > Well I just did a grep in net/*/*.c and it seems that the number of > calls to kfree_skb preceded by a NULL check is a small minority. So > I don't see the point of this as we'll be trading a very small amount > of source code savings for the bloating (albeit small) of the binary. For my kernel, there would be 92 removals if the condition at the price of 135 bytes of extra object code. Some of the removals would be in modules, so the numbers are not exactly fair. Another interesting question is: Why is kfree_skb inline in the first place? After uninlining it, my patch would debloat both source and object code by a bit: -rwxr-xr-x 1 joern src 4824435 Feb 23 11:46 vmlinux 12157 bytes gained. Plus a bit more when the 92 conditionals are removed. Jörn -- Optimizations always bust things, because all optimizations are, in the long haul, a form of cheating, and cheaters eventually get caught. -- Larry Wall --- linux-2.6.14-rc3cow/include/linux/skbuff.h~uninline_kfree_skb 2006-02-23 11:40:30.0 +0100 +++ linux-2.6.14-rc3cow/include/linux/skbuff.h 2006-02-23 11:41:38.0 +0100 @@ -302,6 +302,7 @@ struct sk_buff { #include +void kfree_skb(struct sk_buff *skb); extern void __kfree_skb(struct sk_buff *skb); extern struct sk_buff *__alloc_skb(unsigned int size, unsigned int __nocast priority, int fclone); @@ -397,24 +398,6 @@ static inline struct sk_buff *skb_get(st */ /** - * kfree_skb - free an sk_buff - * @skb: buffer to free - * - * Drop a reference to the buffer and free it if the usage count has - * hit zero. - */ -static inline void kfree_skb(struct sk_buff *skb) -{ - if (unlikely(!skb)) - return; - if (likely(atomic_read(&skb->users) == 1)) - smp_rmb(); - else if (likely(!atomic_dec_and_test(&skb->users))) - return; - __kfree_skb(skb); -} - -/** * skb_cloned - is the buffer a clone * @skb: buffer to check * --- linux-2.6.14-rc3cow/net/core/skbuff.c~uninline_kfree_skb2006-01-18 14:56:05.0 +0100 +++ linux-2.6.14-rc3cow/net/core/skbuff.c 2006-02-23 11:41:56.0 +0100 @@ -350,6 +350,24 @@ void __kfree_skb(struct sk_buff *skb) } /** + * kfree_skb - free an sk_buff + * @skb: buffer to free + * + * Drop a reference to the buffer and free it if the usage count has + * hit zero. + */ +void kfree_skb(struct sk_buff *skb) +{ + if (unlikely(!skb)) + return; + if (likely(atomic_read(&skb->users) == 1)) + smp_rmb(); + else if (likely(!atomic_dec_and_test(&skb->users))) + return; + __kfree_skb(skb); +} + +/** * skb_clone - duplicate an sk_buff * @skb: buffer to clone * @gfp_mask: allocation priority - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Allow kfree_skb to be called with a NULL argument
On Thu, Feb 23, 2006 at 10:54:46AM +0100, J?rn Engel wrote: > > Wrt. the binary, you have a point. For source code, my patch does not > any new bloat and allows removal of the existing. Lemme do a quick Well I just did a grep in net/*/*.c and it seems that the number of calls to kfree_skb preceded by a NULL check is a small minority. So I don't see the point of this as we'll be trading a very small amount of source code savings for the bloating (albeit small) of the binary. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Some infrastructure for interrupt-less TX
From: Lennert Buytenhek <[EMAIL PROTECTED]> Date: Thu, 23 Feb 2006 10:55:21 +0100 > This breaks socket buffer accounting. That's why he's dropping the SKB sans the data. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Allow kfree_skb to be called with a NULL argument
On Thu, 23 February 2006 19:28:49 +1100, Herbert Xu wrote: > On Thu, Feb 23, 2006 at 07:53:36AM +0100, J?rn Engel wrote: > > > > How is that argument special for kfree_skb? Both libc free and kfree > > ignore NULL arguments and do so for good reasons. > > Well with kfree there is actually a slight gain in that you are doing > the check in one place. > > kfree_skb on the other hand is inlined so the you're actually adding > bloat to many places that simply don't need it. Wrt. the binary, you have a point. For source code, my patch does not any new bloat and allows removal of the existing. Lemme do a quick measurement for the kernel I run on my machine: -rwxr-xr-x 1 joern src 4836592 Feb 23 10:43 vmlinux -rwxr-xr-x 1 joern src 4836727 Feb 23 10:19 vmlinux.kfree_null 135 bytes added by my patch. Not that much. Jörn -- He who knows others is wise. He who knows himself is enlightened. -- Lao Tsu - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Some infrastructure for interrupt-less TX
On Thu, 23 February 2006 10:55:21 +0100, Lennert Buytenhek wrote: > On Thu, Feb 23, 2006 at 08:00:32AM +0100, Jörn Engel wrote: > > > > I am assuming the real goal is avoiding interrupts when > > > transmit completions can be reported without them on a > > > reasonably periodic basis. > > > > Not necessarily on a periodic basis. For some network driver I once > > worked on, the hardware simply had a ring buffer of n frames. > > Whenever a n+1th frame was transmitted, the first would be checked for > > completion. If it was completed, it was freed, else the new frame was > > dropped (and freed). > > This breaks socket buffer accounting. Only if you keep the skb as well. Read the patch. The point is to free the skb but keep the packet data. Jörn -- Optimizations always bust things, because all optimizations are, in the long haul, a form of cheating, and cheaters eventually get caught. -- Larry Wall - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Some infrastructure for interrupt-less TX
On Thu, Feb 23, 2006 at 08:00:32AM +0100, Jörn Engel wrote: > > I am assuming the real goal is avoiding interrupts when > > transmit completions can be reported without them on a > > reasonably periodic basis. > > Not necessarily on a periodic basis. For some network driver I once > worked on, the hardware simply had a ring buffer of n frames. > Whenever a n+1th frame was transmitted, the first would be checked for > completion. If it was completed, it was freed, else the new frame was > dropped (and freed). This breaks socket buffer accounting. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] prism54usb: compile fix
On Mon, 20 Feb 2006 20:39:16 +0100, Carlos Martin <[EMAIL PROTECTED]> wrote: > diff --git a/drivers/net/wireless/prism54usb/isl_sm.h > b/drivers/net/wireless/prism54usb/isl_sm.h > index 9e41587..c39bb48 100644 > --- a/drivers/net/wireless/prism54usb/isl_sm.h > +++ b/drivers/net/wireless/prism54usb/isl_sm.h > @@ -249,7 +249,7 @@ extern int islsm_wait_timeo > > /* now the helper functions, for sending packets */ > int islsm_outofband_msg(struct net_device *netdev, > - void *buf, unsigned int size); > + void *buf, size_t size); I have it in my tree already. Something is inconsistent somewhere. Weird. -- Pete - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: no carrier detection after resume from swsusp (8139too)
On Wednesday 22 February 2006 16:19, Robert Love wrote: > e100 or e1000? 8139cp here. Seems to have picked up this behaviour since SL10.1beta2 or so, still in beta4. See https://bugzilla.novell.com/show_bug.cgi?id=151892 > `carrier' returns EINVAL if the device is not UP. It might be a bug in > NM if the device is not UP after a resume. What does `ifconfig eth1` > show before and after a resume? carrier is 1 before and after, eth0 is UP before and after, restarting NM doesn't help, nor does stopping NM, rmmod, modprobe, and starting NM. I didn't think it is NM related, as I could not configure the network by hand after resume. However, I just switched my network config to SUSE traditional ifup+ifplugd (joys of flexibility), and although eth0 does not work on resume, rcnetwork restart fixes it, whereas when NM is in charge, this does not help. So my understanding is NM is not the direct cause but is a contributing factor. Will eth0 Link encap:Ethernet HWaddr 00:02:3F:67:0A:E3 inet addr:169.254.137.164 Bcast:169.254.255.255 Mask:255.255.0.0 inet6 addr: fe80::202:3fff:fe67:ae3/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:2769 errors:0 dropped:3144 overruns:0 frame:0 TX packets:180 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:285833 (279.1 Kb) TX bytes:14752 (14.4 Kb) Interrupt:10 Base address:0x2000 eth1 Link encap:Ethernet HWaddr 00:0C:F1:13:76:CB UP BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Interrupt:5 Memory:9000-9fff loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:277 errors:0 dropped:0 overruns:0 frame:0 TX packets:277 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:21838 (21.3 Kb) TX bytes:21838 (21.3 Kb) eth0 Link encap:Ethernet HWaddr 00:02:3F:67:0A:E3 inet addr:10.10.101.143 Bcast:10.10.255.255 Mask:255.255.0.0 inet6 addr: 2001:780:101:a00:202:3fff:fe67:ae3/64 Scope:Global inet6 addr: fe80::202:3fff:fe67:ae3/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:2210 errors:0 dropped:0 overruns:0 frame:0 TX packets:177 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:251842 (245.9 Kb) TX bytes:14494 (14.1 Kb) Interrupt:10 Base address:0x2000 eth1 Link encap:Ethernet HWaddr 00:0C:F1:13:76:CB UP BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Interrupt:5 Memory:9000-9fff loLink encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:149 errors:0 dropped:0 overruns:0 frame:0 TX packets:149 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:12784 (12.4 Kb) TX bytes:12784 (12.4 Kb)
Re: [PATCH]IPv4 UDP does not discard the datagram with invalid checksum
From: Wei Yongjun <[EMAIL PROTECTED]> Date: Thu, 23 Feb 2006 16:03:18 -0500 > IPv4 UDP does not discard the datagram with invalid checksum. UDP can > validate UDP checksums correctly only when socket filtering instructions > is set. If socket filtering instructions is not set, datagram with > invalid checksum will be passed to the application. We check the checksum later, in parallel with the copy of the packet data into userspace. See udp_recvmsg(), where we do this: if (skb->ip_summed==CHECKSUM_UNNECESSARY) { err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr), msg->msg_iov, copied); } else if (msg->msg_flags&MSG_TRUNC) { if (__udp_checksum_complete(skb)) goto csum_copy_err; err = skb_copy_datagram_iovec(skb, sizeof(struct udphdr), msg->msg_iov, copied); } else { err = skb_copy_and_csum_datagram_iovec(skb, sizeof(struct udphdr), msg->msg_iov); if (err == -EINVAL) goto csum_copy_err; } - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Allow kfree_skb to be called with a NULL argument
On Thu, Feb 23, 2006 at 07:53:36AM +0100, J?rn Engel wrote: > > How is that argument special for kfree_skb? Both libc free and kfree > ignore NULL arguments and do so for good reasons. Well with kfree there is actually a slight gain in that you are doing the check in one place. kfree_skb on the other hand is inlined so the you're actually adding bloat to many places that simply don't need it. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <[EMAIL PROTECTED]> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html