[GIT PATCH] IPv6 Update for net-2.6.19, Take 2
Hello. Here is the update for IPv6 for net-2.6.19, take 2. Changes: - Fix IsRouter flag in NAs. - Add HA/MN Support. With these changesets, net-2.6.19 will be able to handle fundermental not only CN (Correspondent Node) operation but also HA (Home Agent) / MN (Mobile Node) operations. Please pull them from git://git.skbuff.net/yoshfuji/net-2.6.19-20060920-inet6/ Regards, HEADLINES - [IPV6] NDISC: Handle NDP messages to proxied addresses. [IPV6]: Don't forward packets to proxied link-local address. [IPV6] NDISC: Avoid updating neighbor cache for proxied address in receiving NA. [IPV6] NDISC: Set per-entry is_router flag in Proxy NA. [IPV6] NDISC: Add proxy_ndp sysctl. [IPV6] ADDRCONF: Convert addrconf_lock to RCU. [IPV6] NDISC: Fix is_router flag setting. [IPV6] ADDRCONF: Allow non-DAD'able addresses. [IPV6] ADDRCONF: Mobile IPv6 Home Address support. DIFFSTAT Documentation/networking/ip-sysctl.txt |3 + include/linux/if_addr.h|2 + include/linux/ipv6.h |2 + include/linux/sysctl.h |1 include/net/addrconf.h | 16 +--- include/net/if_inet6.h |1 include/net/neighbour.h|1 net/core/neighbour.c | 11 ++- net/core/pktgen.c |4 + net/ipv6/addrconf.c| 128 +++- net/ipv6/anycast.c |4 + net/ipv6/ip6_output.c | 62 net/ipv6/ipv6_syms.c |1 net/ipv6/ndisc.c | 28 ++- net/sctp/ipv6.c|6 +- 15 files changed, 206 insertions(+), 64 deletions(-) CHANGESETS -- commit 6ef7db482e882f77a17afc9f8fef8a91790b4a7a Author: Masahide NAKAMURA <[EMAIL PROTECTED]> Date: Sun Sep 17 13:55:07 2006 +0900 [IPV6] NDISC: Handle NDP messages to proxied addresses. It is required to respond to NDP messages sent directly to the "target" unicast address. Proxying node (router) is required to handle such messages. To achieve this, check if the packet in forwarding patch is NDP message. With this patch, the proxy neighbor entries are always looked up in forwarding path. We may want to optimize further. Based on MIPL2 kernel patch. Signed-off-by: Ville Nuorvala <[EMAIL PROTECTED]> Signed-off-by: Masahide NAKAMURA <[EMAIL PROTECTED]> Signed-off-by: YOSHIFUJI Hideaki <[EMAIL PROTECTED]> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index c14ea1e..0f56e9e 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -308,6 +308,46 @@ static int ip6_call_ra_chain(struct sk_b return 0; } +static int ip6_forward_proxy_check(struct sk_buff *skb) +{ + struct ipv6hdr *hdr = skb->nh.ipv6h; + u8 nexthdr = hdr->nexthdr; + int offset; + + if (ipv6_ext_hdr(nexthdr)) { + offset = ipv6_skip_exthdr(skb, sizeof(*hdr), &nexthdr); + if (offset < 0) + return 0; + } else + offset = sizeof(struct ipv6hdr); + + if (nexthdr == IPPROTO_ICMPV6) { + struct icmp6hdr *icmp6; + + if (!pskb_may_pull(skb, skb->nh.raw + offset + 1 - skb->data)) + return 0; + + icmp6 = (struct icmp6hdr *)(skb->nh.raw + offset); + + switch (icmp6->icmp6_type) { + case NDISC_ROUTER_SOLICITATION: + case NDISC_ROUTER_ADVERTISEMENT: + case NDISC_NEIGHBOUR_SOLICITATION: + case NDISC_NEIGHBOUR_ADVERTISEMENT: + case NDISC_REDIRECT: + /* For reaction involving unicast neighbor discovery +* message destined to the proxied address, pass it to +* input function. +*/ + return 1; + default: + break; + } + } + + return 0; +} + static inline int ip6_forward_finish(struct sk_buff *skb) { return dst_output(skb); @@ -362,6 +402,11 @@ int ip6_forward(struct sk_buff *skb) return -ETIMEDOUT; } + if (pneigh_lookup(&nd_tbl, &hdr->daddr, skb->dev, 0)) { + if (ip6_forward_proxy_check(skb)) + return ip6_input(skb); + } + if (!xfrm6_route_forward(skb)) { IP6_INC_STATS(IPSTATS_MIB_INDISCARDS); goto drop; --- commit aa4c21e2fffb50159fdc2c3e787b582de825 Author: Masahide NAKAMURA <[EMAIL PROTECTED]> Date: Sun Sep 17 13:55:09 2006 +0900 [IPV6]: Don't forward packets to proxied link-local address. Proxying router can't forward traffic sent
Re: [patch 3/3] Add tsi108 On Chip Ethernet device driver support
Zang Roy-r61911 wrote: On Thu, 2006-09-21 at 12:26, Jeff Garzik wrote: Zang Roy-r61911 wrote: +#define TSI108_ETH_WRITE_REG(offset, val) \ + writel(le32_to_cpu(val),data->regs + (offset)) + +#define TSI108_ETH_READ_REG(offset) \ + le32_to_cpu(readl(data->regs + (offset))) + +#define TSI108_ETH_WRITE_PHYREG(offset, val) \ + writel(le32_to_cpu(val), data->phyregs + (offset)) + +#define TSI108_ETH_READ_PHYREG(offset) \ + le32_to_cpu(readl(data->phyregs + (offset))) NAK: 1) writel() and readl() are defined to be little endian. If your platform is different, then your platform should have its own foobus_writel() and foobus_readl(). Tsi108 bridge is designed for powerpc platform. Originally, I use out_be32() and in_be32(). While there is no obvious reason to object using this bridge in a little endian system. Maybe some extra hardware logic needed for the bus interface. le32_to_cpu() can be aware the endian difference. To restate, readl() should read a little endian value, and return a CPU-endian value. writel() should receive a CPU-endian value, and write a little endian value. If your platform's readl/writel doesn't do that, it's broken. That's why normal PCI drivers can use readl() and writel() on either big-endian or little-endian machines, without needing to use le32_to_cpu(). Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/6] TCP socket splice
On Wed, Sep 20, 2006 at 02:07:11PM -0700, Ashwini Kulkarni ([EMAIL PROTECTED]) wrote: > Using TCP socket splice: > > Application Control > | > _|__ > | > | TCP socket splice > | +-+ > | | Direct path | > V | V >Network File System >Buffer Buffer > ^ | > | | > _|___|__ > DMA | | DMA > | | >Hardware | | > | V > NIC SATA > > In this method, the objective is to use TCP socket splicing to create a direct > path in the kernel from the network buffer to the file system buffer via a > pipe > buffer. The pages will migrate from the network buffer (which is associated > with the socket) into the pipe buffer for an optimized path. From the pipe > buffer, the pages will then be migrated to the output file address space page > cache. This will enable to create a LAN to file-system API which will avoid > the > memcpy operations in user space and thus create a fast path from the network > buffer to the storage buffer. > > Open Issues (currently being addressed): > There is a performance drop when transferring bigger files (usually larger > than > 65536 bytes in size). Performance drop increases with the size of the file. > Work is in progress to identify the source of this issue. > > We encourage the community to review our TCP socket splice project. Feedback > would be greatly appreciated. First of all it is not zero-copy, most of the time when mtu is not changed skb does not have fragments, which means that you need to copy, and you do it in skb_splice_bits() after skb_headlen() check. Additionally to copy you add kmap/kunmap overhead, which can be very noticeble. I would not be surprised that exactly that part introduces described above performance drop compared to copy_*_user() approach. Did you checked it with (hacked) drivers, which put data into fragment list? Could you post your benchamrks. And your coding style is broken noticebly... Also do not check for every possible error case, negative return value always meant error, otherwise it is ok in your case to proceed. > -- > Ashwini Kulkarni -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/3] Add tsi108 On Chip Ethernet device driver support
On Thu, 2006-09-21 at 12:26, Jeff Garzik wrote: > Zang Roy-r61911 wrote: > > +#define TSI108_ETH_WRITE_REG(offset, val) \ > > + writel(le32_to_cpu(val),data->regs + (offset)) > > + > > +#define TSI108_ETH_READ_REG(offset) \ > > + le32_to_cpu(readl(data->regs + (offset))) > > + > > +#define TSI108_ETH_WRITE_PHYREG(offset, val) \ > > + writel(le32_to_cpu(val), data->phyregs + (offset)) > > + > > +#define TSI108_ETH_READ_PHYREG(offset) \ > > + le32_to_cpu(readl(data->phyregs + (offset))) > > > NAK: > > 1) writel() and readl() are defined to be little endian. > > If your platform is different, then your platform should have its own > foobus_writel() and foobus_readl(). Tsi108 bridge is designed for powerpc platform. Originally, I use out_be32() and in_be32(). While there is no obvious reason to object using this bridge in a little endian system. Maybe some extra hardware logic needed for the bus interface. le32_to_cpu() can be aware the endian difference. Any comment? > > 2) TSI108_ETH_WRITE_REG() is just way too long. TSI_READ(), > TSI_WRITE(), TSI_READ_PHY() and TSI_WRITE_PHY() would be far more > readable. > > More in next email. > I will modify the name. Roy - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PATCH] NET: Fixes for net-2.6.19
Hello. In article <[EMAIL PROTECTED]> (at Mon, 18 Sep 2006 20:57:46 +0200), Thomas Graf <[EMAIL PROTECTED]> says: > * YOSHIFUJI Hideaki <[EMAIL PROTECTED]> 2006-09-19 00:08 : > > [NET]: Include new rtnetlink headers for userspace backward > > compatibility. : > > diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h > > index 3a18add..8ec375c 100644 > > --- a/include/linux/rtnetlink.h > > +++ b/include/linux/rtnetlink.h > > @@ -2,7 +2,12 @@ #ifndef __LINUX_RTNETLINK_H > > #define __LINUX_RTNETLINK_H > > > > #include > > +#ifndef __KERNEL__ > > +/* Backward compatibility */ > > #include > > +#include > > +#include > > +#endif > > > > / > > * Routing/neighbour discovery messages. > > Still acceptable but this gets ugly at some point. Applications using > the interface should start making copies of the header version they > use. I understand, but I feel it is more ugly. > > commit 55a08a9078b243a06223222735580df9e11a5fa6 > > Author: YOSHIFUJI Hideaki <[EMAIL PROTECTED]> > > Date: Sun Sep 17 13:55:02 2006 +0900 > > > > [NET]: Put {IFLA,IFA,NDA,NDTA}_{RTA,PAYLOAD}() macro back. > > > > These macros are still used by userspace applications. > > Same here, it doesn't make sense to export macros only of functional > value and used by userspace only. The same issue will pop up once > all users have been converted to use the new netlink interface. > Keeping the old interface around just so userspace doesn't have to > make copies doesn't make sense. I think it's better to start fixing > userspace than to try and keep headers source compatible. Backward compatibility is one of the most important factor. A careless breakage is, say, of "brain freeze." About these macros, we have other similar *_{RTA,PAYLOAD}()s in kernel, which are not used by kernel but are exported to userspace. They are forming part of our API, and they're good example how to use netlink interface. Yes, we could change the interface, but we definately need to give them graceful period at least; we can do it in 2.7, or at least, after 2-3 stable releases. Regards, -- YOSHIFUJI Hideaki @ USAGI Project <[EMAIL PROTECTED]> GPG-FP : 9022 65EB 1ECF 3AD1 0BDF 80D8 4807 F894 E062 0EEA - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/3] Add tsi108 On Chip Ethernet device driver support
Zang Roy-r61911 wrote: +struct tsi108_prv_data { + void __iomem *regs;/* Base of normal regs */ + void __iomem *phyregs; /* Base of register bank used for PHY access */ + + int phy;/* Index of PHY for this interface */ + int irq_num; + int id; + + struct timer_list timer;/* Timer that triggers the check phy function */ + int rxtail; /* Next entry in rxring to read */ + int rxhead; /* Next entry in rxring to give a new buffer */ + int rxfree; /* Number of free, allocated RX buffers */ + + int rxpending; /* Non-zero if there are still descriptors +* to be processed from a previous descriptor +* interrupt condition that has been cleared */ + + int txtail; /* Next TX descriptor to check status on */ + int txhead; /* Next TX descriptor to use */ most of these should be unsigned, to prevent bugs. + /* Number of free TX descriptors. This could be calculated from +* rxhead and rxtail if one descriptor were left unused to disambiguate +* full and empty conditions, but it's simpler to just keep track +* explicitly. */ + + int txfree; + + int phy_ok; /* The PHY is currently powered on. */ + + /* PHY status (duplex is 1 for half, 2 for full, +* so that the default 0 indicates that neither has +* yet been configured). */ + + int link_up; + int speed; + int duplex; + + tx_desc *txring; + rx_desc *rxring; + struct sk_buff *txskbs[TSI108_TXRING_LEN]; + struct sk_buff *rxskbs[TSI108_RXRING_LEN]; + + dma_addr_t txdma, rxdma; + + /* txlock nests in misclock and phy_lock */ + + spinlock_t txlock, misclock; + + /* stats is used to hold the upper bits of each hardware counter, +* and tmpstats is used to hold the full values for returning +* to the caller of get_stats(). They must be separate in case +* an overflow interrupt occurs before the stats are consumed. +*/ + + struct net_device_stats stats; + struct net_device_stats tmpstats; + + /* These stats are kept separate in hardware, thus require individual +* fields for handling carry. They are combined in get_stats. +*/ + + unsigned long rx_fcs; /* Add to rx_frame_errors */ + unsigned long rx_short_fcs; /* Add to rx_frame_errors */ + unsigned long rx_long_fcs; /* Add to rx_frame_errors */ + unsigned long rx_underruns; /* Add to rx_length_errors */ + unsigned long rx_overruns; /* Add to rx_length_errors */ + + unsigned long tx_coll_abort;/* Add to tx_aborted_errors/collisions */ + unsigned long tx_pause_drop;/* Add to tx_aborted_errors */ + + unsigned long mc_hash[16]; +}; + +/* Structure for a device driver */ + +static struct platform_driver tsi_eth_driver = { + .probe = tsi108_init_one, + .remove = tsi108_ether_remove, + .driver = { + .name = "tsi-ethernet", + }, +}; + +static void tsi108_timed_checker(unsigned long dev_ptr); + +static void dump_eth_one(struct net_device *dev) +{ + struct tsi108_prv_data *data = netdev_priv(dev); + + printk("Dumping %s...\n", dev->name); + printk("intstat %x intmask %x phy_ok %d" + " link %d speed %d duplex %d\n", + TSI108_ETH_READ_REG(TSI108_EC_INTSTAT), + TSI108_ETH_READ_REG(TSI108_EC_INTMASK), data->phy_ok, + data->link_up, data->speed, data->duplex); + + printk("TX: head %d, tail %d, free %d, stat %x, estat %x, err %x\n", + data->txhead, data->txtail, data->txfree, + TSI108_ETH_READ_REG(TSI108_EC_TXSTAT), + TSI108_ETH_READ_REG(TSI108_EC_TXESTAT), + TSI108_ETH_READ_REG(TSI108_EC_TXERR)); + + printk("RX: head %d, tail %d, free %d, stat %x," + " estat %x, err %x, pending %d\n\n", + data->rxhead, data->rxtail, data->rxfree, + TSI108_ETH_READ_REG(TSI108_EC_RXSTAT), + TSI108_ETH_READ_REG(TSI108_EC_RXESTAT), + TSI108_ETH_READ_REG(TSI108_EC_RXERR), data->rxpending); +} + +/* Synchronization is needed between the thread and up/down events. + * Note that the PHY is accessed through the same registers for both + * interfaces, so this can't be made interface-specific. + */ + +static DEFINE_SPINLOCK(phy_lock); you should have a chip structure, that contains two structs (one for each interface/port) +static u16 tsi108_read_mii(struct tsi108_prv_data *data, int reg, int *status) +{ + int i; + u16 ret; + + TSI108_ETH_WRITE_PHYREG(TSI108_MAC_MII_ADDR, + (data->phy << TSI108_MAC_MII_ADDR_PHY) | + (reg << TSI108_MAC_MII_
Re: [patch 3/3] Add tsi108 On Chip Ethernet device driver support
Zang Roy-r61911 wrote: +#define TSI108_ETH_WRITE_REG(offset, val) \ + writel(le32_to_cpu(val),data->regs + (offset)) + +#define TSI108_ETH_READ_REG(offset) \ + le32_to_cpu(readl(data->regs + (offset))) + +#define TSI108_ETH_WRITE_PHYREG(offset, val) \ + writel(le32_to_cpu(val), data->phyregs + (offset)) + +#define TSI108_ETH_READ_PHYREG(offset) \ + le32_to_cpu(readl(data->phyregs + (offset))) NAK: 1) writel() and readl() are defined to be little endian. If your platform is different, then your platform should have its own foobus_writel() and foobus_readl(). 2) TSI108_ETH_WRITE_REG() is just way too long. TSI_READ(), TSI_WRITE(), TSI_READ_PHY() and TSI_WRITE_PHY() would be far more readable. More in next email. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 3/3 v2] Add tsi108 On Chip Ethernet device driver support
The Tundra Semiconductor Corporation (Tundra) Tsi108/9 is a host bridge for PowerPC processors that offers numerous system interconnect options for embedded application designers . The Tsi108/9 can interconnect 60x or MPX processors to PCI/X peripherals, DDR2-400 memory, Gigabit Ethernet, and Flash. Tsi108/109 is used on powerpc/mpc7448hpc2 platform. The following patch provides Tsi108/9 on chip Ethernet chip driver support. Signed-off-by: Alexandre Bounine <[EMAIL PROTECTED]> Signed-off-by: Roy Zang <[EMAIL PROTECTED]> -- drivers/net/tsi108_eth.c | 1700 ++ 1 files changed, 1700 insertions(+), 0 deletions(-) diff --git a/drivers/net/tsi108_eth.c b/drivers/net/tsi108_eth.c new file mode 100644 index 000..5714f78 -- /dev/null +++ b/drivers/net/tsi108_eth.c @@ -0,0 +1,1700 @@ +/*** + + Copyright(c) 2006 Tundra Semiconductor Corporation. + + This program is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your option) + any later version. + + This program is distributed in the hope that it will be useful, but WITHOUT + ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + You should have received a copy of the GNU General Public License along with + this program; if not, write to the Free Software Foundation, Inc., 59 + Temple Place - Suite 330, Boston, MA 02111-1307, USA. + +***/ + +/* This driver is based on the driver code originally developed + * for the Intel IOC80314 (ForestLake) Gigabit Ethernet by + * [EMAIL PROTECTED] * Copyright (C) 2003 TimeSys Corporation + * + * Currently changes from original version are: + * - porting to Tsi108-based platform and kernel 2.6 ([EMAIL PROTECTED]) + * - modifications to handle two ports independently and support for + * additional PHY devices ([EMAIL PROTECTED]) + * - Get hardware information from platform device. ([EMAIL PROTECTED]) + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include + +#include "tsi108_eth.h" + +#define MII_READ_DELAY 1 /* max link wait time in msec */ + +#define TSI108_RXRING_LEN 256 + +/* NOTE: The driver currently does not support receiving packets + * larger than the buffer size, so don't decrease this (unless you + * want to add such support). + */ +#define TSI108_RXBUF_SIZE 1536 + +#define TSI108_TXRING_LEN 256 + +#define TSI108_TX_INT_FREQ64 + +/* Check the phy status every half a second. */ +#define CHECK_PHY_INTERVAL (HZ/2) + +static int tsi108_init_one(struct platform_device *pdev); +static int tsi108_ether_remove(struct platform_device *pdev); + +struct tsi108_prv_data { + void __iomem *regs;/* Base of normal regs */ + void __iomem *phyregs; /* Base of register bank used for PHY access */ + + int phy;/* Index of PHY for this interface */ + int irq_num; + int id; + + struct timer_list timer;/* Timer that triggers the check phy function */ + int rxtail; /* Next entry in rxring to read */ + int rxhead; /* Next entry in rxring to give a new buffer */ + int rxfree; /* Number of free, allocated RX buffers */ + + int rxpending; /* Non-zero if there are still descriptors +* to be processed from a previous descriptor +* interrupt condition that has been cleared */ + + int txtail; /* Next TX descriptor to check status on */ + int txhead; /* Next TX descriptor to use */ + + /* Number of free TX descriptors. This could be calculated from +* rxhead and rxtail if one descriptor were left unused to disambiguate +* full and empty conditions, but it's simpler to just keep track +* explicitly. */ + + int txfree; + + int phy_ok; /* The PHY is currently powered on. */ + + /* PHY status (duplex is 1 for half, 2 for full, +* so that the default 0 indicates that neither has +* yet been configured). */ + + int link_up; + int speed; + int duplex; + + tx_desc *txring; + rx_desc *rxring; + struct sk_buff *txskbs[TSI108_TXRING_LEN]; + struct sk_buff *rxskbs[TSI108_RXRING_LEN]; + + dma_addr_t txdma, rxdma; + + /* txlock nests in misclock and phy_lock */ + + spinlock_t txlock, misclock
[patch 1/3 v2] Add tsi108 On Chip Ethernet device driver support
The Tundra Semiconductor Corporation (Tundra) Tsi108/9 is a host bridge for PowerPC processors that offers numerous system interconnect options for embedded application designers . The Tsi108/9 can interconnect 60x or MPX processors to PCI/X peripherals, DDR2-400 memory, Gigabit Ethernet, and Flash. Tsi108/109 is used on powerpc/mpc7448hpc2 platform. The following patch provides Tsi108/9 on chip Ethernet chip driver config and Makefile. Signed-off-by: Alexandre Bounine <[EMAIL PROTECTED]> Signed-off-by: Roy Zang <[EMAIL PROTECTED]> -- drivers/net/Kconfig |8 drivers/net/Makefile |1 + 2 files changed, 9 insertions(+), 0 deletions(-) diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index a2bd811..eb17060 100644 -- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -2221,6 +2221,14 @@ config SPIDER_NET This driver supports the Gigabit Ethernet chips present on the Cell Processor-Based Blades from IBM. +config TSI108_ETH + tristate "Tundra TSI108 gigabit Ethernet support" + depends on TSI108_BRIDGE + help +This driver supports Tundra TSI108 gigabit Ethernet ports. +To compile this driver as a module, choose M here: the module +will be called tsi108_eth. + config GIANFAR tristate "Gianfar Ethernet" depends on 85xx || 83xx || PPC_86xx diff --git a/drivers/net/Makefile b/drivers/net/Makefile index 8427bf9..da199e7 100644 -- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -112,6 +112,7 @@ obj-$(CONFIG_B44) += b44.o obj-$(CONFIG_FORCEDETH) += forcedeth.o obj-$(CONFIG_NE_H8300) += ne-h8300.o 8390.o +obj-$(CONFIG_TSI108_ETH) += tsi108_eth.o obj-$(CONFIG_MV643XX_ETH) += mv643xx_eth.o obj-$(CONFIG_PPP) += ppp_generic.o slhc.o -- 1.4.0 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 0/3 v2] Add tsi108 On chip Ethernet device driver support
The Tundra Semiconductor Corporation (Tundra) Tsi108/9 is a host bridge for PowerPC processors that offers numerous system interconnect options for embedded application designers . The Tsi108/9 can interconnect 60x or MPX processors to PCI/X peripherals, DDR2-400 memory, Gigabit Ethernet, and Flash. Tsi108/109 is used on powerpc/mpc7448hpc2 platform. The following serial patches provide Tsi108/9 on chip Ethernet chip support. 1/3 : Config and Makefile modification. 2/3 : Header file 3/3 : C body file This serial patches fix the issues in the feedback from the previous patches. Feedback is welcomed. Roy - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: UDP Out 0f Sequence
On 9/21/06, Majumder, Rajib <[EMAIL PROTECTED]> wrote: Does this mean if we have 2 hosts connected back to back (there's no network device in between), sequence is guaranteed even in UDP? I think if you're trying to make the packets appear in order you need to untie the Gordian knot http://en.wikipedia.org/wiki/Gordian_Knot In other words you should fix the application rather than the near impossible task of trying to make the packets in order... Ian -- Ian McDonald Web: http://wand.net.nz/~iam4 Blog: http://imcdnzl.blogspot.com WAND Network Research Group Department of Computer Science University of Waikato New Zealand - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: UDP Out 0f Sequence
Let's say we have 2 uniprocessor hosts connected back to back. Is there any possibility of an out-of-order scenario on recv? Is this same for all kernel (linux/solaris)? -Original Message- From: David Miller [mailto:[EMAIL PROTECTED] Sent: 21 September 2006 11:51 To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org Subject: Re: UDP Out 0f Sequence From: "Majumder, Rajib" <[EMAIL PROTECTED]> Date: Thu, 21 Sep 2006 10:50:17 +0800 > Does this mean if we have 2 hosts connected back to back (there's no > network device in between), sequence is guaranteed even in UDP? Not true. Even for back to back systems SMP can cause packets to be delivered out of order even locally within the system on receive. == Please access the attached hyperlink for an important electronic communications disclaimer: http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html == - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: UDP Out 0f Sequence
From: "Majumder, Rajib" <[EMAIL PROTECTED]> Date: Thu, 21 Sep 2006 10:50:17 +0800 > Does this mean if we have 2 hosts connected back to back (there's no > network device in between), sequence is guaranteed even in UDP? Not true. Even for back to back systems SMP can cause packets to be delivered out of order even locally within the system on receive. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: UDP Out 0f Sequence
Does this mean if we have 2 hosts connected back to back (there's no network device in between), sequence is guaranteed even in UDP? -Original Message- From: Rick Jones [mailto:[EMAIL PROTECTED] Sent: 21 September 2006 00:47 To: Majumder, Rajib Cc: 'netdev@vger.kernel.org' Subject: Re: UDP Out 0f Sequence Majumder, Rajib wrote: > Hi, > > If I write UDP datagrams 1,2 and 3 to network and if the receiver > receives in order 2,1, and 3, where can the sequence get changed? Is it > at the source stack, network transit or destination stack? Yes. :) Although network transit is by far the most likely case. Destination stack is a distant second and source stack an even more distant third. Generally stack writers try to avoid having places in their stacks where things can reorder, but it isn't completely unknown. rick jones == Please access the attached hyperlink for an important electronic communications disclaimer: http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html == - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 4/7] secid reconciliation-v02: Invoke LSM hook for out bound traffic
On Wed, 20 Sep 2006, Venkat Yekkirala wrote: > > Quite a lot of logic has changed here. > > > > With the original code, we only restored a secmark once for > > the lifetime > > of a packet or connetcion (to make behavior deterministic and > > security > > marks immutable in the face of arbitrarily complex iptables rules). > > > > With your patch, secmarks are always writable. > > Hopefully the following thread addressed these concerns. > http://marc.theaimsgroup.com/?l=selinux&m=115870100405571&w=2 Ok, but can we preserve existing behavior when packet are only being labeled internally? (We should probably settle on the use of 'external' for cipso/xfrm labeling and 'internal' for iptables only). > > Also, we did not restore a 'null' (zero) secmark to the skb > > (while this > > should never happen with the current SECMARK target, there may be > > non-SELinux extensions later which set a null marking). > > How do you envision this (i.e. resoring a null secmark) being useful? > secmark is anyway zero by default (when no labeling rules exist for the > connection) right? Actually, don't worry about this. The implementation can decide what a 'null' mark might be and manage it themselves. > > You've also changed the logic for the dummy case of > > security_skb_netfilter_check() > > I am not getting this. This is a new function. Did you mean > to point to a different function? > > > > > > > +static inline int security_skb_netfilter_check(struct sk_buff *skb, > > + u32 nf_secid) > > +{ > > + return 1; > > +} > > + > > > > This code does not now behave as it did originally. Keep in > > mind that > > SELinux is not the only user of SECMARK. I'm talking about the code as a whole and the way this hook does not preserve existing behavior in the default case. Look at the original code: static void secmark_restore(struct sk_buff *skb) { if (!skb->secmark) { u32 *connsecmark; enum ip_conntrack_info ctinfo; connsecmark = nf_ct_get_secmark(skb, &ctinfo); if (connsecmark && *connsecmark) if (skb->secmark != *connsecmark) skb->secmark = *connsecmark; } } Now, you have added an LSM hook in here: + /* Set secmark on inbound and filter it on outbound */ + if (hooknum == NF_IP_POST_ROUTING || hooknum == NF_IP6_POST_ROUTING) { + if (!security_skb_netfilter_check(skb, secmark)) + return NF_DROP; + } else + if (skb->secmark != secmark) + skb->secmark = secmark; The dummy hook does not restore the secmark in the way that the original code does, now depending on the hooknum. When LSM is not configured or no LSM module is active, the behavior of the code must be identical to the original version. > > I really don't know if connection tracking is the right place > > to be doing > > policy enforcment, either. Perhaps you should just do the > > relabeling here > > and enforcement later. > > We could have done enforcement, in the SELinux postroute_last > hook for example, if only there were a place to hold onto the > "exit point context", separate from the label already associated > with the skb in the secmark field. postroute_last would need BOTH > the label of the skb (available in the secmark field) and the > "exit point context" to do enforcement. Ok, it's not pretty, but I guess it's much better than adding another field to the skb or similar. - James -- James Morris <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/23] e1000: add multicast stats counters
cramerj wrote: Williams, Mitch A wrote: + { "rx_broadcast", E1000_STAT(stats.bprc) }, + { "tx_broadcast", E1000_STAT(stats.bptc) }, + { "rx_multicast", E1000_STAT(stats.mprc) }, + { "tx_multicast", E1000_STAT(stats.mptc) }, { "rx_errors", E1000_STAT(net_stats.rx_errors) }, { "tx_errors", E1000_STAT(net_stats.tx_errors) }, { "tx_dropped", E1000_STAT(net_stats.tx_dropped) }, NAK -- you also need to remove the standard net stats, which are exported elsewhere Jeff, can you please explain the reason for this NAK a little more? Neither Auke nor I understand why you rejected the patch. This patch just adds the display of a few more stats in Ethtool. It doesn't affect any other counters, and is really just a convenience feature. I added this to the driver because of a customer request. Adding those stats is fine. You guys just need to remove the existing mess first. Since we have 1-to-1 mapping of some of our statistics registers to the net_stats, we could s/net_stats/stats/. However, there are a few net_stats (e.g. net_stats.rx_errors) that encapsulate more than one e1000 statistic register of which we don't have a private stat member defined. For those statistics, is it really necessary to add another stat structure just to rm "net_stats" from that list we pass to ethtool? At best, it would look something like this... { "foo_count", E1000_STAT(stats.foo) }, - { "rx_errors", E1000_STAT(net_stats.rx_errors) }, + { "rx_errors", E1000_STAT(eth_stats.rx_errors) }, { "bar_count", E1000_STAT(stats.bar) }, If so, well, OK. I'm just scratching my head as to why it's a "mess" as-is. The ethtool get-stats sub ioctl has _always_ been for exporting _only_ NIC-private statistics. So, no, there is no inherent connection between adding multicast stats and removing ones that should have never been in the list. But if I don't put my foot down, this will never get corrected. Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 08/23] e1000: add multicast stats counters
> Williams, Mitch A wrote: > >>> + { "rx_broadcast", E1000_STAT(stats.bprc) }, > >>> + { "tx_broadcast", E1000_STAT(stats.bptc) }, > >>> + { "rx_multicast", E1000_STAT(stats.mprc) }, > >>> + { "tx_multicast", E1000_STAT(stats.mptc) }, > >>> { "rx_errors", E1000_STAT(net_stats.rx_errors) }, > >>> { "tx_errors", E1000_STAT(net_stats.tx_errors) }, > >>> { "tx_dropped", E1000_STAT(net_stats.tx_dropped) }, > >> NAK -- you also need to remove the standard net stats, which are > >> exported elsewhere > > > > Jeff, can you please explain the reason for this NAK a little more? > > Neither Auke nor I understand why you rejected the patch. > > > > This patch just adds the display of a few more stats in Ethtool. It > > doesn't affect any other counters, and is really just a convenience > > feature. I added this to the driver because of a customer request. > > Adding those stats is fine. You guys just need to remove the existing > mess first. > > Jeff > Since we have 1-to-1 mapping of some of our statistics registers to the net_stats, we could s/net_stats/stats/. However, there are a few net_stats (e.g. net_stats.rx_errors) that encapsulate more than one e1000 statistic register of which we don't have a private stat member defined. For those statistics, is it really necessary to add another stat structure just to rm "net_stats" from that list we pass to ethtool? At best, it would look something like this... { "foo_count", E1000_STAT(stats.foo) }, - { "rx_errors", E1000_STAT(net_stats.rx_errors) }, + { "rx_errors", E1000_STAT(eth_stats.rx_errors) }, { "bar_count", E1000_STAT(stats.bar) }, If so, well, OK. I'm just scratching my head as to why it's a "mess" as-is. I've missed obvious alternatives before; care to enlighten? -Jeb - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] Re: high latency with TCP connections
On Wed, 20 Sep 2006 15:47:56 -0700 (PDT) David Miller <[EMAIL PROTECTED]> wrote: > From: Stephen Hemminger <[EMAIL PROTECTED]> > Date: Wed, 20 Sep 2006 15:44:06 -0700 > > > On Mon, 18 Sep 2006 06:56:55 -0700 (PDT) > > David Miller <[EMAIL PROTECTED]> wrote: > > > > > Ok, I'll put this into net-2.6.19 for now. Thanks. > > > > Did you try this on a desktop system? Something is wrong with net-2.6.19 > > basic web browsing seems slower. > > It might be due to other changes, please verify that it's > truly caused by Alexey's change by backing it out and > retesting. > > Note that I had to use an updated version of Alexey's change, > which he sent me privately, because the first version didn't > compile :) It might be something else.. there are a lot of changes from 2.6.18 to net-2.6.19. -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] Re: high latency with TCP connections
From: Stephen Hemminger <[EMAIL PROTECTED]> Date: Wed, 20 Sep 2006 15:44:06 -0700 > On Mon, 18 Sep 2006 06:56:55 -0700 (PDT) > David Miller <[EMAIL PROTECTED]> wrote: > > > Ok, I'll put this into net-2.6.19 for now. Thanks. > > Did you try this on a desktop system? Something is wrong with net-2.6.19 > basic web browsing seems slower. It might be due to other changes, please verify that it's truly caused by Alexey's change by backing it out and retesting. Note that I had to use an updated version of Alexey's change, which he sent me privately, because the first version didn't compile :) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] Re: high latency with TCP connections
On Mon, 18 Sep 2006 06:56:55 -0700 (PDT) David Miller <[EMAIL PROTECTED]> wrote: > From: Alexey Kuznetsov <[EMAIL PROTECTED]> > Date: Mon, 18 Sep 2006 14:37:05 +0400 > > > > It looks perfectly fine to me, would you like me to apply it > > > Alexey? > > > > Yes, I think it is safe. > > Ok, I'll put this into net-2.6.19 for now. Thanks. Did you try this on a desktop system? Something is wrong with net-2.6.19 basic web browsing seems slower. -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/7] secid reconciliation-v02: Enforcement for SELinux
Venkat Yekkirala wrote: >>>+static int selinux_skb_policy_check(struct sk_buff *skb, >> >>unsigned short >> >>>family) +{ >>>+u32 xfrm_sid, trans_sid; >>>+int err; >>>+ >>>+if (selinux_compat_net) >>>+return 1; >>>+ >>>+err = selinux_xfrm_decode_session(skb, &xfrm_sid, 0); >>>+BUG_ON(err); >> >>First, any reason against including the "struct sock *" in >>the LSM hook? At a >>quick glance it looks like it is available at each place >>security_skb_policy_check() is invoked? If there are no >>objections I would >>like to see it included in the hook. > > There's no sock available (NULL) for forward, no-sock, time-wait cases, etc. ... which would be why I should have taken a closer look :) > What you are trying to accomplish with the sock here anyway? Actually this is no longer an issue because of something else - you can ignore this now. -- paul moore linux security @ hp - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sky2: process tx pause frames.
On Wed, 20 Sep 2006 17:49:41 -0400 Jeff Garzik <[EMAIL PROTECTED]> wrote: > Stephen Hemminger wrote: > > This patch already is in 2.6.17 stable, but the bigger version was pushed > > off till 2.6.19. Here is a less intrusive version that needs to go into > > 2.6.18 > > (or I'll end up sending it for 2.6.18.1). The driver was telling the > > GMAC to flush (not process) pause frames. Manually disabling pause wasn't > > working because of problems in the setup. > > > > Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> > > You'll need to send this to [EMAIL PROTECTED] > > Jeff I did already thanks. -- Stephen Hemminger <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] sky2: process tx pause frames.
Stephen Hemminger wrote: This patch already is in 2.6.17 stable, but the bigger version was pushed off till 2.6.19. Here is a less intrusive version that needs to go into 2.6.18 (or I'll end up sending it for 2.6.18.1). The driver was telling the GMAC to flush (not process) pause frames. Manually disabling pause wasn't working because of problems in the setup. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> You'll need to send this to [EMAIL PROTECTED] Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Update of the r8169 branch
(adding netdev to Cc: so that the patch gets publically known) Boris B. Zhmurov <[EMAIL PROTECTED]> : [...] > Hello Francois. I've figured out, that this patch wasn't merged in > linux-2.6.18 :( Bad timing. Patches are available. > Is there any plans to merge it in mainline ? Jeff pulled most of the r8169 branch. I can't answer for him but I guess that the answer is RSN. > And is there any patches available against linux-2.6.18 release? The content of the r8169 branch in the git repository can be retrieved at: http://www.fr.zoreil.com/people/francois/misc/20060920-2.6.18-r8169-test.patch It should support 8167 as is. -- Ueimor - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Remove powerpc specific parts of 3c509 driver
Segher Boessenkool wrote: Sure, PCI busses are little-endian. But is readX()/writeX() for PCI only? Yes. For other buses, use foo_writel(), etc. Can this please be documented then? Never heard this before... You have come late to the party. WHat do you mean here? Could you please explain? This has been the case for many, many years. No, it was never documented AFAICS. A de facto standard does not need to be documented, to be a de facto standard. A lot of Linux "standards" are often based on emails from Linus buried halfway down a thread. A decision gets made, and people follow. And there is no point in a massive rename to pci_writel(), either. That would be really inconvenient, sure. It's also inconvenient that all the nice short names are PCI-only. Only to you, a decided minority of developers. Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 7/7] secid reconciliation-v02: Enforcement for SELinux
> > +static int selinux_skb_policy_check(struct sk_buff *skb, > unsigned short > > family) +{ > > + u32 xfrm_sid, trans_sid; > > + int err; > > + > > + if (selinux_compat_net) > > + return 1; > > + > > + err = selinux_xfrm_decode_session(skb, &xfrm_sid, 0); > > + BUG_ON(err); > > First, any reason against including the "struct sock *" in > the LSM hook? At a > quick glance it looks like it is available at each place > security_skb_policy_check() is invoked? If there are no > objections I would > like to see it included in the hook. There's no sock available (NULL) for forward, no-sock, time-wait cases, etc. What you are trying to accomplish with the sock here anyway? > > Second, I wonder if it would be better to do a NetLabel/CIPSO > query here using > the xfrm_sid as the NetLabel "base_sid" instead of at the end > of the function > (see your comment)? This way we wouldn't have to duplicate the > avc_has_perm() and security_transition_sid() calls for both xfrm and > NetLabel. There's a need for an additional avc_has_perm check anyway between the cipso label and the ipsec/transition label, to check to make sure the cipso level falls within the range on the IPSec/transition SA. No need for a new transition between ipsec/transition label and the cipso label since the cipso label would be sharing the TE portion with the ipsec/transition label (this could change in the future, when you get round to doing entire SELinux contexts over the wire). For now, you would just set the secmark to the cipso label if the label could come thru (i.e. if the avc_has_perm succeeds). - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 6/6] Move i_size_read part from do_splice_to() to __generic_file_splice_read() in splice.c
--- fs/splice.c | 18 -- 1 files changed, 8 insertions(+), 10 deletions(-) diff --git a/fs/splice.c b/fs/splice.c index 3a4202d..2f8f42a 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -271,7 +271,7 @@ __generic_file_splice_read(struct file * struct partial_page partial[PIPE_BUFFERS]; struct page *page; pgoff_t index, end_index; - loff_t isize; + loff_t isize, left; size_t total_len; int error, page_nr; struct splice_pipe_desc spd = { @@ -421,6 +421,13 @@ __generic_file_splice_read(struct file * * i_size must be checked after ->readpage(). */ isize = i_size_read(mapping->host); + if (unlikely(*ppos >= isize)) + return 0; + + left = isize - *ppos; + if (unlikely(left < len)) + len = left; + end_index = (isize - 1) >> PAGE_CACHE_SHIFT; if (unlikely(!isize || index > end_index)) break; @@ -903,7 +910,6 @@ static long do_splice_to(struct file *in struct pipe_inode_info *pipe, size_t len, unsigned int flags) { - loff_t isize, left; int ret; if (unlikely(!in->f_op || !in->f_op->splice_read)) @@ -916,14 +922,6 @@ static long do_splice_to(struct file *in if (unlikely(ret < 0)) return ret; - isize = i_size_read(in->f_mapping->host); - if (unlikely(*ppos >= isize)) - return 0; - - left = isize - *ppos; - if (unlikely(left < len)) - len = left; - return in->f_op->splice_read(in, ppos, pipe, len, flags); } - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 5/6] Add skb_splice_bits to skbuff.c
--- include/linux/skbuff.h |2 + net/core/skbuff.c | 137 2 files changed, 139 insertions(+), 0 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 755e9cd..8f4b90e 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1338,6 +1338,8 @@ extern unsigned intskb_checksum(cons int len, unsigned int csum); extern intskb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len); +extern intskb_splice_bits(const struct sk_buff *skb, int offset, +struct pipe_inode_info *pipe, int len, unsigned int flags); extern intskb_store_bits(const struct sk_buff *skb, int offset, void *from, int len); extern unsigned intskb_copy_and_csum_bits(const struct sk_buff *skb, diff --git a/net/core/skbuff.c b/net/core/skbuff.c index c54f366..a92d165 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -53,6 +53,7 @@ #endif #include #include +#include #include #include #include @@ -70,6 +71,17 @@ static kmem_cache_t *skbuff_head_cache __read_mostly; static kmem_cache_t *skbuff_fclone_cache __read_mostly; +/* Pipe buffer operations for a socket. */ +static struct pipe_buf_operations sock_buf_ops = { + .can_merge = 0, + .map = generic_pipe_buf_map, + .unmap = generic_pipe_buf_unmap, + .pin = generic_pipe_buf_pin, + .release = generic_sock_buf_release, + .steal = generic_pipe_buf_steal, + .get = generic_pipe_buf_get, +}; + /* * Keep out-of-line to prevent kernel bloat. * __builtin_return_address is not used because it is not always @@ -1148,6 +1160,131 @@ fault: return -EFAULT; } +/* Move specified number of bytes from the source skb to the + * destination pipe buffer. This function even handles all the + * bits of traversing fragment lists. + */ +int skb_splice_bits(const struct sk_buff *skb, int offset, struct pipe_inode_info *pipe, int len, unsigned int flags) +{ + struct page *page; + struct partial_page partial[PIPE_BUFFERS]; + struct page *pages[PIPE_BUFFERS]; + int buflen, available_len; + int pg_nr = 0; + int i, nfrags; + void *address; + size_t ret = 0; + struct splice_pipe_desc spd = { + .pages = pages, + .partial = partial, + .flags = flags, + .ops = &sock_buf_ops, + }; + + buflen = skb_headlen(skb); + + if ((available_len = buflen - offset) >0) { + if (available_len > len) + available_len = len; + + page = alloc_page(GFP_KERNEL); + if (!page) + return -ENOMEM; + + address = kmap(page); + memcpy(address, skb->data + offset, available_len); + /* Push page into splice pipe desc. */ + spd.pages[pg_nr] = page; + pg_nr++; + kunmap(page); + + /* If entire length has been consumed or number of pages pushed into +* splice pipe desc(pipe buffer) equals 16, then call splice_to_pipe. +*/ + if (((len -= available_len) == 0) || pg_nr == PIPE_BUFFERS) { + spd.nr_pages = pg_nr; + offset += available_len; +ret = splice_to_pipe(pipe, &spd); +if (ret == -EPIPE) +return -EPIPE; +else if (ret == -EAGAIN) +return -EAGAIN; +else if (ret == -ERESTARTSYS) +return -ERESTARTSYS; +else goto frags; + } + } + frags: + if (skb_shinfo(skb)->nr_frags != 0) { + nfrags = skb_shinfo(skb)->nr_frags; + + for (i = 0; i < nfrags; i++) { + int total; + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + get_page(skb_shinfo(skb)->frags[i].page); + + total = buflen + skb_shinfo(skb)->frags[i].size; + + if ((available_len = total - offset) > 0) { + +
[RFC 0/6] TCP socket splice
My name is Ashwini Kulkarni and I have been working at Intel Corporation for the past 4 months as an engineering intern. I have been working on the 'TCP socket splice' project with Chris Leech. This is a work-in-progress version of the project with scope for further modifications. TCP socket splicing: It allows a TCP socket to be spliced to a file via a pipe buffer. First, to splice data from a socket to a pipe buffer, upto 16 source pages(s) are pulled into the pipe buffer. Then to splice data from the pipe buffer to a file, those pages are migrated into the address space of the target file. It takes place entirely within the kernel and thus results in zero memory copies. It is the receive side complement to sendfile() but unlike sendfile() it is possible to splice from a socket as well and not just to a socket. Current Method: + > Application Buffer + | | _|___|_ | | Receive or | | Write I/OAT DMA | | | | | V Network File System Buffer Buffer ^ | | | _|___|_ DMA | | DMA | | Hardware | | | V NIC SATA In the current method, the packet is DMAâd from the NIC into the network buffer. There is a read on socket to the user space and the packet data is copied from the network buffer to the application buffer. A write operation then moves the data from the application buffer to the file system buffer which is then DMA'd to the disk again. Thus, in the current method there will be one full copy of all the data to the user space. Using TCP socket splice: Application Control | _|__ | | TCP socket splice | +-+ | | Direct path | V | V Network File System Buffer Buffer ^ | | | _|___|__ DMA | | DMA | | Hardware | | | V NIC SATA In this method, the objective is to use TCP socket splicing to create a direct path in the kernel from the network buffer to the file system buffer via a pipe buffer. The pages will migrate from the network buffer (which is associated with the socket) into the pipe buffer for an optimized path. From the pipe buffer, the pages will then be migrated to the output file address space page cache. This will enable to create a LAN to file-system API which will avoid the memcpy operations in user space and thus create a fast path from the network buffer to the storage buffer. Open Issues (currently being addressed): There is a performance drop when transferring bigger files (usually larger than 65536 bytes in size). Performance drop increases with the size of the file. Work is in progress to identify the source of this issue. We encourage the community to review our TCP socket splice project. Feedback would be greatly appreciated. -- Ashwini Kulkarni - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 4/6] Add TCP socket splicing (tcp_splice_read) support
--- fs/splice.c | 16 include/linux/net.h |2 ++ include/linux/pipe_fs_i.h |1 + include/net/tcp.h |3 +++ net/socket.c | 13 + 5 files changed, 35 insertions(+), 0 deletions(-) diff --git a/fs/splice.c b/fs/splice.c index c6a880b..3a4202d 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -123,6 +123,12 @@ error: return err; } +void generic_sock_buf_release(struct pipe_inode_info *pipe, + struct pipe_buffer *buf) +{ + put_page(buf->page); +} + static struct pipe_buf_operations page_cache_pipe_buf_ops = { .can_merge = 0, .map = generic_pipe_buf_map, @@ -133,6 +139,16 @@ static struct pipe_buf_operations page_c .get = generic_pipe_buf_get, }; +static struct pipe_buf_operations sock_buf_ops = { + .can_merge = 0, + .map = generic_pipe_buf_map, + .unmap = generic_pipe_buf_unmap, + .pin = generic_pipe_buf_pin, + .release = generic_sock_buf_release, + .steal = generic_pipe_buf_steal, + .get = generic_pipe_buf_get, +}; + static int user_page_pipe_buf_steal(struct pipe_inode_info *pipe, struct pipe_buffer *buf) { diff --git a/include/linux/net.h b/include/linux/net.h index b20c53c..65dfe0c 100644 --- a/include/linux/net.h +++ b/include/linux/net.h @@ -164,6 +164,8 @@ struct proto_ops { struct vm_area_struct * vma); ssize_t (*sendpage) (struct socket *sock, struct page *page, int offset, size_t size, int flags); + ssize_t (*splice_read)(struct socket *sock, loff_t *ppos, + struct pipe_inode_info *pipe, size_t len, unsigned int flags); }; struct net_proto_family { diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 9067985..f7f439b 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -72,6 +72,7 @@ void generic_pipe_buf_get(struct pipe_in int generic_pipe_buf_pin(struct pipe_inode_info *, struct pipe_buffer *); int generic_pipe_buf_steal(struct pipe_inode_info *, struct pipe_buffer *); +void generic_sock_buf_release(struct pipe_inode_info *, struct pipe_buffer *); /* * splice is tied to pipes as a transport (at least for now), so we'll just * add the splice flags here. diff --git a/include/net/tcp.h b/include/net/tcp.h index 7a093d0..5032501 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -300,6 +300,9 @@ extern void tcp_cleanup_rbuf(struct so extern int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp); +extern ssize_t tcp_splice_read(struct socket *sk, loff_t *ppos, + struct pipe_inode_info *pipe, size_t len, unsigned int flags); + static inline void tcp_dec_quickack_mode(struct sock *sk, const unsigned int pkts) { diff --git a/net/socket.c b/net/socket.c index 6d261bf..8a4f602 100644 --- a/net/socket.c +++ b/net/socket.c @@ -117,6 +117,8 @@ static ssize_t sock_writev(struct file * unsigned long count, loff_t *ppos); static ssize_t sock_sendpage(struct file *file, struct page *page, int offset, size_t size, loff_t *ppos, int more); +static ssize_t sock_splice_read(struct file *file, loff_t *ppos, + struct pipe_inode_info *pipe, size_t len, unsigned int flags); /* * Socket files have a set of 'special' operations as well as the generic file ones. These don't appear @@ -141,6 +143,7 @@ static struct file_operations socket_fil .writev = sock_writev, .sendpage = sock_sendpage, .splice_write = generic_splice_sendpage, + .splice_read = sock_splice_read, }; /* @@ -701,6 +704,16 @@ static ssize_t sock_sendpage(struct file return sock->ops->sendpage(sock, page, offset, size, flags); } +static ssize_t sock_splice_read(struct file *file, loff_t *ppos, + struct pipe_inode_info *pipe, size_t len, unsigned int flags) +{ + struct socket *sock; + + sock = file->private_data; + + return sock->ops->splice_read(sock, ppos, pipe, len, flags); +} + static struct sock_iocb *alloc_sock_iocb(struct kiocb *iocb, char __user *ubuf, size_t size, struct sock_iocb *siocb) { - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 2/6] Make sock_def_wakeup non-static
--- include/net/sock.h |1 + net/core/sock.c|3 ++- 2 files changed, 3 insertions(+), 1 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 324b3ea..3a64262 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -497,6 +497,7 @@ extern void sk_stream_wait_close(struct extern int sk_stream_error(struct sock *sk, int flags, int err); extern void sk_stream_kill_queues(struct sock *sk); +extern void sock_def_wakeup(struct sock *sk); extern int sk_wait_data(struct sock *sk, long *timeo); struct request_sock_ops; diff --git a/net/core/sock.c b/net/core/sock.c index 51fcfbc..8496854 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1400,7 +1400,7 @@ ssize_t sock_no_sendpage(struct socket * * Default Socket Callbacks */ -static void sock_def_wakeup(struct sock *sk) +void sock_def_wakeup(struct sock *sk) { read_lock(&sk->sk_callback_lock); if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) @@ -1961,6 +1961,7 @@ EXPORT_SYMBOL(sock_no_poll); EXPORT_SYMBOL(sock_no_recvmsg); EXPORT_SYMBOL(sock_no_sendmsg); EXPORT_SYMBOL(sock_no_sendpage); +EXPORT_SYMBOL(sock_def_wakeup); EXPORT_SYMBOL(sock_no_setsockopt); EXPORT_SYMBOL(sock_no_shutdown); EXPORT_SYMBOL(sock_no_socketpair); - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 3/6] Add in TCP related part of splice read to ipv4
--- net/ipv4/af_inet.c |1 net/ipv4/tcp.c | 135 2 files changed, 136 insertions(+), 0 deletions(-) diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index c84a320..3c0d245 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -807,6 +807,7 @@ const struct proto_ops inet_stream_ops = .recvmsg = sock_common_recvmsg, .mmap = sock_no_mmap, .sendpage = tcp_sendpage, + .splice_read = tcp_splice_read, #ifdef CONFIG_COMPAT .compat_setsockopt = compat_sock_common_setsockopt, .compat_getsockopt = compat_sock_common_getsockopt, diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 934396b..d4c02a1 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -254,6 +254,10 @@ #include #include #include +#include +#include +#include +#include #include #include #include @@ -264,6 +268,7 @@ #include #include #include +#include #include #include @@ -291,6 +296,23 @@ EXPORT_SYMBOL(tcp_memory_allocated); EXPORT_SYMBOL(tcp_sockets_allocated); /* + * Create a TCP splice context. + */ +struct tcp_splice_state { + struct pipe_inode_info *pipe; + void (*original_data_ready)(struct sock*, int); + size_t len; + size_t offset; + unsigned int flags; +}; + +int __tcp_splice_read(struct sock *sk, loff_t *ppos, struct pipe_inode_info *pipe, + size_t len, unsigned int flags, struct tcp_splice_state *tss); +int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, +unsigned int offset, size_t len); +void tcp_splice_data_ready(struct sock *sk, int flag); + +/* * Pressure flag: try to collapse. * Technical note: it is used by multiple contexts non atomically. * All the sk_stream_mem_schedule() is of this nature: accounting @@ -499,6 +521,118 @@ static inline void tcp_push(struct sock } } +/* + * tcp_splice_read - splice data from TCP socket to a pipe + * @sock: socket to splice from + * @pipe: pipe to splice to + * @len: number of bytes to splice + * @flags: splice modifier flags + * + * Will read pages from given socket and fill them into a pipe. + */ +ssize_t tcp_splice_read(struct socket *sock, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags) +{ + struct tcp_splice_state tss = { + .pipe = pipe, + .len = len, + .flags = flags, + }; + struct sock *sk = sock->sk; + ssize_t spliced; + int ret; + + ret = 0; + spliced = 0; + + if (*ppos != 0) + return -EINVAL; + + while(tss.len) { + ret = __tcp_splice_read(sk, ppos, tss.pipe, tss.len, tss.flags, &tss); + + if(ret < 0) + break; + else if (!ret) { + if (spliced) + break; + if (flags & SPLICE_F_NONBLOCK) { + ret = -EAGAIN; + break; + } + } + tss.len -= ret; + spliced += ret; + } + if (spliced) + return spliced; + + return ret; +} + +int __tcp_splice_read(struct sock *sk, loff_t *ppos, struct pipe_inode_info *pipe, size_t len, unsigned int flags, struct tcp_splice_state *tss) +{ + read_descriptor_t rd_desc; + int copied; + + tss->original_data_ready = sk->sk_data_ready; + + sk->sk_user_data = tss; + + /* Store TCP splice context information in read_descriptor_t. */ + rd_desc.arg.data = tss; + + copied = tcp_read_sock(sk, &rd_desc, tcp_splice_data_recv); + + if (copied != 0) { + if (flags & SPLICE_F_MORE) { + /* Setup new sk_data_ready as tcp_splice_data_ready. */ + sk->sk_data_ready = tcp_splice_data_ready; + return sk_wait_data(sk, &sk->sk_rcvtimeo); + } + else if(flags & SPLICE_F_NONBLOCK) + return -EAGAIN; + else return copied; + } + else + return copied; +} + +int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb, unsigned int offset, size_t len) +{ + /* +* Restore TCP splice context from read_descriptor_t +*/ + struct tcp_splice_state *tss = rd_desc->arg.data; + + return skb_splice_bits(skb, offset, tss->pipe, tss->len, tss->flags); +} + +void tcp_splice_data_ready(struct sock *sk, int flag) +{ + /* +* Restore splice context/ read_descriptor_t from sk->sk_user_data +*/ + struct tcp_splice_state *tss = sk->sk_user_data; + read_descriptor_t rd_desc; + + read_lock(&sk->sk_callback_lock); + + rd_desc.arg.da
[RFC 1/6] Make splice_to_pipe non-static and move structure definitions to a header file
--- fs/splice.c | 18 +- include/linux/pipe_fs_i.h | 18 ++ 2 files changed, 19 insertions(+), 17 deletions(-) diff --git a/fs/splice.c b/fs/splice.c index 684bca3..c6a880b 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -29,22 +29,6 @@ #include #include -struct partial_page { - unsigned int offset; - unsigned int len; -}; - -/* - * Passed to splice_to_pipe - */ -struct splice_pipe_desc { - struct page **pages;/* page map */ - struct partial_page *partial; /* pages[] may not be contig */ - int nr_pages; /* number of pages in map */ - unsigned int flags; /* splice flags */ - struct pipe_buf_operations *ops;/* ops associated with output pipe */ -}; - /* * Attempt to steal a page from a pipe buffer. This should perhaps go into * a vm helper function, it's already simplified quite a bit by the @@ -173,7 +157,7 @@ static struct pipe_buf_operations user_p * Pipe output worker. This sets up our pipe format with the page cache * pipe buffer operations. Otherwise very similar to the regular pipe_writev(). */ -static ssize_t splice_to_pipe(struct pipe_inode_info *pipe, +ssize_t splice_to_pipe(struct pipe_inode_info *pipe, struct splice_pipe_desc *spd) { int ret, do_wakeup, page_nr; diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index ea4f7cd..9067985 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -100,4 +100,22 @@ extern ssize_t splice_from_pipe(struct p loff_t *, size_t, unsigned int, splice_actor *); +struct partial_page { + unsigned int offset; + unsigned int len; +}; + +/* + * Passed to splice_to_pipe + */ +struct splice_pipe_desc { + struct page **pages;/* page map */ + struct partial_page *partial; /* pages[] may not be contig */ + int nr_pages; /* number of pages in map */ + unsigned int flags; /* splice flags */ + struct pipe_buf_operations *ops;/* ops associated with output pipe */ +}; + +ssize_t splice_to_pipe(struct pipe_inode_info *, struct splice_pipe_desc *); + #endif - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] tcp: make cubic the default
Change default congestion control used from BIC to the newer CUBIC which it the successor to BIC but has better properties over long delay links. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- net/ipv4/Kconfig | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) --- net-test.orig/net/ipv4/Kconfig 2006-09-20 12:22:06.0 -0700 +++ net-test/net/ipv4/Kconfig 2006-09-20 13:31:21.0 -0700 @@ -454,7 +454,7 @@ modules. Nearly all users can safely say no here, and a safe default - selection will be made (BIC-TCP with new Reno as a fallback). + selection will be made (CUBIC with new Reno as a fallback). If unsure, say N. @@ -462,7 +462,7 @@ config TCP_CONG_BIC tristate "Binary Increase Congestion (BIC) control" - default y + default m ---help--- BIC-TCP is a sender-side only change that ensures a linear RTT fairness under large windows while offering both scalability and @@ -476,7 +476,7 @@ config TCP_CONG_CUBIC tristate "CUBIC TCP" - default m + default y ---help--- This is version 2.0 of BIC-TCP which uses a cubic growth function among other techniques. @@ -573,7 +573,7 @@ choice prompt "Default TCP congestion control" - default DEFAULT_BIC + default DEFAULT_CUBIC help Select the TCP congestion control that will be used by default for all connections. @@ -600,7 +600,7 @@ endif -config TCP_CONG_BIC +config TCP_CONG_CUBIC tristate depends on !TCP_CONG_ADVANCED default y @@ -613,7 +613,7 @@ default "vegas" if DEFAULT_VEGAS default "westwood" if DEFAULT_WESTWOOD default "reno" if DEFAULT_RENO - default "bic" + default "cubic" source "net/ipv4/ipvs/Kconfig" - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] tcp: default congestion control menu
Change how default TCP congestion control is chosen. Don't just use last installed module, instead allow selection during configuration, and make sure and use the default regardless of load order. Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> --- net/ipv4/Kconfig | 45 - net/ipv4/sysctl_net_ipv4.c |6 ++ net/ipv4/tcp_cong.c|2 +- 3 files changed, 47 insertions(+), 6 deletions(-) --- net-2.6.19.orig/net/ipv4/Kconfig2006-09-19 16:13:02.0 -0700 +++ net-2.6.19/net/ipv4/Kconfig 2006-09-20 11:17:45.0 -0700 @@ -447,7 +447,7 @@ depends on INET_DIAG def_tristate INET_DIAG -config TCP_CONG_ADVANCED +menuconfig TCP_CONG_ADVANCED bool "TCP: advanced congestion control" ---help--- Support for selection of various TCP congestion control @@ -458,9 +458,7 @@ If unsure, say N. -# TCP Reno is builtin (required as fallback) -menu "TCP congestion control" - depends on TCP_CONG_ADVANCED +if TCP_CONG_ADVANCED config TCP_CONG_BIC tristate "Binary Increase Congestion (BIC) control" @@ -573,12 +571,49 @@ loss packets. See http://www.ntu.edu.sg/home5/ZHOU0022/papers/CPFu03a.pdf -endmenu +choice + prompt "Default TCP congestion control" + default DEFAULT_BIC + help + Select the TCP congestion control that will be used by default + for all connections. + + config DEFAULT_BIC + bool "Bic" if TCP_CONG_BIC=y + + config DEFAULT_CUBIC + bool "Cubic" if TCP_CONG_CUBIC=y + + config DEFAULT_HTCP + bool "Htcp" if TCP_CONG_HTCP=y + + config DEFAULT_VEGAS + bool "Vegas" if TCP_CONG_VEGAS=y + + config DEFAULT_WESTWOOD + bool "Westwood" if TCP_CONG_WESTWOOD=y + + config DEFAULT_RENO + bool "Reno" + +endchoice + +endif config TCP_CONG_BIC tristate depends on !TCP_CONG_ADVANCED default y +config DEFAULT_TCP_CONG + string + default "bic" if DEFAULT_BIC + default "cubic" if DEFAULT_CUBIC + default "htcp" if DEFAULT_HTCP + default "vegas" if DEFAULT_VEGAS + default "westwood" if DEFAULT_WESTWOOD + default "reno" if DEFAULT_RENO + default "bic" + source "net/ipv4/ipvs/Kconfig" --- net-2.6.19.orig/net/ipv4/sysctl_net_ipv4.c 2006-09-19 16:13:02.0 -0700 +++ net-2.6.19/net/ipv4/sysctl_net_ipv4.c 2006-09-19 16:13:05.0 -0700 @@ -129,6 +129,12 @@ return ret; } +static int __init tcp_congestion_default(void) +{ + return tcp_set_default_congestion_control(CONFIG_DEFAULT_TCP_CONG); +} + +late_initcall(tcp_congestion_default); ctl_table ipv4_table[] = { { --- net-2.6.19.orig/net/ipv4/tcp_cong.c 2006-09-19 16:13:02.0 -0700 +++ net-2.6.19/net/ipv4/tcp_cong.c 2006-09-19 16:13:05.0 -0700 @@ -48,7 +48,7 @@ printk(KERN_NOTICE "TCP %s already registered\n", ca->name); ret = -EEXIST; } else { - list_add_rcu(&ca->list, &tcp_cong_list); + list_add_tail_rcu(&ca->list, &tcp_cong_list); printk(KERN_INFO "TCP %s registered\n", ca->name); } spin_unlock(&tcp_cong_list_lock); - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Remove powerpc specific parts of 3c509 driver
Sure, PCI busses are little-endian. But is readX()/writeX() for PCI only? Yes. For other buses, use foo_writel(), etc. Can this please be documented then? Never heard this before... You have come late to the party. WHat do you mean here? Could you please explain? This has been the case for many, many years. No, it was never documented AFAICS. And there is no point in a massive rename to pci_writel(), either. That would be really inconvenient, sure. It's also inconvenient that all the nice short names are PCI-only. Segher - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 4/7] secid reconciliation-v02: Invoke LSM hook for out bound traffic
See below. > -Original Message- > From: James Morris [mailto:[EMAIL PROTECTED] > Sent: Monday, September 18, 2006 2:12 PM > To: Venkat Yekkirala > Cc: netdev@vger.kernel.org; [EMAIL PROTECTED]; [EMAIL PROTECTED]; > [EMAIL PROTECTED] > Subject: Re: [PATCH 4/7] secid reconciliation-v02: Invoke LSM hook for > outbound traffic > > > On Fri, 8 Sep 2006, Venkat Yekkirala wrote: > > > -static void secmark_restore(struct sk_buff *skb) > > +static unsigned int secmark_restore(struct sk_buff *skb, > unsigned int > > hooknum, > > + const struct xt_target *target) > > { > > - if (!skb->secmark) { > > - u32 *connsecmark; > > - enum ip_conntrack_info ctinfo; > > + u32 *psecmark; > > + u32 secmark = 0; > > + enum ip_conntrack_info ctinfo; > > > > - connsecmark = nf_ct_get_secmark(skb, &ctinfo); > > - if (connsecmark && *connsecmark) > > - if (skb->secmark != *connsecmark) > > - skb->secmark = *connsecmark; > > - } > > + psecmark = nf_ct_get_secmark(skb, &ctinfo); > > + if (psecmark) > > + secmark = *psecmark; > > + > > + if (!secmark) > > + return XT_CONTINUE; > > + > > + /* Set secmark on inbound and filter it on outbound */ > > + if (hooknum == NF_IP_POST_ROUTING || hooknum == > NF_IP6_POST_ROUTING) { > > + if (!security_skb_netfilter_check(skb, secmark)) > > + return NF_DROP; > > + } else > > + if (skb->secmark != secmark) > > + skb->secmark = secmark; > > + > > + return XT_CONTINUE; > > } > > Quite a lot of logic has changed here. > > With the original code, we only restored a secmark once for > the lifetime > of a packet or connetcion (to make behavior deterministic and > security > marks immutable in the face of arbitrarily complex iptables rules). > > With your patch, secmarks are always writable. Hopefully the following thread addressed these concerns. http://marc.theaimsgroup.com/?l=selinux&m=115870100405571&w=2 > > What about packets on the OUTPUT hook? I will check for OUTPUT as well as POSTROUTING to kickoff skb_flow_out(). > > Also, we did not restore a 'null' (zero) secmark to the skb > (while this > should never happen with the current SECMARK target, there may be > non-SELinux extensions later which set a null marking). How do you envision this (i.e. resoring a null secmark) being useful? secmark is anyway zero by default (when no labeling rules exist for the connection) right? > > Why not just do something like: > > > psecmark = nf_ct_get_secmark(skb, &ctinfo); > if (psecmark && *psecmark) { > > ... core of function ... > > } > > return XT_CONTINUE; > > I don't think you need the new secmark variable. Will do. > > You've also changed the logic for the dummy case of > security_skb_netfilter_check() I am not getting this. This is a new function. Did you mean to point to a different function? > > > +static inline int security_skb_netfilter_check(struct sk_buff *skb, > + u32 nf_secid) > +{ > + return 1; > +} > + > > This code does not now behave as it did originally. Keep in > mind that > SELinux is not the only user of SECMARK. Missed this as well (this is a new function in this patch). Please elaborate. > > (The documentation of the hook in security.h doesn't match > the behavior, > either -- it's (re-)labeling, not just filtering). Will fix this. > > I really don't know if connection tracking is the right place > to be doing > policy enforcment, either. Perhaps you should just do the > relabeling here > and enforcement later. We could have done enforcement, in the SELinux postroute_last hook for example, if only there were a place to hold onto the "exit point context", separate from the label already associated with the skb in the secmark field. postroute_last would need BOTH the label of the skb (available in the secmark field) and the "exit point context" to do enforcement. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[NET] GT96100: Delete bitrotting ethernet driver
Code for the EV96100 evaluation board hasn't compiled since at least November 15, 2003, so it is being deleted as of 2.6.18 due to lack of a user base. Signed-off-by: Ralf Baechle <[EMAIL PROTECTED]> drivers/net/Kconfig |6 drivers/net/Makefile |1 drivers/net/gt96100eth.c | 1566 -- drivers/net/gt96100eth.h | 346 -- 4 files changed, 0 insertions(+), 1919 deletions(-) diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index 778fbae..0ee6d60 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -446,12 +446,6 @@ config GALILEO_64240_ETH This is the driver for the ethernet interfaces integrated into the Galileo (now Marvell) GT64240 chipset. -config MIPS_GT96100ETH - bool "MIPS GT96100 Ethernet support" - depends on NET_ETHERNET && MIPS_GT96100 - help - Say Y here to support the Ethernet subsystem on your GT96100 card. - config MIPS_AU1X00_ENET bool "MIPS AU1000 Ethernet support" depends on NET_ETHERNET && SOC_AU1X00 diff --git a/drivers/net/Makefile b/drivers/net/Makefile index faf24de..eb48c55 100644 --- a/drivers/net/Makefile +++ b/drivers/net/Makefile @@ -179,7 +179,6 @@ obj-$(CONFIG_HPLANCE) += hplance.o 7990. obj-$(CONFIG_MVME147_NET) += mvme147.o 7990.o obj-$(CONFIG_EQUALIZER) += eql.o obj-$(CONFIG_MIPS_JAZZ_SONIC) += jazzsonic.o -obj-$(CONFIG_MIPS_GT96100ETH) += gt96100eth.o obj-$(CONFIG_MIPS_AU1X00_ENET) += au1000_eth.o obj-$(CONFIG_MIPS_SIM_NET) += mipsnet.o obj-$(CONFIG_SGI_IOC3_ETH) += ioc3-eth.o diff --git a/drivers/net/gt96100eth.c b/drivers/net/gt96100eth.c deleted file mode 100644 index 2b4db74..000 --- a/drivers/net/gt96100eth.c +++ /dev/null @@ -1,1566 +0,0 @@ -/* - * Copyright 2000, 2001 MontaVista Software Inc. - * Author: MontaVista Software, Inc. - * [EMAIL PROTECTED] or [EMAIL PROTECTED] - * - * This program is free software; you can distribute it and/or modify it - * under the terms of the GNU General Public License (Version 2) as - * published by the Free Software Foundation. - * - * This program is distributed in the hope it will be useful, but WITHOUT - * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or - * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License - * for more details. - * - * You should have received a copy of the GNU General Public License along - * with this program; if not, write to the Free Software Foundation, Inc., - * 59 Temple Place - Suite 330, Boston MA 02111-1307, USA. - * - * Ethernet driver for the MIPS GT96100 Advanced Communication Controller. - * - * Revision history - * - *11.11.2001 Moved to 2.4.14, [EMAIL PROTECTED] Modified driver to add - *proper gt96100A support. - *12.05.2001 Moved eth port 0 to irq 3 (mapped to GT_SERINT0 on EV96100A) - *in order for both ports to work. Also cleaned up boot - *option support (mac address string parsing), fleshed out - *gt96100_cleanup_module(), and other general code cleanups - *<[EMAIL PROTECTED]>. - */ -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include -#include - -#include -#include - -#define DESC_BE 1 -#define DESC_DATA_BE 1 - -#define GT96100_DEBUG 2 - -#include "gt96100eth.h" - -// prototypes -static void* dmaalloc(size_t size, dma_addr_t *dma_handle); -static void dmafree(size_t size, void *vaddr); -static void gt96100_delay(int msec); -static int gt96100_add_hash_entry(struct net_device *dev, - unsigned char* addr); -static void read_mib_counters(struct gt96100_private *gp); -static int read_MII(int phy_addr, u32 reg); -static int write_MII(int phy_addr, u32 reg, u16 data); -static int gt96100_init_module(void); -static void gt96100_cleanup_module(void); -static void dump_MII(int dbg_lvl, struct net_device *dev); -static void dump_tx_desc(int dbg_lvl, struct net_device *dev, int i); -static void dump_rx_desc(int dbg_lvl, struct net_device *dev, int i); -static void dump_skb(int dbg_lvl, struct net_device *dev, -struct sk_buff *skb); -static void update_stats(struct gt96100_private *gp); -static void abort(struct net_device *dev, u32 abort_bits); -static void hard_stop(struct net_device *dev); -static void enable_ether_irq(struct net_device *dev); -static void disable_ether_irq(struct net_device *dev); -static int gt96100_probe1(struct pci_dev *pci, int port_num); -static void reset_tx(struct net_device *dev); -static void reset_rx(struct net_device *dev); -static int gt96100_check_tx_consistent(struct gt96100_private *gp); -static int gt96100_init(struct net_device *dev); -static int gt96100_open(struct net_device *dev); -static int gt96100_close(struct net_device *dev); -static int gt96100_tx(struct sk_buff *skb,
[PATCH 2/2] mv643xx_eth: Fix typo: RX_SKB_SIZE ==> ETH_RX_SKB_SIZE
From: Dale Farnsworth <[EMAIL PROTECTED]> Bug was introduced in commit 71d28725548be203e8b8f6ad63b1f64fd7f02d4d. How embarrassing. It wasn't caught because dma_umap_single() is defined away on arch/ppc and 32-bit arch/powerpc. Signed-off-by: Dale Farnsworth <[EMAIL PROTECTED]> --- Arggh. (And that's not pirate talk.) This isn't urgent since dma_unmap_single() is defined away for ppc32 both in arch/ppc and arch/powerpc. It was caught on ppc64 arch/powerpc, but isn't needed by any ppc64 platforms. diff --git a/drivers/net/mv643xx_eth.c b/drivers/net/mv643xx_eth.c index eeab1df..59de3e7 100644 --- a/drivers/net/mv643xx_eth.c +++ b/drivers/net/mv643xx_eth.c @@ -385,7 +385,7 @@ static int mv643xx_eth_receive_queue(str struct pkt_info pkt_info; while (budget-- > 0 && eth_port_receive(mp, &pkt_info) == ETH_OK) { - dma_unmap_single(NULL, pkt_info.buf_ptr, RX_SKB_SIZE, + dma_unmap_single(NULL, pkt_info.buf_ptr, ETH_RX_SKB_SIZE, DMA_FROM_DEVICE); mp->rx_desc_count--; received_packets++; - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] mv643xx_eth: restrict to 32-bit PPC_MULTIPLATFORM
From: Dale Farnsworth <[EMAIL PROTECTED]> No 64-bit PPC_MULTIPLATFORM platforms use the mv643xx_eth driver, so build it only on PPC32. Signed-off-by: Dale Farnsworth <[EMAIL PROTECTED]> Acked-by: Sven Luther <[EMAIL PROTECTED]> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig index a2bd811..2154ae2 100644 --- a/drivers/net/Kconfig +++ b/drivers/net/Kconfig @@ -2262,7 +2262,7 @@ config UGETH_HAS_GIGA config MV643XX_ETH tristate "MV-643XX Ethernet support" - depends on MOMENCO_OCELOT_C || MOMENCO_JAGUAR_ATX || MV64360 || MOMENCO_OCELOT_3 || PPC_MULTIPLATFORM + depends on MOMENCO_OCELOT_C || MOMENCO_JAGUAR_ATX || MV64360 || MOMENCO_OCELOT_3 || (PPC_MULTIPLATFORM && PPC32) select MII help This driver supports the gigabit Ethernet on the Marvell MV643XX - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/23] e1000: add multicast stats counters
Williams, Mitch A wrote: + { "rx_broadcast", E1000_STAT(stats.bprc) }, + { "tx_broadcast", E1000_STAT(stats.bptc) }, + { "rx_multicast", E1000_STAT(stats.mprc) }, + { "tx_multicast", E1000_STAT(stats.mptc) }, { "rx_errors", E1000_STAT(net_stats.rx_errors) }, { "tx_errors", E1000_STAT(net_stats.tx_errors) }, { "tx_dropped", E1000_STAT(net_stats.tx_dropped) }, NAK -- you also need to remove the standard net stats, which are exported elsewhere Jeff, can you please explain the reason for this NAK a little more? Neither Auke nor I understand why you rejected the patch. This patch just adds the display of a few more stats in Ethtool. It doesn't affect any other counters, and is really just a convenience feature. I added this to the driver because of a customer request. Adding those stats is fine. You guys just need to remove the existing mess first. Jeff - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Pull request for 'r8169-20060920-00' tag
Please pull from tag 'r8169-20060920-00' in repository git://electric-eye.fr.zoreil.com/home/romieu/linux-2.6.git to get the change below. Note: Since something went wrong last time I submitted a pull request, here goes the trace from the start of the branch $ git rev-list $(git merge-base v2.6.18 r8169-20060920-00).. d81bf551103cc3bc9e4f7ddf337511d6da0d088f b39fe41f481d20c201012e4483e76c203802dda7 (<- r8169-20060912-00) d2eed8cff9a1a5d7e12ec9ddf71432c466b104d0 5f787a1aca3705bdc6adbda36f8d6446380e85a6 64e4bfb40c9d07a48c1c7e5b8556e92e7cd7406a 5b0384f4fd079c24b976ee333e6d1f0c95cf14de b518fa8eac2d0ac497c0fdb27e4cec68d0249bb7 188f4af04618b32b8ec7c630a3f18201c81ce70c bcf0bf90cd9e9242b66e0563b6a8c8db2e4c262c 4ff96fa67379c31ced69f193c7ffba17051f38e8 623a1593c84afb86b2f496a56fb4ec37f82b5c78 9dccf61112e6755f4e6f154c1794bab3c509bc71 a2b98a697fa4e7564f78905b83db122824916cf9 Shortlog Francois Romieu : r8169: the MMIO region of the 8167 stands behin BAR#1 Patch - diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c index 805562b..93cd1f4 100644 --- a/drivers/net/r8169.c +++ b/drivers/net/r8169.c @@ -210,7 +210,7 @@ static const struct { static struct pci_device_id rtl8169_pci_tbl[] = { { PCI_DEVICE(PCI_VENDOR_ID_REALTEK, 0x8129), 0, 0, RTL_CFG_0 }, { PCI_DEVICE(PCI_VENDOR_ID_REALTEK, 0x8136), 0, 0, RTL_CFG_2 }, - { PCI_DEVICE(PCI_VENDOR_ID_REALTEK, 0x8167), 0, 0, RTL_CFG_1 }, + { PCI_DEVICE(PCI_VENDOR_ID_REALTEK, 0x8167), 0, 0, RTL_CFG_0 }, { PCI_DEVICE(PCI_VENDOR_ID_REALTEK, 0x8168), 0, 0, RTL_CFG_2 }, { PCI_DEVICE(PCI_VENDOR_ID_REALTEK, 0x8169), 0, 0, RTL_CFG_0 }, { PCI_DEVICE(PCI_VENDOR_ID_DLINK, 0x4300), 0, 0, RTL_CFG_0 }, -- Ueimor - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: UDP Out 0f Sequence
Majumder, Rajib wrote: Hi, If I write UDP datagrams 1,2 and 3 to network and if the receiver receives in order 2,1, and 3, where can the sequence get changed? Is it at the source stack, network transit or destination stack? Yes. :) Although network transit is by far the most likely case. Destination stack is a distant second and source stack an even more distant third. Generally stack writers try to avoid having places in their stacks where things can reorder, but it isn't completely unknown. rick jones - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/23] e1000: add multicast stats counters
>> +{ "rx_broadcast", E1000_STAT(stats.bprc) }, >> +{ "tx_broadcast", E1000_STAT(stats.bptc) }, >> +{ "rx_multicast", E1000_STAT(stats.mprc) }, >> +{ "tx_multicast", E1000_STAT(stats.mptc) }, >> { "rx_errors", E1000_STAT(net_stats.rx_errors) }, >> { "tx_errors", E1000_STAT(net_stats.tx_errors) }, >> { "tx_dropped", E1000_STAT(net_stats.tx_dropped) }, > >NAK -- you also need to remove the standard net stats, which are >exported elsewhere Jeff, can you please explain the reason for this NAK a little more? Neither Auke nor I understand why you rejected the patch. This patch just adds the display of a few more stats in Ethtool. It doesn't affect any other counters, and is really just a convenience feature. I added this to the driver because of a customer request. Thank you in advance for edifying us. -Mitch - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/9] network namespaces: socket hashes
Hi, On Mon, Sep 18, 2006 at 05:12:49PM +0200, Daniel Lezcano wrote: > Andrey Savochkin wrote: > > Socket hash lookups are made within namespace. > > Hash tables are common for all namespaces, with > > additional permutation of indexes. > > Hi Andrey, > > why is the hash table common and not instanciated multiple times for > each namespace like the routes ? The main reason is that socket hash tables should be large enough to work efficiently, but it isn't good to waste a lot of memory for each namespace. Namespaces should be cheap enough, to allow to have hundreds of them. This reason of memory efficiency, of course, has a priority unless/until socket hash tables start to resize automatically. Another point is that routing lookup is much more complicated than the socket's one to add another search key. Routing also have additional routines for deleting entries matching some patterns, and so on. In short, routing is much more complicated, and it already quite efficient for various sizes of routing tables. Andrey - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 11/23] e1000: Jumbo frames fixes for 82573
Jeff Garzik wrote: Kok, Auke wrote: Disable jumbo frames for 82573L alltogether and when ASPM is enabled since the hardware has problems with it. For the NICs that do support this in the 82573 series we set ERT_2048 to attempt to receive as much traffic as early as we can. Signed-off-by: Bruce Allan <[EMAIL PROTECTED]> Signed-off-by: Auke Kok <[EMAIL PROTECTED]> --- drivers/net/e1000/e1000_main.c | 10 +++--- 1 files changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c index e81aa03..2ecec51 100644 --- a/drivers/net/e1000/e1000_main.c +++ b/drivers/net/e1000/e1000_main.c @@ -3138,11 +3138,13 @@ e1000_change_mtu(struct net_device *netd } break; case e1000_82573: -/* only enable jumbo frames if ASPM is disabled completely - * this means both bits must be zero in 0x1A bits 3:2 */ +/* Jumbo Frames not supported if: + * - this is not an 82573L device + * - ASPM is enabled in any way (0x1A bits 3:2) */ e1000_read_eeprom(&adapter->hw, EEPROM_INIT_3GIO_3, 1, &eeprom_data); -if (eeprom_data & EEPROM_WORD1A_ASPM_MASK) { +if ((adapter->hw.device_id != E1000_DEV_ID_82573L) || +(eeprom_data & EEPROM_WORD1A_ASPM_MASK)) { if (max_frame > MAXIMUM_ETHERNET_FRAME_SIZE) { DPRINTK(PROBE, ERR, "Jumbo Frames not supported.\n"); NAK. at probe time, set a jumbo-frames-enabled bit, then test it in e1000_change_mtu(). Don't include all this chip-checking code into the change_mtu function. I agree with the concept, not with the NAK. This workaround was already there and is not a significant new introduction of out-of-band workarounds in code. Cleaning e1000 up is a major task that will take a few more months. This workaround changes 3 lines of code and will help today. Cheers, Auke - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 12/23] e1000: Maybe stop TX if not enough free descriptors
Jeff Garzik wrote: Actually, I rescind the ACK. The code should be inside a spinlock, and therefore not need this additional check. If this check were truly needed, then SMP code all over the kernel would be broken. I will drop the patch for now. Once Jesse is back next week he gets to explain all it to me :) Auke - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [redhat-lspp] ipsec acquire has security context although I a m not using it.
Venkat, >This doesn't look right since kzalloc would already have zeroed the >structure out. Are you sure you are getting garbage in the acquire >from the kernel? If you are, I strongly doubt that this would be the >one causing it (unless kzalloc on this arch misbehaved). >Or is this a racoon bug? Yes, you are correct! Thanks for pointing this out to me as I missed it! It is racoon that has the bug. Will fix and post correct fix shortly. Please ignore attached fix as it is incorrect. Again, thanks! Regards, Joy >> When using ipsec while selinux is enabled in my kernel, >> my racoon daemon fails to establish an SA. I believe the >> ACQUIRE sent from kernel has a security context although I >> am not using this feature with ipsec. As a result, racoon >> fails to establish the SA, because it is looking for a policy >> with security context. I noticed the security context >> contains garbage. >> >> I am using a pseries, power5, ppc64 box, and it appears >> that since policy->security structure is not really initialized >> or zero'd out when not using, it is possible it may contain garbage >> on my pseries and a call such as "if (policy->security)" may >> come back as true such that security context is included in >> my acquire message although I believe it should not be. >> >> Hopefully, the below patch is acceptable. I have compiled and >> tested it. >> >> Regards, >> Joy Latten >> >> >> diff -urpN linux-2.6.17.orig/net/xfrm/xfrm_policy.c >> linux-2.6.17.patch/net/xfrm/xfrm_policy.c >> --- linux-2.6.17.orig/net/xfrm/xfrm_policy.c 2006-09-19 >> 02:11:33.0 -0500 >> +++ linux-2.6.17.patch/net/xfrm/xfrm_policy.c2006-09-19 >> 04:33:50.0 -0500 >> @@ -319,6 +319,7 @@ struct xfrm_policy *xfrm_policy_alloc(gf >> init_timer(&policy->timer); >> policy->timer.data = (unsigned long)policy; >> policy->timer.function = xfrm_policy_timer; >> +policy->security = NULL; >> } >> return policy; >> } >> -- This message was distributed to subscribers of the selinux mailing list. If you no longer wish to subscribe, send mail to [EMAIL PROTECTED] with the words "unsubscribe selinux" without quotes as the message. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ipvs locahost client patch for 2.6?
On Fri, 11 Aug 2006 01:18:38, Ryan Nowakowski wrote: > I found this patch for 2.4 that allows the host running ipvs to act > as it's own client via loopback connection. Does anyone have a similar > patch for 2.6? Not that I am aware of, though that kind of approach may well work for 2.6 with little effort. -- Horms H: http://www.vergenet.net/~horms/ W: http://www.valinux.co.jp/en/ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] Make sure ip_vs_ftp ports are valid
On Wed, Sep 20, 2006 at 12:29:45PM +0200, Patrick McHardy wrote: > Horms wrote: > > Here is the revised patch. > > > > > > [IPVS] Make sure ip_vs_ftp ports are valid > > > > I'm not entirely sure what happens in the case of a valid port, > > at best it'll be silently ignored. This patch ensures that > > the port values are unsigned short values, and thus always valid. > > > > Cc: Patrick McHardy <[EMAIL PROTECTED]> > > Signed-Off-By: Simon Horman <[EMAIL PROTECTED]> > > > > Index: linux-2.6/net/ipv4/ipvs/ip_vs_ftp.c > > === > > --- linux-2.6.orig/net/ipv4/ipvs/ip_vs_ftp.c2006-09-04 > 10:47:09.0 +0900 > > +++ linux-2.6/net/ipv4/ipvs/ip_vs_ftp.c 2006-09-04 10:59:30.0 > +0900 > > @@ -44,8 +44,8 @@ > > * List of ports (up to IP_VS_APP_MAX_PORTS) to be handled by helper > > * First port is set to the default port. > > */ > > -static int ports[IP_VS_APP_MAX_PORTS] = {21, 0}; > > -module_param_array(ports, int, NULL, 0); > > +static unsigned short ports[IP_VS_APP_MAX_PORTS] = {21, 0}; > > +module_param_array(ports, ushort, NULL, 0); > > MODULE_PARM_DESC(ports, "Ports to monitor for FTP control commands"); > > > > /* > > It looks like the wrong patch went in: > > http://marc.theaimsgroup.com/?l=git-commits-head&m=115862407021941&w=2 Thanks for pointing that out. I'll send out patches to reverse the committed change, and add the newer incarntation. -- Horms H: http://www.vergenet.net/~horms/ W: http://www.valinux.co.jp/en/ - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: UDP Out 0f Sequence
network transit. different datagrams might go through different routes, hence the out-of-sequence arrival. On 9/20/06, Majumder, Rajib <[EMAIL PROTECTED]> wrote: Hi, If I write UDP datagrams 1,2 and 3 to network and if the receiver receives in order 2,1, and 3, where can the sequence get changed? Is it at the source stack, network transit or destination stack? Any reply is highly appreciated. Thanks Rajib == Please access the attached hyperlink for an important electronic communications disclaimer: http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html == - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-rc7-mm1
On Wednesday, 20 September 2006 16:23, Mike Galbraith wrote: > On Tue, 2006-09-19 at 13:36 -0700, Andrew Morton wrote: > > On Tue, 19 Sep 2006 22:25:21 +0200 > > "Rafael J. Wysocki" <[EMAIL PROTECTED]> wrote: > > > > > > - It took maybe ten hours solid work to get this dogpile vaguely > > > > compiling and limping to a login prompt on x86, x86_64 and powerpc. > > > > I guess it's worth briefly testing if you're keen. > > > > > > It's not that bad, but unfortunately the networking doesn't work on my > > > system > > > (HPC nx6325 + SUSE 10.1 w/ updates, 64-bit). Apparently, the interfaces > > > don't > > > get configured (both tg3 and bcm43xx are affected). > > > > Is there anything interesting in the dmesg output? > > > > Perhaps an `strace -f ifup' or whatever would tell us what's failing. > > FYI, it`s SuSE`s /sbin/getcfg binary that doesn't like the changes. It > sees /sys/class/net/eth0 as a symlink, and reels off into sys/block (?) > looking for a directory. I have filed a report in the SUSE bugzilla. Let's see what happens. Greetings, Rafael -- You never change things by fighting the existing reality. R. Buckminster Fuller - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.18-rc7-mm1
On Tue, 2006-09-19 at 13:36 -0700, Andrew Morton wrote: > On Tue, 19 Sep 2006 22:25:21 +0200 > "Rafael J. Wysocki" <[EMAIL PROTECTED]> wrote: > > > > - It took maybe ten hours solid work to get this dogpile vaguely > > > compiling and limping to a login prompt on x86, x86_64 and powerpc. > > > I guess it's worth briefly testing if you're keen. > > > > It's not that bad, but unfortunately the networking doesn't work on my > > system > > (HPC nx6325 + SUSE 10.1 w/ updates, 64-bit). Apparently, the interfaces > > don't > > get configured (both tg3 and bcm43xx are affected). > > Is there anything interesting in the dmesg output? > > Perhaps an `strace -f ifup' or whatever would tell us what's failing. FYI, it`s SuSE`s /sbin/getcfg binary that doesn't like the changes. It sees /sys/class/net/eth0 as a symlink, and reels off into sys/block (?) looking for a directory. lstat64("/sys/class/net/eth0", {st_dev=makedev(0, 0), st_ino=5968, st_mode=S_IFLNK|0777, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2006/09/20-13:59:13, st_mtime=2006/09/20-13:58:57, st_ctime=2006/09/20-13:58:57}) = 0 lstat64("/sys/block/eth0", 0xbf9e432c) = -1 ENOENT (No such file or directory) open("/proc/mounts", O_RDONLY) = 3 fstat64(3, {st_dev=makedev(0, 3), st_ino=22711, st_mode=S_IFREG|0444, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2006/09/20-14:00:35, st_mtime=2006/09/20-14:00:35, st_ctime=2006/09/20-14:00:35}) = 0 old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f59000 read(3, "rootfs / rootfs rw 0 0\nudev /dev"..., 4096) = 601 close(3)= 0 munmap(0xb7f59000, 4096)= 0 lstat64("/sys/block", {st_dev=makedev(0, 0), st_ino=256, st_mode=S_IFDIR|0755, st_nlink=18, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2006/09/20-14:00:17, st_mtime=2006/09/20-13:58:59, st_ctime=2006/09/20-13:58:59}) = 0 lstat64("/sys/block", {st_dev=makedev(0, 0), st_ino=256, st_mode=S_IFDIR|0755, st_nlink=18, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2006/09/20-14:00:17, st_mtime=2006/09/20-13:58:59, st_ctime=2006/09/20-13:58:59}) = 0 open("/dev/null", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = -1 ENOTDIR (Not a directory) open("/sys/block", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3 fstat64(3, {st_dev=makedev(0, 0), st_ino=256, st_mode=S_IFDIR|0755, st_nlink=18, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2006/09/20-14:00:17, st_mtime=2006/09/20-13:58:59, st_ctime=2006/09/20-13:58:59}) = 0 fcntl64(3, F_SETFD, FD_CLOEXEC) = 0 getdents64(3, {{d_ino=256, d_off=1, d_type=DT_DIR, d_reclen=24, d_name="."} {d_ino=1, d_off=2, d_type=DT_DIR, d_reclen=24, d_name=".."} {d_ino=11521, d_off=3, d_type=DT_DIR, d_reclen=24, d_name="sde"} {d_ino=11455, d_off=4, d_type=DT_DIR, d_reclen=24, d_name="sdd"} {d_ino=11416, d_off=5, d_type=DT_DIR, d_reclen=24, d_name="sdc"} {d_ino=11358, d_off=6, d_type=DT_DIR, d_reclen=24, d_name="sdb"} {d_ino=11311, d_off=7, d_type=DT_DIR, d_reclen=24, d_name="sda"} {d_ino=1784, d_off=8, d_type=DT_DIR, d_reclen=24, d_name="hdd"} {d_ino=1770, d_off=9, d_type=DT_DIR, d_reclen=24, d_name="hdc"} {d_ino=1757, d_off=10, d_type=DT_DIR, d_reclen=24, d_name="hda"} {d_ino=1725, d_off=11, d_type=DT_DIR, d_reclen=32, d_name="loop7"} {d_ino=1722, d_off=12, d_type=DT_DIR, d_reclen=32, d_name="loop6"} {d_ino=1719, d_off=13, d_type=DT_DIR, d_reclen=32, d_name="loop5"} {d_ino=1716, d_off=14, d_type=DT_DIR, d_reclen=32, d_name="loop4"} {d_ino=1713, d_off=15, d_type=DT_DIR, d_reclen=32, d_name="loop3"} {d_ino=1710, d_off=16, d_type=DT_DIR, d_reclen=32, d_name="loop2"} {d_ino=1707, d_off=17, d_type=DT_DIR, d_reclen=32, d_name="loop1"} {d_ino=1704, d_off=18, d_type=DT_DIR, d_reclen=32, d_name="loop0"}}, 4096) = 496 lstat64("/sys/block/sde", {st_dev=makedev(0, 0), st_ino=11521, st_mode=S_IFDIR|0755, st_nlink=5, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2006/09/20-13:59:14, st_mtime=2006/09/20-13:58:59, st_ctime=2006/09/20-13:58:59}) = 0 lstat64("/sys/block/sde", {st_dev=makedev(0, 0), st_ino=11521, st_mode=S_IFDIR|0755, st_nlink=5, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2006/09/20-13:59:14, st_mtime=2006/09/20-13:58:59, st_ctime=2006/09/20-13:58:59}) = 0 lstat64("/sys/block/sdd", {st_dev=makedev(0, 0), st_ino=11455, st_mode=S_IFDIR|0755, st_nlink=5, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2006/09/20-13:59:14, st_mtime=2006/09/20-13:58:59, st_ctime=2006/09/20-13:58:59}) = 0 lstat64("/sys/block/sdd", {st_dev=makedev(0, 0), st_ino=11455, st_mode=S_IFDIR|0755, st_nlink=5, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=2006/09/20-13:59:14, st_mtime=2006/09/20-13:58:59, st_ctime=2006/09/20-13:58:59}) = 0 lstat64("/sys/block/sdc", {st_dev=makedev(0, 0), st_ino=11416,
RE: [patch 3/3] Add tsi108 On Chip Ethernet device driver support
On Tue, 2006-09-19 at 15:39 +0800, Zang Roy-r61911 wrote: > > > > > + spin_unlock_irq(&phy_lock); > > > + msleep(10); > > > + spin_lock_irq(&phy_lock); > > > + } > > > > hmm some places take phy_lock with disabling interrupts, while others > > don't. I sort of fear "the others" may be buggy are you sure those > > are ok? > Could you interpret your comments in detail? > Roy Hi, sorry for being unclear/too short in the review. The phy_lock lock is sometimes taken as spin_lock() and sometimes as spin_lock_irq(). It looks likes it can be used in interrupt context, in which case the spin_lock_irq() version is correct and the places where spin_lock() is used would be a deadlock bug (just think what happens if the interrupt happens while spin_lock(&phy_lock) is helt, and the spinlock then again tries to take the lock!) If there is no way this lock is used in interrupt context, then the spin_lock_irq() version is doing something which is not needed and also a bit expensive; so could be optimized. But my impression is that the _irq() is needed. Also, please consider switching from spin_lock_irq() to spin_lock_irqsave() version instead; spin_unlock_irq() has some side effects (interrupts get enabled unconditionally) so it is generally safer to use spin_lock_irqsave()/spin_unlock_irqrestore() API. If you have more questions please do not hesitate to ask! Greetings, Arjan van de Ven -- if you want to mail me at work (you don't), use arjan (at) linux.intel.com - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
source ip selection for local route
Hello, I have a question regarding the subject. In the old threads it seemed to have concluded that ignoring preferred source in local route was a bug and a pacth was proposed. See, http://marc.theaimsgroup.com/?l=linux-netdev&m=99985580920599&w=2 There, the following patch is proposed, --- net/ipv4/route.c.x Fri Sep 7 02:30:54 2001 +++ net/ipv4/route.cFri Sep 7 02:30:54 2001 @@ -1795,14 +1795,13 @@ if (res.type == RTN_LOCAL) { if (!key.src) - key.src = key.dst; + key.src = res.fi->fib_prefsrc ? : key.dst; if (dev_out) dev_put(dev_out); dev_out = &loopback_dev; dev_hold(dev_out); key.oif = dev_out->ifindex; - if (res.fi) - fib_info_put(res.fi); + fib_info_put(res.fi); res.fi = NULL; flags |= RTCF_LOCAL; goto make_route; However, in the relatively recent kernel (2.6.17.9) it seems that the patch hasn't been applied. net/ipv4/route.c 2508 if (res.type == RTN_LOCAL) { 2509 if (!fl.fl4_src) 2510 fl.fl4_src = fl.fl4_dst; 2511 if (dev_out) 2512 dev_put(dev_out); 2513 dev_out = &loopback_dev; 2514 dev_hold(dev_out); 2515 fl.oif = dev_out->ifindex; 2516 if (res.fi) 2517 fib_info_put(res.fi); 2518 res.fi = NULL; 2519 flags |= RTCF_LOCAL; 2520 goto make_route; 2521 } And actually the source IP of the comunication between two local interfaces is always that of destination. So the questions is why the patch hasn't been applied to the main line kernel, although deciding the souce IP for local route based on routing table seemed more natural than makeing it IP of the destination. The actual problem I have is the following, (if you are interested,) We have a nfs server whose IP address is shared among two hosts using vrrp and those two hosts also act as nfs clients. When the following host1 is a nfs server, the two hosts have following IPs. host1 IP1, VRIP(nfs server's IP shared using vrrp) host2 IP2 The nfs server and clients IP becomes as follows, because on host1 the source IP of nfs packet becomes that of destination i.e. VRIP. nfs server IP nfs client IP on host1 VRIPVRIP on host2 VRIPIP2 We also share the content of the rmtab which is something like VRIP:/filesystem:0x0001 IP2:/filesystem:0x0001 (the first colummn is nfs client, the second is shared file system and the third column is the mount count.) When something happens to host1, the failover is triggered and the VRIP is moved to the host2, host1 IP1 host2 IP2, VRIP(nfs server's IP shared using vrrp) nfs server IP nfs client IP on host1 VRIPIP1 on host2 VRIPVRIP Accesses to the nfs mounted filesystem on host1 will be denied, if the content of the rmtab dosen't change, because the nfs server on host2 thinks the clients are only VRIP and IP2. Of course, it can be avoided, if we unmount and remount the files system on the client, or if we appropriately change the content of the rmtab when the failover occurs. However I think, it would be much nicer if the source IP of the client was always the primary IP of a interface. This is realized if the source IP is determined by the preferred source in routing table. Then the IPs of the nfs server and the clients are always like this, and this dosen't cause any problem when the failover happens. nfs server IP nfs client IP on host1 VRIPIP1 on host2 VRIPIP2 Thanks in advance, Kimitoshi Takahashi, Cluster Computing Inc., Japan - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 3/4] Make sure ip_vs_ftp ports are valid
Horms wrote: > Here is the revised patch. > > > [IPVS] Make sure ip_vs_ftp ports are valid > > I'm not entirely sure what happens in the case of a valid port, > at best it'll be silently ignored. This patch ensures that > the port values are unsigned short values, and thus always valid. > > Cc: Patrick McHardy <[EMAIL PROTECTED]> > Signed-Off-By: Simon Horman <[EMAIL PROTECTED]> > > Index: linux-2.6/net/ipv4/ipvs/ip_vs_ftp.c > === > --- linux-2.6.orig/net/ipv4/ipvs/ip_vs_ftp.c 2006-09-04 10:47:09.0 +0900 > +++ linux-2.6/net/ipv4/ipvs/ip_vs_ftp.c 2006-09-04 10:59:30.0 +0900 > @@ -44,8 +44,8 @@ > * List of ports (up to IP_VS_APP_MAX_PORTS) to be handled by helper > * First port is set to the default port. > */ > -static int ports[IP_VS_APP_MAX_PORTS] = {21, 0}; > -module_param_array(ports, int, NULL, 0); > +static unsigned short ports[IP_VS_APP_MAX_PORTS] = {21, 0}; > +module_param_array(ports, ushort, NULL, 0); > MODULE_PARM_DESC(ports, "Ports to monitor for FTP control commands"); > > /* It looks like the wrong patch went in: http://marc.theaimsgroup.com/?l=git-commits-head&m=115862407021941&w=2 - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[take19 3/4] kevent: Socket notifications.
Socket notifications. This patch include socket send/recv/accept notifications. Using trivial web server based on kevent and this features instead of epoll it's performance increased more than noticebly. More details about benchmark and server itself (evserver_kevent.c) can be found on project's homepage. Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]> diff --git a/fs/inode.c b/fs/inode.c index 0bf9f04..181521d 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -21,6 +21,7 @@ #include #include #include #include +#include #include /* @@ -165,12 +166,18 @@ #endif } memset(&inode->u, 0, sizeof(inode->u)); inode->i_mapping = mapping; +#if defined CONFIG_KEVENT_SOCKET + kevent_storage_init(inode, &inode->st); +#endif } return inode; } void destroy_inode(struct inode *inode) { +#if defined CONFIG_KEVENT_SOCKET + kevent_storage_fini(&inode->st); +#endif BUG_ON(inode_has_buffers(inode)); security_inode_free(inode); if (inode->i_sb->s_op->destroy_inode) diff --git a/include/linux/fs.h b/include/linux/fs.h index 2561020..a697930 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -236,6 +236,7 @@ #include #include #include #include +#include #include #include @@ -546,6 +547,10 @@ #ifdef CONFIG_INOTIFY struct mutexinotify_mutex; /* protects the watches list */ #endif +#ifdef CONFIG_KEVENT_SOCKET + struct kevent_storage st; +#endif + unsigned long i_state; unsigned long dirtied_when; /* jiffies of first dirtying */ @@ -698,6 +703,9 @@ #ifdef CONFIG_EPOLL struct list_headf_ep_links; spinlock_t f_ep_lock; #endif /* #ifdef CONFIG_EPOLL */ +#ifdef CONFIG_KEVENT_POLL + struct kevent_storage st; +#endif struct address_space*f_mapping; }; extern spinlock_t files_lock; diff --git a/include/net/sock.h b/include/net/sock.h index 324b3ea..5d71ed7 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -48,6 +48,7 @@ #include #include #include /* struct sk_buff */ #include +#include #include @@ -450,6 +451,21 @@ static inline int sk_stream_memory_free( extern void sk_stream_rfree(struct sk_buff *skb); +struct socket_alloc { + struct socket socket; + struct inode vfs_inode; +}; + +static inline struct socket *SOCKET_I(struct inode *inode) +{ + return &container_of(inode, struct socket_alloc, vfs_inode)->socket; +} + +static inline struct inode *SOCK_INODE(struct socket *socket) +{ + return &container_of(socket, struct socket_alloc, socket)->vfs_inode; +} + static inline void sk_stream_set_owner_r(struct sk_buff *skb, struct sock *sk) { skb->sk = sk; @@ -477,6 +493,7 @@ static inline void sk_add_backlog(struct sk->sk_backlog.tail = skb; } skb->next = NULL; + kevent_socket_notify(sk, KEVENT_SOCKET_RECV); } #define sk_wait_event(__sk, __timeo, __condition) \ @@ -679,21 +696,6 @@ static inline struct kiocb *siocb_to_kio return si->kiocb; } -struct socket_alloc { - struct socket socket; - struct inode vfs_inode; -}; - -static inline struct socket *SOCKET_I(struct inode *inode) -{ - return &container_of(inode, struct socket_alloc, vfs_inode)->socket; -} - -static inline struct inode *SOCK_INODE(struct socket *socket) -{ - return &container_of(socket, struct socket_alloc, socket)->vfs_inode; -} - extern void __sk_stream_mem_reclaim(struct sock *sk); extern int sk_stream_mem_schedule(struct sock *sk, int size, int kind); diff --git a/include/net/tcp.h b/include/net/tcp.h index 7a093d0..69f4ad2 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -857,6 +857,7 @@ static inline int tcp_prequeue(struct so tp->ucopy.memory = 0; } else if (skb_queue_len(&tp->ucopy.prequeue) == 1) { wake_up_interruptible(sk->sk_sleep); + kevent_socket_notify(sk, KEVENT_SOCKET_RECV|KEVENT_SOCKET_SEND); if (!inet_csk_ack_scheduled(sk)) inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK, (3 * TCP_RTO_MIN) / 4, diff --git a/kernel/kevent/kevent_socket.c b/kernel/kevent/kevent_socket.c new file mode 100644 index 000..1ddd2a1 --- /dev/null +++ b/kernel/kevent/kevent_socket.c @@ -0,0 +1,126 @@ +/* + * kevent_socket.c + * + * 2006 Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, +
[take19 4/4] kevent: Timer notifications.
Timer notifications. Timer notifications can be used for fine grained per-process time management, since interval timers are very inconvenient to use, and they are limited. Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]> diff --git a/kernel/kevent/kevent_timer.c b/kernel/kevent/kevent_timer.c new file mode 100644 index 000..04acc46 --- /dev/null +++ b/kernel/kevent/kevent_timer.c @@ -0,0 +1,113 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#include +#include +#include +#include +#include +#include +#include +#include + +struct kevent_timer +{ + struct hrtimer ktimer; + struct kevent_storage ktimer_storage; + struct kevent *ktimer_event; +}; + +static int kevent_timer_func(struct hrtimer *timer) +{ + struct kevent_timer *t = container_of(timer, struct kevent_timer, ktimer); + struct kevent *k = t->ktimer_event; + + kevent_storage_ready(&t->ktimer_storage, NULL, KEVENT_MASK_ALL); + hrtimer_forward(timer, timer->base->softirq_time, + ktime_set(k->event.id.raw[0], k->event.id.raw[1])); + return HRTIMER_RESTART; +} + +static struct lock_class_key kevent_timer_key; + +static int kevent_timer_enqueue(struct kevent *k) +{ + int err; + struct kevent_timer *t; + + t = kmalloc(sizeof(struct kevent_timer), GFP_KERNEL); + if (!t) + return -ENOMEM; + + hrtimer_init(&t->ktimer, CLOCK_MONOTONIC, HRTIMER_REL); + t->ktimer.expires = ktime_set(k->event.id.raw[0], k->event.id.raw[1]); + t->ktimer.function = kevent_timer_func; + t->ktimer_event = k; + + err = kevent_storage_init(&t->ktimer, &t->ktimer_storage); + if (err) + goto err_out_free; + lockdep_set_class(&t->ktimer_storage.lock, &kevent_timer_key); + + err = kevent_storage_enqueue(&t->ktimer_storage, k); + if (err) + goto err_out_st_fini; + + printk("%s: jiffies: %lu, timer: %p.\n", __func__, jiffies, &t->ktimer); + hrtimer_start(&t->ktimer, t->ktimer.expires, HRTIMER_REL); + + return 0; + +err_out_st_fini: + kevent_storage_fini(&t->ktimer_storage); +err_out_free: + kfree(t); + + return err; +} + +static int kevent_timer_dequeue(struct kevent *k) +{ + struct kevent_storage *st = k->st; + struct kevent_timer *t = container_of(st, struct kevent_timer, ktimer_storage); + + hrtimer_cancel(&t->ktimer); + kevent_storage_dequeue(st, k); + kfree(t); + + return 0; +} + +static int kevent_timer_callback(struct kevent *k) +{ + k->event.ret_data[0] = jiffies_to_msecs(jiffies); + return 1; +} + +static int __init kevent_init_timer(void) +{ + struct kevent_callbacks tc = { + .callback = &kevent_timer_callback, + .enqueue = &kevent_timer_enqueue, + .dequeue = &kevent_timer_dequeue}; + + return kevent_add_callbacks(&tc, KEVENT_TIMER); +} +module_init(kevent_init_timer); + - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[take19 1/4] kevent: Core files.
Core files. This patch includes core kevent files: - userspace controlling - kernelspace interfaces - initialization - notification state machines Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]> diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S index dd63d47..c10698e 100644 --- a/arch/i386/kernel/syscall_table.S +++ b/arch/i386/kernel/syscall_table.S @@ -317,3 +317,6 @@ ENTRY(sys_call_table) .long sys_tee /* 315 */ .long sys_vmsplice .long sys_move_pages + .long sys_kevent_get_events + .long sys_kevent_ctl + .long sys_kevent_wait /* 320 */ diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S index 5d4a7d1..a06b76f 100644 --- a/arch/x86_64/ia32/ia32entry.S +++ b/arch/x86_64/ia32/ia32entry.S @@ -710,7 +710,10 @@ #endif .quad compat_sys_get_robust_list .quad sys_splice .quad sys_sync_file_range - .quad sys_tee + .quad sys_tee /* 315 */ .quad compat_sys_vmsplice .quad compat_sys_move_pages + .quad sys_kevent_get_events + .quad sys_kevent_ctl + .quad sys_kevent_wait /* 320 */ ia32_syscall_end: diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h index fc1c8dd..68072b5 100644 --- a/include/asm-i386/unistd.h +++ b/include/asm-i386/unistd.h @@ -323,10 +323,13 @@ #define __NR_sync_file_range 314 #define __NR_tee 315 #define __NR_vmsplice 316 #define __NR_move_pages317 +#define __NR_kevent_get_events 318 +#define __NR_kevent_ctl319 +#define __NR_kevent_wait 320 #ifdef __KERNEL__ -#define NR_syscalls 318 +#define NR_syscalls 321 /* * user-visible error numbers are in the range -1 - -128: see diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h index 94387c9..ee907ad 100644 --- a/include/asm-x86_64/unistd.h +++ b/include/asm-x86_64/unistd.h @@ -619,10 +619,16 @@ #define __NR_vmsplice 278 __SYSCALL(__NR_vmsplice, sys_vmsplice) #define __NR_move_pages279 __SYSCALL(__NR_move_pages, sys_move_pages) +#define __NR_kevent_get_events 280 +__SYSCALL(__NR_kevent_get_events, sys_kevent_get_events) +#define __NR_kevent_ctl281 +__SYSCALL(__NR_kevent_ctl, sys_kevent_ctl) +#define __NR_kevent_wait 282 +__SYSCALL(__NR_kevent_wait, sys_kevent_wait) #ifdef __KERNEL__ -#define __NR_syscall_max __NR_move_pages +#define __NR_syscall_max __NR_kevent_wait #ifndef __NO_STUBS diff --git a/include/linux/kevent.h b/include/linux/kevent.h new file mode 100644 index 000..24ced10 --- /dev/null +++ b/include/linux/kevent.h @@ -0,0 +1,195 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#ifndef __KEVENT_H +#define __KEVENT_H +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define KEVENT_MIN_BUFFS_ALLOC 3 + +struct kevent; +struct kevent_storage; +typedef int (* kevent_callback_t)(struct kevent *); + +/* @callback is called each time new event has been caught. */ +/* @enqueue is called each time new event is queued. */ +/* @dequeue is called each time event is dequeued. */ + +struct kevent_callbacks { + kevent_callback_t callback, enqueue, dequeue; +}; + +#define KEVENT_READY 0x1 +#define KEVENT_STORAGE 0x2 +#define KEVENT_USER0x4 + +struct kevent +{ + /* Used for kevent freeing.*/ + struct rcu_head rcu_head; + struct ukevent event; + /* This lock protects ukevent manipulations, e.g. ret_flags changes. */ + spinlock_t ulock; + + /* Entry of user's tree. */ + struct rb_node kevent_node; + /* Entry of origin's queue. */ + struct list_headstorage_entry; + /* Entry of user's ready. */ + struct list_headready_entry; + + u32 flags; + + /* User who requested this kevent. */ + struct kevent_user *user; + /* Kevent container. */ + struct kevent_storage *st; + + struct kevent_callbacks callbac
[take19 2/4] kevent: poll/select() notifications.
poll/select() notifications. This patch includes generic poll/select notifications. kevent_poll works simialr to epoll and has the same issues (callback is invoked not from internal state machine of the caller, but through process awake, a lot of allocations and so on.). Signed-off-by: Evgeniy Polyakov <[EMAIL PROTECTED]> diff --git a/include/linux/fs.h b/include/linux/fs.h index 2561020..a697930 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -236,6 +236,7 @@ #include #include #include #include +#include #include #include @@ -546,6 +547,10 @@ #ifdef CONFIG_INOTIFY struct mutexinotify_mutex; /* protects the watches list */ #endif +#ifdef CONFIG_KEVENT_SOCKET + struct kevent_storage st; +#endif + unsigned long i_state; unsigned long dirtied_when; /* jiffies of first dirtying */ @@ -698,6 +703,9 @@ #ifdef CONFIG_EPOLL struct list_headf_ep_links; spinlock_t f_ep_lock; #endif /* #ifdef CONFIG_EPOLL */ +#ifdef CONFIG_KEVENT_POLL + struct kevent_storage st; +#endif struct address_space*f_mapping; }; extern spinlock_t files_lock; diff --git a/kernel/kevent/kevent_poll.c b/kernel/kevent/kevent_poll.c new file mode 100644 index 000..fb74e0f --- /dev/null +++ b/kernel/kevent/kevent_poll.c @@ -0,0 +1,222 @@ +/* + * 2006 Copyright (c) Evgeniy Polyakov <[EMAIL PROTECTED]> + * All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static kmem_cache_t *kevent_poll_container_cache; +static kmem_cache_t *kevent_poll_priv_cache; + +struct kevent_poll_ctl +{ + struct poll_table_structpt; + struct kevent *k; +}; + +struct kevent_poll_wait_container +{ + struct list_headcontainer_entry; + wait_queue_head_t *whead; + wait_queue_twait; + struct kevent *k; +}; + +struct kevent_poll_private +{ + struct list_headcontainer_list; + spinlock_t container_lock; +}; + +static int kevent_poll_enqueue(struct kevent *k); +static int kevent_poll_dequeue(struct kevent *k); +static int kevent_poll_callback(struct kevent *k); + +static int kevent_poll_wait_callback(wait_queue_t *wait, + unsigned mode, int sync, void *key) +{ + struct kevent_poll_wait_container *cont = + container_of(wait, struct kevent_poll_wait_container, wait); + struct kevent *k = cont->k; + struct file *file = k->st->origin; + u32 revents; + + revents = file->f_op->poll(file, NULL); + + kevent_storage_ready(k->st, NULL, revents); + + return 0; +} + +static void kevent_poll_qproc(struct file *file, wait_queue_head_t *whead, + struct poll_table_struct *poll_table) +{ + struct kevent *k = + container_of(poll_table, struct kevent_poll_ctl, pt)->k; + struct kevent_poll_private *priv = k->priv; + struct kevent_poll_wait_container *cont; + unsigned long flags; + + cont = kmem_cache_alloc(kevent_poll_container_cache, SLAB_KERNEL); + if (!cont) { + kevent_break(k); + return; + } + + cont->k = k; + init_waitqueue_func_entry(&cont->wait, kevent_poll_wait_callback); + cont->whead = whead; + + spin_lock_irqsave(&priv->container_lock, flags); + list_add_tail(&cont->container_entry, &priv->container_list); + spin_unlock_irqrestore(&priv->container_lock, flags); + + add_wait_queue(whead, &cont->wait); +} + +static int kevent_poll_enqueue(struct kevent *k) +{ + struct file *file; + int err, ready = 0; + unsigned int revents; + struct kevent_poll_ctl ctl; + struct kevent_poll_private *priv; + + file = fget(k->event.id.raw[0]); + if (!file) + return -ENODEV; + + err = -EINVAL; + if (!file->f_op || !file->f_op->poll) + goto err_out_fput; + + err = -ENOMEM; + priv = kmem_cache_alloc(kevent_poll_priv_cache, SLAB_KERNEL); + if (!priv) + goto err_out_fput; + + spin_lock_init(&priv->container_lock); + INIT_LIST_HEAD(&priv->container_list); + + k->priv = priv; + + ctl.k = k; + init_poll_funcptr(&ctl.pt, &kevent_poll_qproc);
[take19 0/4] kevent: Generic event handling mechanism.
Generic event handling mechanism. Consider for inclusion. Changes from 'take18' patchset: * use __init instead of __devinit * removed 'default N' from config for user statistic * removed kevent_user_fini() since kevent can not be unloaded * use KERN_INFO for statistic output Changes from 'take17' patchset: * Use RB tree instead of hash table. At least for a web sever, frequency of addition/deletion of new kevent is comparable with number of search access, i.e. most of the time events are added, accesed only couple of times and then removed, so it justifies RB tree usage over AVL tree, since the latter does have much slower deletion time (max O(log(N)) compared to 3 ops), although faster search time (1.44*O(log(N)) vs. 2*O(log(N))). So for kevents I use RB tree for now and later, when my AVL tree implementation is ready, it will be possible to compare them. * Changed readiness check for socket notifications. With both above changes it is possible to achieve more than 3380 req/second compared to 2200, sometimes 2500 req/second for epoll() for trivial web-server and httperf client on the same hardware. It is possible that above kevent limit is due to maximum allowed kevents in a time limit, which is 4096 events. Changes from 'take16' patchset: * misc cleanups (__read_mostly, const ...) * created special macro which is used for mmap size (number of pages) calculation * export kevent_socket_notify(), since it is used in network protocols which can be built as modules (IPv6 for example) Changes from 'take15' patchset: * converted kevent_timer to high-resolution timers, this forces timer API update at http://linux-net.osdl.org/index.php/Kevent * use struct ukevent* instead of void * in syscalls (documentation has been updated) * added warning in kevent_add_ukevent() if ring has broken index (for testing) Changes from 'take14' patchset: * added kevent_wait() This syscall waits until either timeout expires or at least one event becomes ready. It also commits that @num events from @start are processed by userspace and thus can be be removed or rearmed (depending on it's flags). It can be used for commit events read by userspace through mmap interface. Example userspace code (evtest.c) can be found on project's homepage. * added socket notifications (send/recv/accept) Changes from 'take13' patchset: * do not get lock aroung user data check in __kevent_search() * fail early if there were no registered callbacks for given type of kevent * trailing whitespace cleanup Changes from 'take12' patchset: * remove non-chardev interface for initialization * use pointer to kevent_mring instead of unsigned longs * use aligned 64bit type in raw user data (can be used by high-res timer if needed) * simplified enqueue/dequeue callbacks and kevent initialization * use nanoseconds for timeout * put number of milliseconds into timer's return data * move some definitions into user-visible header * removed filenames from comments Changes from 'take11' patchset: * include missing headers into patchset * some trivial code cleanups (use goto instead of if/else games and so on) * some whitespace cleanups * check for ready_callback() callback before main loop which should save us some ticks Changes from 'take10' patchset: * removed non-existent prototypes * added helper function for kevent_registered_callbacks * fixed 80 lines comments issues * added shared between userspace and kernelspace header instead of embedd them in one * core restructuring to remove forward declarations * s o m e w h i t e s p a c e c o d y n g s t y l e c l e a n u p * use vm_insert_page() instead of remap_pfn_range() Changes from 'take9' patchset: * fixed ->nopage method Changes from 'take8' patchset: * fixed mmap release bug * use module_init() instead of late_initcall() * use better structures for timer notifications Changes from 'take7' patchset: * new mmap interface (not tested, waiting for other changes to be acked) - use nopage() method to dynamically substitue pages - allocate new page for events only when new added kevent requres it - do not use ugly index dereferencing, use structure instead - reduced amount of data in the ring (id and flags), maximum 12 pages on x86 per kevent fd Changes from 'take6' patchset: * a lot of comments! * do not use list poisoning for detection of the fact, that entry is in the list * return number of ready kevents even if copy*user() fails * strict check for number of kevents in syscall * use ARRAY_SIZE for array size calculation * changed superblock magic number * use SLAB_PANIC instead of direct panic() call * changed -E* return values * a lot of small cleanups and indent fixes Changes from 'take5' patchset: * removed compilation warnings about unused wariables when lockdep is not turne
Re: [PATCH] tcp: simpler bic default
On Tue, Sep 19, 2006 at 04:23:55PM -0700, Stephen Hemminger wrote: > Okay, build testing all the possibilities now, answer by morning.. Please boot some of them as well - I can see a kernel that really wants to load "bic" at boot time but can't find it. Bert -- http://www.PowerDNS.com Open source, database driven DNS Software http://netherlabs.nl Open and Closed source services - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Suspend/Resume: IPv6 default route gets lost
Hello, I'm using FC5 w/ 2.6.17-1.2174_FC5. When the laptop resumes from suspend, it does not re-send an IPv6 route solicitation. So, if the IPv6 default route expired while you were in suspend, you'll have to wait for the next multicast unsolicited RA. A workaround is to cycle the interface at ACPI resume scripts. Maybe triggering a RS is a missing feature in suspend/resume kernel functionality? There has also been discussion (years ago) about user-space interface to triggering a RS, but AFAIK, none exists right now. -- Pekka Savola "You each name yourselves king, yet the Netcore Oykingdom bleeds." Systems. Networks. Security. -- George R.R. Martin: A Clash of Kings - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
UDP Out 0f Sequence
Hi, If I write UDP datagrams 1,2 and 3 to network and if the receiver receives in order 2,1, and 3, where can the sequence get changed? Is it at the source stack, network transit or destination stack? Any reply is highly appreciated. Thanks Rajib == Please access the attached hyperlink for an important electronic communications disclaimer: http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html == - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 3/3] cfg80211 thoughts on configuration
* Johannes Berg <[EMAIL PROTECTED]> 2006-09-20 09:03 > > Just use a nested attribute here, this new array format you introduce > > having 1 byte ID, 1 byte len is equivalent to using a set of nested > > attributes with nla_type=id, nla_len=len. > > No, it is only validated, it is then supposed to be copied verbatim into > some 802.11 frames. I thought validating it would be a good idea to not > send out totally bogus frames, but I didn't want to have to mangle it in > the kernel. I see, fair enough, wasn't able to get that from your code. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 3/3] cfg80211 thoughts on configuration
On Wed, 2006-09-20 at 08:33 +0200, Thomas Graf wrote: > I think I brought this up already, it's a lot easier to understand > things if you keep it symmetric, i.e. NL80211_CMD_GET_CONFIG triggers > sending a NL80211_CMD_NEW_CONFIG. Yes, I think you did :) I'll do that as soon as I get around to reworking it (hoping for more comments...) > Just use a nested attribute here, this new array format you introduce > having 1 byte ID, 1 byte len is equivalent to using a set of nested > attributes with nla_type=id, nla_len=len. No, it is only validated, it is then supposed to be copied verbatim into some 802.11 frames. I thought validating it would be a good idea to not send out totally bogus frames, but I didn't want to have to mangle it in the kernel. johannes - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html