Re: net-next closure?
From: Jeff Kirsher Date: Wed, 02 Sep 2015 22:50:35 -0700 > I was just about to send out my last series of patches and noticed you > sent Linus your pull request. So I am guessing that your net-next tree > is now closed, correct? Just want to make sure before sending anything > out and did not want to dump patches on you right before the closure of > your net-next. Yeah, I already applied too much crap after the merge window openned up so net-next is definitely closed now. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] net: irda: pxaficp_ir: dmaengine conversion
Hi, This serie aims at converting pxaficp_ir to dmaengine. This is almost the last driver to be converted, and once this is gone, legacy DMA support in pxa architecture can be gone. Nothing fancy here, standard readl/writel conversion, then dmaengine support. The main trouble is that I cannot test it, I only compiled and inserted the module, which works on lubbock, but I have no way to make a communcation try. Petr, Dmitry, once the review is advanced enough, ie. in a couple of weeks, do you have a way to test it on corgi/magician if I give you a git tree to pull from ? Cheers -- Robert Robert Jarzmik (3): net: irda: pxaficp_ir: use sched_clock() for time management net: irda: pxaficp_ir: convert to readl and writel net: irda: pxaficp_ir: dmaengine conversion drivers/net/irda/pxaficp_ir.c | 366 +++--- 1 file changed, 233 insertions(+), 133 deletions(-) -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] net: irda: pxaficp_ir: convert to readl and writel
Convert the pxa IRDA driver to readl and writel primitives, and remove another set of direct registers access. This leaves only the DMA registers access, which will be dealt with dmaengine conversion. Signed-off-by: Robert Jarzmik --- drivers/net/irda/pxaficp_ir.c | 210 +- 1 file changed, 126 insertions(+), 84 deletions(-) diff --git a/drivers/net/irda/pxaficp_ir.c b/drivers/net/irda/pxaficp_ir.c index b1794998c68e..519f6b0568a8 100644 --- a/drivers/net/irda/pxaficp_ir.c +++ b/drivers/net/irda/pxaficp_ir.c @@ -29,15 +29,16 @@ #include #include +#undef __REG +#define __REG(x) (x) #include -#define FICP __REG(0x4080) /* Start of FICP area */ -#define ICCR0 __REG(0x4080) /* ICP Control Register 0 */ -#define ICCR1 __REG(0x4084) /* ICP Control Register 1 */ -#define ICCR2 __REG(0x4088) /* ICP Control Register 2 */ -#define ICDR __REG(0x408c) /* ICP Data Register */ -#define ICSR0 __REG(0x40800014) /* ICP Status Register 0 */ -#define ICSR1 __REG(0x40800018) /* ICP Status Register 1 */ +#define ICCR0 0x /* ICP Control Register 0 */ +#define ICCR1 0x0004 /* ICP Control Register 1 */ +#define ICCR2 0x0008 /* ICP Control Register 2 */ +#define ICDR 0x000c /* ICP Data Register */ +#define ICSR0 0x0014 /* ICP Status Register 0 */ +#define ICSR1 0x0018 /* ICP Status Register 1 */ #define ICCR0_AME (1 << 7)/* Address match enable */ #define ICCR0_TIE (1 << 6)/* Transmit FIFO interrupt enable */ @@ -55,9 +56,7 @@ #define ICCR2_TRIG_16 (1 << 0) /* >= 16 bytes */ #define ICCR2_TRIG_32 (2 << 0) /* >= 32 bytes */ -#ifdef CONFIG_PXA27x #define ICSR0_EOC (1 << 6)/* DMA End of Descriptor Chain */ -#endif #define ICSR0_FRE (1 << 5)/* Framing error */ #define ICSR0_RFS (1 << 4)/* Receive FIFO service request */ #define ICSR0_TFS (1 << 3)/* Transnit FIFO service request */ @@ -98,11 +97,50 @@ IrSR_RCVEIR_UART_MODE | \ IrSR_XMITIR_IR_MODE) +/* macros for registers read/write */ +#define ficp_writel(irda, val, off)\ + do {\ + dev_vdbg(irda->dev, \ +"%s():%d ficp_writel(0x%x, %s)\n", \ +__func__, __LINE__, (val), #off); \ + writel_relaxed((val), (irda)->irda_base + (off)); \ + } while (0) + +#define ficp_readl(irda, off) \ + ({ \ + unsigned int _v;\ + _v = readl_relaxed((irda)->irda_base + (off)); \ + dev_vdbg(irda->dev, \ +"%s():%d ficp_readl(%s): 0x%x\n", \ +__func__, __LINE__, #off, _v); \ + _v; \ + }) + +#define stuart_writel(irda, val, off) \ + do {\ + dev_vdbg(irda->dev, \ +"%s():%d stuart_writel(0x%x, %s)\n", \ +__func__, __LINE__, (val), #off); \ + writel_relaxed((val), (irda)->stuart_base + (off)); \ + } while (0) + +#define stuart_readl(irda, off) \ + ({ \ + unsigned int _v;\ + _v = readl_relaxed((irda)->stuart_base + (off));\ + dev_vdbg(irda->dev, \ +"%s():%d stuart_readl(%s): 0x%x\n",\ +__func__, __LINE__, #off, _v); \ + _v; \ + }) + struct pxa_irda { int speed; int newspeed; unsigned long long last_clk; + void __iomem*stuart_base; + void __iomem*irda_base; unsigned char *dma_rx_buff; unsigned char *dma_tx_buff; dma_addr_t dma_rx_buff_phy; @@ -153,7 +191,7 @@ static inline void pxa_irda_enable_sirclk(struct pxa_irda *si) inline static void pxa_irda_fir_dma_rx_start(struct pxa_irda *si) { DCSR(si->rxd
[PATCH 3/3] net: irda: pxaficp_ir: dmaengine conversion
Convert pxaficp_ir to dmaengine. As pxa architecture is shifting from raw DMA registers access to pxa_dma dmaengine driver, convert this driver to dmaengine. Signed-off-by: Robert Jarzmik --- drivers/net/irda/pxaficp_ir.c | 145 +- 1 file changed, 102 insertions(+), 43 deletions(-) diff --git a/drivers/net/irda/pxaficp_ir.c b/drivers/net/irda/pxaficp_ir.c index 519f6b0568a8..42318fb2c95a 100644 --- a/drivers/net/irda/pxaficp_ir.c +++ b/drivers/net/irda/pxaficp_ir.c @@ -19,6 +19,9 @@ #include #include #include +#include +#include +#include #include #include @@ -146,8 +149,12 @@ struct pxa_irda { dma_addr_t dma_rx_buff_phy; dma_addr_t dma_tx_buff_phy; unsigned intdma_tx_buff_len; - int txdma; - int rxdma; + struct dma_chan *txdma; + struct dma_chan *rxdma; + dma_cookie_trx_cookie; + dma_cookie_ttx_cookie; + int drcmr_rx; + int drcmr_tx; int uart_irq; int icp_irq; @@ -165,6 +172,8 @@ struct pxa_irda { struct clk *cur_clk; }; +static int pxa_irda_set_speed(struct pxa_irda *si, int speed); + static inline void pxa_irda_disable_clk(struct pxa_irda *si) { if (si->cur_clk) @@ -188,22 +197,41 @@ static inline void pxa_irda_enable_sirclk(struct pxa_irda *si) #define IS_FIR(si) ((si)->speed >= 400) #define IRDA_FRAME_SIZE_LIMIT 2047 +static void pxa_irda_fir_dma_rx_irq(void *data); +static void pxa_irda_fir_dma_tx_irq(void *data); + inline static void pxa_irda_fir_dma_rx_start(struct pxa_irda *si) { - DCSR(si->rxdma) = DCSR_NODESC; - DSADR(si->rxdma) = (unsigned long)si->irda_base + ICDR; - DTADR(si->rxdma) = si->dma_rx_buff_phy; - DCMD(si->rxdma) = DCMD_INCTRGADDR | DCMD_FLOWSRC | DCMD_WIDTH1 | DCMD_BURST32 | IRDA_FRAME_SIZE_LIMIT; - DCSR(si->rxdma) |= DCSR_RUN; + struct dma_async_tx_descriptor *tx; + + tx = dmaengine_prep_slave_single(si->rxdma, si->dma_rx_buff_phy, +IRDA_FRAME_SIZE_LIMIT, DMA_FROM_DEVICE, +DMA_PREP_INTERRUPT); + if (!tx) { + dev_err(si->dev, "prep_slave_sg() failed\n"); + return; + } + tx->callback = pxa_irda_fir_dma_rx_irq; + tx->callback_param = si; + si->rx_cookie = dmaengine_submit(tx); + dma_async_issue_pending(si->rxdma); } inline static void pxa_irda_fir_dma_tx_start(struct pxa_irda *si) { - DCSR(si->txdma) = DCSR_NODESC; - DSADR(si->txdma) = si->dma_tx_buff_phy; - DTADR(si->txdma) = (unsigned long)si->irda_base + ICDR; - DCMD(si->txdma) = DCMD_INCSRCADDR | DCMD_FLOWTRG | DCMD_ENDIRQEN | DCMD_WIDTH1 | DCMD_BURST32 | si->dma_tx_buff_len; - DCSR(si->txdma) |= DCSR_RUN; + struct dma_async_tx_descriptor *tx; + + tx = dmaengine_prep_slave_single(si->txdma, si->dma_tx_buff_phy, +si->dma_tx_buff_len, DMA_TO_DEVICE, +DMA_PREP_INTERRUPT); + if (!tx) { + dev_err(si->dev, "prep_slave_sg() failed\n"); + return; + } + tx->callback = pxa_irda_fir_dma_tx_irq; + tx->callback_param = si; + si->tx_cookie = dmaengine_submit(tx); + dma_async_issue_pending(si->rxdma); } /* @@ -242,7 +270,7 @@ static int pxa_irda_set_speed(struct pxa_irda *si, int speed) if (IS_FIR(si)) { /* stop RX DMA */ - DCSR(si->rxdma) &= ~DCSR_RUN; + dmaengine_terminate_all(si->rxdma); /* disable FICP */ ficp_writel(si, 0, ICCR0); pxa_irda_disable_clk(si); @@ -388,30 +416,27 @@ static irqreturn_t pxa_irda_sir_irq(int irq, void *dev_id) } /* FIR Receive DMA interrupt handler */ -static void pxa_irda_fir_dma_rx_irq(int channel, void *data) +static void pxa_irda_fir_dma_rx_irq(void *data) { - int dcsr = DCSR(channel); - - DCSR(channel) = dcsr & ~DCSR_RUN; + struct net_device *dev = data; + struct pxa_irda *si = netdev_priv(dev); - printk(KERN_DEBUG "pxa_ir: fir rx dma bus error %#x\n", dcsr); + dmaengine_terminate_all(si->rxdma); + netdev_dbg(dev, "pxa_ir: fir rx dma bus error\n"); } /* FIR Transmit DMA interrupt handler */ -static void pxa_irda_fir_dma_tx_irq(int channel, void *data) +static void pxa_irda_fir_dma_tx_irq(void *data) { struct net_device *dev = data; struct pxa_irda *si = netdev_priv(dev); - int dcsr; - - dcsr = DCSR(channel); - DCSR(channel) = dcsr & ~DCSR_RUN; - if (dcsr & DCSR_ENDINTR) { + d
[PATCH 1/3] net: irda: pxaficp_ir: use sched_clock() for time management
Instead of using directly the OS timer through direct register access, use the standard sched_clock(), which will end up in OSCR reading anyway. This is a first step for direct access register removal and machine specific code removal from this driver. Signed-off-by: Robert Jarzmik --- drivers/net/irda/pxaficp_ir.c | 15 +++ 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/drivers/net/irda/pxaficp_ir.c b/drivers/net/irda/pxaficp_ir.c index 100454662e4b..b1794998c68e 100644 --- a/drivers/net/irda/pxaficp_ir.c +++ b/drivers/net/irda/pxaficp_ir.c @@ -29,7 +29,6 @@ #include #include -#include #include #define FICP __REG(0x4080) /* Start of FICP area */ @@ -102,7 +101,7 @@ struct pxa_irda { int speed; int newspeed; - unsigned long last_oscr; + unsigned long long last_clk; unsigned char *dma_rx_buff; unsigned char *dma_tx_buff; @@ -292,7 +291,7 @@ static irqreturn_t pxa_irda_sir_irq(int irq, void *dev_id) } lsr = STLSR; } - si->last_oscr = readl_relaxed(OSCR); + si->last_clk = sched_clock(); break; case 0x04: /* Received Data Available */ @@ -303,7 +302,7 @@ static irqreturn_t pxa_irda_sir_irq(int irq, void *dev_id) dev->stats.rx_bytes++; async_unwrap_char(dev, &dev->stats, &si->rx_buff, STRBR); } while (STLSR & LSR_DR); - si->last_oscr = readl_relaxed(OSCR); + si->last_clk = sched_clock(); break; case 0x02: /* Transmit FIFO Data Request */ @@ -319,7 +318,7 @@ static irqreturn_t pxa_irda_sir_irq(int irq, void *dev_id) /* We need to ensure that the transmitter has finished. */ while ((STLSR & LSR_TEMT) == 0) cpu_relax(); - si->last_oscr = readl_relaxed(OSCR); + si->last_clk = sched_clock(); /* * Ok, we've finished transmitting. Now enable @@ -373,7 +372,7 @@ static void pxa_irda_fir_dma_tx_irq(int channel, void *data) while (ICSR1 & ICSR1_TBY) cpu_relax(); - si->last_oscr = readl_relaxed(OSCR); + si->last_clk = sched_clock(); /* * HACK: It looks like the TBY bit is dropped too soon. @@ -473,8 +472,8 @@ static irqreturn_t pxa_irda_fir_irq(int irq, void *dev_id) /* stop RX DMA */ DCSR(si->rxdma) &= ~DCSR_RUN; - si->last_oscr = readl_relaxed(OSCR); icsr0 = ICSR0; + si->last_clk = sched_clock(); if (icsr0 & (ICSR0_FRE | ICSR0_RAB)) { if (icsr0 & ICSR0_FRE) { @@ -549,7 +548,7 @@ static int pxa_irda_hard_xmit(struct sk_buff *skb, struct net_device *dev) skb_copy_from_linear_data(skb, si->dma_tx_buff, skb->len); if (mtt) - while ((unsigned)(readl_relaxed(OSCR) - si->last_oscr)/4 < mtt) + while ((sched_clock() - si->last_clk) / 4 < mtt) cpu_relax(); /* stop RX DMA, disable FICP */ -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT] Networking
Hi David, On Wed, 02 Sep 2015 22:35:22 -0700 (PDT) David Miller wrote: > > The following changes since commit 4941b8f0c2b9d88e8a6dacebf8b7faf603b98368: > > Merge tag 'powerpc-4.2-4' of > git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux (2015-08-27 > 17:59:17 -0700) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next > > for you to fetch changes up to 62da98656b62a5ca57f22263705175af8ded5aa1: > > netfilter: nf_conntrack: make nf_ct_zone_dflt built-in (2015-09-02 16:32:56 > -0700) [just for consistency ...] This has 80 commits that have first been in linux-next on Sept 1 or later (and 5 that have not made it to linux-next yet). I understand that this is part of Dave's work flow and most of these have been queued for a while. Not judging, just noting. -- Cheers, Stephen Rothwells...@canb.auug.org.au -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next PATCH v2] drivers: net: cpsw: Add support to make gpio drive which slave connected to phy
In DRA72x EVM, by default slave 1 is connected to the onboard phy, but slave 2 pins are also muxed with video input module which is controlled by pcf857x gpio and currently to select slave 0 to connect to phy gpio hogging is used, but with omap2plus_defconfig the pcf857x gpio is built as module. So when using NFS on DRA72x EVM, board doesn't boot as gpio hogging do not set proper gpio state to connect slave 0 to phy as it is built as module and you do not see any errors for not setting gpio and just mentions dhcp reply not got. To solve this issue, introducing "mode-gpio" in DT when gpio based muxing is required. This will throw a warning when gpio get fails and returns probe defer. When gpio-pcf857x module is installed, cpsw probes again and ethernet becomes functional. Verified this on DRA72x with pcf as module and ramdisk. Signed-off-by: Mugunthan V N --- Changes from initial version: * Updated the gpio dt naming to be more generic. This patch is texted on DRA72x, Logs [1] and pushed a branch [2] [1]: http://pastebin.ubuntu.com/12260767/ [2]: git://git.ti.com/~mugunthanvnm/ti-linux-kernel/linux.git cpsw-gpio-optional-v2 --- Documentation/devicetree/bindings/net/cpsw.txt | 7 +++ drivers/net/ethernet/ti/cpsw.c | 9 + 2 files changed, 16 insertions(+) diff --git a/Documentation/devicetree/bindings/net/cpsw.txt b/Documentation/devicetree/bindings/net/cpsw.txt index 33fe846..dfe3e0b 100644 --- a/Documentation/devicetree/bindings/net/cpsw.txt +++ b/Documentation/devicetree/bindings/net/cpsw.txt @@ -26,6 +26,13 @@ Optional properties: - dual_emac: Specifies Switch to act as Dual EMAC - syscon : Phandle to the system control device node, which is the control module device of the am33x +- mode-gpio: Should be added if a gpio line is required to + be driven so that cpsw data lines can be + connected to the phy via selective mux. For + example in dra72x-evm, pcf gpio has to be + driven low so that cpsw slave 0 and phy + data lines are connected via mux. + Slave Properties: Required properties: diff --git a/drivers/net/ethernet/ti/cpsw.c b/drivers/net/ethernet/ti/cpsw.c index 8fc90f1..90ae3f9 100644 --- a/drivers/net/ethernet/ti/cpsw.c +++ b/drivers/net/ethernet/ti/cpsw.c @@ -29,6 +29,7 @@ #include #include #include +#include #include #include #include @@ -2207,6 +2208,7 @@ static int cpsw_probe(struct platform_device *pdev) void __iomem*ss_regs; struct resource *res, *ss_res; const struct of_device_id *of_id; + struct gpio_desc*mode; u32 slave_offset, sliver_offset, slave_size; int ret = 0, i; int irq; @@ -2232,6 +2234,13 @@ static int cpsw_probe(struct platform_device *pdev) goto clean_ndev_ret; } + mode = devm_gpiod_get_optional(&pdev->dev, "mode", GPIOD_OUT_LOW); + if (IS_ERR(mode)) { + ret = PTR_ERR(mode); + dev_err(&pdev->dev, "gpio request failed, ret %d\n", ret); + goto clean_ndev_ret; + } + /* * This may be required here for child devices. */ -- 2.5.1.522.g7aa67f6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: eth: altera: fix napi poll_list corruption
On Wed, 2 Sep 2015 22:32:54 -0700, David Miller wrote: >> I think napi_gro_flush() can be called with irq enabled, so moving the >> spin_lock_irqsave() just before the __napi_complete() (or moving the >> __napi_complete() just after the spin_lock_irqsave()) would be better, >> right? > > It should work, yes. Thank you. But I agree with Eric's last comment ("Calling napi_gro_flush() and __napi_complete() looks error prone."), and found that napi_complete_done() also checks NAPI_STATE_NPSVC to support NETPOLL. These checks looks somewhat redundant but I like simple way unless it is really critical to performance. So, please take original fix as is. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] net: wan: sbni: fix device usage count
dev_get_by_name() will increment the usage count if the matching device is found. But we were not decrementing the count if we have got the device and the device is non-active. Signed-off-by: Sudip Mukherjee --- drivers/net/wan/sbni.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/wan/sbni.c b/drivers/net/wan/sbni.c index 758c4ba..8fef8d8 100644 --- a/drivers/net/wan/sbni.c +++ b/drivers/net/wan/sbni.c @@ -1358,6 +1358,8 @@ sbni_ioctl( struct net_device *dev, struct ifreq *ifr, int cmd ) if( !slave_dev || !(slave_dev->flags & IFF_UP) ) { netdev_err(dev, "trying to enslave non-active device %s\n", slave_name); + if (slave_dev) + dev_put(slave_dev); return -EPERM; } -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
net-next closure?
I was just about to send out my last series of patches and noticed you sent Linus your pull request. So I am guessing that your net-next tree is now closed, correct? Just want to make sure before sending anything out and did not want to dump patches on you right before the closure of your net-next. Cheers, Jeff signature.asc Description: This is a digitally signed message part
Re: [PATCH] net: eth: altera: fix napi poll_list corruption
From: Atsushi Nemoto Date: Thu, 3 Sep 2015 09:52:57 +0900 > On Wed, 2 Sep 2015 11:25:00 -0700, David Miller wrote: >> Two lines below this change you are disabling interrupts anyways, >> so I would suggest just moving the spin_lock_irqsave() before the >> napi_gro_flush() to fix this. >> >> Many of the checks done by napi_complete_done() (invoked by >> napi_complete()) are completely redundant in this context. For >> example, the direct __napi_complete() call is a really nice >> optimization because we know we are on the poll list and therefore >> it is not empty. > > Thank you for your suggestion. > > I think napi_gro_flush() can be called with irq enabled, so moving the > spin_lock_irqsave() just before the __napi_complete() (or moving the > __napi_complete() just after the spin_lock_irqsave()) would be better, > right? It should work, yes. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] net/ipv6: Correct PIM6 mrt_lock handling
On Wed, Sep 2, 2015 at 6:52 PM, Richard Laing wrote: > In the IPv6 multicast routing code the mrt_lock was not being released > correctly in the MFC iterator, as a result adding or deleting a MIF would > cause a hang because the mrt_lock could not be acquired. > > This fix is a copy of the code for the IPv4 case and ensures that the lock > is released correctly. > > Signed-off-by: Richard Laing Good catch! Acked-by: Cong Wang Needs to go to -stable too. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: eth: altera: fix napi poll_list corruption
On Thu, 2015-09-03 at 09:52 +0900, Atsushi Nemoto wrote: > On Wed, 2 Sep 2015 11:25:00 -0700, David Miller wrote: > > Two lines below this change you are disabling interrupts anyways, > > so I would suggest just moving the spin_lock_irqsave() before the > > napi_gro_flush() to fix this. > > > > Many of the checks done by napi_complete_done() (invoked by > > napi_complete()) are completely redundant in this context. For > > example, the direct __napi_complete() call is a really nice > > optimization because we know we are on the poll list and therefore > > it is not empty. > > Thank you for your suggestion. > > I think napi_gro_flush() can be called with irq enabled, so moving the > spin_lock_irqsave() just before the __napi_complete() (or moving the > __napi_complete() just after the spin_lock_irqsave()) would be better, > right? Unless masking irqs are damn slow on hosts supporting this NIC, I would rather use napi_complete_done() and add the possibility of aggregating more frames per GRO packet, setting a non zero gro_flush_timeout Calling napi_gro_flush() and __napi_complete() looks error prone. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] net/ipv6: Correct PIM6 mrt_lock handling
In the IPv6 multicast routing code the mrt_lock was not being released correctly in the MFC iterator, as a result adding or deleting a MIF would cause a hang because the mrt_lock could not be acquired. This fix is a copy of the code for the IPv4 case and ensures that the lock is released correctly. Signed-off-by: Richard Laing --- net/ipv6/ip6mr.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv6/ip6mr.c b/net/ipv6/ip6mr.c index 74ceb73..5f36266 100644 --- a/net/ipv6/ip6mr.c +++ b/net/ipv6/ip6mr.c @@ -550,7 +550,7 @@ static void ipmr_mfc_seq_stop(struct seq_file *seq, void *v) if (it->cache == &mrt->mfc6_unres_queue) spin_unlock_bh(&mfc_unres_lock); - else if (it->cache == mrt->mfc6_cache_array) + else if (it->cache == &mrt->mfc6_cache_array[it->ct]) read_unlock(&mrt_lock); } -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: eth: altera: fix napi poll_list corruption
On Wed, 2 Sep 2015 11:25:00 -0700, David Miller wrote: > Two lines below this change you are disabling interrupts anyways, > so I would suggest just moving the spin_lock_irqsave() before the > napi_gro_flush() to fix this. > > Many of the checks done by napi_complete_done() (invoked by > napi_complete()) are completely redundant in this context. For > example, the direct __napi_complete() call is a really nice > optimization because we know we are on the poll list and therefore > it is not empty. Thank you for your suggestion. I think napi_gro_flush() can be called with irq enabled, so moving the spin_lock_irqsave() just before the __napi_complete() (or moving the __napi_complete() just after the spin_lock_irqsave()) would be better, right? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next 05/19] ixgbe: Add support for UDP-encapsulated tx checksum offload
> On Sep 2, 2015, at 4:21 PM, Tom Herbert wrote: > > Mark, another question in this area of code. Looking at ixgbe_tx_csum, > I'm wondering what happens with those default cases for the switch > statements. If those are hit for whatever reason does that mean the > checksum is never resolved? It seems like if the device couldn't > handle these cases then skb_checksum_help should be called to set the > checksum. In particular I am wondering what happens in the case that a > TCP or UDP packet is sent in IPv6 with an extension header present (so > default is taken in switch (l4_hdr)). Would the checksum be properly > set in this case? I will look further into this, but in a first look it appears that you are right and that it has been this way for some time. -- Mark Rustad, Networking Division, Intel Corporation signature.asc Description: Message signed with OpenPGP using GPGMail
Re: ip_rcv_finish() NULL pointer and possibly related Oopses
On 09/02/2015 06:39 PM, Shaun Crampton wrote: Make sure you backported commit 10e2eb878f3ca07ac2f05fa5ca5e6c4c9174a27a ("udp: fix dst races with multicast early demux") I just tried the latest CoreOS alpha, which had that patch. Sadly, I saw just as many reboots. Here's a sample of the different types of Oopses I see (I've put the rest up in a gist: https://gist.github.com/fasaxc/d801ced5608f2657abd8): [ 4024.564479] BUG: unable to handle kernel NULL pointer dereference at (null) [ 4024.565452] IP: [< (null)>] (null) [ 4024.565452] PGD 2297067 PUD 2296067 PMD 0 [ 4024.565452] Oops: 0010 [#1] SMP [ 4024.565452] Modules linked in: xt_mac xt_mark veth ip_set_hash_net nf_conntrack_ipv6 nf_defrag_ipv6 xt_comment xt_set ip_set_hash_ip ip_set nfnetlink ipip tunnel4 ip_tunnel ip6table_filter ip6_tables xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter br_netfilter nf_nat nf_conntrack bridge stp llc overlay nls_ascii nls_cp437 vfat fat ext4 crc16 mbcache jbd2 sd_mod crc32c_intel virtio_scsi scsi_mod aesni_intel virtio_net mousedev aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd microcode firmware_class virtio_pci virtio_ring psmouse virtio i2c_piix4 i2c_core acpi_cpufreq button evdev sch_fq_codel ip_tables autofs4 [ 4024.565452] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.1.6-coreos-r1 #2 [ 4024.565452] Hardware name: Google Google, BIOS Google 01/01/2011 [ 4024.565452] task: 81a154c0 ti: 81a0 task.ti: 81a0 [ 4024.565452] RIP: 0010:[<>] [< (null)>] (null) [ 4024.565452] RSP: 0018:88021fc03c00 EFLAGS: 00010246 [ 4024.565452] RAX: 880003375d00 RBX: 880003375d00 RCX: 0001 [ 4024.565452] RDX: 88000306c000 RSI: RDI: 880003375d00 [ 4024.565452] RBP: 88021fc03c28 R08: 5608 R09: bb84 [ 4024.565452] R10: 0003 R11: 880215a30dc0 R12: 880214bfb000 [ 4024.565452] R13: 88000306c000 R14: 88000306c000 R15: 0008 [ 4024.565452] FS: () GS:88021fc0() knlGS: [ 4024.565452] CS: 0010 DS: ES: CR0: 80050033 [ 4024.565452] CR2: CR3: 01d92000 CR4: 001406f0 [ 4024.600761] Stack: [ 4024.601081] 814ac9dc 8802 88000306c000 880003375d00 [ 4024.601081] 88008cbba84e 88021fc03c58 81486628 88021690a000 [ 4024.601081] 88008cbba84e 880003375d00 88000306c000 88021fc03cb8 [ 4024.601081] Call Trace: [ 4024.601081] [ 4024.601081] [] ? tcp_v4_early_demux+0x11c/0x160 [ 4024.601081] [] ip_rcv_finish+0xb8/0x360 [ 4024.601081] [] ip_rcv+0x2a4/0x400 [ 4024.601081] [] ? inet_del_offload+0x40/0x40 [ 4024.601081] [] __netif_receive_skb_core+0x6c3/0x9a0 [ 4024.601081] [] ? build_skb+0x17/0x90 [ 4024.601081] [] __netif_receive_skb+0x18/0x60 [ 4024.601081] [] netif_receive_skb_internal+0x33/0xa0 [ 4024.601081] [] netif_receive_skb_sk+0x1c/0x70 [ 4024.601081] [] 0xa008772b [ 4024.601081] [] ? check_preempt_curr+0x80/0xa0 [ 4024.601081] [] 0xa0087d81 Looking at this one, I am still puzzeled where 0xa008772b and 0xa008772b comes from ... some driver, bridge ...? Also the call to inet_del_offload() seems a bit odd. Even in 4.1, there's only one (buggy) instance that calls inet_del_offload(), which is ipv6_exthdrs_offload_init(), but IPPROTO_ROUTING shouldn't have much of an effect on the v4 table as far as I can see. Maybe rather a false positive that address, hmm? Perhaps some callback/infrastructure vanished underneath us as ip/rip is both null ... maybe due to that also 0xa008772b / 0xa008772b don't resolve? [ 4024.601081] [] net_rx_action+0x159/0x340 [ 4024.601081] [] __do_softirq+0xf4/0x290 [ 4024.601081] [] irq_exit+0xad/0xc0 [ 4024.601081] [] do_IRQ+0x5a/0xf0 [ 4024.601081] [] common_interrupt+0x6e/0x6e [ 4024.601081] [ 4024.601081] [] ? native_safe_halt+0x6/0x10 [ 4024.601081] [] default_idle+0x1e/0xc0 [ 4024.601081] [] arch_cpu_idle+0xf/0x20 [ 4024.601081] [] cpu_startup_entry+0x314/0x3e0 [ 4024.601081] [] rest_init+0x7c/0x80 [ 4024.601081] [] start_kernel+0x483/0x490 [ 4024.601081] [] ? set_init_arg+0x55/0x55 [ 4024.601081] [] ? early_idt_handler_array+0x120/0x120 [ 4024.601081] [] x86_64_start_reservations+0x2a/0x2c [ 4024.601081] [] x86_64_start_kernel+0x138/0x147 [ 4024.601081] Code: Bad RIP value. [ 4024.601081] RIP [< (null)>] (null) [ 4024.601081] RSP [ 4024.601081] CR2: [ 4024.601081] ---[ end trace cdabfe9d7380aaab ]--- [ 4024.601081] Kernel panic - not syncing: Fatal exception in interrupt [ 4024.601081] Kernel Offset: disabled [ 4024.601081] Rebooting in 60 seconds.. [ 4024.601081] ACPI MEMORY or I/O RESET_REG. -- To unsubscribe from this list: send the line "unsubs
Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
On Wed, 2015-09-02 at 16:10 -0700, Martin KaFai Lau wrote: > On Wed, Sep 02, 2015 at 03:48:57PM -0700, Eric Dumazet wrote: > > On Wed, 2015-09-02 at 14:52 -0700, Martin KaFai Lau wrote: > > > On Wed, Sep 02, 2015 at 02:30:45PM -0700, Eric Dumazet wrote: > > > > Object cannot be freed until all cpus have exited their RCU sections. > > > You meant the dst_destroy() here will wait for all cpus exited their RCU > > > sections? > > > > > > static inline void dst_free(struct dst_entry *dst) > > > { > > > if (dst->obsolete > 0) > > > return; > > > if (!atomic_read(&dst->__refcnt)) { > > > dst = dst_destroy(dst); > > > if (!dst) > > > return; > > > } > > > __dst_free(dst); > > > } > > > > dst_free() is called after RCU grace period, in the case you are > > interested in. > > > > Look at dst_rcu_free() and rt_free() > Yes for IPv4 FIB > > Not for IPv6 FIB. F.e. rt6_release() > The IPv6 FIB is protected by rwlock now. Oh well. I gave you a hint. I was not saying that it was currently used in IPv6. Are you telling me that IPv6 needs to continue to use techniques from 1990 ? Surely we can use modern stuff, like proper RCU and/or seqlocks. Since you are fixing a day-0 bug, I do not believe there is a particular hurry to be conservative. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v2] ipv6: fix multipath route replace error recovery
From: Roopa Prabhu Problem: The ecmp route replace support for ipv6 in the kernel, deletes the existing ecmp route too early, ie when it installs the first nexthop. If there is an error in installing the subsequent nexthops, its too late to recover the already deleted existing route This patch fixes the problem with the following: a) Changes the existing multipath route add code to a two stage process: build rt6_infos + insert them ip6_route_add rt6_info creation code is moved into ip6_route_info_create. b) This ensures that all errors are caught during building rt6_infos and we fail early c) Separates multipath add and del code. Because add needs the special two stage mode in a) and delete essentially does not care. d) In any event if the code fails during inserting a route again, a warning is printed (This should be unlikely) Before the patch: $ip -6 route show 3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024 /* Try replacing the route with a duplicate nexthop */ $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1 RTNETLINK answers: File exists $ip -6 route show /* previously added ecmp route 3000:1000:1000:1000::2 dissappears from * kernel */ After the patch: $ip -6 route show 3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024 /* Try replacing the route with a duplicate nexthop */ $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1 RTNETLINK answers: File exists $ip -6 route show 3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024 Fixes: 4a287eba2de3 ("IPv6 routing, NLM_F_* flag support: REPLACE and EXCL flags support, warn about missing CREATE flag") Signed-off-by: Roopa Prabhu --- v2 - fix a rt6_info leak in cleanup on error This bug is present in 4.1 kernel and 4.2 too. Since 4.2 is out or almost out, I am submitting the patch against net-next. I can respin against net if needed. I have tried to keep the changes local to route.c closer to the netlink message handling. Most of the changes move code into separate functions. net/ipv6/route.c | 209 --- 1 file changed, 183 insertions(+), 26 deletions(-) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index f45cac6..ecbb974 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1748,7 +1748,7 @@ static int ip6_convert_metrics(struct mx6_config *mxc, return -EINVAL; } -int ip6_route_add(struct fib6_config *cfg) +int ip6_route_info_create(struct fib6_config *cfg, struct rt6_info **rt_ret) { int err; struct net *net = cfg->fc_nlinfo.nl_net; @@ -1756,7 +1756,6 @@ int ip6_route_add(struct fib6_config *cfg) struct net_device *dev = NULL; struct inet6_dev *idev = NULL; struct fib6_table *table; - struct mx6_config mxc = { .mx = NULL, }; int addr_type; if (cfg->fc_dst_len > 128 || cfg->fc_src_len > 128) @@ -1981,6 +1980,32 @@ install_route: cfg->fc_nlinfo.nl_net = dev_net(dev); + *rt_ret = rt; + + return 0; +out: + if (dev) + dev_put(dev); + if (idev) + in6_dev_put(idev); + if (rt) + dst_free(&rt->dst); + + *rt_ret = NULL; + + return err; +} + +int ip6_route_add(struct fib6_config *cfg) +{ + struct mx6_config mxc = { .mx = NULL, }; + struct rt6_info *rt = NULL; + int err; + + err = ip6_route_info_create(cfg, &rt); + if (err) + goto out; + err = ip6_convert_metrics(&mxc, cfg); if (err) goto out; @@ -1988,14 +2013,12 @@ install_route: err = __ip6_ins_rt(rt, &cfg->fc_nlinfo, &mxc); kfree(mxc.mx); + return err; out: - if (dev) - dev_put(dev); - if (idev) - in6_dev_put(idev); if (rt) dst_free(&rt->dst); + return err; } @@ -2776,19 +2799,79 @@ errout: return err; } -static int ip6_route_multipath(struct fib6_config *cfg, int add) +struct rt6_nh { + struct rt6_info *rt6_info; + struct fib6_config r_cfg; + struct mx6_config mxc; + struct list_head next; +}; + +static void ip6_print_replace_route_err(struct list_head *rt6_nh_list) +{ + struct rt6_nh *nh; + char
Re: [PATCH nf-next] netfilter: nf_conntrack: make nf_ct_zone_dflt built-in
From: Daniel Borkmann Date: Thu, 3 Sep 2015 01:26:07 +0200 > Fengguang reported, that some randconfig generated the following linker > issue with nf_ct_zone_dflt object involved: > > [...] > CC init/version.o > LD init/built-in.o > net/built-in.o: In function `ipv4_conntrack_defrag': > nf_defrag_ipv4.c:(.text+0x93e95): undefined reference to `nf_ct_zone_dflt' > net/built-in.o: In function `ipv6_defrag': > nf_defrag_ipv6_hooks.c:(.text+0xe3ffe): undefined reference to > `nf_ct_zone_dflt' > make: *** [vmlinux] Error 1 > > Given that configurations exist where we have a built-in part, which is > accessing nf_ct_zone_dflt such as the two handlers nf_ct_defrag_user() > and nf_ct6_defrag_user(), and a part that configures nf_conntrack as a > module, we must move nf_ct_zone_dflt into a fixed, guaranteed built-in > area when netfilter is configured in general. > > Therefore, split the more generic parts into a common header under > include/linux/netfilter/ and move nf_ct_zone_dflt into the built-in > section that already holds parts related to CONFIG_NF_CONNTRACK in the > netfilter core. This fixes the issue on my side. > > Fixes: 308ac9143ee2 ("netfilter: nf_conntrack: push zone object into > functions") > Reported-by: Fengguang Wu > Signed-off-by: Daniel Borkmann > --- > [ Here's the 2nd one for either nf-next or net-next. I've tried various >Kconfig combinations including the one Fengguang reported, seems to be >okay from my side. ] Ok I'll apply this directly too, thanks Daniel. If Pablo and others want to fix this another way, they can send me a relative patch. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH nf-next] netfilter: nf_dup{4,6}: fix build error when nf_conntrack disabled
From: Daniel Borkmann Date: Wed, 2 Sep 2015 20:54:02 +0200 > While testing various Kconfig options on another issue, I found that > the following one triggers as well on allmodconfig and nf_conntrack > disabled: > > net/ipv4/netfilter/nf_dup_ipv4.c: In function ‘nf_dup_ipv4’: > net/ipv4/netfilter/nf_dup_ipv4.c:72:20: error: ‘nf_skb_duplicated’ > undeclared (first use in this function) > if (this_cpu_read(nf_skb_duplicated)) > [...] > net/ipv6/netfilter/nf_dup_ipv6.c: In function ‘nf_dup_ipv6’: > net/ipv6/netfilter/nf_dup_ipv6.c:66:20: error: ‘nf_skb_duplicated’ > undeclared (first use in this function) > if (this_cpu_read(nf_skb_duplicated)) > > Fix it by including directly the header where it is defined. > > Fixes: bbde9fc1824a ("netfilter: factor out packet duplication for IPv4/IPv6") > Signed-off-by: Daniel Borkmann I'll take this directly to simplify things. Thanks Daniel.
Re: [PATCH] tipc: fix stall during bclink wakeup procedure
From: Kolmakov Dmitriy Date: Wed, 2 Sep 2015 15:33:00 + > If an attempt to wake up users of broadcast link is made when there > is no enough place in send queue than it may hang up inside the > tipc_sk_rcv() function since the loop breaks only after the wake up > queue becomes empty. This can lead to complete CPU stall with the > following message generated by RCU: I don't understand how it can loop forever. It should either successfully deliver each packet to the socket, or respond with a TIPC_ERR_OVERLOAD. In both cases, the SKB is dequeued from the queue and forward progress is made. If there really is a problem somewhere in here, then two things: 1) You need to describe exactly the sequence of tests and conditions that lead to the endless loop in this code, because I cannot see it. 2) I suspect the fix is more likely to be appropriate in tipc_sk_rcv() or similar, rather than creating a dummy queue to workaround it's behavior. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH nf-next] netfilter: nf_conntrack: make nf_ct_zone_dflt built-in
Fengguang reported, that some randconfig generated the following linker issue with nf_ct_zone_dflt object involved: [...] CC init/version.o LD init/built-in.o net/built-in.o: In function `ipv4_conntrack_defrag': nf_defrag_ipv4.c:(.text+0x93e95): undefined reference to `nf_ct_zone_dflt' net/built-in.o: In function `ipv6_defrag': nf_defrag_ipv6_hooks.c:(.text+0xe3ffe): undefined reference to `nf_ct_zone_dflt' make: *** [vmlinux] Error 1 Given that configurations exist where we have a built-in part, which is accessing nf_ct_zone_dflt such as the two handlers nf_ct_defrag_user() and nf_ct6_defrag_user(), and a part that configures nf_conntrack as a module, we must move nf_ct_zone_dflt into a fixed, guaranteed built-in area when netfilter is configured in general. Therefore, split the more generic parts into a common header under include/linux/netfilter/ and move nf_ct_zone_dflt into the built-in section that already holds parts related to CONFIG_NF_CONNTRACK in the netfilter core. This fixes the issue on my side. Fixes: 308ac9143ee2 ("netfilter: nf_conntrack: push zone object into functions") Reported-by: Fengguang Wu Signed-off-by: Daniel Borkmann --- [ Here's the 2nd one for either nf-next or net-next. I've tried various Kconfig combinations including the one Fengguang reported, seems to be okay from my side. ] include/linux/netfilter.h | 2 ++ .../linux/netfilter/nf_conntrack_zones_common.h| 23 ++ include/net/netfilter/nf_conntrack_zones.h | 19 +- net/netfilter/core.c | 6 ++ net/netfilter/nf_conntrack_core.c | 7 --- 5 files changed, 32 insertions(+), 25 deletions(-) create mode 100644 include/linux/netfilter/nf_conntrack_zones_common.h diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h index d788ce6..36a6525 100644 --- a/include/linux/netfilter.h +++ b/include/linux/netfilter.h @@ -368,6 +368,8 @@ nf_nat_decode_session(struct sk_buff *skb, struct flowi *fl, u_int8_t family) #endif /*CONFIG_NETFILTER*/ #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE) +#include + extern void (*ip_ct_attach)(struct sk_buff *, const struct sk_buff *) __rcu; void nf_ct_attach(struct sk_buff *, const struct sk_buff *); extern void (*nf_ct_destroy)(struct nf_conntrack *) __rcu; diff --git a/include/linux/netfilter/nf_conntrack_zones_common.h b/include/linux/netfilter/nf_conntrack_zones_common.h new file mode 100644 index 000..5d7cf36 --- /dev/null +++ b/include/linux/netfilter/nf_conntrack_zones_common.h @@ -0,0 +1,23 @@ +#ifndef _NF_CONNTRACK_ZONES_COMMON_H +#define _NF_CONNTRACK_ZONES_COMMON_H + +#include + +#define NF_CT_DEFAULT_ZONE_ID 0 + +#define NF_CT_ZONE_DIR_ORIG(1 << IP_CT_DIR_ORIGINAL) +#define NF_CT_ZONE_DIR_REPL(1 << IP_CT_DIR_REPLY) + +#define NF_CT_DEFAULT_ZONE_DIR (NF_CT_ZONE_DIR_ORIG | NF_CT_ZONE_DIR_REPL) + +#define NF_CT_FLAG_MARK1 + +struct nf_conntrack_zone { + u16 id; + u8 flags; + u8 dir; +}; + +extern const struct nf_conntrack_zone nf_ct_zone_dflt; + +#endif /* _NF_CONNTRACK_ZONES_COMMON_H */ diff --git a/include/net/netfilter/nf_conntrack_zones.h b/include/net/netfilter/nf_conntrack_zones.h index 5316c7b..4e32512 100644 --- a/include/net/netfilter/nf_conntrack_zones.h +++ b/include/net/netfilter/nf_conntrack_zones.h @@ -1,24 +1,7 @@ #ifndef _NF_CONNTRACK_ZONES_H #define _NF_CONNTRACK_ZONES_H -#include - -#define NF_CT_DEFAULT_ZONE_ID 0 - -#define NF_CT_ZONE_DIR_ORIG(1 << IP_CT_DIR_ORIGINAL) -#define NF_CT_ZONE_DIR_REPL(1 << IP_CT_DIR_REPLY) - -#define NF_CT_DEFAULT_ZONE_DIR (NF_CT_ZONE_DIR_ORIG | NF_CT_ZONE_DIR_REPL) - -#define NF_CT_FLAG_MARK1 - -struct nf_conntrack_zone { - u16 id; - u8 flags; - u8 dir; -}; - -extern const struct nf_conntrack_zone nf_ct_zone_dflt; +#include #if IS_ENABLED(CONFIG_NF_CONNTRACK) #include diff --git a/net/netfilter/core.c b/net/netfilter/core.c index 0b939b7..8e47f81 100644 --- a/net/netfilter/core.c +++ b/net/netfilter/core.c @@ -388,6 +388,12 @@ EXPORT_SYMBOL(nf_conntrack_destroy); struct nfq_ct_hook __rcu *nfq_ct_hook __read_mostly; EXPORT_SYMBOL_GPL(nfq_ct_hook); +/* Built-in default zone used e.g. by modules. */ +const struct nf_conntrack_zone nf_ct_zone_dflt = { + .id = NF_CT_DEFAULT_ZONE_ID, + .dir= NF_CT_DEFAULT_ZONE_DIR, +}; +EXPORT_SYMBOL_GPL(nf_ct_zone_dflt); #endif /* CONFIG_NF_CONNTRACK */ #ifdef CONFIG_NF_NAT_NEEDED diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c index ac3be9b..eedf049 100644 --- a/net/netfilter/nf_conntrack_core.c +++ b/net/netfilter/nf_conntrack_core.c @@ -1286,13 +1286,6 @@ bool __nf_ct_kill_acct(struct nf_conn *ct, } EXPORT_SYMBOL_GPL(__nf_ct_kill_acct); -/* Built-in default zone used e.g. by modules. */ -const struct nf_
Re: [net-next 05/19] ixgbe: Add support for UDP-encapsulated tx checksum offload
On Wed, Sep 2, 2015 at 9:46 AM, Rustad, Mark D wrote: >> On Sep 1, 2015, at 8:17 PM, Tom Herbert wrote: >> >> I suspect this is not UDP-encapsulation specific, will it work with >> TCP/IP/IP, TCP/IP/GRE etc.? > Mark, another question in this area of code. Looking at ixgbe_tx_csum, I'm wondering what happens with those default cases for the switch statements. If those are hit for whatever reason does that mean the checksum is never resolved? It seems like if the device couldn't handle these cases then skb_checksum_help should be called to set the checksum. In particular I am wondering what happens in the case that a TCP or UDP packet is sent in IPv6 with an extension header present (so default is taken in switch (l4_hdr)). Would the checksum be properly set in this case? Thanks, Tom > It could do more, but this is what has been tested up to this point. > >> Isn't there anyway the ixgbe could just be made to NETIF_HW_CSUM? That >> would be so much more straightforward and support nearly all use cases >> without needing to jump through all these hoops. > > Well, the description says: > > --- > Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM. > It means that device can fill TCP/UDP-like checksum anywhere in the packets > whatever headers there might be. > --- > > The device can't do whatever, wherever. There is always a limit to the offset > to the inner headers that can be handled, for instance. > > -- > Mark Rustad, Networking Division, Intel Corporation > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 1/1] net: fec: clear receive interrupts before processing a packet
From: Fugang Duan Date: Wed, 2 Sep 2015 17:24:14 +0800 > From: Russell King > > The patch just to re-submit the patch "db3421c114cfa6326" because the > patch "4d494cdc92b3b9a0" remove the change. > > Clear any pending receive interrupt before we process a pending packet. > This helps to avoid any spurious interrupts being raised after we have > fully cleaned the receive ring, while still allowing an interrupt to be > raised if we receive another packet. > > The position of this is critical: we must do this prior to reading the > next packet status to avoid potentially dropping an interrupt when a > packet is still pending. > > Acked-by: Fugang Duan > Signed-off-by: Russell King Applied and queued up for -stable, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] Revert "net/ipv6: add sysctl option accept_ra_min_hop_limit"
From: Sabrina Dubroca Date: Wed, 2 Sep 2015 11:43:01 +0200 > This reverts commit 8013d1d7eafb0589ca766db6b74026f76b7f5cb4. > > There are several issues with this patch. > It completely cancels the security changes introduced by 6fd99094de2b > ("ipv6: Don't reduce hop limit for an interface"). > The current default value (min hop limit = 1) can result in the same > denial of service that 6fd99094de2b prevents, but it is hard to define > a correct and sane default value. > More generally, it is yet another IPv6 sysctl, and we already have too > many. > > This was introduced to satisfy a TAHI test case which, in my opinion, is > too strict, turning the RFC's "SHOULD" into a "MUST": > > If the received Cur Hop Limit value is non-zero, the host > SHOULD set its CurHopLimit variable to the received value. > > The behavior of this sysctl is wrong in multiple ways. Some are > fixable, but let's not rush this commit into mainline, and revert this > while we still can, then we can come up with a better solution. > > Signed-off-by: Sabrina Dubroca I don't agree with this revert. If you look at the original commit, the quoted RFC recommends adding a configurable method to protect against this. And that's exactly what the commit you are trying to revert is doing. The only thing I would entertain is potentially an adjustment of the default, working in concert with the TAHI folks to make sure their tests still pass with any new default. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
From: Martin KaFai Lau Date: Wed, 2 Sep 2015 16:10:31 -0700 > On Wed, Sep 02, 2015 at 03:48:57PM -0700, Eric Dumazet wrote: >> dst_free() is called after RCU grace period, in the case you are >> interested in. >> >> Look at dst_rcu_free() and rt_free() > Yes for IPv4 FIB > > Not for IPv6 FIB. F.e. rt6_release() > The IPv6 FIB is protected by rwlock now. The FIB tree can use whatever locking scheme it wants, but the actual route objects need to be released via RCU to fix the problems you are seeing. Converting the entire ipv6 FIB tree handling to RCU is not a prerequisite for this. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
On Wed, Sep 02, 2015 at 03:48:57PM -0700, Eric Dumazet wrote: > On Wed, 2015-09-02 at 14:52 -0700, Martin KaFai Lau wrote: > > On Wed, Sep 02, 2015 at 02:30:45PM -0700, Eric Dumazet wrote: > > > Object cannot be freed until all cpus have exited their RCU sections. > > You meant the dst_destroy() here will wait for all cpus exited their RCU > > sections? > > > > static inline void dst_free(struct dst_entry *dst) > > { > > if (dst->obsolete > 0) > > return; > > if (!atomic_read(&dst->__refcnt)) { > > dst = dst_destroy(dst); > > if (!dst) > > return; > > } > > __dst_free(dst); > > } > > dst_free() is called after RCU grace period, in the case you are > interested in. > > Look at dst_rcu_free() and rt_free() Yes for IPv4 FIB Not for IPv6 FIB. F.e. rt6_release() The IPv6 FIB is protected by rwlock now. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
From: Eric Dumazet Date: Wed, 02 Sep 2015 15:48:57 -0700 > On Wed, 2015-09-02 at 14:52 -0700, Martin KaFai Lau wrote: >> On Wed, Sep 02, 2015 at 02:30:45PM -0700, Eric Dumazet wrote: >> > Object cannot be freed until all cpus have exited their RCU sections. >> You meant the dst_destroy() here will wait for all cpus exited their RCU >> sections? >> >> static inline void dst_free(struct dst_entry *dst) >> { >> if (dst->obsolete > 0) >> return; >> if (!atomic_read(&dst->__refcnt)) { >> dst = dst_destroy(dst); >> if (!dst) >> return; >> } >> __dst_free(dst); >> } > > dst_free() is called after RCU grace period, in the case you are > interested in. > > Look at dst_rcu_free() and rt_free() For ipv4, this is true, but in ipv6, it is not necessarily done in this way. And I think that is the point Martin is trying to make. If you look, the dst_free() calls in ipv6 are basically synchronous, it does not use dst_rcu_free(). And thus, the fix is to make ipv6 properly RCU free route entries. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
On Wed, 2015-09-02 at 14:52 -0700, Martin KaFai Lau wrote: > On Wed, Sep 02, 2015 at 02:30:45PM -0700, Eric Dumazet wrote: > > Object cannot be freed until all cpus have exited their RCU sections. > You meant the dst_destroy() here will wait for all cpus exited their RCU > sections? > > static inline void dst_free(struct dst_entry *dst) > { > if (dst->obsolete > 0) > return; > if (!atomic_read(&dst->__refcnt)) { > dst = dst_destroy(dst); > if (!dst) > return; > } > __dst_free(dst); > } dst_free() is called after RCU grace period, in the case you are interested in. Look at dst_rcu_free() and rt_free() -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next 05/19] ixgbe: Add support for UDP-encapsulated tx checksum offload
On Wed, Sep 2, 2015 at 2:07 PM, Or Gerlitz wrote: > On Wed, Sep 2, 2015 at 8:38 PM, Tom Herbert wrote: >> On Wed, Sep 2, 2015 at 9:46 AM, Rustad, Mark D >> wrote: > >>> Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM. >>> It means that device can fill TCP/UDP-like checksum anywhere in the packets >>> whatever headers there might be. > >>> The device can't do whatever, wherever. There is always a limit to the >>> offset to the inner headers that can be handled, for instance. > >> If the device does NETIF_F_HW_CSUM then inner/outer headers are >> irrelevant at least in the non-GSO case. All the device needs to do is >> compute the checksum from start and write the answer at the given >> offset. No protocol awareness needed in the device, no need to parse >> headers on transmit. > > Tom, could you elaborate a little further on the > semantics/requirements for devices supporting NETIF_F_HW_CSUM, > specifically, AFAIU this isn't a TX equivalent of supporting checksum > complete on RX, right? when you say "write the answer at the given > offset" what non-common answers are you expecting devices to produce? > how the kernel is hinting to the device on the nature on the expected > answer beyond the offset? > NETIF_F_HW_CSUM indicates that the device/driver is will to implement CHECKSUM_PARTIAL on out for the general case. CHECKSUM_PARTIAL is described in skbuff.h as: The device is required to checksum the packet as seen by hard_start_xmit() from skb->csum_start up to the end, and to record/write the checksum at offset skb->csum_start + skb->csum_offset. For instance, if we want to offload an inner checksum the stack would set csum_start to the offset of the the inner transport packet and csum_offset to the relative offset of the checksum field. The stack takes care of priming the checksum field with the not of pseudo header if the transport protocol needs that. Tom > Or. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [Intel-wired-lan] [PATCH 3/6] ethernet/ixgbe: advertise LRO support in vlan_features
-Original Message- From: Intel-wired-lan [mailto:intel-wired-lan-boun...@lists.osuosl.org] On Behalf Of Jarod Wilson Sent: Thursday, August 13, 2015 11:03 AM To: linux-ker...@vger.kernel.org Cc: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org; Jarod Wilson Subject: [Intel-wired-lan] [PATCH 3/6] ethernet/ixgbe: advertise LRO support in vlan_features Without this, the presence of a ixgbe device in a bond will not trigger LRO support to be enabled at the bond level, even while it is enabled on the slave itself. This change becomes necessary when NETIF_F_LRO is added to netdev_features.h's NETIF_F_ONE_FOR_ALL. CC: Jeff Kirsher CC: intel-wired-...@lists.osuosl.org CC: netdev@vger.kernel.org Signed-off-by: Jarod Wilson --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 3e6a931..0a6e4e1 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -8659,8 +8659,10 @@ skip_sriov: if (adapter->flags2 & IXGBE_FLAG2_RSC_CAPABLE) netdev->hw_features |= NETIF_F_LRO; - if (adapter->flags2 & IXGBE_FLAG2_RSC_ENABLED) + if (adapter->flags2 & IXGBE_FLAG2_RSC_ENABLED) { netdev->features |= NETIF_F_LRO; + netdev->vlan_features |= NETIF_F_LRO; + } /* make sure the EEPROM is good */ if (hw->eeprom.ops.validate_checksum(hw, NULL) < 0) { -- 1.8.3.1 ___ Intel-wired-lan mailing list intel-wired-...@lists.osuosl.org http://lists.osuosl.org/mailman/listinfo/intel-wired-lan While Validating this patch we have run in to a call trace if we have forwarding (net.ipv4.ip_forward = 1) and LRO enabled on interface prior to creating VLAN interface. With the patch reverted we don't see this failure. Validation setup: sysctl net.ipv4.ip_forward=1 ethtool -K ethX lro on ip link set ethX up ip link add link ethX name ethX.10 type vlan id 10. CALL TRACE: [582992.985245] ixgbe :83:00.0 eth6: NIC Link is Up 10 Gbps, Flow Control: RX/TX [582992.985400] IPv6: ADDRCONF(NETDEV_CHANGE): eth6: link becomes ready [582995.764828] ixgbe :83:00.1 eth7: NIC Link is Up 10 Gbps, Flow Control: RX/TX [582995.764964] IPv6: ADDRCONF(NETDEV_CHANGE): eth7: link becomes ready [583027.588991] ixgbe :04:00.0 eth2: NIC Link is Up 10 Gbps, Flow Control: RX/TX [583044.365523] [ cut here ] [583044.366181] WARNING: CPU: 20 PID: 56879 at net/core/dev.c:1472 dev_disable_lro+0x95/0xa0() [583044.366711] netdevice: eth2.10 failed to disable LRO! [583044.367876] Modules linked in: ixgbe ixgb igb e100 mii e1000 e1000e 8021q garp mrp tcp_lp bnep bluetooth rfkill fuse btrfs xor raid6_pq vfat msdos fat ext4 mbcache jbd2 binfmt_misc xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 iptable_filter ip_tables tun bridge stp llc x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel nfsd ghash_clmulni_intel mei_me aesni_intel mei lrw auth_rpcgss gf128mul shpchp iTCO_wdt ioatdma iTCO_vendor_support glue_helper nfs_acl ablk_helper lockd cryptd i2c_i801 lpc_ich mfd_core ipmi_si sb_edac edac_core grace dm_mirror dm_region_hash ipmi_msghandler pcspkr wmi dm_log dm_mod sunrpc uinput [583044.371142] xfs libcrc32c sr_mod cdrom sd_mod mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper ttm drm ahci libahci libata mdio vxlan firewire_ohci ip6_udp_tunnel firewire_core udp_tunnel ptp i2c_algo_bit pps_core i2c_core crc_itu_t dca [last unloaded: ixgbe] [583044.372915] CPU: 20 PID: 56879 Comm: ip Tainted: GW IOE 4.2.0-rc7-Ustream-8-26-15+ #1 [583044.373511] Hardware name: Intel Corporation S2600CO/S2600CO, BIOS SE5C600.86B.01.06.0001.090720121056 09/07/2012 [583044.374126] e9a2d4dc 8803ce16b5b8 8166b4e9 [583044.374752] 8803ce16b610 8803ce16b5f8 8107b06a [583044.375380] 8803ce16b608 880428041000 818dc1eb 0005 [583044.376033] Call Trace: [583044.376662] [] dump_stack+0x45/0x57 [583044.377290] [] warn_slowpath_common+0x8a/0xc0 [583044.377950] [] warn_slowpath_fmt+0x55/0x70 [583044.378570] [] ? netdev_update_features+0x25/0x60 [583044.379218] [] dev_disable_lro+0x95/0xa0 [583044.379841] [] inetdev_init+0x17d/0x230 [583044.380458] [] inetdev_event+0x37f/0x4f0 [583044.381079] [] notifier_call_chain+0x4d/0x80 [583044.381697] [] raw_notifier_call_chain+0x16/0x20 [583044.382343] [] call_netdevice_notifiers_info+0x39/0x70 [583044.382971] [] register_netdevice+0x2ae/0x430 [583044.383595] [] ? dev_get_nest_level+0x64/0xa0 [583044.384226] [] register_vlan_dev+0xd
Re: [PATCH net-next 3/3] net: Add table id from route lookup to route response
From: Andy Gospodarek Date: Wed, 2 Sep 2015 15:43:27 -0400 > On Wed, Sep 02, 2015 at 09:08:36PM +0200, Thomas Graf wrote: >> On 09/02/15 at 12:51pm, David Ahern wrote: >> > On 9/2/15 12:49 PM, David Miller wrote: >> > >From: Thomas Graf >> > >Date: Wed, 2 Sep 2015 20:43:46 +0200 >> > > >> > >>On 09/02/15 at 09:40am, David Ahern wrote: >> > >>>rt_fill_info which is called for 'route get' requests hardcodes the >> > >>>table id as RT_TABLE_MAIN which is not correct when multiple tables >> > >>>are used. Use the newly added table id in the rtable to send back >> > >>>the correct table. >> > >>> >> > >>>Signed-off-by: David Ahern >> > >> >> > >>What RTM_GETROUTE returns is not the actual route but a description >> > >>of the routing decision which is why table id, scope, protocol, and >> > >>prefix length are hardcoded. This is indicated by the RTM_F_CLONED >> > >>flag. What you propose would break userspace ABI. >> > > >> > >Agreed, I don't think we can do this. >> > > >> > >> > Doesn't the table used to come up with the decision matter for IPv4? ie., >> > hardcoding to MAIN is misleading when there is absolutely no way the >> > decision comes from that table. IPv6 already returns the table id. >> > >> > Or is your response that it breaks ABI and hence not going to fix. >> >> This behaviour comes back from when we still had the IPv4 routing cache >> which was flat. > > So before the routing cache was removed, was the response always > RTA_TABLE_MAIN since there was no way to indicate which table may have > route if it came from the cache? Right. In fact, it was possible for routes from multiple tables to end up evaluating to the same routing cache entry. So there could be a many to one relationship back then. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] ipv6: fix exthdrs offload registration in out_rt path
From: Daniel Borkmann Date: Thu, 3 Sep 2015 00:29:07 +0200 > We previously register IPPROTO_ROUTING offload under inet6_add_offload(), > but in error path, we try to unregister it with inet_del_offload(). This > doesn't seem correct, it should actually be inet6_del_offload(), also > ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it > also uses rthdr_offload twice), but it got removed entirely later on. > > Fixes: 3336288a9fea ("ipv6: Switch to using new offload infrastructure.") > Signed-off-by: Daniel Borkmann Applied and queued up for -stable, thanks Daniel. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net] ipv6: fix exthdrs offload registration in out_rt path
We previously register IPPROTO_ROUTING offload under inet6_add_offload(), but in error path, we try to unregister it with inet_del_offload(). This doesn't seem correct, it should actually be inet6_del_offload(), also ipv6_exthdrs_offload_exit() from that commit seems rather incorrect (it also uses rthdr_offload twice), but it got removed entirely later on. Fixes: 3336288a9fea ("ipv6: Switch to using new offload infrastructure.") Signed-off-by: Daniel Borkmann --- (Found during code review.) net/ipv6/exthdrs_offload.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv6/exthdrs_offload.c b/net/ipv6/exthdrs_offload.c index 447a7fb..f5e2ba1 100644 --- a/net/ipv6/exthdrs_offload.c +++ b/net/ipv6/exthdrs_offload.c @@ -36,6 +36,6 @@ out: return ret; out_rt: - inet_del_offload(&rthdr_offload, IPPROTO_ROUTING); + inet6_del_offload(&rthdr_offload, IPPROTO_ROUTING); goto out; } -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv1 net-next 0/5] netlink: mmap: kernel panic and some issues
On Wed, Sep 02, 2015 at 05:56:36PM +0200, Daniel Borkmann wrote: > you suggest or not), for two reasons: I think (will start experimenting > more with it tomorrow), you would get an out of bounds access here in > case the skb->data is the last slot in the ring buffer and reaches > exactly to the ring buffer end. And (despite that), it's also hard I thought accessing as a value, not a pointer, in thats wrong shared info will not be a big problem, but > to maintain - the next one adding a new shared info member will very > likely oversee this special case in netlink here, thus the issue would > then simply be reintroduced over and over. I agree with you. Thank you for taking your time. I think I have learned a lot. Thanks, -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
On Wed, Sep 02, 2015 at 02:30:45PM -0700, Eric Dumazet wrote: > Object cannot be freed until all cpus have exited their RCU sections. You meant the dst_destroy() here will wait for all cpus exited their RCU sections? static inline void dst_free(struct dst_entry *dst) { if (dst->obsolete > 0) return; if (!atomic_read(&dst->__refcnt)) { dst = dst_destroy(dst); if (!dst) return; } __dst_free(dst); } -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
On Wed, 2015-09-02 at 13:58 -0700, Martin KaFai Lau wrote: > On Tue, Sep 01, 2015 at 01:14:20PM -0700, Eric Dumazet wrote: > > > 2. Use a spinlock to protect the dst_cache operations > > > > Well, a seqlock would be better : No need for an atomic operation in > > fast path. > > > seqlock can ensure consistency between idst->dst and idst->cookie. > However, IPv6 dst destruction is not protected by rcu. dst_free() is > directly called, like in ip6_fib.c and a few other places. > Hence, atomic_inc_not_zero() cannot be used here because the dst may > have already been kmem_cache_free() when refcnt is 0. Really ? What about basic rcu rules ? Object cannot be freed until all cpus have exited their RCU sections. > A spinlock is > needed to stop the ip6_tnl_dst_set() side from removing the refcnt. Are you telling me RCU should be banished from the kernel ? ;) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next 05/19] ixgbe: Add support for UDP-encapsulated tx checksum offload
On Wed, Sep 2, 2015 at 8:38 PM, Tom Herbert wrote: > On Wed, Sep 2, 2015 at 9:46 AM, Rustad, Mark D > wrote: >> Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM. >> It means that device can fill TCP/UDP-like checksum anywhere in the packets >> whatever headers there might be. >> The device can't do whatever, wherever. There is always a limit to the >> offset to the inner headers that can be handled, for instance. > If the device does NETIF_F_HW_CSUM then inner/outer headers are > irrelevant at least in the non-GSO case. All the device needs to do is > compute the checksum from start and write the answer at the given > offset. No protocol awareness needed in the device, no need to parse > headers on transmit. Tom, could you elaborate a little further on the semantics/requirements for devices supporting NETIF_F_HW_CSUM, specifically, AFAIU this isn't a TX equivalent of supporting checksum complete on RX, right? when you say "write the answer at the given offset" what non-common answers are you expecting devices to produce? how the kernel is hinting to the device on the nature on the expected answer beyond the offset? Or. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net 3/3] ipv6: Fix dst_entry refcnt bugs in ip6_tunnel
On Tue, Sep 01, 2015 at 01:14:20PM -0700, Eric Dumazet wrote: > > 2. Use a spinlock to protect the dst_cache operations > > Well, a seqlock would be better : No need for an atomic operation in > fast path. > seqlock can ensure consistency between idst->dst and idst->cookie. However, IPv6 dst destruction is not protected by rcu. dst_free() is directly called, like in ip6_fib.c and a few other places. Hence, atomic_inc_not_zero() cannot be used here because the dst may have already been kmem_cache_free() when refcnt is 0. A spinlock is needed to stop the ip6_tnl_dst_set() side from removing the refcnt. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 3/3 v2] net: Allow user to get table id from route lookup
rt_fill_info which is called for 'route get' requests hardcodes the table id as RT_TABLE_MAIN which is not correct when multiple tables are used. Use the newly added table id in the rtable to send back the correct table similar to what is done for IPv6. To maintain current ABI a new request flag, RTM_F_LOOKUP_TABLE, is added to indicate the actual table is wanted versus the hardcoded response. Signed-off-by: David Ahern --- v2 - use a new request flag to indicate the real table id is wanted (suggested by Thomas) include/uapi/linux/rtnetlink.h | 1 + net/ipv4/route.c | 12 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index 702024769c74..06625b401422 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -270,6 +270,7 @@ enum rt_scope_t { #define RTM_F_CLONED 0x200 /* This route is cloned */ #define RTM_F_EQUALIZE 0x400 /* Multipath equalizer: NI */ #define RTM_F_PREFIX 0x800 /* Prefix addresses */ +#define RTM_F_LOOKUP_TABLE 0x1000 /* set rtm_table to FIB lookup result */ /* Reserved table identifiers */ diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 92acc95b7578..da427a4a33fe 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2305,7 +2305,7 @@ struct rtable *ip_route_output_flow(struct net *net, struct flowi4 *flp4, } EXPORT_SYMBOL_GPL(ip_route_output_flow); -static int rt_fill_info(struct net *net, __be32 dst, __be32 src, +static int rt_fill_info(struct net *net, __be32 dst, __be32 src, u32 table_id, struct flowi4 *fl4, struct sk_buff *skb, u32 portid, u32 seq, int event, int nowait, unsigned int flags) { @@ -2325,8 +2325,8 @@ static int rt_fill_info(struct net *net, __be32 dst, __be32 src, r->rtm_dst_len = 32; r->rtm_src_len = 0; r->rtm_tos = fl4->flowi4_tos; - r->rtm_table= RT_TABLE_MAIN; - if (nla_put_u32(skb, RTA_TABLE, RT_TABLE_MAIN)) + r->rtm_table= table_id; + if (nla_put_u32(skb, RTA_TABLE, table_id)) goto nla_put_failure; r->rtm_type = rt->rt_type; r->rtm_scope= RT_SCOPE_UNIVERSE; @@ -2431,6 +2431,7 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh) int err; int mark; struct sk_buff *skb; + u32 table_id = RT_TABLE_MAIN; err = nlmsg_parse(nlh, sizeof(*rtm), tb, RTA_MAX, rtm_ipv4_policy); if (err < 0) @@ -2500,7 +2501,10 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh) if (rtm->rtm_flags & RTM_F_NOTIFY) rt->rt_flags |= RTCF_NOTIFY; - err = rt_fill_info(net, dst, src, &fl4, skb, + if (rtm->rtm_flags & RTM_F_LOOKUP_TABLE) + table_id = rt->rt_table_id; + + err = rt_fill_info(net, dst, src, table_id, &fl4, skb, NETLINK_CB(in_skb).portid, nlh->nlmsg_seq, RTM_NEWROUTE, 0, 0); if (err < 0) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 1/3] net: Refactor rtable initialization
All callers to rt_dst_alloc have nearly the same initialization following a successful allocation. Consolidate it into rt_dst_alloc. Signed-off-by: David Ahern --- net/ipv4/route.c | 85 ++-- 1 file changed, 33 insertions(+), 52 deletions(-) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 5f4a5565ad8b..eaefeadce07c 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1438,12 +1438,33 @@ static void rt_set_nexthop(struct rtable *rt, __be32 daddr, } static struct rtable *rt_dst_alloc(struct net_device *dev, + unsigned int flags, u16 type, bool nopolicy, bool noxfrm, bool will_cache) { - return dst_alloc(&ipv4_dst_ops, dev, 1, DST_OBSOLETE_FORCE_CHK, -(will_cache ? 0 : (DST_HOST | DST_NOCACHE)) | -(nopolicy ? DST_NOPOLICY : 0) | -(noxfrm ? DST_NOXFRM : 0)); + struct rtable *rt; + + rt = dst_alloc(&ipv4_dst_ops, dev, 1, DST_OBSOLETE_FORCE_CHK, + (will_cache ? 0 : (DST_HOST | DST_NOCACHE)) | + (nopolicy ? DST_NOPOLICY : 0) | + (noxfrm ? DST_NOXFRM : 0)); + + if (rt) { + rt->rt_genid = rt_genid_ipv4(dev_net(dev)); + rt->rt_flags = flags; + rt->rt_type = type; + rt->rt_is_input = 0; + rt->rt_iif = 0; + rt->rt_pmtu = 0; + rt->rt_gateway = 0; + rt->rt_uses_gateway = 0; + INIT_LIST_HEAD(&rt->rt_uncached); + + rt->dst.output = ip_output; + if (flags & RTCF_LOCAL) + rt->dst.input = ip_local_deliver; + } + + return rt; } /* called in rcu_read_lock() section */ @@ -1452,6 +1473,7 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr, { struct rtable *rth; struct in_device *in_dev = __in_dev_get_rcu(dev); + unsigned int flags = RTCF_MULTICAST; u32 itag = 0; int err; @@ -1477,7 +1499,10 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr, if (err < 0) goto e_err; } - rth = rt_dst_alloc(dev_net(dev)->loopback_dev, + if (our) + flags |= RTCF_LOCAL; + + rth = rt_dst_alloc(dev_net(dev)->loopback_dev, flags, RTN_MULTICAST, IN_DEV_CONF_GET(in_dev, NOPOLICY), false, false); if (!rth) goto e_nobufs; @@ -1486,20 +1511,7 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr, rth->dst.tclassid = itag; #endif rth->dst.output = ip_rt_bug; - - rth->rt_genid = rt_genid_ipv4(dev_net(dev)); - rth->rt_flags = RTCF_MULTICAST; - rth->rt_type= RTN_MULTICAST; rth->rt_is_input= 1; - rth->rt_iif = 0; - rth->rt_pmtu= 0; - rth->rt_gateway = 0; - rth->rt_uses_gateway = 0; - INIT_LIST_HEAD(&rth->rt_uncached); - if (our) { - rth->dst.input= ip_local_deliver; - rth->rt_flags |= RTCF_LOCAL; - } #ifdef CONFIG_IP_MROUTE if (!ipv4_is_local_multicast(daddr) && IN_DEV_MFORWARD(in_dev)) @@ -1608,7 +1620,7 @@ static int __mkroute_input(struct sk_buff *skb, } } - rth = rt_dst_alloc(out_dev->dev, + rth = rt_dst_alloc(out_dev->dev, 0, res->type, IN_DEV_CONF_GET(in_dev, NOPOLICY), IN_DEV_CONF_GET(out_dev, NOXFRM), do_cache); if (!rth) { @@ -1616,19 +1628,10 @@ static int __mkroute_input(struct sk_buff *skb, goto cleanup; } - rth->rt_genid = rt_genid_ipv4(dev_net(rth->dst.dev)); - rth->rt_flags = 0; - rth->rt_type = res->type; rth->rt_is_input = 1; - rth->rt_iif = 0; - rth->rt_pmtu= 0; - rth->rt_gateway = 0; - rth->rt_uses_gateway = 0; - INIT_LIST_HEAD(&rth->rt_uncached); RT_CACHE_STAT_INC(in_slow_tot); rth->dst.input = ip_forward; - rth->dst.output = ip_output; rt_set_nexthop(rth, daddr, res, fnhe, res->fi, res->type, itag); if (lwtunnel_output_redirect(rth->dst.lwtstate)) { @@ -1795,26 +1798,16 @@ out:return err; } } - rth = rt_dst_alloc(net->loopback_dev, + rth = rt_dst_alloc(net->loopback_dev, flags | RTCF_LOCAL, res.type, IN_DEV_CONF_GET(in_dev, NOPOLICY), false, do_cache); if (!rth) goto e_nobufs; - rth->dst.input= ip_local_deliver; rth->dst.output= ip_rt_bug; #ifdef CONFIG_IP_ROUTE_CLASSID rth->dst.tclassid = itag; #endif - - rth->rt_genid = rt_genid_ipv4(net); - rth->rt_flags = flags|RTCF_LOCAL; - rth->rt_type= res.typ
[PATCH net-next 2/3] net: Add FIB table id to rtable
Add the FIB table id to rtable to make the information available for IPv4 as it is for IPv6. Signed-off-by: David Ahern --- drivers/net/vrf.c | 2 ++ include/net/route.h | 2 ++ net/ipv4/route.c| 8 net/ipv4/xfrm4_policy.c | 1 + 4 files changed, 13 insertions(+) diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index e7094fbd7568..8c9ab5ebea23 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -320,6 +320,7 @@ static void vrf_rtable_destroy(struct net_vrf *vrf) static struct rtable *vrf_rtable_create(struct net_device *dev) { + struct net_vrf *vrf = netdev_priv(dev); struct rtable *rth; rth = dst_alloc(&vrf_dst_ops, dev, 2, @@ -335,6 +336,7 @@ static struct rtable *vrf_rtable_create(struct net_device *dev) rth->rt_pmtu= 0; rth->rt_gateway = 0; rth->rt_uses_gateway = 0; + rth->rt_table_id = vrf->tb_id; INIT_LIST_HEAD(&rth->rt_uncached); rth->rt_uncached_list = NULL; } diff --git a/include/net/route.h b/include/net/route.h index cc61cb95f059..10a7d21a211c 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -64,6 +64,8 @@ struct rtable { /* Miscellaneous cached information */ u32 rt_pmtu; + u32 rt_table_id; + struct list_headrt_uncached; struct uncached_list*rt_uncached_list; }; diff --git a/net/ipv4/route.c b/net/ipv4/route.c index eaefeadce07c..92acc95b7578 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1457,6 +1457,7 @@ static struct rtable *rt_dst_alloc(struct net_device *dev, rt->rt_pmtu = 0; rt->rt_gateway = 0; rt->rt_uses_gateway = 0; + rt->rt_table_id = 0; INIT_LIST_HEAD(&rt->rt_uncached); rt->dst.output = ip_output; @@ -1629,6 +1630,8 @@ static int __mkroute_input(struct sk_buff *skb, } rth->rt_is_input = 1; + if (res->table) + rth->rt_table_id = res->table->tb_id; RT_CACHE_STAT_INC(in_slow_tot); rth->dst.input = ip_forward; @@ -1808,6 +1811,8 @@ out: return err; rth->dst.tclassid = itag; #endif rth->rt_is_input = 1; + if (res.table) + rth->rt_table_id = res.table->tb_id; RT_CACHE_STAT_INC(in_slow_tot); if (res.type == RTN_UNREACHABLE) { @@ -1988,6 +1993,9 @@ static struct rtable *__mkroute_output(const struct fib_result *res, return ERR_PTR(-ENOBUFS); rth->rt_iif = orig_oif ? : 0; + if (res->table) + rth->rt_table_id = res->table->tb_id; + RT_CACHE_STAT_INC(out_slow_tot); if (flags & (RTCF_BROADCAST | RTCF_MULTICAST)) { diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c index bb919b28619f..671011055ad5 100644 --- a/net/ipv4/xfrm4_policy.c +++ b/net/ipv4/xfrm4_policy.c @@ -95,6 +95,7 @@ static int xfrm4_fill_dst(struct xfrm_dst *xdst, struct net_device *dev, xdst->u.rt.rt_gateway = rt->rt_gateway; xdst->u.rt.rt_uses_gateway = rt->rt_uses_gateway; xdst->u.rt.rt_pmtu = rt->rt_pmtu; + xdst->u.rt.rt_table_id = rt->rt_table_id; INIT_LIST_HEAD(&xdst->u.rt.rt_uncached); return 0; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next 05/19] ixgbe: Add support for UDP-encapsulated tx checksum offload
On Wed, Sep 2, 2015 at 8:38 PM, Tom Herbert wrote: > On Wed, Sep 2, 2015 at 9:46 AM, Rustad, Mark D > wrote: >>> On Sep 1, 2015, at 8:17 PM, Tom Herbert wrote: >>> >>> I suspect this is not UDP-encapsulation specific, will it work with >>> TCP/IP/IP, TCP/IP/GRE etc.? >> >> It could do more, but this is what has been tested up to this point. >> > Well, please test the those other encapsulations too! It's nice and > all if they get the benefit, but it's really bad news if these changes > were to screw them up (i.e. you don't want users of the GRE, IPIP to > find out that they're now broken). > >>> Isn't there anyway the ixgbe could just be made to NETIF_HW_CSUM? That >>> would be so much more straightforward and support nearly all use cases >>> without needing to jump through all these hoops. >> >> Well, the description says: >> >> --- >> Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM. >> It means that device can fill TCP/UDP-like checksum anywhere in the packets >> whatever headers there might be. >> --- >> >> The device can't do whatever, wherever. There is always a limit to the >> offset to the inner headers that can be handled, for instance. >> > If the device does NETIF_F_HW_CSUM then inner/outer headers are > irrelevant at least in the non-GSO case. All the device needs to do is > compute the checksum from start and write the answer at the given > offset. No protocol awareness needed in the device, no need to parse > headers on transmit. Tom, could you elaborate a little further on the semantics/requirements for devices supporting NETIF_F_HW_CSUM, clearly (as mentioned in > I have the same complaint that ixgbe requires a bunch of driver logic > to offload VXLAN checksum unnecessary instead of just providing > CHECKSUM_COMPLETE which would work with any encapsulation protocol, > require no encapsulation awareness in the device, and should be a much > simpler driver implementation. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2] net: Add table id from route lookup to route response
On Wed, 2 Sep 2015 13:16:20 -0700 David Ahern wrote: > diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h > index 702024769c74..5add1468350a 100644 > --- a/include/uapi/linux/rtnetlink.h > +++ b/include/uapi/linux/rtnetlink.h > @@ -310,6 +310,7 @@ enum rtattr_type_t { > RTA_PREF, > RTA_ENCAP_TYPE, > RTA_ENCAP, > + RTA_TABLE_LOOKUP, /* table hit for fib lookup */ Why add a comment here. There is nothing special that needs a comment. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2] net: Add table id from route lookup to route response
On 9/2/15 2:41 PM, Alexander Duyck wrote: Why not implement this this same for IPv4 and IPv6? It looks like it is only included if it is non-zer and not MAIN in the above case, and then below as long as a table ID is non-zero you are setting the value. Why not just include the value in all cases where it is defined just like for IPv6? I like Thomas' suggestion to add an rtm_flag better. We only need to fix IPv4 which hardcodes the tableid. Adding a flag, e.g., +#define RTM_F_LOOKUP_TABLE 0x1000 /* set rtm_table to FIB lookup result */ signifies the caller wants the real table. When set rt_fill_info sets rtm_table to the actual table id. This allows updated tools to work properly for both ipv4 and ipv6 and without breaking existing userspace. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2] net: Add table id from route lookup to route response
On 09/02/2015 01:16 PM, David Ahern wrote: IPv4 ABI has the table hardcoded as RT_TABLE_MAIN regardless of the table hit for the route lookup. Add the table using a new attribute, RTA_TABLE_LOOKUP, to maintain the ABI yet return the right table id. Signed-off-by: David Ahern --- Thomas: Something like this? The current ABI is returning wrong data in some cases; that seems worse to me than breaking the ABI. include/uapi/linux/rtnetlink.h | 1 + net/ipv4/route.c | 5 + net/ipv6/route.c | 4 3 files changed, 10 insertions(+) diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index 702024769c74..5add1468350a 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -310,6 +310,7 @@ enum rtattr_type_t { RTA_PREF, RTA_ENCAP_TYPE, RTA_ENCAP, + RTA_TABLE_LOOKUP, /* table hit for fib lookup */ __RTA_MAX }; diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 92acc95b7578..95454c368e66 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2328,6 +2328,11 @@ static int rt_fill_info(struct net *net, __be32 dst, __be32 src, r->rtm_table = RT_TABLE_MAIN; if (nla_put_u32(skb, RTA_TABLE, RT_TABLE_MAIN)) goto nla_put_failure; + + if (rt->rt_table_id && rt->rt_table_id != RT_TABLE_MAIN && + nla_put_u32(skb, RTA_TABLE_LOOKUP, rt->rt_table_id)) + goto nla_put_failure; + r->rtm_type = rt->rt_type; r->rtm_scope = RT_SCOPE_UNIVERSE; r->rtm_protocol = RTPROT_UNSPEC; Why not implement this this same for IPv4 and IPv6? It looks like it is only included if it is non-zer and not MAIN in the above case, and then below as long as a table ID is non-zero you are setting the value. Why not just include the value in all cases where it is defined just like for IPv6? diff --git a/net/ipv6/route.c b/net/ipv6/route.c index f45cac6f8356..3c5d3a50bb7b 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2922,6 +2922,10 @@ static int rt6_fill_node(struct net *net, rtm->rtm_table = table; if (nla_put_u32(skb, RTA_TABLE, table)) goto nla_put_failure; + + if (table && nla_put_u32(skb, RTA_TABLE_LOOKUP, table)) + goto nla_put_failure; + if (rt->rt6i_flags & RTF_REJECT) { switch (rt->dst.error) { case -EINVAL: -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2] net: Add table id from route lookup to route response
On 9/2/15 2:23 PM, Thomas Graf wrote: On 09/02/15 at 01:16pm, David Ahern wrote: IPv4 ABI has the table hardcoded as RT_TABLE_MAIN regardless of the table hit for the route lookup. Add the table using a new attribute, RTA_TABLE_LOOKUP, to maintain the ABI yet return the right table id. Signed-off-by: David Ahern --- Thomas: Something like this? The current ABI is returning wrong data in some cases; that seems worse to me than breaking the ABI. Another option is to introduce a new flag bundled with RTM_GETROUTE which fixes RTM_GETROUTE altogether and makes it return the actual route instead of a simulated cache entry. I like that better; it least then information is not duplicated. Thanks for the suggestion. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 1/4] net: qdisc: add op to run filters/actions before enqueue
On 09/02/15 02:22, Cong Wang wrote: (Why not Cc'ing Jamal for net_sched pathes?) On Tue, Sep 1, 2015 at 9:34 AM, Daniel Borkmann wrote: From: John Fastabend Add a new ->preclassify() op to allow multiqueue queuing disciplines to call tc_classify() or perform other work before dev_pick_tx(). This helps, for example, with mqprio queueing discipline that has offload support by most popular 10G NICs, where the txq effectively picks the qdisc. Once traffic is being directed to a specific queue then hardware TX rings may be tuned to support this traffic type. mqprio already gives the ability to do this via skb->priority where the ->preclassify() provides more control over packet steering, it can classify the skb and set the priority, for example, from an eBPF classifier (or action). Also this allows traffic classifiers to be run without holding the qdisc lock and gives one place to attach filters when mqprio is in use. ->preclassify() could also be added to other mq qdiscs later on: f.e. most classful qdiscs first check major/minor numbers of skb->priority before actually consulting a more complex classifier. For mqprio case today, a filter has to be attached to each txq qdisc to have all traffic hit the filter. Since ->preclassify() is currently only used by mqprio, the __dev_queue_xmit() fast path is guarded by a generic, hidden Kconfig option (NET_CLS_PRECLASSIFY) that is only selected by mqprio, otherwise it defaults to off. Also, the Qdisc structure size will stay the same, we move __parent, used by cbq only into a write-mostly hole. If actions are enabled, __parent is written on every enqueue, and only read, rewritten in reshape_fail() phase. Therefore, this place in the read-mostly cacheline could be used by preclassify, which is written only once. I don't like this approach. Ideally, qdisc layer should be totally on top of tx queues, which means tx queue selection should happen after dequeue. I looked at this before, the change is not trivial at all given the fact that qdisc ties too much with tx queue probably due to historical reasons, especially the tx softirq part. But that is really a long-term solution for me. I have no big objection for this as a short-term solution, however, once we add these filters before enqueue, we can't remove them any more. We really need to think twice about it. Jamal, do you have any better idea? Sorry for the top quote: Given the rcu-fication of classifiers i believe the idea will mostly work; expect user will go nuts sticking all kinds of classifiers and actions that wont work (example, I dont think connmark action would work nicely here). Could we strive to do proper offload ala switchdev? The comment on the patch on reshape_fail + __parent: for the record, that is an extremely useful feature (allows an inner qdisc to provide an opportunity for a classful parent qdisc to reclassify and therefore reschedule). Yes, CBQ is the only user - but maybe if it was properly documented more schedulers could put it to good use. cheers, jamal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next v2] net: Add table id from route lookup to route response
On 09/02/15 at 01:16pm, David Ahern wrote: > IPv4 ABI has the table hardcoded as RT_TABLE_MAIN regardless of the table > hit for the route lookup. Add the table using a new attribute, > RTA_TABLE_LOOKUP, to maintain the ABI yet return the right table id. > > Signed-off-by: David Ahern > --- > > Thomas: Something like this? > > The current ABI is returning wrong data in some cases; that seems worse > to me than breaking the ABI. Another option is to introduce a new flag bundled with RTM_GETROUTE which fixes RTM_GETROUTE altogether and makes it return the actual route instead of a simulated cache entry. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/3] net: Add table id from route lookup to route response
On 09/02/15 at 03:43pm, Andy Gospodarek wrote: > On Wed, Sep 02, 2015 at 09:08:36PM +0200, Thomas Graf wrote: > > This behaviour comes back from when we still had the IPv4 routing cache > > which was flat. > > So before the routing cache was removed, was the response always > RTA_TABLE_MAIN since there was no way to indicate which table may have > route if it came from the cache? Yes, from that perspective, get and list are very different in behaviour. Again, I'm not against including this information but we can't break compatibility. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86: Wire up 32-bit direct socket calls
On 09/02/2015 02:48 AM, Geert Uytterhoeven wrote: > > Should all other architectures follow suit? > Or should we follow the s390 approach: > It is up to the maintainer(s), largely dependent on how likely you are going to want to support this in your libc, but in general, socketcall is an abomination which there is no reason not to bypass. So follow suit unless you have a strong reason not to. -hpa -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next v2] net: Add table id from route lookup to route response
IPv4 ABI has the table hardcoded as RT_TABLE_MAIN regardless of the table hit for the route lookup. Add the table using a new attribute, RTA_TABLE_LOOKUP, to maintain the ABI yet return the right table id. Signed-off-by: David Ahern --- Thomas: Something like this? The current ABI is returning wrong data in some cases; that seems worse to me than breaking the ABI. include/uapi/linux/rtnetlink.h | 1 + net/ipv4/route.c | 5 + net/ipv6/route.c | 4 3 files changed, 10 insertions(+) diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index 702024769c74..5add1468350a 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -310,6 +310,7 @@ enum rtattr_type_t { RTA_PREF, RTA_ENCAP_TYPE, RTA_ENCAP, + RTA_TABLE_LOOKUP, /* table hit for fib lookup */ __RTA_MAX }; diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 92acc95b7578..95454c368e66 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2328,6 +2328,11 @@ static int rt_fill_info(struct net *net, __be32 dst, __be32 src, r->rtm_table= RT_TABLE_MAIN; if (nla_put_u32(skb, RTA_TABLE, RT_TABLE_MAIN)) goto nla_put_failure; + + if (rt->rt_table_id && rt->rt_table_id != RT_TABLE_MAIN && + nla_put_u32(skb, RTA_TABLE_LOOKUP, rt->rt_table_id)) + goto nla_put_failure; + r->rtm_type = rt->rt_type; r->rtm_scope= RT_SCOPE_UNIVERSE; r->rtm_protocol = RTPROT_UNSPEC; diff --git a/net/ipv6/route.c b/net/ipv6/route.c index f45cac6f8356..3c5d3a50bb7b 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2922,6 +2922,10 @@ static int rt6_fill_node(struct net *net, rtm->rtm_table = table; if (nla_put_u32(skb, RTA_TABLE, table)) goto nla_put_failure; + + if (table && nla_put_u32(skb, RTA_TABLE_LOOKUP, table)) + goto nla_put_failure; + if (rt->rt6i_flags & RTF_REJECT) { switch (rt->dst.error) { case -EINVAL: -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] net: Support ip route get via given table
On 9/2/15 1:38 PM, Thomas Graf wrote: On 09/02/15 at 01:22pm, David Ahern wrote: On 9/2/15 1:12 PM, Thomas Graf wrote: On 09/02/15 at 12:03pm, David Ahern wrote: Add support for 'ip [-6] route get table X' where the user wants to force the FIB lookup from a given table. Signed-off-by: David Ahern Will you use this outside of 'ip route get' as well? If so, how? I'm asking because you propose to add the check and new behaviour to bypass the routing rules to the routing fastpath, wouldn't it be better to handle this in inet_rtm_getroute()? The way IPv6 code is structured it seemed more appropriate to pass in a table id as part of the flow. I made IPv4 consistent with that approach. The question is: Are you planning to use the new table_id in flowi in the actual datapath as well? It seems entirely wrong to add weight to the fast path for a control plane feature. No plans at the moment. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/3] net: Add table id from route lookup to route response
On Wed, Sep 02, 2015 at 09:08:36PM +0200, Thomas Graf wrote: > On 09/02/15 at 12:51pm, David Ahern wrote: > > On 9/2/15 12:49 PM, David Miller wrote: > > >From: Thomas Graf > > >Date: Wed, 2 Sep 2015 20:43:46 +0200 > > > > > >>On 09/02/15 at 09:40am, David Ahern wrote: > > >>>rt_fill_info which is called for 'route get' requests hardcodes the > > >>>table id as RT_TABLE_MAIN which is not correct when multiple tables > > >>>are used. Use the newly added table id in the rtable to send back > > >>>the correct table. > > >>> > > >>>Signed-off-by: David Ahern > > >> > > >>What RTM_GETROUTE returns is not the actual route but a description > > >>of the routing decision which is why table id, scope, protocol, and > > >>prefix length are hardcoded. This is indicated by the RTM_F_CLONED > > >>flag. What you propose would break userspace ABI. > > > > > >Agreed, I don't think we can do this. > > > > > > > Doesn't the table used to come up with the decision matter for IPv4? ie., > > hardcoding to MAIN is misleading when there is absolutely no way the > > decision comes from that table. IPv6 already returns the table id. > > > > Or is your response that it breaks ABI and hence not going to fix. > > This behaviour comes back from when we still had the IPv4 routing cache > which was flat. So before the routing cache was removed, was the response always RTA_TABLE_MAIN since there was no way to indicate which table may have route if it came from the cache? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] net: Support ip route get via given table
On 09/02/15 at 01:22pm, David Ahern wrote: > On 9/2/15 1:12 PM, Thomas Graf wrote: > >On 09/02/15 at 12:03pm, David Ahern wrote: > >>Add support for 'ip [-6] route get table X' where the user wants to > >>force the FIB lookup from a given table. > >> > >>Signed-off-by: David Ahern > > > >Will you use this outside of 'ip route get' as well? If so, how? I'm > >asking because you propose to add the check and new behaviour to bypass > >the routing rules to the routing fastpath, wouldn't it be better to > >handle this in inet_rtm_getroute()? > > > > The way IPv6 code is structured it seemed more appropriate to pass in a > table id as part of the flow. I made IPv4 consistent with that approach. The question is: Are you planning to use the new table_id in flowi in the actual datapath as well? It seems entirely wrong to add weight to the fast path for a control plane feature. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] net: Support ip route get via given table
On 9/2/15 1:12 PM, Thomas Graf wrote: On 09/02/15 at 12:03pm, David Ahern wrote: Add support for 'ip [-6] route get table X' where the user wants to force the FIB lookup from a given table. Signed-off-by: David Ahern Will you use this outside of 'ip route get' as well? If so, how? I'm asking because you propose to add the check and new behaviour to bypass the routing rules to the routing fastpath, wouldn't it be better to handle this in inet_rtm_getroute()? The way IPv6 code is structured it seemed more appropriate to pass in a table id as part of the flow. I made IPv4 consistent with that approach. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net 2/2] sctp: add routing output fallback
Commit 0ca50d12fe46 added a restriction that the address must belong to the output interface, so that sctp will use the right interface even when using secondary addresses. But it breaks IPVS setups, on which people is used to attach VIP addresses to loopback interface on real servers. It's preferred to attach to the interface actually in use, but it's a very common setup and that used to work. This patch then saves the first routing good result, even if it would be going out through an interface that doesn't have that address. If no better hit found, it's then used. This effectively restores the original behavior if no better interface could be found. Fixes: 0ca50d12fe46 ("sctp: fix src address selection if using secondary addresses") Signed-off-by: Marcelo Ricardo Leitner --- net/sctp/protocol.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c index 4abf94d4cce769371260b42d13c38dbe5776c809..b7143337e4fa025fdb473732fdc064503e731dd4 100644 --- a/net/sctp/protocol.c +++ b/net/sctp/protocol.c @@ -506,16 +506,22 @@ static void sctp_v4_get_dst(struct sctp_transport *t, union sctp_addr *saddr, if (IS_ERR(rt)) continue; + if (!dst) + dst = &rt->dst; + /* Ensure the src address belongs to the output * interface. */ odev = __ip_dev_find(sock_net(sk), laddr->a.v4.sin_addr.s_addr, false); if (!odev || odev->ifindex != fl4->flowi4_oif) { - dst_release(&rt->dst); + if (&rt->dst != dst) + dst_release(&rt->dst); continue; } + if (dst != &rt->dst) + dst_release(dst); dst = &rt->dst; break; } -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net 1/2] sctp: fix dst leak
Commit 0ca50d12fe46 failed to release the reference to dst entries that it decided to skip. Fixes: 0ca50d12fe46 ("sctp: fix src address selection if using secondary addresses") Signed-off-by: Marcelo Ricardo Leitner --- net/sctp/protocol.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c index 4345790ad3266c353eeac5398593c2a9ce4effda..4abf94d4cce769371260b42d13c38dbe5776c809 100644 --- a/net/sctp/protocol.c +++ b/net/sctp/protocol.c @@ -511,8 +511,10 @@ static void sctp_v4_get_dst(struct sctp_transport *t, union sctp_addr *saddr, */ odev = __ip_dev_find(sock_net(sk), laddr->a.v4.sin_addr.s_addr, false); - if (!odev || odev->ifindex != fl4->flowi4_oif) + if (!odev || odev->ifindex != fl4->flowi4_oif) { + dst_release(&rt->dst); continue; + } dst = &rt->dst; break; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net 0/2] couple of sctp fixes for 0ca50d12fe46
These are two fixes for sctp after my patch on 0ca50d12fe46 ("sctp: fix src address selection if using secondary addresses") The first, fix a dst leak on those it decided to skip. The second, adds the fallback on src selection that Vlad had asked about. Unfortunatelly a lot of ipvs setups relies on the old behavior and I don't see a better fix for it. Please consider both to -stable tree. Thanks! Marcelo Ricardo Leitner (2): sctp: fix dst leak sctp: add routing output fallback net/sctp/protocol.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] net: Support ip route get via given table
On 09/02/15 at 12:03pm, David Ahern wrote: > Add support for 'ip [-6] route get table X' where the user wants to > force the FIB lookup from a given table. > > Signed-off-by: David Ahern Will you use this outside of 'ip route get' as well? If so, how? I'm asking because you propose to add the check and new behaviour to bypass the routing rules to the routing fastpath, wouldn't it be better to handle this in inet_rtm_getroute()? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/3] net: Add table id from route lookup to route response
On 09/02/15 at 12:51pm, David Ahern wrote: > On 9/2/15 12:49 PM, David Miller wrote: > >From: Thomas Graf > >Date: Wed, 2 Sep 2015 20:43:46 +0200 > > > >>On 09/02/15 at 09:40am, David Ahern wrote: > >>>rt_fill_info which is called for 'route get' requests hardcodes the > >>>table id as RT_TABLE_MAIN which is not correct when multiple tables > >>>are used. Use the newly added table id in the rtable to send back > >>>the correct table. > >>> > >>>Signed-off-by: David Ahern > >> > >>What RTM_GETROUTE returns is not the actual route but a description > >>of the routing decision which is why table id, scope, protocol, and > >>prefix length are hardcoded. This is indicated by the RTM_F_CLONED > >>flag. What you propose would break userspace ABI. > > > >Agreed, I don't think we can do this. > > > > Doesn't the table used to come up with the decision matter for IPv4? ie., > hardcoding to MAIN is misleading when there is absolutely no way the > decision comes from that table. IPv6 already returns the table id. > > Or is your response that it breaks ABI and hence not going to fix. This behaviour comes back from when we still had the IPv4 routing cache which was flat. I'm not against exposing the table id but you have to use a new attribute for it. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] net: Support ip route get via given table
Add support for 'ip [-6] route get table X' where the user wants to force the FIB lookup from a given table. Signed-off-by: David Ahern --- include/net/flow.h | 4 include/net/ip_fib.h| 15 +++ net/ipv4/fib_frontend.c | 2 ++ net/ipv4/route.c| 2 ++ net/ipv6/route.c| 24 5 files changed, 47 insertions(+) diff --git a/include/net/flow.h b/include/net/flow.h index acd6a096250e..910f2dcaab78 100644 --- a/include/net/flow.h +++ b/include/net/flow.h @@ -36,6 +36,7 @@ struct flowi_common { #define FLOWI_FLAG_KNOWN_NH0x02 #define FLOWI_FLAG_VRFSRC 0x04 __u32 flowic_secid; + __u32 flowic_table_id; struct flowi_tunnel flowic_tun_key; }; @@ -74,6 +75,7 @@ struct flowi4 { #define flowi4_flags __fl_common.flowic_flags #define flowi4_secid __fl_common.flowic_secid #define flowi4_tun_key __fl_common.flowic_tun_key +#define flowi4_table_id__fl_common.flowic_table_id /* (saddr,daddr) must be grouped, same order as in IP header */ __be32 saddr; @@ -103,6 +105,7 @@ static inline void flowi4_init_output(struct flowi4 *fl4, int oif, fl4->flowi4_proto = proto; fl4->flowi4_flags = flags; fl4->flowi4_secid = 0; + fl4->flowi4_table_id = 0; fl4->flowi4_tun_key.tun_id = 0; fl4->daddr = daddr; fl4->saddr = saddr; @@ -132,6 +135,7 @@ struct flowi6 { #define flowi6_flags __fl_common.flowic_flags #define flowi6_secid __fl_common.flowic_secid #define flowi6_tun_key __fl_common.flowic_tun_key +#define flowi6_table_id__fl_common.flowic_table_id struct in6_addr daddr; struct in6_addr saddr; __be32 flowlabel; diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index a37d0432bebd..c7024094726d 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -233,6 +233,9 @@ static inline int fib_lookup(struct net *net, const struct flowi4 *flp, struct fib_table *tb; int err = -ENETUNREACH; + if (flp->flowi4_table_id && flp->flowi4_table_id != RT_TABLE_MAIN) + return -ENETUNREACH; + rcu_read_lock(); tb = fib_get_table(net, RT_TABLE_MAIN); @@ -261,6 +264,18 @@ static inline int fib_lookup(struct net *net, struct flowi4 *flp, int err; flags |= FIB_LOOKUP_NOREF; + if (flp->flowi4_table_id) { + err = -ENETUNREACH; + + rcu_read_lock(); + tb = fib_get_table(net, flp->flowi4_table_id); + if (tb) + err = fib_table_lookup(tb, flp, res, flags); + rcu_read_unlock(); + + return err; + } + if (net->ipv4.fib_has_custom_rules) return __fib_lookup(net, flp, res, flags); diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 6fcbd215cdbc..65519445ca0d 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -129,6 +129,7 @@ struct fib_table *fib_get_table(struct net *net, u32 id) } return NULL; } +EXPORT_SYMBOL(fib_get_table); #endif /* CONFIG_IP_MULTIPLE_TABLES */ static void fib_replace_table(struct net *net, struct fib_table *old, @@ -339,6 +340,7 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst, fl4.saddr = dst; fl4.flowi4_tos = tos; fl4.flowi4_scope = RT_SCOPE_UNIVERSE; + fl4.flowi4_table_id = 0; fl4.flowi4_tun_key.tun_id = 0; no_addr = idev->ifa_list == NULL; diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 5f4a5565ad8b..b3e5ee821450 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2476,6 +2476,8 @@ static int inet_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh) fl4.flowi4_tos = rtm->rtm_tos; fl4.flowi4_oif = tb[RTA_OIF] ? nla_get_u32(tb[RTA_OIF]) : 0; fl4.flowi4_mark = mark; + if (tb[RTA_TABLE]) + fl4.flowi4_table_id = nla_get_u32(tb[RTA_TABLE]); if (iif) { struct net_device *dev; diff --git a/net/ipv6/route.c b/net/ipv6/route.c index f45cac6f8356..f605c8ea5a16 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -61,6 +61,7 @@ #include #include #include +#include #include @@ -1142,6 +1143,20 @@ static struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table, } } +static struct dst_entry *ip6_route_table(struct net *net, int flags, +struct flowi6 *fl6, +pol_lookup_t lookup) +{ + struct rt6_info *rt = NULL; + struct fib6_table *table; + + table = fib6_get_table(net, fl6->flowi6_table_id); + if (table) + rt = lookup(net, table, fl6, FIB_LOOKUP_NOREF | flags); + + return (struct dst_entr
[PATCH nf-next] netfilter: nf_dup{4,6}: fix build error when nf_conntrack disabled
While testing various Kconfig options on another issue, I found that the following one triggers as well on allmodconfig and nf_conntrack disabled: net/ipv4/netfilter/nf_dup_ipv4.c: In function ‘nf_dup_ipv4’: net/ipv4/netfilter/nf_dup_ipv4.c:72:20: error: ‘nf_skb_duplicated’ undeclared (first use in this function) if (this_cpu_read(nf_skb_duplicated)) [...] net/ipv6/netfilter/nf_dup_ipv6.c: In function ‘nf_dup_ipv6’: net/ipv6/netfilter/nf_dup_ipv6.c:66:20: error: ‘nf_skb_duplicated’ undeclared (first use in this function) if (this_cpu_read(nf_skb_duplicated)) Fix it by including directly the header where it is defined. Fixes: bbde9fc1824a ("netfilter: factor out packet duplication for IPv4/IPv6") Signed-off-by: Daniel Borkmann --- [ Don't know whether Dave wants to take it directly, or if it should go via nf-next. I have one more build fix coming later tonight. Also applies to net-next. ] net/ipv4/netfilter/nf_dup_ipv4.c | 1 + net/ipv6/netfilter/nf_dup_ipv6.c | 1 + 2 files changed, 2 insertions(+) diff --git a/net/ipv4/netfilter/nf_dup_ipv4.c b/net/ipv4/netfilter/nf_dup_ipv4.c index b5bb375..2d79e6e 100644 --- a/net/ipv4/netfilter/nf_dup_ipv4.c +++ b/net/ipv4/netfilter/nf_dup_ipv4.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include diff --git a/net/ipv6/netfilter/nf_dup_ipv6.c b/net/ipv6/netfilter/nf_dup_ipv6.c index c5c87e9..c8ab626 100644 --- a/net/ipv6/netfilter/nf_dup_ipv6.c +++ b/net/ipv6/netfilter/nf_dup_ipv6.c @@ -11,6 +11,7 @@ #include #include #include +#include #include #include #include -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/3] net: Add table id from route lookup to route response
On 9/2/15 12:49 PM, David Miller wrote: From: Thomas Graf Date: Wed, 2 Sep 2015 20:43:46 +0200 On 09/02/15 at 09:40am, David Ahern wrote: rt_fill_info which is called for 'route get' requests hardcodes the table id as RT_TABLE_MAIN which is not correct when multiple tables are used. Use the newly added table id in the rtable to send back the correct table. Signed-off-by: David Ahern What RTM_GETROUTE returns is not the actual route but a description of the routing decision which is why table id, scope, protocol, and prefix length are hardcoded. This is indicated by the RTM_F_CLONED flag. What you propose would break userspace ABI. Agreed, I don't think we can do this. Doesn't the table used to come up with the decision matter for IPv4? ie., hardcoding to MAIN is misleading when there is absolutely no way the decision comes from that table. IPv6 already returns the table id. Or is your response that it breaks ABI and hence not going to fix. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/3] net: Add table id from route lookup to route response
From: Thomas Graf Date: Wed, 2 Sep 2015 20:43:46 +0200 > On 09/02/15 at 09:40am, David Ahern wrote: >> rt_fill_info which is called for 'route get' requests hardcodes the >> table id as RT_TABLE_MAIN which is not correct when multiple tables >> are used. Use the newly added table id in the rtable to send back >> the correct table. >> >> Signed-off-by: David Ahern > > What RTM_GETROUTE returns is not the actual route but a description > of the routing decision which is why table id, scope, protocol, and > prefix length are hardcoded. This is indicated by the RTM_F_CLONED > flag. What you propose would break userspace ABI. Agreed, I don't think we can do this. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next PATCH] net: ipv6: use common fib_default_rule_pref
On 09/02/15 at 11:34am, David Miller wrote: > From: Phil Sutter > Date: Wed, 2 Sep 2015 15:03:12 +0200 > > > This switches IPv6 policy routing to use the shared > > fib_default_rule_pref() function of IPv4 and DECnet. It is also used in > > multicast routing for IPv4 as well as IPv6. > > > > The motivation for this patch is a complaint about iproute2 behaving > > inconsistent between IPv4 and IPv6 when adding policy rules: Formerly, > > IPv6 rules were assigned a fixed priority of 0x3FFF whereas for IPv4 the > > assigned priority value was decreased with each rule added. > > > > Signed-off-by: Phil Sutter > > All ->default_pref() methods are therefore going to be set to the > default, so just kill off the method entirely and call > fib_default_rule_pref() directly. How strict are we with regard to compatibility here? New IPv6 rules with no pref specified currently get appended at the end of the list whereas this would start inserting at the head. I'm absolutely in favour of the new behaviour but this could break scripts which do not have proper prefs specified. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 net-next] xen-netback: add support for multicast control
From: Paul Durrant Date: Wed, 2 Sep 2015 17:58:36 +0100 > Xen's PV network protocol includes messages to add/remove ethernet > multicast addresses to/from a filter list in the backend. This allows > the frontend to request the backend only forward multicast packets > which are of interest thus preventing unnecessary noise on the shared > ring. > > The canonical netif header in git://xenbits.xen.org/xen.git specifies > the message format (two more XEN_NETIF_EXTRA_TYPEs) so the minimal > necessary changes have been pulled into include/xen/interface/io/netif.h. > > To prevent the frontend from extending the multicast filter list > arbitrarily a limit (XEN_NETBK_MCAST_MAX) has been set to 64 entries. > This limit is not specified by the protocol and so may change in future. > If the limit is reached then the next XEN_NETIF_EXTRA_TYPE_MCAST_ADD > sent by the frontend will be failed with NETIF_RSP_ERROR. > > Signed-off-by: Paul Durrant Applied. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 3/3] net: Add table id from route lookup to route response
On 09/02/15 at 09:40am, David Ahern wrote: > rt_fill_info which is called for 'route get' requests hardcodes the > table id as RT_TABLE_MAIN which is not correct when multiple tables > are used. Use the newly added table id in the rtable to send back > the correct table. > > Signed-off-by: David Ahern What RTM_GETROUTE returns is not the actual route but a description of the routing decision which is why table id, scope, protocol, and prefix length are hardcoded. This is indicated by the RTM_F_CLONED flag. What you propose would break userspace ABI. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] flow_dissector: Use 'const' where possible.
From: Jiri Pirko Date: Wed, 2 Sep 2015 18:39:34 +0200 > Wed, Sep 02, 2015 at 06:33:34AM CEST, t...@herbertland.com wrote: >>> @@ -19,14 +19,14 @@ >>> #include >>> #include >>> >>> -static bool skb_flow_dissector_uses_key(struct flow_dissector >>> *flow_dissector, >>> - enum flow_dissector_key_id key_id) >>> +static bool dissector_uses_key(const struct flow_dissector *flow_dissector, >>> + enum flow_dissector_key_id key_id) >>> { >>> return flow_dissector->used_keys & (1 << key_id); >>> } >>> >>> -static void skb_flow_dissector_set_key(struct flow_dissector >>> *flow_dissector, >>> - enum flow_dissector_key_id key_id) >>> +static void dissector_set_key(struct flow_dissector *flow_dissector, >>> + enum flow_dissector_key_id key_id) >>> { >>> flow_dissector->used_keys |= (1 << key_id); >>> } >>> @@ -51,20 +51,20 @@ void skb_flow_dissector_init(struct flow_dissector >>> *flow_dissector, >> >>I suppose we should drop skb_ from skb_flow_dissector_init and >>skb_flow_dissector_target as well. > > I like to have "namespaces" by function prefixes. Code is easier to read > then... I completely disagree. These are static, local functions, the can use whatever names they want and the shorter the better. Long function names drive me absolutely insane and make keeping the argument lists under ~80 columns a royal pain in the ass. So I will continue to trim function names down to something more reasonable when they are static and local to a source file. And I encourage you to do so as well. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bgmac: Update fixed_phy_register()
From: Fabio Estevam Date: Wed, 2 Sep 2015 13:25:59 -0300 > From: Fabio Estevam > > Commit a5597008dbc2 ("phy: fixed_phy: Add gpio to determine link up/down.") > added a new argument to fixed_phy_register(), but missed to update bgmac > driver, causing the following build failure: > > drivers/net/ethernet/broadcom/bgmac.c:1450:2: error: too few arguments to > function 'fixed_phy_register' > > Add the missing argument. > > Reported-by: Mark Brown > Signed-off-by: Fabio Estevam Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next PATCH] net: ipv6: use common fib_default_rule_pref
From: Phil Sutter Date: Wed, 2 Sep 2015 15:03:12 +0200 > This switches IPv6 policy routing to use the shared > fib_default_rule_pref() function of IPv4 and DECnet. It is also used in > multicast routing for IPv4 as well as IPv6. > > The motivation for this patch is a complaint about iproute2 behaving > inconsistent between IPv4 and IPv6 when adding policy rules: Formerly, > IPv6 rules were assigned a fixed priority of 0x3FFF whereas for IPv4 the > assigned priority value was decreased with each rule added. > > Signed-off-by: Phil Sutter All ->default_pref() methods are therefore going to be set to the default, so just kill off the method entirely and call fib_default_rule_pref() directly. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] sock, diag: fix panic in sock_diag_put_filterinfo
From: Daniel Borkmann Date: Wed, 2 Sep 2015 14:00:36 +0200 > diag socket's sock_diag_put_filterinfo() dumps classic BPF programs > upon request to user space (ss -0 -b). However, native eBPF programs > attached to sockets (SO_ATTACH_BPF) cannot be dumped with this method: > > Their orig_prog is always NULL. However, sock_diag_put_filterinfo() > unconditionally tries to access its filter length resp. wants to copy > the filter insns from there. Internal cBPF to eBPF transformations > attached to sockets don't have this issue, as orig_prog state is kept. > > It's currently only used by packet sockets. If we would want to add > native eBPF support in the future, this needs to be done through > a different attribute than PACKET_DIAG_FILTER to not confuse possible > user space disassemblers that work on diag data. > > Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to > sockets") > Signed-off-by: Daniel Borkmann Applied and queued up for -stable, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net: eth: altera: fix napi poll_list corruption
From: Atsushi Nemoto Date: Wed, 2 Sep 2015 17:49:29 +0900 > tse_poll() calls __napi_complete() with irq enabled. This leads napi > poll_list corruption and may stop all napi drivers working. > Use napi_complete() instead of __napi_complete(). > > Signed-off-by: Atsushi Nemoto Two lines below this change you are disabling interrupts anyways, so I would suggest just moving the spin_lock_irqsave() before the napi_gro_flush() to fix this. Many of the checks done by napi_complete_done() (invoked by napi_complete()) are completely redundant in this context. For example, the direct __napi_complete() call is a really nice optimization because we know we are on the poll list and therefore it is not empty. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [net-next 05/19] ixgbe: Add support for UDP-encapsulated tx checksum offload
On Wed, Sep 2, 2015 at 9:46 AM, Rustad, Mark D wrote: >> On Sep 1, 2015, at 8:17 PM, Tom Herbert wrote: >> >> I suspect this is not UDP-encapsulation specific, will it work with >> TCP/IP/IP, TCP/IP/GRE etc.? > > It could do more, but this is what has been tested up to this point. > Well, please test the those other encapsulations too! It's nice and all if they get the benefit, but it's really bad news if these changes were to screw them up (i.e. you don't want users of the GRE, IPIP to find out that they're now broken). >> Isn't there anyway the ixgbe could just be made to NETIF_HW_CSUM? That >> would be so much more straightforward and support nearly all use cases >> without needing to jump through all these hoops. > > Well, the description says: > > --- > Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM. > It means that device can fill TCP/UDP-like checksum anywhere in the packets > whatever headers there might be. > --- > > The device can't do whatever, wherever. There is always a limit to the offset > to the inner headers that can be handled, for instance. > If the device does NETIF_F_HW_CSUM then inner/outer headers are irrelevant at least in the non-GSO case. All the device needs to do is compute the checksum from start and write the answer at the given offset. No protocol awareness needed in the device, no need to parse headers on transmit. I have the same complaint that ixgbe requires a bunch of driver logic to offload VXLAN checksum unnecessary instead of just providing CHECKSUM_COMPLETE which would work with any encapsulation protocol, require no encapsulation awareness in the device, and should be a much simpler driver implementation. So my input to NIC vendors will continue to be they provide general protocol agnostic solutions and *stop* perpetuating these narrow protocol specific and unnecessarily complicated solutions. If you don't believe me, see the similar longstanding comments in skbuff.h about NIC capabilities and checksums and what choices vendors make. Tom > -- > Mark Rustad, Networking Division, Intel Corporation > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ip route: Print table id for 'ip route get'
Table id is not dumped for 'ip route get' requests because the RTM_F_CLONED flag is set in rt_fill_info. Move it out from the check and show user the table id any time it is not MAIN. Example: $ ip ru ls 0: from all lookup local 32765: from all to 10.2.1.0/24 lookup 10 32766: from all lookup main 32767: from all lookup default $ ip route ls default via 10.0.0.254 dev eth0 10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.2 10.2.2.0/24 dev eth2 proto kernel scope link src 10.2.2.2 10.2.3.0/24 dev eth3 proto kernel scope link src 10.2.3.2 10.2.4.0/24 dev eth4 proto kernel scope link src 10.2.4.2 $ ip route ls table 10 10.2.1.0/24 dev eth1 scope link Currently: $ ip route get 10.2.1.240 10.2.1.240 dev eth1 src 10.2.1.2 cache With this patch: $ ip route get 10.2.1.240 10.2.1.240 dev eth1 table 10 src 10.2.1.2 cache Signed-off-by: David Ahern --- ip/iproute.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/ip/iproute.c b/ip/iproute.c index 8f49e6289003..9e6148d68e6c 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -421,9 +421,10 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) if (tb[RTA_OIF] && filter.oifmask != -1) fprintf(fp, "dev %s ", ll_index_to_name(*(int*)RTA_DATA(tb[RTA_OIF]))); + if ((table != RT_TABLE_MAIN || show_details > 0) && !filter.tb) + fprintf(fp, " table %s ", rtnl_rttable_n2a(table, b1, sizeof(b1))); + if (!(r->rtm_flags&RTM_F_CLONED)) { - if ((table != RT_TABLE_MAIN || show_details > 0) && !filter.tb) - fprintf(fp, " table %s ", rtnl_rttable_n2a(table, b1, sizeof(b1))); if ((r->rtm_protocol != RTPROT_BOOT || show_details > 0) && filter.protocolmask != -1) fprintf(fp, " proto %s ", rtnl_rtprot_n2a(r->rtm_protocol, b1, sizeof(b1))); if ((r->rtm_scope != RT_SCOPE_UNIVERSE || show_details > 0) && filter.scopemask != -1) -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] cfg80211: regulatory: restore proper user alpha2
restore_regulatory_settings() should restore alpha2 as computed in restore_alpha2(), not raw user_alpha2 to behave as described in the comment just above that code. This fixes endless loop of calling CRDA for "00" and "97" countries after resume from suspend on my laptop. Looks like others had the same problem, too: http://ath9k-devel.ath9k.narkive.com/knY5W6St/ath9k-and-crda-messages-in-logs https://bugs.launchpad.net/ubuntu/+source/linux/+bug/899335 https://forum.porteus.org/viewtopic.php?t=4975&p=36436 https://forums.opensuse.org/showthread.php/ 483356-Authentication-Regulatory-Domain-issues-ath5k-12-2 Signed-off-by: Maciej Szmigiero --- net/wireless/reg.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/wireless/reg.c b/net/wireless/reg.c index 70aef72..7258246 100644 --- a/net/wireless/reg.c +++ b/net/wireless/reg.c @@ -2625,7 +2625,7 @@ static void restore_regulatory_settings(bool reset_user) * settings, user regulatory settings takes precedence. */ if (is_an_alpha2(alpha2)) - regulatory_hint_user(user_alpha2, NL80211_USER_REG_HINT_USER); + regulatory_hint_user(alpha2, NL80211_USER_REG_HINT_USER); spin_lock(®_requests_lock); list_splice_tail_init(&tmp_reg_req_list, ®_requests_list); -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 net-next] xen-netback: add support for multicast control
On Wed, Sep 02, 2015 at 05:58:36PM +0100, Paul Durrant wrote: > Xen's PV network protocol includes messages to add/remove ethernet > multicast addresses to/from a filter list in the backend. This allows > the frontend to request the backend only forward multicast packets > which are of interest thus preventing unnecessary noise on the shared > ring. > > The canonical netif header in git://xenbits.xen.org/xen.git specifies > the message format (two more XEN_NETIF_EXTRA_TYPEs) so the minimal > necessary changes have been pulled into include/xen/interface/io/netif.h. > > To prevent the frontend from extending the multicast filter list > arbitrarily a limit (XEN_NETBK_MCAST_MAX) has been set to 64 entries. > This limit is not specified by the protocol and so may change in future. > If the limit is reached then the next XEN_NETIF_EXTRA_TYPE_MCAST_ADD > sent by the frontend will be failed with NETIF_RSP_ERROR. > > Signed-off-by: Paul Durrant > Cc: Ian Campbell > Cc: Wei Liu Acked-by: Wei Liu -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] flow_dissector: Use 'const' where possible.
On Wed, Sep 2, 2015 at 9:39 AM, Jiri Pirko wrote: > Wed, Sep 02, 2015 at 06:33:34AM CEST, t...@herbertland.com wrote: >>On Tue, Sep 1, 2015 at 9:19 PM, David Miller wrote: >>> >>> Signed-off-by: David S. Miller >>> --- >>> include/linux/skbuff.h| 8 ++--- >>> include/net/flow.h| 8 ++--- >>> net/core/flow_dissector.c | 79 >>> --- >>> 3 files changed, 49 insertions(+), 46 deletions(-) >>> > > > > >>> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c >>> index 345a040..d79699c 100644 >>> --- a/net/core/flow_dissector.c >>> +++ b/net/core/flow_dissector.c >>> @@ -19,14 +19,14 @@ >>> #include >>> #include >>> >>> -static bool skb_flow_dissector_uses_key(struct flow_dissector >>> *flow_dissector, >>> - enum flow_dissector_key_id key_id) >>> +static bool dissector_uses_key(const struct flow_dissector *flow_dissector, >>> + enum flow_dissector_key_id key_id) >>> { >>> return flow_dissector->used_keys & (1 << key_id); >>> } >>> >>> -static void skb_flow_dissector_set_key(struct flow_dissector >>> *flow_dissector, >>> - enum flow_dissector_key_id key_id) >>> +static void dissector_set_key(struct flow_dissector *flow_dissector, >>> + enum flow_dissector_key_id key_id) >>> { >>> flow_dissector->used_keys |= (1 << key_id); >>> } >>> @@ -51,20 +51,20 @@ void skb_flow_dissector_init(struct flow_dissector >>> *flow_dissector, >> >>I suppose we should drop skb_ from skb_flow_dissector_init and >>skb_flow_dissector_target as well. > > I like to have "namespaces" by function prefixes. Code is easier to read > then... Right, these functions now are independent of sk_buff. Conceptually someone could use these for a non-skbuff application-- so it's good design! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 net-next] xen-netback: add support for multicast control
Xen's PV network protocol includes messages to add/remove ethernet multicast addresses to/from a filter list in the backend. This allows the frontend to request the backend only forward multicast packets which are of interest thus preventing unnecessary noise on the shared ring. The canonical netif header in git://xenbits.xen.org/xen.git specifies the message format (two more XEN_NETIF_EXTRA_TYPEs) so the minimal necessary changes have been pulled into include/xen/interface/io/netif.h. To prevent the frontend from extending the multicast filter list arbitrarily a limit (XEN_NETBK_MCAST_MAX) has been set to 64 entries. This limit is not specified by the protocol and so may change in future. If the limit is reached then the next XEN_NETIF_EXTRA_TYPE_MCAST_ADD sent by the frontend will be failed with NETIF_RSP_ERROR. Signed-off-by: Paul Durrant Cc: Ian Campbell Cc: Wei Liu --- v2: - Fix commit comment - Cosmetic change requested by Wei --- drivers/net/xen-netback/common.h| 15 ++ drivers/net/xen-netback/interface.c | 10 drivers/net/xen-netback/netback.c | 99 +++ drivers/net/xen-netback/xenbus.c| 13 + include/xen/interface/io/netif.h|8 ++- 5 files changed, 144 insertions(+), 1 deletion(-) diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h index c6cb85a..6dc76c1 100644 --- a/drivers/net/xen-netback/common.h +++ b/drivers/net/xen-netback/common.h @@ -210,12 +210,22 @@ enum state_bit_shift { VIF_STATUS_CONNECTED, }; +struct xenvif_mcast_addr { + struct list_head entry; + struct rcu_head rcu; + u8 addr[6]; +}; + +#define XEN_NETBK_MCAST_MAX 64 + struct xenvif { /* Unique identifier for this interface. */ domid_t domid; unsigned int handle; u8 fe_dev_addr[6]; + struct list_head fe_mcast_addr; + unsigned int fe_mcast_count; /* Frontend feature information. */ int gso_mask; @@ -224,6 +234,7 @@ struct xenvif { u8 can_sg:1; u8 ip_csum:1; u8 ipv6_csum:1; + u8 multicast_control:1; /* Is this interface disabled? True when backend discovers * frontend is rogue. @@ -341,4 +352,8 @@ void xenvif_skb_zerocopy_prepare(struct xenvif_queue *queue, struct sk_buff *skb); void xenvif_skb_zerocopy_complete(struct xenvif_queue *queue); +/* Multicast control */ +bool xenvif_mcast_match(struct xenvif *vif, const u8 *addr); +void xenvif_mcast_addr_list_free(struct xenvif *vif); + #endif /* __XEN_NETBACK__COMMON_H__ */ diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c index 28577a3..e7bd63e 100644 --- a/drivers/net/xen-netback/interface.c +++ b/drivers/net/xen-netback/interface.c @@ -171,6 +171,13 @@ static int xenvif_start_xmit(struct sk_buff *skb, struct net_device *dev) !xenvif_schedulable(vif)) goto drop; + if (vif->multicast_control && skb->pkt_type == PACKET_MULTICAST) { + struct ethhdr *eth = (struct ethhdr *)skb->data; + + if (!xenvif_mcast_match(vif, eth->h_dest)) + goto drop; + } + cb = XENVIF_RX_CB(skb); cb->expires = jiffies + vif->drain_timeout; @@ -427,6 +434,7 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid, vif->num_queues = 0; spin_lock_init(&vif->lock); + INIT_LIST_HEAD(&vif->fe_mcast_addr); dev->netdev_ops = &xenvif_netdev_ops; dev->hw_features = NETIF_F_SG | @@ -661,6 +669,8 @@ void xenvif_disconnect(struct xenvif *vif) xenvif_unmap_frontend_rings(queue); } + + xenvif_mcast_addr_list_free(vif); } /* Reverse the relevant parts of xenvif_init_queue(). diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c index 3f44b52..42569b9 100644 --- a/drivers/net/xen-netback/netback.c +++ b/drivers/net/xen-netback/netback.c @@ -1157,6 +1157,80 @@ static bool tx_credit_exceeded(struct xenvif_queue *queue, unsigned size) return false; } +/* No locking is required in xenvif_mcast_add/del() as they are + * only ever invoked from NAPI poll. An RCU list is used because + * xenvif_mcast_match() is called asynchronously, during start_xmit. + */ + +static int xenvif_mcast_add(struct xenvif *vif, const u8 *addr) +{ + struct xenvif_mcast_addr *mcast; + + if (vif->fe_mcast_count == XEN_NETBK_MCAST_MAX) { + if (net_ratelimit()) + netdev_err(vif->dev, + "Too many multicast addresses\n"); + return -ENOSPC; + } + + mcast = kzalloc(sizeof(*mcast), GFP_ATOMIC); + if (!mcast) + return -ENOMEM; + + ether_addr_copy(mcast->addr, addr); + list_add_tail_rcu(&mcast->entry, &vif->fe_mcast_addr); + vif->fe_mcast
Re: [net-next 06/19] ixgbe: Add support for VXLAN RX offloads
> On Sep 1, 2015, at 8:31 PM, Tom Herbert wrote: > >> @@ -7240,6 +7267,10 @@ static void ixgbe_atr(struct ixgbe_ring *ring, >>struct ipv6hdr *ipv6; >>} hdr; >>struct tcphdr *th; >> + struct sk_buff *skb; >> +#ifdef CONFIG_IXGBE_VXLAN >> + u8 encap = false; >> +#endif /* CONFIG_IXGBE_VXLAN */ >>__be16 vlan_id; >> >>/* if ring doesn't have a interrupt vector, cannot perform ATR */ >> @@ -7253,16 +7284,36 @@ static void ixgbe_atr(struct ixgbe_ring *ring, > > Isn't this a function in the transmit path? What do these changes have > to do with RX offloads? Yes, it is in the transmit path. ATR sets up rules for delivering received packets to the correct queue. So this is setting things up for the receive side to work better. New hardware capabilities now make it possible to do this for VXLAN traffic. -- Mark Rustad, Networking Division, Intel Corporation signature.asc Description: Message signed with OpenPGP using GPGMail
Re: [net-next 05/19] ixgbe: Add support for UDP-encapsulated tx checksum offload
> On Sep 1, 2015, at 8:17 PM, Tom Herbert wrote: > > I suspect this is not UDP-encapsulation specific, will it work with > TCP/IP/IP, TCP/IP/GRE etc.? It could do more, but this is what has been tested up to this point. > Isn't there anyway the ixgbe could just be made to NETIF_HW_CSUM? That > would be so much more straightforward and support nearly all use cases > without needing to jump through all these hoops. Well, the description says: --- Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM. It means that device can fill TCP/UDP-like checksum anywhere in the packets whatever headers there might be. --- The device can't do whatever, wherever. There is always a limit to the offset to the inner headers that can be handled, for instance. -- Mark Rustad, Networking Division, Intel Corporation signature.asc Description: Message signed with OpenPGP using GPGMail
Re: ip_rcv_finish() NULL pointer and possibly related Oopses
> Make sure you backported commit > 10e2eb878f3ca07ac2f05fa5ca5e6c4c9174a27a > ("udp: fix dst races with multicast early demux") I just tried the latest CoreOS alpha, which had that patch. Sadly, I saw just as many reboots. Here's a sample of the different types of Oopses I see (I've put the rest up in a gist: https://gist.github.com/fasaxc/d801ced5608f2657abd8): [ 4024.564479] BUG: unable to handle kernel NULL pointer dereference at (null) [ 4024.565452] IP: [< (null)>] (null) [ 4024.565452] PGD 2297067 PUD 2296067 PMD 0 [ 4024.565452] Oops: 0010 [#1] SMP [ 4024.565452] Modules linked in: xt_mac xt_mark veth ip_set_hash_net nf_conntrack_ipv6 nf_defrag_ipv6 xt_comment xt_set ip_set_hash_ip ip_set nfnetlink ipip tunnel4 ip_tunnel ip6table_filter ip6_tables xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter br_netfilter nf_nat nf_conntrack bridge stp llc overlay nls_ascii nls_cp437 vfat fat ext4 crc16 mbcache jbd2 sd_mod crc32c_intel virtio_scsi scsi_mod aesni_intel virtio_net mousedev aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd microcode firmware_class virtio_pci virtio_ring psmouse virtio i2c_piix4 i2c_core acpi_cpufreq button evdev sch_fq_codel ip_tables autofs4 [ 4024.565452] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.1.6-coreos-r1 #2 [ 4024.565452] Hardware name: Google Google, BIOS Google 01/01/2011 [ 4024.565452] task: 81a154c0 ti: 81a0 task.ti: 81a0 [ 4024.565452] RIP: 0010:[<>] [< (null)>] (null) [ 4024.565452] RSP: 0018:88021fc03c00 EFLAGS: 00010246 [ 4024.565452] RAX: 880003375d00 RBX: 880003375d00 RCX: 0001 [ 4024.565452] RDX: 88000306c000 RSI: RDI: 880003375d00 [ 4024.565452] RBP: 88021fc03c28 R08: 5608 R09: bb84 [ 4024.565452] R10: 0003 R11: 880215a30dc0 R12: 880214bfb000 [ 4024.565452] R13: 88000306c000 R14: 88000306c000 R15: 0008 [ 4024.565452] FS: () GS:88021fc0() knlGS: [ 4024.565452] CS: 0010 DS: ES: CR0: 80050033 [ 4024.565452] CR2: CR3: 01d92000 CR4: 001406f0 [ 4024.600761] Stack: [ 4024.601081] 814ac9dc 8802 88000306c000 880003375d00 [ 4024.601081] 88008cbba84e 88021fc03c58 81486628 88021690a000 [ 4024.601081] 88008cbba84e 880003375d00 88000306c000 88021fc03cb8 [ 4024.601081] Call Trace: [ 4024.601081] [ 4024.601081] [] ? tcp_v4_early_demux+0x11c/0x160 [ 4024.601081] [] ip_rcv_finish+0xb8/0x360 [ 4024.601081] [] ip_rcv+0x2a4/0x400 [ 4024.601081] [] ? inet_del_offload+0x40/0x40 [ 4024.601081] [] __netif_receive_skb_core+0x6c3/0x9a0 [ 4024.601081] [] ? build_skb+0x17/0x90 [ 4024.601081] [] __netif_receive_skb+0x18/0x60 [ 4024.601081] [] netif_receive_skb_internal+0x33/0xa0 [ 4024.601081] [] netif_receive_skb_sk+0x1c/0x70 [ 4024.601081] [] 0xa008772b [ 4024.601081] [] ? check_preempt_curr+0x80/0xa0 [ 4024.601081] [] 0xa0087d81 [ 4024.601081] [] net_rx_action+0x159/0x340 [ 4024.601081] [] __do_softirq+0xf4/0x290 [ 4024.601081] [] irq_exit+0xad/0xc0 [ 4024.601081] [] do_IRQ+0x5a/0xf0 [ 4024.601081] [] common_interrupt+0x6e/0x6e [ 4024.601081] [ 4024.601081] [] ? native_safe_halt+0x6/0x10 [ 4024.601081] [] default_idle+0x1e/0xc0 [ 4024.601081] [] arch_cpu_idle+0xf/0x20 [ 4024.601081] [] cpu_startup_entry+0x314/0x3e0 [ 4024.601081] [] rest_init+0x7c/0x80 [ 4024.601081] [] start_kernel+0x483/0x490 [ 4024.601081] [] ? set_init_arg+0x55/0x55 [ 4024.601081] [] ? early_idt_handler_array+0x120/0x120 [ 4024.601081] [] x86_64_start_reservations+0x2a/0x2c [ 4024.601081] [] x86_64_start_kernel+0x138/0x147 [ 4024.601081] Code: Bad RIP value. [ 4024.601081] RIP [< (null)>] (null) [ 4024.601081] RSP [ 4024.601081] CR2: [ 4024.601081] ---[ end trace cdabfe9d7380aaab ]--- [ 4024.601081] Kernel panic - not syncing: Fatal exception in interrupt [ 4024.601081] Kernel Offset: disabled [ 4024.601081] Rebooting in 60 seconds.. [ 4024.601081] ACPI MEMORY or I/O RESET_REG. [ 4811.261621] NULL pointer dereference at 0020 [ 4811.261621] IP: [] tcp_current_mss+0x2a/0x80 [ 4811.261621] PGD 214af5067 PUD 210de8067 PMD 0 [ 4811.261621] Oops: [#2] SMP [ 4811.261621] Modules linked in: xt_mac xt_mark veth ip_set_hash_net nf_conntrack_ipv6 nf_defrag_ipv6 xt_comment xt_set ip_set_hash_ip ip_set nfnetlink ipip tunnel4 ip_tunnel ip6table_filter ip6_tables xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter br_netfilter nf_nat nf_conntrack bridge stp llc overlay nls_ascii nls_cp437 vfat fat ext4 crc16 mbcache jbd2 sd_mod virtio_scsi scsi_mod virtio_net mousedev crc32c_intel ae
[PATCH net-next 3/3] net: Add table id from route lookup to route response
rt_fill_info which is called for 'route get' requests hardcodes the table id as RT_TABLE_MAIN which is not correct when multiple tables are used. Use the newly added table id in the rtable to send back the correct table. Signed-off-by: David Ahern --- net/ipv4/route.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 92acc95b7578..2738bf4132db 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2325,8 +2325,8 @@ static int rt_fill_info(struct net *net, __be32 dst, __be32 src, r->rtm_dst_len = 32; r->rtm_src_len = 0; r->rtm_tos = fl4->flowi4_tos; - r->rtm_table= RT_TABLE_MAIN; - if (nla_put_u32(skb, RTA_TABLE, RT_TABLE_MAIN)) + r->rtm_table= rt->rt_table_id; + if (nla_put_u32(skb, RTA_TABLE, rt->rt_table_id)) goto nla_put_failure; r->rtm_type = rt->rt_type; r->rtm_scope= RT_SCOPE_UNIVERSE; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 2/3] net: Add FIB table id to rtable
Add the FIB table id to rtable to make the information available for IPv4 as it is for IPv6. Signed-off-by: David Ahern --- drivers/net/vrf.c | 2 ++ include/net/route.h | 2 ++ net/ipv4/route.c| 8 net/ipv4/xfrm4_policy.c | 1 + 4 files changed, 13 insertions(+) diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c index e7094fbd7568..8c9ab5ebea23 100644 --- a/drivers/net/vrf.c +++ b/drivers/net/vrf.c @@ -320,6 +320,7 @@ static void vrf_rtable_destroy(struct net_vrf *vrf) static struct rtable *vrf_rtable_create(struct net_device *dev) { + struct net_vrf *vrf = netdev_priv(dev); struct rtable *rth; rth = dst_alloc(&vrf_dst_ops, dev, 2, @@ -335,6 +336,7 @@ static struct rtable *vrf_rtable_create(struct net_device *dev) rth->rt_pmtu= 0; rth->rt_gateway = 0; rth->rt_uses_gateway = 0; + rth->rt_table_id = vrf->tb_id; INIT_LIST_HEAD(&rth->rt_uncached); rth->rt_uncached_list = NULL; } diff --git a/include/net/route.h b/include/net/route.h index cc61cb95f059..10a7d21a211c 100644 --- a/include/net/route.h +++ b/include/net/route.h @@ -64,6 +64,8 @@ struct rtable { /* Miscellaneous cached information */ u32 rt_pmtu; + u32 rt_table_id; + struct list_headrt_uncached; struct uncached_list*rt_uncached_list; }; diff --git a/net/ipv4/route.c b/net/ipv4/route.c index eaefeadce07c..92acc95b7578 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1457,6 +1457,7 @@ static struct rtable *rt_dst_alloc(struct net_device *dev, rt->rt_pmtu = 0; rt->rt_gateway = 0; rt->rt_uses_gateway = 0; + rt->rt_table_id = 0; INIT_LIST_HEAD(&rt->rt_uncached); rt->dst.output = ip_output; @@ -1629,6 +1630,8 @@ static int __mkroute_input(struct sk_buff *skb, } rth->rt_is_input = 1; + if (res->table) + rth->rt_table_id = res->table->tb_id; RT_CACHE_STAT_INC(in_slow_tot); rth->dst.input = ip_forward; @@ -1808,6 +1811,8 @@ out: return err; rth->dst.tclassid = itag; #endif rth->rt_is_input = 1; + if (res.table) + rth->rt_table_id = res.table->tb_id; RT_CACHE_STAT_INC(in_slow_tot); if (res.type == RTN_UNREACHABLE) { @@ -1988,6 +1993,9 @@ static struct rtable *__mkroute_output(const struct fib_result *res, return ERR_PTR(-ENOBUFS); rth->rt_iif = orig_oif ? : 0; + if (res->table) + rth->rt_table_id = res->table->tb_id; + RT_CACHE_STAT_INC(out_slow_tot); if (flags & (RTCF_BROADCAST | RTCF_MULTICAST)) { diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c index bb919b28619f..671011055ad5 100644 --- a/net/ipv4/xfrm4_policy.c +++ b/net/ipv4/xfrm4_policy.c @@ -95,6 +95,7 @@ static int xfrm4_fill_dst(struct xfrm_dst *xdst, struct net_device *dev, xdst->u.rt.rt_gateway = rt->rt_gateway; xdst->u.rt.rt_uses_gateway = rt->rt_uses_gateway; xdst->u.rt.rt_pmtu = rt->rt_pmtu; + xdst->u.rt.rt_table_id = rt->rt_table_id; INIT_LIST_HEAD(&xdst->u.rt.rt_uncached); return 0; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next 1/3] net: Refactor rtable initialization
All callers to rt_dst_alloc have nearly the same initialization following a successful allocation. Consolidate it into rt_dst_alloc. Signed-off-by: David Ahern --- net/ipv4/route.c | 85 ++-- 1 file changed, 33 insertions(+), 52 deletions(-) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 5f4a5565ad8b..eaefeadce07c 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1438,12 +1438,33 @@ static void rt_set_nexthop(struct rtable *rt, __be32 daddr, } static struct rtable *rt_dst_alloc(struct net_device *dev, + unsigned int flags, u16 type, bool nopolicy, bool noxfrm, bool will_cache) { - return dst_alloc(&ipv4_dst_ops, dev, 1, DST_OBSOLETE_FORCE_CHK, -(will_cache ? 0 : (DST_HOST | DST_NOCACHE)) | -(nopolicy ? DST_NOPOLICY : 0) | -(noxfrm ? DST_NOXFRM : 0)); + struct rtable *rt; + + rt = dst_alloc(&ipv4_dst_ops, dev, 1, DST_OBSOLETE_FORCE_CHK, + (will_cache ? 0 : (DST_HOST | DST_NOCACHE)) | + (nopolicy ? DST_NOPOLICY : 0) | + (noxfrm ? DST_NOXFRM : 0)); + + if (rt) { + rt->rt_genid = rt_genid_ipv4(dev_net(dev)); + rt->rt_flags = flags; + rt->rt_type = type; + rt->rt_is_input = 0; + rt->rt_iif = 0; + rt->rt_pmtu = 0; + rt->rt_gateway = 0; + rt->rt_uses_gateway = 0; + INIT_LIST_HEAD(&rt->rt_uncached); + + rt->dst.output = ip_output; + if (flags & RTCF_LOCAL) + rt->dst.input = ip_local_deliver; + } + + return rt; } /* called in rcu_read_lock() section */ @@ -1452,6 +1473,7 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr, { struct rtable *rth; struct in_device *in_dev = __in_dev_get_rcu(dev); + unsigned int flags = RTCF_MULTICAST; u32 itag = 0; int err; @@ -1477,7 +1499,10 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr, if (err < 0) goto e_err; } - rth = rt_dst_alloc(dev_net(dev)->loopback_dev, + if (our) + flags |= RTCF_LOCAL; + + rth = rt_dst_alloc(dev_net(dev)->loopback_dev, flags, RTN_MULTICAST, IN_DEV_CONF_GET(in_dev, NOPOLICY), false, false); if (!rth) goto e_nobufs; @@ -1486,20 +1511,7 @@ static int ip_route_input_mc(struct sk_buff *skb, __be32 daddr, __be32 saddr, rth->dst.tclassid = itag; #endif rth->dst.output = ip_rt_bug; - - rth->rt_genid = rt_genid_ipv4(dev_net(dev)); - rth->rt_flags = RTCF_MULTICAST; - rth->rt_type= RTN_MULTICAST; rth->rt_is_input= 1; - rth->rt_iif = 0; - rth->rt_pmtu= 0; - rth->rt_gateway = 0; - rth->rt_uses_gateway = 0; - INIT_LIST_HEAD(&rth->rt_uncached); - if (our) { - rth->dst.input= ip_local_deliver; - rth->rt_flags |= RTCF_LOCAL; - } #ifdef CONFIG_IP_MROUTE if (!ipv4_is_local_multicast(daddr) && IN_DEV_MFORWARD(in_dev)) @@ -1608,7 +1620,7 @@ static int __mkroute_input(struct sk_buff *skb, } } - rth = rt_dst_alloc(out_dev->dev, + rth = rt_dst_alloc(out_dev->dev, 0, res->type, IN_DEV_CONF_GET(in_dev, NOPOLICY), IN_DEV_CONF_GET(out_dev, NOXFRM), do_cache); if (!rth) { @@ -1616,19 +1628,10 @@ static int __mkroute_input(struct sk_buff *skb, goto cleanup; } - rth->rt_genid = rt_genid_ipv4(dev_net(rth->dst.dev)); - rth->rt_flags = 0; - rth->rt_type = res->type; rth->rt_is_input = 1; - rth->rt_iif = 0; - rth->rt_pmtu= 0; - rth->rt_gateway = 0; - rth->rt_uses_gateway = 0; - INIT_LIST_HEAD(&rth->rt_uncached); RT_CACHE_STAT_INC(in_slow_tot); rth->dst.input = ip_forward; - rth->dst.output = ip_output; rt_set_nexthop(rth, daddr, res, fnhe, res->fi, res->type, itag); if (lwtunnel_output_redirect(rth->dst.lwtstate)) { @@ -1795,26 +1798,16 @@ out:return err; } } - rth = rt_dst_alloc(net->loopback_dev, + rth = rt_dst_alloc(net->loopback_dev, flags | RTCF_LOCAL, res.type, IN_DEV_CONF_GET(in_dev, NOPOLICY), false, do_cache); if (!rth) goto e_nobufs; - rth->dst.input= ip_local_deliver; rth->dst.output= ip_rt_bug; #ifdef CONFIG_IP_ROUTE_CLASSID rth->dst.tclassid = itag; #endif - - rth->rt_genid = rt_genid_ipv4(net); - rth->rt_flags = flags|RTCF_LOCAL; - rth->rt_type= res.typ
Re: [PATCH] flow_dissector: Use 'const' where possible.
Wed, Sep 02, 2015 at 06:33:34AM CEST, t...@herbertland.com wrote: >On Tue, Sep 1, 2015 at 9:19 PM, David Miller wrote: >> >> Signed-off-by: David S. Miller >> --- >> include/linux/skbuff.h| 8 ++--- >> include/net/flow.h| 8 ++--- >> net/core/flow_dissector.c | 79 >> --- >> 3 files changed, 49 insertions(+), 46 deletions(-) >> >> diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c >> index 345a040..d79699c 100644 >> --- a/net/core/flow_dissector.c >> +++ b/net/core/flow_dissector.c >> @@ -19,14 +19,14 @@ >> #include >> #include >> >> -static bool skb_flow_dissector_uses_key(struct flow_dissector >> *flow_dissector, >> - enum flow_dissector_key_id key_id) >> +static bool dissector_uses_key(const struct flow_dissector *flow_dissector, >> + enum flow_dissector_key_id key_id) >> { >> return flow_dissector->used_keys & (1 << key_id); >> } >> >> -static void skb_flow_dissector_set_key(struct flow_dissector >> *flow_dissector, >> - enum flow_dissector_key_id key_id) >> +static void dissector_set_key(struct flow_dissector *flow_dissector, >> + enum flow_dissector_key_id key_id) >> { >> flow_dissector->used_keys |= (1 << key_id); >> } >> @@ -51,20 +51,20 @@ void skb_flow_dissector_init(struct flow_dissector >> *flow_dissector, > >I suppose we should drop skb_ from skb_flow_dissector_init and >skb_flow_dissector_target as well. I like to have "namespaces" by function prefixes. Code is easier to read then... -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] bgmac: Update fixed_phy_register()
From: Fabio Estevam Commit a5597008dbc2 ("phy: fixed_phy: Add gpio to determine link up/down.") added a new argument to fixed_phy_register(), but missed to update bgmac driver, causing the following build failure: drivers/net/ethernet/broadcom/bgmac.c:1450:2: error: too few arguments to function 'fixed_phy_register' Add the missing argument. Reported-by: Mark Brown Signed-off-by: Fabio Estevam --- drivers/net/ethernet/broadcom/bgmac.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/broadcom/bgmac.c b/drivers/net/ethernet/broadcom/bgmac.c index d043746..28f7610 100644 --- a/drivers/net/ethernet/broadcom/bgmac.c +++ b/drivers/net/ethernet/broadcom/bgmac.c @@ -1447,7 +1447,7 @@ static int bgmac_fixed_phy_register(struct bgmac *bgmac) struct phy_device *phy_dev; int err; - phy_dev = fixed_phy_register(PHY_POLL, &fphy_status, NULL); + phy_dev = fixed_phy_register(PHY_POLL, &fphy_status, -1, NULL); if (!phy_dev || IS_ERR(phy_dev)) { bgmac_err(bgmac, "Failed to register fixed PHY device\n"); return -ENODEV; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Issue with /etc/netns/${nsname}/hosts
Le 02/09/2015 01:23, James Loosli a écrit : I seem to have an issue with using namespace-specific hosts files. Here's an example. I have different entries for foo.com in my hosts file for the namespace and the system-wide hosts file; root@server-01 Tue Sep 01 04:15:02pm cat /etc/netns/nsXX-XXX-240-3/hosts | grep foo 1.2.3.4 foo.com root@server-01 Tue Sep 01 04:15:15pm ip netns exec nsXX-XXX-240-3 cat /etc/hosts | grep foo 1.2.3.4 foo.com root@server-01 Tue Sep 01 04:15:19pm cat /etc/hosts | grep foo 0.0.0.0 foo.com But when I try to get curl, ping or other utilities to use that hosts file entry, they ignore the namespace-specific file. root@server-01 Tue Sep 01 04:16:02pm ip netns exec ns91-227-240-3 curl -vv foo.com Probably a copy and paste error, but the netns name was nsXX-XXX-240-3 in your example above. Can you confirm? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] ipv6: fix multipath route replace error recovery
On 9/1/15, 10:54 PM, Roopa Prabhu wrote: From: Roopa Prabhu Problem: The ecmp route replace support for ipv6 in the kernel, deletes the existing ecmp route too early, ie when it installs the first nexthop. If there is an error in installing the subsequent nexthops, its too late to recover the already deleted existing route This patch fixes the problem with the following: a) Changes the existing multipath route add code to a two stage process: build rt6_infos + insert them ip6_route_add rt6_info creation code is moved into ip6_route_info_create. b) This ensures that all errors are caught during building rt6_infos and we fail early c) Separates multipath add and del code. Because add needs the special two stage mode in a) and delete essentially does not care. d) In any event if the code fails during inserting a route again, a warning is printed (This should be unlikely) Before the patch: $ip -6 route show 3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024 /* Try replacing the route with a duplicate nexthop */ $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1 RTNETLINK answers: File exists $ip -6 route show /* previously added ecmp route 3000:1000:1000:1000::2 dissappears from * kernel */ After the patch: $ip -6 route show 3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024 /* Try replacing the route with a duplicate nexthop */ $ip -6 route change 3000:1000:1000:1000::2/128 nexthop via fe80::202:ff:fe00:b dev swp49s0 nexthop via fe80::202:ff:fe00:d dev swp49s1 nexthop via fe80::202:ff:fe00:d dev swp49s1 RTNETLINK answers: File exists $ip -6 route show 3000:1000:1000:1000::2 via fe80::202:ff:fe00:b dev swp49s0 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:d dev swp49s1 metric 1024 3000:1000:1000:1000::2 via fe80::202:ff:fe00:f dev swp49s2 metric 1024 Fixes: 4a287eba2de3 ("IPv6 routing, NLM_F_* flag support: REPLACE and EXCL flags support, warn about missing CREATE flag") Signed-off-by: Roopa Prabhu --- This bug is present in 4.1 kernel and 4.2 too. Since 4.2 is out or almost out, I am submitting the patch against net-next. I can respin against net if needed. The part of the patch that I would appreciate more eyes on is the cleanup of the rt6_infos in ip_route_multipath_add. And I have tried to keep the changes local to route.c closer to the netlink message handling. Most of the code changes are moving code into separate functions. net/ipv6/route.c | 205 --- 1 file changed, 179 insertions(+), 26 deletions(-) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index f45cac6..b1b8c96 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1748,7 +1748,7 @@ static int ip6_convert_metrics(struct mx6_config *mxc, return -EINVAL; } -int ip6_route_add(struct fib6_config *cfg) +int ip6_route_info_create(struct fib6_config *cfg, struct rt6_info **rt_ret) { int err; struct net *net = cfg->fc_nlinfo.nl_net; @@ -1756,7 +1756,6 @@ int ip6_route_add(struct fib6_config *cfg) struct net_device *dev = NULL; struct inet6_dev *idev = NULL; struct fib6_table *table; - struct mx6_config mxc = { .mx = NULL, }; int addr_type; if (cfg->fc_dst_len > 128 || cfg->fc_src_len > 128) @@ -1981,6 +1980,32 @@ install_route: cfg->fc_nlinfo.nl_net = dev_net(dev); + *rt_ret = rt; + + return 0; +out: + if (dev) + dev_put(dev); + if (idev) + in6_dev_put(idev); + if (rt) + dst_free(&rt->dst); + + *rt_ret = NULL; + + return err; +} + +int ip6_route_add(struct fib6_config *cfg) +{ + struct mx6_config mxc = { .mx = NULL, }; + struct rt6_info *rt = NULL; + int err; + + err = ip6_route_info_create(cfg, &rt); + if (err) + goto out; + err = ip6_convert_metrics(&mxc, cfg); if (err) goto out; @@ -1988,14 +2013,12 @@ install_route: err = __ip6_ins_rt(rt, &cfg->fc_nlinfo, &mxc); kfree(mxc.mx); + return err; out: - if (dev) - dev_put(dev); - if (idev) - in6_dev_put(idev); if (rt) dst_free(&rt->dst); + return err; } @@ -2776,19 +2799,79 @@ errout: return err; } -static int ip6_route_multipath(struct fib6_config *cfg, int add) +struct rt6_nh { + struct rt6_info *rt6_info; + struct fib6_config r_cfg; + struct mx6_config mxc; + struct li
Re: [PATCHv1 net-next 0/5] netlink: mmap: kernel panic and some issues
On 09/02/2015 01:35 PM, Ken-ichirou MATSUZAWA wrote: Thank you for the reply. On Wed, Sep 02, 2015 at 11:47:26AM +0200, Daniel Borkmann wrote: On 09/02/2015 02:04 AM, Ken-ichirou MATSUZAWA wrote: Talking about skb_copy path, original skb's shared info is accessed only in copy_skb_header, to get gso related field. As a result of It's still not correct. The thing is you can neither call skb_copy() nor skb_clone() on netlink mmaped skbs. For example, skb_copy_bits() would I am sorry for the lack of explanation. And I am afraid I misunderstand... Updated pointers to its data area in a mmaped netlink skb is only its tail. Head, data and end will not be updated. skb_copy() calls int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len) as its argument, "offset" is always 0 and "len" is skb->len. In skb_copy_bits() both "start" and "copy" are skb->len, which means "len - copy" is always 0 so that retuns 0 before accessing shared info. I don't know the situation is intended or not, it seems that skb_copy() for a mmaped skb will not access its shared info. Okay, right, since it's all linear, but ... After that, copy_skb_header() will set newly allocate skb's (wrong) gso fields, I asked we should clear it or not. ... here still we access skb_shinfo() from the mmap'ed skb, which we are simply not allowed (despite whether resetting fields later on as you suggest or not), for two reasons: I think (will start experimenting more with it tomorrow), you would get an out of bounds access here in case the skb->data is the last slot in the ring buffer and reaches exactly to the ring buffer end. And (despite that), it's also hard to maintain - the next one adding a new shared info member will very likely oversee this special case in netlink here, thus the issue would then simply be reintroduced over and over. Thanks, Daniel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] tipc: fix stall during bclink wakeup procedure
From: Dmitry S Kolmakov If an attempt to wake up users of broadcast link is made when there is no enough place in send queue than it may hang up inside the tipc_sk_rcv() function since the loop breaks only after the wake up queue becomes empty. This can lead to complete CPU stall with the following message generated by RCU: Aug 17 15:11:28 [kernel] INFO: rcu_sched self-detected stall on CPU { 0} (t=2101 jiffies g=54225 c=54224 q=11465) Aug 17 15:11:28 [kernel] Task dump for CPU 0: Aug 17 15:11:28 [kernel] tpchR running task0 39949 39948 0x000a Aug 17 15:11:28 [kernel] 818536c0 88181fa037a0 8106a4be Aug 17 15:11:28 [kernel] 818536c0 88181fa037c0 8106d8a8 88181fa03800 Aug 17 15:11:28 [kernel] 0001 88181fa037f0 81094a50 88181fa15680 Aug 17 15:11:28 [kernel] Call Trace: Aug 17 15:11:28 [kernel][] sched_show_task+0xae/0x120 Aug 17 15:11:28 [kernel] [] dump_cpu_task+0x38/0x40 Aug 17 15:11:28 [kernel] [] rcu_dump_cpu_stacks+0x90/0xd0 Aug 17 15:11:28 [kernel] [] rcu_check_callbacks+0x3eb/0x6e0 Aug 17 15:11:28 [kernel] [] ? account_system_time+0x7f/0x170 Aug 17 15:11:28 [kernel] [] update_process_times+0x34/0x60 Aug 17 15:11:28 [kernel] [] tick_sched_handle.isra.18+0x31/0x40 Aug 17 15:11:28 [kernel] [] tick_sched_timer+0x3c/0x70 Aug 17 15:11:28 [kernel] [] __run_hrtimer.isra.34+0x3d/0xc0 Aug 17 15:11:28 [kernel] [] hrtimer_interrupt+0xc5/0x1e0 Aug 17 15:11:28 [kernel] [] ? native_smp_send_reschedule+0x42/0x60 Aug 17 15:11:28 [kernel] [] local_apic_timer_interrupt+0x34/0x60 Aug 17 15:11:28 [kernel] [] smp_apic_timer_interrupt+0x3c/0x60 Aug 17 15:11:28 [kernel] [] apic_timer_interrupt+0x6b/0x70 Aug 17 15:11:28 [kernel] [] ? _raw_spin_unlock_irqrestore+0x9/0x10 Aug 17 15:11:28 [kernel] [] __wake_up_sync_key+0x4f/0x60 Aug 17 15:11:28 [kernel] [] tipc_write_space+0x31/0x40 [tipc] Aug 17 15:11:28 [kernel] [] filter_rcv+0x31f/0x520 [tipc] Aug 17 15:11:28 [kernel] [] ? tipc_sk_lookup+0xc9/0x110 [tipc] Aug 17 15:11:28 [kernel] [] ? _raw_spin_lock_bh+0x19/0x30 Aug 17 15:11:28 [kernel] [] tipc_sk_rcv+0x2dc/0x3e0 [tipc] Aug 17 15:11:28 [kernel] [] tipc_bclink_wakeup_users+0x2f/0x40 [tipc] Aug 17 15:11:28 [kernel] [] tipc_node_unlock+0x186/0x190 [tipc] Aug 17 15:11:28 [kernel] [] ? kfree_skb+0x2c/0x40 Aug 17 15:11:28 [kernel] [] tipc_rcv+0x2ac/0x8c0 [tipc] Aug 17 15:11:28 [kernel] [] tipc_l2_rcv_msg+0x38/0x50 [tipc] Aug 17 15:11:28 [kernel] [] __netif_receive_skb_core+0x5a3/0x950 Aug 17 15:11:28 [kernel] [] __netif_receive_skb+0x13/0x60 Aug 17 15:11:28 [kernel] [] netif_receive_skb_internal+0x1e/0x90 Aug 17 15:11:28 [kernel] [] napi_gro_receive+0x78/0xa0 Aug 17 15:11:28 [kernel] [] tg3_poll_work+0xc54/0xf40 [tg3] Aug 17 15:11:28 [kernel] [] ? consume_skb+0x2c/0x40 Aug 17 15:11:28 [kernel] [] tg3_poll_msix+0x41/0x160 [tg3] Aug 17 15:11:28 [kernel] [] net_rx_action+0xe2/0x290 Aug 17 15:11:28 [kernel] [] __do_softirq+0xda/0x1f0 Aug 17 15:11:28 [kernel] [] irq_exit+0x76/0xa0 Aug 17 15:11:28 [kernel] [] do_IRQ+0x55/0xf0 Aug 17 15:11:28 [kernel] [] common_interrupt+0x6b/0x6b Aug 17 15:12:31 [kernel] This issue was happened on quite big networks of 32-64 sockets which send several multicast messages all-to-all at the same time. The patch fixes the issue by reusing the link_prepare_wakeup() procedure which moves users as permitted by space available in send queue to a separate queue which in its turn is conveyed to tipc_sk_rcv(). The link_prepare_wakeup() procedure was also modified a bit: 1. Firstly to enable its reuse some actions related to unicast link were moved out of the function. 2. And secondly the internal loop doesn't break now when only one send queue is exhausted but it continues up to the end of wake up queue so all send queues can be refilled. Signed-off-by: Dmitry S Kolmakov --- diff --git a/net/tipc/bcast.c b/net/tipc/bcast.c index c5cbdcb..b56f74a 100644 --- a/net/tipc/bcast.c +++ b/net/tipc/bcast.c @@ -176,8 +176,12 @@ static void bclink_retransmit_pkt(struct tipc_net *tn, u32 after, u32 to) void tipc_bclink_wakeup_users(struct net *net) { struct tipc_net *tn = net_generic(net, tipc_net_id); + struct tipc_link *bcl = tn->bcl; + struct sk_buff_head resultq; - tipc_sk_rcv(net, &tn->bclink->link.wakeupq); + skb_queue_head_init(&resultq); + link_prepare_wakeup(bcl, &resultq); + tipc_sk_rcv(net, &resultq); } /** diff --git a/net/tipc/link.c b/net/tipc/link.c index 43a515d..467edbc 100644 --- a/net/tipc/link.c +++ b/net/tipc/link.c @@ -372,10 +372,11 @@ err: /** * link_prepare_wakeup - prepare users for wakeup after congestion * @link: congested link + * @resultq: queue for users which can be woken up * Move a number of waiting users, as permitted by available space in - * the send queue, from link wait queue to node wait queue for wakeup + * the send queue, f
Re: [PATCH net] sock, diag: fix panic in sock_diag_put_filterinfo
On 9/2/15 5:00 AM, Daniel Borkmann wrote: diag socket's sock_diag_put_filterinfo() dumps classic BPF programs upon request to user space (ss -0 -b). However, native eBPF programs attached to sockets (SO_ATTACH_BPF) cannot be dumped with this method: ... Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to sockets") Signed-off-by: Daniel Borkmann good catch. thanks Acked-by: Alexei Starovoitov -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH net-next] xen-netback: add support for multicast control
> -Original Message- > From: Wei Liu [mailto:wei.l...@citrix.com] > Sent: 02 September 2015 15:01 > To: Paul Durrant > Cc: netdev@vger.kernel.org; xen-de...@lists.xenproject.org; Ian Campbell; > Wei Liu > Subject: Re: [PATCH net-next] xen-netback: add support for multicast control > > On Wed, Sep 02, 2015 at 01:19:53PM +0100, Paul Durrant wrote: > > Xen's PV network protocol includes messages to add/remove ethernet > > multicast addresses to/from a filter list in the backend. This allows > > the frontend to request the backend only forward multicast packets > > which are off interest thus preventing unnecessary noise on the shared > > "of interest" Ah yes :-) > > > ring. > > > [...] > > + > > +void xenvif_mcast_flush(struct xenvif *vif) > > Only one cosmetic comment. > > My first impression of this function by looking at the name is that it > flushes queued multicast packets. Maybe we can rename it to > xenvif_mcast_addr_list_free ? > Sure. Paul > Wei. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] xen-netback: add support for multicast control
On Wed, Sep 02, 2015 at 01:19:53PM +0100, Paul Durrant wrote: > Xen's PV network protocol includes messages to add/remove ethernet > multicast addresses to/from a filter list in the backend. This allows > the frontend to request the backend only forward multicast packets > which are off interest thus preventing unnecessary noise on the shared "of interest" > ring. > [...] > + > +void xenvif_mcast_flush(struct xenvif *vif) Only one cosmetic comment. My first impression of this function by looking at the name is that it flushes queued multicast packets. Maybe we can rename it to xenvif_mcast_addr_list_free ? Wei. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unexpected loss recovery in TLP
On Wed, 2015-09-02 at 10:54 +0200, Mohammad Rajiullah wrote: > Hi Eric! > > Thanks for the direction. I tried packet drill locally (with the same kernel > Linux 3.18.5 to start with) > with the following script. And it doesn’t show the problem I mentioned. > So the fast retransmit happens after getting the dupack. > It would be good if I could get some information from the calls > from the TCP stack (I have some printk there), but using packet drill I don’t > know at the moment, > how to get that. > Please do not top post on netdev mailing list. You could try nstat before/after the failure and report anomalies here. > \ > Mohammad > > > // Establish a connection. > 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 > +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > +0 setsockopt(3, SOL_SOCKET, TCP_NODELAY, [1], 4) = 0 > > +0 bind(3, ..., ...) = 0 > +0 listen(3, 1) = 0 > > +0 < S 0:0(0) win 32792 > +0 > S. 0:0(0) ack 1 <...> > > +.03 < . 1:1(0) ack 1 win 257 > +0 accept(3, ..., ...) = 4 > > // Send 1 data segment and get an ACK with DATA > +0 write(4, ..., 1000) = 1000 Note the original tcpdump you gave seemed to use len=250, could you try the exact same lengths ? Thanks ! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net] sock, diag: fix panic in sock_diag_put_filterinfo
Le 02/09/2015 14:00, Daniel Borkmann a écrit : diag socket's sock_diag_put_filterinfo() dumps classic BPF programs upon request to user space (ss -0 -b). However, native eBPF programs attached to sockets (SO_ATTACH_BPF) cannot be dumped with this method: Their orig_prog is always NULL. However, sock_diag_put_filterinfo() unconditionally tries to access its filter length resp. wants to copy the filter insns from there. Internal cBPF to eBPF transformations attached to sockets don't have this issue, as orig_prog state is kept. It's currently only used by packet sockets. If we would want to add native eBPF support in the future, this needs to be done through a different attribute than PACKET_DIAG_FILTER to not confuse possible user space disassemblers that work on diag data. Fixes: 89aa075832b0 ("net: sock: allow eBPF programs to be attached to sockets") Signed-off-by: Daniel Borkmann Acked-by: Nicolas Dichtel -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[net-next PATCH] net: ipv6: use common fib_default_rule_pref
This switches IPv6 policy routing to use the shared fib_default_rule_pref() function of IPv4 and DECnet. It is also used in multicast routing for IPv4 as well as IPv6. The motivation for this patch is a complaint about iproute2 behaving inconsistent between IPv4 and IPv6 when adding policy rules: Formerly, IPv6 rules were assigned a fixed priority of 0x3FFF whereas for IPv4 the assigned priority value was decreased with each rule added. Signed-off-by: Phil Sutter --- net/ipv6/fib6_rules.c | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c index 2367a16..a859ad2 100644 --- a/net/ipv6/fib6_rules.c +++ b/net/ipv6/fib6_rules.c @@ -258,11 +258,6 @@ nla_put_failure: return -ENOBUFS; } -static u32 fib6_rule_default_pref(struct fib_rules_ops *ops) -{ - return 0x3FFF; -} - static size_t fib6_rule_nlmsg_payload(struct fib_rule *rule) { return nla_total_size(16) /* dst */ @@ -279,7 +274,7 @@ static const struct fib_rules_ops __net_initconst fib6_rules_ops_template = { .configure = fib6_rule_configure, .compare= fib6_rule_compare, .fill = fib6_rule_fill, - .default_pref = fib6_rule_default_pref, + .default_pref = fib_default_rule_pref, .nlmsg_payload = fib6_rule_nlmsg_payload, .nlgroup= RTNLGRP_IPV6_RULE, .policy = fib6_rule_policy, -- 2.1.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next 1/4] net: qdisc: add op to run filters/actions before enqueue
On 09/02/2015 08:22 AM, Cong Wang wrote: (Why not Cc'ing Jamal for net_sched pathes?) Sorry, forgot about it, and thanks for the Cc! -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next] xen-netback: add support for multicast control
Xen's PV network protocol includes messages to add/remove ethernet multicast addresses to/from a filter list in the backend. This allows the frontend to request the backend only forward multicast packets which are off interest thus preventing unnecessary noise on the shared ring. The canonical netif header in git://xenbits.xen.org/xen.git specifies the message format (two more XEN_NETIF_EXTRA_TYPEs) so the minimal necessary changes have been pulled into include/xen/interface/io/netif.h. To prevent the frontend from extending the multicast filter list arbitrarily a limit (XEN_NETBK_MCAST_MAX) has been set to 64 entries. This limit is not specified by the protocol and so may change in future. If the limit is reached then the next XEN_NETIF_EXTRA_TYPE_MCAST_ADD sent by the frontend will be failed with NETIF_RSP_ERROR. Signed-off-by: Paul Durrant Cc: Ian Campbell Cc: Wei Liu --- drivers/net/xen-netback/common.h| 15 ++ drivers/net/xen-netback/interface.c | 10 drivers/net/xen-netback/netback.c | 99 +++ drivers/net/xen-netback/xenbus.c| 13 + include/xen/interface/io/netif.h|8 ++- 5 files changed, 144 insertions(+), 1 deletion(-) diff --git a/drivers/net/xen-netback/common.h b/drivers/net/xen-netback/common.h index c6cb85a..9d1eb28 100644 --- a/drivers/net/xen-netback/common.h +++ b/drivers/net/xen-netback/common.h @@ -210,12 +210,22 @@ enum state_bit_shift { VIF_STATUS_CONNECTED, }; +struct xenvif_mcast_addr { + struct list_head entry; + struct rcu_head rcu; + u8 addr[6]; +}; + +#define XEN_NETBK_MCAST_MAX 64 + struct xenvif { /* Unique identifier for this interface. */ domid_t domid; unsigned int handle; u8 fe_dev_addr[6]; + struct list_head fe_mcast_addr; + unsigned int fe_mcast_count; /* Frontend feature information. */ int gso_mask; @@ -224,6 +234,7 @@ struct xenvif { u8 can_sg:1; u8 ip_csum:1; u8 ipv6_csum:1; + u8 multicast_control:1; /* Is this interface disabled? True when backend discovers * frontend is rogue. @@ -341,4 +352,8 @@ void xenvif_skb_zerocopy_prepare(struct xenvif_queue *queue, struct sk_buff *skb); void xenvif_skb_zerocopy_complete(struct xenvif_queue *queue); +/* Multicast control */ +bool xenvif_mcast_match(struct xenvif *vif, const u8 *addr); +void xenvif_mcast_flush(struct xenvif *vif); + #endif /* __XEN_NETBACK__COMMON_H__ */ diff --git a/drivers/net/xen-netback/interface.c b/drivers/net/xen-netback/interface.c index 28577a3..a7785f2 100644 --- a/drivers/net/xen-netback/interface.c +++ b/drivers/net/xen-netback/interface.c @@ -171,6 +171,13 @@ static int xenvif_start_xmit(struct sk_buff *skb, struct net_device *dev) !xenvif_schedulable(vif)) goto drop; + if (vif->multicast_control && skb->pkt_type == PACKET_MULTICAST) { + struct ethhdr *eth = (struct ethhdr *)skb->data; + + if (!xenvif_mcast_match(vif, eth->h_dest)) + goto drop; + } + cb = XENVIF_RX_CB(skb); cb->expires = jiffies + vif->drain_timeout; @@ -427,6 +434,7 @@ struct xenvif *xenvif_alloc(struct device *parent, domid_t domid, vif->num_queues = 0; spin_lock_init(&vif->lock); + INIT_LIST_HEAD(&vif->fe_mcast_addr); dev->netdev_ops = &xenvif_netdev_ops; dev->hw_features = NETIF_F_SG | @@ -661,6 +669,8 @@ void xenvif_disconnect(struct xenvif *vif) xenvif_unmap_frontend_rings(queue); } + + xenvif_mcast_flush(vif); } /* Reverse the relevant parts of xenvif_init_queue(). diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c index 3f44b52..e0324eb 100644 --- a/drivers/net/xen-netback/netback.c +++ b/drivers/net/xen-netback/netback.c @@ -1157,6 +1157,80 @@ static bool tx_credit_exceeded(struct xenvif_queue *queue, unsigned size) return false; } +/* No locking is required in xenvif_mcast_add/del() as they are + * only ever invoked from NAPI poll. An RCU list is used because + * xenvif_mcast_match() is called asynchronously, during start_xmit. + */ + +static int xenvif_mcast_add(struct xenvif *vif, const u8 *addr) +{ + struct xenvif_mcast_addr *mcast; + + if (vif->fe_mcast_count == XEN_NETBK_MCAST_MAX) { + if (net_ratelimit()) + netdev_err(vif->dev, + "Too many multicast addresses\n"); + return -ENOSPC; + } + + mcast = kzalloc(sizeof(*mcast), GFP_ATOMIC); + if (!mcast) + return -ENOMEM; + + ether_addr_copy(mcast->addr, addr); + list_add_tail_rcu(&mcast->entry, &vif->fe_mcast_addr); + vif->fe_mcast_count++; + + return 0; +} + +static void xenvif_mcast_del(struct xenvif *vif,