Re: Fwd: u32 ht filters
Thu, Feb 08, 2018 at 12:08:36AM CET, xiyou.wangc...@gmail.com wrote: >On Tue, Feb 6, 2018 at 11:01 PM, Jiri Pirko wrote: >> Wed, Feb 07, 2018 at 06:09:15AM CET, xiyou.wangc...@gmail.com wrote: >>>Hi, Jiri >>> >>>Your commit 7fa9d974f3c2a016b9accb18f4ee2ed2a738585c >>>breaks the tc script by Paweł. Please find below for details. >> >> Did you do the bisection? >> The commit just uses block struct instead of q, but since they >> are in 1:1 relation, that should be equvivalent. So basically you still >> have per-qdisc hashtables for u32. > >Well, at least the following fixes the problem here. But I am not sure >if it is expected too for shared block among multiple qdiscs. For shared block, block->q is null. > > >@@ -338,7 +330,7 @@ static struct hlist_head *tc_u_common_hash; > > static unsigned int tc_u_hash(const struct tcf_proto *tp) > { >- return hash_ptr(tp->chain->block, U32_HASH_SHIFT); >+ return hash_ptr(tp->chain->block->q, U32_HASH_SHIFT); > } > > static struct tc_u_common *tc_u_common_find(const struct tcf_proto *tp) >@@ -348,7 +340,7 @@ static struct tc_u_common *tc_u_common_find(const >struct tcf_proto *tp) > >h = tc_u_hash(tp); >hlist_for_each_entry(tc, &tc_u_common_hash[h], hnode) { >- if (tc->block == tp->chain->block) >+ if (tc->block->q == tp->chain->block->q) :O I don't get it. tc->block is pointer, tc->block->q is pointer. And they are different at the same time for non-shared block. >return tc; >} >return NULL;
Re: [PATCH net V3 2/2] ptr_ring: fail on large queue size (>64K)
On 2018年02月08日 12:52, Michael S. Tsirkin wrote: On Thu, Feb 08, 2018 at 11:59:25AM +0800, Jason Wang wrote: We need limit the maximum size of queue, otherwise it may cause several side effects e.g slab will warn when the size exceeds KMALLOC_MAX_SIZE. Using KMALLOC_MAX_SIZE still looks too so this patch tries to limit it to 64K. This value could be revisited if we found a real case that needs more. Reported-by: syzbot+e4d4f9ddd42955397...@syzkaller.appspotmail.com Fixes: 2e0ab8ca83c12 ("ptr_ring: array based FIFO for pointers") Signed-off-by: Jason Wang --- include/linux/ptr_ring.h | 4 1 file changed, 4 insertions(+) diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 2af71a7..5858d48 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -44,6 +44,8 @@ struct ptr_ring { void **queue; }; Seems like a weird location for a define. Either put defines on top of the file, or near where they are used. I prefer the second option. Ok. +#define PTR_RING_MAX_ALLOC 65536 + I guess it's an arbitrary number. Seems like a sufficiently large one, but pls add a comment so readers don't wonder. And please explain what it does: /* Callers can create ptr_ring structures with userspace-supplied * parameters. This sets a limit on the size to make that usecase * safe. If you ever change this, make sure to audit all callers. */ Also I think we should generally use either hex 0x1 or (1 << 16). I agree the number is arbitrary, so I still prefer the KMALLOC_MAX_SIZE especially consider it was used by pfifo_fast now. Try to limit it to an arbitrary may break lots of exist setups. E.g just google "txqueuelen 10" can give me a lots of search results. We can do any kind of optimization on top but not for -net now. Thanks /* Note: callers invoking this in a loop must use a compiler barrier, * for example cpu_relax(). * @@ -466,6 +468,8 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r, static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp) { + if (size > PTR_RING_MAX_ALLOC) + return NULL; return kvmalloc_array(size, sizeof(void *), gfp | __GFP_ZERO); } -- 2.7.4
Re: [PATCH net V3 1/2] ptr_ring: try vmalloc() when kmalloc() fails
On 2018年02月08日 12:45, Michael S. Tsirkin wrote: On Thu, Feb 08, 2018 at 11:59:24AM +0800, Jason Wang wrote: This patch switch to use kvmalloc_array() for using a vmalloc() fallback to help in case kmalloc() fails. Reported-by: syzbot+e4d4f9ddd42955397...@syzkaller.appspotmail.com Fixes: 2e0ab8ca83c12 ("ptr_ring: array based FIFO for pointers") I guess the actual patch is the one that switches tun to ptr_ring. I think not, since the issue was large allocation. In fact, I think the actual bugfix is patch 2/2. This specific one just makes kmalloc less likely to fail but that's not what syzbot reported. Agree. Then I would add this patch on top to make kmalloc less likely to fail. Ok. Signed-off-by: Jason Wang --- include/linux/ptr_ring.h | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 1883d61..2af71a7 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -466,7 +466,7 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r, static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp) { - return kcalloc(size, sizeof(void *), gfp); + return kvmalloc_array(size, sizeof(void *), gfp | __GFP_ZERO); } static inline void __ptr_ring_set_size(struct ptr_ring *r, int size) This implies a bunch of limitations on the flags. From kvmalloc_node docs: * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. * __GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is * preferable to the vmalloc fallback, due to visible performance drawbacks. Fine with all the current users, but if we go this way, please add documentation so future users don't misuse this API. I suspect this is somehow a overkill since this means we need sync with mm/vmalloc changes in the future to keep it synced. Alternatively, test flags and call kvmalloc or kcalloc? Similar to the above issue, I would rather leave it as is. Thanks @@ -601,7 +601,7 @@ static inline int ptr_ring_resize(struct ptr_ring *r, int size, gfp_t gfp, spin_unlock(&(r)->producer_lock); spin_unlock_irqrestore(&(r)->consumer_lock, flags); - kfree(old); + kvfree(old); return 0; } @@ -641,7 +641,7 @@ static inline int ptr_ring_resize_multiple(struct ptr_ring **rings, } for (i = 0; i < nrings; ++i) - kfree(queues[i]); + kvfree(queues[i]); kfree(queues); @@ -649,7 +649,7 @@ static inline int ptr_ring_resize_multiple(struct ptr_ring **rings, nomem: while (--i >= 0) - kfree(queues[i]); + kvfree(queues[i]); kfree(queues); @@ -664,7 +664,7 @@ static inline void ptr_ring_cleanup(struct ptr_ring *r, void (*destroy)(void *)) if (destroy) while ((ptr = ptr_ring_consume(r))) destroy(ptr); - kfree(r->queue); + kvfree(r->queue); } #endif /* _LINUX_PTR_RING_H */ -- 2.7.4
[PATCH net-queue 1/3] Partial revert "e1000e: Avoid receiver overrun interrupt bursts"
This partially reverts commit 4aea7a5c5e940c1723add439f4088844cd26196d. We keep the fix for the first part of the problem (1) described in the log of that commit, that is to read ICR in the other interrupt handler. We remove the fix for the second part of the problem (2), Other interrupt throttling. Bursts of "Other" interrupts may once again occur during rxo (receive overflow) traffic conditions. This is deemed acceptable in the interest of avoiding unforeseen fallout from changes that are not strictly necessary. As discussed, the e1000e driver should be in "maintenance mode". Link: https://www.spinics.net/lists/netdev/msg480675.html Signed-off-by: Benjamin Poirier --- drivers/net/ethernet/intel/e1000e/netdev.c | 16 ++-- 1 file changed, 2 insertions(+), 14 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index 153ad406c65e..3b36efa6228d 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -1915,21 +1915,10 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data) struct e1000_adapter *adapter = netdev_priv(netdev); struct e1000_hw *hw = &adapter->hw; u32 icr; - bool enable = true; icr = er32(ICR); ew32(ICR, E1000_ICR_OTHER); - if (icr & E1000_ICR_RXO) { - ew32(ICR, E1000_ICR_RXO); - enable = false; - /* napi poll will re-enable Other, make sure it runs */ - if (napi_schedule_prep(&adapter->napi)) { - adapter->total_rx_bytes = 0; - adapter->total_rx_packets = 0; - __napi_schedule(&adapter->napi); - } - } if (icr & E1000_ICR_LSC) { ew32(ICR, E1000_ICR_LSC); hw->mac.get_link_status = true; @@ -1938,7 +1927,7 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data) mod_timer(&adapter->watchdog_timer, jiffies + 1); } - if (enable && !test_bit(__E1000_DOWN, &adapter->state)) + if (!test_bit(__E1000_DOWN, &adapter->state)) ew32(IMS, E1000_IMS_OTHER); return IRQ_HANDLED; @@ -2708,8 +2697,7 @@ static int e1000e_poll(struct napi_struct *napi, int weight) napi_complete_done(napi, work_done); if (!test_bit(__E1000_DOWN, &adapter->state)) { if (adapter->msix_entries) - ew32(IMS, adapter->rx_ring->ims_val | -E1000_IMS_OTHER); + ew32(IMS, adapter->rx_ring->ims_val); else e1000_irq_enable(adapter); } -- 2.16.1
[PATCH net-queue 2/3] e1000e: Fix queue interrupt re-raising in Other interrupt.
restores the ICS write for rx/tx queue interrupts which was present before commit 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1) but was not restored in commit 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts", v4.15-rc1). This re-raises the queue interrupts in case the txq or rxq bits were set in ICR and the Other interrupt handler read and cleared ICR before the queue interrupt was raised. Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts") Signed-off-by: Benjamin Poirier --- drivers/net/ethernet/intel/e1000e/netdev.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index 3b36efa6228d..2c9609bee2ae 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -1919,6 +1919,9 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data) icr = er32(ICR); ew32(ICR, E1000_ICR_OTHER); + if (icr & adapter->eiac_mask) + ew32(ICS, (icr & adapter->eiac_mask)); + if (icr & E1000_ICR_LSC) { ew32(ICR, E1000_ICR_LSC); hw->mac.get_link_status = true; -- 2.16.1
[PATCH v2 1/2] net, can, ifi: fix "write buffer full" error
the driver reads in the ISR first the IRQpending register, and clears after that in a write *all* bits in it. It could happen that the isr register raise bits between this 2 register accesses, which leads in lost bits ... In case it clears "TX message sent successfully", the driver never sends any Tx data, and buffers to userspace run over. Fixed this: clear only the bits in the IRQpending register, the driver had read. Signed-off-by: Heiko Schocher Reviewed-by: Marek Vasut --- Changes in v2: - add Reviewed-by from Marek drivers/net/can/ifi_canfd/ifi_canfd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/can/ifi_canfd/ifi_canfd.c b/drivers/net/can/ifi_canfd/ifi_canfd.c index 2772d05ff11c..05feb8177936 100644 --- a/drivers/net/can/ifi_canfd/ifi_canfd.c +++ b/drivers/net/can/ifi_canfd/ifi_canfd.c @@ -607,7 +607,7 @@ static irqreturn_t ifi_canfd_isr(int irq, void *dev_id) return IRQ_NONE; /* Clear all pending interrupts but ErrWarn */ - writel(clr_irq_mask, priv->base + IFI_CANFD_INTERRUPT); + writel(isr & clr_irq_mask, priv->base + IFI_CANFD_INTERRUPT); /* RX IRQ or bus warning, start NAPI */ if (isr & rx_irq_mask) { -- 2.14.3
[PATCH v2 2/2] net, can, ifi: loopback Tx message in IFI block
Current ifi driver reads first Rx messages, than loopback the Tx message, if the IFI_CANFD_INTERRUPT_TXFIFO_REMOVE bit is set. This can lead into the case, that Rx messages overhelm Tx messages! Fixed this in the following way: Set in the IFI_CANFD_TXFIFO_DLC register the FN value to 1, so the IFI block loopsback itself the Tx message when sended correctly on the canfd bus. Only the IFI block can insert the Tx message in the correct place. The linux driver now needs only to free the skb, when the Tx message was sended correctly. Signed-off-by: Heiko Schocher Reviewed-by: Marek Vasut --- Changes in v2: - add Reviewed-by from Marek, fixed comment into one liner drivers/net/can/ifi_canfd/ifi_canfd.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/net/can/ifi_canfd/ifi_canfd.c b/drivers/net/can/ifi_canfd/ifi_canfd.c index 05feb8177936..ee74ee8f9b38 100644 --- a/drivers/net/can/ifi_canfd/ifi_canfd.c +++ b/drivers/net/can/ifi_canfd/ifi_canfd.c @@ -211,6 +211,7 @@ struct ifi_canfd_priv { struct napi_struct napi; struct net_device *ndev; void __iomem*base; + unsigned inttx_len; }; static void ifi_canfd_irq_enable(struct net_device *ndev, bool enable) @@ -617,8 +618,10 @@ static irqreturn_t ifi_canfd_isr(int irq, void *dev_id) /* TX IRQ */ if (isr & IFI_CANFD_INTERRUPT_TXFIFO_REMOVE) { - stats->tx_bytes += can_get_echo_skb(ndev, 0); + can_free_echo_skb(ndev, 0); + stats->tx_bytes += priv->tx_len; stats->tx_packets++; + priv->tx_len = 0; can_led_event(ndev, CAN_LED_EVENT_TX); } @@ -889,6 +892,7 @@ static netdev_tx_t ifi_canfd_start_xmit(struct sk_buff *skb, } txdlc = can_len2dlc(cf->len); + priv->tx_len = txdlc; if ((priv->can.ctrlmode & CAN_CTRLMODE_FD) && can_is_canfd_skb(skb)) { txdlc |= IFI_CANFD_TXFIFO_DLC_EDL; if (cf->flags & CANFD_BRS) @@ -898,6 +902,9 @@ static netdev_tx_t ifi_canfd_start_xmit(struct sk_buff *skb, if (cf->can_id & CAN_RTR_FLAG) txdlc |= IFI_CANFD_TXFIFO_DLC_RTR; + /* set FNR to 1, so we get our Tx Message looped back into RxFIFO */ + txdlc += (1 << IFI_CANFD_TXFIFO_DLC_FNR_OFFSET); + /* message ram configuration */ writel(txid, priv->base + IFI_CANFD_TXFIFO_ID); writel(txdlc, priv->base + IFI_CANFD_TXFIFO_DLC); -- 2.14.3
[PATCH net-queue 3/3] e1000e: Avoid missed interrupts following ICR read.
The 82574 specification update errata 12 states that interrupts may be missed if ICR is read while INT_ASSERTED is not set. Avoid that problem by setting all bits related to events that can trigger the Other interrupt in IMS. The Other interrupt is raised for such events regardless of whether or not they are set in IMS. However, only when they are set is the INT_ASSERTED bit also set in ICR. By doing this, we ensure that INT_ASSERTED is always set when we read ICR in e1000_msix_other() and steer clear of the errata. This also ensures that ICR will automatically be cleared on read, therefore we no longer need to clear bits explicitly. Signed-off-by: Benjamin Poirier --- drivers/net/ethernet/intel/e1000e/defines.h | 21 - drivers/net/ethernet/intel/e1000e/netdev.c | 11 --- 2 files changed, 24 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/intel/e1000e/defines.h b/drivers/net/ethernet/intel/e1000e/defines.h index afb7ebe20b24..824fd44e25f0 100644 --- a/drivers/net/ethernet/intel/e1000e/defines.h +++ b/drivers/net/ethernet/intel/e1000e/defines.h @@ -400,6 +400,10 @@ #define E1000_ICR_RXDMT00x0010 /* Rx desc min. threshold (0) */ #define E1000_ICR_RXO 0x0040 /* Receiver Overrun */ #define E1000_ICR_RXT0 0x0080 /* Rx timer intr (ring 0) */ +#define E1000_ICR_MDAC 0x0200 /* MDIO Access Complete */ +#define E1000_ICR_SRPD 0x0001 /* Small Receive Packet Detected */ +#define E1000_ICR_ACK 0x0002 /* Receive ACK Frame Detected */ +#define E1000_ICR_MNG 0x0004 /* Manageability Event Detected */ #define E1000_ICR_ECCER 0x0040 /* Uncorrectable ECC Error */ /* If this bit asserted, the driver should claim the interrupt */ #define E1000_ICR_INT_ASSERTED 0x8000 @@ -407,7 +411,7 @@ #define E1000_ICR_RXQ1 0x0020 /* Rx Queue 1 Interrupt */ #define E1000_ICR_TXQ0 0x0040 /* Tx Queue 0 Interrupt */ #define E1000_ICR_TXQ1 0x0080 /* Tx Queue 1 Interrupt */ -#define E1000_ICR_OTHER 0x0100 /* Other Interrupts */ +#define E1000_ICR_OTHER 0x0100 /* Other Interrupt */ /* PBA ECC Register */ #define E1000_PBA_ECC_COUNTER_MASK 0xFFF0 /* ECC counter mask */ @@ -431,12 +435,27 @@ E1000_IMS_RXSEQ |\ E1000_IMS_LSC) +/* These are all of the events related to the OTHER interrupt. + */ +#define IMS_OTHER_MASK ( \ + E1000_IMS_LSC | \ + E1000_IMS_RXO | \ + E1000_IMS_MDAC | \ + E1000_IMS_SRPD | \ + E1000_IMS_ACK | \ + E1000_IMS_MNG) + /* Interrupt Mask Set */ #define E1000_IMS_TXDW E1000_ICR_TXDW /* Transmit desc written back */ #define E1000_IMS_LSC E1000_ICR_LSC /* Link Status Change */ #define E1000_IMS_RXSEQ E1000_ICR_RXSEQ /* Rx sequence error */ #define E1000_IMS_RXDMT0E1000_ICR_RXDMT0/* Rx desc min. threshold */ +#define E1000_IMS_RXO E1000_ICR_RXO /* Receiver Overrun */ #define E1000_IMS_RXT0 E1000_ICR_RXT0 /* Rx timer intr */ +#define E1000_IMS_MDAC E1000_ICR_MDAC /* MDIO Access Complete */ +#define E1000_IMS_SRPD E1000_ICR_SRPD /* Small Receive Packet */ +#define E1000_IMS_ACK E1000_ICR_ACK /* Receive ACK Frame Detected */ +#define E1000_IMS_MNG E1000_ICR_MNG /* Manageability Event */ #define E1000_IMS_ECCER E1000_ICR_ECCER /* Uncorrectable ECC Error */ #define E1000_IMS_RXQ0 E1000_ICR_RXQ0 /* Rx Queue 0 Interrupt */ #define E1000_IMS_RXQ1 E1000_ICR_RXQ1 /* Rx Queue 1 Interrupt */ diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index 2c9609bee2ae..9fd4050a91ca 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -1914,16 +1914,12 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data) struct net_device *netdev = data; struct e1000_adapter *adapter = netdev_priv(netdev); struct e1000_hw *hw = &adapter->hw; - u32 icr; - - icr = er32(ICR); - ew32(ICR, E1000_ICR_OTHER); + u32 icr = er32(ICR); if (icr & adapter->eiac_mask) ew32(ICS, (icr & adapter->eiac_mask)); if (icr & E1000_ICR_LSC) { - ew32(ICR, E1000_ICR_LSC); hw->mac.get_link_status = true; /* guard against interrupt when we're going down */ if (!test_bit(__E1000_DOWN, &adapter->state)) @@ -1931,7 +1927,7 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data) } if (!test_bit(__E1000_DOWN, &adapter->state)) - ew32(IMS, E1000_IMS_OTHER); + ew32(IMS, E1000_IMS_OTHER | IMS_OTHER_MASK); return IRQ_HANDLED; } @@ -2258,7 +2254,8 @@ static void e1000_irq_enable(struct e1000_adapter *adapter) if (
[PATCH] mpls, nospec: Sanitize array index in mpls_label_ok()
mpls_label_ok() validates that the 'platform_label' array index from a userspace netlink message payload is valid. Under speculation the mpls_label_ok() result may not resolve in the CPU pipeline until after the index is used to access an array element. Sanitize the index to zero to prevent userspace-controlled arbitrary out-of-bounds speculation, a precursor for a speculative execution side channel vulnerability. Cc: Cc: "David S. Miller" Cc: Eric W. Biederman Signed-off-by: Dan Williams --- net/mpls/af_mpls.c | 24 ++-- 1 file changed, 14 insertions(+), 10 deletions(-) diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 8ca9915befc8..aae3565c3a92 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include #include @@ -935,24 +936,27 @@ static int mpls_nh_build_multi(struct mpls_route_config *cfg, return err; } -static bool mpls_label_ok(struct net *net, unsigned int index, +static bool mpls_label_ok(struct net *net, unsigned int *index, struct netlink_ext_ack *extack) { + bool is_ok = true; + /* Reserved labels may not be set */ - if (index < MPLS_LABEL_FIRST_UNRESERVED) { + if (*index < MPLS_LABEL_FIRST_UNRESERVED) { NL_SET_ERR_MSG(extack, "Invalid label - must be MPLS_LABEL_FIRST_UNRESERVED or higher"); - return false; + is_ok = false; } /* The full 20 bit range may not be supported. */ - if (index >= net->mpls.platform_labels) { + if (is_ok && *index >= net->mpls.platform_labels) { NL_SET_ERR_MSG(extack, "Label >= configured maximum in platform_labels"); - return false; + is_ok = false; } - return true; + *index = array_index_nospec(*index, net->mpls.platform_labels); + return is_ok; } static int mpls_route_add(struct mpls_route_config *cfg, @@ -975,7 +979,7 @@ static int mpls_route_add(struct mpls_route_config *cfg, index = find_free_label(net); } - if (!mpls_label_ok(net, index, extack)) + if (!mpls_label_ok(net, &index, extack)) goto errout; /* Append makes no sense with mpls */ @@ -1052,7 +1056,7 @@ static int mpls_route_del(struct mpls_route_config *cfg, index = cfg->rc_label; - if (!mpls_label_ok(net, index, extack)) + if (!mpls_label_ok(net, &index, extack)) goto errout; mpls_route_update(net, index, NULL, &cfg->rc_nlinfo); @@ -1810,7 +1814,7 @@ static int rtm_to_route_config(struct sk_buff *skb, goto errout; if (!mpls_label_ok(cfg->rc_nlinfo.nl_net, - cfg->rc_label, extack)) + &cfg->rc_label, extack)) goto errout; break; } @@ -2137,7 +2141,7 @@ static int mpls_getroute(struct sk_buff *in_skb, struct nlmsghdr *in_nlh, goto errout; } - if (!mpls_label_ok(net, in_label, extack)) { + if (!mpls_label_ok(net, &in_label, extack)) { err = -EINVAL; goto errout; }
Re: [PATCH 3/3] Revert "e1000e: Do not read ICR in Other interrupt"
On 2018/01/29 09:22, Alexander Duyck wrote: [...] > > > Consequently, we must clear OTHER manually from ICR, otherwise the > > interrupt is immediately re-raised after exiting the handler. > > > > These observations are the same whether the interrupt is triggered via a > > write to ICS or in hardware. > > > > Furthermore, I tested that this behavior is the same for other Other > > events (MDAC, SRPD, ACK, MNG). Those were tested via a write to ICS > > only, not in hardware. > > > > This is a version of the test patch that I used to trigger lsc and rxo in > > software and hardware. It applies over this patch series. > > I plan to look into this some more over the next few days. Ideally if > we could mask these "OTHER" interrupts besides the LSC we could comply > with all the needed bits for MSI-X. My concern is that we are still > stuck reading the ICR at this point because of this and it is going to > make dealing with MSI-X challenging on 82574 since it seems like the > intention was that you weren't supposed to be reading the ICR when > MSI-X is enabled based on the list of current issues and HW errata. I totally agree with you that it looks like the msi-x interface was designed so you don't need to read icr. That's also why I was happy to go that direction with the (now infamous) commit 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt", v4.5-rc1). However, we looked at it before and there seems to be no way to mask individual Other interrupt causes (masking rxo but getting lsc). Because of that, I think we have to keep reading icr in the Other interrupt handler. > > At this point it seems like the interrupts is firing and the > INT_ASSERTED is all we really need to be checking for if I understand > this all correctly. Basically if LSC is set it will trigger OTHER and > INT_ASSERTED, if any of the other causes are set they are only setting > OTHER. I think that's right and it's related to the fact that currently LSC is set in IMS but not the other causes. Since we have to read icr (as I wrote above) but we want to avoid reading it without INT_ASSERTED set (as per errata 12) the solution will be to set all of the causes related to Other in IMS. Patches incoming...
Re: [PATCH net v2] net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT
Am 08.02.2018 um 00:00 schrieb Florian Fainelli: > On 02/07/2018 11:44 AM, Heiner Kallweit wrote: >> This condition wasn't adjusted when PHY_IGNORE_INTERRUPT (-2) was added >> long ago. In case of PHY_IGNORE_INTERRUPT the MAC interrupt indicates >> also PHY state changes and we should do what the symbol says. >> >> Signed-off-by: Heiner Kallweit >> --- >> v2: >> - use phy_interrupt_is_valid() instead of checking for irq > 0 > > Thanks, could you identify which Fixes: tag we should be using for that? > It would be great to see this backported to -stable > Before submitting the change I had a look at the phylib history for when the wrong behavior was introduced. However it has been there since introduction of phylib in 2005. Different patches fixed this in few, but not all places. Latest one was 84a527a41f38 "net: phylib: fix interrupts re-enablement in phy_start" which just partially fixed the issue. So we could declare the change now to fix this fix. >> --- >> drivers/net/phy/phy.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c >> index f3313a129..50ed35a45 100644 >> --- a/drivers/net/phy/phy.c >> +++ b/drivers/net/phy/phy.c >> @@ -822,7 +822,7 @@ void phy_start(struct phy_device *phydev) >> phy_resume(phydev); >> >> /* make sure interrupts are re-enabled for the PHY */ >> -if (phydev->irq != PHY_POLL) { >> +if (phy_interrupt_is_valid(phydev)) { >> err = phy_enable_interrupts(phydev); >> if (err < 0) >> break; >> > >
[Patch net] ipt_CLUSTERIP: fix a refcount bug in clusterip_config_find_get()
In clusterip_config_find_get() we hold RCU read lock so it could run concurrently with clusterip_config_entry_put(), as a result, the refcnt could go back to 1 from 0, which leads to a double list_del()... Just replace refcount_inc() with refcount_inc_not_zero(), as for c->refcount. Fixes: d73f33b16883 ("netfilter: CLUSTERIP: RCU conversion") Cc: Eric Dumazet Cc: Pablo Neira Ayuso Signed-off-by: Cong Wang --- net/ipv4/netfilter/ipt_CLUSTERIP.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c b/net/ipv4/netfilter/ipt_CLUSTERIP.c index 1ff72b87a066..4537b1686c7c 100644 --- a/net/ipv4/netfilter/ipt_CLUSTERIP.c +++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c @@ -154,8 +154,10 @@ clusterip_config_find_get(struct net *net, __be32 clusterip, int entry) #endif if (unlikely(!refcount_inc_not_zero(&c->refcount))) c = NULL; - else if (entry) - refcount_inc(&c->entries); + else if (entry) { + if (unlikely(!refcount_inc_not_zero(&c->entries))) + c = NULL; + } } rcu_read_unlock_bh(); -- 2.13.0
[Patch net] ipt_CLUSTERIP: fix a race condition of proc file creation
There is a race condition between clusterip_config_entry_put() and clusterip_config_init(), after we release the spinlock in clusterip_config_entry_put(), a new proc file with a same IP could be created immediately since it is already removed from the configs list, therefore it triggers this warning: [ cut here ] proc_dir_entry 'ipt_CLUSTERIP/172.20.0.170' already registered WARNING: CPU: 1 PID: 4152 at fs/proc/generic.c:330 proc_register+0x2a4/0x370 fs/proc/generic.c:329 Kernel panic - not syncing: panic_on_warn set ... As a quick fix, just move the proc_remove() inside the spinlock. Reported-by: Fixes: 6c5d5cfbe3c5 ("netfilter: ipt_CLUSTERIP: check duplicate config when initializing") Tested-by: Paolo Abeni Cc: Xin Long Cc: Pablo Neira Ayuso Signed-off-by: Cong Wang --- net/ipv4/netfilter/ipt_CLUSTERIP.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c b/net/ipv4/netfilter/ipt_CLUSTERIP.c index 3a84a60f6b39..1ff72b87a066 100644 --- a/net/ipv4/netfilter/ipt_CLUSTERIP.c +++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c @@ -107,12 +107,6 @@ clusterip_config_entry_put(struct net *net, struct clusterip_config *c) local_bh_disable(); if (refcount_dec_and_lock(&c->entries, &cn->lock)) { - list_del_rcu(&c->list); - spin_unlock(&cn->lock); - local_bh_enable(); - - unregister_netdevice_notifier(&c->notifier); - /* In case anyone still accesses the file, the open/close * functions are also incrementing the refcount on their own, * so it's safe to remove the entry even if it's in use. */ @@ -120,6 +114,12 @@ clusterip_config_entry_put(struct net *net, struct clusterip_config *c) if (cn->procdir) proc_remove(c->pde); #endif + list_del_rcu(&c->list); + spin_unlock(&cn->lock); + local_bh_enable(); + + unregister_netdevice_notifier(&c->notifier); + return; } local_bh_enable(); -- 2.13.0
[PATCH] ath9k: turn on btcoex_enable as default
Without btcoex_enable, WiFi activies make both WiFi and Bluetooth unstable if there's a bluetooth connection. Enable this option when bt_ant_diversity is disabled. BugLink: https://bugs.launchpad.net/bugs/1746164 Signed-off-by: Kai-Heng Feng --- drivers/net/wireless/ath/ath9k/init.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/net/wireless/ath/ath9k/init.c b/drivers/net/wireless/ath/ath9k/init.c index e479fae5aab9..f8f6b091a077 100644 --- a/drivers/net/wireless/ath/ath9k/init.c +++ b/drivers/net/wireless/ath/ath9k/init.c @@ -56,7 +56,7 @@ static int ath9k_led_active_high = -1; module_param_named(led_active_high, ath9k_led_active_high, int, 0444); MODULE_PARM_DESC(led_active_high, "Invert LED polarity"); -static int ath9k_btcoex_enable; +static int ath9k_btcoex_enable = 1; module_param_named(btcoex_enable, ath9k_btcoex_enable, int, 0444); MODULE_PARM_DESC(btcoex_enable, "Enable wifi-BT coexistence"); @@ -693,7 +693,6 @@ static int ath9k_init_softc(u16 devid, struct ath_softc *sc, common->hw = sc->hw; common->priv = sc; common->debug_mask = ath9k_debug; - common->btcoex_enabled = ath9k_btcoex_enable == 1; common->disable_ani = false; /* @@ -715,14 +714,17 @@ static int ath9k_init_softc(u16 devid, struct ath_softc *sc, /* * Enable WLAN/BT RX Antenna diversity only when: * -* - BTCOEX is disabled. * - the user manually requests the feature. * - the HW cap is set using the platform data. */ - if (!common->btcoex_enabled && ath9k_bt_ant_diversity && + if (ath9k_bt_ant_diversity && (pCap->hw_caps & ATH9K_HW_CAP_BT_ANT_DIV)) common->bt_ant_diversity = 1; + /* Enable btcoex when ant_diversity is disabled */ + if (!common->bt_ant_diversity && ath9k_btcoex_enable) + common->btcoex_enabled = 1; + spin_lock_init(&common->cc_lock); spin_lock_init(&sc->intr_lock); spin_lock_init(&sc->sc_serial_rw); -- 2.15.1
[PATCH net 2/5] nfp: don't advertise hw-tc-offload on non-port netdevs
nfp_port is a structure which represents an ASIC port, both PCIe vNIC (on a PF or a VF) or the external MAC port. vNIC netdev (struct nfp_net) and pure representor netdev (struct nfp_repr) both have a pointer to this structure. nfp_reprs always have a port associated. nfp_nets, however, only represent a device port in legacy mode, where they are considered the MAC port. In switchdev mode they are just the CPU's side of the PCIe link. By definition TC offloads only apply to device ports. Don't set the flag on vNICs without a port (i.e. in switchdev mode). Signed-off-by: Jakub Kicinski Reviewed-by: Simon Horman Tested-by: Pieter Jansen van Vuuren --- drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index c0fd351c86b1..fe77ea8b656c 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -3734,7 +3734,7 @@ static void nfp_net_netdev_init(struct nfp_net *nn) netdev->features = netdev->hw_features; - if (nfp_app_has_tc(nn->app)) + if (nfp_app_has_tc(nn->app) && nn->port) netdev->hw_features |= NETIF_F_HW_TC; /* Advertise but disable TSO by default. */ -- 2.15.1
[PATCH net 5/5] nfp: populate MODULE_VERSION
DKMS and similar out-of-tree module replacement services use module version to make sure the out-of-tree software is not older than the module shipped with the kernel. We use the kernel version in ethtool -i output, put it into MODULE_VERSION as well. Reported-by: Jan Gutter Signed-off-by: Jakub Kicinski Reviewed-by: Dirk van der Merwe --- drivers/net/ethernet/netronome/nfp/nfp_main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c b/drivers/net/ethernet/netronome/nfp/nfp_main.c index cc570bb6563c..ab301d56430b 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_main.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c @@ -649,3 +649,4 @@ MODULE_FIRMWARE("netronome/nic_AMDA0099-0001_2x25.nffw"); MODULE_AUTHOR("Netronome Systems "); MODULE_LICENSE("GPL"); MODULE_DESCRIPTION("The Netronome Flow Processor (NFP) driver."); +MODULE_VERSION(UTS_RELEASE); -- 2.15.1
[PATCH net 1/5] nfp: bpf: require ETH table
Upcoming changes will require all netdevs supporting TC offloads to have a full struct nfp_port. Require those for BPF offload. The operation without management FW reporting information about Ethernet ports is something we only support for very old and very basic NIC firmwares anyway. Signed-off-by: Jakub Kicinski Reviewed-by: Simon Horman Tested-by: Pieter Jansen van Vuuren --- drivers/net/ethernet/netronome/nfp/bpf/main.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c index 322027792fe8..61898dda11cf 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/main.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c @@ -35,6 +35,7 @@ #include "../nfpcore/nfp_cpp.h" #include "../nfpcore/nfp_nffw.h" +#include "../nfpcore/nfp_nsp.h" #include "../nfp_app.h" #include "../nfp_main.h" #include "../nfp_net.h" @@ -87,9 +88,20 @@ static const char *nfp_bpf_extra_cap(struct nfp_app *app, struct nfp_net *nn) static int nfp_bpf_vnic_alloc(struct nfp_app *app, struct nfp_net *nn, unsigned int id) { + struct nfp_pf *pf = app->pf; struct nfp_bpf_vnic *bv; int err; + if (!pf->eth_tbl) { + nfp_err(pf->cpp, "No ETH table\n"); + return -EINVAL; + } + if (pf->max_data_vnics != pf->eth_tbl->count) { + nfp_err(pf->cpp, "ETH entries don't match vNICs (%d vs %d)\n", + pf->max_data_vnics, pf->eth_tbl->count); + return -EINVAL; + } + bv = kzalloc(sizeof(*bv), GFP_KERNEL); if (!bv) return -ENOMEM; -- 2.15.1
[PATCH net 3/5] nfp: forbid disabling hw-tc-offload on representors while offload active
All netdevs which can accept TC offloads must implement .ndo_set_features(). nfp_reprs currently do not do that, which means hw-tc-offload can be turned on and off even when offloads are active. Whether the offloads are active is really a question to nfp_ports, so remove the per-app tc_busy callback indirection thing, and simply count the number of offloaded items in nfp_port structure. Fixes: 8a2768732a4d ("nfp: provide infrastructure for offloading flower based TC filters") Signed-off-by: Jakub Kicinski Reviewed-by: Simon Horman Tested-by: Pieter Jansen van Vuuren --- drivers/net/ethernet/netronome/nfp/bpf/main.c | 9 + drivers/net/ethernet/netronome/nfp/flower/offload.c | 4 drivers/net/ethernet/netronome/nfp/nfp_app.h| 9 - drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 7 +++ drivers/net/ethernet/netronome/nfp/nfp_net_repr.c | 1 + drivers/net/ethernet/netronome/nfp/nfp_port.c | 18 ++ drivers/net/ethernet/netronome/nfp/nfp_port.h | 6 ++ 7 files changed, 33 insertions(+), 21 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c index 61898dda11cf..34e98aa6b956 100644 --- a/drivers/net/ethernet/netronome/nfp/bpf/main.c +++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c @@ -182,6 +182,7 @@ static int nfp_bpf_setup_tc_block_cb(enum tc_setup_type type, return err; bv->tc_prog = cls_bpf->prog; + nn->port->tc_offload_cnt = !!bv->tc_prog; return 0; } @@ -219,13 +220,6 @@ static int nfp_bpf_setup_tc(struct nfp_app *app, struct net_device *netdev, } } -static bool nfp_bpf_tc_busy(struct nfp_app *app, struct nfp_net *nn) -{ - struct nfp_bpf_vnic *bv = nn->app_priv; - - return !!bv->tc_prog; -} - static int nfp_bpf_change_mtu(struct nfp_app *app, struct net_device *netdev, int new_mtu) { @@ -429,7 +423,6 @@ const struct nfp_app_type app_bpf = { .ctrl_msg_rx= nfp_bpf_ctrl_msg_rx, .setup_tc = nfp_bpf_setup_tc, - .tc_busy= nfp_bpf_tc_busy, .bpf= nfp_ndo_bpf, .xdp_offload= nfp_bpf_xdp_offload, }; diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c index 08c4c6dc5f7f..eb5c13dea8f5 100644 --- a/drivers/net/ethernet/netronome/nfp/flower/offload.c +++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c @@ -349,6 +349,7 @@ nfp_flower_add_offload(struct nfp_app *app, struct net_device *netdev, struct tc_cls_flower_offload *flow, bool egress) { enum nfp_flower_tun_type tun_type = NFP_FL_TUNNEL_NONE; + struct nfp_port *port = nfp_port_from_netdev(netdev); struct nfp_flower_priv *priv = app->priv; struct nfp_fl_payload *flow_pay; struct nfp_fl_key_ls *key_layer; @@ -390,6 +391,7 @@ nfp_flower_add_offload(struct nfp_app *app, struct net_device *netdev, INIT_HLIST_NODE(&flow_pay->link); flow_pay->tc_flower_cookie = flow->cookie; hash_add_rcu(priv->flow_table, &flow_pay->link, flow->cookie); + port->tc_offload_cnt++; /* Deallocate flow payload when flower rule has been destroyed. */ kfree(key_layer); @@ -421,6 +423,7 @@ static int nfp_flower_del_offload(struct nfp_app *app, struct net_device *netdev, struct tc_cls_flower_offload *flow) { + struct nfp_port *port = nfp_port_from_netdev(netdev); struct nfp_fl_payload *nfp_flow; int err; @@ -442,6 +445,7 @@ nfp_flower_del_offload(struct nfp_app *app, struct net_device *netdev, err_free_flow: hash_del_rcu(&nfp_flow->link); + port->tc_offload_cnt--; kfree(nfp_flow->action_data); kfree(nfp_flow->mask_data); kfree(nfp_flow->unmasked_data); diff --git a/drivers/net/ethernet/netronome/nfp/nfp_app.h b/drivers/net/ethernet/netronome/nfp/nfp_app.h index 437964afa8ee..20546ae67909 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_app.h +++ b/drivers/net/ethernet/netronome/nfp/nfp_app.h @@ -92,7 +92,6 @@ extern const struct nfp_app_type app_flower; * @stop: stop application logic * @ctrl_msg_rx:control message handler * @setup_tc: setup TC ndo - * @tc_busy: TC HW offload busy (rules loaded) * @bpf: BPF ndo offload-related calls * @xdp_offload:offload an XDP program * @eswitch_mode_get:get SR-IOV eswitch mode @@ -135,7 +134,6 @@ struct nfp_app_type { int (*setup_tc)(struct nfp_app *app, struct net_device *netdev, enum tc_setup_type type, void *type_data); - bool (*tc_busy)(struct nfp_app *app, struct nfp_net *nn); int (*bpf)(struct nfp_app *app, struct nfp_net *nn, struct netdev_bpf *xdp); int (*xdp_offload)(struct nfp_app *app, struct nfp_net *nn, @@ -301,13 +299,6 @@ static inline bool nfp_app_has_tc(struc
[PATCH net 4/5] nfp: limit the number of TSO segments
Most FWs limit the number of TSO segments a frame can produce to 64. This is for fairness and efficiency (of FW datapath) reasons. If a frame with larger number of segments is submitted the FW will drop it. Signed-off-by: Jakub Kicinski --- drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 2 ++ drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h | 5 - 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 19e989239af7..a05be0ab2713 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -3750,6 +3750,8 @@ static void nfp_net_netdev_init(struct nfp_net *nn) netdev->min_mtu = ETH_MIN_MTU; netdev->max_mtu = nn->max_mtu; + netdev->gso_max_segs = NFP_NET_LSO_MAX_SEGS; + netif_carrier_off(netdev); nfp_net_set_ethtool_ops(netdev); diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h index eeecef2caac6..4499a7333078 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h @@ -59,9 +59,12 @@ #define NFP_NET_RX_OFFSET 32 /** - * Maximum header size supported for LSO frames + * LSO parameters + * %NFP_NET_LSO_MAX_HDR_SZ:Maximum header size supported for LSO frames + * %NFP_NET_LSO_MAX_SEGS: Maximum number of segments LSO frame can produce */ #define NFP_NET_LSO_MAX_HDR_SZ 255 +#define NFP_NET_LSO_MAX_SEGS 64 /** * Prepend field types -- 2.15.1
[PATCH net 0/5] nfp: fix disabling TC offloads in flower, max TSO segs and module version
Hi! This set corrects the way nfp deals with the NETIF_F_HW_TC flag. It has slipped the review that flower offload does not currently refuse disabling this flag when filter offload is active. nfp's flower offload does not actually keep track of how many filters for each port are offloaded. The accounting of the number of filters is added to the nfp core structures, and BPF moved to use these structures as well. If users are allowed to disable TC offloads while filters are active, not only is it incorrect behaviour, but actually the NFP will never be told to remove the flows, leading to use-after-free when stats arrive. Fourth patch makes sure we declare the max number of TSO segments. FW should drop longer packets cleanly (otherwise this would be a security problem for untrusted VFs) but dropping longer TSO frames is not nice and driver should prevent them from being generated. Last small addition populates MODULE_VERSION with kernel version. Jakub Kicinski (5): nfp: bpf: require ETH table nfp: don't advertise hw-tc-offload on non-port netdevs nfp: forbid disabling hw-tc-offload on representors while offload active nfp: limit the number of TSO segments nfp: populate MODULE_VERSION drivers/net/ethernet/netronome/nfp/bpf/main.c | 21 + drivers/net/ethernet/netronome/nfp/flower/offload.c | 4 drivers/net/ethernet/netronome/nfp/nfp_app.h| 9 - drivers/net/ethernet/netronome/nfp/nfp_main.c | 1 + drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 11 ++- drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h | 5 - drivers/net/ethernet/netronome/nfp/nfp_net_repr.c | 1 + drivers/net/ethernet/netronome/nfp/nfp_port.c | 18 ++ drivers/net/ethernet/netronome/nfp/nfp_port.h | 6 ++ 9 files changed, 53 insertions(+), 23 deletions(-) -- 2.15.1
Re: [PATCH net V3 2/2] ptr_ring: fail on large queue size (>64K)
On Thu, Feb 08, 2018 at 11:59:25AM +0800, Jason Wang wrote: > We need limit the maximum size of queue, otherwise it may cause > several side effects e.g slab will warn when the size exceeds > KMALLOC_MAX_SIZE. Using KMALLOC_MAX_SIZE still looks too so this patch > tries to limit it to 64K. This value could be revisited if we found a > real case that needs more. > > Reported-by: syzbot+e4d4f9ddd42955397...@syzkaller.appspotmail.com > Fixes: 2e0ab8ca83c12 ("ptr_ring: array based FIFO for pointers") > Signed-off-by: Jason Wang > --- > include/linux/ptr_ring.h | 4 > 1 file changed, 4 insertions(+) > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h > index 2af71a7..5858d48 100644 > --- a/include/linux/ptr_ring.h > +++ b/include/linux/ptr_ring.h > @@ -44,6 +44,8 @@ struct ptr_ring { > void **queue; > }; > Seems like a weird location for a define. Either put defines on top of the file, or near where they are used. I prefer the second option. > +#define PTR_RING_MAX_ALLOC 65536 > + I guess it's an arbitrary number. Seems like a sufficiently large one, but pls add a comment so readers don't wonder. And please explain what it does: /* Callers can create ptr_ring structures with userspace-supplied * parameters. This sets a limit on the size to make that usecase * safe. If you ever change this, make sure to audit all callers. */ Also I think we should generally use either hex 0x1 or (1 << 16). > /* Note: callers invoking this in a loop must use a compiler barrier, > * for example cpu_relax(). > * > @@ -466,6 +468,8 @@ static inline int ptr_ring_consume_batched_bh(struct > ptr_ring *r, > > static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t > gfp) > { > + if (size > PTR_RING_MAX_ALLOC) > + return NULL; > return kvmalloc_array(size, sizeof(void *), gfp | __GFP_ZERO); > } > > -- > 2.7.4
Re: [PATCH net V3 1/2] ptr_ring: try vmalloc() when kmalloc() fails
On Thu, Feb 08, 2018 at 11:59:24AM +0800, Jason Wang wrote: > This patch switch to use kvmalloc_array() for using a vmalloc() > fallback to help in case kmalloc() fails. > > Reported-by: syzbot+e4d4f9ddd42955397...@syzkaller.appspotmail.com > Fixes: 2e0ab8ca83c12 ("ptr_ring: array based FIFO for pointers") I guess the actual patch is the one that switches tun to ptr_ring. In fact, I think the actual bugfix is patch 2/2. This specific one just makes kmalloc less likely to fail but that's not what syzbot reported. Then I would add this patch on top to make kmalloc less likely to fail. > Signed-off-by: Jason Wang > --- > include/linux/ptr_ring.h | 10 +- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h > index 1883d61..2af71a7 100644 > --- a/include/linux/ptr_ring.h > +++ b/include/linux/ptr_ring.h > @@ -466,7 +466,7 @@ static inline int ptr_ring_consume_batched_bh(struct > ptr_ring *r, > > static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t > gfp) > { > - return kcalloc(size, sizeof(void *), gfp); > + return kvmalloc_array(size, sizeof(void *), gfp | __GFP_ZERO); > } > > static inline void __ptr_ring_set_size(struct ptr_ring *r, int size) This implies a bunch of limitations on the flags. From kvmalloc_node docs: * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported. * __GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is * preferable to the vmalloc fallback, due to visible performance drawbacks. Fine with all the current users, but if we go this way, please add documentation so future users don't misuse this API. Alternatively, test flags and call kvmalloc or kcalloc? > @@ -601,7 +601,7 @@ static inline int ptr_ring_resize(struct ptr_ring *r, int > size, gfp_t gfp, > spin_unlock(&(r)->producer_lock); > spin_unlock_irqrestore(&(r)->consumer_lock, flags); > > - kfree(old); > + kvfree(old); > > return 0; > } > @@ -641,7 +641,7 @@ static inline int ptr_ring_resize_multiple(struct > ptr_ring **rings, > } > > for (i = 0; i < nrings; ++i) > - kfree(queues[i]); > + kvfree(queues[i]); > > kfree(queues); > > @@ -649,7 +649,7 @@ static inline int ptr_ring_resize_multiple(struct > ptr_ring **rings, > > nomem: > while (--i >= 0) > - kfree(queues[i]); > + kvfree(queues[i]); > > kfree(queues); > > @@ -664,7 +664,7 @@ static inline void ptr_ring_cleanup(struct ptr_ring *r, > void (*destroy)(void *)) > if (destroy) > while ((ptr = ptr_ring_consume(r))) > destroy(ptr); > - kfree(r->queue); > + kvfree(r->queue); > } > > #endif /* _LINUX_PTR_RING_H */ > -- > 2.7.4
[PATCH bpf 1/6] nfp: bpf: fix immed relocation for larger offsets
Immed relocation is missing a shift which means for larger offsets the lower and higher part of the address would be ORed together. Fixes: ce4ebfd859c3 ("nfp: bpf: add helpers for updating immediate instructions") Signed-off-by: Jakub Kicinski Reviewed-by: Jiong Wang --- drivers/net/ethernet/netronome/nfp/nfp_asm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_asm.c b/drivers/net/ethernet/netronome/nfp/nfp_asm.c index 3f6952b66a49..1e597600c693 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_asm.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_asm.c @@ -107,7 +107,7 @@ u16 immed_get_value(u64 instr) if (!unreg_is_imm(reg)) reg = FIELD_GET(OP_IMMED_B_SRC, instr); - return (reg & 0xff) | FIELD_GET(OP_IMMED_IMM, instr); + return (reg & 0xff) | FIELD_GET(OP_IMMED_IMM, instr) << 8; } void immed_set_value(u64 *instr, u16 immed) -- 2.15.1
[PATCH bpf 2/6] libbpf: complete list of strings for guessing program type
From: Quentin Monnet It seems that the type guessing feature for libbpf, based on the name of the ELF section the program is located in, was inspired from samples/bpf/prog_load.c, which was not used by any sample for loading programs of certain types such as TC actions and classifiers, or LWT-related types. As a consequence, libbpf is not able to guess the type of such programs and to load them automatically if type is not provided to the `bpf_load_prog()` function. Add ELF section names associated to those eBPF program types so that they can be loaded with e.g. bpftool as well. Signed-off-by: Quentin Monnet Reviewed-by: Jakub Kicinski --- tools/lib/bpf/libbpf.c | 5 + 1 file changed, 5 insertions(+) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 71ddc481f349..c64840365433 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -1816,12 +1816,17 @@ static const struct { BPF_PROG_SEC("socket", BPF_PROG_TYPE_SOCKET_FILTER), BPF_PROG_SEC("kprobe/", BPF_PROG_TYPE_KPROBE), BPF_PROG_SEC("kretprobe/", BPF_PROG_TYPE_KPROBE), + BPF_PROG_SEC("classifier", BPF_PROG_TYPE_SCHED_CLS), + BPF_PROG_SEC("action", BPF_PROG_TYPE_SCHED_ACT), BPF_PROG_SEC("tracepoint/", BPF_PROG_TYPE_TRACEPOINT), BPF_PROG_SEC("xdp", BPF_PROG_TYPE_XDP), BPF_PROG_SEC("perf_event", BPF_PROG_TYPE_PERF_EVENT), BPF_PROG_SEC("cgroup/skb", BPF_PROG_TYPE_CGROUP_SKB), BPF_PROG_SEC("cgroup/sock", BPF_PROG_TYPE_CGROUP_SOCK), BPF_PROG_SEC("cgroup/dev", BPF_PROG_TYPE_CGROUP_DEVICE), + BPF_PROG_SEC("lwt_in", BPF_PROG_TYPE_LWT_IN), + BPF_PROG_SEC("lwt_out", BPF_PROG_TYPE_LWT_OUT), + BPF_PROG_SEC("lwt_xmit",BPF_PROG_TYPE_LWT_XMIT), BPF_PROG_SEC("sockops", BPF_PROG_TYPE_SOCK_OPS), BPF_PROG_SEC("sk_skb", BPF_PROG_TYPE_SK_SKB), }; -- 2.15.1
[PATCH bpf 4/6] tools: bpftool: make syntax for program map update explicit in man page
From: Quentin Monnet Specify in the documentation that when using bpftool to update a map of type BPF_MAP_TYPE_PROG_ARRAY, the syntax for the program used as a value should use the "id|tag|pinned" keywords convention, as used with "bpftool prog" commands. Fixes: ff69c21a85a4 ("tools: bpftool: add documentation") Signed-off-by: Quentin Monnet Reviewed-by: Jakub Kicinski --- tools/bpf/bpftool/Documentation/bpftool-map.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst b/tools/bpf/bpftool/Documentation/bpftool-map.rst index 0ab32b312aec..457e868bd32f 100644 --- a/tools/bpf/bpftool/Documentation/bpftool-map.rst +++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst @@ -31,7 +31,8 @@ MAP COMMANDS | **bpftool** **map help** | | *MAP* := { **id** *MAP_ID* | **pinned** *FILE* } -| *VALUE* := { *BYTES* | *MAP* | *PROGRAM* } +| *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* } +| *VALUE* := { *BYTES* | *MAP* | *PROG* } | *UPDATE_FLAGS* := { **any** | **exist** | **noexist** } DESCRIPTION -- 2.15.1
[PATCH bpf 6/6] tools: bpftool: add bash completion for cgroup commands
From: Quentin Monnet Add bash completion for "bpftool cgroup" command family. While at it, also fix the formatting of some keywords in the man page for cgroups. Fixes: 5ccda64d38cc ("bpftool: implement cgroup bpf operations") Signed-off-by: Quentin Monnet Reviewed-by: Jakub Kicinski --- tools/bpf/bpftool/Documentation/bpftool-cgroup.rst | 4 +- tools/bpf/bpftool/bash-completion/bpftool | 64 -- 2 files changed, 62 insertions(+), 6 deletions(-) diff --git a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst index 2fe2a1bdbe3e..0e4e923235b6 100644 --- a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst +++ b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst @@ -26,8 +26,8 @@ MAP COMMANDS | **bpftool** **cgroup help** | | *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* } -| *ATTACH_TYPE* := { *ingress* | *egress* | *sock_create* | *sock_ops* | *device* } -| *ATTACH_FLAGS* := { *multi* | *override* } +| *ATTACH_TYPE* := { **ingress** | **egress** | **sock_create** | **sock_ops** | **device** } +| *ATTACH_FLAGS* := { **multi** | **override** } DESCRIPTION === diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool index 17d246418348..08719c54a614 100644 --- a/tools/bpf/bpftool/bash-completion/bpftool +++ b/tools/bpf/bpftool/bash-completion/bpftool @@ -52,16 +52,24 @@ _bpftool_once_attr() done } -# Takes a list of words in argument; adds them all to COMPREPLY if none of them -# is already present on the command line. Returns no value. -_bpftool_one_of_list() +# Takes a list of words as argument; if any of those words is present on the +# command line, return 0. Otherwise, return 1. +_bpftool_search_list() { local w idx for w in $*; do for (( idx=3; idx < ${#words[@]}-1; idx++ )); do -[[ $w == ${words[idx]} ]] && return 1 +[[ $w == ${words[idx]} ]] && return 0 done done +return 1 +} + +# Takes a list of words in argument; adds them all to COMPREPLY if none of them +# is already present on the command line. Returns no value. +_bpftool_one_of_list() +{ +_bpftool_search_list $* && return 1 COMPREPLY+=( $( compgen -W "$*" -- "$cur" ) ) } @@ -351,6 +359,54 @@ _bpftool() ;; esac ;; +cgroup) +case $command in +show|list) +_filedir +return 0 +;; +attach|detach) +local ATTACH_TYPES='ingress egress sock_create sock_ops \ +device' +local ATTACH_FLAGS='multi override' +local PROG_TYPE='id pinned tag' +case $prev in +$command) +_filedir +return 0 +;; +ingress|egress|sock_create|sock_ops|device) +COMPREPLY=( $( compgen -W "$PROG_TYPE" -- \ +"$cur" ) ) +return 0 +;; +id) +_bpftool_get_prog_ids +return 0 +;; +*) +if ! _bpftool_search_list "$ATTACH_TYPES"; then +COMPREPLY=( $( compgen -W "$ATTACH_TYPES" -- \ +"$cur" ) ) +elif [[ "$command" == "attach" ]]; then +# We have an attach type on the command line, +# but it is not the previous word, or +# "id|pinned|tag" (we already checked for +# that). This should only leave the case when +# we need attach flags for "attach" commamnd. +_bpftool_one_of_list "$ATTACH_FLAGS" +fi +return 0 +;; +esac +;; +*) +[[ $prev == $object ]] && \ +COMPREPLY=( $( compgen -W 'help attach detach \ +show list' -- "$cur" ) ) +;; +esac +;; esac } && complete -F _bpftool bpftool -- 2.15.1
[PATCH bpf 3/6] tools: bpftool: exit doc Makefile early if rst2man is not available
From: Quentin Monnet If rst2man is not available on the system, running `make doc` from the bpftool directory fails with an error message. However, it creates empty manual pages (.8 files in this case). A subsequent call to `make doc-install` would then succeed and install those empty man pages on the system. To prevent this, raise a Makefile error and exit immediately if rst2man is not available before generating the pages from the rst documentation. Fixes: ff69c21a85a4 ("tools: bpftool: add documentation") Reported-by: Jason van Aaardt Signed-off-by: Quentin Monnet Reviewed-by: Jakub Kicinski --- tools/bpf/bpftool/Documentation/Makefile | 5 + 1 file changed, 5 insertions(+) diff --git a/tools/bpf/bpftool/Documentation/Makefile b/tools/bpf/bpftool/Documentation/Makefile index c462a928e03d..a9d47c1558bb 100644 --- a/tools/bpf/bpftool/Documentation/Makefile +++ b/tools/bpf/bpftool/Documentation/Makefile @@ -23,7 +23,12 @@ DOC_MAN8 = $(addprefix $(OUTPUT),$(_DOC_MAN8)) man: man8 man8: $(DOC_MAN8) +RST2MAN_DEP := $(shell command -v rst2man 2>/dev/null) + $(OUTPUT)%.8: %.rst +ifndef RST2MAN_DEP + $(error "rst2man not found, but required to generate man pages") +endif $(QUIET_GEN)rst2man $< > $@ clean: -- 2.15.1
[PATCH bpf 5/6] tools: bpftool: add bash completion for `bpftool prog load`
From: Quentin Monnet Add bash completion for bpftool command `prog load`. Completion for this command is easy, as it only takes existing file paths as arguments. Fixes: 49a086c201a9 ("bpftool: implement prog load command") Signed-off-by: Quentin Monnet Reviewed-by: Jakub Kicinski --- tools/bpf/bpftool/bash-completion/bpftool | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/tools/bpf/bpftool/bash-completion/bpftool b/tools/bpf/bpftool/bash-completion/bpftool index 0137866bb8f6..17d246418348 100644 --- a/tools/bpf/bpftool/bash-completion/bpftool +++ b/tools/bpf/bpftool/bash-completion/bpftool @@ -230,10 +230,14 @@ _bpftool() fi return 0 ;; +load) +_filedir +return 0 +;; *) [[ $prev == $object ]] && \ -COMPREPLY=( $( compgen -W 'dump help pin show list' -- \ -"$cur" ) ) +COMPREPLY=( $( compgen -W 'dump help pin load \ +show list' -- "$cur" ) ) ;; esac ;; -- 2.15.1
[PATCH bpf 0/6] nfp fix for calls, bpftool completions and doc fixes
Hi! First patch in this series fixes applying the relocation to immediate load instructions in the NFP JIT. The remaining patches come from Quentin. Small addition to libbpf makes sure it recognizes all standard section names. Makefile in bpftool/Documentation is improved to explicitly check for rst2man being installed on the system, otherwise we risk installing empty files. Man page for bpftool-map is corrected to include program as a potential value for map of programs. Last two patches are slightly longer, those update bash completions to include this release cycle's additions from Roman. Maybe the use of Fixes tags is slightly frivolous there, but having bash completions which don't cover all commands and options could be disruptive to work flow for users. Jakub Kicinski (1): nfp: bpf: fix immed relocation for larger offsets Quentin Monnet (5): libbpf: complete list of strings for guessing program type tools: bpftool: exit doc Makefile early if rst2man is not available tools: bpftool: make syntax for program map update explicit in man page tools: bpftool: add bash completion for `bpftool prog load` tools: bpftool: add bash completion for cgroup commands drivers/net/ethernet/netronome/nfp/nfp_asm.c | 2 +- tools/bpf/bpftool/Documentation/Makefile | 5 ++ tools/bpf/bpftool/Documentation/bpftool-cgroup.rst | 4 +- tools/bpf/bpftool/Documentation/bpftool-map.rst| 3 +- tools/bpf/bpftool/bash-completion/bpftool | 72 -- tools/lib/bpf/libbpf.c | 5 ++ 6 files changed, 81 insertions(+), 10 deletions(-) -- 2.15.1
[PATCH net V3 2/2] ptr_ring: fail on large queue size (>64K)
We need limit the maximum size of queue, otherwise it may cause several side effects e.g slab will warn when the size exceeds KMALLOC_MAX_SIZE. Using KMALLOC_MAX_SIZE still looks too so this patch tries to limit it to 64K. This value could be revisited if we found a real case that needs more. Reported-by: syzbot+e4d4f9ddd42955397...@syzkaller.appspotmail.com Fixes: 2e0ab8ca83c12 ("ptr_ring: array based FIFO for pointers") Signed-off-by: Jason Wang --- include/linux/ptr_ring.h | 4 1 file changed, 4 insertions(+) diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 2af71a7..5858d48 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -44,6 +44,8 @@ struct ptr_ring { void **queue; }; +#define PTR_RING_MAX_ALLOC 65536 + /* Note: callers invoking this in a loop must use a compiler barrier, * for example cpu_relax(). * @@ -466,6 +468,8 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r, static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp) { + if (size > PTR_RING_MAX_ALLOC) + return NULL; return kvmalloc_array(size, sizeof(void *), gfp | __GFP_ZERO); } -- 2.7.4
[PATCH net V3 1/2] ptr_ring: try vmalloc() when kmalloc() fails
This patch switch to use kvmalloc_array() for using a vmalloc() fallback to help in case kmalloc() fails. Reported-by: syzbot+e4d4f9ddd42955397...@syzkaller.appspotmail.com Fixes: 2e0ab8ca83c12 ("ptr_ring: array based FIFO for pointers") Signed-off-by: Jason Wang --- include/linux/ptr_ring.h | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 1883d61..2af71a7 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -466,7 +466,7 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r, static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp) { - return kcalloc(size, sizeof(void *), gfp); + return kvmalloc_array(size, sizeof(void *), gfp | __GFP_ZERO); } static inline void __ptr_ring_set_size(struct ptr_ring *r, int size) @@ -601,7 +601,7 @@ static inline int ptr_ring_resize(struct ptr_ring *r, int size, gfp_t gfp, spin_unlock(&(r)->producer_lock); spin_unlock_irqrestore(&(r)->consumer_lock, flags); - kfree(old); + kvfree(old); return 0; } @@ -641,7 +641,7 @@ static inline int ptr_ring_resize_multiple(struct ptr_ring **rings, } for (i = 0; i < nrings; ++i) - kfree(queues[i]); + kvfree(queues[i]); kfree(queues); @@ -649,7 +649,7 @@ static inline int ptr_ring_resize_multiple(struct ptr_ring **rings, nomem: while (--i >= 0) - kfree(queues[i]); + kvfree(queues[i]); kfree(queues); @@ -664,7 +664,7 @@ static inline void ptr_ring_cleanup(struct ptr_ring *r, void (*destroy)(void *)) if (destroy) while ((ptr = ptr_ring_consume(r))) destroy(ptr); - kfree(r->queue); + kvfree(r->queue); } #endif /* _LINUX_PTR_RING_H */ -- 2.7.4
linux-next: Signed-off-by missing for commit in the net tree
Hi all, Commit 043e337f555e ("sch_netem: Bug fixing in calculating Netem interval") is missing a Signed-off-by from its author. -- Cheers, Stephen Rothwell
Re: [PATCH net V2 1/2] ptr_ring: try vmalloc() when kmalloc() fails
On 2018年02月08日 11:21, Jason Wang wrote: This patch switch to use kvmalloc_array() for using a vmalloc() fallback to help in case kmalloc() fails. Reported-by: syzbot+e4d4f9ddd42955397...@syzkaller.appspotmail.com Fixes: 2e0ab8ca83c12 ("ptr_ring: array based FIFO for pointers") Signed-off-by: Jason Wang --- include/linux/ptr_ring.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 1883d61..9c3b748 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -466,7 +466,7 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r, static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp) { - return kcalloc(size, sizeof(void *), gfp); + return kvmalloc_array(size, sizeof(void *), gfp | __GFP_ZERO); } static inline void __ptr_ring_set_size(struct ptr_ring *r, int size) Forget the kvfree() ... Will post V3. Sorry.
[PATCH net V2 1/2] ptr_ring: try vmalloc() when kmalloc() fails
This patch switch to use kvmalloc_array() for using a vmalloc() fallback to help in case kmalloc() fails. Reported-by: syzbot+e4d4f9ddd42955397...@syzkaller.appspotmail.com Fixes: 2e0ab8ca83c12 ("ptr_ring: array based FIFO for pointers") Signed-off-by: Jason Wang --- include/linux/ptr_ring.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 1883d61..9c3b748 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -466,7 +466,7 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r, static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp) { - return kcalloc(size, sizeof(void *), gfp); + return kvmalloc_array(size, sizeof(void *), gfp | __GFP_ZERO); } static inline void __ptr_ring_set_size(struct ptr_ring *r, int size) -- 2.7.4
[PATCH net V2 2/2] ptr_ring: fail on large queue size (>64K)
We need limit the maximum size of queue, otherwise it may cause several side effects e.g slab will warn when the size exceeds KMALLOC_MAX_SIZE. Using KMALLOC_MAX_SIZE still looks too so this patch tries to limit it to 64K. This value could be revisited if we found a real case that needs more. Reported-by: syzbot+e4d4f9ddd42955397...@syzkaller.appspotmail.com Fixes: 2e0ab8ca83c12 ("ptr_ring: array based FIFO for pointers") Signed-off-by: Jason Wang --- include/linux/ptr_ring.h | 4 1 file changed, 4 insertions(+) diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 9c3b748..2520daa 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -44,6 +44,8 @@ struct ptr_ring { void **queue; }; +#define PTR_RING_MAX_ALLOC 65536 + /* Note: callers invoking this in a loop must use a compiler barrier, * for example cpu_relax(). * @@ -466,6 +468,8 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r, static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp) { + if (size > PTR_RING_MAX_ALLOC) + return NULL; return kvmalloc_array(size, sizeof(void *), gfp | __GFP_ZERO); } -- 2.7.4
Re: [PATCH net] ptr_ring: fail early if queue occupies more than KMALLOC_MAX_SIZE
On 2018年02月07日 23:15, Michael S. Tsirkin wrote: On Wed, Feb 07, 2018 at 05:17:59PM +0800, Jason Wang wrote: On 2018年02月07日 16:08, Jason Wang wrote: To avoid slab to warn about exceeded size, fail early if queue occupies more than KMALLOC_MAX_SIZE. Reported-by: syzbot+e4d4f9ddd42955397...@syzkaller.appspotmail.com Signed-off-by: Jason Wang --- include/linux/ptr_ring.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 1883d61..4b862da 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -466,6 +466,8 @@ static inline int ptr_ring_consume_batched_bh(struct ptr_ring *r, static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp) { + if (size > KMALLOC_MAX_SIZE) + return NULL; return kcalloc(size, sizeof(void *), gfp); } Fixes: 2e0ab8ca83c12 ("ptr_ring: array based FIFO for pointers") That's probably not enough. How about switching to kvmalloc and limiting this drastically to e.g. 64K. I vaguely remember that some wise man said that should be enough for everybody :) Okay, let me post a V2 to see if there's any objection. Thanks
Re: [PATCH net] tcp: tracepoint: only call trace_tcp_send_reset with full socket
From: Song Liu Date: Tue, 6 Feb 2018 20:50:23 -0800 > tracepoint tcp_send_reset requires a full socket to work. However, it > may be called when in TCP_TIME_WAIT: > > case TCP_TW_RST: > tcp_v6_send_reset(sk, skb); > inet_twsk_deschedule_put(inet_twsk(sk)); > goto discard_it; > > To avoid this problem, this patch checks the socket with sk_fullsock() > before calling trace_tcp_send_reset(). > > Fixes: c24b14c46bb8 ("tcp: add tracepoint trace_tcp_send_reset") > Signed-off-by: Song Liu > Reviewed-by: Lawrence Brakmo Applied and queued up for -stable, thank you.
Re: [PATCH net-next] sch_netem: Bug fixing in calculating Netem interval
From: "Md. Islam" Date: Tue, 6 Feb 2018 23:14:18 -0500 > In Kernel 4.15.0+, Netem does not work properly. > > Netem setup: > > tc qdisc add dev h1-eth0 root handle 1: netem delay 10ms 2ms > > Result: ... > (rnd % (2 * sigma)) - sigma was overflowing s32. After applying the > patch, I found following output which is desirable. Applied.
Re: [PATCH] net: ethernet: ti: cpsw: fix net watchdog timeout
From: Grygorii Strashko Date: Tue, 6 Feb 2018 19:17:06 -0600 > It was discovered that simple program which indefinitely sends 200b UDP > packets and runs on TI AM574x SoC (SMP) under RT Kernel triggers network > watchdog timeout in TI CPSW driver (<6 hours run). The network watchdog > timeout is triggered due to race between cpsw_ndo_start_xmit() and > cpsw_tx_handler() [NAPI] > > cpsw_ndo_start_xmit() > if (unlikely(!cpdma_check_free_tx_desc(txch))) { > txq = netdev_get_tx_queue(ndev, q_idx); > netif_tx_stop_queue(txq); > > ^^ as per [1] barier has to be used after set_bit() otherwise new value > might not be visible to other cpus > } > > cpsw_tx_handler() > if (unlikely(netif_tx_queue_stopped(txq))) > netif_tx_wake_queue(txq); > > and when it happens ndev TX queue became disabled forever while driver's HW > TX queue is empty. > > Fix this, by adding smp_mb__after_atomic() after netif_tx_stop_queue() > calls and double check for free TX descriptors after stopping ndev TX queue > - if there are free TX descriptors wake up ndev TX queue. > > [1] https://www.kernel.org/doc/html/latest/core-api/atomic_ops.html > Signed-off-by: Grygorii Strashko Applied, thanks.
Re: [PATCH net] net/ipv6: Handle reject routes with onlink flag
From: David Ahern Date: Tue, 6 Feb 2018 12:14:12 -0800 > Verification of nexthops with onlink flag need to handle unreachable > routes. The lookup is only intended to validate the gateway address > is not a local address and if the gateway resolves the egress device > must match the given device. Hence, hitting any default reject route > is ok. > > Fixes: fc1e64e1092f ("net/ipv6: Add support for onlink flag") > Signed-off-by: David Ahern Applied.
Re: [PATCH net] net/ipv6: onlink nexthop checks should default to main table
From: David Ahern Date: Tue, 6 Feb 2018 13:17:06 -0800 > Because of differences in how ipv4 and ipv6 handle fib lookups, > verification of nexthops with onlink flag need to default to the main > table rather than the local table used by IPv4. As it stands an > address within a connected route on device 1 can be used with > onlink on device 2. Updating the table properly rejects the route > due to the egress device mismatch. > > Update the extack message as well to show it could be a device > mismatch for the nexthop spec. > > Fixes: fc1e64e1092f ("net/ipv6: Add support for onlink flag") > Signed-off-by: David Ahern Applied, thanks David.
Re: [PATCH net-next] sun: Add SPDX license tags to Sun network drivers
From: Shannon Nelson Date: Tue, 6 Feb 2018 11:34:23 -0800 > Add the appropriate SPDX license tags to the Sun network drivers > as outlined in Documentation/process/license-rules.rst. > > Signed-off-by: Shannon Nelson I've decided to apply this to the net tree, thank you.
Re: [PATCH v1 1/1] spi_ks8995: use regmap to access chip registers.
From: Sven Van Asbroeck Date: Tue, 6 Feb 2018 10:13:56 -0500 > The register map layouts used in this driver are well suited to > being accessed through a regmap. This makes the driver simpler > and shorter, by eliminating some spi boilerplate code. > > Testing: > - tested on a ksz8785. > - not tested on the other supported chips (ks8995, ksz8864) > because I don't have access to them. > However, I instrumented the spi layer to verify that the > correct spi transactions are generated to read the ID > registers on those chips. > > Signed-off-by: Sven Van Asbroeck Please resubmit this simplification when the net-next tree opens back up. Thank you.
Re: [PATCH net] rxrpc: Fix received abort handling
From: David Howells Date: Tue, 06 Feb 2018 16:44:12 + > AF_RXRPC is incorrectly sending back to the server any abort it receives > for a client connection. This is due to the final-ACK offload to the > connection event processor patch. The abort code is copied into the > last-call information on the connection channel and then the event > processor is set. > > Instead, the following should be done: ... > Fixes: 3136ef49a14c ("rxrpc: Delay terminal ACK transmission on a client > call") > Signed-off-by: David Howells Applied, thanks David.
Re: [PATCH] cxgb4: Fix error handling path in 'init_one()'
From: Christophe JAILLET Date: Tue, 6 Feb 2018 21:17:17 +0100 > Commit baf5086840ab1 ("cxgb4: restructure VF mgmt code") has reordered > some code but an error handling label has not been updated accordingly. > So fix it and free 'adapter' if 't4_wait_dev_ready()' fails. > > Fixes: baf5086840ab1 ("cxgb4: restructure VF mgmt code") > Signed-off-by: Christophe JAILLET Applied, thank you.
[PATCH] net: Whitelist the skbuff_head_cache "cb" field
Most callers of put_cmsg() use a "sizeof(foo)" for the length argument. Within put_cmsg(), a copy_to_user() call is made with a dynamic size, as a result of the cmsg header calculations. This means that hardened usercopy will examine the copy, even though it was technically a fixed size and should be implicitly whitelisted. All the put_cmsg() calls being built from values in skbuff_head_cache are coming out of the protocol-defined "cb" field, so whitelist this field entirely instead of creating per-use bounce buffers, for which there are concerns about performance. Original report was: Bad or missing usercopy whitelist? Kernel memory exposure attempt detected from SLAB object 'skbuff_head_cache' (offset 64, size 16)! WARNING: CPU: 0 PID: 3663 at mm/usercopy.c:81 usercopy_warn+0xdb/0x100 mm/usercopy.c:76 ... __check_heap_object+0x89/0xc0 mm/slab.c:4426 check_heap_object mm/usercopy.c:236 [inline] __check_object_size+0x272/0x530 mm/usercopy.c:259 check_object_size include/linux/thread_info.h:112 [inline] check_copy_size include/linux/thread_info.h:143 [inline] copy_to_user include/linux/uaccess.h:154 [inline] put_cmsg+0x233/0x3f0 net/core/scm.c:242 sock_recv_errqueue+0x200/0x3e0 net/core/sock.c:2913 packet_recvmsg+0xb2e/0x17a0 net/packet/af_packet.c:3296 sock_recvmsg_nosec net/socket.c:803 [inline] sock_recvmsg+0xc9/0x110 net/socket.c:810 ___sys_recvmsg+0x2a4/0x640 net/socket.c:2179 __sys_recvmmsg+0x2a9/0xaf0 net/socket.c:2287 SYSC_recvmmsg net/socket.c:2368 [inline] SyS_recvmmsg+0xc4/0x160 net/socket.c:2352 entry_SYSCALL_64_fastpath+0x29/0xa0 Reported-by: syzbot+e2d6cfb305e9f3911...@syzkaller.appspotmail.com Fixes: 6d07d1cd300f ("usercopy: Restrict non-usercopy caches to size 0") Signed-off-by: Kees Cook --- I tried the inlining, it was awful. Splitting put_cmsg() was awful. So, instead, whitelist the "cb" field as the least bad option if bounce buffers are unacceptable. Dave, do you want to take this through net, or should I take it through the usercopy tree? --- net/core/skbuff.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 6b0ff396fa9d..201b96c8f414 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -3889,10 +3889,12 @@ EXPORT_SYMBOL_GPL(skb_gro_receive); void __init skb_init(void) { - skbuff_head_cache = kmem_cache_create("skbuff_head_cache", + skbuff_head_cache = kmem_cache_create_usercopy("skbuff_head_cache", sizeof(struct sk_buff), 0, SLAB_HWCACHE_ALIGN|SLAB_PANIC, + offsetof(struct sk_buff, cb), + sizeof_field(struct sk_buff, cb), NULL); skbuff_fclone_cache = kmem_cache_create("skbuff_fclone_cache", sizeof(struct sk_buff_fclones), -- 2.7.4 -- Kees Cook Pixel Security
Re: [PATCH RFC 2/4] netlink: add generic object description infrastructure
On 02/06/2018 05:37 PM, Pablo Neira Ayuso wrote: > This patch allows netlink busses to provide object descriptions to > userspace, in terms of supported attributes and its corresponding > datatypes. > > Userspace sends a requests that looks like: > > netlink header > NLA_DESC_REQ_BUS > NLA_DESC_REQ_DATA > > Where NLA_DESC_REQ_BUS is the netlink bus/protocol number, eg. > NETLINK_NETFILTER, and NLA_DESC_REQ_DATA is an attribute layout is > specific to the bus that you are inspecting, this is useful for both > nfnetlink and genetlink since they need to what subsystem in the bus > specifically you're targeting to. > > Then, the netlink description subsystem response via netlink dump looks > like this: > > netlink header > NLA_DESC_NUM_OBJS > NLA_DESC_OBJS (nest) > NLA_DESC_LIST_ITEM (nest) > NLA_DESC_OBJ_ID > NLA_DESC_OBJ_ATTRS_MAX > NLA_DESC_OBJ_ATTRS (nest) > NLA_DESC_LIST_ITEM (nest) > NLA_DESC_ATTR_NUM > NLA_DESC_ATTR_TYPE > NLA_DESC_ATTR_LEN > NLA_DESC_ATTR_MAXVAL > NLA_DESC_ATTR_NEST_ID > NLA_DESC_LIST_ITEM (nest) > ... > > Each object definition is composed of an unique ID, the number of > attributes and the list of attribute definitions. > > The NETLINK_DESC bus provides a generic interface to retrieve the list > of existing objects and its attributes via netlink dump. This new > description family autoloads module dependencies based on what userspace > requests. > > Each bus needs to register a struct nl_desc_subsys definition, that > provides the lookup and parse callbacks. These route the description > requests to the corresponding backend subsystem for genetlink and > nfnetlink. The lookup callback returns struct nl_desc_objs that provides > the array of object descriptions. > > Signed-off-by: Pablo Neira Ayuso > --- > include/net/net_namespace.h | 1 + > include/net/nldesc.h | 160 ++ > include/uapi/linux/netlink.h | 67 ++ > net/netlink/Makefile | 2 +- > net/netlink/desc.c | 499 > +++ > 5 files changed, 728 insertions(+), 1 deletion(-) > create mode 100644 include/net/nldesc.h > create mode 100644 net/netlink/desc.c > > diff --git a/include/net/nldesc.h b/include/net/nldesc.h > new file mode 100644 > index ..19306a648f10 > --- /dev/null > +++ b/include/net/nldesc.h > @@ -0,0 +1,160 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef __NET_NLDESC_H > +#define __NET_NLDESC_H > + > +#include > + > +struct nl_desc_cmd; > +struct nl_desc_obj; > + > +struct nl_desc_cmds { > + int max; > + const struct nl_desc_cmd*table; > +}; > + > +struct nl_desc_objs { > + int max; > + const struct nl_desc_obj**table; > +}; > + > +struct nl_desc_req { > + u32 bus; > +}; > + > +struct net; > +struct sk_buff; > +struct nlmsghdr; > +struct nlattr; > + > + > +/** > + * struct nl_desc_obj - netlink object description > + * @id: unique ID to identify this netlink object > + * @max: number of attributes to describe this object @attr_max: > + * @attrs: array of attribute descriptions > + */ > +struct nl_desc_obj { > + u16 id; > + u16 attr_max; > + const struct nl_desc_attr *attrs; > +}; Is there a test program for this? Maybe add it to tools/testing/ ? thanks, -- ~Randy
Re: [PATCH] rtlwifi: rtl8192cu: Remove variable self-assignment in rf.c
On Wed, 2018-02-07 at 12:51 -0800, Matthias Kaehlcke wrote: > El Wed, Feb 07, 2018 at 02:35:59PM -0600 Larry Finger ha dit: > > > On 02/07/2018 02:26 PM, Matthias Kaehlcke wrote: > > > In _rtl92c_get_txpower_writeval_by_regulatory() the variable writeVal > > > is assigned to itself in an if ... else statement, apparently only to > > > document that the branch condition is handled and that a previously read > > > value should be returned unmodified. The self-assignment causes clang to > > > raise the following warning: > > > > > > drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c:304:13: > > >error: explicitly assigning value of variable of type 'u32' > > > (aka 'unsigned int') to itself [-Werror,-Wself-assign] > > >writeVal = writeVal; > > > > > > Replace the self-assignment with a semicolon, which still serves to > > > document the 'handling' of the branch condition. > > > > > > Signed-off-by: Matthias Kaehlcke > > > --- > > > drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c > b/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c > > > index 9cff6bc4049c..4db92496c122 100644 > > > --- a/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c > > > +++ b/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c > > > @@ -301,7 +301,7 @@ static void > > > _rtl92c_get_txpower_writeval_by_regulatory(struct ieee80211_hw > *hw, > > > writeVal = writeVal - 0x06060606; > > > else if (rtlpriv->dm.dynamic_txhighpower_lvl == > > > TXHIGHPWRLEVEL_BT2) > > > - writeVal = writeVal; > > > + ; > > > *(p_outwriteval + rf) = writeVal; > > > } > > > } > > > > > > > As the branch condition does nothing, why not remove it and save the > > compiler's optimizer a bit of work? The code looks strange, but it matches > > the rest of Realtek's USB drivers. Agree Larry's comment. > > Sure, I am happy to change it to whatever the authors/maintainers prefer. > > I'll wait a bit before respinning for if others feel strongly about > keeping the branch. > > --Please consider the environment before printing this e-mail.
Re: [kbuild-all] [vhost:vhost 19/20] Warning: arch/x86/tools/test_get_len found difference at :ffffffff811aa5f0
On 02/08/2018 01:37 AM, Philip Li wrote: > On Thu, Feb 08, 2018 at 01:25:39AM +0100, Daniel Borkmann wrote: >> On 02/08/2018 01:20 AM, Philip Li wrote: >>> On Thu, Feb 08, 2018 at 02:09:41AM +0200, Michael S. Tsirkin wrote: On Wed, Feb 07, 2018 at 03:02:57PM +0800, kbuild test robot wrote: > tree: https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git > vhost > head: 96bcd04462b99e2c80e09f6537770a0ca6b288d0 > commit: cc1d1dc07885803981520a5303ef5b130f2ca2e8 [19/20] mm: support > reporting free page blocks > config: x86_64-rhel > compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 > reproduce: > git checkout cc1d1dc07885803981520a5303ef5b130f2ca2e8 > # save the attached .config to linux build tree > make ARCH=x86_64 > > All warnings (new ones prefixed by >>): I'm sorry, what exactly does this mean? >>> sorry that the script has issue to provide empty warnings here, which >>> was supposed to list warnings with >> added to new ones like below. >>> >>>In file included from kernel//bpf/core.c:29:0: > include/linux/bpf.h:72:8: error: duplicate member 'security' >>> void *security; >>>^ >>> >>> We will follow up to check what goes wrong. >> >> Certainly odd, even in that tree/commit/header [0] I cannot see a duplicate >> security member in struct bpf_map. > that is one example i try to show what looks like for the >> prefix, and i > copy > it from https://lists.01.org/pipermail/kbuild-all/2018-February/042706.html, > which > is not for vhost repo. Ok, indeed, some merge resolution issue in android stable kernel tree [1]. I didn't parse from your prior mail that you just meant this as a possible example. Anyway, thanks for clarifying. [1] https://android.googlesource.com/kernel/common/+/6bbd2da69c5c2b8ab57cc2171b9abe2c9b82f53e%5E%21/ > Here for this report, the issue is as title "Warning: > arch/x86/tools/test_get_len found difference at > :811aa5f0". > >> [0] >> https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/tree/include/linux/bpf.h?h=vhost&id=cc1d1dc07885803981520a5303ef5b130f2ca2e8#n45 Thanks, Daniel
Re: [kbuild-all] [vhost:vhost 19/20] Warning: arch/x86/tools/test_get_len found difference at :ffffffff811aa5f0
On Thu, Feb 08, 2018 at 01:25:39AM +0100, Daniel Borkmann wrote: > On 02/08/2018 01:20 AM, Philip Li wrote: > > On Thu, Feb 08, 2018 at 02:09:41AM +0200, Michael S. Tsirkin wrote: > >> On Wed, Feb 07, 2018 at 03:02:57PM +0800, kbuild test robot wrote: > >>> tree: https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git > >>> vhost > >>> head: 96bcd04462b99e2c80e09f6537770a0ca6b288d0 > >>> commit: cc1d1dc07885803981520a5303ef5b130f2ca2e8 [19/20] mm: support > >>> reporting free page blocks > >>> config: x86_64-rhel > >>> compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 > >>> reproduce: > >>> git checkout cc1d1dc07885803981520a5303ef5b130f2ca2e8 > >>> # save the attached .config to linux build tree > >>> make ARCH=x86_64 > >>> > >>> All warnings (new ones prefixed by >>): > >> > >> I'm sorry, what exactly does this mean? > > sorry that the script has issue to provide empty warnings here, which > > was supposed to list warnings with >> added to new ones like below. > > > >In file included from kernel//bpf/core.c:29:0: > >>> include/linux/bpf.h:72:8: error: duplicate member 'security' > > void *security; > >^ > > > > We will follow up to check what goes wrong. > > Certainly odd, even in that tree/commit/header [0] I cannot see a duplicate > security member in struct bpf_map. that is one example i try to show what looks like for the >> prefix, and i copy it from https://lists.01.org/pipermail/kbuild-all/2018-February/042706.html, which is not for vhost repo. Here for this report, the issue is as title "Warning: arch/x86/tools/test_get_len found difference at :811aa5f0". > > [0] > https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/tree/include/linux/bpf.h?h=vhost&id=cc1d1dc07885803981520a5303ef5b130f2ca2e8#n45
Re: [PATCH iproute2-next 0/3] route code cleanups
On 2/7/18 5:25 PM, Stephen Hemminger wrote: > Make code in iproute conform more to current coding style > conventions. Also split flush out into separate function. > > Stephen Hemminger (3): > iproute: whitespace fixes > iproute: don't do assignment in condition > iproute: make flush a separate function > > ip/iproute.c | 165 > --- > 1 file changed, 91 insertions(+), 74 deletions(-) > applied to iproute2-next
Re: [kbuild-all] [vhost:vhost 19/20] Warning: arch/x86/tools/test_get_len found difference at :ffffffff811aa5f0
On 02/08/2018 01:20 AM, Philip Li wrote: > On Thu, Feb 08, 2018 at 02:09:41AM +0200, Michael S. Tsirkin wrote: >> On Wed, Feb 07, 2018 at 03:02:57PM +0800, kbuild test robot wrote: >>> tree: https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost >>> head: 96bcd04462b99e2c80e09f6537770a0ca6b288d0 >>> commit: cc1d1dc07885803981520a5303ef5b130f2ca2e8 [19/20] mm: support >>> reporting free page blocks >>> config: x86_64-rhel >>> compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 >>> reproduce: >>> git checkout cc1d1dc07885803981520a5303ef5b130f2ca2e8 >>> # save the attached .config to linux build tree >>> make ARCH=x86_64 >>> >>> All warnings (new ones prefixed by >>): >> >> I'm sorry, what exactly does this mean? > sorry that the script has issue to provide empty warnings here, which > was supposed to list warnings with >> added to new ones like below. > >In file included from kernel//bpf/core.c:29:0: >>> include/linux/bpf.h:72:8: error: duplicate member 'security' > void *security; >^ > > We will follow up to check what goes wrong. Certainly odd, even in that tree/commit/header [0] I cannot see a duplicate security member in struct bpf_map. [0] https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/tree/include/linux/bpf.h?h=vhost&id=cc1d1dc07885803981520a5303ef5b130f2ca2e8#n45
[PATCH iproute2-next 1/3] iproute: whitespace fixes
Add whitespace around operators for consistency. Use tabs for indentation. Signed-off-by: Stephen Hemminger --- ip/iproute.c | 23 +-- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/ip/iproute.c b/ip/iproute.c index 91d2b1a61993..e9c4093fa3eb 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -162,7 +162,7 @@ static int filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len) if (r->rtm_family == AF_INET6 && table != RT_TABLE_MAIN) ip6_multiple_tables = 1; - if (filter.cloned == !(r->rtm_flags&RTM_F_CLONED)) + if (filter.cloned == !(r->rtm_flags & RTM_F_CLONED)) return 0; if (r->rtm_family == AF_INET6 && !ip6_multiple_tables) { @@ -566,7 +566,8 @@ static void print_rta_multipath(FILE *fp, const struct rtmsg *r, if (nh->rtnh_len > len) break; - if (r->rtm_flags&RTM_F_CLONED && r->rtm_type == RTN_MULTICAST) { + if ((r->rtm_flags & RTM_F_CLONED) && + r->rtm_type == RTN_MULTICAST) { if (first) { fprintf(fp, "Oifs: "); first = 0; @@ -594,7 +595,8 @@ static void print_rta_multipath(FILE *fp, const struct rtmsg *r, print_rta_flow(fp, tb[RTA_FLOW]); } - if (r->rtm_flags&RTM_F_CLONED && r->rtm_type == RTN_MULTICAST) { + if ((r->rtm_flags & RTM_F_CLONED) && + r->rtm_type == RTN_MULTICAST) { fprintf(fp, "%s", ll_index_to_name(nh->rtnh_ifindex)); if (nh->rtnh_hops != 1) fprintf(fp, "(ttl>%d)", nh->rtnh_hops); @@ -676,7 +678,7 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) if (r->rtm_dst_len != host_len) { fprintf(fp, "%s/%u ", rt_addr_n2a_rta(family, tb[RTA_DST]), - r->rtm_dst_len); + r->rtm_dst_len); } else { fprintf(fp, "%s ", format_host_rta(family, tb[RTA_DST])); @@ -691,7 +693,7 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) if (r->rtm_src_len != host_len) { fprintf(fp, "from %s/%u ", rt_addr_n2a_rta(family, tb[RTA_SRC]), - r->rtm_src_len); + r->rtm_src_len); } else { fprintf(fp, "from %s ", format_host_rta(family, tb[RTA_SRC])); @@ -722,7 +724,7 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) if (table && (table != RT_TABLE_MAIN || show_details > 0) && !filter.tb) fprintf(fp, "table %s ", rtnl_rttable_n2a(table, b1, sizeof(b1))); - if (!(r->rtm_flags&RTM_F_CLONED)) { + if (!(r->rtm_flags & RTM_F_CLONED)) { if ((r->rtm_protocol != RTPROT_BOOT || show_details > 0) && filter.protocolmask != -1) fprintf(fp, "proto %s ", rtnl_rtprot_n2a(r->rtm_protocol, b1, sizeof(b1))); if ((r->rtm_scope != RT_SCOPE_UNIVERSE || show_details > 0) && filter.scopemask != -1) @@ -764,7 +766,6 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) print_rta_cacheinfo(fp, RTA_DATA(tb[RTA_CACHEINFO])); } else if (r->rtm_family == AF_INET6) { - if (r->rtm_flags & RTM_F_CLONED) fprintf(fp, "%scache ", _SL_); @@ -1577,8 +1578,8 @@ static int iproute_list_flush_or_save(int argc, char **argv, int action) invarg("invalid mark value", *argv); filter.markmask = -1; } else if (matches(*argv, "metric") == 0 || - matches(*argv, "priority") == 0 || - strcmp(*argv, "preference") == 0) { + matches(*argv, "priority") == 0 || + strcmp(*argv, "preference") == 0) { __u32 metric; NEXT_ARG(); @@ -2117,6 +2118,8 @@ int do_iproute(int argc, char **argv) return iproute_showdump(); if (matches(*argv, "help") == 0) usage(); - fprintf(stderr, "Command \"%s\" is unknown, try \"ip route help\".\n", *argv); + + fprintf(stderr, + "Command \"%s\" is unknown, try \"ip route help\".\n", *argv); exit(-1); } -- 2.15.1
[PATCH iproute2-next 3/3] iproute: make flush a separate function
Minor refactoring to move flush into separate function to improve readability and reduce depth of nesting. Signed-off-by: Stephen Hemminger --- ip/iproute.c | 121 +++ 1 file changed, 64 insertions(+), 57 deletions(-) diff --git a/ip/iproute.c b/ip/iproute.c index 22ed113e890a..3c56240f1291 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -1473,6 +1473,68 @@ static int save_route_prep(void) return 0; } +static int iproute_flush(int do_ipv6, rtnl_filter_t filter_fn) +{ + time_t start = time(0); + char flushb[4096-512]; + int round = 0; + int ret; + + if (filter.cloned) { + if (do_ipv6 != AF_INET6) { + iproute_flush_cache(); + if (show_stats) + printf("*** IPv4 routing cache is flushed.\n"); + } + if (do_ipv6 == AF_INET) + return 0; + } + + filter.flushb = flushb; + filter.flushp = 0; + filter.flushe = sizeof(flushb); + + for (;;) { + if (rtnl_wilddump_request(&rth, do_ipv6, RTM_GETROUTE) < 0) { + perror("Cannot send dump request"); + return -2; + } + filter.flushed = 0; + if (rtnl_dump_filter(&rth, filter_fn, stdout) < 0) { + fprintf(stderr, "Flush terminated\n"); + return -2; + } + if (filter.flushed == 0) { + if (show_stats) { + if (round == 0 && + (!filter.cloned || do_ipv6 == AF_INET6)) + printf("Nothing to flush.\n"); + else + printf("*** Flush is complete after %d round%s ***\n", + round, round > 1 ? "s" : ""); + } + fflush(stdout); + return 0; + } + round++; + ret = flush_update(); + if (ret < 0) + return ret; + + if (time(0) - start > 30) { + printf("\n*** Flush not completed after %ld seconds, %d entries remain ***\n", + (long)(time(0) - start), filter.flushed); + return -1; + } + + if (show_stats) { + printf("\n*** Round %d, deleting %d entries ***\n", + round, filter.flushed); + fflush(stdout); + } + } +} + static int iproute_list_flush_or_save(int argc, char **argv, int action) { int do_ipv6 = preferred_family; @@ -1480,7 +1542,6 @@ static int iproute_list_flush_or_save(int argc, char **argv, int action) char *od = NULL; unsigned int mark = 0; rtnl_filter_t filter_fn; - int ret; if (action == IPROUTE_SAVE) { if (save_route_prep()) @@ -1680,62 +1741,8 @@ static int iproute_list_flush_or_save(int argc, char **argv, int action) } filter.mark = mark; - if (action == IPROUTE_FLUSH) { - int round = 0; - char flushb[4096-512]; - time_t start = time(0); - - if (filter.cloned) { - if (do_ipv6 != AF_INET6) { - iproute_flush_cache(); - if (show_stats) - printf("*** IPv4 routing cache is flushed.\n"); - } - if (do_ipv6 == AF_INET) - return 0; - } - - filter.flushb = flushb; - filter.flushp = 0; - filter.flushe = sizeof(flushb); - - for (;;) { - if (rtnl_wilddump_request(&rth, do_ipv6, RTM_GETROUTE) < 0) { - perror("Cannot send dump request"); - return -2; - } - filter.flushed = 0; - if (rtnl_dump_filter(&rth, filter_fn, stdout) < 0) { - fprintf(stderr, "Flush terminated\n"); - return -2; - } - if (filter.flushed == 0) { - if (show_stats) { - if (round == 0 && (!filter.cloned || do_ipv6 == AF_INET6)) - printf("Nothing to flush.\n"); - else - printf("*** Flush is complete after %d round%s ***\n", round, round > 1?"s":""); -
[PATCH iproute2-next 2/3] iproute: don't do assignment in condition
Fix checkpatch complaints about assignment in conditions. Signed-off-by: Stephen Hemminger --- ip/iproute.c | 25 - 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/ip/iproute.c b/ip/iproute.c index e9c4093fa3eb..22ed113e890a 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -653,7 +653,8 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) struct nlmsghdr *fn; if (NLMSG_ALIGN(filter.flushp) + n->nlmsg_len > filter.flushe) { - if ((ret = flush_update()) < 0) + ret = flush_update(); + if (ret < 0) return ret; } fn = (struct nlmsghdr *)(filter.flushb + NLMSG_ALIGN(filter.flushp)); @@ -827,7 +828,8 @@ static int parse_one_nh(struct nlmsghdr *n, struct rtmsg *r, } } else if (strcmp(*argv, "dev") == 0) { NEXT_ARG(); - if ((rtnh->rtnh_ifindex = ll_name_to_index(*argv)) == 0) { + rtnh->rtnh_ifindex = ll_name_to_index(*argv); + if (rtnh->rtnh_ifindex == 0) { fprintf(stderr, "Cannot find device \"%s\"\n", *argv); return -1; } @@ -1326,9 +1328,9 @@ static int iproute_modify(int cmd, unsigned int flags, int argc, char **argv) usage(); if (d) { - int idx; + int idx = ll_name_to_index(d); - if ((idx = ll_name_to_index(d)) == 0) { + if (idx == 0) { fprintf(stderr, "Cannot find device \"%s\"\n", d); return -1; } @@ -1658,7 +1660,8 @@ static int iproute_list_flush_or_save(int argc, char **argv, int action) int idx; if (id) { - if ((idx = ll_name_to_index(id)) == 0) { + idx = ll_name_to_index(id); + if (idx == 0) { fprintf(stderr, "Cannot find device \"%s\"\n", id); return -1; } @@ -1666,7 +1669,8 @@ static int iproute_list_flush_or_save(int argc, char **argv, int action) filter.iifmask = -1; } if (od) { - if ((idx = ll_name_to_index(od)) == 0) { + idx = ll_name_to_index(od); + if (idx == 0) { fprintf(stderr, "Cannot find device \"%s\"\n", od); return -1; } @@ -1716,7 +1720,8 @@ static int iproute_list_flush_or_save(int argc, char **argv, int action) return 0; } round++; - if ((ret = flush_update()) < 0) + ret = flush_update(); + if (ret < 0) return ret; if (time(0) - start > 30) { @@ -1867,14 +1872,16 @@ static int iproute_get(int argc, char **argv) int idx; if (idev) { - if ((idx = ll_name_to_index(idev)) == 0) { + idx = ll_name_to_index(idev); + if (idx == 0) { fprintf(stderr, "Cannot find device \"%s\"\n", idev); return -1; } addattr32(&req.n, sizeof(req), RTA_IIF, idx); } if (odev) { - if ((idx = ll_name_to_index(odev)) == 0) { + idx = ll_name_to_index(odev); + if (idx == 0) { fprintf(stderr, "Cannot find device \"%s\"\n", odev); return -1; } -- 2.15.1
[PATCH iproute2-next 0/3] route code cleanups
Make code in iproute conform more to current coding style conventions. Also split flush out into separate function. Stephen Hemminger (3): iproute: whitespace fixes iproute: don't do assignment in condition iproute: make flush a separate function ip/iproute.c | 165 --- 1 file changed, 91 insertions(+), 74 deletions(-) -- 2.15.1
Re: [PATCH iproute2-next v2 0/6] ip: Use netlink to walk through network device list
On 2/6/18 11:30 PM, Serhey Popovych wrote: > In this seris I replace /proc/net/dev and /sys/class/net usage for walk > through network device list in iptunnel/ip6tunnel and iptuntap with > netlink dump. > > Following changed since RFC was sent: > > 1) Treat @struct rtnl_link_stats and @struct rtnl_link_stats64 as > array with __u32 and __u64 elements respectively in > copy_rtnl_link_stats64() as suggested by Stephen Hemminger. > > 2) Remove @name and @size parameters from @struct tnl_print_nlmsg_info > since we can get them easily from other data. > > Testing. > > > Following script is used to ensure I didn't broke things too much: > > \#!/bin/bash > > iproute2_dir="$1" > iface='gre1' > > pushd "$iproute2_dir" &>/dev/null > > for i in new old; do > DIR="/tmp/$i" > mkdir -p "$DIR" > > ln -snf ip.$i ip/ip > > for o in '' -s -d; do > ip/ip $o tunnel show >"$DIR/ip${o}-tunnel-show" > ip/ip -4 $o tunnel show>"$DIR/ip-4${o}-tunnel-show" > ip/ip -6 $o tunnel show>"$DIR/ip-6${o}-tunnel-show" > ip/ip $o tunnel show dev "$iface" \ > >"$DIR/ip${o}-tunnel-show-$iface" > ip/ip $o tuntap show >"$DIR/ip${o}-tuntap-show" > done > done > rm -f ip/ip > > diff -urN /tmp/{old,new} |sed -n -Ee'/^(-{3}|\+{3})[[:space:]]+/!p' > rc=$? > > popd &>/dev/null > exit $rc > > Results: > > > ... > fopen /sys/class/net/ipip1/tun_flags: No such file or directory > fopen /sys/class/net/ipip2/tun_flags: No such file or directory > fopen /sys/class/net/gre10/tun_flags: No such file or directory > ^^^ > note that this comes from ip.old > ... > diff -urN /tmp/old/ip-d-tuntap-show /tmp/new/ip-d-tuntap-show > @@ -1,4 +1,4 @@ > -tun1: tap user 1004 group 27 > - Attached to processes: > tun0: tun user 1000 group 27 > Attached to processes: > +tun1: tap user 1004 group 27 > + Attached to processes: > diff -urN /tmp/old/ip-s-tuntap-show /tmp/new/ip-s-tuntap-show > @@ -1,2 +1,2 @@ > -tun1: tap user 1004 group 27 > tun0: tun user 1000 group 27 > +tun1: tap user 1004 group 27 > diff -urN /tmp/old/ip-tuntap-show /tmp/new/ip-tuntap-show > @@ -1,2 +1,2 @@ > -tun1: tap user 1004 group 27 > tun0: tun user 1000 group 27 > +tun1: tap user 1004 group 27 > > So basically only print order for ip tuntap get changes. Rest is intact. > > v2 > Fix build failure in 0/4 patch ("iptunnel/ip6tunnel: Code cleanups") > and update it's description showing why this cleanup is necessary. > > Update cover letter to explain origins of fopen /sys/class/net/... > error message sources. > > Thanks, > Serhii > > Serhey Popovych (6): > ipaddress: Unify print_link_stats() and print_link_stats64() > ip: Introduce get_rtnl_link_stats_rta() to get link statistics > tunnel: Split statistic getting and printing > iptunnel/ip6tunnel: Code cleanups > iptunnel/ip6tunnel: Use netlink to walk through tunnels list > tuntap: Use netlink to walk through tuntap list > > include/utils.h |3 + > ip/ip6tunnel.c | 115 +++-- > ip/ipaddress.c | 189 > --- > ip/iptunnel.c | 93 +-- > ip/iptuntap.c | 121 ++- > ip/tunnel.c | 114 ++--- > ip/tunnel.h | 17 - > lib/utils.c | 45 + > 8 files changed, 324 insertions(+), 373 deletions(-) > series applied to iproute2-next
Re: [PATCH iproute2-next 0/9] print refactoring
On 2/7/18 10:10 AM, Stephen Hemminger wrote: > This patch set breaks up the big print_route function into > smaller pieces for readability and to make later changes > to support JSON and color output easier. > > Stephen Hemminger (9): > iproute: refactor printing flags > iproute: make printing icmpv6 a function > iproute: make printing IPv4 cache flags a function > iproute: refactor cacheinfo printing > iproute: refactor metrics print > iproute: refactor printing flow info > iproute: refactor newdst, gateway and via printing > iproute: refactor multipath print > iproute: refactor printing of interface > > ip/iproute.c | 571 > +++ > 1 file changed, 297 insertions(+), 274 deletions(-) > series applied to iproute2-next.
Re: [kbuild-all] [vhost:vhost 19/20] Warning: arch/x86/tools/test_get_len found difference at :ffffffff811aa5f0
On Thu, Feb 08, 2018 at 02:09:41AM +0200, Michael S. Tsirkin wrote: > On Wed, Feb 07, 2018 at 03:02:57PM +0800, kbuild test robot wrote: > > tree: https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost > > head: 96bcd04462b99e2c80e09f6537770a0ca6b288d0 > > commit: cc1d1dc07885803981520a5303ef5b130f2ca2e8 [19/20] mm: support > > reporting free page blocks > > config: x86_64-rhel > > compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 > > reproduce: > > git checkout cc1d1dc07885803981520a5303ef5b130f2ca2e8 > > # save the attached .config to linux build tree > > make ARCH=x86_64 > > > > All warnings (new ones prefixed by >>): > > I'm sorry, what exactly does this mean? sorry that the script has issue to provide empty warnings here, which was supposed to list warnings with >> added to new ones like below. In file included from kernel//bpf/core.c:29:0: >> include/linux/bpf.h:72:8: error: duplicate member 'security' void *security; ^ We will follow up to check what goes wrong. > > > --- > > 0-DAY kernel test infrastructureOpen Source Technology > > Center > > https://lists.01.org/pipermail/kbuild-all Intel > > Corporation > ___ > kbuild-all mailing list > kbuild-...@lists.01.org > https://lists.01.org/mailman/listinfo/kbuild-all
Re: [vhost:vhost 19/20] Warning: arch/x86/tools/test_get_len found difference at :ffffffff811aa5f0
On Wed, Feb 07, 2018 at 03:02:57PM +0800, kbuild test robot wrote: > tree: https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost > head: 96bcd04462b99e2c80e09f6537770a0ca6b288d0 > commit: cc1d1dc07885803981520a5303ef5b130f2ca2e8 [19/20] mm: support > reporting free page blocks > config: x86_64-rhel > compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 > reproduce: > git checkout cc1d1dc07885803981520a5303ef5b130f2ca2e8 > # save the attached .config to linux build tree > make ARCH=x86_64 > > All warnings (new ones prefixed by >>): I'm sorry, what exactly does this mean? > --- > 0-DAY kernel test infrastructureOpen Source Technology Center > https://lists.01.org/pipermail/kbuild-all Intel Corporation
Re: [suricata PATCH 1/3] suricata/ebpf: take clang -target bpf include issue of stdint.h into account
Hi, On Wed, 2018-02-07 at 23:21 +0100, Jesper Dangaard Brouer wrote: > From: Jesper Dangaard Brouer > > This patch prepares code before enabling the clang -target bpf. > > The clang compiler does not like #include when > using '-target bpf' it will fail with: > > fatal error: 'gnu/stubs-32.h' file not found ... > This can be worked around by installing the 32-bit version of > glibc-devel.i686 on your distribution. > > But the BPF programs does not really need to include stdint.h, > if converting: > uint64_t -> __u64 > uint32_t -> __u32 > uint16_t -> __u16 > uint8_t -> __u8 > > This patch does this type syntax conversion. There is an issue for system like Debian because they don't have a asm/types.h in the include path if the architecture is not defined which is the case due to target bpf. This results in: clang-5.0 -Wall -Iinclude -O2 \ -D__KERNEL__ -D__ASM_SYSREG_H \ -target bpf -S -emit-llvm vlan_filter.c -o vlan_filter.ll In file included from vlan_filter.c:19: In file included from include/linux/bpf.h:11: /usr/include/linux/types.h:5:10: fatal error: 'asm/types.h' file not found #include ^ 1 error generated. Makefile:523: recipe for target 'vlan_filter.bpf' failed To go into details, the Debian package providing the 'asm/typs.h' include is the the headers or linux-libc-dev. But this package comes with a flavor and thus we have a prefix: linux-libc-dev:amd64: /usr/include/x86_64-linux-gnu/asm/types.h "Fun" part here is that if you build a debian package of the via make in Linux tree then the linux-libc-dev package is correct. So I propose the following patch that fixes the issue for me: diff --git a/ebpf/Makefile.am b/ebpf/Makefile.am index 89a3304e9..712b05343 100644 --- a/ebpf/Makefile.am +++ b/ebpf/Makefile.am @@ -16,6 +16,7 @@ all: $(BPF_TARGETS) $(BPF_TARGETS): %.bpf: %.c # From C-code to LLVM-IR format suffix .ll (clang -S -emit-llvm) ${CLANG} -Wall $(BPF_CFLAGS) -O2 \ + -I/usr/include/$(host_cpu)-$(host_os)/ \ -D__KERNEL__ -D__ASM_SYSREG_H \ -target bpf -S -emit-llvm $< -o ${@:.bpf=.ll} # From LLVM-IR to BPF-bytecode in ELF-obj file Let me know if it is ok for you. Best regards, -- Eric Leblond
Re: [PATCH v2] selftests: bpf: test_kmod.sh: check the module path before insmod
On 02/07/2018 07:15 PM, Naresh Kamboju wrote: > test_kmod.sh reported false failure when module not present. > Check test_bpf.ko is present in the path before loading it. > > Two cases to be addressed here, > In the development process of test_bpf.c unit testing will be done by > developers by using "insmod $SRC_TREE/lib/test_bpf.ko" > > On the other hand testers run full tests by installing modules on device > under test (DUT) and followed by modprobe to insert the modules accordingly. > > Signed-off-by: Naresh Kamboju Applied to bpf tree, thanks Naresh!
Re: [PATCH] net: ethernet: ti: cpsw: fix net watchdog timeout
Rechecked once again, seems it covers every case, at this moment, so Reviewed-by: Ivan Khoronzhuk -- Regards, Ivan Khoronzhuk
Re: Fwd: u32 ht filters
On Tue, Feb 6, 2018 at 11:01 PM, Jiri Pirko wrote: > Wed, Feb 07, 2018 at 06:09:15AM CET, xiyou.wangc...@gmail.com wrote: >>Hi, Jiri >> >>Your commit 7fa9d974f3c2a016b9accb18f4ee2ed2a738585c >>breaks the tc script by Paweł. Please find below for details. > > Did you do the bisection? > The commit just uses block struct instead of q, but since they > are in 1:1 relation, that should be equvivalent. So basically you still > have per-qdisc hashtables for u32. Well, at least the following fixes the problem here. But I am not sure if it is expected too for shared block among multiple qdiscs. @@ -338,7 +330,7 @@ static struct hlist_head *tc_u_common_hash; static unsigned int tc_u_hash(const struct tcf_proto *tp) { - return hash_ptr(tp->chain->block, U32_HASH_SHIFT); + return hash_ptr(tp->chain->block->q, U32_HASH_SHIFT); } static struct tc_u_common *tc_u_common_find(const struct tcf_proto *tp) @@ -348,7 +340,7 @@ static struct tc_u_common *tc_u_common_find(const struct tcf_proto *tp) h = tc_u_hash(tp); hlist_for_each_entry(tc, &tc_u_common_hash[h], hnode) { - if (tc->block == tp->chain->block) + if (tc->block->q == tp->chain->block->q) return tc; } return NULL;
Re: [PATCH] net: ethernet: ti: cpsw: fix net watchdog timeout
On Wed, Feb 07, 2018 at 12:19:11PM -0600, Grygorii Strashko wrote: On 02/07/2018 07:31 AM, Ivan Khoronzhuk wrote: On Wed, Feb 07, 2018 at 05:03:19AM +0200, Ivan Khoronzhuk wrote: On Tue, Feb 06, 2018 at 07:17:06PM -0600, Grygorii Strashko wrote: It was discovered that simple program which indefinitely sends 200b UDP packets and runs on TI AM574x SoC (SMP) under RT Kernel triggers network watchdog timeout in TI CPSW driver (<6 hours run). The network watchdog timeout is triggered due to race between cpsw_ndo_start_xmit() and cpsw_tx_handler() [NAPI] cpsw_ndo_start_xmit() if (unlikely(!cpdma_check_free_tx_desc(txch))) { txq = netdev_get_tx_queue(ndev, q_idx); netif_tx_stop_queue(txq); ^^ as per [1] barrier has to be used after set_bit() otherwise new value might not be visible to other cpus } cpsw_tx_handler() if (unlikely(netif_tx_queue_stopped(txq))) netif_tx_wake_queue(txq); and when it happens ndev TX queue became disabled forever while driver's HW TX queue is empty. I'm sure it fixes test case somehow but there is some strangeness. (I've thought about this some X months ago): 1. If no free desc, then there is bunch of descs on the queue ready to be sent 2. If one of this desc while this process was missed then next will wake queue, because there is bunch of them on the fly. So, if desc on top of the sent queue missed to enable the queue, then next one more likely will enable it anyway.. then how it could happen? The described race is possible only on last descriptor, yes, packets are small the speed is hight, possibility is very small .but then next situation is also possible: - packets are sent fast - all packets were sent, but no any descriptors are freed now by sw interrupt (NAPI) - when interrupt had started NAPI, the queue was enabled, all other next interrupts are throttled once NAPI not finished it's work yet. - when new packet submitted, no free descs are present yet (NAPI has not freed any yet), but all packets are sent, so no one can awake tx queue, as interrupt will not arise when NAPI is started to free first descriptor interrupts are disabled.because h/w queue to be sent is empty... - how it can happen as submitting packet and handling packet operations is under channel lock? Not exactly, a period between handling and freeing the descriptor to the pool is not under channel lock, here: spin_unlock_irqrestore(&chan->lock, flags); if (unlikely(status & CPDMA_DESC_TD_COMPLETE)) cb_status = -ENOSYS; else cb_status = status; __cpdma_chan_free(chan, desc, outlen, cb_status); return status; unlock_ret: spin_unlock_irqrestore(&chan->lock, flags); return status; And: __cpdma_chan_free(chan, desc, outlen, cb_status); -> cpdma_desc_free(pool, desc, 1); As result, queue deadlock as you've described. Just thought, not checked, but theoretically possible. What do you think? Yep. And if this happens desc_alloc_fail should be >0 while i usually see it 0 when watchdog triggers. It has to be 0 for situation I trying to describe. See below re example for keeping all in one place I was able to reproduce this situation once, but with bunch of debug code added which changes timings: I also unintentionally observed it several times but usually was thinking that it was connected with my other experiments. It was with am572x. But now seems it actually can happen with plane sources. And maybe here not only 1 fix is needed, that's my concern. Prerequisite: only one free desc available. cpsw_ndo_start_xmit1cpsw_tx_poll prepare and send packet ->Free desc queue is empty at this moment if (unlikely(!cpdma_check_free_tx_desc(txch))) netif_tx_stop_queue(txq); --- tx queue stopped cpdma_chan_process() spin_lock_irqsave(&chan->lock, flags); chan->count-- spin_unlock_irqrestore(&chan->lock, flags) cpsw_tx_handler() if (unlikely(netif_tx_queue_stopped(txq))) netif_tx_wake_queue(txq); --- tx queue is woken up cpsw_ndo_start_xmit2 prepare packet ret = cpsw_tx_packet_submit(priv, skb, txch); //fail as desc not returned to the pool yet if (unlikely(ret != 0)) { cpsw_err(priv, tx_err, "desc submit failed\n"); goto fail; } cpdma_desc_free() fail: ndev->stats.tx_dropped++; netif_tx_stop_queue(txq); // oops. That's why I added double check and barrier in fail path also Seems to
Re: handling of phy_stop() and phy_stop_machine() in phylib
On 02/07/2018 01:13 PM, Russell King - ARM Linux wrote: > On Wed, Feb 07, 2018 at 09:56:37PM +0100, Heiner Kallweit wrote: >> Am 04.02.2018 um 03:48 schrieb Florian Fainelli: >>> >>> >>> On 02/03/2018 03:58 PM, Heiner Kallweit wrote: Am 03.02.2018 um 21:17 schrieb Andrew Lunn: > On Sat, Feb 03, 2018 at 05:41:54PM +0100, Heiner Kallweit wrote: >> This commit forces callers of phy_resume() and phy_suspend() to hold >> mutex phydev->lock. This was done for calls to phy_resume() and >> phy_suspend() in phylib, however there are more callers in network >> drivers. I'd assume that these other calls issue a warning now >> because of the lock not being held. >> So is there something I miss or would this have to be fixed? > > Hi Heiner > > This is a good point. > > Yes, it looks like some fixes are needed. But what exactly? > > The phy state machine will suspend and resume the phy is you call > phy_stop() and phy_start() in the MAC suspend and resume functions. > AFAICS phy_stop() doesn't suspend the PHY. It just sets the state to PHY_HALTED and (at least if we're not in polling mode) doesn't call phy_suspend(). Maybe a call to phy_trigger_machine() is needed like in phy_start() ? Then the state machine would call phy_suspend(), provided the link is still up. >>> >>> Right, phy_stop() merely just moves the state machine to PHY_HALTED and >>> this is actually a great source of problems which I tried to address here: >>> >>> https://www.mail-archive.com/netdev@vger.kernel.org/msg196061.html >>> >>> because phy_stop() is not a synchronous call, so when it returns the >>> state machine might still be running (it can take up to a 1 HZ depending >>> on when you called phy_stop()) and so if you took that as a >>> synchronization point to e.g: turn of your Ethernet MAC/MDIO bus clocks, >>> you will likely see problems. phy_stop_machine() would provide that >>> synchronization point, but is not currently exported, despite being a >>> global symbol. This patch series above is all well and good, except that >>> Geert reported issues with suspend/resume interactions which I have not >>> been able to track down. >>> >> To not confuse readers I changed the subject of the mail to reflect that >> the discussion isn't about the original topic any longer. >> >> It seems to me that (at least one) reason for the issues is that pm >> callbacks for the phy device and the network device interfere. >> phy device pm (mdio_bus_phy_suspend et al) feels responsible for dealing >> with suspending/resuming the PHY, and the network driver pm callbacks >> as well. >> >> Maybe, if the network driver takes care, it should inform phy device pm >> to do nothing. For this we could add a flag to phydev.mdio.flags. >> If the network driver sets this flag then mdio_bus_phy_suspend() >> and mdio_bus_phy_resume() could turn into no-ops. >> Not totally sure yet about mdio_bus_phy_restore() .. > > What if the MDIO bus is handled by a separate device and the MDIO bus > is suspended prior to the network driver, thereby making the PHY > inaccessible? Indeed. We can know that in the PHY library though, because there is logic to hold the module reference count, see phy_attach_direct(), we cannot quite trust whether the Ethernet controller does the right thing though, as I can think of several ways for things to be done wrong like: CONFIG_PM is enabled, but the Ethernet driver does not implement any suspend/resume callback, or does it wrong etc... -- Florian
Re: [PATCH net v2] net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT
On 02/07/2018 11:44 AM, Heiner Kallweit wrote: > This condition wasn't adjusted when PHY_IGNORE_INTERRUPT (-2) was added > long ago. In case of PHY_IGNORE_INTERRUPT the MAC interrupt indicates > also PHY state changes and we should do what the symbol says. > > Signed-off-by: Heiner Kallweit > --- > v2: > - use phy_interrupt_is_valid() instead of checking for irq > 0 Thanks, could you identify which Fixes: tag we should be using for that? It would be great to see this backported to -stable > --- > drivers/net/phy/phy.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c > index f3313a129..50ed35a45 100644 > --- a/drivers/net/phy/phy.c > +++ b/drivers/net/phy/phy.c > @@ -822,7 +822,7 @@ void phy_start(struct phy_device *phydev) > phy_resume(phydev); > > /* make sure interrupts are re-enabled for the PHY */ > - if (phydev->irq != PHY_POLL) { > + if (phy_interrupt_is_valid(phydev)) { > err = phy_enable_interrupts(phydev); > if (err < 0) > break; > -- Florian
Re: [suricata PATCH 0/3] Suricata cleanup makefile
Hello Jesper, On Wed, 2018-02-07 at 23:21 +0100, Jesper Dangaard Brouer wrote: > Hi Eric, > > I've improved the Suricata ebpf makefile, in-order to avoid > generating > the .eh_frame sections. This required changing the code a bit, to > allow using clang -target bpf. > > The makefile have also been improved to stop on clang compile errors, > instead of generating an almost empty BPF ELF file. > > Could I ask you to get these changes into Suricata, through correct > process for this Open Source project? Sure, I'm reviewing the code, testing it and I will do a Pull Request on github. Thanks a lot for that, that's a really valuable help! BR, -- Eric Leblond
[suricata PATCH 3/3] suricata/ebpf: improving the ebpf makefile
From: Jesper Dangaard Brouer The current ebpf/Makefile.am have the problem that clang compile errors still result in an ELF .bpf output file. This is obviously problematic as the the problem is first seen runtime when loading the bpf-prog. This this is cause by the uses of a pipe from clang to llc. To address this problem, split up the clang and llc invocations up into two separate commands, to get proper reaction based on the compiler exit code. The clang compiler is used as a frontend (+ optimizer) and instructed (via -S -emit-llvm) to generate LLVM IR (Intermediate Representation) with suffix .ll. The LLVM llc command is used as a compiler backend taking IR and producing BPF machine bytecode, and storing this into a ELF object. In the last step the IR .ll suffix code it removed. The official documentation of the IR language: http://llvm.org/docs/LangRef.html Also fix the previous make portability warning: '%-style pattern rules are a GNU make extension' I instead use some static pattern rules: https://www.gnu.org/software/make/manual/html_node/Static-Usage.html Signed-off-by: Jesper Dangaard Brouer --- ebpf/Makefile.am | 22 ++ 1 file changed, 18 insertions(+), 4 deletions(-) diff --git a/ebpf/Makefile.am b/ebpf/Makefile.am index d0ee44ae668e..89a3304e953b 100644 --- a/ebpf/Makefile.am +++ b/ebpf/Makefile.am @@ -3,11 +3,25 @@ if BUILD_EBPF # Maintaining a local copy of UAPI linux/bpf.h BPF_CFLAGS = -Iinclude -all: lb.bpf filter.bpf bypass_filter.bpf xdp_filter.bpf vlan_filter.bpf +CLANG = ${CC} -%.bpf: %.c - ${CC} -Wall $(BPF_CFLAGS) -O2 -D__KERNEL__ -D__ASM_SYSREG_H -target bpf -emit-llvm -c $< -o - | ${LLC} -march=bpf -filetype=obj -o $@ +BPF_TARGETS = lb.bpf +BPF_TARGETS += filter.bpf +BPF_TARGETS += bypass_filter.bpf +BPF_TARGETS += xdp_filter.bpf +BPF_TARGETS += vlan_filter.bpf -CLEANFILES = *.bpf +all: $(BPF_TARGETS) + +$(BPF_TARGETS): %.bpf: %.c +# From C-code to LLVM-IR format suffix .ll (clang -S -emit-llvm) + ${CLANG} -Wall $(BPF_CFLAGS) -O2 \ + -D__KERNEL__ -D__ASM_SYSREG_H \ + -target bpf -S -emit-llvm $< -o ${@:.bpf=.ll} +# From LLVM-IR to BPF-bytecode in ELF-obj file + ${LLC} -march=bpf -filetype=obj ${@:.bpf=.ll} -o $@ + ${RM} ${@:.bpf=.ll} + +CLEANFILES = *.bpf *.ll endif
[suricata PATCH 0/3] Suricata cleanup makefile
Hi Eric, I've improved the Suricata ebpf makefile, in-order to avoid generating the .eh_frame sections. This required changing the code a bit, to allow using clang -target bpf. The makefile have also been improved to stop on clang compile errors, instead of generating an almost empty BPF ELF file. Could I ask you to get these changes into Suricata, through correct process for this Open Source project? --Jesper --- Jesper Dangaard Brouer (3): suricata/ebpf: take clang -target bpf include issue of stdint.h into account suricata/ebpf: compile with clang -target bpf suricata/ebpf: improving the ebpf makefile ebpf/Makefile.am | 22 ++ ebpf/bypass_filter.c | 27 +-- ebpf/filter.c|3 +-- ebpf/hash_func01.h | 12 ++-- ebpf/lb.c| 11 +-- ebpf/vlan_filter.c |5 ++--- ebpf/xdp_filter.c| 42 -- 7 files changed, 65 insertions(+), 57 deletions(-) --
[suricata PATCH 2/3] suricata/ebpf: compile with clang -target bpf
From: Jesper Dangaard Brouer Enable compiling eBPF programs with clang -target bpf. This is mostly to workaround a bug in libbpf, where clang > ver 4.0.0 generates some ELF sections (.eh_frame) when -target bpf is NOT specified, and libbpf fails loading such files. Notice libbpf is provided by the kernel, and in kernel v4.16 the library will contain the needed function for attatching to the XDP hook. Kernel commit 949abbe88436 ("libbpf: add function to setup XDP") https://git.kernel.org/torvalds/c/949abbe88436 As it looks now, the library fix will not get into kernel v4.16. Thus, we need this workaround for Suricata. In-order to recommend people installing the library libbpf from kernel v4.16. Signed-off-by: Jesper Dangaard Brouer --- ebpf/Makefile.am |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/ebpf/Makefile.am b/ebpf/Makefile.am index 66db35f80c7e..d0ee44ae668e 100644 --- a/ebpf/Makefile.am +++ b/ebpf/Makefile.am @@ -6,7 +6,7 @@ BPF_CFLAGS = -Iinclude all: lb.bpf filter.bpf bypass_filter.bpf xdp_filter.bpf vlan_filter.bpf %.bpf: %.c - ${CC} -Wall $(BPF_CFLAGS) -O2 -D__KERNEL__ -D__ASM_SYSREG_H -emit-llvm -c $< -o - | ${LLC} -march=bpf -filetype=obj -o $@ + ${CC} -Wall $(BPF_CFLAGS) -O2 -D__KERNEL__ -D__ASM_SYSREG_H -target bpf -emit-llvm -c $< -o - | ${LLC} -march=bpf -filetype=obj -o $@ CLEANFILES = *.bpf
[suricata PATCH 1/3] suricata/ebpf: take clang -target bpf include issue of stdint.h into account
From: Jesper Dangaard Brouer This patch prepares code before enabling the clang -target bpf. The clang compiler does not like #include when using '-target bpf' it will fail with: fatal error: 'gnu/stubs-32.h' file not found This is because using clang -target bpf, then clang will have '__bpf__' defined instead of '__x86_64__' hence the gnu/stubs-32.h include attempt as /usr/include/gnu/stubs.h contains, on x86_64: #if !defined __x86_64__ # include #endif #if defined __x86_64__ && defined __LP64__ # include #endif #if defined __x86_64__ && defined __ILP32__ # include #endif This can be worked around by installing the 32-bit version of glibc-devel.i686 on your distribution. But the BPF programs does not really need to include stdint.h, if converting: uint64_t -> __u64 uint32_t -> __u32 uint16_t -> __u16 uint8_t -> __u8 This patch does this type syntax conversion. Signed-off-by: Jesper Dangaard Brouer --- ebpf/bypass_filter.c | 27 +-- ebpf/filter.c|3 +-- ebpf/hash_func01.h | 12 ++-- ebpf/lb.c| 11 +-- ebpf/vlan_filter.c |5 ++--- ebpf/xdp_filter.c| 42 -- 6 files changed, 47 insertions(+), 53 deletions(-) diff --git a/ebpf/bypass_filter.c b/ebpf/bypass_filter.c index d2ce12aa1cd5..be81032b16cf 100644 --- a/ebpf/bypass_filter.c +++ b/ebpf/bypass_filter.c @@ -15,7 +15,6 @@ * 02110-1301, USA. */ -#include #include #include @@ -51,9 +50,9 @@ struct flowv6_keys { } __attribute__((__aligned__(8))); struct pair { -uint64_t time; -uint64_t packets; -uint64_t bytes; +__u64 time; +__u64 packets; +__u64 bytes; } __attribute__((__aligned__(8))); struct bpf_map_def SEC("maps") flow_table_v4 = { @@ -77,10 +76,10 @@ struct bpf_map_def SEC("maps") flow_table_v6 = { */ static __always_inline int ipv4_filter(struct __sk_buff *skb) { -uint32_t nhoff, verlen; +__u32 nhoff, verlen; struct flowv4_keys tuple; struct pair *value; -uint16_t port; +__u16 port; nhoff = skb->cb[0]; @@ -107,8 +106,8 @@ static __always_inline int ipv4_filter(struct __sk_buff *skb) #if 0 if ((tuple.port16[0] == 22) || (tuple.port16[1] == 22)) { -uint16_t sp = tuple.port16[0]; -//uint16_t dp = tuple.port16[1]; +__u16 sp = tuple.port16[0]; +//__u16 dp = tuple.port16[1]; char fmt[] = "Parsed SSH flow: %u %d -> %u\n"; bpf_trace_printk(fmt, sizeof(fmt), tuple.src, sp, tuple.dst); } @@ -118,8 +117,8 @@ static __always_inline int ipv4_filter(struct __sk_buff *skb) if (value) { #if 0 { -uint16_t sp = tuple.port16[0]; -//uint16_t dp = tuple.port16[1]; +__u16 sp = tuple.port16[0]; +//__u16 dp = tuple.port16[1]; char bfmt[] = "Found flow: %u %d -> %u\n"; bpf_trace_printk(bfmt, sizeof(bfmt), tuple.src, sp, tuple.dst); } @@ -139,11 +138,11 @@ static __always_inline int ipv4_filter(struct __sk_buff *skb) */ static __always_inline int ipv6_filter(struct __sk_buff *skb) { -uint32_t nhoff; -uint8_t nhdr; +__u32 nhoff; +__u8 nhdr; struct flowv6_keys tuple; struct pair *value; -uint16_t port; +__u16 port; nhoff = skb->cb[0]; @@ -223,4 +222,4 @@ int SEC("filter") hashfilter(struct __sk_buff *skb) { char __license[] SEC("license") = "GPL"; -uint32_t __version SEC("version") = LINUX_VERSION_CODE; +__u32 __version SEC("version") = LINUX_VERSION_CODE; diff --git a/ebpf/filter.c b/ebpf/filter.c index 1976ffcf188f..4fe95d4fb005 100644 --- a/ebpf/filter.c +++ b/ebpf/filter.c @@ -15,7 +15,6 @@ * 02110-1301, USA. */ -#include #include #include @@ -56,4 +55,4 @@ int SEC("filter") hashfilter(struct __sk_buff *skb) { char __license[] SEC("license") = "GPL"; -uint32_t __version SEC("version") = LINUX_VERSION_CODE; +__u32 __version SEC("version") = LINUX_VERSION_CODE; diff --git a/ebpf/hash_func01.h b/ebpf/hash_func01.h index 060346f67a6a..38255812e376 100644 --- a/ebpf/hash_func01.h +++ b/ebpf/hash_func01.h @@ -4,12 +4,12 @@ * From: http://www.azillionmonkeys.com/qed/hash.html */ -#define get16bits(d) (*((const uint16_t *) (d))) +#define get16bits(d) (*((const __u16 *) (d))) static __always_inline -uint32_t SuperFastHash (const char *data, int len, uint32_t initval) { - uint32_t hash = initval; - uint32_t tmp; +__u32 SuperFastHash (const char *data, int len, __u32 initval) { + __u32 hash = initval; + __u32 tmp; int rem; if (len <= 0 || data == NULL) return 0; @@ -23,7 +23,7 @@ uint32_t SuperFastHash (const char *data, int len, uint32_t initval) { hash += get16bits (data); tmp= (get16bits (data+2) << 11) ^ hash; hash = (hash << 16) ^ tmp; - data += 2*sizeof (uint16_t); + d
Re: [RFC PATCH 02/24] xsk: add user memory registration sockopt
2018-02-07 17:00 GMT+01:00 Willem de Bruijn : > On Wed, Jan 31, 2018 at 8:53 AM, Björn Töpel wrote: >> From: Björn Töpel >> >> The XDP_MEM_REG socket option allows a process to register a window of >> user space memory to the kernel. This memory will later be used as >> frame data buffer. >> >> Signed-off-by: Björn Töpel >> --- > >> +static struct xsk_umem *xsk_mem_reg(u64 addr, u64 size, u32 frame_size, >> + u32 data_headroom) >> +{ >> + unsigned long lock_limit, locked, npages; >> + int ret = 0; >> + struct xsk_umem *umem; >> + >> + if (!can_do_mlock()) >> + return ERR_PTR(-EPERM); >> + >> + umem = xsk_umem_create(addr, size, frame_size, data_headroom); >> + if (IS_ERR(umem)) >> + return umem; >> + >> + npages = PAGE_ALIGN(umem->nframes * umem->frame_size) >> PAGE_SHIFT; >> + >> + down_write(¤t->mm->mmap_sem); >> + >> + locked = npages + current->mm->pinned_vm; >> + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; >> + >> + if (locked > lock_limit && !capable(CAP_IPC_LOCK)) { >> + ret = -ENOMEM; >> + goto out; >> + } >> + >> + if (npages == 0 || npages > UINT_MAX) { >> + ret = -EINVAL; >> + goto out; >> + } >> + umem->npgs = npages; >> + >> + ret = xsk_umem_pin_pages(umem); >> + >> +out: >> + if (ret < 0) { >> + put_pid(umem->pid); >> + kfree(umem); >> + } else { >> + current->mm->pinned_vm = locked; >> + } >> + >> + up_write(¤t->mm->mmap_sem); > > This limits per process. You may want to limit per user. See also > mm_account_pinned_pages. Ah, noted! Thanks for pointing that out!
Re: [RFC PATCH 00/24] Introducing AF_XDP support
2018-02-07 18:59 GMT+01:00 Tom Herbert : > On Wed, Jan 31, 2018 at 5:53 AM, Björn Töpel wrote: [...] >> >> Below are the results in Mpps of the I40E NIC benchmark runs for 64 >> byte packets, generated by commercial packet generator HW that is >> generating packets at full 40 Gbit/s line rate. >> >> XDP baseline numbers without this RFC: >> xdp_rxq_info --action XDP_DROP 31.3 Mpps >> xdp_rxq_info --action XDP_TX 16.7 Mpps >> >> XDP performance with this RFC i.e. with the buffer allocator: >> XDP_DROP 21.0 Mpps >> XDP_TX 11.9 Mpps >> >> AF_PACKET V4 performance from previous RFC on 4.14-rc7: >> Benchmark V2 V3 V4 V4+ZC >> rxdrop 0.67 0.73 0.74 33.7 >> txpush 0.98 0.98 0.91 19.6 >> l2fwd 0.66 0.71 0.67 15.5 >> >> AF_XDP performance: >> Benchmark XDP_SKB XDP_DRVXDP_DRV_ZC (all in Mpps) >> rxdrop 3.311.6 16.9 >> txpush 2.2 NA* 21.8 >> l2fwd 1.7 NA* 10.4 >> > Hi Bjorn, > > This is very impressive work, thank you for contributing it! > Thank you for looking at it! :-) > For these benchmarks how are the AF_PACKET and AF_XDP numbers to be > compared. For instance is rxdpop comparable to XDP_DROP and > "xdp_rxq_info --action XDP_DROP"? Given your explanation below, I > believe they are, but it might be better to make that clear up front. > Ah, yeah, that was a bit confusing: "xdp_rxq_info --action XDP_DROP" is doing an XDP_DROP (no buffer touching) and should be compared to "XDP_DROP". I meant to write "xdp_rxq_info --action XDP_DROP" instead of "XDP_DROP" for the second case. So, what this shows is that the buffer allocation scheme in the patch set (buff_pool) has a pretty hard performance regression (21.0 vs 31.3) on the regular XDP (and skb!) path. Not acceptable. "rxdrop" from AF_PACKET V4 should be compared to "rxdrop" from AF_XDP. This is dropping a packet in user space, whereas the former is dropping a packet in XDP/kernel space. Less confusing? Björn
Re: [RFC PATCH 00/24] Introducing AF_XDP support
2018-02-07 16:54 GMT+01:00 Willem de Bruijn : >> We realized, a bit late maybe, that 24 patches is a bit mouthful, so >> let me try to make it more palatable. > > Overall, this approach looks great to me. > Yay! :-) > The patch set incorporates all the feedback from AF_PACKET V4. > At this point I don't have additional high-level interface comments. > I have a thought on the socket API. Now, we're registering buffer memory *to* the kernel, but mmap:ing the Rx/Tx rings *from* the kernel. I'm leaning towards removing the mmap call, in favor of registering the rings to kernel analogous to the XDP_MEM_REG socket option. We wont guarantee physical contiguous memory for the rings, but I think we can live with that. Thoughts? > As you point out, 24 patches and nearly 6000 changed lines is > quite a bit to ingest. Splitting up in smaller patch sets will help > give more detailed implementation feedback. > > The frame pool and device driver changes are largely independent > from AF_XDP and probably should be resolved first (esp. the > observed regresssion even without AF_XDP). > Yeah, the regression is unacceptable. Another way is starting with the patches without zero-copy first (i.e. the copy path), and later add the driver modifications. That would be the first 7 patches. > As you suggest, it would be great if the need for a separate > xsk_packet_array data structure can be avoided. > Yes, we'll address that! > Since frames from the same frame pool can be forwarded between > multiple device ports and thus AF_XDP sockets, that should perhaps > be a separate object independent from the sockets. This comment > hints at the awkward situation if tied to a descriptor pair: > >> + /* Check if umem is from this socket, if so do not make >> +* circular references. >> +*/ > > Since this is in principle just a large shared memory area, could > it reuse existing BPF map logic? > Hmm, care to elaborate on your thinking here? > More extreme, and perhaps unrealistic, is if the descriptor ring > could similarly be a BPF map and the Rx XDP program directly > writes the descriptor, instead of triggering xdp_do_xsk_redirect. > As we discussed before, this would avoid the need to specify a > descriptor format upfront. Having the XDP program writeback the descriptor to user space ring is really something that would be useful (writing a virtio-net descriptors...). I need to think a bit more about this. :-) Please share your ideas! Thanks for looking into the patches! Björn
Re: [PATCH net-next] sch_netem: Bug fixing in calculating Netem interval
On Tue, 6 Feb 2018 23:14:18 -0500 "Md. Islam" wrote: > In Kernel 4.15.0+, Netem does not work properly. > > Netem setup: > > tc qdisc add dev h1-eth0 root handle 1: netem delay 10ms 2ms > > Result: > > PING 172.16.101.2 (172.16.101.2) 56(84) bytes of data. > 64 bytes from 172.16.101.2: icmp_seq=1 ttl=64 time=22.8 ms > 64 bytes from 172.16.101.2: icmp_seq=2 ttl=64 time=10.9 ms > 64 bytes from 172.16.101.2: icmp_seq=3 ttl=64 time=10.9 ms > 64 bytes from 172.16.101.2: icmp_seq=5 ttl=64 time=11.4 ms > 64 bytes from 172.16.101.2: icmp_seq=6 ttl=64 time=11.8 ms > 64 bytes from 172.16.101.2: icmp_seq=4 ttl=64 time=4303 ms > 64 bytes from 172.16.101.2: icmp_seq=10 ttl=64 time=11.2 ms > 64 bytes from 172.16.101.2: icmp_seq=11 ttl=64 time=10.3 ms > 64 bytes from 172.16.101.2: icmp_seq=7 ttl=64 time=4304 ms > 64 bytes from 172.16.101.2: icmp_seq=8 ttl=64 time=4303 ms > > Patch: > > diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c > index 7bbc13b..7c179ad 100644 > --- a/net/sched/sch_netem.c > +++ b/net/sched/sch_netem.c > @@ -327,7 +327,7 @@ static s64 tabledist(s64 mu, s32 sigma, > > /* default uniform distribution */ > if (dist == NULL) > -return (rnd % (2 * sigma)) - sigma + mu; > +return ((rnd % (2 * sigma)) + mu) - sigma; > > t = dist->table[rnd % dist->size]; > x = (sigma % NETEM_DIST_SCALE) * t; > > > (rnd % (2 * sigma)) - sigma was overflowing s32. After applying the > patch, I found following output which is desirable. > > PING 172.16.101.2 (172.16.101.2) 56(84) bytes of data. > 64 bytes from 172.16.101.2: icmp_seq=1 ttl=64 time=21.1 ms > 64 bytes from 172.16.101.2: icmp_seq=2 ttl=64 time=8.46 ms > 64 bytes from 172.16.101.2: icmp_seq=3 ttl=64 time=9.00 ms > 64 bytes from 172.16.101.2: icmp_seq=4 ttl=64 time=11.8 ms > 64 bytes from 172.16.101.2: icmp_seq=5 ttl=64 time=8.36 ms > 64 bytes from 172.16.101.2: icmp_seq=6 ttl=64 time=11.8 ms > 64 bytes from 172.16.101.2: icmp_seq=7 ttl=64 time=8.11 ms > 64 bytes from 172.16.101.2: icmp_seq=8 ttl=64 time=10.0 ms > 64 bytes from 172.16.101.2: icmp_seq=9 ttl=64 time=11.3 ms > 64 bytes from 172.16.101.2: icmp_seq=10 ttl=64 time=11.5 ms > 64 bytes from 172.16.101.2: icmp_seq=11 ttl=64 time=10.2 ms Thanks for reporting the problem. The original code used 32 bit values and got converted to 64 bit time values when it went to using ktime. Your patch fixes the problem, and lets go with it. The other alternaive would be to make sigma s64 but then it would could cause unnecessary 64 bit modulus here. Reviewed-by: Stephen Hemminger
Re: handling of phy_stop() and phy_stop_machine() in phylib
On Wed, Feb 07, 2018 at 09:56:37PM +0100, Heiner Kallweit wrote: > Am 04.02.2018 um 03:48 schrieb Florian Fainelli: > > > > > > On 02/03/2018 03:58 PM, Heiner Kallweit wrote: > >> Am 03.02.2018 um 21:17 schrieb Andrew Lunn: > >>> On Sat, Feb 03, 2018 at 05:41:54PM +0100, Heiner Kallweit wrote: > This commit forces callers of phy_resume() and phy_suspend() to hold > mutex phydev->lock. This was done for calls to phy_resume() and > phy_suspend() in phylib, however there are more callers in network > drivers. I'd assume that these other calls issue a warning now > because of the lock not being held. > So is there something I miss or would this have to be fixed? > >>> > >>> Hi Heiner > >>> > >>> This is a good point. > >>> > >>> Yes, it looks like some fixes are needed. But what exactly? > >>> > >>> The phy state machine will suspend and resume the phy is you call > >>> phy_stop() and phy_start() in the MAC suspend and resume functions. > >>> > >> AFAICS phy_stop() doesn't suspend the PHY. It just sets the state > >> to PHY_HALTED and (at least if we're not in polling mode) doesn't > >> call phy_suspend(). Maybe a call to phy_trigger_machine() is > >> needed like in phy_start() ? Then the state machine would call > >> phy_suspend(), provided the link is still up. > > > > Right, phy_stop() merely just moves the state machine to PHY_HALTED and > > this is actually a great source of problems which I tried to address here: > > > > https://www.mail-archive.com/netdev@vger.kernel.org/msg196061.html > > > > because phy_stop() is not a synchronous call, so when it returns the > > state machine might still be running (it can take up to a 1 HZ depending > > on when you called phy_stop()) and so if you took that as a > > synchronization point to e.g: turn of your Ethernet MAC/MDIO bus clocks, > > you will likely see problems. phy_stop_machine() would provide that > > synchronization point, but is not currently exported, despite being a > > global symbol. This patch series above is all well and good, except that > > Geert reported issues with suspend/resume interactions which I have not > > been able to track down. > > > To not confuse readers I changed the subject of the mail to reflect that > the discussion isn't about the original topic any longer. > > It seems to me that (at least one) reason for the issues is that pm > callbacks for the phy device and the network device interfere. > phy device pm (mdio_bus_phy_suspend et al) feels responsible for dealing > with suspending/resuming the PHY, and the network driver pm callbacks > as well. > > Maybe, if the network driver takes care, it should inform phy device pm > to do nothing. For this we could add a flag to phydev.mdio.flags. > If the network driver sets this flag then mdio_bus_phy_suspend() > and mdio_bus_phy_resume() could turn into no-ops. > Not totally sure yet about mdio_bus_phy_restore() .. What if the MDIO bus is handled by a separate device and the MDIO bus is suspended prior to the network driver, thereby making the PHY inaccessible? -- RMK's Patch system: http://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up According to speedtest.net: 8.21Mbps down 510kbps up
Re: [RFC PATCH 05/24] bpf: added bpf_xdpsk_redirect
2018-02-05 14:42 GMT+01:00 Jesper Dangaard Brouer : > On Wed, 31 Jan 2018 14:53:37 +0100 Björn Töpel wrote: > >> The bpf_xdpsk_redirect call redirects the XDP context to the XDP >> socket bound to the receiving queue (if any). > > As I explained in-person at FOSDEM, my suggestion is to use the > bpf-map infrastructure for AF_XDP redirect, but people on this > upstream mailing also need a chance to validate my idea ;-) > > The important thing to keep in-mind is how we can still maintain a > SPSC (Single producer Single Consumer) relationship between an > RX-queue and a userspace consumer-process. > > This AF_XDP "FOSDEM" patchset, store the "xsk" (xdp_sock) pointer > directly in the net_device (_rx[].netdev_rx_queue.xs) structure. This > limit each RX-queue to service a single xdp_sock. It sounds good from > a SPSC pov, but not very flexible. With a "xdp_sock_map" we can get > the flexibility to select among multiple xdp_sock'ets (via XDP > pre-filter selecting a different map), and still maintain a SPSC > relationship. The RX-queue will just have several SPSC relationships > with the individual xdp_sock's. > > This is true for the AF_XDP-copy mode, and require less driver change. > For the AF_XDP-zero-copy (ZC) mode drivers need significant changes > anyhow, and in ZC case we will have to disallow this multiple > xdp_sock's, which is simply achieved by checking if the xdp_sock > pointer returned from the map lookup match the one that userspace > requested driver to register it's memory for RX-rings from. > > The "xdp_sock_map" is an array, where the index correspond to the > queue_index. The bpf_redirect_map() ignore the specified index and > instead use xdp_rxq_info->queue_index in the lookup. > > Notice that a bpf-map have no pinned relationship with the device or > XDP prog loaded. Thus, userspace need to bind() this map to the > device before traffic can flow, like the proposed bind() on the > xdp_sock. This is to establish the SPSC binding. My proposal is that > userspace insert the xdp_sock file-descriptor(s) in the map at the > queue-index, and the map (which is also just a file-descriptor) is > bound maybe via bind() to a specific device (via the ifindex). Kernel > side will walk the map and do required actions xdp_sock's in find in > map. > As we discussed at FOSDEM, I like the idea of using a map. This also opens up for configuring the AF_XDP sockets via bpf code, like sockmap does. I'll have a stab at adding an "xdp_sock_map/xskmap" or similar, and also extending the cgroup sock_ops to support AF_XDP sockets, so that the xskmap can be configured from bpf-land. Björn > TX-side is harder, as now multiple xdp_sock's can have the same > queue-pair ID with the same net_device. But Magnus propose that this > can be solved with hardware. As newer NICs have many TX-queue, and the > queue-pair ID is just an external visible number, while the kernel > internal structure can point to a dedicated TX-queue per xdp_sock. > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer
Re: handling of phy_stop() and phy_stop_machine() in phylib
Am 04.02.2018 um 03:48 schrieb Florian Fainelli: > > > On 02/03/2018 03:58 PM, Heiner Kallweit wrote: >> Am 03.02.2018 um 21:17 schrieb Andrew Lunn: >>> On Sat, Feb 03, 2018 at 05:41:54PM +0100, Heiner Kallweit wrote: This commit forces callers of phy_resume() and phy_suspend() to hold mutex phydev->lock. This was done for calls to phy_resume() and phy_suspend() in phylib, however there are more callers in network drivers. I'd assume that these other calls issue a warning now because of the lock not being held. So is there something I miss or would this have to be fixed? >>> >>> Hi Heiner >>> >>> This is a good point. >>> >>> Yes, it looks like some fixes are needed. But what exactly? >>> >>> The phy state machine will suspend and resume the phy is you call >>> phy_stop() and phy_start() in the MAC suspend and resume functions. >>> >> AFAICS phy_stop() doesn't suspend the PHY. It just sets the state >> to PHY_HALTED and (at least if we're not in polling mode) doesn't >> call phy_suspend(). Maybe a call to phy_trigger_machine() is >> needed like in phy_start() ? Then the state machine would call >> phy_suspend(), provided the link is still up. > > Right, phy_stop() merely just moves the state machine to PHY_HALTED and > this is actually a great source of problems which I tried to address here: > > https://www.mail-archive.com/netdev@vger.kernel.org/msg196061.html > > because phy_stop() is not a synchronous call, so when it returns the > state machine might still be running (it can take up to a 1 HZ depending > on when you called phy_stop()) and so if you took that as a > synchronization point to e.g: turn of your Ethernet MAC/MDIO bus clocks, > you will likely see problems. phy_stop_machine() would provide that > synchronization point, but is not currently exported, despite being a > global symbol. This patch series above is all well and good, except that > Geert reported issues with suspend/resume interactions which I have not > been able to track down. > To not confuse readers I changed the subject of the mail to reflect that the discussion isn't about the original topic any longer. It seems to me that (at least one) reason for the issues is that pm callbacks for the phy device and the network device interfere. phy device pm (mdio_bus_phy_suspend et al) feels responsible for dealing with suspending/resuming the PHY, and the network driver pm callbacks as well. Maybe, if the network driver takes care, it should inform phy device pm to do nothing. For this we could add a flag to phydev.mdio.flags. If the network driver sets this flag then mdio_bus_phy_suspend() and mdio_bus_phy_resume() could turn into no-ops. Not totally sure yet about mdio_bus_phy_restore() .. > We should most definitively try to consolidate the different PHY > suspend/resume within the Ethernet MAC suspend/resume implementation and > document exactly what the appropriate behavior must be under the > following circumstances: > > - when to call phy_stop() + phy_stop_machine() > - when to call phy_suspend() (if the network interface does do not WoL) > - when to call phy_resume() (if needed, actually, it usually is not) > - when to call phy_start() > > I don't unfortunately have the time to code this myself at the moment, > but I will happily review patches if you have the opportunity to do so. > >> >> However, if the link is down already (due to whatever calls >> around phy_stop() in the driver) then phy_suspend() wouldn't be >> called. > > Correct, there is an implicit assumption that when the link is down, > there is an opportunity for the Ethernet MAC driver to put things in low > power, and the PHY itself, should be in a lower power mode where only > link/energy detection might be utilizing power. At least this is the theory. > >> >> Heiner >> >>> A few examples: >>> >>> tc35815_suspend(), ravb_suspend() via ravb_close(), sh_eth_suspend() >>> via sh_eth_close(), fec_suspend(), mpc52xx_fec_of_suspend() via >>> mpc52xx_fec_close(), ucc_geth_suspend(), etc... >>> >>> So i suspect those drivers which call phy_suspend()/phy_resume() >>> should really be modified to call phy_stop()/phy_start(). >>> >>> hns_nic_config_phy_loopback() is just funky, and probably needs the >>> help of the hns guys to fix. >>> >>> dsa_slave_suspend() already does a phy_stop(), so the phy_suspend() >>> can be removed. >>> >>> The comments in lpc_eth_open() suggest the phy_resume() is needed, so >>> locks should be added. socfpga_dwmac_resume() seems to be the same. >>> >>> Andrew >>> >> >
Re: [PATCH] rtlwifi: rtl8192cu: Remove variable self-assignment in rf.c
El Wed, Feb 07, 2018 at 02:35:59PM -0600 Larry Finger ha dit: > On 02/07/2018 02:26 PM, Matthias Kaehlcke wrote: > > In _rtl92c_get_txpower_writeval_by_regulatory() the variable writeVal > > is assigned to itself in an if ... else statement, apparently only to > > document that the branch condition is handled and that a previously read > > value should be returned unmodified. The self-assignment causes clang to > > raise the following warning: > > > > drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c:304:13: > >error: explicitly assigning value of variable of type 'u32' > > (aka 'unsigned int') to itself [-Werror,-Wself-assign] > >writeVal = writeVal; > > > > Replace the self-assignment with a semicolon, which still serves to > > document the 'handling' of the branch condition. > > > > Signed-off-by: Matthias Kaehlcke > > --- > > drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c > > b/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c > > index 9cff6bc4049c..4db92496c122 100644 > > --- a/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c > > +++ b/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c > > @@ -301,7 +301,7 @@ static void > > _rtl92c_get_txpower_writeval_by_regulatory(struct ieee80211_hw *hw, > > writeVal = writeVal - 0x06060606; > > else if (rtlpriv->dm.dynamic_txhighpower_lvl == > > TXHIGHPWRLEVEL_BT2) > > - writeVal = writeVal; > > + ; > > *(p_outwriteval + rf) = writeVal; > > } > > } > > > > As the branch condition does nothing, why not remove it and save the > compiler's optimizer a bit of work? The code looks strange, but it matches > the rest of Realtek's USB drivers. Sure, I am happy to change it to whatever the authors/maintainers prefer. I'll wait a bit before respinning for if others feel strongly about keeping the branch.
Re: [PATCH] rtlwifi: rtl8192cu: Remove variable self-assignment in rf.c
On 02/07/2018 02:26 PM, Matthias Kaehlcke wrote: In _rtl92c_get_txpower_writeval_by_regulatory() the variable writeVal is assigned to itself in an if ... else statement, apparently only to document that the branch condition is handled and that a previously read value should be returned unmodified. The self-assignment causes clang to raise the following warning: drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c:304:13: error: explicitly assigning value of variable of type 'u32' (aka 'unsigned int') to itself [-Werror,-Wself-assign] writeVal = writeVal; Replace the self-assignment with a semicolon, which still serves to document the 'handling' of the branch condition. Signed-off-by: Matthias Kaehlcke --- drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c b/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c index 9cff6bc4049c..4db92496c122 100644 --- a/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c +++ b/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c @@ -301,7 +301,7 @@ static void _rtl92c_get_txpower_writeval_by_regulatory(struct ieee80211_hw *hw, writeVal = writeVal - 0x06060606; else if (rtlpriv->dm.dynamic_txhighpower_lvl == TXHIGHPWRLEVEL_BT2) - writeVal = writeVal; + ; *(p_outwriteval + rf) = writeVal; } } As the branch condition does nothing, why not remove it and save the compiler's optimizer a bit of work? The code looks strange, but it matches the rest of Realtek's USB drivers. Larry
[PATCH] rtlwifi: rtl8192cu: Remove variable self-assignment in rf.c
In _rtl92c_get_txpower_writeval_by_regulatory() the variable writeVal is assigned to itself in an if ... else statement, apparently only to document that the branch condition is handled and that a previously read value should be returned unmodified. The self-assignment causes clang to raise the following warning: drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c:304:13: error: explicitly assigning value of variable of type 'u32' (aka 'unsigned int') to itself [-Werror,-Wself-assign] writeVal = writeVal; Replace the self-assignment with a semicolon, which still serves to document the 'handling' of the branch condition. Signed-off-by: Matthias Kaehlcke --- drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c b/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c index 9cff6bc4049c..4db92496c122 100644 --- a/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c +++ b/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c @@ -301,7 +301,7 @@ static void _rtl92c_get_txpower_writeval_by_regulatory(struct ieee80211_hw *hw, writeVal = writeVal - 0x06060606; else if (rtlpriv->dm.dynamic_txhighpower_lvl == TXHIGHPWRLEVEL_BT2) - writeVal = writeVal; + ; *(p_outwriteval + rf) = writeVal; } } -- 2.16.0.rc1.238.g530d649a79-goog
Re: Regression for ip6-in-ip4 IPsec tunnel in 4.14.16
On Wed, 2018-02-07 at 13:50 -0500, Mike Maloney wrote: > On Wed, Feb 7, 2018 at 12:23 PM, Yves-Alexis Perez > > Hi Yves-Alexis - > > I apologize for the problem. It seems to me that tunneling with an > outer MTU that causes the inner MTU to be smaller than the min, is > potentially problematic in other ways as well. Maybe. I tried with removing the MTU setting, and I get (on ping again) févr. 07 20:44:01 scapa kernel: mtu: 1266 which means I would get -EINVAL on standards kernels, which is not really good either. > But also it could seem unfortunate that the code with my fix does not > look at actual packet size, but instead only looks at the MTU and then > fails, even if no packet was going to be so large. The intention of > my patch was to prevent a negative number while calculating the > maxfraglen in __ip6_append_data(). An alternative fix maybe to > instead return an error only if the mtu is less than or equal to the > fragheaderlen. Something like: > > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c > index 3763dc01e374..5d912a289b95 100644 > --- a/net/ipv6/ip6_output.c > +++ b/net/ipv6/ip6_output.c > @@ -1214,8 +1214,6 @@ static int ip6_setup_cork(struct sock *sk, > struct inet_cork_full *cork, > if (np->frag_size) > mtu = np->frag_size; > } > - if (mtu < IPV6_MIN_MTU) > - return -EINVAL; > cork->base.fragsize = mtu; > if (dst_allfrag(rt->dst.path)) > cork->base.flags |= IPCORK_ALLFRAG; > @@ -1264,6 +1262,8 @@ static int __ip6_append_data(struct sock *sk, > > fragheaderlen = sizeof(struct ipv6hdr) + rt->rt6i_nfheader_len + > (opt ? opt->opt_nflen : 0); > + if (mtu < fragheaderlen + 8) > + return -EINVAL; > maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen - > sizeof(struct frag_hdr); > (opt ? opt->opt_nflen : 0); > > But then we also have to convince ourselves that maxfraglen can never > be <= 0. I'd have to think about that. > > I am not sure if others have thoughts on supporting MTUs configured > below the min in the spec. > Here, the MTU is not below, so I'm not sure what happens. Regards, -- Yves-Alexis signature.asc Description: This is a digitally signed message part
[PATCH net v2] net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT
This condition wasn't adjusted when PHY_IGNORE_INTERRUPT (-2) was added long ago. In case of PHY_IGNORE_INTERRUPT the MAC interrupt indicates also PHY state changes and we should do what the symbol says. Signed-off-by: Heiner Kallweit --- v2: - use phy_interrupt_is_valid() instead of checking for irq > 0 --- drivers/net/phy/phy.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index f3313a129..50ed35a45 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -822,7 +822,7 @@ void phy_start(struct phy_device *phydev) phy_resume(phydev); /* make sure interrupts are re-enabled for the PHY */ - if (phydev->irq != PHY_POLL) { + if (phy_interrupt_is_valid(phydev)) { err = phy_enable_interrupts(phydev); if (err < 0) break; -- 2.16.1
Re: [PATCH net] net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT
Am 07.02.2018 um 20:34 schrieb Florian Fainelli: > > > On 02/07/2018 11:26 AM, Heiner Kallweit wrote: >> Am 07.02.2018 um 20:06 schrieb Florian Fainelli: >>> >>> >>> On 02/07/2018 10:44 AM, Heiner Kallweit wrote: This condition wasn't adjusted when PHY_IGNORE_INTERRUPT (-2) was added long ago. In case of PHY_IGNORE_INTERRUPT the MAC interrupt indicates also PHY state changes and we should do what the symbol says. >>> >>> Do you use phy_enable_interrupts() to configure how the PHY interrupts >>> will be flowing through the Ethernet MAC? >>> >> No. And I'm not sure I understand your question correctly. > > No wonder, my question does not make sense, I read the test wrong. > >> The change applies the same behavior as e.g. in phy_connect_direct() >> where phy_start_interrupts() is called only if phy_dev->irq > 0. > > Not enough coffee, your change is fine, could you consider using > phy_interrupt_is_valid() instead for this test? > Sure. I was considering this already however wasn't sure because currently both ways to check for a valid interrupt are used in phylib. >> Signed-off-by: Heiner Kallweit --- drivers/net/phy/phy.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index f3313a129..50ed35a45 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -822,7 +822,7 @@ void phy_start(struct phy_device *phydev) phy_resume(phydev); /* make sure interrupts are re-enabled for the PHY */ - if (phydev->irq != PHY_POLL) { + if (phydev->irq > 0) { err = phy_enable_interrupts(phydev); if (err < 0) break; >>> >> >
[PATCH] net: Extra '_get' in declaration of arch_get_platform_mac_address
In commit c7f5d105495a ("net: Add eth_platform_get_mac_address() helper."), two declarations were added: int eth_platform_get_mac_address(struct device *dev, u8 *mac_addr); unsigned char *arch_get_platform_get_mac_address(void); An extra '_get' was introduced in arch_get_platform_get_mac_address, remove it. Fix compile warning using W=1: CC net/ethernet/eth.o net/ethernet/eth.c:523:24: warning: no previous prototype for ‘arch_get_platform_mac_address’ [-Wmissing-prototypes] unsigned char * __weak arch_get_platform_mac_address(void) ^ AR net/ethernet/built-in.o Signed-off-by: Mathieu Malaterre --- include/linux/etherdevice.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h index 263dbcad22fc..79563840c295 100644 --- a/include/linux/etherdevice.h +++ b/include/linux/etherdevice.h @@ -31,7 +31,7 @@ #ifdef __KERNEL__ struct device; int eth_platform_get_mac_address(struct device *dev, u8 *mac_addr); -unsigned char *arch_get_platform_get_mac_address(void); +unsigned char *arch_get_platform_mac_address(void); u32 eth_get_headlen(void *data, unsigned int max_len); __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev); extern const struct header_ops eth_header_ops; -- 2.11.0
Re: [PATCH net] net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT
On 02/07/2018 11:26 AM, Heiner Kallweit wrote: > Am 07.02.2018 um 20:06 schrieb Florian Fainelli: >> >> >> On 02/07/2018 10:44 AM, Heiner Kallweit wrote: >>> This condition wasn't adjusted when PHY_IGNORE_INTERRUPT (-2) was added >>> long ago. In case of PHY_IGNORE_INTERRUPT the MAC interrupt indicates >>> also PHY state changes and we should do what the symbol says. >> >> Do you use phy_enable_interrupts() to configure how the PHY interrupts >> will be flowing through the Ethernet MAC? >> > No. And I'm not sure I understand your question correctly. No wonder, my question does not make sense, I read the test wrong. > The change applies the same behavior as e.g. in phy_connect_direct() > where phy_start_interrupts() is called only if phy_dev->irq > 0. Not enough coffee, your change is fine, could you consider using phy_interrupt_is_valid() instead for this test? > >>> >>> Signed-off-by: Heiner Kallweit >>> --- >>> drivers/net/phy/phy.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c >>> index f3313a129..50ed35a45 100644 >>> --- a/drivers/net/phy/phy.c >>> +++ b/drivers/net/phy/phy.c >>> @@ -822,7 +822,7 @@ void phy_start(struct phy_device *phydev) >>> phy_resume(phydev); >>> >>> /* make sure interrupts are re-enabled for the PHY */ >>> - if (phydev->irq != PHY_POLL) { >>> + if (phydev->irq > 0) { >>> err = phy_enable_interrupts(phydev); >>> if (err < 0) >>> break; >>> >> > -- Florian
Re: [PATCH net] net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT
Am 07.02.2018 um 20:06 schrieb Florian Fainelli: > > > On 02/07/2018 10:44 AM, Heiner Kallweit wrote: >> This condition wasn't adjusted when PHY_IGNORE_INTERRUPT (-2) was added >> long ago. In case of PHY_IGNORE_INTERRUPT the MAC interrupt indicates >> also PHY state changes and we should do what the symbol says. > > Do you use phy_enable_interrupts() to configure how the PHY interrupts > will be flowing through the Ethernet MAC? > No. And I'm not sure I understand your question correctly. The change applies the same behavior as e.g. in phy_connect_direct() where phy_start_interrupts() is called only if phy_dev->irq > 0. >> >> Signed-off-by: Heiner Kallweit >> --- >> drivers/net/phy/phy.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c >> index f3313a129..50ed35a45 100644 >> --- a/drivers/net/phy/phy.c >> +++ b/drivers/net/phy/phy.c >> @@ -822,7 +822,7 @@ void phy_start(struct phy_device *phydev) >> phy_resume(phydev); >> >> /* make sure interrupts are re-enabled for the PHY */ >> -if (phydev->irq != PHY_POLL) { >> +if (phydev->irq > 0) { >> err = phy_enable_interrupts(phydev); >> if (err < 0) >> break; >> >
[PATCH net-next] ibmvnic: queue reset when CRQ gets closed during reset
While handling a driver reset we get a H_CLOSED return trying to send a CRQ event. When this occurs we need to queue up another reset attempt. Without doing this we see instances where the driver is left in a closed state because the reset failed and there is no further attempts to reset the driver. Signed-off-by: Nathan Fontenot --- drivers/net/ethernet/ibm/ibmvnic.c |6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index 5caaa9033841..0f7e1d016207 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -2908,8 +2908,12 @@ static int ibmvnic_send_crq(struct ibmvnic_adapter *adapter, cpu_to_be64(u64_crq[1])); if (rc) { - if (rc == H_CLOSED) + if (rc == H_CLOSED) { dev_warn(dev, "CRQ Queue closed\n"); + if (adapter->resetting) + ibmvnic_reset(adapter, VNIC_RESET_FATAL); + } + dev_warn(dev, "Send error (rc=%d)\n", rc); }
Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)
On Wed, Feb 07, 2018 at 11:06:42AM -0800, Andrew Morton wrote: > From: Michal Hocko > Subject: net/netfilter/x_tables.c: remove size check > > Back in 2002 vmalloc used to BUG on too large sizes. We are much better > behaved these days and vmalloc simply returns NULL for those. Remove the > check as it simply not needed and the comment is even misleading. Applied, thanks!
Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)
On Wed, 7 Feb 2018 18:44:39 +0100 Pablo Neira Ayuso wrote: > Hi, > > On Wed, Jan 31, 2018 at 09:19:16AM +0100, Michal Hocko wrote: > [...] > > Yeah, we do not BUG but rather fail instead. See __vmalloc_node_range. > > My excavation tools pointed me to "VM: Rework vmalloc code to support > > mapping of arbitray pages" > > by Christoph back in 2002. So yes, we can safely remove it finally. Se > > below. > > > > > > From 8d52e1d939d101b0dafed6ae5c3c1376183e65bb Mon Sep 17 00:00:00 2001 > > From: Michal Hocko > > Date: Wed, 31 Jan 2018 09:16:56 +0100 > > Subject: [PATCH] net/netfilter/x_tables.c: remove size check > > > > Back in 2002 vmalloc used to BUG on too large sizes. We are much better > > behaved these days and vmalloc simply returns NULL for those. Remove > > the check as it simply not needed and the comment even misleading. > > > > Suggested-by: Andrew Morton > > Signed-off-by: Michal Hocko > > --- > > net/netfilter/x_tables.c | 4 > > 1 file changed, 4 deletions(-) > > > > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c > > index b55ec5aa51a6..48a6ff620493 100644 > > --- a/net/netfilter/x_tables.c > > +++ b/net/netfilter/x_tables.c > > @@ -999,10 +999,6 @@ struct xt_table_info *xt_alloc_table_info(unsigned int > > size) > > if (sz < sizeof(*info)) > > return NULL; > > > > - /* Pedantry: prevent them from hitting BUG() in vmalloc.c --RR */ > > - if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages) > > - return NULL; > > - > > /* __GFP_NORETRY is not fully supported by kvmalloc but it should > > * work reasonably well if sz is too large and bail out rather > > * than shoot all processes down before realizing there is nothing > > Patchwork didn't catch this patch for some reason, would you mind to > resend? From: Michal Hocko Subject: net/netfilter/x_tables.c: remove size check Back in 2002 vmalloc used to BUG on too large sizes. We are much better behaved these days and vmalloc simply returns NULL for those. Remove the check as it simply not needed and the comment is even misleading. Link: http://lkml.kernel.org/r/20180131081916.go21...@dhcp22.suse.cz Suggested-by: Andrew Morton Signed-off-by: Michal Hocko Reviewed-by: Andrew Morton Cc: Florian Westphal Cc: David S. Miller Signed-off-by: Andrew Morton --- net/netfilter/x_tables.c |4 1 file changed, 4 deletions(-) diff -puN net/netfilter/x_tables.c~net-netfilter-x_tablesc-remove-size-check net/netfilter/x_tables.c --- a/net/netfilter/x_tables.c~net-netfilter-x_tablesc-remove-size-check +++ a/net/netfilter/x_tables.c @@ -1004,10 +1004,6 @@ struct xt_table_info *xt_alloc_table_inf if (sz < sizeof(*info)) return NULL; - /* Pedantry: prevent them from hitting BUG() in vmalloc.c --RR */ - if ((size >> PAGE_SHIFT) + 2 > totalram_pages) - return NULL; - /* __GFP_NORETRY is not fully supported by kvmalloc but it should * work reasonably well if sz is too large and bail out rather * than shoot all processes down before realizing there is nothing _
Re: [PATCH net] net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT
On 02/07/2018 10:44 AM, Heiner Kallweit wrote: > This condition wasn't adjusted when PHY_IGNORE_INTERRUPT (-2) was added > long ago. In case of PHY_IGNORE_INTERRUPT the MAC interrupt indicates > also PHY state changes and we should do what the symbol says. Do you use phy_enable_interrupts() to configure how the PHY interrupts will be flowing through the Ethernet MAC? > > Signed-off-by: Heiner Kallweit > --- > drivers/net/phy/phy.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c > index f3313a129..50ed35a45 100644 > --- a/drivers/net/phy/phy.c > +++ b/drivers/net/phy/phy.c > @@ -822,7 +822,7 @@ void phy_start(struct phy_device *phydev) > phy_resume(phydev); > > /* make sure interrupts are re-enabled for the PHY */ > - if (phydev->irq != PHY_POLL) { > + if (phydev->irq > 0) { > err = phy_enable_interrupts(phydev); > if (err < 0) > break; > -- Florian
Re: [PATCH 00/11] Netfilter fixes for net
From: Pablo Neira Ayuso Date: Wed, 7 Feb 2018 18:42:18 +0100 > The following patchset contains Netfilter fixes for you net tree, they > are: ... > You can pull these changes from: > > git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git Pulled, thanks Pablo. > P.S: Again more fixes cooking on netfilter-de...@vger.kernel.org, so > another round is likely coming up soon. Ok, no problem.
Re: Regression for ip6-in-ip4 IPsec tunnel in 4.14.16
On Wed, Feb 7, 2018 at 12:23 PM, Yves-Alexis Perez wrote: > On Wed, 2018-02-07 at 18:05 +0100, Yves-Alexis Perez wrote: >> I'll try to printk the mtu before returning EINVAL to see why it's lower than >> 1280, but maybe the IP encapsulation is not correctly handled? > > I did: > > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c > index 3763dc01e374..d3c651158d35 100644 > --- a/net/ipv6/ip6_output.c > +++ b/net/ipv6/ip6_output.c > @@ -1215,7 +1215,7 @@ static int ip6_setup_cork(struct sock *sk, struct > inet_cork_full *cork, > mtu = np->frag_size; > } > if (mtu < IPV6_MIN_MTU) > - return -EINVAL; > + printk("mtu: %d\n", mtu); > cork->base.fragsize = mtu; > if (dst_allfrag(rt->dst.path)) > cork->base.flags |= IPCORK_ALLFRAG; > > and I get: > > févr. 07 18:19:50 scapa kernel: mtu: 1218 > > and it doesn't depend on the original packet size (same thing happens with > ping -s 100). It also happens with UDP (DNS) traffic, but apparently not with > TCP. > > Regards, > -- > Yves-Alexis Hi Yves-Alexis - I apologize for the problem. It seems to me that tunneling with an outer MTU that causes the inner MTU to be smaller than the min, is potentially problematic in other ways as well. But also it could seem unfortunate that the code with my fix does not look at actual packet size, but instead only looks at the MTU and then fails, even if no packet was going to be so large. The intention of my patch was to prevent a negative number while calculating the maxfraglen in __ip6_append_data(). An alternative fix maybe to instead return an error only if the mtu is less than or equal to the fragheaderlen. Something like: diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index 3763dc01e374..5d912a289b95 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1214,8 +1214,6 @@ static int ip6_setup_cork(struct sock *sk, struct inet_cork_full *cork, if (np->frag_size) mtu = np->frag_size; } - if (mtu < IPV6_MIN_MTU) - return -EINVAL; cork->base.fragsize = mtu; if (dst_allfrag(rt->dst.path)) cork->base.flags |= IPCORK_ALLFRAG; @@ -1264,6 +1262,8 @@ static int __ip6_append_data(struct sock *sk, fragheaderlen = sizeof(struct ipv6hdr) + rt->rt6i_nfheader_len + (opt ? opt->opt_nflen : 0); + if (mtu < fragheaderlen + 8) + return -EINVAL; maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen - sizeof(struct frag_hdr); (opt ? opt->opt_nflen : 0); But then we also have to convince ourselves that maxfraglen can never be <= 0. I'd have to think about that. I am not sure if others have thoughts on supporting MTUs configured below the min in the spec. Thanks. -- Mike Maloney
[PATCH net] net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT
This condition wasn't adjusted when PHY_IGNORE_INTERRUPT (-2) was added long ago. In case of PHY_IGNORE_INTERRUPT the MAC interrupt indicates also PHY state changes and we should do what the symbol says. Signed-off-by: Heiner Kallweit --- drivers/net/phy/phy.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index f3313a129..50ed35a45 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -822,7 +822,7 @@ void phy_start(struct phy_device *phydev) phy_resume(phydev); /* make sure interrupts are re-enabled for the PHY */ - if (phydev->irq != PHY_POLL) { + if (phydev->irq > 0) { err = phy_enable_interrupts(phydev); if (err < 0) break; -- 2.16.1
[Secunia Research] Linux Kernel Vulnerability - Sending information
Hello, Secunia Research at Flexera has discovered a vulnerability in Linux Kernel, which can be exploited by malicious, local users to cause a DoS (Denial of Service). Details: - After a bit of fuzzing and some debugging, I've prepared a program that triggers a BUG() failure at net/core/skbuff.c:104. It happens when an SCTP ABORT message is about to be sent. The main problem seems to be with the data size/length. This becomes a problem when the flow reaches the "skb_put()" function (net/core/skbuff.c) and the "unlikenly()" condition is met. I have just checked the reproducer against the current David Miller net-tree and it doesn't seem to be addressed yet. Proof-of-Concept: - I wasn't sure if I should share the reproducer via this email. Please let me know what's the preferred channel. Kernel crash message: [ 31.900829] skbuff: skb_over_panic: text:d6dff053 len:68556 put:68544 head:1a927f7f data:01696ac8 tail:0x10c84 end:0xec0 dev: [ 31.902421] [ cut here ] [ 31.902968] kernel BUG at net/core/skbuff.c:104! [ 31.903559] invalid opcode: [#1] SMP KASAN PTI [ 31.904159] Modules linked in: [ 31.904541] CPU: 1 PID: 3458 Comm: repro Not tainted 4.15.0+ #2 [ 31.905416] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [ 31.906749] RIP: 0010:skb_panic+0x152/0x1d0 [ 31.907211] RSP: 0018:880066f766a0 EFLAGS: 00010282 [ 31.907762] RAX: 008f RBX: 8800641ae2c0 RCX: [ 31.908527] RDX: 008f RSI: 11000cdeec99 RDI: ed000cdeecc8 [ 31.909287] RBP: 84a2efc0 R08: 11000cdeec6d R09: [ 31.910022] R10: 0001 R11: R12: 83c920f0 [ 31.910770] R13: 00010bc0 R14: 84a2e860 R15: 0ec0 [ 31.911514] FS: 7f4face87700() GS:88006cf0() knlGS: [ 31.912367] CS: 0010 DS: ES: CR0: 80050033 [ 31.912973] CR2: 20020fe5 CR3: 69288000 CR4: 06e0 [ 31.913708] Call Trace: [ 31.914534] skb_put+0x178/0x1c0 [ 31.914890] sctp_packet_transmit+0x1120/0x3740 [ 31.924671] sctp_outq_flush+0x113a/0x3b90 [ 31.963822] sctp_do_sm+0x4a5c/0x65c0 [ 31.973685] sctp_primitive_ABORT+0x99/0xc0 [ 31.974457] sctp_sendmsg+0x1bb4/0x33a0 [ 31.987812] inet_sendmsg+0x125/0x580 [ 31.991609] sock_sendmsg+0xc0/0x100 [ 31.992320] ___sys_sendmsg+0x714/0x900 [ 31.03] __sys_sendmsg+0xbd/0x1e0 [ 32.002942] SyS_sendmsg+0x27/0x40 [ 32.003585] entry_SYSCALL_64_fastpath+0x18/0x85 [ 32.004461] RIP: 0033:0x7f4fada2472d [ 32.005130] RSP: 002b:7f4face86ec0 EFLAGS: 0293 [ 32.005138] Code: 03 0f b6 04 01 84 c0 74 04 3c 03 7e 20 8b 4b 78 41 56 45 89 e8 41 57 56 48 c7 c7 a0 e8 a2 84 52 48 89 ee 4c 89 e2 e8 00 bd 2d fe <0f> 0b 4c 89 4c 24 10 48 89 54 24 08 48 89 34 24 e8 b9 15 72 fe [ 32.009658] RIP: skb_panic+0x152/0x1d0 RSP: 880066f766a0 [ 32.010756] ---[ end trace 239ba69b984ccf99 ]--- Closing Comments: - We have assigned the vulnerability Secunia Advisory SA81331. A preliminary release date has been set to February 21st, 2018 for the publication of our advisory. However, we are naturally prepared to push the disclosure date in accordance with the Secunia Research Disclosure Policy [1], if you need more time to address the vulnerability as long as you keep us updated on the status. Please don't hesitate to contact us for assistance with confirming/reproducing the reported vulnerability. Credits should go to: "Jakub Jirasek, Secunia Research at Flexera" In the case that a HTTPS URL is allowed within the mentioning of the credits on e.g. your web site, then please utilize the link [2], which could be made to trigger by clicking on the "Secunia Research" parts of the credits for example. We highly appreciate the effort. Please acknowledge receiving this e-mail and let us know when you expect to fix the vulnerability. References: [1] https://secuniaresearch.flexerasoftware.com/community/research/policy/ [2] https://www.flexerasoftware.com/enterprise/company/about/secunia-research/ --- Kind Regards, Jakub Jirasek Team Lead Information Security Analyst Secunia Research at Flexera Arne Jacobsens Allé 7, 5th floor 2300 Copenhagen S Denmark Phone +45 7020 5144 http://www.flexera.com
Re: [PATCH] net: ethernet: ti: cpsw: fix net watchdog timeout
On 02/07/2018 07:31 AM, Ivan Khoronzhuk wrote: > On Wed, Feb 07, 2018 at 05:03:19AM +0200, Ivan Khoronzhuk wrote: >> On Tue, Feb 06, 2018 at 07:17:06PM -0600, Grygorii Strashko wrote: >>> It was discovered that simple program which indefinitely sends 200b UDP >>> packets and runs on TI AM574x SoC (SMP) under RT Kernel triggers network >>> watchdog timeout in TI CPSW driver (<6 hours run). The network watchdog >>> timeout is triggered due to race between cpsw_ndo_start_xmit() and >>> cpsw_tx_handler() [NAPI] >>> >>> cpsw_ndo_start_xmit() >>> if (unlikely(!cpdma_check_free_tx_desc(txch))) { >>> txq = netdev_get_tx_queue(ndev, q_idx); >>> netif_tx_stop_queue(txq); >>> >>> ^^ as per [1] barrier has to be used after set_bit() otherwise new value >>> might not be visible to other cpus >>> } >>> >>> cpsw_tx_handler() >>> if (unlikely(netif_tx_queue_stopped(txq))) >>> netif_tx_wake_queue(txq); >>> >>> and when it happens ndev TX queue became disabled forever while driver's HW >>> TX queue is empty. >> I'm sure it fixes test case somehow but there is some strangeness. >> (I've thought about this some X months ago): >> 1. If no free desc, then there is bunch of descs on the queue ready to be >> sent >> 2. If one of this desc while this process was missed then next will wake >> queue, >> because there is bunch of them on the fly. So, if desc on top of the sent >> queue >> missed to enable the queue, then next one more likely will enable it anyway.. >> then how it could happen? The described race is possible only on last >> descriptor, yes, packets are small the speed is hight, possibility is very >> small >> .but then next situation is also possible: >> - packets are sent fast >> - all packets were sent, but no any descriptors are freed now by sw >> interrupt (NAPI) >> - when interrupt had started NAPI, the queue was enabled, all other next >> interrupts are throttled once NAPI not finished it's work yet. >> - when new packet submitted, no free descs are present yet (NAPI has not >> freed >> any yet), but all packets are sent, so no one can awake tx queue, as >> interrupt >> will not arise when NAPI is started to free first descriptor interrupts are >> disabled.because h/w queue to be sent is empty... >> - how it can happen as submitting packet and handling packet operations is >> under >> channel lock? Not exactly, a period between handling and freeing the >> descriptor >> to the pool is not under channel lock, here: >> >> spin_unlock_irqrestore(&chan->lock, flags); >> if (unlikely(status & CPDMA_DESC_TD_COMPLETE)) >> cb_status = -ENOSYS; >> else >> cb_status = status; >> >> __cpdma_chan_free(chan, desc, outlen, cb_status); >> return status; >> >> unlock_ret: >> spin_unlock_irqrestore(&chan->lock, flags); >> return status; >> >> And: >> __cpdma_chan_free(chan, desc, outlen, cb_status); >> -> cpdma_desc_free(pool, desc, 1); >> >> As result, queue deadlock as you've described. >> Just thought, not checked, but theoretically possible. >> What do you think? Yep. And if this happens desc_alloc_fail should be >0 while i usually see it 0 when watchdog triggers. I was able to reproduce this situation once, but with bunch of debug code added which changes timings: Prerequisite: only one free desc available. cpsw_ndo_start_xmit1 cpsw_tx_poll prepare and send packet ->Free desc queue is empty at this moment if (unlikely(!cpdma_check_free_tx_desc(txch))) netif_tx_stop_queue(txq); --- tx queue stopped cpdma_chan_process() spin_lock_irqsave(&chan->lock, flags); chan->count-- spin_unlock_irqrestore(&chan->lock, flags) cpsw_tx_handler() if (unlikely(netif_tx_queue_stopped(txq))) netif_tx_wake_queue(txq); --- tx queue is woken up cpsw_ndo_start_xmit2 prepare packet ret = cpsw_tx_packet_submit(priv, skb, txch); //fail as desc not returned to the pool yet if (unlikely(ret != 0)) { cpsw_err(priv, tx_err, "desc submit failed\n"); goto fail; } cpdma_desc_free() fail: ndev->stats.tx_dropped++; netif_tx_stop_queue(txq); // oops. That's why I added double check and barrier in fail path also > > Better explanation, for rare race: > > start conditions: > - all descs are submitted, except last one, and queue is not stopped > - any desc was returned to the pool yet (but packets can be sent) > > time > || > \/ > > submit pr
Re: [PATCH] selftests: bpf: test_kmod.sh: check the module path before insmod
Hi Daniel, On 7 February 2018 at 19:02, Daniel Borkmann wrote: > Hi Naresh, > > On 02/06/2018 10:07 PM, Naresh Kamboju wrote: >> test_kmod.sh reported false failure when module not present. >> Check test_bpf.ko is present in the path before loading it. >> >> Stop using "insmod $SRC_TREE/lib/test_bpf.ko" instead use >> "modprobe test_bpf" >> >> Signed-off-by: Naresh Kamboju > > Thanks for looking into this! Could we have a way to be able to > support both? Say, when test_bpf.ko from SRC_TREE is not present, > we try with modprobe -q fallback? I would still like to support > the case where you can make local changes and add new tests to > test_bpf.c, recompile and then just rerun the test_kmod.sh w/o > having to install it first. Thanks for the review comments. I have sent patch v2 with your comments addressed. - Naresh > > Thanks, > Daniel > >> tools/testing/selftests/bpf/test_kmod.sh | 11 +++ >> 1 file changed, 7 insertions(+), 4 deletions(-) >> >> diff --git a/tools/testing/selftests/bpf/test_kmod.sh >> b/tools/testing/selftests/bpf/test_kmod.sh >> index ed4774d..54177b1 100755 >> --- a/tools/testing/selftests/bpf/test_kmod.sh >> +++ b/tools/testing/selftests/bpf/test_kmod.sh >> @@ -1,8 +1,6 @@ >> #!/bin/sh >> # SPDX-License-Identifier: GPL-2.0 >> >> -SRC_TREE=../../../../ >> - >> test_run() >> { >> sysctl -w net.core.bpf_jit_enable=$1 2>&1 > /dev/null >> @@ -10,8 +8,13 @@ test_run() >> >> echo "[ JIT enabled:$1 hardened:$2 ]" >> dmesg -C >> - insmod $SRC_TREE/lib/test_bpf.ko 2> /dev/null >> - if [ $? -ne 0 ]; then >> + # Use modprobe dry run to check for missing test_bpf module >> + if ! /sbin/modprobe -q -n test_bpf; then >> + echo "test_bpf: [SKIP]" >> + elif /sbin/modprobe -q test_bpf; then >> + echo "test_bpf: ok" >> + else >> + echo "test_bpf: [FAIL]" >> rc=1 >> fi >> rmmod test_bpf 2> /dev/null >> >
[PATCH v2] selftests: bpf: test_kmod.sh: check the module path before insmod
test_kmod.sh reported false failure when module not present. Check test_bpf.ko is present in the path before loading it. Two cases to be addressed here, In the development process of test_bpf.c unit testing will be done by developers by using "insmod $SRC_TREE/lib/test_bpf.ko" On the other hand testers run full tests by installing modules on device under test (DUT) and followed by modprobe to insert the modules accordingly. Signed-off-by: Naresh Kamboju --- tools/testing/selftests/bpf/test_kmod.sh | 18 +++--- 1 file changed, 15 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/bpf/test_kmod.sh b/tools/testing/selftests/bpf/test_kmod.sh index ed4774d..35669cc 100755 --- a/tools/testing/selftests/bpf/test_kmod.sh +++ b/tools/testing/selftests/bpf/test_kmod.sh @@ -10,9 +10,21 @@ test_run() echo "[ JIT enabled:$1 hardened:$2 ]" dmesg -C - insmod $SRC_TREE/lib/test_bpf.ko 2> /dev/null - if [ $? -ne 0 ]; then - rc=1 + if [ -f ${SRC_TREE}/lib/test_bpf.ko ]; then + insmod ${SRC_TREE}/lib/test_bpf.ko 2> /dev/null + if [ $? -ne 0 ]; then + rc=1 + fi + else + # Use modprobe dry run to check for missing test_bpf module + if ! /sbin/modprobe -q -n test_bpf; then + echo "test_bpf: [SKIP]" + elif /sbin/modprobe -q test_bpf; then + echo "test_bpf: ok" + else + echo "test_bpf: [FAIL]" + rc=1 + fi fi rmmod test_bpf 2> /dev/null dmesg | grep FAIL -- 2.7.4
Re: [RFC PATCH 00/24] Introducing AF_XDP support
On Wed, Jan 31, 2018 at 5:53 AM, Björn Töpel wrote: > From: Björn Töpel > > This RFC introduces a new address family called AF_XDP that is > optimized for high performance packet processing and zero-copy > semantics. Throughput improvements can be up to 20x compared to V2 and > V3 for the micro benchmarks included. Would be great to get your > feedback on it. Note that this is the follow up RFC to AF_PACKET V4 > from November last year. The feedback from that RFC submission and the > presentation at NetdevConf in Seoul was to create a new address family > instead of building on top of AF_PACKET. AF_XDP is this new address > family. > > The main difference between AF_XDP and AF_PACKET V2/V3 on a descriptor > level is that TX and RX descriptors are separated from packet > buffers. An RX or TX descriptor points to a data buffer in a packet > buffer area. RX and TX can share the same packet buffer so that a > packet does not have to be copied between RX and TX. Moreover, if a > packet needs to be kept for a while due to a possible retransmit, then > the descriptor that points to that packet buffer can be changed to > point to another buffer and reused right away. This again avoids > copying data. > > The RX and TX descriptor rings are registered with the setsockopts > XDP_RX_RING and XDP_TX_RING, similar to AF_PACKET. The packet buffer > area is allocated by user space and registered with the kernel using > the new XDP_MEM_REG setsockopt. All these three areas are shared > between user space and kernel space. The socket is then bound with a > bind() call to a device and a specific queue id on that device, and it > is not until bind is completed that traffic starts to flow. > > An XDP program can be loaded to direct part of the traffic on that > device and queue id to user space through a new redirect action in an > XDP program called bpf_xdpsk_redirect that redirects a packet up to > the socket in user space. All the other XDP actions work just as > before. Note that the current RFC requires the user to load an XDP > program to get any traffic to user space (for example all traffic to > user space with the one-liner program "return > bpf_xdpsk_redirect();"). We plan on introducing a patch that removes > this requirement and sends all traffic from a queue to user space if > an AF_XDP socket is bound to it. > > AF_XDP can operate in three different modes: XDP_SKB, XDP_DRV, and > XDP_DRV_ZC (shorthand for XDP_DRV with a zero-copy allocator as there > is no specific mode called XDP_DRV_ZC). If the driver does not have > support for XDP, or XDP_SKB is explicitly chosen when loading the XDP > program, XDP_SKB mode is employed that uses SKBs together with the > generic XDP support and copies out the data to user space. A fallback > mode that works for any network device. On the other hand, if the > driver has support for XDP (all three NDOs: ndo_bpf, ndo_xdp_xmit and > ndo_xdp_flush), these NDOs, without any modifications, will be used by > the AF_XDP code to provide better performance, but there is still a > copy of the data into user space. The last mode, XDP_DRV_ZC, is XDP > driver support with the zero-copy user space allocator that provides > even better performance. In this mode, the networking HW (or SW driver > if it is a virtual driver like veth) DMAs/puts packets straight into > the packet buffer that is shared between user space and kernel > space. The RX and TX descriptor queues of the networking HW are NOT > shared to user space. Only the kernel can read and write these and it > is the kernel driver's responsibility to translate these HW specific > descriptors to the HW agnostic ones in the virtual descriptor rings > that user space sees. This way, a malicious user space program cannot > mess with the networking HW. This mode though requires some extensions > to XDP. > > To get the XDP_DRV_ZC mode to work for RX, we chose to introduce a > buffer pool concept so that the same XDP driver code can be used for > buffers allocated using the page allocator (XDP_DRV), the user-space > zero-copy allocator (XDP_DRV_ZC), or some internal driver specific > allocator/cache/recycling mechanism. The ndo_bpf call has also been > extended with two commands for registering and unregistering an XSK > socket and is in the RX case mainly used to communicate some > information about the user-space buffer pool to the driver. > > For the TX path, our plan was to use ndo_xdp_xmit and ndo_xdp_flush, > but we run into problems with this (further discussion in the > challenges section) and had to introduce a new NDO called > ndo_xdp_xmit_xsk (xsk = XDP socket). It takes a pointer to a netdevice > and an explicit queue id that packets should be sent out on. In > contrast to ndo_xdp_xmit, it is asynchronous and pulls packets to be > sent from the xdp socket (associated with the dev and queue > combination that was provided with the NDO call) using a callback > (get_tx_packet), and when they have been transmitted it uses another >