Re: Fwd: u32 ht filters

2018-02-07 Thread Jiri Pirko
Thu, Feb 08, 2018 at 12:08:36AM CET, xiyou.wangc...@gmail.com wrote:
>On Tue, Feb 6, 2018 at 11:01 PM, Jiri Pirko  wrote:
>> Wed, Feb 07, 2018 at 06:09:15AM CET, xiyou.wangc...@gmail.com wrote:
>>>Hi, Jiri
>>>
>>>Your  commit 7fa9d974f3c2a016b9accb18f4ee2ed2a738585c
>>>breaks the tc script by Paweł. Please find below for details.
>>
>> Did you do the bisection?
>> The commit just uses block struct instead of q, but since they
>> are in 1:1 relation, that should be equvivalent. So basically you still
>> have per-qdisc hashtables for u32.
>
>Well, at least the following fixes the problem here. But I am not sure
>if it is expected too for shared block among multiple qdiscs.

For shared block, block->q is null.


>
>
>@@ -338,7 +330,7 @@ static struct hlist_head *tc_u_common_hash;
>
> static unsigned int tc_u_hash(const struct tcf_proto *tp)
> {
>-   return hash_ptr(tp->chain->block, U32_HASH_SHIFT);
>+   return hash_ptr(tp->chain->block->q, U32_HASH_SHIFT);
> }
>
> static struct tc_u_common *tc_u_common_find(const struct tcf_proto *tp)
>@@ -348,7 +340,7 @@ static struct tc_u_common *tc_u_common_find(const
>struct tcf_proto *tp)
>
>h = tc_u_hash(tp);
>hlist_for_each_entry(tc, _u_common_hash[h], hnode) {
>-   if (tc->block == tp->chain->block)
>+   if (tc->block->q == tp->chain->block->q)

:O I don't get it. tc->block is pointer, tc->block->q is pointer. And
they are different at the same time for non-shared block.


>return tc;
>}
>return NULL;


Re: [PATCH net V3 2/2] ptr_ring: fail on large queue size (>64K)

2018-02-07 Thread Jason Wang



On 2018年02月08日 12:52, Michael S. Tsirkin wrote:

On Thu, Feb 08, 2018 at 11:59:25AM +0800, Jason Wang wrote:

We need limit the maximum size of queue, otherwise it may cause
several side effects e.g slab will warn when the size exceeds
KMALLOC_MAX_SIZE. Using KMALLOC_MAX_SIZE still looks too so this patch
tries to limit it to 64K. This value could be revisited if we found a
real case that needs more.

Reported-by: syzbot+e4d4f9ddd42955397...@syzkaller.appspotmail.com
Fixes: 2e0ab8ca83c12 ("ptr_ring: array based FIFO for pointers")
Signed-off-by: Jason Wang 
---
  include/linux/ptr_ring.h | 4 
  1 file changed, 4 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 2af71a7..5858d48 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -44,6 +44,8 @@ struct ptr_ring {
void **queue;
  };
  

Seems like a weird location for a define. Either put defines on
top of the file, or near where they are used. I prefer the
second option.


Ok.




+#define PTR_RING_MAX_ALLOC 65536
+

I guess it's an arbitrary number. Seems like a sufficiently large one,
but pls add a comment so readers don't wonder. And please explain what
it does:

/* Callers can create ptr_ring structures with userspace-supplied
  * parameters. This sets a limit on the size to make that usecase
  * safe. If you ever change this, make sure to audit all callers.
  */

Also I think we should generally use either hex 0x1 or (1 << 16).


I agree the number is arbitrary, so I still prefer the KMALLOC_MAX_SIZE 
especially consider it was used by pfifo_fast now. Try to limit it to an 
arbitrary may break lots of exist setups. E.g just google "txqueuelen 
10" can give me a lots of search results.


We can do any kind of optimization on top but not for -net now.

Thanks




  /* Note: callers invoking this in a loop must use a compiler barrier,
   * for example cpu_relax().
   *
@@ -466,6 +468,8 @@ static inline int ptr_ring_consume_batched_bh(struct 
ptr_ring *r,
  
  static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp)

  {
+   if (size > PTR_RING_MAX_ALLOC)
+   return NULL;
return kvmalloc_array(size, sizeof(void *), gfp | __GFP_ZERO);
  }
  
--

2.7.4




Re: [PATCH net V3 1/2] ptr_ring: try vmalloc() when kmalloc() fails

2018-02-07 Thread Jason Wang



On 2018年02月08日 12:45, Michael S. Tsirkin wrote:

On Thu, Feb 08, 2018 at 11:59:24AM +0800, Jason Wang wrote:

This patch switch to use kvmalloc_array() for using a vmalloc()
fallback to help in case kmalloc() fails.

Reported-by: syzbot+e4d4f9ddd42955397...@syzkaller.appspotmail.com
Fixes: 2e0ab8ca83c12 ("ptr_ring: array based FIFO for pointers")

I guess the actual patch is the one that switches tun to ptr_ring.


I think not, since the issue was large allocation.



In fact, I think the actual bugfix is patch 2/2. This specific one
just makes kmalloc less likely to fail but that's
not what syzbot reported.


Agree.



Then I would add this patch on top to make kmalloc less likely to fail.


Ok.


Signed-off-by: Jason Wang 




---
  include/linux/ptr_ring.h | 10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 1883d61..2af71a7 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -466,7 +466,7 @@ static inline int ptr_ring_consume_batched_bh(struct 
ptr_ring *r,
  
  static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp)

  {
-   return kcalloc(size, sizeof(void *), gfp);
+   return kvmalloc_array(size, sizeof(void *), gfp | __GFP_ZERO);
  }
  
  static inline void __ptr_ring_set_size(struct ptr_ring *r, int size)

This implies a bunch of limitations on the flags. From kvmalloc_node
docs:

  * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported.
  * __GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is
  * preferable to the vmalloc fallback, due to visible performance drawbacks.

Fine with all the current users, but if we go this way, please add
documentation so future users don't misuse this API.


I suspect this is somehow a overkill since this means we need sync with 
mm/vmalloc changes in the future to keep it synced.



Alternatively, test flags and call kvmalloc or kcalloc?


Similar to the above issue, I would rather leave it as is.

Thanks





@@ -601,7 +601,7 @@ static inline int ptr_ring_resize(struct ptr_ring *r, int 
size, gfp_t gfp,
spin_unlock(&(r)->producer_lock);
spin_unlock_irqrestore(&(r)->consumer_lock, flags);
  
-	kfree(old);

+   kvfree(old);
  
  	return 0;

  }
@@ -641,7 +641,7 @@ static inline int ptr_ring_resize_multiple(struct ptr_ring 
**rings,
}
  
  	for (i = 0; i < nrings; ++i)

-   kfree(queues[i]);
+   kvfree(queues[i]);
  
  	kfree(queues);
  
@@ -649,7 +649,7 @@ static inline int ptr_ring_resize_multiple(struct ptr_ring **rings,
  
  nomem:

while (--i >= 0)
-   kfree(queues[i]);
+   kvfree(queues[i]);
  
  	kfree(queues);
  
@@ -664,7 +664,7 @@ static inline void ptr_ring_cleanup(struct ptr_ring *r, void (*destroy)(void *))

if (destroy)
while ((ptr = ptr_ring_consume(r)))
destroy(ptr);
-   kfree(r->queue);
+   kvfree(r->queue);
  }
  
  #endif /* _LINUX_PTR_RING_H  */

--
2.7.4




[PATCH net-queue 1/3] Partial revert "e1000e: Avoid receiver overrun interrupt bursts"

2018-02-07 Thread Benjamin Poirier
This partially reverts commit 4aea7a5c5e940c1723add439f4088844cd26196d.

We keep the fix for the first part of the problem (1) described in the log
of that commit, that is to read ICR in the other interrupt handler. We
remove the fix for the second part of the problem (2), Other interrupt
throttling.

Bursts of "Other" interrupts may once again occur during rxo (receive
overflow) traffic conditions. This is deemed acceptable in the interest of
avoiding unforeseen fallout from changes that are not strictly necessary.
As discussed, the e1000e driver should be in "maintenance mode".

Link: https://www.spinics.net/lists/netdev/msg480675.html
Signed-off-by: Benjamin Poirier 
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 16 ++--
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 153ad406c65e..3b36efa6228d 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1915,21 +1915,10 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = >hw;
u32 icr;
-   bool enable = true;
 
icr = er32(ICR);
ew32(ICR, E1000_ICR_OTHER);
 
-   if (icr & E1000_ICR_RXO) {
-   ew32(ICR, E1000_ICR_RXO);
-   enable = false;
-   /* napi poll will re-enable Other, make sure it runs */
-   if (napi_schedule_prep(>napi)) {
-   adapter->total_rx_bytes = 0;
-   adapter->total_rx_packets = 0;
-   __napi_schedule(>napi);
-   }
-   }
if (icr & E1000_ICR_LSC) {
ew32(ICR, E1000_ICR_LSC);
hw->mac.get_link_status = true;
@@ -1938,7 +1927,7 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
mod_timer(>watchdog_timer, jiffies + 1);
}
 
-   if (enable && !test_bit(__E1000_DOWN, >state))
+   if (!test_bit(__E1000_DOWN, >state))
ew32(IMS, E1000_IMS_OTHER);
 
return IRQ_HANDLED;
@@ -2708,8 +2697,7 @@ static int e1000e_poll(struct napi_struct *napi, int 
weight)
napi_complete_done(napi, work_done);
if (!test_bit(__E1000_DOWN, >state)) {
if (adapter->msix_entries)
-   ew32(IMS, adapter->rx_ring->ims_val |
-E1000_IMS_OTHER);
+   ew32(IMS, adapter->rx_ring->ims_val);
else
e1000_irq_enable(adapter);
}
-- 
2.16.1



[PATCH net-queue 2/3] e1000e: Fix queue interrupt re-raising in Other interrupt.

2018-02-07 Thread Benjamin Poirier
restores the ICS write for rx/tx queue interrupts which was present before
commit 16ecba59bc33 ("e1000e: Do not read ICR in Other interrupt",
v4.5-rc1) but was not restored in commit 4aea7a5c5e94 ("e1000e: Avoid
receiver overrun interrupt bursts", v4.15-rc1).

This re-raises the queue interrupts in case the txq or rxq bits were set in
ICR and the Other interrupt handler read and cleared ICR before the queue
interrupt was raised.

Fixes: 4aea7a5c5e94 ("e1000e: Avoid receiver overrun interrupt bursts")
Signed-off-by: Benjamin Poirier 
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 3b36efa6228d..2c9609bee2ae 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1919,6 +1919,9 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
icr = er32(ICR);
ew32(ICR, E1000_ICR_OTHER);
 
+   if (icr & adapter->eiac_mask)
+   ew32(ICS, (icr & adapter->eiac_mask));
+
if (icr & E1000_ICR_LSC) {
ew32(ICR, E1000_ICR_LSC);
hw->mac.get_link_status = true;
-- 
2.16.1



[PATCH v2 1/2] net, can, ifi: fix "write buffer full" error

2018-02-07 Thread Heiko Schocher
the driver reads in the ISR first the IRQpending register,
and clears after that in a write *all* bits in it.

It could happen that the isr register raise bits between
this 2 register accesses, which leads in lost bits ...

In case it clears "TX message sent successfully", the driver
never sends any Tx data, and buffers to userspace run over.

Fixed this:
clear only the bits in the IRQpending register, the
driver had read.

Signed-off-by: Heiko Schocher 
Reviewed-by: Marek Vasut 
---

Changes in v2:
- add Reviewed-by from Marek

 drivers/net/can/ifi_canfd/ifi_canfd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/can/ifi_canfd/ifi_canfd.c 
b/drivers/net/can/ifi_canfd/ifi_canfd.c
index 2772d05ff11c..05feb8177936 100644
--- a/drivers/net/can/ifi_canfd/ifi_canfd.c
+++ b/drivers/net/can/ifi_canfd/ifi_canfd.c
@@ -607,7 +607,7 @@ static irqreturn_t ifi_canfd_isr(int irq, void *dev_id)
return IRQ_NONE;
 
/* Clear all pending interrupts but ErrWarn */
-   writel(clr_irq_mask, priv->base + IFI_CANFD_INTERRUPT);
+   writel(isr & clr_irq_mask, priv->base + IFI_CANFD_INTERRUPT);
 
/* RX IRQ or bus warning, start NAPI */
if (isr & rx_irq_mask) {
-- 
2.14.3



[PATCH v2 2/2] net, can, ifi: loopback Tx message in IFI block

2018-02-07 Thread Heiko Schocher
Current ifi driver reads first Rx messages, than loopback
the Tx message, if the IFI_CANFD_INTERRUPT_TXFIFO_REMOVE
bit is set. This can lead into the case, that Rx messages
overhelm Tx messages!

Fixed this in the following way:

Set in the IFI_CANFD_TXFIFO_DLC register the FN value to
1, so the IFI block loopsback itself the Tx message when
sended correctly on the canfd bus. Only the IFI block can
insert the Tx message in the correct place.

The linux driver now needs only to free the skb, when
the Tx message was sended correctly.

Signed-off-by: Heiko Schocher 
Reviewed-by: Marek Vasut 
---

Changes in v2:
- add Reviewed-by from Marek, fixed comment into one liner

 drivers/net/can/ifi_canfd/ifi_canfd.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/net/can/ifi_canfd/ifi_canfd.c 
b/drivers/net/can/ifi_canfd/ifi_canfd.c
index 05feb8177936..ee74ee8f9b38 100644
--- a/drivers/net/can/ifi_canfd/ifi_canfd.c
+++ b/drivers/net/can/ifi_canfd/ifi_canfd.c
@@ -211,6 +211,7 @@ struct ifi_canfd_priv {
struct napi_struct  napi;
struct net_device   *ndev;
void __iomem*base;
+   unsigned inttx_len;
 };
 
 static void ifi_canfd_irq_enable(struct net_device *ndev, bool enable)
@@ -617,8 +618,10 @@ static irqreturn_t ifi_canfd_isr(int irq, void *dev_id)
 
/* TX IRQ */
if (isr & IFI_CANFD_INTERRUPT_TXFIFO_REMOVE) {
-   stats->tx_bytes += can_get_echo_skb(ndev, 0);
+   can_free_echo_skb(ndev, 0);
+   stats->tx_bytes += priv->tx_len;
stats->tx_packets++;
+   priv->tx_len = 0;
can_led_event(ndev, CAN_LED_EVENT_TX);
}
 
@@ -889,6 +892,7 @@ static netdev_tx_t ifi_canfd_start_xmit(struct sk_buff *skb,
}
 
txdlc = can_len2dlc(cf->len);
+   priv->tx_len = txdlc;
if ((priv->can.ctrlmode & CAN_CTRLMODE_FD) && can_is_canfd_skb(skb)) {
txdlc |= IFI_CANFD_TXFIFO_DLC_EDL;
if (cf->flags & CANFD_BRS)
@@ -898,6 +902,9 @@ static netdev_tx_t ifi_canfd_start_xmit(struct sk_buff *skb,
if (cf->can_id & CAN_RTR_FLAG)
txdlc |= IFI_CANFD_TXFIFO_DLC_RTR;
 
+   /* set FNR to 1, so we get our Tx Message looped back into RxFIFO */
+   txdlc += (1 << IFI_CANFD_TXFIFO_DLC_FNR_OFFSET);
+
/* message ram configuration */
writel(txid, priv->base + IFI_CANFD_TXFIFO_ID);
writel(txdlc, priv->base + IFI_CANFD_TXFIFO_DLC);
-- 
2.14.3



[PATCH net-queue 3/3] e1000e: Avoid missed interrupts following ICR read.

2018-02-07 Thread Benjamin Poirier
The 82574 specification update errata 12 states that interrupts may be
missed if ICR is read while INT_ASSERTED is not set. Avoid that problem by
setting all bits related to events that can trigger the Other interrupt in
IMS.

The Other interrupt is raised for such events regardless of whether or not
they are set in IMS. However, only when they are set is the INT_ASSERTED
bit also set in ICR.

By doing this, we ensure that INT_ASSERTED is always set when we read ICR
in e1000_msix_other() and steer clear of the errata. This also ensures that
ICR will automatically be cleared on read, therefore we no longer need to
clear bits explicitly.

Signed-off-by: Benjamin Poirier 
---
 drivers/net/ethernet/intel/e1000e/defines.h | 21 -
 drivers/net/ethernet/intel/e1000e/netdev.c  | 11 ---
 2 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/defines.h 
b/drivers/net/ethernet/intel/e1000e/defines.h
index afb7ebe20b24..824fd44e25f0 100644
--- a/drivers/net/ethernet/intel/e1000e/defines.h
+++ b/drivers/net/ethernet/intel/e1000e/defines.h
@@ -400,6 +400,10 @@
 #define E1000_ICR_RXDMT00x0010 /* Rx desc min. threshold (0) */
 #define E1000_ICR_RXO   0x0040 /* Receiver Overrun */
 #define E1000_ICR_RXT0  0x0080 /* Rx timer intr (ring 0) */
+#define E1000_ICR_MDAC  0x0200 /* MDIO Access Complete */
+#define E1000_ICR_SRPD  0x0001 /* Small Receive Packet Detected */
+#define E1000_ICR_ACK   0x0002 /* Receive ACK Frame Detected */
+#define E1000_ICR_MNG   0x0004 /* Manageability Event Detected */
 #define E1000_ICR_ECCER 0x0040 /* Uncorrectable ECC Error */
 /* If this bit asserted, the driver should claim the interrupt */
 #define E1000_ICR_INT_ASSERTED 0x8000
@@ -407,7 +411,7 @@
 #define E1000_ICR_RXQ1  0x0020 /* Rx Queue 1 Interrupt */
 #define E1000_ICR_TXQ0  0x0040 /* Tx Queue 0 Interrupt */
 #define E1000_ICR_TXQ1  0x0080 /* Tx Queue 1 Interrupt */
-#define E1000_ICR_OTHER 0x0100 /* Other Interrupts */
+#define E1000_ICR_OTHER 0x0100 /* Other Interrupt */
 
 /* PBA ECC Register */
 #define E1000_PBA_ECC_COUNTER_MASK  0xFFF0 /* ECC counter mask */
@@ -431,12 +435,27 @@
E1000_IMS_RXSEQ  |\
E1000_IMS_LSC)
 
+/* These are all of the events related to the OTHER interrupt.
+ */
+#define IMS_OTHER_MASK ( \
+   E1000_IMS_LSC  | \
+   E1000_IMS_RXO  | \
+   E1000_IMS_MDAC | \
+   E1000_IMS_SRPD | \
+   E1000_IMS_ACK  | \
+   E1000_IMS_MNG)
+
 /* Interrupt Mask Set */
 #define E1000_IMS_TXDW  E1000_ICR_TXDW  /* Transmit desc written back 
*/
 #define E1000_IMS_LSC   E1000_ICR_LSC   /* Link Status Change */
 #define E1000_IMS_RXSEQ E1000_ICR_RXSEQ /* Rx sequence error */
 #define E1000_IMS_RXDMT0E1000_ICR_RXDMT0/* Rx desc min. threshold */
+#define E1000_IMS_RXO   E1000_ICR_RXO   /* Receiver Overrun */
 #define E1000_IMS_RXT0  E1000_ICR_RXT0  /* Rx timer intr */
+#define E1000_IMS_MDAC  E1000_ICR_MDAC  /* MDIO Access Complete */
+#define E1000_IMS_SRPD  E1000_ICR_SRPD  /* Small Receive Packet */
+#define E1000_IMS_ACK   E1000_ICR_ACK   /* Receive ACK Frame Detected 
*/
+#define E1000_IMS_MNG   E1000_ICR_MNG   /* Manageability Event */
 #define E1000_IMS_ECCER E1000_ICR_ECCER /* Uncorrectable ECC Error */
 #define E1000_IMS_RXQ0  E1000_ICR_RXQ0  /* Rx Queue 0 Interrupt */
 #define E1000_IMS_RXQ1  E1000_ICR_RXQ1  /* Rx Queue 1 Interrupt */
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 2c9609bee2ae..9fd4050a91ca 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1914,16 +1914,12 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
struct net_device *netdev = data;
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = >hw;
-   u32 icr;
-
-   icr = er32(ICR);
-   ew32(ICR, E1000_ICR_OTHER);
+   u32 icr = er32(ICR);
 
if (icr & adapter->eiac_mask)
ew32(ICS, (icr & adapter->eiac_mask));
 
if (icr & E1000_ICR_LSC) {
-   ew32(ICR, E1000_ICR_LSC);
hw->mac.get_link_status = true;
/* guard against interrupt when we're going down */
if (!test_bit(__E1000_DOWN, >state))
@@ -1931,7 +1927,7 @@ static irqreturn_t e1000_msix_other(int __always_unused 
irq, void *data)
}
 
if (!test_bit(__E1000_DOWN, >state))
-   ew32(IMS, E1000_IMS_OTHER);
+   ew32(IMS, E1000_IMS_OTHER | IMS_OTHER_MASK);
 
return IRQ_HANDLED;
 }
@@ -2258,7 +2254,8 @@ static void e1000_irq_enable(struct e1000_adapter 
*adapter)
 
if 

[PATCH] mpls, nospec: Sanitize array index in mpls_label_ok()

2018-02-07 Thread Dan Williams
mpls_label_ok() validates that the 'platform_label' array index from a
userspace netlink message payload is valid. Under speculation the
mpls_label_ok() result may not resolve in the CPU pipeline until after
the index is used to access an array element. Sanitize the index to zero
to prevent userspace-controlled arbitrary out-of-bounds speculation, a
precursor for a speculative execution side channel vulnerability.

Cc: 
Cc: "David S. Miller" 
Cc: Eric W. Biederman 
Signed-off-by: Dan Williams 
---
 net/mpls/af_mpls.c |   24 ++--
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 8ca9915befc8..aae3565c3a92 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -935,24 +936,27 @@ static int mpls_nh_build_multi(struct mpls_route_config 
*cfg,
return err;
 }
 
-static bool mpls_label_ok(struct net *net, unsigned int index,
+static bool mpls_label_ok(struct net *net, unsigned int *index,
  struct netlink_ext_ack *extack)
 {
+   bool is_ok = true;
+
/* Reserved labels may not be set */
-   if (index < MPLS_LABEL_FIRST_UNRESERVED) {
+   if (*index < MPLS_LABEL_FIRST_UNRESERVED) {
NL_SET_ERR_MSG(extack,
   "Invalid label - must be 
MPLS_LABEL_FIRST_UNRESERVED or higher");
-   return false;
+   is_ok = false;
}
 
/* The full 20 bit range may not be supported. */
-   if (index >= net->mpls.platform_labels) {
+   if (is_ok && *index >= net->mpls.platform_labels) {
NL_SET_ERR_MSG(extack,
   "Label >= configured maximum in 
platform_labels");
-   return false;
+   is_ok = false;
}
 
-   return true;
+   *index = array_index_nospec(*index, net->mpls.platform_labels);
+   return is_ok;
 }
 
 static int mpls_route_add(struct mpls_route_config *cfg,
@@ -975,7 +979,7 @@ static int mpls_route_add(struct mpls_route_config *cfg,
index = find_free_label(net);
}
 
-   if (!mpls_label_ok(net, index, extack))
+   if (!mpls_label_ok(net, , extack))
goto errout;
 
/* Append makes no sense with mpls */
@@ -1052,7 +1056,7 @@ static int mpls_route_del(struct mpls_route_config *cfg,
 
index = cfg->rc_label;
 
-   if (!mpls_label_ok(net, index, extack))
+   if (!mpls_label_ok(net, , extack))
goto errout;
 
mpls_route_update(net, index, NULL, >rc_nlinfo);
@@ -1810,7 +1814,7 @@ static int rtm_to_route_config(struct sk_buff *skb,
goto errout;
 
if (!mpls_label_ok(cfg->rc_nlinfo.nl_net,
-  cfg->rc_label, extack))
+  >rc_label, extack))
goto errout;
break;
}
@@ -2137,7 +2141,7 @@ static int mpls_getroute(struct sk_buff *in_skb, struct 
nlmsghdr *in_nlh,
goto errout;
}
 
-   if (!mpls_label_ok(net, in_label, extack)) {
+   if (!mpls_label_ok(net, _label, extack)) {
err = -EINVAL;
goto errout;
}



Re: [PATCH 3/3] Revert "e1000e: Do not read ICR in Other interrupt"

2018-02-07 Thread Benjamin Poirier
On 2018/01/29 09:22, Alexander Duyck wrote:
[...]
> 
> > Consequently, we must clear OTHER manually from ICR, otherwise the
> > interrupt is immediately re-raised after exiting the handler.
> >
> > These observations are the same whether the interrupt is triggered via a
> > write to ICS or in hardware.
> >
> > Furthermore, I tested that this behavior is the same for other Other
> > events (MDAC, SRPD, ACK, MNG). Those were tested via a write to ICS
> > only, not in hardware.
> >
> > This is a version of the test patch that I used to trigger lsc and rxo in
> > software and hardware. It applies over this patch series.
> 
> I plan to look into this some more over the next few days. Ideally if
> we could mask these "OTHER" interrupts besides the LSC we could comply
> with all the needed bits for MSI-X. My concern is that we are still
> stuck reading the ICR at this point because of this and it is going to
> make dealing with MSI-X challenging on 82574 since it seems like the
> intention was that you weren't supposed to be reading the ICR when
> MSI-X is enabled based on the list of current issues and HW errata.

I totally agree with you that it looks like the msi-x interface was
designed so you don't need to read icr. That's also why I was happy to
go that direction with the (now infamous) commit 16ecba59bc33 ("e1000e:
Do not read ICR in Other interrupt", v4.5-rc1).

However, we looked at it before and there seems to be no way to mask
individual Other interrupt causes (masking rxo but getting lsc). Because
of that, I think we have to keep reading icr in the Other interrupt
handler.

> 
> At this point it seems like the interrupts is firing and the
> INT_ASSERTED is all we really need to be checking for if I understand
> this all correctly. Basically if LSC is set it will trigger OTHER and
> INT_ASSERTED, if any of the other causes are set they are only setting
> OTHER.

I think that's right and it's related to the fact that currently LSC is
set in IMS but not the other causes. Since we have to read icr (as I
wrote above) but we want to avoid reading it without INT_ASSERTED set
(as per errata 12) the solution will be to set all of the causes related
to Other in IMS. Patches incoming...


Re: [PATCH net v2] net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT

2018-02-07 Thread Heiner Kallweit
Am 08.02.2018 um 00:00 schrieb Florian Fainelli:
> On 02/07/2018 11:44 AM, Heiner Kallweit wrote:
>> This condition wasn't adjusted when PHY_IGNORE_INTERRUPT (-2) was added
>> long ago. In case of PHY_IGNORE_INTERRUPT the MAC interrupt indicates
>> also PHY state changes and we should do what the symbol says.
>>
>> Signed-off-by: Heiner Kallweit 
>> ---
>> v2:
>> - use phy_interrupt_is_valid() instead of checking for irq > 0
> 
> Thanks, could you identify which Fixes: tag we should be using for that?
> It would be great to see this backported to -stable
> 
Before submitting the change I had a look at the phylib history for
when the wrong behavior was introduced. However it has been there since
introduction of phylib in 2005. Different patches fixed this in few,
but not all places.
Latest one was 84a527a41f38 "net: phylib: fix interrupts re-enablement
in phy_start" which just partially fixed the issue. So we could
declare the change now to fix this fix.

>> ---
>>  drivers/net/phy/phy.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
>> index f3313a129..50ed35a45 100644
>> --- a/drivers/net/phy/phy.c
>> +++ b/drivers/net/phy/phy.c
>> @@ -822,7 +822,7 @@ void phy_start(struct phy_device *phydev)
>>  phy_resume(phydev);
>>  
>>  /* make sure interrupts are re-enabled for the PHY */
>> -if (phydev->irq != PHY_POLL) {
>> +if (phy_interrupt_is_valid(phydev)) {
>>  err = phy_enable_interrupts(phydev);
>>  if (err < 0)
>>  break;
>>
> 
> 



[Patch net] ipt_CLUSTERIP: fix a refcount bug in clusterip_config_find_get()

2018-02-07 Thread Cong Wang
In clusterip_config_find_get() we hold RCU read lock so it could
run concurrently with clusterip_config_entry_put(), as a result,
the refcnt could go back to 1 from 0, which leads to a double
list_del()... Just replace refcount_inc() with
refcount_inc_not_zero(), as for c->refcount.

Fixes: d73f33b16883 ("netfilter: CLUSTERIP: RCU conversion")
Cc: Eric Dumazet 
Cc: Pablo Neira Ayuso 
Signed-off-by: Cong Wang 
---
 net/ipv4/netfilter/ipt_CLUSTERIP.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c 
b/net/ipv4/netfilter/ipt_CLUSTERIP.c
index 1ff72b87a066..4537b1686c7c 100644
--- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
+++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
@@ -154,8 +154,10 @@ clusterip_config_find_get(struct net *net, __be32 
clusterip, int entry)
 #endif
if (unlikely(!refcount_inc_not_zero(>refcount)))
c = NULL;
-   else if (entry)
-   refcount_inc(>entries);
+   else if (entry) {
+   if (unlikely(!refcount_inc_not_zero(>entries)))
+   c = NULL;
+   }
}
rcu_read_unlock_bh();
 
-- 
2.13.0



[Patch net] ipt_CLUSTERIP: fix a race condition of proc file creation

2018-02-07 Thread Cong Wang
There is a race condition between clusterip_config_entry_put()
and clusterip_config_init(), after we release the spinlock in
clusterip_config_entry_put(), a new proc file with a same IP could
be created immediately since it is already removed from the configs
list, therefore it triggers this warning:

[ cut here ]
proc_dir_entry 'ipt_CLUSTERIP/172.20.0.170' already registered
WARNING: CPU: 1 PID: 4152 at fs/proc/generic.c:330 proc_register+0x2a4/0x370 
fs/proc/generic.c:329
Kernel panic - not syncing: panic_on_warn set ...

As a quick fix, just move the proc_remove() inside the spinlock.

Reported-by: 
Fixes: 6c5d5cfbe3c5 ("netfilter: ipt_CLUSTERIP: check duplicate config when 
initializing")
Tested-by: Paolo Abeni 
Cc: Xin Long 
Cc: Pablo Neira Ayuso 
Signed-off-by: Cong Wang 
---
 net/ipv4/netfilter/ipt_CLUSTERIP.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c 
b/net/ipv4/netfilter/ipt_CLUSTERIP.c
index 3a84a60f6b39..1ff72b87a066 100644
--- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
+++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
@@ -107,12 +107,6 @@ clusterip_config_entry_put(struct net *net, struct 
clusterip_config *c)
 
local_bh_disable();
if (refcount_dec_and_lock(>entries, >lock)) {
-   list_del_rcu(>list);
-   spin_unlock(>lock);
-   local_bh_enable();
-
-   unregister_netdevice_notifier(>notifier);
-
/* In case anyone still accesses the file, the open/close
 * functions are also incrementing the refcount on their own,
 * so it's safe to remove the entry even if it's in use. */
@@ -120,6 +114,12 @@ clusterip_config_entry_put(struct net *net, struct 
clusterip_config *c)
if (cn->procdir)
proc_remove(c->pde);
 #endif
+   list_del_rcu(>list);
+   spin_unlock(>lock);
+   local_bh_enable();
+
+   unregister_netdevice_notifier(>notifier);
+
return;
}
local_bh_enable();
-- 
2.13.0



[PATCH] ath9k: turn on btcoex_enable as default

2018-02-07 Thread Kai-Heng Feng
Without btcoex_enable, WiFi activies make both WiFi and Bluetooth
unstable if there's a bluetooth connection.

Enable this option when bt_ant_diversity is disabled.

BugLink: https://bugs.launchpad.net/bugs/1746164
Signed-off-by: Kai-Heng Feng 
---
 drivers/net/wireless/ath/ath9k/init.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/net/wireless/ath/ath9k/init.c 
b/drivers/net/wireless/ath/ath9k/init.c
index e479fae5aab9..f8f6b091a077 100644
--- a/drivers/net/wireless/ath/ath9k/init.c
+++ b/drivers/net/wireless/ath/ath9k/init.c
@@ -56,7 +56,7 @@ static int ath9k_led_active_high = -1;
 module_param_named(led_active_high, ath9k_led_active_high, int, 0444);
 MODULE_PARM_DESC(led_active_high, "Invert LED polarity");
 
-static int ath9k_btcoex_enable;
+static int ath9k_btcoex_enable = 1;
 module_param_named(btcoex_enable, ath9k_btcoex_enable, int, 0444);
 MODULE_PARM_DESC(btcoex_enable, "Enable wifi-BT coexistence");
 
@@ -693,7 +693,6 @@ static int ath9k_init_softc(u16 devid, struct ath_softc *sc,
common->hw = sc->hw;
common->priv = sc;
common->debug_mask = ath9k_debug;
-   common->btcoex_enabled = ath9k_btcoex_enable == 1;
common->disable_ani = false;
 
/*
@@ -715,14 +714,17 @@ static int ath9k_init_softc(u16 devid, struct ath_softc 
*sc,
/*
 * Enable WLAN/BT RX Antenna diversity only when:
 *
-* - BTCOEX is disabled.
 * - the user manually requests the feature.
 * - the HW cap is set using the platform data.
 */
-   if (!common->btcoex_enabled && ath9k_bt_ant_diversity &&
+   if (ath9k_bt_ant_diversity &&
(pCap->hw_caps & ATH9K_HW_CAP_BT_ANT_DIV))
common->bt_ant_diversity = 1;
 
+   /* Enable btcoex when ant_diversity is disabled */
+   if (!common->bt_ant_diversity && ath9k_btcoex_enable)
+   common->btcoex_enabled = 1;
+
spin_lock_init(>cc_lock);
spin_lock_init(>intr_lock);
spin_lock_init(>sc_serial_rw);
-- 
2.15.1



[PATCH net 2/5] nfp: don't advertise hw-tc-offload on non-port netdevs

2018-02-07 Thread Jakub Kicinski
nfp_port is a structure which represents an ASIC port, both
PCIe vNIC (on a PF or a VF) or the external MAC port.  vNIC
netdev (struct nfp_net) and pure representor netdev (struct
nfp_repr) both have a pointer to this structure.  nfp_reprs
always have a port associated.  nfp_nets, however, only represent
a device port in legacy mode, where they are considered the
MAC port. In switchdev mode they are just the CPU's side of
the PCIe link.

By definition TC offloads only apply to device ports.  Don't
set the flag on vNICs without a port (i.e. in switchdev mode).

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
Tested-by: Pieter Jansen van Vuuren 
---
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index c0fd351c86b1..fe77ea8b656c 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -3734,7 +3734,7 @@ static void nfp_net_netdev_init(struct nfp_net *nn)
 
netdev->features = netdev->hw_features;
 
-   if (nfp_app_has_tc(nn->app))
+   if (nfp_app_has_tc(nn->app) && nn->port)
netdev->hw_features |= NETIF_F_HW_TC;
 
/* Advertise but disable TSO by default. */
-- 
2.15.1



[PATCH net 5/5] nfp: populate MODULE_VERSION

2018-02-07 Thread Jakub Kicinski
DKMS and similar out-of-tree module replacement services use
module version to make sure the out-of-tree software is not
older than the module shipped with the kernel.  We use the
kernel version in ethtool -i output, put it into MODULE_VERSION
as well.

Reported-by: Jan Gutter 
Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 drivers/net/ethernet/netronome/nfp/nfp_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c 
b/drivers/net/ethernet/netronome/nfp/nfp_main.c
index cc570bb6563c..ab301d56430b 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c
@@ -649,3 +649,4 @@ MODULE_FIRMWARE("netronome/nic_AMDA0099-0001_2x25.nffw");
 MODULE_AUTHOR("Netronome Systems ");
 MODULE_LICENSE("GPL");
 MODULE_DESCRIPTION("The Netronome Flow Processor (NFP) driver.");
+MODULE_VERSION(UTS_RELEASE);
-- 
2.15.1



[PATCH net 1/5] nfp: bpf: require ETH table

2018-02-07 Thread Jakub Kicinski
Upcoming changes will require all netdevs supporting TC offloads
to have a full struct nfp_port.  Require those for BPF offload.
The operation without management FW reporting information about
Ethernet ports is something we only support for very old and very
basic NIC firmwares anyway.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
Tested-by: Pieter Jansen van Vuuren 
---
 drivers/net/ethernet/netronome/nfp/bpf/main.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c 
b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index 322027792fe8..61898dda11cf 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -35,6 +35,7 @@
 
 #include "../nfpcore/nfp_cpp.h"
 #include "../nfpcore/nfp_nffw.h"
+#include "../nfpcore/nfp_nsp.h"
 #include "../nfp_app.h"
 #include "../nfp_main.h"
 #include "../nfp_net.h"
@@ -87,9 +88,20 @@ static const char *nfp_bpf_extra_cap(struct nfp_app *app, 
struct nfp_net *nn)
 static int
 nfp_bpf_vnic_alloc(struct nfp_app *app, struct nfp_net *nn, unsigned int id)
 {
+   struct nfp_pf *pf = app->pf;
struct nfp_bpf_vnic *bv;
int err;
 
+   if (!pf->eth_tbl) {
+   nfp_err(pf->cpp, "No ETH table\n");
+   return -EINVAL;
+   }
+   if (pf->max_data_vnics != pf->eth_tbl->count) {
+   nfp_err(pf->cpp, "ETH entries don't match vNICs (%d vs %d)\n",
+   pf->max_data_vnics, pf->eth_tbl->count);
+   return -EINVAL;
+   }
+
bv = kzalloc(sizeof(*bv), GFP_KERNEL);
if (!bv)
return -ENOMEM;
-- 
2.15.1



[PATCH net 3/5] nfp: forbid disabling hw-tc-offload on representors while offload active

2018-02-07 Thread Jakub Kicinski
All netdevs which can accept TC offloads must implement
.ndo_set_features().  nfp_reprs currently do not do that, which
means hw-tc-offload can be turned on and off even when offloads
are active.

Whether the offloads are active is really a question to nfp_ports,
so remove the per-app tc_busy callback indirection thing, and
simply count the number of offloaded items in nfp_port structure.

Fixes: 8a2768732a4d ("nfp: provide infrastructure for offloading flower based 
TC filters")
Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
Tested-by: Pieter Jansen van Vuuren 
---
 drivers/net/ethernet/netronome/nfp/bpf/main.c   |  9 +
 drivers/net/ethernet/netronome/nfp/flower/offload.c |  4 
 drivers/net/ethernet/netronome/nfp/nfp_app.h|  9 -
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c |  7 +++
 drivers/net/ethernet/netronome/nfp/nfp_net_repr.c   |  1 +
 drivers/net/ethernet/netronome/nfp/nfp_port.c   | 18 ++
 drivers/net/ethernet/netronome/nfp/nfp_port.h   |  6 ++
 7 files changed, 33 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c 
b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index 61898dda11cf..34e98aa6b956 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -182,6 +182,7 @@ static int nfp_bpf_setup_tc_block_cb(enum tc_setup_type 
type,
return err;
 
bv->tc_prog = cls_bpf->prog;
+   nn->port->tc_offload_cnt = !!bv->tc_prog;
return 0;
 }
 
@@ -219,13 +220,6 @@ static int nfp_bpf_setup_tc(struct nfp_app *app, struct 
net_device *netdev,
}
 }
 
-static bool nfp_bpf_tc_busy(struct nfp_app *app, struct nfp_net *nn)
-{
-   struct nfp_bpf_vnic *bv = nn->app_priv;
-
-   return !!bv->tc_prog;
-}
-
 static int
 nfp_bpf_change_mtu(struct nfp_app *app, struct net_device *netdev, int new_mtu)
 {
@@ -429,7 +423,6 @@ const struct nfp_app_type app_bpf = {
.ctrl_msg_rx= nfp_bpf_ctrl_msg_rx,
 
.setup_tc   = nfp_bpf_setup_tc,
-   .tc_busy= nfp_bpf_tc_busy,
.bpf= nfp_ndo_bpf,
.xdp_offload= nfp_bpf_xdp_offload,
 };
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 08c4c6dc5f7f..eb5c13dea8f5 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -349,6 +349,7 @@ nfp_flower_add_offload(struct nfp_app *app, struct 
net_device *netdev,
   struct tc_cls_flower_offload *flow, bool egress)
 {
enum nfp_flower_tun_type tun_type = NFP_FL_TUNNEL_NONE;
+   struct nfp_port *port = nfp_port_from_netdev(netdev);
struct nfp_flower_priv *priv = app->priv;
struct nfp_fl_payload *flow_pay;
struct nfp_fl_key_ls *key_layer;
@@ -390,6 +391,7 @@ nfp_flower_add_offload(struct nfp_app *app, struct 
net_device *netdev,
INIT_HLIST_NODE(_pay->link);
flow_pay->tc_flower_cookie = flow->cookie;
hash_add_rcu(priv->flow_table, _pay->link, flow->cookie);
+   port->tc_offload_cnt++;
 
/* Deallocate flow payload when flower rule has been destroyed. */
kfree(key_layer);
@@ -421,6 +423,7 @@ static int
 nfp_flower_del_offload(struct nfp_app *app, struct net_device *netdev,
   struct tc_cls_flower_offload *flow)
 {
+   struct nfp_port *port = nfp_port_from_netdev(netdev);
struct nfp_fl_payload *nfp_flow;
int err;
 
@@ -442,6 +445,7 @@ nfp_flower_del_offload(struct nfp_app *app, struct 
net_device *netdev,
 
 err_free_flow:
hash_del_rcu(_flow->link);
+   port->tc_offload_cnt--;
kfree(nfp_flow->action_data);
kfree(nfp_flow->mask_data);
kfree(nfp_flow->unmasked_data);
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_app.h 
b/drivers/net/ethernet/netronome/nfp/nfp_app.h
index 437964afa8ee..20546ae67909 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_app.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_app.h
@@ -92,7 +92,6 @@ extern const struct nfp_app_type app_flower;
  * @stop:  stop application logic
  * @ctrl_msg_rx:control message handler
  * @setup_tc:  setup TC ndo
- * @tc_busy:   TC HW offload busy (rules loaded)
  * @bpf:   BPF ndo offload-related calls
  * @xdp_offload:offload an XDP program
  * @eswitch_mode_get:get SR-IOV eswitch mode
@@ -135,7 +134,6 @@ struct nfp_app_type {
 
int (*setup_tc)(struct nfp_app *app, struct net_device *netdev,
enum tc_setup_type type, void *type_data);
-   bool (*tc_busy)(struct nfp_app *app, struct nfp_net *nn);
int (*bpf)(struct nfp_app *app, struct nfp_net *nn,
   struct netdev_bpf *xdp);
int (*xdp_offload)(struct nfp_app 

[PATCH net 4/5] nfp: limit the number of TSO segments

2018-02-07 Thread Jakub Kicinski
Most FWs limit the number of TSO segments a frame can produce
to 64.  This is for fairness and efficiency (of FW datapath)
reasons.  If a frame with larger number of segments is submitted
the FW will drop it.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 2 ++
 drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h   | 5 -
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 19e989239af7..a05be0ab2713 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -3750,6 +3750,8 @@ static void nfp_net_netdev_init(struct nfp_net *nn)
netdev->min_mtu = ETH_MIN_MTU;
netdev->max_mtu = nn->max_mtu;
 
+   netdev->gso_max_segs = NFP_NET_LSO_MAX_SEGS;
+
netif_carrier_off(netdev);
 
nfp_net_set_ethtool_ops(netdev);
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h 
b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h
index eeecef2caac6..4499a7333078 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h
@@ -59,9 +59,12 @@
 #define NFP_NET_RX_OFFSET   32
 
 /**
- * Maximum header size supported for LSO frames
+ * LSO parameters
+ * %NFP_NET_LSO_MAX_HDR_SZ:Maximum header size supported for LSO frames
+ * %NFP_NET_LSO_MAX_SEGS:  Maximum number of segments LSO frame can produce
  */
 #define NFP_NET_LSO_MAX_HDR_SZ 255
+#define NFP_NET_LSO_MAX_SEGS   64
 
 /**
  * Prepend field types
-- 
2.15.1



[PATCH net 0/5] nfp: fix disabling TC offloads in flower, max TSO segs and module version

2018-02-07 Thread Jakub Kicinski
Hi!

This set corrects the way nfp deals with the NETIF_F_HW_TC flag.
It has slipped the review that flower offload does not currently
refuse disabling this flag when filter offload is active.

nfp's flower offload does not actually keep track of how many filters
for each port are offloaded.  The accounting of the number of filters
is added to the nfp core structures, and BPF moved to use these
structures as well.

If users are allowed to disable TC offloads while filters are active,
not only is it incorrect behaviour, but actually the NFP will never
be told to remove the flows, leading to use-after-free when stats
arrive.

Fourth patch makes sure we declare the max number of TSO segments.
FW should drop longer packets cleanly (otherwise this would be a
security problem for untrusted VFs) but dropping longer TSO frames
is not nice and driver should prevent them from being generated.

Last small addition populates MODULE_VERSION with kernel version.

Jakub Kicinski (5):
  nfp: bpf: require ETH table
  nfp: don't advertise hw-tc-offload on non-port netdevs
  nfp: forbid disabling hw-tc-offload on representors while offload
active
  nfp: limit the number of TSO segments
  nfp: populate MODULE_VERSION

 drivers/net/ethernet/netronome/nfp/bpf/main.c   | 21 +
 drivers/net/ethernet/netronome/nfp/flower/offload.c |  4 
 drivers/net/ethernet/netronome/nfp/nfp_app.h|  9 -
 drivers/net/ethernet/netronome/nfp/nfp_main.c   |  1 +
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 11 ++-
 drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h   |  5 -
 drivers/net/ethernet/netronome/nfp/nfp_net_repr.c   |  1 +
 drivers/net/ethernet/netronome/nfp/nfp_port.c   | 18 ++
 drivers/net/ethernet/netronome/nfp/nfp_port.h   |  6 ++
 9 files changed, 53 insertions(+), 23 deletions(-)

-- 
2.15.1



Re: [PATCH net V3 2/2] ptr_ring: fail on large queue size (>64K)

2018-02-07 Thread Michael S. Tsirkin
On Thu, Feb 08, 2018 at 11:59:25AM +0800, Jason Wang wrote:
> We need limit the maximum size of queue, otherwise it may cause
> several side effects e.g slab will warn when the size exceeds
> KMALLOC_MAX_SIZE. Using KMALLOC_MAX_SIZE still looks too so this patch
> tries to limit it to 64K. This value could be revisited if we found a
> real case that needs more.
> 
> Reported-by: syzbot+e4d4f9ddd42955397...@syzkaller.appspotmail.com
> Fixes: 2e0ab8ca83c12 ("ptr_ring: array based FIFO for pointers")
> Signed-off-by: Jason Wang 
> ---
>  include/linux/ptr_ring.h | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> index 2af71a7..5858d48 100644
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -44,6 +44,8 @@ struct ptr_ring {
>   void **queue;
>  };
>  

Seems like a weird location for a define. Either put defines on
top of the file, or near where they are used. I prefer the
second option.

> +#define PTR_RING_MAX_ALLOC 65536
> +

I guess it's an arbitrary number. Seems like a sufficiently large one,
but pls add a comment so readers don't wonder. And please explain what
it does:

/* Callers can create ptr_ring structures with userspace-supplied
 * parameters. This sets a limit on the size to make that usecase
 * safe. If you ever change this, make sure to audit all callers.
 */

Also I think we should generally use either hex 0x1 or (1 << 16).

>  /* Note: callers invoking this in a loop must use a compiler barrier,
>   * for example cpu_relax().
>   *
> @@ -466,6 +468,8 @@ static inline int ptr_ring_consume_batched_bh(struct 
> ptr_ring *r,
>  
>  static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t 
> gfp)
>  {
> + if (size > PTR_RING_MAX_ALLOC)
> + return NULL;
>   return kvmalloc_array(size, sizeof(void *), gfp | __GFP_ZERO);
>  }
>  
> -- 
> 2.7.4


Re: [PATCH net V3 1/2] ptr_ring: try vmalloc() when kmalloc() fails

2018-02-07 Thread Michael S. Tsirkin
On Thu, Feb 08, 2018 at 11:59:24AM +0800, Jason Wang wrote:
> This patch switch to use kvmalloc_array() for using a vmalloc()
> fallback to help in case kmalloc() fails.
> 
> Reported-by: syzbot+e4d4f9ddd42955397...@syzkaller.appspotmail.com
> Fixes: 2e0ab8ca83c12 ("ptr_ring: array based FIFO for pointers")

I guess the actual patch is the one that switches tun to ptr_ring.

In fact, I think the actual bugfix is patch 2/2. This specific one
just makes kmalloc less likely to fail but that's
not what syzbot reported.

Then I would add this patch on top to make kmalloc less likely to fail.

> Signed-off-by: Jason Wang 



> ---
>  include/linux/ptr_ring.h | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> index 1883d61..2af71a7 100644
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -466,7 +466,7 @@ static inline int ptr_ring_consume_batched_bh(struct 
> ptr_ring *r,
>  
>  static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t 
> gfp)
>  {
> - return kcalloc(size, sizeof(void *), gfp);
> + return kvmalloc_array(size, sizeof(void *), gfp | __GFP_ZERO);
>  }
>  
>  static inline void __ptr_ring_set_size(struct ptr_ring *r, int size)

This implies a bunch of limitations on the flags. From kvmalloc_node
docs:

 * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported.
 * __GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is
 * preferable to the vmalloc fallback, due to visible performance drawbacks.

Fine with all the current users, but if we go this way, please add
documentation so future users don't misuse this API.
Alternatively, test flags and call kvmalloc or kcalloc?


> @@ -601,7 +601,7 @@ static inline int ptr_ring_resize(struct ptr_ring *r, int 
> size, gfp_t gfp,
>   spin_unlock(&(r)->producer_lock);
>   spin_unlock_irqrestore(&(r)->consumer_lock, flags);
>  
> - kfree(old);
> + kvfree(old);
>  
>   return 0;
>  }
> @@ -641,7 +641,7 @@ static inline int ptr_ring_resize_multiple(struct 
> ptr_ring **rings,
>   }
>  
>   for (i = 0; i < nrings; ++i)
> - kfree(queues[i]);
> + kvfree(queues[i]);
>  
>   kfree(queues);
>  
> @@ -649,7 +649,7 @@ static inline int ptr_ring_resize_multiple(struct 
> ptr_ring **rings,
>  
>  nomem:
>   while (--i >= 0)
> - kfree(queues[i]);
> + kvfree(queues[i]);
>  
>   kfree(queues);
>  
> @@ -664,7 +664,7 @@ static inline void ptr_ring_cleanup(struct ptr_ring *r, 
> void (*destroy)(void *))
>   if (destroy)
>   while ((ptr = ptr_ring_consume(r)))
>   destroy(ptr);
> - kfree(r->queue);
> + kvfree(r->queue);
>  }
>  
>  #endif /* _LINUX_PTR_RING_H  */
> -- 
> 2.7.4


[PATCH bpf 1/6] nfp: bpf: fix immed relocation for larger offsets

2018-02-07 Thread Jakub Kicinski
Immed relocation is missing a shift which means for larger
offsets the lower and higher part of the address would be
ORed together.

Fixes: ce4ebfd859c3 ("nfp: bpf: add helpers for updating immediate 
instructions")
Signed-off-by: Jakub Kicinski 
Reviewed-by: Jiong Wang 
---
 drivers/net/ethernet/netronome/nfp/nfp_asm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_asm.c 
b/drivers/net/ethernet/netronome/nfp/nfp_asm.c
index 3f6952b66a49..1e597600c693 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_asm.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_asm.c
@@ -107,7 +107,7 @@ u16 immed_get_value(u64 instr)
if (!unreg_is_imm(reg))
reg = FIELD_GET(OP_IMMED_B_SRC, instr);
 
-   return (reg & 0xff) | FIELD_GET(OP_IMMED_IMM, instr);
+   return (reg & 0xff) | FIELD_GET(OP_IMMED_IMM, instr) << 8;
 }
 
 void immed_set_value(u64 *instr, u16 immed)
-- 
2.15.1



[PATCH bpf 2/6] libbpf: complete list of strings for guessing program type

2018-02-07 Thread Jakub Kicinski
From: Quentin Monnet 

It seems that the type guessing feature for libbpf, based on the name of
the ELF section the program is located in, was inspired from
samples/bpf/prog_load.c, which was not used by any sample for loading
programs of certain types such as TC actions and classifiers, or
LWT-related types. As a consequence, libbpf is not able to guess the
type of such programs and to load them automatically if type is not
provided to the `bpf_load_prog()` function.

Add ELF section names associated to those eBPF program types so that
they can be loaded with e.g. bpftool as well.

Signed-off-by: Quentin Monnet 
Reviewed-by: Jakub Kicinski 
---
 tools/lib/bpf/libbpf.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 71ddc481f349..c64840365433 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1816,12 +1816,17 @@ static const struct {
BPF_PROG_SEC("socket",  BPF_PROG_TYPE_SOCKET_FILTER),
BPF_PROG_SEC("kprobe/", BPF_PROG_TYPE_KPROBE),
BPF_PROG_SEC("kretprobe/",  BPF_PROG_TYPE_KPROBE),
+   BPF_PROG_SEC("classifier",  BPF_PROG_TYPE_SCHED_CLS),
+   BPF_PROG_SEC("action",  BPF_PROG_TYPE_SCHED_ACT),
BPF_PROG_SEC("tracepoint/", BPF_PROG_TYPE_TRACEPOINT),
BPF_PROG_SEC("xdp", BPF_PROG_TYPE_XDP),
BPF_PROG_SEC("perf_event",  BPF_PROG_TYPE_PERF_EVENT),
BPF_PROG_SEC("cgroup/skb",  BPF_PROG_TYPE_CGROUP_SKB),
BPF_PROG_SEC("cgroup/sock", BPF_PROG_TYPE_CGROUP_SOCK),
BPF_PROG_SEC("cgroup/dev",  BPF_PROG_TYPE_CGROUP_DEVICE),
+   BPF_PROG_SEC("lwt_in",  BPF_PROG_TYPE_LWT_IN),
+   BPF_PROG_SEC("lwt_out", BPF_PROG_TYPE_LWT_OUT),
+   BPF_PROG_SEC("lwt_xmit",BPF_PROG_TYPE_LWT_XMIT),
BPF_PROG_SEC("sockops", BPF_PROG_TYPE_SOCK_OPS),
BPF_PROG_SEC("sk_skb",  BPF_PROG_TYPE_SK_SKB),
 };
-- 
2.15.1



[PATCH bpf 4/6] tools: bpftool: make syntax for program map update explicit in man page

2018-02-07 Thread Jakub Kicinski
From: Quentin Monnet 

Specify in the documentation that when using bpftool to update a map of
type BPF_MAP_TYPE_PROG_ARRAY, the syntax for the program used as a value
should use the "id|tag|pinned" keywords convention, as used with
"bpftool prog" commands.

Fixes: ff69c21a85a4 ("tools: bpftool: add documentation")
Signed-off-by: Quentin Monnet 
Reviewed-by: Jakub Kicinski 
---
 tools/bpf/bpftool/Documentation/bpftool-map.rst | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-map.rst 
b/tools/bpf/bpftool/Documentation/bpftool-map.rst
index 0ab32b312aec..457e868bd32f 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-map.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-map.rst
@@ -31,7 +31,8 @@ MAP COMMANDS
 |  **bpftool** **map help**
 |
 |  *MAP* := { **id** *MAP_ID* | **pinned** *FILE* }
-|  *VALUE* := { *BYTES* | *MAP* | *PROGRAM* }
+|  *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* }
+|  *VALUE* := { *BYTES* | *MAP* | *PROG* }
 |  *UPDATE_FLAGS* := { **any** | **exist** | **noexist** }
 
 DESCRIPTION
-- 
2.15.1



[PATCH bpf 6/6] tools: bpftool: add bash completion for cgroup commands

2018-02-07 Thread Jakub Kicinski
From: Quentin Monnet 

Add bash completion for "bpftool cgroup" command family. While at it,
also fix the formatting of some keywords in the man page for cgroups.

Fixes: 5ccda64d38cc ("bpftool: implement cgroup bpf operations")
Signed-off-by: Quentin Monnet 
Reviewed-by: Jakub Kicinski 
---
 tools/bpf/bpftool/Documentation/bpftool-cgroup.rst |  4 +-
 tools/bpf/bpftool/bash-completion/bpftool  | 64 --
 2 files changed, 62 insertions(+), 6 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst 
b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
index 2fe2a1bdbe3e..0e4e923235b6 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
@@ -26,8 +26,8 @@ MAP COMMANDS
 |  **bpftool** **cgroup help**
 |
 |  *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* }
-|  *ATTACH_TYPE* := { *ingress* | *egress* | *sock_create* | *sock_ops* | 
*device* }
-|  *ATTACH_FLAGS* := { *multi* | *override* }
+|  *ATTACH_TYPE* := { **ingress** | **egress** | **sock_create** | 
**sock_ops** | **device** }
+|  *ATTACH_FLAGS* := { **multi** | **override** }
 
 DESCRIPTION
 ===
diff --git a/tools/bpf/bpftool/bash-completion/bpftool 
b/tools/bpf/bpftool/bash-completion/bpftool
index 17d246418348..08719c54a614 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -52,16 +52,24 @@ _bpftool_once_attr()
 done
 }
 
-# Takes a list of words in argument; adds them all to COMPREPLY if none of them
-# is already present on the command line. Returns no value.
-_bpftool_one_of_list()
+# Takes a list of words as argument; if any of those words is present on the
+# command line, return 0. Otherwise, return 1.
+_bpftool_search_list()
 {
 local w idx
 for w in $*; do
 for (( idx=3; idx < ${#words[@]}-1; idx++ )); do
-[[ $w == ${words[idx]} ]] && return 1
+[[ $w == ${words[idx]} ]] && return 0
 done
 done
+return 1
+}
+
+# Takes a list of words in argument; adds them all to COMPREPLY if none of them
+# is already present on the command line. Returns no value.
+_bpftool_one_of_list()
+{
+_bpftool_search_list $* && return 1
 COMPREPLY+=( $( compgen -W "$*" -- "$cur" ) )
 }
 
@@ -351,6 +359,54 @@ _bpftool()
 ;;
 esac
 ;;
+cgroup)
+case $command in
+show|list)
+_filedir
+return 0
+;;
+attach|detach)
+local ATTACH_TYPES='ingress egress sock_create sock_ops \
+device'
+local ATTACH_FLAGS='multi override'
+local PROG_TYPE='id pinned tag'
+case $prev in
+$command)
+_filedir
+return 0
+;;
+ingress|egress|sock_create|sock_ops|device)
+COMPREPLY=( $( compgen -W "$PROG_TYPE" -- \
+"$cur" ) )
+return 0
+;;
+id)
+_bpftool_get_prog_ids
+return 0
+;;
+*)
+if ! _bpftool_search_list "$ATTACH_TYPES"; then
+COMPREPLY=( $( compgen -W "$ATTACH_TYPES" -- \
+"$cur" ) )
+elif [[ "$command" == "attach" ]]; then
+# We have an attach type on the command line,
+# but it is not the previous word, or
+# "id|pinned|tag" (we already checked for
+# that). This should only leave the case when
+# we need attach flags for "attach" commamnd.
+_bpftool_one_of_list "$ATTACH_FLAGS"
+fi
+return 0
+;;
+esac
+;;
+*)
+[[ $prev == $object ]] && \
+COMPREPLY=( $( compgen -W 'help attach detach \
+show list' -- "$cur" ) )
+;;
+esac
+;;
 esac
 } &&
 complete -F _bpftool bpftool
-- 
2.15.1



[PATCH bpf 3/6] tools: bpftool: exit doc Makefile early if rst2man is not available

2018-02-07 Thread Jakub Kicinski
From: Quentin Monnet 

If rst2man is not available on the system, running `make doc` from the
bpftool directory fails with an error message. However, it creates empty
manual pages (.8 files in this case). A subsequent call to `make
doc-install` would then succeed and install those empty man pages on the
system.

To prevent this, raise a Makefile error and exit immediately if rst2man
is not available before generating the pages from the rst documentation.

Fixes: ff69c21a85a4 ("tools: bpftool: add documentation")
Reported-by: Jason van Aaardt 
Signed-off-by: Quentin Monnet 
Reviewed-by: Jakub Kicinski 
---
 tools/bpf/bpftool/Documentation/Makefile | 5 +
 1 file changed, 5 insertions(+)

diff --git a/tools/bpf/bpftool/Documentation/Makefile 
b/tools/bpf/bpftool/Documentation/Makefile
index c462a928e03d..a9d47c1558bb 100644
--- a/tools/bpf/bpftool/Documentation/Makefile
+++ b/tools/bpf/bpftool/Documentation/Makefile
@@ -23,7 +23,12 @@ DOC_MAN8 = $(addprefix $(OUTPUT),$(_DOC_MAN8))
 man: man8
 man8: $(DOC_MAN8)
 
+RST2MAN_DEP := $(shell command -v rst2man 2>/dev/null)
+
 $(OUTPUT)%.8: %.rst
+ifndef RST2MAN_DEP
+   $(error "rst2man not found, but required to generate man pages")
+endif
$(QUIET_GEN)rst2man $< > $@
 
 clean:
-- 
2.15.1



[PATCH bpf 5/6] tools: bpftool: add bash completion for `bpftool prog load`

2018-02-07 Thread Jakub Kicinski
From: Quentin Monnet 

Add bash completion for bpftool command `prog load`. Completion for this
command is easy, as it only takes existing file paths as arguments.

Fixes: 49a086c201a9 ("bpftool: implement prog load command")
Signed-off-by: Quentin Monnet 
Reviewed-by: Jakub Kicinski 
---
 tools/bpf/bpftool/bash-completion/bpftool | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/tools/bpf/bpftool/bash-completion/bpftool 
b/tools/bpf/bpftool/bash-completion/bpftool
index 0137866bb8f6..17d246418348 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -230,10 +230,14 @@ _bpftool()
 fi
 return 0
 ;;
+load)
+_filedir
+return 0
+;;
 *)
 [[ $prev == $object ]] && \
-COMPREPLY=( $( compgen -W 'dump help pin show list' -- 
\
-"$cur" ) )
+COMPREPLY=( $( compgen -W 'dump help pin load \
+show list' -- "$cur" ) )
 ;;
 esac
 ;;
-- 
2.15.1



[PATCH bpf 0/6] nfp fix for calls, bpftool completions and doc fixes

2018-02-07 Thread Jakub Kicinski
Hi!

First patch in this series fixes applying the relocation to immediate
load instructions in the NFP JIT.

The remaining patches come from Quentin.  Small addition to libbpf
makes sure it recognizes all standard section names.  Makefile in
bpftool/Documentation is improved to explicitly check for rst2man
being installed on the system, otherwise we risk installing empty
files.  Man page for bpftool-map is corrected to include program
as a potential value for map of programs.

Last two patches are slightly longer, those update bash completions to
include this release cycle's additions from Roman.  Maybe the use of
Fixes tags is slightly frivolous there, but having bash completions
which don't cover all commands and options could be disruptive to work
flow for users.

Jakub Kicinski (1):
  nfp: bpf: fix immed relocation for larger offsets

Quentin Monnet (5):
  libbpf: complete list of strings for guessing program type
  tools: bpftool: exit doc Makefile early if rst2man is not available
  tools: bpftool: make syntax for program map update explicit in man
page
  tools: bpftool: add bash completion for `bpftool prog load`
  tools: bpftool: add bash completion for cgroup commands

 drivers/net/ethernet/netronome/nfp/nfp_asm.c   |  2 +-
 tools/bpf/bpftool/Documentation/Makefile   |  5 ++
 tools/bpf/bpftool/Documentation/bpftool-cgroup.rst |  4 +-
 tools/bpf/bpftool/Documentation/bpftool-map.rst|  3 +-
 tools/bpf/bpftool/bash-completion/bpftool  | 72 --
 tools/lib/bpf/libbpf.c |  5 ++
 6 files changed, 81 insertions(+), 10 deletions(-)

-- 
2.15.1



[PATCH net V3 2/2] ptr_ring: fail on large queue size (>64K)

2018-02-07 Thread Jason Wang
We need limit the maximum size of queue, otherwise it may cause
several side effects e.g slab will warn when the size exceeds
KMALLOC_MAX_SIZE. Using KMALLOC_MAX_SIZE still looks too so this patch
tries to limit it to 64K. This value could be revisited if we found a
real case that needs more.

Reported-by: syzbot+e4d4f9ddd42955397...@syzkaller.appspotmail.com
Fixes: 2e0ab8ca83c12 ("ptr_ring: array based FIFO for pointers")
Signed-off-by: Jason Wang 
---
 include/linux/ptr_ring.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 2af71a7..5858d48 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -44,6 +44,8 @@ struct ptr_ring {
void **queue;
 };
 
+#define PTR_RING_MAX_ALLOC 65536
+
 /* Note: callers invoking this in a loop must use a compiler barrier,
  * for example cpu_relax().
  *
@@ -466,6 +468,8 @@ static inline int ptr_ring_consume_batched_bh(struct 
ptr_ring *r,
 
 static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp)
 {
+   if (size > PTR_RING_MAX_ALLOC)
+   return NULL;
return kvmalloc_array(size, sizeof(void *), gfp | __GFP_ZERO);
 }
 
-- 
2.7.4



[PATCH net V3 1/2] ptr_ring: try vmalloc() when kmalloc() fails

2018-02-07 Thread Jason Wang
This patch switch to use kvmalloc_array() for using a vmalloc()
fallback to help in case kmalloc() fails.

Reported-by: syzbot+e4d4f9ddd42955397...@syzkaller.appspotmail.com
Fixes: 2e0ab8ca83c12 ("ptr_ring: array based FIFO for pointers")
Signed-off-by: Jason Wang 
---
 include/linux/ptr_ring.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 1883d61..2af71a7 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -466,7 +466,7 @@ static inline int ptr_ring_consume_batched_bh(struct 
ptr_ring *r,
 
 static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp)
 {
-   return kcalloc(size, sizeof(void *), gfp);
+   return kvmalloc_array(size, sizeof(void *), gfp | __GFP_ZERO);
 }
 
 static inline void __ptr_ring_set_size(struct ptr_ring *r, int size)
@@ -601,7 +601,7 @@ static inline int ptr_ring_resize(struct ptr_ring *r, int 
size, gfp_t gfp,
spin_unlock(&(r)->producer_lock);
spin_unlock_irqrestore(&(r)->consumer_lock, flags);
 
-   kfree(old);
+   kvfree(old);
 
return 0;
 }
@@ -641,7 +641,7 @@ static inline int ptr_ring_resize_multiple(struct ptr_ring 
**rings,
}
 
for (i = 0; i < nrings; ++i)
-   kfree(queues[i]);
+   kvfree(queues[i]);
 
kfree(queues);
 
@@ -649,7 +649,7 @@ static inline int ptr_ring_resize_multiple(struct ptr_ring 
**rings,
 
 nomem:
while (--i >= 0)
-   kfree(queues[i]);
+   kvfree(queues[i]);
 
kfree(queues);
 
@@ -664,7 +664,7 @@ static inline void ptr_ring_cleanup(struct ptr_ring *r, 
void (*destroy)(void *))
if (destroy)
while ((ptr = ptr_ring_consume(r)))
destroy(ptr);
-   kfree(r->queue);
+   kvfree(r->queue);
 }
 
 #endif /* _LINUX_PTR_RING_H  */
-- 
2.7.4



linux-next: Signed-off-by missing for commit in the net tree

2018-02-07 Thread Stephen Rothwell
Hi all,

Commit

  043e337f555e ("sch_netem: Bug fixing in calculating Netem interval")

is missing a Signed-off-by from its author.

-- 
Cheers,
Stephen Rothwell


Re: [PATCH net V2 1/2] ptr_ring: try vmalloc() when kmalloc() fails

2018-02-07 Thread Jason Wang



On 2018年02月08日 11:21, Jason Wang wrote:

This patch switch to use kvmalloc_array() for using a vmalloc()
fallback to help in case kmalloc() fails.

Reported-by: syzbot+e4d4f9ddd42955397...@syzkaller.appspotmail.com
Fixes: 2e0ab8ca83c12 ("ptr_ring: array based FIFO for pointers")
Signed-off-by: Jason Wang 
---
  include/linux/ptr_ring.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 1883d61..9c3b748 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -466,7 +466,7 @@ static inline int ptr_ring_consume_batched_bh(struct 
ptr_ring *r,
  
  static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp)

  {
-   return kcalloc(size, sizeof(void *), gfp);
+   return kvmalloc_array(size, sizeof(void *), gfp | __GFP_ZERO);
  }
  
  static inline void __ptr_ring_set_size(struct ptr_ring *r, int size)


Forget the kvfree() ...

Will post V3.

Sorry.


[PATCH net V2 1/2] ptr_ring: try vmalloc() when kmalloc() fails

2018-02-07 Thread Jason Wang
This patch switch to use kvmalloc_array() for using a vmalloc()
fallback to help in case kmalloc() fails.

Reported-by: syzbot+e4d4f9ddd42955397...@syzkaller.appspotmail.com
Fixes: 2e0ab8ca83c12 ("ptr_ring: array based FIFO for pointers")
Signed-off-by: Jason Wang 
---
 include/linux/ptr_ring.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 1883d61..9c3b748 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -466,7 +466,7 @@ static inline int ptr_ring_consume_batched_bh(struct 
ptr_ring *r,
 
 static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp)
 {
-   return kcalloc(size, sizeof(void *), gfp);
+   return kvmalloc_array(size, sizeof(void *), gfp | __GFP_ZERO);
 }
 
 static inline void __ptr_ring_set_size(struct ptr_ring *r, int size)
-- 
2.7.4



[PATCH net V2 2/2] ptr_ring: fail on large queue size (>64K)

2018-02-07 Thread Jason Wang
We need limit the maximum size of queue, otherwise it may cause
several side effects e.g slab will warn when the size exceeds
KMALLOC_MAX_SIZE. Using KMALLOC_MAX_SIZE still looks too so this patch
tries to limit it to 64K. This value could be revisited if we found a
real case that needs more.

Reported-by: syzbot+e4d4f9ddd42955397...@syzkaller.appspotmail.com
Fixes: 2e0ab8ca83c12 ("ptr_ring: array based FIFO for pointers")
Signed-off-by: Jason Wang 
---
 include/linux/ptr_ring.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 9c3b748..2520daa 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -44,6 +44,8 @@ struct ptr_ring {
void **queue;
 };
 
+#define PTR_RING_MAX_ALLOC 65536
+
 /* Note: callers invoking this in a loop must use a compiler barrier,
  * for example cpu_relax().
  *
@@ -466,6 +468,8 @@ static inline int ptr_ring_consume_batched_bh(struct 
ptr_ring *r,
 
 static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp)
 {
+   if (size > PTR_RING_MAX_ALLOC)
+   return NULL;
return kvmalloc_array(size, sizeof(void *), gfp | __GFP_ZERO);
 }
 
-- 
2.7.4



Re: [PATCH net] ptr_ring: fail early if queue occupies more than KMALLOC_MAX_SIZE

2018-02-07 Thread Jason Wang



On 2018年02月07日 23:15, Michael S. Tsirkin wrote:

On Wed, Feb 07, 2018 at 05:17:59PM +0800, Jason Wang wrote:


On 2018年02月07日 16:08, Jason Wang wrote:

To avoid slab to warn about exceeded size, fail early if queue
occupies more than KMALLOC_MAX_SIZE.

Reported-by: syzbot+e4d4f9ddd42955397...@syzkaller.appspotmail.com
Signed-off-by: Jason Wang 
---
   include/linux/ptr_ring.h | 2 ++
   1 file changed, 2 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 1883d61..4b862da 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -466,6 +466,8 @@ static inline int ptr_ring_consume_batched_bh(struct 
ptr_ring *r,
   static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t 
gfp)
   {
+   if (size > KMALLOC_MAX_SIZE)
+   return NULL;
return kcalloc(size, sizeof(void *), gfp);
   }

Fixes: 2e0ab8ca83c12 ("ptr_ring: array based FIFO for pointers")

That's probably not enough. How about switching to kvmalloc and limiting
this drastically to e.g. 64K. I vaguely remember that some wise man said
that should be enough for everybody :)



Okay, let me post a V2 to see if there's any objection.

Thanks


Re: [PATCH net] tcp: tracepoint: only call trace_tcp_send_reset with full socket

2018-02-07 Thread David Miller
From: Song Liu 
Date: Tue, 6 Feb 2018 20:50:23 -0800

> tracepoint tcp_send_reset requires a full socket to work. However, it
> may be called when in TCP_TIME_WAIT:
> 
> case TCP_TW_RST:
> tcp_v6_send_reset(sk, skb);
> inet_twsk_deschedule_put(inet_twsk(sk));
> goto discard_it;
> 
> To avoid this problem, this patch checks the socket with sk_fullsock()
> before calling trace_tcp_send_reset().
> 
> Fixes: c24b14c46bb8 ("tcp: add tracepoint trace_tcp_send_reset")
> Signed-off-by: Song Liu 
> Reviewed-by: Lawrence Brakmo 

Applied and queued up for -stable, thank you.


Re: [PATCH net-next] sch_netem: Bug fixing in calculating Netem interval

2018-02-07 Thread David Miller
From: "Md. Islam" 
Date: Tue, 6 Feb 2018 23:14:18 -0500

> In Kernel 4.15.0+, Netem does not work properly.
> 
> Netem setup:
> 
> tc qdisc add dev h1-eth0 root handle 1: netem delay 10ms 2ms
> 
> Result:
 ...
> (rnd % (2 * sigma)) - sigma was overflowing s32. After applying the
> patch, I found following output which is desirable.

Applied.


Re: [PATCH] net: ethernet: ti: cpsw: fix net watchdog timeout

2018-02-07 Thread David Miller
From: Grygorii Strashko 
Date: Tue, 6 Feb 2018 19:17:06 -0600

> It was discovered that simple program which indefinitely sends 200b UDP
> packets and runs on TI AM574x SoC (SMP) under RT Kernel triggers network
> watchdog timeout in TI CPSW driver (<6 hours run). The network watchdog
> timeout is triggered due to race between cpsw_ndo_start_xmit() and
> cpsw_tx_handler() [NAPI]
> 
> cpsw_ndo_start_xmit()
>   if (unlikely(!cpdma_check_free_tx_desc(txch))) {
>   txq = netdev_get_tx_queue(ndev, q_idx);
>   netif_tx_stop_queue(txq);
> 
> ^^ as per [1] barier has to be used after set_bit() otherwise new value
> might not be visible to other cpus
>   }
> 
> cpsw_tx_handler()
>   if (unlikely(netif_tx_queue_stopped(txq)))
>   netif_tx_wake_queue(txq);
> 
> and when it happens ndev TX queue became disabled forever while driver's HW
> TX queue is empty.
> 
> Fix this, by adding smp_mb__after_atomic() after netif_tx_stop_queue()
> calls and double check for free TX descriptors after stopping ndev TX queue
> - if there are free TX descriptors wake up ndev TX queue.
> 
> [1] https://www.kernel.org/doc/html/latest/core-api/atomic_ops.html
> Signed-off-by: Grygorii Strashko 

Applied, thanks.


Re: [PATCH net] net/ipv6: Handle reject routes with onlink flag

2018-02-07 Thread David Miller
From: David Ahern 
Date: Tue,  6 Feb 2018 12:14:12 -0800

> Verification of nexthops with onlink flag need to handle unreachable
> routes. The lookup is only intended to validate the gateway address
> is not a local address and if the gateway resolves the egress device
> must match the given device. Hence, hitting any default reject route
> is ok.
> 
> Fixes: fc1e64e1092f ("net/ipv6: Add support for onlink flag")
> Signed-off-by: David Ahern 

Applied.


Re: [PATCH net] net/ipv6: onlink nexthop checks should default to main table

2018-02-07 Thread David Miller
From: David Ahern 
Date: Tue,  6 Feb 2018 13:17:06 -0800

> Because of differences in how ipv4 and ipv6 handle fib lookups,
> verification of nexthops with onlink flag need to default to the main
> table rather than the local table used by IPv4. As it stands an
> address within a connected route on device 1 can be used with
> onlink on device 2. Updating the table properly rejects the route
> due to the egress device mismatch.
> 
> Update the extack message as well to show it could be a device
> mismatch for the nexthop spec.
> 
> Fixes: fc1e64e1092f ("net/ipv6: Add support for onlink flag")
> Signed-off-by: David Ahern 

Applied, thanks David.


Re: [PATCH net-next] sun: Add SPDX license tags to Sun network drivers

2018-02-07 Thread David Miller
From: Shannon Nelson 
Date: Tue,  6 Feb 2018 11:34:23 -0800

> Add the appropriate SPDX license tags to the Sun network drivers
> as outlined in Documentation/process/license-rules.rst.
> 
> Signed-off-by: Shannon Nelson 

I've decided to apply this to the net tree, thank you.


Re: [PATCH v1 1/1] spi_ks8995: use regmap to access chip registers.

2018-02-07 Thread David Miller
From: Sven Van Asbroeck 
Date: Tue, 6 Feb 2018 10:13:56 -0500

> The register map layouts used in this driver are well suited to
> being accessed through a regmap. This makes the driver simpler
> and shorter, by eliminating some spi boilerplate code.
> 
> Testing:
> - tested on a ksz8785.
> - not tested on the other supported chips (ks8995, ksz8864)
>   because I don't have access to them.
>   However, I instrumented the spi layer to verify that the
>   correct spi transactions are generated to read the ID
>   registers on those chips.
> 
> Signed-off-by: Sven Van Asbroeck 

Please resubmit this simplification when the net-next tree opens
back up.

Thank you.


Re: [PATCH net] rxrpc: Fix received abort handling

2018-02-07 Thread David Miller
From: David Howells 
Date: Tue, 06 Feb 2018 16:44:12 +

> AF_RXRPC is incorrectly sending back to the server any abort it receives
> for a client connection.  This is due to the final-ACK offload to the
> connection event processor patch.  The abort code is copied into the
> last-call information on the connection channel and then the event
> processor is set.
> 
> Instead, the following should be done:
 ...
> Fixes: 3136ef49a14c ("rxrpc: Delay terminal ACK transmission on a client 
> call")
> Signed-off-by: David Howells 

Applied, thanks David.


Re: [PATCH] cxgb4: Fix error handling path in 'init_one()'

2018-02-07 Thread David Miller
From: Christophe JAILLET 
Date: Tue,  6 Feb 2018 21:17:17 +0100

> Commit baf5086840ab1 ("cxgb4: restructure VF mgmt code") has reordered
> some code but an error handling label has not been updated accordingly.
> So fix it and free 'adapter' if 't4_wait_dev_ready()' fails.
> 
> Fixes: baf5086840ab1 ("cxgb4: restructure VF mgmt code")
> Signed-off-by: Christophe JAILLET 

Applied, thank you.


[PATCH] net: Whitelist the skbuff_head_cache "cb" field

2018-02-07 Thread Kees Cook
Most callers of put_cmsg() use a "sizeof(foo)" for the length argument.
Within put_cmsg(), a copy_to_user() call is made with a dynamic size, as a
result of the cmsg header calculations. This means that hardened usercopy
will examine the copy, even though it was technically a fixed size and
should be implicitly whitelisted. All the put_cmsg() calls being built
from values in skbuff_head_cache are coming out of the protocol-defined
"cb" field, so whitelist this field entirely instead of creating per-use
bounce buffers, for which there are concerns about performance.

Original report was:

Bad or missing usercopy whitelist? Kernel memory exposure attempt detected from 
SLAB object 'skbuff_head_cache' (offset 64, size 16)!
WARNING: CPU: 0 PID: 3663 at mm/usercopy.c:81 usercopy_warn+0xdb/0x100 
mm/usercopy.c:76
...
 __check_heap_object+0x89/0xc0 mm/slab.c:4426
 check_heap_object mm/usercopy.c:236 [inline]
 __check_object_size+0x272/0x530 mm/usercopy.c:259
 check_object_size include/linux/thread_info.h:112 [inline]
 check_copy_size include/linux/thread_info.h:143 [inline]
 copy_to_user include/linux/uaccess.h:154 [inline]
 put_cmsg+0x233/0x3f0 net/core/scm.c:242
 sock_recv_errqueue+0x200/0x3e0 net/core/sock.c:2913
 packet_recvmsg+0xb2e/0x17a0 net/packet/af_packet.c:3296
 sock_recvmsg_nosec net/socket.c:803 [inline]
 sock_recvmsg+0xc9/0x110 net/socket.c:810
 ___sys_recvmsg+0x2a4/0x640 net/socket.c:2179
 __sys_recvmmsg+0x2a9/0xaf0 net/socket.c:2287
 SYSC_recvmmsg net/socket.c:2368 [inline]
 SyS_recvmmsg+0xc4/0x160 net/socket.c:2352
 entry_SYSCALL_64_fastpath+0x29/0xa0

Reported-by: syzbot+e2d6cfb305e9f3911...@syzkaller.appspotmail.com
Fixes: 6d07d1cd300f ("usercopy: Restrict non-usercopy caches to size 0")
Signed-off-by: Kees Cook 
---
I tried the inlining, it was awful. Splitting put_cmsg() was awful. So,
instead, whitelist the "cb" field as the least bad option if bounce
buffers are unacceptable. Dave, do you want to take this through net, or
should I take it through the usercopy tree?
---
 net/core/skbuff.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6b0ff396fa9d..201b96c8f414 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3889,10 +3889,12 @@ EXPORT_SYMBOL_GPL(skb_gro_receive);
 
 void __init skb_init(void)
 {
-   skbuff_head_cache = kmem_cache_create("skbuff_head_cache",
+   skbuff_head_cache = kmem_cache_create_usercopy("skbuff_head_cache",
  sizeof(struct sk_buff),
  0,
  SLAB_HWCACHE_ALIGN|SLAB_PANIC,
+ offsetof(struct sk_buff, cb),
+ sizeof_field(struct sk_buff, cb),
  NULL);
skbuff_fclone_cache = kmem_cache_create("skbuff_fclone_cache",
sizeof(struct sk_buff_fclones),
-- 
2.7.4


-- 
Kees Cook
Pixel Security


Re: [PATCH RFC 2/4] netlink: add generic object description infrastructure

2018-02-07 Thread Randy Dunlap
On 02/06/2018 05:37 PM, Pablo Neira Ayuso wrote:
> This patch allows netlink busses to provide object descriptions to
> userspace, in terms of supported attributes and its corresponding
> datatypes.
> 
> Userspace sends a requests that looks like:
> 
>   netlink header
>   NLA_DESC_REQ_BUS
>   NLA_DESC_REQ_DATA
> 
> Where NLA_DESC_REQ_BUS is the netlink bus/protocol number, eg.
> NETLINK_NETFILTER, and NLA_DESC_REQ_DATA is an attribute layout is
> specific to the bus that you are inspecting, this is useful for both
> nfnetlink and genetlink since they need to what subsystem in the bus
> specifically you're targeting to.
> 
> Then, the netlink description subsystem response via netlink dump looks
> like this:
> 
>   netlink header
>   NLA_DESC_NUM_OBJS
>   NLA_DESC_OBJS (nest)
>   NLA_DESC_LIST_ITEM (nest)
>   NLA_DESC_OBJ_ID
>   NLA_DESC_OBJ_ATTRS_MAX
>   NLA_DESC_OBJ_ATTRS (nest)
>   NLA_DESC_LIST_ITEM (nest)
>   NLA_DESC_ATTR_NUM
>   NLA_DESC_ATTR_TYPE
>   NLA_DESC_ATTR_LEN
>   NLA_DESC_ATTR_MAXVAL
>   NLA_DESC_ATTR_NEST_ID
>   NLA_DESC_LIST_ITEM (nest)
>   ...
> 
> Each object definition is composed of an unique ID, the number of
> attributes and the list of attribute definitions.
> 
> The NETLINK_DESC bus provides a generic interface to retrieve the list
> of existing objects and its attributes via netlink dump. This new
> description family autoloads module dependencies based on what userspace
> requests.
> 
> Each bus needs to register a struct nl_desc_subsys definition, that
> provides the lookup and parse callbacks. These route the description
> requests to the corresponding backend subsystem for genetlink and
> nfnetlink. The lookup callback returns struct nl_desc_objs that provides
> the array of object descriptions.
> 
> Signed-off-by: Pablo Neira Ayuso 
> ---
>  include/net/net_namespace.h  |   1 +
>  include/net/nldesc.h | 160 ++
>  include/uapi/linux/netlink.h |  67 ++
>  net/netlink/Makefile |   2 +-
>  net/netlink/desc.c   | 499 
> +++
>  5 files changed, 728 insertions(+), 1 deletion(-)
>  create mode 100644 include/net/nldesc.h
>  create mode 100644 net/netlink/desc.c
> 

> diff --git a/include/net/nldesc.h b/include/net/nldesc.h
> new file mode 100644
> index ..19306a648f10
> --- /dev/null
> +++ b/include/net/nldesc.h
> @@ -0,0 +1,160 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __NET_NLDESC_H
> +#define __NET_NLDESC_H
> +
> +#include 
> +
> +struct nl_desc_cmd;
> +struct nl_desc_obj;
> +
> +struct nl_desc_cmds {
> + int max;
> + const struct nl_desc_cmd*table;
> +};
> +
> +struct nl_desc_objs {
> + int max;
> + const struct nl_desc_obj**table;
> +};
> +
> +struct nl_desc_req {
> + u32 bus;
> +};
> +
> +struct net;
> +struct sk_buff;
> +struct nlmsghdr;
> +struct nlattr;
> +

> +
> +/**
> + * struct nl_desc_obj - netlink object description
> + * @id: unique ID to identify this netlink object
> + * @max: number of attributes to describe this object

  @attr_max:

> + * @attrs: array of attribute descriptions
> + */
> +struct nl_desc_obj {
> + u16 id;
> + u16 attr_max;
> + const struct nl_desc_attr   *attrs;
> +};


Is there a test program for this?
Maybe add it to tools/testing/ ?

thanks,
-- 
~Randy


Re: [PATCH] rtlwifi: rtl8192cu: Remove variable self-assignment in rf.c

2018-02-07 Thread Pkshih
On Wed, 2018-02-07 at 12:51 -0800, Matthias Kaehlcke wrote:
> El Wed, Feb 07, 2018 at 02:35:59PM -0600 Larry Finger ha dit:
> 
> > On 02/07/2018 02:26 PM, Matthias Kaehlcke wrote:
> > > In _rtl92c_get_txpower_writeval_by_regulatory() the variable writeVal
> > > is assigned to itself in an if ... else statement, apparently only to
> > > document that the branch condition is handled and that a previously read
> > > value should be returned unmodified. The self-assignment causes clang to
> > > raise the following warning:
> > > 
> > > drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c:304:13:
> > >error: explicitly assigning value of variable of type 'u32'
> > >  (aka 'unsigned int') to itself [-Werror,-Wself-assign]
> > >writeVal = writeVal;
> > > 
> > > Replace the self-assignment with a semicolon, which still serves to
> > > document the 'handling' of the branch condition.
> > > 
> > > Signed-off-by: Matthias Kaehlcke 
> > > ---
> > >   drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c
> b/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c
> > > index 9cff6bc4049c..4db92496c122 100644
> > > --- a/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c
> > > +++ b/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c
> > > @@ -301,7 +301,7 @@ static void 
> > > _rtl92c_get_txpower_writeval_by_regulatory(struct ieee80211_hw
> *hw,
> > >   writeVal = writeVal - 0x06060606;
> > >   else if (rtlpriv->dm.dynamic_txhighpower_lvl ==
> > >    TXHIGHPWRLEVEL_BT2)
> > > - writeVal = writeVal;
> > > + ;
> > >   *(p_outwriteval + rf) = writeVal;
> > >   }
> > >   }
> > > 
> > 
> > As the branch condition does nothing, why not remove it and save the
> > compiler's optimizer a bit of work? The code looks strange, but it matches
> > the rest of Realtek's USB drivers.

Agree Larry's comment.

> 
> Sure, I am happy to change it to whatever the authors/maintainers prefer.
> 
> I'll wait a bit before respinning for if others feel strongly about
> keeping the branch.
> 
> --Please consider the environment before printing this e-mail.

Re: [kbuild-all] [vhost:vhost 19/20] Warning: arch/x86/tools/test_get_len found difference at :ffffffff811aa5f0

2018-02-07 Thread Daniel Borkmann
On 02/08/2018 01:37 AM, Philip Li wrote:
> On Thu, Feb 08, 2018 at 01:25:39AM +0100, Daniel Borkmann wrote:
>> On 02/08/2018 01:20 AM, Philip Li wrote:
>>> On Thu, Feb 08, 2018 at 02:09:41AM +0200, Michael S. Tsirkin wrote:
 On Wed, Feb 07, 2018 at 03:02:57PM +0800, kbuild test robot wrote:
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git 
> vhost
> head:   96bcd04462b99e2c80e09f6537770a0ca6b288d0
> commit: cc1d1dc07885803981520a5303ef5b130f2ca2e8 [19/20] mm: support 
> reporting free page blocks
> config: x86_64-rhel
> compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
> reproduce:
> git checkout cc1d1dc07885803981520a5303ef5b130f2ca2e8
> # save the attached .config to linux build tree
> make ARCH=x86_64 
>
> All warnings (new ones prefixed by >>):

 I'm sorry, what exactly does this mean?
>>> sorry that the script has issue to provide empty warnings here, which
>>> was supposed to list warnings with >> added to new ones like below.
>>>
>>>In file included from kernel//bpf/core.c:29:0:
> include/linux/bpf.h:72:8: error: duplicate member 'security'
>>>  void *security;
>>>^
>>>
>>> We will follow up to check what goes wrong.
>>
>> Certainly odd, even in that tree/commit/header [0] I cannot see a duplicate
>> security member in struct bpf_map.
> that is one example i try to show what looks like for the >> prefix, and i 
> copy
> it from https://lists.01.org/pipermail/kbuild-all/2018-February/042706.html, 
> which
> is not for vhost repo.

Ok, indeed, some merge resolution issue in android stable kernel tree [1].
I didn't parse from your prior mail that you just meant this as a possible
example. Anyway, thanks for clarifying.

  [1] 
https://android.googlesource.com/kernel/common/+/6bbd2da69c5c2b8ab57cc2171b9abe2c9b82f53e%5E%21/

> Here for this report, the issue is as title "Warning: 
> arch/x86/tools/test_get_len found difference at 
> :811aa5f0".
> 
>>   [0] 
>> https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/tree/include/linux/bpf.h?h=vhost=cc1d1dc07885803981520a5303ef5b130f2ca2e8#n45

Thanks,
Daniel


Re: [kbuild-all] [vhost:vhost 19/20] Warning: arch/x86/tools/test_get_len found difference at :ffffffff811aa5f0

2018-02-07 Thread Philip Li
On Thu, Feb 08, 2018 at 01:25:39AM +0100, Daniel Borkmann wrote:
> On 02/08/2018 01:20 AM, Philip Li wrote:
> > On Thu, Feb 08, 2018 at 02:09:41AM +0200, Michael S. Tsirkin wrote:
> >> On Wed, Feb 07, 2018 at 03:02:57PM +0800, kbuild test robot wrote:
> >>> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git 
> >>> vhost
> >>> head:   96bcd04462b99e2c80e09f6537770a0ca6b288d0
> >>> commit: cc1d1dc07885803981520a5303ef5b130f2ca2e8 [19/20] mm: support 
> >>> reporting free page blocks
> >>> config: x86_64-rhel
> >>> compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
> >>> reproduce:
> >>> git checkout cc1d1dc07885803981520a5303ef5b130f2ca2e8
> >>> # save the attached .config to linux build tree
> >>> make ARCH=x86_64 
> >>>
> >>> All warnings (new ones prefixed by >>):
> >>
> >> I'm sorry, what exactly does this mean?
> > sorry that the script has issue to provide empty warnings here, which
> > was supposed to list warnings with >> added to new ones like below.
> > 
> >In file included from kernel//bpf/core.c:29:0:
> >>> include/linux/bpf.h:72:8: error: duplicate member 'security'
> >  void *security;
> >^
> > 
> > We will follow up to check what goes wrong.
> 
> Certainly odd, even in that tree/commit/header [0] I cannot see a duplicate
> security member in struct bpf_map.
that is one example i try to show what looks like for the >> prefix, and i copy
it from https://lists.01.org/pipermail/kbuild-all/2018-February/042706.html, 
which
is not for vhost repo.

Here for this report, the issue is as title "Warning: 
arch/x86/tools/test_get_len found difference at 
:811aa5f0".

> 
>   [0] 
> https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/tree/include/linux/bpf.h?h=vhost=cc1d1dc07885803981520a5303ef5b130f2ca2e8#n45


Re: [PATCH iproute2-next 0/3] route code cleanups

2018-02-07 Thread David Ahern
On 2/7/18 5:25 PM, Stephen Hemminger wrote:
> Make code in iproute conform more to current coding style
> conventions. Also split flush out into separate function.
> 
> Stephen Hemminger (3):
>   iproute: whitespace fixes
>   iproute: don't do assignment in condition
>   iproute: make flush a separate function
> 
>  ip/iproute.c | 165 
> ---
>  1 file changed, 91 insertions(+), 74 deletions(-)
> 

applied to iproute2-next



Re: [kbuild-all] [vhost:vhost 19/20] Warning: arch/x86/tools/test_get_len found difference at :ffffffff811aa5f0

2018-02-07 Thread Daniel Borkmann
On 02/08/2018 01:20 AM, Philip Li wrote:
> On Thu, Feb 08, 2018 at 02:09:41AM +0200, Michael S. Tsirkin wrote:
>> On Wed, Feb 07, 2018 at 03:02:57PM +0800, kbuild test robot wrote:
>>> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost
>>> head:   96bcd04462b99e2c80e09f6537770a0ca6b288d0
>>> commit: cc1d1dc07885803981520a5303ef5b130f2ca2e8 [19/20] mm: support 
>>> reporting free page blocks
>>> config: x86_64-rhel
>>> compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
>>> reproduce:
>>> git checkout cc1d1dc07885803981520a5303ef5b130f2ca2e8
>>> # save the attached .config to linux build tree
>>> make ARCH=x86_64 
>>>
>>> All warnings (new ones prefixed by >>):
>>
>> I'm sorry, what exactly does this mean?
> sorry that the script has issue to provide empty warnings here, which
> was supposed to list warnings with >> added to new ones like below.
> 
>In file included from kernel//bpf/core.c:29:0:
>>> include/linux/bpf.h:72:8: error: duplicate member 'security'
>  void *security;
>^
> 
> We will follow up to check what goes wrong.

Certainly odd, even in that tree/commit/header [0] I cannot see a duplicate
security member in struct bpf_map.

  [0] 
https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git/tree/include/linux/bpf.h?h=vhost=cc1d1dc07885803981520a5303ef5b130f2ca2e8#n45


[PATCH iproute2-next 1/3] iproute: whitespace fixes

2018-02-07 Thread Stephen Hemminger
Add whitespace around operators for consistency.
Use tabs for indentation.

Signed-off-by: Stephen Hemminger 
---
 ip/iproute.c | 23 +--
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index 91d2b1a61993..e9c4093fa3eb 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -162,7 +162,7 @@ static int filter_nlmsg(struct nlmsghdr *n, struct rtattr 
**tb, int host_len)
if (r->rtm_family == AF_INET6 && table != RT_TABLE_MAIN)
ip6_multiple_tables = 1;
 
-   if (filter.cloned == !(r->rtm_flags_F_CLONED))
+   if (filter.cloned == !(r->rtm_flags & RTM_F_CLONED))
return 0;
 
if (r->rtm_family == AF_INET6 && !ip6_multiple_tables) {
@@ -566,7 +566,8 @@ static void print_rta_multipath(FILE *fp, const struct 
rtmsg *r,
if (nh->rtnh_len > len)
break;
 
-   if (r->rtm_flags_F_CLONED && r->rtm_type == RTN_MULTICAST) {
+   if ((r->rtm_flags & RTM_F_CLONED) &&
+   r->rtm_type == RTN_MULTICAST) {
if (first) {
fprintf(fp, "Oifs: ");
first = 0;
@@ -594,7 +595,8 @@ static void print_rta_multipath(FILE *fp, const struct 
rtmsg *r,
print_rta_flow(fp, tb[RTA_FLOW]);
}
 
-   if (r->rtm_flags_F_CLONED && r->rtm_type == RTN_MULTICAST) {
+   if ((r->rtm_flags & RTM_F_CLONED) &&
+   r->rtm_type == RTN_MULTICAST) {
fprintf(fp, "%s", ll_index_to_name(nh->rtnh_ifindex));
if (nh->rtnh_hops != 1)
fprintf(fp, "(ttl>%d)", nh->rtnh_hops);
@@ -676,7 +678,7 @@ int print_route(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
if (r->rtm_dst_len != host_len) {
fprintf(fp, "%s/%u ",
rt_addr_n2a_rta(family, tb[RTA_DST]),
-   r->rtm_dst_len);
+   r->rtm_dst_len);
} else {
fprintf(fp, "%s ",
format_host_rta(family, tb[RTA_DST]));
@@ -691,7 +693,7 @@ int print_route(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
if (r->rtm_src_len != host_len) {
fprintf(fp, "from %s/%u ",
rt_addr_n2a_rta(family, tb[RTA_SRC]),
-   r->rtm_src_len);
+   r->rtm_src_len);
} else {
fprintf(fp, "from %s ",
format_host_rta(family, tb[RTA_SRC]));
@@ -722,7 +724,7 @@ int print_route(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
 
if (table && (table != RT_TABLE_MAIN || show_details > 0) && !filter.tb)
fprintf(fp, "table %s ", rtnl_rttable_n2a(table, b1, 
sizeof(b1)));
-   if (!(r->rtm_flags_F_CLONED)) {
+   if (!(r->rtm_flags & RTM_F_CLONED)) {
if ((r->rtm_protocol != RTPROT_BOOT || show_details > 0) && 
filter.protocolmask != -1)
fprintf(fp, "proto %s ", 
rtnl_rtprot_n2a(r->rtm_protocol, b1, sizeof(b1)));
if ((r->rtm_scope != RT_SCOPE_UNIVERSE || show_details > 0) && 
filter.scopemask != -1)
@@ -764,7 +766,6 @@ int print_route(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
print_rta_cacheinfo(fp, RTA_DATA(tb[RTA_CACHEINFO]));
 
} else if (r->rtm_family == AF_INET6) {
-
if (r->rtm_flags & RTM_F_CLONED)
fprintf(fp, "%scache ", _SL_);
 
@@ -1577,8 +1578,8 @@ static int iproute_list_flush_or_save(int argc, char 
**argv, int action)
invarg("invalid mark value", *argv);
filter.markmask = -1;
} else if (matches(*argv, "metric") == 0 ||
-  matches(*argv, "priority") == 0 ||
-  strcmp(*argv, "preference") == 0) {
+  matches(*argv, "priority") == 0 ||
+  strcmp(*argv, "preference") == 0) {
__u32 metric;
 
NEXT_ARG();
@@ -2117,6 +2118,8 @@ int do_iproute(int argc, char **argv)
return iproute_showdump();
if (matches(*argv, "help") == 0)
usage();
-   fprintf(stderr, "Command \"%s\" is unknown, try \"ip route help\".\n", 
*argv);
+
+   fprintf(stderr,
+   "Command \"%s\" is unknown, try \"ip route help\".\n", *argv);
exit(-1);
 }
-- 
2.15.1



[PATCH iproute2-next 3/3] iproute: make flush a separate function

2018-02-07 Thread Stephen Hemminger
Minor refactoring to move flush into separate function to improve
readability and reduce depth of nesting.

Signed-off-by: Stephen Hemminger 
---
 ip/iproute.c | 121 +++
 1 file changed, 64 insertions(+), 57 deletions(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index 22ed113e890a..3c56240f1291 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -1473,6 +1473,68 @@ static int save_route_prep(void)
return 0;
 }
 
+static int iproute_flush(int do_ipv6, rtnl_filter_t filter_fn)
+{
+   time_t start = time(0);
+   char flushb[4096-512];
+   int round = 0;
+   int ret;
+
+   if (filter.cloned) {
+   if (do_ipv6 != AF_INET6) {
+   iproute_flush_cache();
+   if (show_stats)
+   printf("*** IPv4 routing cache is flushed.\n");
+   }
+   if (do_ipv6 == AF_INET)
+   return 0;
+   }
+
+   filter.flushb = flushb;
+   filter.flushp = 0;
+   filter.flushe = sizeof(flushb);
+
+   for (;;) {
+   if (rtnl_wilddump_request(, do_ipv6, RTM_GETROUTE) < 0) {
+   perror("Cannot send dump request");
+   return -2;
+   }
+   filter.flushed = 0;
+   if (rtnl_dump_filter(, filter_fn, stdout) < 0) {
+   fprintf(stderr, "Flush terminated\n");
+   return -2;
+   }
+   if (filter.flushed == 0) {
+   if (show_stats) {
+   if (round == 0 &&
+   (!filter.cloned || do_ipv6 == AF_INET6))
+   printf("Nothing to flush.\n");
+   else
+   printf("*** Flush is complete after %d 
round%s ***\n",
+  round, round > 1 ? "s" : "");
+   }
+   fflush(stdout);
+   return 0;
+   }
+   round++;
+   ret = flush_update();
+   if (ret < 0)
+   return ret;
+
+   if (time(0) - start > 30) {
+   printf("\n*** Flush not completed after %ld seconds, %d 
entries remain ***\n",
+  (long)(time(0) - start), filter.flushed);
+   return -1;
+   }
+
+   if (show_stats) {
+   printf("\n*** Round %d, deleting %d entries ***\n",
+  round, filter.flushed);
+   fflush(stdout);
+   }
+   }
+}
+
 static int iproute_list_flush_or_save(int argc, char **argv, int action)
 {
int do_ipv6 = preferred_family;
@@ -1480,7 +1542,6 @@ static int iproute_list_flush_or_save(int argc, char 
**argv, int action)
char *od = NULL;
unsigned int mark = 0;
rtnl_filter_t filter_fn;
-   int ret;
 
if (action == IPROUTE_SAVE) {
if (save_route_prep())
@@ -1680,62 +1741,8 @@ static int iproute_list_flush_or_save(int argc, char 
**argv, int action)
}
filter.mark = mark;
 
-   if (action == IPROUTE_FLUSH) {
-   int round = 0;
-   char flushb[4096-512];
-   time_t start = time(0);
-
-   if (filter.cloned) {
-   if (do_ipv6 != AF_INET6) {
-   iproute_flush_cache();
-   if (show_stats)
-   printf("*** IPv4 routing cache is 
flushed.\n");
-   }
-   if (do_ipv6 == AF_INET)
-   return 0;
-   }
-
-   filter.flushb = flushb;
-   filter.flushp = 0;
-   filter.flushe = sizeof(flushb);
-
-   for (;;) {
-   if (rtnl_wilddump_request(, do_ipv6, RTM_GETROUTE) 
< 0) {
-   perror("Cannot send dump request");
-   return -2;
-   }
-   filter.flushed = 0;
-   if (rtnl_dump_filter(, filter_fn, stdout) < 0) {
-   fprintf(stderr, "Flush terminated\n");
-   return -2;
-   }
-   if (filter.flushed == 0) {
-   if (show_stats) {
-   if (round == 0 && (!filter.cloned || 
do_ipv6 == AF_INET6))
-   printf("Nothing to flush.\n");
-   else
-   printf("*** Flush is complete 
after %d round%s ***\n", round, round > 1?"s":"");
-  

[PATCH iproute2-next 2/3] iproute: don't do assignment in condition

2018-02-07 Thread Stephen Hemminger
Fix checkpatch complaints about assignment in conditions.

Signed-off-by: Stephen Hemminger 
---
 ip/iproute.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index e9c4093fa3eb..22ed113e890a 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -653,7 +653,8 @@ int print_route(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
struct nlmsghdr *fn;
 
if (NLMSG_ALIGN(filter.flushp) + n->nlmsg_len > filter.flushe) {
-   if ((ret = flush_update()) < 0)
+   ret = flush_update();
+   if (ret < 0)
return ret;
}
fn = (struct nlmsghdr *)(filter.flushb + 
NLMSG_ALIGN(filter.flushp));
@@ -827,7 +828,8 @@ static int parse_one_nh(struct nlmsghdr *n, struct rtmsg *r,
}
} else if (strcmp(*argv, "dev") == 0) {
NEXT_ARG();
-   if ((rtnh->rtnh_ifindex = ll_name_to_index(*argv)) == 
0) {
+   rtnh->rtnh_ifindex = ll_name_to_index(*argv);
+   if (rtnh->rtnh_ifindex == 0) {
fprintf(stderr, "Cannot find device \"%s\"\n", 
*argv);
return -1;
}
@@ -1326,9 +1328,9 @@ static int iproute_modify(int cmd, unsigned int flags, 
int argc, char **argv)
usage();
 
if (d) {
-   int idx;
+   int idx = ll_name_to_index(d);
 
-   if ((idx = ll_name_to_index(d)) == 0) {
+   if (idx == 0) {
fprintf(stderr, "Cannot find device \"%s\"\n", d);
return -1;
}
@@ -1658,7 +1660,8 @@ static int iproute_list_flush_or_save(int argc, char 
**argv, int action)
int idx;
 
if (id) {
-   if ((idx = ll_name_to_index(id)) == 0) {
+   idx = ll_name_to_index(id);
+   if (idx == 0) {
fprintf(stderr, "Cannot find device \"%s\"\n", 
id);
return -1;
}
@@ -1666,7 +1669,8 @@ static int iproute_list_flush_or_save(int argc, char 
**argv, int action)
filter.iifmask = -1;
}
if (od) {
-   if ((idx = ll_name_to_index(od)) == 0) {
+   idx = ll_name_to_index(od);
+   if (idx == 0) {
fprintf(stderr, "Cannot find device \"%s\"\n", 
od);
return -1;
}
@@ -1716,7 +1720,8 @@ static int iproute_list_flush_or_save(int argc, char 
**argv, int action)
return 0;
}
round++;
-   if ((ret = flush_update()) < 0)
+   ret = flush_update();
+   if (ret < 0)
return ret;
 
if (time(0) - start > 30) {
@@ -1867,14 +1872,16 @@ static int iproute_get(int argc, char **argv)
int idx;
 
if (idev) {
-   if ((idx = ll_name_to_index(idev)) == 0) {
+   idx = ll_name_to_index(idev);
+   if (idx == 0) {
fprintf(stderr, "Cannot find device \"%s\"\n", 
idev);
return -1;
}
addattr32(, sizeof(req), RTA_IIF, idx);
}
if (odev) {
-   if ((idx = ll_name_to_index(odev)) == 0) {
+   idx = ll_name_to_index(odev);
+   if (idx == 0) {
fprintf(stderr, "Cannot find device \"%s\"\n", 
odev);
return -1;
}
-- 
2.15.1



[PATCH iproute2-next 0/3] route code cleanups

2018-02-07 Thread Stephen Hemminger
Make code in iproute conform more to current coding style
conventions. Also split flush out into separate function.

Stephen Hemminger (3):
  iproute: whitespace fixes
  iproute: don't do assignment in condition
  iproute: make flush a separate function

 ip/iproute.c | 165 ---
 1 file changed, 91 insertions(+), 74 deletions(-)

-- 
2.15.1



Re: [PATCH iproute2-next v2 0/6] ip: Use netlink to walk through network device list

2018-02-07 Thread David Ahern
On 2/6/18 11:30 PM, Serhey Popovych wrote:
> In this seris I replace /proc/net/dev and /sys/class/net usage for walk
> through network device list in iptunnel/ip6tunnel and iptuntap with
> netlink dump.
> 
> Following changed since RFC was sent:
> 
>   1) Treat @struct rtnl_link_stats and @struct rtnl_link_stats64 as
>  array with __u32 and __u64 elements respectively in
>  copy_rtnl_link_stats64() as suggested by Stephen Hemminger.
> 
>   2) Remove @name and @size parameters from @struct tnl_print_nlmsg_info
>  since we can get them easily from other data.
> 
> Testing.
> 
> 
> Following script is used to ensure I didn't broke things too much:
> 
> \#!/bin/bash
> 
> iproute2_dir="$1"
> iface='gre1'
> 
> pushd "$iproute2_dir" &>/dev/null
> 
> for i in new old; do
>   DIR="/tmp/$i"
>   mkdir -p "$DIR"
> 
>   ln -snf ip.$i ip/ip
> 
>   for o in '' -s -d; do
>   ip/ip $o tunnel show   >"$DIR/ip${o}-tunnel-show"
>   ip/ip -4 $o tunnel show>"$DIR/ip-4${o}-tunnel-show"
>   ip/ip -6 $o tunnel show>"$DIR/ip-6${o}-tunnel-show"
>   ip/ip $o tunnel show dev "$iface" \
>   >"$DIR/ip${o}-tunnel-show-$iface"
>   ip/ip $o tuntap show   >"$DIR/ip${o}-tuntap-show"
>   done
> done
> rm -f ip/ip
> 
> diff -urN /tmp/{old,new} |sed -n -Ee'/^(-{3}|\+{3})[[:space:]]+/!p'
> rc=$?
> 
> popd &>/dev/null
> exit $rc
> 
> Results:
> 
> 
> ...
> fopen /sys/class/net/ipip1/tun_flags: No such file or directory
> fopen /sys/class/net/ipip2/tun_flags: No such file or directory
> fopen /sys/class/net/gre10/tun_flags: No such file or directory
> ^^^
> note that this comes from ip.old
> ...
> diff -urN /tmp/old/ip-d-tuntap-show /tmp/new/ip-d-tuntap-show
> @@ -1,4 +1,4 @@
> -tun1: tap user 1004 group 27
> - Attached to processes:
>  tun0: tun user 1000 group 27
>   Attached to processes:
> +tun1: tap user 1004 group 27
> + Attached to processes:
> diff -urN /tmp/old/ip-s-tuntap-show /tmp/new/ip-s-tuntap-show
> @@ -1,2 +1,2 @@
> -tun1: tap user 1004 group 27
>  tun0: tun user 1000 group 27
> +tun1: tap user 1004 group 27
> diff -urN /tmp/old/ip-tuntap-show /tmp/new/ip-tuntap-show
> @@ -1,2 +1,2 @@
> -tun1: tap user 1004 group 27
>  tun0: tun user 1000 group 27
> +tun1: tap user 1004 group 27
> 
> So basically only print order for ip tuntap get changes. Rest is intact.
> 
> v2
>   Fix build failure in 0/4 patch ("iptunnel/ip6tunnel: Code cleanups")
>   and update it's description showing why this cleanup is necessary.
> 
>   Update cover letter to explain origins of fopen /sys/class/net/...
>   error message sources.
> 
> Thanks,
> Serhii
> 
> Serhey Popovych (6):
>   ipaddress: Unify print_link_stats() and print_link_stats64()
>   ip: Introduce get_rtnl_link_stats_rta() to get link statistics
>   tunnel: Split statistic getting and printing
>   iptunnel/ip6tunnel: Code cleanups
>   iptunnel/ip6tunnel: Use netlink to walk through tunnels list
>   tuntap: Use netlink to walk through tuntap list
> 
>  include/utils.h |3 +
>  ip/ip6tunnel.c  |  115 +++--
>  ip/ipaddress.c  |  189 
> ---
>  ip/iptunnel.c   |   93 +--
>  ip/iptuntap.c   |  121 ++-
>  ip/tunnel.c |  114 ++---
>  ip/tunnel.h |   17 -
>  lib/utils.c |   45 +
>  8 files changed, 324 insertions(+), 373 deletions(-)
> 

series applied to iproute2-next



Re: [PATCH iproute2-next 0/9] print refactoring

2018-02-07 Thread David Ahern
On 2/7/18 10:10 AM, Stephen Hemminger wrote:
> This patch set breaks up the big print_route function into
> smaller pieces for readability and to make later changes
> to support JSON and color output easier.
> 
> Stephen Hemminger (9):
>   iproute: refactor printing flags
>   iproute: make printing icmpv6 a function
>   iproute: make printing IPv4 cache flags a function
>   iproute: refactor cacheinfo printing
>   iproute: refactor metrics print
>   iproute: refactor printing flow info
>   iproute: refactor newdst, gateway and via printing
>   iproute: refactor multipath print
>   iproute: refactor printing of interface
> 
>  ip/iproute.c | 571 
> +++
>  1 file changed, 297 insertions(+), 274 deletions(-)
> 


series applied to iproute2-next.


Re: [kbuild-all] [vhost:vhost 19/20] Warning: arch/x86/tools/test_get_len found difference at :ffffffff811aa5f0

2018-02-07 Thread Philip Li
On Thu, Feb 08, 2018 at 02:09:41AM +0200, Michael S. Tsirkin wrote:
> On Wed, Feb 07, 2018 at 03:02:57PM +0800, kbuild test robot wrote:
> > tree:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost
> > head:   96bcd04462b99e2c80e09f6537770a0ca6b288d0
> > commit: cc1d1dc07885803981520a5303ef5b130f2ca2e8 [19/20] mm: support 
> > reporting free page blocks
> > config: x86_64-rhel
> > compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
> > reproduce:
> > git checkout cc1d1dc07885803981520a5303ef5b130f2ca2e8
> > # save the attached .config to linux build tree
> > make ARCH=x86_64 
> > 
> > All warnings (new ones prefixed by >>):
> 
> I'm sorry, what exactly does this mean?
sorry that the script has issue to provide empty warnings here, which
was supposed to list warnings with >> added to new ones like below.

   In file included from kernel//bpf/core.c:29:0:
>> include/linux/bpf.h:72:8: error: duplicate member 'security'
 void *security;
   ^

We will follow up to check what goes wrong.

> 
> > ---
> > 0-DAY kernel test infrastructureOpen Source Technology 
> > Center
> > https://lists.01.org/pipermail/kbuild-all   Intel 
> > Corporation
> ___
> kbuild-all mailing list
> kbuild-...@lists.01.org
> https://lists.01.org/mailman/listinfo/kbuild-all


Re: [vhost:vhost 19/20] Warning: arch/x86/tools/test_get_len found difference at :ffffffff811aa5f0

2018-02-07 Thread Michael S. Tsirkin
On Wed, Feb 07, 2018 at 03:02:57PM +0800, kbuild test robot wrote:
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost
> head:   96bcd04462b99e2c80e09f6537770a0ca6b288d0
> commit: cc1d1dc07885803981520a5303ef5b130f2ca2e8 [19/20] mm: support 
> reporting free page blocks
> config: x86_64-rhel
> compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
> reproduce:
> git checkout cc1d1dc07885803981520a5303ef5b130f2ca2e8
> # save the attached .config to linux build tree
> make ARCH=x86_64 
> 
> All warnings (new ones prefixed by >>):

I'm sorry, what exactly does this mean?

> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation


Re: [suricata PATCH 1/3] suricata/ebpf: take clang -target bpf include issue of stdint.h into account

2018-02-07 Thread Eric Leblond
Hi,

On Wed, 2018-02-07 at 23:21 +0100, Jesper Dangaard Brouer wrote:
> From: Jesper Dangaard Brouer 
> 
> This patch prepares code before enabling the clang -target bpf.
> 
> The clang compiler does not like #include  when
> using '-target bpf' it will fail with:
> 
>  fatal error: 'gnu/stubs-32.h' file not found
...
> This can be worked around by installing the 32-bit version of
> glibc-devel.i686 on your distribution.
> 
> But the BPF programs does not really need to include stdint.h,
> if converting:
>   uint64_t -> __u64
>   uint32_t -> __u32
>   uint16_t -> __u16
>   uint8_t  -> __u8
> 
> This patch does this type syntax conversion.

There is an issue for system like Debian because they don't have a
asm/types.h in the include path if the architecture is not defined
which is the case due to target bpf. This results in:

clang-5.0 -Wall -Iinclude -O2 \
-D__KERNEL__ -D__ASM_SYSREG_H \
-target bpf -S -emit-llvm vlan_filter.c -o vlan_filter.ll
In file included from vlan_filter.c:19:
In file included from include/linux/bpf.h:11:
/usr/include/linux/types.h:5:10: fatal error: 'asm/types.h' file not
found
#include 
 ^
1 error generated.
Makefile:523: recipe for target 'vlan_filter.bpf' failed

To go into details, the Debian package providing the 'asm/typs.h'
include is the the headers or linux-libc-dev. But this package comes
with a flavor and thus we have a prefix: 
 linux-libc-dev:amd64: /usr/include/x86_64-linux-gnu/asm/types.h

"Fun" part here is that if you build a debian package of the via make
in Linux tree then the linux-libc-dev package is correct.

So I propose the following patch that fixes the issue for me:

diff --git a/ebpf/Makefile.am b/ebpf/Makefile.am
index 89a3304e9..712b05343 100644
--- a/ebpf/Makefile.am
+++ b/ebpf/Makefile.am
@@ -16,6 +16,7 @@ all: $(BPF_TARGETS)
 $(BPF_TARGETS): %.bpf: %.c
 #  From C-code to LLVM-IR format suffix .ll (clang -S -emit-llvm)
${CLANG} -Wall $(BPF_CFLAGS) -O2 \
+   -I/usr/include/$(host_cpu)-$(host_os)/ \
-D__KERNEL__ -D__ASM_SYSREG_H \
-target bpf -S -emit-llvm $< -o ${@:.bpf=.ll}
 #  From LLVM-IR to BPF-bytecode in ELF-obj file

Let me know if it is ok for you.

Best regards,
-- 
Eric Leblond 


Re: [PATCH v2] selftests: bpf: test_kmod.sh: check the module path before insmod

2018-02-07 Thread Daniel Borkmann
On 02/07/2018 07:15 PM, Naresh Kamboju wrote:
> test_kmod.sh reported false failure when module not present.
> Check test_bpf.ko is present in the path before loading it.
> 
> Two cases to be addressed here,
> In the development process of test_bpf.c unit testing will be done by
> developers by using "insmod $SRC_TREE/lib/test_bpf.ko"
> 
> On the other hand testers run full tests by installing modules on device
> under test (DUT) and followed by modprobe to insert the modules accordingly.
> 
> Signed-off-by: Naresh Kamboju 

Applied to bpf tree, thanks Naresh!


Re: [PATCH] net: ethernet: ti: cpsw: fix net watchdog timeout

2018-02-07 Thread Ivan Khoronzhuk
Rechecked once again, seems it covers every case, at this moment, so

Reviewed-by: Ivan Khoronzhuk 

-- 
Regards,
Ivan Khoronzhuk


Re: Fwd: u32 ht filters

2018-02-07 Thread Cong Wang
On Tue, Feb 6, 2018 at 11:01 PM, Jiri Pirko  wrote:
> Wed, Feb 07, 2018 at 06:09:15AM CET, xiyou.wangc...@gmail.com wrote:
>>Hi, Jiri
>>
>>Your  commit 7fa9d974f3c2a016b9accb18f4ee2ed2a738585c
>>breaks the tc script by Paweł. Please find below for details.
>
> Did you do the bisection?
> The commit just uses block struct instead of q, but since they
> are in 1:1 relation, that should be equvivalent. So basically you still
> have per-qdisc hashtables for u32.

Well, at least the following fixes the problem here. But I am not sure
if it is expected too for shared block among multiple qdiscs.


@@ -338,7 +330,7 @@ static struct hlist_head *tc_u_common_hash;

 static unsigned int tc_u_hash(const struct tcf_proto *tp)
 {
-   return hash_ptr(tp->chain->block, U32_HASH_SHIFT);
+   return hash_ptr(tp->chain->block->q, U32_HASH_SHIFT);
 }

 static struct tc_u_common *tc_u_common_find(const struct tcf_proto *tp)
@@ -348,7 +340,7 @@ static struct tc_u_common *tc_u_common_find(const
struct tcf_proto *tp)

h = tc_u_hash(tp);
hlist_for_each_entry(tc, _u_common_hash[h], hnode) {
-   if (tc->block == tp->chain->block)
+   if (tc->block->q == tp->chain->block->q)
return tc;
}
return NULL;


Re: [PATCH] net: ethernet: ti: cpsw: fix net watchdog timeout

2018-02-07 Thread Ivan Khoronzhuk

On Wed, Feb 07, 2018 at 12:19:11PM -0600, Grygorii Strashko wrote:

On 02/07/2018 07:31 AM, Ivan Khoronzhuk wrote:

On Wed, Feb 07, 2018 at 05:03:19AM +0200, Ivan Khoronzhuk wrote:

On Tue, Feb 06, 2018 at 07:17:06PM -0600, Grygorii Strashko wrote:

It was discovered that simple program which indefinitely sends 200b UDP
packets and runs on TI AM574x SoC (SMP) under RT Kernel triggers network
watchdog timeout in TI CPSW driver (<6 hours run). The network watchdog
timeout is triggered due to race between cpsw_ndo_start_xmit() and
cpsw_tx_handler() [NAPI]

cpsw_ndo_start_xmit()
if (unlikely(!cpdma_check_free_tx_desc(txch))) {
txq = netdev_get_tx_queue(ndev, q_idx);
netif_tx_stop_queue(txq);

^^ as per [1] barrier has to be used after set_bit() otherwise new value
might not be visible to other cpus
}

cpsw_tx_handler()
if (unlikely(netif_tx_queue_stopped(txq)))
netif_tx_wake_queue(txq);

and when it happens ndev TX queue became disabled forever while driver's HW
TX queue is empty.

I'm sure it fixes test case somehow but there is some strangeness.
(I've thought about this some X months ago):
1. If no free desc, then there is bunch of descs on the queue ready to be sent
2. If one of this desc while this process was missed then next will wake queue,
because there is bunch of them on the fly. So, if desc on top of the sent queue
missed to enable the queue, then next one more likely will enable it anyway..
then how it could happen? The described race is possible only on last
descriptor, yes, packets are small the speed is hight, possibility is very small
.but then next situation is also possible:
- packets are sent fast
- all packets were sent, but no any descriptors are freed now by sw interrupt 
(NAPI)
- when interrupt had started NAPI, the queue was enabled, all other next
interrupts are throttled once NAPI not finished it's work yet.
- when new packet submitted, no free descs are present yet (NAPI has not freed
any yet), but all packets are sent, so no one can awake tx queue, as interrupt
will not arise when NAPI is started to free first descriptor interrupts are
disabled.because h/w queue to be sent is empty...
- how it can happen as submitting packet and handling packet operations is under
channel lock? Not exactly, a period between handling and freeing the descriptor
to the pool is not under channel lock, here:

spin_unlock_irqrestore(>lock, flags);
if (unlikely(status & CPDMA_DESC_TD_COMPLETE))
cb_status = -ENOSYS;
else
cb_status = status;

__cpdma_chan_free(chan, desc, outlen, cb_status);
return status;

unlock_ret:
spin_unlock_irqrestore(>lock, flags);
return status;

And:
__cpdma_chan_free(chan, desc, outlen, cb_status);
-> cpdma_desc_free(pool, desc, 1);

As result, queue deadlock as you've described.
Just thought, not checked, but theoretically possible.
What do you think?


Yep. And if this happens desc_alloc_fail should be >0 while i usually see it 0 
when
watchdog triggers.

It has to be 0 for situation I trying to describe.
See below re example for keeping all in one place


I was able to reproduce this situation once, but with bunch of debug code added
which changes timings:

I also unintentionally observed it several times but usually was thinking
that it was connected with my other experiments. It was with am572x.
But now seems it actually can happen with plane sources. And maybe here not
only 1 fix is needed, that's my concern.



Prerequisite: only one free desc available.
cpsw_ndo_start_xmit1cpsw_tx_poll
 prepare and send packet
 ->Free desc queue is empty at this moment
if (unlikely(!cpdma_check_free_tx_desc(txch)))
netif_tx_stop_queue(txq);
--- tx queue stopped
cpdma_chan_process()

spin_lock_irqsave(>lock, flags);
chan->count--

spin_unlock_irqrestore(>lock, flags)

cpsw_tx_handler()
   if 
(unlikely(netif_tx_queue_stopped(txq)))

netif_tx_wake_queue(txq);
--- tx queue is woken up
cpsw_ndo_start_xmit2
 prepare packet
ret = cpsw_tx_packet_submit(priv, skb, txch);
//fail as desc not returned to the pool yet
if (unlikely(ret != 0)) {
cpsw_err(priv, tx_err, "desc submit failed\n");
goto fail;
}
cpdma_desc_free()   

fail:
ndev->stats.tx_dropped++;
netif_tx_stop_queue(txq);
// oops.

That's why I added double check and barrier in fail path also

Seems to be reasonable. It 

Re: handling of phy_stop() and phy_stop_machine() in phylib

2018-02-07 Thread Florian Fainelli
On 02/07/2018 01:13 PM, Russell King - ARM Linux wrote:
> On Wed, Feb 07, 2018 at 09:56:37PM +0100, Heiner Kallweit wrote:
>> Am 04.02.2018 um 03:48 schrieb Florian Fainelli:
>>>
>>>
>>> On 02/03/2018 03:58 PM, Heiner Kallweit wrote:
 Am 03.02.2018 um 21:17 schrieb Andrew Lunn:
> On Sat, Feb 03, 2018 at 05:41:54PM +0100, Heiner Kallweit wrote:
>> This commit forces callers of phy_resume() and phy_suspend() to hold
>> mutex phydev->lock. This was done for calls to phy_resume() and
>> phy_suspend() in phylib, however there are more callers in network
>> drivers. I'd assume that these other calls issue a warning now
>> because of the lock not being held.
>> So is there something I miss or would this have to be fixed?
>
> Hi Heiner
>
> This is a good point.
>
> Yes, it looks like some fixes are needed. But what exactly?
>
> The phy state machine will suspend and resume the phy is you call
> phy_stop() and phy_start() in the MAC suspend and resume functions.
>
 AFAICS phy_stop() doesn't suspend the PHY. It just sets the state
 to PHY_HALTED and (at least if we're not in polling mode) doesn't
 call phy_suspend(). Maybe a call to phy_trigger_machine() is
 needed like in phy_start() ? Then the state machine would call
 phy_suspend(), provided the link is still up.
>>>
>>> Right, phy_stop() merely just moves the state machine to PHY_HALTED and
>>> this is actually a great source of problems which I tried to address here:
>>>
>>> https://www.mail-archive.com/netdev@vger.kernel.org/msg196061.html
>>>
>>> because phy_stop() is not a synchronous call, so when it returns the
>>> state machine might still be running (it can take up to a 1 HZ depending
>>> on when you called phy_stop()) and so if you took that as a
>>> synchronization point to e.g: turn of your Ethernet MAC/MDIO bus clocks,
>>> you will likely see problems. phy_stop_machine() would provide that
>>> synchronization point, but is not currently exported, despite being a
>>> global symbol. This patch series above is all well and good, except that
>>> Geert reported issues with suspend/resume interactions which I have not
>>> been able to track down.
>>>
>> To not confuse readers I changed the subject of the mail to reflect that
>> the discussion isn't about the original topic any longer.
>>
>> It seems to me that (at least one) reason for the issues is that pm
>> callbacks for the phy device and the network device interfere.
>> phy device pm (mdio_bus_phy_suspend et al) feels responsible for dealing
>> with suspending/resuming the PHY, and the network driver pm callbacks
>> as well.
>>
>> Maybe, if the network driver takes care, it should inform phy device pm
>> to do nothing. For this we could add a flag to phydev.mdio.flags.
>> If the network driver sets this flag then mdio_bus_phy_suspend()
>> and mdio_bus_phy_resume() could turn into no-ops.
>> Not totally sure yet about mdio_bus_phy_restore() ..
> 
> What if the MDIO bus is handled by a separate device and the MDIO bus
> is suspended prior to the network driver, thereby making the PHY
> inaccessible?

Indeed. We can know that in the PHY library though, because there is
logic to hold the module reference count, see phy_attach_direct(), we
cannot quite trust whether the Ethernet controller does the right thing
though, as I can think of several ways for things to be done wrong like:
CONFIG_PM is enabled, but the Ethernet driver does not implement any
suspend/resume callback, or does it wrong etc...
-- 
Florian


Re: [PATCH net v2] net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT

2018-02-07 Thread Florian Fainelli
On 02/07/2018 11:44 AM, Heiner Kallweit wrote:
> This condition wasn't adjusted when PHY_IGNORE_INTERRUPT (-2) was added
> long ago. In case of PHY_IGNORE_INTERRUPT the MAC interrupt indicates
> also PHY state changes and we should do what the symbol says.
> 
> Signed-off-by: Heiner Kallweit 
> ---
> v2:
> - use phy_interrupt_is_valid() instead of checking for irq > 0

Thanks, could you identify which Fixes: tag we should be using for that?
It would be great to see this backported to -stable

> ---
>  drivers/net/phy/phy.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index f3313a129..50ed35a45 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -822,7 +822,7 @@ void phy_start(struct phy_device *phydev)
>   phy_resume(phydev);
>  
>   /* make sure interrupts are re-enabled for the PHY */
> - if (phydev->irq != PHY_POLL) {
> + if (phy_interrupt_is_valid(phydev)) {
>   err = phy_enable_interrupts(phydev);
>   if (err < 0)
>   break;
> 


-- 
Florian


Re: [suricata PATCH 0/3] Suricata cleanup makefile

2018-02-07 Thread Eric Leblond
Hello Jesper,

On Wed, 2018-02-07 at 23:21 +0100, Jesper Dangaard Brouer wrote:
> Hi Eric,
> 
> I've improved the Suricata ebpf makefile, in-order to avoid
> generating
> the .eh_frame sections.  This required changing the code a bit, to
> allow using clang -target bpf.
> 
> The makefile have also been improved to stop on clang compile errors,
> instead of generating an almost empty BPF ELF file.
> 
> Could I ask you to get these changes into Suricata, through correct
> process for this Open Source project?

Sure, I'm reviewing the code, testing it and I will do a Pull Request
on github.

Thanks a lot for that, that's a really valuable help!

BR,
-- 
Eric Leblond 


[suricata PATCH 3/3] suricata/ebpf: improving the ebpf makefile

2018-02-07 Thread Jesper Dangaard Brouer
From: Jesper Dangaard Brouer 

The current ebpf/Makefile.am have the problem that clang compile
errors still result in an ELF .bpf output file.  This is obviously
problematic as the the problem is first seen runtime when loading
the bpf-prog.  This this is cause by the uses of a pipe from
clang to llc.

To address this problem, split up the clang and llc invocations
up into two separate commands, to get proper reaction based on
the compiler exit code. The clang compiler is used as a
frontend (+ optimizer) and instructed (via -S -emit-llvm) to
generate LLVM IR (Intermediate Representation) with suffix .ll.
The LLVM llc command is used as a compiler backend taking IR and
producing BPF machine bytecode, and storing this into a ELF
object.  In the last step the IR .ll suffix code it removed.

The official documentation of the IR language:
 http://llvm.org/docs/LangRef.html

Also fix the previous make portability warning:
 '%-style pattern rules are a GNU make extension'
I instead use some static pattern rules:
 https://www.gnu.org/software/make/manual/html_node/Static-Usage.html

Signed-off-by: Jesper Dangaard Brouer 
---
 ebpf/Makefile.am |   22 ++
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/ebpf/Makefile.am b/ebpf/Makefile.am
index d0ee44ae668e..89a3304e953b 100644
--- a/ebpf/Makefile.am
+++ b/ebpf/Makefile.am
@@ -3,11 +3,25 @@ if BUILD_EBPF
 # Maintaining a local copy of UAPI linux/bpf.h
 BPF_CFLAGS = -Iinclude
 
-all: lb.bpf filter.bpf bypass_filter.bpf xdp_filter.bpf vlan_filter.bpf
+CLANG = ${CC}
 
-%.bpf: %.c
-   ${CC} -Wall $(BPF_CFLAGS) -O2 -D__KERNEL__ -D__ASM_SYSREG_H -target bpf 
-emit-llvm -c $< -o - | ${LLC} -march=bpf -filetype=obj -o $@
+BPF_TARGETS  = lb.bpf
+BPF_TARGETS += filter.bpf
+BPF_TARGETS += bypass_filter.bpf
+BPF_TARGETS += xdp_filter.bpf
+BPF_TARGETS += vlan_filter.bpf
 
-CLEANFILES = *.bpf
+all: $(BPF_TARGETS)
+
+$(BPF_TARGETS): %.bpf: %.c
+#  From C-code to LLVM-IR format suffix .ll (clang -S -emit-llvm)
+   ${CLANG} -Wall $(BPF_CFLAGS) -O2 \
+   -D__KERNEL__ -D__ASM_SYSREG_H \
+   -target bpf -S -emit-llvm $< -o ${@:.bpf=.ll}
+#  From LLVM-IR to BPF-bytecode in ELF-obj file
+   ${LLC} -march=bpf -filetype=obj ${@:.bpf=.ll} -o $@
+   ${RM} ${@:.bpf=.ll}
+
+CLEANFILES = *.bpf *.ll
 
 endif



[suricata PATCH 0/3] Suricata cleanup makefile

2018-02-07 Thread Jesper Dangaard Brouer
Hi Eric,

I've improved the Suricata ebpf makefile, in-order to avoid generating
the .eh_frame sections.  This required changing the code a bit, to
allow using clang -target bpf.

The makefile have also been improved to stop on clang compile errors,
instead of generating an almost empty BPF ELF file.

Could I ask you to get these changes into Suricata, through correct
process for this Open Source project?

--Jesper

---

Jesper Dangaard Brouer (3):
  suricata/ebpf: take clang -target bpf include issue of stdint.h into 
account
  suricata/ebpf: compile with clang -target bpf
  suricata/ebpf: improving the ebpf makefile


 ebpf/Makefile.am |   22 ++
 ebpf/bypass_filter.c |   27 +--
 ebpf/filter.c|3 +--
 ebpf/hash_func01.h   |   12 ++--
 ebpf/lb.c|   11 +--
 ebpf/vlan_filter.c   |5 ++---
 ebpf/xdp_filter.c|   42 --
 7 files changed, 65 insertions(+), 57 deletions(-)

--


[suricata PATCH 2/3] suricata/ebpf: compile with clang -target bpf

2018-02-07 Thread Jesper Dangaard Brouer
From: Jesper Dangaard Brouer 

Enable compiling eBPF programs with clang -target bpf.

This is mostly to workaround a bug in libbpf, where clang > ver 4.0.0
generates some ELF sections (.eh_frame) when -target bpf is NOT specified,
and libbpf fails loading such files.

Notice libbpf is provided by the kernel, and in kernel v4.16 the library
will contain the needed function for attatching to the XDP hook.

Kernel commit 949abbe88436 ("libbpf: add function to setup XDP")
 https://git.kernel.org/torvalds/c/949abbe88436

As it looks now, the library fix will not get into kernel v4.16. Thus, we
need this workaround for Suricata.  In-order to recommend people installing
the library libbpf from kernel v4.16.

Signed-off-by: Jesper Dangaard Brouer 
---
 ebpf/Makefile.am |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/ebpf/Makefile.am b/ebpf/Makefile.am
index 66db35f80c7e..d0ee44ae668e 100644
--- a/ebpf/Makefile.am
+++ b/ebpf/Makefile.am
@@ -6,7 +6,7 @@ BPF_CFLAGS = -Iinclude
 all: lb.bpf filter.bpf bypass_filter.bpf xdp_filter.bpf vlan_filter.bpf
 
 %.bpf: %.c
-   ${CC} -Wall $(BPF_CFLAGS) -O2 -D__KERNEL__ -D__ASM_SYSREG_H -emit-llvm 
-c $< -o - | ${LLC} -march=bpf -filetype=obj -o $@
+   ${CC} -Wall $(BPF_CFLAGS) -O2 -D__KERNEL__ -D__ASM_SYSREG_H -target bpf 
-emit-llvm -c $< -o - | ${LLC} -march=bpf -filetype=obj -o $@
 
 CLEANFILES = *.bpf
 



[suricata PATCH 1/3] suricata/ebpf: take clang -target bpf include issue of stdint.h into account

2018-02-07 Thread Jesper Dangaard Brouer
From: Jesper Dangaard Brouer 

This patch prepares code before enabling the clang -target bpf.

The clang compiler does not like #include  when
using '-target bpf' it will fail with:

 fatal error: 'gnu/stubs-32.h' file not found

This is because using clang -target bpf, then clang will have '__bpf__'
defined instead of '__x86_64__' hence the gnu/stubs-32.h include
attempt as /usr/include/gnu/stubs.h contains, on x86_64:

  #if !defined __x86_64__
  # include 
  #endif
  #if defined __x86_64__ && defined __LP64__
  # include 
  #endif
  #if defined __x86_64__ && defined __ILP32__
  # include 
  #endif

This can be worked around by installing the 32-bit version of
glibc-devel.i686 on your distribution.

But the BPF programs does not really need to include stdint.h,
if converting:
  uint64_t -> __u64
  uint32_t -> __u32
  uint16_t -> __u16
  uint8_t  -> __u8

This patch does this type syntax conversion.

Signed-off-by: Jesper Dangaard Brouer 
---
 ebpf/bypass_filter.c |   27 +--
 ebpf/filter.c|3 +--
 ebpf/hash_func01.h   |   12 ++--
 ebpf/lb.c|   11 +--
 ebpf/vlan_filter.c   |5 ++---
 ebpf/xdp_filter.c|   42 --
 6 files changed, 47 insertions(+), 53 deletions(-)

diff --git a/ebpf/bypass_filter.c b/ebpf/bypass_filter.c
index d2ce12aa1cd5..be81032b16cf 100644
--- a/ebpf/bypass_filter.c
+++ b/ebpf/bypass_filter.c
@@ -15,7 +15,6 @@
  * 02110-1301, USA.
  */
 
-#include 
 #include 
 #include 
 
@@ -51,9 +50,9 @@ struct flowv6_keys {
 } __attribute__((__aligned__(8)));
 
 struct pair {
-uint64_t time;
-uint64_t packets;
-uint64_t bytes;
+__u64 time;
+__u64 packets;
+__u64 bytes;
 } __attribute__((__aligned__(8)));
 
 struct bpf_map_def SEC("maps") flow_table_v4 = {
@@ -77,10 +76,10 @@ struct bpf_map_def SEC("maps") flow_table_v6 = {
  */
 static __always_inline int ipv4_filter(struct __sk_buff *skb)
 {
-uint32_t nhoff, verlen;
+__u32 nhoff, verlen;
 struct flowv4_keys tuple;
 struct pair *value;
-uint16_t port;
+__u16 port;
 
 nhoff = skb->cb[0];
 
@@ -107,8 +106,8 @@ static __always_inline int ipv4_filter(struct __sk_buff 
*skb)
 #if 0
 if ((tuple.port16[0] == 22) || (tuple.port16[1] == 22))
 {
-uint16_t sp = tuple.port16[0];
-//uint16_t dp = tuple.port16[1];
+__u16 sp = tuple.port16[0];
+//__u16 dp = tuple.port16[1];
 char fmt[] = "Parsed SSH flow: %u %d -> %u\n";
 bpf_trace_printk(fmt, sizeof(fmt), tuple.src, sp, tuple.dst);
 }
@@ -118,8 +117,8 @@ static __always_inline int ipv4_filter(struct __sk_buff 
*skb)
 if (value) {
 #if 0
 {
-uint16_t sp = tuple.port16[0];
-//uint16_t dp = tuple.port16[1];
+__u16 sp = tuple.port16[0];
+//__u16 dp = tuple.port16[1];
 char bfmt[] = "Found flow: %u %d -> %u\n";
 bpf_trace_printk(bfmt, sizeof(bfmt), tuple.src, sp, tuple.dst);
 }
@@ -139,11 +138,11 @@ static __always_inline int ipv4_filter(struct __sk_buff 
*skb)
  */
 static __always_inline int ipv6_filter(struct __sk_buff *skb)
 {
-uint32_t nhoff;
-uint8_t nhdr;
+__u32 nhoff;
+__u8 nhdr;
 struct flowv6_keys tuple;
 struct pair *value;
-uint16_t port;
+__u16 port;
 
 nhoff = skb->cb[0];
 
@@ -223,4 +222,4 @@ int SEC("filter") hashfilter(struct __sk_buff *skb) {
 
 char __license[] SEC("license") = "GPL";
 
-uint32_t __version SEC("version") = LINUX_VERSION_CODE;
+__u32 __version SEC("version") = LINUX_VERSION_CODE;
diff --git a/ebpf/filter.c b/ebpf/filter.c
index 1976ffcf188f..4fe95d4fb005 100644
--- a/ebpf/filter.c
+++ b/ebpf/filter.c
@@ -15,7 +15,6 @@
  * 02110-1301, USA.
  */
 
-#include 
 #include 
 #include 
 
@@ -56,4 +55,4 @@ int SEC("filter") hashfilter(struct __sk_buff *skb) {
 
 char __license[] SEC("license") = "GPL";
 
-uint32_t __version SEC("version") = LINUX_VERSION_CODE;
+__u32 __version SEC("version") = LINUX_VERSION_CODE;
diff --git a/ebpf/hash_func01.h b/ebpf/hash_func01.h
index 060346f67a6a..38255812e376 100644
--- a/ebpf/hash_func01.h
+++ b/ebpf/hash_func01.h
@@ -4,12 +4,12 @@
  * From: http://www.azillionmonkeys.com/qed/hash.html
  */
 
-#define get16bits(d) (*((const uint16_t *) (d)))
+#define get16bits(d) (*((const __u16 *) (d)))
 
 static __always_inline
-uint32_t SuperFastHash (const char *data, int len, uint32_t initval) {
-   uint32_t hash = initval;
-   uint32_t tmp;
+__u32 SuperFastHash (const char *data, int len, __u32 initval) {
+   __u32 hash = initval;
+   __u32 tmp;
int rem;
 
if (len <= 0 || data == NULL) return 0;
@@ -23,7 +23,7 @@ uint32_t SuperFastHash (const char *data, int len, uint32_t 
initval) {
hash  += get16bits (data);
tmp= (get16bits (data+2) << 11) ^ hash;
hash   = (hash << 16) ^ tmp;
-

Re: [RFC PATCH 02/24] xsk: add user memory registration sockopt

2018-02-07 Thread Björn Töpel
2018-02-07 17:00 GMT+01:00 Willem de Bruijn :
> On Wed, Jan 31, 2018 at 8:53 AM, Björn Töpel  wrote:
>> From: Björn Töpel 
>>
>> The XDP_MEM_REG socket option allows a process to register a window of
>> user space memory to the kernel. This memory will later be used as
>> frame data buffer.
>>
>> Signed-off-by: Björn Töpel 
>> ---
>
>> +static struct xsk_umem *xsk_mem_reg(u64 addr, u64 size, u32 frame_size,
>> +   u32 data_headroom)
>> +{
>> +   unsigned long lock_limit, locked, npages;
>> +   int ret = 0;
>> +   struct xsk_umem *umem;
>> +
>> +   if (!can_do_mlock())
>> +   return ERR_PTR(-EPERM);
>> +
>> +   umem = xsk_umem_create(addr, size, frame_size, data_headroom);
>> +   if (IS_ERR(umem))
>> +   return umem;
>> +
>> +   npages = PAGE_ALIGN(umem->nframes * umem->frame_size) >> PAGE_SHIFT;
>> +
>> +   down_write(>mm->mmap_sem);
>> +
>> +   locked = npages + current->mm->pinned_vm;
>> +   lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>> +
>> +   if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
>> +   ret = -ENOMEM;
>> +   goto out;
>> +   }
>> +
>> +   if (npages == 0 || npages > UINT_MAX) {
>> +   ret = -EINVAL;
>> +   goto out;
>> +   }
>> +   umem->npgs = npages;
>> +
>> +   ret = xsk_umem_pin_pages(umem);
>> +
>> +out:
>> +   if (ret < 0) {
>> +   put_pid(umem->pid);
>> +   kfree(umem);
>> +   } else {
>> +   current->mm->pinned_vm = locked;
>> +   }
>> +
>> +   up_write(>mm->mmap_sem);
>
> This limits per process. You may want to limit per user. See also
> mm_account_pinned_pages.

Ah, noted! Thanks for pointing that out!


Re: [RFC PATCH 00/24] Introducing AF_XDP support

2018-02-07 Thread Björn Töpel
2018-02-07 18:59 GMT+01:00 Tom Herbert :
> On Wed, Jan 31, 2018 at 5:53 AM, Björn Töpel  wrote:
[...]
>>
>> Below are the results in Mpps of the I40E NIC benchmark runs for 64
>> byte packets, generated by commercial packet generator HW that is
>> generating packets at full 40 Gbit/s line rate.
>>
>> XDP baseline numbers without this RFC:
>> xdp_rxq_info --action XDP_DROP 31.3 Mpps
>> xdp_rxq_info --action XDP_TX   16.7 Mpps
>>
>> XDP performance with this RFC i.e. with the buffer allocator:
>> XDP_DROP 21.0 Mpps
>> XDP_TX   11.9 Mpps
>>
>> AF_PACKET V4 performance from previous RFC on 4.14-rc7:
>> Benchmark   V2 V3 V4 V4+ZC
>> rxdrop  0.67   0.73   0.74   33.7
>> txpush  0.98   0.98   0.91   19.6
>> l2fwd   0.66   0.71   0.67   15.5
>>
>> AF_XDP performance:
>> Benchmark   XDP_SKB   XDP_DRVXDP_DRV_ZC (all in Mpps)
>> rxdrop  3.311.6 16.9
>> txpush  2.2 NA* 21.8
>> l2fwd   1.7 NA* 10.4
>>
> Hi Bjorn,
>
> This is very impressive work, thank you for contributing it!
>

Thank you for looking at it! :-)

> For these benchmarks how are the AF_PACKET and AF_XDP numbers to be
> compared. For instance is rxdpop comparable to XDP_DROP and
> "xdp_rxq_info --action XDP_DROP"? Given your explanation below, I
> believe they are, but it might be better to make that clear up front.
>

Ah, yeah, that was a bit confusing:

"xdp_rxq_info --action XDP_DROP" is doing an XDP_DROP (no buffer
touching) and should be compared to "XDP_DROP". I meant to write
"xdp_rxq_info --action XDP_DROP" instead of "XDP_DROP" for the
second case.

So, what this shows is that the buffer allocation scheme in the patch
set (buff_pool) has a pretty hard performance regression (21.0 vs
31.3) on the regular XDP (and skb!) path. Not acceptable.

"rxdrop" from AF_PACKET V4 should be compared to "rxdrop" from
AF_XDP. This is dropping a packet in user space, whereas the former is
dropping a packet in XDP/kernel space.

Less confusing?


Björn


Re: [RFC PATCH 00/24] Introducing AF_XDP support

2018-02-07 Thread Björn Töpel
2018-02-07 16:54 GMT+01:00 Willem de Bruijn :
>> We realized, a bit late maybe, that 24 patches is a bit mouthful, so
>> let me try to make it more palatable.
>
> Overall, this approach looks great to me.
>

Yay! :-)

> The patch set incorporates all the feedback from AF_PACKET V4.
> At this point I don't have additional high-level interface comments.
>

I have a thought on the socket API. Now, we're registering buffer
memory *to* the kernel, but mmap:ing the Rx/Tx rings *from* the
kernel. I'm leaning towards removing the mmap call, in favor of
registering the rings to kernel analogous to the XDP_MEM_REG socket
option. We wont guarantee physical contiguous memory for the rings,
but I think we can live with that. Thoughts?

> As you point out, 24 patches and nearly 6000 changed lines is
> quite a bit to ingest. Splitting up in smaller patch sets will help
> give more detailed implementation feedback.
>
> The frame pool and device driver changes are largely independent
> from AF_XDP and probably should be resolved first (esp. the
> observed regresssion even without AF_XDP).
>

Yeah, the regression is unacceptable.

Another way is starting with the patches without zero-copy first
(i.e. the copy path), and later add the driver modifications. That
would be the first 7 patches.

> As you suggest, it would be great if the need for a separate
> xsk_packet_array data structure can be avoided.
>

Yes, we'll address that!

> Since frames from the same frame pool can be forwarded between
> multiple device ports and thus AF_XDP sockets, that should perhaps
> be a separate object independent from the sockets. This comment
> hints at the awkward situation if tied to a descriptor pair:
>
>> +   /* Check if umem is from this socket, if so do not make
>> +* circular references.
>> +*/
>
> Since this is in principle just a large shared memory area, could
> it reuse existing BPF map logic?
>

Hmm, care to elaborate on your thinking here?

> More extreme, and perhaps unrealistic, is if the descriptor ring
> could similarly be a BPF map and the Rx XDP program directly
> writes the descriptor, instead of triggering xdp_do_xsk_redirect.
> As we discussed before, this would avoid the need to specify a
> descriptor format upfront.

Having the XDP program writeback the descriptor to user space ring is
really something that would be useful (writing a virtio-net
descriptors...). I need to think a bit more about this. :-) Please
share your ideas!

Thanks for looking into the patches!


Björn


Re: [PATCH net-next] sch_netem: Bug fixing in calculating Netem interval

2018-02-07 Thread Stephen Hemminger
On Tue, 6 Feb 2018 23:14:18 -0500
"Md. Islam"  wrote:

> In Kernel 4.15.0+, Netem does not work properly.
> 
> Netem setup:
> 
> tc qdisc add dev h1-eth0 root handle 1: netem delay 10ms 2ms
> 
> Result:
> 
> PING 172.16.101.2 (172.16.101.2) 56(84) bytes of data.
> 64 bytes from 172.16.101.2: icmp_seq=1 ttl=64 time=22.8 ms
> 64 bytes from 172.16.101.2: icmp_seq=2 ttl=64 time=10.9 ms
> 64 bytes from 172.16.101.2: icmp_seq=3 ttl=64 time=10.9 ms
> 64 bytes from 172.16.101.2: icmp_seq=5 ttl=64 time=11.4 ms
> 64 bytes from 172.16.101.2: icmp_seq=6 ttl=64 time=11.8 ms
> 64 bytes from 172.16.101.2: icmp_seq=4 ttl=64 time=4303 ms
> 64 bytes from 172.16.101.2: icmp_seq=10 ttl=64 time=11.2 ms
> 64 bytes from 172.16.101.2: icmp_seq=11 ttl=64 time=10.3 ms
> 64 bytes from 172.16.101.2: icmp_seq=7 ttl=64 time=4304 ms
> 64 bytes from 172.16.101.2: icmp_seq=8 ttl=64 time=4303 ms
> 
> Patch:
> 
> diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> index 7bbc13b..7c179ad 100644
> --- a/net/sched/sch_netem.c
> +++ b/net/sched/sch_netem.c
> @@ -327,7 +327,7 @@ static s64 tabledist(s64 mu, s32 sigma,
> 
>  /* default uniform distribution */
>  if (dist == NULL)
> -return (rnd % (2 * sigma)) - sigma + mu;
> +return ((rnd % (2 * sigma)) + mu) - sigma;
> 
>  t = dist->table[rnd % dist->size];
>  x = (sigma % NETEM_DIST_SCALE) * t;
> 
> 
> (rnd % (2 * sigma)) - sigma was overflowing s32. After applying the
> patch, I found following output which is desirable.
> 
> PING 172.16.101.2 (172.16.101.2) 56(84) bytes of data.
> 64 bytes from 172.16.101.2: icmp_seq=1 ttl=64 time=21.1 ms
> 64 bytes from 172.16.101.2: icmp_seq=2 ttl=64 time=8.46 ms
> 64 bytes from 172.16.101.2: icmp_seq=3 ttl=64 time=9.00 ms
> 64 bytes from 172.16.101.2: icmp_seq=4 ttl=64 time=11.8 ms
> 64 bytes from 172.16.101.2: icmp_seq=5 ttl=64 time=8.36 ms
> 64 bytes from 172.16.101.2: icmp_seq=6 ttl=64 time=11.8 ms
> 64 bytes from 172.16.101.2: icmp_seq=7 ttl=64 time=8.11 ms
> 64 bytes from 172.16.101.2: icmp_seq=8 ttl=64 time=10.0 ms
> 64 bytes from 172.16.101.2: icmp_seq=9 ttl=64 time=11.3 ms
> 64 bytes from 172.16.101.2: icmp_seq=10 ttl=64 time=11.5 ms
> 64 bytes from 172.16.101.2: icmp_seq=11 ttl=64 time=10.2 ms

Thanks for reporting the problem.  The original code used 32 bit values
and got converted to 64 bit time values when it went to using ktime.

Your patch fixes the problem, and lets go with it.
The other alternaive would be to make sigma s64 but then it would could
cause unnecessary 64 bit modulus here.

Reviewed-by: Stephen Hemminger 



Re: handling of phy_stop() and phy_stop_machine() in phylib

2018-02-07 Thread Russell King - ARM Linux
On Wed, Feb 07, 2018 at 09:56:37PM +0100, Heiner Kallweit wrote:
> Am 04.02.2018 um 03:48 schrieb Florian Fainelli:
> > 
> > 
> > On 02/03/2018 03:58 PM, Heiner Kallweit wrote:
> >> Am 03.02.2018 um 21:17 schrieb Andrew Lunn:
> >>> On Sat, Feb 03, 2018 at 05:41:54PM +0100, Heiner Kallweit wrote:
>  This commit forces callers of phy_resume() and phy_suspend() to hold
>  mutex phydev->lock. This was done for calls to phy_resume() and
>  phy_suspend() in phylib, however there are more callers in network
>  drivers. I'd assume that these other calls issue a warning now
>  because of the lock not being held.
>  So is there something I miss or would this have to be fixed?
> >>>
> >>> Hi Heiner
> >>>
> >>> This is a good point.
> >>>
> >>> Yes, it looks like some fixes are needed. But what exactly?
> >>>
> >>> The phy state machine will suspend and resume the phy is you call
> >>> phy_stop() and phy_start() in the MAC suspend and resume functions.
> >>>
> >> AFAICS phy_stop() doesn't suspend the PHY. It just sets the state
> >> to PHY_HALTED and (at least if we're not in polling mode) doesn't
> >> call phy_suspend(). Maybe a call to phy_trigger_machine() is
> >> needed like in phy_start() ? Then the state machine would call
> >> phy_suspend(), provided the link is still up.
> > 
> > Right, phy_stop() merely just moves the state machine to PHY_HALTED and
> > this is actually a great source of problems which I tried to address here:
> > 
> > https://www.mail-archive.com/netdev@vger.kernel.org/msg196061.html
> > 
> > because phy_stop() is not a synchronous call, so when it returns the
> > state machine might still be running (it can take up to a 1 HZ depending
> > on when you called phy_stop()) and so if you took that as a
> > synchronization point to e.g: turn of your Ethernet MAC/MDIO bus clocks,
> > you will likely see problems. phy_stop_machine() would provide that
> > synchronization point, but is not currently exported, despite being a
> > global symbol. This patch series above is all well and good, except that
> > Geert reported issues with suspend/resume interactions which I have not
> > been able to track down.
> > 
> To not confuse readers I changed the subject of the mail to reflect that
> the discussion isn't about the original topic any longer.
> 
> It seems to me that (at least one) reason for the issues is that pm
> callbacks for the phy device and the network device interfere.
> phy device pm (mdio_bus_phy_suspend et al) feels responsible for dealing
> with suspending/resuming the PHY, and the network driver pm callbacks
> as well.
> 
> Maybe, if the network driver takes care, it should inform phy device pm
> to do nothing. For this we could add a flag to phydev.mdio.flags.
> If the network driver sets this flag then mdio_bus_phy_suspend()
> and mdio_bus_phy_resume() could turn into no-ops.
> Not totally sure yet about mdio_bus_phy_restore() ..

What if the MDIO bus is handled by a separate device and the MDIO bus
is suspended prior to the network driver, thereby making the PHY
inaccessible?

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up


Re: [RFC PATCH 05/24] bpf: added bpf_xdpsk_redirect

2018-02-07 Thread Björn Töpel
2018-02-05 14:42 GMT+01:00 Jesper Dangaard Brouer :
> On Wed, 31 Jan 2018 14:53:37 +0100 Björn Töpel  wrote:
>
>> The bpf_xdpsk_redirect call redirects the XDP context to the XDP
>> socket bound to the receiving queue (if any).
>
> As I explained in-person at FOSDEM, my suggestion is to use the
> bpf-map infrastructure for AF_XDP redirect, but people on this
> upstream mailing also need a chance to validate my idea ;-)
>
> The important thing to keep in-mind is how we can still maintain a
> SPSC (Single producer Single Consumer) relationship between an
> RX-queue and a userspace consumer-process.
>
> This AF_XDP "FOSDEM" patchset, store the "xsk" (xdp_sock) pointer
> directly in the net_device (_rx[].netdev_rx_queue.xs) structure.  This
> limit each RX-queue to service a single xdp_sock.  It sounds good from
> a SPSC pov, but not very flexible.  With a "xdp_sock_map" we can get
> the flexibility to select among multiple xdp_sock'ets (via XDP
> pre-filter selecting a different map), and still maintain a SPSC
> relationship.  The RX-queue will just have several SPSC relationships
> with the individual xdp_sock's.
>
> This is true for the AF_XDP-copy mode, and require less driver change.
> For the AF_XDP-zero-copy (ZC) mode drivers need significant changes
> anyhow, and in ZC case we will have to disallow this multiple
> xdp_sock's, which is simply achieved by checking if the xdp_sock
> pointer returned from the map lookup match the one that userspace
> requested driver to register it's memory for RX-rings from.
>
> The "xdp_sock_map" is an array, where the index correspond to the
> queue_index.  The bpf_redirect_map() ignore the specified index and
> instead use xdp_rxq_info->queue_index in the lookup.
>
> Notice that a bpf-map have no pinned relationship with the device or
> XDP prog loaded.  Thus, userspace need to bind() this map to the
> device before traffic can flow, like the proposed bind() on the
> xdp_sock.  This is to establish the SPSC binding.  My proposal is that
> userspace insert the xdp_sock file-descriptor(s) in the map at the
> queue-index, and the map (which is also just a file-descriptor) is
> bound maybe via bind() to a specific device (via the ifindex).  Kernel
> side will walk the map and do required actions xdp_sock's in find in
> map.
>

As we discussed at FOSDEM, I like the idea of using a map. This also
opens up for configuring the AF_XDP sockets via bpf code, like sockmap
does.

I'll have a stab at adding an "xdp_sock_map/xskmap" or similar, and
also extending the cgroup sock_ops to support AF_XDP sockets, so that
the xskmap can be configured from bpf-land.


Björn

> TX-side is harder, as now multiple xdp_sock's can have the same
> queue-pair ID with the same net_device. But Magnus propose that this
> can be solved with hardware. As newer NICs have many TX-queue, and the
> queue-pair ID is just an external visible number, while the kernel
> internal structure can point to a dedicated TX-queue per xdp_sock.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


Re: handling of phy_stop() and phy_stop_machine() in phylib

2018-02-07 Thread Heiner Kallweit
Am 04.02.2018 um 03:48 schrieb Florian Fainelli:
> 
> 
> On 02/03/2018 03:58 PM, Heiner Kallweit wrote:
>> Am 03.02.2018 um 21:17 schrieb Andrew Lunn:
>>> On Sat, Feb 03, 2018 at 05:41:54PM +0100, Heiner Kallweit wrote:
 This commit forces callers of phy_resume() and phy_suspend() to hold
 mutex phydev->lock. This was done for calls to phy_resume() and
 phy_suspend() in phylib, however there are more callers in network
 drivers. I'd assume that these other calls issue a warning now
 because of the lock not being held.
 So is there something I miss or would this have to be fixed?
>>>
>>> Hi Heiner
>>>
>>> This is a good point.
>>>
>>> Yes, it looks like some fixes are needed. But what exactly?
>>>
>>> The phy state machine will suspend and resume the phy is you call
>>> phy_stop() and phy_start() in the MAC suspend and resume functions.
>>>
>> AFAICS phy_stop() doesn't suspend the PHY. It just sets the state
>> to PHY_HALTED and (at least if we're not in polling mode) doesn't
>> call phy_suspend(). Maybe a call to phy_trigger_machine() is
>> needed like in phy_start() ? Then the state machine would call
>> phy_suspend(), provided the link is still up.
> 
> Right, phy_stop() merely just moves the state machine to PHY_HALTED and
> this is actually a great source of problems which I tried to address here:
> 
> https://www.mail-archive.com/netdev@vger.kernel.org/msg196061.html
> 
> because phy_stop() is not a synchronous call, so when it returns the
> state machine might still be running (it can take up to a 1 HZ depending
> on when you called phy_stop()) and so if you took that as a
> synchronization point to e.g: turn of your Ethernet MAC/MDIO bus clocks,
> you will likely see problems. phy_stop_machine() would provide that
> synchronization point, but is not currently exported, despite being a
> global symbol. This patch series above is all well and good, except that
> Geert reported issues with suspend/resume interactions which I have not
> been able to track down.
> 
To not confuse readers I changed the subject of the mail to reflect that
the discussion isn't about the original topic any longer.

It seems to me that (at least one) reason for the issues is that pm
callbacks for the phy device and the network device interfere.
phy device pm (mdio_bus_phy_suspend et al) feels responsible for dealing
with suspending/resuming the PHY, and the network driver pm callbacks
as well.

Maybe, if the network driver takes care, it should inform phy device pm
to do nothing. For this we could add a flag to phydev.mdio.flags.
If the network driver sets this flag then mdio_bus_phy_suspend()
and mdio_bus_phy_resume() could turn into no-ops.
Not totally sure yet about mdio_bus_phy_restore() ..


> We should most definitively try to consolidate the different PHY
> suspend/resume within the Ethernet MAC suspend/resume implementation and
> document exactly what the appropriate behavior must be under the
> following circumstances:
> 
> - when to call phy_stop() + phy_stop_machine()
> - when to call phy_suspend() (if the network interface does do not WoL)
> - when to call phy_resume() (if needed, actually, it usually is not)
> - when to call phy_start()
> 
> I don't unfortunately have the time to code this myself at the moment,
> but I will happily review patches if you have the opportunity to do so.
> 
>>
>> However, if the link is down already (due to whatever calls
>> around phy_stop() in the driver) then phy_suspend() wouldn't be
>> called.
> 
> Correct, there is an implicit assumption that when the link is down,
> there is an opportunity for the Ethernet MAC driver to put things in low
> power, and the PHY itself, should be in a lower power mode where only
> link/energy detection might be utilizing power. At least this is the theory.
> 
>>
>> Heiner
>>
>>> A few examples:
>>>
>>> tc35815_suspend(), ravb_suspend() via ravb_close(), sh_eth_suspend()
>>> via sh_eth_close(), fec_suspend(), mpc52xx_fec_of_suspend() via
>>> mpc52xx_fec_close(), ucc_geth_suspend(), etc...
>>>
>>> So i suspect those drivers which call phy_suspend()/phy_resume()
>>> should really be modified to call phy_stop()/phy_start().
>>>
>>> hns_nic_config_phy_loopback() is just funky, and probably needs the
>>> help of the hns guys to fix.
>>>
>>> dsa_slave_suspend() already does a phy_stop(), so the phy_suspend()
>>> can be removed.
>>>
>>> The comments in lpc_eth_open() suggest the phy_resume() is needed, so
>>> locks should be added. socfpga_dwmac_resume() seems to be the same.
>>>
>>> Andrew
>>>
>>
> 



Re: [PATCH] rtlwifi: rtl8192cu: Remove variable self-assignment in rf.c

2018-02-07 Thread Matthias Kaehlcke
El Wed, Feb 07, 2018 at 02:35:59PM -0600 Larry Finger ha dit:

> On 02/07/2018 02:26 PM, Matthias Kaehlcke wrote:
> > In _rtl92c_get_txpower_writeval_by_regulatory() the variable writeVal
> > is assigned to itself in an if ... else statement, apparently only to
> > document that the branch condition is handled and that a previously read
> > value should be returned unmodified. The self-assignment causes clang to
> > raise the following warning:
> > 
> > drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c:304:13:
> >error: explicitly assigning value of variable of type 'u32'
> >  (aka 'unsigned int') to itself [-Werror,-Wself-assign]
> >writeVal = writeVal;
> > 
> > Replace the self-assignment with a semicolon, which still serves to
> > document the 'handling' of the branch condition.
> > 
> > Signed-off-by: Matthias Kaehlcke 
> > ---
> >   drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c 
> > b/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c
> > index 9cff6bc4049c..4db92496c122 100644
> > --- a/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c
> > +++ b/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c
> > @@ -301,7 +301,7 @@ static void 
> > _rtl92c_get_txpower_writeval_by_regulatory(struct ieee80211_hw *hw,
> > writeVal = writeVal - 0x06060606;
> > else if (rtlpriv->dm.dynamic_txhighpower_lvl ==
> >  TXHIGHPWRLEVEL_BT2)
> > -   writeVal = writeVal;
> > +   ;
> > *(p_outwriteval + rf) = writeVal;
> > }
> >   }
> > 
> 
> As the branch condition does nothing, why not remove it and save the
> compiler's optimizer a bit of work? The code looks strange, but it matches
> the rest of Realtek's USB drivers.

Sure, I am happy to change it to whatever the authors/maintainers prefer.

I'll wait a bit before respinning for if others feel strongly about
keeping the branch.


Re: [PATCH] rtlwifi: rtl8192cu: Remove variable self-assignment in rf.c

2018-02-07 Thread Larry Finger

On 02/07/2018 02:26 PM, Matthias Kaehlcke wrote:

In _rtl92c_get_txpower_writeval_by_regulatory() the variable writeVal
is assigned to itself in an if ... else statement, apparently only to
document that the branch condition is handled and that a previously read
value should be returned unmodified. The self-assignment causes clang to
raise the following warning:

drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c:304:13:
   error: explicitly assigning value of variable of type 'u32'
 (aka 'unsigned int') to itself [-Werror,-Wself-assign]
   writeVal = writeVal;

Replace the self-assignment with a semicolon, which still serves to
document the 'handling' of the branch condition.

Signed-off-by: Matthias Kaehlcke 
---
  drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c 
b/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c
index 9cff6bc4049c..4db92496c122 100644
--- a/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c
+++ b/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c
@@ -301,7 +301,7 @@ static void 
_rtl92c_get_txpower_writeval_by_regulatory(struct ieee80211_hw *hw,
writeVal = writeVal - 0x06060606;
else if (rtlpriv->dm.dynamic_txhighpower_lvl ==
 TXHIGHPWRLEVEL_BT2)
-   writeVal = writeVal;
+   ;
*(p_outwriteval + rf) = writeVal;
}
  }



As the branch condition does nothing, why not remove it and save the compiler's 
optimizer a bit of work? The code looks strange, but it matches the rest of 
Realtek's USB drivers.


Larry



[PATCH] rtlwifi: rtl8192cu: Remove variable self-assignment in rf.c

2018-02-07 Thread Matthias Kaehlcke
In _rtl92c_get_txpower_writeval_by_regulatory() the variable writeVal
is assigned to itself in an if ... else statement, apparently only to
document that the branch condition is handled and that a previously read
value should be returned unmodified. The self-assignment causes clang to
raise the following warning:

drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c:304:13:
  error: explicitly assigning value of variable of type 'u32'
(aka 'unsigned int') to itself [-Werror,-Wself-assign]
  writeVal = writeVal;

Replace the self-assignment with a semicolon, which still serves to
document the 'handling' of the branch condition.

Signed-off-by: Matthias Kaehlcke 
---
 drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c 
b/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c
index 9cff6bc4049c..4db92496c122 100644
--- a/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c
+++ b/drivers/net/wireless/realtek/rtlwifi/rtl8192cu/rf.c
@@ -301,7 +301,7 @@ static void 
_rtl92c_get_txpower_writeval_by_regulatory(struct ieee80211_hw *hw,
writeVal = writeVal - 0x06060606;
else if (rtlpriv->dm.dynamic_txhighpower_lvl ==
 TXHIGHPWRLEVEL_BT2)
-   writeVal = writeVal;
+   ;
*(p_outwriteval + rf) = writeVal;
}
 }
-- 
2.16.0.rc1.238.g530d649a79-goog



Re: Regression for ip6-in-ip4 IPsec tunnel in 4.14.16

2018-02-07 Thread Yves-Alexis Perez
On Wed, 2018-02-07 at 13:50 -0500, Mike Maloney wrote:
> On Wed, Feb 7, 2018 at 12:23 PM, Yves-Alexis Perez 
> 
> Hi Yves-Alexis -
> 
> I apologize for the problem.  It seems to me that tunneling with an
> outer MTU that causes the inner MTU to be smaller than the min, is
> potentially problematic in other ways as well.

Maybe. I tried with removing the MTU setting, and I get (on ping again)

févr. 07 20:44:01 scapa kernel: mtu: 1266

which means I would get -EINVAL on standards kernels, which is not really good
either.

> But also it could seem unfortunate that the code with my fix does not
> look at actual packet size, but instead only looks at the MTU and then
> fails, even if no packet was going to be so large.  The intention of
> my patch was to prevent a negative number while calculating the
> maxfraglen in  __ip6_append_data().  An alternative fix maybe to
> instead return an error only if the mtu is less than or equal to the
> fragheaderlen.   Something like:
> 
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index 3763dc01e374..5d912a289b95 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -1214,8 +1214,6 @@ static int ip6_setup_cork(struct sock *sk,
> struct inet_cork_full *cork,
> if (np->frag_size)
> mtu = np->frag_size;
> }
> -   if (mtu < IPV6_MIN_MTU)
> -   return -EINVAL;
> cork->base.fragsize = mtu;
> if (dst_allfrag(rt->dst.path))
> cork->base.flags |= IPCORK_ALLFRAG;
> @@ -1264,6 +1262,8 @@ static int __ip6_append_data(struct sock *sk,
> 
> fragheaderlen = sizeof(struct ipv6hdr) + rt->rt6i_nfheader_len +
> (opt ? opt->opt_nflen : 0);
> +   if (mtu < fragheaderlen + 8)
> +   return -EINVAL;
> maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen -
>  sizeof(struct frag_hdr);
> (opt ? opt->opt_nflen : 0);
> 
> But then we also have to convince ourselves that maxfraglen can never
> be <= 0.  I'd have to think about that.
> 
> I am not sure if others have thoughts on supporting MTUs configured
> below the min in the spec.
> 
Here, the MTU is not below, so I'm not sure what happens.

Regards,
-- 
Yves-Alexis

signature.asc
Description: This is a digitally signed message part


[PATCH net v2] net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT

2018-02-07 Thread Heiner Kallweit
This condition wasn't adjusted when PHY_IGNORE_INTERRUPT (-2) was added
long ago. In case of PHY_IGNORE_INTERRUPT the MAC interrupt indicates
also PHY state changes and we should do what the symbol says.

Signed-off-by: Heiner Kallweit 
---
v2:
- use phy_interrupt_is_valid() instead of checking for irq > 0
---
 drivers/net/phy/phy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index f3313a129..50ed35a45 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -822,7 +822,7 @@ void phy_start(struct phy_device *phydev)
phy_resume(phydev);
 
/* make sure interrupts are re-enabled for the PHY */
-   if (phydev->irq != PHY_POLL) {
+   if (phy_interrupt_is_valid(phydev)) {
err = phy_enable_interrupts(phydev);
if (err < 0)
break;
-- 
2.16.1



Re: [PATCH net] net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT

2018-02-07 Thread Heiner Kallweit
Am 07.02.2018 um 20:34 schrieb Florian Fainelli:
> 
> 
> On 02/07/2018 11:26 AM, Heiner Kallweit wrote:
>> Am 07.02.2018 um 20:06 schrieb Florian Fainelli:
>>>
>>>
>>> On 02/07/2018 10:44 AM, Heiner Kallweit wrote:
 This condition wasn't adjusted when PHY_IGNORE_INTERRUPT (-2) was added
 long ago. In case of PHY_IGNORE_INTERRUPT the MAC interrupt indicates
 also PHY state changes and we should do what the symbol says.
>>>
>>> Do you use phy_enable_interrupts() to configure how the PHY interrupts
>>> will be flowing through the Ethernet MAC?
>>>
>> No. And I'm not sure I understand your question correctly.
> 
> No wonder, my question does not make sense, I read the test wrong.
> 
>> The change applies the same behavior as e.g. in phy_connect_direct()
>> where phy_start_interrupts() is called only if phy_dev->irq > 0.
> 
> Not enough coffee, your change is fine, could you consider using
> phy_interrupt_is_valid() instead for this test?
> 
Sure. I was considering this already however wasn't sure because
currently both ways to check for a valid interrupt are used in
phylib.

>>

 Signed-off-by: Heiner Kallweit 
 ---
  drivers/net/phy/phy.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
 index f3313a129..50ed35a45 100644
 --- a/drivers/net/phy/phy.c
 +++ b/drivers/net/phy/phy.c
 @@ -822,7 +822,7 @@ void phy_start(struct phy_device *phydev)
phy_resume(phydev);
  
/* make sure interrupts are re-enabled for the PHY */
 -  if (phydev->irq != PHY_POLL) {
 +  if (phydev->irq > 0) {
err = phy_enable_interrupts(phydev);
if (err < 0)
break;

>>>
>>
> 



[PATCH] net: Extra '_get' in declaration of arch_get_platform_mac_address

2018-02-07 Thread Mathieu Malaterre
In commit c7f5d105495a ("net: Add eth_platform_get_mac_address() helper."),
two declarations were added:

  int eth_platform_get_mac_address(struct device *dev, u8 *mac_addr);
  unsigned char *arch_get_platform_get_mac_address(void);

An extra '_get' was introduced in arch_get_platform_get_mac_address, remove
it. Fix compile warning using W=1:

  CC  net/ethernet/eth.o
net/ethernet/eth.c:523:24: warning: no previous prototype for 
‘arch_get_platform_mac_address’ [-Wmissing-prototypes]
 unsigned char * __weak arch_get_platform_mac_address(void)
^
  AR  net/ethernet/built-in.o

Signed-off-by: Mathieu Malaterre 
---
 include/linux/etherdevice.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h
index 263dbcad22fc..79563840c295 100644
--- a/include/linux/etherdevice.h
+++ b/include/linux/etherdevice.h
@@ -31,7 +31,7 @@
 #ifdef __KERNEL__
 struct device;
 int eth_platform_get_mac_address(struct device *dev, u8 *mac_addr);
-unsigned char *arch_get_platform_get_mac_address(void);
+unsigned char *arch_get_platform_mac_address(void);
 u32 eth_get_headlen(void *data, unsigned int max_len);
 __be16 eth_type_trans(struct sk_buff *skb, struct net_device *dev);
 extern const struct header_ops eth_header_ops;
-- 
2.11.0



Re: [PATCH net] net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT

2018-02-07 Thread Florian Fainelli


On 02/07/2018 11:26 AM, Heiner Kallweit wrote:
> Am 07.02.2018 um 20:06 schrieb Florian Fainelli:
>>
>>
>> On 02/07/2018 10:44 AM, Heiner Kallweit wrote:
>>> This condition wasn't adjusted when PHY_IGNORE_INTERRUPT (-2) was added
>>> long ago. In case of PHY_IGNORE_INTERRUPT the MAC interrupt indicates
>>> also PHY state changes and we should do what the symbol says.
>>
>> Do you use phy_enable_interrupts() to configure how the PHY interrupts
>> will be flowing through the Ethernet MAC?
>>
> No. And I'm not sure I understand your question correctly.

No wonder, my question does not make sense, I read the test wrong.

> The change applies the same behavior as e.g. in phy_connect_direct()
> where phy_start_interrupts() is called only if phy_dev->irq > 0.

Not enough coffee, your change is fine, could you consider using
phy_interrupt_is_valid() instead for this test?

> 
>>>
>>> Signed-off-by: Heiner Kallweit 
>>> ---
>>>  drivers/net/phy/phy.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
>>> index f3313a129..50ed35a45 100644
>>> --- a/drivers/net/phy/phy.c
>>> +++ b/drivers/net/phy/phy.c
>>> @@ -822,7 +822,7 @@ void phy_start(struct phy_device *phydev)
>>> phy_resume(phydev);
>>>  
>>> /* make sure interrupts are re-enabled for the PHY */
>>> -   if (phydev->irq != PHY_POLL) {
>>> +   if (phydev->irq > 0) {
>>> err = phy_enable_interrupts(phydev);
>>> if (err < 0)
>>> break;
>>>
>>
> 

-- 
Florian


Re: [PATCH net] net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT

2018-02-07 Thread Heiner Kallweit
Am 07.02.2018 um 20:06 schrieb Florian Fainelli:
> 
> 
> On 02/07/2018 10:44 AM, Heiner Kallweit wrote:
>> This condition wasn't adjusted when PHY_IGNORE_INTERRUPT (-2) was added
>> long ago. In case of PHY_IGNORE_INTERRUPT the MAC interrupt indicates
>> also PHY state changes and we should do what the symbol says.
> 
> Do you use phy_enable_interrupts() to configure how the PHY interrupts
> will be flowing through the Ethernet MAC?
> 
No. And I'm not sure I understand your question correctly.
The change applies the same behavior as e.g. in phy_connect_direct()
where phy_start_interrupts() is called only if phy_dev->irq > 0.

>>
>> Signed-off-by: Heiner Kallweit 
>> ---
>>  drivers/net/phy/phy.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
>> index f3313a129..50ed35a45 100644
>> --- a/drivers/net/phy/phy.c
>> +++ b/drivers/net/phy/phy.c
>> @@ -822,7 +822,7 @@ void phy_start(struct phy_device *phydev)
>>  phy_resume(phydev);
>>  
>>  /* make sure interrupts are re-enabled for the PHY */
>> -if (phydev->irq != PHY_POLL) {
>> +if (phydev->irq > 0) {
>>  err = phy_enable_interrupts(phydev);
>>  if (err < 0)
>>  break;
>>
> 



[PATCH net-next] ibmvnic: queue reset when CRQ gets closed during reset

2018-02-07 Thread Nathan Fontenot
While handling a driver reset we get a H_CLOSED return trying
to send a CRQ event. When this occurs we need to queue up another
reset attempt. Without doing this we see instances where the driver
is left in a closed state because the reset failed and there is no
further attempts to reset the driver.

Signed-off-by: Nathan Fontenot 
---
 drivers/net/ethernet/ibm/ibmvnic.c |6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 5caaa9033841..0f7e1d016207 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2908,8 +2908,12 @@ static int ibmvnic_send_crq(struct ibmvnic_adapter 
*adapter,
cpu_to_be64(u64_crq[1]));
 
if (rc) {
-   if (rc == H_CLOSED)
+   if (rc == H_CLOSED) {
dev_warn(dev, "CRQ Queue closed\n");
+   if (adapter->resetting)
+   ibmvnic_reset(adapter, VNIC_RESET_FATAL);
+   }
+
dev_warn(dev, "Send error (rc=%d)\n", rc);
}
 



Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-02-07 Thread Pablo Neira Ayuso
On Wed, Feb 07, 2018 at 11:06:42AM -0800, Andrew Morton wrote:
> From: Michal Hocko 
> Subject: net/netfilter/x_tables.c: remove size check
> 
> Back in 2002 vmalloc used to BUG on too large sizes.  We are much better
> behaved these days and vmalloc simply returns NULL for those.  Remove the
> check as it simply not needed and the comment is even misleading.

Applied, thanks!


Re: [netfilter-core] kernel panic: Out of memory and no killable processes... (2)

2018-02-07 Thread Andrew Morton
On Wed, 7 Feb 2018 18:44:39 +0100 Pablo Neira Ayuso  wrote:

> Hi,
> 
> On Wed, Jan 31, 2018 at 09:19:16AM +0100, Michal Hocko wrote:
> [...]
> > Yeah, we do not BUG but rather fail instead. See __vmalloc_node_range.
> > My excavation tools pointed me to "VM: Rework vmalloc code to support 
> > mapping of arbitray pages"
> > by Christoph back in 2002. So yes, we can safely remove it finally. Se
> > below.
> > 
> > 
> > From 8d52e1d939d101b0dafed6ae5c3c1376183e65bb Mon Sep 17 00:00:00 2001
> > From: Michal Hocko 
> > Date: Wed, 31 Jan 2018 09:16:56 +0100
> > Subject: [PATCH] net/netfilter/x_tables.c: remove size check
> > 
> > Back in 2002 vmalloc used to BUG on too large sizes. We are much better
> > behaved these days and vmalloc simply returns NULL for those. Remove
> > the check as it simply not needed and the comment even misleading.
> > 
> > Suggested-by: Andrew Morton 
> > Signed-off-by: Michal Hocko 
> > ---
> >  net/netfilter/x_tables.c | 4 
> >  1 file changed, 4 deletions(-)
> > 
> > diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
> > index b55ec5aa51a6..48a6ff620493 100644
> > --- a/net/netfilter/x_tables.c
> > +++ b/net/netfilter/x_tables.c
> > @@ -999,10 +999,6 @@ struct xt_table_info *xt_alloc_table_info(unsigned int 
> > size)
> > if (sz < sizeof(*info))
> > return NULL;
> >  
> > -   /* Pedantry: prevent them from hitting BUG() in vmalloc.c --RR */
> > -   if ((SMP_ALIGN(size) >> PAGE_SHIFT) + 2 > totalram_pages)
> > -   return NULL;
> > -
> > /* __GFP_NORETRY is not fully supported by kvmalloc but it should
> >  * work reasonably well if sz is too large and bail out rather
> >  * than shoot all processes down before realizing there is nothing
> 
> Patchwork didn't catch this patch for some reason, would you mind to
> resend?

From: Michal Hocko 
Subject: net/netfilter/x_tables.c: remove size check

Back in 2002 vmalloc used to BUG on too large sizes.  We are much better
behaved these days and vmalloc simply returns NULL for those.  Remove the
check as it simply not needed and the comment is even misleading.

Link: http://lkml.kernel.org/r/20180131081916.go21...@dhcp22.suse.cz
Suggested-by: Andrew Morton 
Signed-off-by: Michal Hocko 
Reviewed-by: Andrew Morton 
Cc: Florian Westphal 
Cc: David S. Miller 
Signed-off-by: Andrew Morton 
---

 net/netfilter/x_tables.c |4 
 1 file changed, 4 deletions(-)

diff -puN net/netfilter/x_tables.c~net-netfilter-x_tablesc-remove-size-check 
net/netfilter/x_tables.c
--- a/net/netfilter/x_tables.c~net-netfilter-x_tablesc-remove-size-check
+++ a/net/netfilter/x_tables.c
@@ -1004,10 +1004,6 @@ struct xt_table_info *xt_alloc_table_inf
if (sz < sizeof(*info))
return NULL;
 
-   /* Pedantry: prevent them from hitting BUG() in vmalloc.c --RR */
-   if ((size >> PAGE_SHIFT) + 2 > totalram_pages)
-   return NULL;
-
/* __GFP_NORETRY is not fully supported by kvmalloc but it should
 * work reasonably well if sz is too large and bail out rather
 * than shoot all processes down before realizing there is nothing
_



Re: [PATCH net] net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT

2018-02-07 Thread Florian Fainelli


On 02/07/2018 10:44 AM, Heiner Kallweit wrote:
> This condition wasn't adjusted when PHY_IGNORE_INTERRUPT (-2) was added
> long ago. In case of PHY_IGNORE_INTERRUPT the MAC interrupt indicates
> also PHY state changes and we should do what the symbol says.

Do you use phy_enable_interrupts() to configure how the PHY interrupts
will be flowing through the Ethernet MAC?

> 
> Signed-off-by: Heiner Kallweit 
> ---
>  drivers/net/phy/phy.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index f3313a129..50ed35a45 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -822,7 +822,7 @@ void phy_start(struct phy_device *phydev)
>   phy_resume(phydev);
>  
>   /* make sure interrupts are re-enabled for the PHY */
> - if (phydev->irq != PHY_POLL) {
> + if (phydev->irq > 0) {
>   err = phy_enable_interrupts(phydev);
>   if (err < 0)
>   break;
> 

-- 
Florian


Re: [PATCH 00/11] Netfilter fixes for net

2018-02-07 Thread David Miller
From: Pablo Neira Ayuso 
Date: Wed,  7 Feb 2018 18:42:18 +0100

> The following patchset contains Netfilter fixes for you net tree, they
> are:
 ...
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Pulled, thanks Pablo.

> P.S: Again more fixes cooking on netfilter-de...@vger.kernel.org, so
>  another round is likely coming up soon.

Ok, no problem.


Re: Regression for ip6-in-ip4 IPsec tunnel in 4.14.16

2018-02-07 Thread Mike Maloney
On Wed, Feb 7, 2018 at 12:23 PM, Yves-Alexis Perez  wrote:
> On Wed, 2018-02-07 at 18:05 +0100, Yves-Alexis Perez wrote:
>> I'll try to printk the mtu before returning EINVAL to see why it's lower than
>> 1280, but maybe the IP encapsulation is not correctly handled?
>
> I did:
>
> diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
> index 3763dc01e374..d3c651158d35 100644
> --- a/net/ipv6/ip6_output.c
> +++ b/net/ipv6/ip6_output.c
> @@ -1215,7 +1215,7 @@ static int ip6_setup_cork(struct sock *sk, struct 
> inet_cork_full *cork,
> mtu = np->frag_size;
> }
> if (mtu < IPV6_MIN_MTU)
> -   return -EINVAL;
> +   printk("mtu: %d\n", mtu);
> cork->base.fragsize = mtu;
> if (dst_allfrag(rt->dst.path))
> cork->base.flags |= IPCORK_ALLFRAG;
>
> and I get:
>
> févr. 07 18:19:50 scapa kernel: mtu: 1218
>
> and it doesn't depend on the original packet size (same thing happens with
> ping -s 100). It also happens with UDP (DNS) traffic, but apparently not with
> TCP.
>
> Regards,
> --
> Yves-Alexis

Hi Yves-Alexis -

I apologize for the problem.  It seems to me that tunneling with an
outer MTU that causes the inner MTU to be smaller than the min, is
potentially problematic in other ways as well.

But also it could seem unfortunate that the code with my fix does not
look at actual packet size, but instead only looks at the MTU and then
fails, even if no packet was going to be so large.  The intention of
my patch was to prevent a negative number while calculating the
maxfraglen in  __ip6_append_data().  An alternative fix maybe to
instead return an error only if the mtu is less than or equal to the
fragheaderlen.   Something like:

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 3763dc01e374..5d912a289b95 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1214,8 +1214,6 @@ static int ip6_setup_cork(struct sock *sk,
struct inet_cork_full *cork,
if (np->frag_size)
mtu = np->frag_size;
}
-   if (mtu < IPV6_MIN_MTU)
-   return -EINVAL;
cork->base.fragsize = mtu;
if (dst_allfrag(rt->dst.path))
cork->base.flags |= IPCORK_ALLFRAG;
@@ -1264,6 +1262,8 @@ static int __ip6_append_data(struct sock *sk,

fragheaderlen = sizeof(struct ipv6hdr) + rt->rt6i_nfheader_len +
(opt ? opt->opt_nflen : 0);
+   if (mtu < fragheaderlen + 8)
+   return -EINVAL;
maxfraglen = ((mtu - fragheaderlen) & ~7) + fragheaderlen -
 sizeof(struct frag_hdr);
(opt ? opt->opt_nflen : 0);

But then we also have to convince ourselves that maxfraglen can never
be <= 0.  I'd have to think about that.

I am not sure if others have thoughts on supporting MTUs configured
below the min in the spec.


Thanks.
--
Mike Maloney


[PATCH net] net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT

2018-02-07 Thread Heiner Kallweit
This condition wasn't adjusted when PHY_IGNORE_INTERRUPT (-2) was added
long ago. In case of PHY_IGNORE_INTERRUPT the MAC interrupt indicates
also PHY state changes and we should do what the symbol says.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/phy/phy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index f3313a129..50ed35a45 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -822,7 +822,7 @@ void phy_start(struct phy_device *phydev)
phy_resume(phydev);
 
/* make sure interrupts are re-enabled for the PHY */
-   if (phydev->irq != PHY_POLL) {
+   if (phydev->irq > 0) {
err = phy_enable_interrupts(phydev);
if (err < 0)
break;
-- 
2.16.1



[Secunia Research] Linux Kernel Vulnerability - Sending information

2018-02-07 Thread Secunia Research
Hello,

Secunia Research at Flexera has discovered a vulnerability in Linux Kernel,
which can be exploited by malicious, local users to cause a DoS (Denial of
Service).

Details:
-

After a bit of fuzzing and some debugging, I've prepared a program that
triggers a BUG() failure at net/core/skbuff.c:104.
It happens when an SCTP ABORT message is about to be sent. The main problem
seems to be with the data size/length. This becomes a problem when the flow
reaches the "skb_put()" function (net/core/skbuff.c) and the "unlikenly()"
condition is met.

I have just checked the reproducer against the current David Miller net-tree
and it doesn't seem to be addressed yet. 

Proof-of-Concept:
-

I wasn't sure if I should share the reproducer via this email. Please let me
know what's the preferred channel.

Kernel crash message:

[   31.900829] skbuff: skb_over_panic: text:d6dff053 len:68556
put:68544 head:1a927f7f data:01696ac8 tail:0x10c84 end:0xec0
dev:
[   31.902421] [ cut here ]
[   31.902968] kernel BUG at net/core/skbuff.c:104!
[   31.903559] invalid opcode:  [#1] SMP KASAN PTI
[   31.904159] Modules linked in:
[   31.904541] CPU: 1 PID: 3458 Comm: repro Not tainted 4.15.0+ #2
[   31.905416] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[   31.906749] RIP: 0010:skb_panic+0x152/0x1d0
[   31.907211] RSP: 0018:880066f766a0 EFLAGS: 00010282
[   31.907762] RAX: 008f RBX: 8800641ae2c0 RCX:

[   31.908527] RDX: 008f RSI: 11000cdeec99 RDI:
ed000cdeecc8
[   31.909287] RBP: 84a2efc0 R08: 11000cdeec6d R09:

[   31.910022] R10: 0001 R11:  R12:
83c920f0
[   31.910770] R13: 00010bc0 R14: 84a2e860 R15:
0ec0
[   31.911514] FS:  7f4face87700() GS:88006cf0()
knlGS:
[   31.912367] CS:  0010 DS:  ES:  CR0: 80050033
[   31.912973] CR2: 20020fe5 CR3: 69288000 CR4:
06e0
[   31.913708] Call Trace:
[   31.914534]  skb_put+0x178/0x1c0
[   31.914890]  sctp_packet_transmit+0x1120/0x3740
[   31.924671]  sctp_outq_flush+0x113a/0x3b90
[   31.963822]  sctp_do_sm+0x4a5c/0x65c0
[   31.973685]  sctp_primitive_ABORT+0x99/0xc0
[   31.974457]  sctp_sendmsg+0x1bb4/0x33a0
[   31.987812]  inet_sendmsg+0x125/0x580
[   31.991609]  sock_sendmsg+0xc0/0x100
[   31.992320]  ___sys_sendmsg+0x714/0x900
[   31.03]  __sys_sendmsg+0xbd/0x1e0
[   32.002942]  SyS_sendmsg+0x27/0x40
[   32.003585]  entry_SYSCALL_64_fastpath+0x18/0x85
[   32.004461] RIP: 0033:0x7f4fada2472d
[   32.005130] RSP: 002b:7f4face86ec0 EFLAGS: 0293
[   32.005138] Code: 03 0f b6 04 01 84 c0 74 04 3c 03 7e 20 8b 4b 78 41 56
45 89 e8 41 57 56 48 c7 c7 a0 e8 a2 84 52 48 89 ee 4c 89 e2 e8 00 bd 2d fe
<0f> 0b 4c 89 4c 24 10 48 89 54 24 08 48 89 34 24 e8 b9 15 72 fe 
[   32.009658] RIP: skb_panic+0x152/0x1d0 RSP: 880066f766a0
[   32.010756] ---[ end trace 239ba69b984ccf99 ]---

Closing Comments:
-

We have assigned the vulnerability Secunia Advisory SA81331.
 
A preliminary release date has been set to February 21st, 2018 for the
publication of our advisory. However, we are naturally prepared to push the
disclosure date in accordance with the Secunia Research Disclosure Policy
[1], if you need more time to address the vulnerability as long as you keep
us updated on the status.

Please don't hesitate to contact us for assistance with
confirming/reproducing the reported vulnerability.

Credits should go to:
"Jakub Jirasek, Secunia Research at Flexera"

In the case that a HTTPS URL is allowed within the mentioning of the credits
on e.g. your web site, then please utilize the link [2], which could be made
to trigger by clicking on the "Secunia Research" parts of the credits for
example. We highly appreciate the effort.

Please acknowledge receiving this e-mail and let us know when you expect to
fix the vulnerability.

References:
[1] https://secuniaresearch.flexerasoftware.com/community/research/policy/
[2]
https://www.flexerasoftware.com/enterprise/company/about/secunia-research/

---

Kind Regards,

Jakub Jirasek
Team Lead Information Security Analyst

Secunia Research at Flexera

Arne Jacobsens Allé 7, 5th floor
2300 Copenhagen S
Denmark

Phone +45 7020 5144

http://www.flexera.com




Re: [PATCH] net: ethernet: ti: cpsw: fix net watchdog timeout

2018-02-07 Thread Grygorii Strashko
On 02/07/2018 07:31 AM, Ivan Khoronzhuk wrote:
> On Wed, Feb 07, 2018 at 05:03:19AM +0200, Ivan Khoronzhuk wrote:
>> On Tue, Feb 06, 2018 at 07:17:06PM -0600, Grygorii Strashko wrote:
>>> It was discovered that simple program which indefinitely sends 200b UDP
>>> packets and runs on TI AM574x SoC (SMP) under RT Kernel triggers network
>>> watchdog timeout in TI CPSW driver (<6 hours run). The network watchdog
>>> timeout is triggered due to race between cpsw_ndo_start_xmit() and
>>> cpsw_tx_handler() [NAPI]
>>>
>>> cpsw_ndo_start_xmit()
>>> if (unlikely(!cpdma_check_free_tx_desc(txch))) {
>>> txq = netdev_get_tx_queue(ndev, q_idx);
>>> netif_tx_stop_queue(txq);
>>>
>>> ^^ as per [1] barrier has to be used after set_bit() otherwise new value
>>> might not be visible to other cpus
>>> }
>>>
>>> cpsw_tx_handler()
>>> if (unlikely(netif_tx_queue_stopped(txq)))
>>> netif_tx_wake_queue(txq);
>>>
>>> and when it happens ndev TX queue became disabled forever while driver's HW
>>> TX queue is empty.
>> I'm sure it fixes test case somehow but there is some strangeness.
>> (I've thought about this some X months ago):
>> 1. If no free desc, then there is bunch of descs on the queue ready to be 
>> sent
>> 2. If one of this desc while this process was missed then next will wake 
>> queue,
>> because there is bunch of them on the fly. So, if desc on top of the sent 
>> queue
>> missed to enable the queue, then next one more likely will enable it anyway..
>> then how it could happen? The described race is possible only on last
>> descriptor, yes, packets are small the speed is hight, possibility is very 
>> small
>> .but then next situation is also possible:
>> - packets are sent fast
>> - all packets were sent, but no any descriptors are freed now by sw 
>> interrupt (NAPI)
>> - when interrupt had started NAPI, the queue was enabled, all other next
>> interrupts are throttled once NAPI not finished it's work yet.
>> - when new packet submitted, no free descs are present yet (NAPI has not 
>> freed
>> any yet), but all packets are sent, so no one can awake tx queue, as 
>> interrupt
>> will not arise when NAPI is started to free first descriptor interrupts are
>> disabled.because h/w queue to be sent is empty...
>> - how it can happen as submitting packet and handling packet operations is 
>> under
>> channel lock? Not exactly, a period between handling and freeing the 
>> descriptor
>> to the pool is not under channel lock, here:
>>
>>  spin_unlock_irqrestore(>lock, flags);
>>  if (unlikely(status & CPDMA_DESC_TD_COMPLETE))
>>  cb_status = -ENOSYS;
>>  else
>>  cb_status = status;
>>
>>  __cpdma_chan_free(chan, desc, outlen, cb_status);
>>  return status;
>>
>> unlock_ret:
>>  spin_unlock_irqrestore(>lock, flags);
>>  return status;
>>
>> And:
>> __cpdma_chan_free(chan, desc, outlen, cb_status);
>>  -> cpdma_desc_free(pool, desc, 1);
>>
>> As result, queue deadlock as you've described.
>> Just thought, not checked, but theoretically possible.
>> What do you think?

Yep. And if this happens desc_alloc_fail should be >0 while i usually see it 0 
when 
watchdog triggers.
I was able to reproduce this situation once, but with bunch of debug code added
which changes timings:

Prerequisite: only one free desc available.
 cpsw_ndo_start_xmit1   cpsw_tx_poll
  prepare and send packet 
  ->Free desc queue is empty at this moment
 if (unlikely(!cpdma_check_free_tx_desc(txch)))
netif_tx_stop_queue(txq);
 --- tx queue stopped
cpdma_chan_process()

spin_lock_irqsave(>lock, flags);
chan->count--

spin_unlock_irqrestore(>lock, flags)

cpsw_tx_handler()
   if 
(unlikely(netif_tx_queue_stopped(txq)))

netif_tx_wake_queue(txq);
 --- tx queue is woken up
 cpsw_ndo_start_xmit2
  prepare packet
 ret = cpsw_tx_packet_submit(priv, skb, txch);
 //fail as desc not returned to the pool yet
if (unlikely(ret != 0)) {
cpsw_err(priv, tx_err, "desc submit failed\n");
goto fail;
}
cpdma_desc_free()   

fail:
ndev->stats.tx_dropped++;
netif_tx_stop_queue(txq);
// oops.

That's why I added double check and barrier in fail path also
 

> 
> Better explanation, for rare race:
> 
> start conditions:
> - all descs are submitted, except last one, and queue is not stopped
> - any desc was returned to the pool yet (but packets can be sent)
> 
>  time
>   ||
>   \/
> 
> submit process   

Re: [PATCH] selftests: bpf: test_kmod.sh: check the module path before insmod

2018-02-07 Thread Naresh Kamboju
Hi Daniel,

On 7 February 2018 at 19:02, Daniel Borkmann  wrote:
> Hi Naresh,
>
> On 02/06/2018 10:07 PM, Naresh Kamboju wrote:
>> test_kmod.sh reported false failure when module not present.
>> Check test_bpf.ko is present in the path before loading it.
>>
>> Stop using "insmod $SRC_TREE/lib/test_bpf.ko" instead use
>> "modprobe test_bpf"
>>
>> Signed-off-by: Naresh Kamboju 
>
> Thanks for looking into this! Could we have a way to be able to
> support both? Say, when test_bpf.ko from SRC_TREE is not present,
> we try with modprobe -q fallback? I would still like to support
> the case where you can make local changes and add new tests to
> test_bpf.c, recompile and then just rerun the test_kmod.sh w/o
> having to install it first.

Thanks for the review comments.
I have sent patch v2 with your comments addressed.

- Naresh

>
> Thanks,
> Daniel
>
>>  tools/testing/selftests/bpf/test_kmod.sh | 11 +++
>>  1 file changed, 7 insertions(+), 4 deletions(-)
>>
>> diff --git a/tools/testing/selftests/bpf/test_kmod.sh 
>> b/tools/testing/selftests/bpf/test_kmod.sh
>> index ed4774d..54177b1 100755
>> --- a/tools/testing/selftests/bpf/test_kmod.sh
>> +++ b/tools/testing/selftests/bpf/test_kmod.sh
>> @@ -1,8 +1,6 @@
>>  #!/bin/sh
>>  # SPDX-License-Identifier: GPL-2.0
>>
>> -SRC_TREE=../../../../
>> -
>>  test_run()
>>  {
>>   sysctl -w net.core.bpf_jit_enable=$1 2>&1 > /dev/null
>> @@ -10,8 +8,13 @@ test_run()
>>
>>   echo "[ JIT enabled:$1 hardened:$2 ]"
>>   dmesg -C
>> - insmod $SRC_TREE/lib/test_bpf.ko 2> /dev/null
>> - if [ $? -ne 0 ]; then
>> + # Use modprobe dry run to check for missing test_bpf module
>> + if ! /sbin/modprobe -q -n test_bpf; then
>> + echo "test_bpf: [SKIP]"
>> + elif /sbin/modprobe -q test_bpf; then
>> + echo "test_bpf: ok"
>> + else
>> + echo "test_bpf: [FAIL]"
>>   rc=1
>>   fi
>>   rmmod  test_bpf 2> /dev/null
>>
>


[PATCH v2] selftests: bpf: test_kmod.sh: check the module path before insmod

2018-02-07 Thread Naresh Kamboju
test_kmod.sh reported false failure when module not present.
Check test_bpf.ko is present in the path before loading it.

Two cases to be addressed here,
In the development process of test_bpf.c unit testing will be done by
developers by using "insmod $SRC_TREE/lib/test_bpf.ko"

On the other hand testers run full tests by installing modules on device
under test (DUT) and followed by modprobe to insert the modules accordingly.

Signed-off-by: Naresh Kamboju 
---
 tools/testing/selftests/bpf/test_kmod.sh | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_kmod.sh 
b/tools/testing/selftests/bpf/test_kmod.sh
index ed4774d..35669cc 100755
--- a/tools/testing/selftests/bpf/test_kmod.sh
+++ b/tools/testing/selftests/bpf/test_kmod.sh
@@ -10,9 +10,21 @@ test_run()
 
echo "[ JIT enabled:$1 hardened:$2 ]"
dmesg -C
-   insmod $SRC_TREE/lib/test_bpf.ko 2> /dev/null
-   if [ $? -ne 0 ]; then
-   rc=1
+   if [ -f ${SRC_TREE}/lib/test_bpf.ko ]; then
+   insmod ${SRC_TREE}/lib/test_bpf.ko 2> /dev/null
+   if [ $? -ne 0 ]; then
+   rc=1
+   fi
+   else
+   # Use modprobe dry run to check for missing test_bpf module
+   if ! /sbin/modprobe -q -n test_bpf; then
+   echo "test_bpf: [SKIP]"
+   elif /sbin/modprobe -q test_bpf; then
+   echo "test_bpf: ok"
+   else
+   echo "test_bpf: [FAIL]"
+   rc=1
+   fi
fi
rmmod  test_bpf 2> /dev/null
dmesg | grep FAIL
-- 
2.7.4



Re: [RFC PATCH 00/24] Introducing AF_XDP support

2018-02-07 Thread Tom Herbert
On Wed, Jan 31, 2018 at 5:53 AM, Björn Töpel  wrote:
> From: Björn Töpel 
>
> This RFC introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and zero-copy
> semantics. Throughput improvements can be up to 20x compared to V2 and
> V3 for the micro benchmarks included. Would be great to get your
> feedback on it. Note that this is the follow up RFC to AF_PACKET V4
> from November last year. The feedback from that RFC submission and the
> presentation at NetdevConf in Seoul was to create a new address family
> instead of building on top of AF_PACKET. AF_XDP is this new address
> family.
>
> The main difference between AF_XDP and AF_PACKET V2/V3 on a descriptor
> level is that TX and RX descriptors are separated from packet
> buffers. An RX or TX descriptor points to a data buffer in a packet
> buffer area. RX and TX can share the same packet buffer so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, then
> the descriptor that points to that packet buffer can be changed to
> point to another buffer and reused right away. This again avoids
> copying data.
>
> The RX and TX descriptor rings are registered with the setsockopts
> XDP_RX_RING and XDP_TX_RING, similar to AF_PACKET. The packet buffer
> area is allocated by user space and registered with the kernel using
> the new XDP_MEM_REG setsockopt. All these three areas are shared
> between user space and kernel space. The socket is then bound with a
> bind() call to a device and a specific queue id on that device, and it
> is not until bind is completed that traffic starts to flow.
>
> An XDP program can be loaded to direct part of the traffic on that
> device and queue id to user space through a new redirect action in an
> XDP program called bpf_xdpsk_redirect that redirects a packet up to
> the socket in user space. All the other XDP actions work just as
> before. Note that the current RFC requires the user to load an XDP
> program to get any traffic to user space (for example all traffic to
> user space with the one-liner program "return
> bpf_xdpsk_redirect();"). We plan on introducing a patch that removes
> this requirement and sends all traffic from a queue to user space if
> an AF_XDP socket is bound to it.
>
> AF_XDP can operate in three different modes: XDP_SKB, XDP_DRV, and
> XDP_DRV_ZC (shorthand for XDP_DRV with a zero-copy allocator as there
> is no specific mode called XDP_DRV_ZC). If the driver does not have
> support for XDP, or XDP_SKB is explicitly chosen when loading the XDP
> program, XDP_SKB mode is employed that uses SKBs together with the
> generic XDP support and copies out the data to user space. A fallback
> mode that works for any network device. On the other hand, if the
> driver has support for XDP (all three NDOs: ndo_bpf, ndo_xdp_xmit and
> ndo_xdp_flush), these NDOs, without any modifications, will be used by
> the AF_XDP code to provide better performance, but there is still a
> copy of the data into user space. The last mode, XDP_DRV_ZC, is XDP
> driver support with the zero-copy user space allocator that provides
> even better performance. In this mode, the networking HW (or SW driver
> if it is a virtual driver like veth) DMAs/puts packets straight into
> the packet buffer that is shared between user space and kernel
> space. The RX and TX descriptor queues of the networking HW are NOT
> shared to user space. Only the kernel can read and write these and it
> is the kernel driver's responsibility to translate these HW specific
> descriptors to the HW agnostic ones in the virtual descriptor rings
> that user space sees. This way, a malicious user space program cannot
> mess with the networking HW. This mode though requires some extensions
> to XDP.
>
> To get the XDP_DRV_ZC mode to work for RX, we chose to introduce a
> buffer pool concept so that the same XDP driver code can be used for
> buffers allocated using the page allocator (XDP_DRV), the user-space
> zero-copy allocator (XDP_DRV_ZC), or some internal driver specific
> allocator/cache/recycling mechanism. The ndo_bpf call has also been
> extended with two commands for registering and unregistering an XSK
> socket and is in the RX case mainly used to communicate some
> information about the user-space buffer pool to the driver.
>
> For the TX path, our plan was to use ndo_xdp_xmit and ndo_xdp_flush,
> but we run into problems with this (further discussion in the
> challenges section) and had to introduce a new NDO called
> ndo_xdp_xmit_xsk (xsk = XDP socket). It takes a pointer to a netdevice
> and an explicit queue id that packets should be sent out on. In
> contrast to ndo_xdp_xmit, it is asynchronous and pulls packets to be
> sent from the xdp socket (associated with the dev and queue
> combination that was provided with the NDO call) using a callback
> (get_tx_packet), and 

  1   2   >