RE: [PATCH 1/1] net: ethernet: qlogic: set error code on failure

2016-12-03 Thread Mintz, Yuval
> From: Pan Bian 
> 
> When calling dma_mapping_error(), the value of return variable rc is 0.
> And when the call returns an unexpected value, rc is not set to a negative
> errno. Thus, it will return 0 on the error path, and its callers cannot detect
> the bug. This patch fixes the bug, assigning "-ENOMEM" to err.
> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=189041
> 
> Signed-off-by: Pan Bian 

The title should have been "[PATCH net 1/1] qed: Set error code on failure".

But the fix itself is sound. Thanks.
BTW, is -ENOMEM the right return code in case of DMA mapping errors?

Acked-by: Yuval Mintz 



Re: [Intel-wired-lan] [PATCH v2] e1000e: free IRQ regardless of __E1000_DOWN

2016-12-03 Thread Neftin, Sasha
On 12/2/2016 7:02 PM, Baicar, Tyler wrote:
> Hello Sasha,
> 
> Were you able to reproduce this issue?
> 
> Do you have a patch fixing the close function inconsistencies that you
> mentioned which I could try out?
> 
> Thanks,
> Tyler
> 
> On 11/21/2016 1:40 PM, Baicar, Tyler wrote:
>> On 11/17/2016 6:31 AM, Neftin, Sasha wrote:
>>> On 11/13/2016 10:34 AM, Neftin, Sasha wrote:
 On 11/11/2016 12:35 AM, Baicar, Tyler wrote:
> Hello Sasha,
>
> On 11/9/2016 11:19 PM, Neftin, Sasha wrote:
>> On 11/9/2016 11:41 PM, Tyler Baicar wrote:
>>> Move IRQ free code so that it will happen regardless of the
>>> __E1000_DOWN bit. Currently the e1000e driver only releases its IRQ
>>> if the __E1000_DOWN bit is cleared. This is not sufficient because
>>> it is possible for __E1000_DOWN to be set without releasing the IRQ.
>>> In such a situation, we will hit a kernel bug later in e1000_remove
>>> because the IRQ still has action since it was never freed. A
>>> secondary bus reset can cause this case to happen.
>>>
>>> Signed-off-by: Tyler Baicar 
>>> ---
>>>drivers/net/ethernet/intel/e1000e/netdev.c | 3 ++-
>>>1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c
>>> b/drivers/net/ethernet/intel/e1000e/netdev.c
>>> index 7017281..36cfcb0 100644
>>> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
>>> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
>>> @@ -4679,12 +4679,13 @@ int e1000e_close(struct net_device *netdev)
>>>  if (!test_bit(__E1000_DOWN, >state)) {
>>>e1000e_down(adapter, true);
>>> -e1000_free_irq(adapter);
>>>  /* Link status message must follow this format */
>>>pr_info("%s NIC Link is Down\n", adapter->netdev->name);
>>>}
>>>+e1000_free_irq(adapter);
>>> +
>>>napi_disable(>napi);
>>>  e1000e_free_tx_resources(adapter->tx_ring);
>>>
>> I would like not recommend insert this change. This change related
>> driver state machine, we afraid from lot of synchronization
>> problem and
>> issues.
>> We need keep e1000_free_irq in loop and check for 'test_bit' ready.
> What do you mean here? There is no loop. If __E1000_DOWN is set
> then we
> will never free the IRQ.
>
>> Another point, does before execute secondary bus reset your SW
>> back up
>> pcie configuration space as properly?
> After a secondary bus reset, the link needs to recover and go back
> to a
> working state after 1 second.
>
>  From the callstack, the issue is happening while removing the
> endpoint
> from the system, before applying the secondary bus reset.
>
> The order of events is
> 1. remove the drivers
> 2. cause a secondary bus reset
> 3. wait 1 second
 Actually, this is too much, usually link up in less than 100ms.You can
 check Data Link Layer indication.
> 4. recover the link
>
> callstack:
> free_msi_irqs+0x6c/0x1a8
> pci_disable_msi+0xb0/0x148
> e1000e_reset_interrupt_capability+0x60/0x78
> e1000_remove+0xc8/0x180
> pci_device_remove+0x48/0x118
> __device_release_driver+0x80/0x108
> device_release_driver+0x2c/0x40
> pci_stop_bus_device+0xa0/0xb0
> pci_stop_bus_device+0x3c/0xb0
> pci_stop_root_bus+0x54/0x80
> acpi_pci_root_remove+0x28/0x64
> acpi_bus_trim+0x6c/0xa4
> acpi_device_hotplug+0x19c/0x3f4
> acpi_hotplug_work_fn+0x28/0x3c
> process_one_work+0x150/0x460
> worker_thread+0x50/0x4b8
> kthread+0xd4/0xe8
> ret_from_fork+0x10/0x50
>
> Thanks,
> Tyler
>
 Hello Tyler,
 Okay, we need consult more about this suggestion.
 May I ask what is setup you run? Is there NIC or on board LAN? I would
 like try reproduce this issue in our lab's too.
 Also, is same issue observed with same scenario and others NIC's too?
 Sasha
 ___
 Intel-wired-lan mailing list
 intel-wired-...@lists.osuosl.org
 http://lists.osuosl.org/mailman/listinfo/intel-wired-lan

>>> Hello Tyler,
>>> I see some in consistent implementation of __*_close methods in our
>>> drivers. Do you have any igb NIC to check if same problem persist there?
>>> Thanks,
>>> Sasha
>> Hello Sasha,
>>
>> I couldn't find an igb NIC to test with, but I did find another e1000e
>> card that does not cause the same issue. That card is:
>>
>> 0004:01:00.0 Ethernet controller: Intel Corporation 82574L Gigabit
>> Network Connection
>> Subsystem: Intel Corporation Gigabit CT Desktop Adapter
>> Physical Slot: 5-1
>> Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
>> ParErr- Stepping- SERR+ FastB2B- DisINTx+
>> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast 

Re: [PATCN net-next] net_sched: gen_estimator: complete rewrite of rate estimators

2016-12-03 Thread Eric Dumazet
On Sat, 2016-12-03 at 23:07 -0800, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> 1) Old code was hard to maintain, due to complex lock chains.
>(We probably will be able to remove some kfree_rcu() in callers)
> 
> 2) Using a single timer to update all estimators does not scale.
> 
> 3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
>is not supposed to work well)


> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 
> c7adcb57654ea57d1ba6702c91743cb7d2c74d28..859b60bfa86712031186fffc09c65bc43aa065dd
>  100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2081,6 +2081,14 @@ static bool tcp_small_queue_check(struct sock *sk, 
> const struct sk_buff *skb,
>   limit <<= factor;
>  
>   if (atomic_read(>sk_wmem_alloc) > limit) {
> + /* Special case where TX completion is delayed too much :
> +  * If the skb we try to send is the first skb in write queue,
> +  * then send it !
> +  * No need to wait for TX completion to call us back.
> +  */
> + if (skb == sk->sk_write_queue.next)
> + return false;
> +

Oups, this has nothing to do here. I will send a v2, sorry.




[PATCN net-next] net_sched: gen_estimator: complete rewrite of rate estimators

2016-12-03 Thread Eric Dumazet
From: Eric Dumazet 

1) Old code was hard to maintain, due to complex lock chains.
   (We probably will be able to remove some kfree_rcu() in callers)

2) Using a single timer to update all estimators does not scale.

3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
   is not supposed to work well)

In this rewrite :

- I removed the RB tree that had to be scanned in
gen_estimator_active(). qdisc dumps should be much faster.

- Each estimator has its own timer.

- Estimations are maintained in net_rate_estimator structure,
  instead of dirtying the qdisc. Minor, but part of the simplification.

- Reading the estimator uses RCU and a seqcount to provide proper
  support for 32bit kernels.

- We reduce memory need when estimators are not used, since
  we store a pointer, instead of the bytes/packets counters.

- xt_rateest_mt() no longer has to grab a spinlock.
  (In the future, xt_rateest_tg() could be switched to per cpu counters)

Signed-off-by: Eric Dumazet 
---
 include/net/act_api.h  |2 
 include/net/gen_stats.h|   17 -
 include/net/netfilter/xt_rateest.h |   10 
 include/net/sch_generic.h  |2 
 net/core/gen_estimator.c   |  298 +--
 net/core/gen_stats.c   |   17 -
 net/ipv4/tcp_output.c  |8 
 net/netfilter/xt_RATEEST.c |4 
 net/netfilter/xt_rateest.c |   28 +-
 net/sched/act_api.c|9 
 net/sched/act_police.c |   21 +
 net/sched/sch_api.c|2 
 net/sched/sch_cbq.c|6 
 net/sched/sch_drr.c|6 
 net/sched/sch_generic.c|2 
 net/sched/sch_hfsc.c   |6 
 net/sched/sch_htb.c|6 
 net/sched/sch_qfq.c|8 
 18 files changed, 188 insertions(+), 264 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 
9dddf77a69ccbcb003cfa66bcc0de337f78f3dae..1d716449209e4753a297c61a287077a1eb96e6d8
 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -36,7 +36,7 @@ struct tc_action {
struct tcf_ttcfa_tm;
struct gnet_stats_basic_packed  tcfa_bstats;
struct gnet_stats_queue tcfa_qstats;
-   struct gnet_stats_rate_est64tcfa_rate_est;
+   struct net_rate_estimator __rcu *tcfa_rate_est;
spinlock_t  tcfa_lock;
struct rcu_head tcfa_rcu;
struct gnet_stats_basic_cpu __percpu *cpu_bstats;
diff --git a/include/net/gen_stats.h b/include/net/gen_stats.h
index 
231e121cc7d9c72075e7e6dde3655d631f64a1c4..8b7aa370e7a4af61fcb71ed751dba72ebead6143
 100644
--- a/include/net/gen_stats.h
+++ b/include/net/gen_stats.h
@@ -11,6 +11,8 @@ struct gnet_stats_basic_cpu {
struct u64_stats_sync syncp;
 };
 
+struct net_rate_estimator;
+
 struct gnet_dump {
spinlock_t *  lock;
struct sk_buff *  skb;
@@ -42,8 +44,7 @@ void __gnet_stats_copy_basic(const seqcount_t *running,
 struct gnet_stats_basic_cpu __percpu *cpu,
 struct gnet_stats_basic_packed *b);
 int gnet_stats_copy_rate_est(struct gnet_dump *d,
-const struct gnet_stats_basic_packed *b,
-struct gnet_stats_rate_est64 *r);
+struct net_rate_estimator __rcu **ptr);
 int gnet_stats_copy_queue(struct gnet_dump *d,
  struct gnet_stats_queue __percpu *cpu_q,
  struct gnet_stats_queue *q, __u32 qlen);
@@ -53,16 +54,16 @@ int gnet_stats_finish_copy(struct gnet_dump *d);
 
 int gen_new_estimator(struct gnet_stats_basic_packed *bstats,
  struct gnet_stats_basic_cpu __percpu *cpu_bstats,
- struct gnet_stats_rate_est64 *rate_est,
+ struct net_rate_estimator __rcu **rate_est,
  spinlock_t *stats_lock,
  seqcount_t *running, struct nlattr *opt);
-void gen_kill_estimator(struct gnet_stats_basic_packed *bstats,
-   struct gnet_stats_rate_est64 *rate_est);
+void gen_kill_estimator(struct net_rate_estimator __rcu **ptr);
 int gen_replace_estimator(struct gnet_stats_basic_packed *bstats,
  struct gnet_stats_basic_cpu __percpu *cpu_bstats,
- struct gnet_stats_rate_est64 *rate_est,
+ struct net_rate_estimator __rcu **ptr,
  spinlock_t *stats_lock,
  seqcount_t *running, struct nlattr *opt);
-bool gen_estimator_active(const struct gnet_stats_basic_packed *bstats,
- const struct gnet_stats_rate_est64 *rate_est);
+bool gen_estimator_active(struct net_rate_estimator __rcu **ptr);
+bool gen_estimator_read(struct net_rate_estimator __rcu **ptr,
+   

[PATCN v2 net-next] net_sched: gen_estimator: complete rewrite of rate estimators

2016-12-03 Thread Eric Dumazet
From: Eric Dumazet 

1) Old code was hard to maintain, due to complex lock chains.
   (We probably will be able to remove some kfree_rcu() in callers)

2) Using a single timer to update all estimators does not scale.

3) Code was buggy on 32bit kernel (WRITE_ONCE() on 64bit quantity
   is not supposed to work well)

In this rewrite :

- I removed the RB tree that had to be scanned in
gen_estimator_active(). qdisc dumps should be much faster.

- Each estimator has its own timer.

- Estimations are maintained in net_rate_estimator structure,
  instead of dirtying the qdisc. Minor, but part of the simplification.

- Reading the estimator uses RCU and a seqcount to provide proper
  support for 32bit kernels.

- We reduce memory need when estimators are not used, since
  we store a pointer, instead of the bytes/packets counters.

- xt_rateest_mt() no longer has to grab a spinlock.
  (In the future, xt_rateest_tg() could be switched to per cpu counters)

Signed-off-by: Eric Dumazet 
---
v2: removed unwanted changes to tcp_output.c
Renamed some parameters to please htmldoc

 include/net/act_api.h  |2 
 include/net/gen_stats.h|   17 -
 include/net/netfilter/xt_rateest.h |   10 
 include/net/sch_generic.h  |2 
 net/core/gen_estimator.c   |  299 +--
 net/core/gen_stats.c   |   17 -
 net/netfilter/xt_RATEEST.c |4 
 net/netfilter/xt_rateest.c |   28 +-
 net/sched/act_api.c|9 
 net/sched/act_police.c |   21 +
 net/sched/sch_api.c|2 
 net/sched/sch_cbq.c|6 
 net/sched/sch_drr.c|6 
 net/sched/sch_generic.c|2 
 net/sched/sch_hfsc.c   |6 
 net/sched/sch_htb.c|6 
 net/sched/sch_qfq.c|8 
 17 files changed, 181 insertions(+), 264 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 
9dddf77a69ccbcb003cfa66bcc0de337f78f3dae..1d716449209e4753a297c61a287077a1eb96e6d8
 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -36,7 +36,7 @@ struct tc_action {
struct tcf_ttcfa_tm;
struct gnet_stats_basic_packed  tcfa_bstats;
struct gnet_stats_queue tcfa_qstats;
-   struct gnet_stats_rate_est64tcfa_rate_est;
+   struct net_rate_estimator __rcu *tcfa_rate_est;
spinlock_t  tcfa_lock;
struct rcu_head tcfa_rcu;
struct gnet_stats_basic_cpu __percpu *cpu_bstats;
diff --git a/include/net/gen_stats.h b/include/net/gen_stats.h
index 
231e121cc7d9c72075e7e6dde3655d631f64a1c4..8b7aa370e7a4af61fcb71ed751dba72ebead6143
 100644
--- a/include/net/gen_stats.h
+++ b/include/net/gen_stats.h
@@ -11,6 +11,8 @@ struct gnet_stats_basic_cpu {
struct u64_stats_sync syncp;
 };
 
+struct net_rate_estimator;
+
 struct gnet_dump {
spinlock_t *  lock;
struct sk_buff *  skb;
@@ -42,8 +44,7 @@ void __gnet_stats_copy_basic(const seqcount_t *running,
 struct gnet_stats_basic_cpu __percpu *cpu,
 struct gnet_stats_basic_packed *b);
 int gnet_stats_copy_rate_est(struct gnet_dump *d,
-const struct gnet_stats_basic_packed *b,
-struct gnet_stats_rate_est64 *r);
+struct net_rate_estimator __rcu **ptr);
 int gnet_stats_copy_queue(struct gnet_dump *d,
  struct gnet_stats_queue __percpu *cpu_q,
  struct gnet_stats_queue *q, __u32 qlen);
@@ -53,16 +54,16 @@ int gnet_stats_finish_copy(struct gnet_dump *d);
 
 int gen_new_estimator(struct gnet_stats_basic_packed *bstats,
  struct gnet_stats_basic_cpu __percpu *cpu_bstats,
- struct gnet_stats_rate_est64 *rate_est,
+ struct net_rate_estimator __rcu **rate_est,
  spinlock_t *stats_lock,
  seqcount_t *running, struct nlattr *opt);
-void gen_kill_estimator(struct gnet_stats_basic_packed *bstats,
-   struct gnet_stats_rate_est64 *rate_est);
+void gen_kill_estimator(struct net_rate_estimator __rcu **ptr);
 int gen_replace_estimator(struct gnet_stats_basic_packed *bstats,
  struct gnet_stats_basic_cpu __percpu *cpu_bstats,
- struct gnet_stats_rate_est64 *rate_est,
+ struct net_rate_estimator __rcu **ptr,
  spinlock_t *stats_lock,
  seqcount_t *running, struct nlattr *opt);
-bool gen_estimator_active(const struct gnet_stats_basic_packed *bstats,
- const struct gnet_stats_rate_est64 *rate_est);
+bool gen_estimator_active(struct net_rate_estimator __rcu **ptr);
+bool gen_estimator_read(struct 

[PATCH 1/1] net: ethernet: broadcom: fix improper return value

2016-12-03 Thread Pan Bian
From: Pan Bian 

Marco BNX2X_ALLOC_AND_SET(arr, lbl, func) calls kmalloc() to allocate
memory, and jumps to label "lbl" if the allocation fails. Label "lbl"
first cleans memory and then returns variable rc. Before calling the
macro, the value of variable rc is 0. Because 0 means no error, the
callers of bnx2x_init_firmware() may be misled. This patch fixes the bug,
assigning "-ENOMEM" to rc before calling macro NX2X_ALLOC_AND_SET().

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=189141

Signed-off-by: Pan Bian 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 0cee4c0..6f9fc20 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -13505,6 +13505,7 @@ static int bnx2x_init_firmware(struct bnx2x *bp)
 
/* Initialize the pointers to the init arrays */
/* Blob */
+   rc = -ENOMEM;
BNX2X_ALLOC_AND_SET(init_data, request_firmware_exit, be32_to_cpu_n);
 
/* Opcodes */
-- 
1.9.1




[PATCH 1/1] net: ethernet: qlogic: fix improper return value

2016-12-03 Thread Pan Bian
From: Pan Bian 

When the call to qlcnic_alloc_mbx_args() fails, returning variable "err"
seems improper. With reference to the context, returing variable
"config" may be better.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=189101

Signed-off-by: Pan Bian 
---
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c 
b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
index bdbcd2b..21c4aca 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
@@ -3189,7 +3189,7 @@ int qlcnic_83xx_test_link(struct qlcnic_adapter *adapter)
 
err = qlcnic_alloc_mbx_args(, adapter, QLCNIC_CMD_GET_LINK_STATUS);
if (err)
-   return err;
+   return config;
 
err = qlcnic_issue_cmd(adapter, );
if (err) {
-- 
1.9.1




[PATCH 1/1] net: ethernet: qlogic: set error code on failure

2016-12-03 Thread Pan Bian
From: Pan Bian 

When calling dma_mapping_error(), the value of return variable rc is 0.
And when the call returns an unexpected value, rc is not set to a
negative errno. Thus, it will return 0 on the error path, and its
callers cannot detect the bug. This patch fixes the bug, assigning
"-ENOMEM" to err.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=189041

Signed-off-by: Pan Bian 
---
 drivers/net/ethernet/qlogic/qed/qed_ll2.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_ll2.c 
b/drivers/net/ethernet/qlogic/qed/qed_ll2.c
index f95385c..62ae55b 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_ll2.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_ll2.c
@@ -1730,6 +1730,7 @@ static int qed_ll2_start_xmit(struct qed_dev *cdev, 
struct sk_buff *skb)
   mapping))) {
DP_NOTICE(cdev,
  "Unable to map frag - dropping 
packet\n");
+   rc = -ENOMEM;
goto err;
}
} else {
-- 
1.9.1




Re: [PATCH v3 07/13] net: ethernet: ti: cpts: clean up event list if event pool is empty

2016-12-03 Thread kbuild test robot
Hi WingMan,

[auto build test ERROR on net/master]
[also build test ERROR on v4.9-rc7 next-20161202]
[cannot apply to net-next/master]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Grygorii-Strashko/net-ethernet-ti-cpts-switch-to-readl-writel_relaxed/20161204-010355
config: arm-omap2plus_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm 

Note: the 
linux-review/Grygorii-Strashko/net-ethernet-ti-cpts-switch-to-readl-writel_relaxed/20161204-010355
 HEAD f938805dce662197c057c75ac67849f60da87c9f builds fine.
  It only hurts bisectibility.

All errors (new ones prefixed by >>):

   In file included from include/linux/dma-mapping.h:6:0,
from include/linux/skbuff.h:34,
from include/linux/ip.h:20,
from include/linux/ptp_classify.h:26,
from drivers/net/ethernet/ti/cpts.c:25:
   drivers/net/ethernet/ti/cpts.c: In function 'cpts_purge_events':
>> drivers/net/ethernet/ti/cpts.c:76:15: error: 'struct cpts' has no member 
>> named 'dev'
  dev_dbg(cpts->dev, "cpts: event pool cleaned up %d\n", removed);
  ^
   include/linux/device.h:1209:26: note: in definition of macro 'dev_dbg'
  dev_printk(KERN_DEBUG, dev, format, ##arg); \
 ^~~
   drivers/net/ethernet/ti/cpts.c: In function 'cpts_fifo_read':
   drivers/net/ethernet/ti/cpts.c:94:16: error: 'struct cpts' has no member 
named 'dev'
   dev_err(cpts->dev, "cpts: event pool empty\n");
   ^~

vim +76 drivers/net/ethernet/ti/cpts.c

19   */
20  #include 
21  #include 
22  #include 
23  #include 
24  #include 
  > 25  #include 
26  #include 
27  #include 
28  #include 
29  #include 
30  #include 
31  
32  #include "cpts.h"
33  
34  #define cpts_read32(c, r)   readl_relaxed(>reg->r)
35  #define cpts_write32(c, v, r)   writel_relaxed(v, >reg->r)
36  
37  static int event_expired(struct cpts_event *event)
38  {
39  return time_after(jiffies, event->tmo);
40  }
41  
42  static int event_type(struct cpts_event *event)
43  {
44  return (event->high >> EVENT_TYPE_SHIFT) & EVENT_TYPE_MASK;
45  }
46  
47  static int cpts_fifo_pop(struct cpts *cpts, u32 *high, u32 *low)
48  {
49  u32 r = cpts_read32(cpts, intstat_raw);
50  
51  if (r & TS_PEND_RAW) {
52  *high = cpts_read32(cpts, event_high);
53  *low  = cpts_read32(cpts, event_low);
54  cpts_write32(cpts, EVENT_POP, event_pop);
55  return 0;
56  }
57  return -1;
58  }
59  
60  static int cpts_purge_events(struct cpts *cpts)
61  {
62  struct list_head *this, *next;
63  struct cpts_event *event;
64  int removed = 0;
65  
66  list_for_each_safe(this, next, >events) {
67  event = list_entry(this, struct cpts_event, list);
68  if (event_expired(event)) {
69  list_del_init(>list);
70  list_add(>list, >pool);
71  ++removed;
72  }
73  }
74  
75  if (removed)
  > 76  dev_dbg(cpts->dev, "cpts: event pool cleaned up %d\n", 
removed);
77  return removed ? 0 : -1;
78  }
79  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[PATCH 1/1] atm: fix improper return value

2016-12-03 Thread Pan Bian
From: Pan Bian 

It returns variable "error" when ioremap_nocache() returns a NULL
pointer. The value of "error" is 0 then, which will mislead the callers
to believe that there is no error. This patch fixes the bug, returning
"-ENOMEM".

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=189021

Signed-off-by: Pan Bian 
---
 drivers/atm/eni.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/atm/eni.c b/drivers/atm/eni.c
index f2aaf9e..40c2d56 100644
--- a/drivers/atm/eni.c
+++ b/drivers/atm/eni.c
@@ -1727,7 +1727,7 @@ static int eni_do_init(struct atm_dev *dev)
printk("\n");
printk(KERN_ERR DEV_LABEL "(itf %d): can't set up page "
"mapping\n",dev->number);
-   return error;
+   return -ENOMEM;
}
eni_dev->ioaddr = base;
eni_dev->base_diff = real_base - (unsigned long) base;
-- 
1.9.1




Re: [net-next PATCH v4 1/6] net: virtio dynamically disable/enable LRO

2016-12-03 Thread Michael S. Tsirkin
On Fri, Dec 02, 2016 at 12:49:45PM -0800, John Fastabend wrote:
> This adds support for dynamically setting the LRO feature flag. The
> message to control guest features in the backend uses the
> CTRL_GUEST_OFFLOADS msg type.
> 
> Signed-off-by: John Fastabend 
> ---
>  drivers/net/virtio_net.c |   45 -
>  1 file changed, 44 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index a21d93a..d814e7cb 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1419,6 +1419,41 @@ static void virtnet_init_settings(struct net_device 
> *dev)
>   .set_settings = virtnet_set_settings,
>  };
>  
> +static int virtnet_set_features(struct net_device *netdev,
> + netdev_features_t features)
> +{
> + struct virtnet_info *vi = netdev_priv(netdev);
> + struct virtio_device *vdev = vi->vdev;
> + struct scatterlist sg;
> + u64 offloads = 0;
> +
> + if (features & NETIF_F_LRO)
> + offloads |= (1 << VIRTIO_NET_F_GUEST_TSO4) |
> + (1 << VIRTIO_NET_F_GUEST_TSO6);
> +
> + if (features & NETIF_F_RXCSUM)
> + offloads |= (1 << VIRTIO_NET_F_GUEST_CSUM);
> +
> + if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)) {
> + sg_init_one(, , sizeof(uint64_t));
> + if (!virtnet_send_command(vi,
> +   VIRTIO_NET_CTRL_GUEST_OFFLOADS,
> +   VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET,
> +   )) {
> + dev_warn(>dev,
> +  "Failed to set guest offloads by virtnet 
> command.\n");
> + return -EINVAL;
> + }
> + } else if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS) &&
> +!virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {

No need for VIRTIO_F_VERSION_1 here.

> + dev_warn(>dev,
> +  "No support for setting offloads pre version_1.\n");
> + return -EINVAL;
> + }
> +
> + return 0;
> +}
> +
>  static const struct net_device_ops virtnet_netdev = {
>   .ndo_open= virtnet_open,
>   .ndo_stop= virtnet_close,
> @@ -1435,6 +1470,7 @@ static void virtnet_init_settings(struct net_device 
> *dev)
>  #ifdef CONFIG_NET_RX_BUSY_POLL
>   .ndo_busy_poll  = virtnet_busy_poll,
>  #endif
> + .ndo_set_features   = virtnet_set_features,
>  };
>  
>  static void virtnet_config_changed_work(struct work_struct *work)
> @@ -1815,6 +1851,12 @@ static int virtnet_probe(struct virtio_device *vdev)
>   if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_CSUM))
>   dev->features |= NETIF_F_RXCSUM;
>  
> + if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) &&
> + virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6)) {
> + dev->features |= NETIF_F_LRO;
> + dev->hw_features |= NETIF_F_LRO;
> + }
> +
>   dev->vlan_features = dev->features;
>  
>   /* MTU range: 68 - 65535 */
> @@ -2057,7 +2099,8 @@ static int virtnet_restore(struct virtio_device *vdev)
>   VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \
>   VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
>   VIRTIO_NET_F_CTRL_MAC_ADDR, \
> - VIRTIO_NET_F_MTU
> + VIRTIO_NET_F_MTU, \
> + VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
>  
>  static unsigned int features[] = {
>   VIRTNET_FEATURES,


[PATCH 1/1] net: irda: set error code on failures

2016-12-03 Thread Pan Bian
From: Pan Bian 

When the calls to kzalloc() fail, the value of return variable ret may
be 0. 0 means success in this context. This patch fixes the bug,
assigning "-ENOMEM" to ret before calling kzalloc().

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188971

Signed-off-by: Pan Bian 
---
 drivers/net/irda/irda-usb.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/irda/irda-usb.c b/drivers/net/irda/irda-usb.c
index a198946..8716b8c 100644
--- a/drivers/net/irda/irda-usb.c
+++ b/drivers/net/irda/irda-usb.c
@@ -1723,6 +1723,7 @@ static int irda_usb_probe(struct usb_interface *intf,
/* Don't change this buffer size and allocation without doing
 * some heavy and complete testing. Don't ask why :-(
 * Jean II */
+   ret = -ENOMEM;
self->speed_buff = kzalloc(IRDA_USB_SPEED_MTU, GFP_KERNEL);
if (!self->speed_buff)
goto err_out_3;
-- 
1.9.1




[PATCH 1/1] isdn: hisax: set error code on failure

2016-12-03 Thread Pan Bian
From: Pan Bian 

In function hfc4s8s_probe(), the value of return variable err should be
negative on failures. However, when the call to request_region() returns
NULL, the value of err is 0. This patch fixes the bug, assiging
"-ENOMEM" to err on the path that request_region() fails.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188931

Signed-off-by: Pan Bian 
---
 drivers/isdn/hisax/hfc4s8s_l1.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/isdn/hisax/hfc4s8s_l1.c b/drivers/isdn/hisax/hfc4s8s_l1.c
index 9600cd7..3172cee 100644
--- a/drivers/isdn/hisax/hfc4s8s_l1.c
+++ b/drivers/isdn/hisax/hfc4s8s_l1.c
@@ -1499,6 +1499,7 @@ struct hfc4s8s_l1 {
printk(KERN_INFO
   "HFC-4S/8S: failed to request address space at 0x%04x\n",
   hw->iobase);
+   err = -ENOMEM;
goto out;
}
 
-- 
1.9.1




[PATCH net v2] ipv6: Allow IPv4-mapped address as next-hop

2016-12-03 Thread Erik Nordmark
Made kernel accept IPv6 routes with IPv4-mapped address as next-hop.

It is possible to configure IP interfaces with IPv4-mapped addresses, and
one can add IPv6 routes for IPv4-mapped destinations/prefixes, yet prior
to this fix the kernel returned an EINVAL when attempting to add an IPv6
route with an IPv4-mapped address as a nexthop/gateway.

RFC 4798 (a proposed standard RFC) uses IPv4-mapped addresses as nexthops,
thus in order to support that type of address configuration the kernel
needs to allow IPv4-mapped addresses as nexthops.

Signed-off-by: Erik Nordmark 
Signed-off-by: Bob Gilligan 
---
v2 honoring minimum 1000 ft vertical separation between Thunderbird and patches 
(fixed whitespace issues)

 net/ipv6/route.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 1b57e11..86bdb02 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1995,8 +1995,11 @@ static struct rt6_info *ip6_route_info_create(struct 
fib6_config *cfg)
   It is very good, but in some (rare!) circumstances
   (SIT, PtP, NBMA NOARP links) it is handy to allow
   some exceptions. --ANK
+  We allow IPv4-mapped nexthops to support RFC4798-type
+  addressing
 */
-   if (!(gwa_type & IPV6_ADDR_UNICAST))
+   if (!(gwa_type & (IPV6_ADDR_UNICAST |
+ IPV6_ADDR_MAPPED)))
goto out;
 
if (cfg->fc_table) {
-- 
1.8.1.4



Re: [PATCH 1/1] net: dcb: set error code on failures

2016-12-03 Thread David Miller
From: Pan Bian 
Date: Sat,  3 Dec 2016 21:49:08 +0800

> From: Pan Bian 
> 
> In function dcbnl_cee_fill(), returns the value of variable err on
> errors. However, on some error paths (e.g. nla put fails), its value may
> be 0. It may be better to explicitly set a negative errno to variable
> err before returning.
> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=11
> 
> Signed-off-by: Pan Bian 

Applied, thanks.


Re: [PATCH net-next] liquidio: 'imply' ptp instead of 'select'

2016-12-03 Thread David Miller
From: Arnd Bergmann 
Date: Sat,  3 Dec 2016 00:04:32 +0100

> ptp now depends on the optional POSIX_TIMERS setting and fails to build
> if we select it without that:
> 
> warning: (LIQUIDIO_VF && TI_CPTS) selects PTP_1588_CLOCK which has unmet 
> direct dependencies (NET && POSIX_TIMERS)
> warning: (LIQUIDIO_VF && TI_CPTS) selects PTP_1588_CLOCK which has unmet 
> direct dependencies (NET && POSIX_TIMERS)
> ERROR: "posix_clock_unregister" [drivers/ptp/ptp.ko] undefined!
> ERROR: "posix_clock_register" [drivers/ptp/ptp.ko] undefined!
> ERROR: "pps_unregister_source" [drivers/ptp/ptp.ko] undefined!
> ERROR: "pps_event" [drivers/ptp/ptp.ko] undefined!
> ERROR: "pps_register_source" [drivers/ptp/ptp.ko] undefined!
> 
> It seems that two patches have collided here, the build failure
> is a result of the combination. Changing the new option to 'imply'
> as well fixes it.
> 
> Fixes: 111fc64a237f ("liquidio CN23XX: VF registration")
> Fixes: d1cbfd771ce8 ("ptp_clock: Allow for it to be optional")
> Signed-off-by: Arnd Bergmann 

Like the kbuild robot, when I apply this it complains about 'imply' being
an unknown option.

I guess it worked for you because support for 'imply' exists in the -next
tree and gets pulled in from somewhere else.

In any event, as-is I cannot apply this.


Re: [PATCH net v3] tcp: warn on bogus MSS and try to amend it

2016-12-03 Thread David Miller
From: Marcelo Ricardo Leitner 
Date: Fri,  2 Dec 2016 20:51:51 -0200

> @@ -144,7 +144,21 @@ static void tcp_measure_rcv_mss(struct sock *sk, const 
> struct sk_buff *skb)
>*/
>   len = skb_shinfo(skb)->gso_size ? : skb->len;
>   if (len >= icsk->icsk_ack.rcv_mss) {
> - icsk->icsk_ack.rcv_mss = len;
> + static bool __once __read_mostly;
> +
> + icsk->icsk_ack.rcv_mss = min_t(unsigned int, len,
> +tcp_sk(sk)->advmss);
> + if (icsk->icsk_ack.rcv_mss != len && !__once) {
> + struct net_device *dev;
> +
> + __once = true;
> +
> + rcu_read_lock();
> + dev = dev_get_by_index_rcu(sock_net(sk), skb->skb_iif);
> + pr_warn_once("%s: Driver has suspect GRO 
> implementation, TCP performance may be compromised.\n",
> +  dev ? dev->name : "Unknown driver");
> + rcu_read_unlock();
> + }

This is almost ready to go.

Since you are doing the 'once' logic by hand, using pr_warn_once() is
redundant.  And while you're at it, why not split this into a helper
function:

static void tcp_gro_dev_warn(struct sock *sk, const struct sk_buff *skb)
{
static bool __once __read_mostly;

if (!__once) {
__once = true;

rcu_read_lock();
dev = dev_get_by_index_rcu(sock_net(sk), skb->skb_iif);
pr_warn("%s: Driver has suspect GRO implementation, TCP 
performance may be compromised.\n",
dev ? dev->name : "Unknown driver");
rcu_read_unlock();
}
}

And then call that when icsk->icsk_ack.rcv_mss != len, you can even
put an unlikely() around the condition as well.


Re: [PATCH net-next v5] ipv6 addrconf: Implemented enhanced DAD (RFC7527)

2016-12-03 Thread David Miller
From: Erik Nordmark 
Date: Fri,  2 Dec 2016 14:00:08 -0800

> Implemented RFC7527 Enhanced DAD.
> IPv6 duplicate address detection can fail if there is some temporary
> loopback of Ethernet frames. RFC7527 solves this by including a random
> nonce in the NS messages used for DAD, and if an NS is received with the
> same nonce it is assumed to be a looped back DAD probe and is ignored.
> RFC7527 is enabled by default. Can be disabled by setting both of
> conf/{all,interface}/enhanced_dad to zero.
> 
> Signed-off-by: Erik Nordmark 
> Signed-off-by: Bob Gilligan 
> Reviewed-by: Hannes Frederic Sowa 

Applied, thanks.


[PATCH 1/1] net: caif: remove ineffective check

2016-12-03 Thread Pan Bian
The check of the return value of sock_register() is ineffective.
"if(!err)" seems to be a typo. It is better to propagate the error code
to the callers of caif_sktinit_module(). This patch removes the check
statment and directly returns the result of sock_register().

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188751
Signed-off-by: Pan Bian 
---
 net/caif/caif_socket.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/caif/caif_socket.c b/net/caif/caif_socket.c
index aa209b1..92cbbd2 100644
--- a/net/caif/caif_socket.c
+++ b/net/caif/caif_socket.c
@@ -1107,10 +1107,7 @@ static int caif_create(struct net *net, struct socket 
*sock, int protocol,
 
 static int __init caif_sktinit_module(void)
 {
-   int err = sock_register(_family_ops);
-   if (!err)
-   return err;
-   return 0;
+   return sock_register(_family_ops);
 }
 
 static void __exit caif_sktexit_module(void)
-- 
1.9.1




Re: [PATCHv2 net-next 0/4] MV88E6390 batch two

2016-12-03 Thread David Miller
From: Andrew Lunn 
Date: Sat,  3 Dec 2016 04:35:15 +0100

> This is the second batch of patches adding support for the
> MV88e6390. They are not sufficient to make it work properly.
> 
> The mv88e6390 has a much expanded set of priority maps. Refactor the
> existing code, and implement basic support for the new device.
> 
> Similarly, the monitor control register has been reworked.
> 
> The mv88e6390 has something odd in its EDSA tagging implementation,
> which means it is not possible to use it. So we need to use DSA
> tagging. This is the first device with EDSA support where we need to
> use DSA, and the code does not support this. So two patches refactor
> the existing code. The two different register definitions are
> separated out, and using DSA on an EDSA capable device is added.
 ...

Series applied.


Re: [PATCH net] geneve: avoid use-after-free of skb->data

2016-12-03 Thread David Miller
From: Sabrina Dubroca 
Date: Sat, 3 Dec 2016 01:33:26 +0100

> I'd like to try something based on static analysis. We'd need a way to
> tag cached pointers to skb->data (via ip_hdr() or whatever), and
> propagate the notion that pskb_expand_head() makes these cached
> pointers stale through layers of function calls.  I don't know how
> feasible this is with the tools we have.

Perhaps create helpers that have some special attribute attached to
them like "skb_volatile" or whatever.  ip_hdr() et al would go through
them.

Then the static analysis tool is told that pskb_expand_head() "kills"
all skb_volatile obtained values, and it could basically mark all such
variables as uninitialized.


Re: [net-next PATCH v4 1/6] net: virtio dynamically disable/enable LRO

2016-12-03 Thread David Miller
From: John Fastabend 
Date: Fri, 02 Dec 2016 12:49:45 -0800

> + if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS)) {
> + sg_init_one(, , sizeof(uint64_t));
> + if (!virtnet_send_command(vi,
> +   VIRTIO_NET_CTRL_GUEST_OFFLOADS,
> +   VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET,
> +   )) {
> + dev_warn(>dev,
> +  "Failed to set guest offloads by virtnet 
> command.\n");
> + return -EINVAL;
> + }
> + } else if (virtio_has_feature(vdev, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS) &&
> +!virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {

Hmmm, to me this reads as:

if (X) {
 ...
else if (X && ...) {

I don't see how the second basic block can ever execute.  If the virtio
has the VIRTIO_NET_F_CTRL_GUEST_OFFLOADS feature, we will execute only
the first basic block.

Maybe I misunderstand the logic for whatever reason.


mlx5 search flow table

2016-12-03 Thread domingo montoya
Hello,

I was wondering if there was any way I could search what flow tables,
flow groups, flow rules exist for mlx5_core driver.


I am aware of the QUERY_* commands but I need to provide a valid
tableId or groupId to retrieve the information.

Let's say because of any bug, one of the flow table or flow group or
flow rule gets orphaned i.e., doesn't get deleted.

Is there any way to know that information, so as to explicity delete
those rules, tables, groups at a later point?

Thanks a lot!

Best Regards,
Domingo


Re: mlx5 VST and VGT mode at the same time

2016-12-03 Thread domingo montoya
Thanks a lot Mohamad. This is really helpful.

On Mon, Aug 22, 2016 at 6:39 PM, Mohamad Haj Yahia
 wrote:
> On Thu, Aug 18, 2016 at 12:41 PM, domingo montoya
>  wrote:
>> Hi All,
>>
>> Is there any way we can support both VST and VGT modes at the same time in 
>> mlx5?
>>
>> For e.g,
>>
>> If i send untagged packets from the VF, they should be tagged with the
>> VST vlan and the vlan be stripped for received packets.
>>
>> If i send tagged packets from the VF, they should be send as it and no
>> tag inserted for these and also the vlan tag not stripped for received
>> packets.
>>
>> Any way we can achieve this?
>>
>>
>> I understand that in the latest code these features are mutually exclusive.
>>
>> But if we have a requirement like this, any ideas on how to go about
>> implementing the same.
>>
>> Few observations:
>>
>> After going through the code, I figured out that for VST mode, we run
>> MODIFY_ESW_VPORT_CONTEXT and as part of this set the flag to strip the
>> vlan from the received packets. In case of VGT mode, because of this
>> command, the tags set by the VF driver also get stripped.
>>
>>
>>
>> Thanks a lot!
>>
>>
>> Best Regards,
>> Domingo
>
> Hi Domingo,
>
> Unfortunately there is a HW limitation that prevent VGT working
> besides VST on the same VF.
> Since the stripping feature is global attribute for all the VF
> incoming vlans, if we enable both modes you will see that the VGT
> traffic vlan also stripped and thus it will arrive to the VF as
> untagged.
> Because of this limitation we blocked the outgoing vlan tagged traffic
> from a VF that is in VST mode and also dropped incoming vlan tagged
> packets targeting that VF with a different vlan than the VF vlan-id.
> The VGT and VST mutual exclusive enforcement is done by VF ACL ingress
> and egress flow tables.
>
> Thanks,
> Mohamad


[no subject]

2016-12-03 Thread Bob Biloxi
subscribe linux-netdev


Re: [PATCH net-next 2/4] mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs

2016-12-03 Thread Martin KaFai Lau
On Fri, Dec 02, 2016 at 04:07:09PM -0800, Rick Jones wrote:
> On 12/02/2016 03:23 PM, Martin KaFai Lau wrote:
> >When XDP prog is attached, it is currently limiting
> >MTU to be FRAG_SZ0 - ETH_HLEN - (2 * VLAN_HLEN) which is 1514
> >in x86.
> >
> >AFAICT, since mlx4 is doing one page per packet for XDP,
> >we can at least raise the MTU limitation up to
> >PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN) which this patch is
> >doing.  It will be useful in the next patch which allows
> >XDP program to extend the packet by adding new header(s).
>
> Is mlx4 the only driver doing page-per-packet?
Sorry for the late reply.  This allocation scheme is only effective
when XDP is active.  AFAIK, only mlx4/5 supports XDP now.


[PATCH v2 net-next 3/4] mlx4: xdp: Reserve headroom for receiving packet when XDP prog is active

2016-12-03 Thread Martin KaFai Lau
Reserve XDP_PACKET_HEADROOM and honor bpf_xdp_adjust_head()
when XDP prog is active.  This patch only affects the code
path when XDP is active.

Signed-off-by: Martin KaFai Lau 
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 17 +++--
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 23 +--
 drivers/net/ethernet/mellanox/mlx4/en_tx.c |  9 +
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |  3 ++-
 4 files changed, 39 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 311c14153b8b..094a13b52cf6 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -51,7 +51,8 @@
 #include "mlx4_en.h"
 #include "en_port.h"
 
-#define MLX4_EN_MAX_XDP_MTU ((int)(PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN)))
+#define MLX4_EN_MAX_XDP_MTU ((int)(PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN) - \
+  XDP_PACKET_HEADROOM))
 
 int mlx4_en_setup_tc(struct net_device *dev, u8 up)
 {
@@ -1551,6 +1552,7 @@ int mlx4_en_start_port(struct net_device *dev)
struct mlx4_en_tx_ring *tx_ring;
int rx_index = 0;
int err = 0;
+   int mtu;
int i, t;
int j;
u8 mc_list[16] = {0};
@@ -1684,8 +1686,12 @@ int mlx4_en_start_port(struct net_device *dev)
}
 
/* Configure port */
+   mtu = priv->rx_skb_size + ETH_FCS_LEN;
+   if (priv->tx_ring_num[TX_XDP])
+   mtu += XDP_PACKET_HEADROOM;
+
err = mlx4_SET_PORT_general(mdev->dev, priv->port,
-   priv->rx_skb_size + ETH_FCS_LEN,
+   mtu,
priv->prof->tx_pause,
priv->prof->tx_ppp,
priv->prof->rx_pause,
@@ -2255,6 +2261,13 @@ static bool mlx4_en_check_xdp_mtu(struct net_device 
*dev, int mtu)
 {
struct mlx4_en_priv *priv = netdev_priv(dev);
 
+   if (mtu + XDP_PACKET_HEADROOM > priv->max_mtu) {
+   en_err(priv,
+  "Device max mtu:%d does not allow %d bytes reserved 
headroom for XDP prog\n",
+  priv->max_mtu, XDP_PACKET_HEADROOM);
+   return false;
+   }
+
if (mtu > MLX4_EN_MAX_XDP_MTU) {
en_err(priv, "mtu:%d > max:%d when XDP prog is attached\n",
   mtu, MLX4_EN_MAX_XDP_MTU);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 23e9d04d1ef4..324771ac929e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -96,7 +96,6 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
struct mlx4_en_rx_alloc page_alloc[MLX4_EN_MAX_RX_FRAGS];
const struct mlx4_en_frag_info *frag_info;
struct page *page;
-   dma_addr_t dma;
int i;
 
for (i = 0; i < priv->num_frags; i++) {
@@ -115,9 +114,10 @@ static int mlx4_en_alloc_frags(struct mlx4_en_priv *priv,
 
for (i = 0; i < priv->num_frags; i++) {
frags[i] = ring_alloc[i];
-   dma = ring_alloc[i].dma + ring_alloc[i].page_offset;
+   frags[i].page_offset += priv->frag_info[i].rx_headroom;
+   rx_desc->data[i].addr = cpu_to_be64(frags[i].dma +
+   frags[i].page_offset);
ring_alloc[i] = page_alloc[i];
-   rx_desc->data[i].addr = cpu_to_be64(dma);
}
 
return 0;
@@ -250,7 +250,8 @@ static int mlx4_en_prepare_rx_desc(struct mlx4_en_priv 
*priv,
 
if (ring->page_cache.index > 0) {
frags[0] = ring->page_cache.buf[--ring->page_cache.index];
-   rx_desc->data[0].addr = cpu_to_be64(frags[0].dma);
+   rx_desc->data[0].addr = cpu_to_be64(frags[0].dma +
+   frags[0].page_offset);
return 0;
}
 
@@ -889,6 +890,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
if (xdp_prog) {
struct xdp_buff xdp;
dma_addr_t dma;
+   void *pg_addr, *orig_data;
u32 act;
 
dma = be64_to_cpu(rx_desc->data[0].addr);
@@ -896,11 +898,18 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
priv->frag_info[0].frag_size,
DMA_FROM_DEVICE);
 
-   xdp.data = page_address(frags[0].page) +
-   frags[0].page_offset;
+   pg_addr = page_address(frags[0].page);
+   orig_data = pg_addr + 

[PATCH v2 net-next 4/4] bpf: xdp: Add XDP example for head adjustment

2016-12-03 Thread Martin KaFai Lau
The XDP prog checks if the incoming packet matches any VIP:PORT
combination in the BPF hashmap.  If it is, it will encapsulate
the packet with a IPv4/v6 header as instructed by the value of
the BPF hashmap and then XDP_TX it out.

The VIP:PORT -> IP-Encap-Info can be specified by the cmd args
of the user prog.

Acked-by: Alexei Starovoitov 
Signed-off-by: Martin KaFai Lau 
---
 samples/bpf/Makefile  |   4 +
 samples/bpf/bpf_helpers.h |   2 +
 samples/bpf/bpf_load.c|  94 ++
 samples/bpf/bpf_load.h|   1 +
 samples/bpf/xdp1_user.c   |  93 --
 samples/bpf/xdp_tx_iptnl_common.h |  37 ++
 samples/bpf/xdp_tx_iptnl_kern.c   | 232 ++
 samples/bpf/xdp_tx_iptnl_user.c   | 253 ++
 8 files changed, 623 insertions(+), 93 deletions(-)
 create mode 100644 samples/bpf/xdp_tx_iptnl_common.h
 create mode 100644 samples/bpf/xdp_tx_iptnl_kern.c
 create mode 100644 samples/bpf/xdp_tx_iptnl_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 00cd3081c038..f78e0ef6ff10 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -33,6 +33,7 @@ hostprogs-y += trace_event
 hostprogs-y += sampleip
 hostprogs-y += tc_l2_redirect
 hostprogs-y += lwt_len_hist
+hostprogs-y += xdp_tx_iptnl
 
 test_lru_dist-objs := test_lru_dist.o libbpf.o
 sock_example-objs := sock_example.o libbpf.o
@@ -67,6 +68,7 @@ trace_event-objs := bpf_load.o libbpf.o trace_event_user.o
 sampleip-objs := bpf_load.o libbpf.o sampleip_user.o
 tc_l2_redirect-objs := bpf_load.o libbpf.o tc_l2_redirect_user.o
 lwt_len_hist-objs := bpf_load.o libbpf.o lwt_len_hist_user.o
+xdp_tx_iptnl-objs := bpf_load.o libbpf.o xdp_tx_iptnl_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -99,6 +101,7 @@ always += test_current_task_under_cgroup_kern.o
 always += trace_event_kern.o
 always += sampleip_kern.o
 always += lwt_len_hist_kern.o
+always += xdp_tx_iptnl_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/testing/selftests/bpf/
@@ -129,6 +132,7 @@ HOSTLOADLIBES_trace_event += -lelf
 HOSTLOADLIBES_sampleip += -lelf
 HOSTLOADLIBES_tc_l2_redirect += -l elf
 HOSTLOADLIBES_lwt_len_hist += -l elf
+HOSTLOADLIBES_xdp_tx_iptnl += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 8370a6e3839d..faaffe2e139a 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -57,6 +57,8 @@ static int (*bpf_skb_set_tunnel_opt)(void *ctx, void *md, int 
size) =
(void *) BPF_FUNC_skb_set_tunnel_opt;
 static unsigned long long (*bpf_get_prandom_u32)(void) =
(void *) BPF_FUNC_get_prandom_u32;
+static int (*bpf_xdp_adjust_head)(void *ctx, int offset) =
+   (void *) BPF_FUNC_xdp_adjust_head;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 49b45ccbe153..e30b6de94f2e 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -12,6 +12,10 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -450,3 +454,93 @@ struct ksym *ksym_search(long key)
/* out of range. return _stext */
return [0];
 }
+
+int set_link_xdp_fd(int ifindex, int fd)
+{
+   struct sockaddr_nl sa;
+   int sock, seq = 0, len, ret = -1;
+   char buf[4096];
+   struct nlattr *nla, *nla_xdp;
+   struct {
+   struct nlmsghdr  nh;
+   struct ifinfomsg ifinfo;
+   char attrbuf[64];
+   } req;
+   struct nlmsghdr *nh;
+   struct nlmsgerr *err;
+
+   memset(, 0, sizeof(sa));
+   sa.nl_family = AF_NETLINK;
+
+   sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
+   if (sock < 0) {
+   printf("open netlink socket: %s\n", strerror(errno));
+   return -1;
+   }
+
+   if (bind(sock, (struct sockaddr *), sizeof(sa)) < 0) {
+   printf("bind to netlink: %s\n", strerror(errno));
+   goto cleanup;
+   }
+
+   memset(, 0, sizeof(req));
+   req.nh.nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg));
+   req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
+   req.nh.nlmsg_type = RTM_SETLINK;
+   req.nh.nlmsg_pid = 0;
+   req.nh.nlmsg_seq = ++seq;
+   req.ifinfo.ifi_family = AF_UNSPEC;
+   req.ifinfo.ifi_index = ifindex;
+   nla = (struct nlattr *)(((char *))
+   + NLMSG_ALIGN(req.nh.nlmsg_len));
+   nla->nla_type = NLA_F_NESTED | 43/*IFLA_XDP*/;
+
+   nla_xdp = (struct nlattr *)((char *)nla + NLA_HDRLEN);
+   nla_xdp->nla_type = 

[PATCH v2 net-next 0/4]: Allow head adjustment in XDP prog

2016-12-03 Thread Martin KaFai Lau
This series adds a helper to allow head adjusting in XDP prog.  mlx4
driver has been modified to support this feature.  An example is written
to encapsulate a packet with an IPv4/v6 header and then XDP_TX it
out.

v2:
1. Make a variable name change in bpf_xdp_adjust_head() in patch 1
2. Ensure no less than ETH_HLEN data in bpf_xdp_adjust_head() in patch 1
3. Some clarifications in commit log messages of patch 2 and 3

Thanks,
--Martin



[PATCH v2 net-next 1/4] bpf: xdp: Allow head adjustment in XDP prog

2016-12-03 Thread Martin KaFai Lau
This patch allows XDP prog to extend/remove the packet
data at the head (like adding or removing header).  It is
done by adding a new XDP helper bpf_xdp_adjust_head().

It also renames bpf_helper_changes_skb_data() to
bpf_helper_changes_pkt_data() to better reflect
that XDP prog does not work on skb.

Acked-by: Alexei Starovoitov 
Signed-off-by: Martin KaFai Lau 
---
 arch/powerpc/net/bpf_jit_comp64.c |  4 ++--
 arch/s390/net/bpf_jit_comp.c  |  2 +-
 arch/x86/net/bpf_jit_comp.c   |  2 +-
 include/linux/filter.h|  2 +-
 include/uapi/linux/bpf.h  | 11 ++-
 kernel/bpf/core.c |  2 +-
 kernel/bpf/verifier.c |  2 +-
 net/core/filter.c | 34 --
 8 files changed, 49 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit_comp64.c 
b/arch/powerpc/net/bpf_jit_comp64.c
index 0fe98a567125..73a5cf18fd84 100644
--- a/arch/powerpc/net/bpf_jit_comp64.c
+++ b/arch/powerpc/net/bpf_jit_comp64.c
@@ -766,7 +766,7 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 
*image,
func = (u8 *) __bpf_call_base + imm;
 
/* Save skb pointer if we need to re-cache skb data */
-   if (bpf_helper_changes_skb_data(func))
+   if (bpf_helper_changes_pkt_data(func))
PPC_BPF_STL(3, 1, bpf_jit_stack_local(ctx));
 
bpf_jit_emit_func_call(image, ctx, (u64)func);
@@ -775,7 +775,7 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 
*image,
PPC_MR(b2p[BPF_REG_0], 3);
 
/* refresh skb cache */
-   if (bpf_helper_changes_skb_data(func)) {
+   if (bpf_helper_changes_pkt_data(func)) {
/* reload skb pointer to r3 */
PPC_BPF_LL(3, 1, bpf_jit_stack_local(ctx));
bpf_jit_emit_skb_loads(image, ctx);
diff --git a/arch/s390/net/bpf_jit_comp.c b/arch/s390/net/bpf_jit_comp.c
index bee281f3163d..167b31b186c1 100644
--- a/arch/s390/net/bpf_jit_comp.c
+++ b/arch/s390/net/bpf_jit_comp.c
@@ -981,7 +981,7 @@ static noinline int bpf_jit_insn(struct bpf_jit *jit, 
struct bpf_prog *fp, int i
EMIT2(0x0d00, REG_14, REG_W1);
/* lgr %b0,%r2: load return value into %b0 */
EMIT4(0xb904, BPF_REG_0, REG_2);
-   if (bpf_helper_changes_skb_data((void *)func)) {
+   if (bpf_helper_changes_pkt_data((void *)func)) {
jit->seen |= SEEN_SKB_CHANGE;
/* lg %b1,ST_OFF_SKBP(%r15) */
EMIT6_DISP_LH(0xe300, 0x0004, BPF_REG_1, REG_0,
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index fe04a04dab8e..e76d1af60f7a 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -853,7 +853,7 @@ xadd:   if (is_imm8(insn->off))
func = (u8 *) __bpf_call_base + imm32;
jmp_offset = func - (image + addrs[i]);
if (seen_ld_abs) {
-   reload_skb_data = 
bpf_helper_changes_skb_data(func);
+   reload_skb_data = 
bpf_helper_changes_pkt_data(func);
if (reload_skb_data) {
EMIT1(0x57); /* push %rdi */
jmp_offset += 22; /* pop, mov, sub, mov 
*/
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 97338134398f..3c02de77ad6a 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -590,7 +590,7 @@ void sk_filter_uncharge(struct sock *sk, struct sk_filter 
*fp);
 u64 __bpf_call_base(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
 struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog);
-bool bpf_helper_changes_skb_data(void *func);
+bool bpf_helper_changes_pkt_data(void *func);
 
 struct bpf_prog *bpf_patch_insn_single(struct bpf_prog *prog, u32 off,
   const struct bpf_insn *patch, u32 len);
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6123d9b8e828..0eb0e87dbe9f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -424,6 +424,12 @@ union bpf_attr {
  * @len: length of header to be pushed in front
  * @flags: Flags (unused for now)
  * Return: 0 on success or negative error
+ *
+ * int bpf_xdp_adjust_head(xdp_md, delta)
+ * Adjust the xdp_md.data by delta
+ * @xdp_md: pointer to xdp_md
+ * @delta: An positive/negative integer to be added to xdp_md.data
+ * Return: 0 on success or negative on error
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -469,7 +475,8 @@ union bpf_attr {
FN(csum_update),

[PATCH v2 net-next 2/4] mlx4: xdp: Allow raising MTU up to one page minus eth and vlan hdrs

2016-12-03 Thread Martin KaFai Lau
When XDP is active in mlx4, mlx4 is using one page/pkt.
At the same time (i.e. when XDP is active), it is currently
limiting MTU to be FRAG_SZ0 - ETH_HLEN - (2 * VLAN_HLEN)
which is 1514 in x86.  AFAICT, we can at least raise the MTU
limit up to PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN) which this
patch is doing.  It will be useful in the next patch which
allows XDP program to extend the packet by adding new header(s).

Note: In the earlier XDP patches, there is already existing guard
to ensure the page/pkt scheme only applies when XDP is active
in mlx4.

Signed-off-by: Martin KaFai Lau 
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 28 +++-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 46 ++
 2 files changed, 44 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 49a81f1fc1d6..311c14153b8b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -51,6 +51,8 @@
 #include "mlx4_en.h"
 #include "en_port.h"
 
+#define MLX4_EN_MAX_XDP_MTU ((int)(PAGE_SIZE - ETH_HLEN - (2 * VLAN_HLEN)))
+
 int mlx4_en_setup_tc(struct net_device *dev, u8 up)
 {
struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -2249,6 +2251,19 @@ void mlx4_en_destroy_netdev(struct net_device *dev)
free_netdev(dev);
 }
 
+static bool mlx4_en_check_xdp_mtu(struct net_device *dev, int mtu)
+{
+   struct mlx4_en_priv *priv = netdev_priv(dev);
+
+   if (mtu > MLX4_EN_MAX_XDP_MTU) {
+   en_err(priv, "mtu:%d > max:%d when XDP prog is attached\n",
+  mtu, MLX4_EN_MAX_XDP_MTU);
+   return false;
+   }
+
+   return true;
+}
+
 static int mlx4_en_change_mtu(struct net_device *dev, int new_mtu)
 {
struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -2258,11 +2273,10 @@ static int mlx4_en_change_mtu(struct net_device *dev, 
int new_mtu)
en_dbg(DRV, priv, "Change MTU called - current:%d new:%d\n",
 dev->mtu, new_mtu);
 
-   if (priv->tx_ring_num[TX_XDP] && MLX4_EN_EFF_MTU(new_mtu) > FRAG_SZ0) {
-   en_err(priv, "MTU size:%d requires frags but XDP running\n",
-  new_mtu);
-   return -EOPNOTSUPP;
-   }
+   if (priv->tx_ring_num[TX_XDP] &&
+   !mlx4_en_check_xdp_mtu(dev, new_mtu))
+   return -ENOTSUPP;
+
dev->mtu = new_mtu;
 
if (netif_running(dev)) {
@@ -2710,10 +2724,8 @@ static int mlx4_xdp_set(struct net_device *dev, struct 
bpf_prog *prog)
return 0;
}
 
-   if (priv->num_frags > 1) {
-   en_err(priv, "Cannot set XDP if MTU requires multiple frags\n");
+   if (!mlx4_en_check_xdp_mtu(dev, dev->mtu))
return -EOPNOTSUPP;
-   }
 
tmp = kzalloc(sizeof(*tmp), GFP_KERNEL);
if (!tmp)
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 6562f78b07f4..23e9d04d1ef4 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -1164,37 +1164,39 @@ static const int frag_sizes[] = {
 
 void mlx4_en_calc_rx_buf(struct net_device *dev)
 {
-   enum dma_data_direction dma_dir = PCI_DMA_FROMDEVICE;
struct mlx4_en_priv *priv = netdev_priv(dev);
int eff_mtu = MLX4_EN_EFF_MTU(dev->mtu);
-   int order = MLX4_EN_ALLOC_PREFER_ORDER;
-   u32 align = SMP_CACHE_BYTES;
-   int buf_size = 0;
int i = 0;
 
/* bpf requires buffers to be set up as 1 packet per page.
 * This only works when num_frags == 1.
 */
if (priv->tx_ring_num[TX_XDP]) {
-   dma_dir = PCI_DMA_BIDIRECTIONAL;
-   /* This will gain efficient xdp frame recycling at the expense
-* of more costly truesize accounting
+   priv->frag_info[0].order = 0;
+   priv->frag_info[0].frag_size = eff_mtu;
+   priv->frag_info[0].frag_prefix_size = 0;
+   /* This will gain efficient xdp frame recycling at the
+* expense of more costly truesize accounting
 */
-   align = PAGE_SIZE;
-   order = 0;
-   }
-
-   while (buf_size < eff_mtu) {
-   priv->frag_info[i].order = order;
-   priv->frag_info[i].frag_size =
-   (eff_mtu > buf_size + frag_sizes[i]) ?
-   frag_sizes[i] : eff_mtu - buf_size;
-   priv->frag_info[i].frag_prefix_size = buf_size;
-   priv->frag_info[i].frag_stride =
-   ALIGN(priv->frag_info[i].frag_size, align);
-   priv->frag_info[i].dma_dir = dma_dir;
-   buf_size += priv->frag_info[i].frag_size;
-   i++;
+   priv->frag_info[0].frag_stride = PAGE_SIZE;
+

Re: [PATCH v3 net-next 1/2] net: ethernet: slicoss: add slicoss gigabit ethernet driver

2016-12-03 Thread kbuild test robot
Hi Lino,

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Lino-Sanfilippo/net-ethernet-slicoss-add-slicoss-gigabit-ethernet-driver/20161126-202438
config: sparc64-allyesconfig (attached as .config)
compiler: sparc64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=sparc64 

All errors (new ones prefixed by >>):

   drivers/staging/slicoss/slicoss.c: In function 'slic_cmdq_addcmdpage':
>> drivers/staging/slicoss/slicoss.c:1258:14: error: implicit declaration of 
>> function 'virt_to_bus' [-Werror=implicit-function-declaration]
 phys_addr = virt_to_bus((void *)page);
 ^~~
   cc1: some warnings being treated as errors

vim +/virt_to_bus +1258 drivers/staging/slicoss/slicoss.c

4d6ea9c3 Denis Kirjanov 2010-07-10  1242struct slic_hostcmd *cmd;
4d6ea9c3 Denis Kirjanov 2010-07-10  1243struct slic_hostcmd *prev;
4d6ea9c3 Denis Kirjanov 2010-07-10  1244struct slic_hostcmd *tail;
4d6ea9c3 Denis Kirjanov 2010-07-10  1245struct slic_cmdqueue *cmdq;
4d6ea9c3 Denis Kirjanov 2010-07-10  1246int cmdcnt;
4d6ea9c3 Denis Kirjanov 2010-07-10  1247void *cmdaddr;
4d6ea9c3 Denis Kirjanov 2010-07-10  1248ulong phys_addr;
4d6ea9c3 Denis Kirjanov 2010-07-10  1249u32 phys_addrl;
4d6ea9c3 Denis Kirjanov 2010-07-10  1250u32 phys_addrh;
4d6ea9c3 Denis Kirjanov 2010-07-10  1251struct slic_handle 
*pslic_handle;
eafe6002 David Matlack  2015-05-11  1252unsigned long flags;
4d6f6af8 Greg Kroah-Hartman 2008-03-19  1253  
4d6ea9c3 Denis Kirjanov 2010-07-10  1254cmdaddr = page;
dd146d21 Shraddha Barke 2015-10-15  1255cmd = cmdaddr;
4d6ea9c3 Denis Kirjanov 2010-07-10  1256cmdcnt = 0;
4d6f6af8 Greg Kroah-Hartman 2008-03-19  1257  
4d6ea9c3 Denis Kirjanov 2010-07-10 @1258phys_addr = virt_to_bus((void 
*)page);
4d6ea9c3 Denis Kirjanov 2010-07-10  1259phys_addrl = 
SLIC_GET_ADDR_LOW(phys_addr);
4d6ea9c3 Denis Kirjanov 2010-07-10  1260phys_addrh = 
SLIC_GET_ADDR_HIGH(phys_addr);
4d6f6af8 Greg Kroah-Hartman 2008-03-19  1261  
4d6ea9c3 Denis Kirjanov 2010-07-10  1262prev = NULL;
4d6ea9c3 Denis Kirjanov 2010-07-10  1263tail = cmd;
4d6ea9c3 Denis Kirjanov 2010-07-10  1264while ((cmdcnt < 
SLIC_CMDQ_CMDSINPAGE) &&
4d6ea9c3 Denis Kirjanov 2010-07-10  1265   (adapter->slic_handle_ix 
< 256)) {
4d6ea9c3 Denis Kirjanov 2010-07-10  1266/* Allocate and 
initialize a SLIC_HANDLE for this command */

:: The code at line 1258 was first introduced by commit
:: 4d6ea9c3223da8d8dc91b369087fa40cc53edd36 Staging: slicoss: kill 
functions prototypes and reorder functions

:: TO: Denis Kirjanov 
:: CC: Greg Kroah-Hartman 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v2 net-next 8/8] tcp: tsq: move tsq_flags close to sk_wmem_alloc

2016-12-03 Thread Eric Dumazet
On Sat, 2016-12-03 at 19:16 -0500, David Miller wrote:
> From: Eric Dumazet 
> Date: Sat,  3 Dec 2016 11:14:57 -0800
> 
> > diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> > index d8be083ab0b0..fc5848dad7a4 100644
> > --- a/include/linux/tcp.h
> > +++ b/include/linux/tcp.h
> > @@ -186,7 +186,6 @@ struct tcp_sock {
> > u32 tsoffset;   /* timestamp offset */
> >  
> > struct list_head tsq_node; /* anchor in tsq_tasklet.head list */
> > -   unsigned long   tsq_flags;
> >  
> > /* Data for direct copy to user */
> > struct {
> 
> Hmmm, did you forget to "git add include/net/sock.h" before making
> this commit?

sk_tsq_flags was added in prior patch in the series ( 7/8 net:
reorganize struct sock for better data locality)

What is the problem with this part ?

Thanks




Re: [patch net-next v4 00/10] ipv4: fib: Replay events when registering FIB notifier

2016-12-03 Thread David Miller
From: Jiri Pirko 
Date: Sat,  3 Dec 2016 16:44:57 +0100

> Ido says:
> 
> In kernel 4.9 the switchdev-specific FIB offload mechanism was replaced
> by a new FIB notification chain to which modules could register in order
> to be notified about the addition and deletion of FIB entries. The
> motivation for this change was that switchdev drivers need to be able to
> reflect the entire FIB table and not only FIBs configured on top of the
> port netdevs themselves. This is useful in case of in-band management.
> 
> The fundamental problem with this approach is that upon registration
> listeners lose all the information previously sent in the chain and
> thus have an incomplete view of the FIB tables, which can result in
> packet loss. This patchset fixes that by dumping the FIB tables and
> replaying notifications previously sent in the chain for the registered
> notification block.
> 
> The entire dump process is done under RCU and thus the FIB notification
> chain is converted to be atomic. The listeners are modified accordingly.
> This is done in the first eight patches.
> 
> The ninth patch adds a change sequence counter to ensure the integrity
> of the FIB dump. The last patch adds the dump itself to the FIB chain
> registration function and modifies existing listeners to pass a callback
> to be executed in case dump was inconsistent.
 ...

Series applied, thanks.


Re: [PATCH 1/3] uapi: export tc tunnel key file

2016-12-03 Thread David Miller
From: Stephen Hemminger 
Date: Fri,  2 Dec 2016 14:53:58 -0800

> Fixes commit 21609ae32aaf6c6fab0e ("net/sched: Introduce act_tunnel_key")
> The file is necessary for iproute2 headers but was not being
> copied by make install_headers
> 
> Signed-off-by: Stephen Hemminger 

This seems to already be fixed.


Re: [PATCH 2/3] uapi: export tc_skbmod.h

2016-12-03 Thread David Miller
From: Stephen Hemminger 
Date: Fri,  2 Dec 2016 14:53:59 -0800

> Fixes commit 735cffe5d800 ("net_sched: Introduce skbmod action")
> Not used by iproute2 but maybe in future.
> 
> Signed-off-by: Stephen Hemminger 

Applied.


Re: [Patch net-next] act_mirred: fix a typo in get_dev

2016-12-03 Thread David Miller
From: Eric Dumazet 
Date: Sat, 03 Dec 2016 10:59:18 -0800

> On Sat, 2016-12-03 at 10:36 -0800, Cong Wang wrote:
>> Cc: Hadar Hen Zion 
>> Cc: Jiri Pirko 
>> Signed-off-by: Cong Wang 
>> ---
>>  net/sched/act_mirred.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
>> index bb09ba3..2d9fa6e 100644
>> --- a/net/sched/act_mirred.c
>> +++ b/net/sched/act_mirred.c
>> @@ -321,7 +321,7 @@ static int tcf_mirred_device(const struct tc_action *a, 
>> struct net *net,
>>  int ifindex = tcf_mirred_ifindex(a);
>>  
>>  *mirred_dev = __dev_get_by_index(net, ifindex);
>> -if (!mirred_dev)
>> +if (!*mirred_dev)
>>  return -EINVAL;
>>  return 0;
>>  }
> 
> Fixes: 255cb30425c0 ("net/sched: act_mirred: Add new tc_action_ops get_dev()")
> Acked-by: Eric Dumazet 

Applied.


Re: [PATCH 3/3] uapi: export nf_log.h

2016-12-03 Thread David Miller
From: Stephen Hemminger 
Date: Fri,  2 Dec 2016 14:54:00 -0800

> File is in uapi directory but not being copied on
>  make install_headers
> 
> Fixes commit 4ec9c8fbbc22 ("netfilter: nft_log: complete
> NFTA_LOG_FLAGS attr support").
> 
> Signed-off-by: Stephen Hemminger 

Also applied.

Someone has to explain to me why we don't simply export every single
file under uapi/, it makes no sense to me to have to specify them
explicitly.

We obviously forget to add the files to the lists all the time.


Re: [PATCH v2 net-next 8/8] tcp: tsq: move tsq_flags close to sk_wmem_alloc

2016-12-03 Thread David Miller
From: Eric Dumazet 
Date: Sat,  3 Dec 2016 11:14:57 -0800

> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index d8be083ab0b0..fc5848dad7a4 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -186,7 +186,6 @@ struct tcp_sock {
>   u32 tsoffset;   /* timestamp offset */
>  
>   struct list_head tsq_node; /* anchor in tsq_tasklet.head list */
> - unsigned long   tsq_flags;
>  
>   /* Data for direct copy to user */
>   struct {

Hmmm, did you forget to "git add include/net/sock.h" before making
this commit?


Re: [net-next 00/18][pull request] 40GbE Intel Wired LAN Driver Updates 2016-12-02

2016-12-03 Thread David Miller
From: Jeff Kirsher 
Date: Sat,  3 Dec 2016 01:19:12 -0800

> This series contains updates to i40e and i40evf only.

Pulled, thanks Jeff.


Re: [PATCH net-next] tcp: fix the missing avr32 SOF_TIMESTAMPING_OPT_STATS

2016-12-03 Thread David Miller
From: Yuchung Cheng 
Date: Sat,  3 Dec 2016 14:46:22 -0800

> The commit of SOF_TIMESTAMPING_OPT_STATS didn't include the
> new header for avr32, causing build to break. The patch fixes it.
> 
> Fixes: 1c885808e456 ("tcp: SOF_TIMESTAMPING_OPT_STATS option for 
> SO_TIMESTAMPING")
> Reported-by: Paul Gortmaker 
> Signed-off-by: Yuchung Cheng 

Applied, thanks.


Re: [PATCH net-next] bpf: Preserve const register type on const OR alu ops

2016-12-03 Thread Daniel Borkmann

On 12/03/2016 09:31 PM, Alexei Starovoitov wrote:

From: Gianluca Borello 

Occasionally, clang (e.g. version 3.8.1) translates a sum between two
constant operands using a BPF_OR instead of a BPF_ADD. The verifier is
currently not handling this scenario, and the destination register type
becomes UNKNOWN_VALUE even if it's still storing a constant. As a result,
the destination register cannot be used as argument to a helper function
expecting a ARG_CONST_STACK_*, limiting some use cases.

Modify the verifier to handle this case, and add a few tests to make sure
all combinations are supported, and stack boundaries are still verified
even with BPF_OR.

Signed-off-by: Gianluca Borello 
Signed-off-by: Alexei Starovoitov 


Acked-by: Daniel Borkmann 


Re: [ovs-dev] [PATCH net-next] net: remove abuse of VLAN DEI/CFI bit

2016-12-03 Thread Ben Pfaff
On Sat, Dec 03, 2016 at 10:22:28AM +0100, Michał Mirosław wrote:
> This All-in-one patch removes abuse of VLAN CFI bit, so it can be passed
> intact through linux networking stack.
> 
> Signed-off-by: Michał Mirosław 
> ---
> 
> Dear NetDevs
> 
> I guess this needs to be split to the prep..convert[]..finish sequence,
> but if you like it as is, then it's ready.
> 
> The biggest question is if the modified interface and vlan_present
> is the way to go. This can be changed to use vlan_proto != 0 instead
> of an extra flag bit.
> 
> As I can't test most of the driver changes, please look at them carefully.
> OVS and bridge eyes are especially welcome.

This appears to change the established Open vSwitch userspace API.  You
can see that simply from the way that it changes the documentation for
the userspace API.  If I'm right about that, then this change will break
all userspace programs that use the Open vSwitch kernel module,
including Open vSwitch itself.


Re: [PATCH 3/6] net: ethernet: ti: cpts: add support of cpts HW_TS_PUSH

2016-12-03 Thread Richard Cochran
On Mon, Nov 28, 2016 at 05:04:25PM -0600, Grygorii Strashko wrote:
> This also change overflow polling period when HW_TS_PUSH feature is
> enabled - overflow check work will be scheduled more often (every
> 200ms) for proper HW_TS_PUSH events reporting.

For proper reporting, you should make use of the interrupt.  The small
fifo (16 iirc) could very well overflow in 200 ms.  The interrupt
handler should read out the entire fifo at each interrupt.

Thanks,
Richard


[PATCH net-next] tcp: fix the missing avr32 SOF_TIMESTAMPING_OPT_STATS

2016-12-03 Thread Yuchung Cheng
The commit of SOF_TIMESTAMPING_OPT_STATS didn't include the
new header for avr32, causing build to break. The patch fixes it.

Fixes: 1c885808e456 ("tcp: SOF_TIMESTAMPING_OPT_STATS option for 
SO_TIMESTAMPING")
Reported-by: Paul Gortmaker 
Signed-off-by: Yuchung Cheng 
---
 arch/avr32/include/uapi/asm/socket.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/avr32/include/uapi/asm/socket.h 
b/arch/avr32/include/uapi/asm/socket.h
index 1fd147f..5a65042 100644
--- a/arch/avr32/include/uapi/asm/socket.h
+++ b/arch/avr32/include/uapi/asm/socket.h
@@ -90,4 +90,6 @@
 
 #define SO_CNX_ADVICE  53
 
+#define SCM_TIMESTAMPING_OPT_STATS 54
+
 #endif /* _UAPI__ASM_AVR32_SOCKET_H */
-- 
2.8.0.rc3.226.g39d4020



Re: [RFC PATCH 2/2] Documentation: devictree: Add macb mdio bindings

2016-12-03 Thread Florian Fainelli
Le 12/03/16 à 13:35, Rob Herring a écrit :
> On Mon, Nov 28, 2016 at 03:19:27PM +0530, Harini Katakam wrote:
>> Add documentations for macb mdio driver.
> 
> Bindings document h/w, not drivers.
> 
>>
>> Signed-off-by: Harini Katakam 
>> ---
>>  .../devicetree/bindings/net/macb-mdio.txt  | 31 
>> ++
>>  1 file changed, 31 insertions(+)
>>  create mode 100644 Documentation/devicetree/bindings/net/macb-mdio.txt
>>
>> diff --git a/Documentation/devicetree/bindings/net/macb-mdio.txt 
>> b/Documentation/devicetree/bindings/net/macb-mdio.txt
>> new file mode 100644
>> index 000..014cedf
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/net/macb-mdio.txt
>> @@ -0,0 +1,31 @@
>> +* Cadence MACB MDIO controller
>> +
>> +Required properties:
>> +- compatible: Should be "cdns,macb-mdio"
> 
> Only one version ever? This needs more specific compatible strings.
> 
>> +- reg: Address and length of the register set of MAC to be used
>> +- clock-names: Tuple listing input clock names.
>> +Required elements: 'pclk', 'hclk'
>> +Optional elements: 'tx_clk'
>> +- clocks: Phandles to input clocks.

You are also missing mandatory properties:

#address-cells = <1> and #size-cells = <0>

Where is patch 1? Can you make sure you have the same recipient list for
both patches in this series so we can review both the binding and driver?

Thanks!
-- 
Florian


Re: [PATCH net-next] net_sched: gen_estimator: account for timer drifts

2016-12-03 Thread Eric Dumazet
On Sat, 2016-12-03 at 16:12 -0500, David Miller wrote:

> Applied.

Thanks David.

I also have to get rid of the WRITE_ONCE() done in est_timer(), since it
does not work properly on 32bit kernels.

It will target net-next, since it is not a very serious problem.




Re: [RFC PATCH 2/2] Documentation: devictree: Add macb mdio bindings

2016-12-03 Thread Rob Herring
On Mon, Nov 28, 2016 at 03:19:27PM +0530, Harini Katakam wrote:
> Add documentations for macb mdio driver.

Bindings document h/w, not drivers.

> 
> Signed-off-by: Harini Katakam 
> ---
>  .../devicetree/bindings/net/macb-mdio.txt  | 31 
> ++
>  1 file changed, 31 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/macb-mdio.txt
> 
> diff --git a/Documentation/devicetree/bindings/net/macb-mdio.txt 
> b/Documentation/devicetree/bindings/net/macb-mdio.txt
> new file mode 100644
> index 000..014cedf
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/macb-mdio.txt
> @@ -0,0 +1,31 @@
> +* Cadence MACB MDIO controller
> +
> +Required properties:
> +- compatible: Should be "cdns,macb-mdio"

Only one version ever? This needs more specific compatible strings.

> +- reg: Address and length of the register set of MAC to be used
> +- clock-names: Tuple listing input clock names.
> + Required elements: 'pclk', 'hclk'
> + Optional elements: 'tx_clk'
> +- clocks: Phandles to input clocks.
> +
> +Examples:
> +
> + mdio {
> + compatible = "cdns,macb-mdio";
> + reg = <0x0 0xff0b 0x0 0x1000>;
> + clocks = <>, <>, <>;
> + clock-names = "pclk", "hclk", "tx_clk";
> + ethernet_phyC: ethernet-phy@C {

lowercase hex for unit addresses please

> + reg = <0xC>;
> + };
> + ethernet_phy7: ethernet-phy@7 {
> + reg = <0x7>;
> + };
> + ethernet_phy3: ethernet-phy@3 {
> + reg = <0x3>;
> + };
> + ethernet_phy8: ethernet-phy@8 {
> + reg = <0x8>;
> + };
> + };
> +
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe devicetree" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next] sfc: remove EFX_BUG_ON_PARANOID, use EFX_WARN_ON_[ONCE_]PARANOID instead

2016-12-03 Thread David Miller
From: Edward Cree 
Date: Fri, 2 Dec 2016 15:51:33 +

> Logically, EFX_BUG_ON_PARANOID can never be correct.  For, BUG_ON should
>  only be used if it is not possible to continue without potential harm;
>  and since the non-DEBUG driver will continue regardless (as the BUG_ON is
>  compiled out), clearly the BUG_ON cannot be needed in the DEBUG driver.
> So, replace every EFX_BUG_ON_PARANOID with either an EFX_WARN_ON_PARANOID
>  or the newly defined EFX_WARN_ON_ONCE_PARANOID.
> 
> Signed-off-by: Edward Cree 

Applied.


Re: [PATCH net-next] net_sched: gen_estimator: account for timer drifts

2016-12-03 Thread David Miller
From: Eric Dumazet 
Date: Fri, 02 Dec 2016 08:11:00 -0800

> From: Eric Dumazet 
> 
> Under heavy stress, timer used in estimators tend to slowly be delayed
> by a few jiffies, leading to inaccuracies.
> 
> Lets remember what was the last scheduled jiffies so that we get more
> precise estimations, without having to add a multiply/divide in the loop
> to account for the drifts.
> 
> Signed-off-by: Eric Dumazet 

Applied.


Re: [PATCH net-next] udp: be less conservative with sock rmem accounting

2016-12-03 Thread David Miller
From: Paolo Abeni 
Date: Fri,  2 Dec 2016 17:35:49 +0100

> Before commit 850cbaddb52d ("udp: use it's own memory accounting
> schema"), the udp protocol allowed sk_rmem_alloc to grow beyond
> the rcvbuf by the whole current packet's truesize. After said commit
> we allow sk_rmem_alloc to exceed the rcvbuf only if the receive queue
> is empty. As reported by Jesper this cause a performance regression
> for some (small) values of rcvbuf.
> 
> This commit is intended to fix the regression restoring the old
> handling of the rcvbuf limit.
> 
> Reported-by: Jesper Dangaard Brouer 
> Fixes: 850cbaddb52d ("udp: use it's own memory accounting schema")
> Signed-off-by: Paolo Abeni 

Applied.


Re: [PATCH] pull request for net: batman-adv 2016-12-02

2016-12-03 Thread David Miller
From: Simon Wunderlich 
Date: Fri,  2 Dec 2016 17:13:22 +0100

> here is another bugfix which we would like to see integrated into net,
> if this is possible now.
> 
> Please pull or let me know of any problem!

Pulled, thanks.


Re: [PATCH 1/6] net: stmmac: return error if no DMA configuration is found

2016-12-03 Thread David Miller

When you post a series of related changes as a patch set, you must
provide a proper "[PATCH 0/N] ..." posting which explains what
the series is doing at a high level, how it is doing it, and why
it is doing it that way.

Please repost this entire series with a proper header posting
included.

Thank you.


Re: [PATCH net-next 0/2] samples, bpf: Refactor; Add automated tests for cgroups

2016-12-03 Thread David Miller
From: Sargun Dhillon 
Date: Fri, 2 Dec 2016 02:42:03 -0800

> These two patches are around refactoring out some old, reusable code from the 
> existing test_current_task_under_cgroup_user test, and adding a new, 
> automated 
> test.
> 
> There is some generic cgroupsv2 setup & cleanup code, given that most 
> environment still don't have it setup by default. With this code, we're able
> to pretty easily add an automated test for future cgroupsv2 functionality.

Series applied, thanks.


Re: [PATCH 1/1] net: caif: fix ineffective error check

2016-12-03 Thread Sergei Shtylyov


On 12/03/2016 06:38 PM, Pan Bian wrote:


In function caif_sktinit_module(), the check of the return value of
sock_register() seems ineffective. This patch fixes it.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188751

Signed-off-by: Pan Bian 
---
net/caif/caif_socket.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/caif/caif_socket.c b/net/caif/caif_socket.c
index aa209b1..2a689a3 100644
--- a/net/caif/caif_socket.c
+++ b/net/caif/caif_socket.c
@@ -1108,7 +1108,7 @@ static int caif_create(struct net *net, struct socket 
*sock, int protocol,
static int __init caif_sktinit_module(void)
{
int err = sock_register(_family_ops);
-   if (!err)
+   if (err)
return err;


   Why not just:

return sock_register(_family_ops);


Your solution looks much cleaner.

But I am not really sure whether it is the author's intention to
return 0 anyway. Do you have any idea?


   I don't think so, the error check seems to have a typo.

[...]


Best regards,
Pan


MBR, Sergei



Re: [PATCH net-next] samples/bpf: silence compiler warnings

2016-12-03 Thread David Miller
From: Alexei Starovoitov 
Date: Thu, 1 Dec 2016 18:31:12 -0800

> silence some of the clang compiler warnings like:
> include/linux/fs.h:2693:9: warning: comparison of unsigned enum expression < 
> 0 is always false
> arch/x86/include/asm/processor.h:491:30: warning: taking address of packed 
> member 'sp0' of class or structure 'x86_hw_tss' may result in an unaligned 
> pointer value
> include/linux/cgroup-defs.h:326:16: warning: field 'cgrp' with variable sized 
> type 'struct cgroup' not at the end of a struct or class is a GNU extension
> since they add too much noise to samples/bpf/ build.
> 
> Signed-off-by: Alexei Starovoitov 

Applied.


Re: [PATCH 1/3] netns: publish net_generic correctly

2016-12-03 Thread David Miller

All 3 patches applied to net-next, thanks.


Re: [PATCH] netlink: 2-clause nla_ok()

2016-12-03 Thread David Miller
From: Alexey Dobriyan 
Date: Fri, 2 Dec 2016 03:59:06 +0300

> nla_ok() consists of 3 clauses:
> 
>   1) int rem >= (int)sizeof(struct nlattr)
> 
>   2) u16 nla_len >= sizeof(struct nlattr)
> 
>   3) u16 nla_len <= int rem
> 
> The statement is that clause (1) is redundant.
> 
> What it does is ensuring that "rem" is a positive number,
> so that in clause (3) positive number will be compared to positive number
> with no problems.
> 
> However, "u16" fully fits into "int" and integers do not change value
> when upcasting even to signed type. Negative integers will be rejected
> by clause (3) just fine. Small positive integers will be rejected
> by transitivity of comparison operator.
> 
> NOTE: all of the above DOES NOT apply to nlmsg_ok() where ->nlmsg_len is
> u32(!), so 3 clauses AND A CAST TO INT are necessary.
> 
> Obligatory space savings report: -1.6 KB
 ...
> Signed-off-by: Alexey Dobriyan 

Looks fine, applied to net-next, thanks.


Re: [net-next 0/5] use reset to set headers

2016-12-03 Thread David Miller
From: Zhang Shengju 
Date: Fri,  2 Dec 2016 09:51:02 +0800

> This patch serial replace 'set' function to 'reset', since the
> offset is zero.  It's not necessary to use set, reset function is
> straightforward, and will remove the unnecessary add operation in
> set function.

Series applied, thanks.


Re: [PATCH net-next 0/8] drivers: net: xgene: Add Jumbo and Pause frame support

2016-12-03 Thread David Miller
From: Iyappan Subramanian 
Date: Thu,  1 Dec 2016 16:41:36 -0800

> This patch set adds,
> 
> 1. Jumbo frame support
> 2. Pause frame based flow control
> 
> and fixes RSS for non-TCP/UDP packets.
> 
> Signed-off-by: Iyappan Subramanian 

Series applied, thanks.


Re: [PATCH 2/2] net: stmmac: unify mdio functions

2016-12-03 Thread David Miller
From: Corentin Labbe 
Date: Thu,  1 Dec 2016 16:19:41 +0100

> stmmac_mdio_{read|write} and stmmac_mdio_{read|write}_gmac4 are not
> enought different for being split.
> The only differences between thoses two functions are shift/mask for
> addr/reg/clk_csr.
> 
> This patch introduce a per platform set of variable for setting thoses
> shift/mask and unify mdio read and write functions.
> 
> Signed-off-by: Corentin Labbe 

Applied.


Re: [PATCH 1/7] net: ethernet: ti: cpdma: am437x: allow descs to be plased in ddr

2016-12-03 Thread David Miller
From: Grygorii Strashko 
Date: Thu, 1 Dec 2016 17:34:26 -0600

> @@ -167,10 +167,10 @@ static struct cpdma_control_info controls[] = {
>  
>  /* various accessors */
>  #define dma_reg_read(ctlr, ofs)  __raw_readl((ctlr)->dmaregs + 
> (ofs))
> -#define chan_read(chan, fld) __raw_readl((chan)->fld)
> +#define chan_read(chan, fld) readl((chan)->fld)
>  #define desc_read(desc, fld) __raw_readl(&(desc)->fld)
>  #define dma_reg_write(ctlr, ofs, v)  __raw_writel(v, (ctlr)->dmaregs + (ofs))
> -#define chan_write(chan, fld, v) __raw_writel(v, (chan)->fld)
> +#define chan_write(chan, fld, v) writel(v, (chan)->fld)
>  #define desc_write(desc, fld, v) __raw_writel((u32)(v), &(desc)->fld)

Unless you want to keep running into subtle errors all over the
place wrt. register vs. memory write ordering, I strong suggest
you use strongly ordered readl/writel for all register accesses.

I see no tangible, worthwhile, advantage to using these relaxed
ordering primitives.  The only result is potential bugs.

People who use the relaxed ordering primitives properly are only
doing so in extremely carefully coded sequences where a series
of writes has no dependency on main memory operations and is
explicitly completed with a barrier operation such as a read
back of a register in the same device.

That's not at all what is going on here, instead the driver is wildly
using relaxed ordered register accesses for basically everything.
This is extremely unwise and it's why you ran into this bug in the
first place.

Therefore, I absolutely require that you fix this by eliminating
any and all usese of relaxed ordering I/O accessors in this driver.

Thank you.


Re: [PATCH 1/2] net: stmmac: avoid Camelcase naming

2016-12-03 Thread David Miller
From: Corentin Labbe 
Date: Thu,  1 Dec 2016 16:19:40 +0100

> This patch simply rename regValue to value, like it was named in other
> mdio functions.
> 
> Signed-off-by: Corentin Labbe 

Applied.


[PATCH net-next] bpf: Preserve const register type on const OR alu ops

2016-12-03 Thread Alexei Starovoitov
From: Gianluca Borello 

Occasionally, clang (e.g. version 3.8.1) translates a sum between two
constant operands using a BPF_OR instead of a BPF_ADD. The verifier is
currently not handling this scenario, and the destination register type
becomes UNKNOWN_VALUE even if it's still storing a constant. As a result,
the destination register cannot be used as argument to a helper function
expecting a ARG_CONST_STACK_*, limiting some use cases.

Modify the verifier to handle this case, and add a few tests to make sure
all combinations are supported, and stack boundaries are still verified
even with BPF_OR.

Signed-off-by: Gianluca Borello 
Signed-off-by: Alexei Starovoitov 
---
 kernel/bpf/verifier.c   |  9 -
 tools/testing/selftests/bpf/.gitignore  |  1 +
 tools/testing/selftests/bpf/test_verifier.c | 60 +
 3 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 0e742210750e..38d05da84a49 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1481,14 +1481,19 @@ static int evaluate_reg_imm_alu(struct bpf_verifier_env 
*env,
struct bpf_reg_state *src_reg = [insn->src_reg];
u8 opcode = BPF_OP(insn->code);
 
-   /* dst_reg->type == CONST_IMM here, simulate execution of 'add' insn.
-* Don't care about overflow or negative values, just add them
+   /* dst_reg->type == CONST_IMM here, simulate execution of 'add'/'or'
+* insn. Don't care about overflow or negative values, just add them
 */
if (opcode == BPF_ADD && BPF_SRC(insn->code) == BPF_K)
dst_reg->imm += insn->imm;
else if (opcode == BPF_ADD && BPF_SRC(insn->code) == BPF_X &&
 src_reg->type == CONST_IMM)
dst_reg->imm += src_reg->imm;
+   else if (opcode == BPF_OR && BPF_SRC(insn->code) == BPF_K)
+   dst_reg->imm |= insn->imm;
+   else if (opcode == BPF_OR && BPF_SRC(insn->code) == BPF_X &&
+src_reg->type == CONST_IMM)
+   dst_reg->imm |= src_reg->imm;
else
mark_reg_unknown_value(regs, insn->dst_reg);
return 0;
diff --git a/tools/testing/selftests/bpf/.gitignore 
b/tools/testing/selftests/bpf/.gitignore
index 3c59f96e3ed8..071431bedde8 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -1,2 +1,3 @@
 test_verifier
 test_maps
+test_lru_map
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 5da2e9d7689c..8d71e44b319d 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -2683,6 +2683,66 @@ static struct bpf_test tests[] = {
.errstr_unpriv = "R0 pointer arithmetic prohibited",
.result_unpriv = REJECT,
},
+   {
+   "constant register |= constant should keep constant type",
+   .insns = {
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -48),
+   BPF_MOV64_IMM(BPF_REG_2, 34),
+   BPF_ALU64_IMM(BPF_OR, BPF_REG_2, 13),
+   BPF_MOV64_IMM(BPF_REG_3, 0),
+   BPF_EMIT_CALL(BPF_FUNC_probe_read),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+   },
+   {
+   "constant register |= constant should not bypass stack boundary 
checks",
+   .insns = {
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -48),
+   BPF_MOV64_IMM(BPF_REG_2, 34),
+   BPF_ALU64_IMM(BPF_OR, BPF_REG_2, 24),
+   BPF_MOV64_IMM(BPF_REG_3, 0),
+   BPF_EMIT_CALL(BPF_FUNC_probe_read),
+   BPF_EXIT_INSN(),
+   },
+   .errstr = "invalid stack type R1 off=-48 access_size=58",
+   .result = REJECT,
+   .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+   },
+   {
+   "constant register |= constant register should keep constant 
type",
+   .insns = {
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, -48),
+   BPF_MOV64_IMM(BPF_REG_2, 34),
+   BPF_MOV64_IMM(BPF_REG_4, 13),
+   BPF_ALU64_REG(BPF_OR, BPF_REG_2, BPF_REG_4),
+   BPF_MOV64_IMM(BPF_REG_3, 0),
+   BPF_EMIT_CALL(BPF_FUNC_probe_read),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_TRACEPOINT,
+  

Re: [PATCH -next] net: ethernet: ti: davinci_cpdma: add missing EXPORTs

2016-12-03 Thread David Miller
From: Paul Gortmaker 
Date: Thu, 1 Dec 2016 15:25:28 -0500

> As of commit 8f32b90981dcdb355516fb95953133f8d4e6b11d
> ("net: ethernet: ti: davinci_cpdma: add set rate for a channel") the
> ARM allmodconfig builds would fail modpost with:
> 
> ERROR: "cpdma_chan_set_weight" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
> ERROR: "cpdma_chan_get_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
> ERROR: "cpdma_chan_get_min_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] 
> undefined!
> ERROR: "cpdma_chan_set_rate" [drivers/net/ethernet/ti/ti_cpsw.ko] undefined!
> 
> Since these weren't declared as static, it is assumed they were
> meant to be shared outside the file, and that modular build testing
> was simply overlooked.
> 
> Fixes: 8f32b90981dc ("net: ethernet: ti: davinci_cpdma: add set rate for a 
> channel")
> Cc: Ivan Khoronzhuk 
> Cc: Mugunthan V N 
> Cc: Grygorii Strashko 
> Cc: linux-o...@vger.kernel.org
> Cc: netdev@vger.kernel.org
> Signed-off-by: Paul Gortmaker 

Applied.


Re: pull-request: can-next 2016-12-01,pull-request: can-next 2016-12-01

2016-12-03 Thread David Miller
From: Marc Kleine-Budde 
Date: Thu, 1 Dec 2016 21:21:44 +0100

> this is a pull request of 4 patches for net-next/master.
> 
> There are two patches by Chris Paterson for the rcar_can and rcar_canfd
> device tree binding documentation. And a patch by Geert Uytterhoeven
> that corrects the order of interrupt specifiers.
> 
> The fourth patch by Colin Ian King fixes a spelling error in the
> kvaser_usb driver.

Pulled, thanks.


Re: [PATCH V2 net-next] net: hns: Fix to conditionally convey RX checksum flag to stack

2016-12-03 Thread David Miller
From: Salil Mehta 
Date: Thu, 1 Dec 2016 16:59:14 +

> It looks to me the cumbersome check in the PATCH V2 should
> be retained.

I really want something simpler with small checks that are
done in logical pieces in a straigtforward progression.

The code in V2 is completely unreadable.


Re: [PATCH net-next 1/4] bpf: xdp: Allow head adjustment in XDP prog

2016-12-03 Thread Daniel Borkmann

On 12/03/2016 08:32 PM, Martin KaFai Lau wrote:

On Sat, Dec 03, 2016 at 04:24:13PM +0100, Jesper Dangaard Brouer wrote:

On Fri, 2 Dec 2016 15:23:30 -0800
Martin KaFai Lau  wrote:


-bool bpf_helper_changes_skb_data(void *func)
+BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
+{
+   /* Both mlx4 and mlx5 driver align each packet to PAGE_SIZE when
+* XDP prog is set.
+* If the above is not true for the other drivers to support
+* bpf_xdp_adjust_head, struct xdp_buff can be extended.
+*/
+   void *head = (void *)((unsigned long)xdp->data & PAGE_MASK);
+   void *new_data = xdp->data + offset;
+
+   if (new_data < head || new_data >= xdp->data_end)
+   /* The packet length must be >=1 */
+   return -EINVAL;
+
+   xdp->data = new_data;
+
+   return 0;
+}


First time I read this code, I was about to complain about you didn't
use XDP_PACKET_HEADROOM in your boundary check.  But then I noticed the
PAGE_MASK.  If you rename "head" to "page_boundary" or "page_start"
then IMHO the code would be more readable.

bpf_xdp_adjust_head() could be called multiple times.  Hence,
XDP_PACKET_HEADROOM is not used in the boundary check.

My thinking is "head" here can closely resemble the meaning of
skb->head as a boundary.  I think missing the info on
what head it is could be the confusing part.

Instead of skb boundary (there is no skb here) or
page boundary (other future XDP driver may not align like mlx4/5),
I think may be "pkt_head" can give more clarity here and also
for furture XDP-capble driver?


I think as-is with head is also fine with me, but if it should be
something better readable (?), perhaps as such (modulo the min len
part):

BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
{
unsigned long addr = (unsigned long)xdp->data & PAGE_MASK;
void *data_hard_start = (void *)addr;
void *data = xdp->data + offset;

if (unlikely(data < data_hard_start || data >= xdp->data_end))
return -EINVAL;

xdp->data = data;
return 0;
}

Thanks,
Daniel


Re: [PATCH net-next 1/4] bpf: xdp: Allow head adjustment in XDP prog

2016-12-03 Thread Martin KaFai Lau
On Sat, Dec 03, 2016 at 04:24:13PM +0100, Jesper Dangaard Brouer wrote:
> On Fri, 2 Dec 2016 15:23:30 -0800
> Martin KaFai Lau  wrote:
>
> > -bool bpf_helper_changes_skb_data(void *func)
> > +BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
> > +{
> > +   /* Both mlx4 and mlx5 driver align each packet to PAGE_SIZE when
> > +* XDP prog is set.
> > +* If the above is not true for the other drivers to support
> > +* bpf_xdp_adjust_head, struct xdp_buff can be extended.
> > +*/
> > +   void *head = (void *)((unsigned long)xdp->data & PAGE_MASK);
> > +   void *new_data = xdp->data + offset;
> > +
> > +   if (new_data < head || new_data >= xdp->data_end)
> > +   /* The packet length must be >=1 */
> > +   return -EINVAL;
> > +
> > +   xdp->data = new_data;
> > +
> > +   return 0;
> > +}
>
> First time I read this code, I was about to complain about you didn't
> use XDP_PACKET_HEADROOM in your boundary check.  But then I noticed the
> PAGE_MASK.  If you rename "head" to "page_boundary" or "page_start"
> then IMHO the code would be more readable.
bpf_xdp_adjust_head() could be called multiple times.  Hence,
XDP_PACKET_HEADROOM is not used in the boundary check.

My thinking is "head" here can closely resemble the meaning of
skb->head as a boundary.  I think missing the info on
what head it is could be the confusing part.

Instead of skb boundary (there is no skb here) or
page boundary (other future XDP driver may not align like mlx4/5),
I think may be "pkt_head" can give more clarity here and also
for furture XDP-capble driver?


Re: [PATCH] irda: w83977af_ir: fix damaged whitespace

2016-12-03 Thread David Miller
From: Arnd Bergmann 
Date: Mon, 28 Nov 2016 15:19:43 +0100

> As David Miller pointed out for for the previous patch, the whitespace
> in some functions looks rather odd. This was caused by commit 6329da5f258a
> ("obsolete config in kernel source: USE_INTERNAL_TIMER"), which removed
> some conditions but did not reindent the code.
> 
> This fixes the indentation in the file and removes extraneous whitespace
> at the end of the lines and before tabs.
> 
> There are many other minor coding style problems in the driver, but I'm
> not touching those here.
> 
> Signed-off-by: Arnd Bergmann 

Applied, thanks Arnd.


Re: [PATCH] stmmac: cleanup documenation, make it match reality

2016-12-03 Thread David Miller
From: Pavel Machek 
Date: Thu, 1 Dec 2016 11:32:18 +0100

> Fix english in documentation, make documentation match reality, remove
> options that were removed from code.
> 
> Signed-off-by: Pavel Machek 

Applied.


Re: [PATCH V2 net-next] net: hns: Fix to conditionally convey RX checksum flag to stack

2016-12-03 Thread David Miller
From: Salil Mehta 
Date: Thu, 1 Dec 2016 12:09:22 +

> But maybe now since we don't have any method to de-multiplex the kind of
> checksum error (cannot depend upon register) we can have below code
> re-arrangement:
> 
> hns_nic_rx_checksum() {
>   /* check supported L3 protocol */
>   if (l3 != IPV4 && l3 != IPV6)
>   return;
>   /* check if L3 protocols error */
>   if (l3e)
>   return;
> 
>   /* check if the packets are fragmented */
>   If (l3frags)
>   Return;
> 
>   /* check supported L4 protocol */
>   if (l4 != UDP && l4 != TCP && l4 != SCTP)
>   return;
>   /* check if any L4 protocol error */
>   if (l3e)
>   return;
> 
>   /* packet with valid checksum - covey to stack */
>   skb->ip_summed = CHECKSUM_UNNECESSARY
> }

This looks a lot cleaner and easier to understand.


[PATCH net-next] r8169: Add support for restarting auto-negotiation

2016-12-03 Thread Florian Fainelli
Implement ethtooll::nway_restart by utilizing mii_nway_restart.

Signed-off-by: Florian Fainelli 
---
 drivers/net/ethernet/realtek/r8169.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 2830190aaace..f9b97f5946f8 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -2344,6 +2344,13 @@ static void rtl8169_get_strings(struct net_device *dev, 
u32 stringset, u8 *data)
}
 }
 
+static int rtl8169_nway_reset(struct net_device *dev)
+{
+   struct rtl8169_private *tp = netdev_priv(dev);
+
+   return mii_nway_restart(>mii);
+}
+
 static const struct ethtool_ops rtl8169_ethtool_ops = {
.get_drvinfo= rtl8169_get_drvinfo,
.get_regs_len   = rtl8169_get_regs_len,
@@ -2359,6 +2366,7 @@ static const struct ethtool_ops rtl8169_ethtool_ops = {
.get_sset_count = rtl8169_get_sset_count,
.get_ethtool_stats  = rtl8169_get_ethtool_stats,
.get_ts_info= ethtool_op_get_ts_info,
+   .nway_reset = rtl8169_nway_reset,
 };
 
 static void rtl8169_get_mac_version(struct rtl8169_private *tp,
-- 
2.9.3



pull request: bluetooth-next 2016-12-03

2016-12-03 Thread Johan Hedberg
Hi Dave,

Here's a set of Bluetooth & 802.15.4 patches for net-next (i.e. 4.10
kernel):

 - Fix for a potential NULL deref in the ieee802154 netlink code
 - Fix for the ED values of the at86rf2xx driver
 - Documentation updates to ieee802154
 - Cleanups to u8 vs __u8 usage
 - Timer API usage cleanups in HCI drivers

Please let me know if there are any issues pulling. Thanks.

Johan

---
The following changes since commit 0b42f25d2f123bb7fbd3565d003a8ea9e1e810fe:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net (2016-11-26 
23:42:21 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next.git 
for-upstream

for you to fetch changes up to 6bf0d84d13e968b4f8bf0710e0cae785e228dbba:

  docs: ieee802154: update main documentation file (2016-11-30 12:33:07 +0100)


Alexander Aring (1):
  at86rf230: fix cca ed values for rf233

Pavel Machek (1):
  Bluetooth: __ variants of u8 and friends are not neccessary inside kernel

Prasanna Karthik (3):
  Bluetooth: hci_bcsp: Use setup_timer Kernel API instead of init_timer
  Bluetooth: hci_h5: Use setup_timer Kernel API instead of init_timer
  Bluetooth: hci_qca: Use setup_timer Kernel API instead of init_timer

Stefan Schmidt (3):
  ieee802154: add myself as co-maintainer to MAINTAINERS file
  ieee802154: fakelb: print number of created fake devices during probe
  docs: ieee802154: update main documentation file

vegard.nos...@oracle.com (1):
  ieee802154: check device type

 Documentation/networking/ieee802154.txt | 26 +++---
 MAINTAINERS |  1 +
 drivers/bluetooth/hci_bcsp.c|  4 +---
 drivers/bluetooth/hci_h5.c  |  4 +---
 drivers/bluetooth/hci_qca.c |  9 +++--
 drivers/net/ieee802154/at86rf230.c  | 16 +++-
 drivers/net/ieee802154/fakelb.c |  2 +-
 include/net/bluetooth/bluetooth.h   | 25 +
 net/ieee802154/nl-phy.c |  6 +-
 9 files changed, 47 insertions(+), 46 deletions(-)


signature.asc
Description: PGP signature


Re: [flamebait] xdp, well meaning but pointless

2016-12-03 Thread John Fastabend
On 16-12-03 08:19 AM, Willem de Bruijn wrote:
> On Fri, Dec 2, 2016 at 12:22 PM, Jesper Dangaard Brouer
>  wrote:
>>
>> On Thu, 1 Dec 2016 10:11:08 +0100 Florian Westphal  wrote:
>>
>>> In light of DPDKs existence it make a lot more sense to me to provide
>>> a). a faster mmap based interface (possibly AF_PACKET based) that allows
>>> to map nic directly into userspace, detaching tx/rx queue from kernel.
>>>
>>> John Fastabend sent something like this last year as a proof of
>>> concept, iirc it was rejected because register space got exposed directly
>>> to userspace.  I think we should re-consider merging netmap
>>> (or something conceptually close to its design).
>>
>> I'm actually working in this direction, of zero-copy RX mapping packets
>> into userspace.  This work is mostly related to page_pool, and I only
>> plan to use XDP as a filter for selecting packets going to userspace,
>> as this choice need to be taken very early.
>>
>> My design is here:
>>  
>> https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
>>
>> This is mostly about changing the memory model in the drivers, to allow
>> for safely mapping pages to userspace.  (An efficient queue mechanism is
>> not covered).
> 
> Virtio virtqueues are used in various other locations in the stack.
> With separate memory pools and send + completion descriptor rings,
> signal moderation, careful avoidance of cacheline bouncing, etc. these
> seem like a good opportunity for a TPACKET_V4 format.
> 

FWIW. After we rejected exposing the register space to user space due to
valid security issues we fell back to using VFIO which works nicely for
mapping virtual functions into userspace and VMs. The main  drawback is
user space has to manage the VF but that is mostly a solved problem at
this point. Deployment concerns aside.

There was a TPACKET_V4 version we had a prototype of that passed
buffers down to the hardware to use with the dma engine. This gives
zero-copy but same as VFs requires the hardware to do all the steering
of traffic and any expected policy in front of the application. Due to
requiring user space to kick hardware and vice versa though it was
somewhat slower so I didn't finish it up. The kick was implemented as a
syscall iirc. I can maybe look at it a bit more next week and see if its
worth reviving now in this context.

I don't think any of this requires page pools though. Or rather tpacket
and vhost/virtio already know how to do page pools is perhaps the other
way to look at it.

One idea I've been playing around with is a vhost backend using
tpacketv{3|4} so we don't require socket manipulation.

Thanks,
John


[PATCH v2 net-next 0/8] tcp: tsq: performance series

2016-12-03 Thread Eric Dumazet
Under very high TX stress, CPU handling NIC TX completions can spend
considerable amount of cycles handling TSQ (TCP Small Queues) logic.

This patch series avoids some atomic operations, but most notable
patch is the 3rd one, allowing other cpus processing ACK packets and
calling tcp_write_xmit() to grab TCP_TSQ_DEFERRED so that
tcp_tasklet_func() can skip already processed sockets.

This avoid lots of lock acquisitions and cache lines accesses,
particularly under load.

In v2, I added :

- tcp_small_queue_check() change to allow 1st and 2nd packets
  in write queue to be sent, even in the case TX completion of
  already acknowledged packets did not happen yet.
  This helps when TX completion coalescing parameters are set
  even to insane values, and/or busy polling is used.

- A reorganization of struct sock fields to
  lower false sharing and increase data locality.

- Then I moved tsq_flags from tcp_sock to struct sock also
  to reduce cache line misses during TX completions.

I measured an overall throughput gain of 22 % for heavy TCP use
over a single TX queue.

Eric Dumazet (8):
  tcp: tsq: add tsq_flags / tsq_enum
  tcp: tsq: remove one locked operation in tcp_wfree()
  tcp: tsq: add shortcut in tcp_tasklet_func()
  tcp: tsq: avoid one atomic in tcp_wfree()
  tcp: tsq: add a shortcut in tcp_small_queue_check()
  tcp: tcp_mtu_probe() is likely to exit early
  net: reorganize struct sock for better data locality
  tcp: tsq: move tsq_flags close to sk_wmem_alloc

 include/linux/tcp.h   | 12 +--
 include/net/sock.h| 51 +++--
 net/ipv4/tcp.c|  4 +--
 net/ipv4/tcp_ipv4.c   |  2 +-
 net/ipv4/tcp_output.c | 91 +++
 net/ipv4/tcp_timer.c  |  4 +--
 net/ipv6/tcp_ipv6.c   |  2 +-
 7 files changed, 98 insertions(+), 68 deletions(-)

-- 
2.8.0.rc3.226.g39d4020



Hello Beautiful

2016-12-03 Thread Bentley
Hello beautiful, How you doing today? I hope you are doing well. My name is 
Bentley, from the US. I'm in Syria right now fighting ISIS. I want to get to 
know you better, if I may be so bold. I consider myself an easy-going man, and 
I am currently looking for a relationship in which I feel loved. Please tell me 
more about yourself, if you don't mind.

Hope to hear from you soon.

Regards,
Bentley.


[PATCH v2 net-next 8/8] tcp: tsq: move tsq_flags close to sk_wmem_alloc

2016-12-03 Thread Eric Dumazet
tsq_flags being in the same cache line than sk_wmem_alloc
makes a lot of sense. Both fields are changed from tcp_wfree()
and more generally by various TSQ related functions.

Prior patch made room in struct sock and added sk_tsq_flags,
this patch deletes tsq_flags from struct tcp_sock.

Signed-off-by: Eric Dumazet 
---
 include/linux/tcp.h   |  1 -
 net/ipv4/tcp.c|  4 ++--
 net/ipv4/tcp_ipv4.c   |  2 +-
 net/ipv4/tcp_output.c | 24 +++-
 net/ipv4/tcp_timer.c  |  4 ++--
 net/ipv6/tcp_ipv6.c   |  2 +-
 6 files changed, 17 insertions(+), 20 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index d8be083ab0b0..fc5848dad7a4 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -186,7 +186,6 @@ struct tcp_sock {
u32 tsoffset;   /* timestamp offset */
 
struct list_head tsq_node; /* anchor in tsq_tasklet.head list */
-   unsigned long   tsq_flags;
 
/* Data for direct copy to user */
struct {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 1149b48700a1..1ef3165114ba 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -663,9 +663,9 @@ static void tcp_push(struct sock *sk, int flags, int 
mss_now,
if (tcp_should_autocork(sk, skb, size_goal)) {
 
/* avoid atomic op if TSQ_THROTTLED bit is already set */
-   if (!test_bit(TSQ_THROTTLED, >tsq_flags)) {
+   if (!test_bit(TSQ_THROTTLED, >sk_tsq_flags)) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPAUTOCORKING);
-   set_bit(TSQ_THROTTLED, >tsq_flags);
+   set_bit(TSQ_THROTTLED, >sk_tsq_flags);
}
/* It is possible TX completion already happened
 * before we set TSQ_THROTTLED.
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index b50f05905ced..30d81f533ada 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -443,7 +443,7 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
if (!sock_owned_by_user(sk)) {
tcp_v4_mtu_reduced(sk);
} else {
-   if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED, 
>tsq_flags))
+   if (!test_and_set_bit(TCP_MTU_REDUCED_DEFERRED, 
>sk_tsq_flags))
sock_hold(sk);
}
goto out;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5f04bee4c86a..b45101f3d2bd 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -767,14 +767,15 @@ static void tcp_tasklet_func(unsigned long data)
list_for_each_safe(q, n, ) {
tp = list_entry(q, struct tcp_sock, tsq_node);
list_del(>tsq_node);
-   clear_bit(TSQ_QUEUED, >tsq_flags);
 
sk = (struct sock *)tp;
+   clear_bit(TSQ_QUEUED, >sk_tsq_flags);
+
if (!sk->sk_lock.owned &&
-   test_bit(TCP_TSQ_DEFERRED, >tsq_flags)) {
+   test_bit(TCP_TSQ_DEFERRED, >sk_tsq_flags)) {
bh_lock_sock(sk);
if (!sock_owned_by_user(sk)) {
-   clear_bit(TCP_TSQ_DEFERRED, >tsq_flags);
+   clear_bit(TCP_TSQ_DEFERRED, >sk_tsq_flags);
tcp_tsq_handler(sk);
}
bh_unlock_sock(sk);
@@ -797,16 +798,15 @@ static void tcp_tasklet_func(unsigned long data)
  */
 void tcp_release_cb(struct sock *sk)
 {
-   struct tcp_sock *tp = tcp_sk(sk);
unsigned long flags, nflags;
 
/* perform an atomic operation only if at least one flag is set */
do {
-   flags = tp->tsq_flags;
+   flags = sk->sk_tsq_flags;
if (!(flags & TCP_DEFERRED_ALL))
return;
nflags = flags & ~TCP_DEFERRED_ALL;
-   } while (cmpxchg(>tsq_flags, flags, nflags) != flags);
+   } while (cmpxchg(>sk_tsq_flags, flags, nflags) != flags);
 
if (flags & TCPF_TSQ_DEFERRED)
tcp_tsq_handler(sk);
@@ -878,7 +878,7 @@ void tcp_wfree(struct sk_buff *skb)
if (wmem >= SKB_TRUESIZE(1) && this_cpu_ksoftirqd() == current)
goto out;
 
-   for (oval = READ_ONCE(tp->tsq_flags);; oval = nval) {
+   for (oval = READ_ONCE(sk->sk_tsq_flags);; oval = nval) {
struct tsq_tasklet *tsq;
bool empty;
 
@@ -886,7 +886,7 @@ void tcp_wfree(struct sk_buff *skb)
goto out;
 
nval = (oval & ~TSQF_THROTTLED) | TSQF_QUEUED | 
TCPF_TSQ_DEFERRED;
-   nval = cmpxchg(>tsq_flags, oval, nval);
+   nval = cmpxchg(>sk_tsq_flags, oval, nval);
if (nval != oval)
continue;
 
@@ -2100,7 +2100,7 @@ static bool 

[PATCH v2 net-next 7/8] net: reorganize struct sock for better data locality

2016-12-03 Thread Eric Dumazet
Group fields used in TX path, and keep some cache lines mostly read
to permit sharing among cpus.

Gained two 4 bytes holes on 64bit arches.

Added a place holder for tcp tsq_flags, next to sk_wmem_alloc
to speed up tcp_wfree() in the following patch.

I have not added cacheline_aligned_in_smp, this might be done later.
I prefer doing this once inet and tcp/udp sockets reorg is also done.

Tested with both TCP and UDP.

UDP receiver performance under flood increased by ~20 % :
Accessing sk_filter/sk_wq/sk_napi_id no longer stalls because sk_drops
was moved away from a critical cache line, now mostly read and shared.

/* --- cacheline 4 boundary (256 bytes) --- */
unsigned int   sk_napi_id;   /* 0x100   0x4 */
intsk_rcvbuf;/* 0x104   0x4 */
struct sk_filter * sk_filter;/* 0x108   0x8 */
union {
struct socket_wq * sk_wq;/* 0x8 */
struct socket_wq * sk_wq_raw;/* 0x8 */
};   /* 0x110   0x8 */
struct xfrm_policy *   sk_policy[2]; /* 0x118  0x10 */
struct dst_entry * sk_rx_dst;/* 0x128   0x8 */
struct dst_entry * sk_dst_cache; /* 0x130   0x8 */
atomic_t   sk_omem_alloc;/* 0x138   0x4 */
intsk_sndbuf;/* 0x13c   0x4 */
/* --- cacheline 5 boundary (320 bytes) --- */
intsk_wmem_queued;   /* 0x140   0x4 */
atomic_t   sk_wmem_alloc;/* 0x144   0x4 */
long unsigned int  sk_tsq_flags; /* 0x148   0x8 */
struct sk_buff *   sk_send_head; /* 0x150   0x8 */
struct sk_buff_headsk_write_queue;   /* 0x158  0x18 */
__s32  sk_peek_off;  /* 0x170   0x4 */
intsk_write_pending; /* 0x174   0x4 */
long int   sk_sndtimeo;  /* 0x178   0x8 */

Signed-off-by: Eric Dumazet 
---
 include/net/sock.h | 51 +++
 1 file changed, 27 insertions(+), 24 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 69afda6bea15..6dfe3aa22b97 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -343,6 +343,9 @@ struct sock {
 #define sk_rxhash  __sk_common.skc_rxhash
 
socket_lock_t   sk_lock;
+   atomic_tsk_drops;
+   int sk_rcvlowat;
+   struct sk_buff_head sk_error_queue;
struct sk_buff_head sk_receive_queue;
/*
 * The backlog queue is special, it is always used with
@@ -359,14 +362,13 @@ struct sock {
struct sk_buff  *tail;
} sk_backlog;
 #define sk_rmem_alloc sk_backlog.rmem_alloc
-   int sk_forward_alloc;
 
-   __u32   sk_txhash;
+   int sk_forward_alloc;
 #ifdef CONFIG_NET_RX_BUSY_POLL
-   unsigned intsk_napi_id;
unsigned intsk_ll_usec;
+   /* = mostly read cache line = */
+   unsigned intsk_napi_id;
 #endif
-   atomic_tsk_drops;
int sk_rcvbuf;
 
struct sk_filter __rcu  *sk_filter;
@@ -379,11 +381,30 @@ struct sock {
 #endif
struct dst_entry*sk_rx_dst;
struct dst_entry __rcu  *sk_dst_cache;
-   /* Note: 32bit hole on 64bit arches */
-   atomic_tsk_wmem_alloc;
atomic_tsk_omem_alloc;
int sk_sndbuf;
+
+   /* = cache line for TX = */
+   int sk_wmem_queued;
+   atomic_tsk_wmem_alloc;
+   unsigned long   sk_tsq_flags;
+   struct sk_buff  *sk_send_head;
struct sk_buff_head sk_write_queue;
+   __s32   sk_peek_off;
+   int sk_write_pending;
+   longsk_sndtimeo;
+   struct timer_list   sk_timer;
+   __u32   sk_priority;
+   __u32   sk_mark;
+   u32 sk_pacing_rate; /* bytes per second */
+   u32 sk_max_pacing_rate;
+   struct page_fragsk_frag;
+   netdev_features_t   sk_route_caps;
+   netdev_features_t   sk_route_nocaps;
+   int sk_gso_type;
+   unsigned intsk_gso_max_size;
+   gfp_t   sk_allocation;
+   __u32   sk_txhash;
 
/*
 * Because of non atomicity rules, all
@@ -414,42 +435,24 @@ struct sock {
 #define SK_PROTOCOL_MAX U8_MAX

[PATCH v2 net-next 5/8] tcp: tsq: add a shortcut in tcp_small_queue_check()

2016-12-03 Thread Eric Dumazet
Always allow the two first skbs in write queue to be sent,
regardless of sk_wmem_alloc/sk_pacing_rate values.

This helps a lot in situations where TX completions are delayed either
because of driver latencies or softirq latencies.

Test is done with no cache line misses.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_output.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 0db63efe5b8b..d5c46749adab 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2091,6 +2091,15 @@ static bool tcp_small_queue_check(struct sock *sk, const 
struct sk_buff *skb,
limit <<= factor;
 
if (atomic_read(>sk_wmem_alloc) > limit) {
+   /* Always send the 1st or 2nd skb in write queue.
+* No need to wait for TX completion to call us back,
+* after softirq/tasklet schedule.
+* This helps when TX completions are delayed too much.
+*/
+   if (skb == sk->sk_write_queue.next ||
+   skb->prev == sk->sk_write_queue.next)
+   return false;
+
set_bit(TSQ_THROTTLED, _sk(sk)->tsq_flags);
/* It is possible TX completion already happened
 * before we set TSQ_THROTTLED, so we must
-- 
2.8.0.rc3.226.g39d4020



[PATCH v2 net-next 3/8] tcp: tsq: add shortcut in tcp_tasklet_func()

2016-12-03 Thread Eric Dumazet
Under high stress, I've seen tcp_tasklet_func() consuming
~700 usec, handling ~150 tcp sockets.

By setting TCP_TSQ_DEFERRED in tcp_wfree(), we give a chance
for other cpus/threads entering tcp_write_xmit() to grab it,
allowing tcp_tasklet_func() to skip sockets that already did
an xmit cycle.

In the future, we might give to ACK processing an increased
budget to reduce even more tcp_tasklet_func() amount of work.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_output.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4adaf8e1bb63..fa23b688a6f3 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -767,19 +767,19 @@ static void tcp_tasklet_func(unsigned long data)
list_for_each_safe(q, n, ) {
tp = list_entry(q, struct tcp_sock, tsq_node);
list_del(>tsq_node);
+   clear_bit(TSQ_QUEUED, >tsq_flags);
 
sk = (struct sock *)tp;
-   bh_lock_sock(sk);
-
-   if (!sock_owned_by_user(sk)) {
-   tcp_tsq_handler(sk);
-   } else {
-   /* defer the work to tcp_release_cb() */
-   set_bit(TCP_TSQ_DEFERRED, >tsq_flags);
+   if (!sk->sk_lock.owned &&
+   test_bit(TCP_TSQ_DEFERRED, >tsq_flags)) {
+   bh_lock_sock(sk);
+   if (!sock_owned_by_user(sk)) {
+   clear_bit(TCP_TSQ_DEFERRED, >tsq_flags);
+   tcp_tsq_handler(sk);
+   }
+   bh_unlock_sock(sk);
}
-   bh_unlock_sock(sk);
 
-   clear_bit(TSQ_QUEUED, >tsq_flags);
sk_free(sk);
}
 }
@@ -884,7 +884,7 @@ void tcp_wfree(struct sk_buff *skb)
if (!(oval & TSQF_THROTTLED) || (oval & TSQF_QUEUED))
goto out;
 
-   nval = (oval & ~TSQF_THROTTLED) | TSQF_QUEUED;
+   nval = (oval & ~TSQF_THROTTLED) | TSQF_QUEUED | 
TCPF_TSQ_DEFERRED;
nval = cmpxchg(>tsq_flags, oval, nval);
if (nval != oval)
continue;
@@ -2229,6 +2229,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle,
unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
break;
 
+   if (test_bit(TCP_TSQ_DEFERRED, >tsq_flags))
+   clear_bit(TCP_TSQ_DEFERRED, >tsq_flags);
if (tcp_small_queue_check(sk, skb, 0))
break;
 
-- 
2.8.0.rc3.226.g39d4020



[PATCH v2 net-next 6/8] tcp: tcp_mtu_probe() is likely to exit early

2016-12-03 Thread Eric Dumazet
Adding a likely() in tcp_mtu_probe() moves its code which used to
be inlined in front of tcp_write_xmit()

We still have a cache line miss to access icsk->icsk_mtup.enabled,
we will probably have to reorganize fields to help data locality.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_output.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d5c46749adab..5f04bee4c86a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1932,26 +1932,26 @@ static inline void tcp_mtu_check_reprobe(struct sock 
*sk)
  */
 static int tcp_mtu_probe(struct sock *sk)
 {
-   struct tcp_sock *tp = tcp_sk(sk);
struct inet_connection_sock *icsk = inet_csk(sk);
+   struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb, *nskb, *next;
struct net *net = sock_net(sk);
-   int len;
int probe_size;
int size_needed;
-   int copy;
+   int copy, len;
int mss_now;
int interval;
 
/* Not currently probing/verifying,
 * not in recovery,
 * have enough cwnd, and
-* not SACKing (the variable headers throw things off) */
-   if (!icsk->icsk_mtup.enabled ||
-   icsk->icsk_mtup.probe_size ||
-   inet_csk(sk)->icsk_ca_state != TCP_CA_Open ||
-   tp->snd_cwnd < 11 ||
-   tp->rx_opt.num_sacks || tp->rx_opt.dsack)
+* not SACKing (the variable headers throw things off)
+*/
+   if (likely(!icsk->icsk_mtup.enabled ||
+  icsk->icsk_mtup.probe_size ||
+  inet_csk(sk)->icsk_ca_state != TCP_CA_Open ||
+  tp->snd_cwnd < 11 ||
+  tp->rx_opt.num_sacks || tp->rx_opt.dsack))
return -1;
 
/* Use binary search for probe_size between tcp_mss_base,
-- 
2.8.0.rc3.226.g39d4020



[PATCH v2 net-next 2/8] tcp: tsq: remove one locked operation in tcp_wfree()

2016-12-03 Thread Eric Dumazet
Instead of atomically clear TSQ_THROTTLED and atomically set TSQ_QUEUED
bits, use one cmpxchg() to perform a single locked operation.

Since the following patch will also set TCP_TSQ_DEFERRED here,
this cmpxchg() will make this addition free.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_output.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8f0289b0fb24..4adaf8e1bb63 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -860,6 +860,7 @@ void tcp_wfree(struct sk_buff *skb)
 {
struct sock *sk = skb->sk;
struct tcp_sock *tp = tcp_sk(sk);
+   unsigned long flags, nval, oval;
int wmem;
 
/* Keep one reference on sk_wmem_alloc.
@@ -877,11 +878,17 @@ void tcp_wfree(struct sk_buff *skb)
if (wmem >= SKB_TRUESIZE(1) && this_cpu_ksoftirqd() == current)
goto out;
 
-   if (test_and_clear_bit(TSQ_THROTTLED, >tsq_flags) &&
-   !test_and_set_bit(TSQ_QUEUED, >tsq_flags)) {
-   unsigned long flags;
+   for (oval = READ_ONCE(tp->tsq_flags);; oval = nval) {
struct tsq_tasklet *tsq;
 
+   if (!(oval & TSQF_THROTTLED) || (oval & TSQF_QUEUED))
+   goto out;
+
+   nval = (oval & ~TSQF_THROTTLED) | TSQF_QUEUED;
+   nval = cmpxchg(>tsq_flags, oval, nval);
+   if (nval != oval)
+   continue;
+
/* queue this socket to tasklet queue */
local_irq_save(flags);
tsq = this_cpu_ptr(_tasklet);
-- 
2.8.0.rc3.226.g39d4020



[PATCH v2 net-next 1/8] tcp: tsq: add tsq_flags / tsq_enum

2016-12-03 Thread Eric Dumazet
This is a cleanup, to ease code review of following patches.

Old 'enum tsq_flags' is renamed, and a new enumeration is added
with the flags used in cmpxchg() operations as opposed to
single bit operations.

Signed-off-by: Eric Dumazet 
---
 include/linux/tcp.h   | 11 ++-
 net/ipv4/tcp_output.c | 16 
 2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 734bab4c3bef..d8be083ab0b0 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -364,7 +364,7 @@ struct tcp_sock {
u32 *saved_syn;
 };
 
-enum tsq_flags {
+enum tsq_enum {
TSQ_THROTTLED,
TSQ_QUEUED,
TCP_TSQ_DEFERRED,  /* tcp_tasklet_func() found socket was owned 
*/
@@ -375,6 +375,15 @@ enum tsq_flags {
*/
 };
 
+enum tsq_flags {
+   TSQF_THROTTLED  = (1UL << TSQ_THROTTLED),
+   TSQF_QUEUED = (1UL << TSQ_QUEUED),
+   TCPF_TSQ_DEFERRED   = (1UL << TCP_TSQ_DEFERRED),
+   TCPF_WRITE_TIMER_DEFERRED   = (1UL << TCP_WRITE_TIMER_DEFERRED),
+   TCPF_DELACK_TIMER_DEFERRED  = (1UL << TCP_DELACK_TIMER_DEFERRED),
+   TCPF_MTU_REDUCED_DEFERRED   = (1UL << TCP_MTU_REDUCED_DEFERRED),
+};
+
 static inline struct tcp_sock *tcp_sk(const struct sock *sk)
 {
return (struct tcp_sock *)sk;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index c7adcb57654e..8f0289b0fb24 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -784,10 +784,10 @@ static void tcp_tasklet_func(unsigned long data)
}
 }
 
-#define TCP_DEFERRED_ALL ((1UL << TCP_TSQ_DEFERRED) |  \
- (1UL << TCP_WRITE_TIMER_DEFERRED) |   \
- (1UL << TCP_DELACK_TIMER_DEFERRED) |  \
- (1UL << TCP_MTU_REDUCED_DEFERRED))
+#define TCP_DEFERRED_ALL (TCPF_TSQ_DEFERRED |  \
+ TCPF_WRITE_TIMER_DEFERRED |   \
+ TCPF_DELACK_TIMER_DEFERRED |  \
+ TCPF_MTU_REDUCED_DEFERRED)
 /**
  * tcp_release_cb - tcp release_sock() callback
  * @sk: socket
@@ -808,7 +808,7 @@ void tcp_release_cb(struct sock *sk)
nflags = flags & ~TCP_DEFERRED_ALL;
} while (cmpxchg(>tsq_flags, flags, nflags) != flags);
 
-   if (flags & (1UL << TCP_TSQ_DEFERRED))
+   if (flags & TCPF_TSQ_DEFERRED)
tcp_tsq_handler(sk);
 
/* Here begins the tricky part :
@@ -822,15 +822,15 @@ void tcp_release_cb(struct sock *sk)
 */
sock_release_ownership(sk);
 
-   if (flags & (1UL << TCP_WRITE_TIMER_DEFERRED)) {
+   if (flags & TCPF_WRITE_TIMER_DEFERRED) {
tcp_write_timer_handler(sk);
__sock_put(sk);
}
-   if (flags & (1UL << TCP_DELACK_TIMER_DEFERRED)) {
+   if (flags & TCPF_DELACK_TIMER_DEFERRED) {
tcp_delack_timer_handler(sk);
__sock_put(sk);
}
-   if (flags & (1UL << TCP_MTU_REDUCED_DEFERRED)) {
+   if (flags & TCPF_MTU_REDUCED_DEFERRED) {
inet_csk(sk)->icsk_af_ops->mtu_reduced(sk);
__sock_put(sk);
}
-- 
2.8.0.rc3.226.g39d4020



[PATCH v2 net-next 4/8] tcp: tsq: avoid one atomic in tcp_wfree()

2016-12-03 Thread Eric Dumazet
Under high load, tcp_wfree() has an atomic operation trying
to schedule a tasklet over and over.

We can schedule it only if our per cpu list was empty.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_output.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fa23b688a6f3..0db63efe5b8b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -880,6 +880,7 @@ void tcp_wfree(struct sk_buff *skb)
 
for (oval = READ_ONCE(tp->tsq_flags);; oval = nval) {
struct tsq_tasklet *tsq;
+   bool empty;
 
if (!(oval & TSQF_THROTTLED) || (oval & TSQF_QUEUED))
goto out;
@@ -892,8 +893,10 @@ void tcp_wfree(struct sk_buff *skb)
/* queue this socket to tasklet queue */
local_irq_save(flags);
tsq = this_cpu_ptr(_tasklet);
+   empty = list_empty(>head);
list_add(>tsq_node, >head);
-   tasklet_schedule(>tasklet);
+   if (empty)
+   tasklet_schedule(>tasklet);
local_irq_restore(flags);
return;
}
-- 
2.8.0.rc3.226.g39d4020



Re: [Patch net-next] act_mirred: fix a typo in get_dev

2016-12-03 Thread Eric Dumazet
On Sat, 2016-12-03 at 10:36 -0800, Cong Wang wrote:
> Cc: Hadar Hen Zion 
> Cc: Jiri Pirko 
> Signed-off-by: Cong Wang 
> ---
>  net/sched/act_mirred.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
> index bb09ba3..2d9fa6e 100644
> --- a/net/sched/act_mirred.c
> +++ b/net/sched/act_mirred.c
> @@ -321,7 +321,7 @@ static int tcf_mirred_device(const struct tc_action *a, 
> struct net *net,
>   int ifindex = tcf_mirred_ifindex(a);
>  
>   *mirred_dev = __dev_get_by_index(net, ifindex);
> - if (!mirred_dev)
> + if (!*mirred_dev)
>   return -EINVAL;
>   return 0;
>  }

Fixes: 255cb30425c0 ("net/sched: act_mirred: Add new tc_action_ops get_dev()")
Acked-by: Eric Dumazet 





Re: [PATCH v2 net-next 1/2] flow dissector: ICMP support

2016-12-03 Thread Tom Herbert
On Sat, Dec 3, 2016 at 2:49 AM, Jiri Pirko  wrote:
> Fri, Dec 02, 2016 at 09:31:41PM CET, simon.hor...@netronome.com wrote:
>>Allow dissection of ICMP(V6) type and code. This re-uses transport layer
>>port dissection code as although ICMP is not a transport protocol and their
>>type and code are not ports this allows sharing of both code and storage.
>>
>>Signed-off-by: Simon Horman 
>>---
>> drivers/net/bonding/bond_main.c |  6 +++--
>> include/linux/skbuff.h  |  5 +
>> include/net/flow_dissector.h| 50 
>> ++---
>> net/core/flow_dissector.c   | 34 +---
>> net/sched/cls_flow.c|  4 ++--
>> 5 files changed, 89 insertions(+), 10 deletions(-)
>>
>>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>>index 8029dd4912b6..a6f75cfb2bf7 100644
>>--- a/drivers/net/bonding/bond_main.c
>>+++ b/drivers/net/bonding/bond_main.c
>>@@ -3181,7 +3181,8 @@ static bool bond_flow_dissect(struct bonding *bond, 
>>struct sk_buff *skb,
>>   } else {
>>   return false;
>>   }
>>-  if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34 && proto >= 0)
>>+  if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34 &&
>>+  proto >= 0 && !skb_flow_is_icmp_any(skb, proto))
>>   fk->ports.ports = skb_flow_get_ports(skb, noff, proto);
>>
>>   return true;
>>@@ -3209,7 +3210,8 @@ u32 bond_xmit_hash(struct bonding *bond, struct sk_buff 
>>*skb)
>>   return bond_eth_hash(skb);
>>
>>   if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER23 ||
>>-  bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP23)
>>+  bond->params.xmit_policy == BOND_XMIT_POLICY_ENCAP23 ||
>>+  flow_keys_are_icmp_any())
>>   hash = bond_eth_hash(skb);
>>   else
>>   hash = (__force u32)flow.ports.ports;
>>diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>index 9c535fbccf2c..44a8f69a9198 100644
>>--- a/include/linux/skbuff.h
>>+++ b/include/linux/skbuff.h
>>@@ -1094,6 +1094,11 @@ u32 __skb_get_poff(const struct sk_buff *skb, void 
>>*data,
>> __be32 __skb_flow_get_ports(const struct sk_buff *skb, int thoff, u8 
>> ip_proto,
>>   void *data, int hlen_proto);
>>
>>+static inline bool skb_flow_is_icmp_any(const struct sk_buff *skb, u8 
>>ip_proto)
>>+{
>>+  return flow_protos_are_icmp_any(skb->protocol, ip_proto);
>>+}
>>+
>> static inline __be32 skb_flow_get_ports(const struct sk_buff *skb,
>>   int thoff, u8 ip_proto)
>> {
>>diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
>>index c4f31666afd2..5540dfa18872 100644
>>--- a/include/net/flow_dissector.h
>>+++ b/include/net/flow_dissector.h
>>@@ -2,6 +2,7 @@
>> #define _NET_FLOW_DISSECTOR_H
>>
>> #include 
>>+#include 
>> #include 
>> #include 
>>
>>@@ -89,10 +90,15 @@ struct flow_dissector_key_addrs {
>> };
>>
>> /**
>>- * flow_dissector_key_tp_ports:
>>- *@ports: port numbers of Transport header
>>+ * flow_dissector_key_ports:
>>+ *@ports: port numbers of Transport header or
>>+ *type and code of ICMP header
>>+ *ports: source (high) and destination (low) port numbers
>>  *src: source port number
>>  *dst: destination port number
>>+ *icmp: ICMP type (high) and code (low)
>>+ *type: ICMP type
>>+ *type: ICMP code
>>  */
>> struct flow_dissector_key_ports {
>>   union {
>>@@ -101,6 +107,11 @@ struct flow_dissector_key_ports {
>>   __be16 src;
>>   __be16 dst;
>>   };
>>+  __be16 icmp;
>>+  struct {
>>+  u8 type;
>>+  u8 code;
>>+  };
>
> Digging into this a bit more. I think it would be much nice not to mix
> up l4 ports and icmp stuff.
>
> How about to have FLOW_DISSECTOR_KEY_ICMP
> and
> struct flow_dissector_key_icmp {
> u8 type;
> u8 code;
> };
>
> The you can make this structure and struct flow_dissector_key_ports into
> an union in struct flow_keys.
>
> Looks much cleaner to me.
>
I agree, this patch adds to many conditionals into the fast path for
ICMP handling. Neither is there much point in using type and code as
input to the packet hash.

Tom

>
>
>>   };
>> };
>>
>>@@ -188,9 +199,42 @@ struct flow_keys_digest {
>> void make_flow_keys_digest(struct flow_keys_digest *digest,
>>  const struct flow_keys *flow);
>>
>>+static inline bool flow_protos_are_icmpv4(__be16 n_proto, u8 ip_proto)
>>+{
>>+  return n_proto == htons(ETH_P_IP) && ip_proto == IPPROTO_ICMP;
>>+}
>>+
>>+static inline bool flow_protos_are_icmpv6(__be16 n_proto, u8 ip_proto)
>>+{
>>+  return n_proto == htons(ETH_P_IPV6) && ip_proto == IPPROTO_ICMPV6;
>>+}
>>+
>>+static inline bool 

Re: [PATCH 1/1] netdev: broadcom: propagate error code

2016-12-03 Thread Michael Chan
On Sat, Dec 3, 2016 at 1:56 AM, Pan Bian  wrote:
> Function bnxt_hwrm_stat_ctx_alloc() always returns 0, even if the call
> to _hwrm_send_message() fails. It may be better to propagate the errors
> to the caller of bnxt_hwrm_stat_ctx_alloc().
>
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188661
>
> Signed-off-by: Pan Bian 

Acked-by: Michael Chan 


[Patch net-next] act_mirred: fix a typo in get_dev

2016-12-03 Thread Cong Wang
Cc: Hadar Hen Zion 
Cc: Jiri Pirko 
Signed-off-by: Cong Wang 
---
 net/sched/act_mirred.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index bb09ba3..2d9fa6e 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -321,7 +321,7 @@ static int tcf_mirred_device(const struct tc_action *a, 
struct net *net,
int ifindex = tcf_mirred_ifindex(a);
 
*mirred_dev = __dev_get_by_index(net, ifindex);
-   if (!mirred_dev)
+   if (!*mirred_dev)
return -EINVAL;
return 0;
 }
-- 
2.1.4



Re: net: use-after-free in worker_thread

2016-12-03 Thread Cong Wang
On Sat, Dec 3, 2016 at 9:41 AM, Cong Wang  wrote:
> On Sat, Dec 3, 2016 at 4:56 AM, Andrey Konovalov  
> wrote:
>> Hi!
>>
>> I'm seeing lots of the following error reports while running the
>> syzkaller fuzzer.
>>
>> Reports appeared when I updated to 3c49de52 (Dec 2) from 2caceb32 (Dec 1).
>>
>> ==
>> BUG: KASAN: use-after-free in worker_thread+0x17d8/0x18a0
>> Read of size 8 at addr 880067f3ecd8 by task kworker/3:1/774
>>
>> page:ea00019fce00 count:1 mapcount:0 mapping:  (null)
>> index:0x880067f39c10 compound_mapcount: 0
>> flags: 0x5004080(slab|head)
>> page dumped because: kasan: bad access detected
>>
>> CPU: 3 PID: 774 Comm: kworker/3:1 Not tainted 4.9.0-rc7+ #66
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>>  88006c267838 81f882da 6c25e338 11000d84ce9a
>>  ed000d84ce92 88006c25e340 41b58ab3 8541e198
>>  81f88048 0001 41b58ab3 853d3ee8
>> Call Trace:
>>  [< inline >] __dump_stack lib/dump_stack.c:15
>>  [] dump_stack+0x292/0x398 lib/dump_stack.c:51
>>  [< inline >] describe_address mm/kasan/report.c:262
>>  [] kasan_report_error+0x121/0x560 mm/kasan/report.c:368
>>  [< inline >] kasan_report mm/kasan/report.c:390
>>  [] __asan_report_load8_noabort+0x3e/0x40
>> mm/kasan/report.c:411
>>  [] worker_thread+0x17d8/0x18a0 kernel/workqueue.c:2228
>>  [] kthread+0x323/0x3e0 kernel/kthread.c:209
>>  [] ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433
>
> Heck... this is the pending work vs. sk_destruct() race. :-/
> We can't wait for the work in RCU callback, let me think about it...

Please try the attached patch, I only did compile test, I can't access
my desktop now, so can't do further tests.

Thanks!
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 602e5eb..6f33013 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -322,11 +322,13 @@ static void netlink_skb_set_owner_r(struct sk_buff *skb, 
struct sock *sk)
sk_mem_charge(sk, skb->truesize);
 }
 
-static void __netlink_sock_destruct(struct sock *sk)
+static void netlink_sock_destruct(struct sock *sk)
 {
struct netlink_sock *nlk = nlk_sk(sk);
 
if (nlk->cb_running) {
+   if (nlk->cb.done)
+   nlk->cb.done(>cb);
module_put(nlk->cb.module);
kfree_skb(nlk->cb.skb);
}
@@ -343,28 +345,6 @@ static void __netlink_sock_destruct(struct sock *sk)
WARN_ON(nlk_sk(sk)->groups);
 }
 
-static void netlink_sock_destruct_work(struct work_struct *work)
-{
-   struct netlink_sock *nlk = container_of(work, struct netlink_sock,
-   work);
-
-   nlk->cb.done(>cb);
-   __netlink_sock_destruct(>sk);
-}
-
-static void netlink_sock_destruct(struct sock *sk)
-{
-   struct netlink_sock *nlk = nlk_sk(sk);
-
-   if (nlk->cb_running && nlk->cb.done) {
-   INIT_WORK(>work, netlink_sock_destruct_work);
-   schedule_work(>work);
-   return;
-   }
-
-   __netlink_sock_destruct(sk);
-}
-
 /* This lock without WQ_FLAG_EXCLUSIVE is good on UP and it is _very_ bad on
  * SMP. Look, when several writers sleep and reader wakes them up, all but one
  * immediately hit write lock and grab all the cpus. Exclusive sleep solves
@@ -664,11 +644,19 @@ static int netlink_create(struct net *net, struct socket 
*sock, int protocol,
goto out;
 }
 
+static void netlink_sock_put_work(struct work_struct *work)
+{
+   struct netlink_sock *nlk = container_of(work, struct netlink_sock,
+   work);
+   sock_put(>sk);
+}
+
 static void deferred_put_nlk_sk(struct rcu_head *head)
 {
struct netlink_sock *nlk = container_of(head, struct netlink_sock, rcu);
 
-   sock_put(>sk);
+   INIT_WORK(>work, netlink_sock_put_work);
+   schedule_work(>work);
 }
 
 static int netlink_release(struct socket *sock)


Re: Possible regression due to "net/sched: cls_flower: Add offload support using egress Hardware device"

2016-12-03 Thread Simon Horman
On Sat, Dec 03, 2016 at 06:06:08PM +0200, Or Gerlitz wrote:
> On Sat, Dec 3, 2016 at 3:17 PM, Simon Horman  wrote:
> 
> > in net-next I am observing what appears to be an regression in net-next due 
> > to:
> > 7091d8c7055d ("net/sched: cls_flower: Add offload support using egress 
> > Hardware device")
> >
> > The problem occurs when adding a flower filter (without offload to a virtio 
> > device).
> 
> > # ethtool -d eth0
> > ethtool -i eth0
> > driver: virtio_net
> 
> > # tc qdisc add dev eth0 ingress
> > # tc filter add dev eth0 protocol ip parent : flower indev eth0
> > [  104.302779] BUG: unable to handle kernel NULL pointer dereference at 
> > 00d5
> 
> Simon, I don't see an action here, is that missing in purpose? wasn't
> sure what such filter does, but ofcourse
> if it was supported by the patches it has to be so after changing
> things as well, we will check and fix.
> 

Hi Or,

sorry, I trimmed the action by mistake when trying to make a minimal test
case. Originally I noticed this bug with something like the following:

tc filter add dev eth0 protocol ip parent : flower indev eth0 ip_proto udp 
dst_port 53 action drop

I don't think the fields flower is matching on or the action are
particularly important when trying to reproduce the problem. But I could be
wrong.


Re: net: use-after-free in worker_thread

2016-12-03 Thread Cong Wang
On Sat, Dec 3, 2016 at 4:56 AM, Andrey Konovalov  wrote:
> Hi!
>
> I'm seeing lots of the following error reports while running the
> syzkaller fuzzer.
>
> Reports appeared when I updated to 3c49de52 (Dec 2) from 2caceb32 (Dec 1).
>
> ==
> BUG: KASAN: use-after-free in worker_thread+0x17d8/0x18a0
> Read of size 8 at addr 880067f3ecd8 by task kworker/3:1/774
>
> page:ea00019fce00 count:1 mapcount:0 mapping:  (null)
> index:0x880067f39c10 compound_mapcount: 0
> flags: 0x5004080(slab|head)
> page dumped because: kasan: bad access detected
>
> CPU: 3 PID: 774 Comm: kworker/3:1 Not tainted 4.9.0-rc7+ #66
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>  88006c267838 81f882da 6c25e338 11000d84ce9a
>  ed000d84ce92 88006c25e340 41b58ab3 8541e198
>  81f88048 0001 41b58ab3 853d3ee8
> Call Trace:
>  [< inline >] __dump_stack lib/dump_stack.c:15
>  [] dump_stack+0x292/0x398 lib/dump_stack.c:51
>  [< inline >] describe_address mm/kasan/report.c:262
>  [] kasan_report_error+0x121/0x560 mm/kasan/report.c:368
>  [< inline >] kasan_report mm/kasan/report.c:390
>  [] __asan_report_load8_noabort+0x3e/0x40
> mm/kasan/report.c:411
>  [] worker_thread+0x17d8/0x18a0 kernel/workqueue.c:2228
>  [] kthread+0x323/0x3e0 kernel/kthread.c:209
>  [] ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433

Heck... this is the pending work vs. sk_destruct() race. :-/
We can't wait for the work in RCU callback, let me think about it...


Re: [PATCH v2 net] ixgbevf: fix invalid uses of napi_hash_del()

2016-12-03 Thread Jeff Kirsher
On Sat, 2016-12-03 at 07:00 -0800, Eric Dumazet wrote:
> On Wed, 2016-11-16 at 07:26 -0800, Eric Dumazet wrote:
> > From: Eric Dumazet 
> > 
> > Calling napi_hash_del() before netif_napi_del() is dangerous
> > if a synchronize_rcu() is not enforced before NAPI struct freeing.
> > 
> > Lets leave this detail to core networking stack to get it right.
> > 
> > Signed-off-by: Eric Dumazet 
> > Cc: Jeff Kirsher 
> > ---
> >   drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |    6 --
> >   1 file changed, 6 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> > b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> > index 7eaac3234049..bf4d7efc7dbd 100644
> > --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> > +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> > @@ -2511,9 +2511,6 @@ static int ixgbevf_alloc_q_vectors(struct
> > ixgbevf_adapter *adapter)
> >    while (q_idx) {
> >    q_idx--;
> >    q_vector = adapter->q_vector[q_idx];
> > -#ifdef CONFIG_NET_RX_BUSY_POLL
> > - napi_hash_del(_vector->napi);
> > -#endif
> >    netif_napi_del(_vector->napi);
> >    kfree(q_vector);
> >    adapter->q_vector[q_idx] = NULL;
> > @@ -2537,9 +2534,6 @@ static void ixgbevf_free_q_vectors(struct
> > ixgbevf_adapter *adapter)
> >    struct ixgbevf_q_vector *q_vector = adapter-
> > >q_vector[q_idx];
> >   
> >    adapter->q_vector[q_idx] = NULL;
> > -#ifdef CONFIG_NET_RX_BUSY_POLL
> > - napi_hash_del(_vector->napi);
> > -#endif
> >    netif_napi_del(_vector->napi);
> >    kfree(q_vector);
> >    }
> > 
> > 
> 
> It looks this patch was not picked up ?

Yeah, sorry I missed it since it was not sent to intel-wired-lan mailing
list.  Dave I am fine if you want to pick this up for your net tree (if it
is not too late).

signature.asc
Description: This is a digitally signed message part


Re: [PATCH 1/1] net: caif: fix ineffective error check

2016-12-03 Thread Pan Bian
From: PanBian 

Hello Sergei,

On Sat, Dec 03, 2016 at 04:17:51PM +0300, Sergei Shtylyov wrote:
> Hello.
> 
> On 12/3/2016 2:18 PM, Pan Bian wrote:
> 
> >In function caif_sktinit_module(), the check of the return value of
> >sock_register() seems ineffective. This patch fixes it.
> >
> >Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=188751
> >
> >Signed-off-by: Pan Bian 
> >---
> > net/caif/caif_socket.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> >diff --git a/net/caif/caif_socket.c b/net/caif/caif_socket.c
> >index aa209b1..2a689a3 100644
> >--- a/net/caif/caif_socket.c
> >+++ b/net/caif/caif_socket.c
> >@@ -1108,7 +1108,7 @@ static int caif_create(struct net *net, struct socket 
> >*sock, int protocol,
> > static int __init caif_sktinit_module(void)
> > {
> > int err = sock_register(_family_ops);
> >-if (!err)
> >+if (err)
> > return err;
> 
>Why not just:
> 
>   return sock_register(_family_ops);
>
Your solution looks much cleaner.

But I am not really sure whether it is the author's intention to
return 0 anyway. Do you have any idea?

Thanks!
> > return 0;
> > }
> 
> MBR, Sergei
> 

Best regards,
Pan



Re: [flamebait] xdp, well meaning but pointless

2016-12-03 Thread Willem de Bruijn
On Fri, Dec 2, 2016 at 12:22 PM, Jesper Dangaard Brouer
 wrote:
>
> On Thu, 1 Dec 2016 10:11:08 +0100 Florian Westphal  wrote:
>
>> In light of DPDKs existence it make a lot more sense to me to provide
>> a). a faster mmap based interface (possibly AF_PACKET based) that allows
>> to map nic directly into userspace, detaching tx/rx queue from kernel.
>>
>> John Fastabend sent something like this last year as a proof of
>> concept, iirc it was rejected because register space got exposed directly
>> to userspace.  I think we should re-consider merging netmap
>> (or something conceptually close to its design).
>
> I'm actually working in this direction, of zero-copy RX mapping packets
> into userspace.  This work is mostly related to page_pool, and I only
> plan to use XDP as a filter for selecting packets going to userspace,
> as this choice need to be taken very early.
>
> My design is here:
>  
> https://prototype-kernel.readthedocs.io/en/latest/vm/page_pool/design/memory_model_nic.html
>
> This is mostly about changing the memory model in the drivers, to allow
> for safely mapping pages to userspace.  (An efficient queue mechanism is
> not covered).

Virtio virtqueues are used in various other locations in the stack.
With separate memory pools and send + completion descriptor rings,
signal moderation, careful avoidance of cacheline bouncing, etc. these
seem like a good opportunity for a TPACKET_V4 format.


Re: Possible regression due to "net/sched: cls_flower: Add offload support using egress Hardware device"

2016-12-03 Thread Or Gerlitz
On Sat, Dec 3, 2016 at 3:17 PM, Simon Horman  wrote:

> in net-next I am observing what appears to be an regression in net-next due 
> to:
> 7091d8c7055d ("net/sched: cls_flower: Add offload support using egress 
> Hardware device")
>
> The problem occurs when adding a flower filter (without offload to a virtio 
> device).

> # ethtool -d eth0
> ethtool -i eth0
> driver: virtio_net

> # tc qdisc add dev eth0 ingress
> # tc filter add dev eth0 protocol ip parent : flower indev eth0
> [  104.302779] BUG: unable to handle kernel NULL pointer dereference at 
> 00d5

Simon, I don't see an action here, is that missing in purpose? wasn't
sure what such filter does, but ofcourse
if it was supported by the patches it has to be so after changing
things as well, we will check and fix.

Or.


[patch net-next v4 06/10] rocker: Implement FIB offload in deferred work

2016-12-03 Thread Jiri Pirko
From: Ido Schimmel 

Convert rocker to offload FIBs in deferred work in a similar fashion to
mlxsw, which was converted in the previous commits.

Signed-off-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/rocker/rocker_main.c  | 58 +-
 drivers/net/ethernet/rocker/rocker_ofdpa.c |  1 +
 2 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker_main.c 
b/drivers/net/ethernet/rocker/rocker_main.c
index 424be96..914e9e1 100644
--- a/drivers/net/ethernet/rocker/rocker_main.c
+++ b/drivers/net/ethernet/rocker/rocker_main.c
@@ -2166,28 +2166,70 @@ static const struct switchdev_ops 
rocker_port_switchdev_ops = {
.switchdev_port_obj_dump= rocker_port_obj_dump,
 };
 
-static int rocker_router_fib_event(struct notifier_block *nb,
-  unsigned long event, void *ptr)
+struct rocker_fib_event_work {
+   struct work_struct work;
+   struct fib_entry_notifier_info fen_info;
+   struct rocker *rocker;
+   unsigned long event;
+};
+
+static void rocker_router_fib_event_work(struct work_struct *work)
 {
-   struct rocker *rocker = container_of(nb, struct rocker, fib_nb);
-   struct fib_entry_notifier_info *fen_info = ptr;
+   struct rocker_fib_event_work *fib_work =
+   container_of(work, struct rocker_fib_event_work, work);
+   struct rocker *rocker = fib_work->rocker;
int err;
 
-   switch (event) {
+   /* Protect internal structures from changes */
+   rtnl_lock();
+   switch (fib_work->event) {
case FIB_EVENT_ENTRY_ADD:
-   err = rocker_world_fib4_add(rocker, fen_info);
+   err = rocker_world_fib4_add(rocker, _work->fen_info);
if (err)
rocker_world_fib4_abort(rocker);
-   else
+   fib_info_put(fib_work->fen_info.fi);
break;
case FIB_EVENT_ENTRY_DEL:
-   rocker_world_fib4_del(rocker, fen_info);
+   rocker_world_fib4_del(rocker, _work->fen_info);
+   fib_info_put(fib_work->fen_info.fi);
break;
case FIB_EVENT_RULE_ADD: /* fall through */
case FIB_EVENT_RULE_DEL:
rocker_world_fib4_abort(rocker);
break;
}
+   rtnl_unlock();
+   kfree(fib_work);
+}
+
+/* Called with rcu_read_lock() */
+static int rocker_router_fib_event(struct notifier_block *nb,
+  unsigned long event, void *ptr)
+{
+   struct rocker *rocker = container_of(nb, struct rocker, fib_nb);
+   struct rocker_fib_event_work *fib_work;
+
+   fib_work = kzalloc(sizeof(*fib_work), GFP_ATOMIC);
+   if (WARN_ON(!fib_work))
+   return NOTIFY_BAD;
+
+   INIT_WORK(_work->work, rocker_router_fib_event_work);
+   fib_work->rocker = rocker;
+   fib_work->event = event;
+
+   switch (event) {
+   case FIB_EVENT_ENTRY_ADD: /* fall through */
+   case FIB_EVENT_ENTRY_DEL:
+   memcpy(_work->fen_info, ptr, sizeof(fib_work->fen_info));
+   /* Take referece on fib_info to prevent it from being
+* freed while work is queued. Release it afterwards.
+*/
+   fib_info_hold(fib_work->fen_info.fi);
+   break;
+   }
+
+   queue_work(rocker->rocker_owq, _work->work);
+
return NOTIFY_DONE;
 }
 
diff --git a/drivers/net/ethernet/rocker/rocker_ofdpa.c 
b/drivers/net/ethernet/rocker/rocker_ofdpa.c
index 4ca4613..7cd76b6 100644
--- a/drivers/net/ethernet/rocker/rocker_ofdpa.c
+++ b/drivers/net/ethernet/rocker/rocker_ofdpa.c
@@ -2516,6 +2516,7 @@ static void ofdpa_fini(struct rocker *rocker)
int bkt;
 
del_timer_sync(>fdb_cleanup_timer);
+   flush_workqueue(rocker->rocker_owq);
 
spin_lock_irqsave(>flow_tbl_lock, flags);
hash_for_each_safe(ofdpa->flow_tbl, bkt, tmp, flow_entry, entry)
-- 
2.7.4



[patch net-next v4 10/10] ipv4: fib: Replay events when registering FIB notifier

2016-12-03 Thread Jiri Pirko
From: Ido Schimmel 

Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
introduced a new notification chain to notify listeners (f.e., switchdev
drivers) about addition and deletion of routes.

However, upon registration to the chain the FIB tables can already be
populated, which means potential listeners will have an incomplete view
of the tables.

Solve that by dumping the FIB tables and replaying the events to the
passed notification block. The dump itself is done using RCU in order
not to starve consumers that need RTNL to make progress.

The integrity of the dump is ensured by reading the FIB change sequence
counter before and after the dump under RTNL. This allows us to avoid
the problematic situation in which the dumping process sends a ENTRY_ADD
notification following ENTRY_DEL generated by another process holding
RTNL.

Callers of the registration function may pass a callback that is
executed in case the dump was inconsistent with current FIB tables.

The number of retries until a consistent dump is achieved is set to a
fixed number to prevent callers from looping for long periods of time.
In case current limit proves to be problematic in the future, it can be
easily converted to be configurable using a sysctl.

Signed-off-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  |  20 ++-
 drivers/net/ethernet/rocker/rocker_main.c  |   8 +-
 include/net/ip_fib.h   |   3 +-
 net/ipv4/fib_trie.c| 148 -
 4 files changed, 174 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 14bed1d..53126bf 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -2027,6 +2027,18 @@ static int mlxsw_sp_router_fib_event(struct 
notifier_block *nb,
return NOTIFY_DONE;
 }
 
+static void mlxsw_sp_router_fib_dump_flush(struct notifier_block *nb)
+{
+   struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
+
+   /* Flush pending FIB notifications and then flush the device's
+* table before requesting another dump. The FIB notification
+* block is unregistered, so no need to take RTNL.
+*/
+   mlxsw_core_flush_owq();
+   mlxsw_sp_router_fib_flush(mlxsw_sp);
+}
+
 int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
 {
int err;
@@ -2047,9 +2059,15 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
goto err_neigh_init;
 
mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
-   register_fib_notifier(_sp->fib_nb);
+   err = register_fib_notifier(_sp->fib_nb,
+   mlxsw_sp_router_fib_dump_flush);
+   if (err)
+   goto err_register_fib_notifier;
+
return 0;
 
+err_register_fib_notifier:
+   mlxsw_sp_neigh_fini(mlxsw_sp);
 err_neigh_init:
mlxsw_sp_vrs_fini(mlxsw_sp);
 err_vrs_init:
diff --git a/drivers/net/ethernet/rocker/rocker_main.c 
b/drivers/net/ethernet/rocker/rocker_main.c
index 8c9c90a..7c450b5 100644
--- a/drivers/net/ethernet/rocker/rocker_main.c
+++ b/drivers/net/ethernet/rocker/rocker_main.c
@@ -2804,8 +2804,13 @@ static int rocker_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
goto err_alloc_ordered_workqueue;
}
 
+   /* Only FIBs pointing to our own netdevs are programmed into
+* the device, so no need to pass a callback.
+*/
rocker->fib_nb.notifier_call = rocker_router_fib_event;
-   register_fib_notifier(>fib_nb);
+   err = register_fib_notifier(>fib_nb, NULL);
+   if (err)
+   goto err_register_fib_notifier;
 
rocker->hw.id = rocker_read64(rocker, SWITCH_ID);
 
@@ -2822,6 +2827,7 @@ static int rocker_probe(struct pci_dev *pdev, const 
struct pci_device_id *id)
 
 err_probe_ports:
unregister_fib_notifier(>fib_nb);
+err_register_fib_notifier:
destroy_workqueue(rocker->rocker_owq);
 err_alloc_ordered_workqueue:
free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_EVENT), rocker);
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 6c67b93..5f376af 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -221,7 +221,8 @@ enum fib_event_type {
FIB_EVENT_RULE_DEL,
 };
 
-int register_fib_notifier(struct notifier_block *nb);
+int register_fib_notifier(struct notifier_block *nb,
+ void (*cb)(struct notifier_block *nb));
 int unregister_fib_notifier(struct notifier_block *nb);
 int call_fib_notifiers(struct net *net, enum fib_event_type event_type,
   struct fib_notifier_info *info);
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 2891356..73a6270 100644

[patch net-next v4 02/10] ipv4: fib: Add fib_info_hold() helper

2016-12-03 Thread Jiri Pirko
From: Ido Schimmel 

As explained in the previous commit, modules are going to need to take a
reference on fib info and then drop it using fib_info_put().

Add the fib_info_hold() helper to make the code more readable and also
symmetric with fib_info_put().

Signed-off-by: Ido Schimmel 
Suggested-by: Jiri Pirko 
Signed-off-by: Jiri Pirko 
---
 include/net/ip_fib.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index f390c3b..6c67b93 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -397,6 +397,11 @@ static inline void fib_combine_itag(u32 *itag, const 
struct fib_result *res)
 
 void free_fib_info(struct fib_info *fi);
 
+static inline void fib_info_hold(struct fib_info *fi)
+{
+   atomic_inc(>fib_clntref);
+}
+
 static inline void fib_info_put(struct fib_info *fi)
 {
if (atomic_dec_and_test(>fib_clntref))
-- 
2.7.4



[patch net-next v4 00/10] ipv4: fib: Replay events when registering FIB notifier

2016-12-03 Thread Jiri Pirko
From: Jiri Pirko 

Ido says:

In kernel 4.9 the switchdev-specific FIB offload mechanism was replaced
by a new FIB notification chain to which modules could register in order
to be notified about the addition and deletion of FIB entries. The
motivation for this change was that switchdev drivers need to be able to
reflect the entire FIB table and not only FIBs configured on top of the
port netdevs themselves. This is useful in case of in-band management.

The fundamental problem with this approach is that upon registration
listeners lose all the information previously sent in the chain and
thus have an incomplete view of the FIB tables, which can result in
packet loss. This patchset fixes that by dumping the FIB tables and
replaying notifications previously sent in the chain for the registered
notification block.

The entire dump process is done under RCU and thus the FIB notification
chain is converted to be atomic. The listeners are modified accordingly.
This is done in the first eight patches.

The ninth patch adds a change sequence counter to ensure the integrity
of the FIB dump. The last patch adds the dump itself to the FIB chain
registration function and modifies existing listeners to pass a callback
to be executed in case dump was inconsistent.

---
v3->v4:
- Register the notification block after the dump and protect it using
  the change sequence counter (Hannes Frederic Sowa).
- Since we now integrate the dump into the registration function, drop
  the sysctl to set maximum number of retries and instead set it to a
  fixed number. Lets see if it's really a problem before adding something
  we can never remove.
- For the same reason, dump FIB tables for all net namespaces.
- Add a comment regarding guarantees provided by mutex semantics.

v2->v3:
- Add sysctl to set the number of FIB dump retries (Hannes Frederic Sowa).
- Read the sequence counter under RTNL to ensure synchronization
  between the dump process and other processes changing the routing
  tables (Hannes Frederic Sowa).
- Pass a callback to the dump function to be executed prior to a retry.
- Limit the dump to a single net namespace.

v1->v2:
- Add a sequence counter to ensure the integrity of the FIB dump
  (David S. Miller, Hannes Frederic Sowa).
- Protect notifications from re-ordering in listeners by using an
  ordered workqueue (Hannes Frederic Sowa).
- Introduce fib_info_hold() (Jiri Pirko).
- Relieve rocker from the need to invoke the FIB dump by registering
  to the FIB notification chain prior to ports creation.

Ido Schimmel (10):
  ipv4: fib: Export free_fib_info()
  ipv4: fib: Add fib_info_hold() helper
  mlxsw: core: Create an ordered workqueue for FIB offload
  mlxsw: spectrum_router: Implement FIB offload in deferred work
  rocker: Create an ordered workqueue for FIB offload
  rocker: Implement FIB offload in deferred work
  rocker: Register FIB notifier before creating ports
  ipv4: fib: Convert FIB notification chain to be atomic
  ipv4: fib: Allow for consistent FIB dumping
  ipv4: fib: Replay events when registering FIB notifier

 drivers/net/ethernet/mellanox/mlxsw/core.c |  22 +++
 drivers/net/ethernet/mellanox/mlxsw/core.h |   2 +
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  |  92 ++--
 drivers/net/ethernet/rocker/rocker.h   |   1 +
 drivers/net/ethernet/rocker/rocker_main.c  |  84 +--
 drivers/net/ethernet/rocker/rocker_ofdpa.c |   1 +
 include/net/ip_fib.h   |   8 +-
 include/net/netns/ipv4.h   |   3 +
 net/ipv4/fib_frontend.c|   2 +
 net/ipv4/fib_semantics.c   |   1 +
 net/ipv4/fib_trie.c| 155 -
 11 files changed, 342 insertions(+), 29 deletions(-)

-- 
2.7.4



  1   2   >