date:20161212

Re: [PATCH V2 00/22] Broadcom RoCE Driver (bnxt_re)

2016-12-12 Thread Or Gerlitz

On Tue, Dec 13, 2016 at 1:52 AM, Doug Ledford  wrote:
> On 12/9/2016 1:47 AM, Selvin Xavier wrote:
>> This series introduces the RoCE driver for the Broadcom
>> NetXtreme-E 10/25/40/50 gigabit RoCE HCAs.
>> This driver is dependent on the bnxt_en NIC driver and is
>> based on the bnxt_re branch in Doug's repository. bnxt_en changes
>> required for this patch series is already available in this branch.
>>
>> I am preparing a git repository with these changes as per Jason's
>> comment and will share the details later today.
>>
>> v1-> v2:
>>   * The license text in each file updated to reflect Dual license.
>>   * Makefile and Kconfig changes are pushed to the last patch
>>   * Moved bnxt_re_uverbs_abi.h to include/uapi/rdma folder
>>   * Remove duplicate structure definitions from bnxt_re_hsi.h as
>> it is available in the corresponding bnxt_en header file (bnxt_hsi.h)
>>   * Removed some unused code reported during code review.
>>   * Fixed few sparse warnings
>>
>> Doug,
>> Please review and consider applying this to linux-rdma repository.
>
> There are outstanding review comments to be addressed still yet, and the
> v2 patchset doesn't compile for me in 0day testing.  I'm going to bounce
> this one to 4.11.

I made some quick on-the-surface static checkers etc rub on the new
driver (Doug, I used
the bits in your github bnxt_re branch), there are bunch (tons...) of
smatch [1] and sparse [2]
complaints along with few checkpatch [3] things too.  So... addressing
them before this goes upstream?

BTW net-next's drivers/net/ethernet/broadcom/bnxt (where you put the
pre patches for the
RDMA driver) is pretty much clean of that

Or.

[1] smatch

CHECK   drivers/infiniband/hw/bnxt_re/bnxt_re_main.c
  CHECK   drivers/infiniband/hw/bnxt_re/bnxt_re_ib_verbs.c
  CHECK   drivers/infiniband/hw/bnxt_re/bnxt_re_debugfs.c
  CHECK   drivers/infiniband/hw/bnxt_re/bnxt_qplib_res.c
  CHECK   drivers/infiniband/hw/bnxt_re/bnxt_qplib_rcfw.c
  CHECK   drivers/infiniband/hw/bnxt_re/bnxt_qplib_sp.c
  CHECK   drivers/infiniband/hw/bnxt_re/bnxt_qplib_fp.c
drivers/infiniband/hw/bnxt_re/bnxt_qplib_sp.c:185
bnxt_qplib_del_sgid() warn: impossible condition '(*sgid_tbl->hw_id +
index == -1) => (0-65535 == (-1))'
drivers/infiniband/hw/bnxt_re/bnxt_qplib_rcfw.c:107
bnxt_qplib_rcfw_send_message() warn: test_bit() takes a bit number
drivers/infiniband/hw/bnxt_re/bnxt_qplib_rcfw.c:116
bnxt_qplib_rcfw_send_message() warn: test_bit() takes a bit number
drivers/infiniband/hw/bnxt_re/bnxt_qplib_rcfw.c:148
bnxt_qplib_rcfw_send_message() error: double lock 'irqsave:flags'
drivers/infiniband/hw/bnxt_re/bnxt_qplib_rcfw.c:204
bnxt_qplib_rcfw_send_message() error: double unlock 'irqsave:flags'
drivers/infiniband/hw/bnxt_re/bnxt_re_main.c:121
bnxt_re_unregister_netdev() warn: variable dereferenced before check
'rdev' (see line 118)
drivers/infiniband/hw/bnxt_re/bnxt_re_main.c:140
bnxt_re_register_netdev() warn: variable dereferenced before check
'rdev' (see line 137)
drivers/infiniband/hw/bnxt_re/bnxt_re_main.c:155 bnxt_re_free_msix()
warn: variable dereferenced before check 'rdev' (see line 152)
drivers/infiniband/hw/bnxt_re/bnxt_re_main.c:173
bnxt_re_request_msix() warn: variable dereferenced before check 'rdev'
(see line 171)
drivers/infiniband/hw/bnxt_re/bnxt_qplib_res.c:172
bnxt_qplib_alloc_init_hwq() warn: unsigned '*elements' is never less
than zero.
drivers/infiniband/hw/bnxt_re/bnxt_qplib_rcfw.c:387
bnxt_qplib_deinit_rcfw() warn: test_bit() takes a bit number
drivers/infiniband/hw/bnxt_re/bnxt_qplib_res.c:488
bnxt_qplib_alloc_sgid_tbl() warn: double check that we're allocating
correct size: 2 vs 4
drivers/infiniband/hw/bnxt_re/bnxt_qplib_res.c:688
bnxt_qplib_alloc_dpi_tbl() warn: consider using resource_size() here
drivers/infiniband/hw/bnxt_re/bnxt_qplib_rcfw.c:496
bnxt_qplib_init_rcfw() warn: test_bit() takes a bit number
./include/linux/mm.h:1385 stack_guard_page_end() warn: bitwise AND
condition is false here
./include/linux/mm.h:1379 vma_growsup() warn: bitwise AND condition is
false here
drivers/infiniband/hw/bnxt_re/bnxt_re_main.c:522 show_rev() info:
ignoring unreachable code.
drivers/infiniband/hw/bnxt_re/bnxt_re_main.c:835 bnxt_re_dev_stop()
warn: was && intended here instead of ||?
drivers/infiniband/hw/bnxt_re/bnxt_re_main.c:1048 bnxt_re_ib_reg()
warn: curly braces intended?
drivers/infiniband/hw/bnxt_re/bnxt_re_main.c:1050 bnxt_re_ib_reg()
warn: inconsistent indenting
drivers/infiniband/hw/bnxt_re/bnxt_re_main.c:1190 bnxt_re_task() warn:
curly braces intended?
drivers/infiniband/hw/bnxt_re/bnxt_qplib_fp.c:1639 __flush_sq() warn:
variable dereferenced before check 'budget' (see line 1621)
drivers/infiniband/hw/bnxt_re/bnxt_qplib_fp.c:1852
bnxt_qplib_cq_process_res_ud() warn: shift has higher precedence than
mask
drivers/infiniband/hw/bnxt_re/bnxt_qplib_fp.c:1861
bnxt_qplib_cq_process_res_ud() warn: inconsistent indenting
drivers/infiniband/hw/bnxt_re/bnxt_qplib_fp.c:2015
bnxt_qplib_cq_process_terminal() warn: variable dereferen

Re: netlink: GPF in sock_sndtimeo

2016-12-12 Thread Richard Guy Briggs

On 2016-12-09 23:40, Cong Wang wrote:
> On Fri, Dec 9, 2016 at 8:13 PM, Cong Wang  wrote:
> > On Fri, Dec 9, 2016 at 3:01 AM, Richard Guy Briggs  wrote:
> >> On 2016-12-08 22:57, Cong Wang wrote:
> >>> On Thu, Dec 8, 2016 at 10:02 PM, Richard Guy Briggs  
> >>> wrote:
> >>> > I also tried to extend Cong Wang's idea to attempt to proactively 
> >>> > respond to a
> >>> > NETLINK_URELEASE on the audit_sock and reset it, but ran into a locking 
> >>> > error
> >>> > stack dump using mutex_lock(&audit_cmd_mutex) in the notifier callback.
> >>> > Eliminating the lock since the sock is dead anways eliminates the error.
> >>> >
> >>> > Is it safe?  I'll resubmit if this looks remotely sane.  Meanwhile I'll 
> >>> > try to
> >>> > get the test case to compile.
> >>>
> >>> It doesn't look safe, because 'audit_sock', 'audit_nlk_portid' and 
> >>> 'audit_pid'
> >>> are updated as a whole and race between audit_receive_msg() and
> >>> NETLINK_URELEASE.
> >>
> >> This is what I expected and why I originally added the mutex lock in the
> >> callback...  The dumps I got were bare with no wrapper identifying the
> >> process context or specific error, so I'm at a bit of a loss how to
> >> solve this (without thinking more about it) other than instinctively
> >> removing the mutex.
> >
> > Netlink notifier can safely be converted to blocking one, I will send
> > a patch.
> >
> > But I seriously doubt you really need NETLINK_URELEASE here,
> > it adds nothing but overhead, b/c the netlink notifier is called on
> > every netlink socket in the system, but for net exit path, that is
> > relatively a slow path.
> >
> > Also, kauditd_send_skb() needs audit_cmd_mutex too.
> 
> Please let me know what you think about the attached patch?
> 
> Thanks!

> commit a12b43ee814625933ff155c20dc863c59cfcf240
> Author: Cong Wang 
> Date:   Fri Dec 9 17:56:42 2016 -0800
> 
> audit: close a race condition on audit_sock
> 
> Signed-off-by: Cong Wang 
> 
> diff --git a/kernel/audit.c b/kernel/audit.c
> index f1ca116..ab947d8 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -423,6 +423,8 @@ static void kauditd_send_skb(struct sk_buff *skb)
>   snprintf(s, sizeof(s), "audit_pid=%d reset", 
> audit_pid);
>   audit_log_lost(s);
>   audit_pid = 0;
> + audit_nlk_portid = 0;
> + sock_put(audit_sock);
>   audit_sock = NULL;
>   } else {
>   pr_warn("re-scheduling(#%d) write to 
> audit_pid=%d\n",
> @@ -899,6 +901,9 @@ static int audit_receive_msg(struct sk_buff *skb, struct 
> nlmsghdr *nlh)
>   audit_log_config_change("audit_pid", new_pid, 
> audit_pid, 1);
>   audit_pid = new_pid;
>   audit_nlk_portid = NETLINK_CB(skb).portid;
> + sock_hold(skb->sk);
> + if (audit_sock)
> + sock_put(audit_sock);
>   audit_sock = skb->sk;
>   }
>   if (s.mask & AUDIT_STATUS_RATE_LIMIT) {
> @@ -1167,10 +1172,6 @@ static void __net_exit audit_net_exit(struct net *net)
>  {
>   struct audit_net *aunet = net_generic(net, audit_net_id);
>   struct sock *sock = aunet->nlsk;
> - if (sock == audit_sock) {
> - audit_pid = 0;
> - audit_sock = NULL;
> - }

So how does this not leak memory leaving the sock refcount incremented
by the registered audit daemon when that daemon shuts down normally?


- RGB

--
Richard Guy Briggs 
Kernel Security Engineering, Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635

Re: Synopsys Ethernet QoS

2016-12-12 Thread Giuseppe CAVALLARO


On 12/12/2016 5:25 PM, Niklas Cassel wrote:



On 12/12/2016 11:19 AM, Joao Pinto wrote:

Hi,

Às 1:44 AM de 12/10/2016, Florian Fainelli escreveu:

Le 12/09/16 à 16:16, Andy Shevchenko a écrit :

On Sat, Dec 10, 2016 at 12:52 AM, Florian Fainelli  wrote:


It's kind of sad that customers of that IP (stmmac, amd-xgbe, sxgbe)
did
actually pioneer the upstreaming effort, but it is good to see people
from Synopsys willing to fix that in the future.

Wait, you would like to tell that we have more than 2 drivers for the
same (okay, same vendor) IP?!
It's better to unify them earlier, than have n+ copies.

Unfortunately that is the case, see this email:

https://www.mail-archive.com/netdev@vger.kernel.org/msg142796.html

dwc_eth_qos and stmmac have some overlap. There seems to be work
underway to unify these two to begin with.


P.S. Though, I don't see how sxgbe got in the list. First glance on
the code doesn't show similarities.

Well samsung/sxgbe looks potentially similar to amd/xgbe, but that's
just my cursory look at the code, it may very well be something entirely
different. The descriptor formats just look suspiciously similar.


Thank you for your inputs! Renaming seems to be a hotspot. I agree that maybe
instead of renaming (breaking retro-compatibility as David and Florian
mentioned), the best is to move stmmac to synopsys/ after merging *qos* and
removing it. As Florian mentioned, git is capable of detecting folder 
restructured.

@Rabin Vincent: Hi Rabin. Since Axis is more familiar with the synopsys/*qos*
driver would it be possible for you to make an initial analysis of what has to
be merged into Stmmac? This way the development would speed-up.


I can answer that question.

I've sent out 12 patches to the stmmac driver
(all patches are included in the current net-next tree),


ok I have seen these patches applied, I had just a minor concern about
the  failure when DMA configuration is missing.
In these years, I have noticed that, for this kind of HW, default DMA
configuration is usually good to have a driver working. AHB, AXI
parameters can be provided to have a best tuning or to fix know issues
on some platforms. So IMO, we should relax the check with a warning.
Please, consider that, the stmmac also supports very old MAC10/100
versions where the DMA configuration was often never passed.


with these patches the stmmac driver works properly on Axis hardware
(we use Synopsys GMAC 4.10a synthesized with multiple TX queues).


perfect and thanks a lot for this effort.


stmmac's DT binding has also been extended with properties that
existed in DWC EQoS's DT binding, such as no-pbl-x8, txpbl, rxpbl.

Since we have no problem updating the DTB together with the kernel,
we will simply move to using the start using the stmmac driver,
with stmmac's DT binding.

However, I've noticed that NVIDIA has extended the DWC EQoS DT binding,
I don't how easy it would be for them to switch to stmmac's DT binding.
(Adding Stephen Warren to CC.)


ok



The reset sequence that Lars Persson was worried about is not an issue
with the stmmac driver.



thx for this check.



There are some performance problems with the stmmac driver though:

When running iperf3 with 3 streams:
iperf3 -c 192.168.0.90 -P 3 -t 30
iperf3 -c 192.168.0.90 -P 3 -t 30 -R

I get really bad fairness between the streams.


Can you confirm you are using the 4.xxa version?

This doesn't match with Alex's experiments on ARM platforms.


This appears to be an issue with how TX IRQ coalescing is implemented in stmmac.
Disabling TX IRQ coalescing in the stmmac driver makes the problem go away.


this doesn't match with what we had seen but I am happy and open to
review and accept new strategy.


We have a local patch that implements TX IRQ coalescing in the dwceqos driver,
and we don't see the same problem.


please, if you have new patch add me on CC and we will review all
together.


Also netperf TCP_RR and UDP_RR gives really bad results compared to the
dwceqos driver (without IRQ coalescing).
2000 transactions/sec vs 9000 transactions/sec.
Turning TX IRQ coalescing off and RX interrupt watchdog off in stmmac
gives the same performance. I guess it's a trade off, low CPU usage
vs low latency, so I don't know how important TCP_RR/UDP_RR really is.

The best thing would be to get a good working TX IRQ coalesce
implementation with HR timers in stmmac.


as said, welcome patches.

Basically, the default tuning of coalescence parameters comes from
ST platform experiences. I mean, we tuned to driver to have good
performance and saving CPU on SH4 (UP) and ARM (SMP) systems.
In these years, these default was accepted but, if today we need
to change something welcome effort. On my side, I can try to
perform some bench to see if I have regressions on not.


Perhaps it should also be investigated if the RX interrupt watchdog
timeout should have a lower default value.


Do not expect to get many improvements to play with the HW watchdog
due to the poor granulari

Re: stmmac DT property snps,axi_all

2016-12-12 Thread Giuseppe CAVALLARO


Hello Niklas, Alex,

my fault and a step behind... Current code is OK
when manage the AAL that, although it is passed from
the axi structure, it is always used to program,
for all the chip versions, the writable bit inside the
DMA_BUS_MODE register.
So I guess no extra patch is needed.

Regards
Peppe

On 12/12/2016 3:18 PM, Giuseppe CAVALLARO wrote:

Please Niklas

when you send the patch, add my

Acked-by: Giuseppe Cavallaro 


On 12/9/2016 10:53 AM, Niklas Cassel wrote:

On 12/09/2016 10:20 AM, Niklas Cassel wrote:

On 12/08/2016 02:36 PM, Alexandre Torgue wrote:

Hi Niklas,

On 12/05/2016 05:18 PM, Niklas Cassel wrote:

Hello Giuseppe


I'm trying to figure out what snps,axi_all is supposed to represent.

It appears that the value is saved, but never used in the code.

Looking at the register specification, I'm guessing that it represents
Address-Aligned Beats, but there is already the property snps,aal
for that.

IMO, it is not useful. Indeed AXI_AAL is a read only bit (in AXI bus
mode register) and reflects the aal bit in DMA bus register.
As you know we use "snps,aal" to set aal bit in DMA bus register.
So "snps,axi_all" entry seems useless. Let's see with Peppe.

Ok, I see. GMAC and GMAC4 is different here.

For GMAC4 AAL only exists in DMA_SYS_BUS_MODE.
It's not reflected anywhere else.

The code is correct in the driver.

If snps,axi_all is just created for a read-only register,
and it is currently never used in the code,
while we have snps,aal, which is correct and works,
I guess it should be ok to remove snps,axi_all.

I can cook up a patch.



Here we go :)

I will send it as a real patch once net-next reopens.



From defc01cb7c22611b89d9cf1fcae72544092bd62c Mon Sep 17 00:00:00 2001

From: Niklas Cassel 
Date: Fri, 9 Dec 2016 10:27:00 +0100
Subject: [PATCH net-next] net: stmmac: remove unused duplicate property
 snps,axi_all

For core revision 3.x Address-Aligned Beats is available in two
registers.
The DT property snps,aal was created for AAL in the DMA bus register,
which is a read/write bit.
The DT property snps,axi_all was created for AXI_AAL in the AXI bus mode
register, which is a read only bit that reflects the value of AAL in the
DMA bus register.

Since the value of snps,axi_all is never used in the driver,
and since the property was created for a bit that is read only,
it should be safe to remove the property.

Signed-off-by: Niklas Cassel 
---
 Documentation/devicetree/bindings/net/stmmac.txt  | 1 -
 drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c | 1 -
 include/linux/stmmac.h| 1 -
 3 files changed, 3 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/stmmac.txt
b/Documentation/devicetree/bindings/net/stmmac.txt
index 128da752fec9..c3d2fd480a1b 100644
--- a/Documentation/devicetree/bindings/net/stmmac.txt
+++ b/Documentation/devicetree/bindings/net/stmmac.txt
@@ -65,7 +65,6 @@ Optional properties:
 - snps,wr_osr_lmt: max write outstanding req. limit
 - snps,rd_osr_lmt: max read outstanding req. limit
 - snps,kbbe: do not cross 1KiB boundary.
-- snps,axi_all: align address
 - snps,blen: this is a vector of supported burst length.
 - snps,fb: fixed-burst
 - snps,mb: mixed-burst
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
index 082cd48db6a7..60ba8993c650 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
@@ -121,7 +121,6 @@ static struct stmmac_axi *stmmac_axi_setup(struct
platform_device *pdev)
 axi->axi_lpi_en = of_property_read_bool(np, "snps,lpi_en");
 axi->axi_xit_frm = of_property_read_bool(np, "snps,xit_frm");
 axi->axi_kbbe = of_property_read_bool(np, "snps,axi_kbbe");
-axi->axi_axi_all = of_property_read_bool(np, "snps,axi_all");
 axi->axi_fb = of_property_read_bool(np, "snps,axi_fb");
 axi->axi_mb = of_property_read_bool(np, "snps,axi_mb");
 axi->axi_rb =  of_property_read_bool(np, "snps,axi_rb");
diff --git a/include/linux/stmmac.h b/include/linux/stmmac.h
index 266dab9ad782..889e0e9a3f1c 100644
--- a/include/linux/stmmac.h
+++ b/include/linux/stmmac.h
@@ -103,7 +103,6 @@ struct stmmac_axi {
 u32 axi_wr_osr_lmt;
 u32 axi_rd_osr_lmt;
 bool axi_kbbe;
-bool axi_axi_all;
 u32 axi_blen[AXI_BLEN];
 bool axi_fb;
 bool axi_mb;

Re: [PATCH V2 00/22] Broadcom RoCE Driver (bnxt_re)

2016-12-12 Thread Michael Chan

On Mon, Dec 12, 2016 at 8:52 PM, Selvin Xavier
 wrote:

>>   CHECK   drivers/infiniband/hw/bnxtre/bnxt_qplib_rcfw.c
>>   CHECK   drivers/infiniband/hw/bnxtre/bnxt_qplib_sp.c
>>   CHECK   drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c
>> drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c:1015:22: warning: context
>> imbalance in 'bnxt_qplib_lock_cqs' - wrong count at exit
>> drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c:1030:28: warning: context
>> imbalance in 'bnxt_qplib_unlock_cqs' - unexpected unlock
> The above two are false positives, since locking and unlocking are
> handled in two separate functions. This is a wrapper to lock/unlock
> both SQ and RQ CQ locks. Functionally it is ok  since
> bnxt_qplib_unlock_cqs is called just after the critical section and
> both locks are freed in order. I think we can ignore this warning.
>
>
You can use __releases() and __acquires() macros to denote these cases
for sparse.

Re: [PATCH V2 18/22] bnxt_re: Support for DCB

2016-12-12 Thread Selvin Xavier

On Sat, Dec 10, 2016 at 7:20 PM, Or Gerlitz  wrote:
> On Fri, Dec 9, 2016 at 8:48 AM, Selvin Xavier
>  wrote:
>> This patch queries the configured RoCE APP Priority on the host
>> using the dcbnl API and programs the RoCE FW with the corresponding
>> Traffic Class(es) for the priority.
>
>> +#define BNXT_RE_ROCE_V1_ETH_TYPE   0x8915
>> +#define BNXT_RE_ROCE_V2_PORT_NO4791
>
> I believe these two are defined already, try # git grep on each under include

Thanks Or for your comments.
V2 port number is defined in ib_verbs.h.  i will include this in the
next patch set.
v1 eth_type is not defined. All vendor drivers have their own definition.

Thanks,
Selvin

[PATCH net] virtio-net: correctly enable multiqueue

2016-12-12 Thread Jason Wang

Commit 4490001029012539937ff02778fe6180613fa949 ("virtio-net: enable
multiqueue by default") blindly set the affinity instead of queues
during probe which can cause a mismatch of #queues between guest and
host. This patch fixes it by setting queues.

Reported-by: Theodore Ts'o 
Tested-by: Theodore Ts'o 
Cc: Neil Horman 
Cc: Michael S. Tsirkin 
Fixes: 49000102901 ("virtio-net: enable multiqueue by default")
Signed-off-by: Jason Wang 
---
 drivers/net/virtio_net.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b425fa1..fe9f772 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1930,7 +1930,9 @@ static int virtnet_probe(struct virtio_device *vdev)
goto free_unregister_netdev;
}
 
-   virtnet_set_affinity(vi);
+   rtnl_lock();
+   virtnet_set_queues(vi, vi->curr_queue_pairs);
+   rtnl_unlock();
 
/* Assume link up if device can't report link status,
   otherwise get link status from config. */
-- 
2.7.4

Re: [PATCH V2 13/22] bnxt_re: Support QP verbs

2016-12-12 Thread Selvin Xavier

On Mon, Dec 12, 2016 at 11:57 PM, Leon Romanovsky  wrote:
> It can help to review if you break this function into smaller pieces and
> get rid of switch->switch->if construction.

Thanks Leon. I will address this and your previous comments in v3 patch set.

Re: [PATCH V2 00/22] Broadcom RoCE Driver (bnxt_re)

2016-12-12 Thread Selvin Xavier

On Mon, Dec 12, 2016 at 10:37 PM, Jason Gunthorpe
 wrote:
> On Sat, Dec 10, 2016 at 11:06:58AM +0530, Selvin Xavier wrote:
>> On Fri, Dec 9, 2016 at 12:17 PM, Selvin Xavier
>>  wrote:
>> > I am preparing a git repository with these changes as per Jason's
>> > comment and will share the details later today.
>>
>> Please use bnxt_re branch in this git repository.
>>
>> https://github.com/Broadcom/linux-rdma-nxt.git
>
> Why are you using __packed in bnxt_re_uverbs_abi.h ? that doesn't seem
> necessary. It is a good idea to make sure all those structures are a
> multiple of 64 bits (add explicit reserved fields), and make sure you
> test 32 bit verbs as well.

Will take care in v3.

>
> Why are you using debugfs just to export counters? Isn't the core code
> counter framework good enough?

I agree that some of the counters exported by this patch set, tx and
rx bytes/pkts etc, can be exported
through the core counters. i will try adding  this support in v3, if
not, will post as a separate patch.
debugfs was introduced more for the future, in case any HW specific
data needs to be displayed.
As of now, it tracks only the count of resources( CQ/MR/QPs) active at
any given point. So its ok to
skip this patch from this series.

>
> Please try and avoid writing functions as defines (eg rdev_to_dev,
> to_bnxt_re, SQE_PG, RCFW_CMDQ_COOKIE, PTR_PG etc)
>
Sure, will take care in v3.

> There is something wrong with the tabs and spaces (see
> https://github.com/Broadcom/linux-rdma-nxt/blob/03e23b087f7e86ea28656273994e065827210ce5/drivers/infiniband/hw/bnxtre/bnxt_re_hsi.h)
>
> FWIW, I really dislike the column alignment style, it is so hard to
> maintain..
This file contains the Macro defines for the FW/HW structures and are
auto-generated. Some of these auto-generated defines are very long
which makes the lines greater than
80 characters. I will fix whatever possible and include in v3 set.

>
> Jason

Thanks,
Selvin

[PATCH iproute2 1/2] tc: flower: Fix typo in the flower man page

2016-12-12 Thread Roi Dayan

Replace vlan_eth_type with vlan_ethtype.

Fixes: 745d91726006 ("tc: flower: Introduce vlan support")
Signed-off-by: Roi Dayan 
Reviewed-by: Hadar Hen Zion 
---
 man/man8/tc-flower.8 | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
index 90fdfba..1cea54d 100644
--- a/man/man8/tc-flower.8
+++ b/man/man8/tc-flower.8
@@ -27,7 +27,7 @@ flower \- flow based traffic control filter
 .IR VID " | "
 .B vlan_prio
 .IR PRIORITY " | "
-.BR vlan_eth_type " { " ipv4 " | " ipv6 " | "
+.BR vlan_ethtype " { " ipv4 " | " ipv6 " | "
 .IR ETH_TYPE " } | "
 .BR ip_proto " { " tcp " | " udp " | " sctp " | " icmp " | " icmpv6 " | "
 .IR IP_PROTO " } | { "
@@ -87,7 +87,7 @@ Match on vlan tag priority.
 .I PRIORITY
 is an unsigned 3bit value in decimal format.
 .TP
-.BI vlan_eth_type " VLAN_ETH_TYPE"
+.BI vlan_ethtype " VLAN_ETH_TYPE"
 Match on layer three protocol.
 .I ETH_TYPE
 may be either
-- 
2.7.4

[PATCH iproute2 2/2] tc: tunnel_key: Add tc-tunnel_key man page to Makefile

2016-12-12 Thread Roi Dayan

To be installed with the other man pages.

Fixes: d57639a475a9 ("tc/act_tunnel: Introduce ip tunnel action")
Signed-off-by: Roi Dayan 
Reviewed-by: Amir Vadai 
---
 man/man8/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/man/man8/Makefile b/man/man8/Makefile
index de6f249..d4cb01a 100644
--- a/man/man8/Makefile
+++ b/man/man8/Makefile
@@ -17,6 +17,7 @@ MAN8PAGES = $(TARGETS) ip.8 arpd.8 lnstat.8 routel.8 rtacct.8 
rtmon.8 rtpr.8 ss.
tc-tcindex.8 tc-u32.8 tc-matchall.8 \
tc-connmark.8 tc-csum.8 tc-mirred.8 tc-nat.8 tc-pedit.8 tc-police.8 \
tc-simple.8 tc-skbedit.8 tc-vlan.8 tc-xt.8  tc-ife.8 tc-skbmod.8 \
+   tc-tunnel_key.8 \
devlink.8 devlink-dev.8 devlink-monitor.8 devlink-port.8 devlink-sb.8
 
 all: $(TARGETS)
-- 
2.7.4

[PATCH iproute2 0/2] Man page fixes

2016-12-12 Thread Roi Dayan

Hi,

The 2 patches are man page related only.
First fixes a typo and second adding missing man page to the Makefile.

Thanks

Roi Dayan (2):
  tc: flower: Fix typo in the flower man page
  tc: tunnel_key: Add tc-tunnel_key man page to Makefile

 man/man8/Makefile| 1 +
 man/man8/tc-flower.8 | 4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

-- 
2.7.4

Re: [PATCH net-next 00/27] Remove VLAN CFI bit abuse

2016-12-12 Thread Michał Mirosław

On Tue, Dec 13, 2016 at 01:12:32AM +0100, Michał Mirosław wrote:
> Dear NetDevs
> 
> This series removes an abuse of VLAN CFI bit in Linux networking stack.
> Currently Linux always clears the bit on outgoing traffic and presents
> it cleared to userspace (even via AF_PACKET/tcpdump when hw-accelerated).
[...]

I just noticed net-next got closed few days ago. I'll resend after it
opens again.  Nevertheless, review is appreciated.

Best Regards,
Michał Mirosław

Re: [PATCH v2] audit: use proper refcount locking on audit_sock

2016-12-12 Thread Richard Guy Briggs

On 2016-12-12 15:18, Paul Moore wrote:
> On Mon, Dec 12, 2016 at 5:03 AM, Richard Guy Briggs  wrote:
> > Resetting audit_sock appears to be racy.
> >
> > audit_sock was being copied and dereferenced without using a refcount on
> > the source sock.
> >
> > Bump the refcount on the underlying sock when we store a refrence in
> > audit_sock and release it when we reset audit_sock.  audit_sock
> > modification needs the audit_cmd_mutex.
> >
> > See: https://lkml.org/lkml/2016/11/26/232
> >
> > Thanks to Eric Dumazet  and Cong Wang
> >  on ideas how to fix it.
> >
> > Signed-off-by: Richard Guy Briggs 
> > ---
> > There has been a lot of change in the audit code that is about to go
> > upstream to address audit queue issues.  This patch is based on the
> > source tree: git://git.infradead.org/users/pcmoore/audit#next
> > ---
> >  kernel/audit.c |   34 --
> >  1 files changed, 28 insertions(+), 6 deletions(-)
> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index f20eee0..439f7f3 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> > @@ -452,7 +452,9 @@ static void auditd_reset(void)
> > struct sk_buff *skb;
> >
> > /* break the connection */
> > +   sock_put(audit_sock);
> > audit_pid = 0;
> > +   audit_nlk_portid = 0;
> > audit_sock = NULL;
> >
> > /* flush all of the retry queue to the hold queue */
> > @@ -478,6 +480,12 @@ static int kauditd_send_unicast_skb(struct sk_buff 
> > *skb)
> > if (rc >= 0) {
> > consume_skb(skb);
> > rc = 0;
> > +   } else {
> > +   if (rc & (-ENOMEM|-EPERM|-ECONNREFUSED)) {
> 
> I dislike the way you wrote this because instead of simply looking at
> this to see if it correct I need to sort out all the bits and find out
> if there are other error codes that could run afoul of this check ...
> make it simple, e.g. (rc == -ENOMEM || rc == -EPERM || ...).
> Actually, since EPERM is 1, -EPERM (-1 in two's compliment is
> 0x) is going to cause this to be true for pretty much any
> value of rc, yes?

Yes, you are correct.  We need there a logical or on the results of each
comparison to the return code rather than bit-wise or-ing the result
codes together first to save a step.

> > +   mutex_lock(&audit_cmd_mutex);
> > +   auditd_reset();
> > +   mutex_unlock(&audit_cmd_mutex);
> > +   }
> 
> The code in audit#next handles netlink_unicast() errors in
> kauditd_thread() and you are adding error handling code here in
> kauditd_send_unicast_skb() ... that's messy.  I don't care too much
> where the auditd_reset() call is made, but let's only do it in one
> function; FWIW, I originally put the error handling code in
> kauditd_thread() because there was other error handling code that
> needed to done in that scope so it resulted in cleaner code.

Hmmm, I seem to remember it not returning the return code and I thought
I had changed it to do so, but I see now that it was already there.
Agreed, I needlessly duplicated that error handling.

> Related, I see you are now considering ENOMEM to be a fatal condition,
> that differs from the AUDITD_BAD macro in kauditd_thread(); this
> difference needs to be reconciled.

Also correct about -EPERM now that I check back to the intent of commit
32a1dbaece7e ("audit: try harder to send to auditd upon netlink
failure")

> Finally, you should update the comment header block for auditd_reset()
> that it needs to be called with the audit_cmd_mutex held.

Yup.

> > @@ -1004,17 +1018,22 @@ static int audit_receive_msg(struct sk_buff *skb, 
> > struct nlmsghdr *nlh)
> > return -EACCES;
> > }
> > if (audit_pid && new_pid &&
> > -   audit_replace(requesting_pid) != -ECONNREFUSED) 
> > {
> > +   (audit_replace(requesting_pid) & 
> > (-ECONNREFUSED|-EPERM|-ENOMEM))) {
> 
> Do we simply want to treat any error here as fatal, and not just
> ECONN/EPERM/ENOMEM?  If not, let's come up with a single macro to
> handle the fatal netlink_unicast() return codes so we have some chance
> to keep things consistent in the future.

I'll work through this before I post another patch...

> paul moore

- RGB

--
Richard Guy Briggs 
Kernel Security Engineering, Base Operating Systems, Red Hat
Remote, Ottawa, Canada
Voice: +1.647.777.2635, Internal: (81) 32635

Re: [PATCH V2 00/22] Broadcom RoCE Driver (bnxt_re)

2016-12-12 Thread Selvin Xavier

On Mon, Dec 12, 2016 at 10:24 PM, Jonathan Toppins  wrote:
> CHECK   drivers/infiniband/hw/bnxtre/bnxt_re_debugfs.c
>   CHECK   drivers/infiniband/hw/bnxtre/bnxt_qplib_res.c
> drivers/infiniband/hw/bnxtre/bnxt_qplib_res.c:729:6: warning: symbol
> 'bnxt_qplib_cleanup_pkey_tbl' was not declared. Should it be static?
I will remove this warning in v3 patch set.

>   CHECK   drivers/infiniband/hw/bnxtre/bnxt_qplib_rcfw.c
>   CHECK   drivers/infiniband/hw/bnxtre/bnxt_qplib_sp.c
>   CHECK   drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c
> drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c:1015:22: warning: context
> imbalance in 'bnxt_qplib_lock_cqs' - wrong count at exit
> drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c:1030:28: warning: context
> imbalance in 'bnxt_qplib_unlock_cqs' - unexpected unlock
The above two are false positives, since locking and unlocking are
handled in two separate functions. This is a wrapper to lock/unlock
both SQ and RQ CQ locks. Functionally it is ok  since
bnxt_qplib_unlock_cqs is called just after the critical section and
both locks are freed in order. I think we can ignore this warning.


>   MODPOST 2 modules

Re: [PATCH v2] audit: use proper refcount locking on audit_sock

2016-12-12 Thread Richard Guy Briggs

On 2016-12-12 12:10, Paul Moore wrote:
> On Mon, Dec 12, 2016 at 5:03 AM, Richard Guy Briggs  wrote:
> > Resetting audit_sock appears to be racy.
> >
> > audit_sock was being copied and dereferenced without using a refcount on
> > the source sock.
> >
> > Bump the refcount on the underlying sock when we store a refrence in
> > audit_sock and release it when we reset audit_sock.  audit_sock
> > modification needs the audit_cmd_mutex.
> >
> > See: https://lkml.org/lkml/2016/11/26/232
> >
> > Thanks to Eric Dumazet  and Cong Wang
> >  on ideas how to fix it.
> >
> > Signed-off-by: Richard Guy Briggs 
> > ---
> > There has been a lot of change in the audit code that is about to go
> > upstream to address audit queue issues.  This patch is based on the
> > source tree: git://git.infradead.org/users/pcmoore/audit#next
> > ---
> >  kernel/audit.c |   34 --
> >  1 files changed, 28 insertions(+), 6 deletions(-)
> 
> This is coming in pretty late for the v4.10 merge window, much later
> than I would usually take things, but this is arguably important, and
> (at first glance) relatively low risk - what testing have you done on
> this?

At this point, compile and boot, and I'm able to compile and run the
supplied test code without any issues.

> > diff --git a/kernel/audit.c b/kernel/audit.c
> > index f20eee0..439f7f3 100644
> > --- a/kernel/audit.c
> > +++ b/kernel/audit.c
> > @@ -452,7 +452,9 @@ static void auditd_reset(void)
> > struct sk_buff *skb;
> >
> > /* break the connection */
> > +   sock_put(audit_sock);
> > audit_pid = 0;
> > +   audit_nlk_portid = 0;
> > audit_sock = NULL;
> >
> > /* flush all of the retry queue to the hold queue */
> > @@ -478,6 +480,12 @@ static int kauditd_send_unicast_skb(struct sk_buff 
> > *skb)
> > if (rc >= 0) {
> > consume_skb(skb);
> > rc = 0;
> > +   } else {
> > +   if (rc & (-ENOMEM|-EPERM|-ECONNREFUSED)) {
> > +   mutex_lock(&audit_cmd_mutex);
> > +   auditd_reset();
> > +   mutex_unlock(&audit_cmd_mutex);
> > +   }
> > }
> >
> > return rc;
> > @@ -579,7 +587,9 @@ static int kauditd_thread(void *dummy)
> >
> > auditd = 0;
> > if (AUDITD_BAD(rc, reschedule)) {
> > +   mutex_lock(&audit_cmd_mutex);
> > auditd_reset();
> > +   mutex_unlock(&audit_cmd_mutex);
> > reschedule = 0;
> > }
> > } else
> > @@ -594,7 +604,9 @@ static int kauditd_thread(void *dummy)
> > auditd = 0;
> > if (AUDITD_BAD(rc, reschedule)) {
> > kauditd_hold_skb(skb);
> > +   mutex_lock(&audit_cmd_mutex);
> > auditd_reset();
> > +   mutex_unlock(&audit_cmd_mutex);
> > reschedule = 0;
> > } else
> > /* temporary problem (we hope), 
> > queue
> > @@ -623,7 +635,9 @@ quick_loop:
> > if (rc) {
> > auditd = 0;
> > if (AUDITD_BAD(rc, reschedule)) {
> > +   
> > mutex_lock(&audit_cmd_mutex);
> > auditd_reset();
> > +   
> > mutex_unlock(&audit_cmd_mutex);
> > reschedule = 0;
> > }
> >
> > @@ -1004,17 +1018,22 @@ static int audit_receive_msg(struct sk_buff *skb, 
> > struct nlmsghdr *nlh)
> > return -EACCES;
> > }
> > if (audit_pid && new_pid &&
> > -   audit_replace(requesting_pid) != -ECONNREFUSED) 
> > {
> > +   (audit_replace(requesting_pid) & 
> > (-ECONNREFUSED|-EPERM|-ENOMEM))) {
> > audit_log_config_change("audit_pid", 
> > new_pid, audit_pid, 0);
> > return -EEXIST;
> > }
> > if (audit_enabled != AUDIT_OFF)
> > audit_log_config_change("audit_pid", 
> > new_pid, audit_pid, 1);
> > -   audit_pid = new_pid;
> > -   audit_nlk_portid = NETLINK_CB(skb).portid;
> > -   audit_sock = skb->sk;
> > -   if (!new_pid)
> > +

Re: [PATCH V2 for-next 00/11] Code improvements & fixes for HNS RoCE driver

2016-12-12 Thread Doug Ledford

On 11/15/2016 1:10 PM, Salil Mehta wrote:
> This patchset introduces some code improvements and fixes
> for the identified problems in the HNS RoCE driver.
> 
> Lijun Ou (4):
>   IB/hns: Add the interface for querying QP1
>   IB/hns: add self loopback for CM
>   IB/hns: Modify the condition of notifying hardware loopback
>   IB/hns: Fix the bug for qp state in hns_roce_v1_m_qp()
> 
> Salil Mehta (1):
>   IB/hns: Fix for Checkpatch.pl comment style errors
> 
> Shaobo Xu (1):
>   IB/hns: Implement the add_gid/del_gid and optimize the GIDs
> management
> 
> Wei Hu (Xavier) (5):
>   IB/hns: Add code for refreshing CQ CI using TPTR
>   IB/hns: Optimize the logic of allocating memory using APIs
>   IB/hns: Modify the macro for the timeout when cmd process
>   IB/hns: Modify query info named port_num when querying RC QP
>   IB/hns: Change qpn allocation to round-robin mode.
> 
>  drivers/infiniband/hw/hns/hns_roce_alloc.c  |   11 +-
>  drivers/infiniband/hw/hns/hns_roce_cmd.c|8 +-
>  drivers/infiniband/hw/hns/hns_roce_cmd.h|7 +-
>  drivers/infiniband/hw/hns/hns_roce_common.h |2 -
>  drivers/infiniband/hw/hns/hns_roce_cq.c |   17 +-
>  drivers/infiniband/hw/hns/hns_roce_device.h |   45 ++--
>  drivers/infiniband/hw/hns/hns_roce_eq.c |6 +-
>  drivers/infiniband/hw/hns/hns_roce_hem.c|6 +-
>  drivers/infiniband/hw/hns/hns_roce_hw_v1.c  |  267 +--
>  drivers/infiniband/hw/hns/hns_roce_hw_v1.h  |   17 +-
>  drivers/infiniband/hw/hns/hns_roce_main.c   |  311 
> +++
>  drivers/infiniband/hw/hns/hns_roce_mr.c |   21 +-
>  drivers/infiniband/hw/hns/hns_roce_pd.c |5 +-
>  drivers/infiniband/hw/hns/hns_roce_qp.c |2 +-
>  14 files changed, 363 insertions(+), 362 deletions(-)
> 

Series applied, thanks.

-- 
Doug Ledford 
GPG Key ID: 0E572FDD



signature.asc
Description: OpenPGP digital signature

Re: "virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE

2016-12-12 Thread Theodore Ts'o

On Tue, Dec 13, 2016 at 11:43:00AM +0800, Jason Wang wrote:
> Thanks for reporting this issue. Looks like I blindly set the affinity
> instead of queues during probe. Could you please try the following patch to
> see if it works?

This fixed things, thanks!!

- Ted


> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index b425fa1..fe9f772 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1930,7 +1930,9 @@ static int virtnet_probe(struct virtio_device *vdev)
> goto free_unregister_netdev;
> }
> 
> -   virtnet_set_affinity(vi);
> +   rtnl_lock();
> +   virtnet_set_queues(vi, vi->curr_queue_pairs);
> +   rtnl_unlock();
> 
> /* Assume link up if device can't report link status,
>otherwise get link status from config. */
> 
>

Re: [PATCH V2 00/22] Broadcom RoCE Driver (bnxt_re)

2016-12-12 Thread Selvin Xavier

On Tue, Dec 13, 2016 at 5:22 AM, Doug Ledford  wrote:
>
> There are outstanding review comments to be addressed still yet, and the
> v2 patchset doesn't compile for me in 0day testing.  I'm going to bounce
> this one to 4.11.

I will address all review comments and fix the 0day compilation error
and post a v3 soon.

Thanks,
Selvin Xavier

Re: "virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE

2016-12-12 Thread Jason Wang




On 2016年12月13日 11:12, Theodore Ts'o wrote:

On Tue, Dec 13, 2016 at 04:28:17AM +0200, Michael S. Tsirkin wrote:

That's unfortunate, of course. It could be a hypervisor or
a guest kernel bug. ideas:
- does host have mq capability? how many queues?
- how about # of msix vectors?
- after you send something on tx queues,
   are interrupts arriving on rx queues?
- is problem rx or tx?
   set ip and arp manually and send a packet to known MAC,
   does it get there?

Sorry, I don't know how to debug virtio-net.  Given that it's in a
cloud environment, I also can't set ip addresses manually, since ip
addresses are set manually.

If you can send me a patch, I'm happy to apply it and send you back
results.

I can say that I've had _zero_ problems using pretty much any kernel
from 3.10 to 4.9 using Google Compute Engine.  The commit I referenced
caused things to stop working.  So in terms of regression, this is
definitely a regression, and it's definitely caused by commit
449000102901.  Even if it is a hypervisor "bug", I'm pretty sure I
know what Linus will say if I ask him to revert it.  Linux kernels are
expected to work around hardware bugs, and breaking users just because
hardware is "broken" by some definition is generally not considered
friendly, especially when has been working for years and years before
some commit "fixed" things.

I would very much like to work with you to fix it, but I will need
your help, since virtio-net doesn't seem to print any informational
during the boot sequence, and I don't know how the best way to debug
it.

Cheers,

- Ted


Thanks for reporting this issue. Looks like I blindly set the affinity 
instead of queues during probe. Could you please try the following patch 
to see if it works?


diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index b425fa1..fe9f772 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1930,7 +1930,9 @@ static int virtnet_probe(struct virtio_device *vdev)
goto free_unregister_netdev;
}

-   virtnet_set_affinity(vi);
+   rtnl_lock();
+   virtnet_set_queues(vi, vi->curr_queue_pairs);
+   rtnl_unlock();

/* Assume link up if device can't report link status,
   otherwise get link status from config. */

Re: "virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE

2016-12-12 Thread Michael S. Tsirkin

On Mon, Dec 12, 2016 at 10:12:43PM -0500, Theodore Ts'o wrote:
> On Tue, Dec 13, 2016 at 04:28:17AM +0200, Michael S. Tsirkin wrote:
> > 
> > That's unfortunate, of course. It could be a hypervisor or
> > a guest kernel bug. ideas:
> > - does host have mq capability? how many queues?
> > - how about # of msix vectors?
> > - after you send something on tx queues,
> >   are interrupts arriving on rx queues?
> > - is problem rx or tx?
> >   set ip and arp manually and send a packet to known MAC,
> >   does it get there?
> 
> Sorry, I don't know how to debug virtio-net.  Given that it's in a
> cloud environment, I also can't set ip addresses manually, since ip
> addresses are set manually.

OK, but you can send raw ethernet frames preseumably?


> If you can send me a patch, I'm happy to apply it and send you back
> results.

Let's start with collecting stats from sysfs for this device.
pls get features bitmap from there,
pls get /proc/interrupts mappings,
and pls use lspci to dump pci config.


> I can say that I've had _zero_ problems using pretty much any kernel
> from 3.10 to 4.9 using Google Compute Engine.  The commit I referenced
> caused things to stop working.  So in terms of regression, this is
> definitely a regression, and it's definitely caused by commit
> 449000102901.  Even if it is a hypervisor "bug", I'm pretty sure I
> know what Linus will say if I ask him to revert it.  Linux kernels are
> expected to work around hardware bugs, and breaking users just because
> hardware is "broken" by some definition is generally not considered
> friendly, especially when has been working for years and years before
> some commit "fixed" things.

I'm open to limiting new features to virtio 1 mode just to
avoid the hassle of dealing with legacy hypervisors.
But let's not argue about it until we know the root cause.

> 
> I would very much like to work with you to fix it, but I will need
> your help, since virtio-net doesn't seem to print any informational
> during the boot sequence, and I don't know how the best way to debug
> it.
> 
> Cheers,
> 
>   - Ted


Let's start with debugging it like any PCI NIC.


-- 
MST

Re: "virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE

2016-12-12 Thread Theodore Ts'o

On Tue, Dec 13, 2016 at 04:28:17AM +0200, Michael S. Tsirkin wrote:
> 
> That's unfortunate, of course. It could be a hypervisor or
> a guest kernel bug. ideas:
> - does host have mq capability? how many queues?
> - how about # of msix vectors?
> - after you send something on tx queues,
>   are interrupts arriving on rx queues?
> - is problem rx or tx?
>   set ip and arp manually and send a packet to known MAC,
>   does it get there?

Sorry, I don't know how to debug virtio-net.  Given that it's in a
cloud environment, I also can't set ip addresses manually, since ip
addresses are set manually.

If you can send me a patch, I'm happy to apply it and send you back
results.

I can say that I've had _zero_ problems using pretty much any kernel
from 3.10 to 4.9 using Google Compute Engine.  The commit I referenced
caused things to stop working.  So in terms of regression, this is
definitely a regression, and it's definitely caused by commit
449000102901.  Even if it is a hypervisor "bug", I'm pretty sure I
know what Linus will say if I ask him to revert it.  Linux kernels are
expected to work around hardware bugs, and breaking users just because
hardware is "broken" by some definition is generally not considered
friendly, especially when has been working for years and years before
some commit "fixed" things.

I would very much like to work with you to fix it, but I will need
your help, since virtio-net doesn't seem to print any informational
during the boot sequence, and I don't know how the best way to debug
it.

Cheers,

- Ted

Re: "virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE

2016-12-12 Thread Michael S. Tsirkin

On Mon, Dec 12, 2016 at 06:33:43PM -0500, Theodore Ts'o wrote:
> Hi,
> 
> I was doing a last minute regression test of the ext4 tree before
> sending a pull request to Linus, which I do using gce-xfstests[1], and
> I found that using networking was broken on GCE on linux-next.  I was
> using next-20161209, and after bisecting things, I narrowed down the
> commit which causing things to break to commit 449000102901:
> "virtio-net: enable multiqueue by default".  Reverting this commit on
> top of next-20161209 fixed the problem.
> 
> [1] http://thunk.org/gce-xfstests
> 
> You can reproduce the problem for building the kernel for Google
> Compute Engine --- I use a config such as this [2], and then try to
> boot a kernel on a VM.  The way I do this involves booting a test
> appliance and then kexec'ing into the kernel to be tested[3], using a
> 2cpu configuration.  (GCE machine type: n1-standard-2)
> 
> [2] 
> https://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/kernel-configs/ext4-x86_64-config-4.9
> [3] 
> https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstests.md
> 
> You can then take a look at serial console using a command such as
> "gcloud compute instances get-serial-port-output ", and
> you will get something like this (see attached).  The important bit is
> that the dhclient command is completely failing to be able to get a
> response from the network, from which I deduce that apparently that
> either networking send or receive or both seem to be badly affected by
> the commit in question.
> 
> Please let me know if there's anything I can do to help you debug this
> further.
> 
> Cheers,
> 
>   - Ted

That's unfortunate, of course. It could be a hypervisor or
a guest kernel bug. ideas:
- does host have mq capability? how many queues?
- how about # of msix vectors?
- after you send something on tx queues,
  are interrupts arriving on rx queues?
- is problem rx or tx?
  set ip and arp manually and send a packet to known MAC,
  does it get there?

> Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] Linux version 
> 4.9.0-rc8-ext4-06387-g03e5cbd (tytso@tytso-ssd) (gcc version 4.9.2 (Debian 
> 4.9.2-10) ) #9 SMP Mon Dec 12 04:50:16 UTC 2016
> Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] Command line: 
> root=/dev/sda1 ro console=ttyS0,38400n8 elevator=noop console=ttyS0  
> fstestcfg=4k fstestset=-g,quick fstestexc= fstestopt=aex fstesttyp=ext4 
> fstestapi=1.3
> Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: 
> Supporting XSAVE feature 0x001: 'x87 floating point registers'
> Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: 
> Supporting XSAVE feature 0x002: 'SSE registers'
> Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: 
> Supporting XSAVE feature 0x004: 'AVX registers'
> Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: 
> xstate_offset[2]:  576, xstate_sizes[2]:  256
> Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: Enabled 
> xstate features 0x7, context size is 832 bytes, using 'standard' format.
> Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Load Kernel Modules.
> Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Apply Kernel 
> Variables...
> Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounting Configuration File 
> System...
> Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounting FUSE Control File 
> System...
> Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounted FUSE Control File 
> System.
> Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounted Configuration File 
> System.
> Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Apply Kernel 
> Variables.
> Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Create Static 
> Device Nodes in /dev.
> Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting udev Kernel Device 
> Manager...
> Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started udev Kernel Device 
> Manager.
> Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started udev Coldplug all 
> Devices.
> Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting udev Wait for 
> Complete Device Initialization...
> Dec 11 23:53:20 xfstests-201612120451 systemd-fsck[1659]: xfstests-root: 
> clean, 56268/655360 files, 357439/2620928 blocks
> Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started File System Check 
> on Root Device.
> Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Remount Root and 
> Kernel File Systems...
> Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Remount Root and 
> Kernel File Systems.
> Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Various fixups to 
> make systemd work better on Debian.
> Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Load/Save Random 
> Seed...
> Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Local File Systems 
> (Pre).
>

Re: [PATCH 0/6] USB support for Broadcom NSP SoC

2016-12-12 Thread Florian Fainelli

On 11/09/2016 01:33 AM, Yendapally Reddy Dhananjaya Reddy wrote:
> This patch set contains the usb support for Broadcom NSP SoC.
> The usb phy is connected through mdio interface. The mdio interface
> can be used to access either internal phys or external phys using a
> multiplexer.
> 
> The first patch provides the documentation details for mdio-mux and
> second patch provides the documentation details for usb3 phy. The third
> patch contains the mdio-mux support and fourth patch contains the
> changes to the mdio bus driver.
> 
> The fifth patch provides the phy driver and sixth patch provides the
> enable method for usb.
> 
> This patch series has been tested on NSP bcm958625HR board.
> This patch series is based on v4.9.0-rc1 and is available from github-
> repo: https://github.com/Broadcom/cygnus-linux.git
> branch:nsp-usb-v1

Can you resubmit this patch series with the feedback from Andrew, Rob
and Scott addressed?

Thanks!
-- 
Florian

Re: [PATCH v2 1/4] net: hix5hd2_gmac: add generic compatible string

2016-12-12 Thread Dongpo Li



On 2016/12/12 22:21, Rob Herring wrote:
> On Mon, Dec 12, 2016 at 5:16 AM, Dongpo Li  wrote:
>> Hi Rob,
>>
>> On 2016/12/10 6:35, Rob Herring wrote:
>>> On Mon, Dec 05, 2016 at 09:27:58PM +0800, Dongpo Li wrote:
 The "hix5hd2" is SoC name, add the generic ethernet driver name.
 The "hisi-gemac-v1" is the basic version and "hisi-gemac-v2" adds
 the SG/TXCSUM/TSO/UFO features.

 Signed-off-by: Dongpo Li 
 ---
  .../devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt|  9 +++--
  drivers/net/ethernet/hisilicon/hix5hd2_gmac.c | 15 
 +++
  2 files changed, 18 insertions(+), 6 deletions(-)

 diff --git 
 a/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt 
 b/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt
 index 75d398b..75920f0 100644
 --- a/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt
 +++ b/Documentation/devicetree/bindings/net/hisilicon-hix5hd2-gmac.txt
 @@ -1,7 +1,12 @@
  Hisilicon hix5hd2 gmac controller

  Required properties:
 -- compatible: should be "hisilicon,hix5hd2-gmac".
 +- compatible: should contain one of the following SoC strings:
 +* "hisilicon,hix5hd2-gemac"
 +* "hisilicon,hi3798cv200-gemac"
 +and one of the following version string:
 +* "hisilicon,hisi-gemac-v1"
 +* "hisilicon,hisi-gemac-v2"
>>>
>>> What combinations are valid? I assume both chips don't have both v1 and
>>> v2. 2 SoCs and 2 versions so far, I don't think there is much point to
>>> have the v1 and v2 compatible strings.
>>>
>> The v1 and v2 are generic MAC compatible strings, many HiSilicon SoCs may
>> use the same MAC version. For example,
>> hix5hd2, hi3716cv200 SoCs use the v1 MAC version,
>> hi3798cv200, hi3516a SoCs use the v2 MAC version,
>> and there may be more SoCs added in future.
>> So I think the generic compatible strings are okay here.
>> Should I add the hi3716cv200, hi3516a SoCs compatible here?
> 
> Yes.
> 
>> Do you have any good advice?
>>
  - reg: specifies base physical address(s) and size of the device 
 registers.
The first region is the MAC register base and size.
The second region is external interface control register.
 @@ -20,7 +25,7 @@ Required properties:

  Example:
  gmac0: ethernet@f984 {
 -compatible = "hisilicon,hix5hd2-gmac";
 +compatible = "hisilicon,hix5hd2-gemac", 
 "hisilicon,hisi-gemac-v1";
>>>
>>> You can't just change compatible strings.
>>>
>> Okay, maybe I should name all the compatible string with the suffix "-gmac" 
>> instead of
>> "-gemac". This can keep the compatible strings with the same suffix. Is this 
>> okay?
>> Can I just add the generic compatible string without changing the SoCs 
>> compatible string?
>> Like following:
>> gmac0: ethernet@f984 {
>>  -  compatible = "hisilicon,hix5hd2-gmac";
>>  +  compatible = "hisilicon,hix5hd2-gmac", 
>> "hisilicon,hisi-gmac-v1";
> 
> Yes, this is fine.
> 
Many thanks for your advice.
As the patch series have been applied to net-next branch,
in which way should I commit this compatible fix?
Should I send a new patch with "Fixes: "?


Regards,
Dongpo

.

RE: [PATCH net-next 09/27] cxgb4: use __vlan_hwaccel helpers

2016-12-12 Thread Steve Wise


> This also initializes vlan_proto field.
> 
> Signed-off-by: Michał Mirosław 


Acked-by: Steve Wise

Re: [PATCH net-next 2/5] liquidio VF vxlan

2016-12-12 Thread Felix Manlunas

Or Gerlitz  wrote on Sat [2016-Dec-10 05:46:13 -0800]:
> On Fri, Dec 9, 2016 at 12:42 AM, Vatsavayi, Raghu
>  wrote:
> >> From: Or Gerlitz [mailto:gerlitz...@gmail.com]
> >> On Thu, Dec 8, 2016 at 11:00 PM, Raghu Vatsavayi
> >>  wrote:
> 
> >>> Adds VF vxlan offload support.
> 
> >> What's the use case for that? a VM running a VTEP, isn't that part needs to
> >> run @ the host?
> 
> > Our HW can support offloads for VF which is required if we load it on 
> > Hypervisor.
> 
> 
> +   nctrl.ncmd.u64 = 0;
> +   nctrl.ncmd.s.cmd = command;
> +   nctrl.ncmd.s.more = vxlan_cmd_bit;
> +   nctrl.ncmd.s.param1 = vxlan_port;
> +   nctrl.iq_no = lio->linfo.txpciq[0].s.q_no;
> +   nctrl.wait_time = 100;
> +   nctrl.netpndev = (u64)netdev;
> +   nctrl.cb_fn = liquidio_link_ctrl_cmd_completion;
> +
> +   ret = octnet_send_nic_ctrl_pkt(lio->oct_dev, &nctrl);
> 
> 1. What happens if > 1 one VF runs this code, each with different
> port? who wins? is the result well defined?

There is neither race nor contention, but all VFs "win" (meaning they get
what they ask for) because the VxLAN UDP port can be set on a per VF basis.
So the result of the above case is:  for VFs running in the host (not in
VMs), each VF interface is a VTEP with a distinct UDP port for VxLAN.

> 2. does octnet_send_nic_ctrl_pkt() goes to sleep? this is disallowed here

No, it does not go to sleep.

Felix

Re: [PATCH net-next 20/27] net/bpf_jit: MIPS: split VLAN_PRESENT bit handling from VLAN_TCI

2016-12-12 Thread Ralf Baechle

I assume you want to merge this together with the rest of you series, so

Acked-by: Ralf Baechle 

Cheers,

  Ralf

On Tue, Dec 13, 2016 at 01:12:39AM +0100, Michał Mirosław wrote:
> Date:   Tue, 13 Dec 2016 01:12:39 +0100 (CET)
> From: Michał Mirosław 
> To: netdev@vger.kernel.org
> Cc: Ralf Baechle , "open list:MIPS"
>  
> Subject: [PATCH net-next 20/27] net/bpf_jit: MIPS: split VLAN_PRESENT bit
>  handling from VLAN_TCI
> Content-Type: text/plain; charset=UTF-8
> 
> Signed-off-by: Michał Mirosław 
> ---
>  arch/mips/net/bpf_jit.c | 20 
>  1 file changed, 12 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/mips/net/bpf_jit.c b/arch/mips/net/bpf_jit.c
> index 49a2e22..4b12b5d 100644
> --- a/arch/mips/net/bpf_jit.c
> +++ b/arch/mips/net/bpf_jit.c
> @@ -1138,19 +1138,23 @@ static int build_body(struct jit_ctx *ctx)
>   emit_load(r_A, r_skb, off, ctx);
>   break;
>   case BPF_ANC | SKF_AD_VLAN_TAG:
> - case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
>   ctx->flags |= SEEN_SKB | SEEN_A;
>   BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
> vlan_tci) != 2);
>   off = offsetof(struct sk_buff, vlan_tci);
>   emit_half_load(r_s0, r_skb, off, ctx);
> - if (code == (BPF_ANC | SKF_AD_VLAN_TAG)) {
> - emit_andi(r_A, r_s0, (u16)~VLAN_TAG_PRESENT, 
> ctx);
> - } else {
> - emit_andi(r_A, r_s0, VLAN_TAG_PRESENT, ctx);
> - /* return 1 if present */
> - emit_sltu(r_A, r_zero, r_A, ctx);
> - }
> +#ifdef VLAN_TAG_PRESENT
> + emit_andi(r_A, r_s0, (u16)~VLAN_TAG_PRESENT, ctx);
> +#endif
> + break;
> + case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
> + ctx->flags |= SEEN_SKB | SEEN_A;
> + emit_load_byte(r_A, r_skb, PKT_VLAN_PRESENT_OFFSET(), 
> ctx);
> + if (PKT_VLAN_PRESENT_BIT)
> + emit_srl(r_A, r_A, PKT_VLAN_PRESENT_BIT, ctx);
> + emit_andi(r_A, r_s0, 1, ctx);
> + /* return 1 if present */
> + emit_sltu(r_A, r_zero, r_A, ctx);
>   break;
>   case BPF_ANC | SKF_AD_PKTTYPE:
>   ctx->flags |= SEEN_SKB;
> -- 
> 2.10.2

Soft lockup in tc_classify

2016-12-12 Thread Shahar Klein


Hi All,

sorry for the spam, the first time was sent with html part and was rejected.

We observed an issue where a classifier instance next member is pointing 
back to itself, causing a CPU soft lockup.
We found it by running traffic on many udp connections and then adding a 
new flower rule using tc.


We added a quick workaround to verify it:

In tc_classify:

for (; tp; tp = rcu_dereference_bh(tp->next)) {
int err;
+   if (tp == tp->next)
+ RCU_INIT_POINTER(tp->next, NULL);


We also had a print here showing tp->next is pointing to tp. With this 
workaround we are not hitting the issue anymore.
We are not sure we fully understand the mechanism here - with the rtnl 
and rcu locks.

We'll appreciate your help solving this issue.

Thanks,
Shahar

[PATCH net-next 13/27] bridge: use __vlan_hwaccel helpers

2016-12-12 Thread Michał Mirosław

This removes assumption than vlan_tci != 0 when tag is present.

Signed-off-by: Michał Mirosław 
---
 net/bridge/br_netfilter_hooks.c | 14 --
 net/bridge/br_private.h |  2 +-
 net/bridge/br_vlan.c|  6 +++---
 3 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index b12501a..2cc0747 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -682,10 +682,8 @@ static int br_nf_push_frag_xmit(struct net *net, struct 
sock *sk, struct sk_buff
return 0;
}
 
-   if (data->vlan_tci) {
-   skb->vlan_tci = data->vlan_tci;
-   skb->vlan_proto = data->vlan_proto;
-   }
+   if (data->vlan_proto)
+   __vlan_hwaccel_put_tag(skb, data->vlan_proto, data->vlan_tci);
 
skb_copy_to_linear_data_offset(skb, -data->size, data->mac, data->size);
__skb_push(skb, data->encap_size);
@@ -749,8 +747,12 @@ static int br_nf_dev_queue_xmit(struct net *net, struct 
sock *sk, struct sk_buff
 
data = this_cpu_ptr(&brnf_frag_data_storage);
 
-   data->vlan_tci = skb->vlan_tci;
-   data->vlan_proto = skb->vlan_proto;
+   if (skb_vlan_tag_present(skb)) {
+   data->vlan_tci = skb->vlan_tci;
+   data->vlan_proto = skb->vlan_proto;
+   } else
+   data->vlan_proto = 0;
+
data->encap_size = nf_bridge_encap_header_len(skb);
data->size = ETH_HLEN + data->encap_size;
 
diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h
index 8ce621e..2efbdaf 100644
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -819,7 +819,7 @@ static inline int br_vlan_get_tag(const struct sk_buff 
*skb, u16 *vid)
int err = 0;
 
if (skb_vlan_tag_present(skb)) {
-   *vid = skb_vlan_tag_get(skb) & VLAN_VID_MASK;
+   *vid = skb_vlan_tag_get_id(skb);
} else {
*vid = 0;
err = -EINVAL;
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index b6de4f4..ef94664 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -377,7 +377,7 @@ struct sk_buff *br_handle_vlan(struct net_bridge *br,
}
 
if (v->flags & BRIDGE_VLAN_INFO_UNTAGGED)
-   skb->vlan_tci = 0;
+   __vlan_hwaccel_clear_tag(skb);
 out:
return skb;
 }
@@ -444,8 +444,8 @@ static bool __allowed_ingress(const struct net_bridge *br,
__vlan_hwaccel_put_tag(skb, br->vlan_proto, pvid);
else
/* Priority-tagged Frame.
-* At this point, We know that skb->vlan_tci had
-* VLAN_TAG_PRESENT bit and its VID field was 0x000.
+* At this point, We know that skb->vlan_tci VID
+* field was 0x000.
 * We update only VID field and preserve PCP field.
 */
skb->vlan_tci |= pvid;
-- 
2.10.2

[PATCH net-next 09/27] cxgb4: use __vlan_hwaccel helpers

2016-12-12 Thread Michał Mirosław

This also initializes vlan_proto field.

Signed-off-by: Michał Mirosław 
---
 drivers/infiniband/hw/cxgb4/cm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index f1510cc..66a3d39 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -3899,7 +3899,7 @@ static int rx_pkt(struct c4iw_dev *dev, struct sk_buff 
*skb)
} else {
vlan_eh = (struct vlan_ethhdr *)(req + 1);
iph = (struct iphdr *)(vlan_eh + 1);
-   skb->vlan_tci = ntohs(cpl->vlan);
+   __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), 
ntohs(cpl->vlan));
}
 
if (iph->version != 0x4)
-- 
2.10.2

[PATCH net-next 11/27] sky2: use __vlan_hwaccel helpers

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 drivers/net/ethernet/marvell/sky2.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/sky2.c 
b/drivers/net/ethernet/marvell/sky2.c
index b60ad0e..bcd20e0 100644
--- a/drivers/net/ethernet/marvell/sky2.c
+++ b/drivers/net/ethernet/marvell/sky2.c
@@ -2485,13 +2485,11 @@ static struct sk_buff *receive_copy(struct sky2_port 
*sky2,
skb->ip_summed = re->skb->ip_summed;
skb->csum = re->skb->csum;
skb_copy_hash(skb, re->skb);
-   skb->vlan_proto = re->skb->vlan_proto;
-   skb->vlan_tci = re->skb->vlan_tci;
+   __vlan_hwaccel_copy_tag(skb, re->skb);
 
pci_dma_sync_single_for_device(sky2->hw->pdev, re->data_addr,
   length, PCI_DMA_FROMDEVICE);
-   re->skb->vlan_proto = 0;
-   re->skb->vlan_tci = 0;
+   __vlan_hwaccel_clear_tag(re->skb);
skb_clear_hash(re->skb);
re->skb->ip_summed = CHECKSUM_NONE;
skb_put(skb, length);
-- 
2.10.2

[PATCH net-next 21/27] net/bpf_jit: PPC: split VLAN_PRESENT bit handling from VLAN_TCI

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 arch/powerpc/net/bpf_jit_comp.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
index 7e706f3..22ae63f 100644
--- a/arch/powerpc/net/bpf_jit_comp.c
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -377,18 +377,19 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 
*image,
  hash));
break;
case BPF_ANC | SKF_AD_VLAN_TAG:
-   case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, vlan_tci) != 
2);
-   BUILD_BUG_ON(VLAN_TAG_PRESENT != 0x1000);
 
PPC_LHZ_OFFS(r_A, r_skb, offsetof(struct sk_buff,
  vlan_tci));
-   if (code == (BPF_ANC | SKF_AD_VLAN_TAG)) {
-   PPC_ANDI(r_A, r_A, ~VLAN_TAG_PRESENT);
-   } else {
-   PPC_ANDI(r_A, r_A, VLAN_TAG_PRESENT);
-   PPC_SRWI(r_A, r_A, 12);
-   }
+#ifdef VLAN_TAG_PRESENT
+   PPC_ANDI(r_A, r_A, ~VLAN_TAG_PRESENT);
+#endif
+   break;
+   case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
+   PPC_LBZ_OFFS(r_A, r_skb, PKT_VLAN_PRESENT_OFFSET());
+   if (PKT_VLAN_PRESENT_BIT)
+   PPC_SRWI(r_A, r_A, PKT_VLAN_PRESENT_BIT);
+   PPC_ANDI(r_A, r_A, 1);
break;
case BPF_ANC | SKF_AD_QUEUE:
BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
-- 
2.10.2

[PATCH net-next 24/27] bpf_test: prepare for VLAN_TAG_PRESENT removal

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 lib/test_bpf.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/lib/test_bpf.c b/lib/test_bpf.c
index 0362da0..00d3450 100644
--- a/lib/test_bpf.c
+++ b/lib/test_bpf.c
@@ -691,8 +691,13 @@ static struct bpf_test tests[] = {
CLASSIC,
{ },
{
+#ifdef VLAN_TAG_PRESENT
{ 1, SKB_VLAN_TCI & ~VLAN_TAG_PRESENT },
{ 10, SKB_VLAN_TCI & ~VLAN_TAG_PRESENT }
+#else
+   { 1, SKB_VLAN_TCI },
+   { 10, SKB_VLAN_TCI }
+#endif
},
},
{
@@ -705,8 +710,13 @@ static struct bpf_test tests[] = {
CLASSIC,
{ },
{
+#ifdef VLAN_TAG_PRESENT
{ 1, !!(SKB_VLAN_TCI & VLAN_TAG_PRESENT) },
{ 10, !!(SKB_VLAN_TCI & VLAN_TAG_PRESENT) }
+#else
+   { 1, SKB_VLAN_PRESENT },
+   { 10, SKB_VLAN_PRESENT }
+#endif
},
},
{
@@ -4773,8 +4783,13 @@ static struct bpf_test tests[] = {
CLASSIC,
{ },
{
+#ifdef VLAN_TAG_PRESENT
{  1, !!(SKB_VLAN_TCI & VLAN_TAG_PRESENT) },
{ 10, !!(SKB_VLAN_TCI & VLAN_TAG_PRESENT) }
+#else
+   {  1, SKB_VLAN_PRESENT },
+   { 10, SKB_VLAN_PRESENT }
+#endif
},
.fill_helper = bpf_fill_maxinsns6,
},
-- 
2.10.2

[PATCH net-next 15/27] ipv4/tunnel: use __vlan_hwaccel helpers

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 net/ipv4/ip_tunnel_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index fed3d29..0004a54 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -120,7 +120,7 @@ int __iptunnel_pull_header(struct sk_buff *skb, int hdr_len,
}
 
skb_clear_hash_if_not_l4(skb);
-   skb->vlan_tci = 0;
+   __vlan_hwaccel_clear_tag(skb);
skb_set_queue_mapping(skb, 0);
skb_scrub_packet(skb, xnet);
 
-- 
2.10.2

[PATCH net-next 00/27] Remove VLAN CFI bit abuse

2016-12-12 Thread Michał Mirosław

Dear NetDevs

This series removes an abuse of VLAN CFI bit in Linux networking stack.
Currently Linux always clears the bit on outgoing traffic and presents
it cleared to userspace (even via AF_PACKET/tcpdump when hw-accelerated).

This uses a new vlan_present bit in struct skbuff, and removes an assumption
that vlan_proto != 0 when VLAN tag is present.

As I can't test most of the driver changes, please look at them carefully.

The series is supposed to be bisect-friendly and that requires temporary
insertion of #define VLAN_TAG_PRESENT in BPF code to be able to split
JIT changes per architecture.

Best Regards,
Michał Mirosław

---

Michał Mirosław (27):
  net/vlan: introduce __vlan_hwaccel_clear_tag() helper
  net/vlan: introduce __vlan_hwaccel_copy_tag() helper
  ibmvnic: fix accelerated VLAN handling
  qlcnic: remove assumption that vlan_tci != 0
  i40iw: remove use of VLAN_TAG_PRESENT
  cnic: remove use of VLAN_TAG_PRESENT
  gianfar: remove use of VLAN_TAG_PRESENT
  net/hyperv: remove use of VLAN_TAG_PRESENT
  cxgb4: use __vlan_hwaccel helpers
  benet: use __vlan_hwaccel helpers
  sky2: use __vlan_hwaccel helpers
  net/core: use __vlan_hwaccel helpers
  bridge: use __vlan_hwaccel helpers
  8021q: use __vlan_hwaccel helpers
  ipv4/tunnel: use __vlan_hwaccel helpers
  nfnetlink/queue: use __vlan_hwaccel helpers
  OVS: remove assumptions about VLAN_TAG_PRESENT bit
  net/skbuff: add macros for VLAN_PRESENT bit
  net/bpf_jit: ARM: split VLAN_PRESENT bit handling from VLAN_TCI
  net/bpf_jit: MIPS: split VLAN_PRESENT bit handling from VLAN_TCI
  net/bpf_jit: PPC: split VLAN_PRESENT bit handling from VLAN_TCI
  net/bpf_jit: SPARC: split VLAN_PRESENT bit handling from VLAN_TCI
  net/bpf: split VLAN_PRESENT bit handling from VLAN_TCI
  bpf_test: prepare for VLAN_TAG_PRESENT removal
  net: remove VLAN_TAG_PRESENT
  net/hyperv: enable passing of VLAN.CFI bit
  net/vlan: remove unused #define HAVE_VLAN_GET_TAG

 Documentation/networking/openvswitch.txt | 14 --
 arch/arm/net/bpf_jit_32.c| 17 ---
 arch/mips/net/bpf_jit.c  | 17 +++
 arch/powerpc/net/bpf_jit_comp.c  | 14 +++---
 arch/sparc/net/bpf_jit_comp.c| 14 +++---
 drivers/infiniband/hw/cxgb4/cm.c |  2 +-
 drivers/infiniband/hw/i40iw/i40iw_cm.c   |  8 ++--
 drivers/net/ethernet/broadcom/cnic.c |  2 +-
 drivers/net/ethernet/emulex/benet/be_main.c  |  4 +-
 drivers/net/ethernet/freescale/gianfar_ethtool.c |  8 ++--
 drivers/net/ethernet/ibm/ibmvnic.c   |  5 +-
 drivers/net/ethernet/marvell/sky2.c  |  6 +--
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c   |  8 ++--
 drivers/net/hyperv/hyperv_net.h  |  2 +-
 drivers/net/hyperv/netvsc_drv.c  | 14 +++---
 drivers/net/hyperv/rndis_filter.c|  5 +-
 include/linux/if_vlan.h  | 37 +++---
 include/linux/skbuff.h   | 10 +++-
 lib/test_bpf.c   | 14 +++---
 net/8021q/vlan_core.c|  2 +-
 net/bridge/br_netfilter_hooks.c  | 14 +++---
 net/bridge/br_private.h  |  2 +-
 net/bridge/br_vlan.c |  6 +--
 net/core/dev.c   |  8 ++--
 net/core/filter.c| 17 +++
 net/core/skbuff.c|  2 +-
 net/ipv4/ip_tunnel_core.c|  2 +-
 net/netfilter/nfnetlink_queue.c  |  5 +-
 net/openvswitch/actions.c| 13 ++---
 net/openvswitch/flow.c   |  4 +-
 net/openvswitch/flow.h   |  4 +-
 net/openvswitch/flow_netlink.c   | 61 
 net/sched/act_vlan.c |  2 +-
 33 files changed, 170 insertions(+), 173 deletions(-)

-- 
2.10.2

[PATCH net-next 16/27] nfnetlink/queue: use __vlan_hwaccel helpers

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 net/netfilter/nfnetlink_queue.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
index be7627b..f268bb9 100644
--- a/net/netfilter/nfnetlink_queue.c
+++ b/net/netfilter/nfnetlink_queue.c
@@ -,8 +,9 @@ static int nfqa_parse_bridge(struct nf_queue_entry *entry,
if (!tb[NFQA_VLAN_TCI] || !tb[NFQA_VLAN_PROTO])
return -EINVAL;
 
-   entry->skb->vlan_tci = ntohs(nla_get_be16(tb[NFQA_VLAN_TCI]));
-   entry->skb->vlan_proto = nla_get_be16(tb[NFQA_VLAN_PROTO]);
+   __vlan_hwaccel_put_tag(entry->skb,
+   nla_get_be16(tb[NFQA_VLAN_PROTO]),
+   ntohs(nla_get_be16(tb[NFQA_VLAN_TCI])));
}
 
if (nfqa[NFQA_L2HDR]) {
-- 
2.10.2

[PATCH net-next 14/27] 8021q: use __vlan_hwaccel helpers

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 net/8021q/vlan_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index e2ed698..604a67a 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -50,7 +50,7 @@ bool vlan_do_receive(struct sk_buff **skbp)
}
 
skb->priority = vlan_get_ingress_priority(vlan_dev, skb->vlan_tci);
-   skb->vlan_tci = 0;
+   __vlan_hwaccel_clear_tag(skb);
 
rx_stats = this_cpu_ptr(vlan_dev_priv(vlan_dev)->vlan_pcpu_stats);
 
-- 
2.10.2

[PATCH net-next 22/27] net/bpf_jit: SPARC: split VLAN_PRESENT bit handling from VLAN_TCI

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 arch/sparc/net/bpf_jit_comp.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/sparc/net/bpf_jit_comp.c b/arch/sparc/net/bpf_jit_comp.c
index a6d9204..61cc15d 100644
--- a/arch/sparc/net/bpf_jit_comp.c
+++ b/arch/sparc/net/bpf_jit_comp.c
@@ -601,15 +601,17 @@ void bpf_jit_compile(struct bpf_prog *fp)
emit_skb_load32(hash, r_A);
break;
case BPF_ANC | SKF_AD_VLAN_TAG:
-   case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
emit_skb_load16(vlan_tci, r_A);
-   if (code != (BPF_ANC | SKF_AD_VLAN_TAG)) {
-   emit_alu_K(SRL, 12);
-   emit_andi(r_A, 1, r_A);
-   } else {
-   emit_loadimm(~VLAN_TAG_PRESENT, r_TMP);
-   emit_and(r_A, r_TMP, r_A);
-   }
+#ifdef VLAN_TAG_PRESENT
+   emit_loadimm(~VLAN_TAG_PRESENT, r_TMP);
+   emit_and(r_A, r_TMP, r_A);
+#endif
+   break;
+   case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
+   __emit_skb_load8(__pkt_vlan_present_offset, 
r_A);
+   if (PKT_VLAN_PRESENT_BIT)
+   emit_alu_K(SRL, PKT_VLAN_PRESENT_BIT);
+   emit_andi(r_A, 1, r_A);
break;
case BPF_LD | BPF_W | BPF_LEN:
emit_skb_load32(len, r_A);
-- 
2.10.2

[PATCH net-next 02/27] net/vlan: introduce __vlan_hwaccel_copy_tag() helper

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 include/linux/if_vlan.h | 13 +
 1 file changed, 13 insertions(+)

diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index 38be904..75e839b 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -393,6 +393,19 @@ static inline void __vlan_hwaccel_clear_tag(struct sk_buff 
*skb)
skb->vlan_tci = 0;
 }
 
+/**
+ * __vlan_hwaccel_copy_tag - copy hardware accelerated VLAN info from another 
skb
+ * @dst: skbuff to copy to
+ * @src: skbuff to copy from
+ *
+ * Copies VLAN information from @src to @dst (for branchless code)
+ */
+static inline void __vlan_hwaccel_copy_tag(struct sk_buff *dst, const struct 
sk_buff *src)
+{
+   dst->vlan_proto = src->vlan_proto;
+   dst->vlan_tci = src->vlan_tci;
+}
+
 /*
  * __vlan_hwaccel_push_inside - pushes vlan tag to the payload
  * @skb: skbuff to tag
-- 
2.10.2

[PATCH net-next 26/27] net/hyperv: enable passing of VLAN.CFI bit

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 drivers/net/hyperv/netvsc_drv.c   | 1 +
 drivers/net/hyperv/rndis_filter.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 6597d79..4e20f4c 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -441,6 +441,7 @@ static int netvsc_start_xmit(struct sk_buff *skb, struct 
net_device *net)
vlan = (struct ndis_pkt_8021q_info *)((void *)ppi +
ppi->ppi_offset);
vlan->vlanid = skb->vlan_tci & VLAN_VID_MASK;
+   vlan->cfi = !!(skb->vlan_tci & VLAN_CFI_MASK);
vlan->pri = (skb->vlan_tci & VLAN_PRIO_MASK) >>
VLAN_PRIO_SHIFT;
}
diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index 7f7b410..9759d73 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -382,6 +382,7 @@ static int rndis_filter_receive_data(struct rndis_device 
*dev,
vlan = rndis_get_ppi(rndis_pkt, IEEE_8021Q_INFO);
if (vlan) {
vlan_tci = vlan->vlanid |
+   (vlan->cfi ? VLAN_CFI_MASK : 0) |
(vlan->pri << VLAN_PRIO_SHIFT);
}
 
-- 
2.10.2

[PATCH net-next 19/27] net/bpf_jit: ARM: split VLAN_PRESENT bit handling from VLAN_TCI

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 arch/arm/net/bpf_jit_32.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/arch/arm/net/bpf_jit_32.c b/arch/arm/net/bpf_jit_32.c
index 93d0b6d..6dbc602 100644
--- a/arch/arm/net/bpf_jit_32.c
+++ b/arch/arm/net/bpf_jit_32.c
@@ -915,17 +915,20 @@ static int build_body(struct jit_ctx *ctx)
emit(ARM_LDR_I(r_A, r_skb, off), ctx);
break;
case BPF_ANC | SKF_AD_VLAN_TAG:
-   case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
ctx->seen |= SEEN_SKB;
BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, vlan_tci) != 
2);
off = offsetof(struct sk_buff, vlan_tci);
emit(ARM_LDRH_I(r_A, r_skb, off), ctx);
-   if (code == (BPF_ANC | SKF_AD_VLAN_TAG))
-   OP_IMM3(ARM_AND, r_A, r_A, ~VLAN_TAG_PRESENT, 
ctx);
-   else {
-   OP_IMM3(ARM_LSR, r_A, r_A, 12, ctx);
-   OP_IMM3(ARM_AND, r_A, r_A, 0x1, ctx);
-   }
+#ifdef VLAN_TAG_PRESENT
+   OP_IMM3(ARM_AND, r_A, r_A, ~VLAN_TAG_PRESENT, ctx);
+#endif
+   break;
+   case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
+   off = PKT_VLAN_PRESENT_OFFSET();
+   emit(ARM_LDRB_I(r_A, r_skb, off), ctx);
+   if (PKT_VLAN_PRESENT_BIT)
+   OP_IMM3(ARM_LSR, r_A, r_A, 
PKT_VLAN_PRESENT_BIT, ctx);
+   OP_IMM3(ARM_AND, r_A, r_A, 0x1, ctx);
break;
case BPF_ANC | SKF_AD_PKTTYPE:
ctx->seen |= SEEN_SKB;
-- 
2.10.2

[PATCH net-next 27/27] net/vlan: remove unused #define HAVE_VLAN_GET_TAG

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 include/linux/if_vlan.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index 8ff2f0e..f0b9356 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -477,8 +477,6 @@ static inline int __vlan_hwaccel_get_tag(const struct 
sk_buff *skb,
}
 }
 
-#define HAVE_VLAN_GET_TAG
-
 /**
  * vlan_get_tag - get the VLAN ID from the skb
  * @skb: skbuff to query
-- 
2.10.2

[PATCH net-next 05/27] i40iw: remove use of VLAN_TAG_PRESENT

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 drivers/infiniband/hw/i40iw/i40iw_cm.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/hw/i40iw/i40iw_cm.c 
b/drivers/infiniband/hw/i40iw/i40iw_cm.c
index 8563769..25cf689 100644
--- a/drivers/infiniband/hw/i40iw/i40iw_cm.c
+++ b/drivers/infiniband/hw/i40iw/i40iw_cm.c
@@ -414,7 +414,7 @@ static struct i40iw_puda_buf *i40iw_form_cm_frame(struct 
i40iw_cm_node *cm_node,
pd_len += MPA_ZERO_PAD_LEN;
}
 
-   if (cm_node->vlan_id < VLAN_TAG_PRESENT)
+   if (cm_node->vlan_id <= VLAN_VID_MASK)
eth_hlen += 4;
 
if (cm_node->ipv4)
@@ -443,7 +443,7 @@ static struct i40iw_puda_buf *i40iw_form_cm_frame(struct 
i40iw_cm_node *cm_node,
 
ether_addr_copy(ethh->h_dest, cm_node->rem_mac);
ether_addr_copy(ethh->h_source, cm_node->loc_mac);
-   if (cm_node->vlan_id < VLAN_TAG_PRESENT) {
+   if (cm_node->vlan_id <= VLAN_VID_MASK) {
((struct vlan_ethhdr *)ethh)->h_vlan_proto = 
htons(ETH_P_8021Q);
((struct vlan_ethhdr *)ethh)->h_vlan_TCI = 
htons(cm_node->vlan_id);
 
@@ -472,7 +472,7 @@ static struct i40iw_puda_buf *i40iw_form_cm_frame(struct 
i40iw_cm_node *cm_node,
 
ether_addr_copy(ethh->h_dest, cm_node->rem_mac);
ether_addr_copy(ethh->h_source, cm_node->loc_mac);
-   if (cm_node->vlan_id < VLAN_TAG_PRESENT) {
+   if (cm_node->vlan_id <= VLAN_VID_MASK) {
((struct vlan_ethhdr *)ethh)->h_vlan_proto = 
htons(ETH_P_8021Q);
((struct vlan_ethhdr *)ethh)->h_vlan_TCI = 
htons(cm_node->vlan_id);
((struct vlan_ethhdr *)ethh)->h_vlan_encapsulated_proto 
= htons(ETH_P_IPV6);
@@ -3235,7 +3235,7 @@ static void i40iw_init_tcp_ctx(struct i40iw_cm_node 
*cm_node,
 
tcp_info->flow_label = 0;
tcp_info->snd_mss = cpu_to_le32(((u32)cm_node->tcp_cntxt.mss));
-   if (cm_node->vlan_id < VLAN_TAG_PRESENT) {
+   if (cm_node->vlan_id <= VLAN_VID_MASK) {
tcp_info->insert_vlan_tag = true;
tcp_info->vlan_tag = cpu_to_le16(cm_node->vlan_id);
}
-- 
2.10.2

[PATCH net-next 23/27] net/bpf: split VLAN_PRESENT bit handling from VLAN_TCI

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 net/core/filter.c | 20 +---
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index b146170..c3321f1 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -188,22 +188,20 @@ static u32 convert_skb_access(int skb_field, int dst_reg, 
int src_reg,
break;
 
case SKF_AD_VLAN_TAG:
-   case SKF_AD_VLAN_TAG_PRESENT:
BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, vlan_tci) != 2);
-   BUILD_BUG_ON(VLAN_TAG_PRESENT != 0x1000);
 
/* dst_reg = *(u16 *) (src_reg + offsetof(vlan_tci)) */
*insn++ = BPF_LDX_MEM(BPF_H, dst_reg, src_reg,
  offsetof(struct sk_buff, vlan_tci));
-   if (skb_field == SKF_AD_VLAN_TAG) {
-   *insn++ = BPF_ALU32_IMM(BPF_AND, dst_reg,
-   ~VLAN_TAG_PRESENT);
-   } else {
-   /* dst_reg >>= 12 */
-   *insn++ = BPF_ALU32_IMM(BPF_RSH, dst_reg, 12);
-   /* dst_reg &= 1 */
-   *insn++ = BPF_ALU32_IMM(BPF_AND, dst_reg, 1);
-   }
+#ifdef VLAN_TAG_PRESENT
+   *insn++ = BPF_ALU32_IMM(BPF_AND, dst_reg, ~VLAN_TAG_PRESENT);
+#endif
+   break;
+   case SKF_AD_VLAN_TAG_PRESENT:
+   *insn++ = BPF_LDX_MEM(BPF_B, dst_reg, src_reg, 
PKT_VLAN_PRESENT_OFFSET());
+   if (PKT_VLAN_PRESENT_BIT)
+   *insn++ = BPF_ALU32_IMM(BPF_RSH, dst_reg, 
PKT_VLAN_PRESENT_BIT);
+   *insn++ = BPF_ALU32_IMM(BPF_AND, dst_reg, 1);
break;
}
 
-- 
2.10.2

[PATCH net-next 17/27] OVS: remove assumptions about VLAN_TAG_PRESENT bit

2016-12-12 Thread Michał Mirosław

This leaves CFI bit toggled in API, because userspace might depend this
is set for normal ethernet traffic with tag present.

Signed-off-by: Michał Mirosław 
---
 Documentation/networking/openvswitch.txt | 14 
 net/openvswitch/actions.c| 13 +++
 net/openvswitch/flow.c   |  4 +--
 net/openvswitch/flow.h   |  4 +--
 net/openvswitch/flow_netlink.c   | 61 ++--
 5 files changed, 30 insertions(+), 66 deletions(-)

diff --git a/Documentation/networking/openvswitch.txt 
b/Documentation/networking/openvswitch.txt
index b3b9ac6..e7ca27d 100644
--- a/Documentation/networking/openvswitch.txt
+++ b/Documentation/networking/openvswitch.txt
@@ -219,20 +219,6 @@ this:
 
 eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0)
 
-As another example, consider a packet with an Ethernet type of 0x8100,
-indicating that a VLAN TCI should follow, but which is truncated just
-after the Ethernet type.  The flow key for this packet would include
-an all-zero-bits vlan and an empty encap attribute, like this:
-
-eth(...), eth_type(0x8100), vlan(0), encap()
-
-Unlike a TCP packet with source and destination ports 0, an
-all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka
-VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan
-attribute expressly to allow this situation to be distinguished.
-Thus, the flow key in this second example unambiguously indicates a
-missing or malformed VLAN TCI.
-
 Other rules
 ---
 
diff --git a/net/openvswitch/actions.c b/net/openvswitch/actions.c
index 514f7bc..6015bc9 100644
--- a/net/openvswitch/actions.c
+++ b/net/openvswitch/actions.c
@@ -277,8 +277,7 @@ static int push_vlan(struct sk_buff *skb, struct 
sw_flow_key *key,
key->eth.vlan.tci = vlan->vlan_tci;
key->eth.vlan.tpid = vlan->vlan_tpid;
}
-   return skb_vlan_push(skb, vlan->vlan_tpid,
-ntohs(vlan->vlan_tci) & ~VLAN_TAG_PRESENT);
+   return skb_vlan_push(skb, vlan->vlan_tpid, ntohs(vlan->vlan_tci ^ 
VLAN_CFI_MASK));
 }
 
 /* 'src' is already properly masked. */
@@ -704,8 +703,10 @@ static int ovs_vport_output(struct net *net, struct sock 
*sk, struct sk_buff *sk
__skb_dst_copy(skb, data->dst);
*OVS_CB(skb) = data->cb;
skb->inner_protocol = data->inner_protocol;
-   skb->vlan_tci = data->vlan_tci;
-   skb->vlan_proto = data->vlan_proto;
+   if (data->vlan_proto)
+   __vlan_hwaccel_put_tag(skb, data->vlan_proto, data->vlan_tci ^ 
VLAN_CFI_MASK);
+   else
+   __vlan_hwaccel_clear_tag(skb);
 
/* Reconstruct the MAC header.  */
skb_push(skb, data->l2_len);
@@ -749,8 +750,8 @@ static void prepare_frag(struct vport *vport, struct 
sk_buff *skb,
data->cb = *OVS_CB(skb);
data->inner_protocol = skb->inner_protocol;
data->network_offset = orig_network_offset;
-   data->vlan_tci = skb->vlan_tci;
-   data->vlan_proto = skb->vlan_proto;
+   data->vlan_tci = skb_vlan_tag_present(skb) ? skb->vlan_tci ^ 
VLAN_CFI_MASK : 0;
+   data->vlan_proto = skb_vlan_tag_present(skb) ? skb->vlan_proto : 0;
data->mac_proto = mac_proto;
data->l2_len = hlen;
memcpy(&data->l2_data, skb->data, hlen);
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index 08aa926..df58cfd 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -327,7 +327,7 @@ static int parse_vlan_tag(struct sk_buff *skb, struct 
vlan_head *key_vh)
return -ENOMEM;
 
vh = (struct vlan_head *)skb->data;
-   key_vh->tci = vh->tci | htons(VLAN_TAG_PRESENT);
+   key_vh->tci = vh->tci ^ htons(VLAN_CFI_MASK);
key_vh->tpid = vh->tpid;
 
__skb_pull(skb, sizeof(struct vlan_head));
@@ -347,7 +347,7 @@ static int parse_vlan(struct sk_buff *skb, struct 
sw_flow_key *key)
int res;
 
if (skb_vlan_tag_present(skb)) {
-   key->eth.vlan.tci = htons(skb->vlan_tci);
+   key->eth.vlan.tci = htons(skb->vlan_tci) ^ htons(VLAN_CFI_MASK);
key->eth.vlan.tpid = skb->vlan_proto;
} else {
/* Parse outer vlan tag in the non-accelerated case. */
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index f61cae7..f5115ed 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -57,8 +57,8 @@ struct ovs_tunnel_info {
 };
 
 struct vlan_head {
-   __be16 tpid; /* Vlan type. Generally 802.1q or 802.1ad.*/
-   __be16 tci;  /* 0 if no VLAN, VLAN_TAG_PRESENT set otherwise. */
+   __be16 tpid; /* Vlan type. Generally 802.1q or 802.1ad. 0 if no VLAN*/
+   __be16 tci;
 };
 
 #define OVS_SW_FLOW_KEY_METADATA_SIZE  \
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index d19044f..6ae5218 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -835

[PATCH net-next 01/27] net/vlan: introduce __vlan_hwaccel_clear_tag() helper

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 include/linux/if_vlan.h | 13 -
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index 8d5fcd6..38be904 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -382,6 +382,17 @@ static inline struct sk_buff 
*vlan_insert_tag_set_proto(struct sk_buff *skb,
return skb;
 }
 
+/**
+ * __vlan_hwaccel_clear_tag - clear hardware accelerated VLAN info
+ * @skb: skbuff to clear
+ *
+ * Clears the VLAN information from @skb
+ */
+static inline void __vlan_hwaccel_clear_tag(struct sk_buff *skb)
+{
+   skb->vlan_tci = 0;
+}
+
 /*
  * __vlan_hwaccel_push_inside - pushes vlan tag to the payload
  * @skb: skbuff to tag
@@ -396,7 +407,7 @@ static inline struct sk_buff 
*__vlan_hwaccel_push_inside(struct sk_buff *skb)
skb = vlan_insert_tag_set_proto(skb, skb->vlan_proto,
skb_vlan_tag_get(skb));
if (likely(skb))
-   skb->vlan_tci = 0;
+   __vlan_hwaccel_clear_tag(skb);
return skb;
 }
 
-- 
2.10.2

[PATCH net-next 20/27] net/bpf_jit: MIPS: split VLAN_PRESENT bit handling from VLAN_TCI

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 arch/mips/net/bpf_jit.c | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/arch/mips/net/bpf_jit.c b/arch/mips/net/bpf_jit.c
index 49a2e22..4b12b5d 100644
--- a/arch/mips/net/bpf_jit.c
+++ b/arch/mips/net/bpf_jit.c
@@ -1138,19 +1138,23 @@ static int build_body(struct jit_ctx *ctx)
emit_load(r_A, r_skb, off, ctx);
break;
case BPF_ANC | SKF_AD_VLAN_TAG:
-   case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
ctx->flags |= SEEN_SKB | SEEN_A;
BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff,
  vlan_tci) != 2);
off = offsetof(struct sk_buff, vlan_tci);
emit_half_load(r_s0, r_skb, off, ctx);
-   if (code == (BPF_ANC | SKF_AD_VLAN_TAG)) {
-   emit_andi(r_A, r_s0, (u16)~VLAN_TAG_PRESENT, 
ctx);
-   } else {
-   emit_andi(r_A, r_s0, VLAN_TAG_PRESENT, ctx);
-   /* return 1 if present */
-   emit_sltu(r_A, r_zero, r_A, ctx);
-   }
+#ifdef VLAN_TAG_PRESENT
+   emit_andi(r_A, r_s0, (u16)~VLAN_TAG_PRESENT, ctx);
+#endif
+   break;
+   case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
+   ctx->flags |= SEEN_SKB | SEEN_A;
+   emit_load_byte(r_A, r_skb, PKT_VLAN_PRESENT_OFFSET(), 
ctx);
+   if (PKT_VLAN_PRESENT_BIT)
+   emit_srl(r_A, r_A, PKT_VLAN_PRESENT_BIT, ctx);
+   emit_andi(r_A, r_s0, 1, ctx);
+   /* return 1 if present */
+   emit_sltu(r_A, r_zero, r_A, ctx);
break;
case BPF_ANC | SKF_AD_PKTTYPE:
ctx->flags |= SEEN_SKB;
-- 
2.10.2

[PATCH net-next 18/27] net/skbuff: add macros for VLAN_PRESENT bit

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 include/linux/skbuff.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 332e767..4a85a1f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -768,6 +768,12 @@ struct sk_buff {
__u32   priority;
int skb_iif;
__u32   hash;
+#define PKT_VLAN_PRESENT_BIT   4   // CFI (12-th bit) in TCI
+#ifdef __BIG_ENDIAN
+#define PKT_VLAN_PRESENT_OFFSET()  offsetof(struct sk_buff, vlan_tci)
+#else
+#define PKT_VLAN_PRESENT_OFFSET()  (offsetof(struct sk_buff, vlan_tci) + 1)
+#endif
__be16  vlan_proto;
__u16   vlan_tci;
 #if defined(CONFIG_NET_RX_BUSY_POLL) || defined(CONFIG_XPS)
-- 
2.10.2

[PATCH net-next 03/27] ibmvnic: fix accelerated VLAN handling

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index c125966..c7664db 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -765,7 +765,7 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
tx_crq.v1.sge_len = cpu_to_be32(skb->len);
tx_crq.v1.ioba = cpu_to_be64(data_dma_addr);
 
-   if (adapter->vlan_header_insertion) {
+   if (adapter->vlan_header_insertion && skb_vlan_tag_present(skb)) {
tx_crq.v1.flags2 |= IBMVNIC_TX_VLAN_INSERT;
tx_crq.v1.vlan_id = cpu_to_be16(skb->vlan_tci);
}
@@ -964,7 +964,8 @@ static int ibmvnic_poll(struct napi_struct *napi, int 
budget)
skb = rx_buff->skb;
skb_copy_to_linear_data(skb, rx_buff->data + offset,
length);
-   skb->vlan_tci = be16_to_cpu(next->rx_comp.vlan_tci);
+   if (flags & IBMVNIC_VLAN_STRIPPED)
+   __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), 
be16_to_cpu(next->rx_comp.vlan_tci));
/* free the entry */
next->rx_comp.first = 0;
remove_buff_from_pool(adapter, rx_buff);
-- 
2.10.2

[PATCH net-next 25/27] net: remove VLAN_TAG_PRESENT

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 arch/mips/net/bpf_jit.c |  3 ---
 arch/powerpc/net/bpf_jit_comp.c |  3 ---
 arch/sparc/net/bpf_jit_comp.c   |  4 
 include/linux/if_vlan.h | 11 ++-
 include/linux/skbuff.h  | 16 +---
 lib/test_bpf.c  | 17 ++---
 net/core/filter.c   |  3 ---
 7 files changed, 17 insertions(+), 40 deletions(-)

diff --git a/arch/mips/net/bpf_jit.c b/arch/mips/net/bpf_jit.c
index 4b12b5d..fb6d234 100644
--- a/arch/mips/net/bpf_jit.c
+++ b/arch/mips/net/bpf_jit.c
@@ -1143,9 +1143,6 @@ static int build_body(struct jit_ctx *ctx)
  vlan_tci) != 2);
off = offsetof(struct sk_buff, vlan_tci);
emit_half_load(r_s0, r_skb, off, ctx);
-#ifdef VLAN_TAG_PRESENT
-   emit_andi(r_A, r_s0, (u16)~VLAN_TAG_PRESENT, ctx);
-#endif
break;
case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
ctx->flags |= SEEN_SKB | SEEN_A;
diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
index 22ae63f..fb38927 100644
--- a/arch/powerpc/net/bpf_jit_comp.c
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -381,9 +381,6 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 
*image,
 
PPC_LHZ_OFFS(r_A, r_skb, offsetof(struct sk_buff,
  vlan_tci));
-#ifdef VLAN_TAG_PRESENT
-   PPC_ANDI(r_A, r_A, ~VLAN_TAG_PRESENT);
-#endif
break;
case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
PPC_LBZ_OFFS(r_A, r_skb, PKT_VLAN_PRESENT_OFFSET());
diff --git a/arch/sparc/net/bpf_jit_comp.c b/arch/sparc/net/bpf_jit_comp.c
index 61cc15d..d499b39 100644
--- a/arch/sparc/net/bpf_jit_comp.c
+++ b/arch/sparc/net/bpf_jit_comp.c
@@ -602,10 +602,6 @@ void bpf_jit_compile(struct bpf_prog *fp)
break;
case BPF_ANC | SKF_AD_VLAN_TAG:
emit_skb_load16(vlan_tci, r_A);
-#ifdef VLAN_TAG_PRESENT
-   emit_loadimm(~VLAN_TAG_PRESENT, r_TMP);
-   emit_and(r_A, r_TMP, r_A);
-#endif
break;
case BPF_ANC | SKF_AD_VLAN_TAG_PRESENT:
__emit_skb_load8(__pkt_vlan_present_offset, 
r_A);
diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
index 75e839b..8ff2f0e 100644
--- a/include/linux/if_vlan.h
+++ b/include/linux/if_vlan.h
@@ -66,7 +66,6 @@ static inline struct vlan_ethhdr *vlan_eth_hdr(const struct 
sk_buff *skb)
 #define VLAN_PRIO_MASK 0xe000 /* Priority Code Point */
 #define VLAN_PRIO_SHIFT13
 #define VLAN_CFI_MASK  0x1000 /* Canonical Format Indicator */
-#define VLAN_TAG_PRESENT   VLAN_CFI_MASK
 #define VLAN_VID_MASK  0x0fff /* VLAN Identifier */
 #define VLAN_N_VID 4096
 
@@ -78,8 +77,8 @@ static inline bool is_vlan_dev(const struct net_device *dev)
 return dev->priv_flags & IFF_802_1Q_VLAN;
 }
 
-#define skb_vlan_tag_present(__skb)((__skb)->vlan_tci & VLAN_TAG_PRESENT)
-#define skb_vlan_tag_get(__skb)((__skb)->vlan_tci & 
~VLAN_TAG_PRESENT)
+#define skb_vlan_tag_present(__skb)((__skb)->vlan_present)
+#define skb_vlan_tag_get(__skb)((__skb)->vlan_tci)
 #define skb_vlan_tag_get_id(__skb) ((__skb)->vlan_tci & VLAN_VID_MASK)
 #define skb_vlan_tag_get_prio(__skb)   ((__skb)->vlan_tci & VLAN_PRIO_MASK)
 
@@ -390,7 +389,7 @@ static inline struct sk_buff 
*vlan_insert_tag_set_proto(struct sk_buff *skb,
  */
 static inline void __vlan_hwaccel_clear_tag(struct sk_buff *skb)
 {
-   skb->vlan_tci = 0;
+   skb->vlan_present = 0;
 }
 
 /**
@@ -402,6 +401,7 @@ static inline void __vlan_hwaccel_clear_tag(struct sk_buff 
*skb)
  */
 static inline void __vlan_hwaccel_copy_tag(struct sk_buff *dst, const struct 
sk_buff *src)
 {
+   dst->vlan_present = src->vlan_present;
dst->vlan_proto = src->vlan_proto;
dst->vlan_tci = src->vlan_tci;
 }
@@ -436,7 +436,8 @@ static inline void __vlan_hwaccel_put_tag(struct sk_buff 
*skb,
  __be16 vlan_proto, u16 vlan_tci)
 {
skb->vlan_proto = vlan_proto;
-   skb->vlan_tci = VLAN_TAG_PRESENT | vlan_tci;
+   skb->vlan_tci = vlan_tci;
+   skb->vlan_present = 1;
 }
 
 /**
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4a85a1f..3577cca 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -740,6 +740,14 @@ struct sk_buff {
__u8csum_level:2;
__u8csum_bad:1;
 
+#ifdef __BIG_ENDIAN_BITFIELD
+#define PKT_VLAN_PRESENT_BIT   7
+#else
+#define PKT_VLAN_PRESENT_BIT   0
+#endif
+#define PKT_VLAN_PRESENT_OFFSET()  offsetof(

[PATCH net-next 07/27] gianfar: remove use of VLAN_TAG_PRESENT

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 drivers/net/ethernet/freescale/gianfar_ethtool.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/freescale/gianfar_ethtool.c 
b/drivers/net/ethernet/freescale/gianfar_ethtool.c
index 56588f2..95fa647 100644
--- a/drivers/net/ethernet/freescale/gianfar_ethtool.c
+++ b/drivers/net/ethernet/freescale/gianfar_ethtool.c
@@ -1155,11 +1155,9 @@ static int gfar_convert_to_filer(struct 
ethtool_rx_flow_spec *rule,
prio = vlan_tci_prio(rule);
prio_mask = vlan_tci_priom(rule);
 
-   if (cfi == VLAN_TAG_PRESENT && cfi_mask == VLAN_TAG_PRESENT) {
-   vlan |= RQFPR_CFI;
-   vlan_mask |= RQFPR_CFI;
-   } else if (cfi != VLAN_TAG_PRESENT &&
-  cfi_mask == VLAN_TAG_PRESENT) {
+   if (cfi_mask) {
+   if (cfi)
+   vlan |= RQFPR_CFI;
vlan_mask |= RQFPR_CFI;
}
}
-- 
2.10.2

[PATCH net-next 04/27] qlcnic: remove assumption that vlan_tci != 0

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c 
b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c
index fedd736..c3cc707 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c
@@ -459,7 +459,7 @@ static int qlcnic_tx_pkt(struct qlcnic_adapter *adapter,
 struct cmd_desc_type0 *first_desc, struct sk_buff *skb,
 struct qlcnic_host_tx_ring *tx_ring)
 {
-   u8 l4proto, opcode = 0, hdr_len = 0;
+   u8 l4proto, opcode = 0, hdr_len = 0, tag_vlan = 0;
u16 flags = 0, vlan_tci = 0;
int copied, offset, copy_len, size;
struct cmd_desc_type0 *hwdesc;
@@ -472,14 +472,16 @@ static int qlcnic_tx_pkt(struct qlcnic_adapter *adapter,
flags = QLCNIC_FLAGS_VLAN_TAGGED;
vlan_tci = ntohs(vh->h_vlan_TCI);
protocol = ntohs(vh->h_vlan_encapsulated_proto);
+   tag_vlan = 1;
} else if (skb_vlan_tag_present(skb)) {
flags = QLCNIC_FLAGS_VLAN_OOB;
vlan_tci = skb_vlan_tag_get(skb);
+   tag_vlan = 1;
}
if (unlikely(adapter->tx_pvid)) {
-   if (vlan_tci && !(adapter->flags & QLCNIC_TAGGING_ENABLED))
+   if (tag_vlan && !(adapter->flags & QLCNIC_TAGGING_ENABLED))
return -EIO;
-   if (vlan_tci && (adapter->flags & QLCNIC_TAGGING_ENABLED))
+   if (tag_vlan && (adapter->flags & QLCNIC_TAGGING_ENABLED))
goto set_flags;
 
flags = QLCNIC_FLAGS_VLAN_OOB;
-- 
2.10.2

[PATCH net-next 06/27] cnic: remove use of VLAN_TAG_PRESENT

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 drivers/net/ethernet/broadcom/cnic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/cnic.c 
b/drivers/net/ethernet/broadcom/cnic.c
index b1d2ac8..6e3c610 100644
--- a/drivers/net/ethernet/broadcom/cnic.c
+++ b/drivers/net/ethernet/broadcom/cnic.c
@@ -5734,7 +5734,7 @@ static int cnic_netdev_event(struct notifier_block *this, 
unsigned long event,
if (realdev) {
dev = cnic_from_netdev(realdev);
if (dev) {
-   vid |= VLAN_TAG_PRESENT;
+   vid |= VLAN_CFI_MASK;   /* make non-zero */
cnic_rcv_netevent(dev->cnic_priv, event, vid);
cnic_put(dev);
}
-- 
2.10.2

[PATCH net-next 08/27] net/hyperv: remove use of VLAN_TAG_PRESENT

2016-12-12 Thread Michał Mirosław

Signed-off-by: Michał Mirosław 
---
 drivers/net/hyperv/hyperv_net.h   |  2 +-
 drivers/net/hyperv/netvsc_drv.c   | 13 ++---
 drivers/net/hyperv/rndis_filter.c |  4 ++--
 3 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 3958ada..b53729e 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -186,7 +186,7 @@ int netvsc_recv_callback(struct hv_device *device_obj,
void **data,
struct ndis_tcp_ip_checksum_info *csum_info,
struct vmbus_channel *channel,
-   u16 vlan_tci);
+   u16 vlan_tci, bool vlan_present);
 void netvsc_channel_cb(void *context);
 int rndis_filter_open(struct netvsc_device *nvdev);
 int rndis_filter_close(struct netvsc_device *nvdev);
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index c9414c0..6597d79 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -595,7 +595,7 @@ void netvsc_linkstatus_callback(struct hv_device 
*device_obj,
 static struct sk_buff *netvsc_alloc_recv_skb(struct net_device *net,
struct hv_netvsc_packet *packet,
struct ndis_tcp_ip_checksum_info *csum_info,
-   void *data, u16 vlan_tci)
+   void *data)
 {
struct sk_buff *skb;
 
@@ -625,10 +625,6 @@ static struct sk_buff *netvsc_alloc_recv_skb(struct 
net_device *net,
skb->ip_summed = CHECKSUM_UNNECESSARY;
}
 
-   if (vlan_tci & VLAN_TAG_PRESENT)
-   __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
-  vlan_tci);
-
return skb;
 }
 
@@ -641,7 +637,7 @@ int netvsc_recv_callback(struct hv_device *device_obj,
void **data,
struct ndis_tcp_ip_checksum_info *csum_info,
struct vmbus_channel *channel,
-   u16 vlan_tci)
+   u16 vlan_tci, bool vlan_present)
 {
struct net_device *net = hv_get_drvdata(device_obj);
struct net_device_context *net_device_ctx = netdev_priv(net);
@@ -664,12 +660,15 @@ int netvsc_recv_callback(struct hv_device *device_obj,
net = vf_netdev;
 
/* Allocate a skb - TODO direct I/O to pages? */
-   skb = netvsc_alloc_recv_skb(net, packet, csum_info, *data, vlan_tci);
+   skb = netvsc_alloc_recv_skb(net, packet, csum_info, *data);
if (unlikely(!skb)) {
++net->stats.rx_dropped;
return NVSP_STAT_FAIL;
}
 
+   if (vlan_present)
+   __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vlan_tci);
+
if (net != vf_netdev)
skb_record_rx_queue(skb,
channel->offermsg.offer.sub_channel_index);
diff --git a/drivers/net/hyperv/rndis_filter.c 
b/drivers/net/hyperv/rndis_filter.c
index 8d90904..7f7b410 100644
--- a/drivers/net/hyperv/rndis_filter.c
+++ b/drivers/net/hyperv/rndis_filter.c
@@ -381,13 +381,13 @@ static int rndis_filter_receive_data(struct rndis_device 
*dev,
 
vlan = rndis_get_ppi(rndis_pkt, IEEE_8021Q_INFO);
if (vlan) {
-   vlan_tci = VLAN_TAG_PRESENT | vlan->vlanid |
+   vlan_tci = vlan->vlanid |
(vlan->pri << VLAN_PRIO_SHIFT);
}
 
csum_info = rndis_get_ppi(rndis_pkt, TCPIP_CHKSUM_PKTINFO);
return netvsc_recv_callback(net_device_ctx->device_ctx, pkt, data,
-   csum_info, channel, vlan_tci);
+   csum_info, channel, vlan_tci, vlan);
 }
 
 int rndis_filter_receive(struct hv_device *dev,
-- 
2.10.2

Re: netlink: GPF in sock_sndtimeo

2016-12-12 Thread Cong Wang

On Mon, Dec 12, 2016 at 2:02 AM, Richard Guy Briggs  wrote:
> On 2016-12-09 20:13, Cong Wang wrote:
>> Netlink notifier can safely be converted to blocking one, I will send
>> a patch.
>
> I had a quick look at how that might happen.  The netlink notifier chain
> is atomic.  Would the registered callback funciton need to spawn a
> one-time thread to avoid blocking?

It is already non-atomic now:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=efa172f42836477bf1ac3c9a3053140df764699c


> I had a look at your patch.  It looks attractively simple.  The audit
> next tree has patches queued that add an audit_reset function that will
> require more work.  I still see some potential gaps.
>
> - If the process messes up (or the sock lookup messes up) it is reset
>   in the kauditd thread under the audit_cmd_mutex.
>
> - If the process exits normally or is replaced due to an audit_replace
>   error, it is reset from audit_receive_skb under the audit_cmd_mutex.
>
> - If the process dies before the kauditd thread notices, either reap it
>   via notifier callback or it needs a check on net exit to reset.  This
>   last one appears necessary to decrement the sock refcount so the sock
>   can be released in netlink_kernel_release().
>
> If we want to be proactive and use the netlink notifier, we assume the
> overhead of adding to the netlink notifier chain and eliminate all the
> other reset calls under the kauditd thread.  If we are ok being
> reactionary, then we'll at least need the net exit check on audit_sock.
>

I don't see why we need to check it in net exit if we use refcnt,
because we have two different users of audit_sock: kauditd and
netns, if both take care of refcnt properly, we don't need to worry
about who is the last, no matter what failures occur in what order.

Re: [PATCH v2] audit: use proper refcount locking on audit_sock

2016-12-12 Thread Cong Wang

On Mon, Dec 12, 2016 at 2:03 AM, Richard Guy Briggs  wrote:
> Resetting audit_sock appears to be racy.
>
> audit_sock was being copied and dereferenced without using a refcount on
> the source sock.
>
> Bump the refcount on the underlying sock when we store a refrence in
> audit_sock and release it when we reset audit_sock.  audit_sock
> modification needs the audit_cmd_mutex.
>
> See: https://lkml.org/lkml/2016/11/26/232
>
> Thanks to Eric Dumazet  and Cong Wang
>  on ideas how to fix it.
>
> Signed-off-by: Richard Guy Briggs 
> ---
> There has been a lot of change in the audit code that is about to go
> upstream to address audit queue issues.  This patch is based on the
> source tree: git://git.infradead.org/users/pcmoore/audit#next
> ---
>  kernel/audit.c |   34 --
>  1 files changed, 28 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/audit.c b/kernel/audit.c
> index f20eee0..439f7f3 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -452,7 +452,9 @@ static void auditd_reset(void)
> struct sk_buff *skb;
>
> /* break the connection */
> +   sock_put(audit_sock);


Why audit_sock can't be NULL here?


> audit_pid = 0;
> +   audit_nlk_portid = 0;
> audit_sock = NULL;
>
> /* flush all of the retry queue to the hold queue */
> @@ -478,6 +480,12 @@ static int kauditd_send_unicast_skb(struct sk_buff *skb)
> if (rc >= 0) {
> consume_skb(skb);
> rc = 0;
> +   } else {
> +   if (rc & (-ENOMEM|-EPERM|-ECONNREFUSED)) {


Are these errno's bits??


> +   mutex_lock(&audit_cmd_mutex);
> +   auditd_reset();
> +   mutex_unlock(&audit_cmd_mutex);
> +   }
> }
>
> return rc;
> @@ -579,7 +587,9 @@ static int kauditd_thread(void *dummy)
>
> auditd = 0;
> if (AUDITD_BAD(rc, reschedule)) {
> +   mutex_lock(&audit_cmd_mutex);
> auditd_reset();
> +   mutex_unlock(&audit_cmd_mutex);
> reschedule = 0;
> }
> } else
> @@ -594,7 +604,9 @@ static int kauditd_thread(void *dummy)
> auditd = 0;
> if (AUDITD_BAD(rc, reschedule)) {
> kauditd_hold_skb(skb);
> +   mutex_lock(&audit_cmd_mutex);
> auditd_reset();
> +   mutex_unlock(&audit_cmd_mutex);
> reschedule = 0;
> } else
> /* temporary problem (we hope), queue
> @@ -623,7 +635,9 @@ quick_loop:
> if (rc) {
> auditd = 0;
> if (AUDITD_BAD(rc, reschedule)) {
> +   mutex_lock(&audit_cmd_mutex);
> auditd_reset();
> +   
> mutex_unlock(&audit_cmd_mutex);
> reschedule = 0;
> }
>
> @@ -1004,17 +1018,22 @@ static int audit_receive_msg(struct sk_buff *skb, 
> struct nlmsghdr *nlh)
> return -EACCES;
> }
> if (audit_pid && new_pid &&
> -   audit_replace(requesting_pid) != -ECONNREFUSED) {
> +   (audit_replace(requesting_pid) & 
> (-ECONNREFUSED|-EPERM|-ENOMEM))) {
> audit_log_config_change("audit_pid", new_pid, 
> audit_pid, 0);
> return -EEXIST;
> }
> if (audit_enabled != AUDIT_OFF)
> audit_log_config_change("audit_pid", new_pid, 
> audit_pid, 1);
> -   audit_pid = new_pid;
> -   audit_nlk_portid = NETLINK_CB(skb).portid;
> -   audit_sock = skb->sk;
> -   if (!new_pid)
> +   if (new_pid) {
> +   if (audit_sock)
> +   sock_put(audit_sock);
> +   audit_pid = new_pid;
> +   audit_nlk_portid = NETLINK_CB(skb).portid;
> +   sock_hold(skb->sk);

Why refcnt is still needed here? I need it because I removed the code
in net exit code path.


> +   audit_sock = skb->sk;
> +   } else {
>

Re: [PATCH V2 00/22] Broadcom RoCE Driver (bnxt_re)

2016-12-12 Thread Doug Ledford

On 12/9/2016 1:47 AM, Selvin Xavier wrote:
> This series introduces the RoCE driver for the Broadcom
> NetXtreme-E 10/25/40/50 gigabit RoCE HCAs. 
> This driver is dependent on the bnxt_en NIC driver and is 
> based on the bnxt_re branch in Doug's repository. bnxt_en changes
> required for this patch series is already available in this branch.
> 
> I am preparing a git repository with these changes as per Jason's
> comment and will share the details later today.
> 
> v1-> v2:
>   * The license text in each file updated to reflect Dual license.
>   * Makefile and Kconfig changes are pushed to the last patch
>   * Moved bnxt_re_uverbs_abi.h to include/uapi/rdma folder
>   * Remove duplicate structure definitions from bnxt_re_hsi.h as
> it is available in the corresponding bnxt_en header file (bnxt_hsi.h)
>   * Removed some unused code reported during code review.
>   * Fixed few sparse warnings
> 
> Doug,
> Please review and consider applying this to linux-rdma repository.

There are outstanding review comments to be addressed still yet, and the
v2 patchset doesn't compile for me in 0day testing.  I'm going to bounce
this one to 4.11.


-- 
Doug Ledford 
GPG Key ID: 0E572FDD



signature.asc
Description: OpenPGP digital signature

Re: [iproute2 v2 net-next 0/8] Add support for vrf helper

2016-12-12 Thread Stephen Hemminger

On Sat, 10 Dec 2016 12:32:06 -0800
David Ahern  wrote:

> This series adds support to iproute2 to run a command against a specific
> VRF. The user semnatics are similar to 'ip netns'.
> 
> The 'ip vrf' subcommand supports 3 usages:
> 
> 1. Run a command against a given vrf:
>ip vrf exec NAME CMD
> 
>Uses the recently committed cgroup/sock BPF option. vrf directory
>is added to cgroup2 mount. Individual vrfs are created under it. BPF
>filter is attached to vrf/NAME cgroup2 to set sk_bound_dev_if to the
>device index of the VRF. From there the current process (ip's pid) is
>addded to the cgroups.proc file and the given command is exected. In
>doing so all AF_INET/AF_INET6 (ipv4/ipv6) sockets are automatically
>bound to the VRF domain.
> 
>The association is inherited parent to child allowing the command to
>be a shell from which other commands are run relative to the VRF.
> 
> 2. Show the VRF a process is bound to:
>ip vrf id
>This command essentially looks at /proc/pid/cgroup for a "::/vrf/"
>entry.
> 
> 3. Show process ids bound to a VRF
>ip vrf pids NAME
>This command dumps the file MNT/vrf/NAME/cgroup.procs since that file
>shows the process ids in the particular vrf cgroup.
> 
> v2
> - updated suject of patch 3 to avoid spam filters on vger
> 
> David Ahern (8):
>   lib bpf: Add support for BPF_PROG_ATTACH and BPF_PROG_DETACH
>   bpf: export bpf_prog_load
>   Add libbpf.h header with BPF_ macros
>   move cmd_exec to lib utils
>   Add filesystem APIs to lib
>   change name_is_vrf to return index
>   libnetlink: Add variant of rtnl_talk that does not display RTNETLINK
> answers error
>   Introduce ip vrf command
> 
>  include/bpf_util.h   |   6 ++
>  include/libbpf.h | 184 
>  include/libnetlink.h |   3 +
>  include/utils.h  |   4 +
>  ip/Makefile  |   3 +-
>  ip/ip.c  |   4 +-
>  ip/ip_common.h   |   4 +-
>  ip/iplink_vrf.c  |  29 --
>  ip/ipnetns.c |  34 --
>  ip/ipvrf.c   | 289 
> +++
>  lib/Makefile |   2 +-
>  lib/bpf.c|  71 -
>  lib/exec.c   |  41 
>  lib/fs.c | 143 +
>  lib/libnetlink.c |  20 +++-
>  man/man8/ip-vrf.8|  88 
>  16 files changed, 850 insertions(+), 75 deletions(-)
>  create mode 100644 include/libbpf.h
>  create mode 100644 ip/ipvrf.c
>  create mode 100644 lib/exec.c
>  create mode 100644 lib/fs.c
>  create mode 100644 man/man8/ip-vrf.8
> 

Please use tooling that puts v2 on all the updated patches.
It makes it easier to spot them in patchwork

"virtio-net: enable multiqueue by default" in linux-next breaks networking on GCE

2016-12-12 Thread Theodore Ts'o

Hi,

I was doing a last minute regression test of the ext4 tree before
sending a pull request to Linus, which I do using gce-xfstests[1], and
I found that using networking was broken on GCE on linux-next.  I was
using next-20161209, and after bisecting things, I narrowed down the
commit which causing things to break to commit 449000102901:
"virtio-net: enable multiqueue by default".  Reverting this commit on
top of next-20161209 fixed the problem.

[1] http://thunk.org/gce-xfstests

You can reproduce the problem for building the kernel for Google
Compute Engine --- I use a config such as this [2], and then try to
boot a kernel on a VM.  The way I do this involves booting a test
appliance and then kexec'ing into the kernel to be tested[3], using a
2cpu configuration.  (GCE machine type: n1-standard-2)

[2] 
https://git.kernel.org/cgit/fs/ext2/xfstests-bld.git/tree/kernel-configs/ext4-x86_64-config-4.9
[3] 
https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstests.md

You can then take a look at serial console using a command such as
"gcloud compute instances get-serial-port-output ", and
you will get something like this (see attached).  The important bit is
that the dhclient command is completely failing to be able to get a
response from the network, from which I deduce that apparently that
either networking send or receive or both seem to be badly affected by
the commit in question.

Please let me know if there's anything I can do to help you debug this
further.

Cheers,

- Ted

Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] Linux version 
4.9.0-rc8-ext4-06387-g03e5cbd (tytso@tytso-ssd) (gcc version 4.9.2 (Debian 
4.9.2-10) ) #9 SMP Mon Dec 12 04:50:16 UTC 2016
Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] Command line: 
root=/dev/sda1 ro console=ttyS0,38400n8 elevator=noop console=ttyS0  
fstestcfg=4k fstestset=-g,quick fstestexc= fstestopt=aex fstesttyp=ext4 
fstestapi=1.3
Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: 
Supporting XSAVE feature 0x001: 'x87 floating point registers'
Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: 
Supporting XSAVE feature 0x002: 'SSE registers'
Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: 
Supporting XSAVE feature 0x004: 'AVX registers'
Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: 
xstate_offset[2]:  576, xstate_sizes[2]:  256
Dec 11 23:53:20 xfstests-201612120451 kernel: [0.00] x86/fpu: Enabled 
xstate features 0x7, context size is 832 bytes, using 'standard' format.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Load Kernel Modules.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Apply Kernel 
Variables...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounting Configuration File 
System...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounting FUSE Control File 
System...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounted FUSE Control File 
System.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Mounted Configuration File 
System.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Apply Kernel 
Variables.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Create Static Device 
Nodes in /dev.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting udev Kernel Device 
Manager...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started udev Kernel Device 
Manager.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started udev Coldplug all 
Devices.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting udev Wait for 
Complete Device Initialization...
Dec 11 23:53:20 xfstests-201612120451 systemd-fsck[1659]: xfstests-root: clean, 
56268/655360 files, 357439/2620928 blocks
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started File System Check on 
Root Device.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Remount Root and 
Kernel File Systems...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Remount Root and 
Kernel File Systems.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Various fixups to 
make systemd work better on Debian.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Load/Save Random 
Seed...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Local File Systems 
(Pre).
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Reached target Local File 
Systems (Pre).
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started Load/Save Random Seed.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Started udev Wait for 
Complete Device Initialization.
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Activation of LVM2 
logical volumes...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Starting Copy rules generated 
while the root was ro...
Dec 11 23:53:20 xfstests-201612120451 systemd[1]: Found device /dev/ttyS0.
Dec 11 23:53:20 xfstests-201612120451

[ANNOUNCE] iproute2 4.9

2016-12-12 Thread Stephen Hemminger

Release of iproute2 for Linux 4.9, just in time for your holiday
giving.

Update to iproute2 utility to support new features in Linux 4.9.
Mostly this is refinements to add new flags to tipc, l2tp, ss
and macsec support. There are also a couple of performance
enhancments for handling lots of interfaces and namespaces.

Source:
  https://www.kernel.org/pub/linux/utils/net/iproute2/iproute2-4.9.0.tar.gz

Repository:
  git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git

Report problems (or enhancements) to the netdev@vger.kernel.org mailing list.

---
Alexei Starovoitov (1):
  iptnl: add support for collect_md flag in IPv4 and IPv6 tunnels

Anton Aksola (1):
  iproute2: build nsid-name cache only for commands that need it

Asbjørn Sloth Tønnesen (9):
  man: ip-l2tp.8: fix l2spec_type documentation
  man: ip-l2tp.8: remove non-existent tunnel parameter name
  l2tp: fix integers with too few significant bits
  l2tp: fix L2TP_ATTR_{RECV,SEND}_SEQ handling
  l2tp: fix L2TP_ATTR_UDP_CSUM handling
  l2tp: read IPv6 UDP checksum attributes from kernel
  l2tp: support sequence numbering
  l2tp: show tunnel: expose UDP checksum state
  man: ip-l2tp.8: document UDP checksum options

Craig Dillabaugh (1):
  action gact: list pipe as a valid action

Daniel Borkmann (1):
  tc, ipt: don't enforce iproute2 dependency on iptables-devel

Daniel Hopf (1):
  macsec: Nr. of packets and octets for macsec tx stats were swapped

Eric Dumazet (1):
  tc: fq: display unthrottle latency

Hadar Hen Zion (2):
  tc: flower: Introduce vlan support
  tc: m_vlan: Add priority option to push vlan action

Hangbin Liu (4):
  misc/ss: tcp cwnd should be unsigned
  ip rule: merge ip rule flush and list, save together
  ip rule: add selector support
  devlink: Convert conditional in dl_argv_handle_port() to switch()

Isaac Boukris (1):
  iproute2: ss: escape all null bytes in abstract unix domain socket

Jakub Kicinski (1):
  tc: cls_bpf: handle skip_sw and skip_hw flags

Jamal Hadi Salim (4):
  actions ife: Introduce encoding and decoding of tcindex metadata
  actions: add skbmod action
  man pages: Add tc-ife to Makefile
  tc filters: add support to get individual filters by handle

Lorenzo Colitti (1):
  ss: Support displaying and filtering on socket marks.

Lucas Bates (2):
  man pages: update ife action to include tcindex
  man pages: add man page for skbmod action

Mahesh Bandewar (1):
  ip: (ipvlan) introduce L3s mode

Mike Frysinger (1):
  ifstat/nstat: fix help output alignment

Moshe Shemesh (1):
  ip link: Add support to configure SR-IOV VF to vlan protocol 802.1ad (VST 
QinQ)

Neal Cardwell (1):
  ss: output TCP BBR diag information

Nikolay Aleksandrov (4):
  bridge: vlan: add support to display per-vlan statistics
  ipmroute: add support for age dumping
  bridge: vlan: remove wrong stats help
  bridge: add support for the multicast flood flag

Parthasarathy Bhuvaragan (7):
  tipc: remove dead code
  tipc: add link monitor set threshold
  tipc: add link monitor get threshold
  tipc: add link monitor summary
  tipc: refractor bearer to facilitate link monitor
  tipc: add link monitor list
  tipc: update man page for link monitor

Paul Blakey (1):
  tc: flower: Fix usage message

Phil Sutter (6):
  iproute: fix documentation for ip rule scan order
  include: Add linux/sctp.h
  ss: Add support for SCTP protocol
  ipaddress: Simplify vf_info parsing
  ipaddress: Print IFLA_VF_QUERY_RSS_EN setting
  man: ip-route.8: Add notes about dropped IPv4 route cache

Richard Alpe (3):
  tipc: add peer remove functionality
  tipc: introduce bearer add for remoteip
  tipc: add the ability to get UDP bearer options

Roi Dayan (2):
  devlink: Add usage help for eswitch subcommand
  devlink: Add option to set and show eswitch inline mode

Roman Mashak (7):
  ife action: allow specifying index in hex
  ife: print prio, mark and hash as unsigned
  ife: improve help text
  tc: updated man page to reflect GET command to retrieve a single filter.
  tc: improved usage help for fw classifier.
  tc: print raw qdisc handle.
  tc: distinguish Add/Replace filter operations

Shmulik Ladkani (1):
  tc: m_vlan: Add vlan modify action

Simon Horman (1):
  ss: initialise variables outside of for loop

Stephen Hemminger (24):
  update headers to 4.8-rc2 net-next
  update TIPC headers
  tipc: cleanup style issues
  update kernel headers from net-next
  update bpf.h
  update headers from pre 4.9 (net-next)
  iplink: cleanup style errors
  ip: iprule style cleanup
  tc: skbmod style cleanup
  tc_filter: style cleanup
  ip: macvlan style cleanup
  Revert "iproute2: macvlan: add "source" mode"
  cleanup debris from revert
  ss: break really lo

Re: Soft lockup in tc_classify

2016-12-12 Thread Cong Wang

On Mon, Dec 12, 2016 at 1:18 PM, Or Gerlitz  wrote:
> On Mon, Dec 12, 2016 at 3:28 PM, Daniel Borkmann  wrote:
>
>> Note that there's still the RCU fix missing for the deletion race that
>> Cong will still send out, but you say that the only thing you do is to
>> add a single rule, but no other operation in involved during that test?
>
> What's missing to have the deletion race fixed? making a patch or
> testing to a patch which was sent?

If you think it would help for this problem, here is my patch rebased
on the latest net-next.

Again, I don't see how it could help this case yet, especially I don't
see how we could have a loop in this singly linked list.
commit f6becda1e12fd8ef74e901fe39adb4558ce6c8f9
Author: Cong Wang 
Date:   Wed Nov 23 14:58:01 2016 -0800

net_sched: move the empty tp check from ->destroy() to ->delete()

Roi reported we could have a race condition where in ->classify() path
we dereference tp->root and meanwhile a parallel ->destroy() makes it
a NULL.

This is possible because ->destroy() could be called when deleting
a filter to check if we are the last one in tp, this tp is still
linked and visible at that time.

The root cause of this problem is the semantic of ->destroy(), it
does two things (for non-force case):

1) check if tp is empty
2) if tp is empty we could really destroy it

and its caller, if cares, needs to check its return value to see if
it is really destroyed. Therefore we can't unlink tp unless we know
it is empty.

As suggested by Daniel, we could actually move the test logic to ->delete()
so that we can safely unlink tp after ->delete() tells us the last one is
just deleted and before ->destroy().

What's more, even we unlink it before ->destroy(), it could still have
readers since we don't wait for a grace period here, we should not modify
tp->root in ->destroy() either.

Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are 
gone")
Reported-by: Roi Dayan 
Cc: Daniel Borkmann 
Cc: John Fastabend 
Signed-off-by: Cong Wang 

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 498f81b..b5eda3f 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -203,14 +203,14 @@ struct tcf_proto_ops {
const struct tcf_proto *,
struct tcf_result *);
int (*init)(struct tcf_proto*);
-   bool(*destroy)(struct tcf_proto*, bool);
+   void(*destroy)(struct tcf_proto*);

unsigned long   (*get)(struct tcf_proto*, u32 handle);
int (*change)(struct net *net, struct sk_buff *,
struct tcf_proto*, unsigned long,
u32 handle, struct nlattr **,
unsigned long *, bool);
-   int (*delete)(struct tcf_proto*, unsigned long);
+   int (*delete)(struct tcf_proto*, unsigned long, 
bool*);
void(*walk)(struct tcf_proto*, struct tcf_walker 
*arg);

/* rtnetlink specific */
@@ -405,7 +405,7 @@ struct Qdisc *qdisc_create_dflt(struct netdev_queue 
*dev_queue,
const struct Qdisc_ops *ops, u32 parentid);
 void __qdisc_calculate_pkt_len(struct sk_buff *skb,
   const struct qdisc_size_table *stab);
-bool tcf_destroy(struct tcf_proto *tp, bool force);
+void tcf_destroy(struct tcf_proto *tp);
 void tcf_destroy_chain(struct tcf_proto __rcu **fl);
 int skb_do_redirect(struct sk_buff *);

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 3fbba79..f9179e0 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -321,7 +321,7 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct 
nlmsghdr *n)

tfilter_notify(net, skb, n, tp, fh,
   RTM_DELTFILTER, false);
-   tcf_destroy(tp, true);
+   tcf_destroy(tp);
err = 0;
goto errout;
}
@@ -331,25 +331,29 @@ static int tc_ctl_tfilter(struct sk_buff *skb, struct 
nlmsghdr *n)
!(n->nlmsg_flags & NLM_F_CREATE))
goto errout;
} else {
+   bool last;
+
switch (n->nlmsg_type) {
case RTM_NEWTFILTER:
err = -EEXIST;
if (n->nlmsg_flags & NLM_F_EXCL) {
if (tp_created)
-   tcf_destroy(tp, true);
+   tcf_destroy(tp);
goto errout;
}
break;

[PATCH] net: cirrus: ep93xx: use new api ethtool_{get|set}_link_ksettings

2016-12-12 Thread Philippe Reynes

The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/cirrus/ep93xx_eth.c |   14 --
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/cirrus/ep93xx_eth.c 
b/drivers/net/ethernet/cirrus/ep93xx_eth.c
index a1de0d1..396c886 100644
--- a/drivers/net/ethernet/cirrus/ep93xx_eth.c
+++ b/drivers/net/ethernet/cirrus/ep93xx_eth.c
@@ -715,16 +715,18 @@ static void ep93xx_get_drvinfo(struct net_device *dev, 
struct ethtool_drvinfo *i
strlcpy(info->version, DRV_MODULE_VERSION, sizeof(info->version));
 }
 
-static int ep93xx_get_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+static int ep93xx_get_link_ksettings(struct net_device *dev,
+struct ethtool_link_ksettings *cmd)
 {
struct ep93xx_priv *ep = netdev_priv(dev);
-   return mii_ethtool_gset(&ep->mii, cmd);
+   return mii_ethtool_get_link_ksettings(&ep->mii, cmd);
 }
 
-static int ep93xx_set_settings(struct net_device *dev, struct ethtool_cmd *cmd)
+static int ep93xx_set_link_ksettings(struct net_device *dev,
+const struct ethtool_link_ksettings *cmd)
 {
struct ep93xx_priv *ep = netdev_priv(dev);
-   return mii_ethtool_sset(&ep->mii, cmd);
+   return mii_ethtool_set_link_ksettings(&ep->mii, cmd);
 }
 
 static int ep93xx_nway_reset(struct net_device *dev)
@@ -741,10 +743,10 @@ static u32 ep93xx_get_link(struct net_device *dev)
 
 static const struct ethtool_ops ep93xx_ethtool_ops = {
.get_drvinfo= ep93xx_get_drvinfo,
-   .get_settings   = ep93xx_get_settings,
-   .set_settings   = ep93xx_set_settings,
.nway_reset = ep93xx_nway_reset,
.get_link   = ep93xx_get_link,
+   .get_link_ksettings = ep93xx_get_link_ksettings,
+   .set_link_ksettings = ep93xx_set_link_ksettings,
 };
 
 static const struct net_device_ops ep93xx_netdev_ops = {
-- 
1.7.4.4

Re: Soft lockup in inet_put_port on 4.6

2016-12-12 Thread Josef Bacik



On Mon, Dec 12, 2016 at 1:44 PM, Hannes Frederic Sowa 
 wrote:

On 12.12.2016 19:05, Josef Bacik wrote:
 On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet 


 wrote:

 On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:



  Hmm... Is your ephemeral port range includes the port your load
  balancing app is using ?


 I suspect that you might have processes doing bind( port = 0) that 
are

 trapped into the bind_conflict() scan ?

 With 100,000 + timewaits there, this possibly hurts.

 Can you try the following loop breaker ?


 It doesn't appear that the app is doing bind(port = 0) during normal
 operation.  I tested this patch and it made no difference.  I'm 
going to

 test simply restarting the app without changing to the SO_REUSEPORT
 option.  Thanks,


Would it be possible to trace the time the function uses with trace? 
If

we don't see the number growing considerably over time we probably can
rule out that we loop somewhere in there (I would instrument
inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port).

__inet_hash_connect -> __inet_check_established also takes a lock
(inet_ehash_lockp) which can be locked from inet_diag code path during
socket diag info dumping.

Unfortunately we couldn't reproduce it so far. :/


So I had a bcc script running to time how long we spent in 
inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port, but 
of course I'm an idiot and didn't actually separate out the stats so I 
could tell _which_ one was taking forever.  But anyway here's a normal 
distribution on the box


Some shit   : count distribution
0 -> 1  : 0|   
|
2 -> 3  : 0|   
|
4 -> 7  : 0|   
|
8 -> 15 : 0|   
|
   16 -> 31 : 0|   
|
   32 -> 63 : 0|   
|
   64 -> 127: 0|   
|
  128 -> 255: 0|   
|
  256 -> 511: 0|   
|
  512 -> 1023   : 0|   
|
 1024 -> 2047   : 74   |   
|
 2048 -> 4095   : 10537
||
 4096 -> 8191   : 8497 |   
|
 8192 -> 16383  : 3745 |** 
|
16384 -> 32767  : 300  |*  
|
32768 -> 65535  : 250  |   
|
65536 -> 131071 : 180  |   
|
   131072 -> 262143 : 71   |   
|
   262144 -> 524287 : 18   |   
|
   524288 -> 1048575: 5|   
|


With the times in nanoseconds, and here's the distribution during the 
problem


Some shit   : count distribution
0 -> 1  : 0|   
|
2 -> 3  : 0|   
|
4 -> 7  : 0|   
|
8 -> 15 : 0|   
|
   16 -> 31 : 0|   
|
   32 -> 63 : 0|   
|
   64 -> 127: 0|   
|
  128 -> 255: 0|   
|
  256 -> 511: 0|   
|
  512 -> 1023   : 0|   
|
 1024 -> 2047   : 21   |   
|
 2048 -> 4095   : 21820
||
 4096 -> 8191   : 11598|*  
|
 8192 -> 16383  : 4337 |***
|
16384 -> 32767  : 290  |   
|
32768 -> 65535  : 59   |   
|
65536 -> 131071 : 23   |   
|
   131072 -> 262143 : 12   |   
|
   262144 -> 524287 : 6|   
|
   524288 -> 1048575: 19   |   
|
  1048576 -> 2097151: 1079 |*  
|
  2097152 -> 4194303: 0|

Re: [PATCH for-next 0/6] IB/hns: Bug Fixes for HNS RoCE Driver

2016-12-12 Thread Doug Ledford

On 11/29/2016 6:10 PM, Salil Mehta wrote:
> This patch-set contains bug fixes for the HNS RoCE driver.
> 
> Lijun Ou (1):
>   IB/hns: Fix the IB device name
> 
> Shaobo Xu (2):
>   IB/hns: Fix the bug when free mr
>   IB/hns: Fix the bug when free cq
> 
> Wei Hu (Xavier) (3):
>   IB/hns: Fix the bug when destroy qp
>   IB/hns: Fix the bug of setting port mtu
>   IB/hns: Delete the redundant memset operation
> 
>  drivers/infiniband/hw/hns/hns_roce_cmd.h|5 -
>  drivers/infiniband/hw/hns/hns_roce_common.h |   42 ++
>  drivers/infiniband/hw/hns/hns_roce_cq.c |   27 +-
>  drivers/infiniband/hw/hns/hns_roce_device.h |   18 +
>  drivers/infiniband/hw/hns/hns_roce_hw_v1.c  |  967 
> ---
>  drivers/infiniband/hw/hns/hns_roce_hw_v1.h  |   57 ++
>  drivers/infiniband/hw/hns/hns_roce_main.c   |   26 +-
>  drivers/infiniband/hw/hns/hns_roce_mr.c |   21 +-
>  8 files changed, 1026 insertions(+), 137 deletions(-)
> 

Series applied, thanks.

-- 
Doug Ledford 
GPG Key ID: 0E572FDD



signature.asc
Description: OpenPGP digital signature

Re: [PATCH V3 for-next 00/11] Code improvements & fixes for HNS RoCE driver

2016-12-12 Thread Doug Ledford

On 11/23/2016 2:40 PM, Salil Mehta wrote:
> This patchset introduces some code improvements and fixes
> for the identified problems in the HNS RoCE driver.
> 
> Lijun Ou (4):
>   IB/hns: Add the interface for querying QP1
>   IB/hns: add self loopback for CM
>   IB/hns: Modify the condition of notifying hardware loopback
>   IB/hns: Fix the bug for qp state in hns_roce_v1_m_qp()
> 
> Salil Mehta (1):
>   IB/hns: Fix for Checkpatch.pl comment style errors
> 
> Shaobo Xu (1):
>   IB/hns: Implement the add_gid/del_gid and optimize the GIDs
> management
> 
> Wei Hu (Xavier) (5):
>   IB/hns: Add code for refreshing CQ CI using TPTR
>   IB/hns: Optimize the logic of allocating memory using APIs
>   IB/hns: Modify the macro for the timeout when cmd process
>   IB/hns: Modify query info named port_num when querying RC QP
>   IB/hns: Change qpn allocation to round-robin mode.
> 
>  drivers/infiniband/hw/hns/hns_roce_alloc.c  |   11 +-
>  drivers/infiniband/hw/hns/hns_roce_cmd.c|8 +-
>  drivers/infiniband/hw/hns/hns_roce_cmd.h|7 +-
>  drivers/infiniband/hw/hns/hns_roce_common.h |2 -
>  drivers/infiniband/hw/hns/hns_roce_cq.c |   17 +-
>  drivers/infiniband/hw/hns/hns_roce_device.h |   45 ++--
>  drivers/infiniband/hw/hns/hns_roce_eq.c |6 +-
>  drivers/infiniband/hw/hns/hns_roce_hem.c|6 +-
>  drivers/infiniband/hw/hns/hns_roce_hw_v1.c  |  267 +--
>  drivers/infiniband/hw/hns/hns_roce_hw_v1.h  |   17 +-
>  drivers/infiniband/hw/hns/hns_roce_main.c   |  311 
> +++
>  drivers/infiniband/hw/hns/hns_roce_mr.c |   22 +-
>  drivers/infiniband/hw/hns/hns_roce_pd.c |5 +-
>  drivers/infiniband/hw/hns/hns_roce_qp.c |2 +-
>  14 files changed, 364 insertions(+), 362 deletions(-)
> 

Series applied, thanks.

-- 
Doug Ledford 
GPG Key ID: 0E572FDD



signature.asc
Description: OpenPGP digital signature

Re: Soft lockup in inet_put_port on 4.6

2016-12-12 Thread Josef Bacik

On Mon, Dec 12, 2016 at 1:44 PM, Hannes Frederic Sowa 
 wrote:

On 12.12.2016 19:05, Josef Bacik wrote:
 On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet 


 wrote:

 On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:



  Hmm... Is your ephemeral port range includes the port your load
  balancing app is using ?


 I suspect that you might have processes doing bind( port = 0) that 
are

 trapped into the bind_conflict() scan ?

 With 100,000 + timewaits there, this possibly hurts.

 Can you try the following loop breaker ?


 It doesn't appear that the app is doing bind(port = 0) during normal
 operation.  I tested this patch and it made no difference.  I'm 
going to

 test simply restarting the app without changing to the SO_REUSEPORT
 option.  Thanks,


Would it be possible to trace the time the function uses with trace? 
If

we don't see the number growing considerably over time we probably can
rule out that we loop somewhere in there (I would instrument
inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port).

__inet_hash_connect -> __inet_check_established also takes a lock
(inet_ehash_lockp) which can be locked from inet_diag code path during
socket diag info dumping.

Unfortunately we couldn't reproduce it so far. :/


Working on getting the timing info, will probably be tomorrow due to 
meetings.  I did test simply restarting the app without changing to the 
config that enabled the use of SO_REUSEPORT and the problem didn't 
occur, so it definitely has something to do with SO_REUSEPORT.  Thanks,


Josef

Re: Soft lockup in tc_classify

2016-12-12 Thread Or Gerlitz

On Mon, Dec 12, 2016 at 3:28 PM, Daniel Borkmann  wrote:

> Note that there's still the RCU fix missing for the deletion race that
> Cong will still send out, but you say that the only thing you do is to
> add a single rule, but no other operation in involved during that test?

What's missing to have the deletion race fixed? making a patch or
testing to a patch which was sent?

Re: [RFC PATCH net-next v3 1/2] macb: Add 1588 support in Cadence GEM.

2016-12-12 Thread Richard Cochran

On Mon, Dec 12, 2016 at 10:22:43AM +, andrei.pistir...@microchip.com wrote:
> Richard, are you agree with this?

Yes, but please trim your replies next time.  Scrolling through pages
of quoted headers and stale content in order to read one line is very
annoying.

Thanks,
Richard

Re: [PATCH V2 10/22] bnxt_re: Support for CQ verbs

2016-12-12 Thread Jonathan Toppins

On 12/09/2016 01:48 AM, Selvin Xavier wrote:
> This patch implements support for create_cq, destroy_cq and req_notify_cq
> verbs.
> 
> Signed-off-by: Eddie Wai 
> Signed-off-by: Devesh Sharma 
> Signed-off-by: Somnath Kotur 
> Signed-off-by: Sriharsha Basavapatna 
> Signed-off-by: Selvin Xavier 
> ---
>  drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c| 183 
> 
>  drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.h|  47 ++
>  drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c | 154 
>  drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.h |  19 +++
>  drivers/infiniband/hw/bnxtre/bnxt_re_main.c |   4 +
>  include/uapi/rdma/bnxt_re_uverbs_abi.h  |  11 ++
>  6 files changed, 418 insertions(+)

Something I just realized is this patch series does not modify the
MAINTAINERS file. Whom from Broadcom will be maintaining this driver?
Probably want to include this info in the v3 series

[...]

> diff --git a/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c 
> b/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c
> index 3417829..f316598 100644
> --- a/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c
> +++ b/drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c
> @@ -60,6 +60,16 @@
>  #include "bnxt_re_ib_verbs.h"
>  #include 
>  
> +static int bnxt_re_copy_to_udata(struct bnxt_re_dev *rdev, void *data, int 
> len,
> +  struct ib_udata *udata)
> +{
> + int rc;
> +
> + rc = ib_copy_to_udata(udata, data, len);
> +
> + return rc ? -EFAULT : 0;
> +}

This function seems to provide no value by wrapping ib_copy_to_udata,
any reason to keep it? From the two call sites for this function it
appears it can be replaced with a direct call to ib_copy_to_udata.

Re: [PATCH 1/1] Fixed to BUG_ON to WARN_ON def

2016-12-12 Thread Ozgur Karatas



12.12.2016, 20:18, "Leon Romanovsky" :
> On Mon, Dec 12, 2016 at 03:04:28PM +0200, Ozgur Karatas wrote:
>>  Dear Romanovsky;
>
> Please avoid top-posting in your replies.
> Thanks

Dear Leon; 
thanks for the information., I will pay attention.

>>  I'm trying to learn english and I apologize for my mistake words and 
>> phrases. So, I think the code when call to "sg_set_buf" and next time set 
>> memory and buffer. For example, isn't to call "WARN_ON" function, get a 
>> error to implicit declaration, right?
>>
>>  Because, you will use to "BUG_ON" get a error implicit declaration of 
>> functions.
>
> I'm not sure that I followed you. mem->offset is set by sg_set_buf from
> buf variable returned by dma_alloc_coherent(). HW needs to get very
> precise size of this buf, in multiple of pages and aligned to pages
> boundaries.

I have studied the following your coding and I guess that's the right patchs.
You are the very expert in this matter, thank you for the correct for me.

I learn to your style as an example.

Regards,

Ozgur Karatas

> See the patch inline which removes this BUG_ON in proper and safe way.
>
> From 7babe807affa2b27d51d3610afb75b693929ea1a Mon Sep 17 00:00:00 2001
> From: Leon Romanovsky 
> Date: Mon, 12 Dec 2016 20:02:45 +0200
> Subject: [PATCH] net/mlx4: Remove BUG_ON from ICM allocation routine
>
> This patch removes BUG_ON() macro from mlx4_alloc_icm_coherent()
> by checking DMA address aligment in advance and performing proper
> folding in case of error.
>
> Fixes: 5b0bf5e25efe ("mlx4_core: Support ICM tables in coherent memory")
> Reported-by: Ozgur Karatas 
> Signed-off-by: Leon Romanovsky 
> ---
>  drivers/net/ethernet/mellanox/mlx4/icm.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
> b/drivers/net/ethernet/mellanox/mlx4/icm.c
> index 2a9dd46..e1f9e7c 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/icm.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
> @@ -118,8 +118,13 @@ static int mlx4_alloc_icm_coherent(struct device *dev, 
> struct scatterlist *mem,
>  if (!buf)
>  return -ENOMEM;
>
> + if (offset_in_page(buf)) {
> + dma_free_coherent(dev, PAGE_SIZE << order,
> + buf, sg_dma_address(mem));
> + return -ENOMEM;
> + }
> +
>  sg_set_buf(mem, buf, PAGE_SIZE << order);
> - BUG_ON(mem->offset);
>  sg_dma_len(mem) = PAGE_SIZE << order;
>  return 0;
>  }
> --
> 2.10.2

Re: [PATCH v2] audit: use proper refcount locking on audit_sock

2016-12-12 Thread Paul Moore

On Mon, Dec 12, 2016 at 5:03 AM, Richard Guy Briggs  wrote:
> Resetting audit_sock appears to be racy.
>
> audit_sock was being copied and dereferenced without using a refcount on
> the source sock.
>
> Bump the refcount on the underlying sock when we store a refrence in
> audit_sock and release it when we reset audit_sock.  audit_sock
> modification needs the audit_cmd_mutex.
>
> See: https://lkml.org/lkml/2016/11/26/232
>
> Thanks to Eric Dumazet  and Cong Wang
>  on ideas how to fix it.
>
> Signed-off-by: Richard Guy Briggs 
> ---
> There has been a lot of change in the audit code that is about to go
> upstream to address audit queue issues.  This patch is based on the
> source tree: git://git.infradead.org/users/pcmoore/audit#next
> ---
>  kernel/audit.c |   34 --
>  1 files changed, 28 insertions(+), 6 deletions(-)

My previous question about testing still stands, but I took a closer
look and have some additional comments, see below ...

> diff --git a/kernel/audit.c b/kernel/audit.c
> index f20eee0..439f7f3 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -452,7 +452,9 @@ static void auditd_reset(void)
> struct sk_buff *skb;
>
> /* break the connection */
> +   sock_put(audit_sock);
> audit_pid = 0;
> +   audit_nlk_portid = 0;
> audit_sock = NULL;
>
> /* flush all of the retry queue to the hold queue */
> @@ -478,6 +480,12 @@ static int kauditd_send_unicast_skb(struct sk_buff *skb)
> if (rc >= 0) {
> consume_skb(skb);
> rc = 0;
> +   } else {
> +   if (rc & (-ENOMEM|-EPERM|-ECONNREFUSED)) {

I dislike the way you wrote this because instead of simply looking at
this to see if it correct I need to sort out all the bits and find out
if there are other error codes that could run afoul of this check ...
make it simple, e.g. (rc == -ENOMEM || rc == -EPERM || ...).
Actually, since EPERM is 1, -EPERM (-1 in two's compliment is
0x) is going to cause this to be true for pretty much any
value of rc, yes?

> +   mutex_lock(&audit_cmd_mutex);
> +   auditd_reset();
> +   mutex_unlock(&audit_cmd_mutex);
> +   }

The code in audit#next handles netlink_unicast() errors in
kauditd_thread() and you are adding error handling code here in
kauditd_send_unicast_skb() ... that's messy.  I don't care too much
where the auditd_reset() call is made, but let's only do it in one
function; FWIW, I originally put the error handling code in
kauditd_thread() because there was other error handling code that
needed to done in that scope so it resulted in cleaner code.

Related, I see you are now considering ENOMEM to be a fatal condition,
that differs from the AUDITD_BAD macro in kauditd_thread(); this
difference needs to be reconciled.

Finally, you should update the comment header block for auditd_reset()
that it needs to be called with the audit_cmd_mutex held.

> @@ -1004,17 +1018,22 @@ static int audit_receive_msg(struct sk_buff *skb, 
> struct nlmsghdr *nlh)
> return -EACCES;
> }
> if (audit_pid && new_pid &&
> -   audit_replace(requesting_pid) != -ECONNREFUSED) {
> +   (audit_replace(requesting_pid) & 
> (-ECONNREFUSED|-EPERM|-ENOMEM))) {

Do we simply want to treat any error here as fatal, and not just
ECONN/EPERM/ENOMEM?  If not, let's come up with a single macro to
handle the fatal netlink_unicast() return codes so we have some chance
to keep things consistent in the future.

-- 
paul moore
www.paul-moore.com

Re: [PATCH net-next 1/3] net:dsa:mv88e6xxx: use hashtable to store multicast entries

2016-12-12 Thread Vivien Didelot

Hi Andrew,

Andrew Lunn  writes:

> Humm, it looks like we are doing the atu_get wrong. We are looking for
> a specific MAC address. Yet we seem to be walking the whole table to
> find it, rather than getting the hardware to do the search. 

We are not doing it wrong, the hardware does the search. A classic dump
of an ATU database consists of starting from the broadcast address
ff:ff:ff:ff:ff:ff and issuing GetNext operation until we reach back the
broadcast address. Only addresses in used are returned by GetNext, thus
dumping an empty database is completed in a single operation.

I implemented atu_get intentionally this way because it provides simpler
code, rather than doing arithmetic on MAC addresses (Unless I am unaware
of simple increment/decrement code.)

> The current code is:
>
> static int mv88e6xxx_atu_get(struct mv88e6xxx_chip *chip, int fid,
>  const u8 *addr, struct mv88e6xxx_atu_entry 
> *entry)
> {
> struct mv88e6xxx_atu_entry next;
> int err;
>
> eth_broadcast_addr(next.mac);
>
> err = _mv88e6xxx_atu_mac_write(chip, next.mac);
>
> We should be setting next.mac to one less than the address we are
> looking for.
>
> Volodymyr, please could you try that, and see how much of a speed up
> you get.
>
> There is another optimization which can be made. We only say there is
> no such entry once we have reached the end of the table. But it will
> return the entries in ascending order. So if the entry it returned is
> bigger than what we are looking for, we can immediately abort the
> search and say it does not exist.

However your two suggestions to optimize the lookup are correct. It'd be
interesting to see if that makes a significant difference or not.

Thanks,

Vivien

Re: [PATCH net-next 1/3] net:dsa:mv88e6xxx: use hashtable to store multicast entries

2016-12-12 Thread Andrew Lunn

On Mon, Dec 12, 2016 at 08:37:50AM -0800, Florian Fainelli wrote:
> On 12/12/2016 07:22 AM, Volodymyr Bendiuga wrote:
> > Hi,
> > 
> > I apologise for incorrectly formatted patch, I will fix and resend it.
> > The problem with the ATU right now is that it is too slow when inserting
> > entries.
> > When the OS boots up, it might insert some multicast entries into the
> > atu (if
> > they are preconfigured by user). I run a test with 10 mc entries being
> > configured for
> > each port (13 ports), and it took 15 seconds, which made system quite
> > slow on responding to
> > other commands, as it has been inserting mc entries. The implementation
> > with hashtable
> > made insert command for 13 ports and 10 entries per port about 700 msec
> > long.

Humm, it looks like we are doing the atu_get wrong. We are looking for
a specific MAC address. Yet we seem to be walking the whole table to
find it, rather than getting the hardware to do the search. 

The current code is:

static int mv88e6xxx_atu_get(struct mv88e6xxx_chip *chip, int fid,
 const u8 *addr, struct mv88e6xxx_atu_entry *entry)
{
struct mv88e6xxx_atu_entry next;
int err;

eth_broadcast_addr(next.mac);

err = _mv88e6xxx_atu_mac_write(chip, next.mac);

We should be setting next.mac to one less than the address we are
looking for.

Volodymyr, please could you try that, and see how much of a speed up
you get.

There is another optimization which can be made. We only say there is
no such entry once we have reached the end of the table. But it will
return the entries in ascending order. So if the entry it returned is
bigger than what we are looking for, we can immediately abort the
search and say it does not exist.

   Andrew

Re: Soft lockup in tc_classify

2016-12-12 Thread Cong Wang

On Mon, Dec 12, 2016 at 8:04 AM, Shahar Klein  wrote:
>
>
> On 12/12/2016 3:28 PM, Daniel Borkmann wrote:
>>
>> Hi Shahar,
>>
>> On 12/12/2016 10:43 AM, Shahar Klein wrote:
>>>
>>> Hi All,
>>>
>>> sorry for the spam, the first time was sent with html part and was
>>> rejected.
>>>
>>> We observed an issue where a classifier instance next member is
>>> pointing back to itself, causing a CPU soft lockup.
>>> We found it by running traffic on many udp connections and then adding
>>> a new flower rule using tc.
>>>
>>> We added a quick workaround to verify it:
>>>
>>> In tc_classify:
>>>
>>>  for (; tp; tp = rcu_dereference_bh(tp->next)) {
>>>  int err;
>>> +   if (tp == tp->next)
>>> + RCU_INIT_POINTER(tp->next, NULL);
>>>
>>>
>>> We also had a print here showing tp->next is pointing to tp. With this
>>> workaround we are not hitting the issue anymore.
>>> We are not sure we fully understand the mechanism here - with the rtnl
>>> and rcu locks.
>>> We'll appreciate your help solving this issue.
>>
>>
>> Note that there's still the RCU fix missing for the deletion race that
>> Cong will still send out, but you say that the only thing you do is to
>> add a single rule, but no other operation in involved during that test?

Hmm, I thought RCU_INIT_POINTER() respects readers, but seems no?
If so, that could be the cause since we play with the next pointer and
there is only one filter in this case, but I don't see why we could have
a loop here.

>>
>> Do you have a script and kernel .config for reproducing this?
>
>
> I'm using a user space socket app(https://github.com/shahar-klein/noodle)on
> a vm to push udp packets from ~2000 different udp src ports ramping up at
> ~100 per second towards another vm on the same Hypervisor. Once the traffic
> starts I'm pushing ingress flower tc udp rules(even_udp_src_port->mirred,
> odd->drop) on the relevant representor in the Hypervisor.

Do you mind to share your `tc filter show dev...` output? Also, since you
mentioned you only add one flower filter, just want to make sure you never
delete any filter before/when the bug happens? How reproducible is this?

Thanks!

Re: Soft lockup in inet_put_port on 4.6

2016-12-12 Thread Hannes Frederic Sowa

On 12.12.2016 19:05, Josef Bacik wrote:
> On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet 
> wrote:
>> On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:
>>
>>>
>>>  Hmm... Is your ephemeral port range includes the port your load
>>>  balancing app is using ?
>>
>> I suspect that you might have processes doing bind( port = 0) that are
>> trapped into the bind_conflict() scan ?
>>
>> With 100,000 + timewaits there, this possibly hurts.
>>
>> Can you try the following loop breaker ?
> 
> It doesn't appear that the app is doing bind(port = 0) during normal
> operation.  I tested this patch and it made no difference.  I'm going to
> test simply restarting the app without changing to the SO_REUSEPORT
> option.  Thanks,

Would it be possible to trace the time the function uses with trace? If
we don't see the number growing considerably over time we probably can
rule out that we loop somewhere in there (I would instrument
inet_csk_bind_conflict, __inet_hash_connect and inet_csk_get_port).

__inet_hash_connect -> __inet_check_established also takes a lock
(inet_ehash_lockp) which can be locked from inet_diag code path during
socket diag info dumping.

Unfortunately we couldn't reproduce it so far. :/

Thanks,
Hannes

Re: [PATCH V2 13/22] bnxt_re: Support QP verbs

2016-12-12 Thread Leon Romanovsky

On Thu, Dec 08, 2016 at 10:48:07PM -0800, Selvin Xavier wrote:
> This patch implements create_qp, destroy_qp, query_qp and modify_qp verbs.
>
> v2: Fixed sparse warnings
>
> Signed-off-by: Eddie Wai 
> Signed-off-by: Devesh Sharma 
> Signed-off-by: Somnath Kotur 
> Signed-off-by: Sriharsha Basavapatna 
> Signed-off-by: Selvin Xavier 
> ---
>  drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c| 873 
> 
>  drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.h| 250 +++
>  drivers/infiniband/hw/bnxtre/bnxt_re.h  |  14 +
>  drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c | 762 +
>  drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.h |  21 +
>  drivers/infiniband/hw/bnxtre/bnxt_re_main.c |   6 +
>  include/uapi/rdma/bnxt_re_uverbs_abi.h  |  10 +
>  7 files changed, 1936 insertions(+)
>
> diff --git a/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c 
> b/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c
> index 636306f..edc9411 100644
> --- a/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c
> +++ b/drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c
> @@ -50,6 +50,69 @@
>  #include "bnxt_qplib_fp.h"
>
>  static void bnxt_qplib_arm_cq_enable(struct bnxt_qplib_cq *cq);
> +
> +static void bnxt_qplib_free_qp_hdr_buf(struct bnxt_qplib_res *res,
> +struct bnxt_qplib_qp *qp)
> +{
> + struct bnxt_qplib_q *rq = &qp->rq;
> + struct bnxt_qplib_q *sq = &qp->sq;
> +
> + if (qp->rq_hdr_buf)
> + dma_free_coherent(&res->pdev->dev,
> +   rq->hwq.max_elements * qp->rq_hdr_buf_size,
> +   qp->rq_hdr_buf, qp->rq_hdr_buf_map);
> + if (qp->sq_hdr_buf)
> + dma_free_coherent(&res->pdev->dev,
> +   sq->hwq.max_elements * qp->sq_hdr_buf_size,
> +   qp->sq_hdr_buf, qp->sq_hdr_buf_map);
> + qp->rq_hdr_buf = NULL;
> + qp->sq_hdr_buf = NULL;
> + qp->rq_hdr_buf_map = 0;
> + qp->sq_hdr_buf_map = 0;
> + qp->sq_hdr_buf_size = 0;
> + qp->rq_hdr_buf_size = 0;
> +}
> +
> +static int bnxt_qplib_alloc_qp_hdr_buf(struct bnxt_qplib_res *res,
> +struct bnxt_qplib_qp *qp)
> +{
> + struct bnxt_qplib_q *rq = &qp->rq;
> + struct bnxt_qplib_q *sq = &qp->rq;
> + int rc = 0;
> +
> + if (qp->sq_hdr_buf_size && sq->hwq.max_elements) {
> + qp->sq_hdr_buf = dma_alloc_coherent(&res->pdev->dev,
> + sq->hwq.max_elements *
> + qp->sq_hdr_buf_size,
> + &qp->sq_hdr_buf_map, GFP_KERNEL);
> + if (!qp->sq_hdr_buf) {
> + rc = -ENOMEM;
> + dev_err(&res->pdev->dev,
> + "QPLIB: Failed to create sq_hdr_buf");
> + goto fail;
> + }
> + }
> +
> + if (qp->rq_hdr_buf_size && rq->hwq.max_elements) {
> + qp->rq_hdr_buf = dma_alloc_coherent(&res->pdev->dev,
> + rq->hwq.max_elements *
> + qp->rq_hdr_buf_size,
> + &qp->rq_hdr_buf_map,
> + GFP_KERNEL);
> + if (!qp->rq_hdr_buf) {
> + rc = -ENOMEM;
> + dev_err(&res->pdev->dev,
> + "QPLIB: Failed to create rq_hdr_buf");
> + goto fail;
> + }
> + }
> + return 0;
> +
> +fail:
> + bnxt_qplib_free_qp_hdr_buf(res, qp);
> + return rc;
> +}
> +
>  static void bnxt_qplib_service_nq(unsigned long data)
>  {
>   struct bnxt_qplib_nq *nq = (struct bnxt_qplib_nq *)data;
> @@ -215,6 +278,816 @@ int bnxt_qplib_alloc_nq(struct pci_dev *pdev, struct 
> bnxt_qplib_nq *nq)
>   return 0;
>  }
>
> +/* QP */
> +int bnxt_qplib_create_qp1(struct bnxt_qplib_res *res, struct bnxt_qplib_qp 
> *qp)
> +{
> + struct bnxt_qplib_rcfw *rcfw = res->rcfw;
> + struct cmdq_create_qp1 req;
> + struct creq_create_qp1_resp *resp;
> + struct bnxt_qplib_pbl *pbl;
> + struct bnxt_qplib_q *sq = &qp->sq;
> + struct bnxt_qplib_q *rq = &qp->rq;
> + int rc;
> + u16 cmd_flags = 0;
> + u32 qp_flags = 0;
> +
> + RCFW_CMD_PREP(req, CREATE_QP1, cmd_flags);
> +
> + /* General */
> + req.type = qp->type;
> + req.dpi = cpu_to_le32(qp->dpi->dpi);
> + req.qp_handle = cpu_to_le64(qp->qp_handle);
> +
> + /* SQ */
> + sq->hwq.max_elements = sq->max_wqe;
> + rc = bnxt_qplib_alloc_init_hwq(res->pdev, &sq->hwq, NULL, 0,
> +&sq->hwq.max_elements,
> +BNXT_QPLIB_MAX_SQE_ENTRY_SIZE, 0,
> +PAGE_SIZE, HWQ_TYPE_QUEUE);
> + if (rc)
> + goto exit;
> +
> + sq

Re: [PATCH 1/1] Fixed to BUG_ON to WARN_ON def

2016-12-12 Thread Leon Romanovsky

On Mon, Dec 12, 2016 at 03:04:28PM +0200, Ozgur Karatas wrote:
> Dear Romanovsky;

Please avoid top-posting in your replies.
Thanks

>
> I'm trying to learn english and I apologize for my mistake words and phrases. 
> So, I think the code when call to "sg_set_buf" and next time set memory and 
> buffer. For example, isn't to call "WARN_ON" function, get a error to 
> implicit declaration, right?
>
> Because, you will use to "BUG_ON" get a error implicit declaration of 
> functions.

I'm not sure that I followed you. mem->offset is set by sg_set_buf from
buf variable returned by dma_alloc_coherent(). HW needs to get very
precise size of this buf, in multiple of pages and aligned to pages
boundaries.

>
> sg_set_buf(mem, buf, PAGE_SIZE << order);
> WARN_ON(mem->offset);

See the patch inline which removes this BUG_ON in proper and safe way.

From 7babe807affa2b27d51d3610afb75b693929ea1a Mon Sep 17 00:00:00 2001
From: Leon Romanovsky 
Date: Mon, 12 Dec 2016 20:02:45 +0200
Subject: [PATCH] net/mlx4: Remove BUG_ON from ICM allocation routine

This patch removes BUG_ON() macro from mlx4_alloc_icm_coherent()
by checking DMA address aligment in advance and performing proper
folding in case of error.

Fixes: 5b0bf5e25efe ("mlx4_core: Support ICM tables in coherent memory")
Reported-by: Ozgur Karatas 
Signed-off-by: Leon Romanovsky 
---
 drivers/net/ethernet/mellanox/mlx4/icm.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/icm.c 
b/drivers/net/ethernet/mellanox/mlx4/icm.c
index 2a9dd46..e1f9e7c 100644
--- a/drivers/net/ethernet/mellanox/mlx4/icm.c
+++ b/drivers/net/ethernet/mellanox/mlx4/icm.c
@@ -118,8 +118,13 @@ static int mlx4_alloc_icm_coherent(struct device *dev, 
struct scatterlist *mem,
if (!buf)
return -ENOMEM;

+   if (offset_in_page(buf)) {
+   dma_free_coherent(dev, PAGE_SIZE << order,
+ buf, sg_dma_address(mem));
+   return -ENOMEM;
+   }
+
sg_set_buf(mem, buf, PAGE_SIZE << order);
-   BUG_ON(mem->offset);
sg_dma_len(mem) = PAGE_SIZE << order;
return 0;
 }
--
2.10.2



signature.asc
Description: PGP signature

Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-12 Thread Christoph Lameter

On Mon, 12 Dec 2016, Jesper Dangaard Brouer wrote:

> Hmmm. If you can rely on hardware setup to give you steering and
> dedicated access to the RX rings.  In those cases, I guess, the "push"
> model could be a more direct API approach.

If the hardware does not support steering then one should be able to
provide those services in software.

> I was shooting for a model that worked without hardware support.  And
> then transparently benefit from HW support by configuring a HW filter
> into a specific RX queue and attaching/using to that queue.

The discussion here is a bit amusing since these issues have been resolved
a long time ago with the design of the RDMA subsystem. Zero copy is
already in wide use. Memory registration is used to pin down memory areas.
Work requests can be filed with the RDMA subsystem that then send and
receive packets from the registered memory regions. This is not strictly
remote memory access but this is a basic mode of operations supported  by
the RDMA subsystem. The mlx5 driver quoted here supports all of that.

What is bad about RDMA is that it is a separate kernel subsystem. What I
would like to see is a deeper integration with the network stack so that
memory regions can be registred with a network socket and work requests
then can be submitted and processed that directly read and write in these
regions. The network stack should provide the services that the hardware
of the NIC does not suppport as usual.

The RX/TX ring in user space should be an additional mode of operation of
the socket layer. Once that is in place the "Remote memory acces" can be
trivially implemented on top of that and the ugly RDMA sidecar subsystem
can go away.

Re: Soft lockup in inet_put_port on 4.6

2016-12-12 Thread Josef Bacik

On Fri, Dec 9, 2016 at 11:14 PM, Eric Dumazet  
wrote:

On Fri, 2016-12-09 at 19:47 -0800, Eric Dumazet wrote:



 Hmm... Is your ephemeral port range includes the port your load
 balancing app is using ?


I suspect that you might have processes doing bind( port = 0) that are
trapped into the bind_conflict() scan ?

With 100,000 + timewaits there, this possibly hurts.

Can you try the following loop breaker ?


It doesn't appear that the app is doing bind(port = 0) during normal 
operation.  I tested this patch and it made no difference.  I'm going 
to test simply restarting the app without changing to the SO_REUSEPORT 
option.  Thanks,


Josef

Re: Soft lockup in tc_classify

2016-12-12 Thread Shahar Klein




On 12/12/2016 3:28 PM, Daniel Borkmann wrote:

Hi Shahar,

On 12/12/2016 10:43 AM, Shahar Klein wrote:

Hi All,

sorry for the spam, the first time was sent with html part and was
rejected.

We observed an issue where a classifier instance next member is
pointing back to itself, causing a CPU soft lockup.
We found it by running traffic on many udp connections and then adding
a new flower rule using tc.

We added a quick workaround to verify it:

In tc_classify:

 for (; tp; tp = rcu_dereference_bh(tp->next)) {
 int err;
+   if (tp == tp->next)
+ RCU_INIT_POINTER(tp->next, NULL);


We also had a print here showing tp->next is pointing to tp. With this
workaround we are not hitting the issue anymore.
We are not sure we fully understand the mechanism here - with the rtnl
and rcu locks.
We'll appreciate your help solving this issue.


Note that there's still the RCU fix missing for the deletion race that
Cong will still send out, but you say that the only thing you do is to
add a single rule, but no other operation in involved during that test?

Do you have a script and kernel .config for reproducing this?


I'm using a user space socket 
app(https://github.com/shahar-klein/noodle)on a vm to push udp packets 
from ~2000 different udp src ports ramping up at ~100 per second towards 
another vm on the same Hypervisor. Once the traffic starts I'm pushing 
ingress flower tc udp rules(even_udp_src_port->mirred, odd->drop) on the 
relevant representor in the Hypervisor.




Thanks,
Daniel

Re: [PATCH] sh_eth: add wake-on-lan support via magic packet

2016-12-12 Thread Sergei Shtylyov


On 12/12/2016 06:49 PM, Niklas Söderlund wrote:


Thanks for your feedback.


   Not at all, it's my duty now. :-)
   I should probably have warned you not to post the new version to netdev -- 
DaveM has closed his net-next.git tree (ahead of the usual time, which would 
have been 4.9 release), so you posting would only upset him...


[...]

  You only enable the WOL support fo the R-Car gen2 chips but never say that
explicitly, neither in the subject nor here.


Signed-off-by: Niklas Söderlund 
---
 drivers/net/ethernet/renesas/sh_eth.c | 120 +++---
 drivers/net/ethernet/renesas/sh_eth.h |   4 ++
 2 files changed, 116 insertions(+), 8 deletions(-)



diff --git a/drivers/net/ethernet/renesas/sh_eth.c 
b/drivers/net/ethernet/renesas/sh_eth.c
index 05b0dc5..3974046 100644
--- a/drivers/net/ethernet/renesas/sh_eth.c
+++ b/drivers/net/ethernet/renesas/sh_eth.c

[...]

+   /* Handle MagicPacket interrupt */
+   if (sh_eth_read(ndev, ECSR) & ECSR_MPD)


   What if it wasn't enabled ATM?


Sorry I don't understand this comment.


   I'm trying to handle only the enabled interrupts but this hasn't been 
consistently done yet (only for EESR, not ECSR), so nevermind. :-)


[...]

@@ -3150,15 +3193,71 @@ static int sh_eth_drv_remove(struct platform_device 
*pdev)

[...]


This is how it's done in
other parts of the driver when disabling interrupts.


   Not in all parts of the driver that disable EESIPR interrupts... I must
confess that I never liked that 'mdp->irq_enabled' flag and still suspect we
can get things done without it... I need to look at this code again, sigh...


   Well, we can't most probably but I have a patch almost ready that turns 
the boolean flag into u32 field holding the EESIPR value to be written next. 
Would that help you?



This is also why I only check for MagicPacket interrupts if irq_enabled
is false.


  I would have preferred that this was done with the other EMAC interrupts,
in sh_eth_error().


I removed the check for Magic Packet in sh_eth_interrupt() and running
without setting mdp->irq_enabled = false. sh_eth_error() will then clear
any ECI interrupt so no need to add Magic Packet detection to it since
all that is needed on Magic Packet is to clear the interrupt which
already is done. This works and I can do multiple suspend/resume cycles,
will be in v2 thanks for the suggestion.


   OK, let's see what you have when I have some more time. We have a lot of 
time for ironing things out till net-next is opened again -- which will happen 
after -rc1)...


[...]

+
+   /* Enable MagicPacket */
+   sh_eth_modify(ndev, ECMR, 0, ECMR_PMDE);
+
+   /* Increased clock usage so device won't be suspended */
+   clk_enable(mdp->clk);


   Hum, intermixiggn runtime PM with clock API doesn't look good...


I agree it looks weird but I need a way to increment the usage count for
the clock otherwise the PM code will disable the module clock and WoL
will not work.


   How will it do it if you don't call sh_eth_close() in this case?


Note that this call will not enable the clock just
increase the usage count so it won't be disabled when the PM code
decrease it after the sh_eth suspend function is run.


   You mean that the PM code calls RPM or clk API on its own? That's strange...


Yes it calls clk API.


   Hum, will have to look into it as well...

[...]

MBR, Sergei

Re: [PATCH v2] audit: use proper refcount locking on audit_sock

2016-12-12 Thread Paul Moore

On Mon, Dec 12, 2016 at 5:03 AM, Richard Guy Briggs  wrote:
> Resetting audit_sock appears to be racy.
>
> audit_sock was being copied and dereferenced without using a refcount on
> the source sock.
>
> Bump the refcount on the underlying sock when we store a refrence in
> audit_sock and release it when we reset audit_sock.  audit_sock
> modification needs the audit_cmd_mutex.
>
> See: https://lkml.org/lkml/2016/11/26/232
>
> Thanks to Eric Dumazet  and Cong Wang
>  on ideas how to fix it.
>
> Signed-off-by: Richard Guy Briggs 
> ---
> There has been a lot of change in the audit code that is about to go
> upstream to address audit queue issues.  This patch is based on the
> source tree: git://git.infradead.org/users/pcmoore/audit#next
> ---
>  kernel/audit.c |   34 --
>  1 files changed, 28 insertions(+), 6 deletions(-)

This is coming in pretty late for the v4.10 merge window, much later
than I would usually take things, but this is arguably important, and
(at first glance) relatively low risk - what testing have you done on
this?

> diff --git a/kernel/audit.c b/kernel/audit.c
> index f20eee0..439f7f3 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -452,7 +452,9 @@ static void auditd_reset(void)
> struct sk_buff *skb;
>
> /* break the connection */
> +   sock_put(audit_sock);
> audit_pid = 0;
> +   audit_nlk_portid = 0;
> audit_sock = NULL;
>
> /* flush all of the retry queue to the hold queue */
> @@ -478,6 +480,12 @@ static int kauditd_send_unicast_skb(struct sk_buff *skb)
> if (rc >= 0) {
> consume_skb(skb);
> rc = 0;
> +   } else {
> +   if (rc & (-ENOMEM|-EPERM|-ECONNREFUSED)) {
> +   mutex_lock(&audit_cmd_mutex);
> +   auditd_reset();
> +   mutex_unlock(&audit_cmd_mutex);
> +   }
> }
>
> return rc;
> @@ -579,7 +587,9 @@ static int kauditd_thread(void *dummy)
>
> auditd = 0;
> if (AUDITD_BAD(rc, reschedule)) {
> +   mutex_lock(&audit_cmd_mutex);
> auditd_reset();
> +   mutex_unlock(&audit_cmd_mutex);
> reschedule = 0;
> }
> } else
> @@ -594,7 +604,9 @@ static int kauditd_thread(void *dummy)
> auditd = 0;
> if (AUDITD_BAD(rc, reschedule)) {
> kauditd_hold_skb(skb);
> +   mutex_lock(&audit_cmd_mutex);
> auditd_reset();
> +   mutex_unlock(&audit_cmd_mutex);
> reschedule = 0;
> } else
> /* temporary problem (we hope), queue
> @@ -623,7 +635,9 @@ quick_loop:
> if (rc) {
> auditd = 0;
> if (AUDITD_BAD(rc, reschedule)) {
> +   mutex_lock(&audit_cmd_mutex);
> auditd_reset();
> +   
> mutex_unlock(&audit_cmd_mutex);
> reschedule = 0;
> }
>
> @@ -1004,17 +1018,22 @@ static int audit_receive_msg(struct sk_buff *skb, 
> struct nlmsghdr *nlh)
> return -EACCES;
> }
> if (audit_pid && new_pid &&
> -   audit_replace(requesting_pid) != -ECONNREFUSED) {
> +   (audit_replace(requesting_pid) & 
> (-ECONNREFUSED|-EPERM|-ENOMEM))) {
> audit_log_config_change("audit_pid", new_pid, 
> audit_pid, 0);
> return -EEXIST;
> }
> if (audit_enabled != AUDIT_OFF)
> audit_log_config_change("audit_pid", new_pid, 
> audit_pid, 1);
> -   audit_pid = new_pid;
> -   audit_nlk_portid = NETLINK_CB(skb).portid;
> -   audit_sock = skb->sk;
> -   if (!new_pid)
> +   if (new_pid) {
> +   if (audit_sock)
> +   sock_put(audit_sock);
> +   audit_pid = new_pid;
> +   audit_nlk_portid = NETLINK_CB(skb).portid;
> +   sock_hold(skb->sk);
> +   audit_sock = skb->sk;

Re: Designing a safe RX-zero-copy Memory Model for Networking

2016-12-12 Thread Jesper Dangaard Brouer

On Mon, 12 Dec 2016 06:49:03 -0800
John Fastabend  wrote:

> On 16-12-12 06:14 AM, Mike Rapoport wrote:
> > On Mon, Dec 12, 2016 at 10:40:42AM +0100, Jesper Dangaard Brouer wrote:  
> >>
> >> On Mon, 12 Dec 2016 10:38:13 +0200 Mike Rapoport  
> >> wrote:
> >>  
> >>> Hello Jesper,
> >>>
> >>> On Mon, Dec 05, 2016 at 03:31:32PM +0100, Jesper Dangaard Brouer wrote:  
>  Hi all,
> 
>  This is my design for how to safely handle RX zero-copy in the network
>  stack, by using page_pool[1] and modifying NIC drivers.  Safely means
>  not leaking kernel info in pages mapped to userspace and resilience
>  so a malicious userspace app cannot crash the kernel.
> 
>  Design target
>  =
> 
>  Allow the NIC to function as a normal Linux NIC and be shared in a
>  safe manor, between the kernel network stack and an accelerated
>  userspace application using RX zero-copy delivery.
> 
>  Target is to provide the basis for building RX zero-copy solutions in
>  a memory safe manor.  An efficient communication channel for userspace
>  delivery is out of scope for this document, but OOM considerations are
>  discussed below (`Userspace delivery and OOM`_).
> >>>
> >>> Sorry, if this reply is a bit off-topic.  
> >>
> >> It is very much on topic IMHO :-)
> >>  
> >>> I'm working on implementation of RX zero-copy for virtio and I've 
> >>> dedicated
> >>> some thought about making guest memory available for physical NIC DMAs.
> >>> I believe this is quite related to your page_pool proposal, at least from
> >>> the NIC driver perspective, so I'd like to share some thoughts here.  
> >>
> >> Seems quite related. I'm very interested in cooperating with you! I'm
> >> not very familiar with virtio, and how packets/pages gets channeled
> >> into virtio.  
> > 
> > They are copied :-)
> > Presuming we are dealing only with vhost backend, the received skb
> > eventually gets converted to IOVs, which in turn are copied to the guest
> > memory. The IOVs point to the guest memory that is allocated by virtio-net
> > running in the guest.
> >   
> 
> Great I'm also doing something similar.
> 
> My plan was to embed the zero copy as an AF_PACKET mode and then push
> a AF_PACKET backend into vhost. I'll post a patch later this week.
> 
> >>> The idea is to dedicate one (or more) of the NIC's queues to a VM, e.g.
> >>> using macvtap, and then propagate guest RX memory allocations to the NIC
> >>> using something like new .ndo_set_rx_buffers method.  
> >>
> >> I believe the page_pool API/design aligns with this idea/use-case.
> >>  
> >>> What is your view about interface between the page_pool and the NIC
> >>> drivers?  
> >>
> >> In my Prove-of-Concept implementation, the NIC driver (mlx5) register
> >> a page_pool per RX queue.  This is done for two reasons (1) performance
> >> and (2) for supporting use-cases where only one single RX-ring queue is
> >> (re)configured to support RX-zero-copy.  There are some associated
> >> extra cost of enabling this mode, thus it makes sense to only enable it
> >> when needed.
> >>
> >> I've not decided how this gets enabled, maybe some new driver NDO.  It
> >> could also happen when a XDP program gets loaded, which request this
> >> feature.
> >>
> >> The macvtap solution is nice and we should support it, but it requires
> >> VM to have their MAC-addr registered on the physical switch.  This
> >> design is about adding flexibility. Registering an XDP eBPF filter
> >> provides the maximum flexibility for matching the destination VM.  
> > 
> > I'm not very familiar with XDP eBPF, and it's difficult for me to estimate
> > what needs to be done in BPF program to do proper conversion of skb to the
> > virtio descriptors.  
> 
> I don't think XDP has much to do with this code and they should be done
> separately. XDP runs eBPF code on received packets after the DMA engine
> has already placed the packet in memory so its too late in the process.

It does not have to be connected to XDP.  My idea should support RX
zero-copy into normal sockets, without XDP.

My idea was to pre-VMA map the RX ring, when zero-copy is requested,
thus it is not too late in the process.  When frame travel the normal
network stack, then require the SKB-read-only-page mode (skb-frags).
If the SKB reach a socket that support zero-copy, then we can do RX
zero-copy on normal sockets.

 
> The other piece here is enabling XDP in vhost but that is again separate
> IMO.
> 
> Notice that ixgbe supports pushing packets into a macvlan via 'tc'
> traffic steering commands so even though macvlan gets an L2 address it
> doesn't mean it can't use other criteria to steer traffic to it.

This sounds interesting. As this allow much more flexibility macvlan
matching, which I like, but still depending on HW support. 

 
> > We were not considered using XDP yet, so we've decided to limit the initial
> > implementation to macvtap because we can ensure correspondence betw

Re: [PATCH net-next 1/3] net:dsa:mv88e6xxx: use hashtable to store multicast entries

2016-12-12 Thread Vivien Didelot

Hi all,

Florian Fainelli  writes:

> Seeing such a change makes me wonder if we should not try to push some
> of this hashtable abstraction (provided that we agree we want it) at a
> higher layer, like net/dsa/slave.c?

That is the major reason why I am reluctant to cache stuffs in drivers.

In most cases, we want the DSA drivers to be "stupid", as much stateless
as possible, simply implementing the supported DSA switch operations.

The DSA core then handles the generic logic of how switch fabrics should
behave, and thus all DSA drivers are consistent and benefit from this.

Thanks,

Vivien

Re: [PATCH V2 00/22] Broadcom RoCE Driver (bnxt_re)

2016-12-12 Thread Jason Gunthorpe

On Sat, Dec 10, 2016 at 11:06:58AM +0530, Selvin Xavier wrote:
> On Fri, Dec 9, 2016 at 12:17 PM, Selvin Xavier
>  wrote:
> > I am preparing a git repository with these changes as per Jason's
> > comment and will share the details later today.
> 
> Please use bnxt_re branch in this git repository.
> 
> https://github.com/Broadcom/linux-rdma-nxt.git

Why are you using __packed in bnxt_re_uverbs_abi.h ? that doesn't seem
necessary. It is a good idea to make sure all those structures are a
multiple of 64 bits (add explicit reserved fields), and make sure you
test 32 bit verbs as well.

Why are you using debugfs just to export counters? Isn't the core code
counter framework good enough?

Please try and avoid writing functions as defines (eg rdev_to_dev,
to_bnxt_re, SQE_PG, RCFW_CMDQ_COOKIE, PTR_PG etc)

There is something wrong with the tabs and spaces (see
https://github.com/Broadcom/linux-rdma-nxt/blob/03e23b087f7e86ea28656273994e065827210ce5/drivers/infiniband/hw/bnxtre/bnxt_re_hsi.h)

FWIW, I really dislike the column alignment style, it is so hard to
maintain..

Jason

Re: [PATCH net-next 0/2] Add ethtool set regs support

2016-12-12 Thread Florian Fainelli

On 12/11/2016 07:22 AM, Andrew Lunn wrote:
> On Sun, Dec 11, 2016 at 02:18:00PM +0200, Saeed Mahameed wrote:
>> On Wed, Dec 7, 2016 at 4:41 AM, Andrew Lunn  wrote:
>>> On Wed, Dec 07, 2016 at 12:33:08AM +0200, Saeed Mahameed wrote:
 Hi Dave,

 This series adds the support for setting device registers from user
 space ethtool.
>>>
>>> Is this not the start of allowing binary only drivers in user space?
>>>
>>
>> It is not, we want to do same as set_eeprom already do,
>> Just set some HW registers, for analysis/debug/tweak/configure HW
>> dependent register for the NIC netdev sake.
> 
> Mellanox has a good reputation of open drivers. However, this API
> sounds like it would be the first step towards user space
> drivers. This is an API which can peek and poke registers, so it
> probably could be mis-used to put part of a driver in user
> space. Hence we should avoid this sort of API to start with.

I don't necessarily share your concerns here on the proprietary vs. open
source driver, because this interface is limited to the register space,
not the data path, there is only a handful of things you can do here,
but getting a NIC to work, is not probably one of them.

My concern is more with the support/debugging aspect, if someone starts
tweaking a gazillion of registers through that interface, and I have no
way to tell, how am I going to support that? Of course, the first step
is: have you tried to turn it on and off again, and see if this is
reproducible, but what if I was asked/told to tweak this or that value
first, etc... it can be hard to collect the exact state in which a NIC
is at the time of the problem.

NB: on the proprietary driver side, you can already mmap() the PCI
device's space and write an entire user-space based driver (DPDK) and
bypass the kernel for most things, ethtool -D is not much worse here
since it just offers a subset (and a small one) of that.
-- 
Florian

Re: [PATCH V2 00/22] Broadcom RoCE Driver (bnxt_re)

2016-12-12 Thread Jonathan Toppins

On 12/09/2016 01:47 AM, Selvin Xavier wrote:
> This series introduces the RoCE driver for the Broadcom
> NetXtreme-E 10/25/40/50 gigabit RoCE HCAs. 
> This driver is dependent on the bnxt_en NIC driver and is 
> based on the bnxt_re branch in Doug's repository. bnxt_en changes
> required for this patch series is already available in this branch.
> 
> I am preparing a git repository with these changes as per Jason's
> comment and will share the details later today.
> 
> v1-> v2:
>   * The license text in each file updated to reflect Dual license.
>   * Makefile and Kconfig changes are pushed to the last patch
>   * Moved bnxt_re_uverbs_abi.h to include/uapi/rdma folder
>   * Remove duplicate structure definitions from bnxt_re_hsi.h as
> it is available in the corresponding bnxt_en header file (bnxt_hsi.h)
>   * Removed some unused code reported during code review.
>   * Fixed few sparse warnings
> 

I get the following sparse errors (filtered for only bnxt_re ones),
please let me know if they are false positives:

$ make C=2  drivers/net/ethernet/broadcom/bnxt/bnxt_en.ko
drivers/infiniband/hw/bnxtre/bnxt_re.ko
  CHK include/config/kernel.release
  CHK include/generated/uapi/linux/version.h
  CHK include/generated/utsrelease.h
  CHECK   arch/x86/purgatory/purgatory.c
[...]
  CHECK   arch/x86/purgatory/sha256.c
  CHECK   arch/x86/purgatory/string.c
[...]
  CHK include/generated/bounds.h
  CHK include/generated/timeconst.h
  CHK include/generated/asm-offsets.h
  CALLscripts/checksyscalls.sh
  CHECK   scripts/mod/empty.c
  CHECK   drivers/net/ethernet/broadcom/bnxt/bnxt.c
  CHECK   drivers/net/ethernet/broadcom/bnxt/bnxt_sriov.c
  CHECK   drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
  CHECK   drivers/net/ethernet/broadcom/bnxt/bnxt_dcb.c
  CHECK   drivers/net/ethernet/broadcom/bnxt/bnxt_ulp.c
  MODPOST 2 modules
  CHECK   drivers/infiniband/hw/bnxtre/bnxt_re_main.c
  CHECK   drivers/infiniband/hw/bnxtre/bnxt_re_ib_verbs.c
[...]
  CHECK   drivers/infiniband/hw/bnxtre/bnxt_re_debugfs.c
  CHECK   drivers/infiniband/hw/bnxtre/bnxt_qplib_res.c
drivers/infiniband/hw/bnxtre/bnxt_qplib_res.c:729:6: warning: symbol
'bnxt_qplib_cleanup_pkey_tbl' was not declared. Should it be static?
  CHECK   drivers/infiniband/hw/bnxtre/bnxt_qplib_rcfw.c
  CHECK   drivers/infiniband/hw/bnxtre/bnxt_qplib_sp.c
  CHECK   drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c
drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c:1015:22: warning: context
imbalance in 'bnxt_qplib_lock_cqs' - wrong count at exit
drivers/infiniband/hw/bnxtre/bnxt_qplib_fp.c:1030:28: warning: context
imbalance in 'bnxt_qplib_unlock_cqs' - unexpected unlock
  MODPOST 2 modules

-Jon

Re: fib_frontend: Add network specific broadcasts, when it takes a sense

2016-12-12 Thread Hannes Frederic Sowa

On 12.12.2016 15:44, Brandon Philips wrote:
> On Mon, Dec 12, 2016 at 9:30 AM, Jiri Benc  wrote:
>> On Fri, 9 Dec 2016 15:41:52 -0800, Brandon Philips wrote:
>>> The issue we have: when creating the VXLAN interface and assigning it
>>> an address we see a broadcast route being added by the Kernel. For
>>> example if we have 10.4.0.0/16 a broadcast route to 10.4.0.0 is
>>> created. This route is unwanted because we assign 10.4.0.0 to one of
>>> our VXLAN interfaces.
>>
>> Are you saying you're trying to assign the IP address 10.4.0.0/16 as a
>> unicast address to an interface? Then you'll run into way more problems
>> than the one you're describing. You can't have host part of the IP
>> address consisting of all zeros (or all ones). Just don't do it. Choose
>> a valid IP address instead.
> 
> Yes, this is what we are doing; it is because of an upstream, to us,
> address assignment so I will figure it out upstream.
> 
> Regardless, it is hard to find an RFC that says "simply don't do this
> because _". The closest I could find was RFC 922 after sending
> this which says:
> 
> "There is probably no reason for such addresses to appear anywhere but
> as the source address of an ICMP Information".

Alternatively you can renumber the network to use /32 and add the
unicast routes for your /16 yourself.

Bye,
Hannes

RE: Re: Synopsys Ethernet QoS

2016-12-12 Thread Stephen Warren

Niklas Cassel wrote at Monday, December 12, 2016 9:25 AM:
...
> However, I've noticed that NVIDIA has extended the DWC EQoS DT binding,
> I don't how easy it would be for them to switch to stmmac's DT binding.
> (Adding Stephen Warren to CC.)

I don't believe there's any issue switching drivers, so long as the new one
works on our HW. However, we can't switch DT binding since that's an ABI.
So, if we switch drivers, the "new" driver needs to support all existing
DT bindings.

FWIW, I'd recommend that we don't name the "new" driver stmmac or anything
like that. Rather, name it after the IP block so that any new user of the same
IP block will find the name they expect in the source tree, and just use it.
Renaming the "new" driver to dwc_eqos or similar might be one way to achieve
that.

-- 
nvpublic

Re: [PATCH net-next 1/3] net:dsa:mv88e6xxx: use hashtable to store multicast entries

2016-12-12 Thread Florian Fainelli

On 12/12/2016 07:22 AM, Volodymyr Bendiuga wrote:
> Hi,
> 
> I apologise for incorrectly formatted patch, I will fix and resend it.
> The problem with the ATU right now is that it is too slow when inserting
> entries.
> When the OS boots up, it might insert some multicast entries into the
> atu (if
> they are preconfigured by user). I run a test with 10 mc entries being
> configured for
> each port (13 ports), and it took 15 seconds, which made system quite
> slow on responding to
> other commands, as it has been inserting mc entries. The implementation
> with hashtable
> made insert command for 13 ports and 10 entries per port about 700 msec
> long.

Just wondering how do you achieve such speed up? What part of using a
hashtable allows you not to write down all 10 MC entries across the
ports? I would assume that the number of MDIO (is that the bus you are
using here?) operations would be identical.

Seeing such a change makes me wonder if we should not try to push some
of this hashtable abstraction (provided that we agree we want it) at a
higher layer, like net/dsa/slave.c?

Thanks!
-- 
Florian

Re: [PATCH] net:phy fix driver reference count error when attach and detach phy device

2016-12-12 Thread Florian Fainelli

On 12/12/2016 12:49 AM, maowenan wrote:
> 
> 
> On 2016/12/5 16:47, maowenan wrote:
>>
>>
>> On 2016/12/2 17:45, David Laight wrote:
>>> From: Mao Wenan
 Sent: 30 November 2016 10:23
 The nic in my board use the phy dev from marvell, and the system will
 load the marvell phy driver automatically, but when I remove the phy
 drivers, the system immediately panic:
 Call trace:
 [ 2582.834493] [] phy_state_machine+0x3c/0x438 [
 2582.851754] [] process_one_work+0x150/0x428 [
 2582.868188] [] worker_thread+0x144/0x4b0 [
 2582.883882] [] kthread+0xfc/0x110

 there should be proper reference counting in place to avoid that.
 I found that phy_attach_direct() forgets to add phy device driver
 reference count, and phy_detach() forgets to subtract reference count.
 This patch is to fix this bug, after that panic is disappeared when remove
 marvell.ko

 Signed-off-by: Mao Wenan 
 ---
  drivers/net/phy/phy_device.c | 7 +++
  1 file changed, 7 insertions(+)

 diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
 index 1a4bf8a..a7ec7c2 100644
 --- a/drivers/net/phy/phy_device.c
 +++ b/drivers/net/phy/phy_device.c
 @@ -866,6 +866,11 @@ int phy_attach_direct(struct net_device *dev, struct 
 phy_device *phydev,
return -EIO;
}

 +  if (!try_module_get(d->driver->owner)) {
 +  dev_err(&dev->dev, "failed to get the device driver module\n");
 +  return -EIO;
 +  }
>>>
>>> If this is the phy code, what stops the phy driver being unloaded
>>> before the try_module_get() obtains a reference.
>>> If it isn't the phy driver then there ought to be a reference count obtained
>>> when the phy driver is located (by whatever decides which phy driver to 
>>> use).
>>> Even if that code later releases its reference (it probably shouldn't on 
>>> success)
>>> then you can't fail to get an extra reference here.
>>
>> [Mao Wenan]Yes, this is phy code, in function phy_attach_direct(), 
>> drivers/net/phy/phy_device.c.
>> when one NIC driver to do probe behavior, it will attach one matched phy 
>> driver. phy_attach_direct()
>> is to obtain phy driver reference and bind phy driver, if try_module_get() 
>> execute on success, the reference
>> count is added; if failed, the driver can't be attached to this NIC, and it 
>> can't added the phy driver
>> reference count. So before try_module_get obtains a reference, phy driver 
>> can't can't be bound to this NIC.
>> when the phy driver is attached to NIC, the reference count is added, if 
>> someone remove phy driver directly,
>> it will be failed because reference count is not equal to 0.
>>
>> An example of call trace when there is NIC driver to attch one phy driver:
>> hns_nic_dev_probe->hns_nic_try_get_ae->hns_nic_init_phy->of_phy_connect->phy_connect_direct->phy_attach_direct
>>
>> Consider the steps of phy driver(marvell.ko) added and removed, and NIC 
>> driver(hns_enet_drv.ko) added and removed:
>> 1)insmod marvell   ref=0
>> 2)insmod hns_enet_drv  ref=1
>> 3)rmmod marvell(should not on success, ref=1)
>> 4)rmmod hns_enet_drv   ref=0
>> 5)rmmod marvell(should on success, because ref=0)
>>
>> if we don't add the reference count in phy_attach_direct(the second step 
>> ref=0), so the third step rmmod marvell will
>> be panic, because there is one user remain use marvell driver and 
>> phy_stat_machine use the NULL drv pointer.
>>
>>>
 +
get_device(d);

/* Assume that if there is no driver, that it doesn't
 @@ -921,6 +926,7 @@ int phy_attach_direct(struct net_device *dev, struct 
 phy_device *phydev,

  error:
put_device(d);
 +  module_put(d->driver->owner);
>>>
>>> Are those two in the wrong order ?
>>>
module_put(bus->owner);
return err;
  }
 @@ -998,6 +1004,7 @@ void phy_detach(struct phy_device *phydev)
bus = phydev->mdio.bus;

put_device(&phydev->mdio.dev);
 +  module_put(phydev->mdio.dev.driver->owner);
module_put(bus->owner);
>>>
>>> Where is this code called from?
>>> You can't call it from the phy driver because the driver can be unloaded
>>> as soon as the last reference is removed.
>>> At that point the code memory is freed.
>>
>> [Mao Wenan] it is called by NIC when it is removed, which aims to disconnect 
>> one bound phy driver. If this phy driver
>> is not used for this NIC, reference count should be subtracted, and phy 
>> driver can be removed if there is no user.
>> hns_nic_dev_remove->phy_disconnect->phy_detach
>>
>>
>>
>>>
  }
  EXPORT_SYMBOL(phy_detach);
 --
 2.7.0

>>>
>>>
>>> .
>>>
> 
> @Florian Fainelli, what's your comments about this patch?

I am trying to reproduce what you are seeing, but at first glance is
looks like an appropriate solution to me. Do you mind giving me a couple
more days?

Thanks!
-- 
Florian

Re: Misalignment, MIPS, and ip_hdr(skb)->version

2016-12-12 Thread Måns Rullgård

David Laight  writes:

> From: Måns Rullgård
>> Sent: 10 December 2016 13:25
> ...
>> I solved this problem in an Ethernet driver by copying the initial part
>> of the packet to an aligned skb and appending the remainder using
>> skb_add_rx_frag().  The kernel network stack only cares about the
>> headers, so the alignment of the packet payload doesn't matter.
>
> That rather depends on where the packet payload ends up.
> It is likely that it will be copied to userspace (or maybe
> into some aligned kernel buffer).
> In which case you get an expensive misaligned copy.

There's not much to be done about that.  The only other option is to
bypass DMA entirely, and that's sure to be even worse.

> What do the hardware engineers think the ethernet interface will
> be used for!

Ticking boxes in marketing material.

-- 
Måns Rullgård

Re: [PATCH 0/2] net: ethernet: hisilicon: set dev->dev.parent before PHY connect

2016-12-12 Thread Florian Fainelli

On 12/12/2016 04:03 AM, Dongpo Li wrote:
> This patch series builds atop:
> ec988ad78ed6d184a7f4ca6b8e962b0e8f1de461 ("phy: Don't increment MDIO bus
> refcount unless it's a different owner")
> 
> I have checked all the hisilicon ethernet driver and found only two drivers
> need to be fixed to make sure set dev->dev.parent before PHY connect.

Thanks for doing this, these two drivers did not show up on my list
because I did not see them calling SET_NETDEV_DEV() so late.
-- 
Florian

Re: [PATCH 1/2] net: ethernet: hisi_femac: Call SET_NETDEV_DEV()

2016-12-12 Thread Florian Fainelli

On 12/12/2016 04:03 AM, Dongpo Li wrote:
> The hisi_femac driver calls into PHYLIB which now checks for
> net_device->dev.parent, so make sure we do set it before calling into
> any MDIO/PHYLIB related function.
> 
> Fixes: ec988ad78ed6 ("phy: Don't increment MDIO bus refcount unless it's a 
> different owner")
> Signed-off-by: Dongpo Li 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH 2/2] net: ethernet: hip04: Call SET_NETDEV_DEV()

2016-12-12 Thread Florian Fainelli

On 12/12/2016 04:03 AM, Dongpo Li wrote:
> The hip04 driver calls into PHYLIB which now checks for
> net_device->dev.parent, so make sure we do set it before calling into
> any MDIO/PHYLIB related function.
> 
> Fixes: ec988ad78ed6 ("phy: Don't increment MDIO bus refcount unless it's a 
> different owner")
> Signed-off-by: Dongpo Li 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH net-next 1/3] net:dsa:mv88e6xxx: use hashtable to store multicast entries

2016-12-12 Thread Vivien Didelot

Hi Volodymyr,

Volodymyr Bendiuga  writes:

> +struct pvec_tbl_entry {
> +struct hlist_node entry;
> +u32 key_crc32; /* key */
> +u16 pvec;
> +struct pvec_tbl_key {
> +u8 addr[ETH_ALEN];
> +u16 fid;
> +} key;
> +};
> +
>  struct mv88e6xxx_atu_entry {
>   u16 fid;
>   u8  state;

Also, if we were to cache some values in mv88e6xxx, I'd use the existing
structures which match the hardware definition. Easier to understand.

mv88e6xxx_atu_entry already represents an ATU entry with its port
vector, FID, MAC address and more. Do you think it can be used here?

Thanks,

Vivien

Re: Re: Synopsys Ethernet QoS

2016-12-12 Thread Niklas Cassel

On 12/12/2016 11:19 AM, Joao Pinto wrote:
> Hi,
>
> Às 1:44 AM de 12/10/2016, Florian Fainelli escreveu:
>> Le 12/09/16 à 16:16, Andy Shevchenko a écrit :
>>> On Sat, Dec 10, 2016 at 12:52 AM, Florian Fainelli  
>>> wrote:
>>>
 It's kind of sad that customers of that IP (stmmac, amd-xgbe, sxgbe)
 did
 actually pioneer the upstreaming effort, but it is good to see people
 from Synopsys willing to fix that in the future.
>>> Wait, you would like to tell that we have more than 2 drivers for the
>>> same (okay, same vendor) IP?!
>>> It's better to unify them earlier, than have n+ copies.
>> Unfortunately that is the case, see this email:
>>
>> https://www.mail-archive.com/netdev@vger.kernel.org/msg142796.html
>>
>> dwc_eth_qos and stmmac have some overlap. There seems to be work
>> underway to unify these two to begin with.
>>
>>> P.S. Though, I don't see how sxgbe got in the list. First glance on
>>> the code doesn't show similarities.
>> Well samsung/sxgbe looks potentially similar to amd/xgbe, but that's
>> just my cursory look at the code, it may very well be something entirely
>> different. The descriptor formats just look suspiciously similar.
>>
> Thank you for your inputs! Renaming seems to be a hotspot. I agree that maybe
> instead of renaming (breaking retro-compatibility as David and Florian
> mentioned), the best is to move stmmac to synopsys/ after merging *qos* and
> removing it. As Florian mentioned, git is capable of detecting folder 
> restructured.
>
> @Rabin Vincent: Hi Rabin. Since Axis is more familiar with the synopsys/*qos*
> driver would it be possible for you to make an initial analysis of what has to
> be merged into Stmmac? This way the development would speed-up.

I can answer that question.

I've sent out 12 patches to the stmmac driver
(all patches are included in the current net-next tree),
with these patches the stmmac driver works properly on Axis hardware
(we use Synopsys GMAC 4.10a synthesized with multiple TX queues).
stmmac's DT binding has also been extended with properties that
existed in DWC EQoS's DT binding, such as no-pbl-x8, txpbl, rxpbl.

Since we have no problem updating the DTB together with the kernel,
we will simply move to using the start using the stmmac driver,
with stmmac's DT binding.

However, I've noticed that NVIDIA has extended the DWC EQoS DT binding,
I don't how easy it would be for them to switch to stmmac's DT binding.
(Adding Stephen Warren to CC.)

The reset sequence that Lars Persson was worried about is not an issue
with the stmmac driver.

There are some performance problems with the stmmac driver though:

When running iperf3 with 3 streams:
iperf3 -c 192.168.0.90 -P 3 -t 30
iperf3 -c 192.168.0.90 -P 3 -t 30 -R

I get really bad fairness between the streams.

This appears to be an issue with how TX IRQ coalescing is implemented in stmmac.
Disabling TX IRQ coalescing in the stmmac driver makes the problem go away.
We have a local patch that implements TX IRQ coalescing in the dwceqos driver,
and we don't see the same problem.

Also netperf TCP_RR and UDP_RR gives really bad results compared to the
dwceqos driver (without IRQ coalescing).
2000 transactions/sec vs 9000 transactions/sec.
Turning TX IRQ coalescing off and RX interrupt watchdog off in stmmac
gives the same performance. I guess it's a trade off, low CPU usage
vs low latency, so I don't know how important TCP_RR/UDP_RR really is.

The best thing would be to get a good working TX IRQ coalesce
implementation with HR timers in stmmac.
Perhaps it should also be investigated if the RX interrupt watchdog
timeout should have a lower default value.

>
> Thanks to all.
>
> Joao

RE: Misalignment, MIPS, and ip_hdr(skb)->version

2016-12-12 Thread David Laight

From: Måns Rullgård
> Sent: 10 December 2016 13:25
...
> I solved this problem in an Ethernet driver by copying the initial part
> of the packet to an aligned skb and appending the remainder using
> skb_add_rx_frag().  The kernel network stack only cares about the
> headers, so the alignment of the packet payload doesn't matter.

That rather depends on where the packet payload ends up.
It is likely that it will be copied to userspace (or maybe
into some aligned kernel buffer).
In which case you get an expensive misaligned copy.

Encapsulation protocols not using headers that are multiples
of 4 bytes is as stupid as ethernet hardware that can't place
the mac address on a 4n+2 boundary.
The latter is particularly stupid when it happens on embedded
silicon with a processor that can only do aligned memory accesses.
What do the hardware engineers think the ethernet interface will
be used for!

David

[PATCHv2 1/5] sh_eth: add generic wake-on-lan support via magic packet

2016-12-12 Thread Niklas Söderlund

Add generic functionality to support Wake-on-Lan using MagicPacket which
are supported by at least a few versions of sh_eth. Only add
functionality for WoL, no specific sh_eth version are marked to support
WoL yet.

WoL is enabled in the suspend callback by setting MagicPacket detection
and disabling all interrupts expect MagicPacket. In the resume path the
driver needs to reset the hardware to rearm the WoL logic, this prevents
the driver from simply restoring the registers and to take advantage of
that sh_eth was not suspended to reduce resume time. To reset the
hardware the driver close and reopens the device just like it would do
in a normal suspend/resume scenario without WoL enabled, but it both
close and open the device in the resume callback since the device needs
to be open for WoL to work.

One quirk needed for WoL is that the module clock needs to be prevented
from being switched off by Runtime PM. To keep the clock alive the
suspend callback need to call clk_enable() directly to increase the
usage count of the clock. Then when Runtime PM decreases the clock usage
count it won't reach 0 and be switched off.

Signed-off-by: Niklas Söderlund 
---
 drivers/net/ethernet/renesas/sh_eth.c | 112 +++---
 drivers/net/ethernet/renesas/sh_eth.h |   3 +
 2 files changed, 107 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/renesas/sh_eth.c 
b/drivers/net/ethernet/renesas/sh_eth.c
index 05b0dc5..87640b9 100644
--- a/drivers/net/ethernet/renesas/sh_eth.c
+++ b/drivers/net/ethernet/renesas/sh_eth.c
@@ -2199,6 +2199,33 @@ static int sh_eth_set_ringparam(struct net_device *ndev,
return 0;
 }
 
+static void sh_eth_get_wol(struct net_device *ndev, struct ethtool_wolinfo 
*wol)
+{
+   struct sh_eth_private *mdp = netdev_priv(ndev);
+
+   wol->supported = 0;
+   wol->wolopts = 0;
+
+   if (mdp->cd->magic && mdp->clk) {
+   wol->supported = WAKE_MAGIC;
+   wol->wolopts = mdp->wol_enabled ? WAKE_MAGIC : 0;
+   }
+}
+
+static int sh_eth_set_wol(struct net_device *ndev, struct ethtool_wolinfo *wol)
+{
+   struct sh_eth_private *mdp = netdev_priv(ndev);
+
+   if (!mdp->cd->magic || !mdp->clk || wol->wolopts & ~WAKE_MAGIC)
+   return -EOPNOTSUPP;
+
+   mdp->wol_enabled = !!(wol->wolopts & WAKE_MAGIC);
+
+   device_set_wakeup_enable(&mdp->pdev->dev, mdp->wol_enabled);
+
+   return 0;
+}
+
 static const struct ethtool_ops sh_eth_ethtool_ops = {
.get_regs_len   = sh_eth_get_regs_len,
.get_regs   = sh_eth_get_regs,
@@ -2213,6 +2240,8 @@ static const struct ethtool_ops sh_eth_ethtool_ops = {
.set_ringparam  = sh_eth_set_ringparam,
.get_link_ksettings = sh_eth_get_link_ksettings,
.set_link_ksettings = sh_eth_set_link_ksettings,
+   .get_wol= sh_eth_get_wol,
+   .set_wol= sh_eth_set_wol,
 };
 
 /* network device open function */
@@ -3017,6 +3046,11 @@ static int sh_eth_drv_probe(struct platform_device *pdev)
goto out_release;
}
 
+   /* Get clock, if not found that's OK but Wake-On-Lan is unavailable */
+   mdp->clk = devm_clk_get(&pdev->dev, NULL);
+   if (IS_ERR(mdp->clk))
+   mdp->clk = NULL;
+
ndev->base_addr = res->start;
 
spin_lock_init(&mdp->lock);
@@ -3111,6 +3145,9 @@ static int sh_eth_drv_probe(struct platform_device *pdev)
if (ret)
goto out_napi_del;
 
+   if (mdp->cd->magic && mdp->clk)
+   device_set_wakeup_capable(&pdev->dev, 1);
+
/* print device information */
netdev_info(ndev, "Base address at 0x%x, %pM, IRQ %d.\n",
(u32)ndev->base_addr, ndev->dev_addr, ndev->irq);
@@ -3150,15 +3187,67 @@ static int sh_eth_drv_remove(struct platform_device 
*pdev)
 
 #ifdef CONFIG_PM
 #ifdef CONFIG_PM_SLEEP
+static int sh_eth_wol_setup(struct net_device *ndev)
+{
+   struct sh_eth_private *mdp = netdev_priv(ndev);
+
+   /* Only allow ECI interrupts */
+   synchronize_irq(ndev->irq);
+   napi_disable(&mdp->napi);
+   sh_eth_write(ndev, DMAC_M_ECI, EESIPR);
+
+   /* Enable MagicPacket */
+   sh_eth_modify(ndev, ECMR, 0, ECMR_PMDE);
+
+   /* Increased clock usage so device won't be suspended */
+   clk_enable(mdp->clk);
+
+   return enable_irq_wake(ndev->irq);
+}
+
+static int sh_eth_wol_restore(struct net_device *ndev)
+{
+   struct sh_eth_private *mdp = netdev_priv(ndev);
+   int ret;
+
+   napi_enable(&mdp->napi);
+
+   /* Disable MagicPacket */
+   sh_eth_modify(ndev, ECMR, ECMR_PMDE, 0);
+
+   /* The device need to be reset to restore MagicPacket logic
+* for next wakeup. If we close and open the device it will
+* both be reset and all registers restored. This is what
+* happens during suspend and resume without WoL enabled.
+*/
+   ret = sh_eth_close(ndev);
+   if (ret < 0)
+   ret

1 2 >

1 - 100 of 174 matches

Mail list logo