Re: Annoying gcc / rdma / networking warnings
On (05/11/19 12:52), Linus Torvalds wrote: > > So David, arguably the kernel "struct sockaddr" is simply wrong, if it > can't contain a "struct sockaddr_in6". No? Is extending it a huge > problem for other users that don't need it (mainly stack, I assume..)? The ipv6 working group came up with sockaddr_storage to solve this. See RFC 2553. However, in practice, since sizeof(struct sockaddr_storage) is much larger than simply creating a union of sockaddr_in and sockaaddr_in6, most userspace networking applications will do the latter. The strucut sockaddr is the mereely the generic pointer cast that is expected to be used for the common posix fucntions like bind/connect etc. > Also equally arguably, the rdma code could just use a "struct > sockaddr_in6 for this use and avoid the gcc issue, couldn't it? It has Yes, that would be the right solution. --Sowmini
Re: [PATCH net-next] tcp: add documentation for tcp_ca_state
On (03/22/19 10:59), Soheil Hassas Yeganeh wrote: > > Add documentation to the tcp_ca_state enum, since this enum is > exposed in uapi. Acked-by: Sowmini Varadhan
[PATCH V2 bpf-next 0/2] Perf-based event notification for sock_ops
This patchset uses eBPF perf-event based notification mechanism to solve the problem described in https://marc.info/?l=linux-netdev&m=154022219423571&w=2. Thanks to Daniel Borkmann for feedback/input. V2: inlined the call to sys_perf_event_open() following the style of existing code in kselftests/bpf The problem statement is We would like to monitor some subset of TCP sockets in user-space, (the monitoring application would define 4-tuples it wants to monitor) using TCP_INFO stats to analyze reported problems. The idea is to use those stats to see where the bottlenecks are likely to be ("is it application-limited?" or "is there evidence of BufferBloat in the path?" etc) Today we can do this by periodically polling for tcp_info, but this could be made more efficient if the kernel would asynchronously notify the application via tcp_info when some "interesting" thresholds (e.g., "RTT variance > X", or "total_retrans > Y" etc) are reached. And to make this effective, it is better if we could apply the threshold check *before* constructing the tcp_info netlink notification, so that we don't waste resources constructing notifications that will be discarded by the filter. This patchset solves the problem by adding perf-event based notification support for sock_ops (Patch1). The eBPF kernel module can thus be designed to apply any desired filters to the bpf_sock_ops and trigger a perf-event notification based on the verdict from the filter. The uspace component can use these perf-event notifications to either read any state managed by the eBPF kernel module, or issue a TCP_INFO netlink call if desired. Patch 2 provides a simple example that shows how to use this infra (and also provides a test case for it) Sowmini Varadhan (2): bpf: add perf-event notificaton support for sock_ops selftests/bpf: add a test case for sock_ops perf-event notification net/core/filter.c | 19 ++ tools/testing/selftests/bpf/Makefile |4 +- tools/testing/selftests/bpf/test_tcpnotify.h | 19 ++ tools/testing/selftests/bpf/test_tcpnotify_kern.c | 95 +++ tools/testing/selftests/bpf/test_tcpnotify_user.c | 186 + 5 files changed, 322 insertions(+), 1 deletions(-) create mode 100644 tools/testing/selftests/bpf/test_tcpnotify.h create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_kern.c create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_user.c
[PATCH V2 bpf-next 2/2] selftests/bpf: add a test case for sock_ops perf-event notification
This patch provides a tcp_bpf based eBPF sample. The test - ncat(1) as the TCP client program to connect() to a port with the intention of triggerring SYN retransmissions: we first install an iptables DROP rule to make sure ncat SYNs are resent (instead of aborting instantly after a TCP RST) - has a bpf kernel module that sends a perf-event notification for each TCP retransmit, and also tracks the number of such notifications sent in the global_map The test passes when the number of event notifications intercepted in user-space matches the value in the global_map. Signed-off-by: Sowmini Varadhan --- V2: inline call to sys_perf_event_open() following the style of existing code in kselftests/bpf tools/testing/selftests/bpf/Makefile |4 +- tools/testing/selftests/bpf/test_tcpnotify.h | 19 ++ tools/testing/selftests/bpf/test_tcpnotify_kern.c | 95 +++ tools/testing/selftests/bpf/test_tcpnotify_user.c | 186 + 4 files changed, 303 insertions(+), 1 deletions(-) create mode 100644 tools/testing/selftests/bpf/test_tcpnotify.h create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_kern.c create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_user.c diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index e39dfb4..6c94048 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -24,12 +24,13 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \ test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user \ test_socket_cookie test_cgroup_storage test_select_reuseport test_section_names \ - test_netcnt + test_netcnt test_tcpnotify_user TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test_obj_id.o \ test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o \ sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \ test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \ + test_tcpnotify_kern.o \ sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \ sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o test_adjust_tail.o \ test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o \ @@ -74,6 +75,7 @@ $(OUTPUT)/test_sock_addr: cgroup_helpers.c $(OUTPUT)/test_socket_cookie: cgroup_helpers.c $(OUTPUT)/test_sockmap: cgroup_helpers.c $(OUTPUT)/test_tcpbpf_user: cgroup_helpers.c +$(OUTPUT)/test_tcpnotify_user: cgroup_helpers.c trace_helpers.c $(OUTPUT)/test_progs: trace_helpers.c $(OUTPUT)/get_cgroup_id_user: cgroup_helpers.c $(OUTPUT)/test_cgroup_storage: cgroup_helpers.c diff --git a/tools/testing/selftests/bpf/test_tcpnotify.h b/tools/testing/selftests/bpf/test_tcpnotify.h new file mode 100644 index 000..8b6cea0 --- /dev/null +++ b/tools/testing/selftests/bpf/test_tcpnotify.h @@ -0,0 +1,19 @@ +// SPDX-License-Identifier: GPL-2.0 + +#ifndef _TEST_TCPBPF_H +#define _TEST_TCPBPF_H + +struct tcpnotify_globals { + __u32 total_retrans; + __u32 ncalls; +}; + +struct tcp_notifier { + __u8type; + __u8subtype; + __u8source; + __u8hash; +}; + +#defineTESTPORT12877 +#endif diff --git a/tools/testing/selftests/bpf/test_tcpnotify_kern.c b/tools/testing/selftests/bpf/test_tcpnotify_kern.c new file mode 100644 index 000..edbca20 --- /dev/null +++ b/tools/testing/selftests/bpf/test_tcpnotify_kern.c @@ -0,0 +1,95 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "bpf_helpers.h" +#include "bpf_endian.h" +#include "test_tcpnotify.h" + +struct bpf_map_def SEC("maps") global_map = { + .type = BPF_MAP_TYPE_ARRAY, + .key_size = sizeof(__u32), + .value_size = sizeof(struct tcpnotify_globals), + .max_entries = 4, +}; + +struct bpf_map_def SEC("maps") perf_event_map = { + .type = BPF_MAP_TYPE_PERF_EVENT_ARRAY, + .key_size = sizeof(int), + .value_size = sizeof(__u32), + .max_entries = 2, +}; + +int _version SEC("version") = 1; + +SEC("sockops") +int bpf_testcb(struct bpf_sock_ops *skops) +{ + int rv = -1; + int op; + + op = (int) skops->op; + + if (bpf_ntohl(skops->remote_port) != TESTPORT) { + skops->reply = -1; + return 0; + } + + switch (op) { + case BPF_SOCK_OPS_TIMEOUT_INIT: + case BPF_SOCK_OPS_RWND_INIT: + case BPF_SOCK_OPS_NEEDS_ECN: + case BPF_SOCK_OPS_BASE_RTT: + case BPF_SOCK_OPS_RTO_CB: + rv = 1; + break; + + case BPF
[PATCH V2 bpf-next 1/2] bpf: add perf-event notificaton support for sock_ops
This patch allows eBPF programs that use sock_ops to send perf-based event notifications using bpf_perf_event_output() Signed-off-by: Sowmini Varadhan --- net/core/filter.c | 19 +++ 1 files changed, 19 insertions(+), 0 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index e521c5e..23464a3 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4048,6 +4048,23 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff, return ret; } +BPF_CALL_5(bpf_sock_opts_event_output, struct bpf_sock_ops *, skops, + struct bpf_map *, map, u64, flags, void *, data, u64, size) +{ + return bpf_event_output(map, flags, data, size, NULL, 0, NULL); +} + +static const struct bpf_func_proto bpf_sock_ops_event_output_proto = { + .func = bpf_sock_opts_event_output, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_CONST_MAP_PTR, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_PTR_TO_MEM, + .arg5_type = ARG_CONST_SIZE_OR_ZERO, +}; + static const struct bpf_func_proto bpf_setsockopt_proto = { .func = bpf_setsockopt, .gpl_only = false, @@ -5226,6 +5243,8 @@ bool bpf_helper_changes_pkt_data(void *func) sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { switch (func_id) { + case BPF_FUNC_perf_event_output: + return &bpf_sock_ops_event_output_proto; case BPF_FUNC_setsockopt: return &bpf_setsockopt_proto; case BPF_FUNC_getsockopt: -- 1.7.1
[PATCH bpf-next 0/2] TCP-BPF event notification support
This patchset uses eBPF perf-event based notification mechanism to solve the problem described in https://marc.info/?l=linux-netdev&m=154022219423571&w=2. Thanks to Daniel Borkmann for feedback/input. The problem statement is We would like to monitor some subset of TCP sockets in user-space, (the monitoring application would define 4-tuples it wants to monitor) using TCP_INFO stats to analyze reported problems. The idea is to use those stats to see where the bottlenecks are likely to be ("is it application-limited?" or "is there evidence of BufferBloat in the path?" etc) Today we can do this by periodically polling for tcp_info, but this could be made more efficient if the kernel would asynchronously notify the application via tcp_info when some "interesting" thresholds (e.g., "RTT variance > X", or "total_retrans > Y" etc) are reached. And to make this effective, it is better if we could apply the threshold check *before* constructing the tcp_info netlink notification, so that we don't waste resources constructing notifications that will be discarded by the filter. This patchset solves the problem by adding perf-event based notification support for sock_ops (Patch1). The eBPF kernel module can thus be designed to apply any desired filters to the bpf_sock_ops and trigger a perf-event notification based on the verdict from the filter. The uspace component can use these perf-event notifications to either read any state managed by the eBPF kernel module, or issue a TCP_INFO netlink call if desired. Patch 2 provides a simple example that shows how to use this infra (and also provides a test case for it) Sowmini Varadhan (2): bpf: add perf-event notificaton support for sock_ops selftests/bpf: add a test case for sock_ops perf-event notification net/core/filter.c | 19 ++ tools/testing/selftests/bpf/Makefile |4 +- tools/testing/selftests/bpf/perf-sys.h| 74 tools/testing/selftests/bpf/test_tcpnotify.h | 19 ++ tools/testing/selftests/bpf/test_tcpnotify_kern.c | 95 +++ tools/testing/selftests/bpf/test_tcpnotify_user.c | 186 + 6 files changed, 396 insertions(+), 1 deletions(-) create mode 100644 tools/testing/selftests/bpf/perf-sys.h create mode 100644 tools/testing/selftests/bpf/test_tcpnotify.h create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_kern.c create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_user.c
[PATCH bpf-next 2/2] selftests/bpf: add a test case for sock_ops perf-event notification
This patch provides a tcp_bpf based eBPF sample. The test - ncat(1) as the TCP client program to connect() to a port with the intention of triggerring SYN retransmissions: we first install an iptables DROP rule to make sure ncat SYNs are resent (instead of aborting instantly after a TCP RST) - has a bpf kernel module that sends a perf-event notification for each TCP retransmit, and also tracks the number of such notifications sent in the global_map The test passes when the number of event notifications intercepted in user-space matches the value in the global_map. Signed-off-by: Sowmini Varadhan --- tools/testing/selftests/bpf/Makefile |4 +- tools/testing/selftests/bpf/perf-sys.h| 74 tools/testing/selftests/bpf/test_tcpnotify.h | 19 ++ tools/testing/selftests/bpf/test_tcpnotify_kern.c | 95 +++ tools/testing/selftests/bpf/test_tcpnotify_user.c | 186 + 5 files changed, 377 insertions(+), 1 deletions(-) create mode 100644 tools/testing/selftests/bpf/perf-sys.h create mode 100644 tools/testing/selftests/bpf/test_tcpnotify.h create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_kern.c create mode 100644 tools/testing/selftests/bpf/test_tcpnotify_user.c diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index e39dfb4..6c94048 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -24,12 +24,13 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \ test_sock test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user \ test_socket_cookie test_cgroup_storage test_select_reuseport test_section_names \ - test_netcnt + test_netcnt test_tcpnotify_user TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test_obj_id.o \ test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o \ sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o \ test_l4lb_noinline.o test_xdp_noinline.o test_stacktrace_map.o \ + test_tcpnotify_kern.o \ sample_map_ret0.o test_tcpbpf_kern.o test_stacktrace_build_id.o \ sockmap_tcp_msg_prog.o connect4_prog.o connect6_prog.o test_adjust_tail.o \ test_btf_haskv.o test_btf_nokv.o test_sockmap_kern.o test_tunnel_kern.o \ @@ -74,6 +75,7 @@ $(OUTPUT)/test_sock_addr: cgroup_helpers.c $(OUTPUT)/test_socket_cookie: cgroup_helpers.c $(OUTPUT)/test_sockmap: cgroup_helpers.c $(OUTPUT)/test_tcpbpf_user: cgroup_helpers.c +$(OUTPUT)/test_tcpnotify_user: cgroup_helpers.c trace_helpers.c $(OUTPUT)/test_progs: trace_helpers.c $(OUTPUT)/get_cgroup_id_user: cgroup_helpers.c $(OUTPUT)/test_cgroup_storage: cgroup_helpers.c diff --git a/tools/testing/selftests/bpf/perf-sys.h b/tools/testing/selftests/bpf/perf-sys.h new file mode 100644 index 000..3eb7a39 --- /dev/null +++ b/tools/testing/selftests/bpf/perf-sys.h @@ -0,0 +1,74 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _PERF_SYS_H +#define _PERF_SYS_H + +#include +#include +#include +#include +#include +#include +#include + +#ifdef __powerpc__ +#define CPUINFO_PROC {"cpu"} +#endif + +#ifdef __s390__ +#define CPUINFO_PROC {"vendor_id"} +#endif + +#ifdef __sh__ +#define CPUINFO_PROC {"cpu type"} +#endif + +#ifdef __hppa__ +#define CPUINFO_PROC {"cpu"} +#endif + +#ifdef __sparc__ +#define CPUINFO_PROC {"cpu"} +#endif + +#ifdef __alpha__ +#define CPUINFO_PROC {"cpu model"} +#endif + +#ifdef __arm__ +#define CPUINFO_PROC {"model name", "Processor"} +#endif + +#ifdef __mips__ +#define CPUINFO_PROC {"cpu model"} +#endif + +#ifdef __arc__ +#define CPUINFO_PROC {"Processor"} +#endif + +#ifdef __xtensa__ +#define CPUINFO_PROC {"core ID"} +#endif + +#ifndef CPUINFO_PROC +#define CPUINFO_PROC { "model name", } +#endif + +static inline int +sys_perf_event_open(struct perf_event_attr *attr, + pid_t pid, int cpu, int group_fd, + unsigned long flags) +{ + int fd; + + fd = syscall(__NR_perf_event_open, attr, pid, cpu, +group_fd, flags); + +#ifdef HAVE_ATTR_TEST + if (unlikely(test_attr__enabled)) + test_attr__open(attr, pid, cpu, fd, group_fd, flags); +#endif + return fd; +} + +#endif /* _PERF_SYS_H */ diff --git a/tools/testing/selftests/bpf/test_tcpnotify.h b/tools/testing/selftests/bpf/test_tcpnotify.h new file mode 100644 index 000..8b6cea0 --- /dev/null +++ b/tools/testing/selftests/bpf/test_tcpnotify.h @@ -0,0 +1,19 @@ +// SPDX-License-Identifier: GPL-2.0 + +#ifndef _TEST_TCPBPF_H +#define _TEST_TCPBPF_H + +struct tcpnotify_globals { + __u32 total_retrans; + __u32
[PATCH bpf-next 1/2] bpf: add perf-event notificaton support for sock_ops
This patch allows eBPF programs that use sock_ops to send perf-based event notifications using bpf_perf_event_output() Signed-off-by: Sowmini Varadhan --- net/core/filter.c | 19 +++ 1 files changed, 19 insertions(+), 0 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index e521c5e..23464a3 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4048,6 +4048,23 @@ static unsigned long bpf_xdp_copy(void *dst_buff, const void *src_buff, return ret; } +BPF_CALL_5(bpf_sock_opts_event_output, struct bpf_sock_ops *, skops, + struct bpf_map *, map, u64, flags, void *, data, u64, size) +{ + return bpf_event_output(map, flags, data, size, NULL, 0, NULL); +} + +static const struct bpf_func_proto bpf_sock_ops_event_output_proto = { + .func = bpf_sock_opts_event_output, + .gpl_only = true, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, + .arg2_type = ARG_CONST_MAP_PTR, + .arg3_type = ARG_ANYTHING, + .arg4_type = ARG_PTR_TO_MEM, + .arg5_type = ARG_CONST_SIZE_OR_ZERO, +}; + static const struct bpf_func_proto bpf_setsockopt_proto = { .func = bpf_setsockopt, .gpl_only = false, @@ -5226,6 +5243,8 @@ bool bpf_helper_changes_pkt_data(void *func) sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) { switch (func_id) { + case BPF_FUNC_perf_event_output: + return &bpf_sock_ops_event_output_proto; case BPF_FUNC_setsockopt: return &bpf_setsockopt_proto; case BPF_FUNC_getsockopt: -- 1.7.1
[PATCH RFC net-next 0/3] Extensions to allow asynchronous TCP_INFO notifications based on congestion parameters
Problem statement: We would like to monitor some subset of TCP sockets in user-space, (the monitoring application would define 4-tuples it wants to monitor) using TCP_INFO stats to analyze reported problems. The idea is to use those stats to see where the bottlenecks are likely to be ("is it application-limited?" or "is there evidence of BufferBloat in the path?" etc) Today we can do this by periodically polling for tcp_info, but this could be made more efficient if the kernel would asynchronously notify the application via tcp_info when some "interesting" thresholds (e.g., "RTT variance > X", or "total_retrans > Y" etc) are reached. And to make this effective, it is better if we could apply the threshold check *before* constructing the tcp_info netlink notification, so that we don't waste resources constructing notifications that will be discarded by the filter. One idea, implemented in this patchset, is to extend the tcp_call_bpf() infra so that the BPF kernel module (the sock_ops filter/callback) can examine the values in the sock_ops, apply any thresholds it wants, and return some new status ("BPF_TCP_INFO_NOTIFY"). Use this status in the tcp stack to queue up a tcp_info notification (similar to sock_diag_broadcast_destroy() today..) Patch 1 in this set refactors the existing sock_diag code so that the functions can be reused for notifications from other states than CLOSE. Patch 2 provides a minor extension to tcp_call_bpf() so that it will queue a tcp_info_notification if the BPF callout returns BPF_TCP_INFO_NOTIFY Patch 3, provided strictly as a demonstration/PoC to aid in reviewing this proposal, shows a simple sample/bpf example where we trigger the tcp_info notification for an iperf connection if the number of retransmits exceeds 16. Sowmini Varadhan (3): sock_diag: Refactor inet_sock_diag_destroy code tcp: BPF_TCP_INFO_NOTIFY support bpf: Added a sample for tcp_info_notify callback include/linux/sock_diag.h | 18 +++--- include/net/tcp.h | 15 +++- include/uapi/linux/bpf.h |4 ++ include/uapi/linux/sock_diag.h |2 + net/core/sock.c|4 +- net/core/sock_diag.c | 11 +++--- samples/bpf/Makefile |1 + samples/bpf/tcp_notify_kern.c | 73 8 files changed, 114 insertions(+), 14 deletions(-) create mode 100644 samples/bpf/tcp_notify_kern.c
[PATCH RFC net-next 2/3] tcp: BPF_TCP_INFO_NOTIFY support
We want to be able to set up the monitoring application so that it can be aysnchronously notified when "interesting" events happen, e.g., when application-determined thresholds on parameters like RTT estimate, number of retransmissions, RTO are reached. The bpf_sock_ops infrastructure provided as part of Commit 40304b2a1567 ("bpf: BPF support for sock_ops") provides an elegant way to trigger this asynchronous notification. The BPF program can examine the current TCP state reported in the bpf_sock_ops and conditionally return a (new) status BPF_TCP_INFO_NOTIFY. The return status is used by the caller to queue up a tcp_info notification for the application. Signed-off-by: Sowmini Varadhan --- include/net/tcp.h| 15 +-- include/uapi/linux/bpf.h |4 2 files changed, 17 insertions(+), 2 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 0d29292..df06a9f 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -47,6 +47,7 @@ #include #include #include +#include extern struct inet_hashinfo tcp_hashinfo; @@ -2065,6 +2066,12 @@ struct tcp_ulp_ops { __MODULE_INFO(alias, alias_userspace, name);\ __MODULE_INFO(alias, alias_tcp_ulp, "tcp-ulp-" name) +#defineTCPDIAG_CB(sk) \ +do { \ + if (unlikely(sk->sk_net_refcnt && sock_diag_has_listeners(sk))) \ + sock_diag_broadcast(sk);\ +} while (0) + /* Call BPF_SOCK_OPS program that returns an int. If the return value * is < 0, then the BPF op failed (for example if the loaded BPF * program does not support the chosen operation or there is no BPF @@ -2088,9 +2095,13 @@ static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args) memcpy(sock_ops.args, args, nargs * sizeof(*args)); ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(&sock_ops); - if (ret == 0) + if (ret == 0) { ret = sock_ops.reply; - else + + /* XXX would be nice if we could use replylong[1] here */ + if (ret == BPF_TCP_INFO_NOTIFY) + TCPDIAG_CB(sk); + } else ret = -1; return ret; } diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index aa5ccd2..bc45e5e 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -2678,6 +2678,10 @@ enum { BPF_TCP_MAX_STATES /* Leave at the end! */ }; +enum { + BPF_TCP_INFO_NOTIFY = 2 +}; + #define TCP_BPF_IW 1001/* Set TCP initial congestion window */ #define TCP_BPF_SNDCWND_CLAMP 1002/* Set sndcwnd_clamp */ -- 1.7.1
[PATCH RFC net-next 3/3] bpf: Added a sample for tcp_info_notify callback
Simple Proof-Of-Concept test program for BPF_TCP_INFO_NOTIFY (will move this to testing/selftests/net later) Signed-off-by: Sowmini Varadhan --- samples/bpf/Makefile |1 + samples/bpf/tcp_notify_kern.c | 73 + 2 files changed, 74 insertions(+), 0 deletions(-) create mode 100644 samples/bpf/tcp_notify_kern.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index be0a961..d937bd2 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -152,6 +152,7 @@ always += tcp_bufs_kern.o always += tcp_cong_kern.o always += tcp_iw_kern.o always += tcp_clamp_kern.o +always += tcp_notify_kern.o always += tcp_basertt_kern.o always += tcp_tos_reflect_kern.o always += xdp_redirect_kern.o diff --git a/samples/bpf/tcp_notify_kern.c b/samples/bpf/tcp_notify_kern.c new file mode 100644 index 000..bc4efd8 --- /dev/null +++ b/samples/bpf/tcp_notify_kern.c @@ -0,0 +1,73 @@ +/* Sample BPF program to demonstrate how to triger async tcp_info + * notification based on thresholds determeined in the filter. + * The example here will trigger notification if skops->total_retrans > 16 + * + * Use load_sock_ops to load this BPF program. + */ + +#include +#include +#include +#include +#include +#include "bpf_helpers.h" +#include "bpf_endian.h" + +#define DEBUG 0 + +#define bpf_printk(fmt, ...) \ +({ \ + char fmt[] = fmt;\ + bpf_trace_printk(fmt, sizeof(fmt), \ + ##__VA_ARGS__); \ +}) + +SEC("sockops") +int bpf_tcp_info_notify(struct bpf_sock_ops *skops) +{ + int bufsize = 15; + int to_init = 10; + int clamp = 100; + int rv = 0; + int op; + + /* For testing purposes, only execute rest of BPF program +* if neither port numberis 5001 (default iperf port) +*/ + if (bpf_ntohl(skops->remote_port) != 5001 && + skops->local_port != 5001) { + skops->reply = -1; + return 0; + } + + op = (int) skops->op; + +#ifdef DEBUG + bpf_printk("BPF command: %d\n", op); +#endif + + switch (op) { + case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB: + case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: + bpf_sock_ops_cb_flags_set(skops, + (BPF_SOCK_OPS_RETRANS_CB_FLAG| + BPF_SOCK_OPS_RTO_CB_FLAG)); + rv = 1; + break; + case BPF_SOCK_OPS_RETRANS_CB: + case BPF_SOCK_OPS_RTO_CB: + if (skops->total_retrans < 16) + rv = 1; /* skip */ + else + rv = BPF_TCP_INFO_NOTIFY; + break; + default: + rv = -1; + } +#ifdef DEBUG + bpf_printk("Returning %d\n", rv); +#endif + skops->reply = rv; + return 1; +} +char _license[] SEC("license") = "GPL"; -- 1.7.1
[PATCH RFC net-next 1/3] sock_diag: Refactor inet_sock_diag_destroy code
We want to use the inet_sock_diag_destroy code to send notifications for more types of TCP events than just socket_close(), so refactor the code to allow this. Signed-off-by: Sowmini Varadhan --- include/linux/sock_diag.h | 18 +- include/uapi/linux/sock_diag.h |2 ++ net/core/sock.c|4 ++-- net/core/sock_diag.c | 11 ++- 4 files changed, 23 insertions(+), 12 deletions(-) diff --git a/include/linux/sock_diag.h b/include/linux/sock_diag.h index 15fe980..df85767 100644 --- a/include/linux/sock_diag.h +++ b/include/linux/sock_diag.h @@ -34,7 +34,7 @@ int sock_diag_put_filterinfo(bool may_report_filterinfo, struct sock *sk, struct sk_buff *skb, int attrtype); static inline -enum sknetlink_groups sock_diag_destroy_group(const struct sock *sk) +enum sknetlink_groups sock_diag_group(const struct sock *sk) { switch (sk->sk_family) { case AF_INET: @@ -43,7 +43,15 @@ enum sknetlink_groups sock_diag_destroy_group(const struct sock *sk) switch (sk->sk_protocol) { case IPPROTO_TCP: - return SKNLGRP_INET_TCP_DESTROY; + switch (sk->sk_state) { + case TCP_ESTABLISHED: + return SKNLGRP_INET_TCP_CONNECTED; + case TCP_SYN_SENT: + case TCP_SYN_RECV: + return SKNLGRP_INET_TCP_3WH; + default: + return SKNLGRP_INET_TCP_DESTROY; + } case IPPROTO_UDP: return SKNLGRP_INET_UDP_DESTROY; default: @@ -67,15 +75,15 @@ enum sknetlink_groups sock_diag_destroy_group(const struct sock *sk) } static inline -bool sock_diag_has_destroy_listeners(const struct sock *sk) +bool sock_diag_has_listeners(const struct sock *sk) { const struct net *n = sock_net(sk); - const enum sknetlink_groups group = sock_diag_destroy_group(sk); + const enum sknetlink_groups group = sock_diag_group(sk); return group != SKNLGRP_NONE && n->diag_nlsk && netlink_has_listeners(n->diag_nlsk, group); } -void sock_diag_broadcast_destroy(struct sock *sk); +void sock_diag_broadcast(struct sock *sk); int sock_diag_destroy(struct sock *sk, int err); #endif diff --git a/include/uapi/linux/sock_diag.h b/include/uapi/linux/sock_diag.h index e592500..4252674 100644 --- a/include/uapi/linux/sock_diag.h +++ b/include/uapi/linux/sock_diag.h @@ -32,6 +32,8 @@ enum sknetlink_groups { SKNLGRP_INET_UDP_DESTROY, SKNLGRP_INET6_TCP_DESTROY, SKNLGRP_INET6_UDP_DESTROY, + SKNLGRP_INET_TCP_3WH, + SKNLGRP_INET_TCP_CONNECTED, __SKNLGRP_MAX, }; #define SKNLGRP_MAX(__SKNLGRP_MAX - 1) diff --git a/net/core/sock.c b/net/core/sock.c index 7e8796a..6684840 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1600,8 +1600,8 @@ static void __sk_free(struct sock *sk) if (likely(sk->sk_net_refcnt)) sock_inuse_add(sock_net(sk), -1); - if (unlikely(sk->sk_net_refcnt && sock_diag_has_destroy_listeners(sk))) - sock_diag_broadcast_destroy(sk); + if (unlikely(sk->sk_net_refcnt && sock_diag_has_listeners(sk))) + sock_diag_broadcast(sk); else sk_destruct(sk); } diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c index 3312a58..dbd9e65 100644 --- a/net/core/sock_diag.c +++ b/net/core/sock_diag.c @@ -116,14 +116,14 @@ static size_t sock_diag_nlmsg_size(void) + nla_total_size_64bit(sizeof(struct tcp_info))); /* INET_DIAG_INFO */ } -static void sock_diag_broadcast_destroy_work(struct work_struct *work) +static void sock_diag_broadcast_work(struct work_struct *work) { struct broadcast_sk *bsk = container_of(work, struct broadcast_sk, work); struct sock *sk = bsk->sk; const struct sock_diag_handler *hndl; struct sk_buff *skb; - const enum sknetlink_groups group = sock_diag_destroy_group(sk); + const enum sknetlink_groups group = sock_diag_group(sk); int err = -1; WARN_ON(group == SKNLGRP_NONE); @@ -144,11 +144,12 @@ static void sock_diag_broadcast_destroy_work(struct work_struct *work) else kfree_skb(skb); out: - sk_destruct(sk); + if (group <= SKNLGRP_INET6_UDP_DESTROY) + sk_destruct(sk); kfree(bsk); } -void sock_diag_broadcast_destroy(struct sock *sk) +void sock_diag_broadcast(struct sock *sk) { /* Note, this function is often called from an interrupt context. */ struct broadcast_sk *bsk = @@ -156,7 +157,7 @@ void sock_diag_broadcast_destroy(struct sock *sk) if (!bsk) return sk_de
Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps
Without getting into Ahern's patchset, which he obviously feels quite passionately about.. On (10/11/18 12:28), David Miller wrote: > > Once you've composed the message, the whole point of filtering is lost. it would be nice to apply the filter *before* constructing the skb, but afaict most things in BPF today only operate on sk_buffs. How should we use *BPF on something other than an sk_buff? --Sowmini
Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps
On (10/11/18 09:33), Roopa Prabhu wrote: > 3. All networking subsystems already have this type of netlink > attribute filtering that apps rely on. This series > just makes it consistent for route dumps. Apps use such mechanism > already when requesting dumps. > Like everywhere else, BPF hook can be an alternate parallel mechanism. sure and that make sense. though I hope we will explore those alternate mechanisms too. --Sowmini
Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps
On (10/11/18 09:32), David Ahern wrote: > > Route dumps are done for the entire FIB for each address family. As we > approach internet routing tables (700k+ routes for IPv4, currently > around 55k for IPv6) with many VRFs dumping the entire table is grossly > inefficient when for example only a single VRF table is wanted. I think someone mentioned a long time ago that a VRF is not an interface/driver/net_device but rather a separate routing table with a dedicated set of interfaces, iirc :-) :-) In the latter model, if you wanted to dump a VRF table, you'd only lock that table, and walk it, instead of holding up other VRFS sorry, could not resist my i-told-you-so moment :-P --Sowmini
Re: [PATCH net-next 0/9] net: Kernel side filtering for route dumps
On (10/11/18 08:26), Stephen Hemminger wrote: > You can do the something like this already with BPF socket filters. > But writing BPF for multi-part messages is hard. Indeed. And I was just experimenting with this for ARP just last week. So to handle the caes of "ip neigh show a.b.c.d" without walking through the entire arp table and filtering in userspace, you could add a sk_filter() hook like this: @@ -2258,6 +2260,24 @@ static int neigh_fill_info(struct sk_buff *skb, struct ne goto nla_put_failure; nlmsg_end(skb, nlh); + + /* XXX skb->sk can be null in the neigh_timer_handler->__neigh_notify +* path. Revisit.. +*/ + if (!skb->sk) + return 0; + + /* pull/push skb->data pointers so that sk_filter only sees the +* most recent nlh that wasjust added. +*/ + len = skb->len - nlh->nlmsg_len; + skb_pull(skb, len); + ret = sk_filter(skb->sk, skb); + skb_push(skb, len); + if (ret) + nlmsg_cancel(skb, nlh); return 0; Writing the cBPF filter is not horrible, due to the nla extension. e.g., to pass a filter that matches on if_index and ipv4 address, the bpf_asm src below will do the job. The benefit of using cBPF is that we can use this older kernels as well /* * Generated from the bpf_asm src * ldb [20] ; len(nlmsghdr) + offsetof(ndm_ifindex) * jne sll->sll_ifindex, skip * ld #28 ; A <- len(nlmsghdr) + len(ndmsg), payload offset * ldx #1 ; X <- NDA_DST * ld #nla ; A <- offset(NDA_DST) * jeq #0, skip * tax * ld [x + 4] ; A <- value(NDA_DST) * jneq htonl(addr), skip * ret #-1 * skip: ret #0 */ struct sock_filter bpf_filter[] = { { 0x30, 0, 0, 0x0014 }, { 0x15, 0, 1, sll->sll_ifindex }, { , 0, 0, 0x001c }, { 0x01, 0, 0, 0x0001 }, { 0x20, 0, 0, 0xf00c }, { 0x15, 4, 0, 00 }, { 0x07, 0, 0, 00 }, { 0x40, 0, 0, 0x0004 }, { 0x15, 0, 1, htonl(addr) }, { 0x06, 0, 0, 0x }, { 0x06, 0, 0, 00 }, { 0x06, 0, 0, 0x }, { 0x06, 0, 0, 00 }, };
Re: [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
On (09/17/18 16:15), Alexei Starovoitov wrote: > > if the goal is to add firewall ability to RDS then the patch set > is going in the wrong direction. The goal is to add the ability to process scatterlist directly, just like we process skb's today. Your main objection was that you wanted a test case in selftests that was aligned with existing tests, Tushar is working on that patchset. Why dont we wait for that patchset before continuing this discussion further? > May be the right answer is to teach rds to behave like the rest of protocols. > Then all existing tooling and features will 'just work' ? RDS does not need to be taught anything :-) I think KCM is modeled on the RDS stack model. Before we "teach" rds anything, "we" need to understand what RDS does first - google can provide lot of slide-decks that explain the rds stack to you, suggest you look at that first. Meanwhile, how about waiting for Tushar's next patchset, where you will have your selftests that are based on veth/netns just like exising tests for XDP. vxlan etc. I strongly suggest waiting for that. And btw, it would have been very useful/courteous to help with the RFC reviews to start with. --Sowmini
Re: [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
On (09/12/18 19:07), Alexei Starovoitov wrote: > > I didn't know that. The way I understand your statement that > this new program type, new sg logic, and all the complexity > are only applicable to RDMA capable hw and RDS. I dont know if you have been following the RFC series at all (and DanielB/JohnF feedback to it) but that is not what the patch set is about. To repeat a summary of the original problem statement: RDS (hardly a "niche" driver, let's please not get carried away with strong assertions based on incomplete understanding), is an example of a driver that happens to pass up packets as both scatterlist and sk_buffs to the ULPs. The scatterlist comes from IB, the sk_buffs come from the ethernet drivers. At the moment, the only way to build firewalls for this is to convert scatterlist to skb and use either netfilter or eBPF on the skb. What Tushar is adding is support to use eBPF on the scatterlist itself, so that you dont have to do this inefficient scatterlist->skb conversion. > In such case, I think, such BPF extensions do not belong > in the upstream kernel. I don't want BPF to support niche technologies, > since the maintenance overhead makes it prohibitive long term. After I sent the message, I noticed that selftests/bpf has some tests using veth/netns. I think Tushar should be able to write tests for the rds-tcp path (and thus test the eBPF infrastructure, which seems to be what you are worried about) Does that address your concern? --Sowmini
Re: [PATCH net-next 5/5] ebpf: Add sample ebpf program for SOCKET_SG_FILTER
> On 09/11/2018 09:00 PM, Alexei Starovoitov wrote: > >please no samples. > >Add this as proper test to tools/testing/selftests/bpf > >that reports PASS/FAIL and can be run automatically. > >samples/bpf is effectively dead code. Just a second. You do realize that RDS is doing real networking, so it needs RDMA capable hardware to test the rds_rdma paths? Also, when we "talk to ourselves" we default to the rds_loop transport, so we would even bypass the rds-tcp module. I dont think this can be tested with some academic "test it over lo0" exercise.. I suppose you can add example code in sefltests for this, but asking for a "proper test" may be a litte unrealistic here- a proper test needs proper hardware in this case. --Sowmini
Re: [Patch net] rds: mark bound socket with SOCK_RCU_FREE
On (09/10/18 17:16), Cong Wang wrote: > > > > On (09/10/18 16:51), Cong Wang wrote: > > > > > > __rds_create_bind_key(key, addr, port, scope_id); > > > - rs = rhashtable_lookup_fast(&bind_hash_table, key, ht_parms); > > > + rcu_read_lock(); > > > + rs = rhashtable_lookup(&bind_hash_table, key, ht_parms); > > > if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) > > > rds_sock_addref(rs); > > > else > > > rs = NULL; > > > + rcu_read_unlock(); > > > > aiui, the rcu_read lock/unlock is only useful if the write > > side doing destructive operations does something to make sure readers > > are done before doing the destructive opertion. AFAIK, that does > > not exist for rds socket management today > > That is exactly why we need it here, right? Maybe I am confused, what exactly is the patch you are proposing? Does it have the SOCK_RCU_FREE change? Does it have the rcu_read_lock you have above? Where is the call_rcu? > Hmm, so you are saying synchronize_rcu() is kinda more correct > than call_rcu()?? I'm not saying that, I'm asking "what exactly is the patch you are proposing?" The only one on record is http://patchwork.ozlabs.org/patch/968282/ which does not have either synchronize_rcu or call_rcu. > I never hear this before, would like to know why. Please post precise patches first. Thanks.
Re: [Patch net] rds: mark bound socket with SOCK_RCU_FREE
On (09/10/18 16:51), Cong Wang wrote: > > __rds_create_bind_key(key, addr, port, scope_id); > - rs = rhashtable_lookup_fast(&bind_hash_table, key, ht_parms); > + rcu_read_lock(); > + rs = rhashtable_lookup(&bind_hash_table, key, ht_parms); > if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) > rds_sock_addref(rs); > else > rs = NULL; > + rcu_read_unlock(); aiui, the rcu_read lock/unlock is only useful if the write side doing destructive operations does something to make sure readers are done before doing the destructive opertion. AFAIK, that does not exist for rds socket management today > Although sock release path is not a very hot path, but blocking > it isn't a good idea either, especially when you can use call_rcu(), > which has the same effect. > > I don't see any reason we should prefer synchronize_rcu() here. Usually correctness (making sure all readers are done, before nuking a data structure) is a little bit more important than perforamance, aka "safety before speed" is what I've always been taught. Clearly, your mileage varies. As you please. --Sowmini
Re: [Patch net] rds: mark bound socket with SOCK_RCU_FREE
On (09/10/18 15:43), Santosh Shilimkar wrote: > On 9/10/2018 3:24 PM, Cong Wang wrote: > >When a rds sock is bound, it is inserted into the bind_hash_table > >which is protected by RCU. But when releasing rd sock, after it > >is removed from this hash table, it is freed immediately without > >respecting RCU grace period. This could cause some use-after-free > >as reported by syzbot. > > > Indeed. > > >Mark the rds sock as SOCK_RCU_FREE before inserting it into the > >bind_hash_table, so that it would be always freed after a RCU grace > >period. So I'm not sure I understand. Yes, Cong's fix may eliminate *some* of the syzbot failures, but the basic problem is not solved. To take one example of possible races (one that was discussed in https://www.spinics.net/lists/netdev/msg475074.html) rds_recv_incoming->rds_find_bound is being called in rds_send_worker context and the rds_find_bound code is 63 rs = rhashtable_lookup_fast(&bind_hash_table, &key, ht_parms); 64 if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) 65 rds_sock_addref(rs); 66 else 67 rs = NULL; 68 After we find an rs at line 63, how can we be sure that the entire logic of rds_release does not execute on another cpu, and free the rs, before we hit line 64 with the bad rs? Normally synchronize_rcu() or synchronize_net() in rds_release() would ensure this. How do we ensure this with SOCK_RCU_FREE (or is the intention to just reduce *some* of the syzbot failures)? --Sowmini
Re: [Patch net] rds: mark bound socket with SOCK_RCU_FREE
On (09/10/18 15:24), Cong Wang wrote: > > When a rds sock is bound, it is inserted into the bind_hash_table > which is protected by RCU. But when releasing rd sock, after it > is removed from this hash table, it is freed immediately without > respecting RCU grace period. This could cause some use-after-free > as reported by syzbot. > I have no objection to the change itself, but the syzbot failures are caused for a very simple reason: we need synchronize_net() in rds_release before we remove the rds_sock from the bind_hash_table. I already pointed this out in https://www.spinics.net/lists/netdev/msg475074.html I think the objection to synchronize_net() is that it can cause perf issues (I'm told that rds_release() has been known to be held up by other threads in rcu critical sections?) but I personally dont see any other alternative to this (other than going back to rwlock, instead of rcu) --Sowmini
Re: [PATCH RFC net-next 00/11] udp gso
On (09/03/18 10:02), Steffen Klassert wrote: > I'm working on patches that builds such skb lists. The list is chained > at the frag_list pointer of the first skb, all subsequent skbs are linked > to the next pointer of the skb. It looks like this: there are some risks to using the frag_list pointer, Alex Duyck had pointed this out to me in https://www.mail-archive.com/netdev@vger.kernel.org/msg131081.html (see last paragraph, or search for the string "gotcha" there) I dont know the details of your playground patch, but might want to watch out for those to make sure it is immune to these issues.. --Sowmini
[PATCH V2 ipsec-next 0/2] xfrm: bug fixes when processing multiple transforms
This series contains bug fixes that were encountered when I set up a libreswan tunnel using the config below, which will set up an IPsec policy involving 2 tmpls. type=transport compress=yes esp=aes_gcm_c-128-null # offloaded to Niantic auto=start The non-offload test case uses esp=aes_gcm_c-256-null. Each patch has a technical description of the contents of the fix. V2: added Fixes tag so that it can be backported to the stable trees. Sowmini Varadhan (2): xfrm: reset transport header back to network header after all input transforms ahave been applied xfrm: reset crypto_done when iterating over multiple input xfrms net/ipv4/xfrm4_input.c |1 + net/ipv4/xfrm4_mode_transport.c |4 +--- net/ipv6/xfrm6_input.c |1 + net/ipv6/xfrm6_mode_transport.c |4 +--- net/xfrm/xfrm_input.c |1 + 5 files changed, 5 insertions(+), 6 deletions(-)
[PATCH V2 ipsec-next 1/2] xfrm: reset transport header back to network header after all input transforms ahave been applied
A policy may have been set up with multiple transforms (e.g., ESP and ipcomp). In this situation, the ingress IPsec processing iterates in xfrm_input() and applies each transform in turn, processing the nexthdr to find any additional xfrm that may apply. This patch resets the transport header back to network header only after the last transformation so that subsequent xfrms can find the correct transport header. Fixes: 7785bba299a8 ("esp: Add a software GRO codepath") Suggested-by: Steffen Klassert Signed-off-by: Sowmini Varadhan --- v2: added "Fixes" tag net/ipv4/xfrm4_input.c |1 + net/ipv4/xfrm4_mode_transport.c |4 +--- net/ipv6/xfrm6_input.c |1 + net/ipv6/xfrm6_mode_transport.c |4 +--- 4 files changed, 4 insertions(+), 6 deletions(-) diff --git a/net/ipv4/xfrm4_input.c b/net/ipv4/xfrm4_input.c index bcfc00e..f8de248 100644 --- a/net/ipv4/xfrm4_input.c +++ b/net/ipv4/xfrm4_input.c @@ -67,6 +67,7 @@ int xfrm4_transport_finish(struct sk_buff *skb, int async) if (xo && (xo->flags & XFRM_GRO)) { skb_mac_header_rebuild(skb); + skb_reset_transport_header(skb); return 0; } diff --git a/net/ipv4/xfrm4_mode_transport.c b/net/ipv4/xfrm4_mode_transport.c index 3d36644..1ad2c2c 100644 --- a/net/ipv4/xfrm4_mode_transport.c +++ b/net/ipv4/xfrm4_mode_transport.c @@ -46,7 +46,6 @@ static int xfrm4_transport_output(struct xfrm_state *x, struct sk_buff *skb) static int xfrm4_transport_input(struct xfrm_state *x, struct sk_buff *skb) { int ihl = skb->data - skb_transport_header(skb); - struct xfrm_offload *xo = xfrm_offload(skb); if (skb->transport_header != skb->network_header) { memmove(skb_transport_header(skb), @@ -54,8 +53,7 @@ static int xfrm4_transport_input(struct xfrm_state *x, struct sk_buff *skb) skb->network_header = skb->transport_header; } ip_hdr(skb)->tot_len = htons(skb->len + ihl); - if (!xo || !(xo->flags & XFRM_GRO)) - skb_reset_transport_header(skb); + skb_reset_transport_header(skb); return 0; } diff --git a/net/ipv6/xfrm6_input.c b/net/ipv6/xfrm6_input.c index 841f4a0..9ef490d 100644 --- a/net/ipv6/xfrm6_input.c +++ b/net/ipv6/xfrm6_input.c @@ -59,6 +59,7 @@ int xfrm6_transport_finish(struct sk_buff *skb, int async) if (xo && (xo->flags & XFRM_GRO)) { skb_mac_header_rebuild(skb); + skb_reset_transport_header(skb); return -1; } diff --git a/net/ipv6/xfrm6_mode_transport.c b/net/ipv6/xfrm6_mode_transport.c index 9ad07a9..3c29da5 100644 --- a/net/ipv6/xfrm6_mode_transport.c +++ b/net/ipv6/xfrm6_mode_transport.c @@ -51,7 +51,6 @@ static int xfrm6_transport_output(struct xfrm_state *x, struct sk_buff *skb) static int xfrm6_transport_input(struct xfrm_state *x, struct sk_buff *skb) { int ihl = skb->data - skb_transport_header(skb); - struct xfrm_offload *xo = xfrm_offload(skb); if (skb->transport_header != skb->network_header) { memmove(skb_transport_header(skb), @@ -60,8 +59,7 @@ static int xfrm6_transport_input(struct xfrm_state *x, struct sk_buff *skb) } ipv6_hdr(skb)->payload_len = htons(skb->len + ihl - sizeof(struct ipv6hdr)); - if (!xo || !(xo->flags & XFRM_GRO)) - skb_reset_transport_header(skb); + skb_reset_transport_header(skb); return 0; } -- 1.7.1
[PATCH V2 ipsec-next 2/2] xfrm: reset crypto_done when iterating over multiple input xfrms
We only support one offloaded xfrm (we do not have devices that can handle more than one offload), so reset crypto_done in xfrm_input() when iterating over multiple transforms in xfrm_input, so that we can invoke the appropriate x->type->input for the non-offloaded transforms Fixes: d77e38e612a0 ("xfrm: Add an IPsec hardware offloading API") Signed-off-by: Sowmini Varadhan --- v2: added "Fixes" tag net/xfrm/xfrm_input.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c index b89c9c7..be3520e 100644 --- a/net/xfrm/xfrm_input.c +++ b/net/xfrm/xfrm_input.c @@ -458,6 +458,7 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type) XFRM_INC_STATS(net, LINUX_MIB_XFRMINHDRERROR); goto drop; } + crypto_done = false; } while (!err); err = xfrm_rcv_cb(skb, family, x->type->proto, 0); -- 1.7.1
[PATCH ipsec-next 2/2] xfrm: reset crypto_done when iterating over multiple input xfrms
We only support one offloaded xfrm (we do not have devices that can handle more than one offload), so reset crypto_done in xfrm_input() when iterating over multiple transforms in xfrm_input, so that we can invoke the appropriate x->type->input for the non-offloaded transforms Signed-off-by: Sowmini Varadhan --- net/xfrm/xfrm_input.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c index b89c9c7..be3520e 100644 --- a/net/xfrm/xfrm_input.c +++ b/net/xfrm/xfrm_input.c @@ -458,6 +458,7 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 spi, int encap_type) XFRM_INC_STATS(net, LINUX_MIB_XFRMINHDRERROR); goto drop; } + crypto_done = false; } while (!err); err = xfrm_rcv_cb(skb, family, x->type->proto, 0); -- 1.7.1
[PATCH ipsec-next 1/2] xfrm: reset transport header back to network header after all input transforms ahave been applied
A policy may have been set up with multiple transforms (e.g., ESP and ipcomp). In this situation, the ingress IPsec processing iterates in xfrm_input() and applies each transform in turn, processing the nexthdr to find any additional xfrm that may apply. This patch resets the transport header back to network header only after the last transformation so that subsequent xfrms can find the correct transport header. Suggested-by: Steffen Klassert Signed-off-by: Sowmini Varadhan --- net/ipv4/xfrm4_input.c |1 + net/ipv4/xfrm4_mode_transport.c |4 +--- net/ipv6/xfrm6_input.c |1 + net/ipv6/xfrm6_mode_transport.c |4 +--- 4 files changed, 4 insertions(+), 6 deletions(-) diff --git a/net/ipv4/xfrm4_input.c b/net/ipv4/xfrm4_input.c index bcfc00e..f8de248 100644 --- a/net/ipv4/xfrm4_input.c +++ b/net/ipv4/xfrm4_input.c @@ -67,6 +67,7 @@ int xfrm4_transport_finish(struct sk_buff *skb, int async) if (xo && (xo->flags & XFRM_GRO)) { skb_mac_header_rebuild(skb); + skb_reset_transport_header(skb); return 0; } diff --git a/net/ipv4/xfrm4_mode_transport.c b/net/ipv4/xfrm4_mode_transport.c index 3d36644..1ad2c2c 100644 --- a/net/ipv4/xfrm4_mode_transport.c +++ b/net/ipv4/xfrm4_mode_transport.c @@ -46,7 +46,6 @@ static int xfrm4_transport_output(struct xfrm_state *x, struct sk_buff *skb) static int xfrm4_transport_input(struct xfrm_state *x, struct sk_buff *skb) { int ihl = skb->data - skb_transport_header(skb); - struct xfrm_offload *xo = xfrm_offload(skb); if (skb->transport_header != skb->network_header) { memmove(skb_transport_header(skb), @@ -54,8 +53,7 @@ static int xfrm4_transport_input(struct xfrm_state *x, struct sk_buff *skb) skb->network_header = skb->transport_header; } ip_hdr(skb)->tot_len = htons(skb->len + ihl); - if (!xo || !(xo->flags & XFRM_GRO)) - skb_reset_transport_header(skb); + skb_reset_transport_header(skb); return 0; } diff --git a/net/ipv6/xfrm6_input.c b/net/ipv6/xfrm6_input.c index 841f4a0..9ef490d 100644 --- a/net/ipv6/xfrm6_input.c +++ b/net/ipv6/xfrm6_input.c @@ -59,6 +59,7 @@ int xfrm6_transport_finish(struct sk_buff *skb, int async) if (xo && (xo->flags & XFRM_GRO)) { skb_mac_header_rebuild(skb); + skb_reset_transport_header(skb); return -1; } diff --git a/net/ipv6/xfrm6_mode_transport.c b/net/ipv6/xfrm6_mode_transport.c index 9ad07a9..3c29da5 100644 --- a/net/ipv6/xfrm6_mode_transport.c +++ b/net/ipv6/xfrm6_mode_transport.c @@ -51,7 +51,6 @@ static int xfrm6_transport_output(struct xfrm_state *x, struct sk_buff *skb) static int xfrm6_transport_input(struct xfrm_state *x, struct sk_buff *skb) { int ihl = skb->data - skb_transport_header(skb); - struct xfrm_offload *xo = xfrm_offload(skb); if (skb->transport_header != skb->network_header) { memmove(skb_transport_header(skb), @@ -60,8 +59,7 @@ static int xfrm6_transport_input(struct xfrm_state *x, struct sk_buff *skb) } ipv6_hdr(skb)->payload_len = htons(skb->len + ihl - sizeof(struct ipv6hdr)); - if (!xo || !(xo->flags & XFRM_GRO)) - skb_reset_transport_header(skb); + skb_reset_transport_header(skb); return 0; } -- 1.7.1
[PATCH ipsec-next 0/2] xfrm: bug fixes when processing multiple transforms
This series contains bug fixes that were encountered when I set up a libreswan tunnel using the config below, which will set up an IPsec policy involving 2 tmpls. type=transport compress=yes esp=aes_gcm_c-128-null # offloaded to Niantic auto=start The non-offload test case uses esp=aes_gcm_c-256-null. Each patch has a technical description of the contents of the fix. Sowmini Varadhan (2): xfrm: reset transport header back to network header after all input transforms ahave been applied xfrm: reset crypto_done when iterating over multiple input xfrms net/ipv4/xfrm4_input.c |1 + net/ipv4/xfrm4_mode_transport.c |4 +--- net/ipv6/xfrm6_input.c |1 + net/ipv6/xfrm6_mode_transport.c |4 +--- net/xfrm/xfrm_input.c |1 + 5 files changed, 5 insertions(+), 6 deletions(-)
Re: [PATCH] rds: tcp: remove duplicated include from tcp.c
On (08/21/18 14:05), Yue Haibing wrote: > Remove duplicated include. > > Signed-off-by: Yue Haibing Acked-by: Sowmini Varadhan
Re: [PATCH net-next] rds: avoid lock hierarchy violation between m_rs_lock and rs_recv_lock
On (08/08/18 14:51), Santosh Shilimkar wrote: > This bug doesn't make sense since two different transports are using > same socket (Loop and rds_tcp) and running together. > For same transport, such race can't happen with MSG_ON_SOCK flag. > CPU1-> rds_loop_inc_free > CPU0 -> rds_tcp_cork ... > The test is just reporting a lock hierarchy violation As far as I can tell, this wasn't an actual deadlock that happened because as you point out, either a socket has the rds_tcp transport or the rds_loop transport, so this particular pair of stack traces would not happen with the code as it exists today. but there is a valid lock hierachy violation here, and imho it's a good idea to get that cleaned up. It also avoids needlessly holding down the rs_recv_lock when doing an rds_inc_put. --Sowmini
[PATCH net-next] rds: avoid lock hierarchy violation between m_rs_lock and rs_recv_lock
The following deadlock, reported by syzbot, can occur if CPU0 is in rds_send_remove_from_sock() while CPU1 is in rds_clear_recv_queue() CPU0CPU1 lock(&(&rm->m_rs_lock)->rlock); lock(&rs->rs_recv_lock); lock(&(&rm->m_rs_lock)->rlock); lock(&rs->rs_recv_lock); The deadlock should be avoided by moving the messages from the rs_recv_queue into a tmp_list in rds_clear_recv_queue() under the rs_recv_lock, and then dropping the refcnt on the messages in the tmp_list (potentially resulting in rds_message_purge()) after dropping the rs_recv_lock. The same lock hierarchy violation also exists in rds_still_queued() and should be avoided in a similar manner Signed-off-by: Sowmini Varadhan Reported-by: syzbot+52140d69ac6dc6b92...@syzkaller.appspotmail.com --- net/rds/recv.c | 11 +-- 1 files changed, 9 insertions(+), 2 deletions(-) diff --git a/net/rds/recv.c b/net/rds/recv.c index 504cd6b..1cf7072 100644 --- a/net/rds/recv.c +++ b/net/rds/recv.c @@ -429,6 +429,7 @@ static int rds_still_queued(struct rds_sock *rs, struct rds_incoming *inc, struct sock *sk = rds_rs_to_sk(rs); int ret = 0; unsigned long flags; + bool drop_ref = false; write_lock_irqsave(&rs->rs_recv_lock, flags); if (!list_empty(&inc->i_item)) { @@ -439,11 +440,13 @@ static int rds_still_queued(struct rds_sock *rs, struct rds_incoming *inc, -be32_to_cpu(inc->i_hdr.h_len), inc->i_hdr.h_dport); list_del_init(&inc->i_item); - rds_inc_put(inc); + drop_ref = true; } } write_unlock_irqrestore(&rs->rs_recv_lock, flags); + if (drop_ref) + rds_inc_put(inc); rdsdebug("inc %p rs %p still %d dropped %d\n", inc, rs, ret, drop); return ret; } @@ -751,16 +754,20 @@ void rds_clear_recv_queue(struct rds_sock *rs) struct sock *sk = rds_rs_to_sk(rs); struct rds_incoming *inc, *tmp; unsigned long flags; + LIST_HEAD(tmp_list); write_lock_irqsave(&rs->rs_recv_lock, flags); list_for_each_entry_safe(inc, tmp, &rs->rs_recv_queue, i_item) { rds_recv_rcvbuf_delta(rs, sk, inc->i_conn->c_lcong, -be32_to_cpu(inc->i_hdr.h_len), inc->i_hdr.h_dport); + list_move_tail(&inc->i_item, &tmp_list); + } + write_unlock_irqrestore(&rs->rs_recv_lock, flags); + list_for_each_entry_safe(inc, tmp, &tmp_list, i_item) { list_del_init(&inc->i_item); rds_inc_put(inc); } - write_unlock_irqrestore(&rs->rs_recv_lock, flags); } /* -- 1.7.1
Re: [PATCH] rds: send: Fix dead code in rds_sendmsg
On (07/25/18 10:22), Gustavo A. R. Silva wrote: > Currently, code at label *out* is unreachable. Fix this by updating > variable *ret* with -EINVAL, so the jump to *out* can be properly > executed instead of directly returning from function. > > Addresses-Coverity-ID: 1472059 ("Structurally dead code") > Fixes: 1e2b44e78eea ("rds: Enable RDS IPv6 support") > Signed-off-by: Gustavo A. R. Silva Acked-by: Sowmini Varadhan
Re: [PATCH v3 net-next 0/3] rds: IPv6 support
On (07/18/18 15:19), Ka-Cheong Poon wrote: > >bind() and connect() are using the sa_family/ss_family to have > >the application signal to the kernel about whether ipv4 or ipv6 is > >desired. (and bind and connect are doing the right thing for > >v4mapped, so that doesnt seem to be a problem there) > > > >In this case you want the application to signal that info via > >the optlen. (And the reason for this inconsistency is that you dont > >want to deal with the user->kernel copy in the same way?) > > > Because doing that can break existing RDS apps. Existing code > does not check the address family in processing this socket > option. It only cares about the address and port. If the new I'll leave this up to DaveM. Existing code only handles IPv4, everywhere else, we always check the sa_family or ss_family first and verify the length afterward. This was DaveM's original point about bind/connect/sendmsg. I dont know why rds sockopts have to be special. > code suddenly checks that and use it to decide on the address > family, working app will break. As I mentioned before, this > patch set does not change existing behavior. And doing what > you mentioned will change existing behavior and break apps. thank you. --Sowmini
Re: [PATCH v3 net-next 0/3] rds: IPv6 support
On (07/17/18 13:32), Ka-Cheong Poon wrote: > > The app can use either structures to make the call. When the > app fills in the structure, it knows what it is filling in, > either sockaddr_in or sockaddr_in6. So it knows the right size > to use. The app can also use IPv4 mapped address in a sockaddr_in6 > without a problem. tupical applications that I have seen in routing applicaitons will use a union like union { struct sockaddr_in sin; struct sockaddr_in sin5; } Or they will use sockadd_storage. Passing down the sizeoof that structure will do the worng thing thing in the existing code for ipv4 (even though it will not generate EFAOIT).. > Could you please explain the inconsistency? An app can use IPv4 > mapped address in a sockaddr_in6 to operate on an IPv4 connection, > in case you are thinking of this new addition in v3 of the patch. bind() and connect() are using the sa_family/ss_family to have the application signal to the kernel about whether ipv4 or ipv6 is desired. (and bind and connect are doing the right thing for v4mapped, so that doesnt seem to be a problem there) In this case you want the application to signal that info via the optlen. (And the reason for this inconsistency is that you dont want to deal with the user->kernel copy in the same way?) --Sowmini
Re: [PATCH v3 net-next 0/3] rds: IPv6 support
- Looks like rds_connect() is checking things in the right order (thanks) However, rds_cancel_sent_to is still looking at the len to figure out the family.. as we move to ipv6, it would be better if we allow the caller to specify struct sockaddr_storage, or even a union of sockaddr_in/sockaddr_in6, rather than require them to hint at which one of ipv4/ipv6 through the optlen. Please see __sys_connect and move_addr_to_kernel if the user-kernel copy is the reason you are not doing this. Similar to inet_dgram_connect you can then check the sa_family and use that to figure out the "Assume IPv4" etc stuff. This would also make the CANCEL_SEND_TO API consistent with the bind/ connect etc semantics. - net/rds/rds.h: thanks for moving RDS_CM_PORT to the rdma specific file. I am guessing (?) that you want to update the comment to talk about the non-existent "RDS over UDP" based on the title of the IANA registration? I would just like to re-iterate that this is actually inaccurate (and confusing to someone looking at this for the first time, since there is no RDS-over-UDP today). If it were up to me, I would update the comment to say /* The following ports, 16385, 18634, 18635, are registered with IANA as * the ports to be used for "RDS over TCP and UDP". * The current linux implementation supports RDS over TCP and IB, and uses * the ports as follows: 18634 is the historical value used for the * RDMA_CM listener port. RDS/TCP uses port 16385. After * IPv6 work, RDMA_CM also uses 16385 as the listener port. 18634 is kept * to ensure compatibility with older RDS modules. Those ports are defined * in each transport's header file. IMHO that makes the comment look a little less odd (I've already explained to you why RDS-over-UDP does not make much practical sense for the RDS use-cases we anticipate). YMMV. Thanks, --Sowmini
Re: [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr
On (07/06/18 23:08), Ka-Cheong Poon wrote: > > As mentioned in a previous mail, it is unclear why the > port number is transport specific. Most Internet services > use the same port number running over TCP/UDP as shown > in the IANA database. And the IANA RDS registration is > the same. What is the rationale of having a transport > specific number in the RDS implementation? because not every transport may need a port number. e.g., "RDS over pigeon carrier" may not need a port number. in a different design (e.g., KCM) the listen port is configurable with sysctl Some may need more than one, e.g., rds_rdma, evidently. What is the problem with using the unused 18635 for the RDS_CM_PORT?
Re: [PATCH v2 net-next 0/3] rds: IPv6 support
On (07/06/18 22:36), Ka-Cheong Poon wrote: > This patch series does not change existing behavior. But > I think this is a strange RDS semantics as it differs from > other types of socket. But this is not about IPv6 support > and can be dealt with later. sure, > > Since we are choosing to treat this behavior as a feature we > > need to be consistent in rds_getname(). If we find > > (!peer and !ipv6_addr_any(rs_conn_addr)) and the socket is not yet bound, > > the returned address (address size, address family) should be based > > on the rs_conn_addr. (Otherwise if I connect to abc::1 without doing a > > bind(), and try to do a getsockname(), I get back a sockaddr_in?!) > > > OK, will change that. > >- rds_cancel_sent_to() and rds_connect and rds_bind and rds_sendmsg > > As DaveM has already pointed out, we should be using sa_family to figure > > out sockaddr_in vs sockaddr_in6 (not the other way around of inspecting > > len first, and then the family- that way wont work if I pass in > > sockaddr_storage). E.g., see inet_dgram_connect. > > > > if (addr_len < sizeof(uaddr->sa_family)) > > return -EINVAL; > OK, will change that. thank you > >- In net/rds/rds.h; > > > > /* The following ports, 16385, 18634, 18635, are registered with IANA as > >* the ports to be used for RDS over TCP and UDP. 18634 is the historical > > > > What is "RDS over TCP and UDP"? There is no such thing as RDS-over-UDP. > > IN fact RDS has nothing to do with UDP. The comment is confused. See next > > item below, where the comment disappears. > > > As mentioned in the above comments, they are registered with IANA. 16385 is registered for RDS-TCP. what does it mean to run rds-tcp and rds_rdma with the CM at the same time? Is this possible? (if it is, why not poach some other number like 2049 from NFS?!) 18635 is actually not used in the RDS code. Why can you not use that for RDS_CM_PORT? In general, please do NOT pull up these definitions into net/rds/rds.h They may change in the future- and the really belong in the transport specific header file - you question this further below, see my answer there. > There is no RDS/UDP in Linux but the port numbers are registered > nevertheless. And RDS can work over UDP AFAICT. BTW, it is really No it cannot. RDS needs a reliable transport. If you start using UDP (or pigeon-carrier) for the transport you have to build the reliability yourself. The Oracle DB libraries either use RDS-TCP or RDS-IB or UDP (in the UDP case they do their own congestion management in userspace, and history has shown that it is hard to get that right, thus the desire to switch to rds-tcp). Anyway, even though Andy Grover registered 18635 for "RDS-IB UDP", that is probably the most harmless one to use for this case, because - it is unused, - it calls itself rds-Infiniband, - the likelihood of doing rds-udp is infinitesmally small (it is more likely that we will need to send packets from abc::1 -> fe80:2 before we need rds-udp :-) :-) :-)) - and if rds-udp happens, we can use 16385/udp for rds-udp so please use 18635 for your RDS_CM_PORT and move it to IB specific header files and lets converge this thread. Towing RDS_TCP_PORT around has absolutely nothing to do with your ipv6 patch set. > strange that RDS/TCP does not use the port number already registered. > Anyway, the above comments are just a note on this fact in the IANA 16385 was in defacto use for a long time (since 2.6 kernels), thus I registered it as well when I started fixing this up. the 18634/18635 are registered for rds-iB, (caps intended) not rds-tcp. They are available for your use. > database. The following is a link to the IANA assignment. yes, I am aware > IMHO, there should only be RDS_PORT in the first place. And it > can be used for all transports. For example, if RDS/SCTP is added, > it can also use the same port number. There is no need to define if/when we add rds/sctp, we shall keep that in mind. This is getting to be "arguing for the sake of arguing". I dont have the time for that. > a new port number for each transport. What is the rationale that > each transport needs to have its own number and be defined in its > own header file? Some transports may not even need a port number. Some may need several. Some may use sysctl (suggested by Tom Herbert) to make this configurable. Until recently, we had rds/iWARP, that may need its own transport hooks, it does not make sense to peanut-butter that into the core module. That is why it has to be in the transport. I hope that answers the question. > If the behavior of a software component is modified/enhanced such > that the existing API of this component has problems dealing with > this new behavior, should the API be modified or a new API be added > to handle that? If neither is done, shouldn't this be considered > a bug? whatever. The design (parallelization for perf) is fine. Some parts of
Re: [PATCH v2 net-next 1/3] rds: Changing IP address internal representation to struct in6_addr
On (07/06/18 17:08), Ka-Cheong Poon wrote: > >Hmm. Why do you need to include tcp header in ib transport > >code ? If there is any common function just move to core > >common file and use it. > > I think it can be removed as it is left over from earlier > changes when the IB IPv6 listener port was RDS_TCP_PORT. > Now all the port definitions are in rds.h. Transport specific information such as port numbers should not be smeared into the common rds-module. Please fix that. If IB is also piggybacking on port 16385 (why?), please use your own definitions for it in IB specific header files. Santosh, David, I have to NACK this if it is not changed. --Sowmini
Re: [PATCH v2 net-next 0/3] rds: IPv6 support
Some additional comments on this patchset (consolidated here, please tease this apart into patch1/patch2/patch3 as appropriate) I looked at the most of rds-core, and the rds-tcp changes. Please make sure santosh looks at these carefully, especially. - RDS bind key changes - connection.c - all the rds_rdma related changes (e.g., the ib* and rdma* files) - rds_getname(): one of the existing properties of PF_RDS is that a connect() does not involve an implicit bind(). Note that you are basing the changes in rds_bind() on this behavior, thus I guess we make the choice of treating this as a feature, not a bug. Since we are choosing to treat this behavior as a feature we need to be consistent in rds_getname(). If we find (!peer and !ipv6_addr_any(rs_conn_addr)) and the socket is not yet bound, the returned address (address size, address family) should be based on the rs_conn_addr. (Otherwise if I connect to abc::1 without doing a bind(), and try to do a getsockname(), I get back a sockaddr_in?!) - rds_cancel_sent_to() and rds_connect and rds_bind and rds_sendmsg As DaveM has already pointed out, we should be using sa_family to figure out sockaddr_in vs sockaddr_in6 (not the other way around of inspecting len first, and then the family- that way wont work if I pass in sockaddr_storage). E.g., see inet_dgram_connect. if (addr_len < sizeof(uaddr->sa_family)) return -EINVAL; - In net/rds/rds.h; /* The following ports, 16385, 18634, 18635, are registered with IANA as * the ports to be used for RDS over TCP and UDP. 18634 is the historical What is "RDS over TCP and UDP"? There is no such thing as RDS-over-UDP. IN fact RDS has nothing to do with UDP. The comment is confused. See next item below, where the comment disappears. - Also in net/rds/rds.h Please dont define transport specific parameters like RD_CM_PORT in the common rds.h header. It is unfortunate that we already have RDS_PORT there, and we should try to clean that up as well. NOte that RDS_TCP_PORT is now in the correct transport-module-specific header (net/rds/tcp.h) and its unclean to drag it from there, into the common header as you are doing. In fact I just tried to move the RDS_PORT definition into net/rds/rdma_transport.h and it built just-fine. So to summarize, please do the following: 1. move RDS_PORT into rdma_transport.h 2. add RDS_CM_PORT into rdma_transport.h 3. stop dragging RDS_TCP_PORT from its current happy home in net/rds/tcp.h to net/rds/rds.h - net/rds/connection.c As we have discussed offline before, the fact that we cannot report TCP seq# etc info via the existing rds-info API is not "a bug in the design of MPRDS" but rather a lacking in the API design. Moreover, much of the useful information around the TCP socket is already available via procfs, TCP_INFO etc, so the info by rds-info is rarely used for rds-tcp- the more useful information is around the RDS socket itself. So there is a bug in the comment, would be nice if you removed it. Also, while you are there, s/exisiting/existing, please. General comments: - I remain unconvinced by your global <-> link-local arguments. For UDP sockets we can do this: eth0 host1 -- host2 abc::1/64 fe80::2 add abc::/64 as onlink subnet route host1# traceroute6 -i eth0 -s abc::1 fe80::2 You just broke this for RDS and are using polemic to defend your case, but the main thrust of your diatribe seems to be "why would you need this?" I'll try to address that briefly here. - There may be lot of valid reasons why host2 does not want to be configured with a global prefix. e.g., I only want host2 to be able to talk to onlink hosts. - RDS mandatorily requires sockets to be bound. So the normal src addr selection (that would have caused host1 to use a link-local to talk to host2) is suppressed in this case This is exactly the same as a UDP socket bound to abc::1 Note well, that one of the use-cases for RDS-TCP is to replace existing infra that uses UDP for cluster-IPC. This has come up before on netdev: See https://www.mail-archive.com/search?l=netdev@vger.kernel.org&q=subject:%22Re%5C%3A+%5C%5BPATCH+net%5C-next+0%5C%2F6%5C%5D+kcm%5C%3A+Kernel+Connection+Multiplexor+%5C%28KCM%5C%29%22&o=newest&f=1 so feature parity with udp is just as important as feature-parity for rds_rdma. I hope that helps you see why we need to not break this gratuituously for rds-tcp. BTW, etiquette is to cc folks who have offered review comments on the code. Please make sure to cc me in follow-ups to this thread. Thank you. --Sowmini
[BISECTED] [4.17.0-rc6] IPv6 link-local address not getting added
Hi David, An IPv6 regression has been introduced in 4.17.0-rc6 by 8308f3f net/ipv6: Add support for specifying metric of connected routes The regression is that some interfaces on my test machine come up with link-local addrs but the fe80 prefix is missing. After this bug, I cannot send any packets to anyone onlink (including my routers). Here are the symptoms: When everything is fine, "ip -6 route|grep eno" shows 2606:b400:400:18c8::/64 dev eno1 proto ra metric 100 pref medium fe80::5:73ff:fea0:52d dev eno1 proto static metric 100 pref medium fe80::/64 dev eno1 proto kernel metric 256 pref medium fe80::/64 dev eno3 proto kernel metric 256 pref medium fe80::/64 dev eno4 proto kernel metric 256 pref medium default via fe80::5:73ff:fea0:52d dev eno1 proto static metric 100 pref medium But after 8308f3f, I only find # ip -6 route|grep eno 2606:b400:400:18c8::/64 dev eno1 proto ra metric 100 pref medium fe80::5:73ff:fea0:52d dev eno1 proto static metric 100 pref medium fe80::/64 dev eno1 proto kernel metric 256 pref medium default via fe80::5:73ff:fea0:52d dev eno1 proto static metric 100 pref medium (note that eno2 is not enabled in my config, so its absence is expected) Please have a look, thanks. --Sowmini
Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support
On (06/27/18 18:07), Ka-Cheong Poon wrote: > > There is a reason for that. It is the way folks expect > how IPv6 addresses are being used. have you tried "traceoute6 -s abc::2 fe80::2" on linux? > It is not just forwarding. The simple case is that one > picks a global address in a different link and then > use it to send to a link local address in another link. This is actually not any different than ipv4's strong/weak ES model. Global addresses are supposed to be globally routable. For your above example, if yuu do that, it is assumed that your routing table has been set up suitably. To state what may be well-known: This does not work for link-locals, becuase, as the name suggests, those are local to the link and you may have the same link-local on multiple links > This does not work. And the RDS connection created will > be stuck forever. that is a different problem in the RDS implementation (that it does not backoff and timeout a failing reconnect) As you can see from the traceroute6 example, global <-> link-local is supported for udp (and probably also tcp sockets, I have not checked that case) > I don't expect RDS apps will want to use link local address > in the first place. In fact, most normal network apps don't. : > Do you know of any IPv4 RDS app which uses IPv4 link local > address? In fact, IPv4 link local address is explicitly > disallowed for active active bonding. Are we talking about "why this ok for my particular use of link-local, so I can slide my patch forward" or, "why this is correct IPv6 behavior"? > Can you explain why DNS name resolution will return an IPv6 > link local address? I'm surprised if it actually does. It depends on how you set up your DNS. It seems like this is all about "I dont want to deal with this now", so I dont want to continue this discussion which is really going nowhere. Thanks --Sowmini
Re: [rds-devel] [PATCH net-next] rds: clean up loopback rds_connections on netns deletion
On (06/26/18 10:53), Sowmini Varadhan wrote: > Date: Tue, 26 Jun 2018 10:53:23 -0400 > From: Sowmini Varadhan > To: David Miller > Cc: netdev@vger.kernel.org, rds-de...@oss.oracle.com > Subject: Re: [rds-devel] [PATCH net-next] rds: clean up loopback > > and just to add, the fix itself is logically correct, so belongs in > net-next. What I dont have (and therefore did not target net) is > official confirmation that the syzbot failures are root-caused to the > absence of this patch (since there is no reproducer for many of these, > and no crash dumps available from syzbot). > With help from Dmitry, I just got the confirmation from syzbot that "syzbot has tested the proposed patch and the reproducer did not trigger crash:" thus, we can mark this Reported-and-tested-by: syzbot+4c20b3866171ce844...@syzkaller.appspotmail.com and yes, it can target net. Thanks --Sowmini
Re: [PATCH net-next] rds: clean up loopback rds_connections on netns deletion
On (06/26/18 16:48), Dmitry Vyukov wrote: > it probably hit the race by a pure luck of the large program, but then > never had the same luck when tried to remove any syscalls. > So it can make sense to submit several test requests to get more testing. How does one submit test requests by email? the last time I asked this question, the answer was a pointer to https://groups.google.com/forum/#!msg/syzkaller-bugs/7ucgCkAJKSk/skZjgavRAQAJ Thanks --Sowmini
Re: [PATCH net-next] rds: clean up loopback rds_connections on netns deletion
On (06/26/18 23:29), David Miller wrote: > >> > >> Since this probably fixes syzbot reports, this can be targetted > >> at 'net' instead? > > > > that thought occurred to me but I wanted to be conservative and have > > it in net-next first, have the syzkaller-bugs team confirm the > > the fixes and then backport to earlier kernels (if needed).. > > I think there is a way to ask syzbot to test a patch in an > email. and just to add, the fix itself is logically correct, so belongs in net-next. What I dont have (and therefore did not target net) is official confirmation that the syzbot failures are root-caused to the absence of this patch (since there is no reproducer for many of these, and no crash dumps available from syzbot). --Sowmini
Re: [PATCH net-next] rds: clean up loopback rds_connections on netns deletion
On (06/26/18 23:29), David Miller wrote: > > I think there is a way to ask syzbot to test a patch in an > email. Dmitry/syzkaller-bugs, can you clarify? This is for the cluster of dup reports like https://groups.google.com/forum/#!topic/syzkaller-bugs/zBph8Vu-q2U and (most recently) https://www.spinics.net/lists/linux-rdma/msg66020.html as I understand it, if there is no reproducer, you cannot really have a pass/fail test to confirm the fix. --Sowmini
Re: [PATCH net-next] rds: clean up loopback rds_connections on netns deletion
On (06/26/18 22:23), David Miller wrote: > > Since this probably fixes syzbot reports, this can be targetted > at 'net' instead? that thought occurred to me but I wanted to be conservative and have it in net-next first, have the syzkaller-bugs team confirm the the fixes and then backport to earlier kernels (if needed).. --Sowmini
Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support
On (06/26/18 21:02), Ka-Cheong Poon wrote: > > In this case, RFC 6724 prefers link local address as source. the keyword is "prefers". > While using non-link local address (say ULA) is not forbidden, > doing this can easily cause inter-operability issues (does the > app really know that the non-link local source and the link > local destination addresses are really on the same link?). I > think it is prudent to disallow this in RDS unless there is a > very clear and important reason to do so. I remember the issues that triggered 6724. The "interop" issue is that when you send from Link-local to global, and need forwarding, it may not work. but I dont think an RDS application today expects to deal with the case that "oh I got back and error when I tried to send to address X on rds socket rs1, let me go and check what I am bound to, and maybe create another socket, and bind it to link-local" You're not doing this for IPv4 and RDS today (you dont have to do this for UDP, afaik) This is especially true if "X" is a hostname that got resovled using DNS > BTW, if it is really > needed, it can be added in future. shrug. You are introducing a new error return. --Sowmini
Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support
On (06/26/18 13:30), Ka-Cheong Poon wrote: > > My answer to this is that if a socket is not bound to a link > local address (meaning it is bound to a non-link local address) > and it is used to send to a link local peer, I think it should > fail. Hmm, I'm not sure I agree. I dont think this is forbidden by RFC 6724 - yes, such a packet cannot be forwarded, but if everything is on the same link, and the dest only has a link-local, you should not need to (create and) bind another socket to a link-local to talk to this destination.. > This is consistent with the scope_id check I mentioned in > the previous mail. If the socket is not bound to a link local > address, the bound_scope_id is 0. So if the socket is used to > send to a link local address (which has a non-zero scope_id), the > check will catch it and fail the call. A new conn should not > be created in this case.
Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support
On (06/26/18 01:43), Ka-Cheong Poon wrote: > > Yes, I think if the socket is bound, it should check the scope_id > in msg_name (if not NULL) to make sure that they match. A bound > RDS socket can send to multiple peers. But if the bound local > address is link local, it should only be allowed to send to peers > on the same link. agree. > If a socket is bound, I guess the scope_id should be used. So > if a socket is not bound to a link local address and the socket > is used to sent to a link local peer, it should fail. PF_RDS sockets *MUST* alwasy be bound. See Documentation/networking/rds.txt: " Sockets must be bound before you can send or receive data. This is needed because binding also selects a transport and attaches it to the socket. Once bound, the transport assignment does not change." Also, rds_sendmsg checks this (from net-next, your version has the equivalent ipv6_addr_any etc check): if (daddr == 0 || rs->rs_bound_addr == 0) { release_sock(sk); ret = -ENOTCONN; /* XXX not a great errno */ goto out; } > > >Also, why is there no IPv6 support in rds_connect? > > > Oops, I missed this when I ported the internal version to the > net-next version. Will add it back. Ok --Sowmini
Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support
On (06/25/18 03:38), Ka-Cheong Poon wrote: > @@ -1105,8 +1105,27 @@ int rds_sendmsg(struct socket *sock, struct msghdr > *msg, size_t payload_len) > break; > > case sizeof(*sin6): { > - ret = -EPROTONOSUPPORT; > - goto out; > + int addr_type; : : > + daddr = sin6->sin6_addr; > + dport = sin6->sin6_port; > + scope_id = sin6->sin6_scope_id; > + break; > } In rds_sendmsg, the scopeid passed to rds_conn_create_outgoing may come from the msg_name (if msg_name is a link-local) or may come from the rs_bound_scope_id (for connected socket, change made in Patch 1 of the series). This sounds inconsistent. If I bind to scopeid if1 and then send to fe80::1%if2 (without connect()), we'd create an rds_connection with dev_if set to if2. (first off, its a bit unexpected to be sending to fe80::1%if2 when you are bound to a link-local on if1!) But then, if we got back a response from fe80::1%if2, I think we would not find a matching conn in rds_recv_incoming? And this is even more confusing because the fastpath in rds_sendmsg does not take the bound_scope_id into consideration at all: 1213 if (rs->rs_conn && ipv6_addr_equal(&rs->rs_conn->c_faddr, &daddr)) 1214 conn = rs->rs_conn; 1215 else { 1216 conn = rds_conn_create_outgoing( /* .. */, scope_id) so if I erroneously passed a msg_name on a connected rds socket, what would happen? (see also question about rds_connect() itself, below) Should we always use rs_bound_scope_id for creating the outgoing rds_connection? (you may need something deterministic for this, like "if bound addr is linklocal, return error if daddr has a different scopeid, else use the bound addr's scopeid", plus, "if bound addr is not global, and daddr is link-local, we need a conn with the daddr's scopeid") Also, why is there no IPv6 support in rds_connect? (still looking through the rds-tcp changes, but wanted to get these questions clarified first). --Sowmini
Re: [PATCH net-next] rds: clean up loopback rds_connections on netns deletion
On (06/25/18 06:41), Sowmini Varadhan wrote: : > Add the changes aligned with the changes from > commit ebeeb1ad9b8a ("rds: tcp: use rds_destroy_pending() to synchronize > netns/module teardown and rds connection/workq management") for > rds_loop_transport FWIW, I am optimistic that this will take care of a number of the use-after-free panics reported by syzbot (I have not marked the patch with the recommended syzkaller Reported-by tags because I was not able to reproduce each original issue, but inspection of the traces suggests this missing patch may be behind the races that cause the reports). --Sowmini
[PATCH net-next] rds: clean up loopback rds_connections on netns deletion
The RDS core module creates rds_connections based on callbacks from rds_loop_transport when sending/receiving packets to local addresses. These connections will need to be cleaned up when they are created from a netns that is not init_net, and that netns is deleted. Add the changes aligned with the changes from commit ebeeb1ad9b8a ("rds: tcp: use rds_destroy_pending() to synchronize netns/module teardown and rds connection/workq management") for rds_loop_transport Acked-by: Santosh Shilimkar Signed-off-by: Sowmini Varadhan --- net/rds/connection.c | 11 +- net/rds/loop.c | 56 ++ net/rds/loop.h |2 + 3 files changed, 68 insertions(+), 1 deletions(-) diff --git a/net/rds/connection.c b/net/rds/connection.c index abef75d..cfb0595 100644 --- a/net/rds/connection.c +++ b/net/rds/connection.c @@ -659,11 +659,19 @@ static void rds_conn_info(struct socket *sock, unsigned int len, int rds_conn_init(void) { + int ret; + + ret = rds_loop_net_init(); /* register pernet callback */ + if (ret) + return ret; + rds_conn_slab = kmem_cache_create("rds_connection", sizeof(struct rds_connection), 0, 0, NULL); - if (!rds_conn_slab) + if (!rds_conn_slab) { + rds_loop_net_exit(); return -ENOMEM; + } rds_info_register_func(RDS_INFO_CONNECTIONS, rds_conn_info); rds_info_register_func(RDS_INFO_SEND_MESSAGES, @@ -676,6 +684,7 @@ int rds_conn_init(void) void rds_conn_exit(void) { + rds_loop_net_exit(); /* unregister pernet callback */ rds_loop_exit(); WARN_ON(!hlist_empty(rds_conn_hash)); diff --git a/net/rds/loop.c b/net/rds/loop.c index dac6218..feea1f9 100644 --- a/net/rds/loop.c +++ b/net/rds/loop.c @@ -33,6 +33,8 @@ #include #include #include +#include +#include #include "rds_single_path.h" #include "rds.h" @@ -40,6 +42,17 @@ static DEFINE_SPINLOCK(loop_conns_lock); static LIST_HEAD(loop_conns); +static atomic_t rds_loop_unloading = ATOMIC_INIT(0); + +static void rds_loop_set_unloading(void) +{ + atomic_set(&rds_loop_unloading, 1); +} + +static bool rds_loop_is_unloading(struct rds_connection *conn) +{ + return atomic_read(&rds_loop_unloading) != 0; +} /* * This 'loopback' transport is a special case for flows that originate @@ -165,6 +178,8 @@ void rds_loop_exit(void) struct rds_loop_connection *lc, *_lc; LIST_HEAD(tmp_list); + rds_loop_set_unloading(); + synchronize_rcu(); /* avoid calling conn_destroy with irqs off */ spin_lock_irq(&loop_conns_lock); list_splice(&loop_conns, &tmp_list); @@ -177,6 +192,46 @@ void rds_loop_exit(void) } } +static void rds_loop_kill_conns(struct net *net) +{ + struct rds_loop_connection *lc, *_lc; + LIST_HEAD(tmp_list); + + spin_lock_irq(&loop_conns_lock); + list_for_each_entry_safe(lc, _lc, &loop_conns, loop_node) { + struct net *c_net = read_pnet(&lc->conn->c_net); + + if (net != c_net) + continue; + list_move_tail(&lc->loop_node, &tmp_list); + } + spin_unlock_irq(&loop_conns_lock); + + list_for_each_entry_safe(lc, _lc, &tmp_list, loop_node) { + WARN_ON(lc->conn->c_passive); + rds_conn_destroy(lc->conn); + } +} + +static void __net_exit rds_loop_exit_net(struct net *net) +{ + rds_loop_kill_conns(net); +} + +static struct pernet_operations rds_loop_net_ops = { + .exit = rds_loop_exit_net, +}; + +int rds_loop_net_init(void) +{ + return register_pernet_device(&rds_loop_net_ops); +} + +void rds_loop_net_exit(void) +{ + unregister_pernet_device(&rds_loop_net_ops); +} + /* * This is missing .xmit_* because loop doesn't go through generic * rds_send_xmit() and doesn't call rds_recv_incoming(). .listen_stop and @@ -194,4 +249,5 @@ struct rds_transport rds_loop_transport = { .inc_free = rds_loop_inc_free, .t_name = "loopback", .t_type = RDS_TRANS_LOOP, + .t_unloading= rds_loop_is_unloading, }; diff --git a/net/rds/loop.h b/net/rds/loop.h index 469fa4b..bbc8cdd 100644 --- a/net/rds/loop.h +++ b/net/rds/loop.h @@ -5,6 +5,8 @@ /* loop.c */ extern struct rds_transport rds_loop_transport; +int rds_loop_net_init(void); +void rds_loop_net_exit(void); void rds_loop_exit(void); #endif -- 1.7.1
Re: KASAN: out-of-bounds Read in rds_cong_queue_updates (2)
On (06/13/18 09:52), Dmitry Vyukov wrote: > I think this is: > > #syz dup: KASAN: use-after-free Read in rds_cong_queue_updates Indeed. We'd had a discussion about getting a dump of threads using sysrq (or similar), given the challenges around actually getting a crash dump, is that now possible? That will certainly help. another missing bit is that we still need the sychronize_net() in rds_release(). I realize synchronize_net() is sub-optimal for perf, but leaving this existing hole where races can occur in unexpected manifestations is not ideal either. (See https://www.spinics.net/lists/netdev/msg475074.html for earlier discussion thread) --Sowmini
Re: [rds-devel] KASAN: null-ptr-deref Read in rds_ib_get_mr
On (05/11/18 15:48), Yanjun Zhu wrote: > diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c > index e678699..2228b50 100644 > --- a/net/rds/ib_rdma.c > +++ b/net/rds/ib_rdma.c > @@ -539,11 +539,17 @@ void rds_ib_flush_mrs(void) >void *rds_ib_get_mr(struct scatterlist *sg, unsigned long nents, > struct rds_sock *rs, u32 *key_ret) >{ > - struct rds_ib_device *rds_ibdev; > + struct rds_ib_device *rds_ibdev = NULL; > struct rds_ib_mr *ibmr = NULL; > - struct rds_ib_connection *ic = rs->rs_conn->c_transport_data; > + struct rds_ib_connection *ic = NULL; > int ret; > > + if (rs->rs_bound_addr == 0) { > + ret = -EPERM; > + goto out; > + } > + > + ic = rs->rs_conn->c_transport_data; > rds_ibdev = rds_ib_get_device(rs->rs_bound_addr); > if (!rds_ibdev) { > ret = -ENODEV; > > I made this raw patch. If you can reproduce this bug, please make tests > with it. I dont think this solves the problem, I think it just changes the timing under which it can still happen. what if the rds_remove_bound() in rds_bind() happens after the check for if (rs->rs_bound_addr == 0) added above by the patch I believe you need some type of synchronization (either through mutex, or some atomic flag in the rs or similar) to make sure rds_bind() and rds_ib_get_mr() are mutually exclusive. --Sowmini
Re: [PATCH RFC net-next 00/11] udp gso
On (04/18/18 06:35), Eric Dumazet wrote: > > There is no change at all. > > This will only be used as a mechanism to send X packets of same size. > > So instead of X system calls , one system call. > > One traversal of some expensive part of the host stack. > > The content on the wire should be the same. I'm sorry that's not how I interpret Willem's email below (and maybe I misunderstood) the following taken from https://www.spinics.net/lists/netdev/msg496150.html Sowmini> If yes, how will the recvmsg differentiate between the case Sowmini> (2000 byte message followed by 512 byte message) and Sowmini> (1472 byte message, 526 byte message, then 512 byte message), Sowmini> in other words, how are UDP message boundary semantics preserved? Willem> They aren't. This is purely an optimization to amortize the cost of Willem> repeated tx stack traversal. Unlike UFO, which would preserve the Willem> boundaries of the original larger than MTU datagram. As I understand Willem's explanation, if I do a sendmsg of 2000 bytes, - classic UDP will send 2 IP fragments, the first one with a full UDP header, and the IP header indicating that this is the first frag for that ipid, with more frags to follow. The second frag will have the rest with the same ipid, it will not have a udp header, and it will indicatet that it is the last frag (no more frags). The receiver can thus use the ipid, "more-frags" bit, frag offset etc to stitch the 2000 byte udp message together and pass it up on the udp socket. - in the "GSO" proposal my 2000 bytes of data are sent as *two* udp packets, each of them with a unique udp header, and uh_len set to 1476 (for first) and 526 (for second). The receiver has no clue that they are both part of the same UDP datagram, So wire format is not the same, am I mistaken? --Sowmini
Re: [PATCH RFC net-next 00/11] udp gso
I went through the patch set and the code looks fine- it extends existing infra for TCP/GSO to UDP. One thing that was not clear to me about the API: shouldn't UDP_SEGMENT just be automatically determined in the stack from the pmtu? Whats the motivation for the socket option for this? also AIUI this can be either a per-socket or a per-packet option? However, I share Sridhar's concerns about the very fundamental change to UDP message boundary semantics here. There is actually no such thing as a "segment" in udp, so in general this feature makes me a little uneasy. Well behaved udp applications should already be sending mtu sized datagrams. And the not-so-well-behaved ones are probably relying on IP fragmentation/reassembly to take care of datagram boundary semantics for them? As Sridhar points out, the feature is not really "negotiated" - one side unilaterally sets the option. If the receiver is a classic/POSIX UDP implementation, it will have no way of knowing that message boundaries have been re-adjusted at the sender. One thought to recover from this: use the infra being proposed in https://tools.ietf.org/html/draft-touch-tsvwg-udp-options-09 to include a new UDP TLV option that tracks datagram# (similar to IP ID) to help the receiver reassemble the UDP datagram and pass it up with the POSIX-conformant UDP message boundary. I realize that this is also not a perfect solution: as you point out, there are risks from packet re-ordering/drops- you may well end up just reinventing IP frag/re-assembly when you are done (with just the slight improvement that each "fragment" has a full UDP header, so it has a better shot at ECMP and RSS). --Sowmini
Re: [PATCH RFC net-next 00/11] udp gso
On (04/17/18 16:23), Willem de Bruijn wrote: > > Assuming IPv4 with an MTU of 1500 and the maximum segment > size of 1472, the receiver will see three datagrams with MSS of > 1472B, 528B and 512B. so the recvmsg will also pass up 1472, 526, 512, right? If yes, how will the recvmsg differentiate between the case (2000 byte message followed by 512 byte message) and (1472 byte message, 526 byte message, then 512 byte message), in other words, how are UDP message boundary semantics preserved? --Sowmini
Re: [PATCH RFC net-next 00/11] udp gso
On (04/17/18 16:00), Willem de Bruijn wrote: > > This patchset implements GSO for UDP. A process can concatenate and > submit multiple datagrams to the same destination in one send call > by setting socket option SOL_UDP/UDP_SEGMENT with the segment size, > or passing an analogous cmsg at send time. > > The stack will send the entire large (up to network layer max size) > datagram through the protocol layer. At the GSO layer, it is broken > up in individual segments. All receive the same network layer header > and UDP src and dst port. All but the last segment have the same UDP > header, but the last may differ in length and checksum. I'll go through the patch-set later today/tomorrow (interesting!), but a question: what are message boundary semantics in this model? E.g., if I do a udp_sendmsg of 2000 bytes, and then then a udp_sendmsg of 512 bytes, with the receiver first recvmsg 2000 bytes and then the 512 bytes? My understanding of the comment above is that no, it will not- because (assuming an mtu of 1500) there will be 2 UDP messages on the wire, the first one with 1500 bytes, and the second with 1012 bytes (with some discounts for the various L2/L3 etc headers) --Sowmini
Re: KASAN: use-after-free Read in inet_create
#syz dup: KASAN: use-after-free Read in rds_cong_queue_updates There are a number of manifestations of this bug, basically all suggest that the connect/reconnect etc workqs are somehow being scheduled after the netns is deleted, despite the code refactoring in Commit 3db6e0d172c (and looks like the WARN_ONs in that commit are not even being triggered). We've not been able to reproduce this issues, and without a crash dump (or some hint of other threads that were running at the time of the problem) are working on figuring out the root-cause by code-inspection. --Sowmini
Re: [rds-devel] [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
On (03/20/18 12:37), H??kon Bugge wrote: > > A little nit below. And some spelling issues in existing commentary > you can consider fixing, since you reshuffle this file considerable. > > + if (net != &init_net && rtn->ctl_table) > > + kfree(rtn->ctl_table); > > Well, this comes from the existing code, but as pointed out by Linus > recently, kfree() handles NULL pointers. So, if rtn->ctl_table is > NULL most likely, the code is OK _if_ you add an unlikely() around the > if-clause. Otherwise, the ??? && rtn->ctl_table??? can simply be removed. As you observe correctly, this comes from the existing code, and is unrelated to the contents of the commit comment. So if we feel passionately about htis, let us please fix this separately, with its own ceremony. > s/when/When/ > s/parameters,this/parameters, this/ > > Well, not part of your commit. As above. > > > > * function resets the RDS connections in that netns so that we can > > Two double spaces incidents above > > Not part of your commit As above. Thanks much. --Sowmini
[PATCH net-next] rds: tcp: remove register_netdevice_notifier infrastructure.
The netns deletion path does not need to wait for all net_devices to be unregistered before dismantling rds_tcp state for the netns (we are able to dismantle this state on module unload even when all net_devices are active so there is no dependency here). This patch removes code related to netdevice notifiers and refactors all the code needed to dismantle rds_tcp state into a ->exit callback for the pernet_operations used with register_pernet_device(). Signed-off-by: Sowmini Varadhan --- net/rds/tcp.c | 93 ++--- 1 files changed, 23 insertions(+), 70 deletions(-) diff --git a/net/rds/tcp.c b/net/rds/tcp.c index 08ea9cd..4f3a32c 100644 --- a/net/rds/tcp.c +++ b/net/rds/tcp.c @@ -485,40 +485,6 @@ static __net_init int rds_tcp_init_net(struct net *net) return err; } -static void __net_exit rds_tcp_exit_net(struct net *net) -{ - struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); - - if (rtn->rds_tcp_sysctl) - unregister_net_sysctl_table(rtn->rds_tcp_sysctl); - - if (net != &init_net && rtn->ctl_table) - kfree(rtn->ctl_table); - - /* If rds_tcp_exit_net() is called as a result of netns deletion, -* the rds_tcp_kill_sock() device notifier would already have cleaned -* up the listen socket, thus there is no work to do in this function. -* -* If rds_tcp_exit_net() is called as a result of module unload, -* i.e., due to rds_tcp_exit() -> unregister_pernet_subsys(), then -* we do need to clean up the listen socket here. -*/ - if (rtn->rds_tcp_listen_sock) { - struct socket *lsock = rtn->rds_tcp_listen_sock; - - rtn->rds_tcp_listen_sock = NULL; - rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w); - } -} - -static struct pernet_operations rds_tcp_net_ops = { - .init = rds_tcp_init_net, - .exit = rds_tcp_exit_net, - .id = &rds_tcp_netid, - .size = sizeof(struct rds_tcp_net), - .async = true, -}; - static void rds_tcp_kill_sock(struct net *net) { struct rds_tcp_connection *tc, *_tc; @@ -546,40 +512,38 @@ static void rds_tcp_kill_sock(struct net *net) rds_conn_destroy(tc->t_cpath->cp_conn); } -void *rds_tcp_listen_sock_def_readable(struct net *net) +static void __net_exit rds_tcp_exit_net(struct net *net) { struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); - struct socket *lsock = rtn->rds_tcp_listen_sock; - if (!lsock) - return NULL; + rds_tcp_kill_sock(net); - return lsock->sk->sk_user_data; + if (rtn->rds_tcp_sysctl) + unregister_net_sysctl_table(rtn->rds_tcp_sysctl); + + if (net != &init_net && rtn->ctl_table) + kfree(rtn->ctl_table); } -static int rds_tcp_dev_event(struct notifier_block *this, -unsigned long event, void *ptr) +static struct pernet_operations rds_tcp_net_ops = { + .init = rds_tcp_init_net, + .exit = rds_tcp_exit_net, + .id = &rds_tcp_netid, + .size = sizeof(struct rds_tcp_net), + .async = true, +}; + +void *rds_tcp_listen_sock_def_readable(struct net *net) { - struct net_device *dev = netdev_notifier_info_to_dev(ptr); + struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); + struct socket *lsock = rtn->rds_tcp_listen_sock; - /* rds-tcp registers as a pernet subys, so the ->exit will only -* get invoked after network acitivity has quiesced. We need to -* clean up all sockets to quiesce network activity, and use -* the unregistration of the per-net loopback device as a trigger -* to start that cleanup. -*/ - if (event == NETDEV_UNREGISTER_FINAL && - dev->ifindex == LOOPBACK_IFINDEX) - rds_tcp_kill_sock(dev_net(dev)); + if (!lsock) + return NULL; - return NOTIFY_DONE; + return lsock->sk->sk_user_data; } -static struct notifier_block rds_tcp_dev_notifier = { - .notifier_call= rds_tcp_dev_event, - .priority = -10, /* must be called after other network notifiers */ -}; - /* when sysctl is used to modify some kernel socket parameters,this * function resets the RDS connections in that netns so that we can * restart with new parameters. The assumption is that such reset @@ -625,9 +589,7 @@ static void rds_tcp_exit(void) rds_tcp_set_unloading(); synchronize_rcu(); rds_info_deregister_func(RDS_INFO_TCP_SOCKETS, rds_tcp_tc_info); - unregister_pernet_subsys(&rds_tcp_net_ops); - if (unregister_netdevice_notifier(&rds_tcp_dev_notifier)) - pr_warn("could not unregister rds_tcp_dev_notifier\n")
Re: KASAN: slab-out-of-bounds Read in rds_cong_queue_updates
On (03/19/18 09:29), Dmitry Vyukov wrote: > > This looks the same as: > > #syz dup: KASAN: use-after-free Read in rds_cong_queue_updates correct, seems like the rds_destroy_pending() fixes did not seal this race condition. I need to look at this more carefully to see what race I missed.. no easy answer here, I am afraid. --Sowmini
Re: [rds-devel] [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
On (03/18/18 00:55), Kirill Tkhai wrote: > > I just want to make rds not using NETDEV_UNREGISTER_FINAL. If there is > another solution to do that, I'm not again that. The patch below takes care of this. I've done some preliminary testing, and I'll send it upstream after doing additional self-review/testing. Please also take a look, if you can, to see if I missed something. Thanks for the input, --Sowmini ---patch follows diff --git a/net/rds/tcp.c b/net/rds/tcp.c index 08ea9cd..87c2643 100644 --- a/net/rds/tcp.c +++ b/net/rds/tcp.c @@ -485,40 +485,6 @@ static __net_init int rds_tcp_init_net(struct net *net) return err; } -static void __net_exit rds_tcp_exit_net(struct net *net) -{ - struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); - - if (rtn->rds_tcp_sysctl) - unregister_net_sysctl_table(rtn->rds_tcp_sysctl); - - if (net != &init_net && rtn->ctl_table) - kfree(rtn->ctl_table); - - /* If rds_tcp_exit_net() is called as a result of netns deletion, -* the rds_tcp_kill_sock() device notifier would already have cleaned -* up the listen socket, thus there is no work to do in this function. -* -* If rds_tcp_exit_net() is called as a result of module unload, -* i.e., due to rds_tcp_exit() -> unregister_pernet_subsys(), then -* we do need to clean up the listen socket here. -*/ - if (rtn->rds_tcp_listen_sock) { - struct socket *lsock = rtn->rds_tcp_listen_sock; - - rtn->rds_tcp_listen_sock = NULL; - rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w); - } -} - -static struct pernet_operations rds_tcp_net_ops = { - .init = rds_tcp_init_net, - .exit = rds_tcp_exit_net, - .id = &rds_tcp_netid, - .size = sizeof(struct rds_tcp_net), - .async = true, -}; - static void rds_tcp_kill_sock(struct net *net) { struct rds_tcp_connection *tc, *_tc; @@ -546,40 +512,38 @@ static void rds_tcp_kill_sock(struct net *net) rds_conn_destroy(tc->t_cpath->cp_conn); } -void *rds_tcp_listen_sock_def_readable(struct net *net) +static void __net_exit rds_tcp_exit_net(struct net *net) { struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); - struct socket *lsock = rtn->rds_tcp_listen_sock; - if (!lsock) - return NULL; + rds_tcp_kill_sock(net); - return lsock->sk->sk_user_data; + if (rtn->rds_tcp_sysctl) + unregister_net_sysctl_table(rtn->rds_tcp_sysctl); + + if (net != &init_net && rtn->ctl_table) + kfree(rtn->ctl_table); } -static int rds_tcp_dev_event(struct notifier_block *this, -unsigned long event, void *ptr) +static struct pernet_operations rds_tcp_net_ops = { + .init = rds_tcp_init_net, + .exit = rds_tcp_exit_net, + .id = &rds_tcp_netid, + .size = sizeof(struct rds_tcp_net), + .async = true, +}; + +void *rds_tcp_listen_sock_def_readable(struct net *net) { - struct net_device *dev = netdev_notifier_info_to_dev(ptr); + struct rds_tcp_net *rtn = net_generic(net, rds_tcp_netid); + struct socket *lsock = rtn->rds_tcp_listen_sock; - /* rds-tcp registers as a pernet subys, so the ->exit will only -* get invoked after network acitivity has quiesced. We need to -* clean up all sockets to quiesce network activity, and use -* the unregistration of the per-net loopback device as a trigger -* to start that cleanup. -*/ - if (event == NETDEV_UNREGISTER_FINAL && - dev->ifindex == LOOPBACK_IFINDEX) - rds_tcp_kill_sock(dev_net(dev)); + if (!lsock) + return NULL; - return NOTIFY_DONE; + return lsock->sk->sk_user_data; } -static struct notifier_block rds_tcp_dev_notifier = { - .notifier_call= rds_tcp_dev_event, - .priority = -10, /* must be called after other network notifiers */ -}; - /* when sysctl is used to modify some kernel socket parameters,this * function resets the RDS connections in that netns so that we can * restart with new parameters. The assumption is that such reset @@ -625,9 +589,7 @@ static void rds_tcp_exit(void) rds_tcp_set_unloading(); synchronize_rcu(); rds_info_deregister_func(RDS_INFO_TCP_SOCKETS, rds_tcp_tc_info); - unregister_pernet_subsys(&rds_tcp_net_ops); - if (unregister_netdevice_notifier(&rds_tcp_dev_notifier)) - pr_warn("could not unregister rds_tcp_dev_notifier\n"); + unregister_pernet_device(&rds_tcp_net_ops); rds_tcp_destroy_conns(); rds_trans_unregister(&rds_tcp_transport); rds_tcp_recv_exit(); @@ -651,24 +613,17 @@ static int rds_tcp_init(void) if (ret) goto out_slab; - ret = register_pernet_s
Re: [rds-devel] [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
On (03/17/18 10:15), Sowmini Varadhan wrote: > To solve the scaling problem why not just have a well-defined > callback to modules when devices are quiesced, instead of > overloading the pernet_device registration in this obscure way? I thought about this a bit, and maybe I missed your original point- today we are able to do all the needed cleanup for rds-tcp when we unload the module, even though network activity has not quiesced, and there is no reason we cannot use the same code for netns cleanup as well. I think this is what you were trying to ask, when you said "why do you need to know that loopback is down?" I'm sorry I missed that, I will re-examine the code and get back to you- it should be possible to just do one registration and cleanup rds-state and avoid the hack of registering twice (saw your most recent long mail- sorry- both v1 and v2 are hacks) I'm on the road at the moment, so I'll get back to you on this. Thanks --Sowmini
Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
I spent a long time staring at both v1 and v2 of your patch. I understand the overall goal, but I am afraid to say that these patches are complete hacks. I was trying to understand why patchv1 blows with a null rtn in rds_tcp_init_net, but v2 does not, and the analysis is ugly. I'm going to put down the analysis here, and others can decide if this sort of hack is a healthy solution for a scaling issue (IMHO it is not- we should get the real fix for the scaling instead of using duck-tape-and-chewing-gum solutions) What is happening in v1 is this: 1. Wnen I do "modprobe rds_tcp" in init_net, we end up doing the following in rds_tcp_init register_pernet_device(&rds_tcp_dev_ops); register_pernet_device(&rds_tcp_net_ops); Where rds_tcp_dev_ops has .id = &rds_tcp_netid, .size = sizeof(struct rds_tcp_net), and rds_tcp_net_ops has 0 values for both of these. 2. So now pernet_list has &rds_tcp_net_ops as the first member of the pernet_list. 3. Now I do "ip netns create blue". As part of setup_net(), we walk the pernet_list and call the ->init of each member (through ops_init()). So we'd hit rds_tcp_net_ops first. Since the id/size are 0, we'd skip the struct rds_tcp_net allocation, so rds_tcp_init_net would find a null return from net_generic() and bomb. The way I view it (and ymmv) the hack here is to call both register_pernet_device and register_pernet_subsys: the kernel only guarantees that calling *one* of register_pernet_* will ensure that you can safely call net_generic() afterwards. The v2 patch "works around" this by reordering the registration. So this time, init_net will set up the rds_tcp_net_ops as the second member, and the first memeber will be the pernet_operations struct that has non-zero id and size. But then the unregistration (necessarily) works in the opposite order you have to unregister_pernet_device first (so that interfaces are quiesced) and then unregister_pernet_subsys() so that sockets can be safely quiesced. I dont think this type of hack makes the code cleaner, it just make things much harder to understand, and completely brittle for subsequent changes. To solve the scaling problem why not just have a well-defined callback to modules when devices are quiesced, instead of overloading the pernet_device registration in this obscure way?
Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
On (03/16/18 21:48), Kirill Tkhai wrote: > > > Thus I have to spend some time reviewing your patch, > > and I cannot give you an answer in the next 5 minutes. > > No problem, 5 minutes response never required. Thanks for > your review. thank you. I would like to take some time this weekend to understand why v1 of your patch hit the null rtn, but v2 did not (a superficial glance at the patch suggested that you were registering twice in both cases, with just a reordering, so I would like to understand the root-cause of the null ptr deref with v1) As for registering 2 times, that needs some comments to provide guidance for other subsystems. e.g., I found the large block comment in net-namespace.h very helpful, so lets please clearly document what and why and when this should be used. --Sowmini
Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
On (03/16/18 21:14), Kirill Tkhai wrote: > > I did the second version and sent you. Have you tried it? I tried it briefly, and it works for the handful of testcases that I tried, but I still think its very werid to register as both a device and a subsys, esp in the light of the warning in net_namespace.h Thus I have to spend some time reviewing your patch, and I cannot give you an answer in the next 5 minutes. > Calling netdevice handler for every event is more disturbing, > as number of events is several times bigger, than one more > pernet exit method. So you are saying there are scaling constraints on subsystems that register for netdevice handlers. The disturbing part of that is that it does not scale. Thanks. --Sowmini
Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
I had taken some of this offline, but it occurs to me that some of these notes should be saved to the netdev archives, in case this question pops up again in a few years. When I run your patch, I get a repeatable panic by doing modprobe rds-tcp ip netns create blue the panic is because we are finding a null trn in rds_tcp_init_net. I think there's something very disturbed about calling register_pernet_operations() twice, once via register_pernet_device() and again via register_pernet_subsys(). I suspect this has everything to do with the panic but I have not had time to debug every little detail here. In general, rds_tcp is not a network device, it is a kernel module. That is the fundamental problem here. To repeat the comments form net_namespace.h: * Network interfaces need to be removed from a dying netns _before_ * subsys notifiers can be called, as most of the network code cleanup * (which is done from subsys notifiers) runs with the assumption that * dev_remove_pack has been called so no new packets will arrive during * and after the cleanup functions have been called. dev_remove_pack * is not per namespace so instead the guarantee of no more packets * arriving in a network namespace is provided by ensuring that all * network devices and all sockets have left the network namespace * before the cleanup methods are called. when the "blue" netns starts up, it creates at least one kernel listen socket on *.16385. This socket, and any other child/client sockets created must be cleaned up before the cleanup_net can happen. This is why I chose to call regster_pernet_subsys. Again, as per comments in net_namespace.h: * Use these carefully. If you implement a network device and it * needs per network namespace operations use device pernet operations, * otherwise use pernet subsys operations. On (03/16/18 18:51), Kirill Tkhai wrote: > > Let's find another approach. Could you tell what problem we have in > > case of rds_tcp_dev_ops is declared as pernet_device? As above, rds-tcp is not a network device. > One more question. Which time we take a reference of loopback device? > Is it possible before we created a net completely? We dont take a reference on the loopback device. We make sure none of the kernel sockets does a get_net() so that we dont block the cleanup_net, and then, when all the network interfaces have been taken down (loopback is the last one) we know there are no more packets coming in and out, so it is safe to dismantle all kernel sockets created by rds-tcp. Hope that helps. --Sowmini
Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
Found my previous question: https://www.mail-archive.com/netdev@vger.kernel.org/msg72330.html (see section about "Comments are specifically ivinted.." > This is not a problem, and rds-tcp is not the only pernet_subsys registering > a socket. It's OK to close it from .exit method. There are many examples, > let me point you to icmp_sk_ops as one of them. But it's not the only. I'm not averse to changing this to NETDEV_UNREGISTER as long as it works for the 2 test cases below- you can test it by using rds-ping from rds-tools rpm, to be used from/to init_net, from/to the netns against some external machine (i.e something not on the same physical host) > > For rds-tcp, we need to be able to do the right thing in both of these > > cases > > 1. modprobe -r rds-tcp (cleanup of rds-tcp state should happen in > >every namespace, including init_net) > > 2. netns delete (rds_tcp.ko should remain loaded for other namespaces) > > The same as above, every pernet_subsys does this. It's not a problem. > exit and exit_batch methods are called in both of the cases. > > Please, see __unregister_pernet_operations()->ops_exit_list for the details. I am familiar with ops_exit_list, but this is the sequence: - when the module is loaded (or netns is started) it starts a kernel listen socket on *.16385 - when you start the rds-pings above, it will create kernel tcp connections from/to the 16385 in the netns. And it will start socket keepalives for those connections. Each tcp connection is associated with a rds_connection As I recall, when I wrote the initial patchset, my problem was that in order to let the module unload make progress, all these sockets had to be cleaned up. But to clean up these sockets, net_device cleanup had to complete (should not have any new incoming connections to the listen endpoint on a non-loopback socket) so I ended up with a circular dependancy. > If we replace NETDEV_UNREGISTER_FINAL with NETDEV_UNREGISTER, the only change > which happens is we call rds_tcp_kill_sock() earlier. So, it may be a reason > of problems only if someone changes the list during the time between > NETDEV_UNREGISTER and NETDEV_UNREGISTER_FINAL are called for loopback. > But since this time noone related to this net can extend the list, > there is no a problem to do that. Please share your patch, I can review it and maybe help to test it.. As I was trying to say in my RFC, I am quite open to ways to make this cleanup more obvious --Sowmini
Re: [PATCH RFC RFC] rds: Use NETDEV_UNREGISTER in rds_tcp_dev_event() (then kill NETDEV_UNREGISTER_FINAL)
On (03/16/18 15:38), Kirill Tkhai wrote: > > 467fa15356acf by Sowmini Varadhan added NETDEV_UNREGISTER_FINAL dependence > with the commentary: > > /* rds-tcp registers as a pernet subys, so the ->exit will only >* get invoked after network acitivity has quiesced. We need to >* clean up all sockets to quiesce network activity, and use >* the unregistration of the per-net loopback device as a trigger >* to start that cleanup. >*/ > > It seems all the protocols pernet subsystems meet this situation, but they > solve it in generic way. What does rds so specific have? The difference with rds is this: most consumers of netns associate a net_device with the netns, and cleanup of the netns state happens as part of the net_device teardown without the constraint above. rds-tcp does has a netns tied to listening socket, not to a specific network interface (net_device) so it registers as a pernet-subsys. But this means that cleanup has to be cone carefully (see comments in net_namespace.h before register_pernet_subsys) For rds-tcp, we need to be able to do the right thing in both of these cases 1. modprobe -r rds-tcp (cleanup of rds-tcp state should happen in every namespace, including init_net) 2. netns delete (rds_tcp.ko should remain loaded for other namespaces) > This commit makes event handler to iterate rds_tcp_conn_list and > kill them. If we change the stage to NETDEV_UNREGISTER, what will change? The above two cases need to work correctly. > Can unregistered loopback device on dead net add new items to > rds_tcp_conn_list? > How it's possible? I dont understand the question- no unregistered loopback devices cannot add items. fwiw, I had asked questions about this (netns per net_device vs netns for module) on the netdev list a few years ago, I can try to hunt down that thread for you later (nobody replied to it, but maybe it will help answer your questions). --Sowmini
[PATCH net-next] rds: tcp: must use spin_lock_irq* and not spin_lock_bh with rds_tcp_conn_lock
rds_tcp_connection allocation/free management has the potential to be called from __rds_conn_create after IRQs have been disabled, so spin_[un]lock_bh cannot be used with rds_tcp_conn_lock. Bottom-halves that need to synchronize for critical sections protected by rds_tcp_conn_lock should instead use rds_destroy_pending() correctly. Reported-by: syzbot+c68e51bb5e699d3f8...@syzkaller.appspotmail.com Fixes: ebeeb1ad9b8a ("rds: tcp: use rds_destroy_pending() to synchronize netns/module teardown and rds connection/workq management") Signed-off-by: Sowmini Varadhan --- net/rds/tcp.c | 17 + 1 files changed, 9 insertions(+), 8 deletions(-) diff --git a/net/rds/tcp.c b/net/rds/tcp.c index eb04e7f..08ea9cd 100644 --- a/net/rds/tcp.c +++ b/net/rds/tcp.c @@ -272,13 +272,14 @@ static int rds_tcp_laddr_check(struct net *net, __be32 addr) static void rds_tcp_conn_free(void *arg) { struct rds_tcp_connection *tc = arg; + unsigned long flags; rdsdebug("freeing tc %p\n", tc); - spin_lock_bh(&rds_tcp_conn_lock); + spin_lock_irqsave(&rds_tcp_conn_lock, flags); if (!tc->t_tcp_node_detached) list_del(&tc->t_tcp_node); - spin_unlock_bh(&rds_tcp_conn_lock); + spin_unlock_irqrestore(&rds_tcp_conn_lock, flags); kmem_cache_free(rds_tcp_conn_slab, tc); } @@ -308,13 +309,13 @@ static int rds_tcp_conn_alloc(struct rds_connection *conn, gfp_t gfp) rdsdebug("rds_conn_path [%d] tc %p\n", i, conn->c_path[i].cp_transport_data); } - spin_lock_bh(&rds_tcp_conn_lock); + spin_lock_irq(&rds_tcp_conn_lock); for (i = 0; i < RDS_MPATH_WORKERS; i++) { tc = conn->c_path[i].cp_transport_data; tc->t_tcp_node_detached = false; list_add_tail(&tc->t_tcp_node, &rds_tcp_conn_list); } - spin_unlock_bh(&rds_tcp_conn_lock); + spin_unlock_irq(&rds_tcp_conn_lock); fail: if (ret) { for (j = 0; j < i; j++) @@ -527,7 +528,7 @@ static void rds_tcp_kill_sock(struct net *net) rtn->rds_tcp_listen_sock = NULL; rds_tcp_listen_stop(lsock, &rtn->rds_tcp_accept_w); - spin_lock_bh(&rds_tcp_conn_lock); + spin_lock_irq(&rds_tcp_conn_lock); list_for_each_entry_safe(tc, _tc, &rds_tcp_conn_list, t_tcp_node) { struct net *c_net = read_pnet(&tc->t_cpath->cp_conn->c_net); @@ -540,7 +541,7 @@ static void rds_tcp_kill_sock(struct net *net) tc->t_tcp_node_detached = true; } } - spin_unlock_bh(&rds_tcp_conn_lock); + spin_unlock_irq(&rds_tcp_conn_lock); list_for_each_entry_safe(tc, _tc, &tmp_list, t_tcp_node) rds_conn_destroy(tc->t_cpath->cp_conn); } @@ -588,7 +589,7 @@ static void rds_tcp_sysctl_reset(struct net *net) { struct rds_tcp_connection *tc, *_tc; - spin_lock_bh(&rds_tcp_conn_lock); + spin_lock_irq(&rds_tcp_conn_lock); list_for_each_entry_safe(tc, _tc, &rds_tcp_conn_list, t_tcp_node) { struct net *c_net = read_pnet(&tc->t_cpath->cp_conn->c_net); @@ -598,7 +599,7 @@ static void rds_tcp_sysctl_reset(struct net *net) /* reconnect with new parameters */ rds_conn_path_drop(tc->t_cpath, false); } - spin_unlock_bh(&rds_tcp_conn_lock); + spin_unlock_irq(&rds_tcp_conn_lock); } static int rds_tcp_skbuf_handler(struct ctl_table *ctl, int write, -- 1.7.1
Re: WARNING in __local_bh_enable_ip (2)
On (03/14/18 14:28), Eric Dumazet wrote: > > > spin_lock_bh(&rds_tcp_conn_lock);/spin_unlock_bh(&rds_tcp_conn_lock); in > rds_tcp_conn_free() > > is in conflict with the spin_lock_irqsave(&rds_conn_lock, flags); > in __rds_conn_create() yes I was going to look into this and fix it later. > Hard to understand why RDS is messing with hard irqs really. some of it comes from the rds_rdma history: some parts of the common rds and rds_rdma module get called in various driver contexts for infiniband. --Sowmini
Re: [PATCH][rds-next] rds: make functions rds_info_from_znotifier and rds_message_zcopy_from_user static
On (03/11/18 18:03), Colin King wrote: > From: Colin Ian King > > Functions rds_info_from_znotifier and rds_message_zcopy_from_user are > local to the source and do not need to be in global scope, so make them > static. the rds_message_zcopy_from_user warning was already flagged by kbuild-robot last week. See https://www.spinics.net/lists/netdev/msg488041.html
Re: [rds-devel] [PATCH][rds-next] rds: remove redundant variable 'sg_off'
On (03/11/18 17:27), Colin King wrote: > Variable sg_off is assigned a value but it is never read, hence it is > redundant and can be removed. > Acked-by: Sowmini Varadhan
Re: [RFC PATCH linux-next] rds: rds_message_zcopy_from_user() can be static
On (03/08/18 18:56), kbuild test robot wrote: > > Fixes: d40a126b16ea ("rds: refactor zcopy code into > rds_message_zcopy_from_user") > Signed-off-by: Fengguang Wu Acked-by: Sowmini Varadhan (do I need to separately submit a non-RFC patch for this?)
Re: [PATCH net-next] sock: Fix SO_ZEROCOPY switch case
On (03/07/18 09:40), Jesus Sanchez-Palencia wrote: > Fix the SO_ZEROCOPY switch case on sock_setsockopt() avoiding the > ret values to be overwritten by the one set on the default case. Acked-by: Sowmini Varadhan
[PATCH net-next 1/2] rds: refactor zcopy code into rds_message_zcopy_from_user
Move the large block of code predicated on zcopy from rds_message_copy_from_user into a new function, rds_message_zcopy_from_user() Signed-off-by: Sowmini Varadhan --- net/rds/message.c | 108 +--- 1 files changed, 60 insertions(+), 48 deletions(-) diff --git a/net/rds/message.c b/net/rds/message.c index 116cf87..c36edbb 100644 --- a/net/rds/message.c +++ b/net/rds/message.c @@ -333,14 +333,14 @@ struct rds_message *rds_message_map_pages(unsigned long *page_addrs, unsigned in return rm; } -int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from, - bool zcopy) +int rds_message_zcopy_from_user(struct rds_message *rm, struct iov_iter *from) { - unsigned long to_copy, nbytes; unsigned long sg_off; struct scatterlist *sg; int ret = 0; int length = iov_iter_count(from); + int total_copied = 0; + struct sk_buff *skb; rm->m_inc.i_hdr.h_len = cpu_to_be32(iov_iter_count(from)); @@ -350,54 +350,66 @@ int rds_message_copy_from_user(struct rds_message *rm, struct iov_iter *from, sg = rm->data.op_sg; sg_off = 0; /* Dear gcc, sg->page will be null from kzalloc. */ - if (zcopy) { - int total_copied = 0; - struct sk_buff *skb; - - skb = alloc_skb(0, GFP_KERNEL); - if (!skb) - return -ENOMEM; - BUILD_BUG_ON(sizeof(skb->cb) < -max_t(int, sizeof(struct rds_znotifier), - sizeof(struct rds_zcopy_cookies))); - rm->data.op_mmp_znotifier = RDS_ZCOPY_SKB(skb); - if (mm_account_pinned_pages(&rm->data.op_mmp_znotifier->z_mmp, - length)) { - ret = -ENOMEM; + skb = alloc_skb(0, GFP_KERNEL); + if (!skb) + return -ENOMEM; + BUILD_BUG_ON(sizeof(skb->cb) < max_t(int, sizeof(struct rds_znotifier), +sizeof(struct rds_zcopy_cookies))); + rm->data.op_mmp_znotifier = RDS_ZCOPY_SKB(skb); + if (mm_account_pinned_pages(&rm->data.op_mmp_znotifier->z_mmp, + length)) { + ret = -ENOMEM; + goto err; + } + while (iov_iter_count(from)) { + struct page *pages; + size_t start; + ssize_t copied; + + copied = iov_iter_get_pages(from, &pages, PAGE_SIZE, + 1, &start); + if (copied < 0) { + struct mmpin *mmp; + int i; + + for (i = 0; i < rm->data.op_nents; i++) + put_page(sg_page(&rm->data.op_sg[i])); + mmp = &rm->data.op_mmp_znotifier->z_mmp; + mm_unaccount_pinned_pages(mmp); + ret = -EFAULT; goto err; } - while (iov_iter_count(from)) { - struct page *pages; - size_t start; - ssize_t copied; - - copied = iov_iter_get_pages(from, &pages, PAGE_SIZE, - 1, &start); - if (copied < 0) { - struct mmpin *mmp; - int i; - - for (i = 0; i < rm->data.op_nents; i++) - put_page(sg_page(&rm->data.op_sg[i])); - mmp = &rm->data.op_mmp_znotifier->z_mmp; - mm_unaccount_pinned_pages(mmp); - ret = -EFAULT; - goto err; - } - total_copied += copied; - iov_iter_advance(from, copied); - length -= copied; - sg_set_page(sg, pages, copied, start); - rm->data.op_nents++; - sg++; - } - WARN_ON_ONCE(length != 0); - return ret; + total_copied += copied; + iov_iter_advance(from, copied); + length -= copied; + sg_set_page(sg, pages, copied, start); + rm->data.op_nents++; + sg++; + } + WARN_ON_ONCE(length != 0); + return ret; err: - consume_skb(skb); - rm->data.op_mmp_znotifier = NULL; - return ret; - } /* zcopy */ + consume_skb(skb); + rm->data.op_mmp_znotifier = NULL; + return ret
[PATCH net-next 0/2] RDS: zerocopy code enhancements
A couple of enhancements to the rds zerocop code - patch 1 refactors rds_message_copy_from_user to pull the zcopy logic into its own function - patch 2 drops the usage sk_buff to track MSG_ZEROCOPY cookies and uses a simple linked list (enhancement suggested by willemb during code review) Sowmini Varadhan (2): rds: refactor zcopy code into rds_message_zcopy_from_user rds: use list structure to track information for zerocopy completion notification
[PATCH net-next 2/2] rds: use list structure to track information for zerocopy completion notification
Commit 401910db4cd4 ("rds: deliver zerocopy completion notification with data") removes support fo r zerocopy completion notification on the sk_error_queue, thus we no longer need to track the cookie information in sk_buff structures. This commit removes the struct sk_buff_head rs_zcookie_queue by a simpler list that results in a smaller memory footprint as well as more efficient memory_allocation time. Signed-off-by: Sowmini Varadhan --- net/rds/af_rds.c |6 ++-- net/rds/message.c | 77 +--- net/rds/rds.h | 23 +++ net/rds/recv.c| 23 +++- 4 files changed, 85 insertions(+), 44 deletions(-) diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c index f712610..ab751a1 100644 --- a/net/rds/af_rds.c +++ b/net/rds/af_rds.c @@ -77,7 +77,7 @@ static int rds_release(struct socket *sock) rds_send_drop_to(rs, NULL); rds_rdma_drop_keys(rs); rds_notify_queue_get(rs, NULL); - __skb_queue_purge(&rs->rs_zcookie_queue); + rds_notify_msg_zcopy_purge(&rs->rs_zcookie_queue); spin_lock_bh(&rds_sock_lock); list_del_init(&rs->rs_item); @@ -180,7 +180,7 @@ static __poll_t rds_poll(struct file *file, struct socket *sock, } if (!list_empty(&rs->rs_recv_queue) || !list_empty(&rs->rs_notify_queue) || - !skb_queue_empty(&rs->rs_zcookie_queue)) + !list_empty(&rs->rs_zcookie_queue.zcookie_head)) mask |= (EPOLLIN | EPOLLRDNORM); if (rs->rs_snd_bytes < rds_sk_sndbuf(rs)) mask |= (EPOLLOUT | EPOLLWRNORM); @@ -515,7 +515,7 @@ static int __rds_create(struct socket *sock, struct sock *sk, int protocol) INIT_LIST_HEAD(&rs->rs_recv_queue); INIT_LIST_HEAD(&rs->rs_notify_queue); INIT_LIST_HEAD(&rs->rs_cong_list); - skb_queue_head_init(&rs->rs_zcookie_queue); + rds_message_zcopy_queue_init(&rs->rs_zcookie_queue); spin_lock_init(&rs->rs_rdma_lock); rs->rs_rdma_keys = RB_ROOT; rs->rs_rx_traces = 0; diff --git a/net/rds/message.c b/net/rds/message.c index c36edbb..90dcdcf 100644 --- a/net/rds/message.c +++ b/net/rds/message.c @@ -48,7 +48,6 @@ [RDS_EXTHDR_GEN_NUM] = sizeof(u32), }; - void rds_message_addref(struct rds_message *rm) { rdsdebug("addref rm %p ref %d\n", rm, refcount_read(&rm->m_refcount)); @@ -56,9 +55,9 @@ void rds_message_addref(struct rds_message *rm) } EXPORT_SYMBOL_GPL(rds_message_addref); -static inline bool skb_zcookie_add(struct sk_buff *skb, u32 cookie) +static inline bool rds_zcookie_add(struct rds_msg_zcopy_info *info, u32 cookie) { - struct rds_zcopy_cookies *ck = (struct rds_zcopy_cookies *)skb->cb; + struct rds_zcopy_cookies *ck = &info->zcookies; int ncookies = ck->num; if (ncookies == RDS_MAX_ZCOOKIES) @@ -68,38 +67,61 @@ static inline bool skb_zcookie_add(struct sk_buff *skb, u32 cookie) return true; } +struct rds_msg_zcopy_info *rds_info_from_znotifier(struct rds_znotifier *znotif) +{ + return container_of(znotif, struct rds_msg_zcopy_info, znotif); +} + +void rds_notify_msg_zcopy_purge(struct rds_msg_zcopy_queue *q) +{ + unsigned long flags; + LIST_HEAD(copy); + struct rds_msg_zcopy_info *info, *tmp; + + spin_lock_irqsave(&q->lock, flags); + list_splice(&q->zcookie_head, ©); + INIT_LIST_HEAD(&q->zcookie_head); + spin_unlock_irqrestore(&q->lock, flags); + + list_for_each_entry_safe(info, tmp, ©, rs_zcookie_next) { + list_del(&info->rs_zcookie_next); + kfree(info); + } +} + static void rds_rm_zerocopy_callback(struct rds_sock *rs, struct rds_znotifier *znotif) { - struct sk_buff *skb, *tail; - unsigned long flags; - struct sk_buff_head *q; + struct rds_msg_zcopy_info *info; + struct rds_msg_zcopy_queue *q; u32 cookie = znotif->z_cookie; struct rds_zcopy_cookies *ck; + struct list_head *head; + unsigned long flags; + mm_unaccount_pinned_pages(&znotif->z_mmp); q = &rs->rs_zcookie_queue; spin_lock_irqsave(&q->lock, flags); - tail = skb_peek_tail(q); - - if (tail && skb_zcookie_add(tail, cookie)) { - spin_unlock_irqrestore(&q->lock, flags); - mm_unaccount_pinned_pages(&znotif->z_mmp); - consume_skb(rds_skb_from_znotifier(znotif)); - /* caller invokes rds_wake_sk_sleep() */ - return; + head = &q->zcookie_head; + if (!list_empty(head)) { + info = list_entry(head, struct rds_msg_zcopy_info, + rs_zco
Re: [PATCH v2 net] rds: Incorrect reference counting in TCP socket creation
On (03/01/18 21:07), Ka-Cheong Poon wrote: > Commit 0933a578cd55 ("rds: tcp: use sock_create_lite() to create the > accept socket") has a reference counting issue in TCP socket creation > when accepting a new connection. The code uses sock_create_lite() to > create a kernel socket. But it does not do __module_get() on the > socket owner. When the connection is shutdown and sock_release() is > called to free the socket, the owner's reference count is decremented > and becomes incorrect. Note that this bug only shows up when the socket > owner is configured as a kernel module. Acked-by: Sowmini Varadhan
Re: [PATCH net] rds: Incorrect reference counting in TCP socket creation
On (03/01/18 20:19), Ka-Cheong Poon wrote: > >> > >>- new_sock->type = sock->type; > >>- new_sock->ops = sock->ops; > >> ret = sock->ops->accept(sock, new_sock, O_NONBLOCK, true); > >> if (ret < 0) > >> goto out; > >> > >>+ new_sock->ops = sock->ops; > > > >How is this delta relevant to the commit comment? Seems unrelated? > > > Note that sock_release() checks if sock->ops is set before > decrementing the refcnt. By moving the ops assignment after > the ops->accept() call, we save increasing the refcnt in > case the ops->accept() fails. Otherwise, the __module_get() > needs to be moved before ops->accept() to handle this failure > case. I see, thanks for clarification. It may be helpful to have some comment in there, in case some other module trips on something similar in the future. --Sowmini
Re: [PATCH net] rds: Incorrect reference counting in TCP socket creation
On Wed, Feb 28, 2018 at 11:44 PM, Ka-Cheong Poon wrote: > Commit 0933a578cd55 ("rds: tcp: use sock_create_lite() to create the > accept socket") has a reference counting issue in TCP socket creation > when accepting a new connection. The code uses sock_create_lite() to > create a kernel socket. But it does not do __module_get() on the > socket owner. When the connection is shutdown and sock_release() is > called to free the socket, the owner's reference count is decremented > and becomes incorrect. Note that this bug only shows up when the socket > owner is configured as a kernel module. > > - new_sock->type = sock->type; > - new_sock->ops = sock->ops; > ret = sock->ops->accept(sock, new_sock, O_NONBLOCK, true); > if (ret < 0) > goto out; > > + new_sock->ops = sock->ops; How is this delta relevant to the commit comment? Seems unrelated? --Sowmini
[PATCH RESEND V3 net-next 3/3] selftests/net: reap zerocopy completions passed up as ancillary data.
PF_RDS sockets pass up cookies for zerocopy completion as ancillary data. Update msg_zerocopy to reap this information. Signed-off-by: Sowmini Varadhan Acked-by: Willem de Bruijn Acked-by: Santosh Shilimkar --- v2: receive zerocopy completion notification as POLLIN v3: drop ncookies arg in do_process_zerocopy_cookies; Reverse christmas tree declarations; check for C_TRUNC; print warning when encountering ignored cmsghdrs in do_recvmsg_completion tools/testing/selftests/net/msg_zerocopy.c | 65 --- 1 files changed, 57 insertions(+), 8 deletions(-) diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c index eff9cf2..406cc70 100644 --- a/tools/testing/selftests/net/msg_zerocopy.c +++ b/tools/testing/selftests/net/msg_zerocopy.c @@ -344,7 +344,53 @@ static int do_setup_tx(int domain, int type, int protocol) return fd; } -static bool do_recv_completion(int fd) +static uint32_t do_process_zerocopy_cookies(struct rds_zcopy_cookies *ck) +{ + int i; + + if (ck->num > RDS_MAX_ZCOOKIES) + error(1, 0, "Returned %d cookies, max expected %d\n", + ck->num, RDS_MAX_ZCOOKIES); + for (i = 0; i < ck->num; i++) + if (cfg_verbose >= 2) + fprintf(stderr, "%d\n", ck->cookies[i]); + return ck->num; +} + +static bool do_recvmsg_completion(int fd) +{ + char cmsgbuf[CMSG_SPACE(sizeof(struct rds_zcopy_cookies))]; + struct rds_zcopy_cookies *ck; + struct cmsghdr *cmsg; + struct msghdr msg; + bool ret = false; + + memset(&msg, 0, sizeof(msg)); + msg.msg_control = cmsgbuf; + msg.msg_controllen = sizeof(cmsgbuf); + + if (recvmsg(fd, &msg, MSG_DONTWAIT)) + return ret; + + if (msg.msg_flags & MSG_CTRUNC) + error(1, errno, "recvmsg notification: truncated"); + + for (cmsg = CMSG_FIRSTHDR(&msg); cmsg; cmsg = CMSG_NXTHDR(&msg, cmsg)) { + if (cmsg->cmsg_level == SOL_RDS && + cmsg->cmsg_type == RDS_CMSG_ZCOPY_COMPLETION) { + + ck = (struct rds_zcopy_cookies *)CMSG_DATA(cmsg); + completions += do_process_zerocopy_cookies(ck); + ret = true; + break; + } + error(0, 0, "ignoring cmsg at level %d type %d\n", + cmsg->cmsg_level, cmsg->cmsg_type); + } + return ret; +} + +static bool do_recv_completion(int fd, int domain) { struct sock_extended_err *serr; struct msghdr msg = {}; @@ -353,6 +399,9 @@ static bool do_recv_completion(int fd) int ret, zerocopy; char control[100]; + if (domain == PF_RDS) + return do_recvmsg_completion(fd); + msg.msg_control = control; msg.msg_controllen = sizeof(control); @@ -409,20 +458,20 @@ static bool do_recv_completion(int fd) } /* Read all outstanding messages on the errqueue */ -static void do_recv_completions(int fd) +static void do_recv_completions(int fd, int domain) { - while (do_recv_completion(fd)) {} + while (do_recv_completion(fd, domain)) {} } /* Wait for all remaining completions on the errqueue */ -static void do_recv_remaining_completions(int fd) +static void do_recv_remaining_completions(int fd, int domain) { int64_t tstop = gettimeofday_ms() + cfg_waittime_ms; while (completions < expected_completions && gettimeofday_ms() < tstop) { - if (do_poll(fd, POLLERR)) - do_recv_completions(fd); + if (do_poll(fd, domain == PF_RDS ? POLLIN : POLLERR)) + do_recv_completions(fd, domain); } if (completions < expected_completions) @@ -503,13 +552,13 @@ static void do_tx(int domain, int type, int protocol) while (!do_poll(fd, POLLOUT)) { if (cfg_zerocopy) - do_recv_completions(fd); + do_recv_completions(fd, domain); } } while (gettimeofday_ms() < tstop); if (cfg_zerocopy) - do_recv_remaining_completions(fd); + do_recv_remaining_completions(fd, domain); if (close(fd)) error(1, errno, "close"); -- 1.7.1
[PATCH RESEND V3 net-next 1/3] selftests/net: revert the zerocopy Rx path for PF_RDS
In preparation for optimized reception of zerocopy completion, revert the Rx side changes introduced by Commit dfb8434b0a94 ("selftests/net: add zerocopy support for PF_RDS test case") Signed-off-by: Sowmini Varadhan Acked-by: Willem de Bruijn Acked-by: Santosh Shilimkar --- v2: prepare to remove sk_error_queue based path; remove recvmsg() as well, PF_RDS can also use recv() for the usage pattern in msg_zerocopy tools/testing/selftests/net/msg_zerocopy.c | 67 1 files changed, 0 insertions(+), 67 deletions(-) diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c index 5cc2a53..eff9cf2 100644 --- a/tools/testing/selftests/net/msg_zerocopy.c +++ b/tools/testing/selftests/net/msg_zerocopy.c @@ -344,26 +344,6 @@ static int do_setup_tx(int domain, int type, int protocol) return fd; } -static int do_process_zerocopy_cookies(struct sock_extended_err *serr, - uint32_t *ckbuf, size_t nbytes) -{ - int ncookies, i; - - if (serr->ee_errno != 0) - error(1, 0, "serr: wrong error code: %u", serr->ee_errno); - ncookies = serr->ee_data; - if (ncookies > SO_EE_ORIGIN_MAX_ZCOOKIES) - error(1, 0, "Returned %d cookies, max expected %d\n", - ncookies, SO_EE_ORIGIN_MAX_ZCOOKIES); - if (nbytes != ncookies * sizeof(uint32_t)) - error(1, 0, "Expected %d cookies, got %ld\n", - ncookies, nbytes/sizeof(uint32_t)); - for (i = 0; i < ncookies; i++) - if (cfg_verbose >= 2) - fprintf(stderr, "%d\n", ckbuf[i]); - return ncookies; -} - static bool do_recv_completion(int fd) { struct sock_extended_err *serr; @@ -372,17 +352,10 @@ static bool do_recv_completion(int fd) uint32_t hi, lo, range; int ret, zerocopy; char control[100]; - uint32_t ckbuf[SO_EE_ORIGIN_MAX_ZCOOKIES]; - struct iovec iov; msg.msg_control = control; msg.msg_controllen = sizeof(control); - iov.iov_base = ckbuf; - iov.iov_len = (SO_EE_ORIGIN_MAX_ZCOOKIES * sizeof(ckbuf[0])); - msg.msg_iov = &iov; - msg.msg_iovlen = 1; - ret = recvmsg(fd, &msg, MSG_ERRQUEUE); if (ret == -1 && errno == EAGAIN) return false; @@ -402,10 +375,6 @@ static bool do_recv_completion(int fd) serr = (void *) CMSG_DATA(cm); - if (serr->ee_origin == SO_EE_ORIGIN_ZCOOKIE) { - completions += do_process_zerocopy_cookies(serr, ckbuf, ret); - return true; - } if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) error(1, 0, "serr: wrong origin: %u", serr->ee_origin); if (serr->ee_errno != 0) @@ -631,40 +600,6 @@ static void do_flush_datagram(int fd, int type) bytes += cfg_payload_len; } - -static void do_recvmsg(int fd) -{ - int ret, off = 0; - char *buf; - struct iovec iov; - struct msghdr msg; - struct sockaddr_storage din; - - buf = calloc(cfg_payload_len, sizeof(char)); - iov.iov_base = buf; - iov.iov_len = cfg_payload_len; - - memset(&msg, 0, sizeof(msg)); - msg.msg_name = &din; - msg.msg_namelen = sizeof(din); - msg.msg_iov = &iov; - msg.msg_iovlen = 1; - - ret = recvmsg(fd, &msg, MSG_TRUNC); - - if (ret == -1) - error(1, errno, "recv"); - if (ret != cfg_payload_len) - error(1, 0, "recv: ret=%u != %u", ret, cfg_payload_len); - - if (memcmp(buf + off, payload, ret)) - error(1, 0, "recv: data mismatch"); - - free(buf); - packets++; - bytes += cfg_payload_len; -} - static void do_rx(int domain, int type, int protocol) { uint64_t tstop; @@ -676,8 +611,6 @@ static void do_rx(int domain, int type, int protocol) do { if (type == SOCK_STREAM) do_flush_tcp(fd); - else if (domain == PF_RDS) - do_recvmsg(fd); else do_flush_datagram(fd, type); -- 1.7.1
[PATCH RESEND V3 net-next 0/3] RDS: optimized notification for zerocopy completion
Resending with acked-by additions: previous attempt does not show up in Patchwork. This time with a new mail Message-Id. RDS applications use predominantly request-response, transacation based IPC, so that ingress and egress traffic are well-balanced, and it is possible/desirable to reduce system-call overhead by piggybacking the notifications for zerocopy completion response with data. Moreover, it has been pointed out that socket functions block if sk_err is non-zero, thus if the RDS code does not plan/need to use sk_error_queue path for completion notification, it is preferable to remove the sk_errror_queue related paths in RDS. Both of these goals are implemented in this series. v2: removed sk_error_queue support v3: incorporated additional code review comments (details in each patch) Sowmini Varadhan (3): selftests/net: revert the zerocopy Rx path for PF_RDS rds: deliver zerocopy completion notification with data selftests/net: reap zerocopy completions passed up as ancillary data. include/uapi/linux/errqueue.h |2 - include/uapi/linux/rds.h |7 ++ net/rds/af_rds.c |7 +- net/rds/message.c | 38 - net/rds/rds.h |2 + net/rds/recv.c | 31 +++- tools/testing/selftests/net/msg_zerocopy.c | 120 7 files changed, 111 insertions(+), 96 deletions(-)
[PATCH RESEND V3 net-next 2/3] rds: deliver zerocopy completion notification with data
This commit is an optimization over commit 01883eda72bd ("rds: support for zcopy completion notification") for PF_RDS sockets. RDS applications are predominantly request-response transactions, so it is more efficient to reduce the number of system calls and have zerocopy completion notification delivered as ancillary data on the POLLIN channel. Cookies are passed up as ancillary data (at level SOL_RDS) in a struct rds_zcopy_cookies when the returned value of recvmsg() is greater than, or equal to, 0. A max of RDS_MAX_ZCOOKIES may be passed with each message. This commit removes support for zerocopy completion notification on MSG_ERRQUEUE for PF_RDS sockets. Signed-off-by: Sowmini Varadhan Acked-by: Willem de Bruijn Acked-by: Santosh Shilimkar --- v2: remove sk_error_queue path; lot of cautionary checks rds_recvmsg_zcookie() and callers to make sure we dont remove cookies from the queue and then fail to pass it up to caller v3: - bounds check on skb->cb to make sure there is enough room for struct rds_zcopy_cookies as well as the rds_znotifier; - Refactor cautionary checks in rds_recvmsg_zcookie: if no msg_control has been passed, or if there not enough msg_controllen for a a rds_zcopy_cookies, return silently (do not return error, as the caller may have wanted other ancillary data which may happen to fit in the space provided) - return bool form rds_recvmsg_zcookie, some other code cleanup include/uapi/linux/errqueue.h |2 -- include/uapi/linux/rds.h |7 +++ net/rds/af_rds.c |7 +-- net/rds/message.c | 38 -- net/rds/rds.h |2 ++ net/rds/recv.c| 31 ++- 6 files changed, 60 insertions(+), 27 deletions(-) diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h index 28812ed..dc64cfa 100644 --- a/include/uapi/linux/errqueue.h +++ b/include/uapi/linux/errqueue.h @@ -20,13 +20,11 @@ struct sock_extended_err { #define SO_EE_ORIGIN_ICMP6 3 #define SO_EE_ORIGIN_TXSTATUS 4 #define SO_EE_ORIGIN_ZEROCOPY 5 -#define SO_EE_ORIGIN_ZCOOKIE 6 #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS #define SO_EE_OFFENDER(ee) ((struct sockaddr*)((ee)+1)) #define SO_EE_CODE_ZEROCOPY_COPIED 1 -#defineSO_EE_ORIGIN_MAX_ZCOOKIES 8 /** * struct scm_timestamping - timestamps exposed through cmsg diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h index 12e3bca..a66b213 100644 --- a/include/uapi/linux/rds.h +++ b/include/uapi/linux/rds.h @@ -104,6 +104,7 @@ #define RDS_CMSG_MASKED_ATOMIC_CSWP9 #define RDS_CMSG_RXPATH_LATENCY11 #defineRDS_CMSG_ZCOPY_COOKIE 12 +#defineRDS_CMSG_ZCOPY_COMPLETION 13 #define RDS_INFO_FIRST 1 #define RDS_INFO_COUNTERS 1 @@ -317,6 +318,12 @@ struct rds_rdma_notify { #define RDS_RDMA_DROPPED 3 #define RDS_RDMA_OTHER_ERROR 4 +#defineRDS_MAX_ZCOOKIES8 +struct rds_zcopy_cookies { + __u32 num; + __u32 cookies[RDS_MAX_ZCOOKIES]; +}; + /* * Common set of flags for all RDMA related structs */ diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c index a937f18..f712610 100644 --- a/net/rds/af_rds.c +++ b/net/rds/af_rds.c @@ -77,6 +77,7 @@ static int rds_release(struct socket *sock) rds_send_drop_to(rs, NULL); rds_rdma_drop_keys(rs); rds_notify_queue_get(rs, NULL); + __skb_queue_purge(&rs->rs_zcookie_queue); spin_lock_bh(&rds_sock_lock); list_del_init(&rs->rs_item); @@ -144,7 +145,7 @@ static int rds_getname(struct socket *sock, struct sockaddr *uaddr, * - to signal that a previously congested destination may have become * uncongested * - A notification has been queued to the socket (this can be a congestion - * update, or a RDMA completion). + * update, or a RDMA completion, or a MSG_ZEROCOPY completion). * * EPOLLOUT is asserted if there is room on the send queue. This does not mean * however, that the next sendmsg() call will succeed. If the application tries @@ -178,7 +179,8 @@ static __poll_t rds_poll(struct file *file, struct socket *sock, spin_unlock(&rs->rs_lock); } if (!list_empty(&rs->rs_recv_queue) || - !list_empty(&rs->rs_notify_queue)) + !list_empty(&rs->rs_notify_queue) || + !skb_queue_empty(&rs->rs_zcookie_queue)) mask |= (EPOLLIN | EPOLLRDNORM); if (rs->rs_snd_bytes < rds_sk_sndbuf(rs)) mask |= (EPOLLOUT | EPOLLWRNORM); @@ -513,6 +515,7 @@ static int __rds_create(struct socket *sock, struct sock *sk, int protocol) INIT_LIST_HEAD(&rs->rs_recv_queue); INIT_LIST_HEAD(&rs->rs_notify_queue);
Re: [PATCH net-next 0/2] ntuple filters with RSS
On (02/27/18 12:40), David Miller wrote: > > I'm expecting a V3 respin of your zerocopy notification changes > if that is what you're talking about, and I therefore marked > the most recent spin as "changes requested". sorry, I'm confused - you are waiting for V4? I am not seeing v3 on patchwork, the only thing I see is V2 (with "changes requested" as I expected). V3 is archived here, and afaik needs no more respins: https://www.spinics.net/lists/netdev/msg485282.html Willem had responded with " For the series: Acked-by Willem de Bruijn Small item I missed in v2: the recvmsg in patch 3 should fail hard with error() on an unexpected errno. Not worth sending a v4 just for that." (see https://www.spinics.net/lists/netdev/msg485418.html) Santosh had incorrectly flagged a smatch non-issue: https://www.spinics.net/lists/netdev/msg485424.html I resent my patch a few minutes ago, but I suspect I may now be hitting this well-known patchwork bug: https://www.spinics.net/lists/sparclinux/msg13787.html Do I need to do something? --Sowmini
Re: [PATCH net-next 0/2] ntuple filters with RSS
On (02/27/18 11:49), David Miller wrote: > > Do I need to resend? > > Yes, see my other email. do we need to resend patches not showing up in patchwork? I recall seeing email about picking things manually from inbox but missed this. --Sowmini
Re: [PATCH V3 net-next 2/3] rds: deliver zerocopy completion notification with data
On (02/26/18 09:07), Santosh Shilimkar wrote: > Just in case you haven't seen yet, Dan Carpenter reported skb deref > warning on previous version of the patch. Not sure why it wasn't sent > on netdev. yes I saw it, but that was for the previous version of the patch around code that delivers notifications on sk_error_queue. This patch series removes the sk_error_queue support to the smatch warning is not applicable after this patch. on a different note, for some odd reason I'm not seeing this patch series on the patch queue, though its showing up in the archives. --Sowmini
[PATCH V3 net-next 1/3] selftests/net: revert the zerocopy Rx path for PF_RDS
In preparation for optimized reception of zerocopy completion, revert the Rx side changes introduced by Commit dfb8434b0a94 ("selftests/net: add zerocopy support for PF_RDS test case") Signed-off-by: Sowmini Varadhan --- v2: prepare to remove sk_error_queue based path; remove recvmsg() as well, PF_RDS can also use recv() for the usage pattern in msg_zerocopy tools/testing/selftests/net/msg_zerocopy.c | 67 1 files changed, 0 insertions(+), 67 deletions(-) diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c index 5cc2a53..eff9cf2 100644 --- a/tools/testing/selftests/net/msg_zerocopy.c +++ b/tools/testing/selftests/net/msg_zerocopy.c @@ -344,26 +344,6 @@ static int do_setup_tx(int domain, int type, int protocol) return fd; } -static int do_process_zerocopy_cookies(struct sock_extended_err *serr, - uint32_t *ckbuf, size_t nbytes) -{ - int ncookies, i; - - if (serr->ee_errno != 0) - error(1, 0, "serr: wrong error code: %u", serr->ee_errno); - ncookies = serr->ee_data; - if (ncookies > SO_EE_ORIGIN_MAX_ZCOOKIES) - error(1, 0, "Returned %d cookies, max expected %d\n", - ncookies, SO_EE_ORIGIN_MAX_ZCOOKIES); - if (nbytes != ncookies * sizeof(uint32_t)) - error(1, 0, "Expected %d cookies, got %ld\n", - ncookies, nbytes/sizeof(uint32_t)); - for (i = 0; i < ncookies; i++) - if (cfg_verbose >= 2) - fprintf(stderr, "%d\n", ckbuf[i]); - return ncookies; -} - static bool do_recv_completion(int fd) { struct sock_extended_err *serr; @@ -372,17 +352,10 @@ static bool do_recv_completion(int fd) uint32_t hi, lo, range; int ret, zerocopy; char control[100]; - uint32_t ckbuf[SO_EE_ORIGIN_MAX_ZCOOKIES]; - struct iovec iov; msg.msg_control = control; msg.msg_controllen = sizeof(control); - iov.iov_base = ckbuf; - iov.iov_len = (SO_EE_ORIGIN_MAX_ZCOOKIES * sizeof(ckbuf[0])); - msg.msg_iov = &iov; - msg.msg_iovlen = 1; - ret = recvmsg(fd, &msg, MSG_ERRQUEUE); if (ret == -1 && errno == EAGAIN) return false; @@ -402,10 +375,6 @@ static bool do_recv_completion(int fd) serr = (void *) CMSG_DATA(cm); - if (serr->ee_origin == SO_EE_ORIGIN_ZCOOKIE) { - completions += do_process_zerocopy_cookies(serr, ckbuf, ret); - return true; - } if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) error(1, 0, "serr: wrong origin: %u", serr->ee_origin); if (serr->ee_errno != 0) @@ -631,40 +600,6 @@ static void do_flush_datagram(int fd, int type) bytes += cfg_payload_len; } - -static void do_recvmsg(int fd) -{ - int ret, off = 0; - char *buf; - struct iovec iov; - struct msghdr msg; - struct sockaddr_storage din; - - buf = calloc(cfg_payload_len, sizeof(char)); - iov.iov_base = buf; - iov.iov_len = cfg_payload_len; - - memset(&msg, 0, sizeof(msg)); - msg.msg_name = &din; - msg.msg_namelen = sizeof(din); - msg.msg_iov = &iov; - msg.msg_iovlen = 1; - - ret = recvmsg(fd, &msg, MSG_TRUNC); - - if (ret == -1) - error(1, errno, "recv"); - if (ret != cfg_payload_len) - error(1, 0, "recv: ret=%u != %u", ret, cfg_payload_len); - - if (memcmp(buf + off, payload, ret)) - error(1, 0, "recv: data mismatch"); - - free(buf); - packets++; - bytes += cfg_payload_len; -} - static void do_rx(int domain, int type, int protocol) { uint64_t tstop; @@ -676,8 +611,6 @@ static void do_rx(int domain, int type, int protocol) do { if (type == SOCK_STREAM) do_flush_tcp(fd); - else if (domain == PF_RDS) - do_recvmsg(fd); else do_flush_datagram(fd, type); -- 1.7.1
[PATCH V3 net-next 3/3] selftests/net: reap zerocopy completions passed up as ancillary data.
PF_RDS sockets pass up cookies for zerocopy completion as ancillary data. Update msg_zerocopy to reap this information. Signed-off-by: Sowmini Varadhan --- v2: receive zerocopy completion notification as POLLIN v3: drop ncookies arg in do_process_zerocopy_cookies; Reverse christmas tree declarations; check for C_TRUNC; print warning when encountering ignored cmsghdrs in do_recvmsg_completion tools/testing/selftests/net/msg_zerocopy.c | 65 --- 1 files changed, 57 insertions(+), 8 deletions(-) diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c index eff9cf2..406cc70 100644 --- a/tools/testing/selftests/net/msg_zerocopy.c +++ b/tools/testing/selftests/net/msg_zerocopy.c @@ -344,7 +344,53 @@ static int do_setup_tx(int domain, int type, int protocol) return fd; } -static bool do_recv_completion(int fd) +static uint32_t do_process_zerocopy_cookies(struct rds_zcopy_cookies *ck) +{ + int i; + + if (ck->num > RDS_MAX_ZCOOKIES) + error(1, 0, "Returned %d cookies, max expected %d\n", + ck->num, RDS_MAX_ZCOOKIES); + for (i = 0; i < ck->num; i++) + if (cfg_verbose >= 2) + fprintf(stderr, "%d\n", ck->cookies[i]); + return ck->num; +} + +static bool do_recvmsg_completion(int fd) +{ + char cmsgbuf[CMSG_SPACE(sizeof(struct rds_zcopy_cookies))]; + struct rds_zcopy_cookies *ck; + struct cmsghdr *cmsg; + struct msghdr msg; + bool ret = false; + + memset(&msg, 0, sizeof(msg)); + msg.msg_control = cmsgbuf; + msg.msg_controllen = sizeof(cmsgbuf); + + if (recvmsg(fd, &msg, MSG_DONTWAIT)) + return ret; + + if (msg.msg_flags & MSG_CTRUNC) + error(1, errno, "recvmsg notification: truncated"); + + for (cmsg = CMSG_FIRSTHDR(&msg); cmsg; cmsg = CMSG_NXTHDR(&msg, cmsg)) { + if (cmsg->cmsg_level == SOL_RDS && + cmsg->cmsg_type == RDS_CMSG_ZCOPY_COMPLETION) { + + ck = (struct rds_zcopy_cookies *)CMSG_DATA(cmsg); + completions += do_process_zerocopy_cookies(ck); + ret = true; + break; + } + error(0, 0, "ignoring cmsg at level %d type %d\n", + cmsg->cmsg_level, cmsg->cmsg_type); + } + return ret; +} + +static bool do_recv_completion(int fd, int domain) { struct sock_extended_err *serr; struct msghdr msg = {}; @@ -353,6 +399,9 @@ static bool do_recv_completion(int fd) int ret, zerocopy; char control[100]; + if (domain == PF_RDS) + return do_recvmsg_completion(fd); + msg.msg_control = control; msg.msg_controllen = sizeof(control); @@ -409,20 +458,20 @@ static bool do_recv_completion(int fd) } /* Read all outstanding messages on the errqueue */ -static void do_recv_completions(int fd) +static void do_recv_completions(int fd, int domain) { - while (do_recv_completion(fd)) {} + while (do_recv_completion(fd, domain)) {} } /* Wait for all remaining completions on the errqueue */ -static void do_recv_remaining_completions(int fd) +static void do_recv_remaining_completions(int fd, int domain) { int64_t tstop = gettimeofday_ms() + cfg_waittime_ms; while (completions < expected_completions && gettimeofday_ms() < tstop) { - if (do_poll(fd, POLLERR)) - do_recv_completions(fd); + if (do_poll(fd, domain == PF_RDS ? POLLIN : POLLERR)) + do_recv_completions(fd, domain); } if (completions < expected_completions) @@ -503,13 +552,13 @@ static void do_tx(int domain, int type, int protocol) while (!do_poll(fd, POLLOUT)) { if (cfg_zerocopy) - do_recv_completions(fd); + do_recv_completions(fd, domain); } } while (gettimeofday_ms() < tstop); if (cfg_zerocopy) - do_recv_remaining_completions(fd); + do_recv_remaining_completions(fd, domain); if (close(fd)) error(1, errno, "close"); -- 1.7.1
[PATCH V3 net-next 2/3] rds: deliver zerocopy completion notification with data
This commit is an optimization over commit 01883eda72bd ("rds: support for zcopy completion notification") for PF_RDS sockets. RDS applications are predominantly request-response transactions, so it is more efficient to reduce the number of system calls and have zerocopy completion notification delivered as ancillary data on the POLLIN channel. Cookies are passed up as ancillary data (at level SOL_RDS) in a struct rds_zcopy_cookies when the returned value of recvmsg() is greater than, or equal to, 0. A max of RDS_MAX_ZCOOKIES may be passed with each message. This commit removes support for zerocopy completion notification on MSG_ERRQUEUE for PF_RDS sockets. Signed-off-by: Sowmini Varadhan --- v2: remove sk_error_queue path; lot of cautionary checks rds_recvmsg_zcookie() and callers to make sure we dont remove cookies from the queue and then fail to pass it up to caller v3: - bounds check on skb->cb to make sure there is enough room for struct rds_zcopy_cookies as well as the rds_znotifier; - Refactor cautionary checks in rds_recvmsg_zcookie: if no msg_control has been passed, or if there not enough msg_controllen for a a rds_zcopy_cookies, return silently (do not return error, as the caller may have wanted other ancillary data which may happen to fit in the space provided) - return bool form rds_recvmsg_zcookie, some other code cleanup include/uapi/linux/errqueue.h |2 -- include/uapi/linux/rds.h |7 +++ net/rds/af_rds.c |7 +-- net/rds/message.c | 38 -- net/rds/rds.h |2 ++ net/rds/recv.c| 31 ++- 6 files changed, 60 insertions(+), 27 deletions(-) diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h index 28812ed..dc64cfa 100644 --- a/include/uapi/linux/errqueue.h +++ b/include/uapi/linux/errqueue.h @@ -20,13 +20,11 @@ struct sock_extended_err { #define SO_EE_ORIGIN_ICMP6 3 #define SO_EE_ORIGIN_TXSTATUS 4 #define SO_EE_ORIGIN_ZEROCOPY 5 -#define SO_EE_ORIGIN_ZCOOKIE 6 #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS #define SO_EE_OFFENDER(ee) ((struct sockaddr*)((ee)+1)) #define SO_EE_CODE_ZEROCOPY_COPIED 1 -#defineSO_EE_ORIGIN_MAX_ZCOOKIES 8 /** * struct scm_timestamping - timestamps exposed through cmsg diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h index 12e3bca..a66b213 100644 --- a/include/uapi/linux/rds.h +++ b/include/uapi/linux/rds.h @@ -104,6 +104,7 @@ #define RDS_CMSG_MASKED_ATOMIC_CSWP9 #define RDS_CMSG_RXPATH_LATENCY11 #defineRDS_CMSG_ZCOPY_COOKIE 12 +#defineRDS_CMSG_ZCOPY_COMPLETION 13 #define RDS_INFO_FIRST 1 #define RDS_INFO_COUNTERS 1 @@ -317,6 +318,12 @@ struct rds_rdma_notify { #define RDS_RDMA_DROPPED 3 #define RDS_RDMA_OTHER_ERROR 4 +#defineRDS_MAX_ZCOOKIES8 +struct rds_zcopy_cookies { + __u32 num; + __u32 cookies[RDS_MAX_ZCOOKIES]; +}; + /* * Common set of flags for all RDMA related structs */ diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c index a937f18..f712610 100644 --- a/net/rds/af_rds.c +++ b/net/rds/af_rds.c @@ -77,6 +77,7 @@ static int rds_release(struct socket *sock) rds_send_drop_to(rs, NULL); rds_rdma_drop_keys(rs); rds_notify_queue_get(rs, NULL); + __skb_queue_purge(&rs->rs_zcookie_queue); spin_lock_bh(&rds_sock_lock); list_del_init(&rs->rs_item); @@ -144,7 +145,7 @@ static int rds_getname(struct socket *sock, struct sockaddr *uaddr, * - to signal that a previously congested destination may have become * uncongested * - A notification has been queued to the socket (this can be a congestion - * update, or a RDMA completion). + * update, or a RDMA completion, or a MSG_ZEROCOPY completion). * * EPOLLOUT is asserted if there is room on the send queue. This does not mean * however, that the next sendmsg() call will succeed. If the application tries @@ -178,7 +179,8 @@ static __poll_t rds_poll(struct file *file, struct socket *sock, spin_unlock(&rs->rs_lock); } if (!list_empty(&rs->rs_recv_queue) || - !list_empty(&rs->rs_notify_queue)) + !list_empty(&rs->rs_notify_queue) || + !skb_queue_empty(&rs->rs_zcookie_queue)) mask |= (EPOLLIN | EPOLLRDNORM); if (rs->rs_snd_bytes < rds_sk_sndbuf(rs)) mask |= (EPOLLOUT | EPOLLWRNORM); @@ -513,6 +515,7 @@ static int __rds_create(struct socket *sock, struct sock *sk, int protocol) INIT_LIST_HEAD(&rs->rs_recv_queue); INIT_LIST_HEAD(&rs->rs_notify_queue); INIT_LIST_HEAD(&rs->rs_cong_list); + skb_queue_head_ini
[PATCH V3 net-next 0/3] RDS: optimized notification for zerocopy completion
RDS applications use predominantly request-response, transacation based IPC, so that ingress and egress traffic are well-balanced, and it is possible/desirable to reduce system-call overhead by piggybacking the notifications for zerocopy completion response with data. Moreover, it has been pointed out that socket functions block if sk_err is non-zero, thus if the RDS code does not plan/need to use sk_error_queue path for completion notification, it is preferable to remove the sk_errror_queue related paths in RDS. Both of these goals are implemented in this series. v2: removed sk_error_queue support v3: incorporated additional code review comments (details in each patch) Sowmini Varadhan (3): selftests/net: revert the zerocopy Rx path for PF_RDS rds: deliver zerocopy completion notification with data selftests/net: reap zerocopy completions passed up as ancillary data. include/uapi/linux/errqueue.h |2 - include/uapi/linux/rds.h |7 ++ net/rds/af_rds.c |7 +- net/rds/message.c | 38 - net/rds/rds.h |2 + net/rds/recv.c | 31 +++- tools/testing/selftests/net/msg_zerocopy.c | 120 7 files changed, 111 insertions(+), 96 deletions(-)
Re: [PATCH V2 net-next 2/3] rds: deliver zerocopy completion notification with data
On (02/25/18 10:56), Willem de Bruijn wrote: > > @@ -91,22 +85,19 @@ static void rds_rm_zerocopy_callback(struct rds_sock > > *rs, > > spin_unlock_irqrestore(&q->lock, flags); > > mm_unaccount_pinned_pages(&znotif->z_mmp); > > consume_skb(rds_skb_from_znotifier(znotif)); > > - sk->sk_error_report(sk); > > + /* caller should wake up POLLIN */ > > sk->sk_data_ready(sk); yes, this was my first thought, but everything else in rds is calling rds_wake_sk_sleep (this is even done in rds_recv_incoming(), which actually queues up the data), so I chose to align with that model (and call this in the caller of rds_rm_zerocopy_callback() > Without the error queue, the struct no longer needs to be an skb, > per se. Converting to a different struct with list_head is definitely > a longer patch. But kmalloc will be cheaper than alloc_skb. > Perhaps something to try (as separate follow-on work). right, I was thinking along these exact lines as well, and was already planning a follow-up. > > + if (!sock_flag(rds_rs_to_sk(rs), SOCK_ZEROCOPY) || !skb_peek(q)) > > + return 0; > > Racy read? Can you elaborate? I only put the skb_peek to quickly bail for sockets that are not using zerocopy at all- if you race against something that's queuing data, and miss it on the peek, the next read/recv should find it. Am I missing some race? > > > + > > + if (!msg->msg_control || > > I'd move this first, so that the cookie queue need not even be probed > in the common case. you mean before the check for SOCK_ZEROCOPY? > > + msg->msg_controllen < CMSG_SPACE(sizeof(*done))) > > + return 0; > > if caller does not satisfy the contract on controllen size, can be > more explicit and return an error. if SOCK_ZEROCOPY has been set, but the recv did not specify a cmsghdr, you mean? > > + ncookies = rds_recvmsg_zcookie(rs, msg); Will take care of the remaining comments in V3.
[PATCH V2 net-next 0/3] RDS: optimized notification for zerocopy completion
RDS applications use predominantly request-response, transacation based IPC, so that ingress and egress traffic are well-balanced, and it is possible/desirable to reduce system-call overhead by piggybacking the notifications for zerocopy completion response with data. Moreover, it has been pointed out that socket functions block if sk_err is non-zero, thus if the RDS code does not plan/need to use sk_error_queue path for completion notification, it is preferable to remove the sk_errror_queue related paths in RDS. Both of these goals are implemented in this series. Sowmini Varadhan (3): selftests/net: revert the zerocopy Rx path for PF_RDS rds: deliver zerocopy completion notification with data selftests/net: reap zerocopy completions passed up as ancillary data. include/uapi/linux/errqueue.h |2 - include/uapi/linux/rds.h |7 ++ net/rds/af_rds.c |7 ++- net/rds/message.c | 35 -- net/rds/rds.h |2 + net/rds/recv.c | 34 +- tools/testing/selftests/net/msg_zerocopy.c | 109 +++- 7 files changed, 103 insertions(+), 93 deletions(-)
[PATCH V2 net-next 3/3] selftests/net: reap zerocopy completions passed up as ancillary data.
PF_RDS sockets pass up cookies for zerocopy completion as ancillary data. Update msg_zerocopy to reap this information. Signed-off-by: Sowmini Varadhan --- v2: receive zerocopy completion notification as POLLIN tools/testing/selftests/net/msg_zerocopy.c | 60 1 files changed, 52 insertions(+), 8 deletions(-) diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c index eff9cf2..8c466e8 100644 --- a/tools/testing/selftests/net/msg_zerocopy.c +++ b/tools/testing/selftests/net/msg_zerocopy.c @@ -344,7 +344,48 @@ static int do_setup_tx(int domain, int type, int protocol) return fd; } -static bool do_recv_completion(int fd) +static int do_process_zerocopy_cookies(struct rds_zcopy_cookies *ck) +{ + int ncookies, i; + + ncookies = ck->num; + if (ncookies > RDS_MAX_ZCOOKIES) + error(1, 0, "Returned %d cookies, max expected %d\n", + ncookies, RDS_MAX_ZCOOKIES); + for (i = 0; i < ncookies; i++) + if (cfg_verbose >= 2) + fprintf(stderr, "%d\n", ck->cookies[i]); + return ncookies; +} + +static int do_recvmsg_completion(int fd) +{ + struct msghdr msg; + char cmsgbuf[256]; + struct cmsghdr *cmsg; + bool ret = false; + struct rds_zcopy_cookies *ck; + + memset(&msg, 0, sizeof(msg)); + msg.msg_control = cmsgbuf; + msg.msg_controllen = sizeof(cmsgbuf); + + if (recvmsg(fd, &msg, MSG_DONTWAIT)) + return ret; + for (cmsg = CMSG_FIRSTHDR(&msg); cmsg; cmsg = CMSG_NXTHDR(&msg, cmsg)) { + if (cmsg->cmsg_level == SOL_RDS && + cmsg->cmsg_type == RDS_CMSG_ZCOPY_COMPLETION) { + + ck = (struct rds_zcopy_cookies *)CMSG_DATA(cmsg); + completions += do_process_zerocopy_cookies(ck); + ret = true; + break; + } + } + return ret; +} + +static bool do_recv_completion(int fd, int domain) { struct sock_extended_err *serr; struct msghdr msg = {}; @@ -353,6 +394,9 @@ static bool do_recv_completion(int fd) int ret, zerocopy; char control[100]; + if (domain == PF_RDS) + return do_recvmsg_completion(fd); + msg.msg_control = control; msg.msg_controllen = sizeof(control); @@ -409,20 +453,20 @@ static bool do_recv_completion(int fd) } /* Read all outstanding messages on the errqueue */ -static void do_recv_completions(int fd) +static void do_recv_completions(int fd, int domain) { - while (do_recv_completion(fd)) {} + while (do_recv_completion(fd, domain)) {} } /* Wait for all remaining completions on the errqueue */ -static void do_recv_remaining_completions(int fd) +static void do_recv_remaining_completions(int fd, int domain) { int64_t tstop = gettimeofday_ms() + cfg_waittime_ms; while (completions < expected_completions && gettimeofday_ms() < tstop) { - if (do_poll(fd, POLLERR)) - do_recv_completions(fd); + if (do_poll(fd, domain == PF_RDS ? POLLIN : POLLERR)) + do_recv_completions(fd, domain); } if (completions < expected_completions) @@ -503,13 +547,13 @@ static void do_tx(int domain, int type, int protocol) while (!do_poll(fd, POLLOUT)) { if (cfg_zerocopy) - do_recv_completions(fd); + do_recv_completions(fd, domain); } } while (gettimeofday_ms() < tstop); if (cfg_zerocopy) - do_recv_remaining_completions(fd); + do_recv_remaining_completions(fd, domain); if (close(fd)) error(1, errno, "close"); -- 1.7.1
[PATCH V2 net-next 1/3] selftests/net: revert the zerocopy Rx path for PF_RDS
In preparation for optimized reception of zerocopy completion, revert the Rx side changes introduced by Commit dfb8434b0a94 ("selftests/net: add zerocopy support for PF_RDS test case") Signed-off-by: Sowmini Varadhan --- v2: prepare to remove sk_error_queue based path; remove recvmsg() as well, PF_RDS can also use recv() for the usage pattern in msg_zerocopy tools/testing/selftests/net/msg_zerocopy.c | 67 1 files changed, 0 insertions(+), 67 deletions(-) diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c index 5cc2a53..eff9cf2 100644 --- a/tools/testing/selftests/net/msg_zerocopy.c +++ b/tools/testing/selftests/net/msg_zerocopy.c @@ -344,26 +344,6 @@ static int do_setup_tx(int domain, int type, int protocol) return fd; } -static int do_process_zerocopy_cookies(struct sock_extended_err *serr, - uint32_t *ckbuf, size_t nbytes) -{ - int ncookies, i; - - if (serr->ee_errno != 0) - error(1, 0, "serr: wrong error code: %u", serr->ee_errno); - ncookies = serr->ee_data; - if (ncookies > SO_EE_ORIGIN_MAX_ZCOOKIES) - error(1, 0, "Returned %d cookies, max expected %d\n", - ncookies, SO_EE_ORIGIN_MAX_ZCOOKIES); - if (nbytes != ncookies * sizeof(uint32_t)) - error(1, 0, "Expected %d cookies, got %ld\n", - ncookies, nbytes/sizeof(uint32_t)); - for (i = 0; i < ncookies; i++) - if (cfg_verbose >= 2) - fprintf(stderr, "%d\n", ckbuf[i]); - return ncookies; -} - static bool do_recv_completion(int fd) { struct sock_extended_err *serr; @@ -372,17 +352,10 @@ static bool do_recv_completion(int fd) uint32_t hi, lo, range; int ret, zerocopy; char control[100]; - uint32_t ckbuf[SO_EE_ORIGIN_MAX_ZCOOKIES]; - struct iovec iov; msg.msg_control = control; msg.msg_controllen = sizeof(control); - iov.iov_base = ckbuf; - iov.iov_len = (SO_EE_ORIGIN_MAX_ZCOOKIES * sizeof(ckbuf[0])); - msg.msg_iov = &iov; - msg.msg_iovlen = 1; - ret = recvmsg(fd, &msg, MSG_ERRQUEUE); if (ret == -1 && errno == EAGAIN) return false; @@ -402,10 +375,6 @@ static bool do_recv_completion(int fd) serr = (void *) CMSG_DATA(cm); - if (serr->ee_origin == SO_EE_ORIGIN_ZCOOKIE) { - completions += do_process_zerocopy_cookies(serr, ckbuf, ret); - return true; - } if (serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) error(1, 0, "serr: wrong origin: %u", serr->ee_origin); if (serr->ee_errno != 0) @@ -631,40 +600,6 @@ static void do_flush_datagram(int fd, int type) bytes += cfg_payload_len; } - -static void do_recvmsg(int fd) -{ - int ret, off = 0; - char *buf; - struct iovec iov; - struct msghdr msg; - struct sockaddr_storage din; - - buf = calloc(cfg_payload_len, sizeof(char)); - iov.iov_base = buf; - iov.iov_len = cfg_payload_len; - - memset(&msg, 0, sizeof(msg)); - msg.msg_name = &din; - msg.msg_namelen = sizeof(din); - msg.msg_iov = &iov; - msg.msg_iovlen = 1; - - ret = recvmsg(fd, &msg, MSG_TRUNC); - - if (ret == -1) - error(1, errno, "recv"); - if (ret != cfg_payload_len) - error(1, 0, "recv: ret=%u != %u", ret, cfg_payload_len); - - if (memcmp(buf + off, payload, ret)) - error(1, 0, "recv: data mismatch"); - - free(buf); - packets++; - bytes += cfg_payload_len; -} - static void do_rx(int domain, int type, int protocol) { uint64_t tstop; @@ -676,8 +611,6 @@ static void do_rx(int domain, int type, int protocol) do { if (type == SOCK_STREAM) do_flush_tcp(fd); - else if (domain == PF_RDS) - do_recvmsg(fd); else do_flush_datagram(fd, type); -- 1.7.1
[PATCH V2 net-next 2/3] rds: deliver zerocopy completion notification with data
This commit is an optimization of the commit 01883eda72bd ("rds: support for zcopy completion notification") for PF_RDS sockets. RDS applications are predominantly request-response transactions, so it is more efficient to reduce the number of system calls and have zerocopy completion notification delivered as ancillary data on the POLLIN channel. Cookies are passed up as ancillary data (at level SOL_RDS) in a struct rds_zcopy_cookies when the returned value of recvmsg() is greater than, or equal to, 0. A max of RDS_MAX_ZCOOKIES may be passed with each message. This commit removes support for zerocopy completion notification on MSG_ERRQUEUE for PF_RDS sockets. Signed-off-by: Sowmini Varadhan --- v2: remove sk_error_queue path; lot of cautionary checks rds_recvmsg_zcookie() and callers to make sure we dont remove cookies from the queue and then fail to pass it up to caller include/uapi/linux/errqueue.h |2 -- include/uapi/linux/rds.h |7 +++ net/rds/af_rds.c |7 +-- net/rds/message.c | 35 +-- net/rds/rds.h |2 ++ net/rds/recv.c| 34 +- 6 files changed, 60 insertions(+), 27 deletions(-) diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h index 28812ed..dc64cfa 100644 --- a/include/uapi/linux/errqueue.h +++ b/include/uapi/linux/errqueue.h @@ -20,13 +20,11 @@ struct sock_extended_err { #define SO_EE_ORIGIN_ICMP6 3 #define SO_EE_ORIGIN_TXSTATUS 4 #define SO_EE_ORIGIN_ZEROCOPY 5 -#define SO_EE_ORIGIN_ZCOOKIE 6 #define SO_EE_ORIGIN_TIMESTAMPING SO_EE_ORIGIN_TXSTATUS #define SO_EE_OFFENDER(ee) ((struct sockaddr*)((ee)+1)) #define SO_EE_CODE_ZEROCOPY_COPIED 1 -#defineSO_EE_ORIGIN_MAX_ZCOOKIES 8 /** * struct scm_timestamping - timestamps exposed through cmsg diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h index 12e3bca..a66b213 100644 --- a/include/uapi/linux/rds.h +++ b/include/uapi/linux/rds.h @@ -104,6 +104,7 @@ #define RDS_CMSG_MASKED_ATOMIC_CSWP9 #define RDS_CMSG_RXPATH_LATENCY11 #defineRDS_CMSG_ZCOPY_COOKIE 12 +#defineRDS_CMSG_ZCOPY_COMPLETION 13 #define RDS_INFO_FIRST 1 #define RDS_INFO_COUNTERS 1 @@ -317,6 +318,12 @@ struct rds_rdma_notify { #define RDS_RDMA_DROPPED 3 #define RDS_RDMA_OTHER_ERROR 4 +#defineRDS_MAX_ZCOOKIES8 +struct rds_zcopy_cookies { + __u32 num; + __u32 cookies[RDS_MAX_ZCOOKIES]; +}; + /* * Common set of flags for all RDMA related structs */ diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c index a937f18..f712610 100644 --- a/net/rds/af_rds.c +++ b/net/rds/af_rds.c @@ -77,6 +77,7 @@ static int rds_release(struct socket *sock) rds_send_drop_to(rs, NULL); rds_rdma_drop_keys(rs); rds_notify_queue_get(rs, NULL); + __skb_queue_purge(&rs->rs_zcookie_queue); spin_lock_bh(&rds_sock_lock); list_del_init(&rs->rs_item); @@ -144,7 +145,7 @@ static int rds_getname(struct socket *sock, struct sockaddr *uaddr, * - to signal that a previously congested destination may have become * uncongested * - A notification has been queued to the socket (this can be a congestion - * update, or a RDMA completion). + * update, or a RDMA completion, or a MSG_ZEROCOPY completion). * * EPOLLOUT is asserted if there is room on the send queue. This does not mean * however, that the next sendmsg() call will succeed. If the application tries @@ -178,7 +179,8 @@ static __poll_t rds_poll(struct file *file, struct socket *sock, spin_unlock(&rs->rs_lock); } if (!list_empty(&rs->rs_recv_queue) || - !list_empty(&rs->rs_notify_queue)) + !list_empty(&rs->rs_notify_queue) || + !skb_queue_empty(&rs->rs_zcookie_queue)) mask |= (EPOLLIN | EPOLLRDNORM); if (rs->rs_snd_bytes < rds_sk_sndbuf(rs)) mask |= (EPOLLOUT | EPOLLWRNORM); @@ -513,6 +515,7 @@ static int __rds_create(struct socket *sock, struct sock *sk, int protocol) INIT_LIST_HEAD(&rs->rs_recv_queue); INIT_LIST_HEAD(&rs->rs_notify_queue); INIT_LIST_HEAD(&rs->rs_cong_list); + skb_queue_head_init(&rs->rs_zcookie_queue); spin_lock_init(&rs->rs_rdma_lock); rs->rs_rdma_keys = RB_ROOT; rs->rs_rx_traces = 0; diff --git a/net/rds/message.c b/net/rds/message.c index 6518345..2e8bdaf 100644 --- a/net/rds/message.c +++ b/net/rds/message.c @@ -58,32 +58,26 @@ void rds_message_addref(struct rds_message *rm) static inline bool skb_zcookie_add(struct sk_buff *skb, u32 cookie) { - struct sock_exterr_skb *serr = SKB_EXT_ERR(skb); - int ncookies; -