Re: Long delay on estimation_timer causes packet latency
Hello, On Fri, 4 Dec 2020, dust.li wrote: > > On 12/3/20 4:48 PM, Julian Anastasov wrote: > > > > - work will use spin_lock_bh(&s->lock) to protect the > > entries, we do not want delays between /proc readers and > > the work if using mutex. But _bh locks stop timers and > > networking for short time :( Not sure yet if just spin_lock > > is safe for both /proc and estimator's work. Here stopping BH is may be not so fatal if some CPUs are used for networking and others for workqueues. > Thanks for sharing your thoughts ! > > > I think it's a good idea to split the est_list into different > > slots, I believe it will dramatically reduce the delay brought > > by estimation. 268ms/64 => 4ms average. As the estimation with single work does not utilize many CPUs simultaneously, this can be a problem for 30-40 services but this looks crazy. > My only concern is the cost of the estimation when the number of > > services is large. Splitting the est_list won't reduce the real > > work to do. > > In our case, each estimation cost at most 268ms/2000ms, which is > > about 13% of one CPU hyper-thread, and this should be a common case > > in a large K8S cluster with lots of services. > > Since the estimation is not needed in our environment at all, it's > > just a waste of CPU resource. Have you ever consider add a switch to > > let the user turn the estimator off ? No problem to add sysctl var for this, we usually add function to check which can be used in ip_vs_in_stats, ip_vs_out_stats, ip_vs_conn_stats. If switch can be changed at any time, what should we do? Add/Del est entries as normally but do not start the delayed work if flag disables stats. When flag is enabled counters will increase and we will start delayed work. Regards -- Julian Anastasov
Re: Long delay on estimation_timer causes packet latency
ives allow it for _rcu. - next options is to insert entries in some current list, if their count reaches, say 128, then move to the next list for inserting. This option tries to provide exact 2000ms delay for the first estimation for the newly added entry. We can start with some implementation and see if your tests are happy. Regards -- Julian Anastasov
Re: [PATCH net v3] ipvs: fix possible memory leak in ip_vs_control_net_init
Hello, On Tue, 24 Nov 2020, Wang Hai wrote: > kmemleak report a memory leak as follows: > > BUG: memory leak > unreferenced object 0x8880759ea000 (size 256): > backtrace: > [<c0bf2deb>] kmem_cache_zalloc include/linux/slab.h:656 [inline] > [<c0bf2deb>] __proc_create+0x23d/0x7d0 fs/proc/generic.c:421 > [<9d718d02>] proc_create_reg+0x8e/0x140 fs/proc/generic.c:535 > [<97bbfc4f>] proc_create_net_data+0x8c/0x1b0 fs/proc/proc_net.c:126 > [<652480fc>] ip_vs_control_net_init+0x308/0x13a0 > net/netfilter/ipvs/ip_vs_ctl.c:4169 > [<4c927ebe>] __ip_vs_init+0x211/0x400 > net/netfilter/ipvs/ip_vs_core.c:2429 > [<aa6b72d9>] ops_init+0xa8/0x3c0 net/core/net_namespace.c:151 > [<153fd114>] setup_net+0x2de/0x7e0 net/core/net_namespace.c:341 > [<be4e4f07>] copy_net_ns+0x27d/0x530 net/core/net_namespace.c:482 > [<f1c23ec9>] create_new_namespaces+0x382/0xa30 kernel/nsproxy.c:110 > [<098a5757>] copy_namespaces+0x2e6/0x3b0 kernel/nsproxy.c:179 > [<26ce39e9>] copy_process+0x220a/0x5f00 kernel/fork.c:2072 > [<b71f4efe>] _do_fork+0xc7/0xda0 kernel/fork.c:2428 > [<2974ee96>] __do_sys_clone3+0x18a/0x280 kernel/fork.c:2703 > [<62ac0a4d>] do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46 > [<93f1ce2c>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > In the error path of ip_vs_control_net_init(), remove_proc_entry() needs > to be called to remove the added proc entry, otherwise a memory leak > will occur. > > Also, add some '#ifdef CONFIG_PROC_FS' because proc_create_net* return NULL > when PROC is not used. > > Fixes: b17fc9963f83 ("IPVS: netns, ip_vs_stats and its procfs") > Fixes: 61b1ab4583e2 ("IPVS: netns, add basic init per netns.") > Reported-by: Hulk Robot > Signed-off-by: Wang Hai Looks good to me, thanks! Acked-by: Julian Anastasov > --- > v2->v3: improve code format > v1->v2: add some '#ifdef CONFIG_PROC_FS' and check the return value of > proc_create_net* > net/netfilter/ipvs/ip_vs_ctl.c | 31 +-- > 1 file changed, 25 insertions(+), 6 deletions(-) > > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c > index e279ded4e306..d45dbcba8b49 100644 > --- a/net/netfilter/ipvs/ip_vs_ctl.c > +++ b/net/netfilter/ipvs/ip_vs_ctl.c > @@ -4167,12 +4167,18 @@ int __net_init ip_vs_control_net_init(struct > netns_ipvs *ipvs) > > spin_lock_init(&ipvs->tot_stats.lock); > > - proc_create_net("ip_vs", 0, ipvs->net->proc_net, &ip_vs_info_seq_ops, > - sizeof(struct ip_vs_iter)); > - proc_create_net_single("ip_vs_stats", 0, ipvs->net->proc_net, > - ip_vs_stats_show, NULL); > - proc_create_net_single("ip_vs_stats_percpu", 0, ipvs->net->proc_net, > - ip_vs_stats_percpu_show, NULL); > +#ifdef CONFIG_PROC_FS > + if (!proc_create_net("ip_vs", 0, ipvs->net->proc_net, > + &ip_vs_info_seq_ops, sizeof(struct ip_vs_iter))) > + goto err_vs; > + if (!proc_create_net_single("ip_vs_stats", 0, ipvs->net->proc_net, > + ip_vs_stats_show, NULL)) > + goto err_stats; > + if (!proc_create_net_single("ip_vs_stats_percpu", 0, > + ipvs->net->proc_net, > + ip_vs_stats_percpu_show, NULL)) > + goto err_percpu; > +#endif > > if (ip_vs_control_net_init_sysctl(ipvs)) > goto err; > @@ -4180,6 +4186,17 @@ int __net_init ip_vs_control_net_init(struct > netns_ipvs *ipvs) > return 0; > > err: > +#ifdef CONFIG_PROC_FS > + remove_proc_entry("ip_vs_stats_percpu", ipvs->net->proc_net); > + > +err_percpu: > + remove_proc_entry("ip_vs_stats", ipvs->net->proc_net); > + > +err_stats: > + remove_proc_entry("ip_vs", ipvs->net->proc_net); > + > +err_vs: > +#endif > free_percpu(ipvs->tot_stats.cpustats); > return -ENOMEM; > } > @@ -4188,9 +4205,11 @@ void __net_exit ip_vs_control_net_cleanup(struct > netns_ipvs *ipvs) > { > ip_vs_trash_cleanup(ipvs); > ip_vs_control_net_cleanup_sysctl(ipvs); > +#ifdef CONFIG_PROC_FS > remove_proc_entry("ip_vs_stats_percpu", ipvs->net->proc_net); > remove_proc_entry("ip_vs_stats", ipvs->net->proc_net); > remove_proc_entry("ip_vs", ipvs->net->proc_net); > +#endif > free_percpu(ipvs->tot_stats.cpustats); > } > > -- > 2.17.1 Regards -- Julian Anastasov
Re: [PATCH net] ipvs: fix possible memory leak in ip_vs_control_net_init
Hello, On Thu, 19 Nov 2020, Wang Hai wrote: > kmemleak report a memory leak as follows: > > BUG: memory leak > unreferenced object 0x8880759ea000 (size 256): > comm "syz-executor.3", pid 6484, jiffies 4297476946 (age 48.546s) > hex dump (first 32 bytes): > 00 00 00 00 01 00 00 00 08 a0 9e 75 80 88 ff ff ...u > 08 a0 9e 75 80 88 ff ff 00 00 00 00 ad 4e ad de ...u.N.. > backtrace: > [<c0bf2deb>] kmem_cache_zalloc include/linux/slab.h:656 [inline] > [<c0bf2deb>] __proc_create+0x23d/0x7d0 fs/proc/generic.c:421 > [<9d718d02>] proc_create_reg+0x8e/0x140 fs/proc/generic.c:535 > [<97bbfc4f>] proc_create_net_data+0x8c/0x1b0 fs/proc/proc_net.c:126 > [<652480fc>] ip_vs_control_net_init+0x308/0x13a0 > net/netfilter/ipvs/ip_vs_ctl.c:4169 > [<4c927ebe>] __ip_vs_init+0x211/0x400 > net/netfilter/ipvs/ip_vs_core.c:2429 > [<aa6b72d9>] ops_init+0xa8/0x3c0 net/core/net_namespace.c:151 > [<153fd114>] setup_net+0x2de/0x7e0 net/core/net_namespace.c:341 > [<be4e4f07>] copy_net_ns+0x27d/0x530 net/core/net_namespace.c:482 > [<f1c23ec9>] create_new_namespaces+0x382/0xa30 kernel/nsproxy.c:110 > [<098a5757>] copy_namespaces+0x2e6/0x3b0 kernel/nsproxy.c:179 > [<26ce39e9>] copy_process+0x220a/0x5f00 kernel/fork.c:2072 > [<b71f4efe>] _do_fork+0xc7/0xda0 kernel/fork.c:2428 > [<2974ee96>] __do_sys_clone3+0x18a/0x280 kernel/fork.c:2703 > [<62ac0a4d>] do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46 > [<93f1ce2c>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > In the error path of ip_vs_control_net_init(), remove_proc_entry() needs > to be called to remove the added proc entry, otherwise a memory leak > will occur. > > Fixes: b17fc9963f83 ("IPVS: netns, ip_vs_stats and its procfs") > Fixes: 61b1ab4583e2 ("IPVS: netns, add basic init per netns.") > Reported-by: Hulk Robot > Signed-off-by: Wang Hai > --- > net/netfilter/ipvs/ip_vs_ctl.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c > index e279ded4e306..d99bb89e7c25 100644 > --- a/net/netfilter/ipvs/ip_vs_ctl.c > +++ b/net/netfilter/ipvs/ip_vs_ctl.c > @@ -4180,6 +4180,9 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs > *ipvs) > return 0; May be we should add some #ifdef CONFIG_PROC_FS because proc_create_net* return NULL when PROC is not used. For example: #ifdef CONFIG_PROC_FS if (!proc_create_net... goto err_vs; if (!proc_create_net... goto err_stats; ... #endif ... > err: #ifdef CONFIG_PROC_FS > + remove_proc_entry("ip_vs_stats_percpu", ipvs->net->proc_net); err_percpu: > + remove_proc_entry("ip_vs_stats", ipvs->net->proc_net); err_stats: > + remove_proc_entry("ip_vs", ipvs->net->proc_net); err_vs: #endif > free_percpu(ipvs->tot_stats.cpustats); > return -ENOMEM; > } > -- Regards -- Julian Anastasov
Re: [PATCH] ipvs: replace atomic_add_return()
Hello, On Mon, 16 Nov 2020, Yejune Deng wrote: > atomic_inc_return() looks better > > Signed-off-by: Yejune Deng Looks good to me for -next, thanks! Acked-by: Julian Anastasov > --- > net/netfilter/ipvs/ip_vs_core.c | 2 +- > net/netfilter/ipvs/ip_vs_sync.c | 4 ++-- > 2 files changed, 3 insertions(+), 3 deletions(-) > > diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c > index c0b8215..54e086c 100644 > --- a/net/netfilter/ipvs/ip_vs_core.c > +++ b/net/netfilter/ipvs/ip_vs_core.c > @@ -2137,7 +2137,7 @@ static int ip_vs_in_icmp_v6(struct netns_ipvs *ipvs, > struct sk_buff *skb, > if (cp->flags & IP_VS_CONN_F_ONE_PACKET) > pkts = sysctl_sync_threshold(ipvs); > else > - pkts = atomic_add_return(1, &cp->in_pkts); > + pkts = atomic_inc_return(&cp->in_pkts); > > if (ipvs->sync_state & IP_VS_STATE_MASTER) > ip_vs_sync_conn(ipvs, cp, pkts); > diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c > index 16b4806..9d43277 100644 > --- a/net/netfilter/ipvs/ip_vs_sync.c > +++ b/net/netfilter/ipvs/ip_vs_sync.c > @@ -615,7 +615,7 @@ static void ip_vs_sync_conn_v0(struct netns_ipvs *ipvs, > struct ip_vs_conn *cp, > cp = cp->control; > if (cp) { > if (cp->flags & IP_VS_CONN_F_TEMPLATE) > - pkts = atomic_add_return(1, &cp->in_pkts); > + pkts = atomic_inc_return(&cp->in_pkts); > else > pkts = sysctl_sync_threshold(ipvs); > ip_vs_sync_conn(ipvs, cp, pkts); > @@ -776,7 +776,7 @@ void ip_vs_sync_conn(struct netns_ipvs *ipvs, struct > ip_vs_conn *cp, int pkts) > if (!cp) > return; > if (cp->flags & IP_VS_CONN_F_TEMPLATE) > - pkts = atomic_add_return(1, &cp->in_pkts); > + pkts = atomic_inc_return(&cp->in_pkts); > else > pkts = sysctl_sync_threshold(ipvs); > goto sloop; > -- > 1.9.1 Regards -- Julian Anastasov
Re: [PATCH RFC v3] ipvs: add genetlink cmd to dump all services and destinations
ctx->idx_dest--; > + ret = -EMSGSIZE; > + goto nla_nested_end; > + } > + } > + ctx->idx_dest = 0; > + ctx->start_dest = 0; > + > +nla_nested_end: > + nla_nest_end(skb, nl_dests); > + nla_nest_end(skb, nl_service); > + genlmsg_end(skb, hdr); > + return ret; > + > +nla_nested_failure: > + nla_nest_cancel(skb, nl_service); > + > +nla_put_failure: > + genlmsg_cancel(skb, hdr); > + > +out_err: > + ctx->idx_svc--; > + return -EMSGSIZE; > +} > + > +static int ip_vs_genl_dump_services_destinations(struct sk_buff *skb, > + struct netlink_callback *cb) > +{ > + struct dump_services_dests_ctx ctx = { > + .idx_svc = 0, > + .start_svc = cb->args[0], > + .idx_dest = 0, > + .start_dest = cb->args[1], > + }; > + struct net *net = sock_net(skb->sk); > + struct netns_ipvs *ipvs = net_ipvs(net); > + struct ip_vs_service *svc; > + struct nlattr *attrs[IPVS_CMD_ATTR_MAX + 1]; > + int tab = cb->args[2]; > + int row = cb->args[3]; > + > + mutex_lock(&__ip_vs_mutex); > + > + if (nlmsg_parse_deprecated(cb->nlh, GENL_HDRLEN, attrs, > +IPVS_CMD_ATTR_MAX, ip_vs_cmd_policy, > +cb->extack) == 0) { > + if (attrs[IPVS_CMD_ATTR_SERVICE]) { > + svc = ip_vs_genl_find_service(ipvs, > + > attrs[IPVS_CMD_ATTR_SERVICE]); > + if (IS_ERR_OR_NULL(svc)) > + goto out_err; > + ip_vs_genl_dump_service_dests(skb, cb, ipvs, svc, &ctx); > + goto nla_put_failure; May be we should use different name for above label, we are not deailing with nla in this function. > + } > + } > + > + if (tab >= 2) > + goto nla_put_failure; > + > + if (tab >= 1) > + goto tab_1; > + > + for (; row < IP_VS_SVC_TAB_SIZE; row++) { > + hlist_for_each_entry(svc, &ip_vs_svc_table[row], s_list) { > + if (ip_vs_genl_dump_service_dests(skb, cb, ipvs, > + svc, &ctx)) > + goto nla_put_failure; > + } > + ctx.idx_svc = 0; > + ctx.start_svc = 0; > + ctx.idx_dest = 0; > + ctx.start_dest = 0; > + } > + > + row = 0; > + tab++; > + > +tab_1: > + for (; row < IP_VS_SVC_TAB_SIZE; row++) { > + hlist_for_each_entry(svc, &ip_vs_svc_fwm_table[row], f_list) { > + if (ip_vs_genl_dump_service_dests(skb, cb, ipvs, > + svc, &ctx)) > + goto nla_put_failure; > + } > + ctx.idx_svc = 0; > + ctx.start_svc = 0; > + ctx.idx_dest = 0; > + ctx.start_dest = 0; > + } > + > + row = 0; > + tab++; > + > +nla_put_failure: > + cb->args[0] = ctx.idx_svc; > + cb->args[1] = ctx.idx_dest; > + cb->args[2] = tab; > + cb->args[3] = row; > + > +out_err: > + mutex_unlock(&__ip_vs_mutex); > + > + return skb->len; > +} > + > static int ip_vs_genl_parse_dest(struct ip_vs_dest_user_kern *udest, >struct nlattr *nla, bool full_entry) > { > @@ -3991,6 +4155,12 @@ static const struct genl_small_ops ip_vs_genl_ops[] = { > .flags = GENL_ADMIN_PERM, > .doit = ip_vs_genl_set_cmd, > }, > + { > + .cmd= IPVS_CMD_GET_SERVICE_DEST, > + .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, > + .flags = GENL_ADMIN_PERM, > + .dumpit = ip_vs_genl_dump_services_destinations, > + }, > }; > > static struct genl_family ip_vs_genl_family __ro_after_init = { > -- > 2.25.1 Some comments for the ipvsadm patch. Please, separately post it next time here on the list with its own subject, so that we can comment it inline. - ipvs_get_services_dests(): #ifdef can be before declarations, try to use long-to-short lines (reverse xmas tree order for variables in declarations) - print_service_entry(): no need to check d before free(d), free() checks it itself, just like kfree() in kernel. - ipvs_services_dests_parse_cb: we should stop if realloc() fails, sadly, existing code does not check realloc() result but for new code we should do it - ipvs_get_services_dests(): kernel avoids using assignments in 'if' condition, we do the same for new code. You have to split such code to assignment+condition. - there are extra parentheses in code such as sizeof(*(get->index)), that should be fine instead: sizeof(*get->index), same for sizeof(get->index[0]). Extra parens also for &(get->dests), etc. - as new code runs only for LIBIPVS_USE_NL, check if it is wrapped in proper #ifdef in libipvs/libipvs.c. Make sure ipvsadm compiles without LIBIPVS_USE_NL. - the extern word should not be used in .h files anymore Some of the above styling issues are also reported by linux# scripts/checkpatch.pl --strict /tmp/ipvsadm.patch As we try to apply to ipvsadm the same styling rules that are used for networking in kernel, you should be able to fix all such places with help from checkpatch.pl. Probably, you know about this file: Documentation/process/coding-style.rst Regards -- Julian Anastasov
Re: [PATCH RFC v2] ipvs: add genetlink cmd to dump all services and destinations
cb->nlh->nlmsg_seq, &ip_vs_genl_family, > + NLM_F_MULTI, IPVS_CMD_NEW_SERVICE); > + if (!hdr) > + goto out_err; > + > + nl_service = nla_nest_start_noflag(skb, IPVS_CMD_ATTR_SERVICE); > + if (!nl_service) > + goto nla_put_failure; > + > + if (ip_vs_genl_put_service_attrs(skb, svc)) > + goto nla_nested_failure; > + > + nl_dests = nla_nest_start_noflag(skb, IPVS_SVC_ATTR_DESTS); > + if (!nl_dests) > + goto nla_nested_failure; > + > + list_for_each_entry(dest, &svc->destinations, n_list) { > + if (++ctx->idx_dest <= ctx->start_dest) > + continue; > + if (ip_vs_genl_fill_dest(skb, IPVS_DESTS_ATTR_DEST, dest) < 0) { > + ctx->idx_svc--; > + ctx->idx_dest--; > + ret = -EMSGSIZE; > + goto nla_nested_end; > + } > + } > + ctx->idx_dest = 0; > + ctx->start_dest = 0; > + > +nla_nested_end: > + nla_nest_end(skb, nl_dests); > + nla_nest_end(skb, nl_service); > + genlmsg_end(skb, hdr); > + return ret; > + > +nla_nested_failure: > + nla_nest_cancel(skb, nl_service); > + > +nla_put_failure: > + genlmsg_cancel(skb, hdr); > + > +out_err: > + ctx->idx_svc--; > + return -EMSGSIZE; > +} > + > +static int ip_vs_genl_dump_services_destinations(struct sk_buff *skb, > + struct netlink_callback *cb) > +{ > + struct dump_services_dests_ctx ctx = { > + .idx_svc = 0, > + .start_svc = cb->args[0], > + .idx_dest = 0, > + .start_dest = cb->args[1], > + }; > + struct net *net = sock_net(skb->sk); > + struct netns_ipvs *ipvs = net_ipvs(net); > + struct ip_vs_service *svc = NULL; NULL not needed > + struct nlattr *attrs[IPVS_CMD_ATTR_MAX + 1]; > + int tab = cb->args[2]; > + int row = cb->args[3]; > + > + mutex_lock(&__ip_vs_mutex); > + > + if (nlmsg_parse_deprecated(cb->nlh, GENL_HDRLEN, attrs, > +IPVS_CMD_ATTR_MAX, ip_vs_cmd_policy, > +cb->extack) == 0) { > + if (attrs[IPVS_CMD_ATTR_SERVICE]) { > + svc = ip_vs_genl_find_service(ipvs, > + > attrs[IPVS_CMD_ATTR_SERVICE]); > + if (IS_ERR_OR_NULL(svc)) > + goto out_err; > + ip_vs_genl_dump_service_dests(skb, cb, ipvs, svc, &ctx); > + goto nla_put_failure; > + } > + } > + To make it more readable and to avoid lookup when at EOF we can start with the tab checks: if (tab >= 2) goto nla_put_failure; # or done if (tab >= 1) goto tab_1; for (; row < IP_VS_SVC_TAB_SIZE; row++) { > + for (; tab == 0 && row < IP_VS_SVC_TAB_SIZE; row++) { > + hlist_for_each_entry(svc, &ip_vs_svc_table[row], s_list) { > + if (ip_vs_genl_dump_service_dests(skb, cb, ipvs, > + svc, &ctx)) > + goto nla_put_failure; > + } > + ctx.idx_svc = 0; > + ctx.start_svc = 0; If we were at the middle of dests for the previous packet but now the svc and its dests are deleted, we have to reset the dest ptr too, otherwise we will skip dests in next row: ctx->idx_dest = 0; ctx->start_dest = 0; But any kind of modifications will show wrong results, so it does not matter much. > + } > + > + if (tab == 0) { > + row = 0; > + tab++; > + } > + row = 0; tab++; tab_1: > + for (; row < IP_VS_SVC_TAB_SIZE; row++) { > + hlist_for_each_entry(svc, &ip_vs_svc_fwm_table[row], f_list) { > + if (ip_vs_genl_dump_service_dests(skb, cb, ipvs, > + svc, &ctx)) > + goto nla_put_failure; > + } > + ctx.idx_svc = 0; > + ctx.start_svc = 0; ctx->idx_dest = 0; ctx->start_dest = 0; > + } row = 0;# Not needed tab++; $ tab = 2 to indicate EOF > + > +nla_put_failure: > + cb->args[0] = ctx.idx_svc; > + cb->args[1] = ctx.idx_dest; > + cb->args[2] = tab; > + cb->args[3] = row; > + > +out_err: > + mutex_unlock(&__ip_vs_mutex); > + > + return skb->len; > +} > + > static int ip_vs_genl_parse_dest(struct ip_vs_dest_user_kern *udest, >struct nlattr *nla, bool full_entry) > { > @@ -3991,6 +4143,12 @@ static const struct genl_small_ops ip_vs_genl_ops[] = { > .flags = GENL_ADMIN_PERM, > .doit = ip_vs_genl_set_cmd, > }, > + { > + .cmd= IPVS_CMD_GET_SERVICE_DEST, > + .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, > + .flags = GENL_ADMIN_PERM, > + .dumpit = ip_vs_genl_dump_services_destinations, > + }, > }; > > static struct genl_family ip_vs_genl_family __ro_after_init = { > -- > 2.25.1 Regards -- Julian Anastasov
Re: [PATCH RFC] ipvs: add genetlink cmd to dump all services and destinations
Hello, On Tue, 3 Nov 2020, Cezar Sá Espinola wrote: > > And now what happens if all dests can not fit in a packet? > > We should start next packet with the same svc? And then > > user space should merge the dests when multiple packets > > start with same service? > > My (maybe not so great) idea was to avoid repeating the svc on each > packet. It's possible for a packet to start with a destination and > user space must consider then as belonging to the last svc received on > the previous packet. The comparison "ctx->last_svc != svc" was > intended to ensure that a packet only starts with destinations if the > current service is the same as the svc we sent on the previous packet. You can also consider the idea of having 3 coordinates for start svc: idx_svc_tab (0 or 1), idx_svc_row (0..IP_VS_SVC_TAB_SIZE-1) and idx_svc for index in row's chain. On new packet this will indicate the htable and its row and we have to skip svcs in this row to find our starting svc. I think, this will still fit in the netlink_callback's args area. If not, we can always kmalloc our context in args[0]. In single table, this should speedup the start svc lookup 128 times in average (we have 256 rows). In setup with 1024 svcs (average 4 in each of the 256 rows) we should skip these 0..3 entries instead of 512 in average. > > last_svc is used out of __ip_vs_mutex region, > > so it is not safe. We can get a reference count but this > > is bad if user space blocks. > > I thought it would be relatively safe to store a pointer to the last > svc since I would only use it for pointer comparison and never > dereferencing it. But in retrospect it does look unsafe and fragile > and could probably lead to errors especially if services are modified > during a dump causing the stored pointer to point to a different > service. Yes, nobody is using such pointers. We should create packets that correctly identify svc for the dests. The drawback is that user space may need more work for merging. We can always create a sorted array of pointers to svcs, so that we can binary search with bsearch() the svc from every received packet. Then we will know if this is a new svc or an old one (with dests in multiple packets). Should we also check for dest duplicates in the svc? The question is how much safe we should play. In user space the max work we can do is to avoid duplicates and to put dests to their correct svc. > > But even if we use just indexes it should be ok. > > If multiple agents are used in parallel it is not our > > problem. What can happen is that we can send duplicates > > or to skip entries (both svcs and dests). It is impossible > > to keep any kind of references to current entries or even > > keys to lookup them if another agent can remove them. > > Got it. I noticed this behavior while writing this patch and even > created a few crude validation scripts running parallel agents and > checking the diff in [1]. Ok, make sure your tests cover cases with multiple dests, so that single service occupies multiple packets, I'm not sure if 100 dests fit in one packet or not. Regards -- Julian Anastasov
Re: [PATCH RFC] ipvs: add genetlink cmd to dump all services and destinations
hedulers - every packet should contain info for svc, so that we can properly add dests to the right svc > + return -EMSGSIZE; > + } > + } > + > + return 0; > +} > + > +static int ip_vs_genl_dump_services_destinations(struct sk_buff *skb, > + struct netlink_callback *cb) > +{ > + /* Besides usual index based counters, saving a pointer to the last > + * dumped service is useful to ensure we only dump destinations that > + * belong to it, even when services are removed while the dump is still > + * running causing indexes to shift. > + */ > + struct dump_services_dests_ctx ctx = { > + .idx_svc = 0, > + .idx_dest = 0, > + .start_svc = cb->args[0], > + .start_dest = cb->args[1], > + .last_svc = (struct ip_vs_service *)(cb->args[2]), > + }; > + struct net *net = sock_net(skb->sk); > + struct netns_ipvs *ipvs = net_ipvs(net); > + struct ip_vs_service *svc = NULL; > + struct nlattr *attrs[IPVS_CMD_ATTR_MAX + 1]; > + int i; > + > + mutex_lock(&__ip_vs_mutex); > + > + if (nlmsg_parse_deprecated(cb->nlh, GENL_HDRLEN, attrs, > IPVS_CMD_ATTR_MAX, > +ip_vs_cmd_policy, cb->extack) == 0) { > + svc = ip_vs_genl_find_service(ipvs, > attrs[IPVS_CMD_ATTR_SERVICE]); > + > + if (!IS_ERR_OR_NULL(svc)) { > + ip_vs_genl_dump_service_destinations(skb, cb, svc, > &ctx); > + goto nla_put_failure; > + } > + } > + > + for (i = 0; i < IP_VS_SVC_TAB_SIZE; i++) { > + hlist_for_each_entry(svc, &ip_vs_svc_table[i], s_list) { > + if (svc->ipvs != ipvs) > + continue; > + if (ip_vs_genl_dump_service_destinations(skb, cb, svc, > &ctx) < 0) > + goto nla_put_failure; > + } > + } > + > + for (i = 0; i < IP_VS_SVC_TAB_SIZE; i++) { > + hlist_for_each_entry(svc, &ip_vs_svc_fwm_table[i], s_list) { > + if (svc->ipvs != ipvs) > + continue; > + if (ip_vs_genl_dump_service_destinations(skb, cb, svc, > &ctx) < 0) > + goto nla_put_failure; > + } > + } > + > +nla_put_failure: > + mutex_unlock(&__ip_vs_mutex); > + cb->args[0] = ctx.idx_svc; > + cb->args[1] = ctx.idx_dest; > + cb->args[2] = (long)ctx.last_svc; last_svc is used out of __ip_vs_mutex region, so it is not safe. We can get a reference count but this is bad if user space blocks. But even if we use just indexes it should be ok. If multiple agents are used in parallel it is not our problem. What can happen is that we can send duplicates or to skip entries (both svcs and dests). It is impossible to keep any kind of references to current entries or even keys to lookup them if another agent can remove them. > + > + return skb->len; > +} > + > static int ip_vs_genl_parse_dest(struct ip_vs_dest_user_kern *udest, >struct nlattr *nla, bool full_entry) > { > @@ -3991,6 +4094,12 @@ static const struct genl_small_ops ip_vs_genl_ops[] = { > .flags = GENL_ADMIN_PERM, > .doit = ip_vs_genl_set_cmd, > }, > + { > + .cmd= IPVS_CMD_GET_SERVICE_DEST, > + .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP, > + .flags = GENL_ADMIN_PERM, > + .dumpit = ip_vs_genl_dump_services_destinations, > + }, > }; > > static struct genl_family ip_vs_genl_family __ro_after_init = { > -- Regards -- Julian Anastasov
Re: Fw: [Bug 209427] New: Incorrect timestamp cause packet to be dropped
Hello, On Wed, 30 Sep 2020, Eric Dumazet wrote: > On 9/29/20 7:35 PM, Stephen Hemminger wrote: > > > > > > then I noticed that in some cases skb->tstamp is equal to real ts whereas in > > the regular cases where a packet pass through it's time since kernel boot. > > This > > doesn't make any sense for me as this condition is satisfied constantly > > > > net/sched/sch_fq.c:439 > > static bool fq_packet_beyond_horizon(const struct sk_buff *skb, > > const struct fq_sched_data *q) > > { > > return unlikely((s64)skb->tstamp > (s64)(q->ktime_cache + > > q->horizon)); > > } > > > > Any ideas on what it can be? > > > Thanks for the detailed report ! > > I suspect ipvs or bridge code needs something similar to the fixes done in > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de20900fbe1c4fd36de25a7a5a43223254ecf0d0 > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=41d1c8839e5f8cb781cc635f12791decee8271b7 > > The reason for that is that skb->tstamp can get a timestamp in input path, > with a base which is not CLOCK_MONOTONIC, unfortunately. > > Whenever a packet is forwarded, its tstamp must be cleared. > > Can you try : > > diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c > index > b00866d777fe0e9ed8018087ebc664c56f29b5c9..11e8ccdae358a89067046efa62ed40308b9e06f9 > 100644 > --- a/net/netfilter/ipvs/ip_vs_xmit.c > +++ b/net/netfilter/ipvs/ip_vs_xmit.c > @@ -952,6 +952,8 @@ ip_vs_prepare_tunneled_skb(struct sk_buff *skb, int > skb_af, > > ip_vs_drop_early_demux_sk(skb); > > + skb->tstamp = 0; > + Should be after all skb_forward_csum() calls in ip_vs_xmit.c > if (skb_headroom(skb) < max_headroom || skb_cloned(skb)) { > new_skb = skb_realloc_headroom(skb, max_headroom); > if (!new_skb) Regards -- Julian Anastasov
Re: [PATCH v5] ipvs: adjust the debug info in function set_tcp_state
Hello, On Mon, 28 Sep 2020, longguang.yue wrote: > Outputting client,virtual,dst addresses info when tcp state changes, > which makes the connection debug more clear > > Signed-off-by: longguang.yue OK, v5 can be used instead of fixing v4. Acked-by: Julian Anastasov > --- longguang.yue, at this place after --- you can add info for changes between versions, eg: v5: fix indentation Use this for other patches, so that we know what is changed between versions. > net/netfilter/ipvs/ip_vs_proto_tcp.c | 10 ++ > 1 file changed, 6 insertions(+), 4 deletions(-) > > diff --git a/net/netfilter/ipvs/ip_vs_proto_tcp.c > b/net/netfilter/ipvs/ip_vs_proto_tcp.c > index dc2e7da2742a..7da51390cea6 100644 > --- a/net/netfilter/ipvs/ip_vs_proto_tcp.c > +++ b/net/netfilter/ipvs/ip_vs_proto_tcp.c > @@ -539,8 +539,8 @@ set_tcp_state(struct ip_vs_proto_data *pd, struct > ip_vs_conn *cp, > if (new_state != cp->state) { > struct ip_vs_dest *dest = cp->dest; > > - IP_VS_DBG_BUF(8, "%s %s [%c%c%c%c] %s:%d->" > - "%s:%d state: %s->%s conn->refcnt:%d\n", > + IP_VS_DBG_BUF(8, "%s %s [%c%c%c%c] c:%s:%d v:%s:%d " > + "d:%s:%d state: %s->%s conn->refcnt:%d\n", > pd->pp->name, > ((state_off == TCP_DIR_OUTPUT) ? > "output " : "input "), > @@ -548,10 +548,12 @@ set_tcp_state(struct ip_vs_proto_data *pd, struct > ip_vs_conn *cp, > th->fin ? 'F' : '.', > th->ack ? 'A' : '.', > th->rst ? 'R' : '.', > - IP_VS_DBG_ADDR(cp->daf, &cp->daddr), > - ntohs(cp->dport), > IP_VS_DBG_ADDR(cp->af, &cp->caddr), > ntohs(cp->cport), > + IP_VS_DBG_ADDR(cp->af, &cp->vaddr), > + ntohs(cp->vport), > + IP_VS_DBG_ADDR(cp->daf, &cp->daddr), > + ntohs(cp->dport), > tcp_state_name(cp->state), > tcp_state_name(new_state), > refcount_read(&cp->refcnt)); > -- > 2.20.1 (Apple Git-117) Regards -- Julian Anastasov
Re: [PATCH v4] ipvs: adjust the debug info in function set_tcp_state
Hello, On Sun, 27 Sep 2020, longguang.yue wrote: > outputting client,virtual,dst addresses info when tcp state changes, > which makes the connection debug more clear > > Signed-off-by: longguang.yue Looks good to me, thanks! Acked-by: Julian Anastasov Simon, Pablo, may be commit description should not be indented... > --- > net/netfilter/ipvs/ip_vs_proto_tcp.c | 10 ++ > 1 file changed, 6 insertions(+), 4 deletions(-) > > diff --git a/net/netfilter/ipvs/ip_vs_proto_tcp.c > b/net/netfilter/ipvs/ip_vs_proto_tcp.c > index dc2e7da2742a..7da51390cea6 100644 > --- a/net/netfilter/ipvs/ip_vs_proto_tcp.c > +++ b/net/netfilter/ipvs/ip_vs_proto_tcp.c > @@ -539,8 +539,8 @@ set_tcp_state(struct ip_vs_proto_data *pd, struct > ip_vs_conn *cp, > if (new_state != cp->state) { > struct ip_vs_dest *dest = cp->dest; > > - IP_VS_DBG_BUF(8, "%s %s [%c%c%c%c] %s:%d->" > - "%s:%d state: %s->%s conn->refcnt:%d\n", > + IP_VS_DBG_BUF(8, "%s %s [%c%c%c%c] c:%s:%d v:%s:%d " > + "d:%s:%d state: %s->%s conn->refcnt:%d\n", > pd->pp->name, > ((state_off == TCP_DIR_OUTPUT) ? > "output " : "input "), > @@ -548,10 +548,12 @@ set_tcp_state(struct ip_vs_proto_data *pd, struct > ip_vs_conn *cp, > th->fin ? 'F' : '.', > th->ack ? 'A' : '.', > th->rst ? 'R' : '.', > - IP_VS_DBG_ADDR(cp->daf, &cp->daddr), > - ntohs(cp->dport), > IP_VS_DBG_ADDR(cp->af, &cp->caddr), > ntohs(cp->cport), > + IP_VS_DBG_ADDR(cp->af, &cp->vaddr), > + ntohs(cp->vport), > + IP_VS_DBG_ADDR(cp->daf, &cp->daddr), > + ntohs(cp->dport), > tcp_state_name(cp->state), > tcp_state_name(new_state), > refcount_read(&cp->refcnt)); > -- > 2.20.1 (Apple Git-117) Regards -- Julian Anastasov
Re: [PATCHv5 net-next] ipvs: remove dependency on ip6_tables
Hello, On Sat, 29 Aug 2020, Yaroslav Bolyukin wrote: > This dependency was added because ipv6_find_hdr was in iptables specific > code but is no longer required > > Fixes: f8f626754ebe ("ipv6: Move ipv6_find_hdr() out of Netfilter code.") > Fixes: 63dca2c0b0e7 ("ipvs: Fix faulty IPv6 extension header handling in > IPVS"). > Signed-off-by: Yaroslav Bolyukin Looks good to me, thanks! May be maintainers will remove the extra dot after the Fixes line. Acked-by: Julian Anastasov > --- > Missed canonical patch format section, subsystem is now spevified > > include/net/ip_vs.h| 3 --- > net/netfilter/ipvs/Kconfig | 1 - > 2 files changed, 4 deletions(-) > > diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h > index 9a59a3378..d609e957a 100644 > --- a/include/net/ip_vs.h > +++ b/include/net/ip_vs.h > @@ -25,9 +25,6 @@ > #include > #include /* for struct ipv6hdr */ > #include > -#if IS_ENABLED(CONFIG_IP_VS_IPV6) > -#include > -#endif > #if IS_ENABLED(CONFIG_NF_CONNTRACK) > #include > #endif > diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig > index 2c1593089..eb0e329f9 100644 > --- a/net/netfilter/ipvs/Kconfig > +++ b/net/netfilter/ipvs/Kconfig > @@ -29,7 +29,6 @@ if IP_VS > config IP_VS_IPV6 > bool "IPv6 support for IPVS" > depends on IPV6 = y || IP_VS = IPV6 > - select IP6_NF_IPTABLES > select NF_DEFRAG_IPV6 > help > Add IPv6 support to IPVS. > -- > 2.28.0 Regards -- Julian Anastasov
Re: [PATCH] Remove ipvs v6 dependency on iptables
Hello, On Sat, 29 Aug 2020, Yaroslav Bolyukin wrote: > This dependency was added as part of commit ecefa32ffda201975 > ("ipvs: Fix faulty IPv6 extension header handling in IPVS"), because it > had dependency on ipv6_find_hdr, which was located in iptables-specific > code > > But it is no longer required after commit e6f890cfde0e74d5b > ("ipv6:Move ipv6_find_hdr() out of Netfilter code.") > > Also remove ip6tables include from ip_vs > > Signed-off-by: Yaroslav Bolyukin The commit you reference better to be added as special tag, eg: Fixes: f8f626754ebe ("ipv6: Move ipv6_find_hdr() out of Netfilter code.") before the Signed-off-by line. Then you may skip mentioning the commit in the description, it will be in Fixes tag. Note that the first 12 chars from the commit id are used, not the last. Second Fixes line can be for 63dca2c0b0e7 ("ipvs: Fix faulty IPv6 extension header handling in IPVS"). Both Fixes lines should not be wrapped. The Subject line needs to include version and tree, for example: [PATCHv2 net-next] ipvs: remove v6 dependency on iptables You increase the version when sending modified patch. You can check the Documentation/process/submitting-patches.rst guide for more info. > --- > include/net/ip_vs.h| 3 --- > net/netfilter/ipvs/Kconfig | 1 - > 2 files changed, 4 deletions(-) > > diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h > index 9a59a3378..d609e957a 100644 > --- a/include/net/ip_vs.h > +++ b/include/net/ip_vs.h > @@ -25,9 +25,6 @@ > #include > #include /* for struct ipv6hdr */ > #include > -#if IS_ENABLED(CONFIG_IP_VS_IPV6) > -#include > -#endif > #if IS_ENABLED(CONFIG_NF_CONNTRACK) > #include > #endif > diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig > index 2c1593089..eb0e329f9 100644 > --- a/net/netfilter/ipvs/Kconfig > +++ b/net/netfilter/ipvs/Kconfig > @@ -29,7 +29,6 @@ if IP_VS > config IP_VS_IPV6 > bool "IPv6 support for IPVS" > depends on IPV6 = y || IP_VS = IPV6 > - select IP6_NF_IPTABLES > select NF_DEFRAG_IPV6 > help > Add IPv6 support to IPVS. > -- Regards -- Julian Anastasov
Re: [PATCH] Remove ipvs v6 dependency on iptables
Hello, On Fri, 28 Aug 2020, Lach wrote: > This dependency was added in 63dca2c0b0e7a92cb39d1b1ecefa32ffda201975, > because this commit had dependency on > ipv6_find_hdr, which was located in iptables-specific code > > But it is no longer required, because > f8f626754ebeca613cf1af2e6f890cfde0e74d5b moved them to a more common location May be then we should also not include ip6_tables.h from include/net/ip_vs.h ? > --- > net/netfilter/ipvs/Kconfig | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig > index 2c1593089..eb0e329f9 100644 > --- a/net/netfilter/ipvs/Kconfig > +++ b/net/netfilter/ipvs/Kconfig > @@ -29,7 +29,6 @@ if IP_VS > config IP_VS_IPV6 > bool "IPv6 support for IPVS" > depends on IPV6 = y || IP_VS = IPV6 > - select IP6_NF_IPTABLES > select NF_DEFRAG_IPV6 > help > Add IPv6 support to IPVS. > -- > 2.28.0 Regards -- Julian Anastasov
Re: [Linux-kernel-mentees] [PATCH net-next v2] ipvs: Fix uninit-value in do_ip_vs_set_ctl()
Hello, On Tue, 11 Aug 2020, Peilin Ye wrote: > do_ip_vs_set_ctl() is referencing uninitialized stack value when `len` is > zero. Fix it. > > Reported-by: syzbot+23b5f9e7caf61d9a3...@syzkaller.appspotmail.com > Link: > https://syzkaller.appspot.com/bug?id=46ebfb92a8a812621a001ef04d90dfa459520fe2 > Suggested-by: Julian Anastasov > Signed-off-by: Peilin Ye Looks good to me, thanks! Acked-by: Julian Anastasov > --- > Changes in v2: > - Target net-next tree. (Suggested by Julian Anastasov ) > - Reject all `len == 0` requests except `IP_VS_SO_SET_FLUSH`, instead > of initializing `arg`. (Suggested by Cong Wang > , Julian Anastasov ) > > net/netfilter/ipvs/ip_vs_ctl.c | 7 --- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c > index 412656c34f20..beeafa42aad7 100644 > --- a/net/netfilter/ipvs/ip_vs_ctl.c > +++ b/net/netfilter/ipvs/ip_vs_ctl.c > @@ -2471,6 +2471,10 @@ do_ip_vs_set_ctl(struct sock *sk, int cmd, void __user > *user, unsigned int len) > /* Set timeout values for (tcp tcpfin udp) */ > ret = ip_vs_set_timeout(ipvs, (struct ip_vs_timeout_user *)arg); > goto out_unlock; > + } else if (!len) { > + /* No more commands with len == 0 below */ > + ret = -EINVAL; > + goto out_unlock; > } > > usvc_compat = (struct ip_vs_service_user *)arg; > @@ -2547,9 +2551,6 @@ do_ip_vs_set_ctl(struct sock *sk, int cmd, void __user > *user, unsigned int len) > break; > case IP_VS_SO_SET_DELDEST: > ret = ip_vs_del_dest(svc, &udest); > - break; > - default: > - ret = -EINVAL; > } > >out_unlock: > -- > 2.25.1 Regards -- Julian Anastasov
Re: [Linux-kernel-mentees] [PATCH net] ipvs: Fix uninit-value in do_ip_vs_set_ctl()
Hello, On Tue, 11 Aug 2020, Peilin Ye wrote: > On Mon, Aug 10, 2020 at 08:57:19PM -0700, Cong Wang wrote: > > On Mon, Aug 10, 2020 at 3:10 PM Peilin Ye wrote: > > > > > > do_ip_vs_set_ctl() is referencing uninitialized stack value when `len` is > > > zero. Fix it. > > > > Which exact 'cmd' is it here? > > > > I _guess_ it is one of those uninitialized in set_arglen[], which is 0. > > Yes, it was `IP_VS_SO_SET_NONE`, implicitly initialized to zero. > > > But if that is the case, should it be initialized to > > sizeof(struct ip_vs_service_user) instead because ip_vs_copy_usvc_compat() > > is called anyway. Or, maybe we should just ban len==0 case. > > I see. I think the latter would be easier, but we cannot ban all of > them, since the function does something with `IP_VS_SO_SET_FLUSH`, which > is a `len == 0` case. > > Maybe we do something like this? Yes, only IP_VS_SO_SET_FLUSH uses len 0. We can go with this change but you do not need to target net tree, as the problem is not fatal net-next works too. What happens is that we may lookup services with random search keys which is harmless. Another option is to add new block after this one: } else if (cmd == IP_VS_SO_SET_TIMEOUT) { /* Set timeout values for (tcp tcpfin udp) */ ret = ip_vs_set_timeout(ipvs, (struct ip_vs_timeout_user *)arg); goto out_unlock; } such as: } else if (!len) { /* No more commands with len=0 below */ ret = -EINVAL; goto out_unlock; } It give more chance for future commands to use len=0 but the drawback is that the check happens under mutex. So, I'm fine with both versions, it is up to you to decide :) > @@ -2432,6 +2432,8 @@ do_ip_vs_set_ctl(struct sock *sk, int cmd, void __user > *user, unsigned int len) > > if (cmd < IP_VS_BASE_CTL || cmd > IP_VS_SO_SET_MAX) > return -EINVAL; > + if (len == 0 && cmd != IP_VS_SO_SET_FLUSH) > + return -EINVAL; > if (len != set_arglen[CMDID(cmd)]) { > IP_VS_DBG(1, "set_ctl: len %u != %u\n", > len, set_arglen[CMDID(cmd)]); > @@ -2547,9 +2549,6 @@ do_ip_vs_set_ctl(struct sock *sk, int cmd, void __user > *user, unsigned int len) > break; > case IP_VS_SO_SET_DELDEST: > ret = ip_vs_del_dest(svc, &udest); > - break; > - default: > - ret = -EINVAL; > } > >out_unlock: Regards -- Julian Anastasov
Re: [PATCH] ipvs: add missing struct name in ip_vs_enqueue_expire_nodest_conns when CONFIG_SYSCTL is disabled
Hello, On Wed, 22 Jul 2020, Pablo Neira Ayuso wrote: > On Fri, Jul 17, 2020 at 08:36:36PM +0300, Julian Anastasov wrote: > > > > On Fri, 17 Jul 2020, Andrew Sy Kim wrote: > > > > > Adds missing "*ipvs" to ip_vs_enqueue_expire_nodest_conns when > > > CONFIG_SYSCTL is disabled > > > > > > Signed-off-by: Andrew Sy Kim > > > > Acked-by: Julian Anastasov > > > > Pablo, please apply this too. > > I have squashed this fix and "ipvs: ensure RCU read unlock when > connection flushing and ipvs is disabled" into the original patch: > > "ipvs: queue delayed work to expire no destination connections if > expire_nodest_conn=1" Very good, thanks! Regards -- Julian Anastasov
Re: [PATCH,v2] ipvs: fix the connection sync failed in some cases
Hello, On Thu, 16 Jul 2020, guodeqing wrote: > The sync_thread_backup only checks sk_receive_queue is empty or not, > there is a situation which cannot sync the connection entries when > sk_receive_queue is empty and sk_rmem_alloc is larger than sk_rcvbuf, > the sync packets are dropped in __udp_enqueue_schedule_skb, this is > because the packets in reader_queue is not read, so the rmem is > not reclaimed. > > Here I add the check of whether the reader_queue of the udp sock is > empty or not to solve this problem. > > Fixes: 2276f58ac589 ("udp: use a separate rx queue for packet reception") > Reported-by: zhouxudong > Signed-off-by: guodeqing Looks good to me, thanks! Acked-by: Julian Anastasov Simon, Pablo, this patch should be applied to the nf tree. As the reader_queue appears in 4.13, this patch can be backported to 4.14, 4.19, 5.4, etc, they all have skb_queue_empty_lockless. > --- > net/netfilter/ipvs/ip_vs_sync.c | 12 > 1 file changed, 8 insertions(+), 4 deletions(-) > > diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c > index 605e0f6..2b8abbf 100644 > --- a/net/netfilter/ipvs/ip_vs_sync.c > +++ b/net/netfilter/ipvs/ip_vs_sync.c > @@ -1717,6 +1717,8 @@ static int sync_thread_backup(void *data) > { > struct ip_vs_sync_thread_data *tinfo = data; > struct netns_ipvs *ipvs = tinfo->ipvs; > + struct sock *sk = tinfo->sock->sk; > + struct udp_sock *up = udp_sk(sk); > int len; > > pr_info("sync thread started: state = BACKUP, mcast_ifn = %s, " > @@ -1724,12 +1726,14 @@ static int sync_thread_backup(void *data) > ipvs->bcfg.mcast_ifn, ipvs->bcfg.syncid, tinfo->id); > > while (!kthread_should_stop()) { > - wait_event_interruptible(*sk_sleep(tinfo->sock->sk), > - !skb_queue_empty(&tinfo->sock->sk->sk_receive_queue) > - || kthread_should_stop()); > + wait_event_interruptible(*sk_sleep(sk), > + > !skb_queue_empty_lockless(&sk->sk_receive_queue) || > + > !skb_queue_empty_lockless(&up->reader_queue) || > + kthread_should_stop()); > > /* do we have data now? */ > - while (!skb_queue_empty(&(tinfo->sock->sk->sk_receive_queue))) { > + while (!skb_queue_empty_lockless(&sk->sk_receive_queue) || > +!skb_queue_empty_lockless(&up->reader_queue)) { > len = ip_vs_receive(tinfo->sock, tinfo->buf, > ipvs->bcfg.sync_maxlen); > if (len <= 0) { > -- > 2.7.4 Regards -- Julian Anastasov
Re: [PATCH] ipvs: add missing struct name in ip_vs_enqueue_expire_nodest_conns when CONFIG_SYSCTL is disabled
Hello, On Fri, 17 Jul 2020, Andrew Sy Kim wrote: > Adds missing "*ipvs" to ip_vs_enqueue_expire_nodest_conns when > CONFIG_SYSCTL is disabled > > Signed-off-by: Andrew Sy Kim Acked-by: Julian Anastasov Pablo, please apply this too. > --- > include/net/ip_vs.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h > index 91a9e1d590a6..9a59a33787cb 100644 > --- a/include/net/ip_vs.h > +++ b/include/net/ip_vs.h > @@ -1533,7 +1533,7 @@ static inline void > ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs *ipvs) > > void ip_vs_expire_nodest_conn_flush(struct netns_ipvs *ipvs); > #else > -static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs) {} > +static inline void ip_vs_enqueue_expire_nodest_conns(struct netns_ipvs > *ipvs) {} > #endif > > #define IP_VS_DFWD_METHOD(dest) (atomic_read(&(dest)->conn_flags) & \ > -- > 2.20.1 Regards -- Julian Anastasov
Re: [PATCH] ipvs: fix the connection sync failed in some cases
Hello, On Wed, 15 Jul 2020, guodeqing wrote: > The sync_thread_backup only checks sk_receive_queue is empty or not, > there is a situation which cannot sync the connection entries when > sk_receive_queue is empty and sk_rmem_alloc is larger than sk_rcvbuf, > the sync packets are dropped in __udp_enqueue_schedule_skb, this is > because the packets in reader_queue is not read, so the rmem is > not reclaimed. Good catch. We missed this change in UDP... > Here I add the check of whether the reader_queue of the udp sock is > empty or not to solve this problem. > > Fixes: 7c13f97ffde6 ("udp: do fwd memory scheduling on dequeue") Why this commit and not 2276f58ac589 which adds reader_queue to udp_poll() ? May be both? > Reported-by: zhouxudong > Signed-off-by: guodeqing > --- > net/netfilter/ipvs/ip_vs_sync.c | 12 > 1 file changed, 8 insertions(+), 4 deletions(-) > > diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c > index 605e0f6..abe8d63 100644 > --- a/net/netfilter/ipvs/ip_vs_sync.c > +++ b/net/netfilter/ipvs/ip_vs_sync.c > @@ -1717,6 +1717,8 @@ static int sync_thread_backup(void *data) > { > struct ip_vs_sync_thread_data *tinfo = data; > struct netns_ipvs *ipvs = tinfo->ipvs; > + struct sock *sk = tinfo->sock->sk; > + struct udp_sock *up = udp_sk(sk); > int len; > > pr_info("sync thread started: state = BACKUP, mcast_ifn = %s, " > @@ -1724,12 +1726,14 @@ static int sync_thread_backup(void *data) > ipvs->bcfg.mcast_ifn, ipvs->bcfg.syncid, tinfo->id); > > while (!kthread_should_stop()) { > - wait_event_interruptible(*sk_sleep(tinfo->sock->sk), > - !skb_queue_empty(&tinfo->sock->sk->sk_receive_queue) > - || kthread_should_stop()); > + wait_event_interruptible(*sk_sleep(sk), > + > !skb_queue_empty(&sk->sk_receive_queue) || > + !skb_queue_empty(&up->reader_queue) || May be we should use skb_queue_empty_lockless for 5.4+ and skb_queue_empty() for backports to 4.14 and 4.19... > + kthread_should_stop()); > > /* do we have data now? */ > - while (!skb_queue_empty(&(tinfo->sock->sk->sk_receive_queue))) { > + while (!skb_queue_empty(&sk->sk_receive_queue) || > +!skb_queue_empty(&up->reader_queue)) { Here too > len = ip_vs_receive(tinfo->sock, tinfo->buf, > ipvs->bcfg.sync_maxlen); > if (len <= 0) { > -- > 2.7.4 Regards -- Julian Anastasov
Re: [PATCH] ipvs: avoid drop first packet to reuse conntrack
et/netfilter/ipvs/ip_vs_core.c > +++ b/net/netfilter/ipvs/ip_vs_core.c > @@ -2086,11 +2086,11 @@ static int ip_vs_in_icmp_v6(struct netns_ipvs *ipvs, > struct sk_buff *skb, > } > > if (resched) { > + if (uses_ct) > + cp->flags &= ~IP_VS_CONN_F_NFCT; > if (!atomic_read(&cp->n_control)) > ip_vs_conn_expire_now(cp); > __ip_vs_conn_put(cp); > - if (uses_ct) > - return NF_DROP; > cp = NULL; > } > } > -- Regards -- Julian Anastasov
Re: TCP_DEFER_ACCEPT wakes up without data
Hello, On Thu, 4 Jun 2020, Christoph Paasch wrote: > On Wed, Jun 11, 2014 at 11:05 PM Julian Anastasov wrote: > > > > > > > The behavior that we want is for the receipt of the duplicate bare > > > ACK to not result in waking up user space. The socket still hasn't > > > received any data, so there's no point in the process accepting the > > > socket since there's nothing the process can do. > > > > One problem with this behavior is that after first ACK > > more ACKs are not expected. Your RST logic still relies on the > > last SYN+ACK we sent to trigger additional ACK. I guess, > > we can live with this because for firewalls it is not worse > > than current behavior. We replace accept() with RST. > > > > > I would prefer that we send a RST upon receipt of a bare ACK for a > > > socket that has completed the 3way handshake, waited the > > > TCP_DEFER_ACCEPT timeout and has never received any > > > data. If it has timed out, then the server should be done with the > > > connection request. > > > > I'm ok with this idea. > > Is there any specific reason as to why we would not want to do this? > > API-breakage does not seem to me to be a concern here. Apps that are > setting DEFER_ACCEPT probably would not expect to get a socket that > does not have data to read. There are older threads that explain why we create sockets on bare ACK: https://marc.info/?t=12554106291&r=1&w=2 Starting message: https://marc.info/?l=linux-netdev&m=125541058019459&w=2 In short, it is better for firewalls when we do not drop silently the request socket. For now we do not have a way to specify that we do not want child sockets without DATA. TL;DR Here is how TCP_DEFER_ACCEPT works for server, with pseudo-states. TCP_DEFER_ACCEPT for server uses: - num_timeout: increments when it is time to send SYN+ACK - num_retrans: increments when SYN+ACK is sent - TCP_DEFER_ACCEPT (seconds) is converted to retransmission periods. The possible time for SYN+ACK retransmissions is calculated as the maximum of SYNCNT/tcp_synack_retries and the TCP_DEFER_ACCEPT periods, see reqsk_timer_handler(). - when bare ACK is received we avoid resending SYN+ACKs, we start to resend just when the last TCP_DEFER_ACCEPT period is started, see syn_ack_recalc(). Pseudo States for SYN_RECV: WAIT_AND_RESEND: - when: - SYN is received - [re]send SYN+ACK depending on SYNCNT (at least for TCP_DEFER_ACCEPT time, if set), the retransmission period is max-of(TCP_DEFER_ACCEPT periods, SYNCNT) - on received DATA => create socket - on received bare ACK: - if using TCP_DEFER_ACCEPT => enter WAIT_SILENTLY - if not using TCP_DEFER_ACCEPT => create socket WAIT_SILENTLY: - when: - using TCP_DEFER_ACCEPT - bare ACK received (indicates that client received our SYN+ACK, so no need to resend anymore) - do not send SYN+ACK - on received DATA => create socket - on received bare ACK: drop it (do not resend on packet) - when the last TCP_DEFER_ACCEPT period starts enter REFRESH_STATE REFRESH_STATE: - when: - last TCP_DEFER_ACCEPT period starts after bare ACK - resend depending on SYNCNT to update the connection state in client and stateful firewalls. Such packets will trigger DATA (hopefully), ACK, FIN, RST, etc. When SYNCNT is less than the TCP_DEFER_ACCEPT period, we will send single SYN+ACK, otherwise more SYN+ACKs will be sent (up to SYNCNT). - on received DATA => create socket - on received bare ACK: create socket - no more resends (SYNCNT) => Drop req. Programming hints for servers: 1. TCP_DEFER_ACCEPT controlled with TCP_SYNCNT - set TCP_DEFER_ACCEPT (defer period) to 1 (used as flag=Enabled) - tune the possible retransmission period with TCP_SYNCNT - PRO: - more control on the sent packets - CON: - time depends on TCP_TIMEOUT_INIT value - server is not silent after bare ACK because defer period is small 2. Use just seconds in TCP_DEFER_ACCEPT - PRO: - defer period is specified in seconds - no SYN+ACKs are sent in defer period after bare ACK - can send more SYN+ACKs when TCP_SYNCNT is greater, to avoid losses - In both cases be ready to accept sockets without data after the defer period Regards -- Julian Anastasov
Re: [PATCH] netfilter/ipvs: immediately expire UDP connections matching unavailable destination if expire_nodest_conn=1
Hello, On Tue, 19 May 2020, Marco Angaroni wrote: > Hi Andrew, Julian, > > could you please confirm if/how this patch is changing any of the > following behaviours, which I’m listing below as per my understanding > ? > > When expire_nodest is set and real-server is unavailable, at the > moment the following happens to a packet going through IPVS: > > a) TCP (or other connection-oriented protocols): >the packet is silently dropped, then the following retransmission > causes the generation of a RST from the load-balancer to the client, > which will then re-open a new TCP connection Yes. It seems we can not create new connection in all cases, we should also check with is_new_conn(). What we have is that two cases are possible depending on conn_reuse_mode, the state of existing connection and whether netfilter conntrack is used: 1. setup expire for old conn, then drop packet 2. setup expire for old conn, then create new conn to schedule the packet When expiration is set, the timer will fire in the next jiffie to remove the connection from hash table. Until removed, the connection still can cause drops. Sometimes we can simply create new connection with the same tuple, so it is possible both connections to coexist for one jiffie but the old connection is not reached on lookup. > b) UDP: >the packet is silently dropped, then the following retransmission > is rescheduled to a new real-server Yes, we drop while old conn is not expired yet > c) UDP in OPS mode: >the packet is rescheduled to a new real-server, as no previous > connection exists in IPVS connection table, and a new OPS connection > is created (but it lasts only the time to transmit the packet) Yes, OPS is not affected. > d) UDP in OPS mode + persistent-template: >the packet is rescheduled to a new real-server, as previous > template-connection is invalidated, a new template-connection is > created, and a new OPS connection is created (but it lasts only the > time to transmit the packet) Yes, the existing template is ignored when its server is unavailable. > It seems to me that you are trying to optimize case a) and b), > avoiding the first step where the packet is silently dropped and > consequently avoiding the retransmission. > And contextually expire also all the other connections pointing to the > unavailable real-sever. The change will allow immediate scheduling in a new connection for any protocol when netfilter conntrack is not used: - TCP: avoids retransmission for SYN - UDP: reduces drops from 1 jiffie to 0 (no drops) But this single jiffie compared to the delay between real server failure and the removal from the IPVS table can be negligible. Of course, if real server is removed while it is working, with this change we should not see any UDP drops. > However I'm confused about the references to OPS mode. > And why you need to expire all the connections at once: if you expire > on a per connection basis, the client experiences the same behaviour > (no more re-transmissions), but you avoid the complexities of a new > thread. Such flushing can help when conntrack is used in which case the cost is a retransmission or downtime for one jiffie. > Maybe also the documentation of expire_nodest_conn sysctl should be updated. > When it's stated: > > If this feature is enabled, the load balancer will expire the > connection immediately when a packet arrives and its > destination server is not available, then the client program > will be notified that the connection is closed > > I think it should be at least "and the client program" instead of > "then the client program". > Or a more detailed explanation. Yes, if the packet is SYN we can create new connection. If it is ACK, the retransmission will get RST. Regards -- Julian Anastasov
Re: [PATCH] netfilter/ipvs: immediately expire UDP connections matching unavailable destination if expire_nodest_conn=1
= ip_vs_conn_uses_conntrack(cp, skb); ip_vs_conn_expire_now(cp); __ip_vs_conn_put(cp); if (uses_ct) return NF_DROP; cp = NULL; } else { __ip_vs_conn_put(cp); return NF_DROP; } } if (unlikely(!cp)) { int v; if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, &v, &cp, &iph)) return v; } Before now, we always waited one jiffie connection to expire, now one packet will: - schedule expiration for existing connection with unavailable dest, as before - create new connection to available destination that will be found first in lists. But it can work only when sysctl var "conntrack" is 0, we do not want to create two netfilter conntracks to different real servers. Note that we intentionally removed the timer_pending() check because we can not see existing ONE_PACKET connections in table. Regards -- Julian Anastasov
Re: [PATCH] netfilter/ipvs: expire no destination UDP connections when expire_nodest_conn=1
Hello, On Thu, 14 May 2020, Andrew Sy Kim wrote: > When expire_nodest_conn=1 and an IPVS destination is deleted, IPVS > doesn't expire connections with the IP_VS_CONN_F_ONE_PACKET flag set (any > UDP connection). If there are many UDP packets to a virtual server from a > single client and a destination is deleted, many packets are silently > dropped whenever an existing connection entry with the same source port > exists. This patch ensures IPVS also expires UDP connections when a > packet matches an existing connection with no destinations. > > Signed-off-by: Andrew Sy Kim > --- > net/netfilter/ipvs/ip_vs_core.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c > index aa6a603a2425..f0535586fe75 100644 > --- a/net/netfilter/ipvs/ip_vs_core.c > +++ b/net/netfilter/ipvs/ip_vs_core.c > @@ -2116,8 +2116,7 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, > struct sk_buff *skb, int > else > ip_vs_conn_put(cp); Above ip_vs_conn_put() should free the ONE_PACKET connections because: - such connections never start timer, they are designed to exist just to schedule the packet, then they are released. - noone takes extra references So, ip_vs_conn_put() simply calls ip_vs_conn_expire() where connections should be released immediately. As result, we can not access cp after this point here. That is why we work just with 'flags' below... Note that not every UDP connection has ONE_PACKET flag, it is present if you configure it for the service. Do you have -o/--ops flag? If not, the UDP connection should expire before the next jiffie. This is the theory, in practice, you may observe some problem... > - if (sysctl_expire_nodest_conn(ipvs) && > - !(flags & IP_VS_CONN_F_ONE_PACKET)) { > + if (sysctl_expire_nodest_conn(ipvs)) { > /* try to expire the connection immediately */ > ip_vs_conn_expire_now(cp); > } You can also look at the discussion which resulted in the last patch for this place: http://archive.linuxvirtualserver.org/html/lvs-devel/2018-07/msg00014.html Regards -- Julian Anastasov
Re: [PATCH] ipvs: no need to update skb route entry for local destination packets.
Hello, On Mon, 30 Sep 2019, zhang kai wrote: > In the end of function __ip_vs_get_out_rt/__ip_vs_get_out_rt_v6,the > 'local' variable is always zero. > > Signed-off-by: zhang kai Looks good to me, thanks! Acked-by: Julian Anastasov Simon, this is for -next kernels... > --- > net/netfilter/ipvs/ip_vs_xmit.c | 18 ++ > 1 file changed, 6 insertions(+), 12 deletions(-) > > diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c > index 9c464d24beec..037c7c91044e 100644 > --- a/net/netfilter/ipvs/ip_vs_xmit.c > +++ b/net/netfilter/ipvs/ip_vs_xmit.c > @@ -407,12 +407,9 @@ __ip_vs_get_out_rt(struct netns_ipvs *ipvs, int skb_af, > struct sk_buff *skb, > goto err_put; > > skb_dst_drop(skb); > - if (noref) { > - if (!local) > - skb_dst_set_noref(skb, &rt->dst); > - else > - skb_dst_set(skb, dst_clone(&rt->dst)); > - } else > + if (noref) > + skb_dst_set_noref(skb, &rt->dst); > + else > skb_dst_set(skb, &rt->dst); > > return local; > @@ -574,12 +571,9 @@ __ip_vs_get_out_rt_v6(struct netns_ipvs *ipvs, int > skb_af, struct sk_buff *skb, > goto err_put; > > skb_dst_drop(skb); > - if (noref) { > - if (!local) > - skb_dst_set_noref(skb, &rt->dst); > - else > - skb_dst_set(skb, dst_clone(&rt->dst)); > - } else > + if (noref) > + skb_dst_set_noref(skb, &rt->dst); > + else > skb_dst_set(skb, &rt->dst); > > return local; > -- > 2.17.1 Regards -- Julian Anastasov
Re: [PATCH net] ipv4: Revert removal of rt_uses_gateway
Hello, On Tue, 17 Sep 2019, David Ahern wrote: > On 9/17/19 12:50 PM, Julian Anastasov wrote: > > > > Looks good to me, thanks! > > > > Reviewed-by: Julian Anastasov > > > > BTW, do you have any tests for the rt_uses_gateway paths - showing why > it is needed? No special tests. > All of the pmtu, redirect, fib tests, etc worked fine without the > special flag. Sure, the 'ip ro get' had extra data; it seems like that > could be handled. I'll explain. In the period before the route cache was removed in 3.6, there were two fields: rt_dst and rt_gateway. For targets on LAN both contained the target. For targets via gateway (nh_gw), rt_dst still stores the target but rt_gateway stored the nh_gw. In short, rt_gateway always contained the next hop IP for neigh resolving. In 3.6, rt_dst was removed and only rt_gateway remained to store nh_gw. As fnhe_rth/nh_pcpu_rth_output were used to cache the output route, rt_gateway can not contain any IP because the route can be used for any target on the LAN. This is true even now, rt_gateway is 0 for cached routes that are not DST_HOST, i.e. not for single target IP. Why this matters? There are users such as IPVS, TEE, raw.c (IP_HDRINCL) that use FLOWI_FLAG_KNOWN_NH to request route with rt_gateway != 0 to be returned, i.e. they want rt_gateway to store a next hop IP for neigh resolving but to put different IP in iph->daddr. This is honoured by rt_nexthop(), it prefers rt_gateway before the iph->daddr. With this FLOWI flag the routing will avoid returning routes with rt_gateway = 0 (cached in NH), instead it will allocate DST_HOST route which can safely hold IP in rt_gateway. So, in 3.7 commit 155e8336c373 ("ipv4: introduce rt_uses_gateway") was created to restore the knowledge that rt_dst != rt_gateway means route via GW and also commit c92b96553a80 ("ipv4: Add FLOWI_FLAG_KNOWN_NH") to make sure packets are routed by requested next hop and not by iph->daddr. You see the places that need to know if rt_gateway contains nh_gw (via GW) or just a requested next hop (when nh_gw is 0). It matters for cases where strict source routes should be on connected network. In simple tests things are working without rt_uses_gateway flag because it is a corner case to see above cases combined with strict source routing or MTU locking. Not sure if we can use some trick to support it differently without such flag. Regards -- Julian Anastasov
Re: [PATCH net] ipv4: Revert removal of rt_uses_gateway
Hello, On Tue, 17 Sep 2019, David Ahern wrote: > From: David Ahern > > Julian noted that rt_uses_gateway has a more subtle use than 'is gateway > set': > > https://lore.kernel.org/netdev/alpine.lfd.2.21.1909151104060.2...@ja.home.ssi.bg/ > > Revert that part of the commit referenced in the Fixes tag. > > Currently, there are no u8 holes in 'struct rtable'. There is a 4-byte hole > in the second cacheline which contains the gateway declaration. So move > rt_gw_family down to the gateway declarations since they are always used > together, and then re-use that u8 for rt_uses_gateway. End result is that > rtable size is unchanged. > > Fixes: 1550c171935d ("ipv4: Prepare rtable for IPv6 gateway") > Reported-by: Julian Anastasov > Signed-off-by: David Ahern Looks good to me, thanks! Reviewed-by: Julian Anastasov > --- > drivers/infiniband/core/addr.c | 2 +- > include/net/route.h | 3 ++- > net/ipv4/inet_connection_sock.c | 4 ++-- > net/ipv4/ip_forward.c | 2 +- > net/ipv4/ip_output.c| 2 +- > net/ipv4/route.c| 36 +--- > net/ipv4/xfrm4_policy.c | 1 + > 7 files changed, 29 insertions(+), 21 deletions(-) > > diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c > index 9b76a8fcdd24..bf539c34ccd3 100644 > --- a/drivers/infiniband/core/addr.c > +++ b/drivers/infiniband/core/addr.c > @@ -352,7 +352,7 @@ static bool has_gateway(const struct dst_entry *dst, > sa_family_t family) > > if (family == AF_INET) { > rt = container_of(dst, struct rtable, dst); > - return rt->rt_gw_family == AF_INET; > + return rt->rt_uses_gateway; > } > > rt6 = container_of(dst, struct rt6_info, dst); > diff --git a/include/net/route.h b/include/net/route.h > index dfce19c9fa96..6c516840380d 100644 > --- a/include/net/route.h > +++ b/include/net/route.h > @@ -53,10 +53,11 @@ struct rtable { > unsigned intrt_flags; > __u16 rt_type; > __u8rt_is_input; > - u8 rt_gw_family; > + __u8rt_uses_gateway; > > int rt_iif; > > + u8 rt_gw_family; > /* Info on neighbour */ > union { > __be32 rt_gw4; > diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c > index f5c163d4771b..a9183543ca30 100644 > --- a/net/ipv4/inet_connection_sock.c > +++ b/net/ipv4/inet_connection_sock.c > @@ -560,7 +560,7 @@ struct dst_entry *inet_csk_route_req(const struct sock > *sk, > rt = ip_route_output_flow(net, fl4, sk); > if (IS_ERR(rt)) > goto no_route; > - if (opt && opt->opt.is_strictroute && rt->rt_gw_family) > + if (opt && opt->opt.is_strictroute && rt->rt_uses_gateway) > goto route_err; > rcu_read_unlock(); > return &rt->dst; > @@ -598,7 +598,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct > sock *sk, > rt = ip_route_output_flow(net, fl4, sk); > if (IS_ERR(rt)) > goto no_route; > - if (opt && opt->opt.is_strictroute && rt->rt_gw_family) > + if (opt && opt->opt.is_strictroute && rt->rt_uses_gateway) > goto route_err; > return &rt->dst; > > diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c > index 06f6f280b9ff..00ec819f949b 100644 > --- a/net/ipv4/ip_forward.c > +++ b/net/ipv4/ip_forward.c > @@ -123,7 +123,7 @@ int ip_forward(struct sk_buff *skb) > > rt = skb_rtable(skb); > > - if (opt->is_strictroute && rt->rt_gw_family) > + if (opt->is_strictroute && rt->rt_uses_gateway) > goto sr_failed; > > IPCB(skb)->flags |= IPSKB_FORWARDED; > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > index cc7ef0d05bbd..da521790cd63 100644 > --- a/net/ipv4/ip_output.c > +++ b/net/ipv4/ip_output.c > @@ -499,7 +499,7 @@ int __ip_queue_xmit(struct sock *sk, struct sk_buff *skb, > struct flowi *fl, > skb_dst_set_noref(skb, &rt->dst); > > packet_routed: > - if (inet_opt && inet_opt->opt.is_strictroute && rt->rt_gw_family) > + if (inet_opt && inet_opt->opt.is_strictroute && rt->rt_uses_gateway) > goto no_route; > > /* OK, we know where to send it, allocate and build IP header. */ >
rt_uses_gateway was removed?
Hello, Looks like I'm a bit late with the storm of changes in the routing. By default, after allocation rt_uses_gateway was set to 0. Later it can be set to 1 if nh_gw is not the final route target, i.e. it is indirect GW and not a target on LAN (the RT_SCOPE_LINK check in rt_set_nexthop). What remains hidden for the rt_uses_gateway semantic is this code in rt_set_nexthop: if (unlikely(!cached)) { /* Routes we intend to cache in nexthop exception or * FIB nexthop have the DST_NOCACHE bit clear. * However, if we are unsuccessful at storing this * route into the cache we really need to set it. */ if (!rt->rt_gateway) rt->rt_gateway = daddr; rt_add_uncached_list(rt); } and this code in rt_bind_exception: if (!rt->rt_gateway) rt->rt_gateway = daddr; I.e. even if rt_uses_gateway remains 0, rt_gateway can contain address, i.e. the target is on LAN and not behind nh_gw. Now I see commit 1550c171935d wrongly changes that to "If rt_gw_family is set it implies rt_uses_gateway.". As result, we set rt_gw_family while rt_uses_gateway was 0 for above cases. Think about it in this way: there should be a reason why we used rt_uses_gateway flag instead a simple "rt_gateway != 0" check, right? Replacing rt->rt_gateway checks with rt_gw_family checks is right but rt_uses_gateway checks should be put back because they indicates the route has more hops to target. As the problem is related to some FNHW and non-cached routes, redirects, etc the easiest way to see the problem is with 'ip route get LOCAL_IP oif eth0' where extra 'via GW' line is shown. Regards -- Julian Anastasov
Re: [PATCHv4 net 1/2] ipv4/icmp: fix rt dst dev null pointer dereference
Hello, On Thu, 22 Aug 2019, Hangbin Liu wrote: > In __icmp_send() there is a possibility that the rt->dst.dev is NULL, > e,g, with tunnel collect_md mode, which will cause kernel crash. > Here is what the code path looks like, for GRE: > > - ip6gre_tunnel_xmit > - ip6gre_xmit_ipv4 > - __gre6_xmit > - ip6_tnl_xmit > - if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE > - icmp_send > - net = dev_net(rt->dst.dev); <-- here > > The reason is __metadata_dst_init() init dst->dev to NULL by default. > We could not fix it in __metadata_dst_init() as there is no dev supplied. > On the other hand, the reason we need rt->dst.dev is to get the net. > So we can just try get it from skb->dev when rt->dst.dev is NULL. > > v4: Julian Anastasov remind skb->dev also could be NULL. We'd better > still use dst.dev and do a check to avoid crash. > > v3: No changes. > > v2: fix the issue in __icmp_send() instead of updating shared dst dev > in {ip_md, ip6}_tunnel_xmit. > > Fixes: c8b34e680a09 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit") > Signed-off-by: Hangbin Liu This patch looks good to me, thanks! Reviewed-by: Julian Anastasov > --- > net/ipv4/icmp.c | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c > index 1510e951f451..001f03f76bc4 100644 > --- a/net/ipv4/icmp.c > +++ b/net/ipv4/icmp.c > @@ -582,7 +582,13 @@ void __icmp_send(struct sk_buff *skb_in, int type, int > code, __be32 info, > > if (!rt) > goto out; > - net = dev_net(rt->dst.dev); > + > + if (rt->dst.dev) > + net = dev_net(rt->dst.dev); > + else if (skb_in->dev) > + net = dev_net(skb_in->dev); > + else > + goto out; > > /* >* Find the original header. It is expected to be valid, of course. > -- > 2.19.2 Regards -- Julian Anastasov
Re: [PATCH 1/2] ipv4/icmp: fix rt dst dev null pointer dereference
Hello, On Mon, 19 Aug 2019, Hangbin Liu wrote: > In __icmp_send() there is a possibility that the rt->dst.dev is NULL, > e,g, with tunnel collect_md mode, which will cause kernel crash. > Here is what the code path looks like, for GRE: > > - ip6gre_tunnel_xmit > - ip6gre_xmit_ipv4 > - __gre6_xmit > - ip6_tnl_xmit > - if skb->len - t->tun_hlen - eth_hlen > mtu; return -EMSGSIZE > - icmp_send > - net = dev_net(rt->dst.dev); <-- here > > The reason is __metadata_dst_init() init dst->dev to NULL by default. > We could not fix it in __metadata_dst_init() as there is no dev supplied. > On the other hand, the reason we need rt->dst.dev is to get the net. > So we can just get it from skb->dev, just like commit 8d9336704521 > ("ipv6: make icmp6_send() robust against null skb->dev") did. > > Fixes: c8b34e680a09 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit") > Signed-off-by: Hangbin Liu > --- > net/ipv4/icmp.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c > index 1510e951f451..5f00c9d18b02 100644 > --- a/net/ipv4/icmp.c > +++ b/net/ipv4/icmp.c > @@ -582,7 +582,10 @@ void __icmp_send(struct sk_buff *skb_in, int type, int > code, __be32 info, > > if (!rt) > goto out; > - net = dev_net(rt->dst.dev); > + > + if (!skb_in->dev) > + goto out; This looks wrong to me. IIRC, we should be able to send ICMP errors from the OUTPUT hook where skb->dev is NULL. It is true even for IPv6: net/ipv6/netfilter/ip6t_REJECT.c works for NF_INET_LOCAL_OUT. nf_send_unreach6() and other IPv6 places have workarounds to avoid skb->dev being NULL but IPv4 and IPv6 are different: IPv4 never required skb->dev to be non-NULL, so better do not change that. Just check dst.dev to avoid crash. > + net = dev_net(skb_in->dev); > > /* >* Find the original header. It is expected to be valid, of course. > -- > 2.19.2 Regards -- Julian Anastasov
Re: [PATCH] ipvs: change type of delta and previous_delta in ip_vs_seq.
Hello, Cc list trimmed... On Tue, 20 Aug 2019, zhang kai wrote: > In NAT forwarding mode, Applications may decrease the size of packets, > and TCP sequences will get smaller, so both of variables will be negetive > values in this case. As long as nobody cares about their sign, the type should not matter. You can not solve all signed/unsigned mismatches with such small patch. Or you are seeing some problem, may be in debug? > > Signed-off-by: zhang kai > --- > include/net/ip_vs.h | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h > index 3759167f91f5..de7e75063c7c 100644 > --- a/include/net/ip_vs.h > +++ b/include/net/ip_vs.h > @@ -346,8 +346,8 @@ enum ip_vs_sctp_states { > */ > struct ip_vs_seq { > __u32 init_seq; /* Add delta from this seq */ > - __u32 delta; /* Delta in sequence numbers */ > - __u32 previous_delta; /* Delta in sequence numbers > + __s32 delta; /* Delta in sequence numbers */ > + __s32 previous_delta; /* Delta in sequence numbers >* before last resized pkt */ > }; > > -- > 2.17.1 Regards -- Julian Anastasov
Re: [PATCH net v3] ipvs: Improve robustness to the ipvs sysctl
Hello, On Thu, 1 Aug 2019, hujunwei wrote: > From: Junwei Hu > > The ipvs module parse the user buffer and save it to sysctl, > then check if the value is valid. invalid value occurs > over a period of time. > Here, I add a variable, struct ctl_table tmp, used to read > the value from the user buffer, and save only when it is valid. > I delete proc_do_sync_mode and use extra1/2 in table for the > proc_dointvec_minmax call. > > Fixes: f73181c8288f ("ipvs: add support for sync threads") > Signed-off-by: Junwei Hu > Acked-by: Julian Anastasov Yep, Acked-by: Julian Anastasov > --- > V1->V2: > - delete proc_do_sync_mode and use proc_dointvec_minmax call. > V2->V3: > - update git version > --- > net/netfilter/ipvs/ip_vs_ctl.c | 69 +- > 1 file changed, 35 insertions(+), 34 deletions(-) > > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c > index 060565e7d227..72189559a1cd 100644 > --- a/net/netfilter/ipvs/ip_vs_ctl.c > +++ b/net/netfilter/ipvs/ip_vs_ctl.c > @@ -1737,12 +1737,18 @@ proc_do_defense_mode(struct ctl_table *table, int > write, > int val = *valp; > int rc; > > - rc = proc_dointvec(table, write, buffer, lenp, ppos); > + struct ctl_table tmp = { > + .data = &val, > + .maxlen = sizeof(int), > + .mode = table->mode, > + }; > + > + rc = proc_dointvec(&tmp, write, buffer, lenp, ppos); > if (write && (*valp != val)) { > - if ((*valp < 0) || (*valp > 3)) { > - /* Restore the correct value */ > - *valp = val; > + if (val < 0 || val > 3) { > + rc = -EINVAL; > } else { > + *valp = val; > update_defense_level(ipvs); > } > } > @@ -1756,33 +1762,20 @@ proc_do_sync_threshold(struct ctl_table *table, int > write, > int *valp = table->data; > int val[2]; > int rc; > + struct ctl_table tmp = { > + .data = &val, > + .maxlen = table->maxlen, > + .mode = table->mode, > + }; > > - /* backup the value first */ > memcpy(val, valp, sizeof(val)); > - > - rc = proc_dointvec(table, write, buffer, lenp, ppos); > - if (write && (valp[0] < 0 || valp[1] < 0 || > - (valp[0] >= valp[1] && valp[1]))) { > - /* Restore the correct value */ > - memcpy(valp, val, sizeof(val)); > - } > - return rc; > -} > - > -static int > -proc_do_sync_mode(struct ctl_table *table, int write, > - void __user *buffer, size_t *lenp, loff_t *ppos) > -{ > - int *valp = table->data; > - int val = *valp; > - int rc; > - > - rc = proc_dointvec(table, write, buffer, lenp, ppos); > - if (write && (*valp != val)) { > - if ((*valp < 0) || (*valp > 1)) { > - /* Restore the correct value */ > - *valp = val; > - } > + rc = proc_dointvec(&tmp, write, buffer, lenp, ppos); > + if (write) { > + if (val[0] < 0 || val[1] < 0 || > + (val[0] >= val[1] && val[1])) > + rc = -EINVAL; > + else > + memcpy(valp, val, sizeof(val)); > } > return rc; > } > @@ -1795,12 +1788,18 @@ proc_do_sync_ports(struct ctl_table *table, int write, > int val = *valp; > int rc; > > - rc = proc_dointvec(table, write, buffer, lenp, ppos); > + struct ctl_table tmp = { > + .data = &val, > + .maxlen = sizeof(int), > + .mode = table->mode, > + }; > + > + rc = proc_dointvec(&tmp, write, buffer, lenp, ppos); > if (write && (*valp != val)) { > - if (*valp < 1 || !is_power_of_2(*valp)) { > - /* Restore the correct value */ > + if (val < 1 || !is_power_of_2(val)) > + rc = -EINVAL; > + else > *valp = val; > - } > } > return rc; > } > @@ -1860,7 +1859,9 @@ static struct ctl_table vs_vars[] = { > .procname = "sync_version", > .maxlen = sizeof(int), > .mode = 0644, > - .proc_handler = proc_do_sync_mode, > + .proc_handler = proc_dointvec_minmax, > + .extra1 = SYSCTL_ZERO, > + .extra2 = SYSCTL_ONE, > }, > { > .procname = "sync_ports", > -- > 2.21.GIT Regards -- Julian Anastasov
Re: [PATCH net v2] ipvs: Improve robustness to the ipvs sysctl
Hello, On Tue, 30 Jul 2019, hujunwei wrote: > From: Junwei Hu > > The ipvs module parse the user buffer and save it to sysctl, > then check if the value is valid. invalid value occurs > over a period of time. > Here, I add a variable, struct ctl_table tmp, used to read > the value from the user buffer, and save only when it is valid. > I delete proc_do_sync_mode and use extra1/2 in table for the > proc_dointvec_minmax call. > > Fixes: f73181c8288f ("ipvs: add support for sync threads") > Signed-off-by: Junwei Hu Looks good to me, thanks! Acked-by: Julian Anastasov BTW, why ip_vs_zero_all everywhere? Due to old git version? > --- > V1->V2: > - delete proc_do_sync_mode and use proc_dointvec_minmax call. > --- > net/netfilter/ipvs/ip_vs_ctl.c | 69 > +- > 1 file changed, 35 insertions(+), 34 deletions(-) > > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c > index 060565e..7aed7b0 100644 > --- a/net/netfilter/ipvs/ip_vs_ctl.c > +++ b/net/netfilter/ipvs/ip_vs_ctl.c > @@ -1737,12 +1737,18 @@ static int ip_vs_zero_all(struct netns_ipvs *ipvs) > int val = *valp; > int rc; > > - rc = proc_dointvec(table, write, buffer, lenp, ppos); > + struct ctl_table tmp = { > + .data = &val, > + .maxlen = sizeof(int), > + .mode = table->mode, > + }; > + > + rc = proc_dointvec(&tmp, write, buffer, lenp, ppos); > if (write && (*valp != val)) { > - if ((*valp < 0) || (*valp > 3)) { > - /* Restore the correct value */ > + if (val < 0 || val > 3) { > + rc = -EINVAL; > + } else { > *valp = val; > - } else { > update_defense_level(ipvs); > } > } > @@ -1756,33 +1762,20 @@ static int ip_vs_zero_all(struct netns_ipvs *ipvs) > int *valp = table->data; > int val[2]; > int rc; > + struct ctl_table tmp = { > + .data = &val, > + .maxlen = table->maxlen, > + .mode = table->mode, > + }; > > - /* backup the value first */ > memcpy(val, valp, sizeof(val)); > - > - rc = proc_dointvec(table, write, buffer, lenp, ppos); > - if (write && (valp[0] < 0 || valp[1] < 0 || > - (valp[0] >= valp[1] && valp[1]))) { > - /* Restore the correct value */ > - memcpy(valp, val, sizeof(val)); > - } > - return rc; > -} > - > -static int > -proc_do_sync_mode(struct ctl_table *table, int write, > - void __user *buffer, size_t *lenp, loff_t *ppos) > -{ > - int *valp = table->data; > - int val = *valp; > - int rc; > - > - rc = proc_dointvec(table, write, buffer, lenp, ppos); > - if (write && (*valp != val)) { > - if ((*valp < 0) || (*valp > 1)) { > - /* Restore the correct value */ > - *valp = val; > - } > + rc = proc_dointvec(&tmp, write, buffer, lenp, ppos); > + if (write) { > + if (val[0] < 0 || val[1] < 0 || > + (val[0] >= val[1] && val[1])) > + rc = -EINVAL; > + else > + memcpy(valp, val, sizeof(val)); > } > return rc; > } > @@ -1795,12 +1788,18 @@ static int ip_vs_zero_all(struct netns_ipvs *ipvs) > int val = *valp; > int rc; > > - rc = proc_dointvec(table, write, buffer, lenp, ppos); > + struct ctl_table tmp = { > + .data = &val, > + .maxlen = sizeof(int), > + .mode = table->mode, > + }; > + > + rc = proc_dointvec(&tmp, write, buffer, lenp, ppos); > if (write && (*valp != val)) { > - if (*valp < 1 || !is_power_of_2(*valp)) { > - /* Restore the correct value */ > + if (val < 1 || !is_power_of_2(val)) > + rc = -EINVAL; > + else > *valp = val; > - } > } > return rc; > } > @@ -1860,7 +1859,9 @@ static int ip_vs_zero_all(struct netns_ipvs *ipvs) > .procname = "sync_version", > .maxlen = sizeof(int), > .mode = 0644, > - .proc_handler = proc_do_sync_mode, > + .proc_handler = proc_dointvec_minmax, > + .extra1 = SYSCTL_ZERO, > + .extra2 = SYSCTL_ONE, > }, > { > .procname = "sync_ports", > -- > 1.7.12.4 Regards -- Julian Anastasov
Re: [PATCH net v3] net: neigh: fix multiple neigh timer scheduling
= NUD_INCOMPLETE; > neigh->updated = now; > next = now + max(NEIGH_VAR(neigh->parms, RETRANS_TIME), > @@ -1140,6 +1141,7 @@ int __neigh_event_send(struct neighbour *neigh, struct > sk_buff *skb) > } > } else if (neigh->nud_state & NUD_STALE) { > neigh_dbg(2, "neigh %p is delayed\n", neigh); > + neigh_del_timer(neigh); > neigh->nud_state = NUD_DELAY; > neigh->updated = jiffies; > neigh_add_timer(neigh, jiffies + > -- > 2.21.0 Regards -- Julian Anastasov
Re: [PATCH] ipvs: allow tunneling with gre encapsulation
Hello, On Tue, 25 Jun 2019, Vadim Fedorenko wrote: > windows real servers can handle gre tunnels, this patch allows > gre encapsulation with the tunneling method, thereby letting ipvs > be load balancer for windows-based services > > Signed-off-by: Vadim Fedorenko Looks like your patch is not encoded properly. Check Documentation/process/email-clients.rst for your mail client to avoid the format=flowed in Content-Type. To be sure, send patch to yourself and try to apply it. As alternative, use git send-email to send patches, for example: git send-email --envelope-sender 'YOUR EMAIL' \ [--suppress-from] [--confirm=always] \ --to 'TO_EMAIL' --cc 'CC_EMAIL1' --cc 'CC_EMAIL2' \ file_or_dir You can test it by sending first to yourself only. Also, test every patch before sending for problems: scripts/checkpatch.pl --strict file.patch It shows some problems you should fix. So, your patch works but you have to send v2 with some changes... > --- > include/uapi/linux/ip_vs.h | 1 + > net/netfilter/ipvs/ip_vs_xmit.c | 76 > + > 2 files changed, 77 insertions(+) > > diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c > index 71fc6d6..fad3f33 100644 > --- a/net/netfilter/ipvs/ip_vs_xmit.c > +++ b/net/netfilter/ipvs/ip_vs_xmit.c > @@ -29,6 +29,7 @@ > #include /* for tcphdr */ > #include > #include > +#include > #include /* for csum_tcpudp_magic */ > #include > #include/* for icmp_send */ > @@ -389,6 +390,12 @@ static inline bool decrement_ttl(struct netns_ipvs *ipvs, > skb->ip_summed == CHECKSUM_PARTIAL) > mtu -= GUE_PLEN_REMCSUM + GUE_LEN_PRIV; > } > + if (dest->tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GRE) { > + __be16 tflags = 0; Empty line needed > + if (dest->tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_CSUM) > + tflags |= TUNNEL_CSUM; > + mtu -= gre_calc_hlen(tflags); > + } > if (mtu < 68) { > IP_VS_DBG_RL("%s(): mtu less than 68\n", __func__); > goto err_put; > @@ -549,6 +556,12 @@ static inline bool decrement_ttl(struct netns_ipvs *ipvs, > skb->ip_summed == CHECKSUM_PARTIAL) > mtu -= GUE_PLEN_REMCSUM + GUE_LEN_PRIV; > } > + if (dest->tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GRE) { > + __be16 tflags = 0; Empty line needed > + if (dest->tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_CSUM) > + tflags |= TUNNEL_CSUM; > + mtu -= gre_calc_hlen(tflags); > + } > if (mtu < IPV6_MIN_MTU) { > IP_VS_DBG_RL("%s(): mtu less than %d\n", __func__, >IPV6_MIN_MTU); > @@ -1079,6 +1092,24 @@ static inline int __tun_gso_type_mask(int encaps_af, > int orig_af) > return 0; > } > > +static void > +ipvs_gre_encap(struct net *net, struct sk_buff *skb, > +struct ip_vs_conn *cp, __u8 *next_protocol) > +{ > + size_t hdrlen; > + __be16 tflags = 0; > + __be16 proto = *next_protocol == IPPROTO_IPIP ? htons(ETH_P_IP) : > htons(ETH_P_IPV6); This line is too long. You can also order the lines from long to short. > + > + if (cp->dest->tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_CSUM) > + tflags |= TUNNEL_CSUM; Many empty lines are not needed... > + > + hdrlen = gre_calc_hlen(tflags); > + > + gre_build_header(skb, hdrlen, tflags, proto, 0, 0); > + > + *next_protocol = IPPROTO_GRE; > +} > + > /* > * IP Tunneling transmitter > * > @@ -1328,6 +1400,10 @@ static inline int __tun_gso_type_mask(int encaps_af, > int orig_af) > udp6_set_csum(!check, skb, &saddr, &cp->daddr.in6, skb->len); > } > > + if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GRE) { > + ipvs_gre_encap(net, skb, cp, &next_protocol); > + } Braces are not needed, at both places where ipvs_gre_encap is called. > + > skb_push(skb, sizeof(struct ipv6hdr)); > skb_reset_network_header(skb); > memset(&(IPCB(skb)->opt), 0, sizeof(IPCB(skb)->opt)); As part of your patch, the new tunnel type should be registered also in ip_vs_rs_hash(), GRE will use port 0 just like IPIP, eg: case IP_VS_CONN_F_TUNNEL_TYPE_IPIP: + case IP_VS_CONN_F_TUNNEL_TYPE_GRE: port = 0; break; Then I'll post a patch for ip_vs_in_icmp() that strips the GRE header from ICMP errors by adding ipvs_gre_decap(). I also created ipvsadm patch for GRE. Regards -- Julian Anastasov
Re: [PATCH v1] ipvs: add checksum support for gue encapsulation
_GUE) { We can move gue_hdrlen and gue_optlen here. > + if ((tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM) && > + skb->ip_summed == CHECKSUM_PARTIAL) { > + gue_optlen += GUE_PLEN_REMCSUM + GUE_LEN_PRIV; > + } > + gue_hdrlen = sizeof(struct guehdr) + gue_optlen; > + > + max_headroom += sizeof(struct udphdr) + gue_hdrlen; > + } > > /* We only care about the df field if sysctl_pmtu_disc(ipvs) is set */ > dfp = sysctl_pmtu_disc(ipvs) ? &df : NULL; > @@ -1105,8 +1164,17 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, struct > ip_vs_conn *cp, > goto tx_error; > > gso_type = __tun_gso_type_mask(AF_INET, cp->af); > - if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE) > - gso_type |= SKB_GSO_UDP_TUNNEL; > + if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE) { > + if ((tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_CSUM) || > + (tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM)) > + gso_type |= SKB_GSO_UDP_TUNNEL_CSUM; > + else > + gso_type |= SKB_GSO_UDP_TUNNEL; > + if ((tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM) && > + skb->ip_summed == CHECKSUM_PARTIAL) { > + gso_type |= SKB_GSO_TUNNEL_REMCSUM; > + } > + } > > if (iptunnel_handle_offloads(skb, gso_type)) > goto tx_error; > @@ -1115,8 +1183,19 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, struct > ip_vs_conn *cp, > > skb_set_inner_ipproto(skb, next_protocol); > > - if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE) > - ipvs_gue_encap(net, skb, cp, &next_protocol); > + if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE) { > + bool check = false; > + > + if (ipvs_gue_encap(net, skb, cp, &next_protocol)) > + goto tx_error; > + > + if ((tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_CSUM) || > + (tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM)) > + check = true; > + > + udp_set_csum(!check, skb, saddr, cp->daddr.ip, skb->len); > + } > + > > skb_push(skb, sizeof(struct iphdr)); > skb_reset_network_header(skb); > @@ -1174,6 +1253,8 @@ ip_vs_tunnel_xmit_v6(struct sk_buff *skb, struct > ip_vs_conn *cp, > unsigned int max_headroom; /* The extra header space needed */ > int ret, local; > int tun_type, gso_type; > + int tun_flags; > + size_t gue_hdrlen, gue_optlen = 0; > > EnterFunction(10); > > @@ -1197,9 +1278,17 @@ ip_vs_tunnel_xmit_v6(struct sk_buff *skb, struct > ip_vs_conn *cp, > max_headroom = LL_RESERVED_SPACE(tdev) + sizeof(struct ipv6hdr); > > tun_type = cp->dest->tun_type; > + tun_flags = cp->dest->tun_flags; > + > + if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE) { > + if ((tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM) && > + skb->ip_summed == CHECKSUM_PARTIAL) { Same, we can move gue_hdrlen and gue_optlen here. > + gue_optlen += GUE_PLEN_REMCSUM + GUE_LEN_PRIV; > + } > + gue_hdrlen = sizeof(struct guehdr) + gue_optlen; > > - if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE) > - max_headroom += sizeof(struct udphdr) + sizeof(struct guehdr); > + max_headroom += sizeof(struct udphdr) + gue_hdrlen; > + } > > skb = ip_vs_prepare_tunneled_skb(skb, cp->af, max_headroom, >&next_protocol, &payload_len, > @@ -1208,8 +1297,17 @@ ip_vs_tunnel_xmit_v6(struct sk_buff *skb, struct > ip_vs_conn *cp, > goto tx_error; > > gso_type = __tun_gso_type_mask(AF_INET6, cp->af); > - if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE) > - gso_type |= SKB_GSO_UDP_TUNNEL; > + if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE) { > + if ((tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_CSUM) || > + (tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM)) > + gso_type |= SKB_GSO_UDP_TUNNEL_CSUM; > + else > + gso_type |= SKB_GSO_UDP_TUNNEL; > + if ((tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM) && > + skb->ip_summed == CHECKSUM_PARTIAL) { > + gso_type |= SKB_GSO_TUNNEL_REMCSUM; > + } > + } > > if (iptunnel_handle_offloads(skb, gso_type)) > goto tx_error; > @@ -1218,8 +1316,18 @@ ip_vs_tunnel_xmit_v6(struct sk_buff *skb, struct > ip_vs_conn *cp, > > skb_set_inner_ipproto(skb, next_protocol); > > - if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE) > - ipvs_gue_encap(net, skb, cp, &next_protocol); > + if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE) { > + bool check = false; > + > + if (ipvs_gue_encap(net, skb, cp, &next_protocol)) > + goto tx_error; > + > + if ((tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_CSUM) || > + (tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM)) > + check = true; > + > + udp6_set_csum(!check, skb, &saddr, &cp->daddr.in6, skb->len); > + } > > skb_push(skb, sizeof(struct ipv6hdr)); > skb_reset_network_header(skb); > -- > 2.21.0 Regards -- Julian Anastasov
Re: [PATCH] ipvs: Fix use-after-free in ip_vs_in
> 8881e9b26f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > while unregistering ipvs module, ops_free_list calls > __ip_vs_cleanup, then nf_unregister_net_hooks be called to > do remove nf hook entries. It need a RCU period to finish, > however net->ipvs is set to NULL immediately, which will > trigger NULL pointer dereference when a packet is hooked > and handled by ip_vs_in where net->ipvs is dereferenced. > > Another scene is ops_free_list call ops_free to free the > net_generic directly while __ip_vs_cleanup finished, then > calling ip_vs_in will triggers use-after-free. OK, can you instead test and post a patch that moves nf_unregister_net_hooks from __ip_vs_cleanup() to __ip_vs_dev_cleanup()? You can add commit efe41606184e in Fixes line. There is rcu_barrier() in unregister_pernet_device -> unregister_pernet_operations that will do the needed grace period. In a followup patch for net-next I'll drop the ipvs->enable flag and will move the nf_register_net_hooks() call to ip_vs_add_service() just before the 'svc = kzalloc' part. So, for now you do not need to move nf_register_net_hooks. As result, hooks will be registered when there are IPVS rules. > Reported-by: Hulk Robot > Fixes: efe41606184e ("ipvs: convert to use pernet nf_hook api") > Signed-off-by: YueHaibing > --- > net/netfilter/ipvs/ip_vs_core.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c > index 1445755..33205db 100644 > --- a/net/netfilter/ipvs/ip_vs_core.c > +++ b/net/netfilter/ipvs/ip_vs_core.c > @@ -2320,6 +2320,7 @@ static void __net_exit __ip_vs_cleanup(struct net *net) > ip_vs_control_net_cleanup(ipvs); > ip_vs_estimator_net_cleanup(ipvs); > IP_VS_DBG(2, "ipvs netns %d released\n", ipvs->gen); > + synchronize_net(); > net->ipvs = NULL; > } Regards -- Julian Anastasov
Re: [PATCH] ipvs: Fix use-after-free in ip_vs_in
> 8881e9b26f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb > > while unregistering ipvs module, ops_free_list calls > __ip_vs_cleanup, then nf_unregister_net_hooks be called to > do remove nf hook entries. It need a RCU period to finish, > however net->ipvs is set to NULL immediately, which will > trigger NULL pointer dereference when a packet is hooked > and handled by ip_vs_in where net->ipvs is dereferenced. > > Another scene is ops_free_list call ops_free to free the > net_generic directly while __ip_vs_cleanup finished, then > calling ip_vs_in will triggers use-after-free. > > Reported-by: Hulk Robot > Fixes: efe41606184e ("ipvs: convert to use pernet nf_hook api") > Signed-off-by: YueHaibing > --- > net/netfilter/ipvs/ip_vs_core.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c > index 1445755..33205db 100644 > --- a/net/netfilter/ipvs/ip_vs_core.c > +++ b/net/netfilter/ipvs/ip_vs_core.c > @@ -2320,6 +2320,7 @@ static void __net_exit __ip_vs_cleanup(struct net *net) > ip_vs_control_net_cleanup(ipvs); > ip_vs_estimator_net_cleanup(ipvs); > IP_VS_DBG(2, "ipvs netns %d released\n", ipvs->gen); > + synchronize_net(); Grace period in net_exit handler should be avoided. It can be added to ip_vs_cleanup() but may be we have to reorder the operations, so that we can have single grace period. Note that ip_vs_conn_cleanup() already includes rcu_barrier() and we can use it to split the cleanups to two steps: 1: unregister hooks (__ip_vs_dev_cleanup) to stop traffic and 2: cleanups when traffic is stopped. Note that the problem should be only when module is removed, the case with netns exit in cleanup_net() should not cause problem. I'll have more time this weekend to reorganize the code... > net->ipvs = NULL; > } > > -- > 2.7.4 Regards -- Julian Anastasov
Re: [PATCH net] tcp: fix tcp_inet6_sk() for 32bit kernels
Hello, On Mon, 1 Apr 2019, Eric Dumazet wrote: > It turns out that struct ipv6_pinfo is not located as we think. > > inet6_sk_generic() and tcp_inet6_sk() disagree on 32bit kernels by 4-bytes, > because struct tcp_sock has 8-bytes alignment, > but ipv6_pinfo size is not a multiple of 8. > > sizeof(struct ipv6_pinfo): 116 (not padded to 8) > > I actually first coded tcp_inet6_sk() as this patch does, but thought > that "container_of(tcp_sk(sk), struct tcp6_sock, tcp)" was cleaner. > > As Julian told me : Nobody should use tcp6_sock.inet6 > directly, it should be accessed via tcp_inet6_sk() or inet6_sk(). > > This happened when we added the first u64 field in struct tcp_sock. > > Fixes: 93a77c11ae79 ("tcp: add tcp_inet6_sk() helper") > Signed-off-by: Eric Dumazet > Bisected-by: Julian Anastasov > --- > net/ipv6/tcp_ipv6.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c > index > eec814fe53b817106bc1d1eaa89dadcb96c974fa..82018bdce863165eba72e1ccf0c12ee558042ae8 > 100644 > --- a/net/ipv6/tcp_ipv6.c > +++ b/net/ipv6/tcp_ipv6.c > @@ -93,12 +93,13 @@ static struct tcp_md5sig_key *tcp_v6_md5_do_lookup(const > struct sock *sk, > /* Helper returning the inet6 address from a given tcp socket. > * It can be used in TCP stack instead of inet6_sk(sk). > * This avoids a dereference and allow compiler optimizations. > + * It is a specialized version of inet6_sk_generic(). > */ > static struct ipv6_pinfo *tcp_inet6_sk(const struct sock *sk) > { > - struct tcp6_sock *tcp6 = container_of(tcp_sk(sk), struct tcp6_sock, > tcp); > + unsigned int offset = sizeof(struct tcp6_sock) - sizeof(struct > ipv6_pinfo); > > - return &tcp6->inet6; > + return (struct ipv6_pinfo *)(((u8 *)sk) + offset); > } > > static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb) Thanks Eric! It is working here. Regards -- Julian Anastasov
Re: [PATCH net] ipvs: get sctphdr by sctphoff in sctp_csum_check
Hello, On Mon, 25 Feb 2019, Xin Long wrote: > sctp_csum_check() is called by sctp_s/dnat_handler() where it calls > skb_make_writable() to ensure sctphdr to be linearized. > > So there's no need to get sctphdr by calling skb_header_pointer() > in sctp_csum_check(). > > Signed-off-by: Xin Long Looks good to me, thanks! Acked-by: Julian Anastasov I guess, it is for the nf-next/net-next tree because it just eliminates a duplicate check. > --- > net/netfilter/ipvs/ip_vs_proto_sctp.c | 7 ++- > 1 file changed, 2 insertions(+), 5 deletions(-) > > diff --git a/net/netfilter/ipvs/ip_vs_proto_sctp.c > b/net/netfilter/ipvs/ip_vs_proto_sctp.c > index b0cd7d0..0ecf241 100644 > --- a/net/netfilter/ipvs/ip_vs_proto_sctp.c > +++ b/net/netfilter/ipvs/ip_vs_proto_sctp.c > @@ -183,7 +183,7 @@ static int > sctp_csum_check(int af, struct sk_buff *skb, struct ip_vs_protocol *pp) > { > unsigned int sctphoff; > - struct sctphdr *sh, _sctph; > + struct sctphdr *sh; > __le32 cmp, val; > > #ifdef CONFIG_IP_VS_IPV6 > @@ -193,10 +193,7 @@ sctp_csum_check(int af, struct sk_buff *skb, struct > ip_vs_protocol *pp) > #endif > sctphoff = ip_hdrlen(skb); > > - sh = skb_header_pointer(skb, sctphoff, sizeof(_sctph), &_sctph); > - if (sh == NULL) > - return 0; > - > + sh = (struct sctphdr *)(skb->data + sctphoff); > cmp = sh->checksum; > val = sctp_compute_cksum(skb, sctphoff); > > -- > 2.1.0 Regards -- Julian Anastasov
Re: [PATCH ipvs-next] ipvs: use indirect call wrappers
Hello, On Sat, 19 Jan 2019, Matteo Croce wrote: > Use the new indirect call wrappers in IPVS when calling the TCP or UDP > protocol specific functions. > This avoids an indirect calls in IPVS, and reduces the performance > impact of the Spectre mitigation. > > Signed-off-by: Matteo Croce Looks good to me, thanks! Acked-by: Julian Anastasov > --- > net/netfilter/ipvs/ip_vs_core.c | 49 +++- > net/netfilter/ipvs/ip_vs_proto_tcp.c | 3 +- > net/netfilter/ipvs/ip_vs_proto_udp.c | 3 +- > 3 files changed, 45 insertions(+), 10 deletions(-) > > diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c > index fe9abf3cc10a..e969dad66991 100644 > --- a/net/netfilter/ipvs/ip_vs_core.c > +++ b/net/netfilter/ipvs/ip_vs_core.c > @@ -53,6 +53,7 @@ > #endif > > #include > +#include > > > EXPORT_SYMBOL(register_ip_vs_scheduler); > @@ -70,6 +71,29 @@ EXPORT_SYMBOL(ip_vs_get_debug_level); > #endif > EXPORT_SYMBOL(ip_vs_new_conn_out); > > +#ifdef CONFIG_IP_VS_PROTO_TCP > +INDIRECT_CALLABLE_DECLARE(int > + tcp_snat_handler(struct sk_buff *skb, struct ip_vs_protocol *pp, > + struct ip_vs_conn *cp, struct ip_vs_iphdr *iph)); > +#endif > + > +#ifdef CONFIG_IP_VS_PROTO_UDP > +INDIRECT_CALLABLE_DECLARE(int > + udp_snat_handler(struct sk_buff *skb, struct ip_vs_protocol *pp, > + struct ip_vs_conn *cp, struct ip_vs_iphdr *iph)); > +#endif > + > +#if defined(CONFIG_IP_VS_PROTO_TCP) && defined(CONFIG_IP_VS_PROTO_UDP) > +#define SNAT_CALL(f, ...) \ > + INDIRECT_CALL_2(f, tcp_snat_handler, udp_snat_handler, __VA_ARGS__) > +#elif defined(CONFIG_IP_VS_PROTO_TCP) > +#define SNAT_CALL(f, ...) INDIRECT_CALL_1(f, tcp_snat_handler, __VA_ARGS__) > +#elif defined(CONFIG_IP_VS_PROTO_UDP) > +#define SNAT_CALL(f, ...) INDIRECT_CALL_1(f, udp_snat_handler, __VA_ARGS__) > +#else > +#define SNAT_CALL(f, ...) f(__VA_ARGS__) > +#endif > + > static unsigned int ip_vs_net_id __read_mostly; > /* netns cnt used for uniqueness */ > static atomic_t ipvs_netns_cnt = ATOMIC_INIT(0); > @@ -478,7 +502,9 @@ ip_vs_schedule(struct ip_vs_service *svc, struct sk_buff > *skb, >*/ > if ((!skb->dev || skb->dev->flags & IFF_LOOPBACK)) { > iph->hdr_flags ^= IP_VS_HDR_INVERSE; > - cp = pp->conn_in_get(svc->ipvs, svc->af, skb, iph); > + cp = INDIRECT_CALL_1(pp->conn_in_get, > + ip_vs_conn_in_get_proto, svc->ipvs, > + svc->af, skb, iph); > iph->hdr_flags ^= IP_VS_HDR_INVERSE; > > if (cp) { > @@ -972,7 +998,8 @@ static int ip_vs_out_icmp(struct netns_ipvs *ipvs, struct > sk_buff *skb, > ip_vs_fill_iph_skb_icmp(AF_INET, skb, offset, true, &ciph); > > /* The embedded headers contain source and dest in reverse order */ > - cp = pp->conn_out_get(ipvs, AF_INET, skb, &ciph); > + cp = INDIRECT_CALL_1(pp->conn_out_get, ip_vs_conn_out_get_proto, > + ipvs, AF_INET, skb, &ciph); > if (!cp) > return NF_ACCEPT; > > @@ -1028,7 +1055,8 @@ static int ip_vs_out_icmp_v6(struct netns_ipvs *ipvs, > struct sk_buff *skb, > return NF_ACCEPT; > > /* The embedded headers contain source and dest in reverse order */ > - cp = pp->conn_out_get(ipvs, AF_INET6, skb, &ciph); > + cp = INDIRECT_CALL_1(pp->conn_out_get, ip_vs_conn_out_get_proto, > + ipvs, AF_INET6, skb, &ciph); > if (!cp) > return NF_ACCEPT; > > @@ -1263,7 +1291,8 @@ handle_response(int af, struct sk_buff *skb, struct > ip_vs_proto_data *pd, > goto drop; > > /* mangle the packet */ > - if (pp->snat_handler && !pp->snat_handler(skb, pp, cp, iph)) > + if (pp->snat_handler && > + !SNAT_CALL(pp->snat_handler, skb, pp, cp, iph)) > goto drop; > > #ifdef CONFIG_IP_VS_IPV6 > @@ -1389,7 +1418,8 @@ ip_vs_out(struct netns_ipvs *ipvs, unsigned int > hooknum, struct sk_buff *skb, in > /* >* Check if the packet belongs to an existing entry >*/ > - cp = pp->conn_out_get(ipvs, af, skb, &iph); > + cp = INDIRECT_CALL_1(pp->conn_out_get, ip_vs_conn_out_get_proto, > + ipvs, af, skb, &iph); > > if (likely(cp)) { > if (IP_VS_FWD_METHOD(cp) != IP_VS_CONN_F_MASQ) > @@ -1644,7 +1674,8 @@ ip_vs_in_icmp(struct netns_ipvs *ipvs, struct sk_buff > *s
Re: [PATCH ipvs-next] ipvs: avoid indirect calls when calculating checksums
Hello, On Sat, 19 Jan 2019, Matteo Croce wrote: > The function pointer ip_vs_protocol->csum_check is only used in protocol > specific code, and never in the generic one. > Remove the function pointer from struct ip_vs_protocol and call the > checksum functions directly. > This reduces the performance impact of the Spectre mitigation, and > should give a small improvement even with RETPOLINES disabled. > > Signed-off-by: Matteo Croce Looks good to me, thanks! Acked-by: Julian Anastasov > --- > include/net/ip_vs.h | 3 --- > net/netfilter/ipvs/ip_vs_proto_ah_esp.c | 2 -- > net/netfilter/ipvs/ip_vs_proto_sctp.c | 8 +--- > net/netfilter/ipvs/ip_vs_proto_tcp.c| 12 +++- > net/netfilter/ipvs/ip_vs_proto_udp.c| 12 +++- > 5 files changed, 19 insertions(+), 18 deletions(-) > > diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h > index a0d2e0bb9a94..047f9a5ccaad 100644 > --- a/include/net/ip_vs.h > +++ b/include/net/ip_vs.h > @@ -453,9 +453,6 @@ struct ip_vs_protocol { > int (*dnat_handler)(struct sk_buff *skb, struct ip_vs_protocol *pp, > struct ip_vs_conn *cp, struct ip_vs_iphdr *iph); > > - int (*csum_check)(int af, struct sk_buff *skb, > - struct ip_vs_protocol *pp); > - > const char *(*state_name)(int state); > > void (*state_transition)(struct ip_vs_conn *cp, int direction, > diff --git a/net/netfilter/ipvs/ip_vs_proto_ah_esp.c > b/net/netfilter/ipvs/ip_vs_proto_ah_esp.c > index 5320d39976e1..480598cb0f05 100644 > --- a/net/netfilter/ipvs/ip_vs_proto_ah_esp.c > +++ b/net/netfilter/ipvs/ip_vs_proto_ah_esp.c > @@ -129,7 +129,6 @@ struct ip_vs_protocol ip_vs_protocol_ah = { > .conn_out_get = ah_esp_conn_out_get, > .snat_handler = NULL, > .dnat_handler = NULL, > - .csum_check = NULL, > .state_transition = NULL, > .register_app = NULL, > .unregister_app = NULL, > @@ -152,7 +151,6 @@ struct ip_vs_protocol ip_vs_protocol_esp = { > .conn_out_get = ah_esp_conn_out_get, > .snat_handler = NULL, > .dnat_handler = NULL, > - .csum_check = NULL, > .state_transition = NULL, > .register_app = NULL, > .unregister_app = NULL, > diff --git a/net/netfilter/ipvs/ip_vs_proto_sctp.c > b/net/netfilter/ipvs/ip_vs_proto_sctp.c > index b0cd7d08f2a7..bc3d1625ecc8 100644 > --- a/net/netfilter/ipvs/ip_vs_proto_sctp.c > +++ b/net/netfilter/ipvs/ip_vs_proto_sctp.c > @@ -9,6 +9,9 @@ > #include > #include > > +static int > +sctp_csum_check(int af, struct sk_buff *skb, struct ip_vs_protocol *pp); > + > static int > sctp_conn_schedule(struct netns_ipvs *ipvs, int af, struct sk_buff *skb, > struct ip_vs_proto_data *pd, > @@ -105,7 +108,7 @@ sctp_snat_handler(struct sk_buff *skb, struct > ip_vs_protocol *pp, > int ret; > > /* Some checks before mangling */ > - if (pp->csum_check && !pp->csum_check(cp->af, skb, pp)) > + if (!sctp_csum_check(cp->af, skb, pp)) > return 0; > > /* Call application helper if needed */ > @@ -152,7 +155,7 @@ sctp_dnat_handler(struct sk_buff *skb, struct > ip_vs_protocol *pp, > int ret; > > /* Some checks before mangling */ > - if (pp->csum_check && !pp->csum_check(cp->af, skb, pp)) > + if (!sctp_csum_check(cp->af, skb, pp)) > return 0; > > /* Call application helper if needed */ > @@ -587,7 +590,6 @@ struct ip_vs_protocol ip_vs_protocol_sctp = { > .conn_out_get = ip_vs_conn_out_get_proto, > .snat_handler = sctp_snat_handler, > .dnat_handler = sctp_dnat_handler, > - .csum_check = sctp_csum_check, > .state_name = sctp_state_name, > .state_transition = sctp_state_transition, > .app_conn_bind = sctp_app_conn_bind, > diff --git a/net/netfilter/ipvs/ip_vs_proto_tcp.c > b/net/netfilter/ipvs/ip_vs_proto_tcp.c > index 1770fc6ce960..6a275f989085 100644 > --- a/net/netfilter/ipvs/ip_vs_proto_tcp.c > +++ b/net/netfilter/ipvs/ip_vs_proto_tcp.c > @@ -31,6 +31,9 @@ > > #include > > +static int > +tcp_csum_check(int af, struct sk_buff *skb, struct ip_vs_protocol *pp); > + > static int > tcp_conn_schedule(struct netns_ipvs *ipvs, int af, struct sk_buff *skb, > struct ip_vs_proto_data *pd, > @@ -166,7 +169,7 @@ tcp_snat_handler(struct sk_buff
Re: [PATCH v3] ipvs: fix race between ip_vs_conn_new() and ip_vs_del_dest()
Hello, On Wed, 25 Jul 2018, Tan Hu wrote: > We came across infinite loop in ipvs when using ipvs in docker > env. > > When ipvs receives new packets and cannot find an ipvs connection, > it will create a new connection, then if the dest is unavailable > (i.e. IP_VS_DEST_F_AVAILABLE), the packet will be dropped sliently. > > But if the dropped packet is the first packet of this connection, > the connection control timer never has a chance to start and the > ipvs connection cannot be released. This will lead to memory leak, or > infinite loop in cleanup_net() when net namespace is released like > this: > > ip_vs_conn_net_cleanup at a0a9f31a [ip_vs] > __ip_vs_cleanup at a0a9f60a [ip_vs] > ops_exit_list at 81567a49 > cleanup_net at 81568b40 > process_one_work at 810a851b > worker_thread at 810a9356 > kthread at 810b0b6f > ret_from_fork at 81697a18 > > race condition: > CPU1 CPU2 > ip_vs_in() > ip_vs_conn_new() >ip_vs_del_dest() > __ip_vs_unlink_dest() >~IP_VS_DEST_F_AVAILABLE > cp->dest && !IP_VS_DEST_F_AVAILABLE > __ip_vs_conn_put > ... > cleanup_net ---> infinite looping > > Fix this by checking whether the timer already started. > > Signed-off-by: Tan Hu > Reviewed-by: Jiang Biao v3 looks good to me, Acked-by: Julian Anastasov Simon and Pablo, this can be applied to ipvs/nf tree... > --- > v2: fix use-after-free in CONN_ONE_PACKET case suggested by Julian Anastasov > v3: remove trailing whitespace for patch checking > > net/netfilter/ipvs/ip_vs_core.c | 15 +++ > 1 file changed, 11 insertions(+), 4 deletions(-) > > diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c > index 0679dd1..a17104f 100644 > --- a/net/netfilter/ipvs/ip_vs_core.c > +++ b/net/netfilter/ipvs/ip_vs_core.c > @@ -1972,13 +1972,20 @@ static int ip_vs_in_icmp_v6(struct netns_ipvs *ipvs, > struct sk_buff *skb, > if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) { > /* the destination server is not available */ > > - if (sysctl_expire_nodest_conn(ipvs)) { > + __u32 flags = cp->flags; > + > + /* when timer already started, silently drop the packet.*/ > + if (timer_pending(&cp->timer)) > + __ip_vs_conn_put(cp); > + else > + ip_vs_conn_put(cp); > + > + if (sysctl_expire_nodest_conn(ipvs) && > + !(flags & IP_VS_CONN_F_ONE_PACKET)) { > /* try to expire the connection immediately */ > ip_vs_conn_expire_now(cp); > } > - /* don't restart its timer, and silently > -drop the packet. */ > - __ip_vs_conn_put(cp); > + > return NF_DROP; > } > > -- > 1.8.3.1 Regards -- Julian Anastasov
Re: [PATCH v2] ipvs: fix race between ip_vs_conn_new() and ip_vs_del_dest()
Hello, On Wed, 25 Jul 2018, Tan Hu wrote: > We came across infinite loop in ipvs when using ipvs in docker > env. > > When ipvs receives new packets and cannot find an ipvs connection, > it will create a new connection, then if the dest is unavailable > (i.e. IP_VS_DEST_F_AVAILABLE), the packet will be dropped sliently. > > But if the dropped packet is the first packet of this connection, > the connection control timer never has a chance to start and the > ipvs connection cannot be released. This will lead to memory leak, or > infinite loop in cleanup_net() when net namespace is released like > this: > > ip_vs_conn_net_cleanup at a0a9f31a [ip_vs] > __ip_vs_cleanup at a0a9f60a [ip_vs] > ops_exit_list at 81567a49 > cleanup_net at 81568b40 > process_one_work at 810a851b > worker_thread at 810a9356 > kthread at 810b0b6f > ret_from_fork at 81697a18 > > race condition: > CPU1 CPU2 > ip_vs_in() > ip_vs_conn_new() >ip_vs_del_dest() > __ip_vs_unlink_dest() >~IP_VS_DEST_F_AVAILABLE > cp->dest && !IP_VS_DEST_F_AVAILABLE > __ip_vs_conn_put > ... > cleanup_net ---> infinite looping > > Fix this by checking whether the timer already started. > > Signed-off-by: Tan Hu > Reviewed-by: Jiang Biao > --- > v2: fix use-after-free in CONN_ONE_PACKET case suggested by Julian Anastasov > > net/netfilter/ipvs/ip_vs_core.c | 15 +++ > 1 file changed, 11 insertions(+), 4 deletions(-) > > diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c > index 0679dd1..a17104f 100644 > --- a/net/netfilter/ipvs/ip_vs_core.c > +++ b/net/netfilter/ipvs/ip_vs_core.c > @@ -1972,13 +1972,20 @@ static int ip_vs_in_icmp_v6(struct netns_ipvs *ipvs, > struct sk_buff *skb, > if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) { > /* the destination server is not available */ > > - if (sysctl_expire_nodest_conn(ipvs)) { > + __u32 flags = cp->flags; Ops, now scripts/checkpatch.pl --strict /tmp/file.patch is complaining about extra trailing space in above line. You can also remove the empty line above the new code... Regards -- Julian Anastasov
Re: Route fallback issue
Hello, On Sun, 24 Jun 2018, Erik Auerswald wrote: > Hello Julien, > > > http://ja.ssi.bg/dgd-usage.txt > > Thanks for that info! > > Can you tell us what parts from the above text is actually implemented > in the upstream Linux kernel, and starting with which version(s) > (approximately)? The text describes ideas and patches from nearly two > decades ago, is more recent documentation available somewhere? Nothing is included in kernel. The idea is that user space has more control. It is best done with CONNMARK: stick NATed connection to some path (via alive ISP), use route lookup just to select alive path for the first packet in connection. So, what we balance are connections, not packets (which does not work with different ISPs). Probe GWs to keep only alive routes in the table. Regards -- Julian Anastasov
Re: Route fallback issue
Hello, On Thu, 21 Jun 2018, Grant Taylor wrote: > On 06/21/2018 01:57 PM, Julian Anastasov wrote: > > Hello, > > > http://ja.ssi.bg/dgd-usage.txt > > "DGD" or "Dead Gateway Detection" sounds very familiar. I referenced it in an > earlier reply. > > I distinctly remember DGD not behaving satisfactorily years ago. Where > unsatisfactorily was something like 90 seconds (or more) to recover. Which > actually matches what I was getting without the ignore_routes_with_linkdown=1 > setting that David A. mentioned. Yes, ARP state for unreachable GWs may be updated slowly, there is in-time feedback only for reachable state. > With ignore_routes_with_linkdown=1 things behaved much better. > > > Not true. net/ipv4/fib_semantics.c:fib_select_path() calls > > fib_select_default() only when prefixlen = 0 (default route). > > Okay My testing last night disagrees with you. Specifically, I was able > to add a alternate routes to the same prefix, 192.0.2.128/26. There was not > any default gateway configured on any of the NetNSs. So everything was using > routes for locally attacked or the two added via "ip route append". > > What am I misinterpreting? Or where are we otherwise talking past each other? You can create the two routes, of course. But only the default routes are alternative. > > > Otherwise, only the first route will be considered. > > "only the first route" almost sounds like something akin to Equal Cost Multi > Path. > > I was not expecting "alternative routes" to use more than one route at a time, > equally or otherwise. I was wanting for the kernel to fall back to an > alternate route / gateway / path in the event that the one that was being used > became unusable / unreachable. > > So what should "Alternative Routes" do? How does this compare / contract to > E.C.M.P. or D.G.D. The alternative routes work in this way: - on lookup, routes are walked in order - as listed in table - as long as route contains reachable gateway (ARP state), only this route is used - if some gateway becomes unreachable (ARP state), next alternative routes are tried - if ARP entry is expired (missing), this gateway can be probed if the route is before the currently used route. This is what happens initially when no ARP state is present for the GWs. It is bad luck if the probed GW is actually unreachable. - active probing by user space (ping GWs) can only help to keep the ARP state present for the used gateways. By this way, if ARP entry for GW is missing, the kernel will not risk to select unavailable route with the goal to probe the GW. > > fib_select_default() is the function that decides which nexthop is reachable > > and whether to contact it. It uses the ARP state via fib_detect_death(). > > That is all code that is behind this feature called "alternative routes": > > the kernel selects one based on nexthop's ARP state. > > Please confirm that you aren't entering / referring to E.C.M.P. territory when > you say "nexthop". I think that you are not, but I want to ask and be sure, > particularly seeing as how things are very closely related. nexthop is the GW in the route > It sounds like you're referring to literally the router that is the next hop > in the path. I.e. the device on the other end of the wire. Yes, the kernel avoids alternative routes with unreachable GWs > I want to do some testing to see if fib_multipath_use_neigh alters this > behavior at all. I'm hoping that it will invalidate an alternate route if the > MAC is not resolvable even if the physical link stays up. The multipath route uses all its alive nexthops at the same time... But you may need in the same way active probing by user space, otherwise unavailable GW can be selected. > Sure, the ARP cache may have a 30 ~ 120 second timeout before triggering this > behavior. But having that timeout and starting to use an alternative route is > considerably better than not using an alternative route. Yes, if you prefer, you may run PING every second to avoid such delays... Regards -- Julian Anastasov
Re: Route fallback issue
Hello, On Wed, 20 Jun 2018, Grant Taylor wrote: > On 06/20/2018 01:00 PM, Julian Anastasov wrote: > > You can also try alternative routes. > > "Alternative routes"? I can't say as I've heard that description as a > specific technique / feature / capability before. > > Is that it's official name? I think so > Where can I find out more about it? You can search on net. I have some old docs on these issues, they should be actual: http://ja.ssi.bg/dgd-usage.txt > > But as the kernel supports only default alternative routes, you can put them > > in their own table: > > I don't know that that is the case any more. > > I was able to issue the following commands without a problem: > > # ip route append 192.0.2.128/26 via 192.0.2.62 > # ip route append 192.0.2.128/26 via 192.0.2.126 > > I crated two network namespaces and had a pair of vEths between them > (192.0.2.0/26 and 192.0.2.64/26). I added a dummy network to each NetNS > (192.0.2.128/26 and 192.0.2.192/26). > > I ran the following commands while a persistent ping was running from one > NetNS to the IP on the other's dummy0 interface: > > # ip link set ns2b up && ip route append 192.0.2.192/26 via 192.0.2.126 && ip > link set ns2a down > (pause and watch things) > # ip link set ns2a up && ip route append 192.0.2.192/26 via 192.0.2.62 && ip > link set ns2b down > (pause and watch things) > > I could iterate between the two above commands and pings continued to work. > > So, I think that it's now possible to use "alternate routes" (new to me) on > specific prefixes in addition to the default. Thus there is no longer any > need for a separate table and the associated IP rule. Not true. net/ipv4/fib_semantics.c:fib_select_path() calls fib_select_default() only when prefixlen = 0 (default route). Otherwise, only the first route will be considered. fib_select_default() is the function that decides which nexthop is reachable and whether to contact it. It uses the ARP state via fib_detect_death(). That is all code that is behind this feature called "alternative routes": the kernel selects one based on nexthop's ARP state. Routes with different metric are considered only when the routes with lower metric are removed. > I'm running kernel version 4.9.76. > > I did go ahead and set net.ipv4.conf.ns2b.ignore_routes_with_linkdown to 1. > > for i in /proc/sys/net/ipv4/conf/*/ignore_routes_with_linkdown; do echo 1 > > $i; done IIRC, this flag invalidates nexthops depending on the link state. If your link is always UP it does not help much. If you rely on user space tool, you can check the state of the desired hops: device link state, your gateway to ISP, one or more gateways in the ISP network which you consider permanent part of the path via this ISP. > Doing that dropped the number of dropped pings from 60 ~ 90 (1 / second) to 0 > ~ 5 (1 / second). (Rarely, maybe 1 out of 20 flips, would it take upwards of > 10 pings / seconds.) > > > # Alternative routes use same metric!!! > > ip route append default via 192.168.1.254 dev eno1 table 100 > > ip route append default via 192.168.2.254 dev eno2 table 100 > > ip rule add prio 100 to 172.16.0.0/12 table 100 > > I did have to "append" the route. I couldn't just "add" the route. When I > tried to "add" the second route, I got an error about the route already > existing. Using "append" instead of "add" with everything else the same > worked just fine. > > Note: I did go ahead and remove the single route that was added via "add" and > used "append" for both. First route can be created with 'add' but all next alternative routes can be added only with "append". If you successfully add them with "add" it means they are not alternatives to the first one, they are not considered at all. Regards -- Julian Anastasov
Re: Route fallback issue
Hello, On Wed, 20 Jun 2018, Akshat Kakkar wrote: > Hi netdev community, > > I have 2 interfaces > eno1 : 192.168.1.10/24 > eno2 : 192.168.2.10/24 > > I added routes as > 172.16.0.0/12 via 192.168.1.254 metric 1 > 172.16.0.0/12 via 192.168.2.254 metric 2 > > My intention : All traffic to 172.16.0.0/12 should go thru eno1. If > 192.168.1.254 is not reachable (no arp entry or link down), then it > should fall back to eno2. You can also try alternative routes. But as the kernel supports only default alternative routes, you can put them in their own table: # Alternative routes use same metric!!! ip route append default via 192.168.1.254 dev eno1 table 100 ip route append default via 192.168.2.254 dev eno2 table 100 ip rule add prio 100 to 172.16.0.0/12 table 100 Of course, you will get better results if an user space tool puts only alive routes in service after doing health checks of all near gateways. Regards -- Julian Anastasov
[PATCH net] ipv6: allow PMTU exceptions to local routes
IPVS setups with local client and remote tunnel server need to create exception for the local virtual IP. What we do is to change PMTU from 64KB (on "lo") to 1460 in the common case. Suggested-by: Martin KaFai Lau Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception") Fixes: 7343ff31ebf0 ("ipv6: Don't create clones of host routes.") Signed-off-by: Julian Anastasov --- net/ipv6/route.c | 3 --- 1 file changed, 3 deletions(-) Note: I failed to build 2.6.38 kernel for the test but I think commit 7343ff31ebf0 looks as the one that added the restriction to change PMTU for local IPs. So, I'm not sure from how long time the issue with local IPVS clients and tunnelling method exists. May be it worked between 2.6.28 and 2.6.37. diff --git a/net/ipv6/route.c b/net/ipv6/route.c index fb95698..86a0e43 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2307,9 +2307,6 @@ static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk, const struct in6_addr *daddr, *saddr; struct rt6_info *rt6 = (struct rt6_info *)dst; - if (rt6->rt6i_flags & RTF_LOCAL) - return; - if (dst_metric_locked(dst, RTAX_MTU)) return; -- 2.9.5
Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
Hello, On Tue, 5 Jun 2018, Martin KaFai Lau wrote: > On Sat, May 05, 2018 at 03:58:25PM +0300, Julian Anastasov wrote: > > So, except the RTF_LOCAL check in __ip6_rt_update_pmtu > > we should have no other issues. > Hi Julian, > > Do you have a chance to work on the IPv6 side? I was chasing other IPVS issues while preparing to finalize these changes. I plan to do tests this weekend and to submit my patch but without gso_size modifications. Can post patch for this check in __ip6_rt_update_pmtu too, separately after making sure all is fine. Regards -- Julian Anastasov
Re: [PATCH] ipvs: drop templates for never established TCP connections
/ > +#define IP_VS_CONN_F_TEMPLATE0x1000 /* template, > not connection */ > +#define IP_VS_CONN_F_ONE_PACKET 0x2000 /* forward only > one packet */ > +#define IP_VS_CONN_F_TMPL_PERSISTED 0x4000 /* template, confirmed > persistent */ > > /* Initial bits allowed in backup server */ > #define IP_VS_CONN_F_BACKUP_MASK (IP_VS_CONN_F_FWD_MASK | \ > diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c > index 370abbf6f421..6afc606a388c 100644 > --- a/net/netfilter/ipvs/ip_vs_conn.c > +++ b/net/netfilter/ipvs/ip_vs_conn.c > @@ -820,6 +820,7 @@ static void ip_vs_conn_rcu_free(struct rcu_head *head) > static void ip_vs_conn_expire(struct timer_list *t) > { > struct ip_vs_conn *cp = from_timer(cp, t, timer); > + struct ip_vs_conn *cp_c; > struct netns_ipvs *ipvs = cp->ipvs; > > /* > @@ -834,8 +835,15 @@ static void ip_vs_conn_expire(struct timer_list *t) > del_timer(&cp->timer); > > /* does anybody control me? */ > - if (cp->control) > + cp_c = cp->control; > + if (cp_c) { > ip_vs_control_del(cp); > + if (cp_c->flags & IP_VS_CONN_F_TEMPLATE && > + !(cp_c->flags & IP_VS_CONN_F_TMPL_PERSISTED)) { > + IP_VS_DBG(4, "del conn template\n"); > + ip_vs_conn_expire_now(cp_c); So, we have current conn expired after 60 seconds in IP_VS_TCP_S_SYN_RECV state and possibly other conns in same state that are not expired yet. Another option is just to use something like: if (cp_c) { ip_vs_control_del(cp); /* Restart cp_c timer only for last conn */ if (!atomic_read(&cp_c->n_control) && (cp_c->flags & IP_VS_CONN_F_TEMPLATE) && /* Some func to decide when to drop cp_c: * - it can be for SYN state * - it can be when cp was dropped on load */ cp->state == IP_VS_TCP_S_SYN_RECV) { IP_VS_DBG(4, "del conn template\n"); ip_vs_conn_expire_now(cp_c); } } It is not perfect, i.e. it does not know if there was some conn that was established in the past: - CONN1: SYN, SYN+ACK, ESTABLISH, FIN, FIN+ACK, expire - CONN2: expire in SYN state, drop tpl before persistent timeout But it should work in the general case. Anyways, give me some days to think more on this issue. Regards -- Julian Anastasov
Re: kernel BUG at lib/string.c:LINE! (4)
Hello, On Wed, 16 May 2018, syzbot wrote: > Hello, > > syzbot found the following crash on: > > HEAD commit:0b7d9978406f Merge branch 'Microsemi-Ocelot-Ethernet-switc.. > git tree: net-next > console output: https://syzkaller.appspot.com/x/log.txt?x=16e9101780 > kernel config: https://syzkaller.appspot.com/x/.config?x=b632d8e2c2ab2c1 > dashboard link: https://syzkaller.appspot.com/bug?extid=aac887f77319868646df > compiler: gcc (GCC) 8.0.1 20180413 (experimental) > syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=1665d63780 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1051710780 > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+aac887f7731986864...@syzkaller.appspotmail.com > > IPVS: Unknown mcast interface: veth1_to???a > IPVS: Unknown mcast interface: veth1_to???a > IPVS: Unknown mcast interface: veth1_to???a > detected buffer overflow in strlen > [ cut here ] > kernel BUG at lib/string.c:1052! > invalid opcode: [#1] SMP KASAN > Dumping ftrace buffer: > (ftrace buffer empty) > Modules linked in: > CPU: 1 PID: 373 Comm: syz-executor936 Not tainted 4.17.0-rc4+ #45 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google > 01/01/2011 > RIP: 0010:fortify_panic+0x13/0x20 lib/string.c:1051 > RSP: 0018:8801c976f800 EFLAGS: 00010282 > RAX: 0022 RBX: 0040 RCX: > RDX: 0022 RSI: 8160f6f1 RDI: ed00392edef6 > RBP: 8801c976f800 R08: 8801cf4c62c0 R09: ed003b5e4fb0 > R10: ed003b5e4fb0 R11: 8801daf27d87 R12: 8801c976fa20 > R13: 8801c976fae4 R14: 8801c976fae0 R15: 048b > FS: 7fd99f75e700() GS:8801daf0() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 21c0 CR3: 0001d6843000 CR4: 001406e0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > Call Trace: > strlen include/linux/string.h:270 [inline] > strlcpy include/linux/string.h:293 [inline] > do_ip_vs_set_ctl+0x31c/0x1d00 net/netfilter/ipvs/ip_vs_ctl.c:2388 > nf_sockopt net/netfilter/nf_sockopt.c:106 [inline] > nf_setsockopt+0x7d/0xd0 net/netfilter/nf_sockopt.c:115 > ip_setsockopt+0xd8/0xf0 net/ipv4/ip_sockglue.c:1253 > udp_setsockopt+0x62/0xa0 net/ipv4/udp.c:2487 > ipv6_setsockopt+0x149/0x170 net/ipv6/ipv6_sockglue.c:917 > tcp_setsockopt+0x93/0xe0 net/ipv4/tcp.c:3057 > sock_common_setsockopt+0x9a/0xe0 net/core/sock.c:3046 > __sys_setsockopt+0x1bd/0x390 net/socket.c:1903 > __do_sys_setsockopt net/socket.c:1914 [inline] > __se_sys_setsockopt net/socket.c:1911 [inline] > __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1911 > do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 > entry_SYSCALL_64_after_hwframe+0x49/0xbe > RIP: 0033:0x447369 > RSP: 002b:7fd99f75dda8 EFLAGS: 0246 ORIG_RAX: 0036 > RAX: ffda RBX: 006e39e4 RCX: 00447369 > RDX: 048b RSI: RDI: 0003 > RBP: R08: 0018 R09: > R10: 21c0 R11: 0246 R12: 006e39e0 > R13: 75a1ff93f0896195 R14: 6f745f3168746576 R15: 0001 > Code: 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 48 89 df e8 d2 8f 48 fa eb de > 55 48 89 fe 48 c7 c7 60 65 64 88 48 89 e5 e8 91 dd f3 f9 <0f> 0b 90 90 90 90 > 90 90 90 90 90 90 90 55 48 89 e5 41 57 41 56 > RIP: fortify_panic+0x13/0x20 lib/string.c:1051 RSP: 8801c976f800 > ---[ end trace 624046f2d9af7702 ]--- Just to let you know that I tested a patch with the syzbot, will do more tests before submitting... Regards -- Julian Anastasov
Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
Hello, On Mon, 7 May 2018, Martin KaFai Lau wrote: > On Sat, May 05, 2018 at 03:58:25PM +0300, Julian Anastasov wrote: > > > > So, except the RTF_LOCAL check in __ip6_rt_update_pmtu > > we should have no other issues. Only one minor bit is strange to me, > > why rt6_insert_exception warns for RTF_PCPU if rt6_cache_allowed_for_pmtu > > allows it when returning true... > hmm...I am not sure I follow this bits. Where is the warn? if (ort->rt6i_flags & (RTF_CACHE | RTF_PCPU)) ort = ort->from; Sorry, my fault, I missed above re-assignment... WARN_ON_ONCE(ort->rt6i_flags & (RTF_CACHE | RTF_PCPU)); > Note that "nrt6" and "from" are passed to rt6_insert_exception() > instead of "rt6". > > > > > Also, commit 0d3f6d297bfb allows rt6_do_update_pmtu() for > > routes without RTF_CACHE, RTF_PCPU and rt6i_node. Should we > > restrict rt6_do_update_pmtu only to RTF_CACHE routes? > > > > if (!rt6_cache_allowed_for_pmtu(rt6)) { > > - rt6_do_update_pmtu(rt6, mtu); > The existing rt6_do_update_pmtu() looks correct. > The mtu of the dst created by icmp6_dst_alloc() > needs to be udpated and this dst does not have > the RTF_CACHE. Aha, ok. I thought, only RTF_CACHE routes can hold PMTU. Regards -- Julian Anastasov
Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
Hello, On Thu, 3 May 2018, Martin KaFai Lau wrote: > > - when exactly we start to use the new PMTU, eg. what happens > > in case socket caches the route, whether route is killed via > > dst->obsolete. Or may be while the PMTU expiration is handled > > per-packet, the PMTU change is noticed only on ICMP... > Before sk can reuse its dst cache, the sk will notice > its dst cache is no longer valid by calling dst_check(). > dst_check() should return NULL which is one of the side > effect of the earlier update_pmtu(). This dst_check() > is usually only called when the sk needs to do output, > so the new PMTU route (i.e. the RTF_CACHE IPv6 route) > only have effect to the later packets. I checked again the code and it looks like sockets are forced to use new exceptional route (RTF_CACHE/fnhe) via dst_check only when the PMTU update should move them away from old non-exceptional routes. Later, if PMTU is reduced/updated this is noticed for every packet via dst_mtu, as in the case with TCP. So, except the RTF_LOCAL check in __ip6_rt_update_pmtu we should have no other issues. Only one minor bit is strange to me, why rt6_insert_exception warns for RTF_PCPU if rt6_cache_allowed_for_pmtu allows it when returning true... Also, commit 0d3f6d297bfb allows rt6_do_update_pmtu() for routes without RTF_CACHE, RTF_PCPU and rt6i_node. Should we restrict rt6_do_update_pmtu only to RTF_CACHE routes? if (!rt6_cache_allowed_for_pmtu(rt6)) { - rt6_do_update_pmtu(rt6, mtu); - /* update rt6_ex->stamp for cache */ - if (rt6->rt6i_flags & RTF_CACHE) + if (rt6->rt6i_flags & RTF_CACHE) { + rt6_do_update_pmtu(rt6, mtu); + /* update rt6_ex->stamp for cache */ rt6_update_exception_stamp_rt(rt6); + } } else if (daddr) { Regards -- Julian Anastasov
Re: [PATCH net] ipv4: fix fnhe usage by non-cached routes
Hello, On Wed, 2 May 2018, David Ahern wrote: > On 5/2/18 12:41 AM, Julian Anastasov wrote: > > Allow some non-cached routes to use non-expired fnhe: > > > > 1. ip_del_fnhe: moved above and now called by find_exception. > > The 4.5+ commit deed49df7390 expires fnhe only when caching > > routes. Change that to: > > > > 1.1. use fnhe for non-cached local output routes, with the help > > from (2) > > > > 1.2. allow __mkroute_input to detect expired fnhe (outdated > > fnhe_gw, for example) when do_cache is false, eg. when itag!=0 > > for unicast destinations. > > > > 2. __mkroute_output: keep fi to allow local routes with orig_oif != 0 > > to use fnhe info even when the new route will not be cached into fnhe. > > After commit 839da4d98960 ("net: ipv4: set orig_oif based on fib > > result for local traffic") it means all local routes will be affected > > because they are not cached. This change is used to solve a PMTU > > problem with IPVS (and probably Netfilter DNAT) setups that redirect > > local clients from target local IP (local route to Virtual IP) > > to new remote IP target, eg. IPVS TUN real server. Loopback has > > 64K MTU and we need to create fnhe on the local route that will > > keep the reduced PMTU for the Virtual IP. Without this change > > fnhe_pmtu is updated from ICMP but never exposed to non-cached > > local routes. This includes routes with flowi4_oif!=0 for 4.6+ and > > with flowi4_oif=any for 4.14+). > > Can you add a test case to tools/testing/selftests/net/pmtu.sh to cover > this situation? Sure, I'll give it a try. > > @@ -1310,8 +1340,14 @@ static struct fib_nh_exception > > *find_exception(struct fib_nh *nh, __be32 daddr) > > > > for (fnhe = rcu_dereference(hash[hval].chain); fnhe; > > fnhe = rcu_dereference(fnhe->fnhe_next)) { > > - if (fnhe->fnhe_daddr == daddr) > > + if (fnhe->fnhe_daddr == daddr) { > > + if (fnhe->fnhe_expires && > > + time_after(jiffies, fnhe->fnhe_expires)) { > > + ip_del_fnhe(nh, daddr); > > I'm surprised this is done in the fast path vs gc time. (the existing > code does as well; your change is only moving the call to make the input > and output paths the same) > > > The change looks correct to me and all of my functional tests passed. > > Acked-by: David Ahern Thanks for the review! Regards -- Julian Anastasov
Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
Hello, On Wed, 2 May 2018, Martin KaFai Lau wrote: > On Wed, May 02, 2018 at 09:38:43AM +0300, Julian Anastasov wrote: > > > > - initial traffic for port 21 does not use GSO. But after > > every packet IPVS calls maybe_update_pmtu (rt->dst.ops->update_pmtu) > > to report the reduced MTU. These updates are stored in fnhe_pmtu > > but they do not go to any route, even if we try to get fresh > > output route. Why? Because the local routes are not cached, so > > they can not use the fnhe. This is what my patch for route.c > > will fix. With this fix FTP-DATA gets route with reduced PMTU. > For IPv6, the 'if (rt6->rt6i_flags & RTF_LOCAL)' gate in > __ip6_rt_update_pmtu() may need to be lifted also. Probably. I completely forgot the IPv6 part but as I don't know the IPv6 code enough, it may take some time to understand what can be the problem there... I'm not sure whether everything started with commit 0a6b2a1dc2a2, so that in some configurations before that commit things worked and problem was not noticed. I think, we should focus on such direction for IPv6: - do we remember per-VIP PMTU for the local routes - when exactly we start to use the new PMTU, eg. what happens in case socket caches the route, whether route is killed via dst->obsolete. Or may be while the PMTU expiration is handled per-packet, the PMTU change is noticed only on ICMP... - as IPVS reports the PMTU via dst.ops->update_pmtu() long before any large packets are sent, do we propagate the PMTU. Also, for IPv4 __ip_rt_update_pmtu() has some protection from such per-packet updates that do not change the PMTU. - if IPVS starts to send ICMP when gso_size exceeds PMTU, like in my draft patch, whether the PMTU is propagated to route and then to socket. As for the gso_size decrease, playing in IPVS is not very safe, at least, we need help from GSO experts to know how we should use it. Regards -- Julian Anastasov
[PATCH net] ipv4: fix fnhe usage by non-cached routes
Allow some non-cached routes to use non-expired fnhe: 1. ip_del_fnhe: moved above and now called by find_exception. The 4.5+ commit deed49df7390 expires fnhe only when caching routes. Change that to: 1.1. use fnhe for non-cached local output routes, with the help from (2) 1.2. allow __mkroute_input to detect expired fnhe (outdated fnhe_gw, for example) when do_cache is false, eg. when itag!=0 for unicast destinations. 2. __mkroute_output: keep fi to allow local routes with orig_oif != 0 to use fnhe info even when the new route will not be cached into fnhe. After commit 839da4d98960 ("net: ipv4: set orig_oif based on fib result for local traffic") it means all local routes will be affected because they are not cached. This change is used to solve a PMTU problem with IPVS (and probably Netfilter DNAT) setups that redirect local clients from target local IP (local route to Virtual IP) to new remote IP target, eg. IPVS TUN real server. Loopback has 64K MTU and we need to create fnhe on the local route that will keep the reduced PMTU for the Virtual IP. Without this change fnhe_pmtu is updated from ICMP but never exposed to non-cached local routes. This includes routes with flowi4_oif!=0 for 4.6+ and with flowi4_oif=any for 4.14+). 3. update_or_create_fnhe: make sure fnhe_expires is not 0 for new entries Fixes: 839da4d98960 ("net: ipv4: set orig_oif based on fib result for local traffic") Fixes: d6d5e999e5df ("route: do not cache fib route info on local routes with oif") Fixes: deed49df7390 ("route: check and remove route cache when we get route") Cc: David Ahern Cc: Xin Long Signed-off-by: Julian Anastasov --- net/ipv4/route.c | 118 +-- 1 file changed, 53 insertions(+), 65 deletions(-) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index ccb25d8..1412a7b 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -709,7 +709,7 @@ static void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw, fnhe->fnhe_gw = gw; fnhe->fnhe_pmtu = pmtu; fnhe->fnhe_mtu_locked = lock; - fnhe->fnhe_expires = expires; + fnhe->fnhe_expires = max(1UL, expires); /* Exception created; mark the cached routes for the nexthop * stale, so anyone caching it rechecks if this exception @@ -1297,6 +1297,36 @@ static unsigned int ipv4_mtu(const struct dst_entry *dst) return mtu - lwtunnel_headroom(dst->lwtstate, mtu); } +static void ip_del_fnhe(struct fib_nh *nh, __be32 daddr) +{ + struct fnhe_hash_bucket *hash; + struct fib_nh_exception *fnhe, __rcu **fnhe_p; + u32 hval = fnhe_hashfun(daddr); + + spin_lock_bh(&fnhe_lock); + + hash = rcu_dereference_protected(nh->nh_exceptions, +lockdep_is_held(&fnhe_lock)); + hash += hval; + + fnhe_p = &hash->chain; + fnhe = rcu_dereference_protected(*fnhe_p, lockdep_is_held(&fnhe_lock)); + while (fnhe) { + if (fnhe->fnhe_daddr == daddr) { + rcu_assign_pointer(*fnhe_p, rcu_dereference_protected( + fnhe->fnhe_next, lockdep_is_held(&fnhe_lock))); + fnhe_flush_routes(fnhe); + kfree_rcu(fnhe, rcu); + break; + } + fnhe_p = &fnhe->fnhe_next; + fnhe = rcu_dereference_protected(fnhe->fnhe_next, +lockdep_is_held(&fnhe_lock)); + } + + spin_unlock_bh(&fnhe_lock); +} + static struct fib_nh_exception *find_exception(struct fib_nh *nh, __be32 daddr) { struct fnhe_hash_bucket *hash = rcu_dereference(nh->nh_exceptions); @@ -1310,8 +1340,14 @@ static struct fib_nh_exception *find_exception(struct fib_nh *nh, __be32 daddr) for (fnhe = rcu_dereference(hash[hval].chain); fnhe; fnhe = rcu_dereference(fnhe->fnhe_next)) { - if (fnhe->fnhe_daddr == daddr) + if (fnhe->fnhe_daddr == daddr) { + if (fnhe->fnhe_expires && + time_after(jiffies, fnhe->fnhe_expires)) { + ip_del_fnhe(nh, daddr); + break; + } return fnhe; + } } return NULL; } @@ -1636,36 +1672,6 @@ static void ip_handle_martian_source(struct net_device *dev, #endif } -static void ip_del_fnhe(struct fib_nh *nh, __be32 daddr) -{ - struct fnhe_hash_bucket *hash; - struct fib_nh_exception *fnhe, __rcu **fnhe_p; - u32 hval = fnhe_hashfun(daddr); - - spin_lock_bh(&fnhe_lock); - - hash = rcu_dereference_protected(nh->nh_exception
Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu)) + return false; + return true; } /* Get route to daddr, update *saddr, optionally bind route to saddr */ @@ -212,11 +234,42 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu) ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu); } +/* GSO: length of network and transport headers, 0=unsupported */ +static unsigned short ipvs_gso_hlen(struct sk_buff *skb, + struct ip_vs_iphdr *ipvsh) +{ + unsigned short hlen = ipvsh->len - ipvsh->off; + + if (skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)) { + struct tcphdr _tcph, *th; + + th = skb_header_pointer(skb, ipvsh->len, sizeof(_tcph), &_tcph); + if (th) + return hlen + (th->doff << 2); + } + return 0; +} + static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af, int rt_mode, struct ip_vs_iphdr *ipvsh, struct sk_buff *skb, int mtu) { + /* Re-segment traffic from local clients */ + if (!skb->dev && skb_is_gso(skb)) { + unsigned short hlen = ipvs_gso_hlen(skb, ipvsh); + + if (hlen && mtu - hlen < skb_shinfo(skb)->gso_size && + mtu > hlen) { + u16 reduce = skb_shinfo(skb)->gso_size - (mtu - hlen); + + IP_VS_DBG(10, "Reducing gso_size=%u with %u\n", + skb_shinfo(skb)->gso_size, reduce); + skb_decrease_gso_size(skb_shinfo(skb), reduce); + skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY; + skb_shinfo(skb)->gso_segs = 0; + } + } #ifdef CONFIG_IP_VS_IPV6 if (skb_af == AF_INET6) { struct net *net = ipvs->net; @@ -240,9 +293,7 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af, if ((rt_mode & IP_VS_RT_MODE_TUNNEL) && !sysctl_pmtu_disc(ipvs)) return true; - if (unlikely(ip_hdr(skb)->frag_off & htons(IP_DF) && -skb->len > mtu && !skb_is_gso(skb) && - !ip_vs_iph_icmp(ipvsh))) { + if (unlikely(__mtu_check_toobig(skb, mtu))) { icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu)); IP_VS_DBG(1, "frag needed for %pI4\n", Regards -- Julian Anastasov
Re: [Patch nf] ipvs: initialize tbl->entries in ip_vs_lblc_init_svc()
Hello, On Mon, 23 Apr 2018, Cong Wang wrote: > Similarly, tbl->entries is not initialized after kmalloc(), > therefore causes an uninit-value warning in ip_vs_lblc_check_expire(), > as reported by syzbot. > > Reported-by: > Cc: Simon Horman > Cc: Julian Anastasov > Cc: Pablo Neira Ayuso > Signed-off-by: Cong Wang Thanks! Acked-by: Julian Anastasov > --- > net/netfilter/ipvs/ip_vs_lblc.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/net/netfilter/ipvs/ip_vs_lblc.c b/net/netfilter/ipvs/ip_vs_lblc.c > index 3057e453bf31..83918119ceb8 100644 > --- a/net/netfilter/ipvs/ip_vs_lblc.c > +++ b/net/netfilter/ipvs/ip_vs_lblc.c > @@ -371,6 +371,7 @@ static int ip_vs_lblc_init_svc(struct ip_vs_service *svc) > tbl->counter = 1; > tbl->dead = false; > tbl->svc = svc; > + atomic_set(&tbl->entries, 0); > > /* > *Hook periodic timer for garbage collection > -- > 2.13.0 Regards -- Julian Anastasov
Re: [Patch nf] ipvs: initialize tbl->entries after allocation
Hello, On Mon, 23 Apr 2018, Cong Wang wrote: > tbl->entries is not initialized after kmalloc(), therefore > causes an uninit-value warning in ip_vs_lblc_check_expire() > as reported by syzbot. > > Reported-by: > Cc: Simon Horman > Cc: Julian Anastasov > Cc: Pablo Neira Ayuso > Signed-off-by: Cong Wang Thanks! Acked-by: Julian Anastasov > --- > net/netfilter/ipvs/ip_vs_lblcr.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/net/netfilter/ipvs/ip_vs_lblcr.c > b/net/netfilter/ipvs/ip_vs_lblcr.c > index 92adc04557ed..bc2bc5eebcb8 100644 > --- a/net/netfilter/ipvs/ip_vs_lblcr.c > +++ b/net/netfilter/ipvs/ip_vs_lblcr.c > @@ -534,6 +534,7 @@ static int ip_vs_lblcr_init_svc(struct ip_vs_service *svc) > tbl->counter = 1; > tbl->dead = false; > tbl->svc = svc; > + atomic_set(&tbl->entries, 0); > > /* > *Hook periodic timer for garbage collection > -- > 2.13.0 Regards -- Julian Anastasov
Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
tu 1500 state UP qlen 1000 > inet6 2401:db00:1011:10af:face:0:27:0/64 scope global >valid_lft forever preferred_lft forever > > [host-a] > cat /proc/net/ip_vs > TCP [2401:db00:1011:1f01:face:b00c::0085]:01BB rr > -> [2401:db00:1011:10cc:face::0091:]:01BB Tunnel 6772 9 > 6 > -> [2401:db00:1011:10d8:face::0091:]:01BB Tunnel 6772 8 > 6 > -> [2401:db00:1011:10d2:face::0091:]:01BB Tunnel 6772 19 > 7 > > [host-a] > openssl s_client -connect [2401:db00:1011:1f01:face:b00c:0:85]:443 > send-something-long-here-to-trigger-the-bug > > Changing the local route mtu to 1460 to account for the extra ipv6 tunnel > header > can also side step the issue. Like this: Yes, reducing the MTU was a solution recommended from long time ago in the IPVS HOWTO. > > > ip -6 r show table local > local 2401:db00:1011:1f01:face:b00c:0:85 dev lo src > 2401:db00:1011:10af:face:0:27:0 metric 1024 mtu 1460 advmss 1440 pref medium > > Signed-off-by: Martin KaFai Lau > --- > net/netfilter/ipvs/ip_vs_xmit.c | 49 > +++-- > 1 file changed, 33 insertions(+), 16 deletions(-) > > diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c > index 11c416f3d6e3..88cc0d53ebce 100644 > --- a/net/netfilter/ipvs/ip_vs_xmit.c > +++ b/net/netfilter/ipvs/ip_vs_xmit.c > @@ -212,13 +212,15 @@ static inline void maybe_update_pmtu(int skb_af, struct > sk_buff *skb, int mtu) > ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu); > } > > -static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int > skb_af, > +static inline bool ensure_mtu_is_adequate(struct ip_vs_conn *cp, > int rt_mode, > struct ip_vs_iphdr *ipvsh, > struct sk_buff *skb, int mtu) > { > + struct netns_ipvs *ipvs = cp->ipvs; > + > #ifdef CONFIG_IP_VS_IPV6 > - if (skb_af == AF_INET6) { > + if (cp->af == AF_INET6) { > struct net *net = ipvs->net; > > if (unlikely(__mtu_check_toobig_v6(skb, mtu))) { > @@ -251,6 +253,17 @@ static inline bool ensure_mtu_is_adequate(struct > netns_ipvs *ipvs, int skb_af, > } > } > > + if (skb_shinfo(skb)->gso_size && cp->protocol == IPPROTO_TCP) { > + const struct tcphdr *th = (struct tcphdr > *)skb_transport_header(skb); > + unsigned short hdr_len = (th->doff << 2) + > + skb_network_header_len(skb); > + > + if (mtu > hdr_len && mtu - hdr_len < skb_shinfo(skb)->gso_size) > + skb_decrease_gso_size(skb_shinfo(skb), > + skb_shinfo(skb)->gso_size - > + (mtu - hdr_len)); So, software segmentation happens and we want the tunnel header to be accounted immediately and not after PMTU probing period? Is this a problem only for IPVS tunnels? Do we see such delays with other tunnels? May be this should be solved for all protocols (not just TCP) and for all tunnels? Looking at ip6_xmit, on GSO we do not return -EMSGSIZE to local sender, so we should really alter the gso_size for proper segmentation? Regards -- Julian Anastasov
Re: [PATCH net-next] Per interface IPv4 stats (CONFIG_IP_IFSTATS_TABLE)
Hello, On Thu, 12 Apr 2018, Stephen Suryaputra wrote: > Thanks for the feedbacks. Please see the detail below: > > On Wed, Apr 11, 2018 at 3:37 PM, Julian Anastasov wrote: > [snip] > >> - __IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS); > >> + __IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_INHDRERRORS); > > > > May be skb->dev if we want to account it to the > > input device. > > > Yes. I'm about to make change it but see the next one. > > [snip] > >> diff --git a/net/netfilter/ipvs/ip_vs_xmit.c > >> b/net/netfilter/ipvs/ip_vs_xmit.c > >> index 4527921..32bd3af 100644 > >> --- a/net/netfilter/ipvs/ip_vs_xmit.c > >> +++ b/net/netfilter/ipvs/ip_vs_xmit.c > >> @@ -286,7 +286,7 @@ static inline bool decrement_ttl(struct netns_ipvs > >> *ipvs, > >> { > >> if (ip_hdr(skb)->ttl <= 1) { > >> /* Tell the sender its packet died... */ > >> - __IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS); > >> + __IP_INC_STATS(net, skb_dst(skb)->dev, > >> IPSTATS_MIB_INHDRERRORS); > > > > At this point, skb_dst(skb) can be: > > > > - input route at LOCAL_IN => dst->dev is "lo", skb->dev = input_device > > - output route at LOCAL_OUT => dst->dev is output_device, skb->dev = NULL > > > > We should see this error on LOCAL_IN but better to be > > safe: use 'skb->dev ? : skb_dst(skb)->dev' instead of just > > 'skb_dst(skb)->dev'. > > > This follows v6 implementation in the same function: > > #ifdef CONFIG_IP_VS_IPV6 > if (skb_af == AF_INET6) { > struct dst_entry *dst = skb_dst(skb); > > /* check and decrement ttl */ > if (ipv6_hdr(skb)->hop_limit <= 1) { > /* Force OUTPUT device used as source address */ It looks like IPVS copied it from ip6_forward() but in IPVS context it has its reason: we want ICMP to exit with saddr=Virtual_IP. And we are at LOCAL_IN where there is no output device like in ip6_forward(FORWARD) to use its source address. So, IPVS is special (both input and output path) and needs: IPv4: skb->dev ? : skb_dst(skb)->dev IPv6 needs fix for IPVS stats in decrement_ttl: idev = skb->dev ? __in6_dev_get(skb->dev) : ip6_dst_idev(dst); ... __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS); Otherwise, stats will go to "lo" if ip6_dst_idev is used for local input route. So, for accounting on input IPv4 path skb->dev should be used, while for IPv6 some sites may prefer to feed icmpv6_send() with output dst->dev as device containing the source address (skb->dev). But this is unrelated to the stats. > skb->dev = dst->dev; > icmpv6_send(skb, ICMPV6_TIME_EXCEED, > ICMPV6_EXC_HOPLIMIT, 0); > __IP6_INC_STATS(net, ip6_dst_idev(dst), > IPSTATS_MIB_INHDRERRORS); > > return false; > } > > /* don't propagate ttl change to cloned packets */ > if (!skb_make_writable(skb, sizeof(struct ipv6hdr))) > return false; > > ipv6_hdr(skb)->hop_limit--; > } else > #endif > > [snip] > > > > The patch probably has other errors, for example, > > using rt->dst.dev (lo) when rt->dst.error != 0 in ip_error, > > may be 'dev' should be used instead... > > Same also here. Examples are ip6_forward and ip6_pkt_drop. > > I think it's better be counted in the input device for them also. Thoughts? I think so. ipv6_rcv() works with idev = __in6_dev_get(skb->dev) but I don't know IPv6 well and whether ip6_dst_idev(skb_dst(skb)) is correct usage for input path. It should be correct for output path, though. Regards -- Julian Anastasov
Re: WARNING: possible recursive locking detected
Hello, On Wed, 11 Apr 2018, Dmitry Vyukov wrote: > On Wed, Apr 11, 2018 at 4:02 PM, syzbot > wrote: > > Hello, > > > > syzbot hit the following crash on upstream commit > > b284d4d5a6785f8cd07eda2646a95782373cd01e (Tue Apr 10 19:25:30 2018 +) > > Merge tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client > > syzbot dashboard link: > > https://syzkaller.appspot.com/bug?extid=3c43eecd7745a5ce1640 > > > > So far this crash happened 3 times on upstream. > > C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5103706542440448 > > syzkaller reproducer: > > https://syzkaller.appspot.com/x/repro.syz?id=5641659786199040 > > Raw console output: > > https://syzkaller.appspot.com/x/log.txt?id=5099510896263168 > > Kernel config: > > https://syzkaller.appspot.com/x/.config?id=-1223000601505858474 > > compiler: gcc (GCC) 8.0.1 20180301 (experimental) > > > > IMPORTANT: if you fix the bug, please add the following tag to the commit: > > Reported-by: syzbot+3c43eecd7745a5ce1...@syzkaller.appspotmail.com > > It will help syzbot understand when the bug is fixed. See footer for > > details. > > If you forward the report, please keep this part and the footer. > > #syz dup: possible deadlock in rtnl_lock (5) Yes, patch is now in the "nf" tree, so all these lockups around start_sync_thread should be resolved soon... > > IPVS: sync thread started: state = BACKUP, mcast_ifn = lo, syncid = 0, id = > > 0 > > IPVS: stopping backup sync thread 4546 ... > > > > > > IPVS: stopping backup sync thread 4559 ... > > WARNING: possible recursive locking detected Regards -- Julian Anastasov
Re: [PATCH net-next] Per interface IPv4 stats (CONFIG_IP_IFSTATS_TABLE)
Hello, On Tue, 10 Apr 2018, Stephen Suryaputra wrote: > This is enhanced from the proposed patch by Igor Maravic in 2011 to > support per interface IPv4 stats. The enhancement is mainly adding a > kernel configuration option CONFIG_IP_IFSTATS_TABLE. > > Signed-off-by: Stephen Suryaputra > --- ... > diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c > index b54b948..bb9be11 100644 > --- a/net/ipv4/ip_forward.c > +++ b/net/ipv4/ip_forward.c > @@ -66,8 +66,8 @@ static int ip_forward_finish(struct net *net, struct sock > *sk, struct sk_buff *s > { > struct ip_options *opt = &(IPCB(skb)->opt); > > - __IP_INC_STATS(net, IPSTATS_MIB_OUTFORWDATAGRAMS); > - __IP_ADD_STATS(net, IPSTATS_MIB_OUTOCTETS, skb->len); > + __IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTFORWDATAGRAMS); > + __IP_ADD_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_OUTOCTETS, skb->len); > > if (unlikely(opt->optlen)) > ip_forward_options(skb); > @@ -121,7 +121,7 @@ int ip_forward(struct sk_buff *skb) > IPCB(skb)->flags |= IPSKB_FORWARDED; > mtu = ip_dst_mtu_maybe_forward(&rt->dst, true); > if (ip_exceeds_mtu(skb, mtu)) { > - IP_INC_STATS(net, IPSTATS_MIB_FRAGFAILS); > + IP_INC_STATS(net, rt->dst.dev, IPSTATS_MIB_FRAGFAILS); > icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, > htonl(mtu)); > goto drop; > @@ -158,7 +158,7 @@ int ip_forward(struct sk_buff *skb) > > too_many_hops: > /* Tell the sender its packet died... */ > - __IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS); > + __IP_INC_STATS(net, skb_dst(skb)->dev, IPSTATS_MIB_INHDRERRORS); May be skb->dev if we want to account it to the input device. > icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0); > drop: > kfree_skb(skb); ... > diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c > index 4527921..32bd3af 100644 > --- a/net/netfilter/ipvs/ip_vs_xmit.c > +++ b/net/netfilter/ipvs/ip_vs_xmit.c > @@ -286,7 +286,7 @@ static inline bool decrement_ttl(struct netns_ipvs *ipvs, > { > if (ip_hdr(skb)->ttl <= 1) { > /* Tell the sender its packet died... */ > - __IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS); > + __IP_INC_STATS(net, skb_dst(skb)->dev, > IPSTATS_MIB_INHDRERRORS); At this point, skb_dst(skb) can be: - input route at LOCAL_IN => dst->dev is "lo", skb->dev = input_device - output route at LOCAL_OUT => dst->dev is output_device, skb->dev = NULL We should see this error on LOCAL_IN but better to be safe: use 'skb->dev ? : skb_dst(skb)->dev' instead of just 'skb_dst(skb)->dev'. > icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0); > return false; > } The patch probably has other errors, for example, using rt->dst.dev (lo) when rt->dst.error != 0 in ip_error, may be 'dev' should be used instead... Regards -- Julian Anastasov
Re: [PATCH net-next v1] ipvs: add consistent source hashing scheduling
Hello, On Mon, 2 Apr 2018, Vincent Bernat wrote: > Based on Google's Maglev algorithm [1][2], this scheduler builds a > lookup table in a way disruption is minimized when a change > occurs. This helps in case of active/active setup without > synchronization. Like for classic source hashing, this lookup table is > used to assign connections to a real server. > > Both source address and port are used to compute the hash (unlike sh > where this is optional). > > Weights are correctly handled. Unlike sh, servers with a weight of 0 > are considered as absent. Also, unlike sh, when a server becomes > unavailable due to a threshold, no fallback is possible: doing so > would seriously impair the the usefulness of using a consistent hash. > > There is a small hack to detect when all real servers have a weight of > 0. It relies on the fact it is not possible for the weight of a real > server to change during the execution of the assignment. I believe > this is the case as modifications through netlink are subject to a > mutex, but the use of atomic_read() is unsettling. > > The value of 65537 for the hash table size is currently not modifiable > at compile-time. This is the value suggested in the Maglev > paper. Another possible value is 257 (for small tests) and 655373 (for > very large setups). > > [1]: https://research.google.com/pubs/pub44824.html > [2]: > https://blog.acolyer.org/2016/03/21/maglev-a-fast-and-reliable-software-network-load-balancer/ Sorry to say it but may be you missed the discussion on lvs-devel about the new MH scheduler implemented by Inju Song: https://www.spinics.net/lists/lvs-devel/msg04928.html http://archive.linuxvirtualserver.org/html/lvs-devel/2018-03/msg00023.html In the last 6 months we fixed all issues and I acked v4 just yesterday: http://archive.linuxvirtualserver.org/html/lvs-devel/2018-04/msg3.html This scheduler supports: - tables with different size (prime): IP_VS_MH_TAB_INDEX - gcd of weights: ip_vs_mh_gcd_weight - shifted weights: ip_vs_mh_shift_weight - weight can be changed any time > Signed-off-by: Vincent Bernat > --- > include/net/ip_vs.h| 27 > net/netfilter/ipvs/Kconfig | 13 ++ > net/netfilter/ipvs/Makefile| 1 + > net/netfilter/ipvs/ip_vs_csh.c | 339 > + > net/netfilter/ipvs/ip_vs_sh.c | 32 +--- > 5 files changed, 381 insertions(+), 31 deletions(-) > create mode 100644 net/netfilter/ipvs/ip_vs_csh.c Regards -- Julian Anastasov
Re: [PATCH net-next v2] ipvs: fix multiplicative hashing in sh/dh/lblc/lblcr algorithms
Hello, On Sun, 1 Apr 2018, Vincent Bernat wrote: > The sh/dh/lblc/lblcr algorithms are using Knuth's multiplicative > hashing incorrectly. Replace its use by the hash_32() macro, which > correctly implements this algorithm. It doesn't use the same constant, > but it shouldn't matter. > > Signed-off-by: Vincent Bernat Looks good to me, thanks! Simon, please apply, if possible with the extra space removed, see below... Acked-by: Julian Anastasov > --- > net/netfilter/ipvs/ip_vs_dh.c| 3 ++- > net/netfilter/ipvs/ip_vs_lblc.c | 3 ++- > net/netfilter/ipvs/ip_vs_lblcr.c | 3 ++- > net/netfilter/ipvs/ip_vs_sh.c| 3 ++- > 4 files changed, 8 insertions(+), 4 deletions(-) > > diff --git a/net/netfilter/ipvs/ip_vs_dh.c b/net/netfilter/ipvs/ip_vs_dh.c > index 75f798f8e83b..dfea31fb10c5 100644 > --- a/net/netfilter/ipvs/ip_vs_dh.c > +++ b/net/netfilter/ipvs/ip_vs_dh.c > @@ -43,6 +43,7 @@ > #include > #include > #include > +#include > > #include > > @@ -81,7 +82,7 @@ static inline unsigned int ip_vs_dh_hashkey(int af, const > union nf_inet_addr *ad > addr_fold = addr->ip6[0]^addr->ip6[1]^ > addr->ip6[2]^addr->ip6[3]; > #endif > - return (ntohl(addr_fold)*2654435761UL) & IP_VS_DH_TAB_MASK; > + return hash_32(ntohl(addr_fold), IP_VS_DH_TAB_BITS); Extra space above > } > > > diff --git a/net/netfilter/ipvs/ip_vs_lblc.c b/net/netfilter/ipvs/ip_vs_lblc.c > index 3057e453bf31..08147fc6400c 100644 > --- a/net/netfilter/ipvs/ip_vs_lblc.c > +++ b/net/netfilter/ipvs/ip_vs_lblc.c > @@ -48,6 +48,7 @@ > #include > #include > #include > +#include > > /* for sysctl */ > #include > @@ -160,7 +161,7 @@ ip_vs_lblc_hashkey(int af, const union nf_inet_addr *addr) > addr_fold = addr->ip6[0]^addr->ip6[1]^ > addr->ip6[2]^addr->ip6[3]; > #endif > - return (ntohl(addr_fold)*2654435761UL) & IP_VS_LBLC_TAB_MASK; > + return hash_32(ntohl(addr_fold), IP_VS_LBLC_TAB_BITS); > } > > > diff --git a/net/netfilter/ipvs/ip_vs_lblcr.c > b/net/netfilter/ipvs/ip_vs_lblcr.c > index 92adc04557ed..9b6a6c9e9cfa 100644 > --- a/net/netfilter/ipvs/ip_vs_lblcr.c > +++ b/net/netfilter/ipvs/ip_vs_lblcr.c > @@ -47,6 +47,7 @@ > #include > #include > #include > +#include > > /* for sysctl */ > #include > @@ -323,7 +324,7 @@ ip_vs_lblcr_hashkey(int af, const union nf_inet_addr > *addr) > addr_fold = addr->ip6[0]^addr->ip6[1]^ > addr->ip6[2]^addr->ip6[3]; > #endif > - return (ntohl(addr_fold)*2654435761UL) & IP_VS_LBLCR_TAB_MASK; > + return hash_32(ntohl(addr_fold), IP_VS_LBLCR_TAB_BITS); > } > > > diff --git a/net/netfilter/ipvs/ip_vs_sh.c b/net/netfilter/ipvs/ip_vs_sh.c > index 16aaac6eedc9..1e01c782583a 100644 > --- a/net/netfilter/ipvs/ip_vs_sh.c > +++ b/net/netfilter/ipvs/ip_vs_sh.c > @@ -96,7 +96,8 @@ ip_vs_sh_hashkey(int af, const union nf_inet_addr *addr, > addr_fold = addr->ip6[0]^addr->ip6[1]^ > addr->ip6[2]^addr->ip6[3]; > #endif > - return (offset + (ntohs(port) + ntohl(addr_fold))*2654435761UL) & > + return (offset + hash_32(ntohs(port) + ntohl(addr_fold), > + IP_VS_SH_TAB_BITS)) & > IP_VS_SH_TAB_MASK; > } > > -- > 2.16.3 Regards -- Julian Anastasov
Re: [PATCH net-next v1] ipvs: fix multiplicative hashing in sh/dh/lblc/lblcr algorithms
Hello, On Sun, 1 Apr 2018, Vincent Bernat wrote: > The sh/dh/lblc/lblcr algorithms are using Knuth's multiplicative > hashing incorrectly. This results in uneven distribution. Good catch. > To fix this, the result has to be shifted by a constant. In "Lecture > 21: Hash functions" [1], it is said: > >In the fixed-point version, The division by 2^q is crucial. The >common mistake when doing multiplicative hashing is to forget to do >it, and in fact you can find web pages highly ranked by Google that >explain multiplicative hashing without this step. Without this >division, there is little point to multiplying by a, because ka mod >m = (k mod m) * (a mod m) mod m . This is no better than modular >hashing with a modulus of m, and quite possibly worse. > > Typing the 2654435761 constant in DuckDuckGo shows many other sources > to confirm this issue. Moreover, doing the multiplication in the 32bit > integer space is enough, hence the change from 2654435761UL to > 2654435761U. > > [1]: https://www.cs.cornell.edu/courses/cs3110/2008fa/lectures/lec21.html > > The following Python program illustrates the bug and its fix: > > import netaddr > import collections > import socket > import statistics > > def run(buggy=False): > base = netaddr.IPAddress('203.0.113.0') > count = collections.defaultdict(int) > for offset in range(100): > for port in range(1, 11000): > r = socket.ntohs(port) + socket.ntohl(int(base) + offset) > r *= 2654435761 > if buggy: > r %= 1 << 64 > else: > r %= 1 << 32 > r >>= 24 > r &= 255 > count[r] += 1 > > print(buggy, > statistics.mean(count.values()), > statistics.stdev(count.values())) > > run(True) > run(False) > > Its output is: > > True 25000 765.9416862050705 > False 390.625 4.681209831891333 > > Signed-off-by: Vincent Bernat > --- > net/netfilter/ipvs/ip_vs_dh.c| 4 +++- > net/netfilter/ipvs/ip_vs_lblc.c | 4 +++- > net/netfilter/ipvs/ip_vs_lblcr.c | 4 +++- > net/netfilter/ipvs/ip_vs_sh.c| 3 ++- > 4 files changed, 11 insertions(+), 4 deletions(-) > > diff --git a/net/netfilter/ipvs/ip_vs_dh.c b/net/netfilter/ipvs/ip_vs_dh.c > index 75f798f8e83b..5638e66dbdd1 100644 > --- a/net/netfilter/ipvs/ip_vs_dh.c > +++ b/net/netfilter/ipvs/ip_vs_dh.c > @@ -81,7 +81,9 @@ static inline unsigned int ip_vs_dh_hashkey(int af, const > union nf_inet_addr *ad > addr_fold = addr->ip6[0]^addr->ip6[1]^ > addr->ip6[2]^addr->ip6[3]; > #endif > - return (ntohl(addr_fold)*2654435761UL) & IP_VS_DH_TAB_MASK; > + return ((ntohl(addr_fold)*2654435761U) >> > + (32 - IP_VS_DH_TAB_BITS)) & > + IP_VS_DH_TAB_MASK; Looks like the '& mask' part is not needed, still, it does not generate extra code. I see that other code uses hash_32(val, bits) from include/linux/hash.h but note that it used different ratio before Linux 4.7, in case someone backports this patch on old kernels. So, I don't have preference what should be used, may be return hash_32(ntohl(addr_fold), IP_VS_DH_TAB_BITS) is better. Regards -- Julian Anastasov
Re: INFO: task hung in stop_sync_thread (2)
a->atomic_read_lock){+.+.}, at: [<c1d180aa>] > n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 > 2 locks held by getty/4342: > #0: (&tty->ldisc_sem){}, at: [<bee98654>] > ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 > #1: (&ldata->atomic_read_lock){+.+.}, at: [<c1d180aa>] > n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 > 2 locks held by getty/4343: > #0: (&tty->ldisc_sem){}, at: [<bee98654>] > ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 > #1: (&ldata->atomic_read_lock){+.+.}, at: [<c1d180aa>] > n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 > 2 locks held by getty/4344: > #0: (&tty->ldisc_sem){}, at: [<bee98654>] > ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 > #1: (&ldata->atomic_read_lock){+.+.}, at: [<c1d180aa>] > n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 > 3 locks held by kworker/0:5/6494: > #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<a062b18e>] > work_static include/linux/workqueue.h:198 [inline] > #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<a062b18e>] > set_work_data kernel/workqueue.c:619 [inline] > #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<a062b18e>] > set_work_pool_and_clear_pending kernel/workqueue.c:646 [inline] > #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<a062b18e>] > process_one_work+0xb12/0x1bb0 kernel/workqueue.c:2084 > #1: ((addr_chk_work).work){+.+.}, at: [<278427d5>] > process_one_work+0xb89/0x1bb0 kernel/workqueue.c:2088 > #2: (rtnl_mutex){+.+.}, at: [<066e35ac>] rtnl_lock+0x17/0x20 > net/core/rtnetlink.c:74 > 1 lock held by syz-executor7/25421: > #0: (ipvs->sync_mutex){+.+.}, at: [<d414a689>] > do_ip_vs_set_ctl+0x277/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2393 > 2 locks held by syz-executor7/25427: > #0: (rtnl_mutex){+.+.}, at: [<066e35ac>] rtnl_lock+0x17/0x20 > net/core/rtnetlink.c:74 > #1: (ipvs->sync_mutex){+.+.}, at: [<e6d48489>] > do_ip_vs_set_ctl+0x10f8/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2388 Above is start_sync_thread() waiting kthread to stop... > 1 lock held by syz-executor7/25435: > #0: (rtnl_mutex){+.+.}, at: [<066e35ac>] rtnl_lock+0x17/0x20 > net/core/rtnetlink.c:74 > 1 lock held by ipvs-b:2:0/25415: > #0: (rtnl_mutex){+.+.}, at: [<066e35ac>] rtnl_lock+0x17/0x20 > net/core/rtnetlink.c:74 backup kthread needs rtnl_lock to stop... > > = > > NMI backtrace for cpu 1 > CPU: 1 PID: 868 Comm: khungtaskd Not tainted 4.16.0-rc6+ #284 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google > 01/01/2011 > Call Trace: > __dump_stack lib/dump_stack.c:17 [inline] > dump_stack+0x194/0x24d lib/dump_stack.c:53 > nmi_cpu_backtrace+0x1d2/0x210 lib/nmi_backtrace.c:103 > nmi_trigger_cpumask_backtrace+0x123/0x180 lib/nmi_backtrace.c:62 > arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38 > trigger_all_cpu_backtrace include/linux/nmi.h:138 [inline] > check_hung_task kernel/hung_task.c:132 [inline] > check_hung_uninterruptible_tasks kernel/hung_task.c:190 [inline] > watchdog+0x90c/0xd60 kernel/hung_task.c:249 > kthread+0x33c/0x400 kernel/kthread.c:238 > ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:406 > Sending NMI from CPU 1 to CPUs 0: > NMI backtrace for cpu 0 skipped: idling at native_safe_halt+0x6/0x10 > arch/x86/include/asm/irqflags.h:54 > > > --- > This bug is generated by a dumb bot. It may contain errors. > See https://goo.gl/tpsmEJ for details. > Direct all questions to syzkal...@googlegroups.com. > > syzbot will keep track of this bug report. > If you forgot to add the Reported-by tag, once the fix for this bug is merged > into any tree, please reply to this email with: > #syz fix: exact-commit-title > To mark this as a duplicate of another syzbot report, please reply with: > #syz dup: exact-subject-of-another-report > If it's a one-off invalid bug report, please reply with: > #syz invalid > Note: if the crash happens again, it will cause creation of a new bug report. > Note: all commands must start from beginning of the line in the email body. Regards -- Julian Anastasov
Re: possible deadlock in rtnl_lock (5)
Hello, On Tue, 27 Mar 2018, Florian Westphal wrote: > syzbot wrote: > [ cc Julian and trimming cc list ] > > > syzkaller688027/4497 is trying to acquire lock: > > (rtnl_mutex){+.+.}, at: [<bb14d7fb>] rtnl_lock+0x17/0x20 > > net/core/rtnetlink.c:74 > > > but task is already holding lock: > > IPVS: stopping backup sync thread 4495 ... > > (rtnl_mutex){+.+.}, at: [<bb14d7fb>] rtnl_lock+0x17/0x20 > > net/core/rtnetlink.c:74 > > > > other info that might help us debug this: > > Possible unsafe locking scenario: > > > >CPU0 > > > > lock(rtnl_mutex); > > lock(rtnl_mutex); > > > > *** DEADLOCK *** > > > > May be due to missing lock nesting notation > > Looks like this is real, commit e0b26cc997d57305b4097711e12e13992580ae34 > ("ipvs: call rtnl_lock early") added rtnl_lock when starting sync thread > but socket close invokes rtnl_lock too: I see, thanks! I'll have to move the locks into start_sync_thread and to split make_{send,receive}_sock to {make,setup}_{send,receive}_sock ... > > stack backtrace: > > rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 > > ip_mc_drop_socket+0x88/0x230 net/ipv4/igmp.c:2643 > > inet_release+0x4e/0x1c0 net/ipv4/af_inet.c:413 > > sock_release+0x8d/0x1e0 net/socket.c:595 > > start_sync_thread+0x2213/0x2b70 net/netfilter/ipvs/ip_vs_sync.c:1924 > > do_ip_vs_set_ctl+0x1139/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2389 Regards -- Julian Anastasov
Re: [PATCH net-next 0/6] Converting pernet_operations (part #8)
Hello, On Thu, 15 Mar 2018, Kirill Tkhai wrote: > Hi, > > this series continues to review and to convert pernet_operations > to make them possible to be executed in parallel for several > net namespaces at the same time. There are different operations > over the tree, mostly are ipvs. > > > Thanks, > Kirill > --- > > Kirill Tkhai (6): > net: Convert l2tp_net_ops > net: Convert mpls_net_ops > net: Convert ovs_net_ops > net: Convert ipvs_core_ops > net: Convert ipvs_core_dev_ops > net: Convert ip_vs_ftp_ops The IPVS patches 4-6 look good to me, Acked-by: Julian Anastasov > net/l2tp/l2tp_core.c|1 + > net/mpls/af_mpls.c |1 + > net/netfilter/ipvs/ip_vs_core.c |2 ++ > net/netfilter/ipvs/ip_vs_ftp.c |1 + > net/openvswitch/datapath.c |1 + > 5 files changed, 6 insertions(+) > > -- > Signed-off-by: Kirill Tkhai Regards -- Julian Anastasov
Re: [PATCH] netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed
Hello, On Thu, 26 Oct 2017, Ye Yin wrote: > When run ipvs in two different network namespace at the same host, and one > ipvs transport network traffic to the other network namespace ipvs. > 'ipvs_property' flag will make the second ipvs take no effect. So we should > clear 'ipvs_property' when SKB network namespace changed. > > Signed-off-by: Ye Yin > Signed-off-by: Wei Zhou Patch looks good to me. ipvs_property was added long ago but skb_scrub_packet() is more recent (3.11), so: Fixes: 621e84d6f373 ("dev: introduce skb_scrub_packet()") Signed-off-by: Julian Anastasov I guess, DaveM can apply it directly as a bugfix to the net tree. > --- > include/linux/skbuff.h | 7 +++ > net/core/skbuff.c | 1 + > 2 files changed, 8 insertions(+) > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > index 72299ef..d448a48 100644 > --- a/include/linux/skbuff.h > +++ b/include/linux/skbuff.h > @@ -3770,6 +3770,13 @@ static inline void nf_reset_trace(struct sk_buff *skb) > #endif > } > > +static inline void ipvs_reset(struct sk_buff *skb) > +{ > +#if IS_ENABLED(CONFIG_IP_VS) > + skb->ipvs_property = 0; > +#endif > +} > + > /* Note: This doesn't put any conntrack and bridge info in dst. */ > static inline void __nf_copy(struct sk_buff *dst, const struct sk_buff *src, >bool copy) > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index 2465607..e140ba4 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -4864,6 +4864,7 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet) > if (!xnet) > return; > > + ipvs_reset(skb); > skb_orphan(skb); > skb->mark = 0; > } > -- > 1.7.12.4 Regards -- Julian Anastasov
Re: [PATCH] netfilter: ipvs: Convert timers to use timer_setup()
Hello, On Tue, 24 Oct 2017, Kees Cook wrote: > In preparation for unconditionally passing the struct timer_list pointer to > all timer callbacks, switch to using the new timer_setup() and from_timer() > to pass the timer pointer explicitly. > > Cc: Wensong Zhang > Cc: Simon Horman > Cc: Julian Anastasov > Cc: Pablo Neira Ayuso > Cc: Jozsef Kadlecsik > Cc: Florian Westphal > Cc: "David S. Miller" > Cc: netdev@vger.kernel.org > Cc: lvs-de...@vger.kernel.org > Cc: netfilter-de...@vger.kernel.org > Cc: coret...@netfilter.org > Signed-off-by: Kees Cook Looks good to me, Acked-by: Julian Anastasov > --- > net/netfilter/ipvs/ip_vs_conn.c | 10 +- > net/netfilter/ipvs/ip_vs_ctl.c | 7 +++ > net/netfilter/ipvs/ip_vs_est.c | 6 +++--- > net/netfilter/ipvs/ip_vs_lblc.c | 11 ++- > net/netfilter/ipvs/ip_vs_lblcr.c | 11 ++- > 5 files changed, 23 insertions(+), 22 deletions(-) > > diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c > index 3d2ac71a83ec..3a43b3470331 100644 > --- a/net/netfilter/ipvs/ip_vs_conn.c > +++ b/net/netfilter/ipvs/ip_vs_conn.c > @@ -104,7 +104,7 @@ static inline void ct_write_unlock_bh(unsigned int key) > spin_unlock_bh(&__ip_vs_conntbl_lock_array[key&CT_LOCKARRAY_MASK].l); > } > > -static void ip_vs_conn_expire(unsigned long data); > +static void ip_vs_conn_expire(struct timer_list *t); > > /* > * Returns hash value for IPVS connection entry > @@ -457,7 +457,7 @@ EXPORT_SYMBOL_GPL(ip_vs_conn_out_get_proto); > static void __ip_vs_conn_put_notimer(struct ip_vs_conn *cp) > { > __ip_vs_conn_put(cp); > - ip_vs_conn_expire((unsigned long)cp); > + ip_vs_conn_expire(&cp->timer); > } > > /* > @@ -817,9 +817,9 @@ static void ip_vs_conn_rcu_free(struct rcu_head *head) > kmem_cache_free(ip_vs_conn_cachep, cp); > } > > -static void ip_vs_conn_expire(unsigned long data) > +static void ip_vs_conn_expire(struct timer_list *t) > { > - struct ip_vs_conn *cp = (struct ip_vs_conn *)data; > + struct ip_vs_conn *cp = from_timer(cp, t, timer); > struct netns_ipvs *ipvs = cp->ipvs; > > /* > @@ -909,7 +909,7 @@ ip_vs_conn_new(const struct ip_vs_conn_param *p, int > dest_af, > } > > INIT_HLIST_NODE(&cp->c_list); > - setup_timer(&cp->timer, ip_vs_conn_expire, (unsigned long)cp); > + timer_setup(&cp->timer, ip_vs_conn_expire, 0); > cp->ipvs = ipvs; > cp->af = p->af; > cp->daf= dest_af; > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c > index 4f940d7eb2f7..b47e266c6eca 100644 > --- a/net/netfilter/ipvs/ip_vs_ctl.c > +++ b/net/netfilter/ipvs/ip_vs_ctl.c > @@ -1146,9 +1146,9 @@ ip_vs_del_dest(struct ip_vs_service *svc, struct > ip_vs_dest_user_kern *udest) > return 0; > } > > -static void ip_vs_dest_trash_expire(unsigned long data) > +static void ip_vs_dest_trash_expire(struct timer_list *t) > { > - struct netns_ipvs *ipvs = (struct netns_ipvs *)data; > + struct netns_ipvs *ipvs = from_timer(ipvs, t, dest_trash_timer); > struct ip_vs_dest *dest, *next; > unsigned long now = jiffies; > > @@ -4019,8 +4019,7 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs > *ipvs) > > INIT_LIST_HEAD(&ipvs->dest_trash); > spin_lock_init(&ipvs->dest_trash_lock); > - setup_timer(&ipvs->dest_trash_timer, ip_vs_dest_trash_expire, > - (unsigned long) ipvs); > + timer_setup(&ipvs->dest_trash_timer, ip_vs_dest_trash_expire, 0); > atomic_set(&ipvs->ftpsvc_counter, 0); > atomic_set(&ipvs->nullsvc_counter, 0); > atomic_set(&ipvs->conn_out_counter, 0); > diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c > index 457c6c193e13..489055091a9b 100644 > --- a/net/netfilter/ipvs/ip_vs_est.c > +++ b/net/netfilter/ipvs/ip_vs_est.c > @@ -97,12 +97,12 @@ static void ip_vs_read_cpu_stats(struct ip_vs_kstats *sum, > } > > > -static void estimation_timer(unsigned long arg) > +static void estimation_timer(struct timer_list *t) > { > struct ip_vs_estimator *e; > struct ip_vs_stats *s; > u64 rate; > - struct netns_ipvs *ipvs = (struct netns_ipvs *)arg; > + struct netns_ipvs *ipvs = from_timer(ipvs, t, est_timer); > > spin_lock(&ipvs->est_lock); > list_for_each_entry(e, &ipvs->est_list, list) { > @@ -192,7 +192,7 @@ int __net_init ip_vs_estimator_net_init(struct n
Re: [PATCH] ipvs: Fix inappropriate output of procfs
Hello, On Sun, 15 Oct 2017, KUWAZAWA Takuya wrote: > Information about ipvs in different network namespace can be seen via procfs. > > How to reproduce: > > # ip netns add ns01 > # ip netns add ns02 > # ip netns exec ns01 ip a add dev lo 127.0.0.1/8 > # ip netns exec ns02 ip a add dev lo 127.0.0.1/8 > # ip netns exec ns01 ipvsadm -A -t 10.1.1.1:80 > # ip netns exec ns02 ipvsadm -A -t 10.1.1.2:80 > > The ipvsadm displays information about its own network namespace only. > > # ip netns exec ns01 ipvsadm -Ln > IP Virtual Server version 1.2.1 (size=4096) > Prot LocalAddress:Port Scheduler Flags > -> RemoteAddress:Port Forward Weight ActiveConn InActConn > TCP 10.1.1.1:80 wlc > > # ip netns exec ns02 ipvsadm -Ln > IP Virtual Server version 1.2.1 (size=4096) > Prot LocalAddress:Port Scheduler Flags > -> RemoteAddress:Port Forward Weight ActiveConn InActConn > TCP 10.1.1.2:80 wlc > > But I can see information about other network namespace via procfs. > > # ip netns exec ns01 cat /proc/net/ip_vs > IP Virtual Server version 1.2.1 (size=4096) > Prot LocalAddress:Port Scheduler Flags > -> RemoteAddress:Port Forward Weight ActiveConn InActConn > TCP 0A010101:0050 wlc > TCP 0A010102:0050 wlc > > # ip netns exec ns02 cat /proc/net/ip_vs > IP Virtual Server version 1.2.1 (size=4096) > Prot LocalAddress:Port Scheduler Flags > -> RemoteAddress:Port Forward Weight ActiveConn InActConn > TCP 0A010102:0050 wlc > > Signed-off-by: KUWAZAWA Takuya Looks good to me Acked-by: Julian Anastasov Simon, please apply to ipvs tree. > --- > net/netfilter/ipvs/ip_vs_ctl.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c > index 4f940d7..b3245f9 100644 > --- a/net/netfilter/ipvs/ip_vs_ctl.c > +++ b/net/netfilter/ipvs/ip_vs_ctl.c > @@ -2034,12 +2034,16 @@ static int ip_vs_info_seq_show(struct seq_file *seq, > void *v) > seq_puts(seq, >" -> RemoteAddress:Port Forward Weight ActiveConn > InActConn\n"); > } else { > + struct net *net = seq_file_net(seq); > + struct netns_ipvs *ipvs = net_ipvs(net); > const struct ip_vs_service *svc = v; > const struct ip_vs_iter *iter = seq->private; > const struct ip_vs_dest *dest; > struct ip_vs_scheduler *sched = rcu_dereference(svc->scheduler); > char *sched_name = sched ? sched->name : "none"; > > + if (svc->ipvs != ipvs) > + return 0; > if (iter->table == ip_vs_svc_table) { > #ifdef CONFIG_IP_VS_IPV6 > if (svc->af == AF_INET6) > -- > 1.8.3.1 Regards -- Julian Anastasov
Re: Use after free in __dst_destroy_metrics_generic
Hello, On Fri, 15 Sep 2017, Subash Abhinov Kasiviswanathan wrote: > > May be I'm missing some posting but I don't see if > > the patch was tested successfully. > > > Hi Julian > > I've had this patch being tested for the last 3-4 days in our regression rack > and I haven't seen the same issue being reproduced or even a related crash > or leak in dst. For now I see only its bug-hiding side effects, i.e. it does not call kfree. Now if there is no double-free there should be a leak, not for dst but for metrics. > The original issue was reported only once to us from the regression rack only > so the exact steps to reproduce is unknown. OK, lets see, may be others can explain what happens. Regards -- Julian Anastasov
Re: Use after free in __dst_destroy_metrics_generic
Hello, On Fri, 15 Sep 2017, Eric Dumazet wrote: > On Fri, 2017-09-08 at 09:10 -0700, Cong Wang wrote: > > On Thu, Sep 7, 2017 at 5:52 PM, Subash Abhinov Kasiviswanathan > > wrote: > > > We are seeing a possible use after free in ip6_dst_destroy. > > > > > > It appears as if memory of the __DST_METRICS_PTR(old) was freed in some > > > path > > > and allocated > > > to ion driver. ion driver has also freed it. Finally the memory is freed > > > by > > > the > > > fib gc and crashes since it is already deallocated. > > > > Does the attach (compile-only) patch help anything? > > > > From my _quick_ glance, it seems we miss the refcnt'ing > > right in __dst_destroy_metrics_generic(). > > > > Thanks! > > > Hi Cong > > I believe your patch makes a lot of sense, please submit it formally ? Cong's patch is wrong for few reasons: - it will stop to kfree non-refcounted metrics - report was for IPV6 and we set DST_METRICS_REFCOUNTED only for IPv4, for DST_METRICS_READ_ONLY metrics - __dst_destroy_metrics_generic is called for val without DST_METRICS_READ_ONLY flag and such metrics are not with DST_METRICS_REFCOUNTED flag - ->cow_metrics and dst_cow_metrics_generic are called with DST_METRICS_READ_ONLY flag set, there is no chance to write new value twice, especially to write DST_METRICS_REFCOUNTED flag and later to see this flag in __dst_destroy_metrics_generic So, I'm not sure where exactly is the bug with the metrics. May be I'm missing some posting but I don't see if the patch was tested successfully. Regards -- Julian Anastasov
Re: [PATCH net 0/2] netfilter: ipvs: some fixes in sctp_conn_schedule
Hello, On Sun, 20 Aug 2017, Xin Long wrote: > Patch 1/2 fixes the regression introduced by commit 5e26b1b3abce. > Patch 2/2 makes ipvs not create conn for sctp ABORT packet. > > Xin Long (2): > netfilter: ipvs: fix the issue that sctp_conn_schedule drops non-INIT > packet > netfilter: ipvs: do not create conn for ABORT packet in > sctp_conn_schedule > > net/netfilter/ipvs/ip_vs_proto_sctp.c | 8 ++-- > 1 file changed, 6 insertions(+), 2 deletions(-) Patchset looks ok to me, Acked-by: Julian Anastasov
Re: 100% CPU load when generating traffic to destination network that nexthop is not reachable
Hello, On Tue, 15 Aug 2017, Eric Dumazet wrote: > It must be possible to add a fast path without locks. > > (say if jiffies has not changed before last state change) New day - new idea. Something like this? But it has bug: without checking neigh->dead under lock we don't have the right to access neigh->parms, it can be destroyed immediately by neigh_release->neigh_destroy->neigh_parms_put-> neigh_parms_destroy->kfree. Not sure, may be kfree_rcu can help for this... diff --git a/include/net/neighbour.h b/include/net/neighbour.h index 9816df2..f52763c 100644 --- a/include/net/neighbour.h +++ b/include/net/neighbour.h @@ -428,10 +428,10 @@ static inline int neigh_event_send(struct neighbour *neigh, struct sk_buff *skb) { unsigned long now = jiffies; - if (neigh->used != now) - neigh->used = now; if (!(neigh->nud_state&(NUD_CONNECTED|NUD_DELAY|NUD_PROBE))) return __neigh_event_send(neigh, skb); + if (neigh->used != now) + neigh->used = now; return 0; } diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 16a1a4c..52a8718 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -991,8 +991,18 @@ static void neigh_timer_handler(unsigned long arg) int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb) { - int rc; bool immediate_probe = false; + unsigned long now = jiffies; + int rc; + + if (neigh->used != now) { + neigh->used = now; + } else if (neigh->nud_state == NUD_INCOMPLETE && + (!skb || neigh->arp_queue_len_bytes + skb->truesize > + NEIGH_VAR(neigh->parms, QUEUE_LEN_BYTES))) { + kfree_skb(skb); + return 1; + } write_lock_bh(&neigh->lock); @@ -1005,7 +1015,7 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb) if (!(neigh->nud_state & (NUD_STALE | NUD_INCOMPLETE))) { if (NEIGH_VAR(neigh->parms, MCAST_PROBES) + NEIGH_VAR(neigh->parms, APP_PROBES)) { - unsigned long next, now = jiffies; + unsigned long next; atomic_set(&neigh->probes, NEIGH_VAR(neigh->parms, UCAST_PROBES)); Regards -- Julian Anastasov
Re: 100% CPU load when generating traffic to destination network that nexthop is not reachable
Hello, On Wed, 16 Aug 2017, Julian Anastasov wrote: > I thought about this, it is possible in > neigh_event_send: > > if (neigh->used != now) > neigh->used = now; > else if (neigh->nud_state == NUD_INCOMPLETE && >neigh->arp_queue_len_bytes + skb->truesize > >NEIGH_VAR(neigh->parms, QUEUE_LEN_BYTES) With kfree_skb(skb) here, of course... > return 1; > > But this is really in fast path and not sure it is > worth it. May be if we can move it somehow in __neigh_event_send > but as neigh->used is changed early we need a better idea > how to reduce the arp_queue hit rate... Regards -- Julian Anastasov
Re: 100% CPU load when generating traffic to destination network that nexthop is not reachable
Hello, On Tue, 15 Aug 2017, Eric Dumazet wrote: > On Tue, 2017-08-15 at 22:45 +0300, Julian Anastasov wrote: > > Hello, > > > > On Tue, 15 Aug 2017, Eric Dumazet wrote: > > > > > Please try this : > > > diff --git a/net/core/neighbour.c b/net/core/neighbour.c > > > index > > > 16a1a4c4eb57fa1147f230916e2e62e18ef89562..95e0d7702029b583de8229e3c3eb923f6395b072 > > > 100644 > > > --- a/net/core/neighbour.c > > > +++ b/net/core/neighbour.c > > > @@ -991,14 +991,18 @@ static void neigh_timer_handler(unsigned long arg) > > > > > > int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb) > > > { > > > - int rc; > > > bool immediate_probe = false; > > > + int rc; > > > + > > > + /* We _should_ test this under write_lock_bh(&neigh->lock), > > > + * but this is too costly. > > > + */ > > > + if (READ_ONCE(neigh->nud_state) & (NUD_CONNECTED | NUD_DELAY | > > > NUD_PROBE)) > > > + return 0; > > > > The same fast check is already done in the only caller, > > neigh_event_send. Now we risk to enter the > > 'if (!(neigh->nud_state & (NUD_STALE | NUD_INCOMPLETE))) {' block... > > > Right you are. > > It must be possible to add a fast path without locks. > > (say if jiffies has not changed before last state change) I thought about this, it is possible in neigh_event_send: if (neigh->used != now) neigh->used = now; else if (neigh->nud_state == NUD_INCOMPLETE && neigh->arp_queue_len_bytes + skb->truesize > NEIGH_VAR(neigh->parms, QUEUE_LEN_BYTES) return 1; But this is really in fast path and not sure it is worth it. May be if we can move it somehow in __neigh_event_send but as neigh->used is changed early we need a better idea how to reduce the arp_queue hit rate... Regards -- Julian Anastasov
Re: 100% CPU load when generating traffic to destination network that nexthop is not reachable
Hello, On Tue, 15 Aug 2017, Eric Dumazet wrote: > Please try this : > diff --git a/net/core/neighbour.c b/net/core/neighbour.c > index > 16a1a4c4eb57fa1147f230916e2e62e18ef89562..95e0d7702029b583de8229e3c3eb923f6395b072 > 100644 > --- a/net/core/neighbour.c > +++ b/net/core/neighbour.c > @@ -991,14 +991,18 @@ static void neigh_timer_handler(unsigned long arg) > > int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb) > { > - int rc; > bool immediate_probe = false; > + int rc; > + > + /* We _should_ test this under write_lock_bh(&neigh->lock), > + * but this is too costly. > + */ > + if (READ_ONCE(neigh->nud_state) & (NUD_CONNECTED | NUD_DELAY | > NUD_PROBE)) > + return 0; The same fast check is already done in the only caller, neigh_event_send. Now we risk to enter the 'if (!(neigh->nud_state & (NUD_STALE | NUD_INCOMPLETE))) {' block... > write_lock_bh(&neigh->lock); > > rc = 0; > - if (neigh->nud_state & (NUD_CONNECTED | NUD_DELAY | NUD_PROBE)) > - goto out_unlock_bh; > if (neigh->dead) > goto out_dead; Regards -- Julian Anastasov
Re: [PATCH V5 net-next] neigh: Really delete an arp/neigh entry on "ip neigh delete" or "arp -d"
Hello, On Fri, 2 Jun 2017, Sowmini Varadhan wrote: > The command > # arp -s 62.2.0.1 a:b:c:d:e:f dev eth2 > adds an entry like the following (listed by "arp -an") > ? (62.2.0.1) at 0a:0b:0c:0d:0e:0f [ether] PERM on eth2 > but the symmetric deletion command > # arp -i eth2 -d 62.2.0.1 > does not remove the PERM entry from the table, and instead leaves behind > ? (62.2.0.1) at on eth2 > > The reason is that there is a refcnt of 1 for the arp_tbl itself > (neigh_alloc starts off the entry with a refcnt of 1), thus > the neigh_release() call from arp_invalidate() will (at best) just > decrement the ref to 1, but will never actually free it from the > table. > > To fix this, we need to do something like neigh_forced_gc: if > the refcnt is 1 (i.e., on the table's ref), remove the entry from > the table and free it. This patch refactors and shares common code > between neigh_forced_gc and the newly added neigh_remove_one. > > A similar issue exists for IPv6 Neighbor Cache entries, and is fixed > in a similar manner by this patch. > > Signed-off-by: Sowmini Varadhan > Reviewed-by: Julian Anastasov Very good, thanks! > --- > v2: kbuild-test-robot compile error > v3: do not export_symbol neigh_remove_one (David Miller comment) > v4: rearrange locking for tbl->lock (Julian Anastasov comment) > v5: minor touch-ups: removed retval from neigh_remove_one (Julian Anastasov), > got rid of checkpatch warnings, explicitly add net-next to subject line. > > include/net/neighbour.h |1 + > net/core/neighbour.c| 60 ++ > net/ipv4/arp.c |4 +++ > 3 files changed, 54 insertions(+), 11 deletions(-) > > diff --git a/include/net/neighbour.h b/include/net/neighbour.h > index e4dd3a2..639b675 100644 > --- a/include/net/neighbour.h > +++ b/include/net/neighbour.h > @@ -317,6 +317,7 @@ struct neighbour *__neigh_create(struct neigh_table *tbl, > const void *pkey, > int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new, u32 > flags, >u32 nlmsg_pid); > void __neigh_set_probe_once(struct neighbour *neigh); > +bool neigh_remove_one(struct neighbour *ndel, struct neigh_table *tbl); > void neigh_changeaddr(struct neigh_table *tbl, struct net_device *dev); > int neigh_ifdown(struct neigh_table *tbl, struct net_device *dev); > int neigh_resolve_output(struct neighbour *neigh, struct sk_buff *skb); > diff --git a/net/core/neighbour.c b/net/core/neighbour.c > index d274f81..dadb5ee 100644 > --- a/net/core/neighbour.c > +++ b/net/core/neighbour.c > @@ -118,6 +118,50 @@ unsigned long neigh_rand_reach_time(unsigned long base) > EXPORT_SYMBOL(neigh_rand_reach_time); > > > +static bool neigh_del(struct neighbour *n, __u8 state, > + struct neighbour __rcu **np, struct neigh_table *tbl) > +{ > + bool retval = false; > + > + write_lock(&n->lock); > + if (atomic_read(&n->refcnt) == 1 && !(n->nud_state & state)) { > + struct neighbour *neigh; > + > + neigh = rcu_dereference_protected(n->next, > + lockdep_is_held(&tbl->lock)); > + rcu_assign_pointer(*np, neigh); > + n->dead = 1; > + retval = true; > + } > + write_unlock(&n->lock); > + if (retval) > + neigh_cleanup_and_release(n); > + return retval; > +} > + > +bool neigh_remove_one(struct neighbour *ndel, struct neigh_table *tbl) > +{ > + struct neigh_hash_table *nht; > + void *pkey = ndel->primary_key; > + u32 hash_val; > + struct neighbour *n; > + struct neighbour __rcu **np; > + > + nht = rcu_dereference_protected(tbl->nht, > + lockdep_is_held(&tbl->lock)); > + hash_val = tbl->hash(pkey, ndel->dev, nht->hash_rnd); > + hash_val = hash_val >> (32 - nht->hash_shift); > + > + np = &nht->hash_buckets[hash_val]; > + while ((n = rcu_dereference_protected(*np, > + lockdep_is_held(&tbl->lock { > + if (n == ndel) > + return neigh_del(n, 0, np, tbl); > + np = &n->next; > + } > + return false; > +} > + > static int neigh_forced_gc(struct neigh_table *tbl) > { > int shrunk = 0; > @@ -140,19 +184,10 @@ static int neigh_forced_gc(struct neigh_table *tbl) >* - nobody refers to it. >* - it is not permanent >
Re: [PATCH V4] neigh: Really delete an arp/neigh entry on "ip neigh delete" or "arp -d"
Hello, On Wed, 31 May 2017, Sowmini Varadhan wrote: > The command > # arp -s 62.2.0.1 a:b:c:d:e:f dev eth2 > adds an entry like the following (listed by "arp -an") > ? (62.2.0.1) at 0a:0b:0c:0d:0e:0f [ether] PERM on eth2 > but the symmetric deletion command > # arp -i eth2 -d 62.2.0.1 > does not remove the PERM entry from the table, and instead leaves behind > ? (62.2.0.1) at on eth2 > > The reason is that there is a refcnt of 1 for the arp_tbl itself > (neigh_alloc starts off the entry with a refcnt of 1), thus > the neigh_release() call from arp_invalidate() will (at best) just > decrement the ref to 1, but will never actually free it from the > table. > > To fix this, we need to do something like neigh_forced_gc: if > the refcnt is 1 (i.e., on the table's ref), remove the entry from > the table and free it. This patch refactors and shares common code > between neigh_forced_gc and the newly added neigh_remove_one. > > A similar issue exists for IPv6 Neighbor Cache entries, and is fixed > in a similar manner by this patch. > > Signed-off-by: Sowmini Varadhan Change looks ok to me but with some non-fatal warnings, see below. Reviewed-by: Julian Anastasov > --- > v2: kbuild-test-robot compile error > v3: do not export_symbol neigh_remove_one (David Miller comment) > v4: rearrange locking for tbl->lock (Julian Anastasov comment) > > include/net/neighbour.h |1 + > net/core/neighbour.c| 61 ++ > net/ipv4/arp.c |4 +++ > 3 files changed, 55 insertions(+), 11 deletions(-) > > diff --git a/include/net/neighbour.h b/include/net/neighbour.h > index e4dd3a2..639b675 100644 > --- a/include/net/neighbour.h > +++ b/include/net/neighbour.h > @@ -317,6 +317,7 @@ struct neighbour *__neigh_create(struct neigh_table *tbl, > const void *pkey, > int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new, u32 > flags, >u32 nlmsg_pid); > void __neigh_set_probe_once(struct neighbour *neigh); > +bool neigh_remove_one(struct neighbour *ndel, struct neigh_table *tbl); > void neigh_changeaddr(struct neigh_table *tbl, struct net_device *dev); > int neigh_ifdown(struct neigh_table *tbl, struct net_device *dev); > int neigh_resolve_output(struct neighbour *neigh, struct sk_buff *skb); > diff --git a/net/core/neighbour.c b/net/core/neighbour.c > index d274f81..b473f9f 100644 > --- a/net/core/neighbour.c > +++ b/net/core/neighbour.c ... > +bool neigh_remove_one(struct neighbour *ndel, struct neigh_table *tbl) > +{ > + struct neigh_hash_table *nht; > + void *pkey = ndel->primary_key; > + u32 hash_val; > + struct neighbour *n; > + struct neighbour __rcu **np; > + bool retval = false; > + > + nht = rcu_dereference_protected(tbl->nht, > + lockdep_is_held(&tbl->lock)); > + hash_val = tbl->hash(pkey, ndel->dev, nht->hash_rnd); > + hash_val = hash_val >> (32 - nht->hash_shift); > + > + np = &nht->hash_buckets[hash_val]; > + while ((n = rcu_dereference_protected(*np, > + lockdep_is_held(&tbl->lock))) != NULL) { checkpatch shows some warnings: scripts/checkpatch.pl --strict /tmp/file.patch > + if (n == ndel) { > + retval = neigh_del(n, 0, np, tbl); > + return retval; In case there is another patch version, the retval can be removed: if (n == ndel) return neigh_del(n, 0, np, tbl); > + } > + np = &n->next; > + } > + return retval; return false; > +} > + > static int neigh_forced_gc(struct neigh_table *tbl) > { > int shrunk = 0; > @@ -140,19 +185,10 @@ static int neigh_forced_gc(struct neigh_table *tbl) >* - nobody refers to it. >* - it is not permanent >*/ > - write_lock(&n->lock); > - if (atomic_read(&n->refcnt) == 1 && > - !(n->nud_state & NUD_PERMANENT)) { > - rcu_assign_pointer(*np, > - rcu_dereference_protected(n->next, > - lockdep_is_held(&tbl->lock))); > - n->dead = 1; > - shrunk = 1; > - write_unlock(&n->lock); > - neigh_cleanup_and_release(n); > + if (neigh
Re: [PATCH V3] neigh: Really delete an arp/neigh entry on "ip neigh delete" or "arp -d"
Hello, On Wed, 31 May 2017, Sowmini Varadhan wrote: > On (06/01/17 00:41), Julian Anastasov wrote: > > > > So, we do not hold reference to neigh while accessing > > its fields. I suspect we need to move the table lock from > > neigh_remove_one here, for example: > > good point, let me think over your suggestion carefully (it sounds > right, I want to make sure I dont miss any other race-windows) > and post patch v4 tomorrow.. Another problem is that neigh_update() changes the state but before we go and unlink the entry another CPU can reactivate the entry, i.e. NUD_INCOMPLETE entered in __neigh_event_send(). So, there will be always some small window where surprises can happen and the entry is not really deleted. Checking for NUD_FAILED may not be needed, the atomic_read(&n->refcnt) == 1 check under lock ensures that we are in some state without timer, i.e. not in NUD_INCOMPLETE, so it is ok to remove the entry. > > Another solution to cause faster removal would be > > to cancel the gc_work and to schedule it after 1 jiffie. > > It helps when many entries are deleted at once but the > > work prefers to just sleep when gc_thresh1 is not reached, > > so such solution is not good enough. > > Right the other drawback of relying on gc for cleanup is > that it doesn't give higher preference to cleaning up FAILED > entries first- so it can end up cleaning up other useful entries > (as I was pointing out to David Ahern) I see, ok. Regards -- Julian Anastasov
Re: [PATCH V3] neigh: Really delete an arp/neigh entry on "ip neigh delete" or "arp -d"
Hello, On Tue, 30 May 2017, Sowmini Varadhan wrote: > @@ -1650,6 +1689,7 @@ static int neigh_delete(struct sk_buff *skb, struct > nlmsghdr *nlh, > NEIGH_UPDATE_F_ADMIN, > NETLINK_CB(skb).portid); > neigh_release(neigh); > + neigh_remove_one(neigh, tbl); So, we do not hold reference to neigh while accessing its fields. I suspect we need to move the table lock from neigh_remove_one here, for example: + write_lock_bh(&tbl->lock); neigh_release(neigh); + neigh_remove_one(neigh, tbl); + write_unlock_bh(&tbl->lock); Otherwise, neigh_forced_gc can call neigh_destroy where neigh is freed (and replaced by another neigh entry) after RCU grace period (which can end prematurely if there are no running RCU read-side critical sections) and I'm not sure if our thread can be delayed, so that such pointer replacement can happen unnoticed by us. Another solution to cause faster removal would be to cancel the gc_work and to schedule it after 1 jiffie. It helps when many entries are deleted at once but the work prefers to just sleep when gc_thresh1 is not reached, so such solution is not good enough. Regards -- Julian Anastasov
Re: [PATCH net] ipv4: add reference counting to metrics
Hello, On Thu, 25 May 2017, Eric Dumazet wrote: > From: Eric Dumazet > > Andrey Konovalov reported crashes in ipv4_mtu() > > I could reproduce the issue with KASAN kernels, between > 10.246.7.151 and 10.246.7.152 : > > 1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 & > > 2) At the same time run following loop : > while : > do > ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500 > ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500 > done > > > Cong Wang attempted to add back rt->fi in commit > 82486aa6f1b9 ("ipv4: restore rt->fi for reference counting") > but this proved to add some issues that were complex to solve. > > Instead, I suggested to add a refcount to the metrics themselves, > being a standalone object (in particular, no reference to other objects) > > I tried to make this patch as small as possible to ease its backport, > instead of being super clean. Note that we believe that only ipv4 dst > need to take care of the metric refcount. But if this is wrong, > this patch adds the basic infrastructure to extend this to other > families. > > Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang > for his efforts on this problem. > > Fixes: 2860583fe840 ("ipv4: Kill rt->fi") > Signed-off-by: Eric Dumazet > Reported-by: Andrey Konovalov Nice work, thanks! Reviewed-by: Julian Anastasov > --- > include/net/dst.h|8 +++- > include/net/ip_fib.h | 10 +- > net/core/dst.c | 23 ++- > net/ipv4/fib_semantics.c | 17 ++--- > net/ipv4/route.c | 10 +- > 5 files changed, 45 insertions(+), 23 deletions(-) > > diff --git a/include/net/dst.h b/include/net/dst.h > index > 049af33da3b6c95897d544670cea65c542317673..cfc0437841665d7ed46a714915c50d723c24901c > 100644 > --- a/include/net/dst.h > +++ b/include/net/dst.h > @@ -107,10 +107,16 @@ struct dst_entry { > }; > }; > > +struct dst_metrics { > + u32 metrics[RTAX_MAX]; > + atomic_trefcnt; > +}; > +extern const struct dst_metrics dst_default_metrics; > + > u32 *dst_cow_metrics_generic(struct dst_entry *dst, unsigned long old); > -extern const u32 dst_default_metrics[]; > > #define DST_METRICS_READ_ONLY0x1UL > +#define DST_METRICS_REFCOUNTED 0x2UL > #define DST_METRICS_FLAGS0x3UL > #define __DST_METRICS_PTR(Y) \ > ((u32 *)((Y) & ~DST_METRICS_FLAGS)) > diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h > index > 6692c5758b332d468f1e0611ecc4f3e03ae03b2b..f7f6aa789c6174c41ca9739206d586c559c1f3a1 > 100644 > --- a/include/net/ip_fib.h > +++ b/include/net/ip_fib.h > @@ -114,11 +114,11 @@ struct fib_info { > __be32 fib_prefsrc; > u32 fib_tb_id; > u32 fib_priority; > - u32 *fib_metrics; > -#define fib_mtu fib_metrics[RTAX_MTU-1] > -#define fib_window fib_metrics[RTAX_WINDOW-1] > -#define fib_rtt fib_metrics[RTAX_RTT-1] > -#define fib_advmss fib_metrics[RTAX_ADVMSS-1] > + struct dst_metrics *fib_metrics; > +#define fib_mtu fib_metrics->metrics[RTAX_MTU-1] > +#define fib_window fib_metrics->metrics[RTAX_WINDOW-1] > +#define fib_rtt fib_metrics->metrics[RTAX_RTT-1] > +#define fib_advmss fib_metrics->metrics[RTAX_ADVMSS-1] > int fib_nhs; > #ifdef CONFIG_IP_ROUTE_MULTIPATH > int fib_weight; > diff --git a/net/core/dst.c b/net/core/dst.c > index > 960e503b5a529a2c4f1866f49c150493ee98d7da..6192f11beec9077de964e2aeff4f78547f08b8da > 100644 > --- a/net/core/dst.c > +++ b/net/core/dst.c > @@ -151,13 +151,13 @@ int dst_discard_out(struct net *net, struct sock *sk, > struct sk_buff *skb) > } > EXPORT_SYMBOL(dst_discard_out); > > -const u32 dst_default_metrics[RTAX_MAX + 1] = { > +const struct dst_metrics dst_default_metrics = { > /* This initializer is needed to force linker to place this variable >* into const section. Otherwise it might end into bss section. >* We really want to avoid false sharing on this variable, and catch >* any writes on it. >*/ > - [RTAX_MAX] = 0xdeadbeef, > + .refcnt = ATOMIC_INIT(1), > }; > > void dst_init(struct dst_entry *dst, struct dst_ops *ops, > @@ -169,7 +169,7 @@ void dst_init(struct dst_entry *dst, struct dst_ops *ops, > if (dev) > dev_hold(dev); > dst->ops = ops; > - dst_init_metrics(dst, dst_default_metrics, true); > + dst_
Re: [PATCH v2 2/4] arp: decompose is_garp logic into a separate function
Hello, On Thu, 18 May 2017, Ihar Hrachyshka wrote: > The code is quite involving already to earn a separate function for > itself. If anything, it helps arp_process readability. > > Signed-off-by: Ihar Hrachyshka > --- > net/ipv4/arp.c | 35 +++ > 1 file changed, 23 insertions(+), 12 deletions(-) > > diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c > index 053492a..ca6e1e6 100644 > --- a/net/ipv4/arp.c > +++ b/net/ipv4/arp.c > @@ -641,6 +641,27 @@ void arp_xmit(struct sk_buff *skb) > } > EXPORT_SYMBOL(arp_xmit); > > +static bool arp_is_garp(struct net_device *dev, int addr_type, > + __be16 ar_op, > + __be32 sip, __be32 tip, > + unsigned char *sha, unsigned char *tha) > +{ > + bool is_garp = tip == sip && addr_type == RTN_UNICAST; > + > + /* Gratuitous ARP _replies_ also require target hwaddr to be > + * the same as source. > + */ > + if (is_garp && ar_op == htons(ARPOP_REPLY)) > + is_garp = > + /* IPv4 over IEEE 1394 doesn't provide target > + * hardware address field in its ARP payload. > + */ > + tha && All 4 patches look ok to me with only a small problem which comes from patch already included in kernel. I see that GARP replies can not work for 1394, is_garp will be cleared. May be 'tha' check should be moved in if expression, for example: if (is_garp && ar_op == htons(ARPOP_REPLY) && tha) is_garp = !memcmp(tha, sha, dev->addr_len); > + !memcmp(tha, sha, dev->addr_len); > + > + return is_garp; > +} Regards -- Julian Anastasov
Re: [PATCH 2/2] arp: always override existing neigh entries with gratuitous ARP
Hello, On Wed, 17 May 2017, Ihar Hrachyshka wrote: > Currently, when arp_accept is 1, we always override existing neigh > entries with incoming gratuitous ARP replies. Otherwise, we override > them only if new replies satisfy _locktime_ conditional (packets arrive > not earlier than _locktime_ seconds since the last update to the neigh > entry). > > The idea behind locktime is to pick the very first (=> close) reply > received in a unicast burst when ARP proxies are used. This helps to > avoid ARP thrashing where Linux would switch back and forth from one > proxy to another. > > This logic has nothing to do with gratuitous ARP replies that are > generally not aligned in time when multiple IP address carriers send > them into network. > > This patch enforces overriding of existing neigh entries by all incoming > gratuitous ARP packets, irrespective of their time of arrival. This will > make the kernel honour all incoming gratuitous ARP packets. > > Signed-off-by: Ihar Hrachyshka > --- > net/ipv4/arp.c | 14 +++--- > 1 file changed, 7 insertions(+), 7 deletions(-) > > diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c > index 3f06201..97ea9d8 100644 > --- a/net/ipv4/arp.c > +++ b/net/ipv4/arp.c > @@ -858,16 +858,16 @@ static int arp_process(struct net *net, struct sock > *sk, struct sk_buff *skb) > > n = __neigh_lookup(&arp_tbl, &sip, dev, 0); > > - if (IN_DEV_ARP_ACCEPT(in_dev)) { > - unsigned int addr_type = inet_addr_type_dev_table(net, dev, > sip); > - > - /* Unsolicited ARP is not accepted by default. Above line is not related to GARP. I think, it should stay at its old place, see below. Also, in first patch, line should be: /* GARP _replies_ also require target hwaddr to be * the same as source. */ > -It is possible, that this option should be enabled for some > -devices (strip is candidate) > - */ > + if (n || IN_DEV_ARP_ACCEPT(in_dev)) { > + addr_type = inet_addr_type_dev_table(net, dev, sip); > is_garp = arp_is_garp(dev, addr_type, arp->ar_op, > sip, tip, sha, tha); There was something I didn't liked in the old code: inet_addr_type_dev_table() can be slow and we should call it only when needed. With some rearranging we can fix it, for example: 1. arp_is_garp(net,..., int *addr_type) should use: bool is_garp = tip == sip; ... if (is_garp) { *addr_type = inet_addr_type_dev_table(net, dev, sip); if (*addr_type != RTN_UNICAST) is_garp = false; } return is_garp; > + } > > + if (IN_DEV_ARP_ACCEPT(in_dev)) { > + /* It is possible, that this option should be enabled for some > + * devices (strip is candidate) > + */ > if (!n && > ((arp->ar_op == htons(ARPOP_REPLY) && > addr_type == RTN_UNICAST) || is_garp)) > 2. fill addr_type and do is_garp check first: if (n || IN_DEV_ARP_ACCEPT(in_dev)) { addr_type = -1; is_garp = arp_is_garp(net, dev, addr_type, arp->ar_op, sip, tip, sha, tha, &addr_type); } /* Unsolicited ARP is not accepted by default. * It is possible, that this option should be enabled for some * devices (strip is candidate) */ if (!n && IN_DEV_ARP_ACCEPT(in_dev) && (is_garp || (arp->ar_op == htons(ARPOP_REPLY) && (addr_type == RTN_UNICAST || (addr_type < 0 && inet_addr_type_dev_table(net, dev, sip) == RTN_UNICAST) n = __neigh_lookup(&arp_tbl, &sip, dev, 1); As result, we will avoid the unneeded inet_addr_type_dev_table() call for ARP requests (non-GARP) which are too common to see. May be there is another way to make this code more nice... Regards -- Julian Anastasov
Re: [Patch net] ipv4: restore rt->fi for reference counting
Hello, On Mon, 15 May 2017, Cong Wang wrote: > On Mon, May 15, 2017 at 1:37 PM, Julian Anastasov wrote: > > Any user that does not set FIB_LOOKUP_NOREF > > will need nh_dev refcounts. The assumption is that the > > NHs are accessed, who knows, may be even after RCU grace > > period. As result, we can not use dev_put on NETDEV_UNREGISTER. > > So, we should check if there are users that do not > > set FIB_LOOKUP_NOREF, at first look, I don't see such ones > > for IPv4. > > I see, although we do have FIB_LOOKUP_NOREF set all the times, > there are other places we hold fib_clntref too, for example > mlxsw_sp_router_fib_event_work(), it actually uses nh_dev too... > > So I am afraid moving dev_put() to fib_release_info() is not a solution > here. I have to rethink about it. At first look, they use fib_info_hold() to get fib_clntref reference from places where fib_treeref is not fatally decreased to 0 but later a work is used which finishes the job. I guess, we can convert such places to use just a fib_treeref reference. They can use such new method instead of fib_info_hold: void fib_treeref_get(struct fib_info *fi) { spin_lock_bh(&fib_info_lock); fi->fib_treeref++; spin_unlock_bh(&fib_info_lock); } They will use fib_release_info() to put the reference. But on FIB_EVENT_ENTRY_DEL there is a small window where the scheduled work delays the unlink of fib info from the hash tables, i.e. there is a risk fib_find_info to reuse a dead fib info. May be we can add a fi->fib_flags & RTNH_F_DEAD check there but the problem is that it is set also on NETDEV_DOWN. While attempts to add route to device with !(dev->flags & IFF_UP) is rejected by fib_check_nh(), fib_create_info still can create routes when cfg->fc_scope == RT_SCOPE_HOST. So, RTNH_F_DEAD check in fib_find_info can avoid the reuse of fib info for host routes while device is down but not unregistered. As result, many fib infos can be created instead of one that is reused. Adding new RTNH_F_* flag to properly handle this does not look good... Regards -- Julian Anastasov
Re: [PATCH] neighbour: update neigh timestamps iff update is effective
Hello, On Mon, 15 May 2017, Ihar Hrachyshka wrote: > On Mon, May 15, 2017 at 1:05 PM, Julian Anastasov wrote: > > > > It seems arp_accept value currently has influence on > > the locktime for GARP requests. My understanding is that > > locktime is used to ignore replies from proxy_arp > > routers while the requested IP is present on the LAN > > and replies immediately. IMHO, GARP requests should not > > depend on locktime, even when arp_accept=0. For example: > > Yes, I believe so. > > I actually thought about introducing the patch that does just that: > forcing override on garp, but then I was thinking, maybe there is some > reason to still apply locktime rules to garps; f.e. if you have > multiple nodes carrying the ip address and located on the same > segment, maybe you want to pick the first that replies to you (in > theory, it may be the node that is less loaded, or closer to us; but > then, it's so fragile even if that was the intent...) Do you want me > to post the patch, or will you cover it? Feel free to post a patch for this, I see that you change in another patch the is_garp value, so it seems the same logic should be used twice. Regards -- Julian Anastasov
Re: [Patch net] ipv4: restore rt->fi for reference counting
Hello, On Mon, 15 May 2017, Cong Wang wrote: > On Fri, May 12, 2017 at 2:27 PM, Julian Anastasov wrote: > > Now the main question: is FIB_LOOKUP_NOREF used > > everywhere in IPv4? I guess so. If not, it means > > someone can walk its res->fi NHs which is bad. I think, > > this will delay the unregistration for long time and we > > can not solve the problem. > > > > If yes, free_fib_info() should not use call_rcu. > > Instead, fib_release_info() will start RCU callback to > > drop everything via a common function for fib_release_info > > and free_fib_info. As result, the last fib_info_put will > > just need to free fi->fib_metrics and fi. > > > Yes it is used. But this is a different problem from the > dev refcnt issue, right? I can send a separate patch to > address it. Any user that does not set FIB_LOOKUP_NOREF will need nh_dev refcounts. The assumption is that the NHs are accessed, who knows, may be even after RCU grace period. As result, we can not use dev_put on NETDEV_UNREGISTER. So, we should check if there are users that do not set FIB_LOOKUP_NOREF, at first look, I don't see such ones for IPv4. > >> Are you sure we are safe to call dev_put() in fib_release_info() > >> for _all_ paths, especially non-unregister paths? See: > > > > Yep, dev_put is safe there... > > > >> commit e49cc0da7283088c5e03d475ffe2fdcb24a6d5b1 > >> Author: Yanmin Zhang > >> Date: Wed May 23 15:39:45 2012 + > >> > >> ipv4: fix the rcu race between free_fib_info and ip_route_output_slow > > > > ...as long as we do not set nh_dev to NULL > > > > OK, fair enough, then I think the best solution here is to move > the dev_put() from free_fib_info_rcu() to fib_release_info(), > fib_nh is already removed from hash there anyway. free_fib_info still needs to put the references, that is the reason for the common fib_info_release() in my example. It happens in fib_create_info() where free_fib_info() is called. The func names in my example can be corrected, if needed. > diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c > index da449dd..cb712d1 100644 > --- a/net/ipv4/fib_semantics.c > +++ b/net/ipv4/fib_semantics.c > @@ -205,8 +205,6 @@ static void free_fib_info_rcu(struct rcu_head *head) > struct fib_info *fi = container_of(head, struct fib_info, rcu); > > change_nexthops(fi) { > - if (nexthop_nh->nh_dev) > - dev_put(nexthop_nh->nh_dev); > lwtstate_put(nexthop_nh->nh_lwtstate); > free_nh_exceptions(nexthop_nh); > rt_fibinfo_free_cpus(nexthop_nh->nh_pcpu_rth_output); > @@ -246,6 +244,14 @@ void fib_release_info(struct fib_info *fi) > if (!nexthop_nh->nh_dev) > continue; > hlist_del(&nexthop_nh->nh_hash); > + /* We have to release these nh_dev here because a dst > +* could still hold a fib_info via rt->fi, we can't > wait > +* for GC, a socket could hold the dst for a long > time. > +* > +* This is safe, dev_put() alone does not really free > +* the netdevice, we just have to put the refcnt back. > +*/ > + dev_put(nexthop_nh->nh_dev); > } endfor_nexthops(fi) > fi->fib_dead = 1; Such solution needs the fib_dead = 1|2 game to know who dropped the nh_dev reference, fib_release_info (2) or fib_create_info (1). You can not remove the dev_put calls from free_fib_info_rcu. > fib_info_put(fi); Regards -- Julian Anastasov
Re: [PATCH] neighbour: update neigh timestamps iff update is effective
Hello, On Tue, 9 May 2017, Ihar Hrachyshka wrote: > It's a common practice to send gratuitous ARPs after moving an > IP address to another device to speed up healing of a service. To > fulfill service availability constraints, the timing of network peers > updating their caches to point to a new location of an IP address can be > particularly important. > > Sometimes neigh_update calls won't touch neither lladdr nor state, for > example if an update arrives in locktime interval. Then we effectively > ignore the update request, bailing out of touching the neigh entry, > except that we still bump its timestamps. > > This may be a problem for updates arriving in quick succession. For > example, consider the following scenario: > > A service is moved to another device with its IP address. The new device > sends three gratuitous ARP requests into the network with ~1 seconds > interval between them. Just before the first request arrives to one of > network peer nodes, its neigh entry for the IP address transitions from > STALE to DELAY. This transition, among other things, updates > neigh->updated. Once the kernel receives the first gratuitous ARP, it > ignores it because its arrival time is inside the locktime interval. The > kernel still bumps neigh->updated. Then the second gratuitous ARP > request arrives, and it's also ignored because it's still in the (new) > locktime interval. Same happens for the third request. The node > eventually heals itself (after delay_first_probe_time seconds since the > initial transition to DELAY state), but it just wasted some time and > require a new ARP request/reply round trip. This unfortunate behaviour > both puts more load on the network, as well as reduces service > availability. > > This patch changes neigh_update so that it bumps neigh->updated (as well > as neigh->confirmed) only once we are sure that either lladdr or entry > state will change). In the scenario described above, it means that the > second gratuitous ARP request will actually update the entry lladdr. > > Ideally, we would update the neigh entry on the very first gratuitous > ARP request. The locktime mechanism is designed to ignore ARP updates in > a short timeframe after a previous ARP update was honoured by the kernel > layer. This would require tracking timestamps for state transitions > separately from timestamps when actual updates are received. This would > probably involve changes in neighbour struct. Therefore, the patch > doesn't tackle the issue of the first gratuitous APR ignored, leaving > it for a follow-up. > > Signed-off-by: Ihar Hrachyshka Looks ok to me, Acked-by: Julian Anastasov It seems arp_accept value currently has influence on the locktime for GARP requests. My understanding is that locktime is used to ignore replies from proxy_arp routers while the requested IP is present on the LAN and replies immediately. IMHO, GARP requests should not depend on locktime, even when arp_accept=0. For example: if (IN_DEV_ARP_ACCEPT(in_dev)) { ... + } else if (n && tip == sip && arp->ar_op == htons(ARPOP_REQUEST)) { + unsigned int addr_type = inet_addr_type_dev_table(net, dev, sip); + + is_garp = (addr_type == RTN_UNICAST); } > --- > net/core/neighbour.c | 14 ++ > 1 file changed, 10 insertions(+), 4 deletions(-) > > diff --git a/net/core/neighbour.c b/net/core/neighbour.c > index 58b0bcc..d274f81 100644 > --- a/net/core/neighbour.c > +++ b/net/core/neighbour.c > @@ -1132,10 +1132,6 @@ int neigh_update(struct neighbour *neigh, const u8 > *lladdr, u8 new, > lladdr = neigh->ha; > } > > - if (new & NUD_CONNECTED) > - neigh->confirmed = jiffies; > - neigh->updated = jiffies; > - > /* If entry was valid and address is not changed, > do not change entry state, if new one is STALE. >*/ > @@ -1157,6 +1153,16 @@ int neigh_update(struct neighbour *neigh, const u8 > *lladdr, u8 new, > } > } > > + /* Update timestamps only once we know we will make a change to the > + * neighbour entry. Otherwise we risk to move the locktime window with > + * noop updates and ignore relevant ARP updates. > + */ > + if (new != old || lladdr != neigh->ha) { > + if (new & NUD_CONNECTED) > + neigh->confirmed = jiffies; > + neigh->updated = jiffies; > + } > + > if (new != old) { > neigh_del_timer(neigh); > if (new & NUD_PROBE) > -- > 2.9.3 Regards -- Julian Anastasov
Re: [Patch net] ipv4: restore rt->fi for reference counting
Hello, On Fri, 12 May 2017, Cong Wang wrote: > On Thu, May 11, 2017 at 11:39 PM, Julian Anastasov wrote: > > > > fib_flush will unlink the FIB infos at NETDEV_UNREGISTER > > time, so we can not see them in any hash tables later on > > NETDEV_UNREGISTER_FINAL. fib_put_nh_devs() can not work > > except if moving NHs in another hash table but that should > > not be needed. > > Ah, I did miss the hlist_del() in fib_table_flush(), so we just need some > way to link those fib_nh together for NETDEV_UNREGISTER_FINAL, > a linked list should be sufficient, but requires adding list_head to fib_nh. It could be a fib_info_devhash_dead hash table, similar to fib_info_devhash where we can hash NHs for dead fib infos but then we will need a fib_clntref reference or otherwise last user can drop the fib info before NETDEV_UNREGISTER_FINAL is called. It will add more code in fib_sync_down_dev to know when to call fib_info_hold. But OTOH, it will reduce the work needed for careful release of the references in fib_release_info. So, we have 2 possible cases to consider: 1. route deleted, fib_release_info called, fi held by socket, NETDEV_UNREGISTER can not see the NHs because they are unlinked in fib_release_info. dev unreg delayed for long time... 2. NETDEV_UNREGISTER called first, before the NHs are unlinked. Now the main question: is FIB_LOOKUP_NOREF used everywhere in IPv4? I guess so. If not, it means someone can walk its res->fi NHs which is bad. I think, this will delay the unregistration for long time and we can not solve the problem. If yes, free_fib_info() should not use call_rcu. Instead, fib_release_info() will start RCU callback to drop everything via a common function for fib_release_info and free_fib_info. As result, the last fib_info_put will just need to free fi->fib_metrics and fi. Something like: /* Long living data */ fib_info_destroy: { if (fi->fib_dead == 0) { pr_warn("Freeing alive fib_info %p\n", fi); return; } if (fi->fib_metrics != (u32 *) dst_default_metrics) kfree(fi->fib_metrics); kfree(fi); } /* Do not imply RCU grace period anymore, last user is last */ fib_info_put(): { if (atomic_dec_and_test(&fi->fib_clntref)) fib_info_destroy(fi); } /* Release everything except long living fields (fib_metrics) */ fib_info_release(): { change_nexthops(fi) { if (nexthop_nh->nh_dev) dev_put(nexthop_nh->nh_dev); lwtstate_put(nexthop_nh->nh_lwtstate); free_nh_exceptions(nexthop_nh); rt_fibinfo_free_cpus(nexthop_nh->nh_pcpu_rth_output); rt_fibinfo_free(&nexthop_nh->nh_rth_input); } endfor_nexthops(fi); } /* RCU grace period after unlink */ fib_release_info_rcu(): { struct fib_info *fi = container_of(head, struct fib_info, rcu); fib_info_release(fi); fib_info_put(fi); } /* Called just on error */ free_fib_info(): { fib_info_release(fi); fib_info_put(fi); } fib_release_info(): { ... fi->fib_dead = 1; - fib_info_put(fi); + call_rcu(&fi->rcu, fib_release_info_rcu); } > > I'm thinking for the following solution which > > is a bit hackish: > > > > - on NETDEV_UNREGISTER we want to put the nh_dev references, > > so fib_release_info is a good place. But as fib_release_info > > is not always called, we will have two places that put > > references. We can use such hack: > > > > - for example, use nh_oif to know if dev_put is > > already called > > > > - fib_release_info() should set nh_oif to 0 because > > it will now call dev_put without clearing nh_dev > > Are you sure we are safe to call dev_put() in fib_release_info() > for _all_ paths, especially non-unregister paths? See: Yep, dev_put is safe there... > commit e49cc0da7283088c5e03d475ffe2fdcb24a6d5b1 > Author: Yanmin Zhang > Date: Wed May 23 15:39:45 2012 + > > ipv4: fix the rcu race between free_fib_info and ip_route_output_slow ...as long as we do not set nh_dev to NULL > But, hmm, very interesting, I always thought dev_put() triggers a > kfree() potentially, but here your suggestion actually leverages the fact > that it is merely a pcpu counter decrease, so for unregister path, > this is just giving refcnt back, which means it is safe as long as > we don't have any parallel unregister? We should because of RTNL. Yes, we are in the context of unregistration and there is a rcu_barrier() that prevents device to be destroyed while there are RCU users. Refcnt reaching 0 is not enough to f
Re: [Patch net] ipv4: restore rt->fi for reference counting
Hello, On Thu, 11 May 2017, Cong Wang wrote: > On Thu, May 11, 2017 at 5:07 PM, Cong Wang wrote: > > So, if I understand you correctly it is safe to NULL'ing > > nh_dev in NETDEV_UNREGISTER_FINAL, right? > > > > If still not, how about transfer nh_dev's to loopback_dev > > too in NETDEV_UNREGISTER? Like we transfer dst->dev. > > > > I don't want to touch the fast path to check for NULL, as > > it will change more code and slow down performance. > > Finally I come up with the attached patch. Please let me know if > I still miss anything. fib_flush will unlink the FIB infos at NETDEV_UNREGISTER time, so we can not see them in any hash tables later on NETDEV_UNREGISTER_FINAL. fib_put_nh_devs() can not work except if moving NHs in another hash table but that should not be needed. I'm thinking for the following solution which is a bit hackish: - on NETDEV_UNREGISTER we want to put the nh_dev references, so fib_release_info is a good place. But as fib_release_info is not always called, we will have two places that put references. We can use such hack: - for example, use nh_oif to know if dev_put is already called - fib_release_info() should set nh_oif to 0 because it will now call dev_put without clearing nh_dev - free_fib_info_rcu() will not call dev_put if nh_oif is 0: if (nexthop_nh->nh_dev && nexthop_nh->nh_oif) dev_put(nexthop_nh->nh_dev); - as fi is unlinked, there is no chance fib_info_hashfn() to use the cleared nh_oif for hashing - we expect noone to touch nh_dev fields after fi is unlinked, this includes free_fib_info and free_fib_info_rcu What we achieve: - between NETDEV_UNREGISTER and NETDEV_UNREGISTER_FINAL nh_dev still points to our dev (and not to loopback, I think, this answers your question in previous email), so route lookups will not find loopback in the FIB tree. We do not want some packets to be wrongly rerouted. Even if we don't hold dev reference, the RCU grace period helps here. - fib_dev/nh_dev != NULL checks are not needed, so this addresses Eric's concerns. BTW, fib_route_seq_show actually checks for fi->fib_dev, may be not in a safe way (lockless_dereference needed?) but as we don't set nh_dev to NULL this is not needed. What more? What about nh_pcpu_rth_output and nh_rth_input holding routes? We should think about them too. I should think more if nh_oif trick can work for them, may be not because nh_oif is optional... May be we should purge them somehow? Regards -- Julian Anastasov
Re: [Patch net] ipv4: restore rt->fi for reference counting
Hello, On Wed, 10 May 2017, Cong Wang wrote: > On Wed, May 10, 2017 at 12:38 AM, Julian Anastasov wrote: > > > > During NETDEV_UNREGISTER packets for dev should not > > be flying but packets for other devs can walk the nexthops > > for multipath routes. It is the rcu_barrier before > > NETDEV_UNREGISTER_FINAL that allows nh_dev to be set to NULL > > during this grace period but there are many places to fix that > > assume nh_dev!=NULL. > > Excellent point. Unfortunately NETDEV_UNREGISTER_FINAL is not > yet captured by the fib event notifier. > > I think readers are still safe to assume nh_dev!=NULL since we wait > for existing readers, readers coming later can't see it any more. Or > am I missing anything? It is not safe because other devs can deliver packets that still can see the multipath fib info in the FIB nodes. CPU 1 (unreg dev) CPU 2 (packet to another dev2) NETDEV_DOWN => set RTNH_F_DEAD Now traffic sent via downed dev should stop flush_all_backlogs synchronize_net Now received traffic for dev should stop on all CPUs fib_sync_down_dev nh_dev = NULL fib_table_lookup: boom in __in_dev_get_rcu(nh->nh_dev), all nh_dev dereferences must be checked. If checked, it should be safe to use this nh_dev due to the rcu_barrier in netdev_run_todo. But only the structure can be accessed, traffic should not go via unreged dev. fib_flush: unlink fi from fa_info in fib_table_flush Now other CPUs will not see this fi on lookups (after fib_table_flush) netdev_run_todo rcu_barrier Now other CPUs should not send packets via this fib info NETDEV_UNREGISTER_FINAL Notes: - when NETDEV_DOWN is delivered fib_sync_down_dev sets RTNH_F_DEAD but only for link routes, not for host routes - fib_table_lookup runs without any memory ordering barriers, when CPU 2 delivers packet from different dev it can hit a multipath route with nh_dev set to NULL. Not so often if RTNH_F_DEAD was set early and properly checked before dereferencing nh_dev. > > But why we leak routes? Is there some place that holds > > routes without listening for NETDEV_UNREGISTER? On fib_flush > > the infos are unlinked from trees, so after a grace period packets > > should not see/hold such infos. If we hold routes somewhere for > > long time, problem can happen also for routes with single nexthop. > > My theory is that dst which holds a refcnt to fib_info (of course, after > this patch) is moved in GC after NETDEV_UNREGISTER but still > referenced somewhere, it therefore holds these nh_dev's, which > in turn hold back to the netdevice which is being unregistered, thus > Eric saw these refcnt leak warnings. Oh, well, the sockets can hold cached dst. But if the promise is that rt->fi is used only as reference to metrics we have to find a way to drop the dev references at NETDEV_UNREGISTER time. If you set nh_dev to NULL then all lookups should check it for != NULL. The sockets will not walk NHs via rt->fi, i.e. the route lookups will get valid res.fi from trees, so it may work in this way. Regards -- Julian Anastasov
Re: [Patch net] ipv4: restore rt->fi for reference counting
Hello, On Tue, 9 May 2017, Cong Wang wrote: > > Also setting nexthop_nh->nh_dev to NULL looks quite dangerous > > > > We have plenty of sites doing : > > > > if (fi->fib_dev) > > x = fi->fib_dev->field > > > > fib_route_seq_show() is one example. > > > > All of them take RCU read lock, so, as I explained in the code comment, > they all should be fine because of synchronize_net() on unregister path. > Do you see anything otherwise? During NETDEV_UNREGISTER packets for dev should not be flying but packets for other devs can walk the nexthops for multipath routes. It is the rcu_barrier before NETDEV_UNREGISTER_FINAL that allows nh_dev to be set to NULL during this grace period but there are many places to fix that assume nh_dev!=NULL. But why we leak routes? Is there some place that holds routes without listening for NETDEV_UNREGISTER? On fib_flush the infos are unlinked from trees, so after a grace period packets should not see/hold such infos. If we hold routes somewhere for long time, problem can happen also for routes with single nexthop. Regards -- Julian Anastasov
Re: PROBLEM: IPVS incorrectly reverse-NATs traffic to LVS host
Hello, On Mon, 24 Apr 2017, Nick Moriarty wrote: > Many thanks for your prompt fix! I've now tested this patch against > the 4.11.0-rc8 kernel on Ubuntu, and I can confirm that my check > script is no longer seeing incorrect addresses in its responses. > > Could you please keep me posted as this is merged? Sure. Thanks for the confirmation! I'll do some tests and will post official patch in few days. Regards -- Julian Anastasov
Re: [PATCH] ipvs: explicitly forbid ipv6 service/dest creation if ipv6 mod is disabled
Hello, On Mon, 24 Apr 2017, Paolo Abeni wrote: > Hi, > > The problem with the patched code is that it tries to resolve ipv6 > addresses that are not created/validated by the kernel. OK. Simon, please apply to ipvs tree. Acked-by: Julian Anastasov Regards -- Julian Anastasov
Re: PROBLEM: IPVS incorrectly reverse-NATs traffic to LVS host
Hello, On Wed, 12 Apr 2017, Nick Moriarty wrote: > Hi, > > I've experienced a problem in how traffic returning to an LVS host is > handled in certain circumstances. Please find a bug report below - if > there's any further information you'd like, please let me know. > > [1.] One line summary of the problem: > IPVS incorrectly reverse-NATs traffic to LVS host > > [2.] Full description of the problem/report: > When using IPVS in direct-routing mode, normal traffic from the LVS > host to a back-end server is sometimes incorrectly NATed on the way > back into the LVS host. Using tcpdump shows that the return packets > have the correct source IP, but by the time it makes it back to the > application, it's been changed. > > To reproduce this, a configuration such as the following will work: > - Set up an LVS system with a VIP serving UDP to a backend DNS server > using the direct-routing method in IPVS > - Make an outgoing UDP request to the VIP from the LVS system itself > (this causes a connection to be added to the IPVS connection table) > - The request should succeed as normal > - Note the UDP source port used > - Within 5 minutes (before the UDP connection entry expires), make an > outgoing UDP request directly to the backend DNS server > - The request will fail as the reply is incorrectly modified on its > way back and appears to return from the VIP > > Monitoring the above sequence with tcpdump verifies that the returned > packet (as it enters the host) is from the DNS IP, even though the > application sees the VIP. > > If an outgoing request direct to the DNS server is made from a port > not in the connection table, everything is fine. Thanks for the detailed report! I think, I fixed the problem. Let me know if you are able to test the appended fix. > I expect that somewhere, something (e.g. functionality for IPVS MASQ > responses) is applying IPVS connection > information to incoming traffic, matching a DROUTE rule, and treating > it as NAT traffic. Yep, that is what happens. [PATCH net] ipvs: SNAT packet replies only for NATed connections We do not check if packet from real server is for NAT connection before performing SNAT. This causes problems for setups that use DR/TUN and allow local clients to access the real server directly, for example: - local client in director creates IPVS-DR/TUN connection CIP->VIP and the request packets are routed to RIP. Talks are finished but IPVS connection is not expired yet. - second local client creates non-IPVS connection CIP->RIP with same reply tuple RIP->CIP and when replies are received on LOCAL_IN we wrongly assign them for the first client connection because RIP->CIP matches the reply direction. The problem is more visible to local UDP clients but in rare cases it can happen also for TCP or remote clients when the real server sends the reply traffic via the director. So, better to be more precise for the reply traffic. As replies are not expected for DR/TUN connections, better to not touch them. Reported-by: Nick Moriarty Signed-off-by: Julian Anastasov --- net/netfilter/ipvs/ip_vs_core.c | 19 ++- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c index db40050..ee44ed5 100644 --- a/net/netfilter/ipvs/ip_vs_core.c +++ b/net/netfilter/ipvs/ip_vs_core.c @@ -849,10 +849,8 @@ static int handle_response_icmp(int af, struct sk_buff *skb, { unsigned int verdict = NF_DROP; - if (IP_VS_FWD_METHOD(cp) != 0) { - pr_err("shouldn't reach here, because the box is on the " - "half connection in the tun/dr module.\n"); - } + if (IP_VS_FWD_METHOD(cp) != IP_VS_CONN_F_MASQ) + goto ignore_cp; /* Ensure the checksum is correct */ if (!skb_csum_unnecessary(skb) && ip_vs_checksum_complete(skb, ihl)) { @@ -886,6 +884,8 @@ static int handle_response_icmp(int af, struct sk_buff *skb, ip_vs_notrack(skb); else ip_vs_update_conntrack(skb, cp, 0); + +ignore_cp: verdict = NF_ACCEPT; out: @@ -1385,8 +1385,11 @@ ip_vs_out(struct netns_ipvs *ipvs, unsigned int hooknum, struct sk_buff *skb, in */ cp = pp->conn_out_get(ipvs, af, skb, &iph); - if (likely(cp)) + if (likely(cp)) { + if (IP_VS_FWD_METHOD(cp) != IP_VS_CONN_F_MASQ) + goto ignore_cp; return handle_response(af, skb, pd, cp, &iph, hooknum); + } /* Check for real-server-started requests */ if (atomic_read(&ipvs->conn_out_counter)) { @@ -1444,9 +1447,15 @@ ip_vs_out(struct netns_ipvs