from:"Julian Anastasov"

Re: [PATCH] ipvs: Avoid unnecessary calls to skb_is_gso_sctp

2024-05-27 Thread Julian Anastasov


Hello,

On Thu, 23 May 2024, Ismael Luceno wrote:

> In the context of the SCTP SNAT/DNAT handler, these calls can only
> return true.
> 
> Ref: e10d3ba4d434 ("ipvs: Fix checksumming on GSO of SCTP packets")

checkpatch.pl prefers to see the "commit" word:

Ref: commit e10d3ba4d434 ("ipvs: Fix checksumming on GSO of SCTP packets")

> Signed-off-by: Ismael Luceno 

Looks good to me for nf-next, thanks!

Acked-by: Julian Anastasov 

> CC: Pablo Neira Ayuso 
> CC: Michal Kubeček 
> CC: Simon Horman 
> CC: Julian Anastasov 
> CC: lvs-de...@vger.kernel.org
> CC: netfilter-de...@vger.kernel.org
> CC: net...@vger.kernel.org
> CC: coret...@netfilter.org
> ---
>  net/netfilter/ipvs/ip_vs_proto_sctp.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_proto_sctp.c 
> b/net/netfilter/ipvs/ip_vs_proto_sctp.c
> index 1e689c714127..83e452916403 100644
> --- a/net/netfilter/ipvs/ip_vs_proto_sctp.c
> +++ b/net/netfilter/ipvs/ip_vs_proto_sctp.c
> @@ -126,7 +126,7 @@ sctp_snat_handler(struct sk_buff *skb, struct 
> ip_vs_protocol *pp,
>   if (sctph->source != cp->vport || payload_csum ||
>   skb->ip_summed == CHECKSUM_PARTIAL) {
>   sctph->source = cp->vport;
> - if (!skb_is_gso(skb) || !skb_is_gso_sctp(skb))
> + if (!skb_is_gso(skb))
>   sctp_nat_csum(skb, sctph, sctphoff);
>   } else {
>   skb->ip_summed = CHECKSUM_UNNECESSARY;
> @@ -175,7 +175,7 @@ sctp_dnat_handler(struct sk_buff *skb, struct 
> ip_vs_protocol *pp,
>   (skb->ip_summed == CHECKSUM_PARTIAL &&
>!(skb_dst(skb)->dev->features & NETIF_F_SCTP_CRC))) {
>   sctph->dest = cp->dport;
> - if (!skb_is_gso(skb) || !skb_is_gso_sctp(skb))
> + if (!skb_is_gso(skb))
>       sctp_nat_csum(skb, sctph, sctphoff);
>   } else if (skb->ip_summed != CHECKSUM_PARTIAL) {
>   skb->ip_summed = CHECKSUM_UNNECESSARY;
> -- 
> 2.44.0

Regards

--
Julian Anastasov

Re: [PATCH v4 2/2] ipvs: allow some sysctls in non-init user namespaces

2024-05-06 Thread Julian Anastasov



Hello,

On Mon, 6 May 2024, Alexander Mikhalitsyn wrote:

> Let's make all IPVS sysctls writtable even when
> network namespace is owned by non-initial user namespace.
> 
> Let's make a few sysctls to be read-only for non-privileged users:
> - sync_qlen_max
> - sync_sock_size
> - run_estimation
> - est_cpulist
> - est_nice
> 
> I'm trying to be conservative with this to prevent
> introducing any security issues in there. Maybe,
> we can allow more sysctls to be writable, but let's
> do this on-demand and when we see real use-case.
> 
> This patch is motivated by user request in the LXC
> project [1]. Having this can help with running some
> Kubernetes [2] or Docker Swarm [3] workloads inside the system
> containers.
> 
> Link: https://github.com/lxc/lxc/issues/4278 [1]
> Link: 
> https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103
>  [2]
> Link: 
> https://github.com/moby/libnetwork/blob/3797618f9a38372e8107d8c06f6ae199e1133ae8/osl/namespace_linux.go#L682
>  [3]
> 
> Cc: Julian Anastasov 
> Cc: Simon Horman 
> Cc: Pablo Neira Ayuso 
> Cc: Jozsef Kadlecsik 
> Cc: Florian Westphal 
> Signed-off-by: Alexander Mikhalitsyn 

Looks good to me for net-next, thanks!

Acked-by: Julian Anastasov 

> ---
>  net/netfilter/ipvs/ip_vs_ctl.c | 19 +++
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index e122fa367b81..b6d0dcf3a5c3 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -4269,6 +4269,7 @@ static int __net_init 
> ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
>   struct ctl_table *tbl;
>   int idx, ret;
>   size_t ctl_table_size = ARRAY_SIZE(vs_vars);
> + bool unpriv = net->user_ns != _user_ns;
>  
>   atomic_set(>dropentry, 0);
>   spin_lock_init(>dropentry_lock);
> @@ -4283,10 +4284,6 @@ static int __net_init 
> ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
>   tbl = kmemdup(vs_vars, sizeof(vs_vars), GFP_KERNEL);
>   if (tbl == NULL)
>   return -ENOMEM;
> -
> - /* Don't export sysctls to unprivileged users */
> - if (net->user_ns != _user_ns)
> - ctl_table_size = 0;
>   } else
>   tbl = vs_vars;
>   /* Initialize sysctl defaults */
> @@ -4312,10 +4309,17 @@ static int __net_init 
> ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
>   ipvs->sysctl_sync_ports = 1;
>   tbl[idx++].data = >sysctl_sync_ports;
>   tbl[idx++].data = >sysctl_sync_persist_mode;
> +
>   ipvs->sysctl_sync_qlen_max = nr_free_buffer_pages() / 32;
> + if (unpriv)
> + tbl[idx].mode = 0444;
>   tbl[idx++].data = >sysctl_sync_qlen_max;
> +
>   ipvs->sysctl_sync_sock_size = 0;
> + if (unpriv)
> + tbl[idx].mode = 0444;
>   tbl[idx++].data = >sysctl_sync_sock_size;
> +
>   tbl[idx++].data = >sysctl_cache_bypass;
>   tbl[idx++].data = >sysctl_expire_nodest_conn;
>   tbl[idx++].data = >sysctl_sloppy_tcp;
> @@ -4338,15 +4342,22 @@ static int __net_init 
> ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
>   tbl[idx++].data = >sysctl_conn_reuse_mode;
>   tbl[idx++].data = >sysctl_schedule_icmp;
>   tbl[idx++].data = >sysctl_ignore_tunneled;
> +
>   ipvs->sysctl_run_estimation = 1;
> + if (unpriv)
> + tbl[idx].mode = 0444;
>   tbl[idx].extra2 = ipvs;
>   tbl[idx++].data = >sysctl_run_estimation;
>  
>   ipvs->est_cpulist_valid = 0;
> +     if (unpriv)
> + tbl[idx].mode = 0444;
>   tbl[idx].extra2 = ipvs;
>   tbl[idx++].data = >sysctl_est_cpulist;
>  
>   ipvs->sysctl_est_nice = IPVS_EST_NICE;
> + if (unpriv)
> + tbl[idx].mode = 0444;
>   tbl[idx].extra2 = ipvs;
>   tbl[idx++].data = >sysctl_est_nice;
>  
> -- 
> 2.34.1

Regards

--
Julian Anastasov

Re: [PATCH v4 1/2] ipvs: add READ_ONCE barrier for ipvs->sysctl_amemthresh

2024-05-06 Thread Julian Anastasov



Hello,

On Mon, 6 May 2024, Alexander Mikhalitsyn wrote:

> Cc: Julian Anastasov 
> Cc: Simon Horman 
> Cc: Pablo Neira Ayuso 
> Cc: Jozsef Kadlecsik 
> Cc: Florian Westphal 
> Suggested-by: Julian Anastasov 
> Signed-off-by: Alexander Mikhalitsyn 

Looks good to me for net-next, thanks!

Acked-by: Julian Anastasov 

> ---
>  net/netfilter/ipvs/ip_vs_ctl.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 50b5dbe40eb8..e122fa367b81 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -94,6 +94,7 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>  {
>   struct sysinfo i;
>   int availmem;
> + int amemthresh;
>   int nomem;
>   int to_change = -1;
>  
> @@ -105,7 +106,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>   /* si_swapinfo(); */
>   /* availmem = availmem - (i.totalswap - i.freeswap); */
>  
> - nomem = (availmem < ipvs->sysctl_amemthresh);
> + amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0);
> + nomem = (availmem < amemthresh);
>  
>   local_bh_disable();
>  
> @@ -145,9 +147,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>   break;
>   case 1:
>   if (nomem) {
> - ipvs->drop_rate = ipvs->drop_counter
> - = ipvs->sysctl_amemthresh /
> - (ipvs->sysctl_amemthresh-availmem);
> + ipvs->drop_counter = amemthresh / (amemthresh - 
> availmem);
> + ipvs->drop_rate = ipvs->drop_counter;
>   ipvs->sysctl_drop_packet = 2;
>   } else {
>   ipvs->drop_rate = 0;
> @@ -155,9 +156,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>   break;
>   case 2:
>   if (nomem) {
> - ipvs->drop_rate = ipvs->drop_counter
> - = ipvs->sysctl_amemthresh /
> - (ipvs->sysctl_amemthresh-availmem);
> + ipvs->drop_counter = amemthresh / (amemthresh - 
> availmem);
> + ipvs->drop_rate = ipvs->drop_counter;
>   } else {
>   ipvs->drop_rate = 0;
>   ipvs->sysctl_drop_packet = 1;
> -- 
> 2.34.1

Regards

--
Julian Anastasov

Re: [PATCH net-next v3 2/2] ipvs: allow some sysctls in non-init user namespaces

2024-05-03 Thread Julian Anastasov


Hello,

On Thu, 18 Apr 2024, Alexander Mikhalitsyn wrote:

> Let's make all IPVS sysctls writtable even when
> network namespace is owned by non-initial user namespace.
> 
> Let's make a few sysctls to be read-only for non-privileged users:
> - sync_qlen_max
> - sync_sock_size
> - run_estimation
> - est_cpulist
> - est_nice
> 
> I'm trying to be conservative with this to prevent
> introducing any security issues in there. Maybe,
> we can allow more sysctls to be writable, but let's
> do this on-demand and when we see real use-case.
> 
> This patch is motivated by user request in the LXC
> project [1]. Having this can help with running some
> Kubernetes [2] or Docker Swarm [3] workloads inside the system
> containers.
> 
> Link: https://github.com/lxc/lxc/issues/4278 [1]
> Link: 
> https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103
>  [2]
> Link: 
> https://github.com/moby/libnetwork/blob/3797618f9a38372e8107d8c06f6ae199e1133ae8/osl/namespace_linux.go#L682
>  [3]
> 
> Cc: Stéphane Graber 
> Cc: Christian Brauner 
> Cc: Julian Anastasov 
> Cc: Simon Horman 
> Cc: Pablo Neira Ayuso 
> Cc: Jozsef Kadlecsik 
> Cc: Florian Westphal 
> Signed-off-by: Alexander Mikhalitsyn 
> ---
>  net/netfilter/ipvs/ip_vs_ctl.c | 21 +++--
>  1 file changed, 15 insertions(+), 6 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 32be24f0d4e4..c3ba71aa2654 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c

...

> @@ -4284,12 +4285,6 @@ static int __net_init 
> ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
>   tbl = kmemdup(vs_vars, sizeof(vs_vars), GFP_KERNEL);
>   if (tbl == NULL)
>   return -ENOMEM;
> -
> - /* Don't export sysctls to unprivileged users */
> - if (net->user_ns != _user_ns) {
> - tbl[0].procname = NULL;
> - ctl_table_size = 0;
> - }
>   } else
>   tbl = vs_vars;
>   /* Initialize sysctl defaults */

Sorry but you have to send v4 because above if-block was
changed with net-next commit 635470eb0aa7 from today...

Regards

--
Julian Anastasov

Re: [PATCH v3] ipvs: Fix checksumming on GSO of SCTP packets

2024-04-25 Thread Julian Anastasov


Hello,

On Thu, 25 Apr 2024, Ismael Luceno wrote:

> It was observed in the wild that pairs of consecutive packets would leave
> the IPVS with the same wrong checksum, and the issue only went away when
> disabling GSO.
> 
> IPVS needs to avoid computing the SCTP checksum when using GSO.
> 
> Fixes: 90017accff61 ("sctp: Add GSO support", 2016-06-02)
> Co-developed-by: Firo Yang 
> Signed-off-by: Ismael Luceno 
> Tested-by: Andreas Taschner 
> CC: Michal Kubeček 
> CC: Simon Horman 
> CC: Julian Anastasov 
> CC: lvs-de...@vger.kernel.org
> CC: netfilter-de...@vger.kernel.org
> CC: net...@vger.kernel.org
> CC: coret...@netfilter.org
> ---
> 
> Notes:
> Changes since v2:
> * Use only skb_is_gso, no need to check for GSO type

v2 is already applied. I acked it because sctp_gso_segment()
checks for skb_is_gso_sctp(). If v3 is just an optimization
better to live with v2? Is it possible to see skb_is_gso() but
not skb_is_gso_sctp() while working with SCTP packet?

> Changes since v1:
> * Added skb_is_gso before skb_is_gso_sctp.
> * Added "Fixes" tag.
> 
>  net/netfilter/ipvs/ip_vs_proto_sctp.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_proto_sctp.c 
> b/net/netfilter/ipvs/ip_vs_proto_sctp.c
> index a0921adc31a9..83e452916403 100644
> --- a/net/netfilter/ipvs/ip_vs_proto_sctp.c
> +++ b/net/netfilter/ipvs/ip_vs_proto_sctp.c
> @@ -126,7 +126,8 @@ sctp_snat_handler(struct sk_buff *skb, struct 
> ip_vs_protocol *pp,
>   if (sctph->source != cp->vport || payload_csum ||
>   skb->ip_summed == CHECKSUM_PARTIAL) {
>   sctph->source = cp->vport;
> - sctp_nat_csum(skb, sctph, sctphoff);
> + if (!skb_is_gso(skb))
> + sctp_nat_csum(skb, sctph, sctphoff);
>   } else {
>   skb->ip_summed = CHECKSUM_UNNECESSARY;
>   }
> @@ -174,7 +175,8 @@ sctp_dnat_handler(struct sk_buff *skb, struct 
> ip_vs_protocol *pp,
>   (skb->ip_summed == CHECKSUM_PARTIAL &&
>!(skb_dst(skb)->dev->features & NETIF_F_SCTP_CRC))) {
>   sctph->dest = cp->dport;
> - sctp_nat_csum(skb, sctph, sctphoff);
> + if (!skb_is_gso(skb))
> + sctp_nat_csum(skb, sctph, sctphoff);
>   } else if (skb->ip_summed != CHECKSUM_PARTIAL) {
>   skb->ip_summed = CHECKSUM_UNNECESSARY;
>   }
> -- 
> 2.43.0

Regards

--
Julian Anastasov

Re: [PATCH v2] ipvs: Fix checksumming on GSO of SCTP packets

2024-04-22 Thread Julian Anastasov


Hello,

On Sun, 21 Apr 2024, Ismael Luceno wrote:

> It was observed in the wild that pairs of consecutive packets would leave
> the IPVS with the same wrong checksum, and the issue only went away when
> disabling GSO.
> 
> IPVS needs to avoid computing the SCTP checksum when using GSO.
> 
> Fixes: 90017accff61 ("sctp: Add GSO support", 2016-06-02)
> Co-developed-by: Firo Yang 
> Signed-off-by: Ismael Luceno 
> Tested-by: Andreas Taschner 
> CC: Michal Kubeček 
> CC: Simon Horman 
> CC: Julian Anastasov 
> CC: lvs-de...@vger.kernel.org
> CC: netfilter-de...@vger.kernel.org
> CC: net...@vger.kernel.org
> CC: coret...@netfilter.org

Looks good to me, thanks!

Acked-by: Julian Anastasov 

As scripts/checkpatch.pl --strict /tmp/file.patch complains
about Co-developed-by and Signed-off-by lines you may want to
send v3...

> ---
> 
> Notes:
> Changes since v1:
> * Added skb_is_gso before skb_is_gso_sctp.
> * Added "Fixes" tag.
> 
>  net/netfilter/ipvs/ip_vs_proto_sctp.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_proto_sctp.c 
> b/net/netfilter/ipvs/ip_vs_proto_sctp.c
> index a0921adc31a9..1e689c714127 100644
> --- a/net/netfilter/ipvs/ip_vs_proto_sctp.c
> +++ b/net/netfilter/ipvs/ip_vs_proto_sctp.c
> @@ -126,7 +126,8 @@ sctp_snat_handler(struct sk_buff *skb, struct 
> ip_vs_protocol *pp,
>   if (sctph->source != cp->vport || payload_csum ||
>   skb->ip_summed == CHECKSUM_PARTIAL) {
>   sctph->source = cp->vport;
> - sctp_nat_csum(skb, sctph, sctphoff);
> + if (!skb_is_gso(skb) || !skb_is_gso_sctp(skb))
> + sctp_nat_csum(skb, sctph, sctphoff);
>   } else {
>   skb->ip_summed = CHECKSUM_UNNECESSARY;
>   }
> @@ -174,7 +175,8 @@ sctp_dnat_handler(struct sk_buff *skb, struct 
> ip_vs_protocol *pp,
>   (skb->ip_summed == CHECKSUM_PARTIAL &&
>!(skb_dst(skb)->dev->features & NETIF_F_SCTP_CRC))) {
>   sctph->dest = cp->dport;
> - sctp_nat_csum(skb, sctph, sctphoff);
> + if (!skb_is_gso(skb) || !skb_is_gso_sctp(skb))
> +     sctp_nat_csum(skb, sctph, sctphoff);
>   } else if (skb->ip_summed != CHECKSUM_PARTIAL) {
>   skb->ip_summed = CHECKSUM_UNNECESSARY;
>   }
> -- 
> 2.43.0

Regards

--
Julian Anastasov

Re: [PATCH] ipvs: Fix checksumming on GSO of SCTP packets

2024-04-21 Thread Julian Anastasov



Hello,

On Sun, 21 Apr 2024, Ismael Luceno wrote:

> On 21/Apr/2024 14:01, Julian Anastasov wrote:
> 
> > I'm guessing what should be the Fixes line, may be?:
> > 
> > Fixes: 90017accff61 ("sctp: Add GSO support")
> 
> This seems like the right one.
> 
> > because SCTP GSO was added after the IPVS code? Or the
> > more recent commit d02f51cbcf12 which adds skb_is_gso_sctp ?
> 
> That doesn't seem related at all.
> 
> Do we need to check .gso_type in cases like this?

    Just skb_is_gso(skb) ? IMHO, this should work.

Regards

--
Julian Anastasov

Re: [PATCH net-next v3 2/2] ipvs: allow some sysctls in non-init user namespaces

2024-04-21 Thread Julian Anastasov


Hello,

On Thu, 18 Apr 2024, Alexander Mikhalitsyn wrote:

> Let's make all IPVS sysctls writtable even when
> network namespace is owned by non-initial user namespace.
> 
> Let's make a few sysctls to be read-only for non-privileged users:
> - sync_qlen_max
> - sync_sock_size
> - run_estimation
> - est_cpulist
> - est_nice
> 
> I'm trying to be conservative with this to prevent
> introducing any security issues in there. Maybe,
> we can allow more sysctls to be writable, but let's
> do this on-demand and when we see real use-case.
> 
> This patch is motivated by user request in the LXC
> project [1]. Having this can help with running some
> Kubernetes [2] or Docker Swarm [3] workloads inside the system
> containers.
> 
> Link: https://github.com/lxc/lxc/issues/4278 [1]
> Link: 
> https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103
>  [2]
> Link: 
> https://github.com/moby/libnetwork/blob/3797618f9a38372e8107d8c06f6ae199e1133ae8/osl/namespace_linux.go#L682
>  [3]
> 
> Cc: Stéphane Graber 
> Cc: Christian Brauner 
> Cc: Julian Anastasov 
> Cc: Simon Horman 
> Cc: Pablo Neira Ayuso 
> Cc: Jozsef Kadlecsik 
> Cc: Florian Westphal 
> Signed-off-by: Alexander Mikhalitsyn 

Looks good to me, thanks!

Acked-by: Julian Anastasov 

> ---
>  net/netfilter/ipvs/ip_vs_ctl.c | 21 +++--
>  1 file changed, 15 insertions(+), 6 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 32be24f0d4e4..c3ba71aa2654 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -4270,6 +4270,7 @@ static int __net_init 
> ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
>   struct ctl_table *tbl;
>   int idx, ret;
>   size_t ctl_table_size = ARRAY_SIZE(vs_vars);
> + bool unpriv = net->user_ns != _user_ns;
>  
>   atomic_set(>dropentry, 0);
>   spin_lock_init(>dropentry_lock);
> @@ -4284,12 +4285,6 @@ static int __net_init 
> ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
>   tbl = kmemdup(vs_vars, sizeof(vs_vars), GFP_KERNEL);
>   if (tbl == NULL)
>   return -ENOMEM;
> -
> - /* Don't export sysctls to unprivileged users */
> - if (net->user_ns != _user_ns) {
> - tbl[0].procname = NULL;
> - ctl_table_size = 0;
> - }
>   } else
>   tbl = vs_vars;
>   /* Initialize sysctl defaults */
> @@ -4315,10 +4310,17 @@ static int __net_init 
> ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
>   ipvs->sysctl_sync_ports = 1;
>   tbl[idx++].data = >sysctl_sync_ports;
>   tbl[idx++].data = >sysctl_sync_persist_mode;
> +
>   ipvs->sysctl_sync_qlen_max = nr_free_buffer_pages() / 32;
> + if (unpriv)
> + tbl[idx].mode = 0444;
>   tbl[idx++].data = >sysctl_sync_qlen_max;
> +
>   ipvs->sysctl_sync_sock_size = 0;
> + if (unpriv)
> + tbl[idx].mode = 0444;
>   tbl[idx++].data = >sysctl_sync_sock_size;
> +
>   tbl[idx++].data = >sysctl_cache_bypass;
>   tbl[idx++].data = >sysctl_expire_nodest_conn;
>   tbl[idx++].data = >sysctl_sloppy_tcp;
> @@ -4341,15 +4343,22 @@ static int __net_init 
> ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
>   tbl[idx++].data = >sysctl_conn_reuse_mode;
>   tbl[idx++].data = >sysctl_schedule_icmp;
>   tbl[idx++].data = >sysctl_ignore_tunneled;
> +
>   ipvs->sysctl_run_estimation = 1;
> + if (unpriv)
> + tbl[idx].mode = 0444;
>   tbl[idx].extra2 = ipvs;
>   tbl[idx++].data = >sysctl_run_estimation;
>  
>   ipvs->est_cpulist_valid = 0;
> +     if (unpriv)
> + tbl[idx].mode = 0444;
>   tbl[idx].extra2 = ipvs;
>   tbl[idx++].data = >sysctl_est_cpulist;
>  
>   ipvs->sysctl_est_nice = IPVS_EST_NICE;
> + if (unpriv)
> + tbl[idx].mode = 0444;
>   tbl[idx].extra2 = ipvs;
>   tbl[idx++].data = >sysctl_est_nice;
>  
> -- 
> 2.34.1
> 
> 

Regards

--
Julian Anastasov

Re: [PATCH net-next v3 1/2] ipvs: add READ_ONCE barrier for ipvs->sysctl_amemthresh

2024-04-21 Thread Julian Anastasov



Hello,

On Thu, 18 Apr 2024, Alexander Mikhalitsyn wrote:

> Cc: Julian Anastasov 
> Cc: Simon Horman 
> Cc: Pablo Neira Ayuso 
> Cc: Jozsef Kadlecsik 
> Cc: Florian Westphal 
> Suggested-by: Julian Anastasov 
> Signed-off-by: Alexander Mikhalitsyn 

Looks good to me, thanks!

Acked-by: Julian Anastasov 

> ---
>  net/netfilter/ipvs/ip_vs_ctl.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 143a341bbc0a..32be24f0d4e4 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -94,6 +94,7 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>  {
>   struct sysinfo i;
>   int availmem;
> + int amemthresh;
>   int nomem;
>   int to_change = -1;
>  
> @@ -105,7 +106,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>   /* si_swapinfo(); */
>   /* availmem = availmem - (i.totalswap - i.freeswap); */
>  
> - nomem = (availmem < ipvs->sysctl_amemthresh);
> + amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0);
> + nomem = (availmem < amemthresh);
>  
>   local_bh_disable();
>  
> @@ -145,9 +147,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>   break;
>   case 1:
>   if (nomem) {
> - ipvs->drop_rate = ipvs->drop_counter
> - = ipvs->sysctl_amemthresh /
> - (ipvs->sysctl_amemthresh-availmem);
> + ipvs->drop_counter = amemthresh / (amemthresh - 
> availmem);
> + ipvs->drop_rate = ipvs->drop_counter;
>   ipvs->sysctl_drop_packet = 2;
>   } else {
>   ipvs->drop_rate = 0;
> @@ -155,9 +156,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>   break;
>   case 2:
>   if (nomem) {
> - ipvs->drop_rate = ipvs->drop_counter
> - = ipvs->sysctl_amemthresh /
> - (ipvs->sysctl_amemthresh-availmem);
> + ipvs->drop_counter = amemthresh / (amemthresh - 
> availmem);
> + ipvs->drop_rate = ipvs->drop_counter;
>   } else {
>   ipvs->drop_rate = 0;
>   ipvs->sysctl_drop_packet = 1;
> -- 
> 2.34.1

Regards

--
Julian Anastasov

Re: [PATCH] ipvs: Fix checksumming on GSO of SCTP packets

2024-04-21 Thread Julian Anastasov


Hello,

On Thu, 18 Apr 2024, Ismael Luceno wrote:

> It was observed in the wild that pairs of consecutive packets would leave
> the IPVS with the same wrong checksum, and the issue only went away when
> disabling GSO.
> 
> IPVS needs to avoid computing the SCTP checksum when using GSO.
> 
> Co-developed-by: Firo Yang 
> Signed-off-by: Ismael Luceno 
> Tested-by: Andreas Taschner 
> CC: Michal Kubeček 
> CC: Simon Horman 
> CC: Julian Anastasov 
> CC: lvs-de...@vger.kernel.org
> CC: netfilter-de...@vger.kernel.org
> CC: net...@vger.kernel.org
> CC: coret...@netfilter.org

Thanks for the fix, I'll accept this but skb_is_gso_sctp()
has comment for pre-condition: skb_is_gso(skb). Can you send v2
with it?

I'm guessing what should be the Fixes line, may be?:

Fixes: 90017accff61 ("sctp: Add GSO support")

because SCTP GSO was added after the IPVS code? Or the
more recent commit d02f51cbcf12 which adds skb_is_gso_sctp ?

> ---
>  net/netfilter/ipvs/ip_vs_proto_sctp.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_proto_sctp.c 
> b/net/netfilter/ipvs/ip_vs_proto_sctp.c
> index a0921adc31a9..3205b45ce161 100644
> --- a/net/netfilter/ipvs/ip_vs_proto_sctp.c
> +++ b/net/netfilter/ipvs/ip_vs_proto_sctp.c
> @@ -126,7 +126,8 @@ sctp_snat_handler(struct sk_buff *skb, struct 
> ip_vs_protocol *pp,
>   if (sctph->source != cp->vport || payload_csum ||
>   skb->ip_summed == CHECKSUM_PARTIAL) {
>   sctph->source = cp->vport;
> - sctp_nat_csum(skb, sctph, sctphoff);
> + if (!skb_is_gso_sctp(skb))
> + sctp_nat_csum(skb, sctph, sctphoff);
>   } else {
>   skb->ip_summed = CHECKSUM_UNNECESSARY;
>   }
> @@ -174,7 +175,8 @@ sctp_dnat_handler(struct sk_buff *skb, struct 
> ip_vs_protocol *pp,
>   (skb->ip_summed == CHECKSUM_PARTIAL &&
>!(skb_dst(skb)->dev->features & NETIF_F_SCTP_CRC))) {
>   sctph->dest = cp->dport;
> - sctp_nat_csum(skb, sctph, sctphoff);
> + if (!skb_is_gso_sctp(skb))
> +     sctp_nat_csum(skb, sctph, sctphoff);
>   } else if (skb->ip_summed != CHECKSUM_PARTIAL) {
>   skb->ip_summed = CHECKSUM_UNNECESSARY;
>   }

Regards

--
Julian Anastasov

Re: [PATCH net-next v2 1/2] ipvs: add READ_ONCE barrier for ipvs->sysctl_amemthresh

2024-04-18 Thread Julian Anastasov



Hello,

On Thu, 18 Apr 2024, Alexander Mikhalitsyn wrote:

> Cc: Julian Anastasov 
> Cc: Simon Horman 
> Cc: Pablo Neira Ayuso 
> Cc: Jozsef Kadlecsik 
> Cc: Florian Westphal 
> Suggested-by: Julian Anastasov 
> Signed-off-by: Alexander Mikhalitsyn 
> ---
>  net/netfilter/ipvs/ip_vs_ctl.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 143a341bbc0a..daa62b8b2dd1 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c

> @@ -105,7 +106,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>   /* si_swapinfo(); */
>   /* availmem = availmem - (i.totalswap - i.freeswap); */
>  
> - nomem = (availmem < ipvs->sysctl_amemthresh);
> + amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0);
> + nomem = (availmem < amemthresh);
>  
>   local_bh_disable();
>  
> @@ -146,8 +148,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>   case 1:
>   if (nomem) {
>   ipvs->drop_rate = ipvs->drop_counter
> - = ipvs->sysctl_amemthresh /
> - (ipvs->sysctl_amemthresh-availmem);
> + = amemthresh /
> + (amemthresh-availmem);

Thanks, both patches look ok except that the old styling
is showing warnings for this patch:

scripts/checkpatch.pl --strict /tmp/file1.patch

It would be great if you silence them somehow in v3...

BTW, est_cpulist is masked with current->cpus_mask of the
sysctl writer process, if that is of any help. That is why I skipped
it but lets keep it read-only for now...

>   ipvs->sysctl_drop_packet = 2;
>   } else {
>   ipvs->drop_rate = 0;
> @@ -156,8 +158,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
>   case 2:
>   if (nomem) {
>   ipvs->drop_rate = ipvs->drop_counter
> - = ipvs->sysctl_amemthresh /
> - (ipvs->sysctl_amemthresh-availmem);
> + = amemthresh /
> +     (amemthresh-availmem);
>   } else {
>   ipvs->drop_rate = 0;
>   ipvs->sysctl_drop_packet = 1;

Regards

--
Julian Anastasov

Re: [PATCH net-next] ipvs: allow some sysctls in non-init user namespaces

2024-04-17 Thread Julian Anastasov


Hello,

On Tue, 16 Apr 2024, Alexander Mikhalitsyn wrote:

> Let's make all IPVS sysctls visible and RO even when
> network namespace is owned by non-initial user namespace.
> 
> Let's make a few sysctls to be writable:
> - conntrack
> - conn_reuse_mode
> - expire_nodest_conn
> - expire_quiescent_template
> 
> I'm trying to be conservative with this to prevent
> introducing any security issues in there. Maybe,
> we can allow more sysctls to be writable, but let's
> do this on-demand and when we see real use-case.
> 
> This list of sysctls was chosen because I can't
> see any security risks allowing them and also
> Kubernetes uses [2] these specific sysctls.
> 
> This patch is motivated by user request in the LXC
> project [1].
> 
> [1] https://github.com/lxc/lxc/issues/4278
> [2] 
> https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103
> 
> Cc: Stéphane Graber 
> Cc: Christian Brauner 
> Cc: Julian Anastasov 
> Cc: Simon Horman 
> Cc: Pablo Neira Ayuso 
> Cc: Jozsef Kadlecsik 
> Cc: Florian Westphal 
> Signed-off-by: Alexander Mikhalitsyn 
> ---
>  net/netfilter/ipvs/ip_vs_ctl.c | 18 +++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 143a341bbc0a..92a818c2f783 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -4285,10 +4285,22 @@ static int __net_init 
> ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)

As the list of privileged vars is short I prefer
to use a bool and to make only some vars read-only:

bool unpriv = false;

>   if (tbl == NULL)
>   return -ENOMEM;
>  
> - /* Don't export sysctls to unprivileged users */
> + /* Let's show all sysctls in non-init user namespace-owned
> +  * net namespaces, but make them read-only.
> +  *
> +  * Allow only a few specific sysctls to be writable.
> +  */
>   if (net->user_ns != _user_ns) {

Here we should just set: unpriv = true;

> - tbl[0].procname = NULL;
> - ctl_table_size = 0;
> + for (idx = 0; idx < ARRAY_SIZE(vs_vars); idx++) {
> + if (!tbl[idx].procname)
> + continue;
> +
> + if (!((strcmp(tbl[idx].procname, "conntrack") 
> == 0) ||
> +   (strcmp(tbl[idx].procname, 
> "conn_reuse_mode") == 0) ||
> +   (strcmp(tbl[idx].procname, 
> "expire_nodest_conn") == 0) ||
> +   (strcmp(tbl[idx].procname, 
> "expire_quiescent_template") == 0)))
> + tbl[idx].mode = 0444;
> + }
>   }
>   } else
>   tbl = vs_vars;

And below at every place to use:

if (unpriv)
tbl[idx].mode = 0444;

for the following 4 privileged sysctl vars:

- sync_qlen_max:
- allocates messages in kernel context
- this needs better tunning in another patch

- sync_sock_size:
- allocates messages in kernel context

- run_estimation:
- for now, better init ns to decide if to use est stats

- est_nice:
- for now, better init ns to decide the value

- debug_level:
- already set to 0444

I.e. these vars allocate resources (mem, CPU) without
proper control, so for now we will just copy them from init ns
without allowing writing. And they are vars that are not tuned
often. Also we do not know which netns is supposed to be the
privileged one, some solutions move all devices out of init_net,
so we can not decide where to use lower limits.

OTOH, "amemthresh" is not privileged but needs single READ_ONCE 
for sysctl_amemthresh in update_defense_level() due to the possible
div by zero if we allow writing to anyone, eg.:

int amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0);
...
nomem = availmem < amemthresh;
... use only amemthresh

All other vars can be writable.

Regards

--
Julian Anastasov

Re: [PATCH] ipvs: allow netlink configuration from non-initial user namespace

2024-03-08 Thread Julian Anastasov

Hello,

On Thu, 7 Mar 2024, Michael Weiß wrote:

> Configuring ipvs in a non-initial user namespace using the genl
> netlink interface, e.g., by 'ipvsadm' is currently resulting in an
> '-EPERM'. This is due to the use of GENL_ADMIN_PERM flag in
> 'ip_vs_ctl.c'.
> 
> Similarly to other genl interfaces, we switch to the use of
> GENL_UNS_ADMIN_PERM flag which allows connection from non-initial
> user namespace. Thus, it would be feasible to configure ipvs using
> the genl interface also from within an unprivileged system container.
> 
> Since adding of new services and new dests are triggered from
> userspace, accounting for the corresponding memory allocations in
> ip_vs_new_dest() and ip_vs_add_service() is activated.
> 
> We tested this by simply running some samples from "man ipvsadm"
> within an unprivileged user namespaced system container in GyroidOS.
> Further, we successfully passed an adapted version of the ipvs
> selftest in 'tools/testing/selftests/netfilter/ipvs.sh' using
> preliminary created network namespaces from unprivileged GyroidOS
> containers.

I planned such change but as followup patchset to other
work which converts many structures to be per-netns.

There is a RFC v2 patchset for reference:

https://archive.linuxvirtualserver.org/html/lvs-devel/2023-12/index.html

My goal was to isolate the different namespaces as much as
possible: different structures, different kthreads, etc. with the
goal to reduce the security risks of giving power to unprivileged roots.
Such isolation should help when namespaces are served from different CPUs.

May be I should push fresh v3 soon, so that we can later use
GFP_KERNEL_ACCOUNT not only for services and dests but also
for allocations by schedulers, estimators, etc. The access to
sysctl vars should be enabled too, around comment
"Don't export sysctls to unprivileged users",
alloc_percpu => alloc_percpu_gfp(,GFP_KERNEL_ACCOUNT),
SLAB_ACCOUNT for kmem_cache_create, not sure about __GFP_NOWARN and
__GFP_NORETRY usage too.

Not sure about the sysctl vars: now they are cloned from
init_net, do we give full access for writing, some can be privileged,
etc.

I didn't push such changes yet because I'm not sure what
is needed: looks like, for now, what was needed is root from init_net to 
control rules in different netns and there was no demand from the 
virtualization world to extend this. If we can clearly define what is 
good and what is bad from security perspective, we can go with such 
changes after pushing the above patchset, i.e. the GENL_UNS_ADMIN_PERM
change should follow all other changes.

> Signed-off-by: Michael Weiß 
> ---
>  net/netfilter/ipvs/ip_vs_ctl.c | 36 +-
>  1 file changed, 18 insertions(+), 18 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 143a341bbc0a..d39120c64207 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -1080,7 +1080,7 @@ ip_vs_new_dest(struct ip_vs_service *svc, struct 
> ip_vs_dest_user_kern *udest)
>   return -EINVAL;
>   }
>  
> - dest = kzalloc(sizeof(struct ip_vs_dest), GFP_KERNEL);
> + dest = kzalloc(sizeof(struct ip_vs_dest), GFP_KERNEL_ACCOUNT);
>   if (dest == NULL)
>   return -ENOMEM;
>  
> @@ -1421,7 +1421,7 @@ ip_vs_add_service(struct netns_ipvs *ipvs, struct 
> ip_vs_service_user_kern *u,
>   ret_hooks = ret;
>   }
>  
> - svc = kzalloc(sizeof(struct ip_vs_service), GFP_KERNEL);
> + svc = kzalloc(sizeof(struct ip_vs_service), GFP_KERNEL_ACCOUNT);
>   if (svc == NULL) {
>   IP_VS_DBG(1, "%s(): no memory\n", __func__);
>   ret = -ENOMEM;
> @@ -4139,98 +4139,98 @@ static const struct genl_small_ops ip_vs_genl_ops[] = 
> {
>   {
>   .cmd= IPVS_CMD_NEW_SERVICE,
>   .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
> - .flags  = GENL_ADMIN_PERM,
> + .flags  = GENL_UNS_ADMIN_PERM,
>   .doit   = ip_vs_genl_set_cmd,
...

Regards

--
Julian Anastasov

Re: [PATCH net] net: ipvs: avoid stat macros calls from preemptible context

2024-01-16 Thread Julian Anastasov



Hello,

On Mon, 15 Jan 2024, Fedor Pchelkin wrote:

> Inside decrement_ttl() upon discovering that the packet ttl has exceeded,
> __IP_INC_STATS and __IP6_INC_STATS macros can be called from preemptible
> context having the following backtrace:
> 
> check_preemption_disabled: 48 callbacks suppressed
> BUG: using __this_cpu_add() in preemptible [] code: curl/1177
> caller is decrement_ttl+0x217/0x830
> CPU: 5 PID: 1177 Comm: curl Not tainted 6.7.0+ #34
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 04/01/2014
> Call Trace:
>  
>  dump_stack_lvl+0xbd/0xe0
>  check_preemption_disabled+0xd1/0xe0
>  decrement_ttl+0x217/0x830
>  __ip_vs_get_out_rt+0x4e0/0x1ef0
>  ip_vs_nat_xmit+0x205/0xcd0
>  ip_vs_in_hook+0x9b1/0x26a0
>  nf_hook_slow+0xc2/0x210
>  nf_hook+0x1fb/0x770
>  __ip_local_out+0x33b/0x640
>  ip_local_out+0x2a/0x490
>  __ip_queue_xmit+0x990/0x1d10
>  __tcp_transmit_skb+0x288b/0x3d10
>  tcp_connect+0x3466/0x5180
>  tcp_v4_connect+0x1535/0x1bb0
>  __inet_stream_connect+0x40d/0x1040
>  inet_stream_connect+0x57/0xa0
>  __sys_connect_file+0x162/0x1a0
>  __sys_connect+0x137/0x160
>  __x64_sys_connect+0x72/0xb0
>  do_syscall_64+0x6f/0x140
>  entry_SYSCALL_64_after_hwframe+0x6e/0x76
> RIP: 0033:0x7fe6dbbc34e0
> 
> Use the corresponding preemption-aware variants: IP_INC_STATS and
> IP6_INC_STATS.
> 
> Found by Linux Verification Center (linuxtesting.org).
> 
> Fixes: 8d8e20e2d7bb ("ipvs: Decrement ttl")
> Signed-off-by: Fedor Pchelkin 

Looks good to me, thanks!

Acked-by: Julian Anastasov 

> ---
>  net/netfilter/ipvs/ip_vs_xmit.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> index 9193e109e6b3..65e0259178da 100644
> --- a/net/netfilter/ipvs/ip_vs_xmit.c
> +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> @@ -271,7 +271,7 @@ static inline bool decrement_ttl(struct netns_ipvs *ipvs,
>   skb->dev = dst->dev;
>   icmpv6_send(skb, ICMPV6_TIME_EXCEED,
>   ICMPV6_EXC_HOPLIMIT, 0);
> - __IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
> + IP6_INC_STATS(net, idev, IPSTATS_MIB_INHDRERRORS);
>  
>   return false;
>   }
> @@ -286,7 +286,7 @@ static inline bool decrement_ttl(struct netns_ipvs *ipvs,
>   {
>   if (ip_hdr(skb)->ttl <= 1) {
>   /* Tell the sender its packet died... */
> - __IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
> +     IP_INC_STATS(net, IPSTATS_MIB_INHDRERRORS);
>   icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0);
>   return false;
>   }
> -- 
> 2.43.0

Regards

--
Julian Anastasov

Re: [PATCH] ipvs: add a stateless type of service and a stateless Maglev hashing scheduler

2023-12-06 Thread Julian Anastasov

Hello,

On Mon, 4 Dec 2023, Lev Pantiukhin wrote:

> +#define IP_VS_SVC_F_STATELESS0x0040  /* stateless scheduling 
> */

I have another idea for the traffic that does not
need per-client state. We need some per-dest cp to forward the packet.
If we replace the cp->caddr usage with iph->saddr/daddr usage we can try 
it. cp->caddr is used at the following places:

- tcp_snat_handler (iph->daddr), tcp_dnat_handler (iph->saddr): iph is 
already provided. tcp_snat_handler requires IP_VS_SVC_F_STATELESS
to be set for serivce with present vaddr, i.e. non-fwmark based.
So, NAT+svc->fwmark is another restriction for IP_VS_SVC_F_STATELESS
because we do not know what VIP to use as saddr for outgoing traffic.

- ip_vs_nfct_expect_related
- we should investigate for any problems when IP_VS_CONN_F_NFCT
is set, probably, we can not work with NFCT?

- ip_vs_conn_drop_conntrack

- FTP:
- sets IP_VS_CONN_F_NFCT, uses cp->app

May be IP_VS_CONN_F_NFCT should be restriction for 
IP_VS_SVC_F_STATELESS mode? cp->app for sure because we keep TCP
seq/ack state for the app in cp->in_seq/out_seq.

We can keep some dest->cp_route or another name that will
hold our cp for such connections. The idea is to not allocate cp for
every packet but to reuse this saved cp. It has all needed info to
forward skb to real server. The first packet will create it, save
it with some locking into dest and next packets will reuse it.

Probably, it should be ONE_PACKET entry (not hashed in table) but 
can be with running timer, if needed. One refcnt for attaching to dest, 
new temp refcnt for every packet. But in this mode __ip_vs_conn_put_timer 
uses 0-second timer, we have to handle it somehow. It should be released
when dest is removed and on edit_dest if needed.

There are other problems to solve, such as set_tcp_state()
changing dest->activeconns and dest->inactconns. They are used also
in ip_vs_bind_dest(), ip_vs_unbind_dest(). As we do not keep previous
connection state and as conn can start in established state, we should
avoid touching these counters. For UDP ONE_PACKET has no such problem
with states but for TCP/SCTP we should take care.

Regards

--
Julian Anastasov

Re: [PATCH] ipvs: add a stateless type of service and a stateless Maglev hashing scheduler

2023-12-05 Thread Julian Anastasov

dests->new_dest);
> + dests->dest = dests->new_dest;
> + RCU_INIT_POINTER(states->first->lookup[hash].dest,
> +  dests->new_dest);
> + states->timestamps[hash] = (ktime_t)0;

These operations are not SMP safe, many readers may try to
switch to stable state at the same time. May be some xchg operation
for timestamps[] can help. But it also races with reconfiguration,
i.e. ip_vs_mhs_update_timestamps(), ip_vs_mhs_populate(), etc.
As it is a rare condition, spin_lock_bh(>lock) will help instead.
You should revalidate states->timestamps[hash] under lock.

> + }
> + /* stable */
> + dests->unstable = false;
> +}
> +
> +/* Stateless Maglev Hashing scheduling */
> +static struct ip_vs_dest *
> +ip_vs_mhs_schedule(struct ip_vs_service *svc,
> +const struct sk_buff *skb,
> +struct ip_vs_iphdr *iph,
> +bool *need_state)
> +{
> + struct ip_vs_mhs_two_dests dests;
> + struct ip_vs_dest *final_dest = NULL;
> + struct ip_vs_mhs_two_states *states = svc->sched_data;
> + __be16 port = 0;
> + const union nf_inet_addr *hash_addr;
> +
> + *need_state = false;
> + hash_addr = ip_vs_iph_inverse(iph) ? >daddr : >saddr;
> +
> + if (svc->flags & IP_VS_SVC_F_SCHED_MH_PORT)
> + port = ip_vs_mhs_get_port(skb, iph);
> +
> + ip_vs_mhs_get(svc, states, , hash_addr, port);
> + IP_VS_DBG_BUF(6,
> +   "MHS: %s(): source IP address %s:%u --> server %s and 
> %s\n",
> +   __func__,
> +   IP_VS_DBG_ADDR(svc->af, hash_addr),
> +   ntohs(port),
> +   dests.dest
> +   ? IP_VS_DBG_ADDR(dests.dest->af, >addr)
> +   : "NULL",
> +   dests.new_dest
> +   ? IP_VS_DBG_ADDR(dests.new_dest->af,
> +_dest->addr)
> +   : "NULL");
> +
> + if (!dests.dest && !dests.new_dest) {
> + /* Both dests is NULL */
> + return NULL;
> + }
> +
> + if (!(dests.dest && dests.new_dest)) {
> + /* dest is NULL or new_dest is NULL,
> +  * so we send all packets to singular available dest
> +  * and create state
> +  */
> + if (dests.new_dest) {
> + /* dest is NULL */
> + final_dest = dests.new_dest;
> + } else {
> + /* new_dest is NULL */
> + final_dest = dests.dest;

In two cases we return dests.dest without checking
for IP_VS_DEST_F_AVAILABLE, even, you keep the flag set after dest is
removed which is not nice. If we do not want to fallback, in this case
we should return NULL, eg. for ACK. Any traffic should stop if 
!IP_VS_DEST_F_AVAILABLE and if weight=0 only established connections should
work. As for IP_VS_DEST_F_OVERLOAD, if used, it should lead to allocating
connection to fallback server, something not suitable for every scheduler.

> + }
> + *need_state = true;
> + IP_VS_DBG(6,
> +   "MHS: %s(): One dest, need_state=%s\n",
> +   __func__,
> +   *need_state ? "true" : "false");
> + } else if (dests.unstable) {
> + /* unstable */
> + if (iph->protocol == IPPROTO_TCP) {
> + /* TCP */
> + *need_state = true;

Looks like we can use iph.hdr_flags & IP_VS_HDR_NEW_CONN instead 
of ip_vs_mhs_is_new_conn. IP_VS_HDR_NEW_CONN can be set where we
call is_new_conn in ip_vs_in_hook:

if (!iph.fragoffs && is_new_conn(skb, ))
iph.hdr_flags |= IP_VS_HDR_NEW_CONN;
if (iph.hdr_flags & IP_VS_HDR_NEW_CONN && cp) {

> + if (ip_vs_mhs_is_new_conn(skb, iph)) {
> + /* SYN packet */
> + final_dest = dests.new_dest;
> + IP_VS_DBG(6,
> +   "MHS: %s(): Unstable, need_state=%s, 
> SYN packet\n",
> +   __func__,
> +   *need_state ? "true" : "false");
> + } else {
> + /* Not SYN packet */
> +     final_dest = dests.dest;
> + IP_VS_DBG(6,
> +   "MHS: %s(): Unstable, need_state=%s, 
> not SYN packet\n",
> +   __func__,
> +   *need_state ? "true" : "false");
> + }
> + } else if (iph->protocol == IPPROTO_UDP) {
> + /* UDP */
> + final_dest = dests.new_dest;
> + IP_VS_DBG(6,
> +   "MHS: %s(): Unstable, need_state=%s, UDP 
> packet\n",
> +   __func__,
> +   *need_state ? "true" : "false");
> + }
> + } else {
> + /* stable */
> + final_dest = dests.dest;
> + IP_VS_DBG(6,
> +   "MHS: %s(): Stable, need_state=%s\n",
> +   __func__,
> +   *need_state ? "true" : "false");
> + }
> + return final_dest;
> +}
> +
> +/* IPVS MHS Scheduler structure */
> +static struct ip_vs_scheduler ip_vs_mhs_scheduler = {
> + .name ="mhs",
> + .refcnt =ATOMIC_INIT(0),
> + .module =THIS_MODULE,
> + .n_list =LIST_HEAD_INIT(ip_vs_mhs_scheduler.n_list),
> + .init_service =ip_vs_mhs_init_svc,
> + .done_service =ip_vs_mhs_done_svc,
> + .add_dest =ip_vs_mhs_dest_changed,
> + .del_dest =ip_vs_mhs_dest_changed,
> + .upd_dest =ip_vs_mhs_dest_changed,
> + .schedule_sl =ip_vs_mhs_schedule,
> +};
> +
> +static int __init
> +ip_vs_mhs_init(void)
> +{
> + return register_ip_vs_scheduler(_vs_mhs_scheduler);
> +}
> +
> +static void __exit
> +ip_vs_mhs_cleanup(void)
> +{
> + unregister_ip_vs_scheduler(_vs_mhs_scheduler);
> + rcu_barrier();
> +}
> +
> +module_init(ip_vs_mhs_init);
> +module_exit(ip_vs_mhs_cleanup);
> +MODULE_DESCRIPTION("Stateless Maglev hashing ipvs scheduler");
> +MODULE_LICENSE("GPL");
> +MODULE_AUTHOR("Lev Pantiukhin ");
> diff --git a/net/netfilter/ipvs/ip_vs_proto_tcp.c 
> b/net/netfilter/ipvs/ip_vs_proto_tcp.c
> index 7da51390cea6..31a8c1bfc863 100644
> --- a/net/netfilter/ipvs/ip_vs_proto_tcp.c
> +++ b/net/netfilter/ipvs/ip_vs_proto_tcp.c
> @@ -38,7 +38,7 @@ tcp_conn_schedule(struct netns_ipvs *ipvs, int af, struct 
> sk_buff *skb,
> struct ip_vs_iphdr *iph)
>  {
>   struct ip_vs_service *svc;
> - struct tcphdr _tcph, *th;
> + struct tcphdr _tcph, *th = NULL;
>   __be16 _ports[2], *ports = NULL;
>  
>   /* In the event of icmp, we're only guaranteed to have the first 8
> @@ -47,11 +47,8 @@ tcp_conn_schedule(struct netns_ipvs *ipvs, int af, struct 
> sk_buff *skb,
>*/
>   if (likely(!ip_vs_iph_icmp(iph))) {
>   th = skb_header_pointer(skb, iph->len, sizeof(_tcph), &_tcph);
> - if (th) {
> - if (th->rst || !(sysctl_sloppy_tcp(ipvs) || th->syn))
> - return 1;
> + if (th)
>   ports = >source;
> - }
>   } else {
>   ports = skb_header_pointer(
>   skb, iph->len, sizeof(_ports), &_ports);
> @@ -74,6 +71,17 @@ tcp_conn_schedule(struct netns_ipvs *ipvs, int af, struct 
> sk_buff *skb,
>   if (svc) {
>   int ignored;
>  
> + if (th) {
> + /* If sloppy_tcp or IP_VS_SVC_F_STATELESS is true,
> +  * all SYN packets are scheduled except packets
> +  * with set RST flag.
> +  */
> + if (!sysctl_sloppy_tcp(ipvs) &&
> + !(svc->flags & IP_VS_SVC_F_STATELESS) &&
> + (!th->syn || th->rst))
> + return 1;
> + }

Probably same can be done for sctp_conn_schedule()

> +
>   if (ip_vs_todrop(ipvs)) {
>   /*
>* It seems that we are very loaded.
> -- 
> 2.17.1

Regards

--
Julian Anastasov

Re: [PATCH net v3] ipvs: fix possible memory leak in ip_vs_control_net_init

2020-11-24 Thread Julian Anastasov



Hello,

On Tue, 24 Nov 2020, Wang Hai wrote:

> kmemleak report a memory leak as follows:
> 
> BUG: memory leak
> unreferenced object 0x8880759ea000 (size 256):
> backtrace:
> [<c0bf2deb>] kmem_cache_zalloc include/linux/slab.h:656 [inline]
> [<c0bf2deb>] __proc_create+0x23d/0x7d0 fs/proc/generic.c:421
> [<9d718d02>] proc_create_reg+0x8e/0x140 fs/proc/generic.c:535
> [<97bbfc4f>] proc_create_net_data+0x8c/0x1b0 fs/proc/proc_net.c:126
> [<652480fc>] ip_vs_control_net_init+0x308/0x13a0 
> net/netfilter/ipvs/ip_vs_ctl.c:4169
> [<4c927ebe>] __ip_vs_init+0x211/0x400 
> net/netfilter/ipvs/ip_vs_core.c:2429
> [<aa6b72d9>] ops_init+0xa8/0x3c0 net/core/net_namespace.c:151
> [<153fd114>] setup_net+0x2de/0x7e0 net/core/net_namespace.c:341
> [<be4e4f07>] copy_net_ns+0x27d/0x530 net/core/net_namespace.c:482
> [<f1c23ec9>] create_new_namespaces+0x382/0xa30 kernel/nsproxy.c:110
> [<098a5757>] copy_namespaces+0x2e6/0x3b0 kernel/nsproxy.c:179
> [<26ce39e9>] copy_process+0x220a/0x5f00 kernel/fork.c:2072
> [<b71f4efe>] _do_fork+0xc7/0xda0 kernel/fork.c:2428
> [<2974ee96>] __do_sys_clone3+0x18a/0x280 kernel/fork.c:2703
> [<62ac0a4d>] do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
> [<93f1ce2c>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> In the error path of ip_vs_control_net_init(), remove_proc_entry() needs
> to be called to remove the added proc entry, otherwise a memory leak
> will occur.
> 
> Also, add some '#ifdef CONFIG_PROC_FS' because proc_create_net* return NULL
> when PROC is not used.
> 
> Fixes: b17fc9963f83 ("IPVS: netns, ip_vs_stats and its procfs")
> Fixes: 61b1ab4583e2 ("IPVS: netns, add basic init per netns.")
> Reported-by: Hulk Robot 
> Signed-off-by: Wang Hai 

Looks good to me, thanks!

Acked-by: Julian Anastasov 

> ---
> v2->v3: improve code format
> v1->v2: add some '#ifdef CONFIG_PROC_FS' and check the return value of 
> proc_create_net*
>  net/netfilter/ipvs/ip_vs_ctl.c | 31 +--
>  1 file changed, 25 insertions(+), 6 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index e279ded4e306..d45dbcba8b49 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -4167,12 +4167,18 @@ int __net_init ip_vs_control_net_init(struct 
> netns_ipvs *ipvs)
>  
>   spin_lock_init(>tot_stats.lock);
>  
> - proc_create_net("ip_vs", 0, ipvs->net->proc_net, _vs_info_seq_ops,
> - sizeof(struct ip_vs_iter));
> - proc_create_net_single("ip_vs_stats", 0, ipvs->net->proc_net,
> - ip_vs_stats_show, NULL);
> - proc_create_net_single("ip_vs_stats_percpu", 0, ipvs->net->proc_net,
> - ip_vs_stats_percpu_show, NULL);
> +#ifdef CONFIG_PROC_FS
> + if (!proc_create_net("ip_vs", 0, ipvs->net->proc_net,
> +  _vs_info_seq_ops, sizeof(struct ip_vs_iter)))
> + goto err_vs;
> + if (!proc_create_net_single("ip_vs_stats", 0, ipvs->net->proc_net,
> + ip_vs_stats_show, NULL))
> + goto err_stats;
> + if (!proc_create_net_single("ip_vs_stats_percpu", 0,
> + ipvs->net->proc_net,
> + ip_vs_stats_percpu_show, NULL))
> + goto err_percpu;
> +#endif
>  
>   if (ip_vs_control_net_init_sysctl(ipvs))
>   goto err;
> @@ -4180,6 +4186,17 @@ int __net_init ip_vs_control_net_init(struct 
> netns_ipvs *ipvs)
>   return 0;
>  
>  err:
> +#ifdef CONFIG_PROC_FS
> + remove_proc_entry("ip_vs_stats_percpu", ipvs->net->proc_net);
> +
> +err_percpu:
> + remove_proc_entry("ip_vs_stats", ipvs->net->proc_net);
> +
> +err_stats:
> + remove_proc_entry("ip_vs", ipvs->net->proc_net);
> +
> +err_vs:
> +#endif
>   free_percpu(ipvs->tot_stats.cpustats);
>   return -ENOMEM;
>  }
> @@ -4188,9 +4205,11 @@ void __net_exit ip_vs_control_net_cleanup(struct 
> netns_ipvs *ipvs)
>  {
>   ip_vs_trash_cleanup(ipvs);
>   ip_vs_control_net_cleanup_sysctl(ipvs);
> +#ifdef CONFIG_PROC_FS
>   remove_proc_entry("ip_vs_stats_percpu", ipvs->net->proc_net);
>   remove_proc_entry("ip_vs_stats", ipvs->net->proc_net);
>   remove_proc_entry("ip_vs", ipvs->net->proc_net);
> +#endif
>   free_percpu(ipvs->tot_stats.cpustats);
>  }
>  
> -- 
> 2.17.1

Regards

--
Julian Anastasov

Re: [PATCH net] ipvs: fix possible memory leak in ip_vs_control_net_init

2020-11-19 Thread Julian Anastasov



Hello,

On Thu, 19 Nov 2020, Wang Hai wrote:

> kmemleak report a memory leak as follows:
> 
> BUG: memory leak
> unreferenced object 0x8880759ea000 (size 256):
> comm "syz-executor.3", pid 6484, jiffies 4297476946 (age 48.546s)
> hex dump (first 32 bytes):
> 00 00 00 00 01 00 00 00 08 a0 9e 75 80 88 ff ff ...u
> 08 a0 9e 75 80 88 ff ff 00 00 00 00 ad 4e ad de ...u.N..
> backtrace:
> [<c0bf2deb>] kmem_cache_zalloc include/linux/slab.h:656 [inline]
> [<c0bf2deb>] __proc_create+0x23d/0x7d0 fs/proc/generic.c:421
> [<9d718d02>] proc_create_reg+0x8e/0x140 fs/proc/generic.c:535
> [<97bbfc4f>] proc_create_net_data+0x8c/0x1b0 fs/proc/proc_net.c:126
> [<652480fc>] ip_vs_control_net_init+0x308/0x13a0 
> net/netfilter/ipvs/ip_vs_ctl.c:4169
> [<4c927ebe>] __ip_vs_init+0x211/0x400 
> net/netfilter/ipvs/ip_vs_core.c:2429
> [<aa6b72d9>] ops_init+0xa8/0x3c0 net/core/net_namespace.c:151
> [<153fd114>] setup_net+0x2de/0x7e0 net/core/net_namespace.c:341
> [<be4e4f07>] copy_net_ns+0x27d/0x530 net/core/net_namespace.c:482
> [<f1c23ec9>] create_new_namespaces+0x382/0xa30 kernel/nsproxy.c:110
> [<098a5757>] copy_namespaces+0x2e6/0x3b0 kernel/nsproxy.c:179
> [<26ce39e9>] copy_process+0x220a/0x5f00 kernel/fork.c:2072
> [<b71f4efe>] _do_fork+0xc7/0xda0 kernel/fork.c:2428
> [<2974ee96>] __do_sys_clone3+0x18a/0x280 kernel/fork.c:2703
> [<62ac0a4d>] do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
> [<93f1ce2c>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> In the error path of ip_vs_control_net_init(), remove_proc_entry() needs
> to be called to remove the added proc entry, otherwise a memory leak
> will occur.
> 
> Fixes: b17fc9963f83 ("IPVS: netns, ip_vs_stats and its procfs")
> Fixes: 61b1ab4583e2 ("IPVS: netns, add basic init per netns.")
> Reported-by: Hulk Robot 
> Signed-off-by: Wang Hai 
> ---
>  net/netfilter/ipvs/ip_vs_ctl.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index e279ded4e306..d99bb89e7c25 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -4180,6 +4180,9 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs 
> *ipvs)
>   return 0;

May be we should add some #ifdef CONFIG_PROC_FS because
proc_create_net* return NULL when PROC is not used. For example:

#ifdef CONFIG_PROC_FS
if (!proc_create_net...
goto err_vs;
if (!proc_create_net...
goto err_stats;
...
#endif
...

>  err:

#ifdef CONFIG_PROC_FS
> + remove_proc_entry("ip_vs_stats_percpu", ipvs->net->proc_net);

err_percpu:
> + remove_proc_entry("ip_vs_stats", ipvs->net->proc_net);

err_stats:
> + remove_proc_entry("ip_vs", ipvs->net->proc_net);

err_vs:
#endif

>   free_percpu(ipvs->tot_stats.cpustats);
>   return -ENOMEM;
>  }
> -- 

Regards

--
Julian Anastasov

Re: [PATCH] ipvs: replace atomic_add_return()

2020-11-17 Thread Julian Anastasov



Hello,

On Mon, 16 Nov 2020, Yejune Deng wrote:

> atomic_inc_return() looks better
> 
> Signed-off-by: Yejune Deng 

Looks good to me for -next, thanks!

Acked-by: Julian Anastasov 

> ---
>  net/netfilter/ipvs/ip_vs_core.c | 2 +-
>  net/netfilter/ipvs/ip_vs_sync.c | 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index c0b8215..54e086c 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -2137,7 +2137,7 @@ static int ip_vs_in_icmp_v6(struct netns_ipvs *ipvs, 
> struct sk_buff *skb,
>   if (cp->flags & IP_VS_CONN_F_ONE_PACKET)
>   pkts = sysctl_sync_threshold(ipvs);
>   else
> - pkts = atomic_add_return(1, >in_pkts);
> + pkts = atomic_inc_return(>in_pkts);
>  
>   if (ipvs->sync_state & IP_VS_STATE_MASTER)
>   ip_vs_sync_conn(ipvs, cp, pkts);
> diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
> index 16b4806..9d43277 100644
> --- a/net/netfilter/ipvs/ip_vs_sync.c
> +++ b/net/netfilter/ipvs/ip_vs_sync.c
> @@ -615,7 +615,7 @@ static void ip_vs_sync_conn_v0(struct netns_ipvs *ipvs, 
> struct ip_vs_conn *cp,
>   cp = cp->control;
>   if (cp) {
>   if (cp->flags & IP_VS_CONN_F_TEMPLATE)
> - pkts = atomic_add_return(1, >in_pkts);
> + pkts = atomic_inc_return(>in_pkts);
>   else
>   pkts = sysctl_sync_threshold(ipvs);
>   ip_vs_sync_conn(ipvs, cp, pkts);
> @@ -776,7 +776,7 @@ void ip_vs_sync_conn(struct netns_ipvs *ipvs, struct 
> ip_vs_conn *cp, int pkts)
>   if (!cp)
>   return;
>   if (cp->flags & IP_VS_CONN_F_TEMPLATE)
> - pkts = atomic_add_return(1, >in_pkts);
> +     pkts = atomic_inc_return(>in_pkts);
>   else
>   pkts = sysctl_sync_threshold(ipvs);
>   goto sloop;
> -- 
> 1.9.1

Regards

--
Julian Anastasov

Re: [PATCH RFC v3] ipvs: add genetlink cmd to dump all services and destinations

2020-11-15 Thread Julian Anastasov

ZE;
> + goto nla_nested_end;
> + }
> + }
> + ctx->idx_dest = 0;
> + ctx->start_dest = 0;
> +
> +nla_nested_end:
> + nla_nest_end(skb, nl_dests);
> + nla_nest_end(skb, nl_service);
> + genlmsg_end(skb, hdr);
> + return ret;
> +
> +nla_nested_failure:
> + nla_nest_cancel(skb, nl_service);
> +
> +nla_put_failure:
> + genlmsg_cancel(skb, hdr);
> +
> +out_err:
> + ctx->idx_svc--;
> + return -EMSGSIZE;
> +}
> +
> +static int ip_vs_genl_dump_services_destinations(struct sk_buff *skb,
> +  struct netlink_callback *cb)
> +{
> + struct dump_services_dests_ctx ctx = {
> + .idx_svc = 0,
> + .start_svc = cb->args[0],
> + .idx_dest = 0,
> + .start_dest = cb->args[1],
> + };
> + struct net *net = sock_net(skb->sk);
> + struct netns_ipvs *ipvs = net_ipvs(net);
> + struct ip_vs_service *svc;
> + struct nlattr *attrs[IPVS_CMD_ATTR_MAX + 1];
> + int tab = cb->args[2];
> + int row = cb->args[3];
> +
> + mutex_lock(&__ip_vs_mutex);
> +
> + if (nlmsg_parse_deprecated(cb->nlh, GENL_HDRLEN, attrs,
> +IPVS_CMD_ATTR_MAX, ip_vs_cmd_policy,
> +cb->extack) == 0) {
> + if (attrs[IPVS_CMD_ATTR_SERVICE]) {
> + svc = ip_vs_genl_find_service(ipvs,
> +   
> attrs[IPVS_CMD_ATTR_SERVICE]);
> + if (IS_ERR_OR_NULL(svc))
> + goto out_err;
> + ip_vs_genl_dump_service_dests(skb, cb, ipvs, svc, );
> + goto nla_put_failure;

May be we should use different name for above label,
we are not deailing with nla in this function.

> + }
> + }
> +
> + if (tab >= 2)
> + goto nla_put_failure;
> +
> + if (tab >= 1)
> + goto tab_1;
> +
> + for (; row < IP_VS_SVC_TAB_SIZE; row++) {
> + hlist_for_each_entry(svc, _vs_svc_table[row], s_list) {
> + if (ip_vs_genl_dump_service_dests(skb, cb, ipvs,
> +   svc, ))
> + goto nla_put_failure;
> + }
> + ctx.idx_svc = 0;
> + ctx.start_svc = 0;
> + ctx.idx_dest = 0;
> + ctx.start_dest = 0;
> + }
> +
> + row = 0;
> + tab++;
> +
> +tab_1:
> + for (; row < IP_VS_SVC_TAB_SIZE; row++) {
> + hlist_for_each_entry(svc, _vs_svc_fwm_table[row], f_list) {
> + if (ip_vs_genl_dump_service_dests(skb, cb, ipvs,
> +   svc, ))
> + goto nla_put_failure;
> + }
> + ctx.idx_svc = 0;
> + ctx.start_svc = 0;
> + ctx.idx_dest = 0;
> + ctx.start_dest = 0;
> + }
> +
> + row = 0;
> + tab++;
> +
> +nla_put_failure:
> + cb->args[0] = ctx.idx_svc;
> + cb->args[1] = ctx.idx_dest;
> + cb->args[2] = tab;
> + cb->args[3] = row;
> +
> +out_err:
> + mutex_unlock(&__ip_vs_mutex);
> +
> + return skb->len;
> +}
> +
>  static int ip_vs_genl_parse_dest(struct ip_vs_dest_user_kern *udest,
>struct nlattr *nla, bool full_entry)
>  {
> @@ -3991,6 +4155,12 @@ static const struct genl_small_ops ip_vs_genl_ops[] = {
>   .flags  = GENL_ADMIN_PERM,
>   .doit   = ip_vs_genl_set_cmd,
>   },
> + {
> + .cmd= IPVS_CMD_GET_SERVICE_DEST,
> + .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
> + .flags  = GENL_ADMIN_PERM,
> + .dumpit = ip_vs_genl_dump_services_destinations,
> + },
>  };
>  
>  static struct genl_family ip_vs_genl_family __ro_after_init = {
> -- 
> 2.25.1

Some comments for the ipvsadm patch. Please, separately
post it next time here on the list with its own subject, so that
we can comment it inline.

- ipvs_get_services_dests(): #ifdef can be before declarations,
try to use long-to-short lines (reverse xmas tree order
for variables in declarations)
- print_service_entry(): no need to check d before free(d),
free() checks it itself, just like kfree() in kernel.
- ipvs_services_dests_parse_cb: we should stop if realloc() fails,
sadly, existing code does not check realloc() result but
for new code we should do it
- ipvs_get_services_dests(): kernel avoids using assignments in
'if' condition, we do the same for new code. You have to
split such code to assignment+condition.
- there are extra parentheses in code such as sizeof(*(get->index)),
that should be fine instead: sizeof(*get->index), same for
sizeof(get->index[0]). Extra parens also for &(get->dests),
etc.
- as new code runs only for LIBIPVS_USE_NL, check if it is wrapped
in proper #ifdef in libipvs/libipvs.c. Make sure
ipvsadm compiles without LIBIPVS_USE_NL.
- the extern word should not be used in .h files anymore

Some of the above styling issues are also reported by
linux# scripts/checkpatch.pl --strict /tmp/ipvsadm.patch

As we try to apply to ipvsadm the same styling rules
that are used for networking in kernel, you should be able
to fix all such places with help from checkpatch.pl. Probably,
you know about this file:

Documentation/process/coding-style.rst

Regards

--
Julian Anastasov

Re: [PATCH RFC v2] ipvs: add genetlink cmd to dump all services and destinations

2020-11-09 Thread Julian Anastasov

; +   NLM_F_MULTI, IPVS_CMD_NEW_SERVICE);
> + if (!hdr)
> + goto out_err;
> +
> + nl_service = nla_nest_start_noflag(skb, IPVS_CMD_ATTR_SERVICE);
> + if (!nl_service)
> + goto nla_put_failure;
> +
> + if (ip_vs_genl_put_service_attrs(skb, svc))
> + goto nla_nested_failure;
> +
> + nl_dests = nla_nest_start_noflag(skb, IPVS_SVC_ATTR_DESTS);
> + if (!nl_dests)
> + goto nla_nested_failure;
> +
> + list_for_each_entry(dest, >destinations, n_list) {
> + if (++ctx->idx_dest <= ctx->start_dest)
> + continue;
> + if (ip_vs_genl_fill_dest(skb, IPVS_DESTS_ATTR_DEST, dest) < 0) {
> + ctx->idx_svc--;
> + ctx->idx_dest--;
> + ret = -EMSGSIZE;
> + goto nla_nested_end;
> + }
> + }
> + ctx->idx_dest = 0;
> + ctx->start_dest = 0;
> +
> +nla_nested_end:
> + nla_nest_end(skb, nl_dests);
> + nla_nest_end(skb, nl_service);
> + genlmsg_end(skb, hdr);
> + return ret;
> +
> +nla_nested_failure:
> + nla_nest_cancel(skb, nl_service);
> +
> +nla_put_failure:
> + genlmsg_cancel(skb, hdr);
> +
> +out_err:
> + ctx->idx_svc--;
> + return -EMSGSIZE;
> +}
> +
> +static int ip_vs_genl_dump_services_destinations(struct sk_buff *skb,
> +  struct netlink_callback *cb)
> +{
> + struct dump_services_dests_ctx ctx = {
> + .idx_svc = 0,
> + .start_svc = cb->args[0],
> + .idx_dest = 0,
> + .start_dest = cb->args[1],
> + };
> + struct net *net = sock_net(skb->sk);
> + struct netns_ipvs *ipvs = net_ipvs(net);
> + struct ip_vs_service *svc = NULL;

NULL not needed

> + struct nlattr *attrs[IPVS_CMD_ATTR_MAX + 1];
> + int tab = cb->args[2];
> + int row = cb->args[3];
> +
> + mutex_lock(&__ip_vs_mutex);
> +
> + if (nlmsg_parse_deprecated(cb->nlh, GENL_HDRLEN, attrs,
> +IPVS_CMD_ATTR_MAX, ip_vs_cmd_policy,
> +cb->extack) == 0) {
> + if (attrs[IPVS_CMD_ATTR_SERVICE]) {
> + svc = ip_vs_genl_find_service(ipvs,
> +   
> attrs[IPVS_CMD_ATTR_SERVICE]);
> + if (IS_ERR_OR_NULL(svc))
> + goto out_err;
> + ip_vs_genl_dump_service_dests(skb, cb, ipvs, svc, );
> + goto nla_put_failure;
> + }
> + }
> +

To make it more readable and to avoid lookup when at EOF
we can start with the tab checks:

if (tab >= 2)
goto nla_put_failure;   # or done
if (tab >= 1)
goto tab_1;

for (; row < IP_VS_SVC_TAB_SIZE; row++) {

> + for (; tab == 0 && row < IP_VS_SVC_TAB_SIZE; row++) {
> + hlist_for_each_entry(svc, _vs_svc_table[row], s_list) {
> + if (ip_vs_genl_dump_service_dests(skb, cb, ipvs,
> +   svc, ))
> + goto nla_put_failure;
> + }
> + ctx.idx_svc = 0;
> + ctx.start_svc = 0;

If we were at the middle of dests for the previous packet
but now the svc and its dests are deleted, we have to reset the
dest ptr too, otherwise we will skip dests in next row:

ctx->idx_dest = 0;
ctx->start_dest = 0;

But any kind of modifications will show wrong results,
so it does not matter much.

> + }
> +
> + if (tab == 0) {
> + row = 0;
> + tab++;
> + }
> +

row = 0;
tab++;

tab_1:

> + for (; row < IP_VS_SVC_TAB_SIZE; row++) {
> + hlist_for_each_entry(svc, _vs_svc_fwm_table[row], f_list) {
> + if (ip_vs_genl_dump_service_dests(skb, cb, ipvs,
> +   svc, ))
> + goto nla_put_failure;
> + }
> +     ctx.idx_svc = 0;
> + ctx.start_svc = 0;

ctx->idx_dest = 0;
ctx->start_dest = 0;

> + }

row = 0;# Not needed
tab++;  $ tab = 2 to indicate EOF

> +
> +nla_put_failure:
> + cb->args[0] = ctx.idx_svc;
> + cb->args[1] = ctx.idx_dest;
> + cb->args[2] = tab;
> + cb->args[3] = row;
> +
> +out_err:
> + mutex_unlock(&__ip_vs_mutex);
> +
> + return skb->len;
> +}
> +
>  static int ip_vs_genl_parse_dest(struct ip_vs_dest_user_kern *udest,
>struct nlattr *nla, bool full_entry)
>  {
> @@ -3991,6 +4143,12 @@ static const struct genl_small_ops ip_vs_genl_ops[] = {
>   .flags  = GENL_ADMIN_PERM,
>   .doit   = ip_vs_genl_set_cmd,
>   },
> + {
> + .cmd= IPVS_CMD_GET_SERVICE_DEST,
> + .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
> + .flags  = GENL_ADMIN_PERM,
> + .dumpit = ip_vs_genl_dump_services_destinations,
> + },
>  };
>  
>  static struct genl_family ip_vs_genl_family __ro_after_init = {
> -- 
> 2.25.1

Regards

--
Julian Anastasov

Re: [PATCH RFC] ipvs: add genetlink cmd to dump all services and destinations

2020-11-03 Thread Julian Anastasov

Hello,

On Tue, 3 Nov 2020, Cezar Sá Espinola wrote:

> > And now what happens if all dests can not fit in a packet?
> > We should start next packet with the same svc? And then
> > user space should merge the dests when multiple packets
> > start with same service?
> 
> My (maybe not so great) idea was to avoid repeating the svc on each
> packet. It's possible for a packet to start with a destination and
> user space must consider then as belonging to the last svc received on
> the previous packet. The comparison "ctx->last_svc != svc" was
> intended to ensure that a packet only starts with destinations if the
> current service is the same as the svc we sent on the previous packet.

You can also consider the idea of having 3 coordinates
for start svc: idx_svc_tab (0 or 1), idx_svc_row (0..IP_VS_SVC_TAB_SIZE-1)
and idx_svc for index in row's chain. On new packet this will
indicate the htable and its row and we have to skip svcs in
this row to find our starting svc. I think, this will still fit in
the netlink_callback's args area. If not, we can always kmalloc
our context in args[0]. In single table, this should speedup
the start svc lookup 128 times in average (we have 256 rows).
In setup with 1024 svcs (average 4 in each of the 256 rows)
we should skip these 0..3 entries instead of 512 in average.

> > last_svc is used out of __ip_vs_mutex region,
> > so it is not safe. We can get a reference count but this
> > is bad if user space blocks.
> 
> I thought it would be relatively safe to store a pointer to the last
> svc since I would only use it for pointer comparison and never
> dereferencing it. But in retrospect it does look unsafe and fragile
> and could probably lead to errors especially if services are modified
> during a dump causing the stored pointer to point to a different
> service.

Yes, nobody is using such pointers. We should create
packets that correctly identify svc for the dests. The drawback
is that user space may need more work for merging. We can always
create a sorted array of pointers to svcs, so that we can binary
search with bsearch() the svc from every received packet. Then we
will know if this is a new svc or an old one (with dests in
multiple packets). Should we also check for dest duplicates in
the svc? The question is how much safe we should play. In
user space the max work we can do is to avoid duplicates
and to put dests to their correct svc.

> > But even if we use just indexes it should be ok.
> > If multiple agents are used in parallel it is not our
> > problem. What can happen is that we can send duplicates
> > or to skip entries (both svcs and dests). It is impossible
> > to keep any kind of references to current entries or even
> > keys to lookup them if another agent can remove them.
> 
> Got it. I noticed this behavior while writing this patch and even
> created a few crude validation scripts running parallel agents and
> checking the diff in [1].

Ok, make sure your tests cover cases with multiple
dests, so that single service occupies multiple packets,
I'm not sure if 100 dests fit in one packet or not.

Regards

--
Julian Anastasov

Re: [PATCH RFC] ipvs: add genetlink cmd to dump all services and destinations

2020-11-02 Thread Julian Anastasov

should contain info for svc, so that we can
properly add dests to the right svc

> + return -EMSGSIZE;
> + }
> + }
> +
> + return 0;
> +}
> +
> +static int ip_vs_genl_dump_services_destinations(struct sk_buff *skb,
> +  struct netlink_callback *cb)
> +{
> + /* Besides usual index based counters, saving a pointer to the last
> +  * dumped service is useful to ensure we only dump destinations that
> +  * belong to it, even when services are removed while the dump is still
> +  * running causing indexes to shift.
> +  */
> + struct dump_services_dests_ctx ctx = {
> + .idx_svc = 0,
> + .idx_dest = 0,
> + .start_svc = cb->args[0],
> + .start_dest = cb->args[1],
> + .last_svc = (struct ip_vs_service *)(cb->args[2]),
> + };
> + struct net *net = sock_net(skb->sk);
> + struct netns_ipvs *ipvs = net_ipvs(net);
> + struct ip_vs_service *svc = NULL;
> + struct nlattr *attrs[IPVS_CMD_ATTR_MAX + 1];
> + int i;
> +
> + mutex_lock(&__ip_vs_mutex);
> +
> + if (nlmsg_parse_deprecated(cb->nlh, GENL_HDRLEN, attrs, 
> IPVS_CMD_ATTR_MAX,
> +ip_vs_cmd_policy, cb->extack) == 0) {
> + svc = ip_vs_genl_find_service(ipvs, 
> attrs[IPVS_CMD_ATTR_SERVICE]);
> +
> + if (!IS_ERR_OR_NULL(svc)) {
> + ip_vs_genl_dump_service_destinations(skb, cb, svc, 
> );
> + goto nla_put_failure;
> + }
> + }
> +
> + for (i = 0; i < IP_VS_SVC_TAB_SIZE; i++) {
> + hlist_for_each_entry(svc, _vs_svc_table[i], s_list) {
> + if (svc->ipvs != ipvs)
> + continue;
> + if (ip_vs_genl_dump_service_destinations(skb, cb, svc, 
> ) < 0)
> + goto nla_put_failure;
> + }
> + }
> +
> + for (i = 0; i < IP_VS_SVC_TAB_SIZE; i++) {
> + hlist_for_each_entry(svc, _vs_svc_fwm_table[i], s_list) {
> + if (svc->ipvs != ipvs)
> + continue;
> + if (ip_vs_genl_dump_service_destinations(skb, cb, svc, 
> ) < 0)
> + goto nla_put_failure;
> + }
> + }
> +
> +nla_put_failure:
> + mutex_unlock(&__ip_vs_mutex);
> + cb->args[0] = ctx.idx_svc;
> + cb->args[1] = ctx.idx_dest;
> + cb->args[2] = (long)ctx.last_svc;

last_svc is used out of __ip_vs_mutex region,
so it is not safe. We can get a reference count but this
is bad if user space blocks.

But even if we use just indexes it should be ok.
If multiple agents are used in parallel it is not our
problem. What can happen is that we can send duplicates
or to skip entries (both svcs and dests). It is impossible
to keep any kind of references to current entries or even
keys to lookup them if another agent can remove them.

> +
> + return skb->len;
> +}
> +
>  static int ip_vs_genl_parse_dest(struct ip_vs_dest_user_kern *udest,
>struct nlattr *nla, bool full_entry)
>  {
> @@ -3991,6 +4094,12 @@ static const struct genl_small_ops ip_vs_genl_ops[] = {
>   .flags  = GENL_ADMIN_PERM,
>   .doit   = ip_vs_genl_set_cmd,
>   },
> + {
> + .cmd= IPVS_CMD_GET_SERVICE_DEST,
> + .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
> + .flags  = GENL_ADMIN_PERM,
> + .dumpit = ip_vs_genl_dump_services_destinations,
> + },
>  };
>  
>  static struct genl_family ip_vs_genl_family __ro_after_init = {
> -- 

Regards

--
Julian Anastasov

Re: [PATCH v5] ipvs: adjust the debug info in function set_tcp_state

2020-09-29 Thread Julian Anastasov



Hello,

On Mon, 28 Sep 2020, longguang.yue wrote:

> Outputting client,virtual,dst addresses info when tcp state changes,
> which makes the connection debug more clear
> 
> Signed-off-by: longguang.yue 

OK, v5 can be used instead of fixing v4.

Acked-by: Julian Anastasov 

> ---

longguang.yue, at this place after --- you can add info
for changes between versions, eg:
v5: fix indentation

Use this for other patches, so that we know what is
changed between versions.

>  net/netfilter/ipvs/ip_vs_proto_tcp.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_proto_tcp.c 
> b/net/netfilter/ipvs/ip_vs_proto_tcp.c
> index dc2e7da2742a..7da51390cea6 100644
> --- a/net/netfilter/ipvs/ip_vs_proto_tcp.c
> +++ b/net/netfilter/ipvs/ip_vs_proto_tcp.c
> @@ -539,8 +539,8 @@ set_tcp_state(struct ip_vs_proto_data *pd, struct 
> ip_vs_conn *cp,
>   if (new_state != cp->state) {
>   struct ip_vs_dest *dest = cp->dest;
>  
> - IP_VS_DBG_BUF(8, "%s %s [%c%c%c%c] %s:%d->"
> -   "%s:%d state: %s->%s conn->refcnt:%d\n",
> + IP_VS_DBG_BUF(8, "%s %s [%c%c%c%c] c:%s:%d v:%s:%d "
> +   "d:%s:%d state: %s->%s conn->refcnt:%d\n",
> pd->pp->name,
> ((state_off == TCP_DIR_OUTPUT) ?
>  "output " : "input "),
> @@ -548,10 +548,12 @@ set_tcp_state(struct ip_vs_proto_data *pd, struct 
> ip_vs_conn *cp,
> th->fin ? 'F' : '.',
> th->ack ? 'A' : '.',
> th->rst ? 'R' : '.',
> -   IP_VS_DBG_ADDR(cp->daf, >daddr),
> -   ntohs(cp->dport),
> IP_VS_DBG_ADDR(cp->af, >caddr),
> ntohs(cp->cport),
> +   IP_VS_DBG_ADDR(cp->af, >vaddr),
> +   ntohs(cp->vport),
> +   IP_VS_DBG_ADDR(cp->daf, >daddr),
> +       ntohs(cp->dport),
> tcp_state_name(cp->state),
> tcp_state_name(new_state),
> refcount_read(>refcnt));
> -- 
> 2.20.1 (Apple Git-117)

Regards

--
Julian Anastasov

Re: [PATCH v4] ipvs: adjust the debug info in function set_tcp_state

2020-09-27 Thread Julian Anastasov



Hello,

On Sun, 27 Sep 2020, longguang.yue wrote:

> outputting client,virtual,dst addresses info when tcp state changes,
> which makes the connection debug more clear
> 
> Signed-off-by: longguang.yue 

Looks good to me, thanks!

Acked-by: Julian Anastasov 

Simon, Pablo, may be commit description should not
be indented...

> ---
>  net/netfilter/ipvs/ip_vs_proto_tcp.c | 10 ++
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_proto_tcp.c 
> b/net/netfilter/ipvs/ip_vs_proto_tcp.c
> index dc2e7da2742a..7da51390cea6 100644
> --- a/net/netfilter/ipvs/ip_vs_proto_tcp.c
> +++ b/net/netfilter/ipvs/ip_vs_proto_tcp.c
> @@ -539,8 +539,8 @@ set_tcp_state(struct ip_vs_proto_data *pd, struct 
> ip_vs_conn *cp,
>   if (new_state != cp->state) {
>   struct ip_vs_dest *dest = cp->dest;
>  
> - IP_VS_DBG_BUF(8, "%s %s [%c%c%c%c] %s:%d->"
> -   "%s:%d state: %s->%s conn->refcnt:%d\n",
> + IP_VS_DBG_BUF(8, "%s %s [%c%c%c%c] c:%s:%d v:%s:%d "
> +   "d:%s:%d state: %s->%s conn->refcnt:%d\n",
> pd->pp->name,
> ((state_off == TCP_DIR_OUTPUT) ?
>  "output " : "input "),
> @@ -548,10 +548,12 @@ set_tcp_state(struct ip_vs_proto_data *pd, struct 
> ip_vs_conn *cp,
> th->fin ? 'F' : '.',
> th->ack ? 'A' : '.',
> th->rst ? 'R' : '.',
> -   IP_VS_DBG_ADDR(cp->daf, >daddr),
> -   ntohs(cp->dport),
> IP_VS_DBG_ADDR(cp->af, >caddr),
> ntohs(cp->cport),
> +   IP_VS_DBG_ADDR(cp->af, >vaddr),
> +   ntohs(cp->vport),
> +   IP_VS_DBG_ADDR(cp->daf, >daddr),
> +       ntohs(cp->dport),
> tcp_state_name(cp->state),
> tcp_state_name(new_state),
> refcount_read(>refcnt));
> -- 
> 2.20.1 (Apple Git-117)

Regards

--
Julian Anastasov

Re: [PATCHv5 net-next] ipvs: remove dependency on ip6_tables

2020-08-31 Thread Julian Anastasov



Hello,

On Sat, 29 Aug 2020, Yaroslav Bolyukin wrote:

> This dependency was added because ipv6_find_hdr was in iptables specific
> code but is no longer required
> 
> Fixes: f8f626754ebe ("ipv6: Move ipv6_find_hdr() out of Netfilter code.")
> Fixes: 63dca2c0b0e7 ("ipvs: Fix faulty IPv6 extension header handling in 
> IPVS").
> Signed-off-by: Yaroslav Bolyukin 

Looks good to me, thanks! May be maintainers will
remove the extra dot after the Fixes line.

Acked-by: Julian Anastasov 

> ---
>  Missed canonical patch format section, subsystem is now spevified
> 
>  include/net/ip_vs.h| 3 ---
>  net/netfilter/ipvs/Kconfig | 1 -
>  2 files changed, 4 deletions(-)
> 
> diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> index 9a59a3378..d609e957a 100644
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -25,9 +25,6 @@
>  #include 
>  #include   /* for struct ipv6hdr */
>  #include 
> -#if IS_ENABLED(CONFIG_IP_VS_IPV6)
> -#include 
> -#endif
>  #if IS_ENABLED(CONFIG_NF_CONNTRACK)
>  #include 
>  #endif
> diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
> index 2c1593089..eb0e329f9 100644
> --- a/net/netfilter/ipvs/Kconfig
> +++ b/net/netfilter/ipvs/Kconfig
> @@ -29,7 +29,6 @@ if IP_VS
>  config   IP_VS_IPV6
>   bool "IPv6 support for IPVS"
>   depends on IPV6 = y || IP_VS = IPV6
> - select IP6_NF_IPTABLES
>   select NF_DEFRAG_IPV6
>   help
> Add IPv6 support to IPVS.
> --
> 2.28.0

Regards

--
Julian Anastasov

Re: [PATCH] Remove ipvs v6 dependency on iptables

2020-08-29 Thread Julian Anastasov

Hello,

On Sat, 29 Aug 2020, Yaroslav Bolyukin wrote:

> This dependency was added as part of commit ecefa32ffda201975
> ("ipvs: Fix faulty IPv6 extension header handling in IPVS"), because it
> had dependency on ipv6_find_hdr, which was located in iptables-specific
> code
> 
> But it is no longer required after commit e6f890cfde0e74d5b
> ("ipv6:Move ipv6_find_hdr() out of Netfilter code.")
> 
> Also remove ip6tables include from ip_vs
> 
> Signed-off-by: Yaroslav Bolyukin 

The commit you reference better to be added as special
tag, eg: Fixes: f8f626754ebe ("ipv6: Move ipv6_find_hdr() out of 
Netfilter code.") before the Signed-off-by line. Then you may skip 
mentioning the commit in the description, it will be in Fixes tag.
Note that the first 12 chars from the commit id are used, not the last.
Second Fixes line can be for 63dca2c0b0e7 ("ipvs: Fix faulty IPv6 
extension header handling in IPVS"). Both Fixes lines should not be
wrapped.

The Subject line needs to include version and tree,
for example: [PATCHv2 net-next] ipvs: remove v6 dependency on iptables
You increase the version when sending modified patch.

You can check the Documentation/process/submitting-patches.rst
guide for more info.

> ---
>  include/net/ip_vs.h| 3 ---
>  net/netfilter/ipvs/Kconfig | 1 -
>  2 files changed, 4 deletions(-)
> 
> diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> index 9a59a3378..d609e957a 100644
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -25,9 +25,6 @@
>  #include 
>  #include   /* for struct ipv6hdr */
>  #include 
> -#if IS_ENABLED(CONFIG_IP_VS_IPV6)
> -#include 
> -#endif
>  #if IS_ENABLED(CONFIG_NF_CONNTRACK)
>  #include 
>  #endif
> diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
> index 2c1593089..eb0e329f9 100644
> --- a/net/netfilter/ipvs/Kconfig
> +++ b/net/netfilter/ipvs/Kconfig
> @@ -29,7 +29,6 @@ if IP_VS
>  config   IP_VS_IPV6
>   bool "IPv6 support for IPVS"
>   depends on IPV6 = y || IP_VS = IPV6
> - select IP6_NF_IPTABLES
>   select NF_DEFRAG_IPV6
>   help
> Add IPv6 support to IPVS.
> -- 

Regards

--
Julian Anastasov

Re: [PATCH] Remove ipvs v6 dependency on iptables

2020-08-27 Thread Julian Anastasov



Hello,

On Fri, 28 Aug 2020, Lach wrote:

> This dependency was added in 63dca2c0b0e7a92cb39d1b1ecefa32ffda201975, 
> because this commit had dependency on
> ipv6_find_hdr, which was located in iptables-specific code
> 
> But it is no longer required, because 
> f8f626754ebeca613cf1af2e6f890cfde0e74d5b moved them to a more common location

May be then we should also not include ip6_tables.h from
include/net/ip_vs.h ?

> ---
>  net/netfilter/ipvs/Kconfig | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/net/netfilter/ipvs/Kconfig b/net/netfilter/ipvs/Kconfig
> index 2c1593089..eb0e329f9 100644
> --- a/net/netfilter/ipvs/Kconfig
> +++ b/net/netfilter/ipvs/Kconfig
> @@ -29,7 +29,6 @@ if IP_VS
>  config   IP_VS_IPV6
>   bool "IPv6 support for IPVS"
>   depends on IPV6 = y || IP_VS = IPV6
> - select IP6_NF_IPTABLES
>   select NF_DEFRAG_IPV6
>   help
>     Add IPv6 support to IPVS.
> -- 
> 2.28.0

Regards

--
Julian Anastasov

Re: [Linux-kernel-mentees] [PATCH net-next v2] ipvs: Fix uninit-value in do_ip_vs_set_ctl()

2020-08-11 Thread Julian Anastasov



Hello,

On Tue, 11 Aug 2020, Peilin Ye wrote:

> do_ip_vs_set_ctl() is referencing uninitialized stack value when `len` is
> zero. Fix it.
> 
> Reported-by: syzbot+23b5f9e7caf61d9a3...@syzkaller.appspotmail.com
> Link: 
> https://syzkaller.appspot.com/bug?id=46ebfb92a8a812621a001ef04d90dfa459520fe2
> Suggested-by: Julian Anastasov 
> Signed-off-by: Peilin Ye 

Looks good to me, thanks!

Acked-by: Julian Anastasov 

> ---
> Changes in v2:
> - Target net-next tree. (Suggested by Julian Anastasov )
> - Reject all `len == 0` requests except `IP_VS_SO_SET_FLUSH`, instead
>   of initializing `arg`. (Suggested by Cong Wang
>   , Julian Anastasov )
> 
>  net/netfilter/ipvs/ip_vs_ctl.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 412656c34f20..beeafa42aad7 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -2471,6 +2471,10 @@ do_ip_vs_set_ctl(struct sock *sk, int cmd, void __user 
> *user, unsigned int len)
>   /* Set timeout values for (tcp tcpfin udp) */
>   ret = ip_vs_set_timeout(ipvs, (struct ip_vs_timeout_user *)arg);
>   goto out_unlock;
> + } else if (!len) {
> + /* No more commands with len == 0 below */
> + ret = -EINVAL;
> + goto out_unlock;
>   }
>  
>   usvc_compat = (struct ip_vs_service_user *)arg;
> @@ -2547,9 +2551,6 @@ do_ip_vs_set_ctl(struct sock *sk, int cmd, void __user 
> *user, unsigned int len)
>   break;
>   case IP_VS_SO_SET_DELDEST:
>   ret = ip_vs_del_dest(svc, );
> -     break;
> - default:
> - ret = -EINVAL;
>   }
>  
>out_unlock:
> -- 
> 2.25.1

Regards

--
Julian Anastasov

Re: [Linux-kernel-mentees] [PATCH net] ipvs: Fix uninit-value in do_ip_vs_set_ctl()

2020-08-11 Thread Julian Anastasov



Hello,

On Tue, 11 Aug 2020, Peilin Ye wrote:

> On Mon, Aug 10, 2020 at 08:57:19PM -0700, Cong Wang wrote:
> > On Mon, Aug 10, 2020 at 3:10 PM Peilin Ye  wrote:
> > >
> > > do_ip_vs_set_ctl() is referencing uninitialized stack value when `len` is
> > > zero. Fix it.
> > 
> > Which exact 'cmd' is it here?
> > 
> > I _guess_ it is one of those uninitialized in set_arglen[], which is 0.
> 
> Yes, it was `IP_VS_SO_SET_NONE`, implicitly initialized to zero.
> 
> > But if that is the case, should it be initialized to
> > sizeof(struct ip_vs_service_user) instead because ip_vs_copy_usvc_compat()
> > is called anyway. Or, maybe we should just ban len==0 case.
> 
> I see. I think the latter would be easier, but we cannot ban all of
> them, since the function does something with `IP_VS_SO_SET_FLUSH`, which
> is a `len == 0` case.
> 
> Maybe we do something like this?

Yes, only IP_VS_SO_SET_FLUSH uses len 0. We can go with
this change but you do not need to target net tree, as the
problem is not fatal net-next works too. What happens is
that we may lookup services with random search keys which
is harmless.

Another option is to add new block after this one:

} else if (cmd == IP_VS_SO_SET_TIMEOUT) {
/* Set timeout values for (tcp tcpfin udp) */
ret = ip_vs_set_timeout(ipvs, (struct ip_vs_timeout_user *)arg);
goto out_unlock;
}

such as:

} else if (!len) {
/* No more commands with len=0 below */
ret = -EINVAL;
goto out_unlock;
}

It give more chance for future commands to use len=0
but the drawback is that the check happens under mutex. So, I'm
fine with both versions, it is up to you to decide :)

> @@ -2432,6 +2432,8 @@ do_ip_vs_set_ctl(struct sock *sk, int cmd, void __user 
> *user, unsigned int len)
> 
>   if (cmd < IP_VS_BASE_CTL || cmd > IP_VS_SO_SET_MAX)
>   return -EINVAL;
> + if (len == 0 && cmd != IP_VS_SO_SET_FLUSH)
> + return -EINVAL;
>   if (len != set_arglen[CMDID(cmd)]) {
>   IP_VS_DBG(1, "set_ctl: len %u != %u\n",
> len, set_arglen[CMDID(cmd)]);
> @@ -2547,9 +2549,6 @@ do_ip_vs_set_ctl(struct sock *sk, int cmd, void __user 
> *user, unsigned int len)
>   break;
>   case IP_VS_SO_SET_DELDEST:
>   ret = ip_vs_del_dest(svc, );
> - break;
> - default:
> - ret = -EINVAL;
>   }
> 
>out_unlock:

Regards

--
Julian Anastasov

Re: [PATCH] ipvs: avoid drop first packet to reuse conntrack

2020-06-11 Thread Julian Anastasov

ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -2086,11 +2086,11 @@ static int ip_vs_in_icmp_v6(struct netns_ipvs *ipvs, 
> struct sk_buff *skb,
>   }
>  
>   if (resched) {
> + if (uses_ct)
> + cp->flags &= ~IP_VS_CONN_F_NFCT;
>   if (!atomic_read(>n_control))
>   ip_vs_conn_expire_now(cp);
>   __ip_vs_conn_put(cp);
> - if (uses_ct)
> - return NF_DROP;
>   cp = NULL;
>   }
>   }
> -- 

Regards

--
Julian Anastasov

Re: [PATCH] netfilter/ipvs: immediately expire UDP connections matching unavailable destination if expire_nodest_conn=1

2020-05-19 Thread Julian Anastasov

Hello,

On Tue, 19 May 2020, Marco Angaroni wrote:

> Hi Andrew, Julian,
> 
> could you please confirm if/how this patch is changing any of the
> following behaviours, which I’m listing below as per my understanding
> ?
> 
> When expire_nodest is set and real-server is unavailable, at the
> moment the following happens to a packet going through IPVS:
> 
> a) TCP (or other connection-oriented protocols):
>the packet is silently dropped, then the following retransmission
> causes the generation of a RST from the load-balancer to the client,
> which will then re-open a new TCP connection

Yes. It seems we can not create new connection in
all cases, we should also check with is_new_conn().

What we have is that two cases are possible depending on 
conn_reuse_mode, the state of existing connection and whether
netfilter conntrack is used:

1. setup expire for old conn, then drop packet
2. setup expire for old conn, then create new
conn to schedule the packet

When expiration is set, the timer will fire in the
next jiffie to remove the connection from hash table. Until
removed, the connection still can cause drops. Sometimes
we can simply create new connection with the same tuple,
so it is possible both connections to coexist for one jiffie
but the old connection is not reached on lookup.

> b) UDP:
>the packet is silently dropped, then the following retransmission
> is rescheduled to a new real-server

Yes, we drop while old conn is not expired yet

> c) UDP in OPS mode:
>the packet is rescheduled to a new real-server, as no previous
> connection exists in IPVS connection table, and a new OPS connection
> is created (but it lasts only the time to transmit the packet)

Yes, OPS is not affected.

> d) UDP in OPS mode + persistent-template:
>the packet is rescheduled to a new real-server, as previous
> template-connection is invalidated, a new template-connection is
> created, and a new OPS connection is created (but it lasts only the
> time to transmit the packet)

Yes, the existing template is ignored when its server
is unavailable.

> It seems to me that you are trying to optimize case a) and b),
> avoiding the first step where the packet is silently dropped and
> consequently avoiding the retransmission.
> And contextually expire also all the other connections pointing to the
> unavailable real-sever.

The change will allow immediate scheduling in a new
connection for any protocol when netfilter conntrack is not
used:

- TCP: avoids retransmission for SYN
- UDP: reduces drops from 1 jiffie to 0 (no drops)

But this single jiffie compared to the delay between
real server failure and the removal from the IPVS table can be
negligible. Of course, if real server is removed while it is
working, with this change we should not see any UDP drops.

> However I'm confused about the references to OPS mode.
> And why you need to expire all the connections at once: if you expire
> on a per connection basis, the client experiences the same behaviour
> (no more re-transmissions), but you avoid the complexities of a new
> thread.

Such flushing can help when conntrack is used in which
case the cost is a retransmission or downtime for one jiffie.

> Maybe also the documentation of expire_nodest_conn sysctl should be updated.
> When it's stated:
> 
> If this feature is enabled, the load balancer will expire the
> connection immediately when a packet arrives and its
> destination server is not available, then the client program
> will be notified that the connection is closed
> 
> I think it should be at least "and the client program" instead of
> "then the client program".
> Or a more detailed explanation.

Yes, if the packet is SYN we can create new connection.
If it is ACK, the retransmission will get RST.

Regards

--
Julian Anastasov

Re: [PATCH] netfilter/ipvs: immediately expire UDP connections matching unavailable destination if expire_nodest_conn=1

2020-05-18 Thread Julian Anastasov

_vs_conn_uses_conntrack(cp, skb);

ip_vs_conn_expire_now(cp);
__ip_vs_conn_put(cp);
if (uses_ct)
return NF_DROP;
cp = NULL;
} else {
__ip_vs_conn_put(cp);
return NF_DROP;
}
}

if (unlikely(!cp)) {
int v;

if (!ip_vs_try_to_schedule(ipvs, af, skb, pd, , , ))
return v;
}

Before now, we always waited one jiffie connection to expire,
now one packet will:

- schedule expiration for existing connection with unavailable dest,
as before

- create new connection to available destination that will be found
first in lists. But it can work only when sysctl var "conntrack" is 0,
we do not want to create two netfilter conntracks to different
real servers.

Note that we intentionally removed the timer_pending() check
because we can not see existing ONE_PACKET connections in table.

Regards

--
Julian Anastasov

Re: [PATCH] netfilter/ipvs: expire no destination UDP connections when expire_nodest_conn=1

2020-05-15 Thread Julian Anastasov

Hello,

On Thu, 14 May 2020, Andrew Sy Kim wrote:

> When expire_nodest_conn=1 and an IPVS destination is deleted, IPVS
> doesn't expire connections with the IP_VS_CONN_F_ONE_PACKET flag set (any
> UDP connection). If there are many UDP packets to a virtual server from a
> single client and a destination is deleted, many packets are silently
> dropped whenever an existing connection entry with the same source port
> exists. This patch ensures IPVS also expires UDP connections when a
> packet matches an existing connection with no destinations.
> 
> Signed-off-by: Andrew Sy Kim 
> ---
>  net/netfilter/ipvs/ip_vs_core.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index aa6a603a2425..f0535586fe75 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -2116,8 +2116,7 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int hooknum, 
> struct sk_buff *skb, int
>   else
>   ip_vs_conn_put(cp);

Above ip_vs_conn_put() should free the ONE_PACKET
connections because:

- such connections never start timer, they are designed
to exist just to schedule the packet, then they are released.
- noone takes extra references

So, ip_vs_conn_put() simply calls ip_vs_conn_expire()
where connections should be released immediately. As result,
we can not access cp after this point here. That is why we work
just with 'flags' below...

Note that not every UDP connection has ONE_PACKET
flag, it is present if you configure it for the service.
Do you have -o/--ops flag? If not, the UDP connection
should expire before the next jiffie. This is the theory,
in practice, you may observe some problem...

> - if (sysctl_expire_nodest_conn(ipvs) &&
> - !(flags & IP_VS_CONN_F_ONE_PACKET)) {
> + if (sysctl_expire_nodest_conn(ipvs)) {
>   /* try to expire the connection immediately */
>   ip_vs_conn_expire_now(cp);
>   }

You can also look at the discussion which resulted in
the last patch for this place:

http://archive.linuxvirtualserver.org/html/lvs-devel/2018-07/msg00014.html

Regards

--
Julian Anastasov

Re: [PATCH v3 0/3] selftests: netfilter: introduce test cases for ipvs

2019-10-01 Thread Julian Anastasov



Hello,

On Tue, 1 Oct 2019, Haishuang Yan wrote:

> This series patch include test cases for ipvs.
> 
> The test topology is who as below:
> +--+
> |  |   |
> | ns0  | ns1   |
> |  --- | ------|
> |  | veth01  | - | veth10  || veth12  ||
> |  ---peer   ------|
> |   |  ||  |
> |  --- ||  |
> |  |  br0| |-  peer |--|
> |  --- ||  |
> |   |  ||  |
> |  -- peer   --  ---   |
> |  |  veth02 | - |  veth20 | | veth12  |   |
> |  --  | --  ---   |
> |  | ns2   |
> |  |   |
> +--+
> 
> Test results:
> # selftests: netfilter: ipvs.sh
> # Testing DR mode...
> # Testing NAT mode...
> # Testing Tunnel mode...
> # ipvs.sh: PASS
> ok 6 selftests: netfilter: ipvs.sh
> 
> Haishuang Yan (3):
>   selftests: netfilter: add ipvs test script
>   selftests: netfilter: add ipvs nat test case
>   selftests: netfilter: add ipvs tunnel test case

Acked-by: Julian Anastasov 

>  tools/testing/selftests/netfilter/Makefile |   2 +-
>  tools/testing/selftests/netfilter/ipvs.sh  | 234 
> +
>  2 files changed, 235 insertions(+), 1 deletion(-)
>  create mode 100755 tools/testing/selftests/netfilter/ipvs.sh

Regards

--
Julian Anastasov

Re: [PATCH v2 0/3] selftests: netfilter: introduce test cases for ipvs

2019-09-30 Thread Julian Anastasov



Hello,

On Fri, 27 Sep 2019, Haishuang Yan wrote:

> This series patch include test cases for ipvs.
> 
> The test topology is who as below:
> +--+
> |  |   |
> | ns0  | ns1   |
> |  --- | ------|
> |  | veth01  | - | veth10  || veth12  ||
> |  ---peer   ------|
> |   |  ||  |
> |  --- ||  |
> |  |  br0| |-  peer |--|
> |  --- ||  |
> |   |  ||  |
> |  -- peer   --  ---   |
> |  |  veth02 | - |  veth20 | | veth21  |   |
> |  --  | --  ---   |
> |  | ns2   |
> |  |   |
> +--+
> 
> Test results:
> # selftests: netfilter: ipvs.sh
> # Testing DR mode...
> # Testing NAT mode...
> # Testing Tunnel mode...
> # ipvs.sh: PASS
> ok 6 selftests: netfilter: ipvs.sh
> 
> Haishuang Yan (3):
>   selftests: netfilter: add ipvs test script
>   selftests: netfilter: add ipvs nat test case
>   selftests: netfilter: add ipvs tunnel test case
> 
>  tools/testing/selftests/netfilter/Makefile |   2 +-
>  tools/testing/selftests/netfilter/ipvs.sh  | 234 
> +
>  2 files changed, 235 insertions(+), 1 deletion(-)
>  create mode 100755 tools/testing/selftests/netfilter/ipvs.sh

Patchset v2 looks good to me, thanks!

Acked-by: Julian Anastasov 

Regards

--
Julian Anastasov

Re: [PATCH v2 0/2] ipvs: speedup ipvs netns dismantle

2019-09-30 Thread Julian Anastasov



Hello,

On Fri, 27 Sep 2019, Haishuang Yan wrote:

> Implement exit_batch() method to dismantle more ipvs netns
> per round.
> 
> Tested:
> $  cat add_del_unshare.sh
> #!/bin/bash
> 
> for i in `seq 1 100`
> do
>  (for j in `seq 1 40` ; do  unshare -n ipvsadm -A -t 172.16.$i.$j:80 
> >/dev/null ; done) &
> done
> wait; grep net_namespace /proc/slabinfo
> 
> Befor patch:
> $  time sh add_del_unshare.sh
> net_namespace   4020   4020   473668 : tunables000 : 
> slabdata670670  0
> 
> real0m8.086s
> user0m2.025s
> sys 0m36.956s
> 
> After patch:
> $  time sh add_del_unshare.sh
> net_namespace   4020   4020   473668 : tunables000 : 
> slabdata670670  0
> 
> real0m7.623s
> user0m2.003s
> sys 0m32.935s
> 
> Haishuang Yan (2):
>   ipvs: batch __ip_vs_cleanup
>   ipvs: batch __ip_vs_dev_cleanup
> 
>  include/net/ip_vs.h |  2 +-
>  net/netfilter/ipvs/ip_vs_core.c | 47 
> -
>  net/netfilter/ipvs/ip_vs_ctl.c  | 12 ++++---
>  3 files changed, 38 insertions(+), 23 deletions(-)

    Both patches in v2 look good to me, thanks!

Acked-by: Julian Anastasov 

This is for the -next kernels...

Regards

--
Julian Anastasov

Re: [PATCH 3/3] selftests: netfilter: add ipvs tunnel test case

2019-09-26 Thread Julian Anastasov



Hello,

On Fri, 27 Sep 2019, Haishuang Yan wrote:

> Test virtual server via ipip tunnel.
> 
> Tested:
> # selftests: netfilter: ipvs.sh
> # Testing DR mode...
> # Testing NAT mode...
> # Testing Tunnel mode...
> # ipvs.sh: PASS
> ok 6 selftests: netfilter: ipvs.sh
> 
> Signed-off-by: Haishuang Yan 

It is good to have IPVS selftests... This is a good start,
later we can add IPv6...

> ---
>  tools/testing/selftests/netfilter/ipvs.sh | 33 
> +++
>  1 file changed, 33 insertions(+)
> 
> diff --git a/tools/testing/selftests/netfilter/ipvs.sh 
> b/tools/testing/selftests/netfilter/ipvs.sh
> index 40058f9..2012cec 100755
> --- a/tools/testing/selftests/netfilter/ipvs.sh
> +++ b/tools/testing/selftests/netfilter/ipvs.sh
> @@ -167,6 +167,33 @@ test_nat() {
>  test_service
>  }
>  
> +test_tun() {
> +ip netns exec ns0 ip route add ${vip_v4} via ${gip_v4} dev br0
> +
> +ip netns exec ns1 modprobe ipip
> +ip netns exec ns1 ip link set tunl0 up
> +ip netns exec ns1 sysctl -qw net.ipv4.ip_forward=0
> +ip netns exec ns1 sysctl -qw net.ipv4.conf.all.send_redirects=0
> +ip netns exec ns1 sysctl -qw net.ipv4.conf.default.send_redirects=0
> +ip netns exec ns1 ipvsadm -A -t ${vip_v4}:${port} -s rr
> +ip netns exec ns1 ipvsadm -a -i -t ${vip_v4}:${port} -r ${rip_v4}:${port}
> +ip netns exec ns1 ip addr add ${vip_v4}/32 dev lo:1
> +
> +ip netns exec ns2 modprobe ipip
> +ip netns exec ns2 ip link set tunl0 up
> +ip netns exec ns2 sysctl -qw net.ipv4.conf.all.arp_ignore=1
> +ip netns exec ns2 sysctl -qw net.ipv4.conf.all.arp_announce=2
> +ip netns exec ns2 sysctl -qw net.ipv4.conf.all.rp_filter=0
> +ip netns exec ns2 sysctl -qw net.ipv4.conf.lo.arp_ignore=1
> +ip netns exec ns2 sysctl -qw net.ipv4.conf.lo.arp_announce=2

arp_ignore and arp_announce are not used on "lo". And MAX
is used, i.e.

# for all interfaces use (suitable for our test setup):
all.arp_ignore=1
all.arp_announce=2

# or if above is not desired, for specific LAN interface use
veth21.arp_ignore=1
veth21.arp_announce=2

BTW, the picture has ns2/veth12 while it should be veth21.
Also, should we check if IPVS module is loaded? Eg. depending
on present /proc/sys/net/ipv4/vs/ dir ?

> +ip netns exec ns2 sysctl -qw net.ipv4.conf.lo.rp_filter=0

IIRC, lo.rp_filter is never used, packets from "lo" always come
with attached output route, so source validation is not performed.

> +ip netns exec ns2 sysctl -qw net.ipv4.conf.tunl0.rp_filter=0
> +ip netns exec ns2 sysctl -qw net.ipv4.conf.veth21.rp_filter=0
> +ip netns exec ns2 ip addr add ${vip_v4}/32 dev lo:1
> +
> +test_service
> +}
> +
>  run_tests() {
>   local errors=
>  
> @@ -182,6 +209,12 @@ run_tests() {
>   test_nat
>   errors=$(( $errors + $? ))
>  
> + echo "Testing Tunnel mode..."
> + cleanup
> + setup
> + test_tun
> + errors=$(( $errors + $? ))
> +
>   return $errors
>  }
>  
> -- 
> 1.8.3.1

Regards

--
Julian Anastasov

Re: [net-next 1/2] ipvs: batch __ip_vs_cleanup

2019-07-29 Thread Julian Anastasov



Hello,

On Thu, 18 Jul 2019, Haishuang Yan wrote:

> As the following benchmark testing results show, there is a little 
> performance improvement:

OK, can you send v2 after removing the LIST_HEAD(list) from
both patches, I guess, it is not needed. If you prefer, you can
include these benchmark results too.

> $  cat add_del_unshare.sh
> #!/bin/bash
> 
> for i in `seq 1 100`
> do
>  (for j in `seq 1 40` ; do  unshare -n ipvsadm -A -t 172.16.$i.$j:80 
> >/dev/null ; done) &
> done
> wait; grep net_namespace /proc/slabinfo
> 
> Befor patch:
> $  time sh add_del_unshare.sh
> net_namespace   4020   4020   473668 : tunables000 : 
> slabdata670670  0
> 
> real0m8.086s
> user0m2.025s
> sys 0m36.956s
> 
> After patch:
> $  time sh add_del_unshare.sh
> net_namespace   4020   4020   473668 : tunables000 : 
> slabdata670670  0
> 
> real0m7.623s
> user0m2.003s
> sys 0m32.935s
> 
> 
> > 
> >> +  ipvs = net_ipvs(net);
> >> +  ip_vs_conn_net_cleanup(ipvs);
> >> +  ip_vs_app_net_cleanup(ipvs);
> >> +  ip_vs_protocol_net_cleanup(ipvs);
> >> +  ip_vs_control_net_cleanup(ipvs);
> >> +      ip_vs_estimator_net_cleanup(ipvs);
> >> +  IP_VS_DBG(2, "ipvs netns %d released\n", ipvs->gen);
> >> +  net->ipvs = NULL;

Regards

--
Julian Anastasov

Re: [PATCH] [v2 net-next] ipvs: reduce kernel stack usage

2019-07-24 Thread Julian Anastasov

t; -   IP_VS_DBG_ADDR(cp->af, >caddr), ntohs(cp->cport),
> -   IP_VS_DBG_ADDR(cp->af, >vaddr), ntohs(cp->vport),
> -   IP_VS_DBG_ADDR(cp->daf, >daddr), ntohs(cp->dport),
> +   IP_VS_DBG_SOCKADDR(cp->af, >caddr, cp->cport),
> +   IP_VS_DBG_SOCKADDR(cp->af, >vaddr, cp->vport),
> +   IP_VS_DBG_SOCKADDR(cp->daf, >daddr, cp->dport),
> cp->flags, refcount_read(>refcnt));
>  
>   ip_vs_conn_stats(cp, svc);
> @@ -886,8 +885,8 @@ static int handle_response_icmp(int af, struct sk_buff 
> *skb,
>   /* Ensure the checksum is correct */
>   if (!skb_csum_unnecessary(skb) && ip_vs_checksum_complete(skb, ihl)) {
>   /* Failed checksum! */
> - IP_VS_DBG_BUF(1, "Forward ICMP: failed checksum from %s!\n",
> -   IP_VS_DBG_ADDR(af, snet));
> + IP_VS_DBG(1, "Forward ICMP: failed checksum from %pISc!\n",
> +   IP_VS_DBG_SOCKADDR(af, snet, 0));
>   goto out;
>   }
>  
> @@ -1220,13 +1219,13 @@ struct ip_vs_conn *ip_vs_new_conn_out(struct 
> ip_vs_service *svc,
>   ip_vs_conn_stats(cp, svc);
>  
>   /* return connection (will be used to handle outgoing packet) */
> - IP_VS_DBG_BUF(6, "New connection RS-initiated:%c c:%s:%u v:%s:%u "
> -   "d:%s:%u conn->flags:%X conn->refcnt:%d\n",
> -   ip_vs_fwd_tag(cp),
> -   IP_VS_DBG_ADDR(cp->af, >caddr), ntohs(cp->cport),
> -   IP_VS_DBG_ADDR(cp->af, >vaddr), ntohs(cp->vport),
> -   IP_VS_DBG_ADDR(cp->af, >daddr), ntohs(cp->dport),
> -   cp->flags, refcount_read(>refcnt));
> + IP_VS_DBG(6, "New connection RS-initiated:%c c:%pISpc v:%pISpc "
> +   "d:%pISp conn->flags:%X conn->refcnt:%d\n",

d:%pISpc

> +   ip_vs_fwd_tag(cp),
> +   IP_VS_DBG_SOCKADDR(cp->af, >caddr, cp->cport),
> +   IP_VS_DBG_SOCKADDR(cp->af, >vaddr, cp->vport),
> +   IP_VS_DBG_SOCKADDR(cp->af, >daddr, cp->dport),
> +   cp->flags, refcount_read(>refcnt));
>   LeaveFunction(12);
>   return cp;
>  }
> @@ -1969,7 +1968,6 @@ static int ip_vs_in_icmp_v6(struct netns_ipvs *ipvs, 
> struct sk_buff *skb,
>  }
>  #endif
>  
> -
>  /*
>   *   Check if it's for virtual services, look it up,
>   *   and send it on its way...
> @@ -1998,10 +1996,10 @@ ip_vs_in(struct netns_ipvs *ipvs, unsigned int 
> hooknum, struct sk_buff *skb, int
> hooknum != NF_INET_LOCAL_OUT) ||
>!skb_dst(skb))) {
>   ip_vs_fill_iph_skb(af, skb, false, );
> - IP_VS_DBG_BUF(12, "packet type=%d proto=%d daddr=%s"
> + IP_VS_DBG(12, "packet type=%d proto=%d daddr=%pISc"
> " ignored in hook %u\n",
> skb->pkt_type, iph.protocol,
> -   IP_VS_DBG_ADDR(af, ), hooknum);
> +   IP_VS_DBG_SOCKADDR(af, , 0), hooknum);
>   return NF_ACCEPT;
>   }
>   /* ipvs enabled in this netns ? */
> diff --git a/net/netfilter/ipvs/ip_vs_ftp.c b/net/netfilter/ipvs/ip_vs_ftp.c
> index cf925906f59b..22099c42e184 100644
> --- a/net/netfilter/ipvs/ip_vs_ftp.c
> +++ b/net/netfilter/ipvs/ip_vs_ftp.c
> @@ -306,9 +306,9 @@ static int ip_vs_ftp_out(struct ip_vs_app *app, struct 
> ip_vs_conn *cp,
>  , ) != 1)
>   return 1;
>  
> - IP_VS_DBG_BUF(7, "EPSV response (%s:%u) -> %s:%u detected\n",
> -   IP_VS_DBG_ADDR(cp->af, ), ntohs(port),
> -   IP_VS_DBG_ADDR(cp->af, >caddr), 0);
> + IP_VS_DBG(7, "EPSV response (%pISpc) -> %pISc detected\n",
> +   IP_VS_DBG_SOCKADDR(cp->af, , port),
> +   IP_VS_DBG_SOCKADDR(cp->af, >caddr, 0));
>   } else {
>   return 1;
>   }
> @@ -510,15 +510,15 @@ static int ip_vs_ftp_in(struct ip_vs_app *app, struct 
> ip_vs_conn *cp,
> , , cp->af,
> , ) == 1) {
>  
> - IP_VS_DBG_BUF(7, "EPRT %s:%u detected\n",
> -   IP_VS_DBG_ADDR(cp->af, ), ntohs(port));
> + IP_VS_DBG(7, "EPRT %pISpc detected\n",
> +   IP_VS_DBG_SOCKADDR(cp->af, , port));
>  
>   /* Now update or create a connection entry for it */
> - IP_VS_DBG_BUF(7, "protocol %s %s:%u %s:%u\n",
> -   ip_vs_proto_name(ipvsh->protocol),
> -   IP_VS_DBG_ADDR(cp->af, ), ntohs(port),
> -   IP_VS_DBG_ADDR(cp->af, >vaddr),
> -   ntohs(cp->vport)-1);
> + IP_VS_DBG(7, "protocol %s %pISpc %pISpc\n",
> +   ip_vs_proto_name(ipvsh->protocol),
> +   IP_VS_DBG_SOCKADDR(cp->af, , port),
> +   IP_VS_DBG_SOCKADDR(cp->af, >vaddr,
> +  htons(ntohs(cp->vport)-1)));
>   } else {
>   return 1;
>   }
> -- 
> 2.20.0

Regards

--
Julian Anastasov

Re: [net-next 1/2] ipvs: batch __ip_vs_cleanup

2019-07-15 Thread Julian Anastasov



Hello,

On Sat, 13 Jul 2019, Haishuang Yan wrote:

> It's better to batch __ip_vs_cleanup to speedup ipvs
> connections dismantle.
> 
> Signed-off-by: Haishuang Yan 
> ---
>  include/net/ip_vs.h |  2 +-
>  net/netfilter/ipvs/ip_vs_core.c | 29 +
>  net/netfilter/ipvs/ip_vs_ctl.c  | 13 ++---
>  3 files changed, 28 insertions(+), 16 deletions(-)
> 
> diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> index 3759167..93e7a25 100644
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -1324,7 +1324,7 @@ static inline void ip_vs_control_del(struct ip_vs_conn 
> *cp)
>  void ip_vs_control_net_cleanup(struct netns_ipvs *ipvs);
>  void ip_vs_estimator_net_cleanup(struct netns_ipvs *ipvs);
>  void ip_vs_sync_net_cleanup(struct netns_ipvs *ipvs);
> -void ip_vs_service_net_cleanup(struct netns_ipvs *ipvs);
> +void ip_vs_service_nets_cleanup(struct list_head *net_list);
>  
>  /* IPVS application functions
>   * (from ip_vs_app.c)
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index 46f06f9..b4d79b7 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -2402,18 +2402,23 @@ static int __net_init __ip_vs_init(struct net *net)
>   return -ENOMEM;
>  }
>  
> -static void __net_exit __ip_vs_cleanup(struct net *net)
> +static void __net_exit __ip_vs_cleanup_batch(struct list_head *net_list)
>  {
> - struct netns_ipvs *ipvs = net_ipvs(net);
> -
> - ip_vs_service_net_cleanup(ipvs);/* ip_vs_flush() with locks */
> - ip_vs_conn_net_cleanup(ipvs);
> - ip_vs_app_net_cleanup(ipvs);
> - ip_vs_protocol_net_cleanup(ipvs);
> - ip_vs_control_net_cleanup(ipvs);
> - ip_vs_estimator_net_cleanup(ipvs);
> - IP_VS_DBG(2, "ipvs netns %d released\n", ipvs->gen);
> - net->ipvs = NULL;
> + struct netns_ipvs *ipvs;
> + struct net *net;
> + LIST_HEAD(list);
> +
> + ip_vs_service_nets_cleanup(net_list);   /* ip_vs_flush() with locks */
> + list_for_each_entry(net, net_list, exit_list) {

How much faster is to replace list_for_each_entry in
ops_exit_list() with this one. IPVS can waste time in calls
such as kthread_stop() and del_timer_sync() but I'm not sure
we can solve it easily. What gain do you see in benchmarks?

> + ipvs = net_ipvs(net);
> + ip_vs_conn_net_cleanup(ipvs);
> + ip_vs_app_net_cleanup(ipvs);
> + ip_vs_protocol_net_cleanup(ipvs);
> + ip_vs_control_net_cleanup(ipvs);
> + ip_vs_estimator_net_cleanup(ipvs);
> + IP_VS_DBG(2, "ipvs netns %d released\n", ipvs->gen);
> + net->ipvs = NULL;
> + }
>  }

Regards

--
Julian Anastasov

Re: linux-next: Tree for Jul 3 (netfilter/ipvs/)

2019-07-03 Thread Julian Anastasov

Hello,

On Wed, 3 Jul 2019, Randy Dunlap wrote:

> On 7/3/19 4:49 AM, Stephen Rothwell wrote:
> > Hi all,
> > 
> > Changes since 20190702:
> > 
> 
> on i386:

Oh, well. net/gre.h was included by CONFIG_NF_CONNTRACK, so
it is failing when CONFIG_NF_CONNTRACK is not used.

Pablo, should I post v2 or just a fix?

> 
>   CC  net/netfilter/ipvs/ip_vs_core.o
> ../net/netfilter/ipvs/ip_vs_core.c: In function ‘ipvs_gre_decap’:
> ../net/netfilter/ipvs/ip_vs_core.c:1618:22: error: storage size of ‘_greh’ 
> isn’t known
>   struct gre_base_hdr _greh, *greh;
>   ^

Regards

--
Julian Anastasov

Re: [PATCH 4/4] ipvs: reduce kernel stack usage

2019-06-30 Thread Julian Anastasov



Hello,

On Fri, 28 Jun 2019, Arnd Bergmann wrote:

> With the new CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL option, the stack
> usage in the ipvs debug output grows because each instance of
> IP_VS_DBG_BUF() now has its own buffer of 160 bytes that add up
> rather than reusing the stack slots:
> 
> net/netfilter/ipvs/ip_vs_core.c: In function 'ip_vs_sched_persist':
> net/netfilter/ipvs/ip_vs_core.c:427:1: error: the frame size of 1052 bytes is 
> larger than 1024 bytes [-Werror=frame-larger-than=]
> net/netfilter/ipvs/ip_vs_core.c: In function 'ip_vs_new_conn_out':
> net/netfilter/ipvs/ip_vs_core.c:1231:1: error: the frame size of 1048 bytes 
> is larger than 1024 bytes [-Werror=frame-larger-than=]
> net/netfilter/ipvs/ip_vs_ftp.c: In function 'ip_vs_ftp_out':
> net/netfilter/ipvs/ip_vs_ftp.c:397:1: error: the frame size of 1104 bytes is 
> larger than 1024 bytes [-Werror=frame-larger-than=]
> net/netfilter/ipvs/ip_vs_ftp.c: In function 'ip_vs_ftp_in':
> net/netfilter/ipvs/ip_vs_ftp.c:555:1: error: the frame size of 1200 bytes is 
> larger than 1024 bytes [-Werror=frame-larger-than=]
> 
> Since printk() already has a way to print IPv4/IPv6 addresses using
> the %pIS format string, use that instead, combined with a macro that
> creates a local sockaddr structure on the stack. These will still
> add up, but the stack frames are now under 200 bytes.
> 
> Signed-off-by: Arnd Bergmann 
> ---
> I'm not sure this actually does what I think it does. Someone
> needs to verify that we correctly print the addresses here.
> I've also only added three files that caused the warning messages
> to be reported. There are still a lot of other instances of
> IP_VS_DBG_BUF() that could be converted the same way after the
> basic idea is confirmed.
> ---
>  include/net/ip_vs.h | 71 +++--
>  net/netfilter/ipvs/ip_vs_core.c | 44 ++--
>  net/netfilter/ipvs/ip_vs_ftp.c  | 20 +-
>  3 files changed, 72 insertions(+), 63 deletions(-)
> 
> diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> index 3759167f91f5..3dfbeef67be6 100644
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -227,6 +227,16 @@ static inline const char *ip_vs_dbg_addr(int af, char 
> *buf, size_t buf_len,
>  sizeof(ip_vs_dbg_buf), addr, \
>  _vs_dbg_idx)
>  
> +#define IP_VS_DBG_SOCKADDR4(fam, addr, port) \
> + (struct sockaddr*)&(struct sockaddr_in) \
> + { .sin_family = (fam), .sin_addr = (addr)->in, .sin_port = (port) }
> +#define IP_VS_DBG_SOCKADDR6(fam, addr, port) \
> + (struct sockaddr*)&(struct sockaddr_in6) \
> + { .sin6_family = (fam), .sin6_addr = (addr)->in6, .sin6_port = (port) }
> +#define IP_VS_DBG_SOCKADDR(fam, addr, port) (fam == AF_INET ?
> \
> + IP_VS_DBG_SOCKADDR4(fam, addr, port) :  \
> + IP_VS_DBG_SOCKADDR6(fam, addr, port))
> +
>  #define IP_VS_DBG(level, msg, ...)   \
>   do {\
>   if (level <= ip_vs_get_debug_level())   \
> @@ -251,6 +261,7 @@ static inline const char *ip_vs_dbg_addr(int af, char 
> *buf, size_t buf_len,
>  #else/* NO DEBUGGING at ALL */
>  #define IP_VS_DBG_BUF(level, msg...)  do {} while (0)
>  #define IP_VS_ERR_BUF(msg...)  do {} while (0)
> +#define IP_VS_DBG_SOCKADDR(fam, addr, port) NULL
>  #define IP_VS_DBG(level, msg...)  do {} while (0)
>  #define IP_VS_DBG_RL(msg...)  do {} while (0)
>  #define IP_VS_DBG_PKT(level, af, pp, skb, ofs, msg)  do {} while (0)
> @@ -1244,31 +1255,31 @@ static inline void ip_vs_control_del(struct 
> ip_vs_conn *cp)
>  {
>   struct ip_vs_conn *ctl_cp = cp->control;
>   if (!ctl_cp) {
> - IP_VS_ERR_BUF("request control DEL for uncontrolled: "
> -   "%s:%d to %s:%d\n",
> -   IP_VS_DBG_ADDR(cp->af, >caddr),
> -   ntohs(cp->cport),
> -   IP_VS_DBG_ADDR(cp->af, >vaddr),
> -   ntohs(cp->vport));
> + pr_err("request control DEL for uncontrolled: "
> +"%pISp to %pISp\n",

ip_vs_dbg_addr() used compact form (%pI6c), so it would be
better to use %pISc and %pISpc everywhere in IPVS...

Also, note that before now port was printed with %d and
ntohs() was used, now port should be in network order, so:

- ntohs() should be removed
- htons() should be added, if missing. At first look, this case
is not present in IPVS, we have only ntohs() usage

Regards

--
Julian Anastasov

Re: memory leak in start_sync_thread

2019-06-11 Thread Julian Anastasov



Hello,

On Mon, 10 Jun 2019, Eric Biggers wrote:

> On Tue, May 28, 2019 at 11:28:05AM -0700, syzbot wrote:
> > Hello,
> > 
> > syzbot found the following crash on:
> > 
> > HEAD commit:cd6c84d8 Linux 5.2-rc2
> > git tree:   upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=132bd44aa0
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=64479170dcaf0e11
> > dashboard link: https://syzkaller.appspot.com/bug?extid=7e2e50c8adfccd2e5041
> > compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=114b1354a0
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=14b7ad26a0
> > 
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+7e2e50c8adfccd2e5...@syzkaller.appspotmail.com
> > 
> > d started: state = MASTER, mcast_ifn = syz_tun, syncid = 0, id = 0
> > BUG: memory leak
> > unreferenced object 0x8881206bf700 (size 32):
> >   comm "syz-executor761", pid 7268, jiffies 4294943441 (age 20.470s)
> >   hex dump (first 32 bytes):
> > 00 40 7c 09 81 88 ff ff 80 45 b8 21 81 88 ff ff  .@|..E.!
> > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> >   backtrace:
> > [<57619e23>] kmemleak_alloc_recursive
> > include/linux/kmemleak.h:55 [inline]
> > [<57619e23>] slab_post_alloc_hook mm/slab.h:439 [inline]
> > [<57619e23>] slab_alloc mm/slab.c:3326 [inline]
> > [<57619e23>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
> > [<86ce5479>] kmalloc include/linux/slab.h:547 [inline]
> > [<86ce5479>] start_sync_thread+0x5d2/0xe10
> > net/netfilter/ipvs/ip_vs_sync.c:1862
> > [<1a9229cc>] do_ip_vs_set_ctl+0x4c5/0x780
> > net/netfilter/ipvs/ip_vs_ctl.c:2402
> > [<ece457c8>] nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
> > [<ece457c8>] nf_setsockopt+0x4c/0x80
> > net/netfilter/nf_sockopt.c:115
> > [<942f62d4>] ip_setsockopt net/ipv4/ip_sockglue.c:1258 [inline]
> > [<942f62d4>] ip_setsockopt+0x9b/0xb0 net/ipv4/ip_sockglue.c:1238
> > [<a56a8ffd>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616
> > [<fa895401>] sock_common_setsockopt+0x38/0x50
> > net/core/sock.c:3130
> > [<95eef4cf>] __sys_setsockopt+0x98/0x120 net/socket.c:2078
> > [<9747cf88>] __do_sys_setsockopt net/socket.c:2089 [inline]
> > [<9747cf88>] __se_sys_setsockopt net/socket.c:2086 [inline]
> > [<9747cf88>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086
> > [<ded8ba80>] do_syscall_64+0x76/0x1a0
> > arch/x86/entry/common.c:301
> > [<893b4ac8>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> 
> The bug is that ownership of some memory is passed to a kthread started by
> kthread_run(), but the kthread can be stopped before it actually executes the
> threadfn.  See the code in kernel/kthread.c:
> 
> ret = -EINTR;
> if (!test_bit(KTHREAD_SHOULD_STOP, >flags)) {
> cgroup_kthread_ready();
> __kthread_parkme(self);
> ret = threadfn(data);
> }
> 
> So, apparently the thread parameters must always be owned by the owner of the
> kthread, not by the kthread itself.  It seems like this would be a common
> mistake in kernel code; I'm surprised this doesn't come up more...

Thanks! It explains the problem. It was not obvious from the
fact that only tinfo was reported as a leak, nothing for tinfo->sock.

Moving sock_release to owner complicates the locking but
I'll try to fix it in the following days...

Regards

--
Julian Anastasov

Re: memory leak in nf_hook_entries_grow

2019-06-03 Thread Julian Anastasov



Hello,

On Mon, 3 Jun 2019, syzbot wrote:

> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:3ab4436f Merge tag 'nfsd-5.2-1' of git://linux-nfs.org/~bf..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=15feaf82a0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=50393f7bfe444ff6
> dashboard link: https://syzkaller.appspot.com/bug?extid=722da59ccb264bc19910
> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=12f02772a0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1657b80ea0
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+722da59ccb264bc19...@syzkaller.appspotmail.com
> 
> 035][ T7273] IPVS: ftp: loaded support on port[0] = 21
> BUG: memory leak
> unreferenced object 0x88810acd8a80 (size 96):
>  comm "syz-executor073", pid 7254, jiffies 4294950560 (age 22.250s)
>  hex dump (first 32 bytes):
>02 00 00 00 00 00 00 00 50 8b bb 82 ff ff ff ff  P...
>00 00 00 00 00 00 00 00 00 77 bb 82 ff ff ff ff  .w..
>  backtrace:
>[<13db61f1>] kmemleak_alloc_recursive include/linux/kmemleak.h:55
>[inline]
>[<13db61f1>] slab_post_alloc_hook mm/slab.h:439 [inline]
>[<13db61f1>] slab_alloc_node mm/slab.c:3269 [inline]
>[<13db61f1>] kmem_cache_alloc_node_trace+0x15b/0x2a0 mm/slab.c:3597
>[<1a27307d>] __do_kmalloc_node mm/slab.c:3619 [inline]
>[<1a27307d>] __kmalloc_node+0x38/0x50 mm/slab.c:3627
>[<25054add>] kmalloc_node include/linux/slab.h:590 [inline]
>[<25054add>] kvmalloc_node+0x4a/0xd0 mm/util.c:431
>[<50d1bc00>] kvmalloc include/linux/mm.h:637 [inline]
>[<50d1bc00>] kvzalloc include/linux/mm.h:645 [inline]
>[<50d1bc00>] allocate_hook_entries_size+0x3b/0x60
>net/netfilter/core.c:61
>[<e8abe142>] nf_hook_entries_grow+0xae/0x270
>net/netfilter/core.c:128
>[<4b94797c>] __nf_register_net_hook+0x9a/0x170
>net/netfilter/core.c:337
>[<d1545cbc>] nf_register_net_hook+0x34/0xc0
>net/netfilter/core.c:464
>[<876c9b55>] nf_register_net_hooks+0x53/0xc0
>net/netfilter/core.c:480
>[<2ea868e0>] __ip_vs_init+0xe8/0x170
>net/netfilter/ipvs/ip_vs_core.c:2280

After commit "ipvs: Fix use-after-free in ip_vs_in" we planned
to call nf_register_net_hooks() only when rule is created but this
is net-next material and we should not leave leak in the error path.
I'll post a patch that adds .init handler for ipvs_core_dev_ops, so
that nf_register_net_hooks() is called there.

> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
> 
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> syzbot can test patches for this bug, for details see:
> https://goo.gl/tpsmEJ#testing-patches

Regards

--
Julian Anastasov

Re: [PATCH v4] ipvs: add checksum support for gue encapsulation

2019-05-30 Thread Julian Anastasov



Hello,

On Thu, 30 May 2019, Jacky Hu wrote:

> Add checksum support for gue encapsulation with the tun_flags parameter,
> which could be one of the values below:
> IP_VS_TUNNEL_ENCAP_FLAG_NOCSUM
> IP_VS_TUNNEL_ENCAP_FLAG_CSUM
> IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM
> 
> Signed-off-by: Jacky Hu 

Looks good to me, thanks!

Signed-off-by: Julian Anastasov 

> ---
> v4->v3:
>   1) defer pd assignment after data += GUE_LEN_PRIV
> 
> v3->v2:
>   1) fixed CHECK: spaces preferred around that '<<' (ctx:VxV)
> 
> v2->v1:
>   1) removed unnecessary changes to ip_vs_core.c
>   2) use correct nla_get/put function for tun_flags
>   3) use correct gue hdrlen for skb_push in ipvs_gue_encap
>   4) moved declaration of gue_hdrlen and gue_optlen
> 
>  include/net/ip_vs.h |   2 +
>  include/uapi/linux/ip_vs.h  |   7 ++
>  net/netfilter/ipvs/ip_vs_ctl.c  |  11 ++-
>  net/netfilter/ipvs/ip_vs_xmit.c | 143 
>  4 files changed, 146 insertions(+), 17 deletions(-)
> 
> diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
> index b01a94ebfc0e..cb1ad0cc5c7b 100644
> --- a/include/net/ip_vs.h
> +++ b/include/net/ip_vs.h
> @@ -603,6 +603,7 @@ struct ip_vs_dest_user_kern {
>  
>   u16 tun_type;   /* tunnel type */
>   __be16  tun_port;   /* tunnel port */
> + u16 tun_flags;  /* tunnel flags */
>  };
>  
>  
> @@ -665,6 +666,7 @@ struct ip_vs_dest {
>   atomic_tlast_weight;/* server latest weight */
>   __u16   tun_type;   /* tunnel type */
>   __be16  tun_port;   /* tunnel port */
> + __u16   tun_flags;  /* tunnel flags */
>  
>   refcount_t  refcnt; /* reference counter */
>   struct ip_vs_stats  stats;  /* statistics */
> diff --git a/include/uapi/linux/ip_vs.h b/include/uapi/linux/ip_vs.h
> index e34f436fc79d..e4f18061a4fd 100644
> --- a/include/uapi/linux/ip_vs.h
> +++ b/include/uapi/linux/ip_vs.h
> @@ -131,6 +131,11 @@ enum {
>   IP_VS_CONN_F_TUNNEL_TYPE_MAX,
>  };
>  
> +/* Tunnel encapsulation flags */
> +#define IP_VS_TUNNEL_ENCAP_FLAG_NOCSUM   (0)
> +#define IP_VS_TUNNEL_ENCAP_FLAG_CSUM (1 << 0)
> +#define IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM  (1 << 1)
> +
>  /*
>   *   The struct ip_vs_service_user and struct ip_vs_dest_user are
>   *   used to set IPVS rules through setsockopt.
> @@ -403,6 +408,8 @@ enum {
>  
>   IPVS_DEST_ATTR_TUN_PORT,/* tunnel port */
>  
> + IPVS_DEST_ATTR_TUN_FLAGS,   /* tunnel flags */
> +
>   __IPVS_DEST_ATTR_MAX,
>  };
>  
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index d5847e06350f..ad19ac08622f 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -893,6 +893,7 @@ __ip_vs_update_dest(struct ip_vs_service *svc, struct 
> ip_vs_dest *dest,
>   /* set the tunnel info */
>   dest->tun_type = udest->tun_type;
>   dest->tun_port = udest->tun_port;
> + dest->tun_flags = udest->tun_flags;
>  
>   /* set the IP_VS_CONN_F_NOOUTPUT flag if not masquerading/NAT */
>   if ((conn_flags & IP_VS_CONN_F_FWD_MASK) != IP_VS_CONN_F_MASQ) {
> @@ -2967,6 +2968,7 @@ static const struct nla_policy 
> ip_vs_dest_policy[IPVS_DEST_ATTR_MAX + 1] = {
>   [IPVS_DEST_ATTR_ADDR_FAMILY]= { .type = NLA_U16 },
>   [IPVS_DEST_ATTR_TUN_TYPE]   = { .type = NLA_U8 },
>   [IPVS_DEST_ATTR_TUN_PORT]   = { .type = NLA_U16 },
> + [IPVS_DEST_ATTR_TUN_FLAGS]  = { .type = NLA_U16 },
>  };
>  
>  static int ip_vs_genl_fill_stats(struct sk_buff *skb, int container_type,
> @@ -3273,6 +3275,8 @@ static int ip_vs_genl_fill_dest(struct sk_buff *skb, 
> struct ip_vs_dest *dest)
>  dest->tun_type) ||
>   nla_put_be16(skb, IPVS_DEST_ATTR_TUN_PORT,
>dest->tun_port) ||
> + nla_put_u16(skb, IPVS_DEST_ATTR_TUN_FLAGS,
> + dest->tun_flags) ||
>   nla_put_u32(skb, IPVS_DEST_ATTR_U_THRESH, dest->u_threshold) ||
>   nla_put_u32(skb, IPVS_DEST_ATTR_L_THRESH, dest->l_threshold) ||
>   nla_put_u32(skb, IPVS_DEST_ATTR_ACTIVE_CONNS,
> @@ -3393,7 +3397,8 @@ static int ip_vs_genl_parse_dest(struct 
> ip_vs_dest_user_kern *udest,
>   /* If a full entry was requested, check for the additional fields */
>   if (full_entry) {
>   struct nlattr *nla_fwd, *nla_weight, *nla_u_thresh,
> -

Re: [PATCH v3] ipvs: add checksum support for gue encapsulation

2019-05-29 Thread Julian Anastasov



Hello,

On Wed, 29 May 2019, Jacky Hu wrote:

>   gueh = (struct guehdr *)skb->data;
>  
>   gueh->control = 0;
>   gueh->version = 0;
> - gueh->hlen = 0;
> + gueh->hlen = optlen >> 2;
>   gueh->flags = 0;
>   gueh->proto_ctype = *next_protocol;
>  
> + data = [1];
> +
> + if (need_priv) {
> + __be32 *flags = data;
> + u16 csum_start = skb_checksum_start_offset(skb);
> + __be16 *pd = data;

Packet tests show another problem. Fix is to defer
pd assignment after data += GUE_LEN_PRIV:

__be16 *pd;

> +
> + gueh->flags |= GUE_FLAG_PRIV;
> + *flags = 0;
> + data += GUE_LEN_PRIV;
> +
> + if (csum_start < hdrlen)
> + return -EINVAL;
> +
> + csum_start -= hdrlen;

pd = data;

> + pd[0] = htons(csum_start);
> + pd[1] = htons(csum_start + skb->csum_offset);
> +
> + if (!skb_is_gso(skb)) {
> + skb->ip_summed = CHECKSUM_NONE;
> + skb->encapsulation = 0;
> + }
> +
> + *flags |= GUE_PFLAG_REMCSUM;
> + data += GUE_PLEN_REMCSUM;
> + }
> +

Regards

--
Julian Anastasov

Re: [PATCH v2] ipvs: add checksum support for gue encapsulation

2019-05-28 Thread Julian Anastasov

Hello,

On Sun, 26 May 2019, Jacky Hu wrote:

> +/* Tunnel encapsulation flags */
> +#define IP_VS_TUNNEL_ENCAP_FLAG_NOCSUM   (0)
> +#define IP_VS_TUNNEL_ENCAP_FLAG_CSUM (1<<0)
> +#define IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM  (1<<1)

scripts/checkpatch.pl --strict file.patch
reports for some issues you should resolve for v3.
Otherwise, the patch looks good to me.

Regards

--
Julian Anastasov

Re: [PATCH v1] ipvs: add checksum support for gue encapsulation

2019-05-24 Thread Julian Anastasov

re.

> + if ((tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM) &&
> + skb->ip_summed == CHECKSUM_PARTIAL) {
> + gue_optlen += GUE_PLEN_REMCSUM + GUE_LEN_PRIV;
> + }
> + gue_hdrlen = sizeof(struct guehdr) + gue_optlen;
> +
> + max_headroom += sizeof(struct udphdr) + gue_hdrlen;
> + }
>  
>   /* We only care about the df field if sysctl_pmtu_disc(ipvs) is set */
>   dfp = sysctl_pmtu_disc(ipvs) ?  : NULL;
> @@ -1105,8 +1164,17 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, struct 
> ip_vs_conn *cp,
>   goto tx_error;
>  
>   gso_type = __tun_gso_type_mask(AF_INET, cp->af);
> - if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE)
> - gso_type |= SKB_GSO_UDP_TUNNEL;
> + if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE) {
> + if ((tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_CSUM) ||
> + (tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM))
> + gso_type |= SKB_GSO_UDP_TUNNEL_CSUM;
> + else
> + gso_type |= SKB_GSO_UDP_TUNNEL;
> + if ((tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM) &&
> + skb->ip_summed == CHECKSUM_PARTIAL) {
> + gso_type |= SKB_GSO_TUNNEL_REMCSUM;
> + }
> + }
>  
>   if (iptunnel_handle_offloads(skb, gso_type))
>   goto tx_error;
> @@ -1115,8 +1183,19 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, struct 
> ip_vs_conn *cp,
>  
>   skb_set_inner_ipproto(skb, next_protocol);
>  
> - if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE)
> - ipvs_gue_encap(net, skb, cp, _protocol);
> + if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE) {
> + bool check = false;
> +
> + if (ipvs_gue_encap(net, skb, cp, _protocol))
> + goto tx_error;
> +
> + if ((tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_CSUM) ||
> + (tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM))
> + check = true;
> +
> + udp_set_csum(!check, skb, saddr, cp->daddr.ip, skb->len);
> + }
> +
>  
>   skb_push(skb, sizeof(struct iphdr));
>   skb_reset_network_header(skb);
> @@ -1174,6 +1253,8 @@ ip_vs_tunnel_xmit_v6(struct sk_buff *skb, struct 
> ip_vs_conn *cp,
>   unsigned int max_headroom;  /* The extra header space needed */
>   int ret, local;
>   int tun_type, gso_type;
> + int tun_flags;
> + size_t gue_hdrlen, gue_optlen = 0;
>  
>   EnterFunction(10);
>  
> @@ -1197,9 +1278,17 @@ ip_vs_tunnel_xmit_v6(struct sk_buff *skb, struct 
> ip_vs_conn *cp,
>   max_headroom = LL_RESERVED_SPACE(tdev) + sizeof(struct ipv6hdr);
>  
>   tun_type = cp->dest->tun_type;
> + tun_flags = cp->dest->tun_flags;
> +
> + if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE) {
> + if ((tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM) &&
> + skb->ip_summed == CHECKSUM_PARTIAL) {

Same, we can move gue_hdrlen and gue_optlen here.

> + gue_optlen += GUE_PLEN_REMCSUM + GUE_LEN_PRIV;
> + }
> + gue_hdrlen = sizeof(struct guehdr) + gue_optlen;
>  
> - if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE)
> - max_headroom += sizeof(struct udphdr) + sizeof(struct guehdr);
> + max_headroom += sizeof(struct udphdr) + gue_hdrlen;
> + }
>  
>   skb = ip_vs_prepare_tunneled_skb(skb, cp->af, max_headroom,
>_protocol, _len,
> @@ -1208,8 +1297,17 @@ ip_vs_tunnel_xmit_v6(struct sk_buff *skb, struct 
> ip_vs_conn *cp,
>   goto tx_error;
>  
>   gso_type = __tun_gso_type_mask(AF_INET6, cp->af);
> - if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE)
> - gso_type |= SKB_GSO_UDP_TUNNEL;
> + if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE) {
> + if ((tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_CSUM) ||
> + (tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM))
> + gso_type |= SKB_GSO_UDP_TUNNEL_CSUM;
> + else
> + gso_type |= SKB_GSO_UDP_TUNNEL;
> + if ((tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM) &&
> + skb->ip_summed == CHECKSUM_PARTIAL) {
> + gso_type |= SKB_GSO_TUNNEL_REMCSUM;
> + }
> + }
>  
>   if (iptunnel_handle_offloads(skb, gso_type))
>   goto tx_error;
> @@ -1218,8 +1316,18 @@ ip_vs_tunnel_xmit_v6(struct sk_buff *skb, struct 
> ip_vs_conn *cp,
>  
>   skb_set_inner_ipproto(skb, next_protocol);
>  
> - if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE)
> - ipvs_gue_encap(net, skb, cp, _protocol);
> + if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE) {
> + bool check = false;
> +
> + if (ipvs_gue_encap(net, skb, cp, _protocol))
> + goto tx_error;
> +
> + if ((tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_CSUM) ||
> + (tun_flags & IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM))
> + check = true;
> +
> + udp6_set_csum(!check, skb, , >daddr.in6, skb->len);
> + }
>  
>   skb_push(skb, sizeof(struct ipv6hdr));
>   skb_reset_network_header(skb);
> -- 
> 2.21.0

Regards

--
Julian Anastasov

Re: [PATCH v2] ipvs: Fix use-after-free in ip_vs_in

2019-05-19 Thread Julian Anastasov



Hello,

On Fri, 17 May 2019, YueHaibing wrote:

> BUG: KASAN: use-after-free in ip_vs_in.part.29+0xe8/0xd20 [ip_vs]
> Read of size 4 at addr 8881e9b26e2c by task sshd/5603
> 
> CPU: 0 PID: 5603 Comm: sshd Not tainted 4.19.39+ #30
> Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
> Call Trace:
>  dump_stack+0x71/0xab
>  print_address_description+0x6a/0x270
>  kasan_report+0x179/0x2c0
>  ip_vs_in.part.29+0xe8/0xd20 [ip_vs]
>  ip_vs_in+0xd8/0x170 [ip_vs]
>  nf_hook_slow+0x5f/0xe0
>  __ip_local_out+0x1d5/0x250
>  ip_local_out+0x19/0x60
>  __tcp_transmit_skb+0xba1/0x14f0
>  tcp_write_xmit+0x41f/0x1ed0
>  ? _copy_from_iter_full+0xca/0x340
>  __tcp_push_pending_frames+0x52/0x140
>  tcp_sendmsg_locked+0x787/0x1600
>  ? tcp_sendpage+0x60/0x60
>  ? inet_sk_set_state+0xb0/0xb0
>  tcp_sendmsg+0x27/0x40
>  sock_sendmsg+0x6d/0x80
>  sock_write_iter+0x121/0x1c0
>  ? sock_sendmsg+0x80/0x80
>  __vfs_write+0x23e/0x370
>  vfs_write+0xe7/0x230
>  ksys_write+0xa1/0x120
>  ? __ia32_sys_read+0x50/0x50
>  ? __audit_syscall_exit+0x3ce/0x450
>  do_syscall_64+0x73/0x200
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7ff6f6147c60
> Code: 73 01 c3 48 8b 0d 28 12 2d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 
> 00 00 83 3d 5d 73 2d 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 
> 31 c3 48 83
> RSP: 002b:7ffd772ead18 EFLAGS: 0246 ORIG_RAX: 0001
> RAX: ffda RBX: 0034 RCX: 7ff6f6147c60
> RDX: 0034 RSI: 55df30a31270 RDI: 0003
> RBP: 55df30a31270 R08:  R09: 
> R10: 7ffd772ead70 R11: 0246 R12: 7ffd772ead74
> R13: 7ffd772eae20 R14: 7ffd772eae24 R15: 55df2f12ddc0
> 
> Allocated by task 6052:
>  kasan_kmalloc+0xa0/0xd0
>  __kmalloc+0x10a/0x220
>  ops_init+0x97/0x190
>  register_pernet_operations+0x1ac/0x360
>  register_pernet_subsys+0x24/0x40
>  0xc0ea016d
>  do_one_initcall+0x8b/0x253
>  do_init_module+0xe3/0x335
>  load_module+0x2fc0/0x3890
>  __do_sys_finit_module+0x192/0x1c0
>  do_syscall_64+0x73/0x200
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Freed by task 6067:
>  __kasan_slab_free+0x130/0x180
>  kfree+0x90/0x1a0
>  ops_free_list.part.7+0xa6/0xc0
>  unregister_pernet_operations+0x18b/0x1f0
>  unregister_pernet_subsys+0x1d/0x30
>  ip_vs_cleanup+0x1d/0xd2f [ip_vs]
>  __x64_sys_delete_module+0x20c/0x300
>  do_syscall_64+0x73/0x200
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> The buggy address belongs to the object at 8881e9b26600 which belongs to 
> the cache kmalloc-4096 of size 4096
> The buggy address is located 2092 bytes inside of 4096-byte region 
> [8881e9b26600, 8881e9b27600)
> The buggy address belongs to the page:
> page:ea0007a6c800 count:1 mapcount:0 mapping:888107c0e600 index:0x0 
> compound_mapcount: 0
> flags: 0x17c0008100(slab|head)
> raw: 0017c0008100 dead0100 dead0200 888107c0e600
> raw:  80070007 0001 
> page dumped because: kasan: bad access detected
> 
> while unregistering ipvs module, ops_free_list calls
> __ip_vs_cleanup, then nf_unregister_net_hooks be called to
> do remove nf hook entries. It need a RCU period to finish,
> however net->ipvs is set to NULL immediately, which will
> trigger NULL pointer dereference when a packet is hooked
> and handled by ip_vs_in where net->ipvs is dereferenced.
> 
> Another scene is ops_free_list call ops_free to free the
> net_generic directly while __ip_vs_cleanup finished, then
> calling ip_vs_in will triggers use-after-free.
> 
> This patch moves nf_unregister_net_hooks from __ip_vs_cleanup()
> to __ip_vs_dev_cleanup(),  where rcu_barrier() is called by
> unregister_pernet_device -> unregister_pernet_operations,
> that will do the needed grace period.
> 
> Reported-by: Hulk Robot 
> Fixes: efe41606184e ("ipvs: convert to use pernet nf_hook api")
> Suggested-by: Julian Anastasov 
> Signed-off-by: YueHaibing 

Looks good to me, thanks!

Acked-by: Julian Anastasov 

It should restore the order of unregistrations before
the mentioned commit and to ensure grace period before stopping
the traffic and unregistering ipvs_core_ops where traffic is not
expected.

> ---
> v2: fix by moving nf_unregister_net_hooks from __ip_vs_cleanup() to 
> __ip_vs_dev_cleanup()
> ---
>  net/netfilter/ipvs/ip_vs_core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index 14457551bcb4..8ebf21149ec3 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
>

Re: [PATCH v6] ipvs: allow tunneling with gue encapsulation

2019-03-26 Thread Julian Anastasov



Hello,

On Tue, 26 Mar 2019, Jacky Hu wrote:

> ipip packets are blocked in some public cloud environments, this patch
> allows gue encapsulation with the tunneling method, which would make
> tunneling working in those environments.
> 
> Signed-off-by: Jacky Hu 
> ---
>  include/net/ip_vs.h |  5 ++
>  include/uapi/linux/ip_vs.h  | 11 +
>  net/netfilter/ipvs/ip_vs_ctl.c  | 35 +-
>  net/netfilter/ipvs/ip_vs_xmit.c | 86 +++--
>  4 files changed, 132 insertions(+), 5 deletions(-)
> 

> diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> index 473cce2a5231..36819c2fabf1 100644
> --- a/net/netfilter/ipvs/ip_vs_xmit.c
> +++ b/net/netfilter/ipvs/ip_vs_xmit.c

> @@ -1054,11 +1104,21 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, struct 
> ip_vs_conn *cp,
>   if (IS_ERR(skb))
>   goto tx_error;
>  
> - if (iptunnel_handle_offloads(skb, __tun_gso_type_mask(AF_INET, cp->af)))
> + if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE)
> + gso_type = SKB_GSO_UDP_TUNNEL;
> + else
> + gso_type = __tun_gso_type_mask(AF_INET, cp->af);

Looks like we should request the IP segmentation. Looking
at skb_udp_tunnel_segment() we call the proper gso_inner_segment
handler but ipip_gso_segment() still requires its bit in gso_type.
So, in both functions we should do:

gso_type = __tun_gso_type_mask();
if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE)
gso_type |= SKB_GSO_UDP_TUNNEL;

Probably, it can be tested with local client sending large
TCP packets...

> @@ -1134,17 +1197,32 @@ ip_vs_tunnel_xmit_v6(struct sk_buff *skb, struct 
> ip_vs_conn *cp,
>*/
>   max_headroom = LL_RESERVED_SPACE(tdev) + sizeof(struct ipv6hdr);
>  
> + tun_type = cp->dest->tun_type;
> +
> + if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE)
> + max_headroom += sizeof(struct udphdr) + sizeof(struct guehdr);
> +
>   skb = ip_vs_prepare_tunneled_skb(skb, cp->af, max_headroom,
>_protocol, _len,
>, , NULL);
>   if (IS_ERR(skb))
>   goto tx_error;
>  
> - if (iptunnel_handle_offloads(skb, __tun_gso_type_mask(AF_INET6, 
> cp->af)))
> + if (tun_type == IP_VS_CONN_F_TUNNEL_TYPE_GUE)
> + gso_type = SKB_GSO_UDP_TUNNEL;
> + else
> + gso_type = __tun_gso_type_mask(AF_INET6, cp->af);

Here too

> + if (iptunnel_handle_offloads(skb, gso_type))
>   goto tx_error;

Regards

--
Julian Anastasov

Re: [PATCH v3] ipvs: fix race between ip_vs_conn_new() and ip_vs_del_dest()

2018-07-25 Thread Julian Anastasov



Hello,

On Wed, 25 Jul 2018, Tan Hu wrote:

> We came across infinite loop in ipvs when using ipvs in docker
> env.
> 
> When ipvs receives new packets and cannot find an ipvs connection,
> it will create a new connection, then if the dest is unavailable
> (i.e. IP_VS_DEST_F_AVAILABLE), the packet will be dropped sliently.
> 
> But if the dropped packet is the first packet of this connection,
> the connection control timer never has a chance to start and the
> ipvs connection cannot be released. This will lead to memory leak, or
> infinite loop in cleanup_net() when net namespace is released like
> this:
> 
> ip_vs_conn_net_cleanup at a0a9f31a [ip_vs]
> __ip_vs_cleanup at a0a9f60a [ip_vs]
> ops_exit_list at 81567a49
> cleanup_net at 81568b40
> process_one_work at 810a851b
> worker_thread at 810a9356
> kthread at 810b0b6f
> ret_from_fork at 81697a18
> 
> race condition:
> CPU1   CPU2
> ip_vs_in()
>   ip_vs_conn_new()
>ip_vs_del_dest()
>  __ip_vs_unlink_dest()
>~IP_VS_DEST_F_AVAILABLE
>   cp->dest && !IP_VS_DEST_F_AVAILABLE
>   __ip_vs_conn_put
> ...
> cleanup_net  ---> infinite looping
> 
> Fix this by checking whether the timer already started.
> 
> Signed-off-by: Tan Hu 
> Reviewed-by: Jiang Biao 

v3 looks good to me,

Acked-by: Julian Anastasov 

    Simon and Pablo, this can be applied to ipvs/nf tree...

> ---
> v2: fix use-after-free in CONN_ONE_PACKET case suggested by Julian Anastasov
> v3: remove trailing whitespace for patch checking 
> 
>  net/netfilter/ipvs/ip_vs_core.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index 0679dd1..a17104f 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -1972,13 +1972,20 @@ static int ip_vs_in_icmp_v6(struct netns_ipvs *ipvs, 
> struct sk_buff *skb,
>   if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
>   /* the destination server is not available */
> 
> - if (sysctl_expire_nodest_conn(ipvs)) {
> + __u32 flags = cp->flags;
> +
> + /* when timer already started, silently drop the packet.*/
> + if (timer_pending(>timer))
> + __ip_vs_conn_put(cp);
> + else
> + ip_vs_conn_put(cp);
> +
> + if (sysctl_expire_nodest_conn(ipvs) &&
> + !(flags & IP_VS_CONN_F_ONE_PACKET)) {
>   /* try to expire the connection immediately */
>   ip_vs_conn_expire_now(cp);
>   }
> - /* don't restart its timer, and silently
> -drop the packet. */
> - __ip_vs_conn_put(cp);
> +
>   return NF_DROP;
>   }
> 
> --
> 1.8.3.1

Regards

--
Julian Anastasov

Re: [PATCH v3] ipvs: fix race between ip_vs_conn_new() and ip_vs_del_dest()

2018-07-25 Thread Julian Anastasov



Hello,

On Wed, 25 Jul 2018, Tan Hu wrote:

> We came across infinite loop in ipvs when using ipvs in docker
> env.
> 
> When ipvs receives new packets and cannot find an ipvs connection,
> it will create a new connection, then if the dest is unavailable
> (i.e. IP_VS_DEST_F_AVAILABLE), the packet will be dropped sliently.
> 
> But if the dropped packet is the first packet of this connection,
> the connection control timer never has a chance to start and the
> ipvs connection cannot be released. This will lead to memory leak, or
> infinite loop in cleanup_net() when net namespace is released like
> this:
> 
> ip_vs_conn_net_cleanup at a0a9f31a [ip_vs]
> __ip_vs_cleanup at a0a9f60a [ip_vs]
> ops_exit_list at 81567a49
> cleanup_net at 81568b40
> process_one_work at 810a851b
> worker_thread at 810a9356
> kthread at 810b0b6f
> ret_from_fork at 81697a18
> 
> race condition:
> CPU1   CPU2
> ip_vs_in()
>   ip_vs_conn_new()
>ip_vs_del_dest()
>  __ip_vs_unlink_dest()
>~IP_VS_DEST_F_AVAILABLE
>   cp->dest && !IP_VS_DEST_F_AVAILABLE
>   __ip_vs_conn_put
> ...
> cleanup_net  ---> infinite looping
> 
> Fix this by checking whether the timer already started.
> 
> Signed-off-by: Tan Hu 
> Reviewed-by: Jiang Biao 

v3 looks good to me,

Acked-by: Julian Anastasov 

    Simon and Pablo, this can be applied to ipvs/nf tree...

> ---
> v2: fix use-after-free in CONN_ONE_PACKET case suggested by Julian Anastasov
> v3: remove trailing whitespace for patch checking 
> 
>  net/netfilter/ipvs/ip_vs_core.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index 0679dd1..a17104f 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -1972,13 +1972,20 @@ static int ip_vs_in_icmp_v6(struct netns_ipvs *ipvs, 
> struct sk_buff *skb,
>   if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
>   /* the destination server is not available */
> 
> - if (sysctl_expire_nodest_conn(ipvs)) {
> + __u32 flags = cp->flags;
> +
> + /* when timer already started, silently drop the packet.*/
> + if (timer_pending(>timer))
> + __ip_vs_conn_put(cp);
> + else
> + ip_vs_conn_put(cp);
> +
> + if (sysctl_expire_nodest_conn(ipvs) &&
> + !(flags & IP_VS_CONN_F_ONE_PACKET)) {
>   /* try to expire the connection immediately */
>   ip_vs_conn_expire_now(cp);
>   }
> - /* don't restart its timer, and silently
> -drop the packet. */
> - __ip_vs_conn_put(cp);
> +
>   return NF_DROP;
>   }
> 
> --
> 1.8.3.1

Regards

--
Julian Anastasov

Re: [PATCH v2] ipvs: fix race between ip_vs_conn_new() and ip_vs_del_dest()

2018-07-24 Thread Julian Anastasov



Hello,

On Wed, 25 Jul 2018, Tan Hu wrote:

> We came across infinite loop in ipvs when using ipvs in docker
> env.
> 
> When ipvs receives new packets and cannot find an ipvs connection,
> it will create a new connection, then if the dest is unavailable
> (i.e. IP_VS_DEST_F_AVAILABLE), the packet will be dropped sliently.
> 
> But if the dropped packet is the first packet of this connection,
> the connection control timer never has a chance to start and the
> ipvs connection cannot be released. This will lead to memory leak, or
> infinite loop in cleanup_net() when net namespace is released like
> this:
> 
> ip_vs_conn_net_cleanup at a0a9f31a [ip_vs]
> __ip_vs_cleanup at a0a9f60a [ip_vs]
> ops_exit_list at 81567a49
> cleanup_net at 81568b40
> process_one_work at 810a851b
> worker_thread at 810a9356
> kthread at 810b0b6f
> ret_from_fork at 81697a18
> 
> race condition:
> CPU1   CPU2
> ip_vs_in()
>   ip_vs_conn_new()
>ip_vs_del_dest()
>  __ip_vs_unlink_dest()
>~IP_VS_DEST_F_AVAILABLE
>   cp->dest && !IP_VS_DEST_F_AVAILABLE
>   __ip_vs_conn_put
> ...
> cleanup_net  ---> infinite looping
> 
> Fix this by checking whether the timer already started.
> 
> Signed-off-by: Tan Hu 
> Reviewed-by: Jiang Biao 
> ---
> v2: fix use-after-free in CONN_ONE_PACKET case suggested by Julian Anastasov
> 
>  net/netfilter/ipvs/ip_vs_core.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index 0679dd1..a17104f 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -1972,13 +1972,20 @@ static int ip_vs_in_icmp_v6(struct netns_ipvs *ipvs, 
> struct sk_buff *skb,
>   if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
>   /* the destination server is not available */
>  
> - if (sysctl_expire_nodest_conn(ipvs)) {
> + __u32 flags = cp->flags; 

Ops, now scripts/checkpatch.pl --strict /tmp/file.patch
is complaining about extra trailing space in above line.
You can also remove the empty line above the new code...

Regards

--
Julian Anastasov

Re: [PATCH v2] ipvs: fix race between ip_vs_conn_new() and ip_vs_del_dest()

2018-07-24 Thread Julian Anastasov



Hello,

On Wed, 25 Jul 2018, Tan Hu wrote:

> We came across infinite loop in ipvs when using ipvs in docker
> env.
> 
> When ipvs receives new packets and cannot find an ipvs connection,
> it will create a new connection, then if the dest is unavailable
> (i.e. IP_VS_DEST_F_AVAILABLE), the packet will be dropped sliently.
> 
> But if the dropped packet is the first packet of this connection,
> the connection control timer never has a chance to start and the
> ipvs connection cannot be released. This will lead to memory leak, or
> infinite loop in cleanup_net() when net namespace is released like
> this:
> 
> ip_vs_conn_net_cleanup at a0a9f31a [ip_vs]
> __ip_vs_cleanup at a0a9f60a [ip_vs]
> ops_exit_list at 81567a49
> cleanup_net at 81568b40
> process_one_work at 810a851b
> worker_thread at 810a9356
> kthread at 810b0b6f
> ret_from_fork at 81697a18
> 
> race condition:
> CPU1   CPU2
> ip_vs_in()
>   ip_vs_conn_new()
>ip_vs_del_dest()
>  __ip_vs_unlink_dest()
>~IP_VS_DEST_F_AVAILABLE
>   cp->dest && !IP_VS_DEST_F_AVAILABLE
>   __ip_vs_conn_put
> ...
> cleanup_net  ---> infinite looping
> 
> Fix this by checking whether the timer already started.
> 
> Signed-off-by: Tan Hu 
> Reviewed-by: Jiang Biao 
> ---
> v2: fix use-after-free in CONN_ONE_PACKET case suggested by Julian Anastasov
> 
>  net/netfilter/ipvs/ip_vs_core.c | 15 +++
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index 0679dd1..a17104f 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -1972,13 +1972,20 @@ static int ip_vs_in_icmp_v6(struct netns_ipvs *ipvs, 
> struct sk_buff *skb,
>   if (cp->dest && !(cp->dest->flags & IP_VS_DEST_F_AVAILABLE)) {
>   /* the destination server is not available */
>  
> - if (sysctl_expire_nodest_conn(ipvs)) {
> + __u32 flags = cp->flags; 

Ops, now scripts/checkpatch.pl --strict /tmp/file.patch
is complaining about extra trailing space in above line.
You can also remove the empty line above the new code...

Regards

--
Julian Anastasov

Re: kernel BUG at lib/string.c:LINE! (4)

2018-05-16 Thread Julian Anastasov


Hello,

On Wed, 16 May 2018, syzbot wrote:

> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:0b7d9978406f Merge branch 'Microsemi-Ocelot-Ethernet-switc..
> git tree:   net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=16e9101780
> kernel config:  https://syzkaller.appspot.com/x/.config?x=b632d8e2c2ab2c1
> dashboard link: https://syzkaller.appspot.com/bug?extid=aac887f77319868646df
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=1665d63780
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1051710780
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+aac887f7731986864...@syzkaller.appspotmail.com
> 
> IPVS: Unknown mcast interface: veth1_to???a
> IPVS: Unknown mcast interface: veth1_to???a
> IPVS: Unknown mcast interface: veth1_to???a
> detected buffer overflow in strlen
> [ cut here ]
> kernel BUG at lib/string.c:1052!
> invalid opcode:  [#1] SMP KASAN
> Dumping ftrace buffer:
>   (ftrace buffer empty)
> Modules linked in:
> CPU: 1 PID: 373 Comm: syz-executor936 Not tainted 4.17.0-rc4+ #45
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
> 01/01/2011
> RIP: 0010:fortify_panic+0x13/0x20 lib/string.c:1051
> RSP: 0018:8801c976f800 EFLAGS: 00010282
> RAX: 0022 RBX: 0040 RCX: 
> RDX: 0022 RSI: 8160f6f1 RDI: ed00392edef6
> RBP: 8801c976f800 R08: 8801cf4c62c0 R09: ed003b5e4fb0
> R10: ed003b5e4fb0 R11: 8801daf27d87 R12: 8801c976fa20
> R13: 8801c976fae4 R14: 8801c976fae0 R15: 048b
> FS:  7fd99f75e700() GS:8801daf0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 21c0 CR3: 0001d6843000 CR4: 001406e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
> strlen include/linux/string.h:270 [inline]
> strlcpy include/linux/string.h:293 [inline]
> do_ip_vs_set_ctl+0x31c/0x1d00 net/netfilter/ipvs/ip_vs_ctl.c:2388
> nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
> nf_setsockopt+0x7d/0xd0 net/netfilter/nf_sockopt.c:115
> ip_setsockopt+0xd8/0xf0 net/ipv4/ip_sockglue.c:1253
> udp_setsockopt+0x62/0xa0 net/ipv4/udp.c:2487
> ipv6_setsockopt+0x149/0x170 net/ipv6/ipv6_sockglue.c:917
> tcp_setsockopt+0x93/0xe0 net/ipv4/tcp.c:3057
> sock_common_setsockopt+0x9a/0xe0 net/core/sock.c:3046
> __sys_setsockopt+0x1bd/0x390 net/socket.c:1903
> __do_sys_setsockopt net/socket.c:1914 [inline]
> __se_sys_setsockopt net/socket.c:1911 [inline]
> __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1911
> do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x447369
> RSP: 002b:7fd99f75dda8 EFLAGS: 0246 ORIG_RAX: 0036
> RAX: ffda RBX: 006e39e4 RCX: 00447369
> RDX: 048b RSI:  RDI: 0003
> RBP:  R08: 0018 R09: 
> R10: 21c0 R11: 0246 R12: 006e39e0
> R13: 75a1ff93f0896195 R14: 6f745f3168746576 R15: 0001
> Code: 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 48 89 df e8 d2 8f 48 fa eb de
> 55 48 89 fe 48 c7 c7 60 65 64 88 48 89 e5 e8 91 dd f3 f9 <0f> 0b 90 90 90 90
> 90 90 90 90 90 90 90 55 48 89 e5 41 57 41 56
> RIP: fortify_panic+0x13/0x20 lib/string.c:1051 RSP: 8801c976f800
> ---[ end trace 624046f2d9af7702 ]---

Just to let you know that I tested a patch with
the syzbot, will do more tests before submitting...

Regards

--
Julian Anastasov <j...@ssi.bg>

Re: kernel BUG at lib/string.c:LINE! (4)

2018-05-16 Thread Julian Anastasov


Hello,

On Wed, 16 May 2018, syzbot wrote:

> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:0b7d9978406f Merge branch 'Microsemi-Ocelot-Ethernet-switc..
> git tree:   net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=16e9101780
> kernel config:  https://syzkaller.appspot.com/x/.config?x=b632d8e2c2ab2c1
> dashboard link: https://syzkaller.appspot.com/bug?extid=aac887f77319868646df
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=1665d63780
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1051710780
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+aac887f7731986864...@syzkaller.appspotmail.com
> 
> IPVS: Unknown mcast interface: veth1_to???a
> IPVS: Unknown mcast interface: veth1_to???a
> IPVS: Unknown mcast interface: veth1_to???a
> detected buffer overflow in strlen
> [ cut here ]
> kernel BUG at lib/string.c:1052!
> invalid opcode:  [#1] SMP KASAN
> Dumping ftrace buffer:
>   (ftrace buffer empty)
> Modules linked in:
> CPU: 1 PID: 373 Comm: syz-executor936 Not tainted 4.17.0-rc4+ #45
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
> 01/01/2011
> RIP: 0010:fortify_panic+0x13/0x20 lib/string.c:1051
> RSP: 0018:8801c976f800 EFLAGS: 00010282
> RAX: 0022 RBX: 0040 RCX: 
> RDX: 0022 RSI: 8160f6f1 RDI: ed00392edef6
> RBP: 8801c976f800 R08: 8801cf4c62c0 R09: ed003b5e4fb0
> R10: ed003b5e4fb0 R11: 8801daf27d87 R12: 8801c976fa20
> R13: 8801c976fae4 R14: 8801c976fae0 R15: 048b
> FS:  7fd99f75e700() GS:8801daf0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 21c0 CR3: 0001d6843000 CR4: 001406e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
> strlen include/linux/string.h:270 [inline]
> strlcpy include/linux/string.h:293 [inline]
> do_ip_vs_set_ctl+0x31c/0x1d00 net/netfilter/ipvs/ip_vs_ctl.c:2388
> nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
> nf_setsockopt+0x7d/0xd0 net/netfilter/nf_sockopt.c:115
> ip_setsockopt+0xd8/0xf0 net/ipv4/ip_sockglue.c:1253
> udp_setsockopt+0x62/0xa0 net/ipv4/udp.c:2487
> ipv6_setsockopt+0x149/0x170 net/ipv6/ipv6_sockglue.c:917
> tcp_setsockopt+0x93/0xe0 net/ipv4/tcp.c:3057
> sock_common_setsockopt+0x9a/0xe0 net/core/sock.c:3046
> __sys_setsockopt+0x1bd/0x390 net/socket.c:1903
> __do_sys_setsockopt net/socket.c:1914 [inline]
> __se_sys_setsockopt net/socket.c:1911 [inline]
> __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1911
> do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x447369
> RSP: 002b:7fd99f75dda8 EFLAGS: 0246 ORIG_RAX: 0036
> RAX: ffda RBX: 006e39e4 RCX: 00447369
> RDX: 048b RSI:  RDI: 0003
> RBP:  R08: 0018 R09: 
> R10: 21c0 R11: 0246 R12: 006e39e0
> R13: 75a1ff93f0896195 R14: 6f745f3168746576 R15: 0001
> Code: 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 48 89 df e8 d2 8f 48 fa eb de
> 55 48 89 fe 48 c7 c7 60 65 64 88 48 89 e5 e8 91 dd f3 f9 <0f> 0b 90 90 90 90
> 90 90 90 90 90 90 90 55 48 89 e5 41 57 41 56
> RIP: fortify_panic+0x13/0x20 lib/string.c:1051 RSP: 8801c976f800
> ---[ end trace 624046f2d9af7702 ]---

Just to let you know that I tested a patch with
the syzbot, will do more tests before submitting...

Regards

--
Julian Anastasov

Re: WARNING: possible recursive locking detected

2018-04-11 Thread Julian Anastasov


Hello,

On Wed, 11 Apr 2018, Dmitry Vyukov wrote:

> On Wed, Apr 11, 2018 at 4:02 PM, syzbot
> <syzbot+3c43eecd7745a5ce1...@syzkaller.appspotmail.com> wrote:
> > Hello,
> >
> > syzbot hit the following crash on upstream commit
> > b284d4d5a6785f8cd07eda2646a95782373cd01e (Tue Apr 10 19:25:30 2018 +)
> > Merge tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client
> > syzbot dashboard link:
> > https://syzkaller.appspot.com/bug?extid=3c43eecd7745a5ce1640
> >
> > So far this crash happened 3 times on upstream.
> > C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5103706542440448
> > syzkaller reproducer:
> > https://syzkaller.appspot.com/x/repro.syz?id=5641659786199040
> > Raw console output:
> > https://syzkaller.appspot.com/x/log.txt?id=5099510896263168
> > Kernel config:
> > https://syzkaller.appspot.com/x/.config?id=-1223000601505858474
> > compiler: gcc (GCC) 8.0.1 20180301 (experimental)
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+3c43eecd7745a5ce1...@syzkaller.appspotmail.com
> > It will help syzbot understand when the bug is fixed. See footer for
> > details.
> > If you forward the report, please keep this part and the footer.
> 
> #syz dup: possible deadlock in rtnl_lock (5)

Yes, patch is now in the "nf" tree, so all these
lockups around start_sync_thread should be resolved soon...

> > IPVS: sync thread started: state = BACKUP, mcast_ifn = lo, syncid = 0, id =
> > 0
> > IPVS: stopping backup sync thread 4546 ...
> >
> > 
> > IPVS: stopping backup sync thread 4559 ...
> > WARNING: possible recursive locking detected

Regards

--
Julian Anastasov <j...@ssi.bg>

Re: WARNING: possible recursive locking detected

2018-04-11 Thread Julian Anastasov


Hello,

On Wed, 11 Apr 2018, Dmitry Vyukov wrote:

> On Wed, Apr 11, 2018 at 4:02 PM, syzbot
>  wrote:
> > Hello,
> >
> > syzbot hit the following crash on upstream commit
> > b284d4d5a6785f8cd07eda2646a95782373cd01e (Tue Apr 10 19:25:30 2018 +)
> > Merge tag 'ceph-for-4.17-rc1' of git://github.com/ceph/ceph-client
> > syzbot dashboard link:
> > https://syzkaller.appspot.com/bug?extid=3c43eecd7745a5ce1640
> >
> > So far this crash happened 3 times on upstream.
> > C reproducer: https://syzkaller.appspot.com/x/repro.c?id=5103706542440448
> > syzkaller reproducer:
> > https://syzkaller.appspot.com/x/repro.syz?id=5641659786199040
> > Raw console output:
> > https://syzkaller.appspot.com/x/log.txt?id=5099510896263168
> > Kernel config:
> > https://syzkaller.appspot.com/x/.config?id=-1223000601505858474
> > compiler: gcc (GCC) 8.0.1 20180301 (experimental)
> >
> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > Reported-by: syzbot+3c43eecd7745a5ce1...@syzkaller.appspotmail.com
> > It will help syzbot understand when the bug is fixed. See footer for
> > details.
> > If you forward the report, please keep this part and the footer.
> 
> #syz dup: possible deadlock in rtnl_lock (5)

Yes, patch is now in the "nf" tree, so all these
lockups around start_sync_thread should be resolved soon...

> > IPVS: sync thread started: state = BACKUP, mcast_ifn = lo, syncid = 0, id =
> > 0
> > IPVS: stopping backup sync thread 4546 ...
> >
> > 
> > IPVS: stopping backup sync thread 4559 ...
> > WARNING: possible recursive locking detected

Regards

--
Julian Anastasov

Re: INFO: task hung in stop_sync_thread (2)

2018-03-29 Thread Julian Anastasov

ty.c:2131
> 2 locks held by getty/4342:
> #0:  (>ldisc_sem){}, at: [<bee98654>]
> ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
> #1:  (>atomic_read_lock){+.+.}, at: [<c1d180aa>]
> n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4343:
> #0:  (>ldisc_sem){}, at: [<bee98654>]
> ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
> #1:  (>atomic_read_lock){+.+.}, at: [<c1d180aa>]
> n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4344:
> #0:  (>ldisc_sem){}, at: [<bee98654>]
> ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
> #1:  (>atomic_read_lock){+.+.}, at: [<c1d180aa>]
> n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
> 3 locks held by kworker/0:5/6494:
> #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<a062b18e>]
> work_static include/linux/workqueue.h:198 [inline]
> #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<a062b18e>]
> set_work_data kernel/workqueue.c:619 [inline]
> #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<a062b18e>]
> set_work_pool_and_clear_pending kernel/workqueue.c:646 [inline]
> #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<a062b18e>]
> process_one_work+0xb12/0x1bb0 kernel/workqueue.c:2084
> #1:  ((addr_chk_work).work){+.+.}, at: [<278427d5>]
> process_one_work+0xb89/0x1bb0 kernel/workqueue.c:2088
> #2:  (rtnl_mutex){+.+.}, at: [<066e35ac>] rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
> 1 lock held by syz-executor7/25421:
> #0:  (ipvs->sync_mutex){+.+.}, at: [<d414a689>]
> do_ip_vs_set_ctl+0x277/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2393
> 2 locks held by syz-executor7/25427:
> #0:  (rtnl_mutex){+.+.}, at: [<066e35ac>] rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
> #1:  (ipvs->sync_mutex){+.+.}, at: [<e6d48489>]
> do_ip_vs_set_ctl+0x10f8/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2388

Above is start_sync_thread() waiting kthread to stop...

> 1 lock held by syz-executor7/25435:
> #0:  (rtnl_mutex){+.+.}, at: [<066e35ac>] rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
> 1 lock held by ipvs-b:2:0/25415:
> #0:  (rtnl_mutex){+.+.}, at: [<066e35ac>] rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74

backup kthread needs rtnl_lock to stop...

> 
> =
> 
> NMI backtrace for cpu 1
> CPU: 1 PID: 868 Comm: khungtaskd Not tainted 4.16.0-rc6+ #284
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
> 01/01/2011
> Call Trace:
> __dump_stack lib/dump_stack.c:17 [inline]
> dump_stack+0x194/0x24d lib/dump_stack.c:53
> nmi_cpu_backtrace+0x1d2/0x210 lib/nmi_backtrace.c:103
> nmi_trigger_cpumask_backtrace+0x123/0x180 lib/nmi_backtrace.c:62
> arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
> trigger_all_cpu_backtrace include/linux/nmi.h:138 [inline]
> check_hung_task kernel/hung_task.c:132 [inline]
> check_hung_uninterruptible_tasks kernel/hung_task.c:190 [inline]
> watchdog+0x90c/0xd60 kernel/hung_task.c:249
> kthread+0x33c/0x400 kernel/kthread.c:238
> ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:406
> Sending NMI from CPU 1 to CPUs 0:
> NMI backtrace for cpu 0 skipped: idling at native_safe_halt+0x6/0x10
> arch/x86/include/asm/irqflags.h:54
> 
> 
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkal...@googlegroups.com.
> 
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug report.
> Note: all commands must start from beginning of the line in the email body.

Regards

--
Julian Anastasov <j...@ssi.bg>

Re: INFO: task hung in stop_sync_thread (2)

2018-03-29 Thread Julian Anastasov

ty.c:2131
> 2 locks held by getty/4342:
> #0:  (>ldisc_sem){}, at: [<bee98654>]
> ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
> #1:  (>atomic_read_lock){+.+.}, at: [<c1d180aa>]
> n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4343:
> #0:  (>ldisc_sem){}, at: [<bee98654>]
> ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
> #1:  (>atomic_read_lock){+.+.}, at: [<c1d180aa>]
> n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
> 2 locks held by getty/4344:
> #0:  (>ldisc_sem){}, at: [<bee98654>]
> ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365
> #1:  (>atomic_read_lock){+.+.}, at: [<c1d180aa>]
> n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131
> 3 locks held by kworker/0:5/6494:
> #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<a062b18e>]
> work_static include/linux/workqueue.h:198 [inline]
> #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<a062b18e>]
> set_work_data kernel/workqueue.c:619 [inline]
> #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<a062b18e>]
> set_work_pool_and_clear_pending kernel/workqueue.c:646 [inline]
> #0:  ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<a062b18e>]
> process_one_work+0xb12/0x1bb0 kernel/workqueue.c:2084
> #1:  ((addr_chk_work).work){+.+.}, at: [<278427d5>]
> process_one_work+0xb89/0x1bb0 kernel/workqueue.c:2088
> #2:  (rtnl_mutex){+.+.}, at: [<066e35ac>] rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
> 1 lock held by syz-executor7/25421:
> #0:  (ipvs->sync_mutex){+.+.}, at: [<d414a689>]
> do_ip_vs_set_ctl+0x277/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2393
> 2 locks held by syz-executor7/25427:
> #0:  (rtnl_mutex){+.+.}, at: [<066e35ac>] rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
> #1:  (ipvs->sync_mutex){+.+.}, at: [<e6d48489>]
> do_ip_vs_set_ctl+0x10f8/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2388

Above is start_sync_thread() waiting kthread to stop...

> 1 lock held by syz-executor7/25435:
> #0:  (rtnl_mutex){+.+.}, at: [<066e35ac>] rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74
> 1 lock held by ipvs-b:2:0/25415:
> #0:  (rtnl_mutex){+.+.}, at: [<066e35ac>] rtnl_lock+0x17/0x20
> net/core/rtnetlink.c:74

backup kthread needs rtnl_lock to stop...

> 
> =
> 
> NMI backtrace for cpu 1
> CPU: 1 PID: 868 Comm: khungtaskd Not tainted 4.16.0-rc6+ #284
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
> 01/01/2011
> Call Trace:
> __dump_stack lib/dump_stack.c:17 [inline]
> dump_stack+0x194/0x24d lib/dump_stack.c:53
> nmi_cpu_backtrace+0x1d2/0x210 lib/nmi_backtrace.c:103
> nmi_trigger_cpumask_backtrace+0x123/0x180 lib/nmi_backtrace.c:62
> arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
> trigger_all_cpu_backtrace include/linux/nmi.h:138 [inline]
> check_hung_task kernel/hung_task.c:132 [inline]
> check_hung_uninterruptible_tasks kernel/hung_task.c:190 [inline]
> watchdog+0x90c/0xd60 kernel/hung_task.c:249
> kthread+0x33c/0x400 kernel/kthread.c:238
> ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:406
> Sending NMI from CPU 1 to CPUs 0:
> NMI backtrace for cpu 0 skipped: idling at native_safe_halt+0x6/0x10
> arch/x86/include/asm/irqflags.h:54
> 
> 
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkal...@googlegroups.com.
> 
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug report.
> Note: all commands must start from beginning of the line in the email body.

Regards

--
Julian Anastasov

Re: [PATCH] netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed

2017-10-28 Thread Julian Anastasov


Hello,

On Thu, 26 Oct 2017, Ye Yin wrote:

> When run ipvs in two different network namespace at the same host, and one
> ipvs transport network traffic to the other network namespace ipvs.
> 'ipvs_property' flag will make the second ipvs take no effect. So we should
> clear 'ipvs_property' when SKB network namespace changed.
> 
> Signed-off-by: Ye Yin <hust...@gmail.com>
> Signed-off-by: Wei Zhou <chouryz...@gmail.com>

Patch looks good to me. ipvs_property was added long ago
but skb_scrub_packet() is more recent (3.11), so:

Fixes: 621e84d6f373 ("dev: introduce skb_scrub_packet()")
Signed-off-by: Julian Anastasov <j...@ssi.bg>

I guess, DaveM can apply it directly as a bugfix
to the net tree.

> ---
>  include/linux/skbuff.h | 7 +++
>  net/core/skbuff.c  | 1 +
>  2 files changed, 8 insertions(+)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 72299ef..d448a48 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -3770,6 +3770,13 @@ static inline void nf_reset_trace(struct sk_buff *skb)
>  #endif
>  }
>  
> +static inline void ipvs_reset(struct sk_buff *skb)
> +{
> +#if IS_ENABLED(CONFIG_IP_VS)
> + skb->ipvs_property = 0;
> +#endif
> +}
> +
>  /* Note: This doesn't put any conntrack and bridge info in dst. */
>  static inline void __nf_copy(struct sk_buff *dst, const struct sk_buff *src,
>bool copy)
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 2465607..e140ba4 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4864,6 +4864,7 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet)
>   if (!xnet)
>   return;
>  
> + ipvs_reset(skb);
>   skb_orphan(skb);
>   skb->mark = 0;
>  }
> -- 
> 1.7.12.4

Regards

--
Julian Anastasov <j...@ssi.bg>

Re: [PATCH] netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed

2017-10-28 Thread Julian Anastasov


Hello,

On Thu, 26 Oct 2017, Ye Yin wrote:

> When run ipvs in two different network namespace at the same host, and one
> ipvs transport network traffic to the other network namespace ipvs.
> 'ipvs_property' flag will make the second ipvs take no effect. So we should
> clear 'ipvs_property' when SKB network namespace changed.
> 
> Signed-off-by: Ye Yin 
> Signed-off-by: Wei Zhou 

Patch looks good to me. ipvs_property was added long ago
but skb_scrub_packet() is more recent (3.11), so:

Fixes: 621e84d6f373 ("dev: introduce skb_scrub_packet()")
Signed-off-by: Julian Anastasov 

I guess, DaveM can apply it directly as a bugfix
to the net tree.

> ---
>  include/linux/skbuff.h | 7 +++
>  net/core/skbuff.c  | 1 +
>  2 files changed, 8 insertions(+)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 72299ef..d448a48 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -3770,6 +3770,13 @@ static inline void nf_reset_trace(struct sk_buff *skb)
>  #endif
>  }
>  
> +static inline void ipvs_reset(struct sk_buff *skb)
> +{
> +#if IS_ENABLED(CONFIG_IP_VS)
> + skb->ipvs_property = 0;
> +#endif
> +}
> +
>  /* Note: This doesn't put any conntrack and bridge info in dst. */
>  static inline void __nf_copy(struct sk_buff *dst, const struct sk_buff *src,
>bool copy)
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 2465607..e140ba4 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4864,6 +4864,7 @@ void skb_scrub_packet(struct sk_buff *skb, bool xnet)
>   if (!xnet)
>   return;
>  
> + ipvs_reset(skb);
>   skb_orphan(skb);
>   skb->mark = 0;
>  }
> -- 
> 1.7.12.4

Regards

--
Julian Anastasov

Re: [PATCH] netfilter: ipvs: Convert timers to use timer_setup()

2017-10-24 Thread Julian Anastasov


Hello,

On Tue, 24 Oct 2017, Kees Cook wrote:

> In preparation for unconditionally passing the struct timer_list pointer to
> all timer callbacks, switch to using the new timer_setup() and from_timer()
> to pass the timer pointer explicitly.
> 
> Cc: Wensong Zhang <wens...@linux-vs.org>
> Cc: Simon Horman <ho...@verge.net.au>
> Cc: Julian Anastasov <j...@ssi.bg>
> Cc: Pablo Neira Ayuso <pa...@netfilter.org>
> Cc: Jozsef Kadlecsik <kad...@blackhole.kfki.hu>
> Cc: Florian Westphal <f...@strlen.de>
> Cc: "David S. Miller" <da...@davemloft.net>
> Cc: net...@vger.kernel.org
> Cc: lvs-de...@vger.kernel.org
> Cc: netfilter-de...@vger.kernel.org
> Cc: coret...@netfilter.org
> Signed-off-by: Kees Cook <keesc...@chromium.org>

Looks good to me,

Acked-by: Julian Anastasov <j...@ssi.bg>

> ---
>  net/netfilter/ipvs/ip_vs_conn.c  | 10 +-
>  net/netfilter/ipvs/ip_vs_ctl.c   |  7 +++
>  net/netfilter/ipvs/ip_vs_est.c   |  6 +++---
>  net/netfilter/ipvs/ip_vs_lblc.c  | 11 ++-
>  net/netfilter/ipvs/ip_vs_lblcr.c | 11 ++-
>  5 files changed, 23 insertions(+), 22 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
> index 3d2ac71a83ec..3a43b3470331 100644
> --- a/net/netfilter/ipvs/ip_vs_conn.c
> +++ b/net/netfilter/ipvs/ip_vs_conn.c
> @@ -104,7 +104,7 @@ static inline void ct_write_unlock_bh(unsigned int key)
>   spin_unlock_bh(&__ip_vs_conntbl_lock_array[key_LOCKARRAY_MASK].l);
>  }
>  
> -static void ip_vs_conn_expire(unsigned long data);
> +static void ip_vs_conn_expire(struct timer_list *t);
>  
>  /*
>   *   Returns hash value for IPVS connection entry
> @@ -457,7 +457,7 @@ EXPORT_SYMBOL_GPL(ip_vs_conn_out_get_proto);
>  static void __ip_vs_conn_put_notimer(struct ip_vs_conn *cp)
>  {
>   __ip_vs_conn_put(cp);
> - ip_vs_conn_expire((unsigned long)cp);
> + ip_vs_conn_expire(>timer);
>  }
>  
>  /*
> @@ -817,9 +817,9 @@ static void ip_vs_conn_rcu_free(struct rcu_head *head)
>   kmem_cache_free(ip_vs_conn_cachep, cp);
>  }
>  
> -static void ip_vs_conn_expire(unsigned long data)
> +static void ip_vs_conn_expire(struct timer_list *t)
>  {
> - struct ip_vs_conn *cp = (struct ip_vs_conn *)data;
> + struct ip_vs_conn *cp = from_timer(cp, t, timer);
>   struct netns_ipvs *ipvs = cp->ipvs;
>  
>   /*
> @@ -909,7 +909,7 @@ ip_vs_conn_new(const struct ip_vs_conn_param *p, int 
> dest_af,
>   }
>  
>   INIT_HLIST_NODE(>c_list);
> - setup_timer(>timer, ip_vs_conn_expire, (unsigned long)cp);
> + timer_setup(>timer, ip_vs_conn_expire, 0);
>   cp->ipvs   = ipvs;
>   cp->af = p->af;
>   cp->daf= dest_af;
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 4f940d7eb2f7..b47e266c6eca 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -1146,9 +1146,9 @@ ip_vs_del_dest(struct ip_vs_service *svc, struct 
> ip_vs_dest_user_kern *udest)
>   return 0;
>  }
>  
> -static void ip_vs_dest_trash_expire(unsigned long data)
> +static void ip_vs_dest_trash_expire(struct timer_list *t)
>  {
> - struct netns_ipvs *ipvs = (struct netns_ipvs *)data;
> + struct netns_ipvs *ipvs = from_timer(ipvs, t, dest_trash_timer);
>   struct ip_vs_dest *dest, *next;
>   unsigned long now = jiffies;
>  
> @@ -4019,8 +4019,7 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs 
> *ipvs)
>  
>   INIT_LIST_HEAD(>dest_trash);
>   spin_lock_init(>dest_trash_lock);
> - setup_timer(>dest_trash_timer, ip_vs_dest_trash_expire,
> - (unsigned long) ipvs);
> + timer_setup(>dest_trash_timer, ip_vs_dest_trash_expire, 0);
>   atomic_set(>ftpsvc_counter, 0);
>   atomic_set(>nullsvc_counter, 0);
>   atomic_set(>conn_out_counter, 0);
> diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
> index 457c6c193e13..489055091a9b 100644
> --- a/net/netfilter/ipvs/ip_vs_est.c
> +++ b/net/netfilter/ipvs/ip_vs_est.c
> @@ -97,12 +97,12 @@ static void ip_vs_read_cpu_stats(struct ip_vs_kstats *sum,
>  }
>  
>  
> -static void estimation_timer(unsigned long arg)
> +static void estimation_timer(struct timer_list *t)
>  {
>   struct ip_vs_estimator *e;
>   struct ip_vs_stats *s;
>   u64 rate;
> - struct netns_ipvs *ipvs = (struct netns_ipvs *)arg;
> + struct netns_ipvs *ipvs = from_timer(ipvs, t, est_timer);
>  
>   spin_lock(>est_lock);
>   list_for_each_entry(e, >es

Re: [PATCH] netfilter: ipvs: Convert timers to use timer_setup()

2017-10-24 Thread Julian Anastasov


Hello,

On Tue, 24 Oct 2017, Kees Cook wrote:

> In preparation for unconditionally passing the struct timer_list pointer to
> all timer callbacks, switch to using the new timer_setup() and from_timer()
> to pass the timer pointer explicitly.
> 
> Cc: Wensong Zhang 
> Cc: Simon Horman 
> Cc: Julian Anastasov 
> Cc: Pablo Neira Ayuso 
> Cc: Jozsef Kadlecsik 
> Cc: Florian Westphal 
> Cc: "David S. Miller" 
> Cc: net...@vger.kernel.org
> Cc: lvs-de...@vger.kernel.org
> Cc: netfilter-de...@vger.kernel.org
> Cc: coret...@netfilter.org
> Signed-off-by: Kees Cook 

Looks good to me,

Acked-by: Julian Anastasov 

> ---
>  net/netfilter/ipvs/ip_vs_conn.c  | 10 +-
>  net/netfilter/ipvs/ip_vs_ctl.c   |  7 +++
>  net/netfilter/ipvs/ip_vs_est.c   |  6 +++---
>  net/netfilter/ipvs/ip_vs_lblc.c  | 11 ++-
>  net/netfilter/ipvs/ip_vs_lblcr.c | 11 ++-
>  5 files changed, 23 insertions(+), 22 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
> index 3d2ac71a83ec..3a43b3470331 100644
> --- a/net/netfilter/ipvs/ip_vs_conn.c
> +++ b/net/netfilter/ipvs/ip_vs_conn.c
> @@ -104,7 +104,7 @@ static inline void ct_write_unlock_bh(unsigned int key)
>   spin_unlock_bh(&__ip_vs_conntbl_lock_array[key_LOCKARRAY_MASK].l);
>  }
>  
> -static void ip_vs_conn_expire(unsigned long data);
> +static void ip_vs_conn_expire(struct timer_list *t);
>  
>  /*
>   *   Returns hash value for IPVS connection entry
> @@ -457,7 +457,7 @@ EXPORT_SYMBOL_GPL(ip_vs_conn_out_get_proto);
>  static void __ip_vs_conn_put_notimer(struct ip_vs_conn *cp)
>  {
>   __ip_vs_conn_put(cp);
> - ip_vs_conn_expire((unsigned long)cp);
> + ip_vs_conn_expire(>timer);
>  }
>  
>  /*
> @@ -817,9 +817,9 @@ static void ip_vs_conn_rcu_free(struct rcu_head *head)
>   kmem_cache_free(ip_vs_conn_cachep, cp);
>  }
>  
> -static void ip_vs_conn_expire(unsigned long data)
> +static void ip_vs_conn_expire(struct timer_list *t)
>  {
> - struct ip_vs_conn *cp = (struct ip_vs_conn *)data;
> + struct ip_vs_conn *cp = from_timer(cp, t, timer);
>   struct netns_ipvs *ipvs = cp->ipvs;
>  
>   /*
> @@ -909,7 +909,7 @@ ip_vs_conn_new(const struct ip_vs_conn_param *p, int 
> dest_af,
>   }
>  
>   INIT_HLIST_NODE(>c_list);
> - setup_timer(>timer, ip_vs_conn_expire, (unsigned long)cp);
> + timer_setup(>timer, ip_vs_conn_expire, 0);
>   cp->ipvs   = ipvs;
>   cp->af = p->af;
>   cp->daf= dest_af;
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 4f940d7eb2f7..b47e266c6eca 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -1146,9 +1146,9 @@ ip_vs_del_dest(struct ip_vs_service *svc, struct 
> ip_vs_dest_user_kern *udest)
>   return 0;
>  }
>  
> -static void ip_vs_dest_trash_expire(unsigned long data)
> +static void ip_vs_dest_trash_expire(struct timer_list *t)
>  {
> - struct netns_ipvs *ipvs = (struct netns_ipvs *)data;
> + struct netns_ipvs *ipvs = from_timer(ipvs, t, dest_trash_timer);
>   struct ip_vs_dest *dest, *next;
>   unsigned long now = jiffies;
>  
> @@ -4019,8 +4019,7 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs 
> *ipvs)
>  
>   INIT_LIST_HEAD(>dest_trash);
>   spin_lock_init(>dest_trash_lock);
> - setup_timer(>dest_trash_timer, ip_vs_dest_trash_expire,
> - (unsigned long) ipvs);
> + timer_setup(>dest_trash_timer, ip_vs_dest_trash_expire, 0);
>   atomic_set(>ftpsvc_counter, 0);
>   atomic_set(>nullsvc_counter, 0);
>   atomic_set(>conn_out_counter, 0);
> diff --git a/net/netfilter/ipvs/ip_vs_est.c b/net/netfilter/ipvs/ip_vs_est.c
> index 457c6c193e13..489055091a9b 100644
> --- a/net/netfilter/ipvs/ip_vs_est.c
> +++ b/net/netfilter/ipvs/ip_vs_est.c
> @@ -97,12 +97,12 @@ static void ip_vs_read_cpu_stats(struct ip_vs_kstats *sum,
>  }
>  
>  
> -static void estimation_timer(unsigned long arg)
> +static void estimation_timer(struct timer_list *t)
>  {
>   struct ip_vs_estimator *e;
>   struct ip_vs_stats *s;
>   u64 rate;
> - struct netns_ipvs *ipvs = (struct netns_ipvs *)arg;
> + struct netns_ipvs *ipvs = from_timer(ipvs, t, est_timer);
>  
>   spin_lock(>est_lock);
>   list_for_each_entry(e, >est_list, list) {
> @@ -192,7 +192,7 @@ int __net_init ip_vs_estimator_net_init(struct netns_ipvs 
> *ipvs)
>  {
>   INIT_LIST_HEAD(>est_list);
>   spin_lock_init(>est_lock);
> - s

Re: [PATCH] ipvs: Fix inappropriate output of procfs

2017-10-15 Thread Julian Anastasov


Hello,

On Sun, 15 Oct 2017, KUWAZAWA Takuya wrote:

> Information about ipvs in different network namespace can be seen via procfs.
> 
> How to reproduce:
> 
>   # ip netns add ns01
>   # ip netns add ns02
>   # ip netns exec ns01 ip a add dev lo 127.0.0.1/8
>   # ip netns exec ns02 ip a add dev lo 127.0.0.1/8
>   # ip netns exec ns01 ipvsadm -A -t 10.1.1.1:80
>   # ip netns exec ns02 ipvsadm -A -t 10.1.1.2:80
> 
> The ipvsadm displays information about its own network namespace only.
> 
>   # ip netns exec ns01 ipvsadm -Ln
>   IP Virtual Server version 1.2.1 (size=4096)
>   Prot LocalAddress:Port Scheduler Flags
> -> RemoteAddress:Port   Forward Weight ActiveConn InActConn
>   TCP  10.1.1.1:80 wlc
> 
>   # ip netns exec ns02 ipvsadm -Ln
>   IP Virtual Server version 1.2.1 (size=4096)
>   Prot LocalAddress:Port Scheduler Flags
> -> RemoteAddress:Port   Forward Weight ActiveConn InActConn
>   TCP  10.1.1.2:80 wlc
> 
> But I can see information about other network namespace via procfs.
> 
>   # ip netns exec ns01 cat /proc/net/ip_vs
>   IP Virtual Server version 1.2.1 (size=4096)
>   Prot LocalAddress:Port Scheduler Flags
> -> RemoteAddress:Port Forward Weight ActiveConn InActConn
>   TCP  0A010101:0050 wlc
>   TCP  0A010102:0050 wlc
> 
>   # ip netns exec ns02 cat /proc/net/ip_vs
>   IP Virtual Server version 1.2.1 (size=4096)
>   Prot LocalAddress:Port Scheduler Flags
> -> RemoteAddress:Port Forward Weight ActiveConn InActConn
>   TCP  0A010102:0050 wlc
> 
> Signed-off-by: KUWAZAWA Takuya <albatro...@gmail.com>

Looks good to me

Acked-by: Julian Anastasov <j...@ssi.bg>

Simon, please apply to ipvs tree.

> ---
>  net/netfilter/ipvs/ip_vs_ctl.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 4f940d7..b3245f9 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -2034,12 +2034,16 @@ static int ip_vs_info_seq_show(struct seq_file *seq, 
> void *v)
>   seq_puts(seq,
>"  -> RemoteAddress:Port Forward Weight ActiveConn 
> InActConn\n");
>   } else {
> + struct net *net = seq_file_net(seq);
> + struct netns_ipvs *ipvs = net_ipvs(net);
>   const struct ip_vs_service *svc = v;
>   const struct ip_vs_iter *iter = seq->private;
>   const struct ip_vs_dest *dest;
>   struct ip_vs_scheduler *sched = rcu_dereference(svc->scheduler);
>   char *sched_name = sched ? sched->name : "none";
>  
> + if (svc->ipvs != ipvs)
> + return 0;
>   if (iter->table == ip_vs_svc_table) {
>  #ifdef CONFIG_IP_VS_IPV6
>   if (svc->af == AF_INET6)
> -- 
> 1.8.3.1

Regards

--
Julian Anastasov <j...@ssi.bg>

Re: [PATCH] ipvs: Fix inappropriate output of procfs

2017-10-15 Thread Julian Anastasov


Hello,

On Sun, 15 Oct 2017, KUWAZAWA Takuya wrote:

> Information about ipvs in different network namespace can be seen via procfs.
> 
> How to reproduce:
> 
>   # ip netns add ns01
>   # ip netns add ns02
>   # ip netns exec ns01 ip a add dev lo 127.0.0.1/8
>   # ip netns exec ns02 ip a add dev lo 127.0.0.1/8
>   # ip netns exec ns01 ipvsadm -A -t 10.1.1.1:80
>   # ip netns exec ns02 ipvsadm -A -t 10.1.1.2:80
> 
> The ipvsadm displays information about its own network namespace only.
> 
>   # ip netns exec ns01 ipvsadm -Ln
>   IP Virtual Server version 1.2.1 (size=4096)
>   Prot LocalAddress:Port Scheduler Flags
> -> RemoteAddress:Port   Forward Weight ActiveConn InActConn
>   TCP  10.1.1.1:80 wlc
> 
>   # ip netns exec ns02 ipvsadm -Ln
>   IP Virtual Server version 1.2.1 (size=4096)
>   Prot LocalAddress:Port Scheduler Flags
> -> RemoteAddress:Port   Forward Weight ActiveConn InActConn
>   TCP  10.1.1.2:80 wlc
> 
> But I can see information about other network namespace via procfs.
> 
>   # ip netns exec ns01 cat /proc/net/ip_vs
>   IP Virtual Server version 1.2.1 (size=4096)
>   Prot LocalAddress:Port Scheduler Flags
> -> RemoteAddress:Port Forward Weight ActiveConn InActConn
>   TCP  0A010101:0050 wlc
>   TCP  0A010102:0050 wlc
> 
>   # ip netns exec ns02 cat /proc/net/ip_vs
>   IP Virtual Server version 1.2.1 (size=4096)
>   Prot LocalAddress:Port Scheduler Flags
> -> RemoteAddress:Port Forward Weight ActiveConn InActConn
>   TCP  0A010102:0050 wlc
> 
> Signed-off-by: KUWAZAWA Takuya 

Looks good to me

Acked-by: Julian Anastasov 

Simon, please apply to ipvs tree.

> ---
>  net/netfilter/ipvs/ip_vs_ctl.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index 4f940d7..b3245f9 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -2034,12 +2034,16 @@ static int ip_vs_info_seq_show(struct seq_file *seq, 
> void *v)
>   seq_puts(seq,
>"  -> RemoteAddress:Port Forward Weight ActiveConn 
> InActConn\n");
>   } else {
> + struct net *net = seq_file_net(seq);
> + struct netns_ipvs *ipvs = net_ipvs(net);
>   const struct ip_vs_service *svc = v;
>   const struct ip_vs_iter *iter = seq->private;
>   const struct ip_vs_dest *dest;
>   struct ip_vs_scheduler *sched = rcu_dereference(svc->scheduler);
>   char *sched_name = sched ? sched->name : "none";
>  
> +     if (svc->ipvs != ipvs)
> + return 0;
>   if (iter->table == ip_vs_svc_table) {
>  #ifdef CONFIG_IP_VS_IPV6
>   if (svc->af == AF_INET6)
> -- 
> 1.8.3.1

Regards

--
Julian Anastasov

Re: [PATCH] netfilter: ip_vs_sync: fix bogus maybe-uninitialized warning

2016-10-24 Thread Julian Anastasov


Hello,

On Mon, 24 Oct 2016, Arnd Bergmann wrote:

> Building the ip_vs_sync code with CONFIG_OPTIMIZE_INLINING on x86
> confuses the compiler to the point where it produces a rather
> dubious warning message:
> 
> net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.init_seq’ may be used 
> uninitialized in this function [-Werror=maybe-uninitialized]
>   struct ip_vs_sync_conn_options opt;
>  ^~~
> net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.delta’ may be used 
> uninitialized in this function [-Werror=maybe-uninitialized]
> net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.previous_delta’ may be 
> used uninitialized in this function [-Werror=maybe-uninitialized]
> net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void *)+12).init_seq’ 
> may be used uninitialized in this function [-Werror=maybe-uninitialized]
> net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void *)+12).delta’ 
> may be used uninitialized in this function [-Werror=maybe-uninitialized]
> net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void 
> *)+12).previous_delta’ may be used uninitialized in this function 
> [-Werror=maybe-uninitialized]
> 
> The problem appears to be a combination of a number of factors, including
> the __builtin_bswap32 compiler builtin being slightly odd, having a large
> amount of code inlined into a single function, and the way that some
> functions only get partially inlined here.
> 
> I've spent way too much time trying to work out a way to improve the
> code, but the best I've come up with is to add an explicit memset
> right before the ip_vs_seq structure is first initialized here. When
> the compiler works correctly, this has absolutely no effect, but in the
> case that produces the warning, the warning disappears.
> 
> In the process of analysing this warning, I also noticed that
> we use memcpy to copy the larger ip_vs_sync_conn_options structure
> over two members of the ip_vs_conn structure. This works because
> the layout is identical, but seems error-prone, so I'm changing
> this in the process to directly copy the two members. This change
> seemed to have no effect on the object code or the warning, but
> it deals with the same data, so I kept the two changes together.
> 
> Signed-off-by: Arnd Bergmann <a...@arndb.de>

OK,

Acked-by: Julian Anastasov <j...@ssi.bg>

I guess, Simon will take the patch for ipvs-next.

> ---
>  net/netfilter/ipvs/ip_vs_sync.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
> index 1b07578bedf3..9350530c16c1 100644
> --- a/net/netfilter/ipvs/ip_vs_sync.c
> +++ b/net/netfilter/ipvs/ip_vs_sync.c
> @@ -283,6 +283,7 @@ struct ip_vs_sync_buff {
>   */
>  static void ntoh_seq(struct ip_vs_seq *no, struct ip_vs_seq *ho)
>  {
> + memset(ho, 0, sizeof(*ho));
>   ho->init_seq   = get_unaligned_be32(>init_seq);
>   ho->delta  = get_unaligned_be32(>delta);
>   ho->previous_delta = get_unaligned_be32(>previous_delta);

So, now there is a double write here?

What about such constructs?:

*ho = (struct ip_vs_seq) {
.init_seq   = get_unaligned_be32(>init_seq),
...
};

Any difference in the compiled code or warnings?

> @@ -917,8 +918,10 @@ static void ip_vs_proc_conn(struct netns_ipvs *ipvs, 
> struct ip_vs_conn_param *pa
>   kfree(param->pe_data);
>   }
>  
> - if (opt)
> - memcpy(>in_seq, opt, sizeof(*opt));
> + if (opt) {
> + cp->in_seq = opt->in_seq;
> + cp->out_seq = opt->out_seq;

This fix is fine.

> + }
>   atomic_set(>in_pkts, sysctl_sync_threshold(ipvs));
>   cp->state = state;
>   cp->old_state = cp->state;
> -- 
> 2.9.0

Regards

--
Julian Anastasov <j...@ssi.bg>

Re: [PATCH] netfilter: ip_vs_sync: fix bogus maybe-uninitialized warning

2016-10-24 Thread Julian Anastasov


Hello,

On Mon, 24 Oct 2016, Arnd Bergmann wrote:

> Building the ip_vs_sync code with CONFIG_OPTIMIZE_INLINING on x86
> confuses the compiler to the point where it produces a rather
> dubious warning message:
> 
> net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.init_seq’ may be used 
> uninitialized in this function [-Werror=maybe-uninitialized]
>   struct ip_vs_sync_conn_options opt;
>  ^~~
> net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.delta’ may be used 
> uninitialized in this function [-Werror=maybe-uninitialized]
> net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘opt.previous_delta’ may be 
> used uninitialized in this function [-Werror=maybe-uninitialized]
> net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void *)+12).init_seq’ 
> may be used uninitialized in this function [-Werror=maybe-uninitialized]
> net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void *)+12).delta’ 
> may be used uninitialized in this function [-Werror=maybe-uninitialized]
> net/netfilter/ipvs/ip_vs_sync.c:1073:33: error: ‘*((void 
> *)+12).previous_delta’ may be used uninitialized in this function 
> [-Werror=maybe-uninitialized]
> 
> The problem appears to be a combination of a number of factors, including
> the __builtin_bswap32 compiler builtin being slightly odd, having a large
> amount of code inlined into a single function, and the way that some
> functions only get partially inlined here.
> 
> I've spent way too much time trying to work out a way to improve the
> code, but the best I've come up with is to add an explicit memset
> right before the ip_vs_seq structure is first initialized here. When
> the compiler works correctly, this has absolutely no effect, but in the
> case that produces the warning, the warning disappears.
> 
> In the process of analysing this warning, I also noticed that
> we use memcpy to copy the larger ip_vs_sync_conn_options structure
> over two members of the ip_vs_conn structure. This works because
> the layout is identical, but seems error-prone, so I'm changing
> this in the process to directly copy the two members. This change
> seemed to have no effect on the object code or the warning, but
> it deals with the same data, so I kept the two changes together.
> 
> Signed-off-by: Arnd Bergmann 

OK,

Acked-by: Julian Anastasov 

I guess, Simon will take the patch for ipvs-next.

> ---
>  net/netfilter/ipvs/ip_vs_sync.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
> index 1b07578bedf3..9350530c16c1 100644
> --- a/net/netfilter/ipvs/ip_vs_sync.c
> +++ b/net/netfilter/ipvs/ip_vs_sync.c
> @@ -283,6 +283,7 @@ struct ip_vs_sync_buff {
>   */
>  static void ntoh_seq(struct ip_vs_seq *no, struct ip_vs_seq *ho)
>  {
> + memset(ho, 0, sizeof(*ho));
>   ho->init_seq   = get_unaligned_be32(>init_seq);
>   ho->delta  = get_unaligned_be32(>delta);
>   ho->previous_delta = get_unaligned_be32(>previous_delta);

So, now there is a double write here?

What about such constructs?:

*ho = (struct ip_vs_seq) {
.init_seq   = get_unaligned_be32(>init_seq),
...
};

Any difference in the compiled code or warnings?

> @@ -917,8 +918,10 @@ static void ip_vs_proc_conn(struct netns_ipvs *ipvs, 
> struct ip_vs_conn_param *pa
>   kfree(param->pe_data);
>   }
>  
> - if (opt)
> - memcpy(>in_seq, opt, sizeof(*opt));
> + if (opt) {
> + cp->in_seq = opt->in_seq;
> +     cp->out_seq = opt->out_seq;

This fix is fine.

> + }
>   atomic_set(>in_pkts, sysctl_sync_threshold(ipvs));
>   cp->state = state;
>   cp->old_state = cp->state;
> -- 
> 2.9.0

Regards

--
Julian Anastasov

Re: [PATCH v2] net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()

2016-07-26 Thread Julian Anastasov


Hello,

On Tue, 26 Jul 2016, Chunhui He wrote:

> NUD_STALE is used when the caller(e.g. arp_process()) can't guarantee
> neighbour reachability. If the entry was NUD_VALID and lladdr is unchanged,
> the entry state should not be changed.
> 
> Currently the code puts an extra "NUD_CONNECTED" condition. So if old state
> was NUD_DELAY or NUD_PROBE (they are NUD_VALID but not NUD_CONNECTED), the
> state can be changed to NUD_STALE.
> 
> This may cause problem. Because NUD_STALE lladdr doesn't guarantee
> reachability, when we send traffic, the state will be changed to
> NUD_DELAY. In normal case, if we get no confirmation (by dst_confirm()),
> we will change the state to NUD_PROBE and send probe traffic. But now the
> state may be reset to NUD_STALE again(e.g. by broadcast ARP packets),
> so the probe traffic will not be sent. This situation may happen again and
> again, and packets will be sent to an non-reachable lladdr forever.
> 
> The fix is to remove the "NUD_CONNECTED" condition. After that the
> "NEIGH_UPDATE_F_WEAK_OVERRIDE" condition (used by IPv6) in that branch will
> be redundant, so remove it.
> 
> This change may increase probe traffic, but it's essential since NUD_STALE
> lladdr is unreliable. To ensure correctness, we prefer to resolve lladdr,
> when we can't get confirmation, even while remote packets try to set
> NUD_STALE state.
> 
> Signed-off-by: Chunhui He <hchun...@mail.ustc.edu.cn>

Looks good to me,

Signed-off-by: Julian Anastasov <j...@ssi.bg>

> ---
> v2:
>  - change title from "net: neigh: disallow state transition DELAY->STALE in
>neigh_update()"
>  - remove "NUD_CONNECTED" condition instead of "NUD_CONNECTED | NUD_DELAY"
>  - remove "NEIGH_UPDATE_F_WEAK_OVERRIDE" condition
> 
> ---
>  net/core/neighbour.c | 7 +--
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 510cd62..ed8c317e 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -1060,8 +1060,6 @@ static void neigh_update_hhs(struct neighbour *neigh)
>   NEIGH_UPDATE_F_WEAK_OVERRIDE will suspect existing "connected"
>   lladdr instead of overriding it
>   if it is different.
> - It also allows to retain current state
> - if lladdr is unchanged.
>   NEIGH_UPDATE_F_ADMINmeans that the change is administrative.
>  
>   NEIGH_UPDATE_F_OVERRIDE_ISROUTER allows to override existing
> @@ -1150,10 +1148,7 @@ int neigh_update(struct neighbour *neigh, const u8 
> *lladdr, u8 new,
>   } else
>   goto out;
>   } else {
> - if (lladdr == neigh->ha && new == NUD_STALE &&
> - ((flags & NEIGH_UPDATE_F_WEAK_OVERRIDE) ||
> -  (old & NUD_CONNECTED))
> - )
> + if (lladdr == neigh->ha && new == NUD_STALE)
>   new = old;
>   }
>   }
> -- 
> 2.1.4

Regards

--
Julian Anastasov <j...@ssi.bg>

Re: [PATCH v2] net: neigh: disallow transition to NUD_STALE if lladdr is unchanged in neigh_update()

2016-07-26 Thread Julian Anastasov


Hello,

On Tue, 26 Jul 2016, Chunhui He wrote:

> NUD_STALE is used when the caller(e.g. arp_process()) can't guarantee
> neighbour reachability. If the entry was NUD_VALID and lladdr is unchanged,
> the entry state should not be changed.
> 
> Currently the code puts an extra "NUD_CONNECTED" condition. So if old state
> was NUD_DELAY or NUD_PROBE (they are NUD_VALID but not NUD_CONNECTED), the
> state can be changed to NUD_STALE.
> 
> This may cause problem. Because NUD_STALE lladdr doesn't guarantee
> reachability, when we send traffic, the state will be changed to
> NUD_DELAY. In normal case, if we get no confirmation (by dst_confirm()),
> we will change the state to NUD_PROBE and send probe traffic. But now the
> state may be reset to NUD_STALE again(e.g. by broadcast ARP packets),
> so the probe traffic will not be sent. This situation may happen again and
> again, and packets will be sent to an non-reachable lladdr forever.
> 
> The fix is to remove the "NUD_CONNECTED" condition. After that the
> "NEIGH_UPDATE_F_WEAK_OVERRIDE" condition (used by IPv6) in that branch will
> be redundant, so remove it.
> 
> This change may increase probe traffic, but it's essential since NUD_STALE
> lladdr is unreliable. To ensure correctness, we prefer to resolve lladdr,
> when we can't get confirmation, even while remote packets try to set
> NUD_STALE state.
> 
> Signed-off-by: Chunhui He 

Looks good to me,

Signed-off-by: Julian Anastasov 

> ---
> v2:
>  - change title from "net: neigh: disallow state transition DELAY->STALE in
>neigh_update()"
>  - remove "NUD_CONNECTED" condition instead of "NUD_CONNECTED | NUD_DELAY"
>  - remove "NEIGH_UPDATE_F_WEAK_OVERRIDE" condition
> 
> ---
>  net/core/neighbour.c | 7 +--
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 510cd62..ed8c317e 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -1060,8 +1060,6 @@ static void neigh_update_hhs(struct neighbour *neigh)
>   NEIGH_UPDATE_F_WEAK_OVERRIDE will suspect existing "connected"
>   lladdr instead of overriding it
>   if it is different.
> - It also allows to retain current state
> - if lladdr is unchanged.
>   NEIGH_UPDATE_F_ADMINmeans that the change is administrative.
>  
>   NEIGH_UPDATE_F_OVERRIDE_ISROUTER allows to override existing
> @@ -1150,10 +1148,7 @@ int neigh_update(struct neighbour *neigh, const u8 
> *lladdr, u8 new,
>   } else
>   goto out;
>   } else {
> - if (lladdr == neigh->ha && new == NUD_STALE &&
> - ((flags & NEIGH_UPDATE_F_WEAK_OVERRIDE) ||
> -  (old & NUD_CONNECTED))
> - )
> + if (lladdr == neigh->ha && new == NUD_STALE)
>   new = old;
>   }
>   }
> -- 
> 2.1.4

Regards

--
Julian Anastasov

Re: [PATCH] net: neigh: disallow state transition DELAY->STALE in neigh_update()

2016-07-25 Thread Julian Anastasov

Hello,

On Mon, 25 Jul 2016, 吉藤英明 wrote:

> OK, following blocks are "no-op" and we will get same result.
> 
> Well, please do not try changing several things at the same time and
> you could say:
> 
> if (ladder == neigh->ha && new == NUD_STALE &&
> !(flags & NUD_UPDATE_F_ADMIN))
> new = old;

OK, lets do it with 2 patches then.

Chunhui He, can you modify your patch to delete the
both lines and explain that we prefer to resolve the
remote address, even while remote packets try to set NUD_STALE
state. If your patch is accepted, I'll post second patch that
adds the line with the ADMIN check. As result, the code will
look like the example from Yoshifuji Hideaki above.

Regards

--
Julian Anastasov <j...@ssi.bg>

Re: [PATCH] net: neigh: disallow state transition DELAY->STALE in neigh_update()

2016-07-25 Thread Julian Anastasov

Hello,

On Mon, 25 Jul 2016, 吉藤英明 wrote:

> OK, following blocks are "no-op" and we will get same result.
> 
> Well, please do not try changing several things at the same time and
> you could say:
> 
> if (ladder == neigh->ha && new == NUD_STALE &&
> !(flags & NUD_UPDATE_F_ADMIN))
> new = old;

OK, lets do it with 2 patches then.

Chunhui He, can you modify your patch to delete the
both lines and explain that we prefer to resolve the
remote address, even while remote packets try to set NUD_STALE
state. If your patch is accepted, I'll post second patch that
adds the line with the ADMIN check. As result, the code will
look like the example from Yoshifuji Hideaki above.

Regards

--
Julian Anastasov

Re: [PATCH] net: neigh: disallow state transition DELAY->STALE in neigh_update()

2016-07-23 Thread Julian Anastasov

Hello,

On Sat, 23 Jul 2016, Chunhui He wrote:

> The neigh system is to reduce ARP traffic, that is good. The problem is it 
> fails
> to handle some coner cases.
> 
> The coner case is (let's forget my case above):
> In NUD_DELAY, the neigh system is waiting for a proof of reachablity. If there
> is no proof, the neigh system must prove by itself, so goes to NUD_PROBE and
> sends request. But when some other part of kernel gives a non-proof by
> neigh_update()(STALE is a *hint*, not a proof of reachablity), the neigh 
> system
> will leave NUD_DELAY, and will *"forget"* to prove by itself. So it's 
> possiable
> to send traffic to a non-reachable address. That's definitely wrong, even it
> "saves" traffic.
> 
> And the fix is to disallow NUD_DELAY -> NUD_STALE.

But NUD_STALE event happens only for received
packet, for the concerned remote IP address, for same or
different hwaddr, for any kind of tip (target IP). Examples:

- Received ARP request who-has LOCAL_IP tell NEIGH_IP:
neigh_event_ns is called for the RTN_LOCAL case,
for sip/sha. Reply is sent.

- Received ARP request who-has UNICAST_IP tell NEIGH_IP:
neigh_event_ns is called for the IN_DEV_FORWARD case,
for sip/sha, i.e. if we use proxy_arp. Deferred
or immediate reply is sent.

- Received ARP request who-has UNICAST_IP tell NEIGH_IP:
neigh_update is called for existing entry when
proxy_arp=0, i.e. request not catched by above case.
No reply is sent.

- Received Gratuitous ARP request who-has NEIGH_IP tell NEIGH_IP:
neigh_update is called for broadcast request when
arp_accept=1 or when arp_accept=0 while cache entry exists.
No reply is sent.

- Received Gratuitous ARP reply NEIGH_IP is-at hwaddr:
neigh_update is called for the received broadcast reply

This was all for NUD_STALE. There is only one
ARP case where NUD_REACHABLE is set, usually in response
to our request:

- Received unicast reply NEIGH_IP is-at hwaddr

- the second non-ARP case for NUD_REACHABLE is from dst_confirm

> > Can it learn from our unicast ARP replies that we
> > should sent in response to its broadcast probes? Or it
> > expects only ARP requests?
> 
> All the broadcast probes I have seen are not "who has ". they are 
> about
> other hosts, so we are not expected to answer.

May be that is the problem: we receive such packet,
ip_route_input_noref detects that we allow such packet
from NEIGH_IP on this interface, tip is not RTN_LOCAL (no
ARP reply from us), tip is RTN_UNICAST but proxy_arp is not
allowed, so we continue and reach __neigh_lookup which finds
the existing cache entry because we talked to GW before that.
As this is an ARP request, neigh_update is called with NUD_STALE.
No reply is sent because request was not for us but we
just learned that NEIGH_IP is alive because it lookups
for someone else. This is common to observe with broadcasts,
GW lookups for other hosts and has to expose its IP+hwaddr.
More difficult to happen with unicast packets, you need hub,
not switch, to detect such packets.

It is possible that you miss the packet that tries
to set NUD_STALE. May be you can add some printk's to catch
what kind of packet causes this. This can help too:

tcpdump -lnnn -s0 arp and host GW_IP

If you see such packet, that is it. Our cache is
updated with NUD_STALE.

> So I'm not sure if it can learn from ARP reply.

See above, received broadcast GARP reply can set
NUD_STALE. But the most trivial case of GW exposing its
IP while looking for other hosts should be the culprit.
It probably happens often, that is why we have no chance
to send ARP requests, GW is more ARP-active than us and
updates our cache and we are happy.

Regards

--
Julian Anastasov <j...@ssi.bg>

Re: [PATCH] net: neigh: disallow state transition DELAY->STALE in neigh_update()

2016-07-23 Thread Julian Anastasov

Hello,

On Sat, 23 Jul 2016, Chunhui He wrote:

> The neigh system is to reduce ARP traffic, that is good. The problem is it 
> fails
> to handle some coner cases.
> 
> The coner case is (let's forget my case above):
> In NUD_DELAY, the neigh system is waiting for a proof of reachablity. If there
> is no proof, the neigh system must prove by itself, so goes to NUD_PROBE and
> sends request. But when some other part of kernel gives a non-proof by
> neigh_update()(STALE is a *hint*, not a proof of reachablity), the neigh 
> system
> will leave NUD_DELAY, and will *"forget"* to prove by itself. So it's 
> possiable
> to send traffic to a non-reachable address. That's definitely wrong, even it
> "saves" traffic.
> 
> And the fix is to disallow NUD_DELAY -> NUD_STALE.

But NUD_STALE event happens only for received
packet, for the concerned remote IP address, for same or
different hwaddr, for any kind of tip (target IP). Examples:

- Received ARP request who-has LOCAL_IP tell NEIGH_IP:
neigh_event_ns is called for the RTN_LOCAL case,
for sip/sha. Reply is sent.

- Received ARP request who-has UNICAST_IP tell NEIGH_IP:
neigh_event_ns is called for the IN_DEV_FORWARD case,
for sip/sha, i.e. if we use proxy_arp. Deferred
or immediate reply is sent.

- Received ARP request who-has UNICAST_IP tell NEIGH_IP:
neigh_update is called for existing entry when
proxy_arp=0, i.e. request not catched by above case.
No reply is sent.

- Received Gratuitous ARP request who-has NEIGH_IP tell NEIGH_IP:
neigh_update is called for broadcast request when
arp_accept=1 or when arp_accept=0 while cache entry exists.
No reply is sent.

- Received Gratuitous ARP reply NEIGH_IP is-at hwaddr:
neigh_update is called for the received broadcast reply

This was all for NUD_STALE. There is only one
ARP case where NUD_REACHABLE is set, usually in response
to our request:

- Received unicast reply NEIGH_IP is-at hwaddr

- the second non-ARP case for NUD_REACHABLE is from dst_confirm

> > Can it learn from our unicast ARP replies that we
> > should sent in response to its broadcast probes? Or it
> > expects only ARP requests?
> 
> All the broadcast probes I have seen are not "who has ". they are 
> about
> other hosts, so we are not expected to answer.

May be that is the problem: we receive such packet,
ip_route_input_noref detects that we allow such packet
from NEIGH_IP on this interface, tip is not RTN_LOCAL (no
ARP reply from us), tip is RTN_UNICAST but proxy_arp is not
allowed, so we continue and reach __neigh_lookup which finds
the existing cache entry because we talked to GW before that.
As this is an ARP request, neigh_update is called with NUD_STALE.
No reply is sent because request was not for us but we
just learned that NEIGH_IP is alive because it lookups
for someone else. This is common to observe with broadcasts,
GW lookups for other hosts and has to expose its IP+hwaddr.
More difficult to happen with unicast packets, you need hub,
not switch, to detect such packets.

It is possible that you miss the packet that tries
to set NUD_STALE. May be you can add some printk's to catch
what kind of packet causes this. This can help too:

tcpdump -lnnn -s0 arp and host GW_IP

If you see such packet, that is it. Our cache is
updated with NUD_STALE.

> So I'm not sure if it can learn from ARP reply.

See above, received broadcast GARP reply can set
NUD_STALE. But the most trivial case of GW exposing its
IP while looking for other hosts should be the culprit.
It probably happens often, that is why we have no chance
to send ARP requests, GW is more ARP-active than us and
updates our cache and we are happy.

Regards

--
Julian Anastasov

Re: [PATCH] net: neigh: disallow state transition DELAY->STALE in neigh_update()

2016-07-23 Thread Julian Anastasov

Hello,

On Sat, 23 Jul 2016, Chunhui He wrote:

> On Sat, 23 Jul 2016 09:17:59 +0300 (EEST), Julian Anastasov <j...@ssi.bg> 
> wrote:
> > 
> > What kind of problem is this? Remote host wants to
> > see a recent probe from us, otherwise it refuses to resolve
> > our address before its traffic to us and it is not sent?
> > Can you explain this in more detail because after looking
> > again I have some doubts what actually happens, see below.
> >
> 
> The remote host is configured to refuse to send any packets to a host it 
> doesn't
> "know" (but broadcast is allowed), and it can only "learn" from ARP packets.

Can it learn from our unicast ARP replies that we
should sent in response to its broadcast probes? Or it
expects only ARP requests?

> When I send packets, if broadcast ARP requests from the remote host are 
> received
> and set the state to NUD_STALE, then I stuck.

So, this is a special case. Is it possible to
solve it from user space?:

1.1. echo 0 > delay_first_probe_time. This can help if
remote hosts sends broadcast ARP probes every second and
if we send IP packets too.

1.2. reduce base_reachable_time if needed to send ARP probes
more often

2. Send ARP probe by using the arping tool, eg. from cron

Note that solution 1 is not good. If we do not
have traffic to send there will be no ARP probe and the
remote host can not send to us.

> > To summarize: currently the change to NUD_STALE serves the
> > purpose to avoid/delay our hwaddr refreshing probes. They are
> > avoided if protocols indicate progress with the current hwaddr.
> > Outgoing IP traffic that does not trigger confirmation
> > from replies (for example TCP ACK calling dst_confirm) or
> > from applications (MSG_CONFIRM) surely will cause a
> > switch to NUD_PROBE.
> >
> 
> Yes, I agree.
> But now it is possible to delay the probes *forever*, and at the same time we
> get no positive response from the remote host.

What happens if we do not send traffic and the
neigh entry is removed? How the remote host will learn
our address? If remote host sends ARP broadcasts even
arp_accept=1 will create NUD_STALE entry and without any
traffic we can stay in this state, no chance for NUD_DELAY.

> > So, the question is, to avoid probes or to refresh
> > frequently? Is there a good reason to ignore this NUD_STALE
> > event in NUD_DELAY | NUD_PROBE state?
> >
> 
> So reaching a NUD_REACHABLE state in not our goal. It's to ensure correctness.
> Cycle between NUD_STALE and NUD_DELAY is not correct.

The main goal looks to be the reduced ARP traffic. If
we learned the neigh address recently (even if from remote ARP
broadcast probes or from TCP ACKs) we do not need to send
probes. Looks like the goal "always stay present in remote
ARP caches" is not listed as our goal. Even "always update
remote ARP cache" is not implemented, no outgoing traffic =>
no ARP probes.

> Maybe it is enough to ignore NUD_STALE?

But you in this case rely on traffic to enter
NUD_DELAY state. Note that looking at neigh_timer_handler
NUD_DELAY state is not guaranteed: if there is no
recent outgoing traffic the NUD_REACHABLE state can be changed
to NUD_STALE, not to NUD_DELAY, so no chance for probes
that will keep the entry refreshed forever.

Regards

--
Julian Anastasov <j...@ssi.bg>

Re: [PATCH] net: neigh: disallow state transition DELAY->STALE in neigh_update()

2016-07-23 Thread Julian Anastasov

Hello,

On Sat, 23 Jul 2016, Chunhui He wrote:

> On Sat, 23 Jul 2016 09:17:59 +0300 (EEST), Julian Anastasov  
> wrote:
> > 
> > What kind of problem is this? Remote host wants to
> > see a recent probe from us, otherwise it refuses to resolve
> > our address before its traffic to us and it is not sent?
> > Can you explain this in more detail because after looking
> > again I have some doubts what actually happens, see below.
> >
> 
> The remote host is configured to refuse to send any packets to a host it 
> doesn't
> "know" (but broadcast is allowed), and it can only "learn" from ARP packets.

Can it learn from our unicast ARP replies that we
should sent in response to its broadcast probes? Or it
expects only ARP requests?

> When I send packets, if broadcast ARP requests from the remote host are 
> received
> and set the state to NUD_STALE, then I stuck.

So, this is a special case. Is it possible to
solve it from user space?:

1.1. echo 0 > delay_first_probe_time. This can help if
remote hosts sends broadcast ARP probes every second and
if we send IP packets too.

1.2. reduce base_reachable_time if needed to send ARP probes
more often

2. Send ARP probe by using the arping tool, eg. from cron

Note that solution 1 is not good. If we do not
have traffic to send there will be no ARP probe and the
remote host can not send to us.

> > To summarize: currently the change to NUD_STALE serves the
> > purpose to avoid/delay our hwaddr refreshing probes. They are
> > avoided if protocols indicate progress with the current hwaddr.
> > Outgoing IP traffic that does not trigger confirmation
> > from replies (for example TCP ACK calling dst_confirm) or
> > from applications (MSG_CONFIRM) surely will cause a
> > switch to NUD_PROBE.
> >
> 
> Yes, I agree.
> But now it is possible to delay the probes *forever*, and at the same time we
> get no positive response from the remote host.

What happens if we do not send traffic and the
neigh entry is removed? How the remote host will learn
our address? If remote host sends ARP broadcasts even
arp_accept=1 will create NUD_STALE entry and without any
traffic we can stay in this state, no chance for NUD_DELAY.

> > So, the question is, to avoid probes or to refresh
> > frequently? Is there a good reason to ignore this NUD_STALE
> > event in NUD_DELAY | NUD_PROBE state?
> >
> 
> So reaching a NUD_REACHABLE state in not our goal. It's to ensure correctness.
> Cycle between NUD_STALE and NUD_DELAY is not correct.

The main goal looks to be the reduced ARP traffic. If
we learned the neigh address recently (even if from remote ARP
broadcast probes or from TCP ACKs) we do not need to send
probes. Looks like the goal "always stay present in remote
ARP caches" is not listed as our goal. Even "always update
remote ARP cache" is not implemented, no outgoing traffic =>
no ARP probes.

> Maybe it is enough to ignore NUD_STALE?

But you in this case rely on traffic to enter
NUD_DELAY state. Note that looking at neigh_timer_handler
NUD_DELAY state is not guaranteed: if there is no
recent outgoing traffic the NUD_REACHABLE state can be changed
to NUD_STALE, not to NUD_DELAY, so no chance for probes
that will keep the entry refreshed forever.

Regards

--
Julian Anastasov

Re: [PATCH] net: neigh: disallow state transition DELAY->STALE in neigh_update()

2016-07-23 Thread Julian Anastasov

Hello,

On Fri, 22 Jul 2016, Chunhui He wrote:

> The origin code allows NUD_DELAY -> NUD_STALE and NUD_PROBE -> NUD_STALE.
> This part was imported to kernel since v2.1.79, I don't know clearly why it
> allows that.
> 
> My analysis:
> (1) As shown in my previous mail, NUD_DELAY -> NUD_STALE may cause "dead 
> loop",
> so it should be fixed.

Yes, because we stay in NUD_DELAY for many seconds
which is enough for remote host to reset our resolving.

BTW, you said:

In my case, the gateway refuses to send unicast packets to me, before it sees
my ARP request. So it's critical to enter REACHABLE state by sending ARP
request, but not by external confirmation.

What kind of problem is this? Remote host wants to
see a recent probe from us, otherwise it refuses to resolve
our address before its traffic to us and it is not sent?
Can you explain this in more detail because after looking
again I have some doubts what actually happens, see below.

> (2) But NUD_PROBE -> NUD_STALE is acceptable, because in NUD_PROBE, ARP 
> request
> has been sent, it is sufficient to break the "dead loop".
> More attempts are accomplished by the following sequence:
> NUD_STALE --> NUD_DELAY -(sent req)-> NUD_PROBE -(reset by neigh_update())->

I think, when entering NUD_DELAY we do not send
any ARP probe: for NUD_STALE __neigh_event_send is called on
outgoing traffic to change state to NUD_DELAY and to start
timer (it was stopped in NUD_STALE) to detect if address is
still alive before probing it again. Now in this period of
5 seconds (delay_first_probe_time) two things can happen:

1. Unexpected Unicast ARP reply (immediate switch to NUD_REACHABLE)
or protocol indication (dst_confirm) causing delayed switch to
NUD_REACHABLE on next outgoing packet. On sporadic
request+reply we may not switch immediately to NUD_REACHABLE.
Even if the reply called dst_confirm, the change happens
next time when new request is sent and dst_neigh_output is called.

2. Remote host is fast enough to reset us again to NUD_STALE
before we change state to ->NUD_PROBE->NUD_REACHABLE.

To summarize: currently the change to NUD_STALE serves the
purpose to avoid/delay our hwaddr refreshing probes. They are
avoided if protocols indicate progress with the current hwaddr.
Outgoing IP traffic that does not trigger confirmation
from replies (for example TCP ACK calling dst_confirm) or
from applications (MSG_CONFIRM) surely will cause a
switch to NUD_PROBE.

Now the main question: is reaching a NUD_REACHABLE
state a good enough goal (if we ignore the NUD_STALE in
NUD_DELAY | NUD_PROBE state) or we prefer traffic that does
not provide confirmation indications to use the current
hwaddr based only on indications from received ARP broadcasts
or requests, in which case we avoid our ARP probes. In the
latter case remote hosts do not see fresh probes from us
and we may cycle between NUD_STALE and NUD_DELAY if
such remote packets come more often.

So, the question is, to avoid probes or to refresh
frequently? Is there a good reason to ignore this NUD_STALE
event in NUD_DELAY | NUD_PROBE state?

> NUD_STALE --> NUD_DELAY -(send req again)-> ... -->
> NUD_REACHABLE

Regards

--
Julian Anastasov <j...@ssi.bg>

Re: [PATCH] net: neigh: disallow state transition DELAY->STALE in neigh_update()

2016-07-23 Thread Julian Anastasov

Hello,

On Fri, 22 Jul 2016, Chunhui He wrote:

> The origin code allows NUD_DELAY -> NUD_STALE and NUD_PROBE -> NUD_STALE.
> This part was imported to kernel since v2.1.79, I don't know clearly why it
> allows that.
> 
> My analysis:
> (1) As shown in my previous mail, NUD_DELAY -> NUD_STALE may cause "dead 
> loop",
> so it should be fixed.

Yes, because we stay in NUD_DELAY for many seconds
which is enough for remote host to reset our resolving.

BTW, you said:

In my case, the gateway refuses to send unicast packets to me, before it sees
my ARP request. So it's critical to enter REACHABLE state by sending ARP
request, but not by external confirmation.

What kind of problem is this? Remote host wants to
see a recent probe from us, otherwise it refuses to resolve
our address before its traffic to us and it is not sent?
Can you explain this in more detail because after looking
again I have some doubts what actually happens, see below.

> (2) But NUD_PROBE -> NUD_STALE is acceptable, because in NUD_PROBE, ARP 
> request
> has been sent, it is sufficient to break the "dead loop".
> More attempts are accomplished by the following sequence:
> NUD_STALE --> NUD_DELAY -(sent req)-> NUD_PROBE -(reset by neigh_update())->

I think, when entering NUD_DELAY we do not send
any ARP probe: for NUD_STALE __neigh_event_send is called on
outgoing traffic to change state to NUD_DELAY and to start
timer (it was stopped in NUD_STALE) to detect if address is
still alive before probing it again. Now in this period of
5 seconds (delay_first_probe_time) two things can happen:

1. Unexpected Unicast ARP reply (immediate switch to NUD_REACHABLE)
or protocol indication (dst_confirm) causing delayed switch to
NUD_REACHABLE on next outgoing packet. On sporadic
request+reply we may not switch immediately to NUD_REACHABLE.
Even if the reply called dst_confirm, the change happens
next time when new request is sent and dst_neigh_output is called.

2. Remote host is fast enough to reset us again to NUD_STALE
before we change state to ->NUD_PROBE->NUD_REACHABLE.

To summarize: currently the change to NUD_STALE serves the
purpose to avoid/delay our hwaddr refreshing probes. They are
avoided if protocols indicate progress with the current hwaddr.
Outgoing IP traffic that does not trigger confirmation
from replies (for example TCP ACK calling dst_confirm) or
from applications (MSG_CONFIRM) surely will cause a
switch to NUD_PROBE.

Now the main question: is reaching a NUD_REACHABLE
state a good enough goal (if we ignore the NUD_STALE in
NUD_DELAY | NUD_PROBE state) or we prefer traffic that does
not provide confirmation indications to use the current
hwaddr based only on indications from received ARP broadcasts
or requests, in which case we avoid our ARP probes. In the
latter case remote hosts do not see fresh probes from us
and we may cycle between NUD_STALE and NUD_DELAY if
such remote packets come more often.

So, the question is, to avoid probes or to refresh
frequently? Is there a good reason to ignore this NUD_STALE
event in NUD_DELAY | NUD_PROBE state?

> NUD_STALE --> NUD_DELAY -(send req again)-> ... -->
> NUD_REACHABLE

Regards

--
Julian Anastasov

Re: [PATCH] net: neigh: disallow state transition DELAY->STALE in neigh_update()

2016-07-22 Thread Julian Anastasov


Hello,

On Thu, 21 Jul 2016, Chunhui He wrote:

> If neigh entry was CONNECTED and address is not changed, and if new state is
> STALE, entry state will not change. Because DELAY is not in CONNECTED, it's
> possible to change state from DELAY to STALE.
> 
> That is bad. Consider a host in IPv4 nerwork, a neigh entry in STALE state
> is referenced to send packets, so goes to DELAY state. If the entry is not
> confirmed by upper layer, it goes to PROBE state, and sends ARP request.
> The neigh host sends ARP reply, then the entry goes to REACHABLE state.
> But the entry state may be reseted to STALE by broadcast ARP packets, before
> the entry goes to PROBE state. So it's possible that the entry will never go
> to REACHABLE state, without external confirmation.
> 
> In my case, the gateway refuses to send unicast packets to me, before it sees
> my ARP request. So it's critical to enter REACHABLE state by sending ARP
> request, but not by external confirmation.
> 
> This fixes neigh_update() not to change to STALE if old state is CONNECTED or
> DELAY.
> 
> Signed-off-by: Chunhui He <hchun...@mail.ustc.edu.cn>
> ---
>  net/core/neighbour.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 510cd62..29429eb 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -1152,7 +1152,7 @@ int neigh_update(struct neighbour *neigh, const u8 
> *lladdr, u8 new,
>   } else {
>   if (lladdr == neigh->ha && new == NUD_STALE &&
>   ((flags & NEIGH_UPDATE_F_WEAK_OVERRIDE) ||
> -  (old & NUD_CONNECTED))
> +  (old & (NUD_CONNECTED | NUD_DELAY)))
>   )
>   new = old;
>   }

You change looks correct to me. But this place
has more problems. There is no good reason to set NUD_STALE
for any state that is NUD_VALID if address is not changed.
This matches perfectly the comment above this code:
NUD_STALE should change a NUD_VALID state only when
address changes. It also means that IPv6 does not need
to provide NEIGH_UPDATE_F_WEAK_OVERRIDE anymore when
NEIGH_UPDATE_F_OVERRIDE is also present.

By this way the state machine can continue with
the resolving: NUD_STALE -> NUD_DELAY (traffic) ->
NUD_PROBE (retries) -> NUD_REACHABLE (unicast reply)
while the address is not changed. Your change covers only
NUD_DELAY, not NUD_PROBE, so it is better to allow more
retries to send. We should not give up until success (NUD_REACHABLE).

Second problem: NEIGH_UPDATE_F_WEAK_OVERRIDE has no
priority over NEIGH_UPDATE_F_ADMIN. For example, now I can not
change from NUD_PERMANENT to NUD_STALE:

# ip neigh add 192.168.168.111 lladdr 00:11:22:33:44:55 nud perm dev wlan0
# ip neigh show to 192.168.168.111
192.168.168.111 dev wlan0 lladdr 00:11:22:33:44:55 PERMANENT
# ip neigh change 192.168.168.111 lladdr 00:11:22:33:44:55 nud stale dev wlan0
# ip neigh show to 192.168.168.111
192.168.168.111 dev wlan0 lladdr 00:11:22:33:44:55 PERMANENT

IMHO, here is how this place should look:

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 5cdc62a..2b1cb91 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1151,10 +1151,8 @@ int neigh_update(struct neighbour *neigh, const u8 
*lladdr, u8 new,
goto out;
} else {
if (lladdr == neigh->ha && new == NUD_STALE &&
-   ((flags & NEIGH_UPDATE_F_WEAK_OVERRIDE) ||
-(old & NUD_CONNECTED))
-   )
-       new = old;
+   !(flags & NEIGH_UPDATE_F_ADMIN))
+   goto out;
}
}

Any thoughts?
 
Regards

--
Julian Anastasov <j...@ssi.bg>

Re: [PATCH] net: neigh: disallow state transition DELAY->STALE in neigh_update()

2016-07-22 Thread Julian Anastasov


Hello,

On Thu, 21 Jul 2016, Chunhui He wrote:

> If neigh entry was CONNECTED and address is not changed, and if new state is
> STALE, entry state will not change. Because DELAY is not in CONNECTED, it's
> possible to change state from DELAY to STALE.
> 
> That is bad. Consider a host in IPv4 nerwork, a neigh entry in STALE state
> is referenced to send packets, so goes to DELAY state. If the entry is not
> confirmed by upper layer, it goes to PROBE state, and sends ARP request.
> The neigh host sends ARP reply, then the entry goes to REACHABLE state.
> But the entry state may be reseted to STALE by broadcast ARP packets, before
> the entry goes to PROBE state. So it's possible that the entry will never go
> to REACHABLE state, without external confirmation.
> 
> In my case, the gateway refuses to send unicast packets to me, before it sees
> my ARP request. So it's critical to enter REACHABLE state by sending ARP
> request, but not by external confirmation.
> 
> This fixes neigh_update() not to change to STALE if old state is CONNECTED or
> DELAY.
> 
> Signed-off-by: Chunhui He 
> ---
>  net/core/neighbour.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 510cd62..29429eb 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -1152,7 +1152,7 @@ int neigh_update(struct neighbour *neigh, const u8 
> *lladdr, u8 new,
>   } else {
>   if (lladdr == neigh->ha && new == NUD_STALE &&
>   ((flags & NEIGH_UPDATE_F_WEAK_OVERRIDE) ||
> -  (old & NUD_CONNECTED))
> +  (old & (NUD_CONNECTED | NUD_DELAY)))
>   )
>   new = old;
>   }

You change looks correct to me. But this place
has more problems. There is no good reason to set NUD_STALE
for any state that is NUD_VALID if address is not changed.
This matches perfectly the comment above this code:
NUD_STALE should change a NUD_VALID state only when
address changes. It also means that IPv6 does not need
to provide NEIGH_UPDATE_F_WEAK_OVERRIDE anymore when
NEIGH_UPDATE_F_OVERRIDE is also present.

By this way the state machine can continue with
the resolving: NUD_STALE -> NUD_DELAY (traffic) ->
NUD_PROBE (retries) -> NUD_REACHABLE (unicast reply)
while the address is not changed. Your change covers only
NUD_DELAY, not NUD_PROBE, so it is better to allow more
retries to send. We should not give up until success (NUD_REACHABLE).

Second problem: NEIGH_UPDATE_F_WEAK_OVERRIDE has no
priority over NEIGH_UPDATE_F_ADMIN. For example, now I can not
change from NUD_PERMANENT to NUD_STALE:

# ip neigh add 192.168.168.111 lladdr 00:11:22:33:44:55 nud perm dev wlan0
# ip neigh show to 192.168.168.111
192.168.168.111 dev wlan0 lladdr 00:11:22:33:44:55 PERMANENT
# ip neigh change 192.168.168.111 lladdr 00:11:22:33:44:55 nud stale dev wlan0
# ip neigh show to 192.168.168.111
192.168.168.111 dev wlan0 lladdr 00:11:22:33:44:55 PERMANENT

IMHO, here is how this place should look:

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 5cdc62a..2b1cb91 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1151,10 +1151,8 @@ int neigh_update(struct neighbour *neigh, const u8 
*lladdr, u8 new,
goto out;
} else {
if (lladdr == neigh->ha && new == NUD_STALE &&
-   ((flags & NEIGH_UPDATE_F_WEAK_OVERRIDE) ||
-(old & NUD_CONNECTED))
-   )
-   new = old;
+   !(flags & NEIGH_UPDATE_F_ADMIN))
+   goto out;
}
}

Any thoughts?
 
Regards

--
Julian Anastasov

Re: [PATCH v4 net] ipvs: fix bind to link-local mcast IPv6 address in backup

2016-06-17 Thread Julian Anastasov


Hello,

On Thu, 16 Jun 2016, Quentin Armitage wrote:

> When using HEAD from
> https://git.kernel.org/cgit/utils/kernel/ipvsadm/ipvsadm.git/,
> the command:
> ipvsadm --start-daemon backup --mcast-interface eth0.60 \
> --mcast-group ff02::1:81
> fails with the error message:
> Argument list too long
> 
> whereas both:
> ipvsadm --start-daemon master --mcast-interface eth0.60 \
> --mcast-group ff02::1:81
> and:
> ipvsadm --start-daemon backup --mcast-interface eth0.60 \
> --mcast-group 224.0.0.81
> are successful.
> 
> The error message "Argument list too long" isn't helpful. The error occurs
> because an IPv6 address is given in backup mode.
> 
> The error is in make_receive_sock() in net/netfilter/ipvs/ip_vs_sync.c,
> since it fails to set the interface on the address or the socket before
> calling inet6_bind() (via sock->ops->bind), where the test
> 'if (!sk->sk_bound_dev_if)' failed.
> 
> Setting sock->sk->sk_bound_dev_if on the socket before calling
> inet6_bind() resolves the issue.
> 
> Fixes: d33288172e72 ("ipvs: add more mcast parameters for the sync daemon")
> Signed-off-by: Quentin Armitage <quen...@armitage.org.uk>

Looks good to me, thanks!

Acked-by: Julian Anastasov <j...@ssi.bg>

Simon, please apply to ipvs tree. Patch compiles
also on stable 4.4.13, 4.5.7 and 4.6.2, so no need for
special versions. The ack is also for the other 3 patches
from v4 (for ipvs-next) but they depend on this patch.

> ---
>  net/netfilter/ipvs/ip_vs_sync.c |6 --
>  1 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
> index 803001a..1b07578 100644
> --- a/net/netfilter/ipvs/ip_vs_sync.c
> +++ b/net/netfilter/ipvs/ip_vs_sync.c
> @@ -1545,7 +1545,8 @@ error:
>  /*
>   *  Set up receiving multicast socket over UDP
>   */
> -static struct socket *make_receive_sock(struct netns_ipvs *ipvs, int id)
> +static struct socket *make_receive_sock(struct netns_ipvs *ipvs, int id,
> + int ifindex)
>  {
>   /* multicast addr */
>   union ipvs_sockaddr mcast_addr;
> @@ -1566,6 +1567,7 @@ static struct socket *make_receive_sock(struct 
> netns_ipvs *ipvs, int id)
>   set_sock_size(sock->sk, 0, result);
>  
>   get_mcast_sockaddr(_addr, , >bcfg, id);
> + sock->sk->sk_bound_dev_if = ifindex;
>   result = sock->ops->bind(sock, (struct sockaddr *)_addr, salen);
>   if (result < 0) {
>   pr_err("Error binding to the multicast addr\n");
> @@ -1868,7 +1870,7 @@ int start_sync_thread(struct netns_ipvs *ipvs, struct 
> ipvs_sync_daemon_cfg *c,
>   if (state == IP_VS_STATE_MASTER)
>   sock = make_send_sock(ipvs, id);
>   else
> -     sock = make_receive_sock(ipvs, id);
> + sock = make_receive_sock(ipvs, id, dev->ifindex);
>   if (IS_ERR(sock)) {
>   result = PTR_ERR(sock);
>   goto outtinfo;
> -- 
> 1.7.7.6

Regards

--
Julian Anastasov <j...@ssi.bg>

Re: [PATCH v4 net] ipvs: fix bind to link-local mcast IPv6 address in backup

2016-06-17 Thread Julian Anastasov


Hello,

On Thu, 16 Jun 2016, Quentin Armitage wrote:

> When using HEAD from
> https://git.kernel.org/cgit/utils/kernel/ipvsadm/ipvsadm.git/,
> the command:
> ipvsadm --start-daemon backup --mcast-interface eth0.60 \
> --mcast-group ff02::1:81
> fails with the error message:
> Argument list too long
> 
> whereas both:
> ipvsadm --start-daemon master --mcast-interface eth0.60 \
> --mcast-group ff02::1:81
> and:
> ipvsadm --start-daemon backup --mcast-interface eth0.60 \
> --mcast-group 224.0.0.81
> are successful.
> 
> The error message "Argument list too long" isn't helpful. The error occurs
> because an IPv6 address is given in backup mode.
> 
> The error is in make_receive_sock() in net/netfilter/ipvs/ip_vs_sync.c,
> since it fails to set the interface on the address or the socket before
> calling inet6_bind() (via sock->ops->bind), where the test
> 'if (!sk->sk_bound_dev_if)' failed.
> 
> Setting sock->sk->sk_bound_dev_if on the socket before calling
> inet6_bind() resolves the issue.
> 
> Fixes: d33288172e72 ("ipvs: add more mcast parameters for the sync daemon")
> Signed-off-by: Quentin Armitage 

Looks good to me, thanks!

Acked-by: Julian Anastasov 

Simon, please apply to ipvs tree. Patch compiles
also on stable 4.4.13, 4.5.7 and 4.6.2, so no need for
special versions. The ack is also for the other 3 patches
from v4 (for ipvs-next) but they depend on this patch.

> ---
>  net/netfilter/ipvs/ip_vs_sync.c |6 --
>  1 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
> index 803001a..1b07578 100644
> --- a/net/netfilter/ipvs/ip_vs_sync.c
> +++ b/net/netfilter/ipvs/ip_vs_sync.c
> @@ -1545,7 +1545,8 @@ error:
>  /*
>   *  Set up receiving multicast socket over UDP
>   */
> -static struct socket *make_receive_sock(struct netns_ipvs *ipvs, int id)
> +static struct socket *make_receive_sock(struct netns_ipvs *ipvs, int id,
> + int ifindex)
>  {
>   /* multicast addr */
>   union ipvs_sockaddr mcast_addr;
> @@ -1566,6 +1567,7 @@ static struct socket *make_receive_sock(struct 
> netns_ipvs *ipvs, int id)
>   set_sock_size(sock->sk, 0, result);
>  
>   get_mcast_sockaddr(_addr, , >bcfg, id);
> + sock->sk->sk_bound_dev_if = ifindex;
>   result = sock->ops->bind(sock, (struct sockaddr *)_addr, salen);
>   if (result < 0) {
>   pr_err("Error binding to the multicast addr\n");
> @@ -1868,7 +1870,7 @@ int start_sync_thread(struct netns_ipvs *ipvs, struct 
> ipvs_sync_daemon_cfg *c,
>   if (state == IP_VS_STATE_MASTER)
>   sock = make_send_sock(ipvs, id);
>   else
> - sock = make_receive_sock(ipvs, id);
> + sock = make_receive_sock(ipvs, id, dev->ifindex);
>   if (IS_ERR(sock)) {
>   result = PTR_ERR(sock);
>   goto outtinfo;
> -- 
> 1.7.7.6

Regards

--
Julian Anastasov

Re: [PATCH v3 0/4] ipvs: fix backup sync daemon with IPv6, and minor updates

2016-06-16 Thread Julian Anastasov


Hello,

On Wed, 15 Jun 2016, Quentin Armitage wrote:

> This series of patches arise from discovering that:
> ipvsadm --start-daemon backup --mcast-group IPv6_address ...
> would always fail.
> 
> The first patch resolves the problem. The second and third patches are
> optimizations that were noticed while investigating the original problem.
> The fourth patch adds a lock which appears to have been omitted, and the
> final patch adds the recently added sync daemon multicast parameters to
> the log messages that are written when the sync daemons start.
> 
> v2 fixes a compile error in a debug message identified by kbuild test
> robot. Now compiles with CONFIG_IP_VS_DEBUG enabled. Patch 2/5 is modified
> to correct the problem, and patch 3/5 is modifed to apply with the
> modified patch 2/5.
> 
> v3 incorporates changes suggested by Julian Anastasov.
> Patch 1 now sets 'sock->sk->sk_bound_dev_if = ifindex' rather than setting
> sin6_scope_id. Also remove the locks since unnecessary.
> Patch 3 shortens the logged message in order not to exceed 80-char limit.
> Patch 4 Removed, the locks aren't necessary
> Patch 5 No longer changes indentation of existing pr_info. Also removes <>
> around commit IDs in commit description.
> Patches 1, 2, 3, 5 are updated to resolve coding style warnings, and all
>   pass with 0 errors, warnings and checks.
> Patch 5 now becomes patch 4.
> 
> The changes have all been tested and work as expected.
> 
> Quentin Armitage (4):
>   ipvs: Enable setting IPv6 multicast address for ipvs
>   ipvs: Stop calling __dev_get_by_name() repeatedly when starting sync
> daemon
>   ipvs: Don't check result < 0 after setting result = 0
>   ipvs: log additional sync daemon parameters
> 
>  net/netfilter/ipvs/ip_vs_sync.c |  105 
> +++
>  1 files changed, 52 insertions(+), 53 deletions(-)
> 
> -- 
> 1.7.7.6

You should post first patch separately, not as a part
from the patchset, by specifying the tree:

[PATCH v4 net] ipvs: ...

The other 3 patches remain in this patchset,
with added "net-next":

[PATCH v4 net-next */3] ipvs: ...

Patch 1:

It is good to mention that problem happens for link-local
addresses, not for site/org-local or global scope. By this way
we are more precise when creating a bugfix, it avoids confusion.
You can also check again the Subject and the commit message for
improvements. It is up to you but here is an example:

ipvs: fix bind to link-local mcast IPv6 address in backup

The empty line between Fixes and Signed-off-by should be
removed.

Patch 2-3: look OK

Patch 4:

Some of the fields are unsigned, so %d should be %u:
sync_maxlen, mcast_port, mcast_af, mcast_ttl

Regards

--
Julian Anastasov <j...@ssi.bg>

Re: [PATCH v3 0/4] ipvs: fix backup sync daemon with IPv6, and minor updates

2016-06-16 Thread Julian Anastasov


Hello,

On Wed, 15 Jun 2016, Quentin Armitage wrote:

> This series of patches arise from discovering that:
> ipvsadm --start-daemon backup --mcast-group IPv6_address ...
> would always fail.
> 
> The first patch resolves the problem. The second and third patches are
> optimizations that were noticed while investigating the original problem.
> The fourth patch adds a lock which appears to have been omitted, and the
> final patch adds the recently added sync daemon multicast parameters to
> the log messages that are written when the sync daemons start.
> 
> v2 fixes a compile error in a debug message identified by kbuild test
> robot. Now compiles with CONFIG_IP_VS_DEBUG enabled. Patch 2/5 is modified
> to correct the problem, and patch 3/5 is modifed to apply with the
> modified patch 2/5.
> 
> v3 incorporates changes suggested by Julian Anastasov.
> Patch 1 now sets 'sock->sk->sk_bound_dev_if = ifindex' rather than setting
> sin6_scope_id. Also remove the locks since unnecessary.
> Patch 3 shortens the logged message in order not to exceed 80-char limit.
> Patch 4 Removed, the locks aren't necessary
> Patch 5 No longer changes indentation of existing pr_info. Also removes <>
> around commit IDs in commit description.
> Patches 1, 2, 3, 5 are updated to resolve coding style warnings, and all
>   pass with 0 errors, warnings and checks.
> Patch 5 now becomes patch 4.
> 
> The changes have all been tested and work as expected.
> 
> Quentin Armitage (4):
>   ipvs: Enable setting IPv6 multicast address for ipvs
>   ipvs: Stop calling __dev_get_by_name() repeatedly when starting sync
> daemon
>   ipvs: Don't check result < 0 after setting result = 0
>   ipvs: log additional sync daemon parameters
> 
>  net/netfilter/ipvs/ip_vs_sync.c |  105 
> +++
>  1 files changed, 52 insertions(+), 53 deletions(-)
> 
> -- 
> 1.7.7.6

You should post first patch separately, not as a part
from the patchset, by specifying the tree:

[PATCH v4 net] ipvs: ...

The other 3 patches remain in this patchset,
with added "net-next":

[PATCH v4 net-next */3] ipvs: ...

Patch 1:

It is good to mention that problem happens for link-local
addresses, not for site/org-local or global scope. By this way
we are more precise when creating a bugfix, it avoids confusion.
You can also check again the Subject and the commit message for
improvements. It is up to you but here is an example:

ipvs: fix bind to link-local mcast IPv6 address in backup

The empty line between Fixes and Signed-off-by should be
removed.

Patch 2-3: look OK

Patch 4:

Some of the fields are unsigned, so %d should be %u:
sync_maxlen, mcast_port, mcast_af, mcast_ttl

Regards

--
Julian Anastasov

Re: [PATCH v2 0/5] ipvs: fix backup sync daemon with IPv6, and minor updates

2016-06-15 Thread Julian Anastasov

Hello,

On Wed, 15 Jun 2016, Quentin Armitage wrote:

> I am updating the patches in line with your comments, but I'm not sure about
> a couple of points.
> 
> Patch 4:
> 
> You state that before bind(), such changes should be safe. However, from the
> function make_send_sock(), when the functions set_mcast_if(),
> set_mcast_loop(), set_mcast_ttl() and set_mcast_pmtudisc() are called before
> connect(), they all lock the socket before modifying it. Patch 4 was
> intended to make the setting of REUSE consistent.
> 
> If the locking is not necessary, would it be better to remove the locks from
> the set_mcast_...() functions referred to above.

This is a slow path, so it does not matter much.
There is no concurrent access to the socket, the only
risk is some call into the stack that checks with lockdep
for the missing lock. Such example but for another lock
we already hold is ASSERT_RTNL in ip_mc_join_group. But for simple
sk vars lock is not needed. You can safely remove locks before
connect/bind if only sk fields are accessed directly.
We can keep it only in join_mcast_group*(), especially
because they are called after bind().

> Re patch 1 setting 'sock->sk->sk_bound_dev_if = ifindex;', I presume the
> locking should be consistent with what is done in the other functions.

It is a simple var, so it can work without lock.

> Your comments on the above would be really helpful.
> 
> Patch 5:
> 
> You state 'The indentation of existing pr_info in both cases should not be
> changed". I'm not clear exactly what that means. Does it mean that the
> spaces at the beginning of the pr_info() strings which report group, port
> and ttl should be removed?

No, here is example from your patch:

pr_info("sync thread started: state = MASTER, mcast_ifn = %s, "
-   "syncid = %d, id = %d\n",
-   ipvs->mcfg.mcast_ifn, ipvs->mcfg.syncid, tinfo->id);
+   "syncid = %d, id = %d, maxlen = %d\n",
+   ipvs->mcfg.mcast_ifn, ipvs->mcfg.syncid,
+   tinfo->id, ipvs->mcfg.sync_maxlen);

"syncid = " was at the same column as "sync thread started",
you added another tab, may be to align with the args in new pr_info.
The result is:

pr_info("sync thread started: state = MASTER, mcast_ifn = %s, "
"syncid = %d, id = %d, maxlen = %d\n",
ipvs->mcfg.mcast_ifn, ipvs->mcfg.syncid,
tinfo->id, ipvs->mcfg.sync_maxlen);
<--- 2 TABs --->

But it should be:

pr_info("sync thread started: state = MASTER, mcast_ifn = %s, "
"syncid = %d, id = %d, maxlen = %d\n",
ipvs->mcfg.mcast_ifn, ipvs->mcfg.syncid,
tinfo->id, ipvs->mcfg.sync_maxlen);
< 1 TAB>

Also, the new pr_info calls exceed 80 columns.
May be you can reduce the many spaces.

Regards

--
Julian Anastasov <j...@ssi.bg>

Re: [PATCH v2 0/5] ipvs: fix backup sync daemon with IPv6, and minor updates

2016-06-15 Thread Julian Anastasov

Hello,

On Wed, 15 Jun 2016, Quentin Armitage wrote:

> I am updating the patches in line with your comments, but I'm not sure about
> a couple of points.
> 
> Patch 4:
> 
> You state that before bind(), such changes should be safe. However, from the
> function make_send_sock(), when the functions set_mcast_if(),
> set_mcast_loop(), set_mcast_ttl() and set_mcast_pmtudisc() are called before
> connect(), they all lock the socket before modifying it. Patch 4 was
> intended to make the setting of REUSE consistent.
> 
> If the locking is not necessary, would it be better to remove the locks from
> the set_mcast_...() functions referred to above.

This is a slow path, so it does not matter much.
There is no concurrent access to the socket, the only
risk is some call into the stack that checks with lockdep
for the missing lock. Such example but for another lock
we already hold is ASSERT_RTNL in ip_mc_join_group. But for simple
sk vars lock is not needed. You can safely remove locks before
connect/bind if only sk fields are accessed directly.
We can keep it only in join_mcast_group*(), especially
because they are called after bind().

> Re patch 1 setting 'sock->sk->sk_bound_dev_if = ifindex;', I presume the
> locking should be consistent with what is done in the other functions.

It is a simple var, so it can work without lock.

> Your comments on the above would be really helpful.
> 
> Patch 5:
> 
> You state 'The indentation of existing pr_info in both cases should not be
> changed". I'm not clear exactly what that means. Does it mean that the
> spaces at the beginning of the pr_info() strings which report group, port
> and ttl should be removed?

No, here is example from your patch:

pr_info("sync thread started: state = MASTER, mcast_ifn = %s, "
-   "syncid = %d, id = %d\n",
-   ipvs->mcfg.mcast_ifn, ipvs->mcfg.syncid, tinfo->id);
+   "syncid = %d, id = %d, maxlen = %d\n",
+   ipvs->mcfg.mcast_ifn, ipvs->mcfg.syncid,
+   tinfo->id, ipvs->mcfg.sync_maxlen);

"syncid = " was at the same column as "sync thread started",
you added another tab, may be to align with the args in new pr_info.
The result is:

pr_info("sync thread started: state = MASTER, mcast_ifn = %s, "
"syncid = %d, id = %d, maxlen = %d\n",
ipvs->mcfg.mcast_ifn, ipvs->mcfg.syncid,
tinfo->id, ipvs->mcfg.sync_maxlen);
<--- 2 TABs --->

But it should be:

pr_info("sync thread started: state = MASTER, mcast_ifn = %s, "
"syncid = %d, id = %d, maxlen = %d\n",
ipvs->mcfg.mcast_ifn, ipvs->mcfg.syncid,
tinfo->id, ipvs->mcfg.sync_maxlen);
< 1 TAB>

Also, the new pr_info calls exceed 80 columns.
May be you can reduce the many spaces.

Regards

--
Julian Anastasov

Re: [PATCH v2 0/5] ipvs: fix backup sync daemon with IPv6, and minor updates

2016-06-14 Thread Julian Anastasov

Hello,

On Tue, 14 Jun 2016, Quentin Armitage wrote:

> This series of patches arise from discovering that:
> ipvsadm --start-daemon backup --mcast-group IPv6_address ...
> would always fail.
> 
> The first patch resolves the problem. The second and third patches are
> optimizations that were noticed while investigating the original problem.
> The fourth patch adds a lock which appears to have been omitted, and the
> final patch adds the recently added sync daemon multicast parameters to
> the log messages that are written when the sync daemons start.
> 
> v2 fixes a compile error in a debug message identified by kbuild test robot.
> Now compiles with CONFIG_IP_VS_DEBUG enabled. Patch 2/5 is modified to correct
> the problem, and patch 3/5 is modifed to apply with the modified patch 2/5.
> 
> Quentin Armitage (5):
>   ipvs: Enable setting IPv6 multicast address for ipvs sync daemon
> backup
>   ipvs: Stop calling __dev_get_by_name() repeatedly when starting sync
> daemon
>   ipvs: Don't check result < 0 after setting result = 0
>   ipvs: Lock socket before setting SK_CAN_REUSE
>   ipvs: log additional sync daemon parameters
> 
>  net/netfilter/ipvs/ip_vs_sync.c |  104 +++---
>  1 files changed, 52 insertions(+), 52 deletions(-)
> 
> -- 
> 1.7.7.6

Thanks for catching this bug. Following are my
comments for the patches:

Patch 1:

I missed the fact that link-local addresses (ffx2) require
binding to ifindex due to __ipv6_addr_needs_scope_id check,
I tested only with a ff05 address. BTW, ff01 is a node-local
address (loopback), you should not use it for IPVS.

Instead of directly writing into sin6_scope_id we can use
'sock->sk->sk_bound_dev_if = ifindex;' before bind(), it will
work for v4 and v6. Let me know if such solution works.

You have to send this patch as a bugfix, it should
apply to the net tree and later will go to stable trees (4.3+),
i.e. 4.4, 4.5, 4.6 and 4.7, I don't see stable 4.3 in
https://www.kernel.org/. You should mention in commit message
that this patch is a fix to specific commit (check
Documentation/SubmittingPatches):

Fixes: d33288172e72 ("ipvs: add more mcast parameters for the sync daemon")

The other patches will go to the net-next tree in
separate patchset but I see little fuzz if patch 2 is applied
without patch 1, so may be this patchset should wait the first
patch to appear in net-next kernel.

Patch 2: looks OK

Patch 3: looks OK

It was done this way to not exceed the 80-char limit.
May be you can reduce the message for the same reason.

Patch 4: looks OK

Before bind() such operations should be safe without locks.

Patch 5:

No need of <> for the commit IDs.

The indentation of existing pr_info in both cases
should not be changed.

Patches 1, 2, 3 have coding style warnings from checkpatch
that can be fixed, you can check them in this way:

scripts/checkpatch.pl --strict /tmp/file.patch

Regards

--
Julian Anastasov <j...@ssi.bg>

Re: [PATCH v2 0/5] ipvs: fix backup sync daemon with IPv6, and minor updates

2016-06-14 Thread Julian Anastasov

Hello,

On Tue, 14 Jun 2016, Quentin Armitage wrote:

> This series of patches arise from discovering that:
> ipvsadm --start-daemon backup --mcast-group IPv6_address ...
> would always fail.
> 
> The first patch resolves the problem. The second and third patches are
> optimizations that were noticed while investigating the original problem.
> The fourth patch adds a lock which appears to have been omitted, and the
> final patch adds the recently added sync daemon multicast parameters to
> the log messages that are written when the sync daemons start.
> 
> v2 fixes a compile error in a debug message identified by kbuild test robot.
> Now compiles with CONFIG_IP_VS_DEBUG enabled. Patch 2/5 is modified to correct
> the problem, and patch 3/5 is modifed to apply with the modified patch 2/5.
> 
> Quentin Armitage (5):
>   ipvs: Enable setting IPv6 multicast address for ipvs sync daemon
> backup
>   ipvs: Stop calling __dev_get_by_name() repeatedly when starting sync
> daemon
>   ipvs: Don't check result < 0 after setting result = 0
>   ipvs: Lock socket before setting SK_CAN_REUSE
>   ipvs: log additional sync daemon parameters
> 
>  net/netfilter/ipvs/ip_vs_sync.c |  104 +++---
>  1 files changed, 52 insertions(+), 52 deletions(-)
> 
> -- 
> 1.7.7.6

Thanks for catching this bug. Following are my
comments for the patches:

Patch 1:

I missed the fact that link-local addresses (ffx2) require
binding to ifindex due to __ipv6_addr_needs_scope_id check,
I tested only with a ff05 address. BTW, ff01 is a node-local
address (loopback), you should not use it for IPVS.

Instead of directly writing into sin6_scope_id we can use
'sock->sk->sk_bound_dev_if = ifindex;' before bind(), it will
work for v4 and v6. Let me know if such solution works.

You have to send this patch as a bugfix, it should
apply to the net tree and later will go to stable trees (4.3+),
i.e. 4.4, 4.5, 4.6 and 4.7, I don't see stable 4.3 in
https://www.kernel.org/. You should mention in commit message
that this patch is a fix to specific commit (check
Documentation/SubmittingPatches):

Fixes: d33288172e72 ("ipvs: add more mcast parameters for the sync daemon")

The other patches will go to the net-next tree in
separate patchset but I see little fuzz if patch 2 is applied
without patch 1, so may be this patchset should wait the first
patch to appear in net-next kernel.

Patch 2: looks OK

Patch 3: looks OK

It was done this way to not exceed the 80-char limit.
May be you can reduce the message for the same reason.

Patch 4: looks OK

Before bind() such operations should be safe without locks.

Patch 5:

No need of <> for the commit IDs.

The indentation of existing pr_info in both cases
should not be changed.

Patches 1, 2, 3 have coding style warnings from checkpatch
that can be fixed, you can check them in this way:

scripts/checkpatch.pl --strict /tmp/file.patch

Regards

--
Julian Anastasov

Re: [PATCH ipvs-next] ipvs: count pre-established TCP states as active

2016-06-12 Thread Julian Anastasov


Hello,

On Fri, 3 Jun 2016, Michal Kubecek wrote:

> Some users observed that "least connection" distribution algorithm doesn't
> handle well bursts of TCP connections from reconnecting clients after
> a node or network failure.
> 
> This is because the algorithm counts active connection as worth 256
> inactive ones where for TCP, "active" only means TCP connections in
> ESTABLISHED state. In case of a connection burst, new connections are
> handled before previous ones have finished the three way handshaking so
> that all are still counted as "inactive", i.e. cheap ones. The become
> "active" quickly but at that time, all of them are already assigned to one
> real server (or few), resulting in highly unbalanced distribution.
> 
> Address this by counting the "pre-established" states as "active".
> 
> Signed-off-by: Michal Kubecek <mkube...@suse.cz>

Acked-by: Julian Anastasov <j...@ssi.bg>

Simon, please apply!

> ---
>  net/netfilter/ipvs/ip_vs_proto_tcp.c | 25 +++--
>  1 file changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_proto_tcp.c 
> b/net/netfilter/ipvs/ip_vs_proto_tcp.c
> index d7024b2ed769..5117bcb7d2f0 100644
> --- a/net/netfilter/ipvs/ip_vs_proto_tcp.c
> +++ b/net/netfilter/ipvs/ip_vs_proto_tcp.c
> @@ -395,6 +395,20 @@ static const char *const 
> tcp_state_name_table[IP_VS_TCP_S_LAST+1] = {
>   [IP_VS_TCP_S_LAST]  =   "BUG!",
>  };
>  
> +static const bool tcp_state_active_table[IP_VS_TCP_S_LAST] = {
> + [IP_VS_TCP_S_NONE]  =   false,
> + [IP_VS_TCP_S_ESTABLISHED]   =   true,
> + [IP_VS_TCP_S_SYN_SENT]  =   true,
> + [IP_VS_TCP_S_SYN_RECV]  =   true,
> + [IP_VS_TCP_S_FIN_WAIT]  =   false,
> + [IP_VS_TCP_S_TIME_WAIT] =   false,
> + [IP_VS_TCP_S_CLOSE] =   false,
> + [IP_VS_TCP_S_CLOSE_WAIT]=   false,
> + [IP_VS_TCP_S_LAST_ACK]  =   false,
> + [IP_VS_TCP_S_LISTEN]=   false,
> + [IP_VS_TCP_S_SYNACK]=   true,
> +};
> +
>  #define sNO IP_VS_TCP_S_NONE
>  #define sES IP_VS_TCP_S_ESTABLISHED
>  #define sSS IP_VS_TCP_S_SYN_SENT
> @@ -418,6 +432,13 @@ static const char * tcp_state_name(int state)
>   return tcp_state_name_table[state] ? tcp_state_name_table[state] : "?";
>  }
>  
> +static bool tcp_state_active(int state)
> +{
> + if (state >= IP_VS_TCP_S_LAST)
> + return false;
> + return tcp_state_active_table[state];
> +}
> +
>  static struct tcp_states_t tcp_states [] = {
>  /*   INPUT */
>  /*sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA  */
> @@ -540,12 +561,12 @@ set_tcp_state(struct ip_vs_proto_data *pd, struct 
> ip_vs_conn *cp,
>  
>   if (dest) {
>   if (!(cp->flags & IP_VS_CONN_F_INACTIVE) &&
> - (new_state != IP_VS_TCP_S_ESTABLISHED)) {
> + !tcp_state_active(new_state)) {
>   atomic_dec(>activeconns);
>   atomic_inc(>inactconns);
>   cp->flags |= IP_VS_CONN_F_INACTIVE;
>   } else if ((cp->flags & IP_VS_CONN_F_INACTIVE) &&
> -        (new_state == IP_VS_TCP_S_ESTABLISHED)) {
> +tcp_state_active(new_state)) {
>   atomic_inc(>activeconns);
>   atomic_dec(>inactconns);
>   cp->flags &= ~IP_VS_CONN_F_INACTIVE;
> -- 
> 2.8.3

Regards

--
Julian Anastasov <j...@ssi.bg>

Re: [PATCH ipvs-next] ipvs: count pre-established TCP states as active

2016-06-12 Thread Julian Anastasov


Hello,

On Fri, 3 Jun 2016, Michal Kubecek wrote:

> Some users observed that "least connection" distribution algorithm doesn't
> handle well bursts of TCP connections from reconnecting clients after
> a node or network failure.
> 
> This is because the algorithm counts active connection as worth 256
> inactive ones where for TCP, "active" only means TCP connections in
> ESTABLISHED state. In case of a connection burst, new connections are
> handled before previous ones have finished the three way handshaking so
> that all are still counted as "inactive", i.e. cheap ones. The become
> "active" quickly but at that time, all of them are already assigned to one
> real server (or few), resulting in highly unbalanced distribution.
> 
> Address this by counting the "pre-established" states as "active".
> 
> Signed-off-by: Michal Kubecek 

Acked-by: Julian Anastasov 

Simon, please apply!

> ---
>  net/netfilter/ipvs/ip_vs_proto_tcp.c | 25 +++--
>  1 file changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_proto_tcp.c 
> b/net/netfilter/ipvs/ip_vs_proto_tcp.c
> index d7024b2ed769..5117bcb7d2f0 100644
> --- a/net/netfilter/ipvs/ip_vs_proto_tcp.c
> +++ b/net/netfilter/ipvs/ip_vs_proto_tcp.c
> @@ -395,6 +395,20 @@ static const char *const 
> tcp_state_name_table[IP_VS_TCP_S_LAST+1] = {
>   [IP_VS_TCP_S_LAST]  =   "BUG!",
>  };
>  
> +static const bool tcp_state_active_table[IP_VS_TCP_S_LAST] = {
> + [IP_VS_TCP_S_NONE]  =   false,
> + [IP_VS_TCP_S_ESTABLISHED]   =   true,
> + [IP_VS_TCP_S_SYN_SENT]  =   true,
> + [IP_VS_TCP_S_SYN_RECV]  =   true,
> + [IP_VS_TCP_S_FIN_WAIT]  =   false,
> + [IP_VS_TCP_S_TIME_WAIT] =   false,
> + [IP_VS_TCP_S_CLOSE] =   false,
> + [IP_VS_TCP_S_CLOSE_WAIT]=   false,
> + [IP_VS_TCP_S_LAST_ACK]  =   false,
> + [IP_VS_TCP_S_LISTEN]=   false,
> + [IP_VS_TCP_S_SYNACK]=   true,
> +};
> +
>  #define sNO IP_VS_TCP_S_NONE
>  #define sES IP_VS_TCP_S_ESTABLISHED
>  #define sSS IP_VS_TCP_S_SYN_SENT
> @@ -418,6 +432,13 @@ static const char * tcp_state_name(int state)
>   return tcp_state_name_table[state] ? tcp_state_name_table[state] : "?";
>  }
>  
> +static bool tcp_state_active(int state)
> +{
> + if (state >= IP_VS_TCP_S_LAST)
> + return false;
> + return tcp_state_active_table[state];
> +}
> +
>  static struct tcp_states_t tcp_states [] = {
>  /*   INPUT */
>  /*sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA  */
> @@ -540,12 +561,12 @@ set_tcp_state(struct ip_vs_proto_data *pd, struct 
> ip_vs_conn *cp,
>  
>   if (dest) {
>   if (!(cp->flags & IP_VS_CONN_F_INACTIVE) &&
> - (new_state != IP_VS_TCP_S_ESTABLISHED)) {
> + !tcp_state_active(new_state)) {
>   atomic_dec(>activeconns);
>   atomic_inc(>inactconns);
>   cp->flags |= IP_VS_CONN_F_INACTIVE;
>   } else if ((cp->flags & IP_VS_CONN_F_INACTIVE) &&
> -        (new_state == IP_VS_TCP_S_ESTABLISHED)) {
> +tcp_state_active(new_state)) {
>   atomic_inc(>activeconns);
>   atomic_dec(>inactconns);
>   cp->flags &= ~IP_VS_CONN_F_INACTIVE;
> -- 
> 2.8.3

Regards

--
Julian Anastasov

Re: [PATCH ipvs-next] ipvs: count pre-established TCP states as active

2016-06-06 Thread Julian Anastasov


Hello,

On Fri, 3 Jun 2016, Michal Kubecek wrote:

> Some users observed that "least connection" distribution algorithm doesn't
> handle well bursts of TCP connections from reconnecting clients after
> a node or network failure.
> 
> This is because the algorithm counts active connection as worth 256
> inactive ones where for TCP, "active" only means TCP connections in
> ESTABLISHED state. In case of a connection burst, new connections are
> handled before previous ones have finished the three way handshaking so
> that all are still counted as "inactive", i.e. cheap ones. The become
> "active" quickly but at that time, all of them are already assigned to one
> real server (or few), resulting in highly unbalanced distribution.
> 
> Address this by counting the "pre-established" states as "active".
> 
> Signed-off-by: Michal Kubecek <mkube...@suse.cz>

LC and WLC are bursty by nature. May be a new
scheduler is needed that combines the LC algorithm with
WRR mode to adaptively reduce the difference in load,
especially for the case when new server is started in
a setup with many servers.

Give me some days or week to analyze the effects of
your patch, mostly in situations of SYN attack. Note that
there can be schedulers affected by this change but I think
they will only benefit from it.
For example:

- LBLC, LBLCR: can use weight as threshold for activeconns
- NQ, SED: only activeconns are used
- OVF: weight is used as threshold for activeconns,
inactconns are not used

Schedulers not part of the kernel can be affacted
too. So, for now the plan is to apply this patch. If there
are other opinions, please speak up.

> ---
>  net/netfilter/ipvs/ip_vs_proto_tcp.c | 25 +++--
>  1 file changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_proto_tcp.c 
> b/net/netfilter/ipvs/ip_vs_proto_tcp.c
> index d7024b2ed769..5117bcb7d2f0 100644
> --- a/net/netfilter/ipvs/ip_vs_proto_tcp.c
> +++ b/net/netfilter/ipvs/ip_vs_proto_tcp.c
> @@ -395,6 +395,20 @@ static const char *const 
> tcp_state_name_table[IP_VS_TCP_S_LAST+1] = {
>   [IP_VS_TCP_S_LAST]  =   "BUG!",
>  };
>  
> +static const bool tcp_state_active_table[IP_VS_TCP_S_LAST] = {
> + [IP_VS_TCP_S_NONE]  =   false,
> + [IP_VS_TCP_S_ESTABLISHED]   =   true,
> + [IP_VS_TCP_S_SYN_SENT]  =   true,
> + [IP_VS_TCP_S_SYN_RECV]  =   true,
> + [IP_VS_TCP_S_FIN_WAIT]  =   false,
> + [IP_VS_TCP_S_TIME_WAIT] =   false,
> + [IP_VS_TCP_S_CLOSE] =   false,
> + [IP_VS_TCP_S_CLOSE_WAIT]=   false,
> + [IP_VS_TCP_S_LAST_ACK]  =   false,
> + [IP_VS_TCP_S_LISTEN]=   false,
> + [IP_VS_TCP_S_SYNACK]=   true,
> +};
> +
>  #define sNO IP_VS_TCP_S_NONE
>  #define sES IP_VS_TCP_S_ESTABLISHED
>  #define sSS IP_VS_TCP_S_SYN_SENT
> @@ -418,6 +432,13 @@ static const char * tcp_state_name(int state)
>   return tcp_state_name_table[state] ? tcp_state_name_table[state] : "?";
>  }
>  
> +static bool tcp_state_active(int state)
> +{
> + if (state >= IP_VS_TCP_S_LAST)
> + return false;
> + return tcp_state_active_table[state];
> +}
> +
>  static struct tcp_states_t tcp_states [] = {
>  /*   INPUT */
>  /*sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA  */
> @@ -540,12 +561,12 @@ set_tcp_state(struct ip_vs_proto_data *pd, struct 
> ip_vs_conn *cp,
>  
>   if (dest) {
>   if (!(cp->flags & IP_VS_CONN_F_INACTIVE) &&
> - (new_state != IP_VS_TCP_S_ESTABLISHED)) {
> + !tcp_state_active(new_state)) {
>   atomic_dec(>activeconns);
>   atomic_inc(>inactconns);
>   cp->flags |= IP_VS_CONN_F_INACTIVE;
>   } else if ((cp->flags & IP_VS_CONN_F_INACTIVE) &&
> -        (new_state == IP_VS_TCP_S_ESTABLISHED)) {
> +tcp_state_active(new_state)) {
>   atomic_inc(>activeconns);
>   atomic_dec(>inactconns);
>   cp->flags &= ~IP_VS_CONN_F_INACTIVE;
> -- 
> 2.8.3

Regards

--
Julian Anastasov <j...@ssi.bg>

Re: [PATCH ipvs-next] ipvs: count pre-established TCP states as active

2016-06-06 Thread Julian Anastasov


Hello,

On Fri, 3 Jun 2016, Michal Kubecek wrote:

> Some users observed that "least connection" distribution algorithm doesn't
> handle well bursts of TCP connections from reconnecting clients after
> a node or network failure.
> 
> This is because the algorithm counts active connection as worth 256
> inactive ones where for TCP, "active" only means TCP connections in
> ESTABLISHED state. In case of a connection burst, new connections are
> handled before previous ones have finished the three way handshaking so
> that all are still counted as "inactive", i.e. cheap ones. The become
> "active" quickly but at that time, all of them are already assigned to one
> real server (or few), resulting in highly unbalanced distribution.
> 
> Address this by counting the "pre-established" states as "active".
> 
> Signed-off-by: Michal Kubecek 

LC and WLC are bursty by nature. May be a new
scheduler is needed that combines the LC algorithm with
WRR mode to adaptively reduce the difference in load,
especially for the case when new server is started in
a setup with many servers.

Give me some days or week to analyze the effects of
your patch, mostly in situations of SYN attack. Note that
there can be schedulers affected by this change but I think
they will only benefit from it.
For example:

- LBLC, LBLCR: can use weight as threshold for activeconns
- NQ, SED: only activeconns are used
- OVF: weight is used as threshold for activeconns,
inactconns are not used

Schedulers not part of the kernel can be affacted
too. So, for now the plan is to apply this patch. If there
are other opinions, please speak up.

> ---
>  net/netfilter/ipvs/ip_vs_proto_tcp.c | 25 +++--
>  1 file changed, 23 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_proto_tcp.c 
> b/net/netfilter/ipvs/ip_vs_proto_tcp.c
> index d7024b2ed769..5117bcb7d2f0 100644
> --- a/net/netfilter/ipvs/ip_vs_proto_tcp.c
> +++ b/net/netfilter/ipvs/ip_vs_proto_tcp.c
> @@ -395,6 +395,20 @@ static const char *const 
> tcp_state_name_table[IP_VS_TCP_S_LAST+1] = {
>   [IP_VS_TCP_S_LAST]  =   "BUG!",
>  };
>  
> +static const bool tcp_state_active_table[IP_VS_TCP_S_LAST] = {
> + [IP_VS_TCP_S_NONE]  =   false,
> + [IP_VS_TCP_S_ESTABLISHED]   =   true,
> + [IP_VS_TCP_S_SYN_SENT]  =   true,
> + [IP_VS_TCP_S_SYN_RECV]  =   true,
> + [IP_VS_TCP_S_FIN_WAIT]  =   false,
> + [IP_VS_TCP_S_TIME_WAIT] =   false,
> + [IP_VS_TCP_S_CLOSE] =   false,
> + [IP_VS_TCP_S_CLOSE_WAIT]=   false,
> + [IP_VS_TCP_S_LAST_ACK]  =   false,
> + [IP_VS_TCP_S_LISTEN]=   false,
> + [IP_VS_TCP_S_SYNACK]=   true,
> +};
> +
>  #define sNO IP_VS_TCP_S_NONE
>  #define sES IP_VS_TCP_S_ESTABLISHED
>  #define sSS IP_VS_TCP_S_SYN_SENT
> @@ -418,6 +432,13 @@ static const char * tcp_state_name(int state)
>   return tcp_state_name_table[state] ? tcp_state_name_table[state] : "?";
>  }
>  
> +static bool tcp_state_active(int state)
> +{
> + if (state >= IP_VS_TCP_S_LAST)
> + return false;
> + return tcp_state_active_table[state];
> +}
> +
>  static struct tcp_states_t tcp_states [] = {
>  /*   INPUT */
>  /*sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA  */
> @@ -540,12 +561,12 @@ set_tcp_state(struct ip_vs_proto_data *pd, struct 
> ip_vs_conn *cp,
>  
>   if (dest) {
>   if (!(cp->flags & IP_VS_CONN_F_INACTIVE) &&
> - (new_state != IP_VS_TCP_S_ESTABLISHED)) {
> + !tcp_state_active(new_state)) {
>   atomic_dec(>activeconns);
>   atomic_inc(>inactconns);
>   cp->flags |= IP_VS_CONN_F_INACTIVE;
>   } else if ((cp->flags & IP_VS_CONN_F_INACTIVE) &&
> -        (new_state == IP_VS_TCP_S_ESTABLISHED)) {
> +tcp_state_active(new_state)) {
>   atomic_inc(>activeconns);
>   atomic_dec(>inactconns);
>   cp->flags &= ~IP_VS_CONN_F_INACTIVE;
> -- 
> 2.8.3

Regards

--
Julian Anastasov

Re: [PATCH 2/2] netfilter: ipvs/SIP: handle ip_vs_fill_iph_skb_off failure

2016-01-27 Thread Julian Anastasov


Hello,

On Wed, 27 Jan 2016, Arnd Bergmann wrote:

> ip_vs_fill_iph_skb_off() may not find an IP header, and gcc has
> determined that ip_vs_sip_fill_param() then incorrectly accesses
> the protocol fields:
> 
> net/netfilter/ipvs/ip_vs_pe_sip.c: In function 'ip_vs_sip_fill_param':
> net/netfilter/ipvs/ip_vs_pe_sip.c:76:5: error: 'iph.protocol' may be used 
> uninitialized in this function [-Werror=maybe-uninitialized]
>   if (iph.protocol != IPPROTO_UDP)
>  ^
> net/netfilter/ipvs/ip_vs_pe_sip.c:81:10: error: 'iph.len' may be used 
> uninitialized in this function [-Werror=maybe-uninitialized]
>   dataoff = iph.len + sizeof(struct udphdr);
>   ^
> 
> This adds a check for the ip_vs_fill_iph_skb_off() return code
> before looking at the ip header data returned from it.
> 
> Signed-off-by: Arnd Bergmann 
> Fixes: b0e010c527de ("ipvs: replace ip_vs_fill_ip4hdr with 
> ip_vs_fill_iph_skb_off")

Looks ok to me,

Acked-by: Julian Anastasov 

but see below...

> ---
>  net/netfilter/ipvs/ip_vs_pe_sip.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_pe_sip.c 
> b/net/netfilter/ipvs/ip_vs_pe_sip.c
> index 1b8d594e493a..c4e9ca016a88 100644
> --- a/net/netfilter/ipvs/ip_vs_pe_sip.c
> +++ b/net/netfilter/ipvs/ip_vs_pe_sip.c
> @@ -70,10 +70,10 @@ ip_vs_sip_fill_param(struct ip_vs_conn_param *p, struct 
> sk_buff *skb)
>   const char *dptr;
>   int retc;
>  
> - ip_vs_fill_iph_skb(p->af, skb, false, );
> + retc = ip_vs_fill_iph_skb(p->af, skb, false, );
>  
>   /* Only useful with UDP */
> - if (iph.protocol != IPPROTO_UDP)
> + if (!retc || iph.protocol != IPPROTO_UDP)
>   return -EINVAL;
>   /* todo: IPv6 fragments:
>*   I think this only should be done for the first fragment. /HS

There are other places like this where result is not
checked because there is always a guarding skb_header_pointer
check, i.e. ip_vs_fill_iph_skb* should not fail at such point.

Let us know you want to extend this patch with other such
calls (including ip_vs_fill_iph_skb_icmp)? May be they will
need return NF_ACCEPT. I guess, all such changes should be
for the ipvs-next/net-next tree when it opens.

Regards

--
Julian Anastasov

Re: [PATCH 1/2] netfilter: ipvs: avoid unused variable warnings

2016-01-27 Thread Julian Anastasov


Hello,

On Wed, 27 Jan 2016, Arnd Bergmann wrote:

> The proc_create() and remove_proc_entry() functions do not reference
> their arguments when CONFIG_PROC_FS is disabled, so we get a couple
> of warnings about unused variables in IPVS:
> 
> ipvs/ip_vs_app.c:608:14: warning: unused variable 'net' [-Wunused-variable]
> ipvs/ip_vs_ctl.c:3950:14: warning: unused variable 'net' [-Wunused-variable]
> ipvs/ip_vs_ctl.c:3994:14: warning: unused variable 'net' [-Wunused-variable]
> 
> This removes the local variables and instead looks them up separately
> for each use, which obviously avoids the warning.
> 
> Signed-off-by: Arnd Bergmann 
> Fixes: 4c50a8ce2b63 ("netfilter: ipvs: avoid unused variable warning")

Looks like your previous patch for ip_vs_app_net_cleanup
was delayed in ipvs-next tree. I guess, Simon should drop it and
use this one instead when net-next opens:

Acked-by: Julian Anastasov 

> ---
>  net/netfilter/ipvs/ip_vs_app.c |  8 ++--
>  net/netfilter/ipvs/ip_vs_ctl.c | 15 ++-
>  2 files changed, 8 insertions(+), 15 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_app.c b/net/netfilter/ipvs/ip_vs_app.c
> index 0328f7250693..299edc6add5a 100644
> --- a/net/netfilter/ipvs/ip_vs_app.c
> +++ b/net/netfilter/ipvs/ip_vs_app.c
> @@ -605,17 +605,13 @@ static const struct file_operations ip_vs_app_fops = {
>  
>  int __net_init ip_vs_app_net_init(struct netns_ipvs *ipvs)
>  {
> - struct net *net = ipvs->net;
> -
>   INIT_LIST_HEAD(>app_list);
> - proc_create("ip_vs_app", 0, net->proc_net, _vs_app_fops);
> + proc_create("ip_vs_app", 0, ipvs->net->proc_net, _vs_app_fops);
>   return 0;
>  }
>  
>  void __net_exit ip_vs_app_net_cleanup(struct netns_ipvs *ipvs)
>  {
> - struct net *net = ipvs->net;
> -
>   unregister_ip_vs_app(ipvs, NULL /* all */);
> - remove_proc_entry("ip_vs_app", net->proc_net);
> + remove_proc_entry("ip_vs_app", ipvs->net->proc_net);
>  }
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index e7c1b052c2a3..bfb4f8372b83 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -3947,7 +3947,6 @@ static struct notifier_block ip_vs_dst_notifier = {
>  
>  int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
>  {
> - struct net *net = ipvs->net;
>   int i, idx;
>  
>   /* Initialize rs_table */
> @@ -3974,9 +3973,9 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs 
> *ipvs)
>  
>   spin_lock_init(>tot_stats.lock);
>  
> - proc_create("ip_vs", 0, net->proc_net, _vs_info_fops);
> - proc_create("ip_vs_stats", 0, net->proc_net, _vs_stats_fops);
> - proc_create("ip_vs_stats_percpu", 0, net->proc_net,
> + proc_create("ip_vs", 0, ipvs->net->proc_net, _vs_info_fops);
> + proc_create("ip_vs_stats", 0, ipvs->net->proc_net, _vs_stats_fops);
> + proc_create("ip_vs_stats_percpu", 0, ipvs->net->proc_net,
>   _vs_stats_percpu_fops);
>  
>   if (ip_vs_control_net_init_sysctl(ipvs))
> @@ -3991,13 +3990,11 @@ err:
>  
>  void __net_exit ip_vs_control_net_cleanup(struct netns_ipvs *ipvs)
>  {
> - struct net *net = ipvs->net;
> -
>   ip_vs_trash_cleanup(ipvs);
>   ip_vs_control_net_cleanup_sysctl(ipvs);
> - remove_proc_entry("ip_vs_stats_percpu", net->proc_net);
> - remove_proc_entry("ip_vs_stats", net->proc_net);
> - remove_proc_entry("ip_vs", net->proc_net);
> + remove_proc_entry("ip_vs_stats_percpu", ipvs->net->proc_net);
> + remove_proc_entry("ip_vs_stats", ipvs->net->proc_net);
> + remove_proc_entry("ip_vs", ipvs->net->proc_net);
>   free_percpu(ipvs->tot_stats.cpustats);
>  }
>  
> -- 
> 2.7.0

Regards

--
Julian Anastasov

Re: [PATCH 1/2] netfilter: ipvs: avoid unused variable warnings

2016-01-27 Thread Julian Anastasov


Hello,

On Wed, 27 Jan 2016, Arnd Bergmann wrote:

> The proc_create() and remove_proc_entry() functions do not reference
> their arguments when CONFIG_PROC_FS is disabled, so we get a couple
> of warnings about unused variables in IPVS:
> 
> ipvs/ip_vs_app.c:608:14: warning: unused variable 'net' [-Wunused-variable]
> ipvs/ip_vs_ctl.c:3950:14: warning: unused variable 'net' [-Wunused-variable]
> ipvs/ip_vs_ctl.c:3994:14: warning: unused variable 'net' [-Wunused-variable]
> 
> This removes the local variables and instead looks them up separately
> for each use, which obviously avoids the warning.
> 
> Signed-off-by: Arnd Bergmann <a...@arndb.de>
> Fixes: 4c50a8ce2b63 ("netfilter: ipvs: avoid unused variable warning")

Looks like your previous patch for ip_vs_app_net_cleanup
was delayed in ipvs-next tree. I guess, Simon should drop it and
use this one instead when net-next opens:

Acked-by: Julian Anastasov <j...@ssi.bg>

> ---
>  net/netfilter/ipvs/ip_vs_app.c |  8 ++--
>  net/netfilter/ipvs/ip_vs_ctl.c | 15 ++-
>  2 files changed, 8 insertions(+), 15 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_app.c b/net/netfilter/ipvs/ip_vs_app.c
> index 0328f7250693..299edc6add5a 100644
> --- a/net/netfilter/ipvs/ip_vs_app.c
> +++ b/net/netfilter/ipvs/ip_vs_app.c
> @@ -605,17 +605,13 @@ static const struct file_operations ip_vs_app_fops = {
>  
>  int __net_init ip_vs_app_net_init(struct netns_ipvs *ipvs)
>  {
> - struct net *net = ipvs->net;
> -
>   INIT_LIST_HEAD(>app_list);
> - proc_create("ip_vs_app", 0, net->proc_net, _vs_app_fops);
> + proc_create("ip_vs_app", 0, ipvs->net->proc_net, _vs_app_fops);
>   return 0;
>  }
>  
>  void __net_exit ip_vs_app_net_cleanup(struct netns_ipvs *ipvs)
>  {
> - struct net *net = ipvs->net;
> -
>   unregister_ip_vs_app(ipvs, NULL /* all */);
> - remove_proc_entry("ip_vs_app", net->proc_net);
> + remove_proc_entry("ip_vs_app", ipvs->net->proc_net);
>  }
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index e7c1b052c2a3..bfb4f8372b83 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -3947,7 +3947,6 @@ static struct notifier_block ip_vs_dst_notifier = {
>  
>  int __net_init ip_vs_control_net_init(struct netns_ipvs *ipvs)
>  {
> - struct net *net = ipvs->net;
>   int i, idx;
>  
>   /* Initialize rs_table */
> @@ -3974,9 +3973,9 @@ int __net_init ip_vs_control_net_init(struct netns_ipvs 
> *ipvs)
>  
>   spin_lock_init(>tot_stats.lock);
>  
> - proc_create("ip_vs", 0, net->proc_net, _vs_info_fops);
> - proc_create("ip_vs_stats", 0, net->proc_net, _vs_stats_fops);
> - proc_create("ip_vs_stats_percpu", 0, net->proc_net,
> + proc_create("ip_vs", 0, ipvs->net->proc_net, _vs_info_fops);
> + proc_create("ip_vs_stats", 0, ipvs->net->proc_net, _vs_stats_fops);
> + proc_create("ip_vs_stats_percpu", 0, ipvs->net->proc_net,
>   _vs_stats_percpu_fops);
>  
>   if (ip_vs_control_net_init_sysctl(ipvs))
> @@ -3991,13 +3990,11 @@ err:
>  
>  void __net_exit ip_vs_control_net_cleanup(struct netns_ipvs *ipvs)
>  {
> - struct net *net = ipvs->net;
> -
>   ip_vs_trash_cleanup(ipvs);
>   ip_vs_control_net_cleanup_sysctl(ipvs);
> - remove_proc_entry("ip_vs_stats_percpu", net->proc_net);
> - remove_proc_entry("ip_vs_stats", net->proc_net);
> - remove_proc_entry("ip_vs", net->proc_net);
> + remove_proc_entry("ip_vs_stats_percpu", ipvs->net->proc_net);
> + remove_proc_entry("ip_vs_stats", ipvs->net->proc_net);
> + remove_proc_entry("ip_vs", ipvs->net->proc_net);
>   free_percpu(ipvs->tot_stats.cpustats);
>  }
>  
> -- 
> 2.7.0

Regards

--
Julian Anastasov <j...@ssi.bg>

Re: [PATCH 2/2] netfilter: ipvs/SIP: handle ip_vs_fill_iph_skb_off failure

2016-01-27 Thread Julian Anastasov


Hello,

On Wed, 27 Jan 2016, Arnd Bergmann wrote:

> ip_vs_fill_iph_skb_off() may not find an IP header, and gcc has
> determined that ip_vs_sip_fill_param() then incorrectly accesses
> the protocol fields:
> 
> net/netfilter/ipvs/ip_vs_pe_sip.c: In function 'ip_vs_sip_fill_param':
> net/netfilter/ipvs/ip_vs_pe_sip.c:76:5: error: 'iph.protocol' may be used 
> uninitialized in this function [-Werror=maybe-uninitialized]
>   if (iph.protocol != IPPROTO_UDP)
>  ^
> net/netfilter/ipvs/ip_vs_pe_sip.c:81:10: error: 'iph.len' may be used 
> uninitialized in this function [-Werror=maybe-uninitialized]
>   dataoff = iph.len + sizeof(struct udphdr);
>   ^
> 
> This adds a check for the ip_vs_fill_iph_skb_off() return code
> before looking at the ip header data returned from it.
> 
> Signed-off-by: Arnd Bergmann <a...@arndb.de>
> Fixes: b0e010c527de ("ipvs: replace ip_vs_fill_ip4hdr with 
> ip_vs_fill_iph_skb_off")

Looks ok to me,

Acked-by: Julian Anastasov <j...@ssi.bg>

but see below...

> ---
>  net/netfilter/ipvs/ip_vs_pe_sip.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_pe_sip.c 
> b/net/netfilter/ipvs/ip_vs_pe_sip.c
> index 1b8d594e493a..c4e9ca016a88 100644
> --- a/net/netfilter/ipvs/ip_vs_pe_sip.c
> +++ b/net/netfilter/ipvs/ip_vs_pe_sip.c
> @@ -70,10 +70,10 @@ ip_vs_sip_fill_param(struct ip_vs_conn_param *p, struct 
> sk_buff *skb)
>   const char *dptr;
>   int retc;
>  
> - ip_vs_fill_iph_skb(p->af, skb, false, );
> + retc = ip_vs_fill_iph_skb(p->af, skb, false, );
>  
>   /* Only useful with UDP */
> - if (iph.protocol != IPPROTO_UDP)
> + if (!retc || iph.protocol != IPPROTO_UDP)
>   return -EINVAL;
>   /* todo: IPv6 fragments:
>*   I think this only should be done for the first fragment. /HS

There are other places like this where result is not
checked because there is always a guarding skb_header_pointer
check, i.e. ip_vs_fill_iph_skb* should not fail at such point.

Let us know you want to extend this patch with other such
calls (including ip_vs_fill_iph_skb_icmp)? May be they will
need return NF_ACCEPT. I guess, all such changes should be
for the ipvs-next/net-next tree when it opens.

Regards

--
Julian Anastasov <j...@ssi.bg>

Re: ipv4: ip unreachable with SO_BINDTODEVICE socket

2015-11-11 Thread Julian Anastasov


Hello,

On Wed, 11 Nov 2015, Kouya Shimura wrote:

> Hi
> 
> When both server and client are on the same machine and each their
> socket option is set to SO_BINDTODEVICE, sometimes a packet doesn't
> reach to the server.
> 
> The reproducible test program is attached. (modify "IF_ADDR=, IP_ADDR=,
> PORT=" lines appropriately).  Please try 'taskset -c 1 python test.py'
> since per cpu data (rt_cache) affects results.  Also 'tcpdump -i lo'
> is helpful for testing.  you can see "ICMP udp port unreachable".
> 
> In this test program, a packet doesn't pass through the bound
> interface but 'lo' interface. So, it might be granted that local
> communication with SO_BINDTODEVICE socket fails. However, dnsmasq and
> dhcp_release commands rely on it (Actually I've found this issue on
> the OpenStack envirionment) and the test program works well on
> linux-2.6.32 but doesn't work on linux-3.10.0 and 4.3.0.
> 
> I'd like to know whether this is a kernel bug or the specification of
> SO_BINDTODEVICE.

My man page does not indicate that SO_BINDTODEVICE
should be relaxed for traffic from loopback but for old
kernels it was really working in this way due to caching
of different orig_oif values and even for mcast/bcast
because ip_mc_output() does not use "lo" for loopback.

> The attached patch fixes this issue, but no confidence this is a
> right modification.

> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 85f184e..546cabe 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -2027,7 +2027,7 @@ static struct rtable *__mkroute_output(const struct 
> fib_result *res,
> prth = raw_cpu_ptr(nh->nh_pcpu_rth_output);
> }
> rth = rcu_dereference(*prth);
> -   if (rt_cache_valid(rth)) {
> +   if (rt_cache_valid(rth) && rth->rt_iif == orig_oif) {
> dst_hold(>dst);
> return rth;
> }

So, if the cache contains same orig_oif we will
use it and if it is different we will replace the cache.
While traffics use same orig_oif we will not have
frequent rt_cache_route() calls that replace the cache.

Patch looks ok to me but I'm not sure if we should
worry for the unicast traffic. If we want frequent
updates only for loopback then the check could be:

if (rt_cache_valid(rth) &&
(!(flags & RTCF_LOCAL) || rth->rt_iif == orig_oif)) {

Or the following, it should better cache mcast because 
mcast does not use/need the rt_iif check:

if (rt_cache_valid(rth) &&
(type != RTN_LOCAL || rth->rt_iif == orig_oif)) {

Regards

--
Julian Anastasov 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: ipv4: ip unreachable with SO_BINDTODEVICE socket

2015-11-11 Thread Julian Anastasov


Hello,

On Wed, 11 Nov 2015, Kouya Shimura wrote:

> Hi
> 
> When both server and client are on the same machine and each their
> socket option is set to SO_BINDTODEVICE, sometimes a packet doesn't
> reach to the server.
> 
> The reproducible test program is attached. (modify "IF_ADDR=, IP_ADDR=,
> PORT=" lines appropriately).  Please try 'taskset -c 1 python test.py'
> since per cpu data (rt_cache) affects results.  Also 'tcpdump -i lo'
> is helpful for testing.  you can see "ICMP udp port unreachable".
> 
> In this test program, a packet doesn't pass through the bound
> interface but 'lo' interface. So, it might be granted that local
> communication with SO_BINDTODEVICE socket fails. However, dnsmasq and
> dhcp_release commands rely on it (Actually I've found this issue on
> the OpenStack envirionment) and the test program works well on
> linux-2.6.32 but doesn't work on linux-3.10.0 and 4.3.0.
> 
> I'd like to know whether this is a kernel bug or the specification of
> SO_BINDTODEVICE.

My man page does not indicate that SO_BINDTODEVICE
should be relaxed for traffic from loopback but for old
kernels it was really working in this way due to caching
of different orig_oif values and even for mcast/bcast
because ip_mc_output() does not use "lo" for loopback.

> The attached patch fixes this issue, but no confidence this is a
> right modification.

> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 85f184e..546cabe 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -2027,7 +2027,7 @@ static struct rtable *__mkroute_output(const struct 
> fib_result *res,
> prth = raw_cpu_ptr(nh->nh_pcpu_rth_output);
> }
> rth = rcu_dereference(*prth);
> -   if (rt_cache_valid(rth)) {
> +   if (rt_cache_valid(rth) && rth->rt_iif == orig_oif) {
> dst_hold(>dst);
> return rth;
> }

So, if the cache contains same orig_oif we will
use it and if it is different we will replace the cache.
While traffics use same orig_oif we will not have
frequent rt_cache_route() calls that replace the cache.

Patch looks ok to me but I'm not sure if we should
worry for the unicast traffic. If we want frequent
updates only for loopback then the check could be:

if (rt_cache_valid(rth) &&
(!(flags & RTCF_LOCAL) || rth->rt_iif == orig_oif)) {

Or the following, it should better cache mcast because 
mcast does not use/need the rt_iif check:

if (rt_cache_valid(rth) &&
(type != RTN_LOCAL || rth->rt_iif == orig_oif)) {

Regards

--
Julian Anastasov <j...@ssi.bg>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4.1 125/159] net: call rcu_read_lock early in process_backlog

2015-09-29 Thread Julian Anastasov


Hello,

On Tue, 29 Sep 2015, Andre Tomt wrote:

> On 29. sep. 2015 10:39, Andre Tomt (LKML) wrote:
> > I just had another hang with it reverted on two different guests..
> > However it took nearly 6 hours rather than the usual "few minutes" for
> > these two. So now I'm a little unsure about my initial conclusions.
> > 
> > On 29. sep. 2015 09:40, Julian Anastasov wrote:
> >> On Tue, 29 Sep 2015, Andre Tomt (LKML) wrote:
> 
> >>They are 2 related patches, the first one is
> >> [PATCH 4.1 124/159] net: do not process device backlog during 
> >> unregistration
> > 
> > Would reverting this change anything outside device unregistration at all?

Its role is only during unregistration, so it
should not matter.

> I enabled CONFIG_RCU_CPU_STALL_INFO=y and disabled a bunch of non-virt
> drivers to speed up debugging. But no output this time either. Got any
> ideas on debugging options I've forgot? Useful sysrqs?

Checking my .config for debugs... I'm not expert
on this but may be such settings can help:

CONFIG_PROVE_RCU=y
CONFIG_PROVE_RCU_REPEATEDLY=y
CONFIG_SPARSE_RCU_POINTER=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=y
CONFIG_LOCKDEP=y
CONFIG_DEBUG_ATOMIC_SLEEP=y
CONFIG_TRACE_IRQFLAGS=y
CONFIG_STACKTRACE=y - you have this one
CONFIG_DEBUG_BUGVERBOSE=y

Regards

--
Julian Anastasov 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 >

1 - 100 of 296 matches

Mail list logo