Re: [PATCH] net-sysfs: export gso_max_size attribute

2017-11-24 Thread Eric Dumazet
On Fri, 2017-11-24 at 11:43 -0700, David Ahern wrote:
> On 11/24/17 11:32 AM, Eric Dumazet wrote:
> > On Fri, 2017-11-24 at 10:14 -0700, David Ahern wrote:
> > > On 11/22/17 5:30 PM, Solio Sarabia wrote:
> > > > The netdevice gso_max_size is exposed to allow users fine-
> > > > control
> > > > on
> > > > systems with multiple NICs with different GSO buffer sizes, and
> > > > where
> > > > the virtual devices like bridge and veth, need to be aware of
> > > > the
> > > > GSO
> > > > size of the underlying devices.
> > > > 
> > > > In a virtualized environment, setting the right GSO sizes for
> > > > physical
> > > > and virtual devices makes all TSO work to be on physical NIC,
> > > > improving
> > > > throughput and reducing CPU util. If virtual devices send
> > > > buffers
> > > > greater than what NIC supports, it forces host to do TSO for
> > > > buffers
> > > > exceeding the limit, increasing CPU utilization in host.
> > > > 
> > > > Suggested-by: Shiny Sebastian 
> > > > Signed-off-by: Solio Sarabia 
> > > > ---
> > > 
> > > This should be added to rtnetlink rather than sysfs.
> > 
> > This is already exposed by rtnetlink [1]
> 
> It currently is read-only. This patch wants to control setting it.
> 
> > 
> > Please lets not add yet another net-sysfs knob.
> 
> Which is my main point - no more sysfs files.
> 

I was not objecting to your point, sorry if this was not obvious.

I usually hit reply on the latest email, not the first one in the
thread.

Proper support for changing these attributes is more complex than that
trivial change. Bonding and team devices, and tunnels comes to mind.




Re: [PATCH] net-sysfs: export gso_max_size attribute

2017-11-24 Thread Eric Dumazet
On Fri, 2017-11-24 at 10:14 -0700, David Ahern wrote:
> On 11/22/17 5:30 PM, Solio Sarabia wrote:
> > The netdevice gso_max_size is exposed to allow users fine-control
> > on
> > systems with multiple NICs with different GSO buffer sizes, and
> > where
> > the virtual devices like bridge and veth, need to be aware of the
> > GSO
> > size of the underlying devices.
> > 
> > In a virtualized environment, setting the right GSO sizes for
> > physical
> > and virtual devices makes all TSO work to be on physical NIC,
> > improving
> > throughput and reducing CPU util. If virtual devices send buffers
> > greater than what NIC supports, it forces host to do TSO for
> > buffers
> > exceeding the limit, increasing CPU utilization in host.
> > 
> > Suggested-by: Shiny Sebastian 
> > Signed-off-by: Solio Sarabia 
> > ---
> 
> This should be added to rtnetlink rather than sysfs.

This is already exposed by rtnetlink [1]

Please lets not add yet another net-sysfs knob.

[1] c70ce028e834f8e51306217dbdbd441d851c64d3 net/rtnetlink: add 
IFLA_GSO_MAX_SEGS and IFLA_GSO_MAX_SIZE attributes




Re: [PATCH] net-sysfs: export gso_max_size attribute

2017-11-24 Thread Eric Dumazet
On Fri, 2017-11-24 at 10:14 -0700, David Ahern wrote:
> On 11/22/17 5:30 PM, Solio Sarabia wrote:
> > The netdevice gso_max_size is exposed to allow users fine-control
> > on
> > systems with multiple NICs with different GSO buffer sizes, and
> > where
> > the virtual devices like bridge and veth, need to be aware of the
> > GSO
> > size of the underlying devices.
> > 
> > In a virtualized environment, setting the right GSO sizes for
> > physical
> > and virtual devices makes all TSO work to be on physical NIC,
> > improving
> > throughput and reducing CPU util. If virtual devices send buffers
> > greater than what NIC supports, it forces host to do TSO for
> > buffers
> > exceeding the limit, increasing CPU utilization in host.
> > 
> > Suggested-by: Shiny Sebastian 
> > Signed-off-by: Solio Sarabia 
> > ---
> 
> This should be added to rtnetlink rather than sysfs.

This is already exposed by rtnetlink [1]

Please lets not add yet another net-sysfs knob.

[1] c70ce028e834f8e51306217dbdbd441d851c64d3 net/rtnetlink: add 
IFLA_GSO_MAX_SEGS and IFLA_GSO_MAX_SIZE attributes




Re: [PATCH net v2] net: thunderx: Fix TCP/UDP checksum offload for IPv6 pkts

2017-11-23 Thread Eric Dumazet
On Thu, 2017-11-23 at 17:41 +0300, Aleksey Makarov wrote:
> From: Sunil Goutham <sgout...@cavium.com>
> 
> Don't offload IP header checksum to NIC.
> 
> This fixes a previous patch which enabled checksum offloading
> for both IPv4 and IPv6 packets.  So L3 checksum offload was
> getting enabled for IPv6 pkts.  And HW is dropping these pkts
> as it assumes the pkt is IPv4 when IP csum offload is set
> in the SQ descriptor.
> 
> Fixes: 494fd005 ("net: thunderx: Enable TSO and checksum offloads
> for ipv6")
> Signed-off-by: Sunil Goutham <sgout...@cavium.com>
> Signed-off-by: Aleksey Makarov <aleksey.maka...@auriga.com>
> ---
>  drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> v2:
> - Don't enable checksum offloading both for IPv4 and IPv6 (Eric
> Dumazet)
> 
> v1:
>   https://lkml.kernel.org/r/20171122123727.23580-1-aleksey.makarov@au
> riga.com
> 
> diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
> b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
> index d4496e9afcdf..8b2c31e2a2b0 100644
> --- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
> +++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
> @@ -1355,7 +1355,6 @@ nicvf_sq_add_hdr_subdesc(struct nicvf *nic,
> struct snd_queue *sq, int qentry,
>  
>   /* Offload checksum calculation to HW */
>   if (skb->ip_summed == CHECKSUM_PARTIAL) {
> - hdr->csum_l3 = 1; /* Enable IP csum calculation */
>   hdr->l3_offset = skb_network_offset(skb);
>   hdr->l4_offset = skb_transport_offset(skb);
>  

Reviewed-by: Eric Dumazet <eduma...@google.com>

Thanks !



Re: [PATCH net v2] net: thunderx: Fix TCP/UDP checksum offload for IPv6 pkts

2017-11-23 Thread Eric Dumazet
On Thu, 2017-11-23 at 17:41 +0300, Aleksey Makarov wrote:
> From: Sunil Goutham 
> 
> Don't offload IP header checksum to NIC.
> 
> This fixes a previous patch which enabled checksum offloading
> for both IPv4 and IPv6 packets.  So L3 checksum offload was
> getting enabled for IPv6 pkts.  And HW is dropping these pkts
> as it assumes the pkt is IPv4 when IP csum offload is set
> in the SQ descriptor.
> 
> Fixes: 494fd005 ("net: thunderx: Enable TSO and checksum offloads
> for ipv6")
> Signed-off-by: Sunil Goutham 
> Signed-off-by: Aleksey Makarov 
> ---
>  drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> v2:
> - Don't enable checksum offloading both for IPv4 and IPv6 (Eric
> Dumazet)
> 
> v1:
>   https://lkml.kernel.org/r/20171122123727.23580-1-aleksey.makarov@au
> riga.com
> 
> diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
> b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
> index d4496e9afcdf..8b2c31e2a2b0 100644
> --- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
> +++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
> @@ -1355,7 +1355,6 @@ nicvf_sq_add_hdr_subdesc(struct nicvf *nic,
> struct snd_queue *sq, int qentry,
>  
>   /* Offload checksum calculation to HW */
>   if (skb->ip_summed == CHECKSUM_PARTIAL) {
> - hdr->csum_l3 = 1; /* Enable IP csum calculation */
>       hdr->l3_offset = skb_network_offset(skb);
>   hdr->l4_offset = skb_transport_offset(skb);
>  

Reviewed-by: Eric Dumazet 

Thanks !



Re: [PATCH net] net: thunderx: Fix TCP/UDP checksum offload for IPv6 pkts

2017-11-22 Thread Eric Dumazet
On Wed, 2017-11-22 at 15:37 +0300, Aleksey Makarov wrote:
> From: Sunil Goutham 
> 
> This fixes a previous patch which missed some changes
> and due to which L3 checksum offload was getting enabled
> for IPv6 pkts. And HW is dropping these pkts as it assumes
> the pkt is IPv4 when IP csum offload is set in the SQ
> descriptor.
> 
> Fixes: 494fd005 ("net: thunderx: Enable TSO and checksum offloads
> for ipv6")
> Signed-off-by: Sunil Goutham 
> Signed-off-by: Aleksey Makarov 
> ---
>  drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
> b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
> index d4496e9afcdf..184d5bdbe7e0 100644
> --- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
> +++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
> @@ -1355,10 +1355,11 @@ nicvf_sq_add_hdr_subdesc(struct nicvf *nic,
> struct snd_queue *sq, int qentry,
>  
>   /* Offload checksum calculation to HW */
>   if (skb->ip_summed == CHECKSUM_PARTIAL) {
> - hdr->csum_l3 = 1; /* Enable IP csum calculation */
>   hdr->l3_offset = skb_network_offset(skb);
>   hdr->l4_offset = skb_transport_offset(skb);
>  
> + /* Enable IP HDR csum calculation for V4 pkts */
> + hdr->csum_l3 = (ip.v4->version == 4) ? 1 : 0;

Have you tried to set hdr->csum_l3 to 0 regardless of version being 4
or 6 ?

This would remove the need for yet another conditional.

AFAIK, linux does not offload IPv4 header checksums to NIC, it is not
worth the trouble.




Re: [PATCH net] net: thunderx: Fix TCP/UDP checksum offload for IPv6 pkts

2017-11-22 Thread Eric Dumazet
On Wed, 2017-11-22 at 15:37 +0300, Aleksey Makarov wrote:
> From: Sunil Goutham 
> 
> This fixes a previous patch which missed some changes
> and due to which L3 checksum offload was getting enabled
> for IPv6 pkts. And HW is dropping these pkts as it assumes
> the pkt is IPv4 when IP csum offload is set in the SQ
> descriptor.
> 
> Fixes: 494fd005 ("net: thunderx: Enable TSO and checksum offloads
> for ipv6")
> Signed-off-by: Sunil Goutham 
> Signed-off-by: Aleksey Makarov 
> ---
>  drivers/net/ethernet/cavium/thunder/nicvf_queues.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
> b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
> index d4496e9afcdf..184d5bdbe7e0 100644
> --- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
> +++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
> @@ -1355,10 +1355,11 @@ nicvf_sq_add_hdr_subdesc(struct nicvf *nic,
> struct snd_queue *sq, int qentry,
>  
>   /* Offload checksum calculation to HW */
>   if (skb->ip_summed == CHECKSUM_PARTIAL) {
> - hdr->csum_l3 = 1; /* Enable IP csum calculation */
>   hdr->l3_offset = skb_network_offset(skb);
>   hdr->l4_offset = skb_transport_offset(skb);
>  
> + /* Enable IP HDR csum calculation for V4 pkts */
> + hdr->csum_l3 = (ip.v4->version == 4) ? 1 : 0;

Have you tried to set hdr->csum_l3 to 0 regardless of version being 4
or 6 ?

This would remove the need for yet another conditional.

AFAIK, linux does not offload IPv4 header checksums to NIC, it is not
worth the trouble.




Re: [PATCH][net-next] tcp: remove the now redundant non-null check on tskb

2017-11-14 Thread Eric Dumazet
On Tue, 2017-11-14 at 18:41 +, Colin King wrote:
> From: Colin Ian King 
> 
> The non-null check on tskb is redundant as it is in an else
> section of a check on tskb where tskb is always null. Remove
> the redundant if statement and the label coalesce.
> 
> Detected by CoverityScan, CID#1457751 ("Logically dead code")
> 
> Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue")
> Signed-off-by: Colin Ian King 
> ---
>  net/ipv4/tcp_output.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 071bdd34f8eb..b58c986b2b27 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -3053,7 +3053,6 @@ void tcp_send_fin(struct sock *sk)
>   tskb = skb_rb_last(>tcp_rtx_queue);
>  
>   if (tskb) {
> -coalesce:
>   TCP_SKB_CB(tskb)->tcp_flags |= TCPHDR_FIN;
>   TCP_SKB_CB(tskb)->end_seq++;
>   tp->write_seq++;
> @@ -3069,11 +3068,8 @@ void tcp_send_fin(struct sock *sk)
>   }
>   } else {
>   skb = alloc_skb_fclone(MAX_TCP_HEADER, sk->sk_allocation);
> - if (unlikely(!skb)) {
> - if (tskb)
> - goto coalesce;
> + if (unlikely(!skb))
>   return;
> - }
>   INIT_LIST_HEAD(>tcp_tsorted_anchor);
>   skb_reserve(skb, MAX_TCP_HEADER);
>   sk_forced_mem_schedule(sk, skb->truesize);

Hmm... I would rather try to use skb_rb_last(), because
alloc_skb_fclone() might fail even if tcp_under_memory_pressure() is
false.

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
76dbe884f2469660028684a46fc19afa000a1353..eea017b8a8918815226fd1412c0a7b8e484aeca8
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3070,6 +3070,7 @@ void tcp_send_fin(struct sock *sk)
} else {
skb = alloc_skb_fclone(MAX_TCP_HEADER, sk->sk_allocation);
if (unlikely(!skb)) {
+   tskb = skb_rb_last(>tcp_rtx_queue);
if (tskb)
goto coalesce;
return;




Re: [PATCH][net-next] tcp: remove the now redundant non-null check on tskb

2017-11-14 Thread Eric Dumazet
On Tue, 2017-11-14 at 18:41 +, Colin King wrote:
> From: Colin Ian King 
> 
> The non-null check on tskb is redundant as it is in an else
> section of a check on tskb where tskb is always null. Remove
> the redundant if statement and the label coalesce.
> 
> Detected by CoverityScan, CID#1457751 ("Logically dead code")
> 
> Fixes: 75c119afe14f ("tcp: implement rb-tree based retransmit queue")
> Signed-off-by: Colin Ian King 
> ---
>  net/ipv4/tcp_output.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 071bdd34f8eb..b58c986b2b27 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -3053,7 +3053,6 @@ void tcp_send_fin(struct sock *sk)
>   tskb = skb_rb_last(>tcp_rtx_queue);
>  
>   if (tskb) {
> -coalesce:
>   TCP_SKB_CB(tskb)->tcp_flags |= TCPHDR_FIN;
>   TCP_SKB_CB(tskb)->end_seq++;
>   tp->write_seq++;
> @@ -3069,11 +3068,8 @@ void tcp_send_fin(struct sock *sk)
>   }
>   } else {
>   skb = alloc_skb_fclone(MAX_TCP_HEADER, sk->sk_allocation);
> - if (unlikely(!skb)) {
> - if (tskb)
> - goto coalesce;
> + if (unlikely(!skb))
>   return;
> - }
>   INIT_LIST_HEAD(>tcp_tsorted_anchor);
>   skb_reserve(skb, MAX_TCP_HEADER);
>   sk_forced_mem_schedule(sk, skb->truesize);

Hmm... I would rather try to use skb_rb_last(), because
alloc_skb_fclone() might fail even if tcp_under_memory_pressure() is
false.

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 
76dbe884f2469660028684a46fc19afa000a1353..eea017b8a8918815226fd1412c0a7b8e484aeca8
 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3070,6 +3070,7 @@ void tcp_send_fin(struct sock *sk)
} else {
skb = alloc_skb_fclone(MAX_TCP_HEADER, sk->sk_allocation);
if (unlikely(!skb)) {
+   tskb = skb_rb_last(>tcp_rtx_queue);
if (tskb)
goto coalesce;
return;




Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-14 Thread Eric Dumazet
On Tue, 2017-11-14 at 10:11 -0800, Stephen Hemminger wrote:
> On Tue, 14 Nov 2017 16:53:33 +0300
> Kirill Tkhai  wrote:
> 
> > +   /*
> > +* RCU-protected list, modifiable by pernet-init and -exit methods.
> > +* When net namespace is alive (net::count > 0), all the changes
> > +* are made under rw_sem held on write.
> > +*/
> > +   struct list_headfib_notifier_ops;
> >  
> 
> If you use __rcu annotation then it could be checked (and the comment is not 
> needed).

I do not think we can use __rcu annotation yet on a struct list_head ?

(The annotation would be needed on the members)






Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-14 Thread Eric Dumazet
On Tue, 2017-11-14 at 10:11 -0800, Stephen Hemminger wrote:
> On Tue, 14 Nov 2017 16:53:33 +0300
> Kirill Tkhai  wrote:
> 
> > +   /*
> > +* RCU-protected list, modifiable by pernet-init and -exit methods.
> > +* When net namespace is alive (net::count > 0), all the changes
> > +* are made under rw_sem held on write.
> > +*/
> > +   struct list_headfib_notifier_ops;
> >  
> 
> If you use __rcu annotation then it could be checked (and the comment is not 
> needed).

I do not think we can use __rcu annotation yet on a struct list_head ?

(The annotation would be needed on the members)






Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-14 Thread Eric Dumazet
On Tue, 2017-11-14 at 09:44 -0800, Andrei Vagin wrote:
> On Tue, Nov 14, 2017 at 04:53:33PM +0300, Kirill Tkhai wrote:
> > Curently mutex is used to protect pernet operations list. It makes
> > cleanup_net() to execute ->exit methods of the same operations set,
> > which was used on the time of ->init, even after net namespace is
> > unlinked from net_namespace_list.
> > 
> > But the problem is it's need to synchronize_rcu() after net is removed
> > from net_namespace_list():
> > 
> > Destroy net_ns:
> > cleanup_net()
> >   mutex_lock(_mutex)
> >   list_del_rcu(>list)
> >   synchronize_rcu()  <--- Sleep there for 
> > ages
> >   list_for_each_entry_reverse(ops, _list, list)
> > ops_exit_list(ops, _exit_list)
> >   list_for_each_entry_reverse(ops, _list, list)
> > ops_free_list(ops, _exit_list)
> >   mutex_unlock(_mutex)
> > 
> > This primitive is not fast, especially on the systems with many processors
> > and/or when preemptible RCU is enabled in config. So, all the time, while
> > cleanup_net() is waiting for RCU grace period, creation of new net 
> > namespaces
> > is not possible, the tasks, who makes it, are sleeping on the same mutex:
> > 
> > Create net_ns:
> > copy_net_ns()
> >   mutex_lock_killable(_mutex)<--- Sleep there for 
> > ages
> > 
> > The solution is to convert net_mutex to the rw_semaphore. Then,
> > pernet_operations::init/::exit methods, modifying the net-related data,
> > will require down_read() locking only, while down_write() will be used
> > for changing pernet_list.
> > 
> > This gives signify performance increase, like you may see below. There
> > is measured sequential net namespace creation in a cycle, in single
> > thread, without other tasks (single user mode):
> > 
> > 1)int main(int argc, char *argv[])
> > {
> > unsigned nr;
> > if (argc < 2) {
> > fprintf(stderr, "Provide nr iterations arg\n");
> > return 1;
> > }
> > nr = atoi(argv[1]);
> > while (nr-- > 0) {
> > if (unshare(CLONE_NEWNET)) {
> > perror("Can't unshare");
> > return 1;
> > }
> > }
> > return 0;
> > }
> > 
> > Origin, 10 unshare():
> > 0.03user 23.14system 1:39.85elapsed 23%CPU
> > 
> > Patched, 10 unshare():
> > 0.03user 67.49system 1:08.34elapsed 98%CPU
> > 
> > 2)for i in {1..1}; do unshare -n bash -c exit; done
> 
> Hi Kirill,
> 
> This mutex has another role. You know that net namespaces are destroyed
> asynchronously, and the net mutex gurantees that a backlog will be not
> big. If we have something in backlog, we know that it will be handled
> before creating a new net ns.
> 
> As far as I remember net namespaces are created much faster than
> they are destroyed, so with this changes we can create a really big
> backlog, can't we?

Please take a look at the recent patches I did :

8ca712c373a462cfa1b62272870b6c2c74aa83f9 Merge branch 
'net-speedup-netns-create-delete-time'
64bc17811b72758753e2b64cd8f2a63812c61fe1 ipv4: speedup ipv6 tunnels dismantle
bb401caefe9d2c65e0c0fa23b21deecfbfa473fe ipv6: speedup ipv6 tunnels dismantle
789e6ddb0b2fb5d5024b760b178a47876e4de7a6 tcp: batch tcp_net_metrics_exit
a90c9347e90ed1e9323d71402ed18023bc910cd8 ipv6: addrlabel: per netns list
d464e84eed02993d40ad55fdc19f4523e4deee5b kobject: factorize skb setup in 
kobject_uevent_net_broadcast()
4a336a23d619e96aef37d4d054cfadcdd1b581ba kobject: copy env blob in one go
16dff336b33d87c15d9cbe933cfd275aae2a8251 kobject: add 
kobject_uevent_net_broadcast()





Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-14 Thread Eric Dumazet
On Tue, 2017-11-14 at 09:44 -0800, Andrei Vagin wrote:
> On Tue, Nov 14, 2017 at 04:53:33PM +0300, Kirill Tkhai wrote:
> > Curently mutex is used to protect pernet operations list. It makes
> > cleanup_net() to execute ->exit methods of the same operations set,
> > which was used on the time of ->init, even after net namespace is
> > unlinked from net_namespace_list.
> > 
> > But the problem is it's need to synchronize_rcu() after net is removed
> > from net_namespace_list():
> > 
> > Destroy net_ns:
> > cleanup_net()
> >   mutex_lock(_mutex)
> >   list_del_rcu(>list)
> >   synchronize_rcu()  <--- Sleep there for 
> > ages
> >   list_for_each_entry_reverse(ops, _list, list)
> > ops_exit_list(ops, _exit_list)
> >   list_for_each_entry_reverse(ops, _list, list)
> > ops_free_list(ops, _exit_list)
> >   mutex_unlock(_mutex)
> > 
> > This primitive is not fast, especially on the systems with many processors
> > and/or when preemptible RCU is enabled in config. So, all the time, while
> > cleanup_net() is waiting for RCU grace period, creation of new net 
> > namespaces
> > is not possible, the tasks, who makes it, are sleeping on the same mutex:
> > 
> > Create net_ns:
> > copy_net_ns()
> >   mutex_lock_killable(_mutex)<--- Sleep there for 
> > ages
> > 
> > The solution is to convert net_mutex to the rw_semaphore. Then,
> > pernet_operations::init/::exit methods, modifying the net-related data,
> > will require down_read() locking only, while down_write() will be used
> > for changing pernet_list.
> > 
> > This gives signify performance increase, like you may see below. There
> > is measured sequential net namespace creation in a cycle, in single
> > thread, without other tasks (single user mode):
> > 
> > 1)int main(int argc, char *argv[])
> > {
> > unsigned nr;
> > if (argc < 2) {
> > fprintf(stderr, "Provide nr iterations arg\n");
> > return 1;
> > }
> > nr = atoi(argv[1]);
> > while (nr-- > 0) {
> > if (unshare(CLONE_NEWNET)) {
> > perror("Can't unshare");
> > return 1;
> > }
> > }
> > return 0;
> > }
> > 
> > Origin, 10 unshare():
> > 0.03user 23.14system 1:39.85elapsed 23%CPU
> > 
> > Patched, 10 unshare():
> > 0.03user 67.49system 1:08.34elapsed 98%CPU
> > 
> > 2)for i in {1..1}; do unshare -n bash -c exit; done
> 
> Hi Kirill,
> 
> This mutex has another role. You know that net namespaces are destroyed
> asynchronously, and the net mutex gurantees that a backlog will be not
> big. If we have something in backlog, we know that it will be handled
> before creating a new net ns.
> 
> As far as I remember net namespaces are created much faster than
> they are destroyed, so with this changes we can create a really big
> backlog, can't we?

Please take a look at the recent patches I did :

8ca712c373a462cfa1b62272870b6c2c74aa83f9 Merge branch 
'net-speedup-netns-create-delete-time'
64bc17811b72758753e2b64cd8f2a63812c61fe1 ipv4: speedup ipv6 tunnels dismantle
bb401caefe9d2c65e0c0fa23b21deecfbfa473fe ipv6: speedup ipv6 tunnels dismantle
789e6ddb0b2fb5d5024b760b178a47876e4de7a6 tcp: batch tcp_net_metrics_exit
a90c9347e90ed1e9323d71402ed18023bc910cd8 ipv6: addrlabel: per netns list
d464e84eed02993d40ad55fdc19f4523e4deee5b kobject: factorize skb setup in 
kobject_uevent_net_broadcast()
4a336a23d619e96aef37d4d054cfadcdd1b581ba kobject: copy env blob in one go
16dff336b33d87c15d9cbe933cfd275aae2a8251 kobject: add 
kobject_uevent_net_broadcast()





Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-14 Thread Eric Dumazet
On Tue, 2017-11-14 at 16:53 +0300, Kirill Tkhai wrote:
> Curently mutex is used to protect pernet operations list. It makes
> cleanup_net() to execute ->exit methods of the same operations set,
> which was used on the time of ->init, even after net namespace is
> unlinked from net_namespace_list.
> 
...

> The solution is to convert net_mutex to the rw_semaphore. Then,
> pernet_operations::init/::exit methods, modifying the net-related data,
> will require down_read() locking only, while down_write() will be used
> for changing pernet_list.
> 
...

> This patch requires commit 76f8507f7a64 "locking/rwsem: Add 
> down_read_killable()"
> from Linus tree (not in net-next yet).


Looks great, thanks for doing this.

I wonder if the down_read_killable() is really needed, maybe a
down_read() would not be blocking the thread too long after this change.





Re: [PATCH] net: Convert net_mutex into rw_semaphore and down read it on net->init/->exit

2017-11-14 Thread Eric Dumazet
On Tue, 2017-11-14 at 16:53 +0300, Kirill Tkhai wrote:
> Curently mutex is used to protect pernet operations list. It makes
> cleanup_net() to execute ->exit methods of the same operations set,
> which was used on the time of ->init, even after net namespace is
> unlinked from net_namespace_list.
> 
...

> The solution is to convert net_mutex to the rw_semaphore. Then,
> pernet_operations::init/::exit methods, modifying the net-related data,
> will require down_read() locking only, while down_write() will be used
> for changing pernet_list.
> 
...

> This patch requires commit 76f8507f7a64 "locking/rwsem: Add 
> down_read_killable()"
> from Linus tree (not in net-next yet).


Looks great, thanks for doing this.

I wonder if the down_read_killable() is really needed, maybe a
down_read() would not be blocking the thread too long after this change.





Re: [PATCH net-next 2/2] tcp: handle TCP_TIME_WAIT/TCP_NEW_SYN_RECV in tcp_set_state tracepoint

2017-11-09 Thread Eric Dumazet
On Thu, 2017-11-09 at 23:11 +0800, Yafang Shao wrote:

> I'm also very sad that I'm still using IPv4 in 2017 : (
> 
> Okay then another issue,
> shoule we reduce the complexity in  the function tcp4_seq_show() ?

This is irrelevant really.

I do not see how tcp4_seq_show() could avoid testing sk_state.




Re: [PATCH net-next 2/2] tcp: handle TCP_TIME_WAIT/TCP_NEW_SYN_RECV in tcp_set_state tracepoint

2017-11-09 Thread Eric Dumazet
On Thu, 2017-11-09 at 23:11 +0800, Yafang Shao wrote:

> I'm also very sad that I'm still using IPv4 in 2017 : (
> 
> Okay then another issue,
> shoule we reduce the complexity in  the function tcp4_seq_show() ?

This is irrelevant really.

I do not see how tcp4_seq_show() could avoid testing sk_state.




Re: [PATCH net-next 2/2] tcp: handle TCP_TIME_WAIT/TCP_NEW_SYN_RECV in tcp_set_state tracepoint

2017-11-09 Thread Eric Dumazet
On Thu, 2017-11-09 at 06:52 -0800, Eric Dumazet wrote:

> Wow.
> 
> 
> Since all three variants of sockets (full sockets, request sockets,
> timewait sockets) are all hashed into ehash table these days, they all
> have the fields at the same offset
> 
> For IPv4, that would be :
> 
> __sk_common.skc_daddr(or inet_daddr)
> __sk_common.skc_rcv_saddr (or inet_rcv_saddr )
> __sk_common.skc_dport (or inet_dport)
> __sk_common.skc_num   (or inet_num)
> 
> Look at __inet_lookup_established() and INET_MATCH() : They deal with
> the three variants, without having to look at sk_state.
> 
> If you were using the fields that are common to all sockets, no need to
> add all this unnecessary complexity.
> 

Not to mention that your patch took care of IPv4 only.

I can not say how sad I am that in 2017 IPv6 seems to be second class
citizen.







Re: [PATCH net-next 2/2] tcp: handle TCP_TIME_WAIT/TCP_NEW_SYN_RECV in tcp_set_state tracepoint

2017-11-09 Thread Eric Dumazet
On Thu, 2017-11-09 at 06:52 -0800, Eric Dumazet wrote:

> Wow.
> 
> 
> Since all three variants of sockets (full sockets, request sockets,
> timewait sockets) are all hashed into ehash table these days, they all
> have the fields at the same offset
> 
> For IPv4, that would be :
> 
> __sk_common.skc_daddr(or inet_daddr)
> __sk_common.skc_rcv_saddr (or inet_rcv_saddr )
> __sk_common.skc_dport (or inet_dport)
> __sk_common.skc_num   (or inet_num)
> 
> Look at __inet_lookup_established() and INET_MATCH() : They deal with
> the three variants, without having to look at sk_state.
> 
> If you were using the fields that are common to all sockets, no need to
> add all this unnecessary complexity.
> 

Not to mention that your patch took care of IPv4 only.

I can not say how sad I am that in 2017 IPv6 seems to be second class
citizen.







Re: [PATCH net-next 2/2] tcp: handle TCP_TIME_WAIT/TCP_NEW_SYN_RECV in tcp_set_state tracepoint

2017-11-09 Thread Eric Dumazet
On Thu, 2017-11-09 at 14:26 +, Yafang Shao wrote:
> When TCP connetion in TCP_TIME_WAIT or TCP_NEW_SYN_RECV state, it can't
> get the sport/dport/saddr/daddr from inet_sock.
> 
> trace_tcp_set_state may be called when the oldstate in these two states.
> 
> Signed-off-by: Yafang Shao 
> ---
>  include/trace/events/tcp.h | 33 ++---
>  1 file changed, 26 insertions(+), 7 deletions(-)
> 
> diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
> index 07a..1982a71 100644
> --- a/include/trace/events/tcp.h
> +++ b/include/trace/events/tcp.h
> @@ -196,7 +196,6 @@
>   ),
>  
>   TP_fast_assign(
> - struct inet_sock *inet = inet_sk(sk);
>   struct in6_addr *pin6;
>   __be32 *p32;
>  
> @@ -204,14 +203,34 @@
>   __entry->oldstate = oldstate;
>   __entry->newstate = newstate;
>  
> - __entry->sport = ntohs(inet->inet_sport);
> - __entry->dport = ntohs(inet->inet_dport);
> + if (oldstate == TCP_TIME_WAIT) {
> + __entry->sport = ntohs(inet_twsk(sk)->tw_sport);
> + __entry->dport = ntohs(inet_twsk(sk)->tw_dport);
>  
> - p32 = (__be32 *) __entry->saddr;
> - *p32 = inet->inet_saddr;
> + p32 = (__be32 *) __entry->saddr;
> + *p32 = inet_twsk(sk)->tw_rcv_saddr;
>  
> - p32 = (__be32 *) __entry->daddr;
> - *p32 =  inet->inet_daddr;
> + p32 = (__be32 *) __entry->daddr;
> + *p32 = inet_twsk(sk)->tw_daddr;
> + } else if (oldstate == TCP_NEW_SYN_RECV) {
> + __entry->sport = inet_rsk(inet_reqsk(sk))->ir_num;
> + __entry->dport = 
> ntohs(inet_rsk(inet_reqsk(sk))->ir_rmt_port);
> +
> + p32 = (__be32 *) __entry->saddr;
> + *p32 = inet_rsk(inet_reqsk(sk))->ir_loc_addr;
> +
> + p32 = (__be32 *) __entry->daddr;
> + *p32 = inet_rsk(inet_reqsk(sk))->ir_rmt_addr;
> + } else {
> + __entry->sport = ntohs(inet_sk(sk)->inet_sport);
> + __entry->dport = ntohs(inet_sk(sk)->inet_dport);
> +
> + p32 = (__be32 *) __entry->saddr;
> + *p32 = inet_sk(sk)->inet_saddr;
> +
> + p32 = (__be32 *) __entry->daddr;
> + *p32 =  inet_sk(sk)->inet_daddr;
> + }


Wow.


Since all three variants of sockets (full sockets, request sockets,
timewait sockets) are all hashed into ehash table these days, they all
have the fields at the same offset

For IPv4, that would be :

__sk_common.skc_daddr(or inet_daddr)
__sk_common.skc_rcv_saddr (or inet_rcv_saddr )
__sk_common.skc_dport (or inet_dport)
__sk_common.skc_num   (or inet_num)

Look at __inet_lookup_established() and INET_MATCH() : They deal with
the three variants, without having to look at sk_state.

If you were using the fields that are common to all sockets, no need to
add all this unnecessary complexity.




Re: [PATCH net-next 2/2] tcp: handle TCP_TIME_WAIT/TCP_NEW_SYN_RECV in tcp_set_state tracepoint

2017-11-09 Thread Eric Dumazet
On Thu, 2017-11-09 at 14:26 +, Yafang Shao wrote:
> When TCP connetion in TCP_TIME_WAIT or TCP_NEW_SYN_RECV state, it can't
> get the sport/dport/saddr/daddr from inet_sock.
> 
> trace_tcp_set_state may be called when the oldstate in these two states.
> 
> Signed-off-by: Yafang Shao 
> ---
>  include/trace/events/tcp.h | 33 ++---
>  1 file changed, 26 insertions(+), 7 deletions(-)
> 
> diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
> index 07a..1982a71 100644
> --- a/include/trace/events/tcp.h
> +++ b/include/trace/events/tcp.h
> @@ -196,7 +196,6 @@
>   ),
>  
>   TP_fast_assign(
> - struct inet_sock *inet = inet_sk(sk);
>   struct in6_addr *pin6;
>   __be32 *p32;
>  
> @@ -204,14 +203,34 @@
>   __entry->oldstate = oldstate;
>   __entry->newstate = newstate;
>  
> - __entry->sport = ntohs(inet->inet_sport);
> - __entry->dport = ntohs(inet->inet_dport);
> + if (oldstate == TCP_TIME_WAIT) {
> + __entry->sport = ntohs(inet_twsk(sk)->tw_sport);
> + __entry->dport = ntohs(inet_twsk(sk)->tw_dport);
>  
> - p32 = (__be32 *) __entry->saddr;
> - *p32 = inet->inet_saddr;
> + p32 = (__be32 *) __entry->saddr;
> + *p32 = inet_twsk(sk)->tw_rcv_saddr;
>  
> - p32 = (__be32 *) __entry->daddr;
> - *p32 =  inet->inet_daddr;
> + p32 = (__be32 *) __entry->daddr;
> + *p32 = inet_twsk(sk)->tw_daddr;
> + } else if (oldstate == TCP_NEW_SYN_RECV) {
> + __entry->sport = inet_rsk(inet_reqsk(sk))->ir_num;
> + __entry->dport = 
> ntohs(inet_rsk(inet_reqsk(sk))->ir_rmt_port);
> +
> + p32 = (__be32 *) __entry->saddr;
> + *p32 = inet_rsk(inet_reqsk(sk))->ir_loc_addr;
> +
> + p32 = (__be32 *) __entry->daddr;
> + *p32 = inet_rsk(inet_reqsk(sk))->ir_rmt_addr;
> + } else {
> + __entry->sport = ntohs(inet_sk(sk)->inet_sport);
> + __entry->dport = ntohs(inet_sk(sk)->inet_dport);
> +
> + p32 = (__be32 *) __entry->saddr;
> + *p32 = inet_sk(sk)->inet_saddr;
> +
> + p32 = (__be32 *) __entry->daddr;
> + *p32 =  inet_sk(sk)->inet_daddr;
> + }


Wow.


Since all three variants of sockets (full sockets, request sockets,
timewait sockets) are all hashed into ehash table these days, they all
have the fields at the same offset

For IPv4, that would be :

__sk_common.skc_daddr(or inet_daddr)
__sk_common.skc_rcv_saddr (or inet_rcv_saddr )
__sk_common.skc_dport (or inet_dport)
__sk_common.skc_num   (or inet_num)

Look at __inet_lookup_established() and INET_MATCH() : They deal with
the three variants, without having to look at sk_state.

If you were using the fields that are common to all sockets, no need to
add all this unnecessary complexity.




Re: IPv6 issue in next-20171102 - lockdep and BUG handling RA packet.

2017-11-06 Thread Eric Dumazet
On Mon, Nov 6, 2017 at 4:29 PM, David Ahern  wrote:
> On 11/7/17 5:56 AM, valdis.kletni...@vt.edu wrote:
>> I've hit this 6 times now, across 3 boots:
>>
>> Nov  3 11:04:54 turing-police kernel: [  547.814748] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>>
>> Nov  3 20:24:11 turing-police kernel: [   60.093793] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>> Nov  4 20:20:54 turing-police kernel: [86264.366955] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>> Nov  5 19:17:40 turing-police kernel: [172469.769179] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>> Nov  6 06:07:37 turing-police kernel: [211467.239460] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>>
>> Nov  6 14:12:43 turing-police kernel: [   54.891848] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>>
>> Something seems to be going astray while handling a RA packet.
>
> Odd. I tested RA before sending the patches and again just now - no
> traceback with an RCU / lock debugging kernel. What is sending the RA's
> in your case? I'd like to understand the config and add a test for this.

Do you have CONFIG_DEBUG_ATOMIC_SLEEP=y in your .config ?


Re: IPv6 issue in next-20171102 - lockdep and BUG handling RA packet.

2017-11-06 Thread Eric Dumazet
On Mon, Nov 6, 2017 at 4:29 PM, David Ahern  wrote:
> On 11/7/17 5:56 AM, valdis.kletni...@vt.edu wrote:
>> I've hit this 6 times now, across 3 boots:
>>
>> Nov  3 11:04:54 turing-police kernel: [  547.814748] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>>
>> Nov  3 20:24:11 turing-police kernel: [   60.093793] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>> Nov  4 20:20:54 turing-police kernel: [86264.366955] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>> Nov  5 19:17:40 turing-police kernel: [172469.769179] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>> Nov  6 06:07:37 turing-police kernel: [211467.239460] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>>
>> Nov  6 14:12:43 turing-police kernel: [   54.891848] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>>
>> Something seems to be going astray while handling a RA packet.
>
> Odd. I tested RA before sending the patches and again just now - no
> traceback with an RCU / lock debugging kernel. What is sending the RA's
> in your case? I'd like to understand the config and add a test for this.

Do you have CONFIG_DEBUG_ATOMIC_SLEEP=y in your .config ?


Re: IPv6 issue in next-20171102 - lockdep and BUG handling RA packet.

2017-11-06 Thread Eric Dumazet
On Mon, Nov 6, 2017 at 2:04 PM, Eric Dumazet <eduma...@google.com> wrote:
> I have a patch, will send in a couple of minutes. Thanks.

https://patchwork.ozlabs.org/patch/834983/  ipv6: addrconf: fix a lockdep splat


Re: IPv6 issue in next-20171102 - lockdep and BUG handling RA packet.

2017-11-06 Thread Eric Dumazet
On Mon, Nov 6, 2017 at 2:04 PM, Eric Dumazet  wrote:
> I have a patch, will send in a couple of minutes. Thanks.

https://patchwork.ozlabs.org/patch/834983/  ipv6: addrconf: fix a lockdep splat


Re: IPv6 issue in next-20171102 - lockdep and BUG handling RA packet.

2017-11-06 Thread Eric Dumazet
On Mon, Nov 6, 2017 at 2:01 PM, Ido Schimmel  wrote:
> On Mon, Nov 06, 2017 at 03:56:54PM -0500, valdis.kletni...@vt.edu wrote:
>> I've hit this 6 times now, across 3 boots:
>>
>> Nov  3 11:04:54 turing-police kernel: [  547.814748] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>>
>> Nov  3 20:24:11 turing-police kernel: [   60.093793] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>> Nov  4 20:20:54 turing-police kernel: [86264.366955] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>> Nov  5 19:17:40 turing-police kernel: [172469.769179] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>> Nov  6 06:07:37 turing-police kernel: [211467.239460] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>>
>> Nov  6 14:12:43 turing-police kernel: [   54.891848] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>>
>> Something seems to be going astray while handling a RA packet.
>>
>> Kernel dirty due to hand-patching 
>> https://patchwork.kernel.org/patch/10003555/
>> (signed int:1 bitfield in sched.h causing tons of warnings)
>>
>> Unfortunately, the previous next- kernel I built was -20170927 (which worked 
>> OK).
>>
>> Googling for things in the traceback in the last month comes up empty, and 
>> only
>> thing in the git log for net/ipv6 that looks vaguely related:
>>
>> commit f3d9832e56c48e4ca50bab0457e21bcaade4536d
>> Author: David Ahern 
>> Date:   Wed Oct 18 09:56:52 2017 -0700
>>
>> ipv6: addrconf: cleanup locking in ipv6_add_addr
>
> Probably right...
>
> [...]
>
>> [   54.891848] BUG: sleeping function called from invalid context at 
>> mm/slab.h:422
>> [   54.891855] in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper/0
>> [   54.891859] INFO: lockdep is turned off.
>> [   54.891867] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GW  OE
>> 4.14.0-rc7-next-20171102-dirty #537
>> [   54.891872] Hardware name: Dell Inc. Latitude E6530/07Y85M, BIOS A20 
>> 05/08/2017
>> [   54.891877] Call Trace:
>> [   54.891882]  
>> [   54.891894]  dump_stack+0x7b/0xe4
>> [   54.891907]  ___might_sleep+0x1b0/0x300
>> [   54.891921]  kmem_cache_alloc_trace+0x2c7/0x500
>> [   54.891931]  ? cyc2ns_read_end+0x1e/0x30
>> [   54.891944]  ipv6_add_addr+0x15a/0xc30
>> [   54.891977]  ? ipv6_create_tempaddr+0x2ea/0x5d0
>> [   54.891986]  ipv6_create_tempaddr+0x2ea/0x5d0
>
> This function is called from softirq so we should call ipv6_add_addr()
> with 'can_block' set to 'false'.
>
> DavidA, it's already late here so you can probably fix this before I
> take a closer look tomorrow morning. :)

I have a patch, will send in a couple of minutes. Thanks.


Re: IPv6 issue in next-20171102 - lockdep and BUG handling RA packet.

2017-11-06 Thread Eric Dumazet
On Mon, Nov 6, 2017 at 2:01 PM, Ido Schimmel  wrote:
> On Mon, Nov 06, 2017 at 03:56:54PM -0500, valdis.kletni...@vt.edu wrote:
>> I've hit this 6 times now, across 3 boots:
>>
>> Nov  3 11:04:54 turing-police kernel: [  547.814748] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>>
>> Nov  3 20:24:11 turing-police kernel: [   60.093793] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>> Nov  4 20:20:54 turing-police kernel: [86264.366955] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>> Nov  5 19:17:40 turing-police kernel: [172469.769179] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>> Nov  6 06:07:37 turing-police kernel: [211467.239460] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>>
>> Nov  6 14:12:43 turing-police kernel: [   54.891848] BUG: sleeping function 
>> called from invalid context at mm/slab.h:422
>>
>> Something seems to be going astray while handling a RA packet.
>>
>> Kernel dirty due to hand-patching 
>> https://patchwork.kernel.org/patch/10003555/
>> (signed int:1 bitfield in sched.h causing tons of warnings)
>>
>> Unfortunately, the previous next- kernel I built was -20170927 (which worked 
>> OK).
>>
>> Googling for things in the traceback in the last month comes up empty, and 
>> only
>> thing in the git log for net/ipv6 that looks vaguely related:
>>
>> commit f3d9832e56c48e4ca50bab0457e21bcaade4536d
>> Author: David Ahern 
>> Date:   Wed Oct 18 09:56:52 2017 -0700
>>
>> ipv6: addrconf: cleanup locking in ipv6_add_addr
>
> Probably right...
>
> [...]
>
>> [   54.891848] BUG: sleeping function called from invalid context at 
>> mm/slab.h:422
>> [   54.891855] in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper/0
>> [   54.891859] INFO: lockdep is turned off.
>> [   54.891867] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GW  OE
>> 4.14.0-rc7-next-20171102-dirty #537
>> [   54.891872] Hardware name: Dell Inc. Latitude E6530/07Y85M, BIOS A20 
>> 05/08/2017
>> [   54.891877] Call Trace:
>> [   54.891882]  
>> [   54.891894]  dump_stack+0x7b/0xe4
>> [   54.891907]  ___might_sleep+0x1b0/0x300
>> [   54.891921]  kmem_cache_alloc_trace+0x2c7/0x500
>> [   54.891931]  ? cyc2ns_read_end+0x1e/0x30
>> [   54.891944]  ipv6_add_addr+0x15a/0xc30
>> [   54.891977]  ? ipv6_create_tempaddr+0x2ea/0x5d0
>> [   54.891986]  ipv6_create_tempaddr+0x2ea/0x5d0
>
> This function is called from softirq so we should call ipv6_add_addr()
> with 'can_block' set to 'false'.
>
> DavidA, it's already late here so you can probably fix this before I
> take a closer look tomorrow morning. :)

I have a patch, will send in a couple of minutes. Thanks.


Re: WARNING: at net/core/stream.c:204

2017-11-02 Thread Eric Dumazet
On Thu, 2017-11-02 at 08:36 -0400, Shankara Pailoor wrote:
> Hi,
> 
> We encountered the following warning when fuzzing with Syzkaller on
> Linux 4.14-rc4. Syzkaller was able to isolate the sequence of calls
> which caused the bug but couldn't create a C program that could
> regularly trigger it.
> 
> 
> Here are the logs from the reproducer attempts: https://pastebin.com/QWQZK6hW
> 
> Here is the original stack trace: https://pastebin.com/7L4WciRr

Sorry, we are flooded by such reports, so you have to give the exact
details, and find a C repro.

A bonus if you actually send a kernel patch, of course.

Thanks.




Re: WARNING: at net/core/stream.c:204

2017-11-02 Thread Eric Dumazet
On Thu, 2017-11-02 at 08:36 -0400, Shankara Pailoor wrote:
> Hi,
> 
> We encountered the following warning when fuzzing with Syzkaller on
> Linux 4.14-rc4. Syzkaller was able to isolate the sequence of calls
> which caused the bug but couldn't create a C program that could
> regularly trigger it.
> 
> 
> Here are the logs from the reproducer attempts: https://pastebin.com/QWQZK6hW
> 
> Here is the original stack trace: https://pastebin.com/7L4WciRr

Sorry, we are flooded by such reports, so you have to give the exact
details, and find a C repro.

A bonus if you actually send a kernel patch, of course.

Thanks.




Re: [PATCH] tcp_nv: fix division by zero in tcpnv_acked()

2017-11-01 Thread Eric Dumazet
On Wed, 2017-11-01 at 16:32 +0300, Konstantin Khlebnikov wrote:
> Average RTT could become zero. This happened in real life at least twice.
> This patch treats zero as 1us.
> 
> Signed-off-by: Konstantin Khlebnikov 
> ---
>  net/ipv4/tcp_nv.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/tcp_nv.c b/net/ipv4/tcp_nv.c
> index 1ff73982e28c..125fc1450b01 100644
> --- a/net/ipv4/tcp_nv.c
> +++ b/net/ipv4/tcp_nv.c
> @@ -252,7 +252,7 @@ static void tcpnv_acked(struct sock *sk, const struct 
> ack_sample *sample)
>  
>   /* rate in 100's bits per second */
>   rate64 = ((u64)sample->in_flight) * 800;
> - rate = (u32)div64_u64(rate64, (u64)(avg_rtt * 100));
> + rate = (u32)div64_u64(rate64, (u64)(avg_rtt ?: 1) * 100);
>  
>   /* Remember the maximum rate seen during this RTT
>* Note: It may be more than one RTT. This function should be
> 

BTW, this div64_u64() could be replaced by a u32 divide (aka do_div()),
slightly less expensive.


rate64 = ((u64)sample->in_flight) * 8;
do_div(rate64, avg_rtt ?: 1);
rate = (u32)rate64;






Re: [PATCH] tcp_nv: fix division by zero in tcpnv_acked()

2017-11-01 Thread Eric Dumazet
On Wed, 2017-11-01 at 16:32 +0300, Konstantin Khlebnikov wrote:
> Average RTT could become zero. This happened in real life at least twice.
> This patch treats zero as 1us.
> 
> Signed-off-by: Konstantin Khlebnikov 
> ---
>  net/ipv4/tcp_nv.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/tcp_nv.c b/net/ipv4/tcp_nv.c
> index 1ff73982e28c..125fc1450b01 100644
> --- a/net/ipv4/tcp_nv.c
> +++ b/net/ipv4/tcp_nv.c
> @@ -252,7 +252,7 @@ static void tcpnv_acked(struct sock *sk, const struct 
> ack_sample *sample)
>  
>   /* rate in 100's bits per second */
>   rate64 = ((u64)sample->in_flight) * 800;
> - rate = (u32)div64_u64(rate64, (u64)(avg_rtt * 100));
> + rate = (u32)div64_u64(rate64, (u64)(avg_rtt ?: 1) * 100);
>  
>   /* Remember the maximum rate seen during this RTT
>* Note: It may be more than one RTT. This function should be
> 

BTW, this div64_u64() could be replaced by a u32 divide (aka do_div()),
slightly less expensive.


rate64 = ((u64)sample->in_flight) * 8;
do_div(rate64, avg_rtt ?: 1);
rate = (u32)rate64;






Re: [PATCH] net: recvmsg: Unconditionally zero struct sockaddr_storage

2017-10-31 Thread Eric Dumazet
On Tue, 2017-10-31 at 09:14 -0700, Kees Cook wrote:
> Some protocols do not correctly wipe the contents of the on-stack
> struct sockaddr_storage sent down into recvmsg() (e.g. SCTP), and leak
> kernel stack contents to userspace. This wipes it unconditionally before
> per-protocol handlers run.
> 
> Note that leaks like this are mitigated by building with
> CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL=y
> 
> Reported-by: Alexander Potapenko 
> Cc: "David S. Miller" 
> Cc: net...@vger.kernel.org
> Signed-off-by: Kees Cook 
> ---
>  net/socket.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/socket.c b/net/socket.c
> index c729625eb5d3..34183f4fbdf8 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -2188,6 +2188,7 @@ static int ___sys_recvmsg(struct socket *sock, struct 
> user_msghdr __user *msg,
>   struct sockaddr __user *uaddr;
>   int __user *uaddr_len = COMPAT_NAMELEN(msg);
>  
> + memset(, 0, sizeof(addr));
>   msg_sys->msg_name = 
>  

This kind of patch comes every year.

Standard answer is : We fix the buggy protocol, we do not make
everything slower just because we are lazy.

struct sockaddr is 128 bytes, but IPV4 only uses a fraction of it.

Also memset() is using long word stores, so next 4-byte or 2-byte stores
on same location hit a performance problem on x86.

By adding all these defensive programming, we would give strong
incentives to bypass the kernel for networking. That would be bad.





Re: [PATCH] net: recvmsg: Unconditionally zero struct sockaddr_storage

2017-10-31 Thread Eric Dumazet
On Tue, 2017-10-31 at 09:14 -0700, Kees Cook wrote:
> Some protocols do not correctly wipe the contents of the on-stack
> struct sockaddr_storage sent down into recvmsg() (e.g. SCTP), and leak
> kernel stack contents to userspace. This wipes it unconditionally before
> per-protocol handlers run.
> 
> Note that leaks like this are mitigated by building with
> CONFIG_GCC_PLUGIN_STRUCTLEAK_BYREF_ALL=y
> 
> Reported-by: Alexander Potapenko 
> Cc: "David S. Miller" 
> Cc: net...@vger.kernel.org
> Signed-off-by: Kees Cook 
> ---
>  net/socket.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/socket.c b/net/socket.c
> index c729625eb5d3..34183f4fbdf8 100644
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -2188,6 +2188,7 @@ static int ___sys_recvmsg(struct socket *sock, struct 
> user_msghdr __user *msg,
>   struct sockaddr __user *uaddr;
>   int __user *uaddr_len = COMPAT_NAMELEN(msg);
>  
> + memset(, 0, sizeof(addr));
>   msg_sys->msg_name = 
>  

This kind of patch comes every year.

Standard answer is : We fix the buggy protocol, we do not make
everything slower just because we are lazy.

struct sockaddr is 128 bytes, but IPV4 only uses a fraction of it.

Also memset() is using long word stores, so next 4-byte or 2-byte stores
on same location hit a performance problem on x86.

By adding all these defensive programming, we would give strong
incentives to bypass the kernel for networking. That would be bad.





Re: [PATCH net-next 3/4] hv_netvsc: reset net_device_ctx->nvdev with rcu_assign_pointer()

2017-10-31 Thread Eric Dumazet
On Tue, 2017-10-31 at 14:42 +0100, Vitaly Kuznetsov wrote:
> RCU_INIT_POINTER() is not suitable here as it doesn't give us ordering
> guarantees (see the comment in rcupdate.h). This is also not a hotpath.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  drivers/net/hyperv/netvsc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index bfc79698b8f4..12efb3e34775 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -560,7 +560,7 @@ void netvsc_device_remove(struct hv_device *device)
>  
>   netvsc_revoke_buf(device, net_device);
>  
> - RCU_INIT_POINTER(net_device_ctx->nvdev, NULL);
> + rcu_assign_pointer(net_device_ctx->nvdev, NULL);

I see no point for this patch.

Setting a NULL pointer needs no barrier at all.




Re: [PATCH net-next 3/4] hv_netvsc: reset net_device_ctx->nvdev with rcu_assign_pointer()

2017-10-31 Thread Eric Dumazet
On Tue, 2017-10-31 at 14:42 +0100, Vitaly Kuznetsov wrote:
> RCU_INIT_POINTER() is not suitable here as it doesn't give us ordering
> guarantees (see the comment in rcupdate.h). This is also not a hotpath.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  drivers/net/hyperv/netvsc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
> index bfc79698b8f4..12efb3e34775 100644
> --- a/drivers/net/hyperv/netvsc.c
> +++ b/drivers/net/hyperv/netvsc.c
> @@ -560,7 +560,7 @@ void netvsc_device_remove(struct hv_device *device)
>  
>   netvsc_revoke_buf(device, net_device);
>  
> - RCU_INIT_POINTER(net_device_ctx->nvdev, NULL);
> + rcu_assign_pointer(net_device_ctx->nvdev, NULL);

I see no point for this patch.

Setting a NULL pointer needs no barrier at all.




Re: linux-next: Signed-off-by missing for commit in the net-next tree

2017-10-31 Thread Eric Dumazet
On Tue, 2017-10-31 at 09:14 +0300, Kirill Smelkov wrote:
> Francois,
> 
> On Tue, Oct 31, 2017 at 04:40:19PM +1100, Stephen Rothwell wrote:
> > Hi all,
> > 
> > Commit
> > 
> >   509708310cf9 ("r8169: Add support for interrupt coalesce tuning (ethtool 
> > -C)")
> > 
> > is missing a Signed-off-by from its author (or its author is wrong).
> 
> Could we please get signoff for this your patch from 2012 which I've
> reworked a bit to be commit 509708310cf9 ?
> 
>   https://www.spinics.net/lists/netdev/msg217984.html
>   https://www.spinics.net/lists/netdev/msg218207.html
> 
> I was keeping you in To and Cc all the time but got no reply at all since my
> first posting from ~ 1 month ago.

Nothing we can do, patch is merged, we can not modify it anymore.

You provided all the rationale in the changelog, so no worries.




Re: linux-next: Signed-off-by missing for commit in the net-next tree

2017-10-31 Thread Eric Dumazet
On Tue, 2017-10-31 at 09:14 +0300, Kirill Smelkov wrote:
> Francois,
> 
> On Tue, Oct 31, 2017 at 04:40:19PM +1100, Stephen Rothwell wrote:
> > Hi all,
> > 
> > Commit
> > 
> >   509708310cf9 ("r8169: Add support for interrupt coalesce tuning (ethtool 
> > -C)")
> > 
> > is missing a Signed-off-by from its author (or its author is wrong).
> 
> Could we please get signoff for this your patch from 2012 which I've
> reworked a bit to be commit 509708310cf9 ?
> 
>   https://www.spinics.net/lists/netdev/msg217984.html
>   https://www.spinics.net/lists/netdev/msg218207.html
> 
> I was keeping you in To and Cc all the time but got no reply at all since my
> first posting from ~ 1 month ago.

Nothing we can do, patch is merged, we can not modify it anymore.

You provided all the rationale in the changelog, so no worries.




Re: KASAN: use-after-free Write in detach_if_pending

2017-10-30 Thread Eric Dumazet
On Mon, 2017-10-30 at 20:52 +0300, Dmitry Vyukov wrote:

> syz fix:   patch title
> 
> then that's doable. If we agree on this format, then I am ready to
> implement this.

Yes please.

David Miller and Linus never rebase their trees, for good reasons.




Re: KASAN: use-after-free Write in detach_if_pending

2017-10-30 Thread Eric Dumazet
On Mon, 2017-10-30 at 20:52 +0300, Dmitry Vyukov wrote:

> syz fix:   patch title
> 
> then that's doable. If we agree on this format, then I am ready to
> implement this.

Yes please.

David Miller and Linus never rebase their trees, for good reasons.




Re: KASAN: use-after-free Write in detach_if_pending

2017-10-30 Thread Eric Dumazet
On Mon, 2017-10-30 at 18:40 +0100, Dmitry Vyukov wrote:
> On Mon, Oct 30, 2017 at 6:30 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> > On Mon, 2017-10-30 at 18:06 +0100, Dmitry Vyukov wrote:
> >
> >> Yes, but hashes in random trees also don't tell much. A tree can be
> >> rebased so the hash will be lost. It can be a tree unknown to the
> >> system. Even if we find the commit by hash, in order to match it
> >> against other trees we will have to use the title anyway (or are there
> >> better options?), so using hashes becomes pointless.
> >
> > We do not send hashes on random trees, but official SHA1 in David Miller
> > trees. They will be the same SHA1 in official Linus Torvalds tree.
> >
> > Really, you make our life more difficult by pretending that hashes are
> > not the proper way.
> >
> > They are reasons we use Fixes: tags all over the places, they are unique
> > in Linus tree.
> >
> > Since syzbot gives a SHA1 itself, it must be using a tree, right ?
> >
> > So a SHA1 that is guaranteed to enter the same tree is correct.
> >
> > Please fix your bot.
> 
> 
> They don't necessary enter the same tree (that's more of an exception
> than the rule). For bugs that we find in Linus tree, fixes enter usb,
> kvm, block, sound, linux-next and a bunch of other trees that I never
> heard of. At the very least we will need a git repo address + commit
> hash. But then for say linux-next hashes disappear. And mm which is
> not a git tree at all (no hashes).
> And still the hashes will need to be explicitly marked as fixes (with
> #syz fix or something else). So that would look like:
> ##syz fix: git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
> e7989f973ae1b90ec7c0b671c81f7f553affccbe
> which does not look much better than:
> ##syz fix: tun: do not arm flow_gc_timer in tun_flow_init()
> which also I think makes it easier for humans to ensure that they
> actually reference what they meant to reference (and maybe find the
> fix in other trees).


I suggested that syzbot catches up on standard way :

  patch title

It contains way more information than :

sys fix: patch title

I never suggested to only use 





Re: KASAN: use-after-free Write in detach_if_pending

2017-10-30 Thread Eric Dumazet
On Mon, 2017-10-30 at 18:40 +0100, Dmitry Vyukov wrote:
> On Mon, Oct 30, 2017 at 6:30 PM, Eric Dumazet  wrote:
> > On Mon, 2017-10-30 at 18:06 +0100, Dmitry Vyukov wrote:
> >
> >> Yes, but hashes in random trees also don't tell much. A tree can be
> >> rebased so the hash will be lost. It can be a tree unknown to the
> >> system. Even if we find the commit by hash, in order to match it
> >> against other trees we will have to use the title anyway (or are there
> >> better options?), so using hashes becomes pointless.
> >
> > We do not send hashes on random trees, but official SHA1 in David Miller
> > trees. They will be the same SHA1 in official Linus Torvalds tree.
> >
> > Really, you make our life more difficult by pretending that hashes are
> > not the proper way.
> >
> > They are reasons we use Fixes: tags all over the places, they are unique
> > in Linus tree.
> >
> > Since syzbot gives a SHA1 itself, it must be using a tree, right ?
> >
> > So a SHA1 that is guaranteed to enter the same tree is correct.
> >
> > Please fix your bot.
> 
> 
> They don't necessary enter the same tree (that's more of an exception
> than the rule). For bugs that we find in Linus tree, fixes enter usb,
> kvm, block, sound, linux-next and a bunch of other trees that I never
> heard of. At the very least we will need a git repo address + commit
> hash. But then for say linux-next hashes disappear. And mm which is
> not a git tree at all (no hashes).
> And still the hashes will need to be explicitly marked as fixes (with
> #syz fix or something else). So that would look like:
> ##syz fix: git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git
> e7989f973ae1b90ec7c0b671c81f7f553affccbe
> which does not look much better than:
> ##syz fix: tun: do not arm flow_gc_timer in tun_flow_init()
> which also I think makes it easier for humans to ensure that they
> actually reference what they meant to reference (and maybe find the
> fix in other trees).


I suggested that syzbot catches up on standard way :

  patch title

It contains way more information than :

sys fix: patch title

I never suggested to only use 





Re: KASAN: use-after-free Write in detach_if_pending

2017-10-30 Thread Eric Dumazet
On Mon, 2017-10-30 at 18:06 +0100, Dmitry Vyukov wrote:

> Yes, but hashes in random trees also don't tell much. A tree can be
> rebased so the hash will be lost. It can be a tree unknown to the
> system. Even if we find the commit by hash, in order to match it
> against other trees we will have to use the title anyway (or are there
> better options?), so using hashes becomes pointless.

We do not send hashes on random trees, but official SHA1 in David Miller
trees. They will be the same SHA1 in official Linus Torvalds tree.

Really, you make our life more difficult by pretending that hashes are
not the proper way.

They are reasons we use Fixes: tags all over the places, they are unique
in Linus tree.

Since syzbot gives a SHA1 itself, it must be using a tree, right ?

So a SHA1 that is guaranteed to enter the same tree is correct.

Please fix your bot.




Re: KASAN: use-after-free Write in detach_if_pending

2017-10-30 Thread Eric Dumazet
On Mon, 2017-10-30 at 18:06 +0100, Dmitry Vyukov wrote:

> Yes, but hashes in random trees also don't tell much. A tree can be
> rebased so the hash will be lost. It can be a tree unknown to the
> system. Even if we find the commit by hash, in order to match it
> against other trees we will have to use the title anyway (or are there
> better options?), so using hashes becomes pointless.

We do not send hashes on random trees, but official SHA1 in David Miller
trees. They will be the same SHA1 in official Linus Torvalds tree.

Really, you make our life more difficult by pretending that hashes are
not the proper way.

They are reasons we use Fixes: tags all over the places, they are unique
in Linus tree.

Since syzbot gives a SHA1 itself, it must be using a tree, right ?

So a SHA1 that is guaranteed to enter the same tree is correct.

Please fix your bot.




Re: KASAN: use-after-free Write in detach_if_pending

2017-10-30 Thread Eric Dumazet
On Mon, 2017-10-30 at 16:48 +0100, Dmitry Vyukov wrote:
> >
> > net-next tree :
> >
> > $ git log --oneline e7989f973ae1b90ec7c0b671c81.. -- drivers/net/tun.c
> > f8ddadc4db6c7b7029b6d0e0d9af24f74ad27ca2 Merge 
> > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> > ee74d9967b829232723939cb7c9b100b29f6ec98 tun: do not arm flow_gc_timer in 
> > tun_flow_init()
> > 81d98fa4df3d1683b3ef21e8a7a0ccac7874f0de tun: avoid extra timer schedule in 
> > tun_flow_cleanup()
> > 7dbfb4ef77db5666f0f3a425e7db93ca30ff4285 tun: do not block BH again in 
> > tun_flow_cleanup()
> > aec72f3392b1d598a979e89c4fdb131965ae0ab3 net-tun: fix panics at dismantle 
> > time
> > 010f245b9dd734adda6386c494a4ace953ea8dc4 tun: relax check on 
> > eth_get_headlen() return value
> > 0ad646c81b2182f7fa67ec0c8c825e0ee165696d tun: call dev_get_valid_name() 
> > before register_netdevice()
> > 53954cf8c5d205624167a2bfd117cc0c1a5f3c6d Merge 
> > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> > 2580c4c17aee3ad58e9751012bad278dd074ccae tun: bail out from tun_get_user() 
> > if the skb is empty
> > de8f3a83b0a0fddb2cf56e7a718127e9619ea3da bpf: add meta pointer for direct 
> > access
> > 9484dc74fcf0750cd6726c9aa27edf97223916a8 tun: delete original tun_get() and 
> > rename __tun_get() to tun_get()
> > 90e33d45940793def6f773b2d528e9f3c84ffdc7 tun: enable napi_gro_frags() for 
> > TUN/TAP driver
> > 943170998b200190f99d3fe7e771437e2c51f319 tun: enable NAPI for TUN/TAP driver
> >
> > net tree :
> >
> > $ git log --oneline e7989f973ae1b90ec7c0b671c81.. -- drivers/net/tun.c
> > 63b9ab65bd76e5de6479bb14b4014b64aa1a317a tuntap: properly align skb->head 
> > before building skb
> > 5c25f65fd1e42685f7ccd80e0621829c105785d9 tun: allow positive return values 
> > on dev_get_valid_name() call
> > 0ad646c81b2182f7fa67ec0c8c825e0ee165696d tun: call dev_get_valid_name() 
> > before register_netdevice()
> > 2580c4c17aee3ad58e9751012bad278dd074ccae tun: bail out from tun_get_user() 
> > if the skb is empty
> >
> > Pick the fixes, they are at least 2 patches that addressed the issue.
> 
> Let's try the last one in net-next that touches timers:
> 
> #syz fix: tun: do not arm flow_gc_timer in tun_flow_init()

Note that is is common to have multiple patches sharing a same title, so
your tag is ambiguous.

Why don't you update your bot to catch up standard SHA1 title  format,
so that we do not have to ?





Re: KASAN: use-after-free Write in detach_if_pending

2017-10-30 Thread Eric Dumazet
On Mon, 2017-10-30 at 16:48 +0100, Dmitry Vyukov wrote:
> >
> > net-next tree :
> >
> > $ git log --oneline e7989f973ae1b90ec7c0b671c81.. -- drivers/net/tun.c
> > f8ddadc4db6c7b7029b6d0e0d9af24f74ad27ca2 Merge 
> > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> > ee74d9967b829232723939cb7c9b100b29f6ec98 tun: do not arm flow_gc_timer in 
> > tun_flow_init()
> > 81d98fa4df3d1683b3ef21e8a7a0ccac7874f0de tun: avoid extra timer schedule in 
> > tun_flow_cleanup()
> > 7dbfb4ef77db5666f0f3a425e7db93ca30ff4285 tun: do not block BH again in 
> > tun_flow_cleanup()
> > aec72f3392b1d598a979e89c4fdb131965ae0ab3 net-tun: fix panics at dismantle 
> > time
> > 010f245b9dd734adda6386c494a4ace953ea8dc4 tun: relax check on 
> > eth_get_headlen() return value
> > 0ad646c81b2182f7fa67ec0c8c825e0ee165696d tun: call dev_get_valid_name() 
> > before register_netdevice()
> > 53954cf8c5d205624167a2bfd117cc0c1a5f3c6d Merge 
> > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> > 2580c4c17aee3ad58e9751012bad278dd074ccae tun: bail out from tun_get_user() 
> > if the skb is empty
> > de8f3a83b0a0fddb2cf56e7a718127e9619ea3da bpf: add meta pointer for direct 
> > access
> > 9484dc74fcf0750cd6726c9aa27edf97223916a8 tun: delete original tun_get() and 
> > rename __tun_get() to tun_get()
> > 90e33d45940793def6f773b2d528e9f3c84ffdc7 tun: enable napi_gro_frags() for 
> > TUN/TAP driver
> > 943170998b200190f99d3fe7e771437e2c51f319 tun: enable NAPI for TUN/TAP driver
> >
> > net tree :
> >
> > $ git log --oneline e7989f973ae1b90ec7c0b671c81.. -- drivers/net/tun.c
> > 63b9ab65bd76e5de6479bb14b4014b64aa1a317a tuntap: properly align skb->head 
> > before building skb
> > 5c25f65fd1e42685f7ccd80e0621829c105785d9 tun: allow positive return values 
> > on dev_get_valid_name() call
> > 0ad646c81b2182f7fa67ec0c8c825e0ee165696d tun: call dev_get_valid_name() 
> > before register_netdevice()
> > 2580c4c17aee3ad58e9751012bad278dd074ccae tun: bail out from tun_get_user() 
> > if the skb is empty
> >
> > Pick the fixes, they are at least 2 patches that addressed the issue.
> 
> Let's try the last one in net-next that touches timers:
> 
> #syz fix: tun: do not arm flow_gc_timer in tun_flow_init()

Note that is is common to have multiple patches sharing a same title, so
your tag is ambiguous.

Why don't you update your bot to catch up standard SHA1 title  format,
so that we do not have to ?





Re: KASAN: use-after-free Write in detach_if_pending

2017-10-29 Thread Eric Dumazet
On Sun, 2017-10-29 at 13:45 +0100, Thomas Gleixner wrote:
> On Fri, 27 Oct 2017, syzbot wrote:
> 
> Cc'ed network folks.
> 
> > syzkaller hit the following crash on 
> > e7989f973ae1b90ec7c0b671c81f7f553affccbe
> > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> > compiler: gcc (GCC) 7.1.1 20170620
> > .config is attached
> > Raw console output is attached.
> > C reproducer is attached
> > syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> > for information about syzkaller reproducers
> > 
> > 
> > BUG: KASAN: use-after-free in __write_once_size include/linux/compiler.h:305
> > [inline]
> > BUG: KASAN: use-after-free in __hlist_del include/linux/list.h:648 [inline]
> > BUG: KASAN: use-after-free in detach_timer kernel/time/timer.c:791 [inline]
> > BUG: KASAN: use-after-free in detach_if_pending+0x557/0x610
> > kernel/time/timer.c:808
> > Write of size 8 at addr 8801d3bab780 by task syzkaller900516/2986
> 
> That's just the point where this gets detected.
> 
> > CPU: 1 PID: 2986 Comm: syzkaller900516 Not tainted 4.13.0+ #82
> 
> > __hlist_del include/linux/list.h:648 [inline]
> > detach_timer kernel/time/timer.c:791 [inline]
> > detach_if_pending+0x557/0x610 kernel/time/timer.c:808
> > try_to_del_timer_sync+0xa2/0x120 kernel/time/timer.c:1182
> > del_timer_sync+0x18a/0x240 kernel/time/timer.c:1247
> > tun_flow_uninit drivers/net/tun.c:1104 [inline]
> > tun_free_netdev+0x105/0x1b0 drivers/net/tun.c:1776
> 
>    This shouldn't be called I think
> 
> > netdev_run_todo+0x870/0xca0 net/core/dev.c:7864
> > rtnl_unlock+0xe/0x10 net/core/rtnetlink.c:106
> > tun_detach drivers/net/tun.c:588 [inline]
> > tun_chr_close+0x49/0x60 drivers/net/tun.c:2609
> > __fput+0x333/0x7f0 fs/file_table.c:210
> > fput+0x15/0x20 fs/file_table.c:246
> > task_work_run+0x199/0x270 kernel/task_work.c:112
> > exit_task_work include/linux/task_work.h:21 [inline]
> > do_exit+0xa52/0x1b40 kernel/exit.c:865
> 
> Here is the allocation path
> 
> > alloc_netdev_mqs+0x16e/0xed0 net/core/dev.c:8018
> > tun_set_iff drivers/net/tun.c:2022 [inline]
> > __tun_chr_ioctl+0x12be/0x3d20 drivers/net/tun.c:2276
> > tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2521
> > vfs_ioctl fs/ioctl.c:45 [inline]
> > do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:685
> > SYSC_ioctl fs/ioctl.c:700 [inline]
> > SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
> > entry_SYSCALL_64_fastpath+0x1f/0xbe
> 
> 
> And this is free.
> 
> > netdev_freemem net/core/dev.c:7970 [inline]
> > free_netdev+0x2cf/0x360 net/core/dev.c:8132
> > tun_set_iff drivers/net/tun.c:2105 [inline]
> 
> err_free_flow:
> tun_flow_uninit(tun); <
> 
> > __tun_chr_ioctl+0x2cf6/0x3d20 drivers/net/tun.c:2276
> > tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2521
> > vfs_ioctl fs/ioctl.c:45 [inline]
> > do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:685
> > SYSC_ioctl fs/ioctl.c:700 [inline]
> > SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
> > entry_SYSCALL_64_fastpath+0x1f/0xbe
> 
> So it's the TUNSETIFF ioctl which first allocates and then frees in the
> errorpath of tun_set_iff.
> 
> But for some reason this sticks and the exit of that task does it again,
> which triggers KASAN in the innocent timer code.

Pretty old story, already fixed in David Miller trees.

net-next tree :

$ git log --oneline e7989f973ae1b90ec7c0b671c81.. -- drivers/net/tun.c
f8ddadc4db6c7b7029b6d0e0d9af24f74ad27ca2 Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
ee74d9967b829232723939cb7c9b100b29f6ec98 tun: do not arm flow_gc_timer in 
tun_flow_init()
81d98fa4df3d1683b3ef21e8a7a0ccac7874f0de tun: avoid extra timer schedule in 
tun_flow_cleanup()
7dbfb4ef77db5666f0f3a425e7db93ca30ff4285 tun: do not block BH again in 
tun_flow_cleanup()
aec72f3392b1d598a979e89c4fdb131965ae0ab3 net-tun: fix panics at dismantle time
010f245b9dd734adda6386c494a4ace953ea8dc4 tun: relax check on eth_get_headlen() 
return value
0ad646c81b2182f7fa67ec0c8c825e0ee165696d tun: call dev_get_valid_name() before 
register_netdevice()
53954cf8c5d205624167a2bfd117cc0c1a5f3c6d Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
2580c4c17aee3ad58e9751012bad278dd074ccae tun: bail out from tun_get_user() if 
the skb is empty
de8f3a83b0a0fddb2cf56e7a718127e9619ea3da bpf: add meta pointer for direct access
9484dc74fcf0750cd6726c9aa27edf97223916a8 tun: delete original tun_get() and 
rename __tun_get() to tun_get()
90e33d45940793def6f773b2d528e9f3c84ffdc7 tun: enable napi_gro_frags() for 
TUN/TAP driver
943170998b200190f99d3fe7e771437e2c51f319 tun: enable NAPI for TUN/TAP driver

net tree :

$ git log --oneline e7989f973ae1b90ec7c0b671c81.. -- drivers/net/tun.c
63b9ab65bd76e5de6479bb14b4014b64aa1a317a tuntap: properly align skb->head 
before building skb
5c25f65fd1e42685f7ccd80e0621829c105785d9 tun: allow positive return values on 
dev_get_valid_name() call
0ad646c81b2182f7fa67ec0c8c825e0ee165696d tun: call dev_get_valid_name() before 
register_netdevice()
2580c4c17aee3ad58e9751012bad278dd074ccae tun: bail out from 

Re: KASAN: use-after-free Write in detach_if_pending

2017-10-29 Thread Eric Dumazet
On Sun, 2017-10-29 at 13:45 +0100, Thomas Gleixner wrote:
> On Fri, 27 Oct 2017, syzbot wrote:
> 
> Cc'ed network folks.
> 
> > syzkaller hit the following crash on 
> > e7989f973ae1b90ec7c0b671c81f7f553affccbe
> > git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> > compiler: gcc (GCC) 7.1.1 20170620
> > .config is attached
> > Raw console output is attached.
> > C reproducer is attached
> > syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> > for information about syzkaller reproducers
> > 
> > 
> > BUG: KASAN: use-after-free in __write_once_size include/linux/compiler.h:305
> > [inline]
> > BUG: KASAN: use-after-free in __hlist_del include/linux/list.h:648 [inline]
> > BUG: KASAN: use-after-free in detach_timer kernel/time/timer.c:791 [inline]
> > BUG: KASAN: use-after-free in detach_if_pending+0x557/0x610
> > kernel/time/timer.c:808
> > Write of size 8 at addr 8801d3bab780 by task syzkaller900516/2986
> 
> That's just the point where this gets detected.
> 
> > CPU: 1 PID: 2986 Comm: syzkaller900516 Not tainted 4.13.0+ #82
> 
> > __hlist_del include/linux/list.h:648 [inline]
> > detach_timer kernel/time/timer.c:791 [inline]
> > detach_if_pending+0x557/0x610 kernel/time/timer.c:808
> > try_to_del_timer_sync+0xa2/0x120 kernel/time/timer.c:1182
> > del_timer_sync+0x18a/0x240 kernel/time/timer.c:1247
> > tun_flow_uninit drivers/net/tun.c:1104 [inline]
> > tun_free_netdev+0x105/0x1b0 drivers/net/tun.c:1776
> 
>    This shouldn't be called I think
> 
> > netdev_run_todo+0x870/0xca0 net/core/dev.c:7864
> > rtnl_unlock+0xe/0x10 net/core/rtnetlink.c:106
> > tun_detach drivers/net/tun.c:588 [inline]
> > tun_chr_close+0x49/0x60 drivers/net/tun.c:2609
> > __fput+0x333/0x7f0 fs/file_table.c:210
> > fput+0x15/0x20 fs/file_table.c:246
> > task_work_run+0x199/0x270 kernel/task_work.c:112
> > exit_task_work include/linux/task_work.h:21 [inline]
> > do_exit+0xa52/0x1b40 kernel/exit.c:865
> 
> Here is the allocation path
> 
> > alloc_netdev_mqs+0x16e/0xed0 net/core/dev.c:8018
> > tun_set_iff drivers/net/tun.c:2022 [inline]
> > __tun_chr_ioctl+0x12be/0x3d20 drivers/net/tun.c:2276
> > tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2521
> > vfs_ioctl fs/ioctl.c:45 [inline]
> > do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:685
> > SYSC_ioctl fs/ioctl.c:700 [inline]
> > SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
> > entry_SYSCALL_64_fastpath+0x1f/0xbe
> 
> 
> And this is free.
> 
> > netdev_freemem net/core/dev.c:7970 [inline]
> > free_netdev+0x2cf/0x360 net/core/dev.c:8132
> > tun_set_iff drivers/net/tun.c:2105 [inline]
> 
> err_free_flow:
> tun_flow_uninit(tun); <
> 
> > __tun_chr_ioctl+0x2cf6/0x3d20 drivers/net/tun.c:2276
> > tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2521
> > vfs_ioctl fs/ioctl.c:45 [inline]
> > do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:685
> > SYSC_ioctl fs/ioctl.c:700 [inline]
> > SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
> > entry_SYSCALL_64_fastpath+0x1f/0xbe
> 
> So it's the TUNSETIFF ioctl which first allocates and then frees in the
> errorpath of tun_set_iff.
> 
> But for some reason this sticks and the exit of that task does it again,
> which triggers KASAN in the innocent timer code.

Pretty old story, already fixed in David Miller trees.

net-next tree :

$ git log --oneline e7989f973ae1b90ec7c0b671c81.. -- drivers/net/tun.c
f8ddadc4db6c7b7029b6d0e0d9af24f74ad27ca2 Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
ee74d9967b829232723939cb7c9b100b29f6ec98 tun: do not arm flow_gc_timer in 
tun_flow_init()
81d98fa4df3d1683b3ef21e8a7a0ccac7874f0de tun: avoid extra timer schedule in 
tun_flow_cleanup()
7dbfb4ef77db5666f0f3a425e7db93ca30ff4285 tun: do not block BH again in 
tun_flow_cleanup()
aec72f3392b1d598a979e89c4fdb131965ae0ab3 net-tun: fix panics at dismantle time
010f245b9dd734adda6386c494a4ace953ea8dc4 tun: relax check on eth_get_headlen() 
return value
0ad646c81b2182f7fa67ec0c8c825e0ee165696d tun: call dev_get_valid_name() before 
register_netdevice()
53954cf8c5d205624167a2bfd117cc0c1a5f3c6d Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
2580c4c17aee3ad58e9751012bad278dd074ccae tun: bail out from tun_get_user() if 
the skb is empty
de8f3a83b0a0fddb2cf56e7a718127e9619ea3da bpf: add meta pointer for direct access
9484dc74fcf0750cd6726c9aa27edf97223916a8 tun: delete original tun_get() and 
rename __tun_get() to tun_get()
90e33d45940793def6f773b2d528e9f3c84ffdc7 tun: enable napi_gro_frags() for 
TUN/TAP driver
943170998b200190f99d3fe7e771437e2c51f319 tun: enable NAPI for TUN/TAP driver

net tree :

$ git log --oneline e7989f973ae1b90ec7c0b671c81.. -- drivers/net/tun.c
63b9ab65bd76e5de6479bb14b4014b64aa1a317a tuntap: properly align skb->head 
before building skb
5c25f65fd1e42685f7ccd80e0621829c105785d9 tun: allow positive return values on 
dev_get_valid_name() call
0ad646c81b2182f7fa67ec0c8c825e0ee165696d tun: call dev_get_valid_name() before 
register_netdevice()
2580c4c17aee3ad58e9751012bad278dd074ccae tun: bail out from 

Re: WARNING in refcount_sub_and_test

2017-10-27 Thread Eric Dumazet
On Fri, 2017-10-27 at 08:09 +0200, Dmitry Vyukov wrote:

> Yes, I've noticed this one. It seems to happen on a first incoming
> network connection (ssh/scp). I have not seen it before.

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=timers/core=52f737c2da40259ac9962170ce608b6fb1b55ee4

( Google-Bug-Id: 68003409 )




Re: WARNING in refcount_sub_and_test

2017-10-27 Thread Eric Dumazet
On Fri, 2017-10-27 at 08:09 +0200, Dmitry Vyukov wrote:

> Yes, I've noticed this one. It seems to happen on a first incoming
> network connection (ssh/scp). I have not seen it before.

https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=timers/core=52f737c2da40259ac9962170ce608b6fb1b55ee4

( Google-Bug-Id: 68003409 )




Re: [PATCH net] tuntap: properly align skb->head before building skb

2017-10-26 Thread Eric Dumazet
On Thu, Oct 26, 2017 at 5:15 AM, Jason Wang <jasow...@redhat.com> wrote:
> An unaligned alloc_frag->offset caused by previous allocation will
> result an unaligned skb->head. This will lead unaligned
> skb_shared_info and then unaligned dataref which requires to be
> aligned for accessing on some architecture. Fix this by aligning
> alloc_frag->offset before the frag refilling.
>
> Fixes: 0bbd7dad34f8 ("tun: make tun_build_skb() thread safe")
> Cc: Eric Dumazet <eduma...@google.com>
> Cc: Willem de Bruijn <willemdebruijn.ker...@gmail.com>
> Cc: Wei Wei <dotwe...@gmail.com>
> Cc: Dmitry Vyukov <dvyu...@google.com>
> Cc: Mark Rutland <mark.rutl...@arm.com>
> Reported-by: Wei Wei <dotwe...@gmail.com>
> Signed-off-by: Jason Wang <jasow...@redhat.com>
> ---
> - The patch is needed for -stable.
> - Wei, can you try this patch to see if it solves your issue?
> ---
>  drivers/net/tun.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index b9973fb..60e44f2 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1286,6 +1286,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
> *tun,
> buflen += SKB_DATA_ALIGN(len + pad);
> rcu_read_unlock();
>
> +   alloc_frag->offset = ALIGN((u64)alloc_frag->offset, TUN_RX_PAD);

You have to align to one cache line (SMP_CACHE_BYTES), or SKB_DATA_ALIGN(1)

Then eventually use skb_reserve() for NET_IP_ALIGN, but I guess it is
already done.


Re: [PATCH net] tuntap: properly align skb->head before building skb

2017-10-26 Thread Eric Dumazet
On Thu, Oct 26, 2017 at 5:15 AM, Jason Wang  wrote:
> An unaligned alloc_frag->offset caused by previous allocation will
> result an unaligned skb->head. This will lead unaligned
> skb_shared_info and then unaligned dataref which requires to be
> aligned for accessing on some architecture. Fix this by aligning
> alloc_frag->offset before the frag refilling.
>
> Fixes: 0bbd7dad34f8 ("tun: make tun_build_skb() thread safe")
> Cc: Eric Dumazet 
> Cc: Willem de Bruijn 
> Cc: Wei Wei 
> Cc: Dmitry Vyukov 
> Cc: Mark Rutland 
> Reported-by: Wei Wei 
> Signed-off-by: Jason Wang 
> ---
> - The patch is needed for -stable.
> - Wei, can you try this patch to see if it solves your issue?
> ---
>  drivers/net/tun.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index b9973fb..60e44f2 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1286,6 +1286,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
> *tun,
> buflen += SKB_DATA_ALIGN(len + pad);
> rcu_read_unlock();
>
> +   alloc_frag->offset = ALIGN((u64)alloc_frag->offset, TUN_RX_PAD);

You have to align to one cache line (SMP_CACHE_BYTES), or SKB_DATA_ALIGN(1)

Then eventually use skb_reserve() for NET_IP_ALIGN, but I guess it is
already done.


Re: v4.14-rc3/arm64 DABT exception in atomic_inc() / __skb_clone()

2017-10-25 Thread Eric Dumazet
On Wed, Oct 25, 2017 at 11:49 AM, Willem de Bruijn
 wrote:

> From skb->dev and netdev_priv, the tun device has flags 0x1002 ==
> IFF_TAP | IFF_NO_PI. This kernel precedes the recent support for
> IFF_NAPI and IFF_NAPI_FRAGS. The allocation most likely happened
> in tun_build_skb from current->task_frag. It would be a previous
> allocation that left alloc_frag->offset unaligned. But perhaps this code
> needs to perform alignment before setting skb->head. At least on
> platforms where atomic on dataref must be aligned.

+1

Bug added in commit 66ccbc9c87c2 ("tap: use build_skb() for small packet")


Re: v4.14-rc3/arm64 DABT exception in atomic_inc() / __skb_clone()

2017-10-25 Thread Eric Dumazet
On Wed, Oct 25, 2017 at 11:49 AM, Willem de Bruijn
 wrote:

> From skb->dev and netdev_priv, the tun device has flags 0x1002 ==
> IFF_TAP | IFF_NO_PI. This kernel precedes the recent support for
> IFF_NAPI and IFF_NAPI_FRAGS. The allocation most likely happened
> in tun_build_skb from current->task_frag. It would be a previous
> allocation that left alloc_frag->offset unaligned. But perhaps this code
> needs to perform alignment before setting skb->head. At least on
> platforms where atomic on dataref must be aligned.

+1

Bug added in commit 66ccbc9c87c2 ("tap: use build_skb() for small packet")


Re: v4.14-rc3/arm64 DABT exception in atomic_inc() / __skb_clone()

2017-10-19 Thread Eric Dumazet
   mov%al,(%rax)   <-- trapping 
> instruction
>   15:   00 00   add%al,(%rax)
> ...
>
> Code starting with the faulting instruction
> ===
>    0:   01 7c 5f 88 add%edi,-0x78(%rdi,%rbx,2)
>4:   00 00   add%al,(%rax)
> ...
> —[ end trace 261e7ac1458ccc0a ]---
>

I thought it was happening on arm64 ?

This is x86_64 disassembly :/

Thanks.

> Thanks,
> Wei
>
>> On 19 Oct 2017, at 10:53 PM, Eric Dumazet <eduma...@google.com> wrote:
>>
>> On Thu, Oct 19, 2017 at 7:16 PM, Wei Wei <dotwe...@gmail.com> wrote:
>>> Hi all,
>>>
>>> I have fuzzed v4.14-rc3 using syzkaller and found a bug similar to that one 
>>> [1].
>>> But the call trace isn’t the same. The atomic_inc() might handle a corrupted
>>> skb_buff.
>>>
>>> The logs and config have been uploaded to my github repo [2].
>>>
>>> [1] https://lkml.org/lkml/2017/10/2/216
>>> [2] https://github.com/dotweiba/skb_clone_atomic_inc_bug
>>>
>>> Thanks,
>>> Wei
>>>
>>> Unable to handle kernel paging request at virtual address 80005bfb81ed
>>> Mem abort info:
>>>   Exception class = DABT (current EL), IL = 32 bits
>>>   SET = 0, FnV = 0
>>>   EA = 0, S1PTW = 0
>>> Data abort info:
>>>   ISV = 0, ISS = 0x0033
>>>   CM = 0, WnR = 0
>>> swapper pgtable: 4k pages, 48-bit VAs, pgd = 2b366000
>>> [80005bfb81ed] *pgd=beff7003, *pud=00e88711
>>> Internal error: Oops: 9621 [#1] PREEMPT SMP
>>> Modules linked in:
>>> CPU: 3 PID: 4725 Comm: syz-executor0 Not tainted 4.14.0-rc3 #3
>>> Hardware name: linux,dummy-virt (DT)
>>> task: 800074409e00 task.stack: 800033db
>>> PC is at __skb_clone+0x430/0x5b0
>>> LR is at __skb_clone+0x1dc/0x5b0
>>> pc : [] lr : [] pstate: 1145
>>> sp : 800033db33d0
>>> x29: 800033db33d0 x28: 298ac378
>>> x27: 16a860e1 x26: 167b66b6
>>> x25: 8000743340a0 x24: 800035430708
>>> x23: 80005bfb80c9 x22: 800035430710
>>> x21: 0380 x20: 800035430640
>>> x19: 8000354312c0 x18: 
>>> x17: 004af000 x16: 2845e8c8
>>> x15: 1e518060 x14: d8316070
>>> x13: d8316090 x12: 
>>> x11: 16a8626f x10: 16a8626f
>>> x9 : dfff2000 x8 : 0082009000900608
>>> x7 :  x6 : 800035431380
>>> x5 : 16a86270 x4 : 
>>> x3 : 16a86273 x2 : 
>>> x1 : 0100 x0 : 80005bfb81ed
>>> Process syz-executor0 (pid: 4725, stack limit = 0x800033db)
>>> Call trace:
>>> Exception stack(0x800033db3290 to 0x800033db33d0)
>>> 3280:   80005bfb81ed 0100
>>> 32a0:  16a86273  16a86270
>>> 32c0: 800035431380  0082009000900608 dfff2000
>>> 32e0: 16a8626f 16a8626f  d8316090
>>> 3300: d8316070 1e518060 2845e8c8 004af000
>>> 3320:  8000354312c0 800035430640 0380
>>> 3340: 800035430710 80005bfb80c9 800035430708 8000743340a0
>>> 3360: 167b66b6 16a860e1 298ac378 800033db33d0
>>> 3380: 29705cfc 800033db33d0 29705f50 1145
>>> 33a0: 8000354312c0 800035430640 0001 800074334000
>>> 33c0: 800033db33d0 29705f50
>>> [] __skb_clone+0x430/0x5b0
>>> [] skb_clone+0x164/0x2c8
>>> [] arp_rcv+0x120/0x488
>>> [] __netif_receive_skb_core+0x11e8/0x18c8
>>> [] __netif_receive_skb+0x30/0x198
>>> [] netif_receive_skb_internal+0x98/0x370
>>> [] netif_receive_skb+0x1c/0x28
>>> [] tun_get_user+0x12f0/0x2e40
>>> [] tun_chr_write_iter+0xbc/0x140
>>> [] do_iter_readv_writev+0x2d4/0x468
>>> [] do_iter_write+0x148/0x498
>>> [] vfs_writev+0x118/0x250
>>> [] do_writev+0xc4/0x1e8
>>> [] SyS_writev+0x34/0x48
>>> Exception stack(0x800033db3ec0 to 0x800033db4000)
>>> 3ec0: 0015 829985e0 0001 8299851c
>>> 3ee0: 82999068 82998f60 82999650 
>>> 3f00: 0042 0036 00406608 82998400
>>> 3f20: 82998f60 d8316090 d8316070 1e518060
>>> 3f40:  004af000  0036
>>> 3f60: 20004fca 2000 0046ccf0 0530
>>> 3f80: 0046cce8 004ade98  395fa6f0
>>> 3fa0: 82998f60 82998560 00431448 82998520
>>> 3fc0: 0043145c 8000 0015 0042
>>> 3fe0:    
>>> [] el0_svc_naked+0x24/0x28
>>> Code: f9406680 8b01 91009000 f9800011 (885f7c01)
>>> ---[ end trace 261e7ac1458ccc0a ]---
>>
>> Please provide proper file:line information in this trace.
>>
>> You can use scripts/decode_stacktrace.sh
>>
>> Thanks.
>


Re: v4.14-rc3/arm64 DABT exception in atomic_inc() / __skb_clone()

2017-10-19 Thread Eric Dumazet
<-- trapping 
> instruction
>   15:   00 00   add%al,(%rax)
> ...
>
> Code starting with the faulting instruction
> ===
>    0:   01 7c 5f 88 add%edi,-0x78(%rdi,%rbx,2)
>4:   00 00   add%al,(%rax)
> ...
> —[ end trace 261e7ac1458ccc0a ]---
>

I thought it was happening on arm64 ?

This is x86_64 disassembly :/

Thanks.

> Thanks,
> Wei
>
>> On 19 Oct 2017, at 10:53 PM, Eric Dumazet  wrote:
>>
>> On Thu, Oct 19, 2017 at 7:16 PM, Wei Wei  wrote:
>>> Hi all,
>>>
>>> I have fuzzed v4.14-rc3 using syzkaller and found a bug similar to that one 
>>> [1].
>>> But the call trace isn’t the same. The atomic_inc() might handle a corrupted
>>> skb_buff.
>>>
>>> The logs and config have been uploaded to my github repo [2].
>>>
>>> [1] https://lkml.org/lkml/2017/10/2/216
>>> [2] https://github.com/dotweiba/skb_clone_atomic_inc_bug
>>>
>>> Thanks,
>>> Wei
>>>
>>> Unable to handle kernel paging request at virtual address 80005bfb81ed
>>> Mem abort info:
>>>   Exception class = DABT (current EL), IL = 32 bits
>>>   SET = 0, FnV = 0
>>>   EA = 0, S1PTW = 0
>>> Data abort info:
>>>   ISV = 0, ISS = 0x0033
>>>   CM = 0, WnR = 0
>>> swapper pgtable: 4k pages, 48-bit VAs, pgd = 2b366000
>>> [80005bfb81ed] *pgd=beff7003, *pud=00e88711
>>> Internal error: Oops: 9621 [#1] PREEMPT SMP
>>> Modules linked in:
>>> CPU: 3 PID: 4725 Comm: syz-executor0 Not tainted 4.14.0-rc3 #3
>>> Hardware name: linux,dummy-virt (DT)
>>> task: 800074409e00 task.stack: 800033db
>>> PC is at __skb_clone+0x430/0x5b0
>>> LR is at __skb_clone+0x1dc/0x5b0
>>> pc : [] lr : [] pstate: 1145
>>> sp : 800033db33d0
>>> x29: 800033db33d0 x28: 298ac378
>>> x27: 16a860e1 x26: 167b66b6
>>> x25: 8000743340a0 x24: 800035430708
>>> x23: 80005bfb80c9 x22: 800035430710
>>> x21: 0380 x20: 800035430640
>>> x19: 8000354312c0 x18: 
>>> x17: 004af000 x16: 2845e8c8
>>> x15: 1e518060 x14: d8316070
>>> x13: d8316090 x12: 
>>> x11: 16a8626f x10: 16a8626f
>>> x9 : dfff2000 x8 : 0082009000900608
>>> x7 :  x6 : 800035431380
>>> x5 : 16a86270 x4 : 
>>> x3 : 16a86273 x2 : 
>>> x1 : 0100 x0 : 80005bfb81ed
>>> Process syz-executor0 (pid: 4725, stack limit = 0x800033db)
>>> Call trace:
>>> Exception stack(0x800033db3290 to 0x800033db33d0)
>>> 3280:   80005bfb81ed 0100
>>> 32a0:  16a86273  16a86270
>>> 32c0: 800035431380  0082009000900608 dfff2000
>>> 32e0: 16a8626f 16a8626f  d8316090
>>> 3300: d8316070 1e518060 2845e8c8 004af000
>>> 3320:  8000354312c0 800035430640 0380
>>> 3340: 800035430710 80005bfb80c9 800035430708 8000743340a0
>>> 3360: 167b66b6 16a860e1 298ac378 800033db33d0
>>> 3380: 29705cfc 800033db33d0 29705f50 1145
>>> 33a0: 8000354312c0 800035430640 0001 800074334000
>>> 33c0: 800033db33d0 29705f50
>>> [] __skb_clone+0x430/0x5b0
>>> [] skb_clone+0x164/0x2c8
>>> [] arp_rcv+0x120/0x488
>>> [] __netif_receive_skb_core+0x11e8/0x18c8
>>> [] __netif_receive_skb+0x30/0x198
>>> [] netif_receive_skb_internal+0x98/0x370
>>> [] netif_receive_skb+0x1c/0x28
>>> [] tun_get_user+0x12f0/0x2e40
>>> [] tun_chr_write_iter+0xbc/0x140
>>> [] do_iter_readv_writev+0x2d4/0x468
>>> [] do_iter_write+0x148/0x498
>>> [] vfs_writev+0x118/0x250
>>> [] do_writev+0xc4/0x1e8
>>> [] SyS_writev+0x34/0x48
>>> Exception stack(0x800033db3ec0 to 0x800033db4000)
>>> 3ec0: 0015 829985e0 0001 8299851c
>>> 3ee0: 82999068 82998f60 82999650 
>>> 3f00: 0042 0036 00406608 82998400
>>> 3f20: 82998f60 d8316090 d8316070 1e518060
>>> 3f40:  004af000  0036
>>> 3f60: 20004fca 2000 0046ccf0 0530
>>> 3f80: 0046cce8 004ade98  395fa6f0
>>> 3fa0: 82998f60 82998560 00431448 82998520
>>> 3fc0: 0043145c 8000 0015 0042
>>> 3fe0:    
>>> [] el0_svc_naked+0x24/0x28
>>> Code: f9406680 8b01 91009000 f9800011 (885f7c01)
>>> ---[ end trace 261e7ac1458ccc0a ]---
>>
>> Please provide proper file:line information in this trace.
>>
>> You can use scripts/decode_stacktrace.sh
>>
>> Thanks.
>


Re: v4.14-rc3/arm64 DABT exception in atomic_inc() / __skb_clone()

2017-10-19 Thread Eric Dumazet
On Thu, Oct 19, 2017 at 7:16 PM, Wei Wei  wrote:
> Hi all,
>
> I have fuzzed v4.14-rc3 using syzkaller and found a bug similar to that one 
> [1].
> But the call trace isn’t the same. The atomic_inc() might handle a corrupted
> skb_buff.
>
> The logs and config have been uploaded to my github repo [2].
>
> [1] https://lkml.org/lkml/2017/10/2/216
> [2] https://github.com/dotweiba/skb_clone_atomic_inc_bug
>
> Thanks,
> Wei
>
>  Unable to handle kernel paging request at virtual address 80005bfb81ed
>  Mem abort info:
>Exception class = DABT (current EL), IL = 32 bits
>SET = 0, FnV = 0
>EA = 0, S1PTW = 0
>  Data abort info:
>ISV = 0, ISS = 0x0033
>CM = 0, WnR = 0
>  swapper pgtable: 4k pages, 48-bit VAs, pgd = 2b366000
>  [80005bfb81ed] *pgd=beff7003, *pud=00e88711
>  Internal error: Oops: 9621 [#1] PREEMPT SMP
>  Modules linked in:
>  CPU: 3 PID: 4725 Comm: syz-executor0 Not tainted 4.14.0-rc3 #3
>  Hardware name: linux,dummy-virt (DT)
>  task: 800074409e00 task.stack: 800033db
>  PC is at __skb_clone+0x430/0x5b0
>  LR is at __skb_clone+0x1dc/0x5b0
>  pc : [] lr : [] pstate: 1145
>  sp : 800033db33d0
>  x29: 800033db33d0 x28: 298ac378
>  x27: 16a860e1 x26: 167b66b6
>  x25: 8000743340a0 x24: 800035430708
>  x23: 80005bfb80c9 x22: 800035430710
>  x21: 0380 x20: 800035430640
>  x19: 8000354312c0 x18: 
>  x17: 004af000 x16: 2845e8c8
>  x15: 1e518060 x14: d8316070
>  x13: d8316090 x12: 
>  x11: 16a8626f x10: 16a8626f
>  x9 : dfff2000 x8 : 0082009000900608
>  x7 :  x6 : 800035431380
>  x5 : 16a86270 x4 : 
>  x3 : 16a86273 x2 : 
>  x1 : 0100 x0 : 80005bfb81ed
>  Process syz-executor0 (pid: 4725, stack limit = 0x800033db)
>  Call trace:
>  Exception stack(0x800033db3290 to 0x800033db33d0)
>  3280:   80005bfb81ed 0100
>  32a0:  16a86273  16a86270
>  32c0: 800035431380  0082009000900608 dfff2000
>  32e0: 16a8626f 16a8626f  d8316090
>  3300: d8316070 1e518060 2845e8c8 004af000
>  3320:  8000354312c0 800035430640 0380
>  3340: 800035430710 80005bfb80c9 800035430708 8000743340a0
>  3360: 167b66b6 16a860e1 298ac378 800033db33d0
>  3380: 29705cfc 800033db33d0 29705f50 1145
>  33a0: 8000354312c0 800035430640 0001 800074334000
>  33c0: 800033db33d0 29705f50
>  [] __skb_clone+0x430/0x5b0
>  [] skb_clone+0x164/0x2c8
>  [] arp_rcv+0x120/0x488
>  [] __netif_receive_skb_core+0x11e8/0x18c8
>  [] __netif_receive_skb+0x30/0x198
>  [] netif_receive_skb_internal+0x98/0x370
>  [] netif_receive_skb+0x1c/0x28
>  [] tun_get_user+0x12f0/0x2e40
>  [] tun_chr_write_iter+0xbc/0x140
>  [] do_iter_readv_writev+0x2d4/0x468
>  [] do_iter_write+0x148/0x498
>  [] vfs_writev+0x118/0x250
>  [] do_writev+0xc4/0x1e8
>  [] SyS_writev+0x34/0x48
>  Exception stack(0x800033db3ec0 to 0x800033db4000)
>  3ec0: 0015 829985e0 0001 8299851c
>  3ee0: 82999068 82998f60 82999650 
>  3f00: 0042 0036 00406608 82998400
>  3f20: 82998f60 d8316090 d8316070 1e518060
>  3f40:  004af000  0036
>  3f60: 20004fca 2000 0046ccf0 0530
>  3f80: 0046cce8 004ade98  395fa6f0
>  3fa0: 82998f60 82998560 00431448 82998520
>  3fc0: 0043145c 8000 0015 0042
>  3fe0:    
>  [] el0_svc_naked+0x24/0x28
>  Code: f9406680 8b01 91009000 f9800011 (885f7c01)
>  ---[ end trace 261e7ac1458ccc0a ]---

Please provide proper file:line information in this trace.

You can use scripts/decode_stacktrace.sh

Thanks.


Re: v4.14-rc3/arm64 DABT exception in atomic_inc() / __skb_clone()

2017-10-19 Thread Eric Dumazet
On Thu, Oct 19, 2017 at 7:16 PM, Wei Wei  wrote:
> Hi all,
>
> I have fuzzed v4.14-rc3 using syzkaller and found a bug similar to that one 
> [1].
> But the call trace isn’t the same. The atomic_inc() might handle a corrupted
> skb_buff.
>
> The logs and config have been uploaded to my github repo [2].
>
> [1] https://lkml.org/lkml/2017/10/2/216
> [2] https://github.com/dotweiba/skb_clone_atomic_inc_bug
>
> Thanks,
> Wei
>
>  Unable to handle kernel paging request at virtual address 80005bfb81ed
>  Mem abort info:
>Exception class = DABT (current EL), IL = 32 bits
>SET = 0, FnV = 0
>EA = 0, S1PTW = 0
>  Data abort info:
>ISV = 0, ISS = 0x0033
>CM = 0, WnR = 0
>  swapper pgtable: 4k pages, 48-bit VAs, pgd = 2b366000
>  [80005bfb81ed] *pgd=beff7003, *pud=00e88711
>  Internal error: Oops: 9621 [#1] PREEMPT SMP
>  Modules linked in:
>  CPU: 3 PID: 4725 Comm: syz-executor0 Not tainted 4.14.0-rc3 #3
>  Hardware name: linux,dummy-virt (DT)
>  task: 800074409e00 task.stack: 800033db
>  PC is at __skb_clone+0x430/0x5b0
>  LR is at __skb_clone+0x1dc/0x5b0
>  pc : [] lr : [] pstate: 1145
>  sp : 800033db33d0
>  x29: 800033db33d0 x28: 298ac378
>  x27: 16a860e1 x26: 167b66b6
>  x25: 8000743340a0 x24: 800035430708
>  x23: 80005bfb80c9 x22: 800035430710
>  x21: 0380 x20: 800035430640
>  x19: 8000354312c0 x18: 
>  x17: 004af000 x16: 2845e8c8
>  x15: 1e518060 x14: d8316070
>  x13: d8316090 x12: 
>  x11: 16a8626f x10: 16a8626f
>  x9 : dfff2000 x8 : 0082009000900608
>  x7 :  x6 : 800035431380
>  x5 : 16a86270 x4 : 
>  x3 : 16a86273 x2 : 
>  x1 : 0100 x0 : 80005bfb81ed
>  Process syz-executor0 (pid: 4725, stack limit = 0x800033db)
>  Call trace:
>  Exception stack(0x800033db3290 to 0x800033db33d0)
>  3280:   80005bfb81ed 0100
>  32a0:  16a86273  16a86270
>  32c0: 800035431380  0082009000900608 dfff2000
>  32e0: 16a8626f 16a8626f  d8316090
>  3300: d8316070 1e518060 2845e8c8 004af000
>  3320:  8000354312c0 800035430640 0380
>  3340: 800035430710 80005bfb80c9 800035430708 8000743340a0
>  3360: 167b66b6 16a860e1 298ac378 800033db33d0
>  3380: 29705cfc 800033db33d0 29705f50 1145
>  33a0: 8000354312c0 800035430640 0001 800074334000
>  33c0: 800033db33d0 29705f50
>  [] __skb_clone+0x430/0x5b0
>  [] skb_clone+0x164/0x2c8
>  [] arp_rcv+0x120/0x488
>  [] __netif_receive_skb_core+0x11e8/0x18c8
>  [] __netif_receive_skb+0x30/0x198
>  [] netif_receive_skb_internal+0x98/0x370
>  [] netif_receive_skb+0x1c/0x28
>  [] tun_get_user+0x12f0/0x2e40
>  [] tun_chr_write_iter+0xbc/0x140
>  [] do_iter_readv_writev+0x2d4/0x468
>  [] do_iter_write+0x148/0x498
>  [] vfs_writev+0x118/0x250
>  [] do_writev+0xc4/0x1e8
>  [] SyS_writev+0x34/0x48
>  Exception stack(0x800033db3ec0 to 0x800033db4000)
>  3ec0: 0015 829985e0 0001 8299851c
>  3ee0: 82999068 82998f60 82999650 
>  3f00: 0042 0036 00406608 82998400
>  3f20: 82998f60 d8316090 d8316070 1e518060
>  3f40:  004af000  0036
>  3f60: 20004fca 2000 0046ccf0 0530
>  3f80: 0046cce8 004ade98  395fa6f0
>  3fa0: 82998f60 82998560 00431448 82998520
>  3fc0: 0043145c 8000 0015 0042
>  3fe0:    
>  [] el0_svc_naked+0x24/0x28
>  Code: f9406680 8b01 91009000 f9800011 (885f7c01)
>  ---[ end trace 261e7ac1458ccc0a ]---

Please provide proper file:line information in this trace.

You can use scripts/decode_stacktrace.sh

Thanks.


Re: [PATCH] lib/dynamic_queue_limits.c: relax BUG_ON to WARN_ON in dql_complete()

2017-10-18 Thread Eric Dumazet
On Wed, 2017-10-18 at 18:57 +0100, Ard Biesheuvel wrote:
> On 18 October 2017 at 17:29, Eric Dumazet <eric.duma...@gmail.com> wrote:
> > On Wed, 2017-10-18 at 16:45 +0100, Ard Biesheuvel wrote:
> >> Even though calling dql_completed() with a count that exceeds the
> >> queued count is a serious error, it still does not justify bringing
> >> down the entire kernel with a BUG_ON(). So relax it to a WARN_ON()
> >> instead.
> >>
> >> Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
> >> ---
> >>  lib/dynamic_queue_limits.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
> >> index f346715e2255..24ce495d78f3 100644
> >> --- a/lib/dynamic_queue_limits.c
> >> +++ b/lib/dynamic_queue_limits.c
> >> @@ -23,7 +23,7 @@ void dql_completed(struct dql *dql, unsigned int count)
> >>   num_queued = ACCESS_ONCE(dql->num_queued);
> >>
> >>   /* Can't complete more than what's in queue */
> >> - BUG_ON(count > num_queued - dql->num_completed);
> >> + WARN_ON(count > num_queued - dql->num_completed);
> >>
> >>   completed = dql->num_completed + count;
> >>   limit = dql->limit;
> >
> > So instead fixing the faulty driver, you'll have strange lockups, and
> > force your users to reboot anyway, after annoying periods where
> > "Internet does not work"
> >
> > These kinds of errors should be found when testing a new device driver
> > or new kernel.
> >
> > Have you found the root cause ?
> >
> 
> Not yet, and I don't intend to send out any patches for this
> particular hardware until this is fixed.
> 
> But that still doesn't mean you should crash hard. As Linus puts it,
> it is better to 'limp on' if you can (unless we're likely to corrupt
> any non-volatile data, e.g., files on disk etc)

How many BUG() do you plan to change to WARN() exactly ?

If you want to comply to Linus wish, just compile your kernel
with appropriate option.

CONFIG_BUG=n




Re: [PATCH] lib/dynamic_queue_limits.c: relax BUG_ON to WARN_ON in dql_complete()

2017-10-18 Thread Eric Dumazet
On Wed, 2017-10-18 at 18:57 +0100, Ard Biesheuvel wrote:
> On 18 October 2017 at 17:29, Eric Dumazet  wrote:
> > On Wed, 2017-10-18 at 16:45 +0100, Ard Biesheuvel wrote:
> >> Even though calling dql_completed() with a count that exceeds the
> >> queued count is a serious error, it still does not justify bringing
> >> down the entire kernel with a BUG_ON(). So relax it to a WARN_ON()
> >> instead.
> >>
> >> Signed-off-by: Ard Biesheuvel 
> >> ---
> >>  lib/dynamic_queue_limits.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
> >> index f346715e2255..24ce495d78f3 100644
> >> --- a/lib/dynamic_queue_limits.c
> >> +++ b/lib/dynamic_queue_limits.c
> >> @@ -23,7 +23,7 @@ void dql_completed(struct dql *dql, unsigned int count)
> >>   num_queued = ACCESS_ONCE(dql->num_queued);
> >>
> >>   /* Can't complete more than what's in queue */
> >> - BUG_ON(count > num_queued - dql->num_completed);
> >> + WARN_ON(count > num_queued - dql->num_completed);
> >>
> >>   completed = dql->num_completed + count;
> >>   limit = dql->limit;
> >
> > So instead fixing the faulty driver, you'll have strange lockups, and
> > force your users to reboot anyway, after annoying periods where
> > "Internet does not work"
> >
> > These kinds of errors should be found when testing a new device driver
> > or new kernel.
> >
> > Have you found the root cause ?
> >
> 
> Not yet, and I don't intend to send out any patches for this
> particular hardware until this is fixed.
> 
> But that still doesn't mean you should crash hard. As Linus puts it,
> it is better to 'limp on' if you can (unless we're likely to corrupt
> any non-volatile data, e.g., files on disk etc)

How many BUG() do you plan to change to WARN() exactly ?

If you want to comply to Linus wish, just compile your kernel
with appropriate option.

CONFIG_BUG=n




Re: [PATCH] lib/dynamic_queue_limits.c: relax BUG_ON to WARN_ON in dql_complete()

2017-10-18 Thread Eric Dumazet
On Wed, 2017-10-18 at 16:45 +0100, Ard Biesheuvel wrote:
> Even though calling dql_completed() with a count that exceeds the
> queued count is a serious error, it still does not justify bringing
> down the entire kernel with a BUG_ON(). So relax it to a WARN_ON()
> instead.
> 
> Signed-off-by: Ard Biesheuvel 
> ---
>  lib/dynamic_queue_limits.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
> index f346715e2255..24ce495d78f3 100644
> --- a/lib/dynamic_queue_limits.c
> +++ b/lib/dynamic_queue_limits.c
> @@ -23,7 +23,7 @@ void dql_completed(struct dql *dql, unsigned int count)
>   num_queued = ACCESS_ONCE(dql->num_queued);
>  
>   /* Can't complete more than what's in queue */
> - BUG_ON(count > num_queued - dql->num_completed);
> + WARN_ON(count > num_queued - dql->num_completed);
>  
>   completed = dql->num_completed + count;
>   limit = dql->limit;

So instead fixing the faulty driver, you'll have strange lockups, and
force your users to reboot anyway, after annoying periods where
"Internet does not work"

These kinds of errors should be found when testing a new device driver
or new kernel.

Have you found the root cause ?




Re: [PATCH] lib/dynamic_queue_limits.c: relax BUG_ON to WARN_ON in dql_complete()

2017-10-18 Thread Eric Dumazet
On Wed, 2017-10-18 at 16:45 +0100, Ard Biesheuvel wrote:
> Even though calling dql_completed() with a count that exceeds the
> queued count is a serious error, it still does not justify bringing
> down the entire kernel with a BUG_ON(). So relax it to a WARN_ON()
> instead.
> 
> Signed-off-by: Ard Biesheuvel 
> ---
>  lib/dynamic_queue_limits.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
> index f346715e2255..24ce495d78f3 100644
> --- a/lib/dynamic_queue_limits.c
> +++ b/lib/dynamic_queue_limits.c
> @@ -23,7 +23,7 @@ void dql_completed(struct dql *dql, unsigned int count)
>   num_queued = ACCESS_ONCE(dql->num_queued);
>  
>   /* Can't complete more than what's in queue */
> - BUG_ON(count > num_queued - dql->num_completed);
> + WARN_ON(count > num_queued - dql->num_completed);
>  
>   completed = dql->num_completed + count;
>   limit = dql->limit;

So instead fixing the faulty driver, you'll have strange lockups, and
force your users to reboot anyway, after annoying periods where
"Internet does not work"

These kinds of errors should be found when testing a new device driver
or new kernel.

Have you found the root cause ?




Re: [PATCH] net: export netdev_txq_to_tc to allow sch_mqprio to compile as module

2017-10-17 Thread Eric Dumazet
On Tue, Oct 17, 2017 at 3:10 AM, Henrik Austad <hen...@austad.us> wrote:
> In commit 32302902ff09 ("mqprio: Reserve last 32 classid values for HW
> traffic classes and misc IDs") sch_mqprio started using netdev_txq_to_tc
> to find the correct tc instead of dev->tc_to_txq[]
>
> However, when mqprio is compiled as a module, it cannot resolve the
> symbol, leading to this error:
>
>  ERROR: "netdev_txq_to_tc" [net/sched/sch_mqprio.ko] undefined!
>
> This adds an EXPORT_SYMBOL() since the other user in the kernel
> (netif_set_xps_queue) is also EXPORT_SYMBOL() (and not _GPL) or in a
> sysfs-callback.
>
> Cc: Alexander Duyck <alexander.h.du...@intel.com>
> Cc: Jesus Sanchez-Palencia <jesus.sanchez-palen...@intel.com>
> Cc: David S. Miller <da...@davemloft.net>
> Signed-off-by: Henrik Austad <haus...@cisco.com>


Reviewed-by: Eric Dumazet <eduma...@google.com>


Re: [PATCH] net: export netdev_txq_to_tc to allow sch_mqprio to compile as module

2017-10-17 Thread Eric Dumazet
On Tue, Oct 17, 2017 at 3:10 AM, Henrik Austad  wrote:
> In commit 32302902ff09 ("mqprio: Reserve last 32 classid values for HW
> traffic classes and misc IDs") sch_mqprio started using netdev_txq_to_tc
> to find the correct tc instead of dev->tc_to_txq[]
>
> However, when mqprio is compiled as a module, it cannot resolve the
> symbol, leading to this error:
>
>  ERROR: "netdev_txq_to_tc" [net/sched/sch_mqprio.ko] undefined!
>
> This adds an EXPORT_SYMBOL() since the other user in the kernel
> (netif_set_xps_queue) is also EXPORT_SYMBOL() (and not _GPL) or in a
> sysfs-callback.
>
> Cc: Alexander Duyck 
> Cc: Jesus Sanchez-Palencia 
> Cc: David S. Miller 
> Signed-off-by: Henrik Austad 


Reviewed-by: Eric Dumazet 


Re: [lkp-robot] [ipv6] 2b760fcf5c: WARNING:suspicious_RCU_usage

2017-10-13 Thread Eric Dumazet
On Thu, Oct 12, 2017 at 10:09 PM, Wei Wang  wrote:
>
> This warning should be fixed by later commit 66f5d6ce53e6 ("ipv6:
> replace rwlock with rcu and spinlock in fib6_table").
> But by this commit, rcu is not yet used in ip6_pol_route(). (Sorry
> that I missed this earlier.) Not sure what to do here to fix this
> particular warning for this commit.

Hi Wei

There is nothing special to be done, if the current tree is all good.

Such a major patch series being done without a single temp 'bug' would
have been quite a challenge.

Thanks !


Re: [lkp-robot] [ipv6] 2b760fcf5c: WARNING:suspicious_RCU_usage

2017-10-13 Thread Eric Dumazet
On Thu, Oct 12, 2017 at 10:09 PM, Wei Wang  wrote:
>
> This warning should be fixed by later commit 66f5d6ce53e6 ("ipv6:
> replace rwlock with rcu and spinlock in fib6_table").
> But by this commit, rcu is not yet used in ip6_pol_route(). (Sorry
> that I missed this earlier.) Not sure what to do here to fix this
> particular warning for this commit.

Hi Wei

There is nothing special to be done, if the current tree is all good.

Such a major patch series being done without a single temp 'bug' would
have been quite a challenge.

Thanks !


Re: next: arm64: LTP sendto01 test causes system crash in ilp32 mode

2017-10-11 Thread Eric Dumazet
On Wed, 2017-10-11 at 22:43 +0300, Yury Norov wrote:

> The fix works for me, thanks.
> 
> Tested-by: Yury Norov 

Thanks Yury !




Re: next: arm64: LTP sendto01 test causes system crash in ilp32 mode

2017-10-11 Thread Eric Dumazet
On Wed, 2017-10-11 at 22:43 +0300, Yury Norov wrote:

> The fix works for me, thanks.
> 
> Tested-by: Yury Norov 

Thanks Yury !




Re: next: arm64: LTP sendto01 test causes system crash in ilp32 mode

2017-10-11 Thread Eric Dumazet
On Wed, 2017-10-11 at 11:41 -0700, Eric Dumazet wrote:
> On Wed, 2017-10-11 at 21:35 +0300, Yury Norov wrote:
> > Hi all, 
> > 
> > It seems like next-20171009 with ilp32 patches crashes on LTP sendto01 test
> > in sys_sendto() path, like this:
> 
> Thanks for the report.
> Probably caused by one of my recent patches, so I am taking a look.

Yes, this was silly.

Please test this fix :

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 
5a95e5886b55e03e4a8bfeac3506c657a4f97dde..15163454174babdcb465904f725b919268dd1bc7
 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1712,6 +1712,7 @@ static inline void tcp_insert_write_queue_before(struct 
sk_buff *new,
 
 static inline void tcp_unlink_write_queue(struct sk_buff *skb, struct sock *sk)
 {
+   tcp_skb_tsorted_anchor_cleanup(skb);
__skb_unlink(skb, >sk_write_queue);
 }
 




Re: next: arm64: LTP sendto01 test causes system crash in ilp32 mode

2017-10-11 Thread Eric Dumazet
On Wed, 2017-10-11 at 11:41 -0700, Eric Dumazet wrote:
> On Wed, 2017-10-11 at 21:35 +0300, Yury Norov wrote:
> > Hi all, 
> > 
> > It seems like next-20171009 with ilp32 patches crashes on LTP sendto01 test
> > in sys_sendto() path, like this:
> 
> Thanks for the report.
> Probably caused by one of my recent patches, so I am taking a look.

Yes, this was silly.

Please test this fix :

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 
5a95e5886b55e03e4a8bfeac3506c657a4f97dde..15163454174babdcb465904f725b919268dd1bc7
 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1712,6 +1712,7 @@ static inline void tcp_insert_write_queue_before(struct 
sk_buff *new,
 
 static inline void tcp_unlink_write_queue(struct sk_buff *skb, struct sock *sk)
 {
+   tcp_skb_tsorted_anchor_cleanup(skb);
__skb_unlink(skb, >sk_write_queue);
 }
 




Re: next: arm64: LTP sendto01 test causes system crash in ilp32 mode

2017-10-11 Thread Eric Dumazet
On Wed, 2017-10-11 at 21:35 +0300, Yury Norov wrote:
> Hi all, 
> 
> It seems like next-20171009 with ilp32 patches crashes on LTP sendto01 test
> in sys_sendto() path, like this:

Thanks for the report.
Probably caused by one of my recent patches, so I am taking a look.





Re: next: arm64: LTP sendto01 test causes system crash in ilp32 mode

2017-10-11 Thread Eric Dumazet
On Wed, 2017-10-11 at 21:35 +0300, Yury Norov wrote:
> Hi all, 
> 
> It seems like next-20171009 with ilp32 patches crashes on LTP sendto01 test
> in sys_sendto() path, like this:

Thanks for the report.
Probably caused by one of my recent patches, so I am taking a look.





Re: [PATCH net-next 2/2] net sched act_vlan: VLAN action rewrite to use RCU lock/unlock and update

2017-10-11 Thread Eric Dumazet
On Wed, 2017-10-11 at 06:10 -0700, Eric Dumazet wrote:
> On Tue, 2017-10-10 at 22:33 -0400, Manish Kurup wrote:
> > Using a spinlock in the VLAN action causes performance issues when the VLAN
> > action is used on multiple cores. Rewrote the VLAN action to use RCU read
> > locking for reads and updates instead.
> > 
> > Signed-off-by: Manish Kurup <manish.ku...@verizon.com>
> > ---
> >  include/net/tc_act/tc_vlan.h | 21 -
> >  net/sched/act_vlan.c | 73 
> > ++--
> >  2 files changed, 63 insertions(+), 31 deletions(-)
> > 
> > diff --git a/include/net/tc_act/tc_vlan.h b/include/net/tc_act/tc_vlan.h
> > index c2090df..67fd355 100644
> > --- a/include/net/tc_act/tc_vlan.h
> > +++ b/include/net/tc_act/tc_vlan.h
> > @@ -13,12 +13,17 @@
> >  #include 
> >  #include 
> >  
> > +struct tcf_vlan_params {
> > +   struct rcu_head   rcu;
> > +   int   tcfv_action;
> > +   u16   tcfv_push_vid;
> > +   __be16tcfv_push_proto;
> > +   u8tcfv_push_prio;
> > +};
> > +
> >  struct tcf_vlan {
> > struct tc_actioncommon;
> > -   int tcfv_action;
> > -   u16 tcfv_push_vid;
> > -   __be16  tcfv_push_proto;
> > -   u8  tcfv_push_prio;
> > +   struct tcf_vlan_params __rcu *vlan_p;
> >  };
> >  #define to_vlan(a) ((struct tcf_vlan *)a)
> >  
> > @@ -33,22 +38,22 @@ static inline bool is_tcf_vlan(const struct tc_action 
> > *a)
> >  
> >  static inline u32 tcf_vlan_action(const struct tc_action *a)
> >  {
> > -   return to_vlan(a)->tcfv_action;
> > +   return to_vlan(a)->vlan_p->tcfv_action;
> 
> This is not proper RCU : ->vlan_p should be read once, and using
> rcu_dereference()
> 
> So the caller should provide to this helper the 'p' pointer instead of
> 'a'
> 
> 
> CONFIG_SPARSE_RCU_POINTER=y
> 
> and make C=2 would probably give you warnings about that.
> 


BTW no need to change your .config after :


commit 41a2901e7d220875752a8c870e0b53288a578c20
Author: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Date:   Fri May 12 15:56:35 2017 -0700

rcu: Remove SPARSE_RCU_POINTER Kconfig option

The sparse-based checking for non-RCU accesses to RCU-protected pointers
has been around for a very long time, and it is now the only type of
sparse-based checking that is optional.  This commit therefore makes
it unconditional.

Reported-by: Ingo Molnar <mi...@kernel.org>
Signed-off-by: Paul E. McKenney <paul...@linux.vnet.ibm.com>
Cc: Fengguang Wu <fengguang...@intel.com>





Re: [PATCH net-next 2/2] net sched act_vlan: VLAN action rewrite to use RCU lock/unlock and update

2017-10-11 Thread Eric Dumazet
On Wed, 2017-10-11 at 06:10 -0700, Eric Dumazet wrote:
> On Tue, 2017-10-10 at 22:33 -0400, Manish Kurup wrote:
> > Using a spinlock in the VLAN action causes performance issues when the VLAN
> > action is used on multiple cores. Rewrote the VLAN action to use RCU read
> > locking for reads and updates instead.
> > 
> > Signed-off-by: Manish Kurup 
> > ---
> >  include/net/tc_act/tc_vlan.h | 21 -
> >  net/sched/act_vlan.c | 73 
> > ++--
> >  2 files changed, 63 insertions(+), 31 deletions(-)
> > 
> > diff --git a/include/net/tc_act/tc_vlan.h b/include/net/tc_act/tc_vlan.h
> > index c2090df..67fd355 100644
> > --- a/include/net/tc_act/tc_vlan.h
> > +++ b/include/net/tc_act/tc_vlan.h
> > @@ -13,12 +13,17 @@
> >  #include 
> >  #include 
> >  
> > +struct tcf_vlan_params {
> > +   struct rcu_head   rcu;
> > +   int   tcfv_action;
> > +   u16   tcfv_push_vid;
> > +   __be16tcfv_push_proto;
> > +   u8tcfv_push_prio;
> > +};
> > +
> >  struct tcf_vlan {
> > struct tc_actioncommon;
> > -   int tcfv_action;
> > -   u16 tcfv_push_vid;
> > -   __be16  tcfv_push_proto;
> > -   u8  tcfv_push_prio;
> > +   struct tcf_vlan_params __rcu *vlan_p;
> >  };
> >  #define to_vlan(a) ((struct tcf_vlan *)a)
> >  
> > @@ -33,22 +38,22 @@ static inline bool is_tcf_vlan(const struct tc_action 
> > *a)
> >  
> >  static inline u32 tcf_vlan_action(const struct tc_action *a)
> >  {
> > -   return to_vlan(a)->tcfv_action;
> > +   return to_vlan(a)->vlan_p->tcfv_action;
> 
> This is not proper RCU : ->vlan_p should be read once, and using
> rcu_dereference()
> 
> So the caller should provide to this helper the 'p' pointer instead of
> 'a'
> 
> 
> CONFIG_SPARSE_RCU_POINTER=y
> 
> and make C=2 would probably give you warnings about that.
> 


BTW no need to change your .config after :


commit 41a2901e7d220875752a8c870e0b53288a578c20
Author: Paul E. McKenney 
Date:   Fri May 12 15:56:35 2017 -0700

rcu: Remove SPARSE_RCU_POINTER Kconfig option

The sparse-based checking for non-RCU accesses to RCU-protected pointers
has been around for a very long time, and it is now the only type of
sparse-based checking that is optional.  This commit therefore makes
it unconditional.

Reported-by: Ingo Molnar 
Signed-off-by: Paul E. McKenney 
Cc: Fengguang Wu 





Re: [PATCH net-next 2/2] net sched act_vlan: VLAN action rewrite to use RCU lock/unlock and update

2017-10-11 Thread Eric Dumazet
On Tue, 2017-10-10 at 22:33 -0400, Manish Kurup wrote:
> Using a spinlock in the VLAN action causes performance issues when the VLAN
> action is used on multiple cores. Rewrote the VLAN action to use RCU read
> locking for reads and updates instead.
> 
> Signed-off-by: Manish Kurup 
> ---
>  include/net/tc_act/tc_vlan.h | 21 -
>  net/sched/act_vlan.c | 73 
> ++--
>  2 files changed, 63 insertions(+), 31 deletions(-)
> 
> diff --git a/include/net/tc_act/tc_vlan.h b/include/net/tc_act/tc_vlan.h
> index c2090df..67fd355 100644
> --- a/include/net/tc_act/tc_vlan.h
> +++ b/include/net/tc_act/tc_vlan.h
> @@ -13,12 +13,17 @@
>  #include 
>  #include 
>  
> +struct tcf_vlan_params {
> + struct rcu_head   rcu;
> + int   tcfv_action;
> + u16   tcfv_push_vid;
> + __be16tcfv_push_proto;
> + u8tcfv_push_prio;
> +};
> +
>  struct tcf_vlan {
>   struct tc_actioncommon;
> - int tcfv_action;
> - u16 tcfv_push_vid;
> - __be16  tcfv_push_proto;
> - u8  tcfv_push_prio;
> + struct tcf_vlan_params __rcu *vlan_p;
>  };
>  #define to_vlan(a) ((struct tcf_vlan *)a)
>  
> @@ -33,22 +38,22 @@ static inline bool is_tcf_vlan(const struct tc_action *a)
>  
>  static inline u32 tcf_vlan_action(const struct tc_action *a)
>  {
> - return to_vlan(a)->tcfv_action;
> + return to_vlan(a)->vlan_p->tcfv_action;

This is not proper RCU : ->vlan_p should be read once, and using
rcu_dereference()

So the caller should provide to this helper the 'p' pointer instead of
'a'


CONFIG_SPARSE_RCU_POINTER=y

and make C=2 would probably give you warnings about that.






Re: [PATCH net-next 2/2] net sched act_vlan: VLAN action rewrite to use RCU lock/unlock and update

2017-10-11 Thread Eric Dumazet
On Tue, 2017-10-10 at 22:33 -0400, Manish Kurup wrote:
> Using a spinlock in the VLAN action causes performance issues when the VLAN
> action is used on multiple cores. Rewrote the VLAN action to use RCU read
> locking for reads and updates instead.
> 
> Signed-off-by: Manish Kurup 
> ---
>  include/net/tc_act/tc_vlan.h | 21 -
>  net/sched/act_vlan.c | 73 
> ++--
>  2 files changed, 63 insertions(+), 31 deletions(-)
> 
> diff --git a/include/net/tc_act/tc_vlan.h b/include/net/tc_act/tc_vlan.h
> index c2090df..67fd355 100644
> --- a/include/net/tc_act/tc_vlan.h
> +++ b/include/net/tc_act/tc_vlan.h
> @@ -13,12 +13,17 @@
>  #include 
>  #include 
>  
> +struct tcf_vlan_params {
> + struct rcu_head   rcu;
> + int   tcfv_action;
> + u16   tcfv_push_vid;
> + __be16tcfv_push_proto;
> + u8tcfv_push_prio;
> +};
> +
>  struct tcf_vlan {
>   struct tc_actioncommon;
> - int tcfv_action;
> - u16 tcfv_push_vid;
> - __be16  tcfv_push_proto;
> - u8  tcfv_push_prio;
> + struct tcf_vlan_params __rcu *vlan_p;
>  };
>  #define to_vlan(a) ((struct tcf_vlan *)a)
>  
> @@ -33,22 +38,22 @@ static inline bool is_tcf_vlan(const struct tc_action *a)
>  
>  static inline u32 tcf_vlan_action(const struct tc_action *a)
>  {
> - return to_vlan(a)->tcfv_action;
> + return to_vlan(a)->vlan_p->tcfv_action;

This is not proper RCU : ->vlan_p should be read once, and using
rcu_dereference()

So the caller should provide to this helper the 'p' pointer instead of
'a'


CONFIG_SPARSE_RCU_POINTER=y

and make C=2 would probably give you warnings about that.






Re: [PATCH][net-next] ipv6: fix dereference of rt6_ex before null check error

2017-10-10 Thread Eric Dumazet
On Tue, 2017-10-10 at 18:01 +0100, Colin King wrote:
> From: Colin Ian King <colin.k...@canonical.com>
> 
> Currently rt6_ex is being dereferenced before it is null checked
> hence there is a possible null dereference bug. Fix this by only
> dereferencing rt6_ex after it has been null checked.
> 
> Detected by CoverityScan, CID#1457749 ("Dereference before null check")
> 
> Fixes: 81eb8447daae ("ipv6: take care of rt6_stats")
> Signed-off-by: Colin Ian King <colin.k...@canonical.com>
> ---

Reviewed-by: Eric Dumazet <eduma...@google.com>




Re: [PATCH][net-next] ipv6: fix dereference of rt6_ex before null check error

2017-10-10 Thread Eric Dumazet
On Tue, 2017-10-10 at 18:01 +0100, Colin King wrote:
> From: Colin Ian King 
> 
> Currently rt6_ex is being dereferenced before it is null checked
> hence there is a possible null dereference bug. Fix this by only
> dereferencing rt6_ex after it has been null checked.
> 
> Detected by CoverityScan, CID#1457749 ("Dereference before null check")
> 
> Fixes: 81eb8447daae ("ipv6: take care of rt6_stats")
> Signed-off-by: Colin Ian King 
> ---

Reviewed-by: Eric Dumazet 




Re: [lkp-robot] [ipv6] a94b9367e0: BUG:using_smp_processor_id()in_preemptible

2017-10-10 Thread Eric Dumazet
On Tue, Oct 10, 2017 at 5:13 AM, kernel test robot
 wrote:
>
> FYI, we noticed the following commit (built with gcc-4.9):
>
> commit: a94b9367e044ba672c9f4105eb1516ff6ff4948a ("ipv6: grab rt->rt6i_ref 
> before allocating pcpu rt")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>
> in testcase: boot
>
> on test machine: qemu-system-x86_64 -enable-kvm -smp 2 -m 512M
>
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
>
>
> ++++
> || 2b760fcf5c | a94b9367e0 |
> ++++
> | boot_successes | 8  | 4  |
> | boot_failures  | 0  | 4  |
> | BUG:using_smp_processor_id()in_preemptible | 0  | 4  |
> ++++
>
>
>
> [   28.927206] BUG: using smp_processor_id() in preemptible [] code: 
> odhcpd/1011
> [   28.928424] caller is debug_smp_processor_id+0x17/0x19
> [   28.929085] CPU: 1 PID: 1011 Comm: odhcpd Not tainted 
> 4.14.0-rc3-00908-ga94b936 #1
> [   28.930196] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.10.2-1 04/01/2014
> [   28.931421] Call Trace:
> [   28.931765]  dump_stack+0x85/0xbe
> [   28.932244]  check_preemption_disabled+0xdb/0xed
> [   28.932845]  debug_smp_processor_id+0x17/0x19
> [   28.933421]  ip6_pol_route+0x541/0x607
> [   28.933993]  ip6_pol_route_output+0x11/0x13
> [   28.934547]  fib6_rule_action+0xbf/0x1ff
> [   28.935088]  ? ip6_pol_route_input+0x17/0x17
> [   28.935725]  fib_rules_lookup+0x18e/0x245
> [   28.936255]  ? fib_rules_lookup+0x18e/0x245
> [   28.936819]  ? ip6_pol_route_input+0x17/0x17
> [   28.937404]  fib6_rule_lookup+0x5a/0xe9
> [   28.937944]  ? ip6_pol_route_input+0x17/0x17
> [   28.938579]  ip6_route_output_flags+0xd0/0xdc
> [   28.939147]  ip6_dst_lookup_tail+0x57/0x1aa
> [   28.939748]  ip6_dst_lookup_flow+0x33/0x73
> [   28.940347]  ip6_datagram_dst_update+0x25b/0x4c2
> [   28.940947]  ? save_stack_trace+0x16/0x18
> [   28.941697]  __ip6_datagram_connect+0x218/0x2ba
> [   28.942286]  ? __ip6_datagram_connect+0x218/0x2ba
> [   28.942897]  ? __local_bh_enable_ip+0x87/0xa8
> [   28.943471]  ? lock_sock_nested+0x42/0x90
> [   28.944080]  ip6_datagram_connect+0x26/0x3a
> [   28.944634]  ip6_datagram_connect_v6_only+0x14/0x16
> [   28.945296]  inet_dgram_connect+0x4a/0x64
> [   28.945815]  SyS_connect+0x7e/0xbf
> [   28.946373]  ? __might_fault+0x79/0x7f
> [   28.946934]  compat_SyS_socketcall+0x113/0x219
> [   28.947797]  ? do_int80_syscall_32+0x22/0x1c1
> [   28.948438]  ? trace_hardirqs_on_caller+0x174/0x190
> [   28.949063]  do_int80_syscall_32+0x74/0x1c1
> [   28.949664]  entry_INT80_compat+0x32/0x50
> [   28.950269] RIP: 0023:0xf7ef2384
> [   28.950735] RSP: 002b:ffa89318 EFLAGS: 0296 ORIG_RAX: 
> 0066
> [   28.951736] RAX: ffda RBX: 0003 RCX: 
> ffa89328
> [   28.952670] RDX: f7f38000 RSI: ffa89328 RDI: 
> ffa89440
> [   28.953572] RBP: ffa89388 R08:  R09: 
> 
> [   28.954538] R10:  R11:  R12: 
> 
> [   28.955533] R13:  R14:  R15: 
> 
>
> Elapsed time: 30
>
> initrds=(
> /osimage/openwrt/openwrt-i386-2016-03-16.cgz
> 
> /lkp/scheduled/vm-lkp-nex04-openwrt-ia32-27/boot-1-openwrt-i386-2016-03-16.cgz-a94b9367e044ba672c9f4105eb1516ff6ff4948a-20171009-12855-139piyy-0.cgz
> /lkp/lkp/lkp-i386.cgz
> )
>
> cat "${initrds[@]}" > /fs/sdb1/initrd-vm-lkp-nex04-openwrt-ia32-27
>
> kvm=(
> qemu-system-x86_64
> -enable-kvm
> -kernel 
> /pkg/linux/x86_64-randconfig-h0-10091308/gcc-4.9/a94b9367e044ba672c9f4105eb1516ff6ff4948a/vmlinuz-4.14.0-rc3-00908-ga94b936
> -initrd /fs/sdb1/initrd-vm-lkp-nex04-openwrt-ia32-27
> -m 512
> -smp 2
> -device e1000,netdev=net0
> -netdev user,id=net0,hostfwd=tcp::23296-:22
> -boot order=nc
> -no-reboot
> -watchdog i6300esb
> -watchdog-action debug
> -rtc base=localtime
> -pidfile /dev/shm/kboot/pid-vm-lkp-nex04-openwrt-ia32-27
> -serial file:/dev/shm/kboot/vm-lkp-nex04-openwrt-ia32-27/serial
> -serial file:/dev/shm/kboot/vm-lkp-nex04-openwrt-ia32-27/kmsg
> -daemonize
> -display none
> -monitor null
> )
>
> append=(
> ip=vm-lkp-nex04-openwrt-ia32-27::dhcp
> root=/dev/ram0
> user=lkp
> 
> job=/lkp/scheduled/vm-lkp-nex04-openwrt-ia32-27/boot-1-openwrt-i386-2016-03-16.cgz-a94b9367e044ba672c9f4105eb1516ff6ff4948a-20171009-12855-139piyy-0.yaml
> ARCH=x86_64
> 

Re: [lkp-robot] [ipv6] a94b9367e0: BUG:using_smp_processor_id()in_preemptible

2017-10-10 Thread Eric Dumazet
On Tue, Oct 10, 2017 at 5:13 AM, kernel test robot
 wrote:
>
> FYI, we noticed the following commit (built with gcc-4.9):
>
> commit: a94b9367e044ba672c9f4105eb1516ff6ff4948a ("ipv6: grab rt->rt6i_ref 
> before allocating pcpu rt")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>
> in testcase: boot
>
> on test machine: qemu-system-x86_64 -enable-kvm -smp 2 -m 512M
>
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
>
>
> ++++
> || 2b760fcf5c | a94b9367e0 |
> ++++
> | boot_successes | 8  | 4  |
> | boot_failures  | 0  | 4  |
> | BUG:using_smp_processor_id()in_preemptible | 0  | 4  |
> ++++
>
>
>
> [   28.927206] BUG: using smp_processor_id() in preemptible [] code: 
> odhcpd/1011
> [   28.928424] caller is debug_smp_processor_id+0x17/0x19
> [   28.929085] CPU: 1 PID: 1011 Comm: odhcpd Not tainted 
> 4.14.0-rc3-00908-ga94b936 #1
> [   28.930196] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.10.2-1 04/01/2014
> [   28.931421] Call Trace:
> [   28.931765]  dump_stack+0x85/0xbe
> [   28.932244]  check_preemption_disabled+0xdb/0xed
> [   28.932845]  debug_smp_processor_id+0x17/0x19
> [   28.933421]  ip6_pol_route+0x541/0x607
> [   28.933993]  ip6_pol_route_output+0x11/0x13
> [   28.934547]  fib6_rule_action+0xbf/0x1ff
> [   28.935088]  ? ip6_pol_route_input+0x17/0x17
> [   28.935725]  fib_rules_lookup+0x18e/0x245
> [   28.936255]  ? fib_rules_lookup+0x18e/0x245
> [   28.936819]  ? ip6_pol_route_input+0x17/0x17
> [   28.937404]  fib6_rule_lookup+0x5a/0xe9
> [   28.937944]  ? ip6_pol_route_input+0x17/0x17
> [   28.938579]  ip6_route_output_flags+0xd0/0xdc
> [   28.939147]  ip6_dst_lookup_tail+0x57/0x1aa
> [   28.939748]  ip6_dst_lookup_flow+0x33/0x73
> [   28.940347]  ip6_datagram_dst_update+0x25b/0x4c2
> [   28.940947]  ? save_stack_trace+0x16/0x18
> [   28.941697]  __ip6_datagram_connect+0x218/0x2ba
> [   28.942286]  ? __ip6_datagram_connect+0x218/0x2ba
> [   28.942897]  ? __local_bh_enable_ip+0x87/0xa8
> [   28.943471]  ? lock_sock_nested+0x42/0x90
> [   28.944080]  ip6_datagram_connect+0x26/0x3a
> [   28.944634]  ip6_datagram_connect_v6_only+0x14/0x16
> [   28.945296]  inet_dgram_connect+0x4a/0x64
> [   28.945815]  SyS_connect+0x7e/0xbf
> [   28.946373]  ? __might_fault+0x79/0x7f
> [   28.946934]  compat_SyS_socketcall+0x113/0x219
> [   28.947797]  ? do_int80_syscall_32+0x22/0x1c1
> [   28.948438]  ? trace_hardirqs_on_caller+0x174/0x190
> [   28.949063]  do_int80_syscall_32+0x74/0x1c1
> [   28.949664]  entry_INT80_compat+0x32/0x50
> [   28.950269] RIP: 0023:0xf7ef2384
> [   28.950735] RSP: 002b:ffa89318 EFLAGS: 0296 ORIG_RAX: 
> 0066
> [   28.951736] RAX: ffda RBX: 0003 RCX: 
> ffa89328
> [   28.952670] RDX: f7f38000 RSI: ffa89328 RDI: 
> ffa89440
> [   28.953572] RBP: ffa89388 R08:  R09: 
> 
> [   28.954538] R10:  R11:  R12: 
> 
> [   28.955533] R13:  R14:  R15: 
> 
>
> Elapsed time: 30
>
> initrds=(
> /osimage/openwrt/openwrt-i386-2016-03-16.cgz
> 
> /lkp/scheduled/vm-lkp-nex04-openwrt-ia32-27/boot-1-openwrt-i386-2016-03-16.cgz-a94b9367e044ba672c9f4105eb1516ff6ff4948a-20171009-12855-139piyy-0.cgz
> /lkp/lkp/lkp-i386.cgz
> )
>
> cat "${initrds[@]}" > /fs/sdb1/initrd-vm-lkp-nex04-openwrt-ia32-27
>
> kvm=(
> qemu-system-x86_64
> -enable-kvm
> -kernel 
> /pkg/linux/x86_64-randconfig-h0-10091308/gcc-4.9/a94b9367e044ba672c9f4105eb1516ff6ff4948a/vmlinuz-4.14.0-rc3-00908-ga94b936
> -initrd /fs/sdb1/initrd-vm-lkp-nex04-openwrt-ia32-27
> -m 512
> -smp 2
> -device e1000,netdev=net0
> -netdev user,id=net0,hostfwd=tcp::23296-:22
> -boot order=nc
> -no-reboot
> -watchdog i6300esb
> -watchdog-action debug
> -rtc base=localtime
> -pidfile /dev/shm/kboot/pid-vm-lkp-nex04-openwrt-ia32-27
> -serial file:/dev/shm/kboot/vm-lkp-nex04-openwrt-ia32-27/serial
> -serial file:/dev/shm/kboot/vm-lkp-nex04-openwrt-ia32-27/kmsg
> -daemonize
> -display none
> -monitor null
> )
>
> append=(
> ip=vm-lkp-nex04-openwrt-ia32-27::dhcp
> root=/dev/ram0
> user=lkp
> 
> job=/lkp/scheduled/vm-lkp-nex04-openwrt-ia32-27/boot-1-openwrt-i386-2016-03-16.cgz-a94b9367e044ba672c9f4105eb1516ff6ff4948a-20171009-12855-139piyy-0.yaml
> ARCH=x86_64
> kconfig=x86_64-randconfig-h0-10091308

Re: [PATCH 4/4] tcp: avoid noref dst leak on input path

2017-10-06 Thread Eric Dumazet
On Fri, Oct 6, 2017 at 8:21 AM, Paolo Abeni <pab...@redhat.com> wrote:
> Hi,
>
> On Fri, 2017-10-06 at 07:37 -0700, Eric Dumazet wrote:
>> On Fri, 2017-10-06 at 14:57 +0200, Paolo Abeni wrote:
>> > Enabling CONFIG_RCU_NOREF_DEBUG gives the following splat when
>> > processing tcp packets:
>> >
>> >to-be-untracked noref entity 942cb71ea300 not found in cache
>> >[ cut here ]
>> >WARNING: CPU: 24 PID: 178 at kernel/rcu/noref_debug.c:54 
>> > rcu_track_noref+0xa4/0xf0
>> >Modules linked in: intel_rapl sb_edac x86_pkg_temp_thermal 
>> > intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul 
>> > crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper 
>> > cryptd iTCO_wdt ipmi_ssif mei_me iTCO_vendor_support mei dcdbas lpc_ich 
>> > ipmi_si mxm_wmi sg pcspkr ipmi_devintf ipmi_msghandler acpi_power_meter 
>> > shpchp wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs 
>> > libcrc32c sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt 
>> > fb_sys_fops ttm igb drm ixgbe mdio crc32c_intel ahci ptp i2c_algo_bit 
>> > libahci pps_core i2c_core libata dca dm_mirror dm_region_hash dm_log dm_mod
>> >CPU: 24 PID: 178 Comm: ksoftirqd/24 Not tainted 
>> > 4.14.0-rc1.noref_route+ #1610
>> >Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 
>> > 01/17/2017
>> >task: 940e4830 task.stack: aec406a2
>> >RIP: 0010:rcu_track_noref+0xa4/0xf0
>> >RSP: 0018:aec406a238e0 EFLAGS: 00010246
>> >RAX: 0040 RBX:  RCX: 
>> > 0002
>> >RDX:  RSI:  RDI: 
>> > 0292
>> >RBP: aec406a238e0 R08:  R09: 
>> > 

> Thank you for the feedback.
>
> I most probably messed-up while extracting the info from dmsg, as this
> issue gives a couple of splats almost concurrently. Please let me re-do
> the test and post a more resonable dmsg.
>
> The problem with the current code is that in the tcp_rcv_established()
> -> tcp_queue_rcv() path, the skb_dst() is not cleared.
>

In any case, I would rather put one skb_dst_drop() right after the
last possible use of skb dst in TCP stack,
probably after sk_rx_dst_set() call.

Trying to move it in multiple places has been error prone, even if
current code is not buggy.


Re: [PATCH 4/4] tcp: avoid noref dst leak on input path

2017-10-06 Thread Eric Dumazet
On Fri, Oct 6, 2017 at 8:21 AM, Paolo Abeni  wrote:
> Hi,
>
> On Fri, 2017-10-06 at 07:37 -0700, Eric Dumazet wrote:
>> On Fri, 2017-10-06 at 14:57 +0200, Paolo Abeni wrote:
>> > Enabling CONFIG_RCU_NOREF_DEBUG gives the following splat when
>> > processing tcp packets:
>> >
>> >to-be-untracked noref entity 942cb71ea300 not found in cache
>> >[ cut here ]
>> >WARNING: CPU: 24 PID: 178 at kernel/rcu/noref_debug.c:54 
>> > rcu_track_noref+0xa4/0xf0
>> >Modules linked in: intel_rapl sb_edac x86_pkg_temp_thermal 
>> > intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul 
>> > crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper 
>> > cryptd iTCO_wdt ipmi_ssif mei_me iTCO_vendor_support mei dcdbas lpc_ich 
>> > ipmi_si mxm_wmi sg pcspkr ipmi_devintf ipmi_msghandler acpi_power_meter 
>> > shpchp wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs 
>> > libcrc32c sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt 
>> > fb_sys_fops ttm igb drm ixgbe mdio crc32c_intel ahci ptp i2c_algo_bit 
>> > libahci pps_core i2c_core libata dca dm_mirror dm_region_hash dm_log dm_mod
>> >CPU: 24 PID: 178 Comm: ksoftirqd/24 Not tainted 
>> > 4.14.0-rc1.noref_route+ #1610
>> >Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 
>> > 01/17/2017
>> >task: 940e4830 task.stack: aec406a2
>> >RIP: 0010:rcu_track_noref+0xa4/0xf0
>> >RSP: 0018:aec406a238e0 EFLAGS: 00010246
>> >RAX: 0040 RBX:  RCX: 
>> > 0002
>> >RDX:  RSI:  RDI: 
>> > 0292
>> >RBP: aec406a238e0 R08:  R09: 
>> > 

> Thank you for the feedback.
>
> I most probably messed-up while extracting the info from dmsg, as this
> issue gives a couple of splats almost concurrently. Please let me re-do
> the test and post a more resonable dmsg.
>
> The problem with the current code is that in the tcp_rcv_established()
> -> tcp_queue_rcv() path, the skb_dst() is not cleared.
>

In any case, I would rather put one skb_dst_drop() right after the
last possible use of skb dst in TCP stack,
probably after sk_rx_dst_set() call.

Trying to move it in multiple places has been error prone, even if
current code is not buggy.


Re: [PATCH 4/4] tcp: avoid noref dst leak on input path

2017-10-06 Thread Eric Dumazet
On Fri, 2017-10-06 at 14:57 +0200, Paolo Abeni wrote:
> Enabling CONFIG_RCU_NOREF_DEBUG gives the following splat when
> processing tcp packets:
> 
>to-be-untracked noref entity 942cb71ea300 not found in cache
>[ cut here ]
>WARNING: CPU: 24 PID: 178 at kernel/rcu/noref_debug.c:54 
> rcu_track_noref+0xa4/0xf0
>Modules linked in: intel_rapl sb_edac x86_pkg_temp_thermal 
> intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul 
> crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper 
> cryptd iTCO_wdt ipmi_ssif mei_me iTCO_vendor_support mei dcdbas lpc_ich 
> ipmi_si mxm_wmi sg pcspkr ipmi_devintf ipmi_msghandler acpi_power_meter 
> shpchp wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs 
> libcrc32c sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt 
> fb_sys_fops ttm igb drm ixgbe mdio crc32c_intel ahci ptp i2c_algo_bit libahci 
> pps_core i2c_core libata dca dm_mirror dm_region_hash dm_log dm_mod
>CPU: 24 PID: 178 Comm: ksoftirqd/24 Not tainted 
> 4.14.0-rc1.noref_route+ #1610
>Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 
> 01/17/2017
>task: 940e4830 task.stack: aec406a2
>RIP: 0010:rcu_track_noref+0xa4/0xf0
>RSP: 0018:aec406a238e0 EFLAGS: 00010246
>RAX: 0040 RBX:  RCX: 0002
>RDX:  RSI:  RDI: 0292
>RBP: aec406a238e0 R08:  R09: 
>R10: 0001 R11: 0003 R12: 
>R13: 942cb511 R14: fe88 R15: 942cb1a20200
>FS:  () GS:942cbee0() 
> knlGS:
>CS:  0010 DS:  ES:  CR0: 80050033
>CR2: 7febc072d140 CR3: 001feebd6002 CR4: 003606e0
>DR0:  DR1:  DR2: 
>DR3:  DR6: fffe0ff0 DR7: 0400
>Call Trace:
> tcp_data_queue+0x82a/0xce0

That is strange, since tcp_data_queue() starts with

skb_dst_drop(skb); 

So this stack trace looks suspicious.

> tcp_rcv_established+0x283/0x570
> tcp_v4_do_rcv+0x102/0x1e0
> tcp_v4_rcv+0xa9e/0xc10
> ip_local_deliver_finish+0x128/0x380
> ? ip_local_deliver_finish+0x41/0x380
> ip_local_deliver+0x74/0x230
> ip_rcv_finish+0x105/0x5e0
> ip_rcv+0x2a7/0x540
> __netif_receive_skb_core+0x3b9/0xe10
> ? netif_receive_skb_internal+0x40/0x390
> __netif_receive_skb+0x18/0x60
> netif_receive_skb_internal+0x8d/0x390
> ? netif_receive_skb_internal+0x40/0x390
> napi_gro_complete+0x127/0x1d0
> ? napi_gro_complete+0x2a/0x1d0
> napi_gro_flush+0x5f/0x80
> napi_complete_done+0x96/0x100
> ixgbe_poll+0x5f8/0x7c0 [ixgbe]
> net_rx_action+0x27d/0x520
> __do_softirq+0xd1/0x4f5
> run_ksoftirqd+0x25/0x70
> smpboot_thread_fn+0x11a/0x1f0
> kthread+0x155/0x190
> ? sort_range+0x30/0x30
> ? kthread_create_on_node+0x70/0x70
> ret_from_fork+0x2a/0x40
>Code: e9 83 c2 01 48 83 c0 18 83 fa 07 75 ef 80 3d b0 e5 ff 00 00 
> 75 d2 48 c7 c7 50 54 07 9d 31 c0 c6 05 9e e5 ff 00 01 e8 ef af fe ff <0f> ff 
> 5d c3 80 3d 8f e5 ff 00 00 75 b0 48 c7 c7 00 54 07 9d 31
> 
> we must clear the skb dst before enqueuing the skb somewhere,
> but currently the dst entry for packets delivered
> via tcp_rcv_established() -> tcp_queue_rcv() is not cleared.
> 
> Fix it by adding the explicit drop in tcp_queue_rcv() and moving
> the current skb_dst_drop() just before the other enqueuing
> operation, do avoid unneeded double skb_dst_drop() for some
> path.
> 
> The leak itself is not harmful, because the tcp recvmsg() code
> should not access such info.
> 
> Signed-off-by: Paolo Abeni 
> ---
>  net/ipv4/tcp_input.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index c5d7656b..bf4e17edfe7a 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4422,6 +4422,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct 
> sk_buff *skb)
>   return;
>   }
>  
> + /* drop the -possibly noref - dst before delivery the skb to ofo tree */
> + skb_dst_drop(skb);
> +
>   /* Stash tstamp to avoid being stomped on by rbnode */
>   if (TCP_SKB_CB(skb)->has_rxtstamp)
>   TCP_SKB_CB(skb)->swtstamp = skb->tstamp;
> @@ -4560,6 +4563,7 @@ static int __must_check tcp_queue_rcv(struct sock *sk, 
> 

Re: [PATCH 4/4] tcp: avoid noref dst leak on input path

2017-10-06 Thread Eric Dumazet
On Fri, 2017-10-06 at 14:57 +0200, Paolo Abeni wrote:
> Enabling CONFIG_RCU_NOREF_DEBUG gives the following splat when
> processing tcp packets:
> 
>to-be-untracked noref entity 942cb71ea300 not found in cache
>[ cut here ]
>WARNING: CPU: 24 PID: 178 at kernel/rcu/noref_debug.c:54 
> rcu_track_noref+0xa4/0xf0
>Modules linked in: intel_rapl sb_edac x86_pkg_temp_thermal 
> intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul 
> crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper 
> cryptd iTCO_wdt ipmi_ssif mei_me iTCO_vendor_support mei dcdbas lpc_ich 
> ipmi_si mxm_wmi sg pcspkr ipmi_devintf ipmi_msghandler acpi_power_meter 
> shpchp wmi nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs 
> libcrc32c sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt 
> fb_sys_fops ttm igb drm ixgbe mdio crc32c_intel ahci ptp i2c_algo_bit libahci 
> pps_core i2c_core libata dca dm_mirror dm_region_hash dm_log dm_mod
>CPU: 24 PID: 178 Comm: ksoftirqd/24 Not tainted 
> 4.14.0-rc1.noref_route+ #1610
>Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 
> 01/17/2017
>task: 940e4830 task.stack: aec406a2
>RIP: 0010:rcu_track_noref+0xa4/0xf0
>RSP: 0018:aec406a238e0 EFLAGS: 00010246
>RAX: 0040 RBX:  RCX: 0002
>RDX:  RSI:  RDI: 0292
>RBP: aec406a238e0 R08:  R09: 
>R10: 0001 R11: 0003 R12: 
>R13: 942cb511 R14: fe88 R15: 942cb1a20200
>FS:  () GS:942cbee0() 
> knlGS:
>CS:  0010 DS:  ES:  CR0: 80050033
>CR2: 7febc072d140 CR3: 001feebd6002 CR4: 003606e0
>DR0:  DR1:  DR2: 
>DR3:  DR6: fffe0ff0 DR7: 0400
>Call Trace:
> tcp_data_queue+0x82a/0xce0

That is strange, since tcp_data_queue() starts with

skb_dst_drop(skb); 

So this stack trace looks suspicious.

> tcp_rcv_established+0x283/0x570
> tcp_v4_do_rcv+0x102/0x1e0
> tcp_v4_rcv+0xa9e/0xc10
> ip_local_deliver_finish+0x128/0x380
> ? ip_local_deliver_finish+0x41/0x380
> ip_local_deliver+0x74/0x230
> ip_rcv_finish+0x105/0x5e0
> ip_rcv+0x2a7/0x540
> __netif_receive_skb_core+0x3b9/0xe10
> ? netif_receive_skb_internal+0x40/0x390
> __netif_receive_skb+0x18/0x60
> netif_receive_skb_internal+0x8d/0x390
> ? netif_receive_skb_internal+0x40/0x390
> napi_gro_complete+0x127/0x1d0
> ? napi_gro_complete+0x2a/0x1d0
> napi_gro_flush+0x5f/0x80
> napi_complete_done+0x96/0x100
> ixgbe_poll+0x5f8/0x7c0 [ixgbe]
> net_rx_action+0x27d/0x520
> __do_softirq+0xd1/0x4f5
> run_ksoftirqd+0x25/0x70
> smpboot_thread_fn+0x11a/0x1f0
> kthread+0x155/0x190
> ? sort_range+0x30/0x30
> ? kthread_create_on_node+0x70/0x70
> ret_from_fork+0x2a/0x40
>Code: e9 83 c2 01 48 83 c0 18 83 fa 07 75 ef 80 3d b0 e5 ff 00 00 
> 75 d2 48 c7 c7 50 54 07 9d 31 c0 c6 05 9e e5 ff 00 01 e8 ef af fe ff <0f> ff 
> 5d c3 80 3d 8f e5 ff 00 00 75 b0 48 c7 c7 00 54 07 9d 31
> 
> we must clear the skb dst before enqueuing the skb somewhere,
> but currently the dst entry for packets delivered
> via tcp_rcv_established() -> tcp_queue_rcv() is not cleared.
> 
> Fix it by adding the explicit drop in tcp_queue_rcv() and moving
> the current skb_dst_drop() just before the other enqueuing
> operation, do avoid unneeded double skb_dst_drop() for some
> path.
> 
> The leak itself is not harmful, because the tcp recvmsg() code
> should not access such info.
> 
> Signed-off-by: Paolo Abeni 
> ---
>  net/ipv4/tcp_input.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index c5d7656b..bf4e17edfe7a 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -4422,6 +4422,9 @@ static void tcp_data_queue_ofo(struct sock *sk, struct 
> sk_buff *skb)
>   return;
>   }
>  
> + /* drop the -possibly noref - dst before delivery the skb to ofo tree */
> + skb_dst_drop(skb);
> +
>   /* Stash tstamp to avoid being stomped on by rbnode */
>   if (TCP_SKB_CB(skb)->has_rxtstamp)
>   TCP_SKB_CB(skb)->swtstamp = skb->tstamp;
> @@ -4560,6 +4563,7 @@ static int __must_check tcp_queue_rcv(struct sock *sk, 
> struct sk_buff *skb, 

Re: [PATCH 2/2] vfs: grab the lock instead of blocking in __fd_install during resizing

2017-10-04 Thread Eric Dumazet
On Wed, Oct 4, 2017 at 7:58 AM, Matthew Wilcox <wi...@infradead.org> wrote:
> On Wed, Oct 04, 2017 at 07:00:40AM -0700, Eric Dumazet wrote:
>> On Wed, Oct 4, 2017 at 6:55 AM, Matthew Wilcox <wi...@infradead.org> wrote:
>> > Any chance you could review the patches from Sandhya that make this entire
>> > codepath obsolete?
>> >
>> > https://lkml.org/lkml/2017/4/29/20
>> >
>>
>> Hmm...
>>
>> 18 files changed, 578 insertions(+), 585 deletions(-)
>>
>> Frankly I need to be convinced with solid performance numbers before I
>> am taking a look at this series.
>
> I was hoping you'd help us get some solid performance numbers ... you
> seem to have workloads available to you that help find weaknesses in
> implementations.
>
> The number of lines inserted is a bit of a red herring.  Over 100 are in
> the test suite (you surely aren't going to review those) and another ~300
> are adding enhancements to the IDR & radix tree that should be useful
> for other users (eg I think I have a way to speed up writing out dirty
> pages by using get_tag_batch()).
>
>> I do not believe an IDR will be faster than current implementation, so
>> I am not quite convinced at this moment.
>
> I don't think it should be significantly different in performance.  Let's
> look at the layout of data for a typical bash shell (fds 0-2 and 255 open).
>
> Current implementation:
>
> files_struct -> fdt -> fd -> struct file
>
> IDR:
>
> files_struct -> radix node (shift 6) -> radix node (shift 0) -> struct file
>
> In either case, it's the same number of dependent loads.  It'll start
> to look worse for the radix tree above 4096 open fds in a given process.

I am interested in performance for process with 10,000,000 fds, and
~100 threads constantly adding/deleting/using fds.

Thanks


Re: [PATCH 2/2] vfs: grab the lock instead of blocking in __fd_install during resizing

2017-10-04 Thread Eric Dumazet
On Wed, Oct 4, 2017 at 7:58 AM, Matthew Wilcox  wrote:
> On Wed, Oct 04, 2017 at 07:00:40AM -0700, Eric Dumazet wrote:
>> On Wed, Oct 4, 2017 at 6:55 AM, Matthew Wilcox  wrote:
>> > Any chance you could review the patches from Sandhya that make this entire
>> > codepath obsolete?
>> >
>> > https://lkml.org/lkml/2017/4/29/20
>> >
>>
>> Hmm...
>>
>> 18 files changed, 578 insertions(+), 585 deletions(-)
>>
>> Frankly I need to be convinced with solid performance numbers before I
>> am taking a look at this series.
>
> I was hoping you'd help us get some solid performance numbers ... you
> seem to have workloads available to you that help find weaknesses in
> implementations.
>
> The number of lines inserted is a bit of a red herring.  Over 100 are in
> the test suite (you surely aren't going to review those) and another ~300
> are adding enhancements to the IDR & radix tree that should be useful
> for other users (eg I think I have a way to speed up writing out dirty
> pages by using get_tag_batch()).
>
>> I do not believe an IDR will be faster than current implementation, so
>> I am not quite convinced at this moment.
>
> I don't think it should be significantly different in performance.  Let's
> look at the layout of data for a typical bash shell (fds 0-2 and 255 open).
>
> Current implementation:
>
> files_struct -> fdt -> fd -> struct file
>
> IDR:
>
> files_struct -> radix node (shift 6) -> radix node (shift 0) -> struct file
>
> In either case, it's the same number of dependent loads.  It'll start
> to look worse for the radix tree above 4096 open fds in a given process.

I am interested in performance for process with 10,000,000 fds, and
~100 threads constantly adding/deleting/using fds.

Thanks


Re: [PATCH 2/2] vfs: grab the lock instead of blocking in __fd_install during resizing

2017-10-04 Thread Eric Dumazet
On Wed, Oct 4, 2017 at 6:55 AM, Matthew Wilcox <wi...@infradead.org> wrote:
> On Tue, Oct 03, 2017 at 07:41:11AM -0700, Eric Dumazet wrote:
>> On Tue, Oct 3, 2017 at 3:58 AM, Mateusz Guzik <mgu...@redhat.com> wrote:
>> > Explicit locking in the fallback case provides a safe state of the
>> > table. Getting rid of blocking semantics makes __fd_install usable
>> > again in non-sleepable contexts, which easies backporting efforts.
>> >
>> > There is a side effect of slightly nicer assembly for the common case
>> > as might_sleep can now be removed.
>> >
>> > Signed-off-by: Mateusz Guzik <mgu...@redhat.com>
>> > ---
>> >  Documentation/filesystems/porting |  4 
>> >  fs/file.c         | 11 +++
>> >  2 files changed, 7 insertions(+), 8 deletions(-)
>>
>> Nice change !
>>
>> Reviewed-by: Eric Dumazet <eduma...@google.com>
>
> Hey Eric,
>
> Any chance you could review the patches from Sandhya that make this entire
> codepath obsolete?
>
> https://lkml.org/lkml/2017/4/29/20
>

Hmm...

18 files changed, 578 insertions(+), 585 deletions(-)

Frankly I need to be convinced with solid performance numbers before I
am taking a look at this series.

I do not believe an IDR will be faster than current implementation, so
I am not quite convinced at this moment.

Thanks.


Re: [PATCH 2/2] vfs: grab the lock instead of blocking in __fd_install during resizing

2017-10-04 Thread Eric Dumazet
On Wed, Oct 4, 2017 at 6:55 AM, Matthew Wilcox  wrote:
> On Tue, Oct 03, 2017 at 07:41:11AM -0700, Eric Dumazet wrote:
>> On Tue, Oct 3, 2017 at 3:58 AM, Mateusz Guzik  wrote:
>> > Explicit locking in the fallback case provides a safe state of the
>> > table. Getting rid of blocking semantics makes __fd_install usable
>> > again in non-sleepable contexts, which easies backporting efforts.
>> >
>> > There is a side effect of slightly nicer assembly for the common case
>> > as might_sleep can now be removed.
>> >
>> > Signed-off-by: Mateusz Guzik 
>> > ---
>> >  Documentation/filesystems/porting |  4 
>> >  fs/file.c | 11 +++----
>> >  2 files changed, 7 insertions(+), 8 deletions(-)
>>
>> Nice change !
>>
>> Reviewed-by: Eric Dumazet 
>
> Hey Eric,
>
> Any chance you could review the patches from Sandhya that make this entire
> codepath obsolete?
>
> https://lkml.org/lkml/2017/4/29/20
>

Hmm...

18 files changed, 578 insertions(+), 585 deletions(-)

Frankly I need to be convinced with solid performance numbers before I
am taking a look at this series.

I do not believe an IDR will be faster than current implementation, so
I am not quite convinced at this moment.

Thanks.


Re: v4.14-rc2/arm64 kernel BUG at net/core/skbuff.c:2626

2017-10-03 Thread Eric Dumazet
On Tue, Oct 3, 2017 at 9:06 AM, Dmitry Vyukov <dvyu...@google.com> wrote:
> On Tue, Oct 3, 2017 at 5:38 PM, 'Eric Dumazet' via syzkaller
> <syzkal...@googlegroups.com> wrote:
>> On Tue, Oct 3, 2017 at 8:19 AM, Dmitry Vyukov <dvyu...@google.com> wrote:
>>> On Mon, Oct 2, 2017 at 4:42 PM, 'Eric Dumazet' via syzkaller
>>> <syzkal...@googlegroups.com> wrote:
>>>> On Mon, Oct 2, 2017 at 7:21 AM, Mark Rutland <mark.rutl...@arm.com> wrote:
>>>>> Hi Eric,
>>>>>
>>>>> On Mon, Oct 02, 2017 at 06:36:32AM -0700, Eric Dumazet wrote:
>>>>>> On Mon, Oct 2, 2017 at 3:49 AM, Mark Rutland <mark.rutl...@arm.com> 
>>>>>> wrote:
>>>>>> > I hit the below splat at net/core/skbuff.c:2626 while fuzzing v4.14-rc2
>>>>>> > on arm64 with Syzkaller. This is the BUG_ON(len) at the end of
>>>>>> > skb_copy_and_csum_bits().
>>>>>
>>>>>> > kernel BUG at net/core/skbuff.c:2626!
>>>>>
>>>>>> > [] skb_copy_and_csum_bits+0x8dc/0xae0 
>>>>>> > net/core/skbuff.c:2626
>>>>>> > [] icmp_glue_bits+0xa4/0x2a0 net/ipv4/icmp.c:357
>>>>>> > [] __ip_append_data+0x10e4/0x20a8 
>>>>>> > net/ipv4/ip_output.c:1018
>>>>>> > [] ip_append_data.part.3+0xe8/0x1a0 
>>>>>> > net/ipv4/ip_output.c:1170
>>>>>> > [] ip_append_data+0xa4/0xb0 net/ipv4/ip_output.c:1173
>>>>>> > [] icmp_push_reply+0x1b8/0x690 net/ipv4/icmp.c:375
>>>>>> > [] icmp_send+0x1070/0x1890 net/ipv4/icmp.c:741
>>>>>> > [] ip_fragment.constprop.4+0x208/0x340 
>>>>>> > net/ipv4/ip_output.c:552
>>>>>> > [] ip_finish_output+0x3a8/0xab0 
>>>>>> > net/ipv4/ip_output.c:315
>>>>>> > [] NF_HOOK_COND include/linux/netfilter.h:238 
>>>>>> > [inline]
>>>>>> > [] ip_output+0x284/0x790 net/ipv4/ip_output.c:405
>>>>>> > [] dst_output include/net/dst.h:458 [inline]
>>>>>> > [] ip_local_out+0x9c/0x1b8 net/ipv4/ip_output.c:124
>>>>>> > [] ip_queue_xmit+0x850/0x18e0 
>>>>>> > net/ipv4/ip_output.c:504
>>>>>> > [] tcp_transmit_skb+0x107c/0x3338 
>>>>>> > net/ipv4/tcp_output.c:1123
>>>>>> > [] __tcp_retransmit_skb+0x614/0x1d18 
>>>>>> > net/ipv4/tcp_output.c:2847
>>>>>> > [] tcp_send_loss_probe+0x478/0x7d0 
>>>>>> > net/ipv4/tcp_output.c:2457
>>>>>> > [] tcp_write_timer_handler+0x50c/0x7e8 
>>>>>> > net/ipv4/tcp_timer.c:557
>>>>>> > [] tcp_write_timer+0x78/0x170 
>>>>>> > net/ipv4/tcp_timer.c:579
>>>>>> > [] call_timer_fn+0x1b8/0x430 kernel/time/timer.c:1281
>>>>>> > [] expire_timers+0x1d4/0x320 kernel/time/timer.c:1320
>>>>>> > [] __run_timers kernel/time/timer.c:1620 [inline]
>>>>>> > [] run_timer_softirq+0x214/0x5f0 
>>>>>> > kernel/time/timer.c:1646
>>>>>> > [] __do_softirq+0x350/0xc0c kernel/softirq.c:284
>>>>>> > [] do_softirq_own_stack 
>>>>>> > include/linux/interrupt.h:498 [inline]
>>>>>> > [] invoke_softirq kernel/softirq.c:371 [inline]
>>>>>> > [] irq_exit+0x1dc/0x2f8 kernel/softirq.c:405
>>>>>> > [] __handle_domain_irq+0xdc/0x230 
>>>>>> > kernel/irq/irqdesc.c:647
>>>>>> > [] handle_domain_irq include/linux/irqdesc.h:175 
>>>>>> > [inline]
>>>>>> > [] gic_handle_irq+0x6c/0xe0 
>>>>>> > drivers/irqchip/irq-gic.c:367
>>>>>
>>>>>> This is most likely a bug caused by syzkaller setting a ridiculous MTU
>>>>>> on loopback device, below minimum size of ipv4 MTU.
>>>>>
>>>>>> I tried to track it in August [1], but it seems hard to find all the
>>>>>> issues with this.
>>>>>>
>>>>>> commit c780a049f9bf442314335372c9abc4548bfe3e44
>>>>>> Author: Eric Dumazet <eduma...@google.com>
>>>>>> Date:   Wed Aug 16 11:09:12 2017 -0700
>>>>>>
>>>>>> ipv4: better IP_MAX_MTU enforcement
>>>>>>
>>>>>> While working on yet another syzkall

Re: v4.14-rc2/arm64 kernel BUG at net/core/skbuff.c:2626

2017-10-03 Thread Eric Dumazet
On Tue, Oct 3, 2017 at 9:06 AM, Dmitry Vyukov  wrote:
> On Tue, Oct 3, 2017 at 5:38 PM, 'Eric Dumazet' via syzkaller
>  wrote:
>> On Tue, Oct 3, 2017 at 8:19 AM, Dmitry Vyukov  wrote:
>>> On Mon, Oct 2, 2017 at 4:42 PM, 'Eric Dumazet' via syzkaller
>>>  wrote:
>>>> On Mon, Oct 2, 2017 at 7:21 AM, Mark Rutland  wrote:
>>>>> Hi Eric,
>>>>>
>>>>> On Mon, Oct 02, 2017 at 06:36:32AM -0700, Eric Dumazet wrote:
>>>>>> On Mon, Oct 2, 2017 at 3:49 AM, Mark Rutland  
>>>>>> wrote:
>>>>>> > I hit the below splat at net/core/skbuff.c:2626 while fuzzing v4.14-rc2
>>>>>> > on arm64 with Syzkaller. This is the BUG_ON(len) at the end of
>>>>>> > skb_copy_and_csum_bits().
>>>>>
>>>>>> > kernel BUG at net/core/skbuff.c:2626!
>>>>>
>>>>>> > [] skb_copy_and_csum_bits+0x8dc/0xae0 
>>>>>> > net/core/skbuff.c:2626
>>>>>> > [] icmp_glue_bits+0xa4/0x2a0 net/ipv4/icmp.c:357
>>>>>> > [] __ip_append_data+0x10e4/0x20a8 
>>>>>> > net/ipv4/ip_output.c:1018
>>>>>> > [] ip_append_data.part.3+0xe8/0x1a0 
>>>>>> > net/ipv4/ip_output.c:1170
>>>>>> > [] ip_append_data+0xa4/0xb0 net/ipv4/ip_output.c:1173
>>>>>> > [] icmp_push_reply+0x1b8/0x690 net/ipv4/icmp.c:375
>>>>>> > [] icmp_send+0x1070/0x1890 net/ipv4/icmp.c:741
>>>>>> > [] ip_fragment.constprop.4+0x208/0x340 
>>>>>> > net/ipv4/ip_output.c:552
>>>>>> > [] ip_finish_output+0x3a8/0xab0 
>>>>>> > net/ipv4/ip_output.c:315
>>>>>> > [] NF_HOOK_COND include/linux/netfilter.h:238 
>>>>>> > [inline]
>>>>>> > [] ip_output+0x284/0x790 net/ipv4/ip_output.c:405
>>>>>> > [] dst_output include/net/dst.h:458 [inline]
>>>>>> > [] ip_local_out+0x9c/0x1b8 net/ipv4/ip_output.c:124
>>>>>> > [] ip_queue_xmit+0x850/0x18e0 
>>>>>> > net/ipv4/ip_output.c:504
>>>>>> > [] tcp_transmit_skb+0x107c/0x3338 
>>>>>> > net/ipv4/tcp_output.c:1123
>>>>>> > [] __tcp_retransmit_skb+0x614/0x1d18 
>>>>>> > net/ipv4/tcp_output.c:2847
>>>>>> > [] tcp_send_loss_probe+0x478/0x7d0 
>>>>>> > net/ipv4/tcp_output.c:2457
>>>>>> > [] tcp_write_timer_handler+0x50c/0x7e8 
>>>>>> > net/ipv4/tcp_timer.c:557
>>>>>> > [] tcp_write_timer+0x78/0x170 
>>>>>> > net/ipv4/tcp_timer.c:579
>>>>>> > [] call_timer_fn+0x1b8/0x430 kernel/time/timer.c:1281
>>>>>> > [] expire_timers+0x1d4/0x320 kernel/time/timer.c:1320
>>>>>> > [] __run_timers kernel/time/timer.c:1620 [inline]
>>>>>> > [] run_timer_softirq+0x214/0x5f0 
>>>>>> > kernel/time/timer.c:1646
>>>>>> > [] __do_softirq+0x350/0xc0c kernel/softirq.c:284
>>>>>> > [] do_softirq_own_stack 
>>>>>> > include/linux/interrupt.h:498 [inline]
>>>>>> > [] invoke_softirq kernel/softirq.c:371 [inline]
>>>>>> > [] irq_exit+0x1dc/0x2f8 kernel/softirq.c:405
>>>>>> > [] __handle_domain_irq+0xdc/0x230 
>>>>>> > kernel/irq/irqdesc.c:647
>>>>>> > [] handle_domain_irq include/linux/irqdesc.h:175 
>>>>>> > [inline]
>>>>>> > [] gic_handle_irq+0x6c/0xe0 
>>>>>> > drivers/irqchip/irq-gic.c:367
>>>>>
>>>>>> This is most likely a bug caused by syzkaller setting a ridiculous MTU
>>>>>> on loopback device, below minimum size of ipv4 MTU.
>>>>>
>>>>>> I tried to track it in August [1], but it seems hard to find all the
>>>>>> issues with this.
>>>>>>
>>>>>> commit c780a049f9bf442314335372c9abc4548bfe3e44
>>>>>> Author: Eric Dumazet 
>>>>>> Date:   Wed Aug 16 11:09:12 2017 -0700
>>>>>>
>>>>>> ipv4: better IP_MAX_MTU enforcement
>>>>>>
>>>>>> While working on yet another syzkaller report, I found
>>>>>> that our IP_MAX_MTU enforcements were not properly done.
>>>>>>
>>>>>> gcc seems to reload dev->mtu for min

Re: v4.14-rc2/arm64 kernel BUG at net/core/skbuff.c:2626

2017-10-03 Thread Eric Dumazet
On Tue, Oct 3, 2017 at 8:19 AM, Dmitry Vyukov <dvyu...@google.com> wrote:
> On Mon, Oct 2, 2017 at 4:42 PM, 'Eric Dumazet' via syzkaller
> <syzkal...@googlegroups.com> wrote:
>> On Mon, Oct 2, 2017 at 7:21 AM, Mark Rutland <mark.rutl...@arm.com> wrote:
>>> Hi Eric,
>>>
>>> On Mon, Oct 02, 2017 at 06:36:32AM -0700, Eric Dumazet wrote:
>>>> On Mon, Oct 2, 2017 at 3:49 AM, Mark Rutland <mark.rutl...@arm.com> wrote:
>>>> > I hit the below splat at net/core/skbuff.c:2626 while fuzzing v4.14-rc2
>>>> > on arm64 with Syzkaller. This is the BUG_ON(len) at the end of
>>>> > skb_copy_and_csum_bits().
>>>
>>>> > kernel BUG at net/core/skbuff.c:2626!
>>>
>>>> > [] skb_copy_and_csum_bits+0x8dc/0xae0 
>>>> > net/core/skbuff.c:2626
>>>> > [] icmp_glue_bits+0xa4/0x2a0 net/ipv4/icmp.c:357
>>>> > [] __ip_append_data+0x10e4/0x20a8 
>>>> > net/ipv4/ip_output.c:1018
>>>> > [] ip_append_data.part.3+0xe8/0x1a0 
>>>> > net/ipv4/ip_output.c:1170
>>>> > [] ip_append_data+0xa4/0xb0 net/ipv4/ip_output.c:1173
>>>> > [] icmp_push_reply+0x1b8/0x690 net/ipv4/icmp.c:375
>>>> > [] icmp_send+0x1070/0x1890 net/ipv4/icmp.c:741
>>>> > [] ip_fragment.constprop.4+0x208/0x340 
>>>> > net/ipv4/ip_output.c:552
>>>> > [] ip_finish_output+0x3a8/0xab0 
>>>> > net/ipv4/ip_output.c:315
>>>> > [] NF_HOOK_COND include/linux/netfilter.h:238 [inline]
>>>> > [] ip_output+0x284/0x790 net/ipv4/ip_output.c:405
>>>> > [] dst_output include/net/dst.h:458 [inline]
>>>> > [] ip_local_out+0x9c/0x1b8 net/ipv4/ip_output.c:124
>>>> > [] ip_queue_xmit+0x850/0x18e0 net/ipv4/ip_output.c:504
>>>> > [] tcp_transmit_skb+0x107c/0x3338 
>>>> > net/ipv4/tcp_output.c:1123
>>>> > [] __tcp_retransmit_skb+0x614/0x1d18 
>>>> > net/ipv4/tcp_output.c:2847
>>>> > [] tcp_send_loss_probe+0x478/0x7d0 
>>>> > net/ipv4/tcp_output.c:2457
>>>> > [] tcp_write_timer_handler+0x50c/0x7e8 
>>>> > net/ipv4/tcp_timer.c:557
>>>> > [] tcp_write_timer+0x78/0x170 net/ipv4/tcp_timer.c:579
>>>> > [] call_timer_fn+0x1b8/0x430 kernel/time/timer.c:1281
>>>> > [] expire_timers+0x1d4/0x320 kernel/time/timer.c:1320
>>>> > [] __run_timers kernel/time/timer.c:1620 [inline]
>>>> > [] run_timer_softirq+0x214/0x5f0 
>>>> > kernel/time/timer.c:1646
>>>> > [] __do_softirq+0x350/0xc0c kernel/softirq.c:284
>>>> > [] do_softirq_own_stack include/linux/interrupt.h:498 
>>>> > [inline]
>>>> > [] invoke_softirq kernel/softirq.c:371 [inline]
>>>> > [] irq_exit+0x1dc/0x2f8 kernel/softirq.c:405
>>>> > [] __handle_domain_irq+0xdc/0x230 
>>>> > kernel/irq/irqdesc.c:647
>>>> > [] handle_domain_irq include/linux/irqdesc.h:175 
>>>> > [inline]
>>>> > [] gic_handle_irq+0x6c/0xe0 
>>>> > drivers/irqchip/irq-gic.c:367
>>>
>>>> This is most likely a bug caused by syzkaller setting a ridiculous MTU
>>>> on loopback device, below minimum size of ipv4 MTU.
>>>
>>>> I tried to track it in August [1], but it seems hard to find all the
>>>> issues with this.
>>>>
>>>> commit c780a049f9bf442314335372c9abc4548bfe3e44
>>>> Author: Eric Dumazet <eduma...@google.com>
>>>> Date:   Wed Aug 16 11:09:12 2017 -0700
>>>>
>>>> ipv4: better IP_MAX_MTU enforcement
>>>>
>>>> While working on yet another syzkaller report, I found
>>>> that our IP_MAX_MTU enforcements were not properly done.
>>>>
>>>> gcc seems to reload dev->mtu for min(dev->mtu, IP_MAX_MTU), and
>>>> final result can be bigger than IP_MAX_MTU :/
>>>>
>>>> This is a problem because device mtu can be changed on other cpus or
>>>> threads.
>>>>
>>>> While this patch does not fix the issue I am working on, it is
>>>> probably worth addressing it.
>>>
>>> Just to check I've understood correctly, are you suggesting that the
>>> IPv4 code should also check the dev->mtu against a IP_MIN_MTU (which
>>> doesn't seem to exist today)?
>>
>> We have plenty of places this is checked.
>>
>> For example, trying to set M

Re: v4.14-rc2/arm64 kernel BUG at net/core/skbuff.c:2626

2017-10-03 Thread Eric Dumazet
On Tue, Oct 3, 2017 at 8:19 AM, Dmitry Vyukov  wrote:
> On Mon, Oct 2, 2017 at 4:42 PM, 'Eric Dumazet' via syzkaller
>  wrote:
>> On Mon, Oct 2, 2017 at 7:21 AM, Mark Rutland  wrote:
>>> Hi Eric,
>>>
>>> On Mon, Oct 02, 2017 at 06:36:32AM -0700, Eric Dumazet wrote:
>>>> On Mon, Oct 2, 2017 at 3:49 AM, Mark Rutland  wrote:
>>>> > I hit the below splat at net/core/skbuff.c:2626 while fuzzing v4.14-rc2
>>>> > on arm64 with Syzkaller. This is the BUG_ON(len) at the end of
>>>> > skb_copy_and_csum_bits().
>>>
>>>> > kernel BUG at net/core/skbuff.c:2626!
>>>
>>>> > [] skb_copy_and_csum_bits+0x8dc/0xae0 
>>>> > net/core/skbuff.c:2626
>>>> > [] icmp_glue_bits+0xa4/0x2a0 net/ipv4/icmp.c:357
>>>> > [] __ip_append_data+0x10e4/0x20a8 
>>>> > net/ipv4/ip_output.c:1018
>>>> > [] ip_append_data.part.3+0xe8/0x1a0 
>>>> > net/ipv4/ip_output.c:1170
>>>> > [] ip_append_data+0xa4/0xb0 net/ipv4/ip_output.c:1173
>>>> > [] icmp_push_reply+0x1b8/0x690 net/ipv4/icmp.c:375
>>>> > [] icmp_send+0x1070/0x1890 net/ipv4/icmp.c:741
>>>> > [] ip_fragment.constprop.4+0x208/0x340 
>>>> > net/ipv4/ip_output.c:552
>>>> > [] ip_finish_output+0x3a8/0xab0 
>>>> > net/ipv4/ip_output.c:315
>>>> > [] NF_HOOK_COND include/linux/netfilter.h:238 [inline]
>>>> > [] ip_output+0x284/0x790 net/ipv4/ip_output.c:405
>>>> > [] dst_output include/net/dst.h:458 [inline]
>>>> > [] ip_local_out+0x9c/0x1b8 net/ipv4/ip_output.c:124
>>>> > [] ip_queue_xmit+0x850/0x18e0 net/ipv4/ip_output.c:504
>>>> > [] tcp_transmit_skb+0x107c/0x3338 
>>>> > net/ipv4/tcp_output.c:1123
>>>> > [] __tcp_retransmit_skb+0x614/0x1d18 
>>>> > net/ipv4/tcp_output.c:2847
>>>> > [] tcp_send_loss_probe+0x478/0x7d0 
>>>> > net/ipv4/tcp_output.c:2457
>>>> > [] tcp_write_timer_handler+0x50c/0x7e8 
>>>> > net/ipv4/tcp_timer.c:557
>>>> > [] tcp_write_timer+0x78/0x170 net/ipv4/tcp_timer.c:579
>>>> > [] call_timer_fn+0x1b8/0x430 kernel/time/timer.c:1281
>>>> > [] expire_timers+0x1d4/0x320 kernel/time/timer.c:1320
>>>> > [] __run_timers kernel/time/timer.c:1620 [inline]
>>>> > [] run_timer_softirq+0x214/0x5f0 
>>>> > kernel/time/timer.c:1646
>>>> > [] __do_softirq+0x350/0xc0c kernel/softirq.c:284
>>>> > [] do_softirq_own_stack include/linux/interrupt.h:498 
>>>> > [inline]
>>>> > [] invoke_softirq kernel/softirq.c:371 [inline]
>>>> > [] irq_exit+0x1dc/0x2f8 kernel/softirq.c:405
>>>> > [] __handle_domain_irq+0xdc/0x230 
>>>> > kernel/irq/irqdesc.c:647
>>>> > [] handle_domain_irq include/linux/irqdesc.h:175 
>>>> > [inline]
>>>> > [] gic_handle_irq+0x6c/0xe0 
>>>> > drivers/irqchip/irq-gic.c:367
>>>
>>>> This is most likely a bug caused by syzkaller setting a ridiculous MTU
>>>> on loopback device, below minimum size of ipv4 MTU.
>>>
>>>> I tried to track it in August [1], but it seems hard to find all the
>>>> issues with this.
>>>>
>>>> commit c780a049f9bf442314335372c9abc4548bfe3e44
>>>> Author: Eric Dumazet 
>>>> Date:   Wed Aug 16 11:09:12 2017 -0700
>>>>
>>>> ipv4: better IP_MAX_MTU enforcement
>>>>
>>>> While working on yet another syzkaller report, I found
>>>> that our IP_MAX_MTU enforcements were not properly done.
>>>>
>>>> gcc seems to reload dev->mtu for min(dev->mtu, IP_MAX_MTU), and
>>>> final result can be bigger than IP_MAX_MTU :/
>>>>
>>>> This is a problem because device mtu can be changed on other cpus or
>>>> threads.
>>>>
>>>> While this patch does not fix the issue I am working on, it is
>>>> probably worth addressing it.
>>>
>>> Just to check I've understood correctly, are you suggesting that the
>>> IPv4 code should also check the dev->mtu against a IP_MIN_MTU (which
>>> doesn't seem to exist today)?
>>
>> We have plenty of places this is checked.
>>
>> For example, trying to set MTU < 68 usually removes IPv4 addresses and 
>> routes.
>>
>> Problem is : these checks are not fool proof yet.
>>
>> ( Only the admin was supposed to play these games )
>>
>>>
>>> Otherwise, I do spot another potential issue. The writer side (e.g. most
>>> net_device::ndo_change_mtu implementations and the __dev_set_mtu()
>>> fallback) doesn't use WRITE_ONCE().
>>
>> It does not matter how many strange values can be observed by the reader :
>> We must be fool proof anyway from reader point of view, so the
>> WRITE_ONCE() is not strictly needed.
>
>
> Note if writer stores some temporal garbage there (which C language
> perfectly allows), it does not matter what we do on reader side --
> reader won't get correct data anyway. Say mtu changes from 1000 to
> 2000, but writer temporary stores 1 there, reader can observe 1 while
> it must not. Synchronization is always a game of two.

Since we have no sync here, a reader _must_ cope with any MTU value.

We need to care of any value, so we do not care how dummy writers can be.

Sure, a WRITE_ONCE() will help avoiding some strange values being written,
 but since we _allow_ writers to write such strange values,
there is really no point pretending to be safe here.

Adding a WRITE_ONCE() will not fix the bug.


Re: [PATCH 2/2] vfs: grab the lock instead of blocking in __fd_install during resizing

2017-10-03 Thread Eric Dumazet
On Tue, Oct 3, 2017 at 3:58 AM, Mateusz Guzik <mgu...@redhat.com> wrote:
> Explicit locking in the fallback case provides a safe state of the
> table. Getting rid of blocking semantics makes __fd_install usable
> again in non-sleepable contexts, which easies backporting efforts.
>
> There is a side effect of slightly nicer assembly for the common case
> as might_sleep can now be removed.
>
> Signed-off-by: Mateusz Guzik <mgu...@redhat.com>
> ---
>  Documentation/filesystems/porting |  4 
>  fs/file.c | 11 +++
>  2 files changed, 7 insertions(+), 8 deletions(-)

Nice change !

Reviewed-by: Eric Dumazet <eduma...@google.com>


Re: [PATCH 2/2] vfs: grab the lock instead of blocking in __fd_install during resizing

2017-10-03 Thread Eric Dumazet
On Tue, Oct 3, 2017 at 3:58 AM, Mateusz Guzik  wrote:
> Explicit locking in the fallback case provides a safe state of the
> table. Getting rid of blocking semantics makes __fd_install usable
> again in non-sleepable contexts, which easies backporting efforts.
>
> There is a side effect of slightly nicer assembly for the common case
> as might_sleep can now be removed.
>
> Signed-off-by: Mateusz Guzik 
> ---
>  Documentation/filesystems/porting |  4 
>  fs/file.c | 11 +++
>  2 files changed, 7 insertions(+), 8 deletions(-)

Nice change !

Reviewed-by: Eric Dumazet 


Re: [PATCH 1/2] vfs: stop clearing close on exec when closing a fd

2017-10-03 Thread Eric Dumazet
On Tue, Oct 3, 2017 at 3:58 AM, Mateusz Guzik <mgu...@redhat.com> wrote:
> Codepaths allocating a fd always make sure the bit is set/unset
> depending on flags, thus clearing on close is redundant.
>
> Signed-off-by: Mateusz Guzik <mgu...@redhat.com>
> ---

Reviewed-by: Eric Dumazet <eduma...@google.com>


Re: [PATCH 1/2] vfs: stop clearing close on exec when closing a fd

2017-10-03 Thread Eric Dumazet
On Tue, Oct 3, 2017 at 3:58 AM, Mateusz Guzik  wrote:
> Codepaths allocating a fd always make sure the bit is set/unset
> depending on flags, thus clearing on close is redundant.
>
> Signed-off-by: Mateusz Guzik 
> ---

Reviewed-by: Eric Dumazet 


Re: v4.14-rc2/arm64 kernel BUG at net/core/skbuff.c:2626

2017-10-02 Thread Eric Dumazet
On Mon, Oct 2, 2017 at 10:21 AM, Mark Rutland <mark.rutl...@arm.com> wrote:
> On Mon, Oct 02, 2017 at 07:48:28AM -0700, Eric Dumazet wrote:
>> Please try the following fool proof patch.
>>
>> This is what I had in my local tree back in August but could not
>> conclude on the syzkaller bug I was working on.
>>
>> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
>> index 
>> 681e33998e03b609fdca83a83e0fc62a3fee8c39..e51d97a927058760a1ab7af00579f7488cb5
>>  100644
>> --- a/net/ipv4/icmp.c
>> +++ b/net/ipv4/icmp.c
>> @@ -732,7 +732,8 @@ void icmp_send(struct sk_buff *skb_in, int type, int 
>> code, __be32 info)
>>   room = 576;
>>   room -= sizeof(struct iphdr) + icmp_param.replyopts.opt.opt.optlen;
>>   room -= sizeof(struct icmphdr);
>> -
>> + if (room < 0)
>> + goto ende;
>>   icmp_param.data_len = skb_in->len - icmp_param.offset;
>>   if (icmp_param.data_len > room)
>>   icmp_param.data_len = room;
>>
>
> Unfortuantely, with this applied I still see the issue.
>
> Syzkaller came up with a minimized reproducer [1], which can trigger the
> issue near instantly under syz-execprog. If there's anything that would
> help to narrow this down, I'm more than happy to give it a go.
>
> Thanks,
> Mark.
>
> [1] 
> https://www.kernel.org/pub/linux/kernel/people/mark/bugs/20171002-skb_clone-misaligned-atomic/syzkaller.repro

Note that I was not trying to address the misaligned stuff.

Only this :

[ cut here ]
kernel BUG at net/core/skbuff.c:2626!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
Modules linked in:
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.14.0-rc2-1-gd7ad33d #115
Hardware name: linux,dummy-virt (DT)
task: 80003a901a80 task.stack: 80003a908000
PC is at skb_copy_and_csum_bits+0x8dc/0xae0 net/core/skbuff.c:2626
LR is at skb_copy_and_csum_bits+0x8dc/0xae0 net/core/skbuff.c:2626


Re: v4.14-rc2/arm64 kernel BUG at net/core/skbuff.c:2626

2017-10-02 Thread Eric Dumazet
On Mon, Oct 2, 2017 at 10:21 AM, Mark Rutland  wrote:
> On Mon, Oct 02, 2017 at 07:48:28AM -0700, Eric Dumazet wrote:
>> Please try the following fool proof patch.
>>
>> This is what I had in my local tree back in August but could not
>> conclude on the syzkaller bug I was working on.
>>
>> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
>> index 
>> 681e33998e03b609fdca83a83e0fc62a3fee8c39..e51d97a927058760a1ab7af00579f7488cb5
>>  100644
>> --- a/net/ipv4/icmp.c
>> +++ b/net/ipv4/icmp.c
>> @@ -732,7 +732,8 @@ void icmp_send(struct sk_buff *skb_in, int type, int 
>> code, __be32 info)
>>   room = 576;
>>   room -= sizeof(struct iphdr) + icmp_param.replyopts.opt.opt.optlen;
>>   room -= sizeof(struct icmphdr);
>> -
>> + if (room < 0)
>> + goto ende;
>>   icmp_param.data_len = skb_in->len - icmp_param.offset;
>>   if (icmp_param.data_len > room)
>>   icmp_param.data_len = room;
>>
>
> Unfortuantely, with this applied I still see the issue.
>
> Syzkaller came up with a minimized reproducer [1], which can trigger the
> issue near instantly under syz-execprog. If there's anything that would
> help to narrow this down, I'm more than happy to give it a go.
>
> Thanks,
> Mark.
>
> [1] 
> https://www.kernel.org/pub/linux/kernel/people/mark/bugs/20171002-skb_clone-misaligned-atomic/syzkaller.repro

Note that I was not trying to address the misaligned stuff.

Only this :

[ cut here ]
kernel BUG at net/core/skbuff.c:2626!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
Modules linked in:
CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.14.0-rc2-1-gd7ad33d #115
Hardware name: linux,dummy-virt (DT)
task: 80003a901a80 task.stack: 80003a908000
PC is at skb_copy_and_csum_bits+0x8dc/0xae0 net/core/skbuff.c:2626
LR is at skb_copy_and_csum_bits+0x8dc/0xae0 net/core/skbuff.c:2626


Re: v4.14-rc2/arm64 kernel BUG at net/core/skbuff.c:2626

2017-10-02 Thread Eric Dumazet
On Mon, 2017-10-02 at 15:21 +0100, Mark Rutland wrote:
> Hi Eric,
> 
> On Mon, Oct 02, 2017 at 06:36:32AM -0700, Eric Dumazet wrote:
> > On Mon, Oct 2, 2017 at 3:49 AM, Mark Rutland <mark.rutl...@arm.com> wrote:
> > > I hit the below splat at net/core/skbuff.c:2626 while fuzzing v4.14-rc2
> > > on arm64 with Syzkaller. This is the BUG_ON(len) at the end of
> > > skb_copy_and_csum_bits().
> 
> > > kernel BUG at net/core/skbuff.c:2626!
> 
> > > [] skb_copy_and_csum_bits+0x8dc/0xae0 
> > > net/core/skbuff.c:2626
> > > [] icmp_glue_bits+0xa4/0x2a0 net/ipv4/icmp.c:357
> > > [] __ip_append_data+0x10e4/0x20a8 
> > > net/ipv4/ip_output.c:1018
> > > [] ip_append_data.part.3+0xe8/0x1a0 
> > > net/ipv4/ip_output.c:1170
> > > [] ip_append_data+0xa4/0xb0 net/ipv4/ip_output.c:1173
> > > [] icmp_push_reply+0x1b8/0x690 net/ipv4/icmp.c:375
> > > [] icmp_send+0x1070/0x1890 net/ipv4/icmp.c:741
> > > [] ip_fragment.constprop.4+0x208/0x340 
> > > net/ipv4/ip_output.c:552
> > > [] ip_finish_output+0x3a8/0xab0 net/ipv4/ip_output.c:315
> > > [] NF_HOOK_COND include/linux/netfilter.h:238 [inline]
> > > [] ip_output+0x284/0x790 net/ipv4/ip_output.c:405
> > > [] dst_output include/net/dst.h:458 [inline]
> > > [] ip_local_out+0x9c/0x1b8 net/ipv4/ip_output.c:124
> > > [] ip_queue_xmit+0x850/0x18e0 net/ipv4/ip_output.c:504
> > > [] tcp_transmit_skb+0x107c/0x3338 
> > > net/ipv4/tcp_output.c:1123
> > > [] __tcp_retransmit_skb+0x614/0x1d18 
> > > net/ipv4/tcp_output.c:2847
> > > [] tcp_send_loss_probe+0x478/0x7d0 
> > > net/ipv4/tcp_output.c:2457
> > > [] tcp_write_timer_handler+0x50c/0x7e8 
> > > net/ipv4/tcp_timer.c:557
> > > [] tcp_write_timer+0x78/0x170 net/ipv4/tcp_timer.c:579
> > > [] call_timer_fn+0x1b8/0x430 kernel/time/timer.c:1281
> > > [] expire_timers+0x1d4/0x320 kernel/time/timer.c:1320
> > > [] __run_timers kernel/time/timer.c:1620 [inline]
> > > [] run_timer_softirq+0x214/0x5f0 
> > > kernel/time/timer.c:1646
> > > [] __do_softirq+0x350/0xc0c kernel/softirq.c:284
> > > [] do_softirq_own_stack include/linux/interrupt.h:498 
> > > [inline]
> > > [] invoke_softirq kernel/softirq.c:371 [inline]
> > > [] irq_exit+0x1dc/0x2f8 kernel/softirq.c:405
> > > [] __handle_domain_irq+0xdc/0x230 
> > > kernel/irq/irqdesc.c:647
> > > [] handle_domain_irq include/linux/irqdesc.h:175 
> > > [inline]
> > > [] gic_handle_irq+0x6c/0xe0 
> > > drivers/irqchip/irq-gic.c:367

Please try the following fool proof patch.

This is what I had in my local tree back in August but could not
conclude on the syzkaller bug I was working on.


diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 
681e33998e03b609fdca83a83e0fc62a3fee8c39..e51d97a927058760a1ab7af00579f7488cb5
 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -732,7 +732,8 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, 
__be32 info)
room = 576;
room -= sizeof(struct iphdr) + icmp_param.replyopts.opt.opt.optlen;
room -= sizeof(struct icmphdr);
-
+   if (room < 0)
+   goto ende;
icmp_param.data_len = skb_in->len - icmp_param.offset;
if (icmp_param.data_len > room)
icmp_param.data_len = room;





Re: v4.14-rc2/arm64 kernel BUG at net/core/skbuff.c:2626

2017-10-02 Thread Eric Dumazet
On Mon, 2017-10-02 at 15:21 +0100, Mark Rutland wrote:
> Hi Eric,
> 
> On Mon, Oct 02, 2017 at 06:36:32AM -0700, Eric Dumazet wrote:
> > On Mon, Oct 2, 2017 at 3:49 AM, Mark Rutland  wrote:
> > > I hit the below splat at net/core/skbuff.c:2626 while fuzzing v4.14-rc2
> > > on arm64 with Syzkaller. This is the BUG_ON(len) at the end of
> > > skb_copy_and_csum_bits().
> 
> > > kernel BUG at net/core/skbuff.c:2626!
> 
> > > [] skb_copy_and_csum_bits+0x8dc/0xae0 
> > > net/core/skbuff.c:2626
> > > [] icmp_glue_bits+0xa4/0x2a0 net/ipv4/icmp.c:357
> > > [] __ip_append_data+0x10e4/0x20a8 
> > > net/ipv4/ip_output.c:1018
> > > [] ip_append_data.part.3+0xe8/0x1a0 
> > > net/ipv4/ip_output.c:1170
> > > [] ip_append_data+0xa4/0xb0 net/ipv4/ip_output.c:1173
> > > [] icmp_push_reply+0x1b8/0x690 net/ipv4/icmp.c:375
> > > [] icmp_send+0x1070/0x1890 net/ipv4/icmp.c:741
> > > [] ip_fragment.constprop.4+0x208/0x340 
> > > net/ipv4/ip_output.c:552
> > > [] ip_finish_output+0x3a8/0xab0 net/ipv4/ip_output.c:315
> > > [] NF_HOOK_COND include/linux/netfilter.h:238 [inline]
> > > [] ip_output+0x284/0x790 net/ipv4/ip_output.c:405
> > > [] dst_output include/net/dst.h:458 [inline]
> > > [] ip_local_out+0x9c/0x1b8 net/ipv4/ip_output.c:124
> > > [] ip_queue_xmit+0x850/0x18e0 net/ipv4/ip_output.c:504
> > > [] tcp_transmit_skb+0x107c/0x3338 
> > > net/ipv4/tcp_output.c:1123
> > > [] __tcp_retransmit_skb+0x614/0x1d18 
> > > net/ipv4/tcp_output.c:2847
> > > [] tcp_send_loss_probe+0x478/0x7d0 
> > > net/ipv4/tcp_output.c:2457
> > > [] tcp_write_timer_handler+0x50c/0x7e8 
> > > net/ipv4/tcp_timer.c:557
> > > [] tcp_write_timer+0x78/0x170 net/ipv4/tcp_timer.c:579
> > > [] call_timer_fn+0x1b8/0x430 kernel/time/timer.c:1281
> > > [] expire_timers+0x1d4/0x320 kernel/time/timer.c:1320
> > > [] __run_timers kernel/time/timer.c:1620 [inline]
> > > [] run_timer_softirq+0x214/0x5f0 
> > > kernel/time/timer.c:1646
> > > [] __do_softirq+0x350/0xc0c kernel/softirq.c:284
> > > [] do_softirq_own_stack include/linux/interrupt.h:498 
> > > [inline]
> > > [] invoke_softirq kernel/softirq.c:371 [inline]
> > > [] irq_exit+0x1dc/0x2f8 kernel/softirq.c:405
> > > [] __handle_domain_irq+0xdc/0x230 
> > > kernel/irq/irqdesc.c:647
> > > [] handle_domain_irq include/linux/irqdesc.h:175 
> > > [inline]
> > > [] gic_handle_irq+0x6c/0xe0 
> > > drivers/irqchip/irq-gic.c:367

Please try the following fool proof patch.

This is what I had in my local tree back in August but could not
conclude on the syzkaller bug I was working on.


diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 
681e33998e03b609fdca83a83e0fc62a3fee8c39..e51d97a927058760a1ab7af00579f7488cb5
 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -732,7 +732,8 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, 
__be32 info)
room = 576;
room -= sizeof(struct iphdr) + icmp_param.replyopts.opt.opt.optlen;
room -= sizeof(struct icmphdr);
-
+   if (room < 0)
+   goto ende;
icmp_param.data_len = skb_in->len - icmp_param.offset;
if (icmp_param.data_len > room)
icmp_param.data_len = room;





Re: v4.14-rc2/arm64 kernel BUG at net/core/skbuff.c:2626

2017-10-02 Thread Eric Dumazet
On Mon, Oct 2, 2017 at 7:21 AM, Mark Rutland <mark.rutl...@arm.com> wrote:
> Hi Eric,
>
> On Mon, Oct 02, 2017 at 06:36:32AM -0700, Eric Dumazet wrote:
>> On Mon, Oct 2, 2017 at 3:49 AM, Mark Rutland <mark.rutl...@arm.com> wrote:
>> > I hit the below splat at net/core/skbuff.c:2626 while fuzzing v4.14-rc2
>> > on arm64 with Syzkaller. This is the BUG_ON(len) at the end of
>> > skb_copy_and_csum_bits().
>
>> > kernel BUG at net/core/skbuff.c:2626!
>
>> > [] skb_copy_and_csum_bits+0x8dc/0xae0 
>> > net/core/skbuff.c:2626
>> > [] icmp_glue_bits+0xa4/0x2a0 net/ipv4/icmp.c:357
>> > [] __ip_append_data+0x10e4/0x20a8 
>> > net/ipv4/ip_output.c:1018
>> > [] ip_append_data.part.3+0xe8/0x1a0 
>> > net/ipv4/ip_output.c:1170
>> > [] ip_append_data+0xa4/0xb0 net/ipv4/ip_output.c:1173
>> > [] icmp_push_reply+0x1b8/0x690 net/ipv4/icmp.c:375
>> > [] icmp_send+0x1070/0x1890 net/ipv4/icmp.c:741
>> > [] ip_fragment.constprop.4+0x208/0x340 
>> > net/ipv4/ip_output.c:552
>> > [] ip_finish_output+0x3a8/0xab0 net/ipv4/ip_output.c:315
>> > [] NF_HOOK_COND include/linux/netfilter.h:238 [inline]
>> > [] ip_output+0x284/0x790 net/ipv4/ip_output.c:405
>> > [] dst_output include/net/dst.h:458 [inline]
>> > [] ip_local_out+0x9c/0x1b8 net/ipv4/ip_output.c:124
>> > [] ip_queue_xmit+0x850/0x18e0 net/ipv4/ip_output.c:504
>> > [] tcp_transmit_skb+0x107c/0x3338 
>> > net/ipv4/tcp_output.c:1123
>> > [] __tcp_retransmit_skb+0x614/0x1d18 
>> > net/ipv4/tcp_output.c:2847
>> > [] tcp_send_loss_probe+0x478/0x7d0 
>> > net/ipv4/tcp_output.c:2457
>> > [] tcp_write_timer_handler+0x50c/0x7e8 
>> > net/ipv4/tcp_timer.c:557
>> > [] tcp_write_timer+0x78/0x170 net/ipv4/tcp_timer.c:579
>> > [] call_timer_fn+0x1b8/0x430 kernel/time/timer.c:1281
>> > [] expire_timers+0x1d4/0x320 kernel/time/timer.c:1320
>> > [] __run_timers kernel/time/timer.c:1620 [inline]
>> > [] run_timer_softirq+0x214/0x5f0 kernel/time/timer.c:1646
>> > [] __do_softirq+0x350/0xc0c kernel/softirq.c:284
>> > [] do_softirq_own_stack include/linux/interrupt.h:498 
>> > [inline]
>> > [] invoke_softirq kernel/softirq.c:371 [inline]
>> > [] irq_exit+0x1dc/0x2f8 kernel/softirq.c:405
>> > [] __handle_domain_irq+0xdc/0x230 
>> > kernel/irq/irqdesc.c:647
>> > [] handle_domain_irq include/linux/irqdesc.h:175 [inline]
>> > [] gic_handle_irq+0x6c/0xe0 drivers/irqchip/irq-gic.c:367
>
>> This is most likely a bug caused by syzkaller setting a ridiculous MTU
>> on loopback device, below minimum size of ipv4 MTU.
>
>> I tried to track it in August [1], but it seems hard to find all the
>> issues with this.
>>
>> commit c780a049f9bf442314335372c9abc4548bfe3e44
>> Author: Eric Dumazet <eduma...@google.com>
>> Date:   Wed Aug 16 11:09:12 2017 -0700
>>
>> ipv4: better IP_MAX_MTU enforcement
>>
>> While working on yet another syzkaller report, I found
>> that our IP_MAX_MTU enforcements were not properly done.
>>
>> gcc seems to reload dev->mtu for min(dev->mtu, IP_MAX_MTU), and
>> final result can be bigger than IP_MAX_MTU :/
>>
>> This is a problem because device mtu can be changed on other cpus or
>> threads.
>>
>> While this patch does not fix the issue I am working on, it is
>> probably worth addressing it.
>
> Just to check I've understood correctly, are you suggesting that the
> IPv4 code should also check the dev->mtu against a IP_MIN_MTU (which
> doesn't seem to exist today)?

We have plenty of places this is checked.

For example, trying to set MTU < 68 usually removes IPv4 addresses and routes.

Problem is : these checks are not fool proof yet.

( Only the admin was supposed to play these games )

>
> Otherwise, I do spot another potential issue. The writer side (e.g. most
> net_device::ndo_change_mtu implementations and the __dev_set_mtu()
> fallback) doesn't use WRITE_ONCE().

It does not matter how many strange values can be observed by the reader :
We must be fool proof anyway from reader point of view, so the
WRITE_ONCE() is not strictly needed.


>
> IIUC, that means that the write could be torn across multiple accesses,
> and we could see dev->mtu < dev->min_mtu on the read side, even if we
> use READ_ONCE(), and sanity check the mtu value before calling
> __dev_set_mtu().


<    5   6   7   8   9   10   11   12   13   14   >