Re: [PATCH] r8169: Add new device ID support

2018-10-23 Thread Shawn Lin

On 2018/10/24 13:54, David Miller wrote:

From: Shawn Lin 
Date: Wed, 24 Oct 2018 13:48:55 +0800


Hi David,

On 2018/10/24 10:19, David Miller wrote:

From: Shawn Lin 
Date: Wed, 24 Oct 2018 09:46:47 +0800


It's found my r8169 ethernet card at hand has a device ID
of 0x which wasn't on the list of rtl8169_pci_tbl. Add
a new entry to make it work:

   ...

01:00.0 Class 0200: 10ec:

I don't know about this.
A value of zero could mean the device is mis-responding to
PCI config space requests or something like that.


It was working fine on my retired Windows XP home PC with same devcice
ID listed, so I guess r8169 driver for windows system knows 0x is
also valid.


It is also possible the device comes up in a different state.

Under windows does it show with that device ID of zero?


yup. More precisely, I checked how BIOS enumerate it by PCIe analyzer
and see it does report 0x as device ID.









Re: [PATCH] r8169: Add new device ID support

2018-10-23 Thread Shawn Lin

Hi David,

On 2018/10/24 10:19, David Miller wrote:

From: Shawn Lin 
Date: Wed, 24 Oct 2018 09:46:47 +0800


It's found my r8169 ethernet card at hand has a device ID
of 0x which wasn't on the list of rtl8169_pci_tbl. Add
a new entry to make it work:

  ...

01:00.0 Class 0200: 10ec:


I don't know about this.

A value of zero could mean the device is mis-responding to
PCI config space requests or something like that.


It was working fine on my retired Windows XP home PC with same devcice
ID listed, so I guess r8169 driver for windows system knows 0x is
also valid.









Re: [PATCH] r8169: Add new device ID support

2018-10-23 Thread David Miller
From: Shawn Lin 
Date: Wed, 24 Oct 2018 13:48:55 +0800

> Hi David,
> 
> On 2018/10/24 10:19, David Miller wrote:
>> From: Shawn Lin 
>> Date: Wed, 24 Oct 2018 09:46:47 +0800
>> 
>>> It's found my r8169 ethernet card at hand has a device ID
>>> of 0x which wasn't on the list of rtl8169_pci_tbl. Add
>>> a new entry to make it work:
>>   ...
>>> 01:00.0 Class 0200: 10ec:
>> I don't know about this.
>> A value of zero could mean the device is mis-responding to
>> PCI config space requests or something like that.
> 
> It was working fine on my retired Windows XP home PC with same devcice
> ID listed, so I guess r8169 driver for windows system knows 0x is
> also valid.

It is also possible the device comes up in a different state.

Under windows does it show with that device ID of zero?


Re: [PATCH net v2] net/ipv6: Add anycast addresses to a global hashtable

2018-10-23 Thread Jeff Barnhill
Thanks for the feedback.

As suggested, I did these things:
 - switched to refcount_t
 - stopped grabbing a reference on the netns (now able to use kfree_rcu)
 - re-ordered ipv6_chk_acast_addr variable definitions to reverse xmas tree

With regards to your question in __ipv6_dev_ac_inc():

> + err = ipv6_add_acaddr_hash(dev_net(idev->dev), addr);
> + if (err) {
> + fib6_info_release(f6i);
> + fib6_info_release(f6i);
Double call to fib6_info_release() ? Why ?

Unless I mis-understand, both addrconf_f6i_alloc() (indirectly through
fib6_info_alloc()) and aca_alloc() increment fib6_ref, so it seemed like
to fully cleanup/backout, we needed to to decrement twice.  Please let me know
if I'm wrong here.

I'll re-submit the patch after agreement on the double call and
testing with the new changes.

Thanks!
Jeff
On Tue, Oct 23, 2018 at 11:12 PM Eric Dumazet  wrote:
>
>
>
> On 10/23/2018 06:58 PM, Jeff Barnhill wrote:
> > icmp6_send() function is expensive on systems with a large number of
> > interfaces. Every time it’s called, it has to verify that the source
> > address does not correspond to an existing anycast address by looping
> > through every device and every anycast address on the device.  This can
> > result in significant delays for a CPU when there are a large number of
> > neighbors and ND timers are frequently timing out and calling
> > neigh_invalidate().
> >
> > Add anycast addresses to a global hashtable to allow quick searching for
> > matching anycast addresses.  This is based on inet6_addr_lst in addrconf.c.
> >
> > Signed-off-by: Jeff Barnhill <0xeff...@gmail.com>
> > ---
> >  include/net/addrconf.h |   2 +
> >  include/net/if_inet6.h |   8 +++
> >  net/ipv6/af_inet6.c|   5 ++
> >  net/ipv6/anycast.c | 132 
> > -
> >  4 files changed, 145 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/net/addrconf.h b/include/net/addrconf.h
> > index 6def0351bcc3..0cee3f99c41d 100644
> > --- a/include/net/addrconf.h
> > +++ b/include/net/addrconf.h
> > @@ -312,6 +312,8 @@ bool ipv6_chk_acast_addr(struct net *net, struct 
> > net_device *dev,
> >const struct in6_addr *addr);
> >  bool ipv6_chk_acast_addr_src(struct net *net, struct net_device *dev,
> >const struct in6_addr *addr);
> > +int anycast_init(void);
> > +void anycast_cleanup(void);
> >
> >  /* Device notifier */
> >  int register_inet6addr_notifier(struct notifier_block *nb);
> > diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
> > index d7578cf49c3a..55a4a1d8cebc 100644
> > --- a/include/net/if_inet6.h
> > +++ b/include/net/if_inet6.h
> > @@ -142,6 +142,14 @@ struct ipv6_ac_socklist {
> >   struct ipv6_ac_socklist *acl_next;
> >  };
> >
> > +struct ipv6_ac_addrlist {
> > + struct in6_addr acal_addr;
> > + possible_net_t  acal_pnet;
> > + int acal_users;
>
> That would be a refcount_t acal_users; so that CONFIG_REFCOUNT_FULL brings 
> debugging for free.
>
> > + struct hlist_node   acal_lst; /* inet6_acaddr_lst */
> > + struct rcu_head rcu;
> > +};
> > +
> >  struct ifacaddr6 {
> >   struct in6_addr aca_addr;
> >   struct fib6_info*aca_rt;
> > diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
> > index 9a4261e50272..971a05fdd3bd 100644
> > --- a/net/ipv6/af_inet6.c
> > +++ b/net/ipv6/af_inet6.c
> > @@ -1001,6 +1001,9 @@ static int __init inet6_init(void)
> >   err = ip6_flowlabel_init();
> >   if (err)
> >   goto ip6_flowlabel_fail;
> > + err = anycast_init();
> > + if (err)
> > + goto anycast_fail;
> >   err = addrconf_init();
> >   if (err)
> >   goto addrconf_fail;
> > @@ -1091,6 +1094,8 @@ static int __init inet6_init(void)
> >  ipv6_exthdrs_fail:
> >   addrconf_cleanup();
> >  addrconf_fail:
> > + anycast_cleanup();
> > +anycast_fail:
> >   ip6_flowlabel_cleanup();
> >  ip6_flowlabel_fail:
> >   ndisc_late_cleanup();
> > diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
> > index 4e0ff7031edd..def1e156d857 100644
> > --- a/net/ipv6/anycast.c
> > +++ b/net/ipv6/anycast.c
> > @@ -44,8 +44,22 @@
> >
> >  #include 
> >
> > +#define IN6_ADDR_HSIZE_SHIFT 8
> > +#define IN6_ADDR_HSIZE   BIT(IN6_ADDR_HSIZE_SHIFT)
> > +/*   anycast address hash table
> > + */
> > +static struct hlist_head inet6_acaddr_lst[IN6_ADDR_HSIZE];
> > +static DEFINE_SPINLOCK(acaddr_hash_lock);
> > +
> >  static int ipv6_dev_ac_dec(struct net_device *dev, const struct in6_addr 
> > *addr);
> >
> > +static u32 inet6_acaddr_hash(struct net *net, const struct in6_addr *addr)
> > +{
> > + u32 val = ipv6_addr_hash(addr) ^ net_hash_mix(net);
> > +
> > + return hash_32(val, IN6_ADDR_HSIZE_SHIFT);
> > +}
> > +
> >  /*
> >   *   socket join an anycast group
> >   */
> > @@ -204,6 +218,83 @@ void ipv6_sock_ac_close(struct sock *sk)
> >

Re: [RFC PATCH v2 06/10] udp: cope with UDP GRO packet misdirection

2018-10-23 Thread Subash Abhinov Kasiviswanathan

Is the "likely" required here?


Not required, but currently helpful IMHO, as we should hit the above
only on unlikey and really unwonted configuration.

Note that only SKB_GSO_UDP_L4 GSO packets will not match the above
likely condition.

HW can coalesce all incoming streams of UDP and may not know the 
socket

state.
In that case, a socket not having UDP GRO option might see a penalty
here.


Really? Is there any HW creating SKB_GSO_UDP_L4 packets on RX? if the
HW is doing that, without this patch, I think it's breaking existing
applications (which may expext that the read UDP frame length
implicitly describe the application level message length).



Hi

Yes, I agree that existing HW would not work without the patch.
My question was based on how UDP GRO packets from any future HW would 
interact
with this code path and if they might be potentially have any side 
effects

due to socket option not being set.

We do not have control over the application and the socket options being 
used
on systems like Android. The packet count reduction due to UDP GRO would 
help

when there are multiple firewall rules present even if we do not take
advantage of the reduced recvmsg() calls.

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH net v2] net/ipv6: Add anycast addresses to a global hashtable

2018-10-23 Thread Eric Dumazet



On 10/23/2018 06:58 PM, Jeff Barnhill wrote:
> icmp6_send() function is expensive on systems with a large number of
> interfaces. Every time it’s called, it has to verify that the source
> address does not correspond to an existing anycast address by looping
> through every device and every anycast address on the device.  This can
> result in significant delays for a CPU when there are a large number of
> neighbors and ND timers are frequently timing out and calling
> neigh_invalidate().
> 
> Add anycast addresses to a global hashtable to allow quick searching for
> matching anycast addresses.  This is based on inet6_addr_lst in addrconf.c.
> 
> Signed-off-by: Jeff Barnhill <0xeff...@gmail.com>
> ---
>  include/net/addrconf.h |   2 +
>  include/net/if_inet6.h |   8 +++
>  net/ipv6/af_inet6.c|   5 ++
>  net/ipv6/anycast.c | 132 
> -
>  4 files changed, 145 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/addrconf.h b/include/net/addrconf.h
> index 6def0351bcc3..0cee3f99c41d 100644
> --- a/include/net/addrconf.h
> +++ b/include/net/addrconf.h
> @@ -312,6 +312,8 @@ bool ipv6_chk_acast_addr(struct net *net, struct 
> net_device *dev,
>const struct in6_addr *addr);
>  bool ipv6_chk_acast_addr_src(struct net *net, struct net_device *dev,
>const struct in6_addr *addr);
> +int anycast_init(void);
> +void anycast_cleanup(void);
>  
>  /* Device notifier */
>  int register_inet6addr_notifier(struct notifier_block *nb);
> diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
> index d7578cf49c3a..55a4a1d8cebc 100644
> --- a/include/net/if_inet6.h
> +++ b/include/net/if_inet6.h
> @@ -142,6 +142,14 @@ struct ipv6_ac_socklist {
>   struct ipv6_ac_socklist *acl_next;
>  };
>  
> +struct ipv6_ac_addrlist {
> + struct in6_addr acal_addr;
> + possible_net_t  acal_pnet;
> + int acal_users;

That would be a refcount_t acal_users; so that CONFIG_REFCOUNT_FULL brings 
debugging for free.

> + struct hlist_node   acal_lst; /* inet6_acaddr_lst */
> + struct rcu_head rcu;
> +};
> +
>  struct ifacaddr6 {
>   struct in6_addr aca_addr;
>   struct fib6_info*aca_rt;
> diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
> index 9a4261e50272..971a05fdd3bd 100644
> --- a/net/ipv6/af_inet6.c
> +++ b/net/ipv6/af_inet6.c
> @@ -1001,6 +1001,9 @@ static int __init inet6_init(void)
>   err = ip6_flowlabel_init();
>   if (err)
>   goto ip6_flowlabel_fail;
> + err = anycast_init();
> + if (err)
> + goto anycast_fail;
>   err = addrconf_init();
>   if (err)
>   goto addrconf_fail;
> @@ -1091,6 +1094,8 @@ static int __init inet6_init(void)
>  ipv6_exthdrs_fail:
>   addrconf_cleanup();
>  addrconf_fail:
> + anycast_cleanup();
> +anycast_fail:
>   ip6_flowlabel_cleanup();
>  ip6_flowlabel_fail:
>   ndisc_late_cleanup();
> diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
> index 4e0ff7031edd..def1e156d857 100644
> --- a/net/ipv6/anycast.c
> +++ b/net/ipv6/anycast.c
> @@ -44,8 +44,22 @@
>  
>  #include 
>  
> +#define IN6_ADDR_HSIZE_SHIFT 8
> +#define IN6_ADDR_HSIZE   BIT(IN6_ADDR_HSIZE_SHIFT)
> +/*   anycast address hash table
> + */
> +static struct hlist_head inet6_acaddr_lst[IN6_ADDR_HSIZE];
> +static DEFINE_SPINLOCK(acaddr_hash_lock);
> +
>  static int ipv6_dev_ac_dec(struct net_device *dev, const struct in6_addr 
> *addr);
>  
> +static u32 inet6_acaddr_hash(struct net *net, const struct in6_addr *addr)
> +{
> + u32 val = ipv6_addr_hash(addr) ^ net_hash_mix(net);
> +
> + return hash_32(val, IN6_ADDR_HSIZE_SHIFT);
> +}
> +
>  /*
>   *   socket join an anycast group
>   */
> @@ -204,6 +218,83 @@ void ipv6_sock_ac_close(struct sock *sk)
>   rtnl_unlock();
>  }
>  
> +static struct ipv6_ac_addrlist *acal_alloc(struct net *net,
> +const struct in6_addr *addr)
> +{
> + struct ipv6_ac_addrlist *acal;
> +
> + acal = kzalloc(sizeof(*acal), GFP_ATOMIC);
> + if (!acal)
> + return NULL;
> +
> + acal->acal_addr = *addr;
> + write_pnet(&acal->acal_pnet, get_net(net));

I am not sure why you grab a reference on the netns.

The ipv6 address will be freed at some point before the netns disappears.
It would automatically remove the associated struct ipv6_ac_addrlist.

> + acal->acal_users = 1;
> + INIT_HLIST_NODE(&acal->acal_lst);
> +
> + return acal;
> +}
> +
> +static void acal_free_rcu(struct rcu_head *h)
> +{
> + struct ipv6_ac_addrlist *acal;
> +
> + acal = container_of(h, struct ipv6_ac_addrlist, rcu);
> + WARN_ON(acal->acal_users);

Not needed with refcount_t debugging infra.

> + put_net(read_pnet(&acal->acal_pnet));
> + kfree(acal);

So this could use kfree_rcu() in the caller, and get rid of acal_free_rcu() 
completely.

> +}

Re: [PATCH] r8169: Add new device ID support

2018-10-23 Thread David Miller
From: Shawn Lin 
Date: Wed, 24 Oct 2018 09:46:47 +0800

> It's found my r8169 ethernet card at hand has a device ID
> of 0x which wasn't on the list of rtl8169_pci_tbl. Add
> a new entry to make it work:
 ...
> 01:00.0 Class 0200: 10ec:

I don't know about this.

A value of zero could mean the device is mis-responding to
PCI config space requests or something like that.


[PATCH net v2] net/ipv6: Add anycast addresses to a global hashtable

2018-10-23 Thread Jeff Barnhill
icmp6_send() function is expensive on systems with a large number of
interfaces. Every time it’s called, it has to verify that the source
address does not correspond to an existing anycast address by looping
through every device and every anycast address on the device.  This can
result in significant delays for a CPU when there are a large number of
neighbors and ND timers are frequently timing out and calling
neigh_invalidate().

Add anycast addresses to a global hashtable to allow quick searching for
matching anycast addresses.  This is based on inet6_addr_lst in addrconf.c.

Signed-off-by: Jeff Barnhill <0xeff...@gmail.com>
---
 include/net/addrconf.h |   2 +
 include/net/if_inet6.h |   8 +++
 net/ipv6/af_inet6.c|   5 ++
 net/ipv6/anycast.c | 132 -
 4 files changed, 145 insertions(+), 2 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 6def0351bcc3..0cee3f99c41d 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -312,6 +312,8 @@ bool ipv6_chk_acast_addr(struct net *net, struct net_device 
*dev,
 const struct in6_addr *addr);
 bool ipv6_chk_acast_addr_src(struct net *net, struct net_device *dev,
 const struct in6_addr *addr);
+int anycast_init(void);
+void anycast_cleanup(void);
 
 /* Device notifier */
 int register_inet6addr_notifier(struct notifier_block *nb);
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index d7578cf49c3a..55a4a1d8cebc 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -142,6 +142,14 @@ struct ipv6_ac_socklist {
struct ipv6_ac_socklist *acl_next;
 };
 
+struct ipv6_ac_addrlist {
+   struct in6_addr acal_addr;
+   possible_net_t  acal_pnet;
+   int acal_users;
+   struct hlist_node   acal_lst; /* inet6_acaddr_lst */
+   struct rcu_head rcu;
+};
+
 struct ifacaddr6 {
struct in6_addr aca_addr;
struct fib6_info*aca_rt;
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 9a4261e50272..971a05fdd3bd 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -1001,6 +1001,9 @@ static int __init inet6_init(void)
err = ip6_flowlabel_init();
if (err)
goto ip6_flowlabel_fail;
+   err = anycast_init();
+   if (err)
+   goto anycast_fail;
err = addrconf_init();
if (err)
goto addrconf_fail;
@@ -1091,6 +1094,8 @@ static int __init inet6_init(void)
 ipv6_exthdrs_fail:
addrconf_cleanup();
 addrconf_fail:
+   anycast_cleanup();
+anycast_fail:
ip6_flowlabel_cleanup();
 ip6_flowlabel_fail:
ndisc_late_cleanup();
diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
index 4e0ff7031edd..def1e156d857 100644
--- a/net/ipv6/anycast.c
+++ b/net/ipv6/anycast.c
@@ -44,8 +44,22 @@
 
 #include 
 
+#define IN6_ADDR_HSIZE_SHIFT   8
+#define IN6_ADDR_HSIZE BIT(IN6_ADDR_HSIZE_SHIFT)
+/* anycast address hash table
+ */
+static struct hlist_head inet6_acaddr_lst[IN6_ADDR_HSIZE];
+static DEFINE_SPINLOCK(acaddr_hash_lock);
+
 static int ipv6_dev_ac_dec(struct net_device *dev, const struct in6_addr 
*addr);
 
+static u32 inet6_acaddr_hash(struct net *net, const struct in6_addr *addr)
+{
+   u32 val = ipv6_addr_hash(addr) ^ net_hash_mix(net);
+
+   return hash_32(val, IN6_ADDR_HSIZE_SHIFT);
+}
+
 /*
  * socket join an anycast group
  */
@@ -204,6 +218,83 @@ void ipv6_sock_ac_close(struct sock *sk)
rtnl_unlock();
 }
 
+static struct ipv6_ac_addrlist *acal_alloc(struct net *net,
+  const struct in6_addr *addr)
+{
+   struct ipv6_ac_addrlist *acal;
+
+   acal = kzalloc(sizeof(*acal), GFP_ATOMIC);
+   if (!acal)
+   return NULL;
+
+   acal->acal_addr = *addr;
+   write_pnet(&acal->acal_pnet, get_net(net));
+   acal->acal_users = 1;
+   INIT_HLIST_NODE(&acal->acal_lst);
+
+   return acal;
+}
+
+static void acal_free_rcu(struct rcu_head *h)
+{
+   struct ipv6_ac_addrlist *acal;
+
+   acal = container_of(h, struct ipv6_ac_addrlist, rcu);
+   WARN_ON(acal->acal_users);
+   put_net(read_pnet(&acal->acal_pnet));
+   kfree(acal);
+}
+
+static int ipv6_add_acaddr_hash(struct net *net, const struct in6_addr *addr)
+{
+   unsigned int hash = inet6_acaddr_hash(net, addr);
+   struct ipv6_ac_addrlist *acal;
+   int err = 0;
+
+   spin_lock(&acaddr_hash_lock);
+   hlist_for_each_entry(acal, &inet6_acaddr_lst[hash], acal_lst) {
+   if (!net_eq(read_pnet(&acal->acal_pnet), net))
+   continue;
+   if (ipv6_addr_equal(&acal->acal_addr, addr)) {
+   acal->acal_users++;
+   goto out;
+   }
+   }
+
+   acal = acal_alloc(net, addr);
+   if (!acal) {
+   err = -ENOMEM

Re: [PATCH net-next] tcp: add tcp_reset_xmit_timer() helper

2018-10-23 Thread David Miller
From: Eric Dumazet 
Date: Tue, 23 Oct 2018 11:54:16 -0700

> With EDT model, SRTT no longer is inflated by pacing delays.
> 
> This means that RTO and some other xmit timers might be setup
> incorrectly. This is particularly visible with either :
> 
> - Very small enforced pacing rates (SO_MAX_PACING_RATE)
> - Reduced rto (from the default 200 ms)
> 
> This can lead to TCP flows aborts in the worst case,
> or spurious retransmits in other cases.
> 
> For example, this session gets far more throughput
> than the requested 80kbit :
> 
> $ netperf -H 127.0.0.2 -l 100 -- -q 1
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.2 
> () port 0 AF_INET
> Recv   SendSend
> Socket Socket  Message  Elapsed
> Size   SizeSize Time Throughput
> bytes  bytes   bytessecs.10^6bits/sec
> 
> 54 262144 262144104.00  2.66
> 
> With the fix :
> 
> $ netperf -H 127.0.0.2 -l 100 -- -q 1
> MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.2 
> () port 0 AF_INET
> Recv   SendSend
> Socket Socket  Message  Elapsed
> Size   SizeSize Time Throughput
> bytes  bytes   bytessecs.10^6bits/sec
> 
> 54 262144 262144104.00  0.12
> 
> EDT allows for better control of rtx timers, since TCP has
> a better idea of the earliest departure time of each skb
> in the rtx queue. We only have to eventually add to the
> timer the difference of the EDT time with current time.
> 
> Signed-off-by: Eric Dumazet 

Applied.


[PATCH net v2] net: udp: fix handling of CHECKSUM_COMPLETE packets

2018-10-23 Thread Sean Tranchetti
Current handling of CHECKSUM_COMPLETE packets by the UDP stack is
incorrect for any packet that has an incorrect checksum value.

udp4/6_csum_init() will both make a call to
__skb_checksum_validate_complete() to initialize/validate the csum
field when receiving a CHECKSUM_COMPLETE packet. When this packet
fails validation, skb->csum will be overwritten with the pseudoheader
checksum so the packet can be fully validated by software, but the
skb->ip_summed value will be left as CHECKSUM_COMPLETE so that way
the stack can later warn the user about their hardware spewing bad
checksums. Unfortunately, leaving the SKB in this state can cause
problems later on in the checksum calculation.

Since the the packet is still marked as CHECKSUM_COMPLETE,
udp_csum_pull_header() will SUBTRACT the checksum of the UDP header
from skb->csum instead of adding it, leaving us with a garbage value
in that field. Once we try to copy the packet to userspace in the
udp4/6_recvmsg(), we'll make a call to skb_copy_and_csum_datagram_msg()
to checksum the packet data and add it in the garbage skb->csum value
to perform our final validation check.

Since the value we're validating is not the proper checksum, it's possible
that the folded value could come out to 0, causing us not to drop the
packet. Instead, we believe that the packet was checksummed incorrectly
by hardware since skb->ip_summed is still CHECKSUM_COMPLETE, and we attempt
to warn the user with netdev_rx_csum_fault(skb->dev);

Unfortunately, since this is the UDP path, skb->dev has been overwritten
by skb->dev_scratch and is no longer a valid pointer, so we end up
reading invalid memory.

This patch addresses this problem in two ways:
1) Do not use the dev pointer when calling netdev_rx_csum_fault()
   from skb_copy_and_csum_datagram_msg(). Since this gets called
   from the UDP path where skb->dev has been overwritten, we have
   no way of knowing if the pointer is still valid. Also for the
   sake of consistency with the other uses of
   netdev_rx_csum_fault(), don't attempt to call it if the
   packet was checksummed by software.

2) Add better CHECKSUM_COMPLETE handling to udp4/6_csum_init().
   If we receive a packet that's CHECKSUM_COMPLETE that fails
   verification (i.e. skb->csum_valid == 0), check who performed
   the calculation. It's possible that the checksum was done in
   software by the network stack earlier (such as Netfilter's
   CONNTRACK module), and if that says the checksum is bad,
   we can drop the packet immediately instead of waiting until
   we try and copy it to userspace. Otherwise, we need to
   mark the SKB as CHECKSUM_NONE, since the skb->csum field
   no longer contains the full packet checksum after the
   call to __skb_checksum_validate_complete().

Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing")
Fixes: c84d949057ca ("udp: copy skb->truesize in the first cache line")
Cc: Sam Kumar 
Cc: Eric Dumazet 
Signed-off-by: Sean Tranchetti 
---
 net/core/datagram.c |  5 +++--
 net/ipv4/udp.c  | 20 ++--
 net/ipv6/ip6_checksum.c | 20 ++--
 3 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 9aac0d6..df16493 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -808,8 +808,9 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff *skb,
return -EINVAL;
}
 
-   if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE))
-   netdev_rx_csum_fault(skb->dev);
+   if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
+   !skb->csum_complete_sw)
+   netdev_rx_csum_fault(NULL);
}
return 0;
 fault:
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index c32a4c1..f8183fd 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2120,8 +2120,24 @@ static inline int udp4_csum_init(struct sk_buff *skb, 
struct udphdr *uh,
/* Note, we are only interested in != 0 or == 0, thus the
 * force to int.
 */
-   return (__force int)skb_checksum_init_zero_check(skb, proto, uh->check,
-inet_compute_pseudo);
+   err = (__force int)skb_checksum_init_zero_check(skb, proto, uh->check,
+   inet_compute_pseudo);
+   if (err)
+   return err;
+
+   if (skb->ip_summed == CHECKSUM_COMPLETE && !skb->csum_valid) {
+   /* If SW calculated the value, we know it's bad */
+   if (skb->csum_complete_sw)
+   return 1;
+
+   /* HW says the value is bad. Let's validate that.
+* skb->csum is no longer the full packet checksum,
+* so don't treat it as such.
+  

Re: [PATCH nf] netfilter: ipv6: fix oops when defragmenting locally generated fragments

2018-10-23 Thread Maciej Żenczykowski
On Tue, Oct 23, 2018 at 7:47 AM, Florian Westphal  wrote:
> Unlike ipv4 and normal ipv6 defrag, netfilter ipv6 defragmentation did
> not save/restore skb->dst.
>
> This causes oops when handling locally generated ipv6 fragments, as
> output path needs a valid dst.
>
> Reported-by: Maciej Żenczykowski 
> Fixes: 84379c9afe01 ("netfilter: ipv6: nf_defrag: drop skb dst before 
> queueing")
> Signed-off-by: Florian Westphal 
> ---
>  net/ipv6/netfilter/nf_conntrack_reasm.c | 13 +
>  1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c 
> b/net/ipv6/netfilter/nf_conntrack_reasm.c
> index 8f68a518d9db..f76bd4d15704 100644
> --- a/net/ipv6/netfilter/nf_conntrack_reasm.c
> +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
> @@ -587,11 +587,16 @@ int nf_ct_frag6_gather(struct net *net, struct sk_buff 
> *skb, u32 user)
>  */
> ret = -EINPROGRESS;
> if (fq->q.flags == (INET_FRAG_FIRST_IN | INET_FRAG_LAST_IN) &&
> -   fq->q.meat == fq->q.len &&
> -   nf_ct_frag6_reasm(fq, skb, dev))
> -   ret = 0;
> -   else
> +   fq->q.meat == fq->q.len) {
> +   unsigned long orefdst = skb->_skb_refdst;
> +
> +   skb->_skb_refdst = 0UL;
> +   if (nf_ct_frag6_reasm(fq, skb, dev))
> +   ret = 0;
> +   skb->_skb_refdst = orefdst;
> +   } else {
> skb_dst_drop(skb);
> +   }
>
>  out_unlock:
> spin_unlock_bh(&fq->q.lock);
> --
> 2.18.1
>

I don't quite follow how this fixes things, but I'll trust you on it.
(nor do I understand why only 4.9 LTS appears to crash with a null ptr deref)

Thanks for the fix.


Re: [PATCH v2] wireless: mark expected switch fall-throughs

2018-10-23 Thread Johannes Berg
On Tue, 2018-10-23 at 12:58 +0200, Gustavo A. R. Silva wrote:
> On 10/23/18 10:59 AM, Gustavo A. R. Silva wrote:
> > 
> > On 10/23/18 9:01 AM, Johannes Berg wrote:
> > > On Tue, 2018-10-23 at 02:13 +0200, Gustavo A. R. Silva wrote:
> > > > In preparation to enabling -Wimplicit-fallthrough, mark switch cases
> > > > where we are expecting to fall through.
> > > > 
> > > > Warning level 3 was used: -Wimplicit-fallthrough=3
> > > > 
> > > > This code was not tested and GCC 7.2.0 was used to compile it.
> > > 
> > > Look, I'm not going to make this any clearer: I'm not applying patches
> > > like that where you've invested no effort whatsoever on verifying that
> > > they're correct.
> > > 
> > 
> > How do you suggest me to verify that every part is correct in this type
> > of patches?
> > 
> 
> BTW... I'm under the impression you think that I don't even look at
> the code. Is that correct?

That's what your commit log looks like, yes. This is your full commit
log:

   In preparation to enabling -Wimplicit-fallthrough, mark switch cases
   where we are expecting to fall through.

   Warning level 3 was used: -Wimplicit-fallthrough=3

   This code was not tested and GCC 7.2.0 was used to compile it.

   For all I know, you could've run spatch to add the comments wherever
   there was no break, and then compiled it once.

   > I've been working on this for quite a while, and in every case I try to
> understand the code in terms of the context in which every warning is
> reported.

That's great.

> Here is a bug I found yesterday at 
> drivers/net/wireless/realtek/rtl8xxxu/rtl8xxxu_core.c
> 
> 5690 case WLAN_CIPHER_SUITE_CCMP:
> 5691 key->flags |= IEEE80211_KEY_FLAG_SW_MGMT_TX;
> 5692 break;
> 5693 case WLAN_CIPHER_SUITE_TKIP:
> 5694 key->flags |= IEEE80211_KEY_FLAG_GENERATE_MMIC;
> 5695 default:
> 5696 return -EOPNOTSUPP;
> 5697 }

Indeed, that looks like a bug, although kinda benign since it just means
TKIP will always use software crypto and TKIP is slow anyway ;-)

> I do this analysis for every warning.  Now, when I say I haven't tested the 
> code
> is because I don't have any log as evidence for anything. Not that I haven't 
> put
> any effort on trying to understand it and its context.  When I started 
> working on
> this task there were more than 2000 of these issues, now there are around 600 
> left.
> 
> I have fixed many bugs on the way, so a good amount of work is being invested 
> on
> this, and it's paying off. :)

:-)

> Now, let me ask you this question:
> 
> It would be easier for you to review this patch if I turn it into a series?
> 
> I can do that without a problem.

I'd be happy if you were to actually just mention that you've looked at
them, and found them to be expected fall throughs. I'll still review
them, but without that information I feel like I'm doing the first round
of reviews this code ever got.

johannes



Re: Kernel oops with mlx5 and dual XDP redirect programs

2018-10-23 Thread Toke Høiland-Jørgensen
Saeed Mahameed  writes:

> On Tue, 2018-10-23 at 12:10 +0200, Toke Høiland-Jørgensen wrote:
>> Saeed Mahameed  writes:
>> 
>> > On Thu, 2018-10-18 at 23:53 +0200, Toke Høiland-Jørgensen wrote:
>> > > Saeed Mahameed  writes:
>> > > 
>> > > > I think that the mlx5 driver doesn't know how to tell the other
>> > > > device
>> > > > to stop transmitting to it while it is resetting.. Maybe tariq
>> > > > or
>> > > > Jesper know more about this ?
>> > > > I will look at this tomorrow after noon and will try to
>> > > > repro...
>> > > 
>> > > Hi Saeed
>> > > 
>> > > Did you have a chance to poke at this? :)
>> > 
>> > HI Toke, yes i have been planing to respond but also i wanted to
>> > dig
>> > more,
>> > 
>> > so the root cause is very clear.
>> > 
>> > 1. core 1 is doing tx_dev->ndo_xdp_xmit()
>> > 2. core 2 is doing tx_dev->xdp_set() //remove xdp program.
>> 
>> Right, it was also my guess that it was related to this interaction.
>> Thanks for looking into it!
>> 
>> > and the problem is beyond mlx5, since we don't have a way to tell a
>> > different core/different netdev to stop xmitting, or at least
>> > synchronize with it.
>> 
>> Hmm, ideally there should be some way for the higher level XDP API to
>> notice this and abort the call before it even reaches the driver on
>> the
>> TX side, shouldn't there? At LPC, Jesper and I will be talking about
>> a
>> proposal for decoupling the ndo_xdp_xmit() resource allocation from
>> loading and unloading XDP programs, which I guess could be a way to
>> deal
>> with this as well.
>> 
>> In the meantime...
>> 
>
> Yes totally agree, this is why my fix is temporary. 
> Good Idea about LPC, let's discuss this there.
>
>> > I will be waiting for your confirmation that the fix did work.
>> 
>> I tested your patch, and it does indeed fix the crash. However, it
>> also
>> seems to have the effect that the XDP redirect continues to function
>> even after removing the XDP program on the target device.
>> 
>> I.e., after the call to ./xdp_fwd -d $TX_IF, I still see packets
>> being
>> redirected out $TX_IF. Is this intentional?
>> 
>
> Interesting, shouldn't happen, unless there is something weird going on
> when running xpd_fwd -d together with xdp_redirect_map, i just checked
> the code and if ndo_xdp_set was called with null program we will remove
> xdp tx resources, nothing suspicious in the driver.
>
> I will look at this later this week.

Cool. Let me know if you need anything more from me :)

-Toke


Re: ethernet "bus" number in DTS ?

2018-10-23 Thread Florian Fainelli
On 10/23/18 1:02 PM, Joakim Tjernlund wrote:
> On Tue, 2018-10-23 at 11:20 -0700, Florian Fainelli wrote:
>>
>> On 10/23/18 11:02 AM, Joakim Tjernlund wrote:
>>> On Tue, 2018-10-23 at 10:03 -0700, Florian Fainelli wrote:


 On 10/23/18 9:49 AM, Joakim Tjernlund wrote:
> SPI (and others) has a way to define bus number in a aliases:
>   aliases {
>   ethernet4 = &enet4;
>   ethernet0 = &enet0;
>   ethernet1 = &enet1;
>   ethernet2 = &enet2;
>   ethernet3 = &enet3;
>   spi0 = &spi0
>   };
> The 0 in the spi0 alias will translate to bus num 0 so one can control 
> the /dev nodes, like /dev/spidev0
> I am looking for the same for ethernet devices:
>  ethernet4 = &enet4;  /* should become eth4 */
>  ethernet0 = &enet0;  /* should become eth0 */
> but I cannot find something like that for eth devices.
>
> Could such functionality be added?

 It could, do we want and need to, no. You have the Ethernet alias in
 /sys/class/net/*/device/uevent already that would allow you to perform
 that (re)naming in user-space:

 # cat /sys/class/net/eth0/device/uevent
 DRIVER=bcmgenet
 OF_NAME=ethernet
 OF_FULLNAME=/rdb/ethernet@f048
 OF_TYPE=network
 OF_COMPATIBLE_0=brcm,genet-v5
 OF_COMPATIBLE_N=1
 OF_ALIAS_0=eth0 <==
 MODALIAS=of:NethernetTnetworkCbrcm,genet-v5
>>>
>>> Yes, one can if one uses udev and can find something to identify the hw I/F 
>>> with, my
>>> cat /sys/class/net/eth0/device/uevent looks like:
>>> DRIVER=fsl_dpa
>>> MODALIAS=platform:dpaa-ethernet
>>
>> Does not dpaa have a notion of Ethernet ports and those should have
>> proper information? Maybe that is part of your problem here, it should
>> have the OF_ALIAS information somehow available.
> 
> I cannot say ATM, but this lack of standard does not make it easier to rename 
> I/F's in udev.
> 
>>
>>> not sure mdev supports this, does it?
>>> Our simple installer FS(initramfs) doesn't have either udev or mdev.
>>
>> I don't know, but you could have a simple shell script that looks at
>> specific network device properties to decide on the naming and call
>> ifrename.
> 
> This reinventing of the wheel is what I am trying to avoid.

Embedded is all about being a special snowflake and re-inventing the
wheel, but having some desktop-like distribution user-space would
certainly allow you to re-invent other parts of the wheel.

> 
>>
>>> I also noted that using status = "disabled" didn't work either to create a 
>>> fix name scheme.
>>> Even worse, all the eth I/F after gets renumbered. It seems to me there
>>> is value in having stability in eth I/F naming at boot.
>>> Then userspace(udev) can rename if need be.
>>>
>>> Sure would like to known more about why this feature is not wanted ?
>>>
>>> I found
>>>   https://patchwork.kernel.org/patch/4122441/
>>> You quote policy as reason but surely it must be better to
>>> have something stable, connected to the hardware name, than semirandom 
>>> naming?
>>
>> If the Device Tree nodes are ordered by ascending base register address,
>> my understanding is that you get the same order as far as
>> platform_device creation goes, this may not be true in the future if Rob
>> decides to randomize that, but AFAICT this is still true. This may not
>> work well with status = disabled properties being inserted here and
>> there, but we have used that here and it has worked for as far as I can
>> remember doing it.
> 
> I recall it is the order in which the eth alias appear that controls the 
> naming,
> not 100% sure though.

Aliases are not looked up at all by the platform bus code other that
with of_get_alias() and friends, it is the order in which the nodes are
declared in the Device Tree, preferably ordered by base address that
dictates the order in which platform devices are created.

> 
>>
>> Second, you might want to name network devices ethX, but what if I want
>> to name them ethernetX or fooX or barX? Should we be accepting a
>> mechanism in the kernel that would allow someone to name the interfaces
>> the way they want straight from a name being provided in Device Tree?
> 
> I just want to have stable boot names, aka ethX, which can defined in
> the platforms DT. Then userspace can go from there to whatever it needs,
> udev could possibly use these stable boot names to identify the I/F's to 
> rename.
> 
> ATM, it is pretty hard to even use udev when /sys/class/net/eth0/device/uevent
> can look different even for OF created interfaces.

network devices have a gazillion of other sysfs attributes that all make
them unique enough to create stable names.

> 
>>
>> Aliases are fine for providing relative stability within the Device Tree
>> itself and boot programs that might need to modify the Device Tree (e.g:
>> inserting MAC addresses) such that you don't have to encode l

Re: ethernet "bus" number in DTS ?

2018-10-23 Thread Joakim Tjernlund
On Tue, 2018-10-23 at 11:20 -0700, Florian Fainelli wrote:
> 
> On 10/23/18 11:02 AM, Joakim Tjernlund wrote:
> > On Tue, 2018-10-23 at 10:03 -0700, Florian Fainelli wrote:
> > > 
> > > 
> > > On 10/23/18 9:49 AM, Joakim Tjernlund wrote:
> > > > SPI (and others) has a way to define bus number in a aliases:
> > > >   aliases {
> > > >   ethernet4 = &enet4;
> > > >   ethernet0 = &enet0;
> > > >   ethernet1 = &enet1;
> > > >   ethernet2 = &enet2;
> > > >   ethernet3 = &enet3;
> > > >   spi0 = &spi0
> > > >   };
> > > > The 0 in the spi0 alias will translate to bus num 0 so one can control 
> > > > the /dev nodes, like /dev/spidev0
> > > > I am looking for the same for ethernet devices:
> > > >  ethernet4 = &enet4;  /* should become eth4 */
> > > >  ethernet0 = &enet0;  /* should become eth0 */
> > > > but I cannot find something like that for eth devices.
> > > > 
> > > > Could such functionality be added?
> > > 
> > > It could, do we want and need to, no. You have the Ethernet alias in
> > > /sys/class/net/*/device/uevent already that would allow you to perform
> > > that (re)naming in user-space:
> > > 
> > > # cat /sys/class/net/eth0/device/uevent
> > > DRIVER=bcmgenet
> > > OF_NAME=ethernet
> > > OF_FULLNAME=/rdb/ethernet@f048
> > > OF_TYPE=network
> > > OF_COMPATIBLE_0=brcm,genet-v5
> > > OF_COMPATIBLE_N=1
> > > OF_ALIAS_0=eth0 <==
> > > MODALIAS=of:NethernetTnetworkCbrcm,genet-v5
> > 
> > Yes, one can if one uses udev and can find something to identify the hw I/F 
> > with, my
> > cat /sys/class/net/eth0/device/uevent looks like:
> > DRIVER=fsl_dpa
> > MODALIAS=platform:dpaa-ethernet
> 
> Does not dpaa have a notion of Ethernet ports and those should have
> proper information? Maybe that is part of your problem here, it should
> have the OF_ALIAS information somehow available.

I cannot say ATM, but this lack of standard does not make it easier to rename 
I/F's in udev.

> 
> > not sure mdev supports this, does it?
> > Our simple installer FS(initramfs) doesn't have either udev or mdev.
> 
> I don't know, but you could have a simple shell script that looks at
> specific network device properties to decide on the naming and call
> ifrename.

This reinventing of the wheel is what I am trying to avoid.

> 
> > I also noted that using status = "disabled" didn't work either to create a 
> > fix name scheme.
> > Even worse, all the eth I/F after gets renumbered. It seems to me there
> > is value in having stability in eth I/F naming at boot.
> > Then userspace(udev) can rename if need be.
> > 
> > Sure would like to known more about why this feature is not wanted ?
> > 
> > I found
> >   https://patchwork.kernel.org/patch/4122441/
> > You quote policy as reason but surely it must be better to
> > have something stable, connected to the hardware name, than semirandom 
> > naming?
> 
> If the Device Tree nodes are ordered by ascending base register address,
> my understanding is that you get the same order as far as
> platform_device creation goes, this may not be true in the future if Rob
> decides to randomize that, but AFAICT this is still true. This may not
> work well with status = disabled properties being inserted here and
> there, but we have used that here and it has worked for as far as I can
> remember doing it.

I recall it is the order in which the eth alias appear that controls the naming,
not 100% sure though.

> 
> Second, you might want to name network devices ethX, but what if I want
> to name them ethernetX or fooX or barX? Should we be accepting a
> mechanism in the kernel that would allow someone to name the interfaces
> the way they want straight from a name being provided in Device Tree?

I just want to have stable boot names, aka ethX, which can defined in
the platforms DT. Then userspace can go from there to whatever it needs,
udev could possibly use these stable boot names to identify the I/F's to rename.

ATM, it is pretty hard to even use udev when /sys/class/net/eth0/device/uevent
can look different even for OF created interfaces.

> 
> Aliases are fine for providing relative stability within the Device Tree
> itself and boot programs that might need to modify the Device Tree (e.g:
> inserting MAC addresses) such that you don't have to encode logic to
> search for nodes by compatible strings etc. but outside of that use
> case, it seems to me that you can resolve every naming decision in
> user-space.

Well, you can resolve MAC address assignment in user space too but most
will agree that is not convenient. I suggest it is also handy to have
some control of I/F enumeration(ethX that is) from platform code like DT.

 Jocke


Re: [RFC PATCH] cgroup, netclassid: add a preemption point to write_classid

2018-10-23 Thread Tejun Heo
On Thu, Oct 18, 2018 at 10:56:17AM +0200, Michal Hocko wrote:
> From: Michal Hocko 
> 
> We have seen a customer complaining about soft lockups on !PREEMPT
> kernel config with 4.4 based kernel
...
> If a cgroup has many tasks with many open file descriptors then we would
> end up in a large loop without any rescheduling point throught the
> operation. Add cond_resched once per task.
> 
> Signed-off-by: Michal Hocko 

Applied to cgroup/for-4.20.

Thanks.

-- 
tejun


Re: [PATCH v2] rtlwifi: remove set but not used variable 'radiob_array_table' and 'radiob_arraylen'

2018-10-23 Thread Joe Perches
On Tue, 2018-10-23 at 16:28 +0800, zhong jiang wrote:
> radiob_array_table' and 'radiob_arraylen' are not used after setting its 
> value.
> It is safe to remove the unused variable. Meanwhile, radio B radio should be
> removed as well. because it will no longer be referenced.

The patch subject is a bit off and too generic here.

This is specific to rtl8723ae and not rtlwifi so it is
probably better for the subject to be something like:

[PATCH] rtl8723ae: Remove set but not used variables and #defines

> Signed-off-by: zhong jiang 
> ---
>  drivers/net/wireless/realtek/rtlwifi/rtl8723ae/phy.c   | 5 +
>  drivers/net/wireless/realtek/rtlwifi/rtl8723ae/table.c | 4 
>  drivers/net/wireless/realtek/rtlwifi/rtl8723ae/table.h | 2 --



[PATCH net-next] tcp: add tcp_reset_xmit_timer() helper

2018-10-23 Thread Eric Dumazet
With EDT model, SRTT no longer is inflated by pacing delays.

This means that RTO and some other xmit timers might be setup
incorrectly. This is particularly visible with either :

- Very small enforced pacing rates (SO_MAX_PACING_RATE)
- Reduced rto (from the default 200 ms)

This can lead to TCP flows aborts in the worst case,
or spurious retransmits in other cases.

For example, this session gets far more throughput
than the requested 80kbit :

$ netperf -H 127.0.0.2 -l 100 -- -q 1
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.2 () 
port 0 AF_INET
Recv   SendSend
Socket Socket  Message  Elapsed
Size   SizeSize Time Throughput
bytes  bytes   bytessecs.10^6bits/sec

54 262144 262144104.00  2.66

With the fix :

$ netperf -H 127.0.0.2 -l 100 -- -q 1
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.2 () 
port 0 AF_INET
Recv   SendSend
Socket Socket  Message  Elapsed
Size   SizeSize Time Throughput
bytes  bytes   bytessecs.10^6bits/sec

54 262144 262144104.00  0.12

EDT allows for better control of rtx timers, since TCP has
a better idea of the earliest departure time of each skb
in the rtx queue. We only have to eventually add to the
timer the difference of the EDT time with current time.

Signed-off-by: Eric Dumazet 
---
 include/net/tcp.h | 30 +++---
 net/ipv4/tcp_input.c  |  8 
 net/ipv4/tcp_output.c | 18 ++
 3 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 
8a61c3e8c15dfaf1095490814288c41c13a629f3..a18914d204864e3016fe8121ec7065da6696f9ef
 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1245,8 +1245,31 @@ static inline bool tcp_needs_internal_pacing(const 
struct sock *sk)
return smp_load_acquire(&sk->sk_pacing_status) == SK_PACING_NEEDED;
 }
 
+/* Return in jiffies the delay before one skb is sent.
+ * If @skb is NULL, we look at EDT for next packet being sent on the socket.
+ */
+static inline unsigned long tcp_pacing_delay(const struct sock *sk,
+const struct sk_buff *skb)
+{
+   s64 pacing_delay = skb ? skb->tstamp : tcp_sk(sk)->tcp_wstamp_ns;
+
+   pacing_delay -= tcp_sk(sk)->tcp_clock_cache;
+
+   return pacing_delay > 0 ? nsecs_to_jiffies(pacing_delay) : 0;
+}
+
+static inline void tcp_reset_xmit_timer(struct sock *sk,
+   const int what,
+   unsigned long when,
+   const unsigned long max_when,
+   const struct sk_buff *skb)
+{
+   inet_csk_reset_xmit_timer(sk, what, when + tcp_pacing_delay(sk, skb),
+ max_when);
+}
+
 /* Something is really bad, we could not queue an additional packet,
- * because qdisc is full or receiver sent a 0 window.
+ * because qdisc is full or receiver sent a 0 window, or we are paced.
  * We do not want to add fuel to the fire, or abort too early,
  * so make sure the timer we arm now is at least 200ms in the future,
  * regardless of current icsk_rto value (as it could be ~2ms)
@@ -1268,8 +1291,9 @@ static inline unsigned long tcp_probe0_when(const struct 
sock *sk,
 static inline void tcp_check_probe_timer(struct sock *sk)
 {
if (!tcp_sk(sk)->packets_out && !inet_csk(sk)->icsk_pending)
-   inet_csk_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
- tcp_probe0_base(sk), TCP_RTO_MAX);
+   tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
+tcp_probe0_base(sk), TCP_RTO_MAX,
+NULL);
 }
 
 static inline void tcp_init_wl(struct tcp_sock *tp, u32 seq)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 
188980c58f873ffdc81c018dcab0996f603cd7eb..2868ef28ce52179b3c5874e749b680ffbdc0521a
 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2979,8 +2979,8 @@ void tcp_rearm_rto(struct sock *sk)
 */
rto = usecs_to_jiffies(max_t(int, delta_us, 1));
}
-   inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
- TCP_RTO_MAX);
+   tcp_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto,
+TCP_RTO_MAX, tcp_rtx_queue_head(sk));
}
 }
 
@@ -3255,8 +3255,8 @@ static void tcp_ack_probe(struct sock *sk)
} else {
unsigned long when = tcp_probe0_when(sk, TCP_RTO_MAX);
 
-   inet_csk_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
- when, TCP_RTO_MAX);
+   tcp_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
+when, TCP_RTO_MAX, NULL);
}
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4

Re: [PATCH] Revert "be2net: remove desc field from be_eq_obj"

2018-10-23 Thread Ivan Vecera
On 23.10.2018 20:03, David Miller wrote:
> From: Ivan Vecera 
> Date: Tue, 23 Oct 2018 16:40:26 +0200
> 
>> The mentioned commit needs to be reverted because we cannot pass
>> string allocated on stack to request_irq(). This function stores
>> uses this pointer for later use (e.g. /proc/interrupts) so we need
>> to keep this string persistently.
>>
>> Fixes: d6d9704af8f4 ("be2net: remove desc field from be_eq_obj")
>>
>> Signed-off-by: Ivan Vecera 
> 
> Please do not put empty lines between Fixes and the other tags,
> all tags should appear together as one contiguous group.
Sure Dave... sorry for that.

Ivan


Re: [PATCH net] net/ipv6: Add anycast addresses to a global hashtable

2018-10-23 Thread Jeff Barnhill
Thanks! You are right. I mis-understood net->ifindex.  I think I need
to instead hold the net pointer in the new ipv6_ac_addrlist structure.
Do you see a problem with that?
On Mon, Oct 22, 2018 at 10:26 PM Eric Dumazet  wrote:
>
>
>
> On 10/22/2018 07:12 PM, Jeff Barnhill wrote:
> > icmp6_send() function is expensive on systems with a large number of
> > interfaces. Every time it’s called, it has to verify that the source
> > address does not correspond to an existing anycast address by looping
> > through every device and every anycast address on the device.  This can
> > result in significant delays for a CPU when there are a large number of
> > neighbors and ND timers are frequently timing out and calling
> > neigh_invalidate().
> >
> > Add anycast addresses to a global hashtable to allow quick searching for
> > matching anycast addresses.  This is based on inet6_addr_lst in addrconf.c.
> >
>
> I do not see this patch being netns aware ?
>
> Also I believe you misunderstood what was stored in net->ifindex
> You can look at dev_new_index() for what I mean.
>


Re: ethernet "bus" number in DTS ?

2018-10-23 Thread Florian Fainelli
On 10/23/18 11:02 AM, Joakim Tjernlund wrote:
> On Tue, 2018-10-23 at 10:03 -0700, Florian Fainelli wrote:
>> CAUTION: This email originated from outside of the organization. Do not 
>> click links or open attachments unless you recognize the sender and know the 
>> content is safe.
>>
>>
>> On 10/23/18 9:49 AM, Joakim Tjernlund wrote:
>>> SPI (and others) has a way to define bus number in a aliases:
>>>   aliases {
>>>   ethernet4 = &enet4;
>>>   ethernet0 = &enet0;
>>>   ethernet1 = &enet1;
>>>   ethernet2 = &enet2;
>>>   ethernet3 = &enet3;
>>>   spi0 = &spi0
>>>   };
>>> The 0 in the spi0 alias will translate to bus num 0 so one can control the 
>>> /dev nodes, like /dev/spidev0
>>> I am looking for the same for ethernet devices:
>>>  ethernet4 = &enet4;  /* should become eth4 */
>>>  ethernet0 = &enet0;  /* should become eth0 */
>>> but I cannot find something like that for eth devices.
>>>
>>> Could such functionality be added?
>>
>> It could, do we want and need to, no. You have the Ethernet alias in
>> /sys/class/net/*/device/uevent already that would allow you to perform
>> that (re)naming in user-space:
>>
>> # cat /sys/class/net/eth0/device/uevent
>> DRIVER=bcmgenet
>> OF_NAME=ethernet
>> OF_FULLNAME=/rdb/ethernet@f048
>> OF_TYPE=network
>> OF_COMPATIBLE_0=brcm,genet-v5
>> OF_COMPATIBLE_N=1
>> OF_ALIAS_0=eth0 <==
>> MODALIAS=of:NethernetTnetworkCbrcm,genet-v5
> 
> Yes, one can if one uses udev and can find something to identify the hw I/F 
> with, my
> cat /sys/class/net/eth0/device/uevent looks like:
> DRIVER=fsl_dpa
> MODALIAS=platform:dpaa-ethernet

Does not dpaa have a notion of Ethernet ports and those should have
proper information? Maybe that is part of your problem here, it should
have the OF_ALIAS information somehow available.

> 
> not sure mdev supports this, does it?
> Our simple installer FS(initramfs) doesn't have either udev or mdev.

I don't know, but you could have a simple shell script that looks at
specific network device properties to decide on the naming and call
ifrename.

> 
> I also noted that using status = "disabled" didn't work either to create a 
> fix name scheme.
> Even worse, all the eth I/F after gets renumbered. It seems to me there
> is value in having stability in eth I/F naming at boot.
> Then userspace(udev) can rename if need be. 
> 
> Sure would like to known more about why this feature is not wanted ?
> 
> I found
>   https://patchwork.kernel.org/patch/4122441/
> You quote policy as reason but surely it must be better to
> have something stable, connected to the hardware name, than semirandom naming?

If the Device Tree nodes are ordered by ascending base register address,
my understanding is that you get the same order as far as
platform_device creation goes, this may not be true in the future if Rob
decides to randomize that, but AFAICT this is still true. This may not
work well with status = disabled properties being inserted here and
there, but we have used that here and it has worked for as far as I can
remember doing it.

Second, you might want to name network devices ethX, but what if I want
to name them ethernetX or fooX or barX? Should we be accepting a
mechanism in the kernel that would allow someone to name the interfaces
the way they want straight from a name being provided in Device Tree?

Aliases are fine for providing relative stability within the Device Tree
itself and boot programs that might need to modify the Device Tree (e.g:
inserting MAC addresses) such that you don't have to encode logic to
search for nodes by compatible strings etc. but outside of that use
case, it seems to me that you can resolve every naming decision in
user-space.
-- 
Florian


Re: [PATCH] bonding:avoid repeated display of same link status change

2018-10-23 Thread David Miller
From: mk.si...@oracle.com
Date: Tue, 23 Oct 2018 20:59:24 +0530

> @@ -229,6 +229,7 @@ struct bonding {
>   struct   dentry *debug_dir;
>  #endif /* CONFIG_DEBUG_FS */
>   struct rtnl_link_stats64 bond_stats;
> + atomic_t rtnl_needed;

As mentioned by others, if the only operations you perform on a value
are set and read, using atomic_t is utterly and totally pointless.

I really have no idea what is achieved by using atomic_t in this set
of circumstances.

It is not guaranteeing that the value stays stable after you read it,
and it is not guaranteeing that another thread won't overwrite the
value you just set it to.

All of those things, if important, need proper synchronization.  An
atomic_t by itself will not do that for you.


Re: [PATCH net 1/1] qed: Fix static checker warning

2018-10-23 Thread David Miller
From: Rahul Verma 
Date: Tue, 23 Oct 2018 08:04:24 -0700

> From: Rahul Verma 
> 
>   Static Checker Warnings:
>   drivers/net/ethernet/qlogic/qed/qed_main.c:1510 
> qed_fill_link_capability()
>   error: uninitialized symbol 'tcvr_state'.
>   drivers/net/ethernet/qlogic/qed/qed_mcp.c:1951 
> qed_mcp_trans_speed_mask()
>   error: uninitialized symbol 'transceiver_state'.
>   drivers/net/ethernet/qlogic/qed/qed_mcp.c:1951 
> qed_mcp_trans_speed_mask()
>   error: uninitialized symbol 'transceiver_type'.
> 
>   Symbols tcvr_state, transceiver_state and transceiver_type
>   are initialized with respective default state.
> 
> Fixes: c56a8be7e7aa (qed: Add supported link and advertise link to display in 
> ethtool.)

Referenced commits shall be of the form:

SHA1_ID ("Commit header text.")

Meaning that text must be contained inside of parenthesis and also double 
quotes.
Please do not forget the double quotes in the future.

> Reported-by: Dan Carpenter 
> Signed-off-by: Rahul Verma 

Fixed, and applied.

Thanks.


Re: [PATCH] Revert "be2net: remove desc field from be_eq_obj"

2018-10-23 Thread David Miller
From: Ivan Vecera 
Date: Tue, 23 Oct 2018 16:40:26 +0200

> The mentioned commit needs to be reverted because we cannot pass
> string allocated on stack to request_irq(). This function stores
> uses this pointer for later use (e.g. /proc/interrupts) so we need
> to keep this string persistently.
> 
> Fixes: d6d9704af8f4 ("be2net: remove desc field from be_eq_obj")
> 
> Signed-off-by: Ivan Vecera 

Please do not put empty lines between Fixes and the other tags,
all tags should appear together as one contiguous group.

Fixed and applied and queued up for -stable.


Re: ethernet "bus" number in DTS ?

2018-10-23 Thread Joakim Tjernlund
On Tue, 2018-10-23 at 10:03 -0700, Florian Fainelli wrote:
> CAUTION: This email originated from outside of the organization. Do not click 
> links or open attachments unless you recognize the sender and know the 
> content is safe.
> 
> 
> On 10/23/18 9:49 AM, Joakim Tjernlund wrote:
> > SPI (and others) has a way to define bus number in a aliases:
> >   aliases {
> >   ethernet4 = &enet4;
> >   ethernet0 = &enet0;
> >   ethernet1 = &enet1;
> >   ethernet2 = &enet2;
> >   ethernet3 = &enet3;
> >   spi0 = &spi0
> >   };
> > The 0 in the spi0 alias will translate to bus num 0 so one can control the 
> > /dev nodes, like /dev/spidev0
> > I am looking for the same for ethernet devices:
> >  ethernet4 = &enet4;  /* should become eth4 */
> >  ethernet0 = &enet0;  /* should become eth0 */
> > but I cannot find something like that for eth devices.
> > 
> > Could such functionality be added?
> 
> It could, do we want and need to, no. You have the Ethernet alias in
> /sys/class/net/*/device/uevent already that would allow you to perform
> that (re)naming in user-space:
> 
> # cat /sys/class/net/eth0/device/uevent
> DRIVER=bcmgenet
> OF_NAME=ethernet
> OF_FULLNAME=/rdb/ethernet@f048
> OF_TYPE=network
> OF_COMPATIBLE_0=brcm,genet-v5
> OF_COMPATIBLE_N=1
> OF_ALIAS_0=eth0 <==
> MODALIAS=of:NethernetTnetworkCbrcm,genet-v5

Yes, one can if one uses udev and can find something to identify the hw I/F 
with, my
cat /sys/class/net/eth0/device/uevent looks like:
DRIVER=fsl_dpa
MODALIAS=platform:dpaa-ethernet

not sure mdev supports this, does it?
Our simple installer FS(initramfs) doesn't have either udev or mdev.

I also noted that using status = "disabled" didn't work either to create a fix 
name scheme.
Even worse, all the eth I/F after gets renumbered. It seems to me there
is value in having stability in eth I/F naming at boot.
Then userspace(udev) can rename if need be. 

Sure would like to known more about why this feature is not wanted ?

I found
  https://patchwork.kernel.org/patch/4122441/
You quote policy as reason but surely it must be better to
have something stable, connected to the hardware name, than semirandom naming?

 Jocke



Re: Kernel oops with mlx5 and dual XDP redirect programs

2018-10-23 Thread Saeed Mahameed
On Tue, 2018-10-23 at 12:10 +0200, Toke Høiland-Jørgensen wrote:
> Saeed Mahameed  writes:
> 
> > On Thu, 2018-10-18 at 23:53 +0200, Toke Høiland-Jørgensen wrote:
> > > Saeed Mahameed  writes:
> > > 
> > > > I think that the mlx5 driver doesn't know how to tell the other
> > > > device
> > > > to stop transmitting to it while it is resetting.. Maybe tariq
> > > > or
> > > > Jesper know more about this ?
> > > > I will look at this tomorrow after noon and will try to
> > > > repro...
> > > 
> > > Hi Saeed
> > > 
> > > Did you have a chance to poke at this? :)
> > 
> > HI Toke, yes i have been planing to respond but also i wanted to
> > dig
> > more,
> > 
> > so the root cause is very clear.
> > 
> > 1. core 1 is doing tx_dev->ndo_xdp_xmit()
> > 2. core 2 is doing tx_dev->xdp_set() //remove xdp program.
> 
> Right, it was also my guess that it was related to this interaction.
> Thanks for looking into it!
> 
> > and the problem is beyond mlx5, since we don't have a way to tell a
> > different core/different netdev to stop xmitting, or at least
> > synchronize with it.
> 
> Hmm, ideally there should be some way for the higher level XDP API to
> notice this and abort the call before it even reaches the driver on
> the
> TX side, shouldn't there? At LPC, Jesper and I will be talking about
> a
> proposal for decoupling the ndo_xdp_xmit() resource allocation from
> loading and unloading XDP programs, which I guess could be a way to
> deal
> with this as well.
> 
> In the meantime...
> 

Yes totally agree, this is why my fix is temporary. 
Good Idea about LPC, let's discuss this there.

> > I will be waiting for your confirmation that the fix did work.
> 
> I tested your patch, and it does indeed fix the crash. However, it
> also
> seems to have the effect that the XDP redirect continues to function
> even after removing the XDP program on the target device.
> 
> I.e., after the call to ./xdp_fwd -d $TX_IF, I still see packets
> being
> redirected out $TX_IF. Is this intentional?
> 

Interesting, shouldn't happen, unless there is something weird going on
when running xpd_fwd -d together with xdp_redirect_map, i just checked
the code and if ndo_xdp_set was called with null program we will remove
xdp tx resources, nothing suspicious in the driver.

I will look at this later this week.

> -Toke


Re: [PATCH net 1/1] net/smc: save link group ptr before calling smc_buf_unuse

2018-10-23 Thread David Miller
From: Ursula Braun 
Date: Tue, 23 Oct 2018 15:48:05 +0200

> @@ -315,6 +314,8 @@ static void smc_buf_unuse(struct smc_connection *conn)
>  /* remove a finished connection from its link group */
>  void smc_conn_free(struct smc_connection *conn)
>  {
> + struct smc_link_group *lgr;
> +
>   if (!conn->lgr)
>   return;
>   if (conn->lgr->is_smcd) {
> @@ -323,8 +324,9 @@ void smc_conn_free(struct smc_connection *conn)
>   } else {
>   smc_cdc_tx_dismiss_slots(conn);
>   }
> + lgr = conn->lgr; /* smc_lgr_unregister_conn() unsets lgr */
>   smc_lgr_unregister_conn(conn);
> - smc_buf_unuse(conn);
> + smc_buf_unuse(conn, lgr);
>  }

This doesn't make any sense.

smc_lgr_unregister_conn() can free the memory and release the object,
albeit sometimes asynchronously via a workqueue.

It is not safe, therefore, to refrence the lgr object after that
function call.

I'm not applying this, sorry.


Re: [PATCH net] Revert "net: simplify sock_poll_wait"

2018-10-23 Thread David Miller
From: Karsten Graul 
Date: Tue, 23 Oct 2018 13:40:39 +0200

> This reverts commit dd979b4df817e9976f18fb6f9d134d6bc4a3c317.
> 
> This broke tcp_poll for SMC fallback: An AF_SMC socket establishes an
> internal TCP socket for the initial handshake with the remote peer.
> Whenever the SMC connection can not be established this TCP socket is
> used as a fallback. All socket operations on the SMC socket are then
> forwarded to the TCP socket. In case of poll, the file->private_data
> pointer references the SMC socket because the TCP socket has no file
> assigned. This causes tcp_poll to wait on the wrong socket.
> 
> Signed-off-by: Karsten Graul 

Applied and queued up for -stable, thank you.


Re: [PATCH] net/wan/fsl_ucc_hdlc: add BQL support

2018-10-23 Thread David Miller
From: Mathias Thore 
Date: Tue, 23 Oct 2018 13:48:32 +0200

> Add byte queue limits support in the fsl_ucc_hdlc driver.
> 
> Signed-off-by: Mathias Thore 
> ---
> 
> Note that this patch is created relative to another patch that was
> applied recently: net/wan/fsl_ucc_hdlc: error counters

net-next is closed, please resubmit this when net-next opens back up.

Thank you.


Re: [PATCH net v2 0/3] Bugfix for the netsec driver

2018-10-23 Thread David Miller
From: masahisa.koj...@linaro.org
Date: Tue, 23 Oct 2018 20:24:25 +0900

> From: Masahisa Kojima 
> 
> This patch series include bugfix for the netsec ethernet
> controller driver, fix the problem in interface down/up.
> 
> changes in v2:
>  - change the place to perform the PHY power down
>  - use the MACROs defiend in include/uapi/linux/mii.h
>  - update commit comment

Series applied.


Re: [PATCHv3 iproute2-next] ip/geneve: fix ttl inherit behavior

2018-10-23 Thread David Ahern
On 10/22/18 1:46 AM, Hangbin Liu wrote:
> Currently when we add geneve with "ttl inherit", we only set ttl to 0, which
> is actually use whatever default value instead of inherit the inner protocol's
> ttl value.
> 
> To make a difference with ttl inherit and ttl == 0, we add an attribute
> IFLA_GENEVE_TTL_INHERIT in kernel commit 52d0d404d39dd ("geneve: add ttl
> inherit support"). Now let's use "ttl inherit" to inherit the inner
> protocol's ttl, and use "ttl auto" to means "use whatever default value",
> the same behavior with ttl == 0.
> 
> v2:
> 1) remove IFLA_GENEVE_TTL_INHERIT defination in if_link.h as it's already
>updated.
> 2) Still use addattr8() so we can enable/disable ttl inherit, as Michal
>suggested.
> 
> v3: Update man page
> 
> Reported-by: Jianlin Shi 
> Signed-off-by: Hangbin Liu 
> ---
>  ip/iplink_geneve.c| 20 +---
>  man/man8/ip-link.8.in |  4 +++-
>  2 files changed, 16 insertions(+), 8 deletions(-)
> 


applied to iproute2-next. Thanks


Re: [PATCH v3 1/4] net: emac: implement 802.1Q VLAN TX tagging support

2018-10-23 Thread David Miller


Sorry, net-next is closed.

Please resubmit this when net-next opens back up.

Thank you.


Re: [iproute PATCH v2] tc: htb: Print default value in hex

2018-10-23 Thread Stephen Hemminger
On Tue, 23 Oct 2018 12:36:24 +0200
Phil Sutter  wrote:

> Value of 'default' is assumed to be hexadecimal when parsing, so
> consequently it should be printed in hex as well. This is a regression
> introduced when adding JSON output.
> 
> As requested, also change JSON output to print the value as hex string.
> 
> Fixes: f354fa6aa5ff0 ("tc: jsonify htb qdisc")
> Signed-off-by: Phil Sutter 

Thanks, applied.


Re: Improving accuracy of PHC readings

2018-10-23 Thread Miroslav Lichvar
On Fri, Oct 19, 2018 at 04:52:13PM +, Keller, Jacob E wrote:
> > This should significantly improve the accuracy of the synchronization,
> > reduce the uncertainty in the readings to less than a half or third,
> > and also reduce the jitter as there are fewer register reads sensitive
> > to the PCIe delay.
> > 
> > What do you think?
> > 
> 
> Nice! I think this is good. I'd love to see some data to back it up, but it 
> makes sense to me.

I tried a quick hack with an X550 and I219. The delay dropped from
about 2940 ns to 1040 ns on the first port of the X550, from 1920 ns
to 660 ns on the second port of the X550, and from 2500 ns to 1300 ns
on the I219.

The I219 supports the SYS_OFFSET_PRECISE ioctl (cross timestamping),
which we can use for comparison. The difference between the offsets
calculated using the two ioctls was about 500-600 ns before and now it
is about -50--150 ns.

I was not able to find any information on how accurate cross
timestamping on this HW is actually supposed to be, so I'm wondering
which of the two is closer to the truth.

Here is an output from phc2sys with the I219:

Before:
phc offset   -59 s2 freq +40 delay   2527
phc offset19 s2 freq+101 delay   2526
phc offset   -23 s2 freq +64 delay   2522
phc offset46 s2 freq+126 delay   2535
phc offset   -32 s2 freq +62 delay   2530
phc offset   -10 s2 freq +75 delay   2526
phc offset   102 s2 freq+184 delay   2523

After:
phc offset17 s2 freq+105 delay   1298
phc offset47 s2 freq+140 delay   1299
phc offset   -42 s2 freq +65 delay   1293
phc offset-6 s2 freq +88 delay   1299
phc offset34 s2 freq+127 delay   1300
phc offset   -14 s2 freq +89 delay   1301
phc offset   -86 s2 freq +13 delay   1296
phc offset   -21 s2 freq +52 delay   1298

-- 
Miroslav Lichvar


[PATCH net 1/1] qed: Fix static checker warning

2018-10-23 Thread Rahul Verma
From: Rahul Verma 

Static Checker Warnings:
drivers/net/ethernet/qlogic/qed/qed_main.c:1510 
qed_fill_link_capability()
error: uninitialized symbol 'tcvr_state'.
drivers/net/ethernet/qlogic/qed/qed_mcp.c:1951 
qed_mcp_trans_speed_mask()
error: uninitialized symbol 'transceiver_state'.
drivers/net/ethernet/qlogic/qed/qed_mcp.c:1951 
qed_mcp_trans_speed_mask()
error: uninitialized symbol 'transceiver_type'.

Symbols tcvr_state, transceiver_state and transceiver_type
are initialized with respective default state.

Fixes: c56a8be7e7aa (qed: Add supported link and advertise link to display in 
ethtool.)
Reported-by: Dan Carpenter 
Signed-off-by: Rahul Verma 
---
 drivers/net/ethernet/qlogic/qed/qed_mcp.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c 
b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index 386ee54..f40f654 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -1900,6 +1900,9 @@ int qed_mcp_get_transceiver_data(struct qed_hwfn *p_hwfn,
 {
u32 transceiver_info;
 
+   *p_transceiver_type = ETH_TRANSCEIVER_TYPE_NONE;
+   *p_transceiver_state = ETH_TRANSCEIVER_STATE_UPDATING;
+
if (IS_VF(p_hwfn->cdev))
return -EINVAL;
 
@@ -1908,9 +1911,6 @@ int qed_mcp_get_transceiver_data(struct qed_hwfn *p_hwfn,
return -EBUSY;
}
 
-   *p_transceiver_type = ETH_TRANSCEIVER_TYPE_NONE;
-   *p_transceiver_state = ETH_TRANSCEIVER_STATE_UPDATING;
-
transceiver_info = qed_rd(p_hwfn, p_ptt,
  p_hwfn->mcp_info->port_addr +
  offsetof(struct public_port,
-- 
1.8.3.1



Re: Improving accuracy of PHC readings

2018-10-23 Thread Miroslav Lichvar
On Mon, Oct 22, 2018 at 03:48:02PM -0700, Richard Cochran wrote:
> On Fri, Oct 19, 2018 at 11:51:37AM +0200, Miroslav Lichvar wrote:
> > The extra timestamp doesn't fit the API of the PTP_SYS_OFFSET ioctl,
> > so it would need to shift the timestamp it returns by the missing
> > intervals (assuming the frequency offset between the PHC and system
> > clock is small), or a new ioctl could be introduced that would return
> > all timestamps in an array looking like this:
> > 
> > [sys, phc, sys, sys, phc, sys, ...]
> 
> How about a new ioctl with number of trials as input and single offset
> as output?

The difference between the system timestamps is important as it gives
an upper bound on the error in the offset, so I think the output
should be at least a pair of offset and delay.

The question is from which triplet should be the offset and delay
calculated. The one with the minimum delay is a good choice, but it's
not the only option. For instance, an average or median from all
triplets that have delay smaller than the minimum + 30 nanoseconds may
give a more stable offset.

This is not that different from an NTP client filtering measurements
made over network. I'm not sure if we should try to solve it in the
kernel or drivers. My preference would be to give the user space all
the data and process it there.

If the increased size of the array is an issue, we can reduce the
maximum number of readings.

Does that make sense?

-- 
Miroslav Lichvar


[PATCH net 1/1] net/smc: save link group ptr before calling smc_buf_unuse

2018-10-23 Thread Ursula Braun
From: Karsten Graul 

The pointer to the link group is unset in the smc connection structure
right before the call to smc_buf_unuse. Save the pointer and provide it
to smc_buf_unuse.

Fixes: a6920d1d130c ("net/smc: handle unregistered buffers")
Signed-off-by: Karsten Graul 
Signed-off-by: Ursula Braun 
---
 net/smc/smc_core.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index e871368500e3..12d8493f72f4 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -291,7 +291,8 @@ static int smc_lgr_create(struct smc_sock *smc, bool 
is_smcd,
return rc;
 }
 
-static void smc_buf_unuse(struct smc_connection *conn)
+static void smc_buf_unuse(struct smc_connection *conn,
+ struct smc_link_group *lgr)
 {
if (conn->sndbuf_desc)
conn->sndbuf_desc->used = 0;
@@ -301,8 +302,6 @@ static void smc_buf_unuse(struct smc_connection *conn)
conn->rmb_desc->used = 0;
} else {
/* buf registration failed, reuse not possible */
-   struct smc_link_group *lgr = conn->lgr;
-
write_lock_bh(&lgr->rmbs_lock);
list_del(&conn->rmb_desc->list);
write_unlock_bh(&lgr->rmbs_lock);
@@ -315,6 +314,8 @@ static void smc_buf_unuse(struct smc_connection *conn)
 /* remove a finished connection from its link group */
 void smc_conn_free(struct smc_connection *conn)
 {
+   struct smc_link_group *lgr;
+
if (!conn->lgr)
return;
if (conn->lgr->is_smcd) {
@@ -323,8 +324,9 @@ void smc_conn_free(struct smc_connection *conn)
} else {
smc_cdc_tx_dismiss_slots(conn);
}
+   lgr = conn->lgr; /* smc_lgr_unregister_conn() unsets lgr */
smc_lgr_unregister_conn(conn);
-   smc_buf_unuse(conn);
+   smc_buf_unuse(conn, lgr);
 }
 
 static void smc_link_clear(struct smc_link *lnk)
-- 
2.16.4



Re: [PATCH v3 net-next 03/11] net: dsa: microchip: Initialize mutex before use

2018-10-23 Thread Andrew Lunn
On Mon, Oct 22, 2018 at 07:26:07PM -0700, tristram...@microchip.com wrote:
> From: Tristram Ha 
> 
> Initialize mutex before use.

Hi Tristram

This seems like a fix for the driver, not simple refactoring.

Please could you rebase this on net, add a fixes: tag, and send it to
netdev for merging. Dave will take fixes anytime.

Thanks
Andrew


Re: [PATCH net-next 3/4] net: phy-c45: Implement reset/suspend/resume callbacks

2018-10-23 Thread Andrew Lunn
> If it's okay for Generic 10G driver I can submit only this and
> manually reset PHY in stmmac driver so that I don't need to
> implement custom PHY driver ...

Hi Jose

That is a bad idea. What happens when somebody uses a different PHY
which uses a different reset sequence? Please keep with the
abstraction. Anything touching the PHY needs to be in a PHY driver, or
phylib.

   Andrew


Re: CRC errors between mvneta and macb

2018-10-23 Thread Richard Genoud
Le 22/10/2018 à 20:19, Andrew Lunn a écrit :
>> I dug more on the subject, and I think I found what Marvell's PHY/MAC
>> doesn't like.
> 
> Hi Richard
> 
> What PHY is being used?
> 
>> After analyzing the ethernet frame on the Davicom PHY's output (pin
>> TX+), I find out that the FCS errors occurs when the ethernet preamble
>> is longer than 56bits. (something like 58 or 60 bits)
> 
> Some Marvell PHYs have a register bit which might be of interest: Page
> 2, register 16, bit 6.
> 
> 0 = Pad odd nibble preambles in copper receive packets.
> 1 = Pass as is and do not pad odd nibble preambles in
It doesn't seem to change anything.

But the problem really seems to be between the 88E1512 and mvneta.

In mvneta_rx_swbm() I dumped the data received, in both cases, I've got
the same thing:
    0004 a3d2 a7ef 0800
dead beef      
       
       8a86 ce78
The 2 first bytes are the marvell header, and 4 last the CRC
In one case the MVNETA_RXD_ERR_SUMMARY status bit is set, and not in the
other case.

But I don't have access to the Marvell documentation to know exactly
what is the status "MVNETA_RXD_ERR_CRC".

Richard


Re: [RFC PATCH v2 00/10] udp: implement GRO support

2018-10-23 Thread Paolo Abeni
Hi,

On Tue, 2018-10-23 at 14:10 +0200, Steffen Klassert wrote:
> On Fri, Oct 19, 2018 at 04:25:10PM +0200, Paolo Abeni wrote:
> > This series implements GRO support for UDP sockets, as the RX counterpart
> > of commit bec1f6f69736 ("udp: generate gso with UDP_SEGMENT").
> > The core functionality is implemented by the second patch, introducing a new
> > sockopt to enable UDP_GRO, while patch 3 implements support for passing the
> > segment size to the user space via a new cmsg.
> > UDP GRO performs a socket lookup for each ingress packets and aggregate 
> > datagram
> > directed to UDP GRO enabled sockets with constant l4 tuple.
> > 
> > UDP GRO packets can land on non GRO-enabled sockets, e.g. due to iptables 
> > NAT
> > rules, and that could potentially confuse existing applications.
> > 
> > The solution adopted here is to de-segment the GRO packet before enqueuing
> > as needed. Since we must cope with packet reinsertion after de-segmentation,
> > the relevant code is factored-out in ipv4 and ipv6 specific helpers and 
> > exposed
> > to UDP usage.
> > 
> > While the current code can probably be improved, this safeguard 
> > ,implemented in
> > the patches 4-7, allows future enachements to enable UDP GSO offload on more
> > virtual devices eventually even on forwarded packets.
> 
> I was curious what I get with this when doing packet forwarding.
> So I added forwarding support with the patch below (on top of
> your patchset). While the packet processing could work the way
> I do it in this patch, I'm not sure about the way I enable
> UDP GRO on forwarding. Maybe there it a better way than just
> do UDP GRO if forwarding is enabled on the receiving device.

My idea/hope is slightly different: ensure that UDP GRO + UDP input
path + UDP desegmentation on socket enqueue is safe and as fast as
plain UDP input path and then enable UDP GRO without socket lookup, for
each incoming frame (!!!).

If we land on the input path on a non UDP GRO enabled socket, the
packet is de-segment before enqueuing.

Socket lookup would be needed only if tunnels are enabled, to check if
the ingress frame match a tunnel. We will use UDP tunnel GRO in that
case.

> Some quick benchmark numbers with UDP packet forwarding
> (1460 byte packets) through two gateways:
> 
> net-next: 16.4 Gbps
> 
> net-next + UDP GRO: 20.3 Gbps

uhmmm... what do you think about this speed-up ?

I hoped that without the socket lookup we can get measurably better
figures.

Cheers,

Paolo



Re: [RFC PATCH v2 00/10] udp: implement GRO support

2018-10-23 Thread Steffen Klassert
On Fri, Oct 19, 2018 at 04:25:10PM +0200, Paolo Abeni wrote:
> This series implements GRO support for UDP sockets, as the RX counterpart
> of commit bec1f6f69736 ("udp: generate gso with UDP_SEGMENT").
> The core functionality is implemented by the second patch, introducing a new
> sockopt to enable UDP_GRO, while patch 3 implements support for passing the
> segment size to the user space via a new cmsg.
> UDP GRO performs a socket lookup for each ingress packets and aggregate 
> datagram
> directed to UDP GRO enabled sockets with constant l4 tuple.
> 
> UDP GRO packets can land on non GRO-enabled sockets, e.g. due to iptables NAT
> rules, and that could potentially confuse existing applications.
> 
> The solution adopted here is to de-segment the GRO packet before enqueuing
> as needed. Since we must cope with packet reinsertion after de-segmentation,
> the relevant code is factored-out in ipv4 and ipv6 specific helpers and 
> exposed
> to UDP usage.
> 
> While the current code can probably be improved, this safeguard ,implemented 
> in
> the patches 4-7, allows future enachements to enable UDP GSO offload on more
> virtual devices eventually even on forwarded packets.

I was curious what I get with this when doing packet forwarding.
So I added forwarding support with the patch below (on top of
your patchset). While the packet processing could work the way
I do it in this patch, I'm not sure about the way I enable
UDP GRO on forwarding. Maybe there it a better way than just
do UDP GRO if forwarding is enabled on the receiving device.

Some quick benchmark numbers with UDP packet forwarding
(1460 byte packets) through two gateways:

net-next: 16.4 Gbps

net-next + UDP GRO: 20.3 Gbps

Subject: [PATCH RFC] udp: Allow gro for the forwarding path.

This patch adds a early route lookup to inet_gro_receive()
in case forwarding is enabled on the receiving device.
To be forwarded packets are allowed to enter the UDP
GRO handlers then.

Signed-off-by: Steffen Klassert 
---
 include/linux/netdevice.h |  2 +-
 include/net/dst.h |  1 +
 net/core/dev.c|  6 --
 net/ipv4/af_inet.c| 12 
 net/ipv4/route.c  |  1 +
 net/ipv4/udp_offload.c| 11 +--
 6 files changed, 28 insertions(+), 5 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index dc1d9ed33b31..2eb3fa960ad4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2304,7 +2304,7 @@ struct napi_gro_cb {
/* Number of gro_receive callbacks this packet already went through */
u8 recursion_counter:4;
 
-   /* 1 bit hole */
+   u8  is_fwd:1;
 
/* used to support CHECKSUM_COMPLETE for tunneling protocols */
__wsum  csum;
diff --git a/include/net/dst.h b/include/net/dst.h
index 6cf0870414c7..01bdf825a6f9 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -54,6 +54,7 @@ struct dst_entry {
 #define DST_XFRM_TUNNEL0x0020
 #define DST_XFRM_QUEUE 0x0040
 #define DST_METADATA   0x0080
+#define DST_FWD0x0100
 
/* A non-zero value of dst->obsolete forces by-hand validation
 * of the route entry.  Positive values are set by the generic
diff --git a/net/core/dev.c b/net/core/dev.c
index 022ad73d6253..c6aaffb99456 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5387,8 +5387,10 @@ static struct list_head *gro_list_prepare(struct 
napi_struct *napi,
 
diffs = (unsigned long)p->dev ^ (unsigned long)skb->dev;
diffs |= p->vlan_tci ^ skb->vlan_tci;
-   diffs |= skb_metadata_dst_cmp(p, skb);
-   diffs |= skb_metadata_differs(p, skb);
+   if (!NAPI_GRO_CB(p)->is_fwd) {
+   diffs |= skb_metadata_dst_cmp(p, skb);
+   diffs |= skb_metadata_differs(p, skb);
+   }
if (maclen == ETH_HLEN)
diffs |= compare_ether_header(skb_mac_header(p),
  skb_mac_header(skb));
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 1fbe2f815474..6e4e8588c0b1 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1479,8 +1479,20 @@ struct sk_buff *inet_gro_receive(struct list_head *head, 
struct sk_buff *skb)
NAPI_GRO_CB(p)->flush_id = flush_id;
else
NAPI_GRO_CB(p)->flush_id |= flush_id;
+
+   NAPI_GRO_CB(skb)->is_fwd = NAPI_GRO_CB(p)->is_fwd;
+
+   goto found;
}
 
+   if (IN_DEV_FORWARD(__in_dev_get_rcu(skb->dev))) {
+   if (!ip_route_input_noref(skb, iph->daddr, iph->saddr, iph->tos,
+ skb->dev)) {
+   if ((skb_dst(skb)->flags & DST_FWD))
+   NAPI_GRO_CB(skb)->is_fwd = 1;
+   }
+   }
+found:
NAPI_GRO_CB(skb)->is_atomic = !!(iph->frag_off & hto

I wait for your prompt response.

2018-10-23 Thread Aziz Dake
Good day,

I am Mr. Aziz Dake, from Burkina Faso  a Minister confide on me to
look for foreign partner who will assist him to invest the sum of
Thirty  Million  Dollars  ($30,000,000) in your country.

He has investment interest in mining, exotic properties for commercial
resident, development properties, hotels and any other viable
investment opportunities in your country based on your recommendation
will be highly welcomed.

Hence your co -operation is highly needed to actualize this investment project

I wait for your prompt response.

Sincerely yours

Mr Aziz Dake.


[PATCH net] Revert "net: simplify sock_poll_wait"

2018-10-23 Thread Karsten Graul
This reverts commit dd979b4df817e9976f18fb6f9d134d6bc4a3c317.

This broke tcp_poll for SMC fallback: An AF_SMC socket establishes an
internal TCP socket for the initial handshake with the remote peer.
Whenever the SMC connection can not be established this TCP socket is
used as a fallback. All socket operations on the SMC socket are then
forwarded to the TCP socket. In case of poll, the file->private_data
pointer references the SMC socket because the TCP socket has no file
assigned. This causes tcp_poll to wait on the wrong socket.

Signed-off-by: Karsten Graul 
---
 crypto/af_alg.c|  2 +-
 include/net/sock.h | 12 +---
 net/atm/common.c   |  2 +-
 net/caif/caif_socket.c |  2 +-
 net/core/datagram.c|  2 +-
 net/dccp/proto.c   |  2 +-
 net/ipv4/tcp.c |  2 +-
 net/iucv/af_iucv.c |  2 +-
 net/nfc/llcp_sock.c|  2 +-
 net/rxrpc/af_rxrpc.c   |  2 +-
 net/smc/af_smc.c   |  2 +-
 net/tipc/socket.c  |  2 +-
 net/unix/af_unix.c |  4 ++--
 13 files changed, 22 insertions(+), 16 deletions(-)

diff --git a/crypto/af_alg.c b/crypto/af_alg.c
index b053179e0bc5..17eb09d222ff 100644
--- a/crypto/af_alg.c
+++ b/crypto/af_alg.c
@@ -1071,7 +1071,7 @@ __poll_t af_alg_poll(struct file *file, struct socket 
*sock,
struct af_alg_ctx *ctx = ask->private;
__poll_t mask;
 
-   sock_poll_wait(file, wait);
+   sock_poll_wait(file, sock, wait);
mask = 0;
 
if (!ctx->more || ctx->used)
diff --git a/include/net/sock.h b/include/net/sock.h
index 433f45fc2d68..c64a1cff9eb3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2057,14 +2057,20 @@ static inline bool skwq_has_sleeper(struct socket_wq 
*wq)
 /**
  * sock_poll_wait - place memory barrier behind the poll_wait call.
  * @filp:   file
+ * @sock:   socket to wait on
  * @p:  poll_table
  *
  * See the comments in the wq_has_sleeper function.
+ *
+ * Do not derive sock from filp->private_data here. An SMC socket establishes
+ * an internal TCP socket that is used in the fallback case. All socket
+ * operations on the SMC socket are then forwarded to the TCP socket. In case 
of
+ * poll, the filp->private_data pointer references the SMC socket because the
+ * TCP socket has no file assigned.
  */
-static inline void sock_poll_wait(struct file *filp, poll_table *p)
+static inline void sock_poll_wait(struct file *filp, struct socket *sock,
+ poll_table *p)
 {
-   struct socket *sock = filp->private_data;
-
if (!poll_does_not_wait(p)) {
poll_wait(filp, &sock->wq->wait, p);
/* We need to be sure we are in sync with the
diff --git a/net/atm/common.c b/net/atm/common.c
index 9f8cb0d2e71e..a38c174fc766 100644
--- a/net/atm/common.c
+++ b/net/atm/common.c
@@ -653,7 +653,7 @@ __poll_t vcc_poll(struct file *file, struct socket *sock, 
poll_table *wait)
struct atm_vcc *vcc;
__poll_t mask;
 
-   sock_poll_wait(file, wait);
+   sock_poll_wait(file, sock, wait);
mask = 0;
 
vcc = ATM_SD(sock);
diff --git a/net/caif/caif_socket.c b/net/caif/caif_socket.c
index d18965f3291f..416717c57cd1 100644
--- a/net/caif/caif_socket.c
+++ b/net/caif/caif_socket.c
@@ -941,7 +941,7 @@ static __poll_t caif_poll(struct file *file,
__poll_t mask;
struct caifsock *cf_sk = container_of(sk, struct caifsock, sk);
 
-   sock_poll_wait(file, wait);
+   sock_poll_wait(file, sock, wait);
mask = 0;
 
/* exceptional events? */
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 9aac0d63d53e..6a034eb538a1 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -837,7 +837,7 @@ __poll_t datagram_poll(struct file *file, struct socket 
*sock,
struct sock *sk = sock->sk;
__poll_t mask;
 
-   sock_poll_wait(file, wait);
+   sock_poll_wait(file, sock, wait);
mask = 0;
 
/* exceptional events? */
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 875858c8b059..43733accf58e 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -325,7 +325,7 @@ __poll_t dccp_poll(struct file *file, struct socket *sock,
__poll_t mask;
struct sock *sk = sock->sk;
 
-   sock_poll_wait(file, wait);
+   sock_poll_wait(file, sock, wait);
if (sk->sk_state == DCCP_LISTEN)
return inet_csk_listen_poll(sk);
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 10c6246396cc..bbd07736fb0f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -507,7 +507,7 @@ __poll_t tcp_poll(struct file *file, struct socket *sock, 
poll_table *wait)
const struct tcp_sock *tp = tcp_sk(sk);
int state;
 
-   sock_poll_wait(file, wait);
+   sock_poll_wait(file, sock, wait);
 
state = inet_sk_state_load(sk);
if (state == TCP_LISTEN)
diff --git a/net/iucv/af_iucv.c b/net/iucv/af_iucv.c
index 634150bff156..69057dccece1 100644
--- a/net/iucv/af_iucv.c
+++ b/net

Re: [PATCH net v2 0/3] Bugfix for the netsec driver

2018-10-23 Thread Masahisa Kojima
Hi Florian, Ard,

Yes, that is my mistake. Thank you very much for pointing out, Ard.
On Tue, 23 Oct 2018 at 20:32, Ard Biesheuvel  wrote:
>
> (+ Florian)
>
> On 23 October 2018 at 08:24,   wrote:
> > From: Masahisa Kojima 
> >
> > This patch series include bugfix for the netsec ethernet
> > controller driver, fix the problem in interface down/up.
> >
> > changes in v2:
> >  - change the place to perform the PHY power down
> >  - use the MACROs defiend in include/uapi/linux/mii.h
> >  - update commit comment
> >
> > Masahisa Kojima (3):
> >   net: socionext: Stop PHY before resetting netsec
> >   net: socionext: Add dummy PHY register read in phy_write()
> >   net: socionext: Reset tx queue in ndo_stop
> >
> >  drivers/net/ethernet/socionext/netsec.c | 40 
> > -
> >  1 file changed, 34 insertions(+), 6 deletions(-)
> >
>
> Hello Masahisa,
>
> As a courtesy, please cc people that have commented on your patches on
> subsequent revisions of the series.


Re: [PATCH net v2 0/3] Bugfix for the netsec driver

2018-10-23 Thread Ard Biesheuvel
(+ Florian)

On 23 October 2018 at 08:24,   wrote:
> From: Masahisa Kojima 
>
> This patch series include bugfix for the netsec ethernet
> controller driver, fix the problem in interface down/up.
>
> changes in v2:
>  - change the place to perform the PHY power down
>  - use the MACROs defiend in include/uapi/linux/mii.h
>  - update commit comment
>
> Masahisa Kojima (3):
>   net: socionext: Stop PHY before resetting netsec
>   net: socionext: Add dummy PHY register read in phy_write()
>   net: socionext: Reset tx queue in ndo_stop
>
>  drivers/net/ethernet/socionext/netsec.c | 40 
> -
>  1 file changed, 34 insertions(+), 6 deletions(-)
>

Hello Masahisa,

As a courtesy, please cc people that have commented on your patches on
subsequent revisions of the series.


[PATCH net v2 3/3] net: socionext: Reset tx queue in ndo_stop

2018-10-23 Thread masahisa . kojima
From: Masahisa Kojima 

We observed that packets and bytes count are not reset
when user performs interface down. Eventually, tx queue is
exhausted and packets will not be sent out.
To avoid this problem, resets tx queue in ndo_stop.

Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver")
Signed-off-by: Masahisa Kojima 
Signed-off-by: Yoshitoyo Osaki 
---
changes in v2:
 - update commit comment

 drivers/net/ethernet/socionext/netsec.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/socionext/netsec.c 
b/drivers/net/ethernet/socionext/netsec.c
index 5c295cc0b8f8..d4da7e017207 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -952,6 +952,9 @@ static void netsec_uninit_pkt_dring(struct netsec_priv 
*priv, int id)
dring->head = 0;
dring->tail = 0;
dring->pkt_cnt = 0;
+
+   if (id == NETSEC_RING_TX)
+   netdev_reset_queue(priv->ndev);
 }
 
 static void netsec_free_dring(struct netsec_priv *priv, int id)
-- 
2.14.2



[PATCH net v2 1/3] net: socionext: Stop PHY before resetting netsec

2018-10-23 Thread masahisa . kojima
From: Masahisa Kojima 

In ndo_stop, driver resets the netsec ethernet controller IP.
When the netsec IP is reset, HW running mode turns to NRM mode
and driver has to wait until this mode transition completes.

But mode transition to NRM will not complete if the PHY is
in normal operation state. Netsec IP requires PHY is in
power down state when it is reset.

This modification stops the PHY before resetting netsec.

Together with this modification, phy_addr is stored in netsec_priv
structure because ndev->phydev is not yet ready in ndo_init.

Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver")
Signed-off-by: Masahisa Kojima 
Signed-off-by: Yoshitoyo Osaki 
---
changes in v2:
 - change the place to perform the PHY power down
 - use the MACROs defiend in include/uapi/linux/mii.h

 drivers/net/ethernet/socionext/netsec.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/socionext/netsec.c 
b/drivers/net/ethernet/socionext/netsec.c
index 7aa5ebb6766c..829ed2718b22 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -274,6 +274,7 @@ struct netsec_priv {
struct clk *clk;
u32 msg_enable;
u32 freq;
+   u32 phy_addr;
bool rx_cksum_offload_flag;
 };
 
@@ -1340,11 +1341,11 @@ static int netsec_netdev_stop(struct net_device *ndev)
netsec_uninit_pkt_dring(priv, NETSEC_RING_TX);
netsec_uninit_pkt_dring(priv, NETSEC_RING_RX);
 
-   ret = netsec_reset_hardware(priv, false);
-
phy_stop(ndev->phydev);
phy_disconnect(ndev->phydev);
 
+   ret = netsec_reset_hardware(priv, false);
+
pm_runtime_put_sync(priv->dev);
 
return ret;
@@ -1354,6 +1355,7 @@ static int netsec_netdev_init(struct net_device *ndev)
 {
struct netsec_priv *priv = netdev_priv(ndev);
int ret;
+   u16 data;
 
ret = netsec_alloc_dring(priv, NETSEC_RING_TX);
if (ret)
@@ -1363,6 +1365,11 @@ static int netsec_netdev_init(struct net_device *ndev)
if (ret)
goto err1;
 
+   /* set phy power down */
+   data = netsec_phy_read(priv->mii_bus, priv->phy_addr, MII_BMCR) |
+   BMCR_PDOWN;
+   netsec_phy_write(priv->mii_bus, priv->phy_addr, MII_BMCR, data);
+
ret = netsec_reset_hardware(priv, true);
if (ret)
goto err2;
@@ -1412,7 +1419,7 @@ static const struct net_device_ops netsec_netdev_ops = {
 };
 
 static int netsec_of_probe(struct platform_device *pdev,
-  struct netsec_priv *priv)
+  struct netsec_priv *priv, u32 *phy_addr)
 {
priv->phy_np = of_parse_phandle(pdev->dev.of_node, "phy-handle", 0);
if (!priv->phy_np) {
@@ -1420,6 +1427,8 @@ static int netsec_of_probe(struct platform_device *pdev,
return -EINVAL;
}
 
+   *phy_addr = of_mdio_parse_addr(&pdev->dev, priv->phy_np);
+
priv->clk = devm_clk_get(&pdev->dev, NULL); /* get by 'phy_ref_clk' */
if (IS_ERR(priv->clk)) {
dev_err(&pdev->dev, "phy_ref_clk not found\n");
@@ -1620,12 +1629,14 @@ static int netsec_probe(struct platform_device *pdev)
}
 
if (dev_of_node(&pdev->dev))
-   ret = netsec_of_probe(pdev, priv);
+   ret = netsec_of_probe(pdev, priv, &phy_addr);
else
ret = netsec_acpi_probe(pdev, priv, &phy_addr);
if (ret)
goto free_ndev;
 
+   priv->phy_addr = phy_addr;
+
if (!priv->freq) {
dev_err(&pdev->dev, "missing PHY reference clock frequency\n");
ret = -ENODEV;
-- 
2.14.2



[PATCH net v2 2/3] net: socionext: Add dummy PHY register read in phy_write()

2018-10-23 Thread masahisa . kojima
From: Masahisa Kojima 

There is a compatibility issue between RTL8211E implemented
in Developerbox and netsec ethernet controller IP.

Our MDIO controller stops MDC clock right after the write
access, but RTL8211E expects MDC clock must be kept toggling
for several clock cycle with MDIO high before entering
the IDLE state. Without keeping clock after write access,
write access is not correctly handled and register is not
updated.

To meet this requirement, netsec driver needs to issue dummy
read(e.g. read PHYID1(offset 0x2) register) right after write
access, to keep MDC clock.

We think this compatibility issue is a problem specific to
our MDIO controller and RTL8211E.

Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver")
Signed-off-by: Masahisa Kojima 
Signed-off-by: Yoshitoyo Osaki 
---
changes in v2:
 - use the MACROs defiend in include/uapi/linux/mii.h

 drivers/net/ethernet/socionext/netsec.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/socionext/netsec.c 
b/drivers/net/ethernet/socionext/netsec.c
index 829ed2718b22..5c295cc0b8f8 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -432,9 +432,12 @@ static int netsec_mac_update_to_phy_state(struct 
netsec_priv *priv)
return 0;
 }
 
+static int netsec_phy_read(struct mii_bus *bus, int phy_addr, int reg_addr);
+
 static int netsec_phy_write(struct mii_bus *bus,
int phy_addr, int reg, u16 val)
 {
+   int status;
struct netsec_priv *priv = bus->priv;
 
if (netsec_mac_write(priv, GMAC_REG_GDR, val))
@@ -447,8 +450,19 @@ static int netsec_phy_write(struct mii_bus *bus,
  GMAC_REG_SHIFT_CR_GAR)))
return -ETIMEDOUT;
 
-   return netsec_mac_wait_while_busy(priv, GMAC_REG_GAR,
- NETSEC_GMAC_GAR_REG_GB);
+   status = netsec_mac_wait_while_busy(priv, GMAC_REG_GAR,
+   NETSEC_GMAC_GAR_REG_GB);
+
+   /* Developerbox implements RTL8211E PHY and there is
+* a compatibility problem with F_GMAC4.
+* RTL8211E expects MDC clock must be kept toggling for several
+* clock cycle with MDIO high before entering the IDLE state.
+* To meet this requirement, netsec driver needs to issue dummy
+* read(e.g. read PHYID1(offset 0x2) register) right after write.
+*/
+   netsec_phy_read(bus, phy_addr, MII_PHYSID1);
+
+   return status;
 }
 
 static int netsec_phy_read(struct mii_bus *bus, int phy_addr, int reg_addr)
-- 
2.14.2



[PATCH net v2 0/3] Bugfix for the netsec driver

2018-10-23 Thread masahisa . kojima
From: Masahisa Kojima 

This patch series include bugfix for the netsec ethernet
controller driver, fix the problem in interface down/up.

changes in v2:
 - change the place to perform the PHY power down
 - use the MACROs defiend in include/uapi/linux/mii.h
 - update commit comment

Masahisa Kojima (3):
  net: socionext: Stop PHY before resetting netsec
  net: socionext: Add dummy PHY register read in phy_write()
  net: socionext: Reset tx queue in ndo_stop

 drivers/net/ethernet/socionext/netsec.c | 40 -
 1 file changed, 34 insertions(+), 6 deletions(-)

-- 
2.14.2



Re: [PATCH net-next 3/4] net: phy-c45: Implement reset/suspend/resume callbacks

2018-10-23 Thread Russell King - ARM Linux
On Tue, Oct 23, 2018 at 11:28:09AM +0100, Jose Abreu wrote:
> On 23-10-2018 11:20, Russell King - ARM Linux wrote:
> > I have no idea what you're proposing there - your patches weren't copied
> > to me.
> 
> They just set / unset  MDIO_CTRL1_LPOWER bit in PCS. I find that
> without this remote end doesn't detect link is down ...
> 
> If it's okay for Generic 10G driver I can submit only this and
> manually reset PHY in stmmac driver so that I don't need to
> implement custom PHY driver ...



> BTW, I just found out currently Generic 10G Driver is broken
> without patch 4/4 of this series [1]
> 
> [1] https://patchwork.ozlabs.org/patch/987570/

How is it broken - what are the symptoms?

The generic 10G driver is bound not via the normal bus matching and
phy_bus_match(), but via a manual bind in phy_attach_direct().  This
calls the probe function, which is phy_probe(), which initialises
the supported/advertising to the driver's features (which as you note
are zero.)

However, phy_attach_direct() goes on to call phy_init_hw(), which
calls the config_init() method.  The config_init() method initialises
the supported/advertising masks to 10GbaseT.  This is (partly) what
I refer to when I say that the generic 10G support is crippled - it
only supports this single speed and media.

So the supported/advertising masks should be forced to only 10GbaseT
at the completion of phy_attach_direct().

The "generic 10G" support doesn't do autonegotiation, configuration
or link mode forcing.  It only assumes 10GbaseT is supported, and
only checks for the "link up" bits.

It isn't like the non-10G generic PHY support due to history - it
was added in 2014 by Andy Fleming (see 124059fd53af).

BTW, your patch 1 is wrong as well (introducing phy_update_link()).
You don't take account that a 10G phy may have alternative ways of
reading the link (like 88x3310 does, because it has multiple
instances of AN/PCS/PHYXS at 1k offsets.)  All the gen10g_*
functions are legacy functions for the crippled "generic" 10G
support.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up


[iproute PATCH v2] tc: htb: Print default value in hex

2018-10-23 Thread Phil Sutter
Value of 'default' is assumed to be hexadecimal when parsing, so
consequently it should be printed in hex as well. This is a regression
introduced when adding JSON output.

As requested, also change JSON output to print the value as hex string.

Fixes: f354fa6aa5ff0 ("tc: jsonify htb qdisc")
Signed-off-by: Phil Sutter 
---
Changes since v1:
- Use print_0xhex().
---
 tc/q_htb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tc/q_htb.c b/tc/q_htb.c
index c8b2941d945b7..5fb11d28c5c3a 100644
--- a/tc/q_htb.c
+++ b/tc/q_htb.c
@@ -332,7 +332,7 @@ static int htb_print_opt(struct qdisc_util *qu, FILE *f, 
struct rtattr *opt)
if (RTA_PAYLOAD(tb[TCA_HTB_INIT])  < sizeof(*gopt)) return -1;
 
print_int(PRINT_ANY, "r2q", "r2q %d", gopt->rate2quantum);
-   print_uint(PRINT_ANY, "default", " default %u", gopt->defcls);
+   print_0xhex(PRINT_ANY, "default", " default %x", gopt->defcls);
print_uint(PRINT_ANY, "direct_packets_stat",
   " direct_packets_stat %u", gopt->direct_pkts);
if (show_details) {
-- 
2.19.0



Re: [RFC PATCH v2 06/10] udp: cope with UDP GRO packet misdirection

2018-10-23 Thread Steffen Klassert
On Mon, Oct 22, 2018 at 02:51:56PM +0200, Paolo Abeni wrote:
> > > +
> > > +static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
> > > +{
> > > + struct sk_buff *next, *segs;
> > > + int ret;
> > > +
> > > + if (likely(!udp_unexpected_gso(sk, skb)))
> > > + return udp_queue_rcv_one_skb(sk, skb);
> > > +
> > > + BUILD_BUG_ON(sizeof(struct udp_skb_cb) > SKB_SGO_CB_OFFSET);
> > > + __skb_push(skb, -skb_mac_offset(skb));
> > > + segs = udp_rcv_segment(sk, skb);
> > > + for (skb = segs; skb; skb = next) {
> > > + next = skb->next;
> > > + __skb_pull(skb, skb_transport_offset(skb));
> > > + ret = udp_queue_rcv_one_skb(sk, skb);
> > 
> > udp_queue_rcv_one_skb() starts with doing a xfrm4_policy_check().
> > Maybe we can do this on the GSO packet instead of the segments.
> > So far this code is just for handling a corner case, but this might
> > change.
> 
> I thought about keeping the policy check here, but then I preferred
> what looked the safest option. Perhaps we can improve with a follow-up?

Fair enough. Let's keep it in mind and do it later.


Re: [PATCH net-next 3/4] net: phy-c45: Implement reset/suspend/resume callbacks

2018-10-23 Thread Jose Abreu
On 23-10-2018 11:20, Russell King - ARM Linux wrote:
> On Tue, Oct 23, 2018 at 11:17:50AM +0100, Jose Abreu wrote:
>> On 22-10-2018 18:13, Florian Fainelli wrote:
>>> On 10/22/18 8:48 AM, Russell King - ARM Linux wrote:
 On Mon, Oct 22, 2018 at 01:47:48PM +0100, Jose Abreu wrote:
> Hello,
>
> On 22-10-2018 13:28, Andrew Lunn wrote:
>>>  EXPORT_SYMBOL_GPL(gen10g_resume);
>>> @@ -327,7 +381,7 @@ struct phy_driver genphy_10g_driver = {
>>> .phy_id = 0x,
>>> .phy_id_mask= 0x,
>>> .name   = "Generic 10G PHY",
>>> -   .soft_reset = gen10g_no_soft_reset,
>>> +   .soft_reset = gen10g_soft_reset,
>>> .config_init= gen10g_config_init,
>>> .features   = 0,
>>> .aneg_done  = genphy_c45_aneg_done,
>> Hi Jose
>>
>> You need to be careful here. There is a reason this is called
>> gen10g_no_soft_reset, rather than having an empty
>> gen10g_soft_reset. Some PHYs break when you do a reset.  So adding a
>> gen10g_soft_reset is fine, but don't change this here, without first
>> understanding the history, and talking to Russell King.
> Hmm, the reset function only interacts with standard PCS
> registers, which should always be available ...
>
> >From my tests I need to do at least 1 reset during power-up so in
> ultimate case I can add a feature quirk or similar.
>
> Russell, can you please comment ?
 Setting the reset bit on 88x3310 causes the entire device to become
 completely inaccessible until hardware reset.  Therefore, this bit
 must _never_ be set for these devices.  That said, we have a separate
 driver for these PHYs, but that will only be used for them if it's
 present in the kernel.  If we accidentally fall back to the generic
 driver, then we'll screw the 88x3310 until a full hardware reset.

 We also have a bunch of net devices that make use of this crippled
 "generic" 10G support - we don't know whether resetting the PHY
 for those systems will cause a regression - maybe board firmware
 already configured the PHY?  I can't say either way on that, except
 that we've had crippled 10G support in PHYLIB for a number of years
 now _with_ users, and adding reset support drastically changes the
 subsystem's behaviour for these users.

 I would recommend not touching the generic 10G driver, but instead
 implement your own driver for your PHY to avoid causing regressions.

>>> Agreed.
>> What about .suspend / .resume ?
> I have no idea what you're proposing there - your patches weren't copied
> to me.
>

They just set / unset  MDIO_CTRL1_LPOWER bit in PCS. I find that
without this remote end doesn't detect link is down ...

If it's okay for Generic 10G driver I can submit only this and
manually reset PHY in stmmac driver so that I don't need to
implement custom PHY driver ...

BTW, I just found out currently Generic 10G Driver is broken
without patch 4/4 of this series [1]

[1] https://patchwork.ozlabs.org/patch/987570/

Thanks and Best Regards,
Jose Miguel Abreu


Re: [PATCH net-next 3/4] net: phy-c45: Implement reset/suspend/resume callbacks

2018-10-23 Thread Russell King - ARM Linux
On Tue, Oct 23, 2018 at 11:17:50AM +0100, Jose Abreu wrote:
> On 22-10-2018 18:13, Florian Fainelli wrote:
> > On 10/22/18 8:48 AM, Russell King - ARM Linux wrote:
> >> On Mon, Oct 22, 2018 at 01:47:48PM +0100, Jose Abreu wrote:
> >>> Hello,
> >>>
> >>> On 22-10-2018 13:28, Andrew Lunn wrote:
> >  EXPORT_SYMBOL_GPL(gen10g_resume);
> > @@ -327,7 +381,7 @@ struct phy_driver genphy_10g_driver = {
> > .phy_id = 0x,
> > .phy_id_mask= 0x,
> > .name   = "Generic 10G PHY",
> > -   .soft_reset = gen10g_no_soft_reset,
> > +   .soft_reset = gen10g_soft_reset,
> > .config_init= gen10g_config_init,
> > .features   = 0,
> > .aneg_done  = genphy_c45_aneg_done,
>  Hi Jose
> 
>  You need to be careful here. There is a reason this is called
>  gen10g_no_soft_reset, rather than having an empty
>  gen10g_soft_reset. Some PHYs break when you do a reset.  So adding a
>  gen10g_soft_reset is fine, but don't change this here, without first
>  understanding the history, and talking to Russell King.
> >>> Hmm, the reset function only interacts with standard PCS
> >>> registers, which should always be available ...
> >>>
> >>> >From my tests I need to do at least 1 reset during power-up so in
> >>> ultimate case I can add a feature quirk or similar.
> >>>
> >>> Russell, can you please comment ?
> >> Setting the reset bit on 88x3310 causes the entire device to become
> >> completely inaccessible until hardware reset.  Therefore, this bit
> >> must _never_ be set for these devices.  That said, we have a separate
> >> driver for these PHYs, but that will only be used for them if it's
> >> present in the kernel.  If we accidentally fall back to the generic
> >> driver, then we'll screw the 88x3310 until a full hardware reset.
> >>
> >> We also have a bunch of net devices that make use of this crippled
> >> "generic" 10G support - we don't know whether resetting the PHY
> >> for those systems will cause a regression - maybe board firmware
> >> already configured the PHY?  I can't say either way on that, except
> >> that we've had crippled 10G support in PHYLIB for a number of years
> >> now _with_ users, and adding reset support drastically changes the
> >> subsystem's behaviour for these users.
> >>
> >> I would recommend not touching the generic 10G driver, but instead
> >> implement your own driver for your PHY to avoid causing regressions.
> >>
> > Agreed.
> 
> What about .suspend / .resume ?

I have no idea what you're proposing there - your patches weren't copied
to me.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up


Re: [PATCH net-next 1/4] net: phy: Use C45 Helpers when forcing PHY

2018-10-23 Thread Jose Abreu
On 22-10-2018 18:11, Florian Fainelli wrote:
> On 10/22/18 3:32 AM, Jose Abreu wrote:
>> If PHY is in force state and we have a C45 phy we need to use the
>> standard C45 helpers and not the C22 ones.
>>
>> Signed-off-by: Jose Abreu 
>> Cc: Andrew Lunn 
>> Cc: Florian Fainelli 
>> Cc: "David S. Miller" 
>> Cc: Joao Pinto 
>> ---
>>  drivers/net/phy/phy.c | 2 +-
>>  include/linux/phy.h   | 8 
>>  2 files changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
>> index 1d73ac3309ce..0ff4946e208e 100644
>> --- a/drivers/net/phy/phy.c
>> +++ b/drivers/net/phy/phy.c
>> @@ -995,7 +995,7 @@ void phy_state_machine(struct work_struct *work)
>>  }
>>  break;
>>  case PHY_FORCING:
>> -err = genphy_update_link(phydev);
>> +err = phy_update_link(phydev);
>>  if (err)
>>  break;
>>  
>> diff --git a/include/linux/phy.h b/include/linux/phy.h
>> index 3ea87f774a76..02c2ee8bc05b 100644
>> --- a/include/linux/phy.h
>> +++ b/include/linux/phy.h
>> @@ -1044,6 +1044,14 @@ static inline int phy_read_status(struct phy_device 
>> *phydev)
>>  return genphy_read_status(phydev);
>>  }
>>  
>> +static inline int phy_update_link(struct phy_device *phydev)
>> +{
>> +if (phydev->is_c45)
>> +return gen10g_read_status(phydev);
> Should not this be genphy_c45_read_link() for symmetry with
> genphy_update_link() which only updates phydev->link?

Hmmm, genphy_c45_read_link() does not update phydev->link ... I
can create a new gen10g_update_link() that wraps around
genphy_c45_read_link() and updates link ...

Thanks and Best Regards,
Jose Miguel Abreu


Re: [PATCH net-next 3/4] net: phy-c45: Implement reset/suspend/resume callbacks

2018-10-23 Thread Jose Abreu
On 22-10-2018 18:13, Florian Fainelli wrote:
> On 10/22/18 8:48 AM, Russell King - ARM Linux wrote:
>> On Mon, Oct 22, 2018 at 01:47:48PM +0100, Jose Abreu wrote:
>>> Hello,
>>>
>>> On 22-10-2018 13:28, Andrew Lunn wrote:
>  EXPORT_SYMBOL_GPL(gen10g_resume);
> @@ -327,7 +381,7 @@ struct phy_driver genphy_10g_driver = {
>   .phy_id = 0x,
>   .phy_id_mask= 0x,
>   .name   = "Generic 10G PHY",
> - .soft_reset = gen10g_no_soft_reset,
> + .soft_reset = gen10g_soft_reset,
>   .config_init= gen10g_config_init,
>   .features   = 0,
>   .aneg_done  = genphy_c45_aneg_done,
 Hi Jose

 You need to be careful here. There is a reason this is called
 gen10g_no_soft_reset, rather than having an empty
 gen10g_soft_reset. Some PHYs break when you do a reset.  So adding a
 gen10g_soft_reset is fine, but don't change this here, without first
 understanding the history, and talking to Russell King.
>>> Hmm, the reset function only interacts with standard PCS
>>> registers, which should always be available ...
>>>
>>> >From my tests I need to do at least 1 reset during power-up so in
>>> ultimate case I can add a feature quirk or similar.
>>>
>>> Russell, can you please comment ?
>> Setting the reset bit on 88x3310 causes the entire device to become
>> completely inaccessible until hardware reset.  Therefore, this bit
>> must _never_ be set for these devices.  That said, we have a separate
>> driver for these PHYs, but that will only be used for them if it's
>> present in the kernel.  If we accidentally fall back to the generic
>> driver, then we'll screw the 88x3310 until a full hardware reset.
>>
>> We also have a bunch of net devices that make use of this crippled
>> "generic" 10G support - we don't know whether resetting the PHY
>> for those systems will cause a regression - maybe board firmware
>> already configured the PHY?  I can't say either way on that, except
>> that we've had crippled 10G support in PHYLIB for a number of years
>> now _with_ users, and adding reset support drastically changes the
>> subsystem's behaviour for these users.
>>
>> I would recommend not touching the generic 10G driver, but instead
>> implement your own driver for your PHY to avoid causing regressions.
>>
> Agreed.

What about .suspend / .resume ?

Thanks and Best Regards,
Jose Miguel Abreu


Re: Kernel oops with mlx5 and dual XDP redirect programs

2018-10-23 Thread Toke Høiland-Jørgensen
Saeed Mahameed  writes:

> On Thu, 2018-10-18 at 23:53 +0200, Toke Høiland-Jørgensen wrote:
>> Saeed Mahameed  writes:
>> 
>> > I think that the mlx5 driver doesn't know how to tell the other
>> > device
>> > to stop transmitting to it while it is resetting.. Maybe tariq or
>> > Jesper know more about this ?
>> > I will look at this tomorrow after noon and will try to repro...
>> 
>> Hi Saeed
>> 
>> Did you have a chance to poke at this? :)
>
> HI Toke, yes i have been planing to respond but also i wanted to dig
> more,
>
> so the root cause is very clear.
>
> 1. core 1 is doing tx_dev->ndo_xdp_xmit()
> 2. core 2 is doing tx_dev->xdp_set() //remove xdp program.

Right, it was also my guess that it was related to this interaction.
Thanks for looking into it!

> and the problem is beyond mlx5, since we don't have a way to tell a
> different core/different netdev to stop xmitting, or at least
> synchronize with it.

Hmm, ideally there should be some way for the higher level XDP API to
notice this and abort the call before it even reaches the driver on the
TX side, shouldn't there? At LPC, Jesper and I will be talking about a
proposal for decoupling the ndo_xdp_xmit() resource allocation from
loading and unloading XDP programs, which I guess could be a way to deal
with this as well.

In the meantime...

> I will be waiting for your confirmation that the fix did work.

I tested your patch, and it does indeed fix the crash. However, it also
seems to have the effect that the XDP redirect continues to function
even after removing the XDP program on the target device.

I.e., after the call to ./xdp_fwd -d $TX_IF, I still see packets being
redirected out $TX_IF. Is this intentional?

-Toke


[PATCH v3 3/4] net: emac: remove IBM_EMAC_RX_SKB_HEADROOM

2018-10-23 Thread Christian Lamparter
The EMAC driver had a custom IBM_EMAC_RX_SKB_HEADROOM
Kconfig option that reserved additional skb headroom for RX.
This patch removes the option and migrates the code
to use napi_alloc_skb() and netdev_alloc_skb_ip_align()
in its place.

Signed-off-by: Christian Lamparter 
---
 drivers/net/ethernet/ibm/emac/Kconfig | 12 --
 drivers/net/ethernet/ibm/emac/core.c  | 57 +++
 drivers/net/ethernet/ibm/emac/core.h  | 10 ++---
 3 files changed, 43 insertions(+), 36 deletions(-)

diff --git a/drivers/net/ethernet/ibm/emac/Kconfig 
b/drivers/net/ethernet/ibm/emac/Kconfig
index 90d49191beb3..eacf7e141fdc 100644
--- a/drivers/net/ethernet/ibm/emac/Kconfig
+++ b/drivers/net/ethernet/ibm/emac/Kconfig
@@ -28,18 +28,6 @@ config IBM_EMAC_RX_COPY_THRESHOLD
depends on IBM_EMAC
default "256"
 
-config IBM_EMAC_RX_SKB_HEADROOM
-   int "Additional RX skb headroom (bytes)"
-   depends on IBM_EMAC
-   default "0"
-   help
- Additional receive skb headroom. Note, that driver
- will always reserve at least 2 bytes to make IP header
- aligned, so usually there is no need to add any additional
- headroom.
-
- If unsure, set to 0.
-
 config IBM_EMAC_DEBUG
bool "Debugging"
depends on IBM_EMAC
diff --git a/drivers/net/ethernet/ibm/emac/core.c 
b/drivers/net/ethernet/ibm/emac/core.c
index 80aafd7552aa..266b6614125b 100644
--- a/drivers/net/ethernet/ibm/emac/core.c
+++ b/drivers/net/ethernet/ibm/emac/core.c
@@ -1075,7 +1075,9 @@ static int emac_resize_rx_ring(struct emac_instance *dev, 
int new_mtu)
 
/* Second pass, allocate new skbs */
for (i = 0; i < NUM_RX_BUFF; ++i) {
-   struct sk_buff *skb = alloc_skb(rx_skb_size, GFP_ATOMIC);
+   struct sk_buff *skb;
+
+   skb = netdev_alloc_skb_ip_align(dev->ndev, rx_skb_size);
if (!skb) {
ret = -ENOMEM;
goto oom;
@@ -1084,7 +1086,6 @@ static int emac_resize_rx_ring(struct emac_instance *dev, 
int new_mtu)
BUG_ON(!dev->rx_skb[i]);
dev_kfree_skb(dev->rx_skb[i]);
 
-   skb_reserve(skb, EMAC_RX_SKB_HEADROOM + 2);
dev->rx_desc[i].data_ptr =
dma_map_single(&dev->ofdev->dev, skb->data - 2, 
rx_sync_size,
   DMA_FROM_DEVICE) + 2;
@@ -1205,20 +1206,18 @@ static void emac_clean_rx_ring(struct emac_instance 
*dev)
}
 }
 
-static inline int emac_alloc_rx_skb(struct emac_instance *dev, int slot,
-   gfp_t flags)
+static inline int
+__emac_prepare_rx_skb(struct sk_buff *skb, struct emac_instance *dev, int slot)
 {
-   struct sk_buff *skb = alloc_skb(dev->rx_skb_size, flags);
if (unlikely(!skb))
return -ENOMEM;
 
dev->rx_skb[slot] = skb;
dev->rx_desc[slot].data_len = 0;
 
-   skb_reserve(skb, EMAC_RX_SKB_HEADROOM + 2);
dev->rx_desc[slot].data_ptr =
-   dma_map_single(&dev->ofdev->dev, skb->data - 2, dev->rx_sync_size,
-  DMA_FROM_DEVICE) + 2;
+   dma_map_single(&dev->ofdev->dev, skb->data - NET_IP_ALIGN,
+  dev->rx_sync_size, DMA_FROM_DEVICE) + NET_IP_ALIGN;
wmb();
dev->rx_desc[slot].ctrl = MAL_RX_CTRL_EMPTY |
(slot == (NUM_RX_BUFF - 1) ? MAL_RX_CTRL_WRAP : 0);
@@ -1226,6 +1225,27 @@ static inline int emac_alloc_rx_skb(struct emac_instance 
*dev, int slot,
return 0;
 }
 
+static inline int
+emac_alloc_rx_skb(struct emac_instance *dev, int slot)
+{
+   struct sk_buff *skb;
+
+   skb = __netdev_alloc_skb_ip_align(dev->ndev, dev->rx_skb_size,
+ GFP_KERNEL);
+
+   return __emac_prepare_rx_skb(skb, dev, slot);
+}
+
+static inline int
+emac_alloc_rx_skb_napi(struct emac_instance *dev, int slot)
+{
+   struct sk_buff *skb;
+
+   skb = napi_alloc_skb(&dev->mal->napi, dev->rx_skb_size);
+
+   return __emac_prepare_rx_skb(skb, dev, slot);
+}
+
 static void emac_print_link_status(struct emac_instance *dev)
 {
if (netif_carrier_ok(dev->ndev))
@@ -1256,7 +1276,7 @@ static int emac_open(struct net_device *ndev)
 
/* Allocate RX ring */
for (i = 0; i < NUM_RX_BUFF; ++i)
-   if (emac_alloc_rx_skb(dev, i, GFP_KERNEL)) {
+   if (emac_alloc_rx_skb(dev, i)) {
printk(KERN_ERR "%s: failed to allocate RX ring\n",
   ndev->name);
goto oom;
@@ -1779,8 +1799,9 @@ static inline void emac_recycle_rx_skb(struct 
emac_instance *dev, int slot,
DBG2(dev, "recycle %d %d" NL, slot, len);
 
if (len)
-   dma_map_single(&dev->ofdev->dev, skb->data - 2,
-  EMAC_DMA_ALIGN(len + 2), DMA_FROM_DEVICE);
+   dma_map_single(&dev->ofdev->dev, skb->data - NET_IP_ALIGN,
+ 

[PATCH v3 2/4] net: emac: implement TCP segmentation offload (TSO)

2018-10-23 Thread Christian Lamparter
This patch enables TSO(v4) hw feature for emac driver.
As atleast the APM82181's TCP/IP acceleration hardware
controller (TAH) provides TCP segmentation support in
the transmit path.

Signed-off-by: Christian Lamparter 
---
 drivers/net/ethernet/ibm/emac/core.c | 112 ++-
 drivers/net/ethernet/ibm/emac/core.h |   7 ++
 drivers/net/ethernet/ibm/emac/emac.h |   7 ++
 drivers/net/ethernet/ibm/emac/tah.c  |  22 +-
 drivers/net/ethernet/ibm/emac/tah.h  |   2 +
 5 files changed, 148 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ibm/emac/core.c 
b/drivers/net/ethernet/ibm/emac/core.c
index be560f9031f4..80aafd7552aa 100644
--- a/drivers/net/ethernet/ibm/emac/core.c
+++ b/drivers/net/ethernet/ibm/emac/core.c
@@ -38,6 +38,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -1118,6 +1121,32 @@ static int emac_resize_rx_ring(struct emac_instance 
*dev, int new_mtu)
return ret;
 }
 
+/* Restriction applied for the segmentation size
+ * to use HW segmentation offload feature. the size
+ * of the segment must not be less than 168 bytes for
+ * DIX formatted segments, or 176 bytes for
+ * IEEE formatted segments. However based on actual
+ * tests any MTU less than 416 causes excessive retries
+ * due to TX FIFO underruns.
+ */
+const u32 tah_ss[TAH_NO_SSR] = { 1500, 1344, 1152, 960, 768, 416 };
+
+/* look-up matching segment size for the given mtu */
+static void emac_find_tso_ss_for_mtu(struct emac_instance *dev)
+{
+   int i;
+
+   for (i = 0; i < ARRAY_SIZE(tah_ss); i++) {
+   if (tah_ss[i] <= dev->ndev->mtu)
+   break;
+   }
+   /* if no matching segment size is found, set the tso_ss_mtu_start
+* variable anyway. This will cause the emac_tx_tso to skip straight
+* to the software fallback.
+*/
+   dev->tso_ss_mtu_start = i;
+}
+
 /* Process ctx, rtnl_lock semaphore */
 static int emac_change_mtu(struct net_device *ndev, int new_mtu)
 {
@@ -1134,6 +1163,7 @@ static int emac_change_mtu(struct net_device *ndev, int 
new_mtu)
 
if (!ret) {
ndev->mtu = new_mtu;
+   emac_find_tso_ss_for_mtu(dev);
dev->rx_skb_size = emac_rx_skb_size(new_mtu);
dev->rx_sync_size = emac_rx_sync_size(new_mtu);
}
@@ -1410,6 +1440,33 @@ static inline u16 emac_tx_csum(struct emac_instance *dev,
return 0;
 }
 
+static int emac_tx_tso(struct emac_instance *dev, struct sk_buff *skb,
+  u16 *ctrl)
+{
+   if (emac_has_feature(dev, EMAC_FTR_TAH_HAS_TSO) && skb_is_gso(skb) &&
+   !!(skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6))) {
+   u32 seg_size = 0, i;
+
+   /* Get the MTU */
+   seg_size = skb_shinfo(skb)->gso_size + tcp_hdrlen(skb) +
+  skb_network_header_len(skb);
+
+   for (i = dev->tso_ss_mtu_start; i < ARRAY_SIZE(tah_ss); i++) {
+   if (tah_ss[i] > seg_size)
+   continue;
+
+   *ctrl |= EMAC_TX_CTRL_TAH_SSR(i);
+   return 0;
+   }
+
+   /* none found fall back to software */
+   return -EINVAL;
+   }
+
+   *ctrl |= emac_tx_csum(dev, skb);
+   return 0;
+}
+
 static inline netdev_tx_t emac_xmit_finish(struct emac_instance *dev, int len)
 {
struct emac_regs __iomem *p = dev->emacp;
@@ -1452,6 +1509,46 @@ static inline u16 emac_tx_vlan(struct emac_instance 
*dev, struct sk_buff *skb)
return 0;
 }
 
+static netdev_tx_t
+emac_start_xmit_sg(struct sk_buff *skb, struct net_device *ndev);
+
+static int
+emac_sw_tso(struct sk_buff *skb, struct net_device *ndev)
+{
+   struct emac_instance *dev = netdev_priv(ndev);
+   struct sk_buff *segs, *curr;
+   unsigned int i, frag_slots;
+
+   /* make sure to not overflow the tx ring */
+   frag_slots = dev->tx_cnt;
+   for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+   struct skb_frag_struct *frag = &skb_shinfo(skb)->frags[i];
+
+   frag_slots += mal_tx_chunks(skb_frag_size(frag));
+
+   if (frag_slots >= NUM_TX_BUFF)
+   return -ENOSPC;
+   };
+
+   segs = skb_gso_segment(skb, ndev->features &
+   ~(NETIF_F_TSO | NETIF_F_TSO6));
+   if (IS_ERR_OR_NULL(segs)) {
+   ++dev->estats.tx_dropped;
+   dev_kfree_skb_any(skb);
+   } else {
+   while (segs) {
+   curr = segs;
+   segs = curr->next;
+   curr->next = NULL;
+
+   emac_start_xmit_sg(curr, ndev);
+   }
+   dev_consume_skb_any(skb);
+   }
+
+   return 0;
+}
+
 /* Tx lock BH */
 static netdev_tx_t emac_start_xmit(struct sk_buff *skb, struct net_device 
*nd

[PATCH v3 4/4] net: emac: add deprecation notice to emac custom phy users

2018-10-23 Thread Christian Lamparter
This patch starts the deprecation process of emac's small library of
supported phys by adding a message to inform all remaining users to
start looking into converting their platform's device-tree to PHYLIB.

EMAC's phy.c support is limited to mostly single ethernet transceivers:
CIS8201, BCM5248, ET1011C, Marvell 88E and 88E1112, AR8035.

And Linux has dedicated PHYLIB drivers for all but the BCM5248 which
can be supported by the generic phy driver.

Signed-off-by: Christian Lamparter 
---
 drivers/net/ethernet/ibm/emac/phy.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/ibm/emac/phy.c 
b/drivers/net/ethernet/ibm/emac/phy.c
index aa070c063e48..143b4c688ee9 100644
--- a/drivers/net/ethernet/ibm/emac/phy.c
+++ b/drivers/net/ethernet/ibm/emac/phy.c
@@ -496,6 +496,7 @@ static struct mii_phy_def ar8035_phy_def = {
 };
 
 static struct mii_phy_def *mii_phy_table[] = {
+   /* DEPRECATED: Do not add any new PHY drivers to this list. */
&et1011c_phy_def,
&cis8201_phy_def,
&bcm5248_phy_def,
@@ -512,6 +513,9 @@ int emac_mii_phy_probe(struct mii_phy *phy, int address)
int i;
u32 id;
 
+   pr_info("EMAC's custom phy code has been deprecated.\n"
+   "Please convert your EMAC device to PHYLIB.\n");
+
phy->autoneg = AUTONEG_DISABLE;
phy->advertising = 0;
phy->address = address;
-- 
2.19.1



[PATCH v3 1/4] net: emac: implement 802.1Q VLAN TX tagging support

2018-10-23 Thread Christian Lamparter
As per' APM82181 Embedded Processor User Manual 26.1 EMAC Features:
VLAN:
 - Support for VLAN tag ID in compliance with IEEE 802.3ac.
 - VLAN tag insertion or replacement for transmit packets

This patch completes the missing code for the VLAN tx tagging
support, as the the EMAC_MR1_VLE was already enabled.

Signed-off-by: Christian Lamparter 
---
 drivers/net/ethernet/ibm/emac/core.c | 32 
 drivers/net/ethernet/ibm/emac/core.h |  6 +-
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/ibm/emac/core.c 
b/drivers/net/ethernet/ibm/emac/core.c
index 760b2ad8e295..be560f9031f4 100644
--- a/drivers/net/ethernet/ibm/emac/core.c
+++ b/drivers/net/ethernet/ibm/emac/core.c
@@ -37,6 +37,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -674,7 +675,7 @@ static int emac_configure(struct emac_instance *dev)
 ndev->dev_addr[5]);
 
/* VLAN Tag Protocol ID */
-   out_be32(&p->vtpid, 0x8100);
+   out_be32(&p->vtpid, ETH_P_8021Q);
 
/* Receive mode register */
r = emac_iff2rmr(ndev);
@@ -1435,6 +1436,22 @@ static inline netdev_tx_t emac_xmit_finish(struct 
emac_instance *dev, int len)
return NETDEV_TX_OK;
 }
 
+static inline u16 emac_tx_vlan(struct emac_instance *dev, struct sk_buff *skb)
+{
+   /* Handle VLAN TPID and TCI insert if this is a VLAN skb */
+   if (emac_has_feature(dev, EMAC_FTR_HAS_VLAN_CTAG_TX) &&
+   skb_vlan_tag_present(skb)) {
+   struct emac_regs __iomem *p = dev->emacp;
+
+   /* update the VLAN TCI */
+   out_be32(&p->vtci, (u32)skb_vlan_tag_get(skb));
+
+   /* Insert VLAN tag */
+   return EMAC_TX_CTRL_IVT;
+   }
+   return 0;
+}
+
 /* Tx lock BH */
 static netdev_tx_t emac_start_xmit(struct sk_buff *skb, struct net_device 
*ndev)
 {
@@ -1443,7 +1460,7 @@ static netdev_tx_t emac_start_xmit(struct sk_buff *skb, 
struct net_device *ndev)
int slot;
 
u16 ctrl = EMAC_TX_CTRL_GFCS | EMAC_TX_CTRL_GP | MAL_TX_CTRL_READY |
-   MAL_TX_CTRL_LAST | emac_tx_csum(dev, skb);
+   MAL_TX_CTRL_LAST | emac_tx_csum(dev, skb) | emac_tx_vlan(dev, skb);
 
slot = dev->tx_slot++;
if (dev->tx_slot == NUM_TX_BUFF) {
@@ -1518,7 +1535,7 @@ emac_start_xmit_sg(struct sk_buff *skb, struct net_device 
*ndev)
goto stop_queue;
 
ctrl = EMAC_TX_CTRL_GFCS | EMAC_TX_CTRL_GP | MAL_TX_CTRL_READY |
-   emac_tx_csum(dev, skb);
+   emac_tx_csum(dev, skb) | emac_tx_vlan(dev, skb);
slot = dev->tx_slot;
 
/* skb data */
@@ -2891,7 +2908,8 @@ static int emac_init_config(struct emac_instance *dev)
if (of_device_is_compatible(np, "ibm,emac-apm821xx")) {
dev->features |= (EMAC_APM821XX_REQ_JUMBO_FRAME_SIZE |
  EMAC_FTR_APM821XX_NO_HALF_DUPLEX |
- EMAC_FTR_460EX_PHY_CLK_FIX);
+ EMAC_FTR_460EX_PHY_CLK_FIX |
+ EMAC_FTR_HAS_VLAN_CTAG_TX);
}
} else if (of_device_is_compatible(np, "ibm,emac4")) {
dev->features |= EMAC_FTR_EMAC4;
@@ -3148,6 +3166,12 @@ static int emac_probe(struct platform_device *ofdev)
 
if (dev->tah_dev) {
ndev->hw_features = NETIF_F_IP_CSUM | NETIF_F_SG;
+
+   if (emac_has_feature(dev, EMAC_FTR_HAS_VLAN_CTAG_TX)) {
+   ndev->vlan_features |= ndev->hw_features;
+   ndev->hw_features |= NETIF_F_HW_VLAN_CTAG_TX;
+   }
+
ndev->features |= ndev->hw_features | NETIF_F_RXCSUM;
}
ndev->watchdog_timeo = 5 * HZ;
diff --git a/drivers/net/ethernet/ibm/emac/core.h 
b/drivers/net/ethernet/ibm/emac/core.h
index 84caa4a3fc52..8d84d439168c 100644
--- a/drivers/net/ethernet/ibm/emac/core.h
+++ b/drivers/net/ethernet/ibm/emac/core.h
@@ -334,6 +334,8 @@ struct emac_instance {
  * APM821xx does not support Half Duplex mode
  */
 #define EMAC_FTR_APM821XX_NO_HALF_DUPLEX   0x1000
+/* EMAC can insert 802.1Q tag */
+#define EMAC_FTR_HAS_VLAN_CTAG_TX  0x2000
 
 /* Right now, we don't quite handle the always/possible masks on the
  * most optimal way as we don't have a way to say something like
@@ -363,7 +365,9 @@ enum {
EMAC_FTR_460EX_PHY_CLK_FIX |
EMAC_FTR_440EP_PHY_CLK_FIX |
EMAC_APM821XX_REQ_JUMBO_FRAME_SIZE |
-   EMAC_FTR_APM821XX_NO_HALF_DUPLEX,
+   EMAC_FTR_APM821XX_NO_HALF_DUPLEX |
+   EMAC_FTR_HAS_VLAN_CTAG_TX |
+   0,
 };
 
 static inline int emac_has_feature(struct emac_instance *dev,
-- 
2.19.1



Re: [RFC PATCH v2 06/10] udp: cope with UDP GRO packet misdirection

2018-10-23 Thread Paolo Abeni
Hi,

On Mon, 2018-10-22 at 13:04 -0600, Subash Abhinov Kasiviswanathan
wrote:
> On 2018-10-19 08:25, Paolo Abeni wrote:
> > In some scenarios, the GRO engine can assemble an UDP GRO packet
> > that ultimately lands on a non GRO-enabled socket.
> > This patch tries to address the issue explicitly checking for the UDP
> > socket features before enqueuing the packet, and eventually segmenting
> > the unexpected GRO packet, as needed.
> > 
> > We must also cope with re-insertion requests: after segmentation the
> > UDP code calls the helper introduced by the previous patches, as 
> > needed.
> > 
> > Signed-off-by: Paolo Abeni 
> > ---
> > +static inline bool udp_unexpected_gso(struct sock *sk, struct sk_buff
> > *skb)
> > +{
> > +   return !udp_sk(sk)->gro_enabled && skb_is_gso(skb) &&
> > +  skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4;
> > +}
> > +
> > +static inline struct sk_buff *udp_rcv_segment(struct sock *sk,
> > + struct sk_buff *skb)
> > +{
> > +   struct sk_buff *segs;
> > +
> > +   /* the GSO CB lays after the UDP one, no need to save and restore
> > any
> > +* CB fragment, just initialize it
> > +*/
> > +   segs = __skb_gso_segment(skb, NETIF_F_SG, false);
> > +   if (unlikely(IS_ERR(segs)))
> > +   kfree_skb(skb);
> > +   else if (segs)
> > +   consume_skb(skb);
> > +   return segs;
> > +}
> > +
> > +
> 
> Hi Paolo
> 
> Do we need to check for IS_ERR_OR_NULL(segs)

Yes, thanks.

(also Williem already noted the above)

> > 
> > +void ip_protocol_deliver_rcu(struct net *net, struct sk_buff *skb, int
> > proto);
> > +
> > +static int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
> > +{
> > +   struct sk_buff *next, *segs;
> > +   int ret;
> > +
> > +   if (likely(!udp_unexpected_gso(sk, skb)))
> > +   return udp_queue_rcv_one_skb(sk, skb);
> > +static int udpv6_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
> > +{
> > +   struct sk_buff *next, *segs;
> > +   int ret;
> > +
> > +   if (likely(!udp_unexpected_gso(sk, skb)))
> > +   return udpv6_queue_rcv_one_skb(sk, skb);
> > +
> 
> Is the "likely" required here?

Not required, but currently helpful IMHO, as we should hit the above
only on unlikey and really unwonted configuration.

Note that only SKB_GSO_UDP_L4 GSO packets will not match the above
likely condition.

> HW can coalesce all incoming streams of UDP and may not know the socket 
> state.
> In that case, a socket not having UDP GRO option might see a penalty 
> here.

Really? Is there any HW creating SKB_GSO_UDP_L4 packets on RX? if the
HW is doing that, without this patch, I think it's breaking existing
applications (which may expext that the read UDP frame length
implicitly describe the application level message length).

Cheers,

Paolo






Re: [PATCH net-next 1/3] net/sock: factor out dequeue/peek with offset code

2018-10-23 Thread Paolo Abeni
Hi,

On Mon, 2018-10-22 at 21:49 -0700, Alexei Starovoitov wrote:
> On Mon, May 15, 2017 at 11:01:42AM +0200, Paolo Abeni wrote:
> > And update __sk_queue_drop_skb() to work on the specified queue.
> > This will help the udp protocol to use an additional private
> > rx queue in a later patch.
> > 
> > Signed-off-by: Paolo Abeni 
> > ---
> >  include/linux/skbuff.h |  7 
> >  include/net/sock.h |  4 +--
> >  net/core/datagram.c| 90 
> > --
> >  3 files changed, 60 insertions(+), 41 deletions(-)
> > 
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index a098d95..bfc7892 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -3056,6 +3056,13 @@ static inline void skb_frag_list_init(struct sk_buff 
> > *skb)
> >  
> >  int __skb_wait_for_more_packets(struct sock *sk, int *err, long *timeo_p,
> > const struct sk_buff *skb);
> > +struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
> > + struct sk_buff_head *queue,
> > + unsigned int flags,
> > + void (*destructor)(struct sock *sk,
> > +  struct sk_buff *skb),
> > + int *peeked, int *off, int *err,
> > + struct sk_buff **last);
> >  struct sk_buff *__skb_try_recv_datagram(struct sock *sk, unsigned flags,
> > void (*destructor)(struct sock *sk,
> >struct sk_buff *skb),
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 66349e4..49d226f 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -2035,8 +2035,8 @@ void sk_reset_timer(struct sock *sk, struct 
> > timer_list *timer,
> >  
> >  void sk_stop_timer(struct sock *sk, struct timer_list *timer);
> >  
> > -int __sk_queue_drop_skb(struct sock *sk, struct sk_buff *skb,
> > -   unsigned int flags,
> > +int __sk_queue_drop_skb(struct sock *sk, struct sk_buff_head *sk_queue,
> > +   struct sk_buff *skb, unsigned int flags,
> > void (*destructor)(struct sock *sk,
> >struct sk_buff *skb));
> >  int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
> > diff --git a/net/core/datagram.c b/net/core/datagram.c
> > index db1866f2..a4592b4 100644
> > --- a/net/core/datagram.c
> > +++ b/net/core/datagram.c
> > @@ -161,6 +161,43 @@ static struct sk_buff *skb_set_peeked(struct sk_buff 
> > *skb)
> > return skb;
> >  }
> >  
> > +struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
> > + struct sk_buff_head *queue,
> > + unsigned int flags,
> > + void (*destructor)(struct sock *sk,
> > +  struct sk_buff *skb),
> > + int *peeked, int *off, int *err,
> > + struct sk_buff **last)
> > +{
> > +   struct sk_buff *skb;
> > +
> > +   *last = queue->prev;
> 
> this refactoring changed the behavior.
> Now queue->prev is returned as last.
> Whereas it was *last = queue before.
> 
> > +   skb_queue_walk(queue, skb) {
> 
> and *last = skb assignment is gone too.
> 
> Was this intentional ? 

Yes.

> Is this the right behavior?

I think so. queue->prev is the last skb in the queue. With the old
code,   __skb_try_recv_datagram(), when returning NULL, used the
instructions you quoted to overall set 'last' to the last skb in the
queue. We did not use 'last' elsewhere. So overall this just reduce the
number of instructions inside the loop. (unless I'm missing something).

Are you experiencing any specific issues due to the mentioned commit?

Thanks,

Paolo