Re: [PATCH] [net/ipv6] ip6_output: Add ipv6_pinfo null check

2020-07-27 Thread Cong Wang
On Sun, Jul 26, 2020 at 8:39 PM Gaurav Singh  wrote:
>
> ipv6_pinfo is initlialized by inet6_sk() which returns NULL.

Why? It only returns NULL for timewait or request sock, but
I don't see how ip6_autoflowlabel() could be called on these
sockets. So please explain.

> Hence it can cause segmentation fault. Fix this by adding a
> NULL check.

Which exact call path? Do you have a full stack trace?

Thanks.


Re: [PATCH v2] net: ipv6: fix use-after-free Read in __xfrm6_tunnel_spi_lookup

2020-07-26 Thread Cong Wang
On Sat, Jul 25, 2020 at 11:12 PM B K Karthik  wrote:
>
> On Sun, Jul 26, 2020 at 11:05 AM Cong Wang  wrote:
> >
> > On Sat, Jul 25, 2020 at 8:09 PM B K Karthik  wrote:
> > > @@ -103,10 +103,10 @@ static int __xfrm6_tunnel_spi_check(struct net 
> > > *net, u32 spi)
> > >  {
> > > struct xfrm6_tunnel_net *xfrm6_tn = xfrm6_tunnel_pernet(net);
> > > struct xfrm6_tunnel_spi *x6spi;
> > > -   int index = xfrm6_tunnel_spi_hash_byspi(spi);
> > > +   int index = xfrm6_tunnel_spi_hash_byaddr((const xfrm_address_t 
> > > *)spi);
> > >
> > > hlist_for_each_entry(x6spi,
> > > -&xfrm6_tn->spi_byspi[index],
> > > +&xfrm6_tn->spi_byaddr[index],
> > >  list_byspi) {
> > > if (x6spi->spi == spi)
> >
> > How did you convince yourself this is correct? This lookup is still
> > using spi. :)
>
> I'm sorry, but my intention behind writing this patch was not to fix
> the UAF, but to fix a slab-out-of-bound.

Odd, your $subject is clearly UAF, so is the stack trace in your changelog.
:)


> If required, I can definitely change the subject line and resend the
> patch, but I figured this was correct for
> https://syzkaller.appspot.com/bug?id=058d05f470583ab2843b1d6785fa8d0658ef66ae
> . since that particular report did not have a reproducer,
> Dmitry Vyukov  suggested that I test this patch on
> other reports for xfrm/spi .

You have to change it to avoid misleading.

>
> Forgive me if this was the wrong way to send a patch for that
> particular report, but I guessed since the reproducer did not trigger
> the crash
> for UAF, I would leave the subject line as 'fix UAF' :)
>
> xfrm6_spi_hash_by_hash seemed more convincing because I had to prevent
> a slab-out-of-bounds because it uses ipv6_addr_hash.
> It would be of great help if you could help me understand how this was
> able to fix a UAF.

Sure, you just avoid a pointer deref, which of course can fix the UAF,
but I still don't think it is correct in any aspect.

Even if it is a OOB, you still have to explain why it happened. Once
again, I can't see how it could happen either.

>
> >
> > More importantly, can you explain how UAF happens? Apparently
> > the syzbot stack traces you quote make no sense at all. I also
> > looked at other similar reports, none of them makes sense to me.
>
> Forgive me, but I do not understand what you mean by the stack traces
> (this or other similar reports) "make no sense".

Because the stack trace in your changelog clearly shows it is allocated
in tomoyo_init_log(), which is a buffer in struct tomoyo_query, but
none of xfrm paths uses it. Or do you see anything otherwise?

Thanks.


Re: [PATCH v2] net: ipv6: fix use-after-free Read in __xfrm6_tunnel_spi_lookup

2020-07-25 Thread Cong Wang
On Sat, Jul 25, 2020 at 8:09 PM B K Karthik  wrote:
> @@ -103,10 +103,10 @@ static int __xfrm6_tunnel_spi_check(struct net *net, 
> u32 spi)
>  {
> struct xfrm6_tunnel_net *xfrm6_tn = xfrm6_tunnel_pernet(net);
> struct xfrm6_tunnel_spi *x6spi;
> -   int index = xfrm6_tunnel_spi_hash_byspi(spi);
> +   int index = xfrm6_tunnel_spi_hash_byaddr((const xfrm_address_t *)spi);
>
> hlist_for_each_entry(x6spi,
> -&xfrm6_tn->spi_byspi[index],
> +&xfrm6_tn->spi_byaddr[index],
>  list_byspi) {
> if (x6spi->spi == spi)

How did you convince yourself this is correct? This lookup is still
using spi. :)

More importantly, can you explain how UAF happens? Apparently
the syzbot stack traces you quote make no sense at all. I also
looked at other similar reports, none of them makes sense to me.

Thanks.


[Patch net] ipv6: fix memory leaks on IPV6_ADDRFORM path

2020-07-25 Thread Cong Wang
IPV6_ADDRFORM causes resource leaks when converting an IPv6 socket
to IPv4, particularly struct ipv6_ac_socklist. Similar to
struct ipv6_mc_socklist, we should just close it on this path.

This bug can be easily reproduced with the following C program:

  #include 
  #include 
  #include 
  #include 
  #include 

  int main()
  {
int s, value;
struct sockaddr_in6 addr;
struct ipv6_mreq m6;

s = socket(AF_INET6, SOCK_DGRAM, 0);
addr.sin6_family = AF_INET6;
addr.sin6_port = htons(5000);
inet_pton(AF_INET6, ":::192.168.122.194", &addr.sin6_addr);
connect(s, (struct sockaddr *)&addr, sizeof(addr));

inet_pton(AF_INET6, "fe80::", &m6.ipv6mr_multiaddr);
m6.ipv6mr_interface = 5;
setsockopt(s, SOL_IPV6, IPV6_JOIN_ANYCAST, &m6, sizeof(m6));

value = AF_INET;
setsockopt(s, SOL_IPV6, IPV6_ADDRFORM, &value, sizeof(value));

close(s);
return 0;
  }

Reported-by: ch333...@gmail.com
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Cong Wang 
---
 include/net/addrconf.h   |  1 +
 net/ipv6/anycast.c   | 17 -
 net/ipv6/ipv6_sockglue.c |  1 +
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index fdb07105384c..8418b7d38468 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -274,6 +274,7 @@ int ipv6_sock_ac_join(struct sock *sk, int ifindex,
  const struct in6_addr *addr);
 int ipv6_sock_ac_drop(struct sock *sk, int ifindex,
  const struct in6_addr *addr);
+void __ipv6_sock_ac_close(struct sock *sk);
 void ipv6_sock_ac_close(struct sock *sk);
 
 int __ipv6_dev_ac_inc(struct inet6_dev *idev, const struct in6_addr *addr);
diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
index 893261230ffc..dacdea7fcb62 100644
--- a/net/ipv6/anycast.c
+++ b/net/ipv6/anycast.c
@@ -183,7 +183,7 @@ int ipv6_sock_ac_drop(struct sock *sk, int ifindex, const 
struct in6_addr *addr)
return 0;
 }
 
-void ipv6_sock_ac_close(struct sock *sk)
+void __ipv6_sock_ac_close(struct sock *sk)
 {
struct ipv6_pinfo *np = inet6_sk(sk);
struct net_device *dev = NULL;
@@ -191,10 +191,7 @@ void ipv6_sock_ac_close(struct sock *sk)
struct net *net = sock_net(sk);
int prev_index;
 
-   if (!np->ipv6_ac_list)
-   return;
-
-   rtnl_lock();
+   ASSERT_RTNL();
pac = np->ipv6_ac_list;
np->ipv6_ac_list = NULL;
 
@@ -211,6 +208,16 @@ void ipv6_sock_ac_close(struct sock *sk)
sock_kfree_s(sk, pac, sizeof(*pac));
pac = next;
}
+}
+
+void ipv6_sock_ac_close(struct sock *sk)
+{
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   if (!np->ipv6_ac_list)
+   return;
+   rtnl_lock();
+   __ipv6_sock_ac_close(sk);
rtnl_unlock();
 }
 
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 20576e87a5f7..76f9e41859a2 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -240,6 +240,7 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, 
int optname,
 
fl6_free_socklist(sk);
__ipv6_sock_mc_close(sk);
+   __ipv6_sock_ac_close(sk);
 
/*
 * Sock is moving from IPv6 to IPv4 (sk_prot), so
-- 
2.27.0



[Patch net-next] net_sched: initialize timer earlier in red_init()

2020-07-25 Thread Cong Wang
When red_init() fails, red_destroy() is called to clean up.
If the timer is not initialized yet, del_timer_sync() will
complain. So we have to move timer_setup() before any failure.

Reported-and-tested-by: syzbot+6e95a4fabf88dc217...@syzkaller.appspotmail.com
Fixes: aee9caa03fc3 ("net: sched: sch_red: Add qevents "early_drop" and "mark"")
Cc: Petr Machata 
Cc: Jamal Hadi Salim 
Cc: Jiri Pirko 
Signed-off-by: Cong Wang 
---
 net/sched/sch_red.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_red.c b/net/sched/sch_red.c
index 4cc0ad0b1189..deac82f3ad7b 100644
--- a/net/sched/sch_red.c
+++ b/net/sched/sch_red.c
@@ -333,6 +333,10 @@ static int red_init(struct Qdisc *sch, struct nlattr *opt,
struct nlattr *tb[TCA_RED_MAX + 1];
int err;
 
+   q->qdisc = &noop_qdisc;
+   q->sch = sch;
+   timer_setup(&q->adapt_timer, red_adaptative_timer, 0);
+
if (!opt)
return -EINVAL;
 
@@ -341,10 +345,6 @@ static int red_init(struct Qdisc *sch, struct nlattr *opt,
if (err < 0)
return err;
 
-   q->qdisc = &noop_qdisc;
-   q->sch = sch;
-   timer_setup(&q->adapt_timer, red_adaptative_timer, 0);
-
err = __red_change(sch, tb, extack);
if (err)
return err;
-- 
2.27.0



Re: memory leak in ipv6_sock_ac_join ( 5.8.0-rc6+)

2020-07-24 Thread Cong Wang
On Thu, Jul 23, 2020 at 5:06 PM \xcH3332\  wrote:
>
> Hi,
>
> SYZKALLER found the following Memory Leak

Thanks for your report!

Can you test the attached patch? I only did compile test as I don't
have an environment to run syz programs.
diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index fdb07105384c..8418b7d38468 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -274,6 +274,7 @@ int ipv6_sock_ac_join(struct sock *sk, int ifindex,
 		  const struct in6_addr *addr);
 int ipv6_sock_ac_drop(struct sock *sk, int ifindex,
 		  const struct in6_addr *addr);
+void __ipv6_sock_ac_close(struct sock *sk);
 void ipv6_sock_ac_close(struct sock *sk);
 
 int __ipv6_dev_ac_inc(struct inet6_dev *idev, const struct in6_addr *addr);
diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
index 893261230ffc..dacdea7fcb62 100644
--- a/net/ipv6/anycast.c
+++ b/net/ipv6/anycast.c
@@ -183,7 +183,7 @@ int ipv6_sock_ac_drop(struct sock *sk, int ifindex, const struct in6_addr *addr)
 	return 0;
 }
 
-void ipv6_sock_ac_close(struct sock *sk)
+void __ipv6_sock_ac_close(struct sock *sk)
 {
 	struct ipv6_pinfo *np = inet6_sk(sk);
 	struct net_device *dev = NULL;
@@ -191,10 +191,7 @@ void ipv6_sock_ac_close(struct sock *sk)
 	struct net *net = sock_net(sk);
 	int	prev_index;
 
-	if (!np->ipv6_ac_list)
-		return;
-
-	rtnl_lock();
+	ASSERT_RTNL();
 	pac = np->ipv6_ac_list;
 	np->ipv6_ac_list = NULL;
 
@@ -211,6 +208,16 @@ void ipv6_sock_ac_close(struct sock *sk)
 		sock_kfree_s(sk, pac, sizeof(*pac));
 		pac = next;
 	}
+}
+
+void ipv6_sock_ac_close(struct sock *sk)
+{
+	struct ipv6_pinfo *np = inet6_sk(sk);
+
+	if (!np->ipv6_ac_list)
+		return;
+	rtnl_lock();
+	__ipv6_sock_ac_close(sk);
 	rtnl_unlock();
 }
 
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 20576e87a5f7..76f9e41859a2 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -240,6 +240,7 @@ static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
 
 			fl6_free_socklist(sk);
 			__ipv6_sock_mc_close(sk);
+			__ipv6_sock_ac_close(sk);
 
 			/*
 			 * Sock is moving from IPv6 to IPv4 (sk_prot), so


[Patch net v2] qrtr: orphan socket in qrtr_release()

2020-07-24 Thread Cong Wang
We have to detach sock from socket in qrtr_release(),
otherwise skb->sk may still reference to this socket
when the skb is released in tun->queue, particularly
sk->sk_wq still points to &sock->wq, which leads to
a UAF.

Reported-and-tested-by: syzbot+6720d64f31c081c2f...@syzkaller.appspotmail.com
Fixes: 28fb4e59a47d ("net: qrtr: Expose tunneling endpoint to user space")
Cc: Bjorn Andersson 
Cc: Eric Dumazet 
Signed-off-by: Cong Wang 
---
 net/qrtr/qrtr.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/qrtr/qrtr.c b/net/qrtr/qrtr.c
index 24a8c3c6da0d..300a104b9a0f 100644
--- a/net/qrtr/qrtr.c
+++ b/net/qrtr/qrtr.c
@@ -1180,6 +1180,7 @@ static int qrtr_release(struct socket *sock)
sk->sk_state_change(sk);
 
sock_set_flag(sk, SOCK_DEAD);
+   sock_orphan(sk);
sock->sk = NULL;
 
if (!sock_flag(sk, SOCK_ZAPPED))
-- 
2.27.0



Re: [Patch net] qrtr: orphan skb before queuing in xmit

2020-07-23 Thread Cong Wang
On Thu, Jul 23, 2020 at 11:00 PM Cong Wang  wrote:
>
> I said socket, not sock. I believe the socket can be gone while the sock is
> still there.

Hmm, looks llike sock_orphan() should be called...


Re: [Patch net] qrtr: orphan skb before queuing in xmit

2020-07-23 Thread Cong Wang
On Thu, Jul 23, 2020 at 10:35 PM Eric Dumazet  wrote:
>
>
>
> On 7/23/20 9:50 PM, Cong Wang wrote:
> > Similar to tun_net_xmit(), we have to orphan the skb
> > before queuing it, otherwise we may use the socket when
> > purging the queue after it is freed by user-space.
>
> Which socket ?

sk->sk_wq points to &sock->wq. The socket is of course from
qrtr_create().

>
> By not calling skb_orphan(skb), this skb should own a reference on skb->sk 
> preventing
> skb->sk to disappear.
>

I said socket, not sock. I believe the socket can be gone while the sock is
still there.


> It seems that instead of skb_orphan() here, we could avoid calling 
> skb_set_owner_w() in the first place,
> because this is confusing.

Not sure about this, at least tun calls skb_set_owner_w() too. More
importantly, sock_alloc_send_skb() calls it too. :)


[Patch net] qrtr: orphan skb before queuing in xmit

2020-07-23 Thread Cong Wang
Similar to tun_net_xmit(), we have to orphan the skb
before queuing it, otherwise we may use the socket when
purging the queue after it is freed by user-space.

Reported-and-tested-by: syzbot+6720d64f31c081c2f...@syzkaller.appspotmail.com
Fixes: 28fb4e59a47d ("net: qrtr: Expose tunneling endpoint to user space")
Cc: Bjorn Andersson 
Signed-off-by: Cong Wang 
---
 net/qrtr/tun.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/qrtr/tun.c b/net/qrtr/tun.c
index 15ce9b642b25..54a565dcfef3 100644
--- a/net/qrtr/tun.c
+++ b/net/qrtr/tun.c
@@ -20,6 +20,7 @@ static int qrtr_tun_send(struct qrtr_endpoint *ep, struct 
sk_buff *skb)
 {
struct qrtr_tun *tun = container_of(ep, struct qrtr_tun, ep);
 
+   skb_orphan(skb);
skb_queue_tail(&tun->queue, skb);
 
/* wake up any blocking processes, waiting for new data */
-- 
2.27.0



Re: KASAN: use-after-free Read in vlan_dev_get_iflink

2020-07-23 Thread Cong Wang
#syz test: https://github.com/congwang/linux.git net


Re: KASAN: use-after-free Read in macvlan_dev_get_iflink

2020-07-23 Thread Cong Wang
#syz test: https://github.com/congwang/linux.git net


[Patch net] geneve: fix an uninitialized value in geneve_changelink()

2020-07-22 Thread Cong Wang
geneve_nl2info() sets 'df' conditionally, so we have to
initialize it by copying the value from existing geneve
device in geneve_changelink().

Fixes: 56c09de347e4 ("geneve: allow changing DF behavior after creation")
Reported-by: syzbot+7ebc2e088af5e4c0c...@syzkaller.appspotmail.com
Cc: Sabrina Dubroca 
Signed-off-by: Cong Wang 
---
 drivers/net/geneve.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 4661ef865807..dec52b763d50 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -1615,11 +1615,11 @@ static int geneve_changelink(struct net_device *dev, 
struct nlattr *tb[],
 struct netlink_ext_ack *extack)
 {
struct geneve_dev *geneve = netdev_priv(dev);
+   enum ifla_geneve_df df = geneve->df;
struct geneve_sock *gs4, *gs6;
struct ip_tunnel_info info;
bool metadata;
bool use_udp6_rx_checksums;
-   enum ifla_geneve_df df;
bool ttl_inherit;
int err;
 
-- 
2.27.0



[Patch net] bonding: check return value of register_netdevice() in bond_newlink()

2020-07-22 Thread Cong Wang
Very similar to commit 544f287b8495
("bonding: check error value of register_netdevice() immediately"),
we should immediately check the return value of register_netdevice()
before doing anything else.

Fixes: 005db31d5f5f ("bonding: set carrier off for devices created through 
netlink")
Reported-and-tested-by: syzbot+bbc3a11c4da63c1b7...@syzkaller.appspotmail.com
Cc: Beniamino Galvani 
Cc: Taehee Yoo 
Cc: Jay Vosburgh 
Signed-off-by: Cong Wang 
---
 drivers/net/bonding/bond_netlink.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/bonding/bond_netlink.c 
b/drivers/net/bonding/bond_netlink.c
index b43b51646b11..f0f9138e967f 100644
--- a/drivers/net/bonding/bond_netlink.c
+++ b/drivers/net/bonding/bond_netlink.c
@@ -456,11 +456,10 @@ static int bond_newlink(struct net *src_net, struct 
net_device *bond_dev,
return err;
 
err = register_netdevice(bond_dev);
-
-   netif_carrier_off(bond_dev);
if (!err) {
struct bonding *bond = netdev_priv(bond_dev);
 
+   netif_carrier_off(bond_dev);
bond_work_init_all(bond);
}
 
-- 
2.27.0



Re: KASAN: use-after-free Write in __linkwatch_run_queue

2020-07-22 Thread Cong Wang
#syz test: https://github.com/congwang/linux.git net


Re: [PATCH] ipvlan: add the check of ip header checksum

2020-07-21 Thread Cong Wang
On Tue, Jul 21, 2020 at 6:17 AM guodeqing  wrote:
>
> The ip header checksum can be error in the following steps.
> $ ip netns add ns1
> $ ip link add gw link eth0 type ipvlan
> $ ip addr add 168.16.0.1/24 dev gw
> $ ip link set dev gw up
> $ ip link add ip1 link eth0 type ipvlan
> $ ip link set ip1 netns ns1
> $ ip netns exec ns1 ip link set ip1 up
> $ ip netns exec ns1 ip addr add 168.16.0.2/24 dev ip1
> $ ip netns exec ns1 tc qdisc add dev ip1 root netem corrupt 50%
> $ ip netns exec ns1 ping 168.16.0.1
>
> The ip header of a packet maybe modified when it steps in
> ipvlan_process_v4_outbound because of the netem, the corruptted
> packets should be dropped.

This does not make much sense, as you intentionally corrupt
the header. More importantly, the check you add is too late, right?
ipvlan_xmit_mode_l3() already does the addr lookup with IP header,

Thanks.


Re: [patch net-next v2] sched: sch_api: add missing rcu read lock to silence the warning

2020-07-21 Thread Cong Wang
On Mon, Jul 20, 2020 at 1:10 AM Jiri Pirko  wrote:
> diff --git a/include/linux/hashtable.h b/include/linux/hashtable.h
> index 78b6ea5fa8ba..f6c666730b8c 100644
> --- a/include/linux/hashtable.h
> +++ b/include/linux/hashtable.h
> @@ -173,9 +173,9 @@ static inline void hash_del_rcu(struct hlist_node *node)
>   * @member: the name of the hlist_node within the struct
>   * @key: the key of the objects to iterate over
>   */

I think you need to update the doc here too, that is adding @cond.


Re: [PATCH net] bonding: check error value of register_netdevice() immediately

2020-07-18 Thread Cong Wang
On Sat, Jul 18, 2020 at 12:14 AM Taehee Yoo  wrote:
>
> If register_netdevice() is failed, net_device should not be used
> because variables are uninitialized or freed.
> So, the routine should be stopped immediately.
> But, bond_create() doesn't check return value of register_netdevice()
> immediately. That will result in a panic because of using uninitialized
> or freed memory.
>
> Test commands:
> modprobe netdev-notifier-error-inject
> echo -22 > /sys/kernel/debug/notifier-error-inject/netdev/\
> actions/NETDEV_REGISTER/error
> modprobe bonding max_bonds=3
>
> Splat looks like:
> [  375.028492][  T193] general protection fault, probably for non-canonical 
> address 0x6b6b6b6b6b6b6b6b:  [#1] SMP DEBUG_PAGEALLOC PTI
> [  375.033207][  T193] CPU: 2 PID: 193 Comm: kworker/2:2 Not tainted 
> 5.8.0-rc4+ #645
> [  375.036068][  T193] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS 1.10.2-1ubuntu1 04/01/2014
> [  375.039673][  T193] Workqueue: events linkwatch_event
> [  375.041557][  T193] RIP: 0010:dev_activate+0x4a/0x340
> [  375.043381][  T193] Code: 40 a8 04 0f 85 db 00 00 00 8b 83 08 04 00 00 85 
> c0 0f 84 0d 01 00 00 31 d2 89 d0 48 8d 04 40 48 c1 e0 07 48 03 83 00 04 00 00 
> <48> 8b 48 10 f6 41 10 01 75 08 f0 80 a1 a0 01 00 00 fd 48 89 48 08
> [  375.050267][  T193] RSP: 0018:9f8facfcfdd8 EFLAGS: 00010202
> [  375.052410][  T193] RAX: 6b6b6b6b6b6b6b6b RBX: 9f8fae6ea000 RCX: 
> 0006
> [  375.055178][  T193] RDX:  RSI:  RDI: 
> 9f8fae6ea000
> [  375.057762][  T193] RBP: 9f8fae6ea000 R08:  R09: 
> 
> [  375.059810][  T193] R10: 0001 R11:  R12: 
> 9f8facfcfe08
> [  375.061892][  T193] R13: 883587e0 R14:  R15: 
> 9f8fae6ea580
> [  375.063931][  T193] FS:  () GS:9f8fbae0() 
> knlGS:
> [  375.066239][  T193] CS:  0010 DS:  ES:  CR0: 80050033
> [  375.067841][  T193] CR2: 7f2f542167a0 CR3: 00012cee6002 CR4: 
> 003606e0
> [  375.069657][  T193] DR0:  DR1:  DR2: 
> 
> [  375.071471][  T193] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [  375.073269][  T193] Call Trace:
> [  375.074005][  T193]  linkwatch_do_dev+0x4d/0x50
> [  375.075052][  T193]  __linkwatch_run_queue+0x10b/0x200
> [  375.076244][  T193]  linkwatch_event+0x21/0x30
> [  375.077274][  T193]  process_one_work+0x252/0x600
> [  375.078379][  T193]  ? process_one_work+0x600/0x600
> [  375.079518][  T193]  worker_thread+0x3c/0x380
> [  375.080534][  T193]  ? process_one_work+0x600/0x600
> [  375.081668][  T193]  kthread+0x139/0x150
> [  375.082567][  T193]  ? kthread_park+0x90/0x90
> [  375.083567][  T193]  ret_from_fork+0x22/0x30
>
> Fixes: 9e2e61fbf8ad ("bonding: fix potential deadlock in bond_uninit()")

I doubt this is the first offending commit. At that time, the only
thing after register_netdevice() was rtnl_unlock(). I think it is
commit e826eafa65c6f1f7c8db5a237556cebac57ebcc5 which
introduced the bug, as it is the first commit puts something between
register_netdevice() and rtnl_unlock().

But this patch itself is obviously correct.

Thanks.


Re: [PATCH net-next v2 0/3] make nf_ct_frag/6_gather elide the skb CB clear

2020-07-17 Thread Cong Wang
On Wed, Jul 15, 2020 at 2:17 PM Florian Westphal  wrote:
>
> Jakub Kicinski  wrote:
> > On Tue,  7 Jul 2020 12:55:08 +0800 we...@ucloud.cn wrote:
> > > From: wenxu 
> > >
> > > Add nf_ct_frag_gather and Make nf_ct_frag6_gather elide the CB clear
> > > when packets are defragmented by connection tracking. This can make
> > > each subsystem such as br_netfilter, openvswitch, act_ct do defrag
> > > without restore the CB.
> > > This also avoid serious crashes and problems in  ct subsystem.
> > > Because Some packet schedulers store pointers in the qdisc CB private
> > > area and parallel accesses to the SKB.
> > >
> > > This series following up
> > > http://patchwork.ozlabs.org/project/netdev/patch/1593422178-26949-1-git-send-email-we...@ucloud.cn/
> > >
> > > patch1: add nf_ct_frag_gather elide the CB clear
> > > patch2: make nf_ct_frag6_gather elide the CB clear
> > > patch3: fix clobber qdisc_skb_cb in act_ct with defrag
> > >
> > > v2: resue some ip_defrag function in patch1
> >
> > Florian, Cong - are you willing to venture an ack on these? Anyone?
>
> Nope, sorry.  Reason is that I can't figure out the need for this series.
> Taking a huge step back:
>
> http://patchwork.ozlabs.org/project/netdev/patch/1593422178-26949-1-git-send-email-we...@ucloud.cn/
>
> That patch looks ok to me:
> I understand the problem statement/commit message and I can see how its 
> addressed.
>
> I don't understand why the CB clearing must be avoided.
>
> defrag assumes skb ownership -- e.g. it may realloc skb->data
> (calls pskb_may_pull), it calls skb_orphan(), etc.
>
> AFAICS, tcf_classify makes same assumption -- exclusive ownership
> and no parallel skb accesses.
>
> So, if in fact the "only" problem is the loss of
> qdisc_skb_cb(skb)->pkt_len, then the other patch looks ok to me.
>
> If we indeed have parallel access, then I do not understand how
> avoiding the memsets in the defrag path makes things any better
> (see above wrt. skb pull and the like).

+1

I don't see parallel access here either. skb can be cloned for packet
socket or act_mirred, but its CB is cloned at the same time.

Thanks.


Re: [PATCH net-next v3 2/4] net/sched: Introduce action hash

2020-07-14 Thread Cong Wang
On Mon, Jul 13, 2020 at 8:17 PM Ariel Levkovich  wrote:
>
> On 7/13/20 6:04 PM, Cong Wang wrote:
> > On Sat, Jul 11, 2020 at 2:28 PM Ariel Levkovich  wrote:
> >> Allow user to set a packet's hash value using a bpf program.
> >>
> >> The user provided BPF program is required to compute and return
> >> a hash value for the packet which is then stored in skb->hash.
> > Can be done by act_bpf, right?
>
> Right. We already agreed on that.
>
> Nevertheless, as I mentioned, act_bpf is not offloadable.
>
> Device driver has no clue what the program does.

What about offloading act_skbedit? You care about offloading
the skb->hash computation, not about bpf.

Thanks.


Re: [PATCH v4 net] rtnetlink: Fix memory(net_device) leak when ->newlink fails

2020-07-14 Thread Cong Wang
On Tue, Jul 14, 2020 at 6:27 PM Weilong Chen  wrote:
>
> When vlan_newlink call register_vlan_dev fails, it might return error
> with dev->reg_state = NETREG_UNREGISTERED. The rtnl_newlink should
> free the memory. But currently rtnl_newlink only free the memory which
> state is NETREG_UNINITIALIZED.
>
> BUG: memory leak
> unreferenced object 0x8881051de000 (size 4096):
>   comm "syz-executor139", pid 560, jiffies 4294745346 (age 32.445s)
>   hex dump (first 32 bytes):
> 76 6c 61 6e 32 00 00 00 00 00 00 00 00 00 00 00  vlan2...
> 00 45 28 03 81 88 ff ff 00 00 00 00 00 00 00 00  .E(.
>   backtrace:
> [<47527e31>] kmalloc_node include/linux/slab.h:578 [inline]
> [<47527e31>] kvmalloc_node+0x33/0xd0 mm/util.c:574
> [<2b59e3bc>] kvmalloc include/linux/mm.h:753 [inline]
> [<2b59e3bc>] kvzalloc include/linux/mm.h:761 [inline]
> [<2b59e3bc>] alloc_netdev_mqs+0x83/0xd90 net/core/dev.c:9929
> [<6076752a>] rtnl_create_link+0x2c0/0xa20 
> net/core/rtnetlink.c:3067
> [<572b3be5>] __rtnl_newlink+0xc9c/0x1330 net/core/rtnetlink.c:3329
> [] rtnl_newlink+0x66/0x90 net/core/rtnetlink.c:3397
> [<52c7c0a9>] rtnetlink_rcv_msg+0x540/0x990 
> net/core/rtnetlink.c:5460
> [<4b5cb379>] netlink_rcv_skb+0x12b/0x3a0 
> net/netlink/af_netlink.c:2469
> [] netlink_unicast_kernel net/netlink/af_netlink.c:1303 
> [inline]
> [] netlink_unicast+0x4c6/0x690 
> net/netlink/af_netlink.c:1329
> [] netlink_sendmsg+0x735/0xcc0 
> net/netlink/af_netlink.c:1918
> [<9221ebf7>] sock_sendmsg_nosec net/socket.c:652 [inline]
> [<9221ebf7>] sock_sendmsg+0x109/0x140 net/socket.c:672
> [<1c30ffe4>] sys_sendmsg+0x5f5/0x780 net/socket.c:2352
> [] ___sys_sendmsg+0x11d/0x1a0 net/socket.c:2406
> [<07297384>] __sys_sendmsg+0xeb/0x1b0 net/socket.c:2439
> [<0eb29b11>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359
> [<6839b4d0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> Fixes: e51fb152318ee6 ("rtnetlink: fix a memory leak when ->newlink fails")

This bug is apparently not introduced by my commit above.

It should be commit cb626bf566eb4433318d35681286c494f0,
right? That commit introduced NETREG_UNREGISTERED on the path.


Re: PATCH ax25: Don't hold skb lock while doing blocking read

2020-07-13 Thread Cong Wang
On Thu, Jul 9, 2020 at 12:53 PM Thomas Habets  wrote:
>
> Here's a test program that illustrates the problem:
> https://github.com/ThomasHabets/radiostuff/blob/master/ax25/axftp/examples/client_lockcheck.cc
>
> Before this patch, this hangs, because the read(2) blocks the
> write(2).
>
> I see that calling skb_recv_datagram without this lock is done in
> pep_sock_accept() and atalk_recvmsg() and others, which is what makes
> me think it's safe to do here too. But I'm far from an expert on skb
> lock semantics.

A proper fix here is just to release the sock lock before going to
sleep in __skb_wait_for_more_packets(). You can take a look at
how sk_wait_event() handles this.

Of course, not all callers call it with sock lock, so it needs some
additional work.

Thanks.


Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks

2020-07-13 Thread Cong Wang
On Fri, Jul 10, 2020 at 7:40 AM Petr Machata  wrote:
>
>
> Cong Wang  writes:
>
> > On Wed, Jul 8, 2020 at 5:13 PM Petr Machata  wrote:
> >>
> >>
> >> Petr Machata  writes:
> >>
> >> > Cong Wang  writes:
> >> >
> >> > I'll think about it some more. For now I will at least fix the lack of
> >> > locking.
> >>
> >> I guess I could store smp_processor_id() that acquired the lock in
> >> struct qdisc_skb_head. Do a trylock instead of lock, and on fail check
> >> the stored value. I'll need to be careful about the race between
> >> unsuccessful trylock and the test, and about making sure CPU ID doesn't
> >> change after it is read. I'll probe this tomorrow.
> >
> > Like __netif_tx_lock(), right? Seems doable.
>
> Good to see it actually used, I wasn't sure if the idea made sense :)
>
> Unfortunately it is not enough.
>
> Consider two threads (A, B) and two netdevices (eth0, eth1):
>
> - "A" takes eth0's root lock and proceeds to classification
> - "B" takes eth1's root lock and proceeds to classification
> - "A" invokes mirror to eth1, waits on lock held by "B"
> - "B" invakes mirror to eth0, waits on lock held by "A"
> - Some say they are still waiting to this day.

Sure, AA or ABBA deadlock.

>
> So one option that I see is to just stash the mirrored packet in a queue
> instead of delivering it right away:
>
> - s/netif_receive_skb/netif_rx/ in act_mirred
>
> - Reuse the RX queue for TX packets as well, differentiating the two by
>   a bit in SKB CB. Then process_backlog() would call either
>   __netif_receive_skb() or dev_queue_transmit().
>
> - Drop mirred_rec_level guard.

I don't think I follow you, the root qdisc lock is on egress which has
nothing to do with ingress, so I don't see how netif_rx() is even involved.

>
> This seems to work, but I might be missing something non-obvious, such
> as CB actually being used for something already in that context. I would
> really rather not introduce a second backlog queue just for mirred
> though.
>
> Since mirred_rec_level does not kick in anymore, the same packet can end
> up being forwarded from the backlog queue, to the qdisc, and back to the
> backlog queue, forever. But that seems OK, that's what the admin
> configured, so that's what's happening.
>
> If this is not a good idea for some reason, this might work as well:
>
> - Convert the current root lock to an rw lock. Convert all current
>   lockers to write lock (which should be safe), except of enqueue, which
>   will take read lock. That will allow many concurrent threads to enter
>   enqueue, or one thread several times, but it will exclude all other
>   users.

Are you sure we can parallelize enqueue()? They all need to move
skb into some queue, which is not able to parallelize with just a read
lock. Even the "lockless" qdisc takes a spinlock, r->producer_lock,
for enqueue().


>
>   So this guards configuration access to the qdisc tree, makes sure
>   qdiscs don't go away from under one's feet.
>
> - Introduce another spin lock to guard the private data of the qdisc
>   tree, counters etc., things that even two concurrent enqueue
>   operations shouldn't trample on. Enqueue takes this spin lock after
>   read-locking the root lock. act_mirred drops it before injecting the
>   packet and takes it again afterwards.
>
> Any opinions y'all?

I thought about forbidding mirror/redirecting to the same device,
but there might be some legitimate use cases of such. So, I don't
have any other ideas yet, perhaps there is some way to refactor
dev_queue_xmit() to avoid this deadlock.

Thanks.


Re: [PATCH net-next v3 2/4] net/sched: Introduce action hash

2020-07-13 Thread Cong Wang
On Sat, Jul 11, 2020 at 2:28 PM Ariel Levkovich  wrote:
>
> Allow user to set a packet's hash value using a bpf program.
>
> The user provided BPF program is required to compute and return
> a hash value for the packet which is then stored in skb->hash.

Can be done by act_bpf, right?

>
> Using this action to set the skb->hash is an alternative to setting
> it with act_skbedit and can be useful for future HW offload support
> when the HW hash function is different then the kernel's hash
> function.
> By using a bpg program that emulates the HW hash function user
> can ensure hash consistency between the SW and the HW.

It sounds weird that the sole reason to add a new action is
because of HW offloading. What prevents us extending the
existing actions to support HW offloading?

Thanks.


Re: [PATCH net-next 2/2] net: sched: Lockless Token Bucket (LTB) Qdisc

2020-07-09 Thread Cong Wang
On Thu, Jul 9, 2020 at 11:07 PM YU, Xiangning
 wrote:
>
>
> On 7/9/20 10:20 PM, Cong Wang wrote:
> > On Thu, Jul 9, 2020 at 10:04 PM Cong Wang  wrote:
> >> IOW, without these *additional* efforts, it is broken in terms of
> >> out-of-order.
> >>
> >
> > Take a look at fq_codel, it provides a hash function for flow 
> > classification,
> > fq_codel_hash(), as default, thus its default configuration does not
> > have such issues. So, you probably want to provide such a hash
> > function too instead of a default class.
> >
> If I understand this code correctly, this socket hash value identifies a 
> flow. Essentially it serves the same purpose as socket priority. In this 
> patch, we use a similar classification method like htb, but without filters.

How is it any similar to HTB? HTB does not have a per-cpu queue
for each class. This is a huge difference.

>
> We could provide a hash function, but I'm a bit confused about the problem we 
> are trying to solve.

Probably more than that, you need to ensure the packets in a same flow
are queued on the same queue.

Let say you have two packets P1 and P2 from the same flow (P1 is before P2),
you can classify them into the same class of course, but with per-cpu queues
they can be sent out in a wrong order too:

send(P1) on CPU1 -> classify() returns default class -> P1 is queued on
the CPU1 queue of default class

(Now process is migrated to CPU2)

send(P2) on CPU2 -> classify() returns default class -> P2 is queued on
the CPU2 queue of default class

P2 is dequeued on CPU2 before P1 dequeued on CPU1.

Now, out of order. :)

Hope it is clear now.

Thanks.


Re: [PATCH net-next 2/2] net: sched: Lockless Token Bucket (LTB) Qdisc

2020-07-09 Thread Cong Wang
On Thu, Jul 9, 2020 at 10:49 PM YU, Xiangning
 wrote:
>
> Well, we do ask packets from a flow to be classified to a single class, not 
> multiple ones. It doesn't have to be socket priority, it could be five tuple 
> hash, or even container classid.

I don't see how it is so in your code, without skb priority your code
simply falls back to default class:

+   /* Allow to select a class by setting skb->priority */
+   if (likely(skb->priority != 0)) {
+   cl = ltb_find_class(sch, skb->priority);
+   if (cl)
+   return cl;
+   }
+   return rcu_dereference_bh(ltb->default_cls);

Mind to be more specific here?

BTW, your qdisc does not even support TC filters, does it?
At least I don't see that tcf_classify() is called.


>
> I think it's ok to have this requirement, even if we use htb, I would suggest 
> the same. Why do you think this is a problem?

Because HTB does not have a per-cpu queue for each class,
yours does, cl->aggr_queues[cpu], if your point here is why we
don't blame HTB.

Thanks.


Re: [PATCH net-next 2/2] net: sched: Lockless Token Bucket (LTB) Qdisc

2020-07-09 Thread Cong Wang
On Thu, Jul 9, 2020 at 10:04 PM Cong Wang  wrote:
> IOW, without these *additional* efforts, it is broken in terms of
> out-of-order.
>

Take a look at fq_codel, it provides a hash function for flow classification,
fq_codel_hash(), as default, thus its default configuration does not
have such issues. So, you probably want to provide such a hash
function too instead of a default class.


Re: [PATCH net-next 2/2] net: sched: Lockless Token Bucket (LTB) Qdisc

2020-07-09 Thread Cong Wang
On Wed, Jul 8, 2020 at 2:07 PM YU, Xiangning
 wrote:
>
>
>
> On 7/8/20 1:24 PM, Cong Wang wrote:
> > On Tue, Jul 7, 2020 at 2:24 PM YU, Xiangning
> >  wrote:
> >>
> >> The key is to avoid classifying packets from a same flow into different 
> >> classes. So we use socket priority to classify packets. It's always going 
> >> to be correctly classified.
> >>
> >> Not sure what do you mean by default configuration. But we create a shadow 
> >> class when the qdisc is created. Before any other classes are created, all 
> >> packets from any flow will be classified to this same shadow class, there 
> >> won't be any incorrect classified packets either.
> >
> > By "default configuration" I mean no additional configuration on top
> > of qdisc creation. If you have to rely on additional TC filters to
> > do the classification, it could be problematic. Same for setting
> > skb priority, right?
> >
>
> In this patch we don't rely on other TC filters. In our use case, socket 
> priority is set on a per-flow basis, not per-skb basis.

Your use case is not the default configuration I mentioned.

>
> > Also, you use a default class, this means all unclassified packets
> > share the same class, and a flow falls into this class could be still
> > out-of-order, right?
> >
>
> A flow will fall and only fall to this class. If we can keep the order within 
> a flow, I'm not sure why we still have this issue?

The issue here is obvious: you have to rely on either TC filters or
whatever sets skb priority to make packets in a flow in-order.

IOW, without these *additional* efforts, it is broken in terms of
out-of-order.

Thanks.


Re: [PATCH net] cgroup: Fix sock_cgroup_data on big-endian.

2020-07-09 Thread Cong Wang
On Thu, Jul 9, 2020 at 4:32 PM David Miller  wrote:
>
>
> From: Cong Wang 
>
> In order for no_refcnt and is_data to be the lowest order two
> bits in the 'val' we have to pad out the bitfield of the u8.
>
> Fixes: ad0f75e5f57c ("cgroup: fix cgroup_sk_alloc() for sk_clone_lock()")
> Reported-by: Guenter Roeck 
> Signed-off-by: David S. Miller 

Thanks a lot for sending this out (and correcting a stupid syntax
error from me)!


Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks

2020-07-09 Thread Cong Wang
On Wed, Jul 8, 2020 at 5:13 PM Petr Machata  wrote:
>
>
> Petr Machata  writes:
>
> > Cong Wang  writes:
> >
> > I'll think about it some more. For now I will at least fix the lack of
> > locking.
>
> I guess I could store smp_processor_id() that acquired the lock in
> struct qdisc_skb_head. Do a trylock instead of lock, and on fail check
> the stored value. I'll need to be careful about the race between
> unsuccessful trylock and the test, and about making sure CPU ID doesn't
> change after it is read. I'll probe this tomorrow.

Like __netif_tx_lock(), right? Seems doable.

Thanks.


Re: [Patch net v2] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

2020-07-09 Thread Cong Wang
On Thu, Jul 9, 2020 at 12:13 PM Guenter Roeck  wrote:
>
> On 7/9/20 11:51 AM, Cong Wang wrote:
> > On Thu, Jul 9, 2020 at 10:10 AM Guenter Roeck  wrote:
> >>
> >> Something seems fishy with the use of skcd->val on big endian systems.
> >>
> >> Some debug output:
> >>
> >> [   22.643703] sock: # sk_alloc(sk=1be28100): Calling 
> >> cgroup_sk_alloc(1be28550)
> >> [   22.643807] cgroup: # cgroup_sk_alloc(skcd=1be28550): 
> >> cgroup_sk_alloc_disabled=0, in_interrupt: 0
> >> [   22.643886] cgroup:   cgroup_sk_alloc(skcd=1be28550): 
> >> cset->dfl_cgrp=01224040, skcd->val=0x1224040
> >> [   22.643957] cgroup: ## cgroup_bpf_get(cgrp=01224040)
> >> [   22.646451] sock: # sk_prot_free(sk=1be28100): Calling 
> >> cgroup_sk_free(1be28550)
> >> [   22.646607] cgroup:   sock_cgroup_ptr(skcd=1be28550) -> 
> >> 00014040 [v=14040, skcd->val=14040]
> >> [   22.646632] cgroup: ### cgroup_sk_free(): skcd=1be28550, 
> >> cgrp=00014040
> >> [   22.646739] cgroup: ### cgroup_sk_free(): skcd->no_refcnt=0
> >> [   22.646814] cgroup: ### cgroup_sk_free(): Calling 
> >> cgroup_bpf_put(cgrp=00014040)
> >> [   22.646886] cgroup: ## cgroup_bpf_put(cgrp=00014040)
> >
> > Excellent debugging! I thought it was a double put, but it seems to
> > be an endian issue. I didn't realize the bit endian machine actually
> > packs bitfields in a big endian way too...
> >
> > Does the attached patch address this?
> >
>
> Partially. I don't see the crash anymore, but something is still odd - some 
> of my
> tests require a retry with this patch applied, which previously never 
> happened.
> I don't know if this is another problem with this patch, or a different 
> problem.
> Unfortunately, I'll be unable to debug this further until next Tuesday.

Make sure you test the second patch I sent, not the first one. The first one
is still incomplete and ugly too. The two bits must be the last two,
so correcting
the if test is not sufficient, we have to fix the whole bitfield packing.

Thanks!


Re: [Patch net v2] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

2020-07-09 Thread Cong Wang
On Thu, Jul 9, 2020 at 11:51 AM Cong Wang  wrote:
>
> On Thu, Jul 9, 2020 at 10:10 AM Guenter Roeck  wrote:
> >
> > Something seems fishy with the use of skcd->val on big endian systems.
> >
> > Some debug output:
> >
> > [   22.643703] sock: # sk_alloc(sk=1be28100): Calling 
> > cgroup_sk_alloc(1be28550)
> > [   22.643807] cgroup: # cgroup_sk_alloc(skcd=1be28550): 
> > cgroup_sk_alloc_disabled=0, in_interrupt: 0
> > [   22.643886] cgroup:   cgroup_sk_alloc(skcd=1be28550): 
> > cset->dfl_cgrp=01224040, skcd->val=0x1224040
> > [   22.643957] cgroup: ## cgroup_bpf_get(cgrp=01224040)
> > [   22.646451] sock: # sk_prot_free(sk=1be28100): Calling 
> > cgroup_sk_free(1be28550)
> > [   22.646607] cgroup:   sock_cgroup_ptr(skcd=1be28550) -> 
> > 00014040 [v=14040, skcd->val=14040]
> > [   22.646632] cgroup: ### cgroup_sk_free(): skcd=1be28550, 
> > cgrp=00014040
> > [   22.646739] cgroup: ### cgroup_sk_free(): skcd->no_refcnt=0
> > [   22.646814] cgroup: ### cgroup_sk_free(): Calling 
> > cgroup_bpf_put(cgrp=00014040)
> > [   22.646886] cgroup: ## cgroup_bpf_put(cgrp=00014040)
>
> Excellent debugging! I thought it was a double put, but it seems to
> be an endian issue. I didn't realize the bit endian machine actually
> packs bitfields in a big endian way too...
>
> Does the attached patch address this?

Ah, this is too ugly. We just have to always make them the last two bits.

Please test this attached patch instead and ignore the previous one.

Thanks.
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 4f1cd0edc57d..0e7a97b50d49 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -792,6 +792,7 @@ struct sock_cgroup_data {
 		struct {
 			u8	is_data : 1;
 			u8	no_refcnt : 1;
+			u8	unused: 6
 			u8	padding;
 			u16	prioidx;
 			u32	classid;
@@ -801,6 +802,7 @@ struct sock_cgroup_data {
 			u32	classid;
 			u16	prioidx;
 			u8	padding;
+			u8	unused: 6
 			u8	no_refcnt : 1;
 			u8	is_data : 1;
 		} __packed;


Re: [Patch net v2] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

2020-07-09 Thread Cong Wang
On Thu, Jul 9, 2020 at 10:10 AM Guenter Roeck  wrote:
>
> Something seems fishy with the use of skcd->val on big endian systems.
>
> Some debug output:
>
> [   22.643703] sock: # sk_alloc(sk=1be28100): Calling 
> cgroup_sk_alloc(1be28550)
> [   22.643807] cgroup: # cgroup_sk_alloc(skcd=1be28550): 
> cgroup_sk_alloc_disabled=0, in_interrupt: 0
> [   22.643886] cgroup:   cgroup_sk_alloc(skcd=1be28550): 
> cset->dfl_cgrp=01224040, skcd->val=0x1224040
> [   22.643957] cgroup: ## cgroup_bpf_get(cgrp=01224040)
> [   22.646451] sock: # sk_prot_free(sk=1be28100): Calling 
> cgroup_sk_free(1be28550)
> [   22.646607] cgroup:   sock_cgroup_ptr(skcd=1be28550) -> 
> 00014040 [v=14040, skcd->val=14040]
> [   22.646632] cgroup: ### cgroup_sk_free(): skcd=1be28550, 
> cgrp=00014040
> [   22.646739] cgroup: ### cgroup_sk_free(): skcd->no_refcnt=0
> [   22.646814] cgroup: ### cgroup_sk_free(): Calling 
> cgroup_bpf_put(cgrp=00014040)
> [   22.646886] cgroup: ## cgroup_bpf_put(cgrp=00014040)

Excellent debugging! I thought it was a double put, but it seems to
be an endian issue. I didn't realize the bit endian machine actually
packs bitfields in a big endian way too...

Does the attached patch address this?

Thank you!
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 618838c48313..1729eded34ab 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -836,7 +836,11 @@ static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
 	 */
 	v = READ_ONCE(skcd->val);
 
-	if (v & 3)
+#ifdef __LITTLE_ENDIAN
+	if (v & 0x3)
+#else
+	if (v & 0xc0)
+#endif
 		return &cgrp_dfl_root.cgrp;
 
 	return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp;


[Patch net] net_sched: fix a memory leak in atm_tc_init()

2020-07-08 Thread Cong Wang
When tcf_block_get() fails inside atm_tc_init(),
atm_tc_put() is called to release the qdisc p->link.q.
But the flow->ref prevents it to do so, as the flow->ref
is still zero.

Fix this by moving the p->link.ref initialization before
tcf_block_get().

Fixes: 6529eaba33f0 ("net: sched: introduce tcf block infractructure")
Reported-and-tested-by: syzbot+d411cff6ab29cc2c3...@syzkaller.appspotmail.com
Cc: Jamal Hadi Salim 
Cc: Jiri Pirko 
Signed-off-by: Cong Wang 
---
 net/sched/sch_atm.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_atm.c b/net/sched/sch_atm.c
index ee12ca9f55b4..1c281cc81f57 100644
--- a/net/sched/sch_atm.c
+++ b/net/sched/sch_atm.c
@@ -553,16 +553,16 @@ static int atm_tc_init(struct Qdisc *sch, struct nlattr 
*opt,
if (!p->link.q)
p->link.q = &noop_qdisc;
pr_debug("atm_tc_init: link (%p) qdisc %p\n", &p->link, p->link.q);
+   p->link.vcc = NULL;
+   p->link.sock = NULL;
+   p->link.common.classid = sch->handle;
+   p->link.ref = 1;
 
err = tcf_block_get(&p->link.block, &p->link.filter_list, sch,
extack);
if (err)
return err;
 
-   p->link.vcc = NULL;
-   p->link.sock = NULL;
-   p->link.common.classid = sch->handle;
-   p->link.ref = 1;
tasklet_init(&p->task, sch_atm_dequeue, (unsigned long)sch);
return 0;
 }
-- 
2.27.0



Re: [PATCH net-next 2/2] net: sched: Lockless Token Bucket (LTB) Qdisc

2020-07-08 Thread Cong Wang
On Tue, Jul 7, 2020 at 2:24 PM YU, Xiangning
 wrote:
>
> The key is to avoid classifying packets from a same flow into different 
> classes. So we use socket priority to classify packets. It's always going to 
> be correctly classified.
>
> Not sure what do you mean by default configuration. But we create a shadow 
> class when the qdisc is created. Before any other classes are created, all 
> packets from any flow will be classified to this same shadow class, there 
> won't be any incorrect classified packets either.

By "default configuration" I mean no additional configuration on top
of qdisc creation. If you have to rely on additional TC filters to
do the classification, it could be problematic. Same for setting
skb priority, right?

Also, you use a default class, this means all unclassified packets
share the same class, and a flow falls into this class could be still
out-of-order, right?

Thanks.


Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-07-08 Thread Cong Wang
On Tue, Jul 7, 2020 at 7:18 AM Paolo Abeni  wrote:
> So the regression with 2 pktgen threads is still relevant. 'perf' shows
> relevant time spent into net_tx_action() and __netif_schedule().

So, touching the __QDISC_STATE_SCHED bit in __dev_xmit_skb() is
not a good idea.

Let me see if there is any other way to fix this.

Thanks.


Re: [Patch net v2] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

2020-07-08 Thread Cong Wang
On Wed, Jul 8, 2020 at 12:07 PM Guenter Roeck  wrote:
>
> On 7/8/20 11:34 AM, Cong Wang wrote:
> > Hi,
> >
> > On Wed, Jul 8, 2020 at 8:33 AM Guenter Roeck  wrote:
> >> This patch causes all my s390 boot tests to crash. Reverting it fixes
> >> the problem. Please see bisect results and and crash log below.
> >>
> > ...
> >> Crash log:
> >
> > Interesting. I don't see how unix socket is any special here, it creates
> > a peer sock with sk_alloc(), but this is not any different from two 
> > separated
> > sockets.
> >
> > What is your kernel config? Do you enable CONFIG_CGROUP_NET_PRIO
> > or CONFIG_CGROUP_NET_CLASSID? I can see there might be a problem
> > if you don't enable either of them but enable CONFIG_CGROUP_BPF.
> >
>
> cgroup specific configuration bits:
>
> CONFIG_CGROUPS=y
> CONFIG_BLK_CGROUP=y
> CONFIG_CGROUP_WRITEBACK=y
> CONFIG_CGROUP_SCHED=y
> CONFIG_CGROUP_PIDS=y
> CONFIG_CGROUP_RDMA=y
> CONFIG_CGROUP_FREEZER=y
> CONFIG_CGROUP_HUGETLB=y
> CONFIG_CGROUP_DEVICE=y
> CONFIG_CGROUP_CPUACCT=y
> CONFIG_CGROUP_PERF=y
> CONFIG_CGROUP_BPF=y
> # CONFIG_CGROUP_DEBUG is not set
> CONFIG_SOCK_CGROUP_DATA=y
> CONFIG_BLK_CGROUP_RWSTAT=y
> CONFIG_BLK_CGROUP_IOLATENCY=y
> CONFIG_BLK_CGROUP_IOCOST=y
> # CONFIG_BFQ_CGROUP_DEBUG is not set
> # CONFIG_NETFILTER_XT_MATCH_CGROUP is not set
> CONFIG_NET_CLS_CGROUP=y
> CONFIG_CGROUP_NET_PRIO=y
> CONFIG_CGROUP_NET_CLASSID=y
>
> This originates from arch/s390/configs/defconfig; I don't touch
> any cgroup specific configuration in my tests.

Good to know you enable everything related here.

>
> > And if you have the full kernel log, it would be helpful too.
> >
>
> https://kerneltests.org/builders/qemu-s390-pending-fixes/builds/222/steps/qemubuildcommand/logs/stdio

It looks like cgroup_sk_alloc_disabled is always false in your case.
This makes the bug even more weird. Unless there is a refcnt bug
prior to my commit, I don't see how it could happen.

>
> Interestingly, enabling CONFIG_BFQ_CGROUP_DEBUG makes the problem disappear.

Yeah, I guess there might be some cgroup refcnt bug which could
just paper out with CONFIG_BFQ_CGROUP_DEBUG=y.

If you can test patches, I can send you a debugging patch for you
to collect more data. I assume this is 100% reproducible on your
side?

Thanks.


Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks

2020-07-08 Thread Cong Wang
On Wed, Jul 8, 2020 at 9:21 AM Petr Machata  wrote:
>
> Actually I guess I could qdisc_refcount_inc() the current qdisc so that
> it doesn't go away. Then when enqueing I could access the child
> directly, not relying on the now-obsolete cache from the beginning of
> the enqueue function. I suppose that a similar approach could be used in
> other users of tcf_classify() as well. What do you think?

The above example is just a quick one I can think of, there could be
more race conditions that lead to other kinds of bugs.

I am sure you can fix that one, but the point is that it is hard to
audit and fix them all. The best solution here is of course not to
release that lock, but again it requires some more work to avoid
the deadlock.

Thanks.


Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks

2020-07-08 Thread Cong Wang
On Wed, Jul 8, 2020 at 5:35 AM Petr Machata  wrote:
> Do you have another solution in mind here? I think the deadlock (in both
> classification and qevents) is an issue, but really don't know how to
> avoid it except by dropping the lock.

Ideally we should only take the lock once, but it clearly requires some
work to teach the dev_queue_xmit() in act_mirred not to acquire it again.

Thanks.


Re: [Patch net v2] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

2020-07-08 Thread Cong Wang
Hi,

On Wed, Jul 8, 2020 at 8:33 AM Guenter Roeck  wrote:
> This patch causes all my s390 boot tests to crash. Reverting it fixes
> the problem. Please see bisect results and and crash log below.
>
...
> Crash log:

Interesting. I don't see how unix socket is any special here, it creates
a peer sock with sk_alloc(), but this is not any different from two separated
sockets.

What is your kernel config? Do you enable CONFIG_CGROUP_NET_PRIO
or CONFIG_CGROUP_NET_CLASSID? I can see there might be a problem
if you don't enable either of them but enable CONFIG_CGROUP_BPF.

And if you have the full kernel log, it would be helpful too.

Thanks.


Re: [RFC PATCH] sch_htb: Hierarchical QoS hardware offload

2020-07-07 Thread Cong Wang
On Fri, Jun 26, 2020 at 3:46 AM Maxim Mikityanskiy  wrote:
>
> HTB doesn't scale well because of contention on a single lock, and it
> also consumes CPU. Mellanox hardware supports hierarchical rate limiting
> that can be leveraged by offloading the functionality of HTB.

True, essentially because it has to enforce a global rate limit with
link sharing.

There is a proposal of adding a new lockless shaping qdisc, which
you can find in netdev list.

>
> Our solution addresses two problems of HTB:
>
> 1. Contention by flow classification. Currently the filters are attached
> to the HTB instance as follows:
>
> # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80
> classid 1:10
>
> It's possible to move classification to clsact egress hook, which is
> thread-safe and lock-free:
>
> # tc filter add dev eth0 egress protocol ip flower dst_port 80
> action skbedit priority 1:10
>
> This way classification still happens in software, but the lock
> contention is eliminated, and it happens before selecting the TX queue,
> allowing the driver to translate the class to the corresponding hardware
> queue.
>
> Note that this is already compatible with non-offloaded HTB and doesn't
> require changes to the kernel nor iproute2.
>
> 2. Contention by handling packets. HTB is not multi-queue, it attaches
> to a whole net device, and handling of all packets takes the same lock.
> Our solution offloads the logic of HTB to the hardware and registers HTB
> as a multi-queue qdisc, similarly to how mq qdisc does, i.e. HTB is
> attached to the netdev, and each queue has its own qdisc. The control
> flow is performed by HTB, it replicates the hierarchy of classes in
> hardware by calling callbacks of the driver. Leaf classes are presented
> by hardware queues. The data path works as follows: a packet is
> classified by clsact, the driver selectes the hardware queue according
> to its class, and the packet is enqueued into this queue's qdisc.

Are you sure the HTB algorithm could still work even after you
kinda make each HTB class separated? I think they must still share
something when they borrow bandwidth from each other. This is why I
doubt you can simply add a ->attach() without touching the core
algorithm.

Thanks.


Re: [PATCH net-next 2/2] net: sched: Lockless Token Bucket (LTB) Qdisc

2020-07-07 Thread Cong Wang
On Mon, Jul 6, 2020 at 1:34 PM YU, Xiangning
 wrote:
>
>
>
> On 7/6/20 1:10 PM, Cong Wang wrote:
> > On Mon, Jul 6, 2020 at 11:11 AM YU, Xiangning
> >  wrote:
> >> +static int ltb_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t 
> >> *root_lock,
> >> +  struct sk_buff **to_free)
> >> +{
> >> +   struct ltb_sched *ltb = qdisc_priv(sch);
> >> +   struct ltb_pcpu_sched *pcpu_q;
> >> +   struct ltb_class *cl;
> >> +   struct ltb_pcpu_data *pcpu = this_cpu_ptr(ltb->pcpu_data);
> >> +   int cpu;
> >> +
> >> +   cpu = smp_processor_id();
> >> +   pcpu_q = qdisc_priv(pcpu->qdisc);
> >> +   ltb_skb_cb(skb)->cpu = cpu;
> >> +
> >> +   cl = ltb_classify(sch, ltb, skb);
> >> +   if (unlikely(!cl)) {
> >> +   kfree_skb(skb);
> >> +   return NET_XMIT_DROP;
> >> +   }
> >> +
> >> +   pcpu->active = true;
> >> +   if (unlikely(kfifo_put(&cl->aggr_queues[cpu], skb) == 0)) {
> >> +   kfree_skb(skb);
> >> +   atomic64_inc(&cl->stat_drops);
> >> +   return NET_XMIT_DROP;
> >> +   }
> >
> >
> > How do you prevent out-of-order packets?
> >
>
> Hi Cong,
>
> That's a good question. In theory there will be out of order packets during 
> aggregation. While keep in mind this is per-class aggregation, and it runs at 
> a high frequency, that the chance to have any left over skbs from the same 
> TCP flow on many CPUs is low.
>
> Also, based on real deployment experience, we haven't observed an increased 
> out of order events even under heavy work load.
>

Yeah, but unless you always classify packets into proper flows, there
is always a chance to generate out-of-order packets here, which
means the default configuration is flawed.


> >
> >> +static int ltb_init(struct Qdisc *sch, struct nlattr *opt,
> > ...
> >> +   ltb->default_cls = ltb->shadow_cls; /* Default hasn't been created 
> >> */
> >> +   tasklet_init(<b->fanout_tasklet, ltb_fanout_tasklet,
> >> +(unsigned long)ltb);
> >> +
> >> +   /* Bandwidth balancer, this logic can be implemented in user-land. 
> >> */
> >> +   init_waitqueue_head(<b->bwbalancer_wq);
> >> +   ltb->bwbalancer_task =
> >> +   kthread_create(ltb_bw_balancer_kthread, ltb, "ltb-balancer");
> >> +   wake_up_process(ltb->bwbalancer_task);
> >
> > Creating a kthread for each qdisc doesn't look good. Why do you
> > need a per-qdisc kthread or even a kernel thread at all?
> >
>
> We moved the bandwidth sharing out of the critical data path, that's why we 
> use a kernel thread to balance the current maximum bandwidth used by each 
> class periodically.
>
> This part could be implemented at as timer. What's your suggestion?

I doubt you can use a timer, as you call rtnl_trylock() there.
Why not use a delayed work?

Thanks.


Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks

2020-07-07 Thread Cong Wang
On Fri, Jun 26, 2020 at 3:46 PM Petr Machata  wrote:
> The function tcf_qevent_handle() should be invoked when qdisc hits the
> "interesting event" corresponding to a block. This function releases root
> lock for the duration of executing the attached filters, to allow packets
> generated through user actions (notably mirred) to be reinserted to the
> same qdisc tree.

I read this again, another question here is: why is tcf_qevent_handle()
special here? We call tcf_classify() under root lock in many other places
too, for example htb_enqueue(), which of course includes act_mirred
execution, so why isn't it a problem there?

People added MIRRED_RECURSION_LIMIT for this kinda recursion,
but never released that root lock.

Thanks.


Re: [PATCH net-next v1 1/5] net: sched: Pass root lock to Qdisc_ops.enqueue

2020-07-07 Thread Cong Wang
On Tue, Jul 7, 2020 at 8:25 AM Petr Machata  wrote:
>
>
> Cong Wang  writes:
>
> > On Fri, Jun 26, 2020 at 3:46 PM Petr Machata  wrote:
> >>
> >> A following patch introduces qevents, points in qdisc algorithm where
> >> packet can be processed by user-defined filters. Should this processing
> >> lead to a situation where a new packet is to be enqueued on the same port,
> >> holding the root lock would lead to deadlocks. To solve the issue, qevent
> >> handler needs to unlock and relock the root lock when necessary.
> >>
> >> To that end, add the root lock argument to the qdisc op enqueue, and
> >> propagate throughout.
> >
> > Hmm, but why do you pass root lock down to each ->enqueue()?
> >
> > You can find root lock with sch_tree_lock() (or qdisc_lock() if you don't
> > care about hierarchy), and you already have qdisc as a parameter of
> > tcf_qevent_handle().
>
> I know, I wanted to make it clear that the lock may end up being used,
> instead of doing it "stealthily". If you find this inelegant I can push
> a follow-up that converts tcf_qevent_handle() to sch_tree_unlock().

So far only sch_red uses tcf_qevent_handle(), changing the rest
qdisc's just for sch_red seems overkill here.

Of course, if we could eliminate the root lock release, all the pains
would go away.

Thanks.


Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks

2020-07-07 Thread Cong Wang
On Tue, Jul 7, 2020 at 8:22 AM Petr Machata  wrote:
>
>
> Cong Wang  writes:
>
> > On Fri, Jun 26, 2020 at 3:46 PM Petr Machata  wrote:
> >> The function tcf_qevent_handle() should be invoked when qdisc hits the
> >> "interesting event" corresponding to a block. This function releases root
> >> lock for the duration of executing the attached filters, to allow packets
> >> generated through user actions (notably mirred) to be reinserted to the
> >> same qdisc tree.
> >
> > Are you sure releasing the root lock in the middle of an enqueue operation
> > is a good idea? I mean, it seems racy with qdisc change or reset path,
> > for example, __red_change() could update some RED parameters
> > immediately after you release the root lock.
>
> So I had mulled over this for a while. If packets are enqueued or
> dequeued while the lock is released, maybe the packet under
> consideration should have been hard_marked instead of prob_marked, or
> vice versa. (I can't really go to not marked at all, because the fact
> that we ran the qevent is a very observable commitment to put the packet
> in the queue with marking.) I figured that is not such a big deal.
>
> Regarding a configuration change, for a brief period after the change, a
> few not-yet-pushed packets could have been enqueued with ECN marking
> even as I e.g. disabled ECN. This does not seem like a big deal either,
> these are transient effects.

Hmm, let's see:

1. red_enqueue() caches a pointer to child qdisc, child = q->qdisc
2. root lock is released by tcf_qevent_handle().
3. __red_change() acquires the root lock and then changes child
qdisc with a new one
4. The old child qdisc is put by qdisc_put()
5. tcf_qevent_handle() acquires the root lock again, and still uses
the cached but now freed old child qdisc.

Isn't this a problem?

Even if it really is not, why do we make tcf_qevent_handle() callers'
life so hard? They have to be very careful with race conditions like this.

Thanks.


Re: [PATCH net-next 2/2] net: sched: Lockless Token Bucket (LTB) Qdisc

2020-07-06 Thread Cong Wang
On Mon, Jul 6, 2020 at 11:11 AM YU, Xiangning
 wrote:
> +static int ltb_enqueue(struct sk_buff *skb, struct Qdisc *sch, spinlock_t 
> *root_lock,
> +  struct sk_buff **to_free)
> +{
> +   struct ltb_sched *ltb = qdisc_priv(sch);
> +   struct ltb_pcpu_sched *pcpu_q;
> +   struct ltb_class *cl;
> +   struct ltb_pcpu_data *pcpu = this_cpu_ptr(ltb->pcpu_data);
> +   int cpu;
> +
> +   cpu = smp_processor_id();
> +   pcpu_q = qdisc_priv(pcpu->qdisc);
> +   ltb_skb_cb(skb)->cpu = cpu;
> +
> +   cl = ltb_classify(sch, ltb, skb);
> +   if (unlikely(!cl)) {
> +   kfree_skb(skb);
> +   return NET_XMIT_DROP;
> +   }
> +
> +   pcpu->active = true;
> +   if (unlikely(kfifo_put(&cl->aggr_queues[cpu], skb) == 0)) {
> +   kfree_skb(skb);
> +   atomic64_inc(&cl->stat_drops);
> +   return NET_XMIT_DROP;
> +   }


How do you prevent out-of-order packets?


> +static int ltb_init(struct Qdisc *sch, struct nlattr *opt,
...
> +   ltb->default_cls = ltb->shadow_cls; /* Default hasn't been created */
> +   tasklet_init(fanout_tasklet, ltb_fanout_tasklet,
> +(unsigned long)ltb);
> +
> +   /* Bandwidth balancer, this logic can be implemented in user-land. */
> +   init_waitqueue_head(bwbalancer_wq);
> +   ltb->bwbalancer_task =
> +   kthread_create(ltb_bw_balancer_kthread, ltb, "ltb-balancer");
> +   wake_up_process(ltb->bwbalancer_task);

Creating a kthread for each qdisc doesn't look good. Why do you
need a per-qdisc kthread or even a kernel thread at all?

Thanks.


Re: [PATCH net-next v1 2/5] net: sched: Introduce helpers for qevent blocks

2020-07-06 Thread Cong Wang
On Fri, Jun 26, 2020 at 3:46 PM Petr Machata  wrote:
> The function tcf_qevent_handle() should be invoked when qdisc hits the
> "interesting event" corresponding to a block. This function releases root
> lock for the duration of executing the attached filters, to allow packets
> generated through user actions (notably mirred) to be reinserted to the
> same qdisc tree.

Are you sure releasing the root lock in the middle of an enqueue operation
is a good idea? I mean, it seems racy with qdisc change or reset path,
for example, __red_change() could update some RED parameters
immediately after you release the root lock.

Thanks.


Re: [PATCH net-next v1 1/5] net: sched: Pass root lock to Qdisc_ops.enqueue

2020-07-06 Thread Cong Wang
On Fri, Jun 26, 2020 at 3:46 PM Petr Machata  wrote:
>
> A following patch introduces qevents, points in qdisc algorithm where
> packet can be processed by user-defined filters. Should this processing
> lead to a situation where a new packet is to be enqueued on the same port,
> holding the root lock would lead to deadlocks. To solve the issue, qevent
> handler needs to unlock and relock the root lock when necessary.
>
> To that end, add the root lock argument to the qdisc op enqueue, and
> propagate throughout.

Hmm, but why do you pass root lock down to each ->enqueue()?

You can find root lock with sch_tree_lock() (or qdisc_lock() if you don't
care about hierarchy), and you already have qdisc as a parameter of
tcf_qevent_handle().

Thanks.


Re: [PATCH net] net/sched: act_mirred: fix fragment the packet after defrag in act_ct

2020-07-05 Thread Cong Wang
On Sun, Jul 5, 2020 at 7:33 AM wenxu  wrote:
> Thanks, I also think It is ok do fragment in the mirred output.

Please stop doing it. There is no way to make this acceptable.

You must be smart enough to find solutions elsewhere, possibly
not even in TC at all. It would be best if you can solve this L3 problem
at L3.

Thanks!


Re: [PATCH net] net/sched: act_mirred: fix fragment the packet after defrag in act_ct

2020-07-05 Thread Cong Wang
On Thu, Jul 2, 2020 at 5:47 PM Marcelo Ricardo Leitner
 wrote:
>
> in this case a act_output_it_well could do it. ;-)

Yeah, still much better than making "mirred" do "mirror, redirect, frag,
defrag", can't we all agree it is too late to rename mirred? :)

Please do not try to add any arbitrary functionality into act_mirred.
Just stop here and find a better way.

Thanks.


Re: [PATCH net-next 1/3] netfilter: nf_defrag_ipv4: Add nf_ct_frag_gather support

2020-07-05 Thread Cong Wang
On Sun, Jul 5, 2020 at 7:34 AM  wrote:
> +static int nf_ct_frag_reinit(struct ipq *qp)
> +{
> +   unsigned int sum_truesize = 0;
> +
> +   if (!mod_timer(&qp->q.timer, jiffies + qp->q.fqdir->timeout)) {
> +   refcount_inc(&qp->q.refcnt);
> +   return -ETIMEDOUT;
> +   }
> +
> +   sum_truesize = inet_frag_rbtree_purge(&qp->q.rb_fragments);
> +   sub_frag_mem_limit(qp->q.fqdir, sum_truesize);
> +
> +   qp->q.flags = 0;
> +   qp->q.len = 0;
> +   qp->q.meat = 0;
> +   qp->q.rb_fragments = RB_ROOT;
> +   qp->q.fragments_tail = NULL;
> +   qp->q.last_run_head = NULL;
> +   qp->iif = 0;
> +   qp->ecn = 0;
> +
> +   return 0;
> +}
> +
> +static struct ipq *ip_find(struct net *net, struct iphdr *iph,
> +  u32 user, int vif)
> +{
> +   struct frag_v4_compare_key key = {
> +   .saddr = iph->saddr,
> +   .daddr = iph->daddr,
> +   .user = user,
> +   .vif = vif,
> +   .id = iph->id,
> +   .protocol = iph->protocol,
> +   };
> +   struct inet_frag_queue *q;
> +
> +   q = inet_frag_find(net->ipv4.fqdir, &key);
> +   if (!q)
> +   return NULL;
> +
> +   return container_of(q, struct ipq, q);
> +}


Please avoid copy-n-paste code by finding a proper way
to reuse them.

Thanks.


Re: [PATCH net-next v2 0/3] ] TC datapath hash api

2020-07-05 Thread Cong Wang
On Sun, Jul 5, 2020 at 2:50 PM Jamal Hadi Salim  wrote:
> BTW, nothing in skbedit is against computing what the new metadata
> should be.

Yup.

>
> IMO: A good arguement to not make it part of skbedit is if it adds
> unnecessary complexity to skbedit or policy definitions.
>

TCA_HASH_ALG_L4 literally has 4 lines of code, has no way
to add any unnecessary complexity to skbedit. (The BPF algorithm
does not belong to skbedit, obviously.)

Thanks.


Re: [PATCH net-next v2 0/3] ] TC datapath hash api

2020-07-05 Thread Cong Wang
On Sun, Jul 5, 2020 at 10:26 AM Ariel Levkovich  wrote:
> However I believe that from a concept point of view, using it is wrong.
>
> In my honest opinion, the concept here is to perform some calculation on
> the packet itself and its headers while the skb->hash field
>
> is the storage location of the calculation result (in SW).

With skbedit, you don't have to pass a value either, whatever you
pass to your act_hash, you can pass it to skbedit too. In your case,
it seems to be an algorithm name.

You can take a look at SKBEDIT_F_INHERITDSFIELD, it calculates
skb->priority from headers, not passed from user-space.


>
> Furthermore, looking forward to HW offload support, the HW devices will
> be offloading the hash calculation and
>
> not rewriting skb metadata fields. Therefore the action should be the
> hash, not skbedit.

Not sure if this makes sense, whatever your code under case
TCA_HASH_ALG_L4 can be just moved to skbedit. I don't see
how making it standalone could be different for HW offloading.


>
> Another thing that I can mention, which is kind of related to what I
> wrote above, is that for all existing skbedit supported fields,
>
> user typically provides a desired value of his choosing to set to a skb
> metadata field.

Again, no one forces this rule. Please feel free to adjust it for your needs.

Thanks.


Re: [PATCH net v2] sched: consistently handle layer3 header accesses in the presence of VLANs

2020-07-03 Thread Cong Wang
On Fri, Jul 3, 2020 at 8:22 AM Toke Høiland-Jørgensen  wrote:
> diff --git a/include/linux/if_vlan.h b/include/linux/if_vlan.h
> index b05e855f1ddd..d0c1cb0d264d 100644
> --- a/include/linux/if_vlan.h
> +++ b/include/linux/if_vlan.h
> @@ -308,6 +308,35 @@ static inline bool eth_type_vlan(__be16 ethertype)
> }
>  }
>
> +/* A getter for the SKB protocol field which will handle VLAN tags 
> consistently
> + * whether VLAN acceleration is enabled or not.
> + */
> +static inline __be16 skb_protocol(const struct sk_buff *skb, bool skip_vlan)
> +{
> +   unsigned int offset = skb_mac_offset(skb) + sizeof(struct ethhdr);
> +   __be16 proto = skb->protocol;
> +   struct vlan_hdr vhdr, *vh;
> +
> +   if (!skip_vlan)
> +   /* VLAN acceleration strips the VLAN header from the skb and
> +* moves it to skb->vlan_proto
> +*/
> +   return skb_vlan_tag_present(skb) ? skb->vlan_proto : proto;
> +
> +   while (eth_type_vlan(proto)) {
> +   vh = skb_header_pointer(skb, offset, sizeof(vhdr), &vhdr);
> +   if (!vh)
> +   break;
> +
> +   proto = vh->h_vlan_encapsulated_proto;
> +   offset += sizeof(vhdr);
> +   }
> +
> +   return proto;
> +}
> +
> +
> +

Just nit: too many newlines here. Please run checkpatch.pl.


>  static inline bool vlan_hw_offload_capable(netdev_features_t features,
>__be16 proto)
>  {
> diff --git a/include/net/inet_ecn.h b/include/net/inet_ecn.h
> index 0f0d1efe06dd..82763ba597f2 100644
> --- a/include/net/inet_ecn.h
> +++ b/include/net/inet_ecn.h
> @@ -4,6 +4,7 @@
>
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -172,7 +173,7 @@ static inline void ipv6_copy_dscp(unsigned int dscp, 
> struct ipv6hdr *inner)
>
>  static inline int INET_ECN_set_ce(struct sk_buff *skb)
>  {
> -   switch (skb->protocol) {
> +   switch (skb_protocol(skb, true)) {
> case cpu_to_be16(ETH_P_IP):
> if (skb_network_header(skb) + sizeof(struct iphdr) <=
> skb_tail_pointer(skb))
> @@ -191,7 +192,7 @@ static inline int INET_ECN_set_ce(struct sk_buff *skb)
>
>  static inline int INET_ECN_set_ect1(struct sk_buff *skb)
>  {
> -   switch (skb->protocol) {
> +   switch (skb_protocol(skb, true)) {
> case cpu_to_be16(ETH_P_IP):
> if (skb_network_header(skb) + sizeof(struct iphdr) <=
> skb_tail_pointer(skb))

These two helpers are called by non-net_sched too, are you sure
your change is correct for them too?

For example, IP6_ECN_decapsulate() uses skb->protocol then calls
INET_ECN_decapsulate() which calls the above, after your change
they use skb_protocol(). This looks inconsistent to me.

Thanks.


Re: [PATCH net] net/sched: act_mirred: fix fragment the packet after defrag in act_ct

2020-07-02 Thread Cong Wang
On Thu, Jul 2, 2020 at 10:32 AM Marcelo Ricardo Leitner
 wrote:
>
> On Thu, Jul 02, 2020 at 05:36:38PM +0800, wenxu wrote:
> >
> > On 7/2/2020 1:33 AM, Cong Wang wrote:
> > > On Wed, Jul 1, 2020 at 1:21 AM wenxu  wrote:
> > >>
> > >> On 7/1/2020 2:21 PM, wenxu wrote:
> > >>> On 7/1/2020 2:12 PM, Cong Wang wrote:
> > >>>> On Tue, Jun 30, 2020 at 11:03 PM wenxu  wrote:
> > >>>>> Only forward packet case need do fragment again and there is no need 
> > >>>>> do defrag explicit.
> > >>>> Same question: why act_mirred? You have to explain why act_mirred
> > >>>> has the responsibility to do this job.
> > >>> The fragment behavior only depends on the mtu of the device sent in 
> > >>> act_mirred. Only in
> > >>>
> > >>> the act_mirred can decides whether do the fragment or not.
> > >> Hi cong,
> > >>
> > >>
> > >> I still think this should be resolved in the act_mirred.  Maybe it is 
> > >> not matter with a "responsibility"
> > >>
> > >> Did you have some other suggestion to solve this problem?
> > > Like I said, why not introduce a new action to handle fragment/defragment?
> > >
> > > With that, you can still pipe it to act_ct and act_mirred to achieve
> > > the same goal.
> >
> > Thanks.  Consider about the act_fagment, There are two problem for this.
> >
> >
> > The frag action will put the ressemble skb to more than one packets. How 
> > these packets
> >
> > go through the following tc filter or chain?
>
> One idea is to listificate it, but I don't see how it can work. For
> example, it can be quite an issue when jumping chains, as the match
> would have to work on the list as well.

Why is this an issue? We already use goto action for use cases like
vlan pop/push. The header can be changed all the time, reclassification
is necessary.

>
> >
> >
> > When should use the act_fragament the action,  always before the act_mirred?
>
> Which can be messy if you consider chains like: "mirred, push vlan,
> mirred" or so. "frag, mirred, defrag, push vlan, frag, mirred".

So you mean we should have a giant act_do_everything?

"Do one thing do it well" is exactly the philosophy of designing tc
actions, if you are against this, you are too late in the game.

Thanks.


Re: [PATCH net-next v2 1/3] net/sched: Introduce action hash

2020-07-02 Thread Cong Wang
On Wed, Jul 1, 2020 at 11:47 AM Ariel Levkovich  wrote:
>
> Allow setting a hash value to a packet for a future match.
>
> The action will determine the packet's hash result according to
> the selected hash type.
>
> The first option is to select a basic asymmetric l4 hash calculation
> on the packet headers which will either use the skb->hash value as
> if such was already calculated and set by the device driver, or it
> will perform the kernel jenkins hash function on the packet which will
> generate the result otherwise.
>
> The other option is for user to provide an BPF program which is
> dedicated to calculate the hash. In such case the program is loaded
> and used by tc to perform the hash calculation and provide it to
> the hash action to be stored in skb->hash field.
>
> The BPF option can be useful for future HW offload support of the hash
> calculation by emulating the HW hash function when it's different than
> the kernel's but yet we want to maintain consistency between the SW and
> the HW.

In previous discussion, people mentioned act_skbedit. If you have
a legitimate reason for adding a new action instead of simply extending
act_skbedit, you have to mention it here or in the cover letter.

Thanks.


Re: [PATCH net-next 1/3] net/sched: Introduce action hash

2020-07-02 Thread Cong Wang
On Fri, Jun 19, 2020 at 7:42 PM Ariel Levkovich  wrote:
> I'll try to address both of your comments regarding existing
> alternatives to this new action
>
> here so that we can have a single thread about it.
>
> Act_bpf indeed can provide a similar functionality. Furthermore, there
> are already existing BPF helpers
>
> to allow user to change skb->hash within the BPF program, so there's no
> need to perform act_skbedit
>
> after act_bpf.

What is "perform act_skbedit after act_bpf"? You mentioned avoiding
act_bpf is one of your goals here.


>
>
> However, since we are trying to offer the user multiple methods to
> calculate the hash, and not only
>
> using a BPF program, act_bpf on its own is not enough.

If this is the reason for you to add a new action, why not just extend
act_skbedit?


>
> If we are looking at HW offload (as Daniel mentioned), like I mentioned
> in the cover letter,
>
> it is important that SW will be able to get the same hash result as in
> HW for a certain packet.
>
> When certain HW is not able to forward TC the hash result, using a BPF
> program that mimics the
>
> HW hash function is useful to maintain consistency but there are cases
> where the HW can and
>
> does forward the hash value via the received packet's metadata and the
> vendor driver already
>
> fills in the skb->hash with this value. In such cases BPF program usage
> can be avoided.
>
> So to sum it up, this api is offering user both ways to calculate the hash:
>
> 1. Use the value that is already there (If the vendor driver already set
> it. If not, calculate using Linux jhash).

Can be done by extending act_skbedit?

>
> 2. Use a given BPF program to calculate the hash and to set skb->hash
> with it.

Seems you agreed this can be done via act_bpf.

>
>
> It's true, you can cover both cases with BPF - meaning, always use BPF
> even if HW/driver can provide hash
>
> to TC in other means but we thought about giving an option to avoid
> writing and using BPF when
>
> it's not necessary.

How about extending act_skbedit? Adding TCA_SKBEDIT_HASH?

Thanks.


[Patch net v2] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

2020-07-02 Thread Cong Wang
When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
copied, so the cgroup refcnt must be taken too. And, unlike the
sk_alloc() path, sock_update_netprioidx() is not called here.
Therefore, it is safe and necessary to grab the cgroup refcnt
even when cgroup_sk_alloc is disabled.

sk_clone_lock() is in BH context anyway, the in_interrupt()
would terminate this function if called there. And for sk_alloc()
skcd->val is always zero. So it's safe to factor out the code
to make it more readable.

The global variable 'cgroup_sk_alloc_disabled' is used to determine
whether to take these reference counts. It is impossible to make
the reference counting correct unless we save this bit of information
in skcd->val. So, add a new bit there to record whether the socket
has already taken the reference counts. This obviously relies on
kmalloc() to align cgroup pointers to at least 4 bytes,
ARCH_KMALLOC_MINALIGN is certainly larger than that.

This bug seems to be introduced since the beginning, commit
d979a39d7242 ("cgroup: duplicate cgroup reference when cloning sockets")
tried to fix it but not compeletely. It seems not easy to trigger until
the recent commit 090e28b229af
("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.

Fixes: bd1060a1d671 ("sock, cgroup: add sock->sk_cgroup")
Reported-by: Cameron Berkenpas 
Reported-by: Peter Geis 
Reported-by: Lu Fengqi 
Reported-by: Daniël Sonck 
Reported-by: Zhang Qiang 
Tested-by: Cameron Berkenpas 
Tested-by: Peter Geis 
Tested-by: Thomas Lamprecht 
Cc: Daniel Borkmann 
Cc: Zefan Li 
Cc: Tejun Heo 
Cc: Roman Gushchin 
Signed-off-by: Cong Wang 
---
 include/linux/cgroup-defs.h |  6 --
 include/linux/cgroup.h  |  4 +++-
 kernel/cgroup/cgroup.c  | 31 +++
 net/core/sock.c |  2 +-
 4 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 52661155f85f..4f1cd0edc57d 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -790,7 +790,8 @@ struct sock_cgroup_data {
union {
 #ifdef __LITTLE_ENDIAN
struct {
-   u8  is_data;
+   u8  is_data : 1;
+   u8  no_refcnt : 1;
u8  padding;
u16 prioidx;
u32 classid;
@@ -800,7 +801,8 @@ struct sock_cgroup_data {
u32 classid;
u16 prioidx;
u8  padding;
-   u8  is_data;
+   u8  no_refcnt : 1;
+   u8  is_data : 1;
} __packed;
 #endif
u64 val;
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4598e4da6b1b..618838c48313 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -822,6 +822,7 @@ extern spinlock_t cgroup_sk_update_lock;
 
 void cgroup_sk_alloc_disable(void);
 void cgroup_sk_alloc(struct sock_cgroup_data *skcd);
+void cgroup_sk_clone(struct sock_cgroup_data *skcd);
 void cgroup_sk_free(struct sock_cgroup_data *skcd);
 
 static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
@@ -835,7 +836,7 @@ static inline struct cgroup *sock_cgroup_ptr(struct 
sock_cgroup_data *skcd)
 */
v = READ_ONCE(skcd->val);
 
-   if (v & 1)
+   if (v & 3)
return &cgrp_dfl_root.cgrp;
 
return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp;
@@ -847,6 +848,7 @@ static inline struct cgroup *sock_cgroup_ptr(struct 
sock_cgroup_data *skcd)
 #else  /* CONFIG_CGROUP_DATA */
 
 static inline void cgroup_sk_alloc(struct sock_cgroup_data *skcd) {}
+static inline void cgroup_sk_clone(struct sock_cgroup_data *skcd) {}
 static inline void cgroup_sk_free(struct sock_cgroup_data *skcd) {}
 
 #endif /* CONFIG_CGROUP_DATA */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1ea181a58465..dd247747ec14 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6439,18 +6439,8 @@ void cgroup_sk_alloc_disable(void)
 
 void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
 {
-   if (cgroup_sk_alloc_disabled)
-   return;
-
-   /* Socket clone path */
-   if (skcd->val) {
-   /*
-* We might be cloning a socket which is left in an empty
-* cgroup and the cgroup might have already been rmdir'd.
-* Don't use cgroup_get_live().
-*/
-   cgroup_get(sock_cgroup_ptr(skcd));
-   cgroup_bpf_get(sock_cgroup_ptr(skcd));
+   if (cgroup_sk_alloc_disabled) {
+   skcd->no_refcnt = 1;
return;
}
 
@@ -6475,10 +6465,27 @@ void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
rcu_rea

Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

2020-07-01 Thread Cong Wang
On Tue, Jun 30, 2020 at 3:48 PM Roman Gushchin  wrote:
>
> Btw if we want to backport the problem but can't blame a specific commit,
> we can always use something like "Cc: [3.1+]".

Sure, but if we don't know which is the right commit to blame, then how
do we know which stable version should the patch target? :)

I am open to all options here, including not backporting to stable at all.

Thanks.


Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-07-01 Thread Cong Wang
On Wed, Jul 1, 2020 at 9:05 AM Cong Wang  wrote:
>
> On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt  wrote:
> > Do either of you know if there's been any development on a fix for this
> > issue? If not we can propose something.
>
> If you have a reproducer, I can look into this.

Does the attached patch fix this bug completely?

Thanks.
diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 9092e697059e..5a03cded3054 100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -123,7 +123,7 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
 
 void __qdisc_run(struct Qdisc *q);
 
-static inline void qdisc_run(struct Qdisc *q)
+static inline bool qdisc_run(struct Qdisc *q)
 {
 	if (qdisc_run_begin(q)) {
 		/* NOLOCK qdisc must check 'state' under the qdisc seqlock
@@ -133,7 +133,9 @@ static inline void qdisc_run(struct Qdisc *q)
 		likely(!test_bit(__QDISC_STATE_DEACTIVATED, &q->state)))
 			__qdisc_run(q);
 		qdisc_run_end(q);
+		return true;
 	}
+	return false;
 }
 
 static inline __be16 tc_skb_protocol(const struct sk_buff *skb)
diff --git a/net/core/dev.c b/net/core/dev.c
index 90b59fc50dc9..c7e48356132a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3744,7 +3744,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 
 	if (q->flags & TCQ_F_NOLOCK) {
 		rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
-		qdisc_run(q);
+		if (!qdisc_run(q) && rc == NET_XMIT_SUCCESS)
+			__netif_schedule(q);
 
 		if (unlikely(to_free))
 			kfree_skb_list(to_free);


Re: [PATCH net] net/sched: act_mirred: fix fragment the packet after defrag in act_ct

2020-07-01 Thread Cong Wang
On Wed, Jul 1, 2020 at 1:21 AM wenxu  wrote:
>
>
> On 7/1/2020 2:21 PM, wenxu wrote:
> > On 7/1/2020 2:12 PM, Cong Wang wrote:
> >> On Tue, Jun 30, 2020 at 11:03 PM wenxu  wrote:
> >>> Only forward packet case need do fragment again and there is no need do 
> >>> defrag explicit.
> >> Same question: why act_mirred? You have to explain why act_mirred
> >> has the responsibility to do this job.
> > The fragment behavior only depends on the mtu of the device sent in 
> > act_mirred. Only in
> >
> > the act_mirred can decides whether do the fragment or not.
>
> Hi cong,
>
>
> I still think this should be resolved in the act_mirred.  Maybe it is not 
> matter with a "responsibility"
>
> Did you have some other suggestion to solve this problem?

Like I said, why not introduce a new action to handle fragment/defragment?

With that, you can still pipe it to act_ct and act_mirred to achieve
the same goal.

act_mirred has the context to handle it does not mean it has to handle it.
Its name already tells you it only handles mirror or redirection, and
fragmentation is a layer 3 thing, it does not fit well in layer 2 here. This
is why you should think carefully about what is the best place to handle
it, _possibly_ it should not be in TC at all.

Thanks.


Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-07-01 Thread Cong Wang
On Tue, Jun 30, 2020 at 2:08 PM Josh Hunt  wrote:
> Do either of you know if there's been any development on a fix for this
> issue? If not we can propose something.

If you have a reproducer, I can look into this.

Thanks.


Re: [PATCH net] net/sched: act_mirred: fix fragment the packet after defrag in act_ct

2020-06-30 Thread Cong Wang
On Tue, Jun 30, 2020 at 11:03 PM wenxu  wrote:
>
> Only forward packet case need do fragment again and there is no need do 
> defrag explicit.

Same question: why act_mirred? You have to explain why act_mirred
has the responsibility to do this job.

Thanks.


Re: [PATCH net] net/sched: act_mirred: fix fragment the packet after defrag in act_ct

2020-06-30 Thread Cong Wang
On Tue, Jun 30, 2020 at 7:36 PM wenxu  wrote:
>
>
> On 7/1/2020 3:02 AM, Cong Wang wrote:
> > On Mon, Jun 29, 2020 at 7:55 PM  wrote:
> >> From: wenxu 
> >>
> >> The fragment packets do defrag in act_ct module. The reassembled packet
> >> over the mtu in the act_mirred. This big packet should be fragmented
> >> to send out.
> > This is too brief. Why act_mirred should handle the burden introduced by
> > act_ct? And why is this 158-line change targeting -net not -net-next?
>
> Hi Cong,
>
> In the act_ct the fragment packets will defrag to a big packet and do 
> conntrack things.
>
> But in the latter filter mirred action, the big packet normally send over the 
> mtu of outgoing device.
>
> So in the act_mirred send the packet should fragment.

Why act_mirred? Not, for a quick example, a new action called act_defrag?
I understand you happen to use the combination of act_ct and act_mirred,
but that is not the reason we should make act_mirred specifically work
for your case.

>
> I think this is a bugfix in the net branch the act_ct handle with fragment 
> will always fail.

act_mirred is just to mirror or redirect packets, if they are too big
to deliver,
it is not act_mirred's fault.

And more importantly, why don't you put all these important information in
your changelog?

THanks.


Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

2020-06-30 Thread Cong Wang
On Sat, Jun 27, 2020 at 4:41 PM Roman Gushchin  wrote:
>
> On Fri, Jun 26, 2020 at 10:58:14AM -0700, Cong Wang wrote:
> > On Thu, Jun 25, 2020 at 10:23 PM Cameron Berkenpas  wrote:
> > >
> > > Hello,
> > >
> > > Somewhere along the way I got the impression that it generally takes
> > > those affected hours before their systems lock up. I'm (generally) able
> > > to reproduce this issue much faster than that. Regardless, I can help 
> > > test.
> > >
> > > Are there any patches that need testing or is this all still pending
> > > discussion around the  best way to resolve the issue?
> >
> > Yes. I come up with a (hopefully) much better patch in the attachment.
> > Can you help to test it? You need to unapply the previous patch before
> > applying this one.
> >
> > (Just in case of any confusion: I still believe we should check NULL on
> > top of this refcnt fix. But it should be a separate patch.)
> >
> > Thank you!
>
> Not opposing the patch, but the Fixes tag is still confusing me.
> Do we have an explanation for what's wrong with 4bfc0bb2c60e?
>
> It looks like we have cgroup_bpf_get()/put() exactly where we have
> cgroup_get()/put(), so it would be nice to understand what's different
> if the problem is bpf-related.

Hmm, I think it is Zefan who believes cgroup refcnt is fine, the bug
is just in cgroup bpf refcnt, in our previous discussion.

Although I agree cgroup refcnt is buggy too, it may not necessarily
cause any real problem, otherwise we would receive bug report
much earlier than just recently, right?

If the Fixes tag is confusing, I can certainly remove it, but this also
means the patch will not be backported to stable. I am fine either
way, this crash is only reported after Zefan's recent change anyway.

Thanks.


Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

2020-06-30 Thread Cong Wang
On Sat, Jun 27, 2020 at 3:59 PM Cameron Berkenpas  wrote:
>
> The box has been up without issue for over 25 hours now. The patch seems
> solid.

That's great! Thanks for testing!


Re: [PATCH net] net/sched: act_mirred: fix fragment the packet after defrag in act_ct

2020-06-30 Thread Cong Wang
On Mon, Jun 29, 2020 at 7:55 PM  wrote:
>
> From: wenxu 
>
> The fragment packets do defrag in act_ct module. The reassembled packet
> over the mtu in the act_mirred. This big packet should be fragmented
> to send out.

This is too brief. Why act_mirred should handle the burden introduced by
act_ct? And why is this 158-line change targeting -net not -net-next?

Thanks.


Re: [PATCH net v2] genetlink: remove genl_bind

2020-06-30 Thread Cong Wang
On Tue, Jun 30, 2020 at 10:50 AM Sean Tranchetti
 wrote:
>
> A potential deadlock can occur during registering or unregistering a
> new generic netlink family between the main nl_table_lock and the
> cb_lock where each thread wants the lock held by the other, as
> demonstrated below.
>
> 1) Thread 1 is performing a netlink_bind() operation on a socket. As part
>of this call, it will call netlink_lock_table(), incrementing the
>nl_table_users count to 1.
> 2) Thread 2 is registering (or unregistering) a genl_family via the
>genl_(un)register_family() API. The cb_lock semaphore will be taken for
>writing.
> 3) Thread 1 will call genl_bind() as part of the bind operation to handle
>subscribing to GENL multicast groups at the request of the user. It will
>attempt to take the cb_lock semaphore for reading, but it will fail and
>be scheduled away, waiting for Thread 2 to finish the write.
> 4) Thread 2 will call netlink_table_grab() during the (un)registration
>call. However, as Thread 1 has incremented nl_table_users, it will not
>be able to proceed, and both threads will be stuck waiting for the
>other.
>
> genl_bind() is a noop, unless a genl_family implements the mcast_bind()
> function to handle setting up family-specific multicast operations. Since
> no one in-tree uses this functionality as Cong pointed out, simply removing
> the genl_bind() function will remove the possibility for deadlock, as there
> is no attempt by Thread 1 above to take the cb_lock semaphore.

I think it is worth noting removing the -ENOENT is probably okay,
as mentioned in commit 023e2cfa36c31b0ad28c159a1bb0d61ff57334c8:

Also do this in generic netlink, and there also make sure that any
bind for multicast groups that only exist in init_net is rejected.
This isn't really a problem if it is accepted since a client in a
different namespace will never receive any notifications from such
a group, but it can confuse the family if not rejected (it's also
possible to silently (without telling the family) accept it, but it
would also have to be ignored on unbind so families that take any
kind of action on bind/unbind won't do unnecessary work for invalid
clients like that.

Thanks.


Re: KASAN: use-after-free Read in nl8NUM_dump_wpan_phy (2)

2020-06-29 Thread Cong Wang
#syz fix: genetlink: get rid of family->attrbuf


Re: KASAN: vmalloc-out-of-bounds Read in tipc_nl_publ_dump

2020-06-29 Thread Cong Wang
#syz fix: genetlink: get rid of family->attrbuf


Re: KASAN: use-after-free Read in tipc_nl_node_dump_monitor_peer (2)

2020-06-29 Thread Cong Wang
#syz fix: genetlink: get rid of family->attrbuf


Re: KASAN: vmalloc-out-of-bounds Read in netdev_name_node_lookup_rcu

2020-06-29 Thread Cong Wang
#syz fix: genetlink: get rid of family->attrbuf


Re: KASAN: vmalloc-out-of-bounds Read in __nla_validate_parse

2020-06-29 Thread Cong Wang
#syz fix: genetlink: get rid of family->attrbuf


Re: KASAN: vmalloc-out-of-bounds Read in __cfg8NUM_wpan_dev_from_attrs

2020-06-29 Thread Cong Wang
#syz fix: genetlink: get rid of family->attrbuf


Re: KASAN: use-after-free Read in devlink_get_from_attrs

2020-06-29 Thread Cong Wang
#syz fix: genetlink: get rid of family->attrbuf


Re: KASAN: use-after-free Read in tipc_nl_publ_dump (2)

2020-06-29 Thread Cong Wang
#syz fix: genetlink: get rid of family->attrbuf


Re: KASAN: use-after-free Read in dev_get_by_name

2020-06-29 Thread Cong Wang
#syz fix: genetlink: get rid of family->attrbuf


Re: KASAN: use-after-free Read in netdev_name_node_lookup_rcu

2020-06-29 Thread Cong Wang
#syz fix: genetlink: get rid of family->attrbuf


Re: KASAN: use-after-free Read in netdev_name_node_lookup_rcu

2020-06-29 Thread Cong Wang
On Mon, Jun 29, 2020 at 6:17 PM Jason A. Donenfeld  wrote:
>
> Hey Cong,

Hi, Jason

>
> I'm wondering if the below error is related to what you've been
> looking at yesterday. AFAICT, there's a simple UaF on the attrbuf
> passed to the start method. I recall recently you were working on the
> locking in genetlink's family buffers and wound up mallocing some
> things, so it seems like this might be related. See below.

Yeah, very likely it is the same bug I have fixed. I will close
this together with others.

Thanks.


Re: [PATCH net] genetlink: take netlink table lock when (un)registering

2020-06-27 Thread Cong Wang
On Fri, Jun 26, 2020 at 5:32 PM Sean Tranchetti  wrote:
>
> A potential deadlock can occur during registering or unregistering a new
> generic netlink family between the main nl_table_lock and the cb_lock where
> each thread wants the lock held by the other, as demonstrated below.
>
> 1) Thread 1 is performing a netlink_bind() operation on a socket. As part
>of this call, it will call netlink_lock_table(), incrementing the
>nl_table_users count to 1.
> 2) Thread 2 is registering (or unregistering) a genl_family via the
>genl_(un)register_family() API. The cb_lock semaphore will be taken for
>writing.
> 3) Thread 1 will call genl_bind() as part of the bind operation to handle
>subscribing to GENL multicast groups at the request of the user. It will
>attempt to take the cb_lock semaphore for reading, but it will fail and
>be scheduled away, waiting for Thread 2 to finish the write.
> 4) Thread 2 will call netlink_table_grab() during the (un)registration
>call. However, as Thread 1 has incremented nl_table_users, it will not
>be able to proceed, and both threads will be stuck waiting for the
>other.
>
> To avoid this scenario, the locks should be acquired in the same order by
> both threads. Since both the register and unregister functions need to take
> the nl_table_lock in their processing, it makes sense to explicitly acquire
> them before they lock the genl_mutex and the cb_lock. In unregistering, no
> other change is needed aside from this locking change.

Like the kernel test robot reported, you can not call genl_lock_all while
holding netlink_table_grab() which is effectively a write lock.

To me, it seems genl_bind() can be just removed as there is no one
in-tree uses family->mcast_bind(). Can you test the attached patch?
It seems sufficient to fix this deadlock.

Thanks.
diff --git a/include/net/genetlink.h b/include/net/genetlink.h
index ad71ed4f55ff..6e5f1e1aa822 100644
--- a/include/net/genetlink.h
+++ b/include/net/genetlink.h
@@ -35,12 +35,6 @@ struct genl_info;
  *	do additional, common, filtering and return an error
  * @post_doit: called after an operation's doit callback, it may
  *	undo operations done by pre_doit, for example release locks
- * @mcast_bind: a socket bound to the given multicast group (which
- *	is given as the offset into the groups array)
- * @mcast_unbind: a socket was unbound from the given multicast group.
- *	Note that unbind() will not be called symmetrically if the
- *	generic netlink family is removed while there are still open
- *	sockets.
  * @mcgrps: multicast groups used by this family
  * @n_mcgrps: number of multicast groups
  * @mcgrp_offset: starting number of multicast group IDs in this family
@@ -63,8 +57,6 @@ struct genl_family {
 	void			(*post_doit)(const struct genl_ops *ops,
 	 struct sk_buff *skb,
 	 struct genl_info *info);
-	int			(*mcast_bind)(struct net *net, int group);
-	void			(*mcast_unbind)(struct net *net, int group);
 	const struct genl_ops *	ops;
 	const struct genl_multicast_group *mcgrps;
 	unsigned int		n_ops;
diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index a914b9365a46..9395ee8a868d 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -1144,60 +1144,11 @@ static struct genl_family genl_ctrl __ro_after_init = {
 	.netnsok = true,
 };
 
-static int genl_bind(struct net *net, int group)
-{
-	struct genl_family *f;
-	int err = -ENOENT;
-	unsigned int id;
-
-	down_read(&cb_lock);
-
-	idr_for_each_entry(&genl_fam_idr, f, id) {
-		if (group >= f->mcgrp_offset &&
-		group < f->mcgrp_offset + f->n_mcgrps) {
-			int fam_grp = group - f->mcgrp_offset;
-
-			if (!f->netnsok && net != &init_net)
-err = -ENOENT;
-			else if (f->mcast_bind)
-err = f->mcast_bind(net, fam_grp);
-			else
-err = 0;
-			break;
-		}
-	}
-	up_read(&cb_lock);
-
-	return err;
-}
-
-static void genl_unbind(struct net *net, int group)
-{
-	struct genl_family *f;
-	unsigned int id;
-
-	down_read(&cb_lock);
-
-	idr_for_each_entry(&genl_fam_idr, f, id) {
-		if (group >= f->mcgrp_offset &&
-		group < f->mcgrp_offset + f->n_mcgrps) {
-			int fam_grp = group - f->mcgrp_offset;
-
-			if (f->mcast_unbind)
-f->mcast_unbind(net, fam_grp);
-			break;
-		}
-	}
-	up_read(&cb_lock);
-}
-
 static int __net_init genl_pernet_init(struct net *net)
 {
 	struct netlink_kernel_cfg cfg = {
 		.input		= genl_rcv,
 		.flags		= NL_CFG_F_NONROOT_RECV,
-		.bind		= genl_bind,
-		.unbind		= genl_unbind,
 	};
 
 	/* we'll bump the group number right afterwards */


Re: possible deadlock in dev_mc_unsync

2020-06-27 Thread Cong Wang
#syz test: https://github.com/congwang/linux.git net


[Patch net] genetlink: get rid of family->attrbuf

2020-06-27 Thread Cong Wang
genl_family_rcv_msg_attrs_parse() reuses the global family->attrbuf
when family->parallel_ops is false. However, family->attrbuf is not
protected by any lock on the genl_family_rcv_msg_doit() code path.

This leads to several different consequences, one of them is UAF,
like the following:

genl_family_rcv_msg_doit(): genl_start():
  genl_family_rcv_msg_attrs_parse()
attrbuf = family->attrbuf
__nlmsg_parse(attrbuf);
  genl_family_rcv_msg_attrs_parse()
attrbuf = family->attrbuf
__nlmsg_parse(attrbuf);
  info->attrs = attrs;
  cb->data = info;

netlink_unicast_kernel():
 consume_skb()
genl_lock_dumpit():
  genl_dumpit_info(cb)->attrs

Note family->attrbuf is an array of pointers to the skb data, once
the skb is freed, any dereference of family->attrbuf will be a UAF.

Maybe we could serialize the family->attrbuf with genl_mutex too, but
that would make the locking more complicated. Instead, we can just get
rid of family->attrbuf and always allocate attrbuf from heap like the
family->parallel_ops==true code path. This may add some performance
overhead but comparing with taking the global genl_mutex, it still
looks better.

Fixes: 75cdbdd08900 ("net: ieee802154: have genetlink code to parse the attrs 
during dumpit")
Fixes: 057af7071344 ("net: tipc: have genetlink code to parse the attrs during 
dumpit")
Reported-and-tested-by: syzbot+3039ddf6d7b13daf3...@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+80cad1e3cb4c41cde...@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+736bcbcb11b60d0c0...@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+520f8704db2b68091...@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+c96e4dfb32f8987fd...@syzkaller.appspotmail.com
Cc: Jiri Pirko 
Signed-off-by: Cong Wang 
---
 include/net/genetlink.h |  2 --
 net/netlink/genetlink.c | 48 +++--
 2 files changed, 13 insertions(+), 37 deletions(-)

diff --git a/include/net/genetlink.h b/include/net/genetlink.h
index 74950663bb00..ad71ed4f55ff 100644
--- a/include/net/genetlink.h
+++ b/include/net/genetlink.h
@@ -41,7 +41,6 @@ struct genl_info;
  * Note that unbind() will not be called symmetrically if the
  * generic netlink family is removed while there are still open
  * sockets.
- * @attrbuf: buffer to store parsed attributes (private)
  * @mcgrps: multicast groups used by this family
  * @n_mcgrps: number of multicast groups
  * @mcgrp_offset: starting number of multicast group IDs in this family
@@ -66,7 +65,6 @@ struct genl_family {
 struct genl_info *info);
int (*mcast_bind)(struct net *net, int group);
void(*mcast_unbind)(struct net *net, int group);
-   struct nlattr **attrbuf;/* private */
const struct genl_ops * ops;
const struct genl_multicast_group *mcgrps;
unsigned intn_ops;
diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index 55ee680e9db1..a914b9365a46 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -351,22 +351,11 @@ int genl_register_family(struct genl_family *family)
start = end = GENL_ID_VFS_DQUOT;
}
 
-   if (family->maxattr && !family->parallel_ops) {
-   family->attrbuf = kmalloc_array(family->maxattr + 1,
-   sizeof(struct nlattr *),
-   GFP_KERNEL);
-   if (family->attrbuf == NULL) {
-   err = -ENOMEM;
-   goto errout_locked;
-   }
-   } else
-   family->attrbuf = NULL;
-
family->id = idr_alloc_cyclic(&genl_fam_idr, family,
  start, end + 1, GFP_KERNEL);
if (family->id < 0) {
err = family->id;
-   goto errout_free;
+   goto errout_locked;
}
 
err = genl_validate_assign_mc_groups(family);
@@ -385,8 +374,6 @@ int genl_register_family(struct genl_family *family)
 
 errout_remove:
idr_remove(&genl_fam_idr, family->id);
-errout_free:
-   kfree(family->attrbuf);
 errout_locked:
genl_unlock_all();
return err;
@@ -419,8 +406,6 @@ int genl_unregister_family(const struct genl_family *family)
   atomic_read(&genl_sk_destructing_cnt) == 0);
genl_unlock();
 
-   kfree(family->attrbuf);
-
genl_ctrl_event(CTRL_CMD_DELFAMILY, family, NULL, 0);
 
return 0;

Re: KASAN: use-after-free Read in tipc_nl_node_dump_monitor_peer (2)

2020-06-26 Thread Cong Wang
#syz test: https://github.com/congwang/linux.git net


[Patch net] net: explain the lockdep annotations for dev_uc_unsync()

2020-06-26 Thread Cong Wang
The lockdep annotations for dev_uc_unsync() and dev_mc_unsync()
are not easy to understand, so add some comments to explain
why they are correct.

Similar for the rest netif_addr_lock_bh() cases, they don't
need nested version.

Cc: Taehee Yoo 
Cc: Dmitry Vyukov 
Signed-off-by: Cong Wang 
---
 net/core/dev_addr_lists.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
index 6393ba930097..54cd568e7c2f 100644
--- a/net/core/dev_addr_lists.c
+++ b/net/core/dev_addr_lists.c
@@ -690,6 +690,15 @@ void dev_uc_unsync(struct net_device *to, struct 
net_device *from)
if (to->addr_len != from->addr_len)
return;
 
+   /* netif_addr_lock_bh() uses lockdep subclass 0, this is okay for two
+* reasons:
+* 1) This is always called without any addr_list_lock, so as the
+*outermost one here, it must be 0.
+* 2) This is called by some callers after unlinking the upper device,
+*so the dev->lower_level becomes 1 again.
+* Therefore, the subclass for 'from' is 0, for 'to' is either 1 or
+* larger.
+*/
netif_addr_lock_bh(from);
netif_addr_lock_nested(to);
__hw_addr_unsync(&to->uc, &from->uc, to->addr_len);
@@ -911,6 +920,7 @@ void dev_mc_unsync(struct net_device *to, struct net_device 
*from)
if (to->addr_len != from->addr_len)
return;
 
+   /* See the above comments inside dev_uc_unsync(). */
netif_addr_lock_bh(from);
netif_addr_lock_nested(to);
__hw_addr_unsync(&to->mc, &from->mc, to->addr_len);
-- 
2.27.0



[Patch net] net: get rid of lockdep_set_class_and_subclass()

2020-06-26 Thread Cong Wang
lockdep_set_class_and_subclass() is meant to reduce
the _nested() annotations by assigning a default subclass.
For addr_list_lock, we have to compute the subclass at
run-time as the netdevice topology changes after creation.

So, we should just get rid of these
lockdep_set_class_and_subclass() and stick with our _nested()
annotations.

Fixes: 845e0ebb4408 ("net: change addr_list_lock back to static key")
Suggested-by: Taehee Yoo 
Cc: Dmitry Vyukov 
Signed-off-by: Cong Wang 
---
 drivers/net/macsec.c  | 5 ++---
 drivers/net/macvlan.c | 5 ++---
 net/8021q/vlan_dev.c  | 9 -
 3 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
index e56547bfdac9..9159846b8b93 100644
--- a/drivers/net/macsec.c
+++ b/drivers/net/macsec.c
@@ -4052,9 +4052,8 @@ static int macsec_newlink(struct net *net, struct 
net_device *dev,
return err;
 
netdev_lockdep_set_classes(dev);
-   lockdep_set_class_and_subclass(&dev->addr_list_lock,
-  &macsec_netdev_addr_lock_key,
-  dev->lower_level);
+   lockdep_set_class(&dev->addr_list_lock,
+ &macsec_netdev_addr_lock_key);
 
err = netdev_upper_dev_link(real_dev, dev, extack);
if (err < 0)
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 6a6cc9f75307..4942f6112e51 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -880,9 +880,8 @@ static struct lock_class_key macvlan_netdev_addr_lock_key;
 static void macvlan_set_lockdep_class(struct net_device *dev)
 {
netdev_lockdep_set_classes(dev);
-   lockdep_set_class_and_subclass(&dev->addr_list_lock,
-  &macvlan_netdev_addr_lock_key,
-  dev->lower_level);
+   lockdep_set_class(&dev->addr_list_lock,
+ &macvlan_netdev_addr_lock_key);
 }
 
 static int macvlan_init(struct net_device *dev)
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index c8d6a07e23c5..3dd7c972677b 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -503,11 +503,10 @@ static void vlan_dev_set_lockdep_one(struct net_device 
*dev,
lockdep_set_class(&txq->_xmit_lock, &vlan_netdev_xmit_lock_key);
 }
 
-static void vlan_dev_set_lockdep_class(struct net_device *dev, int subclass)
+static void vlan_dev_set_lockdep_class(struct net_device *dev)
 {
-   lockdep_set_class_and_subclass(&dev->addr_list_lock,
-  &vlan_netdev_addr_lock_key,
-  subclass);
+   lockdep_set_class(&dev->addr_list_lock,
+ &vlan_netdev_addr_lock_key);
netdev_for_each_tx_queue(dev, vlan_dev_set_lockdep_one, NULL);
 }
 
@@ -601,7 +600,7 @@ static int vlan_dev_init(struct net_device *dev)
 
SET_NETDEV_DEVTYPE(dev, &vlan_type);
 
-   vlan_dev_set_lockdep_class(dev, dev->lower_level);
+   vlan_dev_set_lockdep_class(dev);
 
vlan->vlan_pcpu_stats = netdev_alloc_pcpu_stats(struct vlan_pcpu_stats);
if (!vlan->vlan_pcpu_stats)
-- 
2.27.0



Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

2020-06-26 Thread Cong Wang
On Thu, Jun 25, 2020 at 10:23 PM Cameron Berkenpas  wrote:
>
> Hello,
>
> Somewhere along the way I got the impression that it generally takes
> those affected hours before their systems lock up. I'm (generally) able
> to reproduce this issue much faster than that. Regardless, I can help test.
>
> Are there any patches that need testing or is this all still pending
> discussion around the  best way to resolve the issue?

Yes. I come up with a (hopefully) much better patch in the attachment.
Can you help to test it? You need to unapply the previous patch before
applying this one.

(Just in case of any confusion: I still believe we should check NULL on
top of this refcnt fix. But it should be a separate patch.)

Thank you!
commit 259150604c0b77c717fdaab057da5722e2dfd922
Author: Cong Wang 
Date:   Sat Jun 13 12:34:40 2020 -0700

cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
copied, so the cgroup refcnt must be taken too. And, unlike the
sk_alloc() path, sock_update_netprioidx() is not called here.
Therefore, it is safe and necessary to grab the cgroup refcnt
even when cgroup_sk_alloc is disabled.

sk_clone_lock() is in BH context anyway, the in_interrupt()
would terminate this function if called there. And for sk_alloc()
skcd->val is always zero. So it's safe to factor out the code
to make it more readable.

Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Reported-by: Cameron Berkenpas 
Reported-by: Peter Geis 
Reported-by: Lu Fengqi 
Reported-by: Daniël Sonck 
Tested-by: Cameron Berkenpas 
Cc: Daniel Borkmann 
Cc: Zefan Li 
Cc: Tejun Heo 
Signed-off-by: Cong Wang 

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 52661155f85f..4f1cd0edc57d 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -790,7 +790,8 @@ struct sock_cgroup_data {
 	union {
 #ifdef __LITTLE_ENDIAN
 		struct {
-			u8	is_data;
+			u8	is_data : 1;
+			u8	no_refcnt : 1;
 			u8	padding;
 			u16	prioidx;
 			u32	classid;
@@ -800,7 +801,8 @@ struct sock_cgroup_data {
 			u32	classid;
 			u16	prioidx;
 			u8	padding;
-			u8	is_data;
+			u8	no_refcnt : 1;
+			u8	is_data : 1;
 		} __packed;
 #endif
 		u64		val;
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4598e4da6b1b..618838c48313 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -822,6 +822,7 @@ extern spinlock_t cgroup_sk_update_lock;
 
 void cgroup_sk_alloc_disable(void);
 void cgroup_sk_alloc(struct sock_cgroup_data *skcd);
+void cgroup_sk_clone(struct sock_cgroup_data *skcd);
 void cgroup_sk_free(struct sock_cgroup_data *skcd);
 
 static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
@@ -835,7 +836,7 @@ static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
 	 */
 	v = READ_ONCE(skcd->val);
 
-	if (v & 1)
+	if (v & 3)
 		return &cgrp_dfl_root.cgrp;
 
 	return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp;
@@ -847,6 +848,7 @@ static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
 #else	/* CONFIG_CGROUP_DATA */
 
 static inline void cgroup_sk_alloc(struct sock_cgroup_data *skcd) {}
+static inline void cgroup_sk_clone(struct sock_cgroup_data *skcd) {}
 static inline void cgroup_sk_free(struct sock_cgroup_data *skcd) {}
 
 #endif	/* CONFIG_CGROUP_DATA */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1ea181a58465..dd247747ec14 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6439,18 +6439,8 @@ void cgroup_sk_alloc_disable(void)
 
 void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
 {
-	if (cgroup_sk_alloc_disabled)
-		return;
-
-	/* Socket clone path */
-	if (skcd->val) {
-		/*
-		 * We might be cloning a socket which is left in an empty
-		 * cgroup and the cgroup might have already been rmdir'd.
-		 * Don't use cgroup_get_live().
-		 */
-		cgroup_get(sock_cgroup_ptr(skcd));
-		cgroup_bpf_get(sock_cgroup_ptr(skcd));
+	if (cgroup_sk_alloc_disabled) {
+		skcd->no_refcnt = 1;
 		return;
 	}
 
@@ -6475,10 +6465,27 @@ void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
 	rcu_read_unlock();
 }
 
+void cgroup_sk_clone(struct sock_cgroup_data *skcd)
+{
+	if (skcd->val) {
+		if (skcd->no_refcnt)
+			return;
+		/*
+		 * We might be cloning a socket which is left in an empty
+		 * cgroup and the cgroup might have already been rmdir'd.
+		 * Don't use cgroup_get_live().
+		 */
+		cgroup_get(sock_cgroup_ptr(skcd));
+		cgroup_bpf_get(sock_cgroup_ptr(skcd));
+	}
+}
+
 void cgroup_sk_free(struct sock_cgroup_data *skcd)
 {
 	struct cgroup *cgrp = sock_cgroup_ptr(skcd);
 
+	if (skcd->no_refcnt)
+		return;
 	cgroup_bpf_put(cgrp);
 	cgroup_put(cgrp);
 }
diff --git a/net/core/sock.c b/net/core/sock.c
index d832c65

Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

2020-06-23 Thread Cong Wang
On Tue, Jun 23, 2020 at 1:45 AM Zhang,Qiang  wrote:
>
> There are some message in kernelv5.4, I don't know if it will help.
>
> demsg:
>
> cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or
> net_cls activation
...
> ---[ cut here ]---
> percpu ref (cgroup_bpf_release_fn) <= 0 (-12) after switching to atomic
> WARNING: CPU: 1 PID: 0 at lib/percpu-refcount.c:161
> percpu_ref_switch_to_atomic_rcu+0x12a/0x140

Yes, this proves we have the refcnt bug which my patch tries to fix.
The negative refcnt is exactly a consequence of the bug, as without
my patch we just put the refcnt without holding it first.

If you can reproduce it, please test my patch:
https://patchwork.ozlabs.org/project/netdev/patch/20200616180352.18602-1-xiyou.wangc...@gmail.com/

But, so far I still don't have a good explanation to the NULL
pointer deref. I think that one is an older bug, and we need to check
for NULL even after we fix the refcnt bug, but I do not know how it is
just exposed recently with Zefan's patch. I am still trying to find an
explanation.

Thanks!


Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

2020-06-22 Thread Cong Wang
On Sat, Jun 20, 2020 at 8:58 AM Roman Gushchin  wrote:
>
> On Fri, Jun 19, 2020 at 08:00:41PM -0700, Cong Wang wrote:
> > On Fri, Jun 19, 2020 at 6:14 PM Roman Gushchin  wrote:
> > >
> > > On Sat, Jun 20, 2020 at 09:00:40AM +0800, Zefan Li wrote:
> > > > I think so, though I'm not familiar with the bfp cgroup code.
> > > >
> > > > > If so, we might wanna fix it in a different way,
> > > > > just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
> > > > > like in cgroup_put(). It feels more reliable to me.
> > > > >
> > > >
> > > > Yeah I also have this idea in my mind.
> > >
> > > I wonder if the following patch will fix the issue?
> >
> > Interesting, AFAIU, this refcnt is for bpf programs attached
> > to the cgroup. By this suggestion, do you mean the root
> > cgroup does not need to refcnt the bpf programs attached
> > to it? This seems odd, as I don't see how root is different
> > from others in terms of bpf programs which can be attached
> > and detached in the same way.
> >
> > I certainly understand the root cgroup is never gone, but this
> > does not mean the bpf programs attached to it too.
> >
> > What am I missing?
>
> It's different because the root cgroup can't be deleted.
>
> All this reference counting is required to automatically detach bpf programs
> from a _deleted_ cgroup (look at cgroup_bpf_offline()). It's required
> because a cgroup can be in dying state for a long time being pinned by a
> pagecache page, for example. Only a user can detach a bpf program from
> an existing cgroup.

Yeah, but users can still detach the bpf programs from root cgroup.
IIUC, after detaching, the pointer in the bpf array will be empty_prog_array
which is just an array of NULL. Then __cgroup_bpf_run_filter_skb() will
deref it without checking NULL (as check_non_null == false).

This matches the 0010 pointer seen in the bug reports,
the 0x10, that is 16, is the offset of items[] in struct bpf_prog_array.
So looks like we have to add a NULL check there regardless of refcnt.

Also, I am not sure whether your suggested patch makes a difference
for percpu refcnt, as percpu_ref_put() will never call ->release() until
percpu_ref_kill(), which is never called on root cgroup?

Thanks.


Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

2020-06-19 Thread Cong Wang
On Fri, Jun 19, 2020 at 5:51 PM Zefan Li  wrote:
>
> 在 2020/6/20 8:45, Zefan Li 写道:
> > On 2020/6/20 3:51, Cong Wang wrote:
> >> On Thu, Jun 18, 2020 at 11:40 PM Zefan Li  wrote:
> >>>
> >>> On 2020/6/19 5:09, Cong Wang wrote:
> >>>> On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin  wrote:
> >>>>>
> >>>>> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> >>>>>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li  wrote:
> >>>>>>>
> >>>>>>> Cc: Roman Gushchin 
> >>>>>>>
> >>>>>>> Thanks for fixing this.
> >>>>>>>
> >>>>>>> On 2020/6/17 2:03, Cong Wang wrote:
> >>>>>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> >>>>>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
> >>>>>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
> >>>>>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
> >>>>>>>> even when cgroup_sk_alloc is disabled.
> >>>>>>>>
> >>>>>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
> >>>>>>>> would terminate this function if called there. And for sk_alloc()
> >>>>>>>> skcd->val is always zero. So it's safe to factor out the code
> >>>>>>>> to make it more readable.
> >>>>>>>>
> >>>>>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory 
> >>>>>>>> leak of v2 cgroups")
> >>>>>>>
> >>>>>>> but I don't think the bug was introduced by this commit, because there
> >>>>>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> >>>>>>> write_classid(), which can be triggered by writing to ifpriomap or
> >>>>>>> classid in cgroupfs. This commit just made it much easier to happen
> >>>>>>> with systemd invovled.
> >>>>>>>
> >>>>>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of 
> >>>>>>> cgroup_bpf from cgroup itself"),
> >>>>>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
> >>>>>>
> >>>>>> Good point.
> >>>>>>
> >>>>>> I take a deeper look, it looks like commit d979a39d7242e06
> >>>>>> is the one to blame, because it is the first commit that began to
> >>>>>> hold cgroup refcnt in cgroup_sk_alloc().
> >>>>>
> >>>>> I agree, ut seems that the issue is not related to bpf and probably
> >>>>> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> >>>>> seems closer to the origin.
> >>>>
> >>>> Yeah, I will update the Fixes tag and send V2.
> >>>>
> >>>
> >>> Commit d979a39d7242e06 looks innocent to me. With this commit when 
> >>> cgroup_sk_alloc
> >>> is disabled and then a socket is cloned the cgroup refcnt will not be 
> >>> incremented,
> >>> but this is fine, because when the socket is to be freed:
> >>>
> >>>  sk_prot_free()
> >>>cgroup_sk_free()
> >>>  cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
> >>>
> >>> cgroup_put() does nothing for the default root cgroup, so nothing bad 
> >>> will happen.
> >>
> >> But skcd->val can be a pointer to a non-root cgroup:
> >
> > It returns a non-root cgroup when cgroup_sk_alloc is not disabled. The bug 
> > happens
> > when cgroup_sk_alloc is disabled.
> >
>
> And please read those recent bug reports, they all happened when bpf cgroup 
> was in use,
> and there was no bpf cgroup when d979a39d7242e06 was merged into mainline.

I am totally aware of this. My concern is whether cgroup
has the same refcnt bug as it always pairs with the bpf refcnt.

But, after a second look, the non-root cgroup refcnt is immediately
overwritten by sock_update_classid() or sock_update_netprioidx(),
which effectively turns into a root cgroup again. :-/

(It seems we leak a refcnt here, but this is not related to my patch).

Thanks.


Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

2020-06-19 Thread Cong Wang
On Fri, Jun 19, 2020 at 6:14 PM Roman Gushchin  wrote:
>
> On Sat, Jun 20, 2020 at 09:00:40AM +0800, Zefan Li wrote:
> > I think so, though I'm not familiar with the bfp cgroup code.
> >
> > > If so, we might wanna fix it in a different way,
> > > just checking if (!(css->flags & CSS_NO_REF)) in cgroup_bpf_put()
> > > like in cgroup_put(). It feels more reliable to me.
> > >
> >
> > Yeah I also have this idea in my mind.
>
> I wonder if the following patch will fix the issue?

Interesting, AFAIU, this refcnt is for bpf programs attached
to the cgroup. By this suggestion, do you mean the root
cgroup does not need to refcnt the bpf programs attached
to it? This seems odd, as I don't see how root is different
from others in terms of bpf programs which can be attached
and detached in the same way.

I certainly understand the root cgroup is never gone, but this
does not mean the bpf programs attached to it too.

What am I missing?

Thanks.


Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

2020-06-19 Thread Cong Wang
On Thu, Jun 18, 2020 at 11:40 PM Zefan Li  wrote:
>
> On 2020/6/19 5:09, Cong Wang wrote:
> > On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin  wrote:
> >>
> >> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> >>> On Wed, Jun 17, 2020 at 6:44 PM Zefan Li  wrote:
> >>>>
> >>>> Cc: Roman Gushchin 
> >>>>
> >>>> Thanks for fixing this.
> >>>>
> >>>> On 2020/6/17 2:03, Cong Wang wrote:
> >>>>> When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> >>>>> copied, so the cgroup refcnt must be taken too. And, unlike the
> >>>>> sk_alloc() path, sock_update_netprioidx() is not called here.
> >>>>> Therefore, it is safe and necessary to grab the cgroup refcnt
> >>>>> even when cgroup_sk_alloc is disabled.
> >>>>>
> >>>>> sk_clone_lock() is in BH context anyway, the in_interrupt()
> >>>>> would terminate this function if called there. And for sk_alloc()
> >>>>> skcd->val is always zero. So it's safe to factor out the code
> >>>>> to make it more readable.
> >>>>>
> >>>>> Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak 
> >>>>> of v2 cgroups")
> >>>>
> >>>> but I don't think the bug was introduced by this commit, because there
> >>>> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> >>>> write_classid(), which can be triggered by writing to ifpriomap or
> >>>> classid in cgroupfs. This commit just made it much easier to happen
> >>>> with systemd invovled.
> >>>>
> >>>> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf 
> >>>> from cgroup itself"),
> >>>> which added cgroup_bpf_get() in cgroup_sk_alloc().
> >>>
> >>> Good point.
> >>>
> >>> I take a deeper look, it looks like commit d979a39d7242e06
> >>> is the one to blame, because it is the first commit that began to
> >>> hold cgroup refcnt in cgroup_sk_alloc().
> >>
> >> I agree, ut seems that the issue is not related to bpf and probably
> >> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> >> seems closer to the origin.
> >
> > Yeah, I will update the Fixes tag and send V2.
> >
>
> Commit d979a39d7242e06 looks innocent to me. With this commit when 
> cgroup_sk_alloc
> is disabled and then a socket is cloned the cgroup refcnt will not be 
> incremented,
> but this is fine, because when the socket is to be freed:
>
>  sk_prot_free()
>cgroup_sk_free()
>  cgroup_put(sock_cgroup_ptr(skcd)) == cgroup_put(&cgrp_dfl_root.cgrp)
>
> cgroup_put() does nothing for the default root cgroup, so nothing bad will 
> happen.

But skcd->val can be a pointer to a non-root cgroup:

static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
{
#if defined(CONFIG_CGROUP_NET_PRIO) || defined(CONFIG_CGROUP_NET_CLASSID)
unsigned long v;

/*
 * @skcd->val is 64bit but the following is safe on 32bit too as we
 * just need the lower ulong to be written and read atomically.
 */
v = READ_ONCE(skcd->val);

if (v & 1)
return &cgrp_dfl_root.cgrp;

return (struct cgroup *)(unsigned long)v ?: &cgrp_dfl_root.cgrp;
#else
return (struct cgroup *)(unsigned long)skcd->val;
#endif
}


Re: RATE not being printed on tc -s class show dev XXXX

2020-06-18 Thread Cong Wang
On Tue, Jun 16, 2020 at 7:06 AM Roberto J. Blandino Cisneros
 wrote:
> I am seing "rate 0bit".
>
> But installing from debian package iproute2 i got nothing so i decide to
> compile iproute2 using version 5.7.0
>
> But my output is the same as below:

You either need to enable /sys/module/sch_htb/parameters/htb_rate_est
or specify a rate estimator when you create your HTB class.

Thanks.


Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

2020-06-18 Thread Cong Wang
On Thu, Jun 18, 2020 at 12:36 PM Roman Gushchin  wrote:
>
> On Thu, Jun 18, 2020 at 12:19:13PM -0700, Cong Wang wrote:
> > On Wed, Jun 17, 2020 at 6:44 PM Zefan Li  wrote:
> > >
> > > Cc: Roman Gushchin 
> > >
> > > Thanks for fixing this.
> > >
> > > On 2020/6/17 2:03, Cong Wang wrote:
> > > > When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> > > > copied, so the cgroup refcnt must be taken too. And, unlike the
> > > > sk_alloc() path, sock_update_netprioidx() is not called here.
> > > > Therefore, it is safe and necessary to grab the cgroup refcnt
> > > > even when cgroup_sk_alloc is disabled.
> > > >
> > > > sk_clone_lock() is in BH context anyway, the in_interrupt()
> > > > would terminate this function if called there. And for sk_alloc()
> > > > skcd->val is always zero. So it's safe to factor out the code
> > > > to make it more readable.
> > > >
> > > > Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak 
> > > > of v2 cgroups")
> > >
> > > but I don't think the bug was introduced by this commit, because there
> > > are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> > > write_classid(), which can be triggered by writing to ifpriomap or
> > > classid in cgroupfs. This commit just made it much easier to happen
> > > with systemd invovled.
> > >
> > > I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf 
> > > from cgroup itself"),
> > > which added cgroup_bpf_get() in cgroup_sk_alloc().
> >
> > Good point.
> >
> > I take a deeper look, it looks like commit d979a39d7242e06
> > is the one to blame, because it is the first commit that began to
> > hold cgroup refcnt in cgroup_sk_alloc().
>
> I agree, ut seems that the issue is not related to bpf and probably
> can be reproduced without CONFIG_CGROUP_BPF. d979a39d7242e06 indeed
> seems closer to the origin.

Yeah, I will update the Fixes tag and send V2.

>
> Btw, based on the number of reported-by tags it seems that there was
> a real issue which the patch is fixing. Maybe you'll a couple of words
> about how it reveals itself in the real life?

I still have no idea how exactly this is triggered. According to the
people who reported this bug, they just need to wait for some hours
to trigger. So I am not sure what to add here, just the stack trace?

Thanks.


Re: [Patch net] net: change addr_list_lock back to static key

2020-06-18 Thread Cong Wang
On Thu, Jun 18, 2020 at 1:33 PM Vladimir Oltean  wrote:
>
> On Thu, 18 Jun 2020 at 23:06, Cong Wang  wrote:
> >
> > On Thu, Jun 18, 2020 at 12:56 PM Cong Wang  wrote:
> > >
> > > On Thu, Jun 18, 2020 at 12:40 PM Vladimir Oltean  
> > > wrote:
> > > >
> > > > It's me with the stacked DSA devices again:
> > >
> > > It looks like DSA never uses netdev API to link master
> > > device with slave devices? If so, their dev->lower_level
> > > are always 1, therefore triggers this warning.
> > >
> > > I think it should call one of these netdev_upper_dev_link()
> > > API's when creating a slave device.
> > >
> >
> > I don't know whether DSA is too special to use the API, but
> > something like this should work:
> >
> > diff --git a/net/dsa/slave.c b/net/dsa/slave.c
> > index 4c7f086a047b..f7a2a281e7f0 100644
> > --- a/net/dsa/slave.c
> > +++ b/net/dsa/slave.c
> > @@ -1807,6 +1807,11 @@ int dsa_slave_create(struct dsa_port *port)
> >ret, slave_dev->name);
> > goto out_phy;
> > }
> > +   ret = netdev_upper_dev_link(slave_dev, master, NULL);
> > +   if (ret) {
> > +   unregister_netdevice(slave_dev);
> > +   goto out_phy;
> > +   }
> >
> > return 0;
> >
> > @@ -1832,6 +1837,7 @@ void dsa_slave_destroy(struct net_device *slave_dev)
> > netif_carrier_off(slave_dev);
> > rtnl_lock();
> > phylink_disconnect_phy(dp->pl);
> > +   netdev_upper_dev_unlink(slave_dev, dp->master);
> > rtnl_unlock();
> >
> > dsa_slave_notify(slave_dev, DSA_PORT_UNREGISTER);
>
> Thanks. This is a good approximation of what needed to be done:
> - netdev_upper_dev_link needs to be under rtnl,
> - "dp->master" should be "dsa_slave_to_master(slave_dev)" since it's
> actually a union if you look at struct dsa_port).
> - And, most importantly, I think the hierarchy should be reversed: a
> (virtual) DSA switch port net device (slave) should be an upper of the
> (real) DSA master (the host port). Think of it like this: a DSA switch
> is a sort of port multiplier for a host port, based on a frame header.
> But, it works!

Please feel free to make any changes you need and submit it
by yourself, as you know DSA better than me and I do not even
have a DSA testing environment.

>
> Do you mind if I submit your modified patch to "net"? What would be an
> adequate Fixes: tag?

If it is merely to fix the lockdep warning, my commit 845e0ebb4408d447
is the right one to blame.

Thanks.


Re: [Patch net] net: change addr_list_lock back to static key

2020-06-18 Thread Cong Wang
On Thu, Jun 18, 2020 at 12:56 PM Cong Wang  wrote:
>
> On Thu, Jun 18, 2020 at 12:40 PM Vladimir Oltean  wrote:
> >
> > It's me with the stacked DSA devices again:
>
> It looks like DSA never uses netdev API to link master
> device with slave devices? If so, their dev->lower_level
> are always 1, therefore triggers this warning.
>
> I think it should call one of these netdev_upper_dev_link()
> API's when creating a slave device.
>

I don't know whether DSA is too special to use the API, but
something like this should work:

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 4c7f086a047b..f7a2a281e7f0 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1807,6 +1807,11 @@ int dsa_slave_create(struct dsa_port *port)
   ret, slave_dev->name);
goto out_phy;
}
+   ret = netdev_upper_dev_link(slave_dev, master, NULL);
+   if (ret) {
+   unregister_netdevice(slave_dev);
+   goto out_phy;
+   }

return 0;

@@ -1832,6 +1837,7 @@ void dsa_slave_destroy(struct net_device *slave_dev)
netif_carrier_off(slave_dev);
rtnl_lock();
phylink_disconnect_phy(dp->pl);
+   netdev_upper_dev_unlink(slave_dev, dp->master);
rtnl_unlock();

dsa_slave_notify(slave_dev, DSA_PORT_UNREGISTER);


Re: [Patch net] net: change addr_list_lock back to static key

2020-06-18 Thread Cong Wang
On Thu, Jun 18, 2020 at 12:40 PM Vladimir Oltean  wrote:
>
> It's me with the stacked DSA devices again:

It looks like DSA never uses netdev API to link master
device with slave devices? If so, their dev->lower_level
are always 1, therefore triggers this warning.

I think it should call one of these netdev_upper_dev_link()
API's when creating a slave device.

Thanks.


Re: [Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

2020-06-18 Thread Cong Wang
On Wed, Jun 17, 2020 at 6:44 PM Zefan Li  wrote:
>
> Cc: Roman Gushchin 
>
> Thanks for fixing this.
>
> On 2020/6/17 2:03, Cong Wang wrote:
> > When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
> > copied, so the cgroup refcnt must be taken too. And, unlike the
> > sk_alloc() path, sock_update_netprioidx() is not called here.
> > Therefore, it is safe and necessary to grab the cgroup refcnt
> > even when cgroup_sk_alloc is disabled.
> >
> > sk_clone_lock() is in BH context anyway, the in_interrupt()
> > would terminate this function if called there. And for sk_alloc()
> > skcd->val is always zero. So it's safe to factor out the code
> > to make it more readable.
> >
> > Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 
> > cgroups")
>
> but I don't think the bug was introduced by this commit, because there
> are already calls to cgroup_sk_alloc_disable() in write_priomap() and
> write_classid(), which can be triggered by writing to ifpriomap or
> classid in cgroupfs. This commit just made it much easier to happen
> with systemd invovled.
>
> I think it's 4bfc0bb2c60e2f4c ("bpf: decouple the lifetime of cgroup_bpf from 
> cgroup itself"),
> which added cgroup_bpf_get() in cgroup_sk_alloc().

Good point.

I take a deeper look, it looks like commit d979a39d7242e06
is the one to blame, because it is the first commit that began to
hold cgroup refcnt in cgroup_sk_alloc().

The commit you mentioned above merely adds a refcnt for
cgroup bpf on to of cgroup refcnt.

Thanks.


Re: [Patch net] net: change addr_list_lock back to static key

2020-06-17 Thread Cong Wang
On Tue, Jun 16, 2020 at 8:03 AM Taehee Yoo  wrote:
>
> I agree with that.
> And, do you have any plan to replace netif_addr_lock_bh() with
> netif_addr_lock_nested()?
> (Of course, it needs BH handling code)
> I'm not sure but I think it would be needed.

Yeah, I agree it's needed. I have a patch now and will send it
out after some testing.

Thanks.


[Patch net] cgroup: fix cgroup_sk_alloc() for sk_clone_lock()

2020-06-16 Thread Cong Wang
When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
copied, so the cgroup refcnt must be taken too. And, unlike the
sk_alloc() path, sock_update_netprioidx() is not called here.
Therefore, it is safe and necessary to grab the cgroup refcnt
even when cgroup_sk_alloc is disabled.

sk_clone_lock() is in BH context anyway, the in_interrupt()
would terminate this function if called there. And for sk_alloc()
skcd->val is always zero. So it's safe to factor out the code
to make it more readable.

Fixes: 090e28b229af92dc5b ("netprio_cgroup: Fix unlimited memory leak of v2 
cgroups")
Reported-by: Cameron Berkenpas 
Reported-by: Peter Geis 
Reported-by: Lu Fengqi 
Reported-by: Daniël Sonck 
Tested-by: Cameron Berkenpas 
Cc: Daniel Borkmann 
Cc: Zefan Li 
Cc: Tejun Heo 
Signed-off-by: Cong Wang 
---
 include/linux/cgroup.h |  2 ++
 kernel/cgroup/cgroup.c | 26 ++
 net/core/sock.c|  2 +-
 3 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 4598e4da6b1b..818dc7b3ed6c 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -822,6 +822,7 @@ extern spinlock_t cgroup_sk_update_lock;
 
 void cgroup_sk_alloc_disable(void);
 void cgroup_sk_alloc(struct sock_cgroup_data *skcd);
+void cgroup_sk_clone(struct sock_cgroup_data *skcd);
 void cgroup_sk_free(struct sock_cgroup_data *skcd);
 
 static inline struct cgroup *sock_cgroup_ptr(struct sock_cgroup_data *skcd)
@@ -847,6 +848,7 @@ static inline struct cgroup *sock_cgroup_ptr(struct 
sock_cgroup_data *skcd)
 #else  /* CONFIG_CGROUP_DATA */
 
 static inline void cgroup_sk_alloc(struct sock_cgroup_data *skcd) {}
+static inline void cgroup_sk_clone(struct sock_cgroup_data *skcd) {}
 static inline void cgroup_sk_free(struct sock_cgroup_data *skcd) {}
 
 #endif /* CONFIG_CGROUP_DATA */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1ea181a58465..6377045b7096 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -6442,18 +6442,6 @@ void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
if (cgroup_sk_alloc_disabled)
return;
 
-   /* Socket clone path */
-   if (skcd->val) {
-   /*
-* We might be cloning a socket which is left in an empty
-* cgroup and the cgroup might have already been rmdir'd.
-* Don't use cgroup_get_live().
-*/
-   cgroup_get(sock_cgroup_ptr(skcd));
-   cgroup_bpf_get(sock_cgroup_ptr(skcd));
-   return;
-   }
-
/* Don't associate the sock with unrelated interrupted task's cgroup. */
if (in_interrupt())
return;
@@ -6475,6 +6463,20 @@ void cgroup_sk_alloc(struct sock_cgroup_data *skcd)
rcu_read_unlock();
 }
 
+void cgroup_sk_clone(struct sock_cgroup_data *skcd)
+{
+   /* Socket clone path */
+   if (skcd->val) {
+   /*
+* We might be cloning a socket which is left in an empty
+* cgroup and the cgroup might have already been rmdir'd.
+* Don't use cgroup_get_live().
+*/
+   cgroup_get(sock_cgroup_ptr(skcd));
+   cgroup_bpf_get(sock_cgroup_ptr(skcd));
+   }
+}
+
 void cgroup_sk_free(struct sock_cgroup_data *skcd)
 {
struct cgroup *cgrp = sock_cgroup_ptr(skcd);
diff --git a/net/core/sock.c b/net/core/sock.c
index 6c4acf1f0220..b62f06fa5e37 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1925,7 +1925,7 @@ struct sock *sk_clone_lock(const struct sock *sk, const 
gfp_t priority)
/* sk->sk_memcg will be populated at accept() time */
newsk->sk_memcg = NULL;
 
-   cgroup_sk_alloc(&newsk->sk_cgrp_data);
+   cgroup_sk_clone(&newsk->sk_cgrp_data);
 
rcu_read_lock();
filter = rcu_dereference(sk->sk_filter);
-- 
2.26.2



Re: BUG: kernel NULL pointer dereference in __cgroup_bpf_run_filter_skb

2020-06-15 Thread Cong Wang
On Mon, Jun 15, 2020 at 6:06 AM Daniël Sonck  wrote:
>
> Op zo 14 jun. 2020 om 22:43 schreef Daniël Sonck :
> >
> > Hello,
> >
> > Op zo 14 jun. 2020 om 20:29 schreef Cong Wang :
> > >
> > > Hello,
> > >
> > > On Sun, Jun 14, 2020 at 5:39 AM Daniël Sonck  wrote:
> > > >
> > > > Hello,
> > > >
> > > > I found on the archive that this bug I encountered also happened to
> > > > others. I too have a very similar stacktrace. The issue I'm
> > > > experiencing is:
> > > >
> > > > Whenever I fully boot my cluster, in some time, the host crashes with
> > > > the __cgroup_bpf_run_filter_skb NULL pointer dereference. This has
> > > > been sporadic enough before not to cause real issues. However, as of
> > > > lately, the bug is triggered much more frequently. I've changed my
> > > > server hardware so I could capture serial output in order to get the
> > > > trace. This trace looked very similar as reported by Lu Fengqi. As it
> > > > currently stands, I cannot run the cluster as it's almost instantly
> > > > crashing the host.
> > >
> > > This has been reported for multiple times. Are you able to test the
> > > attached patch? And let me know if everything goes fine with it.
> >
> > I will try out the patch. Since the host reliably crashed each time as
> > I booted up
> > the cluster VMs I will be able to tell whether it has any positive effect.
> > >
> > > I suspect we may still leak some cgroup refcnt even with the patch,
> > > but it might be much harder to trigger with this patch applied.
> >
> > Currently applying the patch to the kernel and compiling so I should
> > know in a few hours
>
> The compilation with the patch has finished and I've since rebooted to the
> new kernel about 12 hours ago, so far this bug did not trigger whereas without
> the patch, by this time it would have triggered. Regardless, I will keep my
> serial connection in case something pops up.

That is great. Please keep it running as this is a race condition which
is not easy to trigger reliably.

Thanks for testing!


<    1   2   3   4   5   6   7   8   9   10   >