Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-11-04 Thread Cong Wang
On Wed, Nov 4, 2020 at 10:04 PM Cong Wang  wrote:
>
> On Mon, Nov 2, 2020 at 11:24 PM Yunsheng Lin  wrote:
> > >> From my understanding, we can do anything about the old qdisc (including
> > >> destorying the old qdisc) after some_qdisc_is_busy() return false.
> > >
> > > But the current code does the reset _before_ some_qdisc_is_busy(). ;)
> >
> > If lock is taken when doing reset, it does not matter if the reset is
> > before some_qdisc_is_busy(), right?
>
> Why not? How about the following scenario?
>
> CPU0:   CPU1:
> dev_reset_queue()
> net_tx_action()
>  -> sch_direct_xmit()
>-> dev_requeue_skb()
> some_qdisc_is_busy()
> // waiting for TX action on CPU1
> // now some packets are requeued

Never mind, the skb_bad_txq is also cleared by dev_reset_queue().
TX action after resetting should get NULL.

Thanks.


Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-11-04 Thread Cong Wang
On Mon, Nov 2, 2020 at 11:24 PM Yunsheng Lin  wrote:
> >> From my understanding, we can do anything about the old qdisc (including
> >> destorying the old qdisc) after some_qdisc_is_busy() return false.
> >
> > But the current code does the reset _before_ some_qdisc_is_busy(). ;)
>
> If lock is taken when doing reset, it does not matter if the reset is
> before some_qdisc_is_busy(), right?

Why not? How about the following scenario?

CPU0:   CPU1:
dev_reset_queue()
net_tx_action()
 -> sch_direct_xmit()
   -> dev_requeue_skb()
some_qdisc_is_busy()
// waiting for TX action on CPU1
// now some packets are requeued


Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-11-02 Thread Cong Wang
On Fri, Oct 30, 2020 at 12:38 AM Yunsheng Lin  wrote:
>
> On 2020/10/30 3:05, Cong Wang wrote:
> >
> > I do not see how and why it should. synchronize_net() is merely an optimized
> > version of synchronize_rcu(), it should wait for RCU readers, softirqs are 
> > not
> > necessarily RCU readers, net_tx_action() does not take RCU read lock either.
>
> Ok, make sense.
>
> Taking RCU read lock in net_tx_action() does not seems to solve the problem,
> what about the time window between __netif_reschedule() and net_tx_action()?
>
> It seems we need to re-dereference the qdisc whenever RCU read lock is 
> released
> and qdisc is still in sd->output_queue or wait for the sd->output_queue to 
> drain?

Not suggesting you to take RCU read lock. We already wait for TX action with
a loop of sleep. To me, the only thing missing is just moving the
reset after that
wait.


> >>>> If we do any additional reset that is not related to qdisc in 
> >>>> dev_reset_queue(), we
> >>>> can move it after some_qdisc_is_busy() checking.
> >>>
> >>> I am not suggesting to do an additional reset, I am suggesting to move
> >>> your reset after the busy waiting.
> >>
> >> There maybe a deadlock here if we reset the qdisc after the 
> >> some_qdisc_is_busy() checking,
> >> because some_qdisc_is_busy() may require the qdisc reset to clear the skb, 
> >> so that
> >
> > some_qdisc_is_busy() checks the status of qdisc, not the skb queue.
>
> Is there any reason why we do not check the skb queue in the dqisc?
> It seems there may be skb left when netdev is deactivated, maybe at least warn
> about that when there is still skb left when netdev is deactivated?
> Is that why we call qdisc_reset() to clear the leftover skb in 
> qdisc_destroy()?
>
> >
> >
> >> some_qdisc_is_busy() can return false. I am not sure this is really a 
> >> problem, but
> >> sch_direct_xmit() may requeue the skb when dev_hard_start_xmit return 
> >> TX_BUSY.
> >
> > Sounds like another reason we should move the reset as late as possible?
>
> Why?

You said "sch_direct_xmit() may requeue the skb", I agree. I assume you mean
net_tx_action() calls sch_direct_xmit() which does the requeue then races with
reset. No?


>
> There current netdev down order is mainly below:
>
> netif_tx_stop_all_queues()
>
> dev_deactivate_queue()
>
> synchronize_net()
>
> dev_reset_queue()
>
> some_qdisc_is_busy()
>
>
> You suggest to change it to below order, right?
>
> netif_tx_stop_all_queues()
>
> dev_deactivate_queue()
>
> synchronize_net()
>
> some_qdisc_is_busy()
>
> dev_reset_queue()

Yes.

>
>
> What is the semantics of some_qdisc_is_busy()?

Waiting for flying TX action.

> From my understanding, we can do anything about the old qdisc (including
> destorying the old qdisc) after some_qdisc_is_busy() return false.

But the current code does the reset _before_ some_qdisc_is_busy(). ;)

Thanks.


Re: arping stuck with ENOBUFS in 4.19.150

2020-10-29 Thread Cong Wang
On Thu, Oct 29, 2020 at 7:11 AM Joakim Tjernlund
 wrote:
>
> OK, bisecting (was a bit of a bother since we merge upstream releases into 
> our tree, is there a way to just bisect that?)
>
> Result was commit "net: sch_generic: aviod concurrent reset and enqueue op 
> for lockless qdisc"  (749cc0b0c7f3dcdfe5842f998c0274e54987384f)
>
> Reverting that commit on top of our tree made it work again. How to fix?

This is odd. The above commit touches the netdev reset path, did
your netdev get reset when you ran arping? You said your eth1 is UP,
is it always UP or flapping?

In the other thread, a bisect also points to the same commit on 5.4.
I guess there might be something missing in the backport.

Thanks.


Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-10-29 Thread Cong Wang
On Wed, Oct 28, 2020 at 7:54 PM Yunsheng Lin  wrote:
>
> On 2020/9/18 3:26, Cong Wang wrote:
> > On Fri, Sep 11, 2020 at 1:13 AM Yunsheng Lin  wrote:
> >>
> >> On 2020/9/11 4:07, Cong Wang wrote:
> >>> On Tue, Sep 8, 2020 at 4:06 AM Yunsheng Lin  
> >>> wrote:
> >>>>
> >>>> Currently there is concurrent reset and enqueue operation for the
> >>>> same lockless qdisc when there is no lock to synchronize the
> >>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >>>> skb with a larger queue_mapping after the corresponding qdisc is
> >>>> reset, and call hns3_nic_net_xmit() with that skb later.
> >>>>
> >>>> Reused the existing synchronize_net() in dev_deactivate_many() to
> >>>> make sure skb with larger queue_mapping enqueued to old qdisc(which
> >>>> is saved in dev_queue->qdisc_sleeping) will always be reset when
> >>>> dev_reset_queue() is called.
> >>>>
> >>>> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
> >>>> Signed-off-by: Yunsheng Lin 
> >>>> ---
> >>>> ChangeLog V2:
> >>>> Reuse existing synchronize_net().
> >>>> ---
> >>>>  net/sched/sch_generic.c | 48 
> >>>> +---
> >>>>  1 file changed, 33 insertions(+), 15 deletions(-)
> >>>>
> >>>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> >>>> index 265a61d..54c4172 100644
> >>>> --- a/net/sched/sch_generic.c
> >>>> +++ b/net/sched/sch_generic.c
> >>>> @@ -1131,24 +1131,10 @@ EXPORT_SYMBOL(dev_activate);
> >>>>
> >>>>  static void qdisc_deactivate(struct Qdisc *qdisc)
> >>>>  {
> >>>> -   bool nolock = qdisc->flags & TCQ_F_NOLOCK;
> >>>> -
> >>>> if (qdisc->flags & TCQ_F_BUILTIN)
> >>>> return;
> >>>> -   if (test_bit(__QDISC_STATE_DEACTIVATED, &qdisc->state))
> >>>> -   return;
> >>>> -
> >>>> -   if (nolock)
> >>>> -   spin_lock_bh(&qdisc->seqlock);
> >>>> -   spin_lock_bh(qdisc_lock(qdisc));
> >>>>
> >>>> set_bit(__QDISC_STATE_DEACTIVATED, &qdisc->state);
> >>>> -
> >>>> -   qdisc_reset(qdisc);
> >>>> -
> >>>> -   spin_unlock_bh(qdisc_lock(qdisc));
> >>>> -   if (nolock)
> >>>> -   spin_unlock_bh(&qdisc->seqlock);
> >>>>  }
> >>>>
> >>>>  static void dev_deactivate_queue(struct net_device *dev,
> >>>> @@ -1165,6 +1151,30 @@ static void dev_deactivate_queue(struct 
> >>>> net_device *dev,
> >>>> }
> >>>>  }
> >>>>
> >>>> +static void dev_reset_queue(struct net_device *dev,
> >>>> +   struct netdev_queue *dev_queue,
> >>>> +   void *_unused)
> >>>> +{
> >>>> +   struct Qdisc *qdisc;
> >>>> +   bool nolock;
> >>>> +
> >>>> +   qdisc = dev_queue->qdisc_sleeping;
> >>>> +   if (!qdisc)
> >>>> +   return;
> >>>> +
> >>>> +   nolock = qdisc->flags & TCQ_F_NOLOCK;
> >>>> +
> >>>> +   if (nolock)
> >>>> +   spin_lock_bh(&qdisc->seqlock);
> >>>> +   spin_lock_bh(qdisc_lock(qdisc));
> >>>
> >>>
> >>> I think you do not need this lock for lockless one.
> >>
> >> It seems so.
> >> Maybe another patch to remove qdisc_lock(qdisc) for lockless
> >> qdisc?
> >
> > Yeah, but not sure if we still want this lockless qdisc any more,
> > it brings more troubles than gains.
> >
> >>
> >>
> >>>
> >>>> +
> >>>> +   qdisc_reset(qdisc);
> >>>>

Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-10-28 Thread Cong Wang
On Wed, Oct 28, 2020 at 8:37 AM Pai, Vishwanath  wrote:
> Hi,
>
> We noticed some problems when testing the latest 5.4 LTS kernel and traced it
> back to this commit using git bisect. When running our tests the machine stops
> responding to all traffic and the only way to recover is a reboot. I do not 
> see
> a stack trace on the console.

Do you mean the machine is still running fine just the network is down?

If so, can you dump your tc config with stats when the problem is happening?
(You can use `tc -s -d qd show ...`.)

>
> This can be reproduced using the packetdrill test below, it should be run a
> few times or in a loop. You should hit this issue within a few tries but
> sometimes might take up to 15-20 tries.
...
> I can reproduce the issue easily on v5.4.68, and after reverting this commit 
> it
> does not happen anymore.

This is odd. The patch in this thread touches netdev reset path, if packetdrill
is the only thing you use to trigger the bug (that is netdev is always active),
I can not connect them.

Thanks.


Re: [tipc-discussion] [net v3 1/1] tipc: fix memory leak caused by tipc_buf_append()

2020-10-28 Thread Cong Wang
On Tue, Oct 27, 2020 at 10:23 PM Tung Quang Nguyen
 wrote:
>
> Hi Cong,
>
> No, I have never ignored any comment from reviewers. I sent v2 on Oct 26 
> after discussing with Xin Long, and v3 on Oct 27 after receiving comment from 
> Jakub.
> I received your 3 emails nearly at the same time on Oct 28. It's weird. Your 
> emails did not appear in this email archive either: 
> https://sourceforge.net/p/tipc/mailman/tipc-discussion/
>
> Anyway, I answer your questions:

Oh, I just realized you meant shinfo->dataref, not skb->users...
Then it makes sense now.

Thanks.


Re: [PATCH v2 net] net/sched: act_mpls: Add softdep on mpls_gso.ko

2020-10-28 Thread Cong Wang
On Tue, Oct 27, 2020 at 2:39 PM Guillaume Nault  wrote:
>
> On Tue, Oct 27, 2020 at 10:28:29AM -0700, Cong Wang wrote:
> > On Mon, Oct 26, 2020 at 4:23 AM Guillaume Nault  wrote:
> > >
> > > TCA_MPLS_ACT_PUSH and TCA_MPLS_ACT_MAC_PUSH might be used on gso
> > > packets. Such packets will thus require mpls_gso.ko for segmentation.
> >
> > Any reason not to call request_module() at run time?
>
> So that mpls_gso would be loaded only when initialising the
> TCA_MPLS_ACT_PUSH or TCA_MPLS_ACT_MAC_PUSH modes?

Yes, exactly.

>
> That could be done, but the dependency on mpls_gso wouldn't be visible
> anymore with modinfo. I don't really mind, I just felt that such
> information could be important for the end user.

I think the dependency is determined at run time based on
TCA_MPLS_ACT_*, so it should be reflected at run time, rather than at
compile time.

If loading mpls_gso even when not needed is not a big deal, I am fine
with your patch too.

Thanks.


Re: [tipc-discussion] [net v3 1/1] tipc: fix memory leak caused by tipc_buf_append()

2020-10-28 Thread Cong Wang
On Tue, Oct 27, 2020 at 1:09 PM Tung Nguyen
 wrote:
>
> Commit ed42989eab57 ("tipc: fix the skb_unshare() in tipc_buf_append()")
> replaced skb_unshare() with skb_copy() to not reduce the data reference
> counter of the original skb intentionally. This is not the correct

More precisely, it is shinfo->dataref.


> way to handle the cloned skb because it causes memory leak in 2
> following cases:
>  1/ Sending multicast messages via broadcast link
>   The original skb list is cloned to the local skb list for local
>   destination. After that, the data reference counter of each skb
>   in the original list has the value of 2. This causes each skb not
>   to be freed after receiving ACK:
>   tipc_link_advance_transmq()
>   {
>...
>/* release skb */
>__skb_unlink(skb, &l->transmq);
>kfree_skb(skb); <-- memory exists after being freed
>   }
>
>  2/ Sending multicast messages via replicast link
>   Similar to the above case, each skb cannot be freed after purging
>   the skb list:
>   tipc_mcast_xmit()
>   {
>...
>__skb_queue_purge(pkts); <-- memory exists after being freed
>   }
>
> This commit fixes this issue by using skb_unshare() instead. Besides,
> to avoid use-after-free error reported by KASAN, the pointer to the
> fragment is set to NULL before calling skb_unshare() to make sure that
> the original skb is not freed after freeing the fragment 2 times in
> case skb_unshare() returns NULL.
>
> Fixes: ed42989eab57 ("tipc: fix the skb_unshare() in tipc_buf_append()")
> Acked-by: Jon Maloy 
> Reported-by: Thang Hoang Ngo 
> Signed-off-by: Tung Nguyen 

Acked-by: Cong Wang 


Re: [PATCH] net: cls_api: remove unneeded local variable in tc_dump_chain()

2020-10-28 Thread Cong Wang
On Wed, Oct 28, 2020 at 6:59 AM Tom Rix  wrote:
>
>
> On 10/28/20 4:35 AM, Lukas Bulwahn wrote:
> > @@ -2971,13 +2963,11 @@ static int tc_dump_chain(struct sk_buff *skb, 
> > struct netlink_callback *cb)
> >   if (!dev)
> >   return skb->len;
> >
> > - parent = tcm->tcm_parent;
> > - if (!parent) {
> > + if (!tcm->tcm_parent)
> >   q = dev->qdisc;
> > - parent = q->handle;
>
> This looks like a an unused error handler.
>
> and the later call to
>
> if (TC_H_MIN(tcm->tcm_parent)
>
> maybe should be
>
> if (TC_H_MIN(parent))

When tcm->tcm_parent is 0, TC_H_MIN(tcm->tcm_parent) is also 0,
so we will not hit that if branch.

So, I think Lukas' patch is correct.

Thanks.


Re: [tipc-discussion] [net v3 1/1] tipc: fix memory leak caused by tipc_buf_append()

2020-10-27 Thread Cong Wang
On Tue, Oct 27, 2020 at 1:09 PM Tung Nguyen
 wrote:
>
> Commit ed42989eab57 ("tipc: fix the skb_unshare() in tipc_buf_append()")
> replaced skb_unshare() with skb_copy() to not reduce the data reference
> counter of the original skb intentionally. This is not the correct
> way to handle the cloned skb because it causes memory leak in 2
> following cases:
>  1/ Sending multicast messages via broadcast link
>   The original skb list is cloned to the local skb list for local
>   destination. After that, the data reference counter of each skb
>   in the original list has the value of 2. This causes each skb not
>   to be freed after receiving ACK:

This does not make sense at all.

skb_unclone() expects refcnt == 1, as stated in the comments
above pskb_expand_head(). skb_unclone() was used prior to
Xin Long's commit.

So either the above is wrong, or something important is still missing
in your changelog. None of them is addressed in your V3.

I also asked you two questions before you sent V3, you seem to
intentionally ignore them. This is not how we collaborate.

Thanks.


Re: [tipc-discussion] [net v2 1/1] tipc: fix memory leak caused by tipc_buf_append()

2020-10-27 Thread Cong Wang
On Tue, Oct 27, 2020 at 11:21 AM Cong Wang  wrote:
>
> On Mon, Oct 26, 2020 at 3:46 AM Tung Nguyen
>  wrote:
> >
> > Commit ed42989eab57 ("fix the skb_unshare() in tipc_buf_append()")
> > replaced skb_unshare() with skb_copy() to not reduce the data reference
> > counter of the original skb intentionally. This is not the correct
> > way to handle the cloned skb because it causes memory leak in 2
> > following cases:
> >  1/ Sending multicast messages via broadcast link
> >   The original skb list is cloned to the local skb list for local
> >   destination. After that, the data reference counter of each skb
> >   in the original list has the value of 2. This causes each skb not
> >   to be freed after receiving ACK:
>
> Interesting, I can not immediately see how tipc_link_advance_transmq()
> clones the skb. You point out how it is freed but not cloned.
>
> It looks really odd to see the skb is held by some caller, then expected
> to be released by the unshare in tipc_buf_append(). IMHO, the refcnt
> should be released where it is held.

More importantly, prior to Xin Long's change of skb_unshare(),
skb_unclone() was used, which does not touch the skb refcnt either.
So, why does it rely on skb_unshare() to release this refcnt now?

Thanks.


Re: [tipc-discussion] [net v2 1/1] tipc: fix memory leak caused by tipc_buf_append()

2020-10-27 Thread Cong Wang
On Mon, Oct 26, 2020 at 3:46 AM Tung Nguyen
 wrote:
>
> Commit ed42989eab57 ("fix the skb_unshare() in tipc_buf_append()")
> replaced skb_unshare() with skb_copy() to not reduce the data reference
> counter of the original skb intentionally. This is not the correct
> way to handle the cloned skb because it causes memory leak in 2
> following cases:
>  1/ Sending multicast messages via broadcast link
>   The original skb list is cloned to the local skb list for local
>   destination. After that, the data reference counter of each skb
>   in the original list has the value of 2. This causes each skb not
>   to be freed after receiving ACK:

Interesting, I can not immediately see how tipc_link_advance_transmq()
clones the skb. You point out how it is freed but not cloned.

It looks really odd to see the skb is held by some caller, then expected
to be released by the unshare in tipc_buf_append(). IMHO, the refcnt
should be released where it is held.

Can you be more specific here?

Thanks.


Re: [PATCH v2 net] net/sched: act_mpls: Add softdep on mpls_gso.ko

2020-10-27 Thread Cong Wang
On Mon, Oct 26, 2020 at 4:23 AM Guillaume Nault  wrote:
>
> TCA_MPLS_ACT_PUSH and TCA_MPLS_ACT_MAC_PUSH might be used on gso
> packets. Such packets will thus require mpls_gso.ko for segmentation.

Any reason not to call request_module() at run time?


Re: [PATCH net] net/sched: act_gate: Unlock ->tcfa_lock in tc_setup_flow_action()

2020-10-20 Thread Cong Wang
On Tue, Oct 20, 2020 at 8:34 AM Guillaume Nault  wrote:
>
> We need to jump to the "err_out_locked" label when
> tcf_gate_get_entries() fails. Otherwise, tc_setup_flow_action() exits
> with ->tcfa_lock still held.
>
> Fixes: d29bdd69ecdd ("net: schedule: add action gate offloading")
> Signed-off-by: Guillaume Nault 

Looks like the err_out label can be just removed after this patch?
If any compiler complains, you have to fix it in v2, otherwise can be in a
separate patch.

Other than this,

Acked-by: Cong Wang 

Thanks!


Re: [PATCH net] net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN tunnels

2020-10-20 Thread Cong Wang
On Tue, Oct 20, 2020 at 3:03 PM Davide Caratti  wrote:
>
> the following command
>
>  # tc action add action tunnel_key \
>  > set src_ip 2001:db8::1 dst_ip 2001:db8::2 id 10 erspan_opts 1:6789:0:0
>
> generates the following splat:
...
> using IPv6 tunnels, act_tunnel_key allocates a fixed amount of memory for
> the tunnel metadata, but then it expects additional bytes to store tunnel
> specific metadata with tunnel_key_copy_opts().
>
> Fix the arguments of __ipv6_tun_set_dst(), so that 'md_size' contains the
> size previously computed by tunnel_key_get_opts_len(), like it's done for
> IPv4 tunnels.

Acked-by: Cong Wang 

Thanks.


Re: [PATCH net] net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info

2020-10-14 Thread Cong Wang
On Wed, Oct 14, 2020 at 1:56 AM Leon Romanovsky  wrote:
>
> From: Leon Romanovsky 
>
> The access of tcf_tunnel_info() produces the following splat, so fix it
> by dereferencing the tcf_tunnel_key_params pointer with marker that
> internal tcfa_liock is held.

Looks reasonable to me,

Acked-by: Cong Wang 

Thanks.


[Patch net v3] ip_gre: set dev->hard_header_len and dev->needed_headroom properly

2020-10-12 Thread Cong Wang
GRE tunnel has its own header_ops, ipgre_header_ops, and sets it
conditionally. When it is set, it assumes the outer IP header is
already created before ipgre_xmit().

This is not true when we send packets through a raw packet socket,
where L2 headers are supposed to be constructed by user. Packet
socket calls dev_validate_header() to validate the header. But
GRE tunnel does not set dev->hard_header_len, so that check can
be simply bypassed, therefore uninit memory could be passed down
to ipgre_xmit(). Similar for dev->needed_headroom.

dev->hard_header_len is supposed to be the length of the header
created by dev->header_ops->create(), so it should be used whenever
header_ops is set, and dev->needed_headroom should be used when it
is not set.

Reported-and-tested-by: syzbot+4a2c52677a8a1aa28...@syzkaller.appspotmail.com
Cc: Xie He 
Cc: William Tu 
Acked-by: Willem de Bruijn 
Signed-off-by: Cong Wang 
---
Note, there are some other suspicious use of dev->hard_header_len in
the same file, but let's leave them to a separate patch if really
needed.

v2: pass 0 to skb_cow_head()
v1: fix dev->needed_headroom and update ipgre_link_update()

 net/ipv4/ip_gre.c | 15 +++
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 4e31f23e4117..e70291748889 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -625,9 +625,7 @@ static netdev_tx_t ipgre_xmit(struct sk_buff *skb,
}
 
if (dev->header_ops) {
-   /* Need space for new headers */
-   if (skb_cow_head(skb, dev->needed_headroom -
- (tunnel->hlen + sizeof(struct iphdr
+   if (skb_cow_head(skb, 0))
goto free_skb;
 
tnl_params = (const struct iphdr *)skb->data;
@@ -748,7 +746,11 @@ static void ipgre_link_update(struct net_device *dev, bool 
set_mtu)
len = tunnel->tun_hlen - len;
tunnel->hlen = tunnel->hlen + len;
 
-   dev->needed_headroom = dev->needed_headroom + len;
+   if (dev->header_ops)
+   dev->hard_header_len += len;
+   else
+   dev->needed_headroom += len;
+
if (set_mtu)
dev->mtu = max_t(int, dev->mtu - len, 68);
 
@@ -944,6 +946,7 @@ static void __gre_tunnel_init(struct net_device *dev)
tunnel->parms.iph.protocol = IPPROTO_GRE;
 
tunnel->hlen = tunnel->tun_hlen + tunnel->encap_hlen;
+   dev->needed_headroom = tunnel->hlen + sizeof(tunnel->parms.iph);
 
dev->features   |= GRE_FEATURES;
dev->hw_features|= GRE_FEATURES;
@@ -987,10 +990,14 @@ static int ipgre_tunnel_init(struct net_device *dev)
return -EINVAL;
dev->flags = IFF_BROADCAST;
dev->header_ops = &ipgre_header_ops;
+   dev->hard_header_len = tunnel->hlen + sizeof(*iph);
+   dev->needed_headroom = 0;
}
 #endif
} else if (!tunnel->collect_md) {
dev->header_ops = &ipgre_header_ops;
+   dev->hard_header_len = tunnel->hlen + sizeof(*iph);
+   dev->needed_headroom = 0;
}
 
return ip_tunnel_init(dev);
-- 
2.28.0



Re: [Patch net v2] ip_gre: set dev->hard_header_len and dev->needed_headroom properly

2020-10-12 Thread Cong Wang
On Sun, Oct 11, 2020 at 3:46 PM Xie He  wrote:
>
> On Sun, Oct 11, 2020 at 12:11 PM Cong Wang  wrote:
> >
> > @@ -626,8 +626,7 @@ static netdev_tx_t ipgre_xmit(struct sk_buff *skb,
> >
> > if (dev->header_ops) {
> > /* Need space for new headers */
> > -   if (skb_cow_head(skb, dev->needed_headroom -
> > - (tunnel->hlen + sizeof(struct 
> > iphdr
> > +   if (skb_cow_head(skb, dev->hard_header_len))
> > goto free_skb;
> >
> > tnl_params = (const struct iphdr *)skb->data;
>
> As I understand, the skb_cow functions are for ensuring enough header
> space before skb->data. (Right?) However, at this stage our skb->data
> is already at the outer IP header, I think we don't need to request
> additional header space before the outer IP header.

Good point, I thought skb_headroom() == dev->hard_header_len,
but skb->data already points to the tunnel header like you said, so
we should pass 0 to skb_cow_head() here.

Thanks.


Re: [External] Re: [PATCH] mm: proc: add Sock to /proc/meminfo

2020-10-12 Thread Cong Wang
On Mon, Oct 12, 2020 at 2:53 AM Muchun Song  wrote:
> We are not complaining about TCP using too much memory, but how do
> we know that TCP uses a lot of memory. When I firstly face this problem,
> I do not know who uses the 25GB memory and it is not shown in the 
> /proc/meminfo.
> If we can know the amount memory of the socket buffer via /proc/meminfo, we
> may not need to spend a lot of time troubleshooting this problem. Not everyone
> knows that a lot of memory may be used here. But I believe many people
> should know /proc/meminfo to confirm memory users.

Well, I'd bet networking people know `ss -m` better than /proc/meminfo,
generally speaking.

The practice here is that if you want some networking-specific counters,
add it to where networking people know better, that is, `ss -m` or /proc/net/...

Or maybe the problem you described is not specific to networking at all,
there must be some other places where pages are allocated but not charged.
If so, adding a general mm counter in /proc/meminfo makes sense, but
it won't be specific to networking.

Thanks.


Re: [External] Re: [PATCH] mm: proc: add Sock to /proc/meminfo

2020-10-12 Thread Cong Wang
On Sun, Oct 11, 2020 at 9:22 PM Muchun Song  wrote:
>
> On Mon, Oct 12, 2020 at 2:39 AM Cong Wang  wrote:
> >
> > On Sat, Oct 10, 2020 at 3:39 AM Muchun Song  
> > wrote:
> > >
> > > The amount of memory allocated to sockets buffer can become significant.
> > > However, we do not display the amount of memory consumed by sockets
> > > buffer. In this case, knowing where the memory is consumed by the kernel
> >
> > We do it via `ss -m`. Is it not sufficient? And if not, why not adding it 
> > there
> > rather than /proc/meminfo?
>
> If the system has little free memory, we can know where the memory is via
> /proc/meminfo. If a lot of memory is consumed by socket buffer, we cannot
> know it when the Sock is not shown in the /proc/meminfo. If the unaware user
> can't think of the socket buffer, naturally they will not `ss -m`. The
> end result

Interesting, we already have a few counters related to socket buffers,
are you saying these are not accounted in /proc/meminfo either?
If yes, why are page frags so special here? If not, they are more
important than page frags, so you probably want to deal with them
first.


> is that we still don’t know where the memory is consumed. And we add the
> Sock to the /proc/meminfo just like the memcg does('sock' item in the cgroup
> v2 memory.stat). So I think that adding to /proc/meminfo is sufficient.

It looks like actually the socket page frag is already accounted,
for example, the tcp_sendmsg_locked():

copy = min_t(int, copy, pfrag->size - pfrag->offset);

if (!sk_wmem_schedule(sk, copy))
goto wait_for_memory;


>
> >
> > >  static inline void __skb_frag_unref(skb_frag_t *frag)
> > >  {
> > > -   put_page(skb_frag_page(frag));
> > > +   struct page *page = skb_frag_page(frag);
> > > +
> > > +   if (put_page_testzero(page)) {
> > > +   dec_sock_node_page_state(page);
> > > +   __put_page(page);
> > > +   }
> > >  }
> >
> > You mix socket page frag with skb frag at least, not sure this is exactly
> > what you want, because clearly skb page frags are frequently used
> > by network drivers rather than sockets.
> >
> > Also, which one matches this dec_sock_node_page_state()? Clearly
> > not skb_fill_page_desc() or __skb_frag_ref().
>
> Yeah, we call inc_sock_node_page_state() in the skb_page_frag_refill().

How is skb_page_frag_refill() possibly paired with __skb_frag_unref()?

> So if someone gets the page returned by skb_page_frag_refill(), it must
> put the page via __skb_frag_unref()/skb_frag_unref(). We use PG_private
> to indicate that we need to dec the node page state when the refcount of
> page reaches zero.

skb_page_frag_refill() is called on frags not within an skb, for instance,
sk_page_frag_refill() uses it for a per-socket or per-process page frag.
But, __skb_frag_unref() is specifically used for skb frags, which are
supposed to be filled by skb_fill_page_desc() (page is allocated by driver).

They are different things you are mixing them up, which looks clearly
wrong or at least misleading.

Thanks.


Re: BUG: unable to handle kernel paging request in tcf_action_dump_terse

2020-10-11 Thread Cong Wang
#syz fix: net_sched: check error pointer in tcf_dump_walker()


[Patch net v2] ip_gre: set dev->hard_header_len and dev->needed_headroom properly

2020-10-11 Thread Cong Wang
GRE tunnel has its own header_ops, ipgre_header_ops, and sets it
conditionally. When it is set, it assumes the outer IP header is
already created before ipgre_xmit().

This is not true when we send packets through a raw packet socket,
where L2 headers are supposed to be constructed by user. Packet
socket calls dev_validate_header() to validate the header. But
GRE tunnel does not set dev->hard_header_len, so that check can
be simply bypassed, therefore uninit memory could be passed down
to ipgre_xmit(). Similar for dev->needed_headroom.

dev->hard_header_len is supposed to be the length of the header
created by dev->header_ops->create(), so it should be used whenever
header_ops is set, and dev->needed_headroom should be used when it
is not set.

Reported-and-tested-by: syzbot+4a2c52677a8a1aa28...@syzkaller.appspotmail.com
Cc: Xie He 
Cc: William Tu 
Cc: Willem de Bruijn 
Signed-off-by: Cong Wang 
---
Note, there are some other suspicious use of dev->hard_header_len in
the same file, but let's leave them to a separate patch if really
needed.

 net/ipv4/ip_gre.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 4e31f23e4117..82fee0010353 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -626,8 +626,7 @@ static netdev_tx_t ipgre_xmit(struct sk_buff *skb,
 
if (dev->header_ops) {
/* Need space for new headers */
-   if (skb_cow_head(skb, dev->needed_headroom -
- (tunnel->hlen + sizeof(struct iphdr
+   if (skb_cow_head(skb, dev->hard_header_len))
goto free_skb;
 
tnl_params = (const struct iphdr *)skb->data;
@@ -748,7 +747,11 @@ static void ipgre_link_update(struct net_device *dev, bool 
set_mtu)
len = tunnel->tun_hlen - len;
tunnel->hlen = tunnel->hlen + len;
 
-   dev->needed_headroom = dev->needed_headroom + len;
+   if (dev->header_ops)
+   dev->hard_header_len += len;
+   else
+   dev->needed_headroom += len;
+
if (set_mtu)
dev->mtu = max_t(int, dev->mtu - len, 68);
 
@@ -944,6 +947,7 @@ static void __gre_tunnel_init(struct net_device *dev)
tunnel->parms.iph.protocol = IPPROTO_GRE;
 
tunnel->hlen = tunnel->tun_hlen + tunnel->encap_hlen;
+   dev->needed_headroom = tunnel->hlen + sizeof(tunnel->parms.iph);
 
dev->features   |= GRE_FEATURES;
dev->hw_features|= GRE_FEATURES;
@@ -987,10 +991,14 @@ static int ipgre_tunnel_init(struct net_device *dev)
return -EINVAL;
dev->flags = IFF_BROADCAST;
dev->header_ops = &ipgre_header_ops;
+   dev->hard_header_len = tunnel->hlen + sizeof(*iph);
+   dev->needed_headroom = 0;
}
 #endif
} else if (!tunnel->collect_md) {
dev->header_ops = &ipgre_header_ops;
+   dev->hard_header_len = tunnel->hlen + sizeof(*iph);
+   dev->needed_headroom = 0;
}
 
return ip_tunnel_init(dev);
-- 
2.28.0



Re: [PATCH] mm: proc: add Sock to /proc/meminfo

2020-10-11 Thread Cong Wang
On Sat, Oct 10, 2020 at 3:39 AM Muchun Song  wrote:
>
> The amount of memory allocated to sockets buffer can become significant.
> However, we do not display the amount of memory consumed by sockets
> buffer. In this case, knowing where the memory is consumed by the kernel

We do it via `ss -m`. Is it not sufficient? And if not, why not adding it there
rather than /proc/meminfo?

>  static inline void __skb_frag_unref(skb_frag_t *frag)
>  {
> -   put_page(skb_frag_page(frag));
> +   struct page *page = skb_frag_page(frag);
> +
> +   if (put_page_testzero(page)) {
> +   dec_sock_node_page_state(page);
> +   __put_page(page);
> +   }
>  }

You mix socket page frag with skb frag at least, not sure this is exactly
what you want, because clearly skb page frags are frequently used
by network drivers rather than sockets.

Also, which one matches this dec_sock_node_page_state()? Clearly
not skb_fill_page_desc() or __skb_frag_ref().

Thanks.


Re: [Patch net] ip_gre: set dev->hard_header_len properly

2020-10-10 Thread Cong Wang
On Fri, Oct 9, 2020 at 8:10 PM Xie He  wrote:
>
> On Fri, Oct 9, 2020 at 6:07 PM Cong Wang  wrote:
> >
> > Looking a bit deeper, I doubt the ipgre_header_ops->create is necessary,
> > because 1) all other tunnels devices do not have it (ip_tunnel_header_ops
> > only contains ->parse_protocol); 2) GRE headers are pushed in xmit
> > anyway, so at least SOCK_DGRAM does not need it; 3) ipgre_header()
> > creates the GRE header, later ipgre_xmit() pulls it back, then __gre_xmit()
> > builds GRE header again...
>
> Haha, I also don't like ipgre_header_ops->create and want to get rid
> of it. We are thinking the same :)

Great!

>
> From what I see, the flow when sending skbs (when header_ops is used)
> is like this:
>
> 1) First ipgre_header creates the IP header and the GRE base header,
> leaving space for the GRE optional fields and the "encap" header (the
> space for the "encap" header should be left before the GRE header, but
> it is incorrectly left after the GRE header).
>
> 2) Then ipgre_xmit pulls the header created by ipgre_header (but
> leaves the data there). Then it calls __gre_xmit.
>
> 3) __gre_xmit calls gre_build_header. gre_build_header will push back
> the GRE header, read the GRE base header and build the GRE optional
> fields.
>
> 4) Then __gre_xmit calls ip_tunnel_xmit. ip_tunnel_xmit will build the
> "encap" header by calling ip_tunnel_encap.
>
> So what ipgre_header does is actually creating the IP header and the
> GRE header, and leaving some holes for the GRE optional fields and the
> "encap" header to be built later.

Right. For the encap header, I'd guess it is because this is relatively new.

>
> This seems so weird to me. If a user is using an AF_PACKET/RAW socket,
> the user is supposed to do what the header_ops->create function does
> (that is, creating two headers and leaving two holes to be filled in
> later). I think no user would actually do that. That is so weird.

Well, AF_PACKET RAW socket is supposed to construct L2 headers
from the user buffer, and for a tunnel device these headers are indeed its
L2. If users do not want to do this, they can switch to DGRAM anyway.

I know how inconvenient it is to construct a GRE tunnel header, I guess
this is why all other tunnel devices do not provide a header_ops::create.
GRE tunnel is not a special case, this is why I agree on removing its
->create() although it does look like all tunnel devices should let users
construct headers by definition of RAW.

Of course, it may be too late to change this behavior even if we really
want, users may already assume there is always no tunnel header needed
to construct.

Thanks.


Re: [Patch net] ip_gre: set dev->hard_header_len properly

2020-10-09 Thread Cong Wang
On Fri, Oct 9, 2020 at 1:38 PM Cong Wang  wrote:
>
> Interesting point. I think needed_headroom is 0 until we call
> ipgre_changelink(), but needed_headroom is already being used
> in multiple places for skb_cow_head() in the same file, I guess
> they should be replaced with hard_head_len because for GRE tunnel
> those are its link-layer header. What makes it more complicated
> is that header_ops is actually set conditionally, so should be
> hard_header_len/needed_head_room accordingly.

Looking a bit deeper, I doubt the ipgre_header_ops->create is necessary,
because 1) all other tunnels devices do not have it (ip_tunnel_header_ops
only contains ->parse_protocol); 2) GRE headers are pushed in xmit
anyway, so at least SOCK_DGRAM does not need it; 3) ipgre_header()
creates the GRE header, later ipgre_xmit() pulls it back, then __gre_xmit()
builds GRE header again...

Please correct me if I am wrong.

Thanks.


Re: [Patch net] ip_gre: set dev->hard_header_len properly

2020-10-09 Thread Cong Wang
On Fri, Oct 9, 2020 at 12:51 PM Xie He  wrote:
>
> On Fri, Oct 9, 2020 at 12:41 PM Xie He  wrote:
> >
> > Thanks. But there is still something that I don't understand. What is
> > needed_headroom used for? If we are requesting space for t->encap_hlen
> > and t->tun_hlen in hard_header_len. Do we still need to use
> > needed_headroom?
>
> It seems to me that the original code includes t->encap_hlen,
> t->tun_hlen, and the IP header length in needed_headroom. (Right?) If
> we are including these in hard_header_len, we need to move them out of
> needed_headroom.

Interesting point. I think needed_headroom is 0 until we call
ipgre_changelink(), but needed_headroom is already being used
in multiple places for skb_cow_head() in the same file, I guess
they should be replaced with hard_head_len because for GRE tunnel
those are its link-layer header. What makes it more complicated
is that header_ops is actually set conditionally, so should be
hard_header_len/needed_head_room accordingly.

Thanks.


Re: [Patch net] ip_gre: set dev->hard_header_len properly

2020-10-09 Thread Cong Wang
On Thu, Oct 8, 2020 at 4:40 PM Xie He  wrote:
>
> I found another possible issue. Shouldn't we update hard_header_len
> every time t->tun_hlen and t->hlen are updated in ipgre_link_update?

Good catch. It should be updated there like ->needed_headroom.
I will update my patch.

Thanks.


Re: [Patch net] can: initialize skbcnt in j1939_tp_tx_dat_new()

2020-10-08 Thread Cong Wang
On Thu, Oct 8, 2020 at 10:34 AM Jakub Kicinski  wrote:
>
> On Wed,  7 Oct 2020 23:18:21 -0700 Cong Wang wrote:
> > This fixes an uninit-value warning:
> > BUG: KMSAN: uninit-value in can_receive+0x26b/0x630 net/can/af_can.c:650
> >
> > Reported-and-tested-by: 
> > syzbot+3f3837e61a48d32b4...@syzkaller.appspotmail.com
> > Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
> > Cc: Robin van der Gracht 
> > Cc: Oleksij Rempel 
> > Cc: Pengutronix Kernel Team 
> > Cc: Oliver Hartkopp 
> > Cc: Marc Kleine-Budde 
> > Signed-off-by: Cong Wang 
> > ---
> >  net/can/j1939/transport.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/net/can/j1939/transport.c b/net/can/j1939/transport.c
> > index 0cec4152f979..88cf1062e1e9 100644
> > --- a/net/can/j1939/transport.c
> > +++ b/net/can/j1939/transport.c
> > @@ -580,6 +580,7 @@ sk_buff *j1939_tp_tx_dat_new(struct j1939_priv *priv,
> >   skb->dev = priv->ndev;
> >   can_skb_reserve(skb);
> >   can_skb_prv(skb)->ifindex = priv->ndev->ifindex;
> > + can_skb_prv(skb)->skbcnt = 0;
> >   /* reserve CAN header */
> >   skb_reserve(skb, offsetof(struct can_frame, data));
>
> Thanks! Looks like there is another can_skb_reserve(skb) on line 1489,
> is that one fine?

I don't know, I only attempt to address the syzbot report. To me,
it at least does not harm to fix that one too. I am fine either way.

Thanks.


Re: [Patch net] ip_gre: set dev->hard_header_len properly

2020-10-08 Thread Cong Wang
On Thu, Oct 8, 2020 at 12:18 PM Willem de Bruijn
 wrote:
>
> On Thu, Oct 8, 2020 at 3:04 PM Willem de Bruijn
>  wrote:
> >
> > On Thu, Oct 8, 2020 at 1:34 PM Cong Wang  wrote:
> > >
> > > On Thu, Oct 8, 2020 at 4:49 AM Willem de Bruijn
> > >  wrote:
> > > >
> > > > On Wed, Oct 7, 2020 at 9:22 PM Cong Wang  
> > > > wrote:
> > > > >
> > > > > GRE tunnel has its own header_ops, ipgre_header_ops, and sets it
> > > > > conditionally. When it is set, it assumes the outer IP header is
> > > > > already created before ipgre_xmit().
> > > > >
> > > > > This is not true when we send packets through a raw packet socket,
> > > > > where L2 headers are supposed to be constructed by user. Packet
> > > > > socket calls dev_validate_header() to validate the header. But
> > > > > GRE tunnel does not set dev->hard_header_len, so that check can
> > > > > be simply bypassed, therefore uninit memory could be passed down
> > > > > to ipgre_xmit().
> > > >
> > > > If dev->hard_header_len is zero, the packet socket will not reserve
> > > > room for the link layer header, so skb->data points to network_header.
> > > > But I don't see any uninitialized packet data?
> > >
> > > The uninit data is allocated by packet_alloc_skb(), if 
> > > dev->hard_header_len
> > > is 0 and 'len' is anything between [0, tunnel->hlen + sizeof(struct 
> > > iphdr)),
> > > dev_validate_header() still returns true obviously but only 'len'
> > > bytes are copied
> > > from user-space by skb_copy_datagram_from_iter(). Therefore, those bytes
> > > within range (len, tunnel->hlen + sizeof(struct iphdr)] are uninitialized.
>
> Oh, do you mean that ipgre_xmit will process undefined data because it
> calls skb_pull(skb, tunnel->hlen + sizeof(struct iphdr)) ?

The syzbot report has the information for both of your questions:
https://syzkaller.appspot.com/text?tag=CrashReport&x=1184556850

It clearly shows packet_snd() and ipgre_xmit().

Thanks.


Re: [Patch net] tipc: fix the skb_unshare() in tipc_buf_append()

2020-10-08 Thread Cong Wang
On Thu, Oct 8, 2020 at 1:45 AM Xin Long  wrote:
>
> On Thu, Oct 8, 2020 at 12:12 PM Cong Wang  wrote:
> >
> > skb_unshare() drops a reference count on the old skb unconditionally,
> > so in the failure case, we end up freeing the skb twice here.
> > And because the skb is allocated in fclone and cloned by caller
> > tipc_msg_reassemble(), the consequence is actually freeing the
> > original skb too, thus triggered the UAF by syzbot.
> Do you mean:
> frag = skb_clone(skb, GFP_ATOMIC);
> frag = skb_unshare(frag) will free the 'skb' too?

Yes, more precisely, I mean:

new = skb_clone(old)
kfree_skb(new)
kfree_skb(new)

would free 'old' eventually when 'old' is a fast clone. The skb_clone()
sets ->fclone_ref to 2 and returns the clone, whose skb->fclone is
SKB_FCLONE_CLONE. So, the first call of kfree_skbmem() will
just decrease ->fclone_ref by 1, but the second call will trigger
kmem_cache_free() which frees _both_  skb's.

Thanks.


Re: [Patch net] ip_gre: set dev->hard_header_len properly

2020-10-08 Thread Cong Wang
On Thu, Oct 8, 2020 at 4:49 AM Willem de Bruijn
 wrote:
>
> On Wed, Oct 7, 2020 at 9:22 PM Cong Wang  wrote:
> >
> > GRE tunnel has its own header_ops, ipgre_header_ops, and sets it
> > conditionally. When it is set, it assumes the outer IP header is
> > already created before ipgre_xmit().
> >
> > This is not true when we send packets through a raw packet socket,
> > where L2 headers are supposed to be constructed by user. Packet
> > socket calls dev_validate_header() to validate the header. But
> > GRE tunnel does not set dev->hard_header_len, so that check can
> > be simply bypassed, therefore uninit memory could be passed down
> > to ipgre_xmit().
>
> If dev->hard_header_len is zero, the packet socket will not reserve
> room for the link layer header, so skb->data points to network_header.
> But I don't see any uninitialized packet data?

The uninit data is allocated by packet_alloc_skb(), if dev->hard_header_len
is 0 and 'len' is anything between [0, tunnel->hlen + sizeof(struct iphdr)),
dev_validate_header() still returns true obviously but only 'len'
bytes are copied
from user-space by skb_copy_datagram_from_iter(). Therefore, those bytes
within range (len, tunnel->hlen + sizeof(struct iphdr)] are uninitialized.

Thanks.


[Patch net] can: initialize skbcnt in j1939_tp_tx_dat_new()

2020-10-07 Thread Cong Wang
This fixes an uninit-value warning:
BUG: KMSAN: uninit-value in can_receive+0x26b/0x630 net/can/af_can.c:650

Reported-and-tested-by: syzbot+3f3837e61a48d32b4...@syzkaller.appspotmail.com
Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol")
Cc: Robin van der Gracht 
Cc: Oleksij Rempel 
Cc: Pengutronix Kernel Team 
Cc: Oliver Hartkopp 
Cc: Marc Kleine-Budde 
Signed-off-by: Cong Wang 
---
 net/can/j1939/transport.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/can/j1939/transport.c b/net/can/j1939/transport.c
index 0cec4152f979..88cf1062e1e9 100644
--- a/net/can/j1939/transport.c
+++ b/net/can/j1939/transport.c
@@ -580,6 +580,7 @@ sk_buff *j1939_tp_tx_dat_new(struct j1939_priv *priv,
skb->dev = priv->ndev;
can_skb_reserve(skb);
can_skb_prv(skb)->ifindex = priv->ndev->ifindex;
+   can_skb_prv(skb)->skbcnt = 0;
/* reserve CAN header */
skb_reserve(skb, offsetof(struct can_frame, data));
 
-- 
2.28.0



[Patch net] tipc: fix the skb_unshare() in tipc_buf_append()

2020-10-07 Thread Cong Wang
skb_unshare() drops a reference count on the old skb unconditionally,
so in the failure case, we end up freeing the skb twice here.
And because the skb is allocated in fclone and cloned by caller
tipc_msg_reassemble(), the consequence is actually freeing the
original skb too, thus triggered the UAF by syzbot.

Fix this by replacing this skb_unshare() with skb_cloned()+skb_copy().

Fixes: ff48b6222e65 ("tipc: use skb_unshare() instead in tipc_buf_append()")
Reported-and-tested-by: syzbot+e96a7ba46281824cc...@syzkaller.appspotmail.com
Cc: Xin Long 
Cc: Jon Maloy 
Cc: Ying Xue 
Signed-off-by: Cong Wang 
---
 net/tipc/msg.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/tipc/msg.c b/net/tipc/msg.c
index 52e93ba4d8e2..681224401871 100644
--- a/net/tipc/msg.c
+++ b/net/tipc/msg.c
@@ -150,7 +150,8 @@ int tipc_buf_append(struct sk_buff **headbuf, struct 
sk_buff **buf)
if (fragid == FIRST_FRAGMENT) {
if (unlikely(head))
goto err;
-   frag = skb_unshare(frag, GFP_ATOMIC);
+   if (skb_cloned(frag))
+   frag = skb_copy(frag, GFP_ATOMIC);
if (unlikely(!frag))
goto err;
head = *headbuf = frag;
-- 
2.28.0



Re: [PATCH net-next] net/sched: get rid of qdisc->padded

2020-10-07 Thread Cong Wang
On Wed, Oct 7, 2020 at 9:51 AM Eric Dumazet  wrote:
>
> From: Eric Dumazet 
>
> kmalloc() of sufficiently big portion of memory is cache-aligned
> in regular conditions. If some debugging options are used,
> there is no reason qdisc structures would need 64-byte alignment
> if most other kernel structures are not aligned.
>
> This get rid of QDISC_ALIGN and QDISC_ALIGNTO.
>
> Addition of privdata field will help implementing
> the reverse of qdisc_priv() and documents where
> the private data is.
>
> Signed-off-by: Eric Dumazet 
> Cc: Cong Wang 
> Cc: Allen Pais 

Acked-by: Cong Wang 

Thanks.


[Patch net] ip_gre: set dev->hard_header_len properly

2020-10-07 Thread Cong Wang
GRE tunnel has its own header_ops, ipgre_header_ops, and sets it
conditionally. When it is set, it assumes the outer IP header is
already created before ipgre_xmit().

This is not true when we send packets through a raw packet socket,
where L2 headers are supposed to be constructed by user. Packet
socket calls dev_validate_header() to validate the header. But
GRE tunnel does not set dev->hard_header_len, so that check can
be simply bypassed, therefore uninit memory could be passed down
to ipgre_xmit().

Fix this by setting dev->hard_header_len whenever sets header_ops,
as dev->hard_header_len is supposed to be the length of the header
created by dev->header_ops->create() anyway.

Reported-and-tested-by: syzbot+4a2c52677a8a1aa28...@syzkaller.appspotmail.com
Cc: William Tu 
Cc: Willem de Bruijn 
Signed-off-by: Cong Wang 
---
 net/ipv4/ip_gre.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 4e31f23e4117..43b62095559e 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -987,10 +987,12 @@ static int ipgre_tunnel_init(struct net_device *dev)
return -EINVAL;
dev->flags = IFF_BROADCAST;
dev->header_ops = &ipgre_header_ops;
+   dev->hard_header_len = tunnel->hlen + sizeof(*iph);
}
 #endif
} else if (!tunnel->collect_md) {
dev->header_ops = &ipgre_header_ops;
+   dev->hard_header_len = tunnel->hlen + sizeof(*iph);
}
 
return ip_tunnel_init(dev);
-- 
2.28.0



[Patch net] net_sched: check error pointer in tcf_dump_walker()

2020-10-02 Thread Cong Wang
Although we take RTNL on dump path, it is possible to
skip RTNL on insertion path. So the following race condition
is possible:

rtnl_lock() // no rtnl lock
mutex_lock(&idrinfo->lock);
// insert ERR_PTR(-EBUSY)
mutex_unlock(&idrinfo->lock);
tc_dump_action()
rtnl_unlock()

So we have to skip those temporary -EBUSY entries on dump path
too.

Reported-and-tested-by: syzbot+b47bc4f247856fb4d...@syzkaller.appspotmail.com
Fixes: 0fedc63fadf0 ("net_sched: commit action insertions together")
Cc: Vlad Buslov 
Cc: Jamal Hadi Salim 
Cc: Jiri Pirko 
Signed-off-by: Cong Wang 
---
 net/sched/act_api.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 5612b336e18e..798430e1a79f 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -235,6 +235,8 @@ static int tcf_dump_walker(struct tcf_idrinfo *idrinfo, 
struct sk_buff *skb,
index++;
if (index < s_i)
continue;
+   if (IS_ERR(p))
+   continue;
 
if (jiffy_since &&
time_after(jiffy_since,
-- 
2.28.0



Re: INFO: task hung in tcf_action_init_1

2020-09-30 Thread Cong Wang
#syz test: https://github.com/congwang/linux.git net


Re: general protection fault in tcf_generic_walker

2020-09-30 Thread Cong Wang
On Wed, Sep 30, 2020 at 10:42 AM syzbot
 wrote:
>
> syzbot has found a reproducer for the following issue on:
>
> HEAD commit:2b3e981a Merge branch 'mptcp-Fix-for-32-bit-DATA_FIN'
> git tree:   net
> console output: https://syzkaller.appspot.com/x/log.txt?x=1653724790
> kernel config:  https://syzkaller.appspot.com/x/.config?x=99a7c78965c75e07
> dashboard link: https://syzkaller.appspot.com/bug?extid=b47bc4f247856fb4d9e1
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1412a5a790
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+b47bc4f247856fb4d...@syzkaller.appspotmail.com
>
> general protection fault, probably for non-canonical address 
> 0xdc04:  [#1] PREEMPT SMP KASAN
> KASAN: null-ptr-deref in range [0x0020-0x0027]
> CPU: 0 PID: 8855 Comm: syz-executor.1 Not tainted 5.9.0-rc6-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> RIP: 0010:tcf_dump_walker net/sched/act_api.c:240 [inline]
> RIP: 0010:tcf_generic_walker+0x367/0xba0 net/sched/act_api.c:343
> Code: 24 31 ff 48 89 de e8 c8 55 eb fa 48 85 db 74 3f e8 3e 59 eb fa 48 8d 7d 
> 30 48 b9 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 <80> 3c 08 00 0f 85 26 
> 07 00 00 48 8b 5d 30 31 ff 48 2b 1c 24 48 89
> RSP: 0018:c9000b6ff3a8 EFLAGS: 00010202
> RAX: 0004 RBX: c000aae4 RCX: dc00
> RDX: 8880a82aa140 RSI: 868ae502 RDI: 0020
> RBP: fff0 R08:  R09: 8880a8c41e07
> R10:  R11:  R12: 88809f226340
> R13:  R14:  R15: 
> FS:  7f156f7fa700() GS:8880ae40() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 55d25128b348 CR3: a7d3d000 CR4: 001506f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  tc_dump_action+0x6d5/0xe60 net/sched/act_api.c:1609


Probably just need another IS_ERR() check on the dump path.
I will take a second look.

Thanks.


Re: KASAN: use-after-free Read in tcf_action_init

2020-09-28 Thread Cong Wang
#syz fix: net_sched: commit action insertions together


[Patch net] net_sched: remove a redundant goto chain check

2020-09-28 Thread Cong Wang
All TC actions call tcf_action_check_ctrlact() to validate
goto chain, so this check in tcf_action_init_1() is actually
redundant. Remove it to save troubles of leaking memory.

Fixes: e49d8c22f126 ("net_sched: defer tcf_idr_insert() in tcf_action_init_1()")
Reported-by: Vlad Buslov 
Suggested-by: Davide Caratti 
Cc: Jamal Hadi Salim 
Cc: Jiri Pirko 
Signed-off-by: Cong Wang 
---
 net/sched/act_api.c | 14 --
 1 file changed, 14 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 104b47f5184f..5612b336e18e 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -722,13 +722,6 @@ int tcf_action_destroy(struct tc_action *actions[], int 
bind)
return ret;
 }
 
-static int tcf_action_destroy_1(struct tc_action *a, int bind)
-{
-   struct tc_action *actions[] = { a, NULL };
-
-   return tcf_action_destroy(actions, bind);
-}
-
 static int tcf_action_put(struct tc_action *p)
 {
return __tcf_action_put(p, false);
@@ -1000,13 +993,6 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct tcf_proto *tp,
if (err < 0)
goto err_mod;
 
-   if (TC_ACT_EXT_CMP(a->tcfa_action, TC_ACT_GOTO_CHAIN) &&
-   !rcu_access_pointer(a->goto_chain)) {
-   tcf_action_destroy_1(a, bind);
-   NL_SET_ERR_MSG(extack, "can't use goto chain with NULL chain");
-   return ERR_PTR(-EINVAL);
-   }
-
if (!name && tb[TCA_ACT_COOKIE])
tcf_set_action_cookie(&a->act_cookie, cookie);
 
-- 
2.28.0



Re: [Patch net 1/2] net_sched: defer tcf_idr_insert() in tcf_action_init_1()

2020-09-28 Thread Cong Wang
On Mon, Sep 28, 2020 at 3:14 AM Davide Caratti  wrote:
>
> hello,
>
> On Fri, 2020-09-25 at 22:45 +0300, Vlad Buslov wrote:
> > On Fri 25 Sep 2020 at 22:22, Cong Wang  wrote:
> > > On Fri, Sep 25, 2020 at 8:24 AM Vlad Buslov  wrote:
> > > > > + if (TC_ACT_EXT_CMP(a->tcfa_action, TC_ACT_GOTO_CHAIN) &&
> > > > > + !rcu_access_pointer(a->goto_chain)) {
> > > > > + tcf_action_destroy_1(a, bind);
> > > > > + NL_SET_ERR_MSG(extack, "can't use goto chain with NULL 
> > > > > chain");
> > > > > + return ERR_PTR(-EINVAL);
> > > > > + }
> > > >
> > > > I don't think calling tcf_action_destoy_1() is enough here. Since you
> > > > moved this block before assigning cookie and releasing the module, you
> > > > also need to release them manually in addition to destroying the action
> > > > instance.
> > > >
> > >
> > > tcf_action_destoy_1() eventually calls free_tcf() which frees cookie and
> > > tcf_action_destroy() which releases module refcnt.
> > >
> > > What am I missing here?
> > >
> > > Thanks.
> >
> > The memory referenced by the function local pointer "cookie" hasn't been
> > assigned yet to the a->act_cookie because in your patch you moved
> > goto_chain validation code before the cookie change. That means that if
> > user overwrites existing action, then action old a->act_cookie will be
> > freed by tcf_action_destroy_1() but new cookie that was allocated by
> > nla_memdup_cookie() will leak.

Yes, good catch!


>
> maybe we can just delete this if (TC_ACT_EXT_CMP(...)) { ... }
> statement, instead of moving it? Each TC action already does the check
> for NULL "goto chains" with a_o->init() -> tcf_action_check_ctrlact(),
> so this if () statement looks dead code to me _ I probably forgot to
> remove it after all actions were converted to validate the control
> action inside their .init() function.

Good point, I think you are right, I will send a patch to remove it.

Thanks!


Re: [Patch net 1/2] net_sched: defer tcf_idr_insert() in tcf_action_init_1()

2020-09-25 Thread Cong Wang
On Fri, Sep 25, 2020 at 8:24 AM Vlad Buslov  wrote:
> > + if (TC_ACT_EXT_CMP(a->tcfa_action, TC_ACT_GOTO_CHAIN) &&
> > + !rcu_access_pointer(a->goto_chain)) {
> > + tcf_action_destroy_1(a, bind);
> > + NL_SET_ERR_MSG(extack, "can't use goto chain with NULL 
> > chain");
> > + return ERR_PTR(-EINVAL);
> > + }
>
> I don't think calling tcf_action_destoy_1() is enough here. Since you
> moved this block before assigning cookie and releasing the module, you
> also need to release them manually in addition to destroying the action
> instance.
>

tcf_action_destoy_1() eventually calls free_tcf() which frees cookie and
tcf_action_destroy() which releases module refcnt.

What am I missing here?

Thanks.


[Patch net 1/2] net_sched: defer tcf_idr_insert() in tcf_action_init_1()

2020-09-22 Thread Cong Wang
All TC actions call tcf_idr_insert() for new action at the end
of their ->init(), so we can actually move it to a central place
in tcf_action_init_1().

And once the action is inserted into the global IDR, other parallel
process could free it immediately as its refcnt is still 1, so we can
not fail after this, we need to move it after the goto action
validation to avoid handling the failure case after insertion.

This is found during code review, is not directly triggered by syzbot.
And this prepares for the next patch.

Cc: Vlad Buslov 
Cc: Jamal Hadi Salim 
Cc: Jiri Pirko 
Signed-off-by: Cong Wang 
---
 include/net/act_api.h  |  2 --
 net/sched/act_api.c| 38 --
 net/sched/act_bpf.c|  4 +---
 net/sched/act_connmark.c   |  1 -
 net/sched/act_csum.c   |  3 ---
 net/sched/act_ct.c |  2 --
 net/sched/act_ctinfo.c |  3 ---
 net/sched/act_gact.c   |  2 --
 net/sched/act_gate.c   |  3 ---
 net/sched/act_ife.c|  3 ---
 net/sched/act_ipt.c|  2 --
 net/sched/act_mirred.c |  2 --
 net/sched/act_mpls.c   |  2 --
 net/sched/act_nat.c|  3 ---
 net/sched/act_pedit.c  |  2 --
 net/sched/act_police.c |  2 --
 net/sched/act_sample.c |  2 --
 net/sched/act_simple.c |  2 --
 net/sched/act_skbedit.c|  2 --
 net/sched/act_skbmod.c |  2 --
 net/sched/act_tunnel_key.c |  3 ---
 net/sched/act_vlan.c   |  2 --
 22 files changed, 21 insertions(+), 66 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index cb382a89ea58..87214927314a 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -166,8 +166,6 @@ int tcf_idr_create_from_flags(struct tc_action_net *tn, u32 
index,
  struct nlattr *est, struct tc_action **a,
  const struct tc_action_ops *ops, int bind,
  u32 flags);
-void tcf_idr_insert(struct tc_action_net *tn, struct tc_action *a);
-
 void tcf_idr_cleanup(struct tc_action_net *tn, u32 index);
 int tcf_idr_check_alloc(struct tc_action_net *tn, u32 *index,
struct tc_action **a, int bind);
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 063d8aaf2900..0030f00234ee 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -467,17 +467,6 @@ int tcf_idr_create_from_flags(struct tc_action_net *tn, 
u32 index,
 }
 EXPORT_SYMBOL(tcf_idr_create_from_flags);
 
-void tcf_idr_insert(struct tc_action_net *tn, struct tc_action *a)
-{
-   struct tcf_idrinfo *idrinfo = tn->idrinfo;
-
-   mutex_lock(&idrinfo->lock);
-   /* Replace ERR_PTR(-EBUSY) allocated by tcf_idr_check_alloc */
-   WARN_ON(!IS_ERR(idr_replace(&idrinfo->action_idr, a, a->tcfa_index)));
-   mutex_unlock(&idrinfo->lock);
-}
-EXPORT_SYMBOL(tcf_idr_insert);
-
 /* Cleanup idr index that was allocated but not initialized. */
 
 void tcf_idr_cleanup(struct tc_action_net *tn, u32 index)
@@ -902,6 +891,16 @@ static const struct nla_policy 
tcf_action_policy[TCA_ACT_MAX + 1] = {
[TCA_ACT_HW_STATS]  = NLA_POLICY_BITFIELD32(TCA_ACT_HW_STATS_ANY),
 };
 
+static void tcf_idr_insert(struct tc_action *a)
+{
+   struct tcf_idrinfo *idrinfo = a->idrinfo;
+
+   mutex_lock(&idrinfo->lock);
+   /* Replace ERR_PTR(-EBUSY) allocated by tcf_idr_check_alloc */
+   WARN_ON(!IS_ERR(idr_replace(&idrinfo->action_idr, a, a->tcfa_index)));
+   mutex_unlock(&idrinfo->lock);
+}
+
 struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
struct nlattr *nla, struct nlattr *est,
char *name, int ovr, int bind,
@@ -989,6 +988,16 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct tcf_proto *tp,
if (err < 0)
goto err_mod;
 
+   if (TC_ACT_EXT_CMP(a->tcfa_action, TC_ACT_GOTO_CHAIN) &&
+   !rcu_access_pointer(a->goto_chain)) {
+   tcf_action_destroy_1(a, bind);
+   NL_SET_ERR_MSG(extack, "can't use goto chain with NULL chain");
+   return ERR_PTR(-EINVAL);
+   }
+
+   if (err == ACT_P_CREATED)
+   tcf_idr_insert(a);
+
if (!name && tb[TCA_ACT_COOKIE])
tcf_set_action_cookie(&a->act_cookie, cookie);
 
@@ -1002,13 +1011,6 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct tcf_proto *tp,
if (err != ACT_P_CREATED)
module_put(a_o->owner);
 
-   if (TC_ACT_EXT_CMP(a->tcfa_action, TC_ACT_GOTO_CHAIN) &&
-   !rcu_access_pointer(a->goto_chain)) {
-   tcf_action_destroy_1(a, bind);
-   NL_SET_ERR_MSG(extack, "can't use goto chain with NULL chain");
-   return ERR_PTR(-EINVAL);
-   }
-
return a;
 
 err_mod:
diff --git a/net/sched/act_

[Patch net 0/2] net_sched: fix a UAF in tcf_action_init()

2020-09-22 Thread Cong Wang
This patchset fixes a use-after-free triggered by syzbot. Please
find more details in each patch description.

---

Cong Wang (2):
  net_sched: defer tcf_idr_insert() in tcf_action_init_1()
  net_sched: commit action insertions together

 include/net/act_api.h  |  2 --
 net/sched/act_api.c| 52 +-
 net/sched/act_bpf.c|  4 +--
 net/sched/act_connmark.c   |  1 -
 net/sched/act_csum.c   |  3 ---
 net/sched/act_ct.c |  2 --
 net/sched/act_ctinfo.c |  3 ---
 net/sched/act_gact.c   |  2 --
 net/sched/act_gate.c   |  3 ---
 net/sched/act_ife.c|  3 ---
 net/sched/act_ipt.c|  2 --
 net/sched/act_mirred.c |  2 --
 net/sched/act_mpls.c   |  2 --
 net/sched/act_nat.c|  3 ---
 net/sched/act_pedit.c  |  2 --
 net/sched/act_police.c |  2 --
 net/sched/act_sample.c |  2 --
 net/sched/act_simple.c |  2 --
 net/sched/act_skbedit.c|  2 --
 net/sched/act_skbmod.c |  2 --
 net/sched/act_tunnel_key.c |  3 ---
 net/sched/act_vlan.c   |  2 --
 22 files changed, 35 insertions(+), 66 deletions(-)

-- 
2.28.0



[Patch net 2/2] net_sched: commit action insertions together

2020-09-22 Thread Cong Wang
syzbot is able to trigger a failure case inside the loop in
tcf_action_init(), and when this happens we clean up with
tcf_action_destroy(). But, as these actions are already inserted
into the global IDR, other parallel process could free them
before tcf_action_destroy(), then we will trigger a use-after-free.

Fix this by deferring the insertions even later, after the loop,
and committing all the insertions in a separate loop, so we will
never fail in the middle of the insertions any more.

One side effect is that the window between alloction and final
insertion becomes larger, now it is more likely that the loop in
tcf_del_walker() sees the placeholder -EBUSY pointer. So we have
to check for error pointer in tcf_del_walker().

Reported-and-tested-by: syzbot+2287853d392e4b423...@syzkaller.appspotmail.com
Fixes: 0190c1d452a9 ("net: sched: atomically check-allocate action")
Cc: Vlad Buslov 
Cc: Jamal Hadi Salim 
Cc: Jiri Pirko 
Signed-off-by: Cong Wang 
---
 net/sched/act_api.c | 32 +++-
 1 file changed, 23 insertions(+), 9 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 0030f00234ee..104b47f5184f 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -307,6 +307,8 @@ static int tcf_del_walker(struct tcf_idrinfo *idrinfo, 
struct sk_buff *skb,
 
mutex_lock(&idrinfo->lock);
idr_for_each_entry_ul(idr, p, tmp, id) {
+   if (IS_ERR(p))
+   continue;
ret = tcf_idr_release_unsafe(p);
if (ret == ACT_P_DELETED) {
module_put(ops->owner);
@@ -891,14 +893,24 @@ static const struct nla_policy 
tcf_action_policy[TCA_ACT_MAX + 1] = {
[TCA_ACT_HW_STATS]  = NLA_POLICY_BITFIELD32(TCA_ACT_HW_STATS_ANY),
 };
 
-static void tcf_idr_insert(struct tc_action *a)
+static void tcf_idr_insert_many(struct tc_action *actions[])
 {
-   struct tcf_idrinfo *idrinfo = a->idrinfo;
+   int i;
 
-   mutex_lock(&idrinfo->lock);
-   /* Replace ERR_PTR(-EBUSY) allocated by tcf_idr_check_alloc */
-   WARN_ON(!IS_ERR(idr_replace(&idrinfo->action_idr, a, a->tcfa_index)));
-   mutex_unlock(&idrinfo->lock);
+   for (i = 0; i < TCA_ACT_MAX_PRIO; i++) {
+   struct tc_action *a = actions[i];
+   struct tcf_idrinfo *idrinfo;
+
+   if (!a)
+   continue;
+   idrinfo = a->idrinfo;
+   mutex_lock(&idrinfo->lock);
+   /* Replace ERR_PTR(-EBUSY) allocated by tcf_idr_check_alloc if
+* it is just created, otherwise this is just a nop.
+*/
+   idr_replace(&idrinfo->action_idr, a, a->tcfa_index);
+   mutex_unlock(&idrinfo->lock);
+   }
 }
 
 struct tc_action *tcf_action_init_1(struct net *net, struct tcf_proto *tp,
@@ -995,9 +1007,6 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct tcf_proto *tp,
return ERR_PTR(-EINVAL);
}
 
-   if (err == ACT_P_CREATED)
-   tcf_idr_insert(a);
-
if (!name && tb[TCA_ACT_COOKIE])
tcf_set_action_cookie(&a->act_cookie, cookie);
 
@@ -1053,6 +1062,11 @@ int tcf_action_init(struct net *net, struct tcf_proto 
*tp, struct nlattr *nla,
actions[i - 1] = act;
}
 
+   /* We have to commit them all together, because if any error happened in
+* between, we could not handle the failure gracefully.
+*/
+   tcf_idr_insert_many(actions);
+
*attr_size = tcf_action_full_attrs_size(sz);
return i - 1;
 
-- 
2.28.0



Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-09-17 Thread Cong Wang
On Fri, Sep 11, 2020 at 1:13 AM Yunsheng Lin  wrote:
>
> On 2020/9/11 4:07, Cong Wang wrote:
> > On Tue, Sep 8, 2020 at 4:06 AM Yunsheng Lin  wrote:
> >>
> >> Currently there is concurrent reset and enqueue operation for the
> >> same lockless qdisc when there is no lock to synchronize the
> >> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >> skb with a larger queue_mapping after the corresponding qdisc is
> >> reset, and call hns3_nic_net_xmit() with that skb later.
> >>
> >> Reused the existing synchronize_net() in dev_deactivate_many() to
> >> make sure skb with larger queue_mapping enqueued to old qdisc(which
> >> is saved in dev_queue->qdisc_sleeping) will always be reset when
> >> dev_reset_queue() is called.
> >>
> >> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
> >> Signed-off-by: Yunsheng Lin 
> >> ---
> >> ChangeLog V2:
> >> Reuse existing synchronize_net().
> >> ---
> >>  net/sched/sch_generic.c | 48 
> >> +---
> >>  1 file changed, 33 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> >> index 265a61d..54c4172 100644
> >> --- a/net/sched/sch_generic.c
> >> +++ b/net/sched/sch_generic.c
> >> @@ -1131,24 +1131,10 @@ EXPORT_SYMBOL(dev_activate);
> >>
> >>  static void qdisc_deactivate(struct Qdisc *qdisc)
> >>  {
> >> -   bool nolock = qdisc->flags & TCQ_F_NOLOCK;
> >> -
> >> if (qdisc->flags & TCQ_F_BUILTIN)
> >> return;
> >> -   if (test_bit(__QDISC_STATE_DEACTIVATED, &qdisc->state))
> >> -   return;
> >> -
> >> -   if (nolock)
> >> -   spin_lock_bh(&qdisc->seqlock);
> >> -   spin_lock_bh(qdisc_lock(qdisc));
> >>
> >> set_bit(__QDISC_STATE_DEACTIVATED, &qdisc->state);
> >> -
> >> -   qdisc_reset(qdisc);
> >> -
> >> -   spin_unlock_bh(qdisc_lock(qdisc));
> >> -   if (nolock)
> >> -   spin_unlock_bh(&qdisc->seqlock);
> >>  }
> >>
> >>  static void dev_deactivate_queue(struct net_device *dev,
> >> @@ -1165,6 +1151,30 @@ static void dev_deactivate_queue(struct net_device 
> >> *dev,
> >> }
> >>  }
> >>
> >> +static void dev_reset_queue(struct net_device *dev,
> >> +   struct netdev_queue *dev_queue,
> >> +   void *_unused)
> >> +{
> >> +   struct Qdisc *qdisc;
> >> +   bool nolock;
> >> +
> >> +   qdisc = dev_queue->qdisc_sleeping;
> >> +   if (!qdisc)
> >> +   return;
> >> +
> >> +   nolock = qdisc->flags & TCQ_F_NOLOCK;
> >> +
> >> +   if (nolock)
> >> +   spin_lock_bh(&qdisc->seqlock);
> >> +   spin_lock_bh(qdisc_lock(qdisc));
> >
> >
> > I think you do not need this lock for lockless one.
>
> It seems so.
> Maybe another patch to remove qdisc_lock(qdisc) for lockless
> qdisc?

Yeah, but not sure if we still want this lockless qdisc any more,
it brings more troubles than gains.

>
>
> >
> >> +
> >> +   qdisc_reset(qdisc);
> >> +
> >> +   spin_unlock_bh(qdisc_lock(qdisc));
> >> +   if (nolock)
> >> +   spin_unlock_bh(&qdisc->seqlock);
> >> +}
> >> +
> >>  static bool some_qdisc_is_busy(struct net_device *dev)
> >>  {
> >> unsigned int i;
> >> @@ -1213,12 +1223,20 @@ void dev_deactivate_many(struct list_head *head)
> >> dev_watchdog_down(dev);
> >> }
> >>
> >> -   /* Wait for outstanding qdisc-less dev_queue_xmit calls.
> >> +   /* Wait for outstanding qdisc-less dev_queue_xmit calls or
> >> +* outstanding qdisc enqueuing calls.
> >>  * This is avoided if all devices are in dismantle phase :
> >>  * Caller will call synchronize_net() for us
> >>  */
&g

Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-09-17 Thread Cong Wang
On Sun, Sep 13, 2020 at 7:10 PM Yunsheng Lin  wrote:
>
> On 2020/9/11 4:19, Cong Wang wrote:
> > On Thu, Sep 3, 2020 at 8:21 PM Kehuan Feng  wrote:
> >> I also tried Cong's patch (shown below on my tree) and it could avoid
> >> the issue (stressing for 30 minutus for three times and not jitter
> >> observed).
> >
> > Thanks for verifying it!
> >
> >>
> >> --- ./include/net/sch_generic.h.orig 2020-08-21 15:13:51.787952710 +0800
> >> +++ ./include/net/sch_generic.h 2020-09-03 21:36:11.468383738 +0800
> >> @@ -127,8 +127,7 @@
> >>  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
> >>  {
> >>   if (qdisc->flags & TCQ_F_NOLOCK) {
> >> - if (!spin_trylock(&qdisc->seqlock))
> >> - return false;
> >> + spin_lock(&qdisc->seqlock);
> >>   } else if (qdisc_is_running(qdisc)) {
> >>   return false;
> >>   }
> >>
> >> I am not actually know what you are discussing above. It seems to me
> >> that Cong's patch is similar as disabling lockless feature.
> >
> >>From performance's perspective, yeah. Did you see any performance
> > downgrade with my patch applied? It would be great if you can compare
> > it with removing NOLOCK. And if the performance is as bad as no
> > NOLOCK, then we can remove the NOLOCK bit for pfifo_fast, at least
> > for now.
>
> It seems the lockless qdisc may have below concurrent problem:
>   cpu0:   cpu1:
> q->enqueue  .
> qdisc_run_begin(q)  .
> __qdisc_run(q) ->qdisc_restart() -> dequeue_skb()   .
>  -> sch_direct_xmit()   .
> .
> q->enqueue
>  
> qdisc_run_begin(q)
> qdisc_run_end(q)
>
>
> cpu1 enqueue a skb without calling __qdisc_run(), and cpu0 did not see the
> enqueued skb when calling __qdisc_run(q) because cpu1 may enqueue the skb
> after cpu0 called __qdisc_run(q) and before cpu0 called qdisc_run_end(q).

This is the same problem that my patch fixes, I do not know
why you are suggesting another patch despite quoting mine.
Please read the whole thread if you want to participate.

Thanks.


Re: [PATCH] [PATCH] net/sched : cbs : fix calculation error of idleslope credits

2020-09-17 Thread Cong Wang
On Wed, Sep 16, 2020 at 11:05 PM Xiaoyong Yan  wrote:
>
> in the function cbs_dequeue_soft, when q->credits <0, [now - q->last]
> should be accounted for sendslope,not idleslope.
>
> so the solution as follows: when q->credits is less than 0, directly
> calculate delay time, activate hrtimer and when hrtimer fires,
> calculate idleslope credits and update it to q->credits.
>
> because of the lack of self-defined qdisc_watchdog_func, so in the
> generic sch_api, add qdisc_watchdog_init_func and
> qdisc_watchdog_init_clockid_func, so sch_cbs can use it to define its
> own process.

You do not have to define them as generic API, you can just use
hrtimer API directly in sch_cbs, as it is the only user within net/sched/.
Hopefully this would reduce the size of your patch too.


>
> the patch is changed based on v5.4.42,and the test result as follows:
> the NIC is 100Mb/s ,full duplex.
>
> step1:
> tc qdisc add dev ens33 root cbs idleslope 75 sendslope -25 hicredit 100 
> locredit -100 offload 0
> step2:
> root@ubuntu:/home/yxy/kernel/linux-stable# iperf -c 192.168.1.114 -i 1
> 
> Client connecting to 192.168.1.114, TCP port 5001
> TCP window size:  246 KByte (default)
> 
> [  3] local 192.168.1.120 port 42004 connected with 192.168.1.114 port 5001
> [ ID] Interval   Transfer Bandwidth
> [  3]  0.0- 1.0 sec  9.00 MBytes  75.5 Mbits/sec
> [  3]  1.0- 2.0 sec  8.50 MBytes  71.3 Mbits/sec
> [  3]  2.0- 3.0 sec  8.50 MBytes  71.3 Mbits/sec
> [  3]  3.0- 4.0 sec  8.38 MBytes  70.3 Mbits/sec
> [  3]  4.0- 5.0 sec  8.38 MBytes  70.3 Mbits/sec
> [  3]  5.0- 6.0 sec  8.50 MBytes  71.3 Mbits/sec
> [  3]  6.0- 7.0 sec  8.50 MBytes  71.3 Mbits/sec
> [  3]  7.0- 8.0 sec  8.62 MBytes  72.4 Mbits/sec
> [  3]  8.0- 9.0 sec  8.50 MBytes  71.3 Mbits/sec
> [  3]  9.0-10.0 sec  8.62 MBytes  72.4 Mbits/sec
> [  3]  0.0-10.0 sec  85.5 MBytes  71.5 Mbits/sec
>
> Signed-off-by: Xiaoyong Yan 
> ---
>  include/net/pkt_sched.h |  7 +++
>  net/sched/sch_api.c | 20 ++--
>  net/sched/sch_cbs.c | 41 +
>  3 files changed, 50 insertions(+), 18 deletions(-)
>
> diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
> index 6a70845bd9ab..5fec23b15e1c 100644
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
> #define DEFAULT_TX_QUEUE_LEN1000
>
> @@ -72,6 +73,12 @@ struct qdisc_watchdog {
> struct Qdisc*qdisc;
>  };
>
> +typedef enum hrtimer_restart (*qdisc_watchdog_func_t)(struct hrtimer *timer);
> +void qdisc_watchdog_init_clockid_func(struct qdisc_watchdog *wd, struct 
> Qdisc *qdisc,
> +clockid_t clockid,qdisc_watchdog_func_t func);
> +void qdisc_watchdog_init_func(struct qdisc_watchdog *wd,struct Qdisc 
> *qdisc,qdisc_watchdog_func_t func);
> +enum hrtimer_restart qdisc_watchdog(struct hrtimer *timer);
> +
>  void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc 
> *qdisc,
>  clockid_t clockid);
>  void qdisc_watchdog_init(struct qdisc_watchdog *wd, struct Qdisc *qdisc);
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 50794125bf02..fea08d10c811 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -22,7 +22,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>
> @@ -591,7 +590,7 @@ void qdisc_warn_nonwc(const char *txt, struct Qdisc 
> *qdisc)
>  }
>  EXPORT_SYMBOL(qdisc_warn_nonwc);
>
> -static enum hrtimer_restart qdisc_watchdog(struct hrtimer *timer)
> +enum hrtimer_restart qdisc_watchdog(struct hrtimer *timer)
>  {
> struct qdisc_watchdog *wd = container_of(timer, struct qdisc_watchdog,
>  timer);
> @@ -602,7 +601,24 @@ static enum hrtimer_restart qdisc_watchdog(struct 
> hrtimer *timer)
>
> return HRTIMER_NORESTART;
>  }
> +EXPORT_SYMBOL(qdisc_watchdog);
>
> +void qdisc_watchdog_init_clockid_func(struct qdisc_watchdog *wd,struct Qdisc 
> *qdisc,clockid_t clockid,qdisc_watchdog_func_t func)
> +{
> +   hrtimer_init(&wd->timer,clockid,HRTIMER_MODE_ABS_PINNED);
> +   if(!func)
> +   wd->timer.function = qdisc_watchdog;
> +   else
> +   wd->timer.function = func;
> +   wd->qdisc = qdisc;
> +}
> +EXPORT_SYMBOL(qdisc_watchdog_init_clockid_func);
> +
> +void qdisc_watchdog_init_func(struct qdisc_watchdog *wd,struct Qdisc 
> *qdisc,qdisc_watchdog_func_t func)
> +{
> +   qdisc_watchdog_init_clockid_func(wd,qdisc,CLOCK_MONOTONIC,func);
> +}
> +EXPORT_SYMBOL(qdisc_watchdog_init_func);
>  void qdisc_watchdog_init_clockid(struct qdisc_watchdog *wd, struct Qdisc 
> *qdisc,
>  clockid_t clockid)
>  {
> diff --git a/net/sched/sch_cbs.c b/net/sched/sch_cbs.c
> index 2eaac2ff380f..351d

Re: [PATCH net-next] genetlink: Remove unused function genl_err_attr()

2020-09-17 Thread Cong Wang
On Wed, Sep 16, 2020 at 9:33 AM YueHaibing  wrote:
>
> It is never used, so can remove it.

This is a bit confusing, it was actually used before, see commit
ab0d76f6823cc3a4e2.


Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-09-10 Thread Cong Wang
On Thu, Sep 3, 2020 at 8:21 PM Kehuan Feng  wrote:
> I also tried Cong's patch (shown below on my tree) and it could avoid
> the issue (stressing for 30 minutus for three times and not jitter
> observed).

Thanks for verifying it!

>
> --- ./include/net/sch_generic.h.orig 2020-08-21 15:13:51.787952710 +0800
> +++ ./include/net/sch_generic.h 2020-09-03 21:36:11.468383738 +0800
> @@ -127,8 +127,7 @@
>  static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>  {
>   if (qdisc->flags & TCQ_F_NOLOCK) {
> - if (!spin_trylock(&qdisc->seqlock))
> - return false;
> + spin_lock(&qdisc->seqlock);
>   } else if (qdisc_is_running(qdisc)) {
>   return false;
>   }
>
> I am not actually know what you are discussing above. It seems to me
> that Cong's patch is similar as disabling lockless feature.

>From performance's perspective, yeah. Did you see any performance
downgrade with my patch applied? It would be great if you can compare
it with removing NOLOCK. And if the performance is as bad as no
NOLOCK, then we can remove the NOLOCK bit for pfifo_fast, at least
for now.

Thanks.


Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-09-10 Thread Cong Wang
On Thu, Sep 3, 2020 at 10:08 PM John Fastabend  wrote:
> Maybe this would unlock us,
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 7df6c9617321..9b09429103f1 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -3749,7 +3749,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, 
> struct Qdisc *q,
>
> if (q->flags & TCQ_F_NOLOCK) {
> rc = q->enqueue(skb, q, &to_free) & NET_XMIT_MASK;
> -   qdisc_run(q);
> +   __qdisc_run(q);
>
> if (unlikely(to_free))
> kfree_skb_list(to_free);
>
>
> Per other thread we also need the state deactivated check added
> back.

I guess no, because pfifo_dequeue() seems to require q->seqlock,
according to comments in qdisc_run(), so we can not just get rid of
qdisc_run_begin()/qdisc_run_end() here.

Thanks.


Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-09-10 Thread Cong Wang
On Tue, Sep 8, 2020 at 4:06 AM Yunsheng Lin  wrote:
>
> Currently there is concurrent reset and enqueue operation for the
> same lockless qdisc when there is no lock to synchronize the
> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> out-of-bounds access for priv->ring[] in hns3 driver if user has
> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> skb with a larger queue_mapping after the corresponding qdisc is
> reset, and call hns3_nic_net_xmit() with that skb later.
>
> Reused the existing synchronize_net() in dev_deactivate_many() to
> make sure skb with larger queue_mapping enqueued to old qdisc(which
> is saved in dev_queue->qdisc_sleeping) will always be reset when
> dev_reset_queue() is called.
>
> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
> Signed-off-by: Yunsheng Lin 
> ---
> ChangeLog V2:
> Reuse existing synchronize_net().
> ---
>  net/sched/sch_generic.c | 48 +---
>  1 file changed, 33 insertions(+), 15 deletions(-)
>
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 265a61d..54c4172 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -1131,24 +1131,10 @@ EXPORT_SYMBOL(dev_activate);
>
>  static void qdisc_deactivate(struct Qdisc *qdisc)
>  {
> -   bool nolock = qdisc->flags & TCQ_F_NOLOCK;
> -
> if (qdisc->flags & TCQ_F_BUILTIN)
> return;
> -   if (test_bit(__QDISC_STATE_DEACTIVATED, &qdisc->state))
> -   return;
> -
> -   if (nolock)
> -   spin_lock_bh(&qdisc->seqlock);
> -   spin_lock_bh(qdisc_lock(qdisc));
>
> set_bit(__QDISC_STATE_DEACTIVATED, &qdisc->state);
> -
> -   qdisc_reset(qdisc);
> -
> -   spin_unlock_bh(qdisc_lock(qdisc));
> -   if (nolock)
> -   spin_unlock_bh(&qdisc->seqlock);
>  }
>
>  static void dev_deactivate_queue(struct net_device *dev,
> @@ -1165,6 +1151,30 @@ static void dev_deactivate_queue(struct net_device 
> *dev,
> }
>  }
>
> +static void dev_reset_queue(struct net_device *dev,
> +   struct netdev_queue *dev_queue,
> +   void *_unused)
> +{
> +   struct Qdisc *qdisc;
> +   bool nolock;
> +
> +   qdisc = dev_queue->qdisc_sleeping;
> +   if (!qdisc)
> +   return;
> +
> +   nolock = qdisc->flags & TCQ_F_NOLOCK;
> +
> +   if (nolock)
> +   spin_lock_bh(&qdisc->seqlock);
> +   spin_lock_bh(qdisc_lock(qdisc));


I think you do not need this lock for lockless one.

> +
> +   qdisc_reset(qdisc);
> +
> +   spin_unlock_bh(qdisc_lock(qdisc));
> +   if (nolock)
> +   spin_unlock_bh(&qdisc->seqlock);
> +}
> +
>  static bool some_qdisc_is_busy(struct net_device *dev)
>  {
> unsigned int i;
> @@ -1213,12 +1223,20 @@ void dev_deactivate_many(struct list_head *head)
> dev_watchdog_down(dev);
> }
>
> -   /* Wait for outstanding qdisc-less dev_queue_xmit calls.
> +   /* Wait for outstanding qdisc-less dev_queue_xmit calls or
> +* outstanding qdisc enqueuing calls.
>  * This is avoided if all devices are in dismantle phase :
>  * Caller will call synchronize_net() for us
>  */
> synchronize_net();
>
> +   list_for_each_entry(dev, head, close_list) {
> +   netdev_for_each_tx_queue(dev, dev_reset_queue, NULL);
> +
> +   if (dev_ingress_queue(dev))
> +   dev_reset_queue(dev, dev_ingress_queue(dev), NULL);
> +   }
> +
> /* Wait for outstanding qdisc_run calls. */
> list_for_each_entry(dev, head, close_list) {
> while (some_qdisc_is_busy(dev)) {

Do you want to reset before waiting for TX action?

I think it is safer to do it after, at least prior to commit 759ae57f1b
we did after.

Thanks.


Re: [Question] Oops when using connector in linux-4.19

2020-09-05 Thread Cong Wang
On Sat, Sep 5, 2020 at 12:28 AM Yang Yingliang  wrote:
>
> Hi,
>
> I got some crashes when using connector module in linux-4.19:

Can you test a reasonably recent kernel?


> The invalid address[0x0003004c] is the value of nlmsghdr from cn 
> netlink, nlmsg_type is 3 and nlmsg_len is 0x4c.
>
> It seems the skb->data pointer is freed wrongly:
>
> Process A   Process B
>
> calls cn_netlink_send_mult()
> skb = nlmsg_new(size, gfp_mask);
> unknown process calls 
> kfree(skb->data)
> //put skb->data 
> pointer back to freelist of struct kmem_cache_cpu or struct page
>
> nlh = nlmsg_put(skb, 0, msg->seq, NLMSG_DONE, size, 0);
> //set (*skb->data) to 0x0003004c,
> //so the freelist is broken here.

This does not make sense. The newly allocated skb is only visible to
process A at this point, it is impossible to be freed by another process.

I guess there might be some buffer overrun on heap, you probably
need to turn on other memory debugging options like SLUB debug:
https://www.kernel.org/doc/Documentation/vm/slub.txt.

Thanks.


[Patch net] act_ife: load meta modules before tcf_idr_check_alloc()

2020-09-03 Thread Cong Wang
The following deadlock scenario is triggered by syzbot:

Thread A:   Thread B:
tcf_idr_check_alloc()
...
populate_metalist()
  rtnl_unlock()
rtnl_lock()
...
  request_module()  tcf_idr_check_alloc()
  rtnl_lock()

At this point, thread A is waiting for thread B to release RTNL
lock, while thread B is waiting for thread A to commit the IDR
change with tcf_idr_insert() later.

Break this deadlock situation by preloading ife modules earlier,
before tcf_idr_check_alloc(), this is fine because we only need
to load modules we need potentially.

Reported-and-tested-by: syzbot+80e32b5d1f9923f8a...@syzkaller.appspotmail.com
Fixes: 0190c1d452a9 ("net: sched: atomically check-allocate action")
Cc: Jamal Hadi Salim 
Cc: Vlad Buslov 
Cc: Jiri Pirko 
Signed-off-by: Cong Wang 
---
 net/sched/act_ife.c | 44 ++--
 1 file changed, 34 insertions(+), 10 deletions(-)

diff --git a/net/sched/act_ife.c b/net/sched/act_ife.c
index c1fcd85719d6..5c568757643b 100644
--- a/net/sched/act_ife.c
+++ b/net/sched/act_ife.c
@@ -436,6 +436,25 @@ static void tcf_ife_cleanup(struct tc_action *a)
kfree_rcu(p, rcu);
 }
 
+static int load_metalist(struct nlattr **tb, bool rtnl_held)
+{
+   int i;
+
+   for (i = 1; i < max_metacnt; i++) {
+   if (tb[i]) {
+   void *val = nla_data(tb[i]);
+   int len = nla_len(tb[i]);
+   int rc;
+
+   rc = load_metaops_and_vet(i, val, len, rtnl_held);
+   if (rc != 0)
+   return rc;
+   }
+   }
+
+   return 0;
+}
+
 static int populate_metalist(struct tcf_ife_info *ife, struct nlattr **tb,
 bool exists, bool rtnl_held)
 {
@@ -449,10 +468,6 @@ static int populate_metalist(struct tcf_ife_info *ife, 
struct nlattr **tb,
val = nla_data(tb[i]);
len = nla_len(tb[i]);
 
-   rc = load_metaops_and_vet(i, val, len, rtnl_held);
-   if (rc != 0)
-   return rc;
-
rc = add_metainfo(ife, i, val, len, exists);
if (rc)
return rc;
@@ -509,6 +524,21 @@ static int tcf_ife_init(struct net *net, struct nlattr 
*nla,
if (!p)
return -ENOMEM;
 
+   if (tb[TCA_IFE_METALST]) {
+   err = nla_parse_nested_deprecated(tb2, IFE_META_MAX,
+ tb[TCA_IFE_METALST], NULL,
+ NULL);
+   if (err) {
+   kfree(p);
+   return err;
+   }
+   err = load_metalist(tb2, rtnl_held);
+   if (err) {
+   kfree(p);
+   return err;
+   }
+   }
+
index = parm->index;
err = tcf_idr_check_alloc(tn, &index, a, bind);
if (err < 0) {
@@ -570,15 +600,9 @@ static int tcf_ife_init(struct net *net, struct nlattr 
*nla,
}
 
if (tb[TCA_IFE_METALST]) {
-   err = nla_parse_nested_deprecated(tb2, IFE_META_MAX,
- tb[TCA_IFE_METALST], NULL,
- NULL);
-   if (err)
-   goto metadata_parse_err;
err = populate_metalist(ife, tb2, exists, rtnl_held);
if (err)
goto metadata_parse_err;
-
} else {
/* if no passed metadata allow list or passed allow-all
 * then here we process by adding as many supported metadatum
-- 
2.28.0



Re: INFO: task hung in tcf_ife_init

2020-09-03 Thread Cong Wang
On Thu, Sep 3, 2020 at 2:47 AM Eric Dumazet  wrote:
> This commit might be related :
>
> commit 4e407ff5cd67ec76a1deec227b7982dc7f66
> Author: Cong Wang 
> Date:   Sun Aug 19 12:22:12 2018 -0700
>
> act_ife: move tcfa_lock down to where necessary

It does not look like my commit's fault. From my _quick_ understanding
of this problem, we somehow have a "deadlock" situation:

Thread A allocates an IDR with a specific index and calls populate_metalist()
to re-accquire the RTNL lock.

Thread B gets RTNL lock right after thread A releases it, then it waits for
thread A to finally commit the IDR change.

Thanks.


Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-09-03 Thread Cong Wang
On Thu, Sep 3, 2020 at 1:40 AM Paolo Abeni  wrote:
>
> On Wed, 2020-09-02 at 22:01 -0700, Cong Wang wrote:
> > Can you test the attached one-line fix? I think we are overthinking,
> > probably all
> > we need here is a busy wait.
>
> I think that will solve, but I also think that will kill NOLOCK
> performances due to really increased contention.

Yeah, we somehow end up with more locks (seqlock, skb array lock)
for lockless qdisc. What an irony... ;)

>
> At this point I fear we could consider reverting the NOLOCK stuff.
> I personally would hate doing so, but it looks like NOLOCK benefits are
> outweighed by its issues.

I agree, NOLOCK brings more pains than gains. There are many race
conditions hidden in generic qdisc layer, another one is enqueue vs.
reset which is being discussed in another thread.

Thanks.


Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-09-02 Thread Cong Wang
Hello, Kehuan

Can you test the attached one-line fix? I think we are overthinking,
probably all
we need here is a busy wait.

Thanks.
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index d60e7c39d60c..fc1bacdb102b 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -156,8 +156,7 @@ static inline bool qdisc_is_empty(const struct Qdisc *qdisc)
 static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 {
 	if (qdisc->flags & TCQ_F_NOLOCK) {
-		if (!spin_trylock(&qdisc->seqlock))
-			return false;
+		spin_lock(&qdisc->seqlock);
 		WRITE_ONCE(qdisc->empty, false);
 	} else if (qdisc_is_running(qdisc)) {
 		return false;


Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-09-02 Thread Cong Wang
On Wed, Sep 2, 2020 at 7:22 PM Yunsheng Lin  wrote:
>
> On 2020/9/3 9:48, Cong Wang wrote:
> > On Wed, Sep 2, 2020 at 6:22 PM Yunsheng Lin  wrote:
> >>
> >> On 2020/9/3 8:35, Cong Wang wrote:
> >>> On Tue, Sep 1, 2020 at 11:35 PM Yunsheng Lin  
> >>> wrote:
> >>>>
> >>>> On 2020/9/2 12:41, Cong Wang wrote:
> >>>>> On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin  
> >>>>> wrote:
> >>>>>>
> >>>>>> On 2020/9/2 2:24, Cong Wang wrote:
> >>>>>>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin  
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Currently there is concurrent reset and enqueue operation for the
> >>>>>>>> same lockless qdisc when there is no lock to synchronize the
> >>>>>>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >>>>>>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >>>>>>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >>>>>>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >>>>>>>> skb with a larger queue_mapping after the corresponding qdisc is
> >>>>>>>> reset, and call hns3_nic_net_xmit() with that skb later.
> >>>>>>>
> >>>>>>> Can you be more specific here? Which call path requests a smaller
> >>>>>>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
> >>>>>>> we already have a synchronize_net() there.
> >>>>>>
> >>>>>> When the netdevice is in active state, the synchronize_net() seems to
> >>>>>> do the correct work, as below:
> >>>>>>
> >>>>>> CPU 0:   CPU1:
> >>>>>> __dev_queue_xmit()   netif_set_real_num_tx_queues()
> >>>>>> rcu_read_lock_bh();
> >>>>>> netdev_core_pick_tx(dev, skb, sb_dev);
> >>>>>> .
> >>>>>> .   dev->real_num_tx_queues = txq;
> >>>>>> .   .
> >>>>>> .   .
> >>>>>> .   synchronize_net();
> >>>>>> .   .
> >>>>>> q->enqueue().
> >>>>>> .   .
> >>>>>> rcu_read_unlock_bh().
> >>>>>> qdisc_reset_all_tx_gt
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> Right.
> >>>>>
> >>>>>
> >>>>>> but dev->real_num_tx_queues is not RCU-protected, maybe that is a 
> >>>>>> problem
> >>>>>> too.
> >>>>>>
> >>>>>> The problem we hit is as below:
> >>>>>> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is 
> >>>>>> called
> >>>>>> to deactive the netdevice when user requested a smaller queue num, and
> >>>>>> txq->qdisc is already changed to noop_qdisc when calling
> >>>>>> netif_set_real_num_tx_queues(), so the synchronize_net() in the 
> >>>>>> function
> >>>>>> netif_set_real_num_tx_queues() does not help here.
> >>>>>
> >>>>> How could qdisc still be running after deactivating the device?
> >>>>
> >>>> qdisc could be running during the device deactivating process.
> >>>>
> >>>> The main process of changing channel number is as below:
> >>>>
> >>>> 1. dev_deactivate()
> >>>> 2. hns3 handware related setup
> >>>> 3. netif_set_real_num_tx_queues()
> >>>> 4. netif_tx_wake_all_queues()
> >>>> 5. dev_activate()
> >>>>
> >>>> During step 1, qdisc could be running while qdisc is resetting, so
> >>>> there could be skb left in the old qdisc(which will be restored back to
&g

Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-09-02 Thread Cong Wang
On Wed, Sep 2, 2020 at 6:22 PM Yunsheng Lin  wrote:
>
> On 2020/9/3 8:35, Cong Wang wrote:
> > On Tue, Sep 1, 2020 at 11:35 PM Yunsheng Lin  wrote:
> >>
> >> On 2020/9/2 12:41, Cong Wang wrote:
> >>> On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin  
> >>> wrote:
> >>>>
> >>>> On 2020/9/2 2:24, Cong Wang wrote:
> >>>>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin  
> >>>>> wrote:
> >>>>>>
> >>>>>> Currently there is concurrent reset and enqueue operation for the
> >>>>>> same lockless qdisc when there is no lock to synchronize the
> >>>>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >>>>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >>>>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >>>>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >>>>>> skb with a larger queue_mapping after the corresponding qdisc is
> >>>>>> reset, and call hns3_nic_net_xmit() with that skb later.
> >>>>>
> >>>>> Can you be more specific here? Which call path requests a smaller
> >>>>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
> >>>>> we already have a synchronize_net() there.
> >>>>
> >>>> When the netdevice is in active state, the synchronize_net() seems to
> >>>> do the correct work, as below:
> >>>>
> >>>> CPU 0:   CPU1:
> >>>> __dev_queue_xmit()   netif_set_real_num_tx_queues()
> >>>> rcu_read_lock_bh();
> >>>> netdev_core_pick_tx(dev, skb, sb_dev);
> >>>> .
> >>>> .   dev->real_num_tx_queues = txq;
> >>>> .   .
> >>>> .   .
> >>>> .   synchronize_net();
> >>>> .   .
> >>>> q->enqueue().
> >>>> .   .
> >>>> rcu_read_unlock_bh().
> >>>> qdisc_reset_all_tx_gt
> >>>>
> >>>>
> >>>
> >>> Right.
> >>>
> >>>
> >>>> but dev->real_num_tx_queues is not RCU-protected, maybe that is a problem
> >>>> too.
> >>>>
> >>>> The problem we hit is as below:
> >>>> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is called
> >>>> to deactive the netdevice when user requested a smaller queue num, and
> >>>> txq->qdisc is already changed to noop_qdisc when calling
> >>>> netif_set_real_num_tx_queues(), so the synchronize_net() in the function
> >>>> netif_set_real_num_tx_queues() does not help here.
> >>>
> >>> How could qdisc still be running after deactivating the device?
> >>
> >> qdisc could be running during the device deactivating process.
> >>
> >> The main process of changing channel number is as below:
> >>
> >> 1. dev_deactivate()
> >> 2. hns3 handware related setup
> >> 3. netif_set_real_num_tx_queues()
> >> 4. netif_tx_wake_all_queues()
> >> 5. dev_activate()
> >>
> >> During step 1, qdisc could be running while qdisc is resetting, so
> >> there could be skb left in the old qdisc(which will be restored back to
> >> txq->qdisc during dev_activate()), as below:
> >>
> >> CPU 0:   CPU1:
> >> __dev_queue_xmit():  dev_deactivate_many():
> >> rcu_read_lock_bh();  qdisc_deactivate(qdisc);
> >> q = rcu_dereference_bh(txq->qdisc); .
> >> netdev_core_pick_tx(dev, skb, sb_dev);  .
> >> .
> >> .   
> >> rcu_assign_pointer(dev_queue->qdisc, qdisc_default);
> >> .   .
> >> .   .
> >> .   .
> >> .   .
> >> q->enqueue().
> >
> >
> > Well, like I said, if the deactivated bit were tested before ->enqueue(),
> > there would be no packet queued after qdisc_deactivate().
>
> Only if the deactivated bit testing is also protected by qdisc->seqlock?
> otherwise there is still window between setting and testing the deactivated 
> bit.

Can you be more specific here? Why testing or setting a bit is not atomic?

AFAIU, qdisc->seqlock is an optimization to replace
__QDISC_STATE_RUNNING, which has nothing to do with deactivate bit.

Thanks.


Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-09-02 Thread Cong Wang
On Tue, Sep 1, 2020 at 11:35 PM Yunsheng Lin  wrote:
>
> On 2020/9/2 12:41, Cong Wang wrote:
> > On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin  wrote:
> >>
> >> On 2020/9/2 2:24, Cong Wang wrote:
> >>> On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin  
> >>> wrote:
> >>>>
> >>>> Currently there is concurrent reset and enqueue operation for the
> >>>> same lockless qdisc when there is no lock to synchronize the
> >>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >>>> skb with a larger queue_mapping after the corresponding qdisc is
> >>>> reset, and call hns3_nic_net_xmit() with that skb later.
> >>>
> >>> Can you be more specific here? Which call path requests a smaller
> >>> tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
> >>> we already have a synchronize_net() there.
> >>
> >> When the netdevice is in active state, the synchronize_net() seems to
> >> do the correct work, as below:
> >>
> >> CPU 0:   CPU1:
> >> __dev_queue_xmit()   netif_set_real_num_tx_queues()
> >> rcu_read_lock_bh();
> >> netdev_core_pick_tx(dev, skb, sb_dev);
> >> .
> >> .   dev->real_num_tx_queues = txq;
> >> .   .
> >> .   .
> >> .   synchronize_net();
> >> .   .
> >> q->enqueue().
> >> .   .
> >> rcu_read_unlock_bh().
> >> qdisc_reset_all_tx_gt
> >>
> >>
> >
> > Right.
> >
> >
> >> but dev->real_num_tx_queues is not RCU-protected, maybe that is a problem
> >> too.
> >>
> >> The problem we hit is as below:
> >> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is called
> >> to deactive the netdevice when user requested a smaller queue num, and
> >> txq->qdisc is already changed to noop_qdisc when calling
> >> netif_set_real_num_tx_queues(), so the synchronize_net() in the function
> >> netif_set_real_num_tx_queues() does not help here.
> >
> > How could qdisc still be running after deactivating the device?
>
> qdisc could be running during the device deactivating process.
>
> The main process of changing channel number is as below:
>
> 1. dev_deactivate()
> 2. hns3 handware related setup
> 3. netif_set_real_num_tx_queues()
> 4. netif_tx_wake_all_queues()
> 5. dev_activate()
>
> During step 1, qdisc could be running while qdisc is resetting, so
> there could be skb left in the old qdisc(which will be restored back to
> txq->qdisc during dev_activate()), as below:
>
> CPU 0:   CPU1:
> __dev_queue_xmit():  dev_deactivate_many():
> rcu_read_lock_bh();  qdisc_deactivate(qdisc);
> q = rcu_dereference_bh(txq->qdisc); .
> netdev_core_pick_tx(dev, skb, sb_dev);  .
> .
> .   rcu_assign_pointer(dev_queue->qdisc, 
> qdisc_default);
> .   .
> .   .
> .   .
> .   .
> q->enqueue().


Well, like I said, if the deactivated bit were tested before ->enqueue(),
there would be no packet queued after qdisc_deactivate().


> .   .
> rcu_read_unlock_bh().
>
> And During step 3, txq->qdisc is pointing to noop_qdisc, so the qdisc_reset()
> only reset the noop_qdisc, but not the actual qdisc, which is stored in
> txq->qdisc_sleeping, so the actual qdisc may still have skb.
>
> When hns3_link_status_change() call step 4 and 5, it will restore all queue's
> qdisc using txq->qdisc_sleeping and schedule all queue with net_tx_action().
> The skb enqueued in step 1 may be dequeued and run

Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-09-01 Thread Cong Wang
On Tue, Sep 1, 2020 at 6:42 PM Yunsheng Lin  wrote:
>
> On 2020/9/2 2:24, Cong Wang wrote:
> > On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin  wrote:
> >>
> >> Currently there is concurrent reset and enqueue operation for the
> >> same lockless qdisc when there is no lock to synchronize the
> >> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> >> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> >> out-of-bounds access for priv->ring[] in hns3 driver if user has
> >> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> >> skb with a larger queue_mapping after the corresponding qdisc is
> >> reset, and call hns3_nic_net_xmit() with that skb later.
> >
> > Can you be more specific here? Which call path requests a smaller
> > tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
> > we already have a synchronize_net() there.
>
> When the netdevice is in active state, the synchronize_net() seems to
> do the correct work, as below:
>
> CPU 0:   CPU1:
> __dev_queue_xmit()   netif_set_real_num_tx_queues()
> rcu_read_lock_bh();
> netdev_core_pick_tx(dev, skb, sb_dev);
> .
> .   dev->real_num_tx_queues = txq;
> .   .
> .   .
> .   synchronize_net();
> .   .
> q->enqueue().
> .   .
> rcu_read_unlock_bh().
> qdisc_reset_all_tx_gt
>
>

Right.


> but dev->real_num_tx_queues is not RCU-protected, maybe that is a problem
> too.
>
> The problem we hit is as below:
> In hns3_set_channels(), hns3_reset_notify(h, HNAE3_DOWN_CLIENT) is called
> to deactive the netdevice when user requested a smaller queue num, and
> txq->qdisc is already changed to noop_qdisc when calling
> netif_set_real_num_tx_queues(), so the synchronize_net() in the function
> netif_set_real_num_tx_queues() does not help here.

How could qdisc still be running after deactivating the device?


>
> >
> >>
> >> Avoid the above concurrent op by calling synchronize_rcu_tasks()
> >> after assigning new qdisc to dev_queue->qdisc and before calling
> >> qdisc_deactivate() to make sure skb with larger queue_mapping
> >> enqueued to old qdisc will always be reset when qdisc_deactivate()
> >> is called.
> >
> > Like Eric said, it is not nice to call such a blocking function when
> > we have a large number of TX queues. Possibly we just need to
> > add a synchronize_net() as in netif_set_real_num_tx_queues(),
> > if it is missing.
>
> As above, the synchronize_net() in netif_set_real_num_tx_queues() seems
> to work when netdevice is in active state, but does not work when in
> deactive.

Please explain why deactivated device still has qdisc running?

At least before commit 379349e9bc3b4, we always test deactivate
bit before enqueueing. Are you complaining about that commit?
That commit is indeed suspicious, at least it does not precisely revert
commit ba27b4cdaaa66561aaedb21 as it claims.


>
> And we do not want skb left in the old qdisc when netdevice is deactived,
> right?

Yes, and more importantly, qdisc should not be running after deactivation.

Thanks.


Re: [PATCH net-next] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc

2020-09-01 Thread Cong Wang
On Mon, Aug 31, 2020 at 5:59 PM Yunsheng Lin  wrote:
>
> Currently there is concurrent reset and enqueue operation for the
> same lockless qdisc when there is no lock to synchronize the
> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
> out-of-bounds access for priv->ring[] in hns3 driver if user has
> requested a smaller queue num when __dev_xmit_skb() still enqueue a
> skb with a larger queue_mapping after the corresponding qdisc is
> reset, and call hns3_nic_net_xmit() with that skb later.

Can you be more specific here? Which call path requests a smaller
tx queue num? If you mean netif_set_real_num_tx_queues(), clearly
we already have a synchronize_net() there.

>
> Avoid the above concurrent op by calling synchronize_rcu_tasks()
> after assigning new qdisc to dev_queue->qdisc and before calling
> qdisc_deactivate() to make sure skb with larger queue_mapping
> enqueued to old qdisc will always be reset when qdisc_deactivate()
> is called.

Like Eric said, it is not nice to call such a blocking function when
we have a large number of TX queues. Possibly we just need to
add a synchronize_net() as in netif_set_real_num_tx_queues(),
if it is missing.

Thanks.


Re: [PATCH net-next] net/sched: add act_ct_output support

2020-08-28 Thread Cong Wang
On Tue, Aug 25, 2020 at 1:45 AM  wrote:
>
> From: wenxu 
>
> The fragment packets do defrag in act_ct module. If the reassembled
> packet should send out to another net device. This over mtu big packet
> should be fragmented to send out. This patch add the act ct_output to
> archive this.

There are a lot of things missing in your changelog.

For example: Why do we need a new action here? Why segmentation
is not done on the target device?

At least for the egress side, dev_queue_xmit() is called by act_mirred,
it will perform a segmentation with skb_gso_segment() if needed.
So why bigger packets can not be segmented here?

Please add all these necessary details into your changelog.

Thanks.


Re: [PATCH net-next] net/sched: add act_ct_output support

2020-08-28 Thread Cong Wang
On Tue, Aug 25, 2020 at 8:33 AM Marcelo Ricardo Leitner
 wrote:
> I still don't understand Cong's argument for not having this on
> act_mirred because TC is L2. That's actually not right. TC hooks at L2

You miss a very important point that it is already too late to rename
act_mirred to reflect whatever new feature adding to it.


> but deals with L3 and L4 (after all, it does static NAT, mungles L4
> headers and classifies based on virtually anything) since beginning,
> and this is just another case.

So eventually you want TC to deal with all L3 stuff?? I think you are
exaggerating it, modifying L3/L4 headers does not mean it handles L3
protocol. But, doing IP layer fragmentation is clearly doing something
belongs to IP protocol. Look at the code, you never need to call into
IP layer code (except some trivial helpers) until you do CT or
fragmentation. This is why I do not like act_ct either, it fits oddly into
TC.

Why not just do segmentation instead of fragmentation? GSO is
already performed at L2 by software.

Thanks.


[Patch net] net_sched: fix error path in red_init()

2020-08-27 Thread Cong Wang
When ->init() fails, ->destroy() is called to clean up.
So it is unnecessary to clean up in red_init(), and it
would cause some refcount underflow.

Fixes: aee9caa03fc3 ("net: sched: sch_red: Add qevents "early_drop" and "mark"")
Reported-and-tested-by: syzbot+b33c1cb0a30ebdc8a...@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+e5ea5f8a3ecfd4427...@syzkaller.appspotmail.com
Cc: Petr Machata 
Signed-off-by: Cong Wang 
---
 net/sched/sch_red.c | 20 
 1 file changed, 4 insertions(+), 16 deletions(-)

diff --git a/net/sched/sch_red.c b/net/sched/sch_red.c
index deac82f3ad7b..e89fab6ccb34 100644
--- a/net/sched/sch_red.c
+++ b/net/sched/sch_red.c
@@ -353,23 +353,11 @@ static int red_init(struct Qdisc *sch, struct nlattr *opt,
  FLOW_BLOCK_BINDER_TYPE_RED_EARLY_DROP,
  tb[TCA_RED_EARLY_DROP_BLOCK], extack);
if (err)
-   goto err_early_drop_init;
-
-   err = tcf_qevent_init(&q->qe_mark, sch,
- FLOW_BLOCK_BINDER_TYPE_RED_MARK,
- tb[TCA_RED_MARK_BLOCK], extack);
-   if (err)
-   goto err_mark_init;
-
-   return 0;
+   return err;
 
-err_mark_init:
-   tcf_qevent_destroy(&q->qe_early_drop, sch);
-err_early_drop_init:
-   del_timer_sync(&q->adapt_timer);
-   red_offload(sch, false);
-   qdisc_put(q->qdisc);
-   return err;
+   return tcf_qevent_init(&q->qe_mark, sch,
+  FLOW_BLOCK_BINDER_TYPE_RED_MARK,
+  tb[TCA_RED_MARK_BLOCK], extack);
 }
 
 static int red_change(struct Qdisc *sch, struct nlattr *opt,
-- 
2.28.0



Re: [PATCH net] tipc: not enable tipc when ipv6 works as a module

2020-08-17 Thread Cong Wang
On Mon, Aug 17, 2020 at 2:39 PM David Miller  wrote:
>
> From: Cong Wang 
> Date: Mon, 17 Aug 2020 13:59:46 -0700
>
> > Is this a new Kconfig feature? ipv6_stub was introduced for
> > VXLAN, at that time I don't remember we have such kind of
> > Kconfig rules, otherwise it would not be needed.
>
> The ipv6_stub exists in order to allow the troublesome
> "ipv6=m && feature_using_ipv6=y" combination.

Hmm, so "IPV6=m && TIPC=y" is not a concern here as you pick
this patch over adding a ipv6_stub?

Thanks.


Re: [PATCH net] tipc: not enable tipc when ipv6 works as a module

2020-08-17 Thread Cong Wang
On Mon, Aug 17, 2020 at 1:43 PM Randy Dunlap  wrote:
>
> On 8/17/20 1:29 PM, Cong Wang wrote:
> > On Mon, Aug 17, 2020 at 12:55 PM Randy Dunlap  wrote:
> >>
> >> TIPC=m and IPV6=m builds just fine.
> >>
> >> Having tipc autoload ipv6 is a different problem. (IMO)
> >>
> >>
> >> This Kconfig entry:
> >>  menuconfig TIPC
> >> tristate "The TIPC Protocol"
> >> depends on INET
> >> +   depends on IPV6 || IPV6=n
> >>
> >> says:
> >> If IPV6=n, TIPC can be y/m/n.
> >> If IPV6=y/m, TIPC is limited to whatever IPV6 is set to.
> >
> > Hmm, nowadays we _do_ have IPV6=y on popular distros.
> > So this means TIPC would have to be builtin after this patch??
>
> No, it does not mean that. We can still have IPV6=y and TIPC=m.
>
> Hm, maybe I should have said this instead:
>
>   If IPV6=y/m, TIPC is limited _by_ whatever IPV6 is set to.
> (instead of_to_ )
>
> Does that help any?
>
> The "limited" in Kconfig rules is a "less than or equal to"
> limit, where 'm' < 'y'.

Yeah, this is more clear now. If so, that means all the symbols
we have in ipv6_stub can be gone now, assuming module
dependency is automatically solved when both are modules.

Is this a new Kconfig feature? ipv6_stub was introduced for
VXLAN, at that time I don't remember we have such kind of
Kconfig rules, otherwise it would not be needed.

I also glanced at Documentation/kbuild/kconfig-language.txt,
I do not see such a rule, I guess the doc is not updated.


> > Still sounds harsh, right?
> >
> > At least on my OpenSUSE I have CONFIG_IPV6=y and
> > CONFIG_TIPC=m.
> >
> >> TIPC cannot be =y unless IPV6=y.
> >
> > Interesting, I never correctly understand that "depends on"
> > behavior.
> >
> > But even if it builds, how could TIPC module find and load
> > IPV6 module? Does IPV6 module automatically become its
> > dependency now I think?
>
> Sorry, I don't know about this.

You can check `modinfo tipc` output after compiling both as
modules. (Sorry that I only have CONFIG_MODULES=n here.)

Thanks.


Re: [PATCH net] tipc: not enable tipc when ipv6 works as a module

2020-08-17 Thread Cong Wang
On Mon, Aug 17, 2020 at 12:55 PM Randy Dunlap  wrote:
>
> TIPC=m and IPV6=m builds just fine.
>
> Having tipc autoload ipv6 is a different problem. (IMO)
>
>
> This Kconfig entry:
>  menuconfig TIPC
> tristate "The TIPC Protocol"
> depends on INET
> +   depends on IPV6 || IPV6=n
>
> says:
> If IPV6=n, TIPC can be y/m/n.
> If IPV6=y/m, TIPC is limited to whatever IPV6 is set to.

Hmm, nowadays we _do_ have IPV6=y on popular distros.
So this means TIPC would have to be builtin after this patch??
Still sounds harsh, right?

At least on my OpenSUSE I have CONFIG_IPV6=y and
CONFIG_TIPC=m.

> TIPC cannot be =y unless IPV6=y.

Interesting, I never correctly understand that "depends on"
behavior.

But even if it builds, how could TIPC module find and load
IPV6 module? Does IPV6 module automatically become its
dependency now I think?

Thanks.


Re: [PATCH net-next 1/1] net/sched: Introduce skb hash classifier

2020-08-17 Thread Cong Wang
On Mon, Aug 17, 2020 at 4:19 AM Jamal Hadi Salim  wrote:
>
> On 2020-08-16 2:59 p.m., Cong Wang wrote:
> > On Thu, Aug 13, 2020 at 5:52 AM Jamal Hadi Salim  wrote:
>
>
> [..]
> >> How do you know whether to use hash or mark or both
> >> for that specific key?
> >
> > Hmm, you can just unconditionally pass skb->hash and skb->mark,
> > no? Something like:
> >
> > if (filter_parameter_has_hash) {
> >  match skb->hash with cls->param_hash
> > }
> >
> > if (filter_parameter_has_mark) {
> >  match skb->mark with cls->param_mark
> > }
> >
>  >
> > fw_classify() uses skb->mark unconditionally anyway, without checking
> > whether it is set or not first.
> >
>
> There is no ambiguity of intent in the fw case, there is only one field.
> In the case of having multiple fields it is ambigious if you
> unconditionally look.
>
> Example: policy says to match skb mark of 5 and hash of 3.
> If packet arrives with skb->mark is 5 and skb->hash is 3
> very clearly matched the intent of the policy.
> If packet arrives withj skb->mark 7 and hash 3 it clearly
> did not match the intent. etc.

This example clearly shows no ambiguous, right? ;)


>
> > But if filters were put in a global hashtable, the above would be
> > much harder to implement.
> >
>
> Ok, yes. My assumption has been you will have some global shared
> structure where all filters will be installed on.

Sure, if not hashtable, we could simply put them in a list:

list_for_each_filter {
  if (filter_parameter_has_hash) {
match skb->hash with cls->param_hash
  }
  if (filter_parameter_has_mark) {
match skb->mark with cls->param_mark
  }
}


>
> I think i may have misunderstood all along what you were saying
> which is:
>
> a) add the rules so they are each _independent with different
> priorities_ in a chain.

Yes, because this gives users freedom to pick a different prio
from its value (hash or mark).


>
> b)  when i do lookup for packet arrival, i will only see a filter
>   that matches "match mark 5 and hash 3" (meaning there is no
>   ambiguity on intent). If packet data doesnt match policy then
>   i will iterate to another filter on the chain list with lower
>   priority.

Right. Multiple values mean AND, not OR, so if you specify
mark 5 and hash 3, it will match skb->mark==5 && skb->hash==3.
If not matched, it will continue the iteration until the end.

>
> Am i correct in my understanding?
>
> If i am - then we still have a problem with lookup scale in presence
> of a large number of filters since essentially this approach
> is linear lookup (similar problem iptables has). I am afraid
> a hash table or something with similar principle goals is needed.

Yeah, this is why I asked you whether we have to put them in a
hashtable in previous emails, as hashtable organizes them with
a key, it is hard to combine multiple fields in one key and allow
to extend easily in the future. But other people smarter than me
may have better ideas here.

Thanks.


Re: [PATCH net] tipc: not enable tipc when ipv6 works as a module

2020-08-17 Thread Cong Wang
On Mon, Aug 17, 2020 at 12:00 PM Randy Dunlap  wrote:
>
> On 8/17/20 11:55 AM, Cong Wang wrote:
> > On Mon, Aug 17, 2020 at 11:49 AM Randy Dunlap  wrote:
> >>
> >> On 8/17/20 11:31 AM, Cong Wang wrote:
> >>> On Sun, Aug 16, 2020 at 11:37 PM Xin Long  wrote:
> >>>>
> >>>> On Mon, Aug 17, 2020 at 2:29 AM Cong Wang  
> >>>> wrote:
> >>>>>
> >>>>> Or put it into struct ipv6_stub?
> >>>> Hi Cong,
> >>>>
> >>>> That could be one way. We may do it when this new function becomes more 
> >>>> common.
> >>>> By now, I think it's okay to make TIPC depend on IPV6 || IPV6=n.
> >>>
> >>> I am not a fan of IPV6=m, but disallowing it for one symbol seems
> >>> too harsh.
> >>
> >> Hi,
> >>
> >> Maybe I'm not following you, but this doesn't disallow IPV6=m.
> >
> > Well, by "disallowing IPV6=m" I meant "disallowing IPV6=m when
> > enabling TIPC" for sure... Sorry that it misleads you to believe
> > completely disallowing IPV6=m globally.
> >
> >>
> >> It just restricts how TIPC can be built, so that
> >> TIPC=y and IPV6=m cannot happen together, which causes
> >> a build error.
> >
> > It also disallows TIPC=m and IPV6=m, right? In short, it disalows
> > IPV6=m when TIPC is enabled. And this is exactly what I complain,
> > as it looks too harsh.
>
> I haven't tested that specifically, but that should work.
> This patch won't prevent that from working.

Please give it a try. I do not see how it allows IPV6=m and TIPC=m
but disallows IPV6=m and TIPC=y.

>
> We have loadable modules calling other loadable modules
> all over the kernel.

True, we rely on request_module(). But I do not see TIPC calls
request_module() to request IPV6 module to load "ipv6_dev_find".

Thanks.


Re: [PATCH v5 1/3] net: introduce helper sendpage_ok() in include/linux/net.h

2020-08-17 Thread Cong Wang
On Sun, Aug 16, 2020 at 10:45 PM Christoph Hellwig  wrote:
>
> On Sun, Aug 16, 2020 at 10:55:09AM -0700, Cong Wang wrote:
> > On Sun, Aug 16, 2020 at 1:36 AM Coly Li  wrote:
> > >
> > > The original problem was from nvme-over-tcp code, who mistakenly uses
> > > kernel_sendpage() to send pages allocated by __get_free_pages() without
> > > __GFP_COMP flag. Such pages don't have refcount (page_count is 0) on
> > > tail pages, sending them by kernel_sendpage() may trigger a kernel panic
> > > from a corrupted kernel heap, because these pages are incorrectly freed
> > > in network stack as page_count 0 pages.
> > >
> > > This patch introduces a helper sendpage_ok(), it returns true if the
> > > checking page,
> > > - is not slab page: PageSlab(page) is false.
> > > - has page refcount: page_count(page) is not zero
> > >
> > > All drivers who want to send page to remote end by kernel_sendpage()
> > > may use this helper to check whether the page is OK. If the helper does
> > > not return true, the driver should try other non sendpage method (e.g.
> > > sock_no_sendpage()) to handle the page.
> >
> > Can we leave this helper to mm subsystem?
> >
> > I know it is for sendpage, but its implementation is all about some
> > mm details and its two callers do not belong to net subsystem either.
> >
> > Think this in another way: who would fix it if it is buggy? I bet mm people
> > should. ;)
>
> No.  This is all about a really unusual imitation in sendpage, which

So netdev people will have to understand and support PageSlab() or
page_count()?

If it is unusual even for mm people, how could netdev people suppose
to understand this unusual mm bug? At least not any better.

> is pretty much unexpected.  In fact the best thing would be to make
> sock_sendpage do the right thing and call sock_no_sendpage based
> on this condition, so that driver writers don't have to worry at all.

Agreed, but kernel_sendpage() still relies on mm to provide a helper
to make the decision and ensure this helper is always up-to-date.

In short, it is all about ownership.

Thanks.


Re: [PATCH net] tipc: not enable tipc when ipv6 works as a module

2020-08-17 Thread Cong Wang
On Mon, Aug 17, 2020 at 11:49 AM Randy Dunlap  wrote:
>
> On 8/17/20 11:31 AM, Cong Wang wrote:
> > On Sun, Aug 16, 2020 at 11:37 PM Xin Long  wrote:
> >>
> >> On Mon, Aug 17, 2020 at 2:29 AM Cong Wang  wrote:
> >>>
> >>> Or put it into struct ipv6_stub?
> >> Hi Cong,
> >>
> >> That could be one way. We may do it when this new function becomes more 
> >> common.
> >> By now, I think it's okay to make TIPC depend on IPV6 || IPV6=n.
> >
> > I am not a fan of IPV6=m, but disallowing it for one symbol seems
> > too harsh.
>
> Hi,
>
> Maybe I'm not following you, but this doesn't disallow IPV6=m.

Well, by "disallowing IPV6=m" I meant "disallowing IPV6=m when
enabling TIPC" for sure... Sorry that it misleads you to believe
completely disallowing IPV6=m globally.

>
> It just restricts how TIPC can be built, so that
> TIPC=y and IPV6=m cannot happen together, which causes
> a build error.

It also disallows TIPC=m and IPV6=m, right? In short, it disalows
IPV6=m when TIPC is enabled. And this is exactly what I complain,
as it looks too harsh.

Thanks.


Re: [PATCH net] tipc: not enable tipc when ipv6 works as a module

2020-08-17 Thread Cong Wang
On Sun, Aug 16, 2020 at 11:37 PM Xin Long  wrote:
>
> On Mon, Aug 17, 2020 at 2:29 AM Cong Wang  wrote:
> >
> > Or put it into struct ipv6_stub?
> Hi Cong,
>
> That could be one way. We may do it when this new function becomes more 
> common.
> By now, I think it's okay to make TIPC depend on IPV6 || IPV6=n.

I am not a fan of IPV6=m, but disallowing it for one symbol seems
too harsh.


Re: [PATCH net-next 1/1] net/sched: Introduce skb hash classifier

2020-08-16 Thread Cong Wang
On Thu, Aug 13, 2020 at 5:52 AM Jamal Hadi Salim  wrote:
>
> The _main_ requirement is to scale to a large number of filters
> (a million is a good handwave number). Scale means
> 1) fast datapath lookup time + 2) fast insertion/deletion/get/dump
> from control/user space.
> fwmark is good at all these goals today for #2. It is good for #1 for
> maybe 1K rules (limitation is the 256 buckets, constrained by rcu
> trickery). Then you start having collisions in a bucket and your
> lookup requires long linked list walks.
>
> Generally something like a hash table with sufficient number of buckets
> will work out ok.
> There maybe other approaches (IDR in the kernel looks interesting,
> but i didnt look closely).
>
> So to the implementation issue:
> Major issue is removing ambiguity while at the same time trying
> to get good performance.
>
> Lets say we decided to classify skbmark and skbhash at this point.
> For a hash table, one simple approach is to set
> lookupkey = hash<<32|mark
>
> the key is used as input to the hash algo to find the bucket.
>
> There are two outstanding challenges in my mind:
>
> 1)  To use the policy like you describe above
> as an example:
>
> $TC filter add dev $DEV1 parent : protocol ip prio 3 handle 1
> skb hash Y flowid 1:12 action ok
>
> and say you receive a packet with both skb->hash and skb->mark set
> Then there is ambiguity
>
> How do you know whether to use hash or mark or both
> for that specific key?

Hmm, you can just unconditionally pass skb->hash and skb->mark,
no? Something like:

if (filter_parameter_has_hash) {
match skb->hash with cls->param_hash
}

if (filter_parameter_has_mark) {
match skb->mark with cls->param_mark
}

fw_classify() uses skb->mark unconditionally anyway, without checking
whether it is set or not first.

But if filters were put in a global hashtable, the above would be
much harder to implement.


> You can probably do some trick but I cant think of a cheap way to
> achieve this goal. Of course this issue doesnt exist if you have
> separate classifiers.
>
> 2) If you decide tomorrow to add tcindex/prio etc, you will have to
> rework this as well.
>
> #2 is not as a big deal as #1.

Well, I think #2 is more serious than #1, if we have to use a hashtable.
(If we don't have to, then it would be much easier to extend, of course.)

THanks.


Re: [PATCH net] tipc: not enable tipc when ipv6 works as a module

2020-08-16 Thread Cong Wang
On Sun, Aug 16, 2020 at 4:54 AM Xin Long  wrote:
>
> When using ipv6_dev_find() in one module, it requires ipv6 not to
> work as a module. Otherwise, this error occurs in build:
>
>   undefined reference to `ipv6_dev_find'.
>
> So fix it by adding "depends on IPV6 || IPV6=n" to tipc/Kconfig,
> as it does in sctp/Kconfig.

Or put it into struct ipv6_stub?


Re: [PATCH v5 1/3] net: introduce helper sendpage_ok() in include/linux/net.h

2020-08-16 Thread Cong Wang
On Sun, Aug 16, 2020 at 1:36 AM Coly Li  wrote:
>
> The original problem was from nvme-over-tcp code, who mistakenly uses
> kernel_sendpage() to send pages allocated by __get_free_pages() without
> __GFP_COMP flag. Such pages don't have refcount (page_count is 0) on
> tail pages, sending them by kernel_sendpage() may trigger a kernel panic
> from a corrupted kernel heap, because these pages are incorrectly freed
> in network stack as page_count 0 pages.
>
> This patch introduces a helper sendpage_ok(), it returns true if the
> checking page,
> - is not slab page: PageSlab(page) is false.
> - has page refcount: page_count(page) is not zero
>
> All drivers who want to send page to remote end by kernel_sendpage()
> may use this helper to check whether the page is OK. If the helper does
> not return true, the driver should try other non sendpage method (e.g.
> sock_no_sendpage()) to handle the page.

Can we leave this helper to mm subsystem?

I know it is for sendpage, but its implementation is all about some
mm details and its two callers do not belong to net subsystem either.

Think this in another way: who would fix it if it is buggy? I bet mm people
should. ;)

Thanks.


[Patch net] tipc: fix uninit skb->data in tipc_nl_compat_dumpit()

2020-08-15 Thread Cong Wang
__tipc_nl_compat_dumpit() has two callers, and it expects them to
pass a valid nlmsghdr via arg->data. This header is artificial and
crafted just for __tipc_nl_compat_dumpit().

tipc_nl_compat_publ_dump() does so by putting a genlmsghdr as well
as some nested attribute, TIPC_NLA_SOCK. But the other caller
tipc_nl_compat_dumpit() does not, this leaves arg->data uninitialized
on this call path.

Fix this by just adding a similar nlmsghdr without any payload in
tipc_nl_compat_dumpit().

This bug exists since day 1, but the recent commit 6ea67769ff33
("net: tipc: prepare attrs in __tipc_nl_compat_dumpit()") makes it
easier to appear.

Reported-and-tested-by: syzbot+0e7181deafa7e0b79...@syzkaller.appspotmail.com
Fixes: d0796d1ef63d ("tipc: convert legacy nl bearer dump to nl compat")
Cc: Jon Maloy 
Cc: Ying Xue 
Cc: Richard Alpe 
Signed-off-by: Cong Wang 
---
 net/tipc/netlink_compat.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/net/tipc/netlink_compat.c b/net/tipc/netlink_compat.c
index 217516357ef2..90e3c70a91ad 100644
--- a/net/tipc/netlink_compat.c
+++ b/net/tipc/netlink_compat.c
@@ -275,8 +275,9 @@ static int __tipc_nl_compat_dumpit(struct 
tipc_nl_compat_cmd_dump *cmd,
 static int tipc_nl_compat_dumpit(struct tipc_nl_compat_cmd_dump *cmd,
 struct tipc_nl_compat_msg *msg)
 {
-   int err;
+   struct nlmsghdr *nlh;
struct sk_buff *arg;
+   int err;
 
if (msg->req_type && (!msg->req_size ||
  !TLV_CHECK_TYPE(msg->req, msg->req_type)))
@@ -305,6 +306,15 @@ static int tipc_nl_compat_dumpit(struct 
tipc_nl_compat_cmd_dump *cmd,
return -ENOMEM;
}
 
+   nlh = nlmsg_put(arg, 0, 0, tipc_genl_family.id, 0, NLM_F_MULTI);
+   if (!nlh) {
+   kfree_skb(arg);
+   kfree_skb(msg->rep);
+   msg->rep = NULL;
+   return -EMSGSIZE;
+   }
+   nlmsg_end(arg, nlh);
+
err = __tipc_nl_compat_dumpit(cmd, msg, arg);
if (err) {
kfree_skb(msg->rep);
-- 
2.28.0



[Patch net] bonding: fix a potential double-unregister

2020-08-15 Thread Cong Wang
When we tear down a network namespace, we unregister all
the netdevices within it. So we may queue a slave device
and a bonding device together in the same unregister queue.

If the only slave device is non-ethernet, it would
automatically unregister the bonding device as well. Thus,
we may end up unregistering the bonding device twice.

Workaround this special case by checking reg_state.

Fixes: 9b5e383c11b0 ("net: Introduce unregister_netdevice_many()")
Reported-by: syzbot+af23e7f3e0a7e10c8...@syzkaller.appspotmail.com
Cc: Eric Dumazet 
Cc: Andy Gospodarek 
Cc: Jay Vosburgh 
Signed-off-by: Cong Wang 
---
 drivers/net/bonding/bond_main.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 5ad43aaf76e5..995fcb4eed92 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2205,7 +2205,8 @@ static int bond_release_and_destroy(struct net_device 
*bond_dev,
int ret;
 
ret = __bond_release_one(bond_dev, slave_dev, false, true);
-   if (ret == 0 && !bond_has_slaves(bond)) {
+   if (ret == 0 && !bond_has_slaves(bond) &&
+   bond_dev->reg_state != NETREG_UNREGISTERING) {
bond_dev->priv_flags |= IFF_DISABLE_NETPOLL;
netdev_info(bond_dev, "Destroying bond\n");
bond_remove_proc_entry(bond);
-- 
2.28.0



Re: [PATCH v2] net: openvswitch: introduce common code for flushing flows

2020-08-13 Thread Cong Wang
On Wed, Aug 12, 2020 at 2:59 AM  wrote:
>
> From: Tonghao Zhang 
>
> To avoid some issues, for example RCU usage warning and double free,
> we should flush the flows under ovs_lock. This patch refactors
> table_instance_destroy and introduces table_instance_flow_flush
> which can be invoked by __dp_destroy or ovs_flow_tbl_flush.
>
> Fixes: 50b0e61b32ee ("net: openvswitch: fix possible memleak on destroy 
> flow-table")
> Reported-by: Johan Knöös 
> Reported-at: 
> https://mail.openvswitch.org/pipermail/ovs-discuss/2020-August/050489.html
> Signed-off-by: Tonghao Zhang 

Reviewed-by: Cong Wang 

Thanks.


Re: [PATCH] net: Fix potential memory leak in proto_register()

2020-08-12 Thread Cong Wang
On Wed, Aug 12, 2020 at 2:21 AM linmiaohe  wrote:
>
> Hi all:
> David Miller  wrote:
> >From: Cong Wang 
> >Date: Tue, 11 Aug 2020 16:02:51 -0700
> >
> >>> @@ -3406,6 +3406,16 @@ static void sock_inuse_add(struct net *net,
> >>> int val)  }  #endif
> >>>
> >>> +static void tw_prot_cleanup(struct timewait_sock_ops *twsk_prot) {
> >>> +   if (!twsk_prot)
> >>> +   return;
> >>> +   kfree(twsk_prot->twsk_slab_name);
> >>> +   twsk_prot->twsk_slab_name = NULL;
> >>> +   kmem_cache_destroy(twsk_prot->twsk_slab);
> >>
> >> Hmm, are you sure you can free the kmem cache name before
> >> kmem_cache_destroy()? To me, it seems kmem_cache_destroy() frees the
> >> name via slab_kmem_cache_release() via kfree_const().
> >> With your patch, we have a double-free on the name?
> >>
> >> Or am I missing anything?
> >
> >Yep, there is a double free here.
> >
> >Please fix this.
>
> Many thanks for both of you to point this issue out. But I'am not really 
> understand, could you please explain it more?
> As far as I can see, the double free path is:
> 1. kfree(twsk_prot->twsk_slab_name)
> 2. kmem_cache_destroy
> --> shutdown_memcg_caches
> --> shutdown_cache
> --> slab_kmem_cache_release
> --> kfree_const(s->name)
> But twsk_prot->twsk_slab_name is allocated from kasprintf via 
> kmalloc_track_caller while twsk_prot->twsk_slab->name is allocated
> via kstrdup_const. So I think twsk_prot->twsk_slab_name and 
> twsk_prot->twsk_slab->name point to different memory, and there is no double 
> free.
>

You are right. Since it is duplicated, then there is no need to keep
->twsk_slab_name, we can just use twsk_slab->name. I will send
a patch to get rid of it.

> Or am I missing anything?
>
> By the way, req_prot_cleanup() do the same things, i.e. free the slab_name 
> before involve kmem_cache_destroy(). If there is a double
> free, so as here?

Ditto. Can be just removed.

Thanks.


Re: [PATCH] net: openvswitch: introduce common code for flushing flows

2020-08-11 Thread Cong Wang
On Mon, Aug 10, 2020 at 6:14 PM  wrote:
>
> From: Tonghao Zhang 
>
> To avoid some issues, for example RCU usage warning, we should
> flush the flows under ovs_lock. This patch refactors
> table_instance_destroy and introduces table_instance_flow_flush
> which can be invoked by __dp_destroy or ovs_flow_tbl_flush.
>
> Signed-off-by: Tonghao Zhang 

Please add a Fixes tag here, I think it is probably your memory leak fix
which introduced this issue. And a Reported-by, to give credits to bug
reporters.

Plus one minor issue below:

> -static void table_instance_destroy(struct flow_table *table,
> -  struct table_instance *ti,
> -  struct table_instance *ufid_ti,
> -  bool deferred)
> +/* Must be called with OVS mutex held. */
> +void table_instance_flow_flush(struct flow_table *table,
> +  struct table_instance *ti,
> +  struct table_instance *ufid_ti)
>  {
> int i;
>
> -   if (!ti)
> -   return;
> -
> -   BUG_ON(!ufid_ti);
> if (ti->keep_flows)
> -   goto skip_flows;
> +   return;
>
> for (i = 0; i < ti->n_buckets; i++) {
> -   struct sw_flow *flow;
> struct hlist_head *head = &ti->buckets[i];
> struct hlist_node *n;
> +   struct sw_flow *flow;

This is at most a coding style change, please do not mix
coding style changes in bug fixes. You can always push coding
style changes separately when net-next is open.

Thanks.


Re: [ovs-discuss] Double free in recent kernels after memleak fix

2020-08-11 Thread Cong Wang
On Mon, Aug 10, 2020 at 10:59 PM Tonghao Zhang  wrote:
>
> On Tue, Aug 11, 2020 at 12:08 PM Cong Wang  wrote:
> >
> > On Mon, Aug 10, 2020 at 8:27 PM Tonghao Zhang  
> > wrote:
> > >
> > > On Tue, Aug 11, 2020 at 10:24 AM Cong Wang  
> > > wrote:
> > > >
> > > > On Mon, Aug 10, 2020 at 6:16 PM Tonghao Zhang 
> > > >  wrote:
> > > > > Hi all, I send a patch to fix this. The rcu warnings disappear. I
> > > > > don't reproduce the double free issue.
> > > > > But I guess this patch may address this issue.
> > > > >
> > > > > http://patchwork.ozlabs.org/project/netdev/patch/20200811011001.75690-1-xiangxia.m@gmail.com/
> > > >
> > > > I don't see how your patch address the double-free, as we still
> > > > free mask array twice after your patch: once in tbl_mask_array_realloc()
> > > > and once in ovs_flow_tbl_destroy().
> > > Hi Cong.
> > > Before my patch, we use the ovsl_dereference
> > > (rcu_dereference_protected) in the rcu callback.
> > > ovs_flow_tbl_destroy
> > > ->table_instance_destroy
> > > ->table_instance_flow_free
> > > ->flow_mask_remove
> > > ASSERT_OVSL(will print warning)
> > > ->tbl_mask_array_del_mask
> > > ovsl_dereference(rcu usage warning)
> > >
> >
> > I understand how your patch addresses the RCU annotation issue,
> > which is different from double-free.
> >
> >
> > > so we should invoke the table_instance_destroy or others under
> > > ovs_lock to avoid (ASSERT_OVSL and rcu usage warning).
> >
> > Of course... I never doubt it.
> >
> >
> > > with this patch, we reallocate the mask_array under ovs_lock, and free
> > > it in the rcu callback. Without it, we  reallocate and free it in the
> > > rcu callback.
> > > I think we may fix it with this patch.
> >
> > Does it matter which context tbl_mask_array_realloc() is called?
> > Even with ovs_lock, we can still double free:
> >
> > ovs_lock()
> > tbl_mask_array_realloc()
> >  => call_rcu(&old->rcu, mask_array_rcu_cb);
> > ovs_unlock()
> > ...
> > ovs_flow_tbl_destroy()
> >  => call_rcu(&old->rcu, mask_array_rcu_cb);
> >
> > So still twice, right? To fix the double-free, we have to eliminate one
> > of them, don't we? ;)
> No
> Without my patch: in rcu callback:
> ovs_flow_tbl_destroy
> ->call_rcu(&ma->rcu, mask_array_rcu_cb);
> ->table_instance_destroy
> ->tbl_mask_array_realloc(Shrink the mask array if necessary)
> ->call_rcu(&old->rcu, mask_array_rcu_cb);
>
> With the patch:
> ovs_lock
> table_instance_flow_flush (free the flow)
> tbl_mask_array_realloc(shrink the mask array if necessary, will free
> mask_array in rcu(mask_array_rcu_cb) and rcu_assign_pointer new
> mask_array)
> ovs_unlock
>
> in rcu callback:
> ovs_flow_tbl_destroy
> call_rcu(&ma->rcu, mask_array_rcu_cb);(that is new mask_array)

Ah, I see, I thought the mask array was cached in caller and passed along,
it is in fact refetched via &dp->table.

Thanks.


Re: [PATCH net-next 1/1] net/sched: Introduce skb hash classifier

2020-08-11 Thread Cong Wang
On Sun, Aug 9, 2020 at 4:41 PM Jamal Hadi Salim  wrote:
>
> Interesting idea. Note: my experience is that typical setup is
> to have only one of those (from offload perspective). Ariel,
> are your use cases requiring say both fields?
>
>  From policy perspective, i think above will get more complex
> mostly because you have to deal with either mark or hash
> being optional. It also opens doors for more complex matching
> requirements. Example "match mark X AND hash Y" and
> "match mark X OR hash Y".
> The new classifier will have to deal with that semantic.
>
> With fw and hash being the complex/optional semantics are easy:
>
> "match mark X AND hash Y":
> $TC filter add dev $DEV1 parent : protocol ip prio 3 handle X
> skbhash flowid 1:12 action continue
> $TC filter add dev $DEV1 parent : protocol ip prio 4 handle Y fw
> flowid 1:12 action ok
>
> "match mark X OR hash Y":
> $TC filter add dev $DEV1 parent : protocol ip prio 3 handle X
> skbhash flowid 1:12 action ok
> $TC filter add dev $DEV1 parent : protocol ip prio 4 handle Y fw
> flowid 1:12 action ok

Not sure if I get you correctly, but with a combined implementation
you can do above too, right? Something like:

(AND case)
$TC filter add dev $DEV1 parent : protocol ip prio 3 handle 1
skb hash Y mark X flowid 1:12 action ok

(OR case)
$TC filter add dev $DEV1 parent : protocol ip prio 3 handle 1
skb hash Y flowid 1:12 action ok
$TC filter add dev $DEV1 parent : protocol ip prio 4 handle 2
skb mark X flowid 1:12 action ok

Side note: you don't have to use handle as the value of hash/mark,
which gives people freedom to choose different handles.


>
> Then the question is how to implement? is it one hash table for
> both or two(one for mark and one for hash), etc.
>

Good question. I am not sure, maybe no hash table at all?
Unless there are a lot of filters, we do not have to organize
them in a hash table, do we?

Thanks.


Re: [PATCH] net: Fix potential memory leak in proto_register()

2020-08-11 Thread Cong Wang
On Mon, Aug 10, 2020 at 5:19 AM Miaohe Lin  wrote:
>
> If we failed to assign proto idx, we free the twsk_slab_name but forget to
> free the twsk_slab. Add a helper function tw_prot_cleanup() to free these
> together and also use this helper function in proto_unregister().
>
> Fixes: b45ce32135d1 ("sock: fix potential memory leak in proto_register()")
> Signed-off-by: Miaohe Lin 
> ---
>  net/core/sock.c | 25 +++--
>  1 file changed, 15 insertions(+), 10 deletions(-)
>
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 49cd5ffe673e..c9083ad44ea1 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -3406,6 +3406,16 @@ static void sock_inuse_add(struct net *net, int val)
>  }
>  #endif
>
> +static void tw_prot_cleanup(struct timewait_sock_ops *twsk_prot)
> +{
> +   if (!twsk_prot)
> +   return;
> +   kfree(twsk_prot->twsk_slab_name);
> +   twsk_prot->twsk_slab_name = NULL;
> +   kmem_cache_destroy(twsk_prot->twsk_slab);

Hmm, are you sure you can free the kmem cache name before
kmem_cache_destroy()? To me, it seems kmem_cache_destroy()
frees the name via slab_kmem_cache_release() via kfree_const().
With your patch, we have a double-free on the name?

Or am I missing anything?

Thanks.


Re: [ovs-discuss] Double free in recent kernels after memleak fix

2020-08-10 Thread Cong Wang
On Mon, Aug 10, 2020 at 8:27 PM Tonghao Zhang  wrote:
>
> On Tue, Aug 11, 2020 at 10:24 AM Cong Wang  wrote:
> >
> > On Mon, Aug 10, 2020 at 6:16 PM Tonghao Zhang  
> > wrote:
> > > Hi all, I send a patch to fix this. The rcu warnings disappear. I
> > > don't reproduce the double free issue.
> > > But I guess this patch may address this issue.
> > >
> > > http://patchwork.ozlabs.org/project/netdev/patch/20200811011001.75690-1-xiangxia.m@gmail.com/
> >
> > I don't see how your patch address the double-free, as we still
> > free mask array twice after your patch: once in tbl_mask_array_realloc()
> > and once in ovs_flow_tbl_destroy().
> Hi Cong.
> Before my patch, we use the ovsl_dereference
> (rcu_dereference_protected) in the rcu callback.
> ovs_flow_tbl_destroy
> ->table_instance_destroy
> ->table_instance_flow_free
> ->flow_mask_remove
> ASSERT_OVSL(will print warning)
> ->tbl_mask_array_del_mask
> ovsl_dereference(rcu usage warning)
>

I understand how your patch addresses the RCU annotation issue,
which is different from double-free.


> so we should invoke the table_instance_destroy or others under
> ovs_lock to avoid (ASSERT_OVSL and rcu usage warning).

Of course... I never doubt it.


> with this patch, we reallocate the mask_array under ovs_lock, and free
> it in the rcu callback. Without it, we  reallocate and free it in the
> rcu callback.
> I think we may fix it with this patch.

Does it matter which context tbl_mask_array_realloc() is called?
Even with ovs_lock, we can still double free:

ovs_lock()
tbl_mask_array_realloc()
 => call_rcu(&old->rcu, mask_array_rcu_cb);
ovs_unlock()
...
ovs_flow_tbl_destroy()
 => call_rcu(&old->rcu, mask_array_rcu_cb);

So still twice, right? To fix the double-free, we have to eliminate one
of them, don't we? ;)


>
> > Have you tried my patch which is supposed to address this double-free?
> I don't reproduce it. but your patch does not avoid ruc usage warning
> and ASSERT_OVSL.

Sure, I never intend to fix anything else but double-free. The $subject is
about double free, I double checked. ;)

Thanks.


Re: [Linux-kernel-mentees] [PATCH net] ipvs: Fix uninit-value in do_ip_vs_set_ctl()

2020-08-10 Thread Cong Wang
On Mon, Aug 10, 2020 at 3:10 PM Peilin Ye  wrote:
>
> do_ip_vs_set_ctl() is referencing uninitialized stack value when `len` is
> zero. Fix it.

Which exact 'cmd' is it here?

I _guess_ it is one of those uninitialized in set_arglen[], which is 0.
But if that is the case, should it be initialized to
sizeof(struct ip_vs_service_user) instead because ip_vs_copy_usvc_compat()
is called anyway. Or, maybe we should just ban len==0 case.

In either case, it does not look like you fix it correctly.

Thanks.


Re: [ovs-discuss] Double free in recent kernels after memleak fix

2020-08-10 Thread Cong Wang
On Mon, Aug 10, 2020 at 6:16 PM Tonghao Zhang  wrote:
> Hi all, I send a patch to fix this. The rcu warnings disappear. I
> don't reproduce the double free issue.
> But I guess this patch may address this issue.
>
> http://patchwork.ozlabs.org/project/netdev/patch/20200811011001.75690-1-xiangxia.m@gmail.com/

I don't see how your patch address the double-free, as we still
free mask array twice after your patch: once in tbl_mask_array_realloc()
and once in ovs_flow_tbl_destroy().

Have you tried my patch which is supposed to address this double-free?
It simply skips the reallocation as it makes no sense to trigger reallocation
when destroying it.

Thanks.


Re: [PATCH net-next 1/1] net/sched: Introduce skb hash classifier

2020-08-09 Thread Cong Wang
On Fri, Aug 7, 2020 at 3:28 PM Jamal Hadi Salim  wrote:
>
> From: Jamal Hadi Salim 
>
> his classifier, in the same spirit as the tc skb mark classifier,
> provides a generic (fast lookup) approach to filter on the hash value
> and optional mask.
>
> like skb->mark, skb->hash could be set by multiple entities in the
> datapath including but not limited to hardware offloaded classification
> policies, hardware RSS (which already sets it today), XDP, ebpf programs
> and even by other tc actions like skbedit and IFE.

Looks like a lot of code duplication with cls_fw, is there any way to
reuse the code?

And I wonder if it is time to introduce a filter to match multiple skb
fields, as adding a filter for each skb field clearly does not scale.
Perhaps something like:

$TC filter add dev $DEV1 parent : protocol ip prio 3 handle X skb \
hash Y mark Z flowid 1:12

Thanks.


Re: [PATCH] [net/ipv6] ip6_output: Add ipv6_pinfo null check

2020-08-08 Thread Cong Wang
On Sat, Aug 8, 2020 at 10:07 AM Gaurav Singh  wrote:
>
> This PR fixes a possible segmentation violation.
>
> In function: ip6_xmit(), we have
> const struct ipv6_pinfo *np = inet6_sk(sk); which returns NULL
> unconditionally (regardless sk being  NULL or not).
>
> In include/linux/ipv6.h:
>
> static inline struct ipv6_pinfo * inet6_sk(const struct sock *__sk)
> {
> return NULL;
> }
>

Tell us who will use ip6_autoflowlabel() when CONFIG_IPV6
is disabled. :)


Re: [ovs-discuss] Double free in recent kernels after memleak fix

2020-08-07 Thread Cong Wang
On Fri, Aug 7, 2020 at 8:33 AM Johan Knöös  wrote:
>
> On Tue, Aug 4, 2020 at 8:52 AM Gregory Rose  wrote:
> >
> >
> >
> > On 8/3/2020 12:01 PM, Johan Knöös via discuss wrote:
> > > Hi Open vSwitch contributors,
> > >
> > > We have found openvswitch is causing double-freeing of memory. The
> > > issue was not present in kernel version 5.5.17 but is present in
> > > 5.6.14 and newer kernels.
> > >
> > > After reverting the RCU commits below for debugging, enabling
> > > slub_debug, lockdep, and KASAN, we see the warnings at the end of this
> > > email in the kernel log (the last one shows the double-free). When I
> > > revert 50b0e61b32ee890a75b4377d5fbe770a86d6a4c1 ("net: openvswitch:
> > > fix possible memleak on destroy flow-table"), the symptoms disappear.
> > > While I have a reliable way to reproduce the issue, I unfortunately
> > > don't yet have a process that's amenable to sharing. Please take a
> > > look.
> > >
> > > 189a6883dcf7 rcu: Remove kfree_call_rcu_nobatch()
> > > 77a40f97030b rcu: Remove kfree_rcu() special casing and lazy-callback 
> > > handling
> > > e99637becb2e rcu: Add support for debug_objects debugging for kfree_rcu()
> > > 0392bebebf26 rcu: Add multiple in-flight batches of kfree_rcu() work
> > > 569d767087ef rcu: Make kfree_rcu() use a non-atomic ->monitor_todo
> > > a35d16905efc rcu: Add basic support for kfree_rcu() batching
> > >
> > > Thanks,
> > > Johan Knöös
> >
> > Let's add the author of the patch you reverted and the Linux netdev
> > mailing list.
> >
> > - Greg
>
> I found we also sometimes get warnings from
> https://elixir.bootlin.com/linux/v5.5.17/source/kernel/rcu/tree.c#L2239
> under similar conditions even on kernel 5.5.17, which I believe may be
> related. However, it's much rarer and I don't have a reliable way of
> reproducing it. Perhaps 50b0e61b32ee890a75b4377d5fbe770a86d6a4c1 only
> increases the frequency of a pre-existing bug.

It seems clear we have a double free on table->mask_array when
the reallocation is triggered on the destroy path.

Are you able to test the attached patch (compile tested only)?
Also note: it is generated against the latest net tree, it may not be
applied cleanly to any earlier stable release.

Thanks!
diff --git a/net/openvswitch/flow_table.c b/net/openvswitch/flow_table.c
index 8c12675cbb67..cc7859db445a 100644
--- a/net/openvswitch/flow_table.c
+++ b/net/openvswitch/flow_table.c
@@ -294,7 +294,7 @@ static int tbl_mask_array_add_mask(struct flow_table *tbl,
 }
 
 static void tbl_mask_array_del_mask(struct flow_table *tbl,
-struct sw_flow_mask *mask)
+struct sw_flow_mask *mask, bool destroy)
 {
 	struct mask_array *ma = ovsl_dereference(tbl->mask_array);
 	int i, ma_count = READ_ONCE(ma->count);
@@ -314,6 +314,11 @@ static void tbl_mask_array_del_mask(struct flow_table *tbl,
 	rcu_assign_pointer(ma->masks[i], ma->masks[ma_count -1]);
 	RCU_INIT_POINTER(ma->masks[ma_count -1], NULL);
 
+	if (destroy) {
+		kfree(mask);
+		return;
+	}
+
 	kfree_rcu(mask, rcu);
 
 	/* Shrink the mask array if necessary. */
@@ -326,7 +331,8 @@ static void tbl_mask_array_del_mask(struct flow_table *tbl,
 }
 
 /* Remove 'mask' from the mask list, if it is not needed any more. */
-static void flow_mask_remove(struct flow_table *tbl, struct sw_flow_mask *mask)
+static void flow_mask_remove(struct flow_table *tbl, struct sw_flow_mask *mask,
+			 bool destroy)
 {
 	if (mask) {
 		/* ovs-lock is required to protect mask-refcount and
@@ -337,7 +343,7 @@ static void flow_mask_remove(struct flow_table *tbl, struct sw_flow_mask *mask)
 		mask->ref_count--;
 
 		if (!mask->ref_count)
-			tbl_mask_array_del_mask(tbl, mask);
+			tbl_mask_array_del_mask(tbl, mask, destroy);
 	}
 }
 
@@ -470,7 +476,7 @@ static void table_instance_flow_free(struct flow_table *table,
 			table->ufid_count--;
 	}
 
-	flow_mask_remove(table, flow->mask);
+	flow_mask_remove(table, flow->mask, !count);
 }
 
 static void table_instance_destroy(struct flow_table *table,
@@ -521,9 +527,9 @@ void ovs_flow_tbl_destroy(struct flow_table *table)
 	struct mask_cache *mc = rcu_dereference_raw(table->mask_cache);
 	struct mask_array *ma = rcu_dereference_raw(table->mask_array);
 
+	table_instance_destroy(table, ti, ufid_ti, false);
 	call_rcu(&mc->rcu, mask_cache_rcu_cb);
 	call_rcu(&ma->rcu, mask_array_rcu_cb);
-	table_instance_destroy(table, ti, ufid_ti, false);
 }
 
 struct sw_flow *ovs_flow_tbl_dump_next(struct table_instance *ti,


Re: rcu build on bug

2020-08-07 Thread Cong Wang
On Fri, Aug 7, 2020 at 8:16 AM Jamal Hadi Salim  wrote:
>
> I am guessing the hash table got too large.
> Smells like hard coded expectation?

Yeah, that is literally how kfree_rcu() works.

>
> How to fix?

I guess you just have to wrap up kfree() and pass it
to call_rcu().

Thanks.


Re: Question about TC filter

2020-08-06 Thread Cong Wang
On Thu, Aug 6, 2020 at 10:21 AM satish dhote  wrote:
>
> Hi Cong,
>
> I tried adding below patch i.e. "return cl == 0 ? q->block : NULL;"
> but after this I'm not able to see any output using "tc filter show... "
> command. Looks like the filter is not getting configured.

What exact command did you use? If you specify "ingress" rather
than ":", it is exactly _my_ goal, because prior to clsact, only
":" could match ingress qdisc, the keyword "ingress" did not
even exist.

Of course, it may not be Daniel's intention, he might expect
"ingress" to match ingress qdisc too for convenience.

Thanks.


Re: Question about TC filter

2020-08-05 Thread Cong Wang
Hello,

On Tue, Aug 4, 2020 at 10:39 PM satish dhote  wrote:
>
> Hi Team,
>
> I have a question regarding tc filter behavior. I tried to look
> for the answer over the web and netdev FAQ but didn't get the
> answer. Hence I'm looking for your help.
>
> I added ingress qdisc for interface enp0s25 and then configured the
> tc filter as shown below, but after adding filters I realize that
> rule is reflected as a result of both ingress and egress filter
> command?  Is this the expected behaviour? or a bug? Why should the
> same filter be reflected in both ingress and egress path?

I am not very sure but I am feeling this is a bug. Let's Cc Daniel who
introduced this.

With the introduction of clsact qdisc, the keywords "ingress" and
"egress" were introduced too to match clsact qdisc. However, its
major/minor handle is kinda confusing:

TC_H_MAKE(TC_H_CLSACT, TC_H_MIN_INGRESS);
TC_H_MAKE(TC_H_CLSACT, TC_H_MIN_EGRESS);
#define TC_H_CLSACT TC_H_INGRESS

They could match the ingress qdisc (:) too.


>
> I understand that policy is always configured for ingress traffic,
> so I believe that filters should not be reflected with egress.
> Behaviour is same when I offloaded ovs flow to the tc software
> datapath.

I believe so too, otherwise it would be too confusing to users.

If you are able to test kernel patch, does the following one-line
change fix this problem? If not, I will try it on my side.

diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
index 84838128b9c5..4d9c1bb15545 100644
--- a/net/sched/sch_ingress.c
+++ b/net/sched/sch_ingress.c
@@ -49,7 +49,7 @@ static struct tcf_block *ingress_tcf_block(struct
Qdisc *sch, unsigned long cl,
 {
struct ingress_sched_data *q = qdisc_priv(sch);

-   return q->block;
+   return cl == 0 ? q->block : NULL;
 }

 static void clsact_chain_head_change(struct tcf_proto *tp_head, void *priv)


Thanks.


Re: [PATCHv2 net-next] dropwatch: Support monitoring of dropped frames

2020-08-04 Thread Cong Wang
On Tue, Aug 4, 2020 at 9:14 AM  wrote:
>
> From: Izabela Bakollari 
>
> Dropwatch is a utility that monitors dropped frames by having userspace
> record them over the dropwatch protocol over a file. This augument
> allows live monitoring of dropped frames using tools like tcpdump.
>
> With this feature, dropwatch allows two additional commands (start and
> stop interface) which allows the assignment of a net_device to the
> dropwatch protocol. When assinged, dropwatch will clone dropped frames,
> and receive them on the assigned interface, allowing tools like tcpdump
> to monitor for them.
>
> With this feature, create a dummy ethernet interface (ip link add dev
> dummy0 type dummy), assign it to the dropwatch kernel subsystem, by using
> these new commands, and then monitor dropped frames in real time by
> running tcpdump -i dummy0.

drop monitor is already able to send dropped packets to user-space,
and wireshark already catches up with this feature:

https://code.wireshark.org/review/gitweb?p=wireshark.git;a=commitdiff;h=a94a860c0644ec3b8a129fd243674a2e376ce1c8

So what you propose here seems pretty much a duplicate?

Thanks.


Re: [PATCH net v2] net/sched: act_ct: fix miss set mru for ovs after defrag in act_ct

2020-07-31 Thread Cong Wang
On Thu, Jul 30, 2020 at 7:45 PM  wrote:
>
> From: wenxu 
>
> When openvswitch conntrack offload with act_ct action. Fragment packets
> defrag in the ingress tc act_ct action and miss the next chain. Then the
> packet pass to the openvswitch datapath without the mru. The over
> mtu packet will be dropped in output action in openvswitch for over mtu.
>
> "kernel: net2: dropped over-mtu packet: 1528 > 1500"
>
> This patch add mru in the tc_skb_ext for adefrag and miss next chain
> situation. And also add mru in the qdisc_skb_cb. The act_ct set the mru
> to the qdisc_skb_cb when the packet defrag. And When the chain miss,
> The mru is set to tc_skb_ext which can be got by ovs datapath.
>
> Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct")
> Signed-off-by: wenxu 

Reviewed-by: Cong Wang 

Thanks.


Re: [PATCH net] net/sched: act_ct: fix miss set mru for ovs after defrag in act_ct

2020-07-30 Thread Cong Wang
On Thu, Jul 30, 2020 at 1:53 AM wenxu  wrote:
>
>
> On 7/30/2020 2:03 PM, Cong Wang wrote:
> > On Wed, Jul 29, 2020 at 3:41 AM  wrote:
> >> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> >> index c510b03..45401d5 100644
> >> --- a/include/net/sch_generic.h
> >> +++ b/include/net/sch_generic.h
> >> @@ -384,6 +384,7 @@ struct qdisc_skb_cb {
> >> };
> >>  #define QDISC_CB_PRIV_LEN 20
> >> unsigned char   data[QDISC_CB_PRIV_LEN];
> >> +   u16 mru;
> >>  };
> > Hmm, can you put it in the anonymous struct before 'data'?
> >
> > We validate this cb size and data size like blow:
> >
> > static inline void qdisc_cb_private_validate(const struct sk_buff *skb, int 
> > sz)
> > {
> > struct qdisc_skb_cb *qcb;
> >
> > BUILD_BUG_ON(sizeof(skb->cb) < offsetof(struct qdisc_skb_cb,
> > data) + sz);
> > BUILD_BUG_ON(sizeof(qcb->data) < sz);
> > }
> >
> > It _kinda_ expects ->data at the end.
>
> It seems the data offsetof data should be align with szieof(u64)?
>
> If I  modify it as following
>
> @@ -383,6 +383,9 @@ struct qdisc_skb_cb {
> unsigned intpkt_len;
> u16 slave_dev_queue_mapping;
> u16 tc_classid;
> +   u16 mru;
> };
>  #define QDISC_CB_PRIV_LEN 20
> unsigned char   data[QDISC_CB_PRIV_LEN];
>
> compile fail:
>
> net/core/filter.c:7625:3: note: in expansion of macro ‘BUILD_BUG_ON’
>BUILD_BUG_ON((offsetof(struct sk_buff, cb) +
>
> inn the file:  net/core/filter.c
>
> case offsetof(struct __sk_buff, cb[0]) ...
>
>  offsetofend(struct __sk_buff, cb[4]) - 1:
> BUILD_BUG_ON(sizeof_field(struct qdisc_skb_cb, data) < 20);
> BUILD_BUG_ON((offsetof(struct sk_buff, cb) +
>   offsetof(struct qdisc_skb_cb, data)) %
>  sizeof(__u64));
>
>
> If I  modify it as following
>
> @@ -383,6 +383,9 @@ struct qdisc_skb_cb {
> unsigned intpkt_len;
> u16 slave_dev_queue_mapping;
> u16 tc_classid;
> +   u16 mru;
> +   u16 _pad1;
> +   u32 _pad2;
> };
>  #define QDISC_CB_PRIV_LEN 20
> unsigned char   data[QDISC_CB_PRIV_LEN];
>
>
> compile fail:
>
> ./include/linux/filter.h:633:2: note: in expansion of macro ‘BUILD_BUG_ON’
>   BUILD_BUG_ON(sizeof(*cb) > sizeof_field(struct sk_buff, cb));
>
>
> static inline void bpf_compute_data_pointers(struct sk_buff *skb)
> {
> struct bpf_skb_data_end *cb = (struct bpf_skb_data_end *)skb->cb;
>
> BUILD_BUG_ON(sizeof(*cb) > sizeof_field(struct sk_buff, cb));
> cb->data_meta = skb->data - skb_metadata_len(skb);
> cb->data_end  = skb->data + skb_headlen(skb);
> }
>
>
> It seems no space for new value add before 'data'?

Hmm, I didn't know bpf has such restrictions on qdisc_skb_cb.
It seems impossible to add a new field before data, if you keep it
after data, can you adjust qdisc_cb_private_validate() too, like below?

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index c510b03b9751..68235489a5d4 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -463,7 +463,7 @@ static inline void qdisc_cb_private_validate(const
struct sk_buff *skb, int sz)
 {
struct qdisc_skb_cb *qcb;

-   BUILD_BUG_ON(sizeof(skb->cb) < offsetof(struct qdisc_skb_cb,
data) + sz);
+   BUILD_BUG_ON(sizeof(skb->cb) < sizeof(*qcb));
BUILD_BUG_ON(sizeof(qcb->data) < sz);
 }

Thanks.


Re: [PATCH net] net/sched: act_ct: fix miss set mru for ovs after defrag in act_ct

2020-07-29 Thread Cong Wang
On Wed, Jul 29, 2020 at 3:41 AM  wrote:
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index c510b03..45401d5 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -384,6 +384,7 @@ struct qdisc_skb_cb {
> };
>  #define QDISC_CB_PRIV_LEN 20
> unsigned char   data[QDISC_CB_PRIV_LEN];
> +   u16 mru;
>  };

Hmm, can you put it in the anonymous struct before 'data'?

We validate this cb size and data size like blow:

static inline void qdisc_cb_private_validate(const struct sk_buff *skb, int sz)
{
struct qdisc_skb_cb *qcb;

BUILD_BUG_ON(sizeof(skb->cb) < offsetof(struct qdisc_skb_cb,
data) + sz);
BUILD_BUG_ON(sizeof(qcb->data) < sz);
}

It _kinda_ expects ->data at the end.

The rest of your patch looks good to me, so feel free to add:
Reviewed-by: Cong Wang 

Thanks.


Re: [PATCH] drivers/net/wan/lapbether: Use needed_headroom instead of hard_header_len

2020-07-27 Thread Cong Wang
Hello,

On Mon, Jul 27, 2020 at 12:41 PM Xie He  wrote:
>
> Hi Cong Wang,
>
> I'm wishing to change a driver from using "hard_header_len" to using
> "needed_headroom" to declare its needed headroom. I submitted a patch
> and it is decided it needs to be reviewed. I see you participated in
> "hard_header_len vs needed_headroom" discussions in the past. Can you
> help me review this patch? Thanks!
>
> The patch is at:
> http://patchwork.ozlabs.org/project/netdev/patch/20200726110524.151957-1-xie.he.0...@gmail.com/
>
> In my understanding, hard_header_len should be the length of the header
> created by dev_hard_header. Any additional headroom needed should be
> declared in needed_headroom instead of hard_header_len. I came to this
> conclusion by examining the logic of net/packet/af_packet.c:packet_snd.

I am not familiar with this WAN driver, but I suggest you to look at
the following commit, which provides a lot of useful information:

commit 9454f7a895b822dd8fb4588fc55fda7c96728869
Author: Brian Norris 
Date:   Wed Feb 26 16:05:11 2020 -0800

mwifiex: set needed_headroom, not hard_header_len

hard_header_len provides limitations for things like AF_PACKET, such
that we don't allow transmitting packets smaller than this.

needed_headroom provides a suggested minimum headroom for SKBs, so that
we can trivally add our headers to the front.

The latter is the correct field to use in this case, while the former
mostly just prevents sending small AF_PACKET frames.

In any case, mwifiex already does its own bounce buffering [1] if we
don't have enough headroom, so hints (not hard limits) are all that are
needed.

This is the essentially the same bug (and fix) that brcmfmac had, fixed
in commit cb39288fd6bb ("brcmfmac: use ndev->needed_headroom to reserve
additional header space").

[1] mwifiex_hard_start_xmit():
if (skb_headroom(skb) < MWIFIEX_MIN_DATA_HEADER_LEN) {
[...]
/* Insufficient skb headroom - allocate a new skb */

Hope this helps.

Thanks.


Re: [PATCH] [net/ipv6] ip6_output: Add ipv6_pinfo null check

2020-07-27 Thread Cong Wang
On Mon, Jul 27, 2020 at 7:14 PM Gaurav Singh  wrote:
>
> Add return to fix build issue. Haven't reproduced this issue at
> my end.
>
> My hypothesis is this: In function: ip6_xmit(), we have
> const struct ipv6_pinfo *np = inet6_sk(sk); which returns NULL.
>
> Further down the function, there's a check:
> if (np) hlimit = hp->htop_limit

This check exists before git history, at that time 'sk' could be NULL,
hence 'np', so it does not mean it is still necessary now.

I looked at all callers of ip6_xmit(), I don't see how it is called with
a non-full socket, neither 'sk' could be NULL after
commit b30bd282cbf5c46247a279a2e8d2aae027d9f1bf
("[IPV6]: ip6_xmit: remove unnecessary NULL ptr check").

Thanks.


<    1   2   3   4   5   6   7   8   9   10   >