Re: [PATCH v2] net-fq: Add WARN_ON check for null flow.

2018-06-07 Thread Cong Wang
On Thu, Jun 7, 2018 at 4:48 PM,   wrote:
> diff --git a/include/net/fq_impl.h b/include/net/fq_impl.h
> index be7c0fa..cb911f0 100644
> --- a/include/net/fq_impl.h
> +++ b/include/net/fq_impl.h
> @@ -78,7 +78,10 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,
> return NULL;
> }
>
> -   flow = list_first_entry(head, struct fq_flow, flowchain);
> +   flow = list_first_entry_or_null(head, struct fq_flow, flowchain);
> +
> +   if (WARN_ON_ONCE(!flow))
> +   return NULL;

This does not make sense either. list_first_entry_or_null()
returns NULL only when the list is empty, but we already check
list_empty() right before this code, and it is protected by fq->lock.


Re: [Patch net] socket: close race condition between sock_close() and sockfs_setattr()

2018-06-07 Thread Cong Wang
On Thu, Jun 7, 2018 at 3:04 PM, Al Viro  wrote:
> On Thu, Jun 07, 2018 at 02:45:58PM -0700, Cong Wang wrote:
>> On Thu, Jun 7, 2018 at 2:26 PM, Al Viro  wrote:
>> > On Thu, Jun 07, 2018 at 01:39:49PM -0700, Cong Wang wrote:
>> >> fchownat() doesn't even hold refcnt of fd until it figures out
>> >> fd is really needed (otherwise is ignored) and releases it after
>> >> it resolves the path. This means sock_close() could race with
>> >> sockfs_setattr(), which leads to a NULL pointer dereference
>> >> since typically we set sock->sk to NULL in ->release().
>> >>
>> >> As pointed out by Al, this is unique to sockfs. So we can fix this
>> >> in socket layer by acquiring inode_lock in sock_close() and
>> >> checking against NULL in sockfs_setattr().
>> >
>> > That looks like a massive overkill - it's way heavier than it should be.
>>
>> I don't see any other quick way to fix this. My initial thought is
>> to keep that refcnt until path_put(), apparently you don't like it
>> either.
>
> You do realize that the same ->setattr() can be called by way of
> chown() on /proc/self/fd/, right?  What would you do there -
> bump refcount on that struct file when traversing that symlink and
> hold it past the end of pathname resolution, until... what exactly?

I was thinking about this:

error = user_path_at(dfd,); // hold dfd when needed

if (!error) {
error = chown_common(, mode);
path_put();  // release dfd if held

With this, we can guarantee ->release() is only possibly called
after chown_common() which is after ->setattr() too.


Re: [PATCH] net-fq: Add WARN_ON check for null flow.

2018-06-07 Thread Cong Wang
On Thu, Jun 7, 2018 at 2:41 PM, Ben Greear  wrote:
> On 06/07/2018 02:29 PM, Cong Wang wrote:
>>
>> On Thu, Jun 7, 2018 at 9:06 AM,   wrote:
>>>
>>> --- a/include/net/fq_impl.h
>>> +++ b/include/net/fq_impl.h
>>> @@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,
>>>
>>> flow = list_first_entry(head, struct fq_flow, flowchain);
>>>
>>> +   if (WARN_ON_ONCE(!flow))
>>> +   return NULL;
>>> +
>>
>>
>> How could even possibly list_first_entry() returns NULL?
>> You need list_first_entry_or_null().
>>
>
> I don't know for certain flow as null, but something was NULL in this method
> near that line and it looked like a likely culprit.
>
> I guess possibly tin or fq was passed in as NULL?

A NULL pointer is not always 0. You can trigger a NULL-ptr-def with 0x3c
too, but you are checking against 0 in your patch, that is the problem and
that is why list_first_entry_or_null() exists.


Re: [Patch net] socket: close race condition between sock_close() and sockfs_setattr()

2018-06-07 Thread Cong Wang
On Thu, Jun 7, 2018 at 2:26 PM, Al Viro  wrote:
> On Thu, Jun 07, 2018 at 01:39:49PM -0700, Cong Wang wrote:
>> fchownat() doesn't even hold refcnt of fd until it figures out
>> fd is really needed (otherwise is ignored) and releases it after
>> it resolves the path. This means sock_close() could race with
>> sockfs_setattr(), which leads to a NULL pointer dereference
>> since typically we set sock->sk to NULL in ->release().
>>
>> As pointed out by Al, this is unique to sockfs. So we can fix this
>> in socket layer by acquiring inode_lock in sock_close() and
>> checking against NULL in sockfs_setattr().
>
> That looks like a massive overkill - it's way heavier than it should be.

I don't see any other quick way to fix this. My initial thought is
to keep that refcnt until path_put(), apparently you don't like it
either.


> And it's very likely to trigger shitloads of deadlock warnings, some
> possibly even true.

I do audit the inode_lock usage in networking, I don't see any
deadlock, of course, there could be some non-networking code
uses socket API that I missed. But _generally_, socket doesn't
have a pointer to retrieve this inode.


Re: [PATCH] net-fq: Add WARN_ON check for null flow.

2018-06-07 Thread Cong Wang
On Thu, Jun 7, 2018 at 9:06 AM,   wrote:
> --- a/include/net/fq_impl.h
> +++ b/include/net/fq_impl.h
> @@ -80,6 +80,9 @@ static struct sk_buff *fq_tin_dequeue(struct fq *fq,
>
> flow = list_first_entry(head, struct fq_flow, flowchain);
>
> +   if (WARN_ON_ONCE(!flow))
> +   return NULL;
> +

How could even possibly list_first_entry() returns NULL?
You need list_first_entry_or_null().


[Patch net] socket: close race condition between sock_close() and sockfs_setattr()

2018-06-07 Thread Cong Wang
fchownat() doesn't even hold refcnt of fd until it figures out
fd is really needed (otherwise is ignored) and releases it after
it resolves the path. This means sock_close() could race with
sockfs_setattr(), which leads to a NULL pointer dereference
since typically we set sock->sk to NULL in ->release().

As pointed out by Al, this is unique to sockfs. So we can fix this
in socket layer by acquiring inode_lock in sock_close() and
checking against NULL in sockfs_setattr().

sock_release() is called in many places, only the sock_close()
path matters here. And fortunately, this should not affect normal
sock_close() as it is only called when the last fd refcnt is gone.
It only affects sock_close() with a parallel sockfs_setattr() in
progress, which is not common.

Fixes: 86741ec25462 ("net: core: Add a UID field to struct sock.")
Reported-by: shankarapailoor 
Cc: Tetsuo Handa 
Cc: Lorenzo Colitti 
Cc: Al Viro 
Signed-off-by: Cong Wang 
---
 net/socket.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/net/socket.c b/net/socket.c
index af57d85bcb48..8a109012608a 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -541,7 +541,10 @@ static int sockfs_setattr(struct dentry *dentry, struct 
iattr *iattr)
if (!err && (iattr->ia_valid & ATTR_UID)) {
struct socket *sock = SOCKET_I(d_inode(dentry));
 
-   sock->sk->sk_uid = iattr->ia_uid;
+   if (sock->sk)
+   sock->sk->sk_uid = iattr->ia_uid;
+   else
+   err = -ENOENT;
}
 
return err;
@@ -590,12 +593,16 @@ EXPORT_SYMBOL(sock_alloc);
  * an inode not a file.
  */
 
-void sock_release(struct socket *sock)
+static void __sock_release(struct socket *sock, struct inode *inode)
 {
if (sock->ops) {
struct module *owner = sock->ops->owner;
 
+   if (inode)
+   inode_lock(inode);
sock->ops->release(sock);
+   if (inode)
+   inode_unlock(inode);
sock->ops = NULL;
module_put(owner);
}
@@ -609,6 +616,11 @@ void sock_release(struct socket *sock)
}
sock->file = NULL;
 }
+
+void sock_release(struct socket *sock)
+{
+   __sock_release(sock, NULL);
+}
 EXPORT_SYMBOL(sock_release);
 
 void __sock_tx_timestamp(__u16 tsflags, __u8 *tx_flags)
@@ -1171,7 +1183,7 @@ static int sock_mmap(struct file *file, struct 
vm_area_struct *vma)
 
 static int sock_close(struct inode *inode, struct file *filp)
 {
-   sock_release(SOCKET_I(inode));
+   __sock_release(SOCKET_I(inode), inode);
return 0;
 }
 
-- 
2.13.0



[Patch net v2] netdev-FAQ: clarify DaveM's position for stable backports

2018-06-05 Thread Cong Wang
Per discussion with David at netconf 2018, let's clarify
DaveM's position of handling stable backports in netdev-FAQ.

This is important for people relying on upstream -stable
releases.

Cc: sta...@vger.kernel.org
Cc: Greg Kroah-Hartman 
Signed-off-by: Cong Wang 
---
 Documentation/networking/netdev-FAQ.txt | 9 +
 1 file changed, 9 insertions(+)

diff --git a/Documentation/networking/netdev-FAQ.txt 
b/Documentation/networking/netdev-FAQ.txt
index 2a3278d5cf35..fa951b820b25 100644
--- a/Documentation/networking/netdev-FAQ.txt
+++ b/Documentation/networking/netdev-FAQ.txt
@@ -179,6 +179,15 @@ A: No.  See above answer.  In short, if you think it 
really belongs in
dash marker line as described in 
Documentation/process/submitting-patches.rst to
temporarily embed that information into the patch that you send.
 
+Q: Are all networking bug fixes backported to all stable releases?
+
+A: Due to capacity, Dave could only take care of the backports for the last
+   2 stable releases. For earlier stable releases, each stable branch 
maintainer
+   is supposed to take care of them. If you find any patch is missing from an
+   earlier stable branch, please notify sta...@vger.kernel.org with either a
+   commit ID or a formal patch backported, and CC Dave and other relevant
+   networking developers.
+
 Q: Someone said that the comment style and coding convention is different
for the networking content.  Is this true?
 
-- 
2.13.0



Re: [Patch net-next] netdev-FAQ: clarify DaveM's position for stable backports

2018-06-05 Thread Cong Wang
On Tue, Jun 5, 2018 at 6:43 AM, David Miller  wrote:
> From: Cong Wang 
> Date: Mon,  4 Jun 2018 11:07:19 -0700
>
>> +Q: Are all networking bug fixes backported to all stable releases?
>> +
>> +A: Due to capacity, Dave could only take care of the backports for the last
>> +   3 stable releases.
>
> As Greg stated, I only do 2 not 3.

Sure, will send v2.
We just need a number here. :)

Thanks!


[Patch net-next] netdev-FAQ: clarify DaveM's position for stable backports

2018-06-04 Thread Cong Wang
Per discussion with David at netconf 2018, let's clarify
DaveM's position of handling stable backports in netdev-FAQ.

This is important for people relying on upstream -stable
releases.

Cc: sta...@vger.kernel.org
Cc: Greg Kroah-Hartman 
Signed-off-by: Cong Wang 
---
 Documentation/networking/netdev-FAQ.txt | 9 +
 1 file changed, 9 insertions(+)

diff --git a/Documentation/networking/netdev-FAQ.txt 
b/Documentation/networking/netdev-FAQ.txt
index 2a3278d5cf35..6dde6686c870 100644
--- a/Documentation/networking/netdev-FAQ.txt
+++ b/Documentation/networking/netdev-FAQ.txt
@@ -179,6 +179,15 @@ A: No.  See above answer.  In short, if you think it 
really belongs in
dash marker line as described in 
Documentation/process/submitting-patches.rst to
temporarily embed that information into the patch that you send.
 
+Q: Are all networking bug fixes backported to all stable releases?
+
+A: Due to capacity, Dave could only take care of the backports for the last
+   3 stable releases. For earlier stable releases, each stable branch 
maintainer
+   is supposed to take care of them. If you find any patch is missing from an
+   earlier stable branch, please notify sta...@vger.kernel.org with either a
+   commit ID or a formal patch backported, and CC Dave and other relevant
+   networking developers.
+
 Q: Someone said that the comment style and coding convention is different
for the networking content.  Is this true?
 
-- 
2.13.0



Re: [PATCH net-next 0/2] cls_flower: Various fixes

2018-06-04 Thread Cong Wang
On Mon, Jun 4, 2018 at 12:35 AM, Roi Dayan  wrote:
>
>
> On 03/06/2018 22:39, Jiri Pirko wrote:
>>
>> Sun, Jun 03, 2018 at 08:33:25PM CEST, xiyou.wangc...@gmail.com wrote:
>>>
>>> On Wed, May 30, 2018 at 1:17 AM, Paul Blakey  wrote:

 Two of the fixes are for my multiple mask patch

 Paul Blakey (2):
cls_flower: Fix missing free of rhashtable
cls_flower: Fix comparing of old filter mask with new filter
>>>
>>>
>>> Both are bug fixes and one-line fixes, so definitely should go
>>> to -net tree and -stable tree.
>>
>>
>> I agree.
>>
>
> it's because the commit they fix doesn't exists in net yet.
>

Oh, sorry, my bad, I thought Apr 30 is in a previous release...

Anyway, v2 should be applied cleanly to -net-next or -net after
net-next is merged into it.


Re: [PATCH net-next 0/2] cls_flower: Various fixes

2018-06-03 Thread Cong Wang
On Wed, May 30, 2018 at 1:17 AM, Paul Blakey  wrote:
> Two of the fixes are for my multiple mask patch
>
> Paul Blakey (2):
>   cls_flower: Fix missing free of rhashtable
>   cls_flower: Fix comparing of old filter mask with new filter

Both are bug fixes and one-line fixes, so definitely should go
to -net tree and -stable tree.

I don't understand why you decide to rebase on net-next.


Re: [PATCH v3 00/11] Modify action API for implementing lockless actions

2018-05-28 Thread Cong Wang
On Sun, May 27, 2018 at 2:17 PM, Vlad Buslov  wrote:
> Currently, all netlink protocol handlers for updating rules, actions and
> qdiscs are protected with single global rtnl lock which removes any
> possibility for parallelism. This patch set is a first step to remove
> rtnl lock dependency from TC rules update path.
>
> Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added.
> Handlers registered with this flag are called without RTNL taken. End
> goal is to have rule update handlers(RTM_NEWTFILTER, RTM_DELTFILTER,
> etc.) to be registered with UNLOCKED flag to allow parallel execution.
> However, there is no intention to completely remove or split rtnl lock
> itself. This patch set addresses specific problems in action API that
> prevents it from being executed concurrently.


Great, your goal is much clear now! So can I expect this patchset is to
_completely_ get rid of RTNL lock from action update paths, correct?

I ask because this is your first step, RTNL is still acquired on upper layer,
that is, filter update paths.


>
> As a preparation for executing TC rules update handlers without rtnl
> lock, action API code was audited to determine areas that assume
> external synchronization with rtnl lock and must be changed to allow
> safe concurrent access with following results:
>
> 1. Action idr is already protected with spinlock. However, some code
>paths assume that idr state is not changes between several
>consecutive tcf_idr_* function calls.
> 2. tc_action reference and bind counters are implemented as plain
>integers. They purpose was to allow single actions to be shared
>between multiple filters, not to provide means for concurrent
>modification.
> 3. tc_action 'cookie' pointer field is not protected against
>modification.
> 4. Action API functions, that work with set of actions, use intrusive
>linked list, which cannot be used concurrently without additional
>synchronization.
> 5. Action API functions don't take reference to actions while using
>them, assuming external synchronization with rtnl lock.


Fair enough, thanks for the details, but some high-level things are still
missing:

1. What lock protects action updates with your patches? Since you remove
RTNL from these paths, I assume no lock at all except the existing spinlock?
Please state here in your cover letter.


2. Assume 1) is correct, how do you guarantee an action update is atomic?
Let's say I have action foo:

struct act_foo
{
  int a;
  int b;
};

With RTNL:

rtnl_lock();
act_foo->a = a;
if (a == X)
  act_foo->b = b;
rtnl_unlock();

Without any lock (as I assumed):


act_foo->a = a;
// fast path now reads new ->a but old ->b
if (act_foo->a == X)
// Other slow path may be changing ->a too
  act_foo->b = b;

If my assumption is correct, please explain the above question in your
cover letter, it is very important for understanding your 11 patches.

If my assumption is wrong, please be specific on which lock protects
which paths here.


3. How are actions like act_mirred and act_ipt updated w/o RTNL?

act_mirred requires to hold a refcnt for target device:

if (dev != NULL) {
if (ret != ACT_P_CREATED)
dev_put(rcu_dereference_protected(m->tcfm_dev, 1));
dev_hold(dev);
rcu_assign_pointer(m->tcfm_dev, dev);
m->tcfm_mac_header_xmit = mac_header_xmit;
}

Without RTNL, how is dev_put()+dev_hold() be atomic in !CREATED case?

act_ipt calls xt_request_find_target() and xt_check_target(), I guess both
assumes RTNL?

Or you just leave these exceptions as they are but make the rest actions
lockless? If so, please list all of them here and describe why are they
special.


Last, since your end goal is to remove RTNL from filter update paths,
how does it work if a tc filter block shared among different qdiscs?
Assume a tc filter block can be shared by different qdiscs on different
devs.


Thanks!


Re: [PATCH 00/14] Modify action API for implementing lockless actions

2018-05-25 Thread Cong Wang
On Fri, May 25, 2018 at 1:39 PM, Vlad Buslov <vla...@mellanox.com> wrote:
>
> On Thu 24 May 2018 at 23:34, Cong Wang <xiyou.wangc...@gmail.com> wrote:
>> On Mon, May 14, 2018 at 7:27 AM, Vlad Buslov <vla...@mellanox.com> wrote:
>>> Currently, all netlink protocol handlers for updating rules, actions and
>>> qdiscs are protected with single global rtnl lock which removes any
>>> possibility for parallelism. This patch set is a first step to remove
>>> rtnl lock dependency from TC rules update path. It updates act API to
>>> use atomic operations, rcu and spinlocks for fine-grained locking. It
>>> also extend API with functions that are needed to update existing
>>> actions for parallel execution.
>>
>> Can you give a summary here for what and how it is achieved?
>
> Got it, will expand cover letter in V2 with summary.
>>
>> You said this is the first step, what do you want to achieve in this
>> very first step? And how do you achieve it? Do you break the RTNL
>
> But aren't this questions answered in paragraph you quoted?


Obviously not, you said to remove it, but never explains why it can
be removed and how it is removed. This is crucial for review.

"use atomic operations, rcu and spinlocks for fine-grained locking"
is literately nothing, why atomic/rcu makes RTNL unnecessary?
How RCU is used? What spinlocks are you talking about? What
do these spinlocks protect after removing RTNL? Why are they
safe with other netdevice and netns operations?

You explain _nothing_ here. Really. Please don't force people to
read 14 patches to understand how it works. In fact, no one wants
to read the code unless there is some high-level explanation that
makes basic sense.


> What: Change act API to not rely on one-big-global-RTNL-lock and to use
> more fine-grained synchronization methods to allow safe concurrent
> execution.

Sure, how fine-grained it is after your patchset? Why this fine-grained
lock could safely replace RTNL?

Could you stop letting us guess your puzzle words? It would save your
time from exchanging emails with me, it would save my time from
guessing you too. It is a win-win.


> How: Refactor act API code to use atomics, rcu and spinlocks, etc. for
> protecting shared data structures, add new functions required to update


What shared data structures? The per-netns idr which is already protected
by a spinlock? The TC hierarchy? The shared standalone actions? Hey,
why do I have to guess? :-/


> specific actions implementation for parallel execution. (step 2)


Claim is easy, prove is hard. I can easily claim I break RTNL down
to a per-netns lock, but I can't prove it really works. :-D


>
> If you feel that this cover letter is too terse, I will add outline of
> changes in V2.

It is not my rule, it is how you have to help people to review your
14 patches. I think it is a fair game: you help people like me to
review your patches, we help you to get them reviewed and merged
if they all make sense.



>
>> lock down to, for a quick example, a per-device lock? Or perhaps you
>> completely remove it because of what reason?
>
> I want to remove RTNL _dependency_ from act API data structures and
> code. I probably should me more specific in this case:
>
> Florian recently made a change that allows registering netlink protocol
> handlers with flag RTNL_FLAG_DOIT_UNLOCKED. Handlers registered with
> this flag are called without RTNL taken. My end goal is to have rule
> update handlers(RTM_NEWTFILTER, RTM_DELTFILTER, etc.) to be registered
> with UNLOCKED flag to allow parallel execution.


Please add this paragraph in your cover letter, it is very important for review.

>
> I do not intend to globally remove or break RTNL.
>
>>
>> I go through all the descriptions of your 14 patches (but not any code),
>> I still have no clue how you successfully avoid RTNL. Please don't
>> let me read into your code to understand that, there must be some
>> high-level justification on how it works. Without it, I don't event want
>> to read into the code.
>
> On internal code review I've been asked not to duplicate info from
> commit messages in cover letter, but I guess I can expand it with some
> high level outline in V2.

In cover letter, you should put a high-level overview of "why" and "how".
If, in the worst case, on high-level it doesn't make sense, why should
we bother to read the code? In short, you have to convince people to
read your code here.

In each patch description, you should explain what a single patch does.
I don't see any duplication.


Re: [PATCH net-next] net:sched: add action inheritdsfield to skbmod

2018-05-25 Thread Cong Wang
On Thu, May 24, 2018 at 10:45 PM, Fu, Qiaobin  wrote:
> The new action inheritdsfield copies the field DS of
> IPv4 and IPv6 packets into skb->priority. This enables
> later classification of packets based on the DS field.

Please move it to skbedit.


Re: [PATCH 00/14] Modify action API for implementing lockless actions

2018-05-24 Thread Cong Wang
On Mon, May 14, 2018 at 7:27 AM, Vlad Buslov  wrote:
> Currently, all netlink protocol handlers for updating rules, actions and
> qdiscs are protected with single global rtnl lock which removes any
> possibility for parallelism. This patch set is a first step to remove
> rtnl lock dependency from TC rules update path. It updates act API to
> use atomic operations, rcu and spinlocks for fine-grained locking. It
> also extend API with functions that are needed to update existing
> actions for parallel execution.

Can you give a summary here for what and how it is achieved?

You said this is the first step, what do you want to achieve in this
very first step? And how do you achieve it? Do you break the RTNL
lock down to, for a quick example, a per-device lock? Or perhaps you
completely remove it because of what reason?

I go through all the descriptions of your 14 patches (but not any code),
I still have no clue how you successfully avoid RTNL. Please don't
let me read into your code to understand that, there must be some
high-level justification on how it works. Without it, I don't event want
to read into the code.

Thanks.


Re: [PATCH net-next v3] net: sched: don't disable bh when accessing action idr

2018-05-23 Thread Cong Wang
On Wed, May 23, 2018 at 1:52 AM, Vlad Buslov <vla...@mellanox.com> wrote:
> Initial net_device implementation used ingress_lock spinlock to synchronize
> ingress path of device. This lock was used in both process and bh context.
> In some code paths action map lock was obtained while holding ingress_lock.
> Commit e1e992e52faa ("[NET_SCHED] protect action config/dump from irqs")
> modified actions to always disable bh, while using action map lock, in
> order to prevent deadlock on ingress_lock in softirq. This lock was removed
> in commit 555353cfa1ae ("netdev: The ingress_lock member is no longer
> needed.").
>
> Another reason to disable bh was filters delete code, that released actions
> in rcu callback. This code was changed to release actions from workqueue
> context in patch set "net_sched: close the race between call_rcu() and
> cleanup_net()".
>
> With these changes it is no longer necessary to continue disable bh while
> accessing action map.
>
> Replace all action idr spinlock usage with regular calls that do not
> disable bh.

Looks much better now!

I _guess_ we perhaps can even get rid of this spinlock since most of
the callers hold RTNL lock, not sure about the dump() path where
RTNL might be removed recently.

Anyway,

Acked-by: Cong Wang <xiyou.wangc...@gmail.com>


Re: [PATCH net-next] net:sched: add action inheritdsfield to skbmod

2018-05-23 Thread Cong Wang
On Thu, May 17, 2018 at 12:33 PM, Fu, Qiaobin  wrote:
> net/sched: add action inheritdsfield to skbmod
>
> The new action inheritdsfield copies the field DS of
> IPv4 and IPv6 packets into skb->prioriry. This enables
> later classification of packets based on the DS field.
>
> Original idea by Jamal Hadi Salim 
>
> Signed-off-by: Qiaobin Fu 
> Reviewed-by: Michel Machado 

Hmm, but skbedit seems better than skbmod for this job,
given:

1) It already modifies skb->priority, although with a given value

2) skbmod doesn't change skb metadata, it only changes payload

I am _not_ saying there is strict rule for what skbmod can or can't
change, it calls itself "data modifier", so I am saying we probably
need to follow this existing practice.


[Patch net-next] net_sched: switch to rcu_work

2018-05-23 Thread Cong Wang
Commit 05f0fe6b74db ("RCU, workqueue: Implement rcu_work") introduces
new API's for dispatching work in a RCU callback. Now we can just
switch to the new API's for tc filters. This could get rid of a lot
of code.

Cc: Tejun Heo <t...@kernel.org>
Cc: "Paul E. McKenney" <paul...@linux.vnet.ibm.com>
Cc: Jamal Hadi Salim <j...@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 include/net/pkt_cls.h|  2 +-
 net/sched/cls_api.c  |  5 +++--
 net/sched/cls_basic.c| 24 +++-
 net/sched/cls_bpf.c  | 22 ++
 net/sched/cls_cgroup.c   | 23 +--
 net/sched/cls_flow.c | 24 +++-
 net/sched/cls_flower.c   | 40 ++--
 net/sched/cls_fw.c   | 24 +++-
 net/sched/cls_matchall.c | 21 +
 net/sched/cls_route.c| 23 +--
 net/sched/cls_rsvp.h | 20 +---
 net/sched/cls_tcindex.c  | 41 ++---
 net/sched/cls_u32.c  | 37 ++---
 13 files changed, 85 insertions(+), 221 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 0005f0b40fe9..f3ec43725724 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -33,7 +33,7 @@ struct tcf_block_ext_info {
 };
 
 struct tcf_block_cb;
-bool tcf_queue_work(struct work_struct *work);
+bool tcf_queue_work(struct rcu_work *rwork, work_func_t func);
 
 #ifdef CONFIG_NET_CLS
 struct tcf_chain *tcf_chain_get(struct tcf_block *block, u32 chain_index,
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 963e4bf0aab8..a4a5ace834c3 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -103,9 +103,10 @@ int unregister_tcf_proto_ops(struct tcf_proto_ops *ops)
 }
 EXPORT_SYMBOL(unregister_tcf_proto_ops);
 
-bool tcf_queue_work(struct work_struct *work)
+bool tcf_queue_work(struct rcu_work *rwork, work_func_t func)
 {
-   return queue_work(tc_filter_wq, work);
+   INIT_RCU_WORK(rwork, func);
+   return queue_rcu_work(tc_filter_wq, rwork);
 }
 EXPORT_SYMBOL(tcf_queue_work);
 
diff --git a/net/sched/cls_basic.c b/net/sched/cls_basic.c
index 6b7ab3512f5b..95367f37098d 100644
--- a/net/sched/cls_basic.c
+++ b/net/sched/cls_basic.c
@@ -35,10 +35,7 @@ struct basic_filter {
struct tcf_result   res;
struct tcf_proto*tp;
struct list_headlink;
-   union {
-   struct work_struct  work;
-   struct rcu_head rcu;
-   };
+   struct rcu_work rwork;
 };
 
 static int basic_classify(struct sk_buff *skb, const struct tcf_proto *tp,
@@ -97,21 +94,14 @@ static void __basic_delete_filter(struct basic_filter *f)
 
 static void basic_delete_filter_work(struct work_struct *work)
 {
-   struct basic_filter *f = container_of(work, struct basic_filter, work);
-
+   struct basic_filter *f = container_of(to_rcu_work(work),
+ struct basic_filter,
+ rwork);
rtnl_lock();
__basic_delete_filter(f);
rtnl_unlock();
 }
 
-static void basic_delete_filter(struct rcu_head *head)
-{
-   struct basic_filter *f = container_of(head, struct basic_filter, rcu);
-
-   INIT_WORK(>work, basic_delete_filter_work);
-   tcf_queue_work(>work);
-}
-
 static void basic_destroy(struct tcf_proto *tp, struct netlink_ext_ack *extack)
 {
struct basic_head *head = rtnl_dereference(tp->root);
@@ -122,7 +112,7 @@ static void basic_destroy(struct tcf_proto *tp, struct 
netlink_ext_ack *extack)
tcf_unbind_filter(tp, >res);
idr_remove(>handle_idr, f->handle);
if (tcf_exts_get_net(>exts))
-   call_rcu(>rcu, basic_delete_filter);
+   tcf_queue_work(>rwork, basic_delete_filter_work);
else
__basic_delete_filter(f);
}
@@ -140,7 +130,7 @@ static int basic_delete(struct tcf_proto *tp, void *arg, 
bool *last,
tcf_unbind_filter(tp, >res);
idr_remove(>handle_idr, f->handle);
tcf_exts_get_net(>exts);
-   call_rcu(>rcu, basic_delete_filter);
+   tcf_queue_work(>rwork, basic_delete_filter_work);
*last = list_empty(>flist);
return 0;
 }
@@ -234,7 +224,7 @@ static int basic_change(struct net *net, struct sk_buff 
*in_skb,
list_replace_rcu(>link, >link);
tcf_unbind_filter(tp, >res);
tcf_exts_get_net(>exts);
-   call_rcu(>rcu, basic_delete_filter);
+   tcf_queue_work(>rwork, basic_delete_filter_work);
} else {
list_add_rcu(>link, >flist);
}
diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
i

Re: [PATCH net-next v2] net: sched: don't disable bh when accessing action idr

2018-05-22 Thread Cong Wang
On Mon, May 21, 2018 at 1:03 PM, Vlad Buslov  wrote:
> Initial net_device implementation used ingress_lock spinlock to synchronize
> ingress path of device. This lock was used in both process and bh context.
> In some code paths action map lock was obtained while holding ingress_lock.
> Commit e1e992e52faa ("[NET_SCHED] protect action config/dump from irqs")
> modified actions to always disable bh, while using action map lock, in
> order to prevent deadlock on ingress_lock in softirq. This lock was removed
> from net_device, so disabling bh, while accessing action map, is no longer
> necessary.
>
> Replace all action idr spinlock usage with regular calls that do not
> disable bh.

While your patch is probably fine, the above justification seems not.

In the past, tc actions could be released in BH context because tc
filters use call_rcu(). However, I moved them to a workqueue recently.
So before my change I don't think you can remove the BH protection,
otherwise race with idr_remove()...


Re: [PATCH] net: sched: don't disable bh when accessing action idr

2018-05-18 Thread Cong Wang
On Fri, May 18, 2018 at 8:45 AM, Vlad Buslov  wrote:
> Underlying implementation of action map has changed and doesn't require
> disabling bh anymore. Replace all action idr spinlock usage with regular
> calls that do not disable bh.

Please explain explicitly why it is not required, don't let people
dig, this would save everyone's time.

Also, this should be targeted for net-next, right?

Thanks.


Re: [PATCH net-next v12 3/7] sch_cake: Add optional ACK filter

2018-05-17 Thread Cong Wang
On Thu, May 17, 2018 at 4:23 AM, Toke Høiland-Jørgensen  wrote:
> Eric Dumazet  writes:
>
>> On 05/16/2018 01:29 PM, Toke Høiland-Jørgensen wrote:
>>> The ACK filter is an optional feature of CAKE which is designed to improve
>>> performance on links with very asymmetrical rate limits. On such links
>>> (which are unfortunately quite prevalent, especially for DSL and cable
>>> subscribers), the downstream throughput can be limited by the number of
>>> ACKs capable of being transmitted in the *upstream* direction.
>>>
>>
>> ...
>>
>>>
>>> Signed-off-by: Toke Høiland-Jørgensen 
>>> ---
>>>  net/sched/sch_cake.c |  260 
>>> ++
>>>  1 file changed, 258 insertions(+), 2 deletions(-)
>>>
>>>
>>
>> I have decided to implement ACK compression in TCP stack itself.
>
> Awesome! Will look forward to seeing that!

+1

It is really odd to put into a TC qdisc, TCP stack is a much better
place.


Re: [PATCH net-next v12 2/7] sch_cake: Add ingress mode

2018-05-16 Thread Cong Wang
On Wed, May 16, 2018 at 1:29 PM, Toke Høiland-Jørgensen  wrote:
> +   if (tb[TCA_CAKE_AUTORATE]) {
> +   if (!!nla_get_u32(tb[TCA_CAKE_AUTORATE]))
> +   q->rate_flags |= CAKE_FLAG_AUTORATE_INGRESS;
> +   else
> +   q->rate_flags &= ~CAKE_FLAG_AUTORATE_INGRESS;
> +   }
> +
> +   if (tb[TCA_CAKE_INGRESS]) {
> +   if (!!nla_get_u32(tb[TCA_CAKE_INGRESS]))
> +   q->rate_flags |= CAKE_FLAG_INGRESS;
> +   else
> +   q->rate_flags &= ~CAKE_FLAG_INGRESS;
> +   }
> +
> if (tb[TCA_CAKE_MEMORY])
> q->buffer_config_limit = nla_get_u32(tb[TCA_CAKE_MEMORY]);
>
> @@ -1559,6 +1628,14 @@ static int cake_dump(struct Qdisc *sch, struct sk_buff 
> *skb)
> if (nla_put_u32(skb, TCA_CAKE_MEMORY, q->buffer_config_limit))
> goto nla_put_failure;
>
> +   if (nla_put_u32(skb, TCA_CAKE_AUTORATE,
> +   !!(q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS)))
> +   goto nla_put_failure;
> +
> +   if (nla_put_u32(skb, TCA_CAKE_INGRESS,
> +   !!(q->rate_flags & CAKE_FLAG_INGRESS)))
> +   goto nla_put_failure;
> +

Why do you want to dump each bit of the rate_flags separately rather than
dumping the whole rate_flags as an integer?


Re: [PATCH net-next v12 4/7] sch_cake: Add NAT awareness to packet classifier

2018-05-16 Thread Cong Wang
On Wed, May 16, 2018 at 1:29 PM, Toke Høiland-Jørgensen  wrote:
> When CAKE is deployed on a gateway that also performs NAT (which is a
> common deployment mode), the host fairness mechanism cannot distinguish
> internal hosts from each other, and so fails to work correctly.
>
> To fix this, we add an optional NAT awareness mode, which will query the
> kernel conntrack mechanism to obtain the pre-NAT addresses for each packet
> and use that in the flow and host hashing.
>
> When the shaper is enabled and the host is already performing NAT, the cost
> of this lookup is negligible. However, in unlimited mode with no NAT being
> performed, there is a significant CPU cost at higher bandwidths. For this
> reason, the feature is turned off by default.
>
> Signed-off-by: Toke Høiland-Jørgensen 
> ---
>  net/sched/sch_cake.c |   73 
> ++
>  1 file changed, 73 insertions(+)
>
> diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
> index 65439b643c92..e1038a7b6686 100644
> --- a/net/sched/sch_cake.c
> +++ b/net/sched/sch_cake.c
> @@ -71,6 +71,12 @@
>  #include 
>  #include 
>
> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
> +#include 
> +#include 
> +#include 
> +#endif
> +
>  #define CAKE_SET_WAYS (8)
>  #define CAKE_MAX_TINS (8)
>  #define CAKE_QUEUES (1024)
> @@ -514,6 +520,60 @@ static bool cobalt_should_drop(struct cobalt_vars *vars,
> return drop;
>  }
>
> +#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
> +
> +static void cake_update_flowkeys(struct flow_keys *keys,
> +const struct sk_buff *skb)
> +{
> +   const struct nf_conntrack_tuple *tuple;
> +   enum ip_conntrack_info ctinfo;
> +   struct nf_conn *ct;
> +   bool rev = false;
> +
> +   if (tc_skb_protocol(skb) != htons(ETH_P_IP))
> +   return;
> +
> +   ct = nf_ct_get(skb, );
> +   if (ct) {
> +   tuple = nf_ct_tuple(ct, CTINFO2DIR(ctinfo));
> +   } else {
> +   const struct nf_conntrack_tuple_hash *hash;
> +   struct nf_conntrack_tuple srctuple;
> +
> +   if (!nf_ct_get_tuplepr(skb, skb_network_offset(skb),
> +  NFPROTO_IPV4, dev_net(skb->dev),
> +  ))
> +   return;
> +
> +   hash = nf_conntrack_find_get(dev_net(skb->dev),
> +_ct_zone_dflt,
> +);
> +   if (!hash)
> +   return;
> +
> +   rev = true;
> +   ct = nf_ct_tuplehash_to_ctrack(hash);
> +   tuple = nf_ct_tuple(ct, !hash->tuple.dst.dir);
> +   }
> +
> +   keys->addrs.v4addrs.src = rev ? tuple->dst.u3.ip : tuple->src.u3.ip;
> +   keys->addrs.v4addrs.dst = rev ? tuple->src.u3.ip : tuple->dst.u3.ip;
> +
> +   if (keys->ports.ports) {
> +   keys->ports.src = rev ? tuple->dst.u.all : tuple->src.u.all;
> +   keys->ports.dst = rev ? tuple->src.u.all : tuple->dst.u.all;
> +   }
> +   if (rev)
> +   nf_ct_put(ct);
> +}
> +#else
> +static void cake_update_flowkeys(struct flow_keys *keys,
> +const struct sk_buff *skb)
> +{
> +   /* There is nothing we can do here without CONNTRACK */
> +}
> +#endif
> +
>  /* Cake has several subtle multiple bit settings. In these cases you
>   *  would be matching triple isolate mode as well.
>   */
> @@ -541,6 +601,9 @@ static u32 cake_hash(struct cake_tin_data *q, const 
> struct sk_buff *skb,
> skb_flow_dissect_flow_keys(skb, ,
>FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
>
> +   if (flow_mode & CAKE_FLOW_NAT_FLAG)
> +   cake_update_flowkeys(, skb);
> +
> /* flow_hash_from_keys() sorts the addresses by value, so we have
>  * to preserve their order in a separate data structure to treat
>  * src and dst host addresses as independently selectable.
> @@ -1727,6 +1790,12 @@ static int cake_change(struct Qdisc *sch, struct 
> nlattr *opt,
> q->flow_mode = (nla_get_u32(tb[TCA_CAKE_FLOW_MODE]) &
> CAKE_FLOW_MASK);
>
> +   if (tb[TCA_CAKE_NAT]) {
> +   q->flow_mode &= ~CAKE_FLOW_NAT_FLAG;
> +   q->flow_mode |= CAKE_FLOW_NAT_FLAG *
> +   !!nla_get_u32(tb[TCA_CAKE_NAT]);
> +   }


I think it's better to return -EOPNOTSUPP when CONFIG_NF_CONNTRACK
is not enabled.


> +
> if (tb[TCA_CAKE_RTT]) {
> q->interval = nla_get_u32(tb[TCA_CAKE_RTT]);
>
> @@ -1892,6 +1961,10 @@ static int cake_dump(struct Qdisc *sch, struct sk_buff 
> *skb)
> if (nla_put_u32(skb, TCA_CAKE_ACK_FILTER, q->ack_filter))
> goto nla_put_failure;
>
> +   if (nla_put_u32(skb, TCA_CAKE_NAT,
> +   !!(q->flow_mode & 

Re: [PATCH net-next v12 1/7] sched: Add Common Applications Kept Enhanced (cake) qdisc

2018-05-16 Thread Cong Wang
On Wed, May 16, 2018 at 1:29 PM, Toke Høiland-Jørgensen  wrote:
> +
> +static struct Qdisc *cake_leaf(struct Qdisc *sch, unsigned long arg)
> +{
> +   return NULL;
> +}
> +
> +static unsigned long cake_find(struct Qdisc *sch, u32 classid)
> +{
> +   return 0;
> +}
> +
> +static void cake_walk(struct Qdisc *sch, struct qdisc_walker *arg)
> +{
> +}


Thanks for adding the support to other TC filters, it is much better now!

A quick question: why class_ops->dump_stats is still NULL?

It is supposed to dump the stats of each flow. Is there still any difficulty
to map it to tc class? I thought you figured it out when you added the
tcf_classify().


Re: [PATCH net-next] sched: cls: enable verbose logging

2018-05-14 Thread Cong Wang
On Mon, May 14, 2018 at 1:47 PM, Marcelo Ricardo Leitner
<marcelo.leit...@gmail.com> wrote:
> On Mon, May 14, 2018 at 01:30:53PM -0700, Cong Wang wrote:
>> On Sun, May 13, 2018 at 1:44 PM, Marcelo Ricardo Leitner
>> <marcelo.leit...@gmail.com> wrote:
>> > Currently, when the rule is not to be exclusively executed by the
>> > hardware, extack is not passed along and offloading failures don't
>> > get logged. The idea was that hardware failures are okay because the
>> > rule will get executed in software then and this way it doesn't confuse
>> > unware users.
>> >
>> > But this is not helpful in case one needs to understand why a certain
>> > rule failed to get offloaded. Considering it may have been a temporary
>> > failure, like resources exceeded or so, reproducing it later and knowing
>> > that it is triggering the same reason may be challenging.
>>
>> I fail to understand why you need a flag here, IOW, why not just pass
>> extack unconditionally?
>
> Because (as discussed in the RFC[1], should have linked it here) it
> could confuse users that are not aware of offloading and, in other
> cases, it can be just noise (like it would be right now for ebpf,
> which is mostly used in sw-path).
>
> 1.https://www.mail-archive.com/netdev@vger.kernel.org/msg223016.html

My point is that a TC filter flag should be used for a filter attribute,
logging is apparently not a part of filter. At least, put it into HW offloading,
not in TC filter.

I know DaveM hates module parameters, but a module parameter here
is more suitable than a TC filter flag.


Re: [PATCH net-next] sched: cls: enable verbose logging

2018-05-14 Thread Cong Wang
On Sun, May 13, 2018 at 1:44 PM, Marcelo Ricardo Leitner
 wrote:
> Currently, when the rule is not to be exclusively executed by the
> hardware, extack is not passed along and offloading failures don't
> get logged. The idea was that hardware failures are okay because the
> rule will get executed in software then and this way it doesn't confuse
> unware users.
>
> But this is not helpful in case one needs to understand why a certain
> rule failed to get offloaded. Considering it may have been a temporary
> failure, like resources exceeded or so, reproducing it later and knowing
> that it is triggering the same reason may be challenging.

I fail to understand why you need a flag here, IOW, why not just pass
extack unconditionally?


Re: [PATCH net 1/1] net sched actions: fix refcnt leak in skbmod

2018-05-11 Thread Cong Wang
On Fri, May 11, 2018 at 11:35 AM, Roman Mashak <m...@mojatatu.com> wrote:
> When application fails to pass flags in netlink TLV when replacing
> existing skbmod action, the kernel will leak refcnt:
>
> $ tc actions get action skbmod index 1
> total acts 0
>
> action order 0: skbmod pipe set smac 00:11:22:33:44:55
>  index 1 ref 1 bind 0
>
> For example, at this point a buggy application replaces the action with
> index 1 with new smac 00:aa:22:33:44:55, it fails because of zero flags,
> however refcnt gets bumped:
>
> $ tc actions get actions skbmod index 1
> total acts 0
>
> action order 0: skbmod pipe set smac 00:11:22:33:44:55
>  index 1 ref 2 bind 0
> $
>
> Tha patch fixes this by calling tcf_idr_release() on existing actions.
>
> Fixes: 86da71b57383d ("net_sched: Introduce skbmod action")
> Signed-off-by: Roman Mashak <m...@mojatatu.com>

Acked-by: Cong Wang <xiyou.wangc...@gmail.com>


Re: [patch net] net: sched: fix error path in tcf_proto_create() when modules are not configured

2018-05-11 Thread Cong Wang
On Fri, May 11, 2018 at 8:45 AM, Jiri Pirko <j...@resnulli.us> wrote:
> From: Jiri Pirko <j...@mellanox.com>
>
> In case modules are not configured, error out when tp->ops is null
> and prevent later null pointer dereference.
>
> Fixes: 33a48927c193 ("sched: push TC filter protocol creation into a separate 
> function")
> Signed-off-by: Jiri Pirko <j...@mellanox.com>

Acked-by: Cong Wang <xiyou.wangc...@gmail.com>


Re: [PATCH v2 net 1/1] net sched actions: fix invalid pointer dereferencing if skbedit flags missing

2018-05-11 Thread Cong Wang
On Fri, May 11, 2018 at 7:55 AM, Roman Mashak <m...@mojatatu.com> wrote:
> When application fails to pass flags in netlink TLV for a new skbedit action,
> the kernel results in the following oops:
>
> [8.307732] BUG: unable to handle kernel paging request at 00021130
> [8.309167] PGD 8000193d1067 P4D 8000193d1067 PUD 180e0067 PMD 0
> [8.310595] Oops:  [#1] SMP PTI
> [8.311334] Modules linked in: kvm_intel kvm irqbypass crct10dif_pclmul 
> crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd 
> cryptd glue_helper serio_raw
> [8.314190] CPU: 1 PID: 397 Comm: tc Not tainted 4.17.0-rc3+ #357
> [8.315252] RIP: 0010:__tcf_idr_release+0x33/0x140
> [8.316203] RSP: 0018:a0718038f840 EFLAGS: 00010246
> [8.317123] RAX: 0001 RBX: 00021100 RCX: 
> 
> [8.319831] RDX:  RSI:  RDI: 
> 00021100
> [8.321181] RBP:  R08: 0004adf8 R09: 
> 0122
> [8.322645] R10:  R11: 9e5b01ed R12: 
> 
> [8.324157] R13: 9e0d3cc0 R14:  R15: 
> 
> [8.325590] FS:  7f591292e700() GS:8fcf5bc4() 
> knlGS:
> [8.327001] CS:  0010 DS:  ES:  CR0: 80050033
> [8.327987] CR2: 00021130 CR3: 180e6004 CR4: 
> 001606a0
> [8.329289] Call Trace:
> [8.329735]  tcf_skbedit_init+0xa7/0xb0
> [8.330423]  tcf_action_init_1+0x362/0x410
> [8.331139]  ? try_to_wake_up+0x44/0x430
> [8.331817]  tcf_action_init+0x103/0x190
> [8.332511]  tc_ctl_action+0x11a/0x220
> [8.333174]  rtnetlink_rcv_msg+0x23d/0x2e0
> [8.333902]  ? _cond_resched+0x16/0x40
> [8.334569]  ? __kmalloc_node_track_caller+0x5b/0x2c0
> [8.335440]  ? rtnl_calcit.isra.31+0xf0/0xf0
> [8.336178]  netlink_rcv_skb+0xdb/0x110
> [8.336855]  netlink_unicast+0x167/0x220
> [8.337550]  netlink_sendmsg+0x2a7/0x390
> [8.338258]  sock_sendmsg+0x30/0x40
> [8.338865]  ___sys_sendmsg+0x2c5/0x2e0
> [8.339531]  ? pagecache_get_page+0x27/0x210
> [8.340271]  ? filemap_fault+0xa2/0x630
> [8.340943]  ? page_add_file_rmap+0x108/0x200
> [8.341732]  ? alloc_set_pte+0x2aa/0x530
> [8.342573]  ? finish_fault+0x4e/0x70
> [8.343332]  ? __handle_mm_fault+0xbc1/0x10d0
> [8.344337]  ? __sys_sendmsg+0x53/0x80
> [8.345040]  __sys_sendmsg+0x53/0x80
> [8.345678]  do_syscall_64+0x4f/0x100
> [8.346339]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [8.347206] RIP: 0033:0x7f591191da67
> [8.347831] RSP: 002b:7fff745abd48 EFLAGS: 0246 ORIG_RAX: 
> 002e
> [8.349179] RAX: ffda RBX: 7fff745abe70 RCX: 
> 7f591191da67
> [8.350431] RDX:  RSI: 7fff745abdc0 RDI: 
> 0003
> [8.351659] RBP: 5af35251 R08: 0001 R09: 
> 
> [8.352922] R10: 05f1 R11: 0246 R12: 
> 
> [8.354183] R13: 7fff745afed0 R14: 0001 R15: 
> 006767c0
> [8.355400] Code: 41 89 d4 53 89 f5 48 89 fb e8 aa 20 fd ff 85 c0 0f 84 ed 
> 00
> 00 00 48 85 db 0f 84 cf 00 00 00 40 84 ed 0f 85 cd 00 00 00 45 84 e4 <8b> 53 
> 30
> 74 0d 85 d2 b8 ff ff ff ff 0f 8f b3 00 00 00 8b 43 2c
> [8.358699] RIP: __tcf_idr_release+0x33/0x140 RSP: a0718038f840
> [8.359770] CR2: 00021130
> [8.360438] ---[ end trace 60c66be45dfc14f0 ]---
>
> The caller calls action's ->init() and passes pointer to "struct tc_action 
> *a",
> which later may be initialized to point at the existing action, otherwise
> "struct tc_action *a" is still invalid, and therefore dereferencing it is an
> error as happens in tcf_idr_release, where refcnt is decremented.
>
> So in case of missing flags tcf_idr_release must be called only for
> existing actions.
>
> v2:
> - prepare patch for net tree
>
> Signed-off-by: Roman Mashak <m...@mojatatu.com>

Fixes: 5e1567aeb7fe ("net sched: skbedit action fix late binding")

Acked-by: Cong Wang <xiyou.wangc...@gmail.com>


Re: [PATCH net V2] tun: fix use after free for ptr_ring

2018-05-11 Thread Cong Wang
On Thu, May 10, 2018 at 7:49 PM, Jason Wang  wrote:
>  static void __tun_detach(struct tun_file *tfile, bool clean)
>  {
> struct tun_file *ntfile;
> @@ -736,7 +727,8 @@ static void __tun_detach(struct tun_file *tfile, bool 
> clean)
> tun->dev->reg_state == NETREG_REGISTERED)
> unregister_netdevice(tun->dev);
> }
> -   tun_cleanup_tx_ring(tfile);
> +   if (tun)
> +   xdp_rxq_info_unreg(>xdp_rxq);
> sock_put(>sk);
> }
>  }
> @@ -783,14 +775,14 @@ static void tun_detach_all(struct net_device *dev)
> tun_napi_del(tun, tfile);
> /* Drop read queue */
> tun_queue_purge(tfile);
> +   xdp_rxq_info_unreg(>xdp_rxq);
> sock_put(>sk);
> -   tun_cleanup_tx_ring(tfile);
> }
> list_for_each_entry_safe(tfile, tmp, >disabled, next) {
> tun_enable_queue(tfile);
> tun_queue_purge(tfile);
> +   xdp_rxq_info_unreg(>xdp_rxq);
> sock_put(>sk);
> -   tun_cleanup_tx_ring(tfile);

Are you sure this is safe?

xdp_rxq_info_unreg() can't be called more than once either,
please make sure the warning that commit c13da21cdb80
("tun: avoid calling xdp_rxq_info_unreg() twice") fixed will not
show up again.


Re: [PATCH net] tun: fix use after free for ptr_ring

2018-05-10 Thread Cong Wang
On Tue, May 8, 2018 at 11:59 PM, Jason Wang  wrote:
> We used to initialize ptr_ring during TUNSETIFF, this is because its
> size depends on the tx_queue_len of netdevice. And we try to clean it
> up when socket were detached from netdevice. A race were spotted when
> trying to do uninit during a read which will lead a use after free for
> pointer ring. Solving this by always initialize a zero size ptr_ring
> in open() and do resizing during TUNSETIFF, and then we can safely do
> cleanup during close(). With this, there's no need for the workaround
> that was introduced by commit 4df0bfc79904 ("tun: fix a memory leak
> for tfile->tx_array").
>

Ah, I didn't know ptr_ring_init(0) could work... Nice patch!
Except one thing below.


> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index ef33950..298cb96 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -681,15 +681,6 @@ static void tun_queue_purge(struct tun_file *tfile)
> skb_queue_purge(>sk.sk_error_queue);
>  }
>
> -static void tun_cleanup_tx_ring(struct tun_file *tfile)
> -{
> -   if (tfile->tx_ring.queue) {
> -   ptr_ring_cleanup(>tx_ring, tun_ptr_free);
> -   xdp_rxq_info_unreg(>xdp_rxq);
> -   memset(>tx_ring, 0, sizeof(tfile->tx_ring));
> -   }
> -}


I don't think you can totally remove ptr_ring_cleanup(), it should be
called unconditionally with your ptr_ring_init(0) trick, right?


Re: [PATCH net-next 1/1] net sched actions: fix invalid pointer dereferencing if skbedit flags missing

2018-05-10 Thread Cong Wang
On Wed, May 9, 2018 at 2:18 PM, Roman Mashak  wrote:
> The caller calls action's ->init() and passes pointer to "struct tc_action 
> *a",
> which later may initialized to point at the existing action, otherwise
> "struct tc_action *a" is still invalid, and therefore dereferencing it is 
> error.

So technically we just need to check if(exists) before calling
tcf_idr_release(), right?


>
> So checking flags should be done as early as possible, before idr lookups.

In theory, this could break the case of binding an existing action.
But in practice it seems nonsense to pass invalid flags in this case
either, so probably this is okay.

BTW, this should be a candidate for -net and possibly -stable too.


Re: [PATCH net-next] net:sched: add gkprio scheduler

2018-05-10 Thread Cong Wang
On Wed, May 9, 2018 at 7:09 AM, Michel Machado <mic...@digirati.com.br> wrote:
> On 05/08/2018 10:24 PM, Cong Wang wrote:
>>
>> On Tue, May 8, 2018 at 5:59 AM, Michel Machado <mic...@digirati.com.br>
>> wrote:
>>>>>
>>>>> Overall it looks good to me, just one thing below:
>>>>>
>>>>>> +struct Qdisc_ops gkprio_qdisc_ops __read_mostly = {
>>>>>> +   .id =   "gkprio",
>>>>>> +   .priv_size  =   sizeof(struct gkprio_sched_data),
>>>>>> +   .enqueue=   gkprio_enqueue,
>>>>>> +   .dequeue=   gkprio_dequeue,
>>>>>> +   .peek   =   qdisc_peek_dequeued,
>>>>>> +   .init   =   gkprio_init,
>>>>>> +   .reset  =   gkprio_reset,
>>>>>> +   .change =   gkprio_change,
>>>>>> +   .dump   =   gkprio_dump,
>>>>>> +   .destroy=   gkprio_destroy,
>>>>>> +   .owner  =   THIS_MODULE,
>>>>>> +};
>>>>>
>>>>>
>>>>>
>>>>> You probably want to add Qdisc_class_ops here so that you can
>>>>> dump the stats of each internal queue.
>>>
>>>
>>>
>>> Hi Cong,
>>>
>>> In the production scenario we are targeting, this priority queue must
>>> be
>>> classless; being classful would only bloat the code for us. I don't see
>>> making this queue classful as a problem per se, but I suggest leaving it
>>> as
>>> a future improvement for when someone can come up with a useful scenario
>>> for
>>> it.
>>
>>
>>
>> Take a look at sch_prio, it is fairly simple since your internal
>> queues are just an array... Per-queue stats are quite useful
>> in production, we definitely want to observe which queues are
>> full which are not.
>>
>
> DSprio cannot add Qdisc_class_ops without a rewrite of other queue
> disciplines, which doesn't seem desirable. Since the method cops->leaf is
> required (see register_qdisc()), we would need to replace the array struct
> sk_buff_head qdiscs[GKPRIO_MAX_PRIORITY] in struct gkprio_sched_data with
> the array struct Qdisc *queues[GKPRIO_MAX_PRIORITY] to be able to return a
> Qdisc in dsprio_leaf(). The problem with this change is that Qdisc does not
> have a method to dequeue from its tail. This new method may not even make
> sense in other queue disciplines. But without this method, gkprio_enqueue()
> cannot drop the lowest priority packet when the queue is full and an
> incoming packet has higher priority.

Sorry for giving you a bad example. Take a look at sch_fq_codel instead,
it returns NULL for ->leaf() and maps its internal flows to classes.

I thought sch_prio uses internal qdiscs, but I was wrong, as you noticed
it actually exposes them to user via classes.

My point is never to make it classful, just want to expose the useful stats,
like how fq_codel dumps its internal flows.


>
> Nevertheless, I see your point on being able to observe the distribution of
> queued packets per priority. A solution for that would be to add the array
> __u32 qlen[GKPRIO_MAX_PRIORITY] in struct tc_gkprio_qopt. This solution even
> avoids adding overhead in the critical paths of DSprio. Do you see a better
> solution?

I believe you can return NULL for ->leaf() and don't need to worry about
->graft() either. ;)


>
> By the way, I've used GKPRIO_MAX_PRIORITY and other names that include
> "gkprio" above to reflect the version 1 of this patch that we are
> discussing. We will rename these identifiers for version 2 of this patch to
> replace "gkprio" with "dsprio".
>

Sounds good.

Thanks.


Re: BUG: spinlock bad magic in tun_do_read

2018-05-08 Thread Cong Wang
On Mon, May 7, 2018 at 11:04 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
>
>
> On 05/07/2018 10:54 PM, Cong Wang wrote:
>>
>> Yeah, we should return early before hitting this uninitialized ptr ring...
>> Something like:
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index ef33950a45d9..638c87a95247 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -2128,6 +2128,9 @@ static void *tun_ring_recv(struct tun_file
>> *tfile, int noblock, int *err)
>> void *ptr = NULL;
>> int error = 0;
>>
>> +   if (!tfile->tx_ring.queue)
>> +   goto out;
>> +
>>
>> Or, checking if tun is detached...
>>
>>
>
> tx_ring was properly initialized when first ptr_ring_consume() at line 2131 
> was attempted.
>
> The bug happens later at line 2143 , after a schedule() call, line 2155
>
> So a single check at function prologue wont solve the case the thread had to 
> sleep,
> then some uninit happened.


Very good point. RTNL lock is supposed to protect cleanup path, but I don't
think we can acquire RTNL for tun_chr_read_iter() path...


Re: [PATCH net-next] net:sched: add gkprio scheduler

2018-05-08 Thread Cong Wang
On Tue, May 8, 2018 at 6:29 AM, Jamal Hadi Salim  wrote:
> Have you considered using skb->prio instead of peeking into the packet
> header.
> Also have you looked at the dsmark qdisc?
>

dsmark modifies ds fields, while this one just maps ds fields into
different queues.


Re: [PATCH net-next] net:sched: add gkprio scheduler

2018-05-08 Thread Cong Wang
On Tue, May 8, 2018 at 5:59 AM, Michel Machado  wrote:
>>> Overall it looks good to me, just one thing below:
>>>
 +struct Qdisc_ops gkprio_qdisc_ops __read_mostly = {
 +   .id =   "gkprio",
 +   .priv_size  =   sizeof(struct gkprio_sched_data),
 +   .enqueue=   gkprio_enqueue,
 +   .dequeue=   gkprio_dequeue,
 +   .peek   =   qdisc_peek_dequeued,
 +   .init   =   gkprio_init,
 +   .reset  =   gkprio_reset,
 +   .change =   gkprio_change,
 +   .dump   =   gkprio_dump,
 +   .destroy=   gkprio_destroy,
 +   .owner  =   THIS_MODULE,
 +};
>>>
>>>
>>> You probably want to add Qdisc_class_ops here so that you can
>>> dump the stats of each internal queue.
>
>
> Hi Cong,
>
>In the production scenario we are targeting, this priority queue must be
> classless; being classful would only bloat the code for us. I don't see
> making this queue classful as a problem per se, but I suggest leaving it as
> a future improvement for when someone can come up with a useful scenario for
> it.


Take a look at sch_prio, it is fairly simple since your internal
queues are just an array... Per-queue stats are quite useful
in production, we definitely want to observe which queues are
full which are not.


Re: BUG: spinlock bad magic in tun_do_read

2018-05-07 Thread Cong Wang
On Mon, May 7, 2018 at 10:27 PM, syzbot
 wrote:
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:75bc37fefc44 Linux 4.17-rc4
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=1162c69780
> kernel config:  https://syzkaller.appspot.com/x/.config?x=31f4b3733894ef79
> dashboard link: https://syzkaller.appspot.com/bug?extid=e8b902c3c3fadf0a9dba
> compiler:   gcc (GCC) 8.0.1 20180413 (experimental)
> userspace arch: i386
> syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=172e4c9780
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+e8b902c3c3fadf0a9...@syzkaller.appspotmail.com
>
> random: sshd: uninitialized urandom read (32 bytes read)
> random: sshd: uninitialized urandom read (32 bytes read)
> random: sshd: uninitialized urandom read (32 bytes read)
> IPVS: ftp: loaded support on port[0] = 21
> BUG: spinlock bad magic on CPU#0, syz-executor0/4586
>  lock: 0x8801ae8928c8, .magic: , .owner: /-1, .owner_cpu:
> 0
> CPU: 0 PID: 4586 Comm: syz-executor0 Not tainted 4.17.0-rc4+ #62
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
>  spin_dump+0x160/0x169 kernel/locking/spinlock_debug.c:67
>  spin_bug kernel/locking/spinlock_debug.c:75 [inline]
>  debug_spin_lock_before kernel/locking/spinlock_debug.c:83 [inline]
>  do_raw_spin_lock.cold.3+0x37/0x3c kernel/locking/spinlock_debug.c:112
>  __raw_spin_lock include/linux/spinlock_api_smp.h:143 [inline]
>  _raw_spin_lock+0x32/0x40 kernel/locking/spinlock.c:144
>  spin_lock include/linux/spinlock.h:310 [inline]
>  ptr_ring_consume include/linux/ptr_ring.h:335 [inline]
>  tun_ring_recv drivers/net/tun.c:2143 [inline]

Yeah, we should return early before hitting this uninitialized ptr ring...
Something like:

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index ef33950a45d9..638c87a95247 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2128,6 +2128,9 @@ static void *tun_ring_recv(struct tun_file
*tfile, int noblock, int *err)
void *ptr = NULL;
int error = 0;

+   if (!tfile->tx_ring.queue)
+   goto out;
+

Or, checking if tun is detached...


>  tun_do_read+0x18b1/0x29f0 drivers/net/tun.c:2182
>  tun_chr_read_iter+0xe5/0x1e0 drivers/net/tun.c:2214
>  call_read_iter include/linux/fs.h:1778 [inline]
>  new_sync_read fs/read_write.c:406 [inline]
>  __vfs_read+0x696/0xa50 fs/read_write.c:418
>  vfs_read+0x17f/0x3d0 fs/read_write.c:452
>  ksys_pread64+0x174/0x1a0 fs/read_write.c:626
>  __do_compat_sys_x86_pread arch/x86/ia32/sys_ia32.c:177 [inline]
>  __se_compat_sys_x86_pread arch/x86/ia32/sys_ia32.c:174 [inline]
>  __ia32_compat_sys_x86_pread+0xc4/0x130 arch/x86/ia32/sys_ia32.c:174
>  do_syscall_32_irqs_on arch/x86/entry/common.c:323 [inline]
>  do_fast_syscall_32+0x345/0xf9b arch/x86/entry/common.c:394
>  entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
> RIP: 0023:0xf7fc0cb9
> RSP: 002b:f7fbc0ac EFLAGS: 0282 ORIG_RAX: 00b4
> RAX: ffda RBX: 0003 RCX: 2080
> RDX: 006e RSI:  RDI: 
> RBP:  R08:  R09: 
> R10:  R11: 0292 R12: 
> R13:  R14:  R15: 
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
>
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> If you want to test a patch for this bug, please reply with:
> #syz test: git://repo/address.git branch
> and provide the patch inline or as an attachment.
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line in the email body.


Re: [PATCH net-next] net:sched: add gkprio scheduler

2018-05-07 Thread Cong Wang
On Mon, May 7, 2018 at 2:36 AM, Nishanth Devarajan  wrote:
> net/sched: add gkprio scheduler
>
> Gkprio (Gatekeeper Priority Queue) is a queueing discipline that prioritizes
> IPv4 and IPv6 packets accordingly to their DSCP field. Although Gkprio can be
> employed in any QoS scenario in which a higher DSCP field means a higher
> priority packet, Gkprio was concieved as a solution for denial-of-service
> defenses that need to route packets with different priorities.


Can we give it a better name? "Gatekeeper" is meaningless if we read
it alone, it ties to your Gatekeeper project which is more than just this
kernel module. Maybe "DS Priority Queue"?

Overall it looks good to me, just one thing below:

> +struct Qdisc_ops gkprio_qdisc_ops __read_mostly = {
> +   .id =   "gkprio",
> +   .priv_size  =   sizeof(struct gkprio_sched_data),
> +   .enqueue=   gkprio_enqueue,
> +   .dequeue=   gkprio_dequeue,
> +   .peek   =   qdisc_peek_dequeued,
> +   .init   =   gkprio_init,
> +   .reset  =   gkprio_reset,
> +   .change =   gkprio_change,
> +   .dump   =   gkprio_dump,
> +   .destroy=   gkprio_destroy,
> +   .owner  =   THIS_MODULE,
> +};

You probably want to add Qdisc_class_ops here so that you can
dump the stats of each internal queue.


Re: [PATCH net-next v8 1/7] sched: Add Common Applications Kept Enhanced (cake) qdisc

2018-05-07 Thread Cong Wang
On Mon, May 7, 2018 at 11:37 AM, Toke Høiland-Jørgensen <t...@toke.dk> wrote:
> Cong Wang <xiyou.wangc...@gmail.com> writes:
>
>> On Fri, May 4, 2018 at 12:10 PM, Toke Høiland-Jørgensen <t...@toke.dk> wrote:
>>> Thank you for the review! A few comments below, I'll fix the rest.
>>>
>>>> [...]
>>>>
>>>> So sch_cake doesn't accept normal tc filters? Is this intentional?
>>>> If so, why?
>>>
>>> For two reasons:
>>>
>>> - The two-level scheduling used in CAKE (tins / diffserv classes, and
>>>   flow hashing) does not map in an obvious way to the classification
>>>   index of tc filters.
>>
>> Sounds like you need to extend struct tcf_result?
>
> Well, the obvious way to support filters would be to have skb->priority
> override the diffserv mapping if set, and have the filter classification
> result select the queue within that tier. That would probably be doable,
> but see below.
>
>>> - No one has asked for it. We have done our best to accommodate the
>>>   features people want in a home router qdisc directly in CAKE, and the
>>>   ability to integrate tc filters has never been requested.
>>
>> It is not hard to integrate, basically you need to call
>> tcf_classify(). Although it is not mandatory, it is odd to merge a
>> qdisc doesn't work with existing tc filters (and actions too).
>
> I looked at the fq_codel code to do this. Is it possible to support
> filtering without implementing Qdisc_class_ops? If so, I'll give it a
> shot; but implementing the class ops is more than I can commit to...

Good question. The tc classes in flow-based qdisc's are actually
used as flows rather than a normal tc class in a hierarchy qdisc.
Like in fq_code, the classes are mapped to each flow and because
of that we can dump stats of each flow.

I am not sure if you can totally bypass class_ops, you need to look
into these API's. Most of them are easy to implement, probably
only except the ->dump_stats(), so I don't think it is a barrier here.


>
>>>>> +static int cake_init(struct Qdisc *sch, struct nlattr *opt,
>>>>> +struct netlink_ext_ack *extack)
>>>>> +{
>>>>> +   struct cake_sched_data *q = qdisc_priv(sch);
>>>>> +   int i, j;
>>>>> +
>>>>> +   sch->limit = 10240;
>>>>> +   q->tin_mode = CAKE_DIFFSERV_BESTEFFORT;
>>>>> +   q->flow_mode  = CAKE_FLOW_TRIPLE;
>>>>> +
>>>>> +   q->rate_bps = 0; /* unlimited by default */
>>>>> +
>>>>> +   q->interval = 10; /* 100ms default */
>>>>> +   q->target   =   5000; /* 5ms: codel RFC argues
>>>>> +  * for 5 to 10% of interval
>>>>> +  */
>>>>> +
>>>>> +   q->cur_tin = 0;
>>>>> +   q->cur_flow  = 0;
>>>>> +
>>>>> +   if (opt) {
>>>>> +   int err = cake_change(sch, opt, extack);
>>>>> +
>>>>> +   if (err)
>>>>> +   return err;
>>>>
>>>>
>>>> Not sure if you really want to reallocate q->tines below for this
>>>> case.
>>>
>>> I'm not sure what you mean here? If there's an error we return it and
>>> the qdisc is not created. If there's not, we allocate and on subsequent
>>> changes cake_change() will be called directly, or? Can the init function
>>> ever be called again during the lifetime of the qdisc?
>>>
>>
>> In non-error case, you call cake_change() first and then allocate
>> ->tins with kvzalloc() below. For me it looks like you don't need to
>> allocate it again when ->tins!=NULL.
>
> No, we definitely don't. It's just not clear to me how cake_init() could
> ever be called with q->tins already allocated?
>
> I can add a check in any case, though, I see that there is one in
> fq_codel as well...

Ah, that's right, you have a check in cake_change() before
cake_reconfigure().


Re: [PATCH net-next v8 1/7] sched: Add Common Applications Kept Enhanced (cake) qdisc

2018-05-07 Thread Cong Wang
On Fri, May 4, 2018 at 12:10 PM, Toke Høiland-Jørgensen  wrote:
> Thank you for the review! A few comments below, I'll fix the rest.
>
>> [...]
>>
>> So sch_cake doesn't accept normal tc filters? Is this intentional?
>> If so, why?
>
> For two reasons:
>
> - The two-level scheduling used in CAKE (tins / diffserv classes, and
>   flow hashing) does not map in an obvious way to the classification
>   index of tc filters.

Sounds like you need to extend struct tcf_result?

>
> - No one has asked for it. We have done our best to accommodate the
>   features people want in a home router qdisc directly in CAKE, and the
>   ability to integrate tc filters has never been requested.

It is not hard to integrate, basically you need to call tcf_classify().
Although it is not mandatory, it is odd to merge a qdisc doesn't work
with existing tc filters (and actions too).


>>> +static int cake_init(struct Qdisc *sch, struct nlattr *opt,
>>> +struct netlink_ext_ack *extack)
>>> +{
>>> +   struct cake_sched_data *q = qdisc_priv(sch);
>>> +   int i, j;
>>> +
>>> +   sch->limit = 10240;
>>> +   q->tin_mode = CAKE_DIFFSERV_BESTEFFORT;
>>> +   q->flow_mode  = CAKE_FLOW_TRIPLE;
>>> +
>>> +   q->rate_bps = 0; /* unlimited by default */
>>> +
>>> +   q->interval = 10; /* 100ms default */
>>> +   q->target   =   5000; /* 5ms: codel RFC argues
>>> +  * for 5 to 10% of interval
>>> +  */
>>> +
>>> +   q->cur_tin = 0;
>>> +   q->cur_flow  = 0;
>>> +
>>> +   if (opt) {
>>> +   int err = cake_change(sch, opt, extack);
>>> +
>>> +   if (err)
>>> +   return err;
>>
>>
>> Not sure if you really want to reallocate q->tines below for this
>> case.
>
> I'm not sure what you mean here? If there's an error we return it and
> the qdisc is not created. If there's not, we allocate and on subsequent
> changes cake_change() will be called directly, or? Can the init function
> ever be called again during the lifetime of the qdisc?
>

In non-error case, you call cake_change() first and then allocate ->tins
with kvzalloc() below. For me it looks like you don't need to allocate it
again when ->tins!=NULL.


Re: [PATCH net-next v8 1/7] sched: Add Common Applications Kept Enhanced (cake) qdisc

2018-05-04 Thread Cong Wang
On Fri, May 4, 2018 at 7:02 AM, Toke Høiland-Jørgensen  wrote:
> +struct cake_sched_data {
> +   struct cake_tin_data *tins;
> +
> +   struct cake_heap_entry overflow_heap[CAKE_QUEUES * CAKE_MAX_TINS];
> +   u16 overflow_timeout;
> +
> +   u16 tin_cnt;
> +   u8  tin_mode;
> +   u8  flow_mode;
> +   u8  ack_filter;
> +   u8  atm_mode;
> +
> +   /* time_next = time_this + ((len * rate_ns) >> rate_shft) */
> +   u16 rate_shft;
> +   u64 time_next_packet;
> +   u64 failsafe_next_packet;
> +   u32 rate_ns;
> +   u32 rate_bps;
> +   u16 rate_flags;
> +   s16 rate_overhead;
> +   u16 rate_mpu;
> +   u32 interval;
> +   u32 target;
> +
> +   /* resource tracking */
> +   u32 buffer_used;
> +   u32 buffer_max_used;
> +   u32 buffer_limit;
> +   u32 buffer_config_limit;
> +
> +   /* indices for dequeue */
> +   u16 cur_tin;
> +   u16 cur_flow;
> +
> +   struct qdisc_watchdog watchdog;
> +   const u8*tin_index;
> +   const u8*tin_order;
> +
> +   /* bandwidth capacity estimate */
> +   u64 last_packet_time;
> +   u64 avg_packet_interval;
> +   u64 avg_window_begin;
> +   u32 avg_window_bytes;
> +   u32 avg_peak_bandwidth;
> +   u64 last_reconfig_time;
> +
> +   /* packet length stats */
> +   u32 avg_netoff;
> +   u16 max_netlen;
> +   u16 max_adjlen;
> +   u16 min_netlen;
> +   u16 min_adjlen;
> +};


So sch_cake doesn't accept normal tc filters? Is this intentional?
If so, why?


> +static u16 quantum_div[CAKE_QUEUES + 1] = {0};
> +
> +#define REC_INV_SQRT_CACHE (16)
> +static u32 cobalt_rec_inv_sqrt_cache[REC_INV_SQRT_CACHE] = {0};
> +
> +/* http://en.wikipedia.org/wiki/Methods_of_computing_square_roots
> + * new_invsqrt = (invsqrt / 2) * (3 - count * invsqrt^2)
> + *
> + * Here, invsqrt is a fixed point number (< 1.0), 32bit mantissa, aka Q0.32
> + */
> +
> +static void cobalt_newton_step(struct cobalt_vars *vars)
> +{
> +   u32 invsqrt = vars->rec_inv_sqrt;
> +   u32 invsqrt2 = ((u64)invsqrt * invsqrt) >> 32;
> +   u64 val = (3LL << 32) - ((u64)vars->count * invsqrt2);
> +
> +   val >>= 2; /* avoid overflow in following multiply */
> +   val = (val * invsqrt) >> (32 - 2 + 1);
> +
> +   vars->rec_inv_sqrt = val;
> +}
> +
> +static void cobalt_invsqrt(struct cobalt_vars *vars)
> +{
> +   if (vars->count < REC_INV_SQRT_CACHE)
> +   vars->rec_inv_sqrt = cobalt_rec_inv_sqrt_cache[vars->count];
> +   else
> +   cobalt_newton_step(vars);
> +}


Looks pretty much duplicated with codel...


> +
> +/* There is a big difference in timing between the accurate values placed in
> + * the cache and the approximations given by a single Newton step for small
> + * count values, particularly when stepping from count 1 to 2 or vice versa.
> + * Above 16, a single Newton step gives sufficient accuracy in either
> + * direction, given the precision stored.
> + *
> + * The magnitude of the error when stepping up to count 2 is such as to give
> + * the value that *should* have been produced at count 4.
> + */
> +
> +static void cobalt_cache_init(void)
> +{
> +   struct cobalt_vars v;
> +
> +   memset(, 0, sizeof(v));
> +   v.rec_inv_sqrt = ~0U;
> +   cobalt_rec_inv_sqrt_cache[0] = v.rec_inv_sqrt;
> +
> +   for (v.count = 1; v.count < REC_INV_SQRT_CACHE; v.count++) {
> +   cobalt_newton_step();
> +   cobalt_newton_step();
> +   cobalt_newton_step();
> +   cobalt_newton_step();
> +
> +   cobalt_rec_inv_sqrt_cache[v.count] = v.rec_inv_sqrt;
> +   }
> +}
> +
> +static void cobalt_vars_init(struct cobalt_vars *vars)
> +{
> +   memset(vars, 0, sizeof(*vars));
> +
> +   if (!cobalt_rec_inv_sqrt_cache[0]) {
> +   cobalt_cache_init();
> +   cobalt_rec_inv_sqrt_cache[0] = ~0;
> +   }
> +}
> +
> +/* CoDel control_law is t + interval/sqrt(count)
> + * We maintain in rec_inv_sqrt the reciprocal value of sqrt(count) to avoid
> + * both sqrt() and divide operation.
> + */
> +static cobalt_time_t cobalt_control(cobalt_time_t t,
> +   cobalt_time_t interval,
> +   u32 rec_inv_sqrt)
> +{
> +   return t + reciprocal_scale(interval, rec_inv_sqrt);
> +}
> +
> +/* Call this when a packet had to be dropped due to queue overflow.  Returns
> + * true if the BLUE state was quiescent before but active after this call.
> + */
> +static bool cobalt_queue_full(struct cobalt_vars *vars,
> +

Re: [lkp-robot] 486ad79630 [ 15.532543] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004

2018-05-03 Thread Cong Wang
On Wed, May 2, 2018 at 10:44 PM, Andrew Morton
<a...@linux-foundation.org> wrote:
> On Wed, 2 May 2018 21:58:25 -0700 Cong Wang <xiyou.wangc...@gmail.com> wrote:
>
>> On Wed, May 2, 2018 at 9:27 PM, Andrew Morton <a...@linux-foundation.org> 
>> wrote:
>> >
>> > So it's saying that something which got committed into Linus's tree
>> > after 4.17-rc3 has caused a NULL deref in
>> > sock_release->llc_ui_release+0x3a/0xd0
>>
>> Do you mean it contains commit 3a04ce7130a7
>> ("llc: fix NULL pointer deref for SOCK_ZAPPED")?
>
> That was in 4.17-rc3 so if this report's bisection is correct, that
> patch is innocent.
>
> origin.patch (http://ozlabs.org/~akpm/mmots/broken-out/origin.patch)
> contains no changes to net/llc/af_llc.c so perhaps this crash is also
> occurring in 4.17-rc3 base.

The commit I pointed out is supposed to fix this bug...

Please let me know if it doesn't.


Re: [lkp-robot] 486ad79630 [ 15.532543] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004

2018-05-02 Thread Cong Wang
On Wed, May 2, 2018 at 9:27 PM, Andrew Morton  wrote:
>
> So it's saying that something which got committed into Linus's tree
> after 4.17-rc3 has caused a NULL deref in
> sock_release->llc_ui_release+0x3a/0xd0

Do you mean it contains commit 3a04ce7130a7
("llc: fix NULL pointer deref for SOCK_ZAPPED")?


Re: [PATCH net-next v6] Add Common Applications Kept Enhanced (cake) qdisc

2018-05-01 Thread Cong Wang
On Tue, May 1, 2018 at 12:12 PM, Eric Dumazet  wrote:
>
> I guess that nobody really wants to really review Cake if
> it is a file with 2700 lines of code and hundreds of variables/tunables.
>
> Sure, we have big files in networking land, as a result of thousands of 
> changes.
>
> If you split it, then maybe the work can be split among reviewers as a result.
>

+1

At *least* split it into CAKE without ack-filter and ack-filter implementation.


Re: [PATCH net-next v6] Add Common Applications Kept Enhanced (cake) qdisc

2018-04-30 Thread Cong Wang
On Mon, Apr 30, 2018 at 2:27 PM, Dave Taht <dave.t...@gmail.com> wrote:
> On Mon, Apr 30, 2018 at 2:21 PM, Cong Wang <xiyou.wangc...@gmail.com> wrote:
>> On Sun, Apr 29, 2018 at 2:34 PM, Toke Høiland-Jørgensen <t...@toke.dk> wrote:
>>> sch_cake targets the home router use case and is intended to squeeze the
>>> most bandwidth and latency out of even the slowest ISP links and routers,
>>> while presenting an API simple enough that even an ISP can configure it.
>>>
>>> Example of use on a cable ISP uplink:
>>>
>>> tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter
>>>
>>> To shape a cable download link (ifb and tc-mirred setup elided)
>>>
>>> tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash
>>>
>>> CAKE is filled with:
>>>
>>> * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
>>>   derived Flow Queuing system, which autoconfigures based on the bandwidth.
>>> * A novel "triple-isolate" mode (the default) which balances per-host
>>>   and per-flow FQ even through NAT.
>>> * An deficit based shaper, that can also be used in an unlimited mode.
>>> * 8 way set associative hashing to reduce flow collisions to a minimum.
>>> * A reasonable interpretation of various diffserv latency/loss tradeoffs.
>>> * Support for zeroing diffserv markings for entering and exiting traffic.
>>> * Support for interacting well with Docsis 3.0 shaper framing.
>>> * Extensive support for DSL framing types.
>>> * Support for ack filtering.
>>
>> Why this TCP ACK filtering has to be built into CAKE qdisc rather than
>> an independent TC filter? Why other qdisc's can't use it?
>
> I actually have a tc - bpf based ack filter, during the development of
> cake's ack-thinner, that I should submit one of these days. It
> proved to be of limited use.

Yeah.

>
> Probably the biggest mistake we made is by calling this cake feature a
> filter. It isn't.


It inspects the payload of each packet and drops packets, therefore
it is a filter by definition, no matter how you name it.

>
> Maybe we should have called it a "thinner" or something like that? In
> order to properly "thin" or "reduce" an ack stream
> you have to have a queue to look at and some related state. TC filters
> do not operate on queues, qdiscs do. Thus the "ack-filter" here is
> deeply embedded into cake's flow isolation and queue structures.


TC filters are installed on qdiscs and in the beginning qdiscs were
queues,for example, pfifo. We already have flow-based filters too
(cls_flower),so we can make them work together, although probably
it is not straight forward.


Re: [PATCH net-next v6] Add Common Applications Kept Enhanced (cake) qdisc

2018-04-30 Thread Cong Wang
On Sun, Apr 29, 2018 at 2:34 PM, Toke Høiland-Jørgensen  wrote:
> sch_cake targets the home router use case and is intended to squeeze the
> most bandwidth and latency out of even the slowest ISP links and routers,
> while presenting an API simple enough that even an ISP can configure it.
>
> Example of use on a cable ISP uplink:
>
> tc qdisc add dev eth0 cake bandwidth 20Mbit nat docsis ack-filter
>
> To shape a cable download link (ifb and tc-mirred setup elided)
>
> tc qdisc add dev ifb0 cake bandwidth 200mbit nat docsis ingress wash
>
> CAKE is filled with:
>
> * A hybrid Codel/Blue AQM algorithm, "Cobalt", tied to an FQ_Codel
>   derived Flow Queuing system, which autoconfigures based on the bandwidth.
> * A novel "triple-isolate" mode (the default) which balances per-host
>   and per-flow FQ even through NAT.
> * An deficit based shaper, that can also be used in an unlimited mode.
> * 8 way set associative hashing to reduce flow collisions to a minimum.
> * A reasonable interpretation of various diffserv latency/loss tradeoffs.
> * Support for zeroing diffserv markings for entering and exiting traffic.
> * Support for interacting well with Docsis 3.0 shaper framing.
> * Extensive support for DSL framing types.
> * Support for ack filtering.

Why this TCP ACK filtering has to be built into CAKE qdisc rather than
an independent TC filter? Why other qdisc's can't use it?


> * Extensive statistics for measuring, loss, ecn markings, latency
>   variation.
>
> A paper describing the design of CAKE is available at
> https://arxiv.org/abs/1804.07617
>

Thanks.


Re: [PATCH net-next] net: sch: prio: Set bands to default on delete instead of noop

2018-04-27 Thread Cong Wang
On Thu, Apr 26, 2018 at 6:32 AM, Nogah Frankel  wrote:
> When a band is created, it is set to the default qdisc, which is
> "invisible" pfifo.


Isn't TCA_DUMP_INVISIBLE for dumping this invisible qdisc?


> However, if a band is set to a qdisc that is later being deleted, it will
> be set to noop qdisc. This can cause a packet loss, while there is no clear
> user indication for it. ("invisible" qdisc are not being shown by default).
> This patch sets a band to the default qdisc, rather then the noop qdisc, on
> delete operation.

It is set to noop historically, may be not reasonable but changing
it could break things.

What's wrong with using TCA_DUMP_INVISIBLE to dump it?


[Patch nf] ipvs: initialize tbl->entries in ip_vs_lblc_init_svc()

2018-04-23 Thread Cong Wang
Similarly, tbl->entries is not initialized after kmalloc(),
therefore causes an uninit-value warning in ip_vs_lblc_check_expire(),
as reported by syzbot.

Reported-by: <syzbot+3e9695f147fb529aa...@syzkaller.appspotmail.com>
Cc: Simon Horman <ho...@verge.net.au>
Cc: Julian Anastasov <j...@ssi.bg>
Cc: Pablo Neira Ayuso <pa...@netfilter.org>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/netfilter/ipvs/ip_vs_lblc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netfilter/ipvs/ip_vs_lblc.c b/net/netfilter/ipvs/ip_vs_lblc.c
index 3057e453bf31..83918119ceb8 100644
--- a/net/netfilter/ipvs/ip_vs_lblc.c
+++ b/net/netfilter/ipvs/ip_vs_lblc.c
@@ -371,6 +371,7 @@ static int ip_vs_lblc_init_svc(struct ip_vs_service *svc)
tbl->counter = 1;
tbl->dead = false;
tbl->svc = svc;
+   atomic_set(>entries, 0);
 
/*
 *Hook periodic timer for garbage collection
-- 
2.13.0



[Patch nf] ipvs: initialize tbl->entries after allocation

2018-04-23 Thread Cong Wang
tbl->entries is not initialized after kmalloc(), therefore
causes an uninit-value warning in ip_vs_lblc_check_expire()
as reported by syzbot.

Reported-by: <syzbot+3dfdea57819073a04...@syzkaller.appspotmail.com>
Cc: Simon Horman <ho...@verge.net.au>
Cc: Julian Anastasov <j...@ssi.bg>
Cc: Pablo Neira Ayuso <pa...@netfilter.org>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/netfilter/ipvs/ip_vs_lblcr.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netfilter/ipvs/ip_vs_lblcr.c b/net/netfilter/ipvs/ip_vs_lblcr.c
index 92adc04557ed..bc2bc5eebcb8 100644
--- a/net/netfilter/ipvs/ip_vs_lblcr.c
+++ b/net/netfilter/ipvs/ip_vs_lblcr.c
@@ -534,6 +534,7 @@ static int ip_vs_lblcr_init_svc(struct ip_vs_service *svc)
tbl->counter = 1;
tbl->dead = false;
tbl->svc = svc;
+   atomic_set(>entries, 0);
 
/*
 *Hook periodic timer for garbage collection
-- 
2.13.0



Re: KASAN: null-ptr-deref Read in refcount_inc_not_zero

2018-04-23 Thread Cong Wang
#syz fix: llc: fix NULL pointer deref for SOCK_ZAPPED


[Patch net] llc: fix NULL pointer deref for SOCK_ZAPPED

2018-04-19 Thread Cong Wang
For SOCK_ZAPPED socket, we don't need to care about llc->sap,
so we should just skip these refcount functions in this case.

Fixes: f7e43672683b ("llc: hold llc_sap before release_sock()")
Reported-by: kernel test robot <l...@intel.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/llc/af_llc.c | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c
index 6d29b2b94e84..cb80ebb38311 100644
--- a/net/llc/af_llc.c
+++ b/net/llc/af_llc.c
@@ -189,7 +189,6 @@ static int llc_ui_release(struct socket *sock)
 {
struct sock *sk = sock->sk;
struct llc_sock *llc;
-   struct llc_sap *sap;
 
if (unlikely(sk == NULL))
goto out;
@@ -200,15 +199,19 @@ static int llc_ui_release(struct socket *sock)
llc->laddr.lsap, llc->daddr.lsap);
if (!llc_send_disc(sk))
llc_ui_wait_for_disc(sk, sk->sk_rcvtimeo);
-   sap = llc->sap;
-   /* Hold this for release_sock(), so that llc_backlog_rcv() could still
-* use it.
-*/
-   llc_sap_hold(sap);
-   if (!sock_flag(sk, SOCK_ZAPPED))
+   if (!sock_flag(sk, SOCK_ZAPPED)) {
+   struct llc_sap *sap = llc->sap;
+
+   /* Hold this for release_sock(), so that llc_backlog_rcv()
+* could still use it.
+*/
+   llc_sap_hold(sap);
llc_sap_remove_socket(llc->sap, sk);
-   release_sock(sk);
-   llc_sap_put(sap);
+   release_sock(sk);
+   llc_sap_put(sap);
+   } else {
+   release_sock(sk);
+   }
if (llc->dev)
dev_put(llc->dev);
sock_put(sk);
-- 
2.13.0



[Patch net] llc: delete timers synchronously in llc_sk_free()

2018-04-19 Thread Cong Wang
The connection timers of an llc sock could be still flying
after we delete them in llc_sk_free(), and even possibly
after we free the sock. We could just wait synchronously
here in case of troubles.

Note, I leave other call paths as they are, since they may
not have to wait, at least we can change them to synchronously
when needed.

Also, move the code to net/llc/llc_conn.c, which is apparently
a better place.

Reported-by: <syzbot+f922284c18ea23a8e...@syzkaller.appspotmail.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 include/net/llc_conn.h |  1 +
 net/llc/llc_c_ac.c |  9 +
 net/llc/llc_conn.c | 22 +-
 3 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/include/net/llc_conn.h b/include/net/llc_conn.h
index 5c40f118c0fa..df528a623548 100644
--- a/include/net/llc_conn.h
+++ b/include/net/llc_conn.h
@@ -97,6 +97,7 @@ static __inline__ char llc_backlog_type(struct sk_buff *skb)
 
 struct sock *llc_sk_alloc(struct net *net, int family, gfp_t priority,
  struct proto *prot, int kern);
+void llc_sk_stop_all_timers(struct sock *sk, bool sync);
 void llc_sk_free(struct sock *sk);
 
 void llc_sk_reset(struct sock *sk);
diff --git a/net/llc/llc_c_ac.c b/net/llc/llc_c_ac.c
index 163121192aca..4d78375f9872 100644
--- a/net/llc/llc_c_ac.c
+++ b/net/llc/llc_c_ac.c
@@ -1099,14 +1099,7 @@ int llc_conn_ac_inc_tx_win_size(struct sock *sk, struct 
sk_buff *skb)
 
 int llc_conn_ac_stop_all_timers(struct sock *sk, struct sk_buff *skb)
 {
-   struct llc_sock *llc = llc_sk(sk);
-
-   del_timer(>pf_cycle_timer.timer);
-   del_timer(>ack_timer.timer);
-   del_timer(>rej_sent_timer.timer);
-   del_timer(>busy_state_timer.timer);
-   llc->ack_must_be_send = 0;
-   llc->ack_pf = 0;
+   llc_sk_stop_all_timers(sk, false);
return 0;
 }
 
diff --git a/net/llc/llc_conn.c b/net/llc/llc_conn.c
index 110e32bcb399..c0ac522b48a1 100644
--- a/net/llc/llc_conn.c
+++ b/net/llc/llc_conn.c
@@ -961,6 +961,26 @@ struct sock *llc_sk_alloc(struct net *net, int family, 
gfp_t priority, struct pr
return sk;
 }
 
+void llc_sk_stop_all_timers(struct sock *sk, bool sync)
+{
+   struct llc_sock *llc = llc_sk(sk);
+
+   if (sync) {
+   del_timer_sync(>pf_cycle_timer.timer);
+   del_timer_sync(>ack_timer.timer);
+   del_timer_sync(>rej_sent_timer.timer);
+   del_timer_sync(>busy_state_timer.timer);
+   } else {
+   del_timer(>pf_cycle_timer.timer);
+   del_timer(>ack_timer.timer);
+   del_timer(>rej_sent_timer.timer);
+   del_timer(>busy_state_timer.timer);
+   }
+
+   llc->ack_must_be_send = 0;
+   llc->ack_pf = 0;
+}
+
 /**
  * llc_sk_free - Frees a LLC socket
  * @sk - socket to free
@@ -973,7 +993,7 @@ void llc_sk_free(struct sock *sk)
 
llc->state = LLC_CONN_OUT_OF_SVC;
/* Stop all (possibly) running timers */
-   llc_conn_ac_stop_all_timers(sk, NULL);
+   llc_sk_stop_all_timers(sk, true);
 #ifdef DEBUG_LLC_CONN_ALLOC
printk(KERN_INFO "%s: unackq=%d, txq=%d\n", __func__,
skb_queue_len(>pdu_unack_q),
-- 
2.13.0



[Patch net] llc: hold llc_sap before release_sock()

2018-04-18 Thread Cong Wang
syzbot reported we still access llc->sap in llc_backlog_rcv()
after it is freed in llc_sap_remove_socket():

Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:430
 llc_conn_ac_send_sabme_cmd_p_set_x+0x3a8/0x460 net/llc/llc_c_ac.c:785
 llc_exec_conn_trans_actions net/llc/llc_conn.c:475 [inline]
 llc_conn_service net/llc/llc_conn.c:400 [inline]
 llc_conn_state_process+0x4e1/0x13a0 net/llc/llc_conn.c:75
 llc_backlog_rcv+0x195/0x1e0 net/llc/llc_conn.c:891
 sk_backlog_rcv include/net/sock.h:909 [inline]
 __release_sock+0x12f/0x3a0 net/core/sock.c:2335
 release_sock+0xa4/0x2b0 net/core/sock.c:2850
 llc_ui_release+0xc8/0x220 net/llc/af_llc.c:204

llc->sap is refcount'ed and llc_sap_remove_socket() is paired
with llc_sap_add_socket(). This can be amended by holding its refcount
before llc_sap_remove_socket() and releasing it after release_sock().

Reported-by: <syzbot+6e181fc95081c2cf9...@syzkaller.appspotmail.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/llc/af_llc.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c
index 01dcc0823d1f..6d29b2b94e84 100644
--- a/net/llc/af_llc.c
+++ b/net/llc/af_llc.c
@@ -189,6 +189,7 @@ static int llc_ui_release(struct socket *sock)
 {
struct sock *sk = sock->sk;
struct llc_sock *llc;
+   struct llc_sap *sap;
 
if (unlikely(sk == NULL))
goto out;
@@ -199,9 +200,15 @@ static int llc_ui_release(struct socket *sock)
llc->laddr.lsap, llc->daddr.lsap);
if (!llc_send_disc(sk))
llc_ui_wait_for_disc(sk, sk->sk_rcvtimeo);
+   sap = llc->sap;
+   /* Hold this for release_sock(), so that llc_backlog_rcv() could still
+* use it.
+*/
+   llc_sap_hold(sap);
if (!sock_flag(sk, SOCK_ZAPPED))
llc_sap_remove_socket(llc->sap, sk);
release_sock(sk);
+   llc_sap_put(sap);
if (llc->dev)
dev_put(llc->dev);
sock_put(sk);
-- 
2.13.0



Re: [PATCH 8/8] CREDITS: Add Chris Novakovic

2018-04-17 Thread Cong Wang
On Tue, Apr 17, 2018 at 1:58 PM, Chris Novakovic  wrote:
> Signed-off-by: Chris Novakovic 
> ---
>  CREDITS | 4 

It is rare to update CREDITS, people now use git to credit
contributions.


[Patch net] tipc: use the right skb in tipc_sk_fill_sock_diag()

2018-04-06 Thread Cong Wang
Commit 4b2e6877b879 ("tipc: Fix namespace violation in tipc_sk_fill_sock_diag")
tried to fix the crash but failed, the crash is still 100% reproducible
with it.

In tipc_sk_fill_sock_diag(), skb is the diag dump we are filling, it is not
correct to retrieve its NETLINK_CB(), instead, like other protocol diag,
we should use NETLINK_CB(cb->skb).sk here.

Reported-by: <syzbot+326e587eff1074657...@syzkaller.appspotmail.com>
Fixes: 4b2e6877b879 ("tipc: Fix namespace violation in tipc_sk_fill_sock_diag")
Fixes: c30b70deb5f4 (tipc: implement socket diagnostics for AF_TIPC)
Cc: GhantaKrishnamurthy MohanKrishna 
<mohan.krishna.ghanta.krishnamur...@ericsson.com>
Cc: Jon Maloy <jon.ma...@ericsson.com>
Cc: Ying Xue <ying@windriver.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/tipc/diag.c   | 2 +-
 net/tipc/socket.c | 6 +++---
 net/tipc/socket.h | 4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/tipc/diag.c b/net/tipc/diag.c
index 46d9cd62f781..aaabb0b776dd 100644
--- a/net/tipc/diag.c
+++ b/net/tipc/diag.c
@@ -59,7 +59,7 @@ static int __tipc_add_sock_diag(struct sk_buff *skb,
if (!nlh)
return -EMSGSIZE;
 
-   err = tipc_sk_fill_sock_diag(skb, tsk, req->tidiag_states,
+   err = tipc_sk_fill_sock_diag(skb, cb, tsk, req->tidiag_states,
 __tipc_diag_gen_cookie);
if (err)
return err;
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index cee6674a3bf4..1fd1c8b5ce03 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -3257,8 +3257,8 @@ int tipc_nl_sk_walk(struct sk_buff *skb, struct 
netlink_callback *cb,
 }
 EXPORT_SYMBOL(tipc_nl_sk_walk);
 
-int tipc_sk_fill_sock_diag(struct sk_buff *skb, struct tipc_sock *tsk,
-  u32 sk_filter_state,
+int tipc_sk_fill_sock_diag(struct sk_buff *skb, struct netlink_callback *cb,
+  struct tipc_sock *tsk, u32 sk_filter_state,
   u64 (*tipc_diag_gen_cookie)(struct sock *sk))
 {
struct sock *sk = >sk;
@@ -3280,7 +3280,7 @@ int tipc_sk_fill_sock_diag(struct sk_buff *skb, struct 
tipc_sock *tsk,
nla_put_u32(skb, TIPC_NLA_SOCK_TIPC_STATE, (u32)sk->sk_state) ||
nla_put_u32(skb, TIPC_NLA_SOCK_INO, sock_i_ino(sk)) ||
nla_put_u32(skb, TIPC_NLA_SOCK_UID,
-   from_kuid_munged(sk_user_ns(NETLINK_CB(skb).sk),
+   from_kuid_munged(sk_user_ns(NETLINK_CB(cb->skb).sk),
 sock_i_uid(sk))) ||
nla_put_u64_64bit(skb, TIPC_NLA_SOCK_COOKIE,
  tipc_diag_gen_cookie(sk),
diff --git a/net/tipc/socket.h b/net/tipc/socket.h
index aae3fd4cd06c..aff9b2ae5a1f 100644
--- a/net/tipc/socket.h
+++ b/net/tipc/socket.h
@@ -61,8 +61,8 @@ int tipc_sk_rht_init(struct net *net);
 void tipc_sk_rht_destroy(struct net *net);
 int tipc_nl_sk_dump(struct sk_buff *skb, struct netlink_callback *cb);
 int tipc_nl_publ_dump(struct sk_buff *skb, struct netlink_callback *cb);
-int tipc_sk_fill_sock_diag(struct sk_buff *skb, struct tipc_sock *tsk,
-  u32 sk_filter_state,
+int tipc_sk_fill_sock_diag(struct sk_buff *skb, struct netlink_callback *cb,
+  struct tipc_sock *tsk, u32 sk_filter_state,
   u64 (*tipc_diag_gen_cookie)(struct sock *sk));
 int tipc_nl_sk_walk(struct sk_buff *skb, struct netlink_callback *cb,
int (*skb_handler)(struct sk_buff *skb,
-- 
2.13.0



[Patch net] net_sched: fix a missing idr_remove() in u32_delete_key()

2018-04-06 Thread Cong Wang
When we delete a u32 key via u32_delete_key(), we forget to
call idr_remove() to remove its handle from IDR.

Fixes: e7614370d6f0 ("net_sched: use idr to allocate u32 filter handles")
Reported-by: Marcin Kabiesz <ad...@hostcenter.eu>
Cc: Linus Torvalds <torva...@linux-foundation.org>
Cc: Jamal Hadi Salim <j...@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/sched/cls_u32.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index ed8b6a24b9e9..bac47b5d18fd 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -489,6 +489,7 @@ static int u32_delete_key(struct tcf_proto *tp, struct 
tc_u_knode *key)
RCU_INIT_POINTER(*kp, key->next);
 
tcf_unbind_filter(tp, >res);
+   idr_remove(>handle_idr, key->handle);
tcf_exts_get_net(>exts);
call_rcu(>rcu, u32_delete_key_freepf_rcu);
return 0;
-- 
2.13.0



Re: Problem with the kernel 4.15 - cutting the band (tc)

2018-04-06 Thread Cong Wang
On Fri, Apr 6, 2018 at 2:56 PM, Linus Torvalds
 wrote:
> Forwarding a report about what looks like a regression between 4.14 and 4.15.
>
> New ENOSPC issue? I don't even knew where to start guessing where to look.
>
> Help me, Davem-Wan Kenobi, you are my only hope.
>
> (But adding netdev just in case somebody else goes "That's obviously Xyz")
>
>   Linus
>
> -- Forwarded message --
> From: Marcin Kabiesz 
> Date: Thu, Apr 5, 2018 at 10:38 AM
> Subject: Problem with the kernel 4.15 - cutting the band (tc)
>
>
> Hello,
> I have a problem with bandwidth cutting on kernel 4.15. On the version
> up to 4.15, i.e. 4.14, this problem does not occur.
>
> uname -a: Linux router 4.14.15 #1 SMP x86_64 Intel Xeon E3-1230 v6
> command to reproduce:
>
> tc qdisc add dev ifb0 root handle 1: htb r2q 2
> tc class add dev ifb0 parent 1: classid 1:1 htb rate 10gbit ceil
> 10gbit quantum 16000
> tc filter add dev ifb0 parent 1: prio 5 handle 1: protocol all u32 divisor 256
> tc filter add dev ifb0 protocol all parent 1: prio 5 u32 ht 800::
> match ip dst 0.0.0.0/0 hashkey mask 0x00ff at 16 link 1:
> tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
> ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
> tc filter del dev ifb0 parent 1:0 handle 1:2c:1 prio 5 u32
> tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
> ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
> tc filter del dev ifb0 parent 1:0 handle 1:2c:1 prio 5 u32
>
> This ok, no error/warnings and dmesg log.
>
> uname -a: Linux router 4.15.8 #1 SMP x86_64 Intel Xeon E3-1230 v6 (or
> 4.15.14 this same effect)
> command to reproduce:
>
> tc qdisc add dev ifb0 root handle 1: htb r2q 2
> tc class add dev ifb0 parent 1: classid 1:1 htb rate 10gbit ceil
> 10gbit quantum 16000
> tc filter add dev ifb0 parent 1: prio 5 handle 1: protocol all u32 divisor 256
> tc filter add dev ifb0 protocol all parent 1: prio 5 u32 ht 800::
> match ip dst 0.0.0.0/0 hashkey mask 0x00ff at 16 link 1:
> tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
> ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
> tc filter del dev ifb0 parent 1:0 handle 1:2c:1 prio 5 u32
> tc filter add dev ifb0 parent 1:0 handle ::1 protocol all prio 5 u32
> ht 1:2c: match ip dst 192.168.3.44/32 flowid 1:2
> RTNETLINK answers: No space left on device
> We have an error talking to the kernel
>
> This not ok, on error/warnings and no dmesg log.

We forgot to call idr_remove() when deleting u32 key...

I am cooking a fix now.

Thanks!


Re: possible deadlock in skb_queue_tail

2018-04-03 Thread Cong Wang
On Tue, Apr 3, 2018 at 4:42 AM, Kirill Tkhai  wrote:
> On 03.04.2018 14:25, Dmitry Vyukov wrote:
>> On Tue, Apr 3, 2018 at 11:50 AM, Kirill Tkhai  wrote:
>>> sk_diag_dump_icons() dumps only sockets in TCP_LISTEN state.
>>> TCP_LISTEN state may be assigned in only place in net/unix/af_unix.c:
>>> it's unix_listen(). The function is applied to stream and seqpacket
>>> socket types.
>>>
>>> It can't be stream because of the second stack, and seqpacket also can't,
>>> as I don't think it's possible for gcc to inline unix_seqpacket_sendmsg()
>>> in the way, we don't see it in the stack.
>>>
>>> So, this is looks like false positive result for me.
>>>
>>> Kirill
>>
>> Do you mean that these &(>lock)->rlock/1 referenced in 2 stacks are
>> always different?
>
> In these 2 particular stacks they have to be different.

So actually my patch could fix this false positive? I thought it couldn't.
https://patchwork.ozlabs.org/patch/894342/


[Patch net] af_unix: remove redundant lockdep class

2018-04-02 Thread Cong Wang
After commit 581319c58600 ("net/socket: use per af lockdep classes for sk 
queues")
sock queue locks now have per-af lockdep classes, including unix socket.
It is no longer necessary to workaround it.

I noticed this while looking at a syzbot deadlock report, this patch
itself doesn't fix it (this is why I don't add Reported-by).

Fixes: 581319c58600 ("net/socket: use per af lockdep classes for sk queues")
Cc: Paolo Abeni <pab...@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/unix/af_unix.c | 10 --
 1 file changed, 10 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 2d465bdeccbc..45971e173924 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -745,14 +745,6 @@ static struct proto unix_proto = {
.obj_size   = sizeof(struct unix_sock),
 };
 
-/*
- * AF_UNIX sockets do not interact with hardware, hence they
- * dont trigger interrupts - so it's safe for them to have
- * bh-unsafe locking for their sk_receive_queue.lock. Split off
- * this special lock-class by reinitializing the spinlock key:
- */
-static struct lock_class_key af_unix_sk_receive_queue_lock_key;
-
 static struct sock *unix_create1(struct net *net, struct socket *sock, int 
kern)
 {
struct sock *sk = NULL;
@@ -767,8 +759,6 @@ static struct sock *unix_create1(struct net *net, struct 
socket *sock, int kern)
goto out;
 
sock_init_data(sock, sk);
-   lockdep_set_class(>sk_receive_queue.lock,
-   _unix_sk_receive_queue_lock_key);
 
sk->sk_allocation   = GFP_KERNEL_ACCOUNT;
sk->sk_write_space  = unix_write_space;
-- 
2.13.0



[Patch nf] nf_conntrack_extend: silent a memory leak warning

2018-03-30 Thread Cong Wang
The following memory leak is false postive:

unreferenced object 0x8f37f156fb38 (size 128):
  comm "softirq", pid 0, jiffies 4294899665 (age 11.292s)
  hex dump (first 32 bytes):
6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  
00 00 00 00 30 00 20 00 48 6b 6b 6b 6b 6b 6b 6b  0. .Hkkk
  backtrace:
[<4fda266a>] __kmalloc_track_caller+0x10d/0x141
[<7b0a7e3c>] __krealloc+0x45/0x62
[<d08e0bfb>] nf_ct_ext_add+0xdc/0x133
[<99b47fd8>] init_conntrack+0x1b1/0x392
[<86dc36ec>] nf_conntrack_in+0x1ee/0x34b
[<940592de>] nf_hook_slow+0x36/0x95
[<d1bd4da7>] nf_hook.constprop.43+0x1c3/0x1dd
[<c3673266>] __ip_local_out+0xae/0xb4
[<3e4192a6>] ip_local_out+0x17/0x33
[<b64356de>] igmp_ifc_timer_expire+0x23e/0x26f
[<6a8f3032>] call_timer_fn+0x14c/0x2a5
[<650c1725>] __run_timers.part.34+0x150/0x182
[<90e6946e>] run_timer_softirq+0x2a/0x4c
[<4d1e7293>] __do_softirq+0x1d1/0x3c2
[<4643557d>] irq_exit+0x53/0xa2
[<29ddee8f>] smp_apic_timer_interrupt+0x22a/0x235

because __krealloc() is not supposed to release the old
memory and it is released later via kfree_rcu(). Since this is
the only external user of __krealloc(), just mark it as not leak
here.

Cc: Pablo Neira Ayuso <pa...@netfilter.org>
Cc: Jozsef Kadlecsik <kad...@blackhole.kfki.hu>
Cc: Florian Westphal <f...@strlen.de>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/netfilter/nf_conntrack_extend.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netfilter/nf_conntrack_extend.c 
b/net/netfilter/nf_conntrack_extend.c
index 9fe0ddc333fb..bd71a828ebde 100644
--- a/net/netfilter/nf_conntrack_extend.c
+++ b/net/netfilter/nf_conntrack_extend.c
@@ -71,6 +71,7 @@ void *nf_ct_ext_add(struct nf_conn *ct, enum nf_ct_ext_id id, 
gfp_t gfp)
rcu_read_unlock();
 
alloc = max(newlen, NF_CT_EXT_PREALLOC);
+   kmemleak_not_leak(old);
new = __krealloc(old, alloc, gfp);
if (!new)
return NULL;
-- 
2.13.0



Re: WARNING in refcount_dec

2018-03-28 Thread Cong Wang
(Cc'ing netdev and Willem)

On Wed, Mar 28, 2018 at 12:03 PM, Byoungyoung Lee
 wrote:
> Another crash patterns observed: race between (setsockopt$packet_int)
> and (bind$packet).
>
> --
> [  357.731597] kernel BUG at
> /home/blee/project/race-fuzzer/kernels/kernel_v4.16-rc3/net/packet/af_packet.c:3107!
> [  357.733382] invalid opcode:  [#1] SMP KASAN
> [  357.734017] Modules linked in:
> [  357.734662] CPU: 1 PID: 3871 Comm: repro.exe Not tainted 4.16.0-rc3 #1
> [  357.735791] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
> [  357.737434] RIP: 0010:packet_do_bind+0x88d/0x950
> [  357.738121] RSP: 0018:8800b2787b08 EFLAGS: 00010293
> [  357.738906] RAX: 8800b2fdc780 RBX: 880234358cc0 RCX: 
> 838b244c
> [  357.739905] RDX:  RSI: 838b257d RDI: 
> 0001
> [  357.741315] RBP: 8800b2787c10 R08: 8800b2fdc780 R09: 
> 
> [  357.743055] R10: 0001 R11:  R12: 
> 88023352ecc0
> [  357.744744] R13:  R14: 0001 R15: 
> 1d00
> [  357.746377] FS:  7f4b43733700() GS:8800b8b0()
> knlGS:
> [  357.749599] CS:  0010 DS:  ES:  CR0: 80050033
> [  357.752096] CR2: 20058000 CR3: 0002334b8000 CR4: 
> 06e0
> [  357.755045] Call Trace:
> [  357.755822]  ? compat_packet_setsockopt+0x100/0x100
> [  357.757324]  ? __sanitizer_cov_trace_const_cmp8+0x18/0x20
> [  357.758810]  packet_bind+0xa2/0xe0
> [  357.759640]  SYSC_bind+0x279/0x2f0
> [  357.760364]  ? move_addr_to_kernel.part.19+0xc0/0xc0
> [  357.761491]  ? __handle_mm_fault+0x25d0/0x25d0
> [  357.762449]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
> [  357.763663]  ? __do_page_fault+0x417/0xba0
> [  357.764569]  ? vmalloc_fault+0x910/0x910
> [  357.765405]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
> [  357.766525]  ? mark_held_locks+0x25/0xb0
> [  357.767336]  ? SyS_socketpair+0x4a0/0x4a0
> [  357.768182]  SyS_bind+0x24/0x30
> [  357.768851]  do_syscall_64+0x209/0x5d0
> [  357.769650]  ? syscall_return_slowpath+0x3e0/0x3e0
> [  357.770665]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
> [  357.771779]  ? syscall_return_slowpath+0x260/0x3e0
> [  357.772748]  ? mark_held_locks+0x25/0xb0
> [  357.773581]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
> [  357.774720]  ? retint_user+0x18/0x18
> [  357.775493]  ? trace_hardirqs_off_caller+0xb5/0x120
> [  357.776567]  ? trace_hardirqs_off_thunk+0x1a/0x1c
> [  357.777512]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> [  357.778508] RIP: 0033:0x4503a9
> [  357.779156] RSP: 002b:7f4b43732ce8 EFLAGS: 0246 ORIG_RAX:
> 0031
> [  357.780737] RAX: ffda RBX:  RCX: 
> 004503a9
> [  357.782169] RDX: 0014 RSI: 20058000 RDI: 
> 0003
> [  357.783710] RBP: 7f4b43732d10 R08:  R09: 
> 
> [  357.785202] R10:  R11: 0246 R12: 
> 
> [  357.786664] R13:  R14: 7f4b437339c0 R15: 
> 7f4b43733700
> [  357.788210] Code: c0 fd 48 c7 c2 00 c8 d9 84 be ab 02 00 00 48 c7
> c7 60 c8 d9 84 c6 05 e7 a2 48 02 01 e8 3f 17 af fd e9 60 fb ff ff e8
> 43 b3 c0 fd <0f> 0b e8 3c b3 c0 fd 48 8b bd 20 ff ff ff e8 60 1e e7 fd
> 4c 89
> [  357.792260] RIP: packet_do_bind+0x88d/0x950 RSP: 8800b2787b08
> [  357.793698] ---[ end trace 0c5a2539f0247369 ]---
> [  357.794696] Kernel panic - not syncing: Fatal exception
> [  357.795918] Kernel Offset: disabled
> [  357.796614] Rebooting in 86400 seconds..
>
> On Wed, Mar 28, 2018 at 1:19 AM, Byoungyoung Lee  
> wrote:
>> We report the crash: WARNING in refcount_dec
>>
>> This crash has been found in v4.16-rc3 using RaceFuzzer (a modified
>> version of Syzkaller), which we describe more at the end of this
>> report. Our analysis shows that the race occurs when invoking two
>> syscalls concurrently, (setsockopt$packet_int) and
>> (setsockopt$packet_rx_ring).
>>
>> C repro code : 
>> https://kiwi.cs.purdue.edu/static/race-fuzzer/repro-refcount_dec.c
>> kernel config: 
>> https://kiwi.cs.purdue.edu/static/race-fuzzer/kernel-config-v4.16-rc3


I tried your reproducer, no luck here.



>>
>> ---
>> [  305.838560] refcount_t: decrement hit 0; leaking memory.
>> [  305.839669] WARNING: CPU: 0 PID: 3867 at
>> /home/blee/project/race-fuzzer/kernels/kernel_v4.16-rc3/lib/refcount.c:228
>> refcount_dec+0x62/0x70
>> [  305.841441] Modules linked in:
>> [  305.841883] CPU: 0 PID: 3867 Comm: repro.exe Not tainted 4.16.0-rc3 #1
>> [  305.842803] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
>> [  305.844345] RIP: 0010:refcount_dec+0x62/0x70
>> [  305.845005] RSP: 0018:880224d374f8 

[Patch net] llc: properly handle dev_queue_xmit() return value

2018-03-26 Thread Cong Wang
llc_conn_send_pdu() pushes the skb into write queue and
calls llc_conn_send_pdus() to flush them out. However, the
status of dev_queue_xmit() is not returned to caller,
in this case, llc_conn_state_process().

llc_conn_state_process() needs hold the skb no matter
success or failure, because it still uses it after that,
therefore we should hold skb before dev_queue_xmit() when
that skb is the one being processed by llc_conn_state_process().

For other callers, they can just pass NULL and ignore
the return value as they are.

Reported-by: Noam Rathaus <no...@beyondsecurity.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 include/net/llc_conn.h |  2 +-
 net/llc/llc_c_ac.c | 15 +--
 net/llc/llc_conn.c | 32 +++-
 3 files changed, 33 insertions(+), 16 deletions(-)

diff --git a/include/net/llc_conn.h b/include/net/llc_conn.h
index fe994d2e5286..5c40f118c0fa 100644
--- a/include/net/llc_conn.h
+++ b/include/net/llc_conn.h
@@ -103,7 +103,7 @@ void llc_sk_reset(struct sock *sk);
 
 /* Access to a connection */
 int llc_conn_state_process(struct sock *sk, struct sk_buff *skb);
-void llc_conn_send_pdu(struct sock *sk, struct sk_buff *skb);
+int llc_conn_send_pdu(struct sock *sk, struct sk_buff *skb);
 void llc_conn_rtn_pdu(struct sock *sk, struct sk_buff *skb);
 void llc_conn_resend_i_pdu_as_cmd(struct sock *sk, u8 nr, u8 first_p_bit);
 void llc_conn_resend_i_pdu_as_rsp(struct sock *sk, u8 nr, u8 first_f_bit);
diff --git a/net/llc/llc_c_ac.c b/net/llc/llc_c_ac.c
index f59648018060..163121192aca 100644
--- a/net/llc/llc_c_ac.c
+++ b/net/llc/llc_c_ac.c
@@ -389,7 +389,7 @@ static int llc_conn_ac_send_i_cmd_p_set_0(struct sock *sk, 
struct sk_buff *skb)
llc_pdu_init_as_i_cmd(skb, 0, llc->vS, llc->vR);
rc = llc_mac_hdr_init(skb, llc->dev->dev_addr, llc->daddr.mac);
if (likely(!rc)) {
-   llc_conn_send_pdu(sk, skb);
+   rc = llc_conn_send_pdu(sk, skb);
llc_conn_ac_inc_vs_by_1(sk, skb);
}
return rc;
@@ -916,7 +916,7 @@ static int llc_conn_ac_send_i_rsp_f_set_ackpf(struct sock 
*sk,
llc_pdu_init_as_i_cmd(skb, llc->ack_pf, llc->vS, llc->vR);
rc = llc_mac_hdr_init(skb, llc->dev->dev_addr, llc->daddr.mac);
if (likely(!rc)) {
-   llc_conn_send_pdu(sk, skb);
+   rc = llc_conn_send_pdu(sk, skb);
llc_conn_ac_inc_vs_by_1(sk, skb);
}
return rc;
@@ -935,14 +935,17 @@ static int llc_conn_ac_send_i_rsp_f_set_ackpf(struct sock 
*sk,
 int llc_conn_ac_send_i_as_ack(struct sock *sk, struct sk_buff *skb)
 {
struct llc_sock *llc = llc_sk(sk);
+   int ret;
 
if (llc->ack_must_be_send) {
-   llc_conn_ac_send_i_rsp_f_set_ackpf(sk, skb);
+   ret = llc_conn_ac_send_i_rsp_f_set_ackpf(sk, skb);
llc->ack_must_be_send = 0 ;
llc->ack_pf = 0;
-   } else
-   llc_conn_ac_send_i_cmd_p_set_0(sk, skb);
-   return 0;
+   } else {
+   ret = llc_conn_ac_send_i_cmd_p_set_0(sk, skb);
+   }
+
+   return ret;
 }
 
 /**
diff --git a/net/llc/llc_conn.c b/net/llc/llc_conn.c
index 9177dbb16dce..110e32bcb399 100644
--- a/net/llc/llc_conn.c
+++ b/net/llc/llc_conn.c
@@ -30,7 +30,7 @@
 #endif
 
 static int llc_find_offset(int state, int ev_type);
-static void llc_conn_send_pdus(struct sock *sk);
+static int llc_conn_send_pdus(struct sock *sk, struct sk_buff *skb);
 static int llc_conn_service(struct sock *sk, struct sk_buff *skb);
 static int llc_exec_conn_trans_actions(struct sock *sk,
   struct llc_conn_state_trans *trans,
@@ -193,11 +193,11 @@ int llc_conn_state_process(struct sock *sk, struct 
sk_buff *skb)
return rc;
 }
 
-void llc_conn_send_pdu(struct sock *sk, struct sk_buff *skb)
+int llc_conn_send_pdu(struct sock *sk, struct sk_buff *skb)
 {
/* queue PDU to send to MAC layer */
skb_queue_tail(>sk_write_queue, skb);
-   llc_conn_send_pdus(sk);
+   return llc_conn_send_pdus(sk, skb);
 }
 
 /**
@@ -255,7 +255,7 @@ void llc_conn_resend_i_pdu_as_cmd(struct sock *sk, u8 nr, 
u8 first_p_bit)
if (howmany_resend > 0)
llc->vS = (llc->vS + 1) % LLC_2_SEQ_NBR_MODULO;
/* any PDUs to re-send are queued up; start sending to MAC */
-   llc_conn_send_pdus(sk);
+   llc_conn_send_pdus(sk, NULL);
 out:;
 }
 
@@ -296,7 +296,7 @@ void llc_conn_resend_i_pdu_as_rsp(struct sock *sk, u8 nr, 
u8 first_f_bit)
if (howmany_resend > 0)
llc->vS = (llc->vS + 1) % LLC_2_SEQ_NBR_MODULO;
/* any PDUs to re-send are queued up; start sending to MAC */
-   llc_conn_send_pdus(sk);
+   llc_conn_send_pdus(sk, NULL);
 out:;
 }
 
@@ -340,12 +340,16 @@ int llc_conn_remove_acked_pdus(struct sock *sk, u8 nr, 
u16 *how_many_unacked)
 /**
  * llc_conn_sen

Re: [net PATCH v2] net: sched, fix OOO packets with pfifo_fast

2018-03-26 Thread Cong Wang
On Sat, Mar 24, 2018 at 10:25 PM, John Fastabend
 wrote:
> After the qdisc lock was dropped in pfifo_fast we allow multiple
> enqueue threads and dequeue threads to run in parallel. On the
> enqueue side the skb bit ooo_okay is used to ensure all related
> skbs are enqueued in-order. On the dequeue side though there is
> no similar logic. What we observe is with fewer queues than CPUs
> it is possible to re-order packets when two instances of
> __qdisc_run() are running in parallel. Each thread will dequeue
> a skb and then whichever thread calls the ndo op first will
> be sent on the wire. This doesn't typically happen because
> qdisc_run() is usually triggered by the same core that did the
> enqueue. However, drivers will trigger __netif_schedule()
> when queues are transitioning from stopped to awake using the
> netif_tx_wake_* APIs. When this happens netif_schedule() calls
> qdisc_run() on the same CPU that did the netif_tx_wake_* which
> is usually done in the interrupt completion context. This CPU
> is selected with the irq affinity which is unrelated to the
> enqueue operations.

Interesting. Why this is unique to pfifo_fast? For me it could
happen to other qdisc's too, when we release the qdisc root
lock in sch_direct_xmit(), another CPU could dequeue from
the same qdisc and transmit the skb in parallel too?

...

> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 7e3fbe9..39c144b 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -373,24 +373,33 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc 
> *q,
>   */
>  static inline bool qdisc_restart(struct Qdisc *q, int *packets)
>  {
> +   bool more, validate, nolock = q->flags & TCQ_F_NOLOCK;
> spinlock_t *root_lock = NULL;
> struct netdev_queue *txq;
> struct net_device *dev;
> struct sk_buff *skb;
> -   bool validate;
>
> /* Dequeue packet */
> +   if (nolock && test_and_set_bit(__QDISC_STATE_RUNNING, >state))
> +   return false;
> +

Nit: you probably want to move the comment below this if check,
or simply remove it since it is useless...


Re: [PATCH net v2 0/7] fix idr leak in actions

2018-03-20 Thread Cong Wang
On Mon, Mar 19, 2018 at 7:31 AM, Davide Caratti <dcara...@redhat.com> wrote:
> This series fixes situations where a temporary failure to install a TC
> action results in the permanent impossibility to reuse the configured
> value of 'index'.
>
> Thanks to Cong Wang for the initial review.
>
> v2: fix build error in act_ipt.c, reported by kbuild test robot

Acked-by: Cong Wang <xiyou.wangc...@gmail.com>

Looks good. Just a nit: seems tcf_idr_cleanup() can be removed
after your patchset. If so, please send a followup patch after
these patches merged into net-next.


Re: WARNING: CPU: 3 PID: 0 at net/sched/sch_hfsc.c:1388 hfsc_dequeue+0x319/0x350 [sch_hfsc]

2018-03-15 Thread Cong Wang
On Wed, Mar 14, 2018 at 1:10 AM, Marco Berizzi <pupi...@libero.it> wrote:
>> Il 9 marzo 2018 alle 0.14 Cong Wang <xiyou.wangc...@gmail.com> ha scritto:
>>
>>
>> On Thu, Mar 8, 2018 at 8:02 AM, Marco Berizzi <pupi...@libero.it> wrote:
>> >> Marco Berizzi wrote:
>> >>
>> >>
>> >> Hello everyone,
>> >>
>> >> Yesterday I got this error on a slackware linux 4.16-rc4 system
>> >> running as a traffic shaping gateway and netfilter nat.
>> >> The error has been arisen after a partial ISP network outage,
>> >> so unfortunately it will not trivial for me to reproduce it again.
>> >
>> > Hello everyone,
>> >
>> > I'm getting this error twice/day, so fortunately I'm able to
>> > reproduce it.
>>
>> IIRC, there was a patch for this, but it got lost...
>>
>> I will take a look anyway.
>
> ok, thanks for the response. Let me know when there will be a patch
> available to test.

It has been reported here:
https://bugzilla.kernel.org/show_bug.cgi?id=109581

And there is a workaround from Konstantin:
https://patchwork.ozlabs.org/patch/803885/

Unfortunately I don't think that is a real fix, we probably need to
fix HFSC itself rather than just workaround the qlen==0. It is not
trivial since HFSC implementation is not easy to understand.
Maybe Jamal knows better than me.


Thanks


Re: [PATCH net] net/sched: act_simple: don't leak 'index' in the error path

2018-03-15 Thread Cong Wang
On Wed, Mar 14, 2018 at 3:43 PM, Davide Caratti <dcara...@redhat.com> wrote:
> hello Cong, thank you for reviewing this.
>
> On Wed, 2018-03-14 at 11:41 -0700, Cong Wang wrote:
>> On Tue, Mar 13, 2018 at 7:13 PM, Davide Caratti <dcara...@redhat.com> wrote:
>>
>> Looks like we just need to replace the tcf_idr_cleanup() with
>> tcf_idr_release()? Which is also simpler.
>
> I just tried it on act_simple, and I can confirm: 'index' does not leak
> anymore if alloc_defdata() fails to kzalloc(), and then tcf_idr_release()
> is called in place of of tcf_idr_cleanup().

Good.

>
>> Looks like all other callers of tcf_idr_cleanup() need to be replaced too,
>> but I don't audit all of them...
>
> no problem, I can try to do that, it's not going to be a big series
> anyway.


Please audit all of them.


>
> while at it, I will also fix other spots where the same bug can be
> reproduced, even if tcf_idr_cleanup() is not there: for example, when
> tcf_vlan_init() fails allocating struct tcf_vlan_params *p,
>
> ASSERT_RTNL();
> p = kzalloc(sizeof(*p), GFP_KERNEL);
> if (!p) {
> if (ovr)
> tcf_idr_release(*a, bind);
> return -ENOMEM;
> }
>
> the followinng behavior can be observed:
>
> # tc actions flush action vlan
> # tc actions add action vlan pop index 5
> RTNETLINK answers: Cannot allocate memory
> We have an error talking to the kernel
> # tc actions add action vlan pop index 5
> RTNETLINK answers: No space left on device
> We have an error talking to the kernel
> # tc actions add action vlan pop index 5
> RTNETLINK answers: No space left on device
> We have an error talking to the kernel
>
> Probably testing the value of 'ovr' here is wrong, or maybe it's
> not enough: I will also verify what happens using 'replace'
> keyword instead of 'add'.

Please fix it separately if really needed, and it would be nicer
if you can add your test cases to tools/testing/selftests/tc-testing/.

Thanks!


Re: [PATCH net] net/sched: act_simple: don't leak 'index' in the error path

2018-03-14 Thread Cong Wang
On Tue, Mar 13, 2018 at 7:13 PM, Davide Caratti  wrote:
> Similarly to what other TC actions do, we can duplicate 'sdata' before
> calling tcf_idr_create(), and avoid calling tcf_idr_cleanup(), so that
> leaks of 'index' don't occur anymore.

Looks like we just need to replace the tcf_idr_cleanup() with
tcf_idr_release()? Which is also simpler.


>
> Signed-off-by: Davide Caratti 
> ---
>
> Notes:
> Hello,
>
> I observed this faulty behavior on act_bpf, in case of negative return
> value of tcf_bpf_init_from_ops() and tcf_bpf_init_from_efd(). Then I
> tried on act_simple, that parses its parameter in a similar way, and
> reproduced the same leakage of 'index'.
>
> So, unless you think we should fix this issue in a different way (i.e.
> changing tcf_idr_cleanup() ), I will post a similar fix for act_bpf.
>
> Any feedback is welcome, thank you in advance!

Looks like all other callers of tcf_idr_cleanup() need to be replaced too,
but I don't audit all of them...


[...]

> if (!exists) {
> +   defdata = nla_strdup(tb[TCA_DEF_DATA], GFP_KERNEL);
> +   if (unlikely(!defdata))
> +   return -ENOMEM;
> +
> ret = tcf_idr_create(tn, parm->index, est, a,
>  _simp_ops, bind, false);
> if (ret)
> return ret;

You leak memory here on failure, BTW.


Re: [PATCH 1/1] net: check dev->reg_state before deref of napi netdev_ops

2018-03-12 Thread Cong Wang
On Sun, Mar 11, 2018 at 12:22 PM, Josh Elsasser  wrote:
> init_dummy_netdev() leaves its netdev_ops pointer zeroed. This leads
> to a NULL pointer dereference when sk_busy_loop fires against an iwlwifi
> wireless adapter and checks napi->dev->netdev_ops->ndo_busy_poll.
>
> Avoid this by ensuring that napi->dev is not a dummy device before
> dereferencing napi dev's netdev_ops, preventing the following panic:

Hmm, how about just checking ->netdev_ops? Checking reg_state looks
odd, although works.


Re: [PATCH net-next 1/1] net sched actions: return explicit error when tunnel_key mode is not specified

2018-03-12 Thread Cong Wang
On Mon, Mar 12, 2018 at 1:20 PM, Roman Mashak <m...@mojatatu.com> wrote:
> If set/unset mode of the tunnel_key action is not provided, ->init() still
> returns 0, and the caller proceeds with bogus 'struct tc_action *' object,
> this results in crash:

...

> Fixes: d0f6dd8a914f ("net/sched: Introduce act_tunnel_key")
> Signed-off-by: Roman Mashak <m...@mojatatu.com>

Acked-by: Cong Wang <xiyou.wangc...@gmail.com>

This should go to -net rather than -net-next.

Thanks.


Re: WARNING in __proc_create

2018-03-09 Thread Cong Wang
On Fri, Mar 9, 2018 at 3:21 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
>
>
> On 03/09/2018 03:05 PM, Cong Wang wrote:
>>
>>
>> BTW, the warning itself is all about empty names, so perhaps
>> it's better to fix them separately.
>
>
> Huh ? You want more syzbot reports ? I do not.

I always prefer one patch to fix one problem, and, as I already said,
checking "." and ".." is probably not as trivial as checking empty.

This why I believe 2 (or more) patches here are better than 1 single
patch.

BTW, I don't have any patch, it is so trivial that I don't want to spend
my time on it.

>
> I unblocked this report today [1], you can be sure that as soon
> as syzbot gets the correct tag (Reported-by: ...), it will find the other
> problems, which are currently on hold to avoid spam.
>
> [1] It has been sitting for a while here at Google, since January 28th.
>At some point you have to ping Pablo/Florian again ;)

Sure, it can always find more problems... why just complain about
this one alone? ;)


Re: WARNING in __proc_create

2018-03-09 Thread Cong Wang
On Fri, Mar 9, 2018 at 2:58 PM, Eric Dumazet  wrote:
>
>
> On 03/09/2018 02:56 PM, Eric Dumazet wrote:
>
>>
>> I sent a patch a while back, but Pablo/Florian wanted more than that
>> simple fix.
>>
>> We also need to filter special characters like '/'

proc_create_data() itself accepts '/', so it must be xt_hashlimit doesn't
want it.

>>
>> Or maybe I am mixing with something else.
>
>
> Yes, Florian mentioned that we also had to reject "."  and ".."
>

It could be, but looks like not as trivial as just strstr(".")?

BTW, the warning itself is all about empty names, so perhaps
it's better to fix them separately.


Re: WARNING in __proc_create

2018-03-09 Thread Cong Wang
On Fri, Mar 9, 2018 at 1:59 PM, syzbot
 wrote:
> Hello,
>
> syzbot hit the following crash on net-next commit
> 617aebe6a97efa539cc4b8a52adccd89596e6be0 (Sun Feb 4 00:25:42 2018 +)
> Merge tag 'usercopy-v4.16-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
>
> So far this crash happened 12 times on net-next, upstream.
> C reproducer is attached.
> syzkaller reproducer is attached.
> Raw console output is attached.
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+0502b00edac2a0680...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
>
> audit: type=1400 audit(1518223777.419:7): avc:  denied  { map } for
> pid=4159 comm="syzkaller598581" path="/root/syzkaller598581546" dev="sda1"
> ino=16481 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023
> tcontext=unconfined_u:object_r:user_home_t:s0 tclass=file permissive=1
> [ cut here ]
> name len 0
> WARNING: CPU: 0 PID: 4159 at fs/proc/generic.c:354 __proc_create+0x696/0x880
> fs/proc/generic.c:354

We need to reject empty names.


Re: WARNING: CPU: 3 PID: 0 at net/sched/sch_hfsc.c:1388 hfsc_dequeue+0x319/0x350 [sch_hfsc]

2018-03-08 Thread Cong Wang
On Thu, Mar 8, 2018 at 8:02 AM, Marco Berizzi  wrote:
>> Marco Berizzi wrote:
>>
>>
>> Hello everyone,
>>
>> Yesterday I got this error on a slackware linux 4.16-rc4 system
>> running as a traffic shaping gateway and netfilter nat.
>> The error has been arisen after a partial ISP network outage,
>> so unfortunately it will not trivial for me to reproduce it again.
>
> Hello everyone,
>
> I'm getting this error twice/day, so fortunately I'm able to
> reproduce it.

IIRC, there was a patch for this, but it got lost...

I will take a look anyway.


[Patch nf-next] netfilter: make xt_rateest hash table per net

2018-03-01 Thread Cong Wang
As suggested by Eric, we need to make the xt_rateest
hash table and its lock per netns to reduce lock
contentions.

Cc: Florian Westphal <f...@strlen.de>
Cc: Eric Dumazet <eduma...@google.com>
Cc: Pablo Neira Ayuso <pa...@netfilter.org>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 include/net/netfilter/xt_rateest.h |  4 +-
 net/netfilter/xt_RATEEST.c | 91 +++---
 net/netfilter/xt_rateest.c | 10 ++---
 3 files changed, 72 insertions(+), 33 deletions(-)

diff --git a/include/net/netfilter/xt_rateest.h 
b/include/net/netfilter/xt_rateest.h
index b1db13772554..832ab69efda5 100644
--- a/include/net/netfilter/xt_rateest.h
+++ b/include/net/netfilter/xt_rateest.h
@@ -21,7 +21,7 @@ struct xt_rateest {
struct net_rate_estimator __rcu *rate_est;
 };
 
-struct xt_rateest *xt_rateest_lookup(const char *name);
-void xt_rateest_put(struct xt_rateest *est);
+struct xt_rateest *xt_rateest_lookup(struct net *net, const char *name);
+void xt_rateest_put(struct net *net, struct xt_rateest *est);
 
 #endif /* _XT_RATEEST_H */
diff --git a/net/netfilter/xt_RATEEST.c b/net/netfilter/xt_RATEEST.c
index 141c295191f6..dec843cadf46 100644
--- a/net/netfilter/xt_RATEEST.c
+++ b/net/netfilter/xt_RATEEST.c
@@ -14,15 +14,21 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
 #include 
 
-static DEFINE_MUTEX(xt_rateest_mutex);
-
 #define RATEEST_HSIZE  16
-static struct hlist_head rateest_hash[RATEEST_HSIZE] __read_mostly;
+
+struct xt_rateest_net {
+   struct mutex hash_lock;
+   struct hlist_head hash[RATEEST_HSIZE];
+};
+
+static unsigned int xt_rateest_id;
+
 static unsigned int jhash_rnd __read_mostly;
 
 static unsigned int xt_rateest_hash(const char *name)
@@ -31,21 +37,23 @@ static unsigned int xt_rateest_hash(const char *name)
   (RATEEST_HSIZE - 1);
 }
 
-static void xt_rateest_hash_insert(struct xt_rateest *est)
+static void xt_rateest_hash_insert(struct xt_rateest_net *xn,
+  struct xt_rateest *est)
 {
unsigned int h;
 
h = xt_rateest_hash(est->name);
-   hlist_add_head(>list, _hash[h]);
+   hlist_add_head(>list, >hash[h]);
 }
 
-static struct xt_rateest *__xt_rateest_lookup(const char *name)
+static struct xt_rateest *__xt_rateest_lookup(struct xt_rateest_net *xn,
+ const char *name)
 {
struct xt_rateest *est;
unsigned int h;
 
h = xt_rateest_hash(name);
-   hlist_for_each_entry(est, _hash[h], list) {
+   hlist_for_each_entry(est, >hash[h], list) {
if (strcmp(est->name, name) == 0) {
est->refcnt++;
return est;
@@ -55,20 +63,23 @@ static struct xt_rateest *__xt_rateest_lookup(const char 
*name)
return NULL;
 }
 
-struct xt_rateest *xt_rateest_lookup(const char *name)
+struct xt_rateest *xt_rateest_lookup(struct net *net, const char *name)
 {
+   struct xt_rateest_net *xn = net_generic(net, xt_rateest_id);
struct xt_rateest *est;
 
-   mutex_lock(_rateest_mutex);
-   est = __xt_rateest_lookup(name);
-   mutex_unlock(_rateest_mutex);
+   mutex_lock(>hash_lock);
+   est = __xt_rateest_lookup(xn, name);
+   mutex_unlock(>hash_lock);
return est;
 }
 EXPORT_SYMBOL_GPL(xt_rateest_lookup);
 
-void xt_rateest_put(struct xt_rateest *est)
+void xt_rateest_put(struct net *net, struct xt_rateest *est)
 {
-   mutex_lock(_rateest_mutex);
+   struct xt_rateest_net *xn = net_generic(net, xt_rateest_id);
+
+   mutex_lock(>hash_lock);
if (--est->refcnt == 0) {
hlist_del(>list);
gen_kill_estimator(>rate_est);
@@ -78,7 +89,7 @@ void xt_rateest_put(struct xt_rateest *est)
 */
kfree_rcu(est, rcu);
}
-   mutex_unlock(_rateest_mutex);
+   mutex_unlock(>hash_lock);
 }
 EXPORT_SYMBOL_GPL(xt_rateest_put);
 
@@ -98,6 +109,7 @@ xt_rateest_tg(struct sk_buff *skb, const struct 
xt_action_param *par)
 
 static int xt_rateest_tg_checkentry(const struct xt_tgchk_param *par)
 {
+   struct xt_rateest_net *xn = net_generic(par->net, xt_rateest_id);
struct xt_rateest_target_info *info = par->targinfo;
struct xt_rateest *est;
struct {
@@ -108,10 +120,10 @@ static int xt_rateest_tg_checkentry(const struct 
xt_tgchk_param *par)
 
net_get_random_once(_rnd, sizeof(jhash_rnd));
 
-   mutex_lock(_rateest_mutex);
-   est = __xt_rateest_lookup(info->name);
+   mutex_lock(>hash_lock);
+   est = __xt_rateest_lookup(xn, info->name);
if (est) {
-   mutex_unlock(_rateest_mutex);
+   mutex_unlock(>hash_lock);
/*
 * If estimator parameters are specified, they must match the
 * existing estimator.
@@ -1

[Patch 4.14 4/4] net: sched: fix use-after-free in tcf_block_put_ext

2018-03-01 Thread Cong Wang
From: Jiri Pirko 

[ Upstream commit df45bf84e4f5a48f23d4b1a07d21d5 ]

Since the block is freed with last chain being put, once we reach the
end of iteration of list_for_each_entry_safe, the block may be
already freed. I'm hitting this only by creating and deleting clsact:

[  202.171952] 
==
[  202.180182] BUG: KASAN: use-after-free in tcf_block_put_ext+0x240/0x390
[  202.187590] Read of size 8 at addr 880225539a80 by task tc/796
[  202.194508]
[  202.196185] CPU: 0 PID: 796 Comm: tc Not tainted 4.15.0-rc2jiri+ #5
[  202.203200] Hardware name: Mellanox Technologies Ltd. 
"MSN2100-CB2F"/"SA001017", BIOS 5.6.5 06/07/2016
[  202.213613] Call Trace:
[  202.216369]  dump_stack+0xda/0x169
[  202.220192]  ? dma_virt_map_sg+0x147/0x147
[  202.224790]  ? show_regs_print_info+0x54/0x54
[  202.229691]  ? tcf_chain_destroy+0x1dc/0x250
[  202.234494]  print_address_description+0x83/0x3d0
[  202.239781]  ? tcf_block_put_ext+0x240/0x390
[  202.244575]  kasan_report+0x1ba/0x460
[  202.248707]  ? tcf_block_put_ext+0x240/0x390
[  202.253518]  tcf_block_put_ext+0x240/0x390
[  202.258117]  ? tcf_chain_flush+0x290/0x290
[  202.262708]  ? qdisc_hash_del+0x82/0x1a0
[  202.267111]  ? qdisc_hash_add+0x50/0x50
[  202.271411]  ? __lock_is_held+0x5f/0x1a0
[  202.275843]  clsact_destroy+0x3d/0x80 [sch_ingress]
[  202.281323]  qdisc_destroy+0xcb/0x240
[  202.285445]  qdisc_graft+0x216/0x7b0
[  202.289497]  tc_get_qdisc+0x260/0x560

Fix this by holding the block also by chain 0 and put chain 0
explicitly, out of the list_for_each_entry_safe loop at the very
end of tcf_block_put_ext.

Fixes: efbf78973978 ("net_sched: get rid of rcu_barrier() in 
tcf_block_put_ext()")
Signed-off-by: Jiri Pirko 
Signed-off-by: David S. Miller 
---
 net/sched/cls_api.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 135147c1deed..934c239cf98d 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -290,21 +290,22 @@ void tcf_block_put(struct tcf_block *block)
if (!block)
return;
 
-   /* Hold a refcnt for all chains, except 0, so that they don't disappear
+   /* Hold a refcnt for all chains, so that they don't disappear
 * while we are iterating.
 */
list_for_each_entry(chain, >chain_list, list)
-   if (chain->index)
-   tcf_chain_hold(chain);
+   tcf_chain_hold(chain);
 
list_for_each_entry(chain, >chain_list, list)
tcf_chain_flush(chain);
 
-   /* At this point, all the chains should have refcnt >= 1. Block will be
-* freed after all chains are gone.
-*/
+   /* At this point, all the chains should have refcnt >= 1. */
list_for_each_entry_safe(chain, tmp, >chain_list, list)
tcf_chain_put(chain);
+
+   /* Finally, put chain 0 and allow block to be freed. */
+   chain = list_first_entry(>chain_list, struct tcf_chain, list);
+   tcf_chain_put(chain);
 }
 EXPORT_SYMBOL(tcf_block_put);
 
-- 
2.13.0



[Patch 4.14 3/4] net_sched: get rid of rcu_barrier() in tcf_block_put_ext()

2018-03-01 Thread Cong Wang
[ Upstream commit efbf78973978b0d25af59bc26c8013 ]

Both Eric and Paolo noticed the rcu_barrier() we use in
tcf_block_put_ext() could be a performance bottleneck when
we have a lot of tc classes.

Paolo provided the following to demonstrate the issue:

tc qdisc add dev lo root htb
for I in `seq 1 1000`; do
tc class add dev lo parent 1: classid 1:$I htb rate 100kbit
tc qdisc add dev lo parent 1:$I handle $((I + 1)): htb
for J in `seq 1 10`; do
tc filter add dev lo parent $((I + 1)): u32 match ip src 
1.1.1.$J
done
done
time tc qdisc del dev root

real0m54.764s
user0m0.023s
sys 0m0.000s

The rcu_barrier() there is to ensure we free the block after all chains
are gone, that is, to queue tcf_block_put_final() at the tail of workqueue.
We can achieve this ordering requirement by refcnt'ing tcf block instead,
that is, the tcf block is freed only when the last chain in this block is
gone. This also simplifies the code.

Paolo reported after this patch we get:

real0m0.017s
user0m0.000s
sys 0m0.017s

Tested-by: Paolo Abeni <pab...@redhat.com>
Cc: Eric Dumazet <eduma...@google.com>
Cc: Jiri Pirko <j...@mellanox.com>
Cc: Jamal Hadi Salim <j...@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
Signed-off-by: David S. Miller <da...@davemloft.net>
---
 include/net/sch_generic.h |  1 -
 net/sched/cls_api.c   | 29 +
 2 files changed, 9 insertions(+), 21 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 236bfe5b2ffe..6073e8bae025 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -273,7 +273,6 @@ struct tcf_chain {
 
 struct tcf_block {
struct list_head chain_list;
-   struct work_struct work;
 };
 
 static inline void qdisc_cb_private_validate(const struct sk_buff *skb, int sz)
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index ddae7053b745..135147c1deed 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -211,8 +211,12 @@ static void tcf_chain_flush(struct tcf_chain *chain)
 
 static void tcf_chain_destroy(struct tcf_chain *chain)
 {
+   struct tcf_block *block = chain->block;
+
list_del(>list);
kfree(chain);
+   if (list_empty(>chain_list))
+   kfree(block);
 }
 
 static void tcf_chain_hold(struct tcf_chain *chain)
@@ -276,26 +280,12 @@ int tcf_block_get(struct tcf_block **p_block,
 }
 EXPORT_SYMBOL(tcf_block_get);
 
-static void tcf_block_put_final(struct work_struct *work)
-{
-   struct tcf_block *block = container_of(work, struct tcf_block, work);
-   struct tcf_chain *chain, *tmp;
-
-   rtnl_lock();
-
-   /* At this point, all the chains should have refcnt == 1. */
-   list_for_each_entry_safe(chain, tmp, >chain_list, list)
-   tcf_chain_put(chain);
-   rtnl_unlock();
-   kfree(block);
-}
-
 /* XXX: Standalone actions are not allowed to jump to any chain, and bound
  * actions should be all removed after flushing.
  */
 void tcf_block_put(struct tcf_block *block)
 {
-   struct tcf_chain *chain;
+   struct tcf_chain *chain, *tmp;
 
if (!block)
return;
@@ -310,12 +300,11 @@ void tcf_block_put(struct tcf_block *block)
list_for_each_entry(chain, >chain_list, list)
tcf_chain_flush(chain);
 
-   INIT_WORK(>work, tcf_block_put_final);
-   /* Wait for RCU callbacks to release the reference count and make
-* sure their works have been queued before this.
+   /* At this point, all the chains should have refcnt >= 1. Block will be
+* freed after all chains are gone.
 */
-   rcu_barrier();
-   tcf_queue_work(>work);
+   list_for_each_entry_safe(chain, tmp, >chain_list, list)
+   tcf_chain_put(chain);
 }
 EXPORT_SYMBOL(tcf_block_put);
 
-- 
2.13.0



[Patch 4.14 2/4] net: sched: crash on blocks with goto chain action

2018-03-01 Thread Cong Wang
From: Roman Kapl 

[ Upstream commit a60b3f515d30d0fe8537c64671926879a3548103 ]

tcf_block_put_ext has assumed that all filters (and thus their goto
actions) are destroyed in RCU callback and thus can not race with our
list iteration. However, that is not true during netns cleanup (see
tcf_exts_get_net comment).

Prevent the user after free by holding all chains (except 0, that one is
already held). foreach_safe is not enough in this case.

To reproduce, run the following in a netns and then delete the ns:
ip link add dtest type dummy
tc qdisc add dev dtest ingress
tc filter add dev dtest chain 1 parent : handle 1 prio 1 flower action 
goto chain 2

Fixes: 822e86d997 ("net_sched: remove tcf_block_put_deferred()")
Signed-off-by: Roman Kapl 
Acked-by: Jiri Pirko 
Signed-off-by: David S. Miller 
---
 net/sched/cls_api.c | 17 -
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 1451a56a8f93..ddae7053b745 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -282,7 +282,8 @@ static void tcf_block_put_final(struct work_struct *work)
struct tcf_chain *chain, *tmp;
 
rtnl_lock();
-   /* Only chain 0 should be still here. */
+
+   /* At this point, all the chains should have refcnt == 1. */
list_for_each_entry_safe(chain, tmp, >chain_list, list)
tcf_chain_put(chain);
rtnl_unlock();
@@ -290,17 +291,23 @@ static void tcf_block_put_final(struct work_struct *work)
 }
 
 /* XXX: Standalone actions are not allowed to jump to any chain, and bound
- * actions should be all removed after flushing. However, filters are now
- * destroyed in tc filter workqueue with RTNL lock, they can not race here.
+ * actions should be all removed after flushing.
  */
 void tcf_block_put(struct tcf_block *block)
 {
-   struct tcf_chain *chain, *tmp;
+   struct tcf_chain *chain;
 
if (!block)
return;
 
-   list_for_each_entry_safe(chain, tmp, >chain_list, list)
+   /* Hold a refcnt for all chains, except 0, so that they don't disappear
+* while we are iterating.
+*/
+   list_for_each_entry(chain, >chain_list, list)
+   if (chain->index)
+   tcf_chain_hold(chain);
+
+   list_for_each_entry(chain, >chain_list, list)
tcf_chain_flush(chain);
 
INIT_WORK(>work, tcf_block_put_final);
-- 
2.13.0



[Patch 4.14 1/4] net: sched: fix crash when deleting secondary chains

2018-03-01 Thread Cong Wang
From: Roman Kapl 

[ Upstream commit d7aa04a5e82b4f254d306926c81eae8df69e5200 ]

If you flush (delete) a filter chain other than chain 0 (such as when
deleting the device), the kernel may run into a use-after-free. The
chain refcount must not be decremented unless we are sure we are done
with the chain.

To reproduce the bug, run:
ip link add dtest type dummy
tc qdisc add dev dtest ingress
tc filter add dev dtest chain 1  parent : flower
ip link del dtest

Introduced in: commit f93e1cdcf42c ("net/sched: fix filter flushing"),
but unless you have KAsan or luck, you won't notice it until
commit 0dadc117ac8b ("cls_flower: use tcf_exts_get_net() before call_rcu()")

Fixes: f93e1cdcf42c ("net/sched: fix filter flushing")
Acked-by: Jiri Pirko 
Signed-off-by: Roman Kapl 
Signed-off-by: David S. Miller 
---
 net/sched/cls_api.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index ecbb019efcbd..1451a56a8f93 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -197,14 +197,15 @@ static struct tcf_chain *tcf_chain_create(struct 
tcf_block *block,
 
 static void tcf_chain_flush(struct tcf_chain *chain)
 {
-   struct tcf_proto *tp;
+   struct tcf_proto *tp = rtnl_dereference(chain->filter_chain);
 
if (chain->p_filter_chain)
RCU_INIT_POINTER(*chain->p_filter_chain, NULL);
-   while ((tp = rtnl_dereference(chain->filter_chain)) != NULL) {
+   while (tp) {
RCU_INIT_POINTER(chain->filter_chain, tp->next);
-   tcf_chain_put(chain);
tcf_proto_destroy(tp);
+   tp = rtnl_dereference(chain->filter_chain);
+   tcf_chain_put(chain);
}
 }
 
-- 
2.13.0



[Patch 4.14 0/4] net_sched: backport tc filter fixes to 4.14

2018-03-01 Thread Cong Wang
This patchset backports 4 important bug fixes for tc filter to
4.14 stable branch. Due to some big changes between 4.14 and 4.15,
the backports are not trivial, I have to adjust and fix the conflicts
manually.

Thanks to Roland for reporting the kernel warning and testing
the patches.

Reported-by: Roland Franke <fl...@franke-prem.de>
Tested-by: Roland Franke <fl...@franke-prem.de>
Cc: Jiri Pirko <j...@mellanox.com>
Cc: Roman Kapl <c...@rkapl.cz>
Cc: David S. Miller <da...@davemloft.net>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>

---

Cong Wang (1):
  net_sched: get rid of rcu_barrier() in tcf_block_put_ext()

Jiri Pirko (1):
  net: sched: fix use-after-free in tcf_block_put_ext

Roman Kapl (2):
  net: sched: fix crash when deleting secondary chains
  net: sched: crash on blocks with goto chain action

 include/net/sch_generic.h |  1 -
 net/sched/cls_api.c   | 48 +++
 2 files changed, 23 insertions(+), 26 deletions(-)

-- 
2.13.0



Re: [PATCH] net: make tc-police action MTU behavior match documentation

2018-02-28 Thread Cong Wang
On Tue, Feb 27, 2018 at 9:41 AM, Andrew Collins
 wrote:
>> I don't find such statement from the man page:
>> http://man7.org/linux/man-pages/man8/tc-police.8.html
>
>
>> What am I missing?
>
> Under MTU the man page states:
>
> mtu BYTES[/BYTES]
>   This is the maximum packet size handled by the policer (larger 
> ones will be handled like they exceeded the configured rate). Setting this 
> value correctly will improve the scheduler's precision.
>   Value  formatting  is identical to burst above. ->Defaults to 
> unlimited<-.
>
> Peakrate requiring MTU isn't mentioned directly in the man page, but if you 
> provide peakrate without MTU, tc complains:
>
> "mtu" is required, if "peakrate" is requested.
>
> The idea here is just to make the actual implementation match these two 
> statements, MTU is unlimited, unless you use peakrate in which case you have 
> to provide it (although if you craft netlink messages without tc you can set 
> peakrate with no mtu, and the action will still default to a reasonable mtu 
> rather than falling over).
>

If this is just an iproute2 issue, please fix it there rather in kernel.


Re: [PATCH] net: make tc-police action MTU behavior match documentation

2018-02-26 Thread Cong Wang
On Mon, Feb 26, 2018 at 12:10 PM, Andrew Collins
 wrote:
> The man page for tc-police states that the MTU defaults to
> unlimited if peakrate is not specified, but it actually defaults
> to 2047.

I don't find such statement from the man page:
http://man7.org/linux/man-pages/man8/tc-police.8.html


What am I missing?


Re: Kernel panic with 4.16-rc1 (and 4.16-rc2) running selftest

2018-02-23 Thread Cong Wang
On Fri, Feb 23, 2018 at 3:27 PM, Cong Wang <xiyou.wangc...@gmail.com> wrote:
> On Fri, Feb 23, 2018 at 11:00 AM, Randy Dunlap <rdun...@infradead.org> wrote:
>> [adding netdev]
>>
>> On 02/23/2018 08:05 AM, Khalid Aziz wrote:
>>> I am seeing a kernel panic with 4.16-rc1 and 4.16-rc2 kernels when running 
>>> selftests
>>> from tools/testing/selftests. Last messages from selftest before kernel 
>>> panic are:
>>>
> ...
>>> Same selftest does not cause panic on 4.15. git bisect pointed to commit 
>>> 6ce711f2750031d12cec91384ac5cfa0a485b60a ("idr: Make 1-based IDRs more 
>>> efficient").
>>> Kernel config is attached.
>
> Looks like something horribly wrong with u32 key id idr...

Adding a few printk's, I got:

[   31.231560] requested handle = ffe0
[   31.232426] allocated handle = 0
...
[   31.246475] requested handle = ffd0
[   31.247555] allocated handle = 1


So the bug is here where we can't allocate a specific handle:

err = idr_alloc_u32(_c->handle_idr, ht, ,
handle, GFP_KERNEL);
if (err) {
kfree(ht);
return err;
}


Re: Kernel panic with 4.16-rc1 (and 4.16-rc2) running selftest

2018-02-23 Thread Cong Wang
On Fri, Feb 23, 2018 at 11:00 AM, Randy Dunlap  wrote:
> [adding netdev]
>
> On 02/23/2018 08:05 AM, Khalid Aziz wrote:
>> I am seeing a kernel panic with 4.16-rc1 and 4.16-rc2 kernels when running 
>> selftests
>> from tools/testing/selftests. Last messages from selftest before kernel 
>> panic are:
>>
...
>> Same selftest does not cause panic on 4.15. git bisect pointed to commit 
>> 6ce711f2750031d12cec91384ac5cfa0a485b60a ("idr: Make 1-based IDRs more 
>> efficient").
>> Kernel config is attached.

Looks like something horribly wrong with u32 key id idr...


Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-21 Thread Cong Wang
On Thu, Feb 15, 2018 at 11:31 AM, Guillaume Nault  wrote:
> On Thu, Feb 15, 2018 at 06:01:16PM +0200, Denys Fedoryshchenko wrote:
>> On 2018-02-15 17:55, Guillaume Nault wrote:
>> > On Thu, Feb 15, 2018 at 12:19:52PM +0200, Denys Fedoryshchenko wrote:
>> > > Here we go:
>> > >
>> > >   [24558.921549]
>> > > ==
>> > >   [24558.922167] BUG: KASAN: use-after-free in
>> > > ppp_ioctl+0xa6a/0x1522
>> > > [ppp_generic]
>> > >   [24558.922776] Write of size 8 at addr 8803d35bf3f8 by task
>> > > accel-pppd/12622
>> > >   [24558.923113]
>> > >   [24558.923451] CPU: 0 PID: 12622 Comm: accel-pppd Tainted: G
>> > > W
>> > > 4.15.3-build-0134 #1
>> > >   [24558.924058] Hardware name: HP ProLiant DL320e Gen8 v2,
>> > > BIOS P80
>> > > 04/02/2015
>> > >   [24558.924406] Call Trace:
>> > >   [24558.924753]  dump_stack+0x46/0x59
>> > >   [24558.925103]  print_address_description+0x6b/0x23b
>> > >   [24558.925451]  ? ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> > >   [24558.925797]  kasan_report+0x21b/0x241
>> > >   [24558.926136]  ppp_ioctl+0xa6a/0x1522 [ppp_generic]
>> > >   [24558.926479]  ? ppp_nl_newlink+0x1da/0x1da [ppp_generic]
>> > >   [24558.926829]  ? sock_sendmsg+0x89/0x99
>> > >   [24558.927176]  ? __vfs_write+0xd9/0x4ad
>> > >   [24558.927523]  ? kernel_read+0xed/0xed
>> > >   [24558.927872]  ? SyS_getpeername+0x18c/0x18c
>> > >   [24558.928213]  ? bit_waitqueue+0x2a/0x2a
>> > >   [24558.928561]  ? wake_atomic_t_function+0x115/0x115
>> > >   [24558.928898]  vfs_ioctl+0x6e/0x81
>> > >   [24558.929228]  do_vfs_ioctl+0xa00/0xb10
>> > >   [24558.929571]  ? sigprocmask+0x1a6/0x1d0
>> > >   [24558.929907]  ? sigsuspend+0x13e/0x13e
>> > >   [24558.930239]  ? ioctl_preallocate+0x14e/0x14e
>> > >   [24558.930568]  ? SyS_rt_sigprocmask+0xf1/0x142
>> > >   [24558.930904]  ? sigprocmask+0x1d0/0x1d0
>> > >   [24558.931252]  SyS_ioctl+0x39/0x55
>> > >   [24558.931595]  ? do_vfs_ioctl+0xb10/0xb10
>> > >   [24558.931942]  do_syscall_64+0x1b1/0x31f
>> > >   [24558.932288]  entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > >   [24558.932627] RIP: 0033:0x7f302849d8a7
>> > >   [24558.932965] RSP: 002b:7f3029a52af8 EFLAGS: 0206
>> > > ORIG_RAX:
>> > > 0010
>> > >   [24558.933578] RAX: ffda RBX: 7f3027d861e3 RCX:
>> > > 7f302849d8a7
>> > >   [24558.933927] RDX: 7f3023f49468 RSI: 4004743a RDI:
>> > > 3a67
>> > >   [24558.934266] RBP: 7f3029a52b20 R08:  R09:
>> > > 55c8308d8e40
>> > >   [24558.934607] R10: 0008 R11: 0206 R12:
>> > > 7f3023f49358
>> > >   [24558.934947] R13: 7ffe86e5723f R14:  R15:
>> > > 7f3029a53700
>> > >   [24558.935288]
>> > >   [24558.935626] Allocated by task 12622:
>> > >   [24558.935972]  ppp_register_net_channel+0x5f/0x5c6
>> > > [ppp_generic]
>> > >   [24558.936306]  pppoe_connect+0xab7/0xc71 [pppoe]
>> > >   [24558.936640]  SyS_connect+0x14b/0x1b7
>> > >   [24558.936975]  do_syscall_64+0x1b1/0x31f
>> > >   [24558.937319]  entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > >   [24558.937655]
>> > >   [24558.937993] Freed by task 12622:
>> > >   [24558.938321]  kfree+0xb0/0x11d
>> > >   [24558.938658]  ppp_release+0x111/0x120 [ppp_generic]
>> > >   [24558.938994]  __fput+0x2ba/0x51a
>> > >   [24558.939332]  task_work_run+0x11c/0x13d
>> > >   [24558.939676]  exit_to_usermode_loop+0x7c/0xaf
>> > >   [24558.940022]  do_syscall_64+0x2ea/0x31f
>> > >   [24558.940368]  entry_SYSCALL_64_after_hwframe+0x21/0x86
>> > >   [24558.947099]
>> >
>> > Your first guess was right. It looks like we have an issue with
>> > reference counting on the channels. Can you send me your ppp_generic.o?
>> http://nuclearcat.com/ppp_generic.o
>> Compiled with gcc version 6.4.0 (Gentoo 6.4.0-r1 p1.3)
>>
> From what I can see, ppp_release() and ioctl(PPPIOCCONNECT) are called
> concurrently on the same ppp_file. Even if this ppp_file was pointed at
> by two different file descriptors, I can't see how this could defeat
> the reference counting mechanism. I'm going to think more about it.

For me it looks like pch->clist is not removed from the list ppp->channels
when destroyed via ppp_release(). But I don't want to pretend I understand
ppp logic.


Re: [PATCH 2/3] net: Make cleanup_list and net::cleanup_list of llist type

2018-02-20 Thread Cong Wang
On Mon, Feb 19, 2018 at 1:58 AM, Kirill Tkhai  wrote:
>  void __put_net(struct net *net)
>  {
> /* Cleanup the network namespace in process context */
> -   unsigned long flags;
> -
> -   spin_lock_irqsave(_list_lock, flags);
> -   list_add(>cleanup_list, _list);
> -   spin_unlock_irqrestore(_list_lock, flags);
> -
> +   llist_add(>cleanup_list, _list);
> queue_work(netns_wq, _cleanup_work);
>  }

Is llist safe against IRQ too?


Re: [PATCH v2] sched: report if filter is too large to dump

2018-02-20 Thread Cong Wang
On Mon, Feb 19, 2018 at 12:32 PM, Roman Kapl <c...@rkapl.cz> wrote:
> So far, if the filter was too large to fit in the allocated skb, the
> kernel did not return any error and stopped dumping. Modify the dumper
> so that it returns -EMSGSIZE when a filter fails to dump and it is the
> first filter in the skb. If we are not first, we will get a next chance
> with more room.
>
> I understand this is pretty near to being an API change, but the
> original design (silent truncation) can be considered a bug.
>
> Note: The error case can happen pretty easily if you create a filter
> with 32 actions and have 4kb pages. Also recent versions of iproute try
> to be clever with their buffer allocation size, which in turn leads to
>
> Signed-off-by: Roman Kapl <c...@rkapl.cz>

Acked-by: Cong Wang <xiyou.wangc...@gmail.com>


Re: KASAN: use-after-free Read in sit_tunnel_xmit

2018-02-15 Thread Cong Wang
On Tue, Feb 13, 2018 at 10:48 AM, Dmitry Vyukov <dvyu...@google.com> wrote:
> On Mon, Oct 30, 2017 at 7:41 PM, Cong Wang <xiyou.wangc...@gmail.com> wrote:
>> On Mon, Oct 30, 2017 at 8:34 AM, syzbot
>> <bot+1aa412fe58f4059538c0204a0f096524e6dce...@syzkaller.appspotmail.com>
>> wrote:
>>> Hello,
>>>
>>> syzkaller hit the following crash on
>>> 4dc12ffeaeac939097a3f55c881d3dc3523dff0c
>>> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master
>>> compiler: gcc (GCC) 7.1.1 20170620
>>> .config is attached
>>> Raw console output is attached.
>>>
>>> skbuff: bad partial csum: csum=53081/14726 len=2273
>>> ==
>>> BUG: KASAN: use-after-free in ipv6_get_dsfield include/net/dsfield.h:23
>>> [inline]
>>> BUG: KASAN: use-after-free in ipip6_tunnel_xmit net/ipv6/sit.c:968 [inline]
>>> BUG: KASAN: use-after-free in sit_tunnel_xmit+0x2a41/0x3130
>>> net/ipv6/sit.c:1016
>>> Read of size 2 at addr 8801c64afd00 by task syz-executor3/16942
>>>
>>> CPU: 0 PID: 16942 Comm: syz-executor3 Not tainted 4.14.0-rc5+ #97
>>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>>> Google 01/01/2011
>>> Call Trace:
>>>  __dump_stack lib/dump_stack.c:16 [inline]
>>>  dump_stack+0x194/0x257 lib/dump_stack.c:52
>>>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>>>  kasan_report_error mm/kasan/report.c:351 [inline]
>>>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>>>  __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:428
>>>  ipv6_get_dsfield include/net/dsfield.h:23 [inline]
>>>  ipip6_tunnel_xmit net/ipv6/sit.c:968 [inline]
>>>  sit_tunnel_xmit+0x2a41/0x3130 net/ipv6/sit.c:1016
>>>  __netdev_start_xmit include/linux/netdevice.h:4022 [inline]
>>>  netdev_start_xmit include/linux/netdevice.h:4031 [inline]
>>>  xmit_one net/core/dev.c:3008 [inline]
>>>  dev_hard_start_xmit+0x248/0xac0 net/core/dev.c:3024
>>>  __dev_queue_xmit+0x17d2/0x2070 net/core/dev.c:3505
>>>  dev_queue_xmit+0x17/0x20 net/core/dev.c:3538
>>>  neigh_direct_output+0x15/0x20 net/core/neighbour.c:1390
>>>  neigh_output include/net/neighbour.h:481 [inline]
>>>  ip6_finish_output2+0xad1/0x22a0 net/ipv6/ip6_output.c:120
>>>  ip6_fragment+0x25ae/0x3420 net/ipv6/ip6_output.c:723
>>>  ip6_finish_output+0x319/0x920 net/ipv6/ip6_output.c:144
>>>  NF_HOOK_COND include/linux/netfilter.h:238 [inline]
>>>  ip6_output+0x1f4/0x850 net/ipv6/ip6_output.c:163
>>>  dst_output include/net/dst.h:459 [inline]
>>>  ip6_local_out+0x95/0x160 net/ipv6/output_core.c:176
>>>  ip6_send_skb+0xa1/0x330 net/ipv6/ip6_output.c:1658
>>>  ip6_push_pending_frames+0xb3/0xe0 net/ipv6/ip6_output.c:1678
>>>  rawv6_push_pending_frames net/ipv6/raw.c:616 [inline]
>>>  rawv6_sendmsg+0x2eb9/0x3e40 net/ipv6/raw.c:935
>>>  inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
>>>  sock_sendmsg_nosec net/socket.c:633 [inline]
>>>  sock_sendmsg+0xca/0x110 net/socket.c:643
>>>  SYSC_sendto+0x352/0x5a0 net/socket.c:1750
>>>  SyS_sendto+0x40/0x50 net/socket.c:1718
>>>  entry_SYSCALL_64_fastpath+0x1f/0xbe
>>> RIP: 0033:0x452869
>>> RSP: 002b:7fe3c12e5be8 EFLAGS: 0212 ORIG_RAX: 002c
>>> RAX: ffda RBX: 007580d8 RCX: 00452869
>>> RDX: 07f1 RSI: 2013b7ff RDI: 0014
>>> RBP: 0161 R08: 204e8fe4 R09: 001c
>>> R10: 0100 R11: 0212 R12: 006f01b8
>>> R13:  R14: 7fe3c12e66d4 R15: 0017
>>>
>>> Allocated by task 16924:
>>>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>>>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>>>  set_track mm/kasan/kasan.c:459 [inline]
>>>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
>>>  __do_kmalloc_node mm/slab.c:3689 [inline]
>>>  __kmalloc_node_track_caller+0x47/0x70 mm/slab.c:3703
>>>  __kmalloc_reserve.isra.40+0x41/0xd0 net/core/skbuff.c:138
>>>  __alloc_skb+0x13b/0x780 net/core/skbuff.c:206
>>>  alloc_skb include/linux/skbuff.h:985 [inline]
>>>  sock_wmalloc+0x140/0x1d0 net/core/sock.c:1932
>>>  __ip6_append_data.isra.43+0x2681/0x3340 net/ipv6/ip6_output.c:1397
>>>  ip6_append_data+0x189/0x290 net/ipv6/ip6_output.c:1552
>>>  rawv6_sendmsg+0x1dd9/0x3e40 net/ipv6/raw.c:928
>

Re: Fwd: u32 ht filters

2018-02-10 Thread Cong Wang
On Wed, Feb 7, 2018 at 11:38 PM, Jiri Pirko  wrote:
> Thu, Feb 08, 2018 at 12:08:36AM CET, xiyou.wangc...@gmail.com wrote:
>>On Tue, Feb 6, 2018 at 11:01 PM, Jiri Pirko  wrote:
>>> Wed, Feb 07, 2018 at 06:09:15AM CET, xiyou.wangc...@gmail.com wrote:
Hi, Jiri

Your  commit 7fa9d974f3c2a016b9accb18f4ee2ed2a738585c
breaks the tc script by Paweł. Please find below for details.
>>>
>>> Did you do the bisection?
>>> The commit just uses block struct instead of q, but since they
>>> are in 1:1 relation, that should be equvivalent. So basically you still
>>> have per-qdisc hashtables for u32.
>>
>>Well, at least the following fixes the problem here. But I am not sure
>>if it is expected too for shared block among multiple qdiscs.
>
> For shared block, block->q is null.

According to this comment:
/* block_index not 0 means the shared block is requested */

and the code,

if (!block) {
block = tcf_block_create(net, q, extack);

block->q is set to q, and q is always non-NULL AFAIU.

Also, I don't know if it is intended, but block->q always points to
the parent qdisc rather than the qdisc attached to a class.


>>
>>
>>@@ -338,7 +330,7 @@ static struct hlist_head *tc_u_common_hash;
>>
>> static unsigned int tc_u_hash(const struct tcf_proto *tp)
>> {
>>-   return hash_ptr(tp->chain->block, U32_HASH_SHIFT);
>>+   return hash_ptr(tp->chain->block->q, U32_HASH_SHIFT);
>> }
>>
>> static struct tc_u_common *tc_u_common_find(const struct tcf_proto *tp)
>>@@ -348,7 +340,7 @@ static struct tc_u_common *tc_u_common_find(const
>>struct tcf_proto *tp)
>>
>>h = tc_u_hash(tp);
>>hlist_for_each_entry(tc, _u_common_hash[h], hnode) {
>>-   if (tc->block == tp->chain->block)
>>+   if (tc->block->q == tp->chain->block->q)
>
> :O I don't get it. tc->block is pointer, tc->block->q is pointer. And
> they are different at the same time for non-shared block.

If you look into Pawel's script, a new block is created for each class
therefore a different tc_u_common is created which causes the
ht 9:22 can't be found.


[Patch net v2] ipt_CLUSTERIP: fix a refcount bug in clusterip_config_find_get()

2018-02-08 Thread Cong Wang
In clusterip_config_find_get() we hold RCU read lock so it could
run concurrently with clusterip_config_entry_put(), as a result,
the refcnt could go back to 1 from 0, which leads to a double
list_del()... Just replace refcount_inc() with
refcount_inc_not_zero(), as for c->refcount.

Fixes: d73f33b16883 ("netfilter: CLUSTERIP: RCU conversion")
Cc: Eric Dumazet <eric.duma...@gmail.com>
Cc: Pablo Neira Ayuso <pa...@netfilter.org>
Cc: Florian Westphal <f...@strlen.de>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/ipv4/netfilter/ipt_CLUSTERIP.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c 
b/net/ipv4/netfilter/ipt_CLUSTERIP.c
index 1ff72b87a066..4b02ab39ebc5 100644
--- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
+++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
@@ -154,8 +154,12 @@ clusterip_config_find_get(struct net *net, __be32 
clusterip, int entry)
 #endif
if (unlikely(!refcount_inc_not_zero(>refcount)))
c = NULL;
-   else if (entry)
-   refcount_inc(>entries);
+   else if (entry) {
+   if (unlikely(!refcount_inc_not_zero(>entries))) {
+   clusterip_config_put(c);
+   c = NULL;
+   }
+   }
}
rcu_read_unlock_bh();
 
-- 
2.13.0



Re: [Patch net] ipt_CLUSTERIP: fix a refcount bug in clusterip_config_find_get()

2018-02-08 Thread Cong Wang
On Thu, Feb 8, 2018 at 12:01 AM, Florian Westphal <f...@strlen.de> wrote:
> Cong Wang <xiyou.wangc...@gmail.com> wrote:
>> In clusterip_config_find_get() we hold RCU read lock so it could
>> run concurrently with clusterip_config_entry_put(), as a result,
>> the refcnt could go back to 1 from 0, which leads to a double
>> list_del()... Just replace refcount_inc() with
>> refcount_inc_not_zero(), as for c->refcount.
>>
>> Fixes: d73f33b16883 ("netfilter: CLUSTERIP: RCU conversion")
>> Cc: Eric Dumazet <eric.duma...@gmail.com>
>> Cc: Pablo Neira Ayuso <pa...@netfilter.org>
>> Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
>> ---
>>  net/ipv4/netfilter/ipt_CLUSTERIP.c | 6 --
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c 
>> b/net/ipv4/netfilter/ipt_CLUSTERIP.c
>> index 1ff72b87a066..4537b1686c7c 100644
>> --- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
>> +++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
>> @@ -154,8 +154,10 @@ clusterip_config_find_get(struct net *net, __be32 
>> clusterip, int entry)
>>  #endif
>>   if (unlikely(!refcount_inc_not_zero(>refcount)))
>>   c = NULL;
>> - else if (entry)
>> - refcount_inc(>entries);
>> + else if (entry) {
>> + if (unlikely(!refcount_inc_not_zero(>entries)))
>
> this needs to call clusterip_config_put(c); too, else we leak one
> reference.
>
> Other than that this looks good.

Right, good catch! I will send v2.


[Patch net] ipt_CLUSTERIP: fix a refcount bug in clusterip_config_find_get()

2018-02-07 Thread Cong Wang
In clusterip_config_find_get() we hold RCU read lock so it could
run concurrently with clusterip_config_entry_put(), as a result,
the refcnt could go back to 1 from 0, which leads to a double
list_del()... Just replace refcount_inc() with
refcount_inc_not_zero(), as for c->refcount.

Fixes: d73f33b16883 ("netfilter: CLUSTERIP: RCU conversion")
Cc: Eric Dumazet <eric.duma...@gmail.com>
Cc: Pablo Neira Ayuso <pa...@netfilter.org>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/ipv4/netfilter/ipt_CLUSTERIP.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c 
b/net/ipv4/netfilter/ipt_CLUSTERIP.c
index 1ff72b87a066..4537b1686c7c 100644
--- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
+++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
@@ -154,8 +154,10 @@ clusterip_config_find_get(struct net *net, __be32 
clusterip, int entry)
 #endif
if (unlikely(!refcount_inc_not_zero(>refcount)))
c = NULL;
-   else if (entry)
-   refcount_inc(>entries);
+   else if (entry) {
+   if (unlikely(!refcount_inc_not_zero(>entries)))
+   c = NULL;
+   }
}
rcu_read_unlock_bh();
 
-- 
2.13.0



[Patch net] ipt_CLUSTERIP: fix a race condition of proc file creation

2018-02-07 Thread Cong Wang
There is a race condition between clusterip_config_entry_put()
and clusterip_config_init(), after we release the spinlock in
clusterip_config_entry_put(), a new proc file with a same IP could
be created immediately since it is already removed from the configs
list, therefore it triggers this warning:

[ cut here ]
proc_dir_entry 'ipt_CLUSTERIP/172.20.0.170' already registered
WARNING: CPU: 1 PID: 4152 at fs/proc/generic.c:330 proc_register+0x2a4/0x370 
fs/proc/generic.c:329
Kernel panic - not syncing: panic_on_warn set ...

As a quick fix, just move the proc_remove() inside the spinlock.

Reported-by: <syzbot+03218bcdba6aa7644...@syzkaller.appspotmail.com>
Fixes: 6c5d5cfbe3c5 ("netfilter: ipt_CLUSTERIP: check duplicate config when 
initializing")
Tested-by: Paolo Abeni <pab...@redhat.com>
Cc: Xin Long <lucien@gmail.com>
Cc: Pablo Neira Ayuso <pa...@netfilter.org>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/ipv4/netfilter/ipt_CLUSTERIP.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c 
b/net/ipv4/netfilter/ipt_CLUSTERIP.c
index 3a84a60f6b39..1ff72b87a066 100644
--- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
+++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
@@ -107,12 +107,6 @@ clusterip_config_entry_put(struct net *net, struct 
clusterip_config *c)
 
local_bh_disable();
if (refcount_dec_and_lock(>entries, >lock)) {
-   list_del_rcu(>list);
-   spin_unlock(>lock);
-   local_bh_enable();
-
-   unregister_netdevice_notifier(>notifier);
-
/* In case anyone still accesses the file, the open/close
 * functions are also incrementing the refcount on their own,
 * so it's safe to remove the entry even if it's in use. */
@@ -120,6 +114,12 @@ clusterip_config_entry_put(struct net *net, struct 
clusterip_config *c)
if (cn->procdir)
proc_remove(c->pde);
 #endif
+   list_del_rcu(>list);
+   spin_unlock(>lock);
+   local_bh_enable();
+
+   unregister_netdevice_notifier(>notifier);
+
return;
}
local_bh_enable();
-- 
2.13.0



Re: Fwd: u32 ht filters

2018-02-07 Thread Cong Wang
On Tue, Feb 6, 2018 at 11:01 PM, Jiri Pirko  wrote:
> Wed, Feb 07, 2018 at 06:09:15AM CET, xiyou.wangc...@gmail.com wrote:
>>Hi, Jiri
>>
>>Your  commit 7fa9d974f3c2a016b9accb18f4ee2ed2a738585c
>>breaks the tc script by Paweł. Please find below for details.
>
> Did you do the bisection?
> The commit just uses block struct instead of q, but since they
> are in 1:1 relation, that should be equvivalent. So basically you still
> have per-qdisc hashtables for u32.

Well, at least the following fixes the problem here. But I am not sure
if it is expected too for shared block among multiple qdiscs.


@@ -338,7 +330,7 @@ static struct hlist_head *tc_u_common_hash;

 static unsigned int tc_u_hash(const struct tcf_proto *tp)
 {
-   return hash_ptr(tp->chain->block, U32_HASH_SHIFT);
+   return hash_ptr(tp->chain->block->q, U32_HASH_SHIFT);
 }

 static struct tc_u_common *tc_u_common_find(const struct tcf_proto *tp)
@@ -348,7 +340,7 @@ static struct tc_u_common *tc_u_common_find(const
struct tcf_proto *tp)

h = tc_u_hash(tp);
hlist_for_each_entry(tc, _u_common_hash[h], hnode) {
-   if (tc->block == tp->chain->block)
+   if (tc->block->q == tp->chain->block->q)
return tc;
}
return NULL;


Re: WARNING: proc registration bug in clusterip_tg_check

2018-02-06 Thread Cong Wang
On Tue, Feb 6, 2018 at 6:27 AM, syzbot
 wrote:
> Hello,
>
> syzbot hit the following crash on net-next commit
> 617aebe6a97efa539cc4b8a52adccd89596e6be0 (Sun Feb 4 00:25:42 2018 +)
> Merge tag 'usercopy-v4.16-rc1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
>
> So far this crash happened 5 times on net-next, upstream.
> C reproducer is attached.
> syzkaller reproducer is attached.
> Raw console output is attached.
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+03218bcdba6aa7644...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
>
> x_tables: ip_tables: osf match: only valid for protocol 6
> x_tables: ip_tables: osf match: only valid for protocol 6
> x_tables: ip_tables: osf match: only valid for protocol 6
> [ cut here ]
> proc_dir_entry 'ipt_CLUSTERIP/172.20.0.170' already registered
> WARNING: CPU: 1 PID: 4152 at fs/proc/generic.c:330 proc_register+0x2a4/0x370
> fs/proc/generic.c:329
> Kernel panic - not syncing: panic_on_warn set ...
>
> CPU: 1 PID: 4152 Comm: syzkaller851476 Not tainted 4.15.0+ #221
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  panic+0x1e4/0x41c kernel/panic.c:183
>  __warn+0x1dc/0x200 kernel/panic.c:547
>  report_bug+0x211/0x2d0 lib/bug.c:184
>  fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
>  fixup_bug arch/x86/kernel/traps.c:247 [inline]
>  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
>  invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1097
> RIP: 0010:proc_register+0x2a4/0x370 fs/proc/generic.c:329
> RSP: 0018:8801cbd6ee20 EFLAGS: 00010286
> RAX: dc08 RBX: 8801d2181038 RCX: 815a57ae
> RDX:  RSI: 1100397add74 RDI: 1100397add49
> RBP: 8801cbd6ee70 R08: 1100397add0b R09: 
> R10: 8801cbd6ecd8 R11:  R12: 8801b2bb1cc0
> R13: dc00 R14: 8801b0d8dbc8 R15: 8801b2bb1d81
>  proc_create_data+0xf8/0x180 fs/proc/generic.c:494
>  clusterip_config_init net/ipv4/netfilter/ipt_CLUSTERIP.c:250 [inline]

I think there is probably a race condition between clusterip_config_entry_put()
and clusterip_config_init(), after we release the spinlock, a new proc
with the same IP could be created therefore triggers this warning

I am not sure if it is enough to just move the proc_remove() under
spinlock...


diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c
b/net/ipv4/netfilter/ipt_CLUSTERIP.c
index 3a84a60f6b39..1ff72b87a066 100644
--- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
+++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
@@ -107,12 +107,6 @@ clusterip_config_entry_put(struct net *net,
struct clusterip_config *c)

local_bh_disable();
if (refcount_dec_and_lock(>entries, >lock)) {
-   list_del_rcu(>list);
-   spin_unlock(>lock);
-   local_bh_enable();
-
-   unregister_netdevice_notifier(>notifier);
-
/* In case anyone still accesses the file, the open/close
 * functions are also incrementing the refcount on their own,
 * so it's safe to remove the entry even if it's in use. */
@@ -120,6 +114,12 @@ clusterip_config_entry_put(struct net *net,
struct clusterip_config *c)
if (cn->procdir)
proc_remove(c->pde);
 #endif
+   list_del_rcu(>list);
+   spin_unlock(>lock);
+   local_bh_enable();
+
+   unregister_netdevice_notifier(>notifier);
+
return;
}
local_bh_enable();


>  clusterip_tg_check+0xf9c/0x16d0 net/ipv4/netfilter/ipt_CLUSTERIP.c:488
>  xt_check_target+0x22c/0x7d0 net/netfilter/x_tables.c:850
>  check_target net/ipv4/netfilter/ip_tables.c:513 [inline]
>  find_check_entry.isra.8+0x8c8/0xcb0 net/ipv4/netfilter/ip_tables.c:554
>  translate_table+0xed1/0x1610 net/ipv4/netfilter/ip_tables.c:725
>  do_replace net/ipv4/netfilter/ip_tables.c:1141 [inline]
>  do_ipt_set_ctl+0x370/0x5f0 net/ipv4/netfilter/ip_tables.c:1675
>  nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
>  nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115
>  ip_setsockopt+0x97/0xa0 net/ipv4/ip_sockglue.c:1259
>  sctp_setsockopt+0x2b6/0x61d0 net/sctp/socket.c:4104
>  sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2975
>  SYSC_setsockopt net/socket.c:1849 [inline]
>  SyS_setsockopt+0x189/0x360 net/socket.c:1828
>  entry_SYSCALL_64_fastpath+0x29/0xa0
> RIP: 0033:0x446839
> RSP: 002b:7f0309d0fdb8 EFLAGS: 0246 ORIG_RAX: 0036
> RAX: 

Re: Two net_sched fixes for stable

2018-02-06 Thread Cong Wang
On Tue, Feb 6, 2018 at 12:20 PM, David Miller <da...@davemloft.net> wrote:
> From: Cong Wang <xiyou.wangc...@gmail.com>
>>
>> Please let me know how you want to handle this for 4.14.
>
> Ok, I sent this off for 4.15 -stable but I need you to do the
> 4.14 backport.
>

OK. I assume you mean I should send the backports directly
to stable. I will do it.

Thanks.


Fwd: u32 ht filters

2018-02-06 Thread Cong Wang
Hi, Jiri

Your  commit 7fa9d974f3c2a016b9accb18f4ee2ed2a738585c
breaks the tc script by Paweł. Please find below for details.


commit 7fa9d974f3c2a016b9accb18f4ee2ed2a738585c
Author: Jiri Pirko <j...@mellanox.com>
Date:   Fri Oct 13 14:01:02 2017 +0200

net: sched: cls_u32: use block instead of q in tc_u_common

tc_u_common is now per-q. With blocks, it has to be converted to be
per-block.

Signed-off-by: Jiri Pirko <j...@mellanox.com>
Signed-off-by: David S. Miller <da...@davemloft.net>

Before this commit, u32 hashtables are per-qdisc, after this commit
it becomes per-block or per-class... this is why the script below is broken.


-- Forwarded message --
From: Paweł Staszewski <pstaszew...@itcare.pl>
Date: Tue, Feb 6, 2018 at 8:05 AM
Subject: u32 ht filters
To: Cong Wang <xiyou.wangc...@gmail.com>


Hi


Is there something changed in kernek 4.15 that makes problem with old
configuration of tc filters with hashing filters ?

for example this :

tc qdisc del root dev ifb1

tc qdisc add dev ifb1 root handle 1:0 hfsc default 8000
tc filter add dev ifb1 parent 1:0 protocol ip u32
tc class add dev ifb1 parent 1:0 classid 1:1 hfsc ls m2 1Mbit ul
m2 1Mbit
tc class add dev ifb1 parent 1:1 classid 1:2 hfsc ls m2 1Mbit ul
m2 1Mbit
tc class add dev ifb1 parent 1:1 classid 1:3 hfsc ls m2 5000Mbit ul m2 5000Mbit
tc class add dev ifb1 parent 1:2 classid 1:8000 hfsc ls m2 1Mbit
ul m2 1Mbit
tc qdisc add dev ifb1 parent 1:8000 handle 8000: sfq perturb 60
tc qdisc add dev ifb1 parent 1:3 handle 3: pfifo limit 1


tc filter add dev ifb1 protocol ip parent 1:0 handle 9: u32 divisor 256
tc filter add dev ifb1 protocol ip parent 1:0 u32 ht 800:: match ip
dst 192.168.0.0/24 hashkey mask 0x00ff at 16 link 9:
tc class add dev ifb1 parent 1:2 classid 1:60 hfsc ls m2 8kbit ul m2 51200kbit
echo 1
tc filter add dev ifb1 parent 1:2 protocol ip u32 ht 9:22 match ip dst
192.168.0.34 flowid 1:60
echo 2
tc qdisc add dev ifb1 parent 1:60 handle 60: pfifo limit 8192


Is working with 4.13


But it is not working with 4.15

error is when adding:

tc filter add dev ifb1 protocol ip parent 1:2 prio 4 u32 ht 9:0x22
match ip dst 192.168.0.34 flowid 1:60
RTNETLINK answers: Invalid argument
We have an error talking to the kernel




Thanks

Paweł Staszewski


Two net_sched fixes for stable

2018-02-06 Thread Cong Wang
Hi, David

Can you queue the following commits for stable? They fix a
sleep-in-atomic warning reported by Roland.


commit efbf78973978b0d25af59bc26c8013a942af6e64
Author: Cong Wang <xiyou.wangc...@gmail.com>
Date:   Mon Dec 4 10:48:18 2017 -0800

net_sched: get rid of rcu_barrier() in tcf_block_put_ext()

commit df45bf84e4f5a48f23d4b1a07d21d566e8b587b2
Author: Jiri Pirko <j...@mellanox.com>
Date:   Fri Dec 8 19:27:27 2017 +0100

net: sched: fix use-after-free in tcf_block_put_ext


This problem was introduced by:

commit e2ef75445340ca7ec2c4558f84ae6c8c5d650fc8
Author: Cong Wang <xiyou.wangc...@gmail.com>
Date:   Mon Sep 11 16:33:31 2017 -0700

net_sched: fix reference counting of tc filter chain

so should apply to 4.14 too.

For 4.15, they can just apply nearly cleanly. However, it is not
straight-forward to cherry pick them to 4.14 due to some big changes
between 4.14 and 4.15.

Please let me know how you want to handle this for 4.14.

Thanks.


[Patch net v2] xt_RATEEST: acquire xt_rateest_mutex for hash insert

2018-02-05 Thread Cong Wang
rateest_hash is supposed to be protected by xt_rateest_mutex,
and, as suggested by Eric, lookup and insert should be atomic,
so we should acquire the xt_rateest_mutex once for both.

So introduce a non-locking helper for internal use and keep the
locking one for external.

Reported-by: <syzbot+5cb189720978275e4...@syzkaller.appspotmail.com>
Fixes: 5859034d7eb8 ("[NETFILTER]: x_tables: add RATEEST target")
Cc: Pablo Neira Ayuso <pa...@netfilter.org>
Cc: Eric Dumazet <eric.duma...@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/netfilter/xt_RATEEST.c | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/xt_RATEEST.c b/net/netfilter/xt_RATEEST.c
index 498b54fd04d7..141c295191f6 100644
--- a/net/netfilter/xt_RATEEST.c
+++ b/net/netfilter/xt_RATEEST.c
@@ -39,23 +39,31 @@ static void xt_rateest_hash_insert(struct xt_rateest *est)
hlist_add_head(>list, _hash[h]);
 }
 
-struct xt_rateest *xt_rateest_lookup(const char *name)
+static struct xt_rateest *__xt_rateest_lookup(const char *name)
 {
struct xt_rateest *est;
unsigned int h;
 
h = xt_rateest_hash(name);
-   mutex_lock(_rateest_mutex);
hlist_for_each_entry(est, _hash[h], list) {
if (strcmp(est->name, name) == 0) {
est->refcnt++;
-   mutex_unlock(_rateest_mutex);
return est;
}
}
-   mutex_unlock(_rateest_mutex);
+
return NULL;
 }
+
+struct xt_rateest *xt_rateest_lookup(const char *name)
+{
+   struct xt_rateest *est;
+
+   mutex_lock(_rateest_mutex);
+   est = __xt_rateest_lookup(name);
+   mutex_unlock(_rateest_mutex);
+   return est;
+}
 EXPORT_SYMBOL_GPL(xt_rateest_lookup);
 
 void xt_rateest_put(struct xt_rateest *est)
@@ -100,8 +108,10 @@ static int xt_rateest_tg_checkentry(const struct 
xt_tgchk_param *par)
 
net_get_random_once(_rnd, sizeof(jhash_rnd));
 
-   est = xt_rateest_lookup(info->name);
+   mutex_lock(_rateest_mutex);
+   est = __xt_rateest_lookup(info->name);
if (est) {
+   mutex_unlock(_rateest_mutex);
/*
 * If estimator parameters are specified, they must match the
 * existing estimator.
@@ -139,11 +149,13 @@ static int xt_rateest_tg_checkentry(const struct 
xt_tgchk_param *par)
 
info->est = est;
xt_rateest_hash_insert(est);
+   mutex_unlock(_rateest_mutex);
return 0;
 
 err2:
kfree(est);
 err1:
+   mutex_unlock(_rateest_mutex);
return ret;
 }
 
-- 
2.13.0



Re: [PATCH net v4] cls_u32: fix use after free in u32_destroy_key()

2018-02-05 Thread Cong Wang
On Mon, Feb 5, 2018 at 1:23 PM, Paolo Abeni <pab...@redhat.com> wrote:
> The problem is that the htnode is freed before the linked knodes and the
> latter will try to access the first at u32_destroy_key() time.
> This change addresses the issue using the htnode refcnt to guarantee
> the correct free order. While at it also add a RCU annotation,
> to keep sparse happy.
>
> v1 -> v2: use rtnl_derefence() instead of RCU read locks
> v2 -> v3:
>   - don't check refcnt in u32_destroy_hnode()
>   - cleaned-up u32_destroy() implementation
>   - cleaned-up code comment
> v3 -> v4:
>   - dropped unneeded comment
>
> Reported-by: Li Shuang <shu...@redhat.com>
> Fixes: c0d378ef1266 ("net_sched: use tcf_queue_work() in u32 filter")
> Signed-off-by: Paolo Abeni <pab...@redhat.com>

Acked-by: Cong Wang <xiyou.wangc...@gmail.com>

Thanks.


Re: [PATCH net v3] cls_u32: fix use after free in u32_destroy_key()

2018-02-05 Thread Cong Wang
On Mon, Feb 5, 2018 at 1:20 AM, Paolo Abeni  wrote:
> @@ -625,6 +627,8 @@ static int u32_destroy_hnode(struct tcf_proto *tp, struct 
> tc_u_hnode *ht,
> idr_destroy(>handle_idr);
> idr_remove_ext(_c->handle_idr, ht->handle);
> RCU_INIT_POINTER(*hn, ht->next);
> +
> +   /* the caller ensures ht->refcnt is 0 at this point */

This comment is not needed, because there is already a WARN_ON():

 608 static int u32_destroy_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht,
 609  struct netlink_ext_ack *extack)
 610 {
 611 struct tc_u_common *tp_c = tp->data;
 612 struct tc_u_hnode __rcu **hn;
 613 struct tc_u_hnode *phn;
 614
 615 WARN_ON(ht->refcnt);


Other than this, looks good.

Thanks.


<    1   2   3   4   5   6   7   8   9   10   >