Re: [PATCH net] cls_u32: add missing RCU annotation.

2018-02-02 Thread Cong Wang
On Fri, Feb 2, 2018 at 7:02 AM, Paolo Abeni <pab...@redhat.com> wrote:
> In a couple of points of the control path, n->ht_down is currently
> accessed without the required RCU annotation. The accesses are
> safe, but sparse complaints. Since we already held the
> rtnl lock, let use rtnl_dereference().
>
> Fixes: a1b7c5fd7fe9 ("net: sched: add cls_u32 offload hooks for netdevs")
> Fixes: de5df63228fc ("net: sched: cls_u32 changes to knode must appear atomic 
> to readers")
> Signed-off-by: Paolo Abeni <pab...@redhat.com>

Acked-by: Cong Wang <xiyou.wangc...@gmail.com>


Re: [PATCH net v2] cls_u32: fix use after free in u32_destroy_key()

2018-02-02 Thread Cong Wang
On Fri, Feb 2, 2018 at 6:30 AM, Paolo Abeni  wrote:
> The problem is that the htnode is freed before the linked knodes and the
> latter will try to access the first at u32_destroy_key() time.
> This change addresses the issue using the htnode refcnt to guarantee
> the correct free order. While at it also add a RCU annotation,
> to keep sparse happy.
>
> v1 -> v2: use rtnl_derefence() instead of RCU read locks
>
> Reported-by: Li Shuang 
> Fixes: c0d378ef1266 ("net_sched: use tcf_queue_work() in u32 filter")
> Signed-off-by: Paolo Abeni 
> ---
>  net/sched/cls_u32.c | 19 +++
>  1 file changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
> index 60c892c36a60..10440fbf3c68 100644
> --- a/net/sched/cls_u32.c
> +++ b/net/sched/cls_u32.c
> @@ -398,10 +398,12 @@ static int u32_init(struct tcf_proto *tp)
>  static int u32_destroy_key(struct tcf_proto *tp, struct tc_u_knode *n,
>bool free_pf)
>  {
> +   struct tc_u_hnode *ht = rtnl_dereference(n->ht_down);
> +
> tcf_exts_destroy(>exts);
> tcf_exts_put_net(>exts);
> -   if (n->ht_down)
> -   n->ht_down->refcnt--;
> +   if (ht && ht->refcnt-- == 0)
> +   kfree(ht);
>  #ifdef CONFIG_CLS_U32_PERF
> if (free_pf)
> free_percpu(n->pf);
> @@ -624,7 +626,12 @@ static int u32_destroy_hnode(struct tcf_proto *tp, 
> struct tc_u_hnode *ht,
> idr_destroy(>handle_idr);
> idr_remove_ext(_c->handle_idr, ht->handle);
> RCU_INIT_POINTER(*hn, ht->next);
> -   kfree_rcu(ht, rcu);
> +
> +   /* u32_destroy_key() will will later free ht for us, 
> if
> +* it's still referenced by some knode
> +*/
> +   if (ht->refcnt == 0)
> +   kfree_rcu(ht, rcu);


Isn't u32_destroy_hnode() always called with ht->refcnt==0 ?
So no need this check at all?


> return 0;
> }
> }
> @@ -667,7 +674,11 @@ static void u32_destroy(struct tcf_proto *tp, struct 
> netlink_ext_ack *extack)
>
> while ((ht = rtnl_dereference(tp_c->hlist)) != NULL) {
> RCU_INIT_POINTER(tp_c->hlist, ht->next);
> -   kfree_rcu(ht, rcu);
> +   /* u32_destroy_key() will will later free ht for us, 
> if


Nit: double "will"


> +* it's still referenced by some knode
> +*/
> +   if (ht->refcnt == 0)
> +   kfree_rcu(ht, rcu);


This part looks fine.

Thanks!


Re: iproute2 4.14.1 tc class add come to kernel-panic

2018-02-01 Thread Cong Wang
On Tue, Jan 30, 2018 at 11:46 PM, Roland Franke <fl...@franke-prem.de> wrote:
> Hello,
>
>> On Tue, Jan 30, 2018 at 9:46 AM, Roland Franke <fl...@franke-prem.de>
>> wrote:
>>>
>>> Hello,
>>>
>>>> Well, not min_qdisc things, but it should be resolved by:
>>>
>>>
>>>> commit efbf78973978b0d25af59bc26c8013a942af6e64
>>>> Author: Cong Wang <xiyou.wangc...@gmail.com>
>>>> Date:   Mon Dec 4 10:48:18 2017 -0800
>>>
>>>
>>>> net_sched: get rid of rcu_barrier() in tcf_block_put_ext()
>>>
>>>
>>>
>>> Against what kernel-version was this be made?
>>> patch from https://patchwork.ozlabs.org/patch/844372/
>>> will not fit against kernel 4.14.15
>>>
>
>> It is merged during 4.15 window, it was to address a performance
>> issue, I don't realize it fixes any bug until you report this.
>
>
>> If you need, we can ask DaveM to queue it for stable, but it may
>> need some other commit too.
>
>
> It is not for me. It is a general question, if tc schould work with
> the requested function general not with the kernel-line 4.14 (As
> i was thinking that this should be come to a very long support.
> (As i was suggest to read in an german page for computers; but
> it may be a mistake from my side)
> Or if the 4.14 series come to a "short end" and then it take no
> matter when the function not work.
>

Sure, I will send a list of commits to backport to DaveM.

Thanks.


Re: [Patch net] xt_RATEEST: acquire xt_rateest_mutex for hash insert

2018-02-01 Thread Cong Wang
On Wed, Jan 31, 2018 at 5:44 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> On Wed, 2018-01-31 at 16:26 -0800, Cong Wang wrote:
>> rateest_hash is supposed to be protected by xt_rateest_mutex.
>>
>> Reported-by: <syzbot+5cb189720978275e4...@syzkaller.appspotmail.com>
>> Fixes: 5859034d7eb8 ("[NETFILTER]: x_tables: add RATEEST target")
>> Cc: Pablo Neira Ayuso <pa...@netfilter.org>
>> Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
>> ---
>>  net/netfilter/xt_RATEEST.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/net/netfilter/xt_RATEEST.c b/net/netfilter/xt_RATEEST.c
>> index 498b54fd04d7..83ec3a282755 100644
>> --- a/net/netfilter/xt_RATEEST.c
>> +++ b/net/netfilter/xt_RATEEST.c
>> @@ -36,7 +36,9 @@ static void xt_rateest_hash_insert(struct xt_rateest *est)
>>   unsigned int h;
>>
>>   h = xt_rateest_hash(est->name);
>> + mutex_lock(_rateest_mutex);
>>   hlist_add_head(>list, _hash[h]);
>> + mutex_unlock(_rateest_mutex);
>>  }
>
> We probably should make this module netns aware, otherwise bad things
> will happen.

Right, both the lock and the hashtable. I can do it for net-next,
if you don't.

>
> (It seems multiple threads could run, so getting the mutex twice
> (xt_rateest_lookup then xt_rateest_hash_insert() is racy)

Yeah, need to merge these two critical sections.

Thanks.


[Patch net] xt_RATEEST: acquire xt_rateest_mutex for hash insert

2018-01-31 Thread Cong Wang
rateest_hash is supposed to be protected by xt_rateest_mutex.

Reported-by: <syzbot+5cb189720978275e4...@syzkaller.appspotmail.com>
Fixes: 5859034d7eb8 ("[NETFILTER]: x_tables: add RATEEST target")
Cc: Pablo Neira Ayuso <pa...@netfilter.org>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/netfilter/xt_RATEEST.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/netfilter/xt_RATEEST.c b/net/netfilter/xt_RATEEST.c
index 498b54fd04d7..83ec3a282755 100644
--- a/net/netfilter/xt_RATEEST.c
+++ b/net/netfilter/xt_RATEEST.c
@@ -36,7 +36,9 @@ static void xt_rateest_hash_insert(struct xt_rateest *est)
unsigned int h;
 
h = xt_rateest_hash(est->name);
+   mutex_lock(_rateest_mutex);
hlist_add_head(>list, _hash[h]);
+   mutex_unlock(_rateest_mutex);
 }
 
 struct xt_rateest *xt_rateest_lookup(const char *name)
-- 
2.13.0



[Patch net] xt_cgroup: initialize info->priv in cgroup_mt_check_v1()

2018-01-31 Thread Cong Wang
xt_cgroup_info_v1->priv is an internal pointer only used for kernel,
we should not trust what user-space provides.

Reported-by: <syzbot+4fbcfcc0d2e6592bd...@syzkaller.appspotmail.com>
Fixes: c38c4597e4bf ("netfilter: implement xt_cgroup cgroup2 path match")
Cc: Pablo Neira Ayuso <pa...@netfilter.org>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/netfilter/xt_cgroup.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netfilter/xt_cgroup.c b/net/netfilter/xt_cgroup.c
index 1db1ce59079f..891f4e7e8ea7 100644
--- a/net/netfilter/xt_cgroup.c
+++ b/net/netfilter/xt_cgroup.c
@@ -52,6 +52,7 @@ static int cgroup_mt_check_v1(const struct xt_mtchk_param 
*par)
return -EINVAL;
}
 
+   info->priv = NULL;
if (info->has_path) {
cgrp = cgroup_get_from_path(info->path);
if (IS_ERR(cgrp)) {
-- 
2.13.0



Re: iproute2 4.14.1 tc class add come to kernel-panic

2018-01-30 Thread Cong Wang
On Tue, Jan 30, 2018 at 9:46 AM, Roland Franke <fl...@franke-prem.de> wrote:
> Hello,
>
>> Well, not min_qdisc things, but it should be resolved by:
>
>
>> commit efbf78973978b0d25af59bc26c8013a942af6e64
>> Author: Cong Wang <xiyou.wangc...@gmail.com>
>> Date:   Mon Dec 4 10:48:18 2017 -0800
>
>
>> net_sched: get rid of rcu_barrier() in tcf_block_put_ext()
>
>
> Against what kernel-version was this be made?
> patch from https://patchwork.ozlabs.org/patch/844372/
> will not fit against kernel 4.14.15
>

It is merged during 4.15 window, it was to address a performance
issue, I don't realize it fixes any bug until you report this.

If you need, we can ask DaveM to queue it for stable, but it may
need some other commit too.

Thanks.


Re: net: hang in unregister_netdevice: waiting for lo to become free

2018-01-30 Thread Cong Wang
On Tue, Jan 30, 2018 at 4:09 AM, Dmitry Vyukov  wrote:
> Hello,
>
> The following program creates a hang in unregister_netdevice.
> cleanup_net work hangs there forever periodically printing
> "unregister_netdevice: waiting for lo to become free. Usage count = 3"
> and creation of any new network namespaces hangs forever.

Interestingly, this is not reproducible on net-next.


Re: [Patch net-next v3 0/3] net_sched: reflect tx_queue_len change for pfifo_fast

2018-01-29 Thread Cong Wang
On Mon, Jan 29, 2018 at 9:43 AM, David Miller  wrote:
>
> Please follow up with John about making the queue allocation use
> a prepare + commit phase.

Will do it once net-next is re-open.


>
> And as for the TX queue state handling on change, I think it's
> fine to purge the whole queue.  That is definitely better than
> allowing the queue length to be visibly over the tx_queue_len
> setting.
>

OK. Thanks.


Re: iproute2 4.14.1 tc class add come to kernel-panic

2018-01-29 Thread Cong Wang
On Mon, Jan 29, 2018 at 9:03 AM, Cong Wang <xiyou.wangc...@gmail.com> wrote:
> On Mon, Jan 29, 2018 at 8:00 AM, Stephen Hemminger
> <step...@networkplumber.org> wrote:
>> On Mon, 29 Jan 2018 16:18:07 +0100
>> "Roland Franke" <fl...@franke-prem.de> wrote:
>>
>>> Hello,
>>>
>>> > To: Roland Franke ; netdev@vger.kernel.org
>>> > Subject: Re: BUG: iproute2 4.14.1 tc class add come to kernel-panic
>>> >>
>>> >> tc qdisc add dev eth0 root handle 20: htb default 4 r2q 1
>>> >> tc class add dev eth0 parent 20: classid 20:7 htb rate 1kbit
>>> >> tc qdisc add dev eth0 parent 20:7 sfq perturb 10
>>> >> tc class add dev eth0 parent 20:7 classid 20:1 htb rate 200kbit ceil
>>> >> 1kbit prio 0
>>> >>
>>> >> I become an Kernel-panic with the following output:
>>> >> kern.err kernel: BUG: scheduling while atomic: tc/1036/0x0200
>>> >
>>>
>>> > Would you have a stack trace to share with us ?
>>>
>>> As i will be an absolute newby here, i will not know how to
>>> get the stack trace out.
>>> When i will get some information how to get this, i can try to
>>> give you this information.
>>> But by my last tests i made the first 3 commands on an console
>>> and had no error. Only by typing the last line i will get the error and
>>> here i get actally only the "kern.err kernel: BUG: ." message.
>>>
>>> Roland
>>
>> It generates this with lockdep (on 4.15)
>>
>> [  151.355076] HTB: quantum of class 27 is big. Consider r2q change.
>> [  151.371495] BUG: sleeping function called from invalid context at 
>> kernel/locking/mutex.c:747
>> [  151.371815] in_atomic(): 1, irqs_disabled(): 0, pid: 1135, name: tc
>> [  151.371875] 2 locks held by tc/1135:
>> [  151.371878]  #0:  (rtnl_mutex){+.+.}, at: [] 
>> rtnetlink_rcv_msg+0x23b/0x5f0
>> [  151.371899]  #1:  (_tx_lock){+...}, at: [] 
>> htb_change_class+0x55f/0x880 [sch_htb]
>> [  151.371918] CPU: 2 PID: 1135 Comm: tc Not tainted 4.14.15 #2
>> [  151.371921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> 1.10.2-1 04/01/2014
>> [  151.371924] Call Trace:
>> [  151.371934]  dump_stack+0x7c/0xbe
>> [  151.371944]  ___might_sleep+0x21e/0x250
>> [  151.371953]  __mutex_lock+0x59/0x9a0
>> [  151.371960]  ? lock_acquire+0xec/0x1e0
>> [  151.371973]  ? __raw_spin_lock_init+0x1c/0x50
>> [  151.371990]  ? _rcu_barrier+0x2f/0x170
>> [  151.371995]  ? __lockdep_init_map+0x5c/0x1d0
>> [  151.371998]  _rcu_barrier+0x2f/0x170
>> [  151.372006]  tcf_block_put+0x7f/0xa0
>> [  151.372013]  sfq_destroy+0x15/0x50 [sch_sfq]
>> [  151.372021]  qdisc_destroy+0x6c/0xe0
>> [  151.372028]  htb_change_class+0x704/0x880 [sch_htb]
>
>
> We hold qdisc tree spinlock but call rcu_barrier() in
> mini_qdisc_pair_swap()...

Well, not min_qdisc things, but it should be resolved by:

commit efbf78973978b0d25af59bc26c8013a942af6e64
Author: Cong Wang <xiyou.wangc...@gmail.com>
Date:   Mon Dec 4 10:48:18 2017 -0800

net_sched: get rid of rcu_barrier() in tcf_block_put_ext()


Re: iproute2 4.14.1 tc class add come to kernel-panic

2018-01-29 Thread Cong Wang
On Mon, Jan 29, 2018 at 8:00 AM, Stephen Hemminger
 wrote:
> On Mon, 29 Jan 2018 16:18:07 +0100
> "Roland Franke"  wrote:
>
>> Hello,
>>
>> > To: Roland Franke ; netdev@vger.kernel.org
>> > Subject: Re: BUG: iproute2 4.14.1 tc class add come to kernel-panic
>> >>
>> >> tc qdisc add dev eth0 root handle 20: htb default 4 r2q 1
>> >> tc class add dev eth0 parent 20: classid 20:7 htb rate 1kbit
>> >> tc qdisc add dev eth0 parent 20:7 sfq perturb 10
>> >> tc class add dev eth0 parent 20:7 classid 20:1 htb rate 200kbit ceil
>> >> 1kbit prio 0
>> >>
>> >> I become an Kernel-panic with the following output:
>> >> kern.err kernel: BUG: scheduling while atomic: tc/1036/0x0200
>> >
>>
>> > Would you have a stack trace to share with us ?
>>
>> As i will be an absolute newby here, i will not know how to
>> get the stack trace out.
>> When i will get some information how to get this, i can try to
>> give you this information.
>> But by my last tests i made the first 3 commands on an console
>> and had no error. Only by typing the last line i will get the error and
>> here i get actally only the "kern.err kernel: BUG: ." message.
>>
>> Roland
>
> It generates this with lockdep (on 4.15)
>
> [  151.355076] HTB: quantum of class 27 is big. Consider r2q change.
> [  151.371495] BUG: sleeping function called from invalid context at 
> kernel/locking/mutex.c:747
> [  151.371815] in_atomic(): 1, irqs_disabled(): 0, pid: 1135, name: tc
> [  151.371875] 2 locks held by tc/1135:
> [  151.371878]  #0:  (rtnl_mutex){+.+.}, at: [] 
> rtnetlink_rcv_msg+0x23b/0x5f0
> [  151.371899]  #1:  (_tx_lock){+...}, at: [] 
> htb_change_class+0x55f/0x880 [sch_htb]
> [  151.371918] CPU: 2 PID: 1135 Comm: tc Not tainted 4.14.15 #2
> [  151.371921] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.10.2-1 04/01/2014
> [  151.371924] Call Trace:
> [  151.371934]  dump_stack+0x7c/0xbe
> [  151.371944]  ___might_sleep+0x21e/0x250
> [  151.371953]  __mutex_lock+0x59/0x9a0
> [  151.371960]  ? lock_acquire+0xec/0x1e0
> [  151.371973]  ? __raw_spin_lock_init+0x1c/0x50
> [  151.371990]  ? _rcu_barrier+0x2f/0x170
> [  151.371995]  ? __lockdep_init_map+0x5c/0x1d0
> [  151.371998]  _rcu_barrier+0x2f/0x170
> [  151.372006]  tcf_block_put+0x7f/0xa0
> [  151.372013]  sfq_destroy+0x15/0x50 [sch_sfq]
> [  151.372021]  qdisc_destroy+0x6c/0xe0
> [  151.372028]  htb_change_class+0x704/0x880 [sch_htb]


We hold qdisc tree spinlock but call rcu_barrier() in
mini_qdisc_pair_swap()...



> [  151.372050]  tc_ctl_tclass+0x193/0x3c0
> [  151.372075]  rtnetlink_rcv_msg+0x270/0x5f0
> [  151.372088]  ? rtnetlink_put_metrics+0x1c0/0x1c0
> [  151.372096]  netlink_rcv_skb+0xde/0x110
> [  151.372109]  netlink_unicast+0x1e4/0x310
> [  151.372118]  netlink_sendmsg+0x2dc/0x3c0
> [  151.372136]  sock_sendmsg+0x30/0x40
> [  151.372142]  ___sys_sendmsg+0x2b9/0x2d0
> [  151.372171]  ? __handle_mm_fault+0x7f8/0x1120
> [  151.372189]  ? __do_page_fault+0x2a5/0x530
> [  151.372200]  ? __sys_sendmsg+0x51/0x90
> [  151.372205]  __sys_sendmsg+0x51/0x90
> [  151.372225]  entry_SYSCALL_64_fastpath+0x29/0xa0
> [  151.372230] RIP: 0033:0x7f7049397490
> [  151.372233] RSP: 002b:7ffd0c432f48 EFLAGS: 0246
> [  151.372445] BUG: scheduling while atomic: tc/1135/0x0202
> [  151.372524] 3 locks held by tc/1135:
> [  151.372527]  #0:  (rtnl_mutex){+.+.}, at: [] 
> rtnetlink_rcv_msg+0x23b/0x5f0
> [  151.372544]  #1:  (_tx_lock){+...}, at: [] 
> htb_change_class+0x55f/0x880 [sch_htb]
> [  151.372560]  #2:  (rcu_sched_state.barrier_mutex){+.+.}, at: 
> [] _rcu_barrier+0x2f/0x170
> [  151.372575] Modules linked in: sch_sfq sch_htb ppdev input_leds serio_raw 
> joydev parport_pc parport i2c_piix4 mac_hid ib_iser rdma_cm iw_cm ib_cm 
> ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs 
> zstd_decompress zstd_compress xxhash raid10 raid456 libcrc32c 
> async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq raid1 
> raid0 multipath linear qxl drm_kms_helper syscopyarea sysfillrect sysimgblt 
> fb_sys_fops ttm drm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel 
> aesni_intel aes_x86_64 crypto_simd cryptd glue_helper psmouse floppy 
> pata_acpi hid_generic usbhid hid
> [  151.372748] CPU: 2 PID: 1135 Comm: tc Tainted: GW   4.14.15 #2
> [  151.372751] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.10.2-1 04/01/2014
> [  151.372753] Call Trace:
> [  151.372759]  dump_stack+0x7c/0xbe
> [  151.372767]  __schedule_bug+0x62/0x90
> [  151.372772]  __schedule+0x7d9/0xb80
> [  151.372788]  schedule+0x39/0x90
> [  151.372793]  schedule_timeout+0x24d/0x5c0
> [  151.372813]  ? mark_held_locks+0x71/0xa0
> [  151.372825]  ? wait_for_completion+0x12e/0x1a0
> [  151.372829]  wait_for_completion+0x12e/0x1a0
> [  151.372835]  ? wake_up_q+0x60/0x60
> [  151.372846]  _rcu_barrier+0x11a/0x170
> [  151.372853]  tcf_block_put+0x7f/0xa0
> [  151.372860]  

Re: [Patch net-next v3 0/3] net_sched: reflect tx_queue_len change for pfifo_fast

2018-01-28 Thread Cong Wang
On Sun, Jan 28, 2018 at 10:09 PM, John Fastabend
<john.fastab...@gmail.com> wrote:
> On 01/28/2018 09:57 PM, Cong Wang wrote:
>> On Sun, Jan 28, 2018 at 9:35 PM, John Fastabend
>> <john.fastab...@gmail.com> wrote:
>>> On 01/25/2018 06:26 PM, Cong Wang wrote:
>>>> This pathcset restores the pfifo_fast qdisc behavior of dropping
>>>> packets based on latest dev->tx_queue_len. Patch 1 introduces
>>>> a helper, patch 2 introduces a new Qdisc ops which is called when
>>>> we modify tx_queue_len, patch 3 implements this ops for pfifo_fast.
>>>>
>>>> Please see each patch for details.
>>>>
>>>
>>> Overall this series is better than what we have at the moment, but
>>> a better fix would preallocate the memory, to avoid ENOMEM errors,
>>
>> I am not against for better ENOMEM error handling, but I still have to
>> remind you that attach_one_default_qdisc() doesn't handle it either.
>> Since no one complained about it, why this one is so special?
>
> Its not we should fix both cases. And also clean up the multiple
> net sync calls in these paths as well.

Now can we agree error handling can be improved later? You
already agree this is not a new problem introduced by this patchset,
why do you want to block a regression fix just because of an old
problem which I will fix later?


>
>>
>>
>>> and add a ptr_ring API to use the preallocated buffers.
>>
>>
>> What ptr_ring API could cure netdev tx queues problem here?
>> Looks like you still don't understand the problem here.
>>
>
> The resize multiple array API can only fail due to alloc failures.
> We need to break this API into two pieces preallocate the memory
> and then commit array changes. Alternatively the qdisc layer could
> allocate new arrays and then swap them after all the arrays been
> initialized correctly using ptr_ring APIs below the ptr_ring
> resize API calls.
>
> Having ptr_ring API operations to support this seems best.
>


Apparently qdisc layer doesn't care about ptr_ring, as you describe
here this potentially needs to change two layers: 1) qdisc layer
2) ptr_ring API. It is more work than just a simple error handling,
potentially bigger than this patchset itself.


>>
>>>
>>> We have time (its only in net-next) so lets do the complete fix
>>> rather than this series IMO.
>>>
>>
>> Why not just accept this and complete the error handling later
>> given the fact that I already add a TODO? IOW, why it is now?
>>
>
> Because its not correct and its not too much work to get it so
> the error is avoided.

So you must want a fix for net too, because attach_one_default_qdisc()
"is not correct" either and "it is not too much work".

I can't agree on any of your claims here. Sorry.


Re: [Patch net-next v3 0/3] net_sched: reflect tx_queue_len change for pfifo_fast

2018-01-28 Thread Cong Wang
On Sun, Jan 28, 2018 at 9:35 PM, John Fastabend
 wrote:
> We have time (its only in net-next) so lets do the complete fix
> rather than this series IMO.

Also have to remind you: this patchset fixes a regression introduced
by your patchset in net-next, it is not a new feature. I don't think it is
a serious regression, but it is still one...


Re: [Patch net-next v3 0/3] net_sched: reflect tx_queue_len change for pfifo_fast

2018-01-28 Thread Cong Wang
On Sun, Jan 28, 2018 at 9:35 PM, John Fastabend
<john.fastab...@gmail.com> wrote:
> On 01/25/2018 06:26 PM, Cong Wang wrote:
>> This pathcset restores the pfifo_fast qdisc behavior of dropping
>> packets based on latest dev->tx_queue_len. Patch 1 introduces
>> a helper, patch 2 introduces a new Qdisc ops which is called when
>> we modify tx_queue_len, patch 3 implements this ops for pfifo_fast.
>>
>> Please see each patch for details.
>>
>
> Overall this series is better than what we have at the moment, but
> a better fix would preallocate the memory, to avoid ENOMEM errors,

I am not against for better ENOMEM error handling, but I still have to
remind you that attach_one_default_qdisc() doesn't handle it either.
Since no one complained about it, why this one is so special?


> and add a ptr_ring API to use the preallocated buffers.


What ptr_ring API could cure netdev tx queues problem here?
Looks like you still don't understand the problem here.


>
> We have time (its only in net-next) so lets do the complete fix
> rather than this series IMO.
>

Why not just accept this and complete the error handling later
given the fact that I already add a TODO? IOW, why it is now?


Re: [PATCH net] net_sched: gen_estimator: fix lockdep splat

2018-01-28 Thread Cong Wang
On Sat, Jan 27, 2018 at 10:58 AM, Eric Dumazet <eric.duma...@gmail.com> wrote:
> From: Eric Dumazet <eduma...@google.com>
>
> syzbot reported a lockdep splat in gen_new_estimator() /
> est_fetch_counters() when attempting to lock est->stats_lock.
>
> Since est_fetch_counters() is called from BH context from timer
> interrupt, we need to block BH as well when calling it from process
> context.
>
> Most qdiscs use per cpu counters and are immune to the problem,
> but net/sched/act_api.c and net/netfilter/xt_RATEEST.c are using
> a spinlock to protect their data. They both call gen_new_estimator()
> while object is created and not yet alive, so this bug could
> not trigger a deadlock, only a lockdep splat.
>
> Fixes: 1c0d32fde5bd ("net_sched: gen_estimator: complete rewrite of rate 
> estimators")
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Reported-by: syzbot <syzkal...@googlegroups.com>

Acked-by: Cong Wang <xiyou.wangc...@gmail.com>

Nit: perhaps it is better to move the spin_lock out of est_fetch_counters()
and let callers decide to use spin_lock or spin_lock_bh.


Re: [PATCH net-next,v2 2/2] net: sched: add em_ipt ematch for calling xtables matches

2018-01-28 Thread Cong Wang
On Fri, Jan 26, 2018 at 11:57 AM, Eyal Birger  wrote:
> On Fri, Jan 26, 2018 at 8:50 PM, Pablo Neira Ayuso  
> wrote:
>> Isn't there a way to reject the use of this from ->change()? ie. from
>> control plane configuration.
>
> I wasn't able to find a simple way of doing so:
>
> - AFAIU tc filters are detached from the qdiscs they operate on via
> tcf_block instances
>   that may be shared by different qdiscs. I was not able to be sure that 
> filters
>   attached to ingress qdiscs via tcf_blocks at configuration time
> cannot be later be shared
>   with non ingress qdiscs. Nor was I able to find another classifier
> making the ingress/egress
>   distinction at configuration time.
>
> - ematches are not provided with 'ingress/egress' information at
> 'change()' invocation, though
>   of course the infrastructure could be extended to provide this,
> given the distinction is available.
>

In the past you can check tp->q, but now we support shared tc filter
block, so it is hard. I think your v1 is okay, which just silently passes
the match on egress side. Or maybe we can just add a pr_info()
unconditionally in em_ipt_change() saying only ingress is supported.


Re: [Patch net-next v3 3/3] net_sched: implement ->change_tx_queue_len() for pfifo_fast

2018-01-28 Thread Cong Wang
On Fri, Jan 26, 2018 at 6:10 AM, Michael S. Tsirkin  wrote:
>
> This part?

Yes, dev_deactivate() as you quote.

>
> +   bool up = dev->flags & IFF_UP;
> +   unsigned int i;
> +   int ret = 0;
> +
> +   if (up)
> +   dev_deactivate(dev);
> +
> +   for (i = 0; i < dev->num_tx_queues; i++) {
> +   ret = qdisc_change_tx_queue_len(dev, >_tx[i]);
> +
> +   /* TODO: revert changes on a partial failure */
> +   if (ret)
> +   break;
> +   }
> +
> +   if (up)
> +   dev_activate(dev);
>
>
> I wonder whether it really is safe to read dev->flags like that
> without any locks.

I really to hate to point it out again we have RTNL here. You
missed my previous response to John. ;)

Please read v1 and v2 when you response to v3.


Re: [Patch net-next v3 2/3] net_sched: plug in qdisc ops change_tx_queue_len

2018-01-28 Thread Cong Wang
On Fri, Jan 26, 2018 at 6:16 AM, Michael S. Tsirkin  wrote:
>
> This drops all packets in the queue. I don't think tweaking the queue
> length did this previously - did it?

No, because previously only enqueue reads the value.

> If not this change might surprise some people.

It is hard to say which behavior is better, it depends on what you
expect:

1) If you want to tx_queue_len change immediately takes affects,
it should drop those in queue too, maybe not all, but at least when
we shrinking the queue length, we should drop those beyond limit.

2) If you want to tx_queue_len change takes affects after all pending
packets are pushed out. This is literally the old behavior, so at some
moments, the number of packets in queue could be larger than the
new tx_queue_len.

I don't see which one is obviously better than the other one,
therefore it is hard to say which one people really expect.


Re: 答复: [PATCH] net: clean the sk_frag.page of new cloned socket

2018-01-25 Thread Cong Wang
On Thu, Jan 25, 2018 at 7:14 PM, Eric Dumazet  wrote:
> On Fri, 2018-01-26 at 02:09 +, Li,Rongqing wrote:
>
>>
>> crash> bt 8683
>> PID: 8683   TASK: 881faa088000  CPU: 10  COMMAND: "mynode"
>>  #0 [881fff145e78] crash_nmi_callback at 81031712
>>  #1 [881fff145e88] nmi_handle at 816cafe9
>>  #2 [881fff145ec8] do_nmi at 816cb0f0
>>  #3 [881fff145ef0] end_repeat_nmi at 816ca4a1
>> [exception RIP: _raw_spin_lock_irqsave+62]
>> RIP: 816c9a9e  RSP: 881fa992b990  RFLAGS: 0002
>> RAX: 4358  RBX: 88207ffd7e80  RCX: 4358
>> RDX: 4356  RSI: 0246  RDI: 88207ffd7ee8
>> RBP: 881fa992b990   R8:    R9: 019a16e6
>> R10: 4d24  R11: 4000  R12: 0242
>> R13: 4d24  R14: 0001  R15: 
>> ORIG_RAX:   CS: 0010  SS: 0018
>> ---  ---
>>  #4 [881fa992b990] _raw_spin_lock_irqsave at 816c9a9e
>>  #5 [881fa992b998] get_page_from_freelist at 8113ce5f
>>  #6 [881fa992ba70] __alloc_pages_nodemask at 8113d15f
>>  #7 [881fa992bba0] alloc_pages_current at 8117ab29
>>  #8 [881fa992bbe8] sk_page_frag_refill at 815dd310
>>  #9 [881fa992bc18] tcp_sendmsg at 8163e4f3
>> #10 [881fa992bcd8] inet_sendmsg at 81668434
>> #11 [881fa992bd08] sock_sendmsg at 815d9719
>> #12 [881fa992be58] SYSC_sendto at 815d9c81
>> #13 [881fa992bf70] sys_sendto at 815da6ae
>> #14 [881fa992bf80] system_call_fastpath at 816d2189
>>
>
> Note that tcp_sendmsg() does not use sk->sk_frag, but the per task
> page.
>
> Unless something changes sk->sk_allocation, which a user application
> can not do.
>

Some kernel TCP socket uses atomic allocation, e.g.,
o2net_open_listening_sock().


Re: [Patch net-next v3 3/3] net_sched: implement ->change_tx_queue_len() for pfifo_fast

2018-01-25 Thread Cong Wang
On Thu, Jan 25, 2018 at 7:57 PM, Jason Wang <jasow...@redhat.com> wrote:
>
>
> On 2018年01月26日 10:26, Cong Wang wrote:
>>
>> pfifo_fast used to drop based on qdisc_dev(qdisc)->tx_queue_len,
>> so we have to resize skb array when we change tx_queue_len.
>>
>> Other qdiscs which read tx_queue_len are fine because they
>> all save it to sch->limit or somewhere else in qdisc during init.
>> They don't have to implement this, it is nicer if they do so
>> that users don't have to re-configure qdisc after changing
>> tx_queue_len.
>>
>> Cc: John Fastabend <john.fastab...@gmail.com>
>> Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
>> ---
>>   net/sched/sch_generic.c | 18 ++
>>   1 file changed, 18 insertions(+)
>>
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 08f9fa27e06e..190570f21b20 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -763,6 +763,23 @@ static void pfifo_fast_destroy(struct Qdisc *sch)
>> }
>>   }
>>   +static int pfifo_fast_change_tx_queue_len(struct Qdisc *sch,
>> + unsigned int new_len)
>> +{
>> +   struct pfifo_fast_priv *priv = qdisc_priv(sch);
>> +   struct skb_array *bands[PFIFO_FAST_BANDS];
>> +   int prio;
>> +
>> +   for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
>> +   struct skb_array *q = band2list(priv, prio);
>> +
>> +   bands[prio] = q;
>> +   }
>> +
>> +   return skb_array_resize_multiple(bands, PFIFO_FAST_BANDS, new_len,
>> +GFP_KERNEL);
>> +}
>> +
>>   struct Qdisc_ops pfifo_fast_ops __read_mostly = {
>> .id =   "pfifo_fast",
>> .priv_size  =   sizeof(struct pfifo_fast_priv),
>> @@ -773,6 +790,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
>> .destroy=   pfifo_fast_destroy,
>> .reset  =   pfifo_fast_reset,
>> .dump   =   pfifo_fast_dump,
>> +   .change_tx_queue_len =  pfifo_fast_change_tx_queue_len,
>> .owner  =   THIS_MODULE,
>> .static_flags   =   TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
>>   };
>
>
> Is __skb_array_empty() in pfifo_fast_dequeue() still safe after this change?

Yes, we sync with dequeue path before calling ->change_tx_queue_len().
I already mentioned this in patch 2/3.


[Patch net-next v3 2/3] net_sched: plug in qdisc ops change_tx_queue_len

2018-01-25 Thread Cong Wang
Introduce a new qdisc ops ->change_tx_queue_len() so that
each qdisc could decide how to implement this if it wants.
Previously we simply read dev->tx_queue_len, after pfifo_fast
switches to skb array, we need this API to resize the skb array
when we change dev->tx_queue_len.

To avoid handling race conditions with TX BH, we need to
deactivate all TX queues before change the value and bring them
back after we are done, this also makes implementation easier.

Cc: John Fastabend <john.fastab...@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 include/net/sch_generic.h |  2 ++
 net/core/dev.c|  1 +
 net/sched/sch_generic.c   | 33 +
 3 files changed, 36 insertions(+)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index eac43e8ca96d..e2ab13687fb9 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -200,6 +200,7 @@ struct Qdisc_ops {
  struct nlattr *arg,
  struct netlink_ext_ack *extack);
void(*attach)(struct Qdisc *sch);
+   int (*change_tx_queue_len)(struct Qdisc *, unsigned 
int);
 
int (*dump)(struct Qdisc *, struct sk_buff *);
int (*dump_stats)(struct Qdisc *, struct gnet_dump 
*);
@@ -489,6 +490,7 @@ void qdisc_class_hash_remove(struct Qdisc_class_hash *,
 void qdisc_class_hash_grow(struct Qdisc *, struct Qdisc_class_hash *);
 void qdisc_class_hash_destroy(struct Qdisc_class_hash *);
 
+int dev_qdisc_change_tx_queue_len(struct net_device *dev);
 void dev_init_scheduler(struct net_device *dev);
 void dev_shutdown(struct net_device *dev);
 void dev_activate(struct net_device *dev);
diff --git a/net/core/dev.c b/net/core/dev.c
index e0b0c2784070..c8443cfaa17a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7070,6 +7070,7 @@ int dev_change_tx_queue_len(struct net_device *dev, 
unsigned long new_len)
dev->tx_queue_len = orig_len;
return res;
}
+   return dev_qdisc_change_tx_queue_len(dev);
}
 
return 0;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 1816bde47256..08f9fa27e06e 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1178,6 +1178,39 @@ void dev_deactivate(struct net_device *dev)
 }
 EXPORT_SYMBOL(dev_deactivate);
 
+static int qdisc_change_tx_queue_len(struct net_device *dev,
+struct netdev_queue *dev_queue)
+{
+   struct Qdisc *qdisc = dev_queue->qdisc_sleeping;
+   const struct Qdisc_ops *ops = qdisc->ops;
+
+   if (ops->change_tx_queue_len)
+   return ops->change_tx_queue_len(qdisc, dev->tx_queue_len);
+   return 0;
+}
+
+int dev_qdisc_change_tx_queue_len(struct net_device *dev)
+{
+   bool up = dev->flags & IFF_UP;
+   unsigned int i;
+   int ret = 0;
+
+   if (up)
+   dev_deactivate(dev);
+
+   for (i = 0; i < dev->num_tx_queues; i++) {
+   ret = qdisc_change_tx_queue_len(dev, >_tx[i]);
+
+   /* TODO: revert changes on a partial failure */
+   if (ret)
+   break;
+   }
+
+   if (up)
+   dev_activate(dev);
+   return ret;
+}
+
 static void dev_init_scheduler_queue(struct net_device *dev,
 struct netdev_queue *dev_queue,
 void *_qdisc)
-- 
2.13.0



[Patch net-next v3 1/3] net: introduce helper dev_change_tx_queue_len()

2018-01-25 Thread Cong Wang
This patch promotes the local change_tx_queue_len() to a core
helper function, dev_change_tx_queue_len(), so that rtnetlink
and net-sysfs could share the code. This also prepares for the
following patch.

Note, the -EFAULT in the original code doesn't make sense,
we should propagate the errno from notifiers.

Cc: John Fastabend <john.fastab...@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 include/linux/netdevice.h |  1 +
 net/core/dev.c| 28 
 net/core/net-sysfs.c  | 25 +
 net/core/rtnetlink.c  | 18 +-
 4 files changed, 35 insertions(+), 37 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 24a62d590350..0804e1d38c78 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3330,6 +3330,7 @@ int dev_get_alias(const struct net_device *, char *, 
size_t);
 int dev_change_net_namespace(struct net_device *, struct net *, const char *);
 int __dev_set_mtu(struct net_device *, int);
 int dev_set_mtu(struct net_device *, int);
+int dev_change_tx_queue_len(struct net_device *, unsigned long);
 void dev_set_group(struct net_device *, int);
 int dev_set_mac_address(struct net_device *, struct sockaddr *);
 int dev_change_carrier(struct net_device *, bool new_carrier);
diff --git a/net/core/dev.c b/net/core/dev.c
index 4670ccabe23a..e0b0c2784070 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7048,6 +7048,34 @@ int dev_set_mtu(struct net_device *dev, int new_mtu)
 EXPORT_SYMBOL(dev_set_mtu);
 
 /**
+ * dev_change_tx_queue_len - Change TX queue length of a netdevice
+ * @dev: device
+ * @new_len: new tx queue length
+ */
+int dev_change_tx_queue_len(struct net_device *dev, unsigned long new_len)
+{
+   unsigned int orig_len = dev->tx_queue_len;
+   int res;
+
+   if (new_len != (unsigned int)new_len)
+   return -ERANGE;
+
+   if (new_len != orig_len) {
+   dev->tx_queue_len = new_len;
+   res = call_netdevice_notifiers(NETDEV_CHANGE_TX_QUEUE_LEN, dev);
+   res = notifier_to_errno(res);
+   if (res) {
+   netdev_err(dev,
+  "refused to change device tx_queue_len\n");
+   dev->tx_queue_len = orig_len;
+   return res;
+   }
+   }
+
+   return 0;
+}
+
+/**
  * dev_set_group - Change group this device belongs to
  * @dev: device
  * @new_group: group this device should belong to
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index c4a28f4667b6..60a5ad2c33ee 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -346,29 +346,6 @@ static ssize_t flags_store(struct device *dev, struct 
device_attribute *attr,
 }
 NETDEVICE_SHOW_RW(flags, fmt_hex);
 
-static int change_tx_queue_len(struct net_device *dev, unsigned long new_len)
-{
-   unsigned int orig_len = dev->tx_queue_len;
-   int res;
-
-   if (new_len != (unsigned int)new_len)
-   return -ERANGE;
-
-   if (new_len != orig_len) {
-   dev->tx_queue_len = new_len;
-   res = call_netdevice_notifiers(NETDEV_CHANGE_TX_QUEUE_LEN, dev);
-   res = notifier_to_errno(res);
-   if (res) {
-   netdev_err(dev,
-  "refused to change device tx_queue_len\n");
-   dev->tx_queue_len = orig_len;
-   return -EFAULT;
-   }
-   }
-
-   return 0;
-}
-
 static ssize_t tx_queue_len_store(struct device *dev,
  struct device_attribute *attr,
  const char *buf, size_t len)
@@ -376,7 +353,7 @@ static ssize_t tx_queue_len_store(struct device *dev,
if (!capable(CAP_NET_ADMIN))
return -EPERM;
 
-   return netdev_store(dev, attr, buf, len, change_tx_queue_len);
+   return netdev_store(dev, attr, buf, len, dev_change_tx_queue_len);
 }
 NETDEVICE_SHOW_RW(tx_queue_len, fmt_dec);
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 97874daa1336..6fa6b9c60694 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2291,19 +2291,11 @@ static int do_setlink(const struct sk_buff *skb,
 
if (tb[IFLA_TXQLEN]) {
unsigned int value = nla_get_u32(tb[IFLA_TXQLEN]);
-   unsigned int orig_len = dev->tx_queue_len;
-
-   if (dev->tx_queue_len ^ value) {
-   dev->tx_queue_len = value;
-   err = call_netdevice_notifiers(
- NETDEV_CHANGE_TX_QUEUE_LEN, dev);
-   err = notifier_to_errno(err);
-   if (err) {
-   dev->tx_queue_len = orig_len;
-   goto er

[Patch net-next v3 0/3] net_sched: reflect tx_queue_len change for pfifo_fast

2018-01-25 Thread Cong Wang
This pathcset restores the pfifo_fast qdisc behavior of dropping
packets based on latest dev->tx_queue_len. Patch 1 introduces
a helper, patch 2 introduces a new Qdisc ops which is called when
we modify tx_queue_len, patch 3 implements this ops for pfifo_fast.

Please see each patch for details.

---
v3: use skb_array_resize_multiple()
v2: handle error case for ->change_tx_queue_len()

Cong Wang (3):
  net: introduce helper dev_change_tx_queue_len()
  net_sched: plug in qdisc ops change_tx_queue_len
  net_sched: implement ->change_tx_queue_len() for pfifo_fast

 include/linux/netdevice.h |  1 +
 include/net/sch_generic.h |  2 ++
 net/core/dev.c| 29 +++
 net/core/net-sysfs.c  | 25 +--
 net/core/rtnetlink.c  | 18 +
 net/sched/sch_generic.c   | 51 +++
 6 files changed, 89 insertions(+), 37 deletions(-)

-- 
2.13.0



[Patch net-next v3 3/3] net_sched: implement ->change_tx_queue_len() for pfifo_fast

2018-01-25 Thread Cong Wang
pfifo_fast used to drop based on qdisc_dev(qdisc)->tx_queue_len,
so we have to resize skb array when we change tx_queue_len.

Other qdiscs which read tx_queue_len are fine because they
all save it to sch->limit or somewhere else in qdisc during init.
They don't have to implement this, it is nicer if they do so
that users don't have to re-configure qdisc after changing
tx_queue_len.

Cc: John Fastabend <john.fastab...@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/sched/sch_generic.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 08f9fa27e06e..190570f21b20 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -763,6 +763,23 @@ static void pfifo_fast_destroy(struct Qdisc *sch)
}
 }
 
+static int pfifo_fast_change_tx_queue_len(struct Qdisc *sch,
+ unsigned int new_len)
+{
+   struct pfifo_fast_priv *priv = qdisc_priv(sch);
+   struct skb_array *bands[PFIFO_FAST_BANDS];
+   int prio;
+
+   for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
+   struct skb_array *q = band2list(priv, prio);
+
+   bands[prio] = q;
+   }
+
+   return skb_array_resize_multiple(bands, PFIFO_FAST_BANDS, new_len,
+GFP_KERNEL);
+}
+
 struct Qdisc_ops pfifo_fast_ops __read_mostly = {
.id =   "pfifo_fast",
.priv_size  =   sizeof(struct pfifo_fast_priv),
@@ -773,6 +790,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
.destroy=   pfifo_fast_destroy,
.reset  =   pfifo_fast_reset,
.dump   =   pfifo_fast_dump,
+   .change_tx_queue_len =  pfifo_fast_change_tx_queue_len,
.owner  =   THIS_MODULE,
.static_flags   =   TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
 };
-- 
2.13.0



Re: [Patch net-next v2 2/3] net_sched: plug in qdisc ops change_tx_queue_len

2018-01-25 Thread Cong Wang
On Wed, Jan 24, 2018 at 4:05 PM, John Fastabend
<john.fastab...@gmail.com> wrote:
> On 01/23/2018 10:18 AM, Cong Wang wrote:
>> +int dev_qdisc_change_tx_queue_len(struct net_device *dev)
>> +{
>> + bool up = dev->flags & IFF_UP;
>> + unsigned int i;
>> + int ret = 0;
>> +
>> + if (up)
>> + dev_deactivate(dev);
>> +
>> + for (i = 0; i < dev->num_tx_queues; i++) {
>> + ret = qdisc_change_tx_queue_len(dev, >_tx[i]);
>> +
>> + /* TODO: revert changes on a partial failure */
>> + if (ret)
>> + break;
>
> After another look it seems we can solve this without too much pain
> by using skb_array_resize_multiple() in patch 3/3. Then pass the
> error pack here via qdisc_change_tx_queue_len and reset queue length
> to orig_length.

Yeah, I was not aware of that API, looks reasonable. But, since you
are referring to dev_qdisc_change_tx_queue_len(), it doesn't help 2/3
because we still have to iterate each tx queue, so the above comment
still stands.

I will update patch 3/3.

Thanks.


[Patch net-next v2 0/3] net_sched: reflect tx_queue_len change for pfifo_fast

2018-01-23 Thread Cong Wang
This pathcset restores the pfifo_fast qdisc behavior of dropping
packets based on latest dev->tx_queue_len. Patch 1 introduces
a helper, patch 2 introduces a new Qdisc ops which is called when
we modify tx_queue_len, patch 3 implements this ops for pfifo_fast.

Please see each patch for details.

---
v2: handle error case for ->change_tx_queue_len()

Cong Wang (3):
  net: introduce helper dev_change_tx_queue_len()
  net_sched: plug in qdisc ops change_tx_queue_len
  net_sched: implement ->change_tx_queue_len() for pfifo_fast

 include/linux/netdevice.h |  1 +
 include/net/sch_generic.h |  2 ++
 net/core/dev.c| 29 +++
 net/core/net-sysfs.c  | 25 +---
 net/core/rtnetlink.c  | 18 +
 net/sched/sch_generic.c   | 50 +++
 6 files changed, 88 insertions(+), 37 deletions(-)

-- 
2.13.0



[Patch net-next v2 1/3] net: introduce helper dev_change_tx_queue_len()

2018-01-23 Thread Cong Wang
This patch promotes the local change_tx_queue_len() to a core
helper function, dev_change_tx_queue_len(), so that rtnetlink
and net-sysfs could share the code. This also prepares for the
following patch.

Note, the -EFAULT in the original code doesn't make sense,
we should propagate the errno from notifiers.

Cc: John Fastabend <john.fastab...@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 include/linux/netdevice.h |  1 +
 net/core/dev.c| 28 
 net/core/net-sysfs.c  | 25 +
 net/core/rtnetlink.c  | 18 +-
 4 files changed, 35 insertions(+), 37 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 581495f4e487..97fab46bdea2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3327,6 +3327,7 @@ int dev_get_alias(const struct net_device *, char *, 
size_t);
 int dev_change_net_namespace(struct net_device *, struct net *, const char *);
 int __dev_set_mtu(struct net_device *, int);
 int dev_set_mtu(struct net_device *, int);
+int dev_change_tx_queue_len(struct net_device *, unsigned long);
 void dev_set_group(struct net_device *, int);
 int dev_set_mac_address(struct net_device *, struct sockaddr *);
 int dev_change_carrier(struct net_device *, bool new_carrier);
diff --git a/net/core/dev.c b/net/core/dev.c
index 77795f66c246..913655e82859 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7037,6 +7037,34 @@ int dev_set_mtu(struct net_device *dev, int new_mtu)
 EXPORT_SYMBOL(dev_set_mtu);
 
 /**
+ * dev_change_tx_queue_len - Change TX queue length of a netdevice
+ * @dev: device
+ * @new_len: new tx queue length
+ */
+int dev_change_tx_queue_len(struct net_device *dev, unsigned long new_len)
+{
+   unsigned int orig_len = dev->tx_queue_len;
+   int res;
+
+   if (new_len != (unsigned int)new_len)
+   return -ERANGE;
+
+   if (new_len != orig_len) {
+   dev->tx_queue_len = new_len;
+   res = call_netdevice_notifiers(NETDEV_CHANGE_TX_QUEUE_LEN, dev);
+   res = notifier_to_errno(res);
+   if (res) {
+   netdev_err(dev,
+  "refused to change device tx_queue_len\n");
+   dev->tx_queue_len = orig_len;
+   return res;
+   }
+   }
+
+   return 0;
+}
+
+/**
  * dev_set_group - Change group this device belongs to
  * @dev: device
  * @new_group: group this device should belong to
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index c4a28f4667b6..60a5ad2c33ee 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -346,29 +346,6 @@ static ssize_t flags_store(struct device *dev, struct 
device_attribute *attr,
 }
 NETDEVICE_SHOW_RW(flags, fmt_hex);
 
-static int change_tx_queue_len(struct net_device *dev, unsigned long new_len)
-{
-   unsigned int orig_len = dev->tx_queue_len;
-   int res;
-
-   if (new_len != (unsigned int)new_len)
-   return -ERANGE;
-
-   if (new_len != orig_len) {
-   dev->tx_queue_len = new_len;
-   res = call_netdevice_notifiers(NETDEV_CHANGE_TX_QUEUE_LEN, dev);
-   res = notifier_to_errno(res);
-   if (res) {
-   netdev_err(dev,
-  "refused to change device tx_queue_len\n");
-   dev->tx_queue_len = orig_len;
-   return -EFAULT;
-   }
-   }
-
-   return 0;
-}
-
 static ssize_t tx_queue_len_store(struct device *dev,
  struct device_attribute *attr,
  const char *buf, size_t len)
@@ -376,7 +353,7 @@ static ssize_t tx_queue_len_store(struct device *dev,
if (!capable(CAP_NET_ADMIN))
return -EPERM;
 
-   return netdev_store(dev, attr, buf, len, change_tx_queue_len);
+   return netdev_store(dev, attr, buf, len, dev_change_tx_queue_len);
 }
 NETDEVICE_SHOW_RW(tx_queue_len, fmt_dec);
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 97874daa1336..6fa6b9c60694 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2291,19 +2291,11 @@ static int do_setlink(const struct sk_buff *skb,
 
if (tb[IFLA_TXQLEN]) {
unsigned int value = nla_get_u32(tb[IFLA_TXQLEN]);
-   unsigned int orig_len = dev->tx_queue_len;
-
-   if (dev->tx_queue_len ^ value) {
-   dev->tx_queue_len = value;
-   err = call_netdevice_notifiers(
- NETDEV_CHANGE_TX_QUEUE_LEN, dev);
-   err = notifier_to_errno(err);
-   if (err) {
-   dev->tx_queue_len = orig_len;
-   goto er

[Patch net-next v2 2/3] net_sched: plug in qdisc ops change_tx_queue_len

2018-01-23 Thread Cong Wang
Introduce a new qdisc ops ->change_tx_queue_len() so that
each qdisc could decide how to implement this if it wants.
Previously we simply read dev->tx_queue_len, after pfifo_fast
switches to skb array, we need this API to resize the skb array
when we change dev->tx_queue_len.

To avoid handling race conditions with TX BH, we need to
deactivate all TX queues before change the value and bring them
back after we are done, this also makes implementation easier.

Cc: John Fastabend <john.fastab...@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 include/net/sch_generic.h |  2 ++
 net/core/dev.c|  1 +
 net/sched/sch_generic.c   | 33 +
 3 files changed, 36 insertions(+)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index cd1be1f25c36..d13dd129d085 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -200,6 +200,7 @@ struct Qdisc_ops {
  struct nlattr *arg,
  struct netlink_ext_ack *extack);
void(*attach)(struct Qdisc *sch);
+   int (*change_tx_queue_len)(struct Qdisc *, unsigned 
int);
 
int (*dump)(struct Qdisc *, struct sk_buff *);
int (*dump_stats)(struct Qdisc *, struct gnet_dump 
*);
@@ -488,6 +489,7 @@ void qdisc_class_hash_remove(struct Qdisc_class_hash *,
 void qdisc_class_hash_grow(struct Qdisc *, struct Qdisc_class_hash *);
 void qdisc_class_hash_destroy(struct Qdisc_class_hash *);
 
+int dev_qdisc_change_tx_queue_len(struct net_device *dev);
 void dev_init_scheduler(struct net_device *dev);
 void dev_shutdown(struct net_device *dev);
 void dev_activate(struct net_device *dev);
diff --git a/net/core/dev.c b/net/core/dev.c
index 913655e82859..a9d7d883416d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7059,6 +7059,7 @@ int dev_change_tx_queue_len(struct net_device *dev, 
unsigned long new_len)
dev->tx_queue_len = orig_len;
return res;
}
+   return dev_qdisc_change_tx_queue_len(dev);
}
 
return 0;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 1816bde47256..08f9fa27e06e 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1178,6 +1178,39 @@ void dev_deactivate(struct net_device *dev)
 }
 EXPORT_SYMBOL(dev_deactivate);
 
+static int qdisc_change_tx_queue_len(struct net_device *dev,
+struct netdev_queue *dev_queue)
+{
+   struct Qdisc *qdisc = dev_queue->qdisc_sleeping;
+   const struct Qdisc_ops *ops = qdisc->ops;
+
+   if (ops->change_tx_queue_len)
+   return ops->change_tx_queue_len(qdisc, dev->tx_queue_len);
+   return 0;
+}
+
+int dev_qdisc_change_tx_queue_len(struct net_device *dev)
+{
+   bool up = dev->flags & IFF_UP;
+   unsigned int i;
+   int ret = 0;
+
+   if (up)
+   dev_deactivate(dev);
+
+   for (i = 0; i < dev->num_tx_queues; i++) {
+   ret = qdisc_change_tx_queue_len(dev, >_tx[i]);
+
+   /* TODO: revert changes on a partial failure */
+   if (ret)
+   break;
+   }
+
+   if (up)
+   dev_activate(dev);
+   return ret;
+}
+
 static void dev_init_scheduler_queue(struct net_device *dev,
 struct netdev_queue *dev_queue,
 void *_qdisc)
-- 
2.13.0



[Patch net-next v2 3/3] net_sched: implement ->change_tx_queue_len() for pfifo_fast

2018-01-23 Thread Cong Wang
pfifo_fast used to drop based on qdisc_dev(qdisc)->tx_queue_len,
so we have to resize skb array when we change tx_queue_len.

Other qdiscs which read tx_queue_len are fine because they
all save it to sch->limit or somewhere else in qdisc during init.
They don't have to implement this, it is nicer if they do so
that users don't have to re-configure qdisc after changing
tx_queue_len.

Cc: John Fastabend <john.fastab...@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/sched/sch_generic.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 08f9fa27e06e..d7b82c379617 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -763,6 +763,22 @@ static void pfifo_fast_destroy(struct Qdisc *sch)
}
 }
 
+static int pfifo_fast_change_tx_queue_len(struct Qdisc *sch, unsigned int 
new_len)
+{
+   struct pfifo_fast_priv *priv = qdisc_priv(sch);
+   int ret = 0, prio;
+
+   for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
+   struct skb_array *q = band2list(priv, prio);
+
+   ret = skb_array_resize(q, new_len, GFP_KERNEL);
+   if (ret)
+   break;
+   }
+
+   return ret;
+}
+
 struct Qdisc_ops pfifo_fast_ops __read_mostly = {
.id =   "pfifo_fast",
.priv_size  =   sizeof(struct pfifo_fast_priv),
@@ -773,6 +789,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
.destroy=   pfifo_fast_destroy,
.reset  =   pfifo_fast_reset,
.dump   =   pfifo_fast_dump,
+   .change_tx_queue_len =  pfifo_fast_change_tx_queue_len,
.owner  =   THIS_MODULE,
.static_flags   =   TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
 };
-- 
2.13.0



Re: [Patch net-next 2/3] net_sched: plug in qdisc ops change_tx_queue_len

2018-01-23 Thread Cong Wang
On Mon, Jan 22, 2018 at 2:44 PM, Cong Wang <xiyou.wangc...@gmail.com> wrote:
>
> Good point, but netdev_for_each_tx_queue() doesn't handle errors
> in current code base, it is not trivial to make it returning int to reflect
> errors, since we have to rollback a partial failure too.

I think for now we can just check for errors but not rollback,
at least attach_default_qdiscs() does not rollback either.
And I will add a comment saying we need to do it in future.


Re: Network interface "stops working"

2018-01-22 Thread Cong Wang
(Please always Cc netdev for networking related bugs.)

On Mon, Jan 22, 2018 at 2:02 AM, Turbo Fredriksson  wrote:
> I just got a new broadband delivered at home. It is "Hyperoptic 1Gbps fiber" 
> which comes as a ethernet connector at home. I wasn’t around
> when they connected up everything, so I’m not sure *where* the fiber starts, 
> but either way, I have an ethernet jack in one of my rooms.
>
> They also provided me with a ZTE router. I have need for my own services 
> (firewalling, NATing, IPSEC and what not), so don’t want to use
> the provided router..
>
> However, I’m having serious trouble keeping the interface up! Works for a few 
> minutes and then just “stops working”. Don’t know why, there’s
> nothing in the logs or from dmesg..
>
> Taking the interface down and then up again usually solves it. For a few 
> minutes.
>
> Also, when running the interface (a Intel 82576 Gigabit dual port, using the 
> igb driver - tried e1000 and e1000e but they don’t find any interfaces),
> in 1Gbps mode, the interface starts flapping up and down and I can’t get a 
> connection at all. So my interface definition runs a script to use
> ethtool to set the speed to 100Mbps, full duplex, no auto negotiation. Which 
> “kinda” works (for a while, hence my problems).
>
> Because the provided router works just fine, I’m sure it’s something on my 
> Linux box (Debian GNU/Linux Jessie) that does it.. I’ve tried running
> it without any iptables, in both static and DHCP mode but same problem..
>
> I’m not a complete beginner with Linux nor networks, but this is to “close to 
> the hardware” for me. I’m at a loss to what else to try..
>
> This isn’t a new machine, it have served me very well for five years give or 
> take and I’ve never had any problems with it (not to say that it still
> can’t be hardware problems, but I find that somewhat unlikely at the moment).
>
>
> Could anyone please advice to what I can try to try to pinpoint the problem 
> (and/or possibly fix it)?


Re: [Patch net-next 2/3] net_sched: plug in qdisc ops change_tx_queue_len

2018-01-22 Thread Cong Wang
On Sun, Jan 21, 2018 at 2:12 PM, John Fastabend
<john.fastab...@gmail.com> wrote:
> On 01/19/2018 03:09 PM, Cong Wang wrote:
>
> hmm what happens if the resize fails in the next patch,
>
>>
>> +static void pfifo_fast_change_tx_queue_len(struct Qdisc *sch, unsigned int 
>> new_len)
>> +{
>> + struct pfifo_fast_priv *priv = qdisc_priv(sch);
>> + int prio;
>> +
>> + for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
>> + struct skb_array *q = band2list(priv, prio);
>> +
>> + skb_array_resize(q, new_len, GFP_KERNEL);
>> + }
>> +}
>> +
>
> Here skb_array_resize() can fail with ENOMEM, do we need to unwind the
> change and push the error up the stack?

Good point, but netdev_for_each_tx_queue() doesn't handle errors
in current code base, it is not trivial to make it returning int to reflect
errors, since we have to rollback a partial failure too.


Re: [PATCH v2 net 2/2] net: sched: fix TCF_LAYER_LINK case in tcf_get_base_ptr

2018-01-22 Thread Cong Wang
On Mon, Jan 22, 2018 at 2:53 AM, Wolfgang Bumiller
 wrote:
> TCF_LAYER_LINK and TCF_LAYER_NETWORK returned the same pointer as
> skb->data points to the network header.
> Use skb_mac_header instead.

skb->data points to network header only on ingress side, IIRC.


Re: [PATCH v2 net 0/2] nbyte, cmp and text filter fixups

2018-01-22 Thread Cong Wang
On Mon, Jan 22, 2018 at 2:53 AM, Wolfgang Bumiller
<w.bumil...@proxmox.com> wrote:
> Changes:
>   * Fixed up commit message
>   * Removed not really related iproute2 patch from this thread.
>
> This fixes an oob read in em_nbyte and allows 'layer 0' in cmp and
> nbyte and em_text to actually match layer 0 rather than being the same
> as specifying layer 1.
>


Acked-by: Cong Wang <xiyou.wangc...@gmail.com>


[Patch net-next] tun: avoid calling xdp_rxq_info_unreg() twice

2018-01-22 Thread Cong Wang
Similarly to tx ring, xdp_rxq_info is only registered
when !tfile->detached, so we need to avoid calling
xdp_rxq_info_unreg() twice too. The helper tun_cleanup_tx_ring()
already checks for this properly, so it is correct to put
xdp_rxq_info_unreg() just inside there.

Reported-by: syzbot+1c788d7ce0f0888f1...@syzkaller.appspotmail.com
Fixes: 8565d26bcb2f ("Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
Cc: Jason Wang <jasow...@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 drivers/net/tun.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 2bc18b16a45b..a0c5cb1a1617 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -774,14 +774,12 @@ static void tun_detach_all(struct net_device *dev)
tun_napi_del(tun, tfile);
/* Drop read queue */
tun_queue_purge(tfile);
-   xdp_rxq_info_unreg(>xdp_rxq);
sock_put(>sk);
tun_cleanup_tx_ring(tfile);
}
list_for_each_entry_safe(tfile, tmp, >disabled, next) {
tun_enable_queue(tfile);
tun_queue_purge(tfile);
-   xdp_rxq_info_unreg(>xdp_rxq);
sock_put(>sk);
tun_cleanup_tx_ring(tfile);
}
-- 
2.13.0



Re: [PATCH net-next v3 0/2] net/sched: remove spinlock from 'csum' action

2018-01-22 Thread Cong Wang
On Mon, Jan 22, 2018 at 9:14 AM, Davide Caratti <dcara...@redhat.com> wrote:
> Similarly to what has been done earlier with other actions [1][2], this
> series tries to improve the performance of 'csum' tc action, removing a
> spinlock in the data path. Patch 1 lets act_csum use per-CPU counters;
> patch 2 removes spin_{,un}lock_bh() calls from the act() method.
>
> test procedure (using pktgen from https://github.com/netoptimizer):
>
>  # ip link add name eth1 type dummy
>  # ip link set dev eth1 up
>  # tc qdisc add dev eth1 root handle 1: prio
>  # for a in pass drop; do
>  > tc filter del dev eth1 parent 1: pref 10 matchall action csum udp
>  > tc filter add dev eth1 parent 1: pref 10 matchall action csum udp $a
>  > for n in 2 4; do
>  > ./pktgen_bench_xmit_mode_queue_xmit.sh -v -s 64 -t $n -n 100 -i eth1
>  > done
>  > done
>
> test results:
>
>   ||  before patch   |   after patch
>   $a  | $n | avg. pps/thread | avg. pps/thread
>  -++-+
>  pass |  2 |1671463 ± 4% |1920789 ± 3%
>  pass |  4 | 648797 ± 1% | 738190 ± 1%
>  drop |  2 |3212692 ± 2% |    3719811 ± 2%
>  drop |  4 |1078824 ± 1% |1328099 ± 1%

Looks good to me,

Acked-by: Cong Wang <xiyou.wangc...@gmail.com>


Re: WARNING in xdp_rxq_info_unreg

2018-01-22 Thread Cong Wang
On Mon, Jan 22, 2018 at 11:58 AM, syzbot
 wrote:
> Hello,
>
> syzbot hit the following crash on net-next commit
> ebdd7b491b8a65d65936e07004caabca4a3c94a0 (Sun Jan 21 23:21:31 2018 +)
> Merge branch 'mlxsw-Add-support-for-mirror-action-with-flower'
>
> So far this crash happened 52 times on net-next.
> C reproducer is attached.
> syzkaller reproducer is attached.
> Raw console output is attached.
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+1c788d7ce0f0888f1...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
>
> [ cut here ]
> Driver BUG
> WARNING: CPU: 1 PID: 3661 at net/core/xdp.c:22 xdp_rxq_info_unreg+0xc1/0xf0
> net/core/xdp.c:22
> Kernel panic - not syncing: panic_on_warn set ...
>
> CPU: 1 PID: 3661 Comm: syzkaller881172 Not tainted 4.15.0-rc8+ #202
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  panic+0x1e4/0x41c kernel/panic.c:183
>  __warn+0x1dc/0x200 kernel/panic.c:547
>  report_bug+0x211/0x2d0 lib/bug.c:184
>  fixup_bug.part.11+0x37/0x80 arch/x86/kernel/traps.c:178
>  fixup_bug arch/x86/kernel/traps.c:247 [inline]
>  do_error_trap+0x2d7/0x3e0 arch/x86/kernel/traps.c:296
>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315
>  invalid_op+0x22/0x40 arch/x86/entry/entry_64.S:1096
> RIP: 0010:xdp_rxq_info_unreg+0xc1/0xf0 net/core/xdp.c:22
> RSP: 0018:8801bc62ee78 EFLAGS: 00010282
> RAX: dc08 RBX: 8801d950b1c0 RCX: 8159f28e
> RDX:  RSI: 1100378c5d8a RDI: 0293
> RBP: 8801bc62ee90 R08: 1100378c5d4c R09: 
> R10:  R11:  R12: 8801d950b1cc
> R13: 0002 R14: 03e8 R15: 
>  tun_cleanup_tx_ring.part.47+0x236/0x450 drivers/net/tun.c:686
>  tun_cleanup_tx_ring drivers/net/tun.c:684 [inline]
>  tun_detach_all+0x4fd/0xbf0 drivers/net/tun.c:779


This is caused by the merge conflict, I will send a fix.


>  tun_net_uninit+0x15/0x20 drivers/net/tun.c:1003
>  rollback_registered_many+0x9a8/0xe20 net/core/dev.c:7360
>  unregister_netdevice_many.part.118+0x87/0x420 net/core/dev.c:8421
>  unregister_netdevice_many+0xbb/0x100 net/core/dev.c:8420
>  rtnl_delete_link+0x10f/0x170 net/core/rtnetlink.c:2586
>  rtnl_dellink+0x2d5/0x7e0 net/core/rtnetlink.c:2625
>  rtnetlink_rcv_msg+0x57f/0xb10 net/core/rtnetlink.c:4530
>  netlink_rcv_skb+0x14b/0x380 net/netlink/af_netlink.c:2442
>  rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4548
>  netlink_unicast_kernel net/netlink/af_netlink.c:1308 [inline]
>  netlink_unicast+0x4c4/0x6b0 net/netlink/af_netlink.c:1334
>  netlink_sendmsg+0xa4a/0xe60 net/netlink/af_netlink.c:1897
>  sock_sendmsg_nosec net/socket.c:630 [inline]
>  sock_sendmsg+0xca/0x110 net/socket.c:640
>  ___sys_sendmsg+0x767/0x8b0 net/socket.c:2020
>  __sys_sendmsg+0xe5/0x210 net/socket.c:2054
>  SYSC_sendmsg net/socket.c:2065 [inline]
>  SyS_sendmsg+0x2d/0x50 net/socket.c:2061
>  entry_SYSCALL_64_fastpath+0x29/0xa0
> RIP: 0033:0x445189
> RSP: 002b:7fff8d14a8c8 EFLAGS: 0207 ORIG_RAX: 002e
> RAX: ffda RBX: 004a6a32 RCX: 00445189
> RDX: 2040 RSI: 2001bfc8 RDI: 0004
> RBP: 7fff8d14a9d8 R08:  R09: 
> R10:  R11: 0207 R12: 7fff8d14a9d8
> R13: 00402650 R14:  R15: 
> Dumping ftrace buffer:
>(ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
>
>
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkal...@googlegroups.com.
>
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> If you want to test a patch for this bug, please reply with:
> #syz test: git://repo/address.git branch
> and provide the patch inline or as an attachment.
> To mark this as a duplicate of another syzbot report, please reply with:
> #syz dup: exact-subject-of-another-report
> If it's a one-off invalid bug report, please reply with:
> #syz invalid
> Note: if the crash happens again, it will cause creation of a new bug
> report.
> Note: all commands must start from beginning of the line in the email body.


Re: [Patch net-next 1/3] net: introduce helper dev_change_tx_queue_len()

2018-01-21 Thread Cong Wang
On Sat, Jan 20, 2018 at 8:52 PM, John Fastabend
<john.fastab...@gmail.com> wrote:
> On 01/19/2018 03:09 PM, Cong Wang wrote:
>> This patch promotes the local change_tx_queue_len() to a core
>> helper function, dev_change_tx_queue_len(), so that rtnetlink
>> and net-sysfs could share the code. This also prepares for the
>> following patch.
>>
>> Note, the -EFAULT in the original code doesn't make sense,
>> we should propagate the errno from notifiers.
>>
>> Cc: John Fastabend <john.fastab...@gmail.com>
>> Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
>> ---
>
>
> [...]
>
>>  static ssize_t tx_queue_len_store(struct device *dev,
>> struct device_attribute *attr,
>> const char *buf, size_t len)
>> @@ -355,7 +332,7 @@ static ssize_t tx_queue_len_store(struct device *dev,
>>   if (!capable(CAP_NET_ADMIN))
>>   return -EPERM;
>>
>> - return netdev_store(dev, attr, buf, len, change_tx_queue_len);
>> + return netdev_store(dev, attr, buf, len, dev_change_tx_queue_len);
>>  }
>
> Is this protected by RTNL lock? If not what happens if this and do_setlink
> both try to change tx queue length at the same time? Seems we could get
> a race with multiple dev_deactivate/dev_activate sequences in-flight in
> the following 2/3 patch.

Surely it is protected by RTNL...

  96 if (!rtnl_trylock())
  97 return restart_syscall();
  98
  99 if (dev_isalive(netdev)) {
 100 ret = (*set)(netdev, new);
 101 if (ret == 0)
 102 ret = len;
 103 }
 104 rtnl_unlock();


Re: [PATCH net-next v2 2/2] net/sched: act_csum: don't use spinlock in the fast path

2018-01-21 Thread Cong Wang
On Fri, Jan 19, 2018 at 6:12 AM, Davide Caratti  wrote:
>  static int tcf_csum_dump(struct sk_buff *skb, struct tc_action *a, int bind,
> @@ -575,15 +594,18 @@ static int tcf_csum_dump(struct sk_buff *skb, struct 
> tc_action *a, int bind,
>  {
> unsigned char *b = skb_tail_pointer(skb);
> struct tcf_csum *p = to_tcf_csum(a);
> +   struct tcf_csum_params *params;
> struct tc_csum opt = {
> -   .update_flags = p->update_flags,
> .index   = p->tcf_index,
> -   .action  = p->tcf_action,
> .refcnt  = p->tcf_refcnt - ref,
> .bindcnt = p->tcf_bindcnt - bind,
> };
> struct tcf_t t;
>
> +   params = rcu_dereference(p->params);

I think you need rtnl_dereference() here, as we don't have RCU read lock here?


> +   opt.action = params->action;
> +   opt.update_flags = params->update_flags;
> +


Thanks.


Re: net merged into net-next

2018-01-20 Thread Cong Wang
On Fri, Jan 19, 2018 at 8:02 PM, David Miller  wrote:
>
> Cong, please check my conflict resolution of drivers/net/tun.c, thank
> you.

It looks good to me except I am not sure about the xdp_rxq_info_unreg()
inside tun_cleanup_tx_ring().


[Patch net-next 2/3] net_sched: plug in qdisc ops change_tx_queue_len

2018-01-19 Thread Cong Wang
Introduce a new qdisc ops ->change_tx_queue_len() so that
each qdisc could decide how to implement this if it wants.
Previously we simply read dev->tx_queue_len, after pfifo_fast
switches to skb array, we need this API to resize the skb array
when we change dev->tx_queue_len.

To avoid handling race conditions with TX BH, we need to
deactivate all TX queues before change the value and bring them
back after we are done, this also makes implementation easier.

Cc: John Fastabend <john.fastab...@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 include/net/sch_generic.h |  2 ++
 net/core/dev.c|  1 +
 net/sched/sch_generic.c   | 22 ++
 3 files changed, 25 insertions(+)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index cd1be1f25c36..aae1baa1c30f 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -200,6 +200,7 @@ struct Qdisc_ops {
  struct nlattr *arg,
  struct netlink_ext_ack *extack);
void(*attach)(struct Qdisc *sch);
+   void(*change_tx_queue_len)(struct Qdisc *, unsigned 
int);
 
int (*dump)(struct Qdisc *, struct sk_buff *);
int (*dump_stats)(struct Qdisc *, struct gnet_dump 
*);
@@ -488,6 +489,7 @@ void qdisc_class_hash_remove(struct Qdisc_class_hash *,
 void qdisc_class_hash_grow(struct Qdisc *, struct Qdisc_class_hash *);
 void qdisc_class_hash_destroy(struct Qdisc_class_hash *);
 
+void dev_qdisc_change_tx_queue_len(struct net_device *dev);
 void dev_init_scheduler(struct net_device *dev);
 void dev_shutdown(struct net_device *dev);
 void dev_activate(struct net_device *dev);
diff --git a/net/core/dev.c b/net/core/dev.c
index 99d353e4cbb2..24809d858a64 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7058,6 +7058,7 @@ int dev_change_tx_queue_len(struct net_device *dev, 
unsigned long new_len)
dev->tx_queue_len = orig_len;
return res;
}
+   dev_qdisc_change_tx_queue_len(dev);
}
 
return 0;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ef8b4ecde2ac..30aaeb3c1bf1 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1178,6 +1178,28 @@ void dev_deactivate(struct net_device *dev)
 }
 EXPORT_SYMBOL(dev_deactivate);
 
+static void qdisc_change_tx_queue_len(struct net_device *dev,
+ struct netdev_queue *dev_queue,
+ void *unused)
+{
+   struct Qdisc *qdisc = dev_queue->qdisc_sleeping;
+   const struct Qdisc_ops *ops = qdisc->ops;
+
+   if (ops->change_tx_queue_len)
+   ops->change_tx_queue_len(qdisc, dev->tx_queue_len);
+}
+
+void dev_qdisc_change_tx_queue_len(struct net_device *dev)
+{
+   bool up = dev->flags & IFF_UP;
+
+   if (up)
+   dev_deactivate(dev);
+   netdev_for_each_tx_queue(dev, qdisc_change_tx_queue_len, NULL);
+   if (up)
+   dev_activate(dev);
+}
+
 static void dev_init_scheduler_queue(struct net_device *dev,
 struct netdev_queue *dev_queue,
 void *_qdisc)
-- 
2.13.0



[Patch net-next 1/3] net: introduce helper dev_change_tx_queue_len()

2018-01-19 Thread Cong Wang
This patch promotes the local change_tx_queue_len() to a core
helper function, dev_change_tx_queue_len(), so that rtnetlink
and net-sysfs could share the code. This also prepares for the
following patch.

Note, the -EFAULT in the original code doesn't make sense,
we should propagate the errno from notifiers.

Cc: John Fastabend <john.fastab...@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 include/linux/netdevice.h |  1 +
 net/core/dev.c| 28 
 net/core/net-sysfs.c  | 25 +
 net/core/rtnetlink.c  | 18 +-
 4 files changed, 35 insertions(+), 37 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ed0799a12bf2..75587a053c33 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3325,6 +3325,7 @@ int dev_get_alias(const struct net_device *, char *, 
size_t);
 int dev_change_net_namespace(struct net_device *, struct net *, const char *);
 int __dev_set_mtu(struct net_device *, int);
 int dev_set_mtu(struct net_device *, int);
+int dev_change_tx_queue_len(struct net_device *, unsigned long);
 void dev_set_group(struct net_device *, int);
 int dev_set_mac_address(struct net_device *, struct sockaddr *);
 int dev_change_carrier(struct net_device *, bool new_carrier);
diff --git a/net/core/dev.c b/net/core/dev.c
index 94435cd09072..99d353e4cbb2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7036,6 +7036,34 @@ int dev_set_mtu(struct net_device *dev, int new_mtu)
 EXPORT_SYMBOL(dev_set_mtu);
 
 /**
+ * dev_change_tx_queue_len - Change TX queue length of a netdevice
+ * @dev: device
+ * @new_len: new tx queue length
+ */
+int dev_change_tx_queue_len(struct net_device *dev, unsigned long new_len)
+{
+   unsigned int orig_len = dev->tx_queue_len;
+   int res;
+
+   if (new_len != (unsigned int)new_len)
+   return -ERANGE;
+
+   if (new_len != orig_len) {
+   dev->tx_queue_len = new_len;
+   res = call_netdevice_notifiers(NETDEV_CHANGE_TX_QUEUE_LEN, dev);
+   res = notifier_to_errno(res);
+   if (res) {
+   netdev_err(dev,
+  "refused to change device tx_queue_len\n");
+   dev->tx_queue_len = orig_len;
+   return res;
+   }
+   }
+
+   return 0;
+}
+
+/**
  * dev_set_group - Change group this device belongs to
  * @dev: device
  * @new_group: group this device should belong to
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 7bf8b85ade16..584a480c6274 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -325,29 +325,6 @@ static ssize_t flags_store(struct device *dev, struct 
device_attribute *attr,
 }
 NETDEVICE_SHOW_RW(flags, fmt_hex);
 
-static int change_tx_queue_len(struct net_device *dev, unsigned long new_len)
-{
-   unsigned int orig_len = dev->tx_queue_len;
-   int res;
-
-   if (new_len != (unsigned int)new_len)
-   return -ERANGE;
-
-   if (new_len != orig_len) {
-   dev->tx_queue_len = new_len;
-   res = call_netdevice_notifiers(NETDEV_CHANGE_TX_QUEUE_LEN, dev);
-   res = notifier_to_errno(res);
-   if (res) {
-   netdev_err(dev,
-  "refused to change device tx_queue_len\n");
-   dev->tx_queue_len = orig_len;
-   return -EFAULT;
-   }
-   }
-
-   return 0;
-}
-
 static ssize_t tx_queue_len_store(struct device *dev,
  struct device_attribute *attr,
  const char *buf, size_t len)
@@ -355,7 +332,7 @@ static ssize_t tx_queue_len_store(struct device *dev,
if (!capable(CAP_NET_ADMIN))
return -EPERM;
 
-   return netdev_store(dev, attr, buf, len, change_tx_queue_len);
+   return netdev_store(dev, attr, buf, len, dev_change_tx_queue_len);
 }
 NETDEVICE_SHOW_RW(tx_queue_len, fmt_dec);
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 16d644a4f974..fb2f38193b86 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2282,19 +2282,11 @@ static int do_setlink(const struct sk_buff *skb,
 
if (tb[IFLA_TXQLEN]) {
unsigned int value = nla_get_u32(tb[IFLA_TXQLEN]);
-   unsigned int orig_len = dev->tx_queue_len;
-
-   if (dev->tx_queue_len ^ value) {
-   dev->tx_queue_len = value;
-   err = call_netdevice_notifiers(
- NETDEV_CHANGE_TX_QUEUE_LEN, dev);
-   err = notifier_to_errno(err);
-   if (err) {
-   dev->tx_queue_len = orig_len;
-   goto er

[Patch net-next 3/3] net_sched: implement ->change_tx_queue_len() for pfifo_fast

2018-01-19 Thread Cong Wang
pfifo_fast used to drop based on qdisc_dev(qdisc)->tx_queue_len,
so we have to resize skb array when we change tx_queue_len.

Other qdiscs which read tx_queue_len are fine because they
all save it to sch->limit or somewhere else in qdisc during init.
They don't have to implement this, it is nicer if they do so
that users don't have to re-configure qdisc after changing
tx_queue_len.

Cc: John Fastabend <john.fastab...@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/sched/sch_generic.c | 13 +
 1 file changed, 13 insertions(+)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 30aaeb3c1bf1..7c48a6942238 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -763,6 +763,18 @@ static void pfifo_fast_destroy(struct Qdisc *sch)
}
 }
 
+static void pfifo_fast_change_tx_queue_len(struct Qdisc *sch, unsigned int 
new_len)
+{
+   struct pfifo_fast_priv *priv = qdisc_priv(sch);
+   int prio;
+
+   for (prio = 0; prio < PFIFO_FAST_BANDS; prio++) {
+   struct skb_array *q = band2list(priv, prio);
+
+   skb_array_resize(q, new_len, GFP_KERNEL);
+   }
+}
+
 struct Qdisc_ops pfifo_fast_ops __read_mostly = {
.id =   "pfifo_fast",
.priv_size  =   sizeof(struct pfifo_fast_priv),
@@ -773,6 +785,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
.destroy=   pfifo_fast_destroy,
.reset  =   pfifo_fast_reset,
.dump   =   pfifo_fast_dump,
+   .change_tx_queue_len =  pfifo_fast_change_tx_queue_len,
.owner  =   THIS_MODULE,
.static_flags   =   TCQ_F_NOLOCK | TCQ_F_CPUSTATS,
 };
-- 
2.13.0



[Patch net-next 0/3] net_sched: reflect tx_queue_len change for pfifo_fast

2018-01-19 Thread Cong Wang
This pathcset restores the pfifo_fast qdisc behavior of dropping
packets based on latest dev->tx_queue_len. Patch 1 introduces
a helper, patch 2 introduces a new Qdisc ops which is called when
we modify tx_queue_len, patch 3 implements this ops for pfifo_fast.

Please see each patch for details.

---

Cong Wang (3):
  net: introduce helper dev_change_tx_queue_len()
  net_sched: plug in qdisc ops change_tx_queue_len
  net_sched: implement ->change_tx_queue_len() for pfifo_fast

 include/linux/netdevice.h |  1 +
 include/net/sch_generic.h |  2 ++
 net/core/dev.c| 29 +
 net/core/net-sysfs.c  | 25 +
 net/core/rtnetlink.c  | 18 +-
 net/sched/sch_generic.c   | 35 +++
 6 files changed, 73 insertions(+), 37 deletions(-)

-- 
2.13.0



Re: [patch net-next v10 02/13] net: sched: introduce shared filter blocks infrastructure

2018-01-16 Thread Cong Wang
On Tue, Jan 16, 2018 at 7:33 AM, Jiri Pirko  wrote:
>  static int __init tc_filter_init(void)
>  {
> +   int err;
> +
> tc_filter_wq = alloc_ordered_workqueue("tc_filter_workqueue", 0);
> if (!tc_filter_wq)
> return -ENOMEM;
>
> +   err = register_pernet_subsys(_net_ops);
> +   if (err)
> +   return err;

Need to destroy the above workqueue on error.


Re: [PATCH net-next 7/8] net: sched: cls: add extack support for tc_setup_cb_call

2018-01-16 Thread Cong Wang
On Tue, Jan 16, 2018 at 9:20 AM, Alexander Aring  wrote:
>  int tc_setup_cb_call(struct tcf_block *block, struct tcf_exts *exts,
> -enum tc_setup_type type, void *type_data, bool err_stop)
> +enum tc_setup_type type, void *type_data, bool err_stop,
> +struct netlink_ext_ack *extack)
>  {
> int ok_count;
> int ret;
>
> ret = tcf_block_cb_call(block, type, type_data, err_stop);
> -   if (ret < 0)
> +   if (ret < 0) {
> +   NL_SET_ERR_MSG(extack, "Failed to inialize tcf block");


s/inialize/initialize/

> return ret;
> +   }
> ok_count = ret;
>
> if (!exts)
> return ok_count;
> ret = tc_exts_setup_cb_egdev_call(exts, type, type_data, err_stop);
> -   if (ret < 0)
> +   if (ret < 0) {
> +   NL_SET_ERR_MSG(extack, "Failed to inialize tcf block 
> extensions");

Ditto.


Re: [PATCH net-next 2/8] net: sched: cls_api: handle generic cls errors

2018-01-16 Thread Cong Wang
On Tue, Jan 16, 2018 at 9:20 AM, Alexander Aring  wrote:
> @@ -1117,8 +1146,10 @@ int tcf_exts_validate(struct net *net, struct 
> tcf_proto *tp, struct nlattr **tb,
> }
>  #else
> if ((exts->action && tb[exts->action]) ||
> -   (exts->police && tb[exts->police]))
> +   (exts->police && tb[exts->police])) {
> +   NL_SET_ERR_MSG(extack, "Actions are not supported. Check 
> compile options");
> return -EOPNOTSUPP;
> +   }
>  #endif

"Check compile options" is confusing, it is clearer if we can just
say we need to enable CONFIG_NET_CLS_ACT here.


Re: [PATCH net-next 8/8] net: sched: cls_u32: add extack support

2018-01-16 Thread Cong Wang
On Tue, Jan 16, 2018 at 9:20 AM, Alexander Aring  wrote:
> -   if (root_ht == ht)
> +   if (root_ht == ht) {
> +   NL_SET_ERR_MSG(extack, "Not allowd to delete root node");

s/allowd/allowed/


Re: KASAN: use-after-free Read in tipc_group_is_open

2018-01-16 Thread Cong Wang
On Tue, Jan 16, 2018 at 5:23 AM, Jon Maloy <jon.ma...@ericsson.com> wrote:
>
>
>> -Original Message-----
>> From: Cong Wang [mailto:xiyou.wangc...@gmail.com]
>> Sent: Monday, January 15, 2018 23:44
>> To: syzbot <syzbot+799dafde028679585...@syzkaller.appspotmail.com>
>> Cc: David Miller <da...@davemloft.net>; Jon Maloy
>> <jon.ma...@ericsson.com>; LKML <linux-ker...@vger.kernel.org>; Linux
>> Kernel Network Developers <netdev@vger.kernel.org>; syzkaller-
>> b...@googlegroups.com; tipc-discuss...@lists.sourceforge.net; Ying Xue
>> <ying@windriver.com>
>> Subject: Re: KASAN: use-after-free Read in tipc_group_is_open
>>
>> On Mon, Jan 15, 2018 at 7:58 PM, syzbot
>> <syzbot+799dafde028679585...@syzkaller.appspotmail.com> wrote:
>> > Hello,
>> >
>> > syzkaller hit the following crash on
>> > 594831a8aba3fd045c3212a3e3bb9788c77b989d
>> > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/maste
>> > r
>> > compiler: gcc (GCC) 7.1.1 20170620
>> > .config is attached
>> > Raw console output is attached.
>> > C reproducer is attached
>> > syzkaller reproducer is attached. See https://goo.gl/kgGztJ for
>> > information about syzkaller reproducers
>> >
>> >
>> > IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> > Reported-by: syzbot+799dafde028679585...@syzkaller.appspotmail.com
>> > It will help syzbot understand when the bug is fixed. See footer for
>> > details.
>> > If you forward the report, please keep this part and the footer.
>> >
>> >
>> ==
>> 
>> > BUG: KASAN: use-after-free in tipc_group_is_open+0x3a/0x40
>> > net/tipc/group.c:106
>> > Read of size 1 at addr 8801d89f7378 by task syzkaller275009/3704
>> >
>> > CPU: 0 PID: 3704 Comm: syzkaller275009 Not tainted 4.15.0-rc7+ #190
>> > Hardware name: Google Google Compute Engine/Google Compute Engine,
>> > BIOS Google 01/01/2011 Call Trace:
>> >  __dump_stack lib/dump_stack.c:17 [inline]
>> >  dump_stack+0x194/0x257 lib/dump_stack.c:53
>> >  print_address_description+0x73/0x250 mm/kasan/report.c:252
>> > kasan_report_error mm/kasan/report.c:351 [inline]
>> >  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>> >  __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:427
>> >  tipc_group_is_open+0x3a/0x40 net/tipc/group.c:106
>> >  tipc_poll+0x364/0x4d0 net/tipc/socket.c:740
>>
>>
>> commit eb929a91b213d2a72c5a8b4af9a1acf63bfb8287
>> Author: Jon Maloy <jon.ma...@ericsson.com>
>> Date:   Mon Jan 8 21:03:31 2018 +0100
>>
>> tipc: improve poll() for group member socket
>>
>> Apparently Jon's commit doesn't fix this. I also don't understand why Jon
>> believes sock_poll_wait() could sync with setsockopt path...
>
> While sock_poll_wait() is sleeping, it is possible that the item the 'grp' 
> stack variable is pointing to might be deleted, invalidating the pointer.
> This is why I moved the initialization of the pointer to after 
> sock_poll_wait().

This doesn't matter at all as long as it doesn't sync setsockopt() path.


> However, now tipc_sock->group is clearly set to NULL at both locations where 
> a group item might be deleted, so the reason for the warning must be 
> something else.
> I am open to suggestions.

Another thread calling  tipc_sk_join() could jump in at any time
since we don't have the sock locked here. This is why I said
we probably need to lock the sock here, unless of course
you can refactor the logic to make tipc_poll() not to touch
tsk->group.


Re: [Patch net v3] tun: fix a memory leak for tfile->tx_array

2018-01-15 Thread Cong Wang
On Mon, Jan 15, 2018 at 10:37 PM, Jason Wang <jasow...@redhat.com> wrote:
>
>
> On 2018年01月16日 14:33, Cong Wang wrote:
>> But __tun_detach(true) is not a hot path, a memset() doesn't harm
>> anything.
>
>
> Yes, but it looks more more like a workaround or trick to me.
>

I'd blame skb_array API for this. ;) Ideally, skb_array_cleanup()
should take care of everything I put in tun_cleanup_tx_array().
As I mentioned in changelog, we can always improve it in
-net-next, so I don't want to bother it for -net.


Re: [PATCH net-next] net: sched: fix use before alloc of per cpu stats

2018-01-15 Thread Cong Wang
On Mon, Jan 15, 2018 at 10:33 PM, Prashant Bhole
 wrote:
> Pardon my ignorance, I think it is necessary to forward d59f5ffa59d8 to
> -net. Because previously flags were set during init and checked after init
> to allocate memory. static_flags makes the flags available before init,
> hence we can allocate memory before init.

Oh, right, but we can allocate it unconditionally and free it if not needed
after ->init(). It is not prettier though. ;)

> Sorry, but from your reply I couldn't understand whether we need v2 for
> net-next.
>

I don't think you need to send v2, because the one for -net will be merged
(with a conflict) into -net-next by DaveM.


Re: [Patch net v3] tun: fix a memory leak for tfile->tx_array

2018-01-15 Thread Cong Wang
On Mon, Jan 15, 2018 at 10:12 PM, Jason Wang <jasow...@redhat.com> wrote:
>
>
> On 2018年01月16日 14:07, Cong Wang wrote:
>>
>> On Mon, Jan 15, 2018 at 10:00 PM, Jason Wang <jasow...@redhat.com> wrote:
>>>
>>> I mean we can leave __tun_detach() as is, and just add the cleanup to
>>> tun_detach_all(). This is because in both cases, we're sure skb array has
>>> been initialized before.
>>>
>> Oh, I thought the same before sending v3, but I believe it is easier to
>> understand 'if (tfile->tx_array.ring.queue)' than 'if (tun)', because
>> tx_array
>> only depends on itself rather tfile->tun in this way.
>
>
> Maybe just add a comment to explain in __tun_detach(), it avoids memset()
> anyway.
>

But __tun_detach(true) is not a hot path, a memset() doesn't harm anything.


Re: [PATCH net-next 2/2] net: sched: add xfrm policy ematch

2018-01-15 Thread Cong Wang
On Fri, Jan 12, 2018 at 4:57 AM, Eyal Birger  wrote:
> +static void em_policy_destroy(struct tcf_ematch *em)
> +{
> +   const struct xt_policy_info *info = (const void *)em->data;
> +
> +   if (!info)
> +   return;
> +
> +   kfree((void *)em->data);
> +}

Nit: kfree() could handle NULL, no need to check.


Re: [Patch net v3] tun: fix a memory leak for tfile->tx_array

2018-01-15 Thread Cong Wang
On Mon, Jan 15, 2018 at 10:00 PM, Jason Wang  wrote:
> I mean we can leave __tun_detach() as is, and just add the cleanup to
> tun_detach_all(). This is because in both cases, we're sure skb array has
> been initialized before.
>

Oh, I thought the same before sending v3, but I believe it is easier to
understand 'if (tfile->tx_array.ring.queue)' than 'if (tun)', because tx_array
only depends on itself rather tfile->tun in this way.


Re: [PATCH net-next] net: sched: fix use before alloc of per cpu stats

2018-01-15 Thread Cong Wang
On Mon, Jan 15, 2018 at 9:47 PM, Prashant Bhole
<bhole_prashant...@lab.ntt.co.jp> wrote:
>
>
> On 1/16/2018 2:08 PM, Cong Wang wrote:
>>
>> On Sun, Jan 14, 2018 at 9:52 PM, Prashant Bhole
>> <bhole_prashant...@lab.ntt.co.jp> wrote:
>>>
>>> About bug:
>>> During init of clsact/ingress, it links qdisc's cpu_bstats,cpu_qstats
>>> with mini qdisc. TCQ_F_CPUSTATS is set in qdisc->flags during init and
>>> this flag is checked after init to allocate memory for stats.
>>>
>>> Hence mini qdisc points to null per-cpu-stats. The problem isn't caught
>>> while updating stats via mini qdisc because per_cpu_ptr(NULL, cpu_num)
>>> or this_cpu_ptr(NULL) gives a valid pointer.
>>>
>>> About fix:
>>> Currently stats memory is allocated at two places.
>>> - in qdisc_alloc() if TCQ_F_CPUSTATS is set in Qdisc_ops->static_flags
>>> - in qdisc_create() if TCQ_F_CPUSTATS is set in Qdisc->flags
>>>
>>> Qdisc_ops->static_flags is propagated to Qdisc->flags. So to fix this
>>> bug,
>>> we set TCQ_F_CPUSTATS in static flags and additional condition to avoid
>>> allocation after init.
>>
>>
>>
>> Good catch! One nit below.
>>
>>
>>>
>>> Fixes: 46209401f8f6 ("net: core: introduce mini_Qdisc and eliminate usage
>>> of tp->q for clsact fastpath")
>>> Signed-off-by: Prashant Bhole <bhole_prashant...@lab.ntt.co.jp>
>>> ---
>>>   net/sched/sch_api.c | 3 ++-
>>>   net/sched/sch_ingress.c | 6 ++
>>>   2 files changed, 4 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
>>> index 8a04c36e579f..de99a5e80944 100644
>>> --- a/net/sched/sch_api.c
>>> +++ b/net/sched/sch_api.c
>>> @@ -1094,7 +1094,8 @@ static struct Qdisc *qdisc_create(struct net_device
>>> *dev,
>>>  goto err_out5;
>>>  }
>>>
>>> -   if (qdisc_is_percpu_stats(sch)) {
>>> +   if (!(ops->static_flags & TCQ_F_CPUSTATS) &&
>>> +   qdisc_is_percpu_stats(sch)) {
>>
>>
>> Instead of checking both flags, it is simpler to just check if
>> sch->cpu_bstats
>> and sch->cpu_qstats are NULL here?
>>
>> Also, this patch should go to -net tree, not just net-next, but -net
>> doesn't have ops->static_flags yet. ;) Given this, consider moving
>> the sch->cpu_bstats allocation before ops->init(), which might be
>> simpler.
>
>
> Cong,
> Thanks for reviewing. Based on this patch, Daniel submitted another patch
> for -net:
> http://patchwork.ozlabs.org/patch/861098/

Odd, I don't even see it in my inbox...


>
> Let me know if we still need v2 of my patch on -net-next (with your
> suggested nit picks).

It is okay, but it doesn't have to forward commit d59f5ffa59d8 to
net, reordering allocation before ->init() should be simpler.


Re: [Patch net v3] tun: fix a memory leak for tfile->tx_array

2018-01-15 Thread Cong Wang
On Mon, Jan 15, 2018 at 9:46 PM, Jason Wang  wrote:
>
>
> I think then you don't even need the memset trick since we are sure it has
> been implemented?

It doesn't look like sk_alloc() zero's the memory of tfile.


Re: [PATCH net-next] net: sched: fix use before alloc of per cpu stats

2018-01-15 Thread Cong Wang
On Sun, Jan 14, 2018 at 9:52 PM, Prashant Bhole
 wrote:
> About bug:
> During init of clsact/ingress, it links qdisc's cpu_bstats,cpu_qstats
> with mini qdisc. TCQ_F_CPUSTATS is set in qdisc->flags during init and
> this flag is checked after init to allocate memory for stats.
>
> Hence mini qdisc points to null per-cpu-stats. The problem isn't caught
> while updating stats via mini qdisc because per_cpu_ptr(NULL, cpu_num)
> or this_cpu_ptr(NULL) gives a valid pointer.
>
> About fix:
> Currently stats memory is allocated at two places.
> - in qdisc_alloc() if TCQ_F_CPUSTATS is set in Qdisc_ops->static_flags
> - in qdisc_create() if TCQ_F_CPUSTATS is set in Qdisc->flags
>
> Qdisc_ops->static_flags is propagated to Qdisc->flags. So to fix this bug,
> we set TCQ_F_CPUSTATS in static flags and additional condition to avoid
> allocation after init.


Good catch! One nit below.


>
> Fixes: 46209401f8f6 ("net: core: introduce mini_Qdisc and eliminate usage of 
> tp->q for clsact fastpath")
> Signed-off-by: Prashant Bhole 
> ---
>  net/sched/sch_api.c | 3 ++-
>  net/sched/sch_ingress.c | 6 ++
>  2 files changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
> index 8a04c36e579f..de99a5e80944 100644
> --- a/net/sched/sch_api.c
> +++ b/net/sched/sch_api.c
> @@ -1094,7 +1094,8 @@ static struct Qdisc *qdisc_create(struct net_device 
> *dev,
> goto err_out5;
> }
>
> -   if (qdisc_is_percpu_stats(sch)) {
> +   if (!(ops->static_flags & TCQ_F_CPUSTATS) &&
> +   qdisc_is_percpu_stats(sch)) {

Instead of checking both flags, it is simpler to just check if sch->cpu_bstats
and sch->cpu_qstats are NULL here?

Also, this patch should go to -net tree, not just net-next, but -net
doesn't have ops->static_flags yet. ;) Given this, consider moving
the sch->cpu_bstats allocation before ops->init(), which might be
simpler.

Thanks.


Re: KASAN: use-after-free Read in tipc_group_is_open

2018-01-15 Thread Cong Wang
On Mon, Jan 15, 2018 at 7:58 PM, syzbot
 wrote:
> Hello,
>
> syzkaller hit the following crash on
> 594831a8aba3fd045c3212a3e3bb9788c77b989d
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
>
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+799dafde028679585...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
>
> ==
> BUG: KASAN: use-after-free in tipc_group_is_open+0x3a/0x40
> net/tipc/group.c:106
> Read of size 1 at addr 8801d89f7378 by task syzkaller275009/3704
>
> CPU: 0 PID: 3704 Comm: syzkaller275009 Not tainted 4.15.0-rc7+ #190
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>  kasan_report_error mm/kasan/report.c:351 [inline]
>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>  __asan_report_load1_noabort+0x14/0x20 mm/kasan/report.c:427
>  tipc_group_is_open+0x3a/0x40 net/tipc/group.c:106
>  tipc_poll+0x364/0x4d0 net/tipc/socket.c:740


commit eb929a91b213d2a72c5a8b4af9a1acf63bfb8287
Author: Jon Maloy 
Date:   Mon Jan 8 21:03:31 2018 +0100

tipc: improve poll() for group member socket

Apparently Jon's commit doesn't fix this. I also don't understand why
Jon believes sock_poll_wait() could sync with setsockopt path...



>  sock_poll+0x141/0x320 net/socket.c:
>  do_pollfd fs/select.c:822 [inline]
>  do_poll fs/select.c:872 [inline]
>  do_sys_poll+0x715/0x10b0 fs/select.c:966
>  SYSC_poll fs/select.c:1024 [inline]
>  SyS_poll+0x10d/0x450 fs/select.c:1012
>  entry_SYSCALL_64_fastpath+0x23/0x9a
> RIP: 0033:0x446129
> RSP: 002b:7f0f2df96db8 EFLAGS: 0246 ORIG_RAX: 0007
> RAX: ffda RBX: 006dbc54 RCX: 00446129
> RDX: 7fff RSI: 000a RDI: 20ef5000
> RBP: 006dbc50 R08:  R09: 
> R10:  R11: 0246 R12: 
> R13: 7ffc07e2923f R14: 7f0f2df979c0 R15: 0007
>
> Allocated by task 3702:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
>  kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3610
>  kmalloc include/linux/slab.h:499 [inline]
>  kzalloc include/linux/slab.h:688 [inline]
>  tipc_group_create+0x144/0x900 net/tipc/group.c:180
>  tipc_sk_join net/tipc/socket.c:2762 [inline]
>  tipc_setsockopt+0x274/0xce0 net/tipc/socket.c:2877
>  SYSC_setsockopt net/socket.c:1823 [inline]
>  SyS_setsockopt+0x189/0x360 net/socket.c:1802
>  entry_SYSCALL_64_fastpath+0x23/0x9a
>
> Freed by task 3702:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
>  __cache_free mm/slab.c:3488 [inline]
>  kfree+0xd6/0x260 mm/slab.c:3803
>  tipc_group_delete+0x2c8/0x3d0 net/tipc/group.c:234
>  tipc_sk_join net/tipc/socket.c:2775 [inline]
>  tipc_setsockopt+0xaa9/0xce0 net/tipc/socket.c:2877
>  SYSC_setsockopt net/socket.c:1823 [inline]
>  SyS_setsockopt+0x189/0x360 net/socket.c:1802
>  entry_SYSCALL_64_fastpath+0x23/0x9a
>
> The buggy address belongs to the object at 8801d89f7300
>  which belongs to the cache kmalloc-128 of size 128
> The buggy address is located 120 bytes inside of
>  128-byte region [8801d89f7300, 8801d89f7380)
> The buggy address belongs to the page:
> page:ea0007627dc0 count:1 mapcount:0 mapping:8801d89f7000 index:0x0
> flags: 0x2fffc000100(slab)
> raw: 02fffc000100 8801d89f7000  00010015
> raw: ea0007622720 ea0007627ce0 8801dac00640 
> page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
>  8801d89f7200: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
>  8801d89f7280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>>
>> 8801d89f7300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>
> ^
>  8801d89f7380: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
>  8801d89f7400: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
> ==
>
>
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.

[Patch net v3] tun: fix a memory leak for tfile->tx_array

2018-01-15 Thread Cong Wang
tfile->tun could be detached before we close the tun fd,
via tun_detach_all(), so it should not be used to check for
tfile->tx_array.

As Jason suggested, we probably have to clean it up
unconditionally both in __tun_deatch() and tun_detach_all(),
but this requires to check if it is initialized or not.
Currently skb_array_cleanup() doesn't have such a check,
so I check it in the caller and introduce a helper function,
it is a bit ugly but we can always improve it in net-next.

Reported-by: Dmitry Vyukov <dvyu...@google.com>
Fixes: 1576d9860599 ("tun: switch to use skb array for tx")
Cc: Jason Wang <jasow...@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 drivers/net/tun.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 4f4a842a1c9c..a8ec589d1359 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -611,6 +611,14 @@ static void tun_queue_purge(struct tun_file *tfile)
skb_queue_purge(>sk.sk_error_queue);
 }
 
+static void tun_cleanup_tx_array(struct tun_file *tfile)
+{
+   if (tfile->tx_array.ring.queue) {
+   skb_array_cleanup(>tx_array);
+   memset(>tx_array, 0, sizeof(tfile->tx_array));
+   }
+}
+
 static void __tun_detach(struct tun_file *tfile, bool clean)
 {
struct tun_file *ntfile;
@@ -657,8 +665,7 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
tun->dev->reg_state == NETREG_REGISTERED)
unregister_netdevice(tun->dev);
}
-   if (tun)
-   skb_array_cleanup(>tx_array);
+   tun_cleanup_tx_array(tfile);
sock_put(>sk);
}
 }
@@ -700,11 +707,13 @@ static void tun_detach_all(struct net_device *dev)
/* Drop read queue */
tun_queue_purge(tfile);
sock_put(>sk);
+   tun_cleanup_tx_array(tfile);
}
list_for_each_entry_safe(tfile, tmp, >disabled, next) {
tun_enable_queue(tfile);
tun_queue_purge(tfile);
sock_put(>sk);
+   tun_cleanup_tx_array(tfile);
}
BUG_ON(tun->numdisabled != 0);
 
@@ -2851,6 +2860,8 @@ static int tun_chr_open(struct inode *inode, struct file 
* file)
 
sock_set_flag(>sk, SOCK_ZEROCOPY);
 
+   memset(>tx_array, 0, sizeof(tfile->tx_array));
+
return 0;
 }
 
-- 
2.13.0



Re: [Patch net v2] tun: fix a memory leak for tfile->tx_array

2018-01-15 Thread Cong Wang
On Sun, Jan 14, 2018 at 11:07 PM, Jason Wang <jasow...@redhat.com> wrote:
>
>
> On 2018年01月14日 01:31, Cong Wang wrote:
>>
>> On Thu, Jan 11, 2018 at 2:16 AM, Jason Wang <jasow...@redhat.com> wrote:
>>>
>>> It looks to me what is actual missed is the cleanups tun_detach_all().
>>> For
>>> me the only case that could leak is
>>>
>>> open
>>> attach
>>> ip link del link dev tap0
>>> close or another set_iff()
>>>
>>> So in this case, clean during close is not sufficient since it could be
>>> attached to another device.
>>
>> In this case, close() still calls tun_detach() with clean=true, so
>> with my patch, the tx_array is still cleaned. What am I missing here?
>> Are you implying clean=true is not sufficient?
>
>
> Consider the corner case:
>
> 1) open
> 2) tun_set_iff() (which calls tun_attach to initialize skb_array)
> 3) ip link del link dev tap0 (which calls tun_detach_all())
> 4) tun_set_iff() (current codes does not forbid this and it will allocate
> skb array again)
>
> Consider the skb array was only initialized when attach it to a real device,
> we should do the cleanup when we detach it from a device which happens on
> two places:
>
> - actively: close to an tun fd (__tun_deatch())
> - passively: tun device was destroyed (tun_detach_all())

Fair enough, I will send out v3.

Thanks.


Re: [PATCH net] ipv6: Fix cleanup ordering on inet6_init() error path

2018-01-13 Thread Cong Wang
On Tue, Jan 9, 2018 at 10:21 AM, Ben Hutchings
<ben.hutchi...@codethink.co.uk> wrote:
> Commit 15e668070a64 reordered the initialisation in inet6_init() to
> fix a crash on an error path further down the call stack.  It also
> reordered cleanup on the error path in inet6_init(), but the result
> is not the reverse of the initialisation order.  This presumably
> can result in a resource leak or crash in some error cases.  Reorder
> cleanup again to fix this.
>
> Fixes: 15e668070a64 ("ipv6: reorder icmpv6_init() and ip6_mr_init()")
> Signed-off-by: Ben Hutchings <ben.hutchi...@codethink.co.uk>

Acked-by: Cong Wang <xiyou.wangc...@gmail.com>


Re: [PATCH net] ipv6: Fix cleanup ordering on inet6_init() error path

2018-01-13 Thread Cong Wang
On Thu, Jan 11, 2018 at 8:48 AM, Ben Hutchings
<ben.hutchi...@codethink.co.uk> wrote:
> On Wed, 2018-01-10 at 14:25 -0800, Cong Wang wrote:
>> On Tue, Jan 9, 2018 at 10:21 AM, Ben Hutchings
>> <ben.hutchi...@codethink.co.uk> wrote:
>> > Commit 15e668070a64 reordered the initialisation in inet6_init() to
>> > fix a crash on an error path further down the call stack.  It also
>> > reordered cleanup on the error path in inet6_init(), but the result
>> > is not the reverse of the initialisation order.  This presumably
>> > can result in a resource leak or crash in some error
>> > cases.  Reorder
>> > cleanup again to fix this.
>>
>> Can you be specific on what resource we leak here?
>
> If icmpv6_init() fails, after ip6_mr_init(), then ip6_mr_cleanup() is
> not called.
>
> Also, if ip6_mr_init() fails, we don't unregister inet6_net_ops.  I
> think that will result in a crash - immediately if ipv6 is a module,
> otherwise when the next net namespace is created.

Ah, I somehow misread the patch. It looks good.

>
>> Also, it looks like you not just revert the order changed in commit
>> 15e668070a64, but also you move  icmpv6_cleanup() even earlier.
>
> So should I add another Fixes: there?

No, I think it is okay.


Re: [Patch net v2] tun: fix a memory leak for tfile->tx_array

2018-01-13 Thread Cong Wang
On Thu, Jan 11, 2018 at 2:16 AM, Jason Wang  wrote:
>
> It looks to me what is actual missed is the cleanups tun_detach_all(). For
> me the only case that could leak is
>
> open
> attach
> ip link del link dev tap0
> close or another set_iff()
>
> So in this case, clean during close is not sufficient since it could be
> attached to another device.

In this case, close() still calls tun_detach() with clean=true, so
with my patch, the tx_array is still cleaned. What am I missing here?
Are you implying clean=true is not sufficient?


Re: sctp: memory leak in sctp_endpoint_init

2018-01-10 Thread Cong Wang
On Tue, Jan 9, 2018 at 9:44 AM, 'Dmitry Vyukov' via syzkaller
 wrote:
> Hello,
>
> syzkaller has hit the following memory leak on 4.15-rc7.
> Reproducer is attached.
>
> unferenced object 0x88007bbaa720 (size 32):
>   comm "syz-executor4", pid 12479, jiffies 4295951917 (age 9.779s)
>   hex dump (first 32 bytes):
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>   backtrace:
> [] kmemleak_alloc_recursive
> include/linux/kmemleak.h:55 [inline]
> [] slab_post_alloc_hook mm/slab.h:440 [inline]
> [] slab_alloc_node mm/slub.c:2725 [inline]
> [] slab_alloc mm/slub.c:2733 [inline]
> [] kmem_cache_alloc_trace+0x126/0x290 mm/slub.c:2750
> [<52b69e97>] kmalloc include/linux/slab.h:499 [inline]
> [<52b69e97>] kzalloc include/linux/slab.h:688 [inline]
> [<52b69e97>] sctp_endpoint_init net/sctp/endpointola.c:66 [inline]
> [<52b69e97>] sctp_endpoint_new+0x16d/0xef0
> net/sctp/endpointola.c:195
> [] sctp_init_sock+0xc18/0x13e0 net/sctp/socket.c:4490
> [] inet6_create+0xba7/0x1290 net/ipv6/af_inet6.c:255
> [] __sock_create+0x521/0x920 net/socket.c:1265
> [] sock_create net/socket.c:1305 [inline]
> [] SYSC_socket net/socket.c:1335 [inline]
> [] SyS_socket+0x102/0x1f0 net/socket.c:1315

This could be probably fixed by the patch
"Fix a leak in socket(2) when we fail to allocate a file descriptor." too.


Re: [PATCH net] ipv6: Fix cleanup ordering on inet6_init() error path

2018-01-10 Thread Cong Wang
On Tue, Jan 9, 2018 at 10:21 AM, Ben Hutchings
 wrote:
> Commit 15e668070a64 reordered the initialisation in inet6_init() to
> fix a crash on an error path further down the call stack.  It also
> reordered cleanup on the error path in inet6_init(), but the result
> is not the reverse of the initialisation order.  This presumably
> can result in a resource leak or crash in some error cases.  Reorder
> cleanup again to fix this.

Can you be specific on what resource we leak here?

Also, it looks like you not just revert the order changed in commit
15e668070a64, but also you move  icmpv6_cleanup() even earlier.


[Patch net] tipc: fix a memory leak in tipc_nl_node_get_link()

2018-01-10 Thread Cong Wang
When tipc_node_find_by_name() fails, the nlmsg is not
freed.

While on it, switch to a goto label to properly
free it.

Fixes: be9c086715c ("tipc: narrow down exposure of struct tipc_node")
Reported-by: Dmitry Vyukov <dvyu...@google.com>
Cc: Jon Maloy <jon.ma...@ericsson.com>
Cc: Ying Xue <ying@windriver.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/tipc/node.c | 26 ++
 1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/net/tipc/node.c b/net/tipc/node.c
index 507017fe0f1b..9036d8756e73 100644
--- a/net/tipc/node.c
+++ b/net/tipc/node.c
@@ -1880,36 +1880,38 @@ int tipc_nl_node_get_link(struct sk_buff *skb, struct 
genl_info *info)
 
if (strcmp(name, tipc_bclink_name) == 0) {
err = tipc_nl_add_bc_link(net, );
-   if (err) {
-   nlmsg_free(msg.skb);
-   return err;
-   }
+   if (err)
+   goto err_free;
} else {
int bearer_id;
struct tipc_node *node;
struct tipc_link *link;
 
node = tipc_node_find_by_name(net, name, _id);
-   if (!node)
-   return -EINVAL;
+   if (!node) {
+   err = -EINVAL;
+   goto err_free;
+   }
 
tipc_node_read_lock(node);
link = node->links[bearer_id].link;
if (!link) {
tipc_node_read_unlock(node);
-   nlmsg_free(msg.skb);
-   return -EINVAL;
+   err = -EINVAL;
+   goto err_free;
}
 
err = __tipc_nl_add_link(net, , link, 0);
tipc_node_read_unlock(node);
-   if (err) {
-   nlmsg_free(msg.skb);
-   return err;
-   }
+   if (err)
+   goto err_free;
}
 
return genlmsg_reply(msg.skb, info);
+
+err_free:
+   nlmsg_free(msg.skb);
+   return err;
 }
 
 int tipc_nl_node_reset_link_stats(struct sk_buff *skb, struct genl_info *info)
-- 
2.13.0



Re: KASAN: use-after-free Read in rb_first_postorder

2018-01-10 Thread Cong Wang
On Tue, Jan 9, 2018 at 2:58 PM, syzbot
 wrote:
> Hello,
>
> syzkaller hit the following crash on
> 61ad64080e039dce99a7f8d89b729bbea995e2f7
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
>
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+e3eeae78ea88b8d6d...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
>
> audit: type=1400 audit(1515534326.722:7): avc:  denied  { map } for
> pid=3495 comm="syzkaller785641" path="/root/syzkaller785641882" dev="sda1"
> ino=16481 scontext=unconfined_u:system_r:insmod_t:s0-s0:c0.c1023
> tcontext=unconfined_u:object_r:user_home_t:s0 tclass=file permissive=1
> ==
> BUG: KASAN: use-after-free in rb_first_postorder+0x7c/0xa0 lib/rbtree.c:671
> Read of size 8 at addr 8801d9929c00 by task syzkaller785641/3495
>
> CPU: 0 PID: 3495 Comm: syzkaller785641 Not tainted 4.15.0-rc7+ #181
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>  kasan_report_error mm/kasan/report.c:351 [inline]
>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:430
>  rb_first_postorder+0x7c/0xa0 lib/rbtree.c:671
>  tipc_group_join+0x120/0x2d0 net/tipc/group.c:210
>  tipc_sk_join net/tipc/socket.c:2781 [inline]
>  tipc_setsockopt+0x67e/0xcc0 net/tipc/socket.c:2876
>  SYSC_setsockopt net/socket.c:1821 [inline]
>  SyS_setsockopt+0x189/0x360 net/socket.c:1800
>  entry_SYSCALL_64_fastpath+0x23/0x9a

Blame this:

commit d12d2e12cec2d66eab6cd58f592dad9fd386b97d
Author: Jon Maloy 
Date:   Mon Jan 8 21:03:28 2018 +0100

tipc: send out join messages as soon as new member is discovered


which assumes a return when tipc_sk_publish() fails but we don't.


> RIP: 0033:0x43fca9
> RSP: 002b:7ffe286872e8 EFLAGS: 0203 ORIG_RAX: 0036
> RAX: ffda RBX: 004002c8 RCX: 0043fca9
> RDX: 0087 RSI: 010f RDI: 0003
> RBP: 006ca018 R08: 0010 R09: 
> R10: 20540ff0 R11: 0203 R12: 00401610
> R13: 004016a0 R14:  R15: 
>
> Allocated by task 3495:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
>  kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3610
>  kmalloc include/linux/slab.h:499 [inline]
>  kzalloc include/linux/slab.h:688 [inline]
>  tipc_group_create+0x144/0x900 net/tipc/group.c:180
>  tipc_sk_join net/tipc/socket.c:2762 [inline]
>  tipc_setsockopt+0x274/0xcc0 net/tipc/socket.c:2876
>  SYSC_setsockopt net/socket.c:1821 [inline]
>  SyS_setsockopt+0x189/0x360 net/socket.c:1800
>  entry_SYSCALL_64_fastpath+0x23/0x9a
>
> Freed by task 3495:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
>  __cache_free mm/slab.c:3488 [inline]
>  kfree+0xd6/0x260 mm/slab.c:3803
>  tipc_group_delete+0x2c8/0x3d0 net/tipc/group.c:234
>  tipc_sk_join net/tipc/socket.c:2775 [inline]
>  tipc_setsockopt+0xba3/0xcc0 net/tipc/socket.c:2876
>  SYSC_setsockopt net/socket.c:1821 [inline]
>  SyS_setsockopt+0x189/0x360 net/socket.c:1800
>  entry_SYSCALL_64_fastpath+0x23/0x9a
>
> The buggy address belongs to the object at 8801d9929c00
>  which belongs to the cache kmalloc-128 of size 128
> The buggy address is located 0 bytes inside of
>  128-byte region [8801d9929c00, 8801d9929c80)
> The buggy address belongs to the page:
> page:ea0007664a40 count:1 mapcount:0 mapping:8801d9929000 index:0x0
> flags: 0x2fffc000100(slab)
> raw: 02fffc000100 8801d9929000  00010015
> raw: ea0007660ca0 ea0007665020 8801dac00640 
> page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
>  8801d9929b00: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
>  8801d9929b80: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
>>
>> 8801d9929c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>
>^
>  8801d9929c80: fc fc fc fc fc fc fc fc 00 00 00 00 00 00 00 00
>  8801d9929d00: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
> 

[Patch net v2] tun: fix a memory leak for tfile->tx_array

2018-01-10 Thread Cong Wang
tfile->tun could be detached before we close the tun fd,
via tun_detach_all(), so it should not be used to check for
tfile->tx_array.

As Jason suggested, we probably have to clean it up
unconditionally, but this requires to check if it is initialized
or not. Currently skb_array_cleanup() doesn't have such a check,
so I check it in the caller, it is ugly but we can always
improve it in net-next.

Reported-by: Dmitry Vyukov <dvyu...@google.com>
Fixes: 1576d9860599 ("tun: switch to use skb array for tx")
Cc: Jason Wang <jasow...@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 drivers/net/tun.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 4f4a842a1c9c..4c85474ffbaf 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -657,7 +657,7 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
tun->dev->reg_state == NETREG_REGISTERED)
unregister_netdevice(tun->dev);
}
-   if (tun)
+   if (tfile->tx_array.ring.queue)
skb_array_cleanup(>tx_array);
sock_put(>sk);
}
@@ -2851,6 +2851,8 @@ static int tun_chr_open(struct inode *inode, struct file 
* file)
 
sock_set_flag(>sk, SOCK_ZEROCOPY);
 
+   memset(>tx_array, 0, sizeof(tfile->tx_array));
+
return 0;
 }
 
-- 
2.13.0



Re: [Patch net] tun: fix a memory leak for tfile->tx_array

2018-01-10 Thread Cong Wang
On Tue, Jan 9, 2018 at 7:00 PM, Jason Wang <jasow...@redhat.com> wrote:
>
>
> On 2018年01月10日 08:07, Cong Wang wrote:
>>
>> tfile->tun could be detached before we close the tun fd,
>> via tun_detach_all(), so it should not be used to check for
>> tfile->tx_array.
>>
>> Use the same logic as in tun_attach(), just test !tfile->deatched.
>>
>> Reported-by: Dmitry Vyukov <dvyu...@google.com>
>> Fixes: 1576d9860599 ("tun: switch to use skb array for tx")
>> Cc: Jason Wang <jasow...@redhat.com>
>> Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
>> ---
>>   drivers/net/tun.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index 4f4a842a1c9c..1a1f834440a6 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -613,6 +613,7 @@ static void tun_queue_purge(struct tun_file *tfile)
>> static void __tun_detach(struct tun_file *tfile, bool clean)
>>   {
>> +   bool clean_tx_array = !tfile->detached;
>> struct tun_file *ntfile;
>> struct tun_struct *tun;
>>   @@ -657,7 +658,7 @@ static void __tun_detach(struct tun_file *tfile,
>> bool clean)
>> tun->dev->reg_state == NETREG_REGISTERED)
>> unregister_netdevice(tun->dev);
>> }
>> -   if (tun)
>> +   if (clean_tx_array)
>> skb_array_cleanup(>tx_array);
>> sock_put(>sk);
>> }
>
>
> Looks like we may still leak if we do
>
> open
> attach
> detach
> close

Good catch.


>
> Should we do cleanup unconditionally?

It doesn't look like we can, because tfile is not zero'ed at
least, it is allocated by sk_alloc() so tfile->tx_array could be
some random value which prevents skb_array_cleanup()
functioning correctly. Zero'ing tfile->tx_array in open()
is not sufficient, as spinlocks are not yet initialized.

I think we probably need to test tfile->tx_array.ring.queue,
it is ugly but I don't see there is an API for it.


[Patch net] tun: fix a memory leak for tfile->tx_array

2018-01-09 Thread Cong Wang
tfile->tun could be detached before we close the tun fd,
via tun_detach_all(), so it should not be used to check for
tfile->tx_array.

Use the same logic as in tun_attach(), just test !tfile->deatched.

Reported-by: Dmitry Vyukov <dvyu...@google.com>
Fixes: 1576d9860599 ("tun: switch to use skb array for tx")
Cc: Jason Wang <jasow...@redhat.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 drivers/net/tun.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 4f4a842a1c9c..1a1f834440a6 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -613,6 +613,7 @@ static void tun_queue_purge(struct tun_file *tfile)
 
 static void __tun_detach(struct tun_file *tfile, bool clean)
 {
+   bool clean_tx_array = !tfile->detached;
struct tun_file *ntfile;
struct tun_struct *tun;
 
@@ -657,7 +658,7 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
tun->dev->reg_state == NETREG_REGISTERED)
unregister_netdevice(tun->dev);
}
-   if (tun)
+   if (clean_tx_array)
skb_array_cleanup(>tx_array);
sock_put(>sk);
}
-- 
2.13.0



Re: [Patch net] 8021q: fix a memory leak for VLAN 0 device

2018-01-09 Thread Cong Wang
On Tue, Jan 9, 2018 at 2:53 PM, Nikolay Aleksandrov
<niko...@cumulusnetworks.com> wrote:
> On 01/10/2018 12:47 AM, Cong Wang wrote:
>> On Tue, Jan 9, 2018 at 2:30 PM, Nikolay Aleksandrov
>> <niko...@cumulusnetworks.com> wrote:
>>>
>>> Just for reference - this is identical to the first part of:
>>> https://patchwork.ozlabs.org/patch/252891/
>>>
>>> I knew this looked familiar. :-)
>>>
>>
>> Yeah, except bonding is not even involved. Unless I misread,
>> DaveM rejected it because of bond, which I never touch here.
>>
>> The refcnt is paired in vlan_vid_{add,del}, and the calls are
>> paired in register/unreigster and NETDEV_UP/NETDEV_DOWN
>> after this patch.
>>
>
> You should read all of my replies to Dave, specifically the last one where I
> describe exactly a memory leak, and IIRC the rejection was not because of the
> bonding part but exactly because of this change - the removal of the vlan_id
> conditional.

Quote:
"If you have the 8021q module available, and you bring a device up, it gets
VLAN 0 by default, and if necessary programmed into the HW filters of the
device."

This is exactly a complain about your bonding check added for NETDEVUP,
which is clearly not here.

> I'm not arguing about this patch now, I've said what I had to say back then,
> I just gave it as a reference in case there's still relevant information in
> there.

Me neither, I just want to point it out memory leak is real
and not even related to bond.


Re: [Patch net] 8021q: fix a memory leak for VLAN 0 device

2018-01-09 Thread Cong Wang
On Tue, Jan 9, 2018 at 2:30 PM, Nikolay Aleksandrov
 wrote:
>
> Just for reference - this is identical to the first part of:
> https://patchwork.ozlabs.org/patch/252891/
>
> I knew this looked familiar. :-)
>

Yeah, except bonding is not even involved. Unless I misread,
DaveM rejected it because of bond, which I never touch here.

The refcnt is paired in vlan_vid_{add,del}, and the calls are
paired in register/unreigster and NETDEV_UP/NETDEV_DOWN
after this patch.


[Patch net] 8021q: fix a memory leak for VLAN 0 device

2018-01-09 Thread Cong Wang
A vlan device with vid 0 is allow to creat by not able to be fully
cleaned up by unregister_vlan_dev() which checks for vlan_id!=0.

Also, VLAN 0 is probably not a valid number and it is kinda
"reserved" for HW accelerating devices, but it is probably too
late to reject it from creation even if makes sense. Instead,
just remove the check in unregister_vlan_dev().

Reported-by: Dmitry Vyukov <dvyu...@google.com>
Fixes: ad1afb003939 ("vlan_dev: VLAN 0 should be treated as "no vlan tag" 
(802.1p packet)")
Cc: Vlad Yasevich <vyasev...@gmail.com>
Cc: Ben Hutchings <ben.hutchi...@codethink.co.uk>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/8021q/vlan.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 8dfdd94e430f..bad01b14a4ad 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -111,12 +111,7 @@ void unregister_vlan_dev(struct net_device *dev, struct 
list_head *head)
vlan_gvrp_uninit_applicant(real_dev);
}
 
-   /* Take it out of our own structures, but be sure to interlock with
-* HW accelerating devices or SW vlan input packet processing if
-* VLAN is not 0 (leave it there for 802.1p).
-*/
-   if (vlan_id)
-   vlan_vid_del(real_dev, vlan->vlan_proto, vlan_id);
+   vlan_vid_del(real_dev, vlan->vlan_proto, vlan_id);
 
/* Get rid of the vlan's reference to real_dev */
dev_put(real_dev);
-- 
2.13.0



Re: KASAN: use-after-free Read in tipc_group_size

2018-01-08 Thread Cong Wang
On Mon, Jan 8, 2018 at 6:58 AM, syzbot
 wrote:
> Hello,
>
> syzkaller hit the following crash on
> b2cd1df66037e7c4697c7e40496bf7e4a5e16a2d
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
>
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+aae58876fb5a1fad0...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
>
> ==
> BUG: KASAN: use-after-free in tipc_group_size+0x40/0x50 net/tipc/group.c:158
> Read of size 2 at addr 8801c08ba280 by task syzkaller447710/3513
>
> CPU: 0 PID: 3513 Comm: syzkaller447710 Not tainted 4.15.0-rc7+ #252
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>  kasan_report_error mm/kasan/report.c:351 [inline]
>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>  __asan_report_load2_noabort+0x14/0x20 mm/kasan/report.c:428
>  tipc_group_size+0x40/0x50 net/tipc/group.c:158
>  tipc_poll+0x374/0x4f0 net/tipc/socket.c:739

Seems we have to lock the sock for tipc_group_size() in
tipc_poll().


>  sock_poll+0x141/0x320 net/socket.c:1117
>  do_pollfd fs/select.c:822 [inline]
>  do_poll fs/select.c:872 [inline]
>  do_sys_poll+0x715/0x10b0 fs/select.c:966
>  SYSC_poll fs/select.c:1024 [inline]
>  SyS_poll+0x10d/0x450 fs/select.c:1012
>  entry_SYSCALL_64_fastpath+0x23/0x9a
> RIP: 0033:0x445cb9
> RSP: 002b:7f04886b1ce8 EFLAGS: 0246 ORIG_RAX: 0007
> RAX: ffda RBX: 006dac3c RCX: 00445cb9
> RDX:  RSI: 0001 RDI: 20fa2ff8
> RBP: 006dac38 R08:  R09: 
> R10:  R11: 0246 R12: 
> R13: 7ffc84d56d7f R14: 7f04886b29c0 R15: 0005
>
> Allocated by task 3510:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551
>  kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3610
>  kmalloc include/linux/slab.h:499 [inline]
>  kzalloc include/linux/slab.h:688 [inline]
>  tipc_group_create+0x116/0x9c0 net/tipc/group.c:167
>  tipc_sk_join net/tipc/socket.c:2747 [inline]
>  tipc_setsockopt+0x249/0xc10 net/tipc/socket.c:2861
>  SYSC_setsockopt net/socket.c:1829 [inline]
>  SyS_setsockopt+0x189/0x360 net/socket.c:1808
>  entry_SYSCALL_64_fastpath+0x23/0x9a
>
> Freed by task 3510:
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:447
>  set_track mm/kasan/kasan.c:459 [inline]
>  kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524
>  __cache_free mm/slab.c:3488 [inline]
>  kfree+0xd6/0x260 mm/slab.c:3803
>  tipc_group_delete+0x2c8/0x3d0 net/tipc/group.c:206
>  tipc_sk_join net/tipc/socket.c:2760 [inline]
>  tipc_setsockopt+0xb0d/0xc10 net/tipc/socket.c:2861
>  SYSC_setsockopt net/socket.c:1829 [inline]
>  SyS_setsockopt+0x189/0x360 net/socket.c:1808
>  entry_SYSCALL_64_fastpath+0x23/0x9a
>
> The buggy address belongs to the object at 8801c08ba200
>  which belongs to the cache kmalloc-192 of size 192
> The buggy address is located 128 bytes inside of
>  192-byte region [8801c08ba200, 8801c08ba2c0)
> The buggy address belongs to the page:
> page:ea0007022e80 count:1 mapcount:0 mapping:8801c08ba000 index:0x0
> flags: 0x2fffc000100(slab)
> raw: 02fffc000100 8801c08ba000  00010010
> raw: ea00071338a0 ea0006fe2360 8801dac00040 
> page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
>  8801c08ba180: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>  8801c08ba200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>>
>> 8801c08ba280: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>
>^
>  8801c08ba300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>  8801c08ba380: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
> ==
>
>
> ---
> This bug is generated by a dumb bot. It may contain errors.
> See https://goo.gl/tpsmEJ for details.
> Direct all questions to syzkal...@googlegroups.com.
>
> syzbot will keep track of this bug report.
> If you forgot to add the Reported-by tag, once the fix for this bug is
> merged
> into any tree, please reply to this email with:
> #syz fix: exact-commit-title
> If you want to test a patch for this 

Re: [PATCH net-next] net: sched: fix tcf_block_get_ext() in case CONFIG_NET_CLS is not set

2018-01-04 Thread Cong Wang
On Thu, Jan 4, 2018 at 1:59 AM, Quentin Monnet
<quentin.mon...@netronome.com> wrote:
> Hi Cong,
>
> 2018-01-03 18:08 UTC-0800 ~ Cong Wang <xiyou.wangc...@gmail.com>
>> On Wed, Jan 3, 2018 at 5:30 PM, Jakub Kicinski
>> <jakub.kicin...@netronome.com> wrote:
>>> From: Quentin Monnet <quentin.mon...@netronome.com>
>>>
>>> The definition of functions tcf_block_get() and tcf_block_get_ext()
>>> depends of CONFIG_NET_CLS being set. When those functions gained extack
>>> support, only one version of the declaration of those functions was
>>> updated. Function tcf_block_get() was later fixed with commit
>>> 3c1490913f3b ("net: sch: api: fix tcf_block_get").
>>>
>>> Change arguments of tcf_block_get_ext() for the case when CONFIG_NET_CLS
>>> is not set.
>>
>> There is one already:
>> https://patchwork.kernel.org/patch/10130849/
>>
>
> Thanks! But this patch is the one I mentioned in the commit log: it
> fixes a different function, tcf_block_get(). My patch is an additional
> fix for tcf_block_get_ext().

Oh, I thought it is same one.

Acked-by: Cong Wang <xiyou.wangc...@gmail.com>


Re: [PATCH net-next] net: sched: fix tcf_block_get_ext() in case CONFIG_NET_CLS is not set

2018-01-03 Thread Cong Wang
On Wed, Jan 3, 2018 at 5:30 PM, Jakub Kicinski
 wrote:
> From: Quentin Monnet 
>
> The definition of functions tcf_block_get() and tcf_block_get_ext()
> depends of CONFIG_NET_CLS being set. When those functions gained extack
> support, only one version of the declaration of those functions was
> updated. Function tcf_block_get() was later fixed with commit
> 3c1490913f3b ("net: sch: api: fix tcf_block_get").
>
> Change arguments of tcf_block_get_ext() for the case when CONFIG_NET_CLS
> is not set.

There is one already:
https://patchwork.kernel.org/patch/10130849/


Re: WARNING in sk_stream_kill_queues (2)

2018-01-03 Thread Cong Wang
#syz dup: KASAN: slab-out-of-bounds Write in tcp_v6_syn_recv_sock


Re: "lockless" qdisc breaks tx_queue_len change too?

2018-01-03 Thread Cong Wang
On Wed, Jan 3, 2018 at 10:09 AM, John Fastabend
<john.fastab...@gmail.com> wrote:
> On 01/02/2018 08:41 PM, Cong Wang wrote:
>> Hi, John
>>
>> While reviewing your ptr_ring fix again today, it looks like your
>> "lockless" qdisc patchset breaks dev->tx_queue_len behavior.
>>
>> Before your patchset, dev->tx_queue_len is merely an integer to read,
>> after your patchset, the skb array has to be resized when
>> dev->tx_queue_len changes, but I don't see any qdisc code handles
>> this...
>>
>> Also, because of that, I doubt __skb_array_empty() in
>> pfifo_fast_dequeue() can be safe any more even with your ptr_ring fix.
>>
>> What am I missing?
>>
>
> I dropped support for tx_queue_len changes after qdisc has been
> created. The only check is at init time when building the qdisc.

This is where it breaks.


>
> Before this series teql and pfifo_fast were the only qdiscs that
> used tx_queue_len other qdiscs used other mechanisms or copied
> tx_queue_len at init time. So the API is inconsistent.

Yeah, pfifo_fast was able to drop based on latest value of tx_queue_len
before your patchset, this is why I am complaining.


>
> OK, but arguably its kAPI now and needs to be supported on live
> qdiscs. So couple options drop the __skb_array_empty() check,
> stop supporting changes on running qdiscs, or do a qdisc swap
> with the new array.

I don't think we can break the old behavior of tx_queue_len change
for pfifo_fast, people may already rely on it.

Doing a swap seems reasonable.

>
> I'm tempted to make the qdisc swap work, still need benchmarks
> I guess without the empty check. Either way to get it working
> we need a callback from tx_queue_len code paths.

Right, probably need a new ops in Qdisc_ops.


>
> Unfortunately, I guess someone somewhere probably uses pfifo_fast
> and changes there queue length with a script after creating the
> qdisc and expects it to work.
>

This is my concern as well. I will work on some patches, this doesn't
look trivial to solve at all.

Thanks.


Re: KASAN: slab-out-of-bounds Write in tcp_v6_syn_recv_sock

2018-01-03 Thread Cong Wang
On Wed, Jan 3, 2018 at 12:55 PM, Ozgur <oz...@goosey.org> wrote:
>
>
> 03.01.2018, 21:57, "Cong Wang" <xiyou.wangc...@gmail.com>:
>> On Tue, Jan 2, 2018 at 3:58 PM, syzbot
>> <syzbot+6dc95bddc6976b800...@syzkaller.appspotmail.com> wrote:
>>>  Hello,
>>>
>>>  syzkaller hit the following crash on
>>>  61233580f1f33c50e159c50e24d80ffd2ba2e06b
>>>  git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
>>>  compiler: gcc (GCC) 7.1.1 20170620
>>>  .config is attached
>>>  Raw console output is attached.
>>>  C reproducer is attached
>>>  syzkaller reproducer is attached. See https://goo.gl/kgGztJ
>>>  for information about syzkaller reproducers
>>>
>>>  IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>>  Reported-by: syzbot+6dc95bddc6976b800...@syzkaller.appspotmail.com
>>>  It will help syzbot understand when the bug is fixed. See footer for
>>>  details.
>>>  If you forward the report, please keep this part and the footer.
>>>
>>>  TCP: request_sock_TCPv6: Possible SYN flooding on port 20002. Sending
>>>  cookies. Check SNMP counters.
>>>  ==
>>>  BUG: KASAN: slab-out-of-bounds in memcpy include/linux/string.h:344 
>>> [inline]
>>>  BUG: KASAN: slab-out-of-bounds in tcp_v6_syn_recv_sock+0x628/0x23a0
>>>  net/ipv6/tcp_ipv6.c:1144
>>>  Write of size 160 at addr 8801cbdd7460 by task syzkaller545407/3196
>>>
>>>  CPU: 1 PID: 3196 Comm: syzkaller545407 Not tainted 4.15.0-rc5+ #241
>>>  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>>>  Google 01/01/2011
>>>  Call Trace:
>>>   
>>>   __dump_stack lib/dump_stack.c:17 [inline]
>>>   dump_stack+0x194/0x257 lib/dump_stack.c:53
>>>   print_address_description+0x73/0x250 mm/kasan/report.c:252
>>>   kasan_report_error mm/kasan/report.c:351 [inline]
>>>   kasan_report+0x25b/0x340 mm/kasan/report.c:409
>>>   check_memory_region_inline mm/kasan/kasan.c:260 [inline]
>>>   check_memory_region+0x137/0x190 mm/kasan/kasan.c:267
>>>   memcpy+0x37/0x50 mm/kasan/kasan.c:303
>>>   memcpy include/linux/string.h:344 [inline]
>>>   tcp_v6_syn_recv_sock+0x628/0x23a0 net/ipv6/tcp_ipv6.c:1144
>>
>> tls_init() changes sk->sk_prot from IPv6 to IPv4, which leads
>> to this bug. I guess IPv6 is not supported for TLS? If so, need
>> a check on proto in tls_init()...
>
> Hello,
>
> I think IPv6 supports with TLS.
> There was a previously posted commit by Mellanox:
>
> https://patchwork.ozlabs.org/patch/801530/

Good to know.

Can you resend the fix? It could probably fix another warning
reported by syzbot too.


Re: KASAN: slab-out-of-bounds Write in tcp_v6_syn_recv_sock

2018-01-03 Thread Cong Wang
On Tue, Jan 2, 2018 at 3:58 PM, syzbot
 wrote:
> Hello,
>
> syzkaller hit the following crash on
> 61233580f1f33c50e159c50e24d80ffd2ba2e06b
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
>
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+6dc95bddc6976b800...@syzkaller.appspotmail.com
> It will help syzbot understand when the bug is fixed. See footer for
> details.
> If you forward the report, please keep this part and the footer.
>
> TCP: request_sock_TCPv6: Possible SYN flooding on port 20002. Sending
> cookies.  Check SNMP counters.
> ==
> BUG: KASAN: slab-out-of-bounds in memcpy include/linux/string.h:344 [inline]
> BUG: KASAN: slab-out-of-bounds in tcp_v6_syn_recv_sock+0x628/0x23a0
> net/ipv6/tcp_ipv6.c:1144
> Write of size 160 at addr 8801cbdd7460 by task syzkaller545407/3196
>
> CPU: 1 PID: 3196 Comm: syzkaller545407 Not tainted 4.15.0-rc5+ #241
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  
>  __dump_stack lib/dump_stack.c:17 [inline]
>  dump_stack+0x194/0x257 lib/dump_stack.c:53
>  print_address_description+0x73/0x250 mm/kasan/report.c:252
>  kasan_report_error mm/kasan/report.c:351 [inline]
>  kasan_report+0x25b/0x340 mm/kasan/report.c:409
>  check_memory_region_inline mm/kasan/kasan.c:260 [inline]
>  check_memory_region+0x137/0x190 mm/kasan/kasan.c:267
>  memcpy+0x37/0x50 mm/kasan/kasan.c:303
>  memcpy include/linux/string.h:344 [inline]
>  tcp_v6_syn_recv_sock+0x628/0x23a0 net/ipv6/tcp_ipv6.c:1144


tls_init() changes sk->sk_prot from IPv6 to IPv4, which leads
to this bug. I guess IPv6 is not supported for TLS? If so, need
a check on proto in tls_init()...




>  tcp_get_cookie_sock+0x102/0x540 net/ipv4/syncookies.c:213
>  cookie_v6_check+0x177d/0x2160 net/ipv6/syncookies.c:255
>  tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1008 [inline]
>  tcp_v6_do_rcv+0xe4d/0x11c0 net/ipv6/tcp_ipv6.c:1316
>  tcp_v6_rcv+0x22ee/0x2b40 net/ipv6/tcp_ipv6.c:1510
>  ip6_input_finish+0x36f/0x1700 net/ipv6/ip6_input.c:284
>  NF_HOOK include/linux/netfilter.h:250 [inline]
>  ip6_input+0xe9/0x560 net/ipv6/ip6_input.c:327
>  dst_input include/net/dst.h:466 [inline]
>  ip6_rcv_finish+0x1a9/0x7a0 net/ipv6/ip6_input.c:71
>  NF_HOOK include/linux/netfilter.h:250 [inline]
>  ipv6_rcv+0xf1f/0x1f80 net/ipv6/ip6_input.c:208
>  __netif_receive_skb_core+0x1a3e/0x3450 net/core/dev.c:4461
>  __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:4526
>  process_backlog+0x203/0x740 net/core/dev.c:5205
>  napi_poll net/core/dev.c:5603 [inline]
>  net_rx_action+0x792/0x1910 net/core/dev.c:5669
>  __do_softirq+0x2d7/0xb85 kernel/softirq.c:285
>  do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1115
>  
>  do_softirq.part.21+0x14d/0x190 kernel/softirq.c:329
>  do_softirq kernel/softirq.c:177 [inline]
>  __local_bh_enable_ip+0x1ee/0x230 kernel/softirq.c:182
>  local_bh_enable include/linux/bottom_half.h:32 [inline]
>  rcu_read_unlock_bh include/linux/rcupdate.h:727 [inline]
>  ip6_finish_output2+0xba6/0x2390 net/ipv6/ip6_output.c:121
>  ip6_finish_output+0x2f9/0x920 net/ipv6/ip6_output.c:146
>  NF_HOOK_COND include/linux/netfilter.h:239 [inline]
>  ip6_output+0x1eb/0x840 net/ipv6/ip6_output.c:163
>  dst_output include/net/dst.h:460 [inline]
>  NF_HOOK include/linux/netfilter.h:250 [inline]
>  ip6_xmit+0xd75/0x2080 net/ipv6/ip6_output.c:269
>  inet6_csk_xmit+0x2fc/0x580 net/ipv6/inet6_connection_sock.c:139
>  tcp_transmit_skb+0x1b12/0x38b0 net/ipv4/tcp_output.c:1176
>  tcp_write_xmit+0x680/0x5190 net/ipv4/tcp_output.c:2367
>  __tcp_push_pending_frames+0xa0/0x250 net/ipv4/tcp_output.c:2543
>  tcp_send_fin+0x1b0/0xd20 net/ipv4/tcp_output.c:3087
>  tcp_close+0xbe0/0xfc0 net/ipv4/tcp.c:2234
>  inet_release+0xed/0x1c0 net/ipv4/af_inet.c:426
>  inet6_release+0x50/0x70 net/ipv6/af_inet6.c:432
>  sock_release+0x8d/0x1e0 net/socket.c:600
>  sock_close+0x16/0x20 net/socket.c:1129
>  __fput+0x327/0x7e0 fs/file_table.c:210
>  fput+0x15/0x20 fs/file_table.c:244
>  task_work_run+0x199/0x270 kernel/task_work.c:113
>  exit_task_work include/linux/task_work.h:22 [inline]
>  do_exit+0x9bb/0x1ad0 kernel/exit.c:865
>  do_group_exit+0x149/0x400 kernel/exit.c:968
>  get_signal+0x73f/0x16c0 kernel/signal.c:2335
>  do_signal+0x94/0x1ee0 arch/x86/kernel/signal.c:809
>  exit_to_usermode_loop+0x214/0x310 arch/x86/entry/common.c:158
>  prepare_exit_to_usermode arch/x86/entry/common.c:195 [inline]
>  syscall_return_slowpath+0x490/0x550 arch/x86/entry/common.c:264
>  entry_SYSCALL_64_fastpath+0x94/0x96
> RIP: 0033:0x4456e9
> RSP: 002b:7fb4de631da8 EFLAGS: 0246 ORIG_RAX: 

"lockless" qdisc breaks tx_queue_len change too?

2018-01-02 Thread Cong Wang
Hi, John

While reviewing your ptr_ring fix again today, it looks like your
"lockless" qdisc patchset breaks dev->tx_queue_len behavior.

Before your patchset, dev->tx_queue_len is merely an integer to read,
after your patchset, the skb array has to be resized when
dev->tx_queue_len changes, but I don't see any qdisc code handles
this...

Also, because of that, I doubt __skb_array_empty() in
pfifo_fast_dequeue() can be safe any more even with your ptr_ring fix.

What am I missing?

Thanks.


Re: [Patch net-next] net_sched: remove the unsafe __skb_array_empty()

2017-12-27 Thread Cong Wang
On Sat, Dec 23, 2017 at 10:57 PM, John Fastabend
<john.fastab...@gmail.com> wrote:
> On 12/22/2017 12:31 PM, Cong Wang wrote:
>> I understand why you had it, but it is just not safe. You don't want
>> to achieve performance gain by crashing system, right?
>
> huh? So my point is the patch you submit here is not a
> real fix but a work around. To peek the head of a consumer/producer ring
> without a lock, _should_ be fine. This _should_ work as well with
> consumer or producer operations happening at the same time. After some
> digging the issue is in the ptr_ring code.


The comments disagree with you:

/* Might be slightly faster than skb_array_empty below, but only safe if the
 * array is never resized. Also, callers invoking this in a loop must take care
 * to use a compiler barrier, for example cpu_relax().
 */

If the comments are right, you miss a barrier here too.


>
> The peek code (what empty check calls) is the following,
>
> static inline void *__ptr_ring_peek(struct ptr_ring *r)
> {
> if (likely(r->size))
> return r->queue[r->consumer_head];
> return NULL;
> }
>
> So what the splat is detecting is consumer head being 'out of bounds'.
> This happens because ptr_ring_discard_one increments the consumer_head
> and then checks to see if it overran the array size. If above peek
> happens after the increment, but before the size check we get the
> splat. There are two ways, as far as I can see, to fix this. First
> do the check before incrementing the consumer head. Or the easier
> fix,
>
> --- a/include/linux/ptr_ring.h
> +++ b/include/linux/ptr_ring.h
> @@ -438,7 +438,7 @@ static inline int ptr_ring_consume_batched_bh(struct
> ptr_ring *r,
>
>  static inline void **__ptr_ring_init_queue_alloc(unsigned int size,
> gfp_t gfp)
>  {
> -   return kcalloc(size, sizeof(void *), gfp);
> +   return kcalloc(size + 1, sizeof(void *), gfp);
>  }
>
> With Jakub's help (Thanks!) I was able to reproduce the original splat
> and also verify the above removes it.
>
> To be clear "resizing" a skb_array only refers to changing the actual
> array size not adding/removing elements.

I never look into the implementation, just simply trust the comments.

At least the comments above __skb_array_empty() need to improve.


>
>>
>>>
>>> Although its not logical IMO to have both reset and dequeue running at
>>> the same time. Some skbs would get through others would get sent, sort
>>> of a mess. I don't see how it can be an issue. The never resized bit
>>> in the documentation is referring to resizing the ring size _not_ popping
>>> off elements of the ring. array_empty just reads the consumer head.
>>> The only ring resizing in pfifo fast should be at init and destroy where
>>> enqueue/dequeue should be disconnected by then. Although based on the
>>> trace I missed a case.
>>
>>
>> Both pfifo_fast_reset() and pfifo_fast_dequeue() call
>> skb_array_consume_bh(), so there is no difference w.r.t. resizing.
>>
>
> Sorry not following.
>
>> And ->reset() is called in qdisc_graft() too. Let's say we have 
>> htb+pfifo_fast,
>> htb_graft() calls qdisc_replace() which calls qdisc_reset() on pfifo_fast,
>> so clearly pfifo_fast_reset() can run with pfifo_fast_dequeue()
>> concurrently.
>
> Yes and this _should_ be perfectly fine for pfifo_fast. I'm wondering
> though if this API can be cleaned up. What are the paths that do a reset
> without a destroy.. Do we really need to have this pattern where reset
> is called then later destroy. Seems destroy could do the entire cleanup
> and this would simplify things. None of this has to do with the splat
> though.

I don't follow your point any more.

We are talking about ->reset() race with ->dequeue() which is the
cause of the bug, right?

If you expect ->reset() runs in parallel with ->dequeue() for pfifo_fast,
why did you even mention synchronize_net() from the beginning???
Also you changed the code too, to adjust rcu grace period.


>
>>
>>
>>>
>>> I think the right fix is to only call reset/destroy patterns after
>>> waiting a grace period and for all tx_action calls in-flight to
>>> complete. This is also better going forward for more complex qdiscs.
>>
>> But we don't even have rcu read lock in TX BH, do we?
>>
>> Also, people certainly don't like yet another synchronize_net()...
>>
>
> This needs a fix and is a _real_ bug, but removing __skb_array_empty()
> doesn't help solve this at all. Will work on a fix after the holiday
> break. The fix here is to ensure the destroy is not going to happen
> while tx_action is in-flight. Can be done with qdisc_run and checking
> correct bits in lockless case.

Sounds like you missed a lot of things with your "lockless" patches
First qdisc rcu callback, second rcu read lock in TX BH...

My quick one-line fix is to amend this bug before you going deeper
in this rabbit hole.


Re: [Patch net-next] net_sched: call qdisc_reset() with qdisc lock

2017-12-22 Thread Cong Wang
On Thu, Dec 21, 2017 at 7:36 PM, John Fastabend
<john.fastab...@gmail.com> wrote:
> On 12/21/2017 04:03 PM, Cong Wang wrote:
>> qdisc_reset() should always be called with qdisc spinlock
>> and with BH disabled, otherwise qdisc ->reset() could race
>> with TX BH.
>>
> hmm I don't see how this fixes the issue. pfifo_fast is no longer
> using the qdisc lock so that doesn't help.  And it is only a
> local_bh_disable.


First of all, this is to fix non-pfifo_fast which you seem totally forget.


>
>
>> Fixes: 7bbde83b1860 ("net: sched: drop qdisc_reset from dev_graft_qdisc")
>> Reported-by: Jakub Kicinski <jakub.kicin...@netronome.com>
>> Cc: John Fastabend <john.fastab...@gmail.com>
>> Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
>> ---
>>  net/sched/sch_generic.c | 5 -
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 10aaa3b615ce..00ddb5f8f430 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -1097,8 +1097,11 @@ static void dev_qdisc_reset(struct net_device *dev,
>>  {
>>   struct Qdisc *qdisc = dev_queue->qdisc_sleeping;
>>
>> - if (qdisc)
>> + if (qdisc) {
>> + spin_lock_bh(qdisc_lock(qdisc));
>>   qdisc_reset(qdisc);
>> + spin_unlock_bh(qdisc_lock(qdisc));
>> + }
>>  }
>>
>>  /**
>>
>
> OK first the cases to get to qdisc_reset that I've tracked
> down are,
>
> dev_shutdown()
>   qdisc_destroy()
>
> dev_deactivate_many()
>   dev_qdisc_reset() <- for each txq
>  qdisc_reset()
>
> chained calls from qdisc_reset ops
>
> At the moment all the lockless qdiscs don't care about chained
> calls so we can ignore that, but would be nice to keep in mind.
>
> Next qdisc_reset() is doing a couple things calling the qdisc
> ops reset call but also walking gso_skb and skb_bad_txq. The
> 'unlink' operations there are not safe to be called while an
> enqueue/dequeue op is in-flight. Also pfifo_fast's reset op
> is not safe to be called with enqueue/dequeue ops in-flight.

This is why I sent two patches instead just this one.


>
> So I've made the assumption that qdisc_reset is _only_ ever
> called after a qdisc is no longer attached on the enqueue
> dev_xmit side and also any in-progress tx_action calls are
> completed. For what its worth this has always been the assumption
> AFAIK.
>
> So those are the assumptions what did I miss?


Speaking of this, qdisc_reset() is also called in dev_deactivate_queue()
even before synchronize_net()...

>
> The biggest gap I see is dev_deactivate_many() is supposed
> to wait for all tx_action calls to complete, this bit:

In some_qdisc_is_busy() you hold qdisc spinlock for non-lockless
ones, I don't understand why you don't want it in dev_qdisc_reset().


Re: [Patch net-next] net_sched: remove the unsafe __skb_array_empty()

2017-12-22 Thread Cong Wang
On Thu, Dec 21, 2017 at 7:06 PM, John Fastabend
<john.fastab...@gmail.com> wrote:
> On 12/21/2017 04:03 PM, Cong Wang wrote:
>> __skb_array_empty() is only safe if array is never resized.
>> pfifo_fast_dequeue() is called in TX BH context and without
>> qdisc lock, so even after we disable BH on ->reset() path
>> we can still race with other CPU's.
>>
>> Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
>> Reported-by: Jakub Kicinski <jakub.kicin...@netronome.com>
>> Cc: John Fastabend <john.fastab...@gmail.com>
>> Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
>> ---
>>  net/sched/sch_generic.c | 3 ---
>>  1 file changed, 3 deletions(-)
>>
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 00ddb5f8f430..9279258ce060 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -622,9 +622,6 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc 
>> *qdisc)
>>   for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>>   struct skb_array *q = band2list(priv, band);
>>
>> - if (__skb_array_empty(q))
>> - continue;
>> -
>>   skb = skb_array_consume_bh(q);
>>   }
>>   if (likely(skb)) {
>>
>
>
> So this is a performance thing we don't want to grab the consumer lock on
> empty bands. Which can be fairly common depending on traffic patterns.


I understand why you had it, but it is just not safe. You don't want
to achieve performance gain by crashing system, right?

>
> Although its not logical IMO to have both reset and dequeue running at
> the same time. Some skbs would get through others would get sent, sort
> of a mess. I don't see how it can be an issue. The never resized bit
> in the documentation is referring to resizing the ring size _not_ popping
> off elements of the ring. array_empty just reads the consumer head.
> The only ring resizing in pfifo fast should be at init and destroy where
> enqueue/dequeue should be disconnected by then. Although based on the
> trace I missed a case.


Both pfifo_fast_reset() and pfifo_fast_dequeue() call
skb_array_consume_bh(), so there is no difference w.r.t. resizing.

And ->reset() is called in qdisc_graft() too. Let's say we have htb+pfifo_fast,
htb_graft() calls qdisc_replace() which calls qdisc_reset() on pfifo_fast,
so clearly pfifo_fast_reset() can run with pfifo_fast_dequeue()
concurrently.


>
> I think the right fix is to only call reset/destroy patterns after
> waiting a grace period and for all tx_action calls in-flight to
> complete. This is also better going forward for more complex qdiscs.

But we don't even have rcu read lock in TX BH, do we?

Also, people certainly don't like yet another synchronize_net()...


[Patch net-next] net_sched: call qdisc_reset() with qdisc lock

2017-12-21 Thread Cong Wang
qdisc_reset() should always be called with qdisc spinlock
and with BH disabled, otherwise qdisc ->reset() could race
with TX BH.

Fixes: 7bbde83b1860 ("net: sched: drop qdisc_reset from dev_graft_qdisc")
Reported-by: Jakub Kicinski <jakub.kicin...@netronome.com>
Cc: John Fastabend <john.fastab...@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/sched/sch_generic.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 10aaa3b615ce..00ddb5f8f430 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1097,8 +1097,11 @@ static void dev_qdisc_reset(struct net_device *dev,
 {
struct Qdisc *qdisc = dev_queue->qdisc_sleeping;
 
-   if (qdisc)
+   if (qdisc) {
+   spin_lock_bh(qdisc_lock(qdisc));
qdisc_reset(qdisc);
+   spin_unlock_bh(qdisc_lock(qdisc));
+   }
 }
 
 /**
-- 
2.13.0



[Patch net-next] net_sched: remove the unsafe __skb_array_empty()

2017-12-21 Thread Cong Wang
__skb_array_empty() is only safe if array is never resized.
pfifo_fast_dequeue() is called in TX BH context and without
qdisc lock, so even after we disable BH on ->reset() path
we can still race with other CPU's.

Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
Reported-by: Jakub Kicinski <jakub.kicin...@netronome.com>
Cc: John Fastabend <john.fastab...@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/sched/sch_generic.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 00ddb5f8f430..9279258ce060 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -622,9 +622,6 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc 
*qdisc)
for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
struct skb_array *q = band2list(priv, band);
 
-   if (__skb_array_empty(q))
-   continue;
-
skb = skb_array_consume_bh(q);
}
if (likely(skb)) {
-- 
2.13.0



Re: RCU callback crashes

2017-12-21 Thread Cong Wang
On Wed, Dec 20, 2017 at 4:50 PM, Jakub Kicinski  wrote:
> On Wed, 20 Dec 2017 16:41:14 -0800, Jakub Kicinski wrote:
>> Just as I hit send... :)  but this looks unrelated, "Comm: sshd" -
>> so probably from the management interface.
>>
>> [  154.604041] 
>> ==
>> [  154.612245] BUG: KASAN: slab-out-of-bounds in 
>> pfifo_fast_dequeue+0x140/0x2d0
>> [  154.620219] Read of size 8 at addr 88086bb64040 by task sshd/983
>> [  154.627403]
>> [  154.629161] CPU: 10 PID: 983 Comm: sshd Not tainted 
>> 4.15.0-rc3-perf-00984-g82d3fc87a4aa-dirty #13
>> [  154.639190] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 
>> 11/08/2016
>> [  154.647665] Call Trace:
>> [  154.650494]  dump_stack+0xa6/0x118
>> [  154.654387]  ? _atomic_dec_and_lock+0xe8/0xe8
>> [  154.659355]  ? trace_event_raw_event_rcu_torture_read+0x190/0x190
>> [  154.666263]  ? rcu_segcblist_enqueue+0xe9/0x120
>> [  154.671422]  ? _raw_spin_unlock_bh+0x91/0xc0
>> [  154.676286]  ? pfifo_fast_dequeue+0x140/0x2d0
>> [  154.681251]  print_address_description+0x6a/0x270
>> [  154.686601]  ? pfifo_fast_dequeue+0x140/0x2d0
>> [  154.691565]  kasan_report+0x23f/0x350
>> [  154.695752]  pfifo_fast_dequeue+0x140/0x2d0
>
> If we trust stack decode it's:
>
>615  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>616  {
>617  struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>618  struct sk_buff *skb = NULL;
>619  int band;
>620
>621  for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>622  struct skb_array *q = band2list(priv, band);
>623
>>> 624  if (__skb_array_empty(q))
>625  continue;
>626
>627  skb = skb_array_consume_bh(q);
>628  }
>629  if (likely(skb)) {
>630  qdisc_qstats_cpu_backlog_dec(qdisc, skb);
>631  qdisc_bstats_cpu_update(qdisc, skb);
>632  qdisc_qstats_cpu_qlen_dec(qdisc);
>633  }
>634
>635  return skb;
>636  }

Hi, Jakub

Could you test the attached patch? It looks like the __skb_array_empty()
use is unsafe.

Thanks!
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 10aaa3b615ce..8d47fb4aadb4 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -621,10 +621,6 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc 
*qdisc)
 
for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
struct skb_array *q = band2list(priv, band);
-
-   if (__skb_array_empty(q))
-   continue;
-
skb = skb_array_consume_bh(q);
}
if (likely(skb)) {


Re: [PATCH] net: Revert "net_sched: no need to free qdisc in RCU callback"

2017-12-21 Thread Cong Wang
On Thu, Dec 21, 2017 at 12:39 AM, Jiri Pirko  wrote:
>
> Why just moving qdisc_free to rcu is not enough? It would resolve this
> issue and also avoid using synchronize net. Something like:

If you mean Jakub's issue, apparently not:
https://www.kernel.org/pub/linux/kernel/people/paulmck/Answers/RCU/RCUCBordering.html

Jiri, you have to use a rcu barrier to wait for a rcu callback, not
queuing another rcu callback, the ordering is simply NOT guaranteed.

What's more importantly, you already have one rcu barrier in the
same function. Why keep believing you don't need it?


Re: [Patch net] net_sched: fix a missing rcu barrier in mini_qdisc_pair_swap()

2017-12-21 Thread Cong Wang
On Thu, Dec 21, 2017 at 11:01 AM, Cong Wang <xiyou.wangc...@gmail.com> wrote:
> On Thu, Dec 21, 2017 at 1:03 AM, Jiri Pirko <j...@resnulli.us> wrote:
>>
>>
>> But again, we don't we just free qdisc in call_rcu and avoid the
>> barrier?
>
>
> Non-sense again. Why qdisc code should be adjusted for your
> miniq code? It is your own responsibility to take care of this shit.
> Don't spread it out of minq.

Also, in case you believe call_rcu to free qdisc is queued after
the call_rcu in miniq, you are wrong again:

https://www.kernel.org/pub/linux/kernel/people/paulmck/Answers/RCU/RCUCBordering.html

The rcu callbacks don't guarantee FIFO ordering.


Re: RCU callback crashes

2017-12-21 Thread Cong Wang
On Thu, Dec 21, 2017 at 8:26 AM, John Fastabend
<john.fastab...@gmail.com> wrote:
> On 12/20/2017 11:27 PM, Cong Wang wrote:
>> On Wed, Dec 20, 2017 at 4:50 PM, Jakub Kicinski <kubak...@wp.pl> wrote:
>>> On Wed, 20 Dec 2017 16:41:14 -0800, Jakub Kicinski wrote:
>>>> Just as I hit send... :)  but this looks unrelated, "Comm: sshd" -
>>>> so probably from the management interface.
>>>>
>>>> [  154.604041] 
>>>> ==
>>>> [  154.612245] BUG: KASAN: slab-out-of-bounds in 
>>>> pfifo_fast_dequeue+0x140/0x2d0
>>>> [  154.620219] Read of size 8 at addr 88086bb64040 by task sshd/983
>>>> [  154.627403]
>>>> [  154.629161] CPU: 10 PID: 983 Comm: sshd Not tainted 
>>>> 4.15.0-rc3-perf-00984-g82d3fc87a4aa-dirty #13
>>>> [  154.639190] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 
>>>> 11/08/2016
>>>> [  154.647665] Call Trace:
>>>> [  154.650494]  dump_stack+0xa6/0x118
>>>> [  154.654387]  ? _atomic_dec_and_lock+0xe8/0xe8
>>>> [  154.659355]  ? trace_event_raw_event_rcu_torture_read+0x190/0x190
>>>> [  154.666263]  ? rcu_segcblist_enqueue+0xe9/0x120
>>>> [  154.671422]  ? _raw_spin_unlock_bh+0x91/0xc0
>>>> [  154.676286]  ? pfifo_fast_dequeue+0x140/0x2d0
>>>> [  154.681251]  print_address_description+0x6a/0x270
>>>> [  154.686601]  ? pfifo_fast_dequeue+0x140/0x2d0
>>>> [  154.691565]  kasan_report+0x23f/0x350
>>>> [  154.695752]  pfifo_fast_dequeue+0x140/0x2d0
>>>
>>> If we trust stack decode it's:
>>>
>>>615  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>>>616  {
>>>617  struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>>>618  struct sk_buff *skb = NULL;
>>>619  int band;
>>>620
>>>621  for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>>>622  struct skb_array *q = band2list(priv, band);
>>>623
>>>>> 624  if (__skb_array_empty(q))
>>>625  continue;
>>>626
>>>627  skb = skb_array_consume_bh(q);
>>>628  }
>>>629  if (likely(skb)) {
>>>630  qdisc_qstats_cpu_backlog_dec(qdisc, skb);
>>>631  qdisc_bstats_cpu_update(qdisc, skb);
>>>632  qdisc_qstats_cpu_qlen_dec(qdisc);
>>>633  }
>>>634
>>>635  return skb;
>>>636  }
>>
>> Yeah, this one is clearly a different one and it is introduced by John's
>> "lockless" patchset.
>>
>> I will take a look tomorrow if John doesn't.
>>
>
> I guess this path
>
>   dev_deactivate_many
> dev_deactivate_queue
>   qdisc_reset
>
> here we have the qdisc lock but no rcu call or sync before the reset
> does a kfree_skb and cleans up list walks. So possible for xmit path to
> also be pushing skbs onto the array/lists still. I don't think this is
> the issue triggered above but needs to be fixed

No, we already have a synchronize_net() there. It is probably just
a race with BH.

I think you missed the spinlock in dev_qdisc_reset(), at least
the comment right above qdisc_reset() says so.


diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 10aaa3b615ce..00ddb5f8f430 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1097,8 +1097,11 @@ static void dev_qdisc_reset(struct net_device *dev,
 {
struct Qdisc *qdisc = dev_queue->qdisc_sleeping;

-   if (qdisc)
+   if (qdisc) {
+   spin_lock_bh(qdisc_lock(qdisc));
qdisc_reset(qdisc);
+   spin_unlock_bh(qdisc_lock(qdisc));
+   }
 }

 /**


Re: [Patch net] net_sched: fix a missing rcu barrier in mini_qdisc_pair_swap()

2017-12-21 Thread Cong Wang
On Thu, Dec 21, 2017 at 1:03 AM, Jiri Pirko <j...@resnulli.us> wrote:
> Thu, Dec 21, 2017 at 08:26:24AM CET, xiyou.wangc...@gmail.com wrote:
>>The rcu_barrier_bh() in mini_qdisc_pair_swap() is to wait for
>>flying RCU callback installed by a previous mini_qdisc_pair_swap(),
>>however we miss it on the tp_head==NULL path, which leads to that
>>the RCU callback still uses miniq_old->rcu after it is freed together
>>with qdisc in qdisc_graft(). So just add it on that path too.
>>
>>Fixes: 46209401f8f6 ("net: core: introduce mini_Qdisc and eliminate usage of 
>>tp->q for clsact fastpath ")
>
> This fixes:
> 752fbcc33405 ("net_sched: no need to free qdisc in RCU callback")
>
> Before that, the issue was not there as the qdisc struct got removed
> after a grace period.


This is non-sense. You have to read the stack trace from Jakub again
and tell me why you keep believing any RCU reader involved.

I am pretty sure no one reported any crash between commit
752fbcc33405 and 46209401f8f6.


>
>
>>Reported-by: Jakub Kicinski <jakub.kicin...@netronome.com>
>>Tested-by: Jakub Kicinski <jakub.kicin...@netronome.com>
>>Cc: Jiri Pirko <j...@mellanox.com>
>>Cc: John Fastabend <john.fastab...@gmail.com>
>>Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
>>---
>> net/sched/sch_generic.c | 4 +++-
>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>
>>diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>>index cd1b200acae7..661c7144b53a 100644
>>--- a/net/sched/sch_generic.c
>>+++ b/net/sched/sch_generic.c
>>@@ -1040,6 +1040,8 @@ void mini_qdisc_pair_swap(struct mini_Qdisc_pair 
>>*miniqp,
>>
>>   if (!tp_head) {
>>   RCU_INIT_POINTER(*miniqp->p_miniq, NULL);
>>+  /* Wait for flying RCU callback before it is freed. */
>>+  rcu_barrier_bh();
>
>
>>   return;
>>   }
>>
>>@@ -1055,7 +1057,7 @@ void mini_qdisc_pair_swap(struct mini_Qdisc_pair 
>>*miniqp,
>>   rcu_assign_pointer(*miniqp->p_miniq, miniq);
>>
>>   if (miniq_old)
>>-  /* This is counterpart of the rcu barrier above. We need to
>>+  /* This is counterpart of the rcu barriers above. We need to
>
> This is incorrect. Here we block in order to not use the same miniq
> again in scenario
>
> miniq1 (X)
> miniq2
> miniq1 (yet there are reader using X)
>
> This call_rcu has 0 relation to the barrier you are adding.


Seriously? It is this call_rcu still flying after we free the qdisc.
Did you seriously look into the stack trace from Jakub?


>
>
> But again, we don't we just free qdisc in call_rcu and avoid the
> barrier?


Non-sense again. Why qdisc code should be adjusted for your
miniq code? It is your own responsibility to take care of this shit.
Don't spread it out of minq.


Re: RCU callback crashes

2017-12-20 Thread Cong Wang
On Wed, Dec 20, 2017 at 4:50 PM, Jakub Kicinski  wrote:
> On Wed, 20 Dec 2017 16:41:14 -0800, Jakub Kicinski wrote:
>> Just as I hit send... :)  but this looks unrelated, "Comm: sshd" -
>> so probably from the management interface.
>>
>> [  154.604041] 
>> ==
>> [  154.612245] BUG: KASAN: slab-out-of-bounds in 
>> pfifo_fast_dequeue+0x140/0x2d0
>> [  154.620219] Read of size 8 at addr 88086bb64040 by task sshd/983
>> [  154.627403]
>> [  154.629161] CPU: 10 PID: 983 Comm: sshd Not tainted 
>> 4.15.0-rc3-perf-00984-g82d3fc87a4aa-dirty #13
>> [  154.639190] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 
>> 11/08/2016
>> [  154.647665] Call Trace:
>> [  154.650494]  dump_stack+0xa6/0x118
>> [  154.654387]  ? _atomic_dec_and_lock+0xe8/0xe8
>> [  154.659355]  ? trace_event_raw_event_rcu_torture_read+0x190/0x190
>> [  154.666263]  ? rcu_segcblist_enqueue+0xe9/0x120
>> [  154.671422]  ? _raw_spin_unlock_bh+0x91/0xc0
>> [  154.676286]  ? pfifo_fast_dequeue+0x140/0x2d0
>> [  154.681251]  print_address_description+0x6a/0x270
>> [  154.686601]  ? pfifo_fast_dequeue+0x140/0x2d0
>> [  154.691565]  kasan_report+0x23f/0x350
>> [  154.695752]  pfifo_fast_dequeue+0x140/0x2d0
>
> If we trust stack decode it's:
>
>615  static struct sk_buff *pfifo_fast_dequeue(struct Qdisc *qdisc)
>616  {
>617  struct pfifo_fast_priv *priv = qdisc_priv(qdisc);
>618  struct sk_buff *skb = NULL;
>619  int band;
>620
>621  for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>622  struct skb_array *q = band2list(priv, band);
>623
>>> 624  if (__skb_array_empty(q))
>625  continue;
>626
>627  skb = skb_array_consume_bh(q);
>628  }
>629  if (likely(skb)) {
>630  qdisc_qstats_cpu_backlog_dec(qdisc, skb);
>631  qdisc_bstats_cpu_update(qdisc, skb);
>632  qdisc_qstats_cpu_qlen_dec(qdisc);
>633  }
>634
>635  return skb;
>636  }

Yeah, this one is clearly a different one and it is introduced by John's
"lockless" patchset.

I will take a look tomorrow if John doesn't.


Re: RCU callback crashes

2017-12-20 Thread Cong Wang
On Wed, Dec 20, 2017 at 4:37 PM, Jakub Kicinski <kubak...@wp.pl> wrote:
> On Wed, 20 Dec 2017 16:03:49 -0800, Cong Wang wrote:
>> On Wed, Dec 20, 2017 at 10:31 AM, Cong Wang <xiyou.wangc...@gmail.com> wrote:
>> > On Wed, Dec 20, 2017 at 10:17 AM, Cong Wang <xiyou.wangc...@gmail.com> 
>> > wrote:
>> >>
>> >> I guess it is q->miniqp which is freed in qdisc_graft() without properly
>> >> waiting for rcu readers?
>> >
>> > It is probably so, the call_rcu_bh(_old->rcu, mini_qdisc_rcu_func)
>> > in the end of mini_qdisc_pair_swap() is invoked on miniq_old->rcu,
>> > but miniq is being freed, no rcu barrier waits for it...
>> >
>> > You can try to add a rcu_barrier_bh() at the end to see if this crash
>> > is gone, but I don't think people like adding yet another rcu barrier...
>>
>> Hi, Jakub
>>
>> Can you test the following fix? I am not a fan of rcu barrier but we
>> already have one so...
>>
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 876fab2604b8..1b68fedea124 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -1240,6 +1240,8 @@ void mini_qdisc_pair_swap(struct mini_Qdisc_pair 
>> *miniqp,
>>
>> if (!tp_head) {
>> RCU_INIT_POINTER(*miniqp->p_miniq, NULL);
>> +   /* Wait for existing flying RCU callback before being freed. 
>> */
>> +   rcu_barrier_bh();
>> return;
>> }
>
> Looks good after 30 minutes, feel free to add if you post officially:
>
> Tested-by: Jakub Kicinski <jakub.kicin...@netronome.com>

Thanks for testing! I just sent a formal patch out.


[Patch net] net_sched: fix a missing rcu barrier in mini_qdisc_pair_swap()

2017-12-20 Thread Cong Wang
The rcu_barrier_bh() in mini_qdisc_pair_swap() is to wait for
flying RCU callback installed by a previous mini_qdisc_pair_swap(),
however we miss it on the tp_head==NULL path, which leads to that
the RCU callback still uses miniq_old->rcu after it is freed together
with qdisc in qdisc_graft(). So just add it on that path too.

Fixes: 46209401f8f6 ("net: core: introduce mini_Qdisc and eliminate usage of 
tp->q for clsact fastpath ")
Reported-by: Jakub Kicinski <jakub.kicin...@netronome.com>
Tested-by: Jakub Kicinski <jakub.kicin...@netronome.com>
Cc: Jiri Pirko <j...@mellanox.com>
Cc: John Fastabend <john.fastab...@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
---
 net/sched/sch_generic.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index cd1b200acae7..661c7144b53a 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1040,6 +1040,8 @@ void mini_qdisc_pair_swap(struct mini_Qdisc_pair *miniqp,
 
if (!tp_head) {
RCU_INIT_POINTER(*miniqp->p_miniq, NULL);
+   /* Wait for flying RCU callback before it is freed. */
+   rcu_barrier_bh();
return;
}
 
@@ -1055,7 +1057,7 @@ void mini_qdisc_pair_swap(struct mini_Qdisc_pair *miniqp,
rcu_assign_pointer(*miniqp->p_miniq, miniq);
 
if (miniq_old)
-   /* This is counterpart of the rcu barrier above. We need to
+   /* This is counterpart of the rcu barriers above. We need to
 * block potential new user of miniq_old until all readers
 * are not seeing it.
 */
-- 
2.13.6



Re: RCU callback crashes

2017-12-20 Thread Cong Wang
On Wed, Dec 20, 2017 at 10:31 AM, Cong Wang <xiyou.wangc...@gmail.com> wrote:
> On Wed, Dec 20, 2017 at 10:17 AM, Cong Wang <xiyou.wangc...@gmail.com> wrote:
>>
>> I guess it is q->miniqp which is freed in qdisc_graft() without properly
>> waiting for rcu readers?
>
> It is probably so, the call_rcu_bh(_old->rcu, mini_qdisc_rcu_func)
> in the end of mini_qdisc_pair_swap() is invoked on miniq_old->rcu,
> but miniq is being freed, no rcu barrier waits for it...
>
> You can try to add a rcu_barrier_bh() at the end to see if this crash
> is gone, but I don't think people like adding yet another rcu barrier...

Hi, Jakub

Can you test the following fix? I am not a fan of rcu barrier but we
already have one so...

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 876fab2604b8..1b68fedea124 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1240,6 +1240,8 @@ void mini_qdisc_pair_swap(struct mini_Qdisc_pair *miniqp,

if (!tp_head) {
RCU_INIT_POINTER(*miniqp->p_miniq, NULL);
+   /* Wait for existing flying RCU callback before being freed. */
+   rcu_barrier_bh();
return;
}


Re: [PATCH] net: Revert "net_sched: no need to free qdisc in RCU callback"

2017-12-20 Thread Cong Wang
On Wed, Dec 20, 2017 at 3:05 PM, John Fastabend
<john.fastab...@gmail.com> wrote:
> On 12/20/2017 02:41 PM, Cong Wang wrote:
>> On Wed, Dec 20, 2017 at 12:09 PM, John Fastabend
>> <john.fastab...@gmail.com> wrote:
>>> RCU grace period is needed for lockless qdiscs added in the commit
>>> c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array").
>>>
>>> It is needed now that qdiscs may be lockless otherwise we risk
>>> free'ing a qdisc that is still in use from datapath. Additionally,
>>> push list cleanup into RCU callback. Otherwise we risk the datapath
>>> adding skbs during removal.
>>
>> What about qdisc_graft() -> dev_deactivate() -> synchronize_net() ?
>> It doesn't work with your "lockless" patches?
>>
>
> Well this is only in the 'parent == NULL' case otherwise we call
> cops->graft(). Most sch_* seem to use qdisc_replace and this uses
> sch_tree_lock().
>
> The only converted qdisc mq and mqprio at this point don't care
> though and do their own dev_deactivate/activate. So its not fixing
> anything in the above mentioned commit.

Sure, removing a class does not impact the whole device,
but removing the root qdisc does.

After your "lockless", skb_array_consume_bh() is called in
pfifo_fast_reset() and ptr_ring_cleanup() is called in
pfifo_fast_destroy(), assuming skb_array is not buggy, what race
do we have here with datapath?


>
> I still think it will need to be done eventually. If it resolves
> the miniq case it seems like a good idea. Although per Jakub's comment
> perhaps I pulled too much into the RCU handler.

The case Jakub reported is a RCU callback missing a rcu
barrier. I don't understand why you keep believing it is RCU
readers on datapath.

Not even to mention ingress is not affected by your "lockless"
thing.


Re: [PATCH] net: Revert "net_sched: no need to free qdisc in RCU callback"

2017-12-20 Thread Cong Wang
On Wed, Dec 20, 2017 at 12:09 PM, John Fastabend
 wrote:
> RCU grace period is needed for lockless qdiscs added in the commit
> c5ad119fb6c09 ("net: sched: pfifo_fast use skb_array").
>
> It is needed now that qdiscs may be lockless otherwise we risk
> free'ing a qdisc that is still in use from datapath. Additionally,
> push list cleanup into RCU callback. Otherwise we risk the datapath
> adding skbs during removal.

What about qdisc_graft() -> dev_deactivate() -> synchronize_net() ?
It doesn't work with your "lockless" patches?


Re: RCU callback crashes

2017-12-20 Thread Cong Wang
On Wed, Dec 20, 2017 at 12:23 PM, John Fastabend
 wrote:
> I'm trying to see how removing that rcu grace period was safe in the
> first place. The datapath is using rcu_read critical section to protect
> the qdisc but the control path (a) doesn't use rcu grace period and (b)
> doesn't use the qidisc lock. Going to go get a coffee and I'll think
> about it a bit more. Any ideas Cong?

qdisc_graft() -> dev_deactivate() -> synchronize_net() is the reason
you want to find?

Also, you can try `git log` to see why it was introduced in the beginning,
here it is:

commit 5d944c640b4ae5f37c537acf491c2f0eb89fa0d6
Author: Eric Dumazet 
Date:   Wed Mar 31 07:06:04 2010 +

gen_estimator: deadlock fix


Re: RCU callback crashes

2017-12-20 Thread Cong Wang
On Wed, Dec 20, 2017 at 12:14 PM, John Fastabend
 wrote:
>
> Hi,
>
> Just sent a patch to complete qdisc_destroy from rcu callback. This
> is needed to resolve a race with the lockless qdisc patches.
>
> But I guess it should fix the miniq issue as well?


If you ever look into tools/testing/selftests/bpf/test_offload.py, it is
ingress qdisc. Don't know why you keep believing it is related
to your patches.


Re: RCU callback crashes

2017-12-20 Thread Cong Wang
On Wed, Dec 20, 2017 at 10:17 AM, Cong Wang <xiyou.wangc...@gmail.com> wrote:
>
> I guess it is q->miniqp which is freed in qdisc_graft() without properly
> waiting for rcu readers?

It is probably so, the call_rcu_bh(_old->rcu, mini_qdisc_rcu_func)
in the end of mini_qdisc_pair_swap() is invoked on miniq_old->rcu,
but miniq is being freed, no rcu barrier waits for it...

You can try to add a rcu_barrier_bh() at the end to see if this crash
is gone, but I don't think people like adding yet another rcu barrier...


<    1   2   3   4   5   6   7   8   9   10   >