Re: tc tp creation performance degratation since kernel 5.1

Jiri Pirko Thu, 13 Jun 2019 08:31:43 -0700

Thu, Jun 13, 2019 at 12:09:32PM CEST, [email protected] wrote:
>On Thu 13 Jun 2019 at 11:11, Jiri Pirko <[email protected]> wrote:
>> I made a mistake during measurements, sorry about that.
>>
>> This is the correct script:
>> -----------------------------------------------------------------------
>> #!/bin/bash
>>
>> dev=testdummy
>> ip link add name $dev type dummy
>> ip link set dev $dev up
>> tc qdisc add dev $dev ingress
>>
>> tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp")
>> pref_id=1
>>
>> while [ $pref_id -lt 20000 ]
>> do
>>         echo "filter add dev $dev ingress proto ip pref $pref_id flower 
>> action drop" >> $tmp_file_name
>>         #echo "filter add dev $dev ingress proto ip pref $pref_id matchall 
>> action drop" >> $tmp_file_name
>>         ((pref_id++))
>> done
>>
>> start=$(date +"%s")
>> tc -b $tmp_file_name
>> stop=$(date +"%s")
>> echo "Insertion duration: $(($stop - $start)) sec"
>> rm -f $tmp_file_name
>>
>> ip link del dev $dev
>> -----------------------------------------------------------------------
>>
>> Note the commented out matchall. I don't see the regression with
>> matchall. However, I see that with flower:
>> kernel 5.1
>> Insertion duration: 4 sec
>> kernel 5.2
>> Insertion duration: 163 sec
>>
>> I don't see any significant difference in perf:
>> kernel 5.1
>>     77.24%  tc               [kernel.vmlinux]    [k] tcf_chain_tp_find
>>      1.67%  tc               [kernel.vmlinux]    [k] mutex_spin_on_owner
>>      1.44%  tc               [kernel.vmlinux]    [k] 
>> _raw_spin_unlock_irqrestore
>>      0.93%  tc               [kernel.vmlinux]    [k] idr_get_free
>>      0.79%  tc_pref_scale_o  [kernel.vmlinux]    [k] do_syscall_64
>>      0.69%  tc               [kernel.vmlinux]    [k] finish_task_switch
>>      0.53%  tc               libc-2.28.so        [.] 
>> __memset_sse2_unaligned_erms
>>      0.49%  tc               [kernel.vmlinux]    [k] __memset
>>      0.36%  tc_pref_scale_o  libc-2.28.so        [.] malloc
>>      0.30%  tc_pref_scale_o  libc-2.28.so        [.] _int_free
>>      0.24%  tc               [kernel.vmlinux]    [k] __memcpy
>>      0.23%  tc               [cls_flower]        [k] fl_change
>>      0.23%  tc               [kernel.vmlinux]    [k] __nla_validate_parse
>>      0.22%  tc               [kernel.vmlinux]    [k] __slab_alloc
>>
>>
>>     75.57%  tc               [kernel.kallsyms]  [k] tcf_chain_tp_find
>>      2.70%  tc               [kernel.kallsyms]  [k] 
>> _raw_spin_unlock_irqrestore
>>      1.13%  tc_pref_scale_o  [kernel.kallsyms]  [k] do_syscall_64
>>      0.87%  tc               libc-2.28.so       [.] 
>> __memset_sse2_unaligned_erms
>>      0.86%  ip               [kernel.kallsyms]  [k] finish_task_switch
>>      0.67%  tc               [kernel.kallsyms]  [k] memset
>>      0.63%  tc               [kernel.kallsyms]  [k] mutex_spin_on_owner
>>      0.52%  tc_pref_scale_o  libc-2.28.so       [.] malloc
>>      0.48%  tc               [kernel.kallsyms]  [k] idr_get_free
>>      0.46%  tc               [kernel.kallsyms]  [k] fl_change
>>      0.42%  tc_pref_scale_o  libc-2.28.so       [.] _int_free
>>      0.35%  tc_pref_scale_o  libc-2.28.so       [.] __GI___strlen_sse2
>>      0.35%  tc_pref_scale_o  libc-2.28.so       [.] __mbrtowc
>>      0.34%  tc_pref_scale_o  libc-2.28.so       [.] 
>> __fcntl64_nocancel_adjusted
>>
>> Any ideas?
>
>Thanks for providing reproduction script!
>
>I've investigate the problem and found the root cause. First of all I
>noticed that CPU utilization during problematic tc run is quite low
>(<10%), so I decided to investigate why tc sleeps so much. I've used bcc
>and obtained following off-CPU trace (uninteresting traces are omitted
>for brevity):
>
>~$ sudo /usr/share/bcc/tools/offcputime -K -p `pgrep -nx tc`
>Tracing off-CPU time (us) of PID 2069 by kernel stack... Hit Ctrl-C to end.
>...
>    finish_task_switch
>    __sched_text_start
>    schedule
>    schedule_timeout
>    wait_for_completion
>    __wait_rcu_gp
>    synchronize_rcu
>    fl_change
>    tc_new_tfilter
>    rtnetlink_rcv_msg
>    netlink_rcv_skb
>    netlink_unicast
>    netlink_sendmsg
>    sock_sendmsg
>    ___sys_sendmsg
>    __sys_sendmsg
>    do_syscall_64
>    entry_SYSCALL_64_after_hwframe
>    -                tc (2069)
>        142284953
>
>As you can see 142 seconds are spent sleeping in synchronize_rcu(). The
>code is in fl_create_new_mask() function:
>
>       err = rhashtable_replace_fast(&head->ht, &mask->ht_node,
>                                     &newmask->ht_node, mask_ht_params);
>       if (err)
>               goto errout_destroy;
>
>       /* Wait until any potential concurrent users of mask are finished */
>       synchronize_rcu();
>
>The justification for this is described in comment in
>fl_check_assign_mask() (user of fl_create_new_mask()):
>
>       /* Insert mask as temporary node to prevent concurrent creation of mask
>        * with same key. Any concurrent lookups with same key will return
>        * -EAGAIN because mask's refcnt is zero. It is safe to insert
>        * stack-allocated 'mask' to masks hash table because we call
>        * synchronize_rcu() before returning from this function (either in case
>        * of error or after replacing it with heap-allocated mask in
>        * fl_create_new_mask()).
>        */
>       fnew->mask = rhashtable_lookup_get_insert_fast(&head->ht,
>                                                      &mask->ht_node,
>                                                      mask_ht_params);
>
>The offending commit is part of my series that implements unlocked
>flower: 195c234d15c9 ("net: sched: flower: handle concurrent mask
>insertion")
>
>The justification presented in it is no longer relevant since Ivan
>Vecera changed mask to be dynamically allocated in commit 2cddd2014782
>("net/sched: cls_flower: allocate mask dynamically in fl_change()").
>With this we can just change fl_change() to deallocate temporary mask
>with rcu grace period and remove offending syncrhonize_rcu() call.
>
>Any other suggestions?


So basically you just change synchronize_rcu() to kfree_rcu(mask),
   correct?

Re: tc tp creation performance degratation since kernel 5.1

Reply via email to