Thu, Jun 13, 2019 at 12:09:32PM CEST, [email protected] wrote: >On Thu 13 Jun 2019 at 11:11, Jiri Pirko <[email protected]> wrote: >> I made a mistake during measurements, sorry about that. >> >> This is the correct script: >> ----------------------------------------------------------------------- >> #!/bin/bash >> >> dev=testdummy >> ip link add name $dev type dummy >> ip link set dev $dev up >> tc qdisc add dev $dev ingress >> >> tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp") >> pref_id=1 >> >> while [ $pref_id -lt 20000 ] >> do >> echo "filter add dev $dev ingress proto ip pref $pref_id flower >> action drop" >> $tmp_file_name >> #echo "filter add dev $dev ingress proto ip pref $pref_id matchall >> action drop" >> $tmp_file_name >> ((pref_id++)) >> done >> >> start=$(date +"%s") >> tc -b $tmp_file_name >> stop=$(date +"%s") >> echo "Insertion duration: $(($stop - $start)) sec" >> rm -f $tmp_file_name >> >> ip link del dev $dev >> ----------------------------------------------------------------------- >> >> Note the commented out matchall. I don't see the regression with >> matchall. However, I see that with flower: >> kernel 5.1 >> Insertion duration: 4 sec >> kernel 5.2 >> Insertion duration: 163 sec >> >> I don't see any significant difference in perf: >> kernel 5.1 >> 77.24% tc [kernel.vmlinux] [k] tcf_chain_tp_find >> 1.67% tc [kernel.vmlinux] [k] mutex_spin_on_owner >> 1.44% tc [kernel.vmlinux] [k] >> _raw_spin_unlock_irqrestore >> 0.93% tc [kernel.vmlinux] [k] idr_get_free >> 0.79% tc_pref_scale_o [kernel.vmlinux] [k] do_syscall_64 >> 0.69% tc [kernel.vmlinux] [k] finish_task_switch >> 0.53% tc libc-2.28.so [.] >> __memset_sse2_unaligned_erms >> 0.49% tc [kernel.vmlinux] [k] __memset >> 0.36% tc_pref_scale_o libc-2.28.so [.] malloc >> 0.30% tc_pref_scale_o libc-2.28.so [.] _int_free >> 0.24% tc [kernel.vmlinux] [k] __memcpy >> 0.23% tc [cls_flower] [k] fl_change >> 0.23% tc [kernel.vmlinux] [k] __nla_validate_parse >> 0.22% tc [kernel.vmlinux] [k] __slab_alloc >> >> >> 75.57% tc [kernel.kallsyms] [k] tcf_chain_tp_find >> 2.70% tc [kernel.kallsyms] [k] >> _raw_spin_unlock_irqrestore >> 1.13% tc_pref_scale_o [kernel.kallsyms] [k] do_syscall_64 >> 0.87% tc libc-2.28.so [.] >> __memset_sse2_unaligned_erms >> 0.86% ip [kernel.kallsyms] [k] finish_task_switch >> 0.67% tc [kernel.kallsyms] [k] memset >> 0.63% tc [kernel.kallsyms] [k] mutex_spin_on_owner >> 0.52% tc_pref_scale_o libc-2.28.so [.] malloc >> 0.48% tc [kernel.kallsyms] [k] idr_get_free >> 0.46% tc [kernel.kallsyms] [k] fl_change >> 0.42% tc_pref_scale_o libc-2.28.so [.] _int_free >> 0.35% tc_pref_scale_o libc-2.28.so [.] __GI___strlen_sse2 >> 0.35% tc_pref_scale_o libc-2.28.so [.] __mbrtowc >> 0.34% tc_pref_scale_o libc-2.28.so [.] >> __fcntl64_nocancel_adjusted >> >> Any ideas? > >Thanks for providing reproduction script! > >I've investigate the problem and found the root cause. First of all I >noticed that CPU utilization during problematic tc run is quite low >(<10%), so I decided to investigate why tc sleeps so much. I've used bcc >and obtained following off-CPU trace (uninteresting traces are omitted >for brevity): > >~$ sudo /usr/share/bcc/tools/offcputime -K -p `pgrep -nx tc` >Tracing off-CPU time (us) of PID 2069 by kernel stack... Hit Ctrl-C to end. >... > finish_task_switch > __sched_text_start > schedule > schedule_timeout > wait_for_completion > __wait_rcu_gp > synchronize_rcu > fl_change > tc_new_tfilter > rtnetlink_rcv_msg > netlink_rcv_skb > netlink_unicast > netlink_sendmsg > sock_sendmsg > ___sys_sendmsg > __sys_sendmsg > do_syscall_64 > entry_SYSCALL_64_after_hwframe > - tc (2069) > 142284953 > >As you can see 142 seconds are spent sleeping in synchronize_rcu(). The >code is in fl_create_new_mask() function: > > err = rhashtable_replace_fast(&head->ht, &mask->ht_node, > &newmask->ht_node, mask_ht_params); > if (err) > goto errout_destroy; > > /* Wait until any potential concurrent users of mask are finished */ > synchronize_rcu(); > >The justification for this is described in comment in >fl_check_assign_mask() (user of fl_create_new_mask()): > > /* Insert mask as temporary node to prevent concurrent creation of mask > * with same key. Any concurrent lookups with same key will return > * -EAGAIN because mask's refcnt is zero. It is safe to insert > * stack-allocated 'mask' to masks hash table because we call > * synchronize_rcu() before returning from this function (either in case > * of error or after replacing it with heap-allocated mask in > * fl_create_new_mask()). > */ > fnew->mask = rhashtable_lookup_get_insert_fast(&head->ht, > &mask->ht_node, > mask_ht_params); > >The offending commit is part of my series that implements unlocked >flower: 195c234d15c9 ("net: sched: flower: handle concurrent mask >insertion") > >The justification presented in it is no longer relevant since Ivan >Vecera changed mask to be dynamically allocated in commit 2cddd2014782 >("net/sched: cls_flower: allocate mask dynamically in fl_change()"). >With this we can just change fl_change() to deallocate temporary mask >with rcu grace period and remove offending syncrhonize_rcu() call. > >Any other suggestions?
So basically you just change synchronize_rcu() to kfree_rcu(mask), correct?
