Thu, Jun 13, 2019 at 01:26:17PM CEST, [email protected] wrote: > >On Thu 13 Jun 2019 at 14:11, Jiri Pirko <[email protected]> wrote: >> Thu, Jun 13, 2019 at 12:09:32PM CEST, [email protected] wrote: >>>On Thu 13 Jun 2019 at 11:11, Jiri Pirko <[email protected]> wrote: >>>> I made a mistake during measurements, sorry about that. >>>> >>>> This is the correct script: >>>> ----------------------------------------------------------------------- >>>> #!/bin/bash >>>> >>>> dev=testdummy >>>> ip link add name $dev type dummy >>>> ip link set dev $dev up >>>> tc qdisc add dev $dev ingress >>>> >>>> tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp") >>>> pref_id=1 >>>> >>>> while [ $pref_id -lt 20000 ] >>>> do >>>> echo "filter add dev $dev ingress proto ip pref $pref_id flower >>>> action drop" >> $tmp_file_name >>>> #echo "filter add dev $dev ingress proto ip pref $pref_id matchall >>>> action drop" >> $tmp_file_name >>>> ((pref_id++)) >>>> done >>>> >>>> start=$(date +"%s") >>>> tc -b $tmp_file_name >>>> stop=$(date +"%s") >>>> echo "Insertion duration: $(($stop - $start)) sec" >>>> rm -f $tmp_file_name >>>> >>>> ip link del dev $dev >>>> ----------------------------------------------------------------------- >>>> >>>> Note the commented out matchall. I don't see the regression with >>>> matchall. However, I see that with flower: >>>> kernel 5.1 >>>> Insertion duration: 4 sec >>>> kernel 5.2 >>>> Insertion duration: 163 sec >>>> >>>> I don't see any significant difference in perf: >>>> kernel 5.1 >>>> 77.24% tc [kernel.vmlinux] [k] tcf_chain_tp_find >>>> 1.67% tc [kernel.vmlinux] [k] mutex_spin_on_owner >>>> 1.44% tc [kernel.vmlinux] [k] >>>> _raw_spin_unlock_irqrestore >>>> 0.93% tc [kernel.vmlinux] [k] idr_get_free >>>> 0.79% tc_pref_scale_o [kernel.vmlinux] [k] do_syscall_64 >>>> 0.69% tc [kernel.vmlinux] [k] finish_task_switch >>>> 0.53% tc libc-2.28.so [.] >>>> __memset_sse2_unaligned_erms >>>> 0.49% tc [kernel.vmlinux] [k] __memset >>>> 0.36% tc_pref_scale_o libc-2.28.so [.] malloc >>>> 0.30% tc_pref_scale_o libc-2.28.so [.] _int_free >>>> 0.24% tc [kernel.vmlinux] [k] __memcpy >>>> 0.23% tc [cls_flower] [k] fl_change >>>> 0.23% tc [kernel.vmlinux] [k] __nla_validate_parse >>>> 0.22% tc [kernel.vmlinux] [k] __slab_alloc >>>> >>>> >>>> 75.57% tc [kernel.kallsyms] [k] tcf_chain_tp_find >>>> 2.70% tc [kernel.kallsyms] [k] >>>> _raw_spin_unlock_irqrestore >>>> 1.13% tc_pref_scale_o [kernel.kallsyms] [k] do_syscall_64 >>>> 0.87% tc libc-2.28.so [.] >>>> __memset_sse2_unaligned_erms >>>> 0.86% ip [kernel.kallsyms] [k] finish_task_switch >>>> 0.67% tc [kernel.kallsyms] [k] memset >>>> 0.63% tc [kernel.kallsyms] [k] mutex_spin_on_owner >>>> 0.52% tc_pref_scale_o libc-2.28.so [.] malloc >>>> 0.48% tc [kernel.kallsyms] [k] idr_get_free >>>> 0.46% tc [kernel.kallsyms] [k] fl_change >>>> 0.42% tc_pref_scale_o libc-2.28.so [.] _int_free >>>> 0.35% tc_pref_scale_o libc-2.28.so [.] __GI___strlen_sse2 >>>> 0.35% tc_pref_scale_o libc-2.28.so [.] __mbrtowc >>>> 0.34% tc_pref_scale_o libc-2.28.so [.] >>>> __fcntl64_nocancel_adjusted >>>> >>>> Any ideas? >>> >>>Thanks for providing reproduction script! >>> >>>I've investigate the problem and found the root cause. First of all I >>>noticed that CPU utilization during problematic tc run is quite low >>>(<10%), so I decided to investigate why tc sleeps so much. I've used bcc >>>and obtained following off-CPU trace (uninteresting traces are omitted >>>for brevity): >>> >>>~$ sudo /usr/share/bcc/tools/offcputime -K -p `pgrep -nx tc` >>>Tracing off-CPU time (us) of PID 2069 by kernel stack... Hit Ctrl-C to end. >>>... >>> finish_task_switch >>> __sched_text_start >>> schedule >>> schedule_timeout >>> wait_for_completion >>> __wait_rcu_gp >>> synchronize_rcu >>> fl_change >>> tc_new_tfilter >>> rtnetlink_rcv_msg >>> netlink_rcv_skb >>> netlink_unicast >>> netlink_sendmsg >>> sock_sendmsg >>> ___sys_sendmsg >>> __sys_sendmsg >>> do_syscall_64 >>> entry_SYSCALL_64_after_hwframe >>> - tc (2069) >>> 142284953 >>> >>>As you can see 142 seconds are spent sleeping in synchronize_rcu(). The >>>code is in fl_create_new_mask() function: >>> >>> err = rhashtable_replace_fast(&head->ht, &mask->ht_node, >>> &newmask->ht_node, mask_ht_params); >>> if (err) >>> goto errout_destroy; >>> >>> /* Wait until any potential concurrent users of mask are finished */ >>> synchronize_rcu(); >>> >>>The justification for this is described in comment in >>>fl_check_assign_mask() (user of fl_create_new_mask()): >>> >>> /* Insert mask as temporary node to prevent concurrent creation of mask >>> * with same key. Any concurrent lookups with same key will return >>> * -EAGAIN because mask's refcnt is zero. It is safe to insert >>> * stack-allocated 'mask' to masks hash table because we call >>> * synchronize_rcu() before returning from this function (either in case >>> * of error or after replacing it with heap-allocated mask in >>> * fl_create_new_mask()). >>> */ >>> fnew->mask = rhashtable_lookup_get_insert_fast(&head->ht, >>> &mask->ht_node, >>> mask_ht_params); >>> >>>The offending commit is part of my series that implements unlocked >>>flower: 195c234d15c9 ("net: sched: flower: handle concurrent mask >>>insertion") >>> >>>The justification presented in it is no longer relevant since Ivan >>>Vecera changed mask to be dynamically allocated in commit 2cddd2014782 >>>("net/sched: cls_flower: allocate mask dynamically in fl_change()"). >>>With this we can just change fl_change() to deallocate temporary mask >>>with rcu grace period and remove offending syncrhonize_rcu() call. >>> >>>Any other suggestions? >> >> So basically you just change synchronize_rcu() to kfree_rcu(mask), >> correct? > >Not really. I remove synchronize_rcu() and change all kfree(mask) in >fl_change() to tcf_queue_work(&mask->rwork, fl_mask_free_work) which >uses queue_rcu_work() internally. This would allow us to deallocate >fl_flow_mask in same manner on all code paths and doesn't require any >extensions in fl_flow_mask struct (kfree_rcu would require extending it >with rcu_head).
Got it. Makes sense to me. Thanks!
