On Thu 13 Jun 2019 at 14:11, Jiri Pirko <[email protected]> wrote: > Thu, Jun 13, 2019 at 12:09:32PM CEST, [email protected] wrote: >>On Thu 13 Jun 2019 at 11:11, Jiri Pirko <[email protected]> wrote: >>> I made a mistake during measurements, sorry about that. >>> >>> This is the correct script: >>> ----------------------------------------------------------------------- >>> #!/bin/bash >>> >>> dev=testdummy >>> ip link add name $dev type dummy >>> ip link set dev $dev up >>> tc qdisc add dev $dev ingress >>> >>> tmp_file_name=$(date +"/tmp/tc_batch.%s.%N.tmp") >>> pref_id=1 >>> >>> while [ $pref_id -lt 20000 ] >>> do >>> echo "filter add dev $dev ingress proto ip pref $pref_id flower >>> action drop" >> $tmp_file_name >>> #echo "filter add dev $dev ingress proto ip pref $pref_id matchall >>> action drop" >> $tmp_file_name >>> ((pref_id++)) >>> done >>> >>> start=$(date +"%s") >>> tc -b $tmp_file_name >>> stop=$(date +"%s") >>> echo "Insertion duration: $(($stop - $start)) sec" >>> rm -f $tmp_file_name >>> >>> ip link del dev $dev >>> ----------------------------------------------------------------------- >>> >>> Note the commented out matchall. I don't see the regression with >>> matchall. However, I see that with flower: >>> kernel 5.1 >>> Insertion duration: 4 sec >>> kernel 5.2 >>> Insertion duration: 163 sec >>> >>> I don't see any significant difference in perf: >>> kernel 5.1 >>> 77.24% tc [kernel.vmlinux] [k] tcf_chain_tp_find >>> 1.67% tc [kernel.vmlinux] [k] mutex_spin_on_owner >>> 1.44% tc [kernel.vmlinux] [k] >>> _raw_spin_unlock_irqrestore >>> 0.93% tc [kernel.vmlinux] [k] idr_get_free >>> 0.79% tc_pref_scale_o [kernel.vmlinux] [k] do_syscall_64 >>> 0.69% tc [kernel.vmlinux] [k] finish_task_switch >>> 0.53% tc libc-2.28.so [.] >>> __memset_sse2_unaligned_erms >>> 0.49% tc [kernel.vmlinux] [k] __memset >>> 0.36% tc_pref_scale_o libc-2.28.so [.] malloc >>> 0.30% tc_pref_scale_o libc-2.28.so [.] _int_free >>> 0.24% tc [kernel.vmlinux] [k] __memcpy >>> 0.23% tc [cls_flower] [k] fl_change >>> 0.23% tc [kernel.vmlinux] [k] __nla_validate_parse >>> 0.22% tc [kernel.vmlinux] [k] __slab_alloc >>> >>> >>> 75.57% tc [kernel.kallsyms] [k] tcf_chain_tp_find >>> 2.70% tc [kernel.kallsyms] [k] >>> _raw_spin_unlock_irqrestore >>> 1.13% tc_pref_scale_o [kernel.kallsyms] [k] do_syscall_64 >>> 0.87% tc libc-2.28.so [.] >>> __memset_sse2_unaligned_erms >>> 0.86% ip [kernel.kallsyms] [k] finish_task_switch >>> 0.67% tc [kernel.kallsyms] [k] memset >>> 0.63% tc [kernel.kallsyms] [k] mutex_spin_on_owner >>> 0.52% tc_pref_scale_o libc-2.28.so [.] malloc >>> 0.48% tc [kernel.kallsyms] [k] idr_get_free >>> 0.46% tc [kernel.kallsyms] [k] fl_change >>> 0.42% tc_pref_scale_o libc-2.28.so [.] _int_free >>> 0.35% tc_pref_scale_o libc-2.28.so [.] __GI___strlen_sse2 >>> 0.35% tc_pref_scale_o libc-2.28.so [.] __mbrtowc >>> 0.34% tc_pref_scale_o libc-2.28.so [.] >>> __fcntl64_nocancel_adjusted >>> >>> Any ideas? >> >>Thanks for providing reproduction script! >> >>I've investigate the problem and found the root cause. First of all I >>noticed that CPU utilization during problematic tc run is quite low >>(<10%), so I decided to investigate why tc sleeps so much. I've used bcc >>and obtained following off-CPU trace (uninteresting traces are omitted >>for brevity): >> >>~$ sudo /usr/share/bcc/tools/offcputime -K -p `pgrep -nx tc` >>Tracing off-CPU time (us) of PID 2069 by kernel stack... Hit Ctrl-C to end. >>... >> finish_task_switch >> __sched_text_start >> schedule >> schedule_timeout >> wait_for_completion >> __wait_rcu_gp >> synchronize_rcu >> fl_change >> tc_new_tfilter >> rtnetlink_rcv_msg >> netlink_rcv_skb >> netlink_unicast >> netlink_sendmsg >> sock_sendmsg >> ___sys_sendmsg >> __sys_sendmsg >> do_syscall_64 >> entry_SYSCALL_64_after_hwframe >> - tc (2069) >> 142284953 >> >>As you can see 142 seconds are spent sleeping in synchronize_rcu(). The >>code is in fl_create_new_mask() function: >> >> err = rhashtable_replace_fast(&head->ht, &mask->ht_node, >> &newmask->ht_node, mask_ht_params); >> if (err) >> goto errout_destroy; >> >> /* Wait until any potential concurrent users of mask are finished */ >> synchronize_rcu(); >> >>The justification for this is described in comment in >>fl_check_assign_mask() (user of fl_create_new_mask()): >> >> /* Insert mask as temporary node to prevent concurrent creation of mask >> * with same key. Any concurrent lookups with same key will return >> * -EAGAIN because mask's refcnt is zero. It is safe to insert >> * stack-allocated 'mask' to masks hash table because we call >> * synchronize_rcu() before returning from this function (either in case >> * of error or after replacing it with heap-allocated mask in >> * fl_create_new_mask()). >> */ >> fnew->mask = rhashtable_lookup_get_insert_fast(&head->ht, >> &mask->ht_node, >> mask_ht_params); >> >>The offending commit is part of my series that implements unlocked >>flower: 195c234d15c9 ("net: sched: flower: handle concurrent mask >>insertion") >> >>The justification presented in it is no longer relevant since Ivan >>Vecera changed mask to be dynamically allocated in commit 2cddd2014782 >>("net/sched: cls_flower: allocate mask dynamically in fl_change()"). >>With this we can just change fl_change() to deallocate temporary mask >>with rcu grace period and remove offending syncrhonize_rcu() call. >> >>Any other suggestions? > > So basically you just change synchronize_rcu() to kfree_rcu(mask), > correct?
Not really. I remove synchronize_rcu() and change all kfree(mask) in fl_change() to tcf_queue_work(&mask->rwork, fl_mask_free_work) which uses queue_rcu_work() internally. This would allow us to deallocate fl_flow_mask in same manner on all code paths and doesn't require any extensions in fl_flow_mask struct (kfree_rcu would require extending it with rcu_head).
