At 2018-05-11 22:56:04, "Willem de Bruijn" <willemdebruijn.ker...@gmail.com> wrote: >On Fri, May 11, 2018 at 10:44 AM, Gao Feng <gfree.w...@vip.163.com> wrote: >> At 2018-05-11 21:23:55, "Willem de Bruijn" <willemdebruijn.ker...@gmail.com> >> wrote: >>>On Fri, May 11, 2018 at 2:20 AM, Gao Feng <gfree.w...@vip.163.com> wrote: >>>> At 2018-05-11 11:54:55, "Willem de Bruijn" >>>> <willemdebruijn.ker...@gmail.com> wrote: >>>>>On Thu, May 10, 2018 at 4:28 AM, <gfree.w...@vip.163.com> wrote: >>>>>> From: Gao Feng <gfree.w...@vip.163.com> >>>>>> >>>>>> The skb flow limit is implemented for each CPU independently. In the >>>>>> current codes, the function skb_flow_limit gets the softnet_data by >>>>>> this_cpu_ptr. But the target cpu of enqueue_to_backlog would be not >>>>>> the current cpu when enable RPS. As the result, the skb_flow_limit checks >>>>>> the stats of current CPU, while the skb is going to append the queue of >>>>>> another CPU. It isn't the expected behavior. >>>>>> >>>>>> Now pass the softnet_data as a param to softnet_data to make consistent. >>>>> >>>>>The local cpu softnet_data is used on purpose. The operations in >>>>>skb_flow_limit() on sd fields could race if not executed on the local cpu. >>>> >>>> I think the race doesn't exist because of the rps_lock. >>>> The enqueue_to_backlog has hold the rps_lock before skb_flow_limit. >>> >>>Indeed, I overlooked that. There still is the matter of cache contention. >> >> The cache contention is really important in this case? >> I don't think so, because the enqueue_to_backlog have touched and modified >> the softnet_stat >> of target cpu. >> >>> >>>>>Flow limit tries to detect large ("elephant") DoS flows with a fixed >>>>>four-tuple. >>>>>These would always hit the same RPS cpu, so that cpu being backlogged >>>> >>>> They may hit the different target CPU when enable RFS. Because the app >>>> could be scheduled >>>> to another CPU, then RFS tries to deliver the skb to latest core which has >>>> hot cache. >>> >>>This even more suggest using the initial (or IRQ) cpu to track state, instead >>>of the destination (RPS/RFS) cpu. >> >> I couldn't understand why it is better to track state on initial cpu, not >> the target cpu. >> The latter one could get more accurate result. > >For a single DoS flow with normal cpu pinned IRQs, the results will be equally >good when tracked on the initial IRQ cpu.. > >> >>> >>>>>may be an indication that such a flow is active. But the flow will also >>>>>always >>>>>arrive on the same initial cpu courtesy of RSS. So storing the lookup table >>>> >>>> The RSS couldn't make sure the irq is handled by same cpu. It would be >>>> balanced between >>>> the cpus. >>> >>>IRQs are usually pinned to cores. Unless using something like irqbalance, >>>but that operates at too coarse a timescale to do anything useful at Mpps >>>packet rates. >> >> There are some motherboard which couldn't make sure the irq is pinned. >> The flow_limit wouldn't work as well as expected. > >.. this seems to be the crux of the argument. I am not aware of any network >interrupts that do not adhere to the cpu pinning configuration in > > /proc/irq/$IRQ/smp_affinity(_list) >
When smp_affinity is configured 0xff, I met some hardwares they deliver most of irqs to a specific cpu, and left irqs are spread to the other cpus. And it couldn't make sure the irq of one flow is delivered to a fixed cpu. I don't know if it is caused by apic or motherboard. And what information you need to confirm it. >What kind of hardware ignores this setting and sprays interrupts? I agree >that in that case flow_limit as is may be ineffective (if migration happens >at rates comparable to packet rates). But this should not happen? > >> >>> >>>>>on the initial CPU is also fine. There may be false positives on other CPUs >>>>>with the same RPS destination, but that is unlikely with a highly >>>>>concurrent >>>>>traffic server mix ("mice"). >>>> >>>> If my comment is right, the flow couldn't always arrive one the same >>>> initial cpu, although >>>> it may be sent to one same target cpu. >>>> >>>>> >>>>>Note that the sysctl net.core.flow_limit_cpu_bitmap enables the feature >>>>>for the cpus on which traffic initially lands, not the RPS destination >>>>>cpus. >>>>>See also Documentation/networking/scaling.txt >>>>> >>>>>That said, I had to reread the code, as it does seem sensible that the >>>>>same softnet_data is intended to be used both when testing qlen and >>>>>flow_limit. >>>> >>>> In most cases, user configures the same RPS map with flow_limit like 0xff. >>>> Because user couldn't predict which core the evil flow would arrive on. >>>> >>>> Take an example, there are 2 cores, cpu0 and cpu1. >>>> One flow is the an evil flow, but the irq is sent to cpu0. After RPS/RFS, >>>> the target cpu is cpu1. >>>> Now cpu0 invokes enqueue_to_backlog, then the skb_flow_limit checkes the >>>> queue length >>>> of cpu0. Certainly it could pass the check of skb_flow_limit because there >>>> is no any evil flow on cpu0. >>> >>>No, enqueue_to_backlog passes qlen to skb_flow_limit, so that does >>>check the queue length of the RPS cpu. >> >> Sorry, I overlooked the qlen is the length of the rps cpu. >> Then it's ok unless the stats may be not accurate when irq isn't pinned. >> >> But I still doubt that is it really important to track state on initial cpu, >> not target cpu? >> Because the enqueue_to_backlog have touched the softnet_data of target cpu. > >I think the merit of both IRQ and RPS cpu can be argued for attaching the >flow_limit state. > >Either way, the current behavior is not a bug, so I don't think that this is a >candidate for net. > >The cost of moving from IRQ to RPS cpu will be the cacheline contention >on a system with multiple IRQ cpus that all try to update the sd->flow_data >of the same RPS cpus. Which is particularly likely with RFS. I suspect that >this cost is non-trivial and not worth the benefit of handling hardware with >unpinned IRQs. Ok, I agree with you. Thanks your detail discussion with me. Best Regards Feng