On Wed, Jul 27, 2011 at 2:24 PM, Pravin Shelar <[email protected]> wrote: > On Wed, Jul 27, 2011 at 12:17 PM, Jesse Gross <[email protected]> wrote: >> On Wed, Jul 27, 2011 at 11:14 AM, Ethan Jackson <[email protected]> wrote: >>>> One strategy that I have considered is to be able to ask only for flows >>>> that have a non-zero packet count. That would help with the common case >>>> where, when there is a large number of flows, they are caused by a port >>>> scan or some other activity with 1-packet flows. It wouldn't help at >>>> all in your case. >>> >>> You could also have the kernel pass down to userspace what logically >>> amounts to a list of the flows which have had their statistics change >>> in the past 10 seconds. A bloom filter would be a sensible approach. >>> Again, probably won't help at all in Simon's case, and may or may-not >>> be a useful optimization above simply not pushing down statistics for >>> flows which have a zero packet count. >> >> I don't think that you could implement a Bloom filter like this in a >> manner that wouldn't cause cache contention. Probably you would still >> need to iterate over every flow in the kernel, you would just be >> comparing last used time to current time - 10 instead of packet count >> not equal to zero. >> > cpu cache contention can be fixed by partitioning all flow by > something (e.g. port no) and assigning cache replacement processing > to a cpu. replacement algo could simple as active and inactive LRU > list. this is how kernel page cache replacement looks like from high > level.
This isn't really a cache replacement problem though. Maybe that's the high level goal that's being solved but I wouldn't want to make that assumption in the kernel as it would likely impose too many restrictions on what userspace can do if it wants to implement something completely different in the future. Anything the kernel provides should just be a simple primitive, potentially analogous to the referenced bit that you would find in a page table. You also can't impose a CPU partitioning scheme on flows because we don't control the CPU that packets are being processed on. That's determined by the originator of the packet (such as RSS on the NIC) and then we just handle it on the same CPU. However, you can use a per-CPU data structure to store information regardless of flow and then merge them later. This actually works well enough for something like a Bloom filter because you can superimpose the results on top of each other without a problem. The issue is what happens when you want to clear it to start the next iteration. I can think of a couple of different ways, none of which are all that appealing: * If you do a destructive read, you can just zero it out as you read from the different CPUs. This is fast but potentially results in losing some flows if one sets a value in between read and clear. * Also for destructive read: put a lock in the per-CPU data that gets taken both by the CPUs processing packets and by the reader CPU. This works but I don't want to introduce an additional spinlock on the packet processing path even if it isn't contended and is used by the same CPU the vast majority of the time. * If you truly do want time based expiration then it gets even messier since what you want is stats for the trailing X seconds. To do that, you would probably need a ring buffer of Bloom filters that bucket by the second on each CPU. By also keeping track of a per-CPU last used time, you could know which buckets were valid and when to clear them without doing a write from a foreign CPU. As a long the number of buckets that you are tracking exceeds the time interval that you care about and the interval is greater than the interval between reads, you should never miss a flow. So I guess it is possible but this seems like a disproportionately complicated solution to the problem that is being solved. _______________________________________________ dev mailing list [email protected] http://openvswitch.org/mailman/listinfo/dev
