On 22 May 2024, at 0:11, Ilya Maximets wrote:
> On 5/16/24 19:03, Adrian Moreno wrote: >> >> >> On 4/24/24 9:53 PM, Adrian Moreno wrote: >>> This is the userspace counterpart of the work being done in the kernel >>> [1]. Sending it as RFC to get some early feedback on the overall >>> solution. >>> >>> ** Problem description ** >>> Currently, OVS supports several observability features, such as >>> per-bridge IPFIX, per-flow IPFIX and sFlow. However, given the >>> complexity of OVN-generated pipelines, a single sample is typically not >>> enough to troubleshoot what is OVN/OVS is doing with the packet. We need >>> highler level metadata alongside the packet sample. >>> >>> This can be achieved by the means of per-flow IPFIX sampling and >>> NXAST_SAMPLE action, through which OVN can add two arbitrary 32-bit >>> values (observation_domain_id and observation_point_id) that are sent >>> alongside the packet information in the IPFIX frames. >>> >>> However, there is a fundamental limitation of the existing sampling >>> infrastructure, specially in the kernel datapath: samples are handled by >>> ovs-vswitchd, forcing the need for an upcall (userspace action). The >>> fact that samples share the same netlink sockets and handler thread cpu >>> time with actual packet misse, can easily cause packet drops making this >>> solution unfit for use in highly loaded production systems. >>> >>> Users are left with little option than guessing what sampling rate will >>> be OK for their traffic pattern and dealing with the lost accuracy. >>> >>> ** Feature Description ** >>> In order to solve this situation and enable this feature to be safely >>> turned on in production, this RFC uses the psample support proposed in >>> [1] to send NXAST_SAMPLE samples to psample adding the observability >>> domain and point information as metadata. >>> >>> ~~ API ~~ >>> The API is simply a new field called "psample_group" in the >>> Flow_Sample_Collector_Set (FSCS) table. Configuring this value is >>> orthogonal to also associating the FSCS entry to an entry in the IPFIX >>> table. >>> >>> Apart from that configuration, the controller needs to add NXAST_SAMPLE >>> actions that refer the entry's id. >>> >>> ~~ HW Offload ~~ >>> psample is already supported by tc using the act_sample action. The >>> metadata is taken from the actions' cookie. >>> This series also adds support for offloading the odp sample action, >>> only when it includes psample information but not nested actions. >>> >>> A bit of special care has to be taken in the tc layer to not store the >>> ufid in the sample's cookie as it's used to carry action-specific data. >>> >>> ~~ Datapath support ~~ >>> This new behavior of the datapth sample action is only supported on the >>> kernel datapath. A more detailed analysis is needed (and planned) to >>> find a way to also optimize the userspace datapath. However, although >>> potentially inefficient, there is not that much risk of dropping packets >>> with the existing IPFIX infrastructure. >>> >>> ~~ Testing ~~ >>> The series includes an utility program called "ovs-psample" that listens >>> to packets multicasted by the psample module and prints them (also >>> printing the obs_domain and obs_point ids). In addition the kernel >>> series includes a tracepoint for easy testing and troubleshooting. >>> >>> [1] https://patchwork.kernel.org/project/netdevbpf/list/?series=847473 >>> >>> >> >> Hi all, >> >> I had an offline meeting with Eelco, Ilya and Aaron about this topic and >> wanted >> to put out what was discussed and hopefully get more feedback. >> >> In general, 3 options were considered: >> >> Option A: Reusing the sample action >> =================================== >> This is essentially what this proposal (and it's kernel counterpart) consists >> in. The datapath action would look like this: >> >> sample(probability=50%, actions(...), group=10, cookie=0x123) >> >> Pros >> ~~~~ >> - In userspace, it fits nicely with the existing per-flow sampling >> infrastructure as this RFC exemplifies. >> >> - The probability is present in the context of sending the packet to psample >> so >> it's easily carried to psample's rate, making the consumer aware of the >> accuracy >> of the sample. >> >> - Relatively easy to implement in tc as a act_sample exists with similar >> semantics. >> >> Cons >> ~~~~ >> - It breaks the original design of the "sample" action. The "sample" action >> means: The packet is sampled (a probability is calculated) and, if the >> result is >> positive, a set of nested actions is executed. This follows the >> "building-blocks" approach of datapath action. This option breaks this >> assumption by adding more semantics and behavior to the "sample" action. >> >> - If we want to add "trunc" to this sampling, the result would probably work >> but >> is not very nice because we need it outside of the sample() action, e.g: >> trunc(100),sample(probability=50%, actions(...), group=10, >> cookie=0x123),trunc(0) >> >> - Makes the uAPI a bit clunky by adding more attributes to an existing, >> simple >> action. Also, new attributes and existing nested actions are completely >> orthogonal so code needs to exist in both userspace and kernel to properly >> split >> this behavior. >> >> - When "trunc" is used, psample will report the original skb length, >> regardless >> of whether the "trunc" is associated to the sample action or not. Not sure >> this >> is a huge problem nor if it's easy or worth doing it better. >> >> Option B: Creating a new action: emit_action >> ============================================ >> Credits to Ilya. Creating a new action to implement this would fit into the >> rest >> of the actions yielding a flow such: >> >> sample(probability=50%, >> actions(trunc(100),emit_sample(group=10,cookie=0x123))) >> >> Pros >> ~~~~ >> - It better aligns with datapath action design by reusing existing building >> blocks. >> >> - Code additions are probably cleaner and easier to review as it's adding a >> new >> action instead of changing the behavior of an existing one. >> >> - Fit with existing per-flow IPFIX sampling is equally good since a combined >> IPFIX/psample action would just add both "userspace" and "emit_sample" >> actions >> will be nested inside "sample". >> >> - "trunc" can now be inside the "sample" nested actions. It should not modify >> the packet so, in theory, this should not cause a packet clone. >> >> Cons >> ~~~~ >> - The probability is now longer available directly from the context of >> "emit_sample". We would need to carry the probability as metadata (in private >> skb section) from the outer "sample" to the inner "emit_sample". This >> mechanism >> is not using explicit actions so it should be very clearly documented. >> >> - offloading to tc might be a bit more complicated as we need the probability >> for act_sample. We would need to introspect the "sample" actions to see if it >> _only_ contains "emit_sample" and only then replace it with act_sample. > > This is not a real con, because TC doesn't have anything like a generic > sample() action with an action list. And offload of sample() action is > not supported today. We will always have to introspect the content of > the sample() action's action list, because we'll be able to offload only > a very narrow subset of these inner actions. Basically, emit_sample() > will be an only offloadable option, since sFlow-via-psample approach > doesn't seem to be possible. > >> >> >> Option C: Using a special vport >> =============================== >> Credits to Aaron: Adding a special vport type that sends packets to psample. >> The >> action would look like this: >> sample(probability=50%, actions(trunc(100),output(pX)) >> >> Pros >> ~~~~ >> - It also aligns well with the datapath building blocks. It is very clear >> semantically: "send it to observation" >> >> - From ovs-vswitchd pov, it resembles per-flow mirroring which is well >> integrated. > > Note: there is no such thing as per-flow mirroring today. But this would > be an implementation of it. > >> >> - Statistics are generated in the kernel as other vports. >> >> Cons >> ~~~~ >> - The probability is not present in the context of output, nor the output has >> this concept currently. We would need to carry it over in the skb private >> area, >> same as for Option B. >> However, documenting this out-of-bands carry over of the sampling probability >> and its interaction with other actions can be cleaner if this mechanism only >> affects a new action than an existing one (specially if it depends on the >> underlying vport type). >> >> - The group and cookie is not on the output action either so we need new >> actions, lets call it SET_OBSERVABILITY_METADATA and a new non-masked key >> field >> that keeps that metadata around for the action to use it. >> >> - Currently, "trunc" is implemented (i.e: packet actually gets trimmed) >> before >> the per-vport-type code is reached, causing the packet to be cloned inside >> the >> "sample" action. The plan for direct psample execution was to add an >> optimization that would memory copies if there were are no listeners in the >> multicast group. This nulls-out this potential optimization. >> >> - Although not super-expensive (if skb is not modified), it _always_ >> requires an >> skb_clone (even without "trunc"). >> >> - It requires a very big amount of work in ovs-vswitchd: >> - We'd need to decide who creates the port, if it's the user via Openflow >> and >> it's exposed as a port in a bridge or if it's a hidden vport created by the >> dpif >> layer. >> - Controls need to be established to limit OFP actions to send traffic to >> this port or it receiving traffic. >> - DPDK datapath would probably require a new netdev_class as well. >> >> - tc offload is more complicated. If we want to use act_sample (I cannot >> think >> of a way that doesn't involve act_sample), we'd need to track "sample" >> actions >> and "set_observability_metadata) to be able to build an act_sample for this. >> >> Ilya, Aaron, Eelco, did I miss something? >> >> Thoughts? Preferences? > > I think, option C is a little too heavy for the end functionality for the > user. > And I'm not sure if we'll be able to reuse this infrastructure for anything > else. > > I'm not a fan of the design breaking part of the A and duplication of > attributes. > > So, I'd prefer the B here, but I'm a little biased. +1 on the above. I don’t think A is in line with the idea behind the sample() action. For C, it’s a lot of infrastructure to maintain for a single feature. And the action to just store some metadata for a later action to be used also sounds odd. So my vote is also for B. Cheers, Eelco _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev