On Wed, May 22, 2024 at 02:55:35PM GMT, Ilya Maximets wrote: > Adding the list back to CC. >
Sorry, I hit the wrong button! > On 5/22/24 13:28, amore...@redhat.com wrote: > > On Wed, May 22, 2024 at 12:11:37AM GMT, Ilya Maximets wrote: > >> On 5/16/24 19:03, Adrian Moreno wrote: > >>> > >>> > >>> On 4/24/24 9:53 PM, Adrian Moreno wrote: > >>>> This is the userspace counterpart of the work being done in the kernel > >>>> [1]. Sending it as RFC to get some early feedback on the overall > >>>> solution. > >>>> > >>>> ** Problem description ** > >>>> Currently, OVS supports several observability features, such as > >>>> per-bridge IPFIX, per-flow IPFIX and sFlow. However, given the > >>>> complexity of OVN-generated pipelines, a single sample is typically not > >>>> enough to troubleshoot what is OVN/OVS is doing with the packet. We need > >>>> highler level metadata alongside the packet sample. > >>>> > >>>> This can be achieved by the means of per-flow IPFIX sampling and > >>>> NXAST_SAMPLE action, through which OVN can add two arbitrary 32-bit > >>>> values (observation_domain_id and observation_point_id) that are sent > >>>> alongside the packet information in the IPFIX frames. > >>>> > >>>> However, there is a fundamental limitation of the existing sampling > >>>> infrastructure, specially in the kernel datapath: samples are handled by > >>>> ovs-vswitchd, forcing the need for an upcall (userspace action). The > >>>> fact that samples share the same netlink sockets and handler thread cpu > >>>> time with actual packet misse, can easily cause packet drops making this > >>>> solution unfit for use in highly loaded production systems. > >>>> > >>>> Users are left with little option than guessing what sampling rate will > >>>> be OK for their traffic pattern and dealing with the lost accuracy. > >>>> > >>>> ** Feature Description ** > >>>> In order to solve this situation and enable this feature to be safely > >>>> turned on in production, this RFC uses the psample support proposed in > >>>> [1] to send NXAST_SAMPLE samples to psample adding the observability > >>>> domain and point information as metadata. > >>>> > >>>> ~~ API ~~ > >>>> The API is simply a new field called "psample_group" in the > >>>> Flow_Sample_Collector_Set (FSCS) table. Configuring this value is > >>>> orthogonal to also associating the FSCS entry to an entry in the IPFIX > >>>> table. > >>>> > >>>> Apart from that configuration, the controller needs to add NXAST_SAMPLE > >>>> actions that refer the entry's id. > >>>> > >>>> ~~ HW Offload ~~ > >>>> psample is already supported by tc using the act_sample action. The > >>>> metadata is taken from the actions' cookie. > >>>> This series also adds support for offloading the odp sample action, > >>>> only when it includes psample information but not nested actions. > >>>> > >>>> A bit of special care has to be taken in the tc layer to not store the > >>>> ufid in the sample's cookie as it's used to carry action-specific data. > >>>> > >>>> ~~ Datapath support ~~ > >>>> This new behavior of the datapth sample action is only supported on the > >>>> kernel datapath. A more detailed analysis is needed (and planned) to > >>>> find a way to also optimize the userspace datapath. However, although > >>>> potentially inefficient, there is not that much risk of dropping packets > >>>> with the existing IPFIX infrastructure. > >>>> > >>>> ~~ Testing ~~ > >>>> The series includes an utility program called "ovs-psample" that listens > >>>> to packets multicasted by the psample module and prints them (also > >>>> printing the obs_domain and obs_point ids). In addition the kernel > >>>> series includes a tracepoint for easy testing and troubleshooting. > >>>> > >>>> [1] https://patchwork.kernel.org/project/netdevbpf/list/?series=847473 > >>>> > >>>> > >>> > >>> Hi all, > >>> > >>> I had an offline meeting with Eelco, Ilya and Aaron about this topic and > >>> wanted > >>> to put out what was discussed and hopefully get more feedback. > >>> > >>> In general, 3 options were considered: > >>> > >>> Option A: Reusing the sample action > >>> =================================== > >>> This is essentially what this proposal (and it's kernel counterpart) > >>> consists > >>> in. The datapath action would look like this: > >>> > >>> sample(probability=50%, actions(...), group=10, cookie=0x123) > >>> > >>> Pros > >>> ~~~~ > >>> - In userspace, it fits nicely with the existing per-flow sampling > >>> infrastructure as this RFC exemplifies. > >>> > >>> - The probability is present in the context of sending the packet to > >>> psample so > >>> it's easily carried to psample's rate, making the consumer aware of the > >>> accuracy > >>> of the sample. > >>> > >>> - Relatively easy to implement in tc as a act_sample exists with similar > >>> semantics. > >>> > >>> Cons > >>> ~~~~ > >>> - It breaks the original design of the "sample" action. The "sample" > >>> action > >>> means: The packet is sampled (a probability is calculated) and, if the > >>> result is > >>> positive, a set of nested actions is executed. This follows the > >>> "building-blocks" approach of datapath action. This option breaks this > >>> assumption by adding more semantics and behavior to the "sample" action. > >>> > >>> - If we want to add "trunc" to this sampling, the result would probably > >>> work but > >>> is not very nice because we need it outside of the sample() action, e.g: > >>> trunc(100),sample(probability=50%, actions(...), group=10, > >>> cookie=0x123),trunc(0) > >>> > >>> - Makes the uAPI a bit clunky by adding more attributes to an existing, > >>> simple > >>> action. Also, new attributes and existing nested actions are completely > >>> orthogonal so code needs to exist in both userspace and kernel to > >>> properly split > >>> this behavior. > >>> > >>> - When "trunc" is used, psample will report the original skb length, > >>> regardless > >>> of whether the "trunc" is associated to the sample action or not. Not > >>> sure this > >>> is a huge problem nor if it's easy or worth doing it better. > >>> > >>> Option B: Creating a new action: emit_action > >>> ============================================ > >>> Credits to Ilya. Creating a new action to implement this would fit into > >>> the rest > >>> of the actions yielding a flow such: > >>> > >>> sample(probability=50%, > >>> actions(trunc(100),emit_sample(group=10,cookie=0x123))) > >>> > >>> Pros > >>> ~~~~ > >>> - It better aligns with datapath action design by reusing existing > >>> building blocks. > >>> > >>> - Code additions are probably cleaner and easier to review as it's adding > >>> a new > >>> action instead of changing the behavior of an existing one. > >>> > >>> - Fit with existing per-flow IPFIX sampling is equally good since a > >>> combined > >>> IPFIX/psample action would just add both "userspace" and "emit_sample" > >>> actions > >>> will be nested inside "sample". > >>> > >>> - "trunc" can now be inside the "sample" nested actions. It should not > >>> modify > >>> the packet so, in theory, this should not cause a packet clone. > >>> > >>> Cons > >>> ~~~~ > >>> - The probability is now longer available directly from the context of > >>> "emit_sample". We would need to carry the probability as metadata (in > >>> private > >>> skb section) from the outer "sample" to the inner "emit_sample". This > >>> mechanism > >>> is not using explicit actions so it should be very clearly documented. > >>> > >>> - offloading to tc might be a bit more complicated as we need the > >>> probability > >>> for act_sample. We would need to introspect the "sample" actions to see > >>> if it > >>> _only_ contains "emit_sample" and only then replace it with act_sample. > >> > >> This is not a real con, because TC doesn't have anything like a generic > >> sample() action with an action list. And offload of sample() action is > >> not supported today. We will always have to introspect the content of > >> the sample() action's action list, because we'll be able to offload only > >> a very narrow subset of these inner actions. Basically, emit_sample() > >> will be an only offloadable option, since sFlow-via-psample approach > >> doesn't seem to be possible. > >> > > > > Good point. > > > >>> > >>> > >>> Option C: Using a special vport > >>> =============================== > >>> Credits to Aaron: Adding a special vport type that sends packets to > >>> psample. The > >>> action would look like this: > >>> sample(probability=50%, actions(trunc(100),output(pX)) > >>> > >>> Pros > >>> ~~~~ > >>> - It also aligns well with the datapath building blocks. It is very clear > >>> semantically: "send it to observation" > >>> > >>> - From ovs-vswitchd pov, it resembles per-flow mirroring which is well > >>> integrated. > >> > >> Note: there is no such thing as per-flow mirroring today. But this would > >> be an implementation of it. > >> > > > > That is what I was trying to express, thanks. > > > >>> > >>> - Statistics are generated in the kernel as other vports. > >>> > >>> Cons > >>> ~~~~ > >>> - The probability is not present in the context of output, nor the output > >>> has > >>> this concept currently. We would need to carry it over in the skb private > >>> area, > >>> same as for Option B. > >>> However, documenting this out-of-bands carry over of the sampling > >>> probability > >>> and its interaction with other actions can be cleaner if this mechanism > >>> only > >>> affects a new action than an existing one (specially if it depends on the > >>> underlying vport type). > >>> > >>> - The group and cookie is not on the output action either so we need new > >>> actions, lets call it SET_OBSERVABILITY_METADATA and a new non-masked key > >>> field > >>> that keeps that metadata around for the action to use it. > >>> > >>> - Currently, "trunc" is implemented (i.e: packet actually gets trimmed) > >>> before > >>> the per-vport-type code is reached, causing the packet to be cloned > >>> inside the > >>> "sample" action. The plan for direct psample execution was to add an > >>> optimization that would memory copies if there were are no listeners in > >>> the > >>> multicast group. This nulls-out this potential optimization. > >>> > >>> - Although not super-expensive (if skb is not modified), it _always_ > >>> requires an > >>> skb_clone (even without "trunc"). > >>> > >>> - It requires a very big amount of work in ovs-vswitchd: > >>> - We'd need to decide who creates the port, if it's the user via > >>> Openflow and > >>> it's exposed as a port in a bridge or if it's a hidden vport created by > >>> the dpif > >>> layer. > >>> - Controls need to be established to limit OFP actions to send traffic > >>> to > >>> this port or it receiving traffic. > >>> - DPDK datapath would probably require a new netdev_class as well. > >>> > >>> - tc offload is more complicated. If we want to use act_sample (I cannot > >>> think > >>> of a way that doesn't involve act_sample), we'd need to track "sample" > >>> actions > >>> and "set_observability_metadata) to be able to build an act_sample for > >>> this. > >>> > >>> Ilya, Aaron, Eelco, did I miss something? > >>> > >>> Thoughts? Preferences? > >> > >> I think, option C is a little too heavy for the end functionality for the > >> user. > >> And I'm not sure if we'll be able to reuse this infrastructure for > >> anything else. > >> > >> I'm not a fan of the design breaking part of the A and duplication of > >> attributes. > >> > >> So, I'd prefer the B here, but I'm a little biased. > >> > > > > Thanks for your feedback Ilya. > > > > In my opinion, option C yields a very elegant datapath flow. However, > > reusing the action does not give us a lot of the typical benefits of > > reusing (i.e: less code). Instead we'll end up having lots of checks to > > see if the output corresponds to this special vport. E.g: cannot have > > OVS_VPORT_ATTR_UPCALL_PID or OVS_VPORT_ATTR_UPCALL_STATS, statistics > > must be handled differently, cannot receive traffic, does not > > correspond to any OFP port, this action does _not_ modify the packet and > > should not trigger a clone in sample(), etc. > > > > So I also lean towards option B but I guess I'm also biased since I > > already started working on option A which is more similar to B than to > > C. > > > > Thanks. > > -- > > Adrián > > > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev