On Wed, May 22, 2024 at 02:55:35PM GMT, Ilya Maximets wrote:
> Adding the list back to CC.
>

Sorry, I hit the wrong button!

> On 5/22/24 13:28, amore...@redhat.com wrote:
> > On Wed, May 22, 2024 at 12:11:37AM GMT, Ilya Maximets wrote:
> >> On 5/16/24 19:03, Adrian Moreno wrote:
> >>>
> >>>
> >>> On 4/24/24 9:53 PM, Adrian Moreno wrote:
> >>>> This is the userspace counterpart of the work being done in the kernel
> >>>> [1]. Sending it as RFC to get some early feedback on the overall
> >>>> solution.
> >>>>
> >>>> ** Problem description **
> >>>> Currently, OVS supports several observability features, such as
> >>>> per-bridge IPFIX, per-flow IPFIX and sFlow. However, given the
> >>>> complexity of OVN-generated pipelines, a single sample is typically not
> >>>> enough to troubleshoot what is OVN/OVS is doing with the packet. We need
> >>>> highler level metadata alongside the packet sample.
> >>>>
> >>>> This can be achieved by the means of per-flow IPFIX sampling and
> >>>> NXAST_SAMPLE action, through which OVN can add two arbitrary 32-bit
> >>>> values (observation_domain_id and observation_point_id) that are sent
> >>>> alongside the packet information in the IPFIX frames.
> >>>>
> >>>> However, there is a fundamental limitation of the existing sampling
> >>>> infrastructure, specially in the kernel datapath: samples are handled by
> >>>> ovs-vswitchd, forcing the need for an upcall (userspace action). The
> >>>> fact that samples share the same netlink sockets and handler thread cpu
> >>>> time with actual packet misse, can easily cause packet drops making this
> >>>> solution unfit for use in highly loaded production systems.
> >>>>
> >>>> Users are left with little option than guessing what sampling rate will
> >>>> be OK for their traffic pattern and dealing with the lost accuracy.
> >>>>
> >>>> ** Feature Description **
> >>>> In order to solve this situation and enable this feature to be safely
> >>>> turned on in production, this RFC uses the psample support proposed in
> >>>> [1] to send NXAST_SAMPLE samples to psample adding the observability
> >>>> domain and point information as metadata.
> >>>>
> >>>> ~~ API ~~
> >>>> The API is simply a new field called "psample_group" in the
> >>>> Flow_Sample_Collector_Set (FSCS) table. Configuring this value is
> >>>> orthogonal to also associating the FSCS entry to an entry in the IPFIX
> >>>> table.
> >>>>
> >>>> Apart from that configuration, the controller needs to add NXAST_SAMPLE
> >>>> actions that refer the entry's id.
> >>>>
> >>>> ~~ HW Offload ~~
> >>>> psample is already supported by tc using the act_sample action. The
> >>>> metadata is taken from the actions' cookie.
> >>>> This series also adds support for offloading the odp sample action,
> >>>> only when it includes psample information but not nested actions.
> >>>>
> >>>> A bit of special care has to be taken in the tc layer to not store the
> >>>> ufid in the sample's cookie as it's used to carry action-specific data.
> >>>>
> >>>> ~~ Datapath support ~~
> >>>> This new behavior of the datapth sample action is only supported on the
> >>>> kernel datapath. A more detailed analysis is needed (and planned) to
> >>>> find a way to also optimize the userspace datapath. However, although
> >>>> potentially inefficient, there is not that much risk of dropping packets
> >>>> with the existing IPFIX infrastructure.
> >>>>
> >>>> ~~ Testing ~~
> >>>> The series includes an utility program called "ovs-psample" that listens
> >>>> to packets multicasted by the psample module and prints them (also
> >>>> printing the obs_domain and obs_point ids). In addition the kernel
> >>>> series includes a tracepoint for easy testing and troubleshooting.
> >>>>
> >>>> [1] https://patchwork.kernel.org/project/netdevbpf/list/?series=847473
> >>>>
> >>>>
> >>>
> >>> Hi all,
> >>>
> >>> I had an offline meeting with Eelco, Ilya and Aaron about this topic and 
> >>> wanted
> >>> to put out what was discussed and hopefully get more feedback.
> >>>
> >>> In general, 3 options were considered:
> >>>
> >>> Option A: Reusing the sample action
> >>> ===================================
> >>> This is essentially what this proposal (and it's kernel counterpart) 
> >>> consists
> >>> in. The datapath action would look like this:
> >>>
> >>>   sample(probability=50%, actions(...), group=10, cookie=0x123)
> >>>
> >>> Pros
> >>> ~~~~
> >>> - In userspace, it fits nicely with the existing per-flow sampling
> >>> infrastructure as this RFC exemplifies.
> >>>
> >>> - The probability is present in the context of sending the packet to 
> >>> psample so
> >>> it's easily carried to psample's rate, making the consumer aware of the 
> >>> accuracy
> >>> of the sample.
> >>>
> >>> - Relatively easy to implement in tc as a act_sample exists with similar 
> >>> semantics.
> >>>
> >>> Cons
> >>> ~~~~
> >>> - It breaks the original design of the "sample" action. The "sample" 
> >>> action
> >>> means: The packet is sampled (a probability is calculated) and, if the 
> >>> result is
> >>> positive, a set of nested actions is executed. This follows the
> >>> "building-blocks" approach of datapath action. This option breaks this
> >>> assumption by adding more semantics and behavior to the "sample" action.
> >>>
> >>> - If we want to add "trunc" to this sampling, the result would probably 
> >>> work but
> >>> is not very nice because we need it outside of the sample() action, e.g:
> >>>     trunc(100),sample(probability=50%, actions(...), group=10,
> >>> cookie=0x123),trunc(0)
> >>>
> >>> - Makes the uAPI a bit clunky by adding more attributes to an existing, 
> >>> simple
> >>> action. Also, new attributes and existing nested actions are completely
> >>> orthogonal so code needs to exist in both userspace and kernel to 
> >>> properly split
> >>> this behavior.
> >>>
> >>> - When "trunc" is used, psample will report the original skb length, 
> >>> regardless
> >>> of whether the "trunc" is associated to the sample action or not. Not 
> >>> sure this
> >>> is a huge problem nor if it's easy or worth doing it better.
> >>>
> >>> Option B: Creating a new action: emit_action
> >>> ============================================
> >>> Credits to Ilya. Creating a new action to implement this would fit into 
> >>> the rest
> >>> of the actions yielding a flow such:
> >>>
> >>>   sample(probability=50%, 
> >>> actions(trunc(100),emit_sample(group=10,cookie=0x123)))
> >>>
> >>> Pros
> >>> ~~~~
> >>> - It better aligns with datapath action design by reusing existing 
> >>> building blocks.
> >>>
> >>> - Code additions are probably cleaner and easier to review as it's adding 
> >>> a new
> >>> action instead of changing the behavior of an existing one.
> >>>
> >>> - Fit with existing per-flow IPFIX sampling is equally good since a 
> >>> combined
> >>> IPFIX/psample action would just add both "userspace" and "emit_sample" 
> >>> actions
> >>> will be nested inside "sample".
> >>>
> >>> - "trunc" can now be inside the "sample" nested actions. It should not 
> >>> modify
> >>> the packet so, in theory, this should not cause a packet clone.
> >>>
> >>> Cons
> >>> ~~~~
> >>> - The probability is now longer available directly from the context of
> >>> "emit_sample". We would need to carry the probability as metadata (in 
> >>> private
> >>> skb section) from the outer "sample" to the inner "emit_sample". This 
> >>> mechanism
> >>> is not using explicit actions so it should be very clearly documented.
> >>>
> >>> - offloading to tc might be a bit more complicated as we need the 
> >>> probability
> >>> for act_sample. We would need to introspect the "sample" actions to see 
> >>> if it
> >>> _only_ contains "emit_sample" and only then replace it with act_sample.
> >>
> >> This is not a real con, because TC doesn't have anything like a generic
> >> sample() action with an action list.  And offload of sample() action is
> >> not supported today.  We will always have to introspect the content of
> >> the sample() action's action list, because we'll be able to offload only
> >> a very narrow subset of these inner actions.  Basically, emit_sample()
> >> will be an only offloadable option, since sFlow-via-psample approach
> >> doesn't seem to be possible.
> >>
> >
> > Good point.
> >
> >>>
> >>>
> >>> Option C: Using a special vport
> >>> ===============================
> >>> Credits to Aaron: Adding a special vport type that sends packets to 
> >>> psample. The
> >>> action would look like this:
> >>>    sample(probability=50%, actions(trunc(100),output(pX))
> >>>
> >>> Pros
> >>> ~~~~
> >>> - It also aligns well with the datapath building blocks. It is very clear
> >>> semantically: "send it to observation"
> >>>
> >>> - From ovs-vswitchd pov, it resembles per-flow mirroring which is well 
> >>> integrated.
> >>
> >> Note: there is no such thing as per-flow mirroring today.  But this would
> >> be an implementation of it.
> >>
> >
> > That is what I was trying to express, thanks.
> >
> >>>
> >>> - Statistics are generated in the kernel as other vports.
> >>>
> >>> Cons
> >>> ~~~~
> >>> - The probability is not present in the context of output, nor the output 
> >>> has
> >>> this concept currently. We would need to carry it over in the skb private 
> >>> area,
> >>> same as for Option B.
> >>> However, documenting this out-of-bands carry over of the sampling 
> >>> probability
> >>> and its interaction with other actions can be cleaner if this mechanism 
> >>> only
> >>> affects a new action than an existing one (specially if it depends on the
> >>> underlying vport type).
> >>>
> >>> - The group and cookie is not on the output action either so we need new
> >>> actions, lets call it SET_OBSERVABILITY_METADATA and a new non-masked key 
> >>> field
> >>> that keeps that metadata around for the action to use it.
> >>>
> >>> - Currently, "trunc" is implemented (i.e: packet actually gets trimmed) 
> >>> before
> >>> the per-vport-type code is reached, causing the packet to be cloned 
> >>> inside the
> >>> "sample" action. The plan for direct psample execution was to add an
> >>> optimization that would memory copies if there were are no listeners in 
> >>> the
> >>> multicast group. This nulls-out this potential optimization.
> >>>
> >>> - Although not super-expensive (if skb is not modified), it _always_ 
> >>> requires an
> >>> skb_clone (even without "trunc").
> >>>
> >>> - It requires a very big amount of work in ovs-vswitchd:
> >>>    - We'd need to decide who creates the port, if it's the user via 
> >>> Openflow and
> >>> it's exposed as a port in a bridge or if it's a hidden vport created by 
> >>> the dpif
> >>> layer.
> >>>    - Controls need to be established to limit OFP actions to send traffic 
> >>> to
> >>> this port or it receiving traffic.
> >>>    - DPDK datapath would probably require a new netdev_class as well.
> >>>
> >>> - tc offload is more complicated. If we want to use act_sample (I cannot 
> >>> think
> >>> of a way that doesn't involve act_sample), we'd need to track "sample" 
> >>> actions
> >>> and "set_observability_metadata) to be able to build an act_sample for 
> >>> this.
> >>>
> >>> Ilya, Aaron, Eelco, did I miss something?
> >>>
> >>> Thoughts? Preferences?
> >>
> >> I think, option C is a little too heavy for the end functionality for the 
> >> user.
> >> And I'm not sure if we'll be able to reuse this infrastructure for 
> >> anything else.
> >>
> >> I'm not a fan of the design breaking part of the A and duplication of 
> >> attributes.
> >>
> >> So, I'd prefer the B here, but I'm a little biased.
> >>
> >
> > Thanks for your feedback Ilya.
> >
> > In my opinion, option C yields a very elegant datapath flow. However,
> > reusing the action does not give us a lot of the typical benefits of
> > reusing (i.e: less code). Instead we'll end up having lots of checks to
> > see if the output corresponds to this special vport. E.g: cannot have
> > OVS_VPORT_ATTR_UPCALL_PID or OVS_VPORT_ATTR_UPCALL_STATS, statistics
> > must be handled differently, cannot receive traffic, does not
> > correspond to any OFP port, this action does _not_ modify the packet and
> > should not trigger a clone in sample(), etc.
> >
> > So I also lean towards option B but I guess I'm also biased since I
> > already started working on option A which is more similar to B than to
> > C.
> >
> > Thanks.
> > --
> > Adrián
> >
>

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to