On 22 May 2024, at 0:11, Ilya Maximets wrote:

> On 5/16/24 19:03, Adrian Moreno wrote:
>>
>>
>> On 4/24/24 9:53 PM, Adrian Moreno wrote:
>>> This is the userspace counterpart of the work being done in the kernel
>>> [1]. Sending it as RFC to get some early feedback on the overall
>>> solution.
>>>
>>> ** Problem description **
>>> Currently, OVS supports several observability features, such as
>>> per-bridge IPFIX, per-flow IPFIX and sFlow. However, given the
>>> complexity of OVN-generated pipelines, a single sample is typically not
>>> enough to troubleshoot what is OVN/OVS is doing with the packet. We need
>>> highler level metadata alongside the packet sample.
>>>
>>> This can be achieved by the means of per-flow IPFIX sampling and
>>> NXAST_SAMPLE action, through which OVN can add two arbitrary 32-bit
>>> values (observation_domain_id and observation_point_id) that are sent
>>> alongside the packet information in the IPFIX frames.
>>>
>>> However, there is a fundamental limitation of the existing sampling
>>> infrastructure, specially in the kernel datapath: samples are handled by
>>> ovs-vswitchd, forcing the need for an upcall (userspace action). The
>>> fact that samples share the same netlink sockets and handler thread cpu
>>> time with actual packet misse, can easily cause packet drops making this
>>> solution unfit for use in highly loaded production systems.
>>>
>>> Users are left with little option than guessing what sampling rate will
>>> be OK for their traffic pattern and dealing with the lost accuracy.
>>>
>>> ** Feature Description **
>>> In order to solve this situation and enable this feature to be safely
>>> turned on in production, this RFC uses the psample support proposed in
>>> [1] to send NXAST_SAMPLE samples to psample adding the observability
>>> domain and point information as metadata.
>>>
>>> ~~ API ~~
>>> The API is simply a new field called "psample_group" in the
>>> Flow_Sample_Collector_Set (FSCS) table. Configuring this value is
>>> orthogonal to also associating the FSCS entry to an entry in the IPFIX
>>> table.
>>>
>>> Apart from that configuration, the controller needs to add NXAST_SAMPLE
>>> actions that refer the entry's id.
>>>
>>> ~~ HW Offload ~~
>>> psample is already supported by tc using the act_sample action. The
>>> metadata is taken from the actions' cookie.
>>> This series also adds support for offloading the odp sample action,
>>> only when it includes psample information but not nested actions.
>>>
>>> A bit of special care has to be taken in the tc layer to not store the
>>> ufid in the sample's cookie as it's used to carry action-specific data.
>>>
>>> ~~ Datapath support ~~
>>> This new behavior of the datapth sample action is only supported on the
>>> kernel datapath. A more detailed analysis is needed (and planned) to
>>> find a way to also optimize the userspace datapath. However, although
>>> potentially inefficient, there is not that much risk of dropping packets
>>> with the existing IPFIX infrastructure.
>>>
>>> ~~ Testing ~~
>>> The series includes an utility program called "ovs-psample" that listens
>>> to packets multicasted by the psample module and prints them (also
>>> printing the obs_domain and obs_point ids). In addition the kernel
>>> series includes a tracepoint for easy testing and troubleshooting.
>>>
>>> [1] https://patchwork.kernel.org/project/netdevbpf/list/?series=847473
>>>
>>>
>>
>> Hi all,
>>
>> I had an offline meeting with Eelco, Ilya and Aaron about this topic and 
>> wanted
>> to put out what was discussed and hopefully get more feedback.
>>
>> In general, 3 options were considered:
>>
>> Option A: Reusing the sample action
>> ===================================
>> This is essentially what this proposal (and it's kernel counterpart) consists
>> in. The datapath action would look like this:
>>
>>   sample(probability=50%, actions(...), group=10, cookie=0x123)
>>
>> Pros
>> ~~~~
>> - In userspace, it fits nicely with the existing per-flow sampling
>> infrastructure as this RFC exemplifies.
>>
>> - The probability is present in the context of sending the packet to psample 
>> so
>> it's easily carried to psample's rate, making the consumer aware of the 
>> accuracy
>> of the sample.
>>
>> - Relatively easy to implement in tc as a act_sample exists with similar 
>> semantics.
>>
>> Cons
>> ~~~~
>> - It breaks the original design of the "sample" action. The "sample" action
>> means: The packet is sampled (a probability is calculated) and, if the 
>> result is
>> positive, a set of nested actions is executed. This follows the
>> "building-blocks" approach of datapath action. This option breaks this
>> assumption by adding more semantics and behavior to the "sample" action.
>>
>> - If we want to add "trunc" to this sampling, the result would probably work 
>> but
>> is not very nice because we need it outside of the sample() action, e.g:
>>     trunc(100),sample(probability=50%, actions(...), group=10,
>> cookie=0x123),trunc(0)
>>
>> - Makes the uAPI a bit clunky by adding more attributes to an existing, 
>> simple
>> action. Also, new attributes and existing nested actions are completely
>> orthogonal so code needs to exist in both userspace and kernel to properly 
>> split
>> this behavior.
>>
>> - When "trunc" is used, psample will report the original skb length, 
>> regardless
>> of whether the "trunc" is associated to the sample action or not. Not sure 
>> this
>> is a huge problem nor if it's easy or worth doing it better.
>>
>> Option B: Creating a new action: emit_action
>> ============================================
>> Credits to Ilya. Creating a new action to implement this would fit into the 
>> rest
>> of the actions yielding a flow such:
>>
>>   sample(probability=50%, 
>> actions(trunc(100),emit_sample(group=10,cookie=0x123)))
>>
>> Pros
>> ~~~~
>> - It better aligns with datapath action design by reusing existing building 
>> blocks.
>>
>> - Code additions are probably cleaner and easier to review as it's adding a 
>> new
>> action instead of changing the behavior of an existing one.
>>
>> - Fit with existing per-flow IPFIX sampling is equally good since a combined
>> IPFIX/psample action would just add both "userspace" and "emit_sample" 
>> actions
>> will be nested inside "sample".
>>
>> - "trunc" can now be inside the "sample" nested actions. It should not modify
>> the packet so, in theory, this should not cause a packet clone.
>>
>> Cons
>> ~~~~
>> - The probability is now longer available directly from the context of
>> "emit_sample". We would need to carry the probability as metadata (in private
>> skb section) from the outer "sample" to the inner "emit_sample". This 
>> mechanism
>> is not using explicit actions so it should be very clearly documented.
>>
>> - offloading to tc might be a bit more complicated as we need the probability
>> for act_sample. We would need to introspect the "sample" actions to see if it
>> _only_ contains "emit_sample" and only then replace it with act_sample.
>
> This is not a real con, because TC doesn't have anything like a generic
> sample() action with an action list.  And offload of sample() action is
> not supported today.  We will always have to introspect the content of
> the sample() action's action list, because we'll be able to offload only
> a very narrow subset of these inner actions.  Basically, emit_sample()
> will be an only offloadable option, since sFlow-via-psample approach
> doesn't seem to be possible.
>
>>
>>
>> Option C: Using a special vport
>> ===============================
>> Credits to Aaron: Adding a special vport type that sends packets to psample. 
>> The
>> action would look like this:
>>    sample(probability=50%, actions(trunc(100),output(pX))
>>
>> Pros
>> ~~~~
>> - It also aligns well with the datapath building blocks. It is very clear
>> semantically: "send it to observation"
>>
>> - From ovs-vswitchd pov, it resembles per-flow mirroring which is well 
>> integrated.
>
> Note: there is no such thing as per-flow mirroring today.  But this would
> be an implementation of it.
>
>>
>> - Statistics are generated in the kernel as other vports.
>>
>> Cons
>> ~~~~
>> - The probability is not present in the context of output, nor the output has
>> this concept currently. We would need to carry it over in the skb private 
>> area,
>> same as for Option B.
>> However, documenting this out-of-bands carry over of the sampling probability
>> and its interaction with other actions can be cleaner if this mechanism only
>> affects a new action than an existing one (specially if it depends on the
>> underlying vport type).
>>
>> - The group and cookie is not on the output action either so we need new
>> actions, lets call it SET_OBSERVABILITY_METADATA and a new non-masked key 
>> field
>> that keeps that metadata around for the action to use it.
>>
>> - Currently, "trunc" is implemented (i.e: packet actually gets trimmed) 
>> before
>> the per-vport-type code is reached, causing the packet to be cloned inside 
>> the
>> "sample" action. The plan for direct psample execution was to add an
>> optimization that would memory copies if there were are no listeners in the
>> multicast group. This nulls-out this potential optimization.
>>
>> - Although not super-expensive (if skb is not modified), it _always_ 
>> requires an
>> skb_clone (even without "trunc").
>>
>> - It requires a very big amount of work in ovs-vswitchd:
>>    - We'd need to decide who creates the port, if it's the user via Openflow 
>> and
>> it's exposed as a port in a bridge or if it's a hidden vport created by the 
>> dpif
>> layer.
>>    - Controls need to be established to limit OFP actions to send traffic to
>> this port or it receiving traffic.
>>    - DPDK datapath would probably require a new netdev_class as well.
>>
>> - tc offload is more complicated. If we want to use act_sample (I cannot 
>> think
>> of a way that doesn't involve act_sample), we'd need to track "sample" 
>> actions
>> and "set_observability_metadata) to be able to build an act_sample for this.
>>
>> Ilya, Aaron, Eelco, did I miss something?
>>
>> Thoughts? Preferences?
>
> I think, option C is a little too heavy for the end functionality for the 
> user.
> And I'm not sure if we'll be able to reuse this infrastructure for anything 
> else.
>
> I'm not a fan of the design breaking part of the A and duplication of 
> attributes.
>
> So, I'd prefer the B here, but I'm a little biased.

+1 on the above.

I don’t think A is in line with the idea behind the sample() action.

For C, it’s a lot of infrastructure to maintain for a single feature. And the 
action to just store some metadata for a later action to be used also sounds 
odd.

So my vote is also for B.

Cheers,

Eelco

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to