Re: [ovs-dev] OVS DPDK DMA-Dev library/Design Discussion

Maxime Coquelin Thu, 07 Apr 2022 08:46:55 -0700


On 4/7/22 17:01, Ilya Maximets wrote:

On 4/7/22 16:42, Van Haaren, Harry wrote:

-----Original Message-----
From: Ilya Maximets <i.maxim...@ovn.org>
Sent: Thursday, April 7, 2022 3:40 PM
To: Maxime Coquelin <maxime.coque...@redhat.com>; Van Haaren, Harry
<harry.van.haa...@intel.com>; Morten Brørup <m...@smartsharesystems.com>;
Richardson, Bruce <bruce.richard...@intel.com>
Cc: i.maxim...@ovn.org; Pai G, Sunil <sunil.pa...@intel.com>; Stokes, Ian
<ian.sto...@intel.com>; Hu, Jiayu <jiayu...@intel.com>; Ferriter, Cian
<cian.ferri...@intel.com>; ovs-dev@openvswitch.org; d...@dpdk.org; Mcnamara,
John <john.mcnam...@intel.com>; O'Driscoll, Tim <tim.odrisc...@intel.com>;
Finn, Emma <emma.f...@intel.com>
Subject: Re: OVS DPDK DMA-Dev library/Design Discussion

On 4/7/22 16:25, Maxime Coquelin wrote:

Hi Harry,

On 4/7/22 16:04, Van Haaren, Harry wrote:

Hi OVS & DPDK, Maintainers & Community,

Top posting overview of discussion as replies to thread become slower:
perhaps it is a good time to review and plan for next steps?

  From my perspective, it those most vocal in the thread seem to be in favour

of the clean

rx/tx split ("defer work"), with the tradeoff that the application must be

aware of handling

the async DMA completions. If there are any concerns opposing upstreaming

of this method,

please indicate this promptly, and we can continue technical discussions here

now.


Wasn't there some discussions about handling the Virtio completions with
the DMA engine? With that, we wouldn't need the deferral of work.

+1


Yes there was, the DMA/virtq completions thread here for reference;
https://mail.openvswitch.org/pipermail/ovs-dev/2022-March/392908.html

I do not believe that there is a viable path to actually implementing it, and 
particularly
not in the more complex cases; e.g. virtio with guest-interrupt enabled.

The thread above mentions additional threads and various other options; none of 
which
I believe to be a clean or workable solution. I'd like input from other folks 
more familiar
with the exact implementations of VHost/vrings, as well as those with DMA 
engine expertise.


I tend to trust Maxime as a vhost maintainer in such questions. :)

In my own opinion though, the implementation is possible and concerns doesn't
sound deal-breaking as solutions for them might work well enough.  So I think
the viability should be tested out before solution is disregarded.  Especially
because the decision will form the API of the vhost library.


I agree, we need a PoC adding interrupt support to dmadev API using
eventfd, and adding a thread in Vhost library that polls for DMA
interrupts and calls vhost_vring_call if needed.

With the virtio completions handled by DMA itself, the vhost port
turns almost into a real HW NIC.  With that we will not need any
extra manipulations from the OVS side, i.e. no need to defer any
work while maintaining clear split between rx and tx operations.

I'd vote for that.


Thanks,
Maxime


Thanks for the prompt responses, and lets understand if there is a viable 
workable way
to totally hide DMA-completions from the application.

Regards,  -Harry

In absence of continued technical discussion here, I suggest Sunil and Ian

collaborate on getting

the OVS Defer-work approach, and DPDK VHost Async patchsets available on

GitHub for easier

consumption and future development (as suggested in slides presented on

last call).


Regards, -Harry

No inline-replies below; message just for context.

-----Original Message-----
From: Van Haaren, Harry
Sent: Wednesday, March 30, 2022 10:02 AM
To: Morten Brørup <m...@smartsharesystems.com>; Richardson, Bruce
<bruce.richard...@intel.com>
Cc: Maxime Coquelin <maxime.coque...@redhat.com>; Pai G, Sunil
<sunil.pa...@intel.com>; Stokes, Ian <ian.sto...@intel.com>; Hu, Jiayu
<jiayu...@intel.com>; Ferriter, Cian <cian.ferri...@intel.com>; Ilya

Maximets

<i.maxim...@ovn.org>; ovs-dev@openvswitch.org; d...@dpdk.org;

Mcnamara,

John <john.mcnam...@intel.com>; O'Driscoll, Tim

<tim.odrisc...@intel.com>;

Finn, Emma <emma.f...@intel.com>
Subject: RE: OVS DPDK DMA-Dev library/Design Discussion

-----Original Message-----
From: Morten Brørup <m...@smartsharesystems.com>
Sent: Tuesday, March 29, 2022 8:59 PM
To: Van Haaren, Harry <harry.van.haa...@intel.com>; Richardson, Bruce
<bruce.richard...@intel.com>
Cc: Maxime Coquelin <maxime.coque...@redhat.com>; Pai G, Sunil
<sunil.pa...@intel.com>; Stokes, Ian <ian.sto...@intel.com>; Hu, Jiayu
<jiayu...@intel.com>; Ferriter, Cian <cian.ferri...@intel.com>; Ilya

Maximets

<i.maxim...@ovn.org>; ovs-dev@openvswitch.org; d...@dpdk.org;

Mcnamara,

John

<john.mcnam...@intel.com>; O'Driscoll, Tim <tim.odrisc...@intel.com>;

Finn,

Emma <emma.f...@intel.com>
Subject: RE: OVS DPDK DMA-Dev library/Design Discussion

From: Van Haaren, Harry [mailto:harry.van.haa...@intel.com]
Sent: Tuesday, 29 March 2022 19.46

From: Morten Brørup <m...@smartsharesystems.com>
Sent: Tuesday, March 29, 2022 6:14 PM

From: Bruce Richardson [mailto:bruce.richard...@intel.com]
Sent: Tuesday, 29 March 2022 19.03

On Tue, Mar 29, 2022 at 06:45:19PM +0200, Morten Brørup wrote:

From: Maxime Coquelin [mailto:maxime.coque...@redhat.com]
Sent: Tuesday, 29 March 2022 18.24

Hi Morten,

On 3/29/22 16:44, Morten Brørup wrote:

From: Van Haaren, Harry [mailto:harry.van.haa...@intel.com]
Sent: Tuesday, 29 March 2022 15.02

From: Morten Brørup <m...@smartsharesystems.com>
Sent: Tuesday, March 29, 2022 1:51 PM

Having thought more about it, I think that a completely

different

architectural approach is required:


Many of the DPDK Ethernet PMDs implement a variety of RX

and TX

packet burst functions, each optimized for different CPU vector
instruction sets. The availability of a DMA engine should be

treated

the same way. So I suggest that PMDs copying packet contents,

e.g.

memif, pcap, vmxnet3, should implement DMA optimized RX and TX

packet

burst functions.


Similarly for the DPDK vhost library.

In such an architecture, it would be the application's job

to

allocate DMA channels and assign them to the specific PMDs that

should

use them. But the actual use of the DMA channels would move

down

below

the application and into the DPDK PMDs and libraries.



Med venlig hilsen / Kind regards,
-Morten Brørup


Hi Morten,

That's *exactly* how this architecture is designed &

implemented.

1.    The DMA configuration and initialization is up to the

application

(OVS).

2.    The VHost library is passed the DMA-dev ID, and its

new

async

rx/tx APIs, and uses the DMA device to accelerate the copy.


Looking forward to talking on the call that just started.

Regards, -

Harry


OK, thanks - as I said on the call, I haven't looked at the

patches.


Then, I suppose that the TX completions can be handled in the

TX

function, and the RX completions can be handled in the RX

function,

just like the Ethdev PMDs handle packet descriptors:


TX_Burst(tx_packet_array):
1.    Clean up descriptors processed by the NIC chip. -->

Process

TX

DMA channel completions. (Effectively, the 2nd pipeline stage.)

2.    Pass on the tx_packet_array to the NIC chip

descriptors. --

Pass

on the tx_packet_array to the TX DMA channel. (Effectively, the

1st

pipeline stage.)

The problem is Tx function might not be called again, so

enqueued

packets in 2. may never be completed from a Virtio point of

view.

IOW,

the packets will be copied to the Virtio descriptors buffers,

but

the

descriptors will not be made available to the Virtio driver.


In that case, the application needs to call TX_Burst()

periodically

with an empty array, for completion purposes.


This is what the "defer work" does at the OVS thread-level, but instead
of
"brute-forcing" and *always* making the call, the defer work concept
tracks
*when* there is outstanding work (DMA copies) to be completed
("deferred work")
and calls the generic completion function at that point.

So "defer work" is generic infrastructure at the OVS thread level to
handle
work that needs to be done "later", e.g. DMA completion handling.

Or some sort of TX_Keepalive() function can be added to the DPDK

library, to handle DMA completion. It might even handle multiple

DMA

channels, if convenient - and if possible without locking or other
weird complexity.


That's exactly how it is done, the VHost library has a new API added,
which allows
for handling completions. And in the "Netdev layer" (~OVS ethdev
abstraction)
we add a function to allow the OVS thread to do those completions in a
new
Netdev-abstraction API called "async_process" where the completions can
be checked.

The only method to abstract them is to "hide" them somewhere that will
always be
polled, e.g. an ethdev port's RX function.  Both V3 and V4 approaches
use this method.
This allows "completions" to be transparent to the app, at the tradeoff
to having bad
separation  of concerns as Rx and Tx are now tied-together.

The point is, the Application layer must *somehow * handle of
completions.
So fundamentally there are 2 options for the Application level:

A) Make the application periodically call a "handle completions"
function
     A1) Defer work, call when needed, and track "needed" at app
layer, and calling into vhost txq complete as required.
             Elegant in that "no work" means "no cycles spent" on
checking DMA completions.
     A2) Brute-force-always-call, and pay some overhead when not
required.
             Cycle-cost in "no work" scenarios. Depending on # of
vhost queues, this adds up as polling required *per vhost txq*.
             Also note that "checking DMA completions" means taking a
virtq-lock, so this "brute-force" can needlessly increase x-thread
contention!


A side note: I don't see why locking is required to test for DMA

completions.

rte_dma_vchan_status() is lockless, e.g.:

https://elixir.bootlin.com/dpdk/latest/source/drivers/dma/ioat/ioat_dmadev.c#L


Correct, DMA-dev is "ethdev like"; each DMA-id can be used in a lockfree

manner

from a single thread.

The locks I refer to are at the OVS-netdev level, as virtq's are shared across

OVS's

dataplane threads.
So the "M to N" comes from M dataplane threads to N virtqs, hence

requiring

some locking.

B) Hide completions and live with the complexity/architectural
sacrifice of mixed-RxTx.
     Various downsides here in my opinion, see the slide deck
presented earlier today for a summary.

In my opinion, A1 is the most elegant solution, as it has a clean
separation of concerns, does not  cause
avoidable contention on virtq locks, and spends no cycles when there is
no completion work to do.


Thank you for elaborating, Harry.


Thanks for part-taking in the discussion & providing your insight!

I strongly oppose against hiding any part of TX processing in an RX function.

It

is just

wrong in so many ways!

I agree that A1 is the most elegant solution. And being the most elegant

solution, it

is probably also the most future proof solution. :-)


I think so too, yes.

I would also like to stress that DMA completion handling belongs in the

DPDK

library, not in the application. And yes, the application will be required to

call

some

"handle DMA completions" function in the DPDK library. But since the

application

already knows that it uses DMA, the application should also know that it

needs

to

call this extra function - so I consider this requirement perfectly acceptable.


Agree here.

I prefer if the DPDK vhost library can hide its inner workings from the

application,

and just expose the additional "handle completions" function. This also

means

that

the inner workings can be implemented as "defer work", or by some other
algorithm. And it can be tweaked and optimized later.


Yes, the choice in how to call the handle_completions function is Application
layer.
For OVS we designed Defer Work, V3 and V4. But it is an App level choice,

and

every
application is free to choose its own method.

Thinking about the long term perspective, this design pattern is common

for

both

the vhost library and other DPDK libraries that could benefit from DMA (e.g.
vmxnet3 and pcap PMDs), so it could be abstracted into the DMA library or

separate library. But for now, we should focus on the vhost use case, and

just

keep

the long term roadmap for using DMA in mind.


Totally agree to keep long term roadmap in mind; but I'm not sure we can
refactor
logic out of vhost. When DMA-completions arrive, the virtQ needs to be
updated;
this causes a tight coupling between the DMA completion count, and the

vhost

library.

As Ilya raised on the call yesterday, there is an "in_order" requirement in the
vhost
library, that per virtq the packets are presented to the guest "in order" of
enqueue.
(To be clear, *not* order of DMA-completion! As Jiayu mentioned, the Vhost
library
handles this today by re-ordering the DMA completions.)

Rephrasing what I said on the conference call: This vhost design will

become

the

common design pattern for using DMA in DPDK libraries. If we get it wrong,

we

are

stuck with it.


Agree, and if we get it right, then we're stuck with it too! :)

Here is another idea, inspired by a presentation at one of the

DPDK

Userspace conferences. It may be wishful thinking, though:


Add an additional transaction to each DMA burst; a special

transaction containing the memory write operation that makes the
descriptors available to the Virtio driver.


That is something that can work, so long as the receiver is

operating

in
polling mode. For cases where virtio interrupts are enabled, you

still

need
to do a write to the eventfd in the kernel in vhost to signal the
virtio
side. That's not something that can be offloaded to a DMA engine,
sadly, so
we still need some form of completion call.


I guess that virtio interrupts is the most widely deployed scenario,

so let's ignore

the DMA TX completion transaction for now - and call it a possible

future

optimization for specific use cases. So it seems that some form of

completion call

is unavoidable.


Agree to leave this aside, there is in theory a potential optimization,
but
unlikely to be of large value.


One more thing: When using DMA to pass on packets into a guest, there

could

be a

delay from the DMA completes until the guest is signaled. Is there any CPU

cache

hotness regarding the guest's access to the packet data to consider here?

I.e. if

we

wait signaling the guest, the packet data may get cold.


Interesting question; we can likely spawn a new thread around this topic!
In short, it depends on how/where the DMA hardware writes the copy.

With technologies like DDIO, the "dest" part of the copy will be in LLC. The

core

reading the
dest data will benefit from the LLC locality (instead of snooping it from a

remote

core's L1/L2).

Delays in notifying the guest could result in LLC capacity eviction, yes.
The application layer decides how often/promptly to check for completions,
and notify the guest of them. Calling the function more often will result in

less

delay in that portion of the pipeline.

Overall, there are caching benefits with DMA acceleration, and the

application

can control
the latency introduced between dma-completion done in HW, and Guest

vring

update.


_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] OVS DPDK DMA-Dev library/Design Discussion

Reply via email to