Hi Heng Qi,

Sorry for the joining this conversation little late.
Your email has very useful summary.

Unfortunately, non-text content (HTML content) doesn’t get achieved.
So, changing the format to text to capture your useful comments.
If you can change your email client settings to be text mode, it will be easier 
to converse.

We have equal interest in having efficient split hdr support and do together 
with you.
Please find response "response" tag below at the end of email to avoid top 
posting.

From: virtio-comm...@lists.oasis-open.org <virtio-comm...@lists.oasis-open.org> 
On Behalf Of hengqi
Sent: Tuesday, January 31, 2023 4:23 AM
To: virtio-dev <virtio-dev@lists.oasis-open.org>; virtio-comment 
<virtio-comm...@lists.oasis-open.org>
Cc: Michael S. Tsirkin <m...@redhat.com>; Jason Wang <jasow...@redhat.com>; 
Cornelia Huck <coh...@redhat.com>; Kangjie Xu <kangjie...@linux.alibaba.com>; 
Xuan Zhuo <xuanz...@linux.alibaba.com>
Subject: [virtio-comment] 回复:[virtio-dev] [PATCH v8] virtio_net: support for 
split transport header

Hi, all.

Split header is a technique with important applications, such as Eric 
(https://lwn.net/Articles/754681/)
and Jonathan Lemon (https://lore.kernel.org/io-uring/20221007211713.170714-1- 
mailto:jonathan.le...@gmail.com/T/#m678770d1fa7040fd76ed35026b93dfcbf25f6196)
realize the zero-copy technology respectively, they all have one thing in 
common that the header and
the payload need to be in separate buffers, and Eric's method requires the 
payload to be page-aligned.

We implemented zero-copy on the virtio-net driver according to Eric's method. 
The commands and
environment are as follows:
# environment
VM1<---->vhost-user<->OVS<->vhost-user<---->VM2
cpu Model name: Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
kernel version 6.0

# commands (linux/tools/testing/selftests/net)
./tcp_mmap -s -z -4 -p 1000 &
./tcp_mmap -H 10.0.0.2 -z -4 -p 1000

The performance data is as follows (implemented according to the split header 
v7 version,
https://lists.oasis-open.org/archives/virtio-dev/202209/msg00004.html):
# direct copy
17.6604 s 10.08 s
# zero copy
1.9 GB/s 3.3 GB/s

We discussed a lot before, the core point is the choice of method A and method 
C, we seem
to be unable to reach an agreement on this point, seeing the above summary and 
previous discussion (https://lists.oasis-open.org/archives/virtio-dev 
/202210/msg00017.html),
how can we resolve this conflict and let this important feature continue?
I really need your help. Cc Jason, Michael, Cornelia, Xuan.

Thanks.
------------------------------------------------------------------
发件人:Heng Qi <mailto:hen...@linux.alibaba.com>
发送时间:2022年10月20日(星期四) 16:34
收件人:Jason Wang <mailto:jasow...@redhat.com>
抄 送:Michael S. Tsirkin <mailto:m...@redhat.com>; Xuan Zhuo 
<mailto:xuanz...@linux.alibaba.com>; Virtio-Dev 
<mailto:virtio-dev@lists.oasis-open.org>; Kangjie Xu 
<mailto:kangjie...@linux.alibaba.com>
主 题:Re: [virtio-dev] [PATCH v8] virtio_net: support for split transport header

On Sat, Oct 08, 2022 at 12:37:45PM +0800, Jason Wang wrote:
> On Thu, Sep 29, 2022 at 3:04 PM Michael S. Tsirkin <mailto:m...@redhat.com> 
>wrote:
> >
> > On Thu, Sep 29, 2022 at 09:48:33AM +0800, Jason Wang wrote:
> > > On Wed, Sep 28, 2022 at 9:39 PM Michael S. Tsirkin 
><mailto:m...@redhat.com> wrote:
> > > >
> > > > On Mon, Sep 26, 2022 at 04:06:17PM +0800, Jason Wang wrote:
> > > > > > Jason I think the issue with previous proposals is that they 
>conflict
> > > > > > with VIRTIO_F_ANY_LAYOUT. We have repeatedly found that giving the
> > > > > > driver flexibility in arranging the packet in memory is benefitial.
> > > > >
> > > > >
> > > > > Yes, but I didn't found how it can conflict the any_layout. Device 
>can just
> > > > > to not split the header when the layout doesn't fit for header 
>splitting.
> > > > > (And this seems the case even if we're using buffers).
> > > >
> > > > Well spec says:
> > > >
> > > >         indicates to both the device and the driver that no
> > > >         assumptions were made about framing.
> > > >
> > > > if device assumes that descriptor boundaries are where
> > > > driver wants packet to be stored that is clearly
> > > > an assumption.
> > >
> > > Yes but what I want to say is, the device can choose to not split the
> > > packet if the framing doesn't fit. Does it still comply with the above
> > > description?
> > >
> > > Thanks
> >
> > The point of ANY_LAYOUT is to give drivers maximum flexibility.
> > For example, if driver wants to split the header at some specific
> > offset this is already possible without extra functionality.
> 
> I'm not sure how this would work without the support from the device.
> This probably can only work if:
> 
> 1) the driver know what kind of packet it can receive
> 2) protocol have fixed length of the header
> 
> This is probably not true consider:
> 
> 1) TCP and UDP have different header length
> 2) IPv6 has an variable length of the header
> 
> 
> >
> > Let's keep it that way.
> >
> > Now, let's formulate what are some of the problems with the current way.
> >
> >
> >
> > A- mergeable buffers is even more flexible, since a single packet
> >   is built up of multiple buffers. And in theory device can
> >   choose arbitrary set of buffers to store a packet.
> >   So you could supply a small buffer for headers followed by a bigger
> >   one for payload, in theory even without any changes.
> >   Problem 1: However since this is not how devices currently operate,
> >   a feature bit would be helpful.
> 
> How do we know the bigger buffer is sufficient for the packet? If we
> try to allocate 64K (not sufficient for the future even) it breaks the
> effort of the mergeable buffer:
> 
> header buffer #1
> payload buffer #1
> header buffer #2
> payload buffer #2
> 
> Is the device expected to
> 
> 1) fill payload in header buffer #2, this breaks the effort that we
> want to make payload page aligned
> 2) skip header buffer #2, in this case, the device assumes the framing
> when it breaks any layout
> 
> >
> >   Problem 2: Also, in the past we found it useful to be able to figure out 
>whether
> >   packet fits in a single buffer without looking at the header.
> >   For this reason, we have this text:
> >
> >         If a receive packet is spread over multiple buffers, the device
> >         MUST use all buffers but the last (i.e. the first 
>\field{num_buffers} -
> >         1 buffers) completely up to the full length of each buffer
> >         supplied by the driver.
> >
> >   if we want to keep this optimization and allow using a separate
> >   buffer for headers, then I think we could rely on the feature bit
> >   from Problem 1 and just make an exception for the first buffer.
> >   Also num_buffers is then always >= 2, maybe state this to avoid
> >   confusion.
> >
> >
> >
> >
> >
> > B- without mergeable, there's no flexibility. In particular, there can
> > not be uninitialized space between header and data.
> 
> I had two questions
> 
> 1) why is this not a problem of mergeable? There's no guarantee that
> the header is just the length of what the driver allocates for header
> buffer anyhow
> 
> E.g the header length could be smaller than the header buffer, the
> device still needs to skip part of the space in the header buffer.
> 
> 2) it should be the responsibility of the driver to handle the
> uninitialized space, it should do anything that is necessary for
> security, more below
> 


We've talked a bit more about split header so far, but there still seem to
be some issues, so let's recap.

一、 Method Discussion Review

In order to adapt to the Eric's tcp receive interface to achieve zero copy,
header and payload are required to be stored separately, and the payload is
stored in a paged alignment way. Therefore, we have discussed several options
for split header as follows:

1: method A ( depend on the descriptor chain )
|                         receive buffer                            | 
|              0th descriptor                      | 1th descriptor | 
| virtnet hdr | mac | ip hdr | tcp hdr|<-- hold -->|      payload   | 
Method A uses a buffer plus a separate page when allocating the receive
buffer. In this way, we can ensure that all payloads can be put
independently in a page, which is very beneficial for the zerocopy 
implemented by the upper layer. 

The advantage of method A is that the implementation is clearer, it can support 
normal
header spit and the rollback conditions. It can also easily support xdp. The 
downside is
that devices operating directly on the descriptor chain may cause the layering 
violation,
and also affect the performance.

2. method B ( depend on mergeable buffer)
|                   receive buffer (page)                                 | 
receive buffer (page) | 
| <-- offset(hold) --> | virtnet hdr | mac | ip hdr | tcp hdr|<-- hold -->|     
    payload       | 
^
|
pointer to device

Method B is based on your previous suggestion, it is implemented based
on mergeable buffer, filling a separate page each time. 

If the split header is negotiated and the packet can be successfully split by 
the device,
the device needs to find at least two buffers, namely two pages, one for the 
virtio-net header
and transport header, and the other for the payload.

The advantage of method B is that it relies on mergeable buffer instead of the 
descriptor chain.
It overcomes the shortcomings of method A and can achieve the purpose of the 
device focusing
on the buffer instead of the descriptor. Its disadvantage is that it causes 
memory waste.

3. method C ( depend on mergeable buffer )
| small buffer | data buffer (page) | small buffer | data buffer (page) | small 
buffer | data buffer (page) |

Method B fills a separate page each time, while method C needs to fill the 
small buffer and
page buffer separately. Method C puts the header in small buffer and the 
payload in a page.

The advantage of method C is that some buffers are filled for header and data 
respectively,
which reduces the memory waste of method B. However, this method is difficult 
to weigh
the number of filled header buffers and data buffers, and an unreasonable 
proportion will
affect performance. For example, in a scenario with a large number of large 
packets,
too many header buffers will affect performance, or in a scenario with a large 
number of small
packets, too many data buffers can also affect performance. At the same time, 
if some protocols
with a large number of packets do not support split header, the existence of 
the header buffers
will also affect performance.

二、Points of agreement and disagreement

1. What we have now agreed upon is that:
None of the three methods break VIRTIO_F_ANY_LAYOUT, they make virtio net 
header and
packet header stored together.

We have now agreed to relax the following in the split header scenario,
 "indicates to both the device and the driver that no assumptions were made 
about framing."
because when a bigger packet comes, and a data buffer is not enough to store 
this packet,
the device either chooses to skip the next header buffer to break what the spec 
says above,
or chooses not to skip the header buffer and cannot make payload page aligned.
Therefore, all three methods need to relax the above requirements.

2. What we haven't now agreed upon is that:
The point where we don't agree now is that we don't have a more precise 
discussion of which
approach to take, but we're still bouncing between approaches.
At present, all three approaches seem to achieve our requirements, but each has 
advantages
and disadvantages. Should we focus on the most important points, such as 
performance to choose.
It seems a little difficult to cover everything?

三、Two forms of implementing receive zerocopy

the Eric's tcp receive interface requires the header and payload are stored in 
separate buffers, and the payload is
stored in a paged alignment way.

Now, io_uring also proposes a new receive zerocopy method, which requires 
header and payload
to be stored in separate buffers, but does not require payload page aligned.
https://lore.kernel.org/io-uring/20221007211713.170714-1-jonathan.le...@gmail.com/T/#m678770d1fa7040fd76ed35026b93dfcbf25f6196

Response....

Page alignment requirements should not come from the virtio spec.
There are a variety of cases which may use non page aligned data buffers.
a. A kernel only consumer can use it who doesn't have mmap requirement.
b. A VQ accessible directly in user space may also use it without page 
alignment.
c. A system with 64k page size, page aligned memory has a fair amount of 
wastage.
d. iouring example you pointed, also has non page aligned use.

So let the driver deal with alignment restriction, outside of the virtio spec.

In header data split cases, data buffers utilization is more important than the 
tiny header buffers utilization.
How about if the headers do not interfere with the data buffers?

In other words, say a given RQ has optionally linked to a circular queue of 
header buffers.
All header buffers are of the same size, supplied one time.
This header size and circular q address is configured one time at RQ creation 
time.

With this the device doesn't need to process header buffer size every single 
incoming packet.
Data buffers can continue as chains or merged mode can be supported.
When the received packet’s header cannot fit, it continues as-is in the data 
buffer.
Virtio net hdr as suggest indicates usage of hdr buffer offset/index.

This method has few benefits on perf and buffer efficiency as below.
1. Data buffers can be directly mapped at best utilization
2. Device doesn't need to match up per packet header sizes and descriptor 
sizes, efficient for device to implement
3. No need to keep reposting the header buffers, only its tail index to be 
updated. 
Directly gives 50% cycle reduction on buffer traversing on driver side on rx 
path.
4. Ability to share this header buffer queue among multiple RQs if needed.
5. In the future there may be an extension to place tiny whole packets that can 
fit in the header buffer also to contain the rest of the data.
6. Device can always fall back to place packet header in data buffer when 
header buffer is not available or smaller than newer protocol
7. Because the header buffer comes virtually contiguous memory and not 
intermixed with data buffers, there isn't small per header allocations
8. Also works in both chained and merged mode
9. memory utilization for an RQ of depth 256, with 4K page size for data 
buffers = 1M, and hdr buffer per packet = 256 * 128bytes = only 3% of the data 
buffer.
So, in worst case when no packet uses the header buffers wastage is only 3%.
When high number of packets larger than 4K uses the header buffer, say 8K 
packets, header buffer utilization is at 50%. So, wastage is only 1.5%.
At 1500 mtu merged buffer data buffer size, it is also < 10% of hdr buffer 
memory.
All 3 cases are very manageable range of buffer utilization.

Crafting and modifying the feature bits from your v7 version and virtio net 
header is not difficult to get there if we like this approach.

Reply via email to