Re: [Qemu-devel] Faster, generic IO/DMA model with vectored AIO?

Blue Swirl Sun, 28 Oct 2007 01:09:33 -0800

On 10/28/07, Paul Brook <[EMAIL PROTECTED]> wrote:
> > I changed Slirp output to use vectored IO to avoid the slowdown from
> > memcpy (see the patch for the work in progress, gives a small
> > performance improvement). But then I got the idea that using AIO would
> > be nice at the outgoing end of the network IO processing. In fact,
> > vectored AIO model could even be used for the generic DMA! The benefit
> > is that no buffering or copying should be needed.
>
> An interesting idea, however I don't want to underestimate the difficulty of
> implementing this correctly.  I suspect to get real benefits you need to
> support zero-copy async operation all the way through.  Things get really
> hairy if you allow some operations to complete synchronously, and some to be
> deferred.


Zero-copy can be the first goal, async may come later. I hope we can
do the change in stages, perhaps introducing temporary conversion
helpers as needed.

> I've done async operation for SCSI and USB. The latter is really not pretty,
> and the former has some notable warts. A generic IODMA framework needs to
> make sure it covers these requirements without making things worse. Hopefully
> it'll also help fix the things that are wrong with them.
>
> > For the specific Sparc32 case, unfortunately Lance bus byte swapping
> > makes buffering necessary at that stage, unless we can make N vectors
> > with just a single byte faster than memcpy + bswap of memory block
> > with size N.
>
> We really want to be dealing with largeish blocks. The {ptr,size} vector is 64
> or 128 bytes per element, so the overhead on blocks < 64 bytes if going to be
> really brutal. Also time taken to do address translation will be O(number of
> vectors).

That's what I suspected as well.

> > Inside Qemu the vectors would use target physical addresses (struct
> > qemu_iovec), but at some point the addresses would change to host
> > pointers suitable for real AIO.
>
> Phrases like "at some point" worry me :-)
>
> I think it would be good to get a top-down description of what each different
> entity (initiating device, host endpoint, bus translation, memory) is
> responsible for, and how they all fit together.
>
>
> I have some ideas, but without more detailed investigation can't tell if they
> will actually work in practice, or if they fit into the code fragments you've
> posted. My suspicion is they don't as I can't make head or tail of how your
> gdma_aiov.diff patch would be used in practice.

Ok, I'll try to make a mental exercise with this chain:
SCSI->ESP->ESPDMA->IOMMU->memory write. Scenario: SCSI read issued, 8k
size. I'll track the address+size vectors at each stage.

scsi-disk uses host memory addresses. ESP uses addresses ranging from
0 to end of request. ESPDMA forces the MS byte to be 0xfc. IOMMU
translates page 0xfc000000 to 0x1000 and page 0xfc001000 to 0x4000.
Memory translates 0x1000 to phys_ram_base + 0x1000, likewise for
0x4000. From this point on, we will be using host memory addresses
again. Each stage may change the callback if needed.

Currently scsi-disk provides a buffer. For true zero copy, this needs
to be changed so that instead the buffer is provided by the caller at
each stage until we reach the host memory. But I'll use the scsi-disk
buffer for now.

Initially the (address, size) vectors provided by scsi-disk is:
src_vector = (&SCSIDevice->SCSIRequest->dma_buf[0], 8192). What's the
destination vector, (NULL, 0)? scsi-disk calls bus_write_north, which
transfers control to ESP.

ESP changes the vectors to
src_vector = (&SCSIDevice->SCSIRequest->dma_buf[0], 8192), dst_vector
= (0, 8192). Calls bus_write_north -> ESPDMA.

ESPDMA:
src (&SCSIDevice->SCSIRequest->dma_buf[0], 8192), dst (0xfc000000, 8192).
-> IOMMU.

After IOMMU:
src (&SCSIDevice->SCSIRequest->dma_buf[0], 8192), dst ((0x1000, 4096),
(0x4000, 4096)).

After memory:
src (&SCSIDevice->SCSIRequest->dma_buf[0], 8192), dst ((phys_ram_base
+ 0x1000, 4096), (phys_ram_base + 0x4000, 4096)).

scsi-disk or memory (which?) can now perform the memcpy.

But now we also have the information to perform the disk read without
copying. Do we need the source vectors at all?

Let's try the other direction, SCSI write. Other parameters are unchanged.

Now the destination is scsi-disk buffer, source vector will be translated.
src=(NULL, 0), dst =(&SCSIDevice->SCSIRequest->dma_buf[0], 8192).
scsi-disk calls bus_read_north, which transfers control to ESP.

ESP:
src = (0, 8192), dst unchanged.

ESPDMA:
src = (0xfc000000, 8192), dst unchanged.

IOMMU:
src ((0x1000, 4096), (0x4000, 4096)).

Memory:
src ((phys_ram_base + 0x1000, 4096), (phys_ram_base + 0x4000, 4096)).

Now having made this exercise, I think we only need a translation
function. It changes the addresses and adds a callback to handle the
intermediate buffers. Let's try this improved model in a more complex
scenario: SLIRP TCP socket (host) -> SLIRP IP -> SLIRP interface ->
Lance -> LEDMA -> IOMMU -> memory.

SLIRP IP adds IP headers, SLIRP Ethernet link adds Ethernet headers.
LEDMA must buffer the data to perform byte swapping and MS byte will
be forced to 0xfc. For IOMMU we reuse the disk parameters.

We need to give a buffer for host recvmsg(). How can we determine the
buffer size? We really know that only after the packet has been
received. Let's pick a packet size of 4096.

TCP:
(0, 4096)

IP adds iphdr:
((0, sizeof(iphdr)), (sizeof(iphdr), 4096))

Link adds Ethernet header:
((0, sizeof(ethhdr), (sizeof(ethhdr), sizeof(iphdr)), (sizeof(ehthdr)
+ sizeof(iphdr), 4096))

Lance searches the receive descriptors for a buffer. We need to set a
bit to indicate that the buffer is in use, so a callback is needed.
The translated vectors are:
((0x1000, sizeof(ethhdr), (0x1000 + sizeof(ethhdr), sizeof(iphdr)),
(0x1000 + sizeof(ehthdr) + sizeof(iphdr), 4096)), callback
lance_rmd_store.

LEDMA provides a bswap translation buffer. So, for TCP the final vectors are:
(&DMAState->lebuffer[0], s(eth)), (&DMAState->lebuffer[s(eth)],
s(ip)), (&DMAState->lebuffer[s(eth)+ s(ip)], 4096)), callbacks
(le_bswap_buffer, lance_rmd_store).

TCP recv writes to lebuffer with AIO, callback le_bswap_buffer issued.

le_bswap_buffer performs bswap, wants to copy the buffer to target
memory. The new vector is:
(0xfc000000, 4k+stuff) (can we merge the vectors at this point?),
callback lance_rmd_store.

IOMMU:
((0x1000, 4096), (0x4000, stuff))

Memory:
((phys_ram_base + 0x1000, 4096), (phys_ram_base + 0x4000, stuff)).

lance_rmd_store is called: memcpy from lebuffer to destination,
updates the descriptor (another translation), raises IRQ etc.

I think this should work.

Re: [Qemu-devel] Faster, generic IO/DMA model with vectored AIO?

Reply via email to