Here are some notes about performance that I prepared a while ago.

> TX is "packets from the guest", RX is "packets for the guest".
> 
> For discussion purposes, here's how the TX path works (this is the fast
> case - if there are resource shortages, ring fills, etc. things are more
> complex):
> 
> domU: xnf is passed a packet chain (typically only a single packet). It:
>       - flattens the message to a single mblk which is contained in a
>         single page (this might be a no-op),
>       - allocates a grant reference,
>       - grants the backend access to the page containing the packet,
>       - gets a slot in the tx ring,
>       - updates the tx ring,
>       - hypercall notifies the backend that a packet is ready.
>       
>       The TX ring is cleaned lazily, usually when getting a slot from
>       the ring fails. Cleaning the ring results in freeing any buffers
>       that were used for transmit.
> 
> dom0: xnb receives an interrupt to say that the xnf sent one or more
>       packets. It:
>       - for each consumed slot in the ring:
>         - add the grant reference of the page containing the packet to a
>           list.
>       - hypercall to map all of the pages for which we have grant
>         references.
>       - for each consumed slot in the ring:
>         - allocate an mblk for the packet.
>         - copy data from the granted page to the mblk.
>         - store mblk in a list.
>       - hypercall to unmap all of the granted pages.
>       - pass the packet chain down to the NIC (typically a VNIC).
> 
> Simpler improvements:
>     - Add support for the scatter-gather extension to our
>       frontend/backend driver pair. This would mean that we don't need
>       to flatten mblk chains that belong to a single packet in the
>       frontend driver. I have a quick prototype of this based on some
>       work that Russ did (the Windows driver tends to use long packet
>       chains, so it's wanted in our backend).
>     - Look at using the 'hypervisor copy' hypercall to move data from
>       guest pages into mblks in the backend driver. This would remove
>       the need to map the granted pages into dom0 (which is
>       expensive). Prototyping this should be straightforward and it may
>       provide a big win, but without trying we don't know. Certainly it
>       would push the dom0 CPU time down (by moving the work into the
>       hypervisor).
>     - Use the guest provide buffers directly (esballoc) rather than
>       copying the data into more buffers. I had an implementation of
>       this and it suffered in three ways:
>       - The buffer management was poor, causing a lot of lock contention
>         over the ring (the tx completion freed the buffer and this
>         contended with the tx lock used to reap packets from the
>         ring). This could be fixed with a little time.
>       - There are a limited number of ring entries (256) and they cannot
>         be reused until the associated buffer is freed. If the dom0
>         stack or a driver holds on to transmit buffers for a long time,
>         we see ring exhaustion. The Neptune driver was particularly bad
>         for this.
>       - Guests grant read-only mappings for these pages. Unfortunately
>         the Solaris IP stack expects to be able to modify packets which
>         causes page faults. There are a couple of workarounds available:
>         - Modify Solaris guests to grant read/write mappings and
>           indicate this. I have this implemented and it works, but it's
>           somewhat undesirable (and doesn't help with Linux or Windows
>           guests).
>         - Indicate to the MAC layer that these packets are 'read only'
>           and have it copy them if they are for the local stack.
>         - Implement an address space manager for the pages used for
>           these packets and handle faults as they occur - somewhat
>           blue-sky this one :-)
> 
> More complex improvements:
>     - Avoid mapping the guest pages into dom0 completely if the packet
>       is not destined for dom0. If the guest is sending a packet to a
>       third party, dom0 doesn't need to map in the packet at all - it
>       can pass the MA[1] to the DMA engine of the NIC without ever
>       acquiring a VA. Issues:
>       - We need the destination MAC address of the packet to be included
>         in the TX ring so that we can route the packet (e.g. decide if
>         it's for dom0, another domU or external). There's no room for it
>         in the current ring structures, see "netchannel2" comments
>         further on.
>       - The MAC layer and any interested drivers would need to learn
>         about packets for which there is currently no VA. This will
>         require *big* changes.
>     - Cache mappings of the granted pages from the guest domain. It's
>       not clear how much benefit this would have for the transmit path -
>       we'd need to see how often the same pages for transmit buffers by
>       the guest.
> 
> Here's the RX path (again, simpler case):
> 
> domU: When the interface is created, domU:
>       - for each entry in the RX ring:
>         - allocate an MTU sized buffer,
>         - find the PA and MFN[2] of the buffer,
>         - allocate a grant reference for the buffer,
>         - update the ring with the details of the buffer (gref and id)
>       - signal the backend that RX buffers are available
> 
> dom0: When a packet arrives[3]:
>       - driver calls mac_rx() having prepared a packet,
>       - MAC layer classifies the packet (if not for free from the ring
>         used),
>       - MAC layer passes packet chain (usually just one packet) to xnb
>         RX function
>       - xnb RX function:
>         - for each packet in the chain (b_next):
>           - get a slot in the RX ring
>           - for each mblk in the packet (b_cont):
>             - for each page in the mblk[4]:
>               - fill in a hypervisor copy request for this chunk
>           - hypercall to perform the copies
>           - mark the RX ring entry completed
>         - notify the frontend of new packets (if required[5]).
>         - free the packet chain.
> 
> domU: When a packet arrives (notified by the backend):
>       - for each dirty entry in the RX ring:
>         - allocate an mblk for the data
>         - copy the data from the RX buffer to the mblk
>         - add the mblk to the packet chain
>         - mark the ring entry free (e.g. re-post the buffer)
>       - notify the backend that the ring has free entries (if required).
>       - pass the packet chain to mac_rx().
> 
> Simpler improvements:
>     - Don't allocate a new mblk and copy the data in the domU interrupt
>       path, rather wrap around the buffer and re-post a new one. This
>       looks like it would be a good win - definitely worth building
>       something to see how it behaves. Obviously the buffer management
>       gets a little more complicated, but it may be worth it. The
>       downside is that it reduces the likely benefit of having the
>       backend cache mappings for the pre-posted RX buffers, as we are
>       much less likely to recycle the same buffers over and over again
>       (which is what happens today).
>     - Update the frontend driver to use the Crossbow polling
>       implementation, significantly reducing the interrupt load on the
>       guest. Max started on this but it has languished since he left
>       us.
> 
> More complex improvements:
>     - Given that the guest pre-posts the buffers that it will use for
>       received data, push these buffers down into the MAC layer,
>       allowing the driver to directly place packets into guest
>       buffers. This presumes that we can get an RX ring in the driver
>       assigned for the MAC address of the guest.
> 
> General things (TX and RX):
>     - Implementing scatter gather should improve some cases, but it's
>       not that big a win. It allows us to implement jumbo-frames, which
>       will show improvements in benchmarks. It also leads to...
>     - Implementing LSO/LRO between dom0 and domU could have big
>       benefits, as it will reduce the number of interrupts and the
>       number of hypercalls.
>     - All of the backend xnb instances currently operate independently -
>       they share no state. If there are a large number of active guests
>       it will probably be worth looking at a scheme where we shift to a
>       worker thread per CPU and have that thread responsible for
>       multiple xnb instances. This would allow us to reduce the
>       hypercall count even more.
>     - netchannel2 is a new inter-domain protocol implementation intended
>       to address some of the shortcomings in the current protocol. It
>       includes:
>       - multiple pages of TX/RX descriptors which can either be just
>         bigger rings or independent rings,
>       - multiple event channels (which means multiple interrupts),
>       - improved ring structure (space for MAC addresses, ...).
>       With it there is a proposal for a soft IOMMU implementation to
>       improve the use of grant mappings.
> 
>       We've done nothing with netchannel2 so far. In Linux it's
>       currently a prototype with changes to an Intel driver to use it
>       with VMDQ.
> 
> Footnotes: 
> [1]  Machine address. In Xen it's no longer the case that all memory is
>      mapped into the dom0 kernel - you may not even have a physical
>      mapping for the memory.
> [2]  Machine frame number, analogous to PFN.
> [3]  This assumes packets from an external source. Locally generated
>      packets destined for a guest jump into the flow a couple of items
>      down the list.
> [5]  The frontend controls whether or not notification takes place using
>      a watermark in the ring.
> [4]  Each chunk passed to the hypervisor copy routine must only contain
>      a single page, as we don't know that the pages are machine
>      contiguous (and it's pretty expensive to find out).

dme.
-- 
David Edmondson, Sun Microsystems, http://dme.org
_______________________________________________
xen-discuss mailing list
[email protected]

Reply via email to