Here are some notes about performance that I prepared a while ago. > TX is "packets from the guest", RX is "packets for the guest". > > For discussion purposes, here's how the TX path works (this is the fast > case - if there are resource shortages, ring fills, etc. things are more > complex): > > domU: xnf is passed a packet chain (typically only a single packet). It: > - flattens the message to a single mblk which is contained in a > single page (this might be a no-op), > - allocates a grant reference, > - grants the backend access to the page containing the packet, > - gets a slot in the tx ring, > - updates the tx ring, > - hypercall notifies the backend that a packet is ready. > > The TX ring is cleaned lazily, usually when getting a slot from > the ring fails. Cleaning the ring results in freeing any buffers > that were used for transmit. > > dom0: xnb receives an interrupt to say that the xnf sent one or more > packets. It: > - for each consumed slot in the ring: > - add the grant reference of the page containing the packet to a > list. > - hypercall to map all of the pages for which we have grant > references. > - for each consumed slot in the ring: > - allocate an mblk for the packet. > - copy data from the granted page to the mblk. > - store mblk in a list. > - hypercall to unmap all of the granted pages. > - pass the packet chain down to the NIC (typically a VNIC). > > Simpler improvements: > - Add support for the scatter-gather extension to our > frontend/backend driver pair. This would mean that we don't need > to flatten mblk chains that belong to a single packet in the > frontend driver. I have a quick prototype of this based on some > work that Russ did (the Windows driver tends to use long packet > chains, so it's wanted in our backend). > - Look at using the 'hypervisor copy' hypercall to move data from > guest pages into mblks in the backend driver. This would remove > the need to map the granted pages into dom0 (which is > expensive). Prototyping this should be straightforward and it may > provide a big win, but without trying we don't know. Certainly it > would push the dom0 CPU time down (by moving the work into the > hypervisor). > - Use the guest provide buffers directly (esballoc) rather than > copying the data into more buffers. I had an implementation of > this and it suffered in three ways: > - The buffer management was poor, causing a lot of lock contention > over the ring (the tx completion freed the buffer and this > contended with the tx lock used to reap packets from the > ring). This could be fixed with a little time. > - There are a limited number of ring entries (256) and they cannot > be reused until the associated buffer is freed. If the dom0 > stack or a driver holds on to transmit buffers for a long time, > we see ring exhaustion. The Neptune driver was particularly bad > for this. > - Guests grant read-only mappings for these pages. Unfortunately > the Solaris IP stack expects to be able to modify packets which > causes page faults. There are a couple of workarounds available: > - Modify Solaris guests to grant read/write mappings and > indicate this. I have this implemented and it works, but it's > somewhat undesirable (and doesn't help with Linux or Windows > guests). > - Indicate to the MAC layer that these packets are 'read only' > and have it copy them if they are for the local stack. > - Implement an address space manager for the pages used for > these packets and handle faults as they occur - somewhat > blue-sky this one :-) > > More complex improvements: > - Avoid mapping the guest pages into dom0 completely if the packet > is not destined for dom0. If the guest is sending a packet to a > third party, dom0 doesn't need to map in the packet at all - it > can pass the MA[1] to the DMA engine of the NIC without ever > acquiring a VA. Issues: > - We need the destination MAC address of the packet to be included > in the TX ring so that we can route the packet (e.g. decide if > it's for dom0, another domU or external). There's no room for it > in the current ring structures, see "netchannel2" comments > further on. > - The MAC layer and any interested drivers would need to learn > about packets for which there is currently no VA. This will > require *big* changes. > - Cache mappings of the granted pages from the guest domain. It's > not clear how much benefit this would have for the transmit path - > we'd need to see how often the same pages for transmit buffers by > the guest. > > Here's the RX path (again, simpler case): > > domU: When the interface is created, domU: > - for each entry in the RX ring: > - allocate an MTU sized buffer, > - find the PA and MFN[2] of the buffer, > - allocate a grant reference for the buffer, > - update the ring with the details of the buffer (gref and id) > - signal the backend that RX buffers are available > > dom0: When a packet arrives[3]: > - driver calls mac_rx() having prepared a packet, > - MAC layer classifies the packet (if not for free from the ring > used), > - MAC layer passes packet chain (usually just one packet) to xnb > RX function > - xnb RX function: > - for each packet in the chain (b_next): > - get a slot in the RX ring > - for each mblk in the packet (b_cont): > - for each page in the mblk[4]: > - fill in a hypervisor copy request for this chunk > - hypercall to perform the copies > - mark the RX ring entry completed > - notify the frontend of new packets (if required[5]). > - free the packet chain. > > domU: When a packet arrives (notified by the backend): > - for each dirty entry in the RX ring: > - allocate an mblk for the data > - copy the data from the RX buffer to the mblk > - add the mblk to the packet chain > - mark the ring entry free (e.g. re-post the buffer) > - notify the backend that the ring has free entries (if required). > - pass the packet chain to mac_rx(). > > Simpler improvements: > - Don't allocate a new mblk and copy the data in the domU interrupt > path, rather wrap around the buffer and re-post a new one. This > looks like it would be a good win - definitely worth building > something to see how it behaves. Obviously the buffer management > gets a little more complicated, but it may be worth it. The > downside is that it reduces the likely benefit of having the > backend cache mappings for the pre-posted RX buffers, as we are > much less likely to recycle the same buffers over and over again > (which is what happens today). > - Update the frontend driver to use the Crossbow polling > implementation, significantly reducing the interrupt load on the > guest. Max started on this but it has languished since he left > us. > > More complex improvements: > - Given that the guest pre-posts the buffers that it will use for > received data, push these buffers down into the MAC layer, > allowing the driver to directly place packets into guest > buffers. This presumes that we can get an RX ring in the driver > assigned for the MAC address of the guest. > > General things (TX and RX): > - Implementing scatter gather should improve some cases, but it's > not that big a win. It allows us to implement jumbo-frames, which > will show improvements in benchmarks. It also leads to... > - Implementing LSO/LRO between dom0 and domU could have big > benefits, as it will reduce the number of interrupts and the > number of hypercalls. > - All of the backend xnb instances currently operate independently - > they share no state. If there are a large number of active guests > it will probably be worth looking at a scheme where we shift to a > worker thread per CPU and have that thread responsible for > multiple xnb instances. This would allow us to reduce the > hypercall count even more. > - netchannel2 is a new inter-domain protocol implementation intended > to address some of the shortcomings in the current protocol. It > includes: > - multiple pages of TX/RX descriptors which can either be just > bigger rings or independent rings, > - multiple event channels (which means multiple interrupts), > - improved ring structure (space for MAC addresses, ...). > With it there is a proposal for a soft IOMMU implementation to > improve the use of grant mappings. > > We've done nothing with netchannel2 so far. In Linux it's > currently a prototype with changes to an Intel driver to use it > with VMDQ. > > Footnotes: > [1] Machine address. In Xen it's no longer the case that all memory is > mapped into the dom0 kernel - you may not even have a physical > mapping for the memory. > [2] Machine frame number, analogous to PFN. > [3] This assumes packets from an external source. Locally generated > packets destined for a guest jump into the flow a couple of items > down the list. > [5] The frontend controls whether or not notification takes place using > a watermark in the ring. > [4] Each chunk passed to the hypervisor copy routine must only contain > a single page, as we don't know that the pages are machine > contiguous (and it's pretty expensive to find out).
dme. -- David Edmondson, Sun Microsystems, http://dme.org _______________________________________________ xen-discuss mailing list [email protected]
