On 20/09/2016 10:38, Pierre Pfister (ppfister) wrote: > More inline, but first I want to propose something. > > One of the expensive operation in device workload is to map physical address > found in the descriptor to > the virtual address. The reason being that there are multiple memory regions. > > When there are only a few regions, optimal implementation uses a loop. > Otherwise, a tree lookup can be used. In any case, it is costy. > > So here is the proposal. > The driver numbers memory regions and provides this numbering to the device, > such that > both device and driver have the same memory region indexing. > > Then, instead of transmitting '__le64 addr' in the descriptor, we put > '__le16 region_idx' and '__le32 offset'.
This makes it much more complicated to change the memory map (which happens rarely---but happens, e.g. on memory hotplug). There's also additional logic to split the regions to 4GB when they're bigger, which makes the tables larger and less cache-friendly. The simple solution is to save the last two or four regions used, and only then fall back to a binary search for everything else. For the common case not involving memory hotplug this is the fastest method, because there are effectively only one or two regions in use. Again: let's not second-guess the software. Software is easy to improve, hardware is hard. > We also reduce the size of the descriptor structure by 16 bits. > Which is not very helpful for now. > At least for networking it wouldn't be a problem to change the 'len' field to > '__le16' too, > hence reducing even more. > But maybe for storage devices this could be an issue ? Yeah, storage requires 32 bits for the length. Also, 16 bytes is kind of a sweet spot because it is naturally aligned and it divides the cache line. 14 bytes per descriptor doesn't really buy you anything. 12 bytes would let you fit 5 descriptors in a cache line in some cases, but in other cases some descriptors would straddle two cache lines. 16 seems better. >>>> But, VRING_DESC_F_INDIRECT adds a layer of indirection and makes >>>> prefetching a bit >>>> more complicated. >>>> The reason why VRING_DESC_F_INDIRECT was useful in the first place is that >>>> virtio 1.0 >>>> didn't have enough descriptors in the main queue (it is equal to the queue >>>> size). >>> >>> rings are physically contigious, and allocating large physically >>> contigious buffers is problematic for many guests. >>> >> >> If having physically contiguous rings is very important, could we allocate >> multiple arrays instead of one ? >> >> For example, allocating 8 rings of 1024 descriptos; >> struct desc *current_desc = &vring[idx & 0x3c00][idx & 0x3ff]; >> >> Just like indirect descriptors, this has two levels of indirection, but >> unlike indirect, this the lookup >> is from non-shared memory and is guaranteed to not change. Like Michael, I would like a benchmark for this. Above 2 or 3 descriptors the cost of indirection should be a wash, because if anything the loop on the indirect descriptors (loop for desc.len / 16 iterations) is more predictable than the loop on direct descriptors (loop while !VRING_DESC_F_NEXT). And even for 2 or 3 direct descriptors there's some risk of repeated cache line bouncing between the consumer and the producer who's preparing the next item. Instead, when you process an indirect descriptor you do fewer reads on the ring's cache line, and all of them are very very close to each other. Considering the simplified logic in the consumer _and_ the possibility of preallocating the second-level descriptors in the producer, indirect descriptors seem like a very straightforward win. Paolo --------------------------------------------------------------------- To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org