Re: [PATCH V4 00/13] x86/Hyper-V: Add Hyper-V Isolation VM support
Hi Michael and Christoph: I just sent out V5 patchset. I use alloc_pages() to allocate rx/tx ring buffer in Isolation VM and use vmap() to map rx/tx buffer first because the vmbus_establish_gpadl() still needs to va of low end memory to initialize gpadl buffer. After calling vmbus_establish_gpadl(), the va returned by vmap will be unmapped to release virtual address space which will not be used in the following code and then map these pages in the extra address space above shared_gpa_boundary via vmap_pfn(). Please have a look. https://lkml.org/lkml/2021/9/14/672 Thanks. On 9/2/2021 11:57 PM, Michael Kelley wrote: From: Christoph Hellwig Sent: Thursday, September 2, 2021 1:00 AM On Tue, Aug 31, 2021 at 05:16:19PM +, Michael Kelley wrote: As a quick overview, I think there are four places where the shared_gpa_boundary must be applied to adjust the guest physical address that is used. Each requires mapping a corresponding virtual address range. Here are the four places: 1) The so-called "monitor pages" that are a core communication mechanism between the guest and Hyper-V. These are two single pages, and the mapping is handled by calling memremap() for each of the two pages. See Patch 7 of Tianyu's series. Ah, interesting. 3) The network driver send and receive buffers. vmap_phys_range() should work here. Actually it won't. The problem with these buffers is that they are physically non-contiguous allocations. Indeed you are right. These buffers are allocated with vzalloc(). We really have two sensible options: 1) use vmap_pfn as in the current series. But in that case I think we should get rid of the other mapping created by vmalloc. I though a bit about finding a way to apply the offset in vmalloc itself, but I think it would be too invasive to the normal fast path. So the other sub-option would be to allocate the pages manually (maybe even using high order allocations to reduce TLB pressure) and then remap them What's the benefit of getting rid of the other mapping created by vmalloc if it isn't referenced? Just page table space? The default sizes are a 16 Meg receive buffer and a 1 Meg send buffer for each VMbus channel used by netvsc, and usually the max number of channels is 8. So there's 128 Meg of virtual space to be saved on the receive buffers, which could be worth it. Allocating the pages manually is also an option, but we have to be careful about high order allocations. While typically these buffers are allocated during system boot, these synthetic NICs can be hot added and removed while the VM is running. The channel count can also be changed while the VM is running. So multiple 16 Meg receive buffer allocations may need to be done after the system has been running a long time. 2) do away with the contiguous kernel mapping entirely. This means the simple memcpy calls become loops over kmap_local_pfn. As I just found out for the send side that would be pretty easy, but the receive side would be more work. We'd also need to check the performance implications. Doing away with the contiguous kernel mapping entirely seems like it would result in fairly messy code to access the buffer. What's the benefit of doing away with the mapping? I'm not an expert on the netvsc driver, but decoding the incoming packets is already fraught with complexities because of the nature of the protocol with Hyper-V. The contiguous kernel mapping at least keeps the basics sane. 4) The swiotlb memory used for bounce buffers. vmap_phys_range() should work here as well. Or memremap if it works for 1. Case #2 above does unusual mapping. The ring buffer consists of a ring buffer header page, followed by one or more pages that are the actual ring buffer. The pages making up the actual ring buffer are mapped twice in succession. For example, if the ring buffer has 4 pages (one header page and three ring buffer pages), the contiguous virtual mapping must cover these seven pages: 0, 1, 2, 3, 1, 2, 3. The duplicate contiguous mapping allows the code that is reading or writing the actual ring buffer to not be concerned about wrap-around because writing off the end of the ring buffer is automatically wrapped-around by the mapping. The amount of data read or written in one batch never exceeds the size of the ring buffer, and after a batch is read or written, the read or write indices are adjusted to put them back into the range of the first mapping of the actual ring buffer pages. So there's method to the madness, and the technique works pretty well. But this kind of mapping is not amenable to using vmap_phys_range(). Hmm. Can you point me to where this is mapped? Especially for the classic non-isolated case where no vmap/vmalloc mapping is involved at all? The existing code is in hv_ringbuffer_init() in drivers/hv/ring_buffer.c. The code hasn't changed in a while, so any recent upstream code tree
RE: [PATCH V4 00/13] x86/Hyper-V: Add Hyper-V Isolation VM support
From: Christoph Hellwig Sent: Thursday, September 2, 2021 1:00 AM > > On Tue, Aug 31, 2021 at 05:16:19PM +, Michael Kelley wrote: > > As a quick overview, I think there are four places where the > > shared_gpa_boundary must be applied to adjust the guest physical > > address that is used. Each requires mapping a corresponding > > virtual address range. Here are the four places: > > > > 1) The so-called "monitor pages" that are a core communication > > mechanism between the guest and Hyper-V. These are two single > > pages, and the mapping is handled by calling memremap() for > > each of the two pages. See Patch 7 of Tianyu's series. > > Ah, interesting. > > > 3) The network driver send and receive buffers. vmap_phys_range() > > should work here. > > Actually it won't. The problem with these buffers is that they are > physically non-contiguous allocations. Indeed you are right. These buffers are allocated with vzalloc(). > We really have two sensible options: > > 1) use vmap_pfn as in the current series. But in that case I think > we should get rid of the other mapping created by vmalloc. I > though a bit about finding a way to apply the offset in vmalloc > itself, but I think it would be too invasive to the normal fast > path. So the other sub-option would be to allocate the pages > manually (maybe even using high order allocations to reduce TLB > pressure) and then remap them What's the benefit of getting rid of the other mapping created by vmalloc if it isn't referenced? Just page table space? The default sizes are a 16 Meg receive buffer and a 1 Meg send buffer for each VMbus channel used by netvsc, and usually the max number of channels is 8. So there's 128 Meg of virtual space to be saved on the receive buffers, which could be worth it. Allocating the pages manually is also an option, but we have to be careful about high order allocations. While typically these buffers are allocated during system boot, these synthetic NICs can be hot added and removed while the VM is running. The channel count can also be changed while the VM is running. So multiple 16 Meg receive buffer allocations may need to be done after the system has been running a long time. > 2) do away with the contiguous kernel mapping entirely. This means > the simple memcpy calls become loops over kmap_local_pfn. As > I just found out for the send side that would be pretty easy, > but the receive side would be more work. We'd also need to check > the performance implications. Doing away with the contiguous kernel mapping entirely seems like it would result in fairly messy code to access the buffer. What's the benefit of doing away with the mapping? I'm not an expert on the netvsc driver, but decoding the incoming packets is already fraught with complexities because of the nature of the protocol with Hyper-V. The contiguous kernel mapping at least keeps the basics sane. > > > 4) The swiotlb memory used for bounce buffers. vmap_phys_range() > > should work here as well. > > Or memremap if it works for 1. > > > Case #2 above does unusual mapping. The ring buffer consists of a ring > > buffer header page, followed by one or more pages that are the actual > > ring buffer. The pages making up the actual ring buffer are mapped > > twice in succession. For example, if the ring buffer has 4 pages > > (one header page and three ring buffer pages), the contiguous > > virtual mapping must cover these seven pages: 0, 1, 2, 3, 1, 2, 3. > > The duplicate contiguous mapping allows the code that is reading > > or writing the actual ring buffer to not be concerned about wrap-around > > because writing off the end of the ring buffer is automatically > > wrapped-around by the mapping. The amount of data read or > > written in one batch never exceeds the size of the ring buffer, and > > after a batch is read or written, the read or write indices are adjusted > > to put them back into the range of the first mapping of the actual > > ring buffer pages. So there's method to the madness, and the > > technique works pretty well. But this kind of mapping is not > > amenable to using vmap_phys_range(). > > Hmm. Can you point me to where this is mapped? Especially for the > classic non-isolated case where no vmap/vmalloc mapping is involved > at all? The existing code is in hv_ringbuffer_init() in drivers/hv/ring_buffer.c. The code hasn't changed in a while, so any recent upstream code tree is valid to look at. The memory pages are typically allocated in vmbus_alloc_ring() in drivers/hv/channel.c. Michael
Re: [PATCH V4 00/13] x86/Hyper-V: Add Hyper-V Isolation VM support
On 9/2/2021 3:59 PM, Christoph Hellwig wrote: On Tue, Aug 31, 2021 at 05:16:19PM +, Michael Kelley wrote: As a quick overview, I think there are four places where the shared_gpa_boundary must be applied to adjust the guest physical address that is used. Each requires mapping a corresponding virtual address range. Here are the four places: 1) The so-called "monitor pages" that are a core communication mechanism between the guest and Hyper-V. These are two single pages, and the mapping is handled by calling memremap() for each of the two pages. See Patch 7 of Tianyu's series. Ah, interesting. 3) The network driver send and receive buffers. vmap_phys_range() should work here. Actually it won't. The problem with these buffers is that they are physically non-contiguous allocations. We really have two sensible options: 1) use vmap_pfn as in the current series. But in that case I think we should get rid of the other mapping created by vmalloc. I though a bit about finding a way to apply the offset in vmalloc itself, but I think it would be too invasive to the normal fast path. So the other sub-option would be to allocate the pages manually (maybe even using high order allocations to reduce TLB pressure) and then remap them Agree. In such case, the map for memory below shared_gpa_boundary is not necessary. allocate_pages() is limited by MAX_ORDER and needs to be called repeatedly to get enough memory. 2) do away with the contiguous kernel mapping entirely. This means the simple memcpy calls become loops over kmap_local_pfn. As I just found out for the send side that would be pretty easy, but the receive side would be more work. We'd also need to check the performance implications. kmap_local_pfn() requires pfn with backing struct page and this doesn't work pfn above shared_gpa_boundary. 4) The swiotlb memory used for bounce buffers. vmap_phys_range() should work here as well. Or memremap if it works for 1. Now use vmap_pfn() and the hv map function is reused in the netvsc driver. Case #2 above does unusual mapping. The ring buffer consists of a ring buffer header page, followed by one or more pages that are the actual ring buffer. The pages making up the actual ring buffer are mapped twice in succession. For example, if the ring buffer has 4 pages (one header page and three ring buffer pages), the contiguous virtual mapping must cover these seven pages: 0, 1, 2, 3, 1, 2, 3. The duplicate contiguous mapping allows the code that is reading or writing the actual ring buffer to not be concerned about wrap-around because writing off the end of the ring buffer is automatically wrapped-around by the mapping. The amount of data read or written in one batch never exceeds the size of the ring buffer, and after a batch is read or written, the read or write indices are adjusted to put them back into the range of the first mapping of the actual ring buffer pages. So there's method to the madness, and the technique works pretty well. But this kind of mapping is not amenable to using vmap_phys_range(). Hmm. Can you point me to where this is mapped? Especially for the classic non-isolated case where no vmap/vmalloc mapping is involved at all? This is done via vmap() in the hv_ringbuffer_init() 182/* Initialize the ring buffer. */ 183int hv_ringbuffer_init(struct hv_ring_buffer_info *ring_info, 184 struct page *pages, u32 page_cnt, u32 max_pkt_size) 185{ 186int i; 187struct page **pages_wraparound; 188 189BUILD_BUG_ON((sizeof(struct hv_ring_buffer) != PAGE_SIZE)); 190 191/* 192 * First page holds struct hv_ring_buffer, do wraparound mapping for 193 * the rest. 194 */ 195pages_wraparound = kcalloc(page_cnt * 2 - 1, sizeof(struct page *), 196 GFP_KERNEL); 197if (!pages_wraparound) 198return -ENOMEM; 199 /* prepare to wrap page array */ 200pages_wraparound[0] = pages; 201for (i = 0; i < 2 * (page_cnt - 1); i++) 202pages_wraparound[i + 1] = &pages[i % (page_cnt - 1) + 1]; 203 /* map */ 204ring_info->ring_buffer = (struct hv_ring_buffer *) 205vmap(pages_wraparound, page_cnt * 2 - 1, VM_MAP, PAGE_KERNEL); 206 207kfree(pages_wraparound); 208 209 210if (!ring_info->ring_buffer) 211return -ENOMEM; 212 213ring_info->ring_buffer->read_index = 214ring_info->ring_buffer->write_index = 0;
Re: [PATCH V4 00/13] x86/Hyper-V: Add Hyper-V Isolation VM support
On Tue, Aug 31, 2021 at 05:16:19PM +, Michael Kelley wrote: > As a quick overview, I think there are four places where the > shared_gpa_boundary must be applied to adjust the guest physical > address that is used. Each requires mapping a corresponding > virtual address range. Here are the four places: > > 1) The so-called "monitor pages" that are a core communication > mechanism between the guest and Hyper-V. These are two single > pages, and the mapping is handled by calling memremap() for > each of the two pages. See Patch 7 of Tianyu's series. Ah, interesting. > 3) The network driver send and receive buffers. vmap_phys_range() > should work here. Actually it won't. The problem with these buffers is that they are physically non-contiguous allocations. We really have two sensible options: 1) use vmap_pfn as in the current series. But in that case I think we should get rid of the other mapping created by vmalloc. I though a bit about finding a way to apply the offset in vmalloc itself, but I think it would be too invasive to the normal fast path. So the other sub-option would be to allocate the pages manually (maybe even using high order allocations to reduce TLB pressure) and then remap them 2) do away with the contiguous kernel mapping entirely. This means the simple memcpy calls become loops over kmap_local_pfn. As I just found out for the send side that would be pretty easy, but the receive side would be more work. We'd also need to check the performance implications. > 4) The swiotlb memory used for bounce buffers. vmap_phys_range() > should work here as well. Or memremap if it works for 1. > Case #2 above does unusual mapping. The ring buffer consists of a ring > buffer header page, followed by one or more pages that are the actual > ring buffer. The pages making up the actual ring buffer are mapped > twice in succession. For example, if the ring buffer has 4 pages > (one header page and three ring buffer pages), the contiguous > virtual mapping must cover these seven pages: 0, 1, 2, 3, 1, 2, 3. > The duplicate contiguous mapping allows the code that is reading > or writing the actual ring buffer to not be concerned about wrap-around > because writing off the end of the ring buffer is automatically > wrapped-around by the mapping. The amount of data read or > written in one batch never exceeds the size of the ring buffer, and > after a batch is read or written, the read or write indices are adjusted > to put them back into the range of the first mapping of the actual > ring buffer pages. So there's method to the madness, and the > technique works pretty well. But this kind of mapping is not > amenable to using vmap_phys_range(). Hmm. Can you point me to where this is mapped? Especially for the classic non-isolated case where no vmap/vmalloc mapping is involved at all?
Re: [PATCH V4 00/13] x86/Hyper-V: Add Hyper-V Isolation VM support
On Tue, Aug 31, 2021 at 11:20:06PM +0800, Tianyu Lan wrote: >> If so I suspect the best way to allocate them is by not using vmalloc >> but just discontiguous pages, and then use kmap_local_pfn where the >> PFN includes the share_gpa offset when actually copying from/to the >> skbs. >> > When netvsc needs to copy packet data to send buffer, it needs to caculate > position with section_index and send_section_size. > Please seee netvsc_copy_to_send_buf() detail. So the contiguous virtual > address of send buffer is necessary to copy data and batch packets. Actually that makes the kmap approach much easier. The phys_to_virt can just be replaced with a kmap_local_pfn and the unmap needs to be added. I've been mostly focussing on the receive path, which would need a similar treatment.
RE: [PATCH V4 00/13] x86/Hyper-V: Add Hyper-V Isolation VM support
From: Christoph Hellwig Sent: Monday, August 30, 2021 5:01 AM > > Sorry for the delayed answer, but I look at the vmap_pfn usage in the > previous version and tried to come up with a better version. This > mostly untested branch: > > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/hyperv-vmap > > get us there for swiotlb and the channel infrastructure I've started > looking at the network driver and didn't get anywhere due to other work. > > As far as I can tell the network driver does gigantic multi-megabyte > vmalloc allocation for the send and receive buffers, which are then > passed to the hardware, but always copied to/from when interacting > with the networking stack. Did I see that right? Are these big > buffers actually required unlike the normal buffer management schemes > in other Linux network drivers? > > If so I suspect the best way to allocate them is by not using vmalloc > but just discontiguous pages, and then use kmap_local_pfn where the > PFN includes the share_gpa offset when actually copying from/to the > skbs. As a quick overview, I think there are four places where the shared_gpa_boundary must be applied to adjust the guest physical address that is used. Each requires mapping a corresponding virtual address range. Here are the four places: 1) The so-called "monitor pages" that are a core communication mechanism between the guest and Hyper-V. These are two single pages, and the mapping is handled by calling memremap() for each of the two pages. See Patch 7 of Tianyu's series. 2) The VMbus channel ring buffers. You have proposed using your new vmap_phys_range() helper, but I don't think that works here. More details below. 3) The network driver send and receive buffers. vmap_phys_range() should work here. 4) The swiotlb memory used for bounce buffers. vmap_phys_range() should work here as well. Case #2 above does unusual mapping. The ring buffer consists of a ring buffer header page, followed by one or more pages that are the actual ring buffer. The pages making up the actual ring buffer are mapped twice in succession. For example, if the ring buffer has 4 pages (one header page and three ring buffer pages), the contiguous virtual mapping must cover these seven pages: 0, 1, 2, 3, 1, 2, 3. The duplicate contiguous mapping allows the code that is reading or writing the actual ring buffer to not be concerned about wrap-around because writing off the end of the ring buffer is automatically wrapped-around by the mapping. The amount of data read or written in one batch never exceeds the size of the ring buffer, and after a batch is read or written, the read or write indices are adjusted to put them back into the range of the first mapping of the actual ring buffer pages. So there's method to the madness, and the technique works pretty well. But this kind of mapping is not amenable to using vmap_phys_range(). Michael
Re: [PATCH V4 00/13] x86/Hyper-V: Add Hyper-V Isolation VM support
Hi Christoph: On 8/30/2021 8:00 PM, Christoph Hellwig wrote: Sorry for the delayed answer, but I look at the vmap_pfn usage in the previous version and tried to come up with a better version. This mostly untested branch: http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/hyperv-vmap No problem. Thank you very much for your suggestion patches and they are very helpful. get us there for swiotlb and the channel infrastructure I've started looking at the network driver and didn't get anywhere due to other work. As far as I can tell the network driver does gigantic multi-megabyte vmalloc allocation for the send and receive buffers, which are then passed to the hardware, but always copied to/from when interacting with the networking stack. Did I see that right? Are these big buffers actually required unlike the normal buffer management schemes in other Linux network drivers? For send packet, netvsc tries batching packet in send buffer if possible. It passes the original skb pages directly to hypervisor when send buffer is not enough or packet length is larger than section size. These packets are sent via vmbus_sendpacket_pagebuffer() finally. Please see netvsc_send() for detail. The following code is to check whether the packet could be copied into send buffer. If not, the packet will be sent with original skb pages. 1239/* batch packets in send buffer if possible */ 1240msdp = &nvchan->msd; 1241if (msdp->pkt) 1242msd_len = msdp->pkt->total_data_buflen; 1243 1244try_batch = msd_len > 0 && msdp->count < net_device->max_pkt; 1245if (try_batch && msd_len + pktlen + net_device->pkt_align < 1246net_device->send_section_size) { 1247section_index = msdp->pkt->send_buf_index; 1248 1249} else if (try_batch && msd_len + packet->rmsg_size < 1250 net_device->send_section_size) { 1251section_index = msdp->pkt->send_buf_index; 1252packet->cp_partial = true; 1253 1254} else if (pktlen + net_device->pkt_align < 1255 net_device->send_section_size) { 1256section_index = netvsc_get_next_send_section(net_device); 1257if (unlikely(section_index == NETVSC_INVALID_INDEX)) { 1258++ndev_ctx->eth_stats.tx_send_full; 1259} else { 1260move_pkt_msd(&msd_send, &msd_skb, msdp); 1261msd_len = 0; 1262} 1263} 1264 For receive packet, the data is always copied from recv buffer. If so I suspect the best way to allocate them is by not using vmalloc but just discontiguous pages, and then use kmap_local_pfn where the PFN includes the share_gpa offset when actually copying from/to the skbs. When netvsc needs to copy packet data to send buffer, it needs to caculate position with section_index and send_section_size. Please seee netvsc_copy_to_send_buf() detail. So the contiguous virtual address of send buffer is necessary to copy data and batch packets.
Re: [PATCH V4 00/13] x86/Hyper-V: Add Hyper-V Isolation VM support
Sorry for the delayed answer, but I look at the vmap_pfn usage in the previous version and tried to come up with a better version. This mostly untested branch: http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/hyperv-vmap get us there for swiotlb and the channel infrastructure I've started looking at the network driver and didn't get anywhere due to other work. As far as I can tell the network driver does gigantic multi-megabyte vmalloc allocation for the send and receive buffers, which are then passed to the hardware, but always copied to/from when interacting with the networking stack. Did I see that right? Are these big buffers actually required unlike the normal buffer management schemes in other Linux network drivers? If so I suspect the best way to allocate them is by not using vmalloc but just discontiguous pages, and then use kmap_local_pfn where the PFN includes the share_gpa offset when actually copying from/to the skbs.