Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 22.04.16 at 14:54,wrote: > On 04/22/16 06:36, Jan Beulich wrote: >> >>> On 22.04.16 at 14:26, wrote: >> > On 04/22/16 04:53, Jan Beulich wrote: >> >> Perhaps I have got confused by the back and forth. If we're to >> >> use struct page_info, then everything should be following a >> >> similar flow to what happens for normal RAM, i.e. normal page >> >> allocation, and normal assignment of pages to guests. >> >> >> > >> > I'll follow the normal assignment of pages to guests for pmem, but not >> > the normal page allocation. Because allocation is difficult to always >> > get the same pmem area for the same guest every time. It still needs >> > input from others (e.g. toolstack) that can provide the exact address. >> >> Understood. >> >> > Because the address is now not decided by xen hypervisor, certain >> > permission track is needed. For this part, we will re-use the existing >> > one for MMIO. Directly using existing range struct for pmem may >> > consume too much space, so I proposed to choose different data >> > structures or put limitation on exiting range struct to avoid or >> > mitigate this problem. >> >> Why would these consume too much space? I'd expect there to be >> just one or very few chunks, just like is the case for MMIO ranges >> on devices. > > As Ian Jackson indicated [1], there are several cases that a pmem page > can be accessed from more than one domains. Then every domain involved > needs a range struct to track its access permission to that pmem > page. In a worst case, e.g. the first of every two contiguous pages on > a pmem are assigned to a domain and are shared with all other domains, > though the size of range struct for a single domain maybe acceptable, > the total will still be very large. Everything Ian has mentioned there is what normal RAM pages also can get used for, yes as you have yourself said (still visible in context above) you mean to only do allocation differently. Hence the permission tracking you talk of should be necessary only for the owning domain (to get validated during allocation), everything else should follow the normal life cycle of a RAM page. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 22.04.16 at 14:26,wrote: > On 04/22/16 04:53, Jan Beulich wrote: >> Perhaps I have got confused by the back and forth. If we're to >> use struct page_info, then everything should be following a >> similar flow to what happens for normal RAM, i.e. normal page >> allocation, and normal assignment of pages to guests. >> > > I'll follow the normal assignment of pages to guests for pmem, but not > the normal page allocation. Because allocation is difficult to always > get the same pmem area for the same guest every time. It still needs > input from others (e.g. toolstack) that can provide the exact address. Understood. > Because the address is now not decided by xen hypervisor, certain > permission track is needed. For this part, we will re-use the existing > one for MMIO. Directly using existing range struct for pmem may > consume too much space, so I proposed to choose different data > structures or put limitation on exiting range struct to avoid or > mitigate this problem. Why would these consume too much space? I'd expect there to be just one or very few chunks, just like is the case for MMIO ranges on devices. Jan > The data structure change will be applied only > to pmem, and only the code that manipulate the range structs > (rangeset_*) will be changed for pmem. So for the permission tracking > part, it will still follow the exiting one. > > Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 04/22/16 04:53, Jan Beulich wrote: > >>> On 22.04.16 at 12:16,wrote: > > On 04/22/16 02:24, Jan Beulich wrote: > > [..] > >> >> >> Well, using existing range struct to manage guest access permissions > >> >> >> to nvdimm could consume too much space which could not fit in either > >> >> >> memory or nvdimm. If the above solution looks really error-prone, > >> >> >> perhaps we can still come back to the existing one and restrict the > >> >> >> number of range structs each domain could have for nvdimm > >> >> >> (e.g. reserve one 4K-page per-domain for them) to make it work for > >> >> >> nvdimm, though it may reject nvdimm mapping that is terribly > >> >> >> fragmented. > >> >> > > >> >> > Hi Jan, > >> >> > > >> >> > Any comments for this? > >> >> > >> >> Well, nothing new, i.e. my previous opinion on the old proposal didn't > >> >> change. I'm really opposed to any artificial limitations here, as I am > >> >> to > >> >> any secondary (and hence error prone) code paths. IOW I continue > >> >> to think that there's no reasonable alternative to re-using the existing > >> >> memory management infrastructure for at least the PMEM case. > >> > > >> > By re-using the existing memory management infrastructure, do you mean > >> > re-using the existing model of MMIO for passthrough PCI devices to > >> > handle the permission of pmem? > >> > >> No, re-using struct page_info. > >> > >> >> The > >> >> only open question remains to be where to place the control structures, > >> >> and I think the thresholding proposal of yours was quite sensible. > >> > > >> > I'm little confused here. Is 'restrict the number of range structs' in > >> > my previous reply the 'thresholding proposal' you mean? Or it's one of > >> > 'artificial limitations'? > >> > >> Neither. It's the decision on where to place the struct page_info > >> arrays needed to manage the PMEM ranges. > >> > > > > In [1][2], we have agreed to use struct page_info to manage mappings > > for pmem and place them in reserved area on pmem. > > > > But I think the discussion in this thread is to decide the data > > structure which will be used to track access permission to host pmem. > > The discussion started from my question in [3]: > > | I'm not sure whether xen toolstack as a userspace program is > > | considered to be safe to pass the host physical address (of host > > | NVDIMM) to hypervisor. > > In reply [4], you mentioned: > > | As long as the passing of physical addresses follows to model of > > | MMIO for passed through PCI devices, I don't think there's problem > > | with the tool stack bypassing the Dom0 kernel. So it really all > > | depends on how you make sure that the guest won't get to see memory > > | it has no permission to access. > > > > I interpreted it as the same access permission control mechanism used > > for MMIO of passthrough pci devices (built around range struct) should > > be used for pmem as well, so that we can safely allow toolstack to > > pass the host physical address of nvdimm to hypervisor. > > Was my understanding wrong from the beginning? > > Perhaps I have got confused by the back and forth. If we're to > use struct page_info, then everything should be following a > similar flow to what happens for normal RAM, i.e. normal page > allocation, and normal assignment of pages to guests. > I'll follow the normal assignment of pages to guests for pmem, but not the normal page allocation. Because allocation is difficult to always get the same pmem area for the same guest every time. It still needs input from others (e.g. toolstack) that can provide the exact address. Because the address is now not decided by xen hypervisor, certain permission track is needed. For this part, we will re-use the existing one for MMIO. Directly using existing range struct for pmem may consume too much space, so I proposed to choose different data structures or put limitation on exiting range struct to avoid or mitigate this problem. The data structure change will be applied only to pmem, and only the code that manipulate the range structs (rangeset_*) will be changed for pmem. So for the permission tracking part, it will still follow the exiting one. Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 22.04.16 at 12:16,wrote: > On 04/22/16 02:24, Jan Beulich wrote: > [..] >> >> >> Well, using existing range struct to manage guest access permissions >> >> >> to nvdimm could consume too much space which could not fit in either >> >> >> memory or nvdimm. If the above solution looks really error-prone, >> >> >> perhaps we can still come back to the existing one and restrict the >> >> >> number of range structs each domain could have for nvdimm >> >> >> (e.g. reserve one 4K-page per-domain for them) to make it work for >> >> >> nvdimm, though it may reject nvdimm mapping that is terribly >> >> >> fragmented. >> >> > >> >> > Hi Jan, >> >> > >> >> > Any comments for this? >> >> >> >> Well, nothing new, i.e. my previous opinion on the old proposal didn't >> >> change. I'm really opposed to any artificial limitations here, as I am to >> >> any secondary (and hence error prone) code paths. IOW I continue >> >> to think that there's no reasonable alternative to re-using the existing >> >> memory management infrastructure for at least the PMEM case. >> > >> > By re-using the existing memory management infrastructure, do you mean >> > re-using the existing model of MMIO for passthrough PCI devices to >> > handle the permission of pmem? >> >> No, re-using struct page_info. >> >> >> The >> >> only open question remains to be where to place the control structures, >> >> and I think the thresholding proposal of yours was quite sensible. >> > >> > I'm little confused here. Is 'restrict the number of range structs' in >> > my previous reply the 'thresholding proposal' you mean? Or it's one of >> > 'artificial limitations'? >> >> Neither. It's the decision on where to place the struct page_info >> arrays needed to manage the PMEM ranges. >> > > In [1][2], we have agreed to use struct page_info to manage mappings > for pmem and place them in reserved area on pmem. > > But I think the discussion in this thread is to decide the data > structure which will be used to track access permission to host pmem. > The discussion started from my question in [3]: > | I'm not sure whether xen toolstack as a userspace program is > | considered to be safe to pass the host physical address (of host > | NVDIMM) to hypervisor. > In reply [4], you mentioned: > | As long as the passing of physical addresses follows to model of > | MMIO for passed through PCI devices, I don't think there's problem > | with the tool stack bypassing the Dom0 kernel. So it really all > | depends on how you make sure that the guest won't get to see memory > | it has no permission to access. > > I interpreted it as the same access permission control mechanism used > for MMIO of passthrough pci devices (built around range struct) should > be used for pmem as well, so that we can safely allow toolstack to > pass the host physical address of nvdimm to hypervisor. > Was my understanding wrong from the beginning? Perhaps I have got confused by the back and forth. If we're to use struct page_info, then everything should be following a similar flow to what happens for normal RAM, i.e. normal page allocation, and normal assignment of pages to guests. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 04/22/16 02:24, Jan Beulich wrote: [..] > >> >> Well, using existing range struct to manage guest access permissions > >> >> to nvdimm could consume too much space which could not fit in either > >> >> memory or nvdimm. If the above solution looks really error-prone, > >> >> perhaps we can still come back to the existing one and restrict the > >> >> number of range structs each domain could have for nvdimm > >> >> (e.g. reserve one 4K-page per-domain for them) to make it work for > >> >> nvdimm, though it may reject nvdimm mapping that is terribly > >> >> fragmented. > >> > > >> > Hi Jan, > >> > > >> > Any comments for this? > >> > >> Well, nothing new, i.e. my previous opinion on the old proposal didn't > >> change. I'm really opposed to any artificial limitations here, as I am to > >> any secondary (and hence error prone) code paths. IOW I continue > >> to think that there's no reasonable alternative to re-using the existing > >> memory management infrastructure for at least the PMEM case. > > > > By re-using the existing memory management infrastructure, do you mean > > re-using the existing model of MMIO for passthrough PCI devices to > > handle the permission of pmem? > > No, re-using struct page_info. > > >> The > >> only open question remains to be where to place the control structures, > >> and I think the thresholding proposal of yours was quite sensible. > > > > I'm little confused here. Is 'restrict the number of range structs' in > > my previous reply the 'thresholding proposal' you mean? Or it's one of > > 'artificial limitations'? > > Neither. It's the decision on where to place the struct page_info > arrays needed to manage the PMEM ranges. > In [1][2], we have agreed to use struct page_info to manage mappings for pmem and place them in reserved area on pmem. But I think the discussion in this thread is to decide the data structure which will be used to track access permission to host pmem. The discussion started from my question in [3]: | I'm not sure whether xen toolstack as a userspace program is | considered to be safe to pass the host physical address (of host | NVDIMM) to hypervisor. In reply [4], you mentioned: | As long as the passing of physical addresses follows to model of | MMIO for passed through PCI devices, I don't think there's problem | with the tool stack bypassing the Dom0 kernel. So it really all | depends on how you make sure that the guest won't get to see memory | it has no permission to access. I interpreted it as the same access permission control mechanism used for MMIO of passthrough pci devices (built around range struct) should be used for pmem as well, so that we can safely allow toolstack to pass the host physical address of nvdimm to hypervisor. Was my understanding wrong from the beginning? Thanks, Haozhong [1] http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01161.html [2] http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01201.html [3] http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01972.html [4] http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01981.html ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 22.04.16 at 04:36,wrote: > On 04/21/16 01:04, Jan Beulich wrote: >> >>> On 21.04.16 at 07:09, wrote: >> > On 04/12/16 16:45, Haozhong Zhang wrote: >> >> On 04/08/16 09:52, Jan Beulich wrote: >> >> > >>> On 08.04.16 at 07:02, wrote: >> >> > > On 03/29/16 04:49, Jan Beulich wrote: >> >> > >> >>> On 29.03.16 at 12:10, wrote: >> >> > >> > On 03/29/16 03:11, Jan Beulich wrote: >> >> > >> >> >>> On 29.03.16 at 10:47, wrote: >> >> > > [..] >> >> > >> >> > I still cannot find a neat approach to manage guest permissions >> >> > >> >> > for >> >> > >> >> > nvdimm pages. A possible one is to use a per-domain bitmap to >> >> > >> >> > track >> >> > >> >> > permissions: each bit corresponding to an nvdimm page. The >> >> > >> >> > bitmap can >> >> > >> >> > save lots of spaces and even be stored in the normal ram, but >> >> > >> >> > operating it for a large nvdimm range, especially for a >> >> > >> >> > contiguous >> >> > >> >> > one, is slower than rangeset. >> >> > >> >> >> >> > >> >> I don't follow: What would a single bit in that bitmap mean? Any >> >> > >> >> guest may access the page? That surely wouldn't be what we >> >> > >> >> need. >> >> > >> >> >> >> > >> > >> >> > >> > For a host having a N pages of nvdimm, each domain will have a N >> >> > >> > bits >> >> > >> > bitmap. If the m'th bit of a domain's bitmap is set, then that >> >> > >> > domain >> >> > >> > has the permission to access the m'th host nvdimm page. >> >> > >> >> >> > >> Which will be more overhead as soon as there are enough such >> >> > >> domains in a system. >> >> > >> >> >> > > >> >> > > Sorry for the late reply. >> >> > > >> >> > > I think we can make some optimization to reduce the space consumed by >> >> > > the bitmap. >> >> > > >> >> > > A per-domain bitmap covering the entire host NVDIMM address range is >> >> > > wasteful especially if the actual used ranges are congregated. We may >> >> > > take following ways to reduce its space. >> >> > > >> >> > > 1) Split the per-domain bitmap into multiple sub-bitmap and each >> >> > >sub-bitmap covers a smaller and contiguous sub host NVDIMM address >> >> > >range. In the beginning, no sub-bitmap is allocated for the >> >> > >domain. If the access permission to a host NVDIMM page in a sub >> >> > >host address range is added to a domain, only the sub-bitmap for >> >> > >that address range is allocated for the domain. If access >> >> > >permissions to all host NVDIMM pages in a sub range are removed >> >> > >from a domain, the corresponding sub-bitmap can be freed. >> >> > > >> >> > > 2) If a domain has access permissions to all host NVDIMM pages in a >> >> > >sub range, the corresponding sub-bitmap will be replaced by a range >> >> > >struct. If range structs are used to track adjacent ranges, they >> >> > >will be merged into one range struct. If access permissions to some >> >> > >pages in that sub range are removed from a domain, the range struct >> >> > >should be converted back to bitmap segment(s). >> >> > > >> >> > > 3) Because there might be lots of above bitmap segments and range >> >> > >structs per-domain, we can organize them in a balanced interval >> >> > >tree to quickly search/add/remove an individual structure. >> >> > > >> >> > > In the worst case that each sub range has non-contiguous pages >> >> > > assigned to a domain, above solution will use all sub-bitmaps and >> >> > > consume more space than a single bitmap because of the extra space for >> >> > > organization. I assume that the sysadmin should be responsible to >> >> > > ensure the host nvdimm ranges assigned to each domain as contiguous >> >> > > and congregated as possible in order to avoid the worst case. However, >> >> > > if the worst case does happen, xen hypervisor should refuse to assign >> >> > > nvdimm to guest when it runs out of memory. >> >> > >> >> > To be honest, this all sounds pretty unconvincing wrt not using >> >> > existing code paths - a lot of special treatment, and hence a lot >> >> > of things that can go (slightly) wrong. >> >> > >> >> >> >> Well, using existing range struct to manage guest access permissions >> >> to nvdimm could consume too much space which could not fit in either >> >> memory or nvdimm. If the above solution looks really error-prone, >> >> perhaps we can still come back to the existing one and restrict the >> >> number of range structs each domain could have for nvdimm >> >> (e.g. reserve one 4K-page per-domain for them) to make it work for >> >> nvdimm, though it may reject nvdimm mapping that is terribly >> >> fragmented. >> > >> > Hi Jan, >> > >> > Any comments for this? >> >> Well, nothing new, i.e. my previous opinion on the old proposal didn't >> change. I'm really opposed to any artificial limitations here, as I am to >> any secondary (and hence
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 04/21/16 01:04, Jan Beulich wrote: > >>> On 21.04.16 at 07:09,wrote: > > On 04/12/16 16:45, Haozhong Zhang wrote: > >> On 04/08/16 09:52, Jan Beulich wrote: > >> > >>> On 08.04.16 at 07:02, wrote: > >> > > On 03/29/16 04:49, Jan Beulich wrote: > >> > >> >>> On 29.03.16 at 12:10, wrote: > >> > >> > On 03/29/16 03:11, Jan Beulich wrote: > >> > >> >> >>> On 29.03.16 at 10:47, wrote: > >> > > [..] > >> > >> >> > I still cannot find a neat approach to manage guest permissions > >> > >> >> > for > >> > >> >> > nvdimm pages. A possible one is to use a per-domain bitmap to > >> > >> >> > track > >> > >> >> > permissions: each bit corresponding to an nvdimm page. The > >> > >> >> > bitmap can > >> > >> >> > save lots of spaces and even be stored in the normal ram, but > >> > >> >> > operating it for a large nvdimm range, especially for a > >> > >> >> > contiguous > >> > >> >> > one, is slower than rangeset. > >> > >> >> > >> > >> >> I don't follow: What would a single bit in that bitmap mean? Any > >> > >> >> guest may access the page? That surely wouldn't be what we > >> > >> >> need. > >> > >> >> > >> > >> > > >> > >> > For a host having a N pages of nvdimm, each domain will have a N > >> > >> > bits > >> > >> > bitmap. If the m'th bit of a domain's bitmap is set, then that > >> > >> > domain > >> > >> > has the permission to access the m'th host nvdimm page. > >> > >> > >> > >> Which will be more overhead as soon as there are enough such > >> > >> domains in a system. > >> > >> > >> > > > >> > > Sorry for the late reply. > >> > > > >> > > I think we can make some optimization to reduce the space consumed by > >> > > the bitmap. > >> > > > >> > > A per-domain bitmap covering the entire host NVDIMM address range is > >> > > wasteful especially if the actual used ranges are congregated. We may > >> > > take following ways to reduce its space. > >> > > > >> > > 1) Split the per-domain bitmap into multiple sub-bitmap and each > >> > >sub-bitmap covers a smaller and contiguous sub host NVDIMM address > >> > >range. In the beginning, no sub-bitmap is allocated for the > >> > >domain. If the access permission to a host NVDIMM page in a sub > >> > >host address range is added to a domain, only the sub-bitmap for > >> > >that address range is allocated for the domain. If access > >> > >permissions to all host NVDIMM pages in a sub range are removed > >> > >from a domain, the corresponding sub-bitmap can be freed. > >> > > > >> > > 2) If a domain has access permissions to all host NVDIMM pages in a > >> > >sub range, the corresponding sub-bitmap will be replaced by a range > >> > >struct. If range structs are used to track adjacent ranges, they > >> > >will be merged into one range struct. If access permissions to some > >> > >pages in that sub range are removed from a domain, the range struct > >> > >should be converted back to bitmap segment(s). > >> > > > >> > > 3) Because there might be lots of above bitmap segments and range > >> > >structs per-domain, we can organize them in a balanced interval > >> > >tree to quickly search/add/remove an individual structure. > >> > > > >> > > In the worst case that each sub range has non-contiguous pages > >> > > assigned to a domain, above solution will use all sub-bitmaps and > >> > > consume more space than a single bitmap because of the extra space for > >> > > organization. I assume that the sysadmin should be responsible to > >> > > ensure the host nvdimm ranges assigned to each domain as contiguous > >> > > and congregated as possible in order to avoid the worst case. However, > >> > > if the worst case does happen, xen hypervisor should refuse to assign > >> > > nvdimm to guest when it runs out of memory. > >> > > >> > To be honest, this all sounds pretty unconvincing wrt not using > >> > existing code paths - a lot of special treatment, and hence a lot > >> > of things that can go (slightly) wrong. > >> > > >> > >> Well, using existing range struct to manage guest access permissions > >> to nvdimm could consume too much space which could not fit in either > >> memory or nvdimm. If the above solution looks really error-prone, > >> perhaps we can still come back to the existing one and restrict the > >> number of range structs each domain could have for nvdimm > >> (e.g. reserve one 4K-page per-domain for them) to make it work for > >> nvdimm, though it may reject nvdimm mapping that is terribly > >> fragmented. > > > > Hi Jan, > > > > Any comments for this? > > Well, nothing new, i.e. my previous opinion on the old proposal didn't > change. I'm really opposed to any artificial limitations here, as I am to > any secondary (and hence error prone) code paths. IOW I continue > to think that there's no reasonable alternative to re-using the existing > memory management infrastructure for at
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 21.04.16 at 07:09,wrote: > On 04/12/16 16:45, Haozhong Zhang wrote: >> On 04/08/16 09:52, Jan Beulich wrote: >> > >>> On 08.04.16 at 07:02, wrote: >> > > On 03/29/16 04:49, Jan Beulich wrote: >> > >> >>> On 29.03.16 at 12:10, wrote: >> > >> > On 03/29/16 03:11, Jan Beulich wrote: >> > >> >> >>> On 29.03.16 at 10:47, wrote: >> > > [..] >> > >> >> > I still cannot find a neat approach to manage guest permissions for >> > >> >> > nvdimm pages. A possible one is to use a per-domain bitmap to track >> > >> >> > permissions: each bit corresponding to an nvdimm page. The bitmap >> > >> >> > can >> > >> >> > save lots of spaces and even be stored in the normal ram, but >> > >> >> > operating it for a large nvdimm range, especially for a contiguous >> > >> >> > one, is slower than rangeset. >> > >> >> >> > >> >> I don't follow: What would a single bit in that bitmap mean? Any >> > >> >> guest may access the page? That surely wouldn't be what we >> > >> >> need. >> > >> >> >> > >> > >> > >> > For a host having a N pages of nvdimm, each domain will have a N bits >> > >> > bitmap. If the m'th bit of a domain's bitmap is set, then that domain >> > >> > has the permission to access the m'th host nvdimm page. >> > >> >> > >> Which will be more overhead as soon as there are enough such >> > >> domains in a system. >> > >> >> > > >> > > Sorry for the late reply. >> > > >> > > I think we can make some optimization to reduce the space consumed by >> > > the bitmap. >> > > >> > > A per-domain bitmap covering the entire host NVDIMM address range is >> > > wasteful especially if the actual used ranges are congregated. We may >> > > take following ways to reduce its space. >> > > >> > > 1) Split the per-domain bitmap into multiple sub-bitmap and each >> > >sub-bitmap covers a smaller and contiguous sub host NVDIMM address >> > >range. In the beginning, no sub-bitmap is allocated for the >> > >domain. If the access permission to a host NVDIMM page in a sub >> > >host address range is added to a domain, only the sub-bitmap for >> > >that address range is allocated for the domain. If access >> > >permissions to all host NVDIMM pages in a sub range are removed >> > >from a domain, the corresponding sub-bitmap can be freed. >> > > >> > > 2) If a domain has access permissions to all host NVDIMM pages in a >> > >sub range, the corresponding sub-bitmap will be replaced by a range >> > >struct. If range structs are used to track adjacent ranges, they >> > >will be merged into one range struct. If access permissions to some >> > >pages in that sub range are removed from a domain, the range struct >> > >should be converted back to bitmap segment(s). >> > > >> > > 3) Because there might be lots of above bitmap segments and range >> > >structs per-domain, we can organize them in a balanced interval >> > >tree to quickly search/add/remove an individual structure. >> > > >> > > In the worst case that each sub range has non-contiguous pages >> > > assigned to a domain, above solution will use all sub-bitmaps and >> > > consume more space than a single bitmap because of the extra space for >> > > organization. I assume that the sysadmin should be responsible to >> > > ensure the host nvdimm ranges assigned to each domain as contiguous >> > > and congregated as possible in order to avoid the worst case. However, >> > > if the worst case does happen, xen hypervisor should refuse to assign >> > > nvdimm to guest when it runs out of memory. >> > >> > To be honest, this all sounds pretty unconvincing wrt not using >> > existing code paths - a lot of special treatment, and hence a lot >> > of things that can go (slightly) wrong. >> > >> >> Well, using existing range struct to manage guest access permissions >> to nvdimm could consume too much space which could not fit in either >> memory or nvdimm. If the above solution looks really error-prone, >> perhaps we can still come back to the existing one and restrict the >> number of range structs each domain could have for nvdimm >> (e.g. reserve one 4K-page per-domain for them) to make it work for >> nvdimm, though it may reject nvdimm mapping that is terribly >> fragmented. > > Hi Jan, > > Any comments for this? Well, nothing new, i.e. my previous opinion on the old proposal didn't change. I'm really opposed to any artificial limitations here, as I am to any secondary (and hence error prone) code paths. IOW I continue to think that there's no reasonable alternative to re-using the existing memory management infrastructure for at least the PMEM case. The only open question remains to be where to place the control structures, and I think the thresholding proposal of yours was quite sensible. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 04/12/16 16:45, Haozhong Zhang wrote: > On 04/08/16 09:52, Jan Beulich wrote: > > >>> On 08.04.16 at 07:02,wrote: > > > On 03/29/16 04:49, Jan Beulich wrote: > > >> >>> On 29.03.16 at 12:10, wrote: > > >> > On 03/29/16 03:11, Jan Beulich wrote: > > >> >> >>> On 29.03.16 at 10:47, wrote: > > > [..] > > >> >> > I still cannot find a neat approach to manage guest permissions for > > >> >> > nvdimm pages. A possible one is to use a per-domain bitmap to track > > >> >> > permissions: each bit corresponding to an nvdimm page. The bitmap > > >> >> > can > > >> >> > save lots of spaces and even be stored in the normal ram, but > > >> >> > operating it for a large nvdimm range, especially for a contiguous > > >> >> > one, is slower than rangeset. > > >> >> > > >> >> I don't follow: What would a single bit in that bitmap mean? Any > > >> >> guest may access the page? That surely wouldn't be what we > > >> >> need. > > >> >> > > >> > > > >> > For a host having a N pages of nvdimm, each domain will have a N bits > > >> > bitmap. If the m'th bit of a domain's bitmap is set, then that domain > > >> > has the permission to access the m'th host nvdimm page. > > >> > > >> Which will be more overhead as soon as there are enough such > > >> domains in a system. > > >> > > > > > > Sorry for the late reply. > > > > > > I think we can make some optimization to reduce the space consumed by > > > the bitmap. > > > > > > A per-domain bitmap covering the entire host NVDIMM address range is > > > wasteful especially if the actual used ranges are congregated. We may > > > take following ways to reduce its space. > > > > > > 1) Split the per-domain bitmap into multiple sub-bitmap and each > > >sub-bitmap covers a smaller and contiguous sub host NVDIMM address > > >range. In the beginning, no sub-bitmap is allocated for the > > >domain. If the access permission to a host NVDIMM page in a sub > > >host address range is added to a domain, only the sub-bitmap for > > >that address range is allocated for the domain. If access > > >permissions to all host NVDIMM pages in a sub range are removed > > >from a domain, the corresponding sub-bitmap can be freed. > > > > > > 2) If a domain has access permissions to all host NVDIMM pages in a > > >sub range, the corresponding sub-bitmap will be replaced by a range > > >struct. If range structs are used to track adjacent ranges, they > > >will be merged into one range struct. If access permissions to some > > >pages in that sub range are removed from a domain, the range struct > > >should be converted back to bitmap segment(s). > > > > > > 3) Because there might be lots of above bitmap segments and range > > >structs per-domain, we can organize them in a balanced interval > > >tree to quickly search/add/remove an individual structure. > > > > > > In the worst case that each sub range has non-contiguous pages > > > assigned to a domain, above solution will use all sub-bitmaps and > > > consume more space than a single bitmap because of the extra space for > > > organization. I assume that the sysadmin should be responsible to > > > ensure the host nvdimm ranges assigned to each domain as contiguous > > > and congregated as possible in order to avoid the worst case. However, > > > if the worst case does happen, xen hypervisor should refuse to assign > > > nvdimm to guest when it runs out of memory. > > > > To be honest, this all sounds pretty unconvincing wrt not using > > existing code paths - a lot of special treatment, and hence a lot > > of things that can go (slightly) wrong. > > > > Well, using existing range struct to manage guest access permissions > to nvdimm could consume too much space which could not fit in either > memory or nvdimm. If the above solution looks really error-prone, > perhaps we can still come back to the existing one and restrict the > number of range structs each domain could have for nvdimm > (e.g. reserve one 4K-page per-domain for them) to make it work for > nvdimm, though it may reject nvdimm mapping that is terribly > fragmented. Hi Jan, Any comments for this? Thanks, Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 04/08/16 09:52, Jan Beulich wrote: > >>> On 08.04.16 at 07:02,wrote: > > On 03/29/16 04:49, Jan Beulich wrote: > >> >>> On 29.03.16 at 12:10, wrote: > >> > On 03/29/16 03:11, Jan Beulich wrote: > >> >> >>> On 29.03.16 at 10:47, wrote: > > [..] > >> >> > I still cannot find a neat approach to manage guest permissions for > >> >> > nvdimm pages. A possible one is to use a per-domain bitmap to track > >> >> > permissions: each bit corresponding to an nvdimm page. The bitmap can > >> >> > save lots of spaces and even be stored in the normal ram, but > >> >> > operating it for a large nvdimm range, especially for a contiguous > >> >> > one, is slower than rangeset. > >> >> > >> >> I don't follow: What would a single bit in that bitmap mean? Any > >> >> guest may access the page? That surely wouldn't be what we > >> >> need. > >> >> > >> > > >> > For a host having a N pages of nvdimm, each domain will have a N bits > >> > bitmap. If the m'th bit of a domain's bitmap is set, then that domain > >> > has the permission to access the m'th host nvdimm page. > >> > >> Which will be more overhead as soon as there are enough such > >> domains in a system. > >> > > > > Sorry for the late reply. > > > > I think we can make some optimization to reduce the space consumed by > > the bitmap. > > > > A per-domain bitmap covering the entire host NVDIMM address range is > > wasteful especially if the actual used ranges are congregated. We may > > take following ways to reduce its space. > > > > 1) Split the per-domain bitmap into multiple sub-bitmap and each > >sub-bitmap covers a smaller and contiguous sub host NVDIMM address > >range. In the beginning, no sub-bitmap is allocated for the > >domain. If the access permission to a host NVDIMM page in a sub > >host address range is added to a domain, only the sub-bitmap for > >that address range is allocated for the domain. If access > >permissions to all host NVDIMM pages in a sub range are removed > >from a domain, the corresponding sub-bitmap can be freed. > > > > 2) If a domain has access permissions to all host NVDIMM pages in a > >sub range, the corresponding sub-bitmap will be replaced by a range > >struct. If range structs are used to track adjacent ranges, they > >will be merged into one range struct. If access permissions to some > >pages in that sub range are removed from a domain, the range struct > >should be converted back to bitmap segment(s). > > > > 3) Because there might be lots of above bitmap segments and range > >structs per-domain, we can organize them in a balanced interval > >tree to quickly search/add/remove an individual structure. > > > > In the worst case that each sub range has non-contiguous pages > > assigned to a domain, above solution will use all sub-bitmaps and > > consume more space than a single bitmap because of the extra space for > > organization. I assume that the sysadmin should be responsible to > > ensure the host nvdimm ranges assigned to each domain as contiguous > > and congregated as possible in order to avoid the worst case. However, > > if the worst case does happen, xen hypervisor should refuse to assign > > nvdimm to guest when it runs out of memory. > > To be honest, this all sounds pretty unconvincing wrt not using > existing code paths - a lot of special treatment, and hence a lot > of things that can go (slightly) wrong. > Well, using existing range struct to manage guest access permissions to nvdimm could consume too much space which could not fit in either memory or nvdimm. If the above solution looks really error-prone, perhaps we can still come back to the existing one and restrict the number of range structs each domain could have for nvdimm (e.g. reserve one 4K-page per-domain for them) to make it work for nvdimm, though it may reject nvdimm mapping that is terribly fragmented. Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 08.04.16 at 07:02,wrote: > On 03/29/16 04:49, Jan Beulich wrote: >> >>> On 29.03.16 at 12:10, wrote: >> > On 03/29/16 03:11, Jan Beulich wrote: >> >> >>> On 29.03.16 at 10:47, wrote: > [..] >> >> > I still cannot find a neat approach to manage guest permissions for >> >> > nvdimm pages. A possible one is to use a per-domain bitmap to track >> >> > permissions: each bit corresponding to an nvdimm page. The bitmap can >> >> > save lots of spaces and even be stored in the normal ram, but >> >> > operating it for a large nvdimm range, especially for a contiguous >> >> > one, is slower than rangeset. >> >> >> >> I don't follow: What would a single bit in that bitmap mean? Any >> >> guest may access the page? That surely wouldn't be what we >> >> need. >> >> >> > >> > For a host having a N pages of nvdimm, each domain will have a N bits >> > bitmap. If the m'th bit of a domain's bitmap is set, then that domain >> > has the permission to access the m'th host nvdimm page. >> >> Which will be more overhead as soon as there are enough such >> domains in a system. >> > > Sorry for the late reply. > > I think we can make some optimization to reduce the space consumed by > the bitmap. > > A per-domain bitmap covering the entire host NVDIMM address range is > wasteful especially if the actual used ranges are congregated. We may > take following ways to reduce its space. > > 1) Split the per-domain bitmap into multiple sub-bitmap and each >sub-bitmap covers a smaller and contiguous sub host NVDIMM address >range. In the beginning, no sub-bitmap is allocated for the >domain. If the access permission to a host NVDIMM page in a sub >host address range is added to a domain, only the sub-bitmap for >that address range is allocated for the domain. If access >permissions to all host NVDIMM pages in a sub range are removed >from a domain, the corresponding sub-bitmap can be freed. > > 2) If a domain has access permissions to all host NVDIMM pages in a >sub range, the corresponding sub-bitmap will be replaced by a range >struct. If range structs are used to track adjacent ranges, they >will be merged into one range struct. If access permissions to some >pages in that sub range are removed from a domain, the range struct >should be converted back to bitmap segment(s). > > 3) Because there might be lots of above bitmap segments and range >structs per-domain, we can organize them in a balanced interval >tree to quickly search/add/remove an individual structure. > > In the worst case that each sub range has non-contiguous pages > assigned to a domain, above solution will use all sub-bitmaps and > consume more space than a single bitmap because of the extra space for > organization. I assume that the sysadmin should be responsible to > ensure the host nvdimm ranges assigned to each domain as contiguous > and congregated as possible in order to avoid the worst case. However, > if the worst case does happen, xen hypervisor should refuse to assign > nvdimm to guest when it runs out of memory. To be honest, this all sounds pretty unconvincing wrt not using existing code paths - a lot of special treatment, and hence a lot of things that can go (slightly) wrong. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/29/16 04:49, Jan Beulich wrote: > >>> On 29.03.16 at 12:10,wrote: > > On 03/29/16 03:11, Jan Beulich wrote: > >> >>> On 29.03.16 at 10:47, wrote: [..] > >> > I still cannot find a neat approach to manage guest permissions for > >> > nvdimm pages. A possible one is to use a per-domain bitmap to track > >> > permissions: each bit corresponding to an nvdimm page. The bitmap can > >> > save lots of spaces and even be stored in the normal ram, but > >> > operating it for a large nvdimm range, especially for a contiguous > >> > one, is slower than rangeset. > >> > >> I don't follow: What would a single bit in that bitmap mean? Any > >> guest may access the page? That surely wouldn't be what we > >> need. > >> > > > > For a host having a N pages of nvdimm, each domain will have a N bits > > bitmap. If the m'th bit of a domain's bitmap is set, then that domain > > has the permission to access the m'th host nvdimm page. > > Which will be more overhead as soon as there are enough such > domains in a system. > Sorry for the late reply. I think we can make some optimization to reduce the space consumed by the bitmap. A per-domain bitmap covering the entire host NVDIMM address range is wasteful especially if the actual used ranges are congregated. We may take following ways to reduce its space. 1) Split the per-domain bitmap into multiple sub-bitmap and each sub-bitmap covers a smaller and contiguous sub host NVDIMM address range. In the beginning, no sub-bitmap is allocated for the domain. If the access permission to a host NVDIMM page in a sub host address range is added to a domain, only the sub-bitmap for that address range is allocated for the domain. If access permissions to all host NVDIMM pages in a sub range are removed from a domain, the corresponding sub-bitmap can be freed. 2) If a domain has access permissions to all host NVDIMM pages in a sub range, the corresponding sub-bitmap will be replaced by a range struct. If range structs are used to track adjacent ranges, they will be merged into one range struct. If access permissions to some pages in that sub range are removed from a domain, the range struct should be converted back to bitmap segment(s). 3) Because there might be lots of above bitmap segments and range structs per-domain, we can organize them in a balanced interval tree to quickly search/add/remove an individual structure. In the worst case that each sub range has non-contiguous pages assigned to a domain, above solution will use all sub-bitmaps and consume more space than a single bitmap because of the extra space for organization. I assume that the sysadmin should be responsible to ensure the host nvdimm ranges assigned to each domain as contiguous and congregated as possible in order to avoid the worst case. However, if the worst case does happen, xen hypervisor should refuse to assign nvdimm to guest when it runs out of memory. Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 29.03.16 at 12:10, <haozhong.zh...@intel.com> wrote: > On 03/29/16 03:11, Jan Beulich wrote: >> >>> On 29.03.16 at 10:47, <haozhong.zh...@intel.com> wrote: >> > On 03/17/16 22:21, Haozhong Zhang wrote: >> >> On 03/17/16 14:00, Ian Jackson wrote: >> >> > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM > support for Xen"): >> >> > > QEMU keeps mappings of guest memory because (1) that mapping is >> >> > > created by itself, and/or (2) certain device emulation needs to access >> >> > > the guest memory. But for vNVDIMM, I'm going to move the creation of >> >> > > its mappings out of qemu to toolstack and vNVDIMM in QEMU does not >> >> > > access vNVDIMM pages mapped to guest, so it's not necessary to let >> >> > > qemu keeps vNVDIMM mappings. >> >> > >> >> > I'm confused by this. >> >> > >> >> > Suppose a guest uses an emulated device (or backend) provided by qemu, >> >> > to do DMA to an vNVDIMM. Then qemu will need to map the real NVDIMM >> >> > pages into its own address space, so that it can write to the memory >> >> > (ie, do the virtual DMA). >> >> > >> >> > That virtual DMA might well involve a direct mapping in the kernel >> >> > underlying qemu: ie, qemu might use O_DIRECT to have its kernel write >> >> > directly to the NVDIMM, and with luck the actual device backing the >> >> > virtual device will be able to DMA to the NVDIMM. >> >> > >> >> > All of this seems to me to mean that qemu needs to be able to map >> >> > its guest's parts of NVDIMMs >> >> > >> >> > There are probably other example: memory inspection systems used by >> >> > virus scanners etc.; debuggers used to inspect a guest from outside; >> >> > etc. >> >> > >> >> > I haven't even got started on save/restore... >> >> > >> >> >> >> Oops, so many cases I missed. Thanks Ian for pointing out all these! >> >> Now I need to reconsider how to manage guest permissions for NVDIMM pages. >> >> >> > >> > I still cannot find a neat approach to manage guest permissions for >> > nvdimm pages. A possible one is to use a per-domain bitmap to track >> > permissions: each bit corresponding to an nvdimm page. The bitmap can >> > save lots of spaces and even be stored in the normal ram, but >> > operating it for a large nvdimm range, especially for a contiguous >> > one, is slower than rangeset. >> >> I don't follow: What would a single bit in that bitmap mean? Any >> guest may access the page? That surely wouldn't be what we >> need. >> > > For a host having a N pages of nvdimm, each domain will have a N bits > bitmap. If the m'th bit of a domain's bitmap is set, then that domain > has the permission to access the m'th host nvdimm page. Which will be more overhead as soon as there are enough such domains in a system. >> > BTW, if I take the other way to map nvdimm pages to guest >> > (http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01972.html) >> > | 2. Or, given the same inputs, we may combine above two steps into a new >> > |dom0 system call that (1) gets the SPA ranges, (2) calls xen >> > |hypercall to map SPA ranges >> > and treat nvdimm as normal ram, then xen will not need to use rangeset >> > or above bitmap to track guest permissions for nvdimm? But looking at >> > how qemu currently populates guest memory via XENMEM_populate_physmap >> > , and other hypercalls like XENMEM_[in|de]crease_reservation, it looks >> > like that mapping a _dedicated_ piece of host ram to guest is not >> > allowed out of the hypervisor (and not allowed even in dom0 kernel)? >> > Is it for security concerns, e.g. avoiding a malfunctioned dom0 leaking >> > guest memory? >> >> Well, it's simply because RAM is a resource managed through >> allocation/freeing, instead of via reserving chunks for special >> purposes. >> > > So that means xen can always ensure the ram assigned to a guest is > what the guest is permitted to access, so no data structures like > iomem_caps is needed for ram. If I have to introduce a hypercall that > maps the dedicated host ram/nvdimm to guest, then the explicit > permission management is still needed, regardless of who (dom0 kernel, > qemu or toolstack) will use it. Right? Yes (if you really mean to go that route). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/29/16 03:11, Jan Beulich wrote: > >>> On 29.03.16 at 10:47, <haozhong.zh...@intel.com> wrote: > > On 03/17/16 22:21, Haozhong Zhang wrote: > >> On 03/17/16 14:00, Ian Jackson wrote: > >> > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM > >> > support for Xen"): > >> > > QEMU keeps mappings of guest memory because (1) that mapping is > >> > > created by itself, and/or (2) certain device emulation needs to access > >> > > the guest memory. But for vNVDIMM, I'm going to move the creation of > >> > > its mappings out of qemu to toolstack and vNVDIMM in QEMU does not > >> > > access vNVDIMM pages mapped to guest, so it's not necessary to let > >> > > qemu keeps vNVDIMM mappings. > >> > > >> > I'm confused by this. > >> > > >> > Suppose a guest uses an emulated device (or backend) provided by qemu, > >> > to do DMA to an vNVDIMM. Then qemu will need to map the real NVDIMM > >> > pages into its own address space, so that it can write to the memory > >> > (ie, do the virtual DMA). > >> > > >> > That virtual DMA might well involve a direct mapping in the kernel > >> > underlying qemu: ie, qemu might use O_DIRECT to have its kernel write > >> > directly to the NVDIMM, and with luck the actual device backing the > >> > virtual device will be able to DMA to the NVDIMM. > >> > > >> > All of this seems to me to mean that qemu needs to be able to map > >> > its guest's parts of NVDIMMs > >> > > >> > There are probably other example: memory inspection systems used by > >> > virus scanners etc.; debuggers used to inspect a guest from outside; > >> > etc. > >> > > >> > I haven't even got started on save/restore... > >> > > >> > >> Oops, so many cases I missed. Thanks Ian for pointing out all these! > >> Now I need to reconsider how to manage guest permissions for NVDIMM pages. > >> > > > > I still cannot find a neat approach to manage guest permissions for > > nvdimm pages. A possible one is to use a per-domain bitmap to track > > permissions: each bit corresponding to an nvdimm page. The bitmap can > > save lots of spaces and even be stored in the normal ram, but > > operating it for a large nvdimm range, especially for a contiguous > > one, is slower than rangeset. > > I don't follow: What would a single bit in that bitmap mean? Any > guest may access the page? That surely wouldn't be what we > need. > For a host having a N pages of nvdimm, each domain will have a N bits bitmap. If the m'th bit of a domain's bitmap is set, then that domain has the permission to access the m'th host nvdimm page. > > BTW, if I take the other way to map nvdimm pages to guest > > (http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01972.html) > > | 2. Or, given the same inputs, we may combine above two steps into a new > > |dom0 system call that (1) gets the SPA ranges, (2) calls xen > > |hypercall to map SPA ranges > > and treat nvdimm as normal ram, then xen will not need to use rangeset > > or above bitmap to track guest permissions for nvdimm? But looking at > > how qemu currently populates guest memory via XENMEM_populate_physmap > > , and other hypercalls like XENMEM_[in|de]crease_reservation, it looks > > like that mapping a _dedicated_ piece of host ram to guest is not > > allowed out of the hypervisor (and not allowed even in dom0 kernel)? > > Is it for security concerns, e.g. avoiding a malfunctioned dom0 leaking > > guest memory? > > Well, it's simply because RAM is a resource managed through > allocation/freeing, instead of via reserving chunks for special > purposes. > So that means xen can always ensure the ram assigned to a guest is what the guest is permitted to access, so no data structures like iomem_caps is needed for ram. If I have to introduce a hypercall that maps the dedicated host ram/nvdimm to guest, then the explicit permission management is still needed, regardless of who (dom0 kernel, qemu or toolstack) will use it. Right? Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 29.03.16 at 10:47, <haozhong.zh...@intel.com> wrote: > On 03/17/16 22:21, Haozhong Zhang wrote: >> On 03/17/16 14:00, Ian Jackson wrote: >> > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM >> > support for Xen"): >> > > QEMU keeps mappings of guest memory because (1) that mapping is >> > > created by itself, and/or (2) certain device emulation needs to access >> > > the guest memory. But for vNVDIMM, I'm going to move the creation of >> > > its mappings out of qemu to toolstack and vNVDIMM in QEMU does not >> > > access vNVDIMM pages mapped to guest, so it's not necessary to let >> > > qemu keeps vNVDIMM mappings. >> > >> > I'm confused by this. >> > >> > Suppose a guest uses an emulated device (or backend) provided by qemu, >> > to do DMA to an vNVDIMM. Then qemu will need to map the real NVDIMM >> > pages into its own address space, so that it can write to the memory >> > (ie, do the virtual DMA). >> > >> > That virtual DMA might well involve a direct mapping in the kernel >> > underlying qemu: ie, qemu might use O_DIRECT to have its kernel write >> > directly to the NVDIMM, and with luck the actual device backing the >> > virtual device will be able to DMA to the NVDIMM. >> > >> > All of this seems to me to mean that qemu needs to be able to map >> > its guest's parts of NVDIMMs >> > >> > There are probably other example: memory inspection systems used by >> > virus scanners etc.; debuggers used to inspect a guest from outside; >> > etc. >> > >> > I haven't even got started on save/restore... >> > >> >> Oops, so many cases I missed. Thanks Ian for pointing out all these! >> Now I need to reconsider how to manage guest permissions for NVDIMM pages. >> > > I still cannot find a neat approach to manage guest permissions for > nvdimm pages. A possible one is to use a per-domain bitmap to track > permissions: each bit corresponding to an nvdimm page. The bitmap can > save lots of spaces and even be stored in the normal ram, but > operating it for a large nvdimm range, especially for a contiguous > one, is slower than rangeset. I don't follow: What would a single bit in that bitmap mean? Any guest may access the page? That surely wouldn't be what we need. > BTW, if I take the other way to map nvdimm pages to guest > (http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01972.html) > | 2. Or, given the same inputs, we may combine above two steps into a new > |dom0 system call that (1) gets the SPA ranges, (2) calls xen > |hypercall to map SPA ranges > and treat nvdimm as normal ram, then xen will not need to use rangeset > or above bitmap to track guest permissions for nvdimm? But looking at > how qemu currently populates guest memory via XENMEM_populate_physmap > , and other hypercalls like XENMEM_[in|de]crease_reservation, it looks > like that mapping a _dedicated_ piece of host ram to guest is not > allowed out of the hypervisor (and not allowed even in dom0 kernel)? > Is it for security concerns, e.g. avoiding a malfunctioned dom0 leaking > guest memory? Well, it's simply because RAM is a resource managed through allocation/freeing, instead of via reserving chunks for special purposes. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/17/16 22:21, Haozhong Zhang wrote: > On 03/17/16 14:00, Ian Jackson wrote: > > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM > > support for Xen"): > > > QEMU keeps mappings of guest memory because (1) that mapping is > > > created by itself, and/or (2) certain device emulation needs to access > > > the guest memory. But for vNVDIMM, I'm going to move the creation of > > > its mappings out of qemu to toolstack and vNVDIMM in QEMU does not > > > access vNVDIMM pages mapped to guest, so it's not necessary to let > > > qemu keeps vNVDIMM mappings. > > > > I'm confused by this. > > > > Suppose a guest uses an emulated device (or backend) provided by qemu, > > to do DMA to an vNVDIMM. Then qemu will need to map the real NVDIMM > > pages into its own address space, so that it can write to the memory > > (ie, do the virtual DMA). > > > > That virtual DMA might well involve a direct mapping in the kernel > > underlying qemu: ie, qemu might use O_DIRECT to have its kernel write > > directly to the NVDIMM, and with luck the actual device backing the > > virtual device will be able to DMA to the NVDIMM. > > > > All of this seems to me to mean that qemu needs to be able to map > > its guest's parts of NVDIMMs > > > > There are probably other example: memory inspection systems used by > > virus scanners etc.; debuggers used to inspect a guest from outside; > > etc. > > > > I haven't even got started on save/restore... > > > > Oops, so many cases I missed. Thanks Ian for pointing out all these! > Now I need to reconsider how to manage guest permissions for NVDIMM pages. > I still cannot find a neat approach to manage guest permissions for nvdimm pages. A possible one is to use a per-domain bitmap to track permissions: each bit corresponding to an nvdimm page. The bitmap can save lots of spaces and even be stored in the normal ram, but operating it for a large nvdimm range, especially for a contiguous one, is slower than rangeset. BTW, if I take the other way to map nvdimm pages to guest (http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01972.html) | 2. Or, given the same inputs, we may combine above two steps into a new |dom0 system call that (1) gets the SPA ranges, (2) calls xen |hypercall to map SPA ranges and treat nvdimm as normal ram, then xen will not need to use rangeset or above bitmap to track guest permissions for nvdimm? But looking at how qemu currently populates guest memory via XENMEM_populate_physmap , and other hypercalls like XENMEM_[in|de]crease_reservation, it looks like that mapping a _dedicated_ piece of host ram to guest is not allowed out of the hypervisor (and not allowed even in dom0 kernel)? Is it for security concerns, e.g. avoiding a malfunctioned dom0 leaking guest memory? Thanks, Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
Hi Jan and Konrad, On 03/04/16 15:30, Haozhong Zhang wrote: > Suddenly realize it's unnecessary to let QEMU get SPA ranges of NVDIMM > or files on NVDIMM. We can move that work to toolstack and pass SPA > ranges got by toolstack to qemu. In this way, no privileged operations > (mmap/mlock/...) are needed in QEMU and non-root QEMU should be able to > work even with vNVDIMM hotplug in future. > As I'm going to let toolstack to get NVDIMM SPA ranges. This can be done via dom0 kernel interface and xen hypercalls, and can be implemented in different ways. I'm wondering which of the following ones is preferred by xen. 1. Given * a file descriptor of either a NVDIMM device or a file on NVDIMM, and * domain id and guest MFN where vNVDIMM is going to be. xen toolstack (1) gets it SPA ranges via dom0 kernel interface (e.g. sysfs and ioctl FIEMAP), and (2) calls a hypercall to map above SPA ranges to the given guest MFN of the given domain. 2. Or, given the same inputs, we may combine above two steps into a new dom0 system call that (1) gets the SPA ranges, (2) calls xen hypercall to map SPA ranges, and, one step further, (3) returns SPA ranges to userspace (because QEMU needs these addresses to build ACPI). The first way does not need to modify dom0 linux kernel, while the second requires a new system call. I'm not sure whether xen toolstack as a userspace program is considered to be safe to pass the host physical address to hypervisor. If not, maybe the second one is better? Thanks, Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
Jan Beulich writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen"): > So that again leaves unaddressed the question of what you > imply to do when a guest elects to use such a page as page > table. I'm afraid any attempt of yours to invent something that > is not struct page_info will not be suitable for all possible needs. It is not clear to me whether this is a realistic thing for a guest to want to do. Haozhong, maybe you want to consider this aspect. If you can come up with an argument why it is OK to simply not permit this, then maybe the recordkeeping requirements can be relaxed ? Ian. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 17.03.16 at 09:58,wrote: > On 03/16/16 09:23, Jan Beulich wrote: >> >>> On 16.03.16 at 15:55, wrote: >> > On 03/16/16 08:23, Jan Beulich wrote: >> >> >>> On 16.03.16 at 14:55, wrote: >> >> > On 03/16/16 07:16, Jan Beulich wrote: >> >> >> And >> >> >> talking of fragmentation - how do you mean to track guest >> >> >> permissions for an unbounded number of address ranges? >> >> >> >> >> > >> >> > In this case range structs in iomem_caps for NVDIMMs may consume a lot >> >> > of memory, so I think they are another candidate that should be put in >> >> > the reserved area on NVDIMM. If we only allow to grant access >> >> > permissions to NVDIMM page by page (rather than byte), the number of >> >> > range structs for each NVDIMM in the worst case is still decidable. >> >> >> >> Of course the permission granularity is going to by pages, not >> >> bytes (or else we couldn't allow the pages to be mapped into >> >> guest address space). And the limit on the per-domain range >> >> sets isn't going to be allowed to be bumped significantly, at >> >> least not for any of the existing ones (or else you'd have to >> >> prove such bumping can't be abused). >> > >> > What is that limit? the total number of range structs in per-domain >> > range sets? I must miss something when looking through 'case >> > XEN_DOMCTL_iomem_permission' of do_domctl() and didn't find that >> > limit, unless it means alloc_range() will fail when there are lots of >> > range structs. >> >> Oh, I'm sorry, that was a different set of range sets I was >> thinking about. But note that excessive creation of ranges >> through XEN_DOMCTL_iomem_permission is not a security issue >> just because of XSA-77, i.e. we'd still not knowingly allow a >> severe increase here. >> > > I didn't notice that multiple domains can all have access permission > to an iomem range, i.e. there can be multiple range structs for a > single iomem range. If range structs for NVDIMM are put on NVDIMM, > then there would be still a huge amount of them on NVDIMM in the worst > case (maximum number of domains * number of NVDIMM pages). > > A workaround is to only allow a range of NVDIMM pages be accessed by a > single domain. Whenever we add the access permission of NVDIMM pages > to a domain, we also remove the permission from its current > grantee. In this way, we only need to put 'number of NVDIMM pages' > range structs on NVDIMM in the worst case. But will this work? There's a reason multiple domains are permitted access: The domain running qemu for the guest, for example, needs to be able to access guest memory. No matter how much you and others are opposed to this, I can't help myself thinking that PMEM regions should be treated like RAM (and hence be under full control of Xen), whereas PBLK regions could indeed be treated like MMIO (and hence partly be under the control of Dom0). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
> Then there is another problem (which also exists in the current > design): does Xen need to emulate NVDIMM _DSM for dom0? Take the _DSM > that access label storage area (for namespace) for example: No. And it really can't as each vendors _DSM is different - and there is no ACPI AML interpreter inside Xen hypervisor. > > The way Linux reserving space on pmem mode NVDIMM is to leave the > reserved space at the beginning of pmem mode NVDIMM and create a pmem > namespace which starts from the end of the reserved space. Because the > reservation information is written in the namespace in the NVDIMM > label storage area, every OS that follows the namespace spec would not > mistakenly write files in the reserved area. I prefer to the same way > if Xen is going to do the reservation. We definitely don't want dom0 > to break the label storage area, so Xen seemingly needs to emulate the > corresponding _DSM functions for dom0? If so, which part, the > hypervisor or the toolstack, should do the emulation? But we do not want Xen to do the reservation. The control guest (Dom0) is the one that will mount the NVDIMM, and extract the system ranges from the files on the NVDIMM - and glue them to a guest. It is also the job of Dom0 to do the actually partition the NVDIMM as fit. Actually let me step back. It is the job of the guest who has the full NVDIMM in it. At bootup it is Dom0 - but you can very well 'unplug' the NVDIMM from Dom0 and assign it wholesale to a guest. Granted at that point the _DSM operations have to go through QEMU which ends up calling the dom0 ioctls on PMEM to do the operation (like getting the SMART data). > > Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 17.03.16 at 13:44,wrote: > On 03/17/16 05:04, Jan Beulich wrote: >> >>> On 17.03.16 at 09:58, wrote: >> > On 03/16/16 09:23, Jan Beulich wrote: >> >> >>> On 16.03.16 at 15:55, wrote: >> >> > On 03/16/16 08:23, Jan Beulich wrote: >> >> >> >>> On 16.03.16 at 14:55, wrote: >> >> >> > On 03/16/16 07:16, Jan Beulich wrote: >> >> >> >> And >> >> >> >> talking of fragmentation - how do you mean to track guest >> >> >> >> permissions for an unbounded number of address ranges? >> >> >> >> >> >> >> > >> >> >> > In this case range structs in iomem_caps for NVDIMMs may consume a >> >> >> > lot >> >> >> > of memory, so I think they are another candidate that should be put >> >> >> > in >> >> >> > the reserved area on NVDIMM. If we only allow to grant access >> >> >> > permissions to NVDIMM page by page (rather than byte), the number of >> >> >> > range structs for each NVDIMM in the worst case is still decidable. >> >> >> >> >> >> Of course the permission granularity is going to by pages, not >> >> >> bytes (or else we couldn't allow the pages to be mapped into >> >> >> guest address space). And the limit on the per-domain range >> >> >> sets isn't going to be allowed to be bumped significantly, at >> >> >> least not for any of the existing ones (or else you'd have to >> >> >> prove such bumping can't be abused). >> >> > >> >> > What is that limit? the total number of range structs in per-domain >> >> > range sets? I must miss something when looking through 'case >> >> > XEN_DOMCTL_iomem_permission' of do_domctl() and didn't find that >> >> > limit, unless it means alloc_range() will fail when there are lots of >> >> > range structs. >> >> >> >> Oh, I'm sorry, that was a different set of range sets I was >> >> thinking about. But note that excessive creation of ranges >> >> through XEN_DOMCTL_iomem_permission is not a security issue >> >> just because of XSA-77, i.e. we'd still not knowingly allow a >> >> severe increase here. >> >> >> > >> > I didn't notice that multiple domains can all have access permission >> > to an iomem range, i.e. there can be multiple range structs for a >> > single iomem range. If range structs for NVDIMM are put on NVDIMM, >> > then there would be still a huge amount of them on NVDIMM in the worst >> > case (maximum number of domains * number of NVDIMM pages). >> > >> > A workaround is to only allow a range of NVDIMM pages be accessed by a >> > single domain. Whenever we add the access permission of NVDIMM pages >> > to a domain, we also remove the permission from its current >> > grantee. In this way, we only need to put 'number of NVDIMM pages' >> > range structs on NVDIMM in the worst case. >> >> But will this work? There's a reason multiple domains are permitted >> access: The domain running qemu for the guest, for example, >> needs to be able to access guest memory. >> > > QEMU now only maintains ACPI tables and emulates _DSM for vNVDIMM > which both do not need to access NVDIMM pages mapped to guest. For one - this was only an example. And then - iirc qemu keeps mappings of certain guest RAM ranges. If I'm remembering this right, then why would it be excluded that it also may need mappings of guest NVDIMM? >> No matter how much you and others are opposed to this, I can't >> help myself thinking that PMEM regions should be treated like RAM >> (and hence be under full control of Xen), whereas PBLK regions >> could indeed be treated like MMIO (and hence partly be under the >> control of Dom0). >> > > Hmm, making Xen has full control could at least make reserving space > on NVDIMM easier. I guess full control does not include manipulating > file systems on NVDIMM which can be still left to dom0? > > Then there is another problem (which also exists in the current > design): does Xen need to emulate NVDIMM _DSM for dom0? Take the _DSM > that access label storage area (for namespace) for example: > > The way Linux reserving space on pmem mode NVDIMM is to leave the > reserved space at the beginning of pmem mode NVDIMM and create a pmem > namespace which starts from the end of the reserved space. Because the > reservation information is written in the namespace in the NVDIMM > label storage area, every OS that follows the namespace spec would not > mistakenly write files in the reserved area. I prefer to the same way > if Xen is going to do the reservation. We definitely don't want dom0 > to break the label storage area, so Xen seemingly needs to emulate the > corresponding _DSM functions for dom0? If so, which part, the > hypervisor or the toolstack, should do the emulation? I don't think I can answer all but the very last point: Of course this can't be done in the tool stack, since afaict the Dom0 kernel will want to evaluate _DSM before the tool stack even runs. Jan ___ Xen-devel mailing list
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/17/16 14:00, Ian Jackson wrote: > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support > for Xen"): > > QEMU keeps mappings of guest memory because (1) that mapping is > > created by itself, and/or (2) certain device emulation needs to access > > the guest memory. But for vNVDIMM, I'm going to move the creation of > > its mappings out of qemu to toolstack and vNVDIMM in QEMU does not > > access vNVDIMM pages mapped to guest, so it's not necessary to let > > qemu keeps vNVDIMM mappings. > > I'm confused by this. > > Suppose a guest uses an emulated device (or backend) provided by qemu, > to do DMA to an vNVDIMM. Then qemu will need to map the real NVDIMM > pages into its own address space, so that it can write to the memory > (ie, do the virtual DMA). > > That virtual DMA might well involve a direct mapping in the kernel > underlying qemu: ie, qemu might use O_DIRECT to have its kernel write > directly to the NVDIMM, and with luck the actual device backing the > virtual device will be able to DMA to the NVDIMM. > > All of this seems to me to mean that qemu needs to be able to map > its guest's parts of NVDIMMs > > There are probably other example: memory inspection systems used by > virus scanners etc.; debuggers used to inspect a guest from outside; > etc. > > I haven't even got started on save/restore... > Oops, so many cases I missed. Thanks Ian for pointing out all these! Now I need to reconsider how to manage guest permissions for NVDIMM pages. Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/17/16 22:12, Xu, Quan wrote: > On March 17, 2016 9:37pm, Haozhong Zhangwrote: > > For PV guests (if we add vNVDIMM support for them in future), as I'm going > > to > > use page_info struct for it, I suppose the current mechanism in Xen can > > handle > > this case. I'm not familiar with PV memory management > > The below web may be helpful: > http://wiki.xen.org/wiki/X86_Paravirtualised_Memory_Management > > :) > Quan > Thanks! Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/16/16 08:23, Jan Beulich wrote: > >>> On 16.03.16 at 14:55,wrote: > > On 03/16/16 07:16, Jan Beulich wrote: > >> Which reminds me: When considering a file on NVDIMM, how > >> are you making sure the mapping of the file to disk (i.e. > >> memory) blocks doesn't change while the guest has access > >> to it, e.g. due to some defragmentation going on? > > > > The current linux kernel 4.5 has an experimental "raw device dax > > support" (enabled by removing "depends on BROKEN" from "config > > BLK_DEV_DAX") which can guarantee the consistent mapping. The driver > > developers are going to make it non-broken in linux kernel 4.6. > > But there you talk about full devices, whereas my question was > for files. > the raw device dax support is for files on NVDIMM. > >> And > >> talking of fragmentation - how do you mean to track guest > >> permissions for an unbounded number of address ranges? > >> > > > > In this case range structs in iomem_caps for NVDIMMs may consume a lot > > of memory, so I think they are another candidate that should be put in > > the reserved area on NVDIMM. If we only allow to grant access > > permissions to NVDIMM page by page (rather than byte), the number of > > range structs for each NVDIMM in the worst case is still decidable. > > Of course the permission granularity is going to by pages, not > bytes (or else we couldn't allow the pages to be mapped into > guest address space). And the limit on the per-domain range > sets isn't going to be allowed to be bumped significantly, at > least not for any of the existing ones (or else you'd have to > prove such bumping can't be abused). What is that limit? the total number of range structs in per-domain range sets? I must miss something when looking through 'case XEN_DOMCTL_iomem_permission' of do_domctl() and didn't find that limit, unless it means alloc_range() will fail when there are lots of range structs. > Putting such control > structures on NVDIMM is a nice idea, but following our isolation > model for normal memory, any such memory used by Xen > would then need to be (made) inaccessible to Dom0. > I'm not clear how this is done. By marking those inaccessible pages as unpresent in dom0's page table? Or any example I can follow? Thanks, Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On Wed, Mar 16, 2016 at 08:55:08PM +0800, Haozhong Zhang wrote: > Hi Jan and Konrad, > > On 03/04/16 15:30, Haozhong Zhang wrote: > > Suddenly realize it's unnecessary to let QEMU get SPA ranges of NVDIMM > > or files on NVDIMM. We can move that work to toolstack and pass SPA > > ranges got by toolstack to qemu. In this way, no privileged operations > > (mmap/mlock/...) are needed in QEMU and non-root QEMU should be able to > > work even with vNVDIMM hotplug in future. > > > > As I'm going to let toolstack to get NVDIMM SPA ranges. This can be > done via dom0 kernel interface and xen hypercalls, and can be > implemented in different ways. I'm wondering which of the following > ones is preferred by xen. > > 1. Given > * a file descriptor of either a NVDIMM device or a file on NVDIMM, and > * domain id and guest MFN where vNVDIMM is going to be. >xen toolstack (1) gets it SPA ranges via dom0 kernel interface >(e.g. sysfs and ioctl FIEMAP), and (2) calls a hypercall to map >above SPA ranges to the given guest MFN of the given domain. > > 2. Or, given the same inputs, we may combine above two steps into a new >dom0 system call that (1) gets the SPA ranges, (2) calls xen >hypercall to map SPA ranges, and, one step further, (3) returns SPA >ranges to userspace (because QEMU needs these addresses to build ACPI). > > The first way does not need to modify dom0 linux kernel, while the > second requires a new system call. I'm not sure whether xen toolstack > as a userspace program is considered to be safe to pass the host physical > address to hypervisor. If not, maybe the second one is better? Well, the toolstack does it already. (for MMIO ranges of PCIe devices and such). I would prefer 1) as it means less kernel code. > > Thanks, > Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 17.03.16 at 14:29,wrote: > On 03/17/16 06:59, Jan Beulich wrote: >> >>> On 17.03.16 at 13:44, wrote: >> > Hmm, making Xen has full control could at least make reserving space >> > on NVDIMM easier. I guess full control does not include manipulating >> > file systems on NVDIMM which can be still left to dom0? >> > >> > Then there is another problem (which also exists in the current >> > design): does Xen need to emulate NVDIMM _DSM for dom0? Take the _DSM >> > that access label storage area (for namespace) for example: >> > >> > The way Linux reserving space on pmem mode NVDIMM is to leave the >> > reserved space at the beginning of pmem mode NVDIMM and create a pmem >> > namespace which starts from the end of the reserved space. Because the >> > reservation information is written in the namespace in the NVDIMM >> > label storage area, every OS that follows the namespace spec would not >> > mistakenly write files in the reserved area. I prefer to the same way >> > if Xen is going to do the reservation. We definitely don't want dom0 >> > to break the label storage area, so Xen seemingly needs to emulate the >> > corresponding _DSM functions for dom0? If so, which part, the >> > hypervisor or the toolstack, should do the emulation? >> >> I don't think I can answer all but the very last point: Of course this >> can't be done in the tool stack, since afaict the Dom0 kernel will >> want to evaluate _DSM before the tool stack even runs. > > Or, we could modify dom0 kernel to just use the label storage area as is > and does not modify it. Can xen hypervisor trust dom0 kernel in this aspect? I think so, yes. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/16/16 07:16, Jan Beulich wrote: > >>> On 16.03.16 at 13:55,wrote: > > Hi Jan and Konrad, > > > > On 03/04/16 15:30, Haozhong Zhang wrote: > >> Suddenly realize it's unnecessary to let QEMU get SPA ranges of NVDIMM > >> or files on NVDIMM. We can move that work to toolstack and pass SPA > >> ranges got by toolstack to qemu. In this way, no privileged operations > >> (mmap/mlock/...) are needed in QEMU and non-root QEMU should be able to > >> work even with vNVDIMM hotplug in future. > >> > > > > As I'm going to let toolstack to get NVDIMM SPA ranges. This can be > > done via dom0 kernel interface and xen hypercalls, and can be > > implemented in different ways. I'm wondering which of the following > > ones is preferred by xen. > > > > 1. Given > > * a file descriptor of either a NVDIMM device or a file on NVDIMM, and > > * domain id and guest MFN where vNVDIMM is going to be. > >xen toolstack (1) gets it SPA ranges via dom0 kernel interface > >(e.g. sysfs and ioctl FIEMAP), and (2) calls a hypercall to map > >above SPA ranges to the given guest MFN of the given domain. > > > > 2. Or, given the same inputs, we may combine above two steps into a new > >dom0 system call that (1) gets the SPA ranges, (2) calls xen > >hypercall to map SPA ranges, and, one step further, (3) returns SPA > >ranges to userspace (because QEMU needs these addresses to build ACPI). > > DYM GPA here? Qemu should hardly have a need for SPA when > wanting to build ACPI tables for the guest. > Oh, it should be GPA for QEMU and (3) is not needed. > > The first way does not need to modify dom0 linux kernel, while the > > second requires a new system call. I'm not sure whether xen toolstack > > as a userspace program is considered to be safe to pass the host physical > > address to hypervisor. If not, maybe the second one is better? > > As long as the passing of physical addresses follows to model > of MMIO for passed through PCI devices, I don't think there's > problem with the tool stack bypassing the Dom0 kernel. So it > really all depends on how you make sure that the guest won't > get to see memory it has no permission to access. > So the toolstack should first use XEN_DOMCTL_iomem_permission to grant permissions to the guest and then call XEN_DOMCTL_memory_mapping for the mapping. > Which reminds me: When considering a file on NVDIMM, how > are you making sure the mapping of the file to disk (i.e. > memory) blocks doesn't change while the guest has access > to it, e.g. due to some defragmentation going on? The current linux kernel 4.5 has an experimental "raw device dax support" (enabled by removing "depends on BROKEN" from "config BLK_DEV_DAX") which can guarantee the consistent mapping. The driver developers are going to make it non-broken in linux kernel 4.6. > And > talking of fragmentation - how do you mean to track guest > permissions for an unbounded number of address ranges? > In this case range structs in iomem_caps for NVDIMMs may consume a lot of memory, so I think they are another candidate that should be put in the reserved area on NVDIMM. If we only allow to grant access permissions to NVDIMM page by page (rather than byte), the number of range structs for each NVDIMM in the worst case is still decidable. Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 16.03.16 at 13:55,wrote: > Hi Jan and Konrad, > > On 03/04/16 15:30, Haozhong Zhang wrote: >> Suddenly realize it's unnecessary to let QEMU get SPA ranges of NVDIMM >> or files on NVDIMM. We can move that work to toolstack and pass SPA >> ranges got by toolstack to qemu. In this way, no privileged operations >> (mmap/mlock/...) are needed in QEMU and non-root QEMU should be able to >> work even with vNVDIMM hotplug in future. >> > > As I'm going to let toolstack to get NVDIMM SPA ranges. This can be > done via dom0 kernel interface and xen hypercalls, and can be > implemented in different ways. I'm wondering which of the following > ones is preferred by xen. > > 1. Given > * a file descriptor of either a NVDIMM device or a file on NVDIMM, and > * domain id and guest MFN where vNVDIMM is going to be. >xen toolstack (1) gets it SPA ranges via dom0 kernel interface >(e.g. sysfs and ioctl FIEMAP), and (2) calls a hypercall to map >above SPA ranges to the given guest MFN of the given domain. > > 2. Or, given the same inputs, we may combine above two steps into a new >dom0 system call that (1) gets the SPA ranges, (2) calls xen >hypercall to map SPA ranges, and, one step further, (3) returns SPA >ranges to userspace (because QEMU needs these addresses to build ACPI). DYM GPA here? Qemu should hardly have a need for SPA when wanting to build ACPI tables for the guest. > The first way does not need to modify dom0 linux kernel, while the > second requires a new system call. I'm not sure whether xen toolstack > as a userspace program is considered to be safe to pass the host physical > address to hypervisor. If not, maybe the second one is better? As long as the passing of physical addresses follows to model of MMIO for passed through PCI devices, I don't think there's problem with the tool stack bypassing the Dom0 kernel. So it really all depends on how you make sure that the guest won't get to see memory it has no permission to access. Which reminds me: When considering a file on NVDIMM, how are you making sure the mapping of the file to disk (i.e. memory) blocks doesn't change while the guest has access to it, e.g. due to some defragmentation going on? And talking of fragmentation - how do you mean to track guest permissions for an unbounded number of address ranges? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/16/16 09:23, Jan Beulich wrote: > >>> On 16.03.16 at 15:55,wrote: > > On 03/16/16 08:23, Jan Beulich wrote: > >> >>> On 16.03.16 at 14:55, wrote: > >> > On 03/16/16 07:16, Jan Beulich wrote: > >> >> Which reminds me: When considering a file on NVDIMM, how > >> >> are you making sure the mapping of the file to disk (i.e. > >> >> memory) blocks doesn't change while the guest has access > >> >> to it, e.g. due to some defragmentation going on? > >> > > >> > The current linux kernel 4.5 has an experimental "raw device dax > >> > support" (enabled by removing "depends on BROKEN" from "config > >> > BLK_DEV_DAX") which can guarantee the consistent mapping. The driver > >> > developers are going to make it non-broken in linux kernel 4.6. > >> > >> But there you talk about full devices, whereas my question was > >> for files. > >> > > > > the raw device dax support is for files on NVDIMM. > > Okay, I can only trust you here. I thought FS_DAX is the file level > thing. > > >> >> And > >> >> talking of fragmentation - how do you mean to track guest > >> >> permissions for an unbounded number of address ranges? > >> >> > >> > > >> > In this case range structs in iomem_caps for NVDIMMs may consume a lot > >> > of memory, so I think they are another candidate that should be put in > >> > the reserved area on NVDIMM. If we only allow to grant access > >> > permissions to NVDIMM page by page (rather than byte), the number of > >> > range structs for each NVDIMM in the worst case is still decidable. > >> > >> Of course the permission granularity is going to by pages, not > >> bytes (or else we couldn't allow the pages to be mapped into > >> guest address space). And the limit on the per-domain range > >> sets isn't going to be allowed to be bumped significantly, at > >> least not for any of the existing ones (or else you'd have to > >> prove such bumping can't be abused). > > > > What is that limit? the total number of range structs in per-domain > > range sets? I must miss something when looking through 'case > > XEN_DOMCTL_iomem_permission' of do_domctl() and didn't find that > > limit, unless it means alloc_range() will fail when there are lots of > > range structs. > > Oh, I'm sorry, that was a different set of range sets I was > thinking about. But note that excessive creation of ranges > through XEN_DOMCTL_iomem_permission is not a security issue > just because of XSA-77, i.e. we'd still not knowingly allow a > severe increase here. > I didn't notice that multiple domains can all have access permission to an iomem range, i.e. there can be multiple range structs for a single iomem range. If range structs for NVDIMM are put on NVDIMM, then there would be still a huge amount of them on NVDIMM in the worst case (maximum number of domains * number of NVDIMM pages). A workaround is to only allow a range of NVDIMM pages be accessed by a single domain. Whenever we add the access permission of NVDIMM pages to a domain, we also remove the permission from its current grantee. In this way, we only need to put 'number of NVDIMM pages' range structs on NVDIMM in the worst case. > >> Putting such control > >> structures on NVDIMM is a nice idea, but following our isolation > >> model for normal memory, any such memory used by Xen > >> would then need to be (made) inaccessible to Dom0. > > > > I'm not clear how this is done. By marking those inaccessible pages as > > unpresent in dom0's page table? Or any example I can follow? > > That's the problem - so far we had no need to do so since Dom0 > was only ever allowed access to memory Xen didn't use for itself > or knows it wants to share. Whereas now you want such a > resource controlled first by Dom0, and only then handed to Xen. > So yes, Dom0 would need to zap any mappings of these pages > (and Xen would need to verify that, which would come mostly > without new code as long as struct page_info gets properly > used for all this memory) before Xen could use it. Much like > ballooning out a normal RAM page. > Thanks, I'll look into this balloon approach. Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/17/16 06:59, Jan Beulich wrote: > >>> On 17.03.16 at 13:44,wrote: > > On 03/17/16 05:04, Jan Beulich wrote: > >> >>> On 17.03.16 at 09:58, wrote: > >> > On 03/16/16 09:23, Jan Beulich wrote: > >> >> >>> On 16.03.16 at 15:55, wrote: > >> >> > On 03/16/16 08:23, Jan Beulich wrote: > >> >> >> >>> On 16.03.16 at 14:55, wrote: > >> >> >> > On 03/16/16 07:16, Jan Beulich wrote: > >> >> >> >> And > >> >> >> >> talking of fragmentation - how do you mean to track guest > >> >> >> >> permissions for an unbounded number of address ranges? > >> >> >> >> > >> >> >> > > >> >> >> > In this case range structs in iomem_caps for NVDIMMs may consume a > >> >> >> > lot > >> >> >> > of memory, so I think they are another candidate that should be > >> >> >> > put in > >> >> >> > the reserved area on NVDIMM. If we only allow to grant access > >> >> >> > permissions to NVDIMM page by page (rather than byte), the number > >> >> >> > of > >> >> >> > range structs for each NVDIMM in the worst case is still decidable. > >> >> >> > >> >> >> Of course the permission granularity is going to by pages, not > >> >> >> bytes (or else we couldn't allow the pages to be mapped into > >> >> >> guest address space). And the limit on the per-domain range > >> >> >> sets isn't going to be allowed to be bumped significantly, at > >> >> >> least not for any of the existing ones (or else you'd have to > >> >> >> prove such bumping can't be abused). > >> >> > > >> >> > What is that limit? the total number of range structs in per-domain > >> >> > range sets? I must miss something when looking through 'case > >> >> > XEN_DOMCTL_iomem_permission' of do_domctl() and didn't find that > >> >> > limit, unless it means alloc_range() will fail when there are lots of > >> >> > range structs. > >> >> > >> >> Oh, I'm sorry, that was a different set of range sets I was > >> >> thinking about. But note that excessive creation of ranges > >> >> through XEN_DOMCTL_iomem_permission is not a security issue > >> >> just because of XSA-77, i.e. we'd still not knowingly allow a > >> >> severe increase here. > >> >> > >> > > >> > I didn't notice that multiple domains can all have access permission > >> > to an iomem range, i.e. there can be multiple range structs for a > >> > single iomem range. If range structs for NVDIMM are put on NVDIMM, > >> > then there would be still a huge amount of them on NVDIMM in the worst > >> > case (maximum number of domains * number of NVDIMM pages). > >> > > >> > A workaround is to only allow a range of NVDIMM pages be accessed by a > >> > single domain. Whenever we add the access permission of NVDIMM pages > >> > to a domain, we also remove the permission from its current > >> > grantee. In this way, we only need to put 'number of NVDIMM pages' > >> > range structs on NVDIMM in the worst case. > >> > >> But will this work? There's a reason multiple domains are permitted > >> access: The domain running qemu for the guest, for example, > >> needs to be able to access guest memory. > >> > > > > QEMU now only maintains ACPI tables and emulates _DSM for vNVDIMM > > which both do not need to access NVDIMM pages mapped to guest. > > For one - this was only an example. And then - iirc qemu keeps > mappings of certain guest RAM ranges. If I'm remembering this > right, then why would it be excluded that it also may need > mappings of guest NVDIMM? > QEMU keeps mappings of guest memory because (1) that mapping is created by itself, and/or (2) certain device emulation needs to access the guest memory. But for vNVDIMM, I'm going to move the creation of its mappings out of qemu to toolstack and vNVDIMM in QEMU does not access vNVDIMM pages mapped to guest, so it's not necessary to let qemu keeps vNVDIMM mappings. > >> No matter how much you and others are opposed to this, I can't > >> help myself thinking that PMEM regions should be treated like RAM > >> (and hence be under full control of Xen), whereas PBLK regions > >> could indeed be treated like MMIO (and hence partly be under the > >> control of Dom0). > >> > > > > Hmm, making Xen has full control could at least make reserving space > > on NVDIMM easier. I guess full control does not include manipulating > > file systems on NVDIMM which can be still left to dom0? > > > > Then there is another problem (which also exists in the current > > design): does Xen need to emulate NVDIMM _DSM for dom0? Take the _DSM > > that access label storage area (for namespace) for example: > > > > The way Linux reserving space on pmem mode NVDIMM is to leave the > > reserved space at the beginning of pmem mode NVDIMM and create a pmem > > namespace which starts from the end of the reserved space. Because the > > reservation information is written in the namespace in the NVDIMM > > label storage area, every OS that follows the namespace spec would not > > mistakenly write files in the
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/17/16 11:05, Ian Jackson wrote: > Jan Beulich writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for > Xen"): > > So that again leaves unaddressed the question of what you > > imply to do when a guest elects to use such a page as page > > table. I'm afraid any attempt of yours to invent something that > > is not struct page_info will not be suitable for all possible needs. > > It is not clear to me whether this is a realistic thing for a guest to > want to do. Haozhong, maybe you want to consider this aspect. > For HVM guests, it's themselves responsibility to not grant (e.g. in xen-blk/net drivers) a vNVDIMM page containing page tables to others. For PV guests (if we add vNVDIMM support for them in future), as I'm going to use page_info struct for it, I suppose the current mechanism in Xen can handle this case. I'm not familiar with PV memory management and have to admit I didn't find the exact code that handles the case that a memory page contains the guest page table. Jan, could you indicate the code that I can follow to understand what xen does in this case? Thanks, Haozhong > If you can come up with an argument why it is OK to simply not permit > this, then maybe the recordkeeping requirements can be relaxed ? > > Ian. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 16.03.16 at 14:55,wrote: > On 03/16/16 07:16, Jan Beulich wrote: >> Which reminds me: When considering a file on NVDIMM, how >> are you making sure the mapping of the file to disk (i.e. >> memory) blocks doesn't change while the guest has access >> to it, e.g. due to some defragmentation going on? > > The current linux kernel 4.5 has an experimental "raw device dax > support" (enabled by removing "depends on BROKEN" from "config > BLK_DEV_DAX") which can guarantee the consistent mapping. The driver > developers are going to make it non-broken in linux kernel 4.6. But there you talk about full devices, whereas my question was for files. >> And >> talking of fragmentation - how do you mean to track guest >> permissions for an unbounded number of address ranges? >> > > In this case range structs in iomem_caps for NVDIMMs may consume a lot > of memory, so I think they are another candidate that should be put in > the reserved area on NVDIMM. If we only allow to grant access > permissions to NVDIMM page by page (rather than byte), the number of > range structs for each NVDIMM in the worst case is still decidable. Of course the permission granularity is going to by pages, not bytes (or else we couldn't allow the pages to be mapped into guest address space). And the limit on the per-domain range sets isn't going to be allowed to be bumped significantly, at least not for any of the existing ones (or else you'd have to prove such bumping can't be abused). Putting such control structures on NVDIMM is a nice idea, but following our isolation model for normal memory, any such memory used by Xen would then need to be (made) inaccessible to Dom0. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On March 17, 2016 9:37pm, Haozhong Zhangwrote: > For PV guests (if we add vNVDIMM support for them in future), as I'm going to > use page_info struct for it, I suppose the current mechanism in Xen can handle > this case. I'm not familiar with PV memory management The below web may be helpful: http://wiki.xen.org/wiki/X86_Paravirtualised_Memory_Management :) Quan > and have to admit I > didn't find the exact code that handles the case that a memory page contains > the guest page table. Jan, could you indicate the code that I can follow to > understand what xen does in this case? ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 16.03.16 at 15:55,wrote: > On 03/16/16 08:23, Jan Beulich wrote: >> >>> On 16.03.16 at 14:55, wrote: >> > On 03/16/16 07:16, Jan Beulich wrote: >> >> Which reminds me: When considering a file on NVDIMM, how >> >> are you making sure the mapping of the file to disk (i.e. >> >> memory) blocks doesn't change while the guest has access >> >> to it, e.g. due to some defragmentation going on? >> > >> > The current linux kernel 4.5 has an experimental "raw device dax >> > support" (enabled by removing "depends on BROKEN" from "config >> > BLK_DEV_DAX") which can guarantee the consistent mapping. The driver >> > developers are going to make it non-broken in linux kernel 4.6. >> >> But there you talk about full devices, whereas my question was >> for files. >> > > the raw device dax support is for files on NVDIMM. Okay, I can only trust you here. I thought FS_DAX is the file level thing. >> >> And >> >> talking of fragmentation - how do you mean to track guest >> >> permissions for an unbounded number of address ranges? >> >> >> > >> > In this case range structs in iomem_caps for NVDIMMs may consume a lot >> > of memory, so I think they are another candidate that should be put in >> > the reserved area on NVDIMM. If we only allow to grant access >> > permissions to NVDIMM page by page (rather than byte), the number of >> > range structs for each NVDIMM in the worst case is still decidable. >> >> Of course the permission granularity is going to by pages, not >> bytes (or else we couldn't allow the pages to be mapped into >> guest address space). And the limit on the per-domain range >> sets isn't going to be allowed to be bumped significantly, at >> least not for any of the existing ones (or else you'd have to >> prove such bumping can't be abused). > > What is that limit? the total number of range structs in per-domain > range sets? I must miss something when looking through 'case > XEN_DOMCTL_iomem_permission' of do_domctl() and didn't find that > limit, unless it means alloc_range() will fail when there are lots of > range structs. Oh, I'm sorry, that was a different set of range sets I was thinking about. But note that excessive creation of ranges through XEN_DOMCTL_iomem_permission is not a security issue just because of XSA-77, i.e. we'd still not knowingly allow a severe increase here. >> Putting such control >> structures on NVDIMM is a nice idea, but following our isolation >> model for normal memory, any such memory used by Xen >> would then need to be (made) inaccessible to Dom0. > > I'm not clear how this is done. By marking those inaccessible pages as > unpresent in dom0's page table? Or any example I can follow? That's the problem - so far we had no need to do so since Dom0 was only ever allowed access to memory Xen didn't use for itself or knows it wants to share. Whereas now you want such a resource controlled first by Dom0, and only then handed to Xen. So yes, Dom0 would need to zap any mappings of these pages (and Xen would need to verify that, which would come mostly without new code as long as struct page_info gets properly used for all this memory) before Xen could use it. Much like ballooning out a normal RAM page. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 17.03.16 at 14:37, <haozhong.zh...@intel.com> wrote: > On 03/17/16 11:05, Ian Jackson wrote: >> Jan Beulich writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support >> for > Xen"): >> > So that again leaves unaddressed the question of what you >> > imply to do when a guest elects to use such a page as page >> > table. I'm afraid any attempt of yours to invent something that >> > is not struct page_info will not be suitable for all possible needs. >> >> It is not clear to me whether this is a realistic thing for a guest to >> want to do. Haozhong, maybe you want to consider this aspect. >> > > For HVM guests, it's themselves responsibility to not grant (e.g. in > xen-blk/net drivers) a vNVDIMM page containing page tables to others. > > For PV guests (if we add vNVDIMM support for them in future), as I'm > going to use page_info struct for it, I suppose the current mechanism > in Xen can handle this case. I'm not familiar with PV memory > management and have to admit I didn't find the exact code that handles > the case that a memory page contains the guest page table. Jan, could > you indicate the code that I can follow to understand what xen does in > this case? xen/arch/x86/mm.c has functions like __get_page_type(), alloc_page_type(), alloc_l[1234]_table(), and mod_l[1234]_entry() which all participate in this. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/17/16 07:56, Jan Beulich wrote: > >>> On 17.03.16 at 14:37, <haozhong.zh...@intel.com> wrote: > > On 03/17/16 11:05, Ian Jackson wrote: > >> Jan Beulich writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support > >> for > > Xen"): > >> > So that again leaves unaddressed the question of what you > >> > imply to do when a guest elects to use such a page as page > >> > table. I'm afraid any attempt of yours to invent something that > >> > is not struct page_info will not be suitable for all possible needs. > >> > >> It is not clear to me whether this is a realistic thing for a guest to > >> want to do. Haozhong, maybe you want to consider this aspect. > >> > > > > For HVM guests, it's themselves responsibility to not grant (e.g. in > > xen-blk/net drivers) a vNVDIMM page containing page tables to others. > > > > For PV guests (if we add vNVDIMM support for them in future), as I'm > > going to use page_info struct for it, I suppose the current mechanism > > in Xen can handle this case. I'm not familiar with PV memory > > management and have to admit I didn't find the exact code that handles > > the case that a memory page contains the guest page table. Jan, could > > you indicate the code that I can follow to understand what xen does in > > this case? > > xen/arch/x86/mm.c has functions like __get_page_type(), > alloc_page_type(), alloc_l[1234]_table(), and mod_l[1234]_entry() > which all participate in this. > Thanks! Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/17/16 05:04, Jan Beulich wrote: > >>> On 17.03.16 at 09:58,wrote: > > On 03/16/16 09:23, Jan Beulich wrote: > >> >>> On 16.03.16 at 15:55, wrote: > >> > On 03/16/16 08:23, Jan Beulich wrote: > >> >> >>> On 16.03.16 at 14:55, wrote: > >> >> > On 03/16/16 07:16, Jan Beulich wrote: > >> >> >> And > >> >> >> talking of fragmentation - how do you mean to track guest > >> >> >> permissions for an unbounded number of address ranges? > >> >> >> > >> >> > > >> >> > In this case range structs in iomem_caps for NVDIMMs may consume a lot > >> >> > of memory, so I think they are another candidate that should be put in > >> >> > the reserved area on NVDIMM. If we only allow to grant access > >> >> > permissions to NVDIMM page by page (rather than byte), the number of > >> >> > range structs for each NVDIMM in the worst case is still decidable. > >> >> > >> >> Of course the permission granularity is going to by pages, not > >> >> bytes (or else we couldn't allow the pages to be mapped into > >> >> guest address space). And the limit on the per-domain range > >> >> sets isn't going to be allowed to be bumped significantly, at > >> >> least not for any of the existing ones (or else you'd have to > >> >> prove such bumping can't be abused). > >> > > >> > What is that limit? the total number of range structs in per-domain > >> > range sets? I must miss something when looking through 'case > >> > XEN_DOMCTL_iomem_permission' of do_domctl() and didn't find that > >> > limit, unless it means alloc_range() will fail when there are lots of > >> > range structs. > >> > >> Oh, I'm sorry, that was a different set of range sets I was > >> thinking about. But note that excessive creation of ranges > >> through XEN_DOMCTL_iomem_permission is not a security issue > >> just because of XSA-77, i.e. we'd still not knowingly allow a > >> severe increase here. > >> > > > > I didn't notice that multiple domains can all have access permission > > to an iomem range, i.e. there can be multiple range structs for a > > single iomem range. If range structs for NVDIMM are put on NVDIMM, > > then there would be still a huge amount of them on NVDIMM in the worst > > case (maximum number of domains * number of NVDIMM pages). > > > > A workaround is to only allow a range of NVDIMM pages be accessed by a > > single domain. Whenever we add the access permission of NVDIMM pages > > to a domain, we also remove the permission from its current > > grantee. In this way, we only need to put 'number of NVDIMM pages' > > range structs on NVDIMM in the worst case. > > But will this work? There's a reason multiple domains are permitted > access: The domain running qemu for the guest, for example, > needs to be able to access guest memory. > QEMU now only maintains ACPI tables and emulates _DSM for vNVDIMM which both do not need to access NVDIMM pages mapped to guest. > No matter how much you and others are opposed to this, I can't > help myself thinking that PMEM regions should be treated like RAM > (and hence be under full control of Xen), whereas PBLK regions > could indeed be treated like MMIO (and hence partly be under the > control of Dom0). > Hmm, making Xen has full control could at least make reserving space on NVDIMM easier. I guess full control does not include manipulating file systems on NVDIMM which can be still left to dom0? Then there is another problem (which also exists in the current design): does Xen need to emulate NVDIMM _DSM for dom0? Take the _DSM that access label storage area (for namespace) for example: The way Linux reserving space on pmem mode NVDIMM is to leave the reserved space at the beginning of pmem mode NVDIMM and create a pmem namespace which starts from the end of the reserved space. Because the reservation information is written in the namespace in the NVDIMM label storage area, every OS that follows the namespace spec would not mistakenly write files in the reserved area. I prefer to the same way if Xen is going to do the reservation. We definitely don't want dom0 to break the label storage area, so Xen seemingly needs to emulate the corresponding _DSM functions for dom0? If so, which part, the hypervisor or the toolstack, should do the emulation? Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen"): > QEMU keeps mappings of guest memory because (1) that mapping is > created by itself, and/or (2) certain device emulation needs to access > the guest memory. But for vNVDIMM, I'm going to move the creation of > its mappings out of qemu to toolstack and vNVDIMM in QEMU does not > access vNVDIMM pages mapped to guest, so it's not necessary to let > qemu keeps vNVDIMM mappings. I'm confused by this. Suppose a guest uses an emulated device (or backend) provided by qemu, to do DMA to an vNVDIMM. Then qemu will need to map the real NVDIMM pages into its own address space, so that it can write to the memory (ie, do the virtual DMA). That virtual DMA might well involve a direct mapping in the kernel underlying qemu: ie, qemu might use O_DIRECT to have its kernel write directly to the NVDIMM, and with luck the actual device backing the virtual device will be able to DMA to the NVDIMM. All of this seems to me to mean that qemu needs to be able to map its guest's parts of NVDIMMs There are probably other example: memory inspection systems used by virus scanners etc.; debuggers used to inspect a guest from outside; etc. I haven't even got started on save/restore... Ian. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/09/16 09:17, Jan Beulich wrote: > >>> On 09.03.16 at 13:22,wrote: > > On 03/08/16 02:27, Jan Beulich wrote: > >> >>> On 08.03.16 at 10:15, wrote: [...] > > I should reexplain the choice of data structures and where to put them. > > > > For handling MCE for NVDIMM, we need to track following data: > > (1) SPA ranges of host NVDIMMs (one range per pmem interleave set), which > > are > > used to check whether a MCE is for NVDIMM. > > (2) GFN to which a NVDIMM page is mapped, which is used to determine the > > address put in vMCE. > > (3) the domain to which a NVDIMM page is mapped, which is used to > > determine whether a vMCE needs to be injected and where it will be > > injected. > > (4) a flag to mark whether a NVDIMM page is broken, which is used to > > avoid mapping broken page to guests. > > > > For granting NVDIMM pages (e.g. xen-blkback/netback), > > (5) a reference counter is needed for each NVDIMM page > > > > Above data can be organized as below: > > > > * For (1) SPA ranges, we can record them in a global data structure, > > e.g. a list > > > > struct list_head nvdimm_iset_list; > > > > struct nvdimm_iset > > { > > uint64_t base; /* starting SPA of this interleave set */ > > uint64_t size; /* size of this interleave set */ > > struct nvdimm_page *pages;/* information for individual pages in > > this interleave set */ > > struct list_head list; > > }; > > > > * For (2) GFN, an intuitive place to get this information is from M2P > > table machine_to_phys_mapping[]. However, the address of NVDIMM is > > not required to be contiguous with normal ram, so, if NVDIMM starts > > from an address that is much higher than the end address of normal > > ram, it may result in a M2P table that maybe too large to fit in the > > normal ram. Therefore, we choose to not put GFNs of NVDIMM in M2P > > table. > > Any page that _may_ be used by a guest as normal RAM page > must have its mach->phys translation entered in the M2P. That's > because a r/o variant of that table is part of the hypervisor ABI > for PV guests. Size considerations simply don't apply here - the > table may be sparse (guests are required to deal with accesses > potentially faulting), and the 256Gb of virtual address space set > aside for it cover all memory up to the 47-bit boundary (there's > room for doubling this). Memory at addresses with bit 47 (or > higher) set would need a complete overhaul of that mechanism, > and whatever new mechanism we may pick would mean old > guests won#t be able to benefit. > OK, then we can use M2P to get PFNs of NVDIMM pages. And ... > > Another possible solution is to extend page_info to include GFN for > > NVDIMM and use frame_table. A benefit of this solution is that other > > data (3)-(5) can be got from page_info as well. However, due to the > > same reason for machine_to_phys_mapping[] and the concern that the > > large number of page_info structures required for large NVDIMMs may > > consume lots of ram, page_info and frame_table seems not a good place > > either. > > For this particular item struct page_info is the wrong place > anyway, due to what I've said above. Also extension > suggestions of struct page_info are quite problematic, as any > such implies a measurable increase on the memory overhead > the hypervisor incurs. Plus the structure right now is (with the > exception of the bigmem configuration) a carefully arranged > for power of two in size. > > > * At the end, we choose to introduce a new data structure for above > > per-page data (2)-(5) > > > > struct nvdimm_page > > { > > struct domain *domain;/* for (3) */ > > uint64_t gfn;/* for (2) */ > > unsigned long count_info; /* for (4) and (5), same as > > page_info->count_info */ > > /* other fields if needed, e.g. lock */ > > } > > So that again leaves unaddressed the question of what you > imply to do when a guest elects to use such a page as page > table. I'm afraid any attempt of yours to invent something that > is not struct page_info will not be suitable for all possible needs. > ... we can use page_info struct rather than nvdimm_page struct for NVDIMM pages and can be able to benefit from whatever have been done with page_info. > > On each NVDIMM interleave set, we could reserve an area to place an > > array of nvdimm_page structures for pages in that interleave set. In > > addition, the corresponding global nvdimm_iset structure is set to > > point to this array via its 'pages' field. > > And I see no problem doing exactly that, just for an array of > struct page_info. > Yes, page_info arrays. Because page_info structs for NVDIMM may be put in NVDIMM, existing code that gets page_info from frame_table needs to be adjusted for NVDIMM pages to use nvdimm_iset
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 09.03.16 at 13:22,wrote: > On 03/08/16 02:27, Jan Beulich wrote: >> >>> On 08.03.16 at 10:15, wrote: >> > More thoughts on reserving NVDIMM space for per-page structures >> > >> > Currently, a per-page struct for managing mapping of NVDIMM pages may >> > include following fields: >> > >> > struct nvdimm_page >> > { >> > uint64_t mfn;/* MFN of SPA of this NVDIMM page */ >> > uint64_t gfn;/* GFN where this NVDIMM page is mapped */ >> > domid_t domain_id; /* which domain is this NVDIMM page mapped to */ >> > int is_broken; /* Is this NVDIMM page broken? (for MCE) */ >> > } >> > >> > Its size is 24 bytes (or 22 bytes if packed). For a 2 TB NVDIMM, >> > nvdimm_page structures would occupy 12 GB space, which is too hard to >> > fit in the normal ram on a small memory host. However, for smaller >> > NVDIMMs and/or hosts with large ram, those structures may still be able >> > to fit in the normal ram. In the latter circumstance, nvdimm_page >> > structures are stored in the normal ram, so they can be accessed more >> > quickly. >> >> Not sure how you came to the above structure - it's the first time >> I see it, yet figuring out what information it needs to hold is what >> this design process should be about. For example, I don't see why >> it would need to duplicate M2P / P2M information. Nor do I see why >> per-page data needs to hold the address of a page (struct >> page_info also doesn't). And whether storing a domain ID (rather >> than a pointer to struct domain, as in struct page_info) is the >> correct think is also to be determined (rather than just stated). >> >> Otoh you make no provisions at all for any kind of ref counting. >> What if a guest wants to put page tables into NVDIMM space? >> >> Since all of your calculations are based upon that fixed assumption >> on the structure layout, I'm afraid they're not very meaningful >> without first settling on what data needs tracking in the first place. > > I should reexplain the choice of data structures and where to put them. > > For handling MCE for NVDIMM, we need to track following data: > (1) SPA ranges of host NVDIMMs (one range per pmem interleave set), which are > used to check whether a MCE is for NVDIMM. > (2) GFN to which a NVDIMM page is mapped, which is used to determine the > address put in vMCE. > (3) the domain to which a NVDIMM page is mapped, which is used to > determine whether a vMCE needs to be injected and where it will be > injected. > (4) a flag to mark whether a NVDIMM page is broken, which is used to > avoid mapping broken page to guests. > > For granting NVDIMM pages (e.g. xen-blkback/netback), > (5) a reference counter is needed for each NVDIMM page > > Above data can be organized as below: > > * For (1) SPA ranges, we can record them in a global data structure, > e.g. a list > > struct list_head nvdimm_iset_list; > > struct nvdimm_iset > { > uint64_t base; /* starting SPA of this interleave set */ > uint64_t size; /* size of this interleave set */ > struct nvdimm_page *pages;/* information for individual pages in > this interleave set */ > struct list_head list; > }; > > * For (2) GFN, an intuitive place to get this information is from M2P > table machine_to_phys_mapping[]. However, the address of NVDIMM is > not required to be contiguous with normal ram, so, if NVDIMM starts > from an address that is much higher than the end address of normal > ram, it may result in a M2P table that maybe too large to fit in the > normal ram. Therefore, we choose to not put GFNs of NVDIMM in M2P > table. Any page that _may_ be used by a guest as normal RAM page must have its mach->phys translation entered in the M2P. That's because a r/o variant of that table is part of the hypervisor ABI for PV guests. Size considerations simply don't apply here - the table may be sparse (guests are required to deal with accesses potentially faulting), and the 256Gb of virtual address space set aside for it cover all memory up to the 47-bit boundary (there's room for doubling this). Memory at addresses with bit 47 (or higher) set would need a complete overhaul of that mechanism, and whatever new mechanism we may pick would mean old guests won#t be able to benefit. > Another possible solution is to extend page_info to include GFN for > NVDIMM and use frame_table. A benefit of this solution is that other > data (3)-(5) can be got from page_info as well. However, due to the > same reason for machine_to_phys_mapping[] and the concern that the > large number of page_info structures required for large NVDIMMs may > consume lots of ram, page_info and frame_table seems not a good place > either. For this particular item struct page_info is the wrong place anyway, due to what I've said above. Also extension suggestions of struct
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/08/16 02:27, Jan Beulich wrote: > >>> On 08.03.16 at 10:15,wrote: > > More thoughts on reserving NVDIMM space for per-page structures > > > > Currently, a per-page struct for managing mapping of NVDIMM pages may > > include following fields: > > > > struct nvdimm_page > > { > > uint64_t mfn;/* MFN of SPA of this NVDIMM page */ > > uint64_t gfn;/* GFN where this NVDIMM page is mapped */ > > domid_t domain_id; /* which domain is this NVDIMM page mapped to */ > > int is_broken; /* Is this NVDIMM page broken? (for MCE) */ > > } > > > > Its size is 24 bytes (or 22 bytes if packed). For a 2 TB NVDIMM, > > nvdimm_page structures would occupy 12 GB space, which is too hard to > > fit in the normal ram on a small memory host. However, for smaller > > NVDIMMs and/or hosts with large ram, those structures may still be able > > to fit in the normal ram. In the latter circumstance, nvdimm_page > > structures are stored in the normal ram, so they can be accessed more > > quickly. > > Not sure how you came to the above structure - it's the first time > I see it, yet figuring out what information it needs to hold is what > this design process should be about. For example, I don't see why > it would need to duplicate M2P / P2M information. Nor do I see why > per-page data needs to hold the address of a page (struct > page_info also doesn't). And whether storing a domain ID (rather > than a pointer to struct domain, as in struct page_info) is the > correct think is also to be determined (rather than just stated). > > Otoh you make no provisions at all for any kind of ref counting. > What if a guest wants to put page tables into NVDIMM space? > > Since all of your calculations are based upon that fixed assumption > on the structure layout, I'm afraid they're not very meaningful > without first settling on what data needs tracking in the first place. > > Jan > I should reexplain the choice of data structures and where to put them. For handling MCE for NVDIMM, we need to track following data: (1) SPA ranges of host NVDIMMs (one range per pmem interleave set), which are used to check whether a MCE is for NVDIMM. (2) GFN to which a NVDIMM page is mapped, which is used to determine the address put in vMCE. (3) the domain to which a NVDIMM page is mapped, which is used to determine whether a vMCE needs to be injected and where it will be injected. (4) a flag to mark whether a NVDIMM page is broken, which is used to avoid mapping broken page to guests. For granting NVDIMM pages (e.g. xen-blkback/netback), (5) a reference counter is needed for each NVDIMM page Above data can be organized as below: * For (1) SPA ranges, we can record them in a global data structure, e.g. a list struct list_head nvdimm_iset_list; struct nvdimm_iset { uint64_t base; /* starting SPA of this interleave set */ uint64_t size; /* size of this interleave set */ struct nvdimm_page *pages;/* information for individual pages in this interleave set */ struct list_head list; }; * For (2) GFN, an intuitive place to get this information is from M2P table machine_to_phys_mapping[]. However, the address of NVDIMM is not required to be contiguous with normal ram, so, if NVDIMM starts from an address that is much higher than the end address of normal ram, it may result in a M2P table that maybe too large to fit in the normal ram. Therefore, we choose to not put GFNs of NVDIMM in M2P table. Another possible solution is to extend page_info to include GFN for NVDIMM and use frame_table. A benefit of this solution is that other data (3)-(5) can be got from page_info as well. However, due to the same reason for machine_to_phys_mapping[] and the concern that the large number of page_info structures required for large NVDIMMs may consume lots of ram, page_info and frame_table seems not a good place either. * At the end, we choose to introduce a new data structure for above per-page data (2)-(5) struct nvdimm_page { struct domain *domain;/* for (3) */ uint64_t gfn;/* for (2) */ unsigned long count_info; /* for (4) and (5), same as page_info->count_info */ /* other fields if needed, e.g. lock */ } (MFN is not needed indeed) On each NVDIMM interleave set, we could reserve an area to place an array of nvdimm_page structures for pages in that interleave set. In addition, the corresponding global nvdimm_iset structure is set to point to this array via its 'pages' field. * One disadvantage of above solution is that accessing NVDIMM is slower than normal ram, so some usage scenarios that requires frequently accesses to nvdimm_page structures may suffer poor performance. Therefore, we may add a boot parameter to allow users to choose normal ram for above nvdimm_page arrays if their
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 08.03.16 at 10:15,wrote: > More thoughts on reserving NVDIMM space for per-page structures > > Currently, a per-page struct for managing mapping of NVDIMM pages may > include following fields: > > struct nvdimm_page > { > uint64_t mfn;/* MFN of SPA of this NVDIMM page */ > uint64_t gfn;/* GFN where this NVDIMM page is mapped */ > domid_t domain_id; /* which domain is this NVDIMM page mapped to */ > int is_broken; /* Is this NVDIMM page broken? (for MCE) */ > } > > Its size is 24 bytes (or 22 bytes if packed). For a 2 TB NVDIMM, > nvdimm_page structures would occupy 12 GB space, which is too hard to > fit in the normal ram on a small memory host. However, for smaller > NVDIMMs and/or hosts with large ram, those structures may still be able > to fit in the normal ram. In the latter circumstance, nvdimm_page > structures are stored in the normal ram, so they can be accessed more > quickly. Not sure how you came to the above structure - it's the first time I see it, yet figuring out what information it needs to hold is what this design process should be about. For example, I don't see why it would need to duplicate M2P / P2M information. Nor do I see why per-page data needs to hold the address of a page (struct page_info also doesn't). And whether storing a domain ID (rather than a pointer to struct domain, as in struct page_info) is the correct think is also to be determined (rather than just stated). Otoh you make no provisions at all for any kind of ref counting. What if a guest wants to put page tables into NVDIMM space? Since all of your calculations are based upon that fixed assumption on the structure layout, I'm afraid they're not very meaningful without first settling on what data needs tracking in the first place. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/04/16 10:20, Haozhong Zhang wrote: > On 03/02/16 06:03, Jan Beulich wrote: > > >>> On 02.03.16 at 08:14,wrote: > > > It means NVDIMM is very possibly mapped in page granularity, and > > > hypervisor needs per-page data structures like page_info (rather than the > > > range set style nvdimm_pages) to manage those mappings. > > > > > > Then we will face the problem that the potentially huge number of > > > per-page data structures may not fit in the normal ram. Linux kernel > > > developers came across the same problem, and their solution is to > > > reserve an area of NVDIMM and put the page structures in the reserved > > > area (https://lwn.net/Articles/672457/). I think we may take the similar > > > solution: > > > (1) Dom0 Linux kernel reserves an area on each NVDIMM for Xen usage > > > (besides the one used by Linux kernel itself) and reports the address > > > and size to Xen hypervisor. > > > > > > Reasons to choose Linux kernel to make the reservation include: > > > (a) only Dom0 Linux kernel has the NVDIMM driver, > > > (b) make it flexible for Dom0 Linux kernel to handle all > > > reservations (for itself and Xen). > > > > > > (2) Then Xen hypervisor builds the page structures for NVDIMM pages and > > > stores them in above reserved areas. > > [...] > > Furthermore - why would Dom0 waste space > > creating per-page control structures for regions which are > > meant to be handed to guests anyway? > > > > I found my description was not accurate after consulting with our driver > developers. By default the linux kernel does not create page structures > for NVDIMM which is called by kernel the "raw mode". We could enforce > the Dom0 kernel to pin NVDIMM in "raw mode" so as to avoid waste. > More thoughts on reserving NVDIMM space for per-page structures Currently, a per-page struct for managing mapping of NVDIMM pages may include following fields: struct nvdimm_page { uint64_t mfn;/* MFN of SPA of this NVDIMM page */ uint64_t gfn;/* GFN where this NVDIMM page is mapped */ domid_t domain_id; /* which domain is this NVDIMM page mapped to */ int is_broken; /* Is this NVDIMM page broken? (for MCE) */ } Its size is 24 bytes (or 22 bytes if packed). For a 2 TB NVDIMM, nvdimm_page structures would occupy 12 GB space, which is too hard to fit in the normal ram on a small memory host. However, for smaller NVDIMMs and/or hosts with large ram, those structures may still be able to fit in the normal ram. In the latter circumstance, nvdimm_page structures are stored in the normal ram, so they can be accessed more quickly. So we may add a boot parameter for Xen to allow users to configure which place, the normal ram or nvdimm, are used to store those structures. For the config of using normal ram, Xen could manage nvdimm_page structures more quickly (and hence start a domain with NVDIMM more quickly), but leaves less normal ram for VMs. For the config of using nvdimm, Xen would take more time to mange nvdimm_page structures, but leaves more normal ram for VMs. Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/07/16 15:53, Konrad Rzeszutek Wilk wrote: > On Wed, Mar 02, 2016 at 03:14:52PM +0800, Haozhong Zhang wrote: > > On 03/01/16 13:49, Konrad Rzeszutek Wilk wrote: > > > On Tue, Mar 01, 2016 at 06:33:32PM +, Ian Jackson wrote: > > > > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM > > > > support for Xen"): > > > > > On 02/18/16 21:14, Konrad Rzeszutek Wilk wrote: > > > > > > [someone:] > > > > > > > (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign, > > > > > > >(a) never map idx in them to GFNs occupied by vNVDIMM, and > > > > > > >(b) never map idx corresponding to GFNs occupied by vNVDIMM > > > > > > > > > > > > Would that mean that guest xen-blkback or xen-netback wouldn't > > > > > > be able to fetch data from the GFNs? As in, what if the HVM guest > > > > > > that has the NVDIMM also serves as a device domain - that is it > > > > > > has xen-blkback running to service other guests? > > > > > > > > > > I'm not familiar with xen-blkback and xen-netback, so following > > > > > statements maybe wrong. > > > > > > > > > > In my understanding, xen-blkback/-netback in a device domain maps the > > > > > pages from other domains into its own domain, and copies data between > > > > > those pages and vNVDIMM. The access to vNVDIMM is performed by NVDIMM > > > > > driver in device domain. In which steps of this procedure that > > > > > xen-blkback/-netback needs to map into GFNs of vNVDIMM? > > > > > > > > I think I agree with what you are saying. I don't understand exactly > > > > what you are proposing above in XENMAPSPACE_gmfn but I don't see how > > > > anything about this would interfere with blkback. > > > > > > > > blkback when talking to an nvdimm will just go through the block layer > > > > front door, and do a copy, I presume. > > > > > > I believe you are right. The block layer, and then the fs would copy in. > > > > > > > > I don't see how netback comes into it at all. > > > > > > > > But maybe I am just confused or ignorant! Please do explain :-). > > > > > > s/back/frontend/ > > > > > > My fear was refcounting. > > > > > > Specifically where we do not do copying. For example, you could > > > be sending data from the NVDIMM GFNs (scp?) to some other location > > > (another host?). It would go over the xen-netback (in the dom0) > > > - which would then grant map it (dom0 would). > > > > > > > Thanks for the explanation! > > > > It means NVDIMM is very possibly mapped in page granularity, and > > hypervisor needs per-page data structures like page_info (rather than the > > range set style nvdimm_pages) to manage those mappings. > > I do not know. I figured you need some accounting in the hypervisor > as the pages can be grant mapped but I don't know the intricate details > of the P2M code to tell you for certain. > > [edit: Your later email seems to imply that you do not need all this > information? Just ranges?] Not quite sure which one do you mean. But at least in this example, NVDIMM can be granted in the unit of page, so I think Xen still needs per-page data structure to track this mapping information and range structure is not enough. > > > > Then we will face the problem that the potentially huge number of > > per-page data structures may not fit in the normal ram. Linux kernel > > developers came across the same problem, and their solution is to > > reserve an area of NVDIMM and put the page structures in the reserved > > area (https://lwn.net/Articles/672457/). I think we may take the similar > > solution: > > (1) Dom0 Linux kernel reserves an area on each NVDIMM for Xen usage > > (besides the one used by Linux kernel itself) and reports the address > > and size to Xen hypervisor. > > > > Reasons to choose Linux kernel to make the reservation include: > > (a) only Dom0 Linux kernel has the NVDIMM driver, > > (b) make it flexible for Dom0 Linux kernel to handle all > > reservations (for itself and Xen). > > > > (2) Then Xen hypervisor builds the page structures for NVDIMM pages and > > stores them in above reserved areas. > > > > (3) The reserved area is used as volatile, i.e. above two steps must be > >
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On Wed, Mar 02, 2016 at 03:14:52PM +0800, Haozhong Zhang wrote: > On 03/01/16 13:49, Konrad Rzeszutek Wilk wrote: > > On Tue, Mar 01, 2016 at 06:33:32PM +, Ian Jackson wrote: > > > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM > > > support for Xen"): > > > > On 02/18/16 21:14, Konrad Rzeszutek Wilk wrote: > > > > > [someone:] > > > > > > (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign, > > > > > >(a) never map idx in them to GFNs occupied by vNVDIMM, and > > > > > >(b) never map idx corresponding to GFNs occupied by vNVDIMM > > > > > > > > > > Would that mean that guest xen-blkback or xen-netback wouldn't > > > > > be able to fetch data from the GFNs? As in, what if the HVM guest > > > > > that has the NVDIMM also serves as a device domain - that is it > > > > > has xen-blkback running to service other guests? > > > > > > > > I'm not familiar with xen-blkback and xen-netback, so following > > > > statements maybe wrong. > > > > > > > > In my understanding, xen-blkback/-netback in a device domain maps the > > > > pages from other domains into its own domain, and copies data between > > > > those pages and vNVDIMM. The access to vNVDIMM is performed by NVDIMM > > > > driver in device domain. In which steps of this procedure that > > > > xen-blkback/-netback needs to map into GFNs of vNVDIMM? > > > > > > I think I agree with what you are saying. I don't understand exactly > > > what you are proposing above in XENMAPSPACE_gmfn but I don't see how > > > anything about this would interfere with blkback. > > > > > > blkback when talking to an nvdimm will just go through the block layer > > > front door, and do a copy, I presume. > > > > I believe you are right. The block layer, and then the fs would copy in. > > > > > > I don't see how netback comes into it at all. > > > > > > But maybe I am just confused or ignorant! Please do explain :-). > > > > s/back/frontend/ > > > > My fear was refcounting. > > > > Specifically where we do not do copying. For example, you could > > be sending data from the NVDIMM GFNs (scp?) to some other location > > (another host?). It would go over the xen-netback (in the dom0) > > - which would then grant map it (dom0 would). > > > > Thanks for the explanation! > > It means NVDIMM is very possibly mapped in page granularity, and > hypervisor needs per-page data structures like page_info (rather than the > range set style nvdimm_pages) to manage those mappings. I do not know. I figured you need some accounting in the hypervisor as the pages can be grant mapped but I don't know the intricate details of the P2M code to tell you for certain. [edit: Your later email seems to imply that you do not need all this information? Just ranges?] > > Then we will face the problem that the potentially huge number of > per-page data structures may not fit in the normal ram. Linux kernel > developers came across the same problem, and their solution is to > reserve an area of NVDIMM and put the page structures in the reserved > area (https://lwn.net/Articles/672457/). I think we may take the similar > solution: > (1) Dom0 Linux kernel reserves an area on each NVDIMM for Xen usage > (besides the one used by Linux kernel itself) and reports the address > and size to Xen hypervisor. > > Reasons to choose Linux kernel to make the reservation include: > (a) only Dom0 Linux kernel has the NVDIMM driver, > (b) make it flexible for Dom0 Linux kernel to handle all > reservations (for itself and Xen). > > (2) Then Xen hypervisor builds the page structures for NVDIMM pages and > stores them in above reserved areas. > > (3) The reserved area is used as volatile, i.e. above two steps must be > done for every host boot. > > > In effect Xen there are two guests (dom0 and domU) pointing in the > > P2M to the same GPFN. And that would mean: > > > > > > > >(b) never map idx corresponding to GFNs occupied by vNVDIMM > > > > Granted the XENMAPSPACE_gmfn happens _before_ the grant mapping is done > > so perhaps this is not an issue? > > > > The other situation I was envisioning - where the driver domain has > > the NVDIMM passed in, and as well SR-IOV network card and functions > > as an iSCSI target. That should work OK as we just need the IOMMU > > to have the NVDIMM GPFNs program
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/16/16 05:55, Jan Beulich wrote: > >>> On 16.02.16 at 12:14,wrote: > > On Mon, 15 Feb 2016, Zhang, Haozhong wrote: > >> On 02/04/16 20:24, Stefano Stabellini wrote: > >> > On Thu, 4 Feb 2016, Haozhong Zhang wrote: > >> > > On 02/03/16 15:22, Stefano Stabellini wrote: > >> > > > On Wed, 3 Feb 2016, George Dunlap wrote: > >> > > > > On 03/02/16 12:02, Stefano Stabellini wrote: > >> > > > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote: > >> > > > > >> Or, we can make a file system on /dev/pmem0, create files on > >> > > > > >> it, set > >> > > > > >> the owner of those files to xen-qemuuser-domid$domid, and then > >> > > > > >> pass > >> > > > > >> those files to QEMU. In this way, non-root QEMU should be able > >> > > > > >> to > >> > > > > >> mmap those files. > >> > > > > > > >> > > > > > Maybe that would work. Worth adding it to the design, I would > >> > > > > > like to > >> > > > > > read more details on it. > >> > > > > > > >> > > > > > Also note that QEMU initially runs as root but drops privileges > >> > > > > > to > >> > > > > > xen-qemuuser-domid$domid before the guest is started. Initially > >> > > > > > QEMU > >> > > > > > *could* mmap /dev/pmem0 while is still running as root, but then > >> > > > > > it > >> > > > > > wouldn't work for any devices that need to be mmap'ed at run time > >> > > > > > (hotplug scenario). > >> > > > > > >> > > > > This is basically the same problem we have for a bunch of other > >> > > > > things, > >> > > > > right? Having xl open a file and then pass it via qmp to qemu > >> > > > > should > >> > > > > work in theory, right? > >> > > > > >> > > > Is there one /dev/pmem? per assignable region? > >> > > > >> > > Yes. > >> > > > >> > > BTW, I'm wondering whether and how non-root qemu works with xl disk > >> > > configuration that is going to access a host block device, e.g. > >> > > disk = [ '/dev/sdb,,hda' ] > >> > > If that works with non-root qemu, I may take the similar solution for > >> > > pmem. > >> > > >> > Today the user is required to give the correct ownership and access mode > >> > to the block device, so that non-root QEMU can open it. However in the > >> > case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence > >> > the feature doesn't work at all with non-root QEMU > >> > (http://marc.info/?l=xen-devel=145261763600528). > >> > > >> > If there is one /dev/pmem device per assignable region, then it would be > >> > conceivable to change its ownership so that non-root QEMU can open it. > >> > Or, better, the file descriptor could be passed by the toolstack via > >> > qmp. > >> > >> Passing file descriptor via qmp is not enough. > >> > >> Let me clarify where the requirement for root/privileged permissions > >> comes from. The primary workflow in my design that maps a host pmem > >> region or files in host pmem region to guest is shown as below: > >> (1) QEMU in Dom0 mmap the host pmem (the host /dev/pmem0 or files on > >> /dev/pmem0) to its virtual address space, i.e. the guest virtual > >> address space. > >> (2) QEMU asks Xen hypervisor to map the host physical address, i.e. SPA > >> occupied by the host pmem to a DomU. This step requires the > >> translation from the guest virtual address (where the host pmem is > >> mmaped in (1)) to the host physical address. The translation can be > >> done by either > >> (a) QEMU that parses its own /proc/self/pagemap, > >> or > >> (b) Xen hypervisor that does the translation by itself [1] (though > >> this choice is not quite doable from Konrad's comments [2]). > >> > >> [1] > >> http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00434.html > >> [2] > >> http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00606.html > >> > >> For 2-a, reading /proc/self/pagemap requires CAP_SYS_ADMIN capability > >> since linux kernel 4.0. Furthermore, if we don't mlock the mapped host > >> pmem (by adding MAP_LOCKED flag to mmap or calling mlock after mmap), > >> pagemap will not contain all mappings. However, mlock may require > >> privileged permission to lock memory larger than RLIMIT_MEMLOCK. Because > >> mlock operates on memory, the permission to open(2) the host pmem files > >> does not solve the problem and therefore passing file descriptor via qmp > >> does not help. > >> > >> For 2-b, from Konrad's comments [2], mlock is also required and > >> privileged permission may be required consequently. > >> > >> Note that the mapping and the address translation are done before QEMU > >> dropping privileged permissions, so non-root QEMU should be able to work > >> with above design until we start considering vNVDIMM hotplug (which has > >> not been supported by the current vNVDIMM implementation in QEMU). In > >> the hotplug case, we may let Xen pass explicit flags to QEMU to keep it > >> running with root permissions. > > > > Are we all good with the fact that vNVDIMM
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/02/16 06:03, Jan Beulich wrote: > >>> On 02.03.16 at 08:14,wrote: > > It means NVDIMM is very possibly mapped in page granularity, and > > hypervisor needs per-page data structures like page_info (rather than the > > range set style nvdimm_pages) to manage those mappings. > > > > Then we will face the problem that the potentially huge number of > > per-page data structures may not fit in the normal ram. Linux kernel > > developers came across the same problem, and their solution is to > > reserve an area of NVDIMM and put the page structures in the reserved > > area (https://lwn.net/Articles/672457/). I think we may take the similar > > solution: > > (1) Dom0 Linux kernel reserves an area on each NVDIMM for Xen usage > > (besides the one used by Linux kernel itself) and reports the address > > and size to Xen hypervisor. > > > > Reasons to choose Linux kernel to make the reservation include: > > (a) only Dom0 Linux kernel has the NVDIMM driver, > > (b) make it flexible for Dom0 Linux kernel to handle all > > reservations (for itself and Xen). > > > > (2) Then Xen hypervisor builds the page structures for NVDIMM pages and > > stores them in above reserved areas. > > Another argument against this being primarily Dom0-managed, > I would say. Yes, Xen should, at least, manage all address mappings for NVDIMM. Dom0 Linux and QEMU then provide a user-friendly interface to configure NVDIMM and vNVDIMM: like providing files (instead of address) as the abstract of SPA ranges in/of NVDIMM. > Furthermore - why would Dom0 waste space > creating per-page control structures for regions which are > meant to be handed to guests anyway? > I found my description was not accurate after consulting with our driver developers. By default the linux kernel does not create page structures for NVDIMM which is called by kernel the "raw mode". We could enforce the Dom0 kernel to pin NVDIMM in "raw mode" so as to avoid waste. Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 02.03.16 at 08:14,wrote: > It means NVDIMM is very possibly mapped in page granularity, and > hypervisor needs per-page data structures like page_info (rather than the > range set style nvdimm_pages) to manage those mappings. > > Then we will face the problem that the potentially huge number of > per-page data structures may not fit in the normal ram. Linux kernel > developers came across the same problem, and their solution is to > reserve an area of NVDIMM and put the page structures in the reserved > area (https://lwn.net/Articles/672457/). I think we may take the similar > solution: > (1) Dom0 Linux kernel reserves an area on each NVDIMM for Xen usage > (besides the one used by Linux kernel itself) and reports the address > and size to Xen hypervisor. > > Reasons to choose Linux kernel to make the reservation include: > (a) only Dom0 Linux kernel has the NVDIMM driver, > (b) make it flexible for Dom0 Linux kernel to handle all > reservations (for itself and Xen). > > (2) Then Xen hypervisor builds the page structures for NVDIMM pages and > stores them in above reserved areas. Another argument against this being primarily Dom0-managed, I would say. Furthermore - why would Dom0 waste space creating per-page control structures for regions which are meant to be handed to guests anyway? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/01/16 13:49, Konrad Rzeszutek Wilk wrote: > On Tue, Mar 01, 2016 at 06:33:32PM +, Ian Jackson wrote: > > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM > > support for Xen"): > > > On 02/18/16 21:14, Konrad Rzeszutek Wilk wrote: > > > > [someone:] > > > > > (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign, > > > > >(a) never map idx in them to GFNs occupied by vNVDIMM, and > > > > >(b) never map idx corresponding to GFNs occupied by vNVDIMM > > > > > > > > Would that mean that guest xen-blkback or xen-netback wouldn't > > > > be able to fetch data from the GFNs? As in, what if the HVM guest > > > > that has the NVDIMM also serves as a device domain - that is it > > > > has xen-blkback running to service other guests? > > > > > > I'm not familiar with xen-blkback and xen-netback, so following > > > statements maybe wrong. > > > > > > In my understanding, xen-blkback/-netback in a device domain maps the > > > pages from other domains into its own domain, and copies data between > > > those pages and vNVDIMM. The access to vNVDIMM is performed by NVDIMM > > > driver in device domain. In which steps of this procedure that > > > xen-blkback/-netback needs to map into GFNs of vNVDIMM? > > > > I think I agree with what you are saying. I don't understand exactly > > what you are proposing above in XENMAPSPACE_gmfn but I don't see how > > anything about this would interfere with blkback. > > > > blkback when talking to an nvdimm will just go through the block layer > > front door, and do a copy, I presume. > > I believe you are right. The block layer, and then the fs would copy in. > > > > I don't see how netback comes into it at all. > > > > But maybe I am just confused or ignorant! Please do explain :-). > > s/back/frontend/ > > My fear was refcounting. > > Specifically where we do not do copying. For example, you could > be sending data from the NVDIMM GFNs (scp?) to some other location > (another host?). It would go over the xen-netback (in the dom0) > - which would then grant map it (dom0 would). > Thanks for the explanation! It means NVDIMM is very possibly mapped in page granularity, and hypervisor needs per-page data structures like page_info (rather than the range set style nvdimm_pages) to manage those mappings. Then we will face the problem that the potentially huge number of per-page data structures may not fit in the normal ram. Linux kernel developers came across the same problem, and their solution is to reserve an area of NVDIMM and put the page structures in the reserved area (https://lwn.net/Articles/672457/). I think we may take the similar solution: (1) Dom0 Linux kernel reserves an area on each NVDIMM for Xen usage (besides the one used by Linux kernel itself) and reports the address and size to Xen hypervisor. Reasons to choose Linux kernel to make the reservation include: (a) only Dom0 Linux kernel has the NVDIMM driver, (b) make it flexible for Dom0 Linux kernel to handle all reservations (for itself and Xen). (2) Then Xen hypervisor builds the page structures for NVDIMM pages and stores them in above reserved areas. (3) The reserved area is used as volatile, i.e. above two steps must be done for every host boot. > In effect Xen there are two guests (dom0 and domU) pointing in the > P2M to the same GPFN. And that would mean: > > > > > >(b) never map idx corresponding to GFNs occupied by vNVDIMM > > Granted the XENMAPSPACE_gmfn happens _before_ the grant mapping is done > so perhaps this is not an issue? > > The other situation I was envisioning - where the driver domain has > the NVDIMM passed in, and as well SR-IOV network card and functions > as an iSCSI target. That should work OK as we just need the IOMMU > to have the NVDIMM GPFNs programmed in. > For this IOMMU usage example and above granted pages example, there remains one question: who is responsible to perform NVDIMM flush (clwb/clflushopt/pcommit)? For the granted page example, if a NVDIMM page is granted to xen-netback, does the hypervisor need to tell xen-netback it's a NVDIMM page so that xen-netback can perform proper flush when it writes to that page? Or we may keep the NVDIMM transparent to xen-netback, and let Xen perform the flush when xen-netback gives up the granted NVDIMM page? For the IOMMU example, my understanding is that there is a piece of software in the driver domain that handles SCSI commands received from network card and drives the network card to read/write certain areas of NVDIMM. Then that software should be aware of the existence of NVDIMM and perform the flush properly. Is that right? Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On Tue, Mar 01, 2016 at 06:33:32PM +, Ian Jackson wrote: > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support > for Xen"): > > On 02/18/16 21:14, Konrad Rzeszutek Wilk wrote: > > > [someone:] > > > > (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign, > > > >(a) never map idx in them to GFNs occupied by vNVDIMM, and > > > >(b) never map idx corresponding to GFNs occupied by vNVDIMM > > > > > > Would that mean that guest xen-blkback or xen-netback wouldn't > > > be able to fetch data from the GFNs? As in, what if the HVM guest > > > that has the NVDIMM also serves as a device domain - that is it > > > has xen-blkback running to service other guests? > > > > I'm not familiar with xen-blkback and xen-netback, so following > > statements maybe wrong. > > > > In my understanding, xen-blkback/-netback in a device domain maps the > > pages from other domains into its own domain, and copies data between > > those pages and vNVDIMM. The access to vNVDIMM is performed by NVDIMM > > driver in device domain. In which steps of this procedure that > > xen-blkback/-netback needs to map into GFNs of vNVDIMM? > > I think I agree with what you are saying. I don't understand exactly > what you are proposing above in XENMAPSPACE_gmfn but I don't see how > anything about this would interfere with blkback. > > blkback when talking to an nvdimm will just go through the block layer > front door, and do a copy, I presume. I believe you are right. The block layer, and then the fs would copy in. > > I don't see how netback comes into it at all. > > But maybe I am just confused or ignorant! Please do explain :-). s/back/frontend/ My fear was refcounting. Specifically where we do not do copying. For example, you could be sending data from the NVDIMM GFNs (scp?) to some other location (another host?). It would go over the xen-netback (in the dom0) - which would then grant map it (dom0 would). In effect Xen there are two guests (dom0 and domU) pointing in the P2M to the same GPFN. And that would mean: > > > >(b) never map idx corresponding to GFNs occupied by vNVDIMM Granted the XENMAPSPACE_gmfn happens _before_ the grant mapping is done so perhaps this is not an issue? The other situation I was envisioning - where the driver domain has the NVDIMM passed in, and as well SR-IOV network card and functions as an iSCSI target. That should work OK as we just need the IOMMU to have the NVDIMM GPFNs programmed in. > > Thanks, > Ian. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen"): > On 02/18/16 21:14, Konrad Rzeszutek Wilk wrote: > > [someone:] > > > (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign, > > >(a) never map idx in them to GFNs occupied by vNVDIMM, and > > >(b) never map idx corresponding to GFNs occupied by vNVDIMM > > > > Would that mean that guest xen-blkback or xen-netback wouldn't > > be able to fetch data from the GFNs? As in, what if the HVM guest > > that has the NVDIMM also serves as a device domain - that is it > > has xen-blkback running to service other guests? > > I'm not familiar with xen-blkback and xen-netback, so following > statements maybe wrong. > > In my understanding, xen-blkback/-netback in a device domain maps the > pages from other domains into its own domain, and copies data between > those pages and vNVDIMM. The access to vNVDIMM is performed by NVDIMM > driver in device domain. In which steps of this procedure that > xen-blkback/-netback needs to map into GFNs of vNVDIMM? I think I agree with what you are saying. I don't understand exactly what you are proposing above in XENMAPSPACE_gmfn but I don't see how anything about this would interfere with blkback. blkback when talking to an nvdimm will just go through the block layer front door, and do a copy, I presume. I don't see how netback comes into it at all. But maybe I am just confused or ignorant! Please do explain :-). Thanks, Ian. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 01.03.16 at 14:51,wrote: > Haozhong Zhang writes ("Re: [RFC Design Doc] Add vNVDIMM support for Xen"): >> On 02/29/16 05:04, Jan Beulich wrote: >> > Which will involve adding how much new code to it? >> >> Because hvmloader only accepts AML device rather than arbitrary objects, >> only code that builds the outmost part of AML device is needed. In ACPI >> spec, an AML device is defined as >> DefDevice := DeviceOp PkgLength NameString ObjectList >> hvmloader only needs to build the first 3 terms, while the last one is >> passed from qemu. > > Jan, is this a satisfactory answer ? Well, sort of yes, but subject to me seeing the actual code this converts to. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
Haozhong Zhang writes ("Re: [RFC Design Doc] Add vNVDIMM support for Xen"): > On 02/29/16 05:04, Jan Beulich wrote: > > Which will involve adding how much new code to it? > > Because hvmloader only accepts AML device rather than arbitrary objects, > only code that builds the outmost part of AML device is needed. In ACPI > spec, an AML device is defined as > DefDevice := DeviceOp PkgLength NameString ObjectList > hvmloader only needs to build the first 3 terms, while the last one is > passed from qemu. Jan, is this a satisfactory answer ? Ian. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/18/16 21:14, Konrad Rzeszutek Wilk wrote: > > > > QEMU would always use MFN above guest normal ram and I/O holes for > > > > vNVDIMM. It would attempt to search in that space for a contiguous range > > > > that is large enough for that that vNVDIMM devices. Is guest able to > > > > punch holes in such GFN space? > > > > > > See XENMAPSPACE_* and their uses. > > > > > > > I think we can add following restrictions to avoid uses of XENMAPSPACE_* > > punching holes in GFNs of vNVDIMM: > > > > (1) For XENMAPSPACE_shared_info and _grant_table, never map idx in them > > to GFNs occupied by vNVDIMM. > > OK, that sounds correct. > > > > (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign, > >(a) never map idx in them to GFNs occupied by vNVDIMM, and > >(b) never map idx corresponding to GFNs occupied by vNVDIMM > > Would that mean that guest xen-blkback or xen-netback wouldn't > be able to fetch data from the GFNs? As in, what if the HVM guest > that has the NVDIMM also serves as a device domain - that is it > has xen-blkback running to service other guests? > I'm not familiar with xen-blkback and xen-netback, so following statements maybe wrong. In my understanding, xen-blkback/-netback in a device domain maps the pages from other domains into its own domain, and copies data between those pages and vNVDIMM. The access to vNVDIMM is performed by NVDIMM driver in device domain. In which steps of this procedure that xen-blkback/-netback needs to map into GFNs of vNVDIMM? Thanks, Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/29/16 05:04, Jan Beulich wrote: > >>> On 29.02.16 at 12:52,wrote: > > On 02/29/16 03:12, Jan Beulich wrote: > >> >>> On 29.02.16 at 10:45, wrote: > >> > On 02/29/16 02:01, Jan Beulich wrote: > >> >> >>> On 28.02.16 at 15:48, wrote: > >> >> > Anyway, we may avoid some conflicts between ACPI tables/objects by > >> >> > restricting which tables and objects can be passed from QEMU to Xen: > >> >> > (1) For ACPI tables, xen does not accept those built by itself, > >> >> > e.g. DSDT and SSDT. > >> >> > (2) xen does not accept ACPI tables for devices that are not attached > >> >> > to > >> >> > a domain, e.g. if NFIT cannot be passed if a domain does not have > >> >> > vNVDIMM. > >> >> > (3) For ACPI objects, xen only accepts namespace devices and requires > >> >> > their names does not conflict with existing ones provided by Xen. > >> >> > >> >> And how do you imagine to enforce this without parsing the > >> >> handed AML? (Remember there's no AML parser in hvmloader.) > >> > > >> > As I proposed in last reply, instead of passing an entire ACPI object, > >> > QEMU passes the device name and the AML code under the AML device entry > >> > separately. Because the name is explicitly given, no AML parser is > >> > needed in hvmloader. > >> > >> I must not only have missed that proposal, but I also don't see > >> how you mean this to work: Are you suggesting for hvmloader to > >> construct valid AML from the passed in blob? Or are you instead > >> considering to pass redundant information (name once given > >> explicitly and once embedded in the AML blob), allowing the two > >> to be out of sync? > > > > I mean the former one. > > Which will involve adding how much new code to it? > Because hvmloader only accepts AML device rather than arbitrary objects, only code that builds the outmost part of AML device is needed. In ACPI spec, an AML device is defined as DefDevice := DeviceOp PkgLength NameString ObjectList hvmloader only needs to build the first 3 terms, while the last one is passed from qemu. Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 29.02.16 at 12:52,wrote: > On 02/29/16 03:12, Jan Beulich wrote: >> >>> On 29.02.16 at 10:45, wrote: >> > On 02/29/16 02:01, Jan Beulich wrote: >> >> >>> On 28.02.16 at 15:48, wrote: >> >> > Anyway, we may avoid some conflicts between ACPI tables/objects by >> >> > restricting which tables and objects can be passed from QEMU to Xen: >> >> > (1) For ACPI tables, xen does not accept those built by itself, >> >> > e.g. DSDT and SSDT. >> >> > (2) xen does not accept ACPI tables for devices that are not attached to >> >> > a domain, e.g. if NFIT cannot be passed if a domain does not have >> >> > vNVDIMM. >> >> > (3) For ACPI objects, xen only accepts namespace devices and requires >> >> > their names does not conflict with existing ones provided by Xen. >> >> >> >> And how do you imagine to enforce this without parsing the >> >> handed AML? (Remember there's no AML parser in hvmloader.) >> > >> > As I proposed in last reply, instead of passing an entire ACPI object, >> > QEMU passes the device name and the AML code under the AML device entry >> > separately. Because the name is explicitly given, no AML parser is >> > needed in hvmloader. >> >> I must not only have missed that proposal, but I also don't see >> how you mean this to work: Are you suggesting for hvmloader to >> construct valid AML from the passed in blob? Or are you instead >> considering to pass redundant information (name once given >> explicitly and once embedded in the AML blob), allowing the two >> to be out of sync? > > I mean the former one. Which will involve adding how much new code to it? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/29/16 03:12, Jan Beulich wrote: > >>> On 29.02.16 at 10:45,wrote: > > On 02/29/16 02:01, Jan Beulich wrote: > >> >>> On 28.02.16 at 15:48, wrote: > >> > Anyway, we may avoid some conflicts between ACPI tables/objects by > >> > restricting which tables and objects can be passed from QEMU to Xen: > >> > (1) For ACPI tables, xen does not accept those built by itself, > >> > e.g. DSDT and SSDT. > >> > (2) xen does not accept ACPI tables for devices that are not attached to > >> > a domain, e.g. if NFIT cannot be passed if a domain does not have > >> > vNVDIMM. > >> > (3) For ACPI objects, xen only accepts namespace devices and requires > >> > their names does not conflict with existing ones provided by Xen. > >> > >> And how do you imagine to enforce this without parsing the > >> handed AML? (Remember there's no AML parser in hvmloader.) > > > > As I proposed in last reply, instead of passing an entire ACPI object, > > QEMU passes the device name and the AML code under the AML device entry > > separately. Because the name is explicitly given, no AML parser is > > needed in hvmloader. > > I must not only have missed that proposal, but I also don't see > how you mean this to work: Are you suggesting for hvmloader to > construct valid AML from the passed in blob? Or are you instead > considering to pass redundant information (name once given > explicitly and once embedded in the AML blob), allowing the two > to be out of sync? > I mean the former one. Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 29.02.16 at 10:45,wrote: > On 02/29/16 02:01, Jan Beulich wrote: >> >>> On 28.02.16 at 15:48, wrote: >> > Anyway, we may avoid some conflicts between ACPI tables/objects by >> > restricting which tables and objects can be passed from QEMU to Xen: >> > (1) For ACPI tables, xen does not accept those built by itself, >> > e.g. DSDT and SSDT. >> > (2) xen does not accept ACPI tables for devices that are not attached to >> > a domain, e.g. if NFIT cannot be passed if a domain does not have >> > vNVDIMM. >> > (3) For ACPI objects, xen only accepts namespace devices and requires >> > their names does not conflict with existing ones provided by Xen. >> >> And how do you imagine to enforce this without parsing the >> handed AML? (Remember there's no AML parser in hvmloader.) > > As I proposed in last reply, instead of passing an entire ACPI object, > QEMU passes the device name and the AML code under the AML device entry > separately. Because the name is explicitly given, no AML parser is > needed in hvmloader. I must not only have missed that proposal, but I also don't see how you mean this to work: Are you suggesting for hvmloader to construct valid AML from the passed in blob? Or are you instead considering to pass redundant information (name once given explicitly and once embedded in the AML blob), allowing the two to be out of sync? Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/29/16 02:01, Jan Beulich wrote: > >>> On 28.02.16 at 15:48,wrote: > > On 02/24/16 09:54, Jan Beulich wrote: > >> >>> On 24.02.16 at 16:48, wrote: > >> > On 02/24/16 07:24, Jan Beulich wrote: > >> >> >>> On 24.02.16 at 14:28, wrote: > >> >> > On 02/18/16 10:17, Jan Beulich wrote: > >> >> >> >>> On 01.02.16 at 06:44, wrote: > >> >> >> > 3.3 Guest ACPI Emulation > >> >> >> > > >> >> >> > 3.3.1 My Design > >> >> >> > > >> >> >> > Guest ACPI emulation is composed of two parts: building guest NFIT > >> >> >> > and SSDT that defines ACPI namespace devices for NVDIMM, and > >> >> >> > emulating guest _DSM. > >> >> >> > > >> >> >> > (1) Building Guest ACPI Tables > >> >> >> > > >> >> >> > This design reuses and extends hvmloader's existing mechanism > >> >> >> > that > >> >> >> > loads passthrough ACPI tables from binary files to load NFIT and > >> >> >> > SSDT tables built by QEMU: > >> >> >> > 1) Because the current QEMU does not building any ACPI tables > >> >> >> > when > >> >> >> > it runs as the Xen device model, this design needs to patch > >> >> >> > QEMU > >> >> >> > to build NFIT and SSDT (so far only NFIT and SSDT) in this > >> >> >> > case. > >> >> >> > > >> >> >> > 2) QEMU copies NFIT and SSDT to the end of guest memory below > >> >> >> > 4G. The guest address and size of those tables are written > >> >> >> > into > >> >> >> > xenstore > >> >> >> > (/local/domain/domid/hvmloader/dm-acpi/{address,length}). > >> >> >> > > >> >> >> > 3) hvmloader is patched to probe and load device model > >> >> >> > passthrough > >> >> >> > ACPI tables from above xenstore keys. The detected ACPI tables > >> >> >> > are then appended to the end of existing guest ACPI tables > >> >> >> > just > >> >> >> > like what current construct_passthrough_tables() does. > >> >> >> > > >> >> >> > Reasons for this design are listed below: > >> >> >> > - NFIT and SSDT in question are quite self-contained, i.e. they > >> >> >> > do > >> >> >> > not refer to other ACPI tables and not conflict with existing > >> >> >> > guest ACPI tables in Xen. Therefore, it is safe to copy them > >> >> >> > from > >> >> >> > QEMU and append to existing guest ACPI tables. > >> >> >> > >> >> >> How is this not conflicting being guaranteed? In particular I don't > >> >> >> see how tables containing AML code and coming from different > >> >> >> sources won't possibly cause ACPI name space collisions. > >> >> >> > >> >> > > >> >> > Really there is no effective mechanism to avoid ACPI name space > >> >> > collisions (and other kinds of conflicts) between ACPI tables loaded > >> >> > from QEMU and ACPI tables built by hvmloader. Because which ACPI > >> >> > tables > >> >> > are loaded is determined by developers, IMO it's developers' > >> >> > responsibility to avoid any collisions and conflicts with existing > >> >> > ACPI > >> >> > tables. > >> >> > >> >> Right, but this needs to be spelled out and settled on at design > >> >> time (i.e. now), rather leaving things unspecified, awaiting the > >> >> first clash. > >> > > >> > So that means if no collision-proof mechanism is introduced, Xen should > >> > not > >> > trust any passed-in ACPI tables and should build them by itself? > >> > >> Basically yes, albeit collision-proof may be too much to demand. > >> Simply separating name spaces (for hvmloader and qemu to have > >> their own sub-spaces) would be sufficient imo. We should trust > >> ourselves to play by such a specification. > >> > > > > I don't quite understand 'separating name spaces'. Do you mean, for > > example, if both hvmloader and qemu want to put a namespace device under > > \_SB, they could be put in different sub-scopes under \_SB? But it does > > not work for Linux at least. > > Aiui just the leaf names matter for sufficient separation, i.e. > recurring sub-scopes should not be a problem. > > > Anyway, we may avoid some conflicts between ACPI tables/objects by > > restricting which tables and objects can be passed from QEMU to Xen: > > (1) For ACPI tables, xen does not accept those built by itself, > > e.g. DSDT and SSDT. > > (2) xen does not accept ACPI tables for devices that are not attached to > > a domain, e.g. if NFIT cannot be passed if a domain does not have > > vNVDIMM. > > (3) For ACPI objects, xen only accepts namespace devices and requires > > their names does not conflict with existing ones provided by Xen. > > And how do you imagine to enforce this without parsing the > handed AML? (Remember there's no AML parser in hvmloader.) > As I proposed in last reply, instead of passing an entire ACPI object, QEMU passes the device name and the AML code under the AML device entry separately. Because the name is explicitly given, no AML parser is needed in hvmloader. Haozhong
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 28.02.16 at 15:48,wrote: > On 02/24/16 09:54, Jan Beulich wrote: >> >>> On 24.02.16 at 16:48, wrote: >> > On 02/24/16 07:24, Jan Beulich wrote: >> >> >>> On 24.02.16 at 14:28, wrote: >> >> > On 02/18/16 10:17, Jan Beulich wrote: >> >> >> >>> On 01.02.16 at 06:44, wrote: >> >> >> > 3.3 Guest ACPI Emulation >> >> >> > >> >> >> > 3.3.1 My Design >> >> >> > >> >> >> > Guest ACPI emulation is composed of two parts: building guest NFIT >> >> >> > and SSDT that defines ACPI namespace devices for NVDIMM, and >> >> >> > emulating guest _DSM. >> >> >> > >> >> >> > (1) Building Guest ACPI Tables >> >> >> > >> >> >> > This design reuses and extends hvmloader's existing mechanism that >> >> >> > loads passthrough ACPI tables from binary files to load NFIT and >> >> >> > SSDT tables built by QEMU: >> >> >> > 1) Because the current QEMU does not building any ACPI tables when >> >> >> > it runs as the Xen device model, this design needs to patch QEMU >> >> >> > to build NFIT and SSDT (so far only NFIT and SSDT) in this case. >> >> >> > >> >> >> > 2) QEMU copies NFIT and SSDT to the end of guest memory below >> >> >> > 4G. The guest address and size of those tables are written into >> >> >> > xenstore >> >> >> > (/local/domain/domid/hvmloader/dm-acpi/{address,length}). >> >> >> > >> >> >> > 3) hvmloader is patched to probe and load device model passthrough >> >> >> > ACPI tables from above xenstore keys. The detected ACPI tables >> >> >> > are then appended to the end of existing guest ACPI tables just >> >> >> > like what current construct_passthrough_tables() does. >> >> >> > >> >> >> > Reasons for this design are listed below: >> >> >> > - NFIT and SSDT in question are quite self-contained, i.e. they do >> >> >> > not refer to other ACPI tables and not conflict with existing >> >> >> > guest ACPI tables in Xen. Therefore, it is safe to copy them from >> >> >> > QEMU and append to existing guest ACPI tables. >> >> >> >> >> >> How is this not conflicting being guaranteed? In particular I don't >> >> >> see how tables containing AML code and coming from different >> >> >> sources won't possibly cause ACPI name space collisions. >> >> >> >> >> > >> >> > Really there is no effective mechanism to avoid ACPI name space >> >> > collisions (and other kinds of conflicts) between ACPI tables loaded >> >> > from QEMU and ACPI tables built by hvmloader. Because which ACPI tables >> >> > are loaded is determined by developers, IMO it's developers' >> >> > responsibility to avoid any collisions and conflicts with existing ACPI >> >> > tables. >> >> >> >> Right, but this needs to be spelled out and settled on at design >> >> time (i.e. now), rather leaving things unspecified, awaiting the >> >> first clash. >> > >> > So that means if no collision-proof mechanism is introduced, Xen should not >> > trust any passed-in ACPI tables and should build them by itself? >> >> Basically yes, albeit collision-proof may be too much to demand. >> Simply separating name spaces (for hvmloader and qemu to have >> their own sub-spaces) would be sufficient imo. We should trust >> ourselves to play by such a specification. >> > > I don't quite understand 'separating name spaces'. Do you mean, for > example, if both hvmloader and qemu want to put a namespace device under > \_SB, they could be put in different sub-scopes under \_SB? But it does > not work for Linux at least. Aiui just the leaf names matter for sufficient separation, i.e. recurring sub-scopes should not be a problem. > Anyway, we may avoid some conflicts between ACPI tables/objects by > restricting which tables and objects can be passed from QEMU to Xen: > (1) For ACPI tables, xen does not accept those built by itself, > e.g. DSDT and SSDT. > (2) xen does not accept ACPI tables for devices that are not attached to > a domain, e.g. if NFIT cannot be passed if a domain does not have > vNVDIMM. > (3) For ACPI objects, xen only accepts namespace devices and requires > their names does not conflict with existing ones provided by Xen. And how do you imagine to enforce this without parsing the handed AML? (Remember there's no AML parser in hvmloader.) Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/24/16 09:54, Jan Beulich wrote: > >>> On 24.02.16 at 16:48,wrote: > > On 02/24/16 07:24, Jan Beulich wrote: > >> >>> On 24.02.16 at 14:28, wrote: > >> > On 02/18/16 10:17, Jan Beulich wrote: > >> >> >>> On 01.02.16 at 06:44, wrote: > >> >> > 3.3 Guest ACPI Emulation > >> >> > > >> >> > 3.3.1 My Design > >> >> > > >> >> > Guest ACPI emulation is composed of two parts: building guest NFIT > >> >> > and SSDT that defines ACPI namespace devices for NVDIMM, and > >> >> > emulating guest _DSM. > >> >> > > >> >> > (1) Building Guest ACPI Tables > >> >> > > >> >> > This design reuses and extends hvmloader's existing mechanism that > >> >> > loads passthrough ACPI tables from binary files to load NFIT and > >> >> > SSDT tables built by QEMU: > >> >> > 1) Because the current QEMU does not building any ACPI tables when > >> >> > it runs as the Xen device model, this design needs to patch QEMU > >> >> > to build NFIT and SSDT (so far only NFIT and SSDT) in this case. > >> >> > > >> >> > 2) QEMU copies NFIT and SSDT to the end of guest memory below > >> >> > 4G. The guest address and size of those tables are written into > >> >> > xenstore > >> >> > (/local/domain/domid/hvmloader/dm-acpi/{address,length}). > >> >> > > >> >> > 3) hvmloader is patched to probe and load device model passthrough > >> >> > ACPI tables from above xenstore keys. The detected ACPI tables > >> >> > are then appended to the end of existing guest ACPI tables just > >> >> > like what current construct_passthrough_tables() does. > >> >> > > >> >> > Reasons for this design are listed below: > >> >> > - NFIT and SSDT in question are quite self-contained, i.e. they do > >> >> > not refer to other ACPI tables and not conflict with existing > >> >> > guest ACPI tables in Xen. Therefore, it is safe to copy them from > >> >> > QEMU and append to existing guest ACPI tables. > >> >> > >> >> How is this not conflicting being guaranteed? In particular I don't > >> >> see how tables containing AML code and coming from different > >> >> sources won't possibly cause ACPI name space collisions. > >> >> > >> > > >> > Really there is no effective mechanism to avoid ACPI name space > >> > collisions (and other kinds of conflicts) between ACPI tables loaded > >> > from QEMU and ACPI tables built by hvmloader. Because which ACPI tables > >> > are loaded is determined by developers, IMO it's developers' > >> > responsibility to avoid any collisions and conflicts with existing ACPI > >> > tables. > >> > >> Right, but this needs to be spelled out and settled on at design > >> time (i.e. now), rather leaving things unspecified, awaiting the > >> first clash. > > > > So that means if no collision-proof mechanism is introduced, Xen should not > > trust any passed-in ACPI tables and should build them by itself? > > Basically yes, albeit collision-proof may be too much to demand. > Simply separating name spaces (for hvmloader and qemu to have > their own sub-spaces) would be sufficient imo. We should trust > ourselves to play by such a specification. > I don't quite understand 'separating name spaces'. Do you mean, for example, if both hvmloader and qemu want to put a namespace device under \_SB, they could be put in different sub-scopes under \_SB? But it does not work for Linux at least. Anyway, we may avoid some conflicts between ACPI tables/objects by restricting which tables and objects can be passed from QEMU to Xen: (1) For ACPI tables, xen does not accept those built by itself, e.g. DSDT and SSDT. (2) xen does not accept ACPI tables for devices that are not attached to a domain, e.g. if NFIT cannot be passed if a domain does not have vNVDIMM. (3) For ACPI objects, xen only accepts namespace devices and requires their names does not conflict with existing ones provided by Xen. In implementation, QEMU could put the passed-in ACPI tables and objects in a series of blobs in following format: +--+--+----+ | type | size | data | +--+--+----+ where (1) 'type' indicates which data are stored in this blob: 0 for ACPI table, 1 for ACPI namespace device, (2) 'size' indicates this blob's size in bytes. The next blob (if exist) can be found by add 'size' to the base address of the current blob. (3) 'data' is of variant-length and stores the actual passed content: (a) for type 0 blob (ACPI table), a complete ACPI table including the table header is stored in 'data' (b) for type 1 blob (ACPI namespace device), at the beginning of 'data' is a 4 bytes device name, and followed is AML code within that device, e.g. for a device Device (DEV0) { Name (_HID, "ACPI1234") Method (_DSM) { ... } } "DEV0" is stored at the beginning of 'data', and then is AML code of
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
> > > QEMU would always use MFN above guest normal ram and I/O holes for > > > vNVDIMM. It would attempt to search in that space for a contiguous range > > > that is large enough for that that vNVDIMM devices. Is guest able to > > > punch holes in such GFN space? > > > > See XENMAPSPACE_* and their uses. > > > > I think we can add following restrictions to avoid uses of XENMAPSPACE_* > punching holes in GFNs of vNVDIMM: > > (1) For XENMAPSPACE_shared_info and _grant_table, never map idx in them > to GFNs occupied by vNVDIMM. OK, that sounds correct. > > (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign, >(a) never map idx in them to GFNs occupied by vNVDIMM, and >(b) never map idx corresponding to GFNs occupied by vNVDIMM Would that mean that guest xen-blkback or xen-netback wouldn't be able to fetch data from the GFNs? As in, what if the HVM guest that has the NVDIMM also serves as a device domain - that is it has xen-blkback running to service other guests? > > > Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 01.02.16 at 06:44,wrote: > This design treats host NVDIMM devices as ordinary MMIO devices: Wrt the cachability note earlier on, I assume you're aware that with the XSA-154 changes we disallow any cachable mappings of MMIO by default. > (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT) > and drive host NVDIMM devices (implementing block device > interface). Namespaces and file systems on host NVDIMM devices > are handled by Dom0 Linux as well. > > (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its > virtual address space (buf). > > (3) QEMU gets the host physical address of buf, i.e. the host system > physical address that is occupied by /dev/pmem0, and calls Xen > hypercall XEN_DOMCTL_memory_mapping to map it to a DomU. > > (ACPI part is described in Section 3.3 later) > > Above (1)(2) have already been done in current QEMU. Only (3) is > needed to implement in QEMU. No change is needed in Xen for address > mapping in this design. > > Open: It seems no system call/ioctl is provided by Linux kernel to >get the physical address from a virtual address. >/proc//pagemap provides information of mapping from >VA to PA. Is it an acceptable solution to let QEMU parse this >file to get the physical address? > > Open: For a large pmem, mmap(2) is very possible to not map all SPA >occupied by pmem at the beginning, i.e. QEMU may not be able to >get all SPA of pmem from buf (in virtual address space) when >calling XEN_DOMCTL_memory_mapping. >Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the >entire pmem being mmaped? A fundamental question I have here is: Why does qemu need to map this at all? It shouldn't itself need to access those ranges, since the guest is given direct access. It would seem quite a bit more natural if qemu simply inquired to underlying GFN range(s) and handed those to Xen for translation to MFNs and mapping into guest space. > I notice that current XEN_DOMCTL_memory_mapping does not make santiy > check for the physical address and size passed from caller > (QEMU). Can QEMU be always trusted? If not, we would need to make Xen > aware of the SPA range of pmem so that it can refuse map physical > address in neither the normal ram nor pmem. I'm not sure what missing sanity checks this is about: The handling involves two iomem_access_permitted() calls. > 3.3 Guest ACPI Emulation > > 3.3.1 My Design > > Guest ACPI emulation is composed of two parts: building guest NFIT > and SSDT that defines ACPI namespace devices for NVDIMM, and > emulating guest _DSM. > > (1) Building Guest ACPI Tables > > This design reuses and extends hvmloader's existing mechanism that > loads passthrough ACPI tables from binary files to load NFIT and > SSDT tables built by QEMU: > 1) Because the current QEMU does not building any ACPI tables when > it runs as the Xen device model, this design needs to patch QEMU > to build NFIT and SSDT (so far only NFIT and SSDT) in this case. > > 2) QEMU copies NFIT and SSDT to the end of guest memory below > 4G. The guest address and size of those tables are written into > xenstore (/local/domain/domid/hvmloader/dm-acpi/{address,length}). > > 3) hvmloader is patched to probe and load device model passthrough > ACPI tables from above xenstore keys. The detected ACPI tables > are then appended to the end of existing guest ACPI tables just > like what current construct_passthrough_tables() does. > > Reasons for this design are listed below: > - NFIT and SSDT in question are quite self-contained, i.e. they do > not refer to other ACPI tables and not conflict with existing > guest ACPI tables in Xen. Therefore, it is safe to copy them from > QEMU and append to existing guest ACPI tables. How is this not conflicting being guaranteed? In particular I don't see how tables containing AML code and coming from different sources won't possibly cause ACPI name space collisions. > 3.3.3 Alternative Design 2: keeping in Xen > > Alternative to switching to QEMU, another design would be building > NFIT and SSDT in hvmloader or toolstack. > > The amount and parameters of sub-structures in guest NFIT vary > according to different vNVDIMM configurations that can not be decided > at compile-time. In contrast, current hvmloader and toolstack can > only build static ACPI tables, i.e. their contents are decided > statically at compile-time and independent from the guest > configuration. In order to build guest NFIT at runtime, this design > may take following steps: > (1) xl converts NVDIMM configurations in xl.cfg to corresponding QEMU > options, > > (2) QEMU accepts above options, figures out the start SPA range > address/size/NVDIMM device handles/..., and writes them in > xenstore. No ACPI table is built by
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/17/16 02:08, Jan Beulich wrote: > >>> On 17.02.16 at 10:01,wrote: > > On 02/15/16 04:07, Jan Beulich wrote: > >> >>> On 15.02.16 at 09:43, wrote: > >> > On 02/03/16 03:15, Konrad Rzeszutek Wilk wrote: > >> >> > Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of > >> >> > three parts: > >> >> > (1) Guest clwb/clflushopt/pcommit enabling, > >> >> > (2) Memory mapping, and > >> >> > (3) Guest ACPI emulation. > >> >> > >> >> > >> >> .. MCE? and vMCE? > >> >> > >> > > >> > NVDIMM can generate UCR errors like normal ram. Xen may handle them in a > >> > way similar to what mc_memerr_dhandler() does, with some differences in > >> > the data structure and the broken page offline parts: > >> > > >> > Broken NVDIMM pages should be marked as "offlined" so that Xen > >> > hypervisor can refuse further requests that map them to DomU. > >> > > >> > The real problem here is what data structure will be used to record > >> > information of NVDIMM pages. Because the size of NVDIMM is usually much > >> > larger than normal ram, using struct page_info for NVDIMM pages would > >> > occupy too much memory. > >> > >> I don't see how your alternative below would be less memory > >> hungry: Since guests have at least partial control of their GFN > >> space, a malicious guest could punch holes into the contiguous > >> GFN range that you appear to be thinking about, thus causing > >> arbitrary splitting of the control structure. > >> > > > > QEMU would always use MFN above guest normal ram and I/O holes for > > vNVDIMM. It would attempt to search in that space for a contiguous range > > that is large enough for that that vNVDIMM devices. Is guest able to > > punch holes in such GFN space? > > See XENMAPSPACE_* and their uses. > I think we can add following restrictions to avoid uses of XENMAPSPACE_* punching holes in GFNs of vNVDIMM: (1) For XENMAPSPACE_shared_info and _grant_table, never map idx in them to GFNs occupied by vNVDIMM. (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign, (a) never map idx in them to GFNs occupied by vNVDIMM, and (b) never map idx corresponding to GFNs occupied by vNVDIMM Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 17.02.16 at 10:01,wrote: > On 02/15/16 04:07, Jan Beulich wrote: >> >>> On 15.02.16 at 09:43, wrote: >> > On 02/03/16 03:15, Konrad Rzeszutek Wilk wrote: >> >> > Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of >> >> > three parts: >> >> > (1) Guest clwb/clflushopt/pcommit enabling, >> >> > (2) Memory mapping, and >> >> > (3) Guest ACPI emulation. >> >> >> >> >> >> .. MCE? and vMCE? >> >> >> > >> > NVDIMM can generate UCR errors like normal ram. Xen may handle them in a >> > way similar to what mc_memerr_dhandler() does, with some differences in >> > the data structure and the broken page offline parts: >> > >> > Broken NVDIMM pages should be marked as "offlined" so that Xen >> > hypervisor can refuse further requests that map them to DomU. >> > >> > The real problem here is what data structure will be used to record >> > information of NVDIMM pages. Because the size of NVDIMM is usually much >> > larger than normal ram, using struct page_info for NVDIMM pages would >> > occupy too much memory. >> >> I don't see how your alternative below would be less memory >> hungry: Since guests have at least partial control of their GFN >> space, a malicious guest could punch holes into the contiguous >> GFN range that you appear to be thinking about, thus causing >> arbitrary splitting of the control structure. >> > > QEMU would always use MFN above guest normal ram and I/O holes for > vNVDIMM. It would attempt to search in that space for a contiguous range > that is large enough for that that vNVDIMM devices. Is guest able to > punch holes in such GFN space? See XENMAPSPACE_* and their uses. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/16/16 05:55, Jan Beulich wrote: > >>> On 16.02.16 at 12:14,wrote: > > On Mon, 15 Feb 2016, Zhang, Haozhong wrote: > >> On 02/04/16 20:24, Stefano Stabellini wrote: > >> > On Thu, 4 Feb 2016, Haozhong Zhang wrote: > >> > > On 02/03/16 15:22, Stefano Stabellini wrote: > >> > > > On Wed, 3 Feb 2016, George Dunlap wrote: > >> > > > > On 03/02/16 12:02, Stefano Stabellini wrote: > >> > > > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote: > >> > > > > >> Or, we can make a file system on /dev/pmem0, create files on > >> > > > > >> it, set > >> > > > > >> the owner of those files to xen-qemuuser-domid$domid, and then > >> > > > > >> pass > >> > > > > >> those files to QEMU. In this way, non-root QEMU should be able > >> > > > > >> to > >> > > > > >> mmap those files. > >> > > > > > > >> > > > > > Maybe that would work. Worth adding it to the design, I would > >> > > > > > like to > >> > > > > > read more details on it. > >> > > > > > > >> > > > > > Also note that QEMU initially runs as root but drops privileges > >> > > > > > to > >> > > > > > xen-qemuuser-domid$domid before the guest is started. Initially > >> > > > > > QEMU > >> > > > > > *could* mmap /dev/pmem0 while is still running as root, but then > >> > > > > > it > >> > > > > > wouldn't work for any devices that need to be mmap'ed at run time > >> > > > > > (hotplug scenario). > >> > > > > > >> > > > > This is basically the same problem we have for a bunch of other > >> > > > > things, > >> > > > > right? Having xl open a file and then pass it via qmp to qemu > >> > > > > should > >> > > > > work in theory, right? > >> > > > > >> > > > Is there one /dev/pmem? per assignable region? > >> > > > >> > > Yes. > >> > > > >> > > BTW, I'm wondering whether and how non-root qemu works with xl disk > >> > > configuration that is going to access a host block device, e.g. > >> > > disk = [ '/dev/sdb,,hda' ] > >> > > If that works with non-root qemu, I may take the similar solution for > >> > > pmem. > >> > > >> > Today the user is required to give the correct ownership and access mode > >> > to the block device, so that non-root QEMU can open it. However in the > >> > case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence > >> > the feature doesn't work at all with non-root QEMU > >> > (http://marc.info/?l=xen-devel=145261763600528). > >> > > >> > If there is one /dev/pmem device per assignable region, then it would be > >> > conceivable to change its ownership so that non-root QEMU can open it. > >> > Or, better, the file descriptor could be passed by the toolstack via > >> > qmp. > >> > >> Passing file descriptor via qmp is not enough. > >> > >> Let me clarify where the requirement for root/privileged permissions > >> comes from. The primary workflow in my design that maps a host pmem > >> region or files in host pmem region to guest is shown as below: > >> (1) QEMU in Dom0 mmap the host pmem (the host /dev/pmem0 or files on > >> /dev/pmem0) to its virtual address space, i.e. the guest virtual > >> address space. > >> (2) QEMU asks Xen hypervisor to map the host physical address, i.e. SPA > >> occupied by the host pmem to a DomU. This step requires the > >> translation from the guest virtual address (where the host pmem is > >> mmaped in (1)) to the host physical address. The translation can be > >> done by either > >> (a) QEMU that parses its own /proc/self/pagemap, > >> or > >> (b) Xen hypervisor that does the translation by itself [1] (though > >> this choice is not quite doable from Konrad's comments [2]). > >> > >> [1] > >> http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00434.html > >> [2] > >> http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00606.html > >> > >> For 2-a, reading /proc/self/pagemap requires CAP_SYS_ADMIN capability > >> since linux kernel 4.0. Furthermore, if we don't mlock the mapped host > >> pmem (by adding MAP_LOCKED flag to mmap or calling mlock after mmap), > >> pagemap will not contain all mappings. However, mlock may require > >> privileged permission to lock memory larger than RLIMIT_MEMLOCK. Because > >> mlock operates on memory, the permission to open(2) the host pmem files > >> does not solve the problem and therefore passing file descriptor via qmp > >> does not help. > >> > >> For 2-b, from Konrad's comments [2], mlock is also required and > >> privileged permission may be required consequently. > >> > >> Note that the mapping and the address translation are done before QEMU > >> dropping privileged permissions, so non-root QEMU should be able to work > >> with above design until we start considering vNVDIMM hotplug (which has > >> not been supported by the current vNVDIMM implementation in QEMU). In > >> the hotplug case, we may let Xen pass explicit flags to QEMU to keep it > >> running with root permissions. > > > > Are we all good with the fact that vNVDIMM
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/15/16 04:07, Jan Beulich wrote: > >>> On 15.02.16 at 09:43,wrote: > > On 02/03/16 03:15, Konrad Rzeszutek Wilk wrote: > >> > Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of > >> > three parts: > >> > (1) Guest clwb/clflushopt/pcommit enabling, > >> > (2) Memory mapping, and > >> > (3) Guest ACPI emulation. > >> > >> > >> .. MCE? and vMCE? > >> > > > > NVDIMM can generate UCR errors like normal ram. Xen may handle them in a > > way similar to what mc_memerr_dhandler() does, with some differences in > > the data structure and the broken page offline parts: > > > > Broken NVDIMM pages should be marked as "offlined" so that Xen > > hypervisor can refuse further requests that map them to DomU. > > > > The real problem here is what data structure will be used to record > > information of NVDIMM pages. Because the size of NVDIMM is usually much > > larger than normal ram, using struct page_info for NVDIMM pages would > > occupy too much memory. > > I don't see how your alternative below would be less memory > hungry: Since guests have at least partial control of their GFN > space, a malicious guest could punch holes into the contiguous > GFN range that you appear to be thinking about, thus causing > arbitrary splitting of the control structure. > QEMU would always use MFN above guest normal ram and I/O holes for vNVDIMM. It would attempt to search in that space for a contiguous range that is large enough for that that vNVDIMM devices. Is guest able to punch holes in such GFN space? > Also - see how you all of the sudden came to think of using > struct page_info here (implying hypervisor control of these > NVDIMM ranges)? > > > (4) When a MCE for host NVDIMM SPA range [start_mfn, end_mfn] happens, > > (a) search xen_nvdimm_pages_list for affected nvdimm_pages structures, > > (b) for each affected nvdimm_pages, if it belongs to a domain d and > > its broken field is already set, the domain d will be shutdown to > > prevent malicious guest accessing broken page (similarly to what > > offline_page() does). > > (c) for each affected nvdimm_pages, set its broken field to 1, and > > (d) for each affected nvdimm_pages, inject to domain d a vMCE that > > covers its GFN range if that nvdimm_pages belongs to domain d. > > I don't see why you'd want to mark the entire range bad: All > that's known to be broken is a single page. Hence this would be > another source of splits of the proposed control structures. > Oh yes, I should split the whole range here. Such kind of splits is caused by hardware errors. Unless the host NVDIMM is terribly broken, there should not be a large amount of splits. Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 16.02.16 at 12:14,wrote: > On Mon, 15 Feb 2016, Zhang, Haozhong wrote: >> On 02/04/16 20:24, Stefano Stabellini wrote: >> > On Thu, 4 Feb 2016, Haozhong Zhang wrote: >> > > On 02/03/16 15:22, Stefano Stabellini wrote: >> > > > On Wed, 3 Feb 2016, George Dunlap wrote: >> > > > > On 03/02/16 12:02, Stefano Stabellini wrote: >> > > > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote: >> > > > > >> Or, we can make a file system on /dev/pmem0, create files on it, >> > > > > >> set >> > > > > >> the owner of those files to xen-qemuuser-domid$domid, and then >> > > > > >> pass >> > > > > >> those files to QEMU. In this way, non-root QEMU should be able to >> > > > > >> mmap those files. >> > > > > > >> > > > > > Maybe that would work. Worth adding it to the design, I would like >> > > > > > to >> > > > > > read more details on it. >> > > > > > >> > > > > > Also note that QEMU initially runs as root but drops privileges to >> > > > > > xen-qemuuser-domid$domid before the guest is started. Initially >> > > > > > QEMU >> > > > > > *could* mmap /dev/pmem0 while is still running as root, but then it >> > > > > > wouldn't work for any devices that need to be mmap'ed at run time >> > > > > > (hotplug scenario). >> > > > > >> > > > > This is basically the same problem we have for a bunch of other >> > > > > things, >> > > > > right? Having xl open a file and then pass it via qmp to qemu should >> > > > > work in theory, right? >> > > > >> > > > Is there one /dev/pmem? per assignable region? >> > > >> > > Yes. >> > > >> > > BTW, I'm wondering whether and how non-root qemu works with xl disk >> > > configuration that is going to access a host block device, e.g. >> > > disk = [ '/dev/sdb,,hda' ] >> > > If that works with non-root qemu, I may take the similar solution for >> > > pmem. >> > >> > Today the user is required to give the correct ownership and access mode >> > to the block device, so that non-root QEMU can open it. However in the >> > case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence >> > the feature doesn't work at all with non-root QEMU >> > (http://marc.info/?l=xen-devel=145261763600528). >> > >> > If there is one /dev/pmem device per assignable region, then it would be >> > conceivable to change its ownership so that non-root QEMU can open it. >> > Or, better, the file descriptor could be passed by the toolstack via >> > qmp. >> >> Passing file descriptor via qmp is not enough. >> >> Let me clarify where the requirement for root/privileged permissions >> comes from. The primary workflow in my design that maps a host pmem >> region or files in host pmem region to guest is shown as below: >> (1) QEMU in Dom0 mmap the host pmem (the host /dev/pmem0 or files on >> /dev/pmem0) to its virtual address space, i.e. the guest virtual >> address space. >> (2) QEMU asks Xen hypervisor to map the host physical address, i.e. SPA >> occupied by the host pmem to a DomU. This step requires the >> translation from the guest virtual address (where the host pmem is >> mmaped in (1)) to the host physical address. The translation can be >> done by either >> (a) QEMU that parses its own /proc/self/pagemap, >> or >> (b) Xen hypervisor that does the translation by itself [1] (though >> this choice is not quite doable from Konrad's comments [2]). >> >> [1] >> http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00434.html >> [2] >> http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00606.html >> >> For 2-a, reading /proc/self/pagemap requires CAP_SYS_ADMIN capability >> since linux kernel 4.0. Furthermore, if we don't mlock the mapped host >> pmem (by adding MAP_LOCKED flag to mmap or calling mlock after mmap), >> pagemap will not contain all mappings. However, mlock may require >> privileged permission to lock memory larger than RLIMIT_MEMLOCK. Because >> mlock operates on memory, the permission to open(2) the host pmem files >> does not solve the problem and therefore passing file descriptor via qmp >> does not help. >> >> For 2-b, from Konrad's comments [2], mlock is also required and >> privileged permission may be required consequently. >> >> Note that the mapping and the address translation are done before QEMU >> dropping privileged permissions, so non-root QEMU should be able to work >> with above design until we start considering vNVDIMM hotplug (which has >> not been supported by the current vNVDIMM implementation in QEMU). In >> the hotplug case, we may let Xen pass explicit flags to QEMU to keep it >> running with root permissions. > > Are we all good with the fact that vNVDIMM hotplug won't work (unless > the user explicitly asks for it at domain creation time, which is > very unlikely otherwise she could use coldplug)? No, at least there needs to be a road towards hotplug, even if initially this may not be supported/implemented. Jan
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On Mon, 15 Feb 2016, Zhang, Haozhong wrote: > On 02/04/16 20:24, Stefano Stabellini wrote: > > On Thu, 4 Feb 2016, Haozhong Zhang wrote: > > > On 02/03/16 15:22, Stefano Stabellini wrote: > > > > On Wed, 3 Feb 2016, George Dunlap wrote: > > > > > On 03/02/16 12:02, Stefano Stabellini wrote: > > > > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote: > > > > > >> Or, we can make a file system on /dev/pmem0, create files on it, > > > > > >> set > > > > > >> the owner of those files to xen-qemuuser-domid$domid, and then pass > > > > > >> those files to QEMU. In this way, non-root QEMU should be able to > > > > > >> mmap those files. > > > > > > > > > > > > Maybe that would work. Worth adding it to the design, I would like > > > > > > to > > > > > > read more details on it. > > > > > > > > > > > > Also note that QEMU initially runs as root but drops privileges to > > > > > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU > > > > > > *could* mmap /dev/pmem0 while is still running as root, but then it > > > > > > wouldn't work for any devices that need to be mmap'ed at run time > > > > > > (hotplug scenario). > > > > > > > > > > This is basically the same problem we have for a bunch of other > > > > > things, > > > > > right? Having xl open a file and then pass it via qmp to qemu should > > > > > work in theory, right? > > > > > > > > Is there one /dev/pmem? per assignable region? > > > > > > Yes. > > > > > > BTW, I'm wondering whether and how non-root qemu works with xl disk > > > configuration that is going to access a host block device, e.g. > > > disk = [ '/dev/sdb,,hda' ] > > > If that works with non-root qemu, I may take the similar solution for > > > pmem. > > > > Today the user is required to give the correct ownership and access mode > > to the block device, so that non-root QEMU can open it. However in the > > case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence > > the feature doesn't work at all with non-root QEMU > > (http://marc.info/?l=xen-devel=145261763600528). > > > > If there is one /dev/pmem device per assignable region, then it would be > > conceivable to change its ownership so that non-root QEMU can open it. > > Or, better, the file descriptor could be passed by the toolstack via > > qmp. > > Passing file descriptor via qmp is not enough. > > Let me clarify where the requirement for root/privileged permissions > comes from. The primary workflow in my design that maps a host pmem > region or files in host pmem region to guest is shown as below: > (1) QEMU in Dom0 mmap the host pmem (the host /dev/pmem0 or files on > /dev/pmem0) to its virtual address space, i.e. the guest virtual > address space. > (2) QEMU asks Xen hypervisor to map the host physical address, i.e. SPA > occupied by the host pmem to a DomU. This step requires the > translation from the guest virtual address (where the host pmem is > mmaped in (1)) to the host physical address. The translation can be > done by either > (a) QEMU that parses its own /proc/self/pagemap, > or > (b) Xen hypervisor that does the translation by itself [1] (though > this choice is not quite doable from Konrad's comments [2]). > > [1] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00434.html > [2] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00606.html > > For 2-a, reading /proc/self/pagemap requires CAP_SYS_ADMIN capability > since linux kernel 4.0. Furthermore, if we don't mlock the mapped host > pmem (by adding MAP_LOCKED flag to mmap or calling mlock after mmap), > pagemap will not contain all mappings. However, mlock may require > privileged permission to lock memory larger than RLIMIT_MEMLOCK. Because > mlock operates on memory, the permission to open(2) the host pmem files > does not solve the problem and therefore passing file descriptor via qmp > does not help. > > For 2-b, from Konrad's comments [2], mlock is also required and > privileged permission may be required consequently. > > Note that the mapping and the address translation are done before QEMU > dropping privileged permissions, so non-root QEMU should be able to work > with above design until we start considering vNVDIMM hotplug (which has > not been supported by the current vNVDIMM implementation in QEMU). In > the hotplug case, we may let Xen pass explicit flags to QEMU to keep it > running with root permissions. Are we all good with the fact that vNVDIMM hotplug won't work (unless the user explicitly asks for it at domain creation time, which is very unlikely otherwise she could use coldplug)? If so, the design is OK for me. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 15.02.16 at 09:43,wrote: > On 02/03/16 03:15, Konrad Rzeszutek Wilk wrote: >> > Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of >> > three parts: >> > (1) Guest clwb/clflushopt/pcommit enabling, >> > (2) Memory mapping, and >> > (3) Guest ACPI emulation. >> >> >> .. MCE? and vMCE? >> > > NVDIMM can generate UCR errors like normal ram. Xen may handle them in a > way similar to what mc_memerr_dhandler() does, with some differences in > the data structure and the broken page offline parts: > > Broken NVDIMM pages should be marked as "offlined" so that Xen > hypervisor can refuse further requests that map them to DomU. > > The real problem here is what data structure will be used to record > information of NVDIMM pages. Because the size of NVDIMM is usually much > larger than normal ram, using struct page_info for NVDIMM pages would > occupy too much memory. I don't see how your alternative below would be less memory hungry: Since guests have at least partial control of their GFN space, a malicious guest could punch holes into the contiguous GFN range that you appear to be thinking about, thus causing arbitrary splitting of the control structure. Also - see how you all of the sudden came to think of using struct page_info here (implying hypervisor control of these NVDIMM ranges)? > (4) When a MCE for host NVDIMM SPA range [start_mfn, end_mfn] happens, > (a) search xen_nvdimm_pages_list for affected nvdimm_pages structures, > (b) for each affected nvdimm_pages, if it belongs to a domain d and > its broken field is already set, the domain d will be shutdown to > prevent malicious guest accessing broken page (similarly to what > offline_page() does). > (c) for each affected nvdimm_pages, set its broken field to 1, and > (d) for each affected nvdimm_pages, inject to domain d a vMCE that > covers its GFN range if that nvdimm_pages belongs to domain d. I don't see why you'd want to mark the entire range bad: All that's known to be broken is a single page. Hence this would be another source of splits of the proposed control structures. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/03/16 23:47, Konrad Rzeszutek Wilk wrote: > > > > > Open: It seems no system call/ioctl is provided by Linux kernel to > > > > >get the physical address from a virtual address. > > > > >/proc//pagemap provides information of mapping from > > > > >VA to PA. Is it an acceptable solution to let QEMU parse this > > > > >file to get the physical address? > > > > > > > > Does it work in a non-root scenario? > > > > > > > > > > Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel: > > > | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get > > > PFNs. > > > | In 4.0 and 4.1 opens by unprivileged fail with -EPERM. Starting from > > > | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN. > > > | Reason: information about PFNs helps in exploiting Rowhammer > > > vulnerability. > > Ah right. > > > > > > A possible alternative is to add a new hypercall similar to > > > XEN_DOMCTL_memory_mapping but receiving virtual address as the address > > > parameter and translating to machine address in the hypervisor. > > > > That might work. > > That won't work. > > This is a userspace VMA - which means the once the ioctl is done we swap > to kernel virtual addresses. Now we may know that the prior cr3 has the > userspace virtual address and walk it down - but what if the domain > that is doing this is PVH? (or HVM) - the cr3 of userspace is tucked somewhere > inside the kernel. > > Which means this hypercall would need to know the Linux kernel task structure > to find this. > > May I propose another solution - an stacking driver (similar to loop). You > setup it up (ioctl /dev/pmem0/guest.img, get some /dev/mapper/guest.img > created). > Then mmap the /dev/mapper/guest.img - all of the operations are the same - > except > it may have an extra ioctl - get_pfns - which would provide the data in > similar > form to pagemap.txt. > This stack driver approach seems still need privileged permission and more modifications in kernel, so ... > But folks will then ask - why don't you just use pagemap? Could the pagemap > have an extra security capability check? One that can be set for > QEMU? > ... I would like to use pagemap and mlock. Haozhong > > > > > > > > > Open: For a large pmem, mmap(2) is very possible to not map all SPA > > > > >occupied by pmem at the beginning, i.e. QEMU may not be able to > > > > >get all SPA of pmem from buf (in virtual address space) when > > > > >calling XEN_DOMCTL_memory_mapping. > > > > >Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the > > > > >entire pmem being mmaped? > > > > > > > > Ditto > > > > > > > > > > No. If I take the above alternative for the first open, maybe the new > > > hypercall above can inject page faults into dom0 for the unmapped > > > virtual address so as to enforce dom0 Linux to create the page > > > mapping. > > Ugh. That sounds hacky. And you wouldn't neccessarily be safe. > Imagine that the system admin decides to defrag the /dev/pmem filesystem. > Or move the files (disk images) around. If they do that - we may > still have the guest mapped to system addresses which may contain filesystem > metadata now, or a different guest image. We MUST mlock or lock the file > during the duration of the guest. > > > ___ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/03/16 03:15, Konrad Rzeszutek Wilk wrote: > > 3. Design of vNVDIMM in Xen > > Thank you for this design! > > > > > Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of > > three parts: > > (1) Guest clwb/clflushopt/pcommit enabling, > > (2) Memory mapping, and > > (3) Guest ACPI emulation. > > > .. MCE? and vMCE? > NVDIMM can generate UCR errors like normal ram. Xen may handle them in a way similar to what mc_memerr_dhandler() does, with some differences in the data structure and the broken page offline parts: Broken NVDIMM pages should be marked as "offlined" so that Xen hypervisor can refuse further requests that map them to DomU. The real problem here is what data structure will be used to record information of NVDIMM pages. Because the size of NVDIMM is usually much larger than normal ram, using struct page_info for NVDIMM pages would occupy too much memory. Alternatively, we may use a range set to represent NVDIMM pages: struct nvdimm_pages { unsigned long mfn; /* starting MFN of a range of NVDIMM pages */ unsigned long gfn; /* starting GFN where this range is mapped, initially INVALID_GFN */ unsigned long len; /* length of this range in bytes */ int broken;/* 0: initial value, 1: this range of NVDIMM pages are broken and offlined */ struct domain *d; /* NULL: initial value, Not NULL: which domain this range is mapped to */ /* * Every nvdimm_pages structure is linked in the global * xen_nvdimm_pages_list. * * If it is mapped to a domain d, it will be also linked in * d->arch.nvdimm_pages_list. */ struct list_head *domain_list; struct list_head *global_list; } struct list_head xen_nvdimm_pages_list; /* in asm-x86/domain.h */ struct arch_domain { ... struct list_head nvdimm_pages_list; } (1) Initially, Xen hypervisor creates a nvdimm_pages structure for each pmem region (starting SPA and size reported by Dom0 NVDIMM driver) and links all nvdimm_pages structures in xen_nvdimm_pages_list. (2) If Xen hypervisor is then requested to map a range of NVDIMM pages [start_mfn, end_mfn] to gfn of domain d, it will (a) Check whether the GFN range [gfn, gfn + end_mfn - start_mfn + 1] of domain d has been occupied (e.g. by normal ram, I/O or other vNVDIMM). (b) Search xen_nvdimm_pages_list for one or multiple nvdimm_pages that [start_mfn, end_mfn] can fit in. If a nvdimm_pages structure is entirely covered by [start_mfn, end_mfn], then link that nvdimm_pages structure to d->arch.nvdimm_pages_list. If only a portion of a nvdimm_pages structure is covered by [start_mfn, end_mfn], then split that nvdimm_pages structure into multiple ones (the one entirely covered and at most two not covered), link the covered one to d->arch.nvdimm_pages_list and all of them to xen_nvdimm_pages_list as well. gfn and d fields of nvdimm_pages structures linked to d->arch.nvdimm_pages_list are also set accordingly. (3) When a domain d is shutdown/destroyed, merge its nvdimm_pages structures (i.e. those in d->arch.nvdimm_pages_list) in xen_nvdimm_pages_list. (4) When a MCE for host NVDIMM SPA range [start_mfn, end_mfn] happens, (a) search xen_nvdimm_pages_list for affected nvdimm_pages structures, (b) for each affected nvdimm_pages, if it belongs to a domain d and its broken field is already set, the domain d will be shutdown to prevent malicious guest accessing broken page (similarly to what offline_page() does). (c) for each affected nvdimm_pages, set its broken field to 1, and (d) for each affected nvdimm_pages, inject to domain d a vMCE that covers its GFN range if that nvdimm_pages belongs to domain d. Comments, pls. Thanks, Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/04/16 20:24, Stefano Stabellini wrote: > On Thu, 4 Feb 2016, Haozhong Zhang wrote: > > On 02/03/16 15:22, Stefano Stabellini wrote: > > > On Wed, 3 Feb 2016, George Dunlap wrote: > > > > On 03/02/16 12:02, Stefano Stabellini wrote: > > > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote: > > > > >> Or, we can make a file system on /dev/pmem0, create files on it, set > > > > >> the owner of those files to xen-qemuuser-domid$domid, and then pass > > > > >> those files to QEMU. In this way, non-root QEMU should be able to > > > > >> mmap those files. > > > > > > > > > > Maybe that would work. Worth adding it to the design, I would like to > > > > > read more details on it. > > > > > > > > > > Also note that QEMU initially runs as root but drops privileges to > > > > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU > > > > > *could* mmap /dev/pmem0 while is still running as root, but then it > > > > > wouldn't work for any devices that need to be mmap'ed at run time > > > > > (hotplug scenario). > > > > > > > > This is basically the same problem we have for a bunch of other things, > > > > right? Having xl open a file and then pass it via qmp to qemu should > > > > work in theory, right? > > > > > > Is there one /dev/pmem? per assignable region? > > > > Yes. > > > > BTW, I'm wondering whether and how non-root qemu works with xl disk > > configuration that is going to access a host block device, e.g. > > disk = [ '/dev/sdb,,hda' ] > > If that works with non-root qemu, I may take the similar solution for > > pmem. > > Today the user is required to give the correct ownership and access mode > to the block device, so that non-root QEMU can open it. However in the > case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence > the feature doesn't work at all with non-root QEMU > (http://marc.info/?l=xen-devel=145261763600528). > > If there is one /dev/pmem device per assignable region, then it would be > conceivable to change its ownership so that non-root QEMU can open it. > Or, better, the file descriptor could be passed by the toolstack via > qmp. Passing file descriptor via qmp is not enough. Let me clarify where the requirement for root/privileged permissions comes from. The primary workflow in my design that maps a host pmem region or files in host pmem region to guest is shown as below: (1) QEMU in Dom0 mmap the host pmem (the host /dev/pmem0 or files on /dev/pmem0) to its virtual address space, i.e. the guest virtual address space. (2) QEMU asks Xen hypervisor to map the host physical address, i.e. SPA occupied by the host pmem to a DomU. This step requires the translation from the guest virtual address (where the host pmem is mmaped in (1)) to the host physical address. The translation can be done by either (a) QEMU that parses its own /proc/self/pagemap, or (b) Xen hypervisor that does the translation by itself [1] (though this choice is not quite doable from Konrad's comments [2]). [1] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00434.html [2] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00606.html For 2-a, reading /proc/self/pagemap requires CAP_SYS_ADMIN capability since linux kernel 4.0. Furthermore, if we don't mlock the mapped host pmem (by adding MAP_LOCKED flag to mmap or calling mlock after mmap), pagemap will not contain all mappings. However, mlock may require privileged permission to lock memory larger than RLIMIT_MEMLOCK. Because mlock operates on memory, the permission to open(2) the host pmem files does not solve the problem and therefore passing file descriptor via qmp does not help. For 2-b, from Konrad's comments [2], mlock is also required and privileged permission may be required consequently. Note that the mapping and the address translation are done before QEMU dropping privileged permissions, so non-root QEMU should be able to work with above design until we start considering vNVDIMM hotplug (which has not been supported by the current vNVDIMM implementation in QEMU). In the hotplug case, we may let Xen pass explicit flags to QEMU to keep it running with root permissions. Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/05/2016 08:43 PM, Haozhong Zhang wrote: > On 02/05/16 09:40, Ross Philipson wrote: >> On 02/03/2016 09:09 AM, Andrew Cooper wrote: > [...] >>> I agree. >>> >>> There has to be a single entity responsible for collating the eventual >>> ACPI handed to the guest, and this is definitely HVMLoader. >>> >>> However, it is correct that Qemu create the ACPI tables for the devices >>> it emulates for the guest. >>> >>> We need to agree on a mechanism whereby each entity can provide their >>> own subset of the ACPI tables to HVMLoader, and have HVMLoader present >>> the final set properly to the VM. >>> >>> There is an existing usecase of passing the Host SLIC table to a VM, for >>> OEM Versions of Windows. I believe this is achieved with >>> HVM_XS_ACPI_PT_{ADDRESS,LENGTH}, but that mechanism is a little >>> inflexible and could probably do with being made a little more generic. >> >> A while back I added a generic mechanism to load extra ACPI tables into a >> guest, configurable at runtime. It looks like the functionality is still >> present. That might be an option. >> >> Also, following the thread, it wasn't clear if some of the tables like the >> SSDT for the NVDIMM device and it's _FIT/_DSM methods were something that >> could be statically created at build time. If it is something that needs to >> be generated at runtime (e.g. platform specific), I have a library that can >> generate any AML on the fly and create SSDTs. >> >> Anyway just FYI in case this is helpful. >> > > Hi Ross, > > Thanks for the information! > > SSDT for NVDIMM devices can not be created statically, because the > number of some items in it can not be determined at build time. For > example, the number of NVDIMM ACPI namespace devices (_DSM is under it) > defined in SSDT is determined by the number of vNVDIMM devices in domain > configuration. FYI, a sample SSDT for NVDIMM looks like > > Scope (\_SB){ > Device (NVDR) // NVDIMM Root device > { > Name (_HID, “ACPI0012”) > Method (_STA) {...} > Method (_FIT) {...} > Method (_DSM, ...) { > ... > } > } > > Device (NVD0) // 1st NVDIMM Device > { > Name(_ADR, h0) > Method (_DSM, ...) { > ... > } > } > > Device (NVD1) // 2nd NVDIMM Device > { > Name(_ADR, h1) > Method (_DSM, ...) { > ... > } > } > > ... > } Makes sense. > > I had ported QEMU's AML builder code as well as NVDIMM ACPI building > code to hvmloader and it did work, but then there was too much > duplicated code for vNVDIMM between QEMU and hvmloader for vNVDIMM. > Therefore, I prefer to let QEMU that emulates vNVDIMM devices > to build those tables, as in Andrew and Jan's replies. Yea it looks like QEUM's AML generating code is quite complete nowadays. Back when I wrote my library there wasn't really much out there. Anyway this is where it is if there is something that I might generate that is missing: https://github.com/OpenXT/xctools/tree/master/libxenacpi > > Thanks, > Haozhong > -- Ross Philipson ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/03/2016 09:09 AM, Andrew Cooper wrote: On 03/02/16 09:13, Jan Beulich wrote: On 03.02.16 at 08:00,wrote: On 02/02/16 17:11, Stefano Stabellini wrote: Once upon a time somebody made the decision that ACPI tables on Xen should be static and included in hvmloader. That might have been a bad decision but at least it was coherent. Loading only *some* tables from QEMU, but not others, it feels like an incomplete design to me. For example, QEMU is currently in charge of emulating the PCI bus, why shouldn't it be QEMU that generates the PRT and MCFG? To Keir, Jan and Andrew: Are there anything related to ACPI that must be done (or are better to be done) in hvmloader? Some of the static tables (FADT and HPET come to mind) likely would better continue to live in hvmloader. MCFG (for example) coming from qemu, otoh, would be quite natural (and would finally allow MMCFG support for guests in the first place). I.e. ... I prefer switching to QEMU building all ACPI tables for devices that it is emulating. However this alternative is good too because it is coherent with the current design. I would prefer to this one if the final conclusion is that only one agent should be allowed to build guest ACPI. As I said above, it looks like a big change to switch to QEMU for all ACPI tables and I'm afraid it would break some existing guests. ... I indeed think that tables should come from qemu for components living in qemu, and from hvmloader for components coming from Xen. I agree. There has to be a single entity responsible for collating the eventual ACPI handed to the guest, and this is definitely HVMLoader. However, it is correct that Qemu create the ACPI tables for the devices it emulates for the guest. We need to agree on a mechanism whereby each entity can provide their own subset of the ACPI tables to HVMLoader, and have HVMLoader present the final set properly to the VM. There is an existing usecase of passing the Host SLIC table to a VM, for OEM Versions of Windows. I believe this is achieved with HVM_XS_ACPI_PT_{ADDRESS,LENGTH}, but that mechanism is a little inflexible and could probably do with being made a little more generic. A while back I added a generic mechanism to load extra ACPI tables into a guest, configurable at runtime. It looks like the functionality is still present. That might be an option. Also, following the thread, it wasn't clear if some of the tables like the SSDT for the NVDIMM device and it's _FIT/_DSM methods were something that could be statically created at build time. If it is something that needs to be generated at runtime (e.g. platform specific), I have a library that can generate any AML on the fly and create SSDTs. Anyway just FYI in case this is helpful. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel -- Ross Philipson ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/05/16 09:40, Ross Philipson wrote: > On 02/03/2016 09:09 AM, Andrew Cooper wrote: [...] > >I agree. > > > >There has to be a single entity responsible for collating the eventual > >ACPI handed to the guest, and this is definitely HVMLoader. > > > >However, it is correct that Qemu create the ACPI tables for the devices > >it emulates for the guest. > > > >We need to agree on a mechanism whereby each entity can provide their > >own subset of the ACPI tables to HVMLoader, and have HVMLoader present > >the final set properly to the VM. > > > >There is an existing usecase of passing the Host SLIC table to a VM, for > >OEM Versions of Windows. I believe this is achieved with > >HVM_XS_ACPI_PT_{ADDRESS,LENGTH}, but that mechanism is a little > >inflexible and could probably do with being made a little more generic. > > A while back I added a generic mechanism to load extra ACPI tables into a > guest, configurable at runtime. It looks like the functionality is still > present. That might be an option. > > Also, following the thread, it wasn't clear if some of the tables like the > SSDT for the NVDIMM device and it's _FIT/_DSM methods were something that > could be statically created at build time. If it is something that needs to > be generated at runtime (e.g. platform specific), I have a library that can > generate any AML on the fly and create SSDTs. > > Anyway just FYI in case this is helpful. > Hi Ross, Thanks for the information! SSDT for NVDIMM devices can not be created statically, because the number of some items in it can not be determined at build time. For example, the number of NVDIMM ACPI namespace devices (_DSM is under it) defined in SSDT is determined by the number of vNVDIMM devices in domain configuration. FYI, a sample SSDT for NVDIMM looks like Scope (\_SB){ Device (NVDR) // NVDIMM Root device { Name (_HID, “ACPI0012”) Method (_STA) {...} Method (_FIT) {...} Method (_DSM, ...) { ... } } Device (NVD0) // 1st NVDIMM Device { Name(_ADR, h0) Method (_DSM, ...) { ... } } Device (NVD1) // 2nd NVDIMM Device { Name(_ADR, h1) Method (_DSM, ...) { ... } } ... } I had ported QEMU's AML builder code as well as NVDIMM ACPI building code to hvmloader and it did work, but then there was too much duplicated code for vNVDIMM between QEMU and hvmloader for vNVDIMM. Therefore, I prefer to let QEMU that emulates vNVDIMM devices to build those tables, as in Andrew and Jan's replies. Thanks, Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On Thu, 4 Feb 2016, Haozhong Zhang wrote: > On 02/03/16 15:22, Stefano Stabellini wrote: > > On Wed, 3 Feb 2016, George Dunlap wrote: > > > On 03/02/16 12:02, Stefano Stabellini wrote: > > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote: > > > >> Or, we can make a file system on /dev/pmem0, create files on it, set > > > >> the owner of those files to xen-qemuuser-domid$domid, and then pass > > > >> those files to QEMU. In this way, non-root QEMU should be able to > > > >> mmap those files. > > > > > > > > Maybe that would work. Worth adding it to the design, I would like to > > > > read more details on it. > > > > > > > > Also note that QEMU initially runs as root but drops privileges to > > > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU > > > > *could* mmap /dev/pmem0 while is still running as root, but then it > > > > wouldn't work for any devices that need to be mmap'ed at run time > > > > (hotplug scenario). > > > > > > This is basically the same problem we have for a bunch of other things, > > > right? Having xl open a file and then pass it via qmp to qemu should > > > work in theory, right? > > > > Is there one /dev/pmem? per assignable region? > > Yes. > > BTW, I'm wondering whether and how non-root qemu works with xl disk > configuration that is going to access a host block device, e.g. > disk = [ '/dev/sdb,,hda' ] > If that works with non-root qemu, I may take the similar solution for > pmem. Today the user is required to give the correct ownership and access mode to the block device, so that non-root QEMU can open it. However in the case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence the feature doesn't work at all with non-root QEMU (http://marc.info/?l=xen-devel=145261763600528). If there is one /dev/pmem device per assignable region, then it would be conceivable to change its ownership so that non-root QEMU can open it. Or, better, the file descriptor could be passed by the toolstack via qmp. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 03.02.16 at 08:00,wrote: > On 02/02/16 17:11, Stefano Stabellini wrote: >> Once upon a time somebody made the decision that ACPI tables >> on Xen should be static and included in hvmloader. That might have been >> a bad decision but at least it was coherent. Loading only *some* tables >> from QEMU, but not others, it feels like an incomplete design to me. >> >> For example, QEMU is currently in charge of emulating the PCI bus, why >> shouldn't it be QEMU that generates the PRT and MCFG? >> > > To Keir, Jan and Andrew: > > Are there anything related to ACPI that must be done (or are better to > be done) in hvmloader? Some of the static tables (FADT and HPET come to mind) likely would better continue to live in hvmloader. MCFG (for example) coming from qemu, otoh, would be quite natural (and would finally allow MMCFG support for guests in the first place). I.e. ... >> I prefer switching to QEMU building all ACPI tables for devices that it >> is emulating. However this alternative is good too because it is >> coherent with the current design. > > I would prefer to this one if the final conclusion is that only one > agent should be allowed to build guest ACPI. As I said above, it looks > like a big change to switch to QEMU for all ACPI tables and I'm afraid > it would break some existing guests. ... I indeed think that tables should come from qemu for components living in qemu, and from hvmloader for components coming from Xen. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 03.02.16 at 13:22,wrote: > On 02/03/16 02:18, Jan Beulich wrote: >> >>> On 03.02.16 at 09:28, wrote: >> > On 02/02/16 14:15, Konrad Rzeszutek Wilk wrote: >> >> > 3.1 Guest clwb/clflushopt/pcommit Enabling >> >> > >> >> > The instruction enabling is simple and we do the same work as in > KVM/QEMU. >> >> > - All three instructions are exposed to guest via guest cpuid. >> >> > - L1 guest pcommit is never intercepted by Xen. >> >> >> >> I wish there was some watermarks like the PLE has. >> >> >> >> My fear is that an unfriendly guest can issue sfence all day long >> >> flushing out other guests MMC queue (the writes followed by pcommits). >> >> Which means that an guest may have degraded performance as their >> >> memory writes are being flushed out immediately as if they were >> >> being written to UC instead of WB memory. >> > >> > pcommit takes no parameter and it seems hard to solve this problem >> > from hardware for now. And the current VMX does not provide mechanism >> > to limit the commit rate of pcommit like PLE for pause. >> > >> >> In other words - the NVDIMM resource does not provide any resource >> >> isolation. However this may not be any different than what we had >> >> nowadays with CPU caches. >> >> >> > >> > Does Xen have any mechanism to isolate multiple guests' operations on >> > CPU caches? >> >> No. All it does is disallow wbinvd for guests not controlling any >> actual hardware. Perhaps pcommit should at least be limited in >> a similar way? >> > > But pcommit is a must that makes writes be persistent on pmem. I'll > look at how guest wbinvd is limited in Xen. But we could intercept it on guests _not_ supposed to use it, in order to simply drop in on the floor. > Any functions suggested, vmx_wbinvd_intercept()? A good example, yes. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/03/16 12:02, Stefano Stabellini wrote: > On Wed, 3 Feb 2016, Haozhong Zhang wrote: > > On 02/02/16 17:11, Stefano Stabellini wrote: > > > On Mon, 1 Feb 2016, Haozhong Zhang wrote: [...] > > > > This design treats host NVDIMM devices as ordinary MMIO devices: > > > > (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT) > > > > and drive host NVDIMM devices (implementing block device > > > > interface). Namespaces and file systems on host NVDIMM devices > > > > are handled by Dom0 Linux as well. > > > > > > > > (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its > > > > virtual address space (buf). > > > > > > > > (3) QEMU gets the host physical address of buf, i.e. the host system > > > > physical address that is occupied by /dev/pmem0, and calls Xen > > > > hypercall XEN_DOMCTL_memory_mapping to map it to a DomU. > > > > > > How is this going to work from a security perspective? Is it going to > > > require running QEMU as root in Dom0, which will prevent NVDIMM from > > > working by default on Xen? If so, what's the plan? > > > > > > > Oh, I forgot to address the non-root qemu issues in this design ... > > > > The default user:group of /dev/pmem0 is root:disk, and its permission > > is rw-rw. We could lift the others permission to rw, so that > > non-root QEMU can mmap /dev/pmem0. But it looks too risky. > > Yep, too risky. > > > > Or, we can make a file system on /dev/pmem0, create files on it, set > > the owner of those files to xen-qemuuser-domid$domid, and then pass > > those files to QEMU. In this way, non-root QEMU should be able to > > mmap those files. > > Maybe that would work. Worth adding it to the design, I would like to > read more details on it. > > Also note that QEMU initially runs as root but drops privileges to > xen-qemuuser-domid$domid before the guest is started. Initially QEMU > *could* mmap /dev/pmem0 while is still running as root, but then it > wouldn't work for any devices that need to be mmap'ed at run time > (hotplug scenario). > Thanks for this information. I'll test some experimental code and then post a design to address the non-root qemu issue. > > > > > (ACPI part is described in Section 3.3 later) > > > > > > > > Above (1)(2) have already been done in current QEMU. Only (3) is > > > > needed to implement in QEMU. No change is needed in Xen for address > > > > mapping in this design. > > > > > > > > Open: It seems no system call/ioctl is provided by Linux kernel to > > > >get the physical address from a virtual address. > > > >/proc//pagemap provides information of mapping from > > > >VA to PA. Is it an acceptable solution to let QEMU parse this > > > >file to get the physical address? > > > > > > Does it work in a non-root scenario? > > > > > > > Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel: > > | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs. > > | In 4.0 and 4.1 opens by unprivileged fail with -EPERM. Starting from > > | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN. > > | Reason: information about PFNs helps in exploiting Rowhammer > > vulnerability. > > > > A possible alternative is to add a new hypercall similar to > > XEN_DOMCTL_memory_mapping but receiving virtual address as the address > > parameter and translating to machine address in the hypervisor. > > That might work. > > > > > > Open: For a large pmem, mmap(2) is very possible to not map all SPA > > > >occupied by pmem at the beginning, i.e. QEMU may not be able to > > > >get all SPA of pmem from buf (in virtual address space) when > > > >calling XEN_DOMCTL_memory_mapping. > > > >Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the > > > >entire pmem being mmaped? > > > > > > Ditto > > > > > > > No. If I take the above alternative for the first open, maybe the new > > hypercall above can inject page faults into dom0 for the unmapped > > virtual address so as to enforce dom0 Linux to create the page > > mapping. > > Otherwise you need to use something like the mapcache in QEMU > (xen-mapcache.c), which admittedly, given its complexity, would be best > to avoid. > Definitely not mapcache like things. What I want is something similar to what emulate_gva_to_mfn() in Xen does. [...] > > > If we start asking QEMU to build ACPI tables, why should we stop at NFIT > > > and SSDT? > > > > for easing my development of supporting vNVDIMM in Xen ... I mean > > NFIT and SSDT are the only two tables needed for this purpose and I'm > > afraid to break exiting guests if I completely switch to QEMU for > > guest ACPI tables. > > I realize that my words have been a bit confusing. Not /all/ ACPI > tables, just all the tables regarding devices for which QEMU is in > charge (the PCI bus and all devices behind it). Anything related to cpus > and memory (FADT, MADT, etc)
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/03/16 02:18, Jan Beulich wrote: > >>> On 03.02.16 at 09:28,wrote: > > On 02/02/16 14:15, Konrad Rzeszutek Wilk wrote: > >> > 3.1 Guest clwb/clflushopt/pcommit Enabling > >> > > >> > The instruction enabling is simple and we do the same work as in > >> > KVM/QEMU. > >> > - All three instructions are exposed to guest via guest cpuid. > >> > - L1 guest pcommit is never intercepted by Xen. > >> > >> I wish there was some watermarks like the PLE has. > >> > >> My fear is that an unfriendly guest can issue sfence all day long > >> flushing out other guests MMC queue (the writes followed by pcommits). > >> Which means that an guest may have degraded performance as their > >> memory writes are being flushed out immediately as if they were > >> being written to UC instead of WB memory. > > > > pcommit takes no parameter and it seems hard to solve this problem > > from hardware for now. And the current VMX does not provide mechanism > > to limit the commit rate of pcommit like PLE for pause. > > > >> In other words - the NVDIMM resource does not provide any resource > >> isolation. However this may not be any different than what we had > >> nowadays with CPU caches. > >> > > > > Does Xen have any mechanism to isolate multiple guests' operations on > > CPU caches? > > No. All it does is disallow wbinvd for guests not controlling any > actual hardware. Perhaps pcommit should at least be limited in > a similar way? > But pcommit is a must that makes writes be persistent on pmem. I'll look at how guest wbinvd is limited in Xen. Any functions suggested, vmx_wbinvd_intercept()? Thanks, Haozhong > >> > - L1 hypervisor is allowed to intercept L2 guest pcommit. > >> > >> clwb? > > > > VMX is not capable to intercept clwb. Any reason to intercept it? > > I don't think so - otherwise normal memory writes might also need > intercepting. Bus bandwidth simply is shared (and CLWB operates > on a guest virtual address, so can only cause bus traffic for cache > lines the guest has managed to dirty). > > Jan > ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/02/16 14:15, Konrad Rzeszutek Wilk wrote: > > 3. Design of vNVDIMM in Xen > > Thank you for this design! > > > > > Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of > > three parts: > > (1) Guest clwb/clflushopt/pcommit enabling, > > (2) Memory mapping, and > > (3) Guest ACPI emulation. > > > .. MCE? and vMCE? > Specifications on my hand seem not mention much about MCE for NVDIMM, but I remember that NVDIMM driver in Linux kernel does have MCE code. I'll have a look at that code and add this part later. > > > > The rest of this section present the design of each part > > respectively. The basic design principle to reuse existing code in > > Linux NVDIMM driver and QEMU as much as possible. As recent > > discussions in the both Xen and QEMU mailing lists for the v1 patch > > series, alternative designs are also listed below. > > > > > > 3.1 Guest clwb/clflushopt/pcommit Enabling > > > > The instruction enabling is simple and we do the same work as in KVM/QEMU. > > - All three instructions are exposed to guest via guest cpuid. > > - L1 guest pcommit is never intercepted by Xen. > > I wish there was some watermarks like the PLE has. > > My fear is that an unfriendly guest can issue sfence all day long > flushing out other guests MMC queue (the writes followed by pcommits). > Which means that an guest may have degraded performance as their > memory writes are being flushed out immediately as if they were > being written to UC instead of WB memory. > pcommit takes no parameter and it seems hard to solve this problem from hardware for now. And the current VMX does not provide mechanism to limit the commit rate of pcommit like PLE for pause. > In other words - the NVDIMM resource does not provide any resource > isolation. However this may not be any different than what we had > nowadays with CPU caches. > Does Xen have any mechanism to isolate multiple guests' operations on CPU caches? > > > - L1 hypervisor is allowed to intercept L2 guest pcommit. > > clwb? > VMX is not capable to intercept clwb. Any reason to intercept it? > > > > > > 3.2 Address Mapping > > > > 3.2.1 My Design > > > > The overview of this design is shown in the following figure. > > > > Dom0 | DomU > > | > > | > > QEMU | > > +...++...+-+ | > > VA | | Label Storage Area | | buf | | > > +...++...+-+ | > > ^^ ^ | > > || | | > > V| | | > > +---+ +---+mmap(2) | > > | vACPI | | v_DSM || | |+++ > > +---+ +---+| | | SPA || /dev/pmem0 | > > ^ ^ +--+ | |+++ > > |---|-||-- | ^^ > > | | || | || > > |+--+ +~-~-+| > > |||| | > > XEN_DOMCTL_memory_mapping > > |||+-~--+ > > |||| | > > || +++ | > > Linux || SPA || /dev/pmem0 | | +--+ +--+ > > || +++ | | ACPI | | _DSM | > > || ^ | +--+ +--+ > > || | | | | > > || Dom0 Driver | hvmloader/xl | > > > > ||---|-|--|--- > > |+---~-~--+ > > Xen || | > > +~-+ > > > > -| > > ++ > >| > > +-+ > > HW |NVDIMM | > > +-+ > > > > > > This design treats host NVDIMM devices as ordinary MMIO devices: > > Nice. > > But it also means you need Xen to 'share' the ranges of an MMIO device. > > That is you may need dom0 _DSM method to access certain ranges > (the AML code may need to poke there) - and the guest may want to access > those as well. > Currently, we
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 03.02.16 at 09:28,wrote: > On 02/02/16 14:15, Konrad Rzeszutek Wilk wrote: >> > 3.1 Guest clwb/clflushopt/pcommit Enabling >> > >> > The instruction enabling is simple and we do the same work as in KVM/QEMU. >> > - All three instructions are exposed to guest via guest cpuid. >> > - L1 guest pcommit is never intercepted by Xen. >> >> I wish there was some watermarks like the PLE has. >> >> My fear is that an unfriendly guest can issue sfence all day long >> flushing out other guests MMC queue (the writes followed by pcommits). >> Which means that an guest may have degraded performance as their >> memory writes are being flushed out immediately as if they were >> being written to UC instead of WB memory. > > pcommit takes no parameter and it seems hard to solve this problem > from hardware for now. And the current VMX does not provide mechanism > to limit the commit rate of pcommit like PLE for pause. > >> In other words - the NVDIMM resource does not provide any resource >> isolation. However this may not be any different than what we had >> nowadays with CPU caches. >> > > Does Xen have any mechanism to isolate multiple guests' operations on > CPU caches? No. All it does is disallow wbinvd for guests not controlling any actual hardware. Perhaps pcommit should at least be limited in a similar way? >> > - L1 hypervisor is allowed to intercept L2 guest pcommit. >> >> clwb? > > VMX is not capable to intercept clwb. Any reason to intercept it? I don't think so - otherwise normal memory writes might also need intercepting. Bus bandwidth simply is shared (and CLWB operates on a guest virtual address, so can only cause bus traffic for cache lines the guest has managed to dirty). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/02/16 09:18, Jan Beulich wrote: >> >>> In other words - the NVDIMM resource does not provide any resource >>> isolation. However this may not be any different than what we had >>> nowadays with CPU caches. >>> >> Does Xen have any mechanism to isolate multiple guests' operations on >> CPU caches? > No. PSR Cache Allocation is supported in Xen 4.6 on supporting hardware, so the administrator can partition guests if necessary. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
>>> On 03.02.16 at 15:30,wrote: > On 03/02/16 09:18, Jan Beulich wrote: >>> In other words - the NVDIMM resource does not provide any resource isolation. However this may not be any different than what we had nowadays with CPU caches. >>> Does Xen have any mechanism to isolate multiple guests' operations on >>> CPU caches? >> No. > > PSR Cache Allocation is supported in Xen 4.6 on supporting hardware, so > the administrator can partition guests if necessary. And if the hardware supports it (which for a while might be more the exception than the rule). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/02/16 12:02, Stefano Stabellini wrote: > On Wed, 3 Feb 2016, Haozhong Zhang wrote: >> Or, we can make a file system on /dev/pmem0, create files on it, set >> the owner of those files to xen-qemuuser-domid$domid, and then pass >> those files to QEMU. In this way, non-root QEMU should be able to >> mmap those files. > > Maybe that would work. Worth adding it to the design, I would like to > read more details on it. > > Also note that QEMU initially runs as root but drops privileges to > xen-qemuuser-domid$domid before the guest is started. Initially QEMU > *could* mmap /dev/pmem0 while is still running as root, but then it > wouldn't work for any devices that need to be mmap'ed at run time > (hotplug scenario). This is basically the same problem we have for a bunch of other things, right? Having xl open a file and then pass it via qmp to qemu should work in theory, right? -George ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On Wed, 3 Feb 2016, George Dunlap wrote: > On 03/02/16 12:02, Stefano Stabellini wrote: > > On Wed, 3 Feb 2016, Haozhong Zhang wrote: > >> Or, we can make a file system on /dev/pmem0, create files on it, set > >> the owner of those files to xen-qemuuser-domid$domid, and then pass > >> those files to QEMU. In this way, non-root QEMU should be able to > >> mmap those files. > > > > Maybe that would work. Worth adding it to the design, I would like to > > read more details on it. > > > > Also note that QEMU initially runs as root but drops privileges to > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU > > *could* mmap /dev/pmem0 while is still running as root, but then it > > wouldn't work for any devices that need to be mmap'ed at run time > > (hotplug scenario). > > This is basically the same problem we have for a bunch of other things, > right? Having xl open a file and then pass it via qmp to qemu should > work in theory, right? Is there one /dev/pmem? per assignable region? Otherwise it wouldn't be safe. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/02/16 13:11, Haozhong Zhang wrote: > On 02/03/16 12:02, Stefano Stabellini wrote: >> On Wed, 3 Feb 2016, Haozhong Zhang wrote: >>> On 02/02/16 17:11, Stefano Stabellini wrote: On Mon, 1 Feb 2016, Haozhong Zhang wrote: > [...] > This design treats host NVDIMM devices as ordinary MMIO devices: > (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT) > and drive host NVDIMM devices (implementing block device > interface). Namespaces and file systems on host NVDIMM devices > are handled by Dom0 Linux as well. > > (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its > virtual address space (buf). > > (3) QEMU gets the host physical address of buf, i.e. the host system > physical address that is occupied by /dev/pmem0, and calls Xen > hypercall XEN_DOMCTL_memory_mapping to map it to a DomU. How is this going to work from a security perspective? Is it going to require running QEMU as root in Dom0, which will prevent NVDIMM from working by default on Xen? If so, what's the plan? >>> Oh, I forgot to address the non-root qemu issues in this design ... >>> >>> The default user:group of /dev/pmem0 is root:disk, and its permission >>> is rw-rw. We could lift the others permission to rw, so that >>> non-root QEMU can mmap /dev/pmem0. But it looks too risky. >> Yep, too risky. >> >> >>> Or, we can make a file system on /dev/pmem0, create files on it, set >>> the owner of those files to xen-qemuuser-domid$domid, and then pass >>> those files to QEMU. In this way, non-root QEMU should be able to >>> mmap those files. >> Maybe that would work. Worth adding it to the design, I would like to >> read more details on it. >> >> Also note that QEMU initially runs as root but drops privileges to >> xen-qemuuser-domid$domid before the guest is started. Initially QEMU >> *could* mmap /dev/pmem0 while is still running as root, but then it >> wouldn't work for any devices that need to be mmap'ed at run time >> (hotplug scenario). >> > Thanks for this information. I'll test some experimental code and then > post a design to address the non-root qemu issue. > > (ACPI part is described in Section 3.3 later) > > Above (1)(2) have already been done in current QEMU. Only (3) is > needed to implement in QEMU. No change is needed in Xen for address > mapping in this design. > > Open: It seems no system call/ioctl is provided by Linux kernel to >get the physical address from a virtual address. >/proc//pagemap provides information of mapping from >VA to PA. Is it an acceptable solution to let QEMU parse this >file to get the physical address? Does it work in a non-root scenario? >>> Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel: >>> | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs. >>> | In 4.0 and 4.1 opens by unprivileged fail with -EPERM. Starting from >>> | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN. >>> | Reason: information about PFNs helps in exploiting Rowhammer >>> vulnerability. >>> >>> A possible alternative is to add a new hypercall similar to >>> XEN_DOMCTL_memory_mapping but receiving virtual address as the address >>> parameter and translating to machine address in the hypervisor. >> That might work. >> >> > Open: For a large pmem, mmap(2) is very possible to not map all SPA >occupied by pmem at the beginning, i.e. QEMU may not be able to >get all SPA of pmem from buf (in virtual address space) when >calling XEN_DOMCTL_memory_mapping. >Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the >entire pmem being mmaped? Ditto >>> No. If I take the above alternative for the first open, maybe the new >>> hypercall above can inject page faults into dom0 for the unmapped >>> virtual address so as to enforce dom0 Linux to create the page >>> mapping. >> Otherwise you need to use something like the mapcache in QEMU >> (xen-mapcache.c), which admittedly, given its complexity, would be best >> to avoid. >> > Definitely not mapcache like things. What I want is something similar to > what emulate_gva_to_mfn() in Xen does. Please not quite like that. It would restrict this to only working in a PV dom0. MFNs are an implementation detail. Interfaces should take GFNs which are consistent logical meaning between PV and HVM domains. As an introduction, http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/xen/mm.h;h=a795dd6001eff7c5dd942bbaf153e3efa5202318;hb=refs/heads/staging#l8 We also need to consider the Xen side security. Currently a domain may be given privilege to map an MMIO range. IIRC, this allows the emulator domain to make mappings for the guest, and for the guest to make mappings itself. With PMEM, we can't allow a domain to make
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/02/16 15:22, Stefano Stabellini wrote: > On Wed, 3 Feb 2016, George Dunlap wrote: >> On 03/02/16 12:02, Stefano Stabellini wrote: >>> On Wed, 3 Feb 2016, Haozhong Zhang wrote: Or, we can make a file system on /dev/pmem0, create files on it, set the owner of those files to xen-qemuuser-domid$domid, and then pass those files to QEMU. In this way, non-root QEMU should be able to mmap those files. >>> >>> Maybe that would work. Worth adding it to the design, I would like to >>> read more details on it. >>> >>> Also note that QEMU initially runs as root but drops privileges to >>> xen-qemuuser-domid$domid before the guest is started. Initially QEMU >>> *could* mmap /dev/pmem0 while is still running as root, but then it >>> wouldn't work for any devices that need to be mmap'ed at run time >>> (hotplug scenario). >> >> This is basically the same problem we have for a bunch of other things, >> right? Having xl open a file and then pass it via qmp to qemu should >> work in theory, right? > > Is there one /dev/pmem? per assignable region? Otherwise it wouldn't be > safe. If I understood Haozhong's description right, you'd be passing through the entirety of one thing that Linux gave you. At the moment that'sone /dev/pmemX, which at the moment corresponds to one region as specified in the ACPI tables. I understood his design going forward to mean that it would rely on Linux to do any further partitioning within regions if that was desired; in which case there would again be a single file that qemu would have access to. -George ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 03/02/16 09:13, Jan Beulich wrote: On 03.02.16 at 08:00,wrote: >> On 02/02/16 17:11, Stefano Stabellini wrote: >>> Once upon a time somebody made the decision that ACPI tables >>> on Xen should be static and included in hvmloader. That might have been >>> a bad decision but at least it was coherent. Loading only *some* tables >>> from QEMU, but not others, it feels like an incomplete design to me. >>> >>> For example, QEMU is currently in charge of emulating the PCI bus, why >>> shouldn't it be QEMU that generates the PRT and MCFG? >>> >> To Keir, Jan and Andrew: >> >> Are there anything related to ACPI that must be done (or are better to >> be done) in hvmloader? > Some of the static tables (FADT and HPET come to mind) likely would > better continue to live in hvmloader. MCFG (for example) coming from > qemu, otoh, would be quite natural (and would finally allow MMCFG > support for guests in the first place). I.e. ... > >>> I prefer switching to QEMU building all ACPI tables for devices that it >>> is emulating. However this alternative is good too because it is >>> coherent with the current design. >> I would prefer to this one if the final conclusion is that only one >> agent should be allowed to build guest ACPI. As I said above, it looks >> like a big change to switch to QEMU for all ACPI tables and I'm afraid >> it would break some existing guests. > ... I indeed think that tables should come from qemu for components > living in qemu, and from hvmloader for components coming from Xen. I agree. There has to be a single entity responsible for collating the eventual ACPI handed to the guest, and this is definitely HVMLoader. However, it is correct that Qemu create the ACPI tables for the devices it emulates for the guest. We need to agree on a mechanism whereby each entity can provide their own subset of the ACPI tables to HVMLoader, and have HVMLoader present the final set properly to the VM. There is an existing usecase of passing the Host SLIC table to a VM, for OEM Versions of Windows. I believe this is achieved with HVM_XS_ACPI_PT_{ADDRESS,LENGTH}, but that mechanism is a little inflexible and could probably do with being made a little more generic. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/03/16 14:09, Andrew Cooper wrote: > On 03/02/16 09:13, Jan Beulich wrote: > On 03.02.16 at 08:00,wrote: > >> On 02/02/16 17:11, Stefano Stabellini wrote: > >>> Once upon a time somebody made the decision that ACPI tables > >>> on Xen should be static and included in hvmloader. That might have been > >>> a bad decision but at least it was coherent. Loading only *some* tables > >>> from QEMU, but not others, it feels like an incomplete design to me. > >>> > >>> For example, QEMU is currently in charge of emulating the PCI bus, why > >>> shouldn't it be QEMU that generates the PRT and MCFG? > >>> > >> To Keir, Jan and Andrew: > >> > >> Are there anything related to ACPI that must be done (or are better to > >> be done) in hvmloader? > > Some of the static tables (FADT and HPET come to mind) likely would > > better continue to live in hvmloader. MCFG (for example) coming from > > qemu, otoh, would be quite natural (and would finally allow MMCFG > > support for guests in the first place). I.e. ... > > > >>> I prefer switching to QEMU building all ACPI tables for devices that it > >>> is emulating. However this alternative is good too because it is > >>> coherent with the current design. > >> I would prefer to this one if the final conclusion is that only one > >> agent should be allowed to build guest ACPI. As I said above, it looks > >> like a big change to switch to QEMU for all ACPI tables and I'm afraid > >> it would break some existing guests. > > ... I indeed think that tables should come from qemu for components > > living in qemu, and from hvmloader for components coming from Xen. > > I agree. > > There has to be a single entity responsible for collating the eventual > ACPI handed to the guest, and this is definitely HVMLoader. > > However, it is correct that Qemu create the ACPI tables for the devices > it emulates for the guest. > > We need to agree on a mechanism whereby each entity can provide their > own subset of the ACPI tables to HVMLoader, and have HVMLoader present > the final set properly to the VM. > > There is an existing usecase of passing the Host SLIC table to a VM, for > OEM Versions of Windows. I believe this is achieved with > HVM_XS_ACPI_PT_{ADDRESS,LENGTH}, but that mechanism is a little > inflexible and could probably do with being made a little more generic. > Yes, that is what one of my v1 patches does ([PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu). It extends the existing construct_passthrough_tables() to get the address and size of acpi tables from its parameters (a pair of xenstore keys) rather than the hardcoded ones. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On Wed, Feb 03, 2016 at 03:22:59PM +, Stefano Stabellini wrote: > On Wed, 3 Feb 2016, George Dunlap wrote: > > On 03/02/16 12:02, Stefano Stabellini wrote: > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote: > > >> Or, we can make a file system on /dev/pmem0, create files on it, set > > >> the owner of those files to xen-qemuuser-domid$domid, and then pass > > >> those files to QEMU. In this way, non-root QEMU should be able to > > >> mmap those files. > > > > > > Maybe that would work. Worth adding it to the design, I would like to > > > read more details on it. > > > > > > Also note that QEMU initially runs as root but drops privileges to > > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU > > > *could* mmap /dev/pmem0 while is still running as root, but then it > > > wouldn't work for any devices that need to be mmap'ed at run time > > > (hotplug scenario). > > > > This is basically the same problem we have for a bunch of other things, > > right? Having xl open a file and then pass it via qmp to qemu should > > work in theory, right? > > Is there one /dev/pmem? per assignable region? Otherwise it wouldn't be > safe. Can be - which may be interleaved on multiple NVDIMMs. But we would operate on files (on the /dev/pmem which has an DAX enabled filesystem). ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/03/16 10:47, Konrad Rzeszutek Wilk wrote: > > > > > Open: It seems no system call/ioctl is provided by Linux kernel to > > > > >get the physical address from a virtual address. > > > > >/proc//pagemap provides information of mapping from > > > > >VA to PA. Is it an acceptable solution to let QEMU parse this > > > > >file to get the physical address? > > > > > > > > Does it work in a non-root scenario? > > > > > > > > > > Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel: > > > | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get > > > PFNs. > > > | In 4.0 and 4.1 opens by unprivileged fail with -EPERM. Starting from > > > | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN. > > > | Reason: information about PFNs helps in exploiting Rowhammer > > > vulnerability. > > Ah right. > > > > > > A possible alternative is to add a new hypercall similar to > > > XEN_DOMCTL_memory_mapping but receiving virtual address as the address > > > parameter and translating to machine address in the hypervisor. > > > > That might work. > > That won't work. > > This is a userspace VMA - which means the once the ioctl is done we swap > to kernel virtual addresses. Now we may know that the prior cr3 has the > userspace virtual address and walk it down - but what if the domain > that is doing this is PVH? (or HVM) - the cr3 of userspace is tucked somewhere > inside the kernel. > > Which means this hypercall would need to know the Linux kernel task structure > to find this. > Thanks for pointing out this. Really it's not a workable solution. > May I propose another solution - an stacking driver (similar to loop). You > setup it up (ioctl /dev/pmem0/guest.img, get some /dev/mapper/guest.img > created). > Then mmap the /dev/mapper/guest.img - all of the operations are the same - > except > it may have an extra ioctl - get_pfns - which would provide the data in > similar > form to pagemap.txt. > I'll have a look at this, thanks! > But folks will then ask - why don't you just use pagemap? Could the pagemap > have an extra security capability check? One that can be set for > QEMU? > Basically for the concern on whether non-root QEMU could work as in Stefano's comments. > > > > > > > > > Open: For a large pmem, mmap(2) is very possible to not map all SPA > > > > >occupied by pmem at the beginning, i.e. QEMU may not be able to > > > > >get all SPA of pmem from buf (in virtual address space) when > > > > >calling XEN_DOMCTL_memory_mapping. > > > > >Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the > > > > >entire pmem being mmaped? > > > > > > > > Ditto > > > > > > > > > > No. If I take the above alternative for the first open, maybe the new > > > hypercall above can inject page faults into dom0 for the unmapped > > > virtual address so as to enforce dom0 Linux to create the page > > > mapping. > > Ugh. That sounds hacky. And you wouldn't neccessarily be safe. > Imagine that the system admin decides to defrag the /dev/pmem filesystem. > Or move the files (disk images) around. If they do that - we may > still have the guest mapped to system addresses which may contain filesystem > metadata now, or a different guest image. We MUST mlock or lock the file > during the duration of the guest. > > So mlocking or locking the mmaped file, or other ways to 'pin' the mmaped file on pmem is a necessity. Thanks, Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/03/16 14:20, Andrew Cooper wrote: > > (ACPI part is described in Section 3.3 later) > > > > Above (1)(2) have already been done in current QEMU. Only (3) is > > needed to implement in QEMU. No change is needed in Xen for address > > mapping in this design. > > > > Open: It seems no system call/ioctl is provided by Linux kernel to > >get the physical address from a virtual address. > >/proc//pagemap provides information of mapping from > >VA to PA. Is it an acceptable solution to let QEMU parse this > >file to get the physical address? > Does it work in a non-root scenario? > > >>> Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel: > >>> | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get > >>> PFNs. > >>> | In 4.0 and 4.1 opens by unprivileged fail with -EPERM. Starting from > >>> | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN. > >>> | Reason: information about PFNs helps in exploiting Rowhammer > >>> vulnerability. > >>> > >>> A possible alternative is to add a new hypercall similar to > >>> XEN_DOMCTL_memory_mapping but receiving virtual address as the address > >>> parameter and translating to machine address in the hypervisor. > >> That might work. > >> > >> > > Open: For a large pmem, mmap(2) is very possible to not map all SPA > >occupied by pmem at the beginning, i.e. QEMU may not be able to > >get all SPA of pmem from buf (in virtual address space) when > >calling XEN_DOMCTL_memory_mapping. > >Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the > >entire pmem being mmaped? > Ditto > > >>> No. If I take the above alternative for the first open, maybe the new > >>> hypercall above can inject page faults into dom0 for the unmapped > >>> virtual address so as to enforce dom0 Linux to create the page > >>> mapping. > >> Otherwise you need to use something like the mapcache in QEMU > >> (xen-mapcache.c), which admittedly, given its complexity, would be best > >> to avoid. > >> > > Definitely not mapcache like things. What I want is something similar to > > what emulate_gva_to_mfn() in Xen does. > > Please not quite like that. It would restrict this to only working in a > PV dom0. > > MFNs are an implementation detail. I don't get this point. What do you mean by 'implementation detail'? Architectural differences? > Interfaces should take GFNs which > are consistent logical meaning between PV and HVM domains. > > As an introduction, > http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/xen/mm.h;h=a795dd6001eff7c5dd942bbaf153e3efa5202318;hb=refs/heads/staging#l8 > > We also need to consider the Xen side security. Currently a domain may > be given privilege to map an MMIO range. IIRC, this allows the emulator > domain to make mappings for the guest, and for the guest to make > mappings itself. With PMEM, we can't allow a domain to make mappings > itself because it could end up mapping resources which belong to another > domain. We probably need an intermediate level which only permits an > emulator to make the mappings. > agree, this hypercall should not be called by arbitrary domains. Any existing mechanism in Xen to restrict callers of hypercalls? > > > > [...] > If we start asking QEMU to build ACPI tables, why should we stop at NFIT > and SSDT? > >>> for easing my development of supporting vNVDIMM in Xen ... I mean > >>> NFIT and SSDT are the only two tables needed for this purpose and I'm > >>> afraid to break exiting guests if I completely switch to QEMU for > >>> guest ACPI tables. > >> I realize that my words have been a bit confusing. Not /all/ ACPI > >> tables, just all the tables regarding devices for which QEMU is in > >> charge (the PCI bus and all devices behind it). Anything related to cpus > >> and memory (FADT, MADT, etc) would still be left to hvmloader. > > OK, then it's clear for me. From Jan's reply, at least MCFG is from > > QEMU. I'll look at whether other PCI related tables are also from QEMU > > or similar to those in QEMU. If yes, then it looks reasonable to let > > QEMU generate them. > > It is entirely likely that the current split of sources of APCI tables > is incorrect. We should also see what can be done about fixing that. > How about Jan's comment | tables should come from qemu for components living in qemu, and from | hvmloader for components coming from Xen Thanks, Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/03/16 15:22, Stefano Stabellini wrote: > On Wed, 3 Feb 2016, George Dunlap wrote: > > On 03/02/16 12:02, Stefano Stabellini wrote: > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote: > > >> Or, we can make a file system on /dev/pmem0, create files on it, set > > >> the owner of those files to xen-qemuuser-domid$domid, and then pass > > >> those files to QEMU. In this way, non-root QEMU should be able to > > >> mmap those files. > > > > > > Maybe that would work. Worth adding it to the design, I would like to > > > read more details on it. > > > > > > Also note that QEMU initially runs as root but drops privileges to > > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU > > > *could* mmap /dev/pmem0 while is still running as root, but then it > > > wouldn't work for any devices that need to be mmap'ed at run time > > > (hotplug scenario). > > > > This is basically the same problem we have for a bunch of other things, > > right? Having xl open a file and then pass it via qmp to qemu should > > work in theory, right? > > Is there one /dev/pmem? per assignable region? Yes. BTW, I'm wondering whether and how non-root qemu works with xl disk configuration that is going to access a host block device, e.g. disk = [ '/dev/sdb,,hda' ] If that works with non-root qemu, I may take the similar solution for pmem. Thanks, Haozhong > Otherwise it wouldn't be safe. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
> > > > Open: It seems no system call/ioctl is provided by Linux kernel to > > > >get the physical address from a virtual address. > > > >/proc//pagemap provides information of mapping from > > > >VA to PA. Is it an acceptable solution to let QEMU parse this > > > >file to get the physical address? > > > > > > Does it work in a non-root scenario? > > > > > > > Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel: > > | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs. > > | In 4.0 and 4.1 opens by unprivileged fail with -EPERM. Starting from > > | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN. > > | Reason: information about PFNs helps in exploiting Rowhammer > > vulnerability. Ah right. > > > > A possible alternative is to add a new hypercall similar to > > XEN_DOMCTL_memory_mapping but receiving virtual address as the address > > parameter and translating to machine address in the hypervisor. > > That might work. That won't work. This is a userspace VMA - which means the once the ioctl is done we swap to kernel virtual addresses. Now we may know that the prior cr3 has the userspace virtual address and walk it down - but what if the domain that is doing this is PVH? (or HVM) - the cr3 of userspace is tucked somewhere inside the kernel. Which means this hypercall would need to know the Linux kernel task structure to find this. May I propose another solution - an stacking driver (similar to loop). You setup it up (ioctl /dev/pmem0/guest.img, get some /dev/mapper/guest.img created). Then mmap the /dev/mapper/guest.img - all of the operations are the same - except it may have an extra ioctl - get_pfns - which would provide the data in similar form to pagemap.txt. But folks will then ask - why don't you just use pagemap? Could the pagemap have an extra security capability check? One that can be set for QEMU? > > > > > > Open: For a large pmem, mmap(2) is very possible to not map all SPA > > > >occupied by pmem at the beginning, i.e. QEMU may not be able to > > > >get all SPA of pmem from buf (in virtual address space) when > > > >calling XEN_DOMCTL_memory_mapping. > > > >Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the > > > >entire pmem being mmaped? > > > > > > Ditto > > > > > > > No. If I take the above alternative for the first open, maybe the new > > hypercall above can inject page faults into dom0 for the unmapped > > virtual address so as to enforce dom0 Linux to create the page > > mapping. Ugh. That sounds hacky. And you wouldn't neccessarily be safe. Imagine that the system admin decides to defrag the /dev/pmem filesystem. Or move the files (disk images) around. If they do that - we may still have the guest mapped to system addresses which may contain filesystem metadata now, or a different guest image. We MUST mlock or lock the file during the duration of the guest. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
On 02/03/16 05:38, Jan Beulich wrote: > >>> On 03.02.16 at 13:22,wrote: > > On 02/03/16 02:18, Jan Beulich wrote: > >> >>> On 03.02.16 at 09:28, wrote: > >> > On 02/02/16 14:15, Konrad Rzeszutek Wilk wrote: > >> >> > 3.1 Guest clwb/clflushopt/pcommit Enabling > >> >> > > >> >> > The instruction enabling is simple and we do the same work as in > > KVM/QEMU. > >> >> > - All three instructions are exposed to guest via guest cpuid. > >> >> > - L1 guest pcommit is never intercepted by Xen. > >> >> > >> >> I wish there was some watermarks like the PLE has. > >> >> > >> >> My fear is that an unfriendly guest can issue sfence all day long > >> >> flushing out other guests MMC queue (the writes followed by pcommits). > >> >> Which means that an guest may have degraded performance as their > >> >> memory writes are being flushed out immediately as if they were > >> >> being written to UC instead of WB memory. > >> > > >> > pcommit takes no parameter and it seems hard to solve this problem > >> > from hardware for now. And the current VMX does not provide mechanism > >> > to limit the commit rate of pcommit like PLE for pause. > >> > > >> >> In other words - the NVDIMM resource does not provide any resource > >> >> isolation. However this may not be any different than what we had > >> >> nowadays with CPU caches. > >> >> > >> > > >> > Does Xen have any mechanism to isolate multiple guests' operations on > >> > CPU caches? > >> > >> No. All it does is disallow wbinvd for guests not controlling any > >> actual hardware. Perhaps pcommit should at least be limited in > >> a similar way? > >> > > > > But pcommit is a must that makes writes be persistent on pmem. I'll > > look at how guest wbinvd is limited in Xen. > > But we could intercept it on guests _not_ supposed to use it, in order > to simply drop in on the floor. > Oh yes! We can drop pcommit from domains not having access to host NVDIMM, just like vmx_wbinvd_intercept() dropping wbinvd from domains not accessing host iomem and ioport. > > Any functions suggested, vmx_wbinvd_intercept()? > > A good example, yes. > > Jan > Thanks, Haozhong ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
> > 2.2 vNVDIMM Implementation in KVM/QEMU > > > > (1) Address Mapping > > > > As described before, the host Linux NVDIMM driver provides a block > > device interface (/dev/pmem0 at the bottom) for a pmem NVDIMM > > region. QEMU can than mmap(2) that device into its virtual address > > space (buf). QEMU is responsible to find a proper guest physical > > address space range that is large enough to hold /dev/pmem0. Then > > QEMU passes the virtual address of mmapped buf to a KVM API > > KVM_SET_USER_MEMORY_REGION that maps in EPT the host physical > > address range of buf to the guest physical address space range where > > the virtual pmem device will be. > > > > In this way, all guest writes/reads on the virtual pmem device is > > applied directly to the host one. > > > > Besides, above implementation also allows to back a virtual pmem > > device by a mmapped regular file or a piece of ordinary ram. > > What's the point of backing pmem with ordinary ram? I can buy-in > the value of file-backed option which although slower does sustain > the persistency attribute. However with ram-backed method there's > no persistency so violates guest expectation. Containers - like the Intel Clear Containers? You can use this work to stitch an exploded initramfs on a tmpfs right in the guest. And you could do that for multiple guests. Granted this has nothing to do with pmem, but this work would allow one to setup containers this way. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
> 3. Design of vNVDIMM in Xen Thank you for this design! > > Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of > three parts: > (1) Guest clwb/clflushopt/pcommit enabling, > (2) Memory mapping, and > (3) Guest ACPI emulation. .. MCE? and vMCE? > > The rest of this section present the design of each part > respectively. The basic design principle to reuse existing code in > Linux NVDIMM driver and QEMU as much as possible. As recent > discussions in the both Xen and QEMU mailing lists for the v1 patch > series, alternative designs are also listed below. > > > 3.1 Guest clwb/clflushopt/pcommit Enabling > > The instruction enabling is simple and we do the same work as in KVM/QEMU. > - All three instructions are exposed to guest via guest cpuid. > - L1 guest pcommit is never intercepted by Xen. I wish there was some watermarks like the PLE has. My fear is that an unfriendly guest can issue sfence all day long flushing out other guests MMC queue (the writes followed by pcommits). Which means that an guest may have degraded performance as their memory writes are being flushed out immediately as if they were being written to UC instead of WB memory. In other words - the NVDIMM resource does not provide any resource isolation. However this may not be any different than what we had nowadays with CPU caches. > - L1 hypervisor is allowed to intercept L2 guest pcommit. clwb? > > > 3.2 Address Mapping > > 3.2.1 My Design > > The overview of this design is shown in the following figure. > > Dom0 | DomU > | > | > QEMU | > +...++...+-+ | > VA | | Label Storage Area | | buf | | > +...++...+-+ | > ^^ ^ | > || | | > V| | | > +---+ +---+mmap(2) | > | vACPI | | v_DSM || | |+++ > +---+ +---+| | | SPA || /dev/pmem0 | > ^ ^ +--+ | |+++ > |---|-||-- | ^^ > | | || | || > |+--+ +~-~-+| > |||| | > XEN_DOMCTL_memory_mapping > |||+-~--+ > |||| | > || +++ | > Linux || SPA || /dev/pmem0 | | +--+ +--+ > || +++ | | ACPI | | _DSM | > || ^ | +--+ +--+ > || | | | | > || Dom0 Driver | hvmloader/xl | > > ||---|-|--|--- > |+---~-~--+ > Xen || | > +~-+ > > -| > ++ >| > +-+ > HW |NVDIMM | > +-+ > > > This design treats host NVDIMM devices as ordinary MMIO devices: Nice. But it also means you need Xen to 'share' the ranges of an MMIO device. That is you may need dom0 _DSM method to access certain ranges (the AML code may need to poke there) - and the guest may want to access those as well. And keep in mind that this NVDIMM management may not need to be always in initial domain. As in you could have NVDIMM device drivers that would carve out the ranges to guests. > (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT) > and drive host NVDIMM devices (implementing block device > interface). Namespaces and file systems on host NVDIMM devices > are handled by Dom0 Linux as well. > > (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its > virtual address space (buf). > > (3) QEMU gets the host physical address of buf, i.e. the host system > physical address that is occupied by /dev/pmem0, and calls Xen > hypercall XEN_DOMCTL_memory_mapping to map it to a DomU. > > (ACPI part is described in Section 3.3 later) > > Above (1)(2)
Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen
> From: Zhang, Haozhong > Sent: Tuesday, February 02, 2016 3:53 PM > > On 02/02/16 15:48, Tian, Kevin wrote: > > > From: Zhang, Haozhong > > > Sent: Tuesday, February 02, 2016 3:39 PM > > > > > > > btw, how is persistency guaranteed in KVM/QEMU, cross guest > > > > power off/on? I guess since Qemu process is killed the allocated pmem > > > > will be freed so you may switch to file-backed method to keep > > > > persistency (however copy would take time for large pmem trunk). Or > > > > will you find some way to keep pmem managed separated from qemu > > > > qemu life-cycle (then pmem is not efficiently reused)? > > > > > > > > > > It all depends on guests themselves. clwb/clflushopt/pcommit > > > instructions are exposed to guest that are used by guests to make > > > writes to pmem persistent. > > > > > > > I meant from guest p.o.v, a range of pmem should be persistent > > cross VM power on/off, i.e. the content needs to be maintained > > somewhere so guest can get it at next power on... > > > > Thanks > > Kevin > > It's just like what we do for guest disk: as long as we always assign > the same host pmem device or the same files on file systems on a host > pmem device to the guest, the guest can find its last data on pmem. > > Haozhong This is the detail which I'd like to learn. If it's Qemu to request host pmem and then free when exit, the very pmem may be allocated to another process later. How do you achieve the 'as long as'? Thanks Kevin ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel