Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-04-22 Thread Jan Beulich
>>> On 22.04.16 at 14:54,  wrote:
> On 04/22/16 06:36, Jan Beulich wrote:
>> >>> On 22.04.16 at 14:26,  wrote:
>> > On 04/22/16 04:53, Jan Beulich wrote:
>> >> Perhaps I have got confused by the back and forth. If we're to
>> >> use struct page_info, then everything should be following a
>> >> similar flow to what happens for normal RAM, i.e. normal page
>> >> allocation, and normal assignment of pages to guests.
>> >>
>> > 
>> > I'll follow the normal assignment of pages to guests for pmem, but not
>> > the normal page allocation. Because allocation is difficult to always
>> > get the same pmem area for the same guest every time. It still needs
>> > input from others (e.g. toolstack) that can provide the exact address.
>> 
>> Understood.
>> 
>> > Because the address is now not decided by xen hypervisor, certain
>> > permission track is needed. For this part, we will re-use the existing
>> > one for MMIO. Directly using existing range struct for pmem may
>> > consume too much space, so I proposed to choose different data
>> > structures or put limitation on exiting range struct to avoid or
>> > mitigate this problem.
>> 
>> Why would these consume too much space? I'd expect there to be
>> just one or very few chunks, just like is the case for MMIO ranges
>> on devices.
> 
> As Ian Jackson indicated [1], there are several cases that a pmem page
> can be accessed from more than one domains. Then every domain involved
> needs a range struct to track its access permission to that pmem
> page. In a worst case, e.g. the first of every two contiguous pages on
> a pmem are assigned to a domain and are shared with all other domains,
> though the size of range struct for a single domain maybe acceptable,
> the total will still be very large.

Everything Ian has mentioned there is what normal RAM pages
also can get used for, yes as you have yourself said (still visible
in context above) you mean to only do allocation differently. Hence
the permission tracking you talk of should be necessary only for the
owning domain (to get validated during allocation), everything else
should follow the normal life cycle of a RAM page.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-04-22 Thread Jan Beulich
>>> On 22.04.16 at 14:26,  wrote:
> On 04/22/16 04:53, Jan Beulich wrote:
>> Perhaps I have got confused by the back and forth. If we're to
>> use struct page_info, then everything should be following a
>> similar flow to what happens for normal RAM, i.e. normal page
>> allocation, and normal assignment of pages to guests.
>>
> 
> I'll follow the normal assignment of pages to guests for pmem, but not
> the normal page allocation. Because allocation is difficult to always
> get the same pmem area for the same guest every time. It still needs
> input from others (e.g. toolstack) that can provide the exact address.

Understood.

> Because the address is now not decided by xen hypervisor, certain
> permission track is needed. For this part, we will re-use the existing
> one for MMIO. Directly using existing range struct for pmem may
> consume too much space, so I proposed to choose different data
> structures or put limitation on exiting range struct to avoid or
> mitigate this problem.

Why would these consume too much space? I'd expect there to be
just one or very few chunks, just like is the case for MMIO ranges
on devices.

Jan

> The data structure change will be applied only
> to pmem, and only the code that manipulate the range structs
> (rangeset_*) will be changed for pmem. So for the permission tracking
> part, it will still follow the exiting one.
> 
> Haozhong




___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-04-22 Thread Haozhong Zhang
On 04/22/16 04:53, Jan Beulich wrote:
> >>> On 22.04.16 at 12:16,  wrote:
> > On 04/22/16 02:24, Jan Beulich wrote:
> > [..]
> >> >> >> Well, using existing range struct to manage guest access permissions
> >> >> >> to nvdimm could consume too much space which could not fit in either
> >> >> >> memory or nvdimm. If the above solution looks really error-prone,
> >> >> >> perhaps we can still come back to the existing one and restrict the
> >> >> >> number of range structs each domain could have for nvdimm
> >> >> >> (e.g. reserve one 4K-page per-domain for them) to make it work for
> >> >> >> nvdimm, though it may reject nvdimm mapping that is terribly
> >> >> >> fragmented.
> >> >> > 
> >> >> > Hi Jan,
> >> >> > 
> >> >> > Any comments for this?
> >> >> 
> >> >> Well, nothing new, i.e. my previous opinion on the old proposal didn't
> >> >> change. I'm really opposed to any artificial limitations here, as I am 
> >> >> to
> >> >> any secondary (and hence error prone) code paths. IOW I continue
> >> >> to think that there's no reasonable alternative to re-using the existing
> >> >> memory management infrastructure for at least the PMEM case.
> >> > 
> >> > By re-using the existing memory management infrastructure, do you mean
> >> > re-using the existing model of MMIO for passthrough PCI devices to
> >> > handle the permission of pmem?
> >> 
> >> No, re-using struct page_info.
> >> 
> >> >> The
> >> >> only open question remains to be where to place the control structures,
> >> >> and I think the thresholding proposal of yours was quite sensible.
> >> > 
> >> > I'm little confused here. Is 'restrict the number of range structs' in
> >> > my previous reply the 'thresholding proposal' you mean? Or it's one of
> >> > 'artificial limitations'?
> >> 
> >> Neither. It's the decision on where to place the struct page_info
> >> arrays needed to manage the PMEM ranges.
> >>
> > 
> > In [1][2], we have agreed to use struct page_info to manage mappings
> > for pmem and place them in reserved area on pmem.
> > 
> > But I think the discussion in this thread is to decide the data
> > structure which will be used to track access permission to host pmem.
> > The discussion started from my question in [3]:
> > | I'm not sure whether xen toolstack as a userspace program is
> > | considered to be safe to pass the host physical address (of host
> > | NVDIMM) to hypervisor.
> > In reply [4], you mentioned:
> > | As long as the passing of physical addresses follows to model of
> > | MMIO for passed through PCI devices, I don't think there's problem
> > | with the tool stack bypassing the Dom0 kernel. So it really all
> > | depends on how you make sure that the guest won't get to see memory
> > | it has no permission to access.
> > 
> > I interpreted it as the same access permission control mechanism used
> > for MMIO of passthrough pci devices (built around range struct) should
> > be used for pmem as well, so that we can safely allow toolstack to
> > pass the host physical address of nvdimm to hypervisor.
> > Was my understanding wrong from the beginning?
> 
> Perhaps I have got confused by the back and forth. If we're to
> use struct page_info, then everything should be following a
> similar flow to what happens for normal RAM, i.e. normal page
> allocation, and normal assignment of pages to guests.
>

I'll follow the normal assignment of pages to guests for pmem, but not
the normal page allocation. Because allocation is difficult to always
get the same pmem area for the same guest every time. It still needs
input from others (e.g. toolstack) that can provide the exact address.

Because the address is now not decided by xen hypervisor, certain
permission track is needed. For this part, we will re-use the existing
one for MMIO. Directly using existing range struct for pmem may
consume too much space, so I proposed to choose different data
structures or put limitation on exiting range struct to avoid or
mitigate this problem. The data structure change will be applied only
to pmem, and only the code that manipulate the range structs
(rangeset_*) will be changed for pmem. So for the permission tracking
part, it will still follow the exiting one.

Haozhong


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-04-22 Thread Jan Beulich
>>> On 22.04.16 at 12:16,  wrote:
> On 04/22/16 02:24, Jan Beulich wrote:
> [..]
>> >> >> Well, using existing range struct to manage guest access permissions
>> >> >> to nvdimm could consume too much space which could not fit in either
>> >> >> memory or nvdimm. If the above solution looks really error-prone,
>> >> >> perhaps we can still come back to the existing one and restrict the
>> >> >> number of range structs each domain could have for nvdimm
>> >> >> (e.g. reserve one 4K-page per-domain for them) to make it work for
>> >> >> nvdimm, though it may reject nvdimm mapping that is terribly
>> >> >> fragmented.
>> >> > 
>> >> > Hi Jan,
>> >> > 
>> >> > Any comments for this?
>> >> 
>> >> Well, nothing new, i.e. my previous opinion on the old proposal didn't
>> >> change. I'm really opposed to any artificial limitations here, as I am to
>> >> any secondary (and hence error prone) code paths. IOW I continue
>> >> to think that there's no reasonable alternative to re-using the existing
>> >> memory management infrastructure for at least the PMEM case.
>> > 
>> > By re-using the existing memory management infrastructure, do you mean
>> > re-using the existing model of MMIO for passthrough PCI devices to
>> > handle the permission of pmem?
>> 
>> No, re-using struct page_info.
>> 
>> >> The
>> >> only open question remains to be where to place the control structures,
>> >> and I think the thresholding proposal of yours was quite sensible.
>> > 
>> > I'm little confused here. Is 'restrict the number of range structs' in
>> > my previous reply the 'thresholding proposal' you mean? Or it's one of
>> > 'artificial limitations'?
>> 
>> Neither. It's the decision on where to place the struct page_info
>> arrays needed to manage the PMEM ranges.
>>
> 
> In [1][2], we have agreed to use struct page_info to manage mappings
> for pmem and place them in reserved area on pmem.
> 
> But I think the discussion in this thread is to decide the data
> structure which will be used to track access permission to host pmem.
> The discussion started from my question in [3]:
> | I'm not sure whether xen toolstack as a userspace program is
> | considered to be safe to pass the host physical address (of host
> | NVDIMM) to hypervisor.
> In reply [4], you mentioned:
> | As long as the passing of physical addresses follows to model of
> | MMIO for passed through PCI devices, I don't think there's problem
> | with the tool stack bypassing the Dom0 kernel. So it really all
> | depends on how you make sure that the guest won't get to see memory
> | it has no permission to access.
> 
> I interpreted it as the same access permission control mechanism used
> for MMIO of passthrough pci devices (built around range struct) should
> be used for pmem as well, so that we can safely allow toolstack to
> pass the host physical address of nvdimm to hypervisor.
> Was my understanding wrong from the beginning?

Perhaps I have got confused by the back and forth. If we're to
use struct page_info, then everything should be following a
similar flow to what happens for normal RAM, i.e. normal page
allocation, and normal assignment of pages to guests.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-04-22 Thread Haozhong Zhang
On 04/22/16 02:24, Jan Beulich wrote:
[..]
> >> >> Well, using existing range struct to manage guest access permissions
> >> >> to nvdimm could consume too much space which could not fit in either
> >> >> memory or nvdimm. If the above solution looks really error-prone,
> >> >> perhaps we can still come back to the existing one and restrict the
> >> >> number of range structs each domain could have for nvdimm
> >> >> (e.g. reserve one 4K-page per-domain for them) to make it work for
> >> >> nvdimm, though it may reject nvdimm mapping that is terribly
> >> >> fragmented.
> >> > 
> >> > Hi Jan,
> >> > 
> >> > Any comments for this?
> >> 
> >> Well, nothing new, i.e. my previous opinion on the old proposal didn't
> >> change. I'm really opposed to any artificial limitations here, as I am to
> >> any secondary (and hence error prone) code paths. IOW I continue
> >> to think that there's no reasonable alternative to re-using the existing
> >> memory management infrastructure for at least the PMEM case.
> > 
> > By re-using the existing memory management infrastructure, do you mean
> > re-using the existing model of MMIO for passthrough PCI devices to
> > handle the permission of pmem?
> 
> No, re-using struct page_info.
> 
> >> The
> >> only open question remains to be where to place the control structures,
> >> and I think the thresholding proposal of yours was quite sensible.
> > 
> > I'm little confused here. Is 'restrict the number of range structs' in
> > my previous reply the 'thresholding proposal' you mean? Or it's one of
> > 'artificial limitations'?
> 
> Neither. It's the decision on where to place the struct page_info
> arrays needed to manage the PMEM ranges.
>

In [1][2], we have agreed to use struct page_info to manage mappings
for pmem and place them in reserved area on pmem.

But I think the discussion in this thread is to decide the data
structure which will be used to track access permission to host pmem.
The discussion started from my question in [3]:
| I'm not sure whether xen toolstack as a userspace program is
| considered to be safe to pass the host physical address (of host
| NVDIMM) to hypervisor.
In reply [4], you mentioned:
| As long as the passing of physical addresses follows to model of
| MMIO for passed through PCI devices, I don't think there's problem
| with the tool stack bypassing the Dom0 kernel. So it really all
| depends on how you make sure that the guest won't get to see memory
| it has no permission to access.

I interpreted it as the same access permission control mechanism used
for MMIO of passthrough pci devices (built around range struct) should
be used for pmem as well, so that we can safely allow toolstack to
pass the host physical address of nvdimm to hypervisor.
Was my understanding wrong from the beginning?

Thanks,
Haozhong

[1] http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01161.html
[2] http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01201.html
[3] http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01972.html
[4] http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01981.html

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-04-22 Thread Jan Beulich
>>> On 22.04.16 at 04:36,  wrote:
> On 04/21/16 01:04, Jan Beulich wrote:
>> >>> On 21.04.16 at 07:09,  wrote:
>> > On 04/12/16 16:45, Haozhong Zhang wrote:
>> >> On 04/08/16 09:52, Jan Beulich wrote:
>> >> > >>> On 08.04.16 at 07:02,  wrote:
>> >> > > On 03/29/16 04:49, Jan Beulich wrote:
>> >> > >> >>> On 29.03.16 at 12:10,  wrote:
>> >> > >> > On 03/29/16 03:11, Jan Beulich wrote:
>> >> > >> >> >>> On 29.03.16 at 10:47,  wrote:
>> >> > > [..]
>> >> > >> >> > I still cannot find a neat approach to manage guest permissions 
>> >> > >> >> > for
>> >> > >> >> > nvdimm pages. A possible one is to use a per-domain bitmap to 
>> >> > >> >> > track
>> >> > >> >> > permissions: each bit corresponding to an nvdimm page. The 
>> >> > >> >> > bitmap can
>> >> > >> >> > save lots of spaces and even be stored in the normal ram, but
>> >> > >> >> > operating it for a large nvdimm range, especially for a 
>> >> > >> >> > contiguous
>> >> > >> >> > one, is slower than rangeset.
>> >> > >> >> 
>> >> > >> >> I don't follow: What would a single bit in that bitmap mean? Any
>> >> > >> >> guest may access the page? That surely wouldn't be what we
>> >> > >> >> need.
>> >> > >> >>
>> >> > >> > 
>> >> > >> > For a host having a N pages of nvdimm, each domain will have a N 
>> >> > >> > bits
>> >> > >> > bitmap. If the m'th bit of a domain's bitmap is set, then that 
>> >> > >> > domain
>> >> > >> > has the permission to access the m'th host nvdimm page.
>> >> > >> 
>> >> > >> Which will be more overhead as soon as there are enough such
>> >> > >> domains in a system.
>> >> > >>
>> >> > > 
>> >> > > Sorry for the late reply.
>> >> > > 
>> >> > > I think we can make some optimization to reduce the space consumed by
>> >> > > the bitmap.
>> >> > > 
>> >> > > A per-domain bitmap covering the entire host NVDIMM address range is
>> >> > > wasteful especially if the actual used ranges are congregated. We may
>> >> > > take following ways to reduce its space.
>> >> > > 
>> >> > > 1) Split the per-domain bitmap into multiple sub-bitmap and each
>> >> > >sub-bitmap covers a smaller and contiguous sub host NVDIMM address
>> >> > >range. In the beginning, no sub-bitmap is allocated for the
>> >> > >domain. If the access permission to a host NVDIMM page in a sub
>> >> > >host address range is added to a domain, only the sub-bitmap for
>> >> > >that address range is allocated for the domain. If access
>> >> > >permissions to all host NVDIMM pages in a sub range are removed
>> >> > >from a domain, the corresponding sub-bitmap can be freed.
>> >> > > 
>> >> > > 2) If a domain has access permissions to all host NVDIMM pages in a
>> >> > >sub range, the corresponding sub-bitmap will be replaced by a range
>> >> > >struct. If range structs are used to track adjacent ranges, they
>> >> > >will be merged into one range struct. If access permissions to some
>> >> > >pages in that sub range are removed from a domain, the range struct
>> >> > >should be converted back to bitmap segment(s).
>> >> > > 
>> >> > > 3) Because there might be lots of above bitmap segments and range
>> >> > >structs per-domain, we can organize them in a balanced interval
>> >> > >tree to quickly search/add/remove an individual structure.
>> >> > > 
>> >> > > In the worst case that each sub range has non-contiguous pages
>> >> > > assigned to a domain, above solution will use all sub-bitmaps and
>> >> > > consume more space than a single bitmap because of the extra space for
>> >> > > organization. I assume that the sysadmin should be responsible to
>> >> > > ensure the host nvdimm ranges assigned to each domain as contiguous
>> >> > > and congregated as possible in order to avoid the worst case. However,
>> >> > > if the worst case does happen, xen hypervisor should refuse to assign
>> >> > > nvdimm to guest when it runs out of memory.
>> >> > 
>> >> > To be honest, this all sounds pretty unconvincing wrt not using
>> >> > existing code paths - a lot of special treatment, and hence a lot
>> >> > of things that can go (slightly) wrong.
>> >> > 
>> >> 
>> >> Well, using existing range struct to manage guest access permissions
>> >> to nvdimm could consume too much space which could not fit in either
>> >> memory or nvdimm. If the above solution looks really error-prone,
>> >> perhaps we can still come back to the existing one and restrict the
>> >> number of range structs each domain could have for nvdimm
>> >> (e.g. reserve one 4K-page per-domain for them) to make it work for
>> >> nvdimm, though it may reject nvdimm mapping that is terribly
>> >> fragmented.
>> > 
>> > Hi Jan,
>> > 
>> > Any comments for this?
>> 
>> Well, nothing new, i.e. my previous opinion on the old proposal didn't
>> change. I'm really opposed to any artificial limitations here, as I am to
>> any secondary (and hence 

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-04-21 Thread Haozhong Zhang
On 04/21/16 01:04, Jan Beulich wrote:
> >>> On 21.04.16 at 07:09,  wrote:
> > On 04/12/16 16:45, Haozhong Zhang wrote:
> >> On 04/08/16 09:52, Jan Beulich wrote:
> >> > >>> On 08.04.16 at 07:02,  wrote:
> >> > > On 03/29/16 04:49, Jan Beulich wrote:
> >> > >> >>> On 29.03.16 at 12:10,  wrote:
> >> > >> > On 03/29/16 03:11, Jan Beulich wrote:
> >> > >> >> >>> On 29.03.16 at 10:47,  wrote:
> >> > > [..]
> >> > >> >> > I still cannot find a neat approach to manage guest permissions 
> >> > >> >> > for
> >> > >> >> > nvdimm pages. A possible one is to use a per-domain bitmap to 
> >> > >> >> > track
> >> > >> >> > permissions: each bit corresponding to an nvdimm page. The 
> >> > >> >> > bitmap can
> >> > >> >> > save lots of spaces and even be stored in the normal ram, but
> >> > >> >> > operating it for a large nvdimm range, especially for a 
> >> > >> >> > contiguous
> >> > >> >> > one, is slower than rangeset.
> >> > >> >> 
> >> > >> >> I don't follow: What would a single bit in that bitmap mean? Any
> >> > >> >> guest may access the page? That surely wouldn't be what we
> >> > >> >> need.
> >> > >> >>
> >> > >> > 
> >> > >> > For a host having a N pages of nvdimm, each domain will have a N 
> >> > >> > bits
> >> > >> > bitmap. If the m'th bit of a domain's bitmap is set, then that 
> >> > >> > domain
> >> > >> > has the permission to access the m'th host nvdimm page.
> >> > >> 
> >> > >> Which will be more overhead as soon as there are enough such
> >> > >> domains in a system.
> >> > >>
> >> > > 
> >> > > Sorry for the late reply.
> >> > > 
> >> > > I think we can make some optimization to reduce the space consumed by
> >> > > the bitmap.
> >> > > 
> >> > > A per-domain bitmap covering the entire host NVDIMM address range is
> >> > > wasteful especially if the actual used ranges are congregated. We may
> >> > > take following ways to reduce its space.
> >> > > 
> >> > > 1) Split the per-domain bitmap into multiple sub-bitmap and each
> >> > >sub-bitmap covers a smaller and contiguous sub host NVDIMM address
> >> > >range. In the beginning, no sub-bitmap is allocated for the
> >> > >domain. If the access permission to a host NVDIMM page in a sub
> >> > >host address range is added to a domain, only the sub-bitmap for
> >> > >that address range is allocated for the domain. If access
> >> > >permissions to all host NVDIMM pages in a sub range are removed
> >> > >from a domain, the corresponding sub-bitmap can be freed.
> >> > > 
> >> > > 2) If a domain has access permissions to all host NVDIMM pages in a
> >> > >sub range, the corresponding sub-bitmap will be replaced by a range
> >> > >struct. If range structs are used to track adjacent ranges, they
> >> > >will be merged into one range struct. If access permissions to some
> >> > >pages in that sub range are removed from a domain, the range struct
> >> > >should be converted back to bitmap segment(s).
> >> > > 
> >> > > 3) Because there might be lots of above bitmap segments and range
> >> > >structs per-domain, we can organize them in a balanced interval
> >> > >tree to quickly search/add/remove an individual structure.
> >> > > 
> >> > > In the worst case that each sub range has non-contiguous pages
> >> > > assigned to a domain, above solution will use all sub-bitmaps and
> >> > > consume more space than a single bitmap because of the extra space for
> >> > > organization. I assume that the sysadmin should be responsible to
> >> > > ensure the host nvdimm ranges assigned to each domain as contiguous
> >> > > and congregated as possible in order to avoid the worst case. However,
> >> > > if the worst case does happen, xen hypervisor should refuse to assign
> >> > > nvdimm to guest when it runs out of memory.
> >> > 
> >> > To be honest, this all sounds pretty unconvincing wrt not using
> >> > existing code paths - a lot of special treatment, and hence a lot
> >> > of things that can go (slightly) wrong.
> >> > 
> >> 
> >> Well, using existing range struct to manage guest access permissions
> >> to nvdimm could consume too much space which could not fit in either
> >> memory or nvdimm. If the above solution looks really error-prone,
> >> perhaps we can still come back to the existing one and restrict the
> >> number of range structs each domain could have for nvdimm
> >> (e.g. reserve one 4K-page per-domain for them) to make it work for
> >> nvdimm, though it may reject nvdimm mapping that is terribly
> >> fragmented.
> > 
> > Hi Jan,
> > 
> > Any comments for this?
> 
> Well, nothing new, i.e. my previous opinion on the old proposal didn't
> change. I'm really opposed to any artificial limitations here, as I am to
> any secondary (and hence error prone) code paths. IOW I continue
> to think that there's no reasonable alternative to re-using the existing
> memory management infrastructure for at 

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-04-21 Thread Jan Beulich
>>> On 21.04.16 at 07:09,  wrote:
> On 04/12/16 16:45, Haozhong Zhang wrote:
>> On 04/08/16 09:52, Jan Beulich wrote:
>> > >>> On 08.04.16 at 07:02,  wrote:
>> > > On 03/29/16 04:49, Jan Beulich wrote:
>> > >> >>> On 29.03.16 at 12:10,  wrote:
>> > >> > On 03/29/16 03:11, Jan Beulich wrote:
>> > >> >> >>> On 29.03.16 at 10:47,  wrote:
>> > > [..]
>> > >> >> > I still cannot find a neat approach to manage guest permissions for
>> > >> >> > nvdimm pages. A possible one is to use a per-domain bitmap to track
>> > >> >> > permissions: each bit corresponding to an nvdimm page. The bitmap 
>> > >> >> > can
>> > >> >> > save lots of spaces and even be stored in the normal ram, but
>> > >> >> > operating it for a large nvdimm range, especially for a contiguous
>> > >> >> > one, is slower than rangeset.
>> > >> >> 
>> > >> >> I don't follow: What would a single bit in that bitmap mean? Any
>> > >> >> guest may access the page? That surely wouldn't be what we
>> > >> >> need.
>> > >> >>
>> > >> > 
>> > >> > For a host having a N pages of nvdimm, each domain will have a N bits
>> > >> > bitmap. If the m'th bit of a domain's bitmap is set, then that domain
>> > >> > has the permission to access the m'th host nvdimm page.
>> > >> 
>> > >> Which will be more overhead as soon as there are enough such
>> > >> domains in a system.
>> > >>
>> > > 
>> > > Sorry for the late reply.
>> > > 
>> > > I think we can make some optimization to reduce the space consumed by
>> > > the bitmap.
>> > > 
>> > > A per-domain bitmap covering the entire host NVDIMM address range is
>> > > wasteful especially if the actual used ranges are congregated. We may
>> > > take following ways to reduce its space.
>> > > 
>> > > 1) Split the per-domain bitmap into multiple sub-bitmap and each
>> > >sub-bitmap covers a smaller and contiguous sub host NVDIMM address
>> > >range. In the beginning, no sub-bitmap is allocated for the
>> > >domain. If the access permission to a host NVDIMM page in a sub
>> > >host address range is added to a domain, only the sub-bitmap for
>> > >that address range is allocated for the domain. If access
>> > >permissions to all host NVDIMM pages in a sub range are removed
>> > >from a domain, the corresponding sub-bitmap can be freed.
>> > > 
>> > > 2) If a domain has access permissions to all host NVDIMM pages in a
>> > >sub range, the corresponding sub-bitmap will be replaced by a range
>> > >struct. If range structs are used to track adjacent ranges, they
>> > >will be merged into one range struct. If access permissions to some
>> > >pages in that sub range are removed from a domain, the range struct
>> > >should be converted back to bitmap segment(s).
>> > > 
>> > > 3) Because there might be lots of above bitmap segments and range
>> > >structs per-domain, we can organize them in a balanced interval
>> > >tree to quickly search/add/remove an individual structure.
>> > > 
>> > > In the worst case that each sub range has non-contiguous pages
>> > > assigned to a domain, above solution will use all sub-bitmaps and
>> > > consume more space than a single bitmap because of the extra space for
>> > > organization. I assume that the sysadmin should be responsible to
>> > > ensure the host nvdimm ranges assigned to each domain as contiguous
>> > > and congregated as possible in order to avoid the worst case. However,
>> > > if the worst case does happen, xen hypervisor should refuse to assign
>> > > nvdimm to guest when it runs out of memory.
>> > 
>> > To be honest, this all sounds pretty unconvincing wrt not using
>> > existing code paths - a lot of special treatment, and hence a lot
>> > of things that can go (slightly) wrong.
>> > 
>> 
>> Well, using existing range struct to manage guest access permissions
>> to nvdimm could consume too much space which could not fit in either
>> memory or nvdimm. If the above solution looks really error-prone,
>> perhaps we can still come back to the existing one and restrict the
>> number of range structs each domain could have for nvdimm
>> (e.g. reserve one 4K-page per-domain for them) to make it work for
>> nvdimm, though it may reject nvdimm mapping that is terribly
>> fragmented.
> 
> Hi Jan,
> 
> Any comments for this?

Well, nothing new, i.e. my previous opinion on the old proposal didn't
change. I'm really opposed to any artificial limitations here, as I am to
any secondary (and hence error prone) code paths. IOW I continue
to think that there's no reasonable alternative to re-using the existing
memory management infrastructure for at least the PMEM case. The
only open question remains to be where to place the control structures,
and I think the thresholding proposal of yours was quite sensible.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-04-20 Thread Haozhong Zhang
On 04/12/16 16:45, Haozhong Zhang wrote:
> On 04/08/16 09:52, Jan Beulich wrote:
> > >>> On 08.04.16 at 07:02,  wrote:
> > > On 03/29/16 04:49, Jan Beulich wrote:
> > >> >>> On 29.03.16 at 12:10,  wrote:
> > >> > On 03/29/16 03:11, Jan Beulich wrote:
> > >> >> >>> On 29.03.16 at 10:47,  wrote:
> > > [..]
> > >> >> > I still cannot find a neat approach to manage guest permissions for
> > >> >> > nvdimm pages. A possible one is to use a per-domain bitmap to track
> > >> >> > permissions: each bit corresponding to an nvdimm page. The bitmap 
> > >> >> > can
> > >> >> > save lots of spaces and even be stored in the normal ram, but
> > >> >> > operating it for a large nvdimm range, especially for a contiguous
> > >> >> > one, is slower than rangeset.
> > >> >> 
> > >> >> I don't follow: What would a single bit in that bitmap mean? Any
> > >> >> guest may access the page? That surely wouldn't be what we
> > >> >> need.
> > >> >>
> > >> > 
> > >> > For a host having a N pages of nvdimm, each domain will have a N bits
> > >> > bitmap. If the m'th bit of a domain's bitmap is set, then that domain
> > >> > has the permission to access the m'th host nvdimm page.
> > >> 
> > >> Which will be more overhead as soon as there are enough such
> > >> domains in a system.
> > >>
> > > 
> > > Sorry for the late reply.
> > > 
> > > I think we can make some optimization to reduce the space consumed by
> > > the bitmap.
> > > 
> > > A per-domain bitmap covering the entire host NVDIMM address range is
> > > wasteful especially if the actual used ranges are congregated. We may
> > > take following ways to reduce its space.
> > > 
> > > 1) Split the per-domain bitmap into multiple sub-bitmap and each
> > >sub-bitmap covers a smaller and contiguous sub host NVDIMM address
> > >range. In the beginning, no sub-bitmap is allocated for the
> > >domain. If the access permission to a host NVDIMM page in a sub
> > >host address range is added to a domain, only the sub-bitmap for
> > >that address range is allocated for the domain. If access
> > >permissions to all host NVDIMM pages in a sub range are removed
> > >from a domain, the corresponding sub-bitmap can be freed.
> > > 
> > > 2) If a domain has access permissions to all host NVDIMM pages in a
> > >sub range, the corresponding sub-bitmap will be replaced by a range
> > >struct. If range structs are used to track adjacent ranges, they
> > >will be merged into one range struct. If access permissions to some
> > >pages in that sub range are removed from a domain, the range struct
> > >should be converted back to bitmap segment(s).
> > > 
> > > 3) Because there might be lots of above bitmap segments and range
> > >structs per-domain, we can organize them in a balanced interval
> > >tree to quickly search/add/remove an individual structure.
> > > 
> > > In the worst case that each sub range has non-contiguous pages
> > > assigned to a domain, above solution will use all sub-bitmaps and
> > > consume more space than a single bitmap because of the extra space for
> > > organization. I assume that the sysadmin should be responsible to
> > > ensure the host nvdimm ranges assigned to each domain as contiguous
> > > and congregated as possible in order to avoid the worst case. However,
> > > if the worst case does happen, xen hypervisor should refuse to assign
> > > nvdimm to guest when it runs out of memory.
> > 
> > To be honest, this all sounds pretty unconvincing wrt not using
> > existing code paths - a lot of special treatment, and hence a lot
> > of things that can go (slightly) wrong.
> > 
> 
> Well, using existing range struct to manage guest access permissions
> to nvdimm could consume too much space which could not fit in either
> memory or nvdimm. If the above solution looks really error-prone,
> perhaps we can still come back to the existing one and restrict the
> number of range structs each domain could have for nvdimm
> (e.g. reserve one 4K-page per-domain for them) to make it work for
> nvdimm, though it may reject nvdimm mapping that is terribly
> fragmented.

Hi Jan,

Any comments for this?

Thanks,
Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-04-12 Thread Haozhong Zhang
On 04/08/16 09:52, Jan Beulich wrote:
> >>> On 08.04.16 at 07:02,  wrote:
> > On 03/29/16 04:49, Jan Beulich wrote:
> >> >>> On 29.03.16 at 12:10,  wrote:
> >> > On 03/29/16 03:11, Jan Beulich wrote:
> >> >> >>> On 29.03.16 at 10:47,  wrote:
> > [..]
> >> >> > I still cannot find a neat approach to manage guest permissions for
> >> >> > nvdimm pages. A possible one is to use a per-domain bitmap to track
> >> >> > permissions: each bit corresponding to an nvdimm page. The bitmap can
> >> >> > save lots of spaces and even be stored in the normal ram, but
> >> >> > operating it for a large nvdimm range, especially for a contiguous
> >> >> > one, is slower than rangeset.
> >> >> 
> >> >> I don't follow: What would a single bit in that bitmap mean? Any
> >> >> guest may access the page? That surely wouldn't be what we
> >> >> need.
> >> >>
> >> > 
> >> > For a host having a N pages of nvdimm, each domain will have a N bits
> >> > bitmap. If the m'th bit of a domain's bitmap is set, then that domain
> >> > has the permission to access the m'th host nvdimm page.
> >> 
> >> Which will be more overhead as soon as there are enough such
> >> domains in a system.
> >>
> > 
> > Sorry for the late reply.
> > 
> > I think we can make some optimization to reduce the space consumed by
> > the bitmap.
> > 
> > A per-domain bitmap covering the entire host NVDIMM address range is
> > wasteful especially if the actual used ranges are congregated. We may
> > take following ways to reduce its space.
> > 
> > 1) Split the per-domain bitmap into multiple sub-bitmap and each
> >sub-bitmap covers a smaller and contiguous sub host NVDIMM address
> >range. In the beginning, no sub-bitmap is allocated for the
> >domain. If the access permission to a host NVDIMM page in a sub
> >host address range is added to a domain, only the sub-bitmap for
> >that address range is allocated for the domain. If access
> >permissions to all host NVDIMM pages in a sub range are removed
> >from a domain, the corresponding sub-bitmap can be freed.
> > 
> > 2) If a domain has access permissions to all host NVDIMM pages in a
> >sub range, the corresponding sub-bitmap will be replaced by a range
> >struct. If range structs are used to track adjacent ranges, they
> >will be merged into one range struct. If access permissions to some
> >pages in that sub range are removed from a domain, the range struct
> >should be converted back to bitmap segment(s).
> > 
> > 3) Because there might be lots of above bitmap segments and range
> >structs per-domain, we can organize them in a balanced interval
> >tree to quickly search/add/remove an individual structure.
> > 
> > In the worst case that each sub range has non-contiguous pages
> > assigned to a domain, above solution will use all sub-bitmaps and
> > consume more space than a single bitmap because of the extra space for
> > organization. I assume that the sysadmin should be responsible to
> > ensure the host nvdimm ranges assigned to each domain as contiguous
> > and congregated as possible in order to avoid the worst case. However,
> > if the worst case does happen, xen hypervisor should refuse to assign
> > nvdimm to guest when it runs out of memory.
> 
> To be honest, this all sounds pretty unconvincing wrt not using
> existing code paths - a lot of special treatment, and hence a lot
> of things that can go (slightly) wrong.
> 

Well, using existing range struct to manage guest access permissions
to nvdimm could consume too much space which could not fit in either
memory or nvdimm. If the above solution looks really error-prone,
perhaps we can still come back to the existing one and restrict the
number of range structs each domain could have for nvdimm
(e.g. reserve one 4K-page per-domain for them) to make it work for
nvdimm, though it may reject nvdimm mapping that is terribly
fragmented.

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-04-08 Thread Jan Beulich
>>> On 08.04.16 at 07:02,  wrote:
> On 03/29/16 04:49, Jan Beulich wrote:
>> >>> On 29.03.16 at 12:10,  wrote:
>> > On 03/29/16 03:11, Jan Beulich wrote:
>> >> >>> On 29.03.16 at 10:47,  wrote:
> [..]
>> >> > I still cannot find a neat approach to manage guest permissions for
>> >> > nvdimm pages. A possible one is to use a per-domain bitmap to track
>> >> > permissions: each bit corresponding to an nvdimm page. The bitmap can
>> >> > save lots of spaces and even be stored in the normal ram, but
>> >> > operating it for a large nvdimm range, especially for a contiguous
>> >> > one, is slower than rangeset.
>> >> 
>> >> I don't follow: What would a single bit in that bitmap mean? Any
>> >> guest may access the page? That surely wouldn't be what we
>> >> need.
>> >>
>> > 
>> > For a host having a N pages of nvdimm, each domain will have a N bits
>> > bitmap. If the m'th bit of a domain's bitmap is set, then that domain
>> > has the permission to access the m'th host nvdimm page.
>> 
>> Which will be more overhead as soon as there are enough such
>> domains in a system.
>>
> 
> Sorry for the late reply.
> 
> I think we can make some optimization to reduce the space consumed by
> the bitmap.
> 
> A per-domain bitmap covering the entire host NVDIMM address range is
> wasteful especially if the actual used ranges are congregated. We may
> take following ways to reduce its space.
> 
> 1) Split the per-domain bitmap into multiple sub-bitmap and each
>sub-bitmap covers a smaller and contiguous sub host NVDIMM address
>range. In the beginning, no sub-bitmap is allocated for the
>domain. If the access permission to a host NVDIMM page in a sub
>host address range is added to a domain, only the sub-bitmap for
>that address range is allocated for the domain. If access
>permissions to all host NVDIMM pages in a sub range are removed
>from a domain, the corresponding sub-bitmap can be freed.
> 
> 2) If a domain has access permissions to all host NVDIMM pages in a
>sub range, the corresponding sub-bitmap will be replaced by a range
>struct. If range structs are used to track adjacent ranges, they
>will be merged into one range struct. If access permissions to some
>pages in that sub range are removed from a domain, the range struct
>should be converted back to bitmap segment(s).
> 
> 3) Because there might be lots of above bitmap segments and range
>structs per-domain, we can organize them in a balanced interval
>tree to quickly search/add/remove an individual structure.
> 
> In the worst case that each sub range has non-contiguous pages
> assigned to a domain, above solution will use all sub-bitmaps and
> consume more space than a single bitmap because of the extra space for
> organization. I assume that the sysadmin should be responsible to
> ensure the host nvdimm ranges assigned to each domain as contiguous
> and congregated as possible in order to avoid the worst case. However,
> if the worst case does happen, xen hypervisor should refuse to assign
> nvdimm to guest when it runs out of memory.

To be honest, this all sounds pretty unconvincing wrt not using
existing code paths - a lot of special treatment, and hence a lot
of things that can go (slightly) wrong.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-04-07 Thread Haozhong Zhang
On 03/29/16 04:49, Jan Beulich wrote:
> >>> On 29.03.16 at 12:10,  wrote:
> > On 03/29/16 03:11, Jan Beulich wrote:
> >> >>> On 29.03.16 at 10:47,  wrote:
[..]
> >> > I still cannot find a neat approach to manage guest permissions for
> >> > nvdimm pages. A possible one is to use a per-domain bitmap to track
> >> > permissions: each bit corresponding to an nvdimm page. The bitmap can
> >> > save lots of spaces and even be stored in the normal ram, but
> >> > operating it for a large nvdimm range, especially for a contiguous
> >> > one, is slower than rangeset.
> >> 
> >> I don't follow: What would a single bit in that bitmap mean? Any
> >> guest may access the page? That surely wouldn't be what we
> >> need.
> >>
> > 
> > For a host having a N pages of nvdimm, each domain will have a N bits
> > bitmap. If the m'th bit of a domain's bitmap is set, then that domain
> > has the permission to access the m'th host nvdimm page.
> 
> Which will be more overhead as soon as there are enough such
> domains in a system.
>

Sorry for the late reply.

I think we can make some optimization to reduce the space consumed by
the bitmap.

A per-domain bitmap covering the entire host NVDIMM address range is
wasteful especially if the actual used ranges are congregated. We may
take following ways to reduce its space.

1) Split the per-domain bitmap into multiple sub-bitmap and each
   sub-bitmap covers a smaller and contiguous sub host NVDIMM address
   range. In the beginning, no sub-bitmap is allocated for the
   domain. If the access permission to a host NVDIMM page in a sub
   host address range is added to a domain, only the sub-bitmap for
   that address range is allocated for the domain. If access
   permissions to all host NVDIMM pages in a sub range are removed
   from a domain, the corresponding sub-bitmap can be freed.

2) If a domain has access permissions to all host NVDIMM pages in a
   sub range, the corresponding sub-bitmap will be replaced by a range
   struct. If range structs are used to track adjacent ranges, they
   will be merged into one range struct. If access permissions to some
   pages in that sub range are removed from a domain, the range struct
   should be converted back to bitmap segment(s).

3) Because there might be lots of above bitmap segments and range
   structs per-domain, we can organize them in a balanced interval
   tree to quickly search/add/remove an individual structure.

In the worst case that each sub range has non-contiguous pages
assigned to a domain, above solution will use all sub-bitmaps and
consume more space than a single bitmap because of the extra space for
organization. I assume that the sysadmin should be responsible to
ensure the host nvdimm ranges assigned to each domain as contiguous
and congregated as possible in order to avoid the worst case. However,
if the worst case does happen, xen hypervisor should refuse to assign
nvdimm to guest when it runs out of memory.

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-29 Thread Jan Beulich
>>> On 29.03.16 at 12:10, <haozhong.zh...@intel.com> wrote:
> On 03/29/16 03:11, Jan Beulich wrote:
>> >>> On 29.03.16 at 10:47, <haozhong.zh...@intel.com> wrote:
>> > On 03/17/16 22:21, Haozhong Zhang wrote:
>> >> On 03/17/16 14:00, Ian Jackson wrote:
>> >> > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM 
> support for Xen"):
>> >> > > QEMU keeps mappings of guest memory because (1) that mapping is
>> >> > > created by itself, and/or (2) certain device emulation needs to access
>> >> > > the guest memory. But for vNVDIMM, I'm going to move the creation of
>> >> > > its mappings out of qemu to toolstack and vNVDIMM in QEMU does not
>> >> > > access vNVDIMM pages mapped to guest, so it's not necessary to let
>> >> > > qemu keeps vNVDIMM mappings.
>> >> > 
>> >> > I'm confused by this.
>> >> > 
>> >> > Suppose a guest uses an emulated device (or backend) provided by qemu,
>> >> > to do DMA to an vNVDIMM.  Then qemu will need to map the real NVDIMM
>> >> > pages into its own address space, so that it can write to the memory
>> >> > (ie, do the virtual DMA).
>> >> > 
>> >> > That virtual DMA might well involve a direct mapping in the kernel
>> >> > underlying qemu: ie, qemu might use O_DIRECT to have its kernel write
>> >> > directly to the NVDIMM, and with luck the actual device backing the
>> >> > virtual device will be able to DMA to the NVDIMM.
>> >> > 
>> >> > All of this seems to me to mean that qemu needs to be able to map
>> >> > its guest's parts of NVDIMMs
>> >> > 
>> >> > There are probably other example: memory inspection systems used by
>> >> > virus scanners etc.; debuggers used to inspect a guest from outside;
>> >> > etc.
>> >> > 
>> >> > I haven't even got started on save/restore...
>> >> > 
>> >> 
>> >> Oops, so many cases I missed. Thanks Ian for pointing out all these!
>> >> Now I need to reconsider how to manage guest permissions for NVDIMM pages.
>> >> 
>> > 
>> > I still cannot find a neat approach to manage guest permissions for
>> > nvdimm pages. A possible one is to use a per-domain bitmap to track
>> > permissions: each bit corresponding to an nvdimm page. The bitmap can
>> > save lots of spaces and even be stored in the normal ram, but
>> > operating it for a large nvdimm range, especially for a contiguous
>> > one, is slower than rangeset.
>> 
>> I don't follow: What would a single bit in that bitmap mean? Any
>> guest may access the page? That surely wouldn't be what we
>> need.
>>
> 
> For a host having a N pages of nvdimm, each domain will have a N bits
> bitmap. If the m'th bit of a domain's bitmap is set, then that domain
> has the permission to access the m'th host nvdimm page.

Which will be more overhead as soon as there are enough such
domains in a system.

>> > BTW, if I take the other way to map nvdimm pages to guest
>> > (http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01972.html)
>> > | 2. Or, given the same inputs, we may combine above two steps into a new
>> > |dom0 system call that (1) gets the SPA ranges, (2) calls xen
>> > |hypercall to map SPA ranges
>> > and treat nvdimm as normal ram, then xen will not need to use rangeset
>> > or above bitmap to track guest permissions for nvdimm? But looking at
>> > how qemu currently populates guest memory via XENMEM_populate_physmap
>> > , and other hypercalls like XENMEM_[in|de]crease_reservation, it looks
>> > like that mapping a _dedicated_ piece of host ram to guest is not
>> > allowed out of the hypervisor (and not allowed even in dom0 kernel)?
>> > Is it for security concerns, e.g. avoiding a malfunctioned dom0 leaking
>> > guest memory?
>> 
>> Well, it's simply because RAM is a resource managed through
>> allocation/freeing, instead of via reserving chunks for special
>> purposes.
>> 
> 
> So that means xen can always ensure the ram assigned to a guest is
> what the guest is permitted to access, so no data structures like
> iomem_caps is needed for ram. If I have to introduce a hypercall that
> maps the dedicated host ram/nvdimm to guest, then the explicit
> permission management is still needed, regardless of who (dom0 kernel,
> qemu or toolstack) will use it. Right?

Yes (if you really mean to go that route).

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-29 Thread Haozhong Zhang
On 03/29/16 03:11, Jan Beulich wrote:
> >>> On 29.03.16 at 10:47, <haozhong.zh...@intel.com> wrote:
> > On 03/17/16 22:21, Haozhong Zhang wrote:
> >> On 03/17/16 14:00, Ian Jackson wrote:
> >> > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM 
> >> > support for Xen"):
> >> > > QEMU keeps mappings of guest memory because (1) that mapping is
> >> > > created by itself, and/or (2) certain device emulation needs to access
> >> > > the guest memory. But for vNVDIMM, I'm going to move the creation of
> >> > > its mappings out of qemu to toolstack and vNVDIMM in QEMU does not
> >> > > access vNVDIMM pages mapped to guest, so it's not necessary to let
> >> > > qemu keeps vNVDIMM mappings.
> >> > 
> >> > I'm confused by this.
> >> > 
> >> > Suppose a guest uses an emulated device (or backend) provided by qemu,
> >> > to do DMA to an vNVDIMM.  Then qemu will need to map the real NVDIMM
> >> > pages into its own address space, so that it can write to the memory
> >> > (ie, do the virtual DMA).
> >> > 
> >> > That virtual DMA might well involve a direct mapping in the kernel
> >> > underlying qemu: ie, qemu might use O_DIRECT to have its kernel write
> >> > directly to the NVDIMM, and with luck the actual device backing the
> >> > virtual device will be able to DMA to the NVDIMM.
> >> > 
> >> > All of this seems to me to mean that qemu needs to be able to map
> >> > its guest's parts of NVDIMMs
> >> > 
> >> > There are probably other example: memory inspection systems used by
> >> > virus scanners etc.; debuggers used to inspect a guest from outside;
> >> > etc.
> >> > 
> >> > I haven't even got started on save/restore...
> >> > 
> >> 
> >> Oops, so many cases I missed. Thanks Ian for pointing out all these!
> >> Now I need to reconsider how to manage guest permissions for NVDIMM pages.
> >> 
> > 
> > I still cannot find a neat approach to manage guest permissions for
> > nvdimm pages. A possible one is to use a per-domain bitmap to track
> > permissions: each bit corresponding to an nvdimm page. The bitmap can
> > save lots of spaces and even be stored in the normal ram, but
> > operating it for a large nvdimm range, especially for a contiguous
> > one, is slower than rangeset.
> 
> I don't follow: What would a single bit in that bitmap mean? Any
> guest may access the page? That surely wouldn't be what we
> need.
>

For a host having a N pages of nvdimm, each domain will have a N bits
bitmap. If the m'th bit of a domain's bitmap is set, then that domain
has the permission to access the m'th host nvdimm page.

> > BTW, if I take the other way to map nvdimm pages to guest
> > (http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01972.html)
> > | 2. Or, given the same inputs, we may combine above two steps into a new
> > |dom0 system call that (1) gets the SPA ranges, (2) calls xen
> > |hypercall to map SPA ranges
> > and treat nvdimm as normal ram, then xen will not need to use rangeset
> > or above bitmap to track guest permissions for nvdimm? But looking at
> > how qemu currently populates guest memory via XENMEM_populate_physmap
> > , and other hypercalls like XENMEM_[in|de]crease_reservation, it looks
> > like that mapping a _dedicated_ piece of host ram to guest is not
> > allowed out of the hypervisor (and not allowed even in dom0 kernel)?
> > Is it for security concerns, e.g. avoiding a malfunctioned dom0 leaking
> > guest memory?
> 
> Well, it's simply because RAM is a resource managed through
> allocation/freeing, instead of via reserving chunks for special
> purposes.
> 

So that means xen can always ensure the ram assigned to a guest is
what the guest is permitted to access, so no data structures like
iomem_caps is needed for ram. If I have to introduce a hypercall that
maps the dedicated host ram/nvdimm to guest, then the explicit
permission management is still needed, regardless of who (dom0 kernel,
qemu or toolstack) will use it. Right?

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-29 Thread Jan Beulich
>>> On 29.03.16 at 10:47, <haozhong.zh...@intel.com> wrote:
> On 03/17/16 22:21, Haozhong Zhang wrote:
>> On 03/17/16 14:00, Ian Jackson wrote:
>> > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM 
>> > support for Xen"):
>> > > QEMU keeps mappings of guest memory because (1) that mapping is
>> > > created by itself, and/or (2) certain device emulation needs to access
>> > > the guest memory. But for vNVDIMM, I'm going to move the creation of
>> > > its mappings out of qemu to toolstack and vNVDIMM in QEMU does not
>> > > access vNVDIMM pages mapped to guest, so it's not necessary to let
>> > > qemu keeps vNVDIMM mappings.
>> > 
>> > I'm confused by this.
>> > 
>> > Suppose a guest uses an emulated device (or backend) provided by qemu,
>> > to do DMA to an vNVDIMM.  Then qemu will need to map the real NVDIMM
>> > pages into its own address space, so that it can write to the memory
>> > (ie, do the virtual DMA).
>> > 
>> > That virtual DMA might well involve a direct mapping in the kernel
>> > underlying qemu: ie, qemu might use O_DIRECT to have its kernel write
>> > directly to the NVDIMM, and with luck the actual device backing the
>> > virtual device will be able to DMA to the NVDIMM.
>> > 
>> > All of this seems to me to mean that qemu needs to be able to map
>> > its guest's parts of NVDIMMs
>> > 
>> > There are probably other example: memory inspection systems used by
>> > virus scanners etc.; debuggers used to inspect a guest from outside;
>> > etc.
>> > 
>> > I haven't even got started on save/restore...
>> > 
>> 
>> Oops, so many cases I missed. Thanks Ian for pointing out all these!
>> Now I need to reconsider how to manage guest permissions for NVDIMM pages.
>> 
> 
> I still cannot find a neat approach to manage guest permissions for
> nvdimm pages. A possible one is to use a per-domain bitmap to track
> permissions: each bit corresponding to an nvdimm page. The bitmap can
> save lots of spaces and even be stored in the normal ram, but
> operating it for a large nvdimm range, especially for a contiguous
> one, is slower than rangeset.

I don't follow: What would a single bit in that bitmap mean? Any
guest may access the page? That surely wouldn't be what we
need.

> BTW, if I take the other way to map nvdimm pages to guest
> (http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01972.html)
> | 2. Or, given the same inputs, we may combine above two steps into a new
> |dom0 system call that (1) gets the SPA ranges, (2) calls xen
> |hypercall to map SPA ranges
> and treat nvdimm as normal ram, then xen will not need to use rangeset
> or above bitmap to track guest permissions for nvdimm? But looking at
> how qemu currently populates guest memory via XENMEM_populate_physmap
> , and other hypercalls like XENMEM_[in|de]crease_reservation, it looks
> like that mapping a _dedicated_ piece of host ram to guest is not
> allowed out of the hypervisor (and not allowed even in dom0 kernel)?
> Is it for security concerns, e.g. avoiding a malfunctioned dom0 leaking
> guest memory?

Well, it's simply because RAM is a resource managed through
allocation/freeing, instead of via reserving chunks for special
purposes.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-29 Thread Haozhong Zhang
On 03/17/16 22:21, Haozhong Zhang wrote:
> On 03/17/16 14:00, Ian Jackson wrote:
> > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM 
> > support for Xen"):
> > > QEMU keeps mappings of guest memory because (1) that mapping is
> > > created by itself, and/or (2) certain device emulation needs to access
> > > the guest memory. But for vNVDIMM, I'm going to move the creation of
> > > its mappings out of qemu to toolstack and vNVDIMM in QEMU does not
> > > access vNVDIMM pages mapped to guest, so it's not necessary to let
> > > qemu keeps vNVDIMM mappings.
> > 
> > I'm confused by this.
> > 
> > Suppose a guest uses an emulated device (or backend) provided by qemu,
> > to do DMA to an vNVDIMM.  Then qemu will need to map the real NVDIMM
> > pages into its own address space, so that it can write to the memory
> > (ie, do the virtual DMA).
> > 
> > That virtual DMA might well involve a direct mapping in the kernel
> > underlying qemu: ie, qemu might use O_DIRECT to have its kernel write
> > directly to the NVDIMM, and with luck the actual device backing the
> > virtual device will be able to DMA to the NVDIMM.
> > 
> > All of this seems to me to mean that qemu needs to be able to map
> > its guest's parts of NVDIMMs
> > 
> > There are probably other example: memory inspection systems used by
> > virus scanners etc.; debuggers used to inspect a guest from outside;
> > etc.
> > 
> > I haven't even got started on save/restore...
> > 
> 
> Oops, so many cases I missed. Thanks Ian for pointing out all these!
> Now I need to reconsider how to manage guest permissions for NVDIMM pages.
> 

I still cannot find a neat approach to manage guest permissions for
nvdimm pages. A possible one is to use a per-domain bitmap to track
permissions: each bit corresponding to an nvdimm page. The bitmap can
save lots of spaces and even be stored in the normal ram, but
operating it for a large nvdimm range, especially for a contiguous
one, is slower than rangeset.

BTW, if I take the other way to map nvdimm pages to guest
(http://lists.xenproject.org/archives/html/xen-devel/2016-03/msg01972.html)
| 2. Or, given the same inputs, we may combine above two steps into a new
|dom0 system call that (1) gets the SPA ranges, (2) calls xen
|hypercall to map SPA ranges
and treat nvdimm as normal ram, then xen will not need to use rangeset
or above bitmap to track guest permissions for nvdimm? But looking at
how qemu currently populates guest memory via XENMEM_populate_physmap
, and other hypercalls like XENMEM_[in|de]crease_reservation, it looks
like that mapping a _dedicated_ piece of host ram to guest is not
allowed out of the hypervisor (and not allowed even in dom0 kernel)?
Is it for security concerns, e.g. avoiding a malfunctioned dom0 leaking
guest memory?

Thanks,
Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-20 Thread Haozhong Zhang
Hi Jan and Konrad,

On 03/04/16 15:30, Haozhong Zhang wrote:
> Suddenly realize it's unnecessary to let QEMU get SPA ranges of NVDIMM
> or files on NVDIMM. We can move that work to toolstack and pass SPA
> ranges got by toolstack to qemu. In this way, no privileged operations
> (mmap/mlock/...) are needed in QEMU and non-root QEMU should be able to
> work even with vNVDIMM hotplug in future.
> 

As I'm going to let toolstack to get NVDIMM SPA ranges. This can be
done via dom0 kernel interface and xen hypercalls, and can be
implemented in different ways. I'm wondering which of the following
ones is preferred by xen.

1. Given
* a file descriptor of either a NVDIMM device or a file on NVDIMM, and
* domain id and guest MFN where vNVDIMM is going to be.
   xen toolstack (1) gets it SPA ranges via dom0 kernel interface
   (e.g. sysfs and ioctl FIEMAP), and (2) calls a hypercall to map
   above SPA ranges to the given guest MFN of the given domain.

2. Or, given the same inputs, we may combine above two steps into a new
   dom0 system call that (1) gets the SPA ranges, (2) calls xen
   hypercall to map SPA ranges, and, one step further, (3) returns SPA
   ranges to userspace (because QEMU needs these addresses to build ACPI).

The first way does not need to modify dom0 linux kernel, while the
second requires a new system call. I'm not sure whether xen toolstack
as a userspace program is considered to be safe to pass the host physical
address to hypervisor. If not, maybe the second one is better?

Thanks,
Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-19 Thread Ian Jackson
Jan Beulich writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for 
Xen"):
> So that again leaves unaddressed the question of what you
> imply to do when a guest elects to use such a page as page
> table. I'm afraid any attempt of yours to invent something that
> is not struct page_info will not be suitable for all possible needs.

It is not clear to me whether this is a realistic thing for a guest to
want to do.  Haozhong, maybe you want to consider this aspect.

If you can come up with an argument why it is OK to simply not permit
this, then maybe the recordkeeping requirements can be relaxed ?

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-19 Thread Jan Beulich
>>> On 17.03.16 at 09:58,  wrote:
> On 03/16/16 09:23, Jan Beulich wrote:
>> >>> On 16.03.16 at 15:55,  wrote:
>> > On 03/16/16 08:23, Jan Beulich wrote:
>> >> >>> On 16.03.16 at 14:55,  wrote:
>> >> > On 03/16/16 07:16, Jan Beulich wrote:
>> >> >> And
>> >> >> talking of fragmentation - how do you mean to track guest
>> >> >> permissions for an unbounded number of address ranges?
>> >> >>
>> >> > 
>> >> > In this case range structs in iomem_caps for NVDIMMs may consume a lot
>> >> > of memory, so I think they are another candidate that should be put in
>> >> > the reserved area on NVDIMM. If we only allow to grant access
>> >> > permissions to NVDIMM page by page (rather than byte), the number of
>> >> > range structs for each NVDIMM in the worst case is still decidable.
>> >> 
>> >> Of course the permission granularity is going to by pages, not
>> >> bytes (or else we couldn't allow the pages to be mapped into
>> >> guest address space). And the limit on the per-domain range
>> >> sets isn't going to be allowed to be bumped significantly, at
>> >> least not for any of the existing ones (or else you'd have to
>> >> prove such bumping can't be abused).
>> > 
>> > What is that limit? the total number of range structs in per-domain
>> > range sets? I must miss something when looking through 'case
>> > XEN_DOMCTL_iomem_permission' of do_domctl() and didn't find that
>> > limit, unless it means alloc_range() will fail when there are lots of
>> > range structs.
>> 
>> Oh, I'm sorry, that was a different set of range sets I was
>> thinking about. But note that excessive creation of ranges
>> through XEN_DOMCTL_iomem_permission is not a security issue
>> just because of XSA-77, i.e. we'd still not knowingly allow a
>> severe increase here.
>>
> 
> I didn't notice that multiple domains can all have access permission
> to an iomem range, i.e. there can be multiple range structs for a
> single iomem range. If range structs for NVDIMM are put on NVDIMM,
> then there would be still a huge amount of them on NVDIMM in the worst
> case (maximum number of domains * number of NVDIMM pages).
> 
> A workaround is to only allow a range of NVDIMM pages be accessed by a
> single domain. Whenever we add the access permission of NVDIMM pages
> to a domain, we also remove the permission from its current
> grantee. In this way, we only need to put 'number of NVDIMM pages'
> range structs on NVDIMM in the worst case.

But will this work? There's a reason multiple domains are permitted
access: The domain running qemu for the guest, for example,
needs to be able to access guest memory.

No matter how much you and others are opposed to this, I can't
help myself thinking that PMEM regions should be treated like RAM
(and hence be under full control of Xen), whereas PBLK regions
could indeed be treated like MMIO (and hence partly be under the
control of Dom0).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-19 Thread Konrad Rzeszutek Wilk
> Then there is another problem (which also exists in the current
> design): does Xen need to emulate NVDIMM _DSM for dom0? Take the _DSM
> that access label storage area (for namespace) for example:

No. And it really can't as each vendors _DSM is different - and there
is no ACPI AML interpreter inside Xen hypervisor.
> 
> The way Linux reserving space on pmem mode NVDIMM is to leave the
> reserved space at the beginning of pmem mode NVDIMM and create a pmem
> namespace which starts from the end of the reserved space. Because the
> reservation information is written in the namespace in the NVDIMM
> label storage area, every OS that follows the namespace spec would not
> mistakenly write files in the reserved area. I prefer to the same way
> if Xen is going to do the reservation. We definitely don't want dom0
> to break the label storage area, so Xen seemingly needs to emulate the
> corresponding _DSM functions for dom0? If so, which part, the
> hypervisor or the toolstack, should do the emulation?

But we do not want Xen to do the reservation. The control guest (Dom0)
is the one that will mount the NVDIMM, and extract the system ranges
from the files on the NVDIMM - and glue them to a guest. 

It is also the job of Dom0 to do the actually partition the NVDIMM
as fit. Actually let me step back. It is the job of the guest who
has the full NVDIMM in it. At bootup it is Dom0 - but you can very
well 'unplug' the NVDIMM from Dom0 and assign it wholesale to a guest.

Granted at that point the _DSM operations have to go through QEMU
which ends up calling the dom0 ioctls on PMEM to do the operation
(like getting the SMART data).
> 
> Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-19 Thread Jan Beulich
>>> On 17.03.16 at 13:44,  wrote:
> On 03/17/16 05:04, Jan Beulich wrote:
>> >>> On 17.03.16 at 09:58,  wrote:
>> > On 03/16/16 09:23, Jan Beulich wrote:
>> >> >>> On 16.03.16 at 15:55,  wrote:
>> >> > On 03/16/16 08:23, Jan Beulich wrote:
>> >> >> >>> On 16.03.16 at 14:55,  wrote:
>> >> >> > On 03/16/16 07:16, Jan Beulich wrote:
>> >> >> >> And
>> >> >> >> talking of fragmentation - how do you mean to track guest
>> >> >> >> permissions for an unbounded number of address ranges?
>> >> >> >>
>> >> >> > 
>> >> >> > In this case range structs in iomem_caps for NVDIMMs may consume a 
>> >> >> > lot
>> >> >> > of memory, so I think they are another candidate that should be put 
>> >> >> > in
>> >> >> > the reserved area on NVDIMM. If we only allow to grant access
>> >> >> > permissions to NVDIMM page by page (rather than byte), the number of
>> >> >> > range structs for each NVDIMM in the worst case is still decidable.
>> >> >> 
>> >> >> Of course the permission granularity is going to by pages, not
>> >> >> bytes (or else we couldn't allow the pages to be mapped into
>> >> >> guest address space). And the limit on the per-domain range
>> >> >> sets isn't going to be allowed to be bumped significantly, at
>> >> >> least not for any of the existing ones (or else you'd have to
>> >> >> prove such bumping can't be abused).
>> >> > 
>> >> > What is that limit? the total number of range structs in per-domain
>> >> > range sets? I must miss something when looking through 'case
>> >> > XEN_DOMCTL_iomem_permission' of do_domctl() and didn't find that
>> >> > limit, unless it means alloc_range() will fail when there are lots of
>> >> > range structs.
>> >> 
>> >> Oh, I'm sorry, that was a different set of range sets I was
>> >> thinking about. But note that excessive creation of ranges
>> >> through XEN_DOMCTL_iomem_permission is not a security issue
>> >> just because of XSA-77, i.e. we'd still not knowingly allow a
>> >> severe increase here.
>> >>
>> > 
>> > I didn't notice that multiple domains can all have access permission
>> > to an iomem range, i.e. there can be multiple range structs for a
>> > single iomem range. If range structs for NVDIMM are put on NVDIMM,
>> > then there would be still a huge amount of them on NVDIMM in the worst
>> > case (maximum number of domains * number of NVDIMM pages).
>> > 
>> > A workaround is to only allow a range of NVDIMM pages be accessed by a
>> > single domain. Whenever we add the access permission of NVDIMM pages
>> > to a domain, we also remove the permission from its current
>> > grantee. In this way, we only need to put 'number of NVDIMM pages'
>> > range structs on NVDIMM in the worst case.
>> 
>> But will this work? There's a reason multiple domains are permitted
>> access: The domain running qemu for the guest, for example,
>> needs to be able to access guest memory.
>>
> 
> QEMU now only maintains ACPI tables and emulates _DSM for vNVDIMM
> which both do not need to access NVDIMM pages mapped to guest.

For one - this was only an example. And then - iirc qemu keeps
mappings of certain guest RAM ranges. If I'm remembering this
right, then why would it be excluded that it also may need
mappings of guest NVDIMM?

>> No matter how much you and others are opposed to this, I can't
>> help myself thinking that PMEM regions should be treated like RAM
>> (and hence be under full control of Xen), whereas PBLK regions
>> could indeed be treated like MMIO (and hence partly be under the
>> control of Dom0).
>>
> 
> Hmm, making Xen has full control could at least make reserving space
> on NVDIMM easier. I guess full control does not include manipulating
> file systems on NVDIMM which can be still left to dom0?
> 
> Then there is another problem (which also exists in the current
> design): does Xen need to emulate NVDIMM _DSM for dom0? Take the _DSM
> that access label storage area (for namespace) for example:
> 
> The way Linux reserving space on pmem mode NVDIMM is to leave the
> reserved space at the beginning of pmem mode NVDIMM and create a pmem
> namespace which starts from the end of the reserved space. Because the
> reservation information is written in the namespace in the NVDIMM
> label storage area, every OS that follows the namespace spec would not
> mistakenly write files in the reserved area. I prefer to the same way
> if Xen is going to do the reservation. We definitely don't want dom0
> to break the label storage area, so Xen seemingly needs to emulate the
> corresponding _DSM functions for dom0? If so, which part, the
> hypervisor or the toolstack, should do the emulation?

I don't think I can answer all but the very last point: Of course this
can't be done in the tool stack, since afaict the Dom0 kernel will
want to evaluate _DSM before the tool stack even runs.

Jan

___
Xen-devel mailing list

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-19 Thread Haozhong Zhang
On 03/17/16 14:00, Ian Jackson wrote:
> Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support 
> for Xen"):
> > QEMU keeps mappings of guest memory because (1) that mapping is
> > created by itself, and/or (2) certain device emulation needs to access
> > the guest memory. But for vNVDIMM, I'm going to move the creation of
> > its mappings out of qemu to toolstack and vNVDIMM in QEMU does not
> > access vNVDIMM pages mapped to guest, so it's not necessary to let
> > qemu keeps vNVDIMM mappings.
> 
> I'm confused by this.
> 
> Suppose a guest uses an emulated device (or backend) provided by qemu,
> to do DMA to an vNVDIMM.  Then qemu will need to map the real NVDIMM
> pages into its own address space, so that it can write to the memory
> (ie, do the virtual DMA).
> 
> That virtual DMA might well involve a direct mapping in the kernel
> underlying qemu: ie, qemu might use O_DIRECT to have its kernel write
> directly to the NVDIMM, and with luck the actual device backing the
> virtual device will be able to DMA to the NVDIMM.
> 
> All of this seems to me to mean that qemu needs to be able to map
> its guest's parts of NVDIMMs
> 
> There are probably other example: memory inspection systems used by
> virus scanners etc.; debuggers used to inspect a guest from outside;
> etc.
> 
> I haven't even got started on save/restore...
> 

Oops, so many cases I missed. Thanks Ian for pointing out all these!
Now I need to reconsider how to manage guest permissions for NVDIMM pages.

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-19 Thread Zhang, Haozhong
On 03/17/16 22:12, Xu, Quan wrote:
> On March 17, 2016 9:37pm, Haozhong Zhang  wrote:
> > For PV guests (if we add vNVDIMM support for them in future), as I'm going 
> > to
> > use page_info struct for it, I suppose the current mechanism in Xen can 
> > handle
> > this case. I'm not familiar with PV memory management 
> 
> The below web may be helpful:
> http://wiki.xen.org/wiki/X86_Paravirtualised_Memory_Management
> 
> :)
> Quan
> 

Thanks!

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-19 Thread Haozhong Zhang
On 03/16/16 08:23, Jan Beulich wrote:
> >>> On 16.03.16 at 14:55,  wrote:
> > On 03/16/16 07:16, Jan Beulich wrote:
> >> Which reminds me: When considering a file on NVDIMM, how
> >> are you making sure the mapping of the file to disk (i.e.
> >> memory) blocks doesn't change while the guest has access
> >> to it, e.g. due to some defragmentation going on?
> > 
> > The current linux kernel 4.5 has an experimental "raw device dax
> > support" (enabled by removing "depends on BROKEN" from "config
> > BLK_DEV_DAX") which can guarantee the consistent mapping. The driver
> > developers are going to make it non-broken in linux kernel 4.6.
> 
> But there you talk about full devices, whereas my question was
> for files.
>

the raw device dax support is for files on NVDIMM.

> >> And
> >> talking of fragmentation - how do you mean to track guest
> >> permissions for an unbounded number of address ranges?
> >>
> > 
> > In this case range structs in iomem_caps for NVDIMMs may consume a lot
> > of memory, so I think they are another candidate that should be put in
> > the reserved area on NVDIMM. If we only allow to grant access
> > permissions to NVDIMM page by page (rather than byte), the number of
> > range structs for each NVDIMM in the worst case is still decidable.
> 
> Of course the permission granularity is going to by pages, not
> bytes (or else we couldn't allow the pages to be mapped into
> guest address space). And the limit on the per-domain range
> sets isn't going to be allowed to be bumped significantly, at
> least not for any of the existing ones (or else you'd have to
> prove such bumping can't be abused).

What is that limit? the total number of range structs in per-domain
range sets? I must miss something when looking through 'case
XEN_DOMCTL_iomem_permission' of do_domctl() and didn't find that
limit, unless it means alloc_range() will fail when there are lots of
range structs.

> Putting such control
> structures on NVDIMM is a nice idea, but following our isolation
> model for normal memory, any such memory used by Xen
> would then need to be (made) inaccessible to Dom0.
>

I'm not clear how this is done. By marking those inaccessible pages as
unpresent in dom0's page table? Or any example I can follow?

Thanks,
Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-19 Thread Konrad Rzeszutek Wilk
On Wed, Mar 16, 2016 at 08:55:08PM +0800, Haozhong Zhang wrote:
> Hi Jan and Konrad,
> 
> On 03/04/16 15:30, Haozhong Zhang wrote:
> > Suddenly realize it's unnecessary to let QEMU get SPA ranges of NVDIMM
> > or files on NVDIMM. We can move that work to toolstack and pass SPA
> > ranges got by toolstack to qemu. In this way, no privileged operations
> > (mmap/mlock/...) are needed in QEMU and non-root QEMU should be able to
> > work even with vNVDIMM hotplug in future.
> > 
> 
> As I'm going to let toolstack to get NVDIMM SPA ranges. This can be
> done via dom0 kernel interface and xen hypercalls, and can be
> implemented in different ways. I'm wondering which of the following
> ones is preferred by xen.
> 
> 1. Given
> * a file descriptor of either a NVDIMM device or a file on NVDIMM, and
> * domain id and guest MFN where vNVDIMM is going to be.
>xen toolstack (1) gets it SPA ranges via dom0 kernel interface
>(e.g. sysfs and ioctl FIEMAP), and (2) calls a hypercall to map
>above SPA ranges to the given guest MFN of the given domain.
> 
> 2. Or, given the same inputs, we may combine above two steps into a new
>dom0 system call that (1) gets the SPA ranges, (2) calls xen
>hypercall to map SPA ranges, and, one step further, (3) returns SPA
>ranges to userspace (because QEMU needs these addresses to build ACPI).
> 
> The first way does not need to modify dom0 linux kernel, while the
> second requires a new system call. I'm not sure whether xen toolstack
> as a userspace program is considered to be safe to pass the host physical
> address to hypervisor. If not, maybe the second one is better?

Well, the toolstack does it already. (for MMIO ranges of PCIe devices and
such).

I would prefer 1) as it means less kernel code.
> 
> Thanks,
> Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-19 Thread Jan Beulich
>>> On 17.03.16 at 14:29,  wrote:
> On 03/17/16 06:59, Jan Beulich wrote:
>> >>> On 17.03.16 at 13:44,  wrote:
>> > Hmm, making Xen has full control could at least make reserving space
>> > on NVDIMM easier. I guess full control does not include manipulating
>> > file systems on NVDIMM which can be still left to dom0?
>> > 
>> > Then there is another problem (which also exists in the current
>> > design): does Xen need to emulate NVDIMM _DSM for dom0? Take the _DSM
>> > that access label storage area (for namespace) for example:
>> > 
>> > The way Linux reserving space on pmem mode NVDIMM is to leave the
>> > reserved space at the beginning of pmem mode NVDIMM and create a pmem
>> > namespace which starts from the end of the reserved space. Because the
>> > reservation information is written in the namespace in the NVDIMM
>> > label storage area, every OS that follows the namespace spec would not
>> > mistakenly write files in the reserved area. I prefer to the same way
>> > if Xen is going to do the reservation. We definitely don't want dom0
>> > to break the label storage area, so Xen seemingly needs to emulate the
>> > corresponding _DSM functions for dom0? If so, which part, the
>> > hypervisor or the toolstack, should do the emulation?
>> 
>> I don't think I can answer all but the very last point: Of course this
>> can't be done in the tool stack, since afaict the Dom0 kernel will
>> want to evaluate _DSM before the tool stack even runs.
> 
> Or, we could modify dom0 kernel to just use the label storage area as is
> and does not modify it. Can xen hypervisor trust dom0 kernel in this aspect?

I think so, yes.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-19 Thread Haozhong Zhang
On 03/16/16 07:16, Jan Beulich wrote:
> >>> On 16.03.16 at 13:55,  wrote:
> > Hi Jan and Konrad,
> > 
> > On 03/04/16 15:30, Haozhong Zhang wrote:
> >> Suddenly realize it's unnecessary to let QEMU get SPA ranges of NVDIMM
> >> or files on NVDIMM. We can move that work to toolstack and pass SPA
> >> ranges got by toolstack to qemu. In this way, no privileged operations
> >> (mmap/mlock/...) are needed in QEMU and non-root QEMU should be able to
> >> work even with vNVDIMM hotplug in future.
> >> 
> > 
> > As I'm going to let toolstack to get NVDIMM SPA ranges. This can be
> > done via dom0 kernel interface and xen hypercalls, and can be
> > implemented in different ways. I'm wondering which of the following
> > ones is preferred by xen.
> > 
> > 1. Given
> > * a file descriptor of either a NVDIMM device or a file on NVDIMM, and
> > * domain id and guest MFN where vNVDIMM is going to be.
> >xen toolstack (1) gets it SPA ranges via dom0 kernel interface
> >(e.g. sysfs and ioctl FIEMAP), and (2) calls a hypercall to map
> >above SPA ranges to the given guest MFN of the given domain.
> > 
> > 2. Or, given the same inputs, we may combine above two steps into a new
> >dom0 system call that (1) gets the SPA ranges, (2) calls xen
> >hypercall to map SPA ranges, and, one step further, (3) returns SPA
> >ranges to userspace (because QEMU needs these addresses to build ACPI).
> 
> DYM GPA here? Qemu should hardly have a need for SPA when
> wanting to build ACPI tables for the guest.
>

Oh, it should be GPA for QEMU and (3) is not needed.

> > The first way does not need to modify dom0 linux kernel, while the
> > second requires a new system call. I'm not sure whether xen toolstack
> > as a userspace program is considered to be safe to pass the host physical
> > address to hypervisor. If not, maybe the second one is better?
> 
> As long as the passing of physical addresses follows to model
> of MMIO for passed through PCI devices, I don't think there's
> problem with the tool stack bypassing the Dom0 kernel. So it
> really all depends on how you make sure that the guest won't
> get to see memory it has no permission to access.
>

So the toolstack should first use XEN_DOMCTL_iomem_permission to grant
permissions to the guest and then call XEN_DOMCTL_memory_mapping for
the mapping.

> Which reminds me: When considering a file on NVDIMM, how
> are you making sure the mapping of the file to disk (i.e.
> memory) blocks doesn't change while the guest has access
> to it, e.g. due to some defragmentation going on?

The current linux kernel 4.5 has an experimental "raw device dax
support" (enabled by removing "depends on BROKEN" from "config
BLK_DEV_DAX") which can guarantee the consistent mapping. The driver
developers are going to make it non-broken in linux kernel 4.6.

> And
> talking of fragmentation - how do you mean to track guest
> permissions for an unbounded number of address ranges?
>

In this case range structs in iomem_caps for NVDIMMs may consume a lot
of memory, so I think they are another candidate that should be put in
the reserved area on NVDIMM. If we only allow to grant access
permissions to NVDIMM page by page (rather than byte), the number of
range structs for each NVDIMM in the worst case is still decidable.

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-19 Thread Jan Beulich
>>> On 16.03.16 at 13:55,  wrote:
> Hi Jan and Konrad,
> 
> On 03/04/16 15:30, Haozhong Zhang wrote:
>> Suddenly realize it's unnecessary to let QEMU get SPA ranges of NVDIMM
>> or files on NVDIMM. We can move that work to toolstack and pass SPA
>> ranges got by toolstack to qemu. In this way, no privileged operations
>> (mmap/mlock/...) are needed in QEMU and non-root QEMU should be able to
>> work even with vNVDIMM hotplug in future.
>> 
> 
> As I'm going to let toolstack to get NVDIMM SPA ranges. This can be
> done via dom0 kernel interface and xen hypercalls, and can be
> implemented in different ways. I'm wondering which of the following
> ones is preferred by xen.
> 
> 1. Given
> * a file descriptor of either a NVDIMM device or a file on NVDIMM, and
> * domain id and guest MFN where vNVDIMM is going to be.
>xen toolstack (1) gets it SPA ranges via dom0 kernel interface
>(e.g. sysfs and ioctl FIEMAP), and (2) calls a hypercall to map
>above SPA ranges to the given guest MFN of the given domain.
> 
> 2. Or, given the same inputs, we may combine above two steps into a new
>dom0 system call that (1) gets the SPA ranges, (2) calls xen
>hypercall to map SPA ranges, and, one step further, (3) returns SPA
>ranges to userspace (because QEMU needs these addresses to build ACPI).

DYM GPA here? Qemu should hardly have a need for SPA when
wanting to build ACPI tables for the guest.

> The first way does not need to modify dom0 linux kernel, while the
> second requires a new system call. I'm not sure whether xen toolstack
> as a userspace program is considered to be safe to pass the host physical
> address to hypervisor. If not, maybe the second one is better?

As long as the passing of physical addresses follows to model
of MMIO for passed through PCI devices, I don't think there's
problem with the tool stack bypassing the Dom0 kernel. So it
really all depends on how you make sure that the guest won't
get to see memory it has no permission to access.

Which reminds me: When considering a file on NVDIMM, how
are you making sure the mapping of the file to disk (i.e.
memory) blocks doesn't change while the guest has access
to it, e.g. due to some defragmentation going on? And
talking of fragmentation - how do you mean to track guest
permissions for an unbounded number of address ranges?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-19 Thread Haozhong Zhang
On 03/16/16 09:23, Jan Beulich wrote:
> >>> On 16.03.16 at 15:55,  wrote:
> > On 03/16/16 08:23, Jan Beulich wrote:
> >> >>> On 16.03.16 at 14:55,  wrote:
> >> > On 03/16/16 07:16, Jan Beulich wrote:
> >> >> Which reminds me: When considering a file on NVDIMM, how
> >> >> are you making sure the mapping of the file to disk (i.e.
> >> >> memory) blocks doesn't change while the guest has access
> >> >> to it, e.g. due to some defragmentation going on?
> >> > 
> >> > The current linux kernel 4.5 has an experimental "raw device dax
> >> > support" (enabled by removing "depends on BROKEN" from "config
> >> > BLK_DEV_DAX") which can guarantee the consistent mapping. The driver
> >> > developers are going to make it non-broken in linux kernel 4.6.
> >> 
> >> But there you talk about full devices, whereas my question was
> >> for files.
> >>
> > 
> > the raw device dax support is for files on NVDIMM.
> 
> Okay, I can only trust you here. I thought FS_DAX is the file level
> thing.
> 
> >> >> And
> >> >> talking of fragmentation - how do you mean to track guest
> >> >> permissions for an unbounded number of address ranges?
> >> >>
> >> > 
> >> > In this case range structs in iomem_caps for NVDIMMs may consume a lot
> >> > of memory, so I think they are another candidate that should be put in
> >> > the reserved area on NVDIMM. If we only allow to grant access
> >> > permissions to NVDIMM page by page (rather than byte), the number of
> >> > range structs for each NVDIMM in the worst case is still decidable.
> >> 
> >> Of course the permission granularity is going to by pages, not
> >> bytes (or else we couldn't allow the pages to be mapped into
> >> guest address space). And the limit on the per-domain range
> >> sets isn't going to be allowed to be bumped significantly, at
> >> least not for any of the existing ones (or else you'd have to
> >> prove such bumping can't be abused).
> > 
> > What is that limit? the total number of range structs in per-domain
> > range sets? I must miss something when looking through 'case
> > XEN_DOMCTL_iomem_permission' of do_domctl() and didn't find that
> > limit, unless it means alloc_range() will fail when there are lots of
> > range structs.
> 
> Oh, I'm sorry, that was a different set of range sets I was
> thinking about. But note that excessive creation of ranges
> through XEN_DOMCTL_iomem_permission is not a security issue
> just because of XSA-77, i.e. we'd still not knowingly allow a
> severe increase here.
>

I didn't notice that multiple domains can all have access permission
to an iomem range, i.e. there can be multiple range structs for a
single iomem range. If range structs for NVDIMM are put on NVDIMM,
then there would be still a huge amount of them on NVDIMM in the worst
case (maximum number of domains * number of NVDIMM pages).

A workaround is to only allow a range of NVDIMM pages be accessed by a
single domain. Whenever we add the access permission of NVDIMM pages
to a domain, we also remove the permission from its current
grantee. In this way, we only need to put 'number of NVDIMM pages'
range structs on NVDIMM in the worst case.

> >> Putting such control
> >> structures on NVDIMM is a nice idea, but following our isolation
> >> model for normal memory, any such memory used by Xen
> >> would then need to be (made) inaccessible to Dom0.
> > 
> > I'm not clear how this is done. By marking those inaccessible pages as
> > unpresent in dom0's page table? Or any example I can follow?
> 
> That's the problem - so far we had no need to do so since Dom0
> was only ever allowed access to memory Xen didn't use for itself
> or knows it wants to share. Whereas now you want such a
> resource controlled first by Dom0, and only then handed to Xen.
> So yes, Dom0 would need to zap any mappings of these pages
> (and Xen would need to verify that, which would come mostly
> without new code as long as struct page_info gets properly
> used for all this memory) before Xen could use it. Much like
> ballooning out a normal RAM page.
> 

Thanks, I'll look into this balloon approach.

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-19 Thread Haozhong Zhang
On 03/17/16 06:59, Jan Beulich wrote:
> >>> On 17.03.16 at 13:44,  wrote:
> > On 03/17/16 05:04, Jan Beulich wrote:
> >> >>> On 17.03.16 at 09:58,  wrote:
> >> > On 03/16/16 09:23, Jan Beulich wrote:
> >> >> >>> On 16.03.16 at 15:55,  wrote:
> >> >> > On 03/16/16 08:23, Jan Beulich wrote:
> >> >> >> >>> On 16.03.16 at 14:55,  wrote:
> >> >> >> > On 03/16/16 07:16, Jan Beulich wrote:
> >> >> >> >> And
> >> >> >> >> talking of fragmentation - how do you mean to track guest
> >> >> >> >> permissions for an unbounded number of address ranges?
> >> >> >> >>
> >> >> >> > 
> >> >> >> > In this case range structs in iomem_caps for NVDIMMs may consume a 
> >> >> >> > lot
> >> >> >> > of memory, so I think they are another candidate that should be 
> >> >> >> > put in
> >> >> >> > the reserved area on NVDIMM. If we only allow to grant access
> >> >> >> > permissions to NVDIMM page by page (rather than byte), the number 
> >> >> >> > of
> >> >> >> > range structs for each NVDIMM in the worst case is still decidable.
> >> >> >> 
> >> >> >> Of course the permission granularity is going to by pages, not
> >> >> >> bytes (or else we couldn't allow the pages to be mapped into
> >> >> >> guest address space). And the limit on the per-domain range
> >> >> >> sets isn't going to be allowed to be bumped significantly, at
> >> >> >> least not for any of the existing ones (or else you'd have to
> >> >> >> prove such bumping can't be abused).
> >> >> > 
> >> >> > What is that limit? the total number of range structs in per-domain
> >> >> > range sets? I must miss something when looking through 'case
> >> >> > XEN_DOMCTL_iomem_permission' of do_domctl() and didn't find that
> >> >> > limit, unless it means alloc_range() will fail when there are lots of
> >> >> > range structs.
> >> >> 
> >> >> Oh, I'm sorry, that was a different set of range sets I was
> >> >> thinking about. But note that excessive creation of ranges
> >> >> through XEN_DOMCTL_iomem_permission is not a security issue
> >> >> just because of XSA-77, i.e. we'd still not knowingly allow a
> >> >> severe increase here.
> >> >>
> >> > 
> >> > I didn't notice that multiple domains can all have access permission
> >> > to an iomem range, i.e. there can be multiple range structs for a
> >> > single iomem range. If range structs for NVDIMM are put on NVDIMM,
> >> > then there would be still a huge amount of them on NVDIMM in the worst
> >> > case (maximum number of domains * number of NVDIMM pages).
> >> > 
> >> > A workaround is to only allow a range of NVDIMM pages be accessed by a
> >> > single domain. Whenever we add the access permission of NVDIMM pages
> >> > to a domain, we also remove the permission from its current
> >> > grantee. In this way, we only need to put 'number of NVDIMM pages'
> >> > range structs on NVDIMM in the worst case.
> >> 
> >> But will this work? There's a reason multiple domains are permitted
> >> access: The domain running qemu for the guest, for example,
> >> needs to be able to access guest memory.
> >>
> > 
> > QEMU now only maintains ACPI tables and emulates _DSM for vNVDIMM
> > which both do not need to access NVDIMM pages mapped to guest.
> 
> For one - this was only an example. And then - iirc qemu keeps
> mappings of certain guest RAM ranges. If I'm remembering this
> right, then why would it be excluded that it also may need
> mappings of guest NVDIMM?
>

QEMU keeps mappings of guest memory because (1) that mapping is
created by itself, and/or (2) certain device emulation needs to access
the guest memory. But for vNVDIMM, I'm going to move the creation of
its mappings out of qemu to toolstack and vNVDIMM in QEMU does not
access vNVDIMM pages mapped to guest, so it's not necessary to let
qemu keeps vNVDIMM mappings.

> >> No matter how much you and others are opposed to this, I can't
> >> help myself thinking that PMEM regions should be treated like RAM
> >> (and hence be under full control of Xen), whereas PBLK regions
> >> could indeed be treated like MMIO (and hence partly be under the
> >> control of Dom0).
> >>
> > 
> > Hmm, making Xen has full control could at least make reserving space
> > on NVDIMM easier. I guess full control does not include manipulating
> > file systems on NVDIMM which can be still left to dom0?
> > 
> > Then there is another problem (which also exists in the current
> > design): does Xen need to emulate NVDIMM _DSM for dom0? Take the _DSM
> > that access label storage area (for namespace) for example:
> > 
> > The way Linux reserving space on pmem mode NVDIMM is to leave the
> > reserved space at the beginning of pmem mode NVDIMM and create a pmem
> > namespace which starts from the end of the reserved space. Because the
> > reservation information is written in the namespace in the NVDIMM
> > label storage area, every OS that follows the namespace spec would not
> > mistakenly write files in the 

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-19 Thread Haozhong Zhang
On 03/17/16 11:05, Ian Jackson wrote:
> Jan Beulich writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for 
> Xen"):
> > So that again leaves unaddressed the question of what you
> > imply to do when a guest elects to use such a page as page
> > table. I'm afraid any attempt of yours to invent something that
> > is not struct page_info will not be suitable for all possible needs.
> 
> It is not clear to me whether this is a realistic thing for a guest to
> want to do.  Haozhong, maybe you want to consider this aspect.
>

For HVM guests, it's themselves responsibility to not grant (e.g. in
xen-blk/net drivers) a vNVDIMM page containing page tables to others.

For PV guests (if we add vNVDIMM support for them in future), as I'm
going to use page_info struct for it, I suppose the current mechanism
in Xen can handle this case. I'm not familiar with PV memory
management and have to admit I didn't find the exact code that handles
the case that a memory page contains the guest page table. Jan, could
you indicate the code that I can follow to understand what xen does in
this case?

Thanks,
Haozhong

> If you can come up with an argument why it is OK to simply not permit
> this, then maybe the recordkeeping requirements can be relaxed ?
> 
> Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-19 Thread Jan Beulich
>>> On 16.03.16 at 14:55,  wrote:
> On 03/16/16 07:16, Jan Beulich wrote:
>> Which reminds me: When considering a file on NVDIMM, how
>> are you making sure the mapping of the file to disk (i.e.
>> memory) blocks doesn't change while the guest has access
>> to it, e.g. due to some defragmentation going on?
> 
> The current linux kernel 4.5 has an experimental "raw device dax
> support" (enabled by removing "depends on BROKEN" from "config
> BLK_DEV_DAX") which can guarantee the consistent mapping. The driver
> developers are going to make it non-broken in linux kernel 4.6.

But there you talk about full devices, whereas my question was
for files.

>> And
>> talking of fragmentation - how do you mean to track guest
>> permissions for an unbounded number of address ranges?
>>
> 
> In this case range structs in iomem_caps for NVDIMMs may consume a lot
> of memory, so I think they are another candidate that should be put in
> the reserved area on NVDIMM. If we only allow to grant access
> permissions to NVDIMM page by page (rather than byte), the number of
> range structs for each NVDIMM in the worst case is still decidable.

Of course the permission granularity is going to by pages, not
bytes (or else we couldn't allow the pages to be mapped into
guest address space). And the limit on the per-domain range
sets isn't going to be allowed to be bumped significantly, at
least not for any of the existing ones (or else you'd have to
prove such bumping can't be abused). Putting such control
structures on NVDIMM is a nice idea, but following our isolation
model for normal memory, any such memory used by Xen
would then need to be (made) inaccessible to Dom0.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-19 Thread Xu, Quan
On March 17, 2016 9:37pm, Haozhong Zhang  wrote:
> For PV guests (if we add vNVDIMM support for them in future), as I'm going to
> use page_info struct for it, I suppose the current mechanism in Xen can handle
> this case. I'm not familiar with PV memory management 

The below web may be helpful:
http://wiki.xen.org/wiki/X86_Paravirtualised_Memory_Management

:)
Quan

> and have to admit I
> didn't find the exact code that handles the case that a memory page contains
> the guest page table. Jan, could you indicate the code that I can follow to
> understand what xen does in this case?

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-19 Thread Jan Beulich
>>> On 16.03.16 at 15:55,  wrote:
> On 03/16/16 08:23, Jan Beulich wrote:
>> >>> On 16.03.16 at 14:55,  wrote:
>> > On 03/16/16 07:16, Jan Beulich wrote:
>> >> Which reminds me: When considering a file on NVDIMM, how
>> >> are you making sure the mapping of the file to disk (i.e.
>> >> memory) blocks doesn't change while the guest has access
>> >> to it, e.g. due to some defragmentation going on?
>> > 
>> > The current linux kernel 4.5 has an experimental "raw device dax
>> > support" (enabled by removing "depends on BROKEN" from "config
>> > BLK_DEV_DAX") which can guarantee the consistent mapping. The driver
>> > developers are going to make it non-broken in linux kernel 4.6.
>> 
>> But there you talk about full devices, whereas my question was
>> for files.
>>
> 
> the raw device dax support is for files on NVDIMM.

Okay, I can only trust you here. I thought FS_DAX is the file level
thing.

>> >> And
>> >> talking of fragmentation - how do you mean to track guest
>> >> permissions for an unbounded number of address ranges?
>> >>
>> > 
>> > In this case range structs in iomem_caps for NVDIMMs may consume a lot
>> > of memory, so I think they are another candidate that should be put in
>> > the reserved area on NVDIMM. If we only allow to grant access
>> > permissions to NVDIMM page by page (rather than byte), the number of
>> > range structs for each NVDIMM in the worst case is still decidable.
>> 
>> Of course the permission granularity is going to by pages, not
>> bytes (or else we couldn't allow the pages to be mapped into
>> guest address space). And the limit on the per-domain range
>> sets isn't going to be allowed to be bumped significantly, at
>> least not for any of the existing ones (or else you'd have to
>> prove such bumping can't be abused).
> 
> What is that limit? the total number of range structs in per-domain
> range sets? I must miss something when looking through 'case
> XEN_DOMCTL_iomem_permission' of do_domctl() and didn't find that
> limit, unless it means alloc_range() will fail when there are lots of
> range structs.

Oh, I'm sorry, that was a different set of range sets I was
thinking about. But note that excessive creation of ranges
through XEN_DOMCTL_iomem_permission is not a security issue
just because of XSA-77, i.e. we'd still not knowingly allow a
severe increase here.

>> Putting such control
>> structures on NVDIMM is a nice idea, but following our isolation
>> model for normal memory, any such memory used by Xen
>> would then need to be (made) inaccessible to Dom0.
> 
> I'm not clear how this is done. By marking those inaccessible pages as
> unpresent in dom0's page table? Or any example I can follow?

That's the problem - so far we had no need to do so since Dom0
was only ever allowed access to memory Xen didn't use for itself
or knows it wants to share. Whereas now you want such a
resource controlled first by Dom0, and only then handed to Xen.
So yes, Dom0 would need to zap any mappings of these pages
(and Xen would need to verify that, which would come mostly
without new code as long as struct page_info gets properly
used for all this memory) before Xen could use it. Much like
ballooning out a normal RAM page.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-18 Thread Jan Beulich
>>> On 17.03.16 at 14:37, <haozhong.zh...@intel.com> wrote:
> On 03/17/16 11:05, Ian Jackson wrote:
>> Jan Beulich writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support 
>> for 
> Xen"):
>> > So that again leaves unaddressed the question of what you
>> > imply to do when a guest elects to use such a page as page
>> > table. I'm afraid any attempt of yours to invent something that
>> > is not struct page_info will not be suitable for all possible needs.
>> 
>> It is not clear to me whether this is a realistic thing for a guest to
>> want to do.  Haozhong, maybe you want to consider this aspect.
>>
> 
> For HVM guests, it's themselves responsibility to not grant (e.g. in
> xen-blk/net drivers) a vNVDIMM page containing page tables to others.
> 
> For PV guests (if we add vNVDIMM support for them in future), as I'm
> going to use page_info struct for it, I suppose the current mechanism
> in Xen can handle this case. I'm not familiar with PV memory
> management and have to admit I didn't find the exact code that handles
> the case that a memory page contains the guest page table. Jan, could
> you indicate the code that I can follow to understand what xen does in
> this case?

xen/arch/x86/mm.c has functions like __get_page_type(),
alloc_page_type(), alloc_l[1234]_table(), and mod_l[1234]_entry()
which all participate in this.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-18 Thread Haozhong Zhang
On 03/17/16 07:56, Jan Beulich wrote:
> >>> On 17.03.16 at 14:37, <haozhong.zh...@intel.com> wrote:
> > On 03/17/16 11:05, Ian Jackson wrote:
> >> Jan Beulich writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support 
> >> for 
> > Xen"):
> >> > So that again leaves unaddressed the question of what you
> >> > imply to do when a guest elects to use such a page as page
> >> > table. I'm afraid any attempt of yours to invent something that
> >> > is not struct page_info will not be suitable for all possible needs.
> >> 
> >> It is not clear to me whether this is a realistic thing for a guest to
> >> want to do.  Haozhong, maybe you want to consider this aspect.
> >>
> > 
> > For HVM guests, it's themselves responsibility to not grant (e.g. in
> > xen-blk/net drivers) a vNVDIMM page containing page tables to others.
> > 
> > For PV guests (if we add vNVDIMM support for them in future), as I'm
> > going to use page_info struct for it, I suppose the current mechanism
> > in Xen can handle this case. I'm not familiar with PV memory
> > management and have to admit I didn't find the exact code that handles
> > the case that a memory page contains the guest page table. Jan, could
> > you indicate the code that I can follow to understand what xen does in
> > this case?
> 
> xen/arch/x86/mm.c has functions like __get_page_type(),
> alloc_page_type(), alloc_l[1234]_table(), and mod_l[1234]_entry()
> which all participate in this.
>

Thanks!

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-18 Thread Haozhong Zhang
On 03/17/16 05:04, Jan Beulich wrote:
> >>> On 17.03.16 at 09:58,  wrote:
> > On 03/16/16 09:23, Jan Beulich wrote:
> >> >>> On 16.03.16 at 15:55,  wrote:
> >> > On 03/16/16 08:23, Jan Beulich wrote:
> >> >> >>> On 16.03.16 at 14:55,  wrote:
> >> >> > On 03/16/16 07:16, Jan Beulich wrote:
> >> >> >> And
> >> >> >> talking of fragmentation - how do you mean to track guest
> >> >> >> permissions for an unbounded number of address ranges?
> >> >> >>
> >> >> > 
> >> >> > In this case range structs in iomem_caps for NVDIMMs may consume a lot
> >> >> > of memory, so I think they are another candidate that should be put in
> >> >> > the reserved area on NVDIMM. If we only allow to grant access
> >> >> > permissions to NVDIMM page by page (rather than byte), the number of
> >> >> > range structs for each NVDIMM in the worst case is still decidable.
> >> >> 
> >> >> Of course the permission granularity is going to by pages, not
> >> >> bytes (or else we couldn't allow the pages to be mapped into
> >> >> guest address space). And the limit on the per-domain range
> >> >> sets isn't going to be allowed to be bumped significantly, at
> >> >> least not for any of the existing ones (or else you'd have to
> >> >> prove such bumping can't be abused).
> >> > 
> >> > What is that limit? the total number of range structs in per-domain
> >> > range sets? I must miss something when looking through 'case
> >> > XEN_DOMCTL_iomem_permission' of do_domctl() and didn't find that
> >> > limit, unless it means alloc_range() will fail when there are lots of
> >> > range structs.
> >> 
> >> Oh, I'm sorry, that was a different set of range sets I was
> >> thinking about. But note that excessive creation of ranges
> >> through XEN_DOMCTL_iomem_permission is not a security issue
> >> just because of XSA-77, i.e. we'd still not knowingly allow a
> >> severe increase here.
> >>
> > 
> > I didn't notice that multiple domains can all have access permission
> > to an iomem range, i.e. there can be multiple range structs for a
> > single iomem range. If range structs for NVDIMM are put on NVDIMM,
> > then there would be still a huge amount of them on NVDIMM in the worst
> > case (maximum number of domains * number of NVDIMM pages).
> > 
> > A workaround is to only allow a range of NVDIMM pages be accessed by a
> > single domain. Whenever we add the access permission of NVDIMM pages
> > to a domain, we also remove the permission from its current
> > grantee. In this way, we only need to put 'number of NVDIMM pages'
> > range structs on NVDIMM in the worst case.
> 
> But will this work? There's a reason multiple domains are permitted
> access: The domain running qemu for the guest, for example,
> needs to be able to access guest memory.
>

QEMU now only maintains ACPI tables and emulates _DSM for vNVDIMM
which both do not need to access NVDIMM pages mapped to guest.

> No matter how much you and others are opposed to this, I can't
> help myself thinking that PMEM regions should be treated like RAM
> (and hence be under full control of Xen), whereas PBLK regions
> could indeed be treated like MMIO (and hence partly be under the
> control of Dom0).
>

Hmm, making Xen has full control could at least make reserving space
on NVDIMM easier. I guess full control does not include manipulating
file systems on NVDIMM which can be still left to dom0?

Then there is another problem (which also exists in the current
design): does Xen need to emulate NVDIMM _DSM for dom0? Take the _DSM
that access label storage area (for namespace) for example:

The way Linux reserving space on pmem mode NVDIMM is to leave the
reserved space at the beginning of pmem mode NVDIMM and create a pmem
namespace which starts from the end of the reserved space. Because the
reservation information is written in the namespace in the NVDIMM
label storage area, every OS that follows the namespace spec would not
mistakenly write files in the reserved area. I prefer to the same way
if Xen is going to do the reservation. We definitely don't want dom0
to break the label storage area, so Xen seemingly needs to emulate the
corresponding _DSM functions for dom0? If so, which part, the
hypervisor or the toolstack, should do the emulation?

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-18 Thread Ian Jackson
Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support 
for Xen"):
> QEMU keeps mappings of guest memory because (1) that mapping is
> created by itself, and/or (2) certain device emulation needs to access
> the guest memory. But for vNVDIMM, I'm going to move the creation of
> its mappings out of qemu to toolstack and vNVDIMM in QEMU does not
> access vNVDIMM pages mapped to guest, so it's not necessary to let
> qemu keeps vNVDIMM mappings.

I'm confused by this.

Suppose a guest uses an emulated device (or backend) provided by qemu,
to do DMA to an vNVDIMM.  Then qemu will need to map the real NVDIMM
pages into its own address space, so that it can write to the memory
(ie, do the virtual DMA).

That virtual DMA might well involve a direct mapping in the kernel
underlying qemu: ie, qemu might use O_DIRECT to have its kernel write
directly to the NVDIMM, and with luck the actual device backing the
virtual device will be able to DMA to the NVDIMM.

All of this seems to me to mean that qemu needs to be able to map
its guest's parts of NVDIMMs

There are probably other example: memory inspection systems used by
virus scanners etc.; debuggers used to inspect a guest from outside;
etc.

I haven't even got started on save/restore...

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-09 Thread Haozhong Zhang
On 03/09/16 09:17, Jan Beulich wrote:
> >>> On 09.03.16 at 13:22,  wrote:
> > On 03/08/16 02:27, Jan Beulich wrote:
> >> >>> On 08.03.16 at 10:15,  wrote:
[...]
> > I should reexplain the choice of data structures and where to put them.
> > 
> > For handling MCE for NVDIMM, we need to track following data:
> > (1) SPA ranges of host NVDIMMs (one range per pmem interleave set), which 
> > are
> > used to check whether a MCE is for NVDIMM.
> > (2) GFN to which a NVDIMM page is mapped, which is used to determine the
> > address put in vMCE.
> > (3) the domain to which a NVDIMM page is mapped, which is used to
> > determine whether a vMCE needs to be injected and where it will be
> > injected.
> > (4) a flag to mark whether a NVDIMM page is broken, which is used to
> > avoid mapping broken page to guests.
> > 
> > For granting NVDIMM pages (e.g. xen-blkback/netback),
> > (5) a reference counter is needed for each NVDIMM page
> > 
> > Above data can be organized as below:
> > 
> > * For (1) SPA ranges, we can record them in a global data structure,
> >   e.g. a list
> > 
> > struct list_head nvdimm_iset_list;
> > 
> > struct nvdimm_iset
> > {
> >  uint64_t   base;  /* starting SPA of this interleave set */
> >  uint64_t   size;  /* size of this interleave set */
> >  struct nvdimm_page *pages;/* information for individual pages in 
> > this interleave set */
> >  struct list_head   list;
> > };
> > 
> > * For (2) GFN, an intuitive place to get this information is from M2P
> >   table machine_to_phys_mapping[].  However, the address of NVDIMM is
> >   not required to be contiguous with normal ram, so, if NVDIMM starts
> >   from an address that is much higher than the end address of normal
> >   ram, it may result in a M2P table that maybe too large to fit in the
> >   normal ram. Therefore, we choose to not put GFNs of NVDIMM in M2P
> >   table.
> 
> Any page that _may_ be used by a guest as normal RAM page
> must have its mach->phys translation entered in the M2P. That's
> because a r/o variant of that table is part of the hypervisor ABI
> for PV guests. Size considerations simply don't apply here - the
> table may be sparse (guests are required to deal with accesses
> potentially faulting), and the 256Gb of virtual address space set
> aside for it cover all memory up to the 47-bit boundary (there's
> room for doubling this). Memory at addresses with bit 47 (or
> higher) set would need a complete overhaul of that mechanism,
> and whatever new mechanism we may pick would mean old
> guests won#t be able to benefit.
>

OK, then we can use M2P to get PFNs of NVDIMM pages. And ...

> >   Another possible solution is to extend page_info to include GFN for
> >   NVDIMM and use frame_table. A benefit of this solution is that other
> >   data (3)-(5) can be got from page_info as well. However, due to the
> >   same reason for machine_to_phys_mapping[] and the concern that the
> >   large number of page_info structures required for large NVDIMMs may
> >   consume lots of ram, page_info and frame_table seems not a good place
> >   either.
> 
> For this particular item struct page_info is the wrong place
> anyway, due to what I've said above. Also extension
> suggestions of struct page_info are quite problematic, as any
> such implies a measurable increase on the memory overhead
> the hypervisor incurs. Plus the structure right now is (with the
> exception of the bigmem configuration) a carefully arranged
> for power of two in size.
> 
> > * At the end, we choose to introduce a new data structure for above
> >   per-page data (2)-(5)
> > 
> > struct nvdimm_page
> > {
> > struct domain *domain;/* for (3) */
> > uint64_t  gfn;/* for (2) */
> > unsigned long count_info; /* for (4) and (5), same as 
> > page_info->count_info */
> > /* other fields if needed, e.g. lock */
> > }
> 
> So that again leaves unaddressed the question of what you
> imply to do when a guest elects to use such a page as page
> table. I'm afraid any attempt of yours to invent something that
> is not struct page_info will not be suitable for all possible needs.
>

... we can use page_info struct rather than nvdimm_page struct for
NVDIMM pages and can be able to benefit from whatever have been done
with page_info.

> >   On each NVDIMM interleave set, we could reserve an area to place an
> >   array of nvdimm_page structures for pages in that interleave set. In
> >   addition, the corresponding global nvdimm_iset structure is set to
> >   point to this array via its 'pages' field.
> 
> And I see no problem doing exactly that, just for an array of
> struct page_info.
>

Yes, page_info arrays.

Because page_info structs for NVDIMM may be put in NVDIMM, existing code
that gets page_info from frame_table needs to be adjusted for NVDIMM
pages to use nvdimm_iset 

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-09 Thread Jan Beulich
>>> On 09.03.16 at 13:22,  wrote:
> On 03/08/16 02:27, Jan Beulich wrote:
>> >>> On 08.03.16 at 10:15,  wrote:
>> > More thoughts on reserving NVDIMM space for per-page structures
>> > 
>> > Currently, a per-page struct for managing mapping of NVDIMM pages may
>> > include following fields:
>> > 
>> > struct nvdimm_page
>> > {
>> > uint64_t mfn;/* MFN of SPA of this NVDIMM page */
>> > uint64_t gfn;/* GFN where this NVDIMM page is mapped */
>> > domid_t  domain_id;  /* which domain is this NVDIMM page mapped to */
>> > int  is_broken;  /* Is this NVDIMM page broken? (for MCE) */
>> > }
>> > 
>> > Its size is 24 bytes (or 22 bytes if packed). For a 2 TB NVDIMM,
>> > nvdimm_page structures would occupy 12 GB space, which is too hard to
>> > fit in the normal ram on a small memory host. However, for smaller
>> > NVDIMMs and/or hosts with large ram, those structures may still be able
>> > to fit in the normal ram. In the latter circumstance, nvdimm_page
>> > structures are stored in the normal ram, so they can be accessed more
>> > quickly.
>> 
>> Not sure how you came to the above structure - it's the first time
>> I see it, yet figuring out what information it needs to hold is what
>> this design process should be about. For example, I don't see why
>> it would need to duplicate M2P / P2M information. Nor do I see why
>> per-page data needs to hold the address of a page (struct
>> page_info also doesn't). And whether storing a domain ID (rather
>> than a pointer to struct domain, as in struct page_info) is the
>> correct think is also to be determined (rather than just stated).
>> 
>> Otoh you make no provisions at all for any kind of ref counting.
>> What if a guest wants to put page tables into NVDIMM space?
>> 
>> Since all of your calculations are based upon that fixed assumption
>> on the structure layout, I'm afraid they're not very meaningful
>> without first settling on what data needs tracking in the first place.
> 
> I should reexplain the choice of data structures and where to put them.
> 
> For handling MCE for NVDIMM, we need to track following data:
> (1) SPA ranges of host NVDIMMs (one range per pmem interleave set), which are
> used to check whether a MCE is for NVDIMM.
> (2) GFN to which a NVDIMM page is mapped, which is used to determine the
> address put in vMCE.
> (3) the domain to which a NVDIMM page is mapped, which is used to
> determine whether a vMCE needs to be injected and where it will be
> injected.
> (4) a flag to mark whether a NVDIMM page is broken, which is used to
> avoid mapping broken page to guests.
> 
> For granting NVDIMM pages (e.g. xen-blkback/netback),
> (5) a reference counter is needed for each NVDIMM page
> 
> Above data can be organized as below:
> 
> * For (1) SPA ranges, we can record them in a global data structure,
>   e.g. a list
> 
> struct list_head nvdimm_iset_list;
> 
> struct nvdimm_iset
> {
>  uint64_t   base;  /* starting SPA of this interleave set */
>  uint64_t   size;  /* size of this interleave set */
>  struct nvdimm_page *pages;/* information for individual pages in 
> this interleave set */
>  struct list_head   list;
> };
> 
> * For (2) GFN, an intuitive place to get this information is from M2P
>   table machine_to_phys_mapping[].  However, the address of NVDIMM is
>   not required to be contiguous with normal ram, so, if NVDIMM starts
>   from an address that is much higher than the end address of normal
>   ram, it may result in a M2P table that maybe too large to fit in the
>   normal ram. Therefore, we choose to not put GFNs of NVDIMM in M2P
>   table.

Any page that _may_ be used by a guest as normal RAM page
must have its mach->phys translation entered in the M2P. That's
because a r/o variant of that table is part of the hypervisor ABI
for PV guests. Size considerations simply don't apply here - the
table may be sparse (guests are required to deal with accesses
potentially faulting), and the 256Gb of virtual address space set
aside for it cover all memory up to the 47-bit boundary (there's
room for doubling this). Memory at addresses with bit 47 (or
higher) set would need a complete overhaul of that mechanism,
and whatever new mechanism we may pick would mean old
guests won#t be able to benefit.

>   Another possible solution is to extend page_info to include GFN for
>   NVDIMM and use frame_table. A benefit of this solution is that other
>   data (3)-(5) can be got from page_info as well. However, due to the
>   same reason for machine_to_phys_mapping[] and the concern that the
>   large number of page_info structures required for large NVDIMMs may
>   consume lots of ram, page_info and frame_table seems not a good place
>   either.

For this particular item struct page_info is the wrong place
anyway, due to what I've said above. Also extension
suggestions of struct 

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-09 Thread Haozhong Zhang
On 03/08/16 02:27, Jan Beulich wrote:
> >>> On 08.03.16 at 10:15,  wrote:
> > More thoughts on reserving NVDIMM space for per-page structures
> > 
> > Currently, a per-page struct for managing mapping of NVDIMM pages may
> > include following fields:
> > 
> > struct nvdimm_page
> > {
> > uint64_t mfn;/* MFN of SPA of this NVDIMM page */
> > uint64_t gfn;/* GFN where this NVDIMM page is mapped */
> > domid_t  domain_id;  /* which domain is this NVDIMM page mapped to */
> > int  is_broken;  /* Is this NVDIMM page broken? (for MCE) */
> > }
> > 
> > Its size is 24 bytes (or 22 bytes if packed). For a 2 TB NVDIMM,
> > nvdimm_page structures would occupy 12 GB space, which is too hard to
> > fit in the normal ram on a small memory host. However, for smaller
> > NVDIMMs and/or hosts with large ram, those structures may still be able
> > to fit in the normal ram. In the latter circumstance, nvdimm_page
> > structures are stored in the normal ram, so they can be accessed more
> > quickly.
> 
> Not sure how you came to the above structure - it's the first time
> I see it, yet figuring out what information it needs to hold is what
> this design process should be about. For example, I don't see why
> it would need to duplicate M2P / P2M information. Nor do I see why
> per-page data needs to hold the address of a page (struct
> page_info also doesn't). And whether storing a domain ID (rather
> than a pointer to struct domain, as in struct page_info) is the
> correct think is also to be determined (rather than just stated).
> 
> Otoh you make no provisions at all for any kind of ref counting.
> What if a guest wants to put page tables into NVDIMM space?
> 
> Since all of your calculations are based upon that fixed assumption
> on the structure layout, I'm afraid they're not very meaningful
> without first settling on what data needs tracking in the first place.
> 
> Jan
> 

I should reexplain the choice of data structures and where to put them.

For handling MCE for NVDIMM, we need to track following data:
(1) SPA ranges of host NVDIMMs (one range per pmem interleave set), which are
used to check whether a MCE is for NVDIMM.
(2) GFN to which a NVDIMM page is mapped, which is used to determine the
address put in vMCE.
(3) the domain to which a NVDIMM page is mapped, which is used to
determine whether a vMCE needs to be injected and where it will be
injected.
(4) a flag to mark whether a NVDIMM page is broken, which is used to
avoid mapping broken page to guests.

For granting NVDIMM pages (e.g. xen-blkback/netback),
(5) a reference counter is needed for each NVDIMM page

Above data can be organized as below:

* For (1) SPA ranges, we can record them in a global data structure,
  e.g. a list

struct list_head nvdimm_iset_list;

struct nvdimm_iset
{
 uint64_t   base;  /* starting SPA of this interleave set */
 uint64_t   size;  /* size of this interleave set */
 struct nvdimm_page *pages;/* information for individual pages in this 
interleave set */
 struct list_head   list;
};

* For (2) GFN, an intuitive place to get this information is from M2P
  table machine_to_phys_mapping[].  However, the address of NVDIMM is
  not required to be contiguous with normal ram, so, if NVDIMM starts
  from an address that is much higher than the end address of normal
  ram, it may result in a M2P table that maybe too large to fit in the
  normal ram. Therefore, we choose to not put GFNs of NVDIMM in M2P
  table.

  Another possible solution is to extend page_info to include GFN for
  NVDIMM and use frame_table. A benefit of this solution is that other
  data (3)-(5) can be got from page_info as well. However, due to the
  same reason for machine_to_phys_mapping[] and the concern that the
  large number of page_info structures required for large NVDIMMs may
  consume lots of ram, page_info and frame_table seems not a good place
  either.

* At the end, we choose to introduce a new data structure for above
  per-page data (2)-(5)

struct nvdimm_page
{
struct domain *domain;/* for (3) */
uint64_t  gfn;/* for (2) */
unsigned long count_info; /* for (4) and (5), same as 
page_info->count_info */
/* other fields if needed, e.g. lock */
}

  (MFN is not needed indeed)

  On each NVDIMM interleave set, we could reserve an area to place an
  array of nvdimm_page structures for pages in that interleave set. In
  addition, the corresponding global nvdimm_iset structure is set to
  point to this array via its 'pages' field.

* One disadvantage of above solution is that accessing NVDIMM is slower
  than normal ram, so some usage scenarios that requires frequently
  accesses to nvdimm_page structures may suffer poor
  performance. Therefore, we may add a boot parameter to allow users to
  choose normal ram for above nvdimm_page arrays if their 

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-08 Thread Jan Beulich
>>> On 08.03.16 at 10:15,  wrote:
> More thoughts on reserving NVDIMM space for per-page structures
> 
> Currently, a per-page struct for managing mapping of NVDIMM pages may
> include following fields:
> 
> struct nvdimm_page
> {
> uint64_t mfn;/* MFN of SPA of this NVDIMM page */
> uint64_t gfn;/* GFN where this NVDIMM page is mapped */
> domid_t  domain_id;  /* which domain is this NVDIMM page mapped to */
> int  is_broken;  /* Is this NVDIMM page broken? (for MCE) */
> }
> 
> Its size is 24 bytes (or 22 bytes if packed). For a 2 TB NVDIMM,
> nvdimm_page structures would occupy 12 GB space, which is too hard to
> fit in the normal ram on a small memory host. However, for smaller
> NVDIMMs and/or hosts with large ram, those structures may still be able
> to fit in the normal ram. In the latter circumstance, nvdimm_page
> structures are stored in the normal ram, so they can be accessed more
> quickly.

Not sure how you came to the above structure - it's the first time
I see it, yet figuring out what information it needs to hold is what
this design process should be about. For example, I don't see why
it would need to duplicate M2P / P2M information. Nor do I see why
per-page data needs to hold the address of a page (struct
page_info also doesn't). And whether storing a domain ID (rather
than a pointer to struct domain, as in struct page_info) is the
correct think is also to be determined (rather than just stated).

Otoh you make no provisions at all for any kind of ref counting.
What if a guest wants to put page tables into NVDIMM space?

Since all of your calculations are based upon that fixed assumption
on the structure layout, I'm afraid they're not very meaningful
without first settling on what data needs tracking in the first place.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-08 Thread Haozhong Zhang
On 03/04/16 10:20, Haozhong Zhang wrote:
> On 03/02/16 06:03, Jan Beulich wrote:
> > >>> On 02.03.16 at 08:14,  wrote:
> > > It means NVDIMM is very possibly mapped in page granularity, and
> > > hypervisor needs per-page data structures like page_info (rather than the
> > > range set style nvdimm_pages) to manage those mappings.
> > > 
> > > Then we will face the problem that the potentially huge number of
> > > per-page data structures may not fit in the normal ram. Linux kernel
> > > developers came across the same problem, and their solution is to
> > > reserve an area of NVDIMM and put the page structures in the reserved
> > > area (https://lwn.net/Articles/672457/). I think we may take the similar
> > > solution:
> > > (1) Dom0 Linux kernel reserves an area on each NVDIMM for Xen usage
> > > (besides the one used by Linux kernel itself) and reports the address
> > > and size to Xen hypervisor.
> > > 
> > > Reasons to choose Linux kernel to make the reservation include:
> > > (a) only Dom0 Linux kernel has the NVDIMM driver,
> > > (b) make it flexible for Dom0 Linux kernel to handle all
> > > reservations (for itself and Xen).
> > > 
> > > (2) Then Xen hypervisor builds the page structures for NVDIMM pages and
> > > stores them in above reserved areas.
> > 
[...]
> > Furthermore - why would Dom0 waste space
> > creating per-page control structures for regions which are
> > meant to be handed to guests anyway?
> > 
> 
> I found my description was not accurate after consulting with our driver
> developers. By default the linux kernel does not create page structures
> for NVDIMM which is called by kernel the "raw mode". We could enforce
> the Dom0 kernel to pin NVDIMM in "raw mode" so as to avoid waste.
> 

More thoughts on reserving NVDIMM space for per-page structures

Currently, a per-page struct for managing mapping of NVDIMM pages may
include following fields:

struct nvdimm_page
{
uint64_t mfn;/* MFN of SPA of this NVDIMM page */
uint64_t gfn;/* GFN where this NVDIMM page is mapped */
domid_t  domain_id;  /* which domain is this NVDIMM page mapped to */
int  is_broken;  /* Is this NVDIMM page broken? (for MCE) */
}

Its size is 24 bytes (or 22 bytes if packed). For a 2 TB NVDIMM,
nvdimm_page structures would occupy 12 GB space, which is too hard to
fit in the normal ram on a small memory host. However, for smaller
NVDIMMs and/or hosts with large ram, those structures may still be able
to fit in the normal ram. In the latter circumstance, nvdimm_page
structures are stored in the normal ram, so they can be accessed more
quickly.

So we may add a boot parameter for Xen to allow users to configure which
place, the normal ram or nvdimm, are used to store those structures. For
the config of using normal ram, Xen could manage nvdimm_page structures
more quickly (and hence start a domain with NVDIMM more quickly), but
leaves less normal ram for VMs. For the config of using nvdimm, Xen would
take more time to mange nvdimm_page structures, but leaves more normal
ram for VMs.

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-07 Thread Haozhong Zhang
On 03/07/16 15:53, Konrad Rzeszutek Wilk wrote:
> On Wed, Mar 02, 2016 at 03:14:52PM +0800, Haozhong Zhang wrote:
> > On 03/01/16 13:49, Konrad Rzeszutek Wilk wrote:
> > > On Tue, Mar 01, 2016 at 06:33:32PM +, Ian Jackson wrote:
> > > > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM 
> > > > support for Xen"):
> > > > > On 02/18/16 21:14, Konrad Rzeszutek Wilk wrote:
> > > > > > [someone:]
> > > > > > > (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign,
> > > > > > >(a) never map idx in them to GFNs occupied by vNVDIMM, and
> > > > > > >(b) never map idx corresponding to GFNs occupied by vNVDIMM
> > > > > > 
> > > > > > Would that mean that guest xen-blkback or xen-netback wouldn't
> > > > > > be able to fetch data from the GFNs? As in, what if the HVM guest
> > > > > > that has the NVDIMM also serves as a device domain - that is it
> > > > > > has xen-blkback running to service other guests?
> > > > > 
> > > > > I'm not familiar with xen-blkback and xen-netback, so following
> > > > > statements maybe wrong.
> > > > > 
> > > > > In my understanding, xen-blkback/-netback in a device domain maps the
> > > > > pages from other domains into its own domain, and copies data between
> > > > > those pages and vNVDIMM. The access to vNVDIMM is performed by NVDIMM
> > > > > driver in device domain. In which steps of this procedure that
> > > > > xen-blkback/-netback needs to map into GFNs of vNVDIMM?
> > > > 
> > > > I think I agree with what you are saying.  I don't understand exactly
> > > > what you are proposing above in XENMAPSPACE_gmfn but I don't see how
> > > > anything about this would interfere with blkback.
> > > > 
> > > > blkback when talking to an nvdimm will just go through the block layer
> > > > front door, and do a copy, I presume.
> > > 
> > > I believe you are right. The block layer, and then the fs would copy in.
> > > > 
> > > > I don't see how netback comes into it at all.
> > > > 
> > > > But maybe I am just confused or ignorant!  Please do explain :-).
> > > 
> > > s/back/frontend/  
> > > 
> > > My fear was refcounting.
> > > 
> > > Specifically where we do not do copying. For example, you could
> > > be sending data from the NVDIMM GFNs (scp?) to some other location
> > > (another host?). It would go over the xen-netback (in the dom0)
> > > - which would then grant map it (dom0 would).
> > >
> > 
> > Thanks for the explanation!
> > 
> > It means NVDIMM is very possibly mapped in page granularity, and
> > hypervisor needs per-page data structures like page_info (rather than the
> > range set style nvdimm_pages) to manage those mappings.
> 
> I do not know. I figured you need some accounting in the hypervisor
> as the pages can be grant mapped but I don't know the intricate details
> of the P2M code to tell you for certain.
> 
> [edit: Your later email seems to imply that you do not need all this
> information? Just ranges?]

Not quite sure which one do you mean. But at least in this example,
NVDIMM can be granted in the unit of page, so I think Xen still needs
per-page data structure to track this mapping information and range
structure is not enough.

> > 
> > Then we will face the problem that the potentially huge number of
> > per-page data structures may not fit in the normal ram. Linux kernel
> > developers came across the same problem, and their solution is to
> > reserve an area of NVDIMM and put the page structures in the reserved
> > area (https://lwn.net/Articles/672457/). I think we may take the similar
> > solution:
> > (1) Dom0 Linux kernel reserves an area on each NVDIMM for Xen usage
> > (besides the one used by Linux kernel itself) and reports the address
> > and size to Xen hypervisor.
> > 
> > Reasons to choose Linux kernel to make the reservation include:
> > (a) only Dom0 Linux kernel has the NVDIMM driver,
> > (b) make it flexible for Dom0 Linux kernel to handle all
> > reservations (for itself and Xen).
> > 
> > (2) Then Xen hypervisor builds the page structures for NVDIMM pages and
> > stores them in above reserved areas.
> > 
> > (3) The reserved area is used as volatile, i.e. above two steps must be
> >

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-07 Thread Konrad Rzeszutek Wilk
On Wed, Mar 02, 2016 at 03:14:52PM +0800, Haozhong Zhang wrote:
> On 03/01/16 13:49, Konrad Rzeszutek Wilk wrote:
> > On Tue, Mar 01, 2016 at 06:33:32PM +, Ian Jackson wrote:
> > > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM 
> > > support for Xen"):
> > > > On 02/18/16 21:14, Konrad Rzeszutek Wilk wrote:
> > > > > [someone:]
> > > > > > (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign,
> > > > > >(a) never map idx in them to GFNs occupied by vNVDIMM, and
> > > > > >(b) never map idx corresponding to GFNs occupied by vNVDIMM
> > > > > 
> > > > > Would that mean that guest xen-blkback or xen-netback wouldn't
> > > > > be able to fetch data from the GFNs? As in, what if the HVM guest
> > > > > that has the NVDIMM also serves as a device domain - that is it
> > > > > has xen-blkback running to service other guests?
> > > > 
> > > > I'm not familiar with xen-blkback and xen-netback, so following
> > > > statements maybe wrong.
> > > > 
> > > > In my understanding, xen-blkback/-netback in a device domain maps the
> > > > pages from other domains into its own domain, and copies data between
> > > > those pages and vNVDIMM. The access to vNVDIMM is performed by NVDIMM
> > > > driver in device domain. In which steps of this procedure that
> > > > xen-blkback/-netback needs to map into GFNs of vNVDIMM?
> > > 
> > > I think I agree with what you are saying.  I don't understand exactly
> > > what you are proposing above in XENMAPSPACE_gmfn but I don't see how
> > > anything about this would interfere with blkback.
> > > 
> > > blkback when talking to an nvdimm will just go through the block layer
> > > front door, and do a copy, I presume.
> > 
> > I believe you are right. The block layer, and then the fs would copy in.
> > > 
> > > I don't see how netback comes into it at all.
> > > 
> > > But maybe I am just confused or ignorant!  Please do explain :-).
> > 
> > s/back/frontend/  
> > 
> > My fear was refcounting.
> > 
> > Specifically where we do not do copying. For example, you could
> > be sending data from the NVDIMM GFNs (scp?) to some other location
> > (another host?). It would go over the xen-netback (in the dom0)
> > - which would then grant map it (dom0 would).
> >
> 
> Thanks for the explanation!
> 
> It means NVDIMM is very possibly mapped in page granularity, and
> hypervisor needs per-page data structures like page_info (rather than the
> range set style nvdimm_pages) to manage those mappings.

I do not know. I figured you need some accounting in the hypervisor
as the pages can be grant mapped but I don't know the intricate details
of the P2M code to tell you for certain.

[edit: Your later email seems to imply that you do not need all this
information? Just ranges?]
> 
> Then we will face the problem that the potentially huge number of
> per-page data structures may not fit in the normal ram. Linux kernel
> developers came across the same problem, and their solution is to
> reserve an area of NVDIMM and put the page structures in the reserved
> area (https://lwn.net/Articles/672457/). I think we may take the similar
> solution:
> (1) Dom0 Linux kernel reserves an area on each NVDIMM for Xen usage
> (besides the one used by Linux kernel itself) and reports the address
> and size to Xen hypervisor.
> 
> Reasons to choose Linux kernel to make the reservation include:
> (a) only Dom0 Linux kernel has the NVDIMM driver,
> (b) make it flexible for Dom0 Linux kernel to handle all
> reservations (for itself and Xen).
> 
> (2) Then Xen hypervisor builds the page structures for NVDIMM pages and
> stores them in above reserved areas.
> 
> (3) The reserved area is used as volatile, i.e. above two steps must be
> done for every host boot.
> 
> > In effect Xen there are two guests (dom0 and domU) pointing in the
> > P2M to the same GPFN. And that would mean:
> > 
> > > > > >(b) never map idx corresponding to GFNs occupied by vNVDIMM
> > 
> > Granted the XENMAPSPACE_gmfn happens _before_ the grant mapping is done
> > so perhaps this is not an issue?
> > 
> > The other situation I was envisioning - where the driver domain has
> > the NVDIMM passed in, and as well SR-IOV network card and functions
> > as an iSCSI target. That should work OK as we just need the IOMMU
> > to have the NVDIMM GPFNs program

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-03 Thread Haozhong Zhang
On 02/16/16 05:55, Jan Beulich wrote:
> >>> On 16.02.16 at 12:14,  wrote:
> > On Mon, 15 Feb 2016, Zhang, Haozhong wrote:
> >> On 02/04/16 20:24, Stefano Stabellini wrote:
> >> > On Thu, 4 Feb 2016, Haozhong Zhang wrote:
> >> > > On 02/03/16 15:22, Stefano Stabellini wrote:
> >> > > > On Wed, 3 Feb 2016, George Dunlap wrote:
> >> > > > > On 03/02/16 12:02, Stefano Stabellini wrote:
> >> > > > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> >> > > > > >> Or, we can make a file system on /dev/pmem0, create files on 
> >> > > > > >> it, set
> >> > > > > >> the owner of those files to xen-qemuuser-domid$domid, and then 
> >> > > > > >> pass
> >> > > > > >> those files to QEMU. In this way, non-root QEMU should be able 
> >> > > > > >> to
> >> > > > > >> mmap those files.
> >> > > > > >
> >> > > > > > Maybe that would work. Worth adding it to the design, I would 
> >> > > > > > like to
> >> > > > > > read more details on it.
> >> > > > > >
> >> > > > > > Also note that QEMU initially runs as root but drops privileges 
> >> > > > > > to
> >> > > > > > xen-qemuuser-domid$domid before the guest is started. Initially 
> >> > > > > > QEMU
> >> > > > > > *could* mmap /dev/pmem0 while is still running as root, but then 
> >> > > > > > it
> >> > > > > > wouldn't work for any devices that need to be mmap'ed at run time
> >> > > > > > (hotplug scenario).
> >> > > > >
> >> > > > > This is basically the same problem we have for a bunch of other 
> >> > > > > things,
> >> > > > > right?  Having xl open a file and then pass it via qmp to qemu 
> >> > > > > should
> >> > > > > work in theory, right?
> >> > > >
> >> > > > Is there one /dev/pmem? per assignable region?
> >> > > 
> >> > > Yes.
> >> > > 
> >> > > BTW, I'm wondering whether and how non-root qemu works with xl disk
> >> > > configuration that is going to access a host block device, e.g.
> >> > >  disk = [ '/dev/sdb,,hda' ]
> >> > > If that works with non-root qemu, I may take the similar solution for
> >> > > pmem.
> >> >  
> >> > Today the user is required to give the correct ownership and access mode
> >> > to the block device, so that non-root QEMU can open it. However in the
> >> > case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence
> >> > the feature doesn't work at all with non-root QEMU
> >> > (http://marc.info/?l=xen-devel=145261763600528).
> >> > 
> >> > If there is one /dev/pmem device per assignable region, then it would be
> >> > conceivable to change its ownership so that non-root QEMU can open it.
> >> > Or, better, the file descriptor could be passed by the toolstack via
> >> > qmp.
> >> 
> >> Passing file descriptor via qmp is not enough.
> >> 
> >> Let me clarify where the requirement for root/privileged permissions
> >> comes from. The primary workflow in my design that maps a host pmem
> >> region or files in host pmem region to guest is shown as below:
> >>  (1) QEMU in Dom0 mmap the host pmem (the host /dev/pmem0 or files on
> >>  /dev/pmem0) to its virtual address space, i.e. the guest virtual
> >>  address space.
> >>  (2) QEMU asks Xen hypervisor to map the host physical address, i.e. SPA
> >>  occupied by the host pmem to a DomU. This step requires the
> >>  translation from the guest virtual address (where the host pmem is
> >>  mmaped in (1)) to the host physical address. The translation can be
> >>  done by either
> >> (a) QEMU that parses its own /proc/self/pagemap,
> >>  or
> >> (b) Xen hypervisor that does the translation by itself [1] (though
> >> this choice is not quite doable from Konrad's comments [2]).
> >> 
> >> [1] 
> >> http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00434.html 
> >> [2] 
> >> http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00606.html 
> >> 
> >> For 2-a, reading /proc/self/pagemap requires CAP_SYS_ADMIN capability
> >> since linux kernel 4.0. Furthermore, if we don't mlock the mapped host
> >> pmem (by adding MAP_LOCKED flag to mmap or calling mlock after mmap),
> >> pagemap will not contain all mappings. However, mlock may require
> >> privileged permission to lock memory larger than RLIMIT_MEMLOCK. Because
> >> mlock operates on memory, the permission to open(2) the host pmem files
> >> does not solve the problem and therefore passing file descriptor via qmp
> >> does not help.
> >> 
> >> For 2-b, from Konrad's comments [2], mlock is also required and
> >> privileged permission may be required consequently.
> >> 
> >> Note that the mapping and the address translation are done before QEMU
> >> dropping privileged permissions, so non-root QEMU should be able to work
> >> with above design until we start considering vNVDIMM hotplug (which has
> >> not been supported by the current vNVDIMM implementation in QEMU). In
> >> the hotplug case, we may let Xen pass explicit flags to QEMU to keep it
> >> running with root permissions.
> > 
> > Are we all good with the fact that vNVDIMM 

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-03 Thread Haozhong Zhang
On 03/02/16 06:03, Jan Beulich wrote:
> >>> On 02.03.16 at 08:14,  wrote:
> > It means NVDIMM is very possibly mapped in page granularity, and
> > hypervisor needs per-page data structures like page_info (rather than the
> > range set style nvdimm_pages) to manage those mappings.
> > 
> > Then we will face the problem that the potentially huge number of
> > per-page data structures may not fit in the normal ram. Linux kernel
> > developers came across the same problem, and their solution is to
> > reserve an area of NVDIMM and put the page structures in the reserved
> > area (https://lwn.net/Articles/672457/). I think we may take the similar
> > solution:
> > (1) Dom0 Linux kernel reserves an area on each NVDIMM for Xen usage
> > (besides the one used by Linux kernel itself) and reports the address
> > and size to Xen hypervisor.
> > 
> > Reasons to choose Linux kernel to make the reservation include:
> > (a) only Dom0 Linux kernel has the NVDIMM driver,
> > (b) make it flexible for Dom0 Linux kernel to handle all
> > reservations (for itself and Xen).
> > 
> > (2) Then Xen hypervisor builds the page structures for NVDIMM pages and
> > stores them in above reserved areas.
> 
> Another argument against this being primarily Dom0-managed,
> I would say.

Yes, Xen should, at least, manage all address mappings for NVDIMM. Dom0
Linux and QEMU then provide a user-friendly interface to configure
NVDIMM and vNVDIMM: like providing files (instead of address) as the
abstract of SPA ranges in/of NVDIMM.

> Furthermore - why would Dom0 waste space
> creating per-page control structures for regions which are
> meant to be handed to guests anyway?
> 

I found my description was not accurate after consulting with our driver
developers. By default the linux kernel does not create page structures
for NVDIMM which is called by kernel the "raw mode". We could enforce
the Dom0 kernel to pin NVDIMM in "raw mode" so as to avoid waste.

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-02 Thread Jan Beulich
>>> On 02.03.16 at 08:14,  wrote:
> It means NVDIMM is very possibly mapped in page granularity, and
> hypervisor needs per-page data structures like page_info (rather than the
> range set style nvdimm_pages) to manage those mappings.
> 
> Then we will face the problem that the potentially huge number of
> per-page data structures may not fit in the normal ram. Linux kernel
> developers came across the same problem, and their solution is to
> reserve an area of NVDIMM and put the page structures in the reserved
> area (https://lwn.net/Articles/672457/). I think we may take the similar
> solution:
> (1) Dom0 Linux kernel reserves an area on each NVDIMM for Xen usage
> (besides the one used by Linux kernel itself) and reports the address
> and size to Xen hypervisor.
> 
> Reasons to choose Linux kernel to make the reservation include:
> (a) only Dom0 Linux kernel has the NVDIMM driver,
> (b) make it flexible for Dom0 Linux kernel to handle all
> reservations (for itself and Xen).
> 
> (2) Then Xen hypervisor builds the page structures for NVDIMM pages and
> stores them in above reserved areas.

Another argument against this being primarily Dom0-managed,
I would say. Furthermore - why would Dom0 waste space
creating per-page control structures for regions which are
meant to be handed to guests anyway?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-01 Thread Haozhong Zhang
On 03/01/16 13:49, Konrad Rzeszutek Wilk wrote:
> On Tue, Mar 01, 2016 at 06:33:32PM +, Ian Jackson wrote:
> > Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM 
> > support for Xen"):
> > > On 02/18/16 21:14, Konrad Rzeszutek Wilk wrote:
> > > > [someone:]
> > > > > (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign,
> > > > >(a) never map idx in them to GFNs occupied by vNVDIMM, and
> > > > >(b) never map idx corresponding to GFNs occupied by vNVDIMM
> > > > 
> > > > Would that mean that guest xen-blkback or xen-netback wouldn't
> > > > be able to fetch data from the GFNs? As in, what if the HVM guest
> > > > that has the NVDIMM also serves as a device domain - that is it
> > > > has xen-blkback running to service other guests?
> > > 
> > > I'm not familiar with xen-blkback and xen-netback, so following
> > > statements maybe wrong.
> > > 
> > > In my understanding, xen-blkback/-netback in a device domain maps the
> > > pages from other domains into its own domain, and copies data between
> > > those pages and vNVDIMM. The access to vNVDIMM is performed by NVDIMM
> > > driver in device domain. In which steps of this procedure that
> > > xen-blkback/-netback needs to map into GFNs of vNVDIMM?
> > 
> > I think I agree with what you are saying.  I don't understand exactly
> > what you are proposing above in XENMAPSPACE_gmfn but I don't see how
> > anything about this would interfere with blkback.
> > 
> > blkback when talking to an nvdimm will just go through the block layer
> > front door, and do a copy, I presume.
> 
> I believe you are right. The block layer, and then the fs would copy in.
> > 
> > I don't see how netback comes into it at all.
> > 
> > But maybe I am just confused or ignorant!  Please do explain :-).
> 
> s/back/frontend/  
> 
> My fear was refcounting.
> 
> Specifically where we do not do copying. For example, you could
> be sending data from the NVDIMM GFNs (scp?) to some other location
> (another host?). It would go over the xen-netback (in the dom0)
> - which would then grant map it (dom0 would).
>

Thanks for the explanation!

It means NVDIMM is very possibly mapped in page granularity, and
hypervisor needs per-page data structures like page_info (rather than the
range set style nvdimm_pages) to manage those mappings.

Then we will face the problem that the potentially huge number of
per-page data structures may not fit in the normal ram. Linux kernel
developers came across the same problem, and their solution is to
reserve an area of NVDIMM and put the page structures in the reserved
area (https://lwn.net/Articles/672457/). I think we may take the similar
solution:
(1) Dom0 Linux kernel reserves an area on each NVDIMM for Xen usage
(besides the one used by Linux kernel itself) and reports the address
and size to Xen hypervisor.

Reasons to choose Linux kernel to make the reservation include:
(a) only Dom0 Linux kernel has the NVDIMM driver,
(b) make it flexible for Dom0 Linux kernel to handle all
reservations (for itself and Xen).

(2) Then Xen hypervisor builds the page structures for NVDIMM pages and
stores them in above reserved areas.

(3) The reserved area is used as volatile, i.e. above two steps must be
done for every host boot.

> In effect Xen there are two guests (dom0 and domU) pointing in the
> P2M to the same GPFN. And that would mean:
> 
> > > > >(b) never map idx corresponding to GFNs occupied by vNVDIMM
> 
> Granted the XENMAPSPACE_gmfn happens _before_ the grant mapping is done
> so perhaps this is not an issue?
> 
> The other situation I was envisioning - where the driver domain has
> the NVDIMM passed in, and as well SR-IOV network card and functions
> as an iSCSI target. That should work OK as we just need the IOMMU
> to have the NVDIMM GPFNs programmed in.
>

For this IOMMU usage example and above granted pages example, there
remains one question: who is responsible to perform NVDIMM flush
(clwb/clflushopt/pcommit)?

For the granted page example, if a NVDIMM page is granted to
xen-netback, does the hypervisor need to tell xen-netback it's a NVDIMM
page so that xen-netback can perform proper flush when it writes to that
page? Or we may keep the NVDIMM transparent to xen-netback, and let Xen
perform the flush when xen-netback gives up the granted NVDIMM page?

For the IOMMU example, my understanding is that there is a piece of
software in the driver domain that handles SCSI commands received from
network card and drives the network card to read/write certain areas of
NVDIMM. Then that software should be aware of the existence of NVDIMM
and perform the flush properly. Is that right?

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-01 Thread Konrad Rzeszutek Wilk
On Tue, Mar 01, 2016 at 06:33:32PM +, Ian Jackson wrote:
> Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support 
> for Xen"):
> > On 02/18/16 21:14, Konrad Rzeszutek Wilk wrote:
> > > [someone:]
> > > > (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign,
> > > >(a) never map idx in them to GFNs occupied by vNVDIMM, and
> > > >(b) never map idx corresponding to GFNs occupied by vNVDIMM
> > > 
> > > Would that mean that guest xen-blkback or xen-netback wouldn't
> > > be able to fetch data from the GFNs? As in, what if the HVM guest
> > > that has the NVDIMM also serves as a device domain - that is it
> > > has xen-blkback running to service other guests?
> > 
> > I'm not familiar with xen-blkback and xen-netback, so following
> > statements maybe wrong.
> > 
> > In my understanding, xen-blkback/-netback in a device domain maps the
> > pages from other domains into its own domain, and copies data between
> > those pages and vNVDIMM. The access to vNVDIMM is performed by NVDIMM
> > driver in device domain. In which steps of this procedure that
> > xen-blkback/-netback needs to map into GFNs of vNVDIMM?
> 
> I think I agree with what you are saying.  I don't understand exactly
> what you are proposing above in XENMAPSPACE_gmfn but I don't see how
> anything about this would interfere with blkback.
> 
> blkback when talking to an nvdimm will just go through the block layer
> front door, and do a copy, I presume.

I believe you are right. The block layer, and then the fs would copy in.
> 
> I don't see how netback comes into it at all.
> 
> But maybe I am just confused or ignorant!  Please do explain :-).

s/back/frontend/  

My fear was refcounting.

Specifically where we do not do copying. For example, you could
be sending data from the NVDIMM GFNs (scp?) to some other location
(another host?). It would go over the xen-netback (in the dom0)
- which would then grant map it (dom0 would).

In effect Xen there are two guests (dom0 and domU) pointing in the
P2M to the same GPFN. And that would mean:

> > > >(b) never map idx corresponding to GFNs occupied by vNVDIMM

Granted the XENMAPSPACE_gmfn happens _before_ the grant mapping is done
so perhaps this is not an issue?

The other situation I was envisioning - where the driver domain has
the NVDIMM passed in, and as well SR-IOV network card and functions
as an iSCSI target. That should work OK as we just need the IOMMU
to have the NVDIMM GPFNs programmed in.

> 
> Thanks,
> Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-01 Thread Ian Jackson
Haozhong Zhang writes ("Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support 
for Xen"):
> On 02/18/16 21:14, Konrad Rzeszutek Wilk wrote:
> > [someone:]
> > > (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign,
> > >(a) never map idx in them to GFNs occupied by vNVDIMM, and
> > >(b) never map idx corresponding to GFNs occupied by vNVDIMM
> > 
> > Would that mean that guest xen-blkback or xen-netback wouldn't
> > be able to fetch data from the GFNs? As in, what if the HVM guest
> > that has the NVDIMM also serves as a device domain - that is it
> > has xen-blkback running to service other guests?
> 
> I'm not familiar with xen-blkback and xen-netback, so following
> statements maybe wrong.
> 
> In my understanding, xen-blkback/-netback in a device domain maps the
> pages from other domains into its own domain, and copies data between
> those pages and vNVDIMM. The access to vNVDIMM is performed by NVDIMM
> driver in device domain. In which steps of this procedure that
> xen-blkback/-netback needs to map into GFNs of vNVDIMM?

I think I agree with what you are saying.  I don't understand exactly
what you are proposing above in XENMAPSPACE_gmfn but I don't see how
anything about this would interfere with blkback.

blkback when talking to an nvdimm will just go through the block layer
front door, and do a copy, I presume.

I don't see how netback comes into it at all.

But maybe I am just confused or ignorant!  Please do explain :-).

Thanks,
Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-01 Thread Jan Beulich
>>> On 01.03.16 at 14:51,  wrote:
> Haozhong Zhang writes ("Re: [RFC Design Doc] Add vNVDIMM support for Xen"):
>> On 02/29/16 05:04, Jan Beulich wrote:
>> > Which will involve adding how much new code to it?
>> 
>> Because hvmloader only accepts AML device rather than arbitrary objects,
>> only code that builds the outmost part of AML device is needed. In ACPI
>> spec, an AML device is defined as
>> DefDevice := DeviceOp PkgLength NameString ObjectList
>> hvmloader only needs to build the first 3 terms, while the last one is
>> passed from qemu.
> 
> Jan, is this a satisfactory answer ?

Well, sort of yes, but subject to me seeing the actual code this
converts to.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-01 Thread Ian Jackson
Haozhong Zhang writes ("Re: [RFC Design Doc] Add vNVDIMM support for Xen"):
> On 02/29/16 05:04, Jan Beulich wrote:
> > Which will involve adding how much new code to it?
> 
> Because hvmloader only accepts AML device rather than arbitrary objects,
> only code that builds the outmost part of AML device is needed. In ACPI
> spec, an AML device is defined as
> DefDevice := DeviceOp PkgLength NameString ObjectList
> hvmloader only needs to build the first 3 terms, while the last one is
> passed from qemu.

Jan, is this a satisfactory answer ?

Ian.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-29 Thread Haozhong Zhang
On 02/18/16 21:14, Konrad Rzeszutek Wilk wrote:
> > > > QEMU would always use MFN above guest normal ram and I/O holes for
> > > > vNVDIMM. It would attempt to search in that space for a contiguous range
> > > > that is large enough for that that vNVDIMM devices. Is guest able to
> > > > punch holes in such GFN space?
> > > 
> > > See XENMAPSPACE_* and their uses.
> > > 
> > 
> > I think we can add following restrictions to avoid uses of XENMAPSPACE_*
> > punching holes in GFNs of vNVDIMM:
> > 
> > (1) For XENMAPSPACE_shared_info and _grant_table, never map idx in them
> > to GFNs occupied by vNVDIMM.
> 
> OK, that sounds correct.
> > 
> > (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign,
> >(a) never map idx in them to GFNs occupied by vNVDIMM, and
> >(b) never map idx corresponding to GFNs occupied by vNVDIMM
> 
> Would that mean that guest xen-blkback or xen-netback wouldn't
> be able to fetch data from the GFNs? As in, what if the HVM guest
> that has the NVDIMM also serves as a device domain - that is it
> has xen-blkback running to service other guests?
> 

I'm not familiar with xen-blkback and xen-netback, so following
statements maybe wrong.

In my understanding, xen-blkback/-netback in a device domain maps the
pages from other domains into its own domain, and copies data between
those pages and vNVDIMM. The access to vNVDIMM is performed by NVDIMM
driver in device domain. In which steps of this procedure that
xen-blkback/-netback needs to map into GFNs of vNVDIMM?

Thanks,
Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-29 Thread Haozhong Zhang
On 02/29/16 05:04, Jan Beulich wrote:
> >>> On 29.02.16 at 12:52,  wrote:
> > On 02/29/16 03:12, Jan Beulich wrote:
> >> >>> On 29.02.16 at 10:45,  wrote:
> >> > On 02/29/16 02:01, Jan Beulich wrote:
> >> >> >>> On 28.02.16 at 15:48,  wrote:
> >> >> > Anyway, we may avoid some conflicts between ACPI tables/objects by
> >> >> > restricting which tables and objects can be passed from QEMU to Xen:
> >> >> > (1) For ACPI tables, xen does not accept those built by itself,
> >> >> > e.g. DSDT and SSDT.
> >> >> > (2) xen does not accept ACPI tables for devices that are not attached 
> >> >> > to
> >> >> > a domain, e.g. if NFIT cannot be passed if a domain does not have
> >> >> > vNVDIMM.
> >> >> > (3) For ACPI objects, xen only accepts namespace devices and requires
> >> >> > their names does not conflict with existing ones provided by Xen.
> >> >> 
> >> >> And how do you imagine to enforce this without parsing the
> >> >> handed AML? (Remember there's no AML parser in hvmloader.)
> >> > 
> >> > As I proposed in last reply, instead of passing an entire ACPI object,
> >> > QEMU passes the device name and the AML code under the AML device entry
> >> > separately. Because the name is explicitly given, no AML parser is
> >> > needed in hvmloader.
> >> 
> >> I must not only have missed that proposal, but I also don't see
> >> how you mean this to work: Are you suggesting for hvmloader to
> >> construct valid AML from the passed in blob? Or are you instead
> >> considering to pass redundant information (name once given
> >> explicitly and once embedded in the AML blob), allowing the two
> >> to be out of sync?
> > 
> > I mean the former one.
> 
> Which will involve adding how much new code to it?
>

Because hvmloader only accepts AML device rather than arbitrary objects,
only code that builds the outmost part of AML device is needed. In ACPI
spec, an AML device is defined as
DefDevice := DeviceOp PkgLength NameString ObjectList
hvmloader only needs to build the first 3 terms, while the last one is
passed from qemu.

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-29 Thread Jan Beulich
>>> On 29.02.16 at 12:52,  wrote:
> On 02/29/16 03:12, Jan Beulich wrote:
>> >>> On 29.02.16 at 10:45,  wrote:
>> > On 02/29/16 02:01, Jan Beulich wrote:
>> >> >>> On 28.02.16 at 15:48,  wrote:
>> >> > Anyway, we may avoid some conflicts between ACPI tables/objects by
>> >> > restricting which tables and objects can be passed from QEMU to Xen:
>> >> > (1) For ACPI tables, xen does not accept those built by itself,
>> >> > e.g. DSDT and SSDT.
>> >> > (2) xen does not accept ACPI tables for devices that are not attached to
>> >> > a domain, e.g. if NFIT cannot be passed if a domain does not have
>> >> > vNVDIMM.
>> >> > (3) For ACPI objects, xen only accepts namespace devices and requires
>> >> > their names does not conflict with existing ones provided by Xen.
>> >> 
>> >> And how do you imagine to enforce this without parsing the
>> >> handed AML? (Remember there's no AML parser in hvmloader.)
>> > 
>> > As I proposed in last reply, instead of passing an entire ACPI object,
>> > QEMU passes the device name and the AML code under the AML device entry
>> > separately. Because the name is explicitly given, no AML parser is
>> > needed in hvmloader.
>> 
>> I must not only have missed that proposal, but I also don't see
>> how you mean this to work: Are you suggesting for hvmloader to
>> construct valid AML from the passed in blob? Or are you instead
>> considering to pass redundant information (name once given
>> explicitly and once embedded in the AML blob), allowing the two
>> to be out of sync?
> 
> I mean the former one.

Which will involve adding how much new code to it?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-29 Thread Haozhong Zhang
On 02/29/16 03:12, Jan Beulich wrote:
> >>> On 29.02.16 at 10:45,  wrote:
> > On 02/29/16 02:01, Jan Beulich wrote:
> >> >>> On 28.02.16 at 15:48,  wrote:
> >> > Anyway, we may avoid some conflicts between ACPI tables/objects by
> >> > restricting which tables and objects can be passed from QEMU to Xen:
> >> > (1) For ACPI tables, xen does not accept those built by itself,
> >> > e.g. DSDT and SSDT.
> >> > (2) xen does not accept ACPI tables for devices that are not attached to
> >> > a domain, e.g. if NFIT cannot be passed if a domain does not have
> >> > vNVDIMM.
> >> > (3) For ACPI objects, xen only accepts namespace devices and requires
> >> > their names does not conflict with existing ones provided by Xen.
> >> 
> >> And how do you imagine to enforce this without parsing the
> >> handed AML? (Remember there's no AML parser in hvmloader.)
> > 
> > As I proposed in last reply, instead of passing an entire ACPI object,
> > QEMU passes the device name and the AML code under the AML device entry
> > separately. Because the name is explicitly given, no AML parser is
> > needed in hvmloader.
> 
> I must not only have missed that proposal, but I also don't see
> how you mean this to work: Are you suggesting for hvmloader to
> construct valid AML from the passed in blob? Or are you instead
> considering to pass redundant information (name once given
> explicitly and once embedded in the AML blob), allowing the two
> to be out of sync?
>

I mean the former one.

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-29 Thread Jan Beulich
>>> On 29.02.16 at 10:45,  wrote:
> On 02/29/16 02:01, Jan Beulich wrote:
>> >>> On 28.02.16 at 15:48,  wrote:
>> > Anyway, we may avoid some conflicts between ACPI tables/objects by
>> > restricting which tables and objects can be passed from QEMU to Xen:
>> > (1) For ACPI tables, xen does not accept those built by itself,
>> > e.g. DSDT and SSDT.
>> > (2) xen does not accept ACPI tables for devices that are not attached to
>> > a domain, e.g. if NFIT cannot be passed if a domain does not have
>> > vNVDIMM.
>> > (3) For ACPI objects, xen only accepts namespace devices and requires
>> > their names does not conflict with existing ones provided by Xen.
>> 
>> And how do you imagine to enforce this without parsing the
>> handed AML? (Remember there's no AML parser in hvmloader.)
> 
> As I proposed in last reply, instead of passing an entire ACPI object,
> QEMU passes the device name and the AML code under the AML device entry
> separately. Because the name is explicitly given, no AML parser is
> needed in hvmloader.

I must not only have missed that proposal, but I also don't see
how you mean this to work: Are you suggesting for hvmloader to
construct valid AML from the passed in blob? Or are you instead
considering to pass redundant information (name once given
explicitly and once embedded in the AML blob), allowing the two
to be out of sync?

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-29 Thread Haozhong Zhang
On 02/29/16 02:01, Jan Beulich wrote:
> >>> On 28.02.16 at 15:48,  wrote:
> > On 02/24/16 09:54, Jan Beulich wrote:
> >> >>> On 24.02.16 at 16:48,  wrote:
> >> > On 02/24/16 07:24, Jan Beulich wrote:
> >> >> >>> On 24.02.16 at 14:28,  wrote:
> >> >> > On 02/18/16 10:17, Jan Beulich wrote:
> >> >> >> >>> On 01.02.16 at 06:44,  wrote:
> >> >> >> > 3.3 Guest ACPI Emulation
> >> >> >> > 
> >> >> >> > 3.3.1 My Design
> >> >> >> > 
> >> >> >> >  Guest ACPI emulation is composed of two parts: building guest NFIT
> >> >> >> >  and SSDT that defines ACPI namespace devices for NVDIMM, and
> >> >> >> >  emulating guest _DSM.
> >> >> >> > 
> >> >> >> >  (1) Building Guest ACPI Tables
> >> >> >> > 
> >> >> >> >   This design reuses and extends hvmloader's existing mechanism 
> >> >> >> > that
> >> >> >> >   loads passthrough ACPI tables from binary files to load NFIT and
> >> >> >> >   SSDT tables built by QEMU:
> >> >> >> >   1) Because the current QEMU does not building any ACPI tables 
> >> >> >> > when
> >> >> >> >  it runs as the Xen device model, this design needs to patch 
> >> >> >> > QEMU
> >> >> >> >  to build NFIT and SSDT (so far only NFIT and SSDT) in this 
> >> >> >> > case.
> >> >> >> > 
> >> >> >> >   2) QEMU copies NFIT and SSDT to the end of guest memory below
> >> >> >> >  4G. The guest address and size of those tables are written 
> >> >> >> > into
> >> >> >> >  xenstore 
> >> >> >> > (/local/domain/domid/hvmloader/dm-acpi/{address,length}).
> >> >> >> > 
> >> >> >> >   3) hvmloader is patched to probe and load device model 
> >> >> >> > passthrough
> >> >> >> >  ACPI tables from above xenstore keys. The detected ACPI tables
> >> >> >> >  are then appended to the end of existing guest ACPI tables 
> >> >> >> > just
> >> >> >> >  like what current construct_passthrough_tables() does.
> >> >> >> > 
> >> >> >> >   Reasons for this design are listed below:
> >> >> >> >   - NFIT and SSDT in question are quite self-contained, i.e. they 
> >> >> >> > do
> >> >> >> > not refer to other ACPI tables and not conflict with existing
> >> >> >> > guest ACPI tables in Xen. Therefore, it is safe to copy them 
> >> >> >> > from
> >> >> >> > QEMU and append to existing guest ACPI tables.
> >> >> >> 
> >> >> >> How is this not conflicting being guaranteed? In particular I don't
> >> >> >> see how tables containing AML code and coming from different
> >> >> >> sources won't possibly cause ACPI name space collisions.
> >> >> >>
> >> >> > 
> >> >> > Really there is no effective mechanism to avoid ACPI name space
> >> >> > collisions (and other kinds of conflicts) between ACPI tables loaded
> >> >> > from QEMU and ACPI tables built by hvmloader. Because which ACPI 
> >> >> > tables
> >> >> > are loaded is determined by developers, IMO it's developers'
> >> >> > responsibility to avoid any collisions and conflicts with existing 
> >> >> > ACPI
> >> >> > tables.
> >> >> 
> >> >> Right, but this needs to be spelled out and settled on at design
> >> >> time (i.e. now), rather leaving things unspecified, awaiting the
> >> >> first clash.
> >> > 
> >> > So that means if no collision-proof mechanism is introduced, Xen should 
> >> > not
> >> > trust any passed-in ACPI tables and should build them by itself?
> >> 
> >> Basically yes, albeit collision-proof may be too much to demand.
> >> Simply separating name spaces (for hvmloader and qemu to have
> >> their own sub-spaces) would be sufficient imo. We should trust
> >> ourselves to play by such a specification.
> >>
> > 
> > I don't quite understand 'separating name spaces'. Do you mean, for
> > example, if both hvmloader and qemu want to put a namespace device under
> > \_SB, they could be put in different sub-scopes under \_SB? But it does
> > not work for Linux at least.
> 
> Aiui just the leaf names matter for sufficient separation, i.e.
> recurring sub-scopes should not be a problem.
>
> > Anyway, we may avoid some conflicts between ACPI tables/objects by
> > restricting which tables and objects can be passed from QEMU to Xen:
> > (1) For ACPI tables, xen does not accept those built by itself,
> > e.g. DSDT and SSDT.
> > (2) xen does not accept ACPI tables for devices that are not attached to
> > a domain, e.g. if NFIT cannot be passed if a domain does not have
> > vNVDIMM.
> > (3) For ACPI objects, xen only accepts namespace devices and requires
> > their names does not conflict with existing ones provided by Xen.
> 
> And how do you imagine to enforce this without parsing the
> handed AML? (Remember there's no AML parser in hvmloader.)
>

As I proposed in last reply, instead of passing an entire ACPI object,
QEMU passes the device name and the AML code under the AML device entry
separately. Because the name is explicitly given, no AML parser is
needed in hvmloader.

Haozhong


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-29 Thread Jan Beulich
>>> On 28.02.16 at 15:48,  wrote:
> On 02/24/16 09:54, Jan Beulich wrote:
>> >>> On 24.02.16 at 16:48,  wrote:
>> > On 02/24/16 07:24, Jan Beulich wrote:
>> >> >>> On 24.02.16 at 14:28,  wrote:
>> >> > On 02/18/16 10:17, Jan Beulich wrote:
>> >> >> >>> On 01.02.16 at 06:44,  wrote:
>> >> >> > 3.3 Guest ACPI Emulation
>> >> >> > 
>> >> >> > 3.3.1 My Design
>> >> >> > 
>> >> >> >  Guest ACPI emulation is composed of two parts: building guest NFIT
>> >> >> >  and SSDT that defines ACPI namespace devices for NVDIMM, and
>> >> >> >  emulating guest _DSM.
>> >> >> > 
>> >> >> >  (1) Building Guest ACPI Tables
>> >> >> > 
>> >> >> >   This design reuses and extends hvmloader's existing mechanism that
>> >> >> >   loads passthrough ACPI tables from binary files to load NFIT and
>> >> >> >   SSDT tables built by QEMU:
>> >> >> >   1) Because the current QEMU does not building any ACPI tables when
>> >> >> >  it runs as the Xen device model, this design needs to patch QEMU
>> >> >> >  to build NFIT and SSDT (so far only NFIT and SSDT) in this case.
>> >> >> > 
>> >> >> >   2) QEMU copies NFIT and SSDT to the end of guest memory below
>> >> >> >  4G. The guest address and size of those tables are written into
>> >> >> >  xenstore 
>> >> >> > (/local/domain/domid/hvmloader/dm-acpi/{address,length}).
>> >> >> > 
>> >> >> >   3) hvmloader is patched to probe and load device model passthrough
>> >> >> >  ACPI tables from above xenstore keys. The detected ACPI tables
>> >> >> >  are then appended to the end of existing guest ACPI tables just
>> >> >> >  like what current construct_passthrough_tables() does.
>> >> >> > 
>> >> >> >   Reasons for this design are listed below:
>> >> >> >   - NFIT and SSDT in question are quite self-contained, i.e. they do
>> >> >> > not refer to other ACPI tables and not conflict with existing
>> >> >> > guest ACPI tables in Xen. Therefore, it is safe to copy them from
>> >> >> > QEMU and append to existing guest ACPI tables.
>> >> >> 
>> >> >> How is this not conflicting being guaranteed? In particular I don't
>> >> >> see how tables containing AML code and coming from different
>> >> >> sources won't possibly cause ACPI name space collisions.
>> >> >>
>> >> > 
>> >> > Really there is no effective mechanism to avoid ACPI name space
>> >> > collisions (and other kinds of conflicts) between ACPI tables loaded
>> >> > from QEMU and ACPI tables built by hvmloader. Because which ACPI tables
>> >> > are loaded is determined by developers, IMO it's developers'
>> >> > responsibility to avoid any collisions and conflicts with existing ACPI
>> >> > tables.
>> >> 
>> >> Right, but this needs to be spelled out and settled on at design
>> >> time (i.e. now), rather leaving things unspecified, awaiting the
>> >> first clash.
>> > 
>> > So that means if no collision-proof mechanism is introduced, Xen should not
>> > trust any passed-in ACPI tables and should build them by itself?
>> 
>> Basically yes, albeit collision-proof may be too much to demand.
>> Simply separating name spaces (for hvmloader and qemu to have
>> their own sub-spaces) would be sufficient imo. We should trust
>> ourselves to play by such a specification.
>>
> 
> I don't quite understand 'separating name spaces'. Do you mean, for
> example, if both hvmloader and qemu want to put a namespace device under
> \_SB, they could be put in different sub-scopes under \_SB? But it does
> not work for Linux at least.

Aiui just the leaf names matter for sufficient separation, i.e.
recurring sub-scopes should not be a problem.

> Anyway, we may avoid some conflicts between ACPI tables/objects by
> restricting which tables and objects can be passed from QEMU to Xen:
> (1) For ACPI tables, xen does not accept those built by itself,
> e.g. DSDT and SSDT.
> (2) xen does not accept ACPI tables for devices that are not attached to
> a domain, e.g. if NFIT cannot be passed if a domain does not have
> vNVDIMM.
> (3) For ACPI objects, xen only accepts namespace devices and requires
> their names does not conflict with existing ones provided by Xen.

And how do you imagine to enforce this without parsing the
handed AML? (Remember there's no AML parser in hvmloader.)

Jan

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-28 Thread Haozhong Zhang
On 02/24/16 09:54, Jan Beulich wrote:
> >>> On 24.02.16 at 16:48,  wrote:
> > On 02/24/16 07:24, Jan Beulich wrote:
> >> >>> On 24.02.16 at 14:28,  wrote:
> >> > On 02/18/16 10:17, Jan Beulich wrote:
> >> >> >>> On 01.02.16 at 06:44,  wrote:
> >> >> > 3.3 Guest ACPI Emulation
> >> >> > 
> >> >> > 3.3.1 My Design
> >> >> > 
> >> >> >  Guest ACPI emulation is composed of two parts: building guest NFIT
> >> >> >  and SSDT that defines ACPI namespace devices for NVDIMM, and
> >> >> >  emulating guest _DSM.
> >> >> > 
> >> >> >  (1) Building Guest ACPI Tables
> >> >> > 
> >> >> >   This design reuses and extends hvmloader's existing mechanism that
> >> >> >   loads passthrough ACPI tables from binary files to load NFIT and
> >> >> >   SSDT tables built by QEMU:
> >> >> >   1) Because the current QEMU does not building any ACPI tables when
> >> >> >  it runs as the Xen device model, this design needs to patch QEMU
> >> >> >  to build NFIT and SSDT (so far only NFIT and SSDT) in this case.
> >> >> > 
> >> >> >   2) QEMU copies NFIT and SSDT to the end of guest memory below
> >> >> >  4G. The guest address and size of those tables are written into
> >> >> >  xenstore 
> >> >> > (/local/domain/domid/hvmloader/dm-acpi/{address,length}).
> >> >> > 
> >> >> >   3) hvmloader is patched to probe and load device model passthrough
> >> >> >  ACPI tables from above xenstore keys. The detected ACPI tables
> >> >> >  are then appended to the end of existing guest ACPI tables just
> >> >> >  like what current construct_passthrough_tables() does.
> >> >> > 
> >> >> >   Reasons for this design are listed below:
> >> >> >   - NFIT and SSDT in question are quite self-contained, i.e. they do
> >> >> > not refer to other ACPI tables and not conflict with existing
> >> >> > guest ACPI tables in Xen. Therefore, it is safe to copy them from
> >> >> > QEMU and append to existing guest ACPI tables.
> >> >> 
> >> >> How is this not conflicting being guaranteed? In particular I don't
> >> >> see how tables containing AML code and coming from different
> >> >> sources won't possibly cause ACPI name space collisions.
> >> >>
> >> > 
> >> > Really there is no effective mechanism to avoid ACPI name space
> >> > collisions (and other kinds of conflicts) between ACPI tables loaded
> >> > from QEMU and ACPI tables built by hvmloader. Because which ACPI tables
> >> > are loaded is determined by developers, IMO it's developers'
> >> > responsibility to avoid any collisions and conflicts with existing ACPI
> >> > tables.
> >> 
> >> Right, but this needs to be spelled out and settled on at design
> >> time (i.e. now), rather leaving things unspecified, awaiting the
> >> first clash.
> > 
> > So that means if no collision-proof mechanism is introduced, Xen should not
> > trust any passed-in ACPI tables and should build them by itself?
> 
> Basically yes, albeit collision-proof may be too much to demand.
> Simply separating name spaces (for hvmloader and qemu to have
> their own sub-spaces) would be sufficient imo. We should trust
> ourselves to play by such a specification.
>

I don't quite understand 'separating name spaces'. Do you mean, for
example, if both hvmloader and qemu want to put a namespace device under
\_SB, they could be put in different sub-scopes under \_SB? But it does
not work for Linux at least.

Anyway, we may avoid some conflicts between ACPI tables/objects by
restricting which tables and objects can be passed from QEMU to Xen:
(1) For ACPI tables, xen does not accept those built by itself,
e.g. DSDT and SSDT.
(2) xen does not accept ACPI tables for devices that are not attached to
a domain, e.g. if NFIT cannot be passed if a domain does not have
vNVDIMM.
(3) For ACPI objects, xen only accepts namespace devices and requires
their names does not conflict with existing ones provided by Xen.

In implementation, QEMU could put the passed-in ACPI tables and objects in
a series of blobs in following format:
  +--+--+----+
  | type | size |  data  |
  +--+--+----+
where
(1) 'type' indicates which data are stored in this blob:
0 for ACPI table,
1 for ACPI namespace device,
(2) 'size' indicates this blob's size in bytes. The next blob (if exist)
can be found by add 'size' to the base address of the current blob.
(3) 'data' is of variant-length and stores the actual passed content:
(a) for type 0 blob (ACPI table), a complete ACPI table including the
table header is stored in 'data'
(b) for type 1 blob (ACPI namespace device), at the beginning of
   'data' is a 4 bytes device name, and followed is AML code within
   that device, e.g. for a device
  Device (DEV0) {
  Name (_HID, "ACPI1234")
  Method (_DSM) { ... }
  }
   "DEV0" is stored at the beginning of 'data', and then is AML code of
  

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-18 Thread Konrad Rzeszutek Wilk
> > > QEMU would always use MFN above guest normal ram and I/O holes for
> > > vNVDIMM. It would attempt to search in that space for a contiguous range
> > > that is large enough for that that vNVDIMM devices. Is guest able to
> > > punch holes in such GFN space?
> > 
> > See XENMAPSPACE_* and their uses.
> > 
> 
> I think we can add following restrictions to avoid uses of XENMAPSPACE_*
> punching holes in GFNs of vNVDIMM:
> 
> (1) For XENMAPSPACE_shared_info and _grant_table, never map idx in them
> to GFNs occupied by vNVDIMM.

OK, that sounds correct.
> 
> (2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign,
>(a) never map idx in them to GFNs occupied by vNVDIMM, and
>(b) never map idx corresponding to GFNs occupied by vNVDIMM

Would that mean that guest xen-blkback or xen-netback wouldn't
be able to fetch data from the GFNs? As in, what if the HVM guest
that has the NVDIMM also serves as a device domain - that is it
has xen-blkback running to service other guests?

> 
> 
> Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-18 Thread Jan Beulich
>>> On 01.02.16 at 06:44,  wrote:
>  This design treats host NVDIMM devices as ordinary MMIO devices:

Wrt the cachability note earlier on, I assume you're aware that with
the XSA-154 changes we disallow any cachable mappings of MMIO
by default.

>  (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT)
>  and drive host NVDIMM devices (implementing block device
>  interface). Namespaces and file systems on host NVDIMM devices
>  are handled by Dom0 Linux as well.
> 
>  (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its
>  virtual address space (buf).
> 
>  (3) QEMU gets the host physical address of buf, i.e. the host system
>  physical address that is occupied by /dev/pmem0, and calls Xen
>  hypercall XEN_DOMCTL_memory_mapping to map it to a DomU.
> 
>  (ACPI part is described in Section 3.3 later)
> 
>  Above (1)(2) have already been done in current QEMU. Only (3) is
>  needed to implement in QEMU. No change is needed in Xen for address
>  mapping in this design.
> 
>  Open: It seems no system call/ioctl is provided by Linux kernel to
>get the physical address from a virtual address.
>/proc//pagemap provides information of mapping from
>VA to PA. Is it an acceptable solution to let QEMU parse this
>file to get the physical address?
> 
>  Open: For a large pmem, mmap(2) is very possible to not map all SPA
>occupied by pmem at the beginning, i.e. QEMU may not be able to
>get all SPA of pmem from buf (in virtual address space) when
>calling XEN_DOMCTL_memory_mapping.
>Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
>entire pmem being mmaped?

A fundamental question I have here is: Why does qemu need to
map this at all? It shouldn't itself need to access those ranges,
since the guest is given direct access. It would seem quite a bit
more natural if qemu simply inquired to underlying GFN range(s)
and handed those to Xen for translation to MFNs and mapping
into guest space.

>  I notice that current XEN_DOMCTL_memory_mapping does not make santiy
>  check for the physical address and size passed from caller
>  (QEMU). Can QEMU be always trusted? If not, we would need to make Xen
>  aware of the SPA range of pmem so that it can refuse map physical
>  address in neither the normal ram nor pmem.

I'm not sure what missing sanity checks this is about: The handling
involves two iomem_access_permitted() calls.

> 3.3 Guest ACPI Emulation
> 
> 3.3.1 My Design
> 
>  Guest ACPI emulation is composed of two parts: building guest NFIT
>  and SSDT that defines ACPI namespace devices for NVDIMM, and
>  emulating guest _DSM.
> 
>  (1) Building Guest ACPI Tables
> 
>   This design reuses and extends hvmloader's existing mechanism that
>   loads passthrough ACPI tables from binary files to load NFIT and
>   SSDT tables built by QEMU:
>   1) Because the current QEMU does not building any ACPI tables when
>  it runs as the Xen device model, this design needs to patch QEMU
>  to build NFIT and SSDT (so far only NFIT and SSDT) in this case.
> 
>   2) QEMU copies NFIT and SSDT to the end of guest memory below
>  4G. The guest address and size of those tables are written into
>  xenstore (/local/domain/domid/hvmloader/dm-acpi/{address,length}).
> 
>   3) hvmloader is patched to probe and load device model passthrough
>  ACPI tables from above xenstore keys. The detected ACPI tables
>  are then appended to the end of existing guest ACPI tables just
>  like what current construct_passthrough_tables() does.
> 
>   Reasons for this design are listed below:
>   - NFIT and SSDT in question are quite self-contained, i.e. they do
> not refer to other ACPI tables and not conflict with existing
> guest ACPI tables in Xen. Therefore, it is safe to copy them from
> QEMU and append to existing guest ACPI tables.

How is this not conflicting being guaranteed? In particular I don't
see how tables containing AML code and coming from different
sources won't possibly cause ACPI name space collisions.

> 3.3.3 Alternative Design 2: keeping in Xen
> 
>  Alternative to switching to QEMU, another design would be building
>  NFIT and SSDT in hvmloader or toolstack.
> 
>  The amount and parameters of sub-structures in guest NFIT vary
>  according to different vNVDIMM configurations that can not be decided
>  at compile-time. In contrast, current hvmloader and toolstack can
>  only build static ACPI tables, i.e. their contents are decided
>  statically at compile-time and independent from the guest
>  configuration. In order to build guest NFIT at runtime, this design
>  may take following steps:
>  (1) xl converts NVDIMM configurations in xl.cfg to corresponding QEMU
>  options,
> 
>  (2) QEMU accepts above options, figures out the start SPA range
>  address/size/NVDIMM device handles/..., and writes them in
>  xenstore. No ACPI table is built by 

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-17 Thread Haozhong Zhang
On 02/17/16 02:08, Jan Beulich wrote:
> >>> On 17.02.16 at 10:01,  wrote:
> > On 02/15/16 04:07, Jan Beulich wrote:
> >> >>> On 15.02.16 at 09:43,  wrote:
> >> > On 02/03/16 03:15, Konrad Rzeszutek Wilk wrote:
> >> >> >  Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
> >> >> >  three parts:
> >> >> >  (1) Guest clwb/clflushopt/pcommit enabling,
> >> >> >  (2) Memory mapping, and
> >> >> >  (3) Guest ACPI emulation.
> >> >> 
> >> >> 
> >> >> .. MCE? and vMCE?
> >> >> 
> >> > 
> >> > NVDIMM can generate UCR errors like normal ram. Xen may handle them in a
> >> > way similar to what mc_memerr_dhandler() does, with some differences in
> >> > the data structure and the broken page offline parts:
> >> > 
> >> > Broken NVDIMM pages should be marked as "offlined" so that Xen
> >> > hypervisor can refuse further requests that map them to DomU.
> >> > 
> >> > The real problem here is what data structure will be used to record
> >> > information of NVDIMM pages. Because the size of NVDIMM is usually much
> >> > larger than normal ram, using struct page_info for NVDIMM pages would
> >> > occupy too much memory.
> >> 
> >> I don't see how your alternative below would be less memory
> >> hungry: Since guests have at least partial control of their GFN
> >> space, a malicious guest could punch holes into the contiguous
> >> GFN range that you appear to be thinking about, thus causing
> >> arbitrary splitting of the control structure.
> >>
> > 
> > QEMU would always use MFN above guest normal ram and I/O holes for
> > vNVDIMM. It would attempt to search in that space for a contiguous range
> > that is large enough for that that vNVDIMM devices. Is guest able to
> > punch holes in such GFN space?
> 
> See XENMAPSPACE_* and their uses.
> 

I think we can add following restrictions to avoid uses of XENMAPSPACE_*
punching holes in GFNs of vNVDIMM:

(1) For XENMAPSPACE_shared_info and _grant_table, never map idx in them
to GFNs occupied by vNVDIMM.

(2) For XENMAPSPACE_gmfn, _gmfn_range and _gmfn_foreign,
   (a) never map idx in them to GFNs occupied by vNVDIMM, and
   (b) never map idx corresponding to GFNs occupied by vNVDIMM


Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-17 Thread Jan Beulich
>>> On 17.02.16 at 10:01,  wrote:
> On 02/15/16 04:07, Jan Beulich wrote:
>> >>> On 15.02.16 at 09:43,  wrote:
>> > On 02/03/16 03:15, Konrad Rzeszutek Wilk wrote:
>> >> >  Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
>> >> >  three parts:
>> >> >  (1) Guest clwb/clflushopt/pcommit enabling,
>> >> >  (2) Memory mapping, and
>> >> >  (3) Guest ACPI emulation.
>> >> 
>> >> 
>> >> .. MCE? and vMCE?
>> >> 
>> > 
>> > NVDIMM can generate UCR errors like normal ram. Xen may handle them in a
>> > way similar to what mc_memerr_dhandler() does, with some differences in
>> > the data structure and the broken page offline parts:
>> > 
>> > Broken NVDIMM pages should be marked as "offlined" so that Xen
>> > hypervisor can refuse further requests that map them to DomU.
>> > 
>> > The real problem here is what data structure will be used to record
>> > information of NVDIMM pages. Because the size of NVDIMM is usually much
>> > larger than normal ram, using struct page_info for NVDIMM pages would
>> > occupy too much memory.
>> 
>> I don't see how your alternative below would be less memory
>> hungry: Since guests have at least partial control of their GFN
>> space, a malicious guest could punch holes into the contiguous
>> GFN range that you appear to be thinking about, thus causing
>> arbitrary splitting of the control structure.
>>
> 
> QEMU would always use MFN above guest normal ram and I/O holes for
> vNVDIMM. It would attempt to search in that space for a contiguous range
> that is large enough for that that vNVDIMM devices. Is guest able to
> punch holes in such GFN space?

See XENMAPSPACE_* and their uses.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-17 Thread Haozhong Zhang
On 02/16/16 05:55, Jan Beulich wrote:
> >>> On 16.02.16 at 12:14,  wrote:
> > On Mon, 15 Feb 2016, Zhang, Haozhong wrote:
> >> On 02/04/16 20:24, Stefano Stabellini wrote:
> >> > On Thu, 4 Feb 2016, Haozhong Zhang wrote:
> >> > > On 02/03/16 15:22, Stefano Stabellini wrote:
> >> > > > On Wed, 3 Feb 2016, George Dunlap wrote:
> >> > > > > On 03/02/16 12:02, Stefano Stabellini wrote:
> >> > > > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> >> > > > > >> Or, we can make a file system on /dev/pmem0, create files on 
> >> > > > > >> it, set
> >> > > > > >> the owner of those files to xen-qemuuser-domid$domid, and then 
> >> > > > > >> pass
> >> > > > > >> those files to QEMU. In this way, non-root QEMU should be able 
> >> > > > > >> to
> >> > > > > >> mmap those files.
> >> > > > > >
> >> > > > > > Maybe that would work. Worth adding it to the design, I would 
> >> > > > > > like to
> >> > > > > > read more details on it.
> >> > > > > >
> >> > > > > > Also note that QEMU initially runs as root but drops privileges 
> >> > > > > > to
> >> > > > > > xen-qemuuser-domid$domid before the guest is started. Initially 
> >> > > > > > QEMU
> >> > > > > > *could* mmap /dev/pmem0 while is still running as root, but then 
> >> > > > > > it
> >> > > > > > wouldn't work for any devices that need to be mmap'ed at run time
> >> > > > > > (hotplug scenario).
> >> > > > >
> >> > > > > This is basically the same problem we have for a bunch of other 
> >> > > > > things,
> >> > > > > right?  Having xl open a file and then pass it via qmp to qemu 
> >> > > > > should
> >> > > > > work in theory, right?
> >> > > >
> >> > > > Is there one /dev/pmem? per assignable region?
> >> > > 
> >> > > Yes.
> >> > > 
> >> > > BTW, I'm wondering whether and how non-root qemu works with xl disk
> >> > > configuration that is going to access a host block device, e.g.
> >> > >  disk = [ '/dev/sdb,,hda' ]
> >> > > If that works with non-root qemu, I may take the similar solution for
> >> > > pmem.
> >> >  
> >> > Today the user is required to give the correct ownership and access mode
> >> > to the block device, so that non-root QEMU can open it. However in the
> >> > case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence
> >> > the feature doesn't work at all with non-root QEMU
> >> > (http://marc.info/?l=xen-devel=145261763600528).
> >> > 
> >> > If there is one /dev/pmem device per assignable region, then it would be
> >> > conceivable to change its ownership so that non-root QEMU can open it.
> >> > Or, better, the file descriptor could be passed by the toolstack via
> >> > qmp.
> >> 
> >> Passing file descriptor via qmp is not enough.
> >> 
> >> Let me clarify where the requirement for root/privileged permissions
> >> comes from. The primary workflow in my design that maps a host pmem
> >> region or files in host pmem region to guest is shown as below:
> >>  (1) QEMU in Dom0 mmap the host pmem (the host /dev/pmem0 or files on
> >>  /dev/pmem0) to its virtual address space, i.e. the guest virtual
> >>  address space.
> >>  (2) QEMU asks Xen hypervisor to map the host physical address, i.e. SPA
> >>  occupied by the host pmem to a DomU. This step requires the
> >>  translation from the guest virtual address (where the host pmem is
> >>  mmaped in (1)) to the host physical address. The translation can be
> >>  done by either
> >> (a) QEMU that parses its own /proc/self/pagemap,
> >>  or
> >> (b) Xen hypervisor that does the translation by itself [1] (though
> >> this choice is not quite doable from Konrad's comments [2]).
> >> 
> >> [1] 
> >> http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00434.html 
> >> [2] 
> >> http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00606.html 
> >> 
> >> For 2-a, reading /proc/self/pagemap requires CAP_SYS_ADMIN capability
> >> since linux kernel 4.0. Furthermore, if we don't mlock the mapped host
> >> pmem (by adding MAP_LOCKED flag to mmap or calling mlock after mmap),
> >> pagemap will not contain all mappings. However, mlock may require
> >> privileged permission to lock memory larger than RLIMIT_MEMLOCK. Because
> >> mlock operates on memory, the permission to open(2) the host pmem files
> >> does not solve the problem and therefore passing file descriptor via qmp
> >> does not help.
> >> 
> >> For 2-b, from Konrad's comments [2], mlock is also required and
> >> privileged permission may be required consequently.
> >> 
> >> Note that the mapping and the address translation are done before QEMU
> >> dropping privileged permissions, so non-root QEMU should be able to work
> >> with above design until we start considering vNVDIMM hotplug (which has
> >> not been supported by the current vNVDIMM implementation in QEMU). In
> >> the hotplug case, we may let Xen pass explicit flags to QEMU to keep it
> >> running with root permissions.
> > 
> > Are we all good with the fact that vNVDIMM 

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-17 Thread Haozhong Zhang
On 02/15/16 04:07, Jan Beulich wrote:
> >>> On 15.02.16 at 09:43,  wrote:
> > On 02/03/16 03:15, Konrad Rzeszutek Wilk wrote:
> >> >  Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
> >> >  three parts:
> >> >  (1) Guest clwb/clflushopt/pcommit enabling,
> >> >  (2) Memory mapping, and
> >> >  (3) Guest ACPI emulation.
> >> 
> >> 
> >> .. MCE? and vMCE?
> >> 
> > 
> > NVDIMM can generate UCR errors like normal ram. Xen may handle them in a
> > way similar to what mc_memerr_dhandler() does, with some differences in
> > the data structure and the broken page offline parts:
> > 
> > Broken NVDIMM pages should be marked as "offlined" so that Xen
> > hypervisor can refuse further requests that map them to DomU.
> > 
> > The real problem here is what data structure will be used to record
> > information of NVDIMM pages. Because the size of NVDIMM is usually much
> > larger than normal ram, using struct page_info for NVDIMM pages would
> > occupy too much memory.
> 
> I don't see how your alternative below would be less memory
> hungry: Since guests have at least partial control of their GFN
> space, a malicious guest could punch holes into the contiguous
> GFN range that you appear to be thinking about, thus causing
> arbitrary splitting of the control structure.
>

QEMU would always use MFN above guest normal ram and I/O holes for
vNVDIMM. It would attempt to search in that space for a contiguous range
that is large enough for that that vNVDIMM devices. Is guest able to
punch holes in such GFN space?

> Also - see how you all of the sudden came to think of using
> struct page_info here (implying hypervisor control of these
> NVDIMM ranges)?
>
> > (4) When a MCE for host NVDIMM SPA range [start_mfn, end_mfn] happens,
> >   (a) search xen_nvdimm_pages_list for affected nvdimm_pages structures,
> >   (b) for each affected nvdimm_pages, if it belongs to a domain d and
> >   its broken field is already set, the domain d will be shutdown to
> >   prevent malicious guest accessing broken page (similarly to what
> >   offline_page() does).
> >   (c) for each affected nvdimm_pages, set its broken field to 1, and
> >   (d) for each affected nvdimm_pages, inject to domain d a vMCE that
> >   covers its GFN range if that nvdimm_pages belongs to domain d.
> 
> I don't see why you'd want to mark the entire range bad: All
> that's known to be broken is a single page. Hence this would be
> another source of splits of the proposed control structures.
>

Oh yes, I should split the whole range here. Such kind of splits is
caused by hardware errors. Unless the host NVDIMM is terribly broken,
there should not be a large amount of splits.

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-16 Thread Jan Beulich
>>> On 16.02.16 at 12:14,  wrote:
> On Mon, 15 Feb 2016, Zhang, Haozhong wrote:
>> On 02/04/16 20:24, Stefano Stabellini wrote:
>> > On Thu, 4 Feb 2016, Haozhong Zhang wrote:
>> > > On 02/03/16 15:22, Stefano Stabellini wrote:
>> > > > On Wed, 3 Feb 2016, George Dunlap wrote:
>> > > > > On 03/02/16 12:02, Stefano Stabellini wrote:
>> > > > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote:
>> > > > > >> Or, we can make a file system on /dev/pmem0, create files on it, 
>> > > > > >> set
>> > > > > >> the owner of those files to xen-qemuuser-domid$domid, and then 
>> > > > > >> pass
>> > > > > >> those files to QEMU. In this way, non-root QEMU should be able to
>> > > > > >> mmap those files.
>> > > > > >
>> > > > > > Maybe that would work. Worth adding it to the design, I would like 
>> > > > > > to
>> > > > > > read more details on it.
>> > > > > >
>> > > > > > Also note that QEMU initially runs as root but drops privileges to
>> > > > > > xen-qemuuser-domid$domid before the guest is started. Initially 
>> > > > > > QEMU
>> > > > > > *could* mmap /dev/pmem0 while is still running as root, but then it
>> > > > > > wouldn't work for any devices that need to be mmap'ed at run time
>> > > > > > (hotplug scenario).
>> > > > >
>> > > > > This is basically the same problem we have for a bunch of other 
>> > > > > things,
>> > > > > right?  Having xl open a file and then pass it via qmp to qemu should
>> > > > > work in theory, right?
>> > > >
>> > > > Is there one /dev/pmem? per assignable region?
>> > > 
>> > > Yes.
>> > > 
>> > > BTW, I'm wondering whether and how non-root qemu works with xl disk
>> > > configuration that is going to access a host block device, e.g.
>> > >  disk = [ '/dev/sdb,,hda' ]
>> > > If that works with non-root qemu, I may take the similar solution for
>> > > pmem.
>> >  
>> > Today the user is required to give the correct ownership and access mode
>> > to the block device, so that non-root QEMU can open it. However in the
>> > case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence
>> > the feature doesn't work at all with non-root QEMU
>> > (http://marc.info/?l=xen-devel=145261763600528).
>> > 
>> > If there is one /dev/pmem device per assignable region, then it would be
>> > conceivable to change its ownership so that non-root QEMU can open it.
>> > Or, better, the file descriptor could be passed by the toolstack via
>> > qmp.
>> 
>> Passing file descriptor via qmp is not enough.
>> 
>> Let me clarify where the requirement for root/privileged permissions
>> comes from. The primary workflow in my design that maps a host pmem
>> region or files in host pmem region to guest is shown as below:
>>  (1) QEMU in Dom0 mmap the host pmem (the host /dev/pmem0 or files on
>>  /dev/pmem0) to its virtual address space, i.e. the guest virtual
>>  address space.
>>  (2) QEMU asks Xen hypervisor to map the host physical address, i.e. SPA
>>  occupied by the host pmem to a DomU. This step requires the
>>  translation from the guest virtual address (where the host pmem is
>>  mmaped in (1)) to the host physical address. The translation can be
>>  done by either
>> (a) QEMU that parses its own /proc/self/pagemap,
>>  or
>> (b) Xen hypervisor that does the translation by itself [1] (though
>> this choice is not quite doable from Konrad's comments [2]).
>> 
>> [1] 
>> http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00434.html 
>> [2] 
>> http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00606.html 
>> 
>> For 2-a, reading /proc/self/pagemap requires CAP_SYS_ADMIN capability
>> since linux kernel 4.0. Furthermore, if we don't mlock the mapped host
>> pmem (by adding MAP_LOCKED flag to mmap or calling mlock after mmap),
>> pagemap will not contain all mappings. However, mlock may require
>> privileged permission to lock memory larger than RLIMIT_MEMLOCK. Because
>> mlock operates on memory, the permission to open(2) the host pmem files
>> does not solve the problem and therefore passing file descriptor via qmp
>> does not help.
>> 
>> For 2-b, from Konrad's comments [2], mlock is also required and
>> privileged permission may be required consequently.
>> 
>> Note that the mapping and the address translation are done before QEMU
>> dropping privileged permissions, so non-root QEMU should be able to work
>> with above design until we start considering vNVDIMM hotplug (which has
>> not been supported by the current vNVDIMM implementation in QEMU). In
>> the hotplug case, we may let Xen pass explicit flags to QEMU to keep it
>> running with root permissions.
> 
> Are we all good with the fact that vNVDIMM hotplug won't work (unless
> the user explicitly asks for it at domain creation time, which is
> very unlikely otherwise she could use coldplug)?

No, at least there needs to be a road towards hotplug, even if
initially this may not be supported/implemented.

Jan


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-16 Thread Stefano Stabellini
On Mon, 15 Feb 2016, Zhang, Haozhong wrote:
> On 02/04/16 20:24, Stefano Stabellini wrote:
> > On Thu, 4 Feb 2016, Haozhong Zhang wrote:
> > > On 02/03/16 15:22, Stefano Stabellini wrote:
> > > > On Wed, 3 Feb 2016, George Dunlap wrote:
> > > > > On 03/02/16 12:02, Stefano Stabellini wrote:
> > > > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> > > > > >> Or, we can make a file system on /dev/pmem0, create files on it, 
> > > > > >> set
> > > > > >> the owner of those files to xen-qemuuser-domid$domid, and then pass
> > > > > >> those files to QEMU. In this way, non-root QEMU should be able to
> > > > > >> mmap those files.
> > > > > >
> > > > > > Maybe that would work. Worth adding it to the design, I would like 
> > > > > > to
> > > > > > read more details on it.
> > > > > >
> > > > > > Also note that QEMU initially runs as root but drops privileges to
> > > > > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU
> > > > > > *could* mmap /dev/pmem0 while is still running as root, but then it
> > > > > > wouldn't work for any devices that need to be mmap'ed at run time
> > > > > > (hotplug scenario).
> > > > >
> > > > > This is basically the same problem we have for a bunch of other 
> > > > > things,
> > > > > right?  Having xl open a file and then pass it via qmp to qemu should
> > > > > work in theory, right?
> > > >
> > > > Is there one /dev/pmem? per assignable region?
> > > 
> > > Yes.
> > > 
> > > BTW, I'm wondering whether and how non-root qemu works with xl disk
> > > configuration that is going to access a host block device, e.g.
> > >  disk = [ '/dev/sdb,,hda' ]
> > > If that works with non-root qemu, I may take the similar solution for
> > > pmem.
> >  
> > Today the user is required to give the correct ownership and access mode
> > to the block device, so that non-root QEMU can open it. However in the
> > case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence
> > the feature doesn't work at all with non-root QEMU
> > (http://marc.info/?l=xen-devel=145261763600528).
> > 
> > If there is one /dev/pmem device per assignable region, then it would be
> > conceivable to change its ownership so that non-root QEMU can open it.
> > Or, better, the file descriptor could be passed by the toolstack via
> > qmp.
> 
> Passing file descriptor via qmp is not enough.
> 
> Let me clarify where the requirement for root/privileged permissions
> comes from. The primary workflow in my design that maps a host pmem
> region or files in host pmem region to guest is shown as below:
>  (1) QEMU in Dom0 mmap the host pmem (the host /dev/pmem0 or files on
>  /dev/pmem0) to its virtual address space, i.e. the guest virtual
>  address space.
>  (2) QEMU asks Xen hypervisor to map the host physical address, i.e. SPA
>  occupied by the host pmem to a DomU. This step requires the
>  translation from the guest virtual address (where the host pmem is
>  mmaped in (1)) to the host physical address. The translation can be
>  done by either
> (a) QEMU that parses its own /proc/self/pagemap,
>  or
> (b) Xen hypervisor that does the translation by itself [1] (though
> this choice is not quite doable from Konrad's comments [2]).
> 
> [1] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00434.html
> [2] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00606.html
> 
> For 2-a, reading /proc/self/pagemap requires CAP_SYS_ADMIN capability
> since linux kernel 4.0. Furthermore, if we don't mlock the mapped host
> pmem (by adding MAP_LOCKED flag to mmap or calling mlock after mmap),
> pagemap will not contain all mappings. However, mlock may require
> privileged permission to lock memory larger than RLIMIT_MEMLOCK. Because
> mlock operates on memory, the permission to open(2) the host pmem files
> does not solve the problem and therefore passing file descriptor via qmp
> does not help.
> 
> For 2-b, from Konrad's comments [2], mlock is also required and
> privileged permission may be required consequently.
> 
> Note that the mapping and the address translation are done before QEMU
> dropping privileged permissions, so non-root QEMU should be able to work
> with above design until we start considering vNVDIMM hotplug (which has
> not been supported by the current vNVDIMM implementation in QEMU). In
> the hotplug case, we may let Xen pass explicit flags to QEMU to keep it
> running with root permissions.

Are we all good with the fact that vNVDIMM hotplug won't work (unless
the user explicitly asks for it at domain creation time, which is
very unlikely otherwise she could use coldplug)?

If so, the design is OK for me.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-15 Thread Jan Beulich
>>> On 15.02.16 at 09:43,  wrote:
> On 02/03/16 03:15, Konrad Rzeszutek Wilk wrote:
>> >  Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
>> >  three parts:
>> >  (1) Guest clwb/clflushopt/pcommit enabling,
>> >  (2) Memory mapping, and
>> >  (3) Guest ACPI emulation.
>> 
>> 
>> .. MCE? and vMCE?
>> 
> 
> NVDIMM can generate UCR errors like normal ram. Xen may handle them in a
> way similar to what mc_memerr_dhandler() does, with some differences in
> the data structure and the broken page offline parts:
> 
> Broken NVDIMM pages should be marked as "offlined" so that Xen
> hypervisor can refuse further requests that map them to DomU.
> 
> The real problem here is what data structure will be used to record
> information of NVDIMM pages. Because the size of NVDIMM is usually much
> larger than normal ram, using struct page_info for NVDIMM pages would
> occupy too much memory.

I don't see how your alternative below would be less memory
hungry: Since guests have at least partial control of their GFN
space, a malicious guest could punch holes into the contiguous
GFN range that you appear to be thinking about, thus causing
arbitrary splitting of the control structure.

Also - see how you all of the sudden came to think of using
struct page_info here (implying hypervisor control of these
NVDIMM ranges)?

> (4) When a MCE for host NVDIMM SPA range [start_mfn, end_mfn] happens,
>   (a) search xen_nvdimm_pages_list for affected nvdimm_pages structures,
>   (b) for each affected nvdimm_pages, if it belongs to a domain d and
>   its broken field is already set, the domain d will be shutdown to
>   prevent malicious guest accessing broken page (similarly to what
>   offline_page() does).
>   (c) for each affected nvdimm_pages, set its broken field to 1, and
>   (d) for each affected nvdimm_pages, inject to domain d a vMCE that
>   covers its GFN range if that nvdimm_pages belongs to domain d.

I don't see why you'd want to mark the entire range bad: All
that's known to be broken is a single page. Hence this would be
another source of splits of the proposed control structures.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-15 Thread Zhang, Haozhong
On 02/03/16 23:47, Konrad Rzeszutek Wilk wrote:
> > > > >  Open: It seems no system call/ioctl is provided by Linux kernel to
> > > > >get the physical address from a virtual address.
> > > > >/proc//pagemap provides information of mapping from
> > > > >VA to PA. Is it an acceptable solution to let QEMU parse this
> > > > >file to get the physical address?
> > > > 
> > > > Does it work in a non-root scenario?
> > > >
> > > 
> > > Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel:
> > > | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get 
> > > PFNs.
> > > | In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting from
> > > | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
> > > | Reason: information about PFNs helps in exploiting Rowhammer 
> > > vulnerability.
> 
> Ah right.
> > >
> > > A possible alternative is to add a new hypercall similar to
> > > XEN_DOMCTL_memory_mapping but receiving virtual address as the address
> > > parameter and translating to machine address in the hypervisor.
> > 
> > That might work.
> 
> That won't work.
> 
> This is a userspace VMA - which means the once the ioctl is done we swap
> to kernel virtual addresses. Now we may know that the prior cr3 has the
> userspace virtual address and walk it down - but what if the domain
> that is doing this is PVH? (or HVM) - the cr3 of userspace is tucked somewhere
> inside the kernel.
> 
> Which means this hypercall would need to know the Linux kernel task structure
> to find this.
> 
> May I propose another solution - an stacking driver (similar to loop). You
> setup it up (ioctl /dev/pmem0/guest.img, get some /dev/mapper/guest.img 
> created).
> Then mmap the /dev/mapper/guest.img - all of the operations are the same - 
> except
> it may have an extra ioctl - get_pfns - which would provide the data in 
> similar
> form to pagemap.txt.
>

This stack driver approach seems still need privileged permission and
more modifications in kernel, so ...

> But folks will then ask - why don't you just use pagemap? Could the pagemap
> have an extra security capability check? One that can be set for
> QEMU?
>

... I would like to use pagemap and mlock.

Haozhong

> > 
> > 
> > > > >  Open: For a large pmem, mmap(2) is very possible to not map all SPA
> > > > >occupied by pmem at the beginning, i.e. QEMU may not be able to
> > > > >get all SPA of pmem from buf (in virtual address space) when
> > > > >calling XEN_DOMCTL_memory_mapping.
> > > > >Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
> > > > >entire pmem being mmaped?
> > > > 
> > > > Ditto
> > > >
> > > 
> > > No. If I take the above alternative for the first open, maybe the new
> > > hypercall above can inject page faults into dom0 for the unmapped
> > > virtual address so as to enforce dom0 Linux to create the page
> > > mapping.
> 
> Ugh. That sounds hacky. And you wouldn't neccessarily be safe.
> Imagine that the system admin decides to defrag the /dev/pmem filesystem.
> Or move the files (disk images) around. If they do that - we may
> still have the guest mapped to system addresses which may contain filesystem
> metadata now, or a different guest image. We MUST mlock or lock the file
> during the duration of the guest.
> 
> 
> ___
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-15 Thread Haozhong Zhang
On 02/03/16 03:15, Konrad Rzeszutek Wilk wrote:
> > 3. Design of vNVDIMM in Xen
> 
> Thank you for this design!
> 
> > 
> >  Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
> >  three parts:
> >  (1) Guest clwb/clflushopt/pcommit enabling,
> >  (2) Memory mapping, and
> >  (3) Guest ACPI emulation.
> 
> 
> .. MCE? and vMCE?
> 

NVDIMM can generate UCR errors like normal ram. Xen may handle them in a
way similar to what mc_memerr_dhandler() does, with some differences in
the data structure and the broken page offline parts:

Broken NVDIMM pages should be marked as "offlined" so that Xen
hypervisor can refuse further requests that map them to DomU.

The real problem here is what data structure will be used to record
information of NVDIMM pages. Because the size of NVDIMM is usually much
larger than normal ram, using struct page_info for NVDIMM pages would
occupy too much memory.

Alternatively, we may use a range set to represent NVDIMM pages:

struct nvdimm_pages
{
unsigned long mfn; /* starting MFN of a range of NVDIMM pages */
unsigned long gfn; /* starting GFN where this range is mapped,
  initially INVALID_GFN */
unsigned long len; /* length of this range in bytes */

int broken;/* 0: initial value,
  1: this range of NVDIMM pages are broken and 
offlined */
   
struct domain *d;  /* NULL: initial value,
  Not NULL: which domain this range is mapped to */

/*
 * Every nvdimm_pages structure is linked in the global
 * xen_nvdimm_pages_list.
 *
 * If it is mapped to a domain d, it will be also linked in
 * d->arch.nvdimm_pages_list.
 */
struct list_head *domain_list;
struct list_head *global_list;
}

struct list_head xen_nvdimm_pages_list;

/* in asm-x86/domain.h */
struct arch_domain
{
...
struct list_head nvdimm_pages_list;
}

(1) Initially, Xen hypervisor creates a nvdimm_pages structure for each
pmem region (starting SPA and size reported by Dom0 NVDIMM driver)
and links all nvdimm_pages structures in xen_nvdimm_pages_list.

(2) If Xen hypervisor is then requested to map a range of NVDIMM pages
[start_mfn, end_mfn] to gfn of domain d, it will

   (a) Check whether the GFN range [gfn, gfn + end_mfn - start_mfn + 1]
   of domain d has been occupied (e.g. by normal ram, I/O or other
   vNVDIMM).

   (b) Search xen_nvdimm_pages_list for one or multiple nvdimm_pages
   that [start_mfn, end_mfn] can fit in.

   If a nvdimm_pages structure is entirely covered by [start_mfn,
   end_mfn], then link that nvdimm_pages structure to
   d->arch.nvdimm_pages_list.

   If only a portion of a nvdimm_pages structure is covered by
   [start_mfn, end_mfn], then split that nvdimm_pages structure
   into multiple ones (the one entirely covered and at most two not
   covered), link the covered one to d->arch.nvdimm_pages_list and
   all of them to xen_nvdimm_pages_list as well.

   gfn and d fields of nvdimm_pages structures linked to
   d->arch.nvdimm_pages_list are also set accordingly.

(3) When a domain d is shutdown/destroyed, merge its nvdimm_pages
structures (i.e. those in d->arch.nvdimm_pages_list) in
xen_nvdimm_pages_list.

(4) When a MCE for host NVDIMM SPA range [start_mfn, end_mfn] happens,
  (a) search xen_nvdimm_pages_list for affected nvdimm_pages structures,
  (b) for each affected nvdimm_pages, if it belongs to a domain d and
  its broken field is already set, the domain d will be shutdown to
  prevent malicious guest accessing broken page (similarly to what
  offline_page() does).
  (c) for each affected nvdimm_pages, set its broken field to 1, and
  (d) for each affected nvdimm_pages, inject to domain d a vMCE that
  covers its GFN range if that nvdimm_pages belongs to domain d.

Comments, pls.

Thanks,
Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-14 Thread Zhang, Haozhong
On 02/04/16 20:24, Stefano Stabellini wrote:
> On Thu, 4 Feb 2016, Haozhong Zhang wrote:
> > On 02/03/16 15:22, Stefano Stabellini wrote:
> > > On Wed, 3 Feb 2016, George Dunlap wrote:
> > > > On 03/02/16 12:02, Stefano Stabellini wrote:
> > > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> > > > >> Or, we can make a file system on /dev/pmem0, create files on it, set
> > > > >> the owner of those files to xen-qemuuser-domid$domid, and then pass
> > > > >> those files to QEMU. In this way, non-root QEMU should be able to
> > > > >> mmap those files.
> > > > >
> > > > > Maybe that would work. Worth adding it to the design, I would like to
> > > > > read more details on it.
> > > > >
> > > > > Also note that QEMU initially runs as root but drops privileges to
> > > > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU
> > > > > *could* mmap /dev/pmem0 while is still running as root, but then it
> > > > > wouldn't work for any devices that need to be mmap'ed at run time
> > > > > (hotplug scenario).
> > > >
> > > > This is basically the same problem we have for a bunch of other things,
> > > > right?  Having xl open a file and then pass it via qmp to qemu should
> > > > work in theory, right?
> > >
> > > Is there one /dev/pmem? per assignable region?
> > 
> > Yes.
> > 
> > BTW, I'm wondering whether and how non-root qemu works with xl disk
> > configuration that is going to access a host block device, e.g.
> >  disk = [ '/dev/sdb,,hda' ]
> > If that works with non-root qemu, I may take the similar solution for
> > pmem.
>  
> Today the user is required to give the correct ownership and access mode
> to the block device, so that non-root QEMU can open it. However in the
> case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence
> the feature doesn't work at all with non-root QEMU
> (http://marc.info/?l=xen-devel=145261763600528).
> 
> If there is one /dev/pmem device per assignable region, then it would be
> conceivable to change its ownership so that non-root QEMU can open it.
> Or, better, the file descriptor could be passed by the toolstack via
> qmp.

Passing file descriptor via qmp is not enough.

Let me clarify where the requirement for root/privileged permissions
comes from. The primary workflow in my design that maps a host pmem
region or files in host pmem region to guest is shown as below:
 (1) QEMU in Dom0 mmap the host pmem (the host /dev/pmem0 or files on
 /dev/pmem0) to its virtual address space, i.e. the guest virtual
 address space.
 (2) QEMU asks Xen hypervisor to map the host physical address, i.e. SPA
 occupied by the host pmem to a DomU. This step requires the
 translation from the guest virtual address (where the host pmem is
 mmaped in (1)) to the host physical address. The translation can be
 done by either
(a) QEMU that parses its own /proc/self/pagemap,
 or
(b) Xen hypervisor that does the translation by itself [1] (though
this choice is not quite doable from Konrad's comments [2]).

[1] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00434.html
[2] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00606.html

For 2-a, reading /proc/self/pagemap requires CAP_SYS_ADMIN capability
since linux kernel 4.0. Furthermore, if we don't mlock the mapped host
pmem (by adding MAP_LOCKED flag to mmap or calling mlock after mmap),
pagemap will not contain all mappings. However, mlock may require
privileged permission to lock memory larger than RLIMIT_MEMLOCK. Because
mlock operates on memory, the permission to open(2) the host pmem files
does not solve the problem and therefore passing file descriptor via qmp
does not help.

For 2-b, from Konrad's comments [2], mlock is also required and
privileged permission may be required consequently.

Note that the mapping and the address translation are done before QEMU
dropping privileged permissions, so non-root QEMU should be able to work
with above design until we start considering vNVDIMM hotplug (which has
not been supported by the current vNVDIMM implementation in QEMU). In
the hotplug case, we may let Xen pass explicit flags to QEMU to keep it
running with root permissions.

Haozhong


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-06 Thread Ross Philipson
On 02/05/2016 08:43 PM, Haozhong Zhang wrote:
> On 02/05/16 09:40, Ross Philipson wrote:
>> On 02/03/2016 09:09 AM, Andrew Cooper wrote:
> [...]
>>> I agree.
>>>
>>> There has to be a single entity responsible for collating the eventual
>>> ACPI handed to the guest, and this is definitely HVMLoader.
>>>
>>> However, it is correct that Qemu create the ACPI tables for the devices
>>> it emulates for the guest.
>>>
>>> We need to agree on a mechanism whereby each entity can provide their
>>> own subset of the ACPI tables to HVMLoader, and have HVMLoader present
>>> the final set properly to the VM.
>>>
>>> There is an existing usecase of passing the Host SLIC table to a VM, for
>>> OEM Versions of Windows.  I believe this is achieved with
>>> HVM_XS_ACPI_PT_{ADDRESS,LENGTH}, but that mechanism is a little
>>> inflexible and could probably do with being made a little more generic.
>>
>> A while back I added a generic mechanism to load extra ACPI tables into a
>> guest, configurable at runtime. It looks like the functionality is still
>> present. That might be an option.
>>
>> Also, following the thread, it wasn't clear if some of the tables like the
>> SSDT for the NVDIMM device and it's _FIT/_DSM methods were something that
>> could be statically created at build time. If it is something that needs to
>> be generated at runtime (e.g. platform specific), I have a library that can
>> generate any AML on the fly and create SSDTs.
>>
>> Anyway just FYI in case this is helpful.
>>
> 
> Hi Ross,
> 
> Thanks for the information!
> 
> SSDT for NVDIMM devices can not be created statically, because the
> number of some items in it can not be determined at build time. For
> example, the number of NVDIMM ACPI namespace devices (_DSM is under it)
> defined in SSDT is determined by the number of vNVDIMM devices in domain
> configuration. FYI, a sample SSDT for NVDIMM looks like
> 
>   Scope (\_SB){
>   Device (NVDR) // NVDIMM Root device
>   {
>   Name (_HID, “ACPI0012”)
>   Method (_STA) {...}
>   Method (_FIT) {...}
>   Method (_DSM, ...) {
>   ...
>   }
>   }
> 
>   Device (NVD0) // 1st NVDIMM Device
>   {
>   Name(_ADR, h0)
>   Method (_DSM, ...) {
>   ...
>   }
>   }
> 
>   Device (NVD1) // 2nd NVDIMM Device
>   {
>   Name(_ADR, h1)
>   Method (_DSM, ...) {
>   ...
>   }
>   }
> 
>   ...
>   }

Makes sense.

> 
> I had ported QEMU's AML builder code as well as NVDIMM ACPI building
> code to hvmloader and it did work, but then there was too much
> duplicated code for vNVDIMM between QEMU and hvmloader for vNVDIMM.
> Therefore, I prefer to let QEMU that emulates vNVDIMM devices
> to build those tables, as in Andrew and Jan's replies.

Yea it looks like QEUM's AML generating code is quite complete nowadays.
Back when I wrote my library there wasn't really much out there. Anyway
this is where it is if there is something that I might generate that is
missing:

https://github.com/OpenXT/xctools/tree/master/libxenacpi


> 
> Thanks,
> Haozhong
> 


-- 
Ross Philipson

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-05 Thread Ross Philipson

On 02/03/2016 09:09 AM, Andrew Cooper wrote:

On 03/02/16 09:13, Jan Beulich wrote:

On 03.02.16 at 08:00,  wrote:

On 02/02/16 17:11, Stefano Stabellini wrote:

Once upon a time somebody made the decision that ACPI tables
on Xen should be static and included in hvmloader. That might have been
a bad decision but at least it was coherent. Loading only *some* tables
from QEMU, but not others, it feels like an incomplete design to me.

For example, QEMU is currently in charge of emulating the PCI bus, why
shouldn't it be QEMU that generates the PRT and MCFG?


To Keir, Jan and Andrew:

Are there anything related to ACPI that must be done (or are better to
be done) in hvmloader?

Some of the static tables (FADT and HPET come to mind) likely would
better continue to live in hvmloader. MCFG (for example) coming from
qemu, otoh, would be quite natural (and would finally allow MMCFG
support for guests in the first place). I.e. ...


I prefer switching to QEMU building all ACPI tables for devices that it
is emulating. However this alternative is good too because it is
coherent with the current design.

I would prefer to this one if the final conclusion is that only one
agent should be allowed to build guest ACPI. As I said above, it looks
like a big change to switch to QEMU for all ACPI tables and I'm afraid
it would break some existing guests.

... I indeed think that tables should come from qemu for components
living in qemu, and from hvmloader for components coming from Xen.


I agree.

There has to be a single entity responsible for collating the eventual
ACPI handed to the guest, and this is definitely HVMLoader.

However, it is correct that Qemu create the ACPI tables for the devices
it emulates for the guest.

We need to agree on a mechanism whereby each entity can provide their
own subset of the ACPI tables to HVMLoader, and have HVMLoader present
the final set properly to the VM.

There is an existing usecase of passing the Host SLIC table to a VM, for
OEM Versions of Windows.  I believe this is achieved with
HVM_XS_ACPI_PT_{ADDRESS,LENGTH}, but that mechanism is a little
inflexible and could probably do with being made a little more generic.


A while back I added a generic mechanism to load extra ACPI tables into 
a guest, configurable at runtime. It looks like the functionality is 
still present. That might be an option.


Also, following the thread, it wasn't clear if some of the tables like 
the SSDT for the NVDIMM device and it's _FIT/_DSM methods were something 
that could be statically created at build time. If it is something that 
needs to be generated at runtime (e.g. platform specific), I have a 
library that can generate any AML on the fly and create SSDTs.


Anyway just FYI in case this is helpful.



~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel




--
Ross Philipson

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-05 Thread Haozhong Zhang
On 02/05/16 09:40, Ross Philipson wrote:
> On 02/03/2016 09:09 AM, Andrew Cooper wrote:
[...]
> >I agree.
> >
> >There has to be a single entity responsible for collating the eventual
> >ACPI handed to the guest, and this is definitely HVMLoader.
> >
> >However, it is correct that Qemu create the ACPI tables for the devices
> >it emulates for the guest.
> >
> >We need to agree on a mechanism whereby each entity can provide their
> >own subset of the ACPI tables to HVMLoader, and have HVMLoader present
> >the final set properly to the VM.
> >
> >There is an existing usecase of passing the Host SLIC table to a VM, for
> >OEM Versions of Windows.  I believe this is achieved with
> >HVM_XS_ACPI_PT_{ADDRESS,LENGTH}, but that mechanism is a little
> >inflexible and could probably do with being made a little more generic.
>
> A while back I added a generic mechanism to load extra ACPI tables into a
> guest, configurable at runtime. It looks like the functionality is still
> present. That might be an option.
>
> Also, following the thread, it wasn't clear if some of the tables like the
> SSDT for the NVDIMM device and it's _FIT/_DSM methods were something that
> could be statically created at build time. If it is something that needs to
> be generated at runtime (e.g. platform specific), I have a library that can
> generate any AML on the fly and create SSDTs.
>
> Anyway just FYI in case this is helpful.
>

Hi Ross,

Thanks for the information!

SSDT for NVDIMM devices can not be created statically, because the
number of some items in it can not be determined at build time. For
example, the number of NVDIMM ACPI namespace devices (_DSM is under it)
defined in SSDT is determined by the number of vNVDIMM devices in domain
configuration. FYI, a sample SSDT for NVDIMM looks like

  Scope (\_SB){
  Device (NVDR) // NVDIMM Root device
  {
  Name (_HID, “ACPI0012”)
  Method (_STA) {...}
  Method (_FIT) {...}
  Method (_DSM, ...) {
  ...
  }
  }

  Device (NVD0) // 1st NVDIMM Device
  {
  Name(_ADR, h0)
  Method (_DSM, ...) {
  ...
  }
  }

  Device (NVD1) // 2nd NVDIMM Device
  {
  Name(_ADR, h1)
  Method (_DSM, ...) {
  ...
  }
  }

  ...
  }

I had ported QEMU's AML builder code as well as NVDIMM ACPI building
code to hvmloader and it did work, but then there was too much
duplicated code for vNVDIMM between QEMU and hvmloader for vNVDIMM.
Therefore, I prefer to let QEMU that emulates vNVDIMM devices
to build those tables, as in Andrew and Jan's replies.

Thanks,
Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-04 Thread Stefano Stabellini
On Thu, 4 Feb 2016, Haozhong Zhang wrote:
> On 02/03/16 15:22, Stefano Stabellini wrote:
> > On Wed, 3 Feb 2016, George Dunlap wrote:
> > > On 03/02/16 12:02, Stefano Stabellini wrote:
> > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> > > >> Or, we can make a file system on /dev/pmem0, create files on it, set
> > > >> the owner of those files to xen-qemuuser-domid$domid, and then pass
> > > >> those files to QEMU. In this way, non-root QEMU should be able to
> > > >> mmap those files.
> > > >
> > > > Maybe that would work. Worth adding it to the design, I would like to
> > > > read more details on it.
> > > >
> > > > Also note that QEMU initially runs as root but drops privileges to
> > > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU
> > > > *could* mmap /dev/pmem0 while is still running as root, but then it
> > > > wouldn't work for any devices that need to be mmap'ed at run time
> > > > (hotplug scenario).
> > >
> > > This is basically the same problem we have for a bunch of other things,
> > > right?  Having xl open a file and then pass it via qmp to qemu should
> > > work in theory, right?
> >
> > Is there one /dev/pmem? per assignable region?
> 
> Yes.
> 
> BTW, I'm wondering whether and how non-root qemu works with xl disk
> configuration that is going to access a host block device, e.g.
>  disk = [ '/dev/sdb,,hda' ]
> If that works with non-root qemu, I may take the similar solution for
> pmem.
 
Today the user is required to give the correct ownership and access mode
to the block device, so that non-root QEMU can open it. However in the
case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence
the feature doesn't work at all with non-root QEMU
(http://marc.info/?l=xen-devel=145261763600528).

If there is one /dev/pmem device per assignable region, then it would be
conceivable to change its ownership so that non-root QEMU can open it.
Or, better, the file descriptor could be passed by the toolstack via
qmp.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-03 Thread Jan Beulich
>>> On 03.02.16 at 08:00,  wrote:
> On 02/02/16 17:11, Stefano Stabellini wrote:
>> Once upon a time somebody made the decision that ACPI tables
>> on Xen should be static and included in hvmloader. That might have been
>> a bad decision but at least it was coherent. Loading only *some* tables
>> from QEMU, but not others, it feels like an incomplete design to me.
>>
>> For example, QEMU is currently in charge of emulating the PCI bus, why
>> shouldn't it be QEMU that generates the PRT and MCFG?
>>
> 
> To Keir, Jan and Andrew:
> 
> Are there anything related to ACPI that must be done (or are better to
> be done) in hvmloader?

Some of the static tables (FADT and HPET come to mind) likely would
better continue to live in hvmloader. MCFG (for example) coming from
qemu, otoh, would be quite natural (and would finally allow MMCFG
support for guests in the first place). I.e. ...

>> I prefer switching to QEMU building all ACPI tables for devices that it
>> is emulating. However this alternative is good too because it is
>> coherent with the current design.
> 
> I would prefer to this one if the final conclusion is that only one
> agent should be allowed to build guest ACPI. As I said above, it looks
> like a big change to switch to QEMU for all ACPI tables and I'm afraid
> it would break some existing guests. 

... I indeed think that tables should come from qemu for components
living in qemu, and from hvmloader for components coming from Xen.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-03 Thread Jan Beulich
>>> On 03.02.16 at 13:22,  wrote:
> On 02/03/16 02:18, Jan Beulich wrote:
>> >>> On 03.02.16 at 09:28,  wrote:
>> > On 02/02/16 14:15, Konrad Rzeszutek Wilk wrote:
>> >> > 3.1 Guest clwb/clflushopt/pcommit Enabling
>> >> > 
>> >> >  The instruction enabling is simple and we do the same work as in 
> KVM/QEMU.
>> >> >  - All three instructions are exposed to guest via guest cpuid.
>> >> >  - L1 guest pcommit is never intercepted by Xen.
>> >> 
>> >> I wish there was some watermarks like the PLE has.
>> >> 
>> >> My fear is that an unfriendly guest can issue sfence all day long
>> >> flushing out other guests MMC queue (the writes followed by pcommits).
>> >> Which means that an guest may have degraded performance as their
>> >> memory writes are being flushed out immediately as if they were
>> >> being written to UC instead of WB memory. 
>> > 
>> > pcommit takes no parameter and it seems hard to solve this problem
>> > from hardware for now. And the current VMX does not provide mechanism
>> > to limit the commit rate of pcommit like PLE for pause.
>> > 
>> >> In other words - the NVDIMM resource does not provide any resource
>> >> isolation. However this may not be any different than what we had
>> >> nowadays with CPU caches.
>> >>
>> > 
>> > Does Xen have any mechanism to isolate multiple guests' operations on
>> > CPU caches?
>> 
>> No. All it does is disallow wbinvd for guests not controlling any
>> actual hardware. Perhaps pcommit should at least be limited in
>> a similar way?
>>
> 
> But pcommit is a must that makes writes be persistent on pmem. I'll
> look at how guest wbinvd is limited in Xen.

But we could intercept it on guests _not_ supposed to use it, in order
to simply drop in on the floor.

> Any functions suggested, vmx_wbinvd_intercept()?

A good example, yes.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-03 Thread Haozhong Zhang
On 02/03/16 12:02, Stefano Stabellini wrote:
> On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> > On 02/02/16 17:11, Stefano Stabellini wrote:
> > > On Mon, 1 Feb 2016, Haozhong Zhang wrote:
[...]
> > > >  This design treats host NVDIMM devices as ordinary MMIO devices:
> > > >  (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT)
> > > >  and drive host NVDIMM devices (implementing block device
> > > >  interface). Namespaces and file systems on host NVDIMM devices
> > > >  are handled by Dom0 Linux as well.
> > > > 
> > > >  (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its
> > > >  virtual address space (buf).
> > > > 
> > > >  (3) QEMU gets the host physical address of buf, i.e. the host system
> > > >  physical address that is occupied by /dev/pmem0, and calls Xen
> > > >  hypercall XEN_DOMCTL_memory_mapping to map it to a DomU.
> > > 
> > > How is this going to work from a security perspective? Is it going to
> > > require running QEMU as root in Dom0, which will prevent NVDIMM from
> > > working by default on Xen? If so, what's the plan?
> > >
> > 
> > Oh, I forgot to address the non-root qemu issues in this design ...
> > 
> > The default user:group of /dev/pmem0 is root:disk, and its permission
> > is rw-rw. We could lift the others permission to rw, so that
> > non-root QEMU can mmap /dev/pmem0. But it looks too risky.
> 
> Yep, too risky.
> 
> 
> > Or, we can make a file system on /dev/pmem0, create files on it, set
> > the owner of those files to xen-qemuuser-domid$domid, and then pass
> > those files to QEMU. In this way, non-root QEMU should be able to
> > mmap those files.
> 
> Maybe that would work. Worth adding it to the design, I would like to
> read more details on it.
> 
> Also note that QEMU initially runs as root but drops privileges to
> xen-qemuuser-domid$domid before the guest is started. Initially QEMU
> *could* mmap /dev/pmem0 while is still running as root, but then it
> wouldn't work for any devices that need to be mmap'ed at run time
> (hotplug scenario).
>

Thanks for this information. I'll test some experimental code and then
post a design to address the non-root qemu issue.

> 
> > > >  (ACPI part is described in Section 3.3 later)
> > > > 
> > > >  Above (1)(2) have already been done in current QEMU. Only (3) is
> > > >  needed to implement in QEMU. No change is needed in Xen for address
> > > >  mapping in this design.
> > > > 
> > > >  Open: It seems no system call/ioctl is provided by Linux kernel to
> > > >get the physical address from a virtual address.
> > > >/proc//pagemap provides information of mapping from
> > > >VA to PA. Is it an acceptable solution to let QEMU parse this
> > > >file to get the physical address?
> > > 
> > > Does it work in a non-root scenario?
> > >
> > 
> > Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel:
> > | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
> > | In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting from
> > | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
> > | Reason: information about PFNs helps in exploiting Rowhammer 
> > vulnerability.
> >
> > A possible alternative is to add a new hypercall similar to
> > XEN_DOMCTL_memory_mapping but receiving virtual address as the address
> > parameter and translating to machine address in the hypervisor.
> 
> That might work.
> 
> 
> > > >  Open: For a large pmem, mmap(2) is very possible to not map all SPA
> > > >occupied by pmem at the beginning, i.e. QEMU may not be able to
> > > >get all SPA of pmem from buf (in virtual address space) when
> > > >calling XEN_DOMCTL_memory_mapping.
> > > >Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
> > > >entire pmem being mmaped?
> > > 
> > > Ditto
> > >
> > 
> > No. If I take the above alternative for the first open, maybe the new
> > hypercall above can inject page faults into dom0 for the unmapped
> > virtual address so as to enforce dom0 Linux to create the page
> > mapping.
> 
> Otherwise you need to use something like the mapcache in QEMU
> (xen-mapcache.c), which admittedly, given its complexity, would be best
> to avoid.
>

Definitely not mapcache like things. What I want is something similar to
what emulate_gva_to_mfn() in Xen does.

[...]
> > > If we start asking QEMU to build ACPI tables, why should we stop at NFIT
> > > and SSDT?
> > 
> > for easing my development of supporting vNVDIMM in Xen ... I mean
> > NFIT and SSDT are the only two tables needed for this purpose and I'm
> > afraid to break exiting guests if I completely switch to QEMU for
> > guest ACPI tables.
> 
> I realize that my words have been a bit confusing. Not /all/ ACPI
> tables, just all the tables regarding devices for which QEMU is in
> charge (the PCI bus and all devices behind it). Anything related to cpus
> and memory (FADT, MADT, etc) 

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-03 Thread Haozhong Zhang
On 02/03/16 02:18, Jan Beulich wrote:
> >>> On 03.02.16 at 09:28,  wrote:
> > On 02/02/16 14:15, Konrad Rzeszutek Wilk wrote:
> >> > 3.1 Guest clwb/clflushopt/pcommit Enabling
> >> > 
> >> >  The instruction enabling is simple and we do the same work as in 
> >> > KVM/QEMU.
> >> >  - All three instructions are exposed to guest via guest cpuid.
> >> >  - L1 guest pcommit is never intercepted by Xen.
> >> 
> >> I wish there was some watermarks like the PLE has.
> >> 
> >> My fear is that an unfriendly guest can issue sfence all day long
> >> flushing out other guests MMC queue (the writes followed by pcommits).
> >> Which means that an guest may have degraded performance as their
> >> memory writes are being flushed out immediately as if they were
> >> being written to UC instead of WB memory. 
> > 
> > pcommit takes no parameter and it seems hard to solve this problem
> > from hardware for now. And the current VMX does not provide mechanism
> > to limit the commit rate of pcommit like PLE for pause.
> > 
> >> In other words - the NVDIMM resource does not provide any resource
> >> isolation. However this may not be any different than what we had
> >> nowadays with CPU caches.
> >>
> > 
> > Does Xen have any mechanism to isolate multiple guests' operations on
> > CPU caches?
> 
> No. All it does is disallow wbinvd for guests not controlling any
> actual hardware. Perhaps pcommit should at least be limited in
> a similar way?
>

But pcommit is a must that makes writes be persistent on pmem. I'll
look at how guest wbinvd is limited in Xen. Any functions suggested,
vmx_wbinvd_intercept()?

Thanks,
Haozhong

> >> >  - L1 hypervisor is allowed to intercept L2 guest pcommit.
> >> 
> >> clwb?
> > 
> > VMX is not capable to intercept clwb. Any reason to intercept it?
> 
> I don't think so - otherwise normal memory writes might also need
> intercepting. Bus bandwidth simply is shared (and CLWB operates
> on a guest virtual address, so can only cause bus traffic for cache
> lines the guest has managed to dirty).
> 
> Jan
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-03 Thread Haozhong Zhang
On 02/02/16 14:15, Konrad Rzeszutek Wilk wrote:
> > 3. Design of vNVDIMM in Xen
> 
> Thank you for this design!
> 
> > 
> >  Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
> >  three parts:
> >  (1) Guest clwb/clflushopt/pcommit enabling,
> >  (2) Memory mapping, and
> >  (3) Guest ACPI emulation.
> 
> 
> .. MCE? and vMCE?
>

Specifications on my hand seem not mention much about MCE for NVDIMM,
but I remember that NVDIMM driver in Linux kernel does have MCE
code. I'll have a look at that code and add this part later.

> > 
> >  The rest of this section present the design of each part
> >  respectively. The basic design principle to reuse existing code in
> >  Linux NVDIMM driver and QEMU as much as possible. As recent
> >  discussions in the both Xen and QEMU mailing lists for the v1 patch
> >  series, alternative designs are also listed below.
> > 
> > 
> > 3.1 Guest clwb/clflushopt/pcommit Enabling
> > 
> >  The instruction enabling is simple and we do the same work as in KVM/QEMU.
> >  - All three instructions are exposed to guest via guest cpuid.
> >  - L1 guest pcommit is never intercepted by Xen.
> 
> I wish there was some watermarks like the PLE has.
> 
> My fear is that an unfriendly guest can issue sfence all day long
> flushing out other guests MMC queue (the writes followed by pcommits).
> Which means that an guest may have degraded performance as their
> memory writes are being flushed out immediately as if they were
> being written to UC instead of WB memory. 
>

pcommit takes no parameter and it seems hard to solve this problem
from hardware for now. And the current VMX does not provide mechanism
to limit the commit rate of pcommit like PLE for pause.

> In other words - the NVDIMM resource does not provide any resource
> isolation. However this may not be any different than what we had
> nowadays with CPU caches.
>

Does Xen have any mechanism to isolate multiple guests' operations on
CPU caches?

> 
> >  - L1 hypervisor is allowed to intercept L2 guest pcommit.
> 
> clwb?
>

VMX is not capable to intercept clwb. Any reason to intercept it?

> > 
> > 
> > 3.2 Address Mapping
> > 
> > 3.2.1 My Design
> > 
> >  The overview of this design is shown in the following figure.
> > 
> >  Dom0 |   DomU
> >   |
> >   |
> >  QEMU |
> >  +...++...+-+ |
> >   VA |   | Label Storage Area |   | buf | |
> >  +...++...+-+ |
> >  ^^ ^ |
> >  || | |
> >  V| | |
> >  +---+   +---+mmap(2) |
> >  | vACPI |   | v_DSM || | |+++
> >  +---+   +---+| | |   SPA  || /dev/pmem0 |
> >  ^   ^ +--+ | |+++
> >  |---|-||--   | ^^
> >  |   | || | ||
> >  |+--+ +~-~-+|
> >  |||| |
> > XEN_DOMCTL_memory_mapping
> >  |||+-~--+
> >  |||| |
> >  ||   +++ |
> >  Linux   ||   SPA || /dev/pmem0 | | +--+   +--+
> >  ||   +++ | | ACPI |   | _DSM |
> >  ||   ^   | +--+   +--+
> >  ||   |   | |  |
> >  ||   Dom0 Driver |   hvmloader/xl |
> >  
> > ||---|-|--|---
> >  |+---~-~--+
> >  Xen || |
> >  +~-+
> >  
> > -|
> >   ++
> >|
> > +-+
> >  HW |NVDIMM   |
> > +-+
> > 
> > 
> >  This design treats host NVDIMM devices as ordinary MMIO devices:
> 
> Nice.
> 
> But it also means you need Xen to 'share' the ranges of an MMIO device.
> 
> That is you may need dom0 _DSM method to access certain ranges
> (the AML code may need to poke there) - and the guest may want to access
> those as well.
>

Currently, we 

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-03 Thread Jan Beulich
>>> On 03.02.16 at 09:28,  wrote:
> On 02/02/16 14:15, Konrad Rzeszutek Wilk wrote:
>> > 3.1 Guest clwb/clflushopt/pcommit Enabling
>> > 
>> >  The instruction enabling is simple and we do the same work as in KVM/QEMU.
>> >  - All three instructions are exposed to guest via guest cpuid.
>> >  - L1 guest pcommit is never intercepted by Xen.
>> 
>> I wish there was some watermarks like the PLE has.
>> 
>> My fear is that an unfriendly guest can issue sfence all day long
>> flushing out other guests MMC queue (the writes followed by pcommits).
>> Which means that an guest may have degraded performance as their
>> memory writes are being flushed out immediately as if they were
>> being written to UC instead of WB memory. 
> 
> pcommit takes no parameter and it seems hard to solve this problem
> from hardware for now. And the current VMX does not provide mechanism
> to limit the commit rate of pcommit like PLE for pause.
> 
>> In other words - the NVDIMM resource does not provide any resource
>> isolation. However this may not be any different than what we had
>> nowadays with CPU caches.
>>
> 
> Does Xen have any mechanism to isolate multiple guests' operations on
> CPU caches?

No. All it does is disallow wbinvd for guests not controlling any
actual hardware. Perhaps pcommit should at least be limited in
a similar way?

>> >  - L1 hypervisor is allowed to intercept L2 guest pcommit.
>> 
>> clwb?
> 
> VMX is not capable to intercept clwb. Any reason to intercept it?

I don't think so - otherwise normal memory writes might also need
intercepting. Bus bandwidth simply is shared (and CLWB operates
on a guest virtual address, so can only cause bus traffic for cache
lines the guest has managed to dirty).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-03 Thread Andrew Cooper
On 03/02/16 09:18, Jan Beulich wrote:
>>
>>> In other words - the NVDIMM resource does not provide any resource
>>> isolation. However this may not be any different than what we had
>>> nowadays with CPU caches.
>>>
>> Does Xen have any mechanism to isolate multiple guests' operations on
>> CPU caches?
> No.

PSR Cache Allocation is supported in Xen 4.6 on supporting hardware, so
the administrator can partition guests if necessary.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-03 Thread Jan Beulich
>>> On 03.02.16 at 15:30,  wrote:
> On 03/02/16 09:18, Jan Beulich wrote:
>>>
 In other words - the NVDIMM resource does not provide any resource
 isolation. However this may not be any different than what we had
 nowadays with CPU caches.

>>> Does Xen have any mechanism to isolate multiple guests' operations on
>>> CPU caches?
>> No.
> 
> PSR Cache Allocation is supported in Xen 4.6 on supporting hardware, so
> the administrator can partition guests if necessary.

And if the hardware supports it (which for a while might be more the
exception than the rule).

Jan


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-03 Thread George Dunlap
On 03/02/16 12:02, Stefano Stabellini wrote:
> On Wed, 3 Feb 2016, Haozhong Zhang wrote:
>> Or, we can make a file system on /dev/pmem0, create files on it, set
>> the owner of those files to xen-qemuuser-domid$domid, and then pass
>> those files to QEMU. In this way, non-root QEMU should be able to
>> mmap those files.
> 
> Maybe that would work. Worth adding it to the design, I would like to
> read more details on it.
> 
> Also note that QEMU initially runs as root but drops privileges to
> xen-qemuuser-domid$domid before the guest is started. Initially QEMU
> *could* mmap /dev/pmem0 while is still running as root, but then it
> wouldn't work for any devices that need to be mmap'ed at run time
> (hotplug scenario).

This is basically the same problem we have for a bunch of other things,
right?  Having xl open a file and then pass it via qmp to qemu should
work in theory, right?

 -George


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-03 Thread Stefano Stabellini
On Wed, 3 Feb 2016, George Dunlap wrote:
> On 03/02/16 12:02, Stefano Stabellini wrote:
> > On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> >> Or, we can make a file system on /dev/pmem0, create files on it, set
> >> the owner of those files to xen-qemuuser-domid$domid, and then pass
> >> those files to QEMU. In this way, non-root QEMU should be able to
> >> mmap those files.
> > 
> > Maybe that would work. Worth adding it to the design, I would like to
> > read more details on it.
> > 
> > Also note that QEMU initially runs as root but drops privileges to
> > xen-qemuuser-domid$domid before the guest is started. Initially QEMU
> > *could* mmap /dev/pmem0 while is still running as root, but then it
> > wouldn't work for any devices that need to be mmap'ed at run time
> > (hotplug scenario).
> 
> This is basically the same problem we have for a bunch of other things,
> right?  Having xl open a file and then pass it via qmp to qemu should
> work in theory, right?

Is there one /dev/pmem? per assignable region? Otherwise it wouldn't be
safe.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-03 Thread Andrew Cooper
On 03/02/16 13:11, Haozhong Zhang wrote:
> On 02/03/16 12:02, Stefano Stabellini wrote:
>> On Wed, 3 Feb 2016, Haozhong Zhang wrote:
>>> On 02/02/16 17:11, Stefano Stabellini wrote:
 On Mon, 1 Feb 2016, Haozhong Zhang wrote:
> [...]
>  This design treats host NVDIMM devices as ordinary MMIO devices:
>  (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT)
>  and drive host NVDIMM devices (implementing block device
>  interface). Namespaces and file systems on host NVDIMM devices
>  are handled by Dom0 Linux as well.
>
>  (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its
>  virtual address space (buf).
>
>  (3) QEMU gets the host physical address of buf, i.e. the host system
>  physical address that is occupied by /dev/pmem0, and calls Xen
>  hypercall XEN_DOMCTL_memory_mapping to map it to a DomU.
 How is this going to work from a security perspective? Is it going to
 require running QEMU as root in Dom0, which will prevent NVDIMM from
 working by default on Xen? If so, what's the plan?

>>> Oh, I forgot to address the non-root qemu issues in this design ...
>>>
>>> The default user:group of /dev/pmem0 is root:disk, and its permission
>>> is rw-rw. We could lift the others permission to rw, so that
>>> non-root QEMU can mmap /dev/pmem0. But it looks too risky.
>> Yep, too risky.
>>
>>
>>> Or, we can make a file system on /dev/pmem0, create files on it, set
>>> the owner of those files to xen-qemuuser-domid$domid, and then pass
>>> those files to QEMU. In this way, non-root QEMU should be able to
>>> mmap those files.
>> Maybe that would work. Worth adding it to the design, I would like to
>> read more details on it.
>>
>> Also note that QEMU initially runs as root but drops privileges to
>> xen-qemuuser-domid$domid before the guest is started. Initially QEMU
>> *could* mmap /dev/pmem0 while is still running as root, but then it
>> wouldn't work for any devices that need to be mmap'ed at run time
>> (hotplug scenario).
>>
> Thanks for this information. I'll test some experimental code and then
> post a design to address the non-root qemu issue.
>
>  (ACPI part is described in Section 3.3 later)
>
>  Above (1)(2) have already been done in current QEMU. Only (3) is
>  needed to implement in QEMU. No change is needed in Xen for address
>  mapping in this design.
>
>  Open: It seems no system call/ioctl is provided by Linux kernel to
>get the physical address from a virtual address.
>/proc//pagemap provides information of mapping from
>VA to PA. Is it an acceptable solution to let QEMU parse this
>file to get the physical address?
 Does it work in a non-root scenario?

>>> Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel:
>>> | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
>>> | In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting from
>>> | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
>>> | Reason: information about PFNs helps in exploiting Rowhammer 
>>> vulnerability.
>>>
>>> A possible alternative is to add a new hypercall similar to
>>> XEN_DOMCTL_memory_mapping but receiving virtual address as the address
>>> parameter and translating to machine address in the hypervisor.
>> That might work.
>>
>>
>  Open: For a large pmem, mmap(2) is very possible to not map all SPA
>occupied by pmem at the beginning, i.e. QEMU may not be able to
>get all SPA of pmem from buf (in virtual address space) when
>calling XEN_DOMCTL_memory_mapping.
>Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
>entire pmem being mmaped?
 Ditto

>>> No. If I take the above alternative for the first open, maybe the new
>>> hypercall above can inject page faults into dom0 for the unmapped
>>> virtual address so as to enforce dom0 Linux to create the page
>>> mapping.
>> Otherwise you need to use something like the mapcache in QEMU
>> (xen-mapcache.c), which admittedly, given its complexity, would be best
>> to avoid.
>>
> Definitely not mapcache like things. What I want is something similar to
> what emulate_gva_to_mfn() in Xen does.

Please not quite like that.  It would restrict this to only working in a
PV dom0.

MFNs are an implementation detail.  Interfaces should take GFNs which
are consistent logical meaning between PV and HVM domains.

As an introduction,
http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/xen/mm.h;h=a795dd6001eff7c5dd942bbaf153e3efa5202318;hb=refs/heads/staging#l8

We also need to consider the Xen side security.  Currently a domain may
be given privilege to map an MMIO range.  IIRC, this allows the emulator
domain to make mappings for the guest, and for the guest to make
mappings itself.  With PMEM, we can't allow a domain to make 

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-03 Thread George Dunlap
On 03/02/16 15:22, Stefano Stabellini wrote:
> On Wed, 3 Feb 2016, George Dunlap wrote:
>> On 03/02/16 12:02, Stefano Stabellini wrote:
>>> On Wed, 3 Feb 2016, Haozhong Zhang wrote:
 Or, we can make a file system on /dev/pmem0, create files on it, set
 the owner of those files to xen-qemuuser-domid$domid, and then pass
 those files to QEMU. In this way, non-root QEMU should be able to
 mmap those files.
>>>
>>> Maybe that would work. Worth adding it to the design, I would like to
>>> read more details on it.
>>>
>>> Also note that QEMU initially runs as root but drops privileges to
>>> xen-qemuuser-domid$domid before the guest is started. Initially QEMU
>>> *could* mmap /dev/pmem0 while is still running as root, but then it
>>> wouldn't work for any devices that need to be mmap'ed at run time
>>> (hotplug scenario).
>>
>> This is basically the same problem we have for a bunch of other things,
>> right?  Having xl open a file and then pass it via qmp to qemu should
>> work in theory, right?
> 
> Is there one /dev/pmem? per assignable region? Otherwise it wouldn't be
> safe.

If I understood Haozhong's description right, you'd be passing through
the entirety of one thing that Linux gave you.  At the moment that'sone
/dev/pmemX, which at the moment corresponds to one region as specified
in the ACPI tables.  I understood his design going forward to mean that
it would rely on Linux to do any further partitioning within regions if
that was desired; in which case there would again be a single file that
qemu would have access to.

 -George

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-03 Thread Andrew Cooper
On 03/02/16 09:13, Jan Beulich wrote:
 On 03.02.16 at 08:00,  wrote:
>> On 02/02/16 17:11, Stefano Stabellini wrote:
>>> Once upon a time somebody made the decision that ACPI tables
>>> on Xen should be static and included in hvmloader. That might have been
>>> a bad decision but at least it was coherent. Loading only *some* tables
>>> from QEMU, but not others, it feels like an incomplete design to me.
>>>
>>> For example, QEMU is currently in charge of emulating the PCI bus, why
>>> shouldn't it be QEMU that generates the PRT and MCFG?
>>>
>> To Keir, Jan and Andrew:
>>
>> Are there anything related to ACPI that must be done (or are better to
>> be done) in hvmloader?
> Some of the static tables (FADT and HPET come to mind) likely would
> better continue to live in hvmloader. MCFG (for example) coming from
> qemu, otoh, would be quite natural (and would finally allow MMCFG
> support for guests in the first place). I.e. ...
>
>>> I prefer switching to QEMU building all ACPI tables for devices that it
>>> is emulating. However this alternative is good too because it is
>>> coherent with the current design.
>> I would prefer to this one if the final conclusion is that only one
>> agent should be allowed to build guest ACPI. As I said above, it looks
>> like a big change to switch to QEMU for all ACPI tables and I'm afraid
>> it would break some existing guests. 
> ... I indeed think that tables should come from qemu for components
> living in qemu, and from hvmloader for components coming from Xen.

I agree.

There has to be a single entity responsible for collating the eventual
ACPI handed to the guest, and this is definitely HVMLoader.

However, it is correct that Qemu create the ACPI tables for the devices
it emulates for the guest.

We need to agree on a mechanism whereby each entity can provide their
own subset of the ACPI tables to HVMLoader, and have HVMLoader present
the final set properly to the VM.

There is an existing usecase of passing the Host SLIC table to a VM, for
OEM Versions of Windows.  I believe this is achieved with
HVM_XS_ACPI_PT_{ADDRESS,LENGTH}, but that mechanism is a little
inflexible and could probably do with being made a little more generic.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-03 Thread Haozhong Zhang
On 02/03/16 14:09, Andrew Cooper wrote:
> On 03/02/16 09:13, Jan Beulich wrote:
>  On 03.02.16 at 08:00,  wrote:
> >> On 02/02/16 17:11, Stefano Stabellini wrote:
> >>> Once upon a time somebody made the decision that ACPI tables
> >>> on Xen should be static and included in hvmloader. That might have been
> >>> a bad decision but at least it was coherent. Loading only *some* tables
> >>> from QEMU, but not others, it feels like an incomplete design to me.
> >>>
> >>> For example, QEMU is currently in charge of emulating the PCI bus, why
> >>> shouldn't it be QEMU that generates the PRT and MCFG?
> >>>
> >> To Keir, Jan and Andrew:
> >>
> >> Are there anything related to ACPI that must be done (or are better to
> >> be done) in hvmloader?
> > Some of the static tables (FADT and HPET come to mind) likely would
> > better continue to live in hvmloader. MCFG (for example) coming from
> > qemu, otoh, would be quite natural (and would finally allow MMCFG
> > support for guests in the first place). I.e. ...
> >
> >>> I prefer switching to QEMU building all ACPI tables for devices that it
> >>> is emulating. However this alternative is good too because it is
> >>> coherent with the current design.
> >> I would prefer to this one if the final conclusion is that only one
> >> agent should be allowed to build guest ACPI. As I said above, it looks
> >> like a big change to switch to QEMU for all ACPI tables and I'm afraid
> >> it would break some existing guests. 
> > ... I indeed think that tables should come from qemu for components
> > living in qemu, and from hvmloader for components coming from Xen.
> 
> I agree.
> 
> There has to be a single entity responsible for collating the eventual
> ACPI handed to the guest, and this is definitely HVMLoader.
> 
> However, it is correct that Qemu create the ACPI tables for the devices
> it emulates for the guest.
> 
> We need to agree on a mechanism whereby each entity can provide their
> own subset of the ACPI tables to HVMLoader, and have HVMLoader present
> the final set properly to the VM.
> 
> There is an existing usecase of passing the Host SLIC table to a VM, for
> OEM Versions of Windows.  I believe this is achieved with
> HVM_XS_ACPI_PT_{ADDRESS,LENGTH}, but that mechanism is a little
> inflexible and could probably do with being made a little more generic.
>

Yes, that is what one of my v1 patches does
([PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu).

It extends the existing construct_passthrough_tables() to get the
address and size of acpi tables from its parameters (a pair of
xenstore keys) rather than the hardcoded ones.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-03 Thread Konrad Rzeszutek Wilk
On Wed, Feb 03, 2016 at 03:22:59PM +, Stefano Stabellini wrote:
> On Wed, 3 Feb 2016, George Dunlap wrote:
> > On 03/02/16 12:02, Stefano Stabellini wrote:
> > > On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> > >> Or, we can make a file system on /dev/pmem0, create files on it, set
> > >> the owner of those files to xen-qemuuser-domid$domid, and then pass
> > >> those files to QEMU. In this way, non-root QEMU should be able to
> > >> mmap those files.
> > > 
> > > Maybe that would work. Worth adding it to the design, I would like to
> > > read more details on it.
> > > 
> > > Also note that QEMU initially runs as root but drops privileges to
> > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU
> > > *could* mmap /dev/pmem0 while is still running as root, but then it
> > > wouldn't work for any devices that need to be mmap'ed at run time
> > > (hotplug scenario).
> > 
> > This is basically the same problem we have for a bunch of other things,
> > right?  Having xl open a file and then pass it via qmp to qemu should
> > work in theory, right?
> 
> Is there one /dev/pmem? per assignable region? Otherwise it wouldn't be
> safe.

Can be - which may be interleaved on multiple NVDIMMs. But we would operate
on files (on the /dev/pmem which has an DAX enabled filesystem).


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-03 Thread Haozhong Zhang
On 02/03/16 10:47, Konrad Rzeszutek Wilk wrote:
> > > > >  Open: It seems no system call/ioctl is provided by Linux kernel to
> > > > >get the physical address from a virtual address.
> > > > >/proc//pagemap provides information of mapping from
> > > > >VA to PA. Is it an acceptable solution to let QEMU parse this
> > > > >file to get the physical address?
> > > >
> > > > Does it work in a non-root scenario?
> > > >
> > >
> > > Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel:
> > > | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get 
> > > PFNs.
> > > | In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting from
> > > | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
> > > | Reason: information about PFNs helps in exploiting Rowhammer 
> > > vulnerability.
>
> Ah right.
> > >
> > > A possible alternative is to add a new hypercall similar to
> > > XEN_DOMCTL_memory_mapping but receiving virtual address as the address
> > > parameter and translating to machine address in the hypervisor.
> >
> > That might work.
>
> That won't work.
>
> This is a userspace VMA - which means the once the ioctl is done we swap
> to kernel virtual addresses. Now we may know that the prior cr3 has the
> userspace virtual address and walk it down - but what if the domain
> that is doing this is PVH? (or HVM) - the cr3 of userspace is tucked somewhere
> inside the kernel.
>
> Which means this hypercall would need to know the Linux kernel task structure
> to find this.
>

Thanks for pointing out this. Really it's not a workable solution.

> May I propose another solution - an stacking driver (similar to loop). You
> setup it up (ioctl /dev/pmem0/guest.img, get some /dev/mapper/guest.img 
> created).
> Then mmap the /dev/mapper/guest.img - all of the operations are the same - 
> except
> it may have an extra ioctl - get_pfns - which would provide the data in 
> similar
> form to pagemap.txt.
>

I'll have a look at this, thanks!

> But folks will then ask - why don't you just use pagemap? Could the pagemap
> have an extra security capability check? One that can be set for
> QEMU?
>

Basically for the concern on whether non-root QEMU could work as in
Stefano's comments.

> >
> >
> > > > >  Open: For a large pmem, mmap(2) is very possible to not map all SPA
> > > > >occupied by pmem at the beginning, i.e. QEMU may not be able to
> > > > >get all SPA of pmem from buf (in virtual address space) when
> > > > >calling XEN_DOMCTL_memory_mapping.
> > > > >Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
> > > > >entire pmem being mmaped?
> > > >
> > > > Ditto
> > > >
> > >
> > > No. If I take the above alternative for the first open, maybe the new
> > > hypercall above can inject page faults into dom0 for the unmapped
> > > virtual address so as to enforce dom0 Linux to create the page
> > > mapping.
>
> Ugh. That sounds hacky. And you wouldn't neccessarily be safe.
> Imagine that the system admin decides to defrag the /dev/pmem filesystem.
> Or move the files (disk images) around. If they do that - we may
> still have the guest mapped to system addresses which may contain filesystem
> metadata now, or a different guest image. We MUST mlock or lock the file
> during the duration of the guest.
>
>

So mlocking or locking the mmaped file, or other ways to 'pin' the
mmaped file on pmem is a necessity.

Thanks,
Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-03 Thread Haozhong Zhang
On 02/03/16 14:20, Andrew Cooper wrote:
> >  (ACPI part is described in Section 3.3 later)
> >
> >  Above (1)(2) have already been done in current QEMU. Only (3) is
> >  needed to implement in QEMU. No change is needed in Xen for address
> >  mapping in this design.
> >
> >  Open: It seems no system call/ioctl is provided by Linux kernel to
> >get the physical address from a virtual address.
> >/proc//pagemap provides information of mapping from
> >VA to PA. Is it an acceptable solution to let QEMU parse this
> >file to get the physical address?
>  Does it work in a non-root scenario?
> 
> >>> Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel:
> >>> | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get 
> >>> PFNs.
> >>> | In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting from
> >>> | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
> >>> | Reason: information about PFNs helps in exploiting Rowhammer 
> >>> vulnerability.
> >>>
> >>> A possible alternative is to add a new hypercall similar to
> >>> XEN_DOMCTL_memory_mapping but receiving virtual address as the address
> >>> parameter and translating to machine address in the hypervisor.
> >> That might work.
> >>
> >>
> >  Open: For a large pmem, mmap(2) is very possible to not map all SPA
> >occupied by pmem at the beginning, i.e. QEMU may not be able to
> >get all SPA of pmem from buf (in virtual address space) when
> >calling XEN_DOMCTL_memory_mapping.
> >Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
> >entire pmem being mmaped?
>  Ditto
> 
> >>> No. If I take the above alternative for the first open, maybe the new
> >>> hypercall above can inject page faults into dom0 for the unmapped
> >>> virtual address so as to enforce dom0 Linux to create the page
> >>> mapping.
> >> Otherwise you need to use something like the mapcache in QEMU
> >> (xen-mapcache.c), which admittedly, given its complexity, would be best
> >> to avoid.
> >>
> > Definitely not mapcache like things. What I want is something similar to
> > what emulate_gva_to_mfn() in Xen does.
>
> Please not quite like that.  It would restrict this to only working in a
> PV dom0.
>
> MFNs are an implementation detail.

I don't get this point.
What do you mean by 'implementation detail'? Architectural differences?

> Interfaces should take GFNs which
> are consistent logical meaning between PV and HVM domains.
>
> As an introduction,
> http://xenbits.xen.org/gitweb/?p=xen.git;a=blob;f=xen/include/xen/mm.h;h=a795dd6001eff7c5dd942bbaf153e3efa5202318;hb=refs/heads/staging#l8
>
> We also need to consider the Xen side security.  Currently a domain may
> be given privilege to map an MMIO range.  IIRC, this allows the emulator
> domain to make mappings for the guest, and for the guest to make
> mappings itself.  With PMEM, we can't allow a domain to make mappings
> itself because it could end up mapping resources which belong to another
> domain.  We probably need an intermediate level which only permits an
> emulator to make the mappings.
>

agree, this hypercall should not be called by arbitrary domains. Any
existing mechanism in Xen to restrict callers of hypercalls?

> >
> > [...]
>  If we start asking QEMU to build ACPI tables, why should we stop at NFIT
>  and SSDT?
> >>> for easing my development of supporting vNVDIMM in Xen ... I mean
> >>> NFIT and SSDT are the only two tables needed for this purpose and I'm
> >>> afraid to break exiting guests if I completely switch to QEMU for
> >>> guest ACPI tables.
> >> I realize that my words have been a bit confusing. Not /all/ ACPI
> >> tables, just all the tables regarding devices for which QEMU is in
> >> charge (the PCI bus and all devices behind it). Anything related to cpus
> >> and memory (FADT, MADT, etc) would still be left to hvmloader.
> > OK, then it's clear for me. From Jan's reply, at least MCFG is from
> > QEMU. I'll look at whether other PCI related tables are also from QEMU
> > or similar to those in QEMU. If yes, then it looks reasonable to let
> > QEMU generate them.
>
> It is entirely likely that the current split of sources of APCI tables
> is incorrect.  We should also see what can be done about fixing that.
>

How about Jan's comment
| tables should come from qemu for components living in qemu, and from
| hvmloader for components coming from Xen

Thanks,
Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-03 Thread Haozhong Zhang
On 02/03/16 15:22, Stefano Stabellini wrote:
> On Wed, 3 Feb 2016, George Dunlap wrote:
> > On 03/02/16 12:02, Stefano Stabellini wrote:
> > > On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> > >> Or, we can make a file system on /dev/pmem0, create files on it, set
> > >> the owner of those files to xen-qemuuser-domid$domid, and then pass
> > >> those files to QEMU. In this way, non-root QEMU should be able to
> > >> mmap those files.
> > >
> > > Maybe that would work. Worth adding it to the design, I would like to
> > > read more details on it.
> > >
> > > Also note that QEMU initially runs as root but drops privileges to
> > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU
> > > *could* mmap /dev/pmem0 while is still running as root, but then it
> > > wouldn't work for any devices that need to be mmap'ed at run time
> > > (hotplug scenario).
> >
> > This is basically the same problem we have for a bunch of other things,
> > right?  Having xl open a file and then pass it via qmp to qemu should
> > work in theory, right?
>
> Is there one /dev/pmem? per assignable region?

Yes.

BTW, I'm wondering whether and how non-root qemu works with xl disk
configuration that is going to access a host block device, e.g.
 disk = [ '/dev/sdb,,hda' ]
If that works with non-root qemu, I may take the similar solution for
pmem.

Thanks,
Haozhong

> Otherwise it wouldn't be safe.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-03 Thread Konrad Rzeszutek Wilk
> > > >  Open: It seems no system call/ioctl is provided by Linux kernel to
> > > >get the physical address from a virtual address.
> > > >/proc//pagemap provides information of mapping from
> > > >VA to PA. Is it an acceptable solution to let QEMU parse this
> > > >file to get the physical address?
> > > 
> > > Does it work in a non-root scenario?
> > >
> > 
> > Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel:
> > | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get PFNs.
> > | In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting from
> > | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
> > | Reason: information about PFNs helps in exploiting Rowhammer 
> > vulnerability.

Ah right.
> >
> > A possible alternative is to add a new hypercall similar to
> > XEN_DOMCTL_memory_mapping but receiving virtual address as the address
> > parameter and translating to machine address in the hypervisor.
> 
> That might work.

That won't work.

This is a userspace VMA - which means the once the ioctl is done we swap
to kernel virtual addresses. Now we may know that the prior cr3 has the
userspace virtual address and walk it down - but what if the domain
that is doing this is PVH? (or HVM) - the cr3 of userspace is tucked somewhere
inside the kernel.

Which means this hypercall would need to know the Linux kernel task structure
to find this.

May I propose another solution - an stacking driver (similar to loop). You
setup it up (ioctl /dev/pmem0/guest.img, get some /dev/mapper/guest.img 
created).
Then mmap the /dev/mapper/guest.img - all of the operations are the same - 
except
it may have an extra ioctl - get_pfns - which would provide the data in similar
form to pagemap.txt.

But folks will then ask - why don't you just use pagemap? Could the pagemap
have an extra security capability check? One that can be set for
QEMU?

> 
> 
> > > >  Open: For a large pmem, mmap(2) is very possible to not map all SPA
> > > >occupied by pmem at the beginning, i.e. QEMU may not be able to
> > > >get all SPA of pmem from buf (in virtual address space) when
> > > >calling XEN_DOMCTL_memory_mapping.
> > > >Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
> > > >entire pmem being mmaped?
> > > 
> > > Ditto
> > >
> > 
> > No. If I take the above alternative for the first open, maybe the new
> > hypercall above can inject page faults into dom0 for the unmapped
> > virtual address so as to enforce dom0 Linux to create the page
> > mapping.

Ugh. That sounds hacky. And you wouldn't neccessarily be safe.
Imagine that the system admin decides to defrag the /dev/pmem filesystem.
Or move the files (disk images) around. If they do that - we may
still have the guest mapped to system addresses which may contain filesystem
metadata now, or a different guest image. We MUST mlock or lock the file
during the duration of the guest.


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-03 Thread Haozhong Zhang
On 02/03/16 05:38, Jan Beulich wrote:
> >>> On 03.02.16 at 13:22,  wrote:
> > On 02/03/16 02:18, Jan Beulich wrote:
> >> >>> On 03.02.16 at 09:28,  wrote:
> >> > On 02/02/16 14:15, Konrad Rzeszutek Wilk wrote:
> >> >> > 3.1 Guest clwb/clflushopt/pcommit Enabling
> >> >> > 
> >> >> >  The instruction enabling is simple and we do the same work as in 
> > KVM/QEMU.
> >> >> >  - All three instructions are exposed to guest via guest cpuid.
> >> >> >  - L1 guest pcommit is never intercepted by Xen.
> >> >> 
> >> >> I wish there was some watermarks like the PLE has.
> >> >> 
> >> >> My fear is that an unfriendly guest can issue sfence all day long
> >> >> flushing out other guests MMC queue (the writes followed by pcommits).
> >> >> Which means that an guest may have degraded performance as their
> >> >> memory writes are being flushed out immediately as if they were
> >> >> being written to UC instead of WB memory. 
> >> > 
> >> > pcommit takes no parameter and it seems hard to solve this problem
> >> > from hardware for now. And the current VMX does not provide mechanism
> >> > to limit the commit rate of pcommit like PLE for pause.
> >> > 
> >> >> In other words - the NVDIMM resource does not provide any resource
> >> >> isolation. However this may not be any different than what we had
> >> >> nowadays with CPU caches.
> >> >>
> >> > 
> >> > Does Xen have any mechanism to isolate multiple guests' operations on
> >> > CPU caches?
> >> 
> >> No. All it does is disallow wbinvd for guests not controlling any
> >> actual hardware. Perhaps pcommit should at least be limited in
> >> a similar way?
> >>
> > 
> > But pcommit is a must that makes writes be persistent on pmem. I'll
> > look at how guest wbinvd is limited in Xen.
> 
> But we could intercept it on guests _not_ supposed to use it, in order
> to simply drop in on the floor.
>

Oh yes! We can drop pcommit from domains not having access to host
NVDIMM, just like vmx_wbinvd_intercept() dropping wbinvd from domains
not accessing host iomem and ioport.

> > Any functions suggested, vmx_wbinvd_intercept()?
> 
> A good example, yes.
> 
> Jan
> 

Thanks,
Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-02 Thread Konrad Rzeszutek Wilk
> > 2.2 vNVDIMM Implementation in KVM/QEMU
> > 
> >  (1) Address Mapping
> > 
> >   As described before, the host Linux NVDIMM driver provides a block
> >   device interface (/dev/pmem0 at the bottom) for a pmem NVDIMM
> >   region. QEMU can than mmap(2) that device into its virtual address
> >   space (buf). QEMU is responsible to find a proper guest physical
> >   address space range that is large enough to hold /dev/pmem0. Then
> >   QEMU passes the virtual address of mmapped buf to a KVM API
> >   KVM_SET_USER_MEMORY_REGION that maps in EPT the host physical
> >   address range of buf to the guest physical address space range where
> >   the virtual pmem device will be.
> > 
> >   In this way, all guest writes/reads on the virtual pmem device is
> >   applied directly to the host one.
> > 
> >   Besides, above implementation also allows to back a virtual pmem
> >   device by a mmapped regular file or a piece of ordinary ram.
> 
> What's the point of backing pmem with ordinary ram? I can buy-in
> the value of file-backed option which although slower does sustain
> the persistency attribute. However with ram-backed method there's
> no persistency so violates guest expectation.

Containers - like the Intel Clear Containers? You can use this work
to stitch an exploded initramfs on a tmpfs right in the guest.
And you could do that for multiple guests.

Granted this has nothing to do with pmem, but this work would allow
one to setup containers this way.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-02 Thread Konrad Rzeszutek Wilk
> 3. Design of vNVDIMM in Xen

Thank you for this design!

> 
>  Similarly to that in KVM/QEMU, enabling vNVDIMM in Xen is composed of
>  three parts:
>  (1) Guest clwb/clflushopt/pcommit enabling,
>  (2) Memory mapping, and
>  (3) Guest ACPI emulation.


.. MCE? and vMCE?

> 
>  The rest of this section present the design of each part
>  respectively. The basic design principle to reuse existing code in
>  Linux NVDIMM driver and QEMU as much as possible. As recent
>  discussions in the both Xen and QEMU mailing lists for the v1 patch
>  series, alternative designs are also listed below.
> 
> 
> 3.1 Guest clwb/clflushopt/pcommit Enabling
> 
>  The instruction enabling is simple and we do the same work as in KVM/QEMU.
>  - All three instructions are exposed to guest via guest cpuid.
>  - L1 guest pcommit is never intercepted by Xen.

I wish there was some watermarks like the PLE has.

My fear is that an unfriendly guest can issue sfence all day long
flushing out other guests MMC queue (the writes followed by pcommits).
Which means that an guest may have degraded performance as their
memory writes are being flushed out immediately as if they were
being written to UC instead of WB memory. 

In other words - the NVDIMM resource does not provide any resource
isolation. However this may not be any different than what we had
nowadays with CPU caches.


>  - L1 hypervisor is allowed to intercept L2 guest pcommit.

clwb?

> 
> 
> 3.2 Address Mapping
> 
> 3.2.1 My Design
> 
>  The overview of this design is shown in the following figure.
> 
>  Dom0 |   DomU
>   |
>   |
>  QEMU |
>  +...++...+-+ |
>   VA |   | Label Storage Area |   | buf | |
>  +...++...+-+ |
>  ^^ ^ |
>  || | |
>  V| | |
>  +---+   +---+mmap(2) |
>  | vACPI |   | v_DSM || | |+++
>  +---+   +---+| | |   SPA  || /dev/pmem0 |
>  ^   ^ +--+ | |+++
>  |---|-||--   | ^^
>  |   | || | ||
>  |+--+ +~-~-+|
>  |||| |
> XEN_DOMCTL_memory_mapping
>  |||+-~--+
>  |||| |
>  ||   +++ |
>  Linux   ||   SPA || /dev/pmem0 | | +--+   +--+
>  ||   +++ | | ACPI |   | _DSM |
>  ||   ^   | +--+   +--+
>  ||   |   | |  |
>  ||   Dom0 Driver |   hvmloader/xl |
>  
> ||---|-|--|---
>  |+---~-~--+
>  Xen || |
>  +~-+
>  
> -|
>   ++
>|
> +-+
>  HW |NVDIMM   |
> +-+
> 
> 
>  This design treats host NVDIMM devices as ordinary MMIO devices:

Nice.

But it also means you need Xen to 'share' the ranges of an MMIO device.

That is you may need dom0 _DSM method to access certain ranges
(the AML code may need to poke there) - and the guest may want to access
those as well.

And keep in mind that this NVDIMM management may not need to be always
in initial domain. As in you could have NVDIMM device drivers that would
carve out the ranges to guests.

>  (1) Dom0 Linux NVDIMM driver is responsible to detect (through NFIT)
>  and drive host NVDIMM devices (implementing block device
>  interface). Namespaces and file systems on host NVDIMM devices
>  are handled by Dom0 Linux as well.
> 
>  (2) QEMU mmap(2) the pmem NVDIMM devices (/dev/pmem0) into its
>  virtual address space (buf).
> 
>  (3) QEMU gets the host physical address of buf, i.e. the host system
>  physical address that is occupied by /dev/pmem0, and calls Xen
>  hypercall XEN_DOMCTL_memory_mapping to map it to a DomU.
> 
>  (ACPI part is described in Section 3.3 later)
> 
>  Above (1)(2) 

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-02 Thread Tian, Kevin
> From: Zhang, Haozhong
> Sent: Tuesday, February 02, 2016 3:53 PM
> 
> On 02/02/16 15:48, Tian, Kevin wrote:
> > > From: Zhang, Haozhong
> > > Sent: Tuesday, February 02, 2016 3:39 PM
> > >
> > > > btw, how is persistency guaranteed in KVM/QEMU, cross guest
> > > > power off/on? I guess since Qemu process is killed the allocated pmem
> > > > will be freed so you may switch to file-backed method to keep
> > > > persistency (however copy would take time for large pmem trunk). Or
> > > > will you find some way to keep pmem managed separated from qemu
> > > > qemu life-cycle (then pmem is not efficiently reused)?
> > > >
> > >
> > > It all depends on guests themselves. clwb/clflushopt/pcommit
> > > instructions are exposed to guest that are used by guests to make
> > > writes to pmem persistent.
> > >
> >
> > I meant from guest p.o.v, a range of pmem should be persistent
> > cross VM power on/off, i.e. the content needs to be maintained
> > somewhere so guest can get it at next power on...
> >
> > Thanks
> > Kevin
> 
> It's just like what we do for guest disk: as long as we always assign
> the same host pmem device or the same files on file systems on a host
> pmem device to the guest, the guest can find its last data on pmem.
> 
> Haozhong

This is the detail which I'd like to learn. If it's Qemu to request 
host pmem and then free when exit, the very pmem may be 
allocated to another process later. How do you achieve the 'as
long as'?

Thanks
Kevin

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


  1   2   >