Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-09 Thread Olaf Hering
On Wed, Jul 09, Vitaly Kuznetsov wrote:

> > Also I am wondering why it was not done as part of copy_oldmem_page()
> > so that respective arch could hide all the details.
> Afaiac that wouldn't solve the mmap issue I'm trying to address but we
> can ask Olaf why he preferred pfn_is_ram() path.

Every copy_oldmem_page would need to know about the pfn_is_ram function,
so I think its better to keep that part of the code private to
fs/proc/vmcore.c

Perhaps pfn_is_ram could be named pfn_is_backed_by_ram, but the comments
make it clear what the function does.


Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-09 Thread David Vrabel
On 09/07/14 10:17, Vitaly Kuznetsov wrote:
> David Vrabel  writes:
> 
>> On 07/07/14 21:33, Andrew Morton wrote:
>>> On Mon,  7 Jul 2014 17:05:49 +0200 Vitaly Kuznetsov  
>>> wrote:
>>>
 we have a special check in read_vmcore() handler to check if the page was
 reported as ram or not by the hypervisor (pfn_is_ram()). However, when
 vmcore is read with mmap() no such check is performed. That can lead to
 unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
 mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
 enormous load in both DomU and Dom0.
>>
>> Does make forward progress though?  Or is it ending up in a repeatedly
>> retrying the same instruction?
> 
> If memcpy is using SSE2 optimization 16-byte 'movdqu' instruction never
> finishes (repeatedly retrying to issue two 8-byte requests to
> qemu-dm). qemu-dm decides that it's hitting 'Neither RAM nor known MMIO
> space' and returns 8 0xff bytes for both of this requests (I was testing
> with qemu-traditional).

Yes, the emulation of instructions with 16-byte operands is a bit
broken.  I should be fixed.

>> Is it failing on a ballooned page in a RAM region? Or is mapping non-RAM
>> regions as well?
> 
> I wasn't using ballooning, it happens that oldmem has several (two in my
> test) pages which are HVMMEM_mmio_dm but qemu-dm considers them being
> neither ram nor mmio.

I think this would also happen with ballooned pages, which are also
not-present in the p2m and thus would show up as HVMMEM_mmio_dm type and
accesses will also be forwarded to qemu (qemu gets everything by default).

 Fix the issue by mapping each non-ram page to the zero page. Keep direct
 path with remap_oldmem_pfn_range() to avoid looping through all pages on
 bare metal.

 The issue can also be solved by overriding remap_oldmem_pfn_range() in
 xen-specific code, as remap_oldmem_pfn_range() was been designed for.
 That, however, would involve non-obvious xen code path for all x86 builds
 with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
 code on x86 arch from doing the same override.
>>
>> The oldmem_pfn_is_ram() is Xen-specific but this problem (ballooned
>> pages) must be common to KVM.  How does KVM handle this?
> 
> Is far as I'm concearned the issue was never hit with KVM. I *think* the
> issue has something to do with the conjunction of 16-byte 'movdqu'
> emulation for io pages in xen hypervisor, 8-byte event channel requests
> and qemu-traditional. But even if it gets fixed on hypervisor side I
> believe fixing the issue kernel-side still worth it as there are
> non-fixed hypervisors out there (e.g. AWS EC2).

I think it would be preferrable to fix this on the hypervisor side so
Xen guests behaves in the same way as KVM guests.

But if this needs to work on non-fixed hypervisors then this patch looks
sensible.  FWIW,

Acked-by: David Vrabel 

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-09 Thread Vitaly Kuznetsov
Konrad Rzeszutek Wilk  writes:

> On Mon, Jul 07, 2014 at 05:05:49PM +0200, Vitaly Kuznetsov wrote:
>> we have a special check in read_vmcore() handler to check if the page was
>> reported as ram or not by the hypervisor (pfn_is_ram()). However, when
>> vmcore is read with mmap() no such check is performed. That can lead to
>> unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
>> mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
>> enormous load in both DomU and Dom0.
>> 
>> Fix the issue by mapping each non-ram page to the zero page. Keep direct
>> path with remap_oldmem_pfn_range() to avoid looping through all pages on
>> bare metal.
>> 
>> The issue can also be solved by overriding remap_oldmem_pfn_range() in
>> xen-specific code, as remap_oldmem_pfn_range() was been designed for.
>> That, however, would involve non-obvious xen code path for all x86 builds
>> with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
>> code on x86 arch from doing the same override.
>
> Could the 'remap_oldmem_pfn_range' become an function ops? I see there
> is an 'register_oldmem_pfn_is_ram' - so could there be similar one for
> 'pfn_range'?

yes, it is possible to replace '__weak remap_oldmem_pfn_range' with
'register_oldmem_pfn_is_ram'. However s390 arch overrides this function
in arch/s390/kernel/crash_dump.c so we'll have to make some changes
there as well.

>
>> 
>> Signed-off-by: Vitaly Kuznetsov 
>> ---
>>  fs/proc/vmcore.c | 68 
>> +++-
>>  1 file changed, 62 insertions(+), 6 deletions(-)
>> 
>> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
>> index 382aa89..2716e19 100644
>> --- a/fs/proc/vmcore.c
>> +++ b/fs/proc/vmcore.c
>> @@ -328,6 +328,46 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
>>   * virtually contiguous user-space in ELF layout.
>>   */
>>  #ifdef CONFIG_MMU
>> +static u64 remap_oldmem_pfn_checked(struct vm_area_struct *vma, u64 len,
>> +unsigned long pfn, unsigned long page_count)
>> +{
>> +unsigned long pos;
>> +size_t size;
>> +unsigned long vma_addr;
>> +unsigned long emptypage_pfn = __pa(empty_zero_page) >> PAGE_SHIFT;
>> +
>> +for (pos = pfn; (pos - pfn) <= page_count; pos++) {
>> +if (!pfn_is_ram(pos) || (pos - pfn) == page_count) {
>> +/* we hit a page which is not ram or reached the end */
>> +if (pos - pfn > 0) {
>> +/* remapping continuous region */
>> +size = (pos - pfn) << PAGE_SHIFT;
>> +vma_addr = vma->vm_start + len;
>> +if (remap_oldmem_pfn_range(vma, vma_addr,
>> +   pfn, size,
>> +   vma->vm_page_prot))
>> +return len;
>> +len += size;
>> +page_count -= (pos - pfn);
>> +}
>> +if (page_count > 0) {
>> +/* we hit a page which is not ram, replacing
>> +   with an empty one */
>> +vma_addr = vma->vm_start + len;
>> +if (remap_oldmem_pfn_range(vma, vma_addr,
>> +   emptypage_pfn,
>> +   PAGE_SIZE,
>> +   vma->vm_page_prot))
>> +return len;
>> +len += PAGE_SIZE;
>> +pfn = pos + 1;
>> +page_count--;
>> +}
>> +}
>> +}
>> +return len;
>> +}
>> +
>>  static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>>  {
>>  size_t size = vma->vm_end - vma->vm_start;
>> @@ -383,17 +423,33 @@ static int mmap_vmcore(struct file *file, struct 
>> vm_area_struct *vma)
>>  
>>  list_for_each_entry(m, _list, list) {
>>  if (start < m->offset + m->size) {
>> -u64 paddr = 0;
>> +u64 paddr = 0, original_len;
>> +unsigned long pfn, page_count;
>>  
>>  tsz = min_t(size_t, m->offset + m->size - start, size);
>>  paddr = m->paddr + start - m->offset;
>> -if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
>> -   paddr >> PAGE_SHIFT, tsz,
>> -   vma->vm_page_prot))
>> -goto fail;
>> +
>> +/* check if oldmem_pfn_is_ram was registered to avoid
>> +   looping over all pages without a reason */
>> +if 

Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-09 Thread Vitaly Kuznetsov
David Vrabel  writes:

> On 07/07/14 21:33, Andrew Morton wrote:
>> On Mon,  7 Jul 2014 17:05:49 +0200 Vitaly Kuznetsov  
>> wrote:
>> 
>>> we have a special check in read_vmcore() handler to check if the page was
>>> reported as ram or not by the hypervisor (pfn_is_ram()). However, when
>>> vmcore is read with mmap() no such check is performed. That can lead to
>>> unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
>>> mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
>>> enormous load in both DomU and Dom0.
>
> Does make forward progress though?  Or is it ending up in a repeatedly
> retrying the same instruction?

If memcpy is using SSE2 optimization 16-byte 'movdqu' instruction never
finishes (repeatedly retrying to issue two 8-byte requests to
qemu-dm). qemu-dm decides that it's hitting 'Neither RAM nor known MMIO
space' and returns 8 0xff bytes for both of this requests (I was testing
with qemu-traditional).

>
> Is it failing on a ballooned page in a RAM region? Or is mapping non-RAM
> regions as well?

I wasn't using ballooning, it happens that oldmem has several (two in my
test) pages which are HVMMEM_mmio_dm but qemu-dm considers them being
neither ram nor mmio.

>
>>> Fix the issue by mapping each non-ram page to the zero page. Keep direct
>>> path with remap_oldmem_pfn_range() to avoid looping through all pages on
>>> bare metal.
>>>
>>> The issue can also be solved by overriding remap_oldmem_pfn_range() in
>>> xen-specific code, as remap_oldmem_pfn_range() was been designed for.
>>> That, however, would involve non-obvious xen code path for all x86 builds
>>> with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
>>> code on x86 arch from doing the same override.
>
> The oldmem_pfn_is_ram() is Xen-specific but this problem (ballooned
> pages) must be common to KVM.  How does KVM handle this?

Is far as I'm concearned the issue was never hit with KVM. I *think* the
issue has something to do with the conjunction of 16-byte 'movdqu'
emulation for io pages in xen hypervisor, 8-byte event channel requests
and qemu-traditional. But even if it gets fixed on hypervisor side I
believe fixing the issue kernel-side still worth it as there are
non-fixed hypervisors out there (e.g. AWS EC2).

>
> David

-- 
  Vitaly
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-09 Thread Vitaly Kuznetsov
David Vrabel david.vra...@citrix.com writes:

 On 07/07/14 21:33, Andrew Morton wrote:
 On Mon,  7 Jul 2014 17:05:49 +0200 Vitaly Kuznetsov vkuzn...@redhat.com 
 wrote:
 
 we have a special check in read_vmcore() handler to check if the page was
 reported as ram or not by the hypervisor (pfn_is_ram()). However, when
 vmcore is read with mmap() no such check is performed. That can lead to
 unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
 mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
 enormous load in both DomU and Dom0.

 Does make forward progress though?  Or is it ending up in a repeatedly
 retrying the same instruction?

If memcpy is using SSE2 optimization 16-byte 'movdqu' instruction never
finishes (repeatedly retrying to issue two 8-byte requests to
qemu-dm). qemu-dm decides that it's hitting 'Neither RAM nor known MMIO
space' and returns 8 0xff bytes for both of this requests (I was testing
with qemu-traditional).


 Is it failing on a ballooned page in a RAM region? Or is mapping non-RAM
 regions as well?

I wasn't using ballooning, it happens that oldmem has several (two in my
test) pages which are HVMMEM_mmio_dm but qemu-dm considers them being
neither ram nor mmio.


 Fix the issue by mapping each non-ram page to the zero page. Keep direct
 path with remap_oldmem_pfn_range() to avoid looping through all pages on
 bare metal.

 The issue can also be solved by overriding remap_oldmem_pfn_range() in
 xen-specific code, as remap_oldmem_pfn_range() was been designed for.
 That, however, would involve non-obvious xen code path for all x86 builds
 with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
 code on x86 arch from doing the same override.

 The oldmem_pfn_is_ram() is Xen-specific but this problem (ballooned
 pages) must be common to KVM.  How does KVM handle this?

Is far as I'm concearned the issue was never hit with KVM. I *think* the
issue has something to do with the conjunction of 16-byte 'movdqu'
emulation for io pages in xen hypervisor, 8-byte event channel requests
and qemu-traditional. But even if it gets fixed on hypervisor side I
believe fixing the issue kernel-side still worth it as there are
non-fixed hypervisors out there (e.g. AWS EC2).


 David

-- 
  Vitaly
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-09 Thread Vitaly Kuznetsov
Konrad Rzeszutek Wilk konrad.w...@oracle.com writes:

 On Mon, Jul 07, 2014 at 05:05:49PM +0200, Vitaly Kuznetsov wrote:
 we have a special check in read_vmcore() handler to check if the page was
 reported as ram or not by the hypervisor (pfn_is_ram()). However, when
 vmcore is read with mmap() no such check is performed. That can lead to
 unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
 mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
 enormous load in both DomU and Dom0.
 
 Fix the issue by mapping each non-ram page to the zero page. Keep direct
 path with remap_oldmem_pfn_range() to avoid looping through all pages on
 bare metal.
 
 The issue can also be solved by overriding remap_oldmem_pfn_range() in
 xen-specific code, as remap_oldmem_pfn_range() was been designed for.
 That, however, would involve non-obvious xen code path for all x86 builds
 with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
 code on x86 arch from doing the same override.

 Could the 'remap_oldmem_pfn_range' become an function ops? I see there
 is an 'register_oldmem_pfn_is_ram' - so could there be similar one for
 'pfn_range'?

yes, it is possible to replace '__weak remap_oldmem_pfn_range' with
'register_oldmem_pfn_is_ram'. However s390 arch overrides this function
in arch/s390/kernel/crash_dump.c so we'll have to make some changes
there as well.


 
 Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
 ---
  fs/proc/vmcore.c | 68 
 +++-
  1 file changed, 62 insertions(+), 6 deletions(-)
 
 diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
 index 382aa89..2716e19 100644
 --- a/fs/proc/vmcore.c
 +++ b/fs/proc/vmcore.c
 @@ -328,6 +328,46 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
   * virtually contiguous user-space in ELF layout.
   */
  #ifdef CONFIG_MMU
 +static u64 remap_oldmem_pfn_checked(struct vm_area_struct *vma, u64 len,
 +unsigned long pfn, unsigned long page_count)
 +{
 +unsigned long pos;
 +size_t size;
 +unsigned long vma_addr;
 +unsigned long emptypage_pfn = __pa(empty_zero_page)  PAGE_SHIFT;
 +
 +for (pos = pfn; (pos - pfn) = page_count; pos++) {
 +if (!pfn_is_ram(pos) || (pos - pfn) == page_count) {
 +/* we hit a page which is not ram or reached the end */
 +if (pos - pfn  0) {
 +/* remapping continuous region */
 +size = (pos - pfn)  PAGE_SHIFT;
 +vma_addr = vma-vm_start + len;
 +if (remap_oldmem_pfn_range(vma, vma_addr,
 +   pfn, size,
 +   vma-vm_page_prot))
 +return len;
 +len += size;
 +page_count -= (pos - pfn);
 +}
 +if (page_count  0) {
 +/* we hit a page which is not ram, replacing
 +   with an empty one */
 +vma_addr = vma-vm_start + len;
 +if (remap_oldmem_pfn_range(vma, vma_addr,
 +   emptypage_pfn,
 +   PAGE_SIZE,
 +   vma-vm_page_prot))
 +return len;
 +len += PAGE_SIZE;
 +pfn = pos + 1;
 +page_count--;
 +}
 +}
 +}
 +return len;
 +}
 +
  static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
  {
  size_t size = vma-vm_end - vma-vm_start;
 @@ -383,17 +423,33 @@ static int mmap_vmcore(struct file *file, struct 
 vm_area_struct *vma)
  
  list_for_each_entry(m, vmcore_list, list) {
  if (start  m-offset + m-size) {
 -u64 paddr = 0;
 +u64 paddr = 0, original_len;
 +unsigned long pfn, page_count;
  
  tsz = min_t(size_t, m-offset + m-size - start, size);
  paddr = m-paddr + start - m-offset;
 -if (remap_oldmem_pfn_range(vma, vma-vm_start + len,
 -   paddr  PAGE_SHIFT, tsz,
 -   vma-vm_page_prot))
 -goto fail;
 +
 +/* check if oldmem_pfn_is_ram was registered to avoid
 +   looping over all pages without a reason */
 +if (oldmem_pfn_is_ram) {
 +pfn = paddr  PAGE_SHIFT;
 +page_count = tsz  PAGE_SHIFT;
 +

Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-09 Thread David Vrabel
On 09/07/14 10:17, Vitaly Kuznetsov wrote:
 David Vrabel david.vra...@citrix.com writes:
 
 On 07/07/14 21:33, Andrew Morton wrote:
 On Mon,  7 Jul 2014 17:05:49 +0200 Vitaly Kuznetsov vkuzn...@redhat.com 
 wrote:

 we have a special check in read_vmcore() handler to check if the page was
 reported as ram or not by the hypervisor (pfn_is_ram()). However, when
 vmcore is read with mmap() no such check is performed. That can lead to
 unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
 mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
 enormous load in both DomU and Dom0.

 Does make forward progress though?  Or is it ending up in a repeatedly
 retrying the same instruction?
 
 If memcpy is using SSE2 optimization 16-byte 'movdqu' instruction never
 finishes (repeatedly retrying to issue two 8-byte requests to
 qemu-dm). qemu-dm decides that it's hitting 'Neither RAM nor known MMIO
 space' and returns 8 0xff bytes for both of this requests (I was testing
 with qemu-traditional).

Yes, the emulation of instructions with 16-byte operands is a bit
broken.  I should be fixed.

 Is it failing on a ballooned page in a RAM region? Or is mapping non-RAM
 regions as well?
 
 I wasn't using ballooning, it happens that oldmem has several (two in my
 test) pages which are HVMMEM_mmio_dm but qemu-dm considers them being
 neither ram nor mmio.

I think this would also happen with ballooned pages, which are also
not-present in the p2m and thus would show up as HVMMEM_mmio_dm type and
accesses will also be forwarded to qemu (qemu gets everything by default).

 Fix the issue by mapping each non-ram page to the zero page. Keep direct
 path with remap_oldmem_pfn_range() to avoid looping through all pages on
 bare metal.

 The issue can also be solved by overriding remap_oldmem_pfn_range() in
 xen-specific code, as remap_oldmem_pfn_range() was been designed for.
 That, however, would involve non-obvious xen code path for all x86 builds
 with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
 code on x86 arch from doing the same override.

 The oldmem_pfn_is_ram() is Xen-specific but this problem (ballooned
 pages) must be common to KVM.  How does KVM handle this?
 
 Is far as I'm concearned the issue was never hit with KVM. I *think* the
 issue has something to do with the conjunction of 16-byte 'movdqu'
 emulation for io pages in xen hypervisor, 8-byte event channel requests
 and qemu-traditional. But even if it gets fixed on hypervisor side I
 believe fixing the issue kernel-side still worth it as there are
 non-fixed hypervisors out there (e.g. AWS EC2).

I think it would be preferrable to fix this on the hypervisor side so
Xen guests behaves in the same way as KVM guests.

But if this needs to work on non-fixed hypervisors then this patch looks
sensible.  FWIW,

Acked-by: David Vrabel david.vra...@citrix.com

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-09 Thread Olaf Hering
On Wed, Jul 09, Vitaly Kuznetsov wrote:

  Also I am wondering why it was not done as part of copy_oldmem_page()
  so that respective arch could hide all the details.
 Afaiac that wouldn't solve the mmap issue I'm trying to address but we
 can ask Olaf why he preferred pfn_is_ram() path.

Every copy_oldmem_page would need to know about the pfn_is_ram function,
so I think its better to keep that part of the code private to
fs/proc/vmcore.c

Perhaps pfn_is_ram could be named pfn_is_backed_by_ram, but the comments
make it clear what the function does.


Olaf
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-08 Thread Konrad Rzeszutek Wilk
On Mon, Jul 07, 2014 at 05:05:49PM +0200, Vitaly Kuznetsov wrote:
> we have a special check in read_vmcore() handler to check if the page was
> reported as ram or not by the hypervisor (pfn_is_ram()). However, when
> vmcore is read with mmap() no such check is performed. That can lead to
> unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
> mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
> enormous load in both DomU and Dom0.
> 
> Fix the issue by mapping each non-ram page to the zero page. Keep direct
> path with remap_oldmem_pfn_range() to avoid looping through all pages on
> bare metal.
> 
> The issue can also be solved by overriding remap_oldmem_pfn_range() in
> xen-specific code, as remap_oldmem_pfn_range() was been designed for.
> That, however, would involve non-obvious xen code path for all x86 builds
> with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
> code on x86 arch from doing the same override.

Could the 'remap_oldmem_pfn_range' become an function ops? I see there
is an 'register_oldmem_pfn_is_ram' - so could there be similar one for
'pfn_range'?

> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  fs/proc/vmcore.c | 68 
> +++-
>  1 file changed, 62 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index 382aa89..2716e19 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -328,6 +328,46 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
>   * virtually contiguous user-space in ELF layout.
>   */
>  #ifdef CONFIG_MMU
> +static u64 remap_oldmem_pfn_checked(struct vm_area_struct *vma, u64 len,
> + unsigned long pfn, unsigned long page_count)
> +{
> + unsigned long pos;
> + size_t size;
> + unsigned long vma_addr;
> + unsigned long emptypage_pfn = __pa(empty_zero_page) >> PAGE_SHIFT;
> +
> + for (pos = pfn; (pos - pfn) <= page_count; pos++) {
> + if (!pfn_is_ram(pos) || (pos - pfn) == page_count) {
> + /* we hit a page which is not ram or reached the end */
> + if (pos - pfn > 0) {
> + /* remapping continuous region */
> + size = (pos - pfn) << PAGE_SHIFT;
> + vma_addr = vma->vm_start + len;
> + if (remap_oldmem_pfn_range(vma, vma_addr,
> +pfn, size,
> +vma->vm_page_prot))
> + return len;
> + len += size;
> + page_count -= (pos - pfn);
> + }
> + if (page_count > 0) {
> + /* we hit a page which is not ram, replacing
> +with an empty one */
> + vma_addr = vma->vm_start + len;
> + if (remap_oldmem_pfn_range(vma, vma_addr,
> +emptypage_pfn,
> +PAGE_SIZE,
> +vma->vm_page_prot))
> + return len;
> + len += PAGE_SIZE;
> + pfn = pos + 1;
> + page_count--;
> + }
> + }
> + }
> + return len;
> +}
> +
>  static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>  {
>   size_t size = vma->vm_end - vma->vm_start;
> @@ -383,17 +423,33 @@ static int mmap_vmcore(struct file *file, struct 
> vm_area_struct *vma)
>  
>   list_for_each_entry(m, _list, list) {
>   if (start < m->offset + m->size) {
> - u64 paddr = 0;
> + u64 paddr = 0, original_len;
> + unsigned long pfn, page_count;
>  
>   tsz = min_t(size_t, m->offset + m->size - start, size);
>   paddr = m->paddr + start - m->offset;
> - if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
> -paddr >> PAGE_SHIFT, tsz,
> -vma->vm_page_prot))
> - goto fail;
> +
> + /* check if oldmem_pfn_is_ram was registered to avoid
> +looping over all pages without a reason */
> + if (oldmem_pfn_is_ram) {
> + pfn = paddr >> PAGE_SHIFT;
> + page_count = tsz >> PAGE_SHIFT;
> + original_len = len;
> + len = remap_oldmem_pfn_checked(vma, len, pfn,
> +  

Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-08 Thread David Vrabel
On 07/07/14 21:33, Andrew Morton wrote:
> On Mon,  7 Jul 2014 17:05:49 +0200 Vitaly Kuznetsov  
> wrote:
> 
>> we have a special check in read_vmcore() handler to check if the page was
>> reported as ram or not by the hypervisor (pfn_is_ram()). However, when
>> vmcore is read with mmap() no such check is performed. That can lead to
>> unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
>> mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
>> enormous load in both DomU and Dom0.

Does make forward progress though?  Or is it ending up in a repeatedly
retrying the same instruction?

Is it failing on a ballooned page in a RAM region? Or is mapping non-RAM
regions as well?

>> Fix the issue by mapping each non-ram page to the zero page. Keep direct
>> path with remap_oldmem_pfn_range() to avoid looping through all pages on
>> bare metal.
>>
>> The issue can also be solved by overriding remap_oldmem_pfn_range() in
>> xen-specific code, as remap_oldmem_pfn_range() was been designed for.
>> That, however, would involve non-obvious xen code path for all x86 builds
>> with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
>> code on x86 arch from doing the same override.

The oldmem_pfn_is_ram() is Xen-specific but this problem (ballooned
pages) must be common to KVM.  How does KVM handle this?

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-08 Thread David Vrabel
On 07/07/14 21:33, Andrew Morton wrote:
 On Mon,  7 Jul 2014 17:05:49 +0200 Vitaly Kuznetsov vkuzn...@redhat.com 
 wrote:
 
 we have a special check in read_vmcore() handler to check if the page was
 reported as ram or not by the hypervisor (pfn_is_ram()). However, when
 vmcore is read with mmap() no such check is performed. That can lead to
 unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
 mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
 enormous load in both DomU and Dom0.

Does make forward progress though?  Or is it ending up in a repeatedly
retrying the same instruction?

Is it failing on a ballooned page in a RAM region? Or is mapping non-RAM
regions as well?

 Fix the issue by mapping each non-ram page to the zero page. Keep direct
 path with remap_oldmem_pfn_range() to avoid looping through all pages on
 bare metal.

 The issue can also be solved by overriding remap_oldmem_pfn_range() in
 xen-specific code, as remap_oldmem_pfn_range() was been designed for.
 That, however, would involve non-obvious xen code path for all x86 builds
 with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
 code on x86 arch from doing the same override.

The oldmem_pfn_is_ram() is Xen-specific but this problem (ballooned
pages) must be common to KVM.  How does KVM handle this?

David
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

2014-07-08 Thread Konrad Rzeszutek Wilk
On Mon, Jul 07, 2014 at 05:05:49PM +0200, Vitaly Kuznetsov wrote:
 we have a special check in read_vmcore() handler to check if the page was
 reported as ram or not by the hypervisor (pfn_is_ram()). However, when
 vmcore is read with mmap() no such check is performed. That can lead to
 unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
 mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
 enormous load in both DomU and Dom0.
 
 Fix the issue by mapping each non-ram page to the zero page. Keep direct
 path with remap_oldmem_pfn_range() to avoid looping through all pages on
 bare metal.
 
 The issue can also be solved by overriding remap_oldmem_pfn_range() in
 xen-specific code, as remap_oldmem_pfn_range() was been designed for.
 That, however, would involve non-obvious xen code path for all x86 builds
 with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
 code on x86 arch from doing the same override.

Could the 'remap_oldmem_pfn_range' become an function ops? I see there
is an 'register_oldmem_pfn_is_ram' - so could there be similar one for
'pfn_range'?

 
 Signed-off-by: Vitaly Kuznetsov vkuzn...@redhat.com
 ---
  fs/proc/vmcore.c | 68 
 +++-
  1 file changed, 62 insertions(+), 6 deletions(-)
 
 diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
 index 382aa89..2716e19 100644
 --- a/fs/proc/vmcore.c
 +++ b/fs/proc/vmcore.c
 @@ -328,6 +328,46 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
   * virtually contiguous user-space in ELF layout.
   */
  #ifdef CONFIG_MMU
 +static u64 remap_oldmem_pfn_checked(struct vm_area_struct *vma, u64 len,
 + unsigned long pfn, unsigned long page_count)
 +{
 + unsigned long pos;
 + size_t size;
 + unsigned long vma_addr;
 + unsigned long emptypage_pfn = __pa(empty_zero_page)  PAGE_SHIFT;
 +
 + for (pos = pfn; (pos - pfn) = page_count; pos++) {
 + if (!pfn_is_ram(pos) || (pos - pfn) == page_count) {
 + /* we hit a page which is not ram or reached the end */
 + if (pos - pfn  0) {
 + /* remapping continuous region */
 + size = (pos - pfn)  PAGE_SHIFT;
 + vma_addr = vma-vm_start + len;
 + if (remap_oldmem_pfn_range(vma, vma_addr,
 +pfn, size,
 +vma-vm_page_prot))
 + return len;
 + len += size;
 + page_count -= (pos - pfn);
 + }
 + if (page_count  0) {
 + /* we hit a page which is not ram, replacing
 +with an empty one */
 + vma_addr = vma-vm_start + len;
 + if (remap_oldmem_pfn_range(vma, vma_addr,
 +emptypage_pfn,
 +PAGE_SIZE,
 +vma-vm_page_prot))
 + return len;
 + len += PAGE_SIZE;
 + pfn = pos + 1;
 + page_count--;
 + }
 + }
 + }
 + return len;
 +}
 +
  static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
  {
   size_t size = vma-vm_end - vma-vm_start;
 @@ -383,17 +423,33 @@ static int mmap_vmcore(struct file *file, struct 
 vm_area_struct *vma)
  
   list_for_each_entry(m, vmcore_list, list) {
   if (start  m-offset + m-size) {
 - u64 paddr = 0;
 + u64 paddr = 0, original_len;
 + unsigned long pfn, page_count;
  
   tsz = min_t(size_t, m-offset + m-size - start, size);
   paddr = m-paddr + start - m-offset;
 - if (remap_oldmem_pfn_range(vma, vma-vm_start + len,
 -paddr  PAGE_SHIFT, tsz,
 -vma-vm_page_prot))
 - goto fail;
 +
 + /* check if oldmem_pfn_is_ram was registered to avoid
 +looping over all pages without a reason */
 + if (oldmem_pfn_is_ram) {
 + pfn = paddr  PAGE_SHIFT;
 + page_count = tsz  PAGE_SHIFT;
 + original_len = len;
 + len = remap_oldmem_pfn_checked(vma, len, pfn,
 +page_count);
 + if (len != original_len + tsz)