[Kernel-packages] [Bug 1781038] Comment bridged from LTC Bugzilla

2018-07-31 Thread bugproxy
--- Comment From s...@us.ibm.com 2018-07-31 08:56 EDT---
I believe the customer has asked to close this bug. Canonical, please confirm.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1781038

Title:
  KVM guest hash page table failed to allocate contiguous memory (CMA)

Status in The Ubuntu-power-systems project:
  Opinion
Status in linux package in Ubuntu:
  New

Bug description:
  Per an email forwarded within IBM, we wish to use this Launchpad bug
  to work on the technical discussion with the Canonical development
  folks and the IBM KVM and kernel team surrounding the analysis made by
  Daniel Axtens of Canonical for the customer issue raised in Case
  #00177825.

  The only statement at the moment by the KVM team was that there were
  various issues associated with CMA fragmentation causing issues with
  KVM guests. However, as mentioned, this bug is to allow the dialog
  amongst all the developers to see what can be done to help alleviate
  the situation or understand the root cause further.

  Please also note that we should not be attaching customer data to this
  bug. If that is necessary then we expect Canonical to help provide a
  controlled environment for reviewing that data so we avoid any privacy
  issues (e.g. for GDPR compliance).

  Here is the email from Daniel:

  I have looked at the sosreport you uploaded. Here is my analysis so
  far.

  Virtualisation on powerpc has some special requirements. To start a
  guest on a powerpc host, you need to allocate a contiguous area of
  memory to hold the guest's hash page table (HPT, or HTAB, depending on
  which document you look at). The HPT is required to track and manage
  guest memory.

  Your error reports show qemu asking the kernel to allocate an HTAB,
  and the kernel reporting that it had insufficient memory to do so. The
  required memory for the HPT scales with the guest memory size - it
  should be about 1/128th of guest memory, so for a 16GB guest, that's
  128MB. However, the HPT has to be allocated as a single contiguous
  memory region. (This is in contrast to regular guest memory, which is
  not required to be contiguous from the host point of view.)

  The kernel keeps a special contiguous memory area (CMA) for these
  purposes, and keeps track of the total amounts in use and still
  available. These are shown in /proc/meminfo. From the system that ran
  the sosreport, we see:

  CmaTotal: 26853376 kB
  CmaFree: 4024448 kB

  So there is a total of about 25GB of CMA, of which about 3.8GB remain.
  This is obviously more than 128MB:

  - It's very possible that between the error and the sosreport, more
  contiguous memory became available. This would match the intermittent
  nature of the issue.

  - It also might be that the failure was due to fragmentation of memory
  in the CMA pool. That is, there might be more than 128MB, but it might
  all be in chunks that are smaller than 128MB, or which don't have the
  required alignment for a HPT.

  Given that the system's uptime was 112 days when the sosreport was
  generated, it would be unsurprising if fragmentation had occurred!
  (Relatedly - you're running 4.4.0-109, which does not have the Spectre
  and Meltdown fixes.)

  This issue has come up before - both in a public Canonical-IBM
  synchronised bug report[1], and with Red Hat[2]. It appears that there
  is some work within IBM to address this, but it seems to have stalled.
  I will get in touch with the IBM powerpc kernel team on their public
  mailing list and ask about the status. I will keep you updated.

  In the mean time, I have a potential solution/workaround. By default,
  5% of memory is reserved for CMA (kernel source:
  arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can
  increase this with a boot parameter, so for example to reserve 10%,
  you could boot with kvm_cma_resv_ratio=10. This can be set in
  petitboot. This should significantly reduce the incidence of this
  issue - perhaps eliminating it entirely - at the cost of locking away
  more of the system's memory. You would need to experiment to determine
  the optimal value. Perhaps given that you are seeing the problem only
  intermittently, a ratio of 7% would be sufficient - that would give
  you ~35GB of CMA.

  Please let me know if testing this setting would be an option for you.
  Please also let me know if you require further information on setting
  boot parameters with Petitboot.

  Regards,
  Daniel

  [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632045
  [2] https://bugzilla.redhat.com/show_bug.cgi?id=1304300

  Before we go any further, let's get the basic info here. Apparently
  there was a sosreport somewhere else, and a link would be good, but,
  here's what we need here -- at least -- to get started:

  1. What is the server model and at least basic config info (I/O cards,
  firmware level)? 

[Kernel-packages] [Bug 1781038] Comment bridged from LTC Bugzilla

2018-07-23 Thread bugproxy
--- Comment From p...@au1.ibm.com 2018-07-24 00:03 EDT---
There are a couple of points about the CMA and HPT allocations:

1. The HPT is probably getting sized according to the maximum memory of
the guest, not the initial amount of memory. Since we haven't been given
any specifics about the configuration of the guests, I can't tell
whether they are configured with maxmem greater than current/initial
memory. If the guests are configured with a very large maxmem, their
HPTs will be much larger than the 128MB mentioned in previous comments,
and that will obviously make it much more likely to have problems
allocating the HPT.

(With a sufficiently recent host kernel, guest kernel and QEMU, the HPT
can be resized while the guest is running, and in that case, QEMU
determines the size of the HPT from the current memory rather than the
maximum memory. However, HPT resizing went into the kernel later than
4.4, and I don't believe it has been backported into the Ubuntu version
of 4.4, hence the "probably" in the previous paragraph.)

2. The memory in the CMA zone is not locked away in the way that
previous comments imply. Memory in the CMA zone is still available for
movable allocations, which includes page cache and anonymous pages for
user processes, as well as memory for KVM guests. It is not available
for kernel allocations (including things like network packet buffers).

Thus it is worth while trying a larger kvm_cma_resv_ratio value in
situations like this. When fragmentation occurs, the parts of the CMA
zone that are too fragmented to use for HPTs can still be used for
running user processes and backing KVM guests.

3. Other relevant factors are whether the guest has any real PCI devices
passed through to it, and whether the guest is backed with large pages.
If the guest has any PCI devices passed through, then when the guest
sets up the DDW (dynamic DMA windows) TCE (iommu) table at boot time,
that will have the effect of pinning all the guest memory. Some of the
guest memory may have been allocated from the CMA zone. Balbir's patch
(which is in the Ubuntu 4.4 kernel now) will try to migrate any pages
that are in the CMA zone to somewhere else before they get pinned, but
if memory is in short supply, it may not succeed in moving the page out
of the CMA zone, and also the patch doesn't cope with large (16M) pages,
whether THP or explicit large pages.

Thus it would be worth disabling THP in the host.

I'm not sure whether explicit large pages could ever come from the CMA
zone. If the guests were backed by large pages, it would be worth trying
without large-page backing.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1781038

Title:
  KVM guest hash page table failed to allocate contiguous memory (CMA)

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  New

Bug description:
  Per an email forwarded within IBM, we wish to use this Launchpad bug
  to work on the technical discussion with the Canonical development
  folks and the IBM KVM and kernel team surrounding the analysis made by
  Daniel Axtens of Canonical for the customer issue raised in Case
  #00177825.

  The only statement at the moment by the KVM team was that there were
  various issues associated with CMA fragmentation causing issues with
  KVM guests. However, as mentioned, this bug is to allow the dialog
  amongst all the developers to see what can be done to help alleviate
  the situation or understand the root cause further.

  Please also note that we should not be attaching customer data to this
  bug. If that is necessary then we expect Canonical to help provide a
  controlled environment for reviewing that data so we avoid any privacy
  issues (e.g. for GDPR compliance).

  Here is the email from Daniel:

  I have looked at the sosreport you uploaded. Here is my analysis so
  far.

  Virtualisation on powerpc has some special requirements. To start a
  guest on a powerpc host, you need to allocate a contiguous area of
  memory to hold the guest's hash page table (HPT, or HTAB, depending on
  which document you look at). The HPT is required to track and manage
  guest memory.

  Your error reports show qemu asking the kernel to allocate an HTAB,
  and the kernel reporting that it had insufficient memory to do so. The
  required memory for the HPT scales with the guest memory size - it
  should be about 1/128th of guest memory, so for a 16GB guest, that's
  128MB. However, the HPT has to be allocated as a single contiguous
  memory region. (This is in contrast to regular guest memory, which is
  not required to be contiguous from the host point of view.)

  The kernel keeps a special contiguous memory area (CMA) for these
  purposes, and keeps track of the total amounts in use and still
  available. These are shown in /proc/meminfo. From the system that ran
  the sosreport, we see:

  

[Kernel-packages] [Bug 1781038] Comment bridged from LTC Bugzilla

2018-07-23 Thread bugproxy
--- Comment From s...@us.ibm.com 2018-07-23 16:30 EDT---
I think the key patch (from LP 1632045) was pulled in as of 4.4.0-51.72.  As 
far as I can tell, the CMA debugging is not in ubuntu (actually, I'm not 
positive it was even pulled in upstream, but that's less immediately 
important). I will continue to poke, but I don't think there's a good way even 
to determine, without a kdump, where that memory is in use.  I think the only 
thing to do is try to work around the problem by devoting more memory to the 
CMA by specifying a boot parameter like cma=50g for example. That would at 
least greatly alleviate  the problem, though not fix it. To fix it, I think we 
will need to arrange for a kdump from a failing system and then see exactly 
where the memory is in use.

I will keep asking my colleagues; however, that's where I am for now.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1781038

Title:
  KVM guest hash page table failed to allocate contiguous memory (CMA)

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  New

Bug description:
  Per an email forwarded within IBM, we wish to use this Launchpad bug
  to work on the technical discussion with the Canonical development
  folks and the IBM KVM and kernel team surrounding the analysis made by
  Daniel Axtens of Canonical for the customer issue raised in Case
  #00177825.

  The only statement at the moment by the KVM team was that there were
  various issues associated with CMA fragmentation causing issues with
  KVM guests. However, as mentioned, this bug is to allow the dialog
  amongst all the developers to see what can be done to help alleviate
  the situation or understand the root cause further.

  Please also note that we should not be attaching customer data to this
  bug. If that is necessary then we expect Canonical to help provide a
  controlled environment for reviewing that data so we avoid any privacy
  issues (e.g. for GDPR compliance).

  Here is the email from Daniel:

  I have looked at the sosreport you uploaded. Here is my analysis so
  far.

  Virtualisation on powerpc has some special requirements. To start a
  guest on a powerpc host, you need to allocate a contiguous area of
  memory to hold the guest's hash page table (HPT, or HTAB, depending on
  which document you look at). The HPT is required to track and manage
  guest memory.

  Your error reports show qemu asking the kernel to allocate an HTAB,
  and the kernel reporting that it had insufficient memory to do so. The
  required memory for the HPT scales with the guest memory size - it
  should be about 1/128th of guest memory, so for a 16GB guest, that's
  128MB. However, the HPT has to be allocated as a single contiguous
  memory region. (This is in contrast to regular guest memory, which is
  not required to be contiguous from the host point of view.)

  The kernel keeps a special contiguous memory area (CMA) for these
  purposes, and keeps track of the total amounts in use and still
  available. These are shown in /proc/meminfo. From the system that ran
  the sosreport, we see:

  CmaTotal: 26853376 kB
  CmaFree: 4024448 kB

  So there is a total of about 25GB of CMA, of which about 3.8GB remain.
  This is obviously more than 128MB:

  - It's very possible that between the error and the sosreport, more
  contiguous memory became available. This would match the intermittent
  nature of the issue.

  - It also might be that the failure was due to fragmentation of memory
  in the CMA pool. That is, there might be more than 128MB, but it might
  all be in chunks that are smaller than 128MB, or which don't have the
  required alignment for a HPT.

  Given that the system's uptime was 112 days when the sosreport was
  generated, it would be unsurprising if fragmentation had occurred!
  (Relatedly - you're running 4.4.0-109, which does not have the Spectre
  and Meltdown fixes.)

  This issue has come up before - both in a public Canonical-IBM
  synchronised bug report[1], and with Red Hat[2]. It appears that there
  is some work within IBM to address this, but it seems to have stalled.
  I will get in touch with the IBM powerpc kernel team on their public
  mailing list and ask about the status. I will keep you updated.

  In the mean time, I have a potential solution/workaround. By default,
  5% of memory is reserved for CMA (kernel source:
  arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can
  increase this with a boot parameter, so for example to reserve 10%,
  you could boot with kvm_cma_resv_ratio=10. This can be set in
  petitboot. This should significantly reduce the incidence of this
  issue - perhaps eliminating it entirely - at the cost of locking away
  more of the system's memory. You would need to experiment to determine
  the optimal value. Perhaps given that you are seeing the problem 

[Kernel-packages] [Bug 1781038] Comment bridged from LTC Bugzilla

2018-07-23 Thread bugproxy
--- Comment From s...@us.ibm.com 2018-07-23 15:10 EDT---
I'm still looking. I'm confident that daniel has correctly identified and 
explained the failure mode (i.e. out of enough CMA to create guests).  I also 
don't *think* we have enough information at the moment to say for sure why this 
particular memory pool is exhausted; however, I'm still poking around to see 
what information we *do* need.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1781038

Title:
  KVM guest hash page table failed to allocate contiguous memory (CMA)

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  New

Bug description:
  Per an email forwarded within IBM, we wish to use this Launchpad bug
  to work on the technical discussion with the Canonical development
  folks and the IBM KVM and kernel team surrounding the analysis made by
  Daniel Axtens of Canonical for the customer issue raised in Case
  #00177825.

  The only statement at the moment by the KVM team was that there were
  various issues associated with CMA fragmentation causing issues with
  KVM guests. However, as mentioned, this bug is to allow the dialog
  amongst all the developers to see what can be done to help alleviate
  the situation or understand the root cause further.

  Please also note that we should not be attaching customer data to this
  bug. If that is necessary then we expect Canonical to help provide a
  controlled environment for reviewing that data so we avoid any privacy
  issues (e.g. for GDPR compliance).

  Here is the email from Daniel:

  I have looked at the sosreport you uploaded. Here is my analysis so
  far.

  Virtualisation on powerpc has some special requirements. To start a
  guest on a powerpc host, you need to allocate a contiguous area of
  memory to hold the guest's hash page table (HPT, or HTAB, depending on
  which document you look at). The HPT is required to track and manage
  guest memory.

  Your error reports show qemu asking the kernel to allocate an HTAB,
  and the kernel reporting that it had insufficient memory to do so. The
  required memory for the HPT scales with the guest memory size - it
  should be about 1/128th of guest memory, so for a 16GB guest, that's
  128MB. However, the HPT has to be allocated as a single contiguous
  memory region. (This is in contrast to regular guest memory, which is
  not required to be contiguous from the host point of view.)

  The kernel keeps a special contiguous memory area (CMA) for these
  purposes, and keeps track of the total amounts in use and still
  available. These are shown in /proc/meminfo. From the system that ran
  the sosreport, we see:

  CmaTotal: 26853376 kB
  CmaFree: 4024448 kB

  So there is a total of about 25GB of CMA, of which about 3.8GB remain.
  This is obviously more than 128MB:

  - It's very possible that between the error and the sosreport, more
  contiguous memory became available. This would match the intermittent
  nature of the issue.

  - It also might be that the failure was due to fragmentation of memory
  in the CMA pool. That is, there might be more than 128MB, but it might
  all be in chunks that are smaller than 128MB, or which don't have the
  required alignment for a HPT.

  Given that the system's uptime was 112 days when the sosreport was
  generated, it would be unsurprising if fragmentation had occurred!
  (Relatedly - you're running 4.4.0-109, which does not have the Spectre
  and Meltdown fixes.)

  This issue has come up before - both in a public Canonical-IBM
  synchronised bug report[1], and with Red Hat[2]. It appears that there
  is some work within IBM to address this, but it seems to have stalled.
  I will get in touch with the IBM powerpc kernel team on their public
  mailing list and ask about the status. I will keep you updated.

  In the mean time, I have a potential solution/workaround. By default,
  5% of memory is reserved for CMA (kernel source:
  arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can
  increase this with a boot parameter, so for example to reserve 10%,
  you could boot with kvm_cma_resv_ratio=10. This can be set in
  petitboot. This should significantly reduce the incidence of this
  issue - perhaps eliminating it entirely - at the cost of locking away
  more of the system's memory. You would need to experiment to determine
  the optimal value. Perhaps given that you are seeing the problem only
  intermittently, a ratio of 7% would be sufficient - that would give
  you ~35GB of CMA.

  Please let me know if testing this setting would be an option for you.
  Please also let me know if you require further information on setting
  boot parameters with Petitboot.

  Regards,
  Daniel

  [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632045
  [2] https://bugzilla.redhat.com/show_bug.cgi?id=1304300

  Before we go any 

[Kernel-packages] [Bug 1781038] Comment bridged from LTC Bugzilla

2018-07-20 Thread bugproxy
--- Comment From s...@us.ibm.com 2018-07-20 19:03 EDT---
Looking though the qemu logs, I'm not seeing anything obvious, though I will 
have a colleague look at them also on Monday. I see that even "some time later" 
we have a system with 512gb of RAM, the usual default 5% in CMA (25gb) and less 
than 4gb free in the CMA. I'm not sure how to see what's using up the CMA, so 
I'll ask about that, also.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1781038

Title:
  KVM guest hash page table failed to allocate contiguous memory (CMA)

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  New

Bug description:
  Per an email forwarded within IBM, we wish to use this Launchpad bug
  to work on the technical discussion with the Canonical development
  folks and the IBM KVM and kernel team surrounding the analysis made by
  Daniel Axtens of Canonical for the customer issue raised in Case
  #00177825.

  The only statement at the moment by the KVM team was that there were
  various issues associated with CMA fragmentation causing issues with
  KVM guests. However, as mentioned, this bug is to allow the dialog
  amongst all the developers to see what can be done to help alleviate
  the situation or understand the root cause further.

  Please also note that we should not be attaching customer data to this
  bug. If that is necessary then we expect Canonical to help provide a
  controlled environment for reviewing that data so we avoid any privacy
  issues (e.g. for GDPR compliance).

  Here is the email from Daniel:

  I have looked at the sosreport you uploaded. Here is my analysis so
  far.

  Virtualisation on powerpc has some special requirements. To start a
  guest on a powerpc host, you need to allocate a contiguous area of
  memory to hold the guest's hash page table (HPT, or HTAB, depending on
  which document you look at). The HPT is required to track and manage
  guest memory.

  Your error reports show qemu asking the kernel to allocate an HTAB,
  and the kernel reporting that it had insufficient memory to do so. The
  required memory for the HPT scales with the guest memory size - it
  should be about 1/128th of guest memory, so for a 16GB guest, that's
  128MB. However, the HPT has to be allocated as a single contiguous
  memory region. (This is in contrast to regular guest memory, which is
  not required to be contiguous from the host point of view.)

  The kernel keeps a special contiguous memory area (CMA) for these
  purposes, and keeps track of the total amounts in use and still
  available. These are shown in /proc/meminfo. From the system that ran
  the sosreport, we see:

  CmaTotal: 26853376 kB
  CmaFree: 4024448 kB

  So there is a total of about 25GB of CMA, of which about 3.8GB remain.
  This is obviously more than 128MB:

  - It's very possible that between the error and the sosreport, more
  contiguous memory became available. This would match the intermittent
  nature of the issue.

  - It also might be that the failure was due to fragmentation of memory
  in the CMA pool. That is, there might be more than 128MB, but it might
  all be in chunks that are smaller than 128MB, or which don't have the
  required alignment for a HPT.

  Given that the system's uptime was 112 days when the sosreport was
  generated, it would be unsurprising if fragmentation had occurred!
  (Relatedly - you're running 4.4.0-109, which does not have the Spectre
  and Meltdown fixes.)

  This issue has come up before - both in a public Canonical-IBM
  synchronised bug report[1], and with Red Hat[2]. It appears that there
  is some work within IBM to address this, but it seems to have stalled.
  I will get in touch with the IBM powerpc kernel team on their public
  mailing list and ask about the status. I will keep you updated.

  In the mean time, I have a potential solution/workaround. By default,
  5% of memory is reserved for CMA (kernel source:
  arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can
  increase this with a boot parameter, so for example to reserve 10%,
  you could boot with kvm_cma_resv_ratio=10. This can be set in
  petitboot. This should significantly reduce the incidence of this
  issue - perhaps eliminating it entirely - at the cost of locking away
  more of the system's memory. You would need to experiment to determine
  the optimal value. Perhaps given that you are seeing the problem only
  intermittently, a ratio of 7% would be sufficient - that would give
  you ~35GB of CMA.

  Please let me know if testing this setting would be an option for you.
  Please also let me know if you require further information on setting
  boot parameters with Petitboot.

  Regards,
  Daniel

  [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632045
  [2] https://bugzilla.redhat.com/show_bug.cgi?id=1304300

  Before we go any 

Re: [Kernel-packages] [Bug 1781038] Comment bridged from LTC Bugzilla

2018-07-18 Thread Daniel Axtens
Hi,

I am told that this is the same machine but not while it was currently
showing symptoms - due to the intermittent nature of the problem it
was taken some time later. This matches what I see in the logs so I
have no reason to doubt it.

Regards,
Daniel

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1781038

Title:
  KVM guest hash page table failed to allocate contiguous memory (CMA)

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  New

Bug description:
  Per an email forwarded within IBM, we wish to use this Launchpad bug
  to work on the technical discussion with the Canonical development
  folks and the IBM KVM and kernel team surrounding the analysis made by
  Daniel Axtens of Canonical for the customer issue raised in Case
  #00177825.

  The only statement at the moment by the KVM team was that there were
  various issues associated with CMA fragmentation causing issues with
  KVM guests. However, as mentioned, this bug is to allow the dialog
  amongst all the developers to see what can be done to help alleviate
  the situation or understand the root cause further.

  Please also note that we should not be attaching customer data to this
  bug. If that is necessary then we expect Canonical to help provide a
  controlled environment for reviewing that data so we avoid any privacy
  issues (e.g. for GDPR compliance).

  Here is the email from Daniel:

  I have looked at the sosreport you uploaded. Here is my analysis so
  far.

  Virtualisation on powerpc has some special requirements. To start a
  guest on a powerpc host, you need to allocate a contiguous area of
  memory to hold the guest's hash page table (HPT, or HTAB, depending on
  which document you look at). The HPT is required to track and manage
  guest memory.

  Your error reports show qemu asking the kernel to allocate an HTAB,
  and the kernel reporting that it had insufficient memory to do so. The
  required memory for the HPT scales with the guest memory size - it
  should be about 1/128th of guest memory, so for a 16GB guest, that's
  128MB. However, the HPT has to be allocated as a single contiguous
  memory region. (This is in contrast to regular guest memory, which is
  not required to be contiguous from the host point of view.)

  The kernel keeps a special contiguous memory area (CMA) for these
  purposes, and keeps track of the total amounts in use and still
  available. These are shown in /proc/meminfo. From the system that ran
  the sosreport, we see:

  CmaTotal: 26853376 kB
  CmaFree: 4024448 kB

  So there is a total of about 25GB of CMA, of which about 3.8GB remain.
  This is obviously more than 128MB:

  - It's very possible that between the error and the sosreport, more
  contiguous memory became available. This would match the intermittent
  nature of the issue.

  - It also might be that the failure was due to fragmentation of memory
  in the CMA pool. That is, there might be more than 128MB, but it might
  all be in chunks that are smaller than 128MB, or which don't have the
  required alignment for a HPT.

  Given that the system's uptime was 112 days when the sosreport was
  generated, it would be unsurprising if fragmentation had occurred!
  (Relatedly - you're running 4.4.0-109, which does not have the Spectre
  and Meltdown fixes.)

  This issue has come up before - both in a public Canonical-IBM
  synchronised bug report[1], and with Red Hat[2]. It appears that there
  is some work within IBM to address this, but it seems to have stalled.
  I will get in touch with the IBM powerpc kernel team on their public
  mailing list and ask about the status. I will keep you updated.

  In the mean time, I have a potential solution/workaround. By default,
  5% of memory is reserved for CMA (kernel source:
  arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can
  increase this with a boot parameter, so for example to reserve 10%,
  you could boot with kvm_cma_resv_ratio=10. This can be set in
  petitboot. This should significantly reduce the incidence of this
  issue - perhaps eliminating it entirely - at the cost of locking away
  more of the system's memory. You would need to experiment to determine
  the optimal value. Perhaps given that you are seeing the problem only
  intermittently, a ratio of 7% would be sufficient - that would give
  you ~35GB of CMA.

  Please let me know if testing this setting would be an option for you.
  Please also let me know if you require further information on setting
  boot parameters with Petitboot.

  Regards,
  Daniel

  [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632045
  [2] https://bugzilla.redhat.com/show_bug.cgi?id=1304300

  Before we go any further, let's get the basic info here. Apparently
  there was a sosreport somewhere else, and a link would be good, but,
  here's what we need here -- at 

[Kernel-packages] [Bug 1781038] Comment bridged from LTC Bugzilla

2018-07-18 Thread bugproxy
--- Comment From s...@us.ibm.com 2018-07-18 08:38 EDT---
Thanks, Daniel. Are you confident that the provided logs, etc. are taken from a 
machine that is currently actually showing the symptoms? The customer does tend 
to take the view that all machines are the same at all times. I don't want to 
try to dig into information taken from another machine or even the same machine 
after rebooting, etc.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1781038

Title:
  KVM guest hash page table failed to allocate contiguous memory (CMA)

Status in The Ubuntu-power-systems project:
  Triaged
Status in linux package in Ubuntu:
  New

Bug description:
  Per an email forwarded within IBM, we wish to use this Launchpad bug
  to work on the technical discussion with the Canonical development
  folks and the IBM KVM and kernel team surrounding the analysis made by
  Daniel Axtens of Canonical for the customer issue raised in Case
  #00177825.

  The only statement at the moment by the KVM team was that there were
  various issues associated with CMA fragmentation causing issues with
  KVM guests. However, as mentioned, this bug is to allow the dialog
  amongst all the developers to see what can be done to help alleviate
  the situation or understand the root cause further.

  Please also note that we should not be attaching customer data to this
  bug. If that is necessary then we expect Canonical to help provide a
  controlled environment for reviewing that data so we avoid any privacy
  issues (e.g. for GDPR compliance).

  Here is the email from Daniel:

  I have looked at the sosreport you uploaded. Here is my analysis so
  far.

  Virtualisation on powerpc has some special requirements. To start a
  guest on a powerpc host, you need to allocate a contiguous area of
  memory to hold the guest's hash page table (HPT, or HTAB, depending on
  which document you look at). The HPT is required to track and manage
  guest memory.

  Your error reports show qemu asking the kernel to allocate an HTAB,
  and the kernel reporting that it had insufficient memory to do so. The
  required memory for the HPT scales with the guest memory size - it
  should be about 1/128th of guest memory, so for a 16GB guest, that's
  128MB. However, the HPT has to be allocated as a single contiguous
  memory region. (This is in contrast to regular guest memory, which is
  not required to be contiguous from the host point of view.)

  The kernel keeps a special contiguous memory area (CMA) for these
  purposes, and keeps track of the total amounts in use and still
  available. These are shown in /proc/meminfo. From the system that ran
  the sosreport, we see:

  CmaTotal: 26853376 kB
  CmaFree: 4024448 kB

  So there is a total of about 25GB of CMA, of which about 3.8GB remain.
  This is obviously more than 128MB:

  - It's very possible that between the error and the sosreport, more
  contiguous memory became available. This would match the intermittent
  nature of the issue.

  - It also might be that the failure was due to fragmentation of memory
  in the CMA pool. That is, there might be more than 128MB, but it might
  all be in chunks that are smaller than 128MB, or which don't have the
  required alignment for a HPT.

  Given that the system's uptime was 112 days when the sosreport was
  generated, it would be unsurprising if fragmentation had occurred!
  (Relatedly - you're running 4.4.0-109, which does not have the Spectre
  and Meltdown fixes.)

  This issue has come up before - both in a public Canonical-IBM
  synchronised bug report[1], and with Red Hat[2]. It appears that there
  is some work within IBM to address this, but it seems to have stalled.
  I will get in touch with the IBM powerpc kernel team on their public
  mailing list and ask about the status. I will keep you updated.

  In the mean time, I have a potential solution/workaround. By default,
  5% of memory is reserved for CMA (kernel source:
  arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can
  increase this with a boot parameter, so for example to reserve 10%,
  you could boot with kvm_cma_resv_ratio=10. This can be set in
  petitboot. This should significantly reduce the incidence of this
  issue - perhaps eliminating it entirely - at the cost of locking away
  more of the system's memory. You would need to experiment to determine
  the optimal value. Perhaps given that you are seeing the problem only
  intermittently, a ratio of 7% would be sufficient - that would give
  you ~35GB of CMA.

  Please let me know if testing this setting would be an option for you.
  Please also let me know if you require further information on setting
  boot parameters with Petitboot.

  Regards,
  Daniel

  [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632045
  [2] https://bugzilla.redhat.com/show_bug.cgi?id=1304300

  Before we go any 

[Kernel-packages] [Bug 1781038] Comment bridged from LTC Bugzilla

2018-07-12 Thread bugproxy
--- Comment From s...@us.ibm.com 2018-07-12 10:00 EDT---
Clearly, the general problem of running out of CMA allocatable space is not 
soluble in the current architecture, anyway. However, this is exactly why we 
need to know the particular situation at hand to understand this particular 
customer problem and whether there is something that can be done -- or, 
depending on what kernel level, has already been done.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1781038

Title:
  KVM guest hash page table failed to allocate contiguous memory (CMA)

Status in The Ubuntu-power-systems project:
  New
Status in linux package in Ubuntu:
  New

Bug description:
  Per an email forwarded within IBM, we wish to use this Launchpad bug
  to work on the technical discussion with the Canonical development
  folks and the IBM KVM and kernel team surrounding the analysis made by
  Daniel Axtens of Canonical for the customer issue raised in Case
  #00177825.

  The only statement at the moment by the KVM team was that there were
  various issues associated with CMA fragmentation causing issues with
  KVM guests. However, as mentioned, this bug is to allow the dialog
  amongst all the developers to see what can be done to help alleviate
  the situation or understand the root cause further.

  Please also note that we should not be attaching customer data to this
  bug. If that is necessary then we expect Canonical to help provide a
  controlled environment for reviewing that data so we avoid any privacy
  issues (e.g. for GDPR compliance).

  Here is the email from Daniel:

  I have looked at the sosreport you uploaded. Here is my analysis so
  far.

  Virtualisation on powerpc has some special requirements. To start a
  guest on a powerpc host, you need to allocate a contiguous area of
  memory to hold the guest's hash page table (HPT, or HTAB, depending on
  which document you look at). The HPT is required to track and manage
  guest memory.

  Your error reports show qemu asking the kernel to allocate an HTAB,
  and the kernel reporting that it had insufficient memory to do so. The
  required memory for the HPT scales with the guest memory size - it
  should be about 1/128th of guest memory, so for a 16GB guest, that's
  128MB. However, the HPT has to be allocated as a single contiguous
  memory region. (This is in contrast to regular guest memory, which is
  not required to be contiguous from the host point of view.)

  The kernel keeps a special contiguous memory area (CMA) for these
  purposes, and keeps track of the total amounts in use and still
  available. These are shown in /proc/meminfo. From the system that ran
  the sosreport, we see:

  CmaTotal: 26853376 kB
  CmaFree: 4024448 kB

  So there is a total of about 25GB of CMA, of which about 3.8GB remain.
  This is obviously more than 128MB:

  - It's very possible that between the error and the sosreport, more
  contiguous memory became available. This would match the intermittent
  nature of the issue.

  - It also might be that the failure was due to fragmentation of memory
  in the CMA pool. That is, there might be more than 128MB, but it might
  all be in chunks that are smaller than 128MB, or which don't have the
  required alignment for a HPT.

  Given that the system's uptime was 112 days when the sosreport was
  generated, it would be unsurprising if fragmentation had occurred!
  (Relatedly - you're running 4.4.0-109, which does not have the Spectre
  and Meltdown fixes.)

  This issue has come up before - both in a public Canonical-IBM
  synchronised bug report[1], and with Red Hat[2]. It appears that there
  is some work within IBM to address this, but it seems to have stalled.
  I will get in touch with the IBM powerpc kernel team on their public
  mailing list and ask about the status. I will keep you updated.

  In the mean time, I have a potential solution/workaround. By default,
  5% of memory is reserved for CMA (kernel source:
  arch/powerpc/kvm/book3s_hv_builtin.c, kvm_cma_resv_ratio). You can
  increase this with a boot parameter, so for example to reserve 10%,
  you could boot with kvm_cma_resv_ratio=10. This can be set in
  petitboot. This should significantly reduce the incidence of this
  issue - perhaps eliminating it entirely - at the cost of locking away
  more of the system's memory. You would need to experiment to determine
  the optimal value. Perhaps given that you are seeing the problem only
  intermittently, a ratio of 7% would be sufficient - that would give
  you ~35GB of CMA.

  Please let me know if testing this setting would be an option for you.
  Please also let me know if you require further information on setting
  boot parameters with Petitboot.

  Regards,
  Daniel

  [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1632045
  [2] https://bugzilla.redhat.com/show_bug.cgi?id=1304300

  Before we go