Re: [PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

2021-06-23 Thread Paolo Bonzini

On 24/06/21 05:57, David Stevens wrote:

KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
follow_pte in gfn_to_pfn. However, the resolved pfns may not have
assoicated struct pages, so they should not be passed to pfn_to_page.
This series removes such calls from the x86 and arm64 secondary MMU. To
do this, this series modifies gfn_to_pfn to return a struct page in
addition to a pfn, if the hva was resolved by gup. This allows the
caller to call put_page only when necessated by gup.

This series provides a helper function that unwraps the new return type
of gfn_to_pfn to provide behavior identical to the old behavior. As I
have no hardware to test powerpc/mips changes, the function is used
there for minimally invasive changes. Additionally, as gfn_to_page and
gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
easily changed over to only use pfns.

This addresses CVE-2021-22543 on x86 and arm64.


Thank you very much for this.  I agree that it makes sense to have a 
minimal change; I had similar changes almost ready, but was stuck with 
deadlocks in the gfn_to_pfn_cache case.  In retrospect I should have 
posted something similar to your patches.


I have started reviewing the patches, and they look good.  I will try to 
include them in 5.13.


Paolo



[PATCH] powerpc/sysfs: Replace sizeof(arr)/sizeof(arr[0]) with ARRAY_SIZE

2021-06-23 Thread Jason Wang
The ARRAY_SIZE macro is more compact and more formal in linux source.

Signed-off-by: Jason Wang 
---
 arch/powerpc/kernel/sysfs.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index 2e08640bb3b4..5ff0e55d0db1 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -843,14 +843,14 @@ static int register_cpu_online(unsigned int cpu)
 #ifdef HAS_PPC_PMC_IBM
case PPC_PMC_IBM:
attrs = ibm_common_attrs;
-   nattrs = sizeof(ibm_common_attrs) / sizeof(struct 
device_attribute);
+   nattrs = ARRAY_SIZE(ibm_common_attrs);
pmc_attrs = classic_pmc_attrs;
break;
 #endif /* HAS_PPC_PMC_IBM */
 #ifdef HAS_PPC_PMC_G4
case PPC_PMC_G4:
attrs = g4_common_attrs;
-   nattrs = sizeof(g4_common_attrs) / sizeof(struct 
device_attribute);
+   nattrs = ARRAY_SIZE(g4_common_attrs);
pmc_attrs = classic_pmc_attrs;
break;
 #endif /* HAS_PPC_PMC_G4 */
@@ -858,7 +858,7 @@ static int register_cpu_online(unsigned int cpu)
case PPC_PMC_PA6T:
/* PA Semi starts counting at PMC0 */
attrs = pa6t_attrs;
-   nattrs = sizeof(pa6t_attrs) / sizeof(struct device_attribute);
+   nattrs = ARRAY_SIZE(pa6t_attrs);
pmc_attrs = NULL;
break;
 #endif
@@ -940,14 +940,14 @@ static int unregister_cpu_online(unsigned int cpu)
 #ifdef HAS_PPC_PMC_IBM
case PPC_PMC_IBM:
attrs = ibm_common_attrs;
-   nattrs = sizeof(ibm_common_attrs) / sizeof(struct 
device_attribute);
+   nattrs = ARRAY_SIZE(ibm_common_attrs);
pmc_attrs = classic_pmc_attrs;
break;
 #endif /* HAS_PPC_PMC_IBM */
 #ifdef HAS_PPC_PMC_G4
case PPC_PMC_G4:
attrs = g4_common_attrs;
-   nattrs = sizeof(g4_common_attrs) / sizeof(struct 
device_attribute);
+   nattrs = ARRAY_SIZE(g4_common_attrs);
pmc_attrs = classic_pmc_attrs;
break;
 #endif /* HAS_PPC_PMC_G4 */
@@ -955,7 +955,7 @@ static int unregister_cpu_online(unsigned int cpu)
case PPC_PMC_PA6T:
/* PA Semi starts counting at PMC0 */
attrs = pa6t_attrs;
-   nattrs = sizeof(pa6t_attrs) / sizeof(struct device_attribute);
+   nattrs = ARRAY_SIZE(pa6t_attrs);
pmc_attrs = NULL;
break;
 #endif
-- 
2.31.1





[PATCH] powerpc: ps3: remove unneeded semicolon

2021-06-23 Thread 13145886936
From: gushengxian 

Remove unneeded semocolons.

Signed-off-by: gushengxian 
---
 arch/powerpc/platforms/ps3/system-bus.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/ps3/system-bus.c 
b/arch/powerpc/platforms/ps3/system-bus.c
index 1a5665875165..f57f37fe038c 100644
--- a/arch/powerpc/platforms/ps3/system-bus.c
+++ b/arch/powerpc/platforms/ps3/system-bus.c
@@ -604,7 +604,7 @@ static dma_addr_t ps3_ioc0_map_page(struct device *_dev, 
struct page *page,
default:
/* not happned */
BUG();
-   };
+   }
result = ps3_dma_map(dev->d_region, (unsigned long)ptr, size,
 &bus_addr, iopte_flag);
 
@@ -763,7 +763,7 @@ int ps3_system_bus_device_register(struct 
ps3_system_bus_device *dev)
break;
default:
BUG();
-   };
+   }
 
dev->core.of_node = NULL;
set_dev_node(&dev->core, 0);
-- 
2.25.1




[PATCH] crypto: scatterwalk - Remove obsolete PageSlab check

2021-06-23 Thread Herbert Xu
On Fri, Jun 18, 2021 at 11:12:58AM -0700, Ira Weiny wrote:
>
> Interesting!  Thanks!
> 
> Digging around a bit more I found:
> 
> https://lore.kernel.org/patchwork/patch/439637/

Nice find.  So we can at least get rid of the PageSlab call from
the Crypto API.

---8<---
As it is now legal to call flush_dcache_page on slab pages we
no longer need to do the check in the Crypto API.

Reported-by: Ira Weiny 
Signed-off-by: Herbert Xu 

diff --git a/include/crypto/scatterwalk.h b/include/crypto/scatterwalk.h
index c837d0775474..7af08174a721 100644
--- a/include/crypto/scatterwalk.h
+++ b/include/crypto/scatterwalk.h
@@ -81,12 +81,7 @@ static inline void scatterwalk_pagedone(struct scatter_walk 
*walk, int out,
struct page *page;
 
page = sg_page(walk->sg) + ((walk->offset - 1) >> PAGE_SHIFT);
-   /* Test ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE first as
-* PageSlab cannot be optimised away per se due to
-* use of volatile pointer.
-*/
-   if (ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE && !PageSlab(page))
-   flush_dcache_page(page);
+   flush_dcache_page(page);
}
 
if (more && walk->offset >= walk->sg->offset + walk->sg->length)
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-23 Thread Claire Chang
On Thu, Jun 24, 2021 at 1:43 PM Christoph Hellwig  wrote:
>
> On Wed, Jun 23, 2021 at 02:44:34PM -0400, Qian Cai wrote:
> > is_swiotlb_force_bounce at /usr/src/linux-next/./include/linux/swiotlb.h:119
> >
> > is_swiotlb_force_bounce() was the new function introduced in this patch 
> > here.
> >
> > +static inline bool is_swiotlb_force_bounce(struct device *dev)
> > +{
> > + return dev->dma_io_tlb_mem->force_bounce;
> > +}
>
> To me the crash looks like dev->dma_io_tlb_mem is NULL.  Can you
> turn this into :
>
> return dev->dma_io_tlb_mem && dev->dma_io_tlb_mem->force_bounce;
>
> for a quick debug check?

I just realized that dma_io_tlb_mem might be NULL like Christoph
pointed out since swiotlb might not get initialized.
However,  `Unable to handle kernel paging request at virtual address
dfff800e` looks more like the address is garbage rather than
NULL?
I wonder if that's because dev->dma_io_tlb_mem is not assigned
properly (which means device_initialize is not called?).


Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-23 Thread Christoph Hellwig
On Wed, Jun 23, 2021 at 02:44:34PM -0400, Qian Cai wrote:
> is_swiotlb_force_bounce at /usr/src/linux-next/./include/linux/swiotlb.h:119
> 
> is_swiotlb_force_bounce() was the new function introduced in this patch here.
> 
> +static inline bool is_swiotlb_force_bounce(struct device *dev)
> +{
> + return dev->dma_io_tlb_mem->force_bounce;
> +}

To me the crash looks like dev->dma_io_tlb_mem is NULL.  Can you
turn this into :

return dev->dma_io_tlb_mem && dev->dma_io_tlb_mem->force_bounce;

for a quick debug check?


Re: [PATCH 6/6] drm/i915/gvt: use gfn_to_pfn's page instead of pfn

2021-06-23 Thread David Stevens
Please ignore this last patch. It was put together as an afterthought
and wasn't properly tested.

-David

On Thu, Jun 24, 2021 at 12:59 PM David Stevens  wrote:
>
> Return struct page instead of pfn from gfn_to_mfn. This function is only
> used to determine if the page is a transparent hugepage, to enable 2MB
> huge gtt shadowing. Returning the page directly avoids the risk of
> calling pfn_to_page on a VM_IO|VM_PFNMAP pfn.
>
> This change also properly releases the reference on the page returned by
> gfn_to_pfn.
>
> Signed-off-by: David Stevens 
> ---
>  drivers/gpu/drm/i915/gvt/gtt.c   | 12 
>  drivers/gpu/drm/i915/gvt/hypercall.h |  3 ++-
>  drivers/gpu/drm/i915/gvt/kvmgt.c | 12 
>  drivers/gpu/drm/i915/gvt/mpt.h   |  8 
>  4 files changed, 18 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
> index 9478c132d7b6..b2951c560582 100644
> --- a/drivers/gpu/drm/i915/gvt/gtt.c
> +++ b/drivers/gpu/drm/i915/gvt/gtt.c
> @@ -1160,16 +1160,20 @@ static int is_2MB_gtt_possible(struct intel_vgpu 
> *vgpu,
> struct intel_gvt_gtt_entry *entry)
>  {
> struct intel_gvt_gtt_pte_ops *ops = vgpu->gvt->gtt.pte_ops;
> -   unsigned long pfn;
> +   struct page *page;
> +   bool is_trans_huge;
>
> if (!HAS_PAGE_SIZES(vgpu->gvt->gt->i915, I915_GTT_PAGE_SIZE_2M))
> return 0;
>
> -   pfn = intel_gvt_hypervisor_gfn_to_mfn(vgpu, ops->get_pfn(entry));
> -   if (pfn == INTEL_GVT_INVALID_ADDR)
> +   page = intel_gvt_hypervisor_gfn_to_mfn_page(vgpu, 
> ops->get_pfn(entry));
> +   if (!page)
> return -EINVAL;
>
> -   return PageTransHuge(pfn_to_page(pfn));
> +   is_trans_huge = PageTransHuge(page);
> +   put_page(page);
> +
> +   return is_trans_huge;
>  }
>
>  static int split_2MB_gtt_entry(struct intel_vgpu *vgpu,
> diff --git a/drivers/gpu/drm/i915/gvt/hypercall.h 
> b/drivers/gpu/drm/i915/gvt/hypercall.h
> index b79da5124f83..017190ff52d5 100644
> --- a/drivers/gpu/drm/i915/gvt/hypercall.h
> +++ b/drivers/gpu/drm/i915/gvt/hypercall.h
> @@ -60,7 +60,8 @@ struct intel_gvt_mpt {
> unsigned long len);
> int (*write_gpa)(unsigned long handle, unsigned long gpa, void *buf,
>  unsigned long len);
> -   unsigned long (*gfn_to_mfn)(unsigned long handle, unsigned long gfn);
> +   struct page *(*gfn_to_mfn_page)(unsigned long handle,
> +   unsigned long gfn);
>
> int (*dma_map_guest_page)(unsigned long handle, unsigned long gfn,
>   unsigned long size, dma_addr_t *dma_addr);
> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c 
> b/drivers/gpu/drm/i915/gvt/kvmgt.c
> index b829ff67e3d9..1e97ae813ed0 100644
> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
> @@ -1928,21 +1928,17 @@ static int kvmgt_inject_msi(unsigned long handle, u32 
> addr, u16 data)
> return -EFAULT;
>  }
>
> -static unsigned long kvmgt_gfn_to_pfn(unsigned long handle, unsigned long 
> gfn)
> +static struct page *kvmgt_gfn_to_page(unsigned long handle, unsigned long 
> gfn)
>  {
> struct kvmgt_guest_info *info;
> kvm_pfn_t pfn;
>
> if (!handle_valid(handle))
> -   return INTEL_GVT_INVALID_ADDR;
> +   return NULL;
>
> info = (struct kvmgt_guest_info *)handle;
>
> -   pfn = kvm_pfn_page_unwrap(gfn_to_pfn(info->kvm, gfn));
> -   if (is_error_noslot_pfn(pfn))
> -   return INTEL_GVT_INVALID_ADDR;
> -
> -   return pfn;
> +   return gfn_to_pfn(info->kvm, gfn).page;
>  }
>
>  static int kvmgt_dma_map_guest_page(unsigned long handle, unsigned long gfn,
> @@ -2112,7 +2108,7 @@ static const struct intel_gvt_mpt kvmgt_mpt = {
> .disable_page_track = kvmgt_page_track_remove,
> .read_gpa = kvmgt_read_gpa,
> .write_gpa = kvmgt_write_gpa,
> -   .gfn_to_mfn = kvmgt_gfn_to_pfn,
> +   .gfn_to_mfn_page = kvmgt_gfn_to_page,
> .dma_map_guest_page = kvmgt_dma_map_guest_page,
> .dma_unmap_guest_page = kvmgt_dma_unmap_guest_page,
> .dma_pin_guest_page = kvmgt_dma_pin_guest_page,
> diff --git a/drivers/gpu/drm/i915/gvt/mpt.h b/drivers/gpu/drm/i915/gvt/mpt.h
> index 550a456e936f..9169b83cf0f6 100644
> --- a/drivers/gpu/drm/i915/gvt/mpt.h
> +++ b/drivers/gpu/drm/i915/gvt/mpt.h
> @@ -214,17 +214,17 @@ static inline int intel_gvt_hypervisor_write_gpa(struct 
> intel_vgpu *vgpu,
>  }
>
>  /**
> - * intel_gvt_hypervisor_gfn_to_mfn - translate a GFN to MFN
> + * intel_gvt_hypervisor_gfn_to_mfn_page - translate a GFN to MFN page
>   * @vgpu: a vGPU
>   * @gpfn: guest pfn
>   *
>   * Returns:
> - * MFN on success, INTEL_GVT_INVALID_ADDR if failed.
> + * struct page* on success, NULL if failed.
>   */
> -static inline unsigned long intel_gvt_hypervisor_gfn_to_mfn(
> +static inline unsigned long int

[PATCH 6/6] drm/i915/gvt: use gfn_to_pfn's page instead of pfn

2021-06-23 Thread David Stevens
Return struct page instead of pfn from gfn_to_mfn. This function is only
used to determine if the page is a transparent hugepage, to enable 2MB
huge gtt shadowing. Returning the page directly avoids the risk of
calling pfn_to_page on a VM_IO|VM_PFNMAP pfn.

This change also properly releases the reference on the page returned by
gfn_to_pfn.

Signed-off-by: David Stevens 
---
 drivers/gpu/drm/i915/gvt/gtt.c   | 12 
 drivers/gpu/drm/i915/gvt/hypercall.h |  3 ++-
 drivers/gpu/drm/i915/gvt/kvmgt.c | 12 
 drivers/gpu/drm/i915/gvt/mpt.h   |  8 
 4 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/drivers/gpu/drm/i915/gvt/gtt.c b/drivers/gpu/drm/i915/gvt/gtt.c
index 9478c132d7b6..b2951c560582 100644
--- a/drivers/gpu/drm/i915/gvt/gtt.c
+++ b/drivers/gpu/drm/i915/gvt/gtt.c
@@ -1160,16 +1160,20 @@ static int is_2MB_gtt_possible(struct intel_vgpu *vgpu,
struct intel_gvt_gtt_entry *entry)
 {
struct intel_gvt_gtt_pte_ops *ops = vgpu->gvt->gtt.pte_ops;
-   unsigned long pfn;
+   struct page *page;
+   bool is_trans_huge;
 
if (!HAS_PAGE_SIZES(vgpu->gvt->gt->i915, I915_GTT_PAGE_SIZE_2M))
return 0;
 
-   pfn = intel_gvt_hypervisor_gfn_to_mfn(vgpu, ops->get_pfn(entry));
-   if (pfn == INTEL_GVT_INVALID_ADDR)
+   page = intel_gvt_hypervisor_gfn_to_mfn_page(vgpu, ops->get_pfn(entry));
+   if (!page)
return -EINVAL;
 
-   return PageTransHuge(pfn_to_page(pfn));
+   is_trans_huge = PageTransHuge(page);
+   put_page(page);
+
+   return is_trans_huge;
 }
 
 static int split_2MB_gtt_entry(struct intel_vgpu *vgpu,
diff --git a/drivers/gpu/drm/i915/gvt/hypercall.h 
b/drivers/gpu/drm/i915/gvt/hypercall.h
index b79da5124f83..017190ff52d5 100644
--- a/drivers/gpu/drm/i915/gvt/hypercall.h
+++ b/drivers/gpu/drm/i915/gvt/hypercall.h
@@ -60,7 +60,8 @@ struct intel_gvt_mpt {
unsigned long len);
int (*write_gpa)(unsigned long handle, unsigned long gpa, void *buf,
 unsigned long len);
-   unsigned long (*gfn_to_mfn)(unsigned long handle, unsigned long gfn);
+   struct page *(*gfn_to_mfn_page)(unsigned long handle,
+   unsigned long gfn);
 
int (*dma_map_guest_page)(unsigned long handle, unsigned long gfn,
  unsigned long size, dma_addr_t *dma_addr);
diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
index b829ff67e3d9..1e97ae813ed0 100644
--- a/drivers/gpu/drm/i915/gvt/kvmgt.c
+++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
@@ -1928,21 +1928,17 @@ static int kvmgt_inject_msi(unsigned long handle, u32 
addr, u16 data)
return -EFAULT;
 }
 
-static unsigned long kvmgt_gfn_to_pfn(unsigned long handle, unsigned long gfn)
+static struct page *kvmgt_gfn_to_page(unsigned long handle, unsigned long gfn)
 {
struct kvmgt_guest_info *info;
kvm_pfn_t pfn;
 
if (!handle_valid(handle))
-   return INTEL_GVT_INVALID_ADDR;
+   return NULL;
 
info = (struct kvmgt_guest_info *)handle;
 
-   pfn = kvm_pfn_page_unwrap(gfn_to_pfn(info->kvm, gfn));
-   if (is_error_noslot_pfn(pfn))
-   return INTEL_GVT_INVALID_ADDR;
-
-   return pfn;
+   return gfn_to_pfn(info->kvm, gfn).page;
 }
 
 static int kvmgt_dma_map_guest_page(unsigned long handle, unsigned long gfn,
@@ -2112,7 +2108,7 @@ static const struct intel_gvt_mpt kvmgt_mpt = {
.disable_page_track = kvmgt_page_track_remove,
.read_gpa = kvmgt_read_gpa,
.write_gpa = kvmgt_write_gpa,
-   .gfn_to_mfn = kvmgt_gfn_to_pfn,
+   .gfn_to_mfn_page = kvmgt_gfn_to_page,
.dma_map_guest_page = kvmgt_dma_map_guest_page,
.dma_unmap_guest_page = kvmgt_dma_unmap_guest_page,
.dma_pin_guest_page = kvmgt_dma_pin_guest_page,
diff --git a/drivers/gpu/drm/i915/gvt/mpt.h b/drivers/gpu/drm/i915/gvt/mpt.h
index 550a456e936f..9169b83cf0f6 100644
--- a/drivers/gpu/drm/i915/gvt/mpt.h
+++ b/drivers/gpu/drm/i915/gvt/mpt.h
@@ -214,17 +214,17 @@ static inline int intel_gvt_hypervisor_write_gpa(struct 
intel_vgpu *vgpu,
 }
 
 /**
- * intel_gvt_hypervisor_gfn_to_mfn - translate a GFN to MFN
+ * intel_gvt_hypervisor_gfn_to_mfn_page - translate a GFN to MFN page
  * @vgpu: a vGPU
  * @gpfn: guest pfn
  *
  * Returns:
- * MFN on success, INTEL_GVT_INVALID_ADDR if failed.
+ * struct page* on success, NULL if failed.
  */
-static inline unsigned long intel_gvt_hypervisor_gfn_to_mfn(
+static inline unsigned long intel_gvt_hypervisor_gfn_to_mfn_page(
struct intel_vgpu *vgpu, unsigned long gfn)
 {
-   return intel_gvt_host.mpt->gfn_to_mfn(vgpu->handle, gfn);
+   return intel_gvt_host.mpt->gfn_to_mfn_page(vgpu->handle, gfn);
 }
 
 /**
-- 
2.32.0.93.g670b81a890-goog



[PATCH 5/6] KVM: mmu: remove over-aggressive warnings

2021-06-23 Thread David Stevens
From: David Stevens 

Remove two warnings that require ref counts for pages to be non-zero, as
mapped pfns from follow_pfn may not have an initialized ref count.

Signed-off-by: David Stevens 
---
 arch/x86/kvm/mmu/mmu.c | 7 ---
 virt/kvm/kvm_main.c| 2 +-
 2 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8fa4a4a411ba..19249ad4b5b8 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -546,13 +546,6 @@ static int mmu_spte_clear_track_bits(u64 *sptep)
 
pfn = spte_to_pfn(old_spte);
 
-   /*
-* KVM does not hold the refcount of the page used by
-* kvm mmu, before reclaiming the page, we should
-* unmap it from mmu first.
-*/
-   WARN_ON(!kvm_is_reserved_pfn(pfn) && !page_count(pfn_to_page(pfn)));
-
if (is_accessed_spte(old_spte))
kvm_set_pfn_accessed(pfn);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 898e90be4d0e..671361f30476 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -168,7 +168,7 @@ bool kvm_is_zone_device_pfn(kvm_pfn_t pfn)
 * the device has been pinned, e.g. by get_user_pages().  WARN if the
 * page_count() is zero to help detect bad usage of this helper.
 */
-   if (!pfn_valid(pfn) || WARN_ON_ONCE(!page_count(pfn_to_page(pfn
+   if (!pfn_valid(pfn) || !page_count(pfn_to_page(pfn)))
return false;
 
return is_zone_device_page(pfn_to_page(pfn));
-- 
2.32.0.93.g670b81a890-goog



[PATCH 4/6] KVM: arm64/mmu: avoid struct page in MMU

2021-06-23 Thread David Stevens
From: David Stevens 

Avoid converting pfns returned by follow_fault_pfn to struct pages to
transiently take a reference. The reference was originally taken to
match the reference taken by gup. However, pfns returned by
follow_fault_pfn may not have a struct page set up for reference
counting.

Signed-off-by: David Stevens 
---
 arch/arm64/kvm/mmu.c | 43 +++
 1 file changed, 23 insertions(+), 20 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 896b3644b36f..a741972cb75f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -779,17 +779,17 @@ static bool fault_supports_stage2_huge_mapping(struct 
kvm_memory_slot *memslot,
  */
 static unsigned long
 transparent_hugepage_adjust(struct kvm_memory_slot *memslot,
-   unsigned long hva, kvm_pfn_t *pfnp,
+   unsigned long hva, struct kvm_pfn_page *pfnpgp,
phys_addr_t *ipap)
 {
-   kvm_pfn_t pfn = *pfnp;
+   kvm_pfn_t pfn = pfnpgp->pfn;
 
/*
 * Make sure the adjustment is done only for THP pages. Also make
 * sure that the HVA and IPA are sufficiently aligned and that the
 * block map is contained within the memslot.
 */
-   if (kvm_is_transparent_hugepage(pfn) &&
+   if (pfnpgp->page && kvm_is_transparent_hugepage(pfn) &&
fault_supports_stage2_huge_mapping(memslot, hva, PMD_SIZE)) {
/*
 * The address we faulted on is backed by a transparent huge
@@ -810,10 +810,11 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
*memslot,
 * page accordingly.
 */
*ipap &= PMD_MASK;
-   kvm_release_pfn_clean(pfn);
+   put_page(pfnpgp->page);
pfn &= ~(PTRS_PER_PMD - 1);
-   kvm_get_pfn(pfn);
-   *pfnp = pfn;
+   pfnpgp->pfn = pfn;
+   pfnpgp->page = pfn_to_page(pfnpgp->pfn);
+   get_page(pfnpgp->page);
 
return PMD_SIZE;
}
@@ -836,7 +837,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
struct vm_area_struct *vma;
short vma_shift;
gfn_t gfn;
-   kvm_pfn_t pfn;
+   struct kvm_pfn_page pfnpg;
bool logging_active = memslot_is_logging(memslot);
unsigned long fault_level = kvm_vcpu_trap_get_fault_level(vcpu);
unsigned long vma_pagesize, fault_granule;
@@ -933,17 +934,16 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 */
smp_rmb();
 
-   pfn = kvm_pfn_page_unwrap(__gfn_to_pfn_memslot(memslot, gfn, false,
-  NULL, write_fault,
-  &writable, NULL));
-   if (pfn == KVM_PFN_ERR_HWPOISON) {
+   pfnpg = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
+write_fault, &writable, NULL);
+   if (pfnpg.pfn == KVM_PFN_ERR_HWPOISON) {
kvm_send_hwpoison_signal(hva, vma_shift);
return 0;
}
-   if (is_error_noslot_pfn(pfn))
+   if (is_error_noslot_pfn(pfnpg.pfn))
return -EFAULT;
 
-   if (kvm_is_device_pfn(pfn)) {
+   if (kvm_is_device_pfn(pfnpg.pfn)) {
device = true;
force_pte = true;
} else if (logging_active && !write_fault) {
@@ -968,16 +968,16 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 */
if (vma_pagesize == PAGE_SIZE && !force_pte)
vma_pagesize = transparent_hugepage_adjust(memslot, hva,
-  &pfn, &fault_ipa);
+  &pfnpg, &fault_ipa);
if (writable)
prot |= KVM_PGTABLE_PROT_W;
 
if (fault_status != FSC_PERM && !device)
-   clean_dcache_guest_page(pfn, vma_pagesize);
+   clean_dcache_guest_page(pfnpg.pfn, vma_pagesize);
 
if (exec_fault) {
prot |= KVM_PGTABLE_PROT_X;
-   invalidate_icache_guest_page(pfn, vma_pagesize);
+   invalidate_icache_guest_page(pfnpg.pfn, vma_pagesize);
}
 
if (device)
@@ -994,20 +994,23 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
ret = kvm_pgtable_stage2_relax_perms(pgt, fault_ipa, prot);
} else {
ret = kvm_pgtable_stage2_map(pgt, fault_ipa, vma_pagesize,
-__pfn_to_phys(pfn), prot,
+__pfn_to_phys(pfnpg.pfn), prot,
 memcache);
}
 
/* Mark the page dirty only if the fault is handled successfully */
if (writable && !ret) {
-   kvm_set_pfn_di

[PATCH 3/6] KVM: x86/mmu: avoid struct page in MMU

2021-06-23 Thread David Stevens
From: David Stevens 

Avoid converting pfns returned by follow_fault_pfn to struct pages to
transiently take a reference. The reference was originally taken to
match the reference taken by gup. However, pfns returned by
follow_fault_pfn may not have a struct page set up for reference
counting.

Signed-off-by: David Stevens 
---
 arch/x86/kvm/mmu/mmu.c  | 56 +++--
 arch/x86/kvm/mmu/mmu_audit.c| 13 
 arch/x86/kvm/mmu/mmu_internal.h |  3 +-
 arch/x86/kvm/mmu/paging_tmpl.h  | 36 -
 arch/x86/kvm/mmu/tdp_mmu.c  |  7 +++--
 arch/x86/kvm/mmu/tdp_mmu.h  |  4 +--
 arch/x86/kvm/x86.c  |  9 +++---
 7 files changed, 73 insertions(+), 55 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 84913677c404..8fa4a4a411ba 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2610,16 +2610,16 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, u64 
*sptep,
return ret;
 }
 
-static kvm_pfn_t pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu, gfn_t gfn,
-bool no_dirty_log)
+static struct kvm_pfn_page pte_prefetch_gfn_to_pfn(struct kvm_vcpu *vcpu,
+  gfn_t gfn, bool no_dirty_log)
 {
struct kvm_memory_slot *slot;
 
slot = gfn_to_memslot_dirty_bitmap(vcpu, gfn, no_dirty_log);
if (!slot)
-   return KVM_PFN_ERR_FAULT;
+   return KVM_PFN_PAGE_ERR(KVM_PFN_ERR_FAULT);
 
-   return kvm_pfn_page_unwrap(gfn_to_pfn_memslot_atomic(slot, gfn));
+   return gfn_to_pfn_memslot_atomic(slot, gfn);
 }
 
 static int direct_pte_prefetch_many(struct kvm_vcpu *vcpu,
@@ -2748,7 +2748,8 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
 
 int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t gfn,
int max_level, kvm_pfn_t *pfnp,
-   bool huge_page_disallowed, int *req_level)
+   struct page *page, bool huge_page_disallowed,
+   int *req_level)
 {
struct kvm_memory_slot *slot;
kvm_pfn_t pfn = *pfnp;
@@ -2760,6 +2761,9 @@ int kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t 
gfn,
if (unlikely(max_level == PG_LEVEL_4K))
return PG_LEVEL_4K;
 
+   if (!page)
+   return PG_LEVEL_4K;
+
if (is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn))
return PG_LEVEL_4K;
 
@@ -2814,7 +2818,8 @@ void disallowed_hugepage_adjust(u64 spte, gfn_t gfn, int 
cur_level,
 }
 
 static int __direct_map(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
-   int map_writable, int max_level, kvm_pfn_t pfn,
+   int map_writable, int max_level,
+   const struct kvm_pfn_page *pfnpg,
bool prefault, bool is_tdp)
 {
bool nx_huge_page_workaround_enabled = is_nx_huge_page_enabled();
@@ -2826,11 +2831,12 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t 
gpa, u32 error_code,
int level, req_level, ret;
gfn_t gfn = gpa >> PAGE_SHIFT;
gfn_t base_gfn = gfn;
+   kvm_pfn_t pfn = pfnpg->pfn;
 
if (WARN_ON(!VALID_PAGE(vcpu->arch.mmu->root_hpa)))
return RET_PF_RETRY;
 
-   level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn,
+   level = kvm_mmu_hugepage_adjust(vcpu, gfn, max_level, &pfn, pfnpg->page,
huge_page_disallowed, &req_level);
 
trace_kvm_mmu_spte_requested(gpa, level, pfn);
@@ -3672,8 +3678,8 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu 
*vcpu, gpa_t cr2_or_gpa,
 }
 
 static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
-gpa_t cr2_or_gpa, kvm_pfn_t *pfn, hva_t *hva,
-bool write, bool *writable)
+gpa_t cr2_or_gpa, struct kvm_pfn_page *pfnpg,
+hva_t *hva, bool write, bool *writable)
 {
struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
bool async;
@@ -3688,17 +3694,16 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool 
prefault, gfn_t gfn,
 
/* Don't expose private memslots to L2. */
if (is_guest_mode(vcpu) && !kvm_is_visible_memslot(slot)) {
-   *pfn = KVM_PFN_NOSLOT;
+   *pfnpg = KVM_PFN_PAGE_ERR(KVM_PFN_NOSLOT);
*writable = false;
return false;
}
 
async = false;
-   *pfn = kvm_pfn_page_unwrap(__gfn_to_pfn_memslot(slot, gfn, false,
-   &async, write,
-   writable, hva));
+   *pfnpg = __gfn_to_pfn_memslot(slot, gfn, false, &async,
+ write, writable, hva);
if (!async)
-   return false; /* *pfn has correct page already */
+ 

[PATCH 2/6] KVM: mmu: also return page from gfn_to_pfn

2021-06-23 Thread David Stevens
From: David Stevens 

Return a struct kvm_pfn_page containing both a pfn and an optional
struct page from the gfn_to_pfn family of functions. This differentiates
the gup and follow_fault_pfn cases, which allows callers that only need
a pfn to avoid touching the page struct in the latter case. For callers
that need a struct page, introduce a helper function that unwraps a
struct kvm_pfn_page into a struct page. This helper makes the call to
kvm_get_pfn which had previously been in hva_to_pfn_remapped.

For now, wrap all calls to gfn_to_pfn functions in the new helper
function. Callers which don't need the page struct will be updated in
follow-up patches.

Signed-off-by: David Stevens 
---
 arch/arm64/kvm/mmu.c   |   5 +-
 arch/mips/kvm/mmu.c|   3 +-
 arch/powerpc/kvm/book3s.c  |   3 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c|   5 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |   5 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c |   4 +-
 arch/powerpc/kvm/e500_mmu_host.c   |   2 +-
 arch/x86/kvm/mmu/mmu.c |  11 ++-
 arch/x86/kvm/mmu/mmu_audit.c   |   2 +-
 arch/x86/kvm/x86.c |   2 +-
 drivers/gpu/drm/i915/gvt/kvmgt.c   |   2 +-
 include/linux/kvm_host.h   |  27 --
 include/linux/kvm_types.h  |   5 +
 virt/kvm/kvm_main.c| 121 +
 14 files changed, 109 insertions(+), 88 deletions(-)

diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index c10207fed2f3..896b3644b36f 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -933,8 +933,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, 
phys_addr_t fault_ipa,
 */
smp_rmb();
 
-   pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
-  write_fault, &writable, NULL);
+   pfn = kvm_pfn_page_unwrap(__gfn_to_pfn_memslot(memslot, gfn, false,
+  NULL, write_fault,
+  &writable, NULL));
if (pfn == KVM_PFN_ERR_HWPOISON) {
kvm_send_hwpoison_signal(hva, vma_shift);
return 0;
diff --git a/arch/mips/kvm/mmu.c b/arch/mips/kvm/mmu.c
index 6d1f68cf4edf..f4e5e48bc6bf 100644
--- a/arch/mips/kvm/mmu.c
+++ b/arch/mips/kvm/mmu.c
@@ -630,7 +630,8 @@ static int kvm_mips_map_page(struct kvm_vcpu *vcpu, 
unsigned long gpa,
smp_rmb();
 
/* Slow path - ask KVM core whether we can access this GPA */
-   pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writeable);
+   pfn = kvm_pfn_page_unwrap(gfn_to_pfn_prot(kvm, gfn,
+ write_fault, &writeable));
if (is_error_noslot_pfn(pfn)) {
err = -EFAULT;
goto out;
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 2b691f4d1f26..2dff01d0632a 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -417,7 +417,8 @@ kvm_pfn_t kvmppc_gpa_to_pfn(struct kvm_vcpu *vcpu, gpa_t 
gpa, bool writing,
return pfn;
}
 
-   return gfn_to_pfn_prot(vcpu->kvm, gfn, writing, writable);
+   return kvm_pfn_page_unwrap(gfn_to_pfn_prot(vcpu->kvm, gfn,
+  writing, writable));
 }
 EXPORT_SYMBOL_GPL(kvmppc_gpa_to_pfn);
 
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 2d9193cd73be..ba094b9f87a9 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -590,8 +590,9 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
write_ok = true;
} else {
/* Call KVM generic code to do the slow-path check */
-   pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
-  writing, &write_ok, NULL);
+   pfn = kvm_pfn_page_unwrap(
+   __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
+writing, &write_ok, NULL));
if (is_error_noslot_pfn(pfn))
return -EFAULT;
page = NULL;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index d909c069363e..e7892f148222 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -821,8 +821,9 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
unsigned long pfn;
 
/* Call KVM generic code to do the slow-path check */
-   pfn = __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
-  writing, upgrade_p, NULL);
+   pfn = kvm_pfn_page_unwrap(
+   __gfn_to_pfn_memslot(memslot, gfn, false, NULL,
+   

[PATCH 1/6] KVM: x86/mmu: release audited pfns

2021-06-23 Thread David Stevens
From: David Stevens 

Signed-off-by: David Stevens 
---
 arch/x86/kvm/mmu/mmu_audit.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu_audit.c b/arch/x86/kvm/mmu/mmu_audit.c
index cedc17b2f60e..97ff184084b4 100644
--- a/arch/x86/kvm/mmu/mmu_audit.c
+++ b/arch/x86/kvm/mmu/mmu_audit.c
@@ -121,6 +121,8 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 
*sptep, int level)
audit_printk(vcpu->kvm, "levels %d pfn %llx hpa %llx "
 "ent %llxn", vcpu->arch.mmu->root_level, pfn,
 hpa, *sptep);
+
+   kvm_release_pfn_clean(pfn);
 }
 
 static void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep)
-- 
2.32.0.93.g670b81a890-goog



[PATCH 0/6] KVM: Remove uses of struct page from x86 and arm64 MMU

2021-06-23 Thread David Stevens
KVM supports mapping VM_IO and VM_PFNMAP memory into the guest by using
follow_pte in gfn_to_pfn. However, the resolved pfns may not have
assoicated struct pages, so they should not be passed to pfn_to_page.
This series removes such calls from the x86 and arm64 secondary MMU. To
do this, this series modifies gfn_to_pfn to return a struct page in
addition to a pfn, if the hva was resolved by gup. This allows the
caller to call put_page only when necessated by gup.

This series provides a helper function that unwraps the new return type
of gfn_to_pfn to provide behavior identical to the old behavior. As I
have no hardware to test powerpc/mips changes, the function is used
there for minimally invasive changes. Additionally, as gfn_to_page and
gfn_to_pfn_cache are not integrated with mmu notifier, they cannot be
easily changed over to only use pfns.

This addresses CVE-2021-22543 on x86 and arm64.

David Stevens (6):
  KVM: x86/mmu: release audited pfns
  KVM: mmu: also return page from gfn_to_pfn
  KVM: x86/mmu: avoid struct page in MMU
  KVM: arm64/mmu: avoid struct page in MMU
  KVM: mmu: remove over-aggressive warnings
  drm/i915/gvt: use gfn_to_pfn's page instead of pfn

 arch/arm64/kvm/mmu.c   |  42 +
 arch/mips/kvm/mmu.c|   3 +-
 arch/powerpc/kvm/book3s.c  |   3 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c|   5 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |   5 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c |   4 +-
 arch/powerpc/kvm/e500_mmu_host.c   |   2 +-
 arch/x86/kvm/mmu/mmu.c |  60 ++--
 arch/x86/kvm/mmu/mmu_audit.c   |  13 ++-
 arch/x86/kvm/mmu/mmu_internal.h|   3 +-
 arch/x86/kvm/mmu/paging_tmpl.h |  36 +---
 arch/x86/kvm/mmu/tdp_mmu.c |   7 +-
 arch/x86/kvm/mmu/tdp_mmu.h |   4 +-
 arch/x86/kvm/x86.c |   9 +-
 drivers/gpu/drm/i915/gvt/gtt.c |  12 ++-
 drivers/gpu/drm/i915/gvt/hypercall.h   |   3 +-
 drivers/gpu/drm/i915/gvt/kvmgt.c   |  12 +--
 drivers/gpu/drm/i915/gvt/mpt.h |   8 +-
 include/linux/kvm_host.h   |  27 --
 include/linux/kvm_types.h  |   5 +
 virt/kvm/kvm_main.c| 123 +
 21 files changed, 212 insertions(+), 174 deletions(-)

-- 
2.32.0.93.g670b81a890-goog



[PATCH v16 4/4] kasan: use MAX_PTRS_PER_* for early shadow tables

2021-06-23 Thread Daniel Axtens
powerpc has a variable number of PTRS_PER_*, set at runtime based
on the MMU that the kernel is booted under.

This means the PTRS_PER_* are no longer constants, and therefore
breaks the build. Switch to using MAX_PTRS_PER_*, which are constant.

Suggested-by: Christophe Leroy 
Suggested-by: Balbir Singh 
Reviewed-by: Christophe Leroy 
Reviewed-by: Balbir Singh 
Reviewed-by: Marco Elver 
Reviewed-by: Andrey Konovalov 
Signed-off-by: Daniel Axtens 
---
 include/linux/kasan.h | 6 +++---
 mm/kasan/init.c   | 6 +++---
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index 768d7d342757..5310e217bd74 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -41,9 +41,9 @@ struct kunit_kasan_expectation {
 #endif
 
 extern unsigned char kasan_early_shadow_page[PAGE_SIZE];
-extern pte_t kasan_early_shadow_pte[PTRS_PER_PTE + PTE_HWTABLE_PTRS];
-extern pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD];
-extern pud_t kasan_early_shadow_pud[PTRS_PER_PUD];
+extern pte_t kasan_early_shadow_pte[MAX_PTRS_PER_PTE + PTE_HWTABLE_PTRS];
+extern pmd_t kasan_early_shadow_pmd[MAX_PTRS_PER_PMD];
+extern pud_t kasan_early_shadow_pud[MAX_PTRS_PER_PUD];
 extern p4d_t kasan_early_shadow_p4d[MAX_PTRS_PER_P4D];
 
 int kasan_populate_early_shadow(const void *shadow_start,
diff --git a/mm/kasan/init.c b/mm/kasan/init.c
index 348f31d15a97..cc64ed6858c6 100644
--- a/mm/kasan/init.c
+++ b/mm/kasan/init.c
@@ -41,7 +41,7 @@ static inline bool kasan_p4d_table(pgd_t pgd)
 }
 #endif
 #if CONFIG_PGTABLE_LEVELS > 3
-pud_t kasan_early_shadow_pud[PTRS_PER_PUD] __page_aligned_bss;
+pud_t kasan_early_shadow_pud[MAX_PTRS_PER_PUD] __page_aligned_bss;
 static inline bool kasan_pud_table(p4d_t p4d)
 {
return p4d_page(p4d) == virt_to_page(lm_alias(kasan_early_shadow_pud));
@@ -53,7 +53,7 @@ static inline bool kasan_pud_table(p4d_t p4d)
 }
 #endif
 #if CONFIG_PGTABLE_LEVELS > 2
-pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD] __page_aligned_bss;
+pmd_t kasan_early_shadow_pmd[MAX_PTRS_PER_PMD] __page_aligned_bss;
 static inline bool kasan_pmd_table(pud_t pud)
 {
return pud_page(pud) == virt_to_page(lm_alias(kasan_early_shadow_pmd));
@@ -64,7 +64,7 @@ static inline bool kasan_pmd_table(pud_t pud)
return false;
 }
 #endif
-pte_t kasan_early_shadow_pte[PTRS_PER_PTE + PTE_HWTABLE_PTRS]
+pte_t kasan_early_shadow_pte[MAX_PTRS_PER_PTE + PTE_HWTABLE_PTRS]
__page_aligned_bss;
 
 static inline bool kasan_pte_table(pmd_t pmd)
-- 
2.30.2



[PATCH v16 3/4] mm: define default MAX_PTRS_PER_* in include/pgtable.h

2021-06-23 Thread Daniel Axtens
Commit c65e774fb3f6 ("x86/mm: Make PGDIR_SHIFT and PTRS_PER_P4D variable")
made PTRS_PER_P4D variable on x86 and introduced MAX_PTRS_PER_P4D as a
constant for cases which need a compile-time constant (e.g. fixed-size
arrays).

powerpc likewise has boot-time selectable MMU features which can cause
other mm "constants" to vary. For KASAN, we have some static
PTE/PMD/PUD/P4D arrays so we need compile-time maximums for all these
constants. Extend the MAX_PTRS_PER_ idiom, and place default definitions
in include/pgtable.h. These define MAX_PTRS_PER_x to be PTRS_PER_x unless
an architecture has defined MAX_PTRS_PER_x in its arch headers.

Clean up pgtable-nop4d.h and s390's MAX_PTRS_PER_P4D definitions while
we're at it: both can just pick up the default now.

Acked-by: Andrey Konovalov 
Reviewed-by: Christophe Leroy 
Reviewed-by: Marco Elver 
Signed-off-by: Daniel Axtens 

---

s390 was compile tested only.
---
 arch/s390/include/asm/pgtable.h |  2 --
 include/asm-generic/pgtable-nop4d.h |  1 -
 include/linux/pgtable.h | 22 ++
 3 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 79742f497cb5..dcac7b2df72c 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -343,8 +343,6 @@ static inline int is_module_addr(void *addr)
 #define PTRS_PER_P4D   _CRST_ENTRIES
 #define PTRS_PER_PGD   _CRST_ENTRIES
 
-#define MAX_PTRS_PER_P4D   PTRS_PER_P4D
-
 /*
  * Segment table and region3 table entry encoding
  * (R = read-only, I = invalid, y = young bit):
diff --git a/include/asm-generic/pgtable-nop4d.h 
b/include/asm-generic/pgtable-nop4d.h
index 2f1d0aad645c..03b7dae47dd4 100644
--- a/include/asm-generic/pgtable-nop4d.h
+++ b/include/asm-generic/pgtable-nop4d.h
@@ -9,7 +9,6 @@
 typedef struct { pgd_t pgd; } p4d_t;
 
 #define P4D_SHIFT  PGDIR_SHIFT
-#define MAX_PTRS_PER_P4D   1
 #define PTRS_PER_P4D   1
 #define P4D_SIZE   (1UL << P4D_SHIFT)
 #define P4D_MASK   (~(P4D_SIZE-1))
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index fb20c57de2ce..d147480cdefc 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1634,4 +1634,26 @@ typedef unsigned int pgtbl_mod_mask;
 #define pte_leaf_size(x) PAGE_SIZE
 #endif
 
+/*
+ * Some architectures have MMUs that are configurable or selectable at boot
+ * time. These lead to variable PTRS_PER_x. For statically allocated arrays it
+ * helps to have a static maximum value.
+ */
+
+#ifndef MAX_PTRS_PER_PTE
+#define MAX_PTRS_PER_PTE PTRS_PER_PTE
+#endif
+
+#ifndef MAX_PTRS_PER_PMD
+#define MAX_PTRS_PER_PMD PTRS_PER_PMD
+#endif
+
+#ifndef MAX_PTRS_PER_PUD
+#define MAX_PTRS_PER_PUD PTRS_PER_PUD
+#endif
+
+#ifndef MAX_PTRS_PER_P4D
+#define MAX_PTRS_PER_P4D PTRS_PER_P4D
+#endif
+
 #endif /* _LINUX_PGTABLE_H */
-- 
2.30.2



[PATCH v16 2/4] kasan: allow architectures to provide an outline readiness check

2021-06-23 Thread Daniel Axtens
Allow architectures to define a kasan_arch_is_ready() hook that bails
out of any function that's about to touch the shadow unless the arch
says that it is ready for the memory to be accessed. This is fairly
uninvasive and should have a negligible performance penalty.

This will only work in outline mode, so an arch must specify
ARCH_DISABLE_KASAN_INLINE if it requires this.

Cc: Balbir Singh 
Cc: Aneesh Kumar K.V 
Suggested-by: Christophe Leroy 
Reviewed-by: Marco Elver 
Signed-off-by: Daniel Axtens 

--

Both previous RFCs for ppc64 - by 2 different people - have
needed this trick! See:
 - https://lore.kernel.org/patchwork/patch/592820/ # ppc64 hash series
 - https://patchwork.ozlabs.org/patch/795211/  # ppc radix series

Build tested on arm64 with SW_TAGS and x86 with INLINE: the error fires
if I add a kasan_arch_is_ready define.
---
 mm/kasan/common.c  | 3 +++
 mm/kasan/generic.c | 3 +++
 mm/kasan/kasan.h   | 6 ++
 mm/kasan/shadow.c  | 6 ++
 4 files changed, 18 insertions(+)

diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 10177cc26d06..2baf121fb8c5 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -331,6 +331,9 @@ static inline bool kasan_slab_free(struct kmem_cache 
*cache, void *object,
u8 tag;
void *tagged_object;
 
+   if (!kasan_arch_is_ready())
+   return false;
+
tag = get_tag(object);
tagged_object = object;
object = kasan_reset_tag(object);
diff --git a/mm/kasan/generic.c b/mm/kasan/generic.c
index 53cbf28859b5..c3f5ba7a294a 100644
--- a/mm/kasan/generic.c
+++ b/mm/kasan/generic.c
@@ -163,6 +163,9 @@ static __always_inline bool check_region_inline(unsigned 
long addr,
size_t size, bool write,
unsigned long ret_ip)
 {
+   if (!kasan_arch_is_ready())
+   return true;
+
if (unlikely(size == 0))
return true;
 
diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h
index 8f450bc28045..4dbc8def64f4 100644
--- a/mm/kasan/kasan.h
+++ b/mm/kasan/kasan.h
@@ -449,6 +449,12 @@ static inline void kasan_poison_last_granule(const void 
*address, size_t size) {
 
 #endif /* CONFIG_KASAN_GENERIC */
 
+#ifndef kasan_arch_is_ready
+static inline bool kasan_arch_is_ready(void)   { return true; }
+#elif !defined(CONFIG_KASAN_GENERIC) || !defined(CONFIG_KASAN_OUTLINE)
+#error kasan_arch_is_ready only works in KASAN generic outline mode!
+#endif
+
 /*
  * Exported functions for interfaces called from assembly or from generated
  * code. Declarations here to avoid warning about missing declarations.
diff --git a/mm/kasan/shadow.c b/mm/kasan/shadow.c
index 082ee5b6d9a1..8d95ee52d019 100644
--- a/mm/kasan/shadow.c
+++ b/mm/kasan/shadow.c
@@ -73,6 +73,9 @@ void kasan_poison(const void *addr, size_t size, u8 value, 
bool init)
 {
void *shadow_start, *shadow_end;
 
+   if (!kasan_arch_is_ready())
+   return;
+
/*
 * Perform shadow offset calculation based on untagged address, as
 * some of the callers (e.g. kasan_poison_object_data) pass tagged
@@ -99,6 +102,9 @@ EXPORT_SYMBOL(kasan_poison);
 #ifdef CONFIG_KASAN_GENERIC
 void kasan_poison_last_granule(const void *addr, size_t size)
 {
+   if (!kasan_arch_is_ready())
+   return;
+
if (size & KASAN_GRANULE_MASK) {
u8 *shadow = (u8 *)kasan_mem_to_shadow(addr + size);
*shadow = size & KASAN_GRANULE_MASK;
-- 
2.30.2



[PATCH v16 1/4] kasan: allow an architecture to disable inline instrumentation

2021-06-23 Thread Daniel Axtens
For annoying architectural reasons, it's very difficult to support inline
instrumentation on powerpc64.*

Add a Kconfig flag to allow an arch to disable inline. (It's a bit
annoying to be 'backwards', but I'm not aware of any way to have
an arch force a symbol to be 'n', rather than 'y'.)

We also disable stack instrumentation in this case as it does things that
are functionally equivalent to inline instrumentation, namely adding
code that touches the shadow directly without going through a C helper.

* on ppc64 atm, the shadow lives in virtual memory and isn't accessible in
real mode. However, before we turn on virtual memory, we parse the device
tree to determine which platform and MMU we're running under. That calls
generic DT code, which is instrumented. Inline instrumentation in DT would
unconditionally attempt to touch the shadow region, which we won't have
set up yet, and would crash. We can make outline mode wait for the arch to
be ready, but we can't change what the compiler inserts for inline mode.

Reviewed-by: Marco Elver 
Signed-off-by: Daniel Axtens 
---
 lib/Kconfig.kasan | 12 
 1 file changed, 12 insertions(+)

diff --git a/lib/Kconfig.kasan b/lib/Kconfig.kasan
index cffc2ebbf185..c3b228828a80 100644
--- a/lib/Kconfig.kasan
+++ b/lib/Kconfig.kasan
@@ -12,6 +12,13 @@ config HAVE_ARCH_KASAN_HW_TAGS
 config HAVE_ARCH_KASAN_VMALLOC
bool
 
+config ARCH_DISABLE_KASAN_INLINE
+   bool
+   help
+ An architecture might not support inline instrumentation.
+ When this option is selected, inline and stack instrumentation are
+ disabled.
+
 config CC_HAS_KASAN_GENERIC
def_bool $(cc-option, -fsanitize=kernel-address)
 
@@ -130,6 +137,7 @@ config KASAN_OUTLINE
 
 config KASAN_INLINE
bool "Inline instrumentation"
+   depends on !ARCH_DISABLE_KASAN_INLINE
help
  Compiler directly inserts code checking shadow memory before
  memory accesses. This is faster than outline (in some workloads
@@ -141,6 +149,7 @@ endchoice
 config KASAN_STACK
bool "Enable stack instrumentation (unsafe)" if CC_IS_CLANG && 
!COMPILE_TEST
depends on KASAN_GENERIC || KASAN_SW_TAGS
+   depends on !ARCH_DISABLE_KASAN_INLINE
default y if CC_IS_GCC
help
  The LLVM stack address sanitizer has a know problem that
@@ -154,6 +163,9 @@ config KASAN_STACK
  but clang users can still enable it for builds without
  CONFIG_COMPILE_TEST.  On gcc it is assumed to always be safe
  to use and enabled by default.
+ If the architecture disables inline instrumentation, stack
+ instrumentation is also disabled as it adds inline-style
+ instrumentation that is run unconditionally.
 
 config KASAN_SW_TAGS_IDENTIFY
bool "Enable memory corruption identification"
-- 
2.30.2



[PATCH v16 0/4] KASAN core changes for ppc64 radix KASAN

2021-06-23 Thread Daniel Axtens
Building on the work of Christophe, Aneesh and Balbir, I've ported
KASAN to 64-bit Book3S kernels running on the Radix MMU. I've been
trying this for a while, but we keep having collisions between the
kasan code in the mm tree and the code I want to put in to the ppc
tree.

This series just contains the kasan core changes that we need. These
can go in via the mm tree. I will then propose the powerpc changes for
a later cycle. (The most recent RFC for the powerpc changes is in the
v12 series at
https://lore.kernel.org/linux-mm/20210615014705.2234866-1-...@axtens.net/
)

v16 applies to next-20210622. There should be no noticeable changes to
other platforms.

Changes since v15: Review comments from Andrey. Thanks Andrey.

Changes since v14: Included a bunch of Reviewed-by:s, thanks
Christophe and Marco. Cleaned up the build time error #ifdefs, thanks
Christophe.

Changes since v13: move the MAX_PTR_PER_* definitions out of kasan and
into pgtable.h. Add a build time error to hopefully prevent any
confusion about when the new hook is applicable. Thanks Marco and
Christophe.

Changes since v12: respond to Marco's review comments - clean up the
help for ARCH_DISABLE_KASAN_INLINE, and add an arch readiness check to
the new granule poisioning function. Thanks Marco.

Daniel Axtens (4):
  kasan: allow an architecture to disable inline instrumentation
  kasan: allow architectures to provide an outline readiness check
  mm: define default MAX_PTRS_PER_* in include/pgtable.h
  kasan: use MAX_PTRS_PER_* for early shadow tables

 arch/s390/include/asm/pgtable.h |  2 --
 include/asm-generic/pgtable-nop4d.h |  1 -
 include/linux/kasan.h   |  6 +++---
 include/linux/pgtable.h | 22 ++
 lib/Kconfig.kasan   | 12 
 mm/kasan/common.c   |  3 +++
 mm/kasan/generic.c  |  3 +++
 mm/kasan/init.c |  6 +++---
 mm/kasan/kasan.h|  6 ++
 mm/kasan/shadow.c   |  6 ++
 10 files changed, 58 insertions(+), 9 deletions(-)

-- 
2.30.2



Re: [RFC PATCH 8/8] powerpc/papr_scm: Use FORM2 associativity details

2021-06-23 Thread David Gibson
On Thu, Jun 17, 2021 at 04:29:01PM +0530, Aneesh Kumar K.V wrote:
> On 6/17/21 1:16 PM, David Gibson wrote:
> > On Tue, Jun 15, 2021 at 12:35:17PM +0530, Aneesh Kumar K.V wrote:
> > > David Gibson  writes:
> > > 
> > > > On Tue, Jun 15, 2021 at 11:27:50AM +0530, Aneesh Kumar K.V wrote:
> > > > > David Gibson  writes:
> 
> ...
> 
> > > > It's weird to me that you'd want to consider them in different nodes
> > > > for those different purposes.
> > > 
> > > 
> > > --
> > >|NUMA node0 |
> > >|ProcA -> MEMA  |
> > >| | |
> > >|  | |
> > >|  ---> PMEMB|
> > >|   |
> > > ---
> > > 
> > > ---
> > >|NUMA node1 |
> > >|   |
> > >|ProcB ---> MEMC|
> > >|  | |
> > >|  ---> PMEMD|
> > >|   |
> > >|   |
> > > ---
> > > 
> > > For a topology like the above application running of ProcA wants to find 
> > > out
> > > persistent memory mount local to its NUMA node. Hence when using it as
> > > pmem fsdax mount or devdax device we want PMEMB to have associativity
> > > of NUMA node0 and PMEMD to have associativity of NUMA node 1. But when
> > > we want to use it as memory using dax kmem driver, we want both PMEMB
> > > and PMEMD to appear as memory only NUMA node at a distance that is
> > > derived based on the latency of the media.
> > 
> > I'm still not understanding why the latency we care about is different
> > in the two cases.  Can you give an example of when this would result
> > in different actual node assignments for the two different cases?
> > 
> 
> In the above example in order allow use of PMEMB and PMEMD as memory only
> NUMA nodes
> we need platform to represent them in its own domainID. Let's assume that
> platform assigned id 40 and 41 and hence both PMEMB and PMEMD will have
> associativity array like below
> 
> { 4, 6, 0}  -> PROCA/MEMA
> { 4, 6, 40} -> PMEMB
> { 4, 6, 41} -> PMEMD
> { 4, 6, 1} ->  PROCB/MEMB
> 
> When we want to use this device PMEMB and PMEMD as fsdax/devdax devices, we
> essentially look for the first nearest online node. Which means both PMEMB
> and PMEMD will appear as devices attached to node0. That is not ideal for
> for many applications.

Not if you actually look at the distance table which tells you that
PMEMB is closer to node0 and PMEMD is closer to node1.  That's exactly
what the distance table is for - making this information explicit,
rather than intuited from a confusing set of nested domains.

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v4 4/7] powerpc/pseries: Consolidate DLPAR NUMA distance update

2021-06-23 Thread David Gibson
On Thu, Jun 17, 2021 at 10:21:02PM +0530, Aneesh Kumar K.V wrote:
> The associativity details of the newly added resourced are collected from
> the hypervisor via "ibm,configure-connector" rtas call. Update the numa
> distance details of the newly added numa node after the above call. In
> later patch we will remove updating NUMA distance when we are looking
> for node id from associativity array.
> 
> Signed-off-by: Aneesh Kumar K.V 

I think this patch and the next would be easier to review if merged
together.  That would make the fact that this is (half of) a code
motion clearer.

> ---
>  arch/powerpc/mm/numa.c| 41 +++
>  arch/powerpc/platforms/pseries/hotplug-cpu.c  |  2 +
>  .../platforms/pseries/hotplug-memory.c|  2 +
>  arch/powerpc/platforms/pseries/pseries.h  |  1 +
>  4 files changed, 46 insertions(+)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 0ec16999beef..645a95e3a7ea 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -287,6 +287,47 @@ int of_node_to_nid(struct device_node *device)
>  }
>  EXPORT_SYMBOL(of_node_to_nid);
>  
> +static void __initialize_form1_numa_distance(const __be32 *associativity)
> +{
> + int i, nid;
> +
> + if (of_read_number(associativity, 1) >= primary_domain_index) {
> + nid = of_read_number(&associativity[primary_domain_index], 1);
> +
> + for (i = 0; i < max_domain_index; i++) {
> + const __be32 *entry;
> +
> + entry = 
> &associativity[be32_to_cpu(distance_ref_points[i])];
> + distance_lookup_table[nid][i] = of_read_number(entry, 
> 1);
> + }
> + }
> +}
> +
> +static void initialize_form1_numa_distance(struct device_node *node)
> +{
> + const __be32 *associativity;
> +
> + associativity = of_get_associativity(node);
> + if (!associativity)
> + return;
> +
> + __initialize_form1_numa_distance(associativity);
> + return;
> +}
> +
> +/*
> + * Used to update distance information w.r.t newly added node.
> + */
> +void update_numa_distance(struct device_node *node)
> +{
> + if (affinity_form == FORM0_AFFINITY)
> + return;
> + else if (affinity_form == FORM1_AFFINITY) {
> + initialize_form1_numa_distance(node);
> + return;
> + }
> +}
> +
>  static int __init find_primary_domain_index(void)
>  {
>   int index;
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
> b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index 7e970f81d8ff..778b6ab35f0d 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -498,6 +498,8 @@ static ssize_t dlpar_cpu_add(u32 drc_index)
>   return saved_rc;
>   }
>  
> + update_numa_distance(dn);
> +
>   rc = dlpar_online_cpu(dn);
>   if (rc) {
>   saved_rc = rc;
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index 8377f1f7c78e..0e602c3b01ea 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -180,6 +180,8 @@ static int update_lmb_associativity_index(struct 
> drmem_lmb *lmb)
>   return -ENODEV;
>   }
>  
> + update_numa_distance(lmb_node);
> +
>   dr_node = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
>   if (!dr_node) {
>   dlpar_free_cc_nodes(lmb_node);
> diff --git a/arch/powerpc/platforms/pseries/pseries.h 
> b/arch/powerpc/platforms/pseries/pseries.h
> index 1f051a786fb3..663a0859cf13 100644
> --- a/arch/powerpc/platforms/pseries/pseries.h
> +++ b/arch/powerpc/platforms/pseries/pseries.h
> @@ -113,4 +113,5 @@ extern u32 pseries_security_flavor;
>  void pseries_setup_security_mitigations(void);
>  void pseries_lpar_read_hblkrm_characteristics(void);
>  
> +void update_numa_distance(struct device_node *node);
>  #endif /* _PSERIES_PSERIES_H */

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v4 2/7] powerpc/pseries: rename distance_ref_points_depth to max_associativity_domain_index

2021-06-23 Thread David Gibson
On Thu, Jun 17, 2021 at 10:21:00PM +0530, Aneesh Kumar K.V wrote:
> No functional change in this patch

I've been convinced of your other rename, but I'm not yet convinced
this one actually clarifies anything.

> 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/mm/numa.c | 20 ++--
>  1 file changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 8365b298ec48..132813dd1a6c 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -56,7 +56,7 @@ static int n_mem_addr_cells, n_mem_size_cells;
>  static int form1_affinity;
>  
>  #define MAX_DISTANCE_REF_POINTS 4
> -static int distance_ref_points_depth;
> +static int max_associativity_domain_index;
>  static const __be32 *distance_ref_points;
>  static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
>  
> @@ -169,7 +169,7 @@ int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
>  
>   int i, index;
>  
> - for (i = 0; i < distance_ref_points_depth; i++) {
> + for (i = 0; i < max_associativity_domain_index; i++) {
>   index = be32_to_cpu(distance_ref_points[i]);
>   if (cpu1_assoc[index] == cpu2_assoc[index])
>   break;
> @@ -193,7 +193,7 @@ int __node_distance(int a, int b)
>   if (!form1_affinity)
>   return ((a == b) ? LOCAL_DISTANCE : REMOTE_DISTANCE);
>  
> - for (i = 0; i < distance_ref_points_depth; i++) {
> + for (i = 0; i < max_associativity_domain_index; i++) {
>   if (distance_lookup_table[a][i] == distance_lookup_table[b][i])
>   break;
>  
> @@ -213,7 +213,7 @@ static void initialize_distance_lookup_table(int nid,
>   if (!form1_affinity)
>   return;
>  
> - for (i = 0; i < distance_ref_points_depth; i++) {
> + for (i = 0; i < max_associativity_domain_index; i++) {
>   const __be32 *entry;
>  
>   entry = &associativity[be32_to_cpu(distance_ref_points[i]) - 1];
> @@ -240,7 +240,7 @@ static int associativity_to_nid(const __be32 
> *associativity)
>   nid = NUMA_NO_NODE;
>  
>   if (nid > 0 &&
> - of_read_number(associativity, 1) >= distance_ref_points_depth) {
> + of_read_number(associativity, 1) >= 
> max_associativity_domain_index) {
>   /*
>* Skip the length field and send start of associativity array
>*/
> @@ -310,14 +310,14 @@ static int __init find_primary_domain_index(void)
>*/
>   distance_ref_points = of_get_property(root,
>   "ibm,associativity-reference-points",
> - &distance_ref_points_depth);
> + &max_associativity_domain_index);
>  
>   if (!distance_ref_points) {
>   dbg("NUMA: ibm,associativity-reference-points not found.\n");
>   goto err;
>   }
>  
> - distance_ref_points_depth /= sizeof(int);
> + max_associativity_domain_index /= sizeof(int);
>  
>   if (firmware_has_feature(FW_FEATURE_OPAL) ||
>   firmware_has_feature(FW_FEATURE_TYPE1_AFFINITY)) {
> @@ -328,7 +328,7 @@ static int __init find_primary_domain_index(void)
>   if (form1_affinity) {
>   index = of_read_number(distance_ref_points, 1);
>   } else {
> - if (distance_ref_points_depth < 2) {
> + if (max_associativity_domain_index < 2) {
>   printk(KERN_WARNING "NUMA: "
>   "short ibm,associativity-reference-points\n");
>   goto err;
> @@ -341,10 +341,10 @@ static int __init find_primary_domain_index(void)
>* Warn and cap if the hardware supports more than
>* MAX_DISTANCE_REF_POINTS domains.
>*/
> - if (distance_ref_points_depth > MAX_DISTANCE_REF_POINTS) {
> + if (max_associativity_domain_index > MAX_DISTANCE_REF_POINTS) {
>   printk(KERN_WARNING "NUMA: distance array capped at "
>   "%d entries\n", MAX_DISTANCE_REF_POINTS);
> - distance_ref_points_depth = MAX_DISTANCE_REF_POINTS;
> + max_associativity_domain_index = MAX_DISTANCE_REF_POINTS;
>   }
>  
>   of_node_put(root);

-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH v4 7/7] powerpc/pseries: Add support for FORM2 associativity

2021-06-23 Thread David Gibson
On Thu, Jun 17, 2021 at 10:21:05PM +0530, Aneesh Kumar K.V wrote:
> PAPR interface currently supports two different ways of communicating resource
> grouping details to the OS. These are referred to as Form 0 and Form 1
> associativity grouping. Form 0 is the older format and is now considered
> deprecated. This patch adds another resource grouping named FORM2.
> 
> Signed-off-by: Daniel Henrique Barboza 
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  Documentation/powerpc/associativity.rst   | 135 
>  arch/powerpc/include/asm/firmware.h   |   3 +-
>  arch/powerpc/include/asm/prom.h   |   1 +
>  arch/powerpc/kernel/prom_init.c   |   3 +-
>  arch/powerpc/mm/numa.c| 149 +-
>  arch/powerpc/platforms/pseries/firmware.c |   1 +
>  6 files changed, 286 insertions(+), 6 deletions(-)
>  create mode 100644 Documentation/powerpc/associativity.rst
> 
> diff --git a/Documentation/powerpc/associativity.rst 
> b/Documentation/powerpc/associativity.rst
> new file mode 100644
> index ..93be604ac54d
> --- /dev/null
> +++ b/Documentation/powerpc/associativity.rst
> @@ -0,0 +1,135 @@
> +
> +NUMA resource associativity
> +=
> +
> +Associativity represents the groupings of the various platform resources into
> +domains of substantially similar mean performance relative to resources 
> outside
> +of that domain. Resources subsets of a given domain that exhibit better
> +performance relative to each other than relative to other resources subsets
> +are represented as being members of a sub-grouping domain. This performance
> +characteristic is presented in terms of NUMA node distance within the Linux 
> kernel.
> +From the platform view, these groups are also referred to as domains.
> +
> +PAPR interface currently supports different ways of communicating these 
> resource
> +grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
> +associativity grouping. Form 0 is the older format and is now considered 
> deprecated.
> +
> +Hypervisor indicates the type/form of associativity used via 
> "ibm,arcitecture-vec-5 property".
> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of 
> Form 0 or Form 1.
> +A value of 1 indicates the usage of Form 1 associativity. For Form 2 
> associativity
> +bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
> +
> +Form 0
> +-
> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
> +
> +Form 1
> +-
> +With Form 1 a combination of ibm,associativity-reference-points and 
> ibm,associativity
> +device tree properties are used to determine the NUMA distance between 
> resource groups/domains.
> +
> +The “ibm,associativity” property contains one or more lists of numbers 
> (domainID)
> +representing the resource’s platform grouping domains.
> +
> +The “ibm,associativity-reference-points” property contains one or more list 
> of numbers
> +(domainID index) that represents the 1 based ordinal in the associativity 
> lists.
> +The list of domainID index represnets increasing hierachy of
> resource grouping.

Typo "represnets".  Also s/hierachy/hierarchy/

> +
> +ex:
> +{ primary domainID index, secondary domainID index, tertiary domainID 
> index.. }

> +Linux kernel uses the domainID at the primary domainID index as the NUMA 
> node id.
> +Linux kernel computes NUMA distance between two domains by recursively 
> comparing
> +if they belong to the same higher-level domains. For mismatch at every higher
> +level of the resource group, the kernel doubles the NUMA distance between the
> +comparing domains.

The Form1 description is still kinda confusing, but I don't really
care.  Form1 *is* confusing, it's Form2 that I hope will be clearer.

> +
> +Form 2
> +---
> +Form 2 associativity format adds separate device tree properties 
> representing NUMA node distance
> +thereby making the node distance computation flexible. Form 2 also allows 
> flexible primary
> +domain numbering. With numa distance computation now detached from the index 
> value of
> +"ibm,associativity" property, Form 2 allows a large number of primary domain 
> ids at the
> +same domainID index representing resource groups of different 
> performance/latency characteristics.

So, see you've removed the special handling of secondary IDs for pmem
- big improvement, thanks.  IIUC, in this revised version, for Form2
there's really no reason for ibm,associativity-reference-points to
have >1 entry.  Is that right?

In Form2 everything revolves around the primary domain ID - so much
that I suggest we come up with a short name for it.  How about just
"node id" since that's how Linux uses it.

> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 
> in the
> +"ibm,architecture-vec-5" property.
> +
> +"ibm,numa-lookup-index-table" property contains one or more list numbers 
> representing
> +

Re: [PATCH v4 1/7] powerpc/pseries: rename min_common_depth to primary_domain_index

2021-06-23 Thread David Gibson
On Thu, Jun 17, 2021 at 10:20:59PM +0530, Aneesh Kumar K.V wrote:
> No functional change in this patch.
> 
> Signed-off-by: Aneesh Kumar K.V 

Reviewed-by: David Gibson 

> ---
>  arch/powerpc/mm/numa.c | 38 +++---
>  1 file changed, 19 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index f2bf98bdcea2..8365b298ec48 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -51,7 +51,7 @@ EXPORT_SYMBOL(numa_cpu_lookup_table);
>  EXPORT_SYMBOL(node_to_cpumask_map);
>  EXPORT_SYMBOL(node_data);
>  
> -static int min_common_depth;
> +static int primary_domain_index;
>  static int n_mem_addr_cells, n_mem_size_cells;
>  static int form1_affinity;
>  
> @@ -232,8 +232,8 @@ static int associativity_to_nid(const __be32 
> *associativity)
>   if (!numa_enabled)
>   goto out;
>  
> - if (of_read_number(associativity, 1) >= min_common_depth)
> - nid = of_read_number(&associativity[min_common_depth], 1);
> + if (of_read_number(associativity, 1) >= primary_domain_index)
> + nid = of_read_number(&associativity[primary_domain_index], 1);
>  
>   /* POWER4 LPAR uses 0x as invalid node */
>   if (nid == 0x || nid >= nr_node_ids)
> @@ -284,9 +284,9 @@ int of_node_to_nid(struct device_node *device)
>  }
>  EXPORT_SYMBOL(of_node_to_nid);
>  
> -static int __init find_min_common_depth(void)
> +static int __init find_primary_domain_index(void)
>  {
> - int depth;
> + int index;
>   struct device_node *root;
>  
>   if (firmware_has_feature(FW_FEATURE_OPAL))
> @@ -326,7 +326,7 @@ static int __init find_min_common_depth(void)
>   }
>  
>   if (form1_affinity) {
> - depth = of_read_number(distance_ref_points, 1);
> + index = of_read_number(distance_ref_points, 1);
>   } else {
>   if (distance_ref_points_depth < 2) {
>   printk(KERN_WARNING "NUMA: "
> @@ -334,7 +334,7 @@ static int __init find_min_common_depth(void)
>   goto err;
>   }
>  
> - depth = of_read_number(&distance_ref_points[1], 1);
> + index = of_read_number(&distance_ref_points[1], 1);
>   }
>  
>   /*
> @@ -348,7 +348,7 @@ static int __init find_min_common_depth(void)
>   }
>  
>   of_node_put(root);
> - return depth;
> + return index;
>  
>  err:
>   of_node_put(root);
> @@ -437,16 +437,16 @@ int of_drconf_to_nid_single(struct drmem_lmb *lmb)
>   int nid = default_nid;
>   int rc, index;
>  
> - if ((min_common_depth < 0) || !numa_enabled)
> + if ((primary_domain_index < 0) || !numa_enabled)
>   return default_nid;
>  
>   rc = of_get_assoc_arrays(&aa);
>   if (rc)
>   return default_nid;
>  
> - if (min_common_depth <= aa.array_sz &&
> + if (primary_domain_index <= aa.array_sz &&
>   !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < 
> aa.n_arrays) {
> - index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
> + index = lmb->aa_index * aa.array_sz + primary_domain_index - 1;
>   nid = of_read_number(&aa.arrays[index], 1);
>  
>   if (nid == 0x || nid >= nr_node_ids)
> @@ -708,18 +708,18 @@ static int __init parse_numa_properties(void)
>   return -1;
>   }
>  
> - min_common_depth = find_min_common_depth();
> + primary_domain_index = find_primary_domain_index();
>  
> - if (min_common_depth < 0) {
> + if (primary_domain_index < 0) {
>   /*
> -  * if we fail to parse min_common_depth from device tree
> +  * if we fail to parse primary_domain_index from device tree
>* mark the numa disabled, boot with numa disabled.
>*/
>   numa_enabled = false;
> - return min_common_depth;
> + return primary_domain_index;
>   }
>  
> - dbg("NUMA associativity depth for CPU/Memory: %d\n", min_common_depth);
> + dbg("NUMA associativity depth for CPU/Memory: %d\n", 
> primary_domain_index);
>  
>   /*
>* Even though we connect cpus to numa domains later in SMP
> @@ -919,14 +919,14 @@ static void __init find_possible_nodes(void)
>   goto out;
>   }
>  
> - max_nodes = of_read_number(&domains[min_common_depth], 1);
> + max_nodes = of_read_number(&domains[primary_domain_index], 1);
>   for (i = 0; i < max_nodes; i++) {
>   if (!node_possible(i))
>   node_set(i, node_possible_map);
>   }
>  
>   prop_length /= sizeof(int);
> - if (prop_length > min_common_depth + 2)
> + if (prop_length > primary_domain_index + 2)
>   coregroup_enabled = 1;
>  
>  out:
> @@ -1259,7 +1259,7 @@ int cpu_to_coregroup_id(int cpu)
>   goto out;
>  
>   index = of_read_number(associativity, 1);
> -

Re: [PATCH 2/3] powerpc: Define swapper_pg_dir[] in C

2021-06-23 Thread Daniel Axtens
Michael Ellerman  writes:

> Daniel Axtens  writes:
>> Hi Christophe,
>>
>> This breaks booting a radix KVM guest with 4k pages for me:
>>
>> make pseries_le_defconfig
>> scripts/config -d CONFIG_PPC_64K_PAGES
>> scripts/config -e CONFIG_PPC_4K_PAGES
>> make vmlinux
>> sudo qemu-system-ppc64 -enable-kvm -M pseries -m 1G -nographic -vga none 
>> -smp 4 -cpu host -kernel vmlinux
>>
>> Boot hangs after printing 'Booting Linux via __start()' and qemu's 'info
>> registers' reports that it's stuck at the instruction fetch exception.
>>
>> My host is Power9, 64k page size radix, and
>> gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 
>> 2.34
>>
>
> ...
>>> diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
>>> index 730838c7ca39..79f2d1e61abd 100644
>>> --- a/arch/powerpc/kernel/head_64.S
>>> +++ b/arch/powerpc/kernel/head_64.S
>>> @@ -997,18 +997,3 @@ start_here_common:
>>>  0: trap
>>> EMIT_BUG_ENTRY 0b, __FILE__, __LINE__, 0
>>> .previous
>>> -
>>> -/*
>>> - * We put a few things here that have to be page-aligned.
>>> - * This stuff goes at the beginning of the bss, which is page-aligned.
>>> - */
>>> -   .section ".bss"
>>> -/*
>>> - * pgd dir should be aligned to PGD_TABLE_SIZE which is 64K.
>>> - * We will need to find a better way to fix this
>>> - */
>>> -   .align  16
>>> -
>>> -   .globl  swapper_pg_dir
>>> -swapper_pg_dir:
>>> -   .space  PGD_TABLE_SIZE
>
> This is now 4K aligned whereas it used to be 64K.
>
> This fixes it and is not completely ugly?
>
> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> index 1707ab580ee2..298469beaa90 100644
> --- a/arch/powerpc/mm/pgtable.c
> +++ b/arch/powerpc/mm/pgtable.c
> @@ -28,7 +28,13 @@
>  #include 
>  #include 
>  
> -pgd_t swapper_pg_dir[MAX_PTRS_PER_PGD] __page_aligned_bss;
> +#ifdef CONFIG_PPC64
> +#define PGD_ALIGN 0x1
> +#else
> +#define PGD_ALIGN PAGE_SIZE
> +#endif
> +
> +pgd_t swapper_pg_dir[MAX_PTRS_PER_PGD] __section(".bss..page_aligned") 
> __aligned(PGD_ALIGN);

The fix works for me, thank you.

Kind regards,
Daniel
>  
>  static inline int is_exec_fault(void)
>  {
>
>
> cheers


Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-23 Thread Qian Cai



On 6/23/2021 2:37 PM, Will Deacon wrote:
> On Wed, Jun 23, 2021 at 12:39:29PM -0400, Qian Cai wrote:
>>
>>
>> On 6/18/2021 11:40 PM, Claire Chang wrote:
>>> Propagate the swiotlb_force into io_tlb_default_mem->force_bounce and
>>> use it to determine whether to bounce the data or not. This will be
>>> useful later to allow for different pools.
>>>
>>> Signed-off-by: Claire Chang 
>>> Reviewed-by: Christoph Hellwig 
>>> Tested-by: Stefano Stabellini 
>>> Tested-by: Will Deacon 
>>> Acked-by: Stefano Stabellini 
>>
>> Reverting the rest of the series up to this patch fixed a boot crash with 
>> NVMe on today's linux-next.
> 
> Hmm, so that makes patch 7 the suspicious one, right?

Will, no. It is rather patch #6 (this patch). Only the patch from #6 to #12 
were reverted to fix the issue. Also, looking at this offset of the crash,

pc : dma_direct_map_sg+0x304/0x8f0
is_swiotlb_force_bounce at /usr/src/linux-next/./include/linux/swiotlb.h:119

is_swiotlb_force_bounce() was the new function introduced in this patch here.

+static inline bool is_swiotlb_force_bounce(struct device *dev)
+{
+   return dev->dma_io_tlb_mem->force_bounce;
+}

> 
> Looking at that one more closely, it looks like swiotlb_find_slots() takes
> 'alloc_size + offset' as its 'alloc_size' parameter from
> swiotlb_tbl_map_single() and initialises 'mem->slots[i].alloc_size' based
> on 'alloc_size + offset', which looks like a change in behaviour from the
> old code, which didn't include the offset there.
> 
> swiotlb_release_slots() then adds the offset back on afaict, so we end up
> accounting for it twice and possibly unmap more than we're supposed to?
> 
> Will
> 


Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-23 Thread Will Deacon
On Wed, Jun 23, 2021 at 12:39:29PM -0400, Qian Cai wrote:
> 
> 
> On 6/18/2021 11:40 PM, Claire Chang wrote:
> > Propagate the swiotlb_force into io_tlb_default_mem->force_bounce and
> > use it to determine whether to bounce the data or not. This will be
> > useful later to allow for different pools.
> > 
> > Signed-off-by: Claire Chang 
> > Reviewed-by: Christoph Hellwig 
> > Tested-by: Stefano Stabellini 
> > Tested-by: Will Deacon 
> > Acked-by: Stefano Stabellini 
> 
> Reverting the rest of the series up to this patch fixed a boot crash with 
> NVMe on today's linux-next.

Hmm, so that makes patch 7 the suspicious one, right?

Looking at that one more closely, it looks like swiotlb_find_slots() takes
'alloc_size + offset' as its 'alloc_size' parameter from
swiotlb_tbl_map_single() and initialises 'mem->slots[i].alloc_size' based
on 'alloc_size + offset', which looks like a change in behaviour from the
old code, which didn't include the offset there.

swiotlb_release_slots() then adds the offset back on afaict, so we end up
accounting for it twice and possibly unmap more than we're supposed to?

Will


Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-23 Thread Odin Ugedal
ons. 23. jun. 2021 kl. 19:27 skrev Vincent Guittot :
>
> On Wed, 23 Jun 2021 at 18:55, Vincent Guittot
>  wrote:
> >
> > On Wed, 23 Jun 2021 at 18:46, Sachin Sant  
> > wrote:
> > >
> > >
> > > > Ok. This becomes even more weird. Could you share your config file and 
> > > > more details about
> > > > you setup ?
> > > >
> > > > Have you applied the patch below ?
> > > > https://lore.kernel.org/lkml/20210621174330.11258-1-vincent.guit...@linaro.org/
> > > >
> > > > Regarding the load_avg warning, I can see possible problem during 
> > > > attach. Could you add
> > > > the patch below. The load_avg warning seems to happen during boot and 
> > > > sched_entity
> > > > creation.
> > > >
> > >
> > > Here is a summary of my testing.
> > >
> > > I have a POWER box with PowerVM hypervisor. On this box I have a logical 
> > > partition(LPAR) or guest
> > > (allocated with 32 cpus 90G memory) running linux-next.
> > >
> > > I started with a clean slate.
> > > Moved to linux-next 5.13.0-rc7-next-20210622 as base code.
> > > Applied patch #1 from Vincent which contains changes to dequeue_load_avg()
> > > Applied patch #2 from Vincent which contains changes to enqueue_load_avg()
> > > Applied patch #3 from Vincent which contains changes to 
> > > attach_entity_load_avg()
> > > Applied patch #4 from 
> > > https://lore.kernel.org/lkml/20210621174330.11258-1-vincent.guit...@linaro.org/
> > >
> > > With these changes applied I was still able to recreate the issue. I 
> > > could see kernel warning
> > > during boot.
> > >
> > > I then applied patch #5 from Odin which contains changes to 
> > > update_cfs_rq_load_avg()
> > >
> > > With all the 5 patches applied I was able to boot the kernel without any 
> > > warning messages.
> > > I also ran scheduler related tests from ltp (./runltp -f sched) . All 
> > > tests including cfs_bandwidth01
> > > ran successfully. No kernel warnings were observed.
> >
> > ok so Odin's patch fixes the problem which highlights that we
> > overestimate _sum or don't sync _avg and _sum correctly
> >
> > I'm going to look at this further
>
> The problem is  "_avg * divider" makes the assumption that all pending
> contrib are not null contributions whereas they can be null.

Yeah.

> Odin patch is the right way to fix this. Other patches should not be
> useful for your problem

Ack. As I see it, given how PELT works now, it is the only way to
mitigate it (without doing a lot of extra PELT stuff).
Will post it as a patch together with a proper message later today or tomorrow.

>
> >
> > >
> > > Have also attached .config in case it is useful. config has 
> > > CONFIG_HZ_100=y
> >
> > Thanks, i will have a look
> >
> > >
> > > Thanks
> > > -Sachin
> > >

Thanks for reporting Sachin!

Thanks
Odin


Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-23 Thread Vincent Guittot
On Wed, 23 Jun 2021 at 18:55, Vincent Guittot
 wrote:
>
> On Wed, 23 Jun 2021 at 18:46, Sachin Sant  wrote:
> >
> >
> > > Ok. This becomes even more weird. Could you share your config file and 
> > > more details about
> > > you setup ?
> > >
> > > Have you applied the patch below ?
> > > https://lore.kernel.org/lkml/20210621174330.11258-1-vincent.guit...@linaro.org/
> > >
> > > Regarding the load_avg warning, I can see possible problem during attach. 
> > > Could you add
> > > the patch below. The load_avg warning seems to happen during boot and 
> > > sched_entity
> > > creation.
> > >
> >
> > Here is a summary of my testing.
> >
> > I have a POWER box with PowerVM hypervisor. On this box I have a logical 
> > partition(LPAR) or guest
> > (allocated with 32 cpus 90G memory) running linux-next.
> >
> > I started with a clean slate.
> > Moved to linux-next 5.13.0-rc7-next-20210622 as base code.
> > Applied patch #1 from Vincent which contains changes to dequeue_load_avg()
> > Applied patch #2 from Vincent which contains changes to enqueue_load_avg()
> > Applied patch #3 from Vincent which contains changes to 
> > attach_entity_load_avg()
> > Applied patch #4 from 
> > https://lore.kernel.org/lkml/20210621174330.11258-1-vincent.guit...@linaro.org/
> >
> > With these changes applied I was still able to recreate the issue. I could 
> > see kernel warning
> > during boot.
> >
> > I then applied patch #5 from Odin which contains changes to 
> > update_cfs_rq_load_avg()
> >
> > With all the 5 patches applied I was able to boot the kernel without any 
> > warning messages.
> > I also ran scheduler related tests from ltp (./runltp -f sched) . All tests 
> > including cfs_bandwidth01
> > ran successfully. No kernel warnings were observed.
>
> ok so Odin's patch fixes the problem which highlights that we
> overestimate _sum or don't sync _avg and _sum correctly
>
> I'm going to look at this further

The problem is  "_avg * divider" makes the assumption that all pending
contrib are not null contributions whereas they can be null.

Odin patch is the right way to fix this. Other patches should not be
useful for your problem

>
> >
> > Have also attached .config in case it is useful. config has CONFIG_HZ_100=y
>
> Thanks, i will have a look
>
> >
> > Thanks
> > -Sachin
> >


Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-23 Thread Vincent Guittot
On Wed, 23 Jun 2021 at 18:46, Sachin Sant  wrote:
>
>
> > Ok. This becomes even more weird. Could you share your config file and more 
> > details about
> > you setup ?
> >
> > Have you applied the patch below ?
> > https://lore.kernel.org/lkml/20210621174330.11258-1-vincent.guit...@linaro.org/
> >
> > Regarding the load_avg warning, I can see possible problem during attach. 
> > Could you add
> > the patch below. The load_avg warning seems to happen during boot and 
> > sched_entity
> > creation.
> >
>
> Here is a summary of my testing.
>
> I have a POWER box with PowerVM hypervisor. On this box I have a logical 
> partition(LPAR) or guest
> (allocated with 32 cpus 90G memory) running linux-next.
>
> I started with a clean slate.
> Moved to linux-next 5.13.0-rc7-next-20210622 as base code.
> Applied patch #1 from Vincent which contains changes to dequeue_load_avg()
> Applied patch #2 from Vincent which contains changes to enqueue_load_avg()
> Applied patch #3 from Vincent which contains changes to 
> attach_entity_load_avg()
> Applied patch #4 from 
> https://lore.kernel.org/lkml/20210621174330.11258-1-vincent.guit...@linaro.org/
>
> With these changes applied I was still able to recreate the issue. I could 
> see kernel warning
> during boot.
>
> I then applied patch #5 from Odin which contains changes to 
> update_cfs_rq_load_avg()
>
> With all the 5 patches applied I was able to boot the kernel without any 
> warning messages.
> I also ran scheduler related tests from ltp (./runltp -f sched) . All tests 
> including cfs_bandwidth01
> ran successfully. No kernel warnings were observed.

ok so Odin's patch fixes the problem which highlights that we
overestimate _sum or don't sync _avg and _sum correctly

I'm going to look at this further

>
> Have also attached .config in case it is useful. config has CONFIG_HZ_100=y

Thanks, i will have a look

>
> Thanks
> -Sachin
>


Re: [PATCH v14 06/12] swiotlb: Use is_swiotlb_force_bounce for swiotlb data bouncing

2021-06-23 Thread Qian Cai



On 6/18/2021 11:40 PM, Claire Chang wrote:
> Propagate the swiotlb_force into io_tlb_default_mem->force_bounce and
> use it to determine whether to bounce the data or not. This will be
> useful later to allow for different pools.
> 
> Signed-off-by: Claire Chang 
> Reviewed-by: Christoph Hellwig 
> Tested-by: Stefano Stabellini 
> Tested-by: Will Deacon 
> Acked-by: Stefano Stabellini 

Reverting the rest of the series up to this patch fixed a boot crash with NVMe 
on today's linux-next.

[   22.286574][T7] Unable to handle kernel paging request at virtual 
address dfff800e
[   22.295225][T7] Mem abort info:
[   22.298743][T7]   ESR = 0x9604
[   22.302496][T7]   EC = 0x25: DABT (current EL), IL = 32 bits
[   22.308525][T7]   SET = 0, FnV = 0
[   22.312274][T7]   EA = 0, S1PTW = 0
[   22.316131][T7]   FSC = 0x04: level 0 translation fault
[   22.321704][T7] Data abort info:
[   22.325278][T7]   ISV = 0, ISS = 0x0004
[   22.329840][T7]   CM = 0, WnR = 0
[   22.333503][T7] [dfff800e] address between user and kernel 
address ranges
[   22.338543][  T256] igb 0006:01:00.0: Intel(R) Gigabit Ethernet Network 
Connection
[   22.341400][T7] Internal error: Oops: 9604 [#1] SMP
[   22.348915][  T256] igb 0006:01:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 
4c:38:d5:09:c8:83
[   22.354458][T7] Modules linked in: igb(+) i2c_algo_bit nvme mlx5_core(+) 
i2c_core nvme_core firmware_class
[   22.362512][  T256] igb 0006:01:00.0: eth0: PBA No: G69016-004
[   22.372287][T7] CPU: 13 PID: 7 Comm: kworker/u64:0 Not tainted 
5.13.0-rc7-next-20210623+ #47
[   22.372293][T7] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, 
BIOS 1.6 06/28/2020
[   22.372298][T7] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[   22.378145][  T256] igb 0006:01:00.0: Using MSI-X interrupts. 4 rx queue(s), 
4 tx queue(s)
[   22.386901][T7] 
[   22.386905][T7] pstate: 1005 (nzcV daif -PAN -UAO -TCO BTYPE=--)
[   22.386910][T7] pc : dma_direct_map_sg+0x304/0x8f0

is_swiotlb_force_bounce at /usr/src/linux-next/./include/linux/swiotlb.h:119
(inlined by) dma_direct_map_page at /usr/src/linux-next/kernel/dma/direct.h:90
(inlined by) dma_direct_map_sg at /usr/src/linux-next/kernel/dma/direct.c:428

[   22.386919][T7] lr : dma_map_sg_attrs+0x6c/0x118
[   22.386924][T7] sp : 80001dc8eac0
[   22.386926][T7] x29: 80001dc8eac0 x28: 199e70b0 x27: 

[   22.386935][T7] x26: 000847ee7000 x25: 80001158e570 x24: 
0002
[   22.386943][T7] x23: dfff8000 x22: 0100 x21: 
199e7460
[   22.386951][T7] x20: 199e7488 x19: 0001 x18: 
10062670
[   22.386955][  T253] Unable to handle kernel paging request at virtual 
address dfff800e
[   22.386958][T7] x17: 8000109f6a90 x16: 8000109e1b4c x15: 
89303420
[   22.386965][  T253] Mem abort info:
[   22.386967][T7] x14: 0001 x13: 80001158e000
[   22.386970][  T253]   ESR = 0x9604
[   22.386972][T7]  x12: 1fffe00108fdce01
[   22.386975][  T253]   EC = 0x25: DABT (current EL), IL = 32 bits
[   22.386976][T7] x11: 1fffe00108fdce03 x10: 000847ee700c x9 : 
0004
[   22.386981][  T253]   SET = 0, FnV = 0
[   22.386983][T7] 
[   22.386985][T7] x8 : 73b91d72
[   22.386986][  T253]   EA = 0, S1PTW = 0
[   22.386987][T7]  x7 :  x6 : 000e
[   22.386990][  T253]   FSC = 0x04: level 0 translation fault
[   22.386992][T7] 
[   22.386994][T7] x5 : dfff8000
[   22.386995][  T253] Data abort info:
[   22.386997][T7]  x4 : 0008c7ede000
[   22.386999][  T253]   ISV = 0, ISS = 0x0004
[   22.386999][T7]  x3 : 0008c7ede000
[   22.387003][T7] x2 : 1000
[   22.387003][  T253]   CM = 0, WnR = 0
[   22.387006][T7]  x1 :  x0 : 0071
[   22.387008][  T253] [dfff800e] address between user and kernel 
address ranges
[   22.387011][T7] 
[   22.387013][T7] Call trace:
[   22.387016][T7]  dma_direct_map_sg+0x304/0x8f0
[   22.387022][T7]  dma_map_sg_attrs+0x6c/0x118
[   22.387026][T7]  nvme_map_data+0x2ec/0x21d8 [nvme]
[   22.387040][T7]  nvme_queue_rq+0x274/0x3f0 [nvme]
[   22.387052][T7]  blk_mq_dispatch_rq_list+0x2ec/0x18a0
[   22.387060][T7]  __blk_mq_sched_dispatch_requests+0x2a0/0x3e8
[   22.387065][T7]  blk_mq_sched_dispatch_requests+0xa4/0x100
[   22.387070][T7]  __blk_mq_run_hw_queue+0x148/0x1d8
[   22.387075][T7]  __blk_mq_delay_run_hw_queue+0x3f8/0x730
[   22.414539][  T269] igb 0006:01:00.0 enP6p1s0: renamed from eth0
[   22.418957][T7]  blk_mq_run_hw_queue+0x148/0x248
[   22.418969][T7]  blk_mq_sched_insert_request+0x2a4/0x330
[   22.418975][T7]  blk_execute_rq_nowait+0xc8/0x118
[   22.418981][T7]  blk_execute_

Re: [PATCH 1/1] ASoC: fsl: remove unnecessary oom message

2021-06-23 Thread Mark Brown
On Thu, 17 Jun 2021 18:31:41 +0800, Zhen Lei wrote:
> Fixes scripts/checkpatch.pl warning:
> WARNING: Possible unnecessary 'out of memory' message
> 
> Remove it can help us save a bit of memory.

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next

Thanks!

[1/1] ASoC: fsl: remove unnecessary oom message
  commit: 723ca2f89412abe47b7cbb276f683ddb292c172c

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark


Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-23 Thread Vincent Guittot
On Wed, 23 Jun 2021 at 17:13, Odin Ugedal  wrote:
>
> ons. 23. jun. 2021 kl. 15:56 skrev Vincent Guittot 
> :
> >
> >
> > The pelt value of sched_entity is synced with  cfs and its contrib
> > before being removed.
>
>
> Hmm. Not sure what you mean by sched_entity here, since this is only
> taking the "removed" load_avg
> and removing it from cfs_rq, together with (removed.load_avg *
> divider) from load_sum. (Although. ".removed" comes from
> a sched entity)

The sched_entity's load_avg that is put in removed.load, is sync with
the cfs_rq PELT signal, which includes contrib, before being added to
removed.load.

>
> > Then, we start to remove this load in update_cfs_rq_load_avg() before
> > calling __update_load_avg_cfs_rq so contrib should not have change and
> > we should be safe
>
> For what it is worth, I am now able to reproduce it (maybe
> CONFIG_HZ=300/250 is the thing) as reported by Sachin,
> and my patch makes it disappear. Without my patch I see situations
> where _sum is zero while _avg is eg. 1 or 2 or 14 (in that range).

hmm, so there is something wrong in the propagation

> This happens for both load, runnable and util.
>
> Lets see what Sachin reports back.
>
> Thanks
> Odin


Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-23 Thread Odin Ugedal
ons. 23. jun. 2021 kl. 15:56 skrev Vincent Guittot :
>
>
> The pelt value of sched_entity is synced with  cfs and its contrib
> before being removed.


Hmm. Not sure what you mean by sched_entity here, since this is only
taking the "removed" load_avg
and removing it from cfs_rq, together with (removed.load_avg *
divider) from load_sum. (Although. ".removed" comes from
a sched entity)

> Then, we start to remove this load in update_cfs_rq_load_avg() before
> calling __update_load_avg_cfs_rq so contrib should not have change and
> we should be safe

For what it is worth, I am now able to reproduce it (maybe
CONFIG_HZ=300/250 is the thing) as reported by Sachin,
and my patch makes it disappear. Without my patch I see situations
where _sum is zero while _avg is eg. 1 or 2 or 14 (in that range).
This happens for both load, runnable and util.

Lets see what Sachin reports back.

Thanks
Odin


[powerpc:next-test 164/170] arch/powerpc/kernel/interrupt.c:36:20: error: unused function 'exit_must_hard_disable'

2021-06-23 Thread kernel test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
next-test
head:   a23408e2575e49c4394f8733c78dce907286ac8e
commit: 63369fa0176120a0db94e95c3aea3b2e6bd3fe54 [164/170] powerpc/64: use 
interrupt restart table to speed up return from interrupt
config: powerpc-randconfig-r004-20210622 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 
b259740801d3515810ecc15bf0c24b0d476a1608)
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# install powerpc cross compiling tool for clang build
# apt-get install binutils-powerpc-linux-gnu
# 
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=63369fa0176120a0db94e95c3aea3b2e6bd3fe54
git remote add powerpc 
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git
git fetch --no-tags powerpc next-test
git checkout 63369fa0176120a0db94e95c3aea3b2e6bd3fe54
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

>> arch/powerpc/kernel/interrupt.c:36:20: error: unused function 
>> 'exit_must_hard_disable' [-Werror,-Wunused-function]
   static inline bool exit_must_hard_disable(void)
  ^
   1 error generated.


vim +/exit_must_hard_disable +36 arch/powerpc/kernel/interrupt.c

28  
29  #ifdef CONFIG_PPC_BOOK3S_64
30  DEFINE_STATIC_KEY_FALSE(interrupt_exit_not_reentrant);
31  static inline bool exit_must_hard_disable(void)
32  {
33  return static_branch_unlikely(&interrupt_exit_not_reentrant);
34  }
35  #else
  > 36  static inline bool exit_must_hard_disable(void)
37  {
38  return false;
39  }
40  #endif
41  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip


Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-23 Thread Vincent Guittot
On Wed, 23 Jun 2021 at 14:37, Odin Ugedal  wrote:
>
> ons. 23. jun. 2021 kl. 14:22 skrev Vincent Guittot 
> :
> >
> > In theory it should not because _sum should be always larger or equal
> > to _avg * divider. Otherwise, it means that we have something wrong
> > somewhere else
>
> Yeah, that might be the case. Still trying to wrap my head around
> this. I might be wrong, but isn't there a possibility
> that avg->period_contrib is increasing in PELTs accumulate_sum,
> without _sum is increasing. This makes the pelt divider increase,
> making the statement "_sum should be always larger or equal to _avg *"
> false? Or am I missing something obvious here?

The pelt value of sched_entity is synced with  cfs and its contrib
before being removed.
Then, we start to remove this load in update_cfs_rq_load_avg() before
calling __update_load_avg_cfs_rq so contrib should not have change and
we should be safe

>
> Still unable to reproduce what Sachin is reporting tho.
>
> Odin


Re: nand: WARNING: a0000000.nand: the ECC used on your system (1b/256B) is too weak compared to the one required by the NAND chip (4b/512B)

2021-06-23 Thread Miquel Raynal
Hi Christophe,

Christophe Leroy  wrote on Wed, 23 Jun
2021 11:41:46 +0200:

> Le 19/06/2021 à 20:40, Miquel Raynal a écrit :
> > Hi Christophe,
> >   
>  Now and then I'm using one of the latest kernels (Today is 5.13-rc6), 
>  and sometime in one of the 5.x releases, I started to get errors like:
> 
>  [5.098265] ecc_sw_hamming_correct: uncorrectable ECC error
>  [5.103859] ubi0 warning: ubi_io_read: error -74 (ECC error) while 
>  reading 60
>  bytes from PEB 99:59824, read only 60 bytes, retry
>  [5.525843] ecc_sw_hamming_correct: uncorrectable ECC error
>  [5.531571] ecc_sw_hamming_correct: uncorrectable ECC error
>  [5.537490] ubi0 warning: ubi_io_read: error -74 (ECC error) while 
>  reading 30
>  73 bytes from PEB 107:108976, read only 3073 bytes, retry
>  [5.691121] ecc_sw_hamming_correct: uncorrectable ECC error
>  [5.696709] ecc_sw_hamming_correct: uncorrectable ECC error
>  [5.702426] ecc_sw_hamming_correct: uncorrectable ECC error
>  [5.708141] ecc_sw_hamming_correct: uncorrectable ECC error
>  [5.714103] ubi0 warning: ubi_io_read: error -74 (ECC error) while 
>  reading 30
>  35 bytes from PEB 107:25144, read only 3035 bytes, retry
>  [   20.523689] random: crng init done
>  [   21.892130] ecc_sw_hamming_correct: uncorrectable ECC error
>  [   21.897730] ubi0 warning: ubi_io_read: error -74 (ECC error) while 
>  reading 13
>  94 bytes from PEB 116:75776, read only 1394 bytes, retry
> 
>  Most of the time, when the reading of the file fails, I just have to 
>  read it once more and it gets read without that error.  
> >>>
> >>> It really looks like a regular bitflip happening "sometimes". Is this a
> >>> board which already had a life? What are the usage counters (UBI should
> >>> tell you this) compared to the official endurance of your chip (see the
> >>> datasheet)?  
> >>
> >> The board had a peacefull life:
> >>
> >> UBI reports "ubi0: max/mean erase counter: 49/20, WL threshold: 4096"  
> > 
> > Mmmh. Indeed.
> >   
> >>
> >> I have tried with half a dozen of boards and all have the issue.
> >>  
> >>> What am I supposed to do to avoid the ECC weakness warning at 
> >>> startup and to fix that ECC error issue ?  
> >>>
> >>> I honestly don't think the errors come from the 5.1x kernels given the
> >>> above logs. If you flash back your old 4.14 I am pretty sure you'll
> >>> have the same errors at some point.  
> >>
> >> I don't have any problem like that with 4.14 with any of the board.
> >>
> >> When booting a 4.14 kernel I don't get any problem on the same board.
> >>  
> > 
> > If you can reliably show that when returning to a 4.14 kernel the ECC
> > weakness disappears, then there is certainly something new. What driver
> > are you using? Maybe you can do a bisection?  
> 
> Using the GPIO driver, and the NAND chip is a HYNIX.
> 
> I can say that the ECC weakness doesn't exist until v5.5 included. The 
> weakness appears with v5.6.
> 
> I have tried bisection between those two versions and I couldn't end up to a 
> reliable result. The closer the v5.5 you go, the more difficult it is to 
> reproduce the issue.
> 
> So I looked at what was done around the places, and in fact that's mainly 
> optimisation in the powerpc code. It seems that the more powerpc is 
> optimised, the more the problem occurs.
> 
> Looking at the GPIO nand driver, I saw that no-op gpio_nand_dosync() 
> function. By adding a memory barrier in that function, the ECC weakness 
> disappeared completely.

I see that the 'fix' in gpio_nand_dosync() has only been designed for
ARM platforms, perhaps it would make sense to have a PPC variant here?

> Not sure what the final solution has to be.

Perhaps PowerPC maintainers can sched some light on these findings?

Thanks,
Miquèl


[powerpc:topic/ppc-kvm 26/41] arch/powerpc/kvm/book3s_hv_builtin.c:419:22: sparse: sparse: incorrect type in assignment (different base types)

2021-06-23 Thread kernel test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
topic/ppc-kvm
head:   51696f39cbee5bb684e7959c0c98b5f54548aa34
commit: 2ce008c8b25467ceacf45bcf0e183d660edb82f2 [26/41] KVM: PPC: Book3S HV: 
Remove unused nested HV tests in XICS emulation
config: powerpc-allyesconfig (attached as .config)
compiler: powerpc64-linux-gcc (GCC) 9.3.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# apt-get install sparse
# sparse version: v0.6.3-341-g8af24329-dirty
# 
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=2ce008c8b25467ceacf45bcf0e183d660edb82f2
git remote add powerpc 
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git
git fetch --no-tags powerpc topic/ppc-kvm
git checkout 2ce008c8b25467ceacf45bcf0e183d660edb82f2
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross C=1 
CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' W=1 ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 


sparse warnings: (new ones prefixed by >>)
   arch/powerpc/kvm/book3s_hv_builtin.c:417:41: sparse: sparse: incorrect type 
in argument 1 (different base types) @@ expected unsigned int [usertype] 
*out_xirr @@ got restricted __be32 * @@
   arch/powerpc/kvm/book3s_hv_builtin.c:417:41: sparse: expected unsigned 
int [usertype] *out_xirr
   arch/powerpc/kvm/book3s_hv_builtin.c:417:41: sparse: got restricted 
__be32 *
>> arch/powerpc/kvm/book3s_hv_builtin.c:419:22: sparse: sparse: incorrect type 
>> in assignment (different base types) @@ expected restricted __be32 
>> [addressable] [usertype] xirr @@ got unsigned int @@
   arch/powerpc/kvm/book3s_hv_builtin.c:419:22: sparse: expected restricted 
__be32 [addressable] [usertype] xirr
   arch/powerpc/kvm/book3s_hv_builtin.c:419:22: sparse: got unsigned int
>> arch/powerpc/kvm/book3s_hv_builtin.c:450:41: sparse: sparse: incorrect type 
>> in argument 1 (different base types) @@ expected unsigned int [usertype] 
>> val @@ got restricted __be32 [addressable] [usertype] xirr @@
   arch/powerpc/kvm/book3s_hv_builtin.c:450:41: sparse: expected unsigned 
int [usertype] val
   arch/powerpc/kvm/book3s_hv_builtin.c:450:41: sparse: got restricted 
__be32 [addressable] [usertype] xirr
   arch/powerpc/kvm/book3s_hv_builtin.c: note: in included file:
   arch/powerpc/include/asm/kvm_ppc.h:966:1: sparse: sparse: cast to restricted 
__be64
   arch/powerpc/include/asm/kvm_ppc.h:966:1: sparse: sparse: cast to restricted 
__le64
   arch/powerpc/include/asm/kvm_ppc.h:962:1: sparse: sparse: incorrect type in 
assignment (different base types) @@ expected unsigned long long [usertype] 
srr0 @@ got restricted __be64 [usertype] @@
   arch/powerpc/include/asm/kvm_ppc.h:962:1: sparse: expected unsigned long 
long [usertype] srr0
   arch/powerpc/include/asm/kvm_ppc.h:962:1: sparse: got restricted __be64 
[usertype]
   arch/powerpc/include/asm/kvm_ppc.h:962:1: sparse: sparse: incorrect type in 
assignment (different base types) @@ expected unsigned long long [usertype] 
srr0 @@ got restricted __le64 [usertype] @@
   arch/powerpc/include/asm/kvm_ppc.h:962:1: sparse: expected unsigned long 
long [usertype] srr0
   arch/powerpc/include/asm/kvm_ppc.h:962:1: sparse: got restricted __le64 
[usertype]
   arch/powerpc/include/asm/kvm_ppc.h:963:1: sparse: sparse: incorrect type in 
assignment (different base types) @@ expected unsigned long long [usertype] 
srr1 @@ got restricted __be64 [usertype] @@
   arch/powerpc/include/asm/kvm_ppc.h:963:1: sparse: expected unsigned long 
long [usertype] srr1
   arch/powerpc/include/asm/kvm_ppc.h:963:1: sparse: got restricted __be64 
[usertype]
   arch/powerpc/include/asm/kvm_ppc.h:963:1: sparse: sparse: incorrect type in 
assignment (different base types) @@ expected unsigned long long [usertype] 
srr1 @@ got restricted __le64 [usertype] @@
   arch/powerpc/include/asm/kvm_ppc.h:963:1: sparse: expected unsigned long 
long [usertype] srr1
   arch/powerpc/include/asm/kvm_ppc.h:963:1: sparse: got restricted __le64 
[usertype]

vim +419 arch/powerpc/kvm/book3s_hv_builtin.c

f725758b899f11 Paul Mackerras 2016-11-18  395  
f725758b899f11 Paul Mackerras 2016-11-18  396  static long 
kvmppc_read_one_intr(bool *again)
37f55d30df2eef Suresh Warrier 2016-08-19  397  {
d381d7caf812f7 Benjamin Herrenschmidt 2017-04-05  398   void __iomem *xics_phys;
37f55d30df2eef Suresh Warrier 2016-08-19  399   u32 h_xirr;
37f55d30df2eef Suresh Warrier 2016-08-19  400   __be32 xirr;
37f55d30df2eef Suresh Warrier 2016-08-19  401   u32 xisr;
37f55d30df2eef Suresh Warrier 2016-08-19  402   u8 host_ipi;
f725758b899f11 Paul Ma

[PATCH] powerpc: Fix is_kvm_guest() / kvm_para_available()

2021-06-23 Thread Michael Ellerman
Commit a21d1becaa3f ("powerpc: Reintroduce is_kvm_guest() as a fast-path
check") added is_kvm_guest() and changed kvm_para_available() to use it.

is_kvm_guest() checks a static key, kvm_guest, and that static key is
set in check_kvm_guest().

The problem is check_kvm_guest() is only called on pseries, and even
then only in some configurations. That means is_kvm_guest() always
returns false on all non-pseries and some pseries depending on
configuration. That's a bug.

For PR KVM guests this is noticable because they no longer do live
patching of themselves, which can be detected by the omission of a
message in dmesg such as:

  KVM: Live patching for a fast VM worked

To fix it make check_kvm_guest() an initcall, to ensure it's always
called at boot. It needs to be core so that it runs before
kvm_guest_init() which is postcore. To be an initcall it needs to return
int, where 0 means success, so update that.

We still call it manually in pSeries_smp_probe(), because that runs
before init calls are run.

Fixes: a21d1becaa3f ("powerpc: Reintroduce is_kvm_guest() as a fast-path check")
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/include/asm/kvm_guest.h |  4 ++--
 arch/powerpc/kernel/firmware.c   | 10 ++
 arch/powerpc/platforms/pseries/smp.c |  4 +++-
 3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_guest.h 
b/arch/powerpc/include/asm/kvm_guest.h
index 2fca299f7e19..c63105d2c9e7 100644
--- a/arch/powerpc/include/asm/kvm_guest.h
+++ b/arch/powerpc/include/asm/kvm_guest.h
@@ -16,10 +16,10 @@ static inline bool is_kvm_guest(void)
return static_branch_unlikely(&kvm_guest);
 }
 
-bool check_kvm_guest(void);
+int check_kvm_guest(void);
 #else
 static inline bool is_kvm_guest(void) { return false; }
-static inline bool check_kvm_guest(void) { return false; }
+static inline int check_kvm_guest(void) { return 0; }
 #endif
 
 #endif /* _ASM_POWERPC_KVM_GUEST_H_ */
diff --git a/arch/powerpc/kernel/firmware.c b/arch/powerpc/kernel/firmware.c
index c9e2819b095a..c7022c41cc31 100644
--- a/arch/powerpc/kernel/firmware.c
+++ b/arch/powerpc/kernel/firmware.c
@@ -23,18 +23,20 @@ EXPORT_SYMBOL_GPL(powerpc_firmware_features);
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_KVM_GUEST)
 DEFINE_STATIC_KEY_FALSE(kvm_guest);
-bool check_kvm_guest(void)
+int __init check_kvm_guest(void)
 {
struct device_node *hyper_node;
 
hyper_node = of_find_node_by_path("/hypervisor");
if (!hyper_node)
-   return false;
+   return 0;
 
if (!of_device_is_compatible(hyper_node, "linux,kvm"))
-   return false;
+   return 0;
 
static_branch_enable(&kvm_guest);
-   return true;
+
+   return 0;
 }
+core_initcall(check_kvm_guest); // before kvm_guest_init()
 #endif
diff --git a/arch/powerpc/platforms/pseries/smp.c 
b/arch/powerpc/platforms/pseries/smp.c
index c70b4be9f0a5..096629f54576 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -211,7 +211,9 @@ static __init void pSeries_smp_probe(void)
if (!cpu_has_feature(CPU_FTR_SMT))
return;
 
-   if (check_kvm_guest()) {
+   check_kvm_guest();
+
+   if (is_kvm_guest()) {
/*
 * KVM emulates doorbells by disabling FSCR[MSGP] so msgsndp
 * faults to the hypervisor which then reads the instruction
-- 
2.25.1



[PATCH] powerpc/64s: Make prom_init require RELOCATABLE

2021-06-23 Thread Michael Ellerman
When we boot from open firmware (OF) using PPC_OF_BOOT_TRAMPOLINE, aka.
prom_init, we run parts of the kernel at an address other than the link
address. That happens because OF loads the kernel above zero (OF is at
zero) and we run prom_init before copying the kernel down to zero.

Currently that works even for non-relocatable kernels, because we do
various fixups to the prom_init code to make it run where it's loaded.

However those fixups are not sufficient if the kernel becomes large
enough. In that case prom_init()'s final call to __start() can end up
generating a plt branch:

bl  c218 <0078.plt_branch.__start>

That results in the kernel jumping to the linked address of __start,
0xc000, when really it needs to jump to the
0xc000 + the runtime address because the kernel is still
running at the load address.

We could do further shenanigans to handle that, see Jordan's patch for
example:
  
https://lore.kernel.org/linuxppc-dev/20210421021721.1539289-1-jniet...@gmail.com

However it is much simpler to just require a kernel with prom_init() to
be built relocatable. The result works in all configurations without
further work, and requires less code.

This should have no effect on most people, as our defconfigs and
essentially all distro configs already have RELOCATABLE enabled.

Signed-off-by: Michael Ellerman 
---
 arch/powerpc/kernel/prom_init.c | 58 ++---
 arch/powerpc/platforms/Kconfig  |  1 +
 2 files changed, 3 insertions(+), 56 deletions(-)

diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 05ce15b854e2..a5bf355ce1d6 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -3243,54 +3243,6 @@ static void __init prom_check_initrd(unsigned long r3, 
unsigned long r4)
 #endif /* CONFIG_BLK_DEV_INITRD */
 }
 
-#ifdef CONFIG_PPC64
-#ifdef CONFIG_RELOCATABLE
-static void reloc_toc(void)
-{
-}
-
-static void unreloc_toc(void)
-{
-}
-#else
-static void __reloc_toc(unsigned long offset, unsigned long nr_entries)
-{
-   unsigned long i;
-   unsigned long *toc_entry;
-
-   /* Get the start of the TOC by using r2 directly. */
-   asm volatile("addi %0,2,-0x8000" : "=b" (toc_entry));
-
-   for (i = 0; i < nr_entries; i++) {
-   *toc_entry = *toc_entry + offset;
-   toc_entry++;
-   }
-}
-
-static void reloc_toc(void)
-{
-   unsigned long offset = reloc_offset();
-   unsigned long nr_entries =
-   (__prom_init_toc_end - __prom_init_toc_start) / sizeof(long);
-
-   __reloc_toc(offset, nr_entries);
-
-   mb();
-}
-
-static void unreloc_toc(void)
-{
-   unsigned long offset = reloc_offset();
-   unsigned long nr_entries =
-   (__prom_init_toc_end - __prom_init_toc_start) / sizeof(long);
-
-   mb();
-
-   __reloc_toc(-offset, nr_entries);
-}
-#endif
-#endif
-
 #ifdef CONFIG_PPC_SVM
 /*
  * Perform the Enter Secure Mode ultracall.
@@ -3324,14 +3276,12 @@ static void __init setup_secure_guest(unsigned long 
kbase, unsigned long fdt)
 * relocated it so the check will fail. Restore the original image by
 * relocating it back to the kernel virtual base address.
 */
-   if (IS_ENABLED(CONFIG_RELOCATABLE))
-   relocate(KERNELBASE);
+   relocate(KERNELBASE);
 
ret = enter_secure_mode(kbase, fdt);
 
/* Relocate the kernel again. */
-   if (IS_ENABLED(CONFIG_RELOCATABLE))
-   relocate(kbase);
+   relocate(kbase);
 
if (ret != U_SUCCESS) {
prom_printf("Returned %d from switching to secure mode.\n", 
ret);
@@ -3359,8 +3309,6 @@ unsigned long __init prom_init(unsigned long r3, unsigned 
long r4,
 #ifdef CONFIG_PPC32
unsigned long offset = reloc_offset();
reloc_got2(offset);
-#else
-   reloc_toc();
 #endif
 
/*
@@ -3537,8 +3485,6 @@ unsigned long __init prom_init(unsigned long r3, unsigned 
long r4,
 
 #ifdef CONFIG_PPC32
reloc_got2(-offset);
-#else
-   unreloc_toc();
 #endif
 
/* Move to secure memory if we're supposed to be secure guests. */
diff --git a/arch/powerpc/platforms/Kconfig b/arch/powerpc/platforms/Kconfig
index 2f071fb9694c..e02d29a9d12f 100644
--- a/arch/powerpc/platforms/Kconfig
+++ b/arch/powerpc/platforms/Kconfig
@@ -51,6 +51,7 @@ config PPC_NATIVE
 config PPC_OF_BOOT_TRAMPOLINE
bool "Support booting from Open Firmware or yaboot"
depends on PPC_BOOK3S_32 || PPC64
+   select RELOCATABLE if PPC64
default y
help
  Support from booting from Open Firmware or yaboot using an
-- 
2.25.1



Re: [PATCH 2/3] powerpc: Define swapper_pg_dir[] in C

2021-06-23 Thread Michael Ellerman
Daniel Axtens  writes:
> Hi Christophe,
>
> This breaks booting a radix KVM guest with 4k pages for me:
>
> make pseries_le_defconfig
> scripts/config -d CONFIG_PPC_64K_PAGES
> scripts/config -e CONFIG_PPC_4K_PAGES
> make vmlinux
> sudo qemu-system-ppc64 -enable-kvm -M pseries -m 1G -nographic -vga none -smp 
> 4 -cpu host -kernel vmlinux
>
> Boot hangs after printing 'Booting Linux via __start()' and qemu's 'info
> registers' reports that it's stuck at the instruction fetch exception.
>
> My host is Power9, 64k page size radix, and
> gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 
> 2.34
>

...
>> diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
>> index 730838c7ca39..79f2d1e61abd 100644
>> --- a/arch/powerpc/kernel/head_64.S
>> +++ b/arch/powerpc/kernel/head_64.S
>> @@ -997,18 +997,3 @@ start_here_common:
>>  0:  trap
>>  EMIT_BUG_ENTRY 0b, __FILE__, __LINE__, 0
>>  .previous
>> -
>> -/*
>> - * We put a few things here that have to be page-aligned.
>> - * This stuff goes at the beginning of the bss, which is page-aligned.
>> - */
>> -.section ".bss"
>> -/*
>> - * pgd dir should be aligned to PGD_TABLE_SIZE which is 64K.
>> - * We will need to find a better way to fix this
>> - */
>> -.align  16
>> -
>> -.globl  swapper_pg_dir
>> -swapper_pg_dir:
>> -.space  PGD_TABLE_SIZE

This is now 4K aligned whereas it used to be 64K.

This fixes it and is not completely ugly?

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 1707ab580ee2..298469beaa90 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -28,7 +28,13 @@
 #include 
 #include 
 
-pgd_t swapper_pg_dir[MAX_PTRS_PER_PGD] __page_aligned_bss;
+#ifdef CONFIG_PPC64
+#define PGD_ALIGN 0x1
+#else
+#define PGD_ALIGN PAGE_SIZE
+#endif
+
+pgd_t swapper_pg_dir[MAX_PTRS_PER_PGD] __section(".bss..page_aligned") 
__aligned(PGD_ALIGN);
 
 static inline int is_exec_fault(void)
 {


cheers


Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-23 Thread Odin Ugedal
ons. 23. jun. 2021 kl. 14:22 skrev Vincent Guittot :
>
> In theory it should not because _sum should be always larger or equal
> to _avg * divider. Otherwise, it means that we have something wrong
> somewhere else

Yeah, that might be the case. Still trying to wrap my head around
this. I might be wrong, but isn't there a possibility
that avg->period_contrib is increasing in PELTs accumulate_sum,
without _sum is increasing. This makes the pelt divider increase,
making the statement "_sum should be always larger or equal to _avg *"
false? Or am I missing something obvious here?

Still unable to reproduce what Sachin is reporting tho.

Odin


Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-23 Thread Vincent Guittot
On Wed, 23 Jun 2021 at 14:18, Odin Ugedal  wrote:
>
> Hi,
>
> Wouldn't the attached diff below also help when load is removed,
> Vincent? Isn't there a theoretical chance that x_sum ends up at zero
> while x_load ends up as a positive value (without this patch)? Can
> post as a separate patch if it works for Sachin.

In theory it should not because _sum should be always larger or equal
to _avg * divider. Otherwise, it means that we have something wrong
somewhere else

>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bfaa6e1f6067..def48bc2e90b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3688,15 +3688,15 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>
> r = removed_load;
> sub_positive(&sa->load_avg, r);
> -   sub_positive(&sa->load_sum, r * divider);
> +   sa->load_sum = sa->load_avg * divider;
>
> r = removed_util;
> sub_positive(&sa->util_avg, r);
> -   sub_positive(&sa->util_sum, r * divider);
> +   sa->util_sum = sa->util_avg * divider;
>
> r = removed_runnable;
> sub_positive(&sa->runnable_avg, r);
> -   sub_positive(&sa->runnable_sum, r * divider);
> +   sa->runnable_sum = sa->runnable_avg * divider;
>
> /*
>  * removed_runnable is the unweighted version of
> removed_load so we


Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-23 Thread Odin Ugedal
Hi,

Wouldn't the attached diff below also help when load is removed,
Vincent? Isn't there a theoretical chance that x_sum ends up at zero
while x_load ends up as a positive value (without this patch)? Can
post as a separate patch if it works for Sachin.


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bfaa6e1f6067..def48bc2e90b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3688,15 +3688,15 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)

r = removed_load;
sub_positive(&sa->load_avg, r);
-   sub_positive(&sa->load_sum, r * divider);
+   sa->load_sum = sa->load_avg * divider;

r = removed_util;
sub_positive(&sa->util_avg, r);
-   sub_positive(&sa->util_sum, r * divider);
+   sa->util_sum = sa->util_avg * divider;

r = removed_runnable;
sub_positive(&sa->runnable_avg, r);
-   sub_positive(&sa->runnable_sum, r * divider);
+   sa->runnable_sum = sa->runnable_avg * divider;

/*
 * removed_runnable is the unweighted version of
removed_load so we


Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-23 Thread Vincent Guittot
Le mercredi 23 juin 2021 à 15:52:59 (+0530), Sachin Sant a écrit :
> 
> 
> > On 23-Jun-2021, at 1:28 PM, Sachin Sant  wrote:
> > 
> > 
>  Could you try the patch below ? I have been able to reproduce the 
>  problem locally and this
>  fix it on my system:
>  
> >>> I can recreate the issue with this patch.
> >> 
> >> ok, so your problem seem to be different from my assumption. Could you try
> >> the patch below on top of the previous one ?
> >> 
> >> This will help us to confirm that the problem comes from load_avg and that
> >> it's linked to the cfs load_avg and it's not a problem happening earlier in
> >> the update of PELT.
> >> 
> > 
> > Indeed. With both the patches applied I see following warning related to 
> > load_avg
> 
> I left the machine running for sometime. Then attempted a kernel compile.
> I subsequently saw warnings triggered for util_avg as well as runnable_avg
> 
> [ 8371.964935] [ cut here ]
> [ 8371.964958] cfs_rq->avg.util_avg
> [ 8371.964969] WARNING: CPU: 16 PID: 479551 at kernel/sched/fair.c:3283 
> update_blocked_averages+0x700/0x830
> ……..
> ……..
> [ 8664.754506] [ cut here ]
> [ 8664.754569] cfs_rq->avg.runnable_avg
> [ 8664.754583] WARNING: CPU: 23 PID: 125 at kernel/sched/fair.c:3284 
> update_blocked_averages+0x730/0x830
> …….
>

Ok. This becomes even more weird. Could you share your config file and more 
details about
you setup ?

Have you applied the patch below ? 
https://lore.kernel.org/lkml/20210621174330.11258-1-vincent.guit...@linaro.org/

Regarding the load_avg warning, I can see possible problem during attach. Could 
you add
the patch below. The load_avg warning seems to happen during boot and 
sched_entity
creation.

---
 kernel/sched/fair.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a6566f945a0..5e86139524c2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3753,11 +3753,12 @@ static void attach_entity_load_avg(struct cfs_rq 
*cfs_rq, struct sched_entity *s
 
se->avg.runnable_sum = se->avg.runnable_avg * divider;
 
-   se->avg.load_sum = divider;
if (se_weight(se)) {
se->avg.load_sum =
-   div_u64(se->avg.load_avg * se->avg.load_sum, 
se_weight(se));
-   }
+   div_u64(se->avg.load_avg * divider, se_weight(se));
+} else {
+   se->avg.load_avg = 0;
+}
 
enqueue_load_avg(cfs_rq, se);
cfs_rq->avg.util_avg += se->avg.util_avg;
-- 
2.17.1


> > 
> > Starting NTP client/server...
> > Starting VDO volume services...
> > [9.029054] [ cut here ]
> > [9.029084] cfs_rq->avg.load_avg
> > [9.029111] WARNING: CPU: 21 PID: 1169 at kernel/sched/fair.c:3282 
> > update_blocked_averages+0x760/0x830
> > [9.029151] Modules linked in: pseries_rng xts vmx_crypto 
> > uio_pdrv_genirq uio sch_fq_codel ip_tables xfs libcrc32c sr_mod sd_mod 
> > cdrom t10_pi sg ibmvscsi ibmveth scsi_transport_srp dm_mirror 
> > dm_region_hash dm_log dm_mod fuse
> > [9.029233] CPU: 21 PID: 1169 Comm: grep Not tainted 
> > 5.13.0-rc7-next-20210621-dirty #3
> > [9.029246] NIP:  c01b6150 LR: c01b614c CTR: 
> > c0728f40
> > [9.029259] REGS: ce177650 TRAP: 0700   Not tainted  
> > (5.13.0-rc7-next-20210621-dirty)
> > [9.029271] MSR:  80029033   CR: 48088224  
> > XER: 0005
> > [9.029296] CFAR: c014d120 IRQMASK: 1 
> > [9.029296] GPR00: c01b614c ce1778f0 c29bb900 
> > 0014 
> > [9.029296] GPR04: fffe ce1775b0 0027 
> > c0154f637e18 
> > [9.029296] GPR08: 0023 0001 0027 
> > c0167f1d7fe8 
> > [9.029296] GPR12: 8000 c0154ffe0e80 b820 
> > 00021a2c6864 
> > [9.029296] GPR16: c000482cc000 c0154f6c2580 0001 
> >  
> > [9.029296] GPR20: c291a7f9 c000482cc100  
> > 020d 
> > [9.029296] GPR24:  c0154f6c2f90 0001 
> > c00030b84400 
> > [9.029296] GPR28: 020d c000482cc1c0 0338 
> >  
> > [9.029481] NIP [c01b6150] update_blocked_averages+0x760/0x830
> > [9.029494] LR [c01b614c] update_blocked_averages+0x75c/0x830
> > [9.029508] Call Trace:
> > [9.029515] [ce1778f0] [c01b614c] 
> > update_blocked_averages+0x75c/0x830 (unreliable)
> > [9.029533] [ce177a20] [c01bd388] 
> > newidle_balance+0x258/0x5c0
> > [9.029542] [ce177ab0] [c01bd7cc] 
> > pick_next_task_fair+0x7c/0x4c0
> > [9.029574] [ce177b10] [c0cee3dc] __schedule+0x15c/0x1780
> > [9.029599] [ce177c50] [c01a5984] do_t

Re: [PATCH 2/3] powerpc: Define swapper_pg_dir[] in C

2021-06-23 Thread Daniel Axtens
Hi Christophe,

This breaks booting a radix KVM guest with 4k pages for me:

make pseries_le_defconfig
scripts/config -d CONFIG_PPC_64K_PAGES
scripts/config -e CONFIG_PPC_4K_PAGES
make vmlinux
sudo qemu-system-ppc64 -enable-kvm -M pseries -m 1G -nographic -vga none -smp 4 
-cpu host -kernel vmlinux

Boot hangs after printing 'Booting Linux via __start()' and qemu's 'info
registers' reports that it's stuck at the instruction fetch exception.

My host is Power9, 64k page size radix, and
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0, GNU ld (GNU Binutils for Ubuntu) 2.34

Kind regards,
Daniel

> Don't duplicate swapper_pg_dir[] in each platform's head.S
>
> Define it in mm/pgtable.c
>
> Define MAX_PTRS_PER_PGD because on book3s/64 PTRS_PER_PGD is
> not a constant.
>
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/include/asm/book3s/64/pgtable.h |  3 +++
>  arch/powerpc/include/asm/pgtable.h   |  4 
>  arch/powerpc/kernel/asm-offsets.c|  5 -
>  arch/powerpc/kernel/head_40x.S   | 11 ---
>  arch/powerpc/kernel/head_44x.S   | 17 +
>  arch/powerpc/kernel/head_64.S| 15 ---
>  arch/powerpc/kernel/head_8xx.S   | 12 
>  arch/powerpc/kernel/head_book3s_32.S | 11 ---
>  arch/powerpc/kernel/head_fsl_booke.S | 12 
>  arch/powerpc/mm/pgtable.c|  2 ++
>  10 files changed, 10 insertions(+), 82 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index a666d561b44d..4d9941b2fe51 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -232,6 +232,9 @@ extern unsigned long __pmd_frag_size_shift;
>  #define PTRS_PER_PUD (1 << PUD_INDEX_SIZE)
>  #define PTRS_PER_PGD (1 << PGD_INDEX_SIZE)
>  
> +#define MAX_PTRS_PER_PGD (1 << (H_PGD_INDEX_SIZE > RADIX_PGD_INDEX_SIZE 
> ? \
> +H_PGD_INDEX_SIZE : RADIX_PGD_INDEX_SIZE))
> +
>  /* PMD_SHIFT determines what a second-level page table entry can map */
>  #define PMD_SHIFT(PAGE_SHIFT + PTE_INDEX_SIZE)
>  #define PMD_SIZE (1UL << PMD_SHIFT)
> diff --git a/arch/powerpc/include/asm/pgtable.h 
> b/arch/powerpc/include/asm/pgtable.h
> index c6a676714f04..b9c8641654f4 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -41,6 +41,10 @@ struct mm_struct;
>  
>  #ifndef __ASSEMBLY__
>  
> +#ifndef MAX_PTRS_PER_PGD
> +#define MAX_PTRS_PER_PGD PTRS_PER_PGD
> +#endif
> +
>  /* Keep these as a macros to avoid include dependency mess */
>  #define pte_page(x)  pfn_to_page(pte_pfn(x))
>  #define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot))
> diff --git a/arch/powerpc/kernel/asm-offsets.c 
> b/arch/powerpc/kernel/asm-offsets.c
> index 0480f4006e0c..f1b6ff14c8a0 100644
> --- a/arch/powerpc/kernel/asm-offsets.c
> +++ b/arch/powerpc/kernel/asm-offsets.c
> @@ -361,11 +361,6 @@ int main(void)
>   DEFINE(BUG_ENTRY_SIZE, sizeof(struct bug_entry));
>  #endif
>  
> -#ifdef CONFIG_PPC_BOOK3S_64
> - DEFINE(PGD_TABLE_SIZE, (sizeof(pgd_t) << max(RADIX_PGD_INDEX_SIZE, 
> H_PGD_INDEX_SIZE)));
> -#else
> - DEFINE(PGD_TABLE_SIZE, PGD_TABLE_SIZE);
> -#endif
>   DEFINE(PTE_SIZE, sizeof(pte_t));
>  
>  #ifdef CONFIG_KVM
> diff --git a/arch/powerpc/kernel/head_40x.S b/arch/powerpc/kernel/head_40x.S
> index 92b6c7356161..7d72ee5ab387 100644
> --- a/arch/powerpc/kernel/head_40x.S
> +++ b/arch/powerpc/kernel/head_40x.S
> @@ -701,14 +701,3 @@ _GLOBAL(abort)
>  mfspr   r13,SPRN_DBCR0
>  orisr13,r13,DBCR0_RST_SYSTEM@h
>  mtspr   SPRN_DBCR0,r13
> -
> -/* We put a few things here that have to be page-aligned. This stuff
> - * goes at the beginning of the data segment, which is page-aligned.
> - */
> - .data
> - .align  12
> - .globl  sdata
> -sdata:
> - .globl  swapper_pg_dir
> -swapper_pg_dir:
> - .space  PGD_TABLE_SIZE
> diff --git a/arch/powerpc/kernel/head_44x.S b/arch/powerpc/kernel/head_44x.S
> index e037eb615757..ddc978a2d381 100644
> --- a/arch/powerpc/kernel/head_44x.S
> +++ b/arch/powerpc/kernel/head_44x.S
> @@ -1233,23 +1233,8 @@ head_start_common:
>   isync
>   blr
>  
> -/*
> - * We put a few things here that have to be page-aligned. This stuff
> - * goes at the beginning of the data segment, which is page-aligned.
> - */
> - .data
> - .align  PAGE_SHIFT
> - .globl  sdata
> -sdata:
> -
> -/*
> - * To support >32-bit physical addresses, we use an 8KB pgdir.
> - */
> - .globl  swapper_pg_dir
> -swapper_pg_dir:
> - .space  PGD_TABLE_SIZE
> -
>  #ifdef CONFIG_SMP
> + .data
>   .align  12
>  temp_boot_stack:
>   .space  1024
> diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
> index 730838c7ca39..79f2d1e61abd 100644
> --- a/arch/powerpc/kernel/head_64.S
> +++ b/arch/

[powerpc:topic/ppc-kvm 9/41] arch/powerpc/kvm/book3s_xive.c:151:41: sparse: sparse: incorrect type in assignment (different base types)

2021-06-23 Thread kernel test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
topic/ppc-kvm
head:   51696f39cbee5bb684e7959c0c98b5f54548aa34
commit: 023c3c96ca4d196c09d554d5a98900406e4d7ecb [9/41] KVM: PPC: Book3S HV P9: 
implement kvmppc_xive_pull_vcpu in C
config: powerpc-allyesconfig (attached as .config)
compiler: powerpc64-linux-gcc (GCC) 9.3.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# apt-get install sparse
# sparse version: v0.6.3-341-g8af24329-dirty
# 
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=023c3c96ca4d196c09d554d5a98900406e4d7ecb
git remote add powerpc 
https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git
git fetch --no-tags powerpc topic/ppc-kvm
git checkout 023c3c96ca4d196c09d554d5a98900406e4d7ecb
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross C=1 
CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' W=1 ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 


sparse warnings: (new ones prefixed by >>)
   arch/powerpc/kvm/book3s_xive.c: note: in included file:
   arch/powerpc/kvm/book3s_xive_template.c:26:15: sparse: sparse: cast to 
restricted __be16
   arch/powerpc/kvm/book3s_xive_template.c:339:39: sparse: sparse: incorrect 
type in initializer (different base types) @@ expected restricted __be64 
[usertype] qw1 @@ got unsigned long @@
   arch/powerpc/kvm/book3s_xive_template.c:339:39: sparse: expected 
restricted __be64 [usertype] qw1
   arch/powerpc/kvm/book3s_xive_template.c:339:39: sparse: got unsigned long
   arch/powerpc/kvm/book3s_xive.c:79:49: sparse: sparse: incorrect type in 
argument 1 (different base types) @@ expected unsigned long v @@ got 
restricted __be64 [usertype] w01 @@
   arch/powerpc/kvm/book3s_xive.c:79:49: sparse: expected unsigned long v
   arch/powerpc/kvm/book3s_xive.c:79:49: sparse: got restricted __be64 
[usertype] w01
   arch/powerpc/kvm/book3s_xive.c:80:32: sparse: sparse: incorrect type in 
argument 1 (different base types) @@ expected unsigned int v @@ got 
restricted __be32 [usertype] xive_cam_word @@
   arch/powerpc/kvm/book3s_xive.c:80:32: sparse: expected unsigned int v
   arch/powerpc/kvm/book3s_xive.c:80:32: sparse: got restricted __be32 
[usertype] xive_cam_word
>> arch/powerpc/kvm/book3s_xive.c:151:41: sparse: sparse: incorrect type in 
>> assignment (different base types) @@ expected restricted __be64 
>> [usertype] w01 @@ got unsigned long @@
   arch/powerpc/kvm/book3s_xive.c:151:41: sparse: expected restricted 
__be64 [usertype] w01
   arch/powerpc/kvm/book3s_xive.c:151:41: sparse: got unsigned long

vim +151 arch/powerpc/kvm/book3s_xive.c

   129  
   130  /*
   131   * Pull a vcpu's context from the XIVE on guest exit.
   132   * This assumes we are in virtual mode (MMU on)
   133   */
   134  void kvmppc_xive_pull_vcpu(struct kvm_vcpu *vcpu)
   135  {
   136  void __iomem *tima = local_paca->kvm_hstate.xive_tima_virt;
   137  
   138  if (!vcpu->arch.xive_pushed)
   139  return;
   140  
   141  /*
   142   * Should not have been pushed if there is no tima
   143   */
   144  if (WARN_ON(!tima))
   145  return;
   146  
   147  eieio();
   148  /* First load to pull the context, we ignore the value */
   149  __raw_readl(tima + TM_SPC_PULL_OS_CTX);
   150  /* Second load to recover the context state (Words 0 and 1) */
 > 151  vcpu->arch.xive_saved_state.w01 = __raw_readq(tima + TM_QW1_OS);
   152  
   153  /* Fixup some of the state for the next load */
   154  vcpu->arch.xive_saved_state.lsmfb = 0;
   155  vcpu->arch.xive_saved_state.ack = 0xff;
   156  vcpu->arch.xive_pushed = 0;
   157  eieio();
   158  }
   159  EXPORT_SYMBOL_GPL(kvmppc_xive_pull_vcpu);
   160  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip


kernel panic with "Unrecoverable FP Unavailable Exception 800 at c00000000009e308"

2021-06-23 Thread Ryan Wong
Hi,

Recently I encountered a kernel panic announcing "Unrecoverable FP
Unavailable Exception 800 at c009e308". I have attached the panic
log at the end of the mail.
As I known, this exception occured when the hard floating-point instruction
was executed with FPU disabled, and if the instruction was from kernel
space, kernel would assume it as unrecoverable and panic itself.
*Here is the investigation I have done.*
I checked the MSR firstly, and MSR[PR] = 0 and MSR[FP] = 0, It seems that
the system did match the panic condition.
Because MSR[PR] = 0, the instruction seemed come from kernel, but kernel
would not do floating point calculation normally, so I was quite curious
about the code which triggered the exception. And from the backtrace log,
it should be the "update_min_vruntime" function.
Unfortunately, I didn't see any floating-point operation in that function.
Then I disassembled the vmlinux and found out the disassembly code of that
function, and matched it with the instruction dump:


















*c009e2b8 <.update_min_vruntime>:...c009e2d8:   e9 1f
00 20 ld  r8,32(r31)c009e2dc:   2f a9 00 00 cmpdi
cr7,r9,0c009e2e0:   41 9e 00 68 beq
cr7,c009e348 <.update_min_vruntime+0x90>c009e2e4:   e9
5f 00 30 ld  r10,48(r31)c009e2e8:   e9 29 00 50 ld
 r9,80(r9)c009e2ec:   2f aa 00 00 cmpdi
cr7,r10,0c009e2f0:   41 9e 00 10 beq
cr7,c009e300 <.update_min_vruntime+0x48>c009e2f4:   e9
4a 00 40 ld  r10,64(r10)c009e2f8:   7c e9 50 51
subf.   r7,r9,r10c009e2fc:   41 80 00 24 blt
c009e320 <.update_min_vruntime+0x68>c009e300:   7c e8
48 51 subf.   r7,r8,r9c009e304:   40 81 00 28 ble
c009e32c <.update_min_vruntime+0x74>c009e308:   f9 3f
00 20 std r9,32(r31)c009e30c:   38 21 00 80 addi
 r1,r1,128c009e310:   e8 01 00 10 ld
 r0,16(r1)c009e314:   eb e1 ff f8 ld  r31,-8(r1)*

And the criminal instruction is
*c009e308:   f9 3f 00 20 std r9,32(r31)  *

This is nothing to do with floating-point, I could not imagine why it will
trigger the exception.

Do you guys have any idea about this condition, appreciate for your reply.

*Panic log*
...
Linux version 4.1.21 (ryan@ubuntu) (gcc version 5.2.0) #22 SMP PREEMPT Wed
Oct 28 10:04:32 CST 2020
...
<1>Kernel command line: ramdisk_size=0x70 root=/dev/ram rw init=/init
mem=3840M reserve=256M@3840M console=ttyS0,115200 crashkernel=128M@32M
bportals=s1 qportals=s1
...
<0>linux-kernel-bde (16258): Allocating DMA memory using method dmaalloc=0
<0>linux-kernel-bde (16258): _use_dma_mapping:1 _dma_vbase:c0006000
_dma_pbase:6000 _cpu_pbase:6000 allocated:200 dmaalloc:0
<0>linux-kernel-bde (16247): _interrupt_connect d 0
<0>linux-kernel-bde (16247): connect primary isr
<0>linux-kernel-bde (16247): _interrupt_connect(3514):device# = 0,
irq_flags = 128, irq = 41
<1>device eth0.4092 entered promiscuous mode
<1>Unrecoverable FP Unavailable Exception 800 at c009e308
<0>Oops: Unrecoverable FP Unavailable Exception, sig: 6 [#1]
<0>PREEMPT SMP NR_CPUS=4 CoreNet Generic
<0>Modules linked in: linux_user_bde(PO) linux_kernel_bde(PO) dma2(O)
dma(O) watchdog(O) ttyVS(O) gpiodev(O) lbdev(O) spid(O) block2mtd
mpc85xx_edac edac_core sch_fq_codel uio_seville(O) loop [last unloaded:
linux_kernel_bde]
<1>CPU: 1 PID: 7 Comm: rcu_preempt Tainted: P   O4.1.21 #22
<1>task: c000e11a4680 ti: c000e11d8000 task.ti: c000e11d8000
<0>NIP: c009e308 LR: c009eda4 CTR: c00a2de8
<0>REGS: c000e11db4d0 TRAP: 0800   Tainted: P   O (4.1.21)
<0>MSR: 80029000   CR: 44a44242  XER: 
<0>SOFTE: 0
<0>GPR00: c009eda4 c000e11db750 c1763800
c000efe476a0
<0>GPR04: c000e11a4680 c000efe4fea0 c000efe47fa0
c1643800
<0>GPR08: 06b94a32fd58 06b949bb61f8 
c000e11f
<0>GPR12: 44a44244 cfffe6c0 

<0>GPR16: c16a9fa0 c16aa108 00fa
0001
<0>GPR20: c176d578  0001

<0>GPR24: 0001 c0b08a18 
c000efe47640
<0>NIP [c009e308] .update_min_vruntime+0x50/0xa4
<0>LR [c009eda4] .update_curr+0x80/0x1ec
<0>Call Trace:
<0>[c000e11db750] [c000e1004560] 0xc000e1004560 (unreliable)
<0>[c000e11db7d0] [c009eda4] .update_curr+0x80/0x1ec
<0>[c000e11db870] [c00a2e80] .dequeue_task_fair+0x98/0xaf0
<0>[c000e11db960] [c009376c] .dequeue_task+0x68/0x88
<0>[c000e11db9f0] [c0ae8f88] .__schedule+0x2f4/0x7b4
<0>[c000e11dbaa0] [c0ae9484] .schedule+0x3c/0xa8
<0>[c000e11dbb20] [c0aecc98] .sche

Re: [PATCH v3 0/4] Add perf interface to expose nvdimm

2021-06-23 Thread Michael Ellerman
Peter Zijlstra  writes:
> On Wed, Jun 23, 2021 at 01:40:38PM +0530, kajoljain wrote:
>> 
>> On 6/22/21 6:44 PM, Peter Zijlstra wrote:
>> > On Thu, Jun 17, 2021 at 06:56:13PM +0530, Kajol Jain wrote:
>> >> ---
>> >> Kajol Jain (4):
>> >>   drivers/nvdimm: Add nvdimm pmu structure
>> >>   drivers/nvdimm: Add perf interface to expose nvdimm performance stats
>> >>   powerpc/papr_scm: Add perf interface support
>> >>   powerpc/papr_scm: Document papr_scm sysfs event format entries
>> > 
>> > Don't see anything obviously wrong with this one.
>> > 
>> > Acked-by: Peter Zijlstra (Intel) 
>> > 
>> 
>> Hi Peter,
>> Thanks for reviewing the patch. Can you help me on how to take 
>> these patches to linus tree or can you take it?
>
> I would expect either the NVDIMM or PPC maintainers to take this. Dan,
> Michael ?

I can take it but would need Acks from nvdimm folks.

cheers


Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-23 Thread Sachin Sant



> On 23-Jun-2021, at 1:28 PM, Sachin Sant  wrote:
> 
> 
 Could you try the patch below ? I have been able to reproduce the problem 
 locally and this
 fix it on my system:
 
>>> I can recreate the issue with this patch.
>> 
>> ok, so your problem seem to be different from my assumption. Could you try
>> the patch below on top of the previous one ?
>> 
>> This will help us to confirm that the problem comes from load_avg and that
>> it's linked to the cfs load_avg and it's not a problem happening earlier in
>> the update of PELT.
>> 
> 
> Indeed. With both the patches applied I see following warning related to 
> load_avg

I left the machine running for sometime. Then attempted a kernel compile.
I subsequently saw warnings triggered for util_avg as well as runnable_avg

[ 8371.964935] [ cut here ]
[ 8371.964958] cfs_rq->avg.util_avg
[ 8371.964969] WARNING: CPU: 16 PID: 479551 at kernel/sched/fair.c:3283 
update_blocked_averages+0x700/0x830
……..
……..
[ 8664.754506] [ cut here ]
[ 8664.754569] cfs_rq->avg.runnable_avg
[ 8664.754583] WARNING: CPU: 23 PID: 125 at kernel/sched/fair.c:3284 
update_blocked_averages+0x730/0x830
…….

> 
> Starting NTP client/server...
> Starting VDO volume services...
> [9.029054] [ cut here ]
> [9.029084] cfs_rq->avg.load_avg
> [9.029111] WARNING: CPU: 21 PID: 1169 at kernel/sched/fair.c:3282 
> update_blocked_averages+0x760/0x830
> [9.029151] Modules linked in: pseries_rng xts vmx_crypto uio_pdrv_genirq 
> uio sch_fq_codel ip_tables xfs libcrc32c sr_mod sd_mod cdrom t10_pi sg 
> ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod 
> fuse
> [9.029233] CPU: 21 PID: 1169 Comm: grep Not tainted 
> 5.13.0-rc7-next-20210621-dirty #3
> [9.029246] NIP:  c01b6150 LR: c01b614c CTR: 
> c0728f40
> [9.029259] REGS: ce177650 TRAP: 0700   Not tainted  
> (5.13.0-rc7-next-20210621-dirty)
> [9.029271] MSR:  80029033   CR: 48088224  
> XER: 0005
> [9.029296] CFAR: c014d120 IRQMASK: 1 
> [9.029296] GPR00: c01b614c ce1778f0 c29bb900 
> 0014 
> [9.029296] GPR04: fffe ce1775b0 0027 
> c0154f637e18 
> [9.029296] GPR08: 0023 0001 0027 
> c0167f1d7fe8 
> [9.029296] GPR12: 8000 c0154ffe0e80 b820 
> 00021a2c6864 
> [9.029296] GPR16: c000482cc000 c0154f6c2580 0001 
>  
> [9.029296] GPR20: c291a7f9 c000482cc100  
> 020d 
> [9.029296] GPR24:  c0154f6c2f90 0001 
> c00030b84400 
> [9.029296] GPR28: 020d c000482cc1c0 0338 
>  
> [9.029481] NIP [c01b6150] update_blocked_averages+0x760/0x830
> [9.029494] LR [c01b614c] update_blocked_averages+0x75c/0x830
> [9.029508] Call Trace:
> [9.029515] [ce1778f0] [c01b614c] 
> update_blocked_averages+0x75c/0x830 (unreliable)
> [9.029533] [ce177a20] [c01bd388] 
> newidle_balance+0x258/0x5c0
> [9.029542] [ce177ab0] [c01bd7cc] 
> pick_next_task_fair+0x7c/0x4c0
> [9.029574] [ce177b10] [c0cee3dc] __schedule+0x15c/0x1780
> [9.029599] [ce177c50] [c01a5984] do_task_dead+0x64/0x70
> [9.029622] [ce177c80] [c0156338] do_exit+0x848/0xcc0
> [9.029646] [ce177d50] [c0156884] do_group_exit+0x64/0xe0
> [9.029666] [ce177d90] [c0156924] sys_exit_group+0x24/0x30
> [9.029688] [ce177db0] [c00310c0] 
> system_call_exception+0x150/0x2d0
> Startin[9.029710] [gce177e10 Hardware Monito] 
> [c000_common+0xec/0x2lling Sensors...
> 78
> [9.029743] --- interrupt: c00 at 0x7fff943fddcc
> [9.029758] NIP:  7fff943fddcc LR: 7fff94357f04 CTR: 
> 
> [9.029786] REGS: ce177e80 TRAP: 0c00   Not tainted  
> (5.13.0-rc7-next-20210621-dirty)
> [9.029798] MSR:  8280f033   
> CR: 28000402  XER: 
> [9.029825] IRQMASK: 0 
> [9.029825] GPR00: 00ea 759c0170 7fff94527100 
> 0001 
> [9.029825] GPR04:   0001 
>  
> [9.029825] GPR08:    
>  
> [9.029825] GPR12:  7fff9466af00  
>  
> [9.029825] GPR16:    
>  
> [9.029825] GPR20:  7fff94524f98 0002 
> 0001 
> [9.029825] GPR24: 7fff94520950  0001 
> 0001 
> [  

Re: nand: WARNING: a0000000.nand: the ECC used on your system (1b/256B) is too weak compared to the one required by the NAND chip (4b/512B)

2021-06-23 Thread Christophe Leroy




Le 19/06/2021 à 20:40, Miquel Raynal a écrit :

Hi Christophe,


Now and then I'm using one of the latest kernels (Today is 5.13-rc6), and 
sometime in one of the 5.x releases, I started to get errors like:

[5.098265] ecc_sw_hamming_correct: uncorrectable ECC error
[5.103859] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 60
bytes from PEB 99:59824, read only 60 bytes, retry
[5.525843] ecc_sw_hamming_correct: uncorrectable ECC error
[5.531571] ecc_sw_hamming_correct: uncorrectable ECC error
[5.537490] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 30
73 bytes from PEB 107:108976, read only 3073 bytes, retry
[5.691121] ecc_sw_hamming_correct: uncorrectable ECC error
[5.696709] ecc_sw_hamming_correct: uncorrectable ECC error
[5.702426] ecc_sw_hamming_correct: uncorrectable ECC error
[5.708141] ecc_sw_hamming_correct: uncorrectable ECC error
[5.714103] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 30
35 bytes from PEB 107:25144, read only 3035 bytes, retry
[   20.523689] random: crng init done
[   21.892130] ecc_sw_hamming_correct: uncorrectable ECC error
[   21.897730] ubi0 warning: ubi_io_read: error -74 (ECC error) while reading 13
94 bytes from PEB 116:75776, read only 1394 bytes, retry

Most of the time, when the reading of the file fails, I just have to read it 
once more and it gets read without that error.


It really looks like a regular bitflip happening "sometimes". Is this a
board which already had a life? What are the usage counters (UBI should
tell you this) compared to the official endurance of your chip (see the
datasheet)?


The board had a peacefull life:

UBI reports "ubi0: max/mean erase counter: 49/20, WL threshold: 4096"


Mmmh. Indeed.



I have tried with half a dozen of boards and all have the issue.

   

What am I supposed to do to avoid the ECC weakness warning at startup and to 
fix that ECC error issue ?


I honestly don't think the errors come from the 5.1x kernels given the
above logs. If you flash back your old 4.14 I am pretty sure you'll
have the same errors at some point.


I don't have any problem like that with 4.14 with any of the board.

When booting a 4.14 kernel I don't get any problem on the same board.



If you can reliably show that when returning to a 4.14 kernel the ECC
weakness disappears, then there is certainly something new. What driver
are you using? Maybe you can do a bisection?


Using the GPIO driver, and the NAND chip is a HYNIX.

I can say that the ECC weakness doesn't exist until v5.5 included. The weakness 
appears with v5.6.

I have tried bisection between those two versions and I couldn't end up to a reliable result. The 
closer the v5.5 you go, the more difficult it is to reproduce the issue.


So I looked at what was done around the places, and in fact that's mainly optimisation in the 
powerpc code. It seems that the more powerpc is optimised, the more the problem occurs.


Looking at the GPIO nand driver, I saw that no-op gpio_nand_dosync() function. By adding a memory 
barrier in that function, the ECC weakness disappeared completely.


Not sure what the final solution has to be.

Christophe


Re: [PATCH v15 2/4] kasan: allow architectures to provide an outline readiness check

2021-06-23 Thread Daniel Axtens
>> diff --git a/mm/kasan/common.c b/mm/kasan/common.c
>> index 10177cc26d06..0ad615f3801d 100644
>> --- a/mm/kasan/common.c
>> +++ b/mm/kasan/common.c
>> @@ -331,6 +331,10 @@ static inline bool kasan_slab_free(struct 
>> kmem_cache *cache, void *object,
>> u8 tag;
>> void *tagged_object;
>>
>> +   /* Bail if the arch isn't ready */
>
> This comment brings no value. The fact that we bail is clear from the
> following line. The comment should explain why we bail.
>
>> +   if (!kasan_arch_is_ready())
>> +   return false;

Fair enough, I've just dropped the comments as I don't think there's
really a lot of scope for the generic/core comment to explain why a
particular architecture might not be ready.

> Have you considered including these checks into the high-level
> wrappers in include/linux/kasan.h? Would that work?

I don't think those wrappers will catch the outline check functions
like __asan_load*, which also need guarding.

Kind regards,
Daniel


Re: [PATCH v14 00/12] Restricted DMA

2021-06-23 Thread Claire Chang
On Wed, Jun 23, 2021 at 4:38 PM Konrad Rzeszutek Wilk
 wrote:
>
> On Sat, Jun 19, 2021 at 11:40:31AM +0800, Claire Chang wrote:
> > This series implements mitigations for lack of DMA access control on
> > systems without an IOMMU, which could result in the DMA accessing the
> > system memory at unexpected times and/or unexpected addresses, possibly
> > leading to data leakage or corruption.
> >
> > For example, we plan to use the PCI-e bus for Wi-Fi and that PCI-e bus is
> > not behind an IOMMU. As PCI-e, by design, gives the device full access to
> > system memory, a vulnerability in the Wi-Fi firmware could easily escalate
> > to a full system exploit (remote wifi exploits: [1a], [1b] that shows a
> > full chain of exploits; [2], [3]).
> >
> > To mitigate the security concerns, we introduce restricted DMA. Restricted
> > DMA utilizes the existing swiotlb to bounce streaming DMA in and out of a
> > specially allocated region and does memory allocation from the same region.
> > The feature on its own provides a basic level of protection against the DMA
> > overwriting buffer contents at unexpected times. However, to protect
> > against general data leakage and system memory corruption, the system needs
> > to provide a way to restrict the DMA to a predefined memory region (this is
> > usually done at firmware level, e.g. MPU in ATF on some ARM platforms [4]).
> >
> > [1a] 
> > https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_4.html
> > [1b] 
> > https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_11.html
> > [2] https://blade.tencent.com/en/advisories/qualpwn/
> > [3] 
> > https://www.bleepingcomputer.com/news/security/vulnerabilities-found-in-highly-popular-firmware-for-wifi-chips/
> > [4] 
> > https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132
>
> Heya Claire,
>
> I put all your patches on
> https://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb.git/log/?h=devel/for-linus-5.14
>
> Please double-check that they all look ok.
>
> Thank you!

They look fine. Thank you!


Re: [PATCH v14 00/12] Restricted DMA

2021-06-23 Thread Konrad Rzeszutek Wilk
On Sat, Jun 19, 2021 at 11:40:31AM +0800, Claire Chang wrote:
> This series implements mitigations for lack of DMA access control on
> systems without an IOMMU, which could result in the DMA accessing the
> system memory at unexpected times and/or unexpected addresses, possibly
> leading to data leakage or corruption.
> 
> For example, we plan to use the PCI-e bus for Wi-Fi and that PCI-e bus is
> not behind an IOMMU. As PCI-e, by design, gives the device full access to
> system memory, a vulnerability in the Wi-Fi firmware could easily escalate
> to a full system exploit (remote wifi exploits: [1a], [1b] that shows a
> full chain of exploits; [2], [3]).
> 
> To mitigate the security concerns, we introduce restricted DMA. Restricted
> DMA utilizes the existing swiotlb to bounce streaming DMA in and out of a
> specially allocated region and does memory allocation from the same region.
> The feature on its own provides a basic level of protection against the DMA
> overwriting buffer contents at unexpected times. However, to protect
> against general data leakage and system memory corruption, the system needs
> to provide a way to restrict the DMA to a predefined memory region (this is
> usually done at firmware level, e.g. MPU in ATF on some ARM platforms [4]).
> 
> [1a] 
> https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_4.html
> [1b] 
> https://googleprojectzero.blogspot.com/2017/04/over-air-exploiting-broadcoms-wi-fi_11.html
> [2] https://blade.tencent.com/en/advisories/qualpwn/
> [3] 
> https://www.bleepingcomputer.com/news/security/vulnerabilities-found-in-highly-popular-firmware-for-wifi-chips/
> [4] 
> https://github.com/ARM-software/arm-trusted-firmware/blob/master/plat/mediatek/mt8183/drivers/emi_mpu/emi_mpu.c#L132

Heya Claire,

I put all your patches on
https://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb.git/log/?h=devel/for-linus-5.14

Please double-check that they all look ok.

Thank you!


Re: [PATCH v3 0/4] Add perf interface to expose nvdimm

2021-06-23 Thread Peter Zijlstra
On Wed, Jun 23, 2021 at 01:40:38PM +0530, kajoljain wrote:
> 
> 
> On 6/22/21 6:44 PM, Peter Zijlstra wrote:
> > On Thu, Jun 17, 2021 at 06:56:13PM +0530, Kajol Jain wrote:
> >> ---
> >> Kajol Jain (4):
> >>   drivers/nvdimm: Add nvdimm pmu structure
> >>   drivers/nvdimm: Add perf interface to expose nvdimm performance stats
> >>   powerpc/papr_scm: Add perf interface support
> >>   powerpc/papr_scm: Document papr_scm sysfs event format entries
> > 
> > Don't see anything obviously wrong with this one.
> > 
> > Acked-by: Peter Zijlstra (Intel) 
> > 
> 
> Hi Peter,
> Thanks for reviewing the patch. Can you help me on how to take 
> these patches to linus tree or can you take it?

I would expect either the NVDIMM or PPC maintainers to take this. Dan,
Michael ?


Re: [PATCH v3 0/4] Add perf interface to expose nvdimm

2021-06-23 Thread kajoljain



On 6/22/21 6:44 PM, Peter Zijlstra wrote:
> On Thu, Jun 17, 2021 at 06:56:13PM +0530, Kajol Jain wrote:
>> ---
>> Kajol Jain (4):
>>   drivers/nvdimm: Add nvdimm pmu structure
>>   drivers/nvdimm: Add perf interface to expose nvdimm performance stats
>>   powerpc/papr_scm: Add perf interface support
>>   powerpc/papr_scm: Document papr_scm sysfs event format entries
> 
> Don't see anything obviously wrong with this one.
> 
> Acked-by: Peter Zijlstra (Intel) 
> 

Hi Peter,
Thanks for reviewing the patch. Can you help me on how to take 
these patches to linus tree or can you take it?

Thanks,
Kajol Jain


Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-23 Thread Sachin Sant


>>> Could you try the patch below ? I have been able to reproduce the problem 
>>> locally and this
>>> fix it on my system:
>>> 
>> I can recreate the issue with this patch.
> 
> ok, so your problem seem to be different from my assumption. Could you try
> the patch below on top of the previous one ?
> 
> This will help us to confirm that the problem comes from load_avg and that
> it's linked to the cfs load_avg and it's not a problem happening earlier in
> the update of PELT.
> 

Indeed. With both the patches applied I see following warning related to 
load_avg

 Starting NTP client/server...
 Starting VDO volume services...
[9.029054] [ cut here ]
[9.029084] cfs_rq->avg.load_avg
[9.029111] WARNING: CPU: 21 PID: 1169 at kernel/sched/fair.c:3282 
update_blocked_averages+0x760/0x830
[9.029151] Modules linked in: pseries_rng xts vmx_crypto uio_pdrv_genirq 
uio sch_fq_codel ip_tables xfs libcrc32c sr_mod sd_mod cdrom t10_pi sg ibmvscsi 
ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod fuse
[9.029233] CPU: 21 PID: 1169 Comm: grep Not tainted 
5.13.0-rc7-next-20210621-dirty #3
[9.029246] NIP:  c01b6150 LR: c01b614c CTR: c0728f40
[9.029259] REGS: ce177650 TRAP: 0700   Not tainted  
(5.13.0-rc7-next-20210621-dirty)
[9.029271] MSR:  80029033   CR: 48088224  
XER: 0005
[9.029296] CFAR: c014d120 IRQMASK: 1 
[9.029296] GPR00: c01b614c ce1778f0 c29bb900 
0014 
[9.029296] GPR04: fffe ce1775b0 0027 
c0154f637e18 
[9.029296] GPR08: 0023 0001 0027 
c0167f1d7fe8 
[9.029296] GPR12: 8000 c0154ffe0e80 b820 
00021a2c6864 
[9.029296] GPR16: c000482cc000 c0154f6c2580 0001 
 
[9.029296] GPR20: c291a7f9 c000482cc100  
020d 
[9.029296] GPR24:  c0154f6c2f90 0001 
c00030b84400 
[9.029296] GPR28: 020d c000482cc1c0 0338 
 
[9.029481] NIP [c01b6150] update_blocked_averages+0x760/0x830
[9.029494] LR [c01b614c] update_blocked_averages+0x75c/0x830
[9.029508] Call Trace:
[9.029515] [ce1778f0] [c01b614c] 
update_blocked_averages+0x75c/0x830 (unreliable)
[9.029533] [ce177a20] [c01bd388] newidle_balance+0x258/0x5c0
[9.029542] [ce177ab0] [c01bd7cc] 
pick_next_task_fair+0x7c/0x4c0
[9.029574] [ce177b10] [c0cee3dc] __schedule+0x15c/0x1780
[9.029599] [ce177c50] [c01a5984] do_task_dead+0x64/0x70
[9.029622] [ce177c80] [c0156338] do_exit+0x848/0xcc0
[9.029646] [ce177d50] [c0156884] do_group_exit+0x64/0xe0
[9.029666] [ce177d90] [c0156924] sys_exit_group+0x24/0x30
[9.029688] [ce177db0] [c00310c0] 
system_call_exception+0x150/0x2d0
 Startin[9.029710] [gce177e10 Hardware Monito] 
[c000_common+0xec/0x2lling Sensors...
78
[9.029743] --- interrupt: c00 at 0x7fff943fddcc
[9.029758] NIP:  7fff943fddcc LR: 7fff94357f04 CTR: 
[9.029786] REGS: ce177e80 TRAP: 0c00   Not tainted  
(5.13.0-rc7-next-20210621-dirty)
[9.029798] MSR:  8280f033   CR: 
28000402  XER: 
[9.029825] IRQMASK: 0 
[9.029825] GPR00: 00ea 759c0170 7fff94527100 
0001 
[9.029825] GPR04:   0001 
 
[9.029825] GPR08:    
 
[9.029825] GPR12:  7fff9466af00  
 
[9.029825] GPR16:    
 
[9.029825] GPR20:  7fff94524f98 0002 
0001 
[9.029825] GPR24: 7fff94520950  0001 
0001 
[9.029825] GPR28:   7fff94663f10 
0001 
[9.029935] NIP [7fff943fddcc] 0x7fff943fddcc
[9.029944] LR [7fff94357f04] 0x7fff94357f04
[9.029952] --- interrupt: c00
[9.029959] Instruction dump:
[9.029966] 0fe0 4bfffc64 6000 6000 89340007 2f89 409efc38 
e8610098 
[9.029987] 3921 99340007 4bf96f71 6000 <0fe0> 4bfffc1c 6000 
6000 
[9.030013] ---[ end trace 3d7e3a29c9539d96 ]---
 Starting Authorization Manager…

Thanks
-Sachin


> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index da91db1c137f..8a6566f945a0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3030,8 +3030,9 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct 
> sche

Re: [powerpc][next-20210621] WARNING at kernel/sched/fair.c:3277 during boot

2021-06-23 Thread Vincent Guittot
Hi Sachin,

Le mardi 22 juin 2021 à 21:29:36 (+0530), Sachin Sant a écrit :
> >> On Tue, 22 Jun 2021 at 09:39, Sachin Sant  
> >> wrote:
> >>> 
> >>> While booting 5.13.0-rc7-next-20210621 on a PowerVM LPAR following warning
> >>> is seen
> >>> 
> >>> [   30.922154] [ cut here ]
> >>> [   30.922201] cfs_rq->avg.load_avg || cfs_rq->avg.util_avg || 
> >>> cfs_rq->avg.runnable_avg
> >>> [   30.922219] WARNING: CPU: 6 PID: 762 at kernel/sched/fair.c:3277 
> >>> update_blocked_averages+0x758/0x780
> >> 
> >> Yes. That was exactly the purpose of the patch. There is one last
> >> remaining part which could generate this. I'm going to prepare a patch
> > 
> > Could you try the patch below ? I have been able to reproduce the problem 
> > locally and this
> > fix it on my system:
> > 
> I can recreate the issue with this patch.

ok, so your problem seem to be different from my assumption. Could you try
the patch below on top of the previous one ?

This will help us to confirm that the problem comes from load_avg and that
it's linked to the cfs load_avg and it's not a problem happening earlier in
the update of PELT.


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da91db1c137f..8a6566f945a0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3030,8 +3030,9 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
 static inline void
 enqueue_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+   u32 divider = get_pelt_divider(&se->avg);
cfs_rq->avg.load_avg += se->avg.load_avg;
-   cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;
+   cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * divider;
 }
 
 static inline void
@@ -3304,9 +3305,9 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq 
*cfs_rq)
 * Make sure that rounding and/or propagation of PELT values never
 * break this.
 */
-   SCHED_WARN_ON(cfs_rq->avg.load_avg ||
- cfs_rq->avg.util_avg ||
- cfs_rq->avg.runnable_avg);
+   SCHED_WARN_ON(cfs_rq->avg.load_avg);
+   SCHED_WARN_ON(cfs_rq->avg.util_avg);
+   SCHED_WARN_ON(cfs_rq->avg.runnable_avg);
 
return true;
 }


> 
>  Starting Terminate Plymouth Boot Screen...
>  Starting Hold until boot process finishes up...
> [FAILED] Failed to start Crash recovery kernel arming.
> See 'systemctl status kdump.service' for details.
> [   10.737913] [ cut here ]
> [   10.737960] cfs_rq->avg.load_avg || cfs_rq->avg.util_avg || 
> cfs_rq->avg.runnable_avg
> [   10.737976] WARNING: CPU: 27 PID: 146 at kernel/sched/fair.c:3279 
> update_blocked_averages+0x758/0x780
> [   10.738010] Modules linked in: stp llc rfkill sunrpc pseries_rng xts 
> vmx_crypto uio_pdrv_genirq uio sch_fq_codel ip_tables xfs libcrc32c sr_mod 
> sd_mod cdrom t10_pi sg ibmvscsi ibmveth scsi_transport_srp dm_mirror 
> dm_region_hash dm_log dm_mod fuse
> [   10.738089] CPU: 27 PID: 146 Comm: ksoftirqd/27 Not tainted 
> 5.13.0-rc7-next-20210621-dirty #2
> [   10.738103] NIP:  c01b2768 LR: c01b2764 CTR: 
> c0729120
> [   10.738116] REGS: c00015973840 TRAP: 0700   Not tainted  
> (5.13.0-rc7-next-20210621-dirty)
> [   10.738130] MSR:  8282b033   CR: 
> 48000224  XER: 0005
> [   10.738161] CFAR: c014d120 IRQMASK: 1 
> [   10.738161] GPR00: c01b2764 c00015973ae0 c29bb900 
> 0048 
> [   10.738161] GPR04: fffe c000159737a0 0027 
> c0154f9f7e18 
> [   10.738161] GPR08: 0023 0001 0027 
> c0167f1d7fe8 
> [   10.738161] GPR12:  c0154ffd7e80 c0154fa82580 
> b78a 
> [   10.738161] GPR16: 00028007883c 02ed c00038d31000 
>  
> [   10.738161] GPR20:  c29fdfe0  
> 037b 
> [   10.738161] GPR24:  c0154fa82f90 0001 
> c0003d4ca400 
> [   10.738161] GPR28: 02ed c00038d311c0 c00038d31100 
>  
> [   10.738281] NIP [c01b2768] update_blocked_averages+0x758/0x780
> [   10.738290] LR [c01b2764] update_blocked_averages+0x754/0x780
> [   10.738299] Call Trace:
> [   10.738303] [c00015973ae0] [c01b2764] 
> update_blocked_averages+0x754/0x780 (unreliable)
> [   10.738315] [c00015973c00] [c01be720] 
> run_rebalance_domains+0xa0/0xd0
> [   10.738326] [c00015973c30] [c0cf9acc] __do_softirq+0x15c/0x3d4
> [   10.738337] [c00015973d20] [c0158464] run_ksoftirqd+0x64/0x90
> [   10.738346] [c00015973d40] [c018fd24] 
> smpboot_thread_fn+0x204/0x270
> [   10.738357] [c00015973da0] [c0189770] kthread+0x190/0x1a0
> [   10.738367] [c00015973e10] [c000ceec] 
> ret_from_kernel_thread+0x5c/0x70
> [   10.738381] Instruction dump:
> [   10.73838