from:"Justin He"

RE: [PATCH v2] device-dax: use fallback nid when numa node is invalid

2021-09-13 Thread Justin He

Hi Dan,

> -Original Message-
> From: Dan Williams 
> Sent: Friday, September 10, 2021 11:42 PM
> To: Justin He 
> Cc: Vishal Verma ; Dave Jiang
> ; David Hildenbrand ; Linux NVDIMM
> ; Linux Kernel Mailing List  ker...@vger.kernel.org>
> Subject: Re: [PATCH v2] device-dax: use fallback nid when numa node is
> invalid
> 
> On Fri, Sep 10, 2021 at 5:46 AM Jia He  wrote:
> >
> > Previously, numa_off was set unconditionally in dummy_numa_init()
> > even with a fake numa node. Then ACPI sets node id as NUMA_NO_NODE(-1)
> > after acpi_map_pxm_to_node() because it regards numa_off as turning
> > off the numa node. Hence dev_dax->target_node is NUMA_NO_NODE on
> > arm64 with fake numa case.
> >
> > Without this patch, pmem can't be probed as RAM devices on arm64 if
> > SRAT table isn't present:
> >   $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g
> -a 64K
> >   kmem dax0.0: rejecting DAX region [mem 0x24040-0x2bfff] with
> invalid node: -1
> >   kmem: probe of dax0.0 failed with error -22
> >
> > This fixes it by using fallback memory_add_physaddr_to_nid() as nid.
> >
> > Suggested-by: David Hildenbrand 
> > Signed-off-by: Jia He 
> > ---
> > v2: - rebase it based on David's "memory group" patch.
> > - drop the changes in dev_dax_kmem_remove() since nid had been
> >   removed in remove_memory().
> >  drivers/dax/kmem.c | 31 +--
> >  1 file changed, 17 insertions(+), 14 deletions(-)
> >
> > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> > index a37622060fff..e4836eb7539e 100644
> > --- a/drivers/dax/kmem.c
> > +++ b/drivers/dax/kmem.c
> > @@ -47,20 +47,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> > unsigned long total_len = 0;
> > struct dax_kmem_data *data;
> > int i, rc, mapped = 0;
> > -   int numa_node;
> > -
> > -   /*
> > -* Ensure good NUMA information for the persistent memory.
> > -* Without this check, there is a risk that slow memory
> > -* could be mixed in a node with faster memory, causing
> > -* unavoidable performance issues.
> > -*/
> > -   numa_node = dev_dax->target_node;
> > -   if (numa_node < 0) {
> > -   dev_warn(dev, "rejecting DAX region with invalid
> node: %d\n",
> > -   numa_node);
> > -   return -EINVAL;
> > -   }
> > +   int numa_node = dev_dax->target_node;
> >
> > for (i = 0; i < dev_dax->nr_range; i++) {
> > struct range range;
> > @@ -71,6 +58,22 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> > i, range.start, range.end);
> > continue;
> > }
> > +
> > +   /*
> > +* Ensure good NUMA information for the persistent
> memory.
> > +* Without this check, there is a risk but not fatal
> that slow
> > +* memory could be mixed in a node with faster memory,
> causing
> > +* unavoidable performance issues. Warn this and use
> fallback
> > +* node id.
> > +*/
> > +   if (numa_node < 0) {
> > +   int new_node =
> memory_add_physaddr_to_nid(range.start);
> > +
> > +   dev_info(dev, "changing nid from %d to %d for
> DAX region [%#llx-%#llx]\n",
> > +numa_node, new_node, range.start,
> range.end);
> > +   numa_node = new_node;
> > +   }
> > +
> > total_len += range_len(&range);
> 
> This fallback change belongs where the parent region for the namespace
> adopts its target_node, because it's not clear
> memory_add_physaddr_to_nid() is the right fallback in all situations.
> Here is where this setting is happening currently:
> 
> drivers/acpi/nfit/core.c:3004:  ndr_desc->target_node =
> pxm_to_node(spa->proximity_domain);
On my local arm64 guest('virt' machine type), the target_node is
set to -1 at this line.
That is:
The condition "spa->flags & ACPI_NFIT_PROXIMITY_VALID" is hit.

> drivers/acpi/nfit/core.c:3007:  ndr_desc->target_node =
> NUMA_NO_NODE;
> drivers/nvdimm/e820.c:29:   ndr_desc.target_node = nid;
> drivers/nvdimm/of_pmem.c:58:ndr_desc.t

RE: [PATCH v2] device-dax: use fallback nid when numa node is invalid

2021-09-14 Thread Justin He



> -Original Message-
> From: Dan Williams 
> Sent: Wednesday, September 15, 2021 1:16 PM
> To: Justin He 
> Cc: Vishal Verma ; Dave Jiang
> ; David Hildenbrand ; Linux NVDIMM
> ; Linux Kernel Mailing List  ker...@vger.kernel.org>; nd 
> Subject: Re: [PATCH v2] device-dax: use fallback nid when numa node is
> invalid
> 
> On Mon, Sep 13, 2021 at 7:06 PM Justin He  wrote:
> >
> > Hi Dan,
> >
> > > -Original Message-
> > > From: Dan Williams 
> > > Sent: Friday, September 10, 2021 11:42 PM
> > > To: Justin He 
> > > Cc: Vishal Verma ; Dave Jiang
> > > ; David Hildenbrand ; Linux
> NVDIMM
> > > ; Linux Kernel Mailing List  > > ker...@vger.kernel.org>
> > > Subject: Re: [PATCH v2] device-dax: use fallback nid when numa node is
> > > invalid
> > >
> > > On Fri, Sep 10, 2021 at 5:46 AM Jia He  wrote:
> > > >
> > > > Previously, numa_off was set unconditionally in dummy_numa_init()
> > > > even with a fake numa node. Then ACPI sets node id as NUMA_NO_NODE(-1)
> > > > after acpi_map_pxm_to_node() because it regards numa_off as turning
> > > > off the numa node. Hence dev_dax->target_node is NUMA_NO_NODE on
> > > > arm64 with fake numa case.
> > > >
> > > > Without this patch, pmem can't be probed as RAM devices on arm64 if
> > > > SRAT table isn't present:
> > > >   $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s
> 1g
> > > -a 64K
> > > >   kmem dax0.0: rejecting DAX region [mem 0x24040-0x2bfff]
> with
> > > invalid node: -1
> > > >   kmem: probe of dax0.0 failed with error -22
> > > >
> > > > This fixes it by using fallback memory_add_physaddr_to_nid() as nid.
> > > >
> > > > Suggested-by: David Hildenbrand 
> > > > Signed-off-by: Jia He 
> > > > ---
> > > > v2: - rebase it based on David's "memory group" patch.
> > > > - drop the changes in dev_dax_kmem_remove() since nid had been
> > > >   removed in remove_memory().
> > > >  drivers/dax/kmem.c | 31 +--
> > > >  1 file changed, 17 insertions(+), 14 deletions(-)
> > > >
> > > > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> > > > index a37622060fff..e4836eb7539e 100644
> > > > --- a/drivers/dax/kmem.c
> > > > +++ b/drivers/dax/kmem.c
> > > > @@ -47,20 +47,7 @@ static int dev_dax_kmem_probe(struct dev_dax
> *dev_dax)
> > > > unsigned long total_len = 0;
> > > > struct dax_kmem_data *data;
> > > > int i, rc, mapped = 0;
> > > > -   int numa_node;
> > > > -
> > > > -   /*
> > > > -* Ensure good NUMA information for the persistent memory.
> > > > -* Without this check, there is a risk that slow memory
> > > > -* could be mixed in a node with faster memory, causing
> > > > -* unavoidable performance issues.
> > > > -*/
> > > > -   numa_node = dev_dax->target_node;
> > > > -   if (numa_node < 0) {
> > > > -   dev_warn(dev, "rejecting DAX region with invalid
> > > node: %d\n",
> > > > -   numa_node);
> > > > -   return -EINVAL;
> > > > -   }
> > > > +   int numa_node = dev_dax->target_node;
> > > >
> > > > for (i = 0; i < dev_dax->nr_range; i++) {
> > > > struct range range;
> > > > @@ -71,6 +58,22 @@ static int dev_dax_kmem_probe(struct dev_dax
> *dev_dax)
> > > > i, range.start, range.end);
> > > > continue;
> > > > }
> > > > +
> > > > +   /*
> > > > +* Ensure good NUMA information for the persistent
> > > memory.
> > > > +* Without this check, there is a risk but not fatal
> > > that slow
> > > > +* memory could be mixed in a node with faster memory,
> > > causing
> > > > +* unavoidable performance issues. Warn this and use
> > > fallback
> > > > +* node id.
> > > > +*/
> > > > +   if (numa_

RE: [PATCH v3] virtio_vsock: Fix race condition in virtio_transport_recv_pkt()

2020-05-30 Thread Justin He

Hi Markus

> -Original Message-
> From: Markus Elfring 
> Sent: Saturday, May 30, 2020 6:41 PM
> To: Justin He ; k...@vger.kernel.org;
> net...@vger.kernel.org; virtualizat...@lists.linux-foundation.org
> Cc: kernel-janit...@vger.kernel.org; linux-kernel@vger.kernel.org;
> sta...@vger.kernel.org; David S. Miller ; Jakub
> Kicinski ; Kaly Xin ; Stefan Hajnoczi
> ; Stefano Garzarella 
> Subject: Re: [PATCH v3] virtio_vsock: Fix race condition in
> virtio_transport_recv_pkt()
>
> > This fixes it by checking sk->sk_shutdown(suggested by Stefano) after
> > lock_sock since sk->sk_shutdown is set to SHUTDOWN_MASK under the
> > protection of lock_sock_nested.
>
> How do you think about a wording variant like the following?
>
>   Thus check the data structure member “sk_shutdown” (suggested by Stefano)
>   after a call of the function “lock_sock” since this field is set to
>   “SHUTDOWN_MASK” under the protection of “lock_sock_nested”.
>
Okay, will update the commit msg.

>
> Would you like to add the tag “Fixes” to the commit message?
Sure.

Thanks


--
Cheers,
Justin (Jia He)


>
> Regards,
> Markus
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

RE: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

2020-07-29 Thread Justin He


> -Original Message-
> From: David Hildenbrand 
> Sent: Wednesday, July 29, 2020 5:35 PM
> To: Mike Rapoport ; Justin He 
> Cc: Dan Williams ; Vishal Verma
> ; Catalin Marinas ;
> Will Deacon ; Greg Kroah-Hartman
> ; Rafael J. Wysocki ; Dave
> Jiang ; Andrew Morton ;
> Steve Capper ; Mark Rutland ;
> Logan Gunthorpe ; Anshuman Khandual
> ; Hsin-Yi Wang ; Jason
> Gunthorpe ; Dave Hansen ; Kees
> Cook ; linux-arm-ker...@lists.infradead.org; linux-
> ker...@vger.kernel.org; linux-nvd...@lists.01.org; linux...@kvack.org; Wei
> Yang ; Pankaj Gupta
> ; Ira Weiny ; Kaly Xin
> 
> Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem
> alignment
> 
> On 29.07.20 11:31, Mike Rapoport wrote:
> > Hi Justin,
> >
> > On Wed, Jul 29, 2020 at 08:27:58AM +, Justin He wrote:
> >> Hi David
> >>>>
> >>>> Without this series, if qemu creates a 4G bytes nvdimm device, we can
> >>> only
> >>>> use 2G bytes for dax pmem(kmem) in the worst case.
> >>>> e.g.
> >>>> 24000-33fdf : Persistent Memory
> >>>> We can only use the memblock between [24000, 2] due to
> the
> >>> hard
> >>>> limitation. It wastes too much memory space.
> >>>>
> >>>> Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative,
> but
> >>> there
> >>>> are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
> >>>> SPARSEMEM_VMEMMAP, page bits in struct page ...
> >>>>
> >>>> Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
> >>> alignment
> >>>> with memory_block_size_bytes().
> >>>>
> >>>> Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device.
> dax
> >>> pmem
> >>>> can be used as ram with smaller gap. Also the kmem hotplug add/remove
> >>> are both
> >>>> tested on arm64/x86 guest.
> >>>>
> >>>
> >>> Hi,
> >>>
> >>> I am not convinced this use case is worth such hacks (that’s what it
> is)
> >>> for now. On real machines pmem is big - your example (losing 50% is
> >>> extreme).
> >>>
> >>> I would much rather want to see the section size on arm64 reduced. I
> >>> remember there were patches and that at least with a base page size of
> 4k
> >>> it can be reduced drastically (64k base pages are more problematic due
> to
> >>> the ridiculous THP size of 512M). But could be a section size of 512
> is
> >>> possible on all configs right now.
> >>
> >> Yes, I once investigated how to reduce section size on arm64
> thoughtfully:
> >> There are many constraints for reducing SECTION_SIZE_BITS
> >> 1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be
> reduced too
> >>much.
> >> 2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be
> counted
> >>into page->flags.
> >> 3. MAX_ORDER depends on SECTION_SIZE_BITS
> >>  - 3.1 mmzone.h
> >> #if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
> >> #error Allocator MAX_ORDER exceeds SECTION_SIZE
> >> #endif
> >>  - 3.2 hugepage_init()
> >> MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);
> >>
> >> Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
> >> SECTION_SIZE_BITS can be reduced to 27.
> >> But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
> >> Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS
> can not
> >> be reduced to 27.
> >>
> >> In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the
> Kconfig
> >> might be very complicated,e.g. we still need to consider the case for
> >> ARM64_16K_PAGES.
> >
> > It is not necessary to pollute Kconfig with that.
> > arch/arm64/include/asm/sparesemem.h can have something like
> >
> > #ifdef CONFIG_ARM64_64K_PAGES
> > #define SPARSE_SECTION_SIZE 29
> > #elif defined(CONFIG_ARM16K_PAGES)
> > #define SPARSE_SECTION_SIZE 28
> > #elif defined(CONFIG_ARM4K_PAGES)
> > #define SPARSE_SECTION_SIZE 27
> > #else
> > #error
> > #endif
> 
> ack
Thanks, David and Mike. Will discuss it further more with arm internally about
the thoughtful section_size change

--
Cheers,
Justin (Jia He)

RE: [RFC PATCH 0/2] Avoid booting stall caused by idmap_kpti_install_ng_mappings

2021-01-19 Thread Justin He

Hi,
Kindly ping 😊

> -Original Message-
> From: Jia He 
> Sent: Wednesday, January 13, 2021 9:41 AM
> To: Catalin Marinas ; Will Deacon
> ; linux-arm-ker...@lists.infradead.org; linux-
> ker...@vger.kernel.org
> Cc: Anshuman Khandual ; Suzuki Poulose
> ; Justin He ; Mark Rutland
> ; Gustavo A. R. Silva ;
> Richard Henderson ; Dave P Martin
> ; Steven Price ; Andrew Morton
> ; Mike Rapoport ; Ard
> Biesheuvel ; Gavin Shan ; Kefeng Wang
> ; Mark Brown ; Marc Zyngier
> ; Cristian Marussi 
> Subject: [RFC PATCH 0/2] Avoid booting stall caused by
> 
> There is a 10s stall in idmap_kpti_install_ng_mappings when kernel boots
> on a Ampere EMAG server.
> 
> Commit f992b4dfd58b ("arm64: kpti: Add ->enable callback to remap
> swapper using nG mappings") updates the nG bit runtime if kpti is
> required.
> 
> But things get worse if rodata=full in map_mem(). NO_BLOCK_MAPPINGS |
> NO_CONT_MAPPINGS is required when creating pagetable mapping. Hence all
> ptes are fully mapped in this case. On a Ampere EMAG server with 256G
> memory(pagesize=4k), it causes the 10s stall.
> 
> After moving init_cpu_features() ahead of early_fixmap_init(), we can use
> cpu_have_const_cap earlier than before. Hence we can avoid this stall
> by updating arm64_use_ng_mappings.
> 
> After this patch series, it reduces the kernel boot time from 14.7s to
> 4.1s:
> Before:
> [   14.757569] Freeing initrd memory: 60752K
> After:
> [4.138819] Freeing initrd memory: 60752K
> 
> Set it as RFC because I want to resolve any other points which I have
> misconerned.
> 
> Jia He (2):
>   arm64/cpuinfo: Move init_cpu_features() ahead of early_fixmap_init()
>   arm64: kpti: Update arm64_use_ng_mappings before pagetable mapping
> 
>  arch/arm64/include/asm/cpu.h |  1 +
>  arch/arm64/kernel/cpuinfo.c  | 13 ++---
>  arch/arm64/kernel/setup.c| 18 +-
>  arch/arm64/kernel/smp.c  |  3 +--
>  4 files changed, 25 insertions(+), 10 deletions(-)
> 
> --
> 2.17.1

RE: [PATCH] KVM: arm64: Fix unaligned addr case in mmu walking

2021-03-03 Thread Justin He

Hi Quentin and Marc
I noticed Marc had sent out new version on behalf of me, thanks for the help.
I hated the time difference, sorry for the late.

Just answer the comments below to make it clear.

> -Original Message-
> From: Quentin Perret 
> Sent: Wednesday, March 3, 2021 7:09 PM
> To: Marc Zyngier 
> Cc: Justin He ; kvm...@lists.cs.columbia.edu; James
> Morse ; Julien Thierry ;
> Suzuki Poulose ; Catalin Marinas
> ; Will Deacon ; Gavin Shan
> ; Yanan Wang ; linux-arm-
> ker...@lists.infradead.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] KVM: arm64: Fix unaligned addr case in mmu walking
> 
> On Wednesday 03 Mar 2021 at 09:54:25 (+), Marc Zyngier wrote:
> > Hi Jia,
> >
> > On Wed, 03 Mar 2021 02:42:25 +,
> > Jia He  wrote:
> > >
> > > If the start addr is not aligned with the granule size of that level.
> > > loop step size should be adjusted to boundary instead of simple
> > > kvm_granual_size(level) increment. Otherwise, some mmu entries might
> miss
> > > the chance to be walked through.
> > > E.g. Assume the unmap range [data->addr, data->end] is
> > > [0xff00ab2000,0xff00cb2000] in level 2 walking and NOT block mapping.
> >
> > When does this occur? Upgrade from page mappings to block? Swap out?
> >
> > > And the 1st part of that pmd entry is [0xff00ab2000,0xff00c0]. The
> > > pmd value is 0x83fbd2c1002 (not valid entry). In this case, data->addr
> > > should be adjusted to 0xff00c0 instead of 0xff00cb2000.
> >
> > Let me see if I understand this. Assuming 4k pages, the region
> > described above spans *two* 2M entries:
> >
> > (a) ff00ab2000-ff00c0, part of ff00a0-ff00c0
> > (b) ff00c0-ff00db2000, part of ff00c0-ff00e0
> >
> > (a) has no valid mapping, but (b) does. Because we fail to correctly
> > align on a block boundary when skipping (a), we also skip (b), which
> > is then left mapped.
> >
> > Did I get it right? If so, yes, this is... annoying.
> >

Yes, exactly the case

> > Understanding the circumstances this triggers in would be most
> > interesting. This current code seems to assume that we get ranges
> > aligned to mapping boundaries, but I seem to remember that the old
> > code did use the stage2_*_addr_end() helpers to deal with this case.
> >
> > Will: I don't think things have changed in that respect, right?
> 
> Indeed we should still use stage2_*_addr_end(), especially in the unmap
> path that is mentioned here, so it would be helpful to have a little bit
> more context.

Yes, stage2_pgd_addr_end() was still there but the stage2_pmd_addr_end() was 
removed.
> 
> > > Without this fix, userspace "segment fault" error can be easily
> > > triggered by running simple gVisor runsc cases on an Ampere Altra
> > > server:
> > > docker run --runtime=runsc -it --rm  ubuntu /bin/bash
> > >
> > > In container:
> > > for i in `seq 1 100`;do ls;done
> >
> > The workload on its own isn't that interesting. What I'd like to
> > understand is what happens on the host during that time.

Okay

> >
> > >
> > > Reported-by: Howard Zhang 
> > > Signed-off-by: Jia He 
> > > ---
> > >  arch/arm64/kvm/hyp/pgtable.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > >
> > > diff --git a/arch/arm64/kvm/hyp/pgtable.c
> b/arch/arm64/kvm/hyp/pgtable.c
> > > index bdf8e55ed308..4d99d07c610c 100644
> > > --- a/arch/arm64/kvm/hyp/pgtable.c
> > > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > > @@ -225,6 +225,7 @@ static inline int __kvm_pgtable_visit(struct
> kvm_pgtable_walk_data *data,
> > >   goto out;
> > >
> > >   if (!table) {
> > > + data->addr = ALIGN_DOWN(data->addr, kvm_granule_size(level));
> > >   data->addr += kvm_granule_size(level);
> > >   goto out;
> > >   }
> >
> > It otherwise looks good to me. Quentin, Will: unless you object to
> > this, I plan to take it in the next round of fixes with
> 
> Though I'm still unsure how we hit that today, the change makes sense on
> its own I think, so no objection from me.
> 
> Thanks,
> Quentin

RE: [PATCH] KVM: arm64: Fix unaligned addr case in mmu walking

2021-03-03 Thread Justin He

Hi Marc

> -Original Message-
> From: Will Deacon 
> Sent: Thursday, March 4, 2021 5:13 AM
> To: Marc Zyngier 
> Cc: Justin He ; kvm...@lists.cs.columbia.edu; James
> Morse ; Julien Thierry ;
> Suzuki Poulose ; Catalin Marinas
> ; Gavin Shan ; Yanan Wang
> ; Quentin Perret ; linux-arm-
> ker...@lists.infradead.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] KVM: arm64: Fix unaligned addr case in mmu walking
> 
> On Wed, Mar 03, 2021 at 07:07:37PM +, Marc Zyngier wrote:
> > From e0524b41a71e0f17d6dc8f197e421e677d584e72 Mon Sep 17 00:00:00 2001
> > From: Jia He 
> > Date: Wed, 3 Mar 2021 10:42:25 +0800
> > Subject: [PATCH] KVM: arm64: Fix range alignment when walking page tables
> >
> > When walking the page tables at a given level, and if the start
> > address for the range isn't aligned for that level, we propagate
> > the misalignment on each iteration at that level.
> >
> > This results in the walker ignoring a number of entries (depending
> > on the original misalignment) on each subsequent iteration.
> >
> > Properly aligning the address at the before the next iteration
> 
> "at the before the next" ???
> 
> > addresses the issue.
> >
> > Cc: sta...@vger.kernel.org
> > Reported-by: Howard Zhang 
> > Signed-off-by: Jia He 
> > Fixes: b1e57de62cfb ("KVM: arm64: Add stand-alone page-table walker
> infrastructure")
> > [maz: rewrite commit message]
> > Signed-off-by: Marc Zyngier 
> > Link: https://lore.kernel.org/r/20210303024225.2591-1-justin...@arm.com
> > ---
> >  arch/arm64/kvm/hyp/pgtable.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c
> > index 4d177ce1d536..124cd2f93020 100644
> > --- a/arch/arm64/kvm/hyp/pgtable.c
> > +++ b/arch/arm64/kvm/hyp/pgtable.c
> > @@ -223,7 +223,7 @@ static inline int __kvm_pgtable_visit(struct
> kvm_pgtable_walk_data *data,
> > goto out;
> >
> > if (!table) {
> > -   data->addr += kvm_granule_size(level);
> > +   data->addr = ALIGN(data->addr, kvm_granule_size(level));

What if previous data->addr is already aligned with kvm_granule_size(level)?
Hence a deadloop? Am I missing anything else?

--
Cheers,
Justin (Jia He)

> > goto out;
> > }
> 
> If Jia is happy with it, please feel free to add:
> 
> Acked-by: Will Deacon 
> 
> Will

RE: [PATCH] vfio iommu type1: Bypass the vma permission check in vfio_pin_pages_remote()

2020-11-24 Thread Justin He

Hi Peter

> -Original Message-
> From: Peter Xu 
> Sent: Wednesday, November 25, 2020 2:12 AM
> To: Justin He 
> Cc: Alex Williamson ; Cornelia Huck
> ; k...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] vfio iommu type1: Bypass the vma permission check in
> vfio_pin_pages_remote()
>
> Hi, Jia,
>
> On Thu, Nov 19, 2020 at 10:27:37PM +0800, Jia He wrote:
> > The permission of vfio iommu is different and incompatible with vma
> > permission. If the iotlb->perm is IOMMU_NONE (e.g. qemu side), qemu will
> > simply call unmap ioctl() instead of mapping. Hence vfio_dma_map() can't
> > map a dma region with NONE permission.
> >
> > This corner case will be exposed in coming virtio_fs cache_size
> > commit [1]
> >  - mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> >memory_region_init_ram_ptr()
> >  - re-mmap the above area with read/write authority.
>
> If iiuc here we'll remap the above PROT_NONE into PROT_READ|PROT_WRITE,
> then...
>
> >  - vfio_dma_map() will be invoked when vfio device is hotplug added.
>
> ... here I'm slightly confused on why VFIO_IOMMU_MAP_DMA would encounter
> vma
> check fail - aren't they already get rw permissions?

No, we haven't got the vma rw permission yet, but the default permission in
this case is rw by default.

When qemu side invoke vfio_dma_map(), the rw of iommu will be automatically
added [1] [2] (currently map a NONE region is not supported in qemu vfio).
[1] 
https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=6ff1daa763f87a1ed5351bcc19aeb027c43b8a8f;hb=HEAD#l479
[2] 
https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=6ff1daa763f87a1ed5351bcc19aeb027c43b8a8f;hb=HEAD#l486

But at kernel side, the vma permission is created by PROT_NONE.

Then the check in check_vma_flags() at [3] will be failed.
[3] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/mm/gup.c#n929

>
> I'd appreciate if you could explain why vfio needs to dma map some
> PROT_NONE

Virtiofs will map a PROT_NONE cache window region firstly, then remap the sub
region of that cache window with read or write permission. I guess this might
be an security concern. Just CC virtiofs expert Stefan to answer it more 
accurately.

--
Cheers,
Justin (Jia He)

> pages after all, and whether QEMU would be able to postpone the vfio map of
> those PROT_NONE pages until they got to become with RW permissions.
>
> Thanks,
>
> --
> Peter Xu

IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

RE: [RFC PATCH 0/2] Avoid booting stall caused by idmap_kpti_install_ng_mappings

2021-01-20 Thread Justin He

Hi Marc

> -Original Message-
> From: Marc Zyngier 
> Sent: Wednesday, January 20, 2021 6:58 PM
> To: Justin He 
> Cc: Catalin Marinas ; Will Deacon
> ; linux-arm-ker...@lists.infradead.org; linux-
> ker...@vger.kernel.org; Anshuman Khandual ;
> Suzuki Poulose ; Mark Rutland
> ; Gustavo A. R. Silva ;
> Richard Henderson ; Dave P Martin
> ; Steven Price ; Andrew Morton
> ; Mike Rapoport ; Ard
> Biesheuvel ; Gavin Shan ; Kefeng Wang
> ; Mark Brown ; Cristian
> Marussi 
> Subject: Re: [RFC PATCH 0/2] Avoid booting stall caused by
> idmap_kpti_install_ng_mappings
> 
> Hi Justin,
> 
> On 2021-01-20 04:51, Justin He wrote:
> > Hi,
> > Kindly ping 😊
> >
> >> -Original Message-
> >> From: Jia He 
> >> Sent: Wednesday, January 13, 2021 9:41 AM
> >> To: Catalin Marinas ; Will Deacon
> >> ; linux-arm-ker...@lists.infradead.org; linux-
> >> ker...@vger.kernel.org
> >> Cc: Anshuman Khandual ; Suzuki Poulose
> >> ; Justin He ; Mark Rutland
> >> ; Gustavo A. R. Silva ;
> >> Richard Henderson ; Dave P Martin
> >> ; Steven Price ; Andrew
> >> Morton
> >> ; Mike Rapoport ; Ard
> >> Biesheuvel ; Gavin Shan ; Kefeng
> >> Wang
> >> ; Mark Brown ; Marc
> >> Zyngier
> >> ; Cristian Marussi 
> >> Subject: [RFC PATCH 0/2] Avoid booting stall caused by
> >>
> >> There is a 10s stall in idmap_kpti_install_ng_mappings when kernel
> >> boots
> >> on a Ampere EMAG server.
> >>
> >> Commit f992b4dfd58b ("arm64: kpti: Add ->enable callback to remap
> >> swapper using nG mappings") updates the nG bit runtime if kpti is
> >> required.
> >>
> >> But things get worse if rodata=full in map_mem(). NO_BLOCK_MAPPINGS |
> >> NO_CONT_MAPPINGS is required when creating pagetable mapping. Hence
> >> all
> >> ptes are fully mapped in this case. On a Ampere EMAG server with 256G
> >> memory(pagesize=4k), it causes the 10s stall.
> >>
> >> After moving init_cpu_features() ahead of early_fixmap_init(), we can
> >> use
> >> cpu_have_const_cap earlier than before. Hence we can avoid this stall
> >> by updating arm64_use_ng_mappings.
> >>
> >> After this patch series, it reduces the kernel boot time from 14.7s to
> >> 4.1s:
> >> Before:
> >> [   14.757569] Freeing initrd memory: 60752K
> >> After:
> >> [4.138819] Freeing initrd memory: 60752K
> >>
> >> Set it as RFC because I want to resolve any other points which I have
> >> misconerned.
> 
> But you don't really explain *why* having the CPU Feature discovery
> early helps at all. Is that so that you can bypass the idmap mapping?

Adding nG bits can be avoided by having the discovery of boot cpu feature
earlier since the nG bit had been set in PTE_MAYBE_NG/PMD_MAYBE_NG 

Before this patch:
1. kernel will firstly create mapping in setup_arch->paging_init->map_mem
-> __map_memblock
2. Then if kpti is required, kernel will add nG bits for each pte entry.
3. In extreme case, e.g. physical memory is 256G,rodata=full, and pagesize
is 4K, the nG bits updating in step 2 takes about 10s.

> I'd expect something that explain the problem instead of paraphrasing
> the patches.
> 
> Another thing is whether you have tested this on some ThunderX HW

I will find a TX1 as you told to see any difference.


--
Cheers,
Justin (Jia He)


> (the first version, not TX2), as this is the whole reason for this
> code...
> 
> Thanks,
> 
>  M.
> --
> Jazz is not dead. It just smells funny...

RE: [RFC PATCH 0/2] Avoid booting stall caused by idmap_kpti_install_ng_mappings

2021-01-24 Thread Justin He

Hi Marc

> -Original Message-
> From: Justin He
> Sent: Wednesday, January 20, 2021 11:56 PM
> To: Marc Zyngier 
> Cc: Catalin Marinas ; Will Deacon
> ; linux-arm-ker...@lists.infradead.org; linux-
> ker...@vger.kernel.org; Anshuman Khandual ;
> Suzuki Poulose ; Mark Rutland
> ; Gustavo A. R. Silva ;
> Richard Henderson ; Dave P Martin
> ; Steven Price ; Andrew Morton
> ; Mike Rapoport ; Ard
> Biesheuvel ; Gavin Shan ; Kefeng Wang
> ; Mark Brown ; Cristian
> Marussi 
> Subject: RE: [RFC PATCH 0/2] Avoid booting stall caused by
> idmap_kpti_install_ng_mappings
> 
> Hi Marc
> 
> > -Original Message-
> > From: Marc Zyngier 
> > Sent: Wednesday, January 20, 2021 6:58 PM
> > To: Justin He 
> > Cc: Catalin Marinas ; Will Deacon
> > ; linux-arm-ker...@lists.infradead.org; linux-
> > ker...@vger.kernel.org; Anshuman Khandual ;
> > Suzuki Poulose ; Mark Rutland
> > ; Gustavo A. R. Silva ;
> > Richard Henderson ; Dave P Martin
> > ; Steven Price ; Andrew Morton
> > ; Mike Rapoport ; Ard
> > Biesheuvel ; Gavin Shan ; Kefeng Wang
> > ; Mark Brown ; Cristian
> > Marussi 
> > Subject: Re: [RFC PATCH 0/2] Avoid booting stall caused by
> > idmap_kpti_install_ng_mappings
> >
> > Hi Justin,
> >
> > On 2021-01-20 04:51, Justin He wrote:
> > > Hi,
> > > Kindly ping 😊
> > >
> > >> -Original Message-
> > >> From: Jia He 
> > >> Sent: Wednesday, January 13, 2021 9:41 AM
> > >> To: Catalin Marinas ; Will Deacon
> > >> ; linux-arm-ker...@lists.infradead.org; linux-
> > >> ker...@vger.kernel.org
> > >> Cc: Anshuman Khandual ; Suzuki Poulose
> > >> ; Justin He ; Mark Rutland
> > >> ; Gustavo A. R. Silva ;
> > >> Richard Henderson ; Dave P Martin
> > >> ; Steven Price ; Andrew
> > >> Morton
> > >> ; Mike Rapoport ; Ard
> > >> Biesheuvel ; Gavin Shan ; Kefeng
> > >> Wang
> > >> ; Mark Brown ; Marc
> > >> Zyngier
> > >> ; Cristian Marussi 
> > >> Subject: [RFC PATCH 0/2] Avoid booting stall caused by
> > >>
> > >> There is a 10s stall in idmap_kpti_install_ng_mappings when kernel
> > >> boots
> > >> on a Ampere EMAG server.
> > >>
> > >> Commit f992b4dfd58b ("arm64: kpti: Add ->enable callback to remap
> > >> swapper using nG mappings") updates the nG bit runtime if kpti is
> > >> required.
> > >>
> > >> But things get worse if rodata=full in map_mem(). NO_BLOCK_MAPPINGS |
> > >> NO_CONT_MAPPINGS is required when creating pagetable mapping. Hence
> > >> all
> > >> ptes are fully mapped in this case. On a Ampere EMAG server with 256G
> > >> memory(pagesize=4k), it causes the 10s stall.
> > >>
> > >> After moving init_cpu_features() ahead of early_fixmap_init(), we can
> > >> use
> > >> cpu_have_const_cap earlier than before. Hence we can avoid this stall
> > >> by updating arm64_use_ng_mappings.
> > >>
> > >> After this patch series, it reduces the kernel boot time from 14.7s to
> > >> 4.1s:
> > >> Before:
> > >> [   14.757569] Freeing initrd memory: 60752K
> > >> After:
> > >> [4.138819] Freeing initrd memory: 60752K
> > >>
> > >> Set it as RFC because I want to resolve any other points which I have
> > >> misconerned.
> >
> > But you don't really explain *why* having the CPU Feature discovery
> > early helps at all. Is that so that you can bypass the idmap mapping?
> 
> Adding nG bits can be avoided by having the discovery of boot cpu feature
> earlier since the nG bit had been set in PTE_MAYBE_NG/PMD_MAYBE_NG
> 
> Before this patch:
> 1. kernel will firstly create mapping in setup_arch->paging_init->map_mem
> -> __map_memblock
> 2. Then if kpti is required, kernel will add nG bits for each pte entry.
> 3. In extreme case, e.g. physical memory is 256G,rodata=full, and pagesize
> is 4K, the nG bits updating in step 2 takes about 10s.
> 
> > I'd expect something that explain the problem instead of paraphrasing
> > the patches.
> >
> > Another thing is whether you have tested this on some ThunderX HW
> 
> I will find a TX1 as you told to see any difference.
> 
> 
I fortunately found a cavium TX1. 
Seems that unmap_kernel_at_el0 is false:
...
[0.00] Machine model: Cavium ThunderX CN88XX board
...
[0.00] CPU features: kernel page table isolation forced OFF by 
ARM64_WORKAROUND_CAVIUM_27456
...

Hence no such stall *before* and *after* this patch set because kpti is not 
enabled.


--
Cheers,
Justin (Jia He)

RE: [PATCH net] vsock/virtio: discard packets only when socket is really closed

2020-11-22 Thread Justin He




> -Original Message-
> From: Stefano Garzarella 
> Sent: Friday, November 20, 2020 6:48 PM
> To: net...@vger.kernel.org
> Cc: Sergio Lopez ; David S. Miller ;
> Stefano Garzarella ; Justin He ;
> k...@vger.kernel.org; linux-kernel@vger.kernel.org; Stefan Hajnoczi
> ; virtualizat...@lists.linux-foundation.org; Jakub
> Kicinski 
> Subject: [PATCH net] vsock/virtio: discard packets only when socket is
> really closed
>
> Starting from commit 8692cefc433f ("virtio_vsock: Fix race condition
> in virtio_transport_recv_pkt"), we discard packets in
> virtio_transport_recv_pkt() if the socket has been released.
>
> When the socket is connected, we schedule a delayed work to wait the
> RST packet from the other peer, also if SHUTDOWN_MASK is set in
> sk->sk_shutdown.
> This is done to complete the virtio-vsock shutdown algorithm, releasing
> the port assigned to the socket definitively only when the other peer
> has consumed all the packets.
>
> If we discard the RST packet received, the socket will be closed only
> when the VSOCK_CLOSE_TIMEOUT is reached.
>
> Sergio discovered the issue while running ab(1) HTTP benchmark using
> libkrun [1] and observing a latency increase with that commit.
>
> To avoid this issue, we discard packet only if the socket is really
> closed (SOCK_DONE flag is set).
> We also set SOCK_DONE in virtio_transport_release() when we don't need
> to wait any packets from the other peer (we didn't schedule the delayed
> work). In this case we remove the socket from the vsock lists, releasing
> the port assigned.
>
> [1] https://github.com/containers/libkrun
>
> Fixes: 8692cefc433f ("virtio_vsock: Fix race condition in
> virtio_transport_recv_pkt")

Acked-by: Jia He 


--
Cheers,
Justin (Jia He)


> Cc: justin...@arm.com
> Reported-by: Sergio Lopez 
> Tested-by: Sergio Lopez 
> Signed-off-by: Stefano Garzarella 
> ---
>  net/vmw_vsock/virtio_transport_common.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/net/vmw_vsock/virtio_transport_common.c
> b/net/vmw_vsock/virtio_transport_common.c
> index 0edda1edf988..5956939eebb7 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -841,8 +841,10 @@ void virtio_transport_release(struct vsock_sock *vsk)
>  virtio_transport_free_pkt(pkt);
>  }
>
> -if (remove_sock)
> +if (remove_sock) {
> +sock_set_flag(sk, SOCK_DONE);
>  vsock_remove_sock(vsk);
> +}
>  }
>  EXPORT_SYMBOL_GPL(virtio_transport_release);
>
> @@ -1132,8 +1134,8 @@ void virtio_transport_recv_pkt(struct
> virtio_transport *t,
>
>  lock_sock(sk);
>
> -/* Check if sk has been released before lock_sock */
> -if (sk->sk_shutdown == SHUTDOWN_MASK) {
> +/* Check if sk has been closed before lock_sock */
> +if (sock_flag(sk, SOCK_DONE)) {
>  (void)virtio_transport_reset_no_sock(t, pkt);
>  release_sock(sk);
>  sock_put(sk);
> --
> 2.26.2

IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

RE: [PATCH] vfio iommu type1: Bypass the vma permission check in vfio_pin_pages_remote()

2020-11-22 Thread Justin He

Hi Alex, thanks for the comments.
See mine below:

> -Original Message-
> From: Alex Williamson 
> Sent: Friday, November 20, 2020 1:05 AM
> To: Justin He 
> Cc: Cornelia Huck ; k...@vger.kernel.org; linux-
> ker...@vger.kernel.org
> Subject: Re: [PATCH] vfio iommu type1: Bypass the vma permission check in
> vfio_pin_pages_remote()
>
> On Thu, 19 Nov 2020 22:27:37 +0800
> Jia He  wrote:
>
> > The permission of vfio iommu is different and incompatible with vma
> > permission. If the iotlb->perm is IOMMU_NONE (e.g. qemu side), qemu will
> > simply call unmap ioctl() instead of mapping. Hence vfio_dma_map() can't
> > map a dma region with NONE permission.
> >
> > This corner case will be exposed in coming virtio_fs cache_size
> > commit [1]
> >  - mmap(NULL, size, PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> >memory_region_init_ram_ptr()
> >  - re-mmap the above area with read/write authority.
> >  - vfio_dma_map() will be invoked when vfio device is hotplug added.
> >
> > qemu:
> > vfio_listener_region_add()
> > vfio_dma_map(..., readonly=false)
> > map.flags is set to VFIO_DMA_MAP_FLAG_READ|VFIO_..._WRITE
> > ioctl(VFIO_IOMMU_MAP_DMA)
> >
> > kernel:
> > vfio_dma_do_map()
> > vfio_pin_map_dma()
> > vfio_pin_pages_remote()
> > vaddr_get_pfn()
> > ...
> > check_vma_flags() failed! because
> > vm_flags hasn't VM_WRITE && gup_flags
> > has FOLL_WRITE
> >
> > It will report error in qemu log when hotplug adding(vfio) a nvme disk
> > to qemu guest on an Ampere EMAG server:
> > "VFIO_MAP_DMA failed: Bad address"
>
> I don't fully understand the argument here, I think this is suggesting
> that because QEMU won't call VFIO_IOMMU_MAP_DMA on a region that has
> NONE permission, the kernel can ignore read/write permission by using
> FOLL_FORCE.  Not only is QEMU not the only userspace driver for vfio,
> but regardless of that, we can't trust the behavior of any given
> userspace driver.  Bypassing the permission check with FOLL_FORCE seems
> like it's placing the trust in the user, which seems like a security
> issue.  Thanks,
Yes, this might have side impact on security.
But besides this simple fix(adding FOLL_FORCE), do you think it is a good
idea that:
Qemu provides a special vfio_dma_map_none_perm() to allow mapping a
region with NONE permission?

Thanks for any suggestion.

--
Cheers,
Justin (Jia He)
>
> Alex
>
>
> > [1] https://gitlab.com/virtio-fs/qemu/-/blob/virtio-fs-
> dev/hw/virtio/vhost-user-fs.c#L502
> >
> > Signed-off-by: Jia He 
> > ---
> >  drivers/vfio/vfio_iommu_type1.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> b/drivers/vfio/vfio_iommu_type1.c
> > index 67e827638995..33faa6b7dbd4 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -453,7 +453,8 @@ static int vaddr_get_pfn(struct mm_struct *mm,
> unsigned long vaddr,
> >  flags |= FOLL_WRITE;
> >
> >  mmap_read_lock(mm);
> > -ret = pin_user_pages_remote(mm, vaddr, 1, flags | FOLL_LONGTERM,
> > +ret = pin_user_pages_remote(mm, vaddr, 1,
> > +flags | FOLL_LONGTERM | FOLL_FORCE,
> >  page, NULL, NULL);
> >  if (ret == 1) {
> >  *pfn = page_to_pfn(page[0]);

IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

RE: [PATCH] vhost: vsock: don't send pkt when vq is not started

2020-04-30 Thread Justin He

Hi Stefano

> -Original Message-
> From: Stefano Garzarella 
> Sent: Thursday, April 30, 2020 4:26 PM
> To: Justin He 
> Cc: Stefan Hajnoczi ; Michael S. Tsirkin
> ; Jason Wang ;
> k...@vger.kernel.org; virtualizat...@lists.linux-foundation.org;
> net...@vger.kernel.org; linux-kernel@vger.kernel.org; Kaly Xin
> 
> Subject: Re: [PATCH] vhost: vsock: don't send pkt when vq is not started
>
> Hi Jia,
> thanks for the patch, some comments below:
>
> On Thu, Apr 30, 2020 at 10:13:14AM +0800, Jia He wrote:
> > Ning Bo reported an abnormal 2-second gap when booting Kata container
> [1].
> > The unconditional timeout is caused by
> VSOCK_DEFAULT_CONNECT_TIMEOUT of
> > connect at client side. The vhost vsock client tries to connect an
> > initlizing virtio vsock server.
> >
> > The abnormal flow looks like:
> > host-userspace   vhost vsock   guest vsock
> > ==   ===   
> > connect() >  vhost_transport_send_pkt_work()   initializing
> >| vq->private_data==NULL
> >| will not be queued
> >V
> > schedule_timeout(2s)
> >  vhost_vsock_start()  <-   device ready
> >  set vq->private_data
> >
> > wait for 2s and failed
> >
> > connect() again  vq->private_data!=NULL  recv connecting pkt
> >
> > 1. host userspace sends a connect pkt, at that time, guest vsock is under
> > initializing, hence the vhost_vsock_start has not been called. So
> > vq->private_data==NULL, and the pkt is not been queued to send to guest.
> > 2. then it sleeps for 2s
> > 3. after guest vsock finishes initializing, vq->private_data is set.
> > 4. When host userspace wakes up after 2s, send connecting pkt again,
> > everything is fine.
> >
> > This fixes it by checking vq->private_data in vhost_transport_send_pkt,
> > and return at once if !vq->private_data. This makes user connect()
> > be returned with ECONNREFUSED.
> >
> > After this patch, kata-runtime (with vsock enabled) boottime reduces from
> > 3s to 1s on ThunderX2 arm64 server.
> >
> > [1] https://github.com/kata-containers/runtime/issues/1917
> >
> > Reported-by: Ning Bo 
> > Signed-off-by: Jia He 
> > ---
> >  drivers/vhost/vsock.c | 8 
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> > index e36aaf9ba7bd..67474334dd88 100644
> > --- a/drivers/vhost/vsock.c
> > +++ b/drivers/vhost/vsock.c
> > @@ -241,6 +241,7 @@ vhost_transport_send_pkt(struct virtio_vsock_pkt
> *pkt)
> >  {
> >  struct vhost_vsock *vsock;
> >  int len = pkt->len;
> > +struct vhost_virtqueue *vq;
> >
> >  rcu_read_lock();
> >
> > @@ -252,6 +253,13 @@ vhost_transport_send_pkt(struct virtio_vsock_pkt
> *pkt)
> >  return -ENODEV;
> >  }
> >
> > +vq = &vsock->vqs[VSOCK_VQ_RX];
> > +if (!vq->private_data) {
>
> I think is better to use vhost_vq_get_backend():
>
> if (!vhost_vq_get_backend(&vsock->vqs[VSOCK_VQ_RX])) {
> ...
>
> This function should be called with 'vq->mutex' acquired as explained in
> the comment, but here we can avoid that, because we are not using the vq,
> so it is safe, because in vhost_transport_do_send_pkt() we check it again.
>
> Please add a comment explaining that.
>

Thanks, vhost_vq_get_backend is better. I chose a 5.3 kernel to develop
and missed this helper.
>
> As an alternative to this patch, should we kick the send worker when the
> device is ready?
>
> IIUC we reach the timeout because the send worker (that runs
> vhost_transport_do_send_pkt()) exits immediately since 'vq->private_data'
> is NULL, and no one will requeue it.
>
> Let's do it when we know the device is ready:
>
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index e36aaf9ba7bd..295b5867944f 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -543,6 +543,11 @@ static int vhost_vsock_start(struct vhost_vsock
> *vsock)
> mutex_unlock(&vq->mutex);
> }
>
> +   /* Some packets may have been queued before the device was started,
> +* let's kick the send worker to send them.
> +*/
> +   vhost_work_queue(&vsock->dev, &vsock->send_pkt_work);
> +
Yes, it works.
But do you think a threshold should be set here to prevent the queue
from

RE: [RFC PATCH v2 2/3] device-dax: use fallback nid when numa_node is invalid

2020-07-07 Thread Justin He

Hi David

> -Original Message-
> From: David Hildenbrand 
> Sent: Tuesday, July 7, 2020 7:34 PM
> To: Justin He ; Catalin Marinas
> ; Will Deacon ; Dan Williams
> ; Vishal Verma ; Dave
> Jiang 
> Cc: Michal Hocko ; Andrew Morton  foundation.org>; Mike Rapoport ; Baoquan He
> ; Chuhong Yuan ; linux-arm-
> ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> m...@kvack.org; linux-nvd...@lists.01.org; Kaly Xin 
> Subject: Re: [RFC PATCH v2 2/3] device-dax: use fallback nid when
> numa_node is invalid
> 
> On 07.07.20 07:59, Jia He wrote:
> > Previously, numa_off is set unconditionally at the end of
> dummy_numa_init(),
> > even with a fake numa node. Then ACPI detects node id as NUMA_NO_NODE(-1)
> in
> > acpi_map_pxm_to_node() because it regards numa_off as turning off the
> numa
> > node. Hence dev_dax->target_node is NUMA_NO_NODE on arm64 with fake numa.
> >
> > Without this patch, pmem can't be probed as a RAM device on arm64 if
> SRAT table
> > isn't present:
> > $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -
> a 64K
> > kmem dax0.0: rejecting DAX region [mem 0x24040-0x2bfff] with
> invalid node: -1
> > kmem: probe of dax0.0 failed with error -22
> >
> > This fixes it by using fallback memory_add_physaddr_to_nid() as nid.
> >
> > Suggested-by: David Hildenbrand 
> > Signed-off-by: Jia He 
> > ---
> > I noticed that on powerpc memory_add_physaddr_to_nid is not exported for
> module
> > driver. Set it to RFC due to this concern.
> >
> >  drivers/dax/kmem.c | 22 ++
> >  1 file changed, 14 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> > index 275aa5f87399..68e693ca6d59 100644
> > --- a/drivers/dax/kmem.c
> > +++ b/drivers/dax/kmem.c
> > @@ -28,20 +28,22 @@ int dev_dax_kmem_probe(struct device *dev)
> > resource_size_t kmem_end;
> > struct resource *new_res;
> > const char *new_res_name;
> > -   int numa_node;
> > +   int numa_node, new_node;
> > int rc;
> >
> > /*
> >  * Ensure good NUMA information for the persistent memory.
> > -* Without this check, there is a risk that slow memory
> > -* could be mixed in a node with faster memory, causing
> > -* unavoidable performance issues.
> > +* Without this check, there is a risk but not fatal that slow
> > +* memory could be mixed in a node with faster memory, causing
> > +* unavoidable performance issues. Furthermore, fallback node
> > +* id can be used when numa_node is invalid.
> >  */
> > numa_node = dev_dax->target_node;
> > if (numa_node < 0) {
> > -   dev_warn(dev, "rejecting DAX region %pR with invalid
> node: %d\n",
> > -res, numa_node);
> > -   return -EINVAL;
> > +   new_node = memory_add_physaddr_to_nid(kmem_start);
> > +   dev_info(dev, "changing nid from %d to %d for DAX
> region %pR\n",
> > +   numa_node, new_node, res);
> > +   numa_node = new_node;
> 
> Now, the warning does not really make sense. We have NUMA_NO_NODE (< 0),
> that is not a change in the nid, but a selection of a nid. Printing
> NUMA_NO_NODE does not make too much sense. I suggest just getting rid of
> new_node and turning the dev_info() into something like
> 
> dev_info(dev, "using nid %d for DAX region with undefined nid %pR\n",
>  numa_node, res);
> 
Okay, I will update it as your sugguestion. Thanks

--
Cheers,
Justin (Jia He)

RE: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

2020-07-07 Thread Justin He

Hi Michal and David

> -Original Message-
> From: Michal Hocko 
> Sent: Tuesday, July 7, 2020 7:55 PM
> To: Justin He 
> Cc: Catalin Marinas ; Will Deacon
> ; Dan Williams ; Vishal Verma
> ; Dave Jiang ; Andrew
> Morton ; Mike Rapoport ;
> Baoquan He ; Chuhong Yuan ; linux-
> arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> m...@kvack.org; linux-nvd...@lists.01.org; Kaly Xin 
> Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid
> as EXPORT_SYMBOL_GPL
> 
> On Tue 07-07-20 13:59:15, Jia He wrote:
> > This exports memory_add_physaddr_to_nid() for module driver to use.
> >
> > memory_add_physaddr_to_nid() is a fallback option to get the nid in case
> > NUMA_NO_NID is detected.
> >
> > Suggested-by: David Hildenbrand 
> > Signed-off-by: Jia He 
> > ---
> >  arch/arm64/mm/numa.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > index aafcee3e3f7e..7eeb31740248 100644
> > --- a/arch/arm64/mm/numa.c
> > +++ b/arch/arm64/mm/numa.c
> > @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> >
> >  /*
> >   * We hope that we will be hotplugging memory on nodes we already know
> about,
> > - * such that acpi_get_node() succeeds and we never fall back to this...
> > + * such that acpi_get_node() succeeds. But when SRAT is not present,
> the node
> > + * id may be probed as NUMA_NO_NODE by acpi, Here provide a fallback
> option.
> >   */
> >  int memory_add_physaddr_to_nid(u64 addr)
> >  {
> > -   pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n",
> addr);
> > return 0;
> >  }
> > +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> 
> Does it make sense to export a noop function? Wouldn't make more sense
> to simply make it static inline somewhere in a header? I haven't checked
> whether there is an easy way to do that sanely bu this just hit my eyes.

Okay, I can make a change in memory_hotplug.h, sth like:
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -149,13 +149,13 @@ int add_pages(int nid, unsigned long start_pfn, unsigned 
long nr_pages,
  struct mhp_params *params);
 #endif /* ARCH_HAS_ADD_PAGES */
 
-#ifdef CONFIG_NUMA
-extern int memory_add_physaddr_to_nid(u64 start);
-#else
+#if !defined(CONFIG_NUMA) || !defined(memory_add_physaddr_to_nid)
 static inline int memory_add_physaddr_to_nid(u64 start)
 {
return 0;
 }
+#else
+extern int memory_add_physaddr_to_nid(u64 start);
 #endif

And then check the memory_add_physaddr_to_nid() helper on all arches,
if it is noop(return 0), I can simply remove it.
if it is not noop, after the helper, 
#define memory_add_physaddr_to_nid

What do you think of this proposal?

--
Cheers,
Justin (Jia He)

RE: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

2020-07-07 Thread Justin He

Hi Dan

> -Original Message-
> From: Dan Williams 
> Sent: Wednesday, July 8, 2020 11:57 AM
> To: Justin He 
> Cc: Michal Hocko ; David Hildenbrand ;
> Catalin Marinas ; Will Deacon ;
> Vishal Verma ; Dave Jiang ;
> Andrew Morton ; Mike Rapoport
> ; Baoquan He ; Chuhong Yuan
> ; linux-arm-ker...@lists.infradead.org; linux-
> ker...@vger.kernel.org; linux...@kvack.org; linux-nvd...@lists.01.org;
> Kaly Xin 
> Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid
> as EXPORT_SYMBOL_GPL
> 
> On Tue, Jul 7, 2020 at 7:20 PM Justin He  wrote:
> >
> > Hi Michal and David
> >
> > > -Original Message-
> > > From: Michal Hocko 
> > > Sent: Tuesday, July 7, 2020 7:55 PM
> > > To: Justin He 
> > > Cc: Catalin Marinas ; Will Deacon
> > > ; Dan Williams ; Vishal
> Verma
> > > ; Dave Jiang ; Andrew
> > > Morton ; Mike Rapoport ;
> > > Baoquan He ; Chuhong Yuan ;
> linux-
> > > arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> > > m...@kvack.org; linux-nvd...@lists.01.org; Kaly Xin 
> > > Subject: Re: [PATCH v2 1/3] arm64/numa: export
> memory_add_physaddr_to_nid
> > > as EXPORT_SYMBOL_GPL
> > >
> > > On Tue 07-07-20 13:59:15, Jia He wrote:
> > > > This exports memory_add_physaddr_to_nid() for module driver to use.
> > > >
> > > > memory_add_physaddr_to_nid() is a fallback option to get the nid in
> case
> > > > NUMA_NO_NID is detected.
> > > >
> > > > Suggested-by: David Hildenbrand 
> > > > Signed-off-by: Jia He 
> > > > ---
> > > >  arch/arm64/mm/numa.c | 5 +++--
> > > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > > > index aafcee3e3f7e..7eeb31740248 100644
> > > > --- a/arch/arm64/mm/numa.c
> > > > +++ b/arch/arm64/mm/numa.c
> > > > @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> > > >
> > > >  /*
> > > >   * We hope that we will be hotplugging memory on nodes we already
> know
> > > about,
> > > > - * such that acpi_get_node() succeeds and we never fall back to
> this...
> > > > + * such that acpi_get_node() succeeds. But when SRAT is not present,
> > > the node
> > > > + * id may be probed as NUMA_NO_NODE by acpi, Here provide a
> fallback
> > > option.
> > > >   */
> > > >  int memory_add_physaddr_to_nid(u64 addr)
> > > >  {
> > > > -   pr_warn("Unknown node for memory at 0x%llx, assuming node 0\n",
> > > addr);
> > > > return 0;
> > > >  }
> > > > +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> > >
> > > Does it make sense to export a noop function? Wouldn't make more sense
> > > to simply make it static inline somewhere in a header? I haven't
> checked
> > > whether there is an easy way to do that sanely bu this just hit my
> eyes.
> >
> > Okay, I can make a change in memory_hotplug.h, sth like:
> > --- a/include/linux/memory_hotplug.h
> > +++ b/include/linux/memory_hotplug.h
> > @@ -149,13 +149,13 @@ int add_pages(int nid, unsigned long start_pfn,
> unsigned long nr_pages,
> >   struct mhp_params *params);
> >  #endif /* ARCH_HAS_ADD_PAGES */
> >
> > -#ifdef CONFIG_NUMA
> > -extern int memory_add_physaddr_to_nid(u64 start);
> > -#else
> > +#if !defined(CONFIG_NUMA) || !defined(memory_add_physaddr_to_nid)
> >  static inline int memory_add_physaddr_to_nid(u64 start)
> >  {
> > return 0;
> >  }
> > +#else
> > +extern int memory_add_physaddr_to_nid(u64 start);
> >  #endif
> >
> > And then check the memory_add_physaddr_to_nid() helper on all arches,
> > if it is noop(return 0), I can simply remove it.
> > if it is not noop, after the helper,
> > #define memory_add_physaddr_to_nid
> >
> > What do you think of this proposal?
> 
> Especially for architectures that use memblock info for numa info
> (which seems to be everyone except x86) why not implement a generic
> memory_add_physaddr_to_nid() that does:
> 
> int memory_add_physaddr_to_nid(u64 addr)
> {
> unsigned long start_pfn, end_pfn, pfn = PHYS_PFN(addr);
> int nid;
> 
> for_each_online_node(nid) {
> get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> if (pfn >= start_pfn && pfn <= end_pfn)
> return nid;
> }
> return NUMA_NO_NODE;
> }

Thanks for your suggestion,
Could I wrap the codes and let memory_add_physaddr_to_nid simply invoke
phys_to_target_node()? 


--
Cheers,
Justin (Jia He)

RE: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid as EXPORT_SYMBOL_GPL

2020-07-07 Thread Justin He

Hi Dan

> -Original Message-
> From: Dan Williams 
> Sent: Wednesday, July 8, 2020 1:48 PM
> To: Mike Rapoport 
> Cc: Justin He ; Michal Hocko ; David
> Hildenbrand ; Catalin Marinas ;
> Will Deacon ; Vishal Verma ;
> Dave Jiang ; Andrew Morton  foundation.org>; Baoquan He ; Chuhong Yuan
> ; linux-arm-ker...@lists.infradead.org; linux-
> ker...@vger.kernel.org; linux...@kvack.org; linux-nvd...@lists.01.org;
> Kaly Xin 
> Subject: Re: [PATCH v2 1/3] arm64/numa: export memory_add_physaddr_to_nid
> as EXPORT_SYMBOL_GPL
> 
> On Tue, Jul 7, 2020 at 10:33 PM Mike Rapoport  wrote:
> >
> > On Tue, Jul 07, 2020 at 08:56:36PM -0700, Dan Williams wrote:
> > > On Tue, Jul 7, 2020 at 7:20 PM Justin He  wrote:
> > > >
> > > > Hi Michal and David
> > > >
> > > > > -----Original Message-
> > > > > From: Michal Hocko 
> > > > > Sent: Tuesday, July 7, 2020 7:55 PM
> > > > > To: Justin He 
> > > > > Cc: Catalin Marinas ; Will Deacon
> > > > > ; Dan Williams ; Vishal
> Verma
> > > > > ; Dave Jiang ;
> Andrew
> > > > > Morton ; Mike Rapoport
> ;
> > > > > Baoquan He ; Chuhong Yuan ;
> linux-
> > > > > arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org;
> linux-
> > > > > m...@kvack.org; linux-nvd...@lists.01.org; Kaly Xin
> 
> > > > > Subject: Re: [PATCH v2 1/3] arm64/numa: export
> memory_add_physaddr_to_nid
> > > > > as EXPORT_SYMBOL_GPL
> > > > >
> > > > > On Tue 07-07-20 13:59:15, Jia He wrote:
> > > > > > This exports memory_add_physaddr_to_nid() for module driver to
> use.
> > > > > >
> > > > > > memory_add_physaddr_to_nid() is a fallback option to get the nid
> in case
> > > > > > NUMA_NO_NID is detected.
> > > > > >
> > > > > > Suggested-by: David Hildenbrand 
> > > > > > Signed-off-by: Jia He 
> > > > > > ---
> > > > > >  arch/arm64/mm/numa.c | 5 +++--
> > > > > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > > > > > index aafcee3e3f7e..7eeb31740248 100644
> > > > > > --- a/arch/arm64/mm/numa.c
> > > > > > +++ b/arch/arm64/mm/numa.c
> > > > > > @@ -464,10 +464,11 @@ void __init arm64_numa_init(void)
> > > > > >
> > > > > >  /*
> > > > > >   * We hope that we will be hotplugging memory on nodes we
> already know
> > > > > about,
> > > > > > - * such that acpi_get_node() succeeds and we never fall back to
> this...
> > > > > > + * such that acpi_get_node() succeeds. But when SRAT is not
> present,
> > > > > the node
> > > > > > + * id may be probed as NUMA_NO_NODE by acpi, Here provide a
> fallback
> > > > > option.
> > > > > >   */
> > > > > >  int memory_add_physaddr_to_nid(u64 addr)
> > > > > >  {
> > > > > > -   pr_warn("Unknown node for memory at 0x%llx, assuming node
> 0\n",
> > > > > addr);
> > > > > > return 0;
> > > > > >  }
> > > > > > +EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> > > > >
> > > > > Does it make sense to export a noop function? Wouldn't make more
> sense
> > > > > to simply make it static inline somewhere in a header? I haven't
> checked
> > > > > whether there is an easy way to do that sanely bu this just hit my
> eyes.
> > > >
> > > > Okay, I can make a change in memory_hotplug.h, sth like:
> > > > --- a/include/linux/memory_hotplug.h
> > > > +++ b/include/linux/memory_hotplug.h
> > > > @@ -149,13 +149,13 @@ int add_pages(int nid, unsigned long start_pfn,
> unsigned long nr_pages,
> > > >   struct mhp_params *params);
> > > >  #endif /* ARCH_HAS_ADD_PAGES */
> > > >
> > > > -#ifdef CONFIG_NUMA
> > > > -extern int memory_add_physaddr_to_nid(u64 start);
> > > > -#else
> > > > +#if !defined(CONFIG_NUMA) || !defined(memory_add_physaddr_to_nid)
> > > >  static inline int memory_add_physaddr_to_nid(u64 start)
> > > >  {
> > > > return 0;
> > > >  }

RE: [PATCH v2 08/22] memblock: Introduce a generic phys_addr_to_target_node()

2020-07-13 Thread Justin He

Hi Dan

> -Original Message-
> From: Dan Williams 
> Sent: Monday, July 13, 2020 11:48 PM
> To: Mike Rapoport 
> Cc: linux-nvdimm ; Justin He
> ; Will Deacon ; David Hildenbrand
> ; Andrew Morton ; Peter
> Zijlstra ; Vishal L Verma ;
> Dave Hansen ; Ard Biesheuvel
> ; Linux MM ; Linux Kernel
> Mailing List ; Linux ACPI  a...@vger.kernel.org>; Christoph Hellwig ; Joao Martins
> 
> Subject: Re: [PATCH v2 08/22] memblock: Introduce a generic
> phys_addr_to_target_node()
>
> On Mon, Jul 13, 2020 at 12:04 AM Mike Rapoport  wrote:
> >
> > Hi Dan,
> >
> > On Sun, Jul 12, 2020 at 09:26:48AM -0700, Dan Williams wrote:
> > > Similar to how generic memory_add_physaddr_to_nid() interrogates
> > > memblock data for numa information, introduce
> > > get_reserved_pfn_range_from_nid() to enable the same operation for
> > > reserved memory ranges. Example memory ranges that are reserved, but
> > > still have associated numa-info are persistent memory or Soft Reserved
> > > (EFI_MEMORY_SP) memory.
> >
> > Here again, I would prefer to add a weak default for
> > phys_to_target_node() because the "generic" implementation is not really
> > generic.
> >
> > The fallback to reserved ranges is x86 specfic because on x86 most of
> the
> > reserved areas is not in memblock.memory. AFAIK, no other architecture
> > does this.
>
> True, I was pre-fetching ARM using the new EFI "Special Purpose"
> memory attribute. However, until that becomes something that platforms
> deploy in practice I'm ok with not solving that problem for now.
>
> > And x86 anyway has implementation of phys_to_target_node().
>
> Sure, let's go with the default stub for non-x86.
>
> Justin, do you think it would make sense to fold your dax_kmem
> enabling for arm64 series into my enabling of dax_hmem for all
> memory-hotplug archs?

It is ok with me, thanks for the folding 😊

--
Cheers,
Justin (Jia He)



IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

RE: [PATCH] virtio_vsock: Fix race condition in virtio_transport_recv_pkt

2020-05-29 Thread Justin He

Hi Stefano

> -Original Message-
> From: Stefano Garzarella 
> Sent: Friday, May 29, 2020 10:11 PM
> To: Justin He 
> Cc: Stefan Hajnoczi ; David S. Miller
> ; Jakub Kicinski ;
> k...@vger.kernel.org; virtualizat...@lists.linux-foundation.org;
> net...@vger.kernel.org; linux-kernel@vger.kernel.org; Kaly Xin
> ; sta...@vger.kernel.org
> Subject: Re: [PATCH] virtio_vsock: Fix race condition in
> virtio_transport_recv_pkt
>
> Hi Jia,
> thanks for the patch! I have some comments.
>
> On Fri, May 29, 2020 at 09:31:23PM +0800, Jia He wrote:
> > When client tries to connect(SOCK_STREAM) the server in the guest with
> NONBLOCK
> > mode, there will be a panic on a ThunderX2 (armv8a server):
> > [  463.718844][ T5040] Unable to handle kernel NULL pointer dereference at
> virtual address 
> > [  463.718848][ T5040] Mem abort info:
> > [  463.718849][ T5040]   ESR = 0x9644
> > [  463.718852][ T5040]   EC = 0x25: DABT (current EL), IL = 32 bits
> > [  463.718853][ T5040]   SET = 0, FnV = 0
> > [  463.718854][ T5040]   EA = 0, S1PTW = 0
> > [  463.718855][ T5040] Data abort info:
> > [  463.718856][ T5040]   ISV = 0, ISS = 0x0044
> > [  463.718857][ T5040]   CM = 0, WnR = 1
> > [  463.718859][ T5040] user pgtable: 4k pages, 48-bit VAs,
> pgdp=008f6f6e9000
> > [  463.718861][ T5040] [] pgd=
> > [  463.718866][ T5040] Internal error: Oops: 9644 [#1] SMP
> > [...]
> > [  463.718977][ T5040] CPU: 213 PID: 5040 Comm: vhost-5032 Tainted: G
> O  5.7.0-rc7+ #139
> > [  463.718980][ T5040] Hardware name: GIGABYTE R281-T91-00/MT91-FS1-00,
> BIOS F06 09/25/2018
> > [  463.718982][ T5040] pstate: 6049 (nZCv daif +PAN -UAO)
> > [  463.718995][ T5040] pc : virtio_transport_recv_pkt+0x4c8/0xd40
> [vmw_vsock_virtio_transport_common]
> > [  463.718999][ T5040] lr : virtio_transport_recv_pkt+0x1fc/0xd40
> [vmw_vsock_virtio_transport_common]
> > [  463.719000][ T5040] sp : 80002dbe3c40
> > [...]
> > [  463.719025][ T5040] Call trace:
> > [  463.719030][ T5040]  virtio_transport_recv_pkt+0x4c8/0xd40
> [vmw_vsock_virtio_transport_common]
> > [  463.719034][ T5040]  vhost_vsock_handle_tx_kick+0x360/0x408
> [vhost_vsock]
> > [  463.719041][ T5040]  vhost_worker+0x100/0x1a0 [vhost]
> > [  463.719048][ T5040]  kthread+0x128/0x130
> > [  463.719052][ T5040]  ret_from_fork+0x10/0x18
> >
> > The race condition as follows:
> > Task1Task2
> > ==
> > __sock_release   virtio_transport_recv_pkt
> >   __vsock_release  vsock_find_bound_socket (found)
> > lock_sock_nested
> > vsock_remove_sock
> > sock_orphan
> >   sk_set_socket(sk, NULL)
> > ...
> > release_sock
> > lock_sock
> >virtio_transport_recv_connecting
> >  sk->sk_socket->state (panic)
> >
> > This fixes it by checking vsk again whether it is in bound/connected table.
> >
> > Signed-off-by: Jia He 
> > Cc: sta...@vger.kernel.org
> > ---
> >  net/vmw_vsock/virtio_transport_common.c | 11 +++
> >  1 file changed, 11 insertions(+)
> >
> > diff --git a/net/vmw_vsock/virtio_transport_common.c
> b/net/vmw_vsock/virtio_transport_common.c
> > index 69efc891885f..0dbd6a45f0ed 100644
> > --- a/net/vmw_vsock/virtio_transport_common.c
> > +++ b/net/vmw_vsock/virtio_transport_common.c
> > @@ -1132,6 +1132,17 @@ void virtio_transport_recv_pkt(struct
> virtio_transport *t,
> >
> >  lock_sock(sk);
> >
> > +/* Check it again if vsk is removed by vsock_remove_sock */
> > +spin_lock_bh(&vsock_table_lock);
> > +if (!__vsock_in_bound_table(vsk)
> && !__vsock_in_connected_table(vsk)) {
> > +spin_unlock_bh(&vsock_table_lock);
> > +(void)virtio_transport_reset_no_sock(t, pkt);
> > +release_sock(sk);
> > +sock_put(sk);
> > +goto free_pkt;
> > +}
> > +spin_unlock_bh(&vsock_table_lock);
> > +
>
> As an a simpler alternative, can we check the sk_shutdown or the socket
> state without check again both bound and connected tables?
>
> This is a data path, so we should take it faster.
>
> I mean something like this:
>
> if (sk->sk_shutdown == SHUTDOWN_MASK) {
> ...
> }
>
Thanks for the suggestion, I verified it worked fine. And it
is a more lightweight checking than mine.

I will send v2 with above change

--
Cheers,
Justin (Jia He)


> or
>
&

RE: [PATCH v2] virtio_vsock: Fix race condition in virtio_transport_recv_pkt

2020-05-29 Thread Justin He

Hi Stefano

> -Original Message-
> From: Stefano Garzarella 
> Sent: Saturday, May 30, 2020 12:34 AM
> To: Justin He 
> Cc: Stefan Hajnoczi ; David S. Miller
> ; Jakub Kicinski ;
> k...@vger.kernel.org; virtualizat...@lists.linux-foundation.org;
> net...@vger.kernel.org; linux-kernel@vger.kernel.org; Kaly Xin
> ; sta...@vger.kernel.org
> Subject: Re: [PATCH v2] virtio_vsock: Fix race condition in
> virtio_transport_recv_pkt
>
> On Fri, May 29, 2020 at 11:21:02PM +0800, Jia He wrote:
> > When client tries to connect(SOCK_STREAM) the server in the guest with
> > NONBLOCK mode, there will be a panic on a ThunderX2 (armv8a server):
> > [  463.718844][ T5040] Unable to handle kernel NULL pointer dereference
> at virtual address 
> > [  463.718848][ T5040] Mem abort info:
> > [  463.718849][ T5040]   ESR = 0x9644
> > [  463.718852][ T5040]   EC = 0x25: DABT (current EL), IL = 32 bits
> > [  463.718853][ T5040]   SET = 0, FnV = 0
> > [  463.718854][ T5040]   EA = 0, S1PTW = 0
> > [  463.718855][ T5040] Data abort info:
> > [  463.718856][ T5040]   ISV = 0, ISS = 0x0044
> > [  463.718857][ T5040]   CM = 0, WnR = 1
> > [  463.718859][ T5040] user pgtable: 4k pages, 48-bit VAs,
> pgdp=008f6f6e9000
> > [  463.718861][ T5040] [] pgd=
> > [  463.718866][ T5040] Internal error: Oops: 9644 [#1] SMP
> > [...]
> > [  463.718977][ T5040] CPU: 213 PID: 5040 Comm: vhost-5032 Tainted: G
> O  5.7.0-rc7+ #139
> > [  463.718980][ T5040] Hardware name: GIGABYTE R281-T91-00/MT91-FS1-00,
> BIOS F06 09/25/2018
> > [  463.718982][ T5040] pstate: 6049 (nZCv daif +PAN -UAO)
> > [  463.718995][ T5040] pc : virtio_transport_recv_pkt+0x4c8/0xd40
> [vmw_vsock_virtio_transport_common]
> > [  463.718999][ T5040] lr : virtio_transport_recv_pkt+0x1fc/0xd40
> [vmw_vsock_virtio_transport_common]
> > [  463.719000][ T5040] sp : 80002dbe3c40
> > [...]
> > [  463.719025][ T5040] Call trace:
> > [  463.719030][ T5040]  virtio_transport_recv_pkt+0x4c8/0xd40
> [vmw_vsock_virtio_transport_common]
> > [  463.719034][ T5040]  vhost_vsock_handle_tx_kick+0x360/0x408
> [vhost_vsock]
> > [  463.719041][ T5040]  vhost_worker+0x100/0x1a0 [vhost]
> > [  463.719048][ T5040]  kthread+0x128/0x130
> > [  463.719052][ T5040]  ret_from_fork+0x10/0x18
>  ^ ^
> Maybe we can remove these two columns from the commit message.
>
> >
> > The race condition as follows:
> > Task1Task2
> > ==
> > __sock_release   virtio_transport_recv_pkt
> >   __vsock_release  vsock_find_bound_socket (found)
> > lock_sock_nested
> > vsock_remove_sock
> > sock_orphan
> >   sk_set_socket(sk, NULL)
>
> Here we can add:
>   sk->sk_shutdown = SHUTDOWN_MASK;

Indeed. This makes it more clearly

--
Cheers,
Justin (Jia He)


>
> > ...
> > release_sock
> > lock_sock
> >virtio_transport_recv_connecting
> >  sk->sk_socket->state (panic)
> >
> > The root cause is that vsock_find_bound_socket can't hold the lock_sock,
> > so there is a small race window between vsock_find_bound_socket() and
> > lock_sock(). If there is __vsock_release() in another task, sk->sk_socket
> > will be set to NULL inadvertently.
> >
> > This fixes it by checking sk->sk_shutdown.
> >
> > Signed-off-by: Jia He 
> > Cc: sta...@vger.kernel.org
> > Cc: Stefano Garzarella 
> > ---
> > v2: use lightweight checking suggested by Stefano Garzarella
> >
> >  net/vmw_vsock/virtio_transport_common.c | 8 
> >  1 file changed, 8 insertions(+)
> >
> > diff --git a/net/vmw_vsock/virtio_transport_common.c
> b/net/vmw_vsock/virtio_transport_common.c
> > index 69efc891885f..0edda1edf988 100644
> > --- a/net/vmw_vsock/virtio_transport_common.c
> > +++ b/net/vmw_vsock/virtio_transport_common.c
> > @@ -1132,6 +1132,14 @@ void virtio_transport_recv_pkt(struct
> virtio_transport *t,
> >
> >  lock_sock(sk);
> >
> > +/* Check if sk has been released before lock_sock */
> > +if (sk->sk_shutdown == SHUTDOWN_MASK) {
> > +(void)virtio_transport_reset_no_sock(t, pkt);
> > +release_sock(sk);
> > +sock_put(sk);
> > +goto free_pkt;
> > +}
> > +
> >  /* Update CID in case it has changed after a transport reset event */
> >  vsk->local_addr.svm_cid = dst.svm_cid;
> >
> > --
> > 2.17.1
> >
>
> Anyway, the patch LGTM, let see what David and other say.
>
> Reviewed-by: Stefano Garzarella 
>
> Thanks,
> Stefano

IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

RE: [GIT PULL] vhost: fixes

2020-05-05 Thread Justin He

Hi Michael

> -Original Message-
> From: Michael S. Tsirkin 
> Sent: Monday, May 4, 2020 8:16 PM
> To: Linus Torvalds 
> Cc: k...@vger.kernel.org; virtualizat...@lists.linux-foundation.org;
> net...@vger.kernel.org; linux-kernel@vger.kernel.org; Justin He
> ; ldi...@redhat.com; m...@redhat.com; n...@live.com;
> stefa...@redhat.com
> Subject: [GIT PULL] vhost: fixes
>
> The following changes since commit
> 6a8b55ed4056ea5559ebe4f6a4b247f627870d4c:
>
>   Linux 5.7-rc3 (2020-04-26 13:51:02 -0700)
>
> are available in the Git repository at:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus
>
> for you to fetch changes up to
> 0b841030625cde5f784dd62aec72d6a766faae70:
>
>   vhost: vsock: kick send_pkt worker once device is started (2020-05-02
> 10:28:21 -0400)
>
> 
> virtio: fixes
>
> A couple of bug fixes.
>
> Signed-off-by: Michael S. Tsirkin 
>
> 
> Jia He (1):
>   vhost: vsock: kick send_pkt worker once device is started

Should this fix also be CC-ed to stable? Sorry I forgot to cc it to stable.

--
Cheers,
Justin (Jia He)


>
> Stefan Hajnoczi (1):
>   virtio-blk: handle block_device_operations callbacks after hot unplug
>
>  drivers/block/virtio_blk.c | 86
> +-
>  drivers/vhost/vsock.c  |  5 +++
>  2 files changed, 83 insertions(+), 8 deletions(-)

IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

RE: [PATCH v3 4/6] mm: don't export memory_add_physaddr_to_nid in arch specific directory

2020-07-08 Thread Justin He

Hi Matthew

> -Original Message-
> From: Matthew Wilcox 
> Sent: Thursday, July 9, 2020 10:11 AM
> To: Justin He 
> Cc: Catalin Marinas ; Will Deacon
> ; Tony Luck ; Fenghua Yu
> ; Yoshinori Sato ; Rich
> Felker ; Dave Hansen ; Andy
> Lutomirski ; Peter Zijlstra ;
> Thomas Gleixner ; Ingo Molnar ;
> Borislav Petkov ; David Hildenbrand ;
> x...@kernel.org; H. Peter Anvin ; Dan Williams
> ; Vishal Verma ; Dave
> Jiang ; Andrew Morton ;
> Baoquan He ; Chuhong Yuan ; Mike
> Rapoport ; Logan Gunthorpe ;
> Masahiro Yamada ; Michal Hocko ;
> linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> i...@vger.kernel.org; linux...@vger.kernel.org; linux-nvd...@lists.01.org;
> linux...@kvack.org; Jonathan Cameron ; Kaly
> Xin 
> Subject: Re: [PATCH v3 4/6] mm: don't export memory_add_physaddr_to_nid in
> arch specific directory
> 
> On Thu, Jul 09, 2020 at 10:06:27AM +0800, Jia He wrote:
> > After a general version of __weak memory_add_physaddr_to_nid implemented
> > and exported , it is no use exporting twice in arch directory even if
> > e,g, ia64/x86 have their specific version.
> >
> > This is to suppress the modpost warning:
> > WARNING: modpost: vmlinux: 'memory_add_physaddr_to_nid' exported twice.
> > Previous export was in vmlinux
> 
> It's bad form to introduce a warning and then send a follow-up patch to
> fix the warning.  Just fold this patch into patch 1/6.
Thanks, will do

Cheers,
Justin He

RE: [PATCH v3 5/6] device-dax: use fallback nid when numa_node is invalid

2020-07-08 Thread Justin He

Hi Dan

> -Original Message-
> From: Dan Williams 
> Sent: Thursday, July 9, 2020 11:39 AM
> To: Justin He 
> Cc: Catalin Marinas ; Will Deacon
> ; Tony Luck ; Fenghua Yu
> ; Yoshinori Sato ; Rich
> Felker ; Dave Hansen ; Andy
> Lutomirski ; Peter Zijlstra ;
> Thomas Gleixner ; Ingo Molnar ;
> Borislav Petkov ; David Hildenbrand ; X86
> ML ; H. Peter Anvin ; Vishal Verma
> ; Dave Jiang ; Andrew
> Morton ; Baoquan He ; Chuhong
> Yuan ; Mike Rapoport ; Logan
> Gunthorpe ; Masahiro Yamada ;
> Michal Hocko ; Linux ARM  ker...@lists.infradead.org>; Linux Kernel Mailing List  ker...@vger.kernel.org>; linux-i...@vger.kernel.org; Linux-sh  s...@vger.kernel.org>; linux-nvdimm ; Linux MM
> ; Jonathan Cameron ; Kaly
> Xin 
> Subject: Re: [PATCH v3 5/6] device-dax: use fallback nid when numa_node is
> invalid
> 
> On Wed, Jul 8, 2020 at 7:07 PM Jia He  wrote:
> >
> > numa_off is set unconditionally at the end of dummy_numa_init(),
> > even with a fake numa node. ACPI detects node id as NUMA_NO_NODE(-1) in
> > acpi_map_pxm_to_node() because it regards numa_off as turning off the
> numa
> > node. Hence dev_dax->target_node is NUMA_NO_NODE on arm64 with fake numa.
> >
> > Without this patch, pmem can't be probed as a RAM device on arm64 if
> SRAT table
> > isn't present:
> > $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -
> a 64K
> > kmem dax0.0: rejecting DAX region [mem 0x24040-0x2bfff] with
> invalid node: -1
> > kmem: probe of dax0.0 failed with error -22
> >
> > This fixes it by using fallback memory_add_physaddr_to_nid() as nid.
> >
> > Suggested-by: David Hildenbrand 
> > Signed-off-by: Jia He 
> > ---
> >  drivers/dax/kmem.c | 21 +
> >  1 file changed, 13 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> > index 275aa5f87399..218f66057994 100644
> > --- a/drivers/dax/kmem.c
> > +++ b/drivers/dax/kmem.c
> > @@ -31,22 +31,23 @@ int dev_dax_kmem_probe(struct device *dev)
> > int numa_node;
> > int rc;
> >
> > +   /* Hotplug starting at the beginning of the next block: */
> > +   kmem_start = ALIGN(res->start, memory_block_size_bytes());
> > +
> > /*
> >  * Ensure good NUMA information for the persistent memory.
> >  * Without this check, there is a risk that slow memory
> >  * could be mixed in a node with faster memory, causing
> > -* unavoidable performance issues.
> > +* unavoidable performance issues. Furthermore, fallback node
> > +* id can be used when numa_node is invalid.
> >  */
> > numa_node = dev_dax->target_node;
> > if (numa_node < 0) {
> > -   dev_warn(dev, "rejecting DAX region %pR with invalid
> node: %d\n",
> > -res, numa_node);
> > -   return -EINVAL;
> > +   numa_node = memory_add_physaddr_to_nid(kmem_start);
> 
> I think this fixup belongs to the core to set a fallback value for
> dev_dax->target_node.
> 
> I'm close to having patches to provide a functional
> phys_addr_to_target_node() for arm64.

Should My this patch(5/6) wait on your new phys_addr_to_target_node() patch?
Thanks for the clarification.

--
Cheers,
Justin (Jia He)

RE: [PATCH v3 4/6] mm: don't export memory_add_physaddr_to_nid in arch specific directory

2020-07-09 Thread Justin He

Hi David

> -Original Message-
> From: David Hildenbrand 
> Sent: Thursday, July 9, 2020 5:19 PM
> To: Mike Rapoport ; Matthew Wilcox
> 
> Cc: Justin He ; Catalin Marinas
> ; Will Deacon ; Tony Luck
> ; Fenghua Yu ; Yoshinori Sato
> ; Rich Felker ; Dave Hansen
> ; Andy Lutomirski ; Peter
> Zijlstra ; Thomas Gleixner ;
> Ingo Molnar ; Borislav Petkov ;
> x...@kernel.org; H. Peter Anvin ; Dan Williams
> ; Vishal Verma ; Dave
> Jiang ; Andrew Morton ;
> Baoquan He ; Chuhong Yuan ; Logan
> Gunthorpe ; Masahiro Yamada ;
> Michal Hocko ; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux-i...@vger.kernel.org; linux-
> s...@vger.kernel.org; linux-nvd...@lists.01.org; linux...@kvack.org;
> Jonathan Cameron ; Kaly Xin 
> Subject: Re: [PATCH v3 4/6] mm: don't export memory_add_physaddr_to_nid in
> arch specific directory
> 
> On 09.07.20 11:18, Mike Rapoport wrote:
> > On Thu, Jul 09, 2020 at 03:11:04AM +0100, Matthew Wilcox wrote:
> >> On Thu, Jul 09, 2020 at 10:06:27AM +0800, Jia He wrote:
> >>> After a general version of __weak memory_add_physaddr_to_nid
> implemented
> >>> and exported , it is no use exporting twice in arch directory even if
> >>> e,g, ia64/x86 have their specific version.
> >>>
> >>> This is to suppress the modpost warning:
> >>> WARNING: modpost: vmlinux: 'memory_add_physaddr_to_nid' exported twice.
> >>> Previous export was in vmlinux
> >>
> >> It's bad form to introduce a warning and then send a follow-up patch to
> >> fix the warning.  Just fold this patch into patch 1/6.
> >
> > Moreover, I think that patches 1-4 can be merged into one.
> >
> 
> +1

Okay, will update, thanks

--
Cheers,
Justin (Jia He)

RE: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem alignment

2020-07-29 Thread Justin He

Hi David

> -Original Message-
> From: David Hildenbrand 
> Sent: Wednesday, July 29, 2020 2:37 PM
> To: Justin He 
> Cc: Dan Williams ; Vishal Verma
> ; Mike Rapoport ; David
> Hildenbrand ; Catalin Marinas ;
> Will Deacon ; Greg Kroah-Hartman
> ; Rafael J. Wysocki ; Dave
> Jiang ; Andrew Morton ;
> Steve Capper ; Mark Rutland ;
> Logan Gunthorpe ; Anshuman Khandual
> ; Hsin-Yi Wang ; Jason
> Gunthorpe ; Dave Hansen ; Kees
> Cook ; linux-arm-ker...@lists.infradead.org; linux-
> ker...@vger.kernel.org; linux-nvd...@lists.01.org; linux...@kvack.org; Wei
> Yang ; Pankaj Gupta
> ; Ira Weiny ; Kaly Xin
> 
> Subject: Re: [RFC PATCH 0/6] decrease unnecessary gap due to pmem kmem
> alignment
> 
> 
> 
> > Am 29.07.2020 um 05:35 schrieb Jia He :
> >
> > When enabling dax pmem as RAM device on arm64, I noticed that kmem_start
> > addr in dev_dax_kmem_probe() should be aligned w/
> SECTION_SIZE_BITS(30),i.e.
> > 1G memblock size. Even Dan Williams' sub-section patch series [1] had
> been
> > upstream merged, it was not helpful due to hard limitation of kmem_start:
> > $ndctl create-namespace -e namespace0.0 --mode=devdax --map=dev -s 2g -f
> -a 2M
> > $echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind
> > $echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id
> > $cat /proc/iomem
> > ...
> > 23c00-23fff : System RAM
> >  23dd4-23fec : reserved
> >  23fed-23fff : reserved
> > 24000-33fdf : Persistent Memory
> >  24000-2403f : namespace0.0
> >  28000-2bfff : dax0.0  <- aligned with 1G boundary
> >28000-2bfff : System RAM
> > Hence there is a big gap between 0x2403f and 0x28000 due to the
> 1G
> > alignment.
> >
> > Without this series, if qemu creates a 4G bytes nvdimm device, we can
> only
> > use 2G bytes for dax pmem(kmem) in the worst case.
> > e.g.
> > 24000-33fdf : Persistent Memory
> > We can only use the memblock between [24000, 2] due to the
> hard
> > limitation. It wastes too much memory space.
> >
> > Decreasing the SECTION_SIZE_BITS on arm64 might be an alternative, but
> there
> > are too many concerns from other constraints, e.g. PAGE_SIZE, hugetlb,
> > SPARSEMEM_VMEMMAP, page bits in struct page ...
> >
> > Beside decreasing the SECTION_SIZE_BITS, we can also relax the kmem
> alignment
> > with memory_block_size_bytes().
> >
> > Tested on arm64 guest and x86 guest, qemu creates a 4G pmem device. dax
> pmem
> > can be used as ram with smaller gap. Also the kmem hotplug add/remove
> are both
> > tested on arm64/x86 guest.
> >
> 
> Hi,
> 
> I am not convinced this use case is worth such hacks (that’s what it is)
> for now. On real machines pmem is big - your example (losing 50% is
> extreme).
> 
> I would much rather want to see the section size on arm64 reduced. I
> remember there were patches and that at least with a base page size of 4k
> it can be reduced drastically (64k base pages are more problematic due to
> the ridiculous THP size of 512M). But could be a section size of 512 is
> possible on all configs right now.

Yes, I once investigated how to reduce section size on arm64 thoughtfully:
There are many constraints for reducing SECTION_SIZE_BITS
1. Given page->flags bits is limited, SECTION_SIZE_BITS can't be reduced too
   much.
2. Once CONFIG_SPARSEMEM_VMEMMAP is enabled, section id will not be counted
   into page->flags.
3. MAX_ORDER depends on SECTION_SIZE_BITS 
 - 3.1 mmzone.h
#if (MAX_ORDER - 1 + PAGE_SHIFT) > SECTION_SIZE_BITS
#error Allocator MAX_ORDER exceeds SECTION_SIZE
#endif
 - 3.2 hugepage_init()
MAYBE_BUILD_BUG_ON(HPAGE_PMD_ORDER >= MAX_ORDER);

Hence when ARM64_4K_PAGES && CONFIG_SPARSEMEM_VMEMMAP are enabled,
SECTION_SIZE_BITS can be reduced to 27.
But when ARM64_64K_PAGES, given 3.2, MAX_ORDER > 29-16 = 13.
Given 3.1 SECTION_SIZE_BITS >= MAX_ORDER+15 > 28. So SECTION_SIZE_BITS can not
be reduced to 27.

In one word, if we considered to reduce SECTION_SIZE_BITS on arm64, the Kconfig
might be very complicated,e.g. we still need to consider the case for
ARM64_16K_PAGES.

> 
> In the long term we might want to rework the memory block device model
> (eventually supporting old/new as discussed with Michal some time ago
> using a kernel parameter), dropping the fixed sizes

Has this been posted to Linux mm maillist? Sorry, searched and didn't find it.


--
Cheers,
Justin (Jia He)



> - allowing sizes / addresses aligned with subsection size
> - drastically reducing the number of devices for boot memory to only a
> hand full (e.g., one per resource

RE: [PATCH 1/3] arm64/numa: set numa_off to false when numa node is fake

2020-07-06 Thread Justin He

Hi David, thanks for the comments. See my answer please:

> -Original Message-
> From: David Hildenbrand 
> Sent: Monday, July 6, 2020 4:03 PM
> To: Justin He ; Catalin Marinas
> ; Will Deacon 
> Cc: Andrew Morton ; Mike Rapoport
> ; Baoquan He ; Chuhong Yuan
> ; linux-arm-ker...@lists.infradead.org; linux-
> ker...@vger.kernel.org; linux...@kvack.org; Kaly Xin 
> Subject: Re: [PATCH 1/3] arm64/numa: set numa_off to false when numa node
> is fake
> 
> On 06.07.20 03:19, Jia He wrote:
> > Previously, numa_off is set to true unconditionally in dummy_numa_init(),
> > even if there is a fake numa node.
> >
> > But acpi will translate node id to NUMA_NO_NODE(-1) in
> acpi_map_pxm_to_node()
> > because it regards numa_off as turning off the numa node.
> >
> > Without this patch, pmem can't be probed as a RAM device on arm64 if
> SRAT table
> > isn't present.
> >
> > $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -
> a 64K
> > kmem dax0.0: rejecting DAX region [mem 0x24040-0x2bfff] with
> invalid node: -1
> > kmem: probe of dax0.0 failed with error -22
> >
> > This fixes it by setting numa_off to false.
> >
> > Signed-off-by: Jia He 
> > ---
> >  arch/arm64/mm/numa.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> > index aafcee3e3f7e..7689986020d9 100644
> > --- a/arch/arm64/mm/numa.c
> > +++ b/arch/arm64/mm/numa.c
> > @@ -440,7 +440,8 @@ static int __init dummy_numa_init(void)
> > return ret;
> > }
> >
> > -   numa_off = true;
> > +   /* force numa_off to be false since we have a fake numa node here
> */
> > +   numa_off = false;
> > return 0;
> >  }
> >
> >
> 
> What would happen if we use something like this in drivers/dax/kmem.c
> instead:
> 
> numa_node = dev_dax->target_node;
> if (numa_node == NUMA_NO_NODE)
>   numa_node = memory_add_physaddr_to_nid(kmem_start);
> 
> and eventually dropping the pr_warn in
> arm64/memory_add_physaddr_to_nid() ? Would that work?

Yes, it works. I sent a similar patch [1] before. But seems pmem
maintainer didn't satisfy it. Do you think memory_add_physaddr_to_nid()
is better than numa_mem_id()? 

[1] https://lkml.org/lkml/2019/8/16/367

--
Cheers,
Justin (Jia He)

RE: [PATCH 1/3] arm64/numa: set numa_off to false when numa node is fake

2020-07-06 Thread Justin He

Hi Jonathan, thanks for the comments.

> -Original Message-
> From: Jonathan Cameron 
> Sent: Monday, July 6, 2020 6:46 PM
> To: Justin He 
> Cc: Catalin Marinas ; Will Deacon
> ; Andrew Morton ; Mike
> Rapoport ; Baoquan He ; Chuhong Yuan
> ; linux-arm-ker...@lists.infradead.org; linux-
> ker...@vger.kernel.org; linux...@kvack.org; Kaly Xin 
> Subject: Re: [PATCH 1/3] arm64/numa: set numa_off to false when numa node
> is fake
> 
> On Mon, 6 Jul 2020 11:29:21 +0100
> Jonathan Cameron  wrote:
> 
> > On Mon, 6 Jul 2020 09:19:45 +0800
> > Jia He  wrote:
> >
> > Hi,
> >
> > > Previously, numa_off is set to true unconditionally in
> dummy_numa_init(),
> > > even if there is a fake numa node.
> > >
> > > But acpi will translate node id to NUMA_NO_NODE(-1) in
> acpi_map_pxm_to_node()
> > > because it regards numa_off as turning off the numa node.
> >
> > That is correct.  It is operating exactly as it should, if SRAT hasn't
> been parsed
> > and you are on ACPI platform there are no nodes.  They cannot be created
> at
> > some later date.  The dummy code doesn't change this. It just does
> enough to carry
> > on operating with no specified nodes.
> >
> > >
> > > Without this patch, pmem can't be probed as a RAM device on arm64 if
> SRAT table
> > > isn't present.
> > >
> > > $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g
> -a 64K
> > > kmem dax0.0: rejecting DAX region [mem 0x24040-0x2bfff] with
> invalid node: -1
> > > kmem: probe of dax0.0 failed with error -22
> > >
> > > This fixes it by setting numa_off to false.
> >
> > Without the SRAT protection patch [1] you may well run into problems

Sorry, doesn't quite understand here. Do you mean your [1] can resolve this
issue? But acpi_map_pxm_to_node() has returned with NUMA_NO_NODE after
following check:
if (pxm < 0 || pxm >= MAX_PXM_DOMAINS || numa_off)
return NUMA_NO_NODE;
Seems even with your [1] patch, it is not helpful? Thanks for clarification
if my understanding is wrong.
[1] https://patchwork.kernel.org/patch/11632063/

> > because someone somewhere will have _PXM in a DSDT but will
> > have a non existent SRAT.   We had this happen on an AMD platform when
> we
> > tried to introduce working _PXM support for PCI. [2]
> >
> > So whilst this seems superficially safe, I'd definitely be crossing your
> fingers.
> > Note, at that time I proposed putting the numa_off = false into the x86
> code
> > path precisely to cut out that possibility (was rejected at the time, at
> least
> > partly because the clarifications to the ACPI spec were not pubilc.)
> >
> > The patch in [1] should sort things out however by ensuring we only
> create
> > new domains where we should actually be doing so. However, in your case
> > it will return NUMA_NO_NODE anyway so this isn't the right way to fix
> things.

Okay, let me try to summarize, there might be 3 possible fixing ways:
1. this patch, seems it is not satisfied by you and David 😉
2. my previous proposal [2], similar as what David suggested
3. remove numa_off check in acpi_map_pxm_to_node()
e.g.
...
if (pxm < 0 || pxm >= MAX_PXM_DOMAINS /*|| numa_off*/)
return NUMA_NO_NODE;

[2] https://lkml.org/lkml/2019/8/16/367


--
Cheers,
Justin (Jia He)

RE: [PATCH 2/3] mm/memory_hotplug: harden try_offline_node against bogus nid

2020-07-06 Thread Justin He

Hi David

> -Original Message-
> From: David Hildenbrand 
> Sent: Monday, July 6, 2020 3:58 PM
> To: Justin He ; Catalin Marinas
> ; Will Deacon 
> Cc: Andrew Morton ; Mike Rapoport
> ; Baoquan He ; Chuhong Yuan
> ; linux-arm-ker...@lists.infradead.org; linux-
> ker...@vger.kernel.org; linux...@kvack.org; Kaly Xin 
> Subject: Re: [PATCH 2/3] mm/memory_hotplug: harden try_offline_node
> against bogus nid
> 
> On 06.07.20 03:19, Jia He wrote:
> > When testing the remove_memory path of dax pmem, there will be a panic
> with
> > call trace:
> >   try_remove_memory+0x84/0x170
> >   remove_memory+0x38/0x58
> >   dev_dax_kmem_remove+0x3c/0x84 [kmem]
> >   device_release_driver_internal+0xfc/0x1c8
> >   device_release_driver+0x28/0x38
> >   bus_remove_device+0xd4/0x158
> >   device_del+0x160/0x3a0
> >   unregister_dev_dax+0x30/0x68
> >   devm_action_release+0x20/0x30
> >   release_nodes+0x150/0x240
> >   devres_release_all+0x6c/0x1d0
> >   device_release_driver_internal+0x10c/0x1c8
> >   driver_detach+0xac/0x170
> >   bus_remove_driver+0x64/0x130
> >   driver_unregister+0x34/0x60
> >   dax_pmem_exit+0x14/0xffc4 [dax_pmem]
> >   __arm64_sys_delete_module+0x18c/0x2d0
> >   el0_svc_common.constprop.2+0x78/0x168
> >   do_el0_svc+0x34/0xa0
> >   el0_sync_handler+0xe0/0x188
> >   el0_sync+0x164/0x180
> >
> > It is caused by the bogus nid (-1). Although the root cause is pmem dax
> > translates from pxm to node_id incorrectly due to numa_off, it is worth
> > hardening the codes in try_offline_node(), quiting if !pgdat.
> >
> > Signed-off-by: Jia He 
> > ---
> >  mm/memory_hotplug.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index da374cd3d45b..e1e290577b45 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -1680,6 +1680,9 @@ void try_offline_node(int nid)
> > pg_data_t *pgdat = NODE_DATA(nid);
> > int rc;
> >
> > +   if (WARN_ON(!pgdat))
> > +   return;
> > +
> > /*
> >  * If the node still spans pages (especially ZONE_DEVICE), don't
> >  * offline it. A node spans memory after move_pfn_range_to_zone(),
> >
> 
> Hm. If I am not wrong, somebody used add_memory() with another nid than
> try_remove_memory()?
> 

Yes after commit fa6d9ec790550, it can prevent this possibility.
I will drop this single patch. Thanks
--
Cheers,
Justin (Jia He)

RE: [RFC PATCH v2 2/3] device-dax: use fallback nid when numa_node is invalid

2020-07-06 Thread Justin He

[+] add Powerpc maintainers to check my concern about memory_add_physaddr_to_nid
Exporting


--
Cheers,
Justin (Jia He)



> -Original Message-
> From: Jia He 
> Sent: Tuesday, July 7, 2020 1:59 PM
> To: Catalin Marinas ; Will Deacon
> ; Dan Williams ; Vishal Verma
> ; Dave Jiang 
> Cc: Michal Hocko ; Andrew Morton  foundation.org>; Mike Rapoport ; Baoquan He
> ; Chuhong Yuan ; linux-arm-
> ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> m...@kvack.org; linux-nvd...@lists.01.org; Kaly Xin ;
> Justin He 
> Subject: [RFC PATCH v2 2/3] device-dax: use fallback nid when numa_node is
> invalid
> 
> Previously, numa_off is set unconditionally at the end of
> dummy_numa_init(),
> even with a fake numa node. Then ACPI detects node id as NUMA_NO_NODE(-1)
> in
> acpi_map_pxm_to_node() because it regards numa_off as turning off the numa
> node. Hence dev_dax->target_node is NUMA_NO_NODE on arm64 with fake numa.
> 
> Without this patch, pmem can't be probed as a RAM device on arm64 if SRAT
> table
> isn't present:
> $ndctl create-namespace -fe namespace0.0 --mode=devdax --map=dev -s 1g -a
> 64K
> kmem dax0.0: rejecting DAX region [mem 0x24040-0x2bfff] with
> invalid node: -1
> kmem: probe of dax0.0 failed with error -22
> 
> This fixes it by using fallback memory_add_physaddr_to_nid() as nid.
> 
> Suggested-by: David Hildenbrand 
> Signed-off-by: Jia He 
> ---
> I noticed that on powerpc memory_add_physaddr_to_nid is not exported for
> module
> driver. Set it to RFC due to this concern.
> 
>  drivers/dax/kmem.c | 22 ++
>  1 file changed, 14 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 275aa5f87399..68e693ca6d59 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -28,20 +28,22 @@ int dev_dax_kmem_probe(struct device *dev)
>   resource_size_t kmem_end;
>   struct resource *new_res;
>   const char *new_res_name;
> - int numa_node;
> + int numa_node, new_node;
>   int rc;
> 
>   /*
>* Ensure good NUMA information for the persistent memory.
> -  * Without this check, there is a risk that slow memory
> -  * could be mixed in a node with faster memory, causing
> -  * unavoidable performance issues.
> +  * Without this check, there is a risk but not fatal that slow
> +  * memory could be mixed in a node with faster memory, causing
> +  * unavoidable performance issues. Furthermore, fallback node
> +  * id can be used when numa_node is invalid.
>*/
>   numa_node = dev_dax->target_node;
>   if (numa_node < 0) {
> - dev_warn(dev, "rejecting DAX region %pR with invalid
> node: %d\n",
> -  res, numa_node);
> - return -EINVAL;
> + new_node = memory_add_physaddr_to_nid(kmem_start);
> + dev_info(dev, "changing nid from %d to %d for DAX
> region %pR\n",
> + numa_node, new_node, res);
> + numa_node = new_node;
>   }
> 
>   /* Hotplug starting at the beginning of the next block: */
> @@ -100,6 +102,7 @@ static int dev_dax_kmem_remove(struct device *dev)
>   resource_size_t kmem_start = res->start;
>   resource_size_t kmem_size = resource_size(res);
>   const char *res_name = res->name;
> + int numa_node = dev_dax->target_node;
>   int rc;
> 
>   /*
> @@ -108,7 +111,10 @@ static int dev_dax_kmem_remove(struct device *dev)
>* there is no way to hotremove this memory until reboot because
> device
>* unbind will succeed even if we return failure.
>*/
> - rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
> + if (numa_node < 0)
> + numa_node = memory_add_physaddr_to_nid(kmem_start);
> +
> + rc = remove_memory(numa_node, kmem_start, kmem_size);
>   if (rc) {
>   any_hotremove_failed = true;
>   dev_err(dev,
> --
> 2.17.1

RE: [PATCH v4 0/2] Fix and enable pmem as RAM device on arm64

2020-07-10 Thread Justin He


Hi David
> -Original Message-
> From: David Hildenbrand 
> Sent: Friday, July 10, 2020 4:30 PM
> To: Justin He ; Catalin Marinas
> ; Will Deacon ; Tony Luck
> ; Fenghua Yu ; Yoshinori Sato
> ; Rich Felker ; Dave Hansen
> ; Andy Lutomirski ; Peter
> Zijlstra ; Thomas Gleixner ;
> Ingo Molnar ; Borislav Petkov 
> Cc: x...@kernel.org; H. Peter Anvin ; Dan Williams
> ; Vishal Verma ; Dave
> Jiang ; Andrew Morton ;
> Baoquan He ; Chuhong Yuan ; Mike
> Rapoport ; Logan Gunthorpe ;
> Masahiro Yamada ; Michal Hocko ;
> linux-arm-ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> i...@vger.kernel.org; linux...@vger.kernel.org; linux-nvd...@lists.01.org;
> linux...@kvack.org; Jonathan Cameron ; Kaly
> Xin 
> Subject: Re: [PATCH v4 0/2] Fix and enable pmem as RAM device on arm64
> 
> On 10.07.20 05:16, Jia He wrote:
> > This fixies a few issues when I tried to enable pmem as RAM device on
> arm64.
> >
> > To use memory_add_physaddr_to_nid as a fallback nid, it would be better
> > implement a general version (__weak) in mm/memory_hotplug. After that,
> arm64/
> > sh/s390 can simply use the general version, and PowerPC/ia64/x86 will
> use
> > arch specific version.
> >
> > Tested on ThunderX2 host/qemu "-M virt" guest with a nvdimm device. The
> > memblocks from the dax pmem device can be either hot-added or hot-
> removed
> > on arm64 guest. Also passed the compilation test on x86.
> >
> > Changes:
> > v4: - remove "device-dax: use fallback nid when numa_node is invalid",
> wait
> >   for Dan Williams' phys_addr_to_target_node() patch
> 
> So, this series no longer does what it promises? "Fix and enable pmem as
> RAM device on arm64"
> 
Hmm, a little bit awkward but seems no long what it promises. How about
sending patch1 patch2 individually without this cover-letter?

--
Cheers,
Justin (Jia He)

RE: [PATCH 1/2] drivers/dax/kmem: use default numa_mem_id if target_node is invalid

2019-09-03 Thread Justin He (Arm Technology China)

Hi
Ping.
The target_node will be -1 if numa disabled. IIUC, it is a generic issue, not 
only on arm64.


--
Cheers,
Justin (Jia He)



> -Original Message-
> From: Jia He 
> Sent: 2019年8月16日 19:19
> To: Dan Williams ; Vishal Verma
> 
> Cc: Keith Busch ; Dave Jiang
> ; linux-nvd...@lists.01.org; linux-
> ker...@vger.kernel.org; Justin He (Arm Technology China)
> 
> Subject: [PATCH 1/2] drivers/dax/kmem: use default numa_mem_id if
> target_node is invalid
>
> In some platforms(e.g arm64 guest), the NFIT info might not be ready.
> Then target_node might be -1. But if there is a default numa_mem_id(),
> we can use it to avoid unnecessary fatal EINVL error.
>
> devm_memremap_pages() also uses this logic if nid is invalid, we can
> keep the same page with it.
>
> Signed-off-by: Jia He 
> ---
>  drivers/dax/kmem.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index a02318c6d28a..ad62d551d94e 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -33,9 +33,9 @@ int dev_dax_kmem_probe(struct device *dev)
>*/
>   numa_node = dev_dax->target_node;
>   if (numa_node < 0) {
> - dev_warn(dev, "rejecting DAX region %pR with invalid
> node: %d\n",
> -  res, numa_node);
> - return -EINVAL;
> + dev_warn(dev, "DAX %pR with invalid node, assume it
> as %d\n",
> + res, numa_node, numa_mem_id());
> + numa_node = numa_mem_id();
>   }
>
>   /* Hotplug starting at the beginning of the next block: */
> --
> 2.17.1

IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

RE: [PATCH 2/2] lib/test_printf: add test of null/invalid pointer dereference for dentry

2019-08-16 Thread Justin He (Arm Technology China)

Hi Petr

> -Original Message-
> From: Petr Mladek 
> Sent: 2019年8月16日 16:27
> To: Justin He (Arm Technology China) 
> Cc: Geert Uytterhoeven ; Sergey Senozhatsky
> ; Thomas Gleixner ;
> Andy Shevchenko ; linux-
> ker...@vger.kernel.org; Kees Cook ; Steven
> Rostedt (VMware) ; Shuah Khan
> ; Tobin C. Harding 
> Subject: Re: [PATCH 2/2] lib/test_printf: add test of null/invalid pointer
> dereference for dentry
>
> On Fri 2019-08-09 09:24:57, Jia He wrote:
> > This add some additional test cases of null/invalid pointer dereference
> > for dentry and file (%pd and %pD)
> >
> > Signed-off-by: Jia He 
> > ---
> >  lib/test_printf.c | 7 +++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/lib/test_printf.c b/lib/test_printf.c
> > index 944eb50f3862..befedffeb476 100644
> > --- a/lib/test_printf.c
> > +++ b/lib/test_printf.c
> > @@ -455,6 +455,13 @@ dentry(void)
> > test("foo", "%pd", &test_dentry[0]);
> > test("foo", "%pd2", &test_dentry[0]);
> >
> > +   /* test the null/invalid pointer case for dentry */
> > +   test("(null)", "%pd", NULL);
> > +   test("(efault)", "%pd", PTR_INVALID);
> > +   /* test the null/invalid pointer case for file */
>
> The two comments mention something that is obvious from the code.
>
No problem, ok with me 😊


--
Cheers,
Justin (Jia He)


> I have pushed the patch as is and removed the comments in
> a follow up patch [1]. Both are in printk.git, branch for-5.4.
>
> > +   test("(null)", "%pD", NULL);
> > +   test("(efault)", "%pD", PTR_INVALID);
>
> Reference:
> [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk.git/commi
> t/?h=for-5.4&id=8ebea6ea1a7ed5d67ecbb2a493c716a2a89c0be2
>
> Best Regards,
> Petr
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

RE: [PATCH v3 2/2] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-16 Thread Justin He (Arm Technology China)


Hi Kirill
> -Original Message-
> From: Kirill A. Shutemov 
> Sent: 2019年9月16日 17:16
> To: Justin He (Arm Technology China) 
> Cc: Catalin Marinas ; Will Deacon
> ; Mark Rutland ; James Morse
> ; Marc Zyngier ; Matthew
> Wilcox ; Kirill A. Shutemov
> ; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux...@kvack.org; Punit Agrawal
> ; Anshuman Khandual
> ; Jun Yao ;
> Alex Van Brunt ; Robin Murphy
> ; Thomas Gleixner ;
> Andrew Morton ; Jérôme Glisse
> ; Ralph Campbell ;
> hejia...@gmail.com
> Subject: Re: [PATCH v3 2/2] mm: fix double page fault on arm64 if PTE_AF
> is cleared
>
> On Sat, Sep 14, 2019 at 12:32:39AM +0800, Jia He wrote:
> > When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest,
> there
> > will be a double page fault in __copy_from_user_inatomic of
> cow_user_page.
> >
> > Below call trace is from arm64 do_page_fault for debugging purpose
> > [  110.016195] Call trace:
> > [  110.016826]  do_page_fault+0x5a4/0x690
> > [  110.017812]  do_mem_abort+0x50/0xb0
> > [  110.018726]  el1_da+0x20/0xc4
> > [  110.019492]  __arch_copy_from_user+0x180/0x280
> > [  110.020646]  do_wp_page+0xb0/0x860
> > [  110.021517]  __handle_mm_fault+0x994/0x1338
> > [  110.022606]  handle_mm_fault+0xe8/0x180
> > [  110.023584]  do_page_fault+0x240/0x690
> > [  110.024535]  do_mem_abort+0x50/0xb0
> > [  110.025423]  el0_da+0x20/0x24
> >
> > The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
> > [9b007000] pgd=00023d4f8003, pud=00023da9b003,
> pmd=00023d4b3003, pte=36298607bd3
> >
> > As told by Catalin: "On arm64 without hardware Access Flag, copying
> from
> > user will fail because the pte is old and cannot be marked young. So we
> > always end up with zeroed page after fork() + CoW for pfn mappings. we
> > don't always have a hardware-managed access flag on arm64."
> >
> > This patch fix it by calling pte_mkyoung. Also, the parameter is
> > changed because vmf should be passed to cow_user_page()
> >
> > [1]
> https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork
> >
> > Reported-by: Yibo Cai 
> > Signed-off-by: Jia He 
> > ---
> >  mm/memory.c | 30 +-
> >  1 file changed, 25 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index e2bb51b6242e..a64af6495f71 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
> > 2;
> >  #endif
> >
> > +#ifndef arch_faults_on_old_pte
> > +static inline bool arch_faults_on_old_pte(void)
> > +{
> > +   return false;
> > +}
> > +#endif
> > +
> >  static int __init disable_randmaps(char *s)
> >  {
> > randomize_va_space = 0;
> > @@ -2140,7 +2147,8 @@ static inline int pte_unmap_same(struct
> mm_struct *mm, pmd_t *pmd,
> > return same;
> >  }
> >
> > -static inline void cow_user_page(struct page *dst, struct page *src,
> unsigned long va, struct vm_area_struct *vma)
> > +static inline void cow_user_page(struct page *dst, struct page *src,
> > +   struct vm_fault *vmf)
> >  {
> > debug_dma_assert_idle(src);
> >
> > @@ -2152,20 +2160,32 @@ static inline void cow_user_page(struct page
> *dst, struct page *src, unsigned lo
> >  */
> > if (unlikely(!src)) {
> > void *kaddr = kmap_atomic(dst);
> > -   void __user *uaddr = (void __user *)(va & PAGE_MASK);
> > +   void __user *uaddr = (void __user *)(vmf->address &
> PAGE_MASK);
> > +   pte_t entry;
> >
> > /*
> >  * This really shouldn't fail, because the page is there
> >  * in the page tables. But it might just be unreadable,
> >  * in which case we just give up and fill the result with
> > -* zeroes.
> > +* zeroes. If PTE_AF is cleared on arm64, it might
> > +* cause double page fault. So makes pte young here
> >  */
> > +   if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte))
> {
> > +   spin_lock(vmf->ptl);
> > +   entry = pte_mkyoung(vmf->orig_pte);
>
> Should't you re-validate that orig_pte after re-taking ptl? It can be
> stale by now.
Thanks, do you mean flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte))
before pte_

RE: [PATCH v10 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af()

2019-10-07 Thread Justin He (Arm Technology China)

Hi Will and Marc
Sorry for the late response, just came back from a vacation.

> -Original Message-
> From: Marc Zyngier 
> Sent: 2019年10月1日 21:19
> To: Will Deacon 
> Cc: Justin He (Arm Technology China) ; Catalin
> Marinas ; Mark Rutland
> ; James Morse ;
> Matthew Wilcox ; Kirill A. Shutemov
> ; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux...@kvack.org; Punit Agrawal
> ; Thomas Gleixner ;
> Andrew Morton ; hejia...@gmail.com; Kaly
> Xin (Arm Technology China) 
> Subject: Re: [PATCH v10 1/3] arm64: cpufeature: introduce helper
> cpu_has_hw_af()
>
> On Tue, 1 Oct 2019 13:54:47 +0100
> Will Deacon  wrote:
>
> > On Mon, Sep 30, 2019 at 09:57:38AM +0800, Jia He wrote:
> > > We unconditionally set the HW_AFDBM capability and only enable it on
> > > CPUs which really have the feature. But sometimes we need to know
> > > whether this cpu has the capability of HW AF. So decouple AF from
> > > DBM by new helper cpu_has_hw_af().
> > >
> > > Signed-off-by: Jia He 
> > > Suggested-by: Suzuki Poulose 
> > > Reviewed-by: Catalin Marinas 
> > > ---
> > >  arch/arm64/include/asm/cpufeature.h | 10 ++
> > >  1 file changed, 10 insertions(+)
> > >
> > > diff --git a/arch/arm64/include/asm/cpufeature.h
> b/arch/arm64/include/asm/cpufeature.h
> > > index 9cde5d2e768f..949bc7c85030 100644
> > > --- a/arch/arm64/include/asm/cpufeature.h
> > > +++ b/arch/arm64/include/asm/cpufeature.h
> > > @@ -659,6 +659,16 @@ static inline u32
> id_aa64mmfr0_parange_to_phys_shift(int parange)
> > >   default: return CONFIG_ARM64_PA_BITS;
> > >   }
> > >  }
> > > +
> > > +/* Check whether hardware update of the Access flag is supported */
> > > +static inline bool cpu_has_hw_af(void)
> > > +{
> > > + if (IS_ENABLED(CONFIG_ARM64_HW_AFDBM))
> > > + return read_cpuid(ID_AA64MMFR1_EL1) & 0xf;
> >
> > 0xf? I think we should have a mask in sysreg.h for this constant.
>
> We don't have the mask, but we certainly have the shift.
>
> GENMASK(ID_AA64MMFR1_HADBS_SHIFT + 3,
> ID_AA64MMFR1_HADBS_SHIFT) is a bit
> of a mouthful though. Ideally, we'd have a helper for that.
>
Ok, I will implement the helper if there isn't so far.
And then replace the 0xf with it.


--
Cheers,
Justin (Jia He)


>   M.
> --
> Without deviation from the norm, progress is not possible.
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

RE: [PATCH v10 2/3] arm64: mm: implement arch_faults_on_old_pte() on arm64

2019-10-07 Thread Justin He (Arm Technology China)

Hi Will and Marc

> -Original Message-
> From: Marc Zyngier 
> Sent: 2019年10月1日 21:32
> To: Will Deacon 
> Cc: Justin He (Arm Technology China) ; Catalin
> Marinas ; Mark Rutland
> ; James Morse ;
> Matthew Wilcox ; Kirill A. Shutemov
> ; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux...@kvack.org; Punit Agrawal
> ; Thomas Gleixner ;
> Andrew Morton ; hejia...@gmail.com; Kaly
> Xin (Arm Technology China) 
> Subject: Re: [PATCH v10 2/3] arm64: mm: implement
> arch_faults_on_old_pte() on arm64
> 
> On Tue, 1 Oct 2019 13:50:32 +0100
> Will Deacon  wrote:
> 
> > On Mon, Sep 30, 2019 at 09:57:39AM +0800, Jia He wrote:
> > > On arm64 without hardware Access Flag, copying fromuser will fail
> because
> > > the pte is old and cannot be marked young. So we always end up with
> zeroed
> > > page after fork() + CoW for pfn mappings. we don't always have a
> > > hardware-managed access flag on arm64.
> > >
> > > Hence implement arch_faults_on_old_pte on arm64 to indicate that it
> might
> > > cause page fault when accessing old pte.
> > >
> > > Signed-off-by: Jia He 
> > > Reviewed-by: Catalin Marinas 
> > > ---
> > >  arch/arm64/include/asm/pgtable.h | 14 ++
> > >  1 file changed, 14 insertions(+)
> > >
> > > diff --git a/arch/arm64/include/asm/pgtable.h
> b/arch/arm64/include/asm/pgtable.h
> > > index 7576df00eb50..e96fb82f62de 100644
> > > --- a/arch/arm64/include/asm/pgtable.h
> > > +++ b/arch/arm64/include/asm/pgtable.h
> > > @@ -885,6 +885,20 @@ static inline void update_mmu_cache(struct
> vm_area_struct *vma,
> > >  #define phys_to_ttbr(addr)   (addr)
> > >  #endif
> > >
> > > +/*
> > > + * On arm64 without hardware Access Flag, copying from user will fail
> because
> > > + * the pte is old and cannot be marked young. So we always end up
> with zeroed
> > > + * page after fork() + CoW for pfn mappings. We don't always have a
> > > + * hardware-managed access flag on arm64.
> > > + */
> > > +static inline bool arch_faults_on_old_pte(void)
> > > +{
> > > + WARN_ON(preemptible());
> > > +
> > > + return !cpu_has_hw_af();
> > > +}
> >
> > Does this work correctly in a KVM guest? (i.e. is the MMFR sanitised in
> that
> > case, despite not being the case on the host?)
> 
> Yup, all the 64bit MMFRs are trapped (HCR_EL2.TID3 is set for an
> AArch64 guest), and we return the sanitised version.
Thanks for Marc's explanation. I verified the patch series on a kvm guest (-M 
virt)
with simulated nvdimm device created by qemu. The host is ThunderX2 aarch64.

> 
> But that's an interesting remark: we're now trading an extra fault on
> CPUs that do not support HWAFDBS for a guaranteed trap for each and
> every guest under the sun that will hit the COW path...
> 
> My gut feeling is that this is going to be pretty visible. Jia, do you
> have any numbers for this kind of behaviour?
It is not a common COW path, but a COW for PFN mapping pages only.
I add a g_counter before pte_mkyoung in force_mkyoung{} when testing 
vmmalloc_fork at [1].

In this test case, it will start M fork processes and N pthreads. The default is
M=2,N=4. the g_counter is about 241, that is it will hit my patch series for 241
times.
If I set M=20 and N=40 for TEST3, the g_counter is about 1492.
  
[1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork


--
Cheers,
Justin (Jia He)

RE: [PATCH v10 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-10-07 Thread Justin He (Arm Technology China)

Hi Will

> -Original Message-
> From: Will Deacon 
> Sent: 2019年10月1日 20:54
> To: Justin He (Arm Technology China) 
> Cc: Catalin Marinas ; Mark Rutland
> ; James Morse ; Marc
> Zyngier ; Matthew Wilcox ; Kirill A.
> Shutemov ; linux-arm-
> ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> m...@kvack.org; Punit Agrawal ; Thomas
> Gleixner ; Andrew Morton  foundation.org>; hejia...@gmail.com; Kaly Xin (Arm Technology China)
> 
> Subject: Re: [PATCH v10 3/3] mm: fix double page fault on arm64 if PTE_AF
> is cleared
> 
> On Mon, Sep 30, 2019 at 09:57:40AM +0800, Jia He wrote:
> > When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest,
> there
> > will be a double page fault in __copy_from_user_inatomic of
> cow_user_page.
> >
> > Below call trace is from arm64 do_page_fault for debugging purpose
> > [  110.016195] Call trace:
> > [  110.016826]  do_page_fault+0x5a4/0x690
> > [  110.017812]  do_mem_abort+0x50/0xb0
> > [  110.018726]  el1_da+0x20/0xc4
> > [  110.019492]  __arch_copy_from_user+0x180/0x280
> > [  110.020646]  do_wp_page+0xb0/0x860
> > [  110.021517]  __handle_mm_fault+0x994/0x1338
> > [  110.022606]  handle_mm_fault+0xe8/0x180
> > [  110.023584]  do_page_fault+0x240/0x690
> > [  110.024535]  do_mem_abort+0x50/0xb0
> > [  110.025423]  el0_da+0x20/0x24
> >
> > The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
> > [9b007000] pgd=00023d4f8003, pud=00023da9b003,
> pmd=00023d4b3003, pte=36298607bd3
> >
> > As told by Catalin: "On arm64 without hardware Access Flag, copying
> from
> > user will fail because the pte is old and cannot be marked young. So we
> > always end up with zeroed page after fork() + CoW for pfn mappings. we
> > don't always have a hardware-managed access flag on arm64."
> >
> > This patch fix it by calling pte_mkyoung. Also, the parameter is
> > changed because vmf should be passed to cow_user_page()
> >
> > Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns
> error
> > in case there can be some obscure use-case.(by Kirill)
> >
> > [1]
> https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork
> >
> > Signed-off-by: Jia He 
> > Reported-by: Yibo Cai 
> > Reviewed-by: Catalin Marinas 
> > Acked-by: Kirill A. Shutemov 
> > ---
> >  mm/memory.c | 99
> +
> >  1 file changed, 84 insertions(+), 15 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index b1ca51a079f2..1f56b0118ef5 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
> > 2;
> >  #endif
> >
> > +#ifndef arch_faults_on_old_pte
> > +static inline bool arch_faults_on_old_pte(void)
> > +{
> > +   return false;
> > +}
> > +#endif
> 
> Kirill has acked this, so I'm happy to take the patch as-is, however isn't
> it the case that /most/ architectures will want to return true for
> arch_faults_on_old_pte()? In which case, wouldn't it make more sense for
> that to be the default, and have x86 and arm64 provide an override? For
> example, aren't most architectures still going to hit the double fault
> scenario even with your patch applied?

No, after applying my patch series, only those architectures which don't provide
setting access flag by hardware AND don't implement their arch_faults_on_old_pte
will hit the double page fault.

The meaning of true for arch_faults_on_old_pte() is "this arch doesn't have the 
hardware
setting access flag way, it might cause page fault on an old pte"
I don't want to change other architectures' default behavior here. So by 
default, 
arch_faults_on_old_pte() is false.

Btw, currently I only observed this double pagefault on arm64's guest (host is 
ThunderX2).
On X86 guest (host is Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz ), there is no 
such double
pagefault. It has the similar setting access flag way by hardware.


--
Cheers,
Justin (Jia He)

RE: [PATCH v10 2/3] arm64: mm: implement arch_faults_on_old_pte() on arm64

2019-10-07 Thread Justin He (Arm Technology China)



> -Original Message-
> From: Justin He (Arm Technology China)
> Sent: 2019年10月8日 9:55
> To: Marc Zyngier ; Will Deacon 
> Cc: Catalin Marinas ; Mark Rutland
> ; James Morse ;
> Matthew Wilcox ; Kirill A. Shutemov
> ; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux...@kvack.org; Punit Agrawal
> ; Thomas Gleixner ;
> Andrew Morton ; hejia...@gmail.com; Kaly
> Xin (Arm Technology China) ; nd 
> Subject: RE: [PATCH v10 2/3] arm64: mm: implement
> arch_faults_on_old_pte() on arm64
> 
> Hi Will and Marc
> 
> > -Original Message-
> > From: Marc Zyngier 
> > Sent: 2019年10月1日 21:32
> > To: Will Deacon 
> > Cc: Justin He (Arm Technology China) ; Catalin
> > Marinas ; Mark Rutland
> > ; James Morse ;
> > Matthew Wilcox ; Kirill A. Shutemov
> > ; linux-arm-ker...@lists.infradead.org;
> > linux-kernel@vger.kernel.org; linux...@kvack.org; Punit Agrawal
> > ; Thomas Gleixner ;
> > Andrew Morton ; hejia...@gmail.com;
> Kaly
> > Xin (Arm Technology China) 
> > Subject: Re: [PATCH v10 2/3] arm64: mm: implement
> > arch_faults_on_old_pte() on arm64
> >
> > On Tue, 1 Oct 2019 13:50:32 +0100
> > Will Deacon  wrote:
> >
> > > On Mon, Sep 30, 2019 at 09:57:39AM +0800, Jia He wrote:
> > > > On arm64 without hardware Access Flag, copying fromuser will fail
> > because
> > > > the pte is old and cannot be marked young. So we always end up with
> > zeroed
> > > > page after fork() + CoW for pfn mappings. we don't always have a
> > > > hardware-managed access flag on arm64.
> > > >
> > > > Hence implement arch_faults_on_old_pte on arm64 to indicate that
> it
> > might
> > > > cause page fault when accessing old pte.
> > > >
> > > > Signed-off-by: Jia He 
> > > > Reviewed-by: Catalin Marinas 
> > > > ---
> > > >  arch/arm64/include/asm/pgtable.h | 14 ++
> > > >  1 file changed, 14 insertions(+)
> > > >
> > > > diff --git a/arch/arm64/include/asm/pgtable.h
> > b/arch/arm64/include/asm/pgtable.h
> > > > index 7576df00eb50..e96fb82f62de 100644
> > > > --- a/arch/arm64/include/asm/pgtable.h
> > > > +++ b/arch/arm64/include/asm/pgtable.h
> > > > @@ -885,6 +885,20 @@ static inline void update_mmu_cache(struct
> > vm_area_struct *vma,
> > > >  #define phys_to_ttbr(addr) (addr)
> > > >  #endif
> > > >
> > > > +/*
> > > > + * On arm64 without hardware Access Flag, copying from user will
> fail
> > because
> > > > + * the pte is old and cannot be marked young. So we always end up
> > with zeroed
> > > > + * page after fork() + CoW for pfn mappings. We don't always have a
> > > > + * hardware-managed access flag on arm64.
> > > > + */
> > > > +static inline bool arch_faults_on_old_pte(void)
> > > > +{
> > > > +   WARN_ON(preemptible());
> > > > +
> > > > +   return !cpu_has_hw_af();
> > > > +}
> > >
> > > Does this work correctly in a KVM guest? (i.e. is the MMFR sanitised in
> > that
> > > case, despite not being the case on the host?)
> >
> > Yup, all the 64bit MMFRs are trapped (HCR_EL2.TID3 is set for an
> > AArch64 guest), and we return the sanitised version.
> Thanks for Marc's explanation. I verified the patch series on a kvm guest (-
> M virt)
> with simulated nvdimm device created by qemu. The host is ThunderX2
> aarch64.
> 
> >
> > But that's an interesting remark: we're now trading an extra fault on
> > CPUs that do not support HWAFDBS for a guaranteed trap for each and
> > every guest under the sun that will hit the COW path...
> >
> > My gut feeling is that this is going to be pretty visible. Jia, do you
> > have any numbers for this kind of behaviour?
> It is not a common COW path, but a COW for PFN mapping pages only.
> I add a g_counter before pte_mkyoung in force_mkyoung{} when testing
> vmmalloc_fork at [1].
> 
> In this test case, it will start M fork processes and N pthreads. The default 
> is
> M=2,N=4. the g_counter is about 241, that is it will hit my patch series for
> 241
> times.
> If I set M=20 and N=40 for TEST3, the g_counter is about 1492.

The time overhead of test vmmalloc_fork is:
real0m5.411s
user0m4.206s
sys 0m2.699s

> 
> [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork
> 
> 
> --
> Cheers,
> Justin (Jia He)
>

RE: [PATCH v10 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-10-08 Thread Justin He (Arm Technology China)

Hi Will

> -Original Message-
> From: Will Deacon 
> Sent: 2019年10月8日 20:40
> To: Justin He (Arm Technology China) 
> Cc: Catalin Marinas ; Mark Rutland
> ; James Morse ; Marc
> Zyngier ; Matthew Wilcox ; Kirill A.
> Shutemov ; linux-arm-
> ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> m...@kvack.org; Punit Agrawal ; Thomas
> Gleixner ; Andrew Morton  foundation.org>; hejia...@gmail.com; Kaly Xin (Arm Technology China)
> ; nd 
> Subject: Re: [PATCH v10 3/3] mm: fix double page fault on arm64 if PTE_AF
> is cleared
> 
> On Tue, Oct 08, 2019 at 02:19:05AM +, Justin He (Arm Technology
> China) wrote:
> > > -Original Message-
> > > From: Will Deacon 
> > > Sent: 2019年10月1日 20:54
> > > To: Justin He (Arm Technology China) 
> > > Cc: Catalin Marinas ; Mark Rutland
> > > ; James Morse ;
> Marc
> > > Zyngier ; Matthew Wilcox ;
> Kirill A.
> > > Shutemov ; linux-arm-
> > > ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> > > m...@kvack.org; Punit Agrawal ; Thomas
> > > Gleixner ; Andrew Morton  > > foundation.org>; hejia...@gmail.com; Kaly Xin (Arm Technology China)
> > > 
> > > Subject: Re: [PATCH v10 3/3] mm: fix double page fault on arm64 if
> PTE_AF
> > > is cleared
> > >
> > > On Mon, Sep 30, 2019 at 09:57:40AM +0800, Jia He wrote:
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index b1ca51a079f2..1f56b0118ef5 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
> > > > 2;
> > > >  #endif
> > > >
> > > > +#ifndef arch_faults_on_old_pte
> > > > +static inline bool arch_faults_on_old_pte(void)
> > > > +{
> > > > +   return false;
> > > > +}
> > > > +#endif
> > >
> > > Kirill has acked this, so I'm happy to take the patch as-is, however isn't
> > > it the case that /most/ architectures will want to return true for
> > > arch_faults_on_old_pte()? In which case, wouldn't it make more sense
> for
> > > that to be the default, and have x86 and arm64 provide an override?
> For
> > > example, aren't most architectures still going to hit the double fault
> > > scenario even with your patch applied?
> >
> > No, after applying my patch series, only those architectures which don't
> provide
> > setting access flag by hardware AND don't implement their
> arch_faults_on_old_pte
> > will hit the double page fault.
> >
> > The meaning of true for arch_faults_on_old_pte() is "this arch doesn't
> have the hardware
> > setting access flag way, it might cause page fault on an old pte"
> > I don't want to change other architectures' default behavior here. So by
> default,
> > arch_faults_on_old_pte() is false.
> 
> ...and my complaint is that this is the majority of supported architectures,
> so you're fixing something for arm64 which also affects arm, powerpc,
> alpha, mips, riscv, ...

So, IIUC, you suggested that:
1. by default, arch_faults_on_old_pte() return true
2. on X86, let arch_faults_on_old_pte() be overrided as returning false
3. on arm64, let it be as-is my patch set.
4. let other architectures decide the behavior. (But by default, it will set
pte_young)

I am ok with that if no objections from others.

@Kirill A. Shutemov Do you have any comments? Thanks
> 
> Chances are, they won't even realise they need to implement
> arch_faults_on_old_pte() until somebody runs into the double fault and
> wastes lots of time debugging it before they spot your patch.

As to this point, I added a WARN_ON in patch 03 to speed up the debugging
process.

--
Cheers,
Justin (Jia He)



> 
> > Btw, currently I only observed this double pagefault on arm64's guest
> > (host is ThunderX2).  On X86 guest (host is Intel(R) Core(TM) i7-4790 CPU
> > @ 3.60GHz ), there is no such double pagefault. It has the similar setting
> > access flag way by hardware.
> 
> Right, and that's why I'm not concerned about x86 for this problem.
> 
> Will

RE: [PATCH v4 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-18 Thread Justin He (Arm Technology China)



> -Original Message-
> From: kbuild test robot 
> Sent: 2019年9月19日 3:36
> To: Justin He (Arm Technology China) 
> Cc: kbuild-...@01.org; Catalin Marinas ; Will
> Deacon ; Mark Rutland ;
> James Morse ; Marc Zyngier ;
> Matthew Wilcox ; Kirill A. Shutemov
> ; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux...@kvack.org; Suzuki Poulose
> ; Punit Agrawal ;
> Anshuman Khandual ; Jun Yao
> ; Alex Van Brunt ;
> Robin Murphy ; Thomas Gleixner
> ; Andrew Morton ;
> jgli...@redhat.com; Ralph Campbell ;
> hejia...@gmail.com; Kaly Xin (Arm Technology China)
> ; Justin He (Arm Technology China)
> 
> Subject: Re: [PATCH v4 3/3] mm: fix double page fault on arm64 if PTE_AF
> is cleared
>
> Hi Jia,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on linus/master]
> [cannot apply to v5.3 next-20190917]
> [if your patch is applied to the wrong git tree, please drop us a note to help
> improve the system]
>
> url:https://github.com/0day-ci/linux/commits/Jia-He/fix-double-page-
> fault-on-arm64/20190918-220036
> config: arm64-allnoconfig (attached as .config)
> compiler: aarch64-linux-gcc (GCC) 7.4.0
> reproduce:
> wget https://raw.githubusercontent.com/intel/lkp-
> tests/master/sbin/make.cross -O ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> GCC_VERSION=7.4.0 make.cross ARCH=arm64
>
> If you fix the issue, kindly add following tag
> Reported-by: kbuild test robot 
>
> All errors (new ones prefixed by >>):
>
>mm/memory.o: In function `wp_page_copy':
> >> memory.c:(.text+0x8fc): undefined reference to `cpu_has_hw_af'
>memory.c:(.text+0x8fc): relocation truncated to fit: R_AARCH64_CALL26
> against undefined symbol `cpu_has_hw_af'
>
Ah, I should add a stub for CONFIG_ARM64_HW_AFDBM is 'N' on arm64 arch
Will fix it asap

--
Cheers,
Justin (Jia He)


> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

RE: [PATCH v4 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af()

2019-09-18 Thread Justin He (Arm Technology China)

Hi Suzuki

> -Original Message-
> From: Catalin Marinas 
> Sent: 2019年9月19日 0:46
> To: Suzuki Poulose 
> Cc: Justin He (Arm Technology China) ; Will Deacon
> ; Mark Rutland ; James Morse
> ; Marc Zyngier ; Matthew
> Wilcox ; Kirill A. Shutemov
> ; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux...@kvack.org; Punit Agrawal
> ; Anshuman Khandual
> ; Jun Yao ;
> Alex Van Brunt ; Robin Murphy
> ; Thomas Gleixner ;
> Andrew Morton ; Jérôme Glisse
> ; Ralph Campbell ;
> hejia...@gmail.com; Kaly Xin (Arm Technology China) 
> Subject: Re: [PATCH v4 1/3] arm64: cpufeature: introduce helper
> cpu_has_hw_af()
>
> On Wed, Sep 18, 2019 at 03:20:41PM +0100, Suzuki K Poulose wrote:
> > On 18/09/2019 14:19, Jia He wrote:
> > > diff --git a/arch/arm64/include/asm/cpufeature.h
> b/arch/arm64/include/asm/cpufeature.h
> > > index c96ffa4722d3..206b6e3954cf 100644
> > > --- a/arch/arm64/include/asm/cpufeature.h
> > > +++ b/arch/arm64/include/asm/cpufeature.h
> > > @@ -390,6 +390,7 @@ extern DECLARE_BITMAP(boot_capabilities,
> ARM64_NPATCHABLE);
> > >   for_each_set_bit(cap, cpu_hwcaps, ARM64_NCAPS)
> > >   bool this_cpu_has_cap(unsigned int cap);
> > > +bool cpu_has_hw_af(void);
> > >   void cpu_set_feature(unsigned int num);
> > >   bool cpu_have_feature(unsigned int num);
> > >   unsigned long cpu_get_elf_hwcap(void);
> > > diff --git a/arch/arm64/kernel/cpufeature.c
> b/arch/arm64/kernel/cpufeature.c
> > > index b1fdc486aed8..c5097f58649d 100644
> > > --- a/arch/arm64/kernel/cpufeature.c
> > > +++ b/arch/arm64/kernel/cpufeature.c
> > > @@ -1141,6 +1141,12 @@ static bool has_hw_dbm(const struct
> arm64_cpu_capabilities *cap,
> > >   return true;
> > >   }
> > > +/* Decouple AF from AFDBM. */
> > > +bool cpu_has_hw_af(void)
> > > +{
> > Sorry for not having asked this earlier. Are we interested in,
> >
> > "whether *this* CPU has AF support ?" or "whether *at least one*
> > CPU has the AF support" ? The following code does the former.
> >
> > > + return (read_cpuid(ID_AA64MMFR1_EL1) & 0xf);
>
> In a non-preemptible context, the former is ok (per-CPU).

Yes, just as what Catalin explained, we need the former because the
pagefault occurred in every cpus

--
Cheers,
Justin (Jia He)


>
> --
> Catalin
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

RE: [PATCH v4 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-19 Thread Justin He (Arm Technology China)

Hi Kirill
Thanks for the detailed explanation.

--
Cheers,
Justin (Jia He)



> -Original Message-
> From: Kirill A. Shutemov 
> Sent: 2019年9月19日 22:58
> To: Jia He 
> Cc: Justin He (Arm Technology China) ; Catalin
> Marinas ; Will Deacon ; Mark
> Rutland ; James Morse
> ; Marc Zyngier ; Matthew
> Wilcox ; Kirill A. Shutemov
> ; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux...@kvack.org; Suzuki Poulose
> ; Punit Agrawal ;
> Anshuman Khandual ; Jun Yao
> ; Alex Van Brunt ;
> Robin Murphy ; Thomas Gleixner
> ; Andrew Morton ;
> Jérôme Glisse ; Ralph Campbell
> ; Kaly Xin (Arm Technology China)
> 
> Subject: Re: [PATCH v4 3/3] mm: fix double page fault on arm64 if PTE_AF is
> cleared
>
> On Thu, Sep 19, 2019 at 10:16:34AM +0800, Jia He wrote:
> > Hi Kirill
> >
> > [On behalf of justin...@arm.com because some mails are filted...]
> >
> > On 2019/9/18 22:00, Kirill A. Shutemov wrote:
> > > On Wed, Sep 18, 2019 at 09:19:14PM +0800, Jia He wrote:
> > > > When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64
> guest, there
> > > > will be a double page fault in __copy_from_user_inatomic of
> cow_user_page.
> > > >
> > > > Below call trace is from arm64 do_page_fault for debugging purpose
> > > > [  110.016195] Call trace:
> > > > [  110.016826]  do_page_fault+0x5a4/0x690
> > > > [  110.017812]  do_mem_abort+0x50/0xb0
> > > > [  110.018726]  el1_da+0x20/0xc4
> > > > [  110.019492]  __arch_copy_from_user+0x180/0x280
> > > > [  110.020646]  do_wp_page+0xb0/0x860
> > > > [  110.021517]  __handle_mm_fault+0x994/0x1338
> > > > [  110.022606]  handle_mm_fault+0xe8/0x180
> > > > [  110.023584]  do_page_fault+0x240/0x690
> > > > [  110.024535]  do_mem_abort+0x50/0xb0
> > > > [  110.025423]  el0_da+0x20/0x24
> > > >
> > > > The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
> > > > [9b007000] pgd=00023d4f8003, pud=00023da9b003,
> pmd=00023d4b3003, pte=36298607bd3
> > > >
> > > > As told by Catalin: "On arm64 without hardware Access Flag, copying
> from
> > > > user will fail because the pte is old and cannot be marked young. So
> we
> > > > always end up with zeroed page after fork() + CoW for pfn mappings.
> we
> > > > don't always have a hardware-managed access flag on arm64."
> > > >
> > > > This patch fix it by calling pte_mkyoung. Also, the parameter is
> > > > changed because vmf should be passed to cow_user_page()
> > > >
> > > > [1]
> https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork
> > > >
> > > > Reported-by: Yibo Cai 
> > > > Signed-off-by: Jia He 
> > > > ---
> > > >   mm/memory.c | 35 ++-
> > > >   1 file changed, 30 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/mm/memory.c b/mm/memory.c
> > > > index e2bb51b6242e..d2c130a5883b 100644
> > > > --- a/mm/memory.c
> > > > +++ b/mm/memory.c
> > > > @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
> > > > 2;
> > > >   #endif
> > > > +#ifndef arch_faults_on_old_pte
> > > > +static inline bool arch_faults_on_old_pte(void)
> > > > +{
> > > > +   return false;
> > > > +}
> > > > +#endif
> > > > +
> > > >   static int __init disable_randmaps(char *s)
> > > >   {
> > > > randomize_va_space = 0;
> > > > @@ -2140,8 +2147,12 @@ static inline int pte_unmap_same(struct
> mm_struct *mm, pmd_t *pmd,
> > > > return same;
> > > >   }
> > > > -static inline void cow_user_page(struct page *dst, struct page *src,
> unsigned long va, struct vm_area_struct *vma)
> > > > +static inline void cow_user_page(struct page *dst, struct page *src,
> > > > +struct vm_fault *vmf)
> > > >   {
> > > > +   struct vm_area_struct *vma = vmf->vma;
> > > > +   unsigned long addr = vmf->address;
> > > > +
> > > > debug_dma_assert_idle(src);
> > > > /*
> > > > @@ -2152,20 +2163,34 @@ static inline void cow_user_page(struct
> page *dst, struct page *src, unsigned lo
> > >

RE: [PATCH v5 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af()

2019-09-19 Thread Justin He (Arm Technology China)


Hi Catalin
> -Original Message-
> From: Catalin Marinas 
> Sent: 2019年9月20日 0:37
> To: Justin He (Arm Technology China) 
> Cc: Will Deacon ; Mark Rutland
> ; James Morse ; Marc
> Zyngier ; Matthew Wilcox ; Kirill A.
> Shutemov ; linux-arm-
> ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> m...@kvack.org; Suzuki Poulose ; Punit
> Agrawal ; Anshuman Khandual
> ; Alex Van Brunt
> ; Robin Murphy ;
> Thomas Gleixner ; Andrew Morton  foundation.org>; Jérôme Glisse ; Ralph Campbell
> ; hejia...@gmail.com; Kaly Xin (Arm Technology
> China) 
> Subject: Re: [PATCH v5 1/3] arm64: cpufeature: introduce helper
> cpu_has_hw_af()
>
> On Fri, Sep 20, 2019 at 12:12:02AM +0800, Jia He wrote:
> > diff --git a/arch/arm64/kernel/cpufeature.c
> b/arch/arm64/kernel/cpufeature.c
> > index b1fdc486aed8..fb0e9425d286 100644
> > --- a/arch/arm64/kernel/cpufeature.c
> > +++ b/arch/arm64/kernel/cpufeature.c
> > @@ -1141,6 +1141,16 @@ static bool has_hw_dbm(const struct
> arm64_cpu_capabilities *cap,
> > return true;
> >  }
> >
> > +/* Decouple AF from AFDBM. */
> > +bool cpu_has_hw_af(void)
> > +{
> > +   return (read_cpuid(ID_AA64MMFR1_EL1) & 0xf);
> > +}
> > +#else /* CONFIG_ARM64_HW_AFDBM */
> > +bool cpu_has_hw_af(void)
> > +{
> > +   return false;
> > +}
> >  #endif
>
> Please place this function in cpufeature.h directly, no need for an
> additional function call. Something like:
>
> static inline bool cpu_has_hw_af(void)
> {
>   if (IS_ENABLED(CONFIG_ARM64_HW_AFDBM))
>   return read_cpuid(ID_AA64MMFR1_EL1) & 0xf;
>   return false;
> }
>
Ok, thanks

--
Cheers,
Justin (Jia He)


> --
> Catalin
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

RE: [PATCH v5 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-19 Thread Justin He (Arm Technology China)

Hi Catalin

> -Original Message-
> From: Catalin Marinas 
> Sent: 2019年9月20日 0:42
> To: Justin He (Arm Technology China) 
> Cc: Will Deacon ; Mark Rutland
> ; James Morse ; Marc
> Zyngier ; Matthew Wilcox ; Kirill A.
> Shutemov ; linux-arm-
> ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> m...@kvack.org; Suzuki Poulose ; Punit
> Agrawal ; Anshuman Khandual
> ; Alex Van Brunt
> ; Robin Murphy ;
> Thomas Gleixner ; Andrew Morton  foundation.org>; Jérôme Glisse ; Ralph Campbell
> ; hejia...@gmail.com; Kaly Xin (Arm Technology
> China) 
> Subject: Re: [PATCH v5 3/3] mm: fix double page fault on arm64 if PTE_AF
> is cleared
>
> On Fri, Sep 20, 2019 at 12:12:04AM +0800, Jia He wrote:
> > @@ -2152,7 +2163,29 @@ static inline void cow_user_page(struct page
> *dst, struct page *src, unsigned lo
> >  */
> > if (unlikely(!src)) {
> > void *kaddr = kmap_atomic(dst);
> > -   void __user *uaddr = (void __user *)(va & PAGE_MASK);
> > +   void __user *uaddr = (void __user *)(addr & PAGE_MASK);
> > +   pte_t entry;
> > +
> > +   /* On architectures with software "accessed" bits, we would
> > +* take a double page fault, so mark it accessed here.
> > +*/
> > +   if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte))
> {
> > +   spin_lock(vmf->ptl);
> > +   if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
> > +   entry = pte_mkyoung(vmf->orig_pte);
> > +   if (ptep_set_access_flags(vma, addr,
> > + vmf->pte, entry, 0))
> > +   update_mmu_cache(vma, addr, vmf-
> >pte);
> > +   } else {
> > +   /* Other thread has already handled the
> fault
> > +* and we don't need to do anything. If it's
> > +* not the case, the fault will be triggered
> > +* again on the same address.
> > +*/
> > +   return -1;
> > +   }
> > +   spin_unlock(vmf->ptl);
>
> Returning with the spinlock held doesn't normally go very well ;).
Yes, my bad. Will fix asap

--
Cheers,
Justin (Jia He)


IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

RE: [PATCH v7 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-20 Thread Justin He (Arm Technology China)

Thanks for your patent review 😊

--
Cheers,
Justin (Jia He)



> -Original Message-
> From: Kirill A. Shutemov 
> Sent: 2019年9月20日 22:21
> To: Justin He (Arm Technology China) 
> Cc: Catalin Marinas ; Will Deacon
> ; Mark Rutland ; James Morse
> ; Marc Zyngier ; Matthew
> Wilcox ; Kirill A. Shutemov
> ; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux...@kvack.org; Suzuki Poulose
> ; Punit Agrawal ;
> Anshuman Khandual ; Alex Van Brunt
> ; Robin Murphy ;
> Thomas Gleixner ; Andrew Morton  foundation.org>; Jérôme Glisse ; Ralph Campbell
> ; hejia...@gmail.com; Kaly Xin (Arm Technology
> China) ; nd 
> Subject: Re: [PATCH v7 3/3] mm: fix double page fault on arm64 if PTE_AF is
> cleared
> 
> On Fri, Sep 20, 2019 at 09:54:37PM +0800, Jia He wrote:
> > When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest,
> there
> > will be a double page fault in __copy_from_user_inatomic of
> cow_user_page.
> >
> > Below call trace is from arm64 do_page_fault for debugging purpose
> > [  110.016195] Call trace:
> > [  110.016826]  do_page_fault+0x5a4/0x690
> > [  110.017812]  do_mem_abort+0x50/0xb0
> > [  110.018726]  el1_da+0x20/0xc4
> > [  110.019492]  __arch_copy_from_user+0x180/0x280
> > [  110.020646]  do_wp_page+0xb0/0x860
> > [  110.021517]  __handle_mm_fault+0x994/0x1338
> > [  110.022606]  handle_mm_fault+0xe8/0x180
> > [  110.023584]  do_page_fault+0x240/0x690
> > [  110.024535]  do_mem_abort+0x50/0xb0
> > [  110.025423]  el0_da+0x20/0x24
> >
> > The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
> > [9b007000] pgd=00023d4f8003, pud=00023da9b003,
> pmd=00023d4b3003, pte=36298607bd3
> >
> > As told by Catalin: "On arm64 without hardware Access Flag, copying from
> > user will fail because the pte is old and cannot be marked young. So we
> > always end up with zeroed page after fork() + CoW for pfn mappings. we
> > don't always have a hardware-managed access flag on arm64."
> >
> > This patch fix it by calling pte_mkyoung. Also, the parameter is
> > changed because vmf should be passed to cow_user_page()
> >
> > Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
> > in case there can be some obscure use-case.(by Kirill)
> >
> > [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork
> >
> > Reported-by: Yibo Cai 
> > Signed-off-by: Jia He 
> 
> Acked-by: Kirill A. Shutemov 
> 
> --
>  Kirill A. Shutemov

RE: [PATCH v9 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-24 Thread Justin He (Arm Technology China)

Hi Matthew and Kirill
I didn't add your previous r-b and a-b tag since I refactored the cow_user_page
and changed the ptl range in v9. Please have a review, thanks


--
Cheers,
Justin (Jia He)



> -Original Message-
> From: Jia He 
> Sent: 2019年9月25日 10:59
> To: Catalin Marinas ; Will Deacon
> ; Mark Rutland ; James Morse
> ; Marc Zyngier ; Matthew
> Wilcox ; Kirill A. Shutemov
> ; linux-arm-ker...@lists.infradead.org;
> linux-kernel@vger.kernel.org; linux...@kvack.org; Suzuki Poulose
> 
> Cc: Punit Agrawal ; Anshuman Khandual
> ; Alex Van Brunt
> ; Robin Murphy ;
> Thomas Gleixner ; Andrew Morton  foundation.org>; Jérôme Glisse ; Ralph Campbell
> ; hejia...@gmail.com; Kaly Xin (Arm Technology
> China) ; nd ; Justin He (Arm
> Technology China) 
> Subject: [PATCH v9 3/3] mm: fix double page fault on arm64 if PTE_AF is
> cleared
> 
> When we tested pmdk unit test [1] vmmalloc_fork TEST1 in arm64 guest,
> there
> will be a double page fault in __copy_from_user_inatomic of
> cow_user_page.
> 
> Below call trace is from arm64 do_page_fault for debugging purpose
> [  110.016195] Call trace:
> [  110.016826]  do_page_fault+0x5a4/0x690
> [  110.017812]  do_mem_abort+0x50/0xb0
> [  110.018726]  el1_da+0x20/0xc4
> [  110.019492]  __arch_copy_from_user+0x180/0x280
> [  110.020646]  do_wp_page+0xb0/0x860
> [  110.021517]  __handle_mm_fault+0x994/0x1338
> [  110.022606]  handle_mm_fault+0xe8/0x180
> [  110.023584]  do_page_fault+0x240/0x690
> [  110.024535]  do_mem_abort+0x50/0xb0
> [  110.025423]  el0_da+0x20/0x24
> 
> The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
> [9b007000] pgd=00023d4f8003, pud=00023da9b003,
> pmd=00023d4b3003, pte=36298607bd3
> 
> As told by Catalin: "On arm64 without hardware Access Flag, copying from
> user will fail because the pte is old and cannot be marked young. So we
> always end up with zeroed page after fork() + CoW for pfn mappings. we
> don't always have a hardware-managed access flag on arm64."
> 
> This patch fix it by calling pte_mkyoung. Also, the parameter is
> changed because vmf should be passed to cow_user_page()
> 
> Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
> in case there can be some obscure use-case.(by Kirill)
> 
> [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_fork
> 
> Signed-off-by: Jia He 
> Reported-by: Yibo Cai 
> ---
>  mm/memory.c | 99
> +
>  1 file changed, 84 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e2bb51b6242e..a0a381b36ff2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -118,6 +118,13 @@ int randomize_va_space __read_mostly =
>   2;
>  #endif
> 
> +#ifndef arch_faults_on_old_pte
> +static inline bool arch_faults_on_old_pte(void)
> +{
> + return false;
> +}
> +#endif
> +
>  static int __init disable_randmaps(char *s)
>  {
>   randomize_va_space = 0;
> @@ -2140,32 +2147,82 @@ static inline int pte_unmap_same(struct
> mm_struct *mm, pmd_t *pmd,
>   return same;
>  }
> 
> -static inline void cow_user_page(struct page *dst, struct page *src,
> unsigned long va, struct vm_area_struct *vma)
> +static inline bool cow_user_page(struct page *dst, struct page *src,
> +  struct vm_fault *vmf)
>  {
> + bool ret;
> + void *kaddr;
> + void __user *uaddr;
> + bool force_mkyoung;
> + struct vm_area_struct *vma = vmf->vma;
> + struct mm_struct *mm = vma->vm_mm;
> + unsigned long addr = vmf->address;
> +
>   debug_dma_assert_idle(src);
> 
> + if (likely(src)) {
> + copy_user_highpage(dst, src, addr, vma);
> + return true;
> + }
> +
>   /*
>* If the source page was a PFN mapping, we don't have
>* a "struct page" for it. We do a best-effort copy by
>* just copying from the original user address. If that
>* fails, we just zero-fill it. Live with it.
>*/
> - if (unlikely(!src)) {
> - void *kaddr = kmap_atomic(dst);
> - void __user *uaddr = (void __user *)(va & PAGE_MASK);
> + kaddr = kmap_atomic(dst);
> + uaddr = (void __user *)(addr & PAGE_MASK);
> +
> + /*
> +  * On architectures with software "accessed" bits, we would
> +  * take a double page fault, so mark it accessed here.
> +  */
> + force_mkyoung = arch_faults_on_old_pte() && !pte_young(vmf-
> >orig_pte);
> + if (force_mkyoung) {
> + p

RE: [PATCH v9 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af()

2019-09-26 Thread Justin He (Arm Technology China)

Hi Catalin

> -Original Message-
> From: Catalin Marinas 
> Sent: 2019年9月25日 22:38
> To: Justin He (Arm Technology China) 
> Cc: Will Deacon ; Mark Rutland
> ; James Morse ; Marc
> Zyngier ; Matthew Wilcox ; Kirill A.
> Shutemov ; linux-arm-
> ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> m...@kvack.org; Suzuki Poulose ; Punit
> Agrawal ; Anshuman Khandual
> ; Alex Van Brunt
> ; Robin Murphy ;
> Thomas Gleixner ; Andrew Morton  foundation.org>; Jérôme Glisse ; Ralph Campbell
> ; hejia...@gmail.com; Kaly Xin (Arm Technology
> China) ; nd 
> Subject: Re: [PATCH v9 1/3] arm64: cpufeature: introduce helper
> cpu_has_hw_af()
> 
> On Wed, Sep 25, 2019 at 10:59:20AM +0800, Jia He wrote:
> > We unconditionally set the HW_AFDBM capability and only enable it on
> > CPUs which really have the feature. But sometimes we need to know
> > whether this cpu has the capability of HW AF. So decouple AF from
> > DBM by new helper cpu_has_hw_af().
> >
> > Signed-off-by: Jia He 
> > Suggested-by: Suzuki Poulose 
> > Reported-by: kbuild test robot 
> 
> Which bug did the kbuild robot actually report? I'd drop this line.
> 
This line is added due to [1]:
"If you fix the issue, kindly add following tag
Reported-by: kbuild test robot "

Yes, I know your concern, it is a little bit confusing. But I don't know
how to distinguish the case btw a) original bug report b) bug report
of my patch implementation? Thanks for any suggestion.

[1] https://www.lkml.org/lkml/2019/9/18/940


--
Cheers,
Justin (Jia He)

RE: [PATCH v8 1/3] arm64: cpufeature: introduce helper cpu_has_hw_af()

2019-09-23 Thread Justin He (Arm Technology China)

Hi Catalin

> -Original Message-
> From: Catalin Marinas 
> Sent: 2019年9月24日 0:07
> To: Justin He (Arm Technology China) 
> Cc: Will Deacon ; Mark Rutland
> ; James Morse ; Marc
> Zyngier ; Matthew Wilcox ; Kirill A.
> Shutemov ; linux-arm-
> ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> m...@kvack.org; Suzuki Poulose ; Punit
> Agrawal ; Anshuman Khandual
> ; Alex Van Brunt
> ; Robin Murphy ;
> Thomas Gleixner ; Andrew Morton  foundation.org>; Jérôme Glisse ; Ralph Campbell
> ; hejia...@gmail.com; Kaly Xin (Arm Technology
> China) ; nd 
> Subject: Re: [PATCH v8 1/3] arm64: cpufeature: introduce helper
> cpu_has_hw_af()
> 
> On Sat, Sep 21, 2019 at 09:50:52PM +0800, Jia He wrote:
> > We unconditionally set the HW_AFDBM capability and only enable it on
> > CPUs which really have the feature. But sometimes we need to know
> > whether this cpu has the capability of HW AF. So decouple AF from
> > DBM by new helper cpu_has_hw_af().
> >
> > Reported-by: kbuild test robot 
> > Suggested-by: Suzuki Poulose 
> > Signed-off-by: Jia He 
> > ---
> >  arch/arm64/include/asm/cpufeature.h | 10 ++
> >  1 file changed, 10 insertions(+)
> >
> > diff --git a/arch/arm64/include/asm/cpufeature.h
> b/arch/arm64/include/asm/cpufeature.h
> > index c96ffa4722d3..46caf934ba4e 100644
> > --- a/arch/arm64/include/asm/cpufeature.h
> > +++ b/arch/arm64/include/asm/cpufeature.h
> > @@ -667,6 +667,16 @@ static inline u32
> id_aa64mmfr0_parange_to_phys_shift(int parange)
> > default: return CONFIG_ARM64_PA_BITS;
> > }
> >  }
> > +
> > +/* Decouple AF from AFDBM. */
> 
> We could do with a better comment here or just remove it altogether. The
> aim of the patch was to decouple AF check from the AF+DBM but the
> comment here should describe what the function does. Maybe something
> like: "Check whether hardware update of the Access flag is supported".
> 

Okay, I will update it

--
Cheers,
Justin (Jia He)


> > +static inline bool cpu_has_hw_af(void)
> > +{
> > +   if (IS_ENABLED(CONFIG_ARM64_HW_AFDBM))
> > +   return read_cpuid(ID_AA64MMFR1_EL1) & 0xf;
> > +
> > +   return false;
> > +}
> 
> Other than the comment above,
> 
> Reviewed-by: Catalin Marinas

RE: [PATCH v8 2/3] arm64: mm: implement arch_faults_on_old_pte() on arm64

2019-09-23 Thread Justin He (Arm Technology China)



> -Original Message-
> From: Catalin Marinas 
> Sent: 2019年9月24日 0:18
> To: Justin He (Arm Technology China) 
> Cc: Will Deacon ; Mark Rutland
> ; James Morse ; Marc
> Zyngier ; Matthew Wilcox ; Kirill A.
> Shutemov ; linux-arm-
> ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> m...@kvack.org; Suzuki Poulose ; Punit
> Agrawal ; Anshuman Khandual
> ; Alex Van Brunt
> ; Robin Murphy ;
> Thomas Gleixner ; Andrew Morton  foundation.org>; Jérôme Glisse ; Ralph Campbell
> ; hejia...@gmail.com; Kaly Xin (Arm Technology
> China) ; nd 
> Subject: Re: [PATCH v8 2/3] arm64: mm: implement
> arch_faults_on_old_pte() on arm64
> 
> On Sat, Sep 21, 2019 at 09:50:53PM +0800, Jia He wrote:
> > diff --git a/arch/arm64/include/asm/pgtable.h
> b/arch/arm64/include/asm/pgtable.h
> > index e09760ece844..4a9939615e41 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -868,6 +868,18 @@ static inline void update_mmu_cache(struct
> vm_area_struct *vma,
> >  #define phys_to_ttbr(addr) (addr)
> >  #endif
> >
> > +/*
> > + * On arm64 without hardware Access Flag, copying fromuser will fail
> because
>  
>from user
> 

Ok
> > + * the pte is old and cannot be marked young. So we always end up with
> zeroed
> > + * page after fork() + CoW for pfn mappings. we don't always have a
> ^^
>   We
> 

Ok
> > + * hardware-managed access flag on arm64.
> > + */
> > +static inline bool arch_faults_on_old_pte(void)
> > +{
> > +   return !cpu_has_hw_af();
> 
> I saw an early incarnation of your patch having a
> WARN_ON(preemptible()). I think we need this back just in case this
> function will be used elsewhere in the future.

Okay

--
Cheers,
Justin (Jia He)


> 
> > +}
> > +#define arch_faults_on_old_pte arch_faults_on_old_pte
> 
> Otherwise,
> 
> Reviewed-by: Catalin Marinas

RE: [PATCH v8 3/3] mm: fix double page fault on arm64 if PTE_AF is cleared

2019-09-23 Thread Justin He (Arm Technology China)

Hi Catalin
Please see an important comment inline, thanks

> -Original Message-
> From: Catalin Marinas 
> Sent: 2019年9月24日 1:05
> To: Justin He (Arm Technology China) 
> Cc: Will Deacon ; Mark Rutland
> ; James Morse ; Marc
> Zyngier ; Matthew Wilcox ; Kirill A.
> Shutemov ; linux-arm-
> ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> m...@kvack.org; Suzuki Poulose ; Punit
> Agrawal ; Anshuman Khandual
> ; Alex Van Brunt
> ; Robin Murphy ;
> Thomas Gleixner ; Andrew Morton  foundation.org>; Jérôme Glisse ; Ralph Campbell
> ; hejia...@gmail.com; Kaly Xin (Arm Technology
> China) ; nd 
> Subject: Re: [PATCH v8 3/3] mm: fix double page fault on arm64 if PTE_AF
> is cleared
> 
> On Sat, Sep 21, 2019 at 09:50:54PM +0800, Jia He wrote:
> > @@ -2151,21 +2163,53 @@ static inline void cow_user_page(struct page
> *dst, struct page *src, unsigned lo
> >  * fails, we just zero-fill it. Live with it.
> >  */
> > if (unlikely(!src)) {
> > -   void *kaddr = kmap_atomic(dst);
> > -   void __user *uaddr = (void __user *)(va & PAGE_MASK);
> > +   void *kaddr;
> > +   pte_t entry;
> > +   void __user *uaddr = (void __user *)(addr & PAGE_MASK);
> >
> > +   /* On architectures with software "accessed" bits, we would
> > +* take a double page fault, so mark it accessed here.
> > +*/
> 
> Nitpick: please follow the kernel coding style for multi-line comments
> (above and the for the rest of the patch):
> 
>   /*
>* Your multi-line comment.
>*/
> 
> > +   if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte))
> {
> > +   vmf->pte = pte_offset_map_lock(mm, vmf->pmd,
> addr,
> > +  &vmf->ptl);
> > +   if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
> > +   entry = pte_mkyoung(vmf->orig_pte);
> > +   if (ptep_set_access_flags(vma, addr,
> > + vmf->pte, entry, 0))
> > +   update_mmu_cache(vma, addr, vmf-
> >pte);
> > +   } else {
> > +   /* Other thread has already handled the
> fault
> > +* and we don't need to do anything. If it's
> > +* not the case, the fault will be triggered
> > +* again on the same address.
> > +*/
> > +   pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +   return false;
> > +   }
> > +   pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +   }
> 
> Another nit, you could rewrite this block slightly to avoid too much
> indentation. Something like (untested):
> 
>   if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte))
> {
>   vmf->pte = pte_offset_map_lock(mm, vmf->pmd,
> addr,
>  &vmf->ptl);
>   if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
>   /*
>* Other thread has already handled the fault
>* and we don't need to do anything. If it's
>* not the case, the fault will be triggered
>* again on the same address.
>*/
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
>   return false;
>   }
>   entry = pte_mkyoung(vmf->orig_pte);
>   if (ptep_set_access_flags(vma, addr,
> vmf->pte, entry, 0))
>   update_mmu_cache(vma, addr, vmf->pte);
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
>   }
> 
> > +
> > +   kaddr = kmap_atomic(dst);
> 
> Since you moved the kmap_atomic() here, could the above
> arch_faults_on_old_pte() run in a preemptible context? I suggested to
> add a WARN_ON in patch 2 to be sure.

Should I move kmap_atomic back to the original line? Thus, we can make sure
that arch_faults_on_old_pte() is in the context of preempt_disabled?
Otherwise, arch_faults_on_old_

RE: [PATCH v11 1/4] arm64: cpufeature: introduce helper cpu_has_hw_af()

2019-10-10 Thread Justin He (Arm Technology China)

Hi Catalin

> -Original Message-
> From: Catalin Marinas 
> Sent: Friday, October 11, 2019 12:43 AM
> To: Justin He (Arm Technology China) 
> Cc: Will Deacon ; Mark Rutland
> ; James Morse ; Marc
> Zyngier ; Matthew Wilcox ; Kirill A.
> Shutemov ; linux-arm-
> ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> m...@kvack.org; Suzuki Poulose ; Borislav
> Petkov ; H. Peter Anvin ; x...@kernel.org;
> Thomas Gleixner ; Andrew Morton  foundation.org>; hejia...@gmail.com; Kaly Xin (Arm Technology China)
> ; nd 
> Subject: Re: [PATCH v11 1/4] arm64: cpufeature: introduce helper
> cpu_has_hw_af()
> 
> On Wed, Oct 09, 2019 at 04:42:43PM +0800, Jia He wrote:
> > We unconditionally set the HW_AFDBM capability and only enable it on
> > CPUs which really have the feature. But sometimes we need to know
> > whether this cpu has the capability of HW AF. So decouple AF from
> > DBM by a new helper cpu_has_hw_af().
> >
> > Signed-off-by: Jia He 
> > Suggested-by: Suzuki Poulose 
> > Reviewed-by: Catalin Marinas 
> 
> I don't think I reviewed this version of the patch.

Sorry about that.
> 
> > diff --git a/arch/arm64/include/asm/cpufeature.h
> b/arch/arm64/include/asm/cpufeature.h
> > index 9cde5d2e768f..1a95396ea5c8 100644
> > --- a/arch/arm64/include/asm/cpufeature.h
> > +++ b/arch/arm64/include/asm/cpufeature.h
> > @@ -659,6 +659,20 @@ static inline u32
> id_aa64mmfr0_parange_to_phys_shift(int parange)
> > default: return CONFIG_ARM64_PA_BITS;
> > }
> >  }
> > +
> > +/* Check whether hardware update of the Access flag is supported */
> > +static inline bool cpu_has_hw_af(void)
> > +{
> > +   if (IS_ENABLED(CONFIG_ARM64_HW_AFDBM)) {
> 
> Please just return early here to avoid unnecessary indentation:

Okay
> 
>   if (!IS_ENABLED(CONFIG_ARM64_HW_AFDBM))
>   return false;
> 
> > +   u64 mmfr1 = read_cpuid(ID_AA64MMFR1_EL1);
> > +
> > +   return !!cpuid_feature_extract_unsigned_field(mmfr1,
> > +
>   ID_AA64MMFR1_HADBS_SHIFT);
> 
> No need for !!, the return type is a bool already.

But cpuid_feature_extract_unsigned_field has the return type "unsigned int" [1]

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/include/asm/cpufeature.h#n444

> 
> Anyway, apart from these nitpicks, the patch is fine you can keep my
> reviewed-by.

Thanks 😉
> 
> If later we noticed a potential performance issue on this path, we can
> turn it into a static label as with other CPU features.

Okay

--
Cheers,
Justin (Jia He)

RE: [PATCH v11 1/4] arm64: cpufeature: introduce helper cpu_has_hw_af()

2019-10-11 Thread Justin He (Arm Technology China)

Hi Catanlin
Thanks for the detailed explanation.
Will send out v12 soon after testing

--
Cheers,
Justin (Jia He)

 

> -Original Message-
> From: Catalin Marinas 
> Sent: Friday, October 11, 2019 6:39 PM
> To: Justin He (Arm Technology China) 
> Cc: Will Deacon ; Mark Rutland
> ; James Morse ; Marc
> Zyngier ; Matthew Wilcox ; Kirill A.
> Shutemov ; linux-arm-
> ker...@lists.infradead.org; linux-kernel@vger.kernel.org; linux-
> m...@kvack.org; Suzuki Poulose ; Borislav
> Petkov ; H. Peter Anvin ; x...@kernel.org;
> Thomas Gleixner ; Andrew Morton  foundation.org>; hejia...@gmail.com; Kaly Xin (Arm Technology China)
> ; nd 
> Subject: Re: [PATCH v11 1/4] arm64: cpufeature: introduce helper
> cpu_has_hw_af()
> 
> On Fri, Oct 11, 2019 at 01:16:36AM +, Justin He (Arm Technology China)
> wrote:
> > From: Catalin Marinas 
> > > On Wed, Oct 09, 2019 at 04:42:43PM +0800, Jia He wrote:
> > > > +   u64 mmfr1 = read_cpuid(ID_AA64MMFR1_EL1);
> > > > +
> > > > +   return !!cpuid_feature_extract_unsigned_field(mmfr1,
> > > > +
> > >   ID_AA64MMFR1_HADBS_SHIFT);
> > >
> > > No need for !!, the return type is a bool already.
> >
> > But cpuid_feature_extract_unsigned_field has the return type "unsigned
> int" [1]
> >
> > [1]
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch
> /arm64/include/asm/cpufeature.h#n444
> 
> And the C language gives you the automatic conversion from unsigned int
> to bool without the need for !!. The reason we use !! in some places is
> for converting long to int (not bool) and losing the top 32-bit. See
> commit 84fe6826c28f ("arm64: mm: Add double logical invert to pte
> accessors") for an explanation.
> 
> --
> Catalin

RE: [PATCH 1/2] vsprintf: Prevent crash when dereferencing invalid pointers for %pD

2019-08-09 Thread Justin He (Arm Technology China)



> -Original Message-
> From: Andy Shevchenko 
> Sent: 2019年8月9日 18:52
> To: Justin He (Arm Technology China) 
> Cc: Petr Mladek ; Andy Shevchenko
> ; Sergey Senozhatsky
> ; Geert Uytterhoeven
> ; Linux Kernel Mailing List  ker...@vger.kernel.org>; Thomas Gleixner ; Steven
> Rostedt (VMware) ; Kees Cook
> ; Shuah Khan ; Tobin C.
> Harding 
> Subject: Re: [PATCH 1/2] vsprintf: Prevent crash when dereferencing invalid
> pointers for %pD
>
> On Fri, Aug 9, 2019 at 4:28 AM Jia He  wrote:
> >
> > Commit 3e5903eb9cff ("vsprintf: Prevent crash when dereferencing
> invalid
> > pointers") prevents most crash except for %pD.
> > There is an additional pointer dereferencing before dentry_name.
> >
> > At least, vma->file can be NULL and be passed to printk %pD in
> > print_bad_pte, which can cause crash.
> >
> > This patch fixes it with introducing a new file_dentry_name.
> >
>
> Reviewed-by: Andy Shevchenko 
>
> Perhaps you need to add a Fixes tag
Thanks, Andy
Fixes: 3e5903eb9cff ("vsprintf: Prevent crash when dereferencing invalid 
pointers")

Need I reposted v2?


--
Cheers,
Justin (Jia He)


>
> > Signed-off-by: Jia He 
> > ---
> >  lib/vsprintf.c | 13 ++---
> >  1 file changed, 10 insertions(+), 3 deletions(-)
> >
> > diff --git a/lib/vsprintf.c b/lib/vsprintf.c
> > index 63937044c57d..b4a119176fdb 100644
> > --- a/lib/vsprintf.c
> > +++ b/lib/vsprintf.c
> > @@ -869,6 +869,15 @@ char *dentry_name(char *buf, char *end, const
> struct dentry *d, struct printf_sp
> > return widen_string(buf, n, end, spec);
> >  }
> >
> > +static noinline_for_stack
> > +char *file_dentry_name(char *buf, char *end, const struct file *f,
> > +   struct printf_spec spec, const char *fmt)
> > +{
> > +   if (check_pointer(&buf, end, f, spec))
> > +   return buf;
> > +
> > +   return dentry_name(buf, end, f->f_path.dentry, spec, fmt);
> > +}
> >  #ifdef CONFIG_BLOCK
> >  static noinline_for_stack
> >  char *bdev_name(char *buf, char *end, struct block_device *bdev,
> > @@ -2166,9 +2175,7 @@ char *pointer(const char *fmt, char *buf, char
> *end, void *ptr,
> > case 'C':
> > return clock(buf, end, ptr, spec, fmt);
> > case 'D':
> > -   return dentry_name(buf, end,
> > -  ((const struct file 
> > *)ptr)->f_path.dentry,
> > -  spec, fmt);
> > +   return file_dentry_name(buf, end, ptr, spec, fmt);
> >  #ifdef CONFIG_BLOCK
> > case 'g':
> > return bdev_name(buf, end, ptr, spec, fmt);
> > --
> > 2.17.1
> >
>
>
> --
> With Best Regards,
> Andy Shevchenko
IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

RE: [PATCH] arm64: mm: add missing PTE_SPECIAL in pte_mkdevmap on arm64

2019-08-07 Thread Justin He (Arm Technology China)

Hi Anshuman
Thanks for the comments, please see my comments below

> -Original Message-
> From: Anshuman Khandual 
> Sent: 2019年8月8日 13:19
> To: Justin He (Arm Technology China) ; Catalin
> Marinas ; Will Deacon ;
> Mark Rutland ; James Morse
> 
> Cc: Christoffer Dall ; Punit Agrawal
> ; Qian Cai ; Jun Yao
> ; Alex Van Brunt ;
> Robin Murphy ; Thomas Gleixner
> ; linux-arm-ker...@lists.infradead.org; linux-
> ker...@vger.kernel.org
> Subject: Re: [PATCH] arm64: mm: add missing PTE_SPECIAL in
> pte_mkdevmap on arm64
>
[...]
> > diff --git a/arch/arm64/include/asm/pgtable.h
> b/arch/arm64/include/asm/pgtable.h
> > index 5fdcfe237338..e09760ece844 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -209,7 +209,7 @@ static inline pmd_t pmd_mkcont(pmd_t pmd)
> >
> >  static inline pte_t pte_mkdevmap(pte_t pte)
> >  {
> > -   return set_pte_bit(pte, __pgprot(PTE_DEVMAP));
> > +   return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
> >  }
> >
> >  static inline void set_pte(pte_t *ptep, pte_t pte)
> > @@ -396,7 +396,10 @@ static inline int pmd_protnone(pmd_t pmd)
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >  #define pmd_devmap(pmd)pte_devmap(pmd_pte(pmd))
> >  #endif
> > -#define pmd_mkdevmap(pmd)
>   pte_pmd(pte_mkdevmap(pmd_pte(pmd)))
> > +static inline pmd_t pmd_mkdevmap(pmd_t pmd)
> > +{
> > +   return pte_pmd(set_pte_bit(pmd_pte(pmd),
> __pgprot(PTE_DEVMAP)));
> > +}
>
> Though I could see other platforms like powerpc and x86 following same
> approach (DEVMAP + SPECIAL) for pte so that it checks positive for
> pte_special() but then just DEVMAP for pmd which could never have a
> pmd_special(). But a more fundamental question is - why should a devmap
> be a special pte as well ?

IIUC, special pte bit make things handling easier compare with those arches 
which
have no special bit. The memory codes will regard devmap page as a special one
compared with normal page.
Devmap page structure can be stored in ram/pmem/none.

>
> Also in vm_normal_page() why cannot it tests for pte_devmap() before it
> starts looking for CONFIG_ARCH_HAS_PTE_SPECIAL. Is this the only path
> for

AFAICT, yes, but it changes to much besides arm codes. 😊

> which we need to set SPECIAL bit on a devmap pte or there are other paths
> where this semantics is assumed ?

No idea


--
Cheers,
Justin (Jia He)


IMPORTANT NOTICE: The contents of this email and any attachments are 
confidential and may also be privileged. If you are not the intended recipient, 
please notify the sender immediately and do not disclose the contents to any 
other person, use it for any purpose, or store or copy the information in any 
medium. Thank you.

53 matches

Mail list logo