Re: [PATCH v5 01/10] arm64: Provide a command line to disable spectre_v2 mitigation

2019-02-28 Thread Catalin Marinas
On Thu, Feb 28, 2019 at 06:14:34PM +, Suzuki K Poulose wrote:
> On 27/02/2019 01:05, Jeremy Linton wrote:
> > There are various reasons, including bencmarking, to disable spectrev2
> > mitigation on a machine. Provide a command-line to do so.
> > 
> > Signed-off-by: Jeremy Linton 
> > Cc: Jonathan Corbet 
> > Cc: linux-...@vger.kernel.org
> 
> 
> > diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
> > index 9950bb0cbd52..d2b2c69d31bb 100644
> > --- a/arch/arm64/kernel/cpu_errata.c
> > +++ b/arch/arm64/kernel/cpu_errata.c
> > @@ -220,6 +220,14 @@ static void qcom_link_stack_sanitization(void)
> >  : "=" (tmp));
> >   }
> > +static bool __nospectre_v2;
> > +static int __init parse_nospectre_v2(char *str)
> > +{
> > +   __nospectre_v2 = true;
> > +   return 0;
> > +}
> > +early_param("nospectre_v2", parse_nospectre_v2);
> > +
> >   static void
> >   enable_smccc_arch_workaround_1(const struct arm64_cpu_capabilities *entry)
> >   {
> > @@ -231,6 +239,11 @@ enable_smccc_arch_workaround_1(const struct 
> > arm64_cpu_capabilities *entry)
> > if (!entry->matches(entry, SCOPE_LOCAL_CPU))
> > return;
> > +   if (__nospectre_v2) {
> > +   pr_info_once("spectrev2 mitigation disabled by command line 
> > option\n");
> > +   return;
> > +   }
> > +
> 
> Could we not disable the "cap" altogether instead, rather than disabling the
> work around ? Or do we need that information ?

There are a few ideas here but I think we settled on always reporting in
sysfs even if the mitigation is disabled in .config. So I guess we need
the "cap" around for the reporting part.

-- 
Catalin


Re: [PATCH] lib/raid6: use vdupq_n_u8 to avoid endianness warnings

2019-02-28 Thread Catalin Marinas
On Mon, Feb 25, 2019 at 08:03:42PM -0800, ndesaulni...@google.com wrote:
> Clang warns: vector initializers are not compatible with NEON intrinsics
> in big endian mode [-Wnonportable-vector-initialization]
> 
> While this is usually the case, it's not an issue for this case since
> we're initializing the uint8x16_t (16x uint8_t's) with the same value.
> 
> Instead, use vdupq_n_u8 which both compilers lower into a single movi
> instruction: https://godbolt.org/z/vBrgzt
> 
> This avoids the static storage for a constant value.
> 
> Link: https://github.com/ClangBuiltLinux/linux/issues/214
> Suggested-by: Nathan Chancellor 
> Signed-off-by: Nick Desaulniers 
> ---
>  lib/raid6/neon.uc| 5 ++---
>  lib/raid6/recov_neon_inner.c | 7 ++-
>  2 files changed, 4 insertions(+), 8 deletions(-)

Queued for 5.1. Thanks.

-- 
Catalin


Re: [PATCH v2 0/3] Ensure inX() is ordered wrt delay() routines

2019-02-28 Thread Catalin Marinas
On Fri, Feb 22, 2019 at 06:04:51PM +, Will Deacon wrote:
> Will Deacon (3):
>   asm-generic/io: Pass result of I/O accessor to __io_[p]ar()
>   riscv: io: Update __io_[p]ar() macros to take an argument
>   arm64: io: Hook up __io_par() for inX() ordering

Since we have the acks in place, I plan to queue these patches via the
arm64 tree. They don't conflict with -next (as of today) but if anyone
expects other issues, please let me know.

-- 
Catalin


Re: [PATCH v5] arm64: Add workaround for Fujitsu A64FX erratum 010001

2019-02-28 Thread Catalin Marinas
On Tue, Feb 26, 2019 at 06:43:41PM +, James Morse wrote:
> From: Zhang Lei 
> 
> On the Fujitsu-A64FX cores ver(1.0, 1.1), memory access may cause
> an undefined fault (Data abort, DFSC=0b11). This fault occurs under
> a specific hardware condition when a load/store instruction performs an
> address translation. Any load/store instruction, except non-fault access
> including Armv8 and SVE might cause this undefined fault.
> 
> The TCR_ELx.NFD1 bit is used by the kernel when CONFIG_RANDOMIZE_BASE
> is enabled to mitigate timing attacks against KASLR where the kernel
> address space could be probed using the FFR and suppressed fault on
> SVE loads.
> 
> Since this erratum causes spurious exceptions, which may corrupt
> the exception registers, we clear the TCR_ELx.NFDx=1 bits when
> booting on an affected CPU.
> 
> Signed-off-by: Zhang Lei 
> [Generated MIDR value/mask for __cpu_setup(), removed spurious-fault handler
>  and always disabled the NFDx bits on affected CPUs]
> Signed-off-by: James Morse 
> Tested-by: zhang.lei 

Queued for 5.1. Thanks.

-- 
Catalin


Re: [PATCH v5 00/10] arm64: add system vulnerability sysfs entries

2019-02-28 Thread Catalin Marinas
Hi Jeremy,

On Tue, Feb 26, 2019 at 07:05:34PM -0600, Jeremy Linton wrote:
> Jeremy Linton (6):
>   arm64: Provide a command line to disable spectre_v2 mitigation
>   arm64: add sysfs vulnerability show for meltdown
>   arm64: Always enable spectrev2 vulnerability detection
>   arm64: add sysfs vulnerability show for spectre v2
>   arm64: Always enable ssb vulnerability detection
>   arm64: add sysfs vulnerability show for speculative store bypass
> 
> Marc Zyngier (2):
>   arm64: Advertise mitigation of Spectre-v2, or lack thereof
>   arm64: Use firmware to detect CPUs that are not affected by Spectre-v2
> 
> Mian Yousaf Kaukab (2):
>   arm64: add sysfs vulnerability show for spectre v1
>   arm64: enable generic CPU vulnerabilites support

The patches look fine to me (I'm giving them some testing now). However,
It would be nice if we got the acks/reviewed-by tags from the people
that looked at the previous series (Andre, Suzuki, Julien). You haven't
included any tags in this series, I guess there were sufficient changes
not to carry them over.

If I get the acks by tomorrow, I'll queue them for 5.1, otherwise they'd
have to wait for the next merging window.

Thanks.

-- 
Catalin


Re: [PATCH v3] mm/page_ext: fix an imbalance with kmemleak

2019-02-27 Thread Catalin Marinas
On Wed, Feb 27, 2019 at 12:31:47PM -0500, Qian Cai wrote:
> After offlined a memory block, kmemleak scan will trigger a crash, as it
> encounters a page ext address that has already been freed during memory
> offlining. At the beginning in alloc_page_ext(), it calls
> kmemleak_alloc(), but it does not call kmemleak_free() in
> free_page_ext().
> 
> BUG: unable to handle kernel paging request at 888453d0
> PGD 128a01067 P4D 128a01067 PUD 128a04067 PMD 47e09e067 PTE 800bac2ff060
> Oops:  [#1] SMP DEBUG_PAGEALLOC KASAN PTI
> CPU: 1 PID: 1594 Comm: bash Not tainted 5.0.0-rc8+ #15
> Hardware name: HP ProLiant DL180 Gen9/ProLiant DL180 Gen9, BIOS U20 10/25/2017
> RIP: 0010:scan_block+0xb5/0x290
> Code: 85 6e 01 00 00 48 b8 00 00 30 f5 81 88 ff ff 48 39 c3 0f 84 5b 01
> 00 00 48 89 d8 48 c1 e8 03 42 80 3c 20 00 0f 85 87 01 00 00 <4c> 8b 3b
> e8 f3 0c fa ff 4c 39 3d 0c 6b 4c 01 0f 87 08 01 00 00 4c
> RSP: 0018:8881ec57f8e0 EFLAGS: 00010082
> RAX:  RBX: 888453d0 RCX: a61e5a54
> RDX:  RSI: 0008 RDI: 888453d0
> RBP: 8881ec57f920 R08: fbfff4ed588d R09: fbfff4ed588c
> R10: fbfff4ed588c R11: a76ac463 R12: dc00
> R13: 888453d00ff9 R14: 8881f80cef48 R15: 8881f80cef48
> FS:  7f6c0e3f8740() GS:8881f768() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 888453d0 CR3: 0001c4244003 CR4: 001606a0
> Call Trace:
>  scan_gray_list+0x269/0x430
>  kmemleak_scan+0x5a8/0x10f0
>  kmemleak_write+0x541/0x6ca
>  full_proxy_write+0xf8/0x190
>  __vfs_write+0xeb/0x980
>  vfs_write+0x15a/0x4f0
>  ksys_write+0xd2/0x1b0
>  __x64_sys_write+0x73/0xb0
>  do_syscall_64+0xeb/0xaaa
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7f6c0dad73b8
> Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa
> 48 8d 05 65 63 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00
> f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
> RSP: 002b:7ffd5b863cb8 EFLAGS: 0246 ORIG_RAX: 0001
> RAX: ffda RBX: 0005 RCX: 7f6c0dad73b8
> RDX: 0005 RSI: 55a9216e1710 RDI: 0001
> RBP: 55a9216e1710 R08: 000a R09: 7ffd5b863840
> R10: 000a R11: 0246 R12: 7f6c0dda9780
> R13: 0005 R14: 7f6c0dda4740 R15: 0005
> Modules linked in: nls_iso8859_1 nls_cp437 vfat fat kvm_intel kvm
> irqbypass efivars ip_tables x_tables xfs sd_mod ahci libahci igb
> i2c_algo_bit libata i2c_core dm_mirror dm_region_hash dm_log dm_mod
> efivarfs
> CR2: 888453d0
> ---[ end trace ccf646c7456717c5 ]---
> RIP: 0010:scan_block+0xb5/0x290
> Code: 85 6e 01 00 00 48 b8 00 00 30 f5 81 88 ff ff 48 39 c3 0f 84 5b 01
> 00 00 48 89 d8 48 c1 e8 03 42 80 3c 20 00 0f 85 87 01 00 00 <4c> 8b 3b
> e8 f3 0c fa ff 4c 39 3d 0c 6b 4c 01 0f 87 08 01 00 00 4c
> RSP: 0018:8881ec57f8e0 EFLAGS: 00010082
> RAX:  RBX: 888453d0 RCX: a61e5a54
> RDX:  RSI: 0008 RDI: 888453d0
> RBP: 8881ec57f920 R08: fbfff4ed588d R09: fbfff4ed588c
> R10: fbfff4ed588c R11: a76ac463 R12: dc00
> R13: 888453d00ff9 R14: 8881f80cef48 R15: 8881f80cef48
> FS:  7f6c0e3f8740() GS:8881f768() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 888453d0 CR3: 0001c4244003 CR4: 001606a0
> Kernel panic - not syncing: Fatal exception
> Shutting down cpus with NMI
> Kernel Offset: 0x24c0 from 0x8100 (relocation range:
> 0x8000-0xbfff)
> ---[ end Kernel panic - not syncing: Fatal exception ]---
> 
> Signed-off-by: Qian Cai 

Reviewed-by: Catalin Marinas 


Re: [PATCH v2] mm/page_ext: fix an imbalance with kmemleak

2019-02-27 Thread Catalin Marinas
On Wed, Feb 27, 2019 at 12:24:45PM -0500, Qian Cai wrote:
> After offlined a memory block, kmemleak scan will trigger a crash, as it
> encounters a page ext address that has already been freed during memory
> offlining. At the beginning in alloc_page_ext(), it calls
> kmemleak_alloc(), but it does not call kmemleak_free() in
> free_page_ext().
[...]
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index 8c78b8d45117..0b6637d7bae9 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -274,6 +274,7 @@ static void free_page_ext(void *addr)
>  
>   BUG_ON(PageReserved(page));
>   free_pages_exact(addr, table_size);
> + kmemleak_free(addr);

Same comment as for v1, call kmemleak_free() before free_pages_exact().
With that:

Reviewed-by: Catalin Marinas 


Re: [PATCH] mm/page_ext: fix an imbalance with kmemleak

2019-02-27 Thread Catalin Marinas
On Wed, Feb 27, 2019 at 12:15:56PM -0500, Qian Cai wrote:
> After offlined a memory block, kmemleak scan will trigger a crash, as it
> encounters a page ext address that has already been freed during memory
> offlining. At the beginning in alloc_page_ext(), it calls
> kmemleak_alloc(), but it does not call kmemleak_free() in
> __free_page_ext().
[...]
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index 8c78b8d45117..b68f2a58ea3b 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -288,6 +288,7 @@ static void __free_page_ext(unsigned long pfn)
>   base = get_entry(ms->page_ext, pfn);
>   free_page_ext(base);
>   ms->page_ext = NULL;
> + kmemleak_free(base);
>  }

The kmemleak_free() call should be placed before free_page_ext() to
avoid a small window where the address has been freed but kmemleak not
informed.

-- 
Catalin


Re: [PATCH 3/3] arm64: Remove documentation about TIF_USEDFPU

2019-02-26 Thread Catalin Marinas
On Fri, Feb 08, 2019 at 05:04:25PM +, Julien Grall wrote:
> TIF_USEDFPU is not defined as thread flags for Arm64. So drop it from
> the documentation.
> 
> Signed-off-by: Julien Grall 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: linux-arm-ker...@lists.infradead.org

Queued for 5.1. Thanks.

-- 
Catalin


Re: [PATCH] mm/cma: cma_declare_contiguous: correct err handling

2019-02-26 Thread Catalin Marinas
On Tue, Feb 19, 2019 at 07:46:11PM +0200, Mike Rapoport wrote:
> On Tue, Feb 19, 2019 at 05:55:33PM +0100, Vlastimil Babka wrote:
> > On 2/14/19 9:38 PM, Andrew Morton wrote:
> > > On Thu, 14 Feb 2019 12:45:51 + Peng Fan  wrote:
> > > 
> > >> In case cma_init_reserved_mem failed, need to free the memblock allocated
> > >> by memblock_reserve or memblock_alloc_range.
> > >>
> > >> ...
> > >>
> > >> --- a/mm/cma.c
> > >> +++ b/mm/cma.c
> > >> @@ -353,12 +353,14 @@ int __init cma_declare_contiguous(phys_addr_t base,
> > >>  
> > >>  ret = cma_init_reserved_mem(base, size, order_per_bit, name, 
> > >> res_cma);
> > >>  if (ret)
> > >> -goto err;
> > >> +goto free_mem;
> > >>  
> > >>  pr_info("Reserved %ld MiB at %pa\n", (unsigned long)size / 
> > >> SZ_1M,
> > >>  );
> > >>  return 0;
> > >>  
> > >> +free_mem:
> > >> +memblock_free(base, size);
> > >>  err:
> > >>  pr_err("Failed to reserve %ld MiB\n", (unsigned long)size / 
> > >> SZ_1M);
> > >>  return ret;
> > > 
> > > This doesn't look right to me.  In the `fixed==true' case we didn't
> > > actually allocate anything and in the `fixed==false' case, the
> > > allocated memory is at `addr', not at `base'.
> > 
> > I think it's ok as the fixed==true path has "memblock_reserve()", but
> > better leave this to the memblock maintainer :)
> 
> As Peng Fan noted in the other e-mail, fixed==true has memblock_reserve()
> and fixed==false resets base = addr, so this is Ok.
>  
> > There's also 'kmemleak_ignore_phys(addr)' which should probably be
> > undone (or not called at all) in the failure case. But it seems to be
> > missing from the fixed==true path?
> 
> Well, memblock and kmemleak interaction does not seem to have clear
> semantics anyway. memblock_free() calls kmemleak_free_part_phys() which
> does not seem to care about ignored objects.
> As for the fixed==true path, memblock_reserve() does not register the area
> with kmemleak, so there would be no object to free in memblock_free().
> AFAIU, kmemleak simply ignores this.

Kmemleak is supposed to work with the memblock_{alloc,free} pair and it
ignores the memblock_reserve() as a memblock_alloc() implementation
detail. It is, however, tolerant to memblock_free() being called on a
sub-range or just a different range from a previous memblock_alloc(). So
the original patch looks fine to me. FWIW:

Reviewed-by: Catalin Marinas 


Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2019-02-25 Thread Catalin Marinas
Hi Szabolcs,

Thanks for looking into this. Comments below.

On Tue, Feb 19, 2019 at 06:38:31PM +, Szabolcs Nagy wrote:
> i think these rules work for the cases i care about, a more
> tricky question is when/how to check for the new syscall abi
> and when/how the TCR_EL1.TBI0 setting may be turned off.

I don't think turning TBI0 off is critical (it's handy for PAC with
52-bit VA but then it's short-lived if you want more security features
like MTE).

> consider the following cases (tb == top byte):
> 
> binary 1: user tb = any, syscall tb = 0
>   tbi is on, "legacy binary"
> 
> binary 2: user tb = any, syscall tb = any
>   tbi is on, "new binary using tb"
>   for backward compat it needs to check for new syscall abi.
> 
> binary 3: user tb = 0, syscall tb = 0
>   tbi can be off, "new binary",
>   binary is marked to indicate unused tb,
>   kernel may turn tbi off: additional pac bits.
> 
> binary 4: user tb = mte, syscall tb = mte
>   like binary 3, but with mte, "new binary using mte"
>   does it have to check for new syscall abi?
>   or MTE HWCAP would imply it?
>   (is it possible to use mte without new syscall abi?)

I think MTE HWCAP should imply it.

> in userspace we want most binaries to be like binary 3 and 4
> eventually, i.e. marked as not-relying-on-tbi, if a dso is
> loaded that is unmarked (legacy or new tb user), then either
> the load fails (e.g. if mte is already used? or can we turn
> mte off at runtime?) or tbi has to be enabled (prctl? does
> this work with pac? or multi-threads?).

We could enable it via prctl. That's the plan for MTE as well (in
addition maybe to some ELF flag).

> as for checking the new syscall abi: i don't see much semantic
> difference between AT_HWCAP and AT_FLAGS (either way, the user
> has to check a feature flag before using the feature of the
> underlying system and it does not matter much if it's a syscall
> abi feature or cpu feature), but i don't see anything wrong
> with AT_FLAGS if the kernel prefers that.

The AT_FLAGS is aimed at capturing binary 2 case above, i.e. the
relaxation of the syscall ABI to accept tb = any. The MTE support will
have its own AT_HWCAP, likely in addition to AT_FLAGS. Arguably,
AT_FLAGS is either redundant here if MTE implies it (and no harm in
keeping it around) or the meaning is different: a tb != 0 may be checked
by the kernel against the allocation tag (i.e. get_user() could fail,
the tag is not entirely ignored).

> the discussion here was mostly about binary 2,

That's because passing tb != 0 into the syscall ABI is the main blocker
here that needs clearing out before merging the MTE support. There is,
of course, a variation of binary 1 for MTE:

binary 5: user tb = mte, syscall tb = 0

but this requires a lot of C lib changes to support properly.

> but for
> me the open question is if we can make binary 3/4 work.
> (which requires some elf binary marking, that is recognised
> by the kernel and dynamic loader, and efficient handling of
> the TBI0 bit, ..if it's not possible, then i don't see how
> mte will be deployed).

If we ignore binary 3, we can keep TBI0 = 1 permanently, whether we have
MTE or not.

> and i guess on the kernel side the open question is if the
> rules 1/2/3/4 can be made to work in corner cases e.g. when
> pointers embedded into structs are passed down in ioctl.

We've been trying to track these down since last summer and we came to
the conclusion that it should be (mostly) fine for the non-weird memory
described above.

-- 
Catalin


Re: [PATCH] Use flush tlb last level when change protection

2019-02-24 Thread Catalin Marinas
On Sat, Feb 23, 2019 at 02:47:27PM +0800, Xuefeng Wang wrote:
> The protection attributes are only kept in last level tlb, so
> protection changing only need invalidate last level tlb, exclude
> the PWC entries.
> 
> Signed-off-by: Xuefeng Wang 
> ---
>  mm/mprotect.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 36cb358..0c4303d 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -287,7 +287,7 @@ static unsigned long change_protection_range(struct 
> vm_area_struct *vma,
>  
>   /* Only flush the TLB if we actually modified any entries: */
>   if (pages)
> - flush_tlb_range(vma, start, end);
> + __flush_tlb_range(vma, start, end, PAGE_SIZE, true);
>   dec_tlb_flush_pending(mm);

You are changing a generic file to call an arm64-internal function,
breaking all the other architectures, so NAK.

Do you actually see any performance improvement?

-- 
Catalin


Re: [PATCH v4 03/12] arm64: Remove the ability to build a kernel without ssbd

2019-02-15 Thread Catalin Marinas
On Wed, Jan 30, 2019 at 06:04:15PM +, Andre Przywara wrote:
> On Fri, 25 Jan 2019 12:07:02 -0600
> Jeremy Linton  wrote:
> > Buried behind EXPERT is the ability to build a kernel without
> > SSBD, this needlessly clutters up the code as well as creates
> > the opportunity for bugs. It also removes the kernel's ability
> > to determine if the machine its running on is vulnerable.
> 
> I don't know the original motivation for this config option, typically
> they are not around for no reason.
> I see the benefit of dropping those config options, but we want to make
> sure that people don't start hacking around to remove them again.
> 
> > Since its also possible to disable it at boot time, lets remove
> > the config option.
> 
> Given the level of optimisation a compiler can do with the state being
> known at compile time, I would imagine that it's not the same (though
> probably very close).
> 
> But that's not my call, it would be good to hear some maintainer's
> opinion on this.

Having spoken to Will, we'd rather keep the config options if possible.
Even if they are behind EXPERT and default y, they come in handy when
debugging.

Can we still have the sysfs information regardless of whether the config
is enabled or not? IOW, move the #ifdefs around to always have the
detection while being able to disable the actual workarounds via config?

Are the code paths between config and cmdline disabling identical? At a
quick look I got the impression they are not exactly the same.

-- 
Catalin


Re: [PATCH v3 0/4] arm64: kprobes: Update blacklist checking on arm64

2019-02-15 Thread Catalin Marinas
Hi Masami,

On Wed, Feb 13, 2019 at 12:42:53AM +0900, Masami Hiramatsu wrote:
> In v3, I rebased on the latest arm64 kernel which includes
> James' KVM/HYP fixes for kprobes, and fix a reported bugs
> in [4/4].

I will queue this patches at 5.1-rc1 since the arm64 for-next/core
branch is currently based on -rc3 and does not include James' fix. Even
if I rebased your patches on top of -rc3, we'd still get a conflict when
pushing into mainline, so I'd rather send them after -rc1.

Thanks.

-- 
Catalin


Re: [PATCH v2 3/5] kmemleak: account for tagged pointers when calculating pointer range

2019-02-15 Thread Catalin Marinas
On Wed, Feb 13, 2019 at 02:58:28PM +0100, Andrey Konovalov wrote:
> kmemleak keeps two global variables, min_addr and max_addr, which store
> the range of valid (encountered by kmemleak) pointer values, which it
> later uses to speed up pointer lookup when scanning blocks.
> 
> With tagged pointers this range will get bigger than it needs to be.
> This patch makes kmemleak untag pointers before saving them to min_addr
> and max_addr and when performing a lookup.
> 
> Signed-off-by: Andrey Konovalov 

I reviewed the old series. This patch also looks fine:

Acked-by: Catalin Marinas 


Re: [PATCH 3/5] kmemleak: account for tagged pointers when calculating pointer range

2019-02-15 Thread Catalin Marinas
On Mon, Feb 11, 2019 at 10:59:52PM +0100, Andrey Konovalov wrote:
> kmemleak keeps two global variables, min_addr and max_addr, which store
> the range of valid (encountered by kmemleak) pointer values, which it
> later uses to speed up pointer lookup when scanning blocks.
> 
> With tagged pointers this range will get bigger than it needs to be.
> This patch makes kmemleak untag pointers before saving them to min_addr
> and max_addr and when performing a lookup.
> 
> Signed-off-by: Andrey Konovalov 

Acked-by: Catalin Marinas 


Re: [PATCH 6/8] initramfs: move the legacy keepinitrd parameter to core code

2019-02-14 Thread Catalin Marinas
On Wed, Feb 13, 2019 at 06:46:19PM +0100, Christoph Hellwig wrote:
> No need to handle the freeing disable in arch code when we already
> have a core hook (and a different name for the option) for it.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/Kconfig |  7 +++
>  arch/arm/Kconfig |  1 +
>  arch/arm/mm/init.c   | 25 ++---
>  arch/arm64/Kconfig   |  1 +
>  arch/arm64/mm/init.c | 17 ++---
>  arch/unicore32/Kconfig   |  1 +
>  arch/unicore32/mm/init.c | 14 +-
>  init/initramfs.c |  9 +
>  8 files changed, 28 insertions(+), 47 deletions(-)

For arm64:

Acked-by: Catalin Marinas 


Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2019-02-12 Thread Catalin Marinas
On Mon, Feb 11, 2019 at 12:32:55PM -0800, Evgenii Stepanov wrote:
> On Mon, Feb 11, 2019 at 9:28 AM Kevin Brodsky  wrote:
> > On 19/12/2018 12:52, Dave Martin wrote:
> > > Really, the kernel should do the expected thing with all "non-weird"
> > > memory.
> > >
> > > In lieu of a proper definition of "non-weird", I think we should have
> > > some lists of things that are explicitly included, and also excluded:
> > >
> > > OK:
> > >   kernel-allocated process stack
> > >   brk area
> > >   MAP_ANONYMOUS | MAP_PRIVATE
> > >   MAP_PRIVATE mappings of /dev/zero
> > >
> > > Not OK:
> > >   MAP_SHARED
> > >   mmaps of non-memory-like devices
> > >   mmaps of anything that is not a regular file
> > >   the VDSO
> > >   ...
> > >
> > > In general, userspace can tag memory that it "owns", and we do not assume
> > > a transfer of ownership except in the "OK" list above.  Otherwise, it's
> > > the kernel's memory, or the owner is simply not well defined.
> >
> > Agreed on the general idea: a process should be able to pass tagged 
> > pointers at the
> > syscall interface, as long as they point to memory privately owned by the 
> > process. I
> > think it would be possible to simplify the definition of "non-weird" memory 
> > by using
> > only this "OK" list:
> > - mmap() done by the process itself, where either:
> >* flags = MAP_PRIVATE | MAP_ANONYMOUS
> >* flags = MAP_PRIVATE and fd refers to a regular file or a well-defined 
> > list of
> > device files (like /dev/zero)
> > - brk() done by the process itself
> > - Any memory mapped by the kernel in the new process's address space during 
> > execve(),
> > with the same restrictions as above ([vdso]/[vvar] are therefore excluded)

Sounds reasonable.

> > >   * Userspace should set tags at the point of allocation only.
> >
> > Yes, tags are only supposed to be set at the point of either allocation or
> > deallocation/reallocation. However, allocators can in principle be nested, 
> > so an
> > allocator could  take a region allocated by malloc() as input and subdivide 
> > it
> > (changing tags in the process). That said, this suballocator must not 
> > free() that
> > region until all the suballocations themselves have been freed (thereby 
> > restoring the
> > tags initially set by malloc()).
> >
> > >   * If you don't know how an object was allocated, you cannot modify the
> > > tag, period.
> >
> > Agreed, allocators that tag memory can only operate on memory with a 
> > well-defined
> > provenance (for instance anonymous mmap() or malloc()).
> >
> > >   * A single C object should be accessed using a single, fixed pointer tag
> > > throughout its entire lifetime.
> >
> > Agreed. Allocators themselves may need to be excluded though, depending on 
> > how they
> > represent their managed memory.
> >
> > >   * Tags can be changed only when there are no outstanding pointers to
> > > the affected object or region that may be used to access the object
> > > or region (i.e., if the object were allocated from the C heap and
> > > is it safe to realloc() it, then it is safe to change the tag; for
> > > other types of allocation, analogous arguments can be applied).
> >
> > Tags can only be changed at the point of deallocation/reallocation. 
> > Pointers to the
> > object become invalid and cannot be used after it has been deallocated; 
> > memory
> > tagging allows to catch such invalid usage.

All the above sound well but that's mostly a guideline on what a C
library can do. It doesn't help much with defining the kernel ABI.
Anyway, it's good to clarify the use-cases.

> > >   * When the kernel dereferences a pointer on userspace's behalf, it
> > > shall behave equivalently to userspace dereferencing the same pointer,
> > > including use of the same tag (where passed by userspace).
> > >
> > >   * Where the pointer tag affects pointer dereference behaviour (i.e.,
> > > with hardware memory colouring) the kernel makes no guarantee to
> > > honour pointer tags correctly for every location a buffer based on a
> > > pointer passed by userspace to the kernel.
> > >
> > > (This means for example that for a read(fd, buf, size), we can check
> > > the tag for a single arbitrary location in *(char (*)[size])buf
> > > before passing the buffer to get_user_pages().  Hopefully this could
> > > be done in get_user_pages() itself rather than hunting call sites.
> > > For userspace, it means that you're on your own if you ask the
> > > kernel to operate on a buffer than spans multiple, independently-
> > > allocated objects, or a deliberately striped single object.)
> >
> > I think both points are reasonable. It is very valuable for the kernel to 
> > access
> > userspace memory using the user-provided tag, because it enables kernel 
> > accesses to
> > be checked in the same way as user accesses, allowing to detect bugs that 
> > are
> > potentially hard to find. For 

Re: add config symbols for arch_{setup,teardown}_dma_ops

2019-02-11 Thread Catalin Marinas
On Mon, Feb 11, 2019 at 02:21:56PM +0100, Christoph Hellwig wrote:
> Any chance to get a quick review on this small series?

For arm64:

Acked-by: Catalin Marinas 


Re: [PATCH] arm64: use memblocks_present

2019-02-11 Thread Catalin Marinas
On Sun, Feb 10, 2019 at 09:28:43AM +, Peng Fan wrote:
> arm64_memory_present is doing same thing as memblocks_present, so
> let's use common code memblocks_present instead of platform
> specific arm64_memory_present.
> 
> Signed-off-by: Peng Fan 

I already merged a similar one (see commit
a2c801c53d1682871fba1e037c9d3b0c9fffee8a in -next).

-- 
Catalin


Re: [PATCH v9 0/8] arm64: untag user pointers passed to the kernel

2019-02-11 Thread Catalin Marinas
Hi Dave,

On Wed, Dec 12, 2018 at 05:01:12PM +, Dave P Martin wrote:
> On Mon, Dec 10, 2018 at 01:50:57PM +0100, Andrey Konovalov wrote:
> > arm64 has a feature called Top Byte Ignore, which allows to embed pointer
> > tags into the top byte of each pointer. Userspace programs (such as
> > HWASan, a memory debugging tool [1]) might use this feature and pass
> > tagged user pointers to the kernel through syscalls or other interfaces.
[...]
> It looks like there's been a lot of progress made here towards smoking
> out most of the sites in the kernel where pointers need to be untagged.

In summary, based on last summer's analysis, there are two main (and
rather broad) scenarios of __user pointers use in the kernel: (a)
uaccess macros, together with access_ok() checks and (b) identifying
of user address ranges (find_vma() and related, some ioctls). The
patches here handle the former by allowing sign-extension in access_ok()
and subsequent uaccess routines work fine with tagged pointers.
Identifying the latter is a bit more problematic and the approach we
took was tracking down pointer to long conversion which seems to cover
the majority of cases. However, this approach doesn't scale as, for
example, we'd rather change get_user_pages() to sign-extend the address
rather than all the callers. In lots of other cases we don't even need
untagging as we don't expect user space to tag such pointers (i.e.
mmap() of device memory).

We might be able to improve the static analysis by introducing a
virt_addr_t but that's significant effort and we still won't cover all
cases (e.g. it doesn't necessarily catch tcp_zerocopy_receive() which
wouldn't use a pointer, just a u64 for address).

> However, I do think that we need a clear policy for how existing kernel
> interfaces are to be interpreted in the presence of tagged pointers.
> Unless we have that nailed down, we are likely to be able to make only
> vague guarantees to userspace about what works, and the ongoing risk
> of ABI regressions and inconsistencies seems high.

I agree.

> Can we define an opt-in for tagged-pointer userspace, that rejects all
> syscalls that we haven't checked and whitelisted (or that are
> uncheckable like ioctl)? 

Defining an opt-in is not a problem, however, rejecting all syscalls
that we haven't whitelisted is not feasible. We can have an opt-in per
process (that's what we were going to do with MTE) but the only thing
we can reasonably do is change the behaviour of access_ok(). That's too
big a knob and a new syscall that we haven't got around to whitelist may
just work. This eventually leads to de-facto ABI and our whitelist would
simply be ignored.

I'm not really keen on a big syscall shim in the arm64 kernel which
checks syscall arguments, including in-struct values. If we are to do
this, I'd rather keep it in user space as part of the C library.

> In the meantime, I think we really need to nail down the kernel's
> policies on
> 
>  * in the default configuration (without opt-in), is the presence of
> non-address bits in pointers exchanged with the kernel simply
> considered broken?  (Even with this series, the de factor answer
> generally seems to be "yes", although many specific things will now
> work fine)

Without these patches, passing non-address bits in pointers is
considered broken. I couldn't find a case where it would still work with
non-zero tag but maybe I haven't looked hard enough.

>  * if not, how do we tighten syscall / interface specifications to
> describe what happens with pointers containing non-address bits, while
> keeping the existing behaviour for untagged pointers?
> 
> We would want a general recipe that gives clear guidance on what
> userspace should expect an arbitrarily chosen syscall to do with its
> pointers, without having to enumerate each and every case.

That's what we are aiming with the pointer origins, to move away from a
syscall whitelist to a generic definition. That said, the two approaches
are orthogonal, we can use the pointer origins as the base rule for
which syscalls can be whitelisted.

If we step back a bit to look at the use-case for TBI (and MTE), the
normal application programmer shouldn't really care about this ABI
(well, most of the time). The app gets a tagged pointer from the C
library as a result of a malloc()/realloc() (possibly alloca()) call and
it expects to be able to pass it back into the kernel (usually via the C
library) without any awareness of the non-address bits. Now, we can't
define a user/kernel ABI based on the provenance of the pointer in user
space (i.e. we only support tags for heap and stack), so we are trying
to generalise this based where the pointer originated from in the kernel
(e.g. anonymous mmap()).

> There may already be some background on these topics -- can you throw me
> a link if so?

That's an interesting sub-thread to read:

https://lore.kernel.org/lkml/5d54526e5ff2e5ad63d0dfdd9ab17cf359afa4f2.1535629099.git.andreyk...@google.com/

-- 

Re: [PATCH v10 12/25] arm64: irqflags: Use ICC_PMR_EL1 for interrupt masking

2019-02-08 Thread Catalin Marinas
On Fri, Feb 08, 2019 at 09:36:48AM +, Julien Thierry wrote:
> From e839dec632bbf440efe8314751138ba46324078c Mon Sep 17 00:00:00 2001
> From: Julien Thierry 
> Date: Fri, 8 Feb 2019 09:21:58 +
> Subject: [PATCH] arm64: irqflags: Fix clang build warnings
> 
> Clang complains when passing asm operands that are smaller than the
> registers they are mapped to:
> 
> arch/arm64/include/asm/irqflags.h:50:10: warning: value size does not
>   match register size specified by the constraint and modifier
>   [-Wasm-operand-widths]
> : "r" (GIC_PRIO_IRQON)
> 
> Fix it by casting the affected input operands to a type of the correct
> size.
> 
> Reported-by: Nathan Chancellor 
> Signed-off-by: Julien Thierry 

Applied. Thanks.

-- 
Catalin


Re: [PATCH v10 00/25] arm64: provide pseudo NMI with GICv3

2019-02-06 Thread Catalin Marinas
Hi Julien,

On Thu, Jan 31, 2019 at 02:58:38PM +, Julien Thierry wrote:
> This series is a continuation of the work started by Daniel [1]. The goal
> is to use GICv3 interrupt priorities to simulate an NMI.
> 
> The patches depend on the core API for NMIs patches [2]. Both series can
> be found on this branch:
> git clone http://linux-arm.org/linux-jt.git -b v5.0-pseudo-nmi

I queued these patches in the arm64 for-next/core, on top of Marc's
generic-nmi branch:

git://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms irq/generic-nmi

I'll push the changes out later today, once my tests finished.

Thanks for pushing this series through.

-- 
Catalin


Re: [PATCH -next] efi/arm64: return zero from ptdump_init()

2019-02-05 Thread Catalin Marinas
On Mon, Feb 04, 2019 at 04:01:01PM -0500, Qian Cai wrote:
> The commit e2a2e56e4082 ("arm64: dump: no need to check return value of
> debugfs_create functions") converted ptdump_debugfs_register() from
> void, but forgot to fix the efi version of ptdump_init().
> 
> drivers/firmware/efi/arm-runtime.c: In function 'ptdump_init':
> drivers/firmware/efi/arm-runtime.c:52:9: error: void value not ignored as it 
> ought to be
>52 |  return ptdump_debugfs_register(_ptdump_info, "efi_page_tables");
>   | ^~~~
> drivers/firmware/efi/arm-runtime.c:53:1: warning: control reaches end of 
> non-void function [-Wreturn-type]
> 
> Fixes: e2a2e56e4082 ("arm64: dump: no need to check return value of 
> debugfs_create functions")
> Signed-off-by: Qian Cai 

A fix is already in -next (commit 67f52a9540e0).

-- 
Catalin


Re: [RFC PATCH 1/3] arm64: entry: Remove unneeded need_resched() loop

2019-02-04 Thread Catalin Marinas
On Thu, Jan 31, 2019 at 06:23:37PM +, Valentin Schneider wrote:
> Since the enabling and disabling of IRQs within preempt_schedule_irq()
> is contained in a need_resched() loop, we don't need the outer arch
> code loop.
> 
> Reported-by: Julien Thierry 
> Reported-by: Will Deacon 
> Signed-off-by: Valentin Schneider 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Mark Rutland 
> Cc: Marc Zyngier 
> Cc: Julien Grall 
> Cc: Julien Thierry 
> Cc: Thomas Gleixner 
> Cc: linux-arm-ker...@lists.infradead.org

Queued for 5.1. Thanks

-- 
Catalin


Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

2019-01-29 Thread Catalin Marinas
Hi,

Could you please copy the whole description from the cover letter to the
actual patch and only send one email (full description as in here
together with the patch)? If we commit this to the kernel, it would be
useful to have the information in the log for reference later on.

More comments below:

On Tue, Jan 29, 2019 at 12:29:58PM +, Zhang, Lei wrote:
> On some variants of the Fujitsu-A64FX cores ver(1.0, 1.1),  
> memory accesses may cause undefined fault (Data abort, DFSC=0b11).
> This problem will be fixed by next version of Fujitsu-A64FX.
> 
> This fault occurs under a specific hardware condition 
> when a load/store instruction perform an address translation using:
>   case-1  TTBR0_EL1 with TCR_EL1.NFD0 == 1.
>   case-2  TTBR0_EL2 with TCR_EL2.NFD0 == 1.
>   case-3  TTBR1_EL1 with TCR_EL1.NFD1 == 1.
>   case-4  TTBR1_EL2 with TCR_EL2.NFD1 == 1.
> And this fault occurs completely spurious.

So this looks like new information on the hardware behaviour since the
v2 of the patch. Can this fault occur for any type of instruction
accessing the memory or only for SVE instructions?

> Since TCR_ELx.NFD1 is set to '1' at the kernel in versions 
> past 4.17, the case-3 or case-4 may happen.
> 
> This fault can be taken only at stage-1, 
> so this fault is taken from EL0 to EL1/EL2, from EL1 to EL1, 
> or from EL2 to EL2.
> 
> I would like to post a workaround to avoid this problem on 
> existing Fujitsu-A64FX version.

How likely is it to trigger this erratum? In other words, aren't we
better off with a spurious fault that we ignore rather than toggling the
TCR_ELx.NFD1 bit?

> There are 2 points in this workaround.
> Point1: trap from EL1 to EL1, EL2 to EL2
> Set '0' to TCR_ELx.NFD1in kernel-entry, 
> and set '1' in kernel-exit.
> 
> From the view point of ARM specification, there is no problem to 
> reset TCR_ELx.{NFD0,NFD1} while in EL1/EL2, because 
> TCR_ELx.{NFD0,NFD1} controls whether to perform a translation 
> table walk in response to an access from EL0.

The problem is that this bit may be cached in the TLB (I haven't checked
the ARM ARM but that's usually the case with the TCR_ELx bits). If
that's the case, you can't guarantee a change unless you also perform a
TLBI VMALL. Arguably, if Fujitsu's microarchitecture doesn't cache the
NFD bits in the TLB, we could apply the workaround but I'd rather have
the spurious trap if it's not too often.

> I confirmed that:
> ・There is no load/store instruction between 
>   tramp_ventry and setting TCR_ELx.NFD1 to '0'.
> ・There is no load/store instruction between 
>   setting TCR_ELx.NFD1 to '1' and tramp_exit.

Could speculative loads also trigger this? Another option would be to
toggle it during kernel_neon_begin/end (with the caveat of TLBI as
mentioned above).

-- 
Catalin


Re: [PATCH 1/4] arm64: dump: no need to check return value of debugfs_create functions

2019-01-25 Thread Catalin Marinas
On Tue, Jan 22, 2019 at 03:41:11PM +0100, Greg Kroah-Hartman wrote:
> When calling debugfs functions, there is no need to ever check the
> return value.  The function can work or not, but the code logic should
> never do something different based on this.
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Marc Zyngier 
> Cc: Peng Donglin 
> Cc: 
> Signed-off-by: Greg Kroah-Hartman 

Acked-by: Catalin Marinas 


Re: [PATCH v2 1/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

2019-01-25 Thread Catalin Marinas
On Tue, Jan 22, 2019 at 08:54:33AM +, Zhang, Lei wrote:
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index efb7b2c..37e4f18 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -666,6 +666,28 @@ static int do_sea(unsigned long addr, unsigned int esr, 
> struct pt_regs *regs)
>   return 0;
>  }
>  
> +static int do_bad_unknown_63(unsigned long addr, unsigned int esr, struct 
> pt_regs *regs)
> +{
> + /*
> +  * On some variants of the Fujitsu-A64FX cores ver(1.0, 1.1),
> +  * memory accesses may spuriously trigger data aborts with
> +  * DFSC=0b11.
> +  */
> + if (IS_ENABLED(CONFIG_FUJITSU_ERRATUM_010001)) {
> + if (cpus_have_cap(ARM64_WORKAROUND_FUJITSU_A64FX_011)) {
> + return 0;
> + } else { /* cpu capabilities maybe not ready*/
> + unsigned int current_cpu_midr = read_cpuid_id();
> + const struct midr_range fujitsu_a64fx_midr_range = {
> + MIDR_FUJITSU_A64FX, MIDR_CPU_VAR_REV(0, 0), 
> MIDR_CPU_VAR_REV(1, 0)
> + };
> + if (is_midr_in_range(current_cpu_midr, 
> _a64fx_midr_range) == TRUE)
> + return 0;
> + }
> + }
> + return do_bad(addr, esr, regs);
> +}

IIUC, this can happen very early when the errata framework isn't yet
ready. Given that this is not on a fast path (you already took a fault),
I don't think it's worth optimising for cpus_have_cap() (and
ARM64_WORKAROUND_FUJITSU_A64FX_011). I've seen Mark's comments on
why checking MIDR in a preemptible context is not a good idea but I
suspect your platform is homogeneous (i.e. not big.LITTLE).

-- 
Catalin


Re: [PATCH v2 06/21] memblock: memblock_phys_alloc_try_nid(): don't panic

2019-01-25 Thread Catalin Marinas
On Mon, Jan 21, 2019 at 10:03:53AM +0200, Mike Rapoport wrote:
> diff --git a/arch/arm64/mm/numa.c b/arch/arm64/mm/numa.c
> index ae34e3a..2c61ea4 100644
> --- a/arch/arm64/mm/numa.c
> +++ b/arch/arm64/mm/numa.c
> @@ -237,6 +237,10 @@ static void __init setup_node_data(int nid, u64 
> start_pfn, u64 end_pfn)
>   pr_info("Initmem setup node %d []\n", nid);
>  
>   nd_pa = memblock_phys_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
> + if (!nd_pa)
> + panic("Cannot allocate %zu bytes for node %d data\n",
> +   nd_size, nid);
> +
>   nd = __va(nd_pa);
>  
>   /* report and initialize */

Does it mean that memblock_phys_alloc_try_nid() never returns valid
physical memory starting at 0?

-- 
Catalin


Re: [PATCH v2 7/9] arm64: kdump: No need to mark crashkernel pages manually PG_reserved

2019-01-25 Thread Catalin Marinas
On Mon, Jan 14, 2019 at 01:59:01PM +0100, David Hildenbrand wrote:
> The crashkernel is reserved via memblock_reserve(). memblock_free_all()
> will call free_low_memory_core_early(), which will go over all reserved
> memblocks, marking the pages as PG_reserved.
> 
> So manually marking pages as PG_reserved is not necessary, they are
> already in the desired state (otherwise they would have been handed over
> to the buddy as free pages and bad things would happen).
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: James Morse 
> Cc: Bhupesh Sharma 
> Cc: David Hildenbrand 
> Cc: Mark Rutland 
> Cc: Dave Kleikamp 
> Cc: Andrew Morton 
> Cc: Mike Rapoport 
> Cc: Michal Hocko 
> Cc: Florian Fainelli 
> Cc: Stefan Agner 
> Cc: Laura Abbott 
> Cc: Greg Hackmann 
> Cc: Johannes Weiner 
> Cc: Kristina Martsenko 
> Cc: CHANDAN VN 
> Cc: AKASHI Takahiro 
> Cc: Logan Gunthorpe 
> Reviewed-by: Matthias Brugger 
> Signed-off-by: David Hildenbrand 

Acked-by: Catalin Marinas 


Re: [PATCH v2 6/9] arm64: kexec: no need to ClearPageReserved()

2019-01-25 Thread Catalin Marinas
On Mon, Jan 14, 2019 at 01:59:00PM +0100, David Hildenbrand wrote:
> This will be done by free_reserved_page().
> 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Bhupesh Sharma 
> Cc: James Morse 
> Cc: Marc Zyngier 
> Cc: Dave Kleikamp 
> Cc: Mark Rutland 
> Cc: Andrew Morton 
> Cc: Michal Hocko 
> Cc: Matthew Wilcox 
> Acked-by: James Morse 
> Signed-off-by: David Hildenbrand 

Acked-by: Catalin Marinas 


Re: [PATCH v2 29/29] y2038: add 64-bit time_t syscalls to all 32-bit architectures

2019-01-25 Thread Catalin Marinas
On Fri, Jan 18, 2019 at 05:18:35PM +0100, Arnd Bergmann wrote:
> This adds 21 new system calls on each ABI that has 32-bit time_t
> today. All of these have the exact same semantics as their existing
> counterparts, and the new ones all have macro names that end in 'time64'
> for clarification.
> 
> This gets us to the point of being able to safely use a C library
> that has 64-bit time_t in user space. There are still a couple of
> loose ends to tie up in various areas of the code, but this is the
> big one, and should be entirely uncontroversial at this point.
> 
> In particular, there are four system calls (getitimer, setitimer,
> waitid, and getrusage) that don't have a 64-bit counterpart yet,
> but these can all be safely implemented in the C library by wrapping
> around the existing system calls because the 32-bit time_t they
> pass only counts elapsed time, not time since the epoch. They
> will be dealt with later.
> 
> Signed-off-by: Arnd Bergmann 

Acked-by: Catalin Marinas 

(as long as compat follows the arm32 syscall numbers)


Re: [PATCH v2 25/29] y2038: syscalls: rename y2038 compat syscalls

2019-01-25 Thread Catalin Marinas
On Fri, Jan 18, 2019 at 05:18:31PM +0100, Arnd Bergmann wrote:
> A lot of system calls that pass a time_t somewhere have an implementation
> using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
> been reworked so that this implementation can now be used on 32-bit
> architectures as well.
> 
> The missing step is to redefine them using the regular SYSCALL_DEFINEx()
> to get them out of the compat namespace and make it possible to build them
> on 32-bit architectures.
> 
> Any system call that ends in 'time' gets a '32' suffix on its name for
> that version, while the others get a '_time32' suffix, to distinguish
> them from the normal version, which takes a 64-bit time argument in the
> future.
> 
> In this step, only 64-bit architectures are changed, doing this rename
> first lets us avoid touching the 32-bit architectures twice.
> 
> Signed-off-by: Arnd Bergmann 

For arm64:

Acked-by: Catalin Marinas 


Re: [PATCH v2 07/29] ARM: add kexec_file_load system call number

2019-01-25 Thread Catalin Marinas
On Fri, Jan 18, 2019 at 05:18:13PM +0100, Arnd Bergmann wrote:
> diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
> index 86de9eb34296..20ed7e026723 100644
> --- a/arch/arm/tools/syscall.tbl
> +++ b/arch/arm/tools/syscall.tbl
> @@ -415,3 +415,4 @@
>  398  common  rseqsys_rseq
>  399  common  io_pgetevents   sys_io_pgetevents
>  400  common  migrate_pages   sys_migrate_pages
> +401  common  kexec_file_load sys_kexec_file_load

I presume on arm32 this would still return -ENOSYS.

> diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
> index 261216c3336e..2c30e6f145ff 100644
> --- a/arch/arm64/include/asm/unistd.h
> +++ b/arch/arm64/include/asm/unistd.h
> @@ -44,7 +44,7 @@
>  #define __ARM_NR_compat_set_tls  (__ARM_NR_COMPAT_BASE + 5)
>  #define __ARM_NR_COMPAT_END  (__ARM_NR_COMPAT_BASE + 0x800)
>  
> -#define __NR_compat_syscalls 401
> +#define __NR_compat_syscalls 402
>  #endif
>  
>  #define __ARCH_WANT_SYS_CLONE
> diff --git a/arch/arm64/include/asm/unistd32.h 
> b/arch/arm64/include/asm/unistd32.h
> index f15bcbacb8f6..8ca1d4c304f4 100644
> --- a/arch/arm64/include/asm/unistd32.h
> +++ b/arch/arm64/include/asm/unistd32.h
> @@ -823,6 +823,8 @@ __SYSCALL(__NR_rseq, sys_rseq)
>  __SYSCALL(__NR_io_pgetevents, compat_sys_io_pgetevents)
>  #define __NR_migrate_pages 400
>  __SYSCALL(__NR_migrate_pages, compat_sys_migrate_pages)
> +#define __NR_kexec_file_load 401
> +__SYSCALL(__NR_kexec_file_load, sys_kexec_file_load)

For arm64:

Acked-by: Catalin Marinas 


Re: [PATCH v2 06/29] ARM: add migrate_pages() system call

2019-01-25 Thread Catalin Marinas
On Fri, Jan 18, 2019 at 05:18:12PM +0100, Arnd Bergmann wrote:
> The migrate_pages system call has an assigned number on all architectures
> except ARM. When it got added initially in commit d80ade7b3231 ("ARM:
> Fix warning: #warning syscall migrate_pages not implemented"), it was
> intentionally left out based on the observation that there are no 32-bit
> ARM NUMA systems.
> 
> However, there are now arm64 NUMA machines that can in theory run 32-bit
> kernels (actually enabling NUMA there would require additional work)
> as well as 32-bit user space on 64-bit kernels, so that argument is no
> longer very strong.
> 
> Assigning the number lets us use the system call on 64-bit kernels as well
> as providing a more consistent set of syscalls across architectures.
> 
> Signed-off-by: Arnd Bergmann 

For the arm64 part:

Acked-by: Catalin Marinas 


Re: [PATCH] arm64: Kconfig.platforms: fix warning unmet direct dependencies

2019-01-25 Thread Catalin Marinas
On Tue, Jan 15, 2019 at 08:18:39PM +0100, Anders Roxell wrote:
> When ARCH_MXC get enabled, ARM64_ERRATUM_845719 will be selected and
> this warning will happen when COMPAT isn't set.
> 
> WARNING: unmet direct dependencies detected for ARM64_ERRATUM_845719
>   Depends on [n]: COMPAT [=n]
>   Selected by [y]:
>   - ARCH_MXC [=y]
> 
> Rework to add 'if COMPAT' before ARM64_ERRATUM_845719 gets selected,
> since ARM64_ERRATUM_845719 depends on COMPAT.
> 
> Signed-off-by: Anders Roxell 
> ---
>  arch/arm64/Kconfig.platforms | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/Kconfig.platforms b/arch/arm64/Kconfig.platforms
> index 251ecf34cb02..d4faca775d9c 100644
> --- a/arch/arm64/Kconfig.platforms
> +++ b/arch/arm64/Kconfig.platforms
> @@ -145,7 +145,7 @@ config ARCH_MVEBU
>  config ARCH_MXC
>   bool "ARMv8 based NXP i.MX SoC family"
>   select ARM64_ERRATUM_843419
> - select ARM64_ERRATUM_845719
> + select ARM64_ERRATUM_845719 if COMPAT
>   help
> This enables support for the ARMv8 based SoCs in the
> NXP i.MX family.

Actually, do we need to select the errata workarounds explicitly? That
seems to be the only case where we do it (commit 930507c18304, "arm64:
add basic Kconfig symbols for i.MX8"). They are default y, so we
shouldn't need to force them on.

-- 
Catalin


Re: [PATCH v3 0/4] uaccess: Add unsafe accessors for arm64

2019-01-25 Thread Catalin Marinas
Hi Julien,

On Tue, Jan 15, 2019 at 01:58:25PM +, Julien Thierry wrote:
> Julien Thierry (4):
>   arm64: uaccess: Cleanup get/put_user()
>   arm64: uaccess: Implement unsafe accessors
>   uaccess: Check no rescheduling function is called in unsafe region
>   arm64: uaccess: Implement user_access_region_active

I queued the first two patches for 5.1. It would be nice to upstream the
other two but patch 3 needs an ack from (or merged by) the scheduler
folk.

Thanks.

-- 
Catalin


Re: [Qestion] Softlockup when send IPI to other CPUs

2019-01-24 Thread Catalin Marinas
Hi  Shijith,

On Thu, Jan 24, 2019 at 07:00:42AM +, Shijith Thotton wrote:
> On 01/23/2019 11:45 PM, Catalin Marinas wrote:
> > diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
> > index 30695a868107..5c9073bace83 100644
> > --- a/arch/arm64/mm/flush.c
> > +++ b/arch/arm64/mm/flush.c
> > @@ -33,7 +33,11 @@ void sync_icache_aliases(void *kaddr, unsigned long len)
> > __clean_dcache_area_pou(kaddr, len);
> > __flush_icache_all();
> > } else {
> > -   flush_icache_range(addr, addr + len);
> > +   /*
> > +* Don't issue kick_all_cpus_sync() after I-cache invalidation
> > +* for user mappings.
> > +*/
> > +   __flush_icache_range(addr, addr + len);
> > }
> >   }
> 
> We also faced similar issue with LTP test migrate_pages03 in past and this 
> patch 
> fixes the issue.
> 
> http://lists.infradead.org/pipermail/linux-arm-kernel/2019-January/623574.html

Thanks for confirming. I presume I can add your tested-by.

-- 
Catalin


Re: [Qestion] Softlockup when send IPI to other CPUs

2019-01-23 Thread Catalin Marinas
On Tue, Jan 22, 2019 at 05:44:02AM +, Will Deacon wrote:
> On Mon, Jan 21, 2019 at 02:21:28PM +0000, Catalin Marinas wrote:
> > arm64: Do not issue IPIs for user executable ptes
> > 
> > From: Catalin Marinas 
> > 
> > Commit 3b8c9f1cdfc5 ("arm64: IPI each CPU after invalidating the I-cache
> > for kernel mappings") was aimed at fixing the I-cache invalidation for
> > kernel mappings. However, it inadvertently caused all cache maintenance
> > for user mappings via set_pte_at() -> __sync_icache_dcache() to call
> > kick_all_cpus_sync().
> > 
> > Fixes: 3b8c9f1cdfc5 ("arm64: IPI each CPU after invalidating the I-cache 
> > for kernel mappings")
> > Cc:  # 4.19.x-
> > Signed-off-by: Catalin Marinas 
> > ---
> >  arch/arm64/include/asm/cacheflush.h |2 +-
> >  arch/arm64/kernel/probes/uprobes.c  |2 +-
> >  arch/arm64/mm/flush.c   |   14 ++
> >  3 files changed, 12 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/arm64/include/asm/cacheflush.h 
> > b/arch/arm64/include/asm/cacheflush.h
> > index 19844211a4e6..18e92d9dacd4 100644
> > --- a/arch/arm64/include/asm/cacheflush.h
> > +++ b/arch/arm64/include/asm/cacheflush.h
> > @@ -80,7 +80,7 @@ extern void __clean_dcache_area_poc(void *addr, size_t 
> > len);
> >  extern void __clean_dcache_area_pop(void *addr, size_t len);
> >  extern void __clean_dcache_area_pou(void *addr, size_t len);
> >  extern long __flush_cache_user_range(unsigned long start, unsigned long 
> > end);
> > -extern void sync_icache_aliases(void *kaddr, unsigned long len);
> > +extern void sync_icache_aliases(void *kaddr, unsigned long len, bool sync);
> 
> I'd much prefer just adding something like sync_user_icache_aliases(), which
> would avoid the IPI, since bool arguments tend to make the callsites
> unreadable imo.

I wonder whether we need the IPI for uprobes and ptrace. I would say we
can avoid it for ptrace since normally the debugged thread should be
stopped. I think it's different for uprobes but changing the text of a
live thread doesn't come with many guarantees anyway. So I'm tempted to
simply change sync_icache_aliases() to not issue an IPI at all, in which
case we wouldn't even need the bool argument; it's only used for user
addresses. So the new diff would be (the text is the same):

diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index 30695a868107..5c9073bace83 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -33,7 +33,11 @@ void sync_icache_aliases(void *kaddr, unsigned long len)
__clean_dcache_area_pou(kaddr, len);
__flush_icache_all();
} else {
-   flush_icache_range(addr, addr + len);
+   /*
+* Don't issue kick_all_cpus_sync() after I-cache invalidation
+* for user mappings.
+*/
+   __flush_icache_range(addr, addr + len);
}
 }
 


Re: [PATCH v9 09/26] arm64: Unmask PMR before going idle

2019-01-22 Thread Catalin Marinas
On Mon, Jan 21, 2019 at 03:33:28PM +, Julien Thierry wrote:
> CPU does not received signals for interrupts with a priority masked by
> ICC_PMR_EL1. This means the CPU might not come back from a WFI
> instruction.
> 
> Make sure ICC_PMR_EL1 does not mask interrupts when doing a WFI.
> 
> Since the logic of cpu_do_idle is becoming a bit more complex than just
> two instructions, lets turn it from ASM to C.
> 
> Signed-off-by: Julien Thierry 
> Suggested-by: Daniel Thompson 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Reviewed-by: Catalin Marinas 


Re: [PATCH v9 12/26] arm64: irqflags: Use ICC_PMR_EL1 for interrupt masking

2019-01-22 Thread Catalin Marinas
On Mon, Jan 21, 2019 at 03:33:31PM +, Julien Thierry wrote:
> diff --git a/arch/arm64/include/asm/irqflags.h 
> b/arch/arm64/include/asm/irqflags.h
> index 24692ed..7e82a92 100644
> --- a/arch/arm64/include/asm/irqflags.h
> +++ b/arch/arm64/include/asm/irqflags.h
> @@ -18,7 +18,9 @@
>  
>  #ifdef __KERNEL__
>  
> +#include 
>  #include 
> +#include 
>  
>  /*
>   * Aarch64 has flags for masking: Debug, Asynchronous (serror), Interrupts 
> and
> @@ -36,33 +38,31 @@
>  /*
>   * CPU interrupt mask handling.
>   */
> -static inline unsigned long arch_local_irq_save(void)
> -{
> - unsigned long flags;
> - asm volatile(
> - "mrs%0, daif// arch_local_irq_save\n"
> - "msrdaifset, #2"
> - : "=r" (flags)
> - :
> - : "memory");
> - return flags;
> -}
> -
>  static inline void arch_local_irq_enable(void)
>  {
> - asm volatile(
> - "msrdaifclr, #2 // arch_local_irq_enable"
> - :
> + unsigned long unmasked = GIC_PRIO_IRQON;
> +
> + asm volatile(ALTERNATIVE(
> + "msrdaifclr, #2 // arch_local_irq_enable\n"
> + "nop",
> + "msr_s  " __stringify(SYS_ICC_PMR_EL1) ",%0\n"
> + "dsbsy",
> + ARM64_HAS_IRQ_PRIO_MASKING)
>   :
> + : "r" (unmasked)
>   : "memory");
>  }
>  
>  static inline void arch_local_irq_disable(void)
>  {
> - asm volatile(
> - "msrdaifset, #2 // arch_local_irq_disable"
> - :
> + unsigned long masked = GIC_PRIO_IRQOFF;
> +
> + asm volatile(ALTERNATIVE(
> + "msrdaifset, #2 // arch_local_irq_disable",
> + "msr_s  " __stringify(SYS_ICC_PMR_EL1) ", %0",
> + ARM64_HAS_IRQ_PRIO_MASKING)
>   :
> + : "r" (masked)
>   : "memory");
>  }

Nitpicks: you could drop masked/unmasked variables here (it's up to you,
it wouldn't make any difference on the generated asm).

> @@ -71,12 +71,44 @@ static inline void arch_local_irq_disable(void)
>   */
>  static inline unsigned long arch_local_save_flags(void)
>  {
> + unsigned long daif_bits;
>   unsigned long flags;
> - asm volatile(
> - "mrs%0, daif// arch_local_save_flags"
> - : "=r" (flags)
> - :
> +
> + daif_bits = read_sysreg(daif);
> +
> + /*
> +  * The asm is logically equivalent to:
> +  *
> +  * if (system_uses_irq_prio_masking())
> +  *  flags = (daif_bits & PSR_I_BIT) ?
> +  *  GIC_PRIO_IRQOFF :
> +  *  read_sysreg_s(SYS_ICC_PMR_EL1);
> +  * else
> +  *  flags = daif_bits;
> +  */
> + asm volatile(ALTERNATIVE(
> + "mov%0, %1\n"
> + "nop\n"
> + "nop",
> +     "mrs_s  %0, " __stringify(SYS_ICC_PMR_EL1) "\n"
> + "ands   %1, %1, " __stringify(PSR_I_BIT) "\n"
> + "csel   %0, %0, %2, eq",
> + ARM64_HAS_IRQ_PRIO_MASKING)
> + : "=" (flags), "+r" (daif_bits)
> + : "r" (GIC_PRIO_IRQOFF)
>   : "memory");
> +
> + return flags;
> +}

BTW, how's the code generated from the C version? It will have a branch
but may not be too bad. Either way is fine by me.

Reviewed-by: Catalin Marinas 


Re: [PATCH 2/2] arm64: mm: make use of new memblocks_present() helper

2019-01-21 Thread Catalin Marinas
On Thu, Jan 10, 2019 at 09:51:52AM -0700, Logan Gunthorpe wrote:
> On 2019-01-10 7:15 a.m., Will Deacon wrote:
> > On Wed, Jan 09, 2019 at 01:21:00PM -0700, Logan Gunthorpe wrote:
> >> Cleanup the arm64_memory_present() function seeing it's very
> >> similar to other arches.
> >>
> >> memblocks_present() is a direct replacement of arm64_memory_present()
> >>
> >> Signed-off-by: Logan Gunthorpe 
> >> Acked-by: Catalin Marinas 
> >> Cc: Will Deacon 
> >> ---
> >>  arch/arm64/mm/init.c | 20 +---
> >>  1 file changed, 1 insertion(+), 19 deletions(-)
> > 
> > Acked-by: Will Deacon 
> > 
> > Did you want us to take this via the arm64 tree?
> 
> Yes please.

Patch 2/2 queued for 5.1. Thanks.

-- 
Catalin


Re: kmemleak panic

2019-01-21 Thread Catalin Marinas
On Mon, Jan 21, 2019 at 07:35:11AM -0600, Rob Herring wrote:
> On Mon, Jan 21, 2019 at 6:19 AM Robin Murphy  wrote:
> >
> > On 21/01/2019 11:57, Marc Gonzalez wrote:
> > [...]
> > > # echo dump=0xffc021e0 > /sys/kernel/debug/kmemleak
> > > kmemleak: Object 0xffc021e0 (size 2097152):
> > > kmemleak:   comm "swapper/0", pid 0, jiffies 4294892296
> > > kmemleak:   min_count = 0
> > > kmemleak:   count = 0
> > > kmemleak:   flags = 0x1
> > > kmemleak:   checksum = 0
> > > kmemleak:   backtrace:
> > >   kmemleak_alloc_phys+0x48/0x60
> > >   memblock_alloc_range_nid+0x8c/0xa4
> > >   memblock_alloc_base_nid+0x4c/0x60
> > >   __memblock_alloc_base+0x3c/0x4c
> > >   early_init_dt_alloc_reserved_memory_arch+0x54/0xa4
> > >   fdt_init_reserved_mem+0x308/0x3ec
> > >   early_init_fdt_scan_reserved_mem+0x88/0xb0
> > >   arm64_memblock_init+0x1dc/0x254
> > >   setup_arch+0x1c8/0x4ec
> > >   start_kernel+0x84/0x44c
> > >   0x
> >
> > OK, so via the __va(phys) call in kmemleak_alloc_phys(), you end up with
> > the linear map address of a no-map reservation, which unsurprisingly
> > turns out not to be mapped. Is there a way to tell kmemleak that it
> > can't scan within a particular object?
> 
> There was this patch posted[1]. I never got a reply, so it hasn't been 
> applied.
> 
> https://patchwork.ozlabs.org/patch/995367/

Thanks Rob, I wasn't aware of this patch (or I just missed it at the
time).

I wonder whether kmemleak should simply remove ranges passed to
memblock_remove(), or at least mark them as no-scan.

-- 
Catalin


Re: [Qestion] Softlockup when send IPI to other CPUs

2019-01-21 Thread Catalin Marinas
On Sat, Jan 19, 2019 at 11:58:27PM +, Will Deacon wrote:
> On Thu, Jan 17, 2019 at 07:42:44AM +, chenwandun wrote:
> > Recently, I do some tests on linux-4.19 and hit a softlockup issue.
> > 
> > I find some CPUs get the spinlock in the __split_huge_pmd function and
> > then send IPI to other CPUs, waiting the response, while several CPUs
> > enter the __split_huge_pmd function, want to get the spinlock, but always
> > in queued_spin_lock_slowpath,
> > 
> > Because long time no response to the IPI, that results in a softlockup.
> > 
> > As to sending IPI, it was in the patch
> > 3b8c9f1cdfc506e94e992ae66b68bbe416f89610.  The patch is mean to send IPI
> > to each CPU after invalidating the I-cache for kernel mappings.  In this
> > case, after modify pmd, it sends IPI to other CPUS to sync memory
> > mappings.
> > 
> > No stable test case to repeat the result, it is hard to repeat the test 
> > procedure.
> > 
> > The environment is arm64, 64 CPUs. Except for idle CPU, there are 6 kind
> > of callstacks in total.
> 
> This looks like another lockup that would be solved if we deferred our
> I-cache invalidation when mapping user-executable pages, and instead
> performed the invalidation off the back of a UXN permission fault, where we
> could avoid holding any locks.

Looking back at commit 3b8c9f1cdfc5 ("arm64: IPI each CPU after
invalidating the I-cache for kernel mappings"), the text implies that it
should only do this for kernel mappings. I don't think we need this for
user mappings. We have a few scenarios where we invoke set_pte_at() with
exec permission:

1. Page faulted in - the pte was not previously accessible and the CPU
   should not have stale decoded instructions (my interpretation of the
   ARM ARM).

2. huge pmd splitting - there is no change to the underlying data. I
   have a suspicion here that we shouldn't even call
   sync_icache_aliases() but probably the PG_arch_1 isn't carried over
   from the head compound page to the small pages (needs checking).

3. page migration - there is no change to the underlying data and
   instructions, so I don't think we need the all CPUs sync.

4. mprotect() setting exec permission - that's normally the
   responsibility of the user to ensure cache maintenance

(I can add more text to the patch below but need to get to the bottom of
this first)

-8<-
arm64: Do not issue IPIs for user executable ptes

From: Catalin Marinas 

Commit 3b8c9f1cdfc5 ("arm64: IPI each CPU after invalidating the I-cache
for kernel mappings") was aimed at fixing the I-cache invalidation for
kernel mappings. However, it inadvertently caused all cache maintenance
for user mappings via set_pte_at() -> __sync_icache_dcache() to call
kick_all_cpus_sync().

Fixes: 3b8c9f1cdfc5 ("arm64: IPI each CPU after invalidating the I-cache for 
kernel mappings")
Cc:  # 4.19.x-
Signed-off-by: Catalin Marinas 
---
 arch/arm64/include/asm/cacheflush.h |2 +-
 arch/arm64/kernel/probes/uprobes.c  |2 +-
 arch/arm64/mm/flush.c   |   14 ++
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/include/asm/cacheflush.h 
b/arch/arm64/include/asm/cacheflush.h
index 19844211a4e6..18e92d9dacd4 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -80,7 +80,7 @@ extern void __clean_dcache_area_poc(void *addr, size_t len);
 extern void __clean_dcache_area_pop(void *addr, size_t len);
 extern void __clean_dcache_area_pou(void *addr, size_t len);
 extern long __flush_cache_user_range(unsigned long start, unsigned long end);
-extern void sync_icache_aliases(void *kaddr, unsigned long len);
+extern void sync_icache_aliases(void *kaddr, unsigned long len, bool sync);
 
 static inline void flush_icache_range(unsigned long start, unsigned long end)
 {
diff --git a/arch/arm64/kernel/probes/uprobes.c 
b/arch/arm64/kernel/probes/uprobes.c
index 636ca0119c0e..595e8c8f41cd 100644
--- a/arch/arm64/kernel/probes/uprobes.c
+++ b/arch/arm64/kernel/probes/uprobes.c
@@ -24,7 +24,7 @@ void arch_uprobe_copy_ixol(struct page *page, unsigned long 
vaddr,
memcpy(dst, src, len);
 
/* flush caches (dcache/icache) */
-   sync_icache_aliases(dst, len);
+   sync_icache_aliases(dst, len, true);
 
kunmap_atomic(xol_page_kaddr);
 }
diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index 30695a868107..5c2f23a92d14 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -25,15 +25,17 @@
 #include 
 #include 
 
-void sync_icache_aliases(void *kaddr, unsigned long len)
+void sync_icache_aliases(void *kaddr, unsigned long len, bool sync)
 {
unsigned long addr = (unsigned long)kaddr;
 
if (icache_is_aliasing()) {
__clean_dcach

Re: kmemleak panic

2019-01-19 Thread Catalin Marinas
On Fri, Jan 18, 2019 at 04:36:59PM +0100, Marc Gonzalez wrote:
> mount -t debugfs nodev /sys/kernel/debug/ 
> echo scan > /sys/kernel/debug/kmemleak
> 
> Unable to handle kernel paging request at virtual address ffc021e0
[...]
> Call trace:
>  scan_block+0x70/0x190
>  scan_gray_list+0x108/0x1c0
>  kmemleak_scan+0x33c/0x7c0
>  kmemleak_write+0x410/0x4b0

As per Robin's remark, this address seems to be pretty easy to
reproduce. It also happens via scan_gray_list() which indicates an
object kmemleak was informed about via kmemleak_alloc() (so this
excludes the pfn that Qian noticed).

Can you configure the kernel with CONFIG_DEBUG_KMEMLEAK_AUTO_SCAN off
just to avoid the bug being triggered early and run:

  mount -t debugfs nodev /sys/kernel/debug/
  echo dump=0xffc021e0 > /sys/kernel/debug/kmemleak

Then run another scan to make sure this is the address that triggered
the page fault:

  echo scan > /sys/kernel/debug/kmemleak

The above should tell us where the object that kmemleak is trying to
scan came from.

Of course, ideally we should bisect this but I haven't been able to
reproduce it.

-- 
Catalin


Re: [PATCH v8 12/26] arm64: irqflags: Use ICC_PMR_EL1 for interrupt masking

2019-01-18 Thread Catalin Marinas
On Fri, Jan 18, 2019 at 05:30:02PM +, Catalin Marinas wrote:
> On Fri, Jan 18, 2019 at 04:57:32PM +, Julien Thierry wrote:
> > On 18/01/2019 16:09, Catalin Marinas wrote:
> > > On Tue, Jan 08, 2019 at 02:07:30PM +, Julien Thierry wrote:
> > >> +asm volatile(ALTERNATIVE(
> > >> +"nop",
> > >> +"mrs_s  %0, " __stringify(SYS_ICC_PMR_EL1),
> > >> +ARM64_HAS_IRQ_PRIO_MASKING)
> > >> +: "=" (pmr)
> > >>  :
> > >>  : "memory");
> > >> +
> > >> +return _get_irqflags(daif_bits, pmr);
> > >> +}
> > > 
> > > I find this confusing spread over two inline asm statements. IIUC, you
> > > want something like below (it could be written as inline asm but I need
> > > to understand it first):
> > > 
> > >   daif_bits = read_sysreg(daif);
> > > 
> > >   if (system_uses_irq_prio_masking()) {
> > >   pmr = read_gicreg(ICC_PMR_EL1);
> > >   flags = pmr & ~(daif_bits & PSR_I_BIT);
> > >   } else {
> > >   flags = daif_bits;
> > >   }
> > > 
> > >   return flags;
> > > 
> > > In the case where the interrupts are disabled at the PSR level, is the
> > > PMR value still relevant? Could we just return the GIC_PRIO_IRQOFF?
> > > Something like:
> > > 
> > >   flags = read_sysreg(daif);
> > > 
> > >   if (system_uses_irq_prio_masking())
> > >   flags = flags & PSR_I_BIT ?
> > >   GIC_PRIO_IRQOFF : read_gicreg(ICC_PMR_EL1);
> > > 
> > 
> > You're right, returning GIC_PRIO_IRQOFF should be good enough (it is
> > actually what happens in this version because GIC_PRIO_IRQOFF ==
> > GIC_PRIO_IRQON & ~PSR_I_BIT happens to be true).
> 
> This wasn't entirely clear to me, I got confused by:
> 
> +   BUILD_BUG_ON(GIC_PRIO_IRQOFF < (GIC_PRIO_IRQON & ~PSR_I_BIT));  \
> 
> and I thought there isn't necessarily an equality between the two.
> 
> > Your suggestion would
> > make things easier to reason about. Maybe something like:
> > 
> > 
> > static inline unsigned long arch_local_save_flags(void)
> > {
> > unsigned long daif_bits;
> > unsigned long prio_off = GIC_PRIO_IRQOFF;
> > 
> > daif_bits = read_sysreg(daif);
> > 
> > asm volatile(ALTERNATIVE(
> > "mov%0, %1\n"
> > "nop\n"
> > "nop",
> > "mrs%0, SYS_ICC_PMR_EL1\n"
> > "ands   %1, %1, PSR_I_BIT\n"
> > "csel   %0, %0, %2, eq")
> > : "=" (flags)
> > : "r" (daif_bits), "r" (prio_off)
> > : "memory");
> > 
> > return flags;
> > }
> 
> It looks fine. If you turn the BUILD_BUG_ON into a !=, you could
> probably simplify the asm a bit (though the number of instructions
> generated would probably be the same). Untested:
> 
> static inline unsigned long arch_local_save_flags(void)
> {
>   unsigned long flags;
> 
>   flags = read_sysreg(daif);
> 
>   asm volatile(ALTERNATIVE(
>   "nop",
>   "bic%0, %1, %2")
>   : "=" (flags)
>   : "r" (flags & PSR_I_BIT), "r" (GIC_PRIO_IRQOFF)
>   : "memory");

Ah, I missed a read from SYS_ICC_PMR_EL1 here. Anyway, the idea was that
you don't need to set prio_off to a variable, just pass "r" (constant)
here and the compiler does the trick.

-- 
Catalin


Re: [PATCH v8 12/26] arm64: irqflags: Use ICC_PMR_EL1 for interrupt masking

2019-01-18 Thread Catalin Marinas
On Fri, Jan 18, 2019 at 04:57:32PM +, Julien Thierry wrote:
> On 18/01/2019 16:09, Catalin Marinas wrote:
> > On Tue, Jan 08, 2019 at 02:07:30PM +, Julien Thierry wrote:
> >> +  asm volatile(ALTERNATIVE(
> >> +  "nop",
> >> +  "mrs_s  %0, " __stringify(SYS_ICC_PMR_EL1),
> >> +  ARM64_HAS_IRQ_PRIO_MASKING)
> >> +  : "=" (pmr)
> >>:
> >>: "memory");
> >> +
> >> +  return _get_irqflags(daif_bits, pmr);
> >> +}
> > 
> > I find this confusing spread over two inline asm statements. IIUC, you
> > want something like below (it could be written as inline asm but I need
> > to understand it first):
> > 
> > daif_bits = read_sysreg(daif);
> > 
> > if (system_uses_irq_prio_masking()) {
> > pmr = read_gicreg(ICC_PMR_EL1);
> > flags = pmr & ~(daif_bits & PSR_I_BIT);
> > } else {
> > flags = daif_bits;
> > }
> > 
> > return flags;
> > 
> > In the case where the interrupts are disabled at the PSR level, is the
> > PMR value still relevant? Could we just return the GIC_PRIO_IRQOFF?
> > Something like:
> > 
> > flags = read_sysreg(daif);
> > 
> > if (system_uses_irq_prio_masking())
> > flags = flags & PSR_I_BIT ?
> > GIC_PRIO_IRQOFF : read_gicreg(ICC_PMR_EL1);
> > 
> 
> You're right, returning GIC_PRIO_IRQOFF should be good enough (it is
> actually what happens in this version because GIC_PRIO_IRQOFF ==
> GIC_PRIO_IRQON & ~PSR_I_BIT happens to be true).

This wasn't entirely clear to me, I got confused by:

+   BUILD_BUG_ON(GIC_PRIO_IRQOFF < (GIC_PRIO_IRQON & ~PSR_I_BIT));  \

and I thought there isn't necessarily an equality between the two.

> Your suggestion would
> make things easier to reason about. Maybe something like:
> 
> 
> static inline unsigned long arch_local_save_flags(void)
> {
>   unsigned long daif_bits;
>   unsigned long prio_off = GIC_PRIO_IRQOFF;
> 
>   daif_bits = read_sysreg(daif);
> 
>   asm volatile(ALTERNATIVE(
>   "mov%0, %1\n"
>   "nop\n"
>   "nop",
>   "mrs%0, SYS_ICC_PMR_EL1\n"
>   "ands   %1, %1, PSR_I_BIT\n"
>   "csel   %0, %0, %2, eq")
>   : "=" (flags)
>   : "r" (daif_bits), "r" (prio_off)
>   : "memory");
> 
>   return flags;
> }

It looks fine. If you turn the BUILD_BUG_ON into a !=, you could
probably simplify the asm a bit (though the number of instructions
generated would probably be the same). Untested:

static inline unsigned long arch_local_save_flags(void)
{
unsigned long flags;

flags = read_sysreg(daif);

asm volatile(ALTERNATIVE(
"nop",
"bic%0, %1, %2")
: "=" (flags)
: "r" (flags & PSR_I_BIT), "r" (GIC_PRIO_IRQOFF)
: "memory");

return flags;
}

-- 
Catalin


Re: [PATCH v8 13/26] arm64: daifflags: Include PMR in daifflags restore operations

2019-01-18 Thread Catalin Marinas
On Tue, Jan 08, 2019 at 02:07:31PM +, Julien Thierry wrote:
> The addition of PMR should not bypass the semantics of daifflags.
> 
> When DA_F are set, I bit is also set as no interrupts (even of higher
> priority) is allowed.
> 
> When DA_F are cleared, I bit is cleared and interrupt enabling/disabling
> goes through ICC_PMR_EL1.
> 
> Signed-off-by: Julien Thierry 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: James Morse 

Reviewed-by: Catalin Marinas 


Re: [PATCH v8 11/26] efi: Let architectures decide the flags that should be saved/restored

2019-01-18 Thread Catalin Marinas
On Tue, Jan 08, 2019 at 02:07:29PM +, Julien Thierry wrote:
> Currently, irqflags are saved before calling runtime services and
> checked for mismatch on return.
> 
> Provide a pair of overridable macros to save and restore (if needed) the
> state that need to be preserved on return from a runtime service.
> This allows to check for flags that are not necesarly related to
> irqflags.
> 
> Signed-off-by: Julien Thierry 

Acked-by: Catalin Marinas 


Re: [PATCH v8 10/26] arm64: kvm: Unmask PMR before entering guest

2019-01-18 Thread Catalin Marinas
On Tue, Jan 08, 2019 at 02:07:28PM +, Julien Thierry wrote:
> Interrupts masked by ICC_PMR_EL1 will not be signaled to the CPU. This
> means that hypervisor will not receive masked interrupts while running a
> guest.
> 
> Avoid this by making sure ICC_PMR_EL1 is unmasked when we enter a guest.
> 
> Signed-off-by: Julien Thierry 
> Cc: Christoffer Dall 
> Cc: Marc Zyngier 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: kvm...@lists.cs.columbia.edu

Acked-by: Catalin Marinas 


Re: [PATCH v8 09/26] arm64: Unmask PMR before going idle

2019-01-18 Thread Catalin Marinas
On Tue, Jan 08, 2019 at 02:07:27PM +, Julien Thierry wrote:
> CPU does not received signals for interrupts with a priority masked by
> ICC_PMR_EL1. This means the CPU might not come back from a WFI
> instruction.
> 
> Make sure ICC_PMR_EL1 does not mask interrupts when doing a WFI.
> 
> Since the logic of cpu_do_idle is becoming a bit more complex than just
> two instructions, lets turn it from ASM to C.

I haven't checked all the call paths, so asking here: does the core code
normally call arch_cpu_idle() with IRQs off?

-- 
Catalin


Re: [PATCH v8 08/26] arm64: Make PMR part of task context

2019-01-18 Thread Catalin Marinas
On Tue, Jan 08, 2019 at 02:07:26PM +, Julien Thierry wrote:
> In order to replace PSR.I interrupt disabling/enabling with ICC_PMR_EL1
> interrupt masking, ICC_PMR_EL1 needs to be saved/restored when
> taking/returning from an exception. This mimics the way hardware saves
> and restores PSR.I bit in spsr_el1 for exceptions and ERET.
> 
> Add PMR to the registers to save in the pt_regs struct upon kernel entry,
> and restore it before ERET. Also, initialize it to a sane value when
> creating new tasks.
> 
> Signed-off-by: Julien Thierry 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Oleg Nesterov 
> Cc: Dave Martin 

Reviewed-by: Catalin Marinas 


Re: [PATCH v8 12/26] arm64: irqflags: Use ICC_PMR_EL1 for interrupt masking

2019-01-18 Thread Catalin Marinas
Hi Julien,

On Tue, Jan 08, 2019 at 02:07:30PM +, Julien Thierry wrote:
> + * Having two ways to control interrupt status is a bit complicated. Some
> + * locations like exception entries will have PSR.I bit set by the 
> architecture
> + * while PMR is unmasked.
> + * We need the irqflags to represent that interrupts are disabled in such 
> cases.
> + *
> + * For this, we lower the value read from PMR when the I bit is set so it is
> + * considered as an irq masking priority. (With PMR, lower value means 
> masking
> + * more interrupts).
> + */
> +#define _get_irqflags(daif_bits, pmr)
> \
> +({   \
> + unsigned long flags;\
> + \
> + BUILD_BUG_ON(GIC_PRIO_IRQOFF < (GIC_PRIO_IRQON & ~PSR_I_BIT));  \
> + asm volatile(ALTERNATIVE(   \
> + "mov%0, %1\n"   \
> + "nop\n" \
> + "nop",  \
> + "and%0, %1, #" __stringify(PSR_I_BIT) "\n"  \
> + "mvn%0, %0\n"   \
> + "and%0, %0, %2",\
> + ARM64_HAS_IRQ_PRIO_MASKING) \

Can you write the last two instructions as a single:

bic %0, %2, %0

> + : "=" (flags) \
> + : "r" (daif_bits), "r" (pmr)\
> + : "memory");\
> + \
> + flags;  \
> +})
> +
> +/*
>   * Save the current interrupt enable state.
>   */
>  static inline unsigned long arch_local_save_flags(void)
>  {
> - unsigned long flags;
> - asm volatile(
> - "mrs%0, daif// arch_local_save_flags"
> - : "=r" (flags)
> + unsigned long daif_bits;
> + unsigned long pmr; // Only used if alternative is on
> +
> + daif_bits = read_sysreg(daif);
> +
> + // Get PMR

Nitpick: don't use C++ (or arm asm) comment style in C code.

> + asm volatile(ALTERNATIVE(
> + "nop",
> + "mrs_s  %0, " __stringify(SYS_ICC_PMR_EL1),
> + ARM64_HAS_IRQ_PRIO_MASKING)
> + : "=" (pmr)
>   :
>   : "memory");
> +
> + return _get_irqflags(daif_bits, pmr);
> +}

I find this confusing spread over two inline asm statements. IIUC, you
want something like below (it could be written as inline asm but I need
to understand it first):

daif_bits = read_sysreg(daif);

if (system_uses_irq_prio_masking()) {
pmr = read_gicreg(ICC_PMR_EL1);
flags = pmr & ~(daif_bits & PSR_I_BIT);
} else {
flags = daif_bits;
}

return flags;

In the case where the interrupts are disabled at the PSR level, is the
PMR value still relevant? Could we just return the GIC_PRIO_IRQOFF?
Something like:

flags = read_sysreg(daif);

if (system_uses_irq_prio_masking())
flags = flags & PSR_I_BIT ?
GIC_PRIO_IRQOFF : read_gicreg(ICC_PMR_EL1);

-- 
Catalin


Re: [PATCH v8 07/26] arm64: ptrace: Provide definitions for PMR values

2019-01-14 Thread Catalin Marinas
On Tue, Jan 08, 2019 at 02:07:25PM +, Julien Thierry wrote:
> Introduce fixed values for PMR that are going to be used to mask and
> unmask interrupts by priority.
> 
> The current priority given to GIC interrupts is 0xa0, so clearing PMR's
> most significant bit is enough to mask interrupts.
> 
> Signed-off-by: Julien Thierry 
> Suggested-by: Daniel Thompson 
> Cc: Oleg Nesterov 
> Cc: Catalin Marinas 
> Cc: Will Deacon 

Acked-by: Catalin Marinas 


Re: [PATCH v8 01/26] arm64: Fix HCR.TGE status for NMI contexts

2019-01-14 Thread Catalin Marinas
On Tue, Jan 08, 2019 at 02:07:19PM +, Julien Thierry wrote:
> When using VHE, the host needs to clear HCR_EL2.TGE bit in order
> to interract with guest TLBs, switching from EL2&0 translation regime
> to EL1&0.
> 
> However, some non-maskable asynchronous event could happen while TGE is
> cleared like SDEI. Because of this address translation operations
> relying on EL2&0 translation regime could fail (tlb invalidation,
> userspace access, ...).

Why would an NMI context need to access user space? (just curious what
breaks exactly without this patch; otherwise it looks fine)

-- 
Catalin


Re: [PATCH] net/core/neighbour: tell kmemleak about hash tables

2019-01-11 Thread Catalin Marinas
On Fri, Jan 11, 2019 at 07:26:09AM +0300, Konstantin Khlebnikov wrote:
> On Thu, Jan 10, 2019 at 11:45 PM Cong Wang  wrote:
> > On Tue, Jan 8, 2019 at 1:30 AM Konstantin Khlebnikov
> >  wrote:
> > > @@ -443,12 +444,14 @@ static struct neigh_hash_table 
> > > *neigh_hash_alloc(unsigned int shift)
> > > ret = kmalloc(sizeof(*ret), GFP_ATOMIC);
> > > if (!ret)
> > > return NULL;
> > > -   if (size <= PAGE_SIZE)
> > > +   if (size <= PAGE_SIZE) {
> > > buckets = kzalloc(size, GFP_ATOMIC);
> > > -   else
> > > +   } else {
> > > buckets = (struct neighbour __rcu **)
> > >   __get_free_pages(GFP_ATOMIC | __GFP_ZERO,
> > >get_order(size));
> > > +   kmemleak_alloc(buckets, size, 0, GFP_ATOMIC);
> >
> > Why min_count is 0 rather than 1 here?
> 
> The api isn't clear and I've misread description.
> So it should be 1 for reporting leak of hash table itself.
> But 0 doesn't add any new issues.

Correct. I'd say it makes sense to set it to 1 so that you'd have
similar behaviour to the kzalloc() allocation.


-- 
Catalin


Re: [PATCH] selinux: avc: mark avc node as not a leak

2019-01-09 Thread Catalin Marinas
Hi Prateek,

On Wed, Jan 09, 2019 at 02:09:22PM +0530, Prateek Patel wrote:
> From: Sri Krishna chowdary 
> 
> kmemleak detects allocated objects as leaks if not accessed for
> default scan time. The memory allocated using avc_alloc_node
> is freed using rcu mechanism when nodes are reclaimed or on
> avc_flush. So, there is no real leak here and kmemleak_scan
> detects it as a leak which is false positive. Hence, mark it as
> kmemleak_not_leak.

In theory, kmemleak should detect the node->rhead in the lists used by
call_rcu() and not report it as a leak. Which RCU options do you have
enabled (just to check whether kmemleak tracks the RCU internal lists)?

Also, does this leak eventually disappear without your patch? Does

  echo dump=0xffc0dd1a0e60 > /sys/kernel/debug/kmemleak

still display this object?

Thanks.

-- 
Catalin


Re: [PATCH v2] kmemleak: survive in a low-memory situation

2019-01-07 Thread Catalin Marinas
On Thu, Jan 03, 2019 at 06:07:35PM +0100, Michal Hocko wrote:
> > > On Wed 02-01-19 13:06:19, Qian Cai wrote:
> > > [...]
> > >> diff --git a/mm/kmemleak.c b/mm/kmemleak.c
> > >> index f9d9dc250428..9e1aa3b7df75 100644
> > >> --- a/mm/kmemleak.c
> > >> +++ b/mm/kmemleak.c
> > >> @@ -576,6 +576,16 @@ static struct kmemleak_object 
> > >> *create_object(unsigned long ptr, size_t size,
> > >>  struct rb_node **link, *rb_parent;
> > >>  
> > >>  object = kmem_cache_alloc(object_cache, gfp_kmemleak_mask(gfp));
> > >> +#ifdef CONFIG_PREEMPT_COUNT
> > >> +if (!object) {
> > >> +/* last-ditch effort in a low-memory situation */
> > >> +if (irqs_disabled() || is_idle_task(current) || 
> > >> in_atomic())
> > >> +gfp = GFP_ATOMIC;
> > >> +else
> > >> +gfp = gfp_kmemleak_mask(gfp) | 
> > >> __GFP_DIRECT_RECLAIM;
> > >> +object = kmem_cache_alloc(object_cache, gfp);
> > >> +}
> > >> +#endif
[...]
> I will not object to this workaround but I strongly believe that
> kmemleak should rethink the metadata allocation strategy to be really
> robust.

This would be nice indeed and it was discussed last year. I just haven't
got around to trying anything yet:

https://marc.info/?l=linux-mm=152812489819532

-- 
Catalin


Re: [PATCH] mm: kmemleak: Turn kmemleak_lock to spin lock and RCU primitives

2019-01-07 Thread Catalin Marinas
On Mon, Jan 07, 2019 at 03:31:18PM +0800, He Zhe wrote:
> On 1/5/19 2:37 AM, Catalin Marinas wrote:
> > On Fri, Jan 04, 2019 at 10:29:13PM +0800, zhe...@windriver.com wrote:
> >> It's not necessary to keep consistency between readers and writers of
> >> kmemleak_lock. RCU is more proper for this case. And in order to gain 
> >> better
> >> performance, we turn the reader locks to RCU read locks and writer locks to
> >> normal spin locks.
> > This won't work.
> >
> >> @@ -515,9 +515,7 @@ static struct kmemleak_object 
> >> *find_and_get_object(unsigned long ptr, int alias)
> >>struct kmemleak_object *object;
> >>  
> >>rcu_read_lock();
> >> -  read_lock_irqsave(_lock, flags);
> >>object = lookup_object(ptr, alias);
> >> -  read_unlock_irqrestore(_lock, flags);
> > The comment on lookup_object() states that the kmemleak_lock must be
> > held. That's because we don't have an RCU-like mechanism for removing
> > removing objects from the object_tree_root:
> >
> >>  
> >>/* check whether the object is still available */
> >>if (object && !get_object(object))
> >> @@ -537,13 +535,13 @@ static struct kmemleak_object 
> >> *find_and_remove_object(unsigned long ptr, int ali
> >>unsigned long flags;
> >>struct kmemleak_object *object;
> >>  
> >> -  write_lock_irqsave(_lock, flags);
> >> +  spin_lock_irqsave(_lock, flags);
> >>object = lookup_object(ptr, alias);
> >>if (object) {
> >>rb_erase(>rb_node, _tree_root);
> >>list_del_rcu(>object_list);
> >>}
> >> -  write_unlock_irqrestore(_lock, flags);
> >> +  spin_unlock_irqrestore(_lock, flags);
> > So here, while list removal is RCU-safe, rb_erase() is not.
> >
> > If you have time to implement an rb_erase_rcu(), than we could reduce
> > the locking in kmemleak.
> 
> Thanks, I really neglected that rb_erase is not RCU-safe here.
> 
> I'm not sure if it is practically possible to implement rb_erase_rcu. Here
> is my concern:
> In the code paths starting from rb_erase, the tree is tweaked at many
> places, in both __rb_erase_augmented and rb_erase_color. To my
> understanding, there are many intermediate versions of the tree
> during the erasion. In some of the versions, the tree is incomplete, i.e.
> some nodes(not the one to be deleted) are invisible to readers. I'm not
> sure if this is acceptable as an RCU implementation. Does it mean we
> need to form a rb_erase_rcu from scratch?

If it's possible, I think it would help. I had a quick look as well but
as it seemed non-trivial, I moved on to something else.

> And are there any other concerns about this attempt?

No concerns if it's possible at all. In the meantime, you could try to
replace the rw_lock with a classic spinlock. There was a thread recently
and I concluded that the rw_lock is no longer necessary as we don't have
multiple readers contention.

Yet another improvement could be to drop the kmemleak_object.lock
entirely and just rely on the main kmemleak_lock. I don't think the
fine-grained locking saves us much as in most cases where it acquires
the object->lock it already holds (or may have acquired/released) the
kmemleak_lock.

Note that even if we have an rb_erase_rcu(), we'd still need to acquire
the object->lock to prevent the scanned block being de-allocated
(unmapped in the case of vmalloc()). So if we manage with a single
kmemleak_lock (spin_lock_t), it may give a similar performance boost to
what you've got without kmemleak_lock.

FTR, the original aim of RCU grace period in kmemleak was to avoid a
recursive call into the slab freeing code; it later came in handy for
some list traversal.

-- 
Catalin


Re: kmemleak: Cannot allocate a kmemleak_object structure - Kernel 4.19.13

2019-01-07 Thread Catalin Marinas
Hi Nathan,

On Tue, Jan 01, 2019 at 01:17:06PM -0600, Nathan Royce wrote:
> I had a leak somewhere and I was directed to look into SUnreclaim
> which was 5.5 GB after an uptime of a little over 1 month on an 8 GB
> system. kmalloc-2048 was a problem.
> I just had enough and needed to find out the cause for my lagging system.
> 
> I finally upgraded from 4.18.16 to 4.19.13 and enabled kmemleak to
> hunt for the culprit. I don't think a day had elapsed before kmemleak
> crashed and disabled itself.

Under memory pressure, kmemleak may fail to allocate memory. See this
patch for an attempt to slightly improve things but it's not a proper
solution:

http://lkml.kernel.org/r/20190102180619.12392-1-...@lca.pw

-- 
Catalin


Re: [PATCH] mm: kmemleak: Turn kmemleak_lock to spin lock and RCU primitives

2019-01-04 Thread Catalin Marinas
On Fri, Jan 04, 2019 at 10:29:13PM +0800, zhe...@windriver.com wrote:
> It's not necessary to keep consistency between readers and writers of
> kmemleak_lock. RCU is more proper for this case. And in order to gain better
> performance, we turn the reader locks to RCU read locks and writer locks to
> normal spin locks.

This won't work.

> @@ -515,9 +515,7 @@ static struct kmemleak_object 
> *find_and_get_object(unsigned long ptr, int alias)
>   struct kmemleak_object *object;
>  
>   rcu_read_lock();
> - read_lock_irqsave(_lock, flags);
>   object = lookup_object(ptr, alias);
> - read_unlock_irqrestore(_lock, flags);

The comment on lookup_object() states that the kmemleak_lock must be
held. That's because we don't have an RCU-like mechanism for removing
removing objects from the object_tree_root:

>  
>   /* check whether the object is still available */
>   if (object && !get_object(object))
> @@ -537,13 +535,13 @@ static struct kmemleak_object 
> *find_and_remove_object(unsigned long ptr, int ali
>   unsigned long flags;
>   struct kmemleak_object *object;
>  
> - write_lock_irqsave(_lock, flags);
> + spin_lock_irqsave(_lock, flags);
>   object = lookup_object(ptr, alias);
>   if (object) {
>   rb_erase(>rb_node, _tree_root);
>   list_del_rcu(>object_list);
>   }
> - write_unlock_irqrestore(_lock, flags);
> + spin_unlock_irqrestore(_lock, flags);

So here, while list removal is RCU-safe, rb_erase() is not.

If you have time to implement an rb_erase_rcu(), than we could reduce
the locking in kmemleak.

-- 
Catalin


Re: [PATCH] kmemleak: survive in a low-memory situation

2019-01-02 Thread Catalin Marinas
Hi Qian,

On Wed, Jan 02, 2019 at 11:08:49AM -0500, Qian Cai wrote:
> Kmemleak could quickly fail to allocate an object structure and then
> disable itself in a low-memory situation. For example, running a mmap()
> workload triggering swapping and OOM [1].
> 
> First, it unnecessarily attempt to allocate even though the tracking
> object is NULL in kmem_cache_alloc(). For example,
> 
> alloc_io
>   bio_alloc_bioset
> mempool_alloc
>   mempool_alloc_slab
> kmem_cache_alloc
>   slab_alloc_node
> __slab_alloc <-- could return NULL
> slab_post_alloc_hook
>   kmemleak_alloc_recursive

kmemleak_alloc() only continues with the kmemleak_object allocation if
the given pointer is not NULL.

> diff --git a/mm/slab.h b/mm/slab.h
> index 4190c24ef0e9..51a9a942cc56 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -435,15 +435,16 @@ static inline void slab_post_alloc_hook(struct 
> kmem_cache *s, gfp_t flags,
>  {
>   size_t i;
>  
> - flags &= gfp_allowed_mask;
> - for (i = 0; i < size; i++) {
> - void *object = p[i];
> -
> - kmemleak_alloc_recursive(object, s->object_size, 1,
> -  s->flags, flags);
> - p[i] = kasan_slab_alloc(s, object, flags);
> + if (*p) {
> + flags &= gfp_allowed_mask;
> + for (i = 0; i < size; i++) {
> + void *object = p[i];
> +
> + kmemleak_alloc_recursive(object, s->object_size, 1,
> +  s->flags, flags);
> + p[i] = kasan_slab_alloc(s, object, flags);
> + }
>   }

This is not necessary for kmemleak.

-- 
Catalin


Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2018-12-18 Thread Catalin Marinas
On Tue, Dec 18, 2018 at 04:03:38PM +0100, Andrey Konovalov wrote:
> On Wed, Dec 12, 2018 at 4:02 PM Catalin Marinas  
> wrote:
> > The summary of our internal discussions (mostly between kernel
> > developers) is that we can't properly describe a user ABI that covers
> > future syscalls or syscall extensions while not all syscalls accept
> > tagged pointers. So we tweaked the requirements slightly to only allow
> > tagged pointers back into the kernel *if* the originating address is
> > from an anonymous mmap() or below sbrk(0). This should cover some of the
> > ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a
> > pointer to a buffer obtained via mmap() on the device operations.
> >
> > (sorry for not being clear on what Vincenzo's proposal implies)
> 
> OK, I see. So I need to make the following changes to my patchset AFAIU.
> 
> 1. Make sure that we only allow tagged user addresses that originate
> from an anonymous mmap() or below sbrk(0). How exactly should this
> check be performed?

I don't think we should perform such checks. That's rather stating that
the kernel only guarantees that the tagged pointers work if they
originated from these memory ranges.

> 2. Allow tagged addressed passed to memory syscalls (as long as (1) is
> satisfied). Do I understand correctly that this means that I need to
> locate all find_vma() callers outside of mm/ and fix them up as well?

Yes (unless anyone as a better idea or objections to this approach).

BTW, I'll be off until the new year, so won't be able to follow up.

-- 
Catalin


Re: [PATCH v2] kmemleak: Turn kmemleak_lock to raw spinlock on RT

2018-12-18 Thread Catalin Marinas
20 
>  block_ioctl+0x3d/0x50 
>  do_vfs_ioctl+0xa8/0x610 
>  ? vfs_write+0x166/0x1b0 
>  ksys_ioctl+0x67/0x90 
>  __x64_sys_ioctl+0x1a/0x20 
>  do_syscall_64+0x4d/0xf0 
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> kmemleak is an error detecting feature. We would not expect as good 
> performance
> as without it. As there is no raw rwlock defining helpers, we turn 
> kmemleak_lock
> to a raw spinlock.
> 
> Signed-off-by: He Zhe 
> Cc: catalin.mari...@arm.com
> Cc: bige...@linutronix.de
> Cc: t...@linutronix.de
> Cc: rost...@goodmis.org

As I replied already, I don't think this patch would increase the
kmemleak latency (or performance), although I haven't actually tested
it. FWIW:

Acked-by: Catalin Marinas 


Re: [PATCH v2] kmemleak: Turn kmemleak_lock to raw spinlock on RT

2018-12-18 Thread Catalin Marinas
On Tue, Dec 18, 2018 at 06:51:41PM +0800, He Zhe wrote:
> On 2018/12/6 03:14, Sebastian Andrzej Siewior wrote:
> > With raw locks you wouldn't have multiple readers at the same time.
> > Maybe you wouldn't have recursion but since you can't have multiple
> > readers you would add lock contention where was none (because you could
> > have two readers at the same time).
> 
> OK. I understand your concern finally. At the commit log said, I wanted to 
> use raw
> rwlock but didn't find the DEFINE helper for it. Thinking it would not be 
> expected to
> have great performance, I turn to use raw spinlock instead. And yes, this 
> would
> introduce worse performance.

Looking through the kmemleak code, I can't really find significant
reader contention. The longest holder of this lock (read) is the scan
thread which is also protected by a scan_mutex, so can't run
concurrently with another scanner (triggered through debugfs). The other
read_lock(_lock) user is find_and_get_object() called from a
few places. However, all such places normally follow a create_object()
call (kmemleak_alloc() and friends) which already performs a
write_lock(_lock), so it needs to wait for the scan thread to
release the kmemleak_lock.

It may be worth running some performance/latency tests during kmemleak
scanning (echo scan > /sys/kernel/debug/kmemleak) but at a quick look,
I don't think we'd see any difference with a raw_spin_lock_t.

With a bit more thinking (though I'll be off until the new year), we
could probably get rid of the kmemleak_lock entirely in scan_block() and
make lookup_object() and the related rbtree code in kmemleak RCU-safe.

-- 
Catalin


Re: [PATCH v10 5/6] arm64: select ACPI PCI code only both features are enabled

2018-12-17 Thread Catalin Marinas
On Mon, Dec 17, 2018 at 11:04:42AM +0100, Rafael J. Wysocki wrote:
> On Saturday, December 15, 2018 2:02:46 AM CET Sinan Kaya wrote:
> > ACPI and PCI are no longer coupled to each other. Specify requirements
> > for both when pulling in code.
> > 
> > Signed-off-by: Sinan Kaya 
> > ---
> >  arch/arm64/Kconfig | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index ea2ab0330e3a..bcb6262044d8 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -5,7 +5,7 @@ config ARM64
> > select ACPI_GTDT if ACPI
> > select ACPI_IORT if ACPI
> > select ACPI_REDUCED_HARDWARE_ONLY if ACPI
> > -   select ACPI_MCFG if ACPI
> > +   select ACPI_MCFG if (ACPI && PCI)
> > select ACPI_SPCR_TABLE if ACPI
> > select ACPI_PPTT if ACPI
> > select ARCH_CLOCKSOURCE_DATA
> > @@ -163,7 +163,7 @@ config ARM64
> > select OF
> > select OF_EARLY_FLATTREE
> > select OF_RESERVED_MEM
> > -   select PCI_ECAM if ACPI
> > +   select PCI_ECAM if (ACPI && PCI)
> > select POWER_RESET
> > select POWER_SUPPLY
> > select REFCOUNT_FULL
> > 
> 
> I need a maintainer ACK here.

Acked-by: Catalin Marinas 


[GIT PULL] arm64 fixes for 4.20

2018-12-14 Thread Catalin Marinas
Hi Linus,

Please pull the arm64 fix below. Thanks.

The following changes since commit 40e020c129cfc991e8ab4736d2665351ffd1468d:

  Linux 4.20-rc6 (2018-12-09 15:31:00 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux tags/arm64-fixes

for you to fetch changes up to 3238c359acee4ab57f15abb5a82b8ab38a661ee7:

  arm64: dma-mapping: Fix FORCE_CONTIGUOUS buffer clearing (2018-12-11 11:55:32 
+)


- Invalidate the caches before clearing the DMA buffer via the
  non-cacheable alias in the FORCE_CONTIGUOUS case


Robin Murphy (1):
  arm64: dma-mapping: Fix FORCE_CONTIGUOUS buffer clearing

 arch/arm64/mm/dma-mapping.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
Catalin


Re: Can we drop upstream Linux x32 support?

2018-12-13 Thread Catalin Marinas
On Wed, Dec 12, 2018 at 10:03:30AM -0800, Andy Lutomirski wrote:
> On Wed, Dec 12, 2018 at 8:52 AM Rich Felker  wrote:
> > On Wed, Dec 12, 2018 at 08:39:53AM -0800, Andy Lutomirski wrote:
> > > I'm proposing another alternative.  Given that x32 already proves that
> > > the user bitness model doesn't have to match the kernel model (in x32,
> > > user "long" is 32-bit but the kernel ABI "long" is 64-bit), I'm
> > > proposing extending this to just make the kernel ABI be LP64.  So
> > > __kernel_size_t would be 64-bit and pointers in kernel data structures
> > > would be 64-bit.  In other words, most or all of the kernel ABI would
> > > just match x86_64.
> > >
> > > As far as I can tell, the only thing that really needs unusual
> > > toolchain features here is that C doesn't have an extra-wide pointer
> > > type.  The kernel headers would need a way to say "this pointer is
> > > still logically a pointer, and user code may assume that it's 32 bits,
> > > but it has 8-byte alignment."
> >
> > None of this works on the userspace/C side, nor should any attempt be
> > made to make it work. Types fundamentally cannot have alignments
> > larger than their size. If you want to make the alignment of some
> > pointers 8, you have to make their size 8, and then you just have LP64
> > again if you did it for all pointers.
> >
> > If on the other hand you tried to make just some pointers "wide
> > pointers", you'd also be completely breaking the specified API
> > contracts of standard interfaces. For example in struct iovec's
> > iov_base, >iov_base is no longer a valid pointer to an object of
> > type void* that you can pass to interfaces expecting void**. Sloppy
> > misunderstandings like what you're making now are exactly why x32 is
> > already broken and buggy (>tv_nsec already has wrong type for
> > struct timespec foo).
> 
> I don't think it's quite that broken.  For the struct iovec example,
> we currently have:
> 
>struct iovec {
>void  *iov_base;/* Starting address */
>size_t iov_len; /* Number of bytes to transfer */
>};
> 
> we could have, instead: (pardon any whitespace damage)
> 
>struct iovec {
>void  *iov_base;/* Starting address */
>uint32_t __pad0;
>size_t iov_len; /* Number of bytes to transfer */
>uint32_t __pad1;
>} __attribute__((aligned(8));
> 
> or the same thing but where iov_len is uint64_t.  A pointer to
> iov_base still works exactly as expected.  Something would need to be
> done to ensure that the padding is all zeroed, which might be a real
> problem.

We looked at this approach briefly for arm64/ILP32 and zeroing the pads
was the biggest problem. User programs would not explicitly zero the pad
and I'm not sure the compiler would be any smarter. This means it's the
kernel's responsibility to zero the pad (around get_user,
copy_from_user), so it doesn't actually simplify the kernel side of the
syscall interface.

If the data flow goes the other way (kernel to user), this approach
works fine.

> No one wants to actually type all the macro gunk into the headers to
> make this work, but this type of transformation is what I have in mind
> when the compiler is asked to handle the headers.  Or there could
> potentially be a tool that automatically consumes the uapi headers and
> spits out modified headers like this.

If the compiler can handle the zeroing, that would be great, though not
sure how (some __attribute__((zero)) which generates a type constructor
for such structure; it kind of departs from what the C language offers).

> Realistically, I think a much better model would be to use true ILP32
> code, where all the memory layouts in the uapi match i386.

The conclusion we came to on arm64 was that an ILP32 ABI should not
really be any different from a _new_ 32-bit architecture ABI. It differs
from arm32 a bit (different syscall numbers, off_t is 64-bit,
sigcontext) but not significantly as it is still able to use the
majority of the compat_sys_* wrappers.

-- 
Catalin


Re: [RFC][PATCH 0/3] arm64 relaxed ABI

2018-12-12 Thread Catalin Marinas
Hi Andrey,

On Wed, Dec 12, 2018 at 03:23:25PM +0100, Andrey Konovalov wrote:
> On Mon, Dec 10, 2018 at 3:31 PM Vincenzo Frascino
>  wrote:
> > On arm64 the TCR_EL1.TBI0 bit has been set since Linux 3.x hence
> > the userspace (EL0) is allowed to set a non-zero value in the top
> > byte but the resulting pointers are not allowed at the user-kernel
> > syscall ABI boundary.
> >
> > This patchset proposes a relaxation of the ABI and a mechanism to
> > advertise it to the userspace via an AT_FLAGS.
> >
> > The rationale behind the choice of AT_FLAGS is that the Unix System V
> > ABI defines AT_FLAGS as "flags", leaving some degree of freedom in
> > interpretation.
> > There are two previous attempts of using AT_FLAGS in the Linux Kernel
> > for different reasons: the first was more generic and was used to expose
> > the support for the GNU STACK NX feature [1] and the second was done for
> > the MIPS architecture and was used to expose the support of "MIPS ABI
> > Extension for IEEE Std 754 Non-Compliant Interlinking" [2].
> > Both the changes are currently _not_ merged in mainline.
> > The only architecture that reserves some of the bits in AT_FLAGS is
> > currently MIPS, which introduced the concept of platform specific ABI
> > (psABI) reserving the top-byte [3].
> >
> > When ARM64_AT_FLAGS_SYSCALL_TBI is set the kernel is advertising
> > to the userspace that a relaxed ABI is supported hence this type
> > of pointers are now allowed to be passed to the syscalls when they are
> > in memory ranges obtained by anonymous mmap() or brk().
> >
> > The userspace _must_ verify that the flag is set before passing tagged
> > pointers to the syscalls allowed by this relaxation.
> >
> > More in general, exposing the ARM64_AT_FLAGS_SYSCALL_TBI flag and mandating
> > to the software to check that the feature is present, before using the
> > associated functionality, it provides a degree of control on the decision
> > of disabling such a feature in future without consequently breaking the
> > userspace.
[...]
> Acked-by: Andrey Konovalov 

Thanks for the ack. However, if we go ahead with this ABI proposal it
means that your patches need to be reworked to allow a non-zero top byte
in all syscalls, including mmap() and friends, ioctl(). There are ABI
concerns in either case but I'd rather have this discussion in the open.
It doesn't necessarily mean that I endorse this proposal, I would like
feedback and not just from kernel developers but user space ones.

The summary of our internal discussions (mostly between kernel
developers) is that we can't properly describe a user ABI that covers
future syscalls or syscall extensions while not all syscalls accept
tagged pointers. So we tweaked the requirements slightly to only allow
tagged pointers back into the kernel *if* the originating address is
from an anonymous mmap() or below sbrk(0). This should cover some of the
ioctls or getsockopt(TCP_ZEROCOPY_RECEIVE) where the user passes a
pointer to a buffer obtained via mmap() on the device operations.

(sorry for not being clear on what Vincenzo's proposal implies)

-- 
Catalin


Re: [PATCH V2] kmemleak: Add config to select auto scan

2018-12-12 Thread Catalin Marinas
On Wed, Dec 12, 2018 at 12:14:29PM +0530, Prateek Patel wrote:
> On 10/29/2018 4:13 PM, Catalin Marinas wrote:
> > On Mon, Oct 22, 2018 at 11:38:43PM +0530, Prateek Patel wrote:
> > > From: Sri Krishna chowdary 
> > > 
> > > Kmemleak scan can be cpu intensive and can stall user tasks at times.
> > > To prevent this, add config DEBUG_KMEMLEAK_AUTO_SCAN to enable/disable
> > > auto scan on boot up.
> > > Also protect first_run with DEBUG_KMEMLEAK_AUTO_SCAN as this is meant
> > > for only first automatic scan.
> > > 
> > > Signed-off-by: Sri Krishna chowdary 
> > > Signed-off-by: Sachin Nikam 
> > > Signed-off-by: Prateek 
> > Looks fine to me.
> > 
> > Reviewed-by: Catalin Marinas 
> 
> Can you mark this patch as acknowledged so that it can be picked up by the
> maintainer.

I thought Reviewed-by was sufficient. Anyway:

Acked-by: Catalin Marinas 


Re: Can we drop upstream Linux x32 support?

2018-12-11 Thread Catalin Marinas
On Tue, Dec 11, 2018 at 12:37:42PM +0100, Florian Weimer wrote:
> * Catalin Marinas:
> > On Tue, Dec 11, 2018 at 10:02:45AM +0100, Arnd Bergmann wrote:
> >> On Tue, Dec 11, 2018 at 6:35 AM Andy Lutomirski  wrote:
> >> > I tried to understand what's going on.  As far as I can tell, most of
> >> > the magic is the fact that __kernel_long_t and __kernel_ulong_t are
> >> > 64-bit as seen by x32 user code.  This means that a decent number of
> >> > uapi structures are the same on x32 and x86_64.  Syscalls that only
> >> > use structures like this should route to the x86_64 entry points.  But
> >> > the implementation is still highly dubious -- in_compat_syscall() will
> >> > be *true* in such system calls,
> >> 
> >> I think the fundamental issue was that the intention had always been
> >> to use only the 64-bit entry points for system calls, but the most
> >> complex one we have -- ioctl() -- has to use the compat entry point
> >> because device drivers define their own data structures using 'long'
> >> and pointer members and they need translation, as well as
> >> matching in_compat_syscall() checks. This in turn breaks down
> >> again whenever a driver defines an ioctl command that takes
> >> a __kernel_long_t or a derived type like timespec as its argument.
> >
> > With arm64 ILP32 we tried to avoid the ioctl() problem by having
> > __kernel_long_t 32-bit, IOW mimicking the arm32 ABI (compat). The
> > biggest pain point is signals where the state is completely different
> > from arm32 (more, wider registers) and can't be dealt with by the compat
> > layer.
> 
> I would expect to approach this from the opposite direction: use 64-bit
> types in places where the 64-bit kernel interface uses 64-bit types.
> After all, not everyone who is interested in ILP32 has a companion
> 32-bit architecture which could serve as a model for the application
> ABI.

I fully agree with you that if someone wants ILP32 for a 64-bit only
architecture, they should use the 64-bit kernel interface and ensure
POSIX is adjusted.

In the arm64 context, both options were discussed with the libc
community complaining that a partial 64-bit syscall ABI breaks POSIX
while the potential users were just asking for a 32-bit ABI to run their
existing software stack on ARMv8 machines without native 32-bit support
(until they complete the migration to 64-bit).

> (If there are conflicts with POSIX, then POSIX needs to be fixed to
> support this.)

This would have been nice but no-one volunteered and, more importantly,
there was no conclusive argument that ARM ILP32 is better than LP64
(well, apart from a minority of benchmarks) and something that people
would want to migrate to. Given that the only credible case made was
about legacy code, we decided to go ahead with a (mostly) compat 32-bit
ABI.

-- 
Catalin


Re: Can we drop upstream Linux x32 support?

2018-12-11 Thread Catalin Marinas
On Tue, Dec 11, 2018 at 10:02:45AM +0100, Arnd Bergmann wrote:
> On Tue, Dec 11, 2018 at 6:35 AM Andy Lutomirski  wrote:
> > I tried to understand what's going on.  As far as I can tell, most of
> > the magic is the fact that __kernel_long_t and __kernel_ulong_t are
> > 64-bit as seen by x32 user code.  This means that a decent number of
> > uapi structures are the same on x32 and x86_64.  Syscalls that only
> > use structures like this should route to the x86_64 entry points.  But
> > the implementation is still highly dubious -- in_compat_syscall() will
> > be *true* in such system calls,
> 
> I think the fundamental issue was that the intention had always been
> to use only the 64-bit entry points for system calls, but the most
> complex one we have -- ioctl() -- has to use the compat entry point
> because device drivers define their own data structures using 'long'
> and pointer members and they need translation, as well as
> matching in_compat_syscall() checks. This in turn breaks down
> again whenever a driver defines an ioctl command that takes
> a __kernel_long_t or a derived type like timespec as its argument.

With arm64 ILP32 we tried to avoid the ioctl() problem by having
__kernel_long_t 32-bit, IOW mimicking the arm32 ABI (compat). The
biggest pain point is signals where the state is completely different
from arm32 (more, wider registers) and can't be dealt with by the compat
layer.

Fortunately, we haven't merge it yet as we have the same dilemma about
real users and who's going to regularly test the ABI in the long run. In
the meantime, watching this thread with interest ;).

-- 
Catalin


Re: linux-next: manual merge of the arm64 tree with Linus' tree

2018-12-11 Thread Catalin Marinas
On Tue, Dec 11, 2018 at 08:52:57AM +1100, Stephen Rothwell wrote:
> Today's linux-next merge of the arm64 tree got conflicts in:
> 
>   Documentation/arm64/silicon-errata.txt
>   arch/arm64/Kconfig
> 
> between commit:
> 
>   ce8c80c536da ("arm64: Add workaround for Cortex-A76 erratum 1286807")
> 
> from Linus' tree and commit:
> 
>   a457b0f7f50d ("arm64: Add configuration/documentation for Cortex-A76 
> erratum 1165522")
> 
> from the arm64 tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

It looks fine. Thanks Stephen.

-- 
Catalin


Re: [PATCH v6 08/13] arm64: expose user PAC bit positions via ptrace

2018-12-10 Thread Catalin Marinas
On Mon, Dec 10, 2018 at 02:29:45PM +, Will Deacon wrote:
> On Mon, Dec 10, 2018 at 08:22:06AM -0600, Richard Henderson wrote:
> > On 12/10/18 6:03 AM, Catalin Marinas wrote:
> > >> However, it won't be too long before someone implements support for
> > >> ARMv8.2-LVA, at which point, without changes to mandatory pointer 
> > >> tagging, we
> > >> will only have 3 authentication bits: [54:52].  This seems useless and 
> > >> easily
> > >> brute-force-able.
[...]
> > Perhaps the opt-in should be at exec time, with ELF flags (or equivalent) on
> > the application.  Because, as you say, changing the shape of the PAC in the
> > middle of execution is in general not possible.
> 
> I think we'd still have a potential performance problem with that approach,
> since we'd end up having to context-switch TCR.T0SZ, which is permitted to
> be cached in a TLB and would therefore force us to introduce TLB
> invalidation when context-switching between tasks using 52-bit VAs and tasks
> using 48-bit VAs.
> 
> There's a chance we could get the architecture tightened here, but it's
> not something we've pushed for so far and it depends on what's already been
> built.

Just a quick summary of our internal discussion:

ARMv8.3 also comes with a new bit, TCR_EL1.TBIDx, which practically
disables TBI for code pointers. This bit allows us to use 11 bits for
code PtrAuth with 52-bit VA.

Now, the problem is that TBI for code pointers is user ABI, so we can't
simply disable it. We may be able to do this with memory tagging since
that's an opt-in feature (prctl) where the user is aware that the top
byte of a pointer is no longer ignored. However, that's probably for a
future discussion.

-- 
Catalin


Re: [PATCH v2 2/2] arm64: uaccess: Implement unsafe accessors

2018-12-10 Thread Catalin Marinas
On Fri, Dec 07, 2018 at 08:38:11AM +, Julien Thierry wrote:
> 
> 
> On 12/06/2018 06:25 PM, Catalin Marinas wrote:
> > On Mon, Dec 03, 2018 at 01:55:18PM +, Julien Thierry wrote:
> > > diff --git a/arch/arm64/include/asm/uaccess.h 
> > > b/arch/arm64/include/asm/uaccess.h
> > > index 07c3408..cabfcae 100644
> > > --- a/arch/arm64/include/asm/uaccess.h
> > > +++ b/arch/arm64/include/asm/uaccess.h
> > > @@ -233,6 +233,23 @@ static inline void uaccess_enable_not_uao(void)
> > >   __uaccess_enable(ARM64_ALT_PAN_NOT_UAO);
> > >   }
> > > +#define unsafe_user_region_activeuaccess_region_active
> > > +static inline bool uaccess_region_active(void)
> > > +{
> > > + if (system_uses_ttbr0_pan()) {
> > > + u64 ttbr;
> > > +
> > > + ttbr = read_sysreg(ttbr1_el1);
> > > + return ttbr & TTBR_ASID_MASK;
> > 
> > Nitpick: could write this in 1-2 lines.
> 
> True, I can do that in 1 line.
> 
> > > + } else if (cpus_have_const_cap(ARM64_ALT_PAN_NOT_UAO)) {
> > > + return (read_sysreg(sctlr_el1) & SCTLR_EL1_SPAN) ?
> > > + false :
> > > + !read_sysreg_s(SYS_PSTATE_PAN);
> > > + }
> > 
> > ARM64_ALT_PAN_NOT_UAO implies ARM64_HAS_PAN which implies SCTLR_EL1.SPAN
> > is 0 at run-time. Is this to cope with the case of being called prior to
> > cpu_enable_pan()?
> > 
> 
> Yes, the issue I can into is that for cpufeatures, .cpu_enable() callbacks
> are called inside stop_machine() which obviously might_sleep and so attempts
> to check whether user_access is on. But for features that get enabled before
> PAN, the PAN bit will be set.

OK, so the PSTATE.PAN bit only makes sense when SCTLR_EL1.SPAN is 0, IOW
the PAN hardware feature has been enabled. Maybe you could write it
(together with some comment):

} else if (cpus_have_const_cap(ARM64_ALT_PAN_NOT_UAO) &&
 !(read_sysreg(sctlr_el1) & SCTLR_EL1_SPAN)) {
 /* only if PAN is present and enabled */
return !read_sysreg_s(SYS_PSTATE_PAN)
}

On the cpufeature.c side of things, it seems that we enable the
static_branch before calling the cpu_enable. I wonder whether changing
the order here would help with avoid the SCTLR_EL1 read (not sure what
else it would break; cc'ing Suzuki).

-- 
Catalin


Re: [PATCH v6 10/24] arm64: irqflags: Use ICC_PMR_EL1 for interrupt masking

2018-12-10 Thread Catalin Marinas
On Thu, Dec 06, 2018 at 09:50:18AM +, Julien Thierry wrote:
> On 05/12/18 18:26, Catalin Marinas wrote:
> > On Wed, Dec 05, 2018 at 04:55:54PM +, Julien Thierry wrote:
> >> On 04/12/18 17:36, Catalin Marinas wrote:
> >>> On Mon, Nov 12, 2018 at 11:57:01AM +, Julien Thierry wrote:
> >>>> diff --git a/arch/arm64/include/asm/irqflags.h 
> >>>> b/arch/arm64/include/asm/irqflags.h
> >>>> index 24692ed..e0a32e4 100644
> >>>> --- a/arch/arm64/include/asm/irqflags.h
> >>>> +++ b/arch/arm64/include/asm/irqflags.h
> >>>> @@ -18,7 +18,27 @@
> >>>>  
> >>>>  #ifdef __KERNEL__
> >>>>  
> >>>> +#include 
> >>>> +#include 
> >>>>  #include 
> >>>> +#include 
> >>>> +
> >>>> +
> >>>> +/*
> >>>> + * When ICC_PMR_EL1 is used for interrupt masking, only the bit 
> >>>> indicating
> >>>> + * whether the normal interrupts are masked is kept along with the daif
> >>>> + * flags.
> >>>> + */
> >>>> +#define ARCH_FLAG_PMR_EN 0x1
> >>>> +
> >>>> +#define MAKE_ARCH_FLAGS(daif, pmr)  
> >>>> \
> >>>> +((daif) | (((pmr) >> GIC_PRIO_STATUS_SHIFT) & ARCH_FLAG_PMR_EN))
> >>>> +
> >>>> +#define ARCH_FLAGS_GET_PMR(flags)\
> >>>> +flags) & ARCH_FLAG_PMR_EN) << GIC_PRIO_STATUS_SHIFT) \
> >>>> +| GIC_PRIO_IRQOFF)
> >>>> +
> >>>> +#define ARCH_FLAGS_GET_DAIF(flags) ((flags) & ~ARCH_FLAG_PMR_EN)
> >>>
> >>> I wonder whether we could just use the PSR_I_BIT here to decide whether
> >>> to set the GIC_PRIO_IRQ{ON,OFF}. We could clear the PSR_I_BIT in
> >>> _restore_daif() with an alternative.
> >>
> >> So, the issue with it is that some contexts might be using PSR.I to
> >> disable interrupts (any contexts with async errors or debug exceptions
> >> disabled, kvm guest entry paths, pseudo-NMIs, ...).
> >>
> >> If any of these contexts calls local_irq_save()/local_irq_restore() or
> >> local_daif_save()/local_daif_restore(), by only relying on PSR_I_BIT to
> >> represent the PMR status, we might end up clearing PSR.I when we shouldn't.
> >>
> >> I'm not sure whether there are no callers of these functions in those
> >> context. But if that is the case, we could simplify things, yes.
> > 
> > There are callers of local_daif_save() (3) and local_daif_mask() (7) but
> > do they all need to disable the pseudo-NMIs?
> 
> Hmmm, I really think that both of those should be disabling NMIs.
> Otherwise, if we take an NMI, the first thing the el1_irq handler is
> going to do is "enable_da_f()" which could lead to potential issues.
> 
> One thing that could be done is:
> - local_daif_save() and local_daif_mask() both mask all daif bits
> (taking care to represent PMR value in the I bit of the saved flags)
> - local_daif_restore() restores da_f as expected and decides values to
> put for PMR and PSR.I as follows:
>   * do the da_f restore
>   * if PSR.A bit is cleared in the saved flags, then we also do a 
> start_nmi()
> 
> However, this would not work with a local_daif_save()/restore() on the
> return path of an NMI because I think it is the only context with NMIs
> "stopped" that can take aborts. I can add a WARN_ON(in_nmi()) for
> local_daif_restore() if that doesn't affect performance too much.

FTR, as we discussed this in the office, the conclusion (IIUC) we got to
was: leave the *_daif_*() functions unchanged, touching all the
corresponding PSTATE bits, but change the arch_local_irq_*() macros to
only touch the PMR when the feature is enabled.

-- 
Catalin


Re: [REPOST PATCH v6 0/4] kgdb: Fix kgdb_roundup_cpus()

2018-12-10 Thread Catalin Marinas
Hi Doug,

On Fri, Dec 07, 2018 at 10:40:24AM -0800, Doug Anderson wrote:
> On Fri, Dec 7, 2018 at 9:42 AM Catalin Marinas  
> wrote:
> > On Tue, Dec 04, 2018 at 07:38:24PM -0800, Douglas Anderson wrote:
> > > Douglas Anderson (4):
> > >   kgdb: Remove irq flags from roundup
> > >   kgdb: Fix kgdb_roundup_cpus() for arches who used smp_call_function()
> > >   kgdb: Don't round up a CPU that failed rounding up before
> > >   kdb: Don't back trace on a cpu that didn't round up
> >
> > FWIW, trying these on arm64 (ThunderX2) with CONFIG_KGDB_TESTS_ON_BOOT=y
> > on top of 4.20-rc5 doesn't boot. It looks like they leave interrupts
> > disabled when they shouldn't and it trips over the BUG at
> > mm/vmalloc.c:1380 (called via do_fork -> copy_process).
> >
> > Now, I don't think these patches make things worse on arm64 since prior
> > to them the kgdb boot tests on arm64 were stuck in a loop (RUN
> > singlestep).
> 
> Thanks for the report!  ...actually, I'd never tried CONFIG_KGDB_TESTS
> before.  ...so I tried them now:
> 
> A) chromeos-4.19 tree on qcom-sdm845 without this series: booted up OK
> B) chromeos-4.19 tree on qcom-sdm845 with this series: booted up OK
> C) v4.20-rc5-90-g30002dd008ed on rockchip-rk3399 (kevin) with this
> series: booted up OK
> 
> Example output from B) above:
> 
> localhost ~ # dmesg | grep kgdbts
> [2.139814] KGDB: Registered I/O driver kgdbts
> [2.144582] kgdbts:RUN plant and detach test
> [2.165333] kgdbts:RUN sw breakpoint test
> [2.172990] kgdbts:RUN bad memory access test
> [2.178640] kgdbts:RUN singlestep test 1000 iterations
> [2.187765] kgdbts:RUN singlestep [0/1000]
> [2.559596] kgdbts:RUN singlestep [100/1000]
> [2.931419] kgdbts:RUN singlestep [200/1000]
> [3.303474] kgdbts:RUN singlestep [300/1000]
> [3.675121] kgdbts:RUN singlestep [400/1000]
> [4.046867] kgdbts:RUN singlestep [500/1000]
> [4.418920] kgdbts:RUN singlestep [600/1000]
> [4.790824] kgdbts:RUN singlestep [700/1000]
> [5.162479] kgdbts:RUN singlestep [800/1000]
> [5.534103] kgdbts:RUN singlestep [900/1000]
> [5.902299] kgdbts:RUN do_fork for 100 breakpoints
> [8.463900] KGDB: Unregistered I/O driver kgdbts, debugger disabled
> 
> ...so I guess I'm a little confused.  Either I have a different config
> than you do or something is special about your machine?

I tried it now on a Juno board both as a host and a guest and boots
fine. It must be something that only triggers ThunderX2. Ignore the
report for now, if I find anything interesting I'll let you know.

-- 
Catalin


Re: [PATCH v6 08/13] arm64: expose user PAC bit positions via ptrace

2018-12-10 Thread Catalin Marinas
On Sun, Dec 09, 2018 at 09:41:31AM -0600, Richard Henderson wrote:
> On 12/7/18 12:39 PM, Kristina Martsenko wrote:
> > When pointer authentication is in use, data/instruction pointers have a
> > number of PAC bits inserted into them. The number and position of these
> > bits depends on the configured TCR_ELx.TxSZ and whether tagging is
> > enabled. ARMv8.3 allows tagging to differ for instruction and data
> > pointers.
> 
> At this point I think it's worth starting a discussion about pointer tagging,
> and how we can make it controllable and not mandatory.
> 
> With this patch set, we are enabling 7 authentication bits: [54:48].
> 
> However, it won't be too long before someone implements support for
> ARMv8.2-LVA, at which point, without changes to mandatory pointer tagging, we
> will only have 3 authentication bits: [54:52].  This seems useless and easily
> brute-force-able.

Such support is already here (about to be queued):

https://lore.kernel.org/linux-arm-kernel/20181206225042.11548-1-steve.cap...@arm.com/

> I assume that pointer tagging is primarily used by Android, since I'm not 
> aware
> of anything else that uses it at all.

I would expect it to be enabled more widely (Linux distros), though only
the support for instructions currently in the NOP space.

> Unfortunately, there is no obvious path to making this optional that does not
> break compatibility with Documentation/arm64/tagged-pointers.txt.

There is also the ARMv8.5 MTE (memory tagging) which relies on tagged
pointers.

> I've been thinking that there ought to be some sort of global setting, akin to
> /proc/sys/kernel/randomize_va_space, as well as a prctl which an application
> could use to selectively enable TBI/TBID for an application that actually uses
> tagging.

An alternative would be to allow the opt-in to 52-bit VA, leaving it at
48-bit by default. However, it has the problem of changing the PAC size
and not being able to return.

> The global /proc setting allows the default to remain 1, which would let any
> application using tagging to continue working.  If there are none, the 
> sysadmin
> can set the default to 0.  Going forward, applications could be updated to use
> the prctl, allowing more systems to set the default to 0.
> 
> FWIW, pointer authentication continues to work when enabling TBI, but not the
> other way around.  Thus the prctl could be used to enable TBI at any point, 
> but
> if libc is built with PAuth, there's no way to turn it back off again.

This may work but, as you said, TBI is user ABI at this point, we can't
take it away now (at the time we didn't forsee the pauth).

Talking briefly with Will/Kristina/Mark, I think the best option is to
make 52-bit VA default off in the kernel config. Whoever needs it
enabled (enterprise systems) should be aware of the reduced PAC bits. I
don't really think we have a better solution.

-- 
Catalin


[GIT PULL] arm64 fixes for 4.20-rc6

2018-12-07 Thread Catalin Marinas
Hi Linus,

Please pull the arm64 fix below. Thanks.

The following changes since commit 2595646791c319cadfdbf271563aac97d0843dc7:

  Linux 4.20-rc5 (2018-12-02 15:07:55 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux tags/arm64-fixes

for you to fetch changes up to b4aecf78083d8c6424657c1746c7c3de6e61669f:

  arm64: hibernate: Avoid sending cross-calling with interrupts disabled 
(2018-12-07 15:52:39 +)


- Avoid sending IPIs with interrupts disabled


Will Deacon (1):
  arm64: hibernate: Avoid sending cross-calling with interrupts disabled

 arch/arm64/kernel/hibernate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
Catalin


[GIT PULL] arm64 fixes for 4.20-rc6

2018-12-07 Thread Catalin Marinas
Hi Linus,

Please pull the arm64 fix below. Thanks.

The following changes since commit 2595646791c319cadfdbf271563aac97d0843dc7:

  Linux 4.20-rc5 (2018-12-02 15:07:55 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux tags/arm64-fixes

for you to fetch changes up to b4aecf78083d8c6424657c1746c7c3de6e61669f:

  arm64: hibernate: Avoid sending cross-calling with interrupts disabled 
(2018-12-07 15:52:39 +)


- Avoid sending IPIs with interrupts disabled


Will Deacon (1):
  arm64: hibernate: Avoid sending cross-calling with interrupts disabled

 arch/arm64/kernel/hibernate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
Catalin


Re: [REPOST PATCH v6 0/4] kgdb: Fix kgdb_roundup_cpus()

2018-12-07 Thread Catalin Marinas
On Tue, Dec 04, 2018 at 07:38:24PM -0800, Douglas Anderson wrote:
> Douglas Anderson (4):
>   kgdb: Remove irq flags from roundup
>   kgdb: Fix kgdb_roundup_cpus() for arches who used smp_call_function()
>   kgdb: Don't round up a CPU that failed rounding up before
>   kdb: Don't back trace on a cpu that didn't round up

FWIW, trying these on arm64 (ThunderX2) with CONFIG_KGDB_TESTS_ON_BOOT=y
on top of 4.20-rc5 doesn't boot. It looks like they leave interrupts
disabled when they shouldn't and it trips over the BUG at
mm/vmalloc.c:1380 (called via do_fork -> copy_process).

Now, I don't think these patches make things worse on arm64 since prior
to them the kgdb boot tests on arm64 were stuck in a loop (RUN
singlestep).

-- 
Catalin


Re: [PATCH] mm, kmemleak: Little optimization while scanning

2018-12-07 Thread Catalin Marinas
On Thu, Dec 06, 2018 at 02:19:18PM +0100, Oscar Salvador wrote:
> kmemleak_scan() goes through all online nodes and tries
> to scan all used pages.
> We can do better and use pfn_to_online_page(), so in case we have
> CONFIG_MEMORY_HOTPLUG, offlined pages will be skiped automatically.
> For boxes where CONFIG_MEMORY_HOTPLUG is not present, pfn_to_online_page()
> will fallback to pfn_valid().
> 
> Another little optimization is to check if the page belongs to the node
> we are currently checking, so in case we have nodes interleaved we will
> not check the same pfn multiple times.
> 
> I ran some tests:
> 
> Add some memory to node1 and node2 making it interleaved:
> 
> (qemu) object_add memory-backend-ram,id=ram0,size=1G
> (qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1
> (qemu) object_add memory-backend-ram,id=ram1,size=1G
> (qemu) device_add pc-dimm,id=dimm1,memdev=ram1,node=2
> (qemu) object_add memory-backend-ram,id=ram2,size=1G
> (qemu) device_add pc-dimm,id=dimm2,memdev=ram2,node=1
> 
> Then, we offline that memory:
>  # for i in {32..39} ; do echo "offline" > 
> /sys/devices/system/node/node1/memory$i/state;done
>  # for i in {48..55} ; do echo "offline" > 
> /sys/devices/system/node/node1/memory$i/state;don
>  # for i in {40..47} ; do echo "offline" > 
> /sys/devices/system/node/node2/memory$i/state;done
> 
> And we run kmemleak_scan:
> 
>  # echo "scan" > /sys/kernel/debug/kmemleak
> 
> before the patch:
> 
> kmemleak: time spend: 41596 us
> 
> after the patch:
> 
> kmemleak: time spend: 34899 us
> 
> Signed-off-by: Oscar Salvador 

Acked-by: Catalin Marinas 


Re: [PATCH] mm, kmemleak: Little optimization while scanning

2018-12-07 Thread Catalin Marinas
On Thu, Dec 06, 2018 at 02:19:18PM +0100, Oscar Salvador wrote:
> kmemleak_scan() goes through all online nodes and tries
> to scan all used pages.
> We can do better and use pfn_to_online_page(), so in case we have
> CONFIG_MEMORY_HOTPLUG, offlined pages will be skiped automatically.
> For boxes where CONFIG_MEMORY_HOTPLUG is not present, pfn_to_online_page()
> will fallback to pfn_valid().
> 
> Another little optimization is to check if the page belongs to the node
> we are currently checking, so in case we have nodes interleaved we will
> not check the same pfn multiple times.
> 
> I ran some tests:
> 
> Add some memory to node1 and node2 making it interleaved:
> 
> (qemu) object_add memory-backend-ram,id=ram0,size=1G
> (qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1
> (qemu) object_add memory-backend-ram,id=ram1,size=1G
> (qemu) device_add pc-dimm,id=dimm1,memdev=ram1,node=2
> (qemu) object_add memory-backend-ram,id=ram2,size=1G
> (qemu) device_add pc-dimm,id=dimm2,memdev=ram2,node=1
> 
> Then, we offline that memory:
>  # for i in {32..39} ; do echo "offline" > 
> /sys/devices/system/node/node1/memory$i/state;done
>  # for i in {48..55} ; do echo "offline" > 
> /sys/devices/system/node/node1/memory$i/state;don
>  # for i in {40..47} ; do echo "offline" > 
> /sys/devices/system/node/node2/memory$i/state;done
> 
> And we run kmemleak_scan:
> 
>  # echo "scan" > /sys/kernel/debug/kmemleak
> 
> before the patch:
> 
> kmemleak: time spend: 41596 us
> 
> after the patch:
> 
> kmemleak: time spend: 34899 us
> 
> Signed-off-by: Oscar Salvador 

Acked-by: Catalin Marinas 


Re: [PATCH v2 2/2] arm64: uaccess: Implement unsafe accessors

2018-12-06 Thread Catalin Marinas
On Mon, Dec 03, 2018 at 01:55:18PM +, Julien Thierry wrote:
> diff --git a/arch/arm64/include/asm/uaccess.h 
> b/arch/arm64/include/asm/uaccess.h
> index 07c3408..cabfcae 100644
> --- a/arch/arm64/include/asm/uaccess.h
> +++ b/arch/arm64/include/asm/uaccess.h
> @@ -233,6 +233,23 @@ static inline void uaccess_enable_not_uao(void)
>   __uaccess_enable(ARM64_ALT_PAN_NOT_UAO);
>  }
>  
> +#define unsafe_user_region_activeuaccess_region_active
> +static inline bool uaccess_region_active(void)
> +{
> + if (system_uses_ttbr0_pan()) {
> + u64 ttbr;
> +
> + ttbr = read_sysreg(ttbr1_el1);
> + return ttbr & TTBR_ASID_MASK;

Nitpick: could write this in 1-2 lines.

> + } else if (cpus_have_const_cap(ARM64_ALT_PAN_NOT_UAO)) {
> + return (read_sysreg(sctlr_el1) & SCTLR_EL1_SPAN) ?
> + false :
> + !read_sysreg_s(SYS_PSTATE_PAN);
> + }

ARM64_ALT_PAN_NOT_UAO implies ARM64_HAS_PAN which implies SCTLR_EL1.SPAN
is 0 at run-time. Is this to cope with the case of being called prior to
cpu_enable_pan()?

-- 
Catalin


Re: [PATCH v2 2/2] arm64: uaccess: Implement unsafe accessors

2018-12-06 Thread Catalin Marinas
On Mon, Dec 03, 2018 at 01:55:18PM +, Julien Thierry wrote:
> diff --git a/arch/arm64/include/asm/uaccess.h 
> b/arch/arm64/include/asm/uaccess.h
> index 07c3408..cabfcae 100644
> --- a/arch/arm64/include/asm/uaccess.h
> +++ b/arch/arm64/include/asm/uaccess.h
> @@ -233,6 +233,23 @@ static inline void uaccess_enable_not_uao(void)
>   __uaccess_enable(ARM64_ALT_PAN_NOT_UAO);
>  }
>  
> +#define unsafe_user_region_activeuaccess_region_active
> +static inline bool uaccess_region_active(void)
> +{
> + if (system_uses_ttbr0_pan()) {
> + u64 ttbr;
> +
> + ttbr = read_sysreg(ttbr1_el1);
> + return ttbr & TTBR_ASID_MASK;

Nitpick: could write this in 1-2 lines.

> + } else if (cpus_have_const_cap(ARM64_ALT_PAN_NOT_UAO)) {
> + return (read_sysreg(sctlr_el1) & SCTLR_EL1_SPAN) ?
> + false :
> + !read_sysreg_s(SYS_PSTATE_PAN);
> + }

ARM64_ALT_PAN_NOT_UAO implies ARM64_HAS_PAN which implies SCTLR_EL1.SPAN
is 0 at run-time. Is this to cope with the case of being called prior to
cpu_enable_pan()?

-- 
Catalin


Re: [RESEND PATCH] drivers/base: kmemleak ignore a known leak

2018-12-06 Thread Catalin Marinas
On Thu, Dec 06, 2018 at 11:07:51AM -0500, Qian Cai wrote:
> unreferenced object 0x808ec6dc5a80 (size 128):
>   comm "swapper/0", pid 1, jiffies 4294938063 (age 2560.530s)
>   hex dump (first 32 bytes):
> ff ff ff ff 00 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b  
> 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  
>   backtrace:
> [<476dcf8c>] kmem_cache_alloc_trace+0x430/0x500
> [<4f708d37>] platform_device_register_full+0xbc/0x1e8
> [<6c2a7ec7>] acpi_create_platform_device+0x370/0x450
> [<ef135642>] acpi_default_enumeration+0x34/0x78
> [<3bd9a052>] acpi_bus_attach+0x2dc/0x3e0
> [<3cf4f7f2>] acpi_bus_attach+0x108/0x3e0
> [<3cf4f7f2>] acpi_bus_attach+0x108/0x3e0
> [<2968643e>] acpi_bus_scan+0xb0/0x110
> [<10dd0bd7>] acpi_scan_init+0x1a8/0x410
> [<965b3c5a>] acpi_init+0x408/0x49c
> [<ed4b9fe2>] do_one_initcall+0x178/0x7f4
> [<a5ac5a74>] kernel_init_freeable+0x9d4/0xa9c
> [<70ea6c15>] kernel_init+0x18/0x138
> [<fb8fff06>] ret_from_fork+0x10/0x1c
> [<41273a0d>] 0x
> 
> Then, faddr2line pointed out this line,
> 
> /*
>  * This memory isn't freed when the device is put,
>  * I don't have a nice idea for that though.  Conceptually
>  * dma_mask in struct device should not be a pointer.
>  * See http://thread.gmane.org/gmane.linux.kernel.pci/9081
>  */
> pdev->dev.dma_mask =
>   kmalloc(sizeof(*pdev->dev.dma_mask), GFP_KERNEL);
> 
> Since this leak has been existed for more than 8-year now and it does not
> reference other part of the memory, let kmemleak ignore it, so users don't
> need to waste time reporting this in the future.
> 
> Signed-off-by: Qian Cai 

Acked-by: Catalin Marinas 


Re: [RESEND PATCH] drivers/base: kmemleak ignore a known leak

2018-12-06 Thread Catalin Marinas
On Thu, Dec 06, 2018 at 11:07:51AM -0500, Qian Cai wrote:
> unreferenced object 0x808ec6dc5a80 (size 128):
>   comm "swapper/0", pid 1, jiffies 4294938063 (age 2560.530s)
>   hex dump (first 32 bytes):
> ff ff ff ff 00 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b  
> 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  
>   backtrace:
> [<476dcf8c>] kmem_cache_alloc_trace+0x430/0x500
> [<4f708d37>] platform_device_register_full+0xbc/0x1e8
> [<6c2a7ec7>] acpi_create_platform_device+0x370/0x450
> [<ef135642>] acpi_default_enumeration+0x34/0x78
> [<3bd9a052>] acpi_bus_attach+0x2dc/0x3e0
> [<3cf4f7f2>] acpi_bus_attach+0x108/0x3e0
> [<3cf4f7f2>] acpi_bus_attach+0x108/0x3e0
> [<2968643e>] acpi_bus_scan+0xb0/0x110
> [<10dd0bd7>] acpi_scan_init+0x1a8/0x410
> [<965b3c5a>] acpi_init+0x408/0x49c
> [<ed4b9fe2>] do_one_initcall+0x178/0x7f4
> [<a5ac5a74>] kernel_init_freeable+0x9d4/0xa9c
> [<70ea6c15>] kernel_init+0x18/0x138
> [<fb8fff06>] ret_from_fork+0x10/0x1c
> [<41273a0d>] 0x
> 
> Then, faddr2line pointed out this line,
> 
> /*
>  * This memory isn't freed when the device is put,
>  * I don't have a nice idea for that though.  Conceptually
>  * dma_mask in struct device should not be a pointer.
>  * See http://thread.gmane.org/gmane.linux.kernel.pci/9081
>  */
> pdev->dev.dma_mask =
>   kmalloc(sizeof(*pdev->dev.dma_mask), GFP_KERNEL);
> 
> Since this leak has been existed for more than 8-year now and it does not
> reference other part of the memory, let kmemleak ignore it, so users don't
> need to waste time reporting this in the future.
> 
> Signed-off-by: Qian Cai 

Acked-by: Catalin Marinas 


Re: [RESEND PATCH] efi: let kmemleak ignore false positives

2018-12-06 Thread Catalin Marinas
On Thu, Dec 06, 2018 at 11:16:33AM -0500, Qian Cai wrote:
> unreferenced object 0x8096c1acf580 (size 128):
>   comm "swapper/63", pid 0, jiffies 4294937418 (age 1201.230s)
>   hex dump (first 32 bytes):
> 80 87 b5 c1 96 00 00 00 00 00 cc c2 16 00 00 00  
> 00 00 01 00 00 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b  
>   backtrace:
> [<1d2549ba>] kmem_cache_alloc_trace+0x430/0x500
> [<93a6dfab>] efi_mem_reserve_persistent+0x50/0xf8
> [<0a730828>] its_cpu_init_lpis+0x394/0x4b8
> [<edf04e07>] its_cpu_init+0x104/0x150
> [<4d0342c5>] gic_starting_cpu+0x34/0x40
> [<5d9da772>] cpuhp_invoke_callback+0x228/0x1d68
> [<61eace9b>] notify_cpu_starting+0xc0/0x118
> [<48bc2dc5>] secondary_start_kernel+0x23c/0x3b0
> [<15137d6a>] 0x
> 
> efi_mem_reserve_persistent+0x50/0xf8:
> kmalloc at include/linux/slab.h:546
> (inlined by) efi_mem_reserve_persistent at drivers/firmware/efi/efi.c:979
> 
> This line,
> 
> rsv = kmalloc(sizeof(*rsv), GFP_ATOMIC);
> 
> Kmemleak has a known limitation that can only track pointers in the kernel
> virtual space. Hence, it will report false positives due to "rsv" will only
> reference to other physical addresses,
> 
> rsv->next = efi_memreserve_root->next;
> efi_memreserve_root->next = __pa(rsv);
> 
> Signed-off-by: Qian Cai 

Acked-by: Catalin Marinas 


Re: [PATCH] arm64: hugetlb: Register hugepages during arch init

2018-12-06 Thread Catalin Marinas
It seems that we somehow missed this patch. Cc'ing a few more people
that touched hugetlbpage.c.

Catalin

On Tue, Oct 23, 2018 at 06:36:57AM +0530, Allen Pais wrote:
> Add hstate for each supported hugepage size using arch initcall.
> 
> * no hugepage parameters
> 
>   Without hugepage parameters, only a default hugepage size is
>   available for dynamic allocation.  It's different, for example, from
>   x86_64 and sparc64 where all supported hugepage sizes are available.
> 
> * only default_hugepagesz= is specified and set not to HPAGE_SIZE
> 
>   In spite of the fact that default_hugepagesz= is set to a valid
>   hugepage size, it's treated as unsupported and reverted to
>   HPAGE_SIZE.  Such behaviour is also different from x86_64 and
>   sparc64.
> 
> Reviewed-by: Tom Saeger 
> Signed-off-by: Dmitry Klochkov 
> Signed-off-by: Allen Pais 
> ---
>  arch/arm64/mm/hugetlbpage.c | 33 ++---
>  1 file changed, 22 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index f58ea50..28cbc22 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -429,6 +429,27 @@ void huge_ptep_clear_flush(struct vm_area_struct *vma,
>   clear_flush(vma->vm_mm, addr, ptep, pgsize, ncontig);
>  }
>  
> +static void __init add_huge_page_size(unsigned long size)
> +{
> + if (size_to_hstate(size))
> + return;
> +
> + hugetlb_add_hstate(ilog2(size) - PAGE_SHIFT);
> +}
> +
> +static int __init hugetlbpage_init(void)
> +{
> +#ifdef CONFIG_ARM64_4K_PAGES
> + add_huge_page_size(PUD_SIZE);
> +#endif
> + add_huge_page_size(PMD_SIZE * CONT_PMDS);
> + add_huge_page_size(PMD_SIZE);
> + add_huge_page_size(PAGE_SIZE * CONT_PTES);
> +
> + return 0;
> +}
> +arch_initcall(hugetlbpage_init);
> +
>  static __init int setup_hugepagesz(char *opt)
>  {
>   unsigned long ps = memparse(opt, );
> @@ -440,7 +461,7 @@ static __init int setup_hugepagesz(char *opt)
>   case PMD_SIZE * CONT_PMDS:
>   case PMD_SIZE:
>   case PAGE_SIZE * CONT_PTES:
> - hugetlb_add_hstate(ilog2(ps) - PAGE_SHIFT);
> + add_huge_page_size(ps);
>   return 1;
>   }
>  
> @@ -449,13 +470,3 @@ static __init int setup_hugepagesz(char *opt)
>   return 0;
>  }
>  __setup("hugepagesz=", setup_hugepagesz);
> -
> -#ifdef CONFIG_ARM64_64K_PAGES
> -static __init int add_default_hugepagesz(void)
> -{
> - if (size_to_hstate(CONT_PTES * PAGE_SIZE) == NULL)
> - hugetlb_add_hstate(CONT_PTE_SHIFT);
> - return 0;
> -}
> -arch_initcall(add_default_hugepagesz);
> -#endif
> -- 
> 1.8.3.1
> 
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


Re: [PATCH] arm64: hugetlb: Register hugepages during arch init

2018-12-06 Thread Catalin Marinas
It seems that we somehow missed this patch. Cc'ing a few more people
that touched hugetlbpage.c.

Catalin

On Tue, Oct 23, 2018 at 06:36:57AM +0530, Allen Pais wrote:
> Add hstate for each supported hugepage size using arch initcall.
> 
> * no hugepage parameters
> 
>   Without hugepage parameters, only a default hugepage size is
>   available for dynamic allocation.  It's different, for example, from
>   x86_64 and sparc64 where all supported hugepage sizes are available.
> 
> * only default_hugepagesz= is specified and set not to HPAGE_SIZE
> 
>   In spite of the fact that default_hugepagesz= is set to a valid
>   hugepage size, it's treated as unsupported and reverted to
>   HPAGE_SIZE.  Such behaviour is also different from x86_64 and
>   sparc64.
> 
> Reviewed-by: Tom Saeger 
> Signed-off-by: Dmitry Klochkov 
> Signed-off-by: Allen Pais 
> ---
>  arch/arm64/mm/hugetlbpage.c | 33 ++---
>  1 file changed, 22 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> index f58ea50..28cbc22 100644
> --- a/arch/arm64/mm/hugetlbpage.c
> +++ b/arch/arm64/mm/hugetlbpage.c
> @@ -429,6 +429,27 @@ void huge_ptep_clear_flush(struct vm_area_struct *vma,
>   clear_flush(vma->vm_mm, addr, ptep, pgsize, ncontig);
>  }
>  
> +static void __init add_huge_page_size(unsigned long size)
> +{
> + if (size_to_hstate(size))
> + return;
> +
> + hugetlb_add_hstate(ilog2(size) - PAGE_SHIFT);
> +}
> +
> +static int __init hugetlbpage_init(void)
> +{
> +#ifdef CONFIG_ARM64_4K_PAGES
> + add_huge_page_size(PUD_SIZE);
> +#endif
> + add_huge_page_size(PMD_SIZE * CONT_PMDS);
> + add_huge_page_size(PMD_SIZE);
> + add_huge_page_size(PAGE_SIZE * CONT_PTES);
> +
> + return 0;
> +}
> +arch_initcall(hugetlbpage_init);
> +
>  static __init int setup_hugepagesz(char *opt)
>  {
>   unsigned long ps = memparse(opt, );
> @@ -440,7 +461,7 @@ static __init int setup_hugepagesz(char *opt)
>   case PMD_SIZE * CONT_PMDS:
>   case PMD_SIZE:
>   case PAGE_SIZE * CONT_PTES:
> - hugetlb_add_hstate(ilog2(ps) - PAGE_SHIFT);
> + add_huge_page_size(ps);
>   return 1;
>   }
>  
> @@ -449,13 +470,3 @@ static __init int setup_hugepagesz(char *opt)
>   return 0;
>  }
>  __setup("hugepagesz=", setup_hugepagesz);
> -
> -#ifdef CONFIG_ARM64_64K_PAGES
> -static __init int add_default_hugepagesz(void)
> -{
> - if (size_to_hstate(CONT_PTES * PAGE_SIZE) == NULL)
> - hugetlb_add_hstate(CONT_PTE_SHIFT);
> - return 0;
> -}
> -arch_initcall(add_default_hugepagesz);
> -#endif
> -- 
> 1.8.3.1
> 
> 
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel


Re: [PATCH v8 0/8] arm64: untag user pointers passed to the kernel

2018-12-06 Thread Catalin Marinas
On Thu, Dec 06, 2018 at 01:44:24PM +0100, Andrey Konovalov wrote:
> On Thu, Nov 29, 2018 at 7:16 PM Catalin Marinas  
> wrote:
> > On Thu, Nov 08, 2018 at 03:48:10PM +0100, Andrey Konovalov wrote:
> > > On Thu, Nov 8, 2018 at 3:36 PM, Andrey Konovalov  
> > > wrote:
> > > > Changes in v8:
> > > > - Rebased onto 65102238 (4.20-rc1).
> > > > - Added a note to the cover letter on why syscall wrappers/shims that 
> > > > untag
> > > >   user pointers won't work.
> > > > - Added a note to the cover letter that this patchset has been merged 
> > > > into
> > > >   the Pixel 2 kernel tree.
> > > > - Documentation fixes, in particular added a list of syscalls that don't
> > > >   support tagged user pointers.
> > >
> > > I've changed the documentation to be more specific, please take a look.
> > >
> > > I haven't done anything about adding a way for the user to find out
> > > that the kernel supports this ABI extension. I don't know what would
> > > the the preferred way to do this, and we haven't received any comments
> > > on that from anybody else. Probing "on some innocuous syscall
> > > currently returning -EFAULT on tagged pointer arguments" works though,
> > > as you mentioned.
> >
> > We've had some internal discussions and also talked to some people at
> > Plumbers. I think the best option is to introduce an AT_FLAGS bit to
> > describe the ABI relaxation on tagged pointers. Vincenzo is going to
> > propose a patch on top of this series.
> 
> So should I wait for a patch from Vincenzo before posting v9 or post
> it as is? Or try to develop this patch myself?

The reason Vincenzo hasn't posted his patches yet is that we are still
debating internally how to document which syscalls accept non-zero
top-byte, what to do with future syscalls for which we don't know the
semantics.

Happy to take the discussion to the public list if Vincenzo posts his
patches. The conclusion of the ABI discussion may have an impact on the
actual implementation that you are proposing in this series.

-- 
Catalin


Re: [PATCH v6 10/24] arm64: irqflags: Use ICC_PMR_EL1 for interrupt masking

2018-12-05 Thread Catalin Marinas
On Wed, Dec 05, 2018 at 04:55:54PM +, Julien Thierry wrote:
> On 04/12/18 17:36, Catalin Marinas wrote:
> > On Mon, Nov 12, 2018 at 11:57:01AM +, Julien Thierry wrote:
> >> diff --git a/arch/arm64/include/asm/irqflags.h 
> >> b/arch/arm64/include/asm/irqflags.h
> >> index 24692ed..e0a32e4 100644
> >> --- a/arch/arm64/include/asm/irqflags.h
> >> +++ b/arch/arm64/include/asm/irqflags.h
> >> @@ -18,7 +18,27 @@
> >>  
> >>  #ifdef __KERNEL__
> >>  
> >> +#include 
> >> +#include 
> >>  #include 
> >> +#include 
> >> +
> >> +
> >> +/*
> >> + * When ICC_PMR_EL1 is used for interrupt masking, only the bit indicating
> >> + * whether the normal interrupts are masked is kept along with the daif
> >> + * flags.
> >> + */
> >> +#define ARCH_FLAG_PMR_EN 0x1
> >> +
> >> +#define MAKE_ARCH_FLAGS(daif, pmr)
> >> \
> >> +  ((daif) | (((pmr) >> GIC_PRIO_STATUS_SHIFT) & ARCH_FLAG_PMR_EN))
> >> +
> >> +#define ARCH_FLAGS_GET_PMR(flags)  \
> >> +  flags) & ARCH_FLAG_PMR_EN) << GIC_PRIO_STATUS_SHIFT) \
> >> +  | GIC_PRIO_IRQOFF)
> >> +
> >> +#define ARCH_FLAGS_GET_DAIF(flags) ((flags) & ~ARCH_FLAG_PMR_EN)
> > 
> > I wonder whether we could just use the PSR_I_BIT here to decide whether
> > to set the GIC_PRIO_IRQ{ON,OFF}. We could clear the PSR_I_BIT in
> > _restore_daif() with an alternative.
> 
> So, the issue with it is that some contexts might be using PSR.I to
> disable interrupts (any contexts with async errors or debug exceptions
> disabled, kvm guest entry paths, pseudo-NMIs, ...).
> 
> If any of these contexts calls local_irq_save()/local_irq_restore() or
> local_daif_save()/local_daif_restore(), by only relying on PSR_I_BIT to
> represent the PMR status, we might end up clearing PSR.I when we shouldn't.
> 
> I'm not sure whether there are no callers of these functions in those
> context. But if that is the case, we could simplify things, yes.

There are callers of local_daif_save() (3) and local_daif_mask() (7) but
do they all need to disable the pseudo-NMIs?

At a brief look at x86, it seems that they have something like
stop_nmi() and restart_nmi(). These don't have save/restore semantics,
so we could do something similar on arm64 that only deals with the
PSTATE.I bit directly and keep the software (flags) PSR.I as the PMR
bit. But we'd have to go through the 10 local_daif_* cases above to see
which actually need the stop_nmi() semantics.

-- 
Catalin


Re: [PATCH v6 10/24] arm64: irqflags: Use ICC_PMR_EL1 for interrupt masking

2018-12-05 Thread Catalin Marinas
On Wed, Dec 05, 2018 at 04:55:54PM +, Julien Thierry wrote:
> On 04/12/18 17:36, Catalin Marinas wrote:
> > On Mon, Nov 12, 2018 at 11:57:01AM +, Julien Thierry wrote:
> >> diff --git a/arch/arm64/include/asm/irqflags.h 
> >> b/arch/arm64/include/asm/irqflags.h
> >> index 24692ed..e0a32e4 100644
> >> --- a/arch/arm64/include/asm/irqflags.h
> >> +++ b/arch/arm64/include/asm/irqflags.h
> >> @@ -18,7 +18,27 @@
> >>  
> >>  #ifdef __KERNEL__
> >>  
> >> +#include 
> >> +#include 
> >>  #include 
> >> +#include 
> >> +
> >> +
> >> +/*
> >> + * When ICC_PMR_EL1 is used for interrupt masking, only the bit indicating
> >> + * whether the normal interrupts are masked is kept along with the daif
> >> + * flags.
> >> + */
> >> +#define ARCH_FLAG_PMR_EN 0x1
> >> +
> >> +#define MAKE_ARCH_FLAGS(daif, pmr)
> >> \
> >> +  ((daif) | (((pmr) >> GIC_PRIO_STATUS_SHIFT) & ARCH_FLAG_PMR_EN))
> >> +
> >> +#define ARCH_FLAGS_GET_PMR(flags)  \
> >> +  flags) & ARCH_FLAG_PMR_EN) << GIC_PRIO_STATUS_SHIFT) \
> >> +  | GIC_PRIO_IRQOFF)
> >> +
> >> +#define ARCH_FLAGS_GET_DAIF(flags) ((flags) & ~ARCH_FLAG_PMR_EN)
> > 
> > I wonder whether we could just use the PSR_I_BIT here to decide whether
> > to set the GIC_PRIO_IRQ{ON,OFF}. We could clear the PSR_I_BIT in
> > _restore_daif() with an alternative.
> 
> So, the issue with it is that some contexts might be using PSR.I to
> disable interrupts (any contexts with async errors or debug exceptions
> disabled, kvm guest entry paths, pseudo-NMIs, ...).
> 
> If any of these contexts calls local_irq_save()/local_irq_restore() or
> local_daif_save()/local_daif_restore(), by only relying on PSR_I_BIT to
> represent the PMR status, we might end up clearing PSR.I when we shouldn't.
> 
> I'm not sure whether there are no callers of these functions in those
> context. But if that is the case, we could simplify things, yes.

There are callers of local_daif_save() (3) and local_daif_mask() (7) but
do they all need to disable the pseudo-NMIs?

At a brief look at x86, it seems that they have something like
stop_nmi() and restart_nmi(). These don't have save/restore semantics,
so we could do something similar on arm64 that only deals with the
PSTATE.I bit directly and keep the software (flags) PSR.I as the PMR
bit. But we'd have to go through the 10 local_daif_* cases above to see
which actually need the stop_nmi() semantics.

-- 
Catalin


Re: [PATCH v6 21/24] arm64: Handle serror in NMI context

2018-12-04 Thread Catalin Marinas
On Mon, Nov 12, 2018 at 11:57:12AM +, Julien Thierry wrote:
> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> index 5f4d9ac..66344cd 100644
> --- a/arch/arm64/kernel/traps.c
> +++ b/arch/arm64/kernel/traps.c
> @@ -897,13 +897,17 @@ bool arm64_is_fatal_ras_serror(struct pt_regs *regs, 
> unsigned int esr)
>  
>  asmlinkage void do_serror(struct pt_regs *regs, unsigned int esr)
>  {
> - nmi_enter();
> + const bool was_in_nmi = in_nmi();
> +
> + if (!was_in_nmi)
> + nmi_enter();
>  
>   /* non-RAS errors are not containable */
>   if (!arm64_is_ras_serror(esr) || arm64_is_fatal_ras_serror(regs, esr))
>   arm64_serror_panic(regs, esr);
>  
> - nmi_exit();
> + if (!was_in_nmi)
> + nmi_exit();
>  }

Do we actually need nmi_enter/exit in the outer do_serror() function?
Could we just move it to arm64_serror_panic()?

-- 
Catalin


Re: [PATCH v6 21/24] arm64: Handle serror in NMI context

2018-12-04 Thread Catalin Marinas
On Mon, Nov 12, 2018 at 11:57:12AM +, Julien Thierry wrote:
> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> index 5f4d9ac..66344cd 100644
> --- a/arch/arm64/kernel/traps.c
> +++ b/arch/arm64/kernel/traps.c
> @@ -897,13 +897,17 @@ bool arm64_is_fatal_ras_serror(struct pt_regs *regs, 
> unsigned int esr)
>  
>  asmlinkage void do_serror(struct pt_regs *regs, unsigned int esr)
>  {
> - nmi_enter();
> + const bool was_in_nmi = in_nmi();
> +
> + if (!was_in_nmi)
> + nmi_enter();
>  
>   /* non-RAS errors are not containable */
>   if (!arm64_is_ras_serror(esr) || arm64_is_fatal_ras_serror(regs, esr))
>   arm64_serror_panic(regs, esr);
>  
> - nmi_exit();
> + if (!was_in_nmi)
> + nmi_exit();
>  }

Do we actually need nmi_enter/exit in the outer do_serror() function?
Could we just move it to arm64_serror_panic()?

-- 
Catalin


Re: [PATCH v6 15/24] arm64: Switch to PMR masking when starting CPUs

2018-12-04 Thread Catalin Marinas
On Mon, Nov 12, 2018 at 11:57:06AM +, Julien Thierry wrote:
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 8dc9dde..e495360 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -35,6 +35,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -175,6 +176,25 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
>   return ret;
>  }
>  
> +static void init_gic_priority_masking(void)
> +{
> + u32 gic_sre = gic_read_sre();
> + u32 cpuflags;
> +
> + if (WARN_ON(!(gic_sre & ICC_SRE_EL1_SRE)))
> + return;
> +
> + WARN_ON(!irqs_disabled());
> +
> + gic_write_pmr(GIC_PRIO_IRQOFF);
> +
> + cpuflags = read_sysreg(daif);
> +
> + /* We can only unmask PSR.I if we can take aborts */
> + if (!(cpuflags & PSR_A_BIT))
> + write_sysreg(cpuflags & ~PSR_I_BIT, daif);

I don't understand this. If you don't switch off PSR_I_BIT here, where
does it happen? In which scenario do we actually have the A bit still
set? At a quick look, smp_prepare_boot_cpu() would have the A bit
cleared previously by setup_arch(). We have secondary_start_kernel()
where you call init_gic_priority_masking() before local_daif_restore().

So what happens if you always turn off PSR_I_BIT here?

-- 
Catalin


Re: [PATCH v6 15/24] arm64: Switch to PMR masking when starting CPUs

2018-12-04 Thread Catalin Marinas
On Mon, Nov 12, 2018 at 11:57:06AM +, Julien Thierry wrote:
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 8dc9dde..e495360 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -35,6 +35,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -175,6 +176,25 @@ int __cpu_up(unsigned int cpu, struct task_struct *idle)
>   return ret;
>  }
>  
> +static void init_gic_priority_masking(void)
> +{
> + u32 gic_sre = gic_read_sre();
> + u32 cpuflags;
> +
> + if (WARN_ON(!(gic_sre & ICC_SRE_EL1_SRE)))
> + return;
> +
> + WARN_ON(!irqs_disabled());
> +
> + gic_write_pmr(GIC_PRIO_IRQOFF);
> +
> + cpuflags = read_sysreg(daif);
> +
> + /* We can only unmask PSR.I if we can take aborts */
> + if (!(cpuflags & PSR_A_BIT))
> + write_sysreg(cpuflags & ~PSR_I_BIT, daif);

I don't understand this. If you don't switch off PSR_I_BIT here, where
does it happen? In which scenario do we actually have the A bit still
set? At a quick look, smp_prepare_boot_cpu() would have the A bit
cleared previously by setup_arch(). We have secondary_start_kernel()
where you call init_gic_priority_masking() before local_daif_restore().

So what happens if you always turn off PSR_I_BIT here?

-- 
Catalin


Re: [PATCH v6 10/24] arm64: irqflags: Use ICC_PMR_EL1 for interrupt masking

2018-12-04 Thread Catalin Marinas
On Mon, Nov 12, 2018 at 11:57:01AM +, Julien Thierry wrote:
> diff --git a/arch/arm64/include/asm/irqflags.h 
> b/arch/arm64/include/asm/irqflags.h
> index 24692ed..e0a32e4 100644
> --- a/arch/arm64/include/asm/irqflags.h
> +++ b/arch/arm64/include/asm/irqflags.h
> @@ -18,7 +18,27 @@
>  
>  #ifdef __KERNEL__
>  
> +#include 
> +#include 
>  #include 
> +#include 
> +
> +
> +/*
> + * When ICC_PMR_EL1 is used for interrupt masking, only the bit indicating
> + * whether the normal interrupts are masked is kept along with the daif
> + * flags.
> + */
> +#define ARCH_FLAG_PMR_EN 0x1
> +
> +#define MAKE_ARCH_FLAGS(daif, pmr)   \
> + ((daif) | (((pmr) >> GIC_PRIO_STATUS_SHIFT) & ARCH_FLAG_PMR_EN))
> +
> +#define ARCH_FLAGS_GET_PMR(flags) \
> + flags) & ARCH_FLAG_PMR_EN) << GIC_PRIO_STATUS_SHIFT) \
> + | GIC_PRIO_IRQOFF)
> +
> +#define ARCH_FLAGS_GET_DAIF(flags) ((flags) & ~ARCH_FLAG_PMR_EN)

I wonder whether we could just use the PSR_I_BIT here to decide whether
to set the GIC_PRIO_IRQ{ON,OFF}. We could clear the PSR_I_BIT in
_restore_daif() with an alternative.

> +/*
> + * CPU interrupt mask handling.
> + */
>  static inline void arch_local_irq_enable(void)
>  {
> - asm volatile(
> - "msrdaifclr, #2 // arch_local_irq_enable"
> - :
> + unsigned long unmasked = GIC_PRIO_IRQON;
> +
> + asm volatile(ALTERNATIVE(
> + "msrdaifclr, #2 // arch_local_irq_enable\n"
> + "nop",
> + "msr_s  " __stringify(SYS_ICC_PMR_EL1) ",%0\n"
> + "dsbsy",
> + ARM64_HAS_IRQ_PRIO_MASKING)

DSB needed here as well? I guess I'd have to read the GIC spec before
asking again ;).

-- 
Catalin


Re: [PATCH v6 10/24] arm64: irqflags: Use ICC_PMR_EL1 for interrupt masking

2018-12-04 Thread Catalin Marinas
On Mon, Nov 12, 2018 at 11:57:01AM +, Julien Thierry wrote:
> diff --git a/arch/arm64/include/asm/irqflags.h 
> b/arch/arm64/include/asm/irqflags.h
> index 24692ed..e0a32e4 100644
> --- a/arch/arm64/include/asm/irqflags.h
> +++ b/arch/arm64/include/asm/irqflags.h
> @@ -18,7 +18,27 @@
>  
>  #ifdef __KERNEL__
>  
> +#include 
> +#include 
>  #include 
> +#include 
> +
> +
> +/*
> + * When ICC_PMR_EL1 is used for interrupt masking, only the bit indicating
> + * whether the normal interrupts are masked is kept along with the daif
> + * flags.
> + */
> +#define ARCH_FLAG_PMR_EN 0x1
> +
> +#define MAKE_ARCH_FLAGS(daif, pmr)   \
> + ((daif) | (((pmr) >> GIC_PRIO_STATUS_SHIFT) & ARCH_FLAG_PMR_EN))
> +
> +#define ARCH_FLAGS_GET_PMR(flags) \
> + flags) & ARCH_FLAG_PMR_EN) << GIC_PRIO_STATUS_SHIFT) \
> + | GIC_PRIO_IRQOFF)
> +
> +#define ARCH_FLAGS_GET_DAIF(flags) ((flags) & ~ARCH_FLAG_PMR_EN)

I wonder whether we could just use the PSR_I_BIT here to decide whether
to set the GIC_PRIO_IRQ{ON,OFF}. We could clear the PSR_I_BIT in
_restore_daif() with an alternative.

> +/*
> + * CPU interrupt mask handling.
> + */
>  static inline void arch_local_irq_enable(void)
>  {
> - asm volatile(
> - "msrdaifclr, #2 // arch_local_irq_enable"
> - :
> + unsigned long unmasked = GIC_PRIO_IRQON;
> +
> + asm volatile(ALTERNATIVE(
> + "msrdaifclr, #2 // arch_local_irq_enable\n"
> + "nop",
> + "msr_s  " __stringify(SYS_ICC_PMR_EL1) ",%0\n"
> + "dsbsy",
> + ARM64_HAS_IRQ_PRIO_MASKING)

DSB needed here as well? I guess I'd have to read the GIC spec before
asking again ;).

-- 
Catalin


Re: [PATCH v6 07/24] arm64: Make PMR part of task context

2018-12-04 Thread Catalin Marinas
On Mon, Nov 12, 2018 at 11:56:58AM +, Julien Thierry wrote:
> diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
> index 039144e..eb8120e 100644
> --- a/arch/arm64/kernel/entry.S
> +++ b/arch/arm64/kernel/entry.S
> @@ -249,6 +249,12 @@ alternative_else_nop_endif
>   msr sp_el0, tsk
>   .endif
>  
> + /* Save pmr */
> +alternative_if ARM64_HAS_IRQ_PRIO_MASKING
> + mrs_s   x20, SYS_ICC_PMR_EL1
> + str x20, [sp, #S_PMR_SAVE]
> +alternative_else_nop_endif
> +
>   /*
>* Registers that may be useful after this macro is invoked:
>*
> @@ -269,6 +275,13 @@ alternative_else_nop_endif
>   /* No need to restore UAO, it will be restored from SPSR_EL1 */
>   .endif
>  
> + /* Restore pmr */
> +alternative_if ARM64_HAS_IRQ_PRIO_MASKING
> + ldr x20, [sp, #S_PMR_SAVE]
> + msr_s   SYS_ICC_PMR_EL1, x20
> + dsb sy
> +alternative_else_nop_endif

What's this DSB for? If it's needed, please add a comment.

I would have expected an ISB (or none at all as we are going to return
from an exception).

-- 
Catalin


<    5   6   7   8   9   10   11   12   13   14   >