from:"Arvind Sankar"

Re: [PATCH] x86: efi: avoid BUILD_BUG_ON() for non-constant p4d_index

2021-02-04 Thread Arvind Sankar

On Thu, Feb 04, 2021 at 11:13:18PM +0100, Borislav Petkov wrote:
> On Thu, Feb 04, 2021 at 04:43:58PM -0500, Arvind Sankar wrote:
> > This should check EFI_VA_END instead of EFI_VA_START, and maybe throw in
> > a BUG_ON if EFI_VA_END >= EFI_VA_START.
> 
> No need:
> 
> if (efi_va < EFI_VA_END) {
> pr_warn(FW_WARN "VA address range overflow!\n");
> return;
> }
> 
> We already check we're not going over at map time. And our runtime
> services range is hardcoded. And we're switching to that PGD on each
> runtime services call.
> 
> So I don't see the point for keeping any of the assertions.
> 
> Unless you have other valid arguments for keeping them...
> 

No, I don't have any objections to removing them altogether. All the
comments other than the one about changing the #define's only apply if
it's decided to keep them.

Re: [PATCH] x86: efi: avoid BUILD_BUG_ON() for non-constant p4d_index

2021-02-04 Thread Arvind Sankar

On Thu, Feb 04, 2021 at 11:51:55AM +0100, Borislav Petkov wrote:
> On Wed, Feb 03, 2021 at 09:29:18PM +0100, Ard Biesheuvel wrote:
> > I think we have agreement on the approach but it is unclear who is
> > going to write the patch.
> 
> How's that below?
> 
> And frankly, I'd even vote for removing those assertions altogether. If
> somehow the EFI pgd lands somewhere else, the kernel will crash'n'burn
> spectacularly and quickly so it's not like we won't catch it...

Removing altogether should be fine, but see below if we don't.

> 
> ---
> diff --git a/arch/x86/include/asm/pgtable_64_types.h 
> b/arch/x86/include/asm/pgtable_64_types.h
> index 91ac10654570..b6be19c09841 100644
> --- a/arch/x86/include/asm/pgtable_64_types.h
> +++ b/arch/x86/include/asm/pgtable_64_types.h
> @@ -156,8 +156,8 @@ extern unsigned int ptrs_per_p4d;
>  #define CPU_ENTRY_AREA_PGD   _AC(-4, UL)
>  #define CPU_ENTRY_AREA_BASE  (CPU_ENTRY_AREA_PGD << P4D_SHIFT)
>  
> -#define EFI_VA_START ( -4 * (_AC(1, UL) << 30))
> -#define EFI_VA_END   (-68 * (_AC(1, UL) << 30))
> +#define EFI_VA_START ( -4UL * (_AC(1, UL) << 30))
> +#define EFI_VA_END   (-68UL * (_AC(1, UL) << 30))

This doesn't have any effect right? And the reason for the _AC() stuff
in there is to allow the #define to be used in assembler -- this
particular one isn't, but it makes no sense to use the UL suffix as well
as _AC() in the same macro.

>  
>  #define EARLY_DYNAMIC_PAGE_TABLES64
>  
> diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
> index e1e8d4e3a213..56fdc0bbb554 100644
> --- a/arch/x86/platform/efi/efi_64.c
> +++ b/arch/x86/platform/efi/efi_64.c
> @@ -123,9 +123,7 @@ void efi_sync_low_kernel_mappings(void)
>* only span a single PGD entry and that the entry also maps
>* other important kernel regions.
>*/
> - MAYBE_BUILD_BUG_ON(pgd_index(EFI_VA_END) != pgd_index(MODULES_END));
> - MAYBE_BUILD_BUG_ON((EFI_VA_START & PGDIR_MASK) !=
> - (EFI_VA_END & PGDIR_MASK));
> + MAYBE_BUILD_BUG_ON((EFI_VA_START & PGDIR_MASK) != PGDIR_MASK);

This check is superfluous. Just do the P4D one.

>  
>   pgd_efi = efi_pgd + pgd_index(PAGE_OFFSET);
>   pgd_k = pgd_offset_k(PAGE_OFFSET);
> @@ -137,8 +135,7 @@ void efi_sync_low_kernel_mappings(void)
>* As with PGDs, we share all P4D entries apart from the one entry
>* that covers the EFI runtime mapping space.
>*/
> - BUILD_BUG_ON(p4d_index(EFI_VA_END) != p4d_index(MODULES_END));
> - BUILD_BUG_ON((EFI_VA_START & P4D_MASK) != (EFI_VA_END & P4D_MASK));
> + BUILD_BUG_ON((EFI_VA_START & P4D_MASK) != P4D_MASK);

This should check EFI_VA_END instead of EFI_VA_START, and maybe throw in
a BUG_ON if EFI_VA_END >= EFI_VA_START.

>  
>   pgd_efi = efi_pgd + pgd_index(EFI_VA_END);
>   pgd_k = pgd_offset_k(EFI_VA_END);
> 
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH v6 1/2] Kbuild: make DWARF version a choice

2021-01-29 Thread Arvind Sankar

On Fri, Jan 29, 2021 at 12:57:20PM -0800, Nick Desaulniers wrote:
> On Fri, Jan 29, 2021 at 12:19 PM Nick Desaulniers
>  wrote:
> >
> > On Fri, Jan 29, 2021 at 12:17 PM Jakub Jelinek  wrote:
> > >
> > > On Fri, Jan 29, 2021 at 11:43:17AM -0800, Nick Desaulniers wrote:
> > > > Modifies CONFIG_DEBUG_INFO_DWARF4 to be a member of a choice. Adds an
> > > > explicit CONFIG_DEBUG_INFO_DWARF2, which is the default. Does so in a
> > > > way that's forward compatible with existing configs, and makes adding
> > > > future versions more straightforward.
> > > >
> > > > Suggested-by: Arvind Sankar 
> > > > Suggested-by: Fangrui Song 
> > > > Suggested-by: Nathan Chancellor 
> > > > Suggested-by: Masahiro Yamada 
> > > > Signed-off-by: Nick Desaulniers 
> > > > ---
> > > >  Makefile  |  6 +++---
> > > >  lib/Kconfig.debug | 21 -
> > > >  2 files changed, 19 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/Makefile b/Makefile
> > > > index 95ab9856f357..20141cd9319e 100644
> > > > --- a/Makefile
> > > > +++ b/Makefile
> > > > @@ -830,9 +830,9 @@ ifneq ($(LLVM_IAS),1)
> > > >  KBUILD_AFLAGS+= -Wa,-gdwarf-2
> > > >  endif
> > > >
> > > > -ifdef CONFIG_DEBUG_INFO_DWARF4
> > > > -DEBUG_CFLAGS += -gdwarf-4
> > > > -endif
> > > > +dwarf-version-$(CONFIG_DEBUG_INFO_DWARF2) := 2
> > > > +dwarf-version-$(CONFIG_DEBUG_INFO_DWARF4) := 4
> > > > +DEBUG_CFLAGS += -gdwarf-$(dwarf-version-y)
> > >
> > > Why do you make DWARF2 the default?  That seems a big step back from what
> > > the Makefile used to do before, where it defaulted to whatever DWARF 
> > > version
> > > the compiler defaulted to?
> > > E.g. GCC 4.8 up to 10 defaults to -gdwarf-4 and GCC 11 will default to
> > > -gdwarf-5.
> > > DWARF2 is more than 27 years old standard, DWARF3 15 years old,
> > > DWARF4 over 10 years old and DWARF5 almost 4 years old...
> > > It is true that some tools aren't DWARF5 ready at this point, but with GCC
> > > defaulting to that it will change quickly, but at least DWARF4 support has
> > > been around for years.
> >
> > I agree with you; I also do not want to change the existing defaults
> > in this series. That is a separate issue to address.
> 
> Thinking more about this over lunch...
> 
> I agree that DWARF v2 is quite old and I don't have a concrete reason
> why the Linux kernel should continue to support it in 2021.
> 
> I agree that this patch takes away the compiler vendor's choice as to
> what the implicit default choice is for dwarf version for the kernel.
> (We, the Linux kernel, do so already for implicit default -std=gnuc*
> as well).
> 
> I would not mind making this commit more explicit along the lines of:
> """
> If you previously had not explicitly opted into
> CONFIG_DEBUG_INFO_DWARF4, you will be opted in to
> CONFIG_DEBUG_INFO_DWARF2 rather than the compiler's implicit default
> (which changes over time).
> """
> If you would rather see dwarf4 be the explicit default, that can be
> done before or after this patch series, but to avoid further
> "rope-a-dope" over getting DWARFv5 enabled, I suggest waiting until
> after.
> 
> If Masahiro or Arvind (or whoever) feel differently about preserving
> the previous "don't care" behavior related to DWARF version for
> developers who had previously not opted in to
> CONFIG_DEBUG_INFO_DWARF4, I can drop this patch, and resend v7 of
> 0002/0002 simply adding CONFIG_DEBUG_INFO_DWARF5 and making that and
> CONFIG_DEBUG_INFO_DWARF4 depend on ! each other (I think).  But I'm
> going to suggest we follow the Zen of Python: explicit is better than
> implicit.  Supporting "I choose not to choose (my dwarf version)"
> doesn't seem worthwhile to me, but could be convinced otherwise.
> -- 
> Thanks,
> ~Nick Desaulniers

Given what Jakub is saying, i.e. it was previously impossible to get
dwarf2 with gcc, and you get dwarf4 whether or not DEBUG_INFO_DWARF4 was
actually selected, we should make the default choice DEBUG_INFO_DWARF4
with the new menu to avoid surprising users. We should probably just
drop DWARF2 and make the menu in this patch have only DWARF4, and then
add DWARF5 as the second choice. The menu is still a good thing for
future-proofing even if it only has two options currently.

Thanks.

Re: [PATCH v4 1/2] arm: lib: xor-neon: remove unnecessary GCC < 4.6 warning

2021-01-20 Thread Arvind Sankar

On Wed, Jan 20, 2021 at 03:09:53PM -0800, Nick Desaulniers wrote:
> On Tue, Jan 19, 2021 at 1:35 PM Arnd Bergmann  wrote:
> >
> > On Tue, Jan 19, 2021 at 10:18 PM 'Nick Desaulniers' via Clang Built
> > Linux  wrote:
> > >
> > > On Tue, Jan 19, 2021 at 5:17 AM Adrian Ratiu  
> > > wrote:
> > > > diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> > > > index b99dd8e1c93f..f9f3601cc2d1 100644
> > > > --- a/arch/arm/lib/xor-neon.c
> > > > +++ b/arch/arm/lib/xor-neon.c
> > > > @@ -14,20 +14,22 @@ MODULE_LICENSE("GPL");
> > > >  #error You should compile this file with '-march=armv7-a 
> > > > -mfloat-abi=softfp -mfpu=neon'
> > > >  #endif
> > > >
> > > > +/*
> > > > + * TODO: Even though -ftree-vectorize is enabled by default in Clang, 
> > > > the
> > > > + * compiler does not produce vectorized code due to its cost model.
> > > > + * See: https://github.com/ClangBuiltLinux/linux/issues/503
> > > > + */
> > > > +#ifdef CONFIG_CC_IS_CLANG
> > > > +#warning Clang does not vectorize code in this file.
> > > > +#endif
> > >
> > > Arnd, remind me again why it's a bug that the compiler's cost model
> > > says it's faster to not produce a vectorized version of these loops?
> > > I stand by my previous comment: 
> > > https://bugs.llvm.org/show_bug.cgi?id=40976#c8
> >
> > The point is that without vectorizing the code, there is no point in 
> > building
> > both the default xor code and a "neon" version that has to save/restore
> > the neon registers but doesn't actually use them.
> 
> Doesn't that already happen today with GCC when the pointer arguments
> are overlapping?  The loop is "versioned" and may not actually use the
> NEON registers at all at runtime for such arguments.
> https://godbolt.org/z/q48q8v  See also:
> https://bugs.llvm.org/show_bug.cgi?id=40976#c11.  Or am I missing
> something?

The gcc version is at least useful when the arguments _don't_ overlap,
which is presumably most/all the time.

There's no point to building this file if it's not going to produce a
vectorized version at all. The warning seems unnecessary, and it's not
really a compiler bug either -- until we agree on a way to get clang to
produce a vectorized version, the best thing would be for the neon
version to not be built at all with clang. Is that too messy to achieve?

> 
> So I'm thinking if we extend out this pattern to the rest of the
> functions, we can actually avoid calls to
> kernel_neon_begin()/kernel_neon_end() for cases in which pointers
> would be too close to use the vectorized loop version; meaning for GCC
> this would be an optimization (don't save neon registers when you're
> not going to use them).  I would probably consider moving
> include/asm-generic/xor.h somewhere under arch/arm/
> perhaps...err...something for the other users of .

We can't directly do the patch below since there are other users of the
asm-generic/xor.h implementations than just the neon file.  If it's too
much work to check and add __restrict everywhere, I think we'd either
need to copy the code into the xor-neon file, or maybe do some ifdeffery
so __restrict is only used for the neon version.

> 
> diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
> index aefddec79286..abd748d317e8 100644
> --- a/arch/arm/include/asm/xor.h
> +++ b/arch/arm/include/asm/xor.h
> @@ -148,12 +148,12 @@ extern struct xor_block_template const
> xor_block_neon_inner;
>  static void
>  xor_neon_2(unsigned long bytes, unsigned long *p1, unsigned long *p2)
>  {
> -   if (in_interrupt()) {
> -   xor_arm4regs_2(bytes, p1, p2);
> -   } else {
> +   if (!in_interrupt() && abs((uintptr_t)p1, (uintptr_t)p2) >= 8) {
> kernel_neon_begin();
> xor_block_neon_inner.do_2(bytes, p1, p2);
> kernel_neon_end();
> +   } else {
> +   xor_arm4regs_2(bytes, p1, p2);
> }
>  }
> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> index b99dd8e1c93f..0e8e474c0523 100644
> --- a/arch/arm/lib/xor-neon.c
> +++ b/arch/arm/lib/xor-neon.c
> @@ -14,22 +14,6 @@ MODULE_LICENSE("GPL");
>  #error You should compile this file with '-march=armv7-a
> -mfloat-abi=softfp -mfpu=neon'
>  #endif
> 
> -/*
> - * Pull in the reference implementations while instructing GCC (through
> - * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
> - * NEON instructions.
> - */
> -#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 6)
> -#pragma GCC optimize "tree-vectorize"
> -#else
> -/*
> - * While older versions of GCC do not generate incorrect code, they fail to
> - * recognize the parallel nature of these functions, and emit plain ARM code,
> - * which is known to be slower than the optimized ARM code in asm-arm/xor.h.
> - */
> -#warning This code requires at least version 4.6 of GCC
> -#endif
> -
>  #pragma GCC diagnostic ignored "-Wunused-variable"
>  #include 
> diff --git a/include/asm-generic/xor.h b/include/asm-generic/xor.h
> index

Re: [PATCH] x86: efi: avoid BUILD_BUG_ON() for non-constant p4d_index

2021-01-18 Thread Arvind Sankar

On Mon, Jan 18, 2021 at 09:24:09PM +0100, Borislav Petkov wrote:
> > > > As a matter of fact, it seems like the four assertions could be combined
> > > > into:
> > > >   BUILD_BUG_ON((EFI_VA_END & P4D_MASK) != (MODULES_END & P4D_MASK));
> > > >   BUILD_BUG_ON((EFI_VA_START & P4D_MASK) != (EFI_VA_END & 
> > > > P4D_MASK));
> > > > instead of separately asserting they're the same PGD entry and the same
> > > > P4D entry.
> > > >
> > > > Thanks.
> > >
> > > I actually don't quite get the MODULES_END check -- Ard, do you know
> > > what that's for?
> > >
> > 
> > Maybe Boris remembers? He wrote the original code for the 'new' EFI
> > page table layout.
> 
> That was added by Kirill for 5-level pgtables:
> 
>   e981316f5604 ("x86/efi: Add 5-level paging support")

That just duplicates the existing pgd_index() check for the p4d_index()
as well. It looks like the original commit adding
efi_sync_low_kernel_mappings() used to copy upto the PGD entry including
MODULES_END:
  d2f7cbe7b26a7 ("x86/efi: Runtime services virtual mapping")
and then Matt changed that when creating efi_mm:
  67a9108ed4313 ("x86/efi: Build our own page table structures")
to use EFI_VA_END instead but have a check that EFI_VA_END is in the
same entry as MODULES_END.

AFAICT, MODULES_END is only relevant as being something that happens to
be in the top 512GiB, and -1ul would be clearer.

> 
>  Documentation/x86/x86_64/mm.rst should explain the pagetable layout:
> 
>ff80 | -512GB | ffee |  444 GB | ... unused 
> hole
>ffef |  -68GB | fffe |   64 GB | EFI region 
> mapping space
> |   -4GB | 7fff |2 GB | ... unused 
> hole
>8000 |   -2GB | 9fff |  512 MB | kernel text 
> mapping, mapped to physical address 0
>8000 |-2048MB |  | |
>a000 |-1536MB | feff | 1520 MB | module 
> mapping space
>ff00 |  -16MB |  | |
>   FIXADDR_START | ~-11MB | ff5f | ~0.5 MB | 
> kernel-internal fixmap range, variable size and offset
> 
> That thing which starts at -512 GB above is the last PGD on the
> pagetable. In it, between -4G and -68G there are 64G which are the EFI
> region mapping space for runtime services.
> 
> Frankly I'm not sure what this thing is testing because the EFI VA range
> is hardcoded and I can't imagine it being somewhere else *except* in the
> last PGD.

It's just so that someone doesn't just change the #define's for
EFI_VA_END/START and think that it will work, I guess.

Another reasonable option, for example, would be to reserve an entire
PGD entry, allowing everything but the PGD level to be shared, and
adding the EFI PGD to the pgd_list and getting rid of
efi_sync_low_kernel_mappings() altogether. There aren't that many PGD
entries still unused though, so this is probably not worth it.

> 
> Lemme add Kirill for clarification.
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH] x86: efi: avoid BUILD_BUG_ON() for non-constant p4d_index

2021-01-15 Thread Arvind Sankar

On Fri, Jan 15, 2021 at 03:12:25PM -0500, Arvind Sankar wrote:
> On Fri, Jan 15, 2021 at 08:54:18PM +0100, Arnd Bergmann wrote:
> > On Fri, Jan 15, 2021 at 8:18 PM Borislav Petkov  wrote:
> > >
> > > On Fri, Jan 15, 2021 at 02:11:25PM -0500, Arvind Sankar wrote:
> > > > That's how build-time assertions work: they are _supposed_ to be
> > > > optimized away completely when the assertion is true. If they're
> > > > _not_ optimized away, the build will fail.
> > >
> > > Yah, that I know, thanks.
> > >
> > > If gcc really inlines p4d_index() and does a lot more aggressive
> > > optimization to determine that the condition is false and thus optimize
> > > everything away (and clang doesn't), then that would explain the
> > > observation.
> > 
> > One difference is that gcc does not have
> > -fsanitize=unsigned-integer-overflow at all, and I don't see the
> > assertion without that on clang either, so it's possible that clang
> > behaves as designed here.
> > 
> > The description is:
> > -fsanitize=unsigned-integer-overflow: Unsigned integer overflow, where
> >  the result of an unsigned integer computation cannot be represented in
> >  its type. Unlike signed integer overflow, this is not undefined 
> > behavior,
> >  but it is often unintentional. This sanitizer does not check for
> > lossy implicit
> >  conversions performed before such a computation (see
> > -fsanitize=implicit-conversion).
> > 
> > The "-68 * ((1UL) << 30" computation does overflow an unsigned long
> > as intended, right? Maybe this is enough for the ubsan code in clang to
> > just disable some of the optimization steps that the assertion relies on.
> > 
> > Arnd
> 
> That does seem to be an overflow, but that happens at compile-time.
> Maybe
>   AC(-68, UL) << 30
> would be a better definition to avoid overflow.

Eh, that's an overflow too, isn't it :( Is this option really useful
with kernel code -- this sort of thing is probably done all over the
place?

> 
> The real issue might be (ptrs_per_p4d - 1) which can overflow at
> run-time, and maybe the added ubsan code makes p4d_index() not worth
> inlining according to clang?

Re: [PATCH] x86: efi: avoid BUILD_BUG_ON() for non-constant p4d_index

2021-01-15 Thread Arvind Sankar

On Fri, Jan 15, 2021 at 02:07:51PM -0500, Arvind Sankar wrote:
> On Thu, Jan 07, 2021 at 11:34:15PM +0100, Arnd Bergmann wrote:
> > From: Arnd Bergmann 
> > 
> > When 5-level page tables are enabled, clang triggers a BUILD_BUG_ON():
> > 
> > x86_64-linux-ld: arch/x86/platform/efi/efi_64.o: in function 
> > `efi_sync_low_kernel_mappings':
> > efi_64.c:(.text+0x22c): undefined reference to `__compiletime_assert_354'
> > 
> > Use the same method as in commit c65e774fb3f6 ("x86/mm: Make PGDIR_SHIFT
> > and PTRS_PER_P4D variable") and change it to MAYBE_BUILD_BUG_ON(),
> > so it only triggers for constant input.
> > 
> > Link: https://github.com/ClangBuiltLinux/linux/issues/256
> > Signed-off-by: Arnd Bergmann 
> > ---
> >  arch/x86/platform/efi/efi_64.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
> > index e1e8d4e3a213..62bb1616b4a5 100644
> > --- a/arch/x86/platform/efi/efi_64.c
> > +++ b/arch/x86/platform/efi/efi_64.c
> > @@ -137,8 +137,8 @@ void efi_sync_low_kernel_mappings(void)
> >  * As with PGDs, we share all P4D entries apart from the one entry
> >  * that covers the EFI runtime mapping space.
> >  */
> > -   BUILD_BUG_ON(p4d_index(EFI_VA_END) != p4d_index(MODULES_END));
> > -   BUILD_BUG_ON((EFI_VA_START & P4D_MASK) != (EFI_VA_END & P4D_MASK));
> > +   MAYBE_BUILD_BUG_ON(p4d_index(EFI_VA_END) != p4d_index(MODULES_END));
> > +   MAYBE_BUILD_BUG_ON((EFI_VA_START & P4D_MASK) != (EFI_VA_END & 
> > P4D_MASK));
> >  
> > pgd_efi = efi_pgd + pgd_index(EFI_VA_END);
> > pgd_k = pgd_offset_k(EFI_VA_END);
> > -- 
> > 2.29.2
> > 
> 
> I think this needs more explanation as to why clang is triggering this.
> The issue mentions clang not inline p4d_index(), and I guess not
> performing inter-procedural analysis either?
> 
> For the second assertion there, everything is always constant AFAICT:
> EFI_VA_START, EFI_VA_END and P4D_MASK are all constants regardless of
> CONFIG_5LEVEL.
> 
> For the first assertion, it isn't technically constant, but if
> p4d_index() gets inlined, the compiler should be able to see that the
> two are always equal, even though ptrs_per_p4d is not constant:
>   EFI_VA_END >> 39 == MODULES_END >> 39
> so the masking with ptrs_per_p4d-1 doesn't matter for the comparison.
> 
> As a matter of fact, it seems like the four assertions could be combined
> into:
>   BUILD_BUG_ON((EFI_VA_END & P4D_MASK) != (MODULES_END & P4D_MASK));
>   BUILD_BUG_ON((EFI_VA_START & P4D_MASK) != (EFI_VA_END & P4D_MASK));
> instead of separately asserting they're the same PGD entry and the same
> P4D entry.
> 
> Thanks.

I actually don't quite get the MODULES_END check -- Ard, do you know
what that's for?

What we really should be checking is that EFI_VA_START is in the top-most
PGD entry and the top-most P4D entry, since we only copy PGD/P4D entries
before EFI_VA_END, but not after EFI_VA_START. So the checks should
really be
BUILD_BUG_ON(((EFI_VA_START - 1) & P4D_MASK) != (-1ul & P4D_MASK));
BUILD_BUG_ON(((EFI_VA_START - 1) & P4D_MASK) != (EFI_VA_END & 
P4D_MASK));
imo. I guess that's what using MODULES_END is effectively checking, but
it would be clearer to check it directly.

Re: [PATCH] x86: efi: avoid BUILD_BUG_ON() for non-constant p4d_index

2021-01-15 Thread Arvind Sankar

On Fri, Jan 15, 2021 at 08:54:18PM +0100, Arnd Bergmann wrote:
> On Fri, Jan 15, 2021 at 8:18 PM Borislav Petkov  wrote:
> >
> > On Fri, Jan 15, 2021 at 02:11:25PM -0500, Arvind Sankar wrote:
> > > That's how build-time assertions work: they are _supposed_ to be
> > > optimized away completely when the assertion is true. If they're
> > > _not_ optimized away, the build will fail.
> >
> > Yah, that I know, thanks.
> >
> > If gcc really inlines p4d_index() and does a lot more aggressive
> > optimization to determine that the condition is false and thus optimize
> > everything away (and clang doesn't), then that would explain the
> > observation.
> 
> One difference is that gcc does not have
> -fsanitize=unsigned-integer-overflow at all, and I don't see the
> assertion without that on clang either, so it's possible that clang
> behaves as designed here.
> 
> The description is:
> -fsanitize=unsigned-integer-overflow: Unsigned integer overflow, where
>  the result of an unsigned integer computation cannot be represented in
>  its type. Unlike signed integer overflow, this is not undefined behavior,
>  but it is often unintentional. This sanitizer does not check for
> lossy implicit
>  conversions performed before such a computation (see
> -fsanitize=implicit-conversion).
> 
> The "-68 * ((1UL) << 30" computation does overflow an unsigned long
> as intended, right? Maybe this is enough for the ubsan code in clang to
> just disable some of the optimization steps that the assertion relies on.
> 
> Arnd

That does seem to be an overflow, but that happens at compile-time.
Maybe
AC(-68, UL) << 30
would be a better definition to avoid overflow.

The real issue might be (ptrs_per_p4d - 1) which can overflow at
run-time, and maybe the added ubsan code makes p4d_index() not worth
inlining according to clang?

Re: [PATCH] x86: efi: avoid BUILD_BUG_ON() for non-constant p4d_index

2021-01-15 Thread Arvind Sankar

On Fri, Jan 15, 2021 at 08:07:29PM +0100, Borislav Petkov wrote:
> On Fri, Jan 15, 2021 at 11:32:03AM -0700, Nathan Chancellor wrote:
> > I triggered it with CONFIG_UBSAN=y + CONFIG_UBSAN_UNSIGNED_OVERFLOW=y
> > (it can be exposed with an allyesconfig/allmodconfig on mainline
> > currently).
> 
> Yah, I can trigger with that, thanks.
> 
> But I'll be damned, check this out:
> 
> clang preprocesses to this:
> 
>  do { extern void __compiletime_assert_332(void) ; if (!(!(p4d_index((-68 * 
> ((1UL) << 30))) != p4d_index((0xff00UL) 
> __compiletime_assert_332(); } while (0);
> 
> The resulting asm is:
> 
> .LBB1_32:
> movabsq $-7301032, %r13 # imm = 0xFFEF
> testb   $1, %al
> jne .LBB1_33
> .LBB1_34:
> xorl%r14d, %ebx
> testl   $33554431, %ebx # imm = 0x1FF
> je  .LBB1_36
> # %bb.35:
> callq   __compiletime_assert_332
> 
> so the undefined symbol is there, leading to:
> 
> ld: arch/x86/platform/efi/efi_64.o: in function 
> `efi_sync_low_kernel_mappings':
> /home/boris/kernel/linux/arch/x86/platform/efi/efi_64.c:140: undefined 
> reference to `__compiletime_assert_332'
> 
> Now look at gcc:
> 
> It preprocesses to:
> 
>  do { extern void __compiletime_assert_332(void) 
> __attribute__((__error__("BUILD_BUG_ON failed: " "p4d_index(EFI_VA_END) != 
> p4d_index(MODULES_END)"))); if (!(!(p4d_index((-68 * ((1UL) << 30))) != 
> p4d_index((0xff00UL) __compiletime_assert_332(); } while (0);
> 
> 
> Resulting asm:
> 
> $ grep __compiletime_assert_332  arch/x86/platform/efi/efi_64.s
> $
> 
> That thing has been optimized away!
> 
> Which means, those build assertions are gone on gcc and they don't catch
> diddly squat. I sure hope I'm missing something here...

That's how build-time assertions work: they are _supposed_ to be
optimized away completely when the assertion is true. If they're
_not_ optimized away, the build will fail.

Re: [PATCH] x86: efi: avoid BUILD_BUG_ON() for non-constant p4d_index

2021-01-15 Thread Arvind Sankar

On Thu, Jan 07, 2021 at 11:34:15PM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> When 5-level page tables are enabled, clang triggers a BUILD_BUG_ON():
> 
> x86_64-linux-ld: arch/x86/platform/efi/efi_64.o: in function 
> `efi_sync_low_kernel_mappings':
> efi_64.c:(.text+0x22c): undefined reference to `__compiletime_assert_354'
> 
> Use the same method as in commit c65e774fb3f6 ("x86/mm: Make PGDIR_SHIFT
> and PTRS_PER_P4D variable") and change it to MAYBE_BUILD_BUG_ON(),
> so it only triggers for constant input.
> 
> Link: https://github.com/ClangBuiltLinux/linux/issues/256
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/x86/platform/efi/efi_64.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
> index e1e8d4e3a213..62bb1616b4a5 100644
> --- a/arch/x86/platform/efi/efi_64.c
> +++ b/arch/x86/platform/efi/efi_64.c
> @@ -137,8 +137,8 @@ void efi_sync_low_kernel_mappings(void)
>* As with PGDs, we share all P4D entries apart from the one entry
>* that covers the EFI runtime mapping space.
>*/
> - BUILD_BUG_ON(p4d_index(EFI_VA_END) != p4d_index(MODULES_END));
> - BUILD_BUG_ON((EFI_VA_START & P4D_MASK) != (EFI_VA_END & P4D_MASK));
> + MAYBE_BUILD_BUG_ON(p4d_index(EFI_VA_END) != p4d_index(MODULES_END));
> + MAYBE_BUILD_BUG_ON((EFI_VA_START & P4D_MASK) != (EFI_VA_END & 
> P4D_MASK));
>  
>   pgd_efi = efi_pgd + pgd_index(EFI_VA_END);
>   pgd_k = pgd_offset_k(EFI_VA_END);
> -- 
> 2.29.2
> 

I think this needs more explanation as to why clang is triggering this.
The issue mentions clang not inline p4d_index(), and I guess not
performing inter-procedural analysis either?

For the second assertion there, everything is always constant AFAICT:
EFI_VA_START, EFI_VA_END and P4D_MASK are all constants regardless of
CONFIG_5LEVEL.

For the first assertion, it isn't technically constant, but if
p4d_index() gets inlined, the compiler should be able to see that the
two are always equal, even though ptrs_per_p4d is not constant:
EFI_VA_END >> 39 == MODULES_END >> 39
so the masking with ptrs_per_p4d-1 doesn't matter for the comparison.

As a matter of fact, it seems like the four assertions could be combined
into:
BUILD_BUG_ON((EFI_VA_END & P4D_MASK) != (MODULES_END & P4D_MASK));
BUILD_BUG_ON((EFI_VA_START & P4D_MASK) != (EFI_VA_END & P4D_MASK));
instead of separately asserting they're the same PGD entry and the same
P4D entry.

Thanks.

[tip: x86/cleanups] x86/mm: Remove duplicate definition of _PAGE_PAT_LARGE

2021-01-08 Thread tip-bot2 for Arvind Sankar

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID: 4af0e6e39b7ed77796a41537db91d717fedd0ac3
Gitweb:
https://git.kernel.org/tip/4af0e6e39b7ed77796a41537db91d717fedd0ac3
Author:Arvind Sankar 
AuthorDate:Wed, 11 Nov 2020 11:09:46 -05:00
Committer: Borislav Petkov 
CommitterDate: Fri, 08 Jan 2021 22:04:51 +01:00

x86/mm: Remove duplicate definition of _PAGE_PAT_LARGE

_PAGE_PAT_LARGE is already defined next to _PAGE_PAT. Remove the
duplicate.

Fixes: 4efb56649132 ("x86/mm: Tabulate the page table encoding definitions")
Signed-off-by: Arvind Sankar 
Signed-off-by: Borislav Petkov 
Acked-by: Dave Hansen 
Link: https://lkml.kernel.org/r/2020160946.147341-2-nived...@alum.mit.edu
---
 arch/x86/include/asm/pgtable_types.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index 394757e..f24d7ef 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -177,8 +177,6 @@ enum page_cache_mode {
 #define __pgprot(x)((pgprot_t) { (x) } )
 #define __pg(x)__pgprot(x)
 
-#define _PAGE_PAT_LARGE(_AT(pteval_t, 1) << 
_PAGE_BIT_PAT_LARGE)
-
 #define PAGE_NONE   __pg(   0|   0|   0|___A|   0|   0|   0|___G)
 #define PAGE_SHARED __pg(__PP|__RW|_USR|___A|__NX|   0|   0|   0)
 #define PAGE_SHARED_EXEC __pg(__PP|__RW|_USR|___A|   0|   0|   0|   0)

Re: [PATCH 2/2] x86/mm: Remove duplicate definition of _PAGE_PAT_LARGE

2021-01-05 Thread Arvind Sankar

On Wed, Nov 11, 2020 at 11:09:46AM -0500, Arvind Sankar wrote:
> _PAGE_PAT_LARGE is already defined next to _PAGE_PAT. Remove the
> duplicate.
> 
> Signed-off-by: Arvind Sankar 
> Fixes: 4efb56649132 ("x86/mm: Tabulate the page table encoding definitions")
> ---
>  arch/x86/include/asm/pgtable_types.h | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_types.h 
> b/arch/x86/include/asm/pgtable_types.h
> index 394757ee030a..f24d7ef8fffa 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -177,8 +177,6 @@ enum page_cache_mode {
>  #define __pgprot(x)  ((pgprot_t) { (x) } )
>  #define __pg(x)  __pgprot(x)
>  
> -#define _PAGE_PAT_LARGE  (_AT(pteval_t, 1) << 
> _PAGE_BIT_PAT_LARGE)
> -
>  #define PAGE_NONE __pg(   0|   0|   0|___A|   0|   0|   0|___G)
>  #define PAGE_SHARED   __pg(__PP|__RW|_USR|___A|__NX|   0|   0|   0)
>  #define PAGE_SHARED_EXEC __pg(__PP|__RW|_USR|___A|   0|   0|   0|   0)
> -- 
> 2.26.2
> 

Ping

Re: [PATCH] x86/kaslr: try process e820 entries if can not get suitable regions from efi

2021-01-05 Thread Arvind Sankar

On Tue, Jan 05, 2021 at 09:54:52AM +0100, Ard Biesheuvel wrote:
> (cc Arvind)
> 
> On Tue, 5 Jan 2021 at 09:54, Lin Feng  wrote:
> >
> > On efi64 x86_64 system, the EFI_CONVENTIONAL_MEMORY regions will not
> > be mapped when making EFI runtime calls. So kexec-tools can not get
> > these from /sys/firmware/efi/runtime-map. Then compressed boot os
> > can not get suitable regions in process_efi_entries and print debug
> > message as follow:
> > Physical KASLR disabled: no suitable memory region!
> > To enable physical kaslr with kexec, call process_e820_entries when
> > no suitable regions in efi memmaps.
> >
> > Signed-off-by: Lin Feng 
> >
> > ---
> >
> > I find a regular of Kernel code and data placement with kexec. It
> > seems unsafe. The reason is showed above.
> >
> > I'm not familiar with efi firmware. I wonder if there are some risks
> > to get regions according to e820 when there is no suitable region
> > in efi memmaps.
> > ---
> >  arch/x86/boot/compressed/kaslr.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/boot/compressed/kaslr.c 
> > b/arch/x86/boot/compressed/kaslr.c
> > index b92fffbe761f..dbd7244b71aa 100644
> > --- a/arch/x86/boot/compressed/kaslr.c
> > +++ b/arch/x86/boot/compressed/kaslr.c
> > @@ -685,6 +685,7 @@ process_efi_entries(unsigned long minimum, unsigned 
> > long image_size)
> >  {
> > struct efi_info *e = _params->efi_info;
> > bool efi_mirror_found = false;
> > +   bool efi_mem_region_found = false;
> > struct mem_vector region;
> > efi_memory_desc_t *md;
> > unsigned long pmap;
> > @@ -742,12 +743,13 @@ process_efi_entries(unsigned long minimum, unsigned 
> > long image_size)
> > !(md->attribute & EFI_MEMORY_MORE_RELIABLE))
> > continue;
> >
> > +   efi_mem_region_found = false;
   ^^ this should be true, not false.

Other than that, I think this should be okay. The reason EFI memmap is
preferred over E820, according to commit

  0982adc74673 ("x86/boot/KASLR: Work around firmware bugs by excluding 
EFI_BOOT_SERVICES_* and EFI_LOADER_* from KASLR's choice")

was to avoid allocating inside EFI_BOOT_SERVICES/EFI_LOADER_DATA etc.
That's not a danger during kexec, and I believe runtime services regions
should be marked as reserved in the E820 map, right?

Also, something a little fishy-looking here is that the first loop to
see if there is any EFI_MEMORY_MORE_RELIABLE region does not apply any
of the checks on the memory region type/attributes. If there is a mirror
region but it isn't conventional memory, or if it was soft-reserved, we
shouldn't be setting efi_mirror_found.


> > region.start = md->phys_addr;
> > region.size = md->num_pages << EFI_PAGE_SHIFT;
> > if (process_mem_region(, minimum, image_size))
> > break;
> > }
> > -   return true;
> > +   return efi_mem_region_found;
> >  }
> >  #else
> >  static inline bool
> > --
> > 2.23.0
> >

Re: [PATCH] x86/kaslr: support separate multiple memmaps parameter parsing

2021-01-05 Thread Arvind Sankar

On Tue, Dec 22, 2020 at 07:31:24PM +0800, Pan Zhang wrote:
> When the kernel is loading,
> the load address of the kernel needs to be calculated firstly.
> 
> If the kernel address space layout randomization(`kaslr`) is enabled,
> the memory range reserved by the memmap parameter will be excluded
> to avoid loading the kernel address into the memmap reserved area.
> 
> Currently, this is what the manual says:
>   `memmap = nn [KMG] @ss [KMG]
>   [KNL] Force usage of a specific region of memory.
>   Region of memory to be used is from ss to ss + nn.
>       If @ss [KMG] is omitted, it is equivalent to mem = nn [KMG],
>   which limits max address to nn [KMG].
>   Multiple different regions can be specified,
>   comma delimited.
>   Example:
>   memmap=100M@2G, 100M#3G, 1G!1024G
>   `
> 
> Can we relax the use of memmap?
> In our production environment we see many people who use it like this:
> Separate multiple memmaps parameters to reserve memory,
> memmap=xx\$xxx memmap=xx\$xxx memmap=xx\$xxx memmap=xx\$xxx memmap=xx\$xxx
> 
> If this format is used, and the reserved memory segment is greater than 4,
> there is no way to parse the 5th and subsequent memmaps and the kaslr cannot 
> be disabled by `memmap_too_large`
> so the kernel loading address may fall within the memmap range
> (reserved memory area from memmap after fourth segment),
> which will have bad consequences for use of reserved memory.
> 
> Signed-off-by: Pan Zhang 
> ---
>  arch/x86/boot/compressed/kaslr.c | 5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/arch/x86/boot/compressed/kaslr.c 
> b/arch/x86/boot/compressed/kaslr.c
> index d7408af..24a2778 100644
> --- a/arch/x86/boot/compressed/kaslr.c
> +++ b/arch/x86/boot/compressed/kaslr.c
> @@ -203,9 +203,6 @@ static void mem_avoid_memmap(enum parse_mode mode, char 
> *str)
>  {
>   static int i;
>  
> - if (i >= MAX_MEMMAP_REGIONS)
> - return;
> -
>   while (str && (i < MAX_MEMMAP_REGIONS)) {
>   int rc;
>   unsigned long long start, size;
> @@ -233,7 +230,7 @@ static void mem_avoid_memmap(enum parse_mode mode, char 
> *str)
>   }
>  
>   /* More than 4 memmaps, fail kaslr */
> - if ((i >= MAX_MEMMAP_REGIONS) && str)
> + if ((i >= MAX_MEMMAP_REGIONS) && !memmap_too_large)

I think this should stay the way it was, otherwise KASLR will be
disabled even if exactly MAX_MEMMAP_REGIONS were specified. Removing the
early return as you did above should be enough to cause the flag to be
set if a 5th memmap is specified in a separate parameter, right?

>   memmap_too_large = true;
>  }
>  
> -- 
> 2.7.4
>

Re: [PATCH] x86/cmdline: Disable jump tables for cmdline.c

2020-12-21 Thread Arvind Sankar

On Mon, Dec 21, 2020 at 11:14:39AM -0800, Nick Desaulniers wrote:
> On Wed, Sep 2, 2020 at 7:31 PM Arvind Sankar  wrote:
> >
> > When CONFIG_RETPOLINE is disabled, Clang uses a jump table for the
> > switch statement in cmdline_find_option (jump tables are disabled when
> > CONFIG_RETPOLINE is enabled). This function is called very early in boot
> > from sme_enable() if CONFIG_AMD_MEM_ENCRYPT is enabled. At this time,
> 
> Hi Arvind, sorry I missed this when you first sent it.  I'm going
> through and mass deleting my inbox (email bankruptcy) but noticed
> this.  I couldn't reproduce jump tables in cmdline_find_option with
> CONFIG_RETPOLINE disabled but CONFIG_AMD_MEM_ENCRYPT on today's
> linux-next. Can you please confirm that this is still an issue? I will
> reread the disassembly, but it looks like a bunch of cmp/test+jumps.
> 

The patch got merged and is in v5.10 -- are you testing with it reverted
or just plain linux-next?

Re: [RFC PATCH v1 07/12] efi: Replace strstarts() by str_has_prefix().

2020-12-11 Thread Arvind Sankar

On Fri, Dec 11, 2020 at 09:45:15AM +, David Laight wrote:
> From: Arvind Sankar
> > Sent: 10 December 2020 18:14
> ...
> > I wasn't aware of str_has_prefix() at the time. It does seem useful to
> > eliminate the duplication of the string literal, I like the
> > skip_prefix() API suggestion, maybe even
> > 
> > bool str_skip_prefix(const char **s, const char *pfx)
> > {
> > size_t len = str_has_prefix(*s, pfx);
> > *s += len;
> > return !!len;
> > }
> > ...
> > if (str_skip_prefix(, prefix)) { ... }
> > 
> > to avoid the intermediate variable.
> 
> That'll generate horrid code - the 'option' variable has to be
> repeatedly reloaded from memory (unless it is all inlined).
> 
> Perhaps the #define
> 
> #define str_skip_prefix(str, prefix) \
> {( \
>   size_t _pfx_len = strlen(prefix)); \
>   memcmp(str, pfx, _pfx_len) ? 0 : ((str) += _pfx_len, 1); \
> )}
> 
> There's probably something that'll let you use sizeof() instead
> of strlen() for quoted strings (if only sizeof pfx != sizeof (char *)).
> 
>   David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)

Avoiding the intermediate varable is for readability at the call-site,
not performance. The performance of this function shouldn't matter --
you shouldn't be parsing strings in a hotpath anyway, and nobody cares
if the EFI stub takes a few nanoseconds longer if that makes the code
more robust and/or readable.

That said,
- I'd be surprised if the overhead of an L1D cache access was measurable
  over the strncmp()/strlen() calls.
- You can't replace strncmp() with memcmp().
- Regarding sizeof() vs strlen(), you can simply use __builtin_strlen()
  to optimize string literals, there's no need for hacks using the
  preprocessor.

Re: [PATCH v2 4/4] Kbuild: implement support for DWARF v5

2020-12-10 Thread Arvind Sankar

On Thu, Dec 10, 2020 at 03:18:45PM -0800, Nick Desaulniers wrote:
> On Fri, Dec 4, 2020 at 9:06 AM Arvind Sankar  wrote:
> >
> > Why? Does this actually cause any problems?
> >
> > It seems like the options for gcc can actually be very straightforward:
> > you just need a cc-option check, and then add -gdwarf-N to both CFLAGS
> > and AFLAGS and you're done.  Adding the -Wa flag is superfluous and
> > carries the risk of interfering with what the compiler driver does. Just
> > let the gcc driver handle the details.
> >
> > Clang/IAS is almost as straightforward, with the only additional edge
> > case being that for assembler files, DWARF 2 doesn't work, so the CFLAGS
> > is the same -gdwarf-N, but AFLAGS gets -gdwarf-N only if N > 2.
> >
> > The messy case is only Clang/IAS=0, which needs to check the support
> > from the external assembler, and needs CFLAGS of -gdwarf-N and AFLAGS of
> > -Wa,--gdwarf-N, because Clang doesn't pass that option on to an external
> > assembler. This is why I was asking if the assembler support check can
> > be restricted to CC_IS_CLANG: nothing but Clang/IAS=0 actually requires
> > that check.
> 
> Oh, I see. Yeah, that might be a nicer approach.  What should we do in
> the case of gcc < 7 though, where -gdwarf-5 won't produce DWARF v5?
> Maybe that's ok, but the intent behind the Kconfig check was to
> prevent the option from being selectable if the tools do not support
> it.  Maybe it's more flexible to pass the arguments along, and hope
> for the best?
> 
> As a gcc-5 user, I might be surprised if I chose
> CONFIG_DEBUG_INFO_DWARF5 if what I got was not actually DWARF v5; it
> does violate the principle of least surprise.  Maybe that doesn't
> matter though?

Even the current gcc documentation still says "DWARF Version 5 is only
experimental".  If the user wants to try it out, I think it's fine to
let them get whatever subset their tool chain produces, as long as it's
not completely broken. Your latest help text does say that gcc 7+ is
required, maybe add another sentence saying that gcc 5+ only has partial
support for some draft DWARF 5 features?

Thanks.

Re: [RFC PATCH v1 07/12] efi: Replace strstarts() by str_has_prefix().

2020-12-10 Thread Arvind Sankar

On Sat, Dec 05, 2020 at 08:36:02PM +0100, Ard Biesheuvel wrote:
> On Fri, 4 Dec 2020 at 19:02, James Bottomley
>  wrote:
> >
> > On Fri, 2020-12-04 at 18:07 +0100, Ard Biesheuvel wrote:
> > > On Fri, 4 Dec 2020 at 18:06, 
> > > wrote:
> > > > From: Francis Laniel 
> > > >
> > > > The two functions indicates if a string begins with a given prefix.
> > > > The only difference is that strstarts() returns a bool while
> > > > str_has_prefix()
> > > > returns the length of the prefix if the string begins with it or 0
> > > > otherwise.
> > > >
> > >
> > > Why?
> >
> > I think I can answer that.  If the conversion were done properly (which
> > it's not) you could get rid of the double strings in the code which are
> > error prone if you update one and forget another.  This gives a good
> > example: 3d739c1f6156 ("tracing: Use the return of str_has_prefix() to
> > remove open coded numbers"). so in your code you'd replace things like
> >
> > if (strstarts(option, "rgb")) {
> > option += strlen("rgb");
> > ...
> >
> > with
> >
> > len = str_has_prefix(option, "rgb");
> > if (len) {
> > option += len
> > ...
> >
> > Obviously you also have cases where strstart is used as a boolean with
> > no need to know the length ... I think there's no value to converting
> > those.
> >
> 
> This will lead to worse code being generated. strlen() is evaluated at
> build time by the compiler if the argument is a string literal, so
> your 'before' version gets turned into 'option += 3', whereas the
> latter needs to use a runtime variable.

The EFI stub is -ffreestanding, so you actually get multiple calls to
strlen() in any case. I could have used strncmp() directly with sizeof()
to avoid that, but the strstarts()/strlen() was slightly more readable
and the performance of this code doesn't really matter.

I wasn't aware of str_has_prefix() at the time. It does seem useful to
eliminate the duplication of the string literal, I like the
skip_prefix() API suggestion, maybe even

bool str_skip_prefix(const char **s, const char *pfx)
{
size_t len = str_has_prefix(*s, pfx);
*s += len;
return !!len;
}
...
if (str_skip_prefix(, prefix)) { ... }

to avoid the intermediate variable.

> 
> So I don't object to using str_has_prefix() in new code in this way,
> but I really don't see the point of touching existing code.

[tip: x86/urgent] x86/mm/mem_encrypt: Fix definition of PMD_FLAGS_DEC_WP

2020-12-10 Thread tip-bot2 for Arvind Sankar

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: 29ac40cbed2bc06fa218ca25d7f5e280d3d08a25
Gitweb:
https://git.kernel.org/tip/29ac40cbed2bc06fa218ca25d7f5e280d3d08a25
Author:Arvind Sankar 
AuthorDate:Wed, 11 Nov 2020 11:09:45 -05:00
Committer: Borislav Petkov 
CommitterDate: Thu, 10 Dec 2020 12:28:06 +01:00

x86/mm/mem_encrypt: Fix definition of PMD_FLAGS_DEC_WP

The PAT bit is in different locations for 4k and 2M/1G page table
entries.

Add a definition for _PAGE_LARGE_CACHE_MASK to represent the three
caching bits (PWT, PCD, PAT), similar to _PAGE_CACHE_MASK for 4k pages,
and use it in the definition of PMD_FLAGS_DEC_WP to get the correct PAT
index for write-protected pages.

Fixes: 6ebcb060713f ("x86/mm: Add support to encrypt the kernel in-place")
Signed-off-by: Arvind Sankar 
Signed-off-by: Borislav Petkov 
Tested-by: Tom Lendacky 
Cc: sta...@vger.kernel.org
Link: https://lkml.kernel.org/r/2020160946.147341-1-nived...@alum.mit.edu
---
 arch/x86/include/asm/pgtable_types.h | 1 +
 arch/x86/mm/mem_encrypt_identity.c   | 4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index 816b31c..394757e 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -155,6 +155,7 @@ enum page_cache_mode {
 #define _PAGE_ENC  (_AT(pteval_t, sme_me_mask))
 
 #define _PAGE_CACHE_MASK   (_PAGE_PWT | _PAGE_PCD | _PAGE_PAT)
+#define _PAGE_LARGE_CACHE_MASK (_PAGE_PWT | _PAGE_PCD | _PAGE_PAT_LARGE)
 
 #define _PAGE_NOCACHE  (cachemode2protval(_PAGE_CACHE_MODE_UC))
 #define _PAGE_CACHE_WP (cachemode2protval(_PAGE_CACHE_MODE_WP))
diff --git a/arch/x86/mm/mem_encrypt_identity.c 
b/arch/x86/mm/mem_encrypt_identity.c
index 733b983..6c5eb6f 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -45,8 +45,8 @@
 #define PMD_FLAGS_LARGE(__PAGE_KERNEL_LARGE_EXEC & 
~_PAGE_GLOBAL)
 
 #define PMD_FLAGS_DEC  PMD_FLAGS_LARGE
-#define PMD_FLAGS_DEC_WP   ((PMD_FLAGS_DEC & ~_PAGE_CACHE_MASK) | \
-(_PAGE_PAT | _PAGE_PWT))
+#define PMD_FLAGS_DEC_WP   ((PMD_FLAGS_DEC & ~_PAGE_LARGE_CACHE_MASK) | \
+(_PAGE_PAT_LARGE | _PAGE_PWT))
 
 #define PMD_FLAGS_ENC  (PMD_FLAGS_LARGE | _PAGE_ENC)

[tip: x86/cpu] x86/cpu/amd: Remove dead code for TSEG region remapping

2020-12-08 Thread tip-bot2 for Arvind Sankar

The following commit has been merged into the x86/cpu branch of tip:

Commit-ID: 262bd5724afdefd4c48a260d6100e78cc43ee06b
Gitweb:
https://git.kernel.org/tip/262bd5724afdefd4c48a260d6100e78cc43ee06b
Author:Arvind Sankar 
AuthorDate:Fri, 27 Nov 2020 12:13:24 -05:00
Committer: Borislav Petkov 
CommitterDate: Tue, 08 Dec 2020 18:45:21 +01:00

x86/cpu/amd: Remove dead code for TSEG region remapping

Commit

  26bfa5f89486 ("x86, amd: Cleanup init_amd")

moved the code that remaps the TSEG region using 4k pages from
init_amd() to bsp_init_amd().

However, bsp_init_amd() is executed well before the direct mapping is
actually created:

  setup_arch()
-> early_cpu_init()
  -> early_identify_cpu()
-> this_cpu->c_bsp_init()
  -> bsp_init_amd()
...
-> init_mem_mapping()

So the change effectively disabled the 4k remapping, because
pfn_range_is_mapped() is always false at this point.

It has been over six years since the commit, and no-one seems to have
noticed this, so just remove the code. The original code was also
incomplete, since it doesn't check how large the TSEG address range
actually is, so it might remap only part of it in any case.

Hygon has copied the incorrect version, so the code has never run on it
since the cpu support was added two years ago. Remove it from there as
well.

Committer notes:

This workaround is incomplete anyway:

1. The code must check MSRC001_0113.TValid (SMM TSeg Mask MSR) first, to
check whether the TSeg address range is enabled.

2. The code must check whether the range is not 2M aligned - if it is,
there's nothing to work around.

3. In all the BIOSes tested, the TSeg range is in a e820 reserved area
and those are not mapped anymore, after

  66520ebc2df3 ("x86, mm: Only direct map addresses that are marked as 
E820_RAM")

which means, there's nothing to be worked around either.

So let's rip it out.

Signed-off-by: Arvind Sankar 
Signed-off-by: Borislav Petkov 
Link: https://lkml.kernel.org/r/20201127171324.1846019-1-nived...@alum.mit.edu
---
 arch/x86/kernel/cpu/amd.c   | 21 -
 arch/x86/kernel/cpu/hygon.c | 20 
 2 files changed, 41 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 1f71c76..f8ca66f 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -23,7 +23,6 @@
 
 #ifdef CONFIG_X86_64
 # include 
-# include 
 #endif
 
 #include "cpu.h"
@@ -509,26 +508,6 @@ static void early_init_amd_mc(struct cpuinfo_x86 *c)
 
 static void bsp_init_amd(struct cpuinfo_x86 *c)
 {
-
-#ifdef CONFIG_X86_64
-   if (c->x86 >= 0xf) {
-   unsigned long long tseg;
-
-   /*
-* Split up direct mapping around the TSEG SMM area.
-* Don't do it for gbpages because there seems very little
-* benefit in doing so.
-*/
-   if (!rdmsrl_safe(MSR_K8_TSEG_ADDR, )) {
-   unsigned long pfn = tseg >> PAGE_SHIFT;
-
-   pr_debug("tseg: %010llx\n", tseg);
-   if (pfn_range_is_mapped(pfn, pfn + 1))
-   set_memory_4k((unsigned long)__va(tseg), 1);
-   }
-   }
-#endif
-
if (cpu_has(c, X86_FEATURE_CONSTANT_TSC)) {
 
if (c->x86 > 0x10 ||
diff --git a/arch/x86/kernel/cpu/hygon.c b/arch/x86/kernel/cpu/hygon.c
index dc0840a..ae59115 100644
--- a/arch/x86/kernel/cpu/hygon.c
+++ b/arch/x86/kernel/cpu/hygon.c
@@ -14,9 +14,6 @@
 #include 
 #include 
 #include 
-#ifdef CONFIG_X86_64
-# include 
-#endif
 
 #include "cpu.h"
 
@@ -203,23 +200,6 @@ static void early_init_hygon_mc(struct cpuinfo_x86 *c)
 
 static void bsp_init_hygon(struct cpuinfo_x86 *c)
 {
-#ifdef CONFIG_X86_64
-   unsigned long long tseg;
-
-   /*
-* Split up direct mapping around the TSEG SMM area.
-* Don't do it for gbpages because there seems very little
-* benefit in doing so.
-*/
-   if (!rdmsrl_safe(MSR_K8_TSEG_ADDR, )) {
-   unsigned long pfn = tseg >> PAGE_SHIFT;
-
-   pr_debug("tseg: %010llx\n", tseg);
-   if (pfn_range_is_mapped(pfn, pfn + 1))
-   set_memory_4k((unsigned long)__va(tseg), 1);
-   }
-#endif
-
if (cpu_has(c, X86_FEATURE_CONSTANT_TSC)) {
u64 val;

Re: [PATCH v2 4/4] Kbuild: implement support for DWARF v5

2020-12-04 Thread Arvind Sankar

On Thu, Dec 03, 2020 at 03:28:14PM -0800, Nick Desaulniers wrote:
> On Thu, Dec 3, 2020 at 3:22 PM Nick Desaulniers  
> wrote:
> >
> > On Tue, Nov 24, 2020 at 9:28 AM Arvind Sankar  wrote:
> > >
> > > On Tue, Nov 03, 2020 at 04:53:43PM -0800, Nick Desaulniers wrote:
> > > > DWARF v5 is the latest standard of the DWARF debug info format.
> > > >
> > > > Feature detection of DWARF5 is onerous, especially given that we've
> > > > removed $(AS), so we must query $(CC) for DWARF5 assembler directive
> > > > support.  GNU `as` only recently gained support for specifying
> > > > -gdwarf-5.
> > >
> > > With gcc, using -gdwarf-5 even without -Wa,--gdwarf-5 results in
> > > considerably smaller debug info. gcc does not seem to generate the .file 0
> > > directive that causes older GNU as to barf.
> > >
> > > Should the assembler support check be restricted to CC_IS_CLANG?
> >
> > No, because if LLVM_IAS=1 then the assembler support need not be checked.
> 
> Also, if your version of GCC supports DWARF Version 5, but your
> version of GAS does not, then I'm more inclined to not allow
> CONFIG_DEBUG_INFO_DWARF5 to be selectable, rather than mix and match
> or partially support this for one but not the other.  Either all tools
> used support DWARF 5, or you don't get to use DWARF 5.
> 

Why? Does this actually cause any problems?

It seems like the options for gcc can actually be very straightforward:
you just need a cc-option check, and then add -gdwarf-N to both CFLAGS
and AFLAGS and you're done.  Adding the -Wa flag is superfluous and
carries the risk of interfering with what the compiler driver does. Just
let the gcc driver handle the details.

Clang/IAS is almost as straightforward, with the only additional edge
case being that for assembler files, DWARF 2 doesn't work, so the CFLAGS
is the same -gdwarf-N, but AFLAGS gets -gdwarf-N only if N > 2.

The messy case is only Clang/IAS=0, which needs to check the support
from the external assembler, and needs CFLAGS of -gdwarf-N and AFLAGS of
-Wa,--gdwarf-N, because Clang doesn't pass that option on to an external
assembler. This is why I was asking if the assembler support check can
be restricted to CC_IS_CLANG: nothing but Clang/IAS=0 actually requires
that check.

Re: [PATCH] x86/cpu/amd: Remove dead code for TSEG region remapping

2020-12-03 Thread Arvind Sankar

On Thu, Dec 03, 2020 at 09:48:57AM +0100, Borislav Petkov wrote:
> On Wed, Dec 02, 2020 at 05:32:32PM -0500, Arvind Sankar wrote:
> > The pfn_range_is_mapped() call just checks whether it is mapped at all
> > in the direct mapping. Is the TSEG range supposed to be marked as
> > non-RAM in the E820 map? AFAICS, the only case when a direct mapping is
> > created for non-RAM is for the 0-1Mb real-mode range, and that will
> > always use 4k pages. Above that anything not marked as RAM will create
> > an unmapped hole in the direct map, so in this case the memory just
> > below the TSEG base would already use smaller pages if needed.
> > 
> > If it's possible that the E820 mapping says this range is RAM, then
> > should we also break up the direct map just after the end of the TSEG
> > range for the same reason?
> 
> So I have a machine where TSEG is not 2M aligned and somewhere in the 1G
> range:
> 
> [1.135094] tseg: 003bf0
> 
> It is not in the E820 map either:
> 
> [0.019784] init_memory_mapping: [mem 0x-0x000f]
> [0.020014] init_memory_mapping: [mem 0x3bc0-0x3bdf]
> [0.020166] init_memory_mapping: [mem 0x2000-0x3bbf]
> [0.020327] init_memory_mapping: [mem 0x0010-0x1fff]
> [0.020677] init_memory_mapping: [mem 0x3be0-0x3be8]
> 
> That doesn't mean that it can happen that there might be some
> configuration where it ends up being mapped.
> 
> So looking at what the code does, it kinda makes sense: you want the 2M
> range between 0x3be0 and 0x3c00 to be split into 4K mappings,
> *if* it is mapped.
> 
> I need to find a box where it is mapped *and* not 2M aligned, though,
> for testing. Which appears kinda hard to do as all the new ones are
> aligned.

Do any of them have it mapped at all, regardless of the alignment? There
seems to be nothing else in the kernel that ever looks at the TSEG MSR,
so I would guess that it has to be non-RAM in the E820 map, otherwise
nothing would prevent the kernel from allocating and using that space.

I found the actual original commit, which does has a description of the
reasoning. It's
  8346ea17aa20 ("x86: split large page mapping for AMD TSEG")

It looks like at the time, the direct mapping didn't really look at the
E820 map in any detail, and was always set up with at least 2Mb pages,
or Gb pages if they were available, from 0 to max_pfn_mapped. So the
direct mapping would have covered even holes that weren't in the E820
map.

Commit
  66520ebc2df3 ("x86, mm: Only direct map addresses that are marked as 
E820_RAM")
changed the direct map setup to avoid mapping holes, because it
apparently became more serious than performance issues: this commit
mentions MCE's getting triggered because of the overmapping.

> 
> The above is from a K8 box which should already be dead, as a matter of
> fact.
> 
> -- 
> Regards/Gruss,
> Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette

Re: [PATCH] x86/cpu/amd: Remove dead code for TSEG region remapping

2020-12-02 Thread Arvind Sankar

On Wed, Dec 02, 2020 at 11:58:15AM -0600, Tom Lendacky wrote:
> On 11/27/20 11:27 AM, Borislav Petkov wrote:
> > On Fri, Nov 27, 2020 at 12:13:24PM -0500, Arvind Sankar wrote:
> >> Commit
> >>26bfa5f89486 ("x86, amd: Cleanup init_amd")
> >> moved the code that remaps the TSEG region using 4k pages from
> >> init_amd() to bsp_init_amd().
> >>
> >> However, bsp_init_amd() is executed well before the direct mapping is
> >> actually created:
> >>
> >>setup_arch()
> >>  -> early_cpu_init()
> >>-> early_identify_cpu()
> >>  -> this_cpu->c_bsp_init()
> >>  -> bsp_init_amd()
> >>  ...
> >>  -> init_mem_mapping()
> >>
> >> So the change effectively disabled the 4k remapping, because
> >> pfn_range_is_mapped() is always false at this point.
> >>
> >> It has been over six years since the commit, and no-one seems to have
> >> noticed this, so just remove the code. The original code was also
> >> incomplete, since it doesn't check how large the TSEG address range
> >> actually is, so it might remap only part of it in any case.
> > 
> > Yah, and the patch which added this:
> > 
> > 6c62aa4a3c12 ("x86: make amd.c have 64bit support code")
> > 
> > does not say what for (I'm not surprised, frankly).
> > 
> > So if AMD folks on Cc don't have any need for actually fixing this
> > properly, yap, we can zap it.
> 
> I believe this is geared towards performance. If the TSEG base address is 
> not 2MB aligned, then hardware has to break down a 2MB TLB entry if the OS 
> references the memory within the 2MB page that is before the TSEG base 
> address. This can occur whenever the 2MB TLB entry is re-installed because 
> of TLB flushes, etc.
> 
> I would hope that newer BIOSes are 2MB aligning the TSEG base address, but 
> if not, then this can help.
> 
> So moving it back wouldn't be a bad thing. It should probably only do the 
> set_memory_4k() if the TSEG base address is not 2MB aligned, which I think 
> is covered by the pfn_range_is_mapped() call?
> 

The pfn_range_is_mapped() call just checks whether it is mapped at all
in the direct mapping. Is the TSEG range supposed to be marked as
non-RAM in the E820 map? AFAICS, the only case when a direct mapping is
created for non-RAM is for the 0-1Mb real-mode range, and that will
always use 4k pages. Above that anything not marked as RAM will create
an unmapped hole in the direct map, so in this case the memory just
below the TSEG base would already use smaller pages if needed.

If it's possible that the E820 mapping says this range is RAM, then
should we also break up the direct map just after the end of the TSEG
range for the same reason?

Thanks.

[PATCH] x86/cpu/amd: Remove dead code for TSEG region remapping

2020-11-27 Thread Arvind Sankar

Commit
  26bfa5f89486 ("x86, amd: Cleanup init_amd")
moved the code that remaps the TSEG region using 4k pages from
init_amd() to bsp_init_amd().

However, bsp_init_amd() is executed well before the direct mapping is
actually created:

  setup_arch()
-> early_cpu_init()
  -> early_identify_cpu()
-> this_cpu->c_bsp_init()
  -> bsp_init_amd()
...
-> init_mem_mapping()

So the change effectively disabled the 4k remapping, because
pfn_range_is_mapped() is always false at this point.

It has been over six years since the commit, and no-one seems to have
noticed this, so just remove the code. The original code was also
incomplete, since it doesn't check how large the TSEG address range
actually is, so it might remap only part of it in any case.

Hygon has copied the incorrect version, so the code has never run on it
since the cpu support was added two years ago. Remove it from there as
well.

Signed-off-by: Arvind Sankar 
---
 arch/x86/kernel/cpu/amd.c   | 21 -
 arch/x86/kernel/cpu/hygon.c | 20 
 2 files changed, 41 deletions(-)

diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 1f71c7616917..f8ca66f3d861 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -23,7 +23,6 @@
 
 #ifdef CONFIG_X86_64
 # include 
-# include 
 #endif
 
 #include "cpu.h"
@@ -509,26 +508,6 @@ static void early_init_amd_mc(struct cpuinfo_x86 *c)
 
 static void bsp_init_amd(struct cpuinfo_x86 *c)
 {
-
-#ifdef CONFIG_X86_64
-   if (c->x86 >= 0xf) {
-   unsigned long long tseg;
-
-   /*
-* Split up direct mapping around the TSEG SMM area.
-* Don't do it for gbpages because there seems very little
-* benefit in doing so.
-*/
-   if (!rdmsrl_safe(MSR_K8_TSEG_ADDR, )) {
-   unsigned long pfn = tseg >> PAGE_SHIFT;
-
-   pr_debug("tseg: %010llx\n", tseg);
-   if (pfn_range_is_mapped(pfn, pfn + 1))
-   set_memory_4k((unsigned long)__va(tseg), 1);
-   }
-   }
-#endif
-
if (cpu_has(c, X86_FEATURE_CONSTANT_TSC)) {
 
if (c->x86 > 0x10 ||
diff --git a/arch/x86/kernel/cpu/hygon.c b/arch/x86/kernel/cpu/hygon.c
index dc0840aae26c..ae59115d18f9 100644
--- a/arch/x86/kernel/cpu/hygon.c
+++ b/arch/x86/kernel/cpu/hygon.c
@@ -14,9 +14,6 @@
 #include 
 #include 
 #include 
-#ifdef CONFIG_X86_64
-# include 
-#endif
 
 #include "cpu.h"
 
@@ -203,23 +200,6 @@ static void early_init_hygon_mc(struct cpuinfo_x86 *c)
 
 static void bsp_init_hygon(struct cpuinfo_x86 *c)
 {
-#ifdef CONFIG_X86_64
-   unsigned long long tseg;
-
-   /*
-* Split up direct mapping around the TSEG SMM area.
-* Don't do it for gbpages because there seems very little
-* benefit in doing so.
-*/
-   if (!rdmsrl_safe(MSR_K8_TSEG_ADDR, )) {
-   unsigned long pfn = tseg >> PAGE_SHIFT;
-
-   pr_debug("tseg: %010llx\n", tseg);
-   if (pfn_range_is_mapped(pfn, pfn + 1))
-   set_memory_4k((unsigned long)__va(tseg), 1);
-   }
-#endif
-
if (cpu_has(c, X86_FEATURE_CONSTANT_TSC)) {
u64 val;
 
-- 
2.26.2

Re: [PATCH v2 4/4] Kbuild: implement support for DWARF v5

2020-11-24 Thread Arvind Sankar

On Tue, Nov 03, 2020 at 04:53:43PM -0800, Nick Desaulniers wrote:
> DWARF v5 is the latest standard of the DWARF debug info format.
> 
> Feature detection of DWARF5 is onerous, especially given that we've
> removed $(AS), so we must query $(CC) for DWARF5 assembler directive
> support.  GNU `as` only recently gained support for specifying
> -gdwarf-5.

With gcc, using -gdwarf-5 even without -Wa,--gdwarf-5 results in
considerably smaller debug info. gcc does not seem to generate the .file 0
directive that causes older GNU as to barf.

Should the assembler support check be restricted to CC_IS_CLANG?

>  /* Stabs debugging sections. */
>  #define STABS_DEBUG  \
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 03c494eefabd..c5b54ba51060 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -274,6 +274,14 @@ config DEBUG_INFO_DWARF4
> It makes the debug information larger, but it significantly
> improves the success of resolving variables in gdb on optimized code.
>  
> +config DEBUG_INFO_DWARF5
> + bool "Generate DWARF5 debuginfo"
> + depends on $(cc-option,-gdwarf-5)
> + depends on $(success,$(srctree)/scripts/test_dwarf5_support.sh $(CC) 
> $(CLANG_FLAGS))
> + help
> +   Genereate dwarf5 debug info. Requires binutils 2.35+, gcc 5.1+, and
> +   gdb 8.0+.
> +
>  endchoice # "DWARF version"

Perhaps this can be expanded with some description of the advantages of
dwarf5 over dwarf4?

>  
>  config DEBUG_INFO_BTF
> diff --git a/scripts/test_dwarf5_support.sh b/scripts/test_dwarf5_support.sh
> new file mode 100755
> index ..156ad5ec4274
> --- /dev/null
> +++ b/scripts/test_dwarf5_support.sh
> @@ -0,0 +1,9 @@
> +#!/bin/sh
> +# SPDX-License-Identifier: GPL-2.0
> +
> +# Test that assembler accepts -gdwarf-5 and .file 0 directives, which were 
> bugs
> +# in binutils < 2.35.
> +# https://sourceware.org/bugzilla/show_bug.cgi?id=25612
> +# https://sourceware.org/bugzilla/show_bug.cgi?id=25614
> +set -e
> +echo '.file 0 "filename"' | $* -Wa,-gdwarf-5 -c -x assembler -o /dev/null -

This also actually needs --gdwarf-5 to really check the support for the
option, but older versions should error on the .file 0 in any case.

Re: [PATCH v2 3/4] Kbuild: make DWARF version a choice

2020-11-24 Thread Arvind Sankar

On Mon, Nov 23, 2020 at 06:33:57PM -0600, Segher Boessenkool wrote:
> On Mon, Nov 23, 2020 at 06:22:10PM -0500, Arvind Sankar wrote:
> > Btw, is -gsplit-dwarf at all useful for assembler files?
> 
> If you invoke the assembler via the compiler, with that flag it still
> creates separate .o and .dwo files (via objcopy invocations as usual).
> Whether that is useful depends on if you have any debug info that can
> be split :-)
> 
> 
> Segher

Right, the latter was what I was really asking :) We don't currently
pass -gsplit-dwarf for assembler and I was wondering if that mattered.

Thanks.

Re: [PATCH v2 3/4] Kbuild: make DWARF version a choice

2020-11-23 Thread Arvind Sankar

On Tue, Nov 03, 2020 at 04:53:42PM -0800, Nick Desaulniers wrote:
> Modifies CONFIG_DEBUG_INFO_DWARF4 to be a member of a choice. Adds an
> explicit CONFIG_DEBUG_INFO_DWARF2, which is the default. Does so in a
> way that's forward compatible with existing configs, and makes adding
> future versions more straightforward.
> 
> Suggested-by: Fangrui Song 
> Suggested-by: Masahiro Yamada 
> Signed-off-by: Nick Desaulniers 
> ---
>  Makefile  | 14 --
>  lib/Kconfig.debug | 19 +++
>  2 files changed, 23 insertions(+), 10 deletions(-)
> 
> diff --git a/Makefile b/Makefile
> index 75b1a3dcbf30..e23786a4c1c7 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -826,12 +826,14 @@ else
>  DEBUG_CFLAGS += -g
>  endif
>  
> -ifndef LLVM_IAS
> -KBUILD_AFLAGS+= -Wa,-gdwarf-2
> -endif
> -
> -ifdef CONFIG_DEBUG_INFO_DWARF4
> -DEBUG_CFLAGS += -gdwarf-4
> +dwarf-version-$(CONFIG_DEBUG_INFO_DWARF2) := 2
> +dwarf-version-$(CONFIG_DEBUG_INFO_DWARF4) := 4
> +DEBUG_CFLAGS += -gdwarf-$(dwarf-version-y)
> +ifneq ($(dwarf-version-y)$(LLVM_IAS),21)
> +# Binutils 2.35+ required for -gdwarf-4+ support.
> +dwarf-aflag  := $(call as-option,-Wa$(comma)-gdwarf-$(dwarf-version-y))
> +DEBUG_CFLAGS += $(dwarf-aflag)
> +KBUILD_AFLAGS+= $(dwarf-aflag)
>  endif
>  

For LLVM_IAS=1, adding dwarf-aflag to DEBUG_CFLAGS should not be
necessary, no? This seems to add it for dwarf-4.

The as-option check will only work on binutils 2.35.1 onwards: earlier
versions will silently accept any -gdwarf-N option. Do we care? I think
it'll just get dwarf-2 for assembly files even though dwarf-4 might have
been configured. The earlier versions only error if the double-hyphen
form --gdwarf-N is used.

More generally, do we want to force this option via -Wa or should we
leave it up to the compiler driver when we can? For both Clang/IAS and
gcc/binutils, passing -gdwarf-N in KBUILD_AFLAGS will allow the compiler
to pass on the appropriate option to the assembler (with gcc you only
get --gdwarf-2 for the assembler except on trunk though). The only case
that would absolutely require -Wa is Clang without IAS, might be worth
adding the ability to pass on the flag to the external assembler?

Btw, is -gsplit-dwarf at all useful for assembler files?

Re: [PATCH v2 08/12] x86/paravirt: remove no longer needed 32-bit pvops cruft

2020-11-20 Thread Arvind Sankar

On Fri, Nov 20, 2020 at 12:46:26PM +0100, Juergen Gross wrote:
> PVOP_VCALL4() is only used for Xen PV, while PVOP_CALL4() isn't used
> at all. Keep PVOP_CALL4() for 64 bits due to symmetry reasons.
> 
> This allows to remove the 32-bit definitions of those macros leading
> to a substantial simplification of the paravirt macros, as those were
> the only ones needing non-empty "pre" and "post" parameters.
> 
> PVOP_CALLEE2() and PVOP_VCALLEE2() are used nowhere, so remove them.
> 
> Another no longer needed case is special handling of return types
> larger than unsigned long. Replace that with a BUILD_BUG_ON().
> 
> DISABLE_INTERRUPTS() is used in 32-bit code only, so it can just be
> replaced by cli.
> 
> INTERRUPT_RETURN in 32-bit code can be replaced by iret.
> 
> GET_CR2_INTO_AX and ENABLE_INTERRUPTS are used nowhere, so they can
> be removed.

As a heads-up, GET_CR2_INTO_AX has been removed in tip:master already by
  31d854603305 ("x86/head/64: Remove unused GET_CR2_INTO() macro")

[tip: x86/cleanups] x86/boot: Remove unused finalize_identity_maps()

2020-11-18 Thread tip-bot2 for Arvind Sankar

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID: 0ac317e89791b76055ef11b952625ef77a1d2eba
Gitweb:
https://git.kernel.org/tip/0ac317e89791b76055ef11b952625ef77a1d2eba
Author:Arvind Sankar 
AuthorDate:Mon, 05 Oct 2020 11:12:07 -04:00
Committer: Borislav Petkov 
CommitterDate: Wed, 18 Nov 2020 16:04:23 +01:00

x86/boot: Remove unused finalize_identity_maps()

Commit

  8570978ea030 ("x86/boot/compressed/64: Don't pre-map memory in KASLR code")

removed all the references to finalize_identity_maps(), but neglected to
delete the actual function. Remove it.

Signed-off-by: Arvind Sankar 
Signed-off-by: Borislav Petkov 
Link: https://lkml.kernel.org/r/20201005151208.2212886-2-nived...@alum.mit.edu
---
 arch/x86/boot/compressed/ident_map_64.c | 10 --
 1 file changed, 10 deletions(-)

diff --git a/arch/x86/boot/compressed/ident_map_64.c 
b/arch/x86/boot/compressed/ident_map_64.c
index a5e5db6..6bf2022 100644
--- a/arch/x86/boot/compressed/ident_map_64.c
+++ b/arch/x86/boot/compressed/ident_map_64.c
@@ -167,16 +167,6 @@ void initialize_identity_maps(void *rmode)
write_cr3(top_level_pgt);
 }
 
-/*
- * This switches the page tables to the new level4 that has been built
- * via calls to add_identity_map() above. If booted via startup_32(),
- * this is effectively a no-op.
- */
-void finalize_identity_maps(void)
-{
-   write_cr3(top_level_pgt);
-}
-
 static pte_t *split_large_pmd(struct x86_mapping_info *info,
  pmd_t *pmdp, unsigned long __address)
 {

[tip: x86/cleanups] x86/head/64: Remove unused GET_CR2_INTO() macro

2020-11-18 Thread tip-bot2 for Arvind Sankar

The following commit has been merged into the x86/cleanups branch of tip:

Commit-ID: 31d8546033053b98de00846ede8088bdbe38651d
Gitweb:
https://git.kernel.org/tip/31d8546033053b98de00846ede8088bdbe38651d
Author:Arvind Sankar 
AuthorDate:Mon, 05 Oct 2020 11:12:08 -04:00
Committer: Borislav Petkov 
CommitterDate: Wed, 18 Nov 2020 18:09:38 +01:00

x86/head/64: Remove unused GET_CR2_INTO() macro

Commit

  4b47cdbda6f1 ("x86/head/64: Move early exception dispatch to C code")

removed the usage of GET_CR2_INTO().

Drop the definition as well, and related definitions in paravirt.h and
asm-offsets.h

Signed-off-by: Arvind Sankar 
Signed-off-by: Borislav Petkov 
Link: https://lkml.kernel.org/r/20201005151208.2212886-3-nived...@alum.mit.edu
---
 arch/x86/include/asm/paravirt.h | 11 ---
 arch/x86/kernel/asm-offsets.c   |  1 -
 arch/x86/kernel/head_64.S   |  9 -
 3 files changed, 21 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index d25cc68..f8dce11 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -812,17 +812,6 @@ extern void default_banner(void);
 #endif /* CONFIG_PARAVIRT_XXL */
 #endif /* CONFIG_X86_64 */
 
-#ifdef CONFIG_PARAVIRT_XXL
-
-#define GET_CR2_INTO_AX
\
-   PARA_SITE(PARA_PATCH(PV_MMU_read_cr2),  \
- ANNOTATE_RETPOLINE_SAFE;  \
- call PARA_INDIRECT(pv_ops+PV_MMU_read_cr2);   \
-)
-
-#endif /* CONFIG_PARAVIRT_XXL */
-
-
 #endif /* __ASSEMBLY__ */
 #else  /* CONFIG_PARAVIRT */
 # define default_banner x86_init_noop
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 70b7154..60b9f42 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -66,7 +66,6 @@ static void __used common(void)
OFFSET(PV_IRQ_irq_disable, paravirt_patch_template, irq.irq_disable);
OFFSET(PV_IRQ_irq_enable, paravirt_patch_template, irq.irq_enable);
OFFSET(PV_CPU_iret, paravirt_patch_template, cpu.iret);
-   OFFSET(PV_MMU_read_cr2, paravirt_patch_template, mmu.read_cr2);
 #endif
 
 #ifdef CONFIG_XEN
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 7eb2a1c..2215d4c 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -26,15 +26,6 @@
 #include 
 #include 
 
-#ifdef CONFIG_PARAVIRT_XXL
-#include 
-#include 
-#define GET_CR2_INTO(reg) GET_CR2_INTO_AX ; _ASM_MOV %_ASM_AX, reg
-#else
-#define INTERRUPT_RETURN iretq
-#define GET_CR2_INTO(reg) _ASM_MOV %cr2, reg
-#endif
-
 /*
  * We are not able to switch in one step to the final KERNEL ADDRESS SPACE
  * because we need identity-mapped pages.

Re: [PATCH 1/3] powerpc: boot: include compiler_attributes.h

2020-11-17 Thread Arvind Sankar

On Sun, Nov 15, 2020 at 08:35:30PM -0800, Nick Desaulniers wrote:
> The kernel uses `-include` to include include/linux/compiler_types.h
> into all translation units (see scripts/Makefile.lib), which #includes
> compiler_attributes.h.
> 
> arch/powerpc/boot/ uses different compiler flags from the rest of the
> kernel. As such, it doesn't contain the definitions from these headers,
> and redefines a few that it needs.
> 
> For the purpose of enabling -Wimplicit-fallthrough for ppc, include
> compiler_types.h via `-include`.

This should be "compiler_attributes.h".

> 
> Link: https://github.com/ClangBuiltLinux/linux/issues/236
> Signed-off-by: Nick Desaulniers 
> ---
> We could just `#include "include/linux/compiler_types.h"` in the few .c
> sources used from lib/ (there are proper header guards in
> compiler_types.h).
> 
> It was also noted in 6a9dc5fd6170 that we could -D__KERNEL__ and
> -include compiler_types.h like the main kernel does, though testing that
> produces a whole sea of warnings to cleanup. This approach is minimally
> invasive.
> 
>  arch/powerpc/boot/Makefile | 1 +
>  arch/powerpc/boot/decompress.c | 1 -
>  2 files changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
> index f8ce6d2dde7b..1659963a8f1d 100644
> --- a/arch/powerpc/boot/Makefile
> +++ b/arch/powerpc/boot/Makefile
> @@ -31,6 +31,7 @@ endif
>  BOOTCFLAGS:= -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \
>-fno-strict-aliasing -O2 -msoft-float -mno-altivec -mno-vsx \
>-pipe -fomit-frame-pointer -fno-builtin -fPIC -nostdinc \
> +  -include $(srctree)/include/linux/compiler_attributes.h \
>$(LINUXINCLUDE)
>  
>  ifdef CONFIG_PPC64_BOOT_WRAPPER
> diff --git a/arch/powerpc/boot/decompress.c b/arch/powerpc/boot/decompress.c
> index 8bf39ef7d2df..6098b879ac97 100644
> --- a/arch/powerpc/boot/decompress.c
> +++ b/arch/powerpc/boot/decompress.c
> @@ -21,7 +21,6 @@
>  
>  #define STATIC static
>  #define INIT
> -#define __always_inline inline
>  
>  /*
>   * The build process will copy the required zlib source files and headers
> -- 
> 2.29.2.299.gdc1121823c-goog
>

Re: [PATCH] RISC-V: fix barrier() use in

2020-11-17 Thread Arvind Sankar

On Tue, Nov 17, 2020 at 11:22:39AM -0800, Randy Dunlap wrote:
> On 11/17/20 11:00 AM, Nick Desaulniers wrote:
> > On Mon, Nov 16, 2020 at 5:40 PM Randy Dunlap  wrote:
> >>
> >> riscv's  uses barrier() so it should
> >> #include  to prevent build errors.
> >>
> >> Fixes this build error:
> >>   CC [M]  drivers/net/ethernet/emulex/benet/be_main.o
> >> In file included from ./include/vdso/processor.h:10,
> >>  from ./arch/riscv/include/asm/processor.h:11,
> >>  from ./include/linux/prefetch.h:15,
> >>  from drivers/net/ethernet/emulex/benet/be_main.c:14:
> >> ./arch/riscv/include/asm/vdso/processor.h: In function 'cpu_relax':
> >> ./arch/riscv/include/asm/vdso/processor.h:14:2: error: implicit 
> >> declaration of function 'barrier' [-Werror=implicit-function-declaration]
> >>14 |  barrier();
> >>
> >> This happens with a total of 5 networking drivers -- they all use
> >> .
> >>
> >> rv64 allmodconfig now builds cleanly after this patch.
> >>
> >> Fixes fallout from:
> >> 815f0ddb346c ("include/linux/compiler*.h: make compiler-*.h mutually 
> >> exclusive")
> >>
> >> Fixes: ad5d1122b82f ("riscv: use vDSO common flow to reduce the latency of 
> >> the time-related functions")
> > 
> > Hi Randy,
> > Thanks for the patch, it looks good to me.  I only had a question on
> > the commit message.
> > 
> > Is this also related to:
> > commit 3347acc6fcd4 ("compiler.h: fix barrier_data() on clang")
> 
> Hi Nick,
> Yes, it looks like I tagged the wrong commit in Fixes:
> Thanks for noticing that.
> 

Thanks for the patch!

Acked-by: Arvind Sankar 

Generally speaking, should compiler_types.h be considered "internal"?
i.e. everyone should include compiler.h, not compiler_types.h directly?

Eg for another potential case that might bite is that RELOC_HIDE is
defined in compiler_types.h, except for clang, which will only have a
definition if compiler.h is included.

[tip: efi/core] efi/x86: Only copy the compressed kernel image in efi_relocate_kernel()

2020-11-17 Thread tip-bot2 for Arvind Sankar

The following commit has been merged into the efi/core branch of tip:

Commit-ID: 688eb28211abdf82a3f51e8997f1c8137947227d
Gitweb:
https://git.kernel.org/tip/688eb28211abdf82a3f51e8997f1c8137947227d
Author:Arvind Sankar 
AuthorDate:Sun, 11 Oct 2020 10:20:12 -04:00
Committer: Ard Biesheuvel 
CommitterDate: Mon, 26 Oct 2020 08:06:36 +01:00

efi/x86: Only copy the compressed kernel image in efi_relocate_kernel()

The image_size argument to efi_relocate_kernel() is currently specified
as init_size, but this is unnecessarily large. The compressed kernel is
much smaller, in fact, its image only extends up to the start of _bss,
since at this point, the .bss section is still uninitialized.

Depending on compression level, this can reduce the amount of data
copied by 4-5x.

Signed-off-by: Arvind Sankar 
Link: https://lore.kernel.org/r/20201011142012.96493-1-nived...@alum.mit.edu
Signed-off-by: Ard Biesheuvel 
---
 drivers/firmware/efi/libstub/x86-stub.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/efi/libstub/x86-stub.c 
b/drivers/firmware/efi/libstub/x86-stub.c
index 3672539..f14c4ff 100644
--- a/drivers/firmware/efi/libstub/x86-stub.c
+++ b/drivers/firmware/efi/libstub/x86-stub.c
@@ -715,8 +715,11 @@ unsigned long efi_main(efi_handle_t handle,
(IS_ENABLED(CONFIG_X86_32) && buffer_end > KERNEL_IMAGE_SIZE)||
(IS_ENABLED(CONFIG_X86_64) && buffer_end > MAXMEM_X86_64_4LEVEL) ||
(image_offset == 0)) {
+   extern char _bss[];
+
status = efi_relocate_kernel(_addr,
-hdr->init_size, hdr->init_size,
+(unsigned long)_bss - bzimage_addr,
+hdr->init_size,
 hdr->pref_address,
 hdr->kernel_alignment,
 LOAD_PHYSICAL_ADDR);

[tip: efi/urgent] efi/x86: Free efi_pgd with free_pages()

2020-11-17 Thread tip-bot2 for Arvind Sankar

The following commit has been merged into the efi/urgent branch of tip:

Commit-ID: c2fe61d8be491ff8188edaf22e838f81146b
Gitweb:
https://git.kernel.org/tip/c2fe61d8be491ff8188edaf22e838f81146b
Author:Arvind Sankar 
AuthorDate:Tue, 10 Nov 2020 11:39:19 -05:00
Committer: Ard Biesheuvel 
CommitterDate: Tue, 10 Nov 2020 19:18:11 +01:00

efi/x86: Free efi_pgd with free_pages()

Commit

  d9e9a6418065 ("x86/mm/pti: Allocate a separate user PGD")

changed the PGD allocation to allocate PGD_ALLOCATION_ORDER pages, so in
the error path it should be freed using free_pages() rather than
free_page().

Commit

06ace26f4e6f ("x86/efi: Free efi_pgd with free_pages()")

fixed one instance of this, but missed another.

Move the freeing out-of-line to avoid code duplication and fix this bug.

Fixes: d9e9a6418065 ("x86/mm/pti: Allocate a separate user PGD")
Link: https://lore.kernel.org/r/20201110163919.1134431-1-nived...@alum.mit.edu
Signed-off-by: Arvind Sankar 
Signed-off-by: Ard Biesheuvel 
---
 arch/x86/platform/efi/efi_64.c | 24 +---
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 8f5759d..e1e8d4e 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -78,28 +78,30 @@ int __init efi_alloc_page_tables(void)
gfp_mask = GFP_KERNEL | __GFP_ZERO;
efi_pgd = (pgd_t *)__get_free_pages(gfp_mask, PGD_ALLOCATION_ORDER);
if (!efi_pgd)
-   return -ENOMEM;
+   goto fail;
 
pgd = efi_pgd + pgd_index(EFI_VA_END);
p4d = p4d_alloc(_mm, pgd, EFI_VA_END);
-   if (!p4d) {
-   free_page((unsigned long)efi_pgd);
-   return -ENOMEM;
-   }
+   if (!p4d)
+   goto free_pgd;
 
pud = pud_alloc(_mm, p4d, EFI_VA_END);
-   if (!pud) {
-   if (pgtable_l5_enabled())
-   free_page((unsigned long) pgd_page_vaddr(*pgd));
-   free_pages((unsigned long)efi_pgd, PGD_ALLOCATION_ORDER);
-   return -ENOMEM;
-   }
+   if (!pud)
+   goto free_p4d;
 
efi_mm.pgd = efi_pgd;
mm_init_cpumask(_mm);
init_new_context(NULL, _mm);
 
return 0;
+
+free_p4d:
+   if (pgtable_l5_enabled())
+   free_page((unsigned long)pgd_page_vaddr(*pgd));
+free_pgd:
+   free_pages((unsigned long)efi_pgd, PGD_ALLOCATION_ORDER);
+fail:
+   return -ENOMEM;
 }
 
 /*

Re: [PATCH] x86/boot/compressed/64: Drop the now-unused finalize_identity_maps()

2020-11-13 Thread Arvind Sankar

On Fri, Nov 13, 2020 at 04:08:10PM +0100, Vitaly Kuznetsov wrote:
> Since commit 8570978ea030 ("x86/boot/compressed/64: Don't pre-map memory in
> KASLR code") finalize_identity_maps() has no users, drop it.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
>  arch/x86/boot/compressed/ident_map_64.c | 10 --
>  1 file changed, 10 deletions(-)
> 
> diff --git a/arch/x86/boot/compressed/ident_map_64.c 
> b/arch/x86/boot/compressed/ident_map_64.c
> index a5e5db6ada3c..6bf20223dc0f 100644
> --- a/arch/x86/boot/compressed/ident_map_64.c
> +++ b/arch/x86/boot/compressed/ident_map_64.c
> @@ -167,16 +167,6 @@ void initialize_identity_maps(void *rmode)
>   write_cr3(top_level_pgt);
>  }
>  
> -/*
> - * This switches the page tables to the new level4 that has been built
> - * via calls to add_identity_map() above. If booted via startup_32(),
> - * this is effectively a no-op.
> - */
> -void finalize_identity_maps(void)
> -{
> - write_cr3(top_level_pgt);
> -}
> -
>  static pte_t *split_large_pmd(struct x86_mapping_info *info,
> pmd_t *pmdp, unsigned long __address)
>  {
> -- 
> 2.26.2
> 

I had sent this and another unused removal last month:
https://lore.kernel.org/lkml/20201005151208.2212886-1-nived...@alum.mit.edu/

Thanks.

Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization

2020-11-12 Thread Arvind Sankar

On Wed, Nov 11, 2020 at 04:15:59PM +0200, Adrian Ratiu wrote:
> On Tue, 10 Nov 2020, Nick Desaulniers  
> wrote:
> > 
> > Yes, though additionally Arvind points out that this code is 
> > kind of curious if there was overlap; maybe the parameters 
> > should just be restrict-qualified. 
> >
> 
> For now I think I'll just re-send the GCC changes and leave the 
> Clang optimization as is, until we better understand what's 
> happening and what's the best way to enable it.
> 

Note that the __restrict__ keywords also help GCC -- it saves it from
having to emit the non-vectorized version and switch between the two at
runtime. If we can verify it's safe, it's a good thing to add all
around.

[PATCH 1/2] x86/mm/sme: Fix definition of PMD_FLAGS_DEC_WP

2020-11-11 Thread Arvind Sankar

The PAT bit is in different locations for 4k and 2M/1G page table
entries.

Add a definition for _PAGE_LARGE_CACHE_MASK to represent the three
caching bits (PWT, PCD, PAT), similar to _PAGE_CACHE_MASK for 4k pages,
and use it in the definition of PMD_FLAGS_DEC_WP to get the correct PAT
index for write-protected pages.

Signed-off-by: Arvind Sankar 
Fixes: 6ebcb060713f ("x86/mm: Add support to encrypt the kernel in-place")
Tested-by: Tom Lendacky 
Cc: sta...@vger.kernel.org
---
 arch/x86/include/asm/pgtable_types.h | 1 +
 arch/x86/mm/mem_encrypt_identity.c   | 4 ++--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index 816b31c68550..394757ee030a 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -155,6 +155,7 @@ enum page_cache_mode {
 #define _PAGE_ENC  (_AT(pteval_t, sme_me_mask))
 
 #define _PAGE_CACHE_MASK   (_PAGE_PWT | _PAGE_PCD | _PAGE_PAT)
+#define _PAGE_LARGE_CACHE_MASK (_PAGE_PWT | _PAGE_PCD | _PAGE_PAT_LARGE)
 
 #define _PAGE_NOCACHE  (cachemode2protval(_PAGE_CACHE_MODE_UC))
 #define _PAGE_CACHE_WP (cachemode2protval(_PAGE_CACHE_MODE_WP))
diff --git a/arch/x86/mm/mem_encrypt_identity.c 
b/arch/x86/mm/mem_encrypt_identity.c
index 733b983f3a89..6c5eb6f3f14f 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -45,8 +45,8 @@
 #define PMD_FLAGS_LARGE(__PAGE_KERNEL_LARGE_EXEC & 
~_PAGE_GLOBAL)
 
 #define PMD_FLAGS_DEC  PMD_FLAGS_LARGE
-#define PMD_FLAGS_DEC_WP   ((PMD_FLAGS_DEC & ~_PAGE_CACHE_MASK) | \
-(_PAGE_PAT | _PAGE_PWT))
+#define PMD_FLAGS_DEC_WP   ((PMD_FLAGS_DEC & ~_PAGE_LARGE_CACHE_MASK) | \
+(_PAGE_PAT_LARGE | _PAGE_PWT))
 
 #define PMD_FLAGS_ENC  (PMD_FLAGS_LARGE | _PAGE_ENC)
 
-- 
2.26.2

[PATCH 2/2] x86/mm: Remove duplicate definition of _PAGE_PAT_LARGE

2020-11-11 Thread Arvind Sankar

_PAGE_PAT_LARGE is already defined next to _PAGE_PAT. Remove the
duplicate.

Signed-off-by: Arvind Sankar 
Fixes: 4efb56649132 ("x86/mm: Tabulate the page table encoding definitions")
---
 arch/x86/include/asm/pgtable_types.h | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index 394757ee030a..f24d7ef8fffa 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -177,8 +177,6 @@ enum page_cache_mode {
 #define __pgprot(x)((pgprot_t) { (x) } )
 #define __pg(x)__pgprot(x)
 
-#define _PAGE_PAT_LARGE(_AT(pteval_t, 1) << 
_PAGE_BIT_PAT_LARGE)
-
 #define PAGE_NONE   __pg(   0|   0|   0|___A|   0|   0|   0|___G)
 #define PAGE_SHARED __pg(__PP|__RW|_USR|___A|__NX|   0|   0|   0)
 #define PAGE_SHARED_EXEC __pg(__PP|__RW|_USR|___A|   0|   0|   0|   0)
-- 
2.26.2

Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization

2020-11-10 Thread Arvind Sankar

On Tue, Nov 10, 2020 at 02:39:59PM -0800, Nick Desaulniers wrote:
> On Tue, Nov 10, 2020 at 2:39 PM Nick Desaulniers
>  wrote:
> >
> > On Tue, Nov 10, 2020 at 2:36 PM Nick Desaulniers
> >  wrote:
> > >
> > > On Tue, Nov 10, 2020 at 2:15 PM Arvind Sankar  
> > > wrote:
> > > >
> > > > On Tue, Nov 10, 2020 at 01:41:17PM -0800, Nick Desaulniers wrote:
> > > > > On Mon, Nov 9, 2020 at 11:51 AM Adrian Ratiu 
> > > > >  wrote:
> > > > > >
> > > > > > On Fri, 06 Nov 2020, Nick Desaulniers 
> > > > > > wrote:
> > > > > > > +#pragma clang loop vectorize(enable)
> > > > > > > do {
> > > > > > > p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] ^=
> > > > > > > p2[1] ^ p3[1] ^ p4[1] ^ p5[1];
> > > > > > > ``` seems to generate the vectorized code.
> > > > > > >
> > > > > > > Why don't we find a way to make those pragma's more toolchain
> > > > > > > portable, rather than open coding them like I have above rather
> > > > > > > than this series?
> > > > > >
> > > > > > Hi again Nick,
> > > > > >
> > > > > > How did you verify the above pragmas generate correct vectorized
> > > > > > code?  Have you tested this specific use case?
> > > > >
> > > > > I read the disassembly before and after my suggested use of pragmas;
> > > > > look for vld/vstr.  You can also add -Rpass-missed=loop-vectorize to
> > > > > CFLAGS_xor-neon.o in arch/arm/lib/Makefile and rebuild
> > > > > arch/arm/lib/xor-neon.o with CONFIG_BTRFS enabled.
> > > > >
> > > >
> > > > https://godbolt.org/z/1oo9M6
> > > >
> > > > With the __restrict__ keywords added, clang seems to vectorize the loop,
> > > > but still reports that vectorization wasn't beneficial -- any idea
> > > > what's going on?
> >
> > Anyways, it's not safe to make that change in the kernel unless you
> > can guarantee that callers of these routines do not alias or overlap.
> 
> s/callers/parameters passed by callers/
> 

Yep, but that seems likely, it doesn't seem like the function would do
anything useful if the destination overlapped one of the sources. The
kernel just doesn't seem to make use of the restrict keyword.

Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization

2020-11-10 Thread Arvind Sankar

On Tue, Nov 10, 2020 at 01:41:17PM -0800, Nick Desaulniers wrote:
> On Mon, Nov 9, 2020 at 11:51 AM Adrian Ratiu  
> wrote:
> >
> > On Fri, 06 Nov 2020, Nick Desaulniers 
> > wrote:
> > > +#pragma clang loop vectorize(enable)
> > > do {
> > > p1[0] ^= p2[0] ^ p3[0] ^ p4[0] ^ p5[0]; p1[1] ^=
> > > p2[1] ^ p3[1] ^ p4[1] ^ p5[1];
> > > ``` seems to generate the vectorized code.
> > >
> > > Why don't we find a way to make those pragma's more toolchain
> > > portable, rather than open coding them like I have above rather
> > > than this series?
> >
> > Hi again Nick,
> >
> > How did you verify the above pragmas generate correct vectorized
> > code?  Have you tested this specific use case?
> 
> I read the disassembly before and after my suggested use of pragmas;
> look for vld/vstr.  You can also add -Rpass-missed=loop-vectorize to
> CFLAGS_xor-neon.o in arch/arm/lib/Makefile and rebuild
> arch/arm/lib/xor-neon.o with CONFIG_BTRFS enabled.
> 

https://godbolt.org/z/1oo9M6

With the __restrict__ keywords added, clang seems to vectorize the loop,
but still reports that vectorization wasn't beneficial -- any idea
what's going on?

[PATCH] efi/x86: Free efi_pgd with free_pages()

2020-11-10 Thread Arvind Sankar

Commit
  d9e9a6418065 ("x86/mm/pti: Allocate a separate user PGD")
changed the PGD allocation to allocate PGD_ALLOCATION_ORDER pages, so in
the error path it should be freed using free_pages() rather than
free_page().

Commit
  06ace26f4e6f ("x86/efi: Free efi_pgd with free_pages()")
fixed one instance of this, but missed another.

Move the freeing out-of-line to avoid code duplication and fix this bug.

Signed-off-by: Arvind Sankar 
Fixes: d9e9a6418065 ("x86/mm/pti: Allocate a separate user PGD")
---
 arch/x86/platform/efi/efi_64.c | 24 +---
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 8f5759df7776..e1e8d4e3a213 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -78,28 +78,30 @@ int __init efi_alloc_page_tables(void)
gfp_mask = GFP_KERNEL | __GFP_ZERO;
efi_pgd = (pgd_t *)__get_free_pages(gfp_mask, PGD_ALLOCATION_ORDER);
if (!efi_pgd)
-   return -ENOMEM;
+   goto fail;
 
pgd = efi_pgd + pgd_index(EFI_VA_END);
p4d = p4d_alloc(_mm, pgd, EFI_VA_END);
-   if (!p4d) {
-   free_page((unsigned long)efi_pgd);
-   return -ENOMEM;
-   }
+   if (!p4d)
+   goto free_pgd;
 
pud = pud_alloc(_mm, p4d, EFI_VA_END);
-   if (!pud) {
-   if (pgtable_l5_enabled())
-   free_page((unsigned long) pgd_page_vaddr(*pgd));
-   free_pages((unsigned long)efi_pgd, PGD_ALLOCATION_ORDER);
-   return -ENOMEM;
-   }
+   if (!pud)
+   goto free_p4d;
 
efi_mm.pgd = efi_pgd;
mm_init_cpumask(_mm);
init_new_context(NULL, _mm);
 
return 0;
+
+free_p4d:
+   if (pgtable_l5_enabled())
+   free_page((unsigned long)pgd_page_vaddr(*pgd));
+free_pgd:
+   free_pages((unsigned long)efi_pgd, PGD_ALLOCATION_ORDER);
+fail:
+   return -ENOMEM;
 }
 
 /*
-- 
2.26.2

Re: [PATCH] x86/mm/sme: Fix definition of PMD_FLAGS_DEC_WP

2020-11-09 Thread Arvind Sankar

On Mon, Nov 09, 2020 at 02:41:48PM -0600, Tom Lendacky wrote:
> On 11/9/20 11:35 AM, Arvind Sankar wrote:
> > The PAT bit is in different locations for 4k and 2M/1G page table
> > entries.
> > 
> > Add a definition for _PAGE_LARGE_CACHE_MASK to represent the three
> > caching bits (PWT, PCD, PAT), similar to _PAGE_CACHE_MASK for 4k pages,
> > and use it in the definition of PMD_FLAGS_DEC_WP to get the correct PAT
> > index for write-protected pages.
> > 
> > Remove a duplication definition of _PAGE_PAT_LARGE.
> > 
> > Signed-off-by: Arvind Sankar 
> 
> Fixes: tag?

It's been broken since it was added in

  6ebcb060713f ("x86/mm: Add support to encrypt the kernel in-place")

but the code has been restructured since then. I think it should be
backportable to 4.19.x if you want, except for that "duplication
definition"[sic] I removed, which was only added in v5.6. Do I need to
split that out into a separate patch?

> 
> Tested-by: Tom Lendacky 
> 
> > ---
> >  arch/x86/include/asm/pgtable_types.h | 3 +--
> >  arch/x86/mm/mem_encrypt_identity.c   | 4 ++--
> >  2 files changed, 3 insertions(+), 4 deletions(-)
> >

[PATCH] x86/mm/sme: Fix definition of PMD_FLAGS_DEC_WP

2020-11-09 Thread Arvind Sankar

The PAT bit is in different locations for 4k and 2M/1G page table
entries.

Add a definition for _PAGE_LARGE_CACHE_MASK to represent the three
caching bits (PWT, PCD, PAT), similar to _PAGE_CACHE_MASK for 4k pages,
and use it in the definition of PMD_FLAGS_DEC_WP to get the correct PAT
index for write-protected pages.

Remove a duplication definition of _PAGE_PAT_LARGE.

Signed-off-by: Arvind Sankar 
---
 arch/x86/include/asm/pgtable_types.h | 3 +--
 arch/x86/mm/mem_encrypt_identity.c   | 4 ++--
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index 816b31c68550..f24d7ef8fffa 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -155,6 +155,7 @@ enum page_cache_mode {
 #define _PAGE_ENC  (_AT(pteval_t, sme_me_mask))
 
 #define _PAGE_CACHE_MASK   (_PAGE_PWT | _PAGE_PCD | _PAGE_PAT)
+#define _PAGE_LARGE_CACHE_MASK (_PAGE_PWT | _PAGE_PCD | _PAGE_PAT_LARGE)
 
 #define _PAGE_NOCACHE  (cachemode2protval(_PAGE_CACHE_MODE_UC))
 #define _PAGE_CACHE_WP (cachemode2protval(_PAGE_CACHE_MODE_WP))
@@ -176,8 +177,6 @@ enum page_cache_mode {
 #define __pgprot(x)((pgprot_t) { (x) } )
 #define __pg(x)__pgprot(x)
 
-#define _PAGE_PAT_LARGE(_AT(pteval_t, 1) << 
_PAGE_BIT_PAT_LARGE)
-
 #define PAGE_NONE   __pg(   0|   0|   0|___A|   0|   0|   0|___G)
 #define PAGE_SHARED __pg(__PP|__RW|_USR|___A|__NX|   0|   0|   0)
 #define PAGE_SHARED_EXEC __pg(__PP|__RW|_USR|___A|   0|   0|   0|   0)
diff --git a/arch/x86/mm/mem_encrypt_identity.c 
b/arch/x86/mm/mem_encrypt_identity.c
index 733b983f3a89..6c5eb6f3f14f 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -45,8 +45,8 @@
 #define PMD_FLAGS_LARGE(__PAGE_KERNEL_LARGE_EXEC & 
~_PAGE_GLOBAL)
 
 #define PMD_FLAGS_DEC  PMD_FLAGS_LARGE
-#define PMD_FLAGS_DEC_WP   ((PMD_FLAGS_DEC & ~_PAGE_CACHE_MASK) | \
-(_PAGE_PAT | _PAGE_PWT))
+#define PMD_FLAGS_DEC_WP   ((PMD_FLAGS_DEC & ~_PAGE_LARGE_CACHE_MASK) | \
+(_PAGE_PAT_LARGE | _PAGE_PWT))
 
 #define PMD_FLAGS_ENC  (PMD_FLAGS_LARGE | _PAGE_ENC)
 
-- 
2.26.2

Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization

2020-11-08 Thread Arvind Sankar

On Sun, Nov 08, 2020 at 12:40:14PM -0500, Arvind Sankar wrote:
> On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote:
> > Due to a Clang bug [1] neon autoloop vectorization does not happen or
> > happens badly with no gains and considering previous GCC experiences
> > which generated unoptimized code which was worse than the default asm
> > implementation, it is safer to default clang builds to the known good
> > generic implementation.
> > 
> > The kernel currently supports a minimum Clang version of v10.0.1, see
> > commit 1f7a44f63e6c ("compiler-clang: add build check for clang 10.0.1").
> > 
> > When the bug gets eventually fixed, this commit could be reverted or,
> > if the minimum clang version bump takes a long time, a warning could
> > be added for users to upgrade their compilers like was done for GCC.
> > 
> > [1] https://bugs.llvm.org/show_bug.cgi?id=40976
> > 
> > Signed-off-by: Adrian Ratiu 
> > ---
> >  arch/arm/include/asm/xor.h | 3 ++-
> >  arch/arm/lib/Makefile  | 3 +++
> >  arch/arm/lib/xor-neon.c| 4 
> >  3 files changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
> > index aefddec79286..49937dafaa71 100644
> > --- a/arch/arm/include/asm/xor.h
> > +++ b/arch/arm/include/asm/xor.h
> > @@ -141,7 +141,8 @@ static struct xor_block_template xor_block_arm4regs = {
> > NEON_TEMPLATES; \
> > } while (0)
> >  
> > -#ifdef CONFIG_KERNEL_MODE_NEON
> > +/* disabled on clang/arm due to 
> > https://bugs.llvm.org/show_bug.cgi?id=40976 */
> > +#if defined(CONFIG_KERNEL_MODE_NEON) && !defined(CONFIG_CC_IS_CLANG)
> >  
> >  extern struct xor_block_template const xor_block_neon_inner;
> >  
> > diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
> > index 6d2ba454f25b..53f9e7dd9714 100644
> > --- a/arch/arm/lib/Makefile
> > +++ b/arch/arm/lib/Makefile
> > @@ -43,8 +43,11 @@ endif
> >  $(obj)/csumpartialcopy.o:  $(obj)/csumpartialcopygeneric.S
> >  $(obj)/csumpartialcopyuser.o:  $(obj)/csumpartialcopygeneric.S
> >  
> > +# disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976
> > +ifndef CONFIG_CC_IS_CLANG
> >  ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
> >NEON_FLAGS   := -march=armv7-a -mfloat-abi=softfp 
> > -mfpu=neon
> >CFLAGS_xor-neon.o+= $(NEON_FLAGS)
> >obj-$(CONFIG_XOR_BLOCKS) += xor-neon.o
> >  endif
> > +endif
> > diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> > index e1e76186ec23..84c91c48dfa2 100644
> > --- a/arch/arm/lib/xor-neon.c
> > +++ b/arch/arm/lib/xor-neon.c
> > @@ -18,6 +18,10 @@ MODULE_LICENSE("GPL");
> >   * Pull in the reference implementations while instructing GCC (through
> >   * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
> >   * NEON instructions.
> > +
> > + * On Clang the loop vectorizer is enabled by default, but due to a bug
> > + * (https://bugs.llvm.org/show_bug.cgi?id=40976) vectorization is broke
> > + * so xor-neon is disabled in favor of the default reg implementations.
> >   */
> >  #ifdef CONFIG_CC_IS_GCC
> >  #pragma GCC optimize "tree-vectorize"
> > -- 
> > 2.29.0
> > 
> 
> It's actually a bad idea to use #pragma GCC optimize. This is basically
> the same as tagging all the functions with __attribute__((optimize)),
> which GCC does not recommend for production use, as it _replaces_
> optimization options rather than appending to them, and has been
> observed to result in dropping important compiler flags.
> 
> There've been a few discussions recently around other such cases:
> https://lore.kernel.org/lkml/20201028171506.15682-1-a...@kernel.org/
> https://lore.kernel.org/lkml/20201028081123.gt2...@hirez.programming.kicks-ass.net/
> 
> For this file, given that it is supposed to use -ftree-vectorize for the
> whole file anyway, is there any reason it's not just added to CFLAGS via
> the Makefile? This seems to be the only use of pragma optimize in the
> kernel.

Eg, this shows that the pragma results in dropping -fno-strict-aliasing.
https://godbolt.org/z/1nfrKT

The first function does not use vectorization because s and s->a might
alias.

Re: [PATCH 2/2] arm: lib: xor-neon: disable clang vectorization

2020-11-08 Thread Arvind Sankar

On Fri, Nov 06, 2020 at 07:14:36AM +0200, Adrian Ratiu wrote:
> Due to a Clang bug [1] neon autoloop vectorization does not happen or
> happens badly with no gains and considering previous GCC experiences
> which generated unoptimized code which was worse than the default asm
> implementation, it is safer to default clang builds to the known good
> generic implementation.
> 
> The kernel currently supports a minimum Clang version of v10.0.1, see
> commit 1f7a44f63e6c ("compiler-clang: add build check for clang 10.0.1").
> 
> When the bug gets eventually fixed, this commit could be reverted or,
> if the minimum clang version bump takes a long time, a warning could
> be added for users to upgrade their compilers like was done for GCC.
> 
> [1] https://bugs.llvm.org/show_bug.cgi?id=40976
> 
> Signed-off-by: Adrian Ratiu 
> ---
>  arch/arm/include/asm/xor.h | 3 ++-
>  arch/arm/lib/Makefile  | 3 +++
>  arch/arm/lib/xor-neon.c| 4 
>  3 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm/include/asm/xor.h b/arch/arm/include/asm/xor.h
> index aefddec79286..49937dafaa71 100644
> --- a/arch/arm/include/asm/xor.h
> +++ b/arch/arm/include/asm/xor.h
> @@ -141,7 +141,8 @@ static struct xor_block_template xor_block_arm4regs = {
>   NEON_TEMPLATES; \
>   } while (0)
>  
> -#ifdef CONFIG_KERNEL_MODE_NEON
> +/* disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976 
> */
> +#if defined(CONFIG_KERNEL_MODE_NEON) && !defined(CONFIG_CC_IS_CLANG)
>  
>  extern struct xor_block_template const xor_block_neon_inner;
>  
> diff --git a/arch/arm/lib/Makefile b/arch/arm/lib/Makefile
> index 6d2ba454f25b..53f9e7dd9714 100644
> --- a/arch/arm/lib/Makefile
> +++ b/arch/arm/lib/Makefile
> @@ -43,8 +43,11 @@ endif
>  $(obj)/csumpartialcopy.o:$(obj)/csumpartialcopygeneric.S
>  $(obj)/csumpartialcopyuser.o:$(obj)/csumpartialcopygeneric.S
>  
> +# disabled on clang/arm due to https://bugs.llvm.org/show_bug.cgi?id=40976
> +ifndef CONFIG_CC_IS_CLANG
>  ifeq ($(CONFIG_KERNEL_MODE_NEON),y)
>NEON_FLAGS := -march=armv7-a -mfloat-abi=softfp -mfpu=neon
>CFLAGS_xor-neon.o  += $(NEON_FLAGS)
>obj-$(CONFIG_XOR_BLOCKS)   += xor-neon.o
>  endif
> +endif
> diff --git a/arch/arm/lib/xor-neon.c b/arch/arm/lib/xor-neon.c
> index e1e76186ec23..84c91c48dfa2 100644
> --- a/arch/arm/lib/xor-neon.c
> +++ b/arch/arm/lib/xor-neon.c
> @@ -18,6 +18,10 @@ MODULE_LICENSE("GPL");
>   * Pull in the reference implementations while instructing GCC (through
>   * -ftree-vectorize) to attempt to exploit implicit parallelism and emit
>   * NEON instructions.
> +
> + * On Clang the loop vectorizer is enabled by default, but due to a bug
> + * (https://bugs.llvm.org/show_bug.cgi?id=40976) vectorization is broke
> + * so xor-neon is disabled in favor of the default reg implementations.
>   */
>  #ifdef CONFIG_CC_IS_GCC
>  #pragma GCC optimize "tree-vectorize"
> -- 
> 2.29.0
> 

It's actually a bad idea to use #pragma GCC optimize. This is basically
the same as tagging all the functions with __attribute__((optimize)),
which GCC does not recommend for production use, as it _replaces_
optimization options rather than appending to them, and has been
observed to result in dropping important compiler flags.

There've been a few discussions recently around other such cases:
https://lore.kernel.org/lkml/20201028171506.15682-1-a...@kernel.org/
https://lore.kernel.org/lkml/20201028081123.gt2...@hirez.programming.kicks-ass.net/

For this file, given that it is supposed to use -ftree-vectorize for the
whole file anyway, is there any reason it's not just added to CFLAGS via
the Makefile? This seems to be the only use of pragma optimize in the
kernel.

Definition of PMD_FLAGS_DEC_WP in arch/x86/mm/mem_encrypt_identity.c

2020-11-08 Thread Arvind Sankar

Hi, I have a question about this definition in
arch/x86/mm/mem_encrypt_identity.c:

#define PMD_FLAGS_LARGE (__PAGE_KERNEL_LARGE_EXEC & 
~_PAGE_GLOBAL)

#define PMD_FLAGS_DEC   PMD_FLAGS_LARGE
#define PMD_FLAGS_DEC_WP((PMD_FLAGS_DEC & ~_PAGE_CACHE_MASK) | \
 (_PAGE_PAT | _PAGE_PWT))

_PAGE_CACHE_MASK and _PAGE_PAT are for 4k pages, not 2M pages. The
definition of PMD_FLAGS_DEC_WP clears the PSE bit by masking out
_PAGE_CACHE_MASK, and sets it again by setting _PAGE_PAT, resulting in
PMD_FLAGS_DEC_WP actually being write-through, not write-protected,
using PAT index 1.

Shouldn't the definition be

#define PMD_FLAGS_DEC_WP(PMD_FLAGS_DEC | _PAGE_PAT_LARGE | 
_PAGE_PWT)

for write-protected using PAT index 5?

I guess the difference doesn't actually matter for encrypt-in-place? But
mem_encrypt_boot.S takes pains to initialize PA5 to be write-protected,
and it looks like it won't actually be used.

Thanks.

Re: [PATCH] Kbuild: implement support for DWARF5

2020-11-03 Thread Arvind Sankar

On Tue, Nov 03, 2020 at 04:05:36PM -0800, Nick Desaulniers wrote:
> On Tue, Nov 3, 2020 at 4:00 PM Arvind Sankar  wrote:
> >
> > On Wed, Oct 21, 2020 at 06:21:06PM -0700, Nick Desaulniers wrote:
> > > Further -gdwarf-X where X is an unsupported value doesn't
> > > produce an error in $(CC).
> >
> > Do you have more details here? On godbolt.org, gcc does report an error
> > for unsupported dwarf versions.
> >
> > https://godbolt.org/z/G35798
> >
> > gcc does not seem to pass the -gdwarf-* options to the assembler when
> > compiling C source. For assembler, gcc will pass an appropriate option
> > depending on the version of binutils it was configured with: if the
> > assembler doesn't support dwarf-5 it can call it with --gdwarf2 for eg.
> >
> > If the user is using a properly configured toolchain it doesn't look
> > like it should be an issue to just use cc-option?
> 
> I wrote the base patch back in May, and didn't revisit until recently.
> I could have sworn the cc-option silently failed for the check
> cc-option does, which is /dev/null input.  I need to recheck that, but
> it doesn't hurt to simply include it for now, which I've done in a v2
> I'm about to send.
> -- 
> Thanks,
> ~Nick Desaulniers

This is giving me deja vu about the -gz=zlib option.

Didn't Masahiro fix the cc-option issue with
  4d0831e8a029 ("kconfig: unify cc-option and as-option")

The existing -Wa,-gdwarf-2 in the Makefile seems bogus, btw. GCC 4.9.0
at least appears to pass on --gdwarf2 automatically.

Re: [PATCH] Kbuild: implement support for DWARF5

2020-11-03 Thread Arvind Sankar

On Wed, Oct 21, 2020 at 06:21:06PM -0700, Nick Desaulniers wrote:
> DWARF5 is the latest standard of the DWARF debug info format.
> 
> Feature detection of DWARF5 is onerous, especially given that we've
> removed $(AS), so we must query $(CC) for DWARF5 assembler directive
> support. Further -gdwarf-X where X is an unsupported value doesn't
> produce an error in $(CC). GNU `as` only recently gained support for
> specifying -gdwarf-5.

Do you have more details here? On godbolt.org, gcc does report an error
for unsupported dwarf versions.

https://godbolt.org/z/G35798

gcc does not seem to pass the -gdwarf-* options to the assembler when
compiling C source. For assembler, gcc will pass an appropriate option
depending on the version of binutils it was configured with: if the
assembler doesn't support dwarf-5 it can call it with --gdwarf2 for eg.

If the user is using a properly configured toolchain it doesn't look
like it should be an issue to just use cc-option?

Re: [PATCH] [v2] x86: apic: avoid -Wshadow warning in header

2020-10-29 Thread Arvind Sankar

On Thu, Oct 29, 2020 at 09:41:13PM +0100, Thomas Gleixner wrote:
> On Thu, Oct 29 2020 at 17:59, Paolo Bonzini wrote:
> > On 29/10/20 17:56, Arvind Sankar wrote:
> >>> For those two just add:
> >>>   struct apic *apic = x86_system_apic;
> >>> before all the assignments.
> >>> Less churn and much better code.
> >>>
> >> Why would it be better code?
> >> 
> >
> > I think he means the compiler produces better code, because it won't
> > read the global variable repeatedly.  Not sure if that's true,(*) but I
> > think I do prefer that version if Arnd wants to do that tweak.
> 
> It's not true.
> 
>  foo *p = bar;
> 
>  p->a = 1;
>  p->b = 2;
> 
> The compiler is free to reload bar after accessing p->a and with
> 
> bar->a = 1;
> bar->b = 1;
> 
> it can either cache bar in a register or reread it after bar->a
> 
> The generated code is the same as long as there is no reason to reload,
> e.g. register pressure.
> 
> Thanks,
> 
> tglx

It's not quite the same.

https://godbolt.org/z/4dzPbM

With -fno-strict-aliasing, the compiler reloads the pointer if you write
to the start of what it points to, but not if you write to later
elements.

[tip: x86/build] x86/build: Fix vmlinux size check on 64-bit

2020-10-29 Thread tip-bot2 for Arvind Sankar

The following commit has been merged into the x86/build branch of tip:

Commit-ID: ea3186b9572a1b0299448697cfc44920061872cf
Gitweb:
https://git.kernel.org/tip/ea3186b9572a1b0299448697cfc44920061872cf
Author:Arvind Sankar 
AuthorDate:Thu, 29 Oct 2020 12:19:03 -04:00
Committer: Borislav Petkov 
CommitterDate: Thu, 29 Oct 2020 21:54:35 +01:00

x86/build: Fix vmlinux size check on 64-bit

Commit

  b4e0409a36f4 ("x86: check vmlinux limits, 64-bit")

added a check that the size of the 64-bit kernel is less than
KERNEL_IMAGE_SIZE.

The check uses (_end - _text), but this is not enough. The initial
PMD used in startup_64() (level2_kernel_pgt) can only map upto
KERNEL_IMAGE_SIZE from __START_KERNEL_map, not from _text, and the
modules area (MODULES_VADDR) starts at KERNEL_IMAGE_SIZE.

The correct check is what is currently done for 32-bit, since
LOAD_OFFSET is defined appropriately for the two architectures. Just
check (_end - LOAD_OFFSET) against KERNEL_IMAGE_SIZE unconditionally.

Note that on 32-bit, the limit is not strict: KERNEL_IMAGE_SIZE is not
really used by the main kernel. The higher the kernel is located, the
less the space available for the vmalloc area. However, it is used by
KASLR in the compressed stub to limit the maximum address of the kernel
to a safe value.

Clean up various comments to clarify that despite the name,
KERNEL_IMAGE_SIZE is not a limit on the size of the kernel image, but a
limit on the maximum virtual address that the image can occupy.

Signed-off-by: Arvind Sankar 
Signed-off-by: Borislav Petkov 
Link: https://lkml.kernel.org/r/20201029161903.2553528-1-nived...@alum.mit.edu
---
 arch/x86/include/asm/page_32_types.h |  8 +++-
 arch/x86/include/asm/page_64_types.h |  6 --
 arch/x86/include/asm/pgtable_32.h| 18 ++
 arch/x86/kernel/head_64.S| 20 +---
 arch/x86/kernel/vmlinux.lds.S| 12 +++-
 5 files changed, 29 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/page_32_types.h 
b/arch/x86/include/asm/page_32_types.h
index f462895..faf9cc1 100644
--- a/arch/x86/include/asm/page_32_types.h
+++ b/arch/x86/include/asm/page_32_types.h
@@ -53,7 +53,13 @@
 #define STACK_TOP_MAX  STACK_TOP
 
 /*
- * Kernel image size is limited to 512 MB (see in arch/x86/kernel/head_32.S)
+ * In spite of the name, KERNEL_IMAGE_SIZE is a limit on the maximum virtual
+ * address for the kernel image, rather than the limit on the size itself. On
+ * 32-bit, this is not a strict limit, but this value is used to limit the
+ * link-time virtual address range of the kernel, and by KASLR to limit the
+ * randomized address from which the kernel is executed. A relocatable kernel
+ * can be loaded somewhat higher than KERNEL_IMAGE_SIZE as long as enough space
+ * remains for the vmalloc area.
  */
 #define KERNEL_IMAGE_SIZE  (512 * 1024 * 1024)
 
diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index 3f49dac..645bd1d 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -98,8 +98,10 @@
 #define STACK_TOP_MAX  TASK_SIZE_MAX
 
 /*
- * Maximum kernel image size is limited to 1 GiB, due to the fixmap living
- * in the next 1 GiB (see level2_kernel_pgt in arch/x86/kernel/head_64.S).
+ * In spite of the name, KERNEL_IMAGE_SIZE is a limit on the maximum virtual
+ * address for the kernel image, rather than the limit on the size itself.
+ * This can be at most 1 GiB, due to the fixmap living in the next 1 GiB (see
+ * level2_kernel_pgt in arch/x86/kernel/head_64.S).
  *
  * On KASLR use 1 GiB by default, leaving 1 GiB for modules once the
  * page tables are fully set up.
diff --git a/arch/x86/include/asm/pgtable_32.h 
b/arch/x86/include/asm/pgtable_32.h
index d7acae4..7c9c968 100644
--- a/arch/x86/include/asm/pgtable_32.h
+++ b/arch/x86/include/asm/pgtable_32.h
@@ -57,19 +57,13 @@ do {\
 #endif
 
 /*
- * This is how much memory in addition to the memory covered up to
- * and including _end we need mapped initially.
- * We need:
- * (KERNEL_IMAGE_SIZE/4096) / 1024 pages (worst case, non PAE)
- * (KERNEL_IMAGE_SIZE/4096) / 512 + 4 pages (worst case for PAE)
+ * This is used to calculate the .brk reservation for initial pagetables.
+ * Enough space is reserved to allocate pagetables sufficient to cover all
+ * of LOWMEM_PAGES, which is an upper bound on the size of the direct map of
+ * lowmem.
  *
- * Modulo rounding, each megabyte assigned here requires a kilobyte of
- * memory, which is currently unreclaimed.
- *
- * This should be a multiple of a page.
- *
- * KERNEL_IMAGE_SIZE should be greater than pa(_end)
- * and small than max_low_pfn, otherwise will waste some page table entries
+ * With PAE paging (PTRS_PER_PMD > 1), we allocate PTRS_PER_PGD == 4 pages for
+ * the PMD's in addition to the pages required for the last lev

Re: [PATCH] [v2] x86: apic: avoid -Wshadow warning in header

2020-10-29 Thread Arvind Sankar

On Thu, Oct 29, 2020 at 05:59:54PM +0100, Paolo Bonzini wrote:
> On 29/10/20 17:56, Arvind Sankar wrote:
> >> For those two just add:
> >>struct apic *apic = x86_system_apic;
> >> before all the assignments.
> >> Less churn and much better code.
> >>
> > Why would it be better code?
> > 
> 
> I think he means the compiler produces better code, because it won't
> read the global variable repeatedly.  Not sure if that's true,(*) but I
> think I do prefer that version if Arnd wants to do that tweak.
> 
> Paolo
> 
> (*) if it is, it might also be due to Linux using -fno-strict-aliasing
> 

Nope, it doesn't read it multiple times. I guess it still assumes that
apic isn't in the middle of what it points to: it would reload the
address if the first element of *apic was modified, but doesn't for
other elements. Interesting.

Re: [PATCH] [v2] x86: apic: avoid -Wshadow warning in header

2020-10-29 Thread Arvind Sankar

On Thu, Oct 29, 2020 at 03:05:31PM +, David Laight wrote:
> From: Arnd Bergmann
> > Sent: 28 October 2020 21:21
> > 
> > From: Arnd Bergmann 
> > 
> > There are hundreds of warnings in a W=2 build about a local
> > variable shadowing the global 'apic' definition:
> > 
> > arch/x86/kvm/lapic.h:149:65: warning: declaration of 'apic' shadows a 
> > global declaration [-Wshadow]
> > 
> > Avoid this by renaming the global 'apic' variable to the more descriptive
> > 'x86_system_apic'. It was originally called 'genapic', but both that
> > and the current 'apic' seem to be a little overly generic for a global
> > variable.
> > 
> > Fixes: c48f14966cc4 ("KVM: inline kvm_apic_present() and 
> > kvm_lapic_enabled()")
> > Fixes: c8d46cf06dc2 ("x86: rename 'genapic' to 'apic'")
> > Signed-off-by: Arnd Bergmann 
> > ---
> > v2: rename the global instead of the local variable in the header
> ...
> > diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
> > index 284e73661a18..33e2dc78ca11 100644
> > --- a/arch/x86/hyperv/hv_apic.c
> > +++ b/arch/x86/hyperv/hv_apic.c
> > @@ -259,14 +259,14 @@ void __init hv_apic_init(void)
> > /*
> >  * Set the IPI entry points.
> >  */
> > -   orig_apic = *apic;
> > -
> > -   apic->send_IPI = hv_send_ipi;
> > -   apic->send_IPI_mask = hv_send_ipi_mask;
> > -   apic->send_IPI_mask_allbutself = hv_send_ipi_mask_allbutself;
> > -   apic->send_IPI_allbutself = hv_send_ipi_allbutself;
> > -   apic->send_IPI_all = hv_send_ipi_all;
> > -   apic->send_IPI_self = hv_send_ipi_self;
> > +   orig_apic = *x86_system_apic;
> > +
> > +   x86_system_apic->send_IPI = hv_send_ipi;
> > +   x86_system_apic->send_IPI_mask = hv_send_ipi_mask;
> > +   x86_system_apic->send_IPI_mask_allbutself = 
> > hv_send_ipi_mask_allbutself;
> > +   x86_system_apic->send_IPI_allbutself = hv_send_ipi_allbutself;
> > +   x86_system_apic->send_IPI_all = hv_send_ipi_all;
> > +   x86_system_apic->send_IPI_self = hv_send_ipi_self;
> > }
> > 
> > if (ms_hyperv.hints & HV_X64_APIC_ACCESS_RECOMMENDED) {
> > @@ -285,10 +285,10 @@ void __init hv_apic_init(void)
> >  */
> > apic_set_eoi_write(hv_apic_eoi_write);
> > if (!x2apic_enabled()) {
> > -   apic->read  = hv_apic_read;
> > -   apic->write = hv_apic_write;
> > -   apic->icr_write = hv_apic_icr_write;
> > -   apic->icr_read  = hv_apic_icr_read;
> > +   x86_system_apic->read  = hv_apic_read;
> > +   x86_system_apic->write = hv_apic_write;
> > +   x86_system_apic->icr_write = hv_apic_icr_write;
> > +   x86_system_apic->icr_read  = hv_apic_icr_read;
> > }
> 
> For those two just add:
>   struct apic *apic = x86_system_apic;
> before all the assignments.
> Less churn and much better code.
> 

Why would it be better code?

[PATCH v2] x86/build: Fix vmlinux size check on 64-bit

2020-10-29 Thread Arvind Sankar

Commit b4e0409a36f4 ("x86: check vmlinux limits, 64-bit") added a check
that the size of the 64-bit kernel is less than KERNEL_IMAGE_SIZE.

The check uses (_end - _text), but this is not enough. The initial PMD
used in startup_64() (level2_kernel_pgt) can only map upto
KERNEL_IMAGE_SIZE from __START_KERNEL_map, not from _text, and the
modules area (MODULES_VADDR) starts at KERNEL_IMAGE_SIZE.

The correct check is what is currently done for 32-bit, since
LOAD_OFFSET is defined appropriately for the two architectures. Just
check (_end - LOAD_OFFSET) against KERNEL_IMAGE_SIZE unconditionally.

Note that on 32-bit, the limit is not strict: KERNEL_IMAGE_SIZE is not
really used by the main kernel. The higher the kernel is located, the
less the space available for the vmalloc area. However, it is used by
KASLR in the compressed stub to limit the maximum address of the kernel
to a safe value.

Clean up various comments to clarify that despite the name,
KERNEL_IMAGE_SIZE is not a limit on the size of the kernel image, but a
limit on the maximum virtual address that the image can occupy.

Signed-off-by: Arvind Sankar 
---
 arch/x86/include/asm/page_32_types.h |  8 +++-
 arch/x86/include/asm/page_64_types.h |  6 --
 arch/x86/include/asm/pgtable_32.h| 18 ++
 arch/x86/kernel/head_64.S| 20 +---
 arch/x86/kernel/vmlinux.lds.S| 11 ++-
 5 files changed, 28 insertions(+), 35 deletions(-)

diff --git a/arch/x86/include/asm/page_32_types.h 
b/arch/x86/include/asm/page_32_types.h
index f462895a33e4..faf9cc1c14bb 100644
--- a/arch/x86/include/asm/page_32_types.h
+++ b/arch/x86/include/asm/page_32_types.h
@@ -53,7 +53,13 @@
 #define STACK_TOP_MAX  STACK_TOP
 
 /*
- * Kernel image size is limited to 512 MB (see in arch/x86/kernel/head_32.S)
+ * In spite of the name, KERNEL_IMAGE_SIZE is a limit on the maximum virtual
+ * address for the kernel image, rather than the limit on the size itself. On
+ * 32-bit, this is not a strict limit, but this value is used to limit the
+ * link-time virtual address range of the kernel, and by KASLR to limit the
+ * randomized address from which the kernel is executed. A relocatable kernel
+ * can be loaded somewhat higher than KERNEL_IMAGE_SIZE as long as enough space
+ * remains for the vmalloc area.
  */
 #define KERNEL_IMAGE_SIZE  (512 * 1024 * 1024)
 
diff --git a/arch/x86/include/asm/page_64_types.h 
b/arch/x86/include/asm/page_64_types.h
index 3f49dac03617..645bd1d0ee07 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -98,8 +98,10 @@
 #define STACK_TOP_MAX  TASK_SIZE_MAX
 
 /*
- * Maximum kernel image size is limited to 1 GiB, due to the fixmap living
- * in the next 1 GiB (see level2_kernel_pgt in arch/x86/kernel/head_64.S).
+ * In spite of the name, KERNEL_IMAGE_SIZE is a limit on the maximum virtual
+ * address for the kernel image, rather than the limit on the size itself.
+ * This can be at most 1 GiB, due to the fixmap living in the next 1 GiB (see
+ * level2_kernel_pgt in arch/x86/kernel/head_64.S).
  *
  * On KASLR use 1 GiB by default, leaving 1 GiB for modules once the
  * page tables are fully set up.
diff --git a/arch/x86/include/asm/pgtable_32.h 
b/arch/x86/include/asm/pgtable_32.h
index d7acae4120d5..7c9c968a42ef 100644
--- a/arch/x86/include/asm/pgtable_32.h
+++ b/arch/x86/include/asm/pgtable_32.h
@@ -57,19 +57,13 @@ do {\
 #endif
 
 /*
- * This is how much memory in addition to the memory covered up to
- * and including _end we need mapped initially.
- * We need:
- * (KERNEL_IMAGE_SIZE/4096) / 1024 pages (worst case, non PAE)
- * (KERNEL_IMAGE_SIZE/4096) / 512 + 4 pages (worst case for PAE)
+ * This is used to calculate the .brk reservation for initial pagetables.
+ * Enough space is reserved to allocate pagetables sufficient to cover all
+ * of LOWMEM_PAGES, which is an upper bound on the size of the direct map of
+ * lowmem.
  *
- * Modulo rounding, each megabyte assigned here requires a kilobyte of
- * memory, which is currently unreclaimed.
- *
- * This should be a multiple of a page.
- *
- * KERNEL_IMAGE_SIZE should be greater than pa(_end)
- * and small than max_low_pfn, otherwise will waste some page table entries
+ * With PAE paging (PTRS_PER_PMD > 1), we allocate PTRS_PER_PGD == 4 pages for
+ * the PMD's in addition to the pages required for the last level pagetables.
  */
 #if PTRS_PER_PMD > 1
 #define PAGE_TABLE_SIZE(pages) (((pages) / PTRS_PER_PMD) + PTRS_PER_PGD)
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 7eb2a1c87969..d41fa5bb77fe 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -524,21 +524,19 @@ SYM_DATA_END(level3_kernel_pgt)
 
 SYM_DATA_START_PAGE_ALIGNED(level2_kernel_pgt)
/*
-* 512 MB kernel mapping. We spend a full page on this pagetable
-* anyway.
+* Ker

Re: [PATCH] x86/entry/64: Use TEST %reg,%reg instead of CMP $0,%reg

2020-10-29 Thread Arvind Sankar

On Thu, Oct 29, 2020 at 03:29:15PM +0100, Uros Bizjak wrote:
> Use TEST %reg,%reg which sets the zero flag in the same way
> as CMP $0,%reg, but the encoding uses one byte less.
> 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Borislav Petkov 
> Cc: "H. Peter Anvin" 
> Signed-off-by: Uros Bizjak 

Please use x86/boot/64 or x86/boot/compressed/64 for the commit subject.
x86/entry is used for syscall/exception handler entries in the main
kernel.

> ---
>  arch/x86/boot/compressed/head_64.S | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/boot/compressed/head_64.S 
> b/arch/x86/boot/compressed/head_64.S
> index 017de6cc87dc..e94874f4bbc1 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -241,12 +241,12 @@ SYM_FUNC_START(startup_32)
>   lealrva(startup_64)(%ebp), %eax
>  #ifdef CONFIG_EFI_MIXED
>   movlrva(efi32_boot_args)(%ebp), %edi
> - cmp $0, %edi
> + testl   %edi, %edi
>   jz  1f
>   lealrva(efi64_stub_entry)(%ebp), %eax
>   movlrva(efi32_boot_args+4)(%ebp), %esi
>   movlrva(efi32_boot_args+8)(%ebp), %edx  // saved bootparams 
> pointer
> - cmpl$0, %edx
> + testl   %edx, %edx
>   jnz 1f
>   /*
>* efi_pe_entry uses MS calling convention, which requires 32 bytes of
> @@ -592,7 +592,7 @@ SYM_CODE_START(trampoline_32bit_src)
>   movl%eax, %cr0
>  
>   /* Check what paging mode we want to be in after the trampoline */
> - cmpl$0, %edx
> + testl   %edx, %edx
>   jz  1f
>  
>   /* We want 5-level paging: don't touch CR3 if it already points to 
> 5-level page tables */
> @@ -622,7 +622,7 @@ SYM_CODE_START(trampoline_32bit_src)
>  
>   /* Enable PAE and LA57 (if required) paging modes */
>   movl$X86_CR4_PAE, %eax
> - cmpl$0, %edx
> + testl   %edx, %edx
>   jz  1f
>   orl $X86_CR4_LA57, %eax
>  1:
> -- 
> 2.26.2
>

Re: [PATCH v2 1/2] bpf: don't rely on GCC attribute((optimize)) to disable GCSE

2020-10-29 Thread Arvind Sankar

On Wed, Oct 28, 2020 at 04:20:01PM -0700, Alexei Starovoitov wrote:
> On Thu, Oct 29, 2020 at 12:10:52AM +0100, Ard Biesheuvel wrote:
> > On Wed, 28 Oct 2020 at 23:59, Alexei Starovoitov
> >  wrote:
> > >
> > > On Wed, Oct 28, 2020 at 11:15:04PM +0100, Ard Biesheuvel wrote:
> > > > On Wed, 28 Oct 2020 at 22:39, Alexei Starovoitov
> > > >  wrote:
> > > > >
> > > > > On Wed, Oct 28, 2020 at 06:15:05PM +0100, Ard Biesheuvel wrote:
> > > > > > Commit 3193c0836 ("bpf: Disable GCC -fgcse optimization for
> > > > > > ___bpf_prog_run()") introduced a __no_fgcse macro that expands to a
> > > > > > function scope __attribute__((optimize("-fno-gcse"))), to disable a
> > > > > > GCC specific optimization that was causing trouble on x86 builds, 
> > > > > > and
> > > > > > was not expected to have any positive effect in the first place.
> > > > > >
> > > > > > However, as the GCC manual documents, __attribute__((optimize))
> > > > > > is not for production use, and results in all other optimization
> > > > > > options to be forgotten for the function in question. This can
> > > > > > cause all kinds of trouble, but in one particular reported case,
> > > > > > it causes -fno-asynchronous-unwind-tables to be disregarded,
> > > > > > resulting in .eh_frame info to be emitted for the function.
> > > > > >
> > > > > > This reverts commit 3193c0836, and instead, it disables the -fgcse
> > > > > > optimization for the entire source file, but only when building for
> > > > > > X86 using GCC with CONFIG_BPF_JIT_ALWAYS_ON disabled. Note that the
> > > > > > original commit states that CONFIG_RETPOLINE=n triggers the issue,
> > > > > > whereas CONFIG_RETPOLINE=y performs better without the optimization,
> > > > > > so it is kept disabled in both cases.
> > > > > >
> > > > > > Fixes: 3193c0836 ("bpf: Disable GCC -fgcse optimization for 
> > > > > > ___bpf_prog_run()")
> > > > > > Link: 
> > > > > > https://lore.kernel.org/lkml/CAMuHMdUg0WJHEcq6to0-eODpXPOywLot6UD2=gfhpzoj_hc...@mail.gmail.com/
> > > > > > Signed-off-by: Ard Biesheuvel 
> > > > > > ---
> > > > > >  include/linux/compiler-gcc.h   | 2 --
> > > > > >  include/linux/compiler_types.h | 4 
> > > > > >  kernel/bpf/Makefile| 6 +-
> > > > > >  kernel/bpf/core.c  | 2 +-
> > > > > >  4 files changed, 6 insertions(+), 8 deletions(-)
> > > > > >
> > > > > > diff --git a/include/linux/compiler-gcc.h 
> > > > > > b/include/linux/compiler-gcc.h
> > > > > > index d1e3c6896b71..5deb37024574 100644
> > > > > > --- a/include/linux/compiler-gcc.h
> > > > > > +++ b/include/linux/compiler-gcc.h
> > > > > > @@ -175,5 +175,3 @@
> > > > > >  #else
> > > > > >  #define __diag_GCC_8(s)
> > > > > >  #endif
> > > > > > -
> > > > > > -#define __no_fgcse __attribute__((optimize("-fno-gcse")))
> > > > >
> > > > > See my reply in the other thread.
> > > > > I prefer
> > > > > -#define __no_fgcse __attribute__((optimize("-fno-gcse")))
> > > > > +#define __no_fgcse 
> > > > > __attribute__((optimize("-fno-gcse,-fno-omit-frame-pointer")))
> > > > >
> > > > > Potentially with -fno-asynchronous-unwind-tables.
> > > > >
> > > >
> > > > So how would that work? arm64 has the following:
> > > >
> > > > KBUILD_CFLAGS += -fno-asynchronous-unwind-tables -fno-unwind-tables
> > > >
> > > > ifeq ($(CONFIG_SHADOW_CALL_STACK), y)
> > > > KBUILD_CFLAGS += -ffixed-x18
> > > > endif
> > > >
> > > > and it adds -fpatchable-function-entry=2 for compilers that support
> > > > it, but only when CONFIG_FTRACE is enabled.
> > >
> > > I think you're assuming that GCC drops all flags when it sees 
> > > __attribute__((optimize)).
> > > That's not the case.
> > >
> > 
> > So which flags does it drop, and which doesn't it drop? Is that
> > documented somewhere? Is that the same for all versions of GCC?
> > 
> > > > Also, as Nick pointed out, -fno-gcse does not work on Clang.
> > >
> > > yes and what's the point?
> > > #define __no_fgcse is GCC only. clang doesn't need this workaround.
> > >
> > 
> > Ah ok, that's at least something.
> > 
> > > > Every architecture will have a different set of requirements here. And
> > > > there is no way of knowing which -f options are disregarded when you
> > > > use the function attribute.
> > > >
> > > > So how on earth are you going to #define __no-fgcse correctly for
> > > > every configuration imaginable?
> > > >
> > > > > __attribute__((optimize("")) is not as broken as you're claiming to 
> > > > > be.
> > > > > It has quirky gcc internal logic, but it's still widely used
> > > > > in many software projects.
> > > >
> > > > So it's fine because it is only a little bit broken? I'm sorry, but
> > > > that makes no sense whatsoever.
> > > >
> > > > If you insist on sticking with this broken construct, can you please
> > > > make it GCC/x86-only at least?
> > >
> > > I'm totally fine with making
> > > #define __no_fgcse 
> > > __attribute__((optimize("-fno-gcse,-fno-omit-frame-pointer")))
> > > to be gcc+x86 only.
> > > I'd like to get rid of it, but

Re: [PATCH] x86/build: Fix vmlinux size check on 64-bit

2020-10-28 Thread Arvind Sankar

On Wed, Oct 28, 2020 at 08:43:55PM +0100, Borislav Petkov wrote:
> On Wed, Oct 28, 2020 at 12:45:51PM -0400, Arvind Sankar wrote:
> > You don't want to try to run the kernel from physical address 0 in any
> > case. The default is set to 16MiB to avoid low memory, historically to
> > avoid the 24-bit ISA DMA range.
> 
> Sure, that's why I wrote:
> 
> "... so I guess this should be a range > 0 specification but I guess not
> important."
> 
> So how about a sentence or two alluding to that fact in the help text of
> that option?

It's mentioned in the commit message for ceefccc93932, but yeah, it
would be useful to have in the help text I guess. But that's not really
related to this patch.

> 
> > This doesn't matter for the 64-bit kernel, which can be run from any
> > physical address independent of the RELOCATABLE/PHYSICAL_START settings.
> > It only matters on 32-bit, where VA and PA are tied together by
> > VA == __PAGE_OFFSET + PA
> 
> You mean the kernel text mapping I assume because we do
> 
> #define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
> 
> on 64-bit too but that's the direct mapping of all physical memory.

Yes, I meant the virtual addresses of the kernel symbols: the 32-bit
kernel needs relocation processing to be loaded at a different physical
address, but the 64-bit kernel doesn't unless the virtual address is
also being changed.

> 
> > KERNEL_IMAGE_SIZE is _not_ the size of the kernel image, the name is
> > misleading.
> 
> So that needs fixing too, I guess.

It's become ABI I think: looks like it's included by that name in
vmcoreinfo for kexec crash dumps.

> 
> > It is the maximum VA that the kernel can occupy, it is used
> > to prepopulate the PMD-level pagetable for initial boot (level2_kernel_pgt)
> > and is also used to define MODULES_VADDR, so it _is_ talking about
> > mappings. If you have a 30MiB kernel that is placed at a starting VA of
> > 510MiB when KERNEL_IMAGE_SIZE is 512MiB, it won't boot.
> 
> ... because not the whole kernel will be mapped, sure. There's a comment
> above KERNEL_IMAGE_SIZE which could use some of that explanation.

Hm, it also looks like KERNEL_IMAGE_SIZE is entirely unused on 32-bit
except for this linker script check and for KASLR. I'll do a v2 cleaning
up those comments.

> 
> > Increasing vmlinux size can trigger the problem by pushing _end
> > beyond KERNEL_IMAGE_SIZE, but the problem occurs once _end -
> > __START_KERNEL_map exceeds KERNEL_IMAGE_SIZE, not when _end - _text
> > exceeds it, hence this patch.
> 
> Understood - in both cases, once _end goes beyond the 512MiB end of the
> PMD mapping, we've lost. Please add that part to the commit message too
> because we will forget.
> 

That's what this bit in the commit message was trying to explain:
  The check uses (_end - _text), but this is not enough. The initial PMD
  used in startup_64() (level2_kernel_pgt) can only map upto
  KERNEL_IMAGE_SIZE from __START_KERNEL_map, not from _text.

Re: [PATCH v3 1/5] x86/boot/compressed/64: Introduce sev_status

2020-10-28 Thread Arvind Sankar

On Wed, Oct 28, 2020 at 09:23:52AM +0100, Joerg Roedel wrote:
> On Mon, Oct 26, 2020 at 07:27:06PM +0100, Borislav Petkov wrote:
> > A couple of lines above you call get_sev_encryption_bit() which already
> > reads MSR_AMD64_SEV. Why not set sev_status there too instead of reading
> > that MSR again here?
> > 
> > It can read that MSR once and use sev_status(%rip) from then on to avoid
> > reading that MSR multiple times...
> 
> Right, makes sense. I updated the patch.

Hang on, get_sev_encryption_bit() is also called from startup_32(),
so it can't contain any 64-bit instructions to set sev_status.

[tip: x86/mm] x86/mm/ident_map: Check for errors from ident_pud_init()

2020-10-28 Thread tip-bot2 for Arvind Sankar

The following commit has been merged into the x86/mm branch of tip:

Commit-ID: 1fcd009102ee02e217f2e7635ab65517d785da8e
Gitweb:
https://git.kernel.org/tip/1fcd009102ee02e217f2e7635ab65517d785da8e
Author:Arvind Sankar 
AuthorDate:Tue, 27 Oct 2020 19:06:48 -04:00
Committer: Borislav Petkov 
CommitterDate: Wed, 28 Oct 2020 14:48:30 +01:00

x86/mm/ident_map: Check for errors from ident_pud_init()

Commit

  ea3b5e60ce80 ("x86/mm/ident_map: Add 5-level paging support")

added ident_p4d_init() to support 5-level paging, but this function
doesn't check and return errors from ident_pud_init().

For example, the decompressor stub uses this code to create an identity
mapping. If it runs out of pages while trying to allocate a PMD
pagetable, the error will be currently ignored.

Fix this to propagate errors.

 [ bp: Space out statements for better readability. ]

Fixes: ea3b5e60ce80 ("x86/mm/ident_map: Add 5-level paging support")
Signed-off-by: Arvind Sankar 
Signed-off-by: Borislav Petkov 
Reviewed-by: Joerg Roedel 
Acked-by: Kirill A. Shutemov 
Link: https://lkml.kernel.org/r/20201027230648.1885111-1-nived...@alum.mit.edu
---
 arch/x86/mm/ident_map.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c
index fe7a125..968d700 100644
--- a/arch/x86/mm/ident_map.c
+++ b/arch/x86/mm/ident_map.c
@@ -62,6 +62,7 @@ static int ident_p4d_init(struct x86_mapping_info *info, 
p4d_t *p4d_page,
  unsigned long addr, unsigned long end)
 {
unsigned long next;
+   int result;
 
for (; addr < end; addr = next) {
p4d_t *p4d = p4d_page + p4d_index(addr);
@@ -73,13 +74,20 @@ static int ident_p4d_init(struct x86_mapping_info *info, 
p4d_t *p4d_page,
 
if (p4d_present(*p4d)) {
pud = pud_offset(p4d, 0);
-   ident_pud_init(info, pud, addr, next);
+   result = ident_pud_init(info, pud, addr, next);
+   if (result)
+   return result;
+
continue;
}
pud = (pud_t *)info->alloc_pgt_page(info->context);
if (!pud)
return -ENOMEM;
-   ident_pud_init(info, pud, addr, next);
+
+   result = ident_pud_init(info, pud, addr, next);
+   if (result)
+   return result;
+
set_p4d(p4d, __p4d(__pa(pud) | info->kernpg_flag));
}

Re: [PATCH] x86/build: Fix vmlinux size check on 64-bit

2020-10-28 Thread Arvind Sankar

On Wed, Oct 28, 2020 at 02:39:09PM +0100, Borislav Petkov wrote:
> On Tue, Oct 27, 2020 at 05:14:22PM -0400, Arvind Sankar wrote:
> > This is indeed just a small correctness fixlet, but I'm not following
> > the rest of your comments.
> 
> I'm just trying to make sense of that house of cards we have here.
> 
> > PHYSICAL_START has an effect independent of the setting of
> > RELOCATABLE.
> 
> Theoretically you can set PHYSICAL_START to 0x0:
> 
> config PHYSICAL_START
> hex "Physical address where the kernel is loaded" if (EXPERT || 
> CRASH_DUMP)
> default "0x100"
> help
>   This gives the physical address where the kernel is loaded.
> 
>   If kernel is a not relocatable (CONFIG_RELOCATABLE=n) then
>   bzImage will decompress itself to above physical address and
>   run from there.
> ^^
> 
> and disable RELOCATABLE:
> 
> CONFIG_PHYSICAL_START=0x0
> # CONFIG_RELOCATABLE is not set
> 
> but then you hit this:
> 
> ld: per-CPU data too large - increase CONFIG_PHYSICAL_START
> 
> full output at the end of the mail.

You don't want to try to run the kernel from physical address 0 in any
case. The default is set to 16MiB to avoid low memory, historically to
avoid the 24-bit ISA DMA range.

> > That said, AFAICT, RELOCATABLE and PHYSICAL_START look like historical
> > artifacts at this point: RELOCATABLE should be completely irrelevant for
> > the 64-bit kernel, and there's really no reason to be able to configure
> > the start VA of the kernel, that should just be constant independent of
> > PHYSICAL_START.
> 
> See the CONFIG_PHYSICAL_START help text. Apparently there has been a
> use case where one can set PHYSICAL_START to the region where a kdump
> kernel is going to be loaded and that kdump kernel is a vmlinux and not
> a bzImage and thus not relocatable.

This doesn't matter for the 64-bit kernel, which can be run from any
physical address independent of the RELOCATABLE/PHYSICAL_START settings.
It only matters on 32-bit, where VA and PA are tied together by
VA == __PAGE_OFFSET + PA
On 64-bit, the kernel's location in VA space and physical space can be
independently moved around, so a kernel that starts at 16MiB in VA space
can be loaded anywhere above 16MiB in physical space.

> 
> Going back to the question at hand, if you think about it, the kernel
> image *is* between _text or _stext and _end. And KERNEL_IMAGE_SIZE is
> exactly what it is - the size of the kernel image.
> 
> Now, if you were talking about a kernel *mapping* size, then I'd
> understand but this check is for the kernel *image* size.
> 

KERNEL_IMAGE_SIZE is _not_ the size of the kernel image, the name is
misleading. It is the maximum VA that the kernel can occupy, it is used
to prepopulate the PMD-level pagetable for initial boot (level2_kernel_pgt)
and is also used to define MODULES_VADDR, so it _is_ talking about
mappings. If you have a 30MiB kernel that is placed at a starting VA of
510MiB when KERNEL_IMAGE_SIZE is 512MiB, it won't boot.

> But reading that commit message again:
> 
> these build-time and link-time checks would have prevented the
> vmlinux size regression.
> 
> this *is* talking about vmlinux size and that starts at _text...
> 

Increasing vmlinux size can trigger the problem by pushing _end beyond
KERNEL_IMAGE_SIZE, but the problem occurs once _end - __START_KERNEL_map
exceeds KERNEL_IMAGE_SIZE, not when _end - _text exceeds it, hence this
patch.

[PATCH] x86/mm/ident_map: Check for errors from ident_pud_init()

2020-10-27 Thread Arvind Sankar

Commit
  ea3b5e60ce80 ("x86/mm/ident_map: Add 5-level paging support")
added ident_p4d_init() to support 5-level paging, but this function
doesn't check and return errors from ident_pud_init().

For example, the decompressor stub uses this code to create an identity
mapping. If it runs out of pages while trying to allocate a PMD
pagetable, the error will be currently ignored.

Fix this to propagate errors.

Signed-off-by: Arvind Sankar 
Fixes: ea3b5e60ce80 ("x86/mm/ident_map: Add 5-level paging support")
---
 arch/x86/mm/ident_map.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/ident_map.c b/arch/x86/mm/ident_map.c
index fe7a12599d8e..5ecb0883cc88 100644
--- a/arch/x86/mm/ident_map.c
+++ b/arch/x86/mm/ident_map.c
@@ -62,6 +62,7 @@ static int ident_p4d_init(struct x86_mapping_info *info, 
p4d_t *p4d_page,
  unsigned long addr, unsigned long end)
 {
unsigned long next;
+   int result;
 
for (; addr < end; addr = next) {
p4d_t *p4d = p4d_page + p4d_index(addr);
@@ -73,13 +74,17 @@ static int ident_p4d_init(struct x86_mapping_info *info, 
p4d_t *p4d_page,
 
if (p4d_present(*p4d)) {
pud = pud_offset(p4d, 0);
-   ident_pud_init(info, pud, addr, next);
+   result = ident_pud_init(info, pud, addr, next);
+   if (result)
+   return result;
continue;
}
pud = (pud_t *)info->alloc_pgt_page(info->context);
if (!pud)
return -ENOMEM;
-   ident_pud_init(info, pud, addr, next);
+   result = ident_pud_init(info, pud, addr, next);
+   if (result)
+   return result;
set_p4d(p4d, __p4d(__pa(pud) | info->kernpg_flag));
}
 
-- 
2.26.2

Re: [PATCH v6 13/29] arm64/build: Assert for unwanted sections

2020-10-27 Thread Arvind Sankar

On Tue, Oct 27, 2020 at 01:40:43PM -0700, Nick Desaulniers wrote:
> On Tue, Oct 27, 2020 at 1:30 PM Arvind Sankar  wrote:
> >
> > On Tue, Oct 27, 2020 at 01:17:55PM -0700, Nick Desaulniers wrote:
> > > > >  (I feel the same about there
> > > > > being an empty asm(); statement in the definition of asm_volatile_goto
> > > > > for compiler-gcc.h).  Might be time to "fix the compiler."
> > > > >
> > > > > (It sounds like Arvind is both in agreement with my sentiment, and has
> > > > > the root cause).
> > > > >
> > Btw, the bug mentioned in asm_volatile_goto seems like its been fixed in
> > 4.9, so the hack could be dropped now?
> 
> https://lore.kernel.org/lkml/20180907222109.163802-1-ndesaulni...@google.com/
> 
> For the life of me I can't find Linus' response.  Maybe he shot it
> down in the PR, but I can't find it...Miguel do you recall?  I could
> paraphrase, but might be better to not rely on my memory.
> -- 
> Thanks,
> ~Nick Desaulniers

You couldn't find it in July either :)
https://lkml.org/lkml/2020/7/10/1026

Possibly he didn't like the version check? That should be unnecessary now.

Re: [PATCH] x86/build: Fix vmlinux size check on 64-bit

2020-10-27 Thread Arvind Sankar

On Tue, Oct 27, 2020 at 09:08:03PM +0100, Borislav Petkov wrote:
> On Mon, Oct 05, 2020 at 11:15:39AM -0400, Arvind Sankar wrote:
> > Commit b4e0409a36f4 ("x86: check vmlinux limits, 64-bit") added a check
> > that the size of the 64-bit kernel is less than KERNEL_IMAGE_SIZE.
> > 
> > The check uses (_end - _text), but this is not enough. The initial PMD
> > used in startup_64() (level2_kernel_pgt) can only map upto
> > KERNEL_IMAGE_SIZE from __START_KERNEL_map, not from _text.
> > 
> > The correct check is the same as for 32-bit, since LOAD_OFFSET is
> > defined appropriately for the two architectures. Just check
> > (_end - LOAD_OFFSET) against KERNEL_IMAGE_SIZE unconditionally.
> > 
> > Signed-off-by: Arvind Sankar 
> > ---
> >  arch/x86/kernel/vmlinux.lds.S | 11 ++-
> >  1 file changed, 2 insertions(+), 9 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
> > index bf9e0adb5b7e..b38832821b98 100644
> > --- a/arch/x86/kernel/vmlinux.lds.S
> > +++ b/arch/x86/kernel/vmlinux.lds.S
> > @@ -454,13 +454,12 @@ SECTIONS
> > ASSERT(SIZEOF(.rela.dyn) == 0, "Unexpected run-time relocations (.rela) 
> > detected!")
> >  }
> >  
> > -#ifdef CONFIG_X86_32
> >  /*
> >   * The ASSERT() sink to . is intentional, for binutils 2.14 compatibility:
> >   */
> >  . = ASSERT((_end - LOAD_OFFSET <= KERNEL_IMAGE_SIZE),
> >"kernel image bigger than KERNEL_IMAGE_SIZE");
> > -#else
> > +#ifdef CONFIG_X86_64
> >  /*
> >   * Per-cpu symbols which need to be offset from __per_cpu_load
> >   * for the boot processor.
> > @@ -470,18 +469,12 @@ INIT_PER_CPU(gdt_page);
> >  INIT_PER_CPU(fixed_percpu_data);
> >  INIT_PER_CPU(irq_stack_backing_store);
> >  
> > -/*
> > - * Build-time check on the image size:
> > - */
> > -. = ASSERT((_end - _text <= KERNEL_IMAGE_SIZE),
> > -  "kernel image bigger than KERNEL_IMAGE_SIZE");
> 
> So we have this:
> 
> SECTIONS
> {   
> #ifdef CONFIG_X86_32
> . = LOAD_OFFSET + LOAD_PHYSICAL_ADDR;
> phys_startup_32 = ABSOLUTE(startup_32 - LOAD_OFFSET);
> #else 
> . = __START_KERNEL;
>   ^^
> 
> which sets the location counter to
> 
> #define __START_KERNEL  (__START_KERNEL_map + __PHYSICAL_START)
> 
> which is  0x8000 + ALIGN(CONFIG_PHYSICAL_START, 
> CONFIG_PHYSICAL_ALIGN)
> 
> and that second term after the '+' has effect only when
> CONFIG_RELOCATABLE=n and that's not really used on modern kernel configs
> as RELOCATABLE is selected by EFI_STUB and RANDOMIZE_BASE depends on at
> and and ...
> 
> So IOW, in a usual .config we have:
> 
> __START_KERNEL_map at 0x8000
> _textat 0x8100
> 
> So practically and for the majority of configs, the kernel image really
> does start at _text and not at __START_KERNEL_map and we map 16Mb which
> is 4 PMDs of unused pages. So basically you're correcting that here -
> that the number tested against KERNEL_IMAGE_SIZE is 16Mb more.
> 
> Yes, no?
> 
> Or am I missing some more important aspect and this is more than just a
> small correctness fixlet?
> 
> Thx.
> 

This is indeed just a small correctness fixlet, but I'm not following
the rest of your comments. PHYSICAL_START has an effect independent of
the setting of RELOCATABLE. It's where the kernel image starts in
virtual address space, as shown by the 16MiB difference between
__START_KERNEL_map and _text in the usual .config situation. In all
configs, not just majority, the kernel image itself starts at _text. The
16MiB gap below _text is not actually mapped, but the important point is
that the way the initial construction of pagetables is currently setup,
the code cannot map anything above __START_KERNEL_map + KERNEL_IMAGE_SIZE,
so _end needs to be below that.

If KASLR was disabled (either at build-time or run-time), these
link-time addresses are where the kernel actually lives (in VA space);
and if it was enabled, it will make sure to place the _end of the kernel
below KERNEL_IMAGE_SIZE when choosing a random virtual location.

That said, AFAICT, RELOCATABLE and PHYSICAL_START look like historical
artifacts at this point: RELOCATABLE should be completely irrelevant for
the 64-bit kernel, and there's really no reason to be able to configure
the start VA of the kernel, that should just be constant independent of
PHYSICAL_START.

Thanks.

Re: [PATCH v6 13/29] arm64/build: Assert for unwanted sections

2020-10-27 Thread Arvind Sankar

On Tue, Oct 27, 2020 at 01:28:02PM -0700, Nick Desaulniers wrote:
> > commit 3193c0836f203a91bef96d88c64cccf0be090d9c
> > Author: Josh Poimboeuf 
> > Date:   Wed Jul 17 20:36:45 2019 -0500
> >
> > bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()
> >
> > has
> >
> > Fixes: e55a73251da3 ("bpf: Fix ORC unwinding in non-JIT BPF code")
> >
> > and mentions objtool and CONFIG_RETPOLINE.
> 
> Thanks for the context.  It might be time to revisit the above commit.
> If I revert it (small conflict that's easy to fixup),
> kernel/bpf/core.o builds cleanly with defconfig+GCC-9.3, so maybe
> obtool did get smart enough to handle that case?  Probably regresses
> the performance of that main dispatch loop for BPF, but not sure what
> folks are expecting when retpolines are enabled.
> -- 
> Thanks,
> ~Nick Desaulniers

The objtool issue was with RETPOLINE disabled.

Re: [PATCH v6 13/29] arm64/build: Assert for unwanted sections

2020-10-27 Thread Arvind Sankar

On Tue, Oct 27, 2020 at 01:17:55PM -0700, Nick Desaulniers wrote:
> On Tue, Oct 27, 2020 at 1:15 PM Ard Biesheuvel  wrote:
> >
> > On Tue, 27 Oct 2020 at 21:12, Nick Desaulniers  
> > wrote:
> > >
> > > On Tue, Oct 27, 2020 at 12:25 PM Geert Uytterhoeven
> > >  wrote:
> > > >
> > > > Hi Nick,
> > > >
> > > > CC Josh
> > > >
> > > > On Mon, Oct 26, 2020 at 6:49 PM Nick Desaulniers
> > > >  wrote:
> > > > > On Mon, Oct 26, 2020 at 10:44 AM Geert Uytterhoeven
> > > > >  wrote:
> > > > > > On Mon, Oct 26, 2020 at 6:39 PM Ard Biesheuvel  
> > > > > > wrote:
> > > > > > > On Mon, 26 Oct 2020 at 17:01, Geert Uytterhoeven 
> > > > > > >  wrote:
> > > > > > > > On Mon, Oct 26, 2020 at 2:29 PM Geert Uytterhoeven 
> > > > > > > >  wrote:
> > > > > > > > > On Mon, Oct 26, 2020 at 1:29 PM Geert Uytterhoeven 
> > > > > > > > >  wrote:
> > > > > > > > > > I.e. including the ".eh_frame" warning. I have tried 
> > > > > > > > > > bisecting that
> > > > > > > > > > warning (i.e. with be2881824ae9eb92 reverted), but that 
> > > > > > > > > > leads me to
> > > > > > > > > > commit b3e5d80d0c48c0cc ("arm64/build: Warn on orphan 
> > > > > > > > > > section
> > > > > > > > > > placement"), which is another red herring.
> > > > > > > > >
> > > > > > > > > kernel/bpf/core.o is the only file containing an eh_frame 
> > > > > > > > > section,
> > > > > > > > > causing the warning.
> > > > >
> > > > > When I see .eh_frame, I think -fno-asynchronous-unwind-tables is
> > > > > missing from someone's KBUILD_CFLAGS.
> > > > > But I don't see anything curious in kernel/bpf/Makefile, unless
> > > > > cc-disable-warning is somehow broken.
> > > >
> > > > I tracked it down to kernel/bpf/core.c:___bpf_prog_run() being tagged
> > > > with __no_fgcse aka __attribute__((optimize("-fno-gcse"))).
> > > >
> > > > Even if the function is trivially empty ("return 0;"), a ".eh_frame" 
> > > > section
> > > > is generated.  Removing the __no_fgcse tag fixes that.
> > >
> > > That's weird.  I feel pretty strongly that unless we're working around
> > > a well understood compiler bug with a comment that links to a
> > > submitted bug report, turning off rando compiler optimizations is a
> > > terrible hack for which one must proceed straight to jail; do not pass
> > > go; do not collect $200.  But maybe I'd feel differently for this case
> > > given the context of the change that added it.  (Ard mentions
> > > retpolines+orc+objtool; can someone share the relevant SHA if you have
> > > it handy so I don't have to go digging?)
> >
> > commit 3193c0836f203a91bef96d88c64cccf0be090d9c
> > Author: Josh Poimboeuf 
> > Date:   Wed Jul 17 20:36:45 2019 -0500
> >
> > bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()
> >
> > has
> >
> > Fixes: e55a73251da3 ("bpf: Fix ORC unwinding in non-JIT BPF code")
> >
> > and mentions objtool and CONFIG_RETPOLINE.
> >
> > >  (I feel the same about there
> > > being an empty asm(); statement in the definition of asm_volatile_goto
> > > for compiler-gcc.h).  Might be time to "fix the compiler."
> > >
> > > (It sounds like Arvind is both in agreement with my sentiment, and has
> > > the root cause).
> > >
> >
> > I agree that the __no_fgcse hack is terrible. Does Clang support the
> > following pragmas?
> >
> > #pragma GCC push_options
> > #pragma GCC optimize ("-fno-gcse")
> > #pragma GCC pop_options
> >
> > ?
> 
> Put it in godbolt.org.  Pretty sure it's `#pragma clang` though.
> `#pragma GCC` might be supported in clang or silently ignored, but
> IIRC pragmas were a bit of a compat nightmare.  I think Arnd wrote
> some macros to set pragmas based on toolchain.  (Uses _Pragma, for
> pragmas in macros, IIRC).
> 
> -- 
> Thanks,
> ~Nick Desaulniers

https://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html#Function-Specific-Option-Pragmas

#pragma GCC optimize is equivalent to the function attribute, so does
that actually help?

Btw, the bug mentioned in asm_volatile_goto seems like its been fixed in
4.9, so the hack could be dropped now?

Re: [PATCH v6 13/29] arm64/build: Assert for unwanted sections

2020-10-27 Thread Arvind Sankar

On Tue, Oct 27, 2020 at 08:33:00PM +0100, Ard Biesheuvel wrote:
> On Tue, 27 Oct 2020 at 20:25, Geert Uytterhoeven  wrote:
> > >
> > > When I see .eh_frame, I think -fno-asynchronous-unwind-tables is
> > > missing from someone's KBUILD_CFLAGS.
> > > But I don't see anything curious in kernel/bpf/Makefile, unless
> > > cc-disable-warning is somehow broken.
> >
> > I tracked it down to kernel/bpf/core.c:___bpf_prog_run() being tagged
> > with __no_fgcse aka __attribute__((optimize("-fno-gcse"))).
> >
> > Even if the function is trivially empty ("return 0;"), a ".eh_frame" section
> > is generated.  Removing the __no_fgcse tag fixes that.
> >
> 
> 
> Given that it was added for issues related to retpolines, ORC and
> objtool, it should be safe to make that annotation x86-only.

The optimize attribute is not meant for production use. I had mentioned
this at the time but it got lost: the optimize attribute apparently does
not add options, it replaces them completely. So I'm guessing this one
is dropping the -fno-asynchronous-unwind-tables and causing the eh_frame
sections, though I don't know why that doesn't cause eh_frame on x86?

https://lore.kernel.org/lkml/alpine.lsu.2.21.2004151445520.11...@wotan.suse.de/

Re: RFC x86/boot/64: BOOT_PGT_SIZE definition for compressed kernel

2020-10-27 Thread Arvind Sankar

On Tue, Oct 27, 2020 at 03:40:07PM +0300, Kirill A. Shutemov wrote:
> On Sat, Oct 24, 2020 at 08:41:58PM -0400, Arvind Sankar wrote:
> > Hi, I think the definition of BOOT_PGT_SIZE in
> > arch/x86/include/asm/boot.h is insufficient, especially after
> >   ca0e22d4f011 ("x86/boot/compressed/64: Always switch to own page table")
> > 
> > Currently, it allocates 6 pages if KASLR is disabled, and either 17 or
> > 19 pages depending on X86_VERBOSE_BOOTUP if KASLR is enabled.
> > 
> > - The X86_VERBOSE_BOOTUP test shouldn't be done: that only disables
> >   debug messages, but warnings/errors are always output to VGA memory,
> >   so the two extra pages for mapping video RAM are always needed.
> > 
> > - The calculation wasn't updated for X86_5LEVEL, which requires at least
> >   one more page for the P4D level, and in theory could require two extra
> >   pages for each of the 4 mappings (compressed kernel, output kernel,
> >   boot_params and command line), though that would require a system with
> >   truly ginormous amounts of RAM.
> 
> Or sparse physical memory map. I hacked QEMU before for testing 5-level
> paging:
> 
> https://gist.github.com/kiryl/d45eb54110944ff95e544972d8bdac1d
> 
> > - If KASLR is disabled, there are only 6 pages, but now that we're
> >   always setting up our own page table, we need 1+(2+2)*3 (one PGD, and
> >   two PUD and two PMD pages for kernel, boot_params and command line),
> >   and 2 more pages for the video RAM, and more for 5-level. Even for
> >   !RELOCATABLE, 13 pages might be needed.
> 
> The comment for BOOT_PGT_SIZE has to be updated.
> 
> BTW, what happens if we underestimate BOOT_PGT_SIZE? Do we overwrite
> something?

No, it checks whether it ran out of pages, so it will just error out and
hang.

> 
> > - SEV-ES needs one more page because it needs to do a PTE-level mapping
> >   for the GHCB page.
> > 
> > - The static calculation is also busted because
> >   boot/compressed/{kaslr.c,acpi.c} can scan the setup data, EFI
> >   configuration tables and the EFI memmap, and none of these are
> >   accounted for. They used to be scanned while still on the
> >   firmware/bootloader page tables, but now our page tables have to cover
> >   them as well. Trying to add up the worst case for all of these, and
> >   anything else the compressed kernel might potentially access seems
> >   like a lost cause.
> > 
> > We could do something similar to what the main kernel does with
> > early_dynamic_pgts: map the compressed kernel at a fixed virtual
> > address (in negative address space, say); recycle all the other mappings
> > until we're done with decompression, and then map the output,
> > boot_params and command line. The number of pages needed for this can be
> > statically calculated, for 4-level paging we'd need 2 pages for the
> > fixed mapping, 12 pages for the other three, and one PGD page.
> 
> Recycling idea look promising to me, but it would require handling #PF in
> decompression code, right? It is considerable complication of the code.
> 

The #PF handler is already there now with the SEV-ES series, but I agree
it would still complicate things. It's simpler to just increase
BOOT_PGT_SIZE and make it unconditional (i.e. bump it to say 32 or 64
even if !KASLR). It's @nobits anyway so it would not increase the size
of the bzImage, just require a slightly larger memory allocation by the
bootloader.

Another alternative is reusing the KASLR code, which contains a memory
allocator, and use it to find system memory for the page tables, but
that also seems like an over-engineered approach.

Re: [PATCH] firmware: tegra: fix strncpy()/strncat() confusion

2020-10-26 Thread Arvind Sankar

On Mon, Oct 26, 2020 at 05:10:19PM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The way that bpmp_populate_debugfs_inband() uses strncpy()
> and strncat() makes no sense since the size argument for
> the first is insufficient to contain the trailing '/'
> and the second passes the length of the input rather than
> the output, which triggers a warning:
> 
> In function 'strncat',
> inlined from 'bpmp_populate_debugfs_inband' at 
> ../drivers/firmware/tegra/bpmp-debugfs.c:422:4:
> include/linux/string.h:289:30: warning: '__builtin_strncat' specified bound 
> depends on the length of the source argument [-Wstringop-overflow=]
>   289 | #define __underlying_strncat __builtin_strncat
>   |  ^
> include/linux/string.h:367:10: note: in expansion of macro 
> '__underlying_strncat'
>   367 |   return __underlying_strncat(p, q, count);
>   |  ^~~~
> drivers/firmware/tegra/bpmp-debugfs.c: In function 
> 'bpmp_populate_debugfs_inband':
> include/linux/string.h:288:29: note: length computed here
>   288 | #define __underlying_strlen __builtin_strlen
>   | ^
> include/linux/string.h:321:10: note: in expansion of macro 
> '__underlying_strlen'
>   321 |   return __underlying_strlen(p);
> 
> Simplify this to use an snprintf() instead.
> 
> Fixes: 5e37b9c137ee ("firmware: tegra: Add support for in-band debug")
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/firmware/tegra/bpmp-debugfs.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/drivers/firmware/tegra/bpmp-debugfs.c 
> b/drivers/firmware/tegra/bpmp-debugfs.c
> index c1bbba9ee93a..9ec20ddc9a6b 100644
> --- a/drivers/firmware/tegra/bpmp-debugfs.c
> +++ b/drivers/firmware/tegra/bpmp-debugfs.c
> @@ -412,16 +412,12 @@ static int bpmp_populate_debugfs_inband(struct 
> tegra_bpmp *bpmp,
>   goto out;
>   }
>  
> - len = strlen(ppath) + strlen(name) + 1;
> + len = snprintf("%s%s/", pathlen, ppath, name);

Didn't you get any warnings with this? It should be
len = snprintf(pathbuf, pathlen, "%s%s/", ppath, name);
right?

>   if (len >= pathlen) {
>   err = -EINVAL;
>   goto out;
>   }
>  
> - strncpy(pathbuf, ppath, pathlen);
> - strncat(pathbuf, name, strlen(name));
> - strcat(pathbuf, "/");
> -
>   err = bpmp_populate_debugfs_inband(bpmp, dentry,
>  pathbuf);
>   if (err < 0)
> -- 
> 2.27.0
>

Re: [PATCH v4 6/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops

2020-10-25 Thread Arvind Sankar

On Sun, Oct 25, 2020 at 11:23:52PM +, David Laight wrote:
> From: Arvind Sankar
> > Sent: 25 October 2020 20:18
> > 
> > On Sun, Oct 25, 2020 at 06:51:18PM +, David Laight wrote:
> > > From: Arvind Sankar
> > > > Sent: 25 October 2020 14:31
> > > >
> > > > Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64
> > > > (tested on Broadwell Xeon) while not increasing code size too much.
> > >
> > > I can't believe unrolling the BLEND loop makes any difference.
> > 
> > It's actually the BLEND loop that accounts for almost all of the
> > difference. The LOAD loop doesn't matter much in general: even replacing
> > it with a plain memcpy() only increases performance by 3-4%. But
> > unrolling it is low cost in code size terms, and clang actually does it
> > without being asked.
> 
> (memcpy is wrong - misses the byte swaps).

I know it's wrong, the point is that it's impossible to gain very much
from optimizing the LOAD loop because it doesn't account for much of the
total time.

> 
> That's odd, the BLEND loop is about 20 instructions.
> I wouldn't expect unrolling to help - unless you manage
> to use 16 registers for the active W[] values.
> 

I am not sure about what's going on inside the hardware, but even with
a straightforward assembly version that just reads out of memory the way
the calculation is specified, unrolling the BLEND loop 8x speeds up the
performance by 7-8%.

The compiler is actually pretty bad here, just translating everything
into assembler with no attempt to optimize anything gets a 10-12%
speedup over the C version.

Re: [PATCH v4 6/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops

2020-10-25 Thread Arvind Sankar

On Sun, Oct 25, 2020 at 06:51:18PM +, David Laight wrote:
> From: Arvind Sankar
> > Sent: 25 October 2020 14:31
> > 
> > Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64
> > (tested on Broadwell Xeon) while not increasing code size too much.
> 
> I can't believe unrolling the BLEND loop makes any difference.

It's actually the BLEND loop that accounts for almost all of the
difference. The LOAD loop doesn't matter much in general: even replacing
it with a plain memcpy() only increases performance by 3-4%. But
unrolling it is low cost in code size terms, and clang actually does it
without being asked.

> WRT patch 5.
> On the C2758 the original unrolled code is slightly faster.
> On the i7-7700 the 8 unroll is a bit faster 'hot cache',
> but slower 'cold cache' - probably because of the d-cache
> loads for K[].
> 
> Non-x86 architectures might need to use d-cache reads for
> the 32bit 'K' constants even in the unrolled loop.
> X86 can use 'lea' with a 32bit offset to avoid data reads.
> So the cold-cache case for the old code may be similar.

Not sure I follow: in the old code, the K's are 32-bit immediates, so
they should come from the i-cache whether an add or an lea is used?

Why is the cold-cache case relevant anyway? If the code is only being
executed a couple of times or so, i.e. you're hashing a single say
64-128 byte input once in a blue moon, the performance of the hash
doesn't really matter, no?

> 
> Interestingly I had to write an asm ror32() to get reasonable
> code (in userspace). The C version the kernel uses didn't work.
> 
>   David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)
>

Re: [PATCH] efi/x86: Only copy the compressed kernel image in efi_relocate_kernel()

2020-10-25 Thread Arvind Sankar

On Sun, Oct 25, 2020 at 05:28:06PM +0100, Ard Biesheuvel wrote:
> On Sun, 25 Oct 2020 at 17:19, Arvind Sankar  wrote:
> >
> > On Sun, Oct 11, 2020 at 10:20:12AM -0400, Arvind Sankar wrote:
> > > The image_size argument to efi_relocate_kernel() is currently specified
> > > as init_size, but this is unnecessarily large. The compressed kernel is
> > > much smaller, in fact, its image only extends up to the start of _bss,
> > > since at this point, the .bss section is still uninitialized.
> > >
> > > Depending on compression level, this can reduce the amount of data
> > > copied by 4-5x.
> > >
> > > Signed-off-by: Arvind Sankar 
> >
> > Ping
> >
> 
> I'll pick this up as a fix once the merge window closes.

Thanks!

> 
> > > ---
> > >  drivers/firmware/efi/libstub/x86-stub.c | 5 -
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/firmware/efi/libstub/x86-stub.c 
> > > b/drivers/firmware/efi/libstub/x86-stub.c
> > > index 3672539cb96e..f14c4ff5839f 100644
> > > --- a/drivers/firmware/efi/libstub/x86-stub.c
> > > +++ b/drivers/firmware/efi/libstub/x86-stub.c
> > > @@ -715,8 +715,11 @@ unsigned long efi_main(efi_handle_t handle,
> > >   (IS_ENABLED(CONFIG_X86_32) && buffer_end > KERNEL_IMAGE_SIZE)   
> > >  ||
> > >   (IS_ENABLED(CONFIG_X86_64) && buffer_end > 
> > > MAXMEM_X86_64_4LEVEL) ||
> > >   (image_offset == 0)) {
> > > + extern char _bss[];
> > > +
> > >   status = efi_relocate_kernel(_addr,
> > > -  hdr->init_size, hdr->init_size,
> > > +  (unsigned long)_bss - 
> > > bzimage_addr,
> > > +  hdr->init_size,
> > >hdr->pref_address,
> > >hdr->kernel_alignment,
> > >LOAD_PHYSICAL_ADDR);
> > > --
> > > 2.26.2
> > >

Re: [PATCH] efi/x86: Only copy the compressed kernel image in efi_relocate_kernel()

2020-10-25 Thread Arvind Sankar

On Sun, Oct 11, 2020 at 10:20:12AM -0400, Arvind Sankar wrote:
> The image_size argument to efi_relocate_kernel() is currently specified
> as init_size, but this is unnecessarily large. The compressed kernel is
> much smaller, in fact, its image only extends up to the start of _bss,
> since at this point, the .bss section is still uninitialized.
> 
> Depending on compression level, this can reduce the amount of data
> copied by 4-5x.
> 
> Signed-off-by: Arvind Sankar 

Ping

> ---
>  drivers/firmware/efi/libstub/x86-stub.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/firmware/efi/libstub/x86-stub.c 
> b/drivers/firmware/efi/libstub/x86-stub.c
> index 3672539cb96e..f14c4ff5839f 100644
> --- a/drivers/firmware/efi/libstub/x86-stub.c
> +++ b/drivers/firmware/efi/libstub/x86-stub.c
> @@ -715,8 +715,11 @@ unsigned long efi_main(efi_handle_t handle,
>   (IS_ENABLED(CONFIG_X86_32) && buffer_end > KERNEL_IMAGE_SIZE)||
>   (IS_ENABLED(CONFIG_X86_64) && buffer_end > MAXMEM_X86_64_4LEVEL) ||
>   (image_offset == 0)) {
> + extern char _bss[];
> +
>   status = efi_relocate_kernel(_addr,
> -  hdr->init_size, hdr->init_size,
> +  (unsigned long)_bss - bzimage_addr,
> +  hdr->init_size,
>hdr->pref_address,
>hdr->kernel_alignment,
>LOAD_PHYSICAL_ADDR);
> -- 
> 2.26.2
>

[PATCH v4 6/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops

2020-10-25 Thread Arvind Sankar

Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64
(tested on Broadwell Xeon) while not increasing code size too much.

Signed-off-by: Arvind Sankar 
Reviewed-by: Eric Biggers 
---
 lib/crypto/sha256.c | 24 
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index e2e29d9b0ccd..cdef37c05972 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -76,12 +76,28 @@ static void sha256_transform(u32 *state, const u8 *input, 
u32 *W)
int i;
 
/* load the input */
-   for (i = 0; i < 16; i++)
-   LOAD_OP(i, W, input);
+   for (i = 0; i < 16; i += 8) {
+   LOAD_OP(i + 0, W, input);
+   LOAD_OP(i + 1, W, input);
+   LOAD_OP(i + 2, W, input);
+   LOAD_OP(i + 3, W, input);
+   LOAD_OP(i + 4, W, input);
+   LOAD_OP(i + 5, W, input);
+   LOAD_OP(i + 6, W, input);
+   LOAD_OP(i + 7, W, input);
+   }
 
/* now blend */
-   for (i = 16; i < 64; i++)
-   BLEND_OP(i, W);
+   for (i = 16; i < 64; i += 8) {
+   BLEND_OP(i + 0, W);
+   BLEND_OP(i + 1, W);
+   BLEND_OP(i + 2, W);
+   BLEND_OP(i + 3, W);
+   BLEND_OP(i + 4, W);
+   BLEND_OP(i + 5, W);
+   BLEND_OP(i + 6, W);
+   BLEND_OP(i + 7, W);
+   }
 
/* load the state into our registers */
a = state[0];  b = state[1];  c = state[2];  d = state[3];
-- 
2.26.2

[PATCH v4 5/6] crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64

2020-10-25 Thread Arvind Sankar

This reduces code size substantially (on x86_64 with gcc-10 the size of
sha256_update() goes from 7593 bytes to 1952 bytes including the new
SHA256_K array), and on x86 is slightly faster than the full unroll
(tested on Broadwell Xeon).

Signed-off-by: Arvind Sankar 
Reviewed-by: Eric Biggers 
---
 lib/crypto/sha256.c | 174 ++--
 1 file changed, 38 insertions(+), 136 deletions(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index c6bfeacc5b81..e2e29d9b0ccd 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -18,6 +18,25 @@
 #include 
 #include 
 
+static const u32 SHA256_K[] = {
+   0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5,
+   0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
+   0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3,
+   0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
+   0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc,
+   0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
+   0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7,
+   0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
+   0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13,
+   0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
+   0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3,
+   0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
+   0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5,
+   0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
+   0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208,
+   0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2,
+};
+
 static inline u32 Ch(u32 x, u32 y, u32 z)
 {
return z ^ (x & (y ^ z));
@@ -43,9 +62,17 @@ static inline void BLEND_OP(int I, u32 *W)
W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
 }
 
+#define SHA256_ROUND(i, a, b, c, d, e, f, g, h) do {   \
+   u32 t1, t2; \
+   t1 = h + e1(e) + Ch(e, f, g) + SHA256_K[i] + W[i];  \
+   t2 = e0(a) + Maj(a, b, c);  \
+   d += t1;\
+   h = t1 + t2;\
+} while (0)
+
 static void sha256_transform(u32 *state, const u8 *input, u32 *W)
 {
-   u32 a, b, c, d, e, f, g, h, t1, t2;
+   u32 a, b, c, d, e, f, g, h;
int i;
 
/* load the input */
@@ -61,141 +88,16 @@ static void sha256_transform(u32 *state, const u8 *input, 
u32 *W)
e = state[4];  f = state[5];  g = state[6];  h = state[7];
 
/* now iterate */
-   t1 = h + e1(e) + Ch(e, f, g) + 0x428a2f98 + W[0];
-   t2 = e0(a) + Maj(a, b, c);d += t1;h = t1 + t2;
-   t1 = g + e1(d) + Ch(d, e, f) + 0x71374491 + W[1];
-   t2 = e0(h) + Maj(h, a, b);c += t1;g = t1 + t2;
-   t1 = f + e1(c) + Ch(c, d, e) + 0xb5c0fbcf + W[2];
-   t2 = e0(g) + Maj(g, h, a);b += t1;f = t1 + t2;
-   t1 = e + e1(b) + Ch(b, c, d) + 0xe9b5dba5 + W[3];
-   t2 = e0(f) + Maj(f, g, h);a += t1;e = t1 + t2;
-   t1 = d + e1(a) + Ch(a, b, c) + 0x3956c25b + W[4];
-   t2 = e0(e) + Maj(e, f, g);h += t1;d = t1 + t2;
-   t1 = c + e1(h) + Ch(h, a, b) + 0x59f111f1 + W[5];
-   t2 = e0(d) + Maj(d, e, f);g += t1;c = t1 + t2;
-   t1 = b + e1(g) + Ch(g, h, a) + 0x923f82a4 + W[6];
-   t2 = e0(c) + Maj(c, d, e);f += t1;b = t1 + t2;
-   t1 = a + e1(f) + Ch(f, g, h) + 0xab1c5ed5 + W[7];
-   t2 = e0(b) + Maj(b, c, d);e += t1;a = t1 + t2;
-
-   t1 = h + e1(e) + Ch(e, f, g) + 0xd807aa98 + W[8];
-   t2 = e0(a) + Maj(a, b, c);d += t1;h = t1 + t2;
-   t1 = g + e1(d) + Ch(d, e, f) + 0x12835b01 + W[9];
-   t2 = e0(h) + Maj(h, a, b);c += t1;g = t1 + t2;
-   t1 = f + e1(c) + Ch(c, d, e) + 0x243185be + W[10];
-   t2 = e0(g) + Maj(g, h, a);b += t1;f = t1 + t2;
-   t1 = e + e1(b) + Ch(b, c, d) + 0x550c7dc3 + W[11];
-   t2 = e0(f) + Maj(f, g, h);a += t1;e = t1 + t2;
-   t1 = d + e1(a) + Ch(a, b, c) + 0x72be5d74 + W[12];
-   t2 = e0(e) + Maj(e, f, g);h += t1;d = t1 + t2;
-   t1 = c + e1(h) + Ch(h, a, b) + 0x80deb1fe + W[13];
-   t2 = e0(d) + Maj(d, e, f);g += t1;c = t1 + t2;
-   t1 = b + e1(g) + Ch(g, h, a) + 0x9bdc06a7 + W[14];
-   t2 = e0(c) + Maj(c, d, e);f += t1;b = t1 + t2;
-   t1 = a + e1(f) + Ch(f, g, h) + 0xc19bf174 + W[15];
-   t2 = e0(b) + Maj(b, c, d);e += t1;a = t1 + t2;
-
-   t1 = h + e1(e) + Ch(e, f, g) + 0xe49b69c1 + W[16];
-   t2 = e0(a) + Maj(a, b, c);d += t1;h = t1 + t2;
-   t1 = g + e1(d) + Ch(d, e, f) + 0xefbe4786 + W[17];
-   t2 = e0(h) + Maj(h, a, b);c += t1;g = t1 + t2;
-   t1 = f + e1(c) + Ch(c, d, e) + 0x0fc19dc6 + W[18];
-   t2 = e0(g) + Maj(g, h, a);b += t1;f = t1 + t2;
-   t1 = e + e1(b) + Ch(b, c, d) + 0x240ca1cc + W[19];
-   t2 = e0(f) + Maj(f, g, h);a += t1;e = t1 + t2;
-   t1

[PATCH v4 3/6] crypto: lib/sha256 - Don't clear temporary variables

2020-10-25 Thread Arvind Sankar

The assignments to clear a through h and t1/t2 are optimized out by the
compiler because they are unused after the assignments.

Clearing individual scalar variables is unlikely to be useful, as they
may have been assigned to registers, and even if stack spilling was
required, there may be compiler-generated temporaries that are
impossible to clear in any case.

So drop the clearing of a through h and t1/t2.

Signed-off-by: Arvind Sankar 
Reviewed-by: Eric Biggers 
---
 lib/crypto/sha256.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index d43bc39ab05e..099cd11f83c1 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -202,7 +202,6 @@ static void sha256_transform(u32 *state, const u8 *input)
state[4] += e; state[5] += f; state[6] += g; state[7] += h;
 
/* clear any sensitive info... */
-   a = b = c = d = e = f = g = h = t1 = t2 = 0;
memzero_explicit(W, 64 * sizeof(u32));
 }
 
-- 
2.26.2

[PATCH v4 1/6] crypto: lib/sha256 - Use memzero_explicit() for clearing state

2020-10-25 Thread Arvind Sankar

Without the barrier_data() inside memzero_explicit(), the compiler may
optimize away the state-clearing if it can tell that the state is not
used afterwards. At least in lib/crypto/sha256.c:__sha256_final(), the
function can get inlined into sha256(), in which case the memset is
optimized away.

Signed-off-by: Arvind Sankar 
Reviewed-by: Eric Biggers 
---
 lib/crypto/sha256.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index 2321f6cb322f..d43bc39ab05e 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -265,7 +265,7 @@ static void __sha256_final(struct sha256_state *sctx, u8 
*out, int digest_words)
put_unaligned_be32(sctx->state[i], [i]);
 
/* Zeroize sensitive information. */
-   memset(sctx, 0, sizeof(*sctx));
+   memzero_explicit(sctx, sizeof(*sctx));
 }
 
 void sha256_final(struct sha256_state *sctx, u8 *out)
-- 
2.26.2

[PATCH v4 0/6] crypto: lib/sha256 - cleanup/optimization

2020-10-25 Thread Arvind Sankar

Patch 1/2 -- Use memzero_explicit() instead of structure assignment/plain
memset() to clear sensitive state.

Patch 3 -- Currently the temporary variables used in the generic sha256
implementation are cleared, but the clearing is optimized away due to
lack of compiler barriers. Drop the clearing.

The last three patches are optimizations for generic sha256.

v4:
- Split the first patch into two, the first one just does
  lib/crypto/sha256.c, so that the second one can be applied or dropped
  depending on the outcome of the discussion between Herbert/Eric.

v3:
- Add some more files to patch 1
- Reword commit message for patch 2
- Reformat SHA256_K array
- Drop v2 patch combining K and W arrays

v2:
- Add patch to combine K and W arrays, suggested by David
- Reformat SHA256_ROUND() macro a little

Arvind Sankar (6):
  crypto: lib/sha256 - Use memzero_explicit() for clearing state
  crypto: Use memzero_explicit() for clearing state
  crypto: lib/sha256 - Don't clear temporary variables
  crypto: lib/sha256 - Clear W[] in sha256_update() instead of
sha256_transform()
  crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64
  crypto: lib/sha256 - Unroll LOAD and BLEND loops

 arch/arm64/crypto/ghash-ce-glue.c |   2 +-
 arch/arm64/crypto/poly1305-glue.c |   2 +-
 arch/arm64/crypto/sha3-ce-glue.c  |   2 +-
 arch/x86/crypto/poly1305_glue.c   |   2 +-
 include/crypto/sha1_base.h|   3 +-
 include/crypto/sha256_base.h  |   3 +-
 include/crypto/sha512_base.h  |   3 +-
 include/crypto/sm3_base.h |   3 +-
 lib/crypto/sha256.c   | 212 +-
 9 files changed, 76 insertions(+), 156 deletions(-)

-- 
2.26.2

[PATCH v4 4/6] crypto: lib/sha256 - Clear W[] in sha256_update() instead of sha256_transform()

2020-10-25 Thread Arvind Sankar

The temporary W[] array is currently zeroed out once every call to
sha256_transform(), i.e. once every 64 bytes of input data. Moving it to
sha256_update() instead so that it is cleared only once per update can
save about 2-3% of the total time taken to compute the digest, with a
reasonable memset() implementation, and considerably more (~20%) with a
bad one (eg the x86 purgatory currently uses a memset() coded in C).

Signed-off-by: Arvind Sankar 
Reviewed-by: Eric Biggers 
---
 lib/crypto/sha256.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index 099cd11f83c1..c6bfeacc5b81 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -43,10 +43,9 @@ static inline void BLEND_OP(int I, u32 *W)
W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
 }
 
-static void sha256_transform(u32 *state, const u8 *input)
+static void sha256_transform(u32 *state, const u8 *input, u32 *W)
 {
u32 a, b, c, d, e, f, g, h, t1, t2;
-   u32 W[64];
int i;
 
/* load the input */
@@ -200,15 +199,13 @@ static void sha256_transform(u32 *state, const u8 *input)
 
state[0] += a; state[1] += b; state[2] += c; state[3] += d;
state[4] += e; state[5] += f; state[6] += g; state[7] += h;
-
-   /* clear any sensitive info... */
-   memzero_explicit(W, 64 * sizeof(u32));
 }
 
 void sha256_update(struct sha256_state *sctx, const u8 *data, unsigned int len)
 {
unsigned int partial, done;
const u8 *src;
+   u32 W[64];
 
partial = sctx->count & 0x3f;
sctx->count += len;
@@ -223,11 +220,13 @@ void sha256_update(struct sha256_state *sctx, const u8 
*data, unsigned int len)
}
 
do {
-   sha256_transform(sctx->state, src);
+   sha256_transform(sctx->state, src, W);
done += 64;
src = data + done;
} while (done + 63 < len);
 
+   memzero_explicit(W, sizeof(W));
+
partial = 0;
}
memcpy(sctx->buf + partial, src, len - done);
-- 
2.26.2

[PATCH v4 2/6] crypto: Use memzero_explicit() for clearing state

2020-10-25 Thread Arvind Sankar

Without the barrier_data() inside memzero_explicit(), the compiler may
optimize away the state-clearing if it can tell that the state is not
used afterwards.

Signed-off-by: Arvind Sankar 
---
 arch/arm64/crypto/ghash-ce-glue.c | 2 +-
 arch/arm64/crypto/poly1305-glue.c | 2 +-
 arch/arm64/crypto/sha3-ce-glue.c  | 2 +-
 arch/x86/crypto/poly1305_glue.c   | 2 +-
 include/crypto/sha1_base.h| 3 ++-
 include/crypto/sha256_base.h  | 3 ++-
 include/crypto/sha512_base.h  | 3 ++-
 include/crypto/sm3_base.h | 3 ++-
 8 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-glue.c 
b/arch/arm64/crypto/ghash-ce-glue.c
index 8536008e3e35..2427e2f3a9a1 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -168,7 +168,7 @@ static int ghash_final(struct shash_desc *desc, u8 *dst)
put_unaligned_be64(ctx->digest[1], dst);
put_unaligned_be64(ctx->digest[0], dst + 8);
 
-   *ctx = (struct ghash_desc_ctx){};
+   memzero_explicit(ctx, sizeof(*ctx));
return 0;
 }
 
diff --git a/arch/arm64/crypto/poly1305-glue.c 
b/arch/arm64/crypto/poly1305-glue.c
index f33ada70c4ed..683de671741a 100644
--- a/arch/arm64/crypto/poly1305-glue.c
+++ b/arch/arm64/crypto/poly1305-glue.c
@@ -177,7 +177,7 @@ void poly1305_final_arch(struct poly1305_desc_ctx *dctx, u8 
*dst)
}
 
poly1305_emit(>h, dst, dctx->s);
-   *dctx = (struct poly1305_desc_ctx){};
+   memzero_explicit(dctx, sizeof(*dctx));
 }
 EXPORT_SYMBOL(poly1305_final_arch);
 
diff --git a/arch/arm64/crypto/sha3-ce-glue.c b/arch/arm64/crypto/sha3-ce-glue.c
index 9a4bbfc45f40..e5a2936f0886 100644
--- a/arch/arm64/crypto/sha3-ce-glue.c
+++ b/arch/arm64/crypto/sha3-ce-glue.c
@@ -94,7 +94,7 @@ static int sha3_final(struct shash_desc *desc, u8 *out)
if (digest_size & 4)
put_unaligned_le32(sctx->st[i], (__le32 *)digest);
 
-   *sctx = (struct sha3_state){};
+   memzero_explicit(sctx, sizeof(*sctx));
return 0;
 }
 
diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
index e508dbd91813..64d09520d279 100644
--- a/arch/x86/crypto/poly1305_glue.c
+++ b/arch/x86/crypto/poly1305_glue.c
@@ -209,7 +209,7 @@ void poly1305_final_arch(struct poly1305_desc_ctx *dctx, u8 
*dst)
}
 
poly1305_simd_emit(>h, dst, dctx->s);
-   *dctx = (struct poly1305_desc_ctx){};
+   memzero_explicit(dctx, sizeof(*dctx));
 }
 EXPORT_SYMBOL(poly1305_final_arch);
 
diff --git a/include/crypto/sha1_base.h b/include/crypto/sha1_base.h
index 20fd1f7468af..a5d6033efef7 100644
--- a/include/crypto/sha1_base.h
+++ b/include/crypto/sha1_base.h
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -101,7 +102,7 @@ static inline int sha1_base_finish(struct shash_desc *desc, 
u8 *out)
for (i = 0; i < SHA1_DIGEST_SIZE / sizeof(__be32); i++)
put_unaligned_be32(sctx->state[i], digest++);
 
-   *sctx = (struct sha1_state){};
+   memzero_explicit(sctx, sizeof(*sctx));
return 0;
 }
 
diff --git a/include/crypto/sha256_base.h b/include/crypto/sha256_base.h
index 6ded110783ae..93f9fd21cc06 100644
--- a/include/crypto/sha256_base.h
+++ b/include/crypto/sha256_base.h
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -105,7 +106,7 @@ static inline int sha256_base_finish(struct shash_desc 
*desc, u8 *out)
for (i = 0; digest_size > 0; i++, digest_size -= sizeof(__be32))
put_unaligned_be32(sctx->state[i], digest++);
 
-   *sctx = (struct sha256_state){};
+   memzero_explicit(sctx, sizeof(*sctx));
return 0;
 }
 
diff --git a/include/crypto/sha512_base.h b/include/crypto/sha512_base.h
index fb19c77494dc..93ab73baa38e 100644
--- a/include/crypto/sha512_base.h
+++ b/include/crypto/sha512_base.h
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -126,7 +127,7 @@ static inline int sha512_base_finish(struct shash_desc 
*desc, u8 *out)
for (i = 0; digest_size > 0; i++, digest_size -= sizeof(__be64))
put_unaligned_be64(sctx->state[i], digest++);
 
-   *sctx = (struct sha512_state){};
+   memzero_explicit(sctx, sizeof(*sctx));
return 0;
 }
 
diff --git a/include/crypto/sm3_base.h b/include/crypto/sm3_base.h
index 1cbf9aa1fe52..2f3a32ab97bb 100644
--- a/include/crypto/sm3_base.h
+++ b/include/crypto/sm3_base.h
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 typedef void (sm3_block_fn)(struct sm3_state *sst, u8 const *src, int blocks);
@@ -104,7 +105,7 @@ static inline int sm3_base_finish(struct shash_desc *desc, 
u8 *out)
for (i = 0; i < SM3_DIGEST_SIZE / sizeof(__be32); i++)
put_unaligned_be32(sctx->state[i], digest++);
 
-   *sctx = (struct sm3_state){};
+   memzero_explicit(sctx, sizeof(*sctx));
return 0;
 }
 
-- 
2.26.2

Re: RFC x86/boot/64: BOOT_PGT_SIZE definition for compressed kernel

2020-10-24 Thread Arvind Sankar

On Sat, Oct 24, 2020 at 08:41:58PM -0400, Arvind Sankar wrote:
> Hi, I think the definition of BOOT_PGT_SIZE in
> arch/x86/include/asm/boot.h is insufficient, especially after
>   ca0e22d4f011 ("x86/boot/compressed/64: Always switch to own page table")
> 
> Currently, it allocates 6 pages if KASLR is disabled, and either 17 or
> 19 pages depending on X86_VERBOSE_BOOTUP if KASLR is enabled.
> 
> - The X86_VERBOSE_BOOTUP test shouldn't be done: that only disables
>   debug messages, but warnings/errors are always output to VGA memory,
>   so the two extra pages for mapping video RAM are always needed.
> 
> - The calculation wasn't updated for X86_5LEVEL, which requires at least
>   one more page for the P4D level, and in theory could require two extra
>   pages for each of the 4 mappings (compressed kernel, output kernel,
>   boot_params and command line), though that would require a system with
>   truly ginormous amounts of RAM.
> 
> - If KASLR is disabled, there are only 6 pages, but now that we're
>   always setting up our own page table, we need 1+(2+2)*3 (one PGD, and
>   two PUD and two PMD pages for kernel, boot_params and command line),
>   and 2 more pages for the video RAM, and more for 5-level. Even for
>   !RELOCATABLE, 13 pages might be needed.
> 
> - SEV-ES needs one more page because it needs to do a PTE-level mapping
>   for the GHCB page.
> 
> - The static calculation is also busted because
>   boot/compressed/{kaslr.c,acpi.c} can scan the setup data, EFI
>   configuration tables and the EFI memmap, and none of these are
>   accounted for. They used to be scanned while still on the
>   firmware/bootloader page tables, but now our page tables have to cover
>   them as well. Trying to add up the worst case for all of these, and
>   anything else the compressed kernel might potentially access seems
>   like a lost cause.
> 
> We could do something similar to what the main kernel does with
> early_dynamic_pgts: map the compressed kernel at a fixed virtual
> address (in negative address space, say); recycle all the other mappings
> until we're done with decompression, and then map the output,
> boot_params and command line. The number of pages needed for this can be
> statically calculated, for 4-level paging we'd need 2 pages for the
> fixed mapping, 12 pages for the other three, and one PGD page.
> 
> Thoughts?

Or just bump BOOT_PGT_SIZE to some largeish number?

RFC x86/boot/64: BOOT_PGT_SIZE definition for compressed kernel

2020-10-24 Thread Arvind Sankar

Hi, I think the definition of BOOT_PGT_SIZE in
arch/x86/include/asm/boot.h is insufficient, especially after
  ca0e22d4f011 ("x86/boot/compressed/64: Always switch to own page table")

Currently, it allocates 6 pages if KASLR is disabled, and either 17 or
19 pages depending on X86_VERBOSE_BOOTUP if KASLR is enabled.

- The X86_VERBOSE_BOOTUP test shouldn't be done: that only disables
  debug messages, but warnings/errors are always output to VGA memory,
  so the two extra pages for mapping video RAM are always needed.

- The calculation wasn't updated for X86_5LEVEL, which requires at least
  one more page for the P4D level, and in theory could require two extra
  pages for each of the 4 mappings (compressed kernel, output kernel,
  boot_params and command line), though that would require a system with
  truly ginormous amounts of RAM.

- If KASLR is disabled, there are only 6 pages, but now that we're
  always setting up our own page table, we need 1+(2+2)*3 (one PGD, and
  two PUD and two PMD pages for kernel, boot_params and command line),
  and 2 more pages for the video RAM, and more for 5-level. Even for
  !RELOCATABLE, 13 pages might be needed.

- SEV-ES needs one more page because it needs to do a PTE-level mapping
  for the GHCB page.

- The static calculation is also busted because
  boot/compressed/{kaslr.c,acpi.c} can scan the setup data, EFI
  configuration tables and the EFI memmap, and none of these are
  accounted for. They used to be scanned while still on the
  firmware/bootloader page tables, but now our page tables have to cover
  them as well. Trying to add up the worst case for all of these, and
  anything else the compressed kernel might potentially access seems
  like a lost cause.

We could do something similar to what the main kernel does with
early_dynamic_pgts: map the compressed kernel at a fixed virtual
address (in negative address space, say); recycle all the other mappings
until we're done with decompression, and then map the output,
boot_params and command line. The number of pages needed for this can be
statically calculated, for 4-level paging we'd need 2 pages for the
fixed mapping, 12 pages for the other three, and one PGD page.

Thoughts?

[PATCH v3 3/5] crypto: lib/sha256 - Clear W[] in sha256_update() instead of sha256_transform()

2020-10-23 Thread Arvind Sankar

The temporary W[] array is currently zeroed out once every call to
sha256_transform(), i.e. once every 64 bytes of input data. Moving it to
sha256_update() instead so that it is cleared only once per update can
save about 2-3% of the total time taken to compute the digest, with a
reasonable memset() implementation, and considerably more (~20%) with a
bad one (eg the x86 purgatory currently uses a memset() coded in C).

Signed-off-by: Arvind Sankar 
Reviewed-by: Eric Biggers 
---
 lib/crypto/sha256.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index 099cd11f83c1..c6bfeacc5b81 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -43,10 +43,9 @@ static inline void BLEND_OP(int I, u32 *W)
W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
 }
 
-static void sha256_transform(u32 *state, const u8 *input)
+static void sha256_transform(u32 *state, const u8 *input, u32 *W)
 {
u32 a, b, c, d, e, f, g, h, t1, t2;
-   u32 W[64];
int i;
 
/* load the input */
@@ -200,15 +199,13 @@ static void sha256_transform(u32 *state, const u8 *input)
 
state[0] += a; state[1] += b; state[2] += c; state[3] += d;
state[4] += e; state[5] += f; state[6] += g; state[7] += h;
-
-   /* clear any sensitive info... */
-   memzero_explicit(W, 64 * sizeof(u32));
 }
 
 void sha256_update(struct sha256_state *sctx, const u8 *data, unsigned int len)
 {
unsigned int partial, done;
const u8 *src;
+   u32 W[64];
 
partial = sctx->count & 0x3f;
sctx->count += len;
@@ -223,11 +220,13 @@ void sha256_update(struct sha256_state *sctx, const u8 
*data, unsigned int len)
}
 
do {
-   sha256_transform(sctx->state, src);
+   sha256_transform(sctx->state, src, W);
done += 64;
src = data + done;
} while (done + 63 < len);
 
+   memzero_explicit(W, sizeof(W));
+
partial = 0;
}
memcpy(sctx->buf + partial, src, len - done);
-- 
2.26.2

[PATCH v3 5/5] crypto: lib/sha256 - Unroll LOAD and BLEND loops

2020-10-23 Thread Arvind Sankar

Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64
(tested on Broadwell Xeon) while not increasing code size too much.

Signed-off-by: Arvind Sankar 
Reviewed-by: Eric Biggers 
---
 lib/crypto/sha256.c | 24 
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index e2e29d9b0ccd..cdef37c05972 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -76,12 +76,28 @@ static void sha256_transform(u32 *state, const u8 *input, 
u32 *W)
int i;
 
/* load the input */
-   for (i = 0; i < 16; i++)
-   LOAD_OP(i, W, input);
+   for (i = 0; i < 16; i += 8) {
+   LOAD_OP(i + 0, W, input);
+   LOAD_OP(i + 1, W, input);
+   LOAD_OP(i + 2, W, input);
+   LOAD_OP(i + 3, W, input);
+   LOAD_OP(i + 4, W, input);
+   LOAD_OP(i + 5, W, input);
+   LOAD_OP(i + 6, W, input);
+   LOAD_OP(i + 7, W, input);
+   }
 
/* now blend */
-   for (i = 16; i < 64; i++)
-   BLEND_OP(i, W);
+   for (i = 16; i < 64; i += 8) {
+   BLEND_OP(i + 0, W);
+   BLEND_OP(i + 1, W);
+   BLEND_OP(i + 2, W);
+   BLEND_OP(i + 3, W);
+   BLEND_OP(i + 4, W);
+   BLEND_OP(i + 5, W);
+   BLEND_OP(i + 6, W);
+   BLEND_OP(i + 7, W);
+   }
 
/* load the state into our registers */
a = state[0];  b = state[1];  c = state[2];  d = state[3];
-- 
2.26.2

[PATCH v3 4/5] crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64

2020-10-23 Thread Arvind Sankar

This reduces code size substantially (on x86_64 with gcc-10 the size of
sha256_update() goes from 7593 bytes to 1952 bytes including the new
SHA256_K array), and on x86 is slightly faster than the full unroll
(tested on Broadwell Xeon).

Signed-off-by: Arvind Sankar 
---
 lib/crypto/sha256.c | 174 ++--
 1 file changed, 38 insertions(+), 136 deletions(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index c6bfeacc5b81..e2e29d9b0ccd 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -18,6 +18,25 @@
 #include 
 #include 
 
+static const u32 SHA256_K[] = {
+   0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5,
+   0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
+   0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3,
+   0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
+   0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc,
+   0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
+   0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7,
+   0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
+   0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13,
+   0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
+   0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3,
+   0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
+   0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5,
+   0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
+   0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208,
+   0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2,
+};
+
 static inline u32 Ch(u32 x, u32 y, u32 z)
 {
return z ^ (x & (y ^ z));
@@ -43,9 +62,17 @@ static inline void BLEND_OP(int I, u32 *W)
W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
 }
 
+#define SHA256_ROUND(i, a, b, c, d, e, f, g, h) do {   \
+   u32 t1, t2; \
+   t1 = h + e1(e) + Ch(e, f, g) + SHA256_K[i] + W[i];  \
+   t2 = e0(a) + Maj(a, b, c);  \
+   d += t1;\
+   h = t1 + t2;\
+} while (0)
+
 static void sha256_transform(u32 *state, const u8 *input, u32 *W)
 {
-   u32 a, b, c, d, e, f, g, h, t1, t2;
+   u32 a, b, c, d, e, f, g, h;
int i;
 
/* load the input */
@@ -61,141 +88,16 @@ static void sha256_transform(u32 *state, const u8 *input, 
u32 *W)
e = state[4];  f = state[5];  g = state[6];  h = state[7];
 
/* now iterate */
-   t1 = h + e1(e) + Ch(e, f, g) + 0x428a2f98 + W[0];
-   t2 = e0(a) + Maj(a, b, c);d += t1;h = t1 + t2;
-   t1 = g + e1(d) + Ch(d, e, f) + 0x71374491 + W[1];
-   t2 = e0(h) + Maj(h, a, b);c += t1;g = t1 + t2;
-   t1 = f + e1(c) + Ch(c, d, e) + 0xb5c0fbcf + W[2];
-   t2 = e0(g) + Maj(g, h, a);b += t1;f = t1 + t2;
-   t1 = e + e1(b) + Ch(b, c, d) + 0xe9b5dba5 + W[3];
-   t2 = e0(f) + Maj(f, g, h);a += t1;e = t1 + t2;
-   t1 = d + e1(a) + Ch(a, b, c) + 0x3956c25b + W[4];
-   t2 = e0(e) + Maj(e, f, g);h += t1;d = t1 + t2;
-   t1 = c + e1(h) + Ch(h, a, b) + 0x59f111f1 + W[5];
-   t2 = e0(d) + Maj(d, e, f);g += t1;c = t1 + t2;
-   t1 = b + e1(g) + Ch(g, h, a) + 0x923f82a4 + W[6];
-   t2 = e0(c) + Maj(c, d, e);f += t1;b = t1 + t2;
-   t1 = a + e1(f) + Ch(f, g, h) + 0xab1c5ed5 + W[7];
-   t2 = e0(b) + Maj(b, c, d);e += t1;a = t1 + t2;
-
-   t1 = h + e1(e) + Ch(e, f, g) + 0xd807aa98 + W[8];
-   t2 = e0(a) + Maj(a, b, c);d += t1;h = t1 + t2;
-   t1 = g + e1(d) + Ch(d, e, f) + 0x12835b01 + W[9];
-   t2 = e0(h) + Maj(h, a, b);c += t1;g = t1 + t2;
-   t1 = f + e1(c) + Ch(c, d, e) + 0x243185be + W[10];
-   t2 = e0(g) + Maj(g, h, a);b += t1;f = t1 + t2;
-   t1 = e + e1(b) + Ch(b, c, d) + 0x550c7dc3 + W[11];
-   t2 = e0(f) + Maj(f, g, h);a += t1;e = t1 + t2;
-   t1 = d + e1(a) + Ch(a, b, c) + 0x72be5d74 + W[12];
-   t2 = e0(e) + Maj(e, f, g);h += t1;d = t1 + t2;
-   t1 = c + e1(h) + Ch(h, a, b) + 0x80deb1fe + W[13];
-   t2 = e0(d) + Maj(d, e, f);g += t1;c = t1 + t2;
-   t1 = b + e1(g) + Ch(g, h, a) + 0x9bdc06a7 + W[14];
-   t2 = e0(c) + Maj(c, d, e);f += t1;b = t1 + t2;
-   t1 = a + e1(f) + Ch(f, g, h) + 0xc19bf174 + W[15];
-   t2 = e0(b) + Maj(b, c, d);e += t1;a = t1 + t2;
-
-   t1 = h + e1(e) + Ch(e, f, g) + 0xe49b69c1 + W[16];
-   t2 = e0(a) + Maj(a, b, c);d += t1;h = t1 + t2;
-   t1 = g + e1(d) + Ch(d, e, f) + 0xefbe4786 + W[17];
-   t2 = e0(h) + Maj(h, a, b);c += t1;g = t1 + t2;
-   t1 = f + e1(c) + Ch(c, d, e) + 0x0fc19dc6 + W[18];
-   t2 = e0(g) + Maj(g, h, a);b += t1;f = t1 + t2;
-   t1 = e + e1(b) + Ch(b, c, d) + 0x240ca1cc + W[19];
-   t2 = e0(f) + Maj(f, g, h);a += t1;e = t1 + t2;
-   t1 = d + e1(a) + Ch(a,

[PATCH v3 0/5] crypto: lib/sha256 - cleanup/optimization

2020-10-23 Thread Arvind Sankar

Patch 1 -- Use memzero_explicit() instead of structure assignment/plain
memset() to clear sensitive state.

Patch 2 -- Currently the temporary variables used in the generic sha256
implementation are cleared, but the clearing is optimized away due to
lack of compiler barriers. Drop the clearing.

The last three patches are optimizations for generic sha256.

v3:
- Add some more files to patch 1
- Reword commit message for patch 2
- Reformat SHA256_K array
- Drop v2 patch combining K and W arrays

v2:
- Add patch to combine K and W arrays, suggested by David
- Reformat SHA256_ROUND() macro a little

Arvind Sankar (5):
  crypto: Use memzero_explicit() for clearing state
  crypto: lib/sha256 - Don't clear temporary variables
  crypto: lib/sha256 - Clear W[] in sha256_update() instead of
sha256_transform()
  crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64
  crypto: lib/sha256 - Unroll LOAD and BLEND loops

 arch/arm64/crypto/ghash-ce-glue.c |   2 +-
 arch/arm64/crypto/poly1305-glue.c |   2 +-
 arch/arm64/crypto/sha3-ce-glue.c  |   2 +-
 arch/x86/crypto/poly1305_glue.c   |   2 +-
 include/crypto/sha1_base.h|   3 +-
 include/crypto/sha256_base.h  |   3 +-
 include/crypto/sha512_base.h  |   3 +-
 include/crypto/sm3_base.h |   3 +-
 lib/crypto/sha256.c   | 212 +-
 9 files changed, 76 insertions(+), 156 deletions(-)

-- 
2.26.2

[PATCH v3 2/5] crypto: lib/sha256 - Don't clear temporary variables

2020-10-23 Thread Arvind Sankar

The assignments to clear a through h and t1/t2 are optimized out by the
compiler because they are unused after the assignments.

Clearing individual scalar variables is unlikely to be useful, as they
may have been assigned to registers, and even if stack spilling was
required, there may be compiler-generated temporaries that are
impossible to clear in any case.

So drop the clearing of a through h and t1/t2.

Signed-off-by: Arvind Sankar 
---
 lib/crypto/sha256.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index d43bc39ab05e..099cd11f83c1 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -202,7 +202,6 @@ static void sha256_transform(u32 *state, const u8 *input)
state[4] += e; state[5] += f; state[6] += g; state[7] += h;
 
/* clear any sensitive info... */
-   a = b = c = d = e = f = g = h = t1 = t2 = 0;
memzero_explicit(W, 64 * sizeof(u32));
 }
 
-- 
2.26.2

[PATCH v3 1/5] crypto: Use memzero_explicit() for clearing state

2020-10-23 Thread Arvind Sankar

Without the barrier_data() inside memzero_explicit(), the compiler may
optimize away the state-clearing if it can tell that the state is not
used afterwards. At least in lib/crypto/sha256.c:__sha256_final(), the
function can get inlined into sha256(), in which case the memset is
optimized away.

Signed-off-by: Arvind Sankar 
---
 arch/arm64/crypto/ghash-ce-glue.c | 2 +-
 arch/arm64/crypto/poly1305-glue.c | 2 +-
 arch/arm64/crypto/sha3-ce-glue.c  | 2 +-
 arch/x86/crypto/poly1305_glue.c   | 2 +-
 include/crypto/sha1_base.h| 3 ++-
 include/crypto/sha256_base.h  | 3 ++-
 include/crypto/sha512_base.h  | 3 ++-
 include/crypto/sm3_base.h | 3 ++-
 lib/crypto/sha256.c   | 2 +-
 9 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-glue.c 
b/arch/arm64/crypto/ghash-ce-glue.c
index 8536008e3e35..2427e2f3a9a1 100644
--- a/arch/arm64/crypto/ghash-ce-glue.c
+++ b/arch/arm64/crypto/ghash-ce-glue.c
@@ -168,7 +168,7 @@ static int ghash_final(struct shash_desc *desc, u8 *dst)
put_unaligned_be64(ctx->digest[1], dst);
put_unaligned_be64(ctx->digest[0], dst + 8);
 
-   *ctx = (struct ghash_desc_ctx){};
+   memzero_explicit(ctx, sizeof(*ctx));
return 0;
 }
 
diff --git a/arch/arm64/crypto/poly1305-glue.c 
b/arch/arm64/crypto/poly1305-glue.c
index f33ada70c4ed..683de671741a 100644
--- a/arch/arm64/crypto/poly1305-glue.c
+++ b/arch/arm64/crypto/poly1305-glue.c
@@ -177,7 +177,7 @@ void poly1305_final_arch(struct poly1305_desc_ctx *dctx, u8 
*dst)
}
 
poly1305_emit(>h, dst, dctx->s);
-   *dctx = (struct poly1305_desc_ctx){};
+   memzero_explicit(dctx, sizeof(*dctx));
 }
 EXPORT_SYMBOL(poly1305_final_arch);
 
diff --git a/arch/arm64/crypto/sha3-ce-glue.c b/arch/arm64/crypto/sha3-ce-glue.c
index 9a4bbfc45f40..e5a2936f0886 100644
--- a/arch/arm64/crypto/sha3-ce-glue.c
+++ b/arch/arm64/crypto/sha3-ce-glue.c
@@ -94,7 +94,7 @@ static int sha3_final(struct shash_desc *desc, u8 *out)
if (digest_size & 4)
put_unaligned_le32(sctx->st[i], (__le32 *)digest);
 
-   *sctx = (struct sha3_state){};
+   memzero_explicit(sctx, sizeof(*sctx));
return 0;
 }
 
diff --git a/arch/x86/crypto/poly1305_glue.c b/arch/x86/crypto/poly1305_glue.c
index e508dbd91813..64d09520d279 100644
--- a/arch/x86/crypto/poly1305_glue.c
+++ b/arch/x86/crypto/poly1305_glue.c
@@ -209,7 +209,7 @@ void poly1305_final_arch(struct poly1305_desc_ctx *dctx, u8 
*dst)
}
 
poly1305_simd_emit(>h, dst, dctx->s);
-   *dctx = (struct poly1305_desc_ctx){};
+   memzero_explicit(dctx, sizeof(*dctx));
 }
 EXPORT_SYMBOL(poly1305_final_arch);
 
diff --git a/include/crypto/sha1_base.h b/include/crypto/sha1_base.h
index 20fd1f7468af..a5d6033efef7 100644
--- a/include/crypto/sha1_base.h
+++ b/include/crypto/sha1_base.h
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -101,7 +102,7 @@ static inline int sha1_base_finish(struct shash_desc *desc, 
u8 *out)
for (i = 0; i < SHA1_DIGEST_SIZE / sizeof(__be32); i++)
put_unaligned_be32(sctx->state[i], digest++);
 
-   *sctx = (struct sha1_state){};
+   memzero_explicit(sctx, sizeof(*sctx));
return 0;
 }
 
diff --git a/include/crypto/sha256_base.h b/include/crypto/sha256_base.h
index 6ded110783ae..93f9fd21cc06 100644
--- a/include/crypto/sha256_base.h
+++ b/include/crypto/sha256_base.h
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -105,7 +106,7 @@ static inline int sha256_base_finish(struct shash_desc 
*desc, u8 *out)
for (i = 0; digest_size > 0; i++, digest_size -= sizeof(__be32))
put_unaligned_be32(sctx->state[i], digest++);
 
-   *sctx = (struct sha256_state){};
+   memzero_explicit(sctx, sizeof(*sctx));
return 0;
 }
 
diff --git a/include/crypto/sha512_base.h b/include/crypto/sha512_base.h
index fb19c77494dc..93ab73baa38e 100644
--- a/include/crypto/sha512_base.h
+++ b/include/crypto/sha512_base.h
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -126,7 +127,7 @@ static inline int sha512_base_finish(struct shash_desc 
*desc, u8 *out)
for (i = 0; digest_size > 0; i++, digest_size -= sizeof(__be64))
put_unaligned_be64(sctx->state[i], digest++);
 
-   *sctx = (struct sha512_state){};
+   memzero_explicit(sctx, sizeof(*sctx));
return 0;
 }
 
diff --git a/include/crypto/sm3_base.h b/include/crypto/sm3_base.h
index 1cbf9aa1fe52..2f3a32ab97bb 100644
--- a/include/crypto/sm3_base.h
+++ b/include/crypto/sm3_base.h
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 typedef void (sm3_block_fn)(struct sm3_state *sst, u8 const *src, int blocks);
@@ -104,7 +105,7 @@ static inline int sm3_base_finish(struct shash_desc *desc, 
u8 *out)
for (i = 0; i < SM3_DIGEST_SIZE / sizeof(__be32); i++)

Re: [PATCH v2 1/6] crypto: Use memzero_explicit() for clearing state

2020-10-23 Thread Arvind Sankar

On Wed, Oct 21, 2020 at 09:36:33PM -0700, Eric Biggers wrote:
> On Tue, Oct 20, 2020 at 04:39:52PM -0400, Arvind Sankar wrote:
> > Without the barrier_data() inside memzero_explicit(), the compiler may
> > optimize away the state-clearing if it can tell that the state is not
> > used afterwards. At least in lib/crypto/sha256.c:__sha256_final(), the
> > function can get inlined into sha256(), in which case the memset is
> > optimized away.
> > 
> > Signed-off-by: Arvind Sankar 
> 
> Reviewed-by: Eric Biggers 
> 
> Maybe get the one in arch/arm64/crypto/sha3-ce-glue.c too?
> 
> - Eric

Hm, there are a few more as well like that. But now I'm thinking it's
only the generic sha256.c that may be problematic. The rest of them are
in _final() functions which will be stored as function pointers in a
structure, so there should be no risk of them getting optimized away?

Re: [PATCH v2 2/6] crypto: lib/sha256 - Don't clear temporary variables

2020-10-22 Thread Arvind Sankar

On Wed, Oct 21, 2020 at 09:58:50PM -0700, Eric Biggers wrote:
> On Tue, Oct 20, 2020 at 04:39:53PM -0400, Arvind Sankar wrote:
> > The assignments to clear a through h and t1/t2 are optimized out by the
> > compiler because they are unused after the assignments.
> > 
> > These variables shouldn't be very sensitive: t1/t2 can be calculated
> > from a through h, so they don't reveal any additional information.
> > Knowing a through h is equivalent to knowing one 64-byte block's SHA256
> > hash (with non-standard initial value) which, assuming SHA256 is secure,
> > doesn't reveal any information about the input.
> > 
> > Signed-off-by: Arvind Sankar 
> 
> I don't entirely buy the second paragraph.  It could be the case that the 
> input
> is less than or equal to one SHA-256 block (64 bytes), in which case leaking
> 'a' through 'h' would reveal the final SHA-256 hash if the input length is
> known.  And note that callers might consider either the input, the resulting
> hash, or both to be sensitive information -- it depends.

The "non-standard initial value" was just parenthetical -- my thinking
was that revealing the hash, whether the real SHA hash or an
intermediate one starting at some other initial value, shouldn't reveal
the input; not that you get any additional security from being an
intermediate block. But if the hash itself could be sensitive, yeah then
a-h are sensitive anyway.

> 
> > ---
> >  lib/crypto/sha256.c | 1 -
> >  1 file changed, 1 deletion(-)
> > 
> > diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
> > index d43bc39ab05e..099cd11f83c1 100644
> > --- a/lib/crypto/sha256.c
> > +++ b/lib/crypto/sha256.c
> > @@ -202,7 +202,6 @@ static void sha256_transform(u32 *state, const u8 
> > *input)
> > state[4] += e; state[5] += f; state[6] += g; state[7] += h;
> >  
> > /* clear any sensitive info... */
> > -   a = b = c = d = e = f = g = h = t1 = t2 = 0;
> > memzero_explicit(W, 64 * sizeof(u32));
> >  }
> 
> Your change itself is fine, though.  As you mentioned, these assignments get
> optimized out, so they weren't accomplishing anything.
> 
> The fact is, there just isn't any way to guarantee in C code that all 
> sensitive
> variables get cleared.
> 
> So we shouldn't (and generally don't) bother trying to clear individual u32's,
> ints, etc. like this, but rather only structs and arrays, as clearing those is
> more likely to work as intended.
> 
> - Eric

Ok, I'll just drop the second paragraph from the commit message then.

Re: [PATCH v2 4/6] crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64

2020-10-22 Thread Arvind Sankar

On Wed, Oct 21, 2020 at 10:02:19PM -0700, Eric Biggers wrote:
> On Tue, Oct 20, 2020 at 04:39:55PM -0400, Arvind Sankar wrote:
> > This reduces code size substantially (on x86_64 with gcc-10 the size of
> > sha256_update() goes from 7593 bytes to 1952 bytes including the new
> > SHA256_K array), and on x86 is slightly faster than the full unroll
> > (tesed on Broadwell Xeon).
> 
> tesed => tested
> 
> > 
> > Signed-off-by: Arvind Sankar 
> > ---
> >  lib/crypto/sha256.c | 166 
> >  1 file changed, 30 insertions(+), 136 deletions(-)
> > 
> > diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
> > index c6bfeacc5b81..5efd390706c6 100644
> > --- a/lib/crypto/sha256.c
> > +++ b/lib/crypto/sha256.c
> > @@ -18,6 +18,17 @@
> >  #include 
> >  #include 
> >  
> > +static const u32 SHA256_K[] = {
> > +   0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 
> > 0x923f82a4, 0xab1c5ed5,
> > +   0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 
> > 0x9bdc06a7, 0xc19bf174,
> > +   0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 
> > 0x5cb0a9dc, 0x76f988da,
> > +   0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 
> > 0x06ca6351, 0x14292967,
> > +   0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 
> > 0x81c2c92e, 0x92722c85,
> > +   0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 
> > 0xf40e3585, 0x106aa070,
> > +   0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 
> > 0x5b9cca4f, 0x682e6ff3,
> > +   0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 
> > 0xbef9a3f7, 0xc67178f2,
> > +};
> 
> Limit this to 80 columns?

I was aiming for 8 columns per line to match all the other groupings by
eight. It does slightly exceed 100 columns but can this be an exception,
or should I maybe make it 4 columns per line?

> 
> Otherwise this looks good.
> 
> - Eric

Re: [PATCH 07/13] x86: Secure Launch kernel early boot stub

2020-10-21 Thread Arvind Sankar

On Wed, Oct 21, 2020 at 05:28:33PM +0200, Daniel Kiper wrote:
> On Mon, Oct 19, 2020 at 01:18:22PM -0400, Arvind Sankar wrote:
> > On Mon, Oct 19, 2020 at 04:51:53PM +0200, Daniel Kiper wrote:
> > > On Fri, Oct 16, 2020 at 04:51:51PM -0400, Arvind Sankar wrote:
> > > > On Thu, Oct 15, 2020 at 08:26:54PM +0200, Daniel Kiper wrote:
> > > > >
> > > > > I am discussing with Ross the other option. We can create
> > > > > .rodata.mle_header section and put it at fixed offset as
> > > > > kernel_info is. So, we would have, e.g.:
> > > > >
> > > > > arch/x86/boot/compressed/vmlinux.lds.S:
> > > > > .rodata.kernel_info KERNEL_INFO_OFFSET : {
> > > > > *(.rodata.kernel_info)
> > > > > }
> > > > > ASSERT(ABSOLUTE(kernel_info) == KERNEL_INFO_OFFSET, 
> > > > > "kernel_info at bad address!")
> > > > >
> > > > > .rodata.mle_header MLE_HEADER_OFFSET : {
> > > > > *(.rodata.mle_header)
> > > > > }
> > > > > ASSERT(ABSOLUTE(mle_header) == MLE_HEADER_OFFSET, "mle_header 
> > > > > at bad address!")
> > > > >
> > > > > arch/x86/boot/compressed/sl_stub.S:
> > > > > #define mleh_rva(X) (((X) - mle_header) + MLE_HEADER_OFFSET)
> > > > >
> > > > > .section ".rodata.mle_header", "a"
> > > > >
> > > > > SYM_DATA_START(mle_header)
> > > > > .long   0x9082ac5a/* UUID0 */
> > > > > .long   0x74a7476f/* UUID1 */
> > > > > .long   0xa2555c0f/* UUID2 */
> > > > > .long   0x42b651cb/* UUID3 */
> > > > > .long   0x0034/* MLE header size */
> > > > > .long   0x00020002/* MLE version 2.2 */
> > > > > .long   mleh_rva(sl_stub_entry)/* Linear entry point of 
> > > > > MLE (virt. address) */
> > > > > .long   0x/* First valid page of MLE */
> > > > > .long   0x/* Offset within binary of first byte 
> > > > > of MLE */
> > > > > .long   0x/* Offset within binary of last byte + 
> > > > > 1 of MLE */
> > > > > .long   0x0223/* Bit vector of MLE-supported 
> > > > > capabilities */
> > > > > .long   0x/* Starting linear address of command 
> > > > > line (unused) */
> > > > > .long   0x/* Ending linear address of command 
> > > > > line (unused) */
> > > > > SYM_DATA_END(mle_header)
> > > > >
> > > > > Of course MLE_HEADER_OFFSET has to be defined as a constant somewhere.
> > > > > Anyway, is it acceptable?
> > >
> > > What do you think about my MLE_HEADER_OFFSET and related stuff proposal?
> > >
> >
> > I'm wondering if it would be easier to just allow relocations in these
> > special "header" sections. I need to check how easy/hard it is to do
> > that without triggering linker warnings.
> 
> Ross and I still bouncing some ideas. We came to the conclusion that
> putting mle_header into kernel .rodata.kernel_info section or even
> arch/x86/boot/compressed/kernel_info.S file would be the easiest thing
> to do at this point. Of course I would suggest some renaming too. E.g.
> .rodata.kernel_info to .rodata.kernel_headers, etc. Does it make sense
> for you?
> 
> Daniel

I haven't been able to come up with any different options that don't
require post-processing of the kernel image. Allowing relocations in
specific sections seems to not be possible with lld, and anyway would
require the fields to be 64-bit sized so it doesn't really help.

Putting mle_header into kernel_info seems like a reasonable thing to me,
and if you do that, putting it into kernel_info.S would seem to be
necessary?  Would you also have a fixed field with the offset of the
mle_header from kernel_info?  That seems nicer than having the
bootloader scan the variable data for magic strings.

Re: [PATCH v2 6/6] crypto: lib/sha - Combine round constants and message schedule

2020-10-21 Thread Arvind Sankar

On Tue, Oct 20, 2020 at 09:36:00PM +, David Laight wrote:
> From: Arvind Sankar
> > Sent: 20 October 2020 21:40
> > 
> > Putting the round constants and the message schedule arrays together in
> > one structure saves one register, which can be a significant benefit on
> > register-constrained architectures. On x86-32 (tested on Broadwell
> > Xeon), this gives a 10% performance benefit.
> 
> I'm actually stunned it makes that much difference.
> The object code must be truly horrid (before and after).
> 
> There are probably other strange tweaks that give a similar
> improvement.
> 
>   David
> 

Hm yes, I took a closer look at the generated code, and gcc seems to be
doing something completely braindead. Before this change, it actually
copies 8 words at a time from SHA256_K onto the stack, and uses those
stack temporaries for the calculation. So this patch is giving a benefit
just because it only does the copy once instead of every time around the
loop.

It doesn't even really need a register to hold SHA256_K since this isn't
PIC code, it could just access it directly as SHA256_K(%ecx) if it just
multiplied the loop counter i by 4.

[PATCH v2 1/6] crypto: Use memzero_explicit() for clearing state

2020-10-20 Thread Arvind Sankar

Without the barrier_data() inside memzero_explicit(), the compiler may
optimize away the state-clearing if it can tell that the state is not
used afterwards. At least in lib/crypto/sha256.c:__sha256_final(), the
function can get inlined into sha256(), in which case the memset is
optimized away.

Signed-off-by: Arvind Sankar 
---
 include/crypto/sha1_base.h   | 3 ++-
 include/crypto/sha256_base.h | 3 ++-
 include/crypto/sha512_base.h | 3 ++-
 include/crypto/sm3_base.h| 3 ++-
 lib/crypto/sha256.c  | 2 +-
 5 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/crypto/sha1_base.h b/include/crypto/sha1_base.h
index 20fd1f7468af..a5d6033efef7 100644
--- a/include/crypto/sha1_base.h
+++ b/include/crypto/sha1_base.h
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -101,7 +102,7 @@ static inline int sha1_base_finish(struct shash_desc *desc, 
u8 *out)
for (i = 0; i < SHA1_DIGEST_SIZE / sizeof(__be32); i++)
put_unaligned_be32(sctx->state[i], digest++);
 
-   *sctx = (struct sha1_state){};
+   memzero_explicit(sctx, sizeof(*sctx));
return 0;
 }
 
diff --git a/include/crypto/sha256_base.h b/include/crypto/sha256_base.h
index 6ded110783ae..93f9fd21cc06 100644
--- a/include/crypto/sha256_base.h
+++ b/include/crypto/sha256_base.h
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -105,7 +106,7 @@ static inline int sha256_base_finish(struct shash_desc 
*desc, u8 *out)
for (i = 0; digest_size > 0; i++, digest_size -= sizeof(__be32))
put_unaligned_be32(sctx->state[i], digest++);
 
-   *sctx = (struct sha256_state){};
+   memzero_explicit(sctx, sizeof(*sctx));
return 0;
 }
 
diff --git a/include/crypto/sha512_base.h b/include/crypto/sha512_base.h
index fb19c77494dc..93ab73baa38e 100644
--- a/include/crypto/sha512_base.h
+++ b/include/crypto/sha512_base.h
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -126,7 +127,7 @@ static inline int sha512_base_finish(struct shash_desc 
*desc, u8 *out)
for (i = 0; digest_size > 0; i++, digest_size -= sizeof(__be64))
put_unaligned_be64(sctx->state[i], digest++);
 
-   *sctx = (struct sha512_state){};
+   memzero_explicit(sctx, sizeof(*sctx));
return 0;
 }
 
diff --git a/include/crypto/sm3_base.h b/include/crypto/sm3_base.h
index 1cbf9aa1fe52..2f3a32ab97bb 100644
--- a/include/crypto/sm3_base.h
+++ b/include/crypto/sm3_base.h
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 typedef void (sm3_block_fn)(struct sm3_state *sst, u8 const *src, int blocks);
@@ -104,7 +105,7 @@ static inline int sm3_base_finish(struct shash_desc *desc, 
u8 *out)
for (i = 0; i < SM3_DIGEST_SIZE / sizeof(__be32); i++)
put_unaligned_be32(sctx->state[i], digest++);
 
-   *sctx = (struct sm3_state){};
+   memzero_explicit(sctx, sizeof(*sctx));
return 0;
 }
 
diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index 2321f6cb322f..d43bc39ab05e 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -265,7 +265,7 @@ static void __sha256_final(struct sha256_state *sctx, u8 
*out, int digest_words)
put_unaligned_be32(sctx->state[i], [i]);
 
/* Zeroize sensitive information. */
-   memset(sctx, 0, sizeof(*sctx));
+   memzero_explicit(sctx, sizeof(*sctx));
 }
 
 void sha256_final(struct sha256_state *sctx, u8 *out)
-- 
2.26.2

[PATCH v2 4/6] crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64

2020-10-20 Thread Arvind Sankar

This reduces code size substantially (on x86_64 with gcc-10 the size of
sha256_update() goes from 7593 bytes to 1952 bytes including the new
SHA256_K array), and on x86 is slightly faster than the full unroll
(tesed on Broadwell Xeon).

Signed-off-by: Arvind Sankar 
---
 lib/crypto/sha256.c | 166 
 1 file changed, 30 insertions(+), 136 deletions(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index c6bfeacc5b81..5efd390706c6 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -18,6 +18,17 @@
 #include 
 #include 
 
+static const u32 SHA256_K[] = {
+   0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 
0x923f82a4, 0xab1c5ed5,
+   0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 
0x9bdc06a7, 0xc19bf174,
+   0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 
0x5cb0a9dc, 0x76f988da,
+   0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 
0x06ca6351, 0x14292967,
+   0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 
0x81c2c92e, 0x92722c85,
+   0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 
0xf40e3585, 0x106aa070,
+   0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 
0x5b9cca4f, 0x682e6ff3,
+   0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 
0xbef9a3f7, 0xc67178f2,
+};
+
 static inline u32 Ch(u32 x, u32 y, u32 z)
 {
return z ^ (x & (y ^ z));
@@ -43,9 +54,17 @@ static inline void BLEND_OP(int I, u32 *W)
W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
 }
 
+#define SHA256_ROUND(i, a, b, c, d, e, f, g, h) do {   \
+   u32 t1, t2; \
+   t1 = h + e1(e) + Ch(e, f, g) + SHA256_K[i] + W[i];  \
+   t2 = e0(a) + Maj(a, b, c);  \
+   d += t1;\
+   h = t1 + t2;\
+} while (0)
+
 static void sha256_transform(u32 *state, const u8 *input, u32 *W)
 {
-   u32 a, b, c, d, e, f, g, h, t1, t2;
+   u32 a, b, c, d, e, f, g, h;
int i;
 
/* load the input */
@@ -61,141 +80,16 @@ static void sha256_transform(u32 *state, const u8 *input, 
u32 *W)
e = state[4];  f = state[5];  g = state[6];  h = state[7];
 
/* now iterate */
-   t1 = h + e1(e) + Ch(e, f, g) + 0x428a2f98 + W[0];
-   t2 = e0(a) + Maj(a, b, c);d += t1;h = t1 + t2;
-   t1 = g + e1(d) + Ch(d, e, f) + 0x71374491 + W[1];
-   t2 = e0(h) + Maj(h, a, b);c += t1;g = t1 + t2;
-   t1 = f + e1(c) + Ch(c, d, e) + 0xb5c0fbcf + W[2];
-   t2 = e0(g) + Maj(g, h, a);b += t1;f = t1 + t2;
-   t1 = e + e1(b) + Ch(b, c, d) + 0xe9b5dba5 + W[3];
-   t2 = e0(f) + Maj(f, g, h);a += t1;e = t1 + t2;
-   t1 = d + e1(a) + Ch(a, b, c) + 0x3956c25b + W[4];
-   t2 = e0(e) + Maj(e, f, g);h += t1;d = t1 + t2;
-   t1 = c + e1(h) + Ch(h, a, b) + 0x59f111f1 + W[5];
-   t2 = e0(d) + Maj(d, e, f);g += t1;c = t1 + t2;
-   t1 = b + e1(g) + Ch(g, h, a) + 0x923f82a4 + W[6];
-   t2 = e0(c) + Maj(c, d, e);f += t1;b = t1 + t2;
-   t1 = a + e1(f) + Ch(f, g, h) + 0xab1c5ed5 + W[7];
-   t2 = e0(b) + Maj(b, c, d);e += t1;a = t1 + t2;
-
-   t1 = h + e1(e) + Ch(e, f, g) + 0xd807aa98 + W[8];
-   t2 = e0(a) + Maj(a, b, c);d += t1;h = t1 + t2;
-   t1 = g + e1(d) + Ch(d, e, f) + 0x12835b01 + W[9];
-   t2 = e0(h) + Maj(h, a, b);c += t1;g = t1 + t2;
-   t1 = f + e1(c) + Ch(c, d, e) + 0x243185be + W[10];
-   t2 = e0(g) + Maj(g, h, a);b += t1;f = t1 + t2;
-   t1 = e + e1(b) + Ch(b, c, d) + 0x550c7dc3 + W[11];
-   t2 = e0(f) + Maj(f, g, h);a += t1;e = t1 + t2;
-   t1 = d + e1(a) + Ch(a, b, c) + 0x72be5d74 + W[12];
-   t2 = e0(e) + Maj(e, f, g);h += t1;d = t1 + t2;
-   t1 = c + e1(h) + Ch(h, a, b) + 0x80deb1fe + W[13];
-   t2 = e0(d) + Maj(d, e, f);g += t1;c = t1 + t2;
-   t1 = b + e1(g) + Ch(g, h, a) + 0x9bdc06a7 + W[14];
-   t2 = e0(c) + Maj(c, d, e);f += t1;b = t1 + t2;
-   t1 = a + e1(f) + Ch(f, g, h) + 0xc19bf174 + W[15];
-   t2 = e0(b) + Maj(b, c, d);e += t1;a = t1 + t2;
-
-   t1 = h + e1(e) + Ch(e, f, g) + 0xe49b69c1 + W[16];
-   t2 = e0(a) + Maj(a, b, c);d += t1;h = t1 + t2;
-   t1 = g + e1(d) + Ch(d, e, f) + 0xefbe4786 + W[17];
-   t2 = e0(h) + Maj(h, a, b);c += t1;g = t1 + t2;
-   t1 = f + e1(c) + Ch(c, d, e) + 0x0fc19dc6 + W[18];
-   t2 = e0(g) + Maj(g, h, a);b += t1;f = t1 + t2;
-   t1 = e + e1(b) + Ch(b, c, d) + 0x240ca1cc + W[19];
-   t2 = e0(f) + Maj(f, g, h);a += t1;e = t1 + t2;
-   t1 = d + e1(a) + Ch(a, b, c) + 0x2de92c6f + W[20];
-   t2 = e0(e) + Maj(e, f, g);

[PATCH v2 2/6] crypto: lib/sha256 - Don't clear temporary variables

2020-10-20 Thread Arvind Sankar

The assignments to clear a through h and t1/t2 are optimized out by the
compiler because they are unused after the assignments.

These variables shouldn't be very sensitive: t1/t2 can be calculated
from a through h, so they don't reveal any additional information.
Knowing a through h is equivalent to knowing one 64-byte block's SHA256
hash (with non-standard initial value) which, assuming SHA256 is secure,
doesn't reveal any information about the input.

Signed-off-by: Arvind Sankar 
---
 lib/crypto/sha256.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index d43bc39ab05e..099cd11f83c1 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -202,7 +202,6 @@ static void sha256_transform(u32 *state, const u8 *input)
state[4] += e; state[5] += f; state[6] += g; state[7] += h;
 
/* clear any sensitive info... */
-   a = b = c = d = e = f = g = h = t1 = t2 = 0;
memzero_explicit(W, 64 * sizeof(u32));
 }
 
-- 
2.26.2

[PATCH v2 5/6] crypto: lib/sha256 - Unroll LOAD and BLEND loops

2020-10-20 Thread Arvind Sankar

Unrolling the LOAD and BLEND loops improves performance by ~8% on x86_64
(tested on Broadwell Xeon) while not increasing code size too much.

Signed-off-by: Arvind Sankar 
---
 lib/crypto/sha256.c | 24 
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index 5efd390706c6..3a8802d5f747 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -68,12 +68,28 @@ static void sha256_transform(u32 *state, const u8 *input, 
u32 *W)
int i;
 
/* load the input */
-   for (i = 0; i < 16; i++)
-   LOAD_OP(i, W, input);
+   for (i = 0; i < 16; i += 8) {
+   LOAD_OP(i + 0, W, input);
+   LOAD_OP(i + 1, W, input);
+   LOAD_OP(i + 2, W, input);
+   LOAD_OP(i + 3, W, input);
+   LOAD_OP(i + 4, W, input);
+   LOAD_OP(i + 5, W, input);
+   LOAD_OP(i + 6, W, input);
+   LOAD_OP(i + 7, W, input);
+   }
 
/* now blend */
-   for (i = 16; i < 64; i++)
-   BLEND_OP(i, W);
+   for (i = 16; i < 64; i += 8) {
+   BLEND_OP(i + 0, W);
+   BLEND_OP(i + 1, W);
+   BLEND_OP(i + 2, W);
+   BLEND_OP(i + 3, W);
+   BLEND_OP(i + 4, W);
+   BLEND_OP(i + 5, W);
+   BLEND_OP(i + 6, W);
+   BLEND_OP(i + 7, W);
+   }
 
/* load the state into our registers */
a = state[0];  b = state[1];  c = state[2];  d = state[3];
-- 
2.26.2

[PATCH v2 6/6] crypto: lib/sha - Combine round constants and message schedule

2020-10-20 Thread Arvind Sankar

Putting the round constants and the message schedule arrays together in
one structure saves one register, which can be a significant benefit on
register-constrained architectures. On x86-32 (tested on Broadwell
Xeon), this gives a 10% performance benefit.

Signed-off-by: Arvind Sankar 
Suggested-by: David Laight 
---
 lib/crypto/sha256.c | 49 ++---
 1 file changed, 28 insertions(+), 21 deletions(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index 3a8802d5f747..985cd0560d79 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -29,6 +29,11 @@ static const u32 SHA256_K[] = {
0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 
0xbef9a3f7, 0xc67178f2,
 };
 
+struct KW {
+   u32 K[64];
+   u32 W[64];
+};
+
 static inline u32 Ch(u32 x, u32 y, u32 z)
 {
return z ^ (x & (y ^ z));
@@ -56,39 +61,39 @@ static inline void BLEND_OP(int I, u32 *W)
 
 #define SHA256_ROUND(i, a, b, c, d, e, f, g, h) do {   \
u32 t1, t2; \
-   t1 = h + e1(e) + Ch(e, f, g) + SHA256_K[i] + W[i];  \
+   t1 = h + e1(e) + Ch(e, f, g) + KW->K[i] + KW->W[i]; \
t2 = e0(a) + Maj(a, b, c);  \
d += t1;\
h = t1 + t2;\
 } while (0)
 
-static void sha256_transform(u32 *state, const u8 *input, u32 *W)
+static void sha256_transform(u32 *state, const u8 *input, struct KW *KW)
 {
u32 a, b, c, d, e, f, g, h;
int i;
 
/* load the input */
for (i = 0; i < 16; i += 8) {
-   LOAD_OP(i + 0, W, input);
-   LOAD_OP(i + 1, W, input);
-   LOAD_OP(i + 2, W, input);
-   LOAD_OP(i + 3, W, input);
-   LOAD_OP(i + 4, W, input);
-   LOAD_OP(i + 5, W, input);
-   LOAD_OP(i + 6, W, input);
-   LOAD_OP(i + 7, W, input);
+   LOAD_OP(i + 0, KW->W, input);
+   LOAD_OP(i + 1, KW->W, input);
+   LOAD_OP(i + 2, KW->W, input);
+   LOAD_OP(i + 3, KW->W, input);
+   LOAD_OP(i + 4, KW->W, input);
+   LOAD_OP(i + 5, KW->W, input);
+   LOAD_OP(i + 6, KW->W, input);
+   LOAD_OP(i + 7, KW->W, input);
}
 
/* now blend */
for (i = 16; i < 64; i += 8) {
-   BLEND_OP(i + 0, W);
-   BLEND_OP(i + 1, W);
-   BLEND_OP(i + 2, W);
-   BLEND_OP(i + 3, W);
-   BLEND_OP(i + 4, W);
-   BLEND_OP(i + 5, W);
-   BLEND_OP(i + 6, W);
-   BLEND_OP(i + 7, W);
+   BLEND_OP(i + 0, KW->W);
+   BLEND_OP(i + 1, KW->W);
+   BLEND_OP(i + 2, KW->W);
+   BLEND_OP(i + 3, KW->W);
+   BLEND_OP(i + 4, KW->W);
+   BLEND_OP(i + 5, KW->W);
+   BLEND_OP(i + 6, KW->W);
+   BLEND_OP(i + 7, KW->W);
}
 
/* load the state into our registers */
@@ -115,7 +120,7 @@ void sha256_update(struct sha256_state *sctx, const u8 
*data, unsigned int len)
 {
unsigned int partial, done;
const u8 *src;
-   u32 W[64];
+   struct KW KW;
 
partial = sctx->count & 0x3f;
sctx->count += len;
@@ -129,13 +134,15 @@ void sha256_update(struct sha256_state *sctx, const u8 
*data, unsigned int len)
src = sctx->buf;
}
 
+   memcpy(KW.K, SHA256_K, sizeof(KW.K));
+
do {
-   sha256_transform(sctx->state, src, W);
+   sha256_transform(sctx->state, src, );
done += 64;
src = data + done;
} while (done + 63 < len);
 
-   memzero_explicit(W, sizeof(W));
+   memzero_explicit(KW.W, sizeof(KW.W));
 
partial = 0;
}
-- 
2.26.2

[PATCH v2 3/6] crypto: lib/sha256 - Clear W[] in sha256_update() instead of sha256_transform()

2020-10-20 Thread Arvind Sankar

The temporary W[] array is currently zeroed out once every call to
sha256_transform(), i.e. once every 64 bytes of input data. Moving it to
sha256_update() instead so that it is cleared only once per update can
save about 2-3% of the total time taken to compute the digest, with a
reasonable memset() implementation, and considerably more (~20%) with a
bad one (eg the x86 purgatory currently uses a memset() coded in C).

Signed-off-by: Arvind Sankar 
---
 lib/crypto/sha256.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/lib/crypto/sha256.c b/lib/crypto/sha256.c
index 099cd11f83c1..c6bfeacc5b81 100644
--- a/lib/crypto/sha256.c
+++ b/lib/crypto/sha256.c
@@ -43,10 +43,9 @@ static inline void BLEND_OP(int I, u32 *W)
W[I] = s1(W[I-2]) + W[I-7] + s0(W[I-15]) + W[I-16];
 }
 
-static void sha256_transform(u32 *state, const u8 *input)
+static void sha256_transform(u32 *state, const u8 *input, u32 *W)
 {
u32 a, b, c, d, e, f, g, h, t1, t2;
-   u32 W[64];
int i;
 
/* load the input */
@@ -200,15 +199,13 @@ static void sha256_transform(u32 *state, const u8 *input)
 
state[0] += a; state[1] += b; state[2] += c; state[3] += d;
state[4] += e; state[5] += f; state[6] += g; state[7] += h;
-
-   /* clear any sensitive info... */
-   memzero_explicit(W, 64 * sizeof(u32));
 }
 
 void sha256_update(struct sha256_state *sctx, const u8 *data, unsigned int len)
 {
unsigned int partial, done;
const u8 *src;
+   u32 W[64];
 
partial = sctx->count & 0x3f;
sctx->count += len;
@@ -223,11 +220,13 @@ void sha256_update(struct sha256_state *sctx, const u8 
*data, unsigned int len)
}
 
do {
-   sha256_transform(sctx->state, src);
+   sha256_transform(sctx->state, src, W);
done += 64;
src = data + done;
} while (done + 63 < len);
 
+   memzero_explicit(W, sizeof(W));
+
partial = 0;
}
memcpy(sctx->buf + partial, src, len - done);
-- 
2.26.2

[PATCH v2 0/6] crypto: lib/sha256 - cleanup/optimization

2020-10-20 Thread Arvind Sankar

Patch 1 -- Use memzero_explicit() instead of structure assignment/plain
memset() to clear sensitive state.

Patch 2 -- I am not sure about this one: currently the temporary
variables used in the generic sha256 implementation are cleared, but the
clearing is optimized away due to lack of compiler barriers. I don't
think it's really necessary to clear them, but I'm not a cryptanalyst,
so I would like comment on whether it's indeed safe not to, or we should
instead add the required barriers to force clearing.

The last four patches are optimizations for generic sha256.

v2:
- Add patch to combine K and W arrays, suggested by David
- Reformat SHA256_ROUND() macro a little

Arvind Sankar (6):
  crypto: Use memzero_explicit() for clearing state
  crypto: lib/sha256 - Don't clear temporary variables
  crypto: lib/sha256 - Clear W[] in sha256_update() instead of
sha256_transform()
  crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64
  crypto: lib/sha256 - Unroll LOAD and BLEND loops
  crypto: lib/sha - Combine round constants and message schedule

 include/crypto/sha1_base.h   |   3 +-
 include/crypto/sha256_base.h |   3 +-
 include/crypto/sha512_base.h |   3 +-
 include/crypto/sm3_base.h|   3 +-
 lib/crypto/sha256.c  | 211 +++
 5 files changed, 71 insertions(+), 152 deletions(-)

-- 
2.26.2

Re: [PATCH 4/5] crypto: lib/sha256 - Unroll SHA256 loop 8 times intead of 64

2020-10-20 Thread Arvind Sankar

On Tue, Oct 20, 2020 at 02:55:47PM +, David Laight wrote:
> From: Arvind Sankar
> > Sent: 20 October 2020 15:07
> > To: David Laight 
> > 
> > On Tue, Oct 20, 2020 at 07:41:33AM +, David Laight wrote:
> > > From: Arvind Sankar> Sent: 19 October 2020 16:30
> > > > To: Herbert Xu ; David S. Miller 
> > > > ; linux-
> > > > cry...@vger.kernel.org
> > > > Cc: linux-kernel@vger.kernel.org
> > > > Subject: [PATCH 4/5] crypto: lib/sha256 - Unroll SHA256 loop 8 times 
> > > > intead of 64
> > > >
> > > > This reduces code size substantially (on x86_64 with gcc-10 the size of
> > > > sha256_update() goes from 7593 bytes to 1952 bytes including the new
> > > > SHA256_K array), and on x86 is slightly faster than the full unroll.
> > >
> > > The speed will depend on exactly which cpu type is used.
> > > It is even possible that the 'not unrolled at all' loop
> > > (with the all the extra register moves) is faster on some x86-64 cpu.
> > 
> > Yes, I should have mentioned this was tested on a Broadwell Xeon, at
> > least on that processor, no unrolling is a measurable performance loss.
> > But the hope is that 8x unroll should be generally enough unrolling that
> > 64x is unlikely to speed it up more, and so has no advantage over 8x.
> 
> (I've just looked at the actual code, not just the patch.)
> 
> Yes I doubt the 64x unroll was ever a significant gain.
> Unrolling completely requires a load of register moves/renames;
> probably too many to be 'zero cost'.
> 
> With modern cpu you can often get the loop control instructions
> 'for free' so unrolling just kills the I-cache.
> Some of the cpu have loop buffers for decoded instructions,
> unroll beyond that size and you suddenly get the decoder costs
> hitting you again.
> 
> ...
> > > If you can put SHA256_K[] and W[] into a struct then the
> > > compiler can use the same register to address into both
> > > arrays (using an offset of 64*4 for the second one).
> > > (ie keep the two arrays, not an array of struct).
> > > This should reduce the register pressure slightly.
> > 
> > I can try that, could copy the data in sha256_update() so it's amortized
> > over the whole input.
> 
> Having looked more closely the extra copy needed is probably
> bigger than any saving.
> 

On x86-64 it doesn't make much difference, but it speeds up x86-32 by
around 10% (on long inputs).

Re: [PATCH v2 3/5] x86/boot/compressed/64: Check SEV encryption in 64-bit boot-path

2020-10-20 Thread Arvind Sankar

On Tue, Oct 20, 2020 at 05:48:12PM +0200, Joerg Roedel wrote:
> On Tue, Oct 20, 2020 at 10:12:59AM -0400, Arvind Sankar wrote:
> > On Tue, Oct 20, 2020 at 02:18:54PM +0200, Joerg Roedel wrote:
> > Why use r10-r12 rather than the caller-save registers? Even for the head
> > code where you need to perserve the cr3 value you can just return it in
> > rax?
> 
> It can surely be optimized, but it makes the code less robust.  This
> function is only called from assembly so the standard x86-64 calling
> conventions might not be followed strictly. I think its better to make
> as few assumptions as possible about the calling code to avoid
> regressions. Changes to the head code are not necessarily tested with
> SEV/SEV-ES guests by developers.
> 
> Regards,
> 
>   Joerg

This is called from both assembly and C, but anyway, you're already
assuming r10 and r11 can be clobbered safely, and you just took out the
save/restores in set_sev_encryption_mask, which is actually called only
from assembly.

1 2 3 4 5 6 7 >

1 - 100 of 635 matches

Mail list logo