from:"hpa"

Re: [PATCH 7/7] DWARF: add the config option

2017-05-26 Thread hpa

On May 22, 2017 10:49:06 PM PDT, Ingo Molnar  wrote:
>
>* H. Peter Anvin  wrote:
>
>> On 05/22/17 04:12, Ingo Molnar wrote:
>> \>>
>> >> This construct might be useful for other arches, which is why I
>called
>> >> it "FP" instead of "BP".  But then I ruined that with the last 3
>:-)
>> > 
>> > Please call it BP - 'FP' can easily be read as floating-point,
>making it all 
>> > super-confusing. We should use canonical x86 register names and
>ordering - even
>> > if not all registers are used straight away.
>> > 
>> 
>> Seriously, I suspect that at the end of the day we will have
>reinvented
>> DWARF.
>
>Absolutely - the main difference is:
>
>- the debug-info implementation is _internal_ to the kernel so it can
>be fixed
>instead of "oh, wait 2 years for the toolchain to fix this particular
>bug, work
>it around in the kernel meanwhile" kind of crazy flow and dependencies.
>I.e. 
>the debug-info generation and parsing code is both part of the kernel
>Git tree 
>   and can be iterated (and fixed) at once with.
>
>- the debug-info is auto-generated for assembly as well, leaving
>assembly code 
>   maintainable.
>
>- the debug-info has a sane data structure designed for robustness and 
>   compactness
>
>So even if it's a subset of the existing complexity of dwarf et al we
>are still 
>literally infinitely better off with this model.
>
>Thanks,
>
>   Ingo

This assumes that it actually ends up being feasible for objtool to do so.

It is worth noting that using DWARF for unwinding vs auto-generating the unwind 
information are independent issues.

Another option is to use (or postprocess) the compiler-generated DWARF for C 
modules and pursue autogeneration only for assembly modules, which ought to be 
a much easier problem and is less dependent on discovering new 
compiler-generated patterns.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCHv1, RFC 0/8] Boot-time switching between 4- and 5-level paging

2017-05-26 Thread hpa

On May 26, 2017 8:51:48 AM PDT, Linus Torvalds  
wrote:
>On Fri, May 26, 2017 at 6:00 AM, Kirill A. Shutemov
> wrote:
>>
>> I don't see how kernel threads can use 4-level paging. It doesn't
>work
>> from virtual memory layout POV. Kernel claims half of full virtual
>address
>> space for itself -- 256 PGD entries, not one as we would effectively
>have
>> in case of switching to 4-level paging. For instance, addresses,
>where
>> vmalloc and vmemmap are mapped, are not canonical with 4-level
>paging.
>
>I would have just assumed we'd map the kernel in the shared part that
>fits in the top 47 bits.
>
>But it sounds like you can't switch back and forth anyway, so I guess
>it's moot.
>
>Where *is* the LA57 documentation, btw? I had an old x86 architecture
>manual, so I updated it, but LA57 isn't mentioned in the new one
>either.
>
>   Linus

As one of the major motivations for LA57 is that we expect that we will have 
machines with more than 2^46 bytes of memory in the near future, it isn't 
feasible in most cases to do per-VM LA57.

The only case where that even has any utility is for an application to want 
more than 128 TiB address space on a machine with no more than 64 TiB of RAM.  
It is kind of a narrow use case, I think.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCHv1, RFC 0/8] Boot-time switching between 4- and 5-level paging

2017-05-26 Thread hpa

On May 26, 2017 12:23:18 PM PDT, Dave Hansen  wrote:
>On 05/26/2017 11:24 AM, h...@zytor.com wrote:
>> The only case where that even has any utility is for an application
>> to want more than 128 TiB address space on a machine with no more
>> than 64 TiB of RAM.  It is kind of a narrow use case, I think.
>
>Doesn't more address space increase the effectiveness of ASLR?  I
>thought KASLR, especially, was limited in its effectiveness because of
>a
>lack of address space.

The shortage of address space for KASLR is not addressable by LA57; rather, it 
would have to be addressed by compiling the kernel using a different (less 
efficient) memory model, presumably the "medium" memory model.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCHv1, RFC 0/8] Boot-time switching between 4- and 5-level paging

2017-05-26 Thread hpa

On May 26, 2017 6:00:57 AM PDT, "Kirill A. Shutemov"  
wrote:
>On Thu, May 25, 2017 at 04:24:24PM -0700, Linus Torvalds wrote:
>> On Thu, May 25, 2017 at 1:33 PM, Kirill A. Shutemov
>>  wrote:
>> > Here' my first attempt to bring boot-time between 4- and 5-level
>paging.
>> > It looks not too terrible to me. I've expected it to be worse.
>> 
>> If I read this right, you just made it a global on/off thing.
>> 
>> May I suggest possibly a different model entirely? Can you make it a
>> per-mm flag instead?
>> 
>> And then we
>> 
>>  (a) make all kthreads use the 4-level page tables
>> 
>>  (b) which means that all the init code uses the 4-level page tables
>> 
>>  (c) which means that all those checks for "start_secondary" etc can
>> just go away, because those all run with 4-level page tables.
>> 
>> Or is it just much too expensive to switch between 4-level and
>5-level
>> paging at run-time?
>
>Hm..
>
>I don't see how kernel threads can use 4-level paging. It doesn't work
>from virtual memory layout POV. Kernel claims half of full virtual
>address
>space for itself -- 256 PGD entries, not one as we would effectively
>have
>in case of switching to 4-level paging. For instance, addresses, where
>vmalloc and vmemmap are mapped, are not canonical with 4-level paging.
>
>And you cannot see whole direct mapping of physical memory. Back to
>highmem? (Please, no, please).
>
>We could possible reduce number of PGD required by kernel. Currently,
>layout for 5-level paging allows up-to 55-bit physical memory. It's
>redundant as SDM claim that we never will get more than 52. So we could
>reduce size of kernel part of layout by few bits, but not definitely to
>1.
>
>I don't see how it can possibly work.
>
>Besides difficulties of getting switching between paging modes correct,
>that Andy mentioned, it will also hurt performance. You cannot switch
>between paging modes directly. It would require disabling paging
>completely. It means we loose benefit from global page table entries on
>such switching. More page-walks.
>
>Even ignoring all of above, I don't see much benefit of having per-mm
>switching. It adds complexity without much benefit -- saving few lines
>of
>logic during early boot doesn't look as huge win to me.

It also makes no sense – the kernel threads only need one common page table 
anyway.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH v2 2/7] x86: use long long for 64-bit atomic ops

2017-05-27 Thread hpa

On May 26, 2017 12:09:04 PM PDT, Dmitry Vyukov  wrote:
>Some 64-bit atomic operations use 'long long' as operand/return type
>(e.g. asm-generic/atomic64.h, arch/x86/include/asm/atomic64_32.h);
>while others use 'long' (e.g. arch/x86/include/asm/atomic64_64.h).
>This makes it impossible to write portable code.
>For example, there is no format specifier that prints result of
>atomic64_read() without warnings. atomic64_try_cmpxchg() is almost
>impossible to use in portable fashion because it requires either
>'long *' or 'long long *' as argument depending on arch.
>
>Switch arch/x86/include/asm/atomic64_64.h to 'long long'.
>
>Signed-off-by: Dmitry Vyukov 
>Cc: Mark Rutland 
>Cc: Peter Zijlstra 
>Cc: Will Deacon 
>Cc: Andrew Morton 
>Cc: Andrey Ryabinin 
>Cc: Ingo Molnar 
>Cc: kasan-...@googlegroups.com
>Cc: linux...@kvack.org
>Cc: linux-kernel@vger.kernel.org
>Cc: x...@kernel.org
>
>---
>Changes since v1:
> - reverted stray s/long/long long/ replace in comment
> - added arch/s390 changes to fix build errors/warnings
>---
> arch/s390/include/asm/atomic_ops.h | 14 +-
> arch/s390/include/asm/bitops.h | 12 -
>arch/x86/include/asm/atomic64_64.h | 52
>+++---
> include/linux/types.h  |  2 +-
> 4 files changed, 40 insertions(+), 40 deletions(-)
>
>diff --git a/arch/s390/include/asm/atomic_ops.h
>b/arch/s390/include/asm/atomic_ops.h
>index ac9e2b939d04..055a9083e52d 100644
>--- a/arch/s390/include/asm/atomic_ops.h
>+++ b/arch/s390/include/asm/atomic_ops.h
>@@ -31,10 +31,10 @@ __ATOMIC_OPS(__atomic_and, int, "lan")
> __ATOMIC_OPS(__atomic_or,  int, "lao")
> __ATOMIC_OPS(__atomic_xor, int, "lax")
> 
>-__ATOMIC_OPS(__atomic64_add, long, "laag")
>-__ATOMIC_OPS(__atomic64_and, long, "lang")
>-__ATOMIC_OPS(__atomic64_or,  long, "laog")
>-__ATOMIC_OPS(__atomic64_xor, long, "laxg")
>+__ATOMIC_OPS(__atomic64_add, long long, "laag")
>+__ATOMIC_OPS(__atomic64_and, long long, "lang")
>+__ATOMIC_OPS(__atomic64_or,  long long, "laog")
>+__ATOMIC_OPS(__atomic64_xor, long long, "laxg")
> 
> #undef __ATOMIC_OPS
> #undef __ATOMIC_OP
>@@ -46,7 +46,7 @@ static inline void __atomic_add_const(int val, int
>*ptr)
>   : [ptr] "+Q" (*ptr) : [val] "i" (val) : "cc");
> }
> 
>-static inline void __atomic64_add_const(long val, long *ptr)
>+static inline void __atomic64_add_const(long val, long long *ptr)
> {
>   asm volatile(
>   "   agsi%[ptr],%[val]\n"
>@@ -82,7 +82,7 @@ __ATOMIC_OPS(__atomic_xor, "xr")
> #undef __ATOMIC_OPS
> 
> #define __ATOMIC64_OP(op_name, op_string) \
>-static inline long op_name(long val, long *ptr)   
>\
>+static inline long op_name(long val, long long *ptr)  \
> { \
>   long old, new;  \
>   \
>@@ -118,7 +118,7 @@ static inline int __atomic_cmpxchg(int *ptr, int
>old, int new)
>   return old;
> }
> 
>-static inline long __atomic64_cmpxchg(long *ptr, long old, long new)
>+static inline long __atomic64_cmpxchg(long long *ptr, long old, long
>new)
> {
>   asm volatile(
>   "   csg %[old],%[new],%[ptr]"
>diff --git a/arch/s390/include/asm/bitops.h
>b/arch/s390/include/asm/bitops.h
>index d92047da5ccb..8912f52bca5d 100644
>--- a/arch/s390/include/asm/bitops.h
>+++ b/arch/s390/include/asm/bitops.h
>@@ -80,7 +80,7 @@ static inline void set_bit(unsigned long nr, volatile
>unsigned long *ptr)
>   }
> #endif
>   mask = 1UL << (nr & (BITS_PER_LONG - 1));
>-  __atomic64_or(mask, addr);
>+  __atomic64_or(mask, (long long *)addr);
> }
> 
>static inline void clear_bit(unsigned long nr, volatile unsigned long
>*ptr)
>@@ -101,7 +101,7 @@ static inline void clear_bit(unsigned long nr,
>volatile unsigned long *ptr)
>   }
> #endif
>   mask = ~(1UL << (nr & (BITS_PER_LONG - 1)));
>-  __atomic64_and(mask, addr);
>+  __atomic64_and(mask, (long long *)addr);
> }
> 
>static inline void change_bit(unsigned long nr, volatile unsigned long
>*ptr)
>@@ -122,7 +122,7 @@ static inline void change_bit(unsigned long nr,
>volatile unsigned long *ptr)
>   }
> #endif
>   mask = 1UL << (nr & (BITS_PER_LONG - 1));
>-  __atomic64_xor(mask, addr);
>+  __atomic64_xor(mask, (long long *)addr);
> }
> 
> static inline int
>@@ -132,7 +132,7 @@ test_and_set_bit(unsigned long nr, volatile
>unsigned long *ptr)
>   unsigned long old, mask;
> 
>   mask = 1UL << (nr & (BITS_PER_LONG - 1));
>-  old = __atomic64_or_barrier(mask, addr);
>+  old = __atomic64_or_barrier(mask, (long long *)addr);
>   return (old & mask) != 0;
> }
> 
>@@ -143,7 +143,7 @@ test_and_clear_bit(unsigned long nr, volatile
>unsigned long *ptr)
>   unsigned long old, mask;
> 
>   mask = ~(1UL << (nr & (BITS_PER_LONG - 1)));
>-  old = __atomic

Re: [PATCH v2 2/7] x86: use long long for 64-bit atomic ops

2017-05-28 Thread hpa

On May 28, 2017 2:29:32 AM PDT, Dmitry Vyukov  wrote:
>On Sun, May 28, 2017 at 1:02 AM,   wrote:
>> On May 26, 2017 12:09:04 PM PDT, Dmitry Vyukov 
>wrote:
>>>Some 64-bit atomic operations use 'long long' as operand/return type
>>>(e.g. asm-generic/atomic64.h, arch/x86/include/asm/atomic64_32.h);
>>>while others use 'long' (e.g. arch/x86/include/asm/atomic64_64.h).
>>>This makes it impossible to write portable code.
>>>For example, there is no format specifier that prints result of
>>>atomic64_read() without warnings. atomic64_try_cmpxchg() is almost
>>>impossible to use in portable fashion because it requires either
>>>'long *' or 'long long *' as argument depending on arch.
>>>
>>>Switch arch/x86/include/asm/atomic64_64.h to 'long long'.
>>>
>>>Signed-off-by: Dmitry Vyukov 
>>>Cc: Mark Rutland 
>>>Cc: Peter Zijlstra 
>>>Cc: Will Deacon 
>>>Cc: Andrew Morton 
>>>Cc: Andrey Ryabinin 
>>>Cc: Ingo Molnar 
>>>Cc: kasan-...@googlegroups.com
>>>Cc: linux...@kvack.org
>>>Cc: linux-kernel@vger.kernel.org
>>>Cc: x...@kernel.org
>>>
>>>---
>>>Changes since v1:
>>> - reverted stray s/long/long long/ replace in comment
>>> - added arch/s390 changes to fix build errors/warnings
>>>---
>>> arch/s390/include/asm/atomic_ops.h | 14 +-
>>> arch/s390/include/asm/bitops.h | 12 -
>>>arch/x86/include/asm/atomic64_64.h | 52
>>>+++---
>>> include/linux/types.h  |  2 +-
>>> 4 files changed, 40 insertions(+), 40 deletions(-)
>>>
>>>diff --git a/arch/s390/include/asm/atomic_ops.h
>>>b/arch/s390/include/asm/atomic_ops.h
>>>index ac9e2b939d04..055a9083e52d 100644
>>>--- a/arch/s390/include/asm/atomic_ops.h
>>>+++ b/arch/s390/include/asm/atomic_ops.h
>>>@@ -31,10 +31,10 @@ __ATOMIC_OPS(__atomic_and, int, "lan")
>>> __ATOMIC_OPS(__atomic_or,  int, "lao")
>>> __ATOMIC_OPS(__atomic_xor, int, "lax")
>>>
>>>-__ATOMIC_OPS(__atomic64_add, long, "laag")
>>>-__ATOMIC_OPS(__atomic64_and, long, "lang")
>>>-__ATOMIC_OPS(__atomic64_or,  long, "laog")
>>>-__ATOMIC_OPS(__atomic64_xor, long, "laxg")
>>>+__ATOMIC_OPS(__atomic64_add, long long, "laag")
>>>+__ATOMIC_OPS(__atomic64_and, long long, "lang")
>>>+__ATOMIC_OPS(__atomic64_or,  long long, "laog")
>>>+__ATOMIC_OPS(__atomic64_xor, long long, "laxg")
>>>
>>> #undef __ATOMIC_OPS
>>> #undef __ATOMIC_OP
>>>@@ -46,7 +46,7 @@ static inline void __atomic_add_const(int val, int
>>>*ptr)
>>>   : [ptr] "+Q" (*ptr) : [val] "i" (val) : "cc");
>>> }
>>>
>>>-static inline void __atomic64_add_const(long val, long *ptr)
>>>+static inline void __atomic64_add_const(long val, long long *ptr)
>>> {
>>>   asm volatile(
>>>   "   agsi%[ptr],%[val]\n"
>>>@@ -82,7 +82,7 @@ __ATOMIC_OPS(__atomic_xor, "xr")
>>> #undef __ATOMIC_OPS
>>>
>>> #define __ATOMIC64_OP(op_name, op_string)   
> \
>>>-static inline long op_name(long val, long *ptr) 
> \
>>>+static inline long op_name(long val, long long *ptr)
> \
>>> {   
> \
>>>   long old, new;
> \
>>> 
> \
>>>@@ -118,7 +118,7 @@ static inline int __atomic_cmpxchg(int *ptr, int
>>>old, int new)
>>>   return old;
>>> }
>>>
>>>-static inline long __atomic64_cmpxchg(long *ptr, long old, long new)
>>>+static inline long __atomic64_cmpxchg(long long *ptr, long old, long
>>>new)
>>> {
>>>   asm volatile(
>>>   "   csg %[old],%[new],%[ptr]"
>>>diff --git a/arch/s390/include/asm/bitops.h
>>>b/arch/s390/include/asm/bitops.h
>>>index d92047da5ccb..8912f52bca5d 100644
>>>--- a/arch/s390/include/asm/bitops.h
>>>+++ b/arch/s390/include/asm/bitops.h
>>>@@ -80,7 +80,7 @@ static inline void set_bit(unsigned long nr,
>volatile
>>>unsigned long *ptr)
>>>   }
>>> #endif
>>>   mask = 1UL << (nr & (BITS_PER_LONG - 1));
>>>-  __atomic64_or(mask, addr);
>>>+  __atomic64_or(mask, (long long *)addr);
>>> }
>>>
>>>static inline void clear_bit(unsigned long nr, volatile unsigned long
>>>*ptr)
>>>@@ -101,7 +101,7 @@ static inline void clear_bit(unsigned long nr,
>>>volatile unsigned long *ptr)
>>>   }
>>> #endif
>>>   mask = ~(1UL << (nr & (BITS_PER_LONG - 1)));
>>>-  __atomic64_and(mask, addr);
>>>+  __atomic64_and(mask, (long long *)addr);
>>> }
>>>
>>>static inline void change_bit(unsigned long nr, volatile unsigned
>long
>>>*ptr)
>>>@@ -122,7 +122,7 @@ static inline void change_bit(unsigned long nr,
>>>volatile unsigned long *ptr)
>>>   }
>>> #endif
>>>   mask = 1UL << (nr & (BITS_PER_LONG - 1));
>>>-  __atomic64_xor(mask, addr);
>>>+  __atomic64_xor(mask, (long long *)addr);
>>> }
>>>
>>> static inline int
>>>@@ -132,7 +132,7 @@ test_and_set_bit(unsigned long nr, volatile
>>>unsigned long *ptr)
>>>   unsigned long old, mask;
>>>
>>>   mask = 1UL << (nr & (BITS_PER_LONG - 1));

Re: [PATCH 0/6] Boot-time switching between 4- and 5-level paging for 4.15, Part 1

2017-10-24 Thread hpa

On October 17, 2017 5:42:41 PM GMT+02:00, "Kirill A. Shutemov" 
 wrote:
>On Tue, Oct 03, 2017 at 11:27:54AM +0300, Kirill A. Shutemov wrote:
>> On Fri, Sep 29, 2017 at 05:08:15PM +0300, Kirill A. Shutemov wrote:
>> > The first bunch of patches that prepare kernel to boot-time
>switching
>> > between paging modes.
>> > 
>> > Please review and consider applying.
>> 
>> Ping?
>
>Ingo, is there anything I can do to get review easier for you?
>
>I hoped to get boot-time switching code into v4.15...

One issue that has come up with this is what happens if the kernel is loaded 
above 4 GB and we need to switch page table mode.  In that case we need enough 
memory below the 4 GB point to hold a root page table (since we can't write the 
upper half of cr3 outside of 64-bit mode) and a handful of instructions.

We have no real way to know for sure what memory is safe without parsing all 
the memory maps and map out all the data structures that The bootloader has 
left for the kernel.  I'm thinking that the best way to deal with this is to 
add an entry in setup_data to provide a pointers, with the kernel header 
specifying a necessary size and alignment.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH v3 0/7] x86/ptrace: regset access to the GDT and LDT

2018-06-21 Thread hpa

On June 21, 2018 6:58:51 PM PDT, Ingo Molnar  wrote:
>
>* H. Peter Anvin, Intel  wrote:
>
>> From: "H. Peter Anvin" 
>> 
>> Give a debugger access to the visible part of the GDT and LDT.  This
>> allows a debugger to find out what a particular segment descriptor
>> corresponds to; e.g. if %cs is 16, 32, or 64 bits.
>> 
>> v3:
>>  Requalify LDT segments for selectors that have actually changed.
>> 
>> v2:
>>  Add missing change to elf.h for the very last patch.
>> 
>> Cc: Ingo Molnar 
>> Cc: Thomas Gleixner 
>> Cc: Andy Lutomirski 
>> Cc: Chang S. Bae 
>> Cc: Markus T. Metzger 
>> Cc: H. Peter Anvin 
>> 
>>  arch/x86/Kconfig   |   4 +
>>  arch/x86/include/asm/desc.h|  24 +++-
>>  arch/x86/include/asm/ldt.h |  16 +++
>>  arch/x86/include/asm/segment.h |  10 ++
>>  arch/x86/kernel/Makefile   |   3 +-
>>  arch/x86/kernel/ldt.c  | 283
>-
>>  arch/x86/kernel/ptrace.c   | 103 ++-
>>  arch/x86/kernel/tls.c  | 102 +--
>>  arch/x86/kernel/tls.h  |   8 +-
>>  include/uapi/linux/elf.h   |   2 +
>>  10 files changed, 413 insertions(+), 142 deletions(-)
>
>Ok, this looks mostly good to me at a quick glance, but could you
>please do one 
>more iteration and more explicitly describe where you change/extend
>existing ABIs?
>
>I.e. instead of a terse and somewhat vague summary:
>
>> x86/tls,ptrace: provide regset access to the GDT
>>
>> Give a debugger access to the visible part of the GDT and LDT.  This
>> allows a debugger to find out what a particular segment descriptor
>> corresponds to; e.g. if %cs is 16, 32, or 64 bits.
>
>Please make it really explicit how the ABI is affected, both in the
>title and in 
>the description, and also _explain_ how this helps us over what we had
>before, 
>plus also list limitations of the new ABI.
>
>I.e. something like:
>
>  x86/tls, ptrace: Extend the ptrace ABI with the new REGSET_GDT method
>
>   Add the new REGSET_GDT ptrace variant to PTRACE_{GET|SET}REGSET,
>   to give read and write access to the GDT:
>
>- Previously only certain parts of the GDT were visible, and only via
>random
>ABIs and instructions - there was no architectured access to all of it.
>
>- The SETREGSET variant is only allowed to change the TLS area of the
>GDT.
>
>(or so.)
>
>This is important not just for documentation and library support
>purposes, but 
>also to be able to review it properly, to match 'intent' to
>'implementation'.
>
>(It might also help later on in finding bugs/quirks, if any.)
>
>Please do this for all patches in the series that change the ABI.
>
>Thanks,
>
>   Ingo

ACK.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH v3 7/7] x86/ldt,ptrace: provide regset access to the LDT

2018-06-22 Thread hpa

On June 22, 2018 7:49:13 AM PDT, Andy Lutomirski  wrote:
>On Thu, Jun 21, 2018 at 2:18 PM H. Peter Anvin, Intel
> wrote:
>>
>> From: "H. Peter Anvin" 
>>
>> Provide ptrace/regset access to the LDT, if one exists.  This
>> interface provides both read and write access. The write code is
>> unified with modify_ldt(); the read code doesn't have enough
>> similarity so it has been kept made separate.
>
>For this and for the GDT, you've chosen to use struct user_desc as
>your format instead of using a native hardware descriptor format.  Any
>particular reason why?  If nothing else, it will bloat core files a
>bit more than needed.

I did because REGSET_TLS was implemented that way, and that is simply a subset 
of the GDT (which made the same code trivially applicable to both.) 
modify_ldt() does it *both* ways for extra fun (one for reading, and one for 
writing.)

->active is defined as "beyond this point the regset contains only the default 
value", which seemed appropriate in this case.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [RFC PATCH 00/16] PTI support for x86-32

2018-01-21 Thread hpa

On January 21, 2018 6:11:07 PM PST, Linus Torvalds 
 wrote:
>On Sun, Jan 21, 2018 at 3:46 PM, Nadav Amit 
>wrote:
>> I wanted to see whether segments protection can be a replacement for
>PTI
>> (yes, excluding SMEP emulation), or whether speculative execution
>“ignores”
>> limit checks, similarly to the way paging protection is skipped.
>>
>> It does seem that segmentation provides sufficient protection from
>Meltdown.
>> The “reliability” test of Gratz PoC fails if the segment limit is set
>to
>> prevent access to the kernel memory. [ It passes if the limit is not
>set,
>> even if the DS is reloaded. ] My test is enclosed below.
>
>Interesting. It might not be entirely reliable for all
>microarchitectures, though.
>
>> So my question: wouldn’t it be much more efficient to use
>segmentation
>> protection for x86-32, and allow users to choose whether they want
>SMEP-like
>> protection if needed (and then enable PTI)?
>
>That's what we did long long ago, with user space segments actually
>using the limit (in fact, if you go back far enough, the kernel even
>used the base).
>
>You'd have to make sure that the LDT loading etc do not allow CPL3
>segments with base+limit past TASK_SIZE, so that people can't generate
>their own.  And the TLS segments also need to be limited (and
>remember, the limit has to be TASK_SIZE-base, not just TASK_SIZE).
>
>And we should check with Intel that segment limit checking really is
>guaranteed to be done before any access.
>
>Too bad x86-64 got rid of the segments ;)
>
>   Linus

No idea about Intel, but at least on Transmeta CPUs the limit check was 
asynchronous with the access.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH v1] x86/io: Define readq()/writeq() to use 64-bit type

2018-01-22 Thread hpa

On January 22, 2018 4:32:14 PM PST, "Mehta, Sohil"  
wrote:
>On Fri, 2018-01-19 at 16:33 +0200, Andy Shevchenko wrote:
>> Since non atomic readq() and writeq() were added some of the drivers
>> would like to use it in a manner of:
>> 
>>  #include 
>> ...
>>  pr_debug("Debug value of some register: %016llx\n", readq(addr));
>> 
>> However, lo_hi_readq() always returns __u64 data, while readq()
>> on x86_64 defines it as unsigned long. and thus compiler warns
>> about type mismatch, although they are both 64-bit on x86_64.
>> 
>> Convert readq() and writeq() on x86 to operate on deterministic
>> 64-bit type. The most of architectures in the kernel already are
>> using
>> either unsigned long long, or u64 type for readq() / writeq().
>> This change propagates consistency in that sense.
>> 
>> While this is not an issue per se, though if someone wants to address
>> it,
>> the anchor could be the commit
>> 
>>   797a796a13df ("asm-generic: architecture independent readq/writeq
>> for 32bit environment")
>> 
>> where non-atomic variants had been introduced.
>> 
>> Note, there are only few users of above pattern and they will not be
>> affected because they do cast returned value. The actual warning has
>> been issued on not-yet-upstreamed code.
>> 
>> Potentially we might get a new warnings if some 64-bit only code
>> assigns returned value to unsigned long type of variable. This is
>> assumed to be addressed on case-by-case basis.
>> 
>> Reported-by: lkp 
>> Cc: Hitoshi Mitake 
>> Cc: "Mehta, Sohil" 
>> Signed-off-by: Andy Shevchenko 
>> ---
>>  arch/x86/include/asm/io.h | 8 
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>> 
>> diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
>> index 95e948627fd0..365f5ba9222b 100644
>> --- a/arch/x86/include/asm/io.h
>> +++ b/arch/x86/include/asm/io.h
>> @@ -94,10 +94,10 @@ build_mmio_write(__writel, "l", unsigned int,
>> "r", )
>>  
>>  #ifdef CONFIG_X86_64
>>  
>> -build_mmio_read(readq, "q", unsigned long, "=r", :"memory")
>> -build_mmio_read(__readq, "q", unsigned long, "=r", )
>> -build_mmio_write(writeq, "q", unsigned long, "r", :"memory")
>> -build_mmio_write(__writeq, "q", unsigned long, "r", )
>> +build_mmio_read(readq, "q", unsigned long long, "=r", :"memory")
>> +build_mmio_read(__readq, "q", unsigned long long, "=r", )
>> +build_mmio_write(writeq, "q", unsigned long long, "r", :"memory")
>> +build_mmio_write(__writeq, "q", unsigned long long, "r", )
>>  
>>  #define readq_relaxed(a)__readq(a)
>>  #define writeq_relaxed(v, a)__writeq(v, a)
>
>The patch works for me:
>
>Tested-by: Sohil Mehta 
>
>Sohil

Wouldn't simply u64 make more sense?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH v3] x86: i8237: Register based on FADT legacy boot flag

2018-03-26 Thread hpa

On March 26, 2018 2:11:51 AM PDT, Thomas Gleixner  wrote:
>On Mon, 26 Mar 2018, Rajneesh Bhardwaj wrote:
>
>> On Sun, Mar 25, 2018 at 01:50:40PM +0200, Thomas Gleixner wrote:
>> > On Thu, 22 Mar 2018, Anshuman Gupta wrote:
>> > 
>> > > From: Rajneesh Bhardwaj 
>> > > 
>> > > >From Skylake onwards, the platform controller hub (Sunrisepoint
>PCH) does
>> > > not support legacy DMA operations to IO ports 81h-83h, 87h,
>89h-8Bh, 8Fh.
>> > > Currently this driver registers as syscore ops and its resume
>function is
>> > > called on every resume from S3. On Skylake and Kabylake, this
>causes a
>> > > resume delay of around 100ms due to port IO operations, which is
>a problem.
>> > > 
>> > > This change allows to load the driver only when the platform bios
>> > > explicitly supports such devices or has a cut-off date earlier
>than 2017.
>> > 
>> > Please explain WHY 2017 is the cut-off date. I still have no clue
>how that
>> > is decided aside of being a random number.
>> 
>> Hello Thomas,
>> 
>> We tested on few Intel platforms such as Skylake, Kabylake,
>Geminilake etc
>> and realized that the BIOS always sets the FADT flag to be true
>though the
>> device may not be physically present on the SoC. This is a BIOS bug.
>To keep
>> the impact minimum, we decided to add a cut-off date since we are not
>aware
>> of any BIOS (other than the coreboot link provided in the commit msg)
>that
>> properly sets this field. SoCs released after Skylake will not have
>this DMA
>> device on the PCH. So, because of these two reasons, we decided to
>add a
>> cut-off date as 2017.
>> 
>> Please let us know if you feel strongly about it and we can change it
>or
>> remove it if you feel so.
>
>I don't feel strongly about the cut off itself, but I want a reasonable
>explanation in the changelog or code comment because half a year from
>now
>nobody remembers 
>
>Thanks,
>
>   tglx

Can we probe safely for this device?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Sleeping in user_access section

2018-11-23 Thread hpa

On November 23, 2018 1:27:02 AM PST, Julien Thierry  
wrote:
>Hi,
>
>I made an attempt at implementing the 
>user_access_begin()/user_access_end() macros along with the 
>get/put_user_unsafe() for arm64 by toggling the status of PAN (more or 
>less similar to x86's STAC/CTAC).
>
>With a small mistake in my patch, we realized that directly calling 
>function that could reschedule while in a user_access section could
>lead to:
>
>- scheduling another task keeping the user_access status enabled
>despite 
>the task never calling user_access_begin()
>
>- when re-scheduling the task that was mid user_access section, 
>user_access would be disabled and the task would fault on the next 
>get/put_user_unsafe.
>
>
>This is because __switch_to does not alter the user_access status when 
>switching from next to prev (at least on arm64 we currently don't, and 
>by looking at the x86 code I don't think this is done either).
>
>
> From my understanding, this is not an issue when the task in 
>user_access mode gets scheduled out/in as a result of an interrupt as 
>PAN and EFLAGS.AC get saved/restore on exception entry/exit (at least I
>
>know it is the case for PAN, I am less sure for the x86 side).
>
>
>So, the question is, should __switch_to take care of the user_access 
>status when scheduling new tasks? Or should there be a restriction
>about 
>scheduling out a task with user_access mode enabled and maybe add a 
>warning if we can detect this?
>
>(Or did we miss something and this is not an issue on x86?)
>
>Thanks,

You should never call a sleeping function from a user_access section. It is 
intended for very limited regions.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] x86/TSC: Use RDTSCP

2018-11-23 Thread hpa

On November 23, 2018 12:03:07 PM PST, Guenter Roeck  wrote:
>Hi,
>
>On Mon, Nov 19, 2018 at 07:45:56PM +0100, Borislav Petkov wrote:
>> From: Borislav Petkov 
>> 
>> Currently, the kernel uses
>> 
>>   [LM]FENCE; RDTSC
>> 
>> in the timekeeping code, to guarantee monotonicity of time where the
>> *FENCE is selected based on vendor.
>> 
>> Replace that sequence with RDTSCP which is faster or on-par and gives
>> the same guarantees.
>> 
>> A microbenchmark on Intel shows that the change is on-par.
>> 
>> On AMD, the change is either on-par with the current LFENCE-prefixed
>> RDTSC and some are slightly better with RDTSCP.
>> 
>> The comparison is done with the LFENCE-prefixed RDTSC (and not with
>the
>> MFENCE-prefixed one, as one would normally expect) because all modern
>> AMD families make LFENCE serializing and thus avoid the heavy MFENCE
>by
>> effectively enabling X86_FEATURE_LFENCE_RDTSC.
>> 
>> Co-developed-by: Thomas Gleixner 
>> Signed-off-by: Thomas Gleixner 
>> Signed-off-by: Borislav Petkov 
>> Cc: Andy Lutomirski 
>> Cc: "H. Peter Anvin" 
>> Cc: John Stultz 
>> Cc: Thomas Lendacky 
>> Cc: x...@kernel.org
>> ---
>>  arch/x86/include/asm/msr.h | 15 +--
>>  1 file changed, 13 insertions(+), 2 deletions(-)
>> 
>> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
>> index 91e4cf189914..f00f2b61d326 100644
>> --- a/arch/x86/include/asm/msr.h
>> +++ b/arch/x86/include/asm/msr.h
>> @@ -217,6 +217,8 @@ static __always_inline unsigned long long
>rdtsc(void)
>>   */
>>  static __always_inline unsigned long long rdtsc_ordered(void)
>>  {
>> +DECLARE_ARGS(val, low, high);
>> +
>>  /*
>>   * The RDTSC instruction is not ordered relative to memory
>>   * access.  The Intel SDM and the AMD APM are both vague on this
>> @@ -227,9 +229,18 @@ static __always_inline unsigned long long
>rdtsc_ordered(void)
>>   * ordering guarantees as reading from a global memory location
>>   * that some other imaginary CPU is updating continuously with a
>>   * time stamp.
>> + *
>> + * Thus, use the preferred barrier on the respective CPU, aiming
>for
>> + * RDTSCP as the default.
>>   */
>> -barrier_nospec();
>> -return rdtsc();
>> +asm volatile(ALTERNATIVE_2("mfence; rdtsc",
>> +   "lfence; rdtsc", X86_FEATURE_LFENCE_RDTSC,
>> +   "rdtscp", X86_FEATURE_RDTSCP)
>> +: EAX_EDX_RET(val, low, high)
>> +/* RDTSCP clobbers ECX with MSR_TSC_AUX. */
>> +:: "ecx");
>> +
>> +return EAX_EDX_VAL(val, low, high);
>>  }
>>  
>
>This patch results in a crash with certain qemu emulations.
>
>[0.756869] hpet0: 3 comparators, 64-bit 100.00 MHz counter
>[0.762233] invalid opcode:  [#1] PTI
>[0.762435] CPU: 0 PID: 1 Comm: swapper Not tainted
>4.20.0-rc3-next-20181123 #2
>[0.762644] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
>Ubuntu-1.8.2-1ubuntu1 04/01/2014
>[0.762832] EIP: read_tsc+0x4/0x10
>
>To reproduce:
>
>make ARCH=i386 defconfig
>echo "CONFIG_HIGHMEM64G=y" >> .config
>make ARCH=i386 olddefconfig
>make -j30 ARCH=i386
>
>qemu-system-i386 \
>   -kernel arch/x86/boot/bzImage \
>   -M q35 -cpu pentium3 -no-reboot -m 256 \
>   -initrd rootfs.cpio \
>   --append 'earlycon=uart8250,io,0x3f8,9600n8 rdinit=/sbin/init panic=-1
>console=ttyS0 console=tty' \
>   -nographic -monitor none
>
>initrd or root file system doesn't really matter since the code never
>gets there, but it is available from here:
>   
> https://github.com/groeck/linux-build-test/blob/master/rootfs/x86/rootfs.cpio.gz
>The qemu version does not matter either (I tried with 2.5 and 3.0).
>
>Reverting the patch fixes the problem.
>
>Guenter
>
>---
>crash log:
>
>...
>[0.756366] HPET: 3 timers in total, 0 timers will be used for
>per-cpu timer
>[0.756718] hpet0: at MMIO 0xfed0, IRQs 2, 8, 0
>[0.756869] hpet0: 3 comparators, 64-bit 100.00 MHz counter
>[0.762233] invalid opcode:  [#1] PTI
>[0.762435] CPU: 0 PID: 1 Comm: swapper Not tainted
>4.20.0-rc3-next-20181123
>#2
>[0.762644] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
>Ubuntu-1.8.2-1ubuntu1 04/01/2014
>[0.762832] EIP: read_tsc+0x4/0x10
>[0.762832] Code: 00 01 00 eb 89 90 55 89 e5 5d c3 90 90 90 90 90 90
>90 90 90 90 90 55 a1 44 5a 8b c5 89 e5 5d c3 8d b6 00 00 00 00 55 89 e5
>57 <0f> ae f0b
>[0.762832] EAX: c58062c0 EBX: c58062c0 ECX: 0008 EDX: c4c1f0a0
>[0.762832] ESI: c58168c0 EDI: 00200086 EBP: cb495ed4 ESP: cb495ed0
>[0.762832] DS: 007b ES: 007b FS:  GS: 00e0 SS: 0068 EFLAGS:
>0022
>[0.762832] CR0: 80050033 CR2:  CR3: 0597a000 CR4: 06f0
>[0.762832] Call Trace:
>[0.762832]  tk_setup_internals.constprop.10+0x3d/0x260
>[0.762832]  timekeeping_notify+0x56/0xc0
>[0.762832]  __clocksource_select+0xf3/0x150
>[0.762832]  ? boot_override_clock+0x42/0x42
>

Re: [PATCH] x86/TSC: Use RDTSCP

2018-11-19 Thread hpa

On November 19, 2018 12:40:25 PM PST, Borislav Petkov  wrote:
>On Mon, Nov 19, 2018 at 12:17:35PM -0800, H. Peter Anvin wrote:
>> On 11/19/18 11:52 AM, Andy Lutomirski wrote:
>> > 
>> > I thought I benchmarked this on Intel at some point and found the
>> > LFENCE;RDTSC variant to be slightly faster.  But I believe you, so:
>> > 
>> > Acked-by: Andy Lutomirski 
>> > 
>> 
>> As long as the difference isn't significant, the simplicity would
>seem to be a
>> win.
>
>Right, I think by simplicity you mean RDTSCP. :)
>
>Also in the sense that you have a single instruction which gives you
>that barrier of waiting for all older insns to retire before reading
>the
>TSC vs two where you don't necessarily know what happens on the uarch
>level. I mean, nothing probably happens but RDTSCP is still simpler :)
>
>Also, hpa, isn't LFENCE; RDTSC and RDTSCP equivalent on Intel? In the
>sense that RDTSCP microcode practically adds an LFENCE before reading
>the TSC?
>
>Because SDM says:
>
>"The RDTSCP instruction is not a serializing instruction, but it does
>wait until all previous instructions have executed and all previous
>loads are globally visible."
>
>which sounds pretty much like an LFENCE to me:
>
>"LFENCE does not execute until all prior instructions have completed
>locally, and no later instruction begins execution until LFENCE
>completes."

I don't know the exact sequence of fencing operations it is equivalent to, but 
it is probably something quite similar.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: PLEASE REVERT URGENTLY: Re: [PATCH v5 2/3] x86/boot: add acpi rsdp address to setup_header

2018-11-11 Thread hpa

On November 10, 2018 7:22:29 AM PST, Juergen Gross  wrote:
>On 09/11/2018 23:23, H. Peter Anvin wrote:
>> I just noticed this patch -- I missed it because the cover message
>> seemed far more harmless so I didn't notice this change.
>> 
>> THIS PATCH IS FATALLY WRONG AND NEEDS TO BE IMMEDIATELY REVERTED
>BEFORE
>> ANYONE STARTS RELYING ON IT; IT HAS THE POTENTIAL OF BREAKING THE
>> BOOTLOADER PROTOCOL FOR ALL FUTURE.
>> 
>> It seems to be based on fundamental misconceptions about the various
>> data structures in the protocol, and does so in a way that completely
>> breaks the way the protocol is designed to work.
>> 
>> The protocol is specifically designed such that fields are not
>version
>> dependencies. The version number is strictly to inform the boot
>loader
>> about which capabilities the kernel has, so that the boot loader can
>> know if a certain data field is meaningful and/or honored.
>> 
>>> +Protocol 2.14: (Kernel 4.20) Added acpi_rsdp_addr holding the
>physical
>>> +   address of the ACPI RSDP table.
>>> +   The bootloader updates version with:
>>> +   0x8000 | min(kernel-version, bootloader-version)
>>> +   kernel-version being the protocol version supported by
>>> +   the kernel and bootloader-version the protocol version
>>> +   supported by the bootloader.
>> 
>> [...]
>> 
>>>   MEMORY LAYOUT
>>>
>>>  The traditional memory map for the kernel loader, used for Image or
>>> @@ -197,6 +209,7 @@ Offset  Proto   NameMeaning
>>>  0258/8 2.10+   pref_addressPreferred loading address
>>>  0260/4 2.10+   init_size   Linear memory required during 
>>> initialization
>>>  0264/4 2.11+   handover_offset Offset of handover entry point
>>> +0268/8 2.14+   acpi_rsdp_addr  Physical address of RSDP table
>> 
>> NO.
>> 
>> That is not how struct setup_header works, nor does this belong here.
>> 
>> struct setup_header contains *initialized data*, and has a length
>byte
>> at offset 0x201.  The bootloader is responsible for copying the full
>> structure into the appropriate offset (0x1f1) in struct boot_params.
>> 
>> The length byte isn't actually a requirement, since the maximum
>possible
>> size of this structure is 144 bytes, and the kernel will (obviously)
>not
>> look at the older fields anyway, but it is good practice. The kernel
>or
>> any other entity is free to zero out the bytes past this length
>pointer.
>> 
>> There are only 24 bytes left in this structure, and this would occupy
>8
>> of them for no valid reason.  The *only* valid reason to put a
>> zero-initialized field in struct setup_header is if it used by the
>> 16-bit legacy BIOS boot, which is obviously not the case here.
>> 
>> This field thus belongs in struct boot_params, not struct
>setup_header.
>
>Would you be okay with putting acpi_rsdp_addr at offset 0x0cc (_pad4)?
>
>
>Juergen

I'd prefer if you used __pad3 offset 0x70 to keep the large block, and that way 
your field is also aligned.

However, if you have some specific reason to prefer __pad4 it's no big deal, 
although I'm curious what it would be.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] x86/boot: clear rsdp address in boot_params for broken loaders

2018-12-03 Thread hpa

On December 3, 2018 2:38:11 AM PST, Juergen Gross  wrote:
>In case a broken boot loader doesn't clear its struct boot_params clear
>rsdp_addr in sanitize_boot_params().
>
>This fixes commit e6e094e053af75 ("x86/acpi, x86/boot: Take RSDP
>address from boot params if available") e.g. for the case of a boot via
>systemd-boot.
>
>Fixes: e6e094e053af75 ("x86/acpi, x86/boot: Take RSDP address from boot
>params if available")
>Reported-by: Gunnar Krueger 
>Tested-by: Gunnar Krueger 
>Signed-off-by: Juergen Gross 
>---
> arch/x86/include/asm/bootparam_utils.h | 1 +
> 1 file changed, 1 insertion(+)
>
>diff --git a/arch/x86/include/asm/bootparam_utils.h
>b/arch/x86/include/asm/bootparam_utils.h
>index a07ffd23e4dd..f6f6ef436599 100644
>--- a/arch/x86/include/asm/bootparam_utils.h
>+++ b/arch/x86/include/asm/bootparam_utils.h
>@@ -36,6 +36,7 @@ static void sanitize_boot_params(struct boot_params
>*boot_params)
>*/
>   if (boot_params->sentinel) {
>   /* fields in boot_params are left uninitialized, clear them */
>+  boot_params->acpi_rsdp_addr = 0;
>   memset(&boot_params->ext_ramdisk_image, 0,
>  (char *)&boot_params->efi_info -
>   (char *)&boot_params->ext_ramdisk_image);

Isn't this already covered by the memset()? If not, we should extend the 
memset() to maximal coverage.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] x86-64: use 32-bit XOR to zero registers

2018-06-25 Thread hpa

On June 25, 2018 9:33:35 AM PDT, Randy Dunlap  wrote:
>On 06/25/2018 03:25 AM, Jan Beulich wrote:
>> Some Intel CPUs don't recognize 64-bit XORs as zeroing idioms - use
>> 32-bit ones instead.
>
>Hmph.  Is that considered a bug (errata)?
>
>URL/references?
>
>Are these changes really only zeroing the lower 32 bits of the
>register?
>and that's all that the code cares about?
>
>thanks.
>
>> Signed-off-by: Jan Beulich 
>> ---
>>  arch/x86/crypto/aegis128-aesni-asm.S |2 +-
>>  arch/x86/crypto/aegis128l-aesni-asm.S|2 +-
>>  arch/x86/crypto/aegis256-aesni-asm.S |2 +-
>>  arch/x86/crypto/aesni-intel_asm.S|8 
>>  arch/x86/crypto/aesni-intel_avx-x86_64.S |4 ++--
>>  arch/x86/crypto/morus1280-avx2-asm.S |2 +-
>>  arch/x86/crypto/morus1280-sse2-asm.S |2 +-
>>  arch/x86/crypto/morus640-sse2-asm.S  |2 +-
>>  arch/x86/crypto/sha1_ssse3_asm.S |2 +-
>>  arch/x86/kernel/head_64.S|2 +-
>>  arch/x86/kernel/paravirt_patch_64.c  |2 +-
>>  arch/x86/lib/memcpy_64.S |2 +-
>>  arch/x86/power/hibernate_asm_64.S|2 +-
>>  13 files changed, 17 insertions(+), 17 deletions(-)
>> 
>> --- 4.18-rc2/arch/x86/crypto/aegis128-aesni-asm.S
>> +++ 4.18-rc2-x86_64-32bit-XOR/arch/x86/crypto/aegis128-aesni-asm.S
>> @@ -75,7 +75,7 @@
>>   *   %r9
>>   */
>>  __load_partial:
>> -xor %r9, %r9
>> +xor %r9d, %r9d
>>  pxor MSG, MSG
>>  
>>  mov LEN, %r8
>> --- 4.18-rc2/arch/x86/crypto/aegis128l-aesni-asm.S
>> +++ 4.18-rc2-x86_64-32bit-XOR/arch/x86/crypto/aegis128l-aesni-asm.S
>> @@ -66,7 +66,7 @@
>>   *   %r9
>>   */
>>  __load_partial:
>> -xor %r9, %r9
>> +xor %r9d, %r9d
>>  pxor MSG0, MSG0
>>  pxor MSG1, MSG1
>>  
>> --- 4.18-rc2/arch/x86/crypto/aegis256-aesni-asm.S
>> +++ 4.18-rc2-x86_64-32bit-XOR/arch/x86/crypto/aegis256-aesni-asm.S
>> @@ -59,7 +59,7 @@
>>   *   %r9
>>   */
>>  __load_partial:
>> -xor %r9, %r9
>> +xor %r9d, %r9d
>>  pxor MSG, MSG
>>  
>>  mov LEN, %r8
>> --- 4.18-rc2/arch/x86/crypto/aesni-intel_asm.S
>> +++ 4.18-rc2-x86_64-32bit-XOR/arch/x86/crypto/aesni-intel_asm.S
>> @@ -258,7 +258,7 @@ ALL_F:  .octa 0x
>>  .macro GCM_INIT Iv SUBKEY AAD AADLEN
>>  mov \AADLEN, %r11
>>  mov %r11, AadLen(%arg2) # ctx_data.aad_length = aad_length
>> -xor %r11, %r11
>> +xor %r11d, %r11d
>>  mov %r11, InLen(%arg2) # ctx_data.in_length = 0
>>  mov %r11, PBlockLen(%arg2) # ctx_data.partial_block_length = 0
>>  mov %r11, PBlockEncKey(%arg2) # ctx_data.partial_block_enc_key = 0
>> @@ -286,7 +286,7 @@ ALL_F:  .octa 0x
>>  movdqu HashKey(%arg2), %xmm13
>>  add %arg5, InLen(%arg2)
>>  
>> -xor %r11, %r11 # initialise the data pointer offset as zero
>> +xor %r11d, %r11d # initialise the data pointer offset as zero
>>  PARTIAL_BLOCK %arg3 %arg4 %arg5 %r11 %xmm8 \operation
>>  
>>  sub %r11, %arg5 # sub partial block data used
>> @@ -702,7 +702,7 @@ _no_extra_mask_1_\@:
>>  
>>  # GHASH computation for the last <16 Byte block
>>  GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6
>> -xor %rax,%rax
>> +xor %eax, %eax
>>  
>>  mov %rax, PBlockLen(%arg2)
>>  jmp _dec_done_\@
>> @@ -737,7 +737,7 @@ _no_extra_mask_2_\@:
>>  
>>  # GHASH computation for the last <16 Byte block
>>  GHASH_MUL \AAD_HASH, %xmm13, %xmm0, %xmm10, %xmm11, %xmm5, %xmm6
>> -xor %rax,%rax
>> +xor %eax, %eax
>>  
>>  mov %rax, PBlockLen(%arg2)
>>  jmp _encode_done_\@
>> --- 4.18-rc2/arch/x86/crypto/aesni-intel_avx-x86_64.S
>> +++
>4.18-rc2-x86_64-32bit-XOR/arch/x86/crypto/aesni-intel_avx-x86_64.S
>> @@ -463,7 +463,7 @@ _get_AAD_rest_final\@:
>>  
>>  _get_AAD_done\@:
>>  # initialize the data pointer offset as zero
>> -xor %r11, %r11
>> +xor %r11d, %r11d
>>  
>>  # start AES for num_initial_blocks blocks
>>  mov arg5, %rax # rax = *Y0
>> @@ -1770,7 +1770,7 @@ _get_AAD_rest_final\@:
>>  
>>  _get_AAD_done\@:
>>  # initialize the data pointer offset as zero
>> -xor %r11, %r11
>> +xor %r11d, %r11d
>>  
>>  # start AES for num_initial_blocks blocks
>>  mov arg5, %rax # rax = *Y0
>> --- 4.18-rc2/arch/x86/crypto/morus1280-avx2-asm.S
>> +++ 4.18-rc2-x86_64-32bit-XOR/arch/x86/crypto/morus1280-avx2-asm.S
>> @@ -113,7 +113,7 @@ ENDPROC(__morus1280_update_zero)
>>   *   %r9
>>   */
>>  __load_partial:
>> -xor %r9, %r9
>> +xor %r9d, %r9d
>>  vpxor MSG, MSG, MSG
>>  
>>  mov %rcx, %r8
>> --- 4.18-rc2/arch/x86/crypto/morus1280-sse2-asm.S
>> +++ 4.18-rc2-x86_64-32bit-XOR/arch/x86/crypto/morus1280-sse2-asm.S
>> @@ -235,7 +235,7 @@ ENDPROC(__morus1280_update_zero)
>>   *   %r9
>>   */
>>  __load_partial:
>> -xor %r9, %r9
>> +xor %r9d, %r9d
>>  pxor MSG_LO, MSG_LO
>>  pxor MSG_HI, MSG_HI
>>

Re: [PATCH v3 1/7] x86/ldt: refresh %fs and %gs in refresh_ldt_segments()

2018-06-27 Thread hpa

On June 27, 2018 11:19:12 AM PDT, Andy Lutomirski  wrote:
>On Fri, Jun 22, 2018 at 11:47 AM, Andy Lutomirski 
>wrote:
>>
>>
>>
>>> On Jun 22, 2018, at 11:29 AM, H. Peter Anvin
> wrote:
>>>
 On 06/22/18 07:24, Andy Lutomirski wrote:

 That RPL3 part is false.  The following program does:

 #include 

 int main()
 {
unsigned short sel;
asm volatile ("mov %%ss, %0" : "=rm" (sel));
sel &= ~3;
printf("Will write 0x%hx to GS\n", sel);
asm volatile ("mov %0, %%gs" :: "rm" (sel & ~3));
asm volatile ("mov %%gs, %0" : "=rm" (sel));
printf("GS = 0x%hx\n", sel);
return 0;
 }

 prints:

 Will write 0x28 to GS
 GS = 0x28

 The x86 architecture is *insane*.

 Other than that, this patch seems generally sensible.  But my
 objection that it's incorrect with FSGSBASE enabled for %fs and %gs
 still applies.

>>>
>>> Ugh, you're right... I misremembered.  The CPL simply overrides the
>RPL
>>> rather than trapping.
>>>
>>> We still need to give legacy applications which have zero idea about
>the
>>> separate bases that apply only to 64-bit mode a way to DTRT. 
>Requiring
>>> these old crufty applications to do something new is not an option.
>>
>>>
>>> As ugly as it is, I'm thinking the Right Thing is to simply make it
>a
>>> part of the Linux ABI that if the FS or GS selector registers point
>into
>>> the LDT then we will requalify them; if a 64-bit app does that then
>they
>>> get that behavior.  This isn't something that will happen
>>> asynchronously, and if a 64-bit process loads an LDT value into FS
>or
>>> GS, they are considered to have opted in to that behavior.
>>
>> But the old and crusty apps don’t depend on requalification because
>we never used to do it.
>>
>> I’m not convinced we ever need to refresh the base. In fact, we could
>start preserving the base of LDT-referencing FS/GS across context
>switches even without FSGSBASE at some minor performance cost, but I
>don’t really see the point. I still think my proposed semantics are
>easy to implement and preserve the ABI even if they have the sad
>property that the FSGSBASE behavior and the non-FSGSBASE behavior end
>up different.
>>
>
>There's another reasonable solution: do exactly what your patch does,
>minus the bugs.  We would need to get the RPL != 3 case right (easy)
>and the case where there's a non-running thread using the selector in
>question.  The latter is probably best handled by adding a flag to
>thread_struct that says "fsbase needs reloading from the descriptor
>table" and only applies if the selector is in the LDT or TLS area.  Or
>we could hijack a high bit in the selector.  Then we'd need to update
>everything that uses the fields.

Obviously fix the bugs.

How would you control this bit?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH v3 1/7] x86/ldt: refresh %fs and %gs in refresh_ldt_segments()

2018-06-27 Thread hpa

On June 27, 2018 11:22:14 AM PDT, h...@zytor.com wrote:
>On June 27, 2018 11:19:12 AM PDT, Andy Lutomirski 
>wrote:
>>On Fri, Jun 22, 2018 at 11:47 AM, Andy Lutomirski
>
>>wrote:
>>>
>>>
>>>
 On Jun 22, 2018, at 11:29 AM, H. Peter Anvin
>> wrote:

> On 06/22/18 07:24, Andy Lutomirski wrote:
>
> That RPL3 part is false.  The following program does:
>
> #include 
>
> int main()
> {
>unsigned short sel;
>asm volatile ("mov %%ss, %0" : "=rm" (sel));
>sel &= ~3;
>printf("Will write 0x%hx to GS\n", sel);
>asm volatile ("mov %0, %%gs" :: "rm" (sel & ~3));
>asm volatile ("mov %%gs, %0" : "=rm" (sel));
>printf("GS = 0x%hx\n", sel);
>return 0;
> }
>
> prints:
>
> Will write 0x28 to GS
> GS = 0x28
>
> The x86 architecture is *insane*.
>
> Other than that, this patch seems generally sensible.  But my
> objection that it's incorrect with FSGSBASE enabled for %fs and
>%gs
> still applies.
>

 Ugh, you're right... I misremembered.  The CPL simply overrides the
>>RPL
 rather than trapping.

 We still need to give legacy applications which have zero idea
>about
>>the
 separate bases that apply only to 64-bit mode a way to DTRT. 
>>Requiring
 these old crufty applications to do something new is not an option.
>>>

 As ugly as it is, I'm thinking the Right Thing is to simply make it
>>a
 part of the Linux ABI that if the FS or GS selector registers point
>>into
 the LDT then we will requalify them; if a 64-bit app does that then
>>they
 get that behavior.  This isn't something that will happen
 asynchronously, and if a 64-bit process loads an LDT value into FS
>>or
 GS, they are considered to have opted in to that behavior.
>>>
>>> But the old and crusty apps don’t depend on requalification because
>>we never used to do it.
>>>
>>> I’m not convinced we ever need to refresh the base. In fact, we
>could
>>start preserving the base of LDT-referencing FS/GS across context
>>switches even without FSGSBASE at some minor performance cost, but I
>>don’t really see the point. I still think my proposed semantics are
>>easy to implement and preserve the ABI even if they have the sad
>>property that the FSGSBASE behavior and the non-FSGSBASE behavior end
>>up different.
>>>
>>
>>There's another reasonable solution: do exactly what your patch does,
>>minus the bugs.  We would need to get the RPL != 3 case right (easy)
>>and the case where there's a non-running thread using the selector in
>>question.  The latter is probably best handled by adding a flag to
>>thread_struct that says "fsbase needs reloading from the descriptor
>>table" and only applies if the selector is in the LDT or TLS area.  Or
>>we could hijack a high bit in the selector.  Then we'd need to update
>>everything that uses the fields.
>
>Obviously fix the bugs.
>
>How would you control this bit?

I can personally think of these options:

1. A prctl() to disable requalification;
2. Make the new instructions trap until used.  This will add to the startup 
time of legitimate users of these instructions;
3. Either of these, but start out in "off" mode until one of the descriptor 
table system calls are called.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH v3 1/7] x86/ldt: refresh %fs and %gs in refresh_ldt_segments()

2018-06-28 Thread hpa

On June 28, 2018 1:33:24 PM PDT, Andy Lutomirski  wrote:
>On Wed, Jun 27, 2018 at 11:22 AM,   wrote:
>> On June 27, 2018 11:19:12 AM PDT, Andy Lutomirski 
>wrote:
>>>On Fri, Jun 22, 2018 at 11:47 AM, Andy Lutomirski
>
>>>wrote:



> On Jun 22, 2018, at 11:29 AM, H. Peter Anvin
>>> wrote:
>
>> On 06/22/18 07:24, Andy Lutomirski wrote:
>>
>> That RPL3 part is false.  The following program does:
>>
>> #include 
>>
>> int main()
>> {
>>unsigned short sel;
>>asm volatile ("mov %%ss, %0" : "=rm" (sel));
>>sel &= ~3;
>>printf("Will write 0x%hx to GS\n", sel);
>>asm volatile ("mov %0, %%gs" :: "rm" (sel & ~3));
>>asm volatile ("mov %%gs, %0" : "=rm" (sel));
>>printf("GS = 0x%hx\n", sel);
>>return 0;
>> }
>>
>> prints:
>>
>> Will write 0x28 to GS
>> GS = 0x28
>>
>> The x86 architecture is *insane*.
>>
>> Other than that, this patch seems generally sensible.  But my
>> objection that it's incorrect with FSGSBASE enabled for %fs and
>%gs
>> still applies.
>>
>
> Ugh, you're right... I misremembered.  The CPL simply overrides
>the
>>>RPL
> rather than trapping.
>
> We still need to give legacy applications which have zero idea
>about
>>>the
> separate bases that apply only to 64-bit mode a way to DTRT.
>>>Requiring
> these old crufty applications to do something new is not an
>option.

>
> As ugly as it is, I'm thinking the Right Thing is to simply make
>it
>>>a
> part of the Linux ABI that if the FS or GS selector registers
>point
>>>into
> the LDT then we will requalify them; if a 64-bit app does that
>then
>>>they
> get that behavior.  This isn't something that will happen
> asynchronously, and if a 64-bit process loads an LDT value into FS
>>>or
> GS, they are considered to have opted in to that behavior.

 But the old and crusty apps don’t depend on requalification because
>>>we never used to do it.

 I’m not convinced we ever need to refresh the base. In fact, we
>could
>>>start preserving the base of LDT-referencing FS/GS across context
>>>switches even without FSGSBASE at some minor performance cost, but I
>>>don’t really see the point. I still think my proposed semantics are
>>>easy to implement and preserve the ABI even if they have the sad
>>>property that the FSGSBASE behavior and the non-FSGSBASE behavior end
>>>up different.

>>>
>>>There's another reasonable solution: do exactly what your patch does,
>>>minus the bugs.  We would need to get the RPL != 3 case right (easy)
>>>and the case where there's a non-running thread using the selector in
>>>question.  The latter is probably best handled by adding a flag to
>>>thread_struct that says "fsbase needs reloading from the descriptor
>>>table" and only applies if the selector is in the LDT or TLS area. 
>Or
>>>we could hijack a high bit in the selector.  Then we'd need to update
>>>everything that uses the fields.
>>
>> Obviously fix the bugs.
>>
>> How would you control this bit?
>
>Sorry, I was wrong in my previous email.  Let me try this again:
>
>Notwithstanding the RPL thing, the reason I don't like your patch as
>is, and the reason I didn't write a similar patch myself, is that it
>will behave nondeterministically on an FSGSBASE kernel.  Suppose there
>are two threads, A and B, that share an mm.  A has %fs == 0x7 and
>FSBASE = 0.  The LDT has the base for entry 0 set to 0.
>
>Now thread B calls modify_ldt to change the base for entry 0 to 1.
>
>The Obviously Sane (tm) behavior is for task A's FSBASE to
>asynchronously change to 1.  This is the only deterministic behavior
>that is even possible on a 32-bit kernel, and it's the only
>not-totally-nutty behavior that is possible on a 64-bit non-FSGSBASE
>kernel, and it's still perfectly reasonable for FSGSBASE.  The problem
>is that it's not so easly to implement.
>
>With your patch, on an FSGSBASE kernel, we get the desired behavior if
>thread A is running while thread B calls modify_ldt().  But we get
>different behavior if thread A is stopped -- thread A's FSBASE will
>remain set to 0.
>
>With that in mind, my email was otherwise garbage, and the magic "bit"
>idea was total crap.
>
>I can see three vaguely credible ways to implement this.
>
>1. thread B walks all threads on the system, notices that thread A has
>the same mm, and asynchronously fixes it up.  The locking is a bit
>tricky, and the performance isn't exactly great.  Maybe that's okay.
>
>2. We finally add an efficient way to find all threads that share an
>mm and do (1) but faster.
>
>3. We add enough bookkeeping to thread_struct so that, the next time
>thread A runs or has ptrace try to read its FSBASE, we notice that
>FSBASE is stale and fix it up.
>
>(3) will perform the best, but the implementation is probably nasty.
>If we want modify_ldt() to only reset the base for the modified
>records, w

Re: [clang] stack protector and f1f029c7bf

2018-05-24 Thread hpa

On May 23, 2018 3:08:19 PM PDT, Nick Desaulniers  
wrote:
>H. Peter,
>
>It was reported [0] that compiling the Linux kernel with Clang +
>CC_STACKPROTECTOR_STRONG was causing a crash in native_save_fl(), due
>to
>how GCC does not emit a stack guard for static inline functions (see
>Alistair's excellent report in [1]) but Clang does.
>
>When working with the LLVM release maintainers, Tom had suggested [2]
>changing the inline assembly constraint in native_save_fl() from '=rm'
>to
>'=r', and Alistair had verified the disassembly:
>
>(good) code generated w/o -fstack-protector-strong:
>
>native_save_fl:
>  pushfq
>  popq-8(%rsp)
>  movq-8(%rsp), %rax
>  retq
>
>(good) code generated w/ =r input constraint:
>
>native_save_fl:
>  pushfq
>  popq%rax
>  retq
>
>(bad) code generated with -fstack-protector-strong:
>
>native_save_fl:
>  subq$24, %rsp
>  movq%fs:40, %rax
>  movq%rax, 16(%rsp)
>  pushfq
>  popq8(%rsp)
>  movq8(%rsp), %rax
>  movq%fs:40, %rcx
>  cmpq16(%rsp), %rcx
>  jne .LBB0_2
>  addq$24, %rsp
>  retq
>.LBB0_2:
>  callq   __stack_chk_fail
>
>It looks like the sugguestion is actually a revert of your commit:
>ab94fcf528d127fcb490175512a8910f37e5b346:
>x86: allow "=rm" in native_save_fl()
>
>It seemed like there was a question internally about why worry about
>pop
>adjusting the stack if the stack could be avoided altogether.
>
>I think Sedat can retest this, but I was curious as well about the
>commit
>message in ab94fcf528d: "[ Impact: performance ]", but Alistair's
>analysis
>of the disassembly seems to indicate there is no performance impact (in
>fact, looks better as there's one less mov).
>
>Is there a reason we should not revert ab94fcf528d12, or maybe a better
>approach?
>
>[0] https://lkml.org/lkml/2018/5/7/534
>[1] https://bugs.llvm.org/show_bug.cgi?id=37512#c15
>[2] https://bugs.llvm.org/show_bug.cgi?id=37512#c22

A stack canary on an *inlined* function? That's bound to break things elsewhere 
too sooner or later.

It feels like *once again* clang is asking for the Linux kernel to change to 
paper over technical or compatibility problems in clang/LLVM. This is not 
exactly helping the feeling that we should just rip out any and all clang hacks 
and call it a loss.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [clang] stack protector and f1f029c7bf

2018-05-24 Thread hpa

On May 23, 2018 3:08:19 PM PDT, Nick Desaulniers  
wrote:
>H. Peter,
>
>It was reported [0] that compiling the Linux kernel with Clang +
>CC_STACKPROTECTOR_STRONG was causing a crash in native_save_fl(), due
>to
>how GCC does not emit a stack guard for static inline functions (see
>Alistair's excellent report in [1]) but Clang does.
>
>When working with the LLVM release maintainers, Tom had suggested [2]
>changing the inline assembly constraint in native_save_fl() from '=rm'
>to
>'=r', and Alistair had verified the disassembly:
>
>(good) code generated w/o -fstack-protector-strong:
>
>native_save_fl:
>  pushfq
>  popq-8(%rsp)
>  movq-8(%rsp), %rax
>  retq
>
>(good) code generated w/ =r input constraint:
>
>native_save_fl:
>  pushfq
>  popq%rax
>  retq
>
>(bad) code generated with -fstack-protector-strong:
>
>native_save_fl:
>  subq$24, %rsp
>  movq%fs:40, %rax
>  movq%rax, 16(%rsp)
>  pushfq
>  popq8(%rsp)
>  movq8(%rsp), %rax
>  movq%fs:40, %rcx
>  cmpq16(%rsp), %rcx
>  jne .LBB0_2
>  addq$24, %rsp
>  retq
>.LBB0_2:
>  callq   __stack_chk_fail
>
>It looks like the sugguestion is actually a revert of your commit:
>ab94fcf528d127fcb490175512a8910f37e5b346:
>x86: allow "=rm" in native_save_fl()
>
>It seemed like there was a question internally about why worry about
>pop
>adjusting the stack if the stack could be avoided altogether.
>
>I think Sedat can retest this, but I was curious as well about the
>commit
>message in ab94fcf528d: "[ Impact: performance ]", but Alistair's
>analysis
>of the disassembly seems to indicate there is no performance impact (in
>fact, looks better as there's one less mov).
>
>Is there a reason we should not revert ab94fcf528d12, or maybe a better
>approach?
>
>[0] https://lkml.org/lkml/2018/5/7/534
>[1] https://bugs.llvm.org/show_bug.cgi?id=37512#c15
>[2] https://bugs.llvm.org/show_bug.cgi?id=37512#c22

Ok, this is the *second* thing about LLVM-originated bug reports that drives me 
batty. When you *do* identify a real problem, you propose a paper over and/or 
talk about it as an LLVM issue and don't consider the often far bigger picture.

Issue 1: Fundamentally, the compiler is doing The Wrong Thing if it generates 
worse code with a less constrained =rm than with =r. That is a compiler 
optimization bug, period. The whole point with the less constrained option is 
to give the compiler the freedom of action.

You are claiming it doesn't buy us anything, but you are only looking at the 
paravirt case which is kind of "special" (in the short bus kind of way), and 
only because the compiler in question makes an incredibly stupid decision.

Issue 2: What you are flagging seems to be a far more fundamental problem, 
which would affect *any* use of push/pop in inline assembly. If that is true, 
we need to pull in the gcc people too and create an interface to let the 
compiler know that online assembly needs a certain number of stack slots. We do 
a lot of push/pop in assembly. The other option is to turn stack canary 
explicitly off for all such functions.

Issue 3: Let's face it, reading and writing the flags should be builtins, 
exactly because it has to do stack operations, which really means the compiler 
should be involved.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [clang] stack protector and f1f029c7bf

2018-05-24 Thread hpa

On May 24, 2018 12:49:56 PM PDT, Tom Stellard  wrote:
>On 05/24/2018 11:19 AM, h...@zytor.com wrote:
>> On May 23, 2018 3:08:19 PM PDT, Nick Desaulniers
> wrote:
>>> H. Peter,
>>>
>>> It was reported [0] that compiling the Linux kernel with Clang +
>>> CC_STACKPROTECTOR_STRONG was causing a crash in native_save_fl(),
>due
>>> to
>>> how GCC does not emit a stack guard for static inline functions (see
>>> Alistair's excellent report in [1]) but Clang does.
>>>
>>> When working with the LLVM release maintainers, Tom had suggested
>[2]
>>> changing the inline assembly constraint in native_save_fl() from
>'=rm'
>>> to
>>> '=r', and Alistair had verified the disassembly:
>>>
>>> (good) code generated w/o -fstack-protector-strong:
>>>
>>> native_save_fl:
>>>  pushfq
>>>  popq-8(%rsp)
>>>  movq-8(%rsp), %rax
>>>  retq
>>>
>>> (good) code generated w/ =r input constraint:
>>>
>>> native_save_fl:
>>>  pushfq
>>>  popq%rax
>>>  retq
>>>
>>> (bad) code generated with -fstack-protector-strong:
>>>
>>> native_save_fl:
>>>  subq$24, %rsp
>>>  movq%fs:40, %rax
>>>  movq%rax, 16(%rsp)
>>>  pushfq
>>>  popq8(%rsp)
>>>  movq8(%rsp), %rax
>>>  movq%fs:40, %rcx
>>>  cmpq16(%rsp), %rcx
>>>  jne .LBB0_2
>>>  addq$24, %rsp
>>>  retq
>>> .LBB0_2:
>>>  callq   __stack_chk_fail
>>>
>>> It looks like the sugguestion is actually a revert of your commit:
>>> ab94fcf528d127fcb490175512a8910f37e5b346:
>>> x86: allow "=rm" in native_save_fl()
>>>
>>> It seemed like there was a question internally about why worry about
>>> pop
>>> adjusting the stack if the stack could be avoided altogether.
>>>
>>> I think Sedat can retest this, but I was curious as well about the
>>> commit
>>> message in ab94fcf528d: "[ Impact: performance ]", but Alistair's
>>> analysis
>>> of the disassembly seems to indicate there is no performance impact
>(in
>>> fact, looks better as there's one less mov).
>>>
>>> Is there a reason we should not revert ab94fcf528d12, or maybe a
>better
>>> approach?
>>>
>>> [0] https://lkml.org/lkml/2018/5/7/534
>>> [1] https://bugs.llvm.org/show_bug.cgi?id=37512#c15
>>> [2] https://bugs.llvm.org/show_bug.cgi?id=37512#c22
>> 
>> A stack canary on an *inlined* function? That's bound to break things
>elsewhere too sooner or later.
>> 
>> It feels like *once again* clang is asking for the Linux kernel to
>change to paper over technical or compatibility problems in clang/LLVM.
>This is not exactly helping the feeling that we should just rip out any
>and all clang hacks and call it a loss.
>> 
>
>In this case this fix is working-around a bug in the kernel.  The
>problem
>here is that the caller of native_save_fl() assumes that it won't
>clobber any of the general purpose register even though the function
>is defined as a normal c function which tells the compiler that it's
>OK to clobber the registers.
>
>This is something that really needs to be fixed in the kernel.  Relying
>on heuristics of internal compiler algorithms in order for code to work
>correctly is always going to lead to problems like this.
>
>-Tom

Ok, yet another problem (this time with paravirtualization, no surprise 
there.). Good, let's find the actual problems and fix them where they should be 
fixed.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [clang] stack protector and f1f029c7bf

2018-05-24 Thread hpa

On May 24, 2018 1:52:16 PM PDT, Nick Desaulniers  
wrote:
>On Thu, May 24, 2018 at 1:26 PM Nick Desaulniers
>
>wrote:
>> On Thu, May 24, 2018 at 11:59 AM  wrote:
>> > Issue 3: Let's face it, reading and writing the flags should be
>builtins,
>> exactly because it has to do stack operations, which really means the
>> compiler should be involved.
>
>> I'm happy to propose that as a feature request to llvm+gcc.
>
>Oh, looks like both clang and gcc have:
>__builtin_ia32_readeflags_u64()
>
>https://godbolt.org/g/SwPjhq
>
>Maybe native_save_fl() and native_restore_fl() should be replaced in
>the
>kernel with
>__builtin_ia32_readeflags_u64() and __builtin_ia32_writeeflags_u64()?

It should, at least when the compiler supports them (we keep support in for old 
compilers back to about gcc 3.4 or 4.1 or so for x86, but they are intently 
special cases.)

For the PV case, I think we should create it as an explicit assembly function, 
meaning it should be an extern inline in C code. That way we're fixing the 
kernel right.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [clang] stack protector and f1f029c7bf

2018-05-24 Thread hpa

On May 24, 2018 3:31:05 PM PDT, Nick Desaulniers  
wrote:
>On Thu, May 24, 2018 at 3:05 PM H. Peter Anvin  wrote:
>> COMPILER AR: "=rm" should NEVER generate worse code than "=r". That
>is
>> unequivocally a compiler bug.
>
>Filed: https://bugs.llvm.org/show_bug.cgi?id=37583
>
>> >> You are claiming it doesn't buy us anything, but you are only
>looking
>at
>> > the paravirt case which is kind of "special" (in the short bus kind
>of
>way),
>> >
>> > That's fair.  Is another possible solution to have paravirt maybe
>not
>use
>> > native_save_fl() then, but its own
>non-static-inline-without-m-constraint
>> > implementation?
>
>> KERNEL AR: change native_save_fl() to an extern inline with an
>assembly
>> out-of-line implementation, to satisfy the paravirt requirement that
>no
>> GPRs other than %rax are clobbered.
>
>i'm happy to add that, do you have a recommendation if it should go in
>an
>existing .S file or a new one (and if so where/what shall I call it?).

How about irqflags.c since that is what the .h file is called.

It should simply be:

push %rdi
popf
ret

pushf
pop %rax
ret

... but with all the regular assembly decorations, of course.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [clang] stack protector and f1f029c7bf

2018-05-25 Thread hpa

On May 25, 2018 9:27:40 AM PDT, Nick Desaulniers  
wrote:
>On Thu, May 24, 2018 at 3:43 PM  wrote:
>> On May 24, 2018 3:31:05 PM PDT, Nick Desaulniers
>
>wrote:
>> >On Thu, May 24, 2018 at 3:05 PM H. Peter Anvin 
>wrote:
>> >> COMPILER AR: "=rm" should NEVER generate worse code than "=r".
>That
>> >is
>> >> unequivocally a compiler bug.
>> >
>> >Filed: https://bugs.llvm.org/show_bug.cgi?id=37583
>> >
>> >> >> You are claiming it doesn't buy us anything, but you are only
>> >looking
>> >at
>> >> > the paravirt case which is kind of "special" (in the short bus
>kind
>> >of
>> >way),
>> >> >
>> >> > That's fair.  Is another possible solution to have paravirt
>maybe
>> >not
>> >use
>> >> > native_save_fl() then, but its own
>> >non-static-inline-without-m-constraint
>> >> > implementation?
>> >
>> >> KERNEL AR: change native_save_fl() to an extern inline with an
>> >assembly
>> >> out-of-line implementation, to satisfy the paravirt requirement
>that
>> >no
>> >> GPRs other than %rax are clobbered.
>> >
>> >i'm happy to add that, do you have a recommendation if it should go
>in
>> >an
>> >existing .S file or a new one (and if so where/what shall I call
>it?).
>
>> How about irqflags.c since that is what the .h file is called.
>
>> It should simply be:
>
>> push %rdi
>> popf
>> ret
>
>> pushf
>> pop %rax
>> ret
>
>> ... but with all the regular assembly decorations, of course.
>
>Something like the following?
>
>
>diff --git a/arch/x86/kernel/irqflags.c b/arch/x86/kernel/irqflags.c
>new file mode 100644
>index ..59dc21bd3327
>--- /dev/null
>+++ b/arch/x86/kernel/irqflags.c
>@@ -0,0 +1,24 @@
>+#include 
>+
>+extern unsigned long native_save_fl(void);
>+extern void native_restore_fl(unsigned long flags);
>+
>+asm(
>+".pushsection .text;"
>+".global native_save_fl;"
>+".type native_save_fl, @function;"
>+"native_save_fl:"
>+"pushf;"
>+"pop %" _ASM_AX ";"
>+"ret;"
>+".popsection");
>+
>+asm(
>+".pushsection .text;"
>+".global native_restore_fl;"
>+".type native_restore_fl, @function;"
>+"native_restore_fl:"
>+"push %" _ASM_DI ";"
>+"popf;"
>+"ret;"
>+".popsection");
>
>And change the declaration in arch/x86/include/asm/irqflags.h to:
>+extern inline unsigned long native_save_fl(void);
>+extern inline void native_restore_fl(unsigned long flags);
>
>This seems to work, but
>1. arch/x86/boot/compressed/misc.o warns that native_save_fl() is never
>defined (arch_local_save_flags() uses it).  Does that mean
>arch_local_save_flags(), and friends would also have to move to the
>newly
>created .c file as well?
>2. `extern inline` doesn't inline any instances (from what I can tell
>from
>disassembling vmlinux).  I think this is strictly worse. Don't we only
>want
>pv_irq_ops.save_fl to be non-inlined in a way that no stack protector
>can
>be added? If that's the case, should my assembly based implementation
>have
>a different identifier (`native_save_fl_paravirt` or something). That
>would
>also fix point #1 above. But now the paravirt code has its own copy of
>the
>function.

Sorry, I meant irqflags.S.

It still should be available as as inline, however, but now "extern inline".
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [clang] stack protector and f1f029c7bf

2018-05-25 Thread hpa

On May 25, 2018 9:27:40 AM PDT, Nick Desaulniers  
wrote:
>On Thu, May 24, 2018 at 3:43 PM  wrote:
>> On May 24, 2018 3:31:05 PM PDT, Nick Desaulniers
>
>wrote:
>> >On Thu, May 24, 2018 at 3:05 PM H. Peter Anvin 
>wrote:
>> >> COMPILER AR: "=rm" should NEVER generate worse code than "=r".
>That
>> >is
>> >> unequivocally a compiler bug.
>> >
>> >Filed: https://bugs.llvm.org/show_bug.cgi?id=37583
>> >
>> >> >> You are claiming it doesn't buy us anything, but you are only
>> >looking
>> >at
>> >> > the paravirt case which is kind of "special" (in the short bus
>kind
>> >of
>> >way),
>> >> >
>> >> > That's fair.  Is another possible solution to have paravirt
>maybe
>> >not
>> >use
>> >> > native_save_fl() then, but its own
>> >non-static-inline-without-m-constraint
>> >> > implementation?
>> >
>> >> KERNEL AR: change native_save_fl() to an extern inline with an
>> >assembly
>> >> out-of-line implementation, to satisfy the paravirt requirement
>that
>> >no
>> >> GPRs other than %rax are clobbered.
>> >
>> >i'm happy to add that, do you have a recommendation if it should go
>in
>> >an
>> >existing .S file or a new one (and if so where/what shall I call
>it?).
>
>> How about irqflags.c since that is what the .h file is called.
>
>> It should simply be:
>
>> push %rdi
>> popf
>> ret
>
>> pushf
>> pop %rax
>> ret
>
>> ... but with all the regular assembly decorations, of course.
>
>Something like the following?
>
>
>diff --git a/arch/x86/kernel/irqflags.c b/arch/x86/kernel/irqflags.c
>new file mode 100644
>index ..59dc21bd3327
>--- /dev/null
>+++ b/arch/x86/kernel/irqflags.c
>@@ -0,0 +1,24 @@
>+#include 
>+
>+extern unsigned long native_save_fl(void);
>+extern void native_restore_fl(unsigned long flags);
>+
>+asm(
>+".pushsection .text;"
>+".global native_save_fl;"
>+".type native_save_fl, @function;"
>+"native_save_fl:"
>+"pushf;"
>+"pop %" _ASM_AX ";"
>+"ret;"
>+".popsection");
>+
>+asm(
>+".pushsection .text;"
>+".global native_restore_fl;"
>+".type native_restore_fl, @function;"
>+"native_restore_fl:"
>+"push %" _ASM_DI ";"
>+"popf;"
>+"ret;"
>+".popsection");
>
>And change the declaration in arch/x86/include/asm/irqflags.h to:
>+extern inline unsigned long native_save_fl(void);
>+extern inline void native_restore_fl(unsigned long flags);
>
>This seems to work, but
>1. arch/x86/boot/compressed/misc.o warns that native_save_fl() is never
>defined (arch_local_save_flags() uses it).  Does that mean
>arch_local_save_flags(), and friends would also have to move to the
>newly
>created .c file as well?
>2. `extern inline` doesn't inline any instances (from what I can tell
>from
>disassembling vmlinux).  I think this is strictly worse. Don't we only
>want
>pv_irq_ops.save_fl to be non-inlined in a way that no stack protector
>can
>be added? If that's the case, should my assembly based implementation
>have
>a different identifier (`native_save_fl_paravirt` or something). That
>would
>also fix point #1 above. But now the paravirt code has its own copy of
>the
>function.

You keep compiling with paravirtualization enabled... that code will not do any 
inlining.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [clang] stack protector and f1f029c7bf

2018-05-25 Thread hpa

On May 25, 2018 9:46:42 AM PDT, Nick Desaulniers  
wrote:
>On Fri, May 25, 2018 at 9:33 AM  wrote:
>> On May 25, 2018 9:27:40 AM PDT, Nick Desaulniers
>
>wrote:
>> >On Thu, May 24, 2018 at 3:43 PM  wrote:
>> >> On May 24, 2018 3:31:05 PM PDT, Nick Desaulniers
>> >
>> >wrote:
>> >> >On Thu, May 24, 2018 at 3:05 PM H. Peter Anvin 
>> >wrote:
>> >> >> COMPILER AR: "=rm" should NEVER generate worse code than "=r".
>> >That
>> >> >is
>> >> >> unequivocally a compiler bug.
>> >> >
>> >> >Filed: https://bugs.llvm.org/show_bug.cgi?id=37583
>> >> >
>> >> >> >> You are claiming it doesn't buy us anything, but you are
>only
>> >> >looking
>> >> >at
>> >> >> > the paravirt case which is kind of "special" (in the short
>bus
>> >kind
>> >> >of
>> >> >way),
>> >> >> >
>> >> >> > That's fair.  Is another possible solution to have paravirt
>> >maybe
>> >> >not
>> >> >use
>> >> >> > native_save_fl() then, but its own
>> >> >non-static-inline-without-m-constraint
>> >> >> > implementation?
>> >> >
>> >> >> KERNEL AR: change native_save_fl() to an extern inline with an
>> >> >assembly
>> >> >> out-of-line implementation, to satisfy the paravirt requirement
>> >that
>> >> >no
>> >> >> GPRs other than %rax are clobbered.
>> >> >
>> >> >i'm happy to add that, do you have a recommendation if it should
>go
>> >in
>> >> >an
>> >> >existing .S file or a new one (and if so where/what shall I call
>> >it?).
>> >
>> >> How about irqflags.c since that is what the .h file is called.
>> >
>> >> It should simply be:
>> >
>> >> push %rdi
>> >> popf
>> >> ret
>> >
>> >> pushf
>> >> pop %rax
>> >> ret
>> >
>> >> ... but with all the regular assembly decorations, of course.
>> >
>> >Something like the following?
>> >
>> >
>> >diff --git a/arch/x86/kernel/irqflags.c b/arch/x86/kernel/irqflags.c
>> >new file mode 100644
>> >index ..59dc21bd3327
>> >--- /dev/null
>> >+++ b/arch/x86/kernel/irqflags.c
>> >@@ -0,0 +1,24 @@
>> >+#include 
>> >+
>> >+extern unsigned long native_save_fl(void);
>> >+extern void native_restore_fl(unsigned long flags);
>> >+
>> >+asm(
>> >+".pushsection .text;"
>> >+".global native_save_fl;"
>> >+".type native_save_fl, @function;"
>> >+"native_save_fl:"
>> >+"pushf;"
>> >+"pop %" _ASM_AX ";"
>> >+"ret;"
>> >+".popsection");
>> >+
>> >+asm(
>> >+".pushsection .text;"
>> >+".global native_restore_fl;"
>> >+".type native_restore_fl, @function;"
>> >+"native_restore_fl:"
>> >+"push %" _ASM_DI ";"
>> >+"popf;"
>> >+"ret;"
>> >+".popsection");
>> >
>> >And change the declaration in arch/x86/include/asm/irqflags.h to:
>> >+extern inline unsigned long native_save_fl(void);
>> >+extern inline void native_restore_fl(unsigned long flags);
>> >
>> >This seems to work, but
>> >1. arch/x86/boot/compressed/misc.o warns that native_save_fl() is
>never
>> >defined (arch_local_save_flags() uses it).  Does that mean
>> >arch_local_save_flags(), and friends would also have to move to the
>> >newly
>> >created .c file as well?
>> >2. `extern inline` doesn't inline any instances (from what I can
>tell
>> >from
>> >disassembling vmlinux).  I think this is strictly worse. Don't we
>only
>> >want
>> >pv_irq_ops.save_fl to be non-inlined in a way that no stack
>protector
>> >can
>> >be added? If that's the case, should my assembly based
>implementation
>> >have
>> >a different identifier (`native_save_fl_paravirt` or something).
>That
>> >would
>> >also fix point #1 above. But now the paravirt code has its own copy
>of
>> >the
>> >function.
>
>> Sorry, I meant irqflags.S.
>
>> It still should be available as as inline, however, but now "extern
>inline".
>
>Heh, ok I was confused.  But in testing, I had also created:
>
>arch/x86/lib/irqflags.S
>/* SPDX-License-Identifier: GPL-2.0 */
>
>#include 
>#include 
>#include 
>
>/*
>  * unsigned long native_save_fl(void)
>  */
>ENTRY(native_save_fl)
>pushf
>pop %_ASM_AX
>ret
>ENDPROC(native_save_fl)
>EXPORT_SYMBOL(native_save_fl)
>
>/*
>  * void native_restore_fl(unsigned long flags)
>  * %rdi: flags
>  */
>ENTRY(native_restore_fl)
>push %_ASM_DI
>popf
>ret
>ENDPROC(native_restore_fl)
>EXPORT_SYMBOL(native_restore_fl)
>
>The issue is that this still has issues 1 & 2 listed earlier (and the
>disassembly has a lot more trailing nops added).
>
>When you say
>
>> It still should be available as as inline, however, but now "extern
>inline".
>
>Am I understanding correctly that native_save_fl should be inlined into
>all
>call sites (modulo the problematic pv_irq_ops.save_fl case)?  Because
>for
>these two assembly implementations, it's not, but maybe there's
>something
>missing in my implementation?

Yes, that's what "extern inline" means. Maybe it needs a must inline 
annotation, but that's really messed up.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [clang] stack protector and f1f029c7bf

2018-05-25 Thread hpa

On May 25, 2018 10:31:51 AM PDT, Nick Desaulniers  
wrote:
>On Fri, May 25, 2018 at 9:53 AM  wrote:
>> On May 25, 2018 9:46:42 AM PDT, Nick Desaulniers
>
>wrote:
>> >On Fri, May 25, 2018 at 9:33 AM  wrote:
>> >> On May 25, 2018 9:27:40 AM PDT, Nick Desaulniers
>> > wrote:
>> >When you say
>> >
>> >> It still should be available as as inline, however, but now
>"extern
>> >inline".
>> >
>> >Am I understanding correctly that native_save_fl should be inlined
>into
>> >all
>> >call sites (modulo the problematic pv_irq_ops.save_fl case)? 
>Because
>> >for
>> >these two assembly implementations, it's not, but maybe there's
>> >something
>> >missing in my implementation?
>
>> Yes, that's what "extern inline" means. Maybe it needs a must inline
>annotation, but that's really messed up.
>
>I don't think it's possible to inline a function from an external
>translation unit without something like LTO.
>
>If I move the implementation of native_save_fl() to a separate .c (with
>out
>of line assembly) or .S, neither clang nor gcc will inline that
>assembly to
>any call sites, whether the declaration of native_save_fl() looks like:
>
>extern inline unsigned long native_save_fl(void);
>
>or
>
>__attribute__((always_inline))
>
>extern inline unsigned long native_save_fl(void);
>
>I think an external copy is the best approach for the paravirt code:
>
>diff --git a/arch/x86/kernel/irqflags.c b/arch/x86/kernel/irqflags.c
>new file mode 100644
>index ..e173ba8bee7b
>--- /dev/null
>+++ b/arch/x86/kernel/irqflags.c
>@@ -0,0 +1,24 @@
>+#include 
>+
>+extern unsigned long native_save_fl_no_stack_protector(void);
>+extern void native_restore_fl_no_stack_protector(unsigned long flags);
>+
>+asm(
>+".pushsection .text;"
>+".global native_save_fl_no_stack_protector;"
>+".type native_save_fl_no_stack_protector, @function;"
>+"native_save_fl_no_stack_protector:"
>+"pushf;"
>+"pop %" _ASM_AX ";"
>+"ret;"
>+".popsection");
>+
>+asm(
>+".pushsection .text;"
>+".global native_restore_fl_no_stack_protector;"
>+".type native_restore_fl_no_stack_protector, @function;"
>+"native_restore_fl_no_stack_protector:"
>+"push %" _ASM_DI ";"
>+"popf;"
>+"ret;"
>+".popsection");
>diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
>index 02d6f5cf4e70..8824d01c0c35 100644
>--- a/arch/x86/kernel/Makefile
>+++ b/arch/x86/kernel/Makefile
>@@ -61,6 +61,7 @@ obj-y += alternative.o i8253.o
>hw_breakpoint.o
>  obj-y  += tsc.o tsc_msr.o io_delay.o rtc.o
>  obj-y  += pci-iommu_table.o
>  obj-y  += resource.o
>+obj-y  += irqflags.o
>
>  obj-y  += process.o
>  obj-y  += fpu/
>--- a/arch/x86/kernel/paravirt.c
>+++ b/arch/x86/kernel/paravirt.c
>@@ -322,9 +322,12 @@ struct pv_time_ops pv_time_ops = {
> .steal_clock = native_steal_clock,
>  };
>
>+extern unsigned long native_save_fl_no_stack_protector(void);
>+extern void native_restore_fl_no_stack_protector(unsigned long flags);
>+
>  __visible struct pv_irq_ops pv_irq_ops = {
>-   .save_fl = __PV_IS_CALLEE_SAVE(native_save_fl),
>-   .restore_fl = __PV_IS_CALLEE_SAVE(native_restore_fl),
>+   .save_fl =
>__PV_IS_CALLEE_SAVE(native_save_fl_no_stack_protector),
>+   .restore_fl =
>__PV_IS_CALLEE_SAVE(native_restore_fl_no_stack_protector),
> .irq_disable = __PV_IS_CALLEE_SAVE(native_irq_disable),
> .irq_enable = __PV_IS_CALLEE_SAVE(native_irq_enable),
> .safe_halt = native_safe_halt,
>
>Thoughts?

You need the extern inline in the .h file and the out-of-line .S file both.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [clang] stack protector and f1f029c7bf

2018-05-25 Thread hpa

On May 25, 2018 10:49:28 AM PDT, Nick Desaulniers  
wrote:
>On Fri, May 25, 2018 at 10:35 AM Tom Stellard 
>wrote:
>> On 05/25/2018 10:31 AM, Nick Desaulniers wrote:
>> > On Fri, May 25, 2018 at 9:53 AM  wrote:
>> >> On May 25, 2018 9:46:42 AM PDT, Nick Desaulniers <
>ndesaulni...@google.com>
>> > wrote:
>> >>> On Fri, May 25, 2018 at 9:33 AM  wrote:
>>  On May 25, 2018 9:27:40 AM PDT, Nick Desaulniers
>> >>>  wrote:
>> >>> When you say
>> >>>
>>  It still should be available as as inline, however, but now
>"extern
>> >>> inline".
>> >>>
>> >>> Am I understanding correctly that native_save_fl should be
>inlined
>into
>> >>> all
>> >>> call sites (modulo the problematic pv_irq_ops.save_fl case)? 
>Because
>> >>> for
>> >>> these two assembly implementations, it's not, but maybe there's
>> >>> something
>> >>> missing in my implementation?
>> >
>> >> Yes, that's what "extern inline" means. Maybe it needs a must
>inline
>> > annotation, but that's really messed up.
>> >
>
>> What about doing something like suggested here:
>> https://bugs.llvm.org/show_bug.cgi?id=37512#c17
>
>> This would keep the definition in C and make it easier for compilers
>> to inline.
>
>The GCC docs for __attribute__((naked)) seem to imply this is a machine
>specific constraint (of which x86 is not listed):
>https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-Attributes.html
>
>But let's try with source:
>https://godbolt.org/g/aJ4gZB
>
>Clang errors:
>:3:3: error: non-ASM statement in naked function is not
>supported
>   unsigned long flags;
>   ^
>
>Is it valid to use assembly to place the results in %rax and mark the c
>function somehow?
>
>gcc doesn't support this attribute until 4.9 (but we can add a feature
>test
>for attributes with gcc (unlike builtins)), but warns that:
>
>warning: ‘naked’ attribute directive ignored [-Wattributes]
>
>gcc 8.1 and trunk inserts a `ud2` instruction (what?!) (let me see if I
>can
>repro outside of godbolt, and will file a bug report).

No, we found that the paravirt code can do the wrong thing for a C 
implementation.

Nick, could you forward the list of problems so we all have it?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

RE: [PATCH] x86: pad assembly functions with INT3

2018-05-14 Thread hpa

On May 14, 2018 2:04:38 AM PDT, David Laight  wrote:
>From: H. Peter Anvin
>> Sent: 11 May 2018 19:54
>> 
>> On 05/10/18 09:39, David Laight wrote:
>> > From: Alexey Dobriyan
>> >> Sent: 07 May 2018 22:38
>> >>
>> >> Use INT3 instead of NOP. All that padding between functions is
>> >> an illegal area, no legitimate code should jump into it.
>> >>
>> >> I've checked x86_64 allyesconfig disassembly, all changes looks
>sane:
>> >> INT3 is only used after RET or unconditional JMP.
>> >
>> > I thought there was a performance penalty (on at least some cpu)
>> > depending on the number of and the actual instructions used for
>padding.
>> >
>> > I believe that is why gcc generates a small number of very long
>'nop'
>> > instructions when padding code.
>> >
>> 
>> There is a performance penalty for using NOP instructions *in the
>> fallthrough case.*  In the case where the padding is never supposed
>to
>> be executed, which is what we're talking about here, it is
>irrelevant.
>
>Not completely irrelevant, the instructions stream gets fetched and
>decoded
>beyond the jmp/ret at the end of the function.
>At some point the cpu will decode the jmp/ret and fetch/decode from the
>target address.
>I guess it won't try to speculatively execute the 'pad' instructions
>- but you can never really tell!
>
>   David

The CPU doesn't speculate down past an unconditional control transfer. Doing so 
would be idiotic.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH V2 02/15] x86/fsgsbase/64: Make ptrace read FS/GS base accurately

2018-05-31 Thread hpa

On May 31, 2018 1:14:59 PM PDT, Andy Lutomirski  wrote:
>On Thu, May 31, 2018 at 10:58 AM Chang S. Bae
> wrote:
>>
>> From: Andy Lutomirski 
>>
>> ptrace can read FS/GS base using the register access API
>> (PTRACE_PEEKUSER, etc) or PTRACE_ARCH_PRCTL.  Make both of these
>> mechanisms return the actual FS/GS base.
>>
>> This will improve debuggability by providing the correct information
>> to ptracer (GDB and etc).
>
>LGTM.

LGTM?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH V2 06/15] taint: Add taint for insecure

2018-05-31 Thread hpa

On May 31, 2018 1:25:39 PM PDT, Andy Lutomirski  wrote:
>On Thu, May 31, 2018 at 10:58 AM Chang S. Bae
> wrote:
>>
>> When adding new feature support, patches need to be
>> incrementally applied and tested with temporal parameters.
>> For such testing (or root-only) purposes, the new flag
>> will serve to tag the kernel taint state properly.
>
>I'm okay with this, I guess, but I'm not at all convinced we need it.

This was my idea. It isn't the only thing that may want it, and I think it is 
critical that we give the system a way to flag that the system contains 
experimental code that is known to break security. Sometimes that kind of 
experimental code is useful (I have written some myself, e.g. to treat SMAP), 
but it is a good idea to be able to flag such a kernel permanently, even if 
it's a module that can be removed.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH v12 05/13] x86/sgx: architectural structures

2018-07-05 Thread hpa

On July 5, 2018 1:09:12 PM PDT, Jarkko Sakkinen 
 wrote:
>On Thu, Jul 05, 2018 at 08:31:42AM -0700, Dave Hansen wrote:
>> On 07/03/2018 11:19 AM, Jarkko Sakkinen wrote:
>> > +struct sgx_secs {
>> > +  uint64_t size;
>> > +  uint64_t base;
>> > +  uint32_t ssaframesize;
>> > +  uint32_t miscselect;
>> > +  uint8_t reserved1[SGX_SECS_RESERVED1_SIZE];
>> > +  uint64_t attributes;
>> > +  uint64_t xfrm;
>> > +  uint32_t mrenclave[8];
>> > +  uint8_t reserved2[SGX_SECS_RESERVED2_SIZE];
>> > +  uint32_t mrsigner[8];
>> > +  uint8_t reserved3[SGX_SECS_RESERVED3_SIZE];
>> > +  uint16_t isvvprodid;
>> > +  uint16_t isvsvn;
>> > +  uint8_t reserved4[SGX_SECS_RESERVED4_SIZE];
>> > +} __packed __aligned(4096);
>> 
>> Why are the uint* versions in use here?  Those are for userspace ABI,
>> but this is entirely for in-kernel-use, right?
>> 
>> We've used u8/16/32/64 in arch code in a bunch of places.  They're at
>> least a bit more compact and easier to read.  It's this:
>> 
>>  u8  foo;
>>  u64 bar;
>> 
>> vs. this:
>> 
>>  uint8_t  foo;
>>  uint64_t bar;
>
>The reason was that with in-kernel LE these were in fact used by
>user space code. Now they can be changed to those that you
>suggested.
>
>/Jarkko

For things exported to user space use __u* and __s* types... the _t types would 
actually violate the C standard with respect to namespace pollution.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH 1/3] x86: verify_cpu: use 32-bit arithmetic

2018-05-17 Thread hpa

On May 17, 2018 2:30:01 PM PDT, Alexey Dobriyan  wrote:
>32-bit instructions are 1 byte shorter than 16-bit instructions.
>
>Signed-off-by: Alexey Dobriyan 
>---
>
> arch/x86/kernel/verify_cpu.S |8 
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
>--- a/arch/x86/kernel/verify_cpu.S
>+++ b/arch/x86/kernel/verify_cpu.S
>@@ -56,14 +56,14 @@ ENTRY(verify_cpu)
>   cmpl$0x1,%eax
>   jb  .Lverify_cpu_no_longmode# no cpuid 1
> 
>-  xor %di,%di
>+  xor %edi, %edi
>   cmpl$0x68747541,%ebx# AuthenticAMD
>   jnz .Lverify_cpu_noamd
>   cmpl$0x69746e65,%edx
>   jnz .Lverify_cpu_noamd
>   cmpl$0x444d4163,%ecx
>   jnz .Lverify_cpu_noamd
>-  mov $1,%di  # cpu is from AMD
>+  mov $1, %edi# cpu is from AMD
>   jmp .Lverify_cpu_check
> 
> .Lverify_cpu_noamd:
>@@ -122,13 +122,13 @@ ENTRY(verify_cpu)
>   andl$SSE_MASK,%edx
>   cmpl$SSE_MASK,%edx
>   je  .Lverify_cpu_sse_ok
>-  test%di,%di
>+  test%edi, %edi
>   jz  .Lverify_cpu_no_longmode# only try to force SSE on AMD
>   movl$MSR_K7_HWCR,%ecx
>   rdmsr
>   btr $15,%eax# enable SSE
>   wrmsr
>-  xor %di,%di # don't loop
>+  xor %edi, %edi  # don't loop
>   jmp .Lverify_cpu_sse_test   # try again
> 
> .Lverify_cpu_no_longmode:

Well, in 32/64-bit mode... this code is also assembled in 16-bit mode, and the 
16-but code is more space sensitive.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH 2/6] x86: bug: prevent gcc distortions

2018-05-18 Thread hpa

On May 18, 2018 11:25:32 AM PDT, Linus Torvalds  
wrote:
>On Fri, May 18, 2018 at 10:24 AM Nadav Amit  wrote:
>
>> Will it be ok just to use a global inline asm to set an “.include”
>directive
>> that gas would later process? (I can probably wrap it in a C macro so
>it
>> won’t be too disgusting)
>
>Maybe. I'd almost prefer it to be the automatic kind of thing that we
>already do for C files using "-include".
>
>  Linus

Unfortunately gcc doesn't guarantee that global assembly inlines will appear at 
the top of the file.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH v11 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64

2018-05-18 Thread hpa

On May 18, 2018 11:00:05 AM PDT, Bart Van Assche  wrote:
>The next patch in this series introduces a call to cmpxchg64()
>in the block layer core for those architectures on which this
>functionality is available. Make it possible to test whether
>cmpxchg64() is available by introducing CONFIG_ARCH_HAVE_CMPXCHG64.
>
>Signed-off-by: Bart Van Assche 
>Cc: Catalin Marinas 
>Cc: Will Deacon 
>Cc: Tony Luck 
>Cc: Fenghua Yu 
>Cc: Geert Uytterhoeven 
>Cc: "James E.J. Bottomley" 
>Cc: Helge Deller 
>Cc: Benjamin Herrenschmidt 
>Cc: Paul Mackerras 
>Cc: Michael Ellerman 
>Cc: Martin Schwidefsky 
>Cc: Heiko Carstens 
>Cc: David S. Miller 
>Cc: Thomas Gleixner 
>Cc: Ingo Molnar 
>Cc: H. Peter Anvin 
>Cc: Chris Zankel 
>Cc: Max Filippov 
>Cc: Arnd Bergmann 
>Cc: Jonathan Corbet 
>---
>.../features/locking/cmpxchg64/arch-support.txt| 33
>++
> arch/Kconfig   |  4 +++
> arch/arm/Kconfig   |  1 +
> arch/ia64/Kconfig  |  1 +
> arch/m68k/Kconfig  |  1 +
> arch/mips/Kconfig  |  1 +
> arch/parisc/Kconfig|  1 +
> arch/riscv/Kconfig |  1 +
> arch/sparc/Kconfig |  1 +
> arch/x86/Kconfig   |  1 +
> arch/xtensa/Kconfig|  1 +
> 11 files changed, 46 insertions(+)
>create mode 100644
>Documentation/features/locking/cmpxchg64/arch-support.txt
>
>diff --git a/Documentation/features/locking/cmpxchg64/arch-support.txt
>b/Documentation/features/locking/cmpxchg64/arch-support.txt
>new file mode 100644
>index ..84bfef7242b2
>--- /dev/null
>+++ b/Documentation/features/locking/cmpxchg64/arch-support.txt
>@@ -0,0 +1,33 @@
>+#
>+# Feature name:  cmpxchg64
>+# Kconfig:   ARCH_HAVE_CMPXCHG64
>+# description:   arch supports the cmpxchg64() API
>+#
>+---
>+| arch |status|
>+---
>+|   alpha: |  ok  |
>+| arc: |  ..  |
>+| arm: |  ok  |
>+|   arm64: |  ok  |
>+| c6x: |  ..  |
>+|   h8300: |  ..  |
>+| hexagon: |  ..  |
>+|ia64: |  ok  |
>+|m68k: |  ok  |
>+|  microblaze: |  ..  |
>+|mips: |  ok  |
>+|   nds32: |  ..  |
>+|   nios2: |  ..  |
>+|openrisc: |  ..  |
>+|  parisc: |  ok  |
>+| powerpc: |  ok  |
>+|   riscv: |  ok  |
>+|s390: |  ok  |
>+|  sh: |  ..  |
>+|   sparc: |  ok  |
>+|  um: |  ..  |
>+|   unicore32: |  ..  |
>+| x86: |  ok  |
>+|  xtensa: |  ok  |
>+---
>diff --git a/arch/Kconfig b/arch/Kconfig
>index 8e0d665c8d53..9840b2577af1 100644
>--- a/arch/Kconfig
>+++ b/arch/Kconfig
>@@ -358,6 +358,10 @@ config HAVE_ALIGNED_STRUCT_PAGE
> on a struct page for better performance. However selecting this
> might increase the size of a struct page by a word.
> 
>+config ARCH_HAVE_CMPXCHG64
>+  bool
>+  default y if 64BIT
>+
> config HAVE_CMPXCHG_LOCAL
>   bool
> 
>diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
>index a7f8e7f4b88f..02c75697176e 100644
>--- a/arch/arm/Kconfig
>+++ b/arch/arm/Kconfig
>@@ -13,6 +13,7 @@ config ARM
>   select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
>   select ARCH_HAS_STRICT_MODULE_RWX if MMU
>   select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
>+  select ARCH_HAVE_CMPXCHG64 if !THUMB2_KERNEL
>   select ARCH_HAVE_CUSTOM_GPIO_H
>   select ARCH_HAS_GCOV_PROFILE_ALL
>   select ARCH_MIGHT_HAVE_PC_PARPORT
>diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
>index bbe12a038d21..31c49e1482e2 100644
>--- a/arch/ia64/Kconfig
>+++ b/arch/ia64/Kconfig
>@@ -41,6 +41,7 @@ config IA64
>   select GENERIC_PENDING_IRQ if SMP
>   select GENERIC_IRQ_SHOW
>   select GENERIC_IRQ_LEGACY
>+  select ARCH_HAVE_CMPXCHG64
>   select ARCH_HAVE_NMI_SAFE_CMPXCHG
>   select GENERIC_IOMAP
>   select GENERIC_SMP_IDLE_THREAD
>diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig
>index 785612b576f7..7b87cda3bbed 100644
>--- a/arch/m68k/Kconfig
>+++ b/arch/m68k/Kconfig
>@@ -11,6 +11,7 @@ config M68K
>   select GENERIC_ATOMIC64
>   select HAVE_UID16
>   select VIRT_TO_BUS
>+  select ARCH_HAVE_CMPXCHG64
>   select ARCH_HAVE_NMI_SAFE_CMPXCHG if RMW_INSNS
>   select GENERIC_CPU_DEVICES
>   select GENERIC_IOMAP
>diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
>index 225c95da23ce..088bca0fd9f2 100644
>--- a/arch/mips/Kconfig
>+++ b/arch/mips/Kconfig
>@@ -7,6 +7,7 @@ config MIPS
>   select ARCH_DISCARD_MEMBLOCK
>   select ARCH_HAS_ELF_RANDOMIZE
>   select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
>+  select

Re: [PATCH 2/6] x86: bug: prevent gcc distortions

2018-05-18 Thread hpa

On May 18, 2018 11:50:12 AM PDT, Linus Torvalds  
wrote:
>On Fri, May 18, 2018 at 11:34 AM  wrote:
>
>> On May 18, 2018 11:25:32 AM PDT, Linus Torvalds <
>torva...@linux-foundation.org> wrote:
>
>> Unfortunately gcc doesn't guarantee that global assembly inlines will
>appear at the top of the file.
>
>Yeah. It really would be better to do the "asm version of -inline".
>
>We already do something like that for real *.S files on some
>architectures
>(because their assembly really wants it, eg
>
>  arch/arm/Makefile:
>KBUILD_AFLAGS +=$(CFLAGS_ABI) $(AFLAGS_ISA) $(arch-y) $(tune-y)
>-include
>asm/unified.h -msoft-float
>
>but I do want to point out that KBUILD_AFLAGS is *not* used for
>compiler-generated assembly, only for actual *.S files.
>
>Sadly, I don't actually know any way to make gcc call the 'as' phase
>with
>particular options. We can use "-Wa,xyzzy" to pass in xyzzy to the
>assembler, but there is no "-include" option for GNU as afaik.
>
>Can you perhaps define a macro symbol for "--defsym"? Probably not.
>
>   Linus

I looked at this thing a long time ago; it's not there, and the best would 
probably be to get global asm() working properly in gcc.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH 2/6] x86: bug: prevent gcc distortions

2018-05-18 Thread hpa

On May 18, 2018 12:02:50 PM PDT, Nadav Amit  wrote:
>h...@zytor.com wrote:
>
>> On May 18, 2018 11:50:12 AM PDT, Linus Torvalds
> wrote:
>>> On Fri, May 18, 2018 at 11:34 AM  wrote:
>>> 
 On May 18, 2018 11:25:32 AM PDT, Linus Torvalds <
>>> torva...@linux-foundation.org> wrote:
>>> 
 Unfortunately gcc doesn't guarantee that global assembly inlines
>will
>>> appear at the top of the file.
>>> 
>>> Yeah. It really would be better to do the "asm version of -inline".
>>> 
>>> We already do something like that for real *.S files on some
>>> architectures
>>> (because their assembly really wants it, eg
>>> 
>>> arch/arm/Makefile:
>>> KBUILD_AFLAGS +=$(CFLAGS_ABI) $(AFLAGS_ISA) $(arch-y) $(tune-y)
>>> -include
>>> asm/unified.h -msoft-float
>>> 
>>> but I do want to point out that KBUILD_AFLAGS is *not* used for
>>> compiler-generated assembly, only for actual *.S files.
>>> 
>>> Sadly, I don't actually know any way to make gcc call the 'as' phase
>>> with
>>> particular options. We can use "-Wa,xyzzy" to pass in xyzzy to the
>>> assembler, but there is no "-include" option for GNU as afaik.
>>> 
>>> Can you perhaps define a macro symbol for "--defsym"? Probably not.
>>> 
>>>  Linus
>> 
>> I looked at this thing a long time ago; it's not there, and the best
>would probably be to get global asm() working properly in gcc.
>
>I can add a -Wa,[filename.s] switch. It works, but sort of
>undocumented.

Ah, I thought that would have assembled each file separately. We can request it 
be documented, that's usually much easier than getting a new feature 
implemented.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH 2/6] x86: bug: prevent gcc distortions

2018-05-18 Thread hpa

On May 18, 2018 12:21:00 PM PDT, Linus Torvalds  
wrote:
>On Fri, May 18, 2018 at 12:18 PM Nadav Amit  wrote:
>
>> Gnu ASM manual says: "Each time you run as it assembles exactly one
>source
>> program. The source program is made up of one or more files.”
>
>Ok, that counts as documentation, although it's confusing as hell. Is
>it
>one source program or multiple files again? I see what they are trying
>to
>say, but it really could be phrased more clearly ;)
>
> Linus

At least we have a solution!
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH 2/6] x86: bug: prevent gcc distortions

2018-05-18 Thread hpa

On May 18, 2018 12:36:20 PM PDT, Nadav Amit  wrote:
>h...@zytor.com wrote:
>
>> On May 18, 2018 12:21:00 PM PDT, Linus Torvalds
> wrote:
>>> On Fri, May 18, 2018 at 12:18 PM Nadav Amit 
>wrote:
>>> 
 Gnu ASM manual says: "Each time you run as it assembles exactly one
>>> source
 program. The source program is made up of one or more files.”
>>> 
>>> Ok, that counts as documentation, although it's confusing as hell.
>Is
>>> it
>>> one source program or multiple files again? I see what they are
>trying
>>> to
>>> say, but it really could be phrased more clearly ;)
>>> 
>>>Linus
>> 
>> At least we have a solution!
>
>Thanks for all your help. I will try to do it over over the weekend.
>
>Funny, I see that this “trick” (-Wa,[filename.s]) is already being used
>to
>include arch/x86/boot/code16gcc.h so I feel “safer” to use it.
>
>Regards,
>Nadav

Oh. I don't remember when we did that... might even be my doing and then I just 
forgot about it :)
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Can we drop upstream Linux x32 support?

2018-12-10 Thread hpa

On December 10, 2018 5:40:33 PM PST, Linus Torvalds 
 wrote:
>On Mon, Dec 10, 2018 at 5:23 PM Andy Lutomirski 
>wrote:
>>
>> I'm seriously considering sending a patch to remove x32 support from
>> upstream Linux.  Here are some problems with it:
>
>I talked to Arnd (I think - we were talking about all the crazy ABI's,
>but maybe it was with somebody else) about exactly this in Edinburgh.
>
>Apparently the main real use case is for extreme benchmarking. It's
>the only use-case where the complexity of maintaining a whole
>development environment and distro is worth it, it seems. Apparently a
>number of Spec submissions have been done with the x32 model.
>
>I'm not opposed to trying to sunset the support, but let's see who
>complains..
>
>  Linus

The use case aside, I need to address the technical issues in this post; some 
of the behaviors that Andy is pointing out area quite intentional, even if they 
are perhaps somewhat confusing at first glance. That being said, some were due 
to tradeoffs that might have been wrong.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] x86/boot: drop memset from copy.S

2019-01-07 Thread hpa

On January 6, 2019 11:40:56 PM PST, Cao jin  wrote:
>According to objdump output of setup, function memset is not used in
>setup code. Currently, all usage of memset in setup come from macro
>definition of string.h.
>
>Signed-off-by: Cao jin 
>---
>Compiled and booted under x86_64; compiled under i386.
>
>Questions: now there is 2 definition of memcpy, one lies in copy.S,
>another lies in string.h which is mapped to gcc builtin function. Do we
>still need 2 definition? Could we move the content of copy.S to
>boot/string.c?
>
>At first glance, the usage of string.{c.h} of setup is kind of
>confusing,
>they are also used in compressed/ and purgatory/
>
> arch/x86/boot/copy.S | 15 ---
> 1 file changed, 15 deletions(-)
>
>diff --git a/arch/x86/boot/copy.S b/arch/x86/boot/copy.S
>index 15d9f74b0008..5157d08b0ff2 100644
>--- a/arch/x86/boot/copy.S
>+++ b/arch/x86/boot/copy.S
>@@ -33,21 +33,6 @@ GLOBAL(memcpy)
>   retl
> ENDPROC(memcpy)
> 
>-GLOBAL(memset)
>-  pushw   %di
>-  movw%ax, %di
>-  movzbl  %dl, %eax
>-  imull   $0x01010101,%eax
>-  pushw   %cx
>-  shrw$2, %cx
>-  rep; stosl
>-  popw%cx
>-  andw$3, %cx
>-  rep; stosb
>-  popw%di
>-  retl
>-ENDPROC(memset)
>-
> GLOBAL(copy_from_fs)
>   pushw   %ds
>   pushw   %fs

This is dependent on both gcc version and flags.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] x86/boot: drop memset from copy.S

2019-01-08 Thread hpa

On January 7, 2019 12:52:57 AM PST, Cao jin  wrote:
>Hi,
>
>On 1/7/19 3:59 PM, h...@zytor.com wrote:
>> On January 6, 2019 11:40:56 PM PST, Cao jin
> wrote:
>>> According to objdump output of setup, function memset is not used in
>>> setup code. Currently, all usage of memset in setup come from macro
>>> definition of string.h.
>>>
>>> Signed-off-by: Cao jin 
>>> ---
>>> Compiled and booted under x86_64; compiled under i386.
>>>
>>> Questions: now there is 2 definition of memcpy, one lies in copy.S,
>>> another lies in string.h which is mapped to gcc builtin function. Do
>we
>>> still need 2 definition? Could we move the content of copy.S to
>>> boot/string.c?
>>>
>>> At first glance, the usage of string.{c.h} of setup is kind of
>>> confusing,
>>> they are also used in compressed/ and purgatory/
>>>
>>> arch/x86/boot/copy.S | 15 ---
>>> 1 file changed, 15 deletions(-)
>>>
>>> diff --git a/arch/x86/boot/copy.S b/arch/x86/boot/copy.S
>>> index 15d9f74b0008..5157d08b0ff2 100644
>>> --- a/arch/x86/boot/copy.S
>>> +++ b/arch/x86/boot/copy.S
>>> @@ -33,21 +33,6 @@ GLOBAL(memcpy)
>>> retl
>>> ENDPROC(memcpy)
>>>
>>> -GLOBAL(memset)
>>> -   pushw   %di
>>> -   movw%ax, %di
>>> -   movzbl  %dl, %eax
>>> -   imull   $0x01010101,%eax
>>> -   pushw   %cx
>>> -   shrw$2, %cx
>>> -   rep; stosl
>>> -   popw%cx
>>> -   andw$3, %cx
>>> -   rep; stosb
>>> -   popw%di
>>> -   retl
>>> -ENDPROC(memset)
>>> -
>>> GLOBAL(copy_from_fs)
>>> pushw   %ds
>>> pushw   %fs
>> 
>> This is dependent on both gcc version and flags.
>> 
>
>Thanks for your info, but I still don't quite get your point. All files
>who has memset reference in setup will also #include "string.h", so how
>gcc version and flags will associate memset reference to the assembly
>function by force?  Hope to get a little more details when you are
>convenient:)

GCC will sometimes emit calls to certain functions including memcpy().
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [RFC v2 1/6] x86: introduce kernel restartable sequence

2019-01-03 Thread hpa

On December 30, 2018 11:21:07 PM PST, Nadav Amit  wrote:
>It is sometimes beneficial to have a restartable sequence - very few
>instructions which if they are preempted jump to a predefined point.
>
>To provide such functionality on x86-64, we use an empty REX-prefix
>(opcode 0x40) as an indication for instruction in such a sequence.
>Before
>calling the schedule IRQ routine, if the "magic" prefix is found, we
>call a routine to adjust the instruction pointer.  It is expected that
>this opcode is not in common use.
>
>The following patch will make use of this function. Since there are no
>other users (yet?), the patch does not bother to create a general
>infrastructure and API that others can use for such sequences. Yet, it
>should not be hard to make such extension later.
>
>Signed-off-by: Nadav Amit 
>---
> arch/x86/entry/entry_64.S| 16 ++--
> arch/x86/include/asm/nospec-branch.h | 12 
> arch/x86/kernel/traps.c  |  7 +++
> 3 files changed, 33 insertions(+), 2 deletions(-)
>
>diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>index 1f0efdb7b629..e144ff8b914f 100644
>--- a/arch/x86/entry/entry_64.S
>+++ b/arch/x86/entry/entry_64.S
>@@ -644,12 +644,24 @@ retint_kernel:
>   /* Interrupts are off */
>   /* Check if we need preemption */
>   btl $9, EFLAGS(%rsp)/* were interrupts off? */
>-  jnc 1f
>+  jnc 2f
> 0:cmpl$0, PER_CPU_VAR(__preempt_count)
>+  jnz 2f
>+
>+  /*
>+   * Allow to use restartable code sections in the kernel. Consider an
>+   * instruction with the first byte having REX prefix without any bits
>+   * set as an indication for an instruction in such a section.
>+   */
>+  movqRIP(%rsp), %rax
>+  cmpb$KERNEL_RESTARTABLE_PREFIX, (%rax)
>   jnz 1f
>+  mov %rsp, %rdi
>+  callrestart_kernel_rseq
>+1:
>   callpreempt_schedule_irq
>   jmp 0b
>-1:
>+2:
> #endif
>   /*
>* The iretq could re-enable interrupts:
>diff --git a/arch/x86/include/asm/nospec-branch.h
>b/arch/x86/include/asm/nospec-branch.h
>index dad12b767ba0..be4713ef0940 100644
>--- a/arch/x86/include/asm/nospec-branch.h
>+++ b/arch/x86/include/asm/nospec-branch.h
>@@ -54,6 +54,12 @@
>   jnz 771b;   \
>   add $(BITS_PER_LONG/8) * nr, sp;
> 
>+/*
>+ * An empty REX-prefix is an indication that this instruction is part
>of kernel
>+ * restartable sequence.
>+ */
>+#define KERNEL_RESTARTABLE_PREFIX (0x40)
>+
> #ifdef __ASSEMBLY__
> 
> /*
>@@ -150,6 +156,12 @@
> #endif
> .endm
> 
>+.macro restartable_seq_prefix
>+#ifdef CONFIG_PREEMPT
>+  .byte   KERNEL_RESTARTABLE_PREFIX
>+#endif
>+.endm
>+
> #else /* __ASSEMBLY__ */
> 
> #define ANNOTATE_NOSPEC_ALTERNATIVE   \
>diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>index 85cccadb9a65..b1e855bad5ac 100644
>--- a/arch/x86/kernel/traps.c
>+++ b/arch/x86/kernel/traps.c
>@@ -59,6 +59,7 @@
> #include 
> #include 
> #include 
>+#include 
> 
> #ifdef CONFIG_X86_64
> #include 
>@@ -186,6 +187,12 @@ int fixup_bug(struct pt_regs *regs, int trapnr)
>   return 0;
> }
> 
>+asmlinkage __visible void restart_kernel_rseq(struct pt_regs *regs)
>+{
>+  if (user_mode(regs) || *(u8 *)regs->ip != KERNEL_RESTARTABLE_PREFIX)
>+  return;
>+}
>+
> static nokprobe_inline int
>do_trap_no_signal(struct task_struct *tsk, int trapnr, const char *str,
> struct pt_regs *regs, long error_code)

A 0x40 prefix is *not* a noop. It changes the interpretation of byte registers 
4 though 7 from ah, ch, dh, bh to spl, bpl, sil and dil.

It may not matter in your application but:

a. You need to clarify that so is the case, and why;
b. Phrase it differently so others don't propagate the same misunderstanding.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Insanely high baud rates

2018-10-10 Thread hpa

On October 10, 2018 1:17:17 PM PDT, Alan Cox  wrote:
>On Tue, 9 Oct 2018 12:19:04 -0700
>"H. Peter Anvin"  wrote:
>
>> [Resending to a wider audience]
>> 
>> In trying to get the termios2 interface actually implemented in
>glibc,
>> the question came up if we will ever care about baud rates in excess
>of
>> 4 Gbps, even in the relatively remote future.
>
>Even RS485 at 4MBits involves deep magic. I think we are fairly safe.
>Not
>only that but our entire tty layer isn't capable of sustaining anything
>even remotely in that range.
>
>I think its non issue.
>
>Alan

I'm mostly wondering if it is worth future-proofing for new transports. It 
sounds like we can have a consensus on leaving the upper 4 bits of the speed 
fields reserved, but leave the details of implementation for the future?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH v9 02/10] Makefile: Prepare for using macros for inline asm

2018-11-07 Thread hpa

On November 7, 2018 1:43:39 PM PST, Logan Gunthorpe  wrote:
>
>
>On 2018-11-07 11:56 a.m., Nadav Amit wrote:
>> HPA indicated more than once that this is wrong (and that was indeed
>my
>> initial solution), since it is not guaranteed that the compiler would
>put
>> the macro assembly before its use.
>
>Hmm, that's very unfortunate. I don't really understand why the
>compiler
>would not put the macro assembly in the same order as it appears in the
>code and it would be in the correct order there.
>
>In any case, I've submitted a couple of issues to icecc[1] and
>distcc[2]
>to see if they have any ideas about supporting this on their sides.
>
>Thanks,
>
>Logan
>
>[1] https://github.com/icecc/icecream/issues/428
>[2] https://github.com/distcc/distcc/issues/312

Apparently gcc will treat them like basic blocks and possibly move them around.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Insanely high baud rates

2018-10-11 Thread hpa

On October 11, 2018 5:31:34 AM PDT, Alan Cox  wrote:
>> I'm mostly wondering if it is worth future-proofing for new
>transports. It sounds like we can have a consensus on leaving the upper
>4 bits of the speed fields reserved, but leave the details of
>implementation for the future?
>
>It seems reasonable, although I think the reality is that any future
>transport is not going to be a true serial link, but some kind of
>serial
>emulation layer. For those the speed really only matters to tell
>editors
>and the like not to bother being clever.
>
>I mean - what is the baud rate of a pty  ?
>
>Alan

Whatever the master wants it to be...
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions

2018-10-04 Thread hpa

On October 4, 2018 1:33:33 AM PDT, Ingo Molnar  wrote:
>
>* Ingo Molnar  wrote:
>
>> I'm also somewhat annoyed at the fact that this series carries a
>boatload
>> of reviewed-by's and acked-by's, yet none of those reviewers found it
>> important to point out the large chasm that is gaping between
>description
>> and reality.
>
>Another problem I just realized is that we now include
>arch/x86/kernel/macros.S in every 
>translation pass when building the kernel, right?
>
>But arch/x86/kernel/macros.S expands to a pretty large hiearchy of
>header files:
>
>  $ make arch/x86/kernel/macros.s
>
>$ cat $(grep include arch/x86/kernel/macros.s | cut -d\" -f2 | sort |
>uniq) | wc -l
>  4128
>
>That's 4,100 extra lines of code to be preprocessed for every
>translation unit, of
>which there are tens of thousands. More if other pieces of code get
>macrofied in
>this fasion in the future.
>
>If we assume that a typical distribution kernel build has ~20,000
>translation units
>then this change adds 82,560,000 more lines to be preprocessed, just to
>work around
>a stupid GCC bug?
>
>I'm totally unhappy about that. Can we do this without adding macros.S?
>
>It's also a pretty stupidly central file anyway that moves source code
>away
>from where it's used.
>
>Thanks,
>
>   Ingo

It's not just for working around a stupid GCC bug, but it also has a huge 
potential for cleaning up the inline asm in general.

I would like to know if there is an actual number for the build overhead (an 
actual benchmark); I have asked for that once already.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions

2018-10-04 Thread hpa

On October 4, 2018 1:56:37 AM PDT, Nadav Amit  wrote:
>at 1:40 AM, h...@zytor.com wrote:
>
>> On October 4, 2018 1:33:33 AM PDT, Ingo Molnar 
>wrote:
>>> * Ingo Molnar  wrote:
>>> 
>>>> I'm also somewhat annoyed at the fact that this series carries a
>>> boatload
>>>> of reviewed-by's and acked-by's, yet none of those reviewers found
>it
>>>> important to point out the large chasm that is gaping between
>>> description
>>>> and reality.
>>> 
>>> Another problem I just realized is that we now include
>>> arch/x86/kernel/macros.S in every 
>>> translation pass when building the kernel, right?
>>> 
>>> But arch/x86/kernel/macros.S expands to a pretty large hiearchy of
>>> header files:
>>> 
>>> $ make arch/x86/kernel/macros.s
>>> 
>>> $ cat $(grep include arch/x86/kernel/macros.s | cut -d\" -f2 | sort
>|
>>> uniq) | wc -l
>>> 4128
>>> 
>>> That's 4,100 extra lines of code to be preprocessed for every
>>> translation unit, of
>>> which there are tens of thousands. More if other pieces of code get
>>> macrofied in
>>> this fasion in the future.
>>> 
>>> If we assume that a typical distribution kernel build has ~20,000
>>> translation units
>>> then this change adds 82,560,000 more lines to be preprocessed, just
>to
>>> work around
>>> a stupid GCC bug?
>>> 
>>> I'm totally unhappy about that. Can we do this without adding
>macros.S?
>>> 
>>> It's also a pretty stupidly central file anyway that moves source
>code
>>> away
>>> from where it's used.
>>> 
>>> Thanks,
>>> 
>>> Ingo
>> 
>> It's not just for working around a stupid GCC bug, but it also has a
>huge
>> potential for cleaning up the inline asm in general.
>> 
>> I would like to know if there is an actual number for the build
>overhead
>> (an actual benchmark); I have asked for that once already.
>
>I can run some tests. (@hpa: I thought you asked about the -pipe
>overhead;
>perhaps I misunderstood).
>
>I guess you regard to the preprocessing of the assembler. Note that the
>C 
>preprocessing of macros.S obviously happens only once. That’s the
>reason
>I assumed it’s not that expensive.
>
>Anyhow, I remember that we discussed at some point doing something like
>‘asm(“.include XXX.s”)’ and somebody said it is not good, but I don’t
>remember why and don’t see any reason it is so. Unless I am missing
>something, I think it is possible to take each individual header and
>preprocess the assembly part of into a separate .s file. Then we can
>put in
>the C part of the header ‘asm(".include XXX.s”)’.
>
>What do you think?

Ingo: I wasn't talking necessarily about the specifics of each bit, but rather 
the general concept about being able to use macros in inlines... I can send you 
something I have been working on in the background, but have been holding off 
on because of this, in the morning my time.

But that's no excuse for not doing it right.

Global asm() statements are problematic because there is no guarantee where in 
the assembly source they will end up. You'd almost need to intercept the 
assembly output, move all the includes to the top, and then process...
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH v9 04/10] x86: refcount: prevent gcc distortions

2018-10-04 Thread hpa

On October 4, 2018 2:12:22 AM PDT, Ingo Molnar  wrote:
>
>* Nadav Amit  wrote:
>
>> I can run some tests. (@hpa: I thought you asked about the -pipe
>overhead;
>> perhaps I misunderstood).
>
>Well, tests are unlikely to show the overhead of extra lines of this
>magnitude, unless done very carefully, yet the added bloat exists and
>is not even
>mentioned by the changelog, it just says:
>
>Subject: [PATCH v9 02/10] Makefile: Prepare for using macros for inline
>asm
>
>  Using macros for inline assembly improves both readability and
>compilation decisions that are distorted by big assembly blocks that
>use
>  alternative sections. Compile macros.S and use it to assemble all C
>  files. Currently, only x86 will use it.
>
>> I guess you regard to the preprocessing of the assembler. Note that
>the C 
>> preprocessing of macros.S obviously happens only once. That’s the
>reason
>> I assumed it’s not that expensive.
>
>True - so first we build macros.s, and that gets included in every C
>file build, right?
>
>macros.s is smaller: 275 lines only in the distro test build I tried,
>which looks
>a lot better than my first 4,200 lines guesstimate.
>
>> Anyhow, I remember that we discussed at some point doing something
>like
>> ‘asm(“.include XXX.s”)’ and somebody said it is not good, but I don’t
>> remember why and don’t see any reason it is so. Unless I am missing
>> something, I think it is possible to take each individual header and
>> preprocess the assembly part of into a separate .s file. Then we can
>put in
>> the C part of the header ‘asm(".include XXX.s”)’.
>> 
>> What do you think?
>
>Hm, this looks quite complex - macros.s is better I think. Also, 275
>straight assembly lines is 
>a lot better than 4,200.
>
>Another, separate question I wanted to ask: how do we ensure that the
>kernel stays fixed?
>I.e. is there some tooling we can use to actually measure whether
>there's bad inlining decisions 
>done, to detect all these bad patterns that cause bad GCC code
>generation?
>
>Thanks,
>
>   Ingo

The assembly output from GCC is quite volumious; I doubt tacking a few hundred 
lines on will matter one iota.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH stable v2 1/2] termios, tty/tty_baudrate.c: fix buffer overrun

2018-10-23 Thread hpa

On October 23, 2018 7:53:51 AM PDT, Greg Kroah-Hartman 
 wrote:
>On Mon, Oct 22, 2018 at 09:19:04AM -0700, H. Peter Anvin (Intel) wrote:
>> From: "H. Peter Anvin" 
>> 
>> On architectures with CBAUDEX == 0 (Alpha and PowerPC), the code in
>tty_baudrate.c does
>> not do any limit checking on the tty_baudrate[] array, and in fact a
>> buffer overrun is possible on both architectures. Add a limit check
>to
>> prevent that situation.
>> 
>> This will be followed by a much bigger cleanup/simplification patch.
>> 
>> Signed-off-by: H. Peter Anvin (Intel) 
>> Requested-by: Cc: Johan Hovold 
>> Cc: Greg Kroah-Hartman 
>> Cc: Jiri Slaby 
>> Cc: Al Viro 
>> Cc: Richard Henderson 
>> Cc: Ivan Kokshaysky 
>> Cc: Matt Turner 
>> Cc: Thomas Gleixner 
>> Cc: Kate Stewart 
>> Cc: Philippe Ombredanne 
>> Cc: Greg Kroah-Hartman 
>> Cc: Eugene Syromiatnikov 
>> Cc: 
>> Cc: 
>> Cc: Alan Cox 
>> Cc: 
>> ---
>>  drivers/tty/tty_baudrate.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>
>As I think Al's big termios cleanups are going to be hitting Linus's
>tree soon, do you know how these patches interact with that?
>
>This patch seems like it will not, so I'll be glad to queue that up
>after my first round of patches get merged to Linus later this week,
>but
>the second one worries me.
>
>thanks,
>
>greg k-h

I have been working with Al; we had approached much the same problems but from 
different directions. Mine ended up being a bit more comprehensive as a result, 
so I think we're going to end up using my code with Al's reviews.

So bottom line is that it should be all good.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH RFC] x86: Don't include '-Wa,-' when building with Clang

2018-10-23 Thread hpa

On October 23, 2018 2:58:11 PM PDT, Nathan Chancellor 
 wrote:
>On Tue, Oct 23, 2018 at 01:01:22PM -0700, H. Peter Anvin wrote:
>> On 10/23/18 11:40, Nick Desaulniers wrote:
>> > On Mon, Oct 22, 2018 at 10:11 PM Nadav Amit 
>wrote:
>> >>
>> >> at 5:37 PM, Nathan Chancellor  wrote:
>> >>
>> >> Commit 77b0bf55bc67 ("kbuild/Makefile: Prepare for using macros in
>> >> inline assembly code to work around asm() related GCC inlining
>bugs")
>> >> added this flag to KBUILD_CFLAGS, where it works perfectly fine
>with
>> >> GCC. However, when building with Clang, all of the object files
>compile
>> >> fine but the build hangs indefinitely at init/main.o, right before
>the
>> >> linking stage. Don't include this flag when building with Clang.
>> >>
>> >> The kernel builds and boots to a shell in QEMU with both GCC and
>Clang
>> >> with this patch applied.
>> >>
>> >> Link:
>https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FClangBuiltLinux%2Flinux%2Fissues%2F213&data=02%7C01%7Cnamit%40vmware.com%7C871daebc2ca44947d28d08d638811fb5%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636758524579997650&sdata=shuxW81QRrO3TSqbgf462wgZYdLeAKeQEdGRxmnUX30%3D&reserved=0
>> >> Signed-off-by: Nathan Chancellor 
>> >> ---
>> >>
>> >> The reason this patch is labeled RFC is while I can verify that
>this
>> >> fixes the issue, I'm not entirely sure why the '-Wa,-' works for
>GCC
>> >> and not Clang. I looked into what the flag means and I couldn't
>really
>> >> find anything so I just assume it's taking input from stdin? The
>issue
>> >> could stem from how GCC forks gas versus how Clang does it. If
>this
>> >> isn't of concern and the maintainers are happy with this patch as
>is,
>> >> feel free to take it.
>> >>
>> 
>> Perhaps someone could actually, you know, time the build and see how
>> much -pipe actually matters, if at all?
>> 
>>  -hpa
>> 
>
>Thank you for the suggestion! With the attached diff for removing
>'-pipe' and 'make -j1' with defconfig (just to make sure any variance
>would stand out), here are my results:
>
>-pipe (GCC):
>
>real15m55.202s
>user14m17.748s
>sys 1m47.496s
>
>No -pipe (GCC):
>
>real16m4.430s
>user14m16.277s
>sys 1m46.604s
>
>-pipe (Clang):
>
>real21m26.016s
>user19m21.722s
>sys 2m2.606s
>
>No -pipe (Clang):
>
>real21m27.822s
>user19m22.092s
>sys 2m4.151s
>
>Certainly seems like -pipe doesn't make a ton of difference. If this is
>a better fix, I am happy to draft up a proper commit message and send
>it out for review.
>
>==
>
>diff --git a/arch/x86/Makefile b/arch/x86/Makefile
>index 73f4831283ac..672c689c1faa 100644
>--- a/arch/x86/Makefile
>+++ b/arch/x86/Makefile
>@@ -213,8 +213,6 @@ ifdef CONFIG_X86_64
> KBUILD_LDFLAGS += $(call ld-option, -z max-page-size=0x20)
> endif
> 
>-# Speed up the build
>-KBUILD_CFLAGS += -pipe
># Workaround for a gcc prelease that unfortunately was shipped in a
>suse release
> KBUILD_CFLAGS += -Wno-sign-compare
> #
>@@ -239,7 +237,7 @@ archheaders:
> archmacros:
>$(Q)$(MAKE) $(build)=arch/x86/kernel arch/x86/kernel/macros.s
> 
>-ASM_MACRO_FLAGS = -Wa,arch/x86/kernel/macros.s -Wa,-
>+ASM_MACRO_FLAGS = -Wa,arch/x86/kernel/macros.s
> export ASM_MACRO_FLAGS
> KBUILD_CFLAGS += $(ASM_MACRO_FLAGS)

make -j1 makes the test useless. You should aim for optimal performance for 
your machine.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH RFC] x86: Don't include '-Wa,-' when building with Clang

2018-10-23 Thread hpa

On October 23, 2018 3:53:10 PM PDT, Nick Desaulniers  
wrote:
>On Tue, Oct 23, 2018 at 3:44 PM Nathan Chancellor
> wrote:
>>
>> On Tue, Oct 23, 2018 at 03:08:53PM -0700, Nick Desaulniers wrote:
>> > On Tue, Oct 23, 2018 at 2:58 PM Nathan Chancellor
>> >  wrote:
>> > >
>> > > On Tue, Oct 23, 2018 at 01:01:22PM -0700, H. Peter Anvin wrote:
>> > > > On 10/23/18 11:40, Nick Desaulniers wrote:
>> > > > > On Mon, Oct 22, 2018 at 10:11 PM Nadav Amit
> wrote:
>> > > > >>
>> > > > >> at 5:37 PM, Nathan Chancellor 
>wrote:
>> > > > >>
>> > > > >> Commit 77b0bf55bc67 ("kbuild/Makefile: Prepare for using
>macros in
>> > > > >> inline assembly code to work around asm() related GCC
>inlining bugs")
>> > > > >> added this flag to KBUILD_CFLAGS, where it works perfectly
>fine with
>> > > > >> GCC. However, when building with Clang, all of the object
>files compile
>> > > > >> fine but the build hangs indefinitely at init/main.o, right
>before the
>> > > > >> linking stage. Don't include this flag when building with
>Clang.
>> > > > >>
>> > > > >> The kernel builds and boots to a shell in QEMU with both GCC
>and Clang
>> > > > >> with this patch applied.
>> > > > >>
>> > > > >> Link:
>https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FClangBuiltLinux%2Flinux%2Fissues%2F213&data=02%7C01%7Cnamit%40vmware.com%7C871daebc2ca44947d28d08d638811fb5%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636758524579997650&sdata=shuxW81QRrO3TSqbgf462wgZYdLeAKeQEdGRxmnUX30%3D&reserved=0
>> > > > >> Signed-off-by: Nathan Chancellor 
>> > > > >> ---
>> > > > >>
>> > > > >> The reason this patch is labeled RFC is while I can verify
>that this
>> > > > >> fixes the issue, I'm not entirely sure why the '-Wa,-' works
>for GCC
>> > > > >> and not Clang. I looked into what the flag means and I
>couldn't really
>> > > > >> find anything so I just assume it's taking input from stdin?
>The issue
>> > > > >> could stem from how GCC forks gas versus how Clang does it.
>If this
>> > > > >> isn't of concern and the maintainers are happy with this
>patch as is,
>> > > > >> feel free to take it.
>> > > > >>
>> > > >
>> > > > Perhaps someone could actually, you know, time the build and
>see how
>> > > > much -pipe actually matters, if at all?
>> > > >
>> > > >   -hpa
>> > > >
>> > >
>> > > Thank you for the suggestion! With the attached diff for removing
>> > > '-pipe' and 'make -j1' with defconfig (just to make sure any
>variance
>> > > would stand out), here are my results:
>> > >
>> > > -pipe (GCC):
>> > >
>> > > real15m55.202s
>> > > user14m17.748s
>> > > sys 1m47.496s
>> > >
>> > > No -pipe (GCC):
>> > >
>> > > real16m4.430s
>> > > user14m16.277s
>> > > sys 1m46.604s
>> > >
>> > > -pipe (Clang):
>> > >
>> > > real21m26.016s
>> > > user19m21.722s
>> > > sys 2m2.606s
>> > >
>> > > No -pipe (Clang):
>> > >
>> > > real21m27.822s
>> > > user19m22.092s
>> > > sys 2m4.151s
>> >
>> > Looks like Clang eats `-pipe`:
>> >
>https://github.com/llvm-mirror/clang/blob/391667a023f79287f9c40868f34f08c16156/lib/Driver/Driver.cpp#L962
>> > commit r110007 has the log:
>> > Driver: Start ripping out support for -pipe, which is worthless
>> > and complicates
>> > too many other things.
>> >
>>
>> In that case, we can either keep this change (I'll resend with the
>> explanation that Clang doesn't respect -pipe) or we can just rip out
>> -pipe for GCC too. Here are three separate results for GCC with my
>> normal jobs flag:
>>
>> -pipe (GCC):
>>
>> real3m40.813s
>> real3m44.449s
>> real3m39.648s
>>
>> No -pipe (GCC):
>>
>> real3m38.4

Re: [PATCH] x86, random: Fix get_random_bytes() warning in x86 start_kernel

2018-05-29 Thread hpa

On May 29, 2018 9:58:10 AM PDT, Prarit Bhargava  wrote:
>
>
>On 05/29/2018 12:07 PM, Theodore Y. Ts'o wrote:
>> On Tue, May 29, 2018 at 11:01:07AM -0400, Prarit Bhargava wrote:
>>> Kees, in early boot no pool is available so the stack canary is
>initialized from
>>> the TSC.  Later in boot, the stack canary will use the the crng.
>>>
>>> ie) in early boot only TSC is okay, and late boot (when crng_ready()
>is true)
>>> the pool will be used.
>> 
>> But that means all of the kernel threads (e.g., workqueues, et. al)
>> would not be well protected by the stack canary.  That
>> seems rather unfortunate.
>
>Well, as stated the TSC is used as a source of entropy in early boot. 
>It's
>always been that way and get_random_bytes() AFAICT has always returned
>0.  CPUs
>added later on via hotplug do use get_random_bytes().
>
>Does anyone cc'd have a better idea on how to get another source of
>entropy this
>early in boot?
>
>P.
>
>> 
>>   - Ted
>> 

RDRAND/RDSEED for newer x86 processors.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Is this a kernel BUG? ///Re: [Question] Can we use SIGRTMIN when vdso disabled on X86?

2018-06-06 Thread hpa

On June 6, 2018 2:17:42 AM PDT, "Leizhen (ThunderTown)" 
 wrote:
>I found that glibc has already dealt with this case. So this issue must
>have been met before, should it be maintained by libc/user?
>
>   if (GLRO(dl_sysinfo_dso) == NULL)
>   {
>   kact.sa_flags |= SA_RESTORER;
>
>   kact.sa_restorer = ((act->sa_flags & SA_SIGINFO)
>   ? &restore_rt : &restore);
>   }
>
>
>On 2018/6/6 15:52, Leizhen (ThunderTown) wrote:
>> 
>> 
>> On 2018/6/5 19:24, Leizhen (ThunderTown) wrote:
>>> After I executed "echo 0 > /proc/sys/abi/vsyscall32" to disable
>vdso, the rt_sigaction01 test case from ltp_2015 failed.
>>> The test case source code please refer to the attachment, and the
>output as blow:
>>>
>>> -
>>> ./rt_sigaction01
>>> rt_sigaction010  TINFO  :  signal: 34
>>> rt_sigaction011  TPASS  :  rt_sigaction call succeeded: result =
>0
>>> rt_sigaction010  TINFO  :  sa.sa_flags = SA_RESETHAND|SA_SIGINFO
>>> rt_sigaction010  TINFO  :  Signal Handler Called with signal
>number 34
>>>
>>> Segmentation fault
>>> --
>>>
>>>
>>> Is this the desired result? In function ia32_setup_rt_frame, I found
>below code:
>>>
>>> if (ksig->ka.sa.sa_flags & SA_RESTORER)
>>> restorer = ksig->ka.sa.sa_restorer;
>>> else
>>> restorer = current->mm->context.vdso +
>>> vdso_image_32.sym___kernel_rt_sigreturn;
>>> put_user_ex(ptr_to_compat(restorer), &frame->pretcode);
>>>
>>> Because the vdso is disabled, so current->mm->context.vdso is NULL,
>which cause the result of frame->pretcode invalid.
>>>
>>> I'm not sure whether this is a kernel bug or just an error of test
>case itself. Can anyone help me?
>>>
>> 

The use of signals without SA_RESTORER is considered obsolete, but it's 
somewhat surprising that the vdso isn't there; it should be mapped even for 
static binaries esp. on i386 since it is the preferred way to do system calls 
(you don't need to parse the ELF for that.) Are you explicitly disabling the 
VDSO? If so, Don't Do That.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH v2 7/8] x86/segments/32: Introduce CPU_NUMBER segment

2018-06-06 Thread hpa

On June 6, 2018 12:07:15 PM PDT, Brian Gerst  wrote:
>On Wed, Jun 6, 2018 at 1:16 PM, Andy Lutomirski 
>wrote:
>> On Wed, Jun 6, 2018 at 9:23 AM Chang S. Bae
> wrote:
>>>
>>> The new entry will be equivalent to that of x86-64 which
>>> stores CPU number. The entry is placed in segment 23 in GDT
>>> by bumping down 23-28 by one, which are all kernel-internal
>>> segments and so have no impact on user space.
>>>
>>> CPU_NUMBER segment will always be at '%ss (USER_DS) + 80'
>>> for the default (flat, initial) user space %ss.
>>
>> No, it won't :(  This is because, on Xen PV, user code very
>frequently
>> sees a different, Xen-supplied "flat" SS value.  This is definitely
>> true right now on 64-bit, and I'm reasonably confident it's also the
>> case on 32-bit.
>>
>> As it stands, as far as I can tell, we don't have a "cpu number"
>> segment on 32-bit kernels.  I see no compelling reason to add one,
>and
>> we should definitely not add one as part of the FSGSBASE series.  I
>> think the right solution is to rename the 64-bit segment to
>> "CPU_NUMBER" and then have the rearrangement of the initialization
>> code as a followup patch.  The goal is to make the patches
>> individually reviewable.  As it stands, this patch adds some #defines
>> without making them work, which is extra confusing.
>>
>> Given how many times we screwed it up, I really want the patch that
>> moves the initialization of the 64-bit CPU number to be obviously
>> correct and to avoid changing the sematics of anything except the
>> actual CPU number fields during boot.
>>
>> So NAK to this patch, at least as part of the FSGSBASE series.
>>
>> (My apologies -- a bunch of this is because I along with everyone
>else
>> misunderstood the existing code.)
>
>The sole purpose of this segment is for the getcpu() function in the
>VDSO.  No other userspace code can rely on its presence or location.
>
>--
>Brian Gerst

Unfortunately that is not true in reality :(
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] x86: pad assembly functions with INT3

2018-05-15 Thread hpa

On May 14, 2018 11:54:05 PM PDT, Ingo Molnar  wrote:
>
>* h...@zytor.com  wrote:
>
>> > I guess it won't try to speculatively execute the 'pad'
>instructions - but you 
>> > can never really tell!
>> >
>> >David
>> 
>> The CPU doesn't speculate down past an unconditional control
>transfer. Doing so 
>> would be idiotic.
>
>I think, when it comes to speculative execution, our general
>expectation that CPUs 
>don't do idiotic things got somewhat weakened in the past year or so
>...
>
>Thanks,
>
>   Ingo

Sort-of-kind-of... the things that have cropped up were reasonable consequences 
of designing under a set of assumptions which turned out to be incorrect. 
Speculating through an unconditional control transfer did not make sense under 
those assumptions either.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH v5 00/13] x86: Enable FSGSBASE instructions

2019-02-04 Thread hpa

On February 1, 2019 6:43:25 PM PST, Andy Lutomirski  wrote:
>Hi hpa-
>
>A while back, you were working on some patches to make modify_ldt()
>play better with this series.  What happened to them?
>
>--Andy

Looks like I need to dig them out...
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH v5 01/13] taint: Introduce a new taint flag (insecure)

2019-02-05 Thread hpa

On February 5, 2019 2:46:11 PM PST, Randy Dunlap  wrote:
>On 2/5/19 1:21 PM, Andrew Morton wrote:
>> On Fri, 1 Feb 2019 18:42:29 -0800 Andy Lutomirski 
>wrote:
>> 
>>> On Fri, Feb 1, 2019 at 12:54 PM Chang S. Bae
> wrote:

 For testing (or root-only) purposes, the new flag will serve to tag
>the
 kernel taint accurately.

 When adding a new feature support, patches need to be incrementally
 applied and tested with temporal parameters. Currently, there is no
>flag
 for this usage.
>>>
>>> I think this should be reviewed by someone like akpm.  akpm, for
>>> background, this is part of an x86 patch series.  If only part of
>the
>>> series is applied, the kernel will be blatantly insecure (but still
>>> functional and useful for testing and bisection), and this taint
>flag
>>> will be set if this kernel is booted.  With the whole series
>applied,
>>> there are no users of the taint flag in the kernel.
>>>
>>> Do you think this is a good idea?
>> 
>> What does "temporal parameters" mean?  A complete description of this
>> testing process would help.
>> 
>> I sounds a bit strange.  You mean it assumes that people will
>partially
>> apply the series to test its functionality?  That would be
>inconvenient.
>
>Ack.  I don't think we need to (or should) worry about that kind of
>muckup.
>
>> - Can the new and now-unused taint flag be removed again at
>>   end-of-series?
>> 
>> - It would be a lot more convenient if we had some means of testing
>>   after the whole series is applied, on a permanent basis - some
>>   debugfs flag, perhaps?
>> 

I would like to see this taint flag, though, because sometimes it is useful to 
write test modules (e.g. when I was testing SMAP) which are dangerous even if 
out of tree.

In case of an escape or pilot error gets it into the wrong kernel, it is a very 
good thing to have the kernel flagged.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [RFC] Provide in-kernel headers for making it easy to extend the kernel

2019-01-25 Thread hpa

On January 24, 2019 12:59:29 PM PST, Joel Fernandes  
wrote:
>On Thu, Jan 24, 2019 at 07:57:26PM +0100, Karim Yaghmour wrote:
>> 
>> On 1/23/19 11:37 PM, Daniel Colascione wrote:
>[..]
>> > > Personally I advocated a more aggressive approach with Joel in
>private:
>> > > just put the darn headers straight into the kernel image, it's
>the
>> > > *only* artifact we're sure will follow the Android device
>whatever
>> > > happens to it (like built-in ftrace).
>> > 
>> > I was thinking along similar lines. Ordinarily, we make loadable
>> > kernel modules. What we kind of want here is a non-loadable kernel
>> > module --- or a non-loadable section in the kernel image proper.
>I'm
>> > not familiar with early-stage kernel loader operation: I know it's
>> > possible to crease discardable sections in the kernel image, but
>can
>> > we create sections that are never slurped into memory in the first
>> > place? If not, maybe loading and immediately discarding the header
>> > section is good enough.
>> 
>> Interesting. Maybe just append it to the image but have it not loaded
>and
>> have a kernel parameter than enables a "/proc/kheaders" driver to
>know where
>> the fetch the appended headers from storage at runtime. There would
>be no
>> RAM loading whatsoever of the headers, just some sort of
>> "kheaders=/dev/foobar:offset:size" parameter. If you turn the option
>on, you
>> get a fatter kernel image size to store on permanent storage, but no
>impact
>> on what's loaded at boot time.
>
>Embedding anything into the kernel image does impact boot time though
>because
>it increase the time spent by bootloader. A module OTOH would not have
>such
>overhead.
>
>Also a kernel can be booted in any number of ways other than mass
>storage so
>it is not a generic Linux-wide solution to have a kheaders= option like
>that.
>If the option is forgotten, then the running system can't use the
>feature.
>The other issue is it requires a kernel command line option /
>bootloader
>changes for that which adds more configuration burden, which not be
>needed
>with a module.
>
>> > Would such a thing really do better than LZMA? LZMA already has
>very
>> > clever techniques for eliminating long-range redundancies in
>> > compressible text, including redundancies at the sub-byte level. I
>can
>> > certainly understand the benefit of stripping comments, since
>removing
>> > comments really does decrease the total amount of information the
>> > compressor has to preserve, but I'm not sure how much the encoding
>> > scheme you propose below would help, since it reminds me of the
>> > encoding scheme that LZMA would discover automatically.
>> 
>> I'm no compression algorithm expert. If you say LZMA would do the
>> same/better than what I suggested then I have no reason to contest
>that. My
>> goal is to see the headers as part of the kernel image that's
>distributed on
>> devices so that they don't have to be chased around. I'm just trying
>to make
>> it as palatable as possible.
>
>I believe LZMA is really good at that sort of thing too.
>
>Also at 3.3MB of module size, I think we are really good size-wise. But
>Dan
>is helping look at possibly reducing further if he gets time. Many
>modules in
>my experience are much bigger. amdgpu.ko on my Linux machine is 6.1MB.
>
>I really think making it a module is the best way to make sure this is
>bundled with the kernel on the widest number of Android and other Linux
>systems, without incurring boot time overhead, or any other command
>line
>configuration burden.
>
>I spoke to so many people at LPC personally with other kernel
>contributors,
>and many folks told me one word - MODULE :D.  Even though I hesitated
>at
>first, now it seems the right solution.
>
>If no one seriously objects, I'll clean this up and post a v2 and with
>the
>RFC tag taken off. Thank you!
>
> - Joel

So let me throw in a different notion.

A kernel module really is nothing other than a kernel build system artifact 
stored in the filesystem.

I really don't at any reason whatsoever why this is direct from just producing 
an archive and putting it in the module directory, except that the latter is 
far simpler.

I see literally *no* problem, social or technical, you are solvin by actually 
making it a kernel ELF object.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [RFC] Provide in-kernel headers for making it easy to extend the kernel

2019-01-25 Thread hpa

On January 25, 2019 11:15:52 AM PST, Daniel Colascione  
wrote:
>On Fri, Jan 25, 2019 at 11:01 AM  wrote:
>>
>> On January 24, 2019 12:59:29 PM PST, Joel Fernandes
> wrote:
>> >On Thu, Jan 24, 2019 at 07:57:26PM +0100, Karim Yaghmour wrote:
>> >>
>> >> On 1/23/19 11:37 PM, Daniel Colascione wrote:
>> >[..]
>> >> > > Personally I advocated a more aggressive approach with Joel in
>> >private:
>> >> > > just put the darn headers straight into the kernel image, it's
>> >the
>> >> > > *only* artifact we're sure will follow the Android device
>> >whatever
>> >> > > happens to it (like built-in ftrace).
>> >> >
>> >> > I was thinking along similar lines. Ordinarily, we make loadable
>> >> > kernel modules. What we kind of want here is a non-loadable
>kernel
>> >> > module --- or a non-loadable section in the kernel image proper.
>> >I'm
>> >> > not familiar with early-stage kernel loader operation: I know
>it's
>> >> > possible to crease discardable sections in the kernel image, but
>> >can
>> >> > we create sections that are never slurped into memory in the
>first
>> >> > place? If not, maybe loading and immediately discarding the
>header
>> >> > section is good enough.
>> >>
>> >> Interesting. Maybe just append it to the image but have it not
>loaded
>> >and
>> >> have a kernel parameter than enables a "/proc/kheaders" driver to
>> >know where
>> >> the fetch the appended headers from storage at runtime. There
>would
>> >be no
>> >> RAM loading whatsoever of the headers, just some sort of
>> >> "kheaders=/dev/foobar:offset:size" parameter. If you turn the
>option
>> >on, you
>> >> get a fatter kernel image size to store on permanent storage, but
>no
>> >impact
>> >> on what's loaded at boot time.
>> >
>> >Embedding anything into the kernel image does impact boot time
>though
>> >because
>> >it increase the time spent by bootloader. A module OTOH would not
>have
>> >such
>> >overhead.
>> >
>> >Also a kernel can be booted in any number of ways other than mass
>> >storage so
>> >it is not a generic Linux-wide solution to have a kheaders= option
>like
>> >that.
>> >If the option is forgotten, then the running system can't use the
>> >feature.
>> >The other issue is it requires a kernel command line option /
>> >bootloader
>> >changes for that which adds more configuration burden, which not be
>> >needed
>> >with a module.
>> >
>> >> > Would such a thing really do better than LZMA? LZMA already has
>> >very
>> >> > clever techniques for eliminating long-range redundancies in
>> >> > compressible text, including redundancies at the sub-byte level.
>I
>> >can
>> >> > certainly understand the benefit of stripping comments, since
>> >removing
>> >> > comments really does decrease the total amount of information
>the
>> >> > compressor has to preserve, but I'm not sure how much the
>encoding
>> >> > scheme you propose below would help, since it reminds me of the
>> >> > encoding scheme that LZMA would discover automatically.
>> >>
>> >> I'm no compression algorithm expert. If you say LZMA would do the
>> >> same/better than what I suggested then I have no reason to contest
>> >that. My
>> >> goal is to see the headers as part of the kernel image that's
>> >distributed on
>> >> devices so that they don't have to be chased around. I'm just
>trying
>> >to make
>> >> it as palatable as possible.
>> >
>> >I believe LZMA is really good at that sort of thing too.
>> >
>> >Also at 3.3MB of module size, I think we are really good size-wise.
>But
>> >Dan
>> >is helping look at possibly reducing further if he gets time. Many
>> >modules in
>> >my experience are much bigger. amdgpu.ko on my Linux machine is
>6.1MB.
>> >
>> >I really think making it a module is the best way to make sure this
>is
>> >bundled with the kernel on the widest number of Android and other
>Linux
>> >systems, without incurring boot time overhead, or any other command
>> >line
>> >configuration burden.
>> >
>> >I spoke to so many people at LPC personally with other kernel
>> >contributors,
>> >and many folks told me one word - MODULE :D.  Even though I
>hesitated
>> >at
>> >first, now it seems the right solution.
>> >
>> >If no one seriously objects, I'll clean this up and post a v2 and
>with
>> >the
>> >RFC tag taken off. Thank you!
>> >
>> > - Joel
>>
>> So let me throw in a different notion.
>>
>> A kernel module really is nothing other than a kernel build system
>artifact stored in the filesystem.
>>
>> I really don't at any reason whatsoever why this is direct from just
>producing an archive and putting it in the module directory, except
>that the latter is far simpler.
>>
>> I see literally *no* problem, social or technical, you are solvin by
>actually making it a kernel ELF object.
>
>Joel does have a point. Suppose we're on Android and we're testing a
>random local kernel we've built. We can command the device to boot
>into the bootloader, then send our locally-built kernel to the device
>with "fastboot boot mykernelz". Having booted the device this wa

Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas

2020-06-01 Thread hpa

On June 1, 2020 6:59:26 AM PDT, Andy Lutomirski  wrote:
>
>
>> On Jun 1, 2020, at 2:23 AM, Billy Laws  wrote:
>> 
>> 
>>> 
>>> On May 30, 2020, at 5:26 PM, Gabriel Krisman Bertazi
> wrote:
>>> 
>>> Andy Lutomirski  writes:
>>> 
>>> On May 29, 2020, at 11:00 PM, Gabriel Krisman Bertazi
> wrote:
>> 
>> Modern Windows applications are executing system call
>instructions
>> directly from the application's code without going through the
>WinAPI.
>> This breaks Wine emulation, because it doesn't have a chance to
>> intercept and emulate these syscalls before they are submitted to
>Linux.
>> 
>> In addition, we cannot simply trap every system call of the
>application
>> to userspace using PTRACE_SYSEMU, because performance would
>suffer,
>> since our main use case is to run Windows games over Linux. 
>Therefore,
>> we need some in-kernel filtering to decide whether the syscall
>was
>> issued by the wine code or by the windows application.
 
 Do you really need in-kernel filtering?  What if you could have
 efficient userspace filtering instead?  That is, set something up
>so
 that all syscalls, except those from a special address, are
>translated
 to CALL thunk where the thunk is configured per task.  Then the
>thunk
 can do whatever emulation is needed.
>>> 
>>> Hi,
>>> 
>>> I suggested something similar to my customer, by using
>>> libsyscall-intercept.  The idea would be overwritting the syscall
>>> instruction with a call to the entry point.  I'm not a specialist on
>the
>>> specifics of Windows games, (cc'ed Paul Gofman, who can provide more
>>> details on that side), but as far as I understand, the reason why
>that
>>> is not feasible is that the anti-cheat protection in games will
>abort
>>> execution if the binary region was modified either on-disk or
>in-memory.
>>> 
>>> Is there some mechanism to do that without modiyfing the
>application?
>> 
>> Hi,
>> 
>> I work on an emulator for the Nintendo Switch that uses a similar
>technique,
>> in our testing it works very well and is much more performant than
>even
>> PTRACE_SYSEMU.
>> 
>> To work around DRM reading the memory contents I think mprotect could
>> be used, after patching the syscall a copy of the original code could
>be
>> kept somewhere in memory and the patched region mapped --X.
>> With this, any time the DRM attempts to read to the patched region
>and
>> perform integrity checks it will cause a segfault and a branch to the
>> signal handler. This handler can then return the contents of the
>original,
>> unpatched region to satisfy them checks.
>> 
>> Are memory contents checked by DRM solutions too often for this to be
>> performant?
>
>A bigger issue is that hardware support for —X is quite spotty. There
>is no x86 CPU that can do it cleanly in a bare metal setup, and client
>CPUs that can do it at all without hypervisor help may be nonexistent.
>I don’t know if the ARM situation is much better.
>
>> --
>> Billy Laws
>>> 
 Getting the details and especially the interaction with any seccomp
 filters that may be installed right could be tricky, but the
>performance
 should be decent, at least on non-PTI systems.
 
 (If we go this route, I suspect that the correct interaction with
 seccomp is that this type of redirection takes precedence over
>seccomp
 and seccomp filters are not invoked for redirected syscalls. After
>all,
 a redirected syscall is, functionally, not a syscall at all.)
 
>>> 
>>> 
>>> --
>>> Gabriel Krisman Bertazi

Running these things in a minimal VM container would allow this kind of 
filtering/trapping to be done in the VMM, too. I don't know how many layers 
deep you invoke native Linux libraries, and so if the option would exist to use 
out-of-range system call numbers for the Linux system numbers?
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH v1 1/3] x86/cpufeatures: Add low performance CRC32C instruction CPU feature

2021-01-11 Thread hpa

On January 6, 2021 10:37:50 PM PST, Borislav Petkov  wrote:
>On Thu, Jan 07, 2021 at 02:19:06PM +0800, Tony W Wang-oc wrote:
>> SSE4.2 on Zhaoxin CPUs are compatible with Intel. The presence of
>> CRC32C instruction is enumerated by CPUID.01H:ECX.SSE4_2[bit 20] = 1.
>> Some Zhaoxin CPUs declare support SSE4.2 instruction sets but their
>> CRC32C instruction are working with low performance.
>> 
>> Add a synthetic CPU flag to indicates that the CRC32C instruction is
>> not working as intended. This low performance CRC32C instruction flag
>> is depend on X86_FEATURE_XMM4_2.
>> 
>> Signed-off-by: Tony W Wang-oc 
>> ---
>>  arch/x86/include/asm/cpufeatures.h | 1 +
>>  arch/x86/kernel/cpu/cpuid-deps.c   | 1 +
>>  2 files changed, 2 insertions(+)
>> 
>> diff --git a/arch/x86/include/asm/cpufeatures.h
>b/arch/x86/include/asm/cpufeatures.h
>> index 84b8878..9e8151b 100644
>> --- a/arch/x86/include/asm/cpufeatures.h
>> +++ b/arch/x86/include/asm/cpufeatures.h
>> @@ -292,6 +292,7 @@
>>  #define X86_FEATURE_FENCE_SWAPGS_KERNEL (11*32+ 5) /* "" LFENCE in
>kernel entry SWAPGS path */
>>  #define X86_FEATURE_SPLIT_LOCK_DETECT   (11*32+ 6) /* #AC for split
>lock */
>>  #define X86_FEATURE_PER_THREAD_MBA  (11*32+ 7) /* "" Per-thread
>Memory Bandwidth Allocation */
>> +#define X86_FEATURE_CRC32C  (11*32+ 8) /* "" Low performance CRC32C
>instruction */
>
>Didn't hpa say to create a BUG flag for it - X86_BUG...? Low
>performance
>insn sounds like a bug and not a feature to me.
>
>And call it X86_BUG_CRC32C_SLOW or ..._UNUSABLE to denote what it
>means.
>
>Thx.

Yes, it should be a BUG due to the inclusion properties of BUG v FEATURE.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] crypto: x86/crc32c-intel - Don't match some Zhaoxin CPUs

2020-12-21 Thread hpa

On December 20, 2020 6:46:25 PM PST, tonywwang...@zhaoxin.com wrote:
>On December 16, 2020 1:56:45 AM GMT+08:00, Eric Biggers
> wrote:
>>On Tue, Dec 15, 2020 at 10:15:29AM +0800, Tony W Wang-oc wrote:
>>> 
>>> On 15/12/2020 04:41, Eric Biggers wrote:
>>> > On Mon, Dec 14, 2020 at 10:28:19AM +0800, Tony W Wang-oc wrote:
>>> >> On 12/12/2020 01:43, Eric Biggers wrote:
>>> >>> On Fri, Dec 11, 2020 at 07:29:04PM +0800, Tony W Wang-oc wrote:
>>>  The driver crc32c-intel match CPUs supporting
>>X86_FEATURE_XMM4_2.
>>>  On platforms with Zhaoxin CPUs supporting this X86 feature,
>When
>>>  crc32c-intel and crc32c-generic are both registered, system
>will
>>>  use crc32c-intel because its .cra_priority is greater than
>>>  crc32c-generic. This case expect to use crc32c-generic driver
>>for
>>>  some Zhaoxin CPUs to get performance gain, So remove these
>>Zhaoxin
>>>  CPUs support from crc32c-intel.
>>> 
>>>  Signed-off-by: Tony W Wang-oc 
>>> >>>
>>> >>> Does this mean that the performance of the crc32c instruction on
>>those CPUs is
>>> >>> actually slower than a regular C implementation?  That's very
>>weird.
>>> >>>
>>> >>
>>> >> From the lmbench3 Create and Delete file test on those chips, I
>>think yes.
>>> >>
>>> > 
>>> > Did you try measuring the performance of the hashing itself, and
>>not some
>>> > higher-level filesystem operations?
>>> > 
>>> 
>>> Yes. Was testing on these Zhaoxin CPUs, the result is that with the
>>same
>>> input value the generic C implementation takes fewer time than the
>>> crc32c instruction implementation.
>>> 
>>
>>And that is really "working as intended"?
>
>These CPU's crc32c instruction is not working as intended.
>
>  Why do these CPUs even
>>declare that
>>they support the crc32c instruction, when it is so slow?
>>
>
>The presence of crc32c and some other instructions supports are
>enumerated by CPUID.01:ECX[SSE4.2] = 1,  other instructions are ok
>except the crc32c instruction.
>
>>Are there any other instruction sets (AES-NI, PCLMUL, SSE, SSE2, AVX,
>>etc.) that
>>these CPUs similarly declare support for but they are uselessly slow?
>
>No.
>
>Sincerely
>Tonyw
>
>>
>>- Eric

Then the right thing to do is to disable the CPUID bit in the vendor-specific 
startup code.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] crypto: x86/crc32c-intel - Don't match some Zhaoxin CPUs

2020-12-21 Thread hpa

On December 21, 2020 7:01:39 PM PST, tonywwang...@zhaoxin.com wrote:
>On December 22, 2020 3:27:33 AM GMT+08:00, h...@zytor.com wrote:
>>On December 20, 2020 6:46:25 PM PST, tonywwang...@zhaoxin.com wrote:
>>>On December 16, 2020 1:56:45 AM GMT+08:00, Eric Biggers
>>> wrote:
On Tue, Dec 15, 2020 at 10:15:29AM +0800, Tony W Wang-oc wrote:
> 
> On 15/12/2020 04:41, Eric Biggers wrote:
> > On Mon, Dec 14, 2020 at 10:28:19AM +0800, Tony W Wang-oc wrote:
> >> On 12/12/2020 01:43, Eric Biggers wrote:
> >>> On Fri, Dec 11, 2020 at 07:29:04PM +0800, Tony W Wang-oc
>wrote:
>  The driver crc32c-intel match CPUs supporting
X86_FEATURE_XMM4_2.
>  On platforms with Zhaoxin CPUs supporting this X86 feature,
>>>When
>  crc32c-intel and crc32c-generic are both registered, system
>>>will
>  use crc32c-intel because its .cra_priority is greater than
>  crc32c-generic. This case expect to use crc32c-generic driver
for
>  some Zhaoxin CPUs to get performance gain, So remove these
Zhaoxin
>  CPUs support from crc32c-intel.
> 
>  Signed-off-by: Tony W Wang-oc 
> >>>
> >>> Does this mean that the performance of the crc32c instruction
>>on
those CPUs is
> >>> actually slower than a regular C implementation?  That's very
weird.
> >>>
> >>
> >> From the lmbench3 Create and Delete file test on those chips, I
think yes.
> >>
> > 
> > Did you try measuring the performance of the hashing itself, and
not some
> > higher-level filesystem operations?
> > 
> 
> Yes. Was testing on these Zhaoxin CPUs, the result is that with
>the
same
> input value the generic C implementation takes fewer time than the
> crc32c instruction implementation.
> 

And that is really "working as intended"?
>>>
>>>These CPU's crc32c instruction is not working as intended.
>>>
>>>  Why do these CPUs even
declare that
they support the crc32c instruction, when it is so slow?

>>>
>>>The presence of crc32c and some other instructions supports are
>>>enumerated by CPUID.01:ECX[SSE4.2] = 1,  other instructions are ok
>>>except the crc32c instruction.
>>>
Are there any other instruction sets (AES-NI, PCLMUL, SSE, SSE2,
>AVX,
etc.) that
these CPUs similarly declare support for but they are uselessly
>slow?
>>>
>>>No.
>>>
>>>Sincerely
>>>Tonyw
>>>

- Eric
>>
>>Then the right thing to do is to disable the CPUID bit in the
>>vendor-specific startup code.
>
>This way makes these CPUs do not support all instruction sets
>enumerated
>by CPUID.01:ECX[SSE4.2].
>While only crc32c instruction is slow, just expect the crc32c-intel
>driver do not
>match these CPUs.
>
>Sincerely
>Tonyw

Then create a BUG flag for it, or factor out CRC32C into a synthetic flag. We 
*do not* bury this information in drivers; it becomes a recipe for the same 
problems over and over.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] arch/x86: Propagate $(CLANG_FLAGS) to $(REALMODE_FLAGS)

2020-12-25 Thread hpa

On December 25, 2020 11:29:30 PM PST, John Millikin  wrote:
>When compiling with Clang, the `$(CLANG_FLAGS)' variable contains
>additional flags needed to cross-compile C and Assembly sources:
>
>* `-no-integrated-as' tells clang to assemble with GNU Assembler
>  instead of its built-in LLVM assembler. This flag is set by default
>  unless `LLVM_IAS=1' is set, because the LLVM assembler can't yet
>  parse certain GNU extensions.
>
>* `--target' sets the target architecture when cross-compiling. This
>  flag must be set for both compilation and assembly (`KBUILD_AFLAGS')
>  to support architecture-specific assembler directives.
>
>Signed-off-by: John Millikin 
>---
> arch/x86/Makefile | 5 +
> 1 file changed, 5 insertions(+)
>
>diff --git a/arch/x86/Makefile b/arch/x86/Makefile
>index 7116da3980be..725c65532482 100644
>--- a/arch/x86/Makefile
>+++ b/arch/x86/Makefile
>@@ -33,6 +33,11 @@ REALMODE_CFLAGS += -ffreestanding
> REALMODE_CFLAGS += -fno-stack-protector
> REALMODE_CFLAGS += $(call __cc-option, $(CC), $(REALMODE_CFLAGS),
>-Wno-address-of-packed-member)
> REALMODE_CFLAGS += $(call __cc-option, $(CC), $(REALMODE_CFLAGS),
>$(cc_stack_align4))
>+
>+ifdef CONFIG_CC_IS_CLANG
>+REALMODE_CFLAGS += $(CLANG_FLAGS)
>+endif
>+
> export REALMODE_CFLAGS
> 
> # BITS is used as extension for files which are available in a 32 bit

Why is CLANG_FLAGS non-null when unused? It would be better to centralize that.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCHSET] saner elf compat

2020-12-06 Thread hpa

On December 5, 2020 7:23:05 PM PST, Al Viro  wrote:
>On Thu, Dec 03, 2020 at 11:03:36PM +, Al Viro wrote:
>> > >  The answer (for mainline) is that mips compat does *NOT* want
>> > > COMPAT_BINFMT_ELF.  Not a problem with that series, though, so
>I'd
>> > > retested it (seems to work, both for x86_64 and mips64, execs and
>> > > coredumps for all ABIs alike), with centralization of Kconfig
>logics
>> > > thrown in.
>> > 
>> > Well, the diffstat looks nice:
>> > 
>> > >  26 files changed, 127 insertions(+), 317 deletions(-)
>> > 
>> > and the patches didn't trigger anything for me, but how much did
>this
>> > get tested? Do you actually have both kinds of 32-bit elf mips
>> > binaries around and a machine to test on?
>> 
>> Yes (aptitude install gcc-multilib on debian mips64el/stretch sets
>the toolchain
>> and libraries just fine, and then it's just a matter of -mabi=n32
>passed
>> to gcc).  "Machine" is qemu-system-mips64el -machine malta -m 1024
>-cpu 5KEc
>> and the things appear to work; I hadn't tried that on the actual
>hardware.
>> I do have a Loongson-2 box, but it would take a while to dig it out
>and
>> get it up-to-date.
>> 
>> > Linux-mips was cc'd, but I'm adding Thomas B to the cc here
>explicitly
>> > just so that he has a heads-up on this thing and can go and look at
>> > the mailing list in case it goes to a separate mailbox for him..
>> 
>> I would certainly appreciate review and testing - this branch sat
>> around in the "should post it someday" state since June (it was
>> one of the followups grown from regset work back then), and I'm
>> _not_ going to ask pulling it without an explicit OK from mips
>> folks.
>
>BTW, there's something curious going on in ELF binary recognition for
>x32.  Unlike other 64bit architectures, here we have a 32bit binary
>that successfully passes the native elf_check_arch().  Usually we
>either have different EM_... values for 64bit and 32bit (e.g. for ppc
>and sparc) or we have an explicit check for ->e_ident[EI_CLASS]
>having the right value (ELFCLASS32 or ELFCLASS64 for 32bit and 64bit
>binaries resp.)
>
>For x32 that's not true - we use EM_X86_64 for ->e_machine and that's
>the only thing the native elf_check_arch() is looking at.  IOW,
>it looks like amd64 elf_load_binary() progresses past elf_check_arch()
>for x32 binaries.  And gets to load_elf_phdrs(), which would appear
>to have a check of its own that should reject the sucker:
>/*
> * If the size of this structure has changed, then punt, since
> * we will be doing the wrong thing.
> */
>if (elf_ex->e_phentsize != sizeof(struct elf_phdr))
>goto out;
>After all, ->e_phentsize is going to be 32 (sizeof(struct elf32_phdr)
>rather than expected 56 (sizeof(struct elf64_phdr)) and off we bugger,
>even though it happens at slightly later point than usual.  Except that
>we are looking at struct elf64_hdr ->e_phentsize - in struct elf32_hdr.
>I.e. at offset 54, two bytes past the end of in-file struct elf32_hdr.
>
>Usually we won't find 0x38 0x00 in that location, so everything works,
>but IMO that's too convoluted.
>
>Peter, is there any reason not to check ->ei_ident[EI_CLASS] in
>amd64 elf_check_arch()?  It's a 1-byte load from hot cacheline
>(offset 4 and we'd just read the 4 bytes at offsets 0..3) +
>compare + branch not taken, so performance impact is pretty much
>nil.  I'm not saying it's a security problem or anything of that
>sort, just that it makes the analysis more subtle than it ought
>to be...
>
>Is it about some malformed homegrown 64bit binaries with BS value
>at offset 4?  Confused...

I can't think of any.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] x86, build: remove -m16 workaround for unsupported versions of GCC

2020-12-01 Thread hpa

On November 30, 2020 5:13:06 PM PST, Nick Desaulniers  
wrote:
>A revert of the following two commits.
>commit de3accdaec88 ("x86, build: Build 16-bit code with -m16 where
>possible")
>commit a9cfccee6604 ("x86, build: Change code16gcc.h from a C header to
>an assembly header")
>
>Since commit 0bddd227f3dc ("Documentation: update for gcc 4.9
>requirement") the minimum supported version of GCC is gcc-4.9.  It's
>now
>safe to remove this code.
>
>Signed-off-by: Nick Desaulniers 
>---
> arch/x86/Makefile |  9 +
> arch/x86/boot/code16gcc.h | 12 
> 2 files changed, 1 insertion(+), 20 deletions(-)
> delete mode 100644 arch/x86/boot/code16gcc.h
>
>diff --git a/arch/x86/Makefile b/arch/x86/Makefile
>index 1bf21746f4ce..7116da3980be 100644
>--- a/arch/x86/Makefile
>+++ b/arch/x86/Makefile
>@@ -24,14 +24,7 @@ endif
> 
># How to compile the 16-bit code.  Note we always compile for
>-march=i386;
> # that way we can complain to the user if the CPU is insufficient.
>-#
>-# The -m16 option is supported by GCC >= 4.9 and clang >= 3.5. For
>-# older versions of GCC, include an *assembly* header to make sure
>that
>-# gcc doesn't play any games behind our back.
>-CODE16GCC_CFLAGS := -m32 -Wa,$(srctree)/arch/x86/boot/code16gcc.h
>-M16_CFLAGS := $(call cc-option, -m16, $(CODE16GCC_CFLAGS))
>-
>-REALMODE_CFLAGS   := $(M16_CFLAGS) -g -Os -DDISABLE_BRANCH_PROFILING \
>+REALMODE_CFLAGS   := -m16 -g -Os -DDISABLE_BRANCH_PROFILING \
>  -Wall -Wstrict-prototypes -march=i386 -mregparm=3 \
>  -fno-strict-aliasing -fomit-frame-pointer -fno-pic \
>  -mno-mmx -mno-sse
>diff --git a/arch/x86/boot/code16gcc.h b/arch/x86/boot/code16gcc.h
>deleted file mode 100644
>index e19fd7536307..
>--- a/arch/x86/boot/code16gcc.h
>+++ /dev/null
>@@ -1,12 +0,0 @@
>-/* SPDX-License-Identifier: GPL-2.0 */
>-#
>-# code16gcc.h
>-#
>-# This file is added to the assembler via -Wa when compiling 16-bit C
>code.
>-# This is done this way instead via asm() to make sure gcc does not
>reorder
>-# things around us.
>-#
>-# gcc 4.9+ has a real -m16 option so we can drop this hack long term.
>-#
>-
>-  .code16gcc

With enthusiasm:

Acked-by: H. Peter Anvin 
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: The killing of ideal_nops[]

2021-03-09 Thread hpa

On March 9, 2021 1:24:44 PM PST, Peter Zijlstra  wrote:
>On Tue, Mar 09, 2021 at 12:05:19PM -0500, Steven Rostedt wrote:
>> On Tue, 9 Mar 2021 17:58:17 +0100
>> Peter Zijlstra  wrote:
>> 
>> > Hi,
>> > 
>> > AFAICT everything made in the past 10 years ends up using p6_nops.
>Is it
>> > time to kill off ideal_nops[] and simplify life?
>> > 
>> 
>> Well, the one bug that was reported recently was due to a box that
>uses a
>> different "ideal_nops" than p6_nops. Perhaps we should ask him if
>there's
>> any noticeable difference between using p6_nops for every function
>than the
>> ideal_nops that as found for that box.
>
>If the machine is more than a decade old, I'm not really caring about
>optimal performance. If it is 32bit, I really couldn't be arsed as long
>as it boots.

p6_nops don't boot on all 32-bit chips.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] lib: Add shared copy of __lshrti3 from libgcc

2019-03-15 Thread hpa

On March 15, 2019 3:06:37 PM PDT, Nick Desaulniers  
wrote:
>On Fri, Mar 15, 2019 at 1:54 PM Matthias Kaehlcke 
>wrote:
>>
>> The compiler may emit calls to __lshrti3 from the compiler runtime
>> library, which results in undefined references:
>>
>> arch/x86/kvm/x86.o: In function `mul_u64_u64_shr':
>>   include/linux/math64.h:186: undefined reference to `__lshrti3'
>
>Looks like Clang will emit this at -Oz (but not -O2):
>https://godbolt.org/z/w1_2YC
>
>>
>> Add a copy of the __lshrti3 libgcc routine (from gcc v4.9.2).
>
>Has it changed since? If so why not a newer version of libgcc_s?
>
>>
>> Include the function for x86 builds with clang, which is the
>> environment where the above error was observed.
>>
>> Signed-off-by: Matthias Kaehlcke 
>> ---
>>  arch/x86/Kconfig   |  1 +
>>  include/linux/libgcc.h | 16 
>>  lib/Kconfig|  3 +++
>>  lib/Makefile   |  1 +
>>  lib/lshrti3.c  | 31 +++
>>  5 files changed, 52 insertions(+)
>>  create mode 100644 lib/lshrti3.c
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index c1f9b3cf437c..a5e0d923845d 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -105,6 +105,7 @@ config X86
>> select GENERIC_IRQ_PROBE
>> select GENERIC_IRQ_RESERVATION_MODE
>> select GENERIC_IRQ_SHOW
>> +   select GENERIC_LIB_LSHRTI3  if CC_IS_CLANG
>> select GENERIC_PENDING_IRQ  if SMP
>> select GENERIC_SMP_IDLE_THREAD
>> select GENERIC_STRNCPY_FROM_USER
>> diff --git a/include/linux/libgcc.h b/include/linux/libgcc.h
>> index 32e1e0f4b2d0..a71036471838 100644
>> --- a/include/linux/libgcc.h
>> +++ b/include/linux/libgcc.h
>> @@ -22,15 +22,26 @@
>>  #include 
>>
>>  typedef int word_type __attribute__ ((mode (__word__)));
>> +typedef int TItype __attribute__ ((mode (TI)));
>
>Well that's an interesting new compiler attribute.
>https://stackoverflow.com/questions/4559025/what-does-gcc-attribute-modexx-actually-do
>Please use `__mode(TI)` from include/linux/compiler_attributes.h ex.
>typedef int TItype __mode(TI);
>
>>
>>  #ifdef __BIG_ENDIAN
>>  struct DWstruct {
>> int high, low;
>>  };
>> +
>> +struct DWstruct128 {
>> +   long long high, low;
>> +};
>> +
>>  #elif defined(__LITTLE_ENDIAN)
>>  struct DWstruct {
>> int low, high;
>>  };
>> +
>> +struct DWstruct128 {
>> +   long long low, high;
>> +};
>> +
>>  #else
>>  #error I feel sick.
>>  #endif
>> @@ -40,4 +51,9 @@ typedef union {
>> long long ll;
>>  } DWunion;
>>
>> +typedef union {
>> +   struct DWstruct128 s;
>> +   TItype ll;
>> +} DWunion128;
>> +
>>  #endif /* __ASM_LIBGCC_H */
>> diff --git a/lib/Kconfig b/lib/Kconfig
>> index a9e56539bd11..369e10259ea6 100644
>> --- a/lib/Kconfig
>> +++ b/lib/Kconfig
>> @@ -624,6 +624,9 @@ config GENERIC_LIB_ASHRDI3
>>  config GENERIC_LIB_LSHRDI3
>> bool
>>
>> +config GENERIC_LIB_LSHRTI3
>> +   bool
>> +
>>  config GENERIC_LIB_MULDI3
>> bool
>>
>> diff --git a/lib/Makefile b/lib/Makefile
>> index 4e066120a0d6..42648411f451 100644
>> --- a/lib/Makefile
>> +++ b/lib/Makefile
>> @@ -277,6 +277,7 @@ obj-$(CONFIG_PARMAN) += parman.o
>>  obj-$(CONFIG_GENERIC_LIB_ASHLDI3) += ashldi3.o
>>  obj-$(CONFIG_GENERIC_LIB_ASHRDI3) += ashrdi3.o
>>  obj-$(CONFIG_GENERIC_LIB_LSHRDI3) += lshrdi3.o
>> +obj-$(CONFIG_GENERIC_LIB_LSHRTI3) += lshrti3.o
>>  obj-$(CONFIG_GENERIC_LIB_MULDI3) += muldi3.o
>>  obj-$(CONFIG_GENERIC_LIB_CMPDI2) += cmpdi2.o
>>  obj-$(CONFIG_GENERIC_LIB_UCMPDI2) += ucmpdi2.o
>> diff --git a/lib/lshrti3.c b/lib/lshrti3.c
>> new file mode 100644
>> index ..2d2123bb3030
>> --- /dev/null
>> +++ b/lib/lshrti3.c
>> @@ -0,0 +1,31 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +#include 
>> +#include 
>> +
>> +long long __lshrti3(long long u, word_type b)
>> +{
>> +   DWunion128 uu, w;
>> +   word_type bm;
>> +
>> +   if (b == 0)
>> +   return u;
>> +
>> +   uu.ll = u;
>> +   bm = 64 - b;
>> +
>> +   if (bm <= 0) {
>> +   w.s.high = 0;
>> +   w.s.low = (unsigned long long) uu.s.high >> -bm;
>> +   } else {
>> +   const unsigned long long carries =
>> +   (unsigned long long) uu.s.high << bm;
>> +   w.s.high = (unsigned long long) uu.s.high >> b;
>> +   w.s.low = ((unsigned long long) uu.s.low >> b) |
>carries;
>> +   }
>> +
>> +   return w.ll;
>> +}
>> +#ifndef BUILD_VDSO
>> +EXPORT_SYMBOL(__lshrti3);
>> +#endif
>
>I don't think you want this.  Maybe that was carried over from
>https://lkml.org/lkml/2019/3/15/669
>by accident?  The above linkage error mentions arch/x86/kvm/x86.o
>which I wouldn't expect to be linked into the VDSO image?

If we compile with the default ABI we can simply link with libgcc; with 
-mregparm=3 all versions of gcc are broken.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] lib: Add shared copy of __lshrti3 from libgcc

2019-03-15 Thread hpa

On March 15, 2019 3:06:37 PM PDT, Nick Desaulniers  
wrote:
>On Fri, Mar 15, 2019 at 1:54 PM Matthias Kaehlcke 
>wrote:
>>
>> The compiler may emit calls to __lshrti3 from the compiler runtime
>> library, which results in undefined references:
>>
>> arch/x86/kvm/x86.o: In function `mul_u64_u64_shr':
>>   include/linux/math64.h:186: undefined reference to `__lshrti3'
>
>Looks like Clang will emit this at -Oz (but not -O2):
>https://godbolt.org/z/w1_2YC
>
>>
>> Add a copy of the __lshrti3 libgcc routine (from gcc v4.9.2).
>
>Has it changed since? If so why not a newer version of libgcc_s?
>
>>
>> Include the function for x86 builds with clang, which is the
>> environment where the above error was observed.
>>
>> Signed-off-by: Matthias Kaehlcke 
>> ---
>>  arch/x86/Kconfig   |  1 +
>>  include/linux/libgcc.h | 16 
>>  lib/Kconfig|  3 +++
>>  lib/Makefile   |  1 +
>>  lib/lshrti3.c  | 31 +++
>>  5 files changed, 52 insertions(+)
>>  create mode 100644 lib/lshrti3.c
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index c1f9b3cf437c..a5e0d923845d 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -105,6 +105,7 @@ config X86
>> select GENERIC_IRQ_PROBE
>> select GENERIC_IRQ_RESERVATION_MODE
>> select GENERIC_IRQ_SHOW
>> +   select GENERIC_LIB_LSHRTI3  if CC_IS_CLANG
>> select GENERIC_PENDING_IRQ  if SMP
>> select GENERIC_SMP_IDLE_THREAD
>> select GENERIC_STRNCPY_FROM_USER
>> diff --git a/include/linux/libgcc.h b/include/linux/libgcc.h
>> index 32e1e0f4b2d0..a71036471838 100644
>> --- a/include/linux/libgcc.h
>> +++ b/include/linux/libgcc.h
>> @@ -22,15 +22,26 @@
>>  #include 
>>
>>  typedef int word_type __attribute__ ((mode (__word__)));
>> +typedef int TItype __attribute__ ((mode (TI)));
>
>Well that's an interesting new compiler attribute.
>https://stackoverflow.com/questions/4559025/what-does-gcc-attribute-modexx-actually-do
>Please use `__mode(TI)` from include/linux/compiler_attributes.h ex.
>typedef int TItype __mode(TI);
>
>>
>>  #ifdef __BIG_ENDIAN
>>  struct DWstruct {
>> int high, low;
>>  };
>> +
>> +struct DWstruct128 {
>> +   long long high, low;
>> +};
>> +
>>  #elif defined(__LITTLE_ENDIAN)
>>  struct DWstruct {
>> int low, high;
>>  };
>> +
>> +struct DWstruct128 {
>> +   long long low, high;
>> +};
>> +
>>  #else
>>  #error I feel sick.
>>  #endif
>> @@ -40,4 +51,9 @@ typedef union {
>> long long ll;
>>  } DWunion;
>>
>> +typedef union {
>> +   struct DWstruct128 s;
>> +   TItype ll;
>> +} DWunion128;
>> +
>>  #endif /* __ASM_LIBGCC_H */
>> diff --git a/lib/Kconfig b/lib/Kconfig
>> index a9e56539bd11..369e10259ea6 100644
>> --- a/lib/Kconfig
>> +++ b/lib/Kconfig
>> @@ -624,6 +624,9 @@ config GENERIC_LIB_ASHRDI3
>>  config GENERIC_LIB_LSHRDI3
>> bool
>>
>> +config GENERIC_LIB_LSHRTI3
>> +   bool
>> +
>>  config GENERIC_LIB_MULDI3
>> bool
>>
>> diff --git a/lib/Makefile b/lib/Makefile
>> index 4e066120a0d6..42648411f451 100644
>> --- a/lib/Makefile
>> +++ b/lib/Makefile
>> @@ -277,6 +277,7 @@ obj-$(CONFIG_PARMAN) += parman.o
>>  obj-$(CONFIG_GENERIC_LIB_ASHLDI3) += ashldi3.o
>>  obj-$(CONFIG_GENERIC_LIB_ASHRDI3) += ashrdi3.o
>>  obj-$(CONFIG_GENERIC_LIB_LSHRDI3) += lshrdi3.o
>> +obj-$(CONFIG_GENERIC_LIB_LSHRTI3) += lshrti3.o
>>  obj-$(CONFIG_GENERIC_LIB_MULDI3) += muldi3.o
>>  obj-$(CONFIG_GENERIC_LIB_CMPDI2) += cmpdi2.o
>>  obj-$(CONFIG_GENERIC_LIB_UCMPDI2) += ucmpdi2.o
>> diff --git a/lib/lshrti3.c b/lib/lshrti3.c
>> new file mode 100644
>> index ..2d2123bb3030
>> --- /dev/null
>> +++ b/lib/lshrti3.c
>> @@ -0,0 +1,31 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +#include 
>> +#include 
>> +
>> +long long __lshrti3(long long u, word_type b)
>> +{
>> +   DWunion128 uu, w;
>> +   word_type bm;
>> +
>> +   if (b == 0)
>> +   return u;
>> +
>> +   uu.ll = u;
>> +   bm = 64 - b;
>> +
>> +   if (bm <= 0) {
>> +   w.s.high = 0;
>> +   w.s.low = (unsigned long long) uu.s.high >> -bm;
>> +   } else {
>> +   const unsigned long long carries =
>> +   (unsigned long long) uu.s.high << bm;
>> +   w.s.high = (unsigned long long) uu.s.high >> b;
>> +   w.s.low = ((unsigned long long) uu.s.low >> b) |
>carries;
>> +   }
>> +
>> +   return w.ll;
>> +}
>> +#ifndef BUILD_VDSO
>> +EXPORT_SYMBOL(__lshrti3);
>> +#endif
>
>I don't think you want this.  Maybe that was carried over from
>https://lkml.org/lkml/2019/3/15/669
>by accident?  The above linkage error mentions arch/x86/kvm/x86.o
>which I wouldn't expect to be linked into the VDSO image?

Or just "u64"...
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] x86/vdso: include generic __lshrdi3 in 32-bit vDSO

2019-03-15 Thread hpa

On March 15, 2019 3:29:06 PM PDT, Matthias Kaehlcke  wrote:
>Hi Nick,
>
>On Fri, Mar 15, 2019 at 02:31:09PM -0700, 'Nick Desaulniers' via Clang
>Built Linux wrote:
>> On Fri, Mar 15, 2019 at 12:54 PM Matthias Kaehlcke 
>wrote:
>> >
>> > Building the 32-bit vDSO with a recent clang version fails due
>> > to undefined symbols:
>> >
>> > arch/x86/entry/vdso/vdso32.so.dbg: undefined symbols found
>> >
>> > The undefined symbol in this case is __lshrdi3, which is part of
>> > the compiler runtime library, however the vDSO isn't linked against
>> > this library.
>> >
>> > Include the kernel version of __lshrdi3 in the 32-bit vDSO build.
>> 
>> __lshrdi3 is used for "logical shift right double-word by int" (best
>> guess), so anywhere there's a right shift of a u64.  Looks like
>> there's a few of these in arch/x86/entry/vdso/, so it's legal for the
>> compiler to emit this libcall.  Do you know which function
>> specifically in the .so has a relocation referencing __lshrdi3
>> specifically?
>
>It's the right shifts in do_realtime() and do_monotonic().
>
>> Is there a config I can set to reproduce this, in order to help
>> test?
>
>I encountered it with a Chrome OS specific configuration, but a
>defconfig should do. Note that you probably need a development version
>of clang to reproduce this.
>
>> >
>> > Signed-off-by: Matthias Kaehlcke 
>> > ---
>> >  arch/x86/entry/vdso/Makefile | 7 ++-
>> >  lib/lshrdi3.c| 4 +++-
>> >  2 files changed, 9 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/arch/x86/entry/vdso/Makefile
>b/arch/x86/entry/vdso/Makefile
>> > index 5bfe2243a08f..7517cd87e10b 100644
>> > --- a/arch/x86/entry/vdso/Makefile
>> > +++ b/arch/x86/entry/vdso/Makefile
>> > @@ -144,6 +144,7 @@ KBUILD_CFLAGS_32 += $(call cc-option,
>-fno-stack-protector)
>> >  KBUILD_CFLAGS_32 += $(call cc-option, -foptimize-sibling-calls)
>> >  KBUILD_CFLAGS_32 += -fno-omit-frame-pointer
>> >  KBUILD_CFLAGS_32 += -DDISABLE_BRANCH_PROFILING
>> > +KBUILD_CFLAGS_32 += -DBUILD_VDSO
>> >
>> >  ifdef CONFIG_RETPOLINE
>> >  ifneq ($(RETPOLINE_VDSO_CFLAGS),)
>> > @@ -153,12 +154,16 @@ endif
>> >
>> >  $(obj)/vdso32.so.dbg: KBUILD_CFLAGS = $(KBUILD_CFLAGS_32)
>> >
>> > +$(obj)/vdso32/lshrdi3.o: $(srctree)/lib/lshrdi3.c FORCE
>> > +   $(call if_changed_rule,cc_o_c)
>> 
>> + Masahiro to help look at this part (I don't understand this part
>> of kbuild).
>
>I bluntly stole that from arch/x86/purgatory/Makefile , which does
>something similar.
>
>> 
>> > +
>> >  $(obj)/vdso32.so.dbg: FORCE \
>> >   $(obj)/vdso32/vdso32.lds \
>> >   $(obj)/vdso32/vclock_gettime.o \
>> >   $(obj)/vdso32/note.o \
>> >   $(obj)/vdso32/system_call.o \
>> > - $(obj)/vdso32/sigreturn.o
>> > + $(obj)/vdso32/sigreturn.o \
>> > + $(obj)/vdso32/lshrdi3.o
>> > $(call if_changed,vdso)
>> >
>> >  #
>> > diff --git a/lib/lshrdi3.c b/lib/lshrdi3.c
>> > index 99cfa5721f2d..8a4fc6bcf3a4 100644
>> > --- a/lib/lshrdi3.c
>> > +++ b/lib/lshrdi3.c
>> > @@ -16,7 +16,7 @@
>> >   * to the Free Software Foundation, Inc.
>> >   */
>> >
>> > -#include 
>> > +#include 
>> 
>> Is this a simple cleanup, or?
>
>The vDSO build is unhappy when modules.h draws in a whole bunch of
>other kernel headers and export.h is all that's need. It seemed
>reasonable to do the 'cleanup' in this patch since we touch it anyway
>to place EXPORT_SYMBOL within an #ifdef.
>
>> >  #include 
>> >
>> >  long long notrace __lshrdi3(long long u, word_type b)
>> > @@ -42,4 +42,6 @@ long long notrace __lshrdi3(long long u,
>word_type b)
>> >
>> > return w.ll;
>> >  }
>> > +#ifndef BUILD_VDSO
>> >  EXPORT_SYMBOL(__lshrdi3);
>> > +#endif
>> 
>> Compilers (GCC and Clang) will always assume their runtime has these
>> helper functions; whether or not they emit libcalls vs inline
>routines
>> is implementation defined.  So I agree with this patch; I just would
>> like to help confirm/test it.
>
>Thanks for your help!
>
>Matthias

Note: it is also probably no reason to use -Os/-Oz for the vdso.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] lib: Add shared copy of __lshrti3 from libgcc

2019-03-15 Thread hpa

On March 15, 2019 4:34:10 PM PDT, Matthias Kaehlcke  wrote:
>Hi Peter,
>
>On Fri, Mar 15, 2019 at 03:08:57PM -0700, h...@zytor.com wrote:
>> On March 15, 2019 3:06:37 PM PDT, Nick Desaulniers
> wrote:
>> >On Fri, Mar 15, 2019 at 1:54 PM Matthias Kaehlcke 
>> >wrote:
>> >>
>> >> The compiler may emit calls to __lshrti3 from the compiler runtime
>> >> library, which results in undefined references:
>> >>
>> >> arch/x86/kvm/x86.o: In function `mul_u64_u64_shr':
>> >>   include/linux/math64.h:186: undefined reference to `__lshrti3'
>> >
>> >Looks like Clang will emit this at -Oz (but not -O2):
>> >https://godbolt.org/z/w1_2YC
>> >
>> >>
>> >> Add a copy of the __lshrti3 libgcc routine (from gcc v4.9.2).
>> >
>> >Has it changed since? If so why not a newer version of libgcc_s?
>> >
>> >>
>> >> Include the function for x86 builds with clang, which is the
>> >> environment where the above error was observed.
>> >>
>> >> Signed-off-by: Matthias Kaehlcke 
>> >> ---
>> >>  arch/x86/Kconfig   |  1 +
>> >>  include/linux/libgcc.h | 16 
>> >>  lib/Kconfig|  3 +++
>> >>  lib/Makefile   |  1 +
>> >>  lib/lshrti3.c  | 31 +++
>> >>  5 files changed, 52 insertions(+)
>> >>  create mode 100644 lib/lshrti3.c
>> >>
>> >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> >> index c1f9b3cf437c..a5e0d923845d 100644
>> >> --- a/arch/x86/Kconfig
>> >> +++ b/arch/x86/Kconfig
>> >> @@ -105,6 +105,7 @@ config X86
>> >> select GENERIC_IRQ_PROBE
>> >> select GENERIC_IRQ_RESERVATION_MODE
>> >> select GENERIC_IRQ_SHOW
>> >> +   select GENERIC_LIB_LSHRTI3  if CC_IS_CLANG
>> >> select GENERIC_PENDING_IRQ  if SMP
>> >> select GENERIC_SMP_IDLE_THREAD
>> >> select GENERIC_STRNCPY_FROM_USER
>> >> diff --git a/include/linux/libgcc.h b/include/linux/libgcc.h
>> >> index 32e1e0f4b2d0..a71036471838 100644
>> >> --- a/include/linux/libgcc.h
>> >> +++ b/include/linux/libgcc.h
>> >> @@ -22,15 +22,26 @@
>> >>  #include 
>> >>
>> >>  typedef int word_type __attribute__ ((mode (__word__)));
>> >> +typedef int TItype __attribute__ ((mode (TI)));
>> >
>> >Well that's an interesting new compiler attribute.
>>
>>https://stackoverflow.com/questions/4559025/what-does-gcc-attribute-modexx-actually-do
>> >Please use `__mode(TI)` from include/linux/compiler_attributes.h ex.
>> >typedef int TItype __mode(TI);
>> >
>> >>
>> >>  #ifdef __BIG_ENDIAN
>> >>  struct DWstruct {
>> >> int high, low;
>> >>  };
>> >> +
>> >> +struct DWstruct128 {
>> >> +   long long high, low;
>> >> +};
>> >> +
>> >>  #elif defined(__LITTLE_ENDIAN)
>> >>  struct DWstruct {
>> >> int low, high;
>> >>  };
>> >> +
>> >> +struct DWstruct128 {
>> >> +   long long low, high;
>> >> +};
>> >> +
>> >>  #else
>> >>  #error I feel sick.
>> >>  #endif
>> >> @@ -40,4 +51,9 @@ typedef union {
>> >> long long ll;
>> >>  } DWunion;
>> >>
>> >> +typedef union {
>> >> +   struct DWstruct128 s;
>> >> +   TItype ll;
>> >> +} DWunion128;
>> >> +
>> >>  #endif /* __ASM_LIBGCC_H */
>> >> diff --git a/lib/Kconfig b/lib/Kconfig
>> >> index a9e56539bd11..369e10259ea6 100644
>> >> --- a/lib/Kconfig
>> >> +++ b/lib/Kconfig
>> >> @@ -624,6 +624,9 @@ config GENERIC_LIB_ASHRDI3
>> >>  config GENERIC_LIB_LSHRDI3
>> >> bool
>> >>
>> >> +config GENERIC_LIB_LSHRTI3
>> >> +   bool
>> >> +
>> >>  config GENERIC_LIB_MULDI3
>> >> bool
>> >>
>> >> diff --git a/lib/Makefile b/lib/Makefile
>> >> index 4e066120a0d6..42648411f451 100644
>> >> --- a/lib/Makefile
>> >> +++ b/lib/Makefile
>> >> @@ -277,6 +277,7 @@ obj-$(CONFIG_PARMAN) += parman.o
>> >>  obj-$(CONFIG_GENERIC_LIB_ASHLDI3) += ashldi3.o
>> >>  obj-$(CONFIG_GENERIC_LIB_ASHRDI3) += ashrdi3.o
>> >>  obj-$(CONFIG_GENERIC_LIB_LSHRDI3) += lshrdi3.o
>> >> +obj-$(CONFIG_GENERIC_LIB_LSHRTI3) += lshrti3.o
>> >>  obj-$(CONFIG_GENERIC_LIB_MULDI3) += muldi3.o
>> >>  obj-$(CONFIG_GENERIC_LIB_CMPDI2) += cmpdi2.o
>> >>  obj-$(CONFIG_GENERIC_LIB_UCMPDI2) += ucmpdi2.o
>> >> diff --git a/lib/lshrti3.c b/lib/lshrti3.c
>> >> new file mode 100644
>> >> index ..2d2123bb3030
>> >> --- /dev/null
>> >> +++ b/lib/lshrti3.c
>> >> @@ -0,0 +1,31 @@
>> >> +// SPDX-License-Identifier: GPL-2.0
>> >> +
>> >> +#include 
>> >> +#include 
>> >> +
>> >> +long long __lshrti3(long long u, word_type b)
>> >> +{
>> >> +   DWunion128 uu, w;
>> >> +   word_type bm;
>> >> +
>> >> +   if (b == 0)
>> >> +   return u;
>> >> +
>> >> +   uu.ll = u;
>> >> +   bm = 64 - b;
>> >> +
>> >> +   if (bm <= 0) {
>> >> +   w.s.high = 0;
>> >> +   w.s.low = (unsigned long long) uu.s.high >> -bm;
>> >> +   } else {
>> >> +   const unsigned long long carries =
>> >> +   (unsigned long long) uu.s.high << bm;
>> >> +   w.s.high = (unsigned long long) uu.s.high >> b;
>> >> +   w.s.low = ((unsigned l

Re: [PATCH] lib: Add shared copy of __lshrti3 from libgcc

2019-03-15 Thread hpa

On March 15, 2019 4:47:01 PM PDT, Matthias Kaehlcke  wrote:
>On Fri, Mar 15, 2019 at 03:12:08PM -0700, h...@zytor.com wrote:
>> On March 15, 2019 3:06:37 PM PDT, Nick Desaulniers
> wrote:
>> >On Fri, Mar 15, 2019 at 1:54 PM Matthias Kaehlcke 
>> >wrote:
>> >>
>> >> The compiler may emit calls to __lshrti3 from the compiler runtime
>> >> library, which results in undefined references:
>> >>
>> >> arch/x86/kvm/x86.o: In function `mul_u64_u64_shr':
>> >>   include/linux/math64.h:186: undefined reference to `__lshrti3'
>> >
>> >Looks like Clang will emit this at -Oz (but not -O2):
>> >https://godbolt.org/z/w1_2YC
>> >
>> >>
>> >> Add a copy of the __lshrti3 libgcc routine (from gcc v4.9.2).
>> >
>> >Has it changed since? If so why not a newer version of libgcc_s?
>> >
>> >>
>> >> Include the function for x86 builds with clang, which is the
>> >> environment where the above error was observed.
>> >>
>> >> Signed-off-by: Matthias Kaehlcke 
>> >> ---
>> >>  arch/x86/Kconfig   |  1 +
>> >>  include/linux/libgcc.h | 16 
>> >>  lib/Kconfig|  3 +++
>> >>  lib/Makefile   |  1 +
>> >>  lib/lshrti3.c  | 31 +++
>> >>  5 files changed, 52 insertions(+)
>> >>  create mode 100644 lib/lshrti3.c
>> >>
>> >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> >> index c1f9b3cf437c..a5e0d923845d 100644
>> >> --- a/arch/x86/Kconfig
>> >> +++ b/arch/x86/Kconfig
>> >> @@ -105,6 +105,7 @@ config X86
>> >> select GENERIC_IRQ_PROBE
>> >> select GENERIC_IRQ_RESERVATION_MODE
>> >> select GENERIC_IRQ_SHOW
>> >> +   select GENERIC_LIB_LSHRTI3  if CC_IS_CLANG
>> >> select GENERIC_PENDING_IRQ  if SMP
>> >> select GENERIC_SMP_IDLE_THREAD
>> >> select GENERIC_STRNCPY_FROM_USER
>> >> diff --git a/include/linux/libgcc.h b/include/linux/libgcc.h
>> >> index 32e1e0f4b2d0..a71036471838 100644
>> >> --- a/include/linux/libgcc.h
>> >> +++ b/include/linux/libgcc.h
>> >> @@ -22,15 +22,26 @@
>> >>  #include 
>> >>
>> >>  typedef int word_type __attribute__ ((mode (__word__)));
>> >> +typedef int TItype __attribute__ ((mode (TI)));
>> >
>> >Well that's an interesting new compiler attribute.
>>
>>https://stackoverflow.com/questions/4559025/what-does-gcc-attribute-modexx-actually-do
>> >Please use `__mode(TI)` from include/linux/compiler_attributes.h ex.
>> >typedef int TItype __mode(TI);
>> >
>> >>
>> >>  #ifdef __BIG_ENDIAN
>> >>  struct DWstruct {
>> >> int high, low;
>> >>  };
>> >> +
>> >> +struct DWstruct128 {
>> >> +   long long high, low;
>> >> +};
>> >> +
>> >>  #elif defined(__LITTLE_ENDIAN)
>> >>  struct DWstruct {
>> >> int low, high;
>> >>  };
>> >> +
>> >> +struct DWstruct128 {
>> >> +   long long low, high;
>> >> +};
>> >> +
>> >>  #else
>> >>  #error I feel sick.
>> >>  #endif
>> >> @@ -40,4 +51,9 @@ typedef union {
>> >> long long ll;
>> >>  } DWunion;
>> >>
>> >> +typedef union {
>> >> +   struct DWstruct128 s;
>> >> +   TItype ll;
>> >> +} DWunion128;
>> >> +
>> >>  #endif /* __ASM_LIBGCC_H */
>> >> diff --git a/lib/Kconfig b/lib/Kconfig
>> >> index a9e56539bd11..369e10259ea6 100644
>> >> --- a/lib/Kconfig
>> >> +++ b/lib/Kconfig
>> >> @@ -624,6 +624,9 @@ config GENERIC_LIB_ASHRDI3
>> >>  config GENERIC_LIB_LSHRDI3
>> >> bool
>> >>
>> >> +config GENERIC_LIB_LSHRTI3
>> >> +   bool
>> >> +
>> >>  config GENERIC_LIB_MULDI3
>> >> bool
>> >>
>> >> diff --git a/lib/Makefile b/lib/Makefile
>> >> index 4e066120a0d6..42648411f451 100644
>> >> --- a/lib/Makefile
>> >> +++ b/lib/Makefile
>> >> @@ -277,6 +277,7 @@ obj-$(CONFIG_PARMAN) += parman.o
>> >>  obj-$(CONFIG_GENERIC_LIB_ASHLDI3) += ashldi3.o
>> >>  obj-$(CONFIG_GENERIC_LIB_ASHRDI3) += ashrdi3.o
>> >>  obj-$(CONFIG_GENERIC_LIB_LSHRDI3) += lshrdi3.o
>> >> +obj-$(CONFIG_GENERIC_LIB_LSHRTI3) += lshrti3.o
>> >>  obj-$(CONFIG_GENERIC_LIB_MULDI3) += muldi3.o
>> >>  obj-$(CONFIG_GENERIC_LIB_CMPDI2) += cmpdi2.o
>> >>  obj-$(CONFIG_GENERIC_LIB_UCMPDI2) += ucmpdi2.o
>> >> diff --git a/lib/lshrti3.c b/lib/lshrti3.c
>> >> new file mode 100644
>> >> index ..2d2123bb3030
>> >> --- /dev/null
>> >> +++ b/lib/lshrti3.c
>> >> @@ -0,0 +1,31 @@
>> >> +// SPDX-License-Identifier: GPL-2.0
>> >> +
>> >> +#include 
>> >> +#include 
>> >> +
>> >> +long long __lshrti3(long long u, word_type b)
>> >> +{
>> >> +   DWunion128 uu, w;
>> >> +   word_type bm;
>> >> +
>> >> +   if (b == 0)
>> >> +   return u;
>> >> +
>> >> +   uu.ll = u;
>> >> +   bm = 64 - b;
>> >> +
>> >> +   if (bm <= 0) {
>> >> +   w.s.high = 0;
>> >> +   w.s.low = (unsigned long long) uu.s.high >> -bm;
>> >> +   } else {
>> >> +   const unsigned long long carries =
>> >> +   (unsigned long long) uu.s.high << bm;
>> >> +   w.s.high = (unsigned long long) uu.s.high >> b;
>> >> +   w.s.low = ((unsigned long long) uu.

Re: [PATCH] lib: Add shared copy of __lshrti3 from libgcc

2019-03-18 Thread hpa

On March 18, 2019 2:31:13 PM PDT, Matthias Kaehlcke  wrote:
>On Fri, Mar 15, 2019 at 01:54:50PM -0700, Matthias Kaehlcke wrote:
>> The compiler may emit calls to __lshrti3 from the compiler runtime
>> library, which results in undefined references:
>> 
>> arch/x86/kvm/x86.o: In function `mul_u64_u64_shr':
>>   include/linux/math64.h:186: undefined reference to `__lshrti3'
>> 
>> Add a copy of the __lshrti3 libgcc routine (from gcc v4.9.2).
>> 
>> Include the function for x86 builds with clang, which is the
>> environment where the above error was observed.
>> 
>> Signed-off-by: Matthias Kaehlcke 
>
>With "Revert "kbuild: use -Oz instead of -Os when using clang"
>(https://lore.kernel.org/patchwork/patch/1051932/) the above
>error is fixed, a few comments inline for if the patch is
>resurrected in the future because __lshrti3 is emitted in a
>different context.
>
>> diff --git a/include/linux/libgcc.h b/include/linux/libgcc.h
>> index 32e1e0f4b2d0..a71036471838 100644
>> --- a/include/linux/libgcc.h
>> +++ b/include/linux/libgcc.h
>> @@ -22,15 +22,26 @@
>>  #include 
>> 
>>  typedef int word_type __attribute__ ((mode (__word__)));
>> +typedef int TItype __attribute__ ((mode (TI)));
>
>Consider using __int128 instead. Definition and use need a
>'defined(__SIZEOF_INT128__)' guard  (similar for mode (TI)), since
>these 128 bit types aren't supported on all platforms.
>
>>  #ifdef __BIG_ENDIAN
>>  struct DWstruct {
>>  int high, low;
>>  };
>> +
>> +struct DWstruct128 {
>> +long long high, low;
>> +};
>
>This struct isn't needed, struct DWstruct can be used.
>
>> diff --git a/lib/lshrti3.c b/lib/lshrti3.c
>> new file mode 100644
>> index ..2d2123bb3030
>> --- /dev/null
>> +++ b/lib/lshrti3.c
>> @@ -0,0 +1,31 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +
>> +#include 
>> +#include 
>> +
>> +long long __lshrti3(long long u, word_type b)
>
>use TItype for input/output, which is what gcc does, though the above
>matches the interface in the documentation.
>
>> +{
>> +DWunion128 uu, w;
>> +word_type bm;
>> +
>> +if (b == 0)
>> +return u;
>> +
>> +uu.ll = u;
>> +bm = 64 - b;
>> +
>> +if (bm <= 0) {
>> +w.s.high = 0;
>> +w.s.low = (unsigned long long) uu.s.high >> -bm;
>
>include  and use u64 instead of unsigned long long.

Ok, now I'm really puzzled.

How could we need a 128-bit shift when the prototype only has 64 bits of input?!

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] Revert "kbuild: use -Oz instead of -Os when using clang"

2019-03-18 Thread hpa

On March 18, 2019 2:56:05 PM PDT, Matthias Kaehlcke  wrote:
>On Mon, Mar 18, 2019 at 02:47:13PM -0700, 'Nick Desaulniers' via Clang
>Built Linux wrote:
>> On Mon, Mar 18, 2019 at 2:10 PM Matthias Kaehlcke 
>wrote:
>> >
>> > The clang option -Oz enables *aggressive* optimization for size,
>> > which doesn't necessarily result in smaller images, but can have
>> > negative impact on performance. Switch back to the less aggressive
>> > -Os.
>> >
>> > This reverts commit 6748cb3c299de1ffbe56733647b01dbcc398c419.
>> 
>> Does scripts/checkpatch.pl complain about this format (full sha, no
>> title)?
>
>Nope, for reverts checkpatch is happy with it, I suppose because by
>default the subject of the reverted patch is part of the revert's
>subject.
>
>> Not sure that necessitates a v2, but if so would be good to
>> mention that Clang generates excessive calls into libgcc/compiler-rt
>> at -Oz.  Thanks for the patch and the discussion/sanity check.
>
>I can send a v2 if 'needed'.
>
>> Reviewed-by: Nick Desaulniers 
>
>Thanks!
>
>Matthias
>
>> >
>> > Suggested-by: Peter Zijlstra 
>> > Signed-off-by: Matthias Kaehlcke 
>> > ---
>> >  Makefile | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> >
>> > diff --git a/Makefile b/Makefile
>> > index 9ef547fc7ffe..191f3ce3cb5e 100644
>> > --- a/Makefile
>> > +++ b/Makefile
>> > @@ -667,7 +667,7 @@ KBUILD_CFLAGS   += $(call
>cc-disable-warning, format-overflow)
>> >  KBUILD_CFLAGS  += $(call cc-disable-warning, int-in-bool-context)
>> >
>> >  ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE
>> > -KBUILD_CFLAGS  += $(call cc-option,-Oz,-Os)
>> > +KBUILD_CFLAGS  += -Os
>> >  else
>> >  KBUILD_CFLAGS   += -O2
>> >  endif
>> 
>> 
>> 

Would be nice, still, to have -Ok.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] lib: Add shared copy of __lshrti3 from libgcc

2019-03-18 Thread hpa

On March 18, 2019 3:16:39 PM PDT, Matthias Kaehlcke  wrote:
>On Mon, Mar 18, 2019 at 02:50:44PM -0700, h...@zytor.com wrote:
>> On March 18, 2019 2:31:13 PM PDT, Matthias Kaehlcke
> wrote:
>> >On Fri, Mar 15, 2019 at 01:54:50PM -0700, Matthias Kaehlcke wrote:
>> >> The compiler may emit calls to __lshrti3 from the compiler runtime
>> >> library, which results in undefined references:
>> >> 
>> >> arch/x86/kvm/x86.o: In function `mul_u64_u64_shr':
>> >>   include/linux/math64.h:186: undefined reference to `__lshrti3'
>> >> 
>> >> Add a copy of the __lshrti3 libgcc routine (from gcc v4.9.2).
>> >> 
>> >> Include the function for x86 builds with clang, which is the
>> >> environment where the above error was observed.
>> >> 
>> >> Signed-off-by: Matthias Kaehlcke 
>> >
>> >With "Revert "kbuild: use -Oz instead of -Os when using clang"
>> >(https://lore.kernel.org/patchwork/patch/1051932/) the above
>> >error is fixed, a few comments inline for if the patch is
>> >resurrected in the future because __lshrti3 is emitted in a
>> >different context.
>> >
>> >> diff --git a/include/linux/libgcc.h b/include/linux/libgcc.h
>> >> index 32e1e0f4b2d0..a71036471838 100644
>> >> --- a/include/linux/libgcc.h
>> >> +++ b/include/linux/libgcc.h
>> >> @@ -22,15 +22,26 @@
>> >>  #include 
>> >> 
>> >>  typedef int word_type __attribute__ ((mode (__word__)));
>> >> +typedef int TItype __attribute__ ((mode (TI)));
>> >
>> >Consider using __int128 instead. Definition and use need a
>> >'defined(__SIZEOF_INT128__)' guard  (similar for mode (TI)), since
>> >these 128 bit types aren't supported on all platforms.
>> >
>> >>  #ifdef __BIG_ENDIAN
>> >>  struct DWstruct {
>> >>   int high, low;
>> >>  };
>> >> +
>> >> +struct DWstruct128 {
>> >> + long long high, low;
>> >> +};
>> >
>> >This struct isn't needed, struct DWstruct can be used.
>> >
>> >> diff --git a/lib/lshrti3.c b/lib/lshrti3.c
>> >> new file mode 100644
>> >> index ..2d2123bb3030
>> >> --- /dev/null
>> >> +++ b/lib/lshrti3.c
>> >> @@ -0,0 +1,31 @@
>> >> +// SPDX-License-Identifier: GPL-2.0
>> >> +
>> >> +#include 
>> >> +#include 
>> >> +
>> >> +long long __lshrti3(long long u, word_type b)
>> >
>> >use TItype for input/output, which is what gcc does, though the
>above
>> >matches the interface in the documentation.
>> >
>> >> +{
>> >> + DWunion128 uu, w;
>> >> + word_type bm;
>> >> +
>> >> + if (b == 0)
>> >> + return u;
>> >> +
>> >> + uu.ll = u;
>> >> + bm = 64 - b;
>> >> +
>> >> + if (bm <= 0) {
>> >> + w.s.high = 0;
>> >> + w.s.low = (unsigned long long) uu.s.high >> -bm;
>> >
>> >include  and use u64 instead of unsigned long long.
>> 
>> Ok, now I'm really puzzled.
>> 
>> How could we need a 128-bit shift when the prototype only has 64 bits
>of input?!
>
>Good question, this is the code from libgcc:
>
>TItype
>__lshrti3 (TItype u, shift_count_type b)
>{
>  if (b == 0)
>return u;
>
>  const DWunion uu = {.ll = u};
>  const shift_count_type bm = (8 * (8)) - b;
>  DWunion w;
>
>  if (bm <= 0)
>{
>  w.s.high = 0;
>  w.s.low = (UDItype) uu.s.high >> -bm;
>}
>  else
>{
>  const UDItype carries = (UDItype) uu.s.high << bm;
>
>  w.s.high = (UDItype) uu.s.high >> b;
>  w.s.low = ((UDItype) uu.s.low >> b) | carries;
>}
>
>  return w.ll;
>}
>
>
>My compiler knowledge is limited, my guess is that the function is a
>generic implementation, and while a long long is 64-bit wide under
>Linux it could be 128-bit on other platforms.

Yes, long long is just plain wrong.

How could we end up calling this function on 32 bits?!
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] lib: Add shared copy of __lshrti3 from libgcc

2019-03-18 Thread hpa

On March 18, 2019 4:52:19 PM PDT, Matthias Kaehlcke  wrote:
>On Mon, Mar 18, 2019 at 04:44:03PM -0700, h...@zytor.com wrote:
>> On March 18, 2019 3:16:39 PM PDT, Matthias Kaehlcke
> wrote:
>> >On Mon, Mar 18, 2019 at 02:50:44PM -0700, h...@zytor.com wrote:
>> >> On March 18, 2019 2:31:13 PM PDT, Matthias Kaehlcke
>> > wrote:
>> >> >On Fri, Mar 15, 2019 at 01:54:50PM -0700, Matthias Kaehlcke
>wrote:
>> >> >> The compiler may emit calls to __lshrti3 from the compiler
>runtime
>> >> >> library, which results in undefined references:
>> >> >> 
>> >> >> arch/x86/kvm/x86.o: In function `mul_u64_u64_shr':
>> >> >>   include/linux/math64.h:186: undefined reference to
>`__lshrti3'
>> >> >> 
>> >> >> Add a copy of the __lshrti3 libgcc routine (from gcc v4.9.2).
>> >> >> 
>> >> >> Include the function for x86 builds with clang, which is the
>> >> >> environment where the above error was observed.
>> >> >> 
>> >> >> Signed-off-by: Matthias Kaehlcke 
>> >> >
>> >> >With "Revert "kbuild: use -Oz instead of -Os when using clang"
>> >> >(https://lore.kernel.org/patchwork/patch/1051932/) the above
>> >> >error is fixed, a few comments inline for if the patch is
>> >> >resurrected in the future because __lshrti3 is emitted in a
>> >> >different context.
>> >> >
>> >> >> diff --git a/include/linux/libgcc.h b/include/linux/libgcc.h
>> >> >> index 32e1e0f4b2d0..a71036471838 100644
>> >> >> --- a/include/linux/libgcc.h
>> >> >> +++ b/include/linux/libgcc.h
>> >> >> @@ -22,15 +22,26 @@
>> >> >>  #include 
>> >> >> 
>> >> >>  typedef int word_type __attribute__ ((mode (__word__)));
>> >> >> +typedef int TItype __attribute__ ((mode (TI)));
>> >> >
>> >> >Consider using __int128 instead. Definition and use need a
>> >> >'defined(__SIZEOF_INT128__)' guard  (similar for mode (TI)),
>since
>> >> >these 128 bit types aren't supported on all platforms.
>> >> >
>> >> >>  #ifdef __BIG_ENDIAN
>> >> >>  struct DWstruct {
>> >> >>int high, low;
>> >> >>  };
>> >> >> +
>> >> >> +struct DWstruct128 {
>> >> >> +  long long high, low;
>> >> >> +};
>> >> >
>> >> >This struct isn't needed, struct DWstruct can be used.
>> >> >
>> >> >> diff --git a/lib/lshrti3.c b/lib/lshrti3.c
>> >> >> new file mode 100644
>> >> >> index ..2d2123bb3030
>> >> >> --- /dev/null
>> >> >> +++ b/lib/lshrti3.c
>> >> >> @@ -0,0 +1,31 @@
>> >> >> +// SPDX-License-Identifier: GPL-2.0
>> >> >> +
>> >> >> +#include 
>> >> >> +#include 
>> >> >> +
>> >> >> +long long __lshrti3(long long u, word_type b)
>> >> >
>> >> >use TItype for input/output, which is what gcc does, though the
>> >above
>> >> >matches the interface in the documentation.
>> >> >
>> >> >> +{
>> >> >> +  DWunion128 uu, w;
>> >> >> +  word_type bm;
>> >> >> +
>> >> >> +  if (b == 0)
>> >> >> +  return u;
>> >> >> +
>> >> >> +  uu.ll = u;
>> >> >> +  bm = 64 - b;
>> >> >> +
>> >> >> +  if (bm <= 0) {
>> >> >> +  w.s.high = 0;
>> >> >> +  w.s.low = (unsigned long long) uu.s.high >> -bm;
>> >> >
>> >> >include  and use u64 instead of unsigned long
>long.
>> >> 
>> >> Ok, now I'm really puzzled.
>> >> 
>> >> How could we need a 128-bit shift when the prototype only has 64
>bits
>> >of input?!
>> >
>> >Good question, this is the code from libgcc:
>> >
>> >TItype
>> >__lshrti3 (TItype u, shift_count_type b)
>> >{
>> >  if (b == 0)
>> >return u;
>> >
>> >  const DWunion uu = {.ll = u};
>> >  const shift_count_type bm = (8 * (8)) - b;
>> >  DWunion w;
>> >
>> >  if (bm <= 0)
>> >{
>> >  w.s.high = 0;
>> >  w.s.low = (UDItype) uu.s.high >> -bm;
>> >}
>> >  else
>> >{
>> >  const UDItype carries = (UDItype) uu.s.high << bm;
>> >
>> >  w.s.high = (UDItype) uu.s.high >> b;
>> >  w.s.low = ((UDItype) uu.s.low >> b) | carries;
>> >}
>> >
>> >  return w.ll;
>> >}
>> >
>> >
>> >My compiler knowledge is limited, my guess is that the function is a
>> >generic implementation, and while a long long is 64-bit wide under
>> >Linux it could be 128-bit on other platforms.
>> 
>> Yes, long long is just plain wrong.
>> 
>> How could we end up calling this function on 32 bits?!
>
>We didn't, in this case the function is called in 64-bit code
>(arch/x86/kvm/x86.o: In function `mul_u64_u64_shr'), for the 32-bit
>vDSO it was __lshrdi3.

That makes more sense.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] lib: Add shared copy of __lshrti3 from libgcc

2019-03-18 Thread hpa

On March 18, 2019 4:52:19 PM PDT, Matthias Kaehlcke  wrote:
>On Mon, Mar 18, 2019 at 04:44:03PM -0700, h...@zytor.com wrote:
>> On March 18, 2019 3:16:39 PM PDT, Matthias Kaehlcke
> wrote:
>> >On Mon, Mar 18, 2019 at 02:50:44PM -0700, h...@zytor.com wrote:
>> >> On March 18, 2019 2:31:13 PM PDT, Matthias Kaehlcke
>> > wrote:
>> >> >On Fri, Mar 15, 2019 at 01:54:50PM -0700, Matthias Kaehlcke
>wrote:
>> >> >> The compiler may emit calls to __lshrti3 from the compiler
>runtime
>> >> >> library, which results in undefined references:
>> >> >> 
>> >> >> arch/x86/kvm/x86.o: In function `mul_u64_u64_shr':
>> >> >>   include/linux/math64.h:186: undefined reference to
>`__lshrti3'
>> >> >> 
>> >> >> Add a copy of the __lshrti3 libgcc routine (from gcc v4.9.2).
>> >> >> 
>> >> >> Include the function for x86 builds with clang, which is the
>> >> >> environment where the above error was observed.
>> >> >> 
>> >> >> Signed-off-by: Matthias Kaehlcke 
>> >> >
>> >> >With "Revert "kbuild: use -Oz instead of -Os when using clang"
>> >> >(https://lore.kernel.org/patchwork/patch/1051932/) the above
>> >> >error is fixed, a few comments inline for if the patch is
>> >> >resurrected in the future because __lshrti3 is emitted in a
>> >> >different context.
>> >> >
>> >> >> diff --git a/include/linux/libgcc.h b/include/linux/libgcc.h
>> >> >> index 32e1e0f4b2d0..a71036471838 100644
>> >> >> --- a/include/linux/libgcc.h
>> >> >> +++ b/include/linux/libgcc.h
>> >> >> @@ -22,15 +22,26 @@
>> >> >>  #include 
>> >> >> 
>> >> >>  typedef int word_type __attribute__ ((mode (__word__)));
>> >> >> +typedef int TItype __attribute__ ((mode (TI)));
>> >> >
>> >> >Consider using __int128 instead. Definition and use need a
>> >> >'defined(__SIZEOF_INT128__)' guard  (similar for mode (TI)),
>since
>> >> >these 128 bit types aren't supported on all platforms.
>> >> >
>> >> >>  #ifdef __BIG_ENDIAN
>> >> >>  struct DWstruct {
>> >> >>int high, low;
>> >> >>  };
>> >> >> +
>> >> >> +struct DWstruct128 {
>> >> >> +  long long high, low;
>> >> >> +};
>> >> >
>> >> >This struct isn't needed, struct DWstruct can be used.
>> >> >
>> >> >> diff --git a/lib/lshrti3.c b/lib/lshrti3.c
>> >> >> new file mode 100644
>> >> >> index ..2d2123bb3030
>> >> >> --- /dev/null
>> >> >> +++ b/lib/lshrti3.c
>> >> >> @@ -0,0 +1,31 @@
>> >> >> +// SPDX-License-Identifier: GPL-2.0
>> >> >> +
>> >> >> +#include 
>> >> >> +#include 
>> >> >> +
>> >> >> +long long __lshrti3(long long u, word_type b)
>> >> >
>> >> >use TItype for input/output, which is what gcc does, though the
>> >above
>> >> >matches the interface in the documentation.
>> >> >
>> >> >> +{
>> >> >> +  DWunion128 uu, w;
>> >> >> +  word_type bm;
>> >> >> +
>> >> >> +  if (b == 0)
>> >> >> +  return u;
>> >> >> +
>> >> >> +  uu.ll = u;
>> >> >> +  bm = 64 - b;
>> >> >> +
>> >> >> +  if (bm <= 0) {
>> >> >> +  w.s.high = 0;
>> >> >> +  w.s.low = (unsigned long long) uu.s.high >> -bm;
>> >> >
>> >> >include  and use u64 instead of unsigned long
>long.
>> >> 
>> >> Ok, now I'm really puzzled.
>> >> 
>> >> How could we need a 128-bit shift when the prototype only has 64
>bits
>> >of input?!
>> >
>> >Good question, this is the code from libgcc:
>> >
>> >TItype
>> >__lshrti3 (TItype u, shift_count_type b)
>> >{
>> >  if (b == 0)
>> >return u;
>> >
>> >  const DWunion uu = {.ll = u};
>> >  const shift_count_type bm = (8 * (8)) - b;
>> >  DWunion w;
>> >
>> >  if (bm <= 0)
>> >{
>> >  w.s.high = 0;
>> >  w.s.low = (UDItype) uu.s.high >> -bm;
>> >}
>> >  else
>> >{
>> >  const UDItype carries = (UDItype) uu.s.high << bm;
>> >
>> >  w.s.high = (UDItype) uu.s.high >> b;
>> >  w.s.low = ((UDItype) uu.s.low >> b) | carries;
>> >}
>> >
>> >  return w.ll;
>> >}
>> >
>> >
>> >My compiler knowledge is limited, my guess is that the function is a
>> >generic implementation, and while a long long is 64-bit wide under
>> >Linux it could be 128-bit on other platforms.
>> 
>> Yes, long long is just plain wrong.
>> 
>> How could we end up calling this function on 32 bits?!
>
>We didn't, in this case the function is called in 64-bit code
>(arch/x86/kvm/x86.o: In function `mul_u64_u64_shr'), for the 32-bit
>vDSO it was __lshrdi3.

Again, for 64 bits we can include libgcc in the link (with --no-whole-archive).
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] x86/umip: Add emulation for 64-bit processes

2019-09-09 Thread hpa

On September 8, 2019 8:22:48 AM GMT+01:00, Borislav Petkov  
wrote:
>On Sat, Sep 07, 2019 at 02:26:10PM -0700, Ricardo Neri wrote:
>> > Wine users have encountered a number of 64-bit Windows games that
>use
>> > these instructions (particularly sgdt), and were crashing when run
>on
>> > UMIP-enabled systems.
>> 
>> Emulation support for 64-bit processes was not initially included
>> because no use cases had been identified.
>
>AFAIR, we said at the time that 64-bit doesn't need it because this is
>legacy software only and 64-bit will get fixed properly not to use
>those
>insns. I can probably guess how that went ...

I don't think Windows games was something we considered. However, needing to 
simulate these instructions is not a huge surprise. The important thing is that 
by simulating them, we can plug the leak of some very high value kernel 
information – mainly the GDT, IDT and TSS addresses.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] x86/umip: Add emulation for 64-bit processes

2019-09-09 Thread hpa

On September 6, 2019 12:22:21 AM GMT+01:00, Brendan Shanks 
 wrote:
>Add emulation of the sgdt, sidt, and smsw instructions for 64-bit
>processes.
>
>Wine users have encountered a number of 64-bit Windows games that use
>these instructions (particularly sgdt), and were crashing when run on
>UMIP-enabled systems.
>
>Originally-by: Ricardo Neri 
>Signed-off-by: Brendan Shanks 
>---
> arch/x86/kernel/umip.c | 55 +-
> 1 file changed, 33 insertions(+), 22 deletions(-)
>
>diff --git a/arch/x86/kernel/umip.c b/arch/x86/kernel/umip.c
>index 5b345add550f..1812e95d2f55 100644
>--- a/arch/x86/kernel/umip.c
>+++ b/arch/x86/kernel/umip.c
>@@ -51,9 +51,7 @@
>* The instruction smsw is emulated to return the value that the
>register CR0
>  * has at boot time as set in the head_32.
>  *
>- * Also, emulation is provided only for 32-bit processes; 64-bit
>processes
>- * that attempt to use the instructions that UMIP protects will
>receive the
>- * SIGSEGV signal issued as a consequence of the general protection
>fault.
>+ * Emulation is provided for both 32-bit and 64-bit processes.
>  *
>* Care is taken to appropriately emulate the results when segmentation
>is
>* used. That is, rather than relying on USER_DS and USER_CS, the
>function
>@@ -63,17 +61,18 @@
>  * application uses a local descriptor table.
>  */
> 
>-#define UMIP_DUMMY_GDT_BASE 0xfffe
>-#define UMIP_DUMMY_IDT_BASE 0x
>+#define UMIP_DUMMY_GDT_BASE 0xfffeULL
>+#define UMIP_DUMMY_IDT_BASE 0xULL
> 
> /*
>* The SGDT and SIDT instructions store the contents of the global
>descriptor
>* table and interrupt table registers, respectively. The destination is
>a
>* memory operand of X+2 bytes. X bytes are used to store the base
>address of
>- * the table and 2 bytes are used to store the limit. In 32-bit
>processes, the
>- * only processes for which emulation is provided, X has a value of 4.
>+ * the table and 2 bytes are used to store the limit. In 32-bit
>processes X
>+ * has a value of 4, in 64-bit processes X has a value of 8.
>  */
>-#define UMIP_GDT_IDT_BASE_SIZE 4
>+#define UMIP_GDT_IDT_BASE_SIZE_64BIT 8
>+#define UMIP_GDT_IDT_BASE_SIZE_32BIT 4
> #define UMIP_GDT_IDT_LIMIT_SIZE 2
> 
> #define   UMIP_INST_SGDT  0   /* 0F 01 /0 */
>@@ -189,6 +188,7 @@ static int identify_insn(struct insn *insn)
>  * @umip_inst:A constant indicating the instruction to emulate
>  * @data: Buffer into which the dummy result is stored
>  * @data_size:Size of the emulated result
>+ * @x86_64: true if process is 64-bit, false otherwise
>  *
>* Emulate an instruction protected by UMIP and provide a dummy result.
>The
>* result of the emulation is saved in @data. The size of the results
>depends
>@@ -202,11 +202,8 @@ static int identify_insn(struct insn *insn)
>  * 0 on success, -EINVAL on error while emulating.
>  */
> static int emulate_umip_insn(struct insn *insn, int umip_inst,
>-   unsigned char *data, int *data_size)
>+   unsigned char *data, int *data_size, bool x86_64)
> {
>-  unsigned long dummy_base_addr, dummy_value;
>-  unsigned short dummy_limit = 0;
>-
>   if (!data || !data_size || !insn)
>   return -EINVAL;
>   /*
>@@ -219,6 +216,9 @@ static int emulate_umip_insn(struct insn *insn, int
>umip_inst,
>* is always returned irrespective of the operand size.
>*/
>   if (umip_inst == UMIP_INST_SGDT || umip_inst == UMIP_INST_SIDT) {
>+  u64 dummy_base_addr;
>+  u16 dummy_limit = 0;
>+
>   /* SGDT and SIDT do not use registers operands. */
>   if (X86_MODRM_MOD(insn->modrm.value) == 3)
>   return -EINVAL;
>@@ -228,13 +228,24 @@ static int emulate_umip_insn(struct insn *insn,
>int umip_inst,
>   else
>   dummy_base_addr = UMIP_DUMMY_IDT_BASE;
> 
>-  *data_size = UMIP_GDT_IDT_LIMIT_SIZE + UMIP_GDT_IDT_BASE_SIZE;
>+  /*
>+   * 64-bit processes use the entire dummy base address.
>+   * 32-bit processes use the lower 32 bits of the base address.
>+   * dummy_base_addr is always 64 bits, but we memcpy the correct
>+   * number of bytes from it to the destination.
>+   */
>+  if (x86_64)
>+  *data_size = UMIP_GDT_IDT_BASE_SIZE_64BIT;
>+  else
>+  *data_size = UMIP_GDT_IDT_BASE_SIZE_32BIT;
>+
>+  memcpy(data + 2, &dummy_base_addr, *data_size);
> 
>-  memcpy(data + 2, &dummy_base_addr, UMIP_GDT_IDT_BASE_SIZE);
>+  *data_size += UMIP_GDT_IDT_LIMIT_SIZE;
>   memcpy(data, &dummy_limit, UMIP_GDT_IDT_LIMIT_SIZE);
> 
>   } else if (umip_inst == UMIP_INST_SMSW) {
>-  dummy_value = CR0_STATE;
>+  unsigned long dummy_value = CR0_STATE;
> 
>   /*
>* Eve

Re: [PATCH] x86/umip: Add emulation for 64-bit processes

2019-09-09 Thread hpa

On September 10, 2019 7:28:28 AM GMT+01:00, Ingo Molnar  
wrote:
>
>* h...@zytor.com  wrote:
>
>> I would strongly suggest that we change the term "emulation" to 
>> "spoofing" for these instructions. We need to explain that we do
>*not* 
>> execute these instructions the was the CPU would have, and unlike the
>
>> native instructions do not leak kernel information.
>
>Ok, I've edited the patch to add the 'spoofing' wording where 
>appropriate, and I also made minor fixes such as consistently 
>capitalizing instruction names.
>
>Can I also add your Reviewed-by tag?
>
>So the patch should show up in tip:x86/asm today-ish, and barring any 
>complications is v5.4 material.
>
>Thanks,
>
>   Ingo

Yes, please do.

Reviewed-by: H. Peter Anvin (Intel) 
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: New kernel interface for sys_tz and timewarp?

2019-08-15 Thread hpa

On August 15, 2019 8:05:30 AM PDT, "Theodore Y. Ts'o"  wrote:
>On Thu, Aug 15, 2019 at 03:22:45PM +0200, Arnd Bergmann wrote:
>> If 64-bit Windows relies on a working EFI RTC implementation, we
>could
>> decide to leave the driver enabled on 64-bit and only disable it for
>> 32-bit EFI. That way, future distros would no longer have to worry
>about
>> the localtime hack, at least the ones that have dropped support for
>> 32-bit x86 kernels.
>
>... and who have also dropped support for legacy (non-UEFI) 64-bit
>boot.  Keep in mind that even for distributions which may install with
>UEFI by default, if people have been upgrading from (for example)
>Debian Jessie to Stretch to Buster, they may still be using non-UEFI
>boot.  This might be especially true on small Linode servers (such as,
>for example, the one which is currently running my mail smarthost)
>
>  - Ted

There is also the ACPI TAD device, which can be used to expose this through 
ACPI. We might also be able to request that the ACPI information for the RTC be 
augmented with a CMOS timezone location.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Tracing text poke / kernel self-modifying code (Was: Re: [RFC v2 0/6] x86: dynamic indirect branch promotion)

2019-09-12 Thread hpa

On September 12, 2019 8:00:39 AM GMT+01:00, Adrian Hunter 
 wrote:
>On 29/08/19 2:46 PM, Peter Zijlstra wrote:
>> On Thu, Aug 29, 2019 at 12:40:56PM +0300, Adrian Hunter wrote:
>>> Can you expand on "and ensure the poke_handler preserves the
>existing
>>> control flow"?  Whatever the INT3-handler does will be traced
>normally so
>>> long as it does not itself execute self-modified code.
>> 
>> My thinking was that the code shouldn't change semantics before
>emitting
>> the RECORD_TEXT_POKE; but you're right in that that doesn't seem to
>> matter much.
>> 
>> Either we run the old code or INT3 does 'something'.  Then we get
>> RECORD_TEXT_POKE and finish the poke.  Which tells that the moment
>INT3
>> stops the new text is in place.
>> 
>> I suppose that works too, and requires less changes.
>
>
>What about this?
>
>
>diff --git a/arch/x86/include/asm/text-patching.h
>b/arch/x86/include/asm/text-patching.h
>index 70c09967a999..00aa9bef2b9d 100644
>--- a/arch/x86/include/asm/text-patching.h
>+++ b/arch/x86/include/asm/text-patching.h
>@@ -30,6 +30,7 @@ struct text_poke_loc {
>   void *addr;
>   size_t len;
>   const char opcode[POKE_MAX_OPCODE_SIZE];
>+  char old_opcode[POKE_MAX_OPCODE_SIZE];
> };
> 
>extern void text_poke_early(void *addr, const void *opcode, size_t
>len);
>diff --git a/arch/x86/kernel/alternative.c
>b/arch/x86/kernel/alternative.c
>index ccd32013c47a..c781bafd 100644
>--- a/arch/x86/kernel/alternative.c
>+++ b/arch/x86/kernel/alternative.c
>@@ -3,6 +3,7 @@
> 
> #include 
> #include 
>+#include 
> #include 
> #include 
> #include 
>@@ -1045,8 +1046,10 @@ void text_poke_bp_batch(struct text_poke_loc
>*tp, unsigned int nr_entries)
>   /*
>* First step: add a int3 trap to the address that will be patched.
>*/
>-  for (i = 0; i < nr_entries; i++)
>+  for (i = 0; i < nr_entries; i++) {
>+  memcpy(tp[i].old_opcode, tp[i].addr, tp[i].len);
>   text_poke(tp[i].addr, &int3, sizeof(int3));
>+  }
> 
>   on_each_cpu(do_sync_core, NULL, 1);
> 
>@@ -1071,6 +1074,11 @@ void text_poke_bp_batch(struct text_poke_loc
>*tp, unsigned int nr_entries)
>   on_each_cpu(do_sync_core, NULL, 1);
>   }
> 
>+  for (i = 0; i < nr_entries; i++) {
>+  perf_event_text_poke((unsigned long)tp[i].addr,
>+   tp[i].old_opcode, tp[i].opcode, tp[i].len);
>+  }
>+
>   /*
>* Third step: replace the first byte (int3) by the first byte of
>* replacing opcode.
>diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>index 61448c19a132..f4c6095d2110 100644
>--- a/include/linux/perf_event.h
>+++ b/include/linux/perf_event.h
>@@ -1183,6 +1183,8 @@ extern void perf_event_exec(void);
> extern void perf_event_comm(struct task_struct *tsk, bool exec);
> extern void perf_event_namespaces(struct task_struct *tsk);
> extern void perf_event_fork(struct task_struct *tsk);
>+extern void perf_event_text_poke(unsigned long addr, const void
>*old_bytes,
>+   const void *new_bytes, size_t len);
> 
> /* Callchains */
> DECLARE_PER_CPU(struct perf_callchain_entry, perf_callchain_entry);
>@@ -1406,6 +1408,10 @@ static inline void perf_event_exec(void)
>{ }
>static inline void perf_event_comm(struct task_struct *tsk, bool
>exec)  { }
> static inline void perf_event_namespaces(struct task_struct *tsk) { }
> static inline void perf_event_fork(struct task_struct *tsk)   { }
>+static inline void perf_event_text_poke(unsigned long addr,
>+  const void *old_bytes,
>+  const void *new_bytes,
>+  size_t len) { }
> static inline void perf_event_init(void)  { }
>static inline int  perf_swevent_get_recursion_context(void){ return
>-1; }
> static inline void perf_swevent_put_recursion_context(int rctx)   
> { }
>diff --git a/include/uapi/linux/perf_event.h
>b/include/uapi/linux/perf_event.h
>index bb7b271397a6..6396d4c0d2f9 100644
>--- a/include/uapi/linux/perf_event.h
>+++ b/include/uapi/linux/perf_event.h
>@@ -375,7 +375,8 @@ struct perf_event_attr {
>   ksymbol:  1, /* include ksymbol events 
> */
>   bpf_event  :  1, /* include bpf events */
>   aux_output :  1, /* generate AUX records 
> instead of events */
>-  __reserved_1   : 32;
>+  text_poke  :  1, /* include text poke 
>events */
>+  __reserved_1   : 31;
> 
>   union {
>   __u32   wakeup_events;/* wakeup every n events */
>@@ -1000,6 +1001,22 @@ enum perf_event_type {
>*/
>   PERF_RECORD_BPF_EVENT   = 18,
> 
>+  /*
>+   * Records chan

Re: [PATCH v2 1/2] x86/boot/64: Make level2_kernel_pgt pages invalid outside kernel area.

2019-09-23 Thread hpa

On September 23, 2019 11:15:20 AM PDT, Steve Wahl  wrote:
>Our hardware (UV aka Superdome Flex) has address ranges marked
>reserved by the BIOS. Access to these ranges is caught as an error,
>causing the BIOS to halt the system.
>
>Initial page tables mapped a large range of physical addresses that
>were not checked against the list of BIOS reserved addresses, and
>sometimes included reserved addresses in part of the mapped range.
>Including the reserved range in the map allowed processor speculative
>accesses to the reserved range, triggering a BIOS halt.
>
>Used early in booting, the page table level2_kernel_pgt addresses 1
>GiB divided into 2 MiB pages, and it was set up to linearly map a full
>1 GiB of physical addresses that included the physical address range
>of the kernel image, as chosen by KASLR.  But this also included a
>large range of unused addresses on either side of the kernel image.
>And unlike the kernel image's physical address range, this extra
>mapped space was not checked against the BIOS tables of usable RAM
>addresses.  So there were times when the addresses chosen by KASLR
>would result in processor accessible mappings of BIOS reserved
>physical addresses.
>
>The kernel code did not directly access any of this extra mapped
>space, but having it mapped allowed the processor to issue speculative
>accesses into reserved memory, causing system halts.
>
>This was encountered somewhat rarely on a normal system boot, and much
>more often when starting the crash kernel if "crashkernel=512M,high"
>was specified on the command line (this heavily restricts the physical
>address of the crash kernel, in our case usually within 1 GiB of
>reserved space).
>
>The solution is to invalidate the pages of this table outside the
>kernel image's space before the page table is activated.  This patch
>has been validated to fix this problem on our hardware.
>
>Signed-off-by: Steve Wahl 
>Cc: sta...@vger.kernel.org
>---
>Changes since v1:
>  * Added comment.
>  * Reworked changelog text.
>
> arch/x86/kernel/head64.c | 18 --
> 1 file changed, 16 insertions(+), 2 deletions(-)
>
>diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
>index 29ffa495bd1c..ee9d0e3e0c46 100644
>--- a/arch/x86/kernel/head64.c
>+++ b/arch/x86/kernel/head64.c
>@@ -222,13 +222,27 @@ unsigned long __head __startup_64(unsigned long
>physaddr,
>* we might write invalid pmds, when the kernel is relocated
>* cleanup_highmap() fixes this up along with the mappings
>* beyond _end.
>+   *
>+   * Only the region occupied by the kernel image has so far
>+   * been checked against the table of usable memory regions
>+   * provided by the firmware, so invalidate pages outside that
>+   * region.  A page table entry that maps to a reserved area of
>+   * memory would allow processor speculation into that area,
>+   * and on some hardware (particularly the UV platform) even
>+   * speculative access to some reserved areas is caught as an
>+   * error, causing the BIOS to halt the system.
>*/
> 
>   pmd = fixup_pointer(level2_kernel_pgt, physaddr);
>-  for (i = 0; i < PTRS_PER_PMD; i++) {
>+  for (i = 0; i < pmd_index((unsigned long)_text); i++)
>+  pmd[i] &= ~_PAGE_PRESENT;
>+
>+  for (; i <= pmd_index((unsigned long)_end); i++)
>   if (pmd[i] & _PAGE_PRESENT)
>   pmd[i] += load_delta;
>-  }
>+
>+  for (; i < PTRS_PER_PMD; i++)
>+  pmd[i] &= ~_PAGE_PRESENT;
> 
>   /*
>* Fixup phys_base - remove the memory encryption mask to obtain

What does your MTRR setup look like, and what memory map do you present, in 
exact detail? The BIOS is normally expected to mark the relevant ranges as UC 
in the MTRRs (that is the remaining, legitimate usage of MTRRs.)

I'm somewhat sceptical that chopping off potentially several megabytes is a 
good thing. We also have the memory type interfaces which can be used to map 
these as UC in the page tables.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

RE: [RFD] x86/split_lock: Request to Intel

2019-10-18 Thread hpa

On October 18, 2019 3:45:14 AM PDT, David Laight  
wrote:
>From: Luck, Tony
>> Sent: 18 October 2019 00:28
>...
>> 2) Fix set_bit() et. al. to not issue atomic operations that cross
>boundaries.
>> 
>> Fenghua had been pursuing option #1 in previous iterations. He found
>a few
>> more places with the help of the "grep" patterns suggested by David
>Laight.
>> So that path is up to ~8 patches now that do one of:
>>  + Change from u32 to u64
>>  + Force alignment with a union with a u64
>>  + Change to non-atomic (places that didn't need atomic)
>> 
>> Downside of strategy #1 is that people will add new misaligned cases
>in the
>> future. So this process has no defined end point.
>> 
>> Strategy #2 begun when I looked at the split-lock issue I saw that
>with a
>> constant bit argument set_bit() just does a "ORB" on the affected
>byte (i.e.
>> no split lock). Similar for clear_bit() and change_bit(). Changing
>code to also
>> do that for the variable bit case is easy.
>> 
>> test_and_clr_bit() needs more care, but luckily, we had Peter Anvin
>nearby
>> to give us a neat solution.
>
>Changing the x86-64 bitops to use 32bit memory cycles is trivial
>(provided you are willing to accept a limit of 2G bits).
>
>OTOH this only works because x86 is LE.
>On any BE systems passing an 'int []' to any of the bit-functions is so
>terribly
>wrong it is unbelievable.
>
>So changing the x86-64 bitops is largely papering over a crack.
>
>In essence any code that casts the argument to any of the bitops
>functions
>is almost certainly badly broken on BE systems.
>
>The x86 cpu features code is always LE.
>It probably ought to have a typedef for a union of long [] and int [].
>
>   David
>
>-
>Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes,
>MK1 1PT, UK
>Registration No: 1397386 (Wales)

One thing I suggested is that we should actually expose the violations at 
committee time either by wrapping them in macros using __alignof__ and/or make 
the kernel compile with -Wcast-align.

On x86 the btsl/btcl/btrl instructions can be used without limiting to 2Gbit of 
the address is computed, the way one does for plain and, or, etc. However, if 
the real toes for the arguments are exposed then or is possible to do better.

Finally, as far as bigendian is concerned: the problem Linux has on bigendian 
machines is that it tries to use littleendian bitmaps on bigendian machines: on 
bigendian machines, bit 0 is naturally the MSB. If your reaction is "but that 
is absurd", then you have just grokked why bigendian is fundamentally broken.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH] sched/x86: Save [ER]FLAGS on context switch

2019-02-19 Thread hpa

On February 19, 2019 1:04:09 AM PST, Peter Zijlstra  
wrote:
>On Mon, Feb 18, 2019 at 02:30:21PM -0800, H. Peter Anvin wrote:
>> On 2/16/19 2:30 AM, Peter Zijlstra wrote:
>> > On Fri, Feb 15, 2019 at 08:06:56PM -0800, h...@zytor.com wrote:
>> >> This implies we invoke schedule -- a restricted operation
>(consider
>> >> may_sleep) during execution of STAC-enabled code, but *not* as an
>> >> exception or interrupt, since those preserve the flags.
>> > 
>> > Meet preempt_enable().
>> 
>> I believe this falls under "doctor, it hurts when I do that." And it
>hurts for
>> very good reasons. See below.
>
>I disagree; the basic rule is that if you're preemptible you must also
>be able to schedule and vice-versa. These AC regions violate this.
>
>And, like illustrated, it is _very_ easy to get all sorts inside that
>AC
>crap.
>
>> >> I have serious concerns about this. This is more or less saying
>that
>> >> we have left an unlimited gap and have had AC escape.
>> > 
>> > Yes; by allowing normal C in between STAC and CLAC this is so.
>> > 
>> >> Does *anyone* see a need to allow this? I got a question at LPC
>from
>> >> someone about this, and what they were trying to do once all the
>> >> layers had been unwound was so far down the wrong track for a root
>> >> problem that actually has a very simple solution.
>> > 
>> > Have you read the rest of the thread?
>> > 
>> > All it takes for this to explode is a call to a C function that has
>> > tracing on it in between the user_access_begin() and
>user_access_end()
>> > things. That is a _very_ easy thing to do.
>> > 
>> > Heck, GCC is allowed to generate that broken code compiling
>> > lib/strnlen_user.c; it doesn't _have_ to inline do_strnlen_user
>(esp.
>> > with CONFIG_OPTIMIZE_INLINING), and making that a function call
>would
>> > get us fentry hooks and ftrace and *BOOM*.
>> > 
>> > (Now, of course, its a static function with a single caller, and
>GCC
>> > isn't _THAT_ stupid, but it could, if it wanted to)
>> > 
>> > Since the bar is _that_ low for mistakes, I figure we'd better fix
>it.
>> > 
>> 
>> The question is what "fix it" means. 
>
>It means that when we do schedule, the next task doesn't have AC set,
>and when we schedule back, we'll restore our AC when we had it. Unlike
>the current situation, where the next task will run with AC set.
>
>IOW, it confines AC to the task context where it was set.
>
>> I'm really concerned about AC escapes,
>> and everyone else should be, too.
>
>Then _that_ should be asserted.
>
>> For an issue specific to tracing, it would be more appropriate to do
>the
>> appropriate things in the tracing entry/exit than in schedule.
>Otherwise, we
>> don't want to silently paper over mistakes which could mean that we
>run a
>> large portion of the kernel without protection we thought we had.
>> 
>> In that sense, calling schedule() with AC set is in fact a great
>place to have
>> a WARN() or even BUG(), because such an event really could imply that
>the
>> kernel has been security compromised.
>
>It is not specific to tracing, tracing is just one of the most trivial
>and non-obvious ways to make it go splat.
>
>There's lot of fairly innocent stuff that does preempt_disable() /
>preempt_enable(). And just a warning in schedule() isn't sufficient,
>you'd have to actually trigger a reschedule before you know your code
>is
>bad.
>
>And like I said; the invariant is: if you're preemptible you can
>schedule and v.v.
>
>Now, clearly you don't want to mark these whole regions as
>!preemptible,
>because then you can also not take faults, but somehow you're not
>worried about the whole fault handler, but you are worried about the
>scheduler ?!? How does that work? The fault handler can touch a _ton_
>more code. Heck, the fault handler can schedule.
>
>So either pre-fault, and run the whole AC crap with preemption disabled
>and retry, or accept that we have to schedule.
>
>> Does that make more sense?
>
>It appears to me you're going about it backwards.

I'm not worried about the fault handler, because AC is always 
presented/disabled on exception entry; otherwise I would of course be very 
concerned.

To be clear: I'm not worried about the scheduler itself. I'm worried about how 
much code we have gone through to get there. Perhaps the scheduler itself is 
not the right point to probe for it, but we do need to catch things that have 
gone wrong, rather than just leaving the door wide open.

I would personally be far more comfortable saying you have to disable 
preemption in user access regions. It may be an unnecessary constraint for x86 
and ARM64, but it is safe.

And Julien, yes, it is "somewhat arbitrary", but so are many engineering 
tradeoffs. Not everything has a very sharply definable line.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH 01/17] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"

2019-01-17 Thread hpa

On January 16, 2019 10:47:01 PM PST, Masami Hiramatsu  
wrote:
>On Wed, 16 Jan 2019 16:32:43 -0800
>Rick Edgecombe  wrote:
>
>> From: Nadav Amit 
>> 
>> text_mutex is currently expected to be held before text_poke() is
>> called, but we kgdb does not take the mutex, and instead *supposedly*
>> ensures the lock is not taken and will not be acquired by any other
>core
>> while text_poke() is running.
>> 
>> The reason for the "supposedly" comment is that it is not entirely
>clear
>> that this would be the case if gdb_do_roundup is zero.
>> 
>> This patch creates two wrapper functions, text_poke() and
>> text_poke_kgdb() which do or do not run the lockdep assertion
>> respectively.
>> 
>> While we are at it, change the return code of text_poke() to
>something
>> meaningful. One day, callers might actually respect it and the
>existing
>> BUG_ON() when patching fails could be removed. For kgdb, the return
>> value can actually be used.
>
>Looks good to me.
>
>Reviewed-by: Masami Hiramatsu 
>
>Thank you,
>
>> 
>> Cc: Andy Lutomirski 
>> Cc: Kees Cook 
>> Cc: Dave Hansen 
>> Cc: Masami Hiramatsu 
>> Fixes: 9222f606506c ("x86/alternatives: Lockdep-enforce text_mutex in
>text_poke*()")
>> Suggested-by: Peter Zijlstra 
>> Acked-by: Jiri Kosina 
>> Signed-off-by: Nadav Amit 
>> Signed-off-by: Rick Edgecombe 
>> ---
>>  arch/x86/include/asm/text-patching.h |  1 +
>>  arch/x86/kernel/alternative.c| 52
>
>>  arch/x86/kernel/kgdb.c   | 11 +++---
>>  3 files changed, 45 insertions(+), 19 deletions(-)
>> 
>> diff --git a/arch/x86/include/asm/text-patching.h
>b/arch/x86/include/asm/text-patching.h
>> index e85ff65c43c3..f8fc8e86cf01 100644
>> --- a/arch/x86/include/asm/text-patching.h
>> +++ b/arch/x86/include/asm/text-patching.h
>> @@ -35,6 +35,7 @@ extern void *text_poke_early(void *addr, const void
>*opcode, size_t len);
>>   * inconsistent instruction while you patch.
>>   */
>>  extern void *text_poke(void *addr, const void *opcode, size_t len);
>> +extern void *text_poke_kgdb(void *addr, const void *opcode, size_t
>len);
>>  extern int poke_int3_handler(struct pt_regs *regs);
>>  extern void *text_poke_bp(void *addr, const void *opcode, size_t
>len, void *handler);
>>  extern int after_bootmem;
>> diff --git a/arch/x86/kernel/alternative.c
>b/arch/x86/kernel/alternative.c
>> index ebeac487a20c..c6a3a10a2fd5 100644
>> --- a/arch/x86/kernel/alternative.c
>> +++ b/arch/x86/kernel/alternative.c
>> @@ -678,18 +678,7 @@ void *__init_or_module text_poke_early(void
>*addr, const void *opcode,
>>  return addr;
>>  }
>>  
>> -/**
>> - * text_poke - Update instructions on a live kernel
>> - * @addr: address to modify
>> - * @opcode: source of the copy
>> - * @len: length to copy
>> - *
>> - * Only atomic text poke/set should be allowed when not doing early
>patching.
>> - * It means the size must be writable atomically and the address
>must be aligned
>> - * in a way that permits an atomic write. It also makes sure we fit
>on a single
>> - * page.
>> - */
>> -void *text_poke(void *addr, const void *opcode, size_t len)
>> +static void *__text_poke(void *addr, const void *opcode, size_t len)
>>  {
>>  unsigned long flags;
>>  char *vaddr;
>> @@ -702,8 +691,6 @@ void *text_poke(void *addr, const void *opcode,
>size_t len)
>>   */
>>  BUG_ON(!after_bootmem);
>>  
>> -lockdep_assert_held(&text_mutex);
>> -
>>  if (!core_kernel_text((unsigned long)addr)) {
>>  pages[0] = vmalloc_to_page(addr);
>>  pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
>> @@ -732,6 +719,43 @@ void *text_poke(void *addr, const void *opcode,
>size_t len)
>>  return addr;
>>  }
>>  
>> +/**
>> + * text_poke - Update instructions on a live kernel
>> + * @addr: address to modify
>> + * @opcode: source of the copy
>> + * @len: length to copy
>> + *
>> + * Only atomic text poke/set should be allowed when not doing early
>patching.
>> + * It means the size must be writable atomically and the address
>must be aligned
>> + * in a way that permits an atomic write. It also makes sure we fit
>on a single
>> + * page.
>> + */
>> +void *text_poke(void *addr, const void *opcode, size_t len)
>> +{
>> +lockdep_assert_held(&text_mutex);
>> +
>> +return __text_poke(addr, opcode, len);
>> +}
>> +
>> +/**
>> + * text_poke_kgdb - Update instructions on a live kernel by kgdb
>> + * @addr: address to modify
>> + * @opcode: source of the copy
>> + * @len: length to copy
>> + *
>> + * Only atomic text poke/set should be allowed when not doing early
>patching.
>> + * It means the size must be writable atomically and the address
>must be aligned
>> + * in a way that permits an atomic write. It also makes sure we fit
>on a single
>> + * page.
>> + *
>> + * Context: should only be used by kgdb, which ensures no other core
>is running,
>> + *  despite the fact it does not hold the text_mutex.
>> + */
>> +void *text_poke_kgdb(void *addr, const void *opcode, size_t len)
>> +{
>> +

Re: [PATCH 06/17] x86/alternative: use temporary mm for text poking

2019-01-17 Thread hpa

On January 17, 2019 1:43:54 PM PST, Nadav Amit  wrote:
>> On Jan 17, 2019, at 12:47 PM, Andy Lutomirski 
>wrote:
>> 
>> On Thu, Jan 17, 2019 at 12:27 PM Andy Lutomirski 
>wrote:
>>> On Wed, Jan 16, 2019 at 4:33 PM Rick Edgecombe
>>>  wrote:
 From: Nadav Amit 
 
 text_poke() can potentially compromise the security as it sets
>temporary
 PTEs in the fixmap. These PTEs might be used to rewrite the kernel
>code
 from other cores accidentally or maliciously, if an attacker gains
>the
 ability to write onto kernel memory.
>>> 
>>> i think this may be sufficient, but barely.
>>> 
 +   pte_clear(poking_mm, poking_addr, ptep);
 +
 +   /*
 +* __flush_tlb_one_user() performs a redundant TLB flush
>when PTI is on,
 +* as it also flushes the corresponding "user" address
>spaces, which
 +* does not exist.
 +*
 +* Poking, however, is already very inefficient since it
>does not try to
 +* batch updates, so we ignore this problem for the time
>being.
 +*
 +* Since the PTEs do not exist in other kernel
>address-spaces, we do
 +* not use __flush_tlb_one_kernel(), which when PTI is on
>would cause
 +* more unwarranted TLB flushes.
 +*
 +* There is a slight anomaly here: the PTE is a
>supervisor-only and
 +* (potentially) global and we use __flush_tlb_one_user()
>but this
 +* should be fine.
 +*/
 +   __flush_tlb_one_user(poking_addr);
 +   if (cross_page_boundary) {
 +   pte_clear(poking_mm, poking_addr + PAGE_SIZE, ptep
>+ 1);
 +   __flush_tlb_one_user(poking_addr + PAGE_SIZE);
 +   }
>>> 
>>> In principle, another CPU could still have the old translation. 
>Your
>>> mutex probably makes this impossible, but it makes me nervous.
>>> Ideally you'd use flush_tlb_mm_range(), but I guess you can't do
>that
>>> with IRQs off.  Hmm.  I think you should add an inc_mm_tlb_gen()
>here.
>>> Arguably, if you did that, you could omit the flushes, but maybe
>>> that's silly.
>>> 
>>> If we start getting new users of use_temporary_mm(), we should give
>>> some serious thought to the SMP semantics.
>>> 
>>> Also, you're using PAGE_KERNEL.  Please tell me that the global bit
>>> isn't set in there.
>> 
>> Much better solution: do unuse_temporary_mm() and *then*
>> flush_tlb_mm_range().  This is entirely non-sketchy and should be
>just
>> about optimal, too.
>
>This solution sounds nice and clean. The fact the global-bit was set
>didn’t
>matter before (since __flush_tlb_one_user would get rid of it no matter
>what), but would matter now, so I’ll change it too.
>
>Thanks!
>
>Nadav

You can just disable the global bit at the top level, obviously.

This approach also should make it far easier to do batching if desired.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH 01/17] Fix "x86/alternatives: Lockdep-enforce text_mutex in text_poke*()"

2019-01-17 Thread hpa

On January 17, 2019 2:39:15 PM PST, Nadav Amit  wrote:
>> On Jan 17, 2019, at 1:15 PM, h...@zytor.com wrote:
>> 
>> On January 16, 2019 10:47:01 PM PST, Masami Hiramatsu
> wrote:
>>> On Wed, 16 Jan 2019 16:32:43 -0800
>>> Rick Edgecombe  wrote:
>>> 
 From: Nadav Amit 
 
 text_mutex is currently expected to be held before text_poke() is
 called, but we kgdb does not take the mutex, and instead
>*supposedly*
 ensures the lock is not taken and will not be acquired by any other
>>> core
 while text_poke() is running.
 
 The reason for the "supposedly" comment is that it is not entirely
>>> clear
 that this would be the case if gdb_do_roundup is zero.
 
 This patch creates two wrapper functions, text_poke() and
 text_poke_kgdb() which do or do not run the lockdep assertion
 respectively.
 
 While we are at it, change the return code of text_poke() to
>>> something
 meaningful. One day, callers might actually respect it and the
>>> existing
 BUG_ON() when patching fails could be removed. For kgdb, the return
 value can actually be used.
>>> 
>>> Looks good to me.
>>> 
>>> Reviewed-by: Masami Hiramatsu 
>>> 
>>> Thank you,
>>> 
 Cc: Andy Lutomirski 
 Cc: Kees Cook 
 Cc: Dave Hansen 
 Cc: Masami Hiramatsu 
 Fixes: 9222f606506c ("x86/alternatives: Lockdep-enforce text_mutex
>in
>>> text_poke*()")
 Suggested-by: Peter Zijlstra 
 Acked-by: Jiri Kosina 
 Signed-off-by: Nadav Amit 
 Signed-off-by: Rick Edgecombe 
 ---
 arch/x86/include/asm/text-patching.h |  1 +
 arch/x86/kernel/alternative.c| 52
>>> 
 arch/x86/kernel/kgdb.c   | 11 +++---
 3 files changed, 45 insertions(+), 19 deletions(-)
 
 diff --git a/arch/x86/include/asm/text-patching.h
>>> b/arch/x86/include/asm/text-patching.h
 index e85ff65c43c3..f8fc8e86cf01 100644
 --- a/arch/x86/include/asm/text-patching.h
 +++ b/arch/x86/include/asm/text-patching.h
 @@ -35,6 +35,7 @@ extern void *text_poke_early(void *addr, const
>void
>>> *opcode, size_t len);
  * inconsistent instruction while you patch.
  */
 extern void *text_poke(void *addr, const void *opcode, size_t len);
 +extern void *text_poke_kgdb(void *addr, const void *opcode, size_t
>>> len);
 extern int poke_int3_handler(struct pt_regs *regs);
 extern void *text_poke_bp(void *addr, const void *opcode, size_t
>>> len, void *handler);
 extern int after_bootmem;
 diff --git a/arch/x86/kernel/alternative.c
>>> b/arch/x86/kernel/alternative.c
 index ebeac487a20c..c6a3a10a2fd5 100644
 --- a/arch/x86/kernel/alternative.c
 +++ b/arch/x86/kernel/alternative.c
 @@ -678,18 +678,7 @@ void *__init_or_module text_poke_early(void
>>> *addr, const void *opcode,
 return addr;
 }
 
 -/**
 - * text_poke - Update instructions on a live kernel
 - * @addr: address to modify
 - * @opcode: source of the copy
 - * @len: length to copy
 - *
 - * Only atomic text poke/set should be allowed when not doing
>early
>>> patching.
 - * It means the size must be writable atomically and the address
>>> must be aligned
 - * in a way that permits an atomic write. It also makes sure we
>fit
>>> on a single
 - * page.
 - */
 -void *text_poke(void *addr, const void *opcode, size_t len)
 +static void *__text_poke(void *addr, const void *opcode, size_t
>len)
 {
unsigned long flags;
char *vaddr;
 @@ -702,8 +691,6 @@ void *text_poke(void *addr, const void *opcode,
>>> size_t len)
  */
BUG_ON(!after_bootmem);
 
 -  lockdep_assert_held(&text_mutex);
 -
if (!core_kernel_text((unsigned long)addr)) {
pages[0] = vmalloc_to_page(addr);
pages[1] = vmalloc_to_page(addr + PAGE_SIZE);
 @@ -732,6 +719,43 @@ void *text_poke(void *addr, const void
>*opcode,
>>> size_t len)
 return addr;
 }
 
 +/**
 + * text_poke - Update instructions on a live kernel
 + * @addr: address to modify
 + * @opcode: source of the copy
 + * @len: length to copy
 + *
 + * Only atomic text poke/set should be allowed when not doing
>early
>>> patching.
 + * It means the size must be writable atomically and the address
>>> must be aligned
 + * in a way that permits an atomic write. It also makes sure we
>fit
>>> on a single
 + * page.
 + */
 +void *text_poke(void *addr, const void *opcode, size_t len)
 +{
 +  lockdep_assert_held(&text_mutex);
 +
 +  return __text_poke(addr, opcode, len);
 +}
 +
 +/**
 + * text_poke_kgdb - Update instructions on a live kernel by kgdb
 + * @addr: address to modify
 + * @opcode: source of the copy
 + * @len: length to copy
 + *
 + * Only atomic text poke/set should be allowed when not doing
>early
>>> patching.
 + * It means the size must be writable at

Re: [PATCH v3 0/6] Static calls

2019-01-12 Thread hpa

On January 11, 2019 11:34:34 AM PST, Linus Torvalds 
 wrote:
>On Fri, Jan 11, 2019 at 11:24 AM  wrote:
>>
>> I still don't see why can't simply spin in the #BP handler until the
>patch is complete.
>
>So here's at least one problem:
>
>text_poke_bp()
>  text_poke(addr, &int3, sizeof(int3));
>   *interrupt*
>  interrupt has a static call
>*BP*
>  poke_int3_handler
> *BOOM*
>
>Note how at BOOM we cannot just spin (or return) to wait for the
>'int3' to be switched back. Becuase it never will. Because we are
>interrupting the thing that would do that switch-back.
>
>So we'd have to do the 'text_poke_bp()' sequence with interrupts
>disabled. Which we can't do right now at least, because part of that
>sequence involves that on_each_cpu(do_sync_core) thing, which needs
>interrupts enabled.
>
>See?
>
>Or am I missing something?
>
>Linus

Let me think about it.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH v3 0/6] Static calls

2019-01-12 Thread hpa

On January 11, 2019 11:34:34 AM PST, Linus Torvalds 
 wrote:
>On Fri, Jan 11, 2019 at 11:24 AM  wrote:
>>
>> I still don't see why can't simply spin in the #BP handler until the
>patch is complete.
>
>So here's at least one problem:
>
>text_poke_bp()
>  text_poke(addr, &int3, sizeof(int3));
>   *interrupt*
>  interrupt has a static call
>*BP*
>  poke_int3_handler
> *BOOM*
>
>Note how at BOOM we cannot just spin (or return) to wait for the
>'int3' to be switched back. Becuase it never will. Because we are
>interrupting the thing that would do that switch-back.
>
>So we'd have to do the 'text_poke_bp()' sequence with interrupts
>disabled. Which we can't do right now at least, because part of that
>sequence involves that on_each_cpu(do_sync_core) thing, which needs
>interrupts enabled.
>
>See?
>
>Or am I missing something?
>
>Linus

Ok, I was thinking far more about spinning with an IRET and letting the 
exception be delivered. Patching with interrupts disabled have other 
problems... 
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH v3 0/6] Static calls

2019-01-14 Thread hpa

On January 14, 2019 3:27:55 PM PST, Andy Lutomirski  wrote:
>On Mon, Jan 14, 2019 at 2:01 PM H. Peter Anvin  wrote:
>>
>> So I was already in the middle of composing this message when Andy
>posted:
>>
>> > I don't even think this is sufficient.  I think we also need
>everyone
>> > who clears the bit to check if all bits are clear and, if so,
>remove
>> > the breakpoint.  Otherwise we have a situation where, if you are in
>> > text_poke_bp() and you take an NMI (or interrupt or MCE or
>whatever)
>> > and that interrupt then hits the breakpoint, then you deadlock
>because
>> > no one removes the breakpoint.
>> >
>> > If we do this, and if we can guarantee that all CPUs make forward
>> > progress, then maybe the problem is solved. Can we guarantee
>something
>> > like all NMI handlers that might wait in a spinlock or for any
>other
>> > reason will periodically check if a sync is needed while they're
>> > spinning?
>>
>> So the really, really nasty case is when an asynchronous event on the
>> *patching* processor gets stuck spinning on a resource which is
>> unavailable due to another processor spinning on the #BP. We can
>disable
>> interrupts, but we can't stop NMIs from coming in (although we could
>> test in the NMI handler if we are in that condition and return
>> immediately; I'm not sure we want to do that, and we still have to
>deal
>> with #MC and what not.)
>>
>> The fundamental problem here is that we don't see the #BP on the
>> patching processor, in which case we could simply complete the
>patching
>> from the #BP handler on that processor.
>>
>> On 1/13/19 6:40 PM, H. Peter Anvin wrote:
>> > On 1/13/19 6:31 PM, H. Peter Anvin wrote:
>> >>
>> >> static cpumask_t text_poke_cpumask;
>> >>
>> >> static void text_poke_sync(void)
>> >> {
>> >>  smp_wmb();
>> >>  text_poke_cpumask = cpu_online_mask;
>> >>  smp_wmb();  /* Should be optional on x86 */
>> >>  cpumask_clear_cpu(&text_poke_cpumask, smp_processor_id());
>> >>  on_each_cpu_mask(&text_poke_cpumask, text_poke_sync_cpu,
>NULL, false);
>> >>  while (!cpumask_empty(&text_poke_cpumask)) {
>> >>  cpu_relax();
>> >>  smp_rmb();
>> >>  }
>> >> }
>> >>
>> >> static void text_poke_sync_cpu(void *dummy)
>> >> {
>> >>  (void)dummy;
>> >>
>> >>  smp_rmb();
>> >>  cpumask_clear_cpu(&poke_bitmask, smp_processor_id());
>> >>  /*
>> >>   * We are guaranteed to return with an IRET, either from the
>> >>   * IPI or the #BP handler; this provides serialization.
>> >>   */
>> >> }
>> >>
>> >
>> > The invariants here are:
>> >
>> > 1. The patching routine must set each bit in the cpumask after each
>event
>> >that requires synchronization is complete.
>> > 2. The bit can be (atomically) cleared on the target CPU only, and
>only in a
>> >place that guarantees a synchronizing event (e.g. IRET) before
>it may
>> >reaching the poked instruction.
>> > 3. At a minimum the IPI handler and #BP handler needs to clear the
>bit. It
>> >*is* also possible to clear it in other places, e.g. the NMI
>handler, if
>> >necessary as long as condition 2 is satisfied.
>> >
>>
>> OK, so with interrupts enabled *on the processor doing the patching*
>we
>> still have a problem if it takes an interrupt which in turn takes a
>#BP.
>>  Disabling interrupts would not help, because but an NMI and #MC
>could
>> still cause problems unless we can guarantee that no path which may
>be
>> invoked by NMI/#MC can do text_poke, which seems to be a very
>aggressive
>> assumption.
>>
>> Note: I am assuming preemption is disabled.
>>
>> The easiest/sanest way to deal with this might be to switch the IDT
>(or
>> provide a hook in the generic exception entry code) on the patching
>> processor, such that if an asynchronous event comes in, we either
>roll
>> forward or revert. This is doable because the second sync we
>currently
>> do is not actually necessary per the hardware guys.
>
>This is IMO insanely complicated.  I much prefer the kind of
>complexity that is more or less deterministic and easy to test to the
>kind of complexity (like this) that only happens in corner cases.
>
>I see two solutions here:
>
>1. Just suck it up and emulate the CALL.  And find a way to write a
>test case so we know it works.
>
>2. Find a non-deadlocky way to make the breakpoint handler wait for
>the breakpoint to get removed, without any mucking at all with the
>entry code.  And find a way to write a test case so we know it works.
>(E.g. stick an actual static_call call site *in text_poke_bp()* that
>fires once on boot so that the really awful recursive case gets
>exercised all the time.
>
>But if we're going to do any mucking with the entry code, let's just
>do the simple mucking to make emulating CALL work.
>
>--Andy

Ugh. So much for not really proofreading. Yes, I think the second solution is 
the right thing since I think I figured out how to do it without deadlock; see 
other mail.
-- 
Sent from my Android device w

Re: [RFC] Provide in-kernel headers for making it easy to extend the kernel

2019-01-19 Thread hpa

On January 19, 2019 3:25:03 PM PST, Joel Fernandes  
wrote:
>On Sat, Jan 19, 2019 at 12:43:35PM -0500, Daniel Colascione wrote:
>> On Sat, Jan 19, 2019 at 11:27 AM Joel Fernandes
> wrote:
>> >
>> > On Sat, Jan 19, 2019 at 09:25:32AM +0100, Greg KH wrote:
>> > > On Fri, Jan 18, 2019 at 05:55:43PM -0500, Joel Fernandes wrote:
>> > > > --- /dev/null
>> > > > +++ b/kernel/kheaders.c
>> 
>> Thanks a ton for this work. It'll make it much easier to do cool
>> things with BPF.
>
>You're welcome, thanks.
>
>> One question: I can imagine wanting to probe
>> structures that are defined, not in headers, but in random
>> implementation files. Would it be possible to optionally include
>*all*
>> kernel source files?
>
>That would be prohibitively too large to justify keeping it in memory,
>even
>compressed. Arguably, such structures should be moved into include/ if
>modules or whatever is extending the kernel like eBPF needs them.
>
>> If not, what about a hash, so we could at least
>> do precise correlation between a candidate local tree and what's
>> actually on device?
>
>That would make a tool too difficult to write wouldn't it, since they
>you have to
>correlate every possible hash and keep updating eBPF tools with new
>hashes -
>probably not scalable. I think what you want is to use the kernel
>version to
>assume what such internal structures look like although that's still
>not
>robust.
>
>> BTW, I'm not sure that the magic constants you've defined are long
>> enough.  I'd feel more comfortable with two UUIDs (16 bytes each).
>
>Ok, I'll expand it.
>
>> I'd also strongly consider LZMA compression: xz -9 on the kernel
>> headers (with comments) brings the size down to 5MB, compared to the
>> 7MB I get for gzip -9. Considering that this feature is optional, I
>> think it's okay to introduce a dependency on widespread modern
>> compression tools. (For comparison, bzip2 -9 gets us 6MB.)
>
>Ok, I'll look into LZMA. Thanks for checking the compression sizes.
>
>- Joel

Don't use lzma, use xz if you are going to do something.

However, it seems unlikely to me that someone not willing to spend the space in 
the filesystem will spend unswappable kernel memory.

It would seem that a far saner way to do this is to use inittmpfs or perhaps an 
auxiliary "ktmpfs" so it can at least be swapped out if you have swap.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [RFC] Provide in-kernel headers for making it easy to extend the kernel

2019-01-19 Thread hpa

On January 19, 2019 2:36:06 AM PST, Greg KH  wrote:
>On Sat, Jan 19, 2019 at 02:28:00AM -0800, Christoph Hellwig wrote:
>> This seems like a pretty horrible idea and waste of kernel memory.
>
>It's only a waste if you want it to be a waste, i.e. if you load the
>kernel module.
>
>This really isn't any different from how /proc/config.gz works.
>
>> Just add support to kbuild to store a compressed archive in initramfs
>> and unpack it in the right place.
>
>I think the issue is that some devices do not use initramfs, or switch
>away from it after init happens or something like that.  Joel has all
>of
>the looney details that he can provide.
>
>thanks,
>
>greg k-h

Yeah, well... but it is kind of a losing game... the more in-kernel stuff there 
is the less smiley are things to actually be supported.

Modularizing is it in some ways even crazier in the sense that at that point 
you are relying on the filesystem containing the module, which has to be loaded 
into the kernel by a root user. One could even wonder if a better way to do 
this would be to have "make modules_install" park an archive file – or even a 
directory as opposed to a symlink – with this stuff in /lib/modules. We could 
even provide a tmpfs shim which autoloads such an archive via the firmware 
loader; this might even be generically useful, who knows.

Note also that initramfs contents can be built into the kernel. Extracting such 
content into a single-instance tmpfs would again be a possibility.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [RFC] Provide in-kernel headers for making it easy to extend the kernel

2019-01-21 Thread hpa

On January 20, 2019 5:45:53 PM PST, Joel Fernandes  
wrote:
>On Sun, Jan 20, 2019 at 01:58:15PM -0800, h...@zytor.com wrote:
>> On January 20, 2019 8:10:03 AM PST, Joel Fernandes
> wrote:
>> >On Sat, Jan 19, 2019 at 11:01:13PM -0800, h...@zytor.com wrote:
>> >> On January 19, 2019 2:36:06 AM PST, Greg KH
>> > wrote:
>> >> >On Sat, Jan 19, 2019 at 02:28:00AM -0800, Christoph Hellwig
>wrote:
>> >> >> This seems like a pretty horrible idea and waste of kernel
>memory.
>> >> >
>> >> >It's only a waste if you want it to be a waste, i.e. if you load
>the
>> >> >kernel module.
>> >> >
>> >> >This really isn't any different from how /proc/config.gz works.
>> >> >
>> >> >> Just add support to kbuild to store a compressed archive in
>> >initramfs
>> >> >> and unpack it in the right place.
>> >> >
>> >> >I think the issue is that some devices do not use initramfs, or
>> >switch
>> >> >away from it after init happens or something like that.  Joel has
>> >all
>> >> >of
>> >> >the looney details that he can provide.
>> >> >
>> >> >thanks,
>> >> >
>> >> >greg k-h
>> >> 
>> >> Yeah, well... but it is kind of a losing game... the more
>in-kernel
>> >stuff there is the less smiley are things to actually be supported.
>> >
>> >It is better than nothing, and if this makes things a bit easier and
>> >solves
>> >real-world issues people have been having, and is optional, then I
>> >don't see
>> >why not.
>> >
>> >> Modularizing is it in some ways even crazier in the sense that at
>> >that point you are relying on the filesystem containing the module,
>> >which has to be loaded into the kernel by a root user. One could
>even
>> >wonder if a better way to do this would be to have "make
>> >modules_install" park an archive file – or even a directory as
>opposed
>> >to a symlink – with this stuff in /lib/modules. We could even
>provide a
>> >tmpfs shim which autoloads such an archive via the firmware loader;
>> >this might even be generically useful, who knows.
>> >
>> >All this seems to assume where the modules are located. In Android,
>we
>> >don't
>> >have /lib/modules. This patch generically fits into the grand scheme
>> >things
>> >and I think is just better made a part of the kernel since it is not
>> >that
>> >huge once compressed, as Dan also pointed. The more complex, and the
>> >more
>> >assumptions we make, the less likely people writing tools will get
>it
>> >right
>> >and be able to easily use it.
>> >
>> >> 
>> >> Note also that initramfs contents can be built into the kernel.
>> >Extracting such content into a single-instance tmpfs would again be
>a
>> >possibility
>> >
>> >Such an approach would bloat the kernel image size though, which may
>> >not work
>> >for everyone. The module based approach, on the other hand, gives an
>> >option
>> >to the user to enable the feature, but not have it loaded into
>memory
>> >or used
>> >until it is really needed.
>> >
>> >thanks,
>> >
>> > - Joel
>> 
>> Well, where are the modules? They must exist in the filesystem.
>
>The scheme of loading a module doesn't depend on _where_ the module is
>on the
>filesystem. As long as a distro knows how to load a module in its own
>way (by
>looking into whichever paths it cares about), that's all that matters. 
>And
>the module contains compressed headers which saves space, vs storing it
>uncompressed on the file system.
>
>To remove complete reliance on the filesystem, there is an option of
>not
>building it as a module, and making it as a built-in.
>
>I think I see your point now - you're saying if its built-in, then it
>becomes kernel memory that is lost and unswappable. Did I get that
>right?
>I am saying that if that's a major concern, then:
>1. Don't make it a built-in, make it a module.
>2. Don't enable it at for your distro, and use a linux-headers package
>or
>whatever else you have been using so far that works for you.
>
>thanks,
>
> - Joel

My point is that if we're going to actually solve a problem, we need to make it 
so that the distro won't just disable it anyway, and it ought to be something 
scalable; otherwise nothing is gained.

I am *not* disagreeing with the problem statement!

Now, /proc isn't something that will autoload modules. A filesystem *will*, 
although you need to be able to mount it; furthermore, it makes it trivially to 
extend it (and the firmware interface provides an . easy way to feed the data 
to such a filesystem without having to muck with anything magic.)

Heck, we could even make it a squashfs image that can just be mounted.

So, first of all, where does Android keep its modules, and what is actually 
included? Is /sbin/modprobe used to load the modules, as is normal? We might 
even be able to address this with some fairly trivial enhancements to modprobe; 
specifically to search in the module paths for something that isn't a module 
per se.

The best scenario would be if we could simply have the tools find the location 
equivalent of /lib/modules/$version/source...
-- 
Sent from my Android device wit

Re: [RFC] Provide in-kernel headers for making it easy to extend the kernel

2019-01-21 Thread hpa

On January 20, 2019 8:10:03 AM PST, Joel Fernandes  
wrote:
>On Sat, Jan 19, 2019 at 11:01:13PM -0800, h...@zytor.com wrote:
>> On January 19, 2019 2:36:06 AM PST, Greg KH
> wrote:
>> >On Sat, Jan 19, 2019 at 02:28:00AM -0800, Christoph Hellwig wrote:
>> >> This seems like a pretty horrible idea and waste of kernel memory.
>> >
>> >It's only a waste if you want it to be a waste, i.e. if you load the
>> >kernel module.
>> >
>> >This really isn't any different from how /proc/config.gz works.
>> >
>> >> Just add support to kbuild to store a compressed archive in
>initramfs
>> >> and unpack it in the right place.
>> >
>> >I think the issue is that some devices do not use initramfs, or
>switch
>> >away from it after init happens or something like that.  Joel has
>all
>> >of
>> >the looney details that he can provide.
>> >
>> >thanks,
>> >
>> >greg k-h
>> 
>> Yeah, well... but it is kind of a losing game... the more in-kernel
>stuff there is the less smiley are things to actually be supported.
>
>It is better than nothing, and if this makes things a bit easier and
>solves
>real-world issues people have been having, and is optional, then I
>don't see
>why not.
>
>> Modularizing is it in some ways even crazier in the sense that at
>that point you are relying on the filesystem containing the module,
>which has to be loaded into the kernel by a root user. One could even
>wonder if a better way to do this would be to have "make
>modules_install" park an archive file – or even a directory as opposed
>to a symlink – with this stuff in /lib/modules. We could even provide a
>tmpfs shim which autoloads such an archive via the firmware loader;
>this might even be generically useful, who knows.
>
>All this seems to assume where the modules are located. In Android, we
>don't
>have /lib/modules. This patch generically fits into the grand scheme
>things
>and I think is just better made a part of the kernel since it is not
>that
>huge once compressed, as Dan also pointed. The more complex, and the
>more
>assumptions we make, the less likely people writing tools will get it
>right
>and be able to easily use it.
>
>> 
>> Note also that initramfs contents can be built into the kernel.
>Extracting such content into a single-instance tmpfs would again be a
>possibility
>
>Such an approach would bloat the kernel image size though, which may
>not work
>for everyone. The module based approach, on the other hand, gives an
>option
>to the user to enable the feature, but not have it loaded into memory
>or used
>until it is really needed.
>
>thanks,
>
> - Joel

Well, where are the modules? They must exist in the filesystem.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: [PATCH v3 2/2] initramfs: introduce do_readxattrs()

2019-05-22 Thread hpa

On May 20, 2019 2:39:46 AM PDT, Roberto Sassu  wrote:
>On 5/18/2019 12:17 AM, Arvind Sankar wrote:
>> On Fri, May 17, 2019 at 02:47:31PM -0700, H. Peter Anvin wrote:
>>> On 5/17/19 2:02 PM, Arvind Sankar wrote:
>>>> On Fri, May 17, 2019 at 01:18:11PM -0700, h...@zytor.com wrote:
>>>>>
>>>>> Ok... I just realized this does not work for a modular initramfs,
>composed at load time from multiple files, which is a very real
>problem. Should be easy enough to deal with: instead of one large file,
>use one companion file per source file, perhaps something like
>filename..xattrs (suggesting double dots to make it less likely to
>conflict with a "real" file.) No leading dot, as it makes it more
>likely that archivers will sort them before the file proper.
>>>> This version of the patch was changed from the previous one exactly
>to deal with this case --
>>>> it allows for the bootloader to load multiple initramfs archives,
>each
>>>> with its own .xattr-list file, and to have that work properly.
>>>> Could you elaborate on the issue that you see?
>>>>
>>>
>>> Well, for one thing, how do you define "cpio archive", each with its
>own
>>> .xattr-list file? Second, that would seem to depend on the ordering,
>no,
>>> in which case you depend critically on .xattr-list file following
>the
>>> files, which most archivers won't do.
>>>
>>> Either way it seems cleaner to have this per file; especially if/as
>it
>>> can be done without actually mucking up the format.
>>>
>>> I need to run, but I'll post a more detailed explanation of what I
>did
>>> in a little bit.
>>>
>>> -hpa
>>>
>> Not sure what you mean by how do I define it? Each cpio archive will
>> contain its own .xattr-list file with signatures for the files within
>> it, that was the idea.
>> 
>> You need to review the code more closely I think -- it does not
>depend
>> on the .xattr-list file following the files to which it applies.
>> 
>> The code first extracts .xattr-list as though it was a regular file.
>If
>> a later dupe shows up (presumably from a second archive, although the
>> patch will actually allow a second one in the same archive), it will
>> then process the existing .xattr-list file and apply the attributes
>> listed within it. It then will proceed to read the second one and
>> overwrite the first one with it (this is the normal behaviour in the
>> kernel cpio parser). At the end once all the archives have been
>> extracted, if there is an .xattr-list file in the rootfs it will be
>> parsed (it would've been the last one encountered, which hasn't been
>> parsed yet, just extracted).
>> 
>> Regarding the idea to use the high 16 bits of the mode field in
>> the header that's another possibility. It would just require
>additional
>> support in the program that actually creates the archive though,
>which
>> the current patch doesn't.
>
>Yes, for adding signatures for a subset of files, no changes to the ram
>disk generator are necessary. Everything is done by a custom module. To
>support a generic use case, it would be necessary to modify the
>generator to execute getfattr and the awk script after files have been
>placed in the temporary directory.
>
>If I understood the new proposal correctly, it would be task for cpio
>to
>read file metadata after the content and create a new record for each
>file with mode 0x18000, type of metadata encoded in the file name and
>metadata as file content. I don't know how easy it would be to modify
>cpio. Probably the amount of changes would be reasonable.
>
>The kernel will behave in a similar way. It will call do_readxattrs()
>in
>do_copy() for each file. Since the only difference between the current
>and the new proposal would be two additional calls to do_readxattrs()
>in
>do_name() and unpack_to_rootfs(), maybe we could support both.
>
>Roberto

The nice thing with explicit metadata is that it doesn't have to contain the 
filename per se, and each file is self-contained. There is a reason why each 
cpio header starts with the magic number: each cpio record is formally 
independent and can be processed in isolation.  The TRAILER!!! thing is a huge 
wart in the format, although in practice TRAILER!!! always has a mode of 0 and 
so can be distinguished from an actual file.

The use of mode 0x18000 for metadata allows for optional backwards 
compatibility for extraction; for encoding this can be handled with very simple 
postprocessing.

So my suggestion would be t

1 2 3 >

1 - 100 of 281 matches

Mail list logo