Re: [Xen-devel] [PATCH v2 1/2] x86/xpti: really hide almost all of Xen image

2018-03-05 Thread Jan Beulich
>>> On 02.03.18 at 18:23,  wrote:
> On 02/03/18 17:04, Jan Beulich wrote:
> On 02.03.18 at 17:53,  wrote:
>>> On 02/03/18 14:34, Jan Beulich wrote:
 Note that the removed BUILD_BUG_ON()s don't get replaced by anything -
 there already is a suitable ASSERT() in xen.lds.S.
>>> This isn't quite true.  You've changed the mechanism by which the stubs
>>> get mapped (from entirely common, to per-pcpu), removing the need for
>>> the BUILD_BUG_ON().
>>>
>>> The ASSERT() in xen.lds.S serves a different purpose, checking that the
>>> sum total of stubs don't overlap with the compiled code.  (On this
>>> note... do we perform the same check for livepatches?  I can't spot
>>> anything.)
>> What you say may be true for the one that was in
>> setup_cpu_root_pgt(), but surely not the one I'm removing from
>> alloc_stub_page(). But I can drop this if you prefer.
> 
> I think it might avoid some confusion.

Okay. Do you have any other comments on _this_ patch then?
The TSS related discussion below would result in another one
anyway.

 What should we do with the TSS? Currently together with it we expose
 almost a full page of other per-CPU data. A simple (but slightly
 hackish) option would be to use one of the two unused stack slots.
>>> In 64bit, the TSS can be mapped read-only, because hardware never has
>>> cause to write to it.
>>>
>>> I believe that Linux now uses a read-only TSS mapping to double as a
>>> guard page for the trampoline stack, which is a less hacky way of
>>> thinking about it.
>>>
>>> However, doing that in Xen would mean shattering the directmap
>>> superpages in all cases, and we'd inherit the SVM triple fault case into
>>> release builds.  A different alternative (and perhaps simpler to
>>> backport) might be to have .bss.percpu.page_aligned and use that to hide
>>> the surrounding data.
>> Well, yes, that's obviously an option, but pretty wasteful. I'd then
>> be tempted to at least do some sharing of the page similar to how
>> the stubs of several CPUs share a single page.
> 
> For backport to older releases?
> 
> I think the extra almost 4k per pcpu isn't going to concern people (its
> the least of their problems right now), and there is a very tangible
> benefit of not leaking the other surrounding data.

Well, yes and no. I can see why people wouldn't be overly
concerned, but otoh I pretty much dislike today's general attitude
of such sort of wasteful use of resources. This isn't limited to
the software world, plus, I'm sorry, but I've grown up with less-
than-1Mb environments.

>>> Thinking about it, we've got the same problem with the TSS as the BSP
>>> IDT, if the link order happens to cause init_tss to cross a page boundary.
>> I don't think so, no - the structure is 128 bytes in size and 128
>> byte aligned. When I created the original XPTI light patch I did
>> specifically check.
> 
> This only happens by chance, because sizeof(struct tss_struct) ==
> SMP_CACHE_BYTES

Sort of true; the comment next to __cacheline_filler[] says so otoh.

> If we intend to rely on this behaviour, we want something like this:
> 
> diff --git a/xen/include/asm-x86/processor.h
> b/xen/include/asm-x86/processor.h
> index 9c70a98..fe647dc 100644
> --- a/xen/include/asm-x86/processor.h
> +++ b/xen/include/asm-x86/processor.h
> @@ -385,7 +385,7 @@ static always_inline void __mwait(unsigned long eax,
> unsigned long ecx)
>  #define IOBMP_BYTES 8192
>  #define IOBMP_INVALID_OFFSET0x8000
>  
> -struct __packed __cacheline_aligned tss_struct {
> +struct __packed tss_struct {
>  uint32_t :32;
>  uint64_t rsp0, rsp1, rsp2;
>  uint64_t :64;
> @@ -398,7 +398,7 @@ struct __packed __cacheline_aligned tss_struct {
>  uint16_t :16, bitmap;
>  /* Pads the TSS to be cacheline-aligned (total size is 0x80). */
>  uint8_t __cacheline_filler[24];
> -};
> +} __aligned(sizeof(struct tss_struct));
>  
>  #define IST_NONE 0UL
>  #define IST_DF   1UL
> 
> except that C can't cope with this expression.  I wonder if there is an
> alternate way with typedefs.

Typedefs can't possibly help, as they're fully equivalent with the
types they refer to. Wrapping another structure around the "raw"
one might help.

But perhaps a BUILD_BUG_ON() would do? Or even just replacing 24
by SMP_CACHE_BYTES - 104 (yielding a negative array size should
SMP_CACHE_BYTES ever shrink for whatever reason), plus perhaps
adding __cacheline_aligned.

Jan

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 1/2] x86/xpti: really hide almost all of Xen image

2018-03-02 Thread Andrew Cooper
On 02/03/18 17:04, Jan Beulich wrote:
 On 02.03.18 at 17:53,  wrote:
>> On 02/03/18 14:34, Jan Beulich wrote:
>>> Note that the removed BUILD_BUG_ON()s don't get replaced by anything -
>>> there already is a suitable ASSERT() in xen.lds.S.
>> This isn't quite true.  You've changed the mechanism by which the stubs
>> get mapped (from entirely common, to per-pcpu), removing the need for
>> the BUILD_BUG_ON().
>>
>> The ASSERT() in xen.lds.S serves a different purpose, checking that the
>> sum total of stubs don't overlap with the compiled code.  (On this
>> note... do we perform the same check for livepatches?  I can't spot
>> anything.)
> What you say may be true for the one that was in
> setup_cpu_root_pgt(), but surely not the one I'm removing from
> alloc_stub_page(). But I can drop this if you prefer.

I think it might avoid some confusion.

>
>>> What should we do with the TSS? Currently together with it we expose
>>> almost a full page of other per-CPU data. A simple (but slightly
>>> hackish) option would be to use one of the two unused stack slots.
>> In 64bit, the TSS can be mapped read-only, because hardware never has
>> cause to write to it.
>>
>> I believe that Linux now uses a read-only TSS mapping to double as a
>> guard page for the trampoline stack, which is a less hacky way of
>> thinking about it.
>>
>> However, doing that in Xen would mean shattering the directmap
>> superpages in all cases, and we'd inherit the SVM triple fault case into
>> release builds.  A different alternative (and perhaps simpler to
>> backport) might be to have .bss.percpu.page_aligned and use that to hide
>> the surrounding data.
> Well, yes, that's obviously an option, but pretty wasteful. I'd then
> be tempted to at least do some sharing of the page similar to how
> the stubs of several CPUs share a single page.

For backport to older releases?

I think the extra almost 4k per pcpu isn't going to concern people (its
the least of their problems right now), and there is a very tangible
benefit of not leaking the other surrounding data.

>
>> Thinking about it, we've got the same problem with the TSS as the BSP
>> IDT, if the link order happens to cause init_tss to cross a page boundary.
> I don't think so, no - the structure is 128 bytes in size and 128
> byte aligned. When I created the original XPTI light patch I did
> specifically check.

This only happens by chance, because sizeof(struct tss_struct) ==
SMP_CACHE_BYTES

If we intend to rely on this behaviour, we want something like this:

diff --git a/xen/include/asm-x86/processor.h
b/xen/include/asm-x86/processor.h
index 9c70a98..fe647dc 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -385,7 +385,7 @@ static always_inline void __mwait(unsigned long eax,
unsigned long ecx)
 #define IOBMP_BYTES 8192
 #define IOBMP_INVALID_OFFSET    0x8000
 
-struct __packed __cacheline_aligned tss_struct {
+struct __packed tss_struct {
 uint32_t :32;
 uint64_t rsp0, rsp1, rsp2;
 uint64_t :64;
@@ -398,7 +398,7 @@ struct __packed __cacheline_aligned tss_struct {
 uint16_t :16, bitmap;
 /* Pads the TSS to be cacheline-aligned (total size is 0x80). */
 uint8_t __cacheline_filler[24];
-};
+} __aligned(sizeof(struct tss_struct));
 
 #define IST_NONE 0UL
 #define IST_DF   1UL

except that C can't cope with this expression.  I wonder if there is an
alternate way with typedefs.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 1/2] x86/xpti: really hide almost all of Xen image

2018-03-02 Thread Jan Beulich
>>> On 02.03.18 at 17:53,  wrote:
> On 02/03/18 14:34, Jan Beulich wrote:
>> Note that the removed BUILD_BUG_ON()s don't get replaced by anything -
>> there already is a suitable ASSERT() in xen.lds.S.
> 
> This isn't quite true.  You've changed the mechanism by which the stubs
> get mapped (from entirely common, to per-pcpu), removing the need for
> the BUILD_BUG_ON().
> 
> The ASSERT() in xen.lds.S serves a different purpose, checking that the
> sum total of stubs don't overlap with the compiled code.  (On this
> note... do we perform the same check for livepatches?  I can't spot
> anything.)

What you say may be true for the one that was in
setup_cpu_root_pgt(), but surely not the one I'm removing from
alloc_stub_page(). But I can drop this if you prefer.

>> What should we do with the TSS? Currently together with it we expose
>> almost a full page of other per-CPU data. A simple (but slightly
>> hackish) option would be to use one of the two unused stack slots.
> 
> In 64bit, the TSS can be mapped read-only, because hardware never has
> cause to write to it.
> 
> I believe that Linux now uses a read-only TSS mapping to double as a
> guard page for the trampoline stack, which is a less hacky way of
> thinking about it.
> 
> However, doing that in Xen would mean shattering the directmap
> superpages in all cases, and we'd inherit the SVM triple fault case into
> release builds.  A different alternative (and perhaps simpler to
> backport) might be to have .bss.percpu.page_aligned and use that to hide
> the surrounding data.

Well, yes, that's obviously an option, but pretty wasteful. I'd then
be tempted to at least do some sharing of the page similar to how
the stubs of several CPUs share a single page.

> Thinking about it, we've got the same problem with the TSS as the BSP
> IDT, if the link order happens to cause init_tss to cross a page boundary.

I don't think so, no - the structure is 128 bytes in size and 128
byte aligned. When I created the original XPTI light patch I did
specifically check.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2 1/2] x86/xpti: really hide almost all of Xen image

2018-03-02 Thread Andrew Cooper
On 02/03/18 14:34, Jan Beulich wrote:
> Commit 422588e885 ("x86/xpti: Hide almost all of .text and all
> .data/.rodata/.bss mappings") carefully limited the Xen image cloning to
> just entry code, but then overwrote the just allocated and populated L3
> entry with the normal one again covering both Xen image and stubs.

Some version of this patch definitely worked correctly, but it is clear
that this version doesn't.  Sorry :(

>
> Drop the respective code in favor of an explicit clone_mapping()
> invocation. This in turn now requires setup_cpu_root_pgt() to run after
> stub setup in all cases. Additionally, with (almost) no unintended
> mappings left, the BSP's IDT now also needs to be page aligned.
>
> Note that the removed BUILD_BUG_ON()s don't get replaced by anything -
> there already is a suitable ASSERT() in xen.lds.S.

This isn't quite true.  You've changed the mechanism by which the stubs
get mapped (from entirely common, to per-pcpu), removing the need for
the BUILD_BUG_ON().

The ASSERT() in xen.lds.S serves a different purpose, checking that the
sum total of stubs don't overlap with the compiled code.  (On this
note... do we perform the same check for livepatches?  I can't spot
anything.)

>
> The moving ahead of cleanup_cpu_root_pgt() is not strictly necessary
> for functionality, but things are more logical this way, and we retain
> cleanup being done in the inverse order of setup.
>
> Signed-off-by: Jan Beulich 
> ---
> v2: Add missing cleanup of the stub mapping.
> ---
> What should we do with the TSS? Currently together with it we expose
> almost a full page of other per-CPU data. A simple (but slightly
> hackish) option would be to use one of the two unused stack slots.

In 64bit, the TSS can be mapped read-only, because hardware never has
cause to write to it.

I believe that Linux now uses a read-only TSS mapping to double as a
guard page for the trampoline stack, which is a less hacky way of
thinking about it.

However, doing that in Xen would mean shattering the directmap
superpages in all cases, and we'd inherit the SVM triple fault case into
release builds.  A different alternative (and perhaps simpler to
backport) might be to have .bss.percpu.page_aligned and use that to hide
the surrounding data.

Thinking about it, we've got the same problem with the TSS as the BSP
IDT, if the link order happens to cause init_tss to cross a page boundary.

~Andrew

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

[Xen-devel] [PATCH v2 1/2] x86/xpti: really hide almost all of Xen image

2018-03-02 Thread Jan Beulich
Commit 422588e885 ("x86/xpti: Hide almost all of .text and all
.data/.rodata/.bss mappings") carefully limited the Xen image cloning to
just entry code, but then overwrote the just allocated and populated L3
entry with the normal one again covering both Xen image and stubs.

Drop the respective code in favor of an explicit clone_mapping()
invocation. This in turn now requires setup_cpu_root_pgt() to run after
stub setup in all cases. Additionally, with (almost) no unintended
mappings left, the BSP's IDT now also needs to be page aligned.

Note that the removed BUILD_BUG_ON()s don't get replaced by anything -
there already is a suitable ASSERT() in xen.lds.S.

The moving ahead of cleanup_cpu_root_pgt() is not strictly necessary
for functionality, but things are more logical this way, and we retain
cleanup being done in the inverse order of setup.

Signed-off-by: Jan Beulich 
---
v2: Add missing cleanup of the stub mapping.
---
What should we do with the TSS? Currently together with it we expose
almost a full page of other per-CPU data. A simple (but slightly
hackish) option would be to use one of the two unused stack slots.

--- a/xen/arch/x86/smpboot.c
+++ b/xen/arch/x86/smpboot.c
@@ -622,9 +622,6 @@ unsigned long alloc_stub_page(unsigned i
 unmap_domain_page(memset(__map_domain_page(pg), 0xcc, PAGE_SIZE));
 }
 
-/* Confirm that all stubs fit in a single L3 entry. */
-BUILD_BUG_ON(NR_CPUS * PAGE_SIZE > (1u << L3_PAGETABLE_SHIFT));
-
 stub_va = XEN_VIRT_END - (cpu + 1) * PAGE_SIZE;
 if ( map_pages_to_xen(stub_va, mfn_x(page_to_mfn(pg)), 1,
   PAGE_HYPERVISOR_RX | MAP_SMALL_PAGES) )
@@ -758,12 +755,12 @@ static int clone_mapping(const void *ptr
 boolean_param("xpti", opt_xpti);
 DEFINE_PER_CPU(root_pgentry_t *, root_pgt);
 
+static root_pgentry_t common_pgt;
+
 extern const char _stextentry[], _etextentry[];
 
 static int setup_cpu_root_pgt(unsigned int cpu)
 {
-static root_pgentry_t common_pgt;
-
 root_pgentry_t *rpt;
 unsigned int off;
 int rc;
@@ -786,8 +783,6 @@ static int setup_cpu_root_pgt(unsigned i
 /* One-time setup of common_pgt, which maps .text.entry and the stubs. */
 if ( unlikely(!root_get_intpte(common_pgt)) )
 {
-unsigned long stubs_linear = XEN_VIRT_END - 1;
-l3_pgentry_t *stubs_main, *stubs_shadow;
 const char *ptr;
 
 for ( rc = 0, ptr = _stextentry;
@@ -797,16 +792,6 @@ static int setup_cpu_root_pgt(unsigned i
 if ( rc )
 return rc;
 
-/* Confirm that all stubs fit in a single L3 entry. */
-BUILD_BUG_ON(NR_CPUS * PAGE_SIZE > (1u << L3_PAGETABLE_SHIFT));
-
-stubs_main = l4e_to_l3e(idle_pg_table[l4_table_offset(stubs_linear)]);
-stubs_shadow = l4e_to_l3e(rpt[l4_table_offset(stubs_linear)]);
-
-/* Splice into the regular L2 mapping the stubs. */
-stubs_shadow[l3_table_offset(stubs_linear)] =
-stubs_main[l3_table_offset(stubs_linear)];
-
 common_pgt = rpt[root_table_offset(XEN_VIRT_START)];
 }
 
@@ -820,6 +805,8 @@ static int setup_cpu_root_pgt(unsigned i
 rc = clone_mapping(idt_tables[cpu], rpt);
 if ( !rc )
 rc = clone_mapping(_cpu(init_tss, cpu), rpt);
+if ( !rc )
+rc = clone_mapping((void *)per_cpu(stubs.addr, cpu), rpt);
 
 return rc;
 }
@@ -828,6 +815,7 @@ static void cleanup_cpu_root_pgt(unsigne
 {
 root_pgentry_t *rpt = per_cpu(root_pgt, cpu);
 unsigned int r;
+unsigned long stub_linear = per_cpu(stubs.addr, cpu);
 
 if ( !rpt )
 return;
@@ -872,6 +860,16 @@ static void cleanup_cpu_root_pgt(unsigne
 }
 
 free_xen_pagetable(rpt);
+
+/* Also zap the stub mapping for this CPU. */
+if ( stub_linear )
+{
+l3_pgentry_t *l3t = l4e_to_l3e(common_pgt);
+l2_pgentry_t *l2t = l3e_to_l2e(l3t[l3_table_offset(stub_linear)]);
+l1_pgentry_t *l1t = l2e_to_l1e(l2t[l2_table_offset(stub_linear)]);
+
+l1t[l2_table_offset(stub_linear)] = l1e_empty();
+}
 }
 
 static void cpu_smpboot_free(unsigned int cpu)
@@ -895,6 +893,8 @@ static void cpu_smpboot_free(unsigned in
 if ( per_cpu(scratch_cpumask, cpu) != _cpu0mask )
 free_cpumask_var(per_cpu(scratch_cpumask, cpu));
 
+cleanup_cpu_root_pgt(cpu);
+
 if ( per_cpu(stubs.addr, cpu) )
 {
 mfn_t mfn = _mfn(per_cpu(stubs.mfn, cpu));
@@ -912,8 +912,6 @@ static void cpu_smpboot_free(unsigned in
 free_domheap_page(mfn_to_page(mfn));
 }
 
-cleanup_cpu_root_pgt(cpu);
-
 order = get_order_from_pages(NR_RESERVED_GDT_PAGES);
 free_xenheap_pages(per_cpu(gdt_table, cpu), order);
 
@@ -968,11 +966,6 @@ static int cpu_smpboot_alloc(unsigned in
 memcpy(idt_tables[cpu], idt_table, IDT_ENTRIES * sizeof(idt_entry_t));
 disable_each_ist(idt_tables[cpu]);
 
-rc = setup_cpu_root_pgt(cpu);
-if ( rc )
-goto out;
-rc = -ENOMEM;
-
 for ( stub_page = 0, i = cpu &