from:"boris . ostrovsky"

Re: [PATCH] xen: fix multicall debug data referencing

2024-07-18 Thread boris . ostrovsky




On 7/18/24 2:21 AM, Juergen Gross wrote:

The recent adding of multicall debug mixed up the referencing of
the debug data. A __percpu tagged pointer can't be initialized with a
plain pointer, so use another percpu variable for the pointer and set
it on each new cpu via a function.

Fixes: 942d917cb92a ("xen: make multicall debug boot time selectable")
Reported-by: kernel test robot 
Closes: 
https://lore.kernel.org/oe-kbuild-all/202407151106.5s7mnfpz-...@intel.com/
Signed-off-by: Juergen Gross 



IIUIC we only need data until SMP is initialized, so setting per-cpu 
pointer just on cpu0 may be sufficient. Not sure where to do this though.


But this works as well.

Reviewed-by: Boris Ostrovsky

Re: [PATCH] x86/xen: remove deprecated xen_nopvspin boot parameter

2024-07-11 Thread boris . ostrovsky




On 7/10/24 7:01 AM, Juergen Gross wrote:

The xen_nopvspin boot parameter is deprecated since 2019. nopvspin
can be used instead.

Remove the xen_nopvspin boot parameter and replace the xen_pvspin
variable use cases with nopvspin.

This requires to move the nopvspin variable out of the .initdata
section, as it needs to be accessed for cpuhotplug, too.

Signed-off-by: Juergen Gross 



Reviewed-by: Boris Ostrovsky

Re: [PATCH 0/2] x86/xen: cleanup of xen private headers

2024-07-11 Thread boris . ostrovsky




On 7/10/24 5:37 AM, Juergen Gross wrote:

Cleanup the private header files in arch/x86/xen by merging them into
one file and by removing unneeded stuff.

Juergen Gross (2):
   x86/xen: make some functions static
   x86/xen: eliminate some private header files

  arch/x86/xen/apic.c  |   2 -
  arch/x86/xen/debugfs.c   |   2 +-
  arch/x86/xen/debugfs.h   |   7 --
  arch/x86/xen/enlighten.c |   2 -
  arch/x86/xen/enlighten_hvm.c |   2 -
  arch/x86/xen/enlighten_pv.c  |   4 -
  arch/x86/xen/mmu.c   |   3 +-
  arch/x86/xen/mmu.h   |  28 ---
  arch/x86/xen/mmu_hvm.c   |   2 +-
  arch/x86/xen/mmu_pv.c|  15 ++--
  arch/x86/xen/multicalls.c|   3 +-
  arch/x86/xen/multicalls.h|  69 
  arch/x86/xen/p2m.c   |   2 -
  arch/x86/xen/pmu.c   |   1 -
  arch/x86/xen/pmu.h   |  22 --
  arch/x86/xen/setup.c |   1 -
  arch/x86/xen/smp.c   |   1 -
  arch/x86/xen/smp.h   |  51 
  arch/x86/xen/smp_hvm.c   |   2 -
  arch/x86/xen/smp_pv.c|   3 -
  arch/x86/xen/suspend.c   |   2 -
  arch/x86/xen/xen-ops.h   | 148 ++-
  include/xen/events.h |   2 +
  23 files changed, 158 insertions(+), 216 deletions(-)
  delete mode 100644 arch/x86/xen/debugfs.h
  delete mode 100644 arch/x86/xen/mmu.h
  delete mode 100644 arch/x86/xen/multicalls.h
  delete mode 100644 arch/x86/xen/pmu.h
  delete mode 100644 arch/x86/xen/smp.h



Reviewed-by: Boris Ostrovsky

Re: [PATCH v2] xen: make multicall debug boot time selectable

2024-07-11 Thread boris . ostrovsky




On 7/10/24 5:27 AM, Juergen Gross wrote:

Today Xen multicall debugging needs to be enabled via modifying a
define in a source file for getting debug data of multicall errors
encountered by users.

Switch multicall debugging to depend on a boot parameter "xen_mc_debug"
instead, enabling affected users to boot with the new parameter set in
order to get better diagnostics.

With debugging enabled print the following information in case at least
one of the batched calls failed:
- all calls of the batch with operation, result and caller
- all parameters of each call
- all parameters stored in the multicall data for each call

Signed-off-by: Juergen Gross 



Reviewed-by: Boris Ostrovsky

Re: [PATCH] xen: make multicall debug boot time selectable

2024-07-08 Thread boris . ostrovsky





On 7/8/24 5:04 AM, Jürgen Groß wrote:

On 06.07.24 00:36, boris.ostrov...@oracle.com wrote:







Also, would it be better to keep these fields as a struct of scalars 
and instead have the percpu array of this struct? Otherwise there is a 
whole bunch of [MC_BATCH] arrays, all of them really indexed by the 
same value. (And while at it, there is no reason to have 
callbacks[MC_BATCH] sized like that -- it has nothing to do with batch 
size and can probably be made smaller)


As today the mc_buffer's entries are copied via a single memcpy(), there
are 3 options:


Ah yes, it's memcpy, I didn't think of that. Then leaving it as is is 
the best.




- make mc_debug_data a percpu pointer to a single array, requiring to
   copy the mc_buffer's entries in a loop

- let struct mc_debug_data contain two arrays (entries[] and struct foo 
{}[],

   with struct foo containing the other pointers/values)

- keep the layout as in my patch

Regarding the callbacks: I think the max number of callbacks is indeed 
MC_BATCH,
as for each batch member one callback might be requested. So I'd rather 
keep it

the way it is today.



Right, I was trying to point out that it's the max number but I suspect 
it usually is smaller --  we currently ask for a callback in fewer than 
half of the cases where we submit a request.



-boris

Re: [PATCH] xen: make multicall debug boot time selectable

2024-07-05 Thread boris . ostrovsky





On 7/3/24 7:56 AM, Juergen Gross wrote:

  
  #define MC_BATCH	32
  
-#define MC_DEBUG	0

-
  #define MC_ARGS   (MC_BATCH * 16)
  
  
  struct mc_buffer {

unsigned mcidx, argidx, cbidx;
struct multicall_entry entries[MC_BATCH];
-#if MC_DEBUG
-   struct multicall_entry debug[MC_BATCH];
-   void *caller[MC_BATCH];
-#endif
unsigned char args[MC_ARGS];
struct callback {
void (*fn)(void *);
@@ -50,13 +46,84 @@ struct mc_buffer {
} callbacks[MC_BATCH];
  };
  
+struct mc_debug_data {

+   struct multicall_entry debug[MC_BATCH];


'entries'? It's a mc_debug_data's copy of mc_buffer's entries.

Also, would it be better to keep these fields as a struct of scalars and 
instead have the percpu array of this struct? Otherwise there is a whole 
bunch of [MC_BATCH] arrays, all of them really indexed by the same 
value. (And while at it, there is no reason to have callbacks[MC_BATCH] 
sized like that -- it has nothing to do with batch size and can probably 
be made smaller)




+   void *caller[MC_BATCH];
+   size_t argsz[MC_BATCH];
+};
+
  static DEFINE_PER_CPU(struct mc_buffer, mc_buffer);
+static struct mc_debug_data __percpu *mc_debug_data;
+static struct mc_debug_data mc_debug_data_early __initdata;


How about (I think this should work):

static struct mc_debug_data __percpu *mc_debug_data __refdata = 
&mc_debug_data_early;


Then you won't need get_mc_debug_ptr().


-boris

Re: [PATCH] x86/xen: fix percpu vcpu_info allocation

2023-11-27 Thread Boris Ostrovsky





On 11/24/23 2:48 AM, Juergen Gross wrote:

Today the percpu struct vcpu_info is allocated via DEFINE_PER_CPU(),
meaning that it could cross a page boundary. In this case registering
it with the hypervisor will fail, resulting in a panic().

This can easily be fixed by using DEFINE_PER_CPU_ALIGNED() instead,
as struct vcpu_info is guaranteed to have a size of 64 bytes, matching
the cache line size of x86 64-bit processors (Xen doesn't support
32-bit processors).

Fixes: 5ead97c84fa7 ("xen: Core Xen implementation")
Signed-off-by: Juergen Gross 


Reviewed-by: Boris Ostrovsky 

although I am not sure in usefulness of BUILD_BUG_ON --- 64 bytes is part of 
ABI and hypervisor already has its own BUILD_BUG_ON for this.


-boris



@@ -160,6 +163,7 @@ void xen_vcpu_setup(int cpu)
int err;
struct vcpu_info *vcpup;
  
+	BUILD_BUG_ON(sizeof(*vcpup) > SMP_CACHE_BYTES);

BUG_ON(HYPERVISOR_shared_info == &xen_dummy_shared_info);

Re: [PATCH RESEND 2/2] x86/percpu, xen: Correct PER_CPU_VAR usage to include symbol and its addend

2023-10-15 Thread Boris Ostrovsky





On 10/15/23 12:08 PM, Uros Bizjak wrote:

PER_CPU_VAR macro should be applied to a symbol and its addend.
Inconsisten usage is currently harmless, but needs to be corrected
before %rip-relative addressing is introduced to PER_CPU_VAR macro.

No functional changes intended.

Cc: Juergen Gross 
Cc: Boris Ostrovsky 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: "H. Peter Anvin" 
Signed-off-by: Uros Bizjak 


Reviewed-by: Boris Ostrovsky

Re: [PATCH 3/3] x86/xen: allow nesting of same lazy mode

2023-09-15 Thread Boris Ostrovsky





On 9/13/23 7:38 AM, Juergen Gross wrote:

When running as a paravirtualized guest under Xen, Linux is using
"lazy mode" for issuing hypercalls which don't need to take immediate
effect in order to improve performance (examples are e.g. multiple
PTE changes).

There are two different lazy modes defined: MMU and CPU lazy mode.
Today it is not possible to nest multiple lazy mode sections, even if
they are of the same kind. A recent change in memory management added
nesting of MMU lazy mode sections, resulting in a regression when
running as Xen PV guest.

Technically there is no reason why nesting of multiple sections of the
same kind of lazy mode shouldn't be allowed. So add support for that
for fixing the regression.

Fixes: bcc6cc832573 ("mm: add default definition of set_ptes()")
Signed-off-by: Juergen Gross 


For patches 2 and 3

Reviewed-by: Boris Ostrovsky

Re: [PATCH] xen: simplify evtchn_do_upcall() call maze

2023-08-24 Thread Boris Ostrovsky





On 8/24/23 11:41 AM, Juergen Gross wrote:

There are several functions involved for performing the functionality
of evtchn_do_upcall():

- __xen_evtchn_do_upcall() doing the real work
- xen_hvm_evtchn_do_upcall() just being a wrapper for
   __xen_evtchn_do_upcall(), exposed for external callers
- xen_evtchn_do_upcall() calling __xen_evtchn_do_upcall(), too, but
   without any user

Simplify this maze by:

- removing the unused xen_evtchn_do_upcall()
- removing xen_hvm_evtchn_do_upcall() as the only left caller of
   __xen_evtchn_do_upcall(), while renaming __xen_evtchn_do_upcall() to
   xen_evtchn_do_upcall()

Signed-off-by: Juergen Gross 


Reviewed-by: Boris Ostrovsky

Re: [PATCH v1] xen: remove a confusing comment on auto-translated guest I/O

2023-08-02 Thread Boris Ostrovsky





On 8/2/23 12:31 PM, Petr Tesarik wrote:

From: Petr Tesarik 

After removing the conditional return from xen_create_contiguous_region(),
the accompanying comment was left in place, but it now precedes an
unrelated conditional and confuses readers.

Fixes: 989513a735f5 ("xen: cleanup pvh leftovers from pv-only sources")
Signed-off-by: Petr Tesarik 


Reviewed-by: Boris Ostrovsky

Re: [PATCH] x86/xen: fix secondary processor fpu initialization

2023-07-03 Thread Boris Ostrovsky





On 7/3/23 9:00 AM, Juergen Gross wrote:

Moving the call of fpu__init_cpu() from cpu_init() to start_secondary()
broke Xen PV guests, as those don't call start_secondary() for APs.

Fix that by adding the call of fpu__init_cpu() to cpu_bringup(), which
is the Xen PV replacement of start_secondary().

Fixes: b81fac906a8f ("x86/fpu: Move FPU initialization into 
arch_cpu_finalize_init()")
Signed-off-by: Juergen Gross 



Reviewed-by: Boris Ostrovsky

Re: [PATCH] x86/xen: set default memory type for pv guests to WB

2023-06-15 Thread Boris Ostrovsky





On 6/15/23 8:39 AM, Juergen Gross wrote:

When running as an unprivileged PV-guest under Xen (not dom0), the
default MTRR memory type should be write-back.

Signed-off-by: Juergen Gross 


Reviewed-by: Boris Ostrovsky

Re: [PATCH v4 0/2] x86: xen: add missing prototypes

2023-06-14 Thread Boris Ostrovsky





On 6/14/23 3:34 AM, Juergen Gross wrote:

Avoid missing prototype warnings.

Arnd Bergmann (1):
   x86: xen: add missing prototypes

Juergen Gross (1):
   x86/xen: add prototypes for paravirt mmu functions

  arch/x86/xen/efi.c |  2 ++
  arch/x86/xen/mmu_pv.c  | 16 
  arch/x86/xen/smp.h |  4 
  arch/x86/xen/smp_pv.c  |  1 -
  arch/x86/xen/xen-ops.h |  3 +++
  include/xen/xen.h  |  3 +++
  6 files changed, 28 insertions(+), 1 deletion(-)




Reviewed-by: Boris Ostrovsky

Re: [PATCH] [v2] x86: xen: add missing prototypes

2023-05-23 Thread Boris Ostrovsky





On 5/23/23 4:37 PM, Arnd Bergmann wrote:

On Sat, May 20, 2023, at 00:24, Boris Ostrovsky wrote:

On 5/19/23 5:28 AM, Arnd Bergmann wrote:


diff --git a/arch/x86/xen/smp.h b/arch/x86/xen/smp.h
index 22fb982ff971..81a7821dd07f 100644
--- a/arch/x86/xen/smp.h
+++ b/arch/x86/xen/smp.h
@@ -1,7 +1,11 @@
   /* SPDX-License-Identifier: GPL-2.0 */
   #ifndef _XEN_SMP_H
   
+void asm_cpu_bringup_and_idle(void);

+asmlinkage void cpu_bringup_and_idle(void);


These can go under CONFIG_SMP


Not sure if there is much point for the second one, since
it's only called from assembler, so the #else path is
never seen, but I can do that for consistency if you
like.

I generally prefer to avoid the extra #if checks
when there is no strict need for an empty stub.


Do we need the empty stubs? Neither of these should be referenced if !SMP (or 
more precisely if !CONFIG_XEN_PV_SMP)



I guess you also want me to add the empty stubs for the
other functions that only have a bit in #ifdef but not
#else then?


+void xen_force_evtchn_callback(void);


These ...


+pteval_t xen_pte_val(pte_t pte);
+pgdval_t xen_pgd_val(pgd_t pgd);
+pte_t xen_make_pte(pteval_t pte);
+pgd_t xen_make_pgd(pgdval_t pgd);
+pmdval_t xen_pmd_val(pmd_t pmd);
+pmd_t xen_make_pmd(pmdval_t pmd);
+pudval_t xen_pud_val(pud_t pud);
+pud_t xen_make_pud(pudval_t pud);
+p4dval_t xen_p4d_val(p4d_t p4d);
+p4d_t xen_make_p4d(p4dval_t p4d);
+pte_t xen_make_pte_init(pteval_t pte);
+void xen_start_kernel(struct start_info *si);



... should go under '#ifdef CONFIG_XEN_PV'


What should the stubs do here? I guess just return the
unmodified value?



Same here -- these should only be referenced if CONFIG_XEN_PV is defined which 
is why I suggested moving them under ifdef.


-boris

Re: [PATCH] [v2] x86: xen: add missing prototypes

2023-05-19 Thread Boris Ostrovsky





On 5/19/23 5:28 AM, Arnd Bergmann wrote:


diff --git a/arch/x86/xen/smp.h b/arch/x86/xen/smp.h
index 22fb982ff971..81a7821dd07f 100644
--- a/arch/x86/xen/smp.h
+++ b/arch/x86/xen/smp.h
@@ -1,7 +1,11 @@
  /* SPDX-License-Identifier: GPL-2.0 */
  #ifndef _XEN_SMP_H
  
+void asm_cpu_bringup_and_idle(void);

+asmlinkage void cpu_bringup_and_idle(void);


These can go under CONFIG_SMP


+
  #ifdef CONFIG_SMP
+
  extern void xen_send_IPI_mask(const struct cpumask *mask,
  int vector);
  extern void xen_send_IPI_mask_allbutself(const struct cpumask *mask,
diff --git a/arch/x86/xen/smp_pv.c b/arch/x86/xen/smp_pv.c
index a92e8002b5cf..d5ae5de2daa2 100644





diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
index 6d7f6318fc07..0f71ee3fe86b 100644
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -160,4 +160,18 @@ void xen_hvm_post_suspend(int suspend_cancelled);
  static inline void xen_hvm_post_suspend(int suspend_cancelled) {}
  #endif
  
+void xen_force_evtchn_callback(void);


These ...


+pteval_t xen_pte_val(pte_t pte);
+pgdval_t xen_pgd_val(pgd_t pgd);
+pte_t xen_make_pte(pteval_t pte);
+pgd_t xen_make_pgd(pgdval_t pgd);
+pmdval_t xen_pmd_val(pmd_t pmd);
+pmd_t xen_make_pmd(pmdval_t pmd);
+pudval_t xen_pud_val(pud_t pud);
+pud_t xen_make_pud(pudval_t pud);
+p4dval_t xen_p4d_val(p4d_t p4d);
+p4d_t xen_make_p4d(p4dval_t p4d);
+pte_t xen_make_pte_init(pteval_t pte);
+void xen_start_kernel(struct start_info *si);



... should go under '#ifdef CONFIG_XEN_PV'



-boris

Re: [patch 16/37] x86/xen/smp_pv: Remove wait for CPU online

2023-04-17 Thread Boris Ostrovsky





On 4/14/23 7:44 PM, Thomas Gleixner wrote:

Now that the core code drops sparse_irq_lock after the idle thread
synchronized, it's pointless to wait for the AP to mark itself online.

Whether the control CPU runs in a wait loop or sleeps in the core code
waiting for the online operation to complete makes no difference.

Signed-off-by: Thomas Gleixner 
Cc: Juergen Gross 
Cc: Boris Ostrovsky 
Cc: xen-devel@lists.xenproject.org
---
  arch/x86/xen/smp_pv.c |   10 +-
  1 file changed, 5 insertions(+), 5 deletions(-)

--- a/arch/x86/xen/smp_pv.c
+++ b/arch/x86/xen/smp_pv.c
@@ -340,11 +340,11 @@ static int xen_pv_cpu_up(unsigned int cp
  
  	xen_pmu_init(cpu);
  
-	rc = HYPERVISOR_vcpu_op(VCPUOP_up, xen_vcpu_nr(cpu), NULL);

-   BUG_ON(rc);
-
-   while (cpu_report_state(cpu) != CPU_ONLINE)
-   HYPERVISOR_sched_op(SCHEDOP_yield, NULL);
+   /*
+* Why is this a BUG? If the hypercall fails then everything can be
+* rolled back, no?
+*/



In many cases this indicates either some sort of hypervisor internal error or 
broken logic in the guest, so it is, well, a bug. But I suppose it may also be 
some transient condition in the hypervisor (I don't see it now but it can 
happen in the future) so perhaps we should indeed try not to die on the spot.



-boris



+   BUG_ON(HYPERVISOR_vcpu_op(VCPUOP_up, xen_vcpu_nr(cpu), NULL));
  
  	return 0;

  }

Re: [PATCH v4 05/12] x86/xen: set MTRR state when running as Xen PV initial domain

2023-03-07 Thread Boris Ostrovsky





On 3/6/23 11:34 AM, Juergen Gross wrote:

When running as Xen PV initial domain (aka dom0), MTRRs are disabled
by the hypervisor, but the system should nevertheless use correct
cache memory types. This has always kind of worked, as disabled MTRRs
resulted in disabled PAT, too, so that the kernel avoided code paths
resulting in inconsistencies. This bypassed all of the sanity checks
the kernel is doing with enabled MTRRs in order to avoid memory
mappings with conflicting memory types.

This has been changed recently, leading to PAT being accepted to be
enabled, while MTRRs stayed disabled. The result is that
mtrr_type_lookup() no longer is accepting all memory type requests,
but started to return WB even if UC- was requested. This led to
driver failures during initialization of some devices.

In reality MTRRs are still in effect, but they are under complete
control of the Xen hypervisor. It is possible, however, to retrieve
the MTRR settings from the hypervisor.

In order to fix those problems, overwrite the MTRR state via
mtrr_overwrite_state() with the MTRR data from the hypervisor, if the
system is running as a Xen dom0.

Fixes: 72cbc8f04fe2 ("x86/PAT: Have pat_enabled() properly reflect state when 
running on Xen")
Signed-off-by: Juergen Gross 



Reviewed-by: Boris Ostrovsky

Re: [PATCH v3 05/12] x86/xen: set MTRR state when running as Xen PV initial domain

2023-02-27 Thread Boris Ostrovsky





On 2/27/23 2:12 AM, Juergen Gross wrote:

On 24.02.23 22:00, Boris Ostrovsky wrote:


On 2/23/23 4:32 AM, Juergen Gross wrote:

+
+    for (reg = 0; reg < MTRR_MAX_VAR_RANGES; reg++) {
+    op.u.read_memtype.reg = reg;
+    if (HYPERVISOR_platform_op(&op))
+    break;



If we fail on the first iteration, do we still want to mark MTRRs are 
enabled/set in mtrr_overwrite_state()?


Hmm, good idea.

I think we should just drop the call of mtrr_overwrite_state() in this
case.



TBH I am not sure what the right way is to handle errors here. What if the 
hypercall fails on second iteration?


-boris

Re: [PATCH v3 05/12] x86/xen: set MTRR state when running as Xen PV initial domain

2023-02-24 Thread Boris Ostrovsky



On 2/23/23 4:32 AM, Juergen Gross wrote:

+
+   for (reg = 0; reg < MTRR_MAX_VAR_RANGES; reg++) {
+   op.u.read_memtype.reg = reg;
+   if (HYPERVISOR_platform_op(&op))
+   break;



If we fail on the first iteration, do we still want to mark MTRRs are 
enabled/set in mtrr_overwrite_state()?


-boris



+
+   /*
+* Only called in dom0, which has all RAM PFNs mapped at
+* RAM MFNs, and all PCI space etc. is identity mapped.
+* This means we can treat MFN == PFN regarding MTTR settings.
+*/
+   var[reg].base_lo = op.u.read_memtype.type;
+   var[reg].base_lo |= op.u.read_memtype.mfn << PAGE_SHIFT;
+   var[reg].base_hi = op.u.read_memtype.mfn >> (32 - PAGE_SHIFT);
+   mask = ~((op.u.read_memtype.nr_mfns << PAGE_SHIFT) - 1);
+   mask &= (1UL << width) - 1;
+   if (mask)
+   mask |= 1 << 11;
+   var[reg].mask_lo = mask;
+   var[reg].mask_hi = mask >> 32;
+   }
+
+   mtrr_overwrite_state(var, reg, MTRR_TYPE_UNCACHABLE);
+#endif
+}

Re: [PATCH] x86/Xen: make use of IBPB controlling VM assist

2023-02-15 Thread Boris Ostrovsky




On 2/15/23 3:31 AM, Jan Beulich wrote:

On 15.02.2023 01:07, Boris Ostrovsky wrote:

On 2/14/23 6:53 PM, Boris Ostrovsky wrote:

On 2/14/23 11:13 AM, Jan Beulich wrote:


--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -18,6 +18,8 @@
   #include 
   #include 
   +#include 
+
   #include 
   #include 
   #include 
@@ -32,6 +34,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 
     #include "cpu.h"
@@ -934,7 +937,8 @@ do_cmd_auto:
   break;
     case RETBLEED_MITIGATION_IBPB:
-    setup_force_cpu_cap(X86_FEATURE_ENTRY_IBPB);
+    if (!xen_pv_domain() || xen_vm_assist_ibpb(true))


Is this going to compile without CONFIG_XEN?

Yes. The declaration of xen_vm_assist_ibpb() is visible (satisfying
the compiler) and DCE will eliminate the call to the function due to
xen_pv_domain() being constant "false" in that case, avoiding any
linking issues. The interesting case here really is building with
XEN but without XEN_PV: That's why I needed to put the function in
enlighten.c. This wouldn't be needed if xen_pv_domain() was also
constant "false" in that case (just like xen_pvh_domain() is when
!XEN_PVH).


I also think these two conditions should be wrapped into something to limit 
exposure of non-Xen code to Xen-specific primitives.

I would have done so, if I had any halfway sensible idea on how to
go about doing so in this particular case. In the absence of that it
looked okay-ish to me to reference Xen functions directly here.


Oh, and this needs x86 maintainers.

Eventually yes. But I would prefer to sort the above question first
(which I'm sure would have been raised by them, in perhaps more
harsh a way), hence the initially limited exposure.



I also think there is a bit of a disconnect between how the mitigation is reported in the 
log/sysfs (retbleed_mitigation is RETBLEED_MITIGATION_IBPB, so "Mitigation: 
IBPB") and, for example, lscpu (since X86_FEATURE_ENTRY_IBPB is not set anymore).


-boris

Re: [PATCH] x86/Xen: make use of IBPB controlling VM assist

2023-02-14 Thread Boris Ostrovsky




On 2/14/23 6:53 PM, Boris Ostrovsky wrote:


On 2/14/23 11:13 AM, Jan Beulich wrote:


--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -18,6 +18,8 @@
  #include 
  #include 
  +#include 
+
  #include 
  #include 
  #include 
@@ -32,6 +34,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
    #include "cpu.h"
@@ -934,7 +937,8 @@ do_cmd_auto:
  break;
    case RETBLEED_MITIGATION_IBPB:
-    setup_force_cpu_cap(X86_FEATURE_ENTRY_IBPB);
+    if (!xen_pv_domain() || xen_vm_assist_ibpb(true))



Is this going to compile without CONFIG_XEN?


I also think these two conditions should be wrapped into something to limit 
exposure of non-Xen code to Xen-specific primitives.



Oh, and this needs x86 maintainers.


-boris

Re: [PATCH] x86/Xen: make use of IBPB controlling VM assist

2023-02-14 Thread Boris Ostrovsky




On 2/14/23 11:13 AM, Jan Beulich wrote:


--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -18,6 +18,8 @@
  #include 
  #include 
  
+#include 

+
  #include 
  #include 
  #include 
@@ -32,6 +34,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  
  #include "cpu.h"

@@ -934,7 +937,8 @@ do_cmd_auto:
break;
  
  	case RETBLEED_MITIGATION_IBPB:

-   setup_force_cpu_cap(X86_FEATURE_ENTRY_IBPB);
+   if (!xen_pv_domain() || xen_vm_assist_ibpb(true))



Is this going to compile without CONFIG_XEN?


I also think these two conditions should be wrapped into something to limit 
exposure of non-Xen code to Xen-specific primitives.


-boris



+   setup_force_cpu_cap(X86_FEATURE_ENTRY_IBPB);
mitigate_smt = true;
break;

Re: [RFC 01/10] x86: introduce AMD-V and Intel VT-x Kconfig options

2023-02-14 Thread Boris Ostrovsky




On 2/14/23 2:48 AM, Jan Beulich wrote:

On 13.02.2023 21:53, Boris Ostrovsky wrote:

On 2/13/23 11:41 AM, Jan Beulich wrote:

On 13.02.2023 17:30, Xenia Ragiadakou wrote:

On 2/13/23 17:11, Jan Beulich wrote:

On 13.02.2023 15:57, Xenia Ragiadakou wrote:

--- a/xen/arch/x86/cpu/Makefile
+++ b/xen/arch/x86/cpu/Makefile
@@ -10,4 +10,6 @@ obj-y += intel.o
obj-y += intel_cacheinfo.o
obj-y += mwait-idle.o
obj-y += shanghai.o
-obj-y += vpmu.o vpmu_amd.o vpmu_intel.o
+obj-y += vpmu.o
+obj-$(CONFIG_AMD_SVM) += vpmu_amd.o
+obj-$(CONFIG_INTEL_VMX) += vpmu_intel.o

This code was moved from hvm/ to cpu/ for the specific reason of also
being used by PV guests. (Sadly the comments at the top weren't
corrected at that time.)

Hmm, the init functions are prefixed with svm/vmx.
Does vpmu depend on svm/vmx support?

There are interactions, but I don't think there's a dependency. You
may want to ask Boris (whom I have now added to Cc).

MSR intercept bits need to be manipulated, which is SVM/VMX-specific.

But that's "interaction", not "dependency" aiui: The intercept bits aren't
relevant for PV guests, are they?



Correct, they are not. All accesses to intercept bits are under is_hvm_vcpu().


So vpmu does not depend on presence of vmx/svm support. (And init routines 
shouldn't be prefixed with those)


-boris

Re: [RFC 01/10] x86: introduce AMD-V and Intel VT-x Kconfig options

2023-02-13 Thread Boris Ostrovsky




On 2/13/23 11:41 AM, Jan Beulich wrote:

On 13.02.2023 17:30, Xenia Ragiadakou wrote:

On 2/13/23 17:11, Jan Beulich wrote:

On 13.02.2023 15:57, Xenia Ragiadakou wrote:

--- a/xen/arch/x86/cpu/Makefile
+++ b/xen/arch/x86/cpu/Makefile
@@ -10,4 +10,6 @@ obj-y += intel.o
   obj-y += intel_cacheinfo.o
   obj-y += mwait-idle.o
   obj-y += shanghai.o
-obj-y += vpmu.o vpmu_amd.o vpmu_intel.o
+obj-y += vpmu.o
+obj-$(CONFIG_AMD_SVM) += vpmu_amd.o
+obj-$(CONFIG_INTEL_VMX) += vpmu_intel.o

This code was moved from hvm/ to cpu/ for the specific reason of also
being used by PV guests. (Sadly the comments at the top weren't
corrected at that time.)

Hmm, the init functions are prefixed with svm/vmx.
Does vpmu depend on svm/vmx support?

There are interactions, but I don't think there's a dependency. You
may want to ask Boris (whom I have now added to Cc).



MSR intercept bits need to be manipulated, which is SVM/VMX-specific.


-boris

Re: [PATCH 0/2] x86/xen: don't return from xen_pv_play_dead()

2023-01-16 Thread Boris Ostrovsky




On 1/16/23 2:49 AM, Juergen Gross wrote:

On 25.11.22 07:32, Juergen Gross wrote:

All play_dead() functions but xen_pv_play_dead() don't return to the
caller.

Adapt xen_pv_play_dead() to behave like the other play_dead() variants.

Juergen Gross (2):
   x86/xen: don't let xen_pv_play_dead() return
   x86/xen: mark xen_pv_play_dead() as __noreturn

  arch/x86/xen/smp.h  |  2 ++
  arch/x86/xen/smp_pv.c   | 17 -
  arch/x86/xen/xen-head.S |  7 +++
  tools/objtool/check.c   |  1 +
  4 files changed, 14 insertions(+), 13 deletions(-)



Ping?




Reviewed-by: Boris Ostrovsky

Re: [PATCH] x86/paravirt: merge activate_mm and dup_mmap callbacks

2023-01-12 Thread Boris Ostrovsky




On 1/12/23 10:21 AM, Juergen Gross wrote:

The two paravirt callbacks .mmu.activate_mm and .mmu.dup_mmap are
sharing the same implementations in all cases: for Xen PV guests they
are pinning the PGD of the new mm_struct, and for all other cases
they are a NOP.

So merge them to a common callback .mmu.enter_mmap (in contrast to the
corresponding already existing .mmu.exit_mmap).

As the first parameter of the old callbacks isn't used, drop it from
the replacement one.

Signed-off-by: Juergen Gross 



Reviewed-by: Boris Ostrovsky

Re: [PATCH linux-next v2] x86/xen/time: prefer tsc as clocksource when it is invariant

2022-12-14 Thread Boris Ostrovsky




On 12/14/22 1:01 PM, Krister Johansen wrote:

On Tue, Dec 13, 2022 at 04:25:32PM -0500, Boris Ostrovsky wrote:

On 12/12/22 5:09 PM, Krister Johansen wrote:

On Mon, Dec 12, 2022 at 01:48:24PM -0500, Boris Ostrovsky wrote:

On 12/12/22 11:05 AM, Krister Johansen wrote:

diff --git a/arch/x86/include/asm/xen/cpuid.h b/arch/x86/include/asm/xen/cpuid.h
index 6daa9b0c8d11..d9d7432481e9 100644
--- a/arch/x86/include/asm/xen/cpuid.h
+++ b/arch/x86/include/asm/xen/cpuid.h
@@ -88,6 +88,12 @@
 * EDX: shift amount for tsc->ns conversion
 * Sub-leaf 2: EAX: host tsc frequency in kHz
 */
+#define XEN_CPUID_TSC_EMULATED   (1u << 0)
+#define XEN_CPUID_HOST_TSC_RELIABLE  (1u << 1)
+#define XEN_CPUID_RDTSCP_INSTR_AVAIL (1u << 2)
+#define XEN_CPUID_TSC_MODE_DEFAULT   (0)
+#define XEN_CPUID_TSC_MODE_EMULATE   (1u)
+#define XEN_CPUID_TSC_MODE_NOEMULATE (2u)

This file is a copy of Xen public interface so this change should go to Xen 
first.

Ok, should I split this into a separate patch on the linux side too?

Yes. Once the Xen patch has been accepted you will either submit the same patch 
for Linux or sync Linux file with Xen (if there are more differences).

Thanks.  Based upon the feedback I received from you and Jan, I may try
to shrink the check in xen_tsc_safe_clocksource() down a bit.  In that
case, I may only need to refer to a single field in the leaf that
provides this information.  In that case, are you alright with dropping
the change to the header and referring to the value directly, or would
you prefer that I proceed with adding these to the public API?



It would certainly be appreciated if you updated the header files but it's up 
to maintainers to decide whether it's required.



+static int __init xen_tsc_safe_clocksource(void)
+{
+   u32 eax, ebx, ecx, edx;
+
+   if (!(xen_hvm_domain() || xen_pvh_domain()))
+   return 0;
+
+   if (!(boot_cpu_has(X86_FEATURE_CONSTANT_TSC)))
+   return 0;
+
+   if (!(boot_cpu_has(X86_FEATURE_NONSTOP_TSC)))
+   return 0;
+
+   if (check_tsc_unstable())
+   return 0;
+
+   cpuid(xen_cpuid_base() + 3, &eax, &ebx, &ecx, &edx);
+
+   if (eax & XEN_CPUID_TSC_EMULATED)
+   return 0;
+
+   if (ebx != XEN_CPUID_TSC_MODE_NOEMULATE)
+   return 0;

Why is the last test needed?

I was under the impression that if the mode was 0 (default) it would be
possible for the tsc to become emulated in the future, perhaps after a
migration.  The presence of the tsc_mode noemulate meant that we could
count on the falseneess of the XEN_CPUID_TSC_EMULATED check remaining
constant.

This will filter out most modern processors with TSC scaling support where in 
default mode we don't intercept RDTCS after migration. But I don't think we 
have proper interface to determine this so we don't have much choice but to 
indeed make this check.

Yes, if this remains a single boot-time check, I'm not sure that knowing
whether the processor supports tsc scaling helps us.  If tsc_mode is
default, there's always a possibility of the tsc becoming emulated later
on as part of migration, correct?



If the processor supports TSC scaling I don't think it's possible (it can 
happen in theory) but if it doesn't and you migrate to a CPU running at 
different frequency then yes, hypervisor will start emulating RDTSC.




The other thing that might be possible here is to add a background
timer that periodically checks if the tsc is still not emulated, and if
it suddenly becomes so, change the rating again to prefer the xen
clocksource.  I had written this off initially as an impractical
solution, since it seemed like a lot more mechanism and because it meant
the performance characteristics of the system would change without user
intervention.  However, if this seems like a good idea, I'm not opposed
to giving it a try.



I don't think we should do it. Having the kernel suddenly change clocksource 
will probably be somewhat of a surprise to users.


-boris

Re: [PATCH linux-next v2] x86/xen/time: prefer tsc as clocksource when it is invariant

2022-12-13 Thread Boris Ostrovsky




On 12/12/22 5:09 PM, Krister Johansen wrote:

On Mon, Dec 12, 2022 at 01:48:24PM -0500, Boris Ostrovsky wrote:

On 12/12/22 11:05 AM, Krister Johansen wrote:

diff --git a/arch/x86/include/asm/xen/cpuid.h b/arch/x86/include/asm/xen/cpuid.h
index 6daa9b0c8d11..d9d7432481e9 100644
--- a/arch/x86/include/asm/xen/cpuid.h
+++ b/arch/x86/include/asm/xen/cpuid.h
@@ -88,6 +88,12 @@
* EDX: shift amount for tsc->ns conversion
* Sub-leaf 2: EAX: host tsc frequency in kHz
*/
+#define XEN_CPUID_TSC_EMULATED   (1u << 0)
+#define XEN_CPUID_HOST_TSC_RELIABLE  (1u << 1)
+#define XEN_CPUID_RDTSCP_INSTR_AVAIL (1u << 2)
+#define XEN_CPUID_TSC_MODE_DEFAULT   (0)
+#define XEN_CPUID_TSC_MODE_EMULATE   (1u)
+#define XEN_CPUID_TSC_MODE_NOEMULATE (2u)

This file is a copy of Xen public interface so this change should go to Xen 
first.

Ok, should I split this into a separate patch on the linux side too?



Yes. Once the Xen patch has been accepted you will either submit the same patch 
for Linux or sync Linux file with Xen (if there are more differences).





+static int __init xen_tsc_safe_clocksource(void)
+{
+   u32 eax, ebx, ecx, edx;
+
+   if (!(xen_hvm_domain() || xen_pvh_domain()))
+   return 0;
+
+   if (!(boot_cpu_has(X86_FEATURE_CONSTANT_TSC)))
+   return 0;
+
+   if (!(boot_cpu_has(X86_FEATURE_NONSTOP_TSC)))
+   return 0;
+
+   if (check_tsc_unstable())
+   return 0;
+
+   cpuid(xen_cpuid_base() + 3, &eax, &ebx, &ecx, &edx);
+
+   if (eax & XEN_CPUID_TSC_EMULATED)
+   return 0;
+
+   if (ebx != XEN_CPUID_TSC_MODE_NOEMULATE)
+   return 0;

Why is the last test needed?

I was under the impression that if the mode was 0 (default) it would be
possible for the tsc to become emulated in the future, perhaps after a
migration.  The presence of the tsc_mode noemulate meant that we could
count on the falseneess of the XEN_CPUID_TSC_EMULATED check remaining
constant.



This will filter out most modern processors with TSC scaling support where in 
default mode we don't intercept RDTCS after migration. But I don't think we 
have proper interface to determine this so we don't have much choice but to 
indeed make this check.


-boris

Re: [PATCH linux-next v2] x86/xen/time: prefer tsc as clocksource when it is invariant

2022-12-12 Thread Boris Ostrovsky




On 12/12/22 11:05 AM, Krister Johansen wrote:


diff --git a/arch/x86/include/asm/xen/cpuid.h b/arch/x86/include/asm/xen/cpuid.h
index 6daa9b0c8d11..d9d7432481e9 100644
--- a/arch/x86/include/asm/xen/cpuid.h
+++ b/arch/x86/include/asm/xen/cpuid.h
@@ -88,6 +88,12 @@
   * EDX: shift amount for tsc->ns conversion
   * Sub-leaf 2: EAX: host tsc frequency in kHz
   */
+#define XEN_CPUID_TSC_EMULATED   (1u << 0)
+#define XEN_CPUID_HOST_TSC_RELIABLE  (1u << 1)
+#define XEN_CPUID_RDTSCP_INSTR_AVAIL (1u << 2)
+#define XEN_CPUID_TSC_MODE_DEFAULT   (0)
+#define XEN_CPUID_TSC_MODE_EMULATE   (1u)
+#define XEN_CPUID_TSC_MODE_NOEMULATE (2u)



This file is a copy of Xen public interface so this change should go to Xen 
first.


  
+static int __init xen_tsc_safe_clocksource(void)

+{
+   u32 eax, ebx, ecx, edx;
+
+   if (!(xen_hvm_domain() || xen_pvh_domain()))
+   return 0;
+
+   if (!(boot_cpu_has(X86_FEATURE_CONSTANT_TSC)))
+   return 0;
+
+   if (!(boot_cpu_has(X86_FEATURE_NONSTOP_TSC)))
+   return 0;
+
+   if (check_tsc_unstable())
+   return 0;
+
+   cpuid(xen_cpuid_base() + 3, &eax, &ebx, &ecx, &edx);
+
+   if (eax & XEN_CPUID_TSC_EMULATED)
+   return 0;
+
+   if (ebx != XEN_CPUID_TSC_MODE_NOEMULATE)
+   return 0;



Why is the last test needed?


-boris

Re: [PATCH linux-next] x86/xen/time: prefer tsc as clocksource when it is invariant

2022-12-09 Thread Boris Ostrovsky




On 12/8/22 11:36 AM, Krister Johansen wrote:

+   /*
+* As Dom0 is never moved, no penalty on using TSC there.
+*
+* If the guest has invariant tsc, then set xen_clocksource rating
+* below that of the tsc so that the system prefers tsc instead.  This
+* check excludes PV domains, because PV is unable to guarantee that the
+* guest's cpuid call has been intercepted by the hypervisor.
+*/
+   if (xen_initial_domain()) {
xen_clocksource.rating = 275;
+   } else if ((xen_hvm_domain() || xen_pvh_domain()) &&
+   boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
+   boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
+   !check_tsc_unstable()) {
+   xen_clocksource.rating = 299;
+   }



What if RDTSC is intercepted?


-boris

Re: [PATCH] drivers/xen/hypervisor: Expose VM SIF flags to userspace

2022-11-29 Thread Boris Ostrovsky




On 11/29/22 10:00 AM, Per Bilse wrote:
  
+static ssize_t flags_show(struct hyp_sysfs_attr *attr, char *buffer)

+{
+   static char const *const sifstr[SIFN_NUM_SIFN] = {
+   [SIFN_PRIV]  = "privileged",
+   [SIFN_INIT]  = "initdomain",
+   [SIFN_MULTI] = "multiboot",
+   [SIFN_PFN]   = "mod_start_pfn",
+   [SIFN_VIRT]  = "virtmap"
+   };
+   unsigned sifnum, sifmask;
+   ssize_t ret = 0;
+
+   sifmask = ~(~0U << SIFN_NUM_SIFN);  // ...111...
+   if (xen_domain() && (xen_start_flags & sifmask) != 0) {
+   for (sifnum = 0; sifnum != SIFN_NUM_SIFN; sifnum++) {
+   if ((xen_start_flags & (1<


Why not simply show unprocessed xen_start_flags as a hex value?


-boris

Re: [PATCH 28/30] x86/xen: Use kstrtobool() instead of strtobool()

2022-11-01 Thread Boris Ostrovsky




On 11/1/22 5:14 PM, Christophe JAILLET wrote:

strtobool() is the same as kstrtobool().
However, the latter is more used within the kernel.

In order to remove strtobool() and slightly simplify kstrtox.h, switch to
the other function name.

While at it, include the corresponding header file ()

Signed-off-by: Christophe JAILLET 
---
This patch is part of a serie that axes all usages of strtobool().
Each patch can be applied independently from the other ones.

The last patch of the serie removes the definition of strtobool().

You may not be in copy of the cover letter. So, if needed, it is available
at [1].

[1]: 
https://lore.kernel.org/all/cover.1667336095.git.christophe.jail...@wanadoo.fr/



Reviewed-by: Boris Ostrovsky

Re: [PATCH] x86/xen: silence smatch warning in pmu_msr_chk_emulated()

2022-10-20 Thread Boris Ostrovsky




On 10/20/22 9:34 AM, Juergen Gross wrote:

On 20.10.22 15:16, Jan Beulich wrote:

On 20.10.2022 13:37, Juergen Gross wrote:

Commit 8714f7bcd3c2 ("xen/pv: add fault recovery control to pmu msr
accesses") introduced code resulting in a warning issued by the smatch
static checker, claiming to use an uninitialized variable.

This is a false positive, but work around the warning nevertheless.


The risk of introducing a problem might be quite low here, but in general
it exists: With the adjustment you remove any chance of the compiler
spotting a missing initialization before use. And I'm not convinced using
0 in such a case would actually be ending up sufficiently benign.


Hmm, an alternative would be to initialize it to -1 and add a test for the
index to be >= 0 before using it.

Or to live with the smash warning with the chance, that a compiler might be
warning for the same reason in the future.



Is smatch complaining about both variables or just index? There are two cases 
in is_intel_pmu_msr() where it returns true but index is not set so perhaps 
that's what bothers smatch? It shold not complain if is_intel_pmu_msr() returns 
false.


-boris

Re: [PATCH v2 3/3] xen/virtio: enable grant based virtio on x86

2022-10-07 Thread Boris Ostrovsky




On 10/7/22 10:00 AM, Oleksandr Tyshchenko wrote:

On 07.10.22 09:41, Juergen Gross wrote:

Hello Juergen


Use an x86-specific virtio_check_mem_acc_cb() for Xen in order to setup
the correct DMA ops.

Signed-off-by: Juergen Gross 
---
V2:
- add missing PV check in xen_virtio_mem_acc() (Oleksandr Tyshchenko)
- add xen_virtio_restricted_mem_acc() stub (Oleksandr Tyshchenko)


Reviewed-by: Oleksandr Tyshchenko  #
common code




Reviewed-by: Boris Ostrovsky

Re: [PATCH v2 1/3] xen/pv: allow pmu msr accesses to cause GP

2022-10-04 Thread Boris Ostrovsky




On 10/4/22 4:43 AM, Juergen Gross wrote:
  
  bool pmu_msr_read(unsigned int msr, uint64_t *val, int *err)

  {
-   if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) {
-   if (is_amd_pmu_msr(msr)) {
-   if (!xen_amd_pmu_emulate(msr, val, 1))
-   *val = native_read_msr_safe(msr, err);
-   return true;
-   }
-   } else {
-   int type, index;
+   int type, index;
+   bool emulated;
  
-		if (is_intel_pmu_msr(msr, &type, &index)) {

-   if (!xen_intel_pmu_emulate(msr, val, type, index, 1))
-   *val = native_read_msr_safe(msr, err);
-   return true;
-   }
+   if (is_amd_pmu_msr(msr))
+   emulated = xen_amd_pmu_emulate(msr, val, 1);
+   else if (is_intel_pmu_msr(msr, &type, &index))
+   emulated = xen_intel_pmu_emulate(msr, val, type, index, 1);
+   else
+   return false;
+
+   if (!emulated) {



You can factor this out even further I think by moving if/elseif/esle into a 
separate routine and then have 'if (!xen_emulate_pmu_msr(msr, val, 1))' (and 
pass zero from pmu_msr_write())


-boris

Re: [PATCH 1/3] xen/pv: allow pmu msr accesses to cause GP

2022-09-26 Thread Boris Ostrovsky




On 9/26/22 10:18 AM, Juergen Gross wrote:

  bool pmu_msr_read(unsigned int msr, uint64_t *val, int *err)
  {
if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) {
-   if (is_amd_pmu_msr(msr)) {
-   if (!xen_amd_pmu_emulate(msr, val, 1))
-   *val = native_read_msr_safe(msr, err);
-   return true;
+   if (!is_amd_pmu_msr(msr))



You should be able to move vendor check inside is__pmu_msr().


-boris



+   return false;
+   if (!xen_amd_pmu_emulate(msr, val, 1)) {
+   *val = err ? native_read_msr_safe(msr, err)
+  : native_read_msr(msr);
}
+   return true;
} else {

Re: [PATCH 1/4] x86/entry: Work around Clang __bdos() bug

2022-09-20 Thread Boris Ostrovsky




On 9/20/22 3:21 PM, Kees Cook wrote:

After expanding bounds checking to use __builtin_dynamic_object_size(),
Clang produces a false positive when building with CONFIG_FORTIFY_SOURCE=y
and CONFIG_UBSAN_BOUNDS=y when operating on an array with a dynamic
offset. Work around this by using a direct assignment of an empty
instance. Avoids this warning:

../include/linux/fortify-string.h:309:4: warning: call to 
__write_overflow_field declared with 'warn
ing' attribute: detected write beyond size of field (1st parameter); maybe use 
struct_group()? [-Wat
tribute-warning]
 __write_overflow_field(p_size_field, size);
 ^

which was isolated to the memset() call in xen_load_idt().

Note that this looks very much like another bug that was worked around:
https://github.com/ClangBuiltLinux/linux/issues/1592

Cc: Juergen Gross 
Cc: Boris Ostrovsky 
Cc: Nathan Chancellor 
Cc: Nick Desaulniers 
Cc: xen-devel@lists.xenproject.org
Cc: l...@lists.linux.dev
Signed-off-by: Kees Cook 



Reviewed-by: Boris Ostrovsky

Re: [PATCH] x86/vpmu: fix race-condition in vpmu_load

2022-09-20 Thread Boris Ostrovsky




On 9/20/22 10:54 AM, Jan Beulich wrote:

On 20.09.2022 16:26, Boris Ostrovsky wrote:

On 9/20/22 4:01 AM, Jan Beulich wrote:

On 20.09.2022 00:42, Boris Ostrovsky wrote:

It is saving vpmu data from current pcpu's MSRs for a remote vcpu so @v
in vmx_find_msr() is not @current:

vpmu_load()
...
prev = per_cpu(last_vcpu, pcpu);
vpmu_save_force(prev)
core2_vpmu_save()
__core2_vpmu_save()
vmx_read_guest_msr()
vmx_find_msr()


The call to vmx_find_msr() was introduced by 755087eb9b10c. I wonder though 
whether
this call is needed when code path above is executed (i.e. when we are saving
remove vcpu)

How could it not be needed? We need to obtain the guest value. The
thing I don't understand is why this forced saving is necessary,
when context_switch() unconditionally calls vpmu_switch_from().


IIRC the logic is:

1. vcpuA runs on pcpu0
2. vcpuA is de-scheduled and is selected to run on pcpu1. It has not yet called 
vpmu_load() from pcpu1

The calling of vpmu_load() shouldn't matter here. What does matter is
that vpmu_save() was necessarily called already.



I thought we don't always save MSRs on context switch because we optimize for 
case when the same vcpu gets scheduled again. But I am not sure I see this now 
that I am looking at the code.


-boris




  Therefore I'm having
trouble seeing why ...


3. vcpuB is ready to run on pcpu0, calls vpmu_load()
4. vcpuB discovers that pcpu0's MSRs are still holding values from vcpuA
5. vcpuB calls vpmu_force_save(vcpuA) which stashes pcpu0's MSRs into vcpuA's 
vpmu context.

... forced saving would be necessary here. What's necessary at this
point is only the loading of vcpuB's MSR values.

Re: [PATCH] x86/vpmu: fix race-condition in vpmu_load

2022-09-20 Thread Boris Ostrovsky





On 9/20/22 4:01 AM, Jan Beulich wrote:

On 20.09.2022 00:42, Boris Ostrovsky wrote:




It is saving vpmu data from current pcpu's MSRs for a remote vcpu so @v
in vmx_find_msr() is not @current:

   vpmu_load()
   ...
   prev = per_cpu(last_vcpu, pcpu);
   vpmu_save_force(prev)
   core2_vpmu_save()
   __core2_vpmu_save()
   vmx_read_guest_msr()
   vmx_find_msr()


The call to vmx_find_msr() was introduced by 755087eb9b10c. I wonder though 
whether
this call is needed when code path above is executed (i.e. when we are saving
remove vcpu)


How could it not be needed? We need to obtain the guest value. The
thing I don't understand is why this forced saving is necessary,
when context_switch() unconditionally calls vpmu_switch_from().



IIRC the logic is:

1. vcpuA runs on pcpu0
2. vcpuA is de-scheduled and is selected to run on pcpu1. It has not yet called 
vpmu_load() from pcpu1
3. vcpuB is ready to run on pcpu0, calls vpmu_load()
4. vcpuB discovers that pcpu0's MSRs are still holding values from vcpuA
5. vcpuB calls vpmu_force_save(vcpuA) which stashes pcpu0's MSRs into vcpuA's 
vpmu context.


-boris

Re: [PATCH] x86/vpmu: fix race-condition in vpmu_load

2022-09-19 Thread Boris Ostrovsky





On 9/19/22 10:56 AM, Jan Beulich wrote:

On 19.09.2022 16:11, Tamas K Lengyel wrote:

On Mon, Sep 19, 2022 at 9:58 AM Jan Beulich  wrote:


On 19.09.2022 15:24, Tamas K Lengyel wrote:

On Mon, Sep 19, 2022 at 9:21 AM Jan Beulich  wrote:


On 19.09.2022 14:25, Tamas K Lengyel wrote:

On Mon, Sep 19, 2022 at 5:28 AM Jan Beulich  wrote:


On 16.09.2022 23:35, Boris Ostrovsky wrote:


On 9/16/22 8:52 AM, Jan Beulich wrote:

On 15.09.2022 16:01, Tamas K Lengyel wrote:

While experimenting with the vPMU subsystem an ASSERT failure was
observed in vmx_find_msr because the vcpu_runnable state was true.

The root cause of the bug appears to be the fact that the vPMU

subsystem

doesn't save its state on context_switch.


For the further reply below - is this actually true? What is the
vpmu_switch_from() (resolving to vpmu_save()) doing then early
in context_switch()?


The vpmu_load function will attempt
to gather the PMU state if its still loaded two different ways:
  1. if the current pcpu is not where the vcpu ran before doing

a remote save

  2. if the current pcpu had another vcpu active before doing a

local save


However, in case the prev vcpu is being rescheduled on another

pcpu its state

has already changed and vcpu_runnable is returning true, thus #2

will trip the

ASSERT. The only way to avoid this race condition is to make sure

the

prev vcpu is paused while being checked and its context saved.

Once the prev

vcpu is resumed and does #1 it will find its state already saved.

While I consider this explanation plausible, I'm worried:


--- a/xen/arch/x86/cpu/vpmu.c
+++ b/xen/arch/x86/cpu/vpmu.c
@@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t

from_guest)

   vpmu = vcpu_vpmu(prev);

   /* Someone ran here before us */
+vcpu_pause(prev);
   vpmu_save_force(prev);
   vpmu_reset(vpmu, VPMU_CONTEXT_LOADED);
+vcpu_unpause(prev);

   vpmu = vcpu_vpmu(v);
   }

We're running with IRQs off here, yet vcpu_pause() waits for the

vcpu

to actually be de-scheduled. Even with IRQs on this is already a
relatively heavy operation (also including its impact on the remote
side). Additionally the function is called from context_switch(),

and

I'm unsure of the usability of vcpu_pause() on such a path. In
particular: Is there a risk of two CPUs doing this mutually to one
another? If so, is deadlocking excluded?

Hence at the very least I think the description wants extending, to
discuss the safety of the change.

Boris - any chance you could comment here? Iirc that's code you did
introduce.



Is the assertion in vmx_find_msr() really needs to be for runnable

vcpu or can it be a check on whether vcpu is actually running (e.g.
RUNSTATE_running)?


You cannot safely check for "running", as "runnable" may transition
to/from "running" behind your back.


The more I look at this the more I think the only sensible solution is
to have the vPMU state be saved on vmexit for all vCPUs.


Do you really mean vmexit? It would suffice if state was reliably
saved during context-switch-out, wouldn't it? At that point the
vCPU can't be resumed on another pCPU, yet.


That way all
this having to figure out where and when a context needs saving during
scheduling goes away. Yes, it adds a bit of overhead for cases where
the vCPU will resume on the same pCPU and that context saved could
have been skipped,


If you really mean vmexit, then I'm inclined to say that's more
than just "a bit of overhead". I'd agree if you really meant
context-switch-out, but as said further up it looks to me as if
that was already occurring. Apparently I'm overlooking something
crucial ...


Yes, the current setup is doing exactly that, saving the vPMU context
on context-switch-out, and that's where the ASSERT failure occurs
because the vCPU it needs to save the context for is already runnable
on another pCPU.


Well, if that's the scenario (sorry for not understanding it that
way earlier on), then the assertion is too strict: While in context
switch, the vCPU may be runnable, but certainly won't actually run
anywhere. Therefore I'd be inclined to suggest to relax the
condition just enough to cover this case, without actually going to
checking for "running".



What ensures the vCPU won't actually run anywhere if it's in the runnable
state?


The fact that the vCPU is the subject of context_switch().


And how do I relax the condition just enough to cover just this case?


That's the more difficult question. The immediate solution, passing a
boolean or flag to vpmu_switch_from(), may not be practical, as it
would likely mean passing this through many layers.

Utilizing that I have Jürgen sitting next to me, I've discussed this
with him, and he suggested to simply check for v == current. And
indeed - set_current() in context_switch

Re: [PATCH] x86/vpmu: fix race-condition in vpmu_load

2022-09-16 Thread Boris Ostrovsky




On 9/16/22 8:52 AM, Jan Beulich wrote:

On 15.09.2022 16:01, Tamas K Lengyel wrote:

While experimenting with the vPMU subsystem an ASSERT failure was
observed in vmx_find_msr because the vcpu_runnable state was true.

The root cause of the bug appears to be the fact that the vPMU subsystem
doesn't save its state on context_switch. The vpmu_load function will attempt
to gather the PMU state if its still loaded two different ways:
 1. if the current pcpu is not where the vcpu ran before doing a remote save
 2. if the current pcpu had another vcpu active before doing a local save

However, in case the prev vcpu is being rescheduled on another pcpu its state
has already changed and vcpu_runnable is returning true, thus #2 will trip the
ASSERT. The only way to avoid this race condition is to make sure the
prev vcpu is paused while being checked and its context saved. Once the prev
vcpu is resumed and does #1 it will find its state already saved.

While I consider this explanation plausible, I'm worried:


--- a/xen/arch/x86/cpu/vpmu.c
+++ b/xen/arch/x86/cpu/vpmu.c
@@ -419,8 +419,10 @@ int vpmu_load(struct vcpu *v, bool_t from_guest)
  vpmu = vcpu_vpmu(prev);
  
  /* Someone ran here before us */

+vcpu_pause(prev);
  vpmu_save_force(prev);
  vpmu_reset(vpmu, VPMU_CONTEXT_LOADED);
+vcpu_unpause(prev);
  
  vpmu = vcpu_vpmu(v);

  }

We're running with IRQs off here, yet vcpu_pause() waits for the vcpu
to actually be de-scheduled. Even with IRQs on this is already a
relatively heavy operation (also including its impact on the remote
side). Additionally the function is called from context_switch(), and
I'm unsure of the usability of vcpu_pause() on such a path. In
particular: Is there a risk of two CPUs doing this mutually to one
another? If so, is deadlocking excluded?

Hence at the very least I think the description wants extending, to
discuss the safety of the change.

Boris - any chance you could comment here? Iirc that's code you did
introduce.



Is the assertion in vmx_find_msr() really needs to be for runnable vcpu or can it be a 
check on whether vcpu is actually running (e.g. RUNSTATE_running)? I think call chain 
vpmu_load()->..->*_vpmu_save()->...->vmx_find_msr() is the only one where we 
are doing it for non-current vcpu. If we can guarantee that run state is set after 
vpmu_load() call (maybe it is already, I haven't checked) then testing the state may avoid 
the assertion.


-boris

Re: [PATCH v3] x86/xen: Add support for HVMOP_set_evtchn_upcall_vector

2022-07-28 Thread Boris Ostrovsky




On 7/28/22 8:52 AM, Jane Malalane wrote:
  
+/*

+ * Setup per-vCPU vector-type callbacks and trick toolstack to think



The comment should be adjusted -- no need to mention toolstack now that that 
code has been factored out.


Other than that


Reviewed-by: Boris Ostrovsky 



+ * we are enlightened. If this setup is unavailable, fallback to the
+ * global vector-type callback.
+ */
+static __init void xen_init_setup_upcall_vector(void)
+{

Re: [PATCH v2] x86/xen: Add support for HVMOP_set_evtchn_upcall_vector

2022-07-26 Thread Boris Ostrovsky




On 7/26/22 8:56 AM, Jane Malalane wrote:
  
+/* Setup per-vCPU vector-type callbacks and trick toolstack to think

+ * we are enlightened. If this setup is unavailable, fallback to the
+ * global vector-type callback. */



Comment style.



+static __init void xen_init_setup_upcall_vector(void)
+{
+   unsigned int cpu = 0;



Unnecessary variable.



+
+   if (!xen_have_vector_callback)
+   return;
+
+   if ((cpuid_eax(xen_cpuid_base() + 4) & XEN_HVM_CPUID_UPCALL_VECTOR) &&
+   !xen_set_upcall_vector(cpu) && !xen_set_callback_via(1))
+   xen_percpu_upcall = true;
+   else if (xen_feature(XENFEAT_hvm_callback_vector))
+   xen_setup_callback_vector();
+   else
+   xen_have_vector_callback = false;
+}
+
+int xen_set_upcall_vector(unsigned int cpu)
+{
+   int rc;
+   xen_hvm_evtchn_upcall_vector_t op = {
+   .vector = HYPERVISOR_CALLBACK_VECTOR,
+   .vcpu = per_cpu(xen_vcpu_id, cpu),
+   };
+
+   rc = HYPERVISOR_hvm_op(HVMOP_set_evtchn_upcall_vector, &op);
+   if (rc)
+   return rc;
+
+   if (!cpu)



A comment (e.g. "Let toolstack know that we are enlightened." or something 
along these lines) would be useful here.


-boris



+   rc = xen_set_callback_via(1);
+

Re: [PATCH] x86/xen: Add support for HVMOP_set_evtchn_upcall_vector

2022-07-25 Thread Boris Ostrovsky




On 7/25/22 6:03 AM, Jane Malalane wrote:

On 18/07/2022 14:59, Boris Ostrovsky wrote:

On 7/18/22 4:56 AM, Andrew Cooper wrote:

On 15/07/2022 14:10, Boris Ostrovsky wrote:

On 7/15/22 5:50 AM, Andrew Cooper wrote:

On 15/07/2022 09:18, Jane Malalane wrote:

On 14/07/2022 00:27, Boris Ostrovsky wrote:

     xen_hvm_smp_init();
     WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm,
xen_cpu_dead_hvm));
diff --git a/arch/x86/xen/suspend_hvm.c b/arch/x86/xen/suspend_hvm.c
index 9d548b0c772f..be66e027ef28 100644
--- a/arch/x86/xen/suspend_hvm.c
+++ b/arch/x86/xen/suspend_hvm.c
@@ -5,6 +5,7 @@
     #include 
     #include 
     #include 
+#include 
     #include "xen-ops.h"
@@ -14,6 +15,23 @@ void xen_hvm_post_suspend(int suspend_cancelled)
     xen_hvm_init_shared_info();
     xen_vcpu_restore();
     }
-    xen_setup_callback_vector();
+    if (xen_ack_upcall) {
+    unsigned int cpu;
+
+    for_each_online_cpu(cpu) {
+    xen_hvm_evtchn_upcall_vector_t op = {
+    .vector = HYPERVISOR_CALLBACK_VECTOR,
+    .vcpu = per_cpu(xen_vcpu_id, cpu),
+    };
+
+
BUG_ON(HYPERVISOR_hvm_op(HVMOP_set_evtchn_upcall_vector,
+ &op));
+    /* Trick toolstack to think we are enlightened. */
+    if (!cpu)
+    BUG_ON(xen_set_callback_via(1));

What are you trying to make the toolstack aware of? That we have *a*
callback (either global or percpu)?

Yes, specifically for the check in libxl__domain_pvcontrol_available.

And others.

This is all a giant bodge, but basically a lot of tooling uses the
non-zero-ness of the CALLBACK_VIA param to determine whether the VM has
Xen-aware drivers loaded or not.

The value 1 is a CALLBACK_VIA value which encodes GSI 1, and the only
reason this doesn't explode everywhere is because the
evtchn_upcall_vector registration takes priority over GSI delivery.

This is decades of tech debt piled on top of tech debt.

Feels like it (setting the callback parameter) is something that the
hypervisor should do --- no need to expose guests to this.

Sensible or not, it is the ABI.

Linux still needs to work (nicely) with older Xen's in the world, and we
can't just retrofit a change in the hypervisor which says "btw, this ABI
we've just changed now has a side effect of modifying a field that you
also logically own".


The hypercall has been around for a while so I understand ABI concerns
there but XEN_HVM_CPUID_UPCALL_VECTOR was introduced only a month ago.
Why not tie presence of this bit to no longer having to explicitly set
the callback field?


Any other opinions on this?

(i.e., calling xen_set_callback_via(1) after
HVMOP_set_evtchn_upcall_vector OR not exposing this to guests and
instead having Xen call this function (in hvmop_set_evtchn_upcall_vector
maybe) and tieing its presense to XEN_HVM_CPUID_UPCALL_VECTOR which was
recently added)



CPUID won't help here, I wasn't thinking clearly.


Can we wrap the HVMOP_set_evtchn_upcall_vector hypercall in a function that 
will decide whether or not to also do xen_set_callback_via(1)?


-boris

Re: [PATCH] x86/xen: Add support for HVMOP_set_evtchn_upcall_vector

2022-07-18 Thread Boris Ostrovsky




On 7/18/22 4:56 AM, Andrew Cooper wrote:

On 15/07/2022 14:10, Boris Ostrovsky wrote:

On 7/15/22 5:50 AM, Andrew Cooper wrote:

On 15/07/2022 09:18, Jane Malalane wrote:

On 14/07/2022 00:27, Boris Ostrovsky wrote:

    xen_hvm_smp_init();
    WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm,
xen_cpu_dead_hvm));
diff --git a/arch/x86/xen/suspend_hvm.c b/arch/x86/xen/suspend_hvm.c
index 9d548b0c772f..be66e027ef28 100644
--- a/arch/x86/xen/suspend_hvm.c
+++ b/arch/x86/xen/suspend_hvm.c
@@ -5,6 +5,7 @@
    #include 
    #include 
    #include 
+#include 
    #include "xen-ops.h"
@@ -14,6 +15,23 @@ void xen_hvm_post_suspend(int suspend_cancelled)
    xen_hvm_init_shared_info();
    xen_vcpu_restore();
    }
-    xen_setup_callback_vector();
+    if (xen_ack_upcall) {
+    unsigned int cpu;
+
+    for_each_online_cpu(cpu) {
+    xen_hvm_evtchn_upcall_vector_t op = {
+    .vector = HYPERVISOR_CALLBACK_VECTOR,
+    .vcpu = per_cpu(xen_vcpu_id, cpu),
+    };
+
+    BUG_ON(HYPERVISOR_hvm_op(HVMOP_set_evtchn_upcall_vector,
+ &op));
+    /* Trick toolstack to think we are enlightened. */
+    if (!cpu)
+    BUG_ON(xen_set_callback_via(1));

What are you trying to make the toolstack aware of? That we have *a*
callback (either global or percpu)?

Yes, specifically for the check in libxl__domain_pvcontrol_available.

And others.

This is all a giant bodge, but basically a lot of tooling uses the
non-zero-ness of the CALLBACK_VIA param to determine whether the VM has
Xen-aware drivers loaded or not.

The value 1 is a CALLBACK_VIA value which encodes GSI 1, and the only
reason this doesn't explode everywhere is because the
evtchn_upcall_vector registration takes priority over GSI delivery.

This is decades of tech debt piled on top of tech debt.


Feels like it (setting the callback parameter) is something that the
hypervisor should do --- no need to expose guests to this.

Sensible or not, it is the ABI.

Linux still needs to work (nicely) with older Xen's in the world, and we
can't just retrofit a change in the hypervisor which says "btw, this ABI
we've just changed now has a side effect of modifying a field that you
also logically own".



The hypercall has been around for a while so I understand ABI concerns there 
but XEN_HVM_CPUID_UPCALL_VECTOR was introduced only a month ago. Why not tie 
presence of this bit to no longer having to explicitly set the callback field?


-boris

Re: Build warnings in Xen 5.15.y and 5.10.y with retbleed backports

2022-07-18 Thread Boris Ostrovsky




On 7/17/22 1:20 AM, Juergen Gross wrote:


What about filling the complete hypercall page just with "int 3" or "ud2"?

Any try to do a hypercall before the hypercall page has been initialized
is a bug anyway. What good can come from calling a function which will
return a basically random value instead of doing a privileged operation?



This is all about objtool, that's why 'ret' was added there originally by f4b4bc10b0b8 
("x86/xen: Support objtool vmlinux.o validation in xen-head.S").


Before that it was all 'nop' which is similar to what you are suggesting 
('int3' or 'ud2' would of course be better)


-boris

Re: Build warnings in Xen 5.15.y and 5.10.y with retbleed backports

2022-07-16 Thread Boris Ostrovsky





On 7/16/22 12:35 PM, Nicolai Stange wrote:

Hi,

I see a patch for this has been queued up for 5.10 already ([1]), I'm
just sharing my findings in support of this patch here -- it doesn't
merely exchange one warning for another, but fixes a real issue and
should perhaps get applied to other stable branches as well.

TL;DR: for this particular warning, objtool would exit early and fail to
create any .orc_unwind* ELF sections for head_64.o, which are consumed
by the ORC unwinder at runtime.


Boris Ostrovsky  writes:


On 7/12/22 3:31 PM, Greg KH wrote:

On Tue, Jul 12, 2022 at 03:19:39PM -0400, Boris Ostrovsky wrote:


On 7/12/22 12:38 PM, Greg KH wrote:

Hi all,

I'm seeing the following build warning:
arch/x86/kernel/head_64.o: warning: objtool: 
xen_hypercall_mmu_update(): can't find starting instruction
in the 5.15.y and 5.10.y retbleed backports.


The reason for this is that with RET being multibyte, it can cross those
"xen_hypecall_*" symbol boundaries, because ...



I don't know why just this one hypercall is being called out by objtool,
and this warning isn't in 5.18 and Linus's tree due to I think commit
5b2fc51576ef ("x86/ibt,xen: Sprinkle the ENDBR") being there.

But, is this a ret call that we "forgot" here?  It's a "real" ret in
Linus's branch:

.pushsection .noinstr.text, "ax"
.balign PAGE_SIZE
SYM_CODE_START(hypercall_page)
.rept (PAGE_SIZE / 32)
UNWIND_HINT_FUNC
ANNOTATE_NOENDBR
ANNOTATE_UNRET_SAFE
ret
/*
 * Xen will write the hypercall page, and sort out ENDBR.
 */
.skip 31, 0xcc
.endr

while 5.15.y and older has:
.pushsection .text
.balign PAGE_SIZE
SYM_CODE_START(hypercall_page)
.rept (PAGE_SIZE / 32)
UNWIND_HINT_FUNC
.skip 31, 0x90


... the "31" is no longer correct, ...


ANNOTATE_UNRET_SAFE
RET


... as with RET occupying more than one byte, the resulting hypercall
entry's total size won't add up to 32 anymore.



Right! I haven't thought about that part. I think this has been broken since 14b476e07fab 
("x86: Prepare asm files for straight-line-speculation").

It still shouldn't matter as far as correct execution is concerned which is 
probably why noone complained.




Note that those xen_hypercall_* symbols' values are getting statically
calculated as 'hypercall page + n * 32' in the HYPERCALL() #define from
xen-head.S. So there's a mismatch and with RET == 'ret; int3', the
resulting .text effectively becomes

 101e:   90  nop
 101f:   c3  ret

1020 :
 1020:   cc  int3
 1021:   90  nop
 1022:   90  nop


This is probably already not what has been intended, but because 'ret'
and 'int3' both are single-byte encoded, objtool would still be able to
find at least some "starting instruction" at this point.

But with RET == 'jmp __x86_return_thunk', it becomes

 101e:   90  nop
 101f:   e9  .byte 0xe9

1020 :
 1020:   00 00   add%al,(%rax)
 1022:   00 00   add%al,(%rax)
 1024:   90  nop

Here the 'e9 00 00 00 00' jmp crosses the symbol boundary and objtool
errors out.




Ah, thanks for explanation.


Then I think we need to replace

.skip 31, 0x90

with something like

#if defined(CONFIG_RETHUNK) && !defined(__DISABLE_EXPORTS) && 
!defined(BUILD_VDSO)
#define SKIP_BYTES27/* RET is 'jmp __x86_return_thunk' (5 bytes) */
#else /* CONFIG_RETPOLINE */
#ifdef CONFIG_SLS
#define SKIP_BYTES30/* RET is 'ret; int3' (2 bytes) */
#else
#define SKIP_BYTES31/* RET is 'ret' (1 byte) */
#endif
.skip SKIP_BYTES, 0x90

(I don't have patched 5.15 so I am going by what mainline looks like)

Or replace RET with ret. (Although at least with unpatched 5.15 the warning 
below is still generated)



-boris




.endr

So should the "ret" remain or be turned into "RET" in mainline right
now?



It doesn't matter --- this is overwritten by the hypervisor during
initialization when Xen fills in actual hypercall code.


It does makes a difference though: even though objtool reports only a
warning, it still exits early in this particular case and won't create
any of the .orc_unwind* or .return_sites sections for head_64.o as it's
supposed to.

The significance of not having .orc_unwind* for head_64.o is that the
reliable

Re: [PATCH] x86/xen: Add support for HVMOP_set_evtchn_upcall_vector

2022-07-15 Thread Boris Ostrovsky




On 7/15/22 5:50 AM, Andrew Cooper wrote:

On 15/07/2022 09:18, Jane Malalane wrote:

On 14/07/2022 00:27, Boris Ostrovsky wrote:

   xen_hvm_smp_init();
   WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm));
diff --git a/arch/x86/xen/suspend_hvm.c b/arch/x86/xen/suspend_hvm.c
index 9d548b0c772f..be66e027ef28 100644
--- a/arch/x86/xen/suspend_hvm.c
+++ b/arch/x86/xen/suspend_hvm.c
@@ -5,6 +5,7 @@
   #include 
   #include 
   #include 
+#include 
   #include "xen-ops.h"
@@ -14,6 +15,23 @@ void xen_hvm_post_suspend(int suspend_cancelled)
   xen_hvm_init_shared_info();
   xen_vcpu_restore();
   }
-    xen_setup_callback_vector();
+    if (xen_ack_upcall) {
+    unsigned int cpu;
+
+    for_each_online_cpu(cpu) {
+    xen_hvm_evtchn_upcall_vector_t op = {
+    .vector = HYPERVISOR_CALLBACK_VECTOR,
+    .vcpu = per_cpu(xen_vcpu_id, cpu),
+    };
+
+    BUG_ON(HYPERVISOR_hvm_op(HVMOP_set_evtchn_upcall_vector,
+ &op));
+    /* Trick toolstack to think we are enlightened. */
+    if (!cpu)
+    BUG_ON(xen_set_callback_via(1));

What are you trying to make the toolstack aware of? That we have *a*
callback (either global or percpu)?

Yes, specifically for the check in libxl__domain_pvcontrol_available.

And others.

This is all a giant bodge, but basically a lot of tooling uses the
non-zero-ness of the CALLBACK_VIA param to determine whether the VM has
Xen-aware drivers loaded or not.

The value 1 is a CALLBACK_VIA value which encodes GSI 1, and the only
reason this doesn't explode everywhere is because the
evtchn_upcall_vector registration takes priority over GSI delivery.

This is decades of tech debt piled on top of tech debt.



Feels like it (setting the callback parameter) is something that the hypervisor 
should do --- no need to expose guests to this.


-boris

Re: [PATCH] x86/xen: Add support for HVMOP_set_evtchn_upcall_vector

2022-07-13 Thread Boris Ostrovsky




On 7/11/22 11:22 AM, Jane Malalane wrote:

--- a/arch/x86/xen/enlighten_hvm.c
+++ b/arch/x86/xen/enlighten_hvm.c
@@ -7,6 +7,7 @@
  
  #include 

  #include 
+#include 
  #include 
  
  #include 

@@ -30,6 +31,9 @@
  
  static unsigned long shared_info_pfn;
  
+__ro_after_init bool xen_ack_upcall;

+EXPORT_SYMBOL_GPL(xen_ack_upcall);



Shouldn't this be called something like xen_percpu_upcall?



+
  void xen_hvm_init_shared_info(void)
  {
struct xen_add_to_physmap xatp;
@@ -125,6 +129,9 @@ DEFINE_IDTENTRY_SYSVEC(sysvec_xen_hvm_callback)
  {
struct pt_regs *old_regs = set_irq_regs(regs);
  
+	if (xen_ack_upcall)

+   ack_APIC_irq();
+
inc_irq_stat(irq_hv_callback_count);
  
  	xen_hvm_evtchn_do_upcall();

@@ -168,6 +175,15 @@ static int xen_cpu_up_prepare_hvm(unsigned int cpu)
if (!xen_have_vector_callback)
return 0;
  
+	if (xen_ack_upcall) {

+   xen_hvm_evtchn_upcall_vector_t op = {
+   .vector = HYPERVISOR_CALLBACK_VECTOR,
+   .vcpu = per_cpu(xen_vcpu_id, cpu),
+   };
+
+   BUG_ON(HYPERVISOR_hvm_op(HVMOP_set_evtchn_upcall_vector, &op));



Does this have to be fatal? Can't we just fail bringing this vcpu up?



+   }
+
if (xen_feature(XENFEAT_hvm_safe_pvclock))
xen_setup_timer(cpu);
  
@@ -211,8 +227,7 @@ static void __init xen_hvm_guest_init(void)
  
  	xen_panic_handler_init();
  
-	if (!no_vector_callback && xen_feature(XENFEAT_hvm_callback_vector))

-   xen_have_vector_callback = 1;
+   xen_have_vector_callback = !no_vector_callback;



Can we get rid of one of those two variables then?


  
  	xen_hvm_smp_init();

WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_hvm, xen_cpu_dead_hvm));
diff --git a/arch/x86/xen/suspend_hvm.c b/arch/x86/xen/suspend_hvm.c
index 9d548b0c772f..be66e027ef28 100644
--- a/arch/x86/xen/suspend_hvm.c
+++ b/arch/x86/xen/suspend_hvm.c
@@ -5,6 +5,7 @@
  #include 
  #include 
  #include 
+#include 
  
  #include "xen-ops.h"
  
@@ -14,6 +15,23 @@ void xen_hvm_post_suspend(int suspend_cancelled)

xen_hvm_init_shared_info();
xen_vcpu_restore();
}
-   xen_setup_callback_vector();
+   if (xen_ack_upcall) {
+   unsigned int cpu;
+
+   for_each_online_cpu(cpu) {
+   xen_hvm_evtchn_upcall_vector_t op = {
+   .vector = HYPERVISOR_CALLBACK_VECTOR,
+   .vcpu = per_cpu(xen_vcpu_id, cpu),
+   };
+
+   BUG_ON(HYPERVISOR_hvm_op(HVMOP_set_evtchn_upcall_vector,
+&op));
+   /* Trick toolstack to think we are enlightened. */
+   if (!cpu)
+   BUG_ON(xen_set_callback_via(1));



What are you trying to make the toolstack aware of? That we have *a* callback 
(either global or percpu)?


BTW, you can take it out the loop. And maybe @op definition too, except for 
.vcpu assignment.



+   }
+   } else {
+   xen_setup_callback_vector();
+   }
xen_unplug_emulated_devices();
  }



-boris

Re: Build warnings in Xen 5.15.y and 5.10.y with retbleed backports

2022-07-12 Thread Boris Ostrovsky





On 7/12/22 3:31 PM, Greg KH wrote:

On Tue, Jul 12, 2022 at 03:19:39PM -0400, Boris Ostrovsky wrote:


On 7/12/22 12:38 PM, Greg KH wrote:

Hi all,

I'm seeing the following build warning:
arch/x86/kernel/head_64.o: warning: objtool: 
xen_hypercall_mmu_update(): can't find starting instruction
in the 5.15.y and 5.10.y retbleed backports.

I don't know why just this one hypercall is being called out by objtool,
and this warning isn't in 5.18 and Linus's tree due to I think commit
5b2fc51576ef ("x86/ibt,xen: Sprinkle the ENDBR") being there.

But, is this a ret call that we "forgot" here?  It's a "real" ret in
Linus's branch:

.pushsection .noinstr.text, "ax"
.balign PAGE_SIZE
SYM_CODE_START(hypercall_page)
.rept (PAGE_SIZE / 32)
UNWIND_HINT_FUNC
ANNOTATE_NOENDBR
ANNOTATE_UNRET_SAFE
ret
/*
 * Xen will write the hypercall page, and sort out ENDBR.
 */
.skip 31, 0xcc
.endr

while 5.15.y and older has:
.pushsection .text
.balign PAGE_SIZE
SYM_CODE_START(hypercall_page)
.rept (PAGE_SIZE / 32)
UNWIND_HINT_FUNC
.skip 31, 0x90
ANNOTATE_UNRET_SAFE
RET
.endr

So should the "ret" remain or be turned into "RET" in mainline right
now?



It doesn't matter --- this is overwritten by the hypervisor during 
initialization when Xen fills in actual hypercall code.


So f4b4bc10b0b85ec66f1a9bf5dddf475e6695b6d2 added 'ret' to make objtool happy 
and then 14b476e07fab6 replaced 'ret' with RET as part of SLS fixes. The latter 
was not really necessary but harmless.


So it can be 'ret', RET, or anything else that tools don't complain about. It 
will not be executed.


Cool, thanks.

But what about the objtool warning that I now see?  Is that "real"?




It's not real in the sense that the code there is not real, it will be 
overwritten. (Originally the whole page was 'nop's)


I am getting a different error BTW:

arch/x86/kernel/head_64.o: warning: objtool: .text+0x5: unreachable 
instruction






I don't run any Xen systems, so I can't test any of this myself.



You can't test any changes to that code --- it is rewritten when Xen guest is 
running.


We probably do want to shut up objtool. Josh, any suggestions?


-boris

Re: Build warnings in Xen 5.15.y and 5.10.y with retbleed backports

2022-07-12 Thread Boris Ostrovsky




On 7/12/22 12:38 PM, Greg KH wrote:

Hi all,

I'm seeing the following build warning:
arch/x86/kernel/head_64.o: warning: objtool: 
xen_hypercall_mmu_update(): can't find starting instruction
in the 5.15.y and 5.10.y retbleed backports.

I don't know why just this one hypercall is being called out by objtool,
and this warning isn't in 5.18 and Linus's tree due to I think commit
5b2fc51576ef ("x86/ibt,xen: Sprinkle the ENDBR") being there.

But, is this a ret call that we "forgot" here?  It's a "real" ret in
Linus's branch:

.pushsection .noinstr.text, "ax"
.balign PAGE_SIZE
SYM_CODE_START(hypercall_page)
.rept (PAGE_SIZE / 32)
UNWIND_HINT_FUNC
ANNOTATE_NOENDBR
ANNOTATE_UNRET_SAFE
ret
/*
 * Xen will write the hypercall page, and sort out ENDBR.
 */
.skip 31, 0xcc
.endr

while 5.15.y and older has:
.pushsection .text
.balign PAGE_SIZE
SYM_CODE_START(hypercall_page)
.rept (PAGE_SIZE / 32)
UNWIND_HINT_FUNC
.skip 31, 0x90
ANNOTATE_UNRET_SAFE
RET
.endr

So should the "ret" remain or be turned into "RET" in mainline right
now?



It doesn't matter --- this is overwritten by the hypervisor during 
initialization when Xen fills in actual hypercall code.


So f4b4bc10b0b85ec66f1a9bf5dddf475e6695b6d2 added 'ret' to make objtool happy 
and then 14b476e07fab6 replaced 'ret' with RET as part of SLS fixes. The latter 
was not really necessary but harmless.


So it can be 'ret', RET, or anything else that tools don't complain about. It 
will not be executed.


-boris

Re: Reg. Tee init fail...

2022-06-29 Thread Boris Ostrovsky




On 6/29/22 4:03 PM, Stefano Stabellini wrote:

Adding Juergen and Boris because this is a Linux/x86 issue.


As you can see from this Linux driver:
https://elixir.bootlin.com/linux/latest/source/drivers/crypto/ccp/tee-dev.c#L132

Linux as dom0 on x86 is trying to communicate with firmware (TEE). Linux
is calling __pa to pass a physical address to firmware. However, __pa
returns a "fake" address not an mfn. I imagine that a quick workaround
would be to call "virt_to_machine" instead of "__pa" in tee-dev.c.



It's probably worth a try but it seems we may need to OR the result with C-bit 
(i.e. sme_me_mask). Or (for testing purposes) run with TSME on, I think C-bit 
is not set then.


-boris



Normally, if this was a device, the "right fix" would be to use
swiotlb-xen:xen_swiotlb_map_page to get back a real physical address.

However, xen_swiotlb_map_page is meant to be used as part of the dma_ops
API and takes a struct device *dev as input parameter. Maybe
xen_swiotlb_map_page can be used for tee-dev as well?


Basically tee-dev would need to call dma_map_page before passing
addresses to firmware, and dma_unmap_page when it is done. E.g.:


   cmd_buffer = dma_map_page(dev, virt_to_page(cmd),
 cmd & ~PAGE_MASK,
 ring_size,
 DMA_TO_DEVICE);


Juergen, Boris,
what do you think?



On Fri, 24 Jun 2022, Julien Grall wrote:

Hi,

(moving the discussion to xen-devel as I think it is more appropriate)

On 24/06/2022 10:53, SK, SivaSangeetha (Siva Sangeetha) wrote:

[AMD Official Use Only - General]

Not clear what this means.


Hi Xen team,

In TEE driver, We allocate a ring buffer, get its physical address from
__pa() macro, pass the physical address to secure processor for mapping it
and using in secure processor side.

Source:
https://elixir.bootlin.com/linux/latest/source/drivers/crypto/ccp/tee-dev.c#L132

This works good natively in Dom0 on the target.
When we boot the same Dom0 kernel, with Xen hypervisor enabled, ring init
fails.

Do you have any error message or error code?



We suspect that the address passed to secure processor, is not same when xen
is enabled, and when xen is enabled, some level of address translation might
be required to get exact physical address.

If you are using Xen upstream, Dom0 will be mapped with IPA == PA. So there
should be no need for translation.

Can you provide more details on your setup (version of Xen, Linux...)?

Cheers,

--
Julien Grall

Re: [PATCH v2 0/3] x86: fix brk area initialization

2022-06-29 Thread Boris Ostrovsky




On 6/29/22 10:10 AM, Juergen Gross wrote:

On 23.06.22 11:46, Juergen Gross wrote:

The brk area needs to be zeroed initially, like the .bss section.
At the same time its memory should be covered by the ELF program
headers.

Juergen Gross (3):
   x86/xen: use clear_bss() for Xen PV guests
   x86: fix setup of brk area
   x86: fix .brk attribute in linker script

  arch/x86/include/asm/setup.h  |  3 +++
  arch/x86/kernel/head64.c  |  4 +++-
  arch/x86/kernel/vmlinux.lds.S |  2 +-
  arch/x86/xen/enlighten_pv.c   |  8 ++--
  arch/x86/xen/xen-head.S   | 10 +-
  5 files changed, 14 insertions(+), 13 deletions(-)



Could I please have some feedback? This series is fixing a major
regression regarding running as Xen PV guest (depending on kernel
configuration system will crash very early).

#regzbot ^introduced e32683c6f7d2




I don't think you need this for Xen bits as Jan had already reviewed it but in 
case you do


Reviewed-by: Boris Ostrovsky

Re: [PATCH V3 4/8] xen/virtio: Enable restricted memory access using Xen grant mappings

2022-06-02 Thread Boris Ostrovsky




On 6/2/22 8:49 AM, Oleksandr wrote:


On 31.05.22 00:00, Oleksandr Tyshchenko wrote:

Hello all.


From: Juergen Gross 

In order to support virtio in Xen guests add a config option XEN_VIRTIO
enabling the user to specify whether in all Xen guests virtio should
be able to access memory via Xen grant mappings only on the host side.

Also set PLATFORM_VIRTIO_RESTRICTED_MEM_ACCESS feature from the guest
initialization code on Arm and x86 if CONFIG_XEN_VIRTIO is enabled.

Signed-off-by: Juergen Gross 
Signed-off-by: Oleksandr Tyshchenko 
Reviewed-by: Stefano Stabellini 
---
Changes V1 -> V2:
    - new patch, split required changes from commit:
 "[PATCH V1 3/6] xen/virtio: Add option to restrict memory access under Xen"
    - rework according to new platform_has() infrastructure

Changes V2 -> V3:
    - add Stefano's R-b


May I please ask for the ack or comments for x86 side here?




Reviewed-by: Boris Ostrovsky

Re: [PATCH] xen: replace xen_remap() with memremap()

2022-05-31 Thread Boris Ostrovsky




On 5/30/22 4:26 AM, Juergen Gross wrote:

xen_remap() is used to establish mappings for frames not under direct
control of the kernel: for Xenstore and console ring pages, and for
grant pages of non-PV guests.

Today xen_remap() is defined to use ioremap() on x86 (doing uncached
mappings), and ioremap_cache() on Arm (doing cached mappings).

Uncached mappings for those use cases are bad for performance, so they
should be avoided if possible. As all use cases of xen_remap() don't
require uncached mappings (the mapped area is always physical RAM),
a mapping using the standard WB cache mode is fine.

As sparse is flagging some of the xen_remap() use cases to be not
appropriate for iomem(), as the result is not annotated with the
__iomem modifier, eliminate xen_remap() completely and replace all
use cases with memremap() specifying the MEMREMAP_WB caching mode.

xen_unmap() can be replaced with memunmap().

Reported-by: kernel test robot 
Signed-off-by: Juergen Gross 




Reviewed-by: Boris Ostrovsky

[PATCH] MAINTAINERS: Update Xen maintainership

2022-05-27 Thread Boris Ostrovsky

Due to time constraints I am stepping down as maintainter. I will stay
as reviewer for x86 code (for which create a separate category).

Stefano is now maintainer for Xen hypervisor interface and Oleksandr has
graciously agreed to become a reviewer.

Signed-off-by: Boris Ostrovsky 
---
 MAINTAINERS | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index d6d879cb0afd..b879c4e6750e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -21549,23 +21549,29 @@ F:arch/arm64/include/asm/xen/
 F: arch/arm64/xen/
 
 XEN HYPERVISOR INTERFACE
-M: Boris Ostrovsky 
 M: Juergen Gross 
-R: Stefano Stabellini 
+M: Stefano Stabellini 
+R: Oleksandr Tyshchenko 
 L: xen-devel@lists.xenproject.org (moderated for non-subscribers)
 S: Supported
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip.git
 F: Documentation/ABI/stable/sysfs-hypervisor-xen
 F: Documentation/ABI/testing/sysfs-hypervisor-xen
-F: arch/x86/include/asm/pvclock-abi.h
-F: arch/x86/include/asm/xen/
-F: arch/x86/platform/pvh/
-F: arch/x86/xen/
 F: drivers/*/xen-*front.c
 F: drivers/xen/
 F: include/uapi/xen/
 F: include/xen/
 
+XEN HYPERVISOR X86
+M: Juergen Gross 
+R: Boris Ostrovsky 
+L: xen-devel@lists.xenproject.org (moderated for non-subscribers)
+S: Supported
+F: arch/x86/include/asm/pvclock-abi.h
+F: arch/x86/include/asm/xen/
+F: arch/x86/platform/pvh/
+F: arch/x86/xen/
+
 XEN NETWORK BACKEND DRIVER
 M: Wei Liu 
 M: Paul Durrant 
-- 
1.8.3.1

Re: [PATCH V2 2/7] xen/grants: support allocating consecutive grants

2022-05-16 Thread Boris Ostrovsky




On 5/16/22 2:30 PM, Oleksandr wrote:


On 16.05.22 19:00, Boris Ostrovsky wrote:





With the error handling in gnttab_init() fixed


yes, this is a diff that I am going to apply for the next version:


[snip]

@@ -1596,19 +1601,20 @@ static int gnttab_expand(unsigned int req_entries)
 int gnttab_init(void)
 {
    int i;
-   unsigned long max_nr_grant_frames;
+   unsigned long max_nr_grant_frames, max_nr_grefs;
    unsigned int max_nr_glist_frames, nr_glist_frames;
    int ret;

    gnttab_request_version();
    max_nr_grant_frames = gnttab_max_grant_frames();
+   max_nr_grefs = max_nr_grant_frames *
+ gnttab_interface->grefs_per_grant_frame;
    nr_grant_frames = 1;

    /* Determine the maximum number of frames required for the
 * grant reference free list on the current hypervisor.
 */
-   max_nr_glist_frames = (max_nr_grant_frames *
- gnttab_interface->grefs_per_grant_frame / RPP);
+   max_nr_glist_frames = max_nr_grefs / RPP;

    gnttab_list = kmalloc_array(max_nr_glist_frames,
    sizeof(grant_ref_t *),
@@ -1625,8 +1631,7 @@ int gnttab_init(void)
    }
    }

-   i = gnttab_interface->grefs_per_grant_frame * max_nr_grant_frames;
-   gnttab_free_bitmap = bitmap_zalloc(i, GFP_KERNEL);
+   gnttab_free_bitmap = bitmap_zalloc(max_nr_grefs, GFP_KERNEL);
    if (!gnttab_free_bitmap) {
    ret = -ENOMEM;
    goto ini_nomem;




Looks good, thanks.


-boris

Re: [PATCH LINUX v5 1/2] xen: sync xs_wire.h header with upstream xen

2022-05-16 Thread Boris Ostrovsky




On 5/13/22 5:19 PM, Stefano Stabellini wrote:

From: Stefano Stabellini 

Sync the xs_wire.h header file in Linux with the one in Xen.

Signed-off-by: Stefano Stabellini 

---
Changes in v5:
- add XSD_ERROR(E2BIG)
- Boris gave his reviewed-by but due to this change I removed it



Reviewed-by: Boris Ostrovsky

Re: [PATCH V2 2/7] xen/grants: support allocating consecutive grants

2022-05-16 Thread Boris Ostrovsky





On 5/16/22 1:59 AM, Juergen Gross wrote:

On 14.05.22 04:34, Boris Ostrovsky wrote:



On 5/13/22 1:33 AM, Juergen Gross wrote:

On 12.05.22 22:01, Boris Ostrovsky wrote:


On 5/7/22 2:19 PM, Oleksandr Tyshchenko wrote:



+/* Rebuilds the free grant list and tries to find count consecutive entries. */
+static int get_free_seq(unsigned int count)
+{
+    int ret = -ENOSPC;
+    unsigned int from, to;
+    grant_ref_t *last;
+
+    gnttab_free_tail_ptr = &gnttab_free_head;
+    last = &gnttab_free_head;
+
+    for (from = find_first_bit(gnttab_free_bitmap, gnttab_size);
+ from < gnttab_size;
+ from = find_next_bit(gnttab_free_bitmap, gnttab_size, to + 1)) {
+    to = find_next_zero_bit(gnttab_free_bitmap, gnttab_size,
+    from + 1);
+    if (ret < 0 && to - from >= count) {
+    ret = from;
+    bitmap_clear(gnttab_free_bitmap, ret, count);
+    from += count;
+    gnttab_free_count -= count;



IIUIC we can have multiple passes over this, meaning that the gnttab_free_count 
may be decremented more than once. Is that intentional?


After the first pass decrementing gnttab_free_cnt, ret will no
longer be less than zero, so this can be hit only once.


Oh, yes, of course.







+    if (from == to)
+    continue;
+    }
+
+    while (from < to) {
+    *last = from;
+    last = __gnttab_entry(from);
+    gnttab_last_free = from;
+    from++;
+    }



I have been looking at this loop and I can't understand what it is doing ;-( 
Can you enlighten me?


It is recreating the free list in order to have it properly sorted.
This is needed to make sure that the free tail has the maximum
possible size (you can take the tail off the list without having
to worry about breaking the linked list because of references into
the tail).



So let's say we have the (one-dimensional) table of length 13

idx    ..    2    3  ...  10  11  12

grant   12   11    2  -1   3


and gnttab_free_head is 10. I.e. the free list is 2, 12, 3, 11.


You meant 10, 2, 12, 3, 11, I guess?



What will this look like after the 2 iterations of the outer loop?


idx    ..    2    3  ...  10  11  12

grant    3   10   11  12  -1

with gnttab_free_head being 2, i.e the free list is now 2, 3, 10, 11, 12.




OK, thanks, that helped. I couldn't link the free chunks in my head


With the error handling in gnttab_init() fixed

Reviewed-by: Boris Ostrovsky 



-boris

Re: [PATCH V2 2/7] xen/grants: support allocating consecutive grants

2022-05-13 Thread Boris Ostrovsky





On 5/13/22 1:33 AM, Juergen Gross wrote:

On 12.05.22 22:01, Boris Ostrovsky wrote:


On 5/7/22 2:19 PM, Oleksandr Tyshchenko wrote:



+/* Rebuilds the free grant list and tries to find count consecutive entries. */
+static int get_free_seq(unsigned int count)
+{
+    int ret = -ENOSPC;
+    unsigned int from, to;
+    grant_ref_t *last;
+
+    gnttab_free_tail_ptr = &gnttab_free_head;
+    last = &gnttab_free_head;
+
+    for (from = find_first_bit(gnttab_free_bitmap, gnttab_size);
+ from < gnttab_size;
+ from = find_next_bit(gnttab_free_bitmap, gnttab_size, to + 1)) {
+    to = find_next_zero_bit(gnttab_free_bitmap, gnttab_size,
+    from + 1);
+    if (ret < 0 && to - from >= count) {
+    ret = from;
+    bitmap_clear(gnttab_free_bitmap, ret, count);
+    from += count;
+    gnttab_free_count -= count;



IIUIC we can have multiple passes over this, meaning that the gnttab_free_count 
may be decremented more than once. Is that intentional?


After the first pass decrementing gnttab_free_cnt, ret will no
longer be less than zero, so this can be hit only once.


Oh, yes, of course.







+    if (from == to)
+    continue;
+    }
+
+    while (from < to) {
+    *last = from;
+    last = __gnttab_entry(from);
+    gnttab_last_free = from;
+    from++;
+    }



I have been looking at this loop and I can't understand what it is doing ;-( 
Can you enlighten me?


It is recreating the free list in order to have it properly sorted.
This is needed to make sure that the free tail has the maximum
possible size (you can take the tail off the list without having
to worry about breaking the linked list because of references into
the tail).



So let's say we have the (one-dimensional) table of length 13

idx..23  ...  10  11  12

grant   12   112  -1   3


and gnttab_free_head is 10. I.e. the free list is 2, 12, 3, 11.

What will this look like after the 2 iterations of the outer loop?

(I am really having a mental block on this).



-boris

Re: [PATCH V2 2/7] xen/grants: support allocating consecutive grants

2022-05-12 Thread Boris Ostrovsky




On 5/7/22 2:19 PM, Oleksandr Tyshchenko wrote:

+
+/*
+ * Handling of free grants:
+ *
+ * Free grants are in a simple list anchored in gnttab_free_head. They are
+ * linked by grant ref, the last element contains GNTTAB_LIST_END. The number
+ * of free entries is stored in gnttab_free_count.
+ * Additionally there is a bitmap of free entries anchored in
+ * gnttab_free_bitmap. This is being used for simplifying allocation of
+ * multiple consecutive grants, which is needed e.g. for support of virtio.
+ * gnttab_last_free is used to add free entries of new frames at the end of
+ * the free list.
+ * gnttab_free_tail_ptr specifies the variable which references the start



If this references the start of the free interval, why is it called 
gnttab_free_*tail*_ptr?



+ * of consecutive free grants ending with gnttab_last_free. This pointer is
+ * updated in a rather defensive way, in order to avoid performance hits in
+ * hot paths.
+ * All those variables are protected by gnttab_list_lock.
+ */
  static int gnttab_free_count;
-static grant_ref_t gnttab_free_head;
+static unsigned int gnttab_size;
+static grant_ref_t gnttab_free_head = GNTTAB_LIST_END;
+static grant_ref_t gnttab_last_free = GNTTAB_LIST_END;
+static grant_ref_t *gnttab_free_tail_ptr;
+static unsigned long *gnttab_free_bitmap;
  static DEFINE_SPINLOCK(gnttab_list_lock);
+
  struct grant_frames xen_auto_xlat_grant_frames;
  static unsigned int xen_gnttab_version;
  module_param_named(version, xen_gnttab_version, uint, 0);
@@ -170,16 +194,111 @@ static int get_free_entries(unsigned count)
  
  	ref = head = gnttab_free_head;

gnttab_free_count -= count;
-   while (count-- > 1)
-   head = gnttab_entry(head);
+   while (count--) {
+   bitmap_clear(gnttab_free_bitmap, head, 1);
+   if (gnttab_free_tail_ptr == __gnttab_entry(head))
+   gnttab_free_tail_ptr = &gnttab_free_head;
+   if (count)
+   head = gnttab_entry(head);
+   }
gnttab_free_head = gnttab_entry(head);
gnttab_entry(head) = GNTTAB_LIST_END;
  
+	if (!gnttab_free_count) {

+   gnttab_last_free = GNTTAB_LIST_END;
+   gnttab_free_tail_ptr = NULL;
+   }
+
spin_unlock_irqrestore(&gnttab_list_lock, flags);
  
  	return ref;

  }
  
+static int get_seq_entry_count(void)

+{
+   if (gnttab_last_free == GNTTAB_LIST_END || !gnttab_free_tail_ptr ||
+   *gnttab_free_tail_ptr == GNTTAB_LIST_END)
+   return 0;
+
+   return gnttab_last_free - *gnttab_free_tail_ptr + 1;
+}
+
+/* Rebuilds the free grant list and tries to find count consecutive entries. */
+static int get_free_seq(unsigned int count)
+{
+   int ret = -ENOSPC;
+   unsigned int from, to;
+   grant_ref_t *last;
+
+   gnttab_free_tail_ptr = &gnttab_free_head;
+   last = &gnttab_free_head;
+
+   for (from = find_first_bit(gnttab_free_bitmap, gnttab_size);
+from < gnttab_size;
+from = find_next_bit(gnttab_free_bitmap, gnttab_size, to + 1)) {
+   to = find_next_zero_bit(gnttab_free_bitmap, gnttab_size,
+   from + 1);
+   if (ret < 0 && to - from >= count) {
+   ret = from;
+   bitmap_clear(gnttab_free_bitmap, ret, count);
+   from += count;
+   gnttab_free_count -= count;



IIUIC we can have multiple passes over this, meaning that the gnttab_free_count 
may be decremented more than once. Is that intentional?



+   if (from == to)
+   continue;
+   }
+
+   while (from < to) {
+   *last = from;
+   last = __gnttab_entry(from);
+   gnttab_last_free = from;
+   from++;
+   }



I have been looking at this loop and I can't understand what it is doing ;-( 
Can you enlighten me?



-boris




+   if (to < gnttab_size)
+   gnttab_free_tail_ptr = __gnttab_entry(to - 1);
+   }
+
+   *last = GNTTAB_LIST_END;
+   if (gnttab_last_free != gnttab_size - 1)
+   gnttab_free_tail_ptr = NULL;
+
+   return ret;
+}
+
+static int get_free_entries_seq(unsigned int count)
+{
+   unsigned long flags;
+   int ret = 0;
+
+   spin_lock_irqsave(&gnttab_list_lock, flags);
+
+   if (gnttab_free_count < count) {
+   ret = gnttab_expand(count - gnttab_free_count);
+   if (ret < 0)
+   goto out;
+   }
+
+   if (get_seq_entry_count() < count) {
+   ret = get_free_seq(count);
+   if (ret >= 0)
+   goto out;
+   ret = gnttab_expand(count - get_seq_entry_count());
+   if (ret < 0)
+   goto out;
+   }
+
+   ret = *gnttab_free_t

Re: [PATCH V2 2/7] xen/grants: support allocating consecutive grants

2022-05-11 Thread Boris Ostrovsky




On 5/11/22 2:00 PM, Oleksandr wrote:


On 07.05.22 21:19, Oleksandr Tyshchenko wrote:

Hello Boris, Stefano



From: Juergen Gross 

For support of virtio via grant mappings in rare cases larger mappings
using consecutive grants are needed. Support those by adding a bitmap
of free grants.

As consecutive grants will be needed only in very rare cases (e.g. when
configuring a virtio device with a multi-page ring), optimize for the
normal case of non-consecutive allocations.

Signed-off-by: Juergen Gross 
---
Changes RFC -> V1:
    - no changes
    Changes V1 -> V2:
    - no changes



May I please ask for the review here?




I had a quick look but I am stuck on get_free_seq(), I need to stare at it some 
more. Unless someone else reviews this, I will try to get to this in the next 
couple of days.


One thing I did notice is





@@ -1452,6 +1624,13 @@ int gnttab_init(void)
  }
  }
  +    i = gnttab_interface->grefs_per_grant_frame * max_nr_grant_frames;
+    gnttab_free_bitmap = bitmap_zalloc(i, GFP_KERNEL);
+    if (!gnttab_free_bitmap) {
+    ret = -ENOMEM;
+    goto ini_nomem;
+    }



This overwrites 'i' and will break error handling at ini_nomem.


-boris

Re: Attributing linux related patches on xen-devel

2022-05-09 Thread Boris Ostrovsky




On 5/9/22 3:01 PM, Stefano Stabellini wrote:

On Mon, 9 May 2022, Juergen Gross wrote:

On IRC the question came up whether it would be possible to have a
special marker for Linux patches on the xen-devel ML.

I suggested to use xen-devel+li...@lists.xenprojext.org for those
patches. With a patch for the kernel's MAINTAINERS file this would
be quite easy to achieve.

Any thoughts?

Fine by me, as long as xen-devel+li...@lists.xenprojext.org works :-)

The alternative would be to come up with a different subject tag (e.g.
"PATCH LINUX") but that doesn't work as it is not supported by today's
Linux MAINTAINERS file.

So I think xen-devel+linux is fine.




I'd prefer '-' instead of '+' but either way is fine.


-boris

Re: [PATCH v3 00/21] xen: simplify frontend side ring setup

2022-05-05 Thread Boris Ostrovsky




On 5/5/22 4:16 AM, Juergen Gross wrote:

Many Xen PV frontends share similar code for setting up a ring page
(allocating and granting access for the backend) and for tearing it
down.

Create new service functions doing all needed steps in one go.

This requires all frontends to use a common value for an invalid
grant reference in order to make the functions idempotent.

Changes in V3:
- new patches 1 and 2, comments addressed

Changes in V2:
- new patch 9 and related changes in patches 10-18



For the patches that I was explicitly copied on:


Reviewed-by: Boris Ostrovsky

Re: [PATCH v4 2/2] xen: add support for initializing xenstore later as HVM domain

2022-05-05 Thread Boris Ostrovsky




On 5/4/22 8:23 PM, Stefano Stabellini wrote:

@@ -957,25 +989,44 @@ static int __init xenbus_init(void)
 * been properly initialized. Instead of attempting to map a
 * wrong guest physical address return error.
 *
-* Also recognize all bits set as an invalid value.
+* Also recognize all bits set as an invalid/uninitialized 
value.



What I really meant (but not what I actually wrote I guess) was that now we are 
treating -1 differently than 0 and so that comment should go ...



 */
-   if (!v || !~v) {
+   if (!v) {
err = -ENOENT;
goto out_error;
}
-   /* Avoid truncation on 32-bit. */



... here.


But this is ntpicking so for the series


Reviewed-by: Boris Ostrovsky 



+   if (v == ~0ULL) {
+   wait = true;
+   } else {
+   /* Avoid truncation on 32-bit. */

Re: [LINUX PATCH v3] xen: add support for initializing xenstore later as HVM domain

2022-05-02 Thread Boris Ostrovsky




On 5/2/22 8:36 PM, Stefano Stabellini wrote:


I gave it a try and there is a problem with unbinding the IRQ here. If I
do that, later when xb_init_comms calls bind_evtchn_to_irqhandler, the
event channel doesn't fire anymore. I did some testing and debugging and
as far as I can tell the issue is that if we call unbind_from_irqhandler
here, we end up calling xen_evtchn_close. Then, when xb_init_comms calls
bind_evtchn_to_irqhandler on the same evtchn, it is not enough to
receive event notifications anymore because from Xen POV the evtchn is
"closed".



Ah, yes. That's unfortunate.




If I comment out xen_evtchn_close() in __unbind_from_irq, then yes, I
can call unbind_from_irqhandler here instead of free_irq and everything
works.

My suggestion is to keep the call to free_irq here (not
unbind_from_irqhandler) and improve the in-code comment.



OK.


You could add an argument to unbind_from_irq() to keep the event channel open 
(I in fact am not sure it should be closing the channel since bind routines 
don't open it) but that looks pretty ugly too.


-boris

Re: [LINUX PATCH v3] xen: add support for initializing xenstore later as HVM domain

2022-04-29 Thread Boris Ostrovsky




On 4/29/22 5:10 PM, Stefano Stabellini wrote:

From: Luca Miccio 

When running as dom0less guest (HVM domain on ARM) the xenstore event
channel is available at domain creation but the shared xenstore
interface page only becomes available later on.

In that case, wait for a notification on the xenstore event channel,
then complete the xenstore initialization later, when the shared page
is actually available.

The xenstore page has few extra field. Add them to the shared struct.
One of the field is "connection", when the connection is ready, it is
zero. If the connection is not-zero, wait for a notification.

Signed-off-by: Luca Miccio 
Signed-off-by: Stefano Stabellini 
CC: jgr...@suse.com
CC: boris.ostrov...@oracle.com
---
Changes in v3:
- check for the connection field, if it is not zero, wait for event

Changes in v2:
- remove XENFEAT_xenstore_late_init
---
  drivers/xen/xenbus/xenbus_probe.c  | 86 +++---
  include/xen/interface/io/xs_wire.h |  3 ++
  2 files changed, 70 insertions(+), 19 deletions(-)

diff --git a/drivers/xen/xenbus/xenbus_probe.c 
b/drivers/xen/xenbus/xenbus_probe.c
index fe360c33ce71..dc046d25789e 100644
--- a/drivers/xen/xenbus/xenbus_probe.c
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -65,6 +65,7 @@
  #include "xenbus.h"
  
  
+static int xs_init_irq;

  int xen_store_evtchn;
  EXPORT_SYMBOL_GPL(xen_store_evtchn);
  
@@ -750,6 +751,17 @@ static void xenbus_probe(void)

  {
xenstored_ready = 1;
  
+	if (!xen_store_interface) {

+   xen_store_interface = xen_remap(xen_store_gfn << XEN_PAGE_SHIFT,
+   XEN_PAGE_SIZE);
+   /*
+* Now it is safe to free the IRQ used for xenstore late
+* initialization. No need to unbind: it is about to be
+* bound again.



This assumes knowledge of bind/unbind internals. I think we should unbind.



+*/
+   free_irq(xs_init_irq, &xb_waitq);
+   }
+





@@ -959,23 +988,42 @@ static int __init xenbus_init(void)
 *
 * Also recognize all bits set as an invalid value.



Is this comment still correct?



 */
-   if (!v || !~v) {
+   if (!v) {
err = -ENOENT;
goto out_error;
}
-   /* Avoid truncation on 32-bit. */
+   if (v == ~0ULL) {
+   wait = true;
+   } else {
+   /* Avoid truncation on 32-bit. */

Re: [PATCH] swiotlb-xen: fix DMA_ATTR_NO_KERNEL_MAPPING on arm

2022-04-29 Thread Boris Ostrovsky




On 4/28/22 6:49 PM, Stefano Stabellini wrote:

On Thu, 28 Apr 2022, Boris Ostrovsky wrote:

On 4/28/22 5:49 PM, Stefano Stabellini wrote:

On Thu, 28 Apr 2022, Christoph Hellwig wrote:

On Tue, Apr 26, 2022 at 04:07:45PM -0700, Stefano Stabellini wrote:

Reported-by: Rahul Singh 
Signed-off-by: Christoph Hellwig 

Reviewed-by: Stefano Stabellini 

Do you want to take this through the Xen tree or should I pick it up?
Either way I'd love to see some testing on x86 as well.

I agree on the x86 testing. Juergen, Boris?

I'd say to take this patch via the Xen tree but Juergen has just sent a
Xen pull request to Linux last Saturday. Juergen do you plan to send
another one? Do you have something else lined up? If not, it might be
better to let Christoph pick it up.


We don't have anything pending.


I can test it but at best tomorrow so not sure we can get this into rc5. Do
you consider this an urgent fix or can we wait until 5.19? Because it's a bit
too much IMO for rc6.

On one hand, Linux doesn't boot on a platform without this fix. On the
other hand, I totally see that this patch could introduce regressions on
x86 so I think it is fair that we are careful with it.

 From my point of view, it might be better to wait for 5.19 and mark it
as backport.



No problems uncovered during testing.


-boris

Re: [PATCH v2 0/4] xen/pv-scsi: update header and harden frontend

2022-04-29 Thread Boris Ostrovsky




On 4/28/22 3:53 AM, Juergen Gross wrote:

Update the Xen PV-scsi interface from the Xen tree and adapt the
related drivers to use the new definitions.

Harden the frontend driver to be no longer vulnerable to a malicious
backend.

Juergen Gross (4):
   xen: update vscsiif.h
   xen/scsiback: use new command result macros
   xen/scsifront: use new command result macros
   xen/scsifront: harden driver against malicious backend

  drivers/scsi/xen-scsifront.c   | 168 ++---
  drivers/xen/xen-scsiback.c |  82 +-
  include/xen/interface/io/vscsiif.h | 133 ++-
  3 files changed, 340 insertions(+), 43 deletions(-)



Reviewed-by: Boris Ostrovsky

Re: [PATCH] swiotlb-xen: fix DMA_ATTR_NO_KERNEL_MAPPING on arm

2022-04-28 Thread Boris Ostrovsky




On 4/28/22 5:49 PM, Stefano Stabellini wrote:

On Thu, 28 Apr 2022, Christoph Hellwig wrote:

On Tue, Apr 26, 2022 at 04:07:45PM -0700, Stefano Stabellini wrote:

Reported-by: Rahul Singh 
Signed-off-by: Christoph Hellwig 

Reviewed-by: Stefano Stabellini 

Do you want to take this through the Xen tree or should I pick it up?
Either way I'd love to see some testing on x86 as well.

I agree on the x86 testing. Juergen, Boris?

I'd say to take this patch via the Xen tree but Juergen has just sent a
Xen pull request to Linux last Saturday. Juergen do you plan to send
another one? Do you have something else lined up? If not, it might be
better to let Christoph pick it up.



We don't have anything pending.


I can test it but at best tomorrow so not sure we can get this into rc5. Do you 
consider this an urgent fix or can we wait until 5.19? Because it's a bit too 
much IMO for rc6.


-boris

Re: x86/PV: (lack of) MTRR exposure

2022-04-28 Thread Boris Ostrovsky




On 4/28/22 11:53 AM, Jan Beulich wrote:

Hello,

in the course of analyzing the i915 driver causing boot to fail in
Linux 5.18 I found that Linux, for all the years, has been running
in PV mode as if PAT was (mostly) disabled. This is a result of
them tying PAT initialization to MTRR initialization, while we
offer PAT but not MTRR in CPUID output. This was different before
our moving to CPU featuresets, and as such one could view this
behavior as a regression from that change.

The first question here is whether not exposing MTRR as a feature
was really intended, in particular also for PV Dom0. The XenoLinux
kernel and its forward ports did make use of XENPF_*_memtype to
deal with MTRRs. That's functionality which (maybe for a good
reason) never made it into the pvops kernel. Note that PVH Dom0
does have access to the original settings, as the host values are
used as initial state there.



Initially MTRR was supposed to be supported but it was shot down by x86 
maintainers who strongly suggested to use PAT.


https://lists.xen.org/archives/html/xen-devel/2010-09/msg01634.html


-boris




The next question would be how we could go about improving the
situation. For the particular issue in 5.18 I've found a relatively
simple solution [1] (which also looks to help graphics performance
on other systems, according to my initial observations with using
the change), albeit its simplicity likely means it either is wrong
in some way, or might not be liked for looking hacky and/or abusive.
We can't, for example, simply turn on the MTRR bit in CPUID, as that
would implicitly lead to the kernel trying to write the PAT MSR (if,
see below, it didn't itself zap the bit). We also can't simply
ignore PAT MSR writes, as the kernel won't check whether writes
actually took effect. (All of that did work on top of old Xen
apparently only because xen_init_capabilities() itself also forces
the MTRR feature to off.)

Jan

[1] https://lists.xen.org/archives/html/xen-devel/2022-04/msg02392.html

Re: [PATCH V1 3/6] xen/virtio: Add option to restrict memory access under Xen

2022-04-24 Thread Boris Ostrovsky




On 4/24/22 12:53 PM, Oleksandr wrote:


On 23.04.22 19:40, Christoph Hellwig wrote:







+
+#ifdef CONFIG_ARCH_HAS_RESTRICTED_VIRTIO_MEMORY_ACCESS
+int arch_has_restricted_virtio_memory_access(void)
+{
+    return (xen_has_restricted_virtio_memory_access() ||
+    cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT));
+}

So instead of hardcoding Xen here, this seems like a candidate for
another cc_platform_has flag.



I have a limited knowledge of x86 and Xen on x86.

Would the Xen specific bits fit into Confidential Computing Platform checks? I 
will let Juergen/Boris comment on this.



This is unrelated to confidential so I don't think we can add another CC_ flag.


Would arch/x86/kernel/cpu/hypervisor.c be a better home for this?


-boris

Re: [PATCH v2] xen: Convert kmap() to kmap_local_page()

2022-04-21 Thread Boris Ostrovsky




On 4/19/22 7:43 PM, Alaa Mohamed wrote:

kmap() is being deprecated and these usages are all local to the thread
so there is no reason kmap_local_page() can't be used.

Replace kmap() calls with kmap_local_page().

Signed-off-by: Alaa Mohamed 



Applied to for-linus-5.18. (Juergen, I kept your R-b from v1)


-boris

Re: [PATCH 2/4] xen/scsiback: use new command result macros

2022-04-21 Thread Boris Ostrovsky




On 4/21/22 4:40 AM, Juergen Gross wrote:

On 20.04.22 18:12, Boris Ostrovsky wrote:


On 4/20/22 5:25 AM, Juergen Gross wrote:

@@ -569,7 +645,7 @@ static void scsiback_device_action(struct vscsibk_pend 
*pending_req,
  wait_for_completion(&pending_req->tmr_done);
  err = (se_cmd->se_tmr_req->response == TMR_FUNCTION_COMPLETE) ?
-    SUCCESS : FAILED;
+    XEN_VSCSIIF_RSLT_RESET_SUCCESS : XEN_VSCSIIF_RSLT_RESET_FAILED;
  scsiback_do_resp_with_sense(NULL, err, 0, pending_req);
  transport_generic_free_cmd(&pending_req->se_cmd, 0);



You also want to initialize err to XEN_VSCSIIF_RSLT_RESET_FAILED.


I did that.



Yes you did. I don't know what I was was looking at.





And also looking at invocations of scsiback_do_resp_with_sense() I think those 
may need to be adjusted as well.


No, the invocations are fine, but scsiback_result() needs to pass through
the lowest 16 bits instead of only the lowest 8 bits of the result value.



What I was thinking was that this could use the reverse of 
XEN_VSCSIIF_RSLT_HOST(), i.e. something like

#define RSLT_HOST_TO_XEN_VSCSIIF(x)   ((x)<<16)

to be explicit about namespaces.


BTW, should scsiback_result() use XEN_VSCSIIF_RSLT_HOST() at the top?


-boris

Re: [PATCH 09/18] xen/xenbus: add xenbus_setup_ring() service function

2022-04-20 Thread Boris Ostrovsky




On 4/20/22 11:09 AM, Juergen Gross wrote:
  
+/*

+ * xenbus_setup_ring
+ * @dev: xenbus device
+ * @vaddr: pointer to starting virtual address of the ring
+ * @nr_pages: number of pages to be granted
+ * @grefs: grant reference array to be filled in
+ *
+ * Allocate physically contiguous pages for a shared ring buffer and grant it
+ * to the peer of the given device. The ring buffer is initially filled with
+ * zeroes. The virtual address of the ring is stored at @vaddr and the
+ * grant references are stored in the @grefs array. In case of error @vaddr
+ * will be set to NULL and @grefs will be filled with INVALID_GRANT_REF.
+ */
+int xenbus_setup_ring(struct xenbus_device *dev, gfp_t gfp, void **vaddr,
+ unsigned int nr_pages, grant_ref_t *grefs)
+{
+   unsigned long ring_size = nr_pages * XEN_PAGE_SIZE;
+   unsigned int i;
+   int ret;
+
+   *vaddr = alloc_pages_exact(ring_size, gfp | __GFP_ZERO);
+   if (!*vaddr) {
+   ret = -ENOMEM;
+   goto err;
+   }
+
+   ret = xenbus_grant_ring(dev, *vaddr, nr_pages, grefs);
+   if (ret)
+   goto err;
+
+   return 0;
+
+ err:
+   if (*vaddr)
+   free_pages_exact(*vaddr, ring_size);
+   for (i = 0; i < nr_pages; i++)
+   grefs[i] = INVALID_GRANT_REF;
+   *vaddr = NULL;
+
+   return ret;
+}



We can create a wrapper around this function that will also call 
SHARED_RING_INIT() and FRONT_RING_INIT(). A bunch of drivers do that.


-boris

Re: [PATCH 4/4] xen/scsifront: harden driver against malicious backend

2022-04-20 Thread Boris Ostrovsky


Just a couple of nits.


On 4/20/22 5:25 AM, Juergen Gross wrote:
  
-static int scsifront_ring_drain(struct vscsifrnt_info *info)

+static int scsifront_ring_drain(struct vscsifrnt_info *info,
+   unsigned int *eoiflag)
  {
-   struct vscsiif_response *ring_rsp;
+   struct vscsiif_response ring_rsp;
RING_IDX i, rp;
int more_to_do = 0;
  
-	rp = info->ring.sring->rsp_prod;

-   rmb();  /* ordering required respective to dom0 */
+   rp = READ_ONCE(info->ring.sring->rsp_prod);
+   virt_rmb(); /* ordering required respective to backend */
+   if (RING_RESPONSE_PROD_OVERFLOW(&info->ring, rp)) {
+   scsifront_set_error(info, "illegal number of responses");



In net and block drivers we report number of such responses. (But not in usb)



+   return 0;
+   }
for (i = info->ring.rsp_cons; i != rp; i++) {
-   ring_rsp = RING_GET_RESPONSE(&info->ring, i);
-   scsifront_do_response(info, ring_rsp);
+   RING_COPY_RESPONSE(&info->ring, i, &ring_rsp);
+   scsifront_do_response(info, &ring_rsp);
+   if (info->host_active == STATE_ERROR)
+   return 0;
+   *eoiflag = 0;



*eoiflags &= ~XEN_EOI_FLAG_SPURIOUS; ?


We also use eoi_flags name in other instances in this file.


-boris

Re: [PATCH 2/4] xen/scsiback: use new command result macros

2022-04-20 Thread Boris Ostrovsky




On 4/20/22 5:25 AM, Juergen Gross wrote:

@@ -569,7 +645,7 @@ static void scsiback_device_action(struct vscsibk_pend 
*pending_req,
wait_for_completion(&pending_req->tmr_done);
  
  	err = (se_cmd->se_tmr_req->response == TMR_FUNCTION_COMPLETE) ?

-   SUCCESS : FAILED;
+   XEN_VSCSIIF_RSLT_RESET_SUCCESS : XEN_VSCSIIF_RSLT_RESET_FAILED;
  
  	scsiback_do_resp_with_sense(NULL, err, 0, pending_req);

transport_generic_free_cmd(&pending_req->se_cmd, 0);



You also want to initialize err to XEN_VSCSIIF_RSLT_RESET_FAILED.


And also looking at invocations of scsiback_do_resp_with_sense() I think those 
may need to be adjusted as well.




-boris

Re: [PATCH] xen/balloon: don't use PV mode extra memory for zone device allocations

2022-04-09 Thread Boris Ostrovsky




On 4/7/22 5:38 AM, Juergen Gross wrote:

When running as a Xen PV guest use the extra memory (memory which isn't
allocated for the guest at boot time) only for ballooning purposes and
not for zone device allocations. This will remove some code without any
lack of functionality.

While at it move some code to get rid of another #ifdef.

Remove a comment which is stale since some time now.

Signed-off-by: Juergen Gross 




Applied to for-linus-5.18


-boris

Re: [PATCH] xen/balloon: don't use PV mode extra memory for zone device allocations

2022-04-07 Thread Boris Ostrovsky




On 4/7/22 5:38 AM, Juergen Gross wrote:

When running as a Xen PV guest use the extra memory (memory which isn't
allocated for the guest at boot time) only for ballooning purposes and
not for zone device allocations. This will remove some code without any
lack of functionality.

While at it move some code to get rid of another #ifdef.

Remove a comment which is stale since some time now.

Signed-off-by: Juergen Gross 



Reviewed-by: Boris Ostrovsky

Re: cleanup swiotlb initialization v8

2022-04-05 Thread Boris Ostrovsky




On 4/4/22 1:05 AM, Christoph Hellwig wrote:

Hi all,

this series tries to clean up the swiotlb initialization, including
that of swiotlb-xen.  To get there is also removes the x86 iommu table
infrastructure that massively obsfucates the initialization path.

Git tree:

 git://git.infradead.org/users/hch/misc.git swiotlb-init-cleanup

Gitweb:

 
http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/swiotlb-init-cleanup





Tested-by: Boris Ostrovsky

Re: [PATCH v2] xen: fix is_xen_pmu()

2022-03-26 Thread Boris Ostrovsky




On 3/25/22 10:20 AM, Juergen Gross wrote:

is_xen_pmu() is taking the cpu number as parameter, but it is not using
it. Instead it just tests whether the Xen PMU initialization on the
current cpu did succeed. As this test is done by checking a percpu
pointer, preemption needs to be disabled in order to avoid switching
the cpu while doing the test. While resuming from suspend() this seems
not to be the case:

[   88.082751] ACPI: PM: Low-level resume complete
[   88.087933] ACPI: EC: EC started
[   88.091464] ACPI: PM: Restoring platform NVS memory
[   88.097166] xen_acpi_processor: Uploading Xen processor PM info
[   88.103850] Enabling non-boot CPUs ...
[   88.108128] installing Xen timer for CPU 1
[   88.112763] BUG: using smp_processor_id() in preemptible [] code: 
systemd-sleep/7138
[   88.122256] caller is is_xen_pmu+0x12/0x30
[   88.126937] CPU: 0 PID: 7138 Comm: systemd-sleep Tainted: GW 
5.16.13-2.fc32.qubes.x86_64 #1
[   88.137939] Hardware name: Star Labs StarBook/StarBook, BIOS 7.97 03/21/2022
[   88.145930] Call Trace:
[   88.148757]  
[   88.151193]  dump_stack_lvl+0x48/0x5e
[   88.155381]  check_preemption_disabled+0xde/0xe0
[   88.160641]  is_xen_pmu+0x12/0x30
[   88.164441]  xen_smp_intr_init_pv+0x75/0x100

Fix that by replacing is_xen_pmu() by a simple boolean variable which
reflects the Xen PMU initialization state on cpu 0.

Modify xen_pmu_init() to return early in case it is being called for a
cpu other than cpu 0 and the boolean variable not being set.

Fixes: bf6dfb154d93 ("xen/PMU: PMU emulation code")
Reported-by: Marek Marczykowski-Górecki 
Signed-off-by: Juergen Gross 




Applied to for-linus-5.18.

Re: [PATCH] xen: don't hang when resuming PCI device

2022-03-26 Thread Boris Ostrovsky




On 3/22/22 9:21 PM, Jakub Kądziołka wrote:

If a xen domain with at least two VCPUs has a PCI device attached which
enters the D3hot state during suspend, the kernel may hang while
resuming, depending on the core on which an async resume task gets
scheduled.

The bug occurs because xen's do_suspend calls dpm_resume_start while
only the timer of the boot CPU has been resumed (when xen_suspend called
syscore_resume), before calling xen_arch_suspend to resume the timers of
the other CPUs. This breaks pci_dev_d3_sleep.

Thus this patch moves the call to xen_arch_resume before the call to
dpm_resume_start, eliminating the hangs and restoring the stack-like
structure of the suspend/restore procedure.

Signed-off-by: Jakub Kądziołka 



Applied to for-linus-5.18

Re: [PATCH v2] xen: fix is_xen_pmu()

2022-03-25 Thread Boris Ostrovsky




On 3/25/22 10:20 AM, Juergen Gross wrote:

is_xen_pmu() is taking the cpu number as parameter, but it is not using
it. Instead it just tests whether the Xen PMU initialization on the
current cpu did succeed. As this test is done by checking a percpu
pointer, preemption needs to be disabled in order to avoid switching
the cpu while doing the test. While resuming from suspend() this seems
not to be the case:

[   88.082751] ACPI: PM: Low-level resume complete
[   88.087933] ACPI: EC: EC started
[   88.091464] ACPI: PM: Restoring platform NVS memory
[   88.097166] xen_acpi_processor: Uploading Xen processor PM info
[   88.103850] Enabling non-boot CPUs ...
[   88.108128] installing Xen timer for CPU 1
[   88.112763] BUG: using smp_processor_id() in preemptible [] code: 
systemd-sleep/7138
[   88.122256] caller is is_xen_pmu+0x12/0x30
[   88.126937] CPU: 0 PID: 7138 Comm: systemd-sleep Tainted: GW 
5.16.13-2.fc32.qubes.x86_64 #1
[   88.137939] Hardware name: Star Labs StarBook/StarBook, BIOS 7.97 03/21/2022
[   88.145930] Call Trace:
[   88.148757]  
[   88.151193]  dump_stack_lvl+0x48/0x5e
[   88.155381]  check_preemption_disabled+0xde/0xe0
[   88.160641]  is_xen_pmu+0x12/0x30
[   88.164441]  xen_smp_intr_init_pv+0x75/0x100

Fix that by replacing is_xen_pmu() by a simple boolean variable which
reflects the Xen PMU initialization state on cpu 0.

Modify xen_pmu_init() to return early in case it is being called for a
cpu other than cpu 0 and the boolean variable not being set.

Fixes: bf6dfb154d93 ("xen/PMU: PMU emulation code")
Reported-by: Marek Marczykowski-Górecki 
Signed-off-by: Juergen Gross 



Reviewed-by: Boris Ostrovsky

Re: [PATCH] xen: fix is_xen_pmu()

2022-03-22 Thread Boris Ostrovsky




On 3/22/22 11:50 AM, Juergen Gross wrote:

is_xen_pmu() is taking the cpu number as parameter, but it is not using
it. Instead it just tests whether the Xen PMU initialization on the
current cpu did succeed. As this test is done by checking a percpu
pointer, preemption needs to be disabled in order to avoid switching
the cpu while doing the test. While resuming from suspend() this seems
not to be the case:

[   88.082751] ACPI: PM: Low-level resume complete
[   88.087933] ACPI: EC: EC started
[   88.091464] ACPI: PM: Restoring platform NVS memory
[   88.097166] xen_acpi_processor: Uploading Xen processor PM info
[   88.103850] Enabling non-boot CPUs ...
[   88.108128] installing Xen timer for CPU 1
[   88.112763] BUG: using smp_processor_id() in preemptible [] code: 
systemd-sleep/7138
[   88.122256] caller is is_xen_pmu+0x12/0x30
[   88.126937] CPU: 0 PID: 7138 Comm: systemd-sleep Tainted: GW 
5.16.13-2.fc32.qubes.x86_64 #1
[   88.137939] Hardware name: Star Labs StarBook/StarBook, BIOS 7.97 03/21/2022
[   88.145930] Call Trace:
[   88.148757]  
[   88.151193]  dump_stack_lvl+0x48/0x5e
[   88.155381]  check_preemption_disabled+0xde/0xe0
[   88.160641]  is_xen_pmu+0x12/0x30
[   88.164441]  xen_smp_intr_init_pv+0x75/0x100



There is actually another PMU-related problem on restore which was caused (or, 
rather, highlighted) by ff083a2d972f56bebfd82409ca62e5dfce950961:


[  116.861637] [ cut here ]
[  116.861651] WARNING: CPU: 1 PID: 31 at kernel/events/core.c:6614 
perf_register_guest_info_callbacks+0x68/0x70
[  116.861673] Modules linked in:
[  116.861682] CPU: 1 PID: 31 Comm: xenwatch Not tainted 5.17.0-rc7ostr #103
[  116.861695] RIP: e030:perf_register_guest_info_callbacks+0x68/0x70
[  116.861706] Code: c7 c7 40 e1 86 82 e8 d7 e7 ff ff 48 8b 53 10 48 85 d2 74 14 48 
c7 c6 f0 0a c0 81 48 c7 c7 30 e1 86 82 5b e9 ba e7 ff ff 5b c3 <0f> 0b c3 0f 1f 
44 00 00 0f 1f 44 00 00 48 8b 05 54 fd 0b 02 48 39
[  116.861747] RSP: e02b:c9004016fe18 EFLAGS: 00010286
[  116.861758] RAX: 82432850 RBX: 0003 RCX: 888079c0
[  116.861768] RDX: 888079c0 RSI: c9004016fe30 RDI: 82432850
[  116.861778] RBP:  R08: 1600 R09: ea0ed340
[  116.861788] R10: 0758 R11:  R12: 888003b4d000
[  116.861797] R13: 0003 R14: 8162cf10 R15: 
[  116.861819] FS:  () GS:888079c8() 
knlGS:
[  116.861830] CS:  e030 DS:  ES:  CR0: 80050033
[  116.861839] CR2:  CR3: 062b6000 CR4: 00040660
[  116.861853] Call Trace:
[  116.861861]  
[  116.861866]  xen_pmu_init+0x187/0x280
[  116.861879]  xen_arch_resume+0x30/0x50
[  116.861888]  do_suspend.cold+0x132/0x147
[  116.861899]  shutdown_handler+0x12e/0x140
[  116.861910]  xenwatch_thread+0x94/0x180
[  116.861919]  ? finish_wait+0x80/0x80
[  116.861928]  kthread+0xe7/0x110
[  116.861938]  ? kthread_complete_and_exit+0x20/0x20
[  116.861948]  ret_from_fork+0x22/0x30
[  116.861959]  
[  116.861964] ---[ end trace  ]---


I was going to send a patch but I think yours can be slightly modified to take 
care of this problem as well.




@@ -542,6 +539,7 @@ void xen_pmu_init(int cpu)
per_cpu(xenpmu_shared, cpu).flags = 0;
  
  	if (cpu == 0) {



  if (!is_xen_pmu)



+   is_xen_pmu = true;
perf_register_guest_info_callbacks(&xen_guest_cbs);
xen_pmu_arch_init();
}
@@ -572,4 +570,7 @@ void xen_pmu_finish(int cpu)
  
  	free_pages((unsigned long)per_cpu(xenpmu_shared, cpu).xenpmu_data, 0);

per_cpu(xenpmu_shared, cpu).xenpmu_data = NULL;
+
+   if (cpu == 0)
+   is_xen_pmu = false;



And drop this hunk.


-boris

Re: [PATCH] arch:x86:xen: Remove unnecessary assignment in xen_apic_read()

2022-03-16 Thread Boris Ostrovsky




On 3/14/22 3:05 AM, jianchunfu wrote:

In the function xen_apic_read(), the initialized value of 'ret' is unused
because it will be assigned by the function HYPERVISOR_platform_op(),
thus remove it.

Signed-off-by: jianchunfu 



Reviewed-by: Boris Ostrovsky 


Applied to for-linus-5.18 (both patches)

Re: [PATCH 1/2] xen/grant-table: remove gnttab_transfer() functions

2022-03-16 Thread Boris Ostrovsky




On 3/11/22 5:34 AM, Juergen Gross wrote:

All grant table operations related to the "transfer" functionality
are unused currently. There have been users in the old days of the
"Xen-o-Linux" kernel, but those didn't make it upstream.

So remove the "transfer" related functions.

Signed-off-by: Juergen Gross 



Applied to for-linus-5.18 (both patches)

Re: [PATCH 1/2] xen/grant-table: remove gnttab_transfer() functions

2022-03-15 Thread Boris Ostrovsky





On 3/15/22 2:10 AM, Juergen Gross wrote:

On 15.03.22 00:37, Boris Ostrovsky wrote:


On 3/11/22 5:34 AM, Juergen Gross wrote:

All grant table operations related to the "transfer" functionality
are unused currently. There have been users in the old days of the
"Xen-o-Linux" kernel, but those didn't make it upstream.

So remove the "transfer" related functions.



Do we need to assert somewhere that transfer flags are not set?


This would be an orthogonal change, right? My patch is just removing
never called functions.



I was thinking of having this done as part of code removal (maybe as a separate 
patch).



In any case I believe checking those flags is the job of the hypervisor.
If an operation is illegal due to a transfer flag being set, it needs to
be rejected at hypervisor level.


True.

Reviewed-by: Boris Ostrovsky

Re: [PATCH 12/15] swiotlb: provide swiotlb_init variants that remap the buffer

2022-03-15 Thread Boris Ostrovsky





On 3/15/22 2:36 AM, Christoph Hellwig wrote:


@@ -271,12 +273,23 @@ void __init swiotlb_init(bool addressing_limit, unsigned 
int flags)
 * allow to pick a location everywhere for hypervisors with guest
 * memory encryption.
 */
+retry:
+   bytes = PAGE_ALIGN(default_nslabs << IO_TLB_SHIFT);
if (flags & SWIOTLB_ANY)
tlb = memblock_alloc(bytes, PAGE_SIZE);
else
tlb = memblock_alloc_low(bytes, PAGE_SIZE);
if (!tlb)
goto fail;
+   if (remap && remap(tlb, nslabs) < 0) {
+   memblock_free(tlb, PAGE_ALIGN(bytes));
+
+   if (nslabs <= IO_TLB_MIN_SLABS)
+   panic("%s: Failed to remap %zu bytes\n",
+ __func__, bytes);
+   nslabs = max(1024UL, ALIGN(nslabs >> 1, IO_TLB_SEGSIZE));



I spoke with Konrad (who wrote the original patch --- 
f4b2f07b2ed9b469ead87e06fc2fc3d12663a725) and apparently the reason for 2MB was 
to optimize for Xen's slab allocator, it had nothing to do with 
IO_TLB_MIN_SLABS. Since this is now common code we should not expose 
Xen-specific optimizations here and smaller values will still work so 
IO_TLB_MIN_SLABS is fine.

I think this should be mentioned in the commit message though, probably best in 
the next patch where you switch to this code.

As far as the hunk above, I don't think we need the max() here: with 
IO_TLB_MIN_SLABS being 512 we may get stuck in an infinite loop. Something like

nslabs = ALIGN(nslabs >> 1, IO_TLB_SEGSIZE);
if (nslabs <= IO_TLB_MIN_SLABS)
panic()

should be sufficient.



+   goto retry;
+   }
if (swiotlb_init_with_tbl(tlb, default_nslabs, flags))
goto fail_free_mem;
return;
@@ -287,12 +300,18 @@ void __init swiotlb_init(bool addressing_limit, unsigned 
int flags)
pr_warn("Cannot allocate buffer");
  }
  
+void __init swiotlb_init(bool addressing_limit, unsigned int flags)

+{
+   return swiotlb_init_remap(addressing_limit, flags, NULL);
+}
+
  /*
   * Systems with larger DMA zones (those that don't support ISA) can
   * initialize the swiotlb later using the slab allocator if needed.
   * This should be just like above, but with some error catching.
   */
-int swiotlb_init_late(size_t size, gfp_t gfp_mask)
+int swiotlb_init_late(size_t size, gfp_t gfp_mask,
+   int (*remap)(void *tlb, unsigned long nslabs))
  {
unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
unsigned long bytes;
@@ -303,6 +322,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask)
if (swiotlb_force_disable)
return 0;
  
+retry:

order = get_order(nslabs << IO_TLB_SHIFT);
nslabs = SLABS_PER_PAGE << order;
bytes = nslabs << IO_TLB_SHIFT;
@@ -317,6 +337,16 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask)
  
  	if (!vstart)

return -ENOMEM;
+   if (remap)
+   rc = remap(vstart, nslabs);
+   if (rc) {
+   free_pages((unsigned long)vstart, order);
+
+   if (IO_TLB_MIN_SLABS <= 1024)
+   return rc;
+   nslabs = max(1024UL, ALIGN(nslabs >> 1, IO_TLB_SEGSIZE));



Same here. (The 'if' check above is wrong anyway).

Patches 13 and 14 look good.


-boris




+   goto retry;
+   }
  
  	if (order != get_order(bytes)) {

pr_warn("only able to allocate %ld MB\n",

Re: [PATCH 1/2] xen/grant-table: remove gnttab_transfer() functions

2022-03-14 Thread Boris Ostrovsky




On 3/11/22 5:34 AM, Juergen Gross wrote:

All grant table operations related to the "transfer" functionality
are unused currently. There have been users in the old days of the
"Xen-o-Linux" kernel, but those didn't make it upstream.

So remove the "transfer" related functions.



Do we need to assert somewhere that transfer flags are not set?


-boris

Re: [PATCH 14/15] swiotlb: remove swiotlb_init_with_tbl and swiotlb_init_late_with_tbl

2022-03-14 Thread Boris Ostrovsky




On 3/14/22 3:31 AM, Christoph Hellwig wrote:

@@ -314,6 +293,7 @@ void __init swiotlb_init(bool addressing_limit, unsigned 
int flags)
  int swiotlb_init_late(size_t size, gfp_t gfp_mask,
int (*remap)(void *tlb, unsigned long nslabs))
  {
+   struct io_tlb_mem *mem = &io_tlb_default_mem;
unsigned long nslabs = ALIGN(size >> IO_TLB_SHIFT, IO_TLB_SEGSIZE);
unsigned long bytes;
unsigned char *vstart = NULL;
@@ -355,33 +335,28 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask,
(PAGE_SIZE << order) >> 20);
nslabs = SLABS_PER_PAGE << order;
}
-   rc = swiotlb_late_init_with_tbl(vstart, nslabs);
-   if (rc)
-   free_pages((unsigned long)vstart, order);
-
-   return rc;
-}
-
-int
-swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
-{
-   struct io_tlb_mem *mem = &io_tlb_default_mem;
-   unsigned long bytes = nslabs << IO_TLB_SHIFT;
  
-	if (swiotlb_force_disable)

-   return 0;
+   if (remap)
+   rc = remap(vstart, nslabs);
+   if (rc) {
+   free_pages((unsigned long)vstart, order);
  
-	/* protect against double initialization */

-   if (WARN_ON_ONCE(mem->nslabs))
-   return -ENOMEM;
+   /* Min is 2MB */
+   if (nslabs <= 1024)
+   return rc;
+   nslabs = max(1024UL, ALIGN(nslabs >> 1, IO_TLB_SEGSIZE));
+   goto retry;
+   }



We now end up with two attempts to remap. I think this second one is what we 
want since it solves the problem I pointed in previous patch.


-boris

Re: [PATCH 13/15] swiotlb: merge swiotlb-xen initialization into swiotlb

2022-03-14 Thread Boris Ostrovsky




On 3/14/22 3:31 AM, Christoph Hellwig wrote:

-
  static void __init pci_xen_swiotlb_init(void)
  {
if (!xen_initial_domain() && !x86_swiotlb_enable)
return;
x86_swiotlb_enable = true;
-   xen_swiotlb = true;
-   xen_swiotlb_init_early();
+   swiotlb_init_remap(true, x86_swiotlb_flags, xen_swiotlb_fixup);



I think we need to have SWIOTLB_ANY set in x86_swiotlb_flags here.




dma_ops = &xen_swiotlb_dma_ops;
if (IS_ENABLED(CONFIG_PCI))
pci_request_acs();
@@ -88,14 +85,16 @@ static void __init pci_xen_swiotlb_init(void)
  
  int pci_xen_swiotlb_init_late(void)

  {
-   int rc;
-
-   if (xen_swiotlb)
+   if (dma_ops == &xen_swiotlb_dma_ops)
return 0;
  
-	rc = xen_swiotlb_init();

-   if (rc)
-   return rc;
+   /* we can work with the default swiotlb */
+   if (!io_tlb_default_mem.nslabs) {
+   int rc = swiotlb_init_late(swiotlb_size_or_default(),
+  GFP_KERNEL, xen_swiotlb_fixup);



This may be comment for previous patch but looking at swiotlb_init_late():


retry:
    order = get_order(nslabs << IO_TLB_SHIFT);
    nslabs = SLABS_PER_PAGE << order;
    bytes = nslabs << IO_TLB_SHIFT;

    while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) {
    vstart = (void *)__get_free_pages(gfp_mask | __GFP_NOWARN,
  order);
    if (vstart)
    break;
    order--;
    }

    if (!vstart)
    return -ENOMEM;
    if (remap)
    rc = remap(vstart, nslabs);
    if (rc) {
    free_pages((unsigned long)vstart, order);

    /* Min is 2MB */
    if (nslabs <= 1024)
    return rc;
    nslabs = max(1024UL, ALIGN(nslabs >> 1, IO_TLB_SEGSIZE));
    goto retry;
    }

    if (order != get_order(bytes)) {
    pr_warn("only able to allocate %ld MB\n",
    (PAGE_SIZE << order) >> 20);
    nslabs = SLABS_PER_PAGE << order; <===
    }

    rc = swiotlb_late_init_with_tbl(vstart, nslabs);

Notice that we don't do remap() after final update to nslabs. We should.



-boris

Re: [PATCH 12/15] swiotlb: provide swiotlb_init variants that remap the buffer

2022-03-14 Thread Boris Ostrovsky




On 3/14/22 3:31 AM, Christoph Hellwig wrote:

-void __init swiotlb_init(bool addressing_limit, unsigned int flags)
+void __init swiotlb_init_remap(bool addressing_limit, unsigned int flags,
+   int (*remap)(void *tlb, unsigned long nslabs))
  {
-   size_t bytes = PAGE_ALIGN(default_nslabs << IO_TLB_SHIFT);
+   unsigned long nslabs = default_nslabs;
+   size_t bytes;
void *tlb;
  
  	if (!addressing_limit && !swiotlb_force_bounce)

@@ -271,12 +273,24 @@ void __init swiotlb_init(bool addressing_limit, unsigned 
int flags)
 * allow to pick a location everywhere for hypervisors with guest
 * memory encryption.
 */
+retry:
+   bytes = PAGE_ALIGN(default_nslabs << IO_TLB_SHIFT);
if (flags & SWIOTLB_ANY)
tlb = memblock_alloc(bytes, PAGE_SIZE);
else
tlb = memblock_alloc_low(bytes, PAGE_SIZE);
if (!tlb)
goto fail;
+   if (remap && remap(tlb, nslabs) < 0) {
+   memblock_free(tlb, PAGE_ALIGN(bytes));
+
+   /* Min is 2MB */
+   if (nslabs <= 1024)



This is IO_TLB_MIN_SLABS, isn't it? (Xen code didn't say so but that's what it 
meant to say I believe)



+   panic("%s: Failed to remap %zu bytes\n",
+ __func__, bytes);
+   nslabs = max(1024UL, ALIGN(nslabs >> 1, IO_TLB_SEGSIZE));
+   goto retry;
+   }

@@ -303,6 +323,7 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask)
if (swiotlb_force_disable)
return 0;
  
+retry:

order = get_order(nslabs << IO_TLB_SHIFT);
nslabs = SLABS_PER_PAGE << order;
bytes = nslabs << IO_TLB_SHIFT;
@@ -317,6 +338,17 @@ int swiotlb_init_late(size_t size, gfp_t gfp_mask)
  
  	if (!vstart)

return -ENOMEM;
+   if (remap)
+   rc = remap(vstart, nslabs);
+   if (rc) {
+   free_pages((unsigned long)vstart, order);
+
+   /* Min is 2MB */
+   if (nslabs <= 1024)



Same here.


-boris



+   return rc;
+   nslabs = max(1024UL, ALIGN(nslabs >> 1, IO_TLB_SEGSIZE));
+   goto retry;
+   }
  
  	if (order != get_order(bytes)) {

pr_warn("only able to allocate %ld MB\n",

Re: [PATCH] x86/xen: Fix kerneldoc warning

2022-03-11 Thread Boris Ostrovsky




On 3/7/22 1:25 AM, Jiapeng Chong wrote:

Fix the following W=1 kernel warnings:

arch/x86/xen/setup.c:725: warning: expecting prototype for
machine_specific_memory_setup(). Prototype was for xen_memory_setup()
instead.

Reported-by: Abaci Robot 
Signed-off-by: Jiapeng Chong 



Reviewed-by: Boris Ostrovsky 


Applied to for-linus-5.18

Re: [PATCH] drivers/xen: use helper macro __ATTR_RW

2022-03-11 Thread Boris Ostrovsky




On 3/5/22 8:38 AM, zhanglianjie wrote:

Use helper macro __ATTR_RW to define HYPERVISOR_ATTR_RW to make code more clear.
Minor readability improvement.

Signed-off-by: zhanglianjie 



Reviewed-by: Boris Ostrovsky 


Applied to for-linus-5.18 (with whitespace change noted in commit message)

Re: [PATCH v4 2/2] xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32

2022-03-11 Thread Boris Ostrovsky




On 3/2/22 11:40 AM, Dongli Zhang wrote:

The sched_clock() can be used very early since commit 857baa87b642
("sched/clock: Enable sched clock early"). In addition, with commit
38669ba205d1 ("x86/xen/time: Output xen sched_clock time from 0"), kdump
kernel in Xen HVM guest may panic at very early stage when accessing
&__this_cpu_read(xen_vcpu)->time as in below:

setup_arch()
  -> init_hypervisor_platform()
  -> x86_init.hyper.init_platform = xen_hvm_guest_init()
  -> xen_hvm_init_time_ops()
  -> xen_clocksource_read()
  -> src = &__this_cpu_read(xen_vcpu)->time;

This is because Xen HVM supports at most MAX_VIRT_CPUS=32 'vcpu_info'
embedded inside 'shared_info' during early stage until xen_vcpu_setup() is
used to allocate/relocate 'vcpu_info' for boot cpu at arbitrary address.

However, when Xen HVM guest panic on vcpu >= 32, since
xen_vcpu_info_reset(0) would set per_cpu(xen_vcpu, cpu) = NULL when
vcpu >= 32, xen_clocksource_read() on vcpu >= 32 would panic.

This patch calls xen_hvm_init_time_ops() again later in
xen_hvm_smp_prepare_boot_cpu() after the 'vcpu_info' for boot vcpu is
registered when the boot vcpu is >= 32.

This issue can be reproduced on purpose via below command at the guest
side when kdump/kexec is enabled:

"taskset -c 33 echo c > /proc/sysrq-trigger"

The bugfix for PVM is not implemented due to the lack of testing
environment.

Cc: Joe Jin 
Signed-off-by: Dongli Zhang 



Reviewed-by: Boris Ostrovsky 


Applied to for-linus-5.18 (with return path adjusted)

Re: [PATCH] xen: use time_is_before_eq_jiffies() instead of open coding it

2022-03-11 Thread Boris Ostrovsky




On 2/27/22 10:15 PM, Qing Wang wrote:

From: Wang Qing 

Use the helper function time_is_{before,after}_jiffies() to improve
code readability.

Signed-off-by: Wang Qing 



Reviewed-by: Boris Ostrovsky 


Applied to for-linus-5.18

Re: [PATCH 11/12] swiotlb: merge swiotlb-xen initialization into swiotlb

2022-03-09 Thread Boris Ostrovsky




On 3/9/22 1:18 AM, Christoph Hellwig wrote:

On Tue, Mar 08, 2022 at 04:38:21PM -0500, Boris Ostrovsky wrote:

On 3/1/22 5:53 AM, Christoph Hellwig wrote:

Allow to pass a remap argument to the swiotlb initialization functions
to handle the Xen/x86 remap case.  ARM/ARM64 never did any remapping
from xen_swiotlb_fixup, so we don't even need that quirk.


Any chance this patch could be split? Lots of things are happening here and 
it's somewhat hard to review. (Patch 7 too BTW but I think I managed to get 
through it)

What would be your preferred split?



swiotlb_init() rework to be done separately?





diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index e0def4b1c3181..2f2c468acb955 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -71,15 +71,12 @@ static inline void __init pci_swiotlb_detect(void)
   #endif /* CONFIG_SWIOTLB */
 #ifdef CONFIG_SWIOTLB_XEN
-static bool xen_swiotlb;
-
   static void __init pci_xen_swiotlb_init(void)
   {
if (!xen_initial_domain() && !x86_swiotlb_enable)
return;


Now that there is a single call site for this routine I think this check can be 
dropped. We are only called here for xen_initial_domain()==true.

The callsite just checks xen_pv_domain() and itself is called
unconditionally during initialization.



Oh, right, nevermind. *pv* domain.


-boris

Re: [PATCH 11/12] swiotlb: merge swiotlb-xen initialization into swiotlb

2022-03-08 Thread Boris Ostrovsky




On 3/1/22 5:53 AM, Christoph Hellwig wrote:

Allow to pass a remap argument to the swiotlb initialization functions
to handle the Xen/x86 remap case.  ARM/ARM64 never did any remapping
from xen_swiotlb_fixup, so we don't even need that quirk.



Any chance this patch could be split? Lots of things are happening here and 
it's somewhat hard to review. (Patch 7 too BTW but I think I managed to get 
through it)



diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index e0def4b1c3181..2f2c468acb955 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -71,15 +71,12 @@ static inline void __init pci_swiotlb_detect(void)
  #endif /* CONFIG_SWIOTLB */
  
  #ifdef CONFIG_SWIOTLB_XEN

-static bool xen_swiotlb;
-
  static void __init pci_xen_swiotlb_init(void)
  {
if (!xen_initial_domain() && !x86_swiotlb_enable)
return;



Now that there is a single call site for this routine I think this check can be 
dropped. We are only called here for xen_initial_domain()==true.


-boris

Re: [PATCH 11/12] swiotlb: merge swiotlb-xen initialization into swiotlb

2022-03-04 Thread Boris Ostrovsky




On 3/4/22 12:43 PM, Christoph Hellwig wrote:

On Fri, Mar 04, 2022 at 12:36:17PM -0500, Boris Ostrovsky wrote:

I bisected it to "x86: remove the IOMMU table infrastructure" but haven't 
actually looked at the code yet.

That looks like the swiotlb buffer did not get initialized at all, but I
can't really explain why.

Can you stick in a printk and see if xen_swiotlb_init_early gets called
at all?



Actually, that's the only thing I did do so far and yes, it does get called.

So, specifically for "x86: remove the IOMMU table infrastructure" I
think we need the one-liner below so that swiotlb_exit doesn't get called
for the Xen case.  But that should have been fixed up by the next
patch already.



This indeed allows dom0 to boot. Not sure I see where in the next patch this 
would have been fixed?


(BTW, just noticed in iommu_setup() you set this variable to 1. Should be 
'true')


-boris

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1033 matches

Mail list logo