Re: [PATCH 1/7] KVM: Add API documentation for KVM_HYPERV_SET_TLB_FLUSH_INHIBIT

2024-10-10 Thread Vitaly Kuznetsov
Nikolas Wipper  writes:

> Add API documentation for the new KVM_HYPERV_SET_TLB_FLUSH_INHIBIT ioctl.
>
> Signed-off-by: Nikolas Wipper 
> ---
>  Documentation/virt/kvm/api.rst | 41 ++
>  1 file changed, 41 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index a4b7dc4a9dda..9c11a8af336b 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6443,6 +6443,47 @@ the capability to be present.
>  `flags` must currently be zero.
>  
>  
> +4.144 KVM_HYPERV_SET_TLB_FLUSH_INHIBIT
> +--
> +
> +:Capability: KVM_CAP_HYPERV_TLB_FLUSH_INHIBIT
> +:Architectures: x86
> +:Type: vcpu ioctl
> +:Parameters: struct kvm_hyperv_tlb_flush_inhibit
> +:returnReturns: 0 on success, this ioctl can't fail
> +
> +KVM_HYPERV_SET_TLB_FLUSH_INHIBIT allows userspace to prevent Hyper-V
> hyper-calls

Very minor nitpick: I suggest standardize on "hypercall" spelling
without the dash because:

$ grep -c hypercall Documentation/virt/kvm/api.rst
56
$ grep -c hyper-call Documentation/virt/kvm/api.rst
3

(I see all three 'hypercall', 'hyper-call', 'hyper call' usages in the
wild and I honestly don't think it matters but it would be nice to
adhere to one share across the same file / KVM docs).

> +that remotely flush a vCPU's TLB, i.e. HvFlushVirtualAddressSpace(Ex)/
> +HvFlushVirtualAddressList(Ex). When the flag is set, a vCPU attempting to 
> flush
> +an inhibited vCPU will be suspended and will only resume once the flag is
> +cleared again using this ioctl. During suspension, the vCPU will not finish 
> the
> +hyper-call, but may enter the guest to retry it. Because it is caused by a
> +hyper-call, the suspension naturally happens on a guest instruction boundary.
> +This behaviour and the suspend state itself are specified in Microsoft's
> +"Hypervisor Top Level Functional Specification" (TLFS).
> +
> +::
> +
> +  /* for KVM_HYPERV_SET_TLB_FLUSH_INHIBIT */
> +  struct kvm_hyperv_tlb_flush_inhibit {
> +  /* in */
> +  __u16 flags;
> +  #define KVM_HYPERV_UNINHIBIT_TLB_FLUSH 0
> +  #define KVM_HYPERV_INHIBIT_TLB_FLUSH 1
> +  __u8  inhibit;
> +  __u8 padding[5];
> +  };
> +
> +No flags are specified so far, the corresponding field must be set to zero,
> +otherwise the ioctl will fail with exit code -EINVAL.
> +
> +The suspension is transparent to userspace. It won't cause KVM_RUN to return 
> or
> +the MP state to be changed. The suspension cannot be manually induced or 
> exited
> +apart from changing the TLB flush inhibit flag of a targeted processor.
> +
> +There is no way for userspace to query the state of the flush inhibit flag.
> +Userspace must keep track of the required state itself.
> +
>  5. The kvm_run structure
>  

-- 
Vitaly




Re: [PATCH 2/7] KVM: x86: Implement Hyper-V's vCPU suspended state

2024-10-10 Thread Vitaly Kuznetsov
Nikolas Wipper  writes:

> Introduce a suspension state for Hyper-V enlightened vCPUs. Microsoft's
> "Hypervisor Top Level Functional Specification" (TLFS) introduces this
> state as a "vCPU that is stopped on a instruction guest boundary, either
> explicitly or implicitly due to an intercept". The only documented
> explicit suspension is caused in response to a TLB Flush hypercall, which
> will use the state switching API in subsequent patches.
>
> Each Hyper-V vCPU now has a 'suspended' boolean field, which is checked
> before entering the guest. When set, it forces the vCPU to block. Once in
> that state, the vCPU ignores any events. The vCPU is unsuspended by
> clearing 'suspend' and issuing a request to force the vCPU out of sleep.
>
> Suspensions are issued as a mechanism to halt execution until state change
> is observed on a remote vCPU. Hyper-V vCPUs track this with 'waiting_on',
> which holds the 'vcpu_id' of the remote vCPU that forced the vCPU to enter
> the suspended state. It's the remote vCPU's responsibility to wake up the
> suspended vCPUs when ready. 'waiting_on' ensures the remote vCPU can
> selectively unsuspend vCPUs that blocked on its behalf while leaving other
> suspended vCPUs undisturbed. One vCPU can only be suspended due to a
> single remote vCPU, but different vCPUs can be suspended on behalf of
> different or the same remote vCPU(s). The guest is responsible for
> avoiding circular dependencies between suspended vCPUs.
>
> Callers of the suspend API are responsible for ensuring that suspend and
> unsuspend aren't called in parallel while targeting the same pair of
> vCPUs. Otherwise kvm_hv_vcpu_{un}suspend_tlb_flush() ensure 'waiting_on'
> and 'suspended' are updated and accessed in the correct order. This, for
> example, avoids races where the unsuspended vCPU re-suspends before
> kvm_hv_vcpu_unsuspend_tlb_flush() is done updating 'waiting_on'.
>
> Signed-off-by: Nikolas Wipper 
> ---
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/hyperv.c   | 30 ++
>  arch/x86/kvm/hyperv.h   | 17 +
>  arch/x86/kvm/x86.c  |  4 +++-
>  4 files changed, 53 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 46e0a466d7fb..7571ac578884 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -695,6 +695,9 @@ struct kvm_vcpu_hv {
>   u64 vm_id;
>   u32 vp_id;
>   } nested;
> +
> + bool suspended;
> + int waiting_on;

I don't quite understand why we need 'suspended' at all. Isn't it always
suspended when 'waiting_on != -1'? I can see we always update these two
in pair.

Also, I would suggest we use a more descriptive
name. 'waiting_on_vcpu_id', for example.

>  };
>  
>  struct kvm_hypervisor_cpuid {
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index 4f0a94346d00..6e7941ed25ae 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -971,6 +971,7 @@ int kvm_hv_vcpu_init(struct kvm_vcpu *vcpu)
>  
>   vcpu->arch.hyperv = hv_vcpu;
>   hv_vcpu->vcpu = vcpu;
> + hv_vcpu->waiting_on = -1;
>  
>   synic_init(&hv_vcpu->synic);
>  
> @@ -2915,3 +2916,32 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct 
> kvm_cpuid2 *cpuid,
>  
>   return 0;
>  }
> +
> +void kvm_hv_vcpu_suspend_tlb_flush(struct kvm_vcpu *vcpu, int vcpu_id)

Can we make parameter's name 'waiting_on_vcpu_id' as well? Because as-is
I'm getting confused which CPU of these two is actually getting
suspended)

Also, why do we need '_tlb_flush' in the name? The mechanism seems to be
fairly generic, it's just that we use it for TLB flushes.

> +{
> + /* waiting_on's store should happen before suspended's */
> + WRITE_ONCE(vcpu->arch.hyperv->waiting_on, vcpu_id);
> + WRITE_ONCE(vcpu->arch.hyperv->suspended, true);
> +}
> +
> +void kvm_hv_vcpu_unsuspend_tlb_flush(struct kvm_vcpu *vcpu)

And here someone may expect this means 'unsuspend vcpu' but in reality
this means 'unsuspend all vCPUs which are waiting on 'vcpu'). I guess we
need a rename. How about

void kvm_hv_unsuspend_vcpus(struct kvm_vcpu *waiting_on_vcpu)

?

> +{
> + DECLARE_BITMAP(vcpu_mask, KVM_MAX_VCPUS);
> + struct kvm_vcpu_hv *vcpu_hv;
> + struct kvm_vcpu *v;
> + unsigned long i;
> +
> + kvm_for_each_vcpu(i, v, vcpu->kvm) {
> + vcpu_hv = to_hv_vcpu(v);
> +
> + if (kvm_hv_vcpu_suspended(v) &&
> + READ_ONCE(vcpu_hv->waiting_on) == vcpu->vcpu_id) {
> + /* waiting_on's store should happen before suspended's 
> */
> + WRITE_ONCE(v->arch.hyperv->waiting_on, -1);
> + WRITE_ONCE(v->arch.hyperv->suspended, false);
> + __set_bit(i, vcpu_mask);
> + }
> + }
> +
> + kvm_make_vcpus_request_mask(vcpu->kvm, KVM_REQ_EVENT, vcpu_mask);
> +}
> diff --git a/arc

Re: [PATCH 4/7] KVM: Introduce KVM_HYPERV_SET_TLB_FLUSH_INHIBIT

2024-10-10 Thread Vitaly Kuznetsov
Nikolas Wipper  writes:

> Introduce a new ioctl to control whether remote flushing via Hyper-V
> hyper-calls should be allowed on a vCPU. When the tlb_flush_inhibit bit is
> set, vCPUs attempting to flush the TLB of the inhibitied vCPU will be
> suspended until the bit is clearded.
>
> Signed-off-by: Nikolas Wipper 
> ---
>  include/uapi/linux/kvm.h | 15 +++

I guess we can merge this patch with documentation (PATCH1) or even
implementation (PATCH5) as I don't see why separation helps here.

>  1 file changed, 15 insertions(+)
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 637efc055145..3bc43fdcab88 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -933,6 +933,7 @@ struct kvm_enable_cap {
>  #define KVM_CAP_PRE_FAULT_MEMORY 236
>  #define KVM_CAP_X86_APIC_BUS_CYCLES_NS 237
>  #define KVM_CAP_X86_GUEST_MODE 238
> +#define KVM_CAP_HYPERV_TLB_FLUSH_INHIBIT 239
>  
>  struct kvm_irq_routing_irqchip {
>   __u32 irqchip;
> @@ -1573,4 +1574,18 @@ struct kvm_pre_fault_memory {
>   __u64 padding[5];
>  };
>  
> +/* Available with KVM_CAP_HYPERV_TLBFLUSH */
> +#define KVM_HYPERV_SET_TLB_FLUSH_INHIBIT \
> + _IOW(KVMIO,  0xd6, struct kvm_hyperv_tlb_flush_inhibit)
> +
> +/* for KVM_HYPERV_SET_TLB_FLUSH_INHIBIT */
> +struct kvm_hyperv_tlb_flush_inhibit {
> + /* in */
> + __u16 flags;
> +#define KVM_HYPERV_UNINHIBIT_TLB_FLUSH 0
> +#define KVM_HYPERV_INHIBIT_TLB_FLUSH 1
> + __u8  inhibit;
> + __u8  reserved[5];
> +};
> +
>  #endif /* __LINUX_KVM_H */

-- 
Vitaly




Re: [PATCH 5/7] KVM: x86: Implement KVM_HYPERV_SET_TLB_FLUSH_INHIBIT

2024-10-10 Thread Vitaly Kuznetsov
Nikolas Wipper  writes:

> Implement KVM_HYPERV_SET_TLB_FLUSH_INHIBIT for x86. Apart from setting/
> clearing the internal TLB flush inhibit flag this ioctl also wakes up
> vCPUs suspended and waiting on this vCPU.
>
> When the flag is set, a vCPU trying to flush the inhibited vCPUs TLB with
> a Hyper-V hyper-call suspendeds until the bit is cleared again. The hyper-
> call finishes internally, but the instruction isn't skipped, and execution
> doesn't return to the guest. This behaviour and the suspended state itself
> are specified in Microsoft's "Hypervisor Top Level Functional
> Specification" (TLFS).
>
> To maintain thread safety during suspension and unsuspension, the IOCTL
> uses the KVM SRCU. After flipping the flush inhibit flag, it synchronizes
> SRCU to ensure that all running TLB flush attempts that saw the old state,
> finish before the IOCTL exits. If the flag was changed to inhibit new TLB
> flushes, this guarantees that no TLB flushes happen once the ioctl exits.
> If the flag was changed to no longer inhibit TLB flushes, the
> synchronization ensures that all suspended CPUs finished entering the
> suspended state properly, so they can be unsuspended again.
>
> Once TLB flushes are no longer inhibited, all suspended vCPUs are
> unsuspended.
>
> Signed-off-by: Nikolas Wipper 
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/hyperv.c   | 22 
>  arch/x86/kvm/x86.c  | 37 +
>  3 files changed, 61 insertions(+)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7571ac578884..ab3a9beb61a2 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -698,6 +698,8 @@ struct kvm_vcpu_hv {
>  
>   bool suspended;
>   int waiting_on;
> +
> + int tlb_flush_inhibit;

This is basically boolean, right? And we only make it 'int' to be able
to store 'u8' from the ioctl? This doesn't look very clean. Do you
envision anything but '1'/'0' in 'inhibit'? If not, maybe we can just
make it a flag (and e.g. extend 'flags' to be u32/u64)? This way we can
convert 'tlb_flush_inhibit' to a normal bool.

>  };
>  
>  struct kvm_hypervisor_cpuid {
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index e68fbc0c7fc1..40ea8340838f 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -2137,6 +2137,9 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, 
> struct kvm_hv_hcall *hc)
>   bitmap_zero(vcpu_mask, KVM_MAX_VCPUS);
>  
>   kvm_for_each_vcpu(i, v, kvm) {
> + if (READ_ONCE(v->arch.hyperv->tlb_flush_inhibit))
> + goto ret_suspend;
> +
>   __set_bit(i, vcpu_mask);
>   }
>   } else if (!is_guest_mode(vcpu)) {
> @@ -2148,6 +2151,9 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, 
> struct kvm_hv_hcall *hc)
>   __clear_bit(i, vcpu_mask);
>   continue;
>   }
> +
> + if (READ_ONCE(v->arch.hyperv->tlb_flush_inhibit))
> + goto ret_suspend;
>   }
>   } else {
>   struct kvm_vcpu_hv *hv_v;
> @@ -2175,6 +2181,9 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, 
> struct kvm_hv_hcall *hc)
>   sparse_banks))
>   continue;
>  
> + if (READ_ONCE(v->arch.hyperv->tlb_flush_inhibit))
> + goto ret_suspend;
> +

These READ_ONCEs make me think I misunderstand something here, please
bear with me :-).

Like we're trying to protect against 'tlb_flush_inhibit' being read
somewhere in the beginning of the function and want to generate real
memory accesses. But what happens if tlb_flush_inhibit changes right
_after_ we checked it here and _before_ we actuall do
kvm_make_vcpus_request_mask()? Wouldn't it be a problem? In case it
would, I think we need to reverse the order: do
kvm_make_vcpus_request_mask() anyway and after it go through vcpu_mask
checking whether any of the affected vCPUs has 'tlb_flush_inhibit' and
if it does, suspend the caller.

>   __set_bit(i, vcpu_mask);
>   }
>   }
> @@ -2193,6 +2202,9 @@ static u64 kvm_hv_flush_tlb(struct kvm_vcpu *vcpu, 
> struct kvm_hv_hcall *hc)
>   /* We always do full TLB flush, set 'Reps completed' = 'Rep Count' */
>   return (u64)HV_STATUS_SUCCESS |
>   ((u64)hc->rep_cnt << HV_HYPERCALL_REP_COMP_OFFSET);
> +ret_suspend:
> + kvm_hv_vcpu_suspend_tlb_flush(vcpu, v->vcpu_id);
> + return -EBUSY;
>  }
>  
>  static void kvm_hv_send_ipi_to_many(struct kvm *kvm, u32 vector,
> @@ -2380,6 +2392,13 @@ static int kvm_hv_hypercall_complete(struct kvm_vcpu 
> *vcpu, u64 result)
>   u32 tlb_lock_count = 0;
>   int ret;
>  
> + /*
> +  * Reached when the

Re: [PATCH 01/18] KVM: x86: hyper-v: Introduce XMM output support

2024-07-29 Thread Vitaly Kuznetsov
Nicolas Saenz Julienne  writes:

> Hi Vitaly,
> Thanks for having a look at this.
>
> On Mon Jul 8, 2024 at 2:59 PM UTC, Vitaly Kuznetsov wrote:
>> Nicolas Saenz Julienne  writes:
>>
>> > Prepare infrastructure to be able to return data through the XMM
>> > registers when Hyper-V hypercalls are issues in fast mode. The XMM
>> > registers are exposed to user-space through KVM_EXIT_HYPERV_HCALL and
>> > restored on successful hypercall completion.
>> >
>> > Signed-off-by: Nicolas Saenz Julienne 
>> >
>> > ---
>> >
>> > There was some discussion in the RFC about whether growing 'struct
>> > kvm_hyperv_exit' is ABI breakage. IMO it isn't:
>> > - There is padding in 'struct kvm_run' that ensures that a bigger
>> >   'struct kvm_hyperv_exit' doesn't alter the offsets within that struct.
>> > - Adding a new field at the bottom of the 'hcall' field within the
>> >   'struct kvm_hyperv_exit' should be fine as well, as it doesn't alter
>> >   the offsets within that struct either.
>> > - Ultimately, previous updates to 'struct kvm_hyperv_exit's hint that
>> >   its size isn't part of the uABI. It already grew when syndbg was
>> >   introduced.
>>
>> Yes but SYNDBG exit comes with KVM_EXIT_HYPERV_SYNDBG. While I don't see
>> any immediate issues with the current approach, we may want to introduce
>> something like KVM_EXIT_HYPERV_HCALL_XMM: the userspace must be prepared
>> to handle this new information anyway and it is better to make
>> unprepared userspace fail with 'unknown exit' then to mishandle a
>> hypercall by ignoring XMM portion of the data.
>
> OK, I'll go that way. Just wanted to get a better understanding of why
> you felt it was necessary.
>

(sorry for delayed reply, I was on vacation)

I don't think it's an absolute must but it appears as a cleaner approach
to me. 

Imagine there's some userspace which handles KVM_EXIT_HYPERV_HCALL today
and we want to add XMM handling there. How would we know if xmm portion
of the data is actually filled by KVM or not? With your patch, we can of
course check for HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE in
KVM_GET_SUPPORTED_HV_CPUID but this is not really straightforward, is
it? Checking the size is not good either. E.g. think about downstream
versions of KVM which may or may not have certain backports. In case we
(theoretically) do several additions to 'struct kvm_hyperv_exit', it
will quickly become a nightmare.

On the contrary, KVM_EXIT_HYPERV_HCALL_XMM (or just
KVM_EXIT_HYPERV_HCALL2) approach looks cleaner: once userspace sees it,
it knows that 'xmm' portion of the data can be relied upon.

-- 
Vitaly




Re: [PATCH 01/18] KVM: x86: hyper-v: Introduce XMM output support

2024-07-08 Thread Vitaly Kuznetsov
Nicolas Saenz Julienne  writes:

> Prepare infrastructure to be able to return data through the XMM
> registers when Hyper-V hypercalls are issues in fast mode. The XMM
> registers are exposed to user-space through KVM_EXIT_HYPERV_HCALL and
> restored on successful hypercall completion.
>
> Signed-off-by: Nicolas Saenz Julienne 
>
> ---
>
> There was some discussion in the RFC about whether growing 'struct
> kvm_hyperv_exit' is ABI breakage. IMO it isn't:
> - There is padding in 'struct kvm_run' that ensures that a bigger
>   'struct kvm_hyperv_exit' doesn't alter the offsets within that struct.
> - Adding a new field at the bottom of the 'hcall' field within the
>   'struct kvm_hyperv_exit' should be fine as well, as it doesn't alter
>   the offsets within that struct either.
> - Ultimately, previous updates to 'struct kvm_hyperv_exit's hint that
>   its size isn't part of the uABI. It already grew when syndbg was
>   introduced.

Yes but SYNDBG exit comes with KVM_EXIT_HYPERV_SYNDBG. While I don't see
any immediate issues with the current approach, we may want to introduce
something like KVM_EXIT_HYPERV_HCALL_XMM: the userspace must be prepared
to handle this new information anyway and it is better to make
unprepared userspace fail with 'unknown exit' then to mishandle a
hypercall by ignoring XMM portion of the data.

>
>  Documentation/virt/kvm/api.rst | 19 ++
>  arch/x86/include/asm/hyperv-tlfs.h |  2 +-
>  arch/x86/kvm/hyperv.c  | 56 +-
>  include/uapi/linux/kvm.h   |  6 
>  4 files changed, 81 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index a71d91978d9ef..17893b330b76f 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -8893,3 +8893,22 @@ Ordering of KVM_GET_*/KVM_SET_* ioctls
>  ^^
>  
>  TBD
> +
> +10. Hyper-V CPUIDs
> +==
> +
> +This section only applies to x86.

We can probably use 

:Architectures: x86

which we already use.

> +
> +New Hyper-V feature support is no longer being tracked through KVM
> +capabilities.  Userspace can check if a particular version of KVM supports a
> +feature using KMV_GET_SUPPORTED_HV_CPUID.  This section documents how Hyper-V
> +CPUIDs map to KVM functionality.
> +
> +10.1 HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE
> +--
> +
> +:Location: CPUID.4003H:EDX[bit 15]
> +
> +This CPUID indicates that KVM supports retuning data to the guest in response
> +to a hypercall using the XMM registers. It also extends ``struct
> +kvm_hyperv_exit`` to allow passing the XMM data from userspace.

It's always good to document things, thanks! I'm, however, wondering
what should we document as part of KVM API. In the file, we already
have:
- "4.118 KVM_GET_SUPPORTED_HV_CPUID"
- "struct kvm_hyperv_exit" description in "5. The kvm_run structure"

The later should definitely get extended to cover XMM and I guess the
former can accomodate the 'no longer being tracked' comment. With that,
maybe there's no need for a new section? 

> diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
> b/arch/x86/include/asm/hyperv-tlfs.h
> index 3787d26810c1c..6a18c9f77d5fe 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -49,7 +49,7 @@
>  /* Support for physical CPU dynamic partitioning events is available*/
>  #define HV_X64_CPU_DYNAMIC_PARTITIONING_AVAILABLEBIT(3)
>  /*
> - * Support for passing hypercall input parameter block via XMM
> + * Support for passing hypercall input and output parameter block via XMM
>   * registers is available
>   */
>  #define HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE BIT(4)

This change of the comment is weird (or I may have forgotten something
important), could you please elaborate?. Currently, we have:

/*
 * Support for passing hypercall input parameter block via XMM
 * registers is available
 */
#define HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE BIT(4)
...
/*
 * Support for returning hypercall output block via XMM
 * registers is available
 */
#define HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLEBIT(15)

which seems to be correct. TLFS also defines

Bit 4: XmmRegistersForFastHypercallAvailable

in CPUID 0x4009.EDX (Nested Hypervisor Feature Identification) which
probably covers both but we don't set this leaf in KVM currently ...

> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index 8a47f8541eab7..42f44546fe79c 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -1865,6 +1865,7 @@ struct kvm_hv_hcall {
>   u16 rep_idx;
>   bool fast;
>   bool rep;
> + bool xmm_dirty;
>   sse128_t xmm[HV_HYPERCALL_MAX_XMM_REGISTERS];
>  
>   /*
> @@ -2396,9 +2397,49 @@ static int kvm_hv_hypercall_complete(struct kvm_vcpu 
> *vcpu, u64 result)
>   return ret;
>  }
>  
> +static void kvm_hv_write_xmm

Re: [PATCH 00/18] Introducing Core Building Blocks for Hyper-V VSM Emulation

2024-07-03 Thread Vitaly Kuznetsov
Nicolas Saenz Julienne  writes:

> Hi Sean,
>
> On Sun Jun 9, 2024 at 3:49 PM UTC, Nicolas Saenz Julienne wrote:
>> This series introduces core KVM functionality necessary to emulate Hyper-V's
>> Virtual Secure Mode in a Virtual Machine Monitor (VMM).
>
> Just wanted to make sure the series is in your radar.
>

Not Sean here but I was planning to take a look at least at Hyper-V
parts of it next week.

-- 
Vitaly




Re: [RFC 03/33] KVM: x86: hyper-v: Introduce XMM output support

2023-11-08 Thread Vitaly Kuznetsov
Alexander Graf  writes:

> On 08.11.23 12:17, Nicolas Saenz Julienne wrote:
>> Prepare infrastructure to be able to return data through the XMM
>> registers when Hyper-V hypercalls are issues in fast mode. The XMM
>> registers are exposed to user-space through KVM_EXIT_HYPERV_HCALL and
>> restored on successful hypercall completion.
>>
>> Signed-off-by: Nicolas Saenz Julienne 
>> ---
>>   arch/x86/include/asm/hyperv-tlfs.h |  2 +-
>>   arch/x86/kvm/hyperv.c  | 33 +-
>>   include/uapi/linux/kvm.h   |  6 ++
>>   3 files changed, 39 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
>> b/arch/x86/include/asm/hyperv-tlfs.h
>> index 2ff26f53cd62..af594aa65307 100644
>> --- a/arch/x86/include/asm/hyperv-tlfs.h
>> +++ b/arch/x86/include/asm/hyperv-tlfs.h
>> @@ -49,7 +49,7 @@
>>   /* Support for physical CPU dynamic partitioning events is available*/
>>   #define HV_X64_CPU_DYNAMIC_PARTITIONING_AVAILABLE  BIT(3)
>>   /*
>> - * Support for passing hypercall input parameter block via XMM
>> + * Support for passing hypercall input and output parameter block via XMM
>>* registers is available
>>*/
>>   #define HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE   BIT(4)
>> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
>> index 238afd7335e4..e1bc861ab3b0 100644
>> --- a/arch/x86/kvm/hyperv.c
>> +++ b/arch/x86/kvm/hyperv.c
>> @@ -1815,6 +1815,7 @@ struct kvm_hv_hcall {
>>  u16 rep_idx;
>>  bool fast;
>>  bool rep;
>> +bool xmm_dirty;
>>  sse128_t xmm[HV_HYPERCALL_MAX_XMM_REGISTERS];
>>   
>>  /*
>> @@ -2346,9 +2347,33 @@ static int kvm_hv_hypercall_complete(struct kvm_vcpu 
>> *vcpu, u64 result)
>>  return ret;
>>   }
>>   
>> +static void kvm_hv_write_xmm(struct kvm_hyperv_xmm_reg *xmm)
>> +{
>> +int reg;
>> +
>> +kvm_fpu_get();
>> +for (reg = 0; reg < HV_HYPERCALL_MAX_XMM_REGISTERS; reg++) {
>> +const sse128_t data = sse128(xmm[reg].low, xmm[reg].high);
>> +_kvm_write_sse_reg(reg, &data);
>> +}
>> +kvm_fpu_put();
>> +}
>> +
>> +static bool kvm_hv_is_xmm_output_hcall(u16 code)
>> +{
>> +return false;
>> +}
>> +
>>   static int kvm_hv_hypercall_complete_userspace(struct kvm_vcpu *vcpu)
>>   {
>> -return kvm_hv_hypercall_complete(vcpu, 
>> vcpu->run->hyperv.u.hcall.result);
>> +bool fast = !!(vcpu->run->hyperv.u.hcall.input & HV_HYPERCALL_FAST_BIT);
>> +u16 code = vcpu->run->hyperv.u.hcall.input & 0x;
>> +u64 result = vcpu->run->hyperv.u.hcall.result;
>> +
>> +if (kvm_hv_is_xmm_output_hcall(code) && hv_result_success(result) && 
>> fast)
>> +kvm_hv_write_xmm(vcpu->run->hyperv.u.hcall.xmm);
>> +
>> +return kvm_hv_hypercall_complete(vcpu, result);
>>   }
>>   
>>   static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, struct 
>> kvm_hv_hcall *hc)
>> @@ -2623,6 +2648,9 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>>  break;
>>  }
>>   
>> +if ((ret & HV_HYPERCALL_RESULT_MASK) == HV_STATUS_SUCCESS && 
>> hc.xmm_dirty)
>> +kvm_hv_write_xmm((struct kvm_hyperv_xmm_reg*)hc.xmm);
>> +
>>   hypercall_complete:
>>  return kvm_hv_hypercall_complete(vcpu, ret);
>>   
>> @@ -2632,6 +2660,8 @@ int kvm_hv_hypercall(struct kvm_vcpu *vcpu)
>>  vcpu->run->hyperv.u.hcall.input = hc.param;
>>  vcpu->run->hyperv.u.hcall.params[0] = hc.ingpa;
>>  vcpu->run->hyperv.u.hcall.params[1] = hc.outgpa;
>> +if (hc.fast)
>> +memcpy(vcpu->run->hyperv.u.hcall.xmm, hc.xmm, sizeof(hc.xmm));
>>  vcpu->arch.complete_userspace_io = kvm_hv_hypercall_complete_userspace;
>>  return 0;
>>   }
>> @@ -2780,6 +2810,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct 
>> kvm_cpuid2 *cpuid,
>>  ent->ebx |= HV_ENABLE_EXTENDED_HYPERCALLS;
>>   
>>  ent->edx |= HV_X64_HYPERCALL_XMM_INPUT_AVAILABLE;
>> +ent->edx |= HV_X64_HYPERCALL_XMM_OUTPUT_AVAILABLE;
>
>
> Shouldn't this be guarded by an ENABLE_CAP to make sure old user space 
> that doesn't know about xmm outputs is still able to run with newer kernels?
>

No, we don't do CAPs for new Hyper-V features anymore since we have
KVM_GET_SUPPORTED_HV_CPUID. Userspace is not supposed to simply copy
its output into guest visible CPUIDs, it must only enable features it
knows. Even 'hv_passthrough' option in QEMU doesn't pass unknown
features through.

>
>>  ent->edx |= HV_FEATURE_FREQUENCY_MSRS_AVAILABLE;
>>  ent->edx |= HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE;
>>   
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index d7a01766bf21..5ce06a1eee2b 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -192,6 +192,11 @@ struct kvm_s390_cmma_log {
>>  __u64 values;
>>   };
>>   
>> +struct kvm_hyperv_xmm_reg {
>> +__u64 low;
>> +__u64 high;
>> +};
>> +
>>   struct kvm_hyperv_exit 

Re: [PATCH V4 0/3] KVM/Hyper-V: Add Hyper-V direct tlb flush support

2019-09-17 Thread Vitaly Kuznetsov
Paolo Bonzini  writes:

> On 22/08/19 16:30, lantianyu1...@gmail.com wrote:
>> From: Tianyu Lan 
>> 
>> This patchset is to add Hyper-V direct tlb support in KVM. Hyper-V
>> in L0 can delegate L1 hypervisor to handle tlb flush request from
>> L2 guest when direct tlb flush is enabled in L1.
>> 
>> Patch 2 introduces new cap KVM_CAP_HYPERV_DIRECT_TLBFLUSH to enable
>> feature from user space. User space should enable this feature only
>> when Hyper-V hypervisor capability is exposed to guest and KVM profile
>> is hided. There is a parameter conflict between KVM and Hyper-V hypercall.
>> We hope L2 guest doesn't use KVM hypercall when the feature is
>> enabled. Detail please see comment of new API 
>> "KVM_CAP_HYPERV_DIRECT_TLBFLUSH"
>> 
>> Change since v3:
>>- Update changelog in each patches. 
>> 
>> Change since v2:
>>- Move hv assist page(hv_pa_pg) from struct kvm  to struct kvm_hv.
>> 
>> Change since v1:
>>- Fix offset issue in the patch 1.
>>- Update description of KVM KVM_CAP_HYPERV_DIRECT_TLBFLUSH.
>> 
>> Tianyu Lan (2):
>>   x86/Hyper-V: Fix definition of struct hv_vp_assist_page
>>   KVM/Hyper-V: Add new KVM capability KVM_CAP_HYPERV_DIRECT_TLBFLUSH
>> 
>> Vitaly Kuznetsov (1):
>>   KVM/Hyper-V/VMX: Add direct tlb flush support
>> 
>>  Documentation/virtual/kvm/api.txt  | 13 +
>>  arch/x86/include/asm/hyperv-tlfs.h | 24 ++-
>>  arch/x86/include/asm/kvm_host.h|  4 
>>  arch/x86/kvm/vmx/evmcs.h   |  2 ++
>>  arch/x86/kvm/vmx/vmx.c | 39 
>> ++
>>  arch/x86/kvm/x86.c |  8 
>>  include/uapi/linux/kvm.h   |  1 +
>>  7 files changed, 86 insertions(+), 5 deletions(-)
>> 
>
> Queued, thanks.
>

I had a suggestion how we can get away without the new capability (like
direct tlb flush gets automatically enabled when Hyper-V hypercall page
is activated and we know we can't handle KVM hypercalls any more)
but this can probably be done as a follow-up.

-- 
Vitaly


Re: [PATCH V3 0/3] KVM/Hyper-V: Add Hyper-V direct tlb flush support

2019-08-27 Thread Vitaly Kuznetsov
Tianyu Lan  writes:

> On Tue, Aug 27, 2019 at 8:38 PM Vitaly Kuznetsov  wrote:
>>
>> Tianyu Lan  writes:
>>
>> > On Tue, Aug 27, 2019 at 2:41 PM Vitaly Kuznetsov  
>> > wrote:
>> >>
>> >> lantianyu1...@gmail.com writes:
>> >>
>> >> > From: Tianyu Lan 
>> >> >
>> >> > This patchset is to add Hyper-V direct tlb support in KVM. Hyper-V
>> >> > in L0 can delegate L1 hypervisor to handle tlb flush request from
>> >> > L2 guest when direct tlb flush is enabled in L1.
>> >> >
>> >> > Patch 2 introduces new cap KVM_CAP_HYPERV_DIRECT_TLBFLUSH to enable
>> >> > feature from user space. User space should enable this feature only
>> >> > when Hyper-V hypervisor capability is exposed to guest and KVM profile
>> >> > is hided. There is a parameter conflict between KVM and Hyper-V 
>> >> > hypercall.
>> >> > We hope L2 guest doesn't use KVM hypercall when the feature is
>> >> > enabled. Detail please see comment of new API
>> >> > "KVM_CAP_HYPERV_DIRECT_TLBFLUSH"
>> >>
>> >> I was thinking about this for awhile and I think I have a better
>> >> proposal. Instead of adding this new capability let's enable direct TLB
>> >> flush when KVM guest enables Hyper-V Hypercall page (writes to
>> >> HV_X64_MSR_HYPERCALL) - this guarantees that the guest doesn't need KVM
>> >> hypercalls as we can't handle both KVM-style and Hyper-V-style
>> >> hypercalls simultaneously and kvm_emulate_hypercall() does:
>> >>
>> >> if (kvm_hv_hypercall_enabled(vcpu->kvm))
>> >> return kvm_hv_hypercall(vcpu);
>> >>
>> >> What do you think?
>> >>
>> >> (and instead of adding the capability we can add kvm.ko module parameter
>> >> to enable direct tlb flush unconditionally, like
>> >> 'hv_direct_tlbflush=-1/0/1' with '-1' being the default (autoselect
>> >> based on Hyper-V hypercall enablement, '0' - permanently disabled, '1' -
>> >> permanenetly enabled)).
>> >>
>> >
>> > Hi Vitaly::
>> >  Actually, I had such idea before. But user space should check
>> > whether hv tlb flush
>> > is exposed to VM before enabling direct tlb flush. If no, user space
>> > should not direct
>> > tlb flush for guest since Hyper-V will do more check for each
>> > hypercall from nested
>> > VM with enabling the feauter..
>>
>> If TLB Flush enlightenment is not exposed to the VM at all there's no
>> difference if we enable direct TLB flush in eVMCS or not: the guest
>> won't be using 'TLB Flush' hypercall and will do TLB flushing with
>> IPIs. And, in case the guest enables Hyper-V hypercall page, it is
>> definitelly not going to use KVM hypercalls so we can't break these.
>>
>
> Yes, this won't tigger KVM/Hyper-V hypercall conflict. My point is
> that if tlb flush enlightenment is not enabled, enabling direct tlb
> flush will not accelate anything and Hyper-V still will check each
> hypercalls from nested VM in order to intercpt tlb flush hypercall
> But guest won't use tlb flush hypercall in this case. The check
> of each hypercall in Hyper-V is redundant. We may avoid the
> overhead via checking status of tlb flush enlightenment and just
> enable direct tlb flush when it's enabled.

Oh, I see. Yes, this optimization is possible and I'm not against it,
however I doubt it will make a significant difference because no matter
what upon VMCALL we first drop into L0 which can either inject this in
L1 or, in case of direct TLB flush, execute it by itself. Checking if
the hypercall is a TLB flush hypercall is just a register read, it
should be very cheap.

-- 
Vitaly


Re: [PATCH V3 0/3] KVM/Hyper-V: Add Hyper-V direct tlb flush support

2019-08-27 Thread Vitaly Kuznetsov
Tianyu Lan  writes:

> On Tue, Aug 27, 2019 at 2:41 PM Vitaly Kuznetsov  wrote:
>>
>> lantianyu1...@gmail.com writes:
>>
>> > From: Tianyu Lan 
>> >
>> > This patchset is to add Hyper-V direct tlb support in KVM. Hyper-V
>> > in L0 can delegate L1 hypervisor to handle tlb flush request from
>> > L2 guest when direct tlb flush is enabled in L1.
>> >
>> > Patch 2 introduces new cap KVM_CAP_HYPERV_DIRECT_TLBFLUSH to enable
>> > feature from user space. User space should enable this feature only
>> > when Hyper-V hypervisor capability is exposed to guest and KVM profile
>> > is hided. There is a parameter conflict between KVM and Hyper-V hypercall.
>> > We hope L2 guest doesn't use KVM hypercall when the feature is
>> > enabled. Detail please see comment of new API
>> > "KVM_CAP_HYPERV_DIRECT_TLBFLUSH"
>>
>> I was thinking about this for awhile and I think I have a better
>> proposal. Instead of adding this new capability let's enable direct TLB
>> flush when KVM guest enables Hyper-V Hypercall page (writes to
>> HV_X64_MSR_HYPERCALL) - this guarantees that the guest doesn't need KVM
>> hypercalls as we can't handle both KVM-style and Hyper-V-style
>> hypercalls simultaneously and kvm_emulate_hypercall() does:
>>
>> if (kvm_hv_hypercall_enabled(vcpu->kvm))
>> return kvm_hv_hypercall(vcpu);
>>
>> What do you think?
>>
>> (and instead of adding the capability we can add kvm.ko module parameter
>> to enable direct tlb flush unconditionally, like
>> 'hv_direct_tlbflush=-1/0/1' with '-1' being the default (autoselect
>> based on Hyper-V hypercall enablement, '0' - permanently disabled, '1' -
>> permanenetly enabled)).
>>
>
> Hi Vitaly::
>  Actually, I had such idea before. But user space should check
> whether hv tlb flush
> is exposed to VM before enabling direct tlb flush. If no, user space
> should not direct
> tlb flush for guest since Hyper-V will do more check for each
> hypercall from nested
> VM with enabling the feauter..

If TLB Flush enlightenment is not exposed to the VM at all there's no
difference if we enable direct TLB flush in eVMCS or not: the guest
won't be using 'TLB Flush' hypercall and will do TLB flushing with
IPIs. And, in case the guest enables Hyper-V hypercall page, it is
definitelly not going to use KVM hypercalls so we can't break these.

-- 
Vitaly


Re: [PATCH V3 0/3] KVM/Hyper-V: Add Hyper-V direct tlb flush support

2019-08-26 Thread Vitaly Kuznetsov
lantianyu1...@gmail.com writes:

> From: Tianyu Lan 
>
> This patchset is to add Hyper-V direct tlb support in KVM. Hyper-V
> in L0 can delegate L1 hypervisor to handle tlb flush request from
> L2 guest when direct tlb flush is enabled in L1.
>
> Patch 2 introduces new cap KVM_CAP_HYPERV_DIRECT_TLBFLUSH to enable
> feature from user space. User space should enable this feature only
> when Hyper-V hypervisor capability is exposed to guest and KVM profile
> is hided. There is a parameter conflict between KVM and Hyper-V hypercall.
> We hope L2 guest doesn't use KVM hypercall when the feature is
> enabled. Detail please see comment of new API
> "KVM_CAP_HYPERV_DIRECT_TLBFLUSH"

I was thinking about this for awhile and I think I have a better
proposal. Instead of adding this new capability let's enable direct TLB
flush when KVM guest enables Hyper-V Hypercall page (writes to
HV_X64_MSR_HYPERCALL) - this guarantees that the guest doesn't need KVM
hypercalls as we can't handle both KVM-style and Hyper-V-style
hypercalls simultaneously and kvm_emulate_hypercall() does:

if (kvm_hv_hypercall_enabled(vcpu->kvm))
return kvm_hv_hypercall(vcpu);

What do you think?

(and instead of adding the capability we can add kvm.ko module parameter
to enable direct tlb flush unconditionally, like
'hv_direct_tlbflush=-1/0/1' with '-1' being the default (autoselect
based on Hyper-V hypercall enablement, '0' - permanently disabled, '1' -
permanenetly enabled)).

-- 
Vitaly


Re: [PATCH 2/3] KVM/Hyper-V: Add new KVM cap KVM_CAP_HYPERV_DIRECT_TLBFLUSH

2019-08-09 Thread Vitaly Kuznetsov
lantianyu1...@gmail.com writes:

> From: Tianyu Lan 
>
> This patch adds new KVM cap KVM_CAP_HYPERV_DIRECT_TLBFLUSH and let
> user space to enable direct tlb flush function when only Hyper-V
> hypervsior capability is exposed to VM. This patch also adds
> enable_direct_tlbflush callback in the struct kvm_x86_ops and
> platforms may use it to implement direct tlb flush support.
>
> Signed-off-by: Tianyu Lan 
> ---
>  Documentation/virtual/kvm/api.txt | 10 ++
>  arch/x86/include/asm/kvm_host.h   |  2 ++
>  arch/x86/kvm/x86.c|  8 
>  include/uapi/linux/kvm.h  |  1 +
>  4 files changed, 21 insertions(+)
>
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 2cd6250b2896..45308ed6dd75 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -5289,3 +5289,13 @@ Architectures: x86
>  This capability indicates that KVM supports paravirtualized Hyper-V IPI send
>  hypercalls:
>  HvCallSendSyntheticClusterIpi, HvCallSendSyntheticClusterIpiEx.
> +8.21 KVM_CAP_HYPERV_DIRECT_TLBFLUSH
> +
> +Architecture: x86
> +
> +This capability indicates that KVM supports Hyper-V direct tlb flush 
> function.
> +User space should enable this feature only when Hyper-V hypervisor capability
> +is exposed to guest and KVM profile is hided. Both Hyper-V and KVM hypercalls
> +use RAX and RCX registers to pass parameters. If KVM hypercall is exposed
> +to L2 guest with direct tlbflush enabled, Hyper-V may mistake KVM hypercall
> +for Hyper-V tlb flush Hypercall due to paremeter register overlap.

First, we need to explicitly state that this is for KVM on Hyper-V and
second, that this disables normal hypercall handling by KVM.

My take:

This capability indicates that KVM running on top of Hyper-V hypervisor
enables Direct TLB flush for its guests meaning that TLB flush
hypercalls are handled by Level 1 hypervisor (Hyper-V) bypassing KVM. 
Due to the different ABI for hypercall parameters between Hyper-V and
KVM, enabling this capability effectively disables all hypercall
handling by KVM (as some KVM hypercall may be mistakenly treated as TLB
flush hypercalls by Hyper-C) so userspace should disable KVM
identification in CPUID.

I think we should also enforce this somehow leaving only Hyper-V style
hypercalls handling (for Windows guests) in place.

> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 0cc5b611a113..667d154e89d4 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1205,6 +1205,8 @@ struct kvm_x86_ops {
>   uint16_t (*nested_get_evmcs_version)(struct kvm_vcpu *vcpu);
>  
>   bool (*need_emulation_on_page_fault)(struct kvm_vcpu *vcpu);
> +
> + int (*enable_direct_tlbflush)(struct kvm_vcpu *vcpu);
>  };
>  
>  struct kvm_arch_async_pf {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9d7b9e6a0939..a9d8ee7f7bf0 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3183,6 +3183,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long 
> ext)
>   r = kvm_x86_ops->get_nested_state ?
>   kvm_x86_ops->get_nested_state(NULL, NULL, 0) : 0;
>   break;
> + case KVM_CAP_HYPERV_DIRECT_TLBFLUSH:
> + r = kvm_x86_ops->enable_direct_tlbflush ? 1 : 0;
> + break;
>   default:
>   break;
>   }
> @@ -3953,6 +3956,11 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu 
> *vcpu,
>   r = -EFAULT;
>   }
>   return r;
> + case KVM_CAP_HYPERV_DIRECT_TLBFLUSH:
> + if (!kvm_x86_ops->enable_direct_tlbflush)
> + return -ENOTTY;
> +
> + return kvm_x86_ops->enable_direct_tlbflush(vcpu);
>  
>   default:
>   return -EINVAL;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index a7c19540ce21..cb959bc925b1 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -996,6 +996,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_ARM_PTRAUTH_ADDRESS 171
>  #define KVM_CAP_ARM_PTRAUTH_GENERIC 172
>  #define KVM_CAP_PMU_EVENT_FILTER 173
> +#define KVM_CAP_HYPERV_DIRECT_TLBFLUSH 174
>  
>  #ifdef KVM_CAP_IRQ_ROUTING

-- 
Vitaly


Re: [PATCH 1/3] x86/Hyper-V: Fix definition of struct hv_vp_assist_page

2019-08-09 Thread Vitaly Kuznetsov
lantianyu1...@gmail.com writes:

> From: Tianyu Lan 
>
> The struct hv_vp_assist_page was defined incorrectly.
> The "vtl_control" should be u64[3], "nested_enlightenments_control"
> should be a u64 and there is 7 reserved bytes following "enlighten_vmentry".
> This patch is to fix it.
>
> Signed-off-by: Tianyu Lan 
> ---
>  arch/x86/include/asm/hyperv-tlfs.h | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h 
> b/arch/x86/include/asm/hyperv-tlfs.h
> index af78cd72b8f3..a79703c56ebe 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -517,11 +517,11 @@ struct hv_timer_message_payload {
>  /* Define virtual processor assist page structure. */
>  struct hv_vp_assist_page {
>   __u32 apic_assist;
> - __u32 reserved;
> - __u64 vtl_control[2];
> + __u32 reserved1;
> + __u64 vtl_control[3];
>   __u64 nested_enlightenments_control[2];

In PATCH3 you define 'struct hv_nested_enlightenments_control' and it is
64bit long, not 128. We should change it here too as ...

> - __u32 enlighten_vmentry;

enlighten_vmentry filed will get a very different offset breaking
Enlightened VMCS.

> - __u32 padding;
> + __u8 enlighten_vmentry;
> + __u8 reserved2[7];
>   __u64 current_nested_vmcs;
>  } __packed;

-- 
Vitaly


Re: [PATCH] Revert "mm: rename _count, field of the struct page, to _refcount"

2016-06-16 Thread Vitaly Kuznetsov
Michal Hocko  writes:

> On Thu 16-06-16 12:30:16, Vitaly Kuznetsov wrote:
>> Christoph Hellwig  writes:
>> 
>> > On Thu, Jun 16, 2016 at 11:22:46AM +0200, Vitaly Kuznetsov wrote:
>> >> _count -> _refcount rename in commit 0139aa7b7fa12 ("mm: rename _count,
>> >> field of the struct page, to _refcount") broke kdump. makedumpfile(8) does
>> >> stuff like READ_MEMBER_OFFSET("page._count", page._count) and fails. While
>> >> it is definitely possible to fix this particular tool I'm not sure about
>> >> other tools which might be doing the same.
>> >> 
>> >> I suggest we remember the "we don't break userspace" rule and revert for
>> >> 4.7 while it's not too late.
>> >
>> > Err, sorry - this is not "userspace".  It's crazy crap digging into
>> > kernel internal structure.
>> >
>> > The rename was absolutely useful, so fix up your stinking pike in kdump.
>> 
>> Ok, sure, I'll send a patch to it. I was worried about other tools out
>> there which e.g. inspect /proc/vmcore. As it is something we support
>> some conservatism around it is justified.
>
> struct page layout as some others that such a tool might depend on has
> changes several times in the past so I fail to see how is it any
> different this time.

IMO this time the change doesn't give us any advantage, it was just a
rename.

> struct page is nothing the userspace should depend on.

True but at least makedumpfile(8) is special and even if it's a 'crazy
crap digging into ...' we could avoid breaking it for no technical
reason.

-- 
  Vitaly
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Revert "mm: rename _count, field of the struct page, to _refcount"

2016-06-16 Thread Vitaly Kuznetsov
Christoph Hellwig  writes:

> On Thu, Jun 16, 2016 at 11:22:46AM +0200, Vitaly Kuznetsov wrote:
>> _count -> _refcount rename in commit 0139aa7b7fa12 ("mm: rename _count,
>> field of the struct page, to _refcount") broke kdump. makedumpfile(8) does
>> stuff like READ_MEMBER_OFFSET("page._count", page._count) and fails. While
>> it is definitely possible to fix this particular tool I'm not sure about
>> other tools which might be doing the same.
>> 
>> I suggest we remember the "we don't break userspace" rule and revert for
>> 4.7 while it's not too late.
>
> Err, sorry - this is not "userspace".  It's crazy crap digging into
> kernel internal structure.
>
> The rename was absolutely useful, so fix up your stinking pike in kdump.

Ok, sure, I'll send a patch to it. I was worried about other tools out
there which e.g. inspect /proc/vmcore. As it is something we support
some conservatism around it is justified.

-- 
  Vitaly
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Revert "mm: rename _count, field of the struct page, to _refcount"

2016-06-16 Thread Vitaly Kuznetsov
_count -> _refcount rename in commit 0139aa7b7fa12 ("mm: rename _count,
field of the struct page, to _refcount") broke kdump. makedumpfile(8) does
stuff like READ_MEMBER_OFFSET("page._count", page._count) and fails. While
it is definitely possible to fix this particular tool I'm not sure about
other tools which might be doing the same.

I suggest we remember the "we don't break userspace" rule and revert for
4.7 while it's not too late.

This is a partial revert, useful hunks in drivers which do
page_ref_{sub,add,inc} instead of open coded atomic_* operations stay.

Signed-off-by: Vitaly Kuznetsov 
---
 Documentation/vm/transhuge.txt   | 10 +-
 arch/tile/mm/init.c  |  2 +-
 drivers/block/aoe/aoecmd.c   |  2 +-
 drivers/hwtracing/intel_th/msu.c |  2 +-
 fs/proc/page.c   |  2 +-
 include/linux/mm.h   |  2 +-
 include/linux/mm_types.h | 14 +-
 include/linux/page_ref.h | 26 +-
 include/linux/pagemap.h  |  8 
 kernel/kexec_core.c  |  2 +-
 mm/huge_memory.c |  4 ++--
 mm/internal.h|  2 +-
 mm/page_alloc.c  |  4 ++--
 mm/slub.c|  4 ++--
 mm/vmscan.c  |  4 ++--
 15 files changed, 42 insertions(+), 46 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 7c871d6..75c774c 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -394,9 +394,9 @@ hugepage natively. Once finished you can drop the page 
table lock.
 Refcounting on THP is mostly consistent with refcounting on other compound
 pages:
 
-  - get_page()/put_page() and GUP operate in head page's ->_refcount.
+  - get_page()/put_page() and GUP operate in head page's ->_count.
 
-  - ->_refcount in tail pages is always zero: get_page_unless_zero() never
+  - ->_count in tail pages is always zero: get_page_unless_zero() never
 succeed on tail pages.
 
   - map/unmap of the pages with PTE entry increment/decrement ->_mapcount
@@ -426,15 +426,15 @@ requests to split pinned huge page: it expects page count 
to be equal to
 sum of mapcount of all sub-pages plus one (split_huge_page caller must
 have reference for head page).
 
-split_huge_page uses migration entries to stabilize page->_refcount and
+split_huge_page uses migration entries to stabilize page->_count and
 page->_mapcount.
 
 We safe against physical memory scanners too: the only legitimate way
 scanner can get reference to a page is get_page_unless_zero().
 
-All tail pages have zero ->_refcount until atomic_add(). This prevents the
+All tail pages have zero ->_count until atomic_add(). This prevents the
 scanner from getting a reference to the tail page up to that point. After the
-atomic_add() we don't care about the ->_refcount value.  We already known how
+atomic_add() we don't care about the ->_count value.  We already known how
 many references should be uncharged from the head page.
 
 For head page get_page_unless_zero() will succeed and we don't mind. It's
diff --git a/arch/tile/mm/init.c b/arch/tile/mm/init.c
index adce254..a0582b7 100644
--- a/arch/tile/mm/init.c
+++ b/arch/tile/mm/init.c
@@ -679,7 +679,7 @@ static void __init init_free_pfn_range(unsigned long start, 
unsigned long end)
 * Hacky direct set to avoid unnecessary
 * lock take/release for EVERY page here.
 */
-   p->_refcount.counter = 0;
+   p->_count.counter = 0;
p->_mapcount.counter = -1;
}
init_page_count(page);
diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c
index d597e43..437b3a8 100644
--- a/drivers/block/aoe/aoecmd.c
+++ b/drivers/block/aoe/aoecmd.c
@@ -861,7 +861,7 @@ rqbiocnt(struct request *r)
  * discussion.
  *
  * We cannot use get_page in the workaround, because it insists on a
- * positive page count as a precondition.  So we use _refcount directly.
+ * positive page count as a precondition.  So we use _count directly.
  */
 static void
 bio_pageinc(struct bio *bio)
diff --git a/drivers/hwtracing/intel_th/msu.c b/drivers/hwtracing/intel_th/msu.c
index e8d55a1..0974090 100644
--- a/drivers/hwtracing/intel_th/msu.c
+++ b/drivers/hwtracing/intel_th/msu.c
@@ -1172,7 +1172,7 @@ static void msc_mmap_close(struct vm_area_struct *vma)
if (!atomic_dec_and_mutex_lock(&msc->mmap_count, &msc->buf_mutex))
return;
 
-   /* drop page _refcounts */
+   /* drop page _counts */
for (pg = 0; pg < msc->nr_pages; pg++) {
struct page *page = msc_buffer_get_page(msc, pg);
 
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 3ecd445..712f1b9 100644
--- a/f

Re: [PATCH 0/2] memory_hotplug: introduce config and command line options to set the default onlining policy

2016-04-21 Thread Vitaly Kuznetsov
David Rientjes  writes:

> On Tue, 19 Apr 2016, Vitaly Kuznetsov wrote:
>
>> > I'd personally disagree that we need more and more config options to take 
>> > care of something that an initscript can easily do and most distros 
>> > already have their own initscripts that this can be added to.  I don't see 
>> > anything that the config option adds.
>> 
>> Yes, but why does every distro need to solve the exact same issue by 
>> a distro-specific init script when we can allow setting reasonable
>> default in kernel?
>> 
>
> No, only distros that want to change the long-standing default which is 
> "offline" since they apparently aren't worried about breaking existing 
> userspace.
>
> Changing defaults is always risky business in the kernel, especially when 
> it's long standing.  If the default behavior is changeable, userspace 
> needs to start testing for that and acting accordingly if it actually 
> wants to default to offline (and there are existing tools that suppose the 
> long-standing default).  The end result is that the kernel default doesn't 
> matter anymore, we've just pushed it to userspace to either online or 
> offline at the time of hotplug.
>

"We don't break userspace". Yes, I know, but is there an example of such
userspace which is going to break?

E.g. RHEL7 ships the following udev rule by default:
# Memory hotadd request
SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"

which is not very smart but it does the job (with issues I'm trying to
solve). I'm not aware of any breakages reported after it was introduced.

My understanding is that the legacy default 'offline' was introduced
before memory hotplug became a frequently used feature in virtual
machines. When you hotplug physical memory you go to your server room,
open your server, insert memory dimm, ... - in this scenario 'offline'
is a reasonable default. But in VMs mempory hotplug is usually an
automatic from host side -- we address high memory pressure/tenant
requests.

>> If the config option itself is a problem (though I don't understand why)
>> we can get rid of it making the default 'online' and keeping the command
>> line parameter to disable it for cases when something goes wrong but why
>> not leave an option for those who want it the other way around?
>> 
>
> That could break existing userspace that assumes the default is offline; 
> if users are currently hotadding memory and then onlining it when needed 
> rather than immediately, they break.  So that's not a possibility.
>

Yes, so I introduce a config option. Next thing we do we enable it in
'bleeding edge' distros, e.g. Fedora and see who complains. My guess is
that nobody is going to complain.

>> Other than the above, let's imagine a 'unikernel' scenario when there
>> are no initscripts and we're in a virtualized environment. We may want to
>> have memory hotplug there too, but where would we put the 'onlining'
>> logic? In every userspace we want to run? This doesn't sound right.
>> 
>
> Nobody is resisting hotplug notifiers.

Yes, but we need to teach memory hotplug to every userspace instead.

-- 
  Vitaly
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] memory_hotplug: introduce config and command line options to set the default onlining policy

2016-04-19 Thread Vitaly Kuznetsov
David Rientjes  writes:

> On Thu, 7 Apr 2016, Vitaly Kuznetsov wrote:
>
>> >> > This patchset continues the work I started with:
>> >> > 
>> >> > commit 31bc3858ea3ebcc3157b3f5f0e624c5962f5a7a6
>> >> > Author: Vitaly Kuznetsov 
>> >> > Date:   Tue Mar 15 14:56:48 2016 -0700
>> >> > 
>> >> > memory-hotplug: add automatic onlining policy for the newly added 
>> >> > memory
>> >> > 
>> >> > Initially I was going to stop there and bring the policy setting logic 
>> >> > to
>> >> > userspace. I met two issues on this way:
>> >> > 
>> >> > 1) It is possible to have memory hotplugged at boot (e.g. with QEMU). 
>> >> > These
>> >> >blocks stay offlined if we turn the onlining policy on by userspace.
>> >> > 
>> >> > 2) My attempt to bring this policy setting to systemd failed, systemd
>> >> >maintainers suggest to change the default in kernel or ... to use 
>> >> > tmpfiles.d
>> >> >to alter the policy (which looks like a hack to me): 
>> >> >https://github.com/systemd/systemd/pull/2938
>> >> 
>> >> That discussion really didn't come to a conclusion and I don't
>> >> understand why you consider Lennert's "recommended way" to be a hack?
>> >> 
>> >> > Here I suggest to add a config option to set the default value for the 
>> >> > policy
>> >> > and a kernel command line parameter to make the override.
>> >> 
>> >> But the patchset looks pretty reasonable regardless of the above.
>> >> 
>> >
>> > I don't understand why initscripts simply cannot crawl sysfs memory blocks 
>> > and online them for the same behavior.
>> 
>> Yes, they can. With this patchset I don't bring any new features, it's
>> rather a convenience so linux distros can make memory hotplug work
>> 'out of the box' without such distro-specific initscripts. Memory
>> hotplug is a standard feature of all major virt technologies so I think
>> it's pretty reasonable to have an option to make it work 'by default'
>> available.
>> 
>
> I'd personally disagree that we need more and more config options to take 
> care of something that an initscript can easily do and most distros 
> already have their own initscripts that this can be added to.  I don't see 
> anything that the config option adds.

Yes, but why does every distro need to solve the exact same issue by 
a distro-specific init script when we can allow setting reasonable
default in kernel?

If the config option itself is a problem (though I don't understand why)
we can get rid of it making the default 'online' and keeping the command
line parameter to disable it for cases when something goes wrong but why
not leave an option for those who want it the other way around?

Other than the above, let's imagine a 'unikernel' scenario when there
are no initscripts and we're in a virtualized environment. We may want to
have memory hotplug there too, but where would we put the 'onlining'
logic? In every userspace we want to run? This doesn't sound right.

-- 
  Vitaly
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] memory_hotplug: introduce config and command line options to set the default onlining policy

2016-04-07 Thread Vitaly Kuznetsov
David Rientjes  writes:

> On Wed, 6 Apr 2016, Andrew Morton wrote:
>
>> > This patchset continues the work I started with:
>> > 
>> > commit 31bc3858ea3ebcc3157b3f5f0e624c5962f5a7a6
>> > Author: Vitaly Kuznetsov 
>> > Date:   Tue Mar 15 14:56:48 2016 -0700
>> > 
>> > memory-hotplug: add automatic onlining policy for the newly added 
>> > memory
>> > 
>> > Initially I was going to stop there and bring the policy setting logic to
>> > userspace. I met two issues on this way:
>> > 
>> > 1) It is possible to have memory hotplugged at boot (e.g. with QEMU). These
>> >blocks stay offlined if we turn the onlining policy on by userspace.
>> > 
>> > 2) My attempt to bring this policy setting to systemd failed, systemd
>> >maintainers suggest to change the default in kernel or ... to use 
>> > tmpfiles.d
>> >to alter the policy (which looks like a hack to me): 
>> >https://github.com/systemd/systemd/pull/2938
>> 
>> That discussion really didn't come to a conclusion and I don't
>> understand why you consider Lennert's "recommended way" to be a hack?
>> 
>> > Here I suggest to add a config option to set the default value for the 
>> > policy
>> > and a kernel command line parameter to make the override.
>> 
>> But the patchset looks pretty reasonable regardless of the above.
>> 
>
> I don't understand why initscripts simply cannot crawl sysfs memory blocks 
> and online them for the same behavior.

Yes, they can. With this patchset I don't bring any new features, it's
rather a convenience so linux distros can make memory hotplug work
'out of the box' without such distro-specific initscripts. Memory
hotplug is a standard feature of all major virt technologies so I think
it's pretty reasonable to have an option to make it work 'by default'
available.

-- 
  Vitaly
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] memory_hotplug: introduce config and command line options to set the default onlining policy

2016-04-07 Thread Vitaly Kuznetsov
Andrew Morton  writes:

> On Wed,  6 Apr 2016 15:45:10 +0200 Vitaly Kuznetsov  
> wrote:
>
>> This patchset continues the work I started with:
>> 
>> commit 31bc3858ea3ebcc3157b3f5f0e624c5962f5a7a6
>> Author: Vitaly Kuznetsov 
>> Date:   Tue Mar 15 14:56:48 2016 -0700
>> 
>> memory-hotplug: add automatic onlining policy for the newly added memory
>> 
>> Initially I was going to stop there and bring the policy setting logic to
>> userspace. I met two issues on this way:
>> 
>> 1) It is possible to have memory hotplugged at boot (e.g. with QEMU). These
>>blocks stay offlined if we turn the onlining policy on by userspace.
>> 
>> 2) My attempt to bring this policy setting to systemd failed, systemd
>>maintainers suggest to change the default in kernel or ... to use 
>> tmpfiles.d
>>to alter the policy (which looks like a hack to me): 
>>https://github.com/systemd/systemd/pull/2938
>
> That discussion really didn't come to a conclusion and I don't
> understand why you consider Lennert's "recommended way" to be a hack?

Just the name. To me 'tmpfiles.d' doesn't sound like an appropriate
place to search for kernel tunables settings. It would be much better in
case we had something like 'tunables.d' for that.

-- 
  Vitaly
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] memory_hotplug: introduce CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE

2016-04-06 Thread Vitaly Kuznetsov
Introduce config option to set the default value for memory hotplug
onlining policy (/sys/devices/system/memory/auto_online_blocks). The
reason one would want to turn this option on are to have early onlining
for hotpluggable memory available at boot and to not require any userspace
actions to make memory hotplug work.

Signed-off-by: Vitaly Kuznetsov 
---
 Documentation/memory-hotplug.txt |  9 +
 mm/Kconfig   | 16 
 mm/memory_hotplug.c  |  4 
 3 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
index 443f4b4..0d7cb95 100644
--- a/Documentation/memory-hotplug.txt
+++ b/Documentation/memory-hotplug.txt
@@ -261,10 +261,11 @@ it according to the policy which can be read from 
"auto_online_blocks" file:
 
 % cat /sys/devices/system/memory/auto_online_blocks
 
-The default is "offline" which means the newly added memory is not in a
-ready-to-use state and you have to "online" the newly added memory blocks
-manually. Automatic onlining can be requested by writing "online" to
-"auto_online_blocks" file:
+The default depends on the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config
+option. If it is disabled the default is "offline" which means the newly added
+memory is not in a ready-to-use state and you have to "online" the newly added
+memory blocks manually. Automatic onlining can be requested by writing "online"
+to "auto_online_blocks" file:
 
 % echo online > /sys/devices/system/memory/auto_online_blocks
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 989f8f3..4c7dd6e 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -192,6 +192,22 @@ config MEMORY_HOTPLUG_SPARSE
def_bool y
depends on SPARSEMEM && MEMORY_HOTPLUG
 
+config MEMORY_HOTPLUG_DEFAULT_ONLINE
+bool "Online the newly added memory blocks by default"
+default n
+depends on MEMORY_HOTPLUG
+help
+ This option changes the default policy setting for memory hotplug
+ onlining policy (/sys/devices/system/memory/auto_online_blocks) which
+ determines what happens to the newly added memory regions. Policy
+ setting can always be changed in runtime.
+ See Documentation/memory-hotplug.txt for more information.
+
+ Say Y here if you want all hot-plugged memory blocks to appear in
+ 'online' state by default.
+ Say N here if you want the default policy to keep all hot-plugged
+ memory blocks in 'offline' state.
+
 config MEMORY_HOTREMOVE
bool "Allow for memory hot remove"
select MEMORY_ISOLATION
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index aa34431..072e0a1 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -78,7 +78,11 @@ static struct {
 #define memhp_lock_acquire()  lock_map_acquire(&mem_hotplug.dep_map)
 #define memhp_lock_release()  lock_map_release(&mem_hotplug.dep_map)
 
+#ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
 bool memhp_auto_online;
+#else
+bool memhp_auto_online = true;
+#endif
 EXPORT_SYMBOL_GPL(memhp_auto_online);
 
 void get_online_mems(void)
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] memory_hotplug: introduce config and command line options to set the default onlining policy

2016-04-06 Thread Vitaly Kuznetsov
This patchset continues the work I started with:

commit 31bc3858ea3ebcc3157b3f5f0e624c5962f5a7a6
Author: Vitaly Kuznetsov 
Date:   Tue Mar 15 14:56:48 2016 -0700

memory-hotplug: add automatic onlining policy for the newly added memory

Initially I was going to stop there and bring the policy setting logic to
userspace. I met two issues on this way:

1) It is possible to have memory hotplugged at boot (e.g. with QEMU). These
   blocks stay offlined if we turn the onlining policy on by userspace.

2) My attempt to bring this policy setting to systemd failed, systemd
   maintainers suggest to change the default in kernel or ... to use tmpfiles.d
   to alter the policy (which looks like a hack to me): 
   https://github.com/systemd/systemd/pull/2938

Here I suggest to add a config option to set the default value for the policy
and a kernel command line parameter to make the override.

Vitaly Kuznetsov (2):
  memory_hotplug: introduce CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
  memory_hotplug: introduce memhp_default_state= command line parameter

 Documentation/kernel-parameters.txt |  8 
 Documentation/memory-hotplug.txt|  9 +
 mm/Kconfig  | 16 
 mm/memory_hotplug.c | 15 +++
 4 files changed, 44 insertions(+), 4 deletions(-)

-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] memory_hotplug: introduce memhp_default_state= command line parameter

2016-04-06 Thread Vitaly Kuznetsov
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE specifies the default value for the
memory hotplug onlining policy. Add a command line parameter to make it
possible to override the default. It may come handy for debug and testing
purposes.

Signed-off-by: Vitaly Kuznetsov 
---
 Documentation/kernel-parameters.txt |  8 
 mm/memory_hotplug.c | 11 +++
 2 files changed, 19 insertions(+)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index ecc74fa..b05ee2f 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2141,6 +2141,14 @@ bytes respectively. Such letter suffixes can also be 
entirely omitted.
[KNL,SH] Allow user to override the default size for
per-device physically contiguous DMA buffers.
 
+memhp_default_state=online/offline
+   [KNL] Set the initial state for the memory hotplug
+   onlining policy. If not specified, the default value is
+   set according to the
+   CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config
+   option.
+   See Documentation/memory-hotplug.txt.
+
memmap=exactmap [KNL,X86] Enable setting of an exact
E820 memory map, as specified by the user.
Such memmap=exactmap lines can be constructed based on
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 072e0a1..179f3af 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -85,6 +85,17 @@ bool memhp_auto_online = true;
 #endif
 EXPORT_SYMBOL_GPL(memhp_auto_online);
 
+static int __init setup_memhp_default_state(char *str)
+{
+   if (!strcmp(str, "online"))
+   memhp_auto_online = true;
+   else if (!strcmp(str, "offline"))
+   memhp_auto_online = false;
+
+   return 1;
+}
+__setup("memhp_default_state=", setup_memhp_default_state);
+
 void get_online_mems(void)
 {
might_sleep();
-- 
2.5.5

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html